Method of Moment
Method of Moment
Method of Moment
yi = xi β + ui , i = 1, ..., N
The residual ûi ≡ yi − xi blse , which is an estimator for ui , has zero sample
mean and zero sample covariance with the regressors due to the first-order
condition:
Sec. 1 Least Squares Estimator (LSE) 3
1 1 1 1
xi (yi − xi blse ) = ûi , xi2 ûi , ..., xik ûi = 0.
N i N i N i N i
Instead of minimizing N −1 i (yi − xi b)2 , LSE can be motivated directly
from a moment
condition. Observe that the LSE first-order condition at b = β
is N −1 i xi ui = 0, and its population version is
⎡ ⎤ ⎡ ⎤
E(u) 0
⎢ E(x2 u) ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥
E(xu) = 0 ⇐⇒ ⎢ .. ⎥=⎢ .. ⎥
⎣ . ⎦ ⎣ . ⎦
E(xk u) 0
as COV (xj , u) = E(xj u) − E(xj )E(u), where COV and COR stand for
covariance and correlation, respectively.
Replacing u with y − x β yields
which is a restriction on the joint distribution of (x , y). Assuming that E(xx )
is invertible, we get
−1
β = {E(xx )} · E(xy).
LSE blse is just a sample analog of this expression of
β, obtained by replacing
E(xx ) and E(xy) with their sample versions N −1 i xi xi and N −1 i xi yi .
Instead of identifying β by minimizing the prediction error, here β is identified
by the “information” (i.e., the assumption) that the observed x is “orthogo-
nal” to the unobserved u.
For any k × 1 constant vector γ,
Clearly, blse = β due to the second term on the right-hand side (rhs) which
shows that each xi ui contributes to the deviation blse − β. Using the LLN,
we have
1 1
xi ui →p E(xu) = 0 and xi xi →p E(xx ).
N i N i
Substituting these into the preceding display, we can get blse →p β, but
we need to deal with the inverse: for a square random matrix WN , when
WN →p W , will WN−1 converge to W −1 in probability?
It is known that, for a rv matrix WN and a constant matrix Wo ,
Therefore, blse is β plus a product of two terms, one consistent for a zero
vector and the other consistent for a bounded matrix; thus the product is
consistent for zero, and we have blse →p β: blse is consistent for β.
√
1.2.2 CLT and N-Consistency
For the asymptotic distribution of the LSE, a central limit theorem (CLT)
is needed that, for an iid random vector sequence z1 , ..., zN with finite second
moments,
1
√ {zi − E(z)} N (0, E [{z − E(z)} {z − E(z)} ]) as N → ∞
N i
where “” denotes convergence in distribution; i.e., letting Ψ(·) denote the
df of N (0, E[{z − E(z)}{z − E(z)} ]),
1
lim P √ {zi − E(z)} ≤ t = Ψ(t) ∀ t.
N →∞ N i
A single rv z always satisfies P {|z| > δ ε } < ε, because we can capture “all but
ε” probability mass by choosing δ ε large enough. The last display means that
we can capture all but ε probability mass with δ ε for any rv in the sequence
w1 , w2 , ... Any
random sequence converging in distribution is Op (1), which
implies N −1/2 i {zi − E(z)} = Op (1).
To understand Op better, consider N −1 and N −2 , both of which con-
verge to 0. Observe N −1 /N −1 = 1, but N −1 /N −1+ε = 1/N ε → 0 whereas
N −1 /N −1−ε = N ε → ∞ for any constant ε > 0. Thus the “(fastest) conver-
gence rate” is N −1 which, when divided into N −1 , makes the resulting ratio
Sec. 1 Least Squares Estimator (LSE) 7
to get
√
N (blse − β) N (0, Ω) where Ω ≡ E −1 (xx )E(xx u2 )E −1 (xx ) : (*)
8 Ch. 1 Methods of Moments for Single Linear Equation Models
√
N (blse − β) is asymptotically normal with mean√ 0 and variance Ω. Often
this convergence in distribution (or “in law”) of N (blse − β) is informally
stated as
1 −1 2 −1
blse ∼ N β, E (xx )E(xx u )E (xx ) (*’)
N
√
The asymptotic variance Ω of N (blse −β) can be estimated consistently
with (this point will be further discussed later)
−1 −1
1 1 1
ΩN ≡ xi xi xi xi û2i xi xi . (*”)
N i N i N i
Alternatively (and informally), the asymptotic variance of blse is estimated
consistently with
−1 −1
ΩN
2
= xi xi xi xi ûi xi xi .
N i i i
PX ≡ X(X X)−1 X
to get
Y = IN Y = PX Y + (IN − PX )Y = PX Y + QX Y = Xblse + Û .
Also note
PX X = X and QX X = 0N :
extracting the X part of X gives X itself, and removing the X part of X
yields 0.
Suppose we use 1 as the only regressor. Defining 1N as the N × 1 vector
of 1’s and denoting Q1N just as Q1 ,
−1 1
Q1 Y = IN − 1N (1N 1N ) 1N Y = IN − 1N 1N Y
N
1
= IN − 1N 1N Y
N
⎧⎡ ⎤ ⎡ ⎤⎫ ⎡ ⎤
⎪
⎪ 1 0 ··· 0 1 1 ··· 1 ⎪ ⎪ y1
⎪
⎨⎢ 0 1 · · · 0 ⎥ ⎪
⎢ ⎥ 1 ⎢
⎢ 1 1 ··· 1 ⎥
⎥⎬ ⎢ y2 ⎥
⎢ ⎥
= ⎢ .. .. . . .. ⎥ − ⎢ .. .. . . .. ⎥ · ⎢ .. ⎥
⎪
⎪ ⎣ . . . . ⎦ N⎣ . . . . ⎦⎪⎪ ⎣ . ⎦
⎪
⎩ ⎪
⎭
0 ··· 0 1 1 ··· 1 1 yN
⎡ ⎤
y1 − ȳ
⎢ y2 − ȳ ⎥
⎢ ⎥
= ⎢ .. ⎥.
⎣ . ⎦
yN − ȳ
10 Ch. 1 Methods of Moments for Single Linear Equation Models
The part (1N 1N )−1 1N Y = ȳ demonstrates that the LSE with 1 as the sole
regressor is just the sample mean ȳ. Q1 may be called the “mean-deviation”
or “mean-subtracting” matrix.
Ŷ Q1 Ŷ Ŷ Q1 Y · Ŷ Q1 Ŷ Ŷ Q1 Y · Ŷ Q1 Y
R2 ≡
= =
Y Q1 Y Ŷ Q1 Y · Y Q1 Y Ŷ Q1 Ŷ · Y Q1 Y
2
i (ŷi − ŷ)(yi − ȳ)
= !2 = (sample correlation of Y andŶ )2
i ŷi − ŷ · i (yi − ȳ) 2
EXAMPLE: HOUSE SALE. A data set of size 467 was collected from the State
College District in Pennsylvania for year 1991. State College is a small college
town with the population of about 50,000. The houses sold during the year
were sampled, and the sale prices and the durations until sale since the first
listing in the market were recorded.
The dependent variable is the discount (DISC) percentage defined as 100
times the natural log of list price (LP) over sale price (SP) of a house:
LP LP − SP LP − SP
100·ln = 100·ln 1 + 100 = discount %.
SP SP SP
LP and SP are measured in $1000. Since LP is the initial list price, given
LP, explaining DISC is equivalent to explaining SP. The following is the list
of regressors—the measurement units should be kept in mind: the number
of days on the market until sold (T), years built minus 1900 (YR), number
of rooms (ROOM), number of bathrooms (BATH), dummy for heating by
electricity (ELEC), property tax in $1,000 (TAX), dummy for spring listing
(L1), summer listing (L2), and fall listing (L3), sale-month interest rate in %
(RATE), dummy for sale by a big broker (BIGS), and number of houses on
the market divided by 100 in the month when the house is listed (SUPPLY).
In Table 1, examine only the first three columns for a while. ln(T ) ap-
pears before 1, because ln(T ) is different from the other regressors—it is
determined nearly simultaneously with DISC—and thus needs a special at-
tention. Judging from the t-values in “tv-het,” most regressors are statis-
tically significant at 5% level, for their absolute t-values are greater than
1.96; “tv-ho” will be used in the next subsection where the qualifiers “het”
and “ho” will be explained. A longer ln(T ) implies the bigger DISC: with
∂ ln T ∂T /T , an increase of ∂ ln T = 1 (i.e., 100% increase in T ) means
4.6% increase in DISC, which in turn means 1% increase in T leading to
12 Ch. 1 Methods of Moments for Single Linear Equation Models
Variable Mean SD
DISC 7.16 7.64
L1 0.29 0.45
L2 0.31 0.46
L3 0.19 0.39
SP 115 57.7
T 188 150
BATH 2.02 0.67
ELEC 0.52 0.50
ROOM 7.09 1.70
TAX 1.38 0.65
YR 73.0 15.1
LP 124 64.9
BIGS 0.78 0.42
RATE 9.33 0.32
SUPPLY 0.62 0.19
0.046% increase in DISC. A newer house commands the less DISC: one year
newer causes 0.15% less DISC, and thus 10 year newer causes 1.5% less DISC.
A higher RATE means the lower DISC (1% increase in RATE causing 2.99%
DISC drop); this finding seems, however, counter-intuitive, because a higher
mortgage rate means the lower demand for houses. R2 = 0.34 shows that
34% of the DISC variance is explained by x blse , and sN = 6.20 shows that
about 95% of ui ’s fall in the range ±1.96 × 6.20 if ui ’s follow N (0, V (u)).
As just noted, 1% increase in T causes 0.046% increase in DISC. Since
this may not be easy to grasp, T is used instead of ln(T ) for the LSE in
the last two columns of the table. The estimate for T is significant with the
Sec. 1 Least Squares Estimator (LSE) 13
magnitude 0.027, meaning that 100 day increase in T leads to 2.7% DISC
increase, which seems reasonable. This kind of query—whether the popular
logged variable ln(T ), level T , or some other function of T should be used—
will be addressed later when we deal with “transformation of variables” in
nonlinear models.
Xf can be written as
Xf = X · Sf
N ×kf N ×k k×kf
where Sf is a “selection matrix ” consisting only of 1’s and 0’s to select the
components of X for Xf ; analogously we can get Xg = X · Sg . For example,
14 Ch. 1 Methods of Moments for Single Linear Equation Models
Observe
In words, for PXf PX = PXf , extracting first the X part (with PX ) and then
its subset Xf part (with PXf ) is the same as extracting only the Xf part. As
for QXf QX = QX , removing first the X part and then its subset Xf part is
the same as removing the whole X part.
Multiply Y = Xf bf + Xg bg + Û by QXf to get
Because
This expression shows that the LSE bg for β g can be obtained in two
stages. First, do the LSE of Y on Xf to get the partial residual QXf Y , and
then do the LSE of Xg on Xf to get the partial residual QXf Xg . Second, do
the LSE of QXf Y on QXf Xg :
(Xg QXf QXf Xg )−1 Xg QXf QXf Y = (Xg QXf Xg )−1 Xg QXf Y.
Using the vector notations, the partial regression for the slopes bg is nothing
LSE with the mean-deviation variables: with xi = (1, x̃i ) and x̃ ≡
but the
−1
N i x̃i ,
−1
bg = x̃i − x̃ x̃i − x̃ x̃i − x̃ (yi − ȳ) .
i i
−1
1 1
= βf + xif xif xif vi
N i N i
−1
1 1 !
= βf + xif xif xif xig β g + ui
N i N i
−1
1 1
= βf + xif xif xif xig · β g
N i N i
−1
1 1
+ xif xif xif ui
N i N i
! !
which is consistent for β f + E −1 xf xf E xf xg · β g .
The term other than β f is called the omitted variable bias, which is 0 if
either β g = 0 (i.e., xg is not omitted at all) or if E −1 (xf xf )
E(xf xg) = 0 which is the population linear projection coefficient of regress-
ing xg on xf . In simple words, if COR(xf , xg ) = 0, then there is no omitted
variable bias. When LSE is run on some data and if resulting estimates do
not make sense intuitively, in most cases, the omitted variable bias formula
will provide a good guide on what might have gone wrong.
One question that might arise when COR(xf , xg ) = 0 is what happens
if a subvector xf 2 of xf is correlated to xg while the other subvector xf 1 of
xf is not where xf = (xf 1 , xf 2 ) . In this case, will xf 1 still be subject to the
omitted variable bias? The answer depends on COR(xf 1 , xf 2 ) as can be seen
in
−1
−1 E(xf 1 xf 1 ) E(xf 1 xf 2 ) 0
E (xf xf )E(xf xg ) =
E(xf 2 xf 1 ) E(xf 2 xf 2 ) E(xf 2 xg )
as E(xf 1 xg ) = 0
0
= if E(xf 1 xf 2 ) = 0.
E −1 (xf 2 xf 2 )E(xf 2 xg )
seat belt makes the driver go faster, which results in more accidents. That
is, driving speed xg in the error term is correlated with xf , and the omitted
variable bias dominates β f so that the following sum becomes positive:
In this case, enacting the seat belt law will increase y, not because β f > 0
but rather because it will cause xg to increase.
What the state have in mind is the ceteris paribus (“direct”) effect β f
with all the other variables held constant, but what is estimated is the total
effect that is the sum of the direct effect β f and the indirect effect of xf on y
through xg . Both the direct and indirect effects can be estimated consistently
using the LSE of y on xf and xg , but enacting only the seat belt law will not
have the intended effect because the indirect effect will occur. A solution is
enacting both the seat belt law and a speed limit law to assure COR(xf , xg ) =
0 after the laws are passed.
In the example, omitted variable bias helped explaining an apparently
nonsensical result. But it can also help negating an apparently plausible re-
sult. Suppose that there are two types of people, one cautious and the other
reckless, with xg denoting the proportion of the cautious people, and that the
cautious people tend to wear seat belts more (COR(xf , xg ) > 0) and have
fewer traffic accidents. Also suppose β f = 0, i.e., no true effect of seat belt
wearing. In this case, the LSE of y on xf converges to a negative number
and, due to omitting xg , we may wrongly conclude that wearing seat belt
will lower y to enact the seat belt law. Here the endogeneity problem of xf
leads to an ineffective policy as the seat belt law will have no true effect on
y. Note that, differently from the xg = speed example, there is no indirect
effect of forcing seat belt wearing because seat belt wearing will not change
the people’s type.
(or there is “homoskedasticity”). Although we assume that (ui , xi ) are iid
across i, ui |xi are not iid across i under heteroskedasticity.
1
ui ≡ uji .
ni j
i
That is, what is available is a random sample on cities with (ni , xi , yi ), i =
1, ..., N , where ni is the total number of people in city i and N is the number
of the sampled cities. Suppose that uji is independent of xji , and that uji ’s
are iid with zero mean and variance σ 2 (i.e., uji ∼ (0, σ 2 )). Then u1 , ..., uN
are independent, and ui |(xi , ni ) ∼ (0, σ 2 /ni ): the error terms in the city-level
model are heteroskedastic wrt ni , but not wrt xi . Note that all of ni , xi , and
yi are random as we do not know which city gets drawn.
This type of heteroskedasticity can be dealt with by minimizing i (yi −
xi b)2 ni , which is equivalent to applying LSE to the transformed equation
√ √ √
yi∗ = x∗ ∗ ∗ ∗ ∗
i β + ui , where yi ≡ yi ni , xi ≡ xi ni and ui ≡ ui ni .
Hence u∗1 , ..., u∗N are iid (0, σ 2 ). This LSE motivates “weighted LSE” to appear
later.
Two remarks on the city-level data example. First, there is no unity
√
in the transformed regressors because 1 is replaced with ni . This requires
a different definition of R2 which was defined using Q1 Û = Û . R2 for the
transformed model can be defined as {sample COR(y, ŷ)}2 , not as {sample
COR(y ∗ , ŷ ∗ )}2 , where ŷi = xi b∗lse , ŷi∗ = x∗ ∗ ∗
i blse , and blse is the LSE for the
2
transformed model. This definition of R can also be used for “weighted LSE”
below. Second, we assumed above that sampling is done at city level and what
is available is the averaged variables yi and xi along with ni . If, instead, all
cities are included but ni individuals get sampled in city i where ni is a pre-
determined (i.e., fixed) constant ahead of sampling, then ni is not random
(but still may vary across i); in contrast, (xi , yi ) is still random because it
depends on the sampled individuals. In this case, ui ’s are independent but
non-identically distributed (inid) due to V (ui ) = σ 2 /ni where V (ui ) is the
marginal variance of ui . Clearly, how sampling is done matters greatly.
20 Ch. 1 Methods of Moments for Single Linear Equation Models
which can help understand the sources of V (y). Suppose that x is a rv taking
on 1, 2, or 3. Decompose the population with x into 3 groups (i.e., subpopu-
lations):
Each group has its conditional variance, and we may be tempted to think
that E{V (y|x)} which is the weighted average of V (y|x) with P (y|x) as the
weight yields the marginal variance V (y). But the variance decomposition
formula demonstrates V (y) = E{V (y|x)} unless E(y|x) = 0 ∀x, although
E(y) = E{E(y|x)} always. That is, the source of the variance is not just
the “within-group variance” V (y|x), but also the “between-group variance”
V {E(y|x)} of the group mean E(y|x).
If the variance decomposition is done with an observable variable x, then
we may dig deeper by estimating E(y|x) and V (y|x). But a decomposition
with an unobservable variable u can be also thought of, as we can choose
any variable we want in the variance decomposition: V (y) = E{V (y|u)} +
V {E(y|u)}. In this case, the decomposition can help us imagine the sources
depending on u. For instance, if y is income and u is ability (whereas x is
education), then the income variance is the weighted average of ability-group
variances plus the variance between the average group-incomes.
Two polar cases are of interest. Suppose that y is income and x is edu-
cation group: 1 for “below high school graduation,” 2 for “high school gradu-
ation” to “below college graduation,” and 3 for college graduation or above.
One extreme case is the same mean income for all education groups:
The other extreme case is the same variance in each education group:
V (y|x) = σ 2 ∀x =⇒ E{V (y|x)} = σ 2 =⇒ V (y) = σ 2 + V {E(y|x)};
if σ 2 = 0, then V (y) = V {E(y|x)}: the variance comes solely from the differ-
ences of E(y|x) across the groups.
Then the decomposition yij − ȳ = (yij − ȳj ) + (ȳj − ȳ) is used in one-way
ANOVA where ȳj − ȳ is for V {E(y|x)}.
J Nj
Specifically, take j=1 i=1 on (yij − ȳ)2 = {(yij − ȳj ) + (ȳj − ȳ)}2 to
see that the cross-product term is zero because
J
Nj
J
Nj
(yij − ȳj )(ȳj − ȳ) = (ȳj − ȳ) (yij − ȳj )
j=1 i=1 j=1 i=1
J
= (ȳj − ȳ)(Nj ȳj − Nj ȳj ) = 0.
j=1
Thus we get
J
Nj
J
Nj
J
(yij − ȳ)2 = (yij − ȳj )2 + Nj (ȳj − ȳ)2
j=1 i=1 j=1 i=1 j=1
total variation unexplained variation explained variation
22 Ch. 1 Methods of Moments for Single Linear Equation Models
where the two terms on the rhs are for σ 2 + V {E(y|x)} when divided by N.
The aforementioned test statistic for mean equality is
J
(J − 1)−1 j=1 Nj (ȳj − ȳ)2
J Nj ∼ F (J − 1, N − J).
(N − J)−1 j=1 i=1 (yij − ȳj )2
To understand the dof’s, note that there are J-many “observations” (ȳj ’s) in
the numerator, and 1 is subtracted in the dof because the grand mean gets
estimated by ȳ. In the denominator, there are N -many observations yij ’s,
and J is subtracted in the dof because the group means get estimated by
ȳj ’s. Under the H0 , the test statistic is close to zero as the numerator is so
because of V {E(y|x)} = 0.
The model yij = μj + uij can be rewritten as a familiar linear model.
Define J − 1 dummy variables, say xi2 , ..., xiJ , where xij = 1 if observation i
belongs to group j and xij = 0 otherwise. Then
β = (μ1 , μ2 − μ1 , ..., μJ − μ1 ) .
Here the intercept is for μ1 and the slopes are for the deviations from μ1 ;
group 1 is typically the “control (i.e., no-treatment) group” whereas the other
groups are the “treatment groups.” For instance, if observation i belongs to
treatment group 2, then
Instead of the above F -test, we can test for H0 : μ1 =, ..., = μJ with “Wald
test” to appear later without assuming normality; the Wald test checks out
whether all slopes are zero or not.
“Two-way ANOVA” generalizes one-way ANOVA. There are two “fac-
tors” now, and we get yijk where j and k index the group (j, k), j = 1, ..., J
and k = 1, ..., K; group (j, k) has Njk observations. The model is
where αj is the factor-1 effect, β k is the factor-2 effect, and γ jk is the inter-
action effect between the two factors. The relevant decomposition is
where ȳ is the grand mean, ȳj. is the average of all observations with j
K Njk K
fixed (i.e., ȳj. ≡ k=1 i=1 yijk / k=1 Njk ), and ȳ.k is analogously defined.
Various F -test statistics can be devised by squaring and summing up this
display, but the two-way ANOVA model can be also written as a familiar
linear model, to which “Wald tests” can be applied.
Sec. 2 Heteroskedasticity and Homoskedasticity 23
• Second, estimate θ by the LSE of û2i on mi to get the LSE θ̂ for θ; this
is motivated by E(u2 |x) = m θ.
The assumption mi θ̂ > 0 for all mi can be avoided if V (u|x) = exp(m θ) and
if θ is estimated with “nonlinear LSE” that will appear later. The assumption
mi θ̂ > 0 for all mi is simply to illustrate WLS using LSE in the first step.
An easy practical alternative to guarantee positive estimated weights is
adopting a log-linear model ln u2i = mi ζ + vi with vi being an error term.
The log-linear model is equivalent to
u2i = emi ζ evi = (emi ζ/2 ν i )2 where ν i ≡ evi /2
and emi ζ/2 may be taken as the scale factor SD(u|xi ) for ν i (but emi ζ/2 ν i > 0
and thus the error ui cannot be emi ζ/2 ν i although u2i = (emi ζ/2 ν i )2 ). This
suggests using SD(u|xi ) emi ζ̂/2 for WLS weighting where ζ̂ is the LSE for
ζ. Strictly speaking, this “suggestion” is not valid because, for SD(u|xi ) =
emi ζ/2 to hold, we need
ln E(u2 |xi ) = mi ζ ⇐⇒ E(u2 |xi ) = exp(mi ζ)
24 Ch. 1 Methods of Moments for Single Linear Equation Models
but ln u2i = mi ζ+vi postulates instead E(ln u2 |xi ) = mi ζ. Since ln E(u2 |xi ) =
E(ln u2 |xi ), ln u2i = mi ζ + vi is not compatible with SD(u|xi ) = emi ζ/2 .
∗ ∗ ∗
Despite this, however, defining ûi ≡ yi − xi bwls where bwls is the WLS with
weight exp(mi ζ̂/2), so long as the LSE of û∗2 i on mi returns insignificant
slopes, we can still say that the weight exp(mi ζ̂/2) is adequate because the
heteroskedasticity has been removed by the weight, no matter how it was
obtained.
In short, each one of the following has different implications on how we
go about LSE.
then V (u|x) = exp(x θ), and we can do WLS with this. This is also con-
venient in viewing yi : yi is obtained by generating xi and wi first and then
summing up xi β and wi exp(xi θ). But if the specified form of heteroskedastic-
ity exp(x θ) is wrong, then the asymptotic variance of the WLS is no longer
E −1 {xx /V (u|x)}. So, it is safer to use LSE with heteroskedasticity-robust
variance. From now on, we will not invoke homoskedasticity assumption, un-
less it gives helpful insights for the problem at hand, which does happen from
time to time.
We list both tv-het and tv-ho; the latter is in (·) and was computed with
√
/ vN,jj , j = 1, ..., k, where VN ≡ [vN,hj ], h, j = 1, ..., k, is defined as
blse,j
s2N ( i xi xi )−1 . The large differences between the two types of t-values indi-
cate that the homoskedasticity assumption would not be valid for this model.
In this time-series data, if the form of heteroskedasticity is correctly modeled,
the gain in significance (i.e., the gain in the precision of the estimators) would
be substantial. Indeed, such modeling is often done in financial time-series.
Typically, we test for some chosen elements of β being zero jointly. In that
case, R consists of the column vectors picking up the chosen elements of β
(each column consists of k − 1 zeros and 1) and c is a zero vector.
Given the above C and R, define H and Λ such that
R CR = HΛH .
Further observe
√
N · R (bN − β) N (0, R CR)
" √
from N (bN − β) N (0, C) “times R ”} ,
√
N SR (bN − β) N (0, Ig ), since
−0.5
S · R CR · S = HΛ H · HΛH · HΛ−0.5 H = Ig ,
because N (R bN −R β) S S(R bN −R β) is a sum of g-many squared, asymp-
totically uncorrelated N (0, 1) random variables (rv). Replacing R β with c
under H0 : R β = c, we get a Wald test statistic
3.2 Remarks
When bN is the LSE of y on x, we get
Here, we take the “working proposition” that, for the expected value E(h
(x, y, β)) where h(x, y, β) is a (matrix-valued) function of x, y, and β, it
holds in general that
1
h(xi , yi , bN ) − E (h (x, y, β)) = op (1), if bN →p β.
N i
−1 X r̃r̃ X −1
C̃N ≡ (N − 1) (X X) X D̃X − (X X) , where
N
! yi − xi blse
D̃ ≡ diag r̃12 , ..., r̃N
2
, r̃i ≡ , r̃ ≡ (r̃1 , ..., r̃N ) , and
1 − dii
dii is the ith diagonal element of the matrix X(X X)−1 X .
C̃N and CN are asymptotically equivalent as the term X r̃r̃ X/N in C̃N is
of smaller order than X D̃X.
Although the two variance estimators CN and CN o numerically differ
in finite samples, we have CN − CN o = op (1) under homoskedasticity. As
28 Ch. 1 Methods of Moments for Single Linear Equation Models
already noted, too much difference between CN and CN o would indicate the
presence of heteroskedasticity, which is the basis for White (1980) test for
heteroskedasticity. We will not, however, test for heteroskedasticity; instead,
we will just allow it by using the heteroskedasticity-robust variance estimator
CN . There have been criticisms on the heteroskedasticity-robust variance
estimator. For instance, Kauermann and Carroll (2001) showed that, when
homoskedasticity holds, the heteroskedasticity-robust variance estimator has
the higher variance than the variance estimator under homoskedasticity, and
that confidence intervals based on the former have the coverage probability
lower than the nominal value.
Suppose
yi = xi β + di β d + di wi β dw + ui
where di is a dummy variable of interest (e.g., a key policy variable on (d = 1)
or off (d = 0)), and wi consists of elements of xi interacting with di . Here, the
effect of di on yi is β d + wi β dw which varies across i; i.e., we get N different
individual effects. A way to summarize the N -many effects is using β d +
E(w )β dw (the effect evaluated at the “mean person”) or β d + M ed(w )β dw
(the effect evaluated at the “median person”). Observe
adding the other to the model when one is already included does not add any
new explanatory power. With k = 11, g = 2, c = (0, 0) , and
0 0 1 0 0 0 , ..., 0
R =
2×11 0 0 0 0 1 0 , ..., 0
β = (β ln(T ) , β 1 , β bath , β elec , β room , ..., β supply ) ,
11×1
the Wald test statistic is 0.456 with the p-value 0.796 = P (χ22 > 0.456)
for the model with ln(T) and CN : the joint null hypothesis is not rejected.
When CoN is used instead of CN , the Wald test statistic is 0.501 with the p-
value 0.779—hardly any change. Although BATH and ROOM are important
variables for house prices, they do not explain the discount % DISC. The t-
values with CN and C̃N shown below for the 11 regressors are little different
(tv-CN was shown already) because N = 467 is not too small for the number
of regressors:
1 1 β pq + β qp
β pq ln xp ln xq + β qp ln xq ln xp = ln xp ln xq = β pq ln xp ln xq :
2 2 2
we can only identify the average of β pq and β qp , and β pq = β qp essentially
redefines the average as β pq .
If we take the translog function as a second-order approximation to an
underlying smooth function, say, y = exp{f (x)}, then β pq = β qp is a nat-
ural restriction from the symmetry of the second-order matrix. Specifically,
observe
β 1 + β 2 + β 3 = 1, β 11 + β 12 + β 13 = 0,
β 12 + β 22 + β 23 = 0 (from β 21 + β 22 + β 23 = 0) and
β 13 + β 23 + β 33 = 0 (from β 31 + β 32 + β 33 = 0).
β = E −1 (zx ) · E(zy).
While IVE in its broad sense includes any estimator using instruments, here
we define IVE in its narrow sense as the one taking this particular form. IVE
includes LSE as a special case when z = x (or Z = X in matrices).
Substitute yi = xi β + ui into the bive formula to get
−1 −1
1 1 1
bive = zi xi zi (xi β + ui ) = β + zi xi
N i N i
N i
1
× zi u i .
N i
32 Ch. 1 Methods of Moments for Single Linear Equation Models
The consistency of the IVE follows simply by applying the LLN to the terms
other than β in the last equation. As for the asymptotic distribution, observe
−1
√ 1 1
N (bive − β) = zi xi √ zi u i .
N i N i
Applying the LLN to N −1 i zi xi and the CLT to N −1/2 i zi ui , it holds
that √ !
N (bive − β) N 0, E −1 (zx ) E zz u2 E −1 (xz ) .
This is informally stated as
1 −1
bive ∼ N β, E (zx )E(zz u2 )E −1 (xz )
N
(i ) COR(w, u) = 0 ( ⇐⇒ E(wu) = 0)
(ii ) 0 = COR(w, x4 ) (“inclusion restriction” )
(iii ) w does not appear in the y equation (“exclusion restriction”)
then w is a valid instrumental variable (IV)—or just instrument—for x4 ,
and we can use zi = (1, xi2 , xi3 , wi ) . The reason why (ii) is called “inclusion
restriction” is that w should be in the x4 equation for (ii) to hold. Conditions
(ii) and (iii) together are simply called “inclusion/exclusion restrictions.”
As an example, suppose that y is blood pressure, x2 is age, x3 is gender,
x4 is exercise, u includes health concern, and w is a randomized education
dummy variable on health benefits of exercise (i.e., a coin is flipped to give
person i the education if head comes up). Those who are health-conscious
Sec. 4 Instrumental Variable Estimator (IVE) 33
may exercise more, which means COR(x4 , u) = 0. Checking out (i−iii) for
w, first, w satisfies (i) because w is randomized. Second, those who received
the health education are likely to exercise more, thus implying (ii). Third,
receiving the education alone cannot affect blood pressure, and hence (iii)
holds. (iii) does not mean that w should not influence y at all: (iii) is that
w can affect y only indirectly through x4 .
Condition (i) is natural in view of E(zu) = 0. Condition (ii) is necessary
as w is used as a “proxy” for x4 ; if COR(w, x4 ) = 0, then w cannot represent
x4 —a rv from a coin toss is independent of x4 and fails (ii) despite satisfy-
ing (i) and (iii). Condition (iii) is necessary to make E(zx ) invertible; an
exogenous regressor x2 (or x3 ) already in the y-equation cannot be used as
an instrument for x4 despite it satisfies (i) and possibly (ii), because E(zx )
is not invertible if z = (1, x2 , x3 , x2 ) .
Recalling partial regression, only the part of x4 not explained by the
other regressors (1, x2 , x3 ) in the y equation contributes to explaining y.
Among the part of x4 , w picks only the part uncorrelated with u, because w is
uncorrelated with u by condition (i). The instrument w is said to extract the
“exogenous variation” in x4 . In view of this, to be more precise, (ii) should
be replaced with
Condition (ii) can be (and should be) verified by the LSE of x4 on w and
the other regressors: the slope coefficient of w should be non-zero in this LSE
for w to be a valid instrument. But conditions (i) and (iii) cannot be checked
out; they can be only “argued for.”In short, an instrument should be excluded
from the response equation and included in the endogenous regressor equation
with zero correlation with the error term.
There are a number of sources for the endogeneity of x4 :
and use xe4 as a regressor. But the new error u − β 4 e is correlated with
xe4 through e.
When this is estimated by LSE, the slope estimator for x3 is consistent for
β 3 + β 4 γ 2 , where β 4 γ 2 is nothing but the bias due to omitting x4 in the LSE.
The slope parameter β 3 + β 4 γ 2 of x3 consists of two parts: the “direct effect”
of x3 on y, and the “indirect part” of x3 on y through x4 . If x3 affects x4
but not the other way around, then the indirect part can be interpreted as
the “indirect effect” of x3 on y through x4 . So long as we are interested in
the total effect β 3 + β 4 γ 2 , the LSE is all right. But usually in economics, the
desired effect is the “ceteris paribus” effect of changing x3 while holding all
the other variables (including x4 ) constant.
The IVE can alsobe cast into a minimization problem. The sample
analog of E(zu) is N −1 i zi ui . Since ui is unobservable, replace ui by yi −xi b
Sec. 4 Instrumental Variable Estimator (IVE) 35
−1
to get N i zi (yi − xi b). We can get theIVE by minimizing the deviation
−1 −1
of N i z i (y i − x i b) from 0. Since N i z i (y i − xi b) is a k × 1 vector, we
need to choose how to measure the distance from 0. Adopting the squared
Euclidean norm as usual and ignoring N −1 , we get
zi (yi − xi b) · zi (yi − xi b) = {Z (Y − X b)} · Z (Y − X b)
i i
= (Y − Xb) ZZ (Y − Xb) = Y ZZ Y − 2b X ZZ Y + b X ZZ Xb.
Twin First
#Children More than First birth Same second birth
ever two boy sex birth age
All women 2.50 0.375 0.512 0.505 0.012 21.8
(0.76) (0.48) (0.50) (0.50) (0.108) (3.5)
Wives 2.48 0.367 0.514 0.503 0.011 22.4
(0.74) (0.48) (0.50) (0.50) (0.105) (3.5)
The table shows that there is not much difference across all women data
and wives only data, that the probability of boy is slightly higher than the
probability of girl, and that the probability of twin birth is about 1%.
Part of Table 8 for the all women data in Angrist and Evans (1998) is:
For instance, having the third child decreases weeks worked by 6–9% and
hours worked by 4–7 hours per week. Overall, IVE magnitudes are about
50–100% smaller than the LSE magnitudes.
which is the y “reduced form (RF).” Substituting the y RF into xi4 = qi γ +
αyi + vi , we also get the x4 RF:
α
xi4 = qi γ + (β + β 2 xi2 + β 3 xi3 + β 4 qi γ + β 4 vi + ui ) + vi .
1 − β4α 1
Judging from the u’s slope α(1 − β 4 α)−1 > 0, we get COR(x4 , u) > 0.
Suppose that LSE is run for the y equation ignoring the simultaneity. Then
with xi = (1, xi2 , xi3 , xi4 ) , the LSE of y on x will be inconsistent by the
magnitude
E −1 (xx )E(xu) = E −1 (xx ){0, 0, 0, E(x4 u)} :
the LSE for β 4 is upward biased and hence the LSE for β 4 can even be
positive. Recalling the discussion on omitted variable bias, we can see that
the bias is not restricted to β 4 if x4 is correlated with x2 or x3 , because the
last column of E −1 (xx ) can “spread” E(x4 u) = 0 to all components of the
LSE.
One way to overcome the simultaneity problem is to use data for short
periods. For instance, if y is a monthly crime number for city i and x4 is the
number of policemen in the same month, then it is unlikely that y affects
x4 as it takes time to adjust x4 , whereas x4 can affect y almost instantly.
Another way is to find an instrument. Levitt (1997) noted that the change
in x4 takes place almost always in election years, mayoral or gubernatorial.
Thus he sets up a “panel (or longitudinal) data” model where yit is a change
in crime numbers for city i and year t, xit,4 is a change in policemen, and
wit = 1 if year t is an election year at city i and 0 otherwise, because wit
is unlikely to be correlated with the error term in the crime number change
equation. Levitt (1997) concluded that the police force size reduces (violent)
crimes.
As McCrary (2002) noted, however, there was a small error in Levitt
(1997). Levitt (2002) thus proposed the number of firefighters per capita as
a new instrument for the number of policemen per capita. The panel data
model used is
Part of Table 3 in Levitt (2002) for police effect is shown below with
SD in (·), where “LSE without city dummies” means the LSE without city
dummies but with year dummies. By not using city dummies, the parameters
are identified mainly with cross-city variation because cross-city variation is
much greater than over-time variation, and this LSE is thus similar to cross-
section LSE pooling all panel data.
This table shows that the LSE’s are upward biased as analyzed above al-
though the bias is smaller when the city dummies are used, and that police
force expansion indeed reduces the number of crimes. The number of fire-
fighters is an attractive instrument, but somewhat less convincing than the
instruments in the fertility example.
where yit is the number of various offenses per 100,000 people in county i at
year t, xit includes the mean log weekly wage of non-college educated men
(wageit ), unemployment rate of non-college educated men (urit ), and the
mean log household income (incit ), and time dummies, δ i is a time-constant
error and uit is a time-variant error. Our presentation in the following is a
rough simplification of their longer models.
Since δ i represents each county’s unobserved long-standing culture and
practice such as how extensively crimes are reported and so on, δ i is likely
to be correlated with xit . They take the difference between 1979 and 1989 to
remove δ i and get (removing δ i by differencing is a “standard” procedure in
panel data)
Δyi = Δxi β + Δui ,
where Δyi ≡ yi,1989 − yi,1979 , Δxi and Δui are analogously defined, and
N = 564. Their estimation results are as follows with SD in (·):
LSE : bwage = −1.13 (0.38), bur = 2.35 (0.62), binc = 0.71 (0.35)
R2 = 0.094,
IV E : bwage = −1.06 (0.59), bur = 2.71 (0.97), binc = 0.093 (0.55);
Sec. 4 Instrumental Variable Estimator (IVE) 39
the instruments will be explained below. All three estimates in LSE are signif-
icant and show that low wage, high unemployment rate, and high household
income increase the crime rate. The IVE is close to the LSE in wageit and
urit , but much smaller for incit and insignificant. See Freeman (1999) for a
survey on crime and economics.
It is possible that crime rates influence local labor market conditions, be-
cause firms may move out in response to high crime rates or firms may offer
higher wages to compensate for high crime rates. This means that a simul-
taneous relation problem may occur between crime rates and labor market
conditions. To avoid this problem, Gould et al. (2002) constructed a number
of instruments. One of the instruments is
(employment share of industry j in county i in 1979)
j
·(national growth rate of industry j for 1979–1989).
For the product AB of two matrices A and B where B −1 exists, rank (AB) =
rank(A); i.e., multiplication by a non-singular matrix B does not alter the
rank of A. This fact implies that E(xz )E −1 (zz ) E(zx ) has rank k and
thus is invertible. If E(zx ) is invertible, then β in the last display becomes
E −1 (zx ) · E(zy) and the resulting bive is the IVE when the number of in-
struments is the same as the number of parameters.
The sample analog for β is the following instrumental variable estimator
⎧ −1 ⎫−1 −1
⎨ ⎬
bive = xi zi zi zi zi xi · xi zi zi zi zi y i
⎩ ⎭
i i i i i i
where many N −1 ’s are ignored that cancel one√another out. The consistency
is obvious, and the asymptotic distribution of N (bive − β) is
as if bive were the LSE for the equation Y = X̂β + error where the “error” is
Y − X̂β. This rewriting accords an interesting interpretation of bive . As X is
Sec. 4 Instrumental Variable Estimator (IVE) 41
The expression (X̂ X̂)−1 X̂ Y also demonstrates that the so-called “two-
stage LSE (2SLSE)” for simultaneous equations is nothing but IVE. For sim-
plification, consider two simultaneous equations with two endogenous vari-
ables y1 and y2 :
y1 = α1 y2 + x1 β 1 + u1 , y2 = α2 y1 + x2 β 2 + u2 ,
COR(xj , uj ) = 0, j, j = 0, 1 and x1 = x2 .
Let z denote the system exogenous regressors (i.e., the collection of the el-
ements in x1 and x2 ). Denoting the regressors for the y1 equation as x ≡
(y2 , x1 ) , the first step of 2SLSE for (α1 , β 1 ) is the LSE of y2 on z to obtain
the fitted value ŷ2 of y2 , and the second step is the LSE of y1 on (ŷ2 , x1 ).
This 2SLSE is nothing but the IVE where the first step is PZ X to obtain the
LSE fitted value of x on z—the LSE fitted value of x1 on z is simply x1 —and
the second step is the LSE of y1 on PZ X.
was given ahead already: Y = X̂β + error with the error term asymptotically
orthogonal to X̂.
Observe
This is a decomposition of E(xx ) into two parts, one explained by z and the
other unexplained by z; x − γ z is the “residual” (compared with the linear
projection, E(x|z) is often called the projection of x on z).
From the decomposition, we get
for IVE being just one of them. If we use E(xz )W −1 where W is a p × p p.d.
matrix, we will get
E(xz )W −1 E(zy) = E(xz )W −1 E(zx )β
=⇒ β = {E(xz )W −1 E(zx )}−1 E(xz )W −1 E(zy).
As it turns out, W = E(zz u2 ) is optimal for iid samples, which is the theme
of this section.
bW = (X ZW −1 Z X)−1 · (X ZW −1 Z Y )
−1
−1
= xi zi W zi xi · xi zi W −1 zi y i , in vectors.
i i i i
With W = E(zz u2 ), this matrix becomes {E(xz )E −1 (zz u2 )E(zx )}−1 , and
we get the GMM with
√
N (bgmm − β) N (0, {E(xz )E −1 (zz u2 )E(zx )}−1 ).
Differently from IVE, Z(ZDZ )−1 Z is no longer the linear projection matrix
of Z. A consistent estimator for the GMM asymptotic variance {E(xz )E −1
(zz u2 )E(zx )}−1 is easily obtained: it is simply the first part (X Z(Z DZ)−1
Z X)−1 of bgmm times N .
Too big a value, greater than an upper quantile of χ2p−k , indicates that some
moment conditions do not hold (or some other assumptions of the model
may be violated). The reader may wonder how we can test for the very mo-
ment conditions that were used to get the GMM. If there are only k moment
conditions, this concern is valid. But when there are more than k moment con-
ditions (p-many), essentially only k of them get to be used in obtaining the
GMM. The GMM over-identification test checks if the remaining p − k mo-
ment conditions are satisfied by the GMM, as can be seen in the degrees of
freedom (“dof”) of the test.
The test statistics may be viewed as
⎡⎧ −1 ⎫
⎨ u N i z zi u N i z u N i zi u N i ⎬
⎣ √ i √ √i
√ 1 ·
⎩ N N N N ⎭
i i i
⎧ −1 ⎫⎤
⎨ u z z u z u zi u N i ⎬ ⎦
Ni i i Ni i Ni
√ √ √ √ 1 .
⎩ N N N N ⎭
i i
√
Defining √ the matrix version for zi uN i / N as G—i.e., the ith row of G is
zi uN i / N —this display can be written as
−1
G(G G)−1 G 1N G(G G)−1 G 1N = 1N G (G G) G 1N
The inner-product form shows that the test statistic is non-negative at least.
Using the GMM over-identification test, a natural thing to do is to use
only those moment conditions that are not rejected by the test. This can be
done in practice by doing GMM on various subsets of the moment conditions,
which would be ad hoc, however. Andrews (1999) and Hall and Peixe (2003)
provided a formal discussion on this issue of selecting valid moment condi-
tions, although how popular these suggestions will be in practice remains to
be seen.
Under the homoskedasticity E(u2 |z) = σ 2 , W = σ 2 E(zz ) = σ 2 Z Z/N +
op (1). But any multiplicative scalar in W is irrelevant for the minimization.
Hence setting W = Z Z is enough, and bgmm becomes bive under homoskedas-
ticity; the aforementioned optimality of bive comes from the GMM optimality
under homoskedasticity. Under homoskedasticity, we do not need an initial
estimator to get the residuals ri ’s. But when we do not know whether ho-
moskedasticity holds or not, GMM is obtained in two stages: first apply IVE
to get the ri ’s, then use i zi zi ri2 to get the GMM. For this reason, the GMM
is sometimes called a “two-stage IVE.”
We can summarize our analysis for the linear model under p × 1 moment
condition E(zu) = 0 and heteroskedasticity of unknown form as follows.
First, the efficient estimator when p ≥ k is
bgmm = {X Z(Z DZ)−1 Z X}−1 X Z(Z DZ)−1 Z Y.
If homoskedasticity prevails,
bive = {X Z(Z Z)−1 Z X}−1 X Z(Z Z)−1 Z Y
46 Ch. 1 Methods of Moments for Single Linear Equation Models
Then yt−1 , xt−1 , yt−2 , xt−2 , ... are all valid instruments because the error
term E(yt+1 |It ) − yt+1 is uncorrelated with all available information up to t.
If E(xt εt ) = 0, then xt is also a good instrument.
can be dealt with IVE and GMM. We will use L1, L2, and L3 as instruments.
An argument for the instruments would be that, while the market conditions
and the characteristics of the house and realtor may influence DISC directly,
it is unlikely that DISC is affected directly by when to list the house in the
market. The reader may object to this argument, in which case the following
should be taken just as an illustration.
Although we cannot test for the exclusion restriction, we can at least
check whether the three variables have explanatory power for the potentially
endogenous regressor ln(T ). For this, the LSE of ln(T ) on the instruments
and the exogenous regressors was done to yield (heteroskedasticity-robust
variance used):
ln(Ti ) = − 1.352 +, ..., − 0.294 · L1 − 0.269 · L2 − 0.169 · L3 , R2 = 0.098,
(t−value) (−0.73) (−2.77) (−2.21) (−1.22)
which shows that indeed L1, L2, and L3 have explanatory power for ln(T).
Table 2 shows the LSE, IVE, and GMM results (LSE is provided here again
for the sake of comparison). The pseudo R2 for the IVE is 0.144; compare this
to the R2 = 0.34 of the LSE. The GMM over-identification test statistic value
and its p-value are, respectively, 2.548 and 0.280, not rejecting the moment
conditions.
In the IVE, the tv–ho’s are little different from the tv–het’s other than
for YR. The difference between the tv for GMM and the tv–het for IVE is
also negligible. The LSE have far more significant variables than the IVE and
GMM which are close to each other. Ln(T )’s estimate is about 50% smaller
in LSE than in IVE and GMM. ELEC and ln(LP) lose its significance in the
IVE and GMM and the estimate sizes are also halved. YR has almost the
same estimates and t-values across the three estimators. BIGS have similar
estimates across the three estimators, but not significant in the IVE and
GMM. RATE is significant for all three estimators, and its value changes from
-3 in LSE to -5 in the IVE and GMM. Overall, the signs of the significant
estimates are the same across all estimators, and most earlier remarks made
for LSE apply to IVE and GMM.
48 Ch. 1 Methods of Moments for Single Linear Equation Models
√
As in WLS, we need to replace θ with a first-stage N -consistent esti-
mator, say θ̂; call the GLS with θ̂ the “feasible GLS ” and the GLS with θ
the “infeasible GLS.” Whether the feasible GLS is consistent with the same
asymptotic distribution as the infeasible GLS follows depends on the form of
Ω(X; θ), but in all cases we will consider GLS for, this will be the case as to
be shown in a later chapter. In the transformed equation, the error terms are
iid and homoskedastic with unit variance. Thus we get
√
N (bGLS − β) N (0, E −1 (x∗ x∗ )).
This is the asymptotic variance of GMM, which means that GMM is efficient
under E(zu) = 0, y = x β + u, and the iid assumption. When z = x and u is
a scalar, the efficiency bound becomes
which is the asymptotic variance of LSE. Thus LSE is the most efficient under
the moment condition E(xu) = 0, the linear model, and the iid assumption.
Chamberlain (1987) also showed that if
σ 2 E −1 (xx ).
E(xx )E −1 (xx u2 )E(xx ) = E(xx )Ex−1 xx Eu|x (u2 ) E(xx )
x $ x
=E x SD(u|x) E −1 [{x SD(u|x)} x SD(u|x) E xSD(u|x)
SD(u|x) SD(u|x)
52 Ch. 1 Methods of Moments for Single Linear Equation Models