Introductory Econometrics A Modern Appro
Introductory Econometrics A Modern Appro
Introductory Econometrics A Modern Appro
TEACHING NOTES
You have substantial latitude about what to emphasize in Chapter 1. I find it useful to talk about
the economics of crime example (Example 1.1) and the wage example (Example 1.2) so that
students see, at the outset, that econometrics is linked to economic reasoning, if not economic
theory.
I like to familiarize students with the important data structures that empirical economists use,
focusing primarily on cross-sectional and time series data sets, as these are what I cover in a
first-semester course. It is probably a good idea to mention the growing importance of data sets
that have both a cross-sectional and time dimension.
I spend almost an entire lecture talking about the problems inherent in drawing causal inferences
in the social sciences. I do this mostly through the agricultural yield, return to education, and
crime examples. These examples also contrast experimental and nonexperimental data. Students
studying business and finance tend to find the term structure of interest rates example more
relevant, although the issue there is testing the implication of a simple theory, as opposed to
inferring causality. I have found that spending time talking about these examples, in place of a
formal review of probability and statistics, is more successful (and more enjoyable for the
students and me).
3
CHAPTER 2
TEACHING NOTES
This is the chapter where I expect students to follow most, if not all, of the algebraic derivations.
In class I like to derive at least the unbiasedness of the OLS slope coefficient, and usually I
derive the variance. At a minimum, I talk about the factors affecting the variance. To simplify
the notation, after I emphasize the assumptions in the population model, and assume random
sampling, I just condition on the values of the explanatory variables in the sample. Technically,
this is justified by random sampling because, for example, E(ui|x1,x2,…,xn) = E(ui|xi) by
independent sampling. I find that students are able to focus on the key assumption SLR.3 and
subsequently take my word about how conditioning on the independent variables in the sample is
harmless. (If you prefer, the appendix to Chapter 3 does the conditioning argument carefully.)
Because statistical inference is no more difficult in multiple regression than in simple regression,
I postpone inference until Chapter 4. (This reduces redundancy and allows you to focus on the
interpretive differences between simple and multiple regression.)
You might notice how, compared with most other texts, I use relatively few assumptions to
derive the unbiasedness of the OLS slope estimator, followed by the formula for its variance.
This is because I do not introduce redundant or unnecessary assumptions. For example, once
SLR.3 is assumed, nothing further about the relationship between u and x is needed to obtain the
unbiasedness of OLS under random sampling.
4
SOLUTIONS TO PROBLEMS
2.1 (i) Income, age, and family background (such as number of siblings) are just a few
possibilities. It seems that each of these could be correlated with years of education. (Income
and education are probably positively correlated; age and education may be negatively correlated
because women in more recent cohorts have, on average, more education; and number of siblings
and education are probably negatively correlated.)
(ii) Not if the factors we listed in part (i) are correlated with educ. Because we would like to
2.2 In the equation y = β0 + β1x + u, add and subtract α0 from the right hand side to get y = (α0 +
β0) + β1x + (u − α0). Call the new error e = u − α0, so that E(e) = 0. The new intercept is α0 + β0,
but the slope is still β1.
2.3 (i) Let yi = GPAi, xi = ACTi, and n = 8. Then x = 25.875, y = 3.2125, ∑ (xi – x )(yi – y ) =
n
i=1
5.8125, and ∑ (xi – x )2 = 56.875. From equation (2.9), we obtain the slope as βˆ1 =
n
5.8125/56.875 ≈ .1022, rounded to four places after the decimal. From (2.17), βˆ0 = y –
i=1
The intercept does not have a useful interpretation because ACT is not close to zero for the
increases by .1022(5) = .511.
population of interest. If ACT is 5 points higher, GPA
(ii) The fitted values and residuals — rounded to four decimal places — are given along with
the observation number i and GPA in the following table:
i GPA
GPA û
1 2.8 2.7143 .0857
2 3.4 3.0209 .3791
3 3.0 3.2253 –.2253
4 3.5 3.3275 .1725
5 3.6 3.5319 .0681
6 3.0 3.1231 –.1231
7 2.7 3.1231 –.4231
8 3.7 3.6341 .0659
You can verify that the residuals, as reported in the table, sum to −.0002, which is pretty close to
zero given the inherent rounding error.
5
ˆ = .5681 + .1022(20) ≈ 2.61.
(iii) When ACT = 20, GPA
∑ uˆ
n
2
(iv) The sum of squared residuals, , is about .4347 (rounded to four decimal places),
i =1
i
∑ (yi –
n
and the total sum of squares, y )2, is about 1.0288. So the R-squared from the
i=1
regression is
Therefore, about 57.7% of the variation in GPA is explained by ACT in this small sample of
students.
2.4 (i) When cigs = 0, predicted birth weight is 119.77 ounces. When cigs = 20, bwght = 109.49.
This is about an 8.6% drop.
(ii) Not necessarily. There are many other factors that can affect birth weight, particularly
overall health of the mother and quality of prenatal care. These could be correlated with
cigarette smoking during birth. Also, something such as caffeine consumption can affect birth
weight, and might also be correlated with cigarette smoking.
(iii) If we want a predicted bwght of 125, then cigs = (125 – 119.77)/( –.524) ≈ –10.18, or
about –10 cigarettes! This is nonsense, of course, and it shows what happens when we are trying
to predict something as complicated as birth weight with only a single explanatory variable. The
largest predicted birth weight is necessarily 119.77. Yet almost 700 of the births in the sample
had a birth weight higher than 119.77.
(iv) 1,176 out of 1,388 women did not smoke while pregnant, or about 84.7%.
2.5 (i) The intercept implies that when inc = 0, cons is predicted to be negative $124.84. This, of
course, cannot be true, and reflects that fact that this consumption function might be a poor
predictor of consumption at very low-income levels. On the other hand, on an annual basis,
$124.84 is not so far from zero.
(iii) The MPC and the APC are shown in the following graph. Even though the intercept is
negative, the smallest APC in the sample is positive. The graph starts at an annual income level
of $1,000 (in 1970 dollars).
6
MPC .9
APC
MPC
.853
APC
.728
.7
1000 10000 20000 30000
inc
2.6 (i) Yes. If living closer to an incinerator depresses housing prices, then being farther away
increases housing prices.
(ii) If the city chose to locate the incinerator in an area away from more expensive
neighborhoods, then log(dist) is positively correlated with housing quality. This would violate
SLR.3, and OLS estimation is biased.
(iii) Size of the house, number of bathrooms, size of the lot, age of the home, and quality of
the neighborhood (including school quality), are just a handful of factors. As mentioned in part
(ii), these could certainly be correlated with dist [and log(dist)].
2.7 (i) When we condition on inc in computing an expectation, inc becomes a constant. So
E(u|inc) = E( inc ⋅ e|inc) = inc ⋅ E(e|inc) = inc ⋅ 0 because E(e|inc) = E(e) = 0.
(iii) Families with low incomes do not have much discretion about spending; typically, a
low-income family must spend on food, clothing, housing, and other necessities. Higher income
people have more discretion, and some might choose more consumption while others more
saving. This discretion suggests wider variability in saving among higher income families.
7
β%1 = ⎜ ∑ xi yi ⎟ / ⎛⎜ ∑ xi2 ⎞⎟ .
⎛ n
⎞ n
⎝ i =1 ⎠ ⎝ i =1 ⎠
β%1 = ⎜ ∑ xi ( β 0 + β1 xi + ui ) ⎟ / ⎜ ∑ xi2 ⎟ .
⎛ n
⎞ ⎛ n
⎞
⎝ i =1 ⎠ ⎝ i =1 ⎠
β 0 ∑ xi +β1 ∑ x 2 + ∑ xi ui .
n n n
i =1 i =1 i =1
i
E( β%1 ) = β0 ⎜ ∑ xi ⎟ / ⎜ ∑ xi ⎟ + β1
⎛ n ⎞ ⎛ n 2⎞
⎝ i =1 ⎠ ⎝ i =1 ⎠
because E(ui) = 0 for all i. Therefore, the bias in β%1
∑x
is given by the first term in this equation.
(ii) From the last expression for β%1 in part (i) we have, conditional on the xi,
x = 0. In the latter case, regression through the origin is identical to regression with an intercept.
⎝ i =1 ⎠ ⎝ i =1 ⎠ ⎝ i =1 ⎠ ⎝ i =1 ⎠
⎝ i =1 ⎠ ⎝ i =1 ⎠ ⎝ i =1 ⎠
(iii) From (2.57), Var( βˆ1 ) = σ2/ ⎜ ∑ ( xi − x ) 2 ⎟ . From the hint, ∑ xi2 ≥ ∑ (x − x )
⎛ n ⎞ n n
⎝ i =1 ⎠
2
, and so
i =1 i =1
i
Var( β%1 ) ≤ Var( βˆ1 ). A more direct way to see this is to write ∑ (x − x ) ∑x − n( x ) 2 , which
n n
2 2
=
i =1 i =1
i i
∑x
n
2
is less than unless x = 0.
i =1
i
8
(iv) For a given sample size, the bias in β%1 increases as x increases (holding the sum of the
xi2 fixed). But as x increases, the variance of βˆ1 increases relative to Var( β%1 ). The bias in β%1
is also small when β is small. Therefore, whether we prefer β% or β̂ on a mean squared error
∑x
0 1 1
2.9 (i) We follow the hint, noting that c1 y = c1 y (the sample average of c1 yi is c1 times the
sample average of yi) and c2 x = c2 x . When we regress c1yi on c2xi (including an intercept) we
use equation (2.19) to obtain the slope:
∑ (c2 xi − c2 x)(c1 yi − c1 y ) ∑ c c ( x − x )( y − y )
n n
β 1 = i =1
= i =1
∑ (c2 xi − c2 x )2 ∑ c (x − x )
1 2 i i
n n
2 2
i =1 i =1
2 i
c ∑
( xi − x )( yi − y )
n
= 1 ⋅ i =1 = β 1.
∑ (x − x )
c1
n
c2 2 c2
i =1
i
From (2.17), we obtain the intercept as β%0 = (c1 y ) – β%1 (c2 x ) = (c1 y ) – [(c1/c2) βˆ1 ](c2 x ) =
c1( y – βˆ x ) = c1 βˆ ) because the intercept from regressing yi on xi is ( y – βˆ x ).
1 0 1
(ii) We use the same approach from part (i) along with the fact that (c1 + y ) = c1 + y and
(c2 + x) = c2 + x . Therefore, (c1 + yi ) − (c1 + y ) = (c1 + yi) – (c1 + y ) = yi – y and (c2 + xi) –
(c2 + x) = xi – x . So c1 and c2 entirely drop out of the slope formula for the regression of (c1 +
yi) on (c2 + xi), and β% = β̂ . The intercept is β% = (c + y ) – β% (c + x) = (c1 + y ) – βˆ (c2 +
x ) = ( y − βˆ1 x ) + c1 – c2 βˆ1 = βˆ0 + c1 – c2 βˆ1 , which is what we wanted to show.
1 1 0 1 1 2 1
(iii) We can simply apply part (ii) because log(c1 yi ) = log(c1 ) + log( yi ) . In other words,
replace c1 with log(c1), yi with log(yi), and set c2 = 0.
If β垐
0 and β1 are the original intercept and slope, then β1 = β1 and β 0 = β 0 − log(c2 ) β1 .
(iv) Again, we can apply part (ii) with c1 = 0 and replacing c2 with log(c2) and xi with log(xi).
% ˆ % 垐
2.10 (i) The average prate is about 87.36 and the average mrate is about .732.
9
= 83.05 + 5.86 mrate
prate
n = 1,534, R2 = .075.
(iii) The intercept implies that, even if mrate = 0, the predicted participation rate is 83.05
percent. The coefficient on mrate implies that a one-dollar increase in the match rate – a fairly
large increase – is estimated to increase prate by 5.86 percentage points. This assumes, of
course, that this change prate is possible (if, say, prate is already at 98, this interpretation makes
no sense).
(v) mrate explains about 7.5% of the variation in prate. This is not much, and suggests that
many other factors influence 401(k) plan participation rates.
2.11 (i) Average salary is about 865.864, which means $865,864 because salary is in thousands
of dollars. Average ceoten is about 7.95.
(ii) There are five CEOs with ceoten = 0. The longest tenure is 37 years.
The intercept implies that the estimated amount of sleep per week for someone who does not
work is 3,586.4 minutes, or about 59.77 hours. This comes to about 8.5 hours per night.
(ii) If someone works two more hours per week then Δtotwrk = 120 (because totwrk is
measured in minutes), and so Δsleep
= –.151(120) = –18.12 minutes. This is only a few minutes
a night. If someone were to work one more hour on each of five working days, Δsleep
=
–.151(300) = –45.3 minutes, or about five minutes a night.
2.13 (i) Average salary is about $957.95 and average IQ is about 101.28. The sample standard
deviation of IQ is about 15.05, which is pretty close to the population value of 15.
10
(ii) This calls for a level-level model:
= 116.99 + 8.30 IQ
wage
n = 935, R2 = .096.
If ΔIQ = 15 then Δ log (wage) = .0088(15) = .132, which is the (approximate) proportionate
change in predicted wage. The percentage increase is therefore approximately 13.2.
log(rd) = β 0 + β1 log(sales) + u,
The estimated elasticity of rd with respect to sales is 1.076, which is just above one. A one
percent increase in sales is estimated to increase rd by about 1.08%.
11
CHAPTER 3
TEACHING NOTES
For undergraduates, I do not do most of the derivations in this chapter, at least not in detail.
Rather, I focus on interpreting the assumptions, which mostly concern the population. Other
than random sampling, the only assumption that involves more than population considerations is
the assumption about no perfect collinearity, where the possibility of perfect collinearity in the
sample (even if it does not occur in the population) should be touched on. The more important
issue is perfect collinearity in the population, but this is fairly easy to dispense with via examples.
These come from my experiences with the kinds of model specification issues that beginners
have trouble with.
The comparison of simple and multiple regression estimates – based on the particular sample at
hand, as opposed to their statistical properties – usually makes a strong impression. Sometimes I
do not bother with the “partialling out” interpretation of multiple regression.
As far as statistical properties, notice how I treat the problem of including an irrelevant variable:
no separate derivation is needed, as the result follows form Theorem 3.1.
I do like to derive the omitted variable bias in the simple case. This is not much more difficult
than showing unbiasedness of OLS in the simple regression case under the first four Gauss-
Markov assumptions. It is important to get the students thinking about this problem early on,
and before too many additional (unnecessary) assumptions have been introduced.
I have intentionally kept the discussion of multicollinearity to a minimum. This partly indicates
my bias, but it also reflects reality. It is, of course, very important for students to understand the
potential consequences of having highly correlated independent variables. But this is often
beyond our control, except that we can ask less of our multiple regression analysis. If two or
more explanatory variables are highly correlated in the sample, we should not expect to precisely
estimate their ceteris paribus effects in the population.
I find extensive treatments of multicollinearity, where one “tests” or somehow “solves” the
multicollinearity problem, to be misleading, at best. Even the organization of some texts gives
the impression that imperfect multicollinearity is somehow a violation of the Gauss-Markov
assumptions: they include multicollinearity in a chapter or part of the book devoted to “violation
of the basic assumptions,” or something like that. I have noticed that master’s students who have
had some undergraduate econometrics are often confused on the multicollinearity issue. It is
very important that students not confuse multicollinearity among the included explanatory
variables in a regression model with the bias caused by omitting an important variable.
I do not prove the Gauss-Markov theorem. Instead, I emphasize its implications. Sometimes,
and certainly for advanced beginners, I put a special case of Problem 3.12 on a midterm exam,
where I make a particular choice for the function g(x). Rather than have the students directly
compare the variances, they should appeal to the Gauss-Markov theorem for the superiority of
OLS over any other linear, unbiased estimator.
12
SOLUTIONS TO PROBLEMS
3.1 (i) hsperc is defined so that the smaller it is, the lower the student’s standing in high school.
Everything else equal, the worse the student’s standing in high school, the lower is his/her
expected college GPA.
$
colgpa = 1.392 − .0135(20) + .00148(1050) = 2.676.
hsperc is the same for both students. So A is predicted to have a score .00148(140) ≈ .207
(iii) The difference between A and B is simply 140 times the coefficient on sat, because
higher.
(iv) With hsperc fixed, Δ col$gpa = .00148Δsat. Now, we want to find Δsat such that
Δ col$gpa = .5, so .5 = .00148(Δsat) or Δsat = .5/(.00148) ≈ 338. Perhaps not surprisingly, a
large ceteris paribus difference in SAT score – almost two and one-half standard deviations – is
needed to obtain a predicted difference in college GPA or a half a point.
3.2 (i) Yes. Because of budget constraints, it makes sense that, the more siblings there are in a
siblings that reduces predicted education by one year, we solve 1 = .094(Δsibs), so Δsibs =
family, the less education any one child in the family has. To find the increase in the number of
1/.094 ≈ 10.6.
(ii) Holding sibs and feduc fixed, one more year of mother’s education implies .131 years
more of predicted education. So if a mother has four more years of education, her son is
predicted to have about a half a year (.524) more years of education.
(iii) Since the number of siblings is the same, but meduc and feduc are both different, the
coefficients on meduc and feduc both need to be accounted for. The predicted difference in
education between B and A is .131(4) + .210(4) = 1.364.
β1 < 0.
3.3 (i) If adults trade off sleep for work, more work implies less sleep (other things equal), so
(ii) The signs of β 2 and β 3 are not obvious, at least to me. One could argue that more
educated people like to get more out of life, and so, other things equal, they sleep less ( β 2 < 0).
The relationship between sleeping and age is more complicated than this model suggests, and
economists are not in the best position to judge such things.
(iii) Since totwrk is in minutes, we must convert five hours into minutes: Δtotwrk =
5(60) = 300. Then sleep is predicted to fall by .148(300) = 44.4 minutes. For a week, 45
minutes less sleep is not an overwhelming change.
13
(iv) More education implies less predicted time sleeping, but the effect is quite small. If
we assume the difference between college and high school is four years, the college graduate
sleeps about 45 minutes less per week, other things equal.
(v) Not surprisingly, the three explanatory variables explain only about 11.3% of the
variation in sleep. One important factor in the error term is general health. Another is marital
status, and whether the person has children. Health (however we measure that), marital status,
and number and ages of children would generally be correlated with totwrk. (For example, less
healthy people would tend to work less.)
3.4 (i) A larger rank for a law school means that the school has less prestige; this lowers
starting salaries. For example, a rank of 100 means there are 99 schools thought to be better.
(ii) β1 > 0, β 2 > 0. Both LSAT and GPA are measures of the quality of the entering class.
No matter where better students attend law school, we expect them to earn more, on average. β 3 ,
β 4 > 0. The number of volumes in the law library and the tuition cost are both measures of the
school quality. (Cost is less obvious than library volumes, but should reflect quality of the
faculty, physical plant, and so on.)
(iv) This is an elasticity: a one percent increase in library volumes implies a .095%
increase in predicted median starting salary, other things equal.
(v) It is definitely better to attend a law school with a lower rank. If law school A has a
ranking 20 less than law school B, the predicted difference in starting salary is 100(.0033)(20) =
6.6% higher for law school A.
3.5 (i) No. By definition, study + sleep + work + leisure = 168. So if we change study, we
must change at least one of the other categories so that the sum is still 168.
independent variables: study = 168 − sleep − work − leisure. This holds for every observation,
(ii) From part (i), we can write, say, study as a perfect linear function of the other
so MLR.4 is violated.
Now, for example, β1 is interpreted as the change in GPA when study increases by one hour,
where sleep, work, and u are all held fixed. If we are holding sleep and work fixed but increasing
study by one hour, then we must be reducing leisure by one hour. The other slope parameters
have a similar interpretation.
14
3.6 Conditioning on the outcomes of the explanatory variables, we have E(θ$1 ) = E( β̂1 +
βˆ ) = E( β̂ ) + E( βˆ ) = β1 + β2 = θ .
2 1 2 1
3.7 Only (ii), omitting an important variable, can cause bias, and this is true only when the
omitted variable is correlated with the included explanatory variables. The homoskedasticity
(Homoskedasticity was used to obtain the standard variance formulas for the βˆ j .) Further, the
assumption. MLR.5, played no role in showing that the OLS estimators are unbiased.
degree of collinearity between the explanatory variables in the sample, even if it is reflected in a
correlation as high as .95, does not affect the Gauss-Markov assumptions. Only if there is a
perfect linear relationship among two or more explanatory variables is MLR.4 violated.
3.8 We can use Table 3.2. By definition, β 2 > 0, and by assumption, Corr(x1,x2) < 0.
Therefore, there is a negative bias in β% : E( β% ) < β . This means that, on average, the simple
1 1 1
3.9 (i) β1 < 0 because more pollution can be expected to lower housing values; note that β1 is
the elasticity of price with respect to nox. β 2 is probably positive because rooms roughly
measures the size of a house. (However, it does not allow us to distinguish homes where each
room is large from homes where each room is small.)
(ii) If we assume that rooms increases with quality of the home, then log(nox) and rooms
often true. We can use Table 3.2 to determine the direction of the bias. If β 2 > 0 and
are negatively correlated when poorer neighborhoods have more pollution, something that is
Corr(x1,x2) < 0, the simple regression estimator β% has a downward bias. But because β < 0,
this means that the simple regression, on average, overstates the importance of pollution. [E( β%1 )
1 1
simple regression estimate, −1.043, is more negative (larger in magnitude) than the multiple
(iii) This is what we expect from the typical sample based on our analysis in part (ii). The
regression estimate, −.718. As those estimates are only for one sample, we can never know
which is closer to β1 . But if this is a “typical” sample, β1 is closer to −.718.
∑ rˆ y
n
β%1 = i =1
∑ rˆ
i1 i
n
,
2
i =1
i1
15
where the rˆi1 are defined in the problem. As usual, we must plug in the true model for yi:
∑ rˆ ( β + β1 xi1 + β 2 xi 2 + β 3 xi 3 + ui
n
β%1 = i =1
∑ rˆ
i1 0
n
.
2
i =1
i1
∑ rˆ
n
2
. These all follow from the fact that the rˆi1 are the residuals from the regression of xi1 on
i =1
i1
xi 2 : the rˆi1 have zero sample average and are uncorrelated in sample with xi 2 . So the numerator
of β% can be expressed as
1
β1 ∑ r垐
i1 + β 3 ∑ ri1 xi 3 + ∑ ri1ui .
n n n
2
i =1 i =1 i =1
∑ r垐x ∑ru
n n
β%1 = β1 + β 3 i =1
+ i =1
∑ r垐 ∑r
i1 i 3 1 i
n n
.
2 2
i =1 i =1
i1 i1
Conditional on all sample values on x1, x2, and x3, only the last term is random due to its
dependence on ui. But E(ui) = 0, and so
∑ rˆ x
n
E( β%1 ) = β1 + β 3 i =1
∑ rˆ
i1 i 3
n
,
2
i =1
i1
which is what we wanted to show. Notice that the term multiplying β 3 is the regression
coefficient from the simple regression of xi3 on rˆi1 .
3.11 (i) The shares, by definition, add to one. If we do not omit one of the shares then the
equation would suffer from perfect multicollinearity. The parameters would not have a ceteris
paribus interpretation, as it is impossible to change one share while holding all of the other
shares fixed.
16
(ii) Because each share is a proportion (and can be at most one, when all other shares are
zero), it makes little sense to increase sharep by one unit. If sharep increases by .01 – which is
holding shareI, shareS, and the other factors fixed, then growth increases by β1 (.01). With the
equivalent to a one percentage point increase in the share of property taxes in total revenue –
other shares fixed, the excluded share, shareF, must fall by .01 when sharep increases by .01.
∑ (z − z ) y
n
β%1 = i =1
i i
.
szx
This is clearly a linear function of the yi: take the weights to be wi = (zi − z )/szx. To show
unbiasedness, as usual we plug yi = β 0 + β1 xi + ui into this equation, and simplify:
∑ ( z − z )( β + β1 xi + ui )
n
β%1 = i =1
i 0
szx
β 0 ∑ ( zi − z ) + β1szx + ∑ ( zi − z )ui
n n
= i =1 i =1
szx
∑ ( z − z )u
n
= β1 + i =1
i i
szx
∑ ( z − z ) = 0 always.
n
where we use the fact that Now szx is a function of the zi and xi and the
i =1
i
expected value of each ui is zero conditional on all zi and xi in the sample. Therefore, conditional
on these values,
∑ ( z − z )E(u )
n
E( β%1 ) = β1 + i =1
= β1
i i
szx
17
(ii) From the fourth equation in part (i) we have (again conditional on the zi and xi in the
sample),
∑ ( z − z )u ∑ ( z − z ) Var (u )
n n
i =1
= i =1
i i i i
szx2 szx2
∑ (z − z )
n
=σ 2
2
i =1
i
szx2
because of the homoskedasticity assumption [Var(ui) = σ2 for all i]. Given the definition of szx,
this is what we wanted to show.
(iii) We know that Var( βˆ1 ) = σ2/ [∑ ( xi − x ) 2 ]. Now we can rearrange the inequality in the
n
i =1
hint, drop x from the sample covariance, and cancel n-1 everywhere, to get [∑ ( zi − z ) 2 ] / sz2x ≥
n
i =1
i =1
we wanted to show.
3.13 (i) Probably β 2 > 0, as more income typically means better nutrition for the mother and
better prenatal care.
(ii) On the one hand, an increase in income generally increases the consumption of a good,
and cigs and faminc could be positively correlated. On the other, family incomes are also higher
negatively correlated. The sample correlation between cigs and faminc is about −.173, indicating
for families with more education, and more education and cigarette smoking tend to be
a negative correlation.
bwght = 119.77 − .514 cigs
n = 1,388, R 2 = .023
and
bwght = 116.97 − .463 cigs + .093 faminc
n = 1,388, R 2 = .030.
18
The effect of cigarette smoking is slightly smaller when faminc is added to the regression, but the
difference is not great. This is due to the fact that cigs and faminc are not very correlated, and
the coefficient on faminc is practically small. (The variable faminc is measured in thousands, so
$10,000 more in 1988 income increases predicted birth weight by only .93 ounces.)
n = 88, R 2 = .632
(vi) From part (v), the estimated value of the home based only on square footage and
number of bedrooms is $353,544. The actual selling price was $300,000, which suggests the
buyer underpaid by some margin. But, of course, there are many other features of a house (some
that we cannot even measure) that affect price, and we have not controlled for these.
n = 177, R 2 = .299.
(ii) We cannot include profits in logarithmic form because profits are negative for nine of
the companies in the sample. When we add it in levels form we get
log (salary ) = 4.69 + .161 log( sales ) + .098 log(mktval ) + .000036 profits
n = 177, R 2 = .299.
increase by $1 billion, which means Δprofits = 1,000 – a huge change – predicted salary
The coefficient on profits is very small. Here, profits are measured in millions, so if profits
increases by about only 3.6%. However, remember that we are holding sales and market value
fixed.
19
Together, these variables (and we could drop profits without losing anything) explain
almost 30% of the sample variation in log(salary). This is certainly not “most” of the variation.
log (salary ) = 4.56 + .162 log( sales ) + .102 log(mktval ) + .000029 profits + .012ceoten
n = 177, R 2 = .318.
This means that one more year as CEO increases predicted salary by about 1.2%.
(iv) The sample correlation between log(mktval) and profits is about .78, which is fairly
high. As we know, this causes no bias in the OLS estimators, although it can cause their
variances to be large. Given the fairly substantial correlation between market value and firm
profits, it is not too surprising that the latter adds nothing to explaining CEO salaries. Also,
profits is a short term measure of how the firm is doing while mktval is based on past, current,
and expected future profitability.
3.16 (i) The minimum, maximum, and average values for these three variables are given in the
table below:
atndrte = 75.70 + 17.26 priGPA − 1.72 ACT
n = 680, R2 = .291.
The intercept means that, for a student whose prior GPA is zero and ACT score is zero, the
predicted attendance rate is 75.7%. But this is clearly not an interesting segment of the
population. (In fact, there are no students in the college population with priGPA = 0 and ACT =
0.)
(iii) The coefficient on priGPA means that, if a student’s prior GPA is one point higher
(say, from 2.0 to 3.0), the attendance rate is about 17.3 percentage points higher. This holds ACT
fixed. The negative coefficient on ACT is, perhaps initially a bit surprising. Five more points on
the ACT is predicted to lower attendance by 8.6 percentage points at a given level of priGPA. As
priGPA measures performance in college (and, at least partially, could reflect, past attendance
rates), while ACT is a measure of potential in college, it appears that students that had more
promise (which could mean more innate ability) think they can get by with missing lectures.
20
(iv) We have atndrte = 75.70 + 17.267(3.65) – 1.72(20) ≈ 104.3. Of course, a student
cannot have higher than a 100% attendance rate. Getting predications like this is always possible
when using regression methods with natural upper or lower bounds on the dependent variable.
In practice, we would predict a 100% attendance rate for this student. (In fact, this student had
an attendance rate of only 87.5%.)
(v) The difference in predicted attendance rates for A and B is 17.26(3.1 − 2.1) − (21 −
26) = 25.86.
n = 526, R2 = .101.
n = 526, R2 = .207.
As expected, the coefficient on r̂1 in the second regression is identical to the coefficient on educ
in equation (3.19). Notice that the R-squared from the above regression is below that in (3.19).
In effect, the regression on r̂1 only uses the part of educ that is uncorrelated with exper and
tenure to explain log(wage).
δ%1 = 3.53383.
3.18 (i) The slope coefficient from the regression IQ on educ is (rounded to five decimal places)
21
CHAPTER 4
TEACHING NOTES
The structure of this chapter allows you to remind students that a specific error distribution
played no role in the results of Chapter 3. Normality is needed, however, to obtain exact normal
sampling distributions (conditional on the explanatory variables). I emphasize that the full set of
CLM assumptions are used in this chapter, but that in Chapter 5 we relax the normality
assumption and still perform approximately valid inference. One could argue that the classical
linear model results could be skipped entirely, and that only large-sample analysis is needed.
But, from a practical perspective, students still need to know where the t distribution comes from,
because virtually all regression packages report t statistics and obtain p-values off of the t
distribution. I then find it very easy to cover Chapter 5 quickly, by just saying we can drop
normality and still use t statistics and the associated p-values as being approximately valid.
Besides, occasionally students will have to analyze smaller data sets, especially if they do their
own small surveys for a term project.
my students that they will be punished if they write something like H0: βˆ1 = 0 on an exam or,
It is crucial to emphasize that we test hypotheses about unknown, population parameters. I tell
One useful feature of Chapter 4 is its emphasis on rewriting a population model so that it
contains the parameter of interest in testing a single restriction. I find this is easier, both
theoretically and practically, than computing variances that can, in some cases, depend on
numerous covariance terms. The example of testing equality of the return to two- and four-year
colleges illustrates the basic method, and shows that the respecified model can have a useful
interpretation.
One can use an F test for single linear restrictions on multiple parameters, but this is less
transparent than a t test and does not immediately produce the standard error needed for a
confidence interval or for testing a one-sided alternative. The trick of rewriting the population
model is useful in several instances, including obtaining confidence intervals for predictions in
Chapter 6, as well as for obtaining confidence intervals for marginal effects in models with
interactions (also in Chapter 6).
The major league baseball player salary example illustrates the difference between individual
and joint significance when explanatory variables (rbisyr and hrunsyr in this case) are highly
correlated. I tend to emphasize the R-squared form of the F statistic because, in practice, it is
applicable a large percentage of the time, and it is much more readily computed. I do regret that
this example is biased toward students in countries where baseball is played. Still, it is one of the
better examples of multicollinearity that I have come across, and students of all backgrounds
seem to get the point.
22
SOLUTIONS TO PROBLEMS
4.1 (i) and (iii) generally cause the t statistics not to have a t distribution under H0.
Homoskedasticity is one of the CLM assumptions. An important omitted variable violates
Assumption MLR.3. The CLM assumptions contain no mention of the sample correlations
among independent variables, except to rule out the case where the correlation is one.
(ii) The proportionate effect on salary is .00024(50) = .012. To obtain the percentage effect,
we multiply this by 100: 1.2%. Therefore, a 50 point ceteris paribus increase in ros is predicted
to increase salary by only 1.2%. Practically speaking this is a very small effect for such a large
change in ros.
(iii) The 10% critical value for a one-tailed test, using df = ∞, is obtained from Table G.2 as
1.282. The t statistic on ros is .00024/.00054 ≈ .44, which is well below the critical value.
Therefore, we fail to reject H0 at the 10% significance level.
(iv) Based on this sample, the estimated ros coefficient appears to be different from zero only
because of sampling variation. On the other hand, including ros may not be causing any harm; it
depends on how correlated it is with the other independent variables (although these are very
significant even with ros in the equation).
Δrdintens
ˆ ≈ .032, or only about 3/100 of a percentage point. For such a large percentage
increase in sales, this seems like a practically small effect.
(ii) H0: β1 = 0 versus H1: β1 > 0, where β1 is the population slope on log(sales). The t
statistic is .321/.216 ≈ 1.486. The 5% critical value for a one-tailed test, with df = 32 – 3 = 29,
is obtained from Table G.2 as 1.699; so we cannot reject H0 at the 5% level. But the 10% critical
value is 1.311; since the t statistic is above this value, we reject H0 in favor of H1 at the 10%
level.
(iii) Not really. Its t statistic is only 1.087, which is well below even the 10% critical value
for a one-tailed test.
(ii) Other things equal, a larger population increases the demand for rental housing, which
should increase rents. The demand for overall housing is higher when average income is higher,
pushing up the cost of housing, including rental rates.
(iii) The coefficient on log(pop) is an elasticity. A correct statement is that “a 10% increase
in population increases rent by .066(10) = .66%.”
23
is about 3.29, which is well above the critical value. So β 3 is statistically different from zero at
(iv) With df = 64 – 4 = 60, the 1% critical value for a two-tailed test is 2.660. The t statistic
the 1% level.
(ii) No, because the value .4 is well inside the 95% CI.
Because each test is two-tailed, the critical value is 1.987. The t statistic for H0: β 0 = 0 is about -
4.6 (i) With df = n – 2 = 86, we obtain the 5% critical value from Table G.2 with df = 90.
.89, which is much less than 1.987 in absolute value. Therefore, we fail to reject β 0 = 0. The t
statistic for H0: β1 = 1 is (.976 – 1)/.049 ≈ -.49, which is even less significant. (Remember, we
reject H0 in favor of H1 in this case only if |t| > 1.987.)
(ii) We use the SSR form of the F statistic. We are testing q = 2 restrictions and the df in the
unrestricted model is 86. We are given SSRr = 209,448.99 and SSRur = 165,644.51. Therefore,
which is a strong rejection of H0: from Table G.3c, the 1% critical value with 2 and 90 df is 4.85.
1.46. The 10% critical value (again using 90 denominator df in Table G.3a) is 2.15, so we fail to
reject H0 at even the 10% level. In fact, the p-value is about .23.
(iv) If heteroskedasticity were present, Assumption MLR.5 would be violated, and the F
statistic would not have an F distribution under the null hypothesis. Therefore, comparing the F
statistic against the usual critical values, or obtaining the p-value from the F distribution, would
not be especially meaningful.
4.7 (i) While the standard error on hrsemp has not changed, the magnitude of the coefficient has
increased by half. The t statistic on hrsemp has gone from about –1.47 to –2.21, so now the
coefficient is statistically less than zero at the 5% level. (From Table G.2 the 5% critical value
with 40 df is –1.684. The 1% critical value is –2.423, so the p-value is between .01 and .05.)
(ii) If we add and subtract β 2 log(employ) from the right-hand-side and collect terms, we
have
24
log(scrap) = β 0 + β1 hrsemp + [ β 2 log(sales) – β 2 log(employ)]
+ [ β 2 log(employ) + β 3 log(employ)] + u
= β 0 + β1 hrsemp + β 2 log(sales/employ)
+ ( β 2 + β 3 )log(employ) + u,
(iii) No. We are interested in the coefficient on log(employ), which has a t statistic of .2,
which is very small. Therefore, we conclude that the size of the firm, as measured by employees,
does not matter, once we control for training and sales per employee (in a logarithmic functional
form).
(iv) The null hypothesis in the model from part (ii) is H0: β 2 = –1. The t statistic is [–.951 –
(–1)]/.37 = (1 – .951)/.37 ≈ .132; this is very small, and we fail to reject whether we specify a
one- or two-sided alternative.
4.8 (i) We use Property VAR.3 from Appendix B: Var( β̂1 − 3 βˆ2 ) = Var ( β̂1 ) + 9 Var ( βˆ2 ) – 6
Cov ( βˆ , βˆ ).
1 2
(ii) t = ( βˆ1 − 3 βˆ2 − 1)/se( βˆ1 − 3 βˆ2 ), so we need the standard error of βˆ1 − 3 βˆ2 .
(iii) Because θ1 = β1 – 3β2, we can write β1 = θ1 + 3β2. Plugging this into the population
y = β 0 + ( θ1 + 3β2)x1 + β 2 x2 + β 3 x3 + u
model gives
= β 0 + θ1 x1 + β 2 (3x1 + x2) + β 3 x3 + u.
This last equation is what we would estimate by regressing y on x1, 3x1 + x2, and x3. The
coefficient and standard error on x1 are what we want.
4.9 (i) With df = 706 – 4 = 702, we use the standard normal critical value (df = ∞ in Table G.2),
which is 1.96 for a two-tailed test at the 5% level. Now teduc = −11.13/5.88 ≈ −1.89, so |teduc| =
1.89 < 1.96, and we fail to reject H0: β educ = 0 at the 5% level. Also, tage ≈ 1.52, so age is also
statistically insignificant at the 5% level.
[(.113 − .103)/(1 − .113)](702/2) ≈ 3.96. The 5% critical value in the F2,702 distribution can be
(ii) We need to compute the R-squared form of the F statistic for joint significance. But F =
obtained from Table G.3b with denominator df = ∞: cv = 3.00. Therefore, educ and age are
jointly significant at the 5% level (3.96 > 3.00). In fact, the p-value is about .019, and so educ
and age are jointly significant at the 2% level.
25
(iii) Not really. These variables are jointly significant, but including them only changes the
coefficient on totwrk from –.151 to –.148.
(iv) The standard t and F statistics that we used assume homoskedasticity, in addition to the
other CLM assumptions. If there is heteroskedasticity in the equation, the tests are no longer
valid.
variable is individually significant at the 5% level. The largest absolute t statistic is on dkr, tdkr ≈
1.60, which is not significant at the 5% level against a two-sided alternative.
(ii) The F statistic (with the same df) is now [.0330/(1 – .0330)](137/4) ≈ 1.17, which is
even lower than in part (i). None of the t statistics is significant at a reasonable level.
(iii) It seems very weak. There are no significant t statistics at the 5% level (against a two-
sided alternative), and the F statistics are insignificant in both cases. Plus, less than 4% of the
variation in return is explained by the independent variables.
4.11 (i) In columns (2) and (3), the coefficient on profmarg is actually negative, although its t
statistic is only about –1. It appears that, once firm sales and market value have been controlled
for, profit margin has no effect on CEO salary.
(ii) We use column (3), which controls for the most factors affecting salary. The t statistic on
log(mktval) is about 2.05, which is just significant at the 5% level against a two-sided alternative.
(We can use the standard normal critical value, 1.96.) So log(mktval) is statistically significant.
Because the coefficient is an elasticity, a ceteris paribus 10% increase in market value is
predicted to increase salary by 1%. This is not a huge effect, but it is not negligible, either.
(iii) These variables are individually significant at low significance levels, with tceoten ≈ 3.11
and tcomten ≈ –2.79. Other factors fixed, another year as CEO with the company increases salary
by about 1.71%. On the other hand, another year with the company, but not as CEO, lowers
salary by about .92%. This second finding at first seems surprising, but could be related to the
“superstar” effect: firms that hire CEOs from outside the company often go after a small pool of
highly regarded candidates, and salaries of these people are bid up. More non-CEO years with a
company makes it less likely the person was hired as an outside superstar.
26
ΔvoteA = β1Δ log(expendA) = ( β1 /100)[100 ⋅ Δ log(expendA)]
≈ ( β1 /100)(%ΔexpendA),
where we use the fact that 100 ⋅Δ log(expendA) ≈ %ΔexpendA . So β1 /100 is the (ceteris paribus)
percentage point change in voteA when expendA increases by one percent.
β1 + β 2 = 0.
and a z% increase in expenditure by B leaves voteA unchanged. We can equivalently write H0:
(iii) The estimated equation (with standard errors in parentheses below estimates) is
opposite in sign, as we expect), we do not have the standard error of β̂1 + βˆ2 , which is what we
While the coefficients on log(expendA) and log(expendB) are of similar magnitudes (and
voteA = β 0 + θ1 log(expendA) + β 2 [log(expendB) – log(expendA)] + β 3 prtystrA + u,
When we estimate this equation we obtain θ$1 ≈ –.532 and se( θ$1 ) ≈ .533. The t statistic for the
hypothesis in part (ii) is –.532/.533 ≈ –1. Therefore, we fail to reject H0: β 2 = – β1 .
the hypothesis that rank has no effect on log(salary) is H0: β 5 = 0. The estimated equation (now
with standard errors) is
27
log (salary ) = 8.34 + .0047 LSAT + .248 GPA + .095 log(libvol)
(0.53) (.0040) (.090) (.033)
+ .038 log(cost) – .0033 rank
(.032) (.0003)
n = 136, R2 = .842.
The t statistic on rank is –11, which is very significant. If rank decreases by 10 (which is a move
up for a law school), median starting salary is predicted to increase by about 3.3%.
(ii) LSAT is not statistically significant (t statistic ≈ 1.18) but GPA is very significance (t
statistic ≈ 2.76). The test for joint significance is moot given that GPA is so significant, but for
completeness the F statistic is about 9.95 (with 2 and 130 df) and p-value ≈ .0001.
their joint significant (with 2 and 131 – 8 = 123 df) gives F ≈ .95 and p-value ≈ .39. So these
(iii) When we add clsize and faculty to the regression we lose five observations. The test of
two variables are not jointly significant unless we use a very large significance level.
(iv) If we want to just determine the effect of numerical ranking on starting law school
salaries, we should control for other factors that affect salaries and rankings. The idea is that
there is some randomness in rankings, or the rankings might depend partly on frivolous factors
that do not affect quality of the students. LSAT scores and GPA are perhaps good controls for
student quality. However, if there are differences in gender and racial composition across
schools, and systematic gender and race differences in salaries, we could also control for these.
However, it is unclear why these would be correlated with rank. Faculty quality, as perhaps
measured by publication records, could be included. Such things do enter rankings of law
schools.
Therefore, θˆ1 = 150(.000379) + .0289 = .0858, which means that an additional 150 square foot
bedroom increases the predicted price by about 8.6%.
28
(iii) From part (ii), we run the regression
and obtain the standard error on bdrms. We already know that θˆ1 = .0858; now we also get
se( θˆ ) = .0268. The 95% confidence interval reported by my software package is .0326 to .1390
1
4.15 The R-squared from the regression bwght on cigs, parity, and faminc, using all 1,388
observations, is about .0348. This means that, if we mistakenly use this in place of .0364, which
would obtain F = [(.0387 − .0348)/(1 − .0387)](1,185/2) ≈ 2.40, which yields p-value ≈ .091 in
is the R-squared using the same 1,191 observations available in the unrestricted regression, we
The correct F statistic was computed as 1.42 in Example 4.9, with p-value ≈ .242.
an F distribution with 2 and 1,1185 df. This is significant at the 10% level, but it is incorrect.
Now hrunsyr is very statistically significant (t statistic ≈ 4.99), and its coefficient has increased
by about two and one-half times.
statistic = .0174/.0051 ≈ 3.41). The estimate implies that one more run per year, other factors
Of the three additional independent variables, only runsyr is statistically significant (t
fixed, increases predicted salary by about 1.74%, a substantial increase. The stolen bases
variable even has the “wrong” sign with a t statistic of about –1.23, while fldperc has a t statistic
of only .5. Most major league baseball players are pretty good fielders; in fact, the smallest
29
fldperc is 800 (which means .800). With relatively little variation in fldperc, it is perhaps not
surprising that its effect is hard to estimate.
statistic for their joint significance (with 3 and 345 df) is about .69 with p-value ≈ .56.
(iii) From their t statistics, bavg, fldperc, and sbasesyr are individually insignificant. The F
to obtain the 95% CI for θ 2 . This turns out to be about .0020 ± 1.96(.0047), or about -.0072
to .0112. Because zero is in this CI, θ 2 is not statistically different from zero at the 5% level,
and we fail to reject H0: β 2 = β 3 at the 5% level.
4.18 (i) The minimum value is 0, the maximum is 99, and the average is about 56.16.
log (wage) = 1.459 − .0093 jc + .0755 totcoll + .0049 exper + .00030 phsrank
(0.024) (.0070) (.0026) (.0002) (.00024)
n = 6,763, R2 = .223
So phsrank has a t statistic equal to only 1.25; it is not statistically significant. If we increase
phsrank by 10, log(wage) is predicted to increase by (.0003)10 = .003. This implies a .3%
increase in wage, which seems a modest increase given a 10 percentage point increase in phsrank.
(However, the sample standard deviation of phsrank is about 24.)
(iii) Adding phsrank makes the t statistic on jc even smaller in absolute value, about 1.33, but
the coefficient magnitude is similar to (4.26). Therefore, the base point remains unchanged: the
return to a junior college is estimated to be somewhat smaller, but the difference is not
significant and standard significant levels.
(iv) The variable id is just a worker identification number, which should be randomly
assigned (at least roughly). Therefore, id should not be correlated with any variable in the
regression equation. It should be insignificant when added to (4.17) or (4.26). In fact, its t
statistic is about .54.
30
4.19 (i) There are 2,017 single people in the sample of 9,275.
n = 2,017, R2 = .119.
The coefficient on inc indicates that one more dollar in income (holding age fixed) is reflected in
about 80 more cents in predicted nettfa; no surprise there. The coefficient on age means that,
holding income fixed, if a person gets another year older, his/her nettfa is predicted to increase
by about $843. (Remember, nettfa is in thousands of dollars.) Again, this is not surprising.
(iii) The intercept is not very interesting, as it gives the predicted nettfa for inc = 0 and age =
0. Clearly, there is no one with even close to these values in the relevant population.
(iv) The t statistic is (.843 − 1)/.092 ≈ −1.71. Against the one-sided alternative H1: β2 < 1,
the p-value is about .044. Therefore, we can reject H0: β2 = 1 at the 5% significance level
(against the one-sided alternative).
(v) The slope coefficient on inc in the simple regression is about .821, which is not very
different from the .799 obtained in part (ii). As it turns out, the correlation between inc and age
in the sample of single people is only about .039, which helps explain why the simple and
multiple regression estimates are not very different; refer back to page 79 of the text.
31
CHAPTER 5
TEACHING NOTES
Chapter 5 is short, but it is conceptually more difficult than the earlier chapters; it requires some
knowledge of asymptotic properties of estimators. In class, I give a brief, heuristic description of
consistency and asymptotic normality before stating the consistency and asymptotic normality of
OLS. (Conveniently, the same assumptions that work for finite sample analysis work for
asymptotic analysis.) More advanced students can follow the proof of consistency of the slope
coefficient in the bivariate regression case. Section E.4 contains a full matrix treatment of
asymptotic analysis appropriate for a master’s level course.
An explicit illustration of what happens to standard errors as the sample size grows emphasizes
the importance of having a larger sample. I do not usually cover the LM statistic in a first-
semester course, and I only briefly mention the asymptotic efficiency result. Without full use of
matrix algebra combined with limit theorems for vectors and matrices, it is very difficult to prove
asymptotic efficiency of OLS.
I think the conclusions of this chapter are important for students to know, even though they may
not grasp the details. On exams I usually include true-false type questions, with explanation, to
test the students’ understanding of asymptotics. [For example: “In large samples we do not have
to worry about omitted variable bias.” (False). Or “Even if the error term is not normally
distributed, in large samples we can still compute approximately valid confidence intervals under
the Gauss-Markov assumptions.” (True).]
32
SOLUTIONS TO PROBLEMS
5.1 Write y = β 0 + β1 x1 + u, and take the expected value: E(y) = β 0 + β1 E(x1) + E(u), or µ y =
β 0 + β1 µ x since E(u) = 0, where µ y = E(y) and µ x = E(x1). We can rewrite this as β 0 = µ y -
β1 µ x. Now, βˆ0 = y − β̂1 x1 . Taking the plim of this we have plim( βˆ0 ) = plim( y − βˆ1 x1 ) =
plim( y ) – plim( βˆ1 ) ⋅ plim( x1 ) = µ y − β1 µ x, where we use the fact that plim( y ) = µ y and
plim( x ) = µ x by the law of large numbers, and plim( βˆ ) = β . We have also used the parts of
1 1 1
5.2 A higher tolerance of risk means more willingness to invest in the stock market, so β 2 > 0.
δ1 > 0: plim( β%1 ) = β1 + β 2 δ1 > β1 , so β%1 has a positive inconsistency (asymptotic bias). This
By assumption, funds and risktol are positively correlated. Now we use equation (5.5), where
makes sense: if we omit risktol from the regression and it is positively correlated with funds,
some of the estimated effect of funds is actually due to the effect of risktol.
5.3 The variable cigs has nothing close to a normal distribution in the population. Most people
do not smoke, so cigs = 0 for over half of the population. A normally distributed random
variable takes on no particular value with positive probability. Further, the distribution of cigs is
skewed, whereas a normal random variable must be symmetric about its mean.
5.4 Write y = β 0 + β1 x + u, and take the expected value: E(y) = β 0 + β1 E(x) + E(u), or μy =
β 0 + β1 μx, since E(u) = 0, where μy = E(y) and µ x = E(x). We can rewrite this as β 0 = µ y −
β1 µ x. Now, β%0 = y − β%1 x . Taking the plim of this we have plim( β%0 ) = plim( y − β%1 x ) =
plim( y ) – plim( β%1 )⋅plim( x ) = μy − β1 μx, where we use the fact that plim( y ) = μy and
plim( x ) = μx by the law of large numbers, and plim( β% ) = β . We have also used the parts of
1 1
Below is a histogram of the 526 residual, uˆi , i = 1, 2 , ..., 526. The histogram uses 27 bins,
which is suggested by the formula in the Stata manual for 526 observations. For comparison, the
normal distribution that provides the best fit to the histogram is also plotted.
33
.18
.13
Fraction
.08
.04
0
-8 -4 -2 0 2 6 10 15
uhat
The histogram for the residuals from this equation, with the best-fitting normal distribution
overlaid, is given below:
34
.14
.1
Fraction
.06
.03
0
-2 -1 0 1.5
uhat
(iii) The residuals from the log(wage) regression appear to be more normally distributed.
Certainly the histogram in part (ii) fits under its comparable normal density better than in part (i),
and the histogram for the wage residuals is notably skewed to the left. In the wage regression
deviations ( σˆ = 3.085) from the mean of the residuals, which is identically zero. This does not
there are some very large residuals (roughly equal to 15) that lie almost five estimated standard
(iii) The ratio of the standard error using 2,070 observations to that using 4,137 observations
is about 1.31. From (5.10) we compute (4,137 / 2, 070) ≈ 1.41, which is somewhat above the
ratio of the actual standard errors.
35
5.7 We first run the regression colgpa on cigs, parity, and faminc using only the 1,191
observations with nonmissing observations on motheduc and fatheduc. After obtaining these
residuals, u%i , these are regressed on cigsi, parityi, faminci, motheduci, and fatheduci, where, of
course, we can only use the 1,197 observations with nonmissing values for both motheduc and
fatheduc. The R-squared from this regression, Ru2 , is about .0024. With 1,191 observations, the
chi-square statistic is (1,191)(.0024) ≈ 2.86. The p-value from the χ 22 distribution is about .239,
which is very close to .242, the p-value for the comparable F test.
36
CHAPTER 6
TEACHING NOTES
I cover most of Chapter 6, but not all of the material in great detail. I use the example in Table
6.1 to quickly run through the effects of data scaling on the important OLS statistics. (Students
should already have a feel for the effects of data scaling on the coefficients, fitting values, and R-
squared because it is covered in Chapter 2.) At most, I briefly mention beta coefficients; if
students have a need for them, they can read this subsection.
The functional form material is important, and I spend some time on more complicated models
with logarithms, quadratics, and interactions. An important point for models with quadratics,
and especially interactions, is that we need to evaluate the partial effect at interesting values of
the explanatory variables. Often, zero is not an interesting value for an explanatory variable and
is well outside the range in the sample. Using the methods from Chapter 4, it is easy to obtain
confidence intervals for the effects at interesting x values.
As far as goodness-of-fit, I only introduce the adjusted R-squared, as I think using a slew of
goodness-of-fit measures to choose a model can be confusing (and is not representative of most
empirical analyses). It is important to discuss how, if we fixate on a high R-squared, we may
wind up with a model that has no interesting ceteris paribus interpretation.
I often have students and colleagues ask if there is a simple way to predict y when log(y) has
been used as the dependent variable, and to obtain a goodness-of-fit measure for the log(y) model
that can be compared with the usual R-squared obtained when y is the dependent variable. The
methods described in Section 6.4 are easy to implement and, unlike other approaches, do not
require normality.
The section on prediction and residual analysis contains several important topics, including
constructing prediction intervals. It is useful to see how much wider the prediction intervals are
than the confidence interval for the conditional mean. I usually discuss some of the residual-
analysis examples, as they have real-world applicability.
37
SOLUTIONS TO PROBLEMS
6.1 The generality is not necessary. The t statistic on roe2 is only about −.30, which shows that
roe2 is very statistically insignificant. Plus, having the squared term has only a minor effect on
the slope even for large values of roe. (The approximate slope is .0215 − .00016 roe, and even
when roe = 25 – about one standard deviation above the average roe in the sample – the slope
is .211, as compared with .215 at roe = 0.)
6.2 By definition of the OLS regression of c0yi on c1xi1, K , ckxik, i = 2, K , n, the β% j solve
i =1
0 i 0
i =1
1 i1
i =1
k ik
variables.] We now show that if β%0 = c0 βˆ0 and β% j = (c0 / c j ) β% j , j = 1,…,k, then these k + 1
[We obtain these from equations (3.13), where we plug in the scaled dependent and independent
first order conditions are satisfied, which proves the result because we know that the OLS
independent variables). Plugging in these guesses for the β% j gives the expressions
estimates are the unique solutions to the FOCs (once we rule out perfect collinearity in the
i =1
0 i 0 0 0
i =1
j ij 0 i 0 0 0
∑ [(c y ) − c β垐 − c β x − K − c0 β k xik ]
n
i =1
0 i 0 0 0 1 i1
∑ (c x )[(c y ) − c β垐− c β x
and
− ... − c0 β k xik ], j = 1, 2,..., k
n
i =1
j ij 0 i 0 0 0 1 i1
38
c0 ⎜ ∑ ( yi − β垐
⎛ n ⎞
0 − β1 xi1 − K − β k xik ) ⎟
⎝ i =1 ⎠
c0 c j ⎜ ∑ ( yi − β垐
and
⎛ n ⎞
0 − β1 xi1 − K − β k xik ) ⎟ , j = 1, 2, K
⎝ i =1 ⎠
But the terms multiplying c0 and c0cj are identically zero by the first order conditions for the βˆ j
since, by definition, they are obtained from the regression yi on xi1, K , xik, i = 1,2,..,n. So we
have shown that β%0 = c0 βˆ0 and β% j = (c0/cj) βˆ j , j = 1, K , k solve the requisite first order
conditions.
6.3 (i) The turnaround point is given by β̂1 /(2| β̂ 2 |), or .0003/(.000000014) ≈ 21,428.57;
remember, this is sales in millions of dollars.
alternative H0: β1 < 0 at the 5% level (cv ≈ –1.70 with df = 29). In fact, the p-value is
(ii) Probably. Its t statistic is about –1.89, which is significant against the one-sided
about .036.
(iii) Because sales gets divided by 1,000 to obtain salesbil, the corresponding coefficient gets
multiplied by 1,000: (1,000)(.00030) = .30. The standard error gets multiplied by the same
factor. As stated in the hint, salesbil2 = sales/1,000,000, and so the coefficient on the quadratic
gets multiplied by one million: (1,000,000)(.0000000070) = .0070; its standard error also gets
multiplied by one million. Nothing happens to the intercept (because rdintens has not been
rescaled) or to the R2:
rdintens = 2.613 + .30 salesbil – .0070 salesbil2
(0.429) (.14) (.0037)
n = 32, R2 = .1484.
(iv) The equation in part (iii) is easier to read because it contains fewer zeros to the right of
the decimal. Of course the interpretation of the two equations is identical once the different
scales are accounted for.
Dividing both sides by ∆educ gives the result. The sign of β 2 is not obvious, although β 2 > 0 if
we think a child gets more out of another year of education the more highly educated are the
child’s parents.
39
educ ⋅ pareduc. The difference in the estimated return to education is .00078(32 – 24) = .0062, or
(ii) We use the values pareduc = 32 and pareduc = 24 to interpret the coefficient on
statistic on educ ⋅ pareduc is about –1.33, which is not significant at the 10% level against a two-
(iii) When we add pareduc by itself, the coefficient on the interaction term is negative. The t
sided alternative. Note that the coefficient on pareduc is significant at the 5% level against a
two-sided alternative. This provides a good example of how omitting a level effect (pareduc in
this case) can lead to biased estimation of the interaction effect.
6.5 This would make little sense. Performance on math and science exams are measures of
outputs of the educational process, and we would like to know how various educational inputs
and school characteristics affect math and science scores. For example, if the staff-to-pupil ratio
has an effect on both exam scores, why would we want to hold performance on the science test
fixed while studying the effects of staff on the math pass rate? This would be an example of
controlling for too many factors in a regression equation. The variable scill could be a dependent
variable in an identical regression equation.
F = [(.232 – .229)/(1 – .232)](671/2) ≈ 1.31, which is well below the 10% critical value in the F
6.6 The extended model has df = 680 – 9 = 671, and we are testing two restrictions. Therefore,
distribution with 2 and ∞ df: cv = 2.30. Thus, atndrte2 and ACT ⋅ atndrte are jointly insignificant.
Because adding these terms complicates the model without statistical justification, we would not
include them in the final model.
6.7 The second equation is clearly preferred, as its adjusted R-squared is notably larger than that
in the other two equations. The second equation contains the same number of estimated
parameters as the first, and the one fewer than the third. The second equation is also easier to
interpret than the third.
6.8 (i) The causal (or ceteris paribus) effect of dist on price means that β1 ≥ 0: all other relevant
factors equal, it is better to have a home farther away from the incinerator. The estimated
equation is
which means a 1% increase in distance from the incinerator is associated with a predicted price
that is about .37% higher.
40
regression, the coefficient on log(dist) becomes about .055 (se ≈ .058). The effect is much
(ii) When the variables log(inst), log(area), log(land), rooms, baths, and age are added to the
smaller now, and statistically insignificant. This is because we have explicitly controlled for
several other factors that determine the quality of a home (such as its size and number of baths)
and its location (distance to the interstate). This is consistent with the hypothesis that the
incinerator was located near less desirable homes to begin with.
(iii) When [log(inst)]2 is added to the regression in part (ii), we obtain (with the results only
partially reported)
The coefficient on log(dist) is now very statistically significant, with a t statistic of about three.
The coefficients on log(inst) and [log(inst)]2 are both very statistically significant, each with t
statistics above four in absolute value. Just adding [log(inst)]2 has had a very big effect on the
coefficient important for policy purposes. This means that distance from the incinerator and
distance from the interstate are correlated in some nonlinear way that also affects housing price.
2.073/[2(.1193)] ≈ 8.69. When we exponentiate this we obtain about 5,943 feet from the
We can find the value of log(inst) where the effect on log(price) actually becomes negative:
interstate. Therefore, it is best to have your home away from the interstate for distances less than
just over a mile. After that, moving farther away from the interstate lowers predicted house price.
(iv) The coefficient on [log(dist)]2, when it is added to the model estimated in part (iii), is
about -.0365, but its t statistic is only about -.33. Therefore, it is not necessary to add this
complication.
(ii) The t statistic on exper2 is about –6.16, which has a p-value of essentially zero. So exper
is significant at the 1% level(and much smaller significance levels).
%Δwage
ˆ ≈ 100[.0410 − 2(.000714)4] ≈ 3.53%.
41
%Δwage
ˆ ≈ 100[.0410 − 2(.000714)19] ≈ 1.39%
(iv) The turnaround point is about .041/[2(.000714)] ≈ 28.7 years of experience. In the
sample, there are 121 people with at least 29 years of experience. This is a fairly sizeable
fraction of the sample.
Δ log( wage)
or
= ( β1 + β 3exper ).
Δeduc
This is the approximate proportionate change in wage given one more year of education.
(ii) H0: β 3 = 0. If we think that education and experience interact positively – so that people
with more experience are more productive when given another year of education – then β 3 > 0
is the appropriate alternative.
on educ. We obtain θˆ1 ≈ .0761 and se( θˆ1 ) ≈ .0066. The 95% CI for θ1 is about .063 to .089.
and run the regression log(wage) on educ, exper, and educ(exper – 10). We want the coefficient
42
The quadratic term is very statistically significant, with t statistic ≈ –3.87.
turning point in the parabola, which we calculate as hsize* = 19.81/[2(2.13)] ≈ 4.65. Since hsize
ˆ reaches its maximum. This is the
(ii) We want the value of hsize, say hsize*, where sat
is in 100s, this means 465 students is the “optimal” class size. Of course, the very small R-
squared shows that class size explains only a tiny amount of the variation in SAT score.
(iii) Only students who actually take the SAT exam appear in the sample, so it is not
representative of all high school seniors. If the population of interest is all high school seniors,
we need a random sample of such students who all took the same standardized exam.
The optimal class size is now estimated as about 469, which is very close to what we obtained
with the level-level model.
6.12 (i) The results of estimating the log-log model (but with bdrms in levels) are
log( p̂rice ) = 5.61 + .168 log(lotsize) + .700 log (sqrft) + .037 bdrms
(0.65) (.038) (.093) (.028)
lprice = 5.61 + .168 ⋅ log(20,000) + .700 ⋅ log(2,500) + .037(4) ≈ 12.90
ˆ =
where we use lprice to denote log(price). To predict price, we use the equation price
α̂ 0 exp( lprice
), where α̂ is the slope on mˆ ≡ exp( lprice
) from the regression pricei on mˆ , i =
1,2, K , 88 (without an intercept). When we do this regression we get α̂ 0 ≈ 1.023. Therefore,
0 i i i
ˆ ≈ (1.023)exp(12.90) ≈ $409,519
(rounded to the nearest dollar). If we forget to multiply by α̂ 0 the predicted price would be
for the values of the independent variables given above, price
about $400,312.
(iii) When we run the regression with all variables in levels, the R-squared is about .672.
When we compute the correlation between pricei and the mˆ i from part (ii), we obtain about .859.
The square of this, or roughly .738, is the comparable goodness-of-fit measure for the model
43
with log(price) as the dependent variable. Therefore, for predicting price, the log model is
notably better.
the ceteris paribus effect of expendB on voteA is obtained by taking changes and holding prtystrA,
expendA, and u fixed:
ΔvoteA/ΔexpendB = β 3 + β 4 expendA.
or
We think β 3 < 0 if a ceteris paribus increase in spending by B lowers the share of the vote
received by A. But the sign of β 4 is ambiguous: Is the effect of more spending by B smaller or
larger for higher levels of spending by A?
ˆ
voteA = 32.12 + .342 prtystrA + .0383 expendA – .0317 expendB
(4.59) (.088) (.0050) (.0046)
– .0000066 expendA ⋅ expendB
(.0000072)
The interaction term is not statistically significant, as its t statistic is less than one in absolute
value.
(iii) The average value of expendA is about 310.61, or $310,610. If we set expendA at 300,
which is close to the average value, we have
ΔvoteA
ˆ = [–.0317 – .0000066 ⋅ (300)] ΔexpendB ≈ –.0337(ΔexpendB).
So, when ΔexpendB = 100, ΔvoteA ˆ ≈ –3.37, which is a fairly large effect. (Note that, given the
insignificance of the interaction term, we would be justified in leaving it out and reestimating the
model. This would make the calculation easier.)
ΔvoteA
ˆ = ( βˆ2 + βˆ4 expendB)ΔexpendA ≈ .0376(ΔexpendA) = 3.76
44
when ΔexpendA = 100. This does make sense, and it is a nontrivial effect.
ˆ
voteA = 18.20 + .157 prtystrA − .0067 expendA + .0043 expendB + .494 shareA
(2.57) (.050) (.0028) (.0026) (.025)
Notice how much higher the goodness-of-fit measures are as compared with the equation
estimated in part (ii), and how significant shareA is. To obtain the partial effect of expendB on
ˆ we must compute the partial derivative. Generally, we have
voteA
∂voteA ⎛ ∂shareA ⎞
= β垐
3 + β4 ⎜ ⎟,
ˆ
∂expendB ⎝ ∂expendB ⎠
∂shareA ⎛ ⎞
= − 100 ⎜ 2 ⎟
expendA
∂expendB ⎝ (expendA + expendB) ⎠
.
Evaluated at expendA = 300 and expendB = 0, the partial derivative is –100(300/3002) = −1/3,
and therefore
∂voteA
= β垐
3 + β 4 (1/ 3) = .0043 − .494 / 3 ≈ − .164.
ˆ
∂expendB
ˆ falls by .164 percentage points given the first thousand dollars of spending by
So voteA
candidate B, where A’s spending is held fixed at 300 (or $300,000). This is a fairly large effect,
although it may not be the most typical scenario (because it is rare to have one candidate spend
expendB = 100, the effect of the thousand dollars of spending is only about .0043 − .494(.188) ≈
so much and another spend so little). The effect tapers off as expendB grows. For example, at
–.089.
6.14 (i) If we hold all variables except priGPA fixed and use the usual approximation Δ(priGPA2)
≈ 2(priGPA) ΔpriGPA, then we have
45
and dividing by ∆priGPA gives the result. In equation (6.19) we have βˆ2 = −1.63, βˆ4 = .296,
and βˆ = .0056. When priGPA = 2.59 and atndrte = .82 we have
6
Δstndfnl
= − 1.63 + 2(.296)(2.59) +.0056(.82) ≈ − .092.
ˆ
ΔpriGPA
(ii) First, note that (priGPA – 2.59)2 = priGPA2 – 2(2.59)priGPA + (2.59)2 and
priGPA(atndrte − .82) = priGPA ⋅ atndrte – (.82)priGPA. So we can write equation 6.18) as
When we run the regression associated with this last model, we obtain θˆ2 ≈ -.091 (which differs
from part (i) by rounding error) and se( θˆ ) ≈ .363. This implies a very small t statistic for θˆ .
2 2
ˆ
price = −21,770.3 + 2.068 lotsize + 122.78 sqrft + 13,852.5 bdrms
(29,475.0) (0.642) (13.24) (9,010.1)
The predicted price at lotsize = 10,000, sqrft = 2,300, and bdrms = 4 is about $336,714.
(ii) The regression is pricei on (lotsizei – 10,000), (sqrfti – 2,300), and (bdrmsi – 4). We want
336,706.7 ± 14,665, or about $322,042 to $351,372 when rounded to the nearest dollar.
the intercept and the associated 95% CI from this regression. The CI is approximately
(iii) We must use equation (6.36) to obtain the standard error of ê 0 and then use equation
Using 1.99 as the approximate 97.5th percentile in the t84 distribution gives the 95% CI for price0,
at the given values of the explanatory variables, as 336,706.7 ± 1.99(60,285.8) or, rounded to the
nearest dollar, $216,738 to $456,675. This is a fairly wide prediction interval. But we have not
46
error standard deviation, and therefore σˆ , to obtain a tighter prediction interval.
used many factors to explain housing price. If we had more, we could, presumably, reduce the
(ii) The turnaround point is 2.364/[2(.0770)] ≈ 15.35. So, the increase from 15 to 16 years of
experience would actually reduce salary. This is a very high level of experience, and we can
essentially ignore this prediction: only two players in the sample of 269 have more than 15 years
of experience.
(iii) Many of the most promising players leave college early, or, in some cases, forego
college altogether, to play in the NBA. These top players command the highest salaries. It is not
more college that hurts salary, but less college is indicative of super-star potential.
(iv) When age2 is added to the regression from part (i), its coefficient is .0536 (se = .0492).
Its t statistic is barely above one, so we are justified in dropping it. The coefficient on age in the
same regression is –3.984 (se = 2.689). Together, these estimates imply a negative, increasing,
return to age. The turning point is roughly at 74 years old. In any case, the linear function of
age seems sufficient.
log (wage) = 6.78 + .078 points + .218 exper − .0071 exper2 − .048 age − .040 coll
(.85) (.007) (.050) (.0028) (.035) (.053)
(vi) The joint F test produced by Stata is about 1.19. With 2 and 263 df, this gives a p-value
of roughly .31. Therefore, once scoring and years played are controlled for, there is no evidence
for wage differentials depending on age or years played in college.
The quadratic term is very significant; its t statistic is above 3.5 in absolute value.
47
(ii) The turning point calculation is by now familiar: npvis* = .0189 /[2(.00043)] ≈ 21.97 , or
about 22. In the sample, 89 women had 22 or more prenatal visits.
(iii) While prenatal visits are a good thing for helping to prevent low birth weight, a woman
having many prenatal visits is a possible indicator of a pregnancy with difficulties. So it does
make sense that the quadratic has a hump shape.
log (bwght ) = 7.584 + .0180 npvis − .00041 npvis2 + .0254 mage − .00041 mage2
(.137) (.0037) (.00012) (.0093) (.00015)
The birth weight is maximized at mage ≈ 31. 746 women are at least 31 years old; 605 are at
least 32.
(v) These variables explain on the order of 2.6% of the variation in log(bwght), or even less
based on R 2 , which is not very much.
(vi) If we regress bwght on npvis, npvis2, mage, and mage2, then R2 = .0192. But remember,
we cannot compare this directly with the R-squared from part (iv). Instead, we compute an R-
squared for the log(bwght) model that can be compared with .0192. From Section 6.4, we
compute the squared correlation between bwght and exp(lbwght ) , where lbwght denotes the
fitted values from the log(bwght) model. The correlation is .1362, so its square is about .0186.
Therefore, for explaining bwght, the model with bwght actually fits slightly better (but nothing to
make a big deal about).
48
CHAPTER 7
TEACHING NOTES
This is a fairly standard chapter on using qualitative information in regression analysis, although
I try to emphasize examples with policy relevance (and only cross-sectional applications are
included.).
In discussing the Chow test, I think it is important to discuss testing for differences in slope
coefficients after allowing for an intercept difference. In many applications, a significant Chow
statistic simply indicates intercept differences. (See the example in Section 7.4 on student-
athlete GPAs in the text.) From a practical perspective, it is important to know whether the
partial effects differ across groups or whether a constant differential is sufficient.
An unconventional feature of this chapter is its introduction of the linear probability model. I
cover the LPM here for several reasons. First, the LPM is being used more and more. Empirical
researchers find it much easier to interpret than probit or logit models, and, once the proper
scalings are done, the estimated effects are often similar near the mean or median values of the
explanatory variables. The theoretical drawbacks of the LPM are often of secondary importance
in practice. Computer Exercise 7.17 is a good one to illustrate that, even with over 9,000
observations, the LPM can deliver fitted values strictly between zero and one for all observations.
If the LPM is not covered, many students will never be exposed to using econometrics to explain
qualitative outcomes. This would be especially unfortunate for students who might need to read
an article that uses an LPM or who might want to estimate an LPM for a term paper or senior
thesis.
A useful modification of the LPM estimated in equation (7.29) is to drop kidsge6 (since it is not
significant) and then define two dummy variables, one for kidslt6 equal to one and the other for
kidslt6 at least two. These can be included in place of kidslt6 (with no young children being the
base group). This allows a diminishing marginal effect in an LPM. Perhaps surprisingly, the
diminishing effect does not materialize.
49
SOLUTIONS TO PROBLEMS
hours more per week than a comparable woman. Further, tmale = 87.75/34.33 ≈ 2.56, which is
7.1 (i) The coefficient on male is 87.75, so a man is estimated to sleep almost one and one-half
close to the 1% critical value against a two-sided alternative (about 2.58). Thus, the evidence for
a gender differential is fairly strong.
(ii) The t statistic on totwrk is −.163/.018 ≈ −9.06, which is very statistically significant. The
coefficient implies that one more hour of work (60 minutes) is associated with .163(60) ≈ 9.8
minutes less sleep.
(iii) To obtain Rr2 , the R-squared from the restricted regression, we need to estimate the
model without age and age2. When age and age2 are both in the model, age has no effect only if
the parameters on both terms are zero.
fixed. Further, twhite ≈ 4.23, which is well above any commonly used critical value. Thus, the
(ii) A white child is estimated to weigh about 5.5% more, other factors in the first equation
(iii) If the mother has one more year of education, the child’s birth weight is estimated to
be .3% higher. This is not a huge effect, and the t statistic is only one, so it is not statistically
significant.
(iv) The two regressions use different sets of observations. The second regression uses fewer
observations because motheduc or fatheduc are missing for some observations. We would have
to reestimate the first equation (and obtain the R-squared) using the same observations used to
estimate the second equation.
7.3 (i) The t statistic on hsize2 is over four in absolute value, so there is very strong evidence that
(ii) This is given by the coefficient on female (since black = 0): nonblack females have SAT
scores about 45 points lower than nonblack males. The t statistic is about –10.51, so the
difference is very statistically significant. (The very large sample size certainly contributes to
the statistical significance.)
(iii) Because female = 0, the coefficient on black implies that a black male has an estimated
SAT score almost 170 points less than a comparable nonblack male. The t statistic is over 13 in
absolute value, so we easily reject the hypothesis that there is no ceteris paribus difference.
50
nonblack females. The difference is therefore –169.81 + 62.31 = −107.50. Because the estimate
(iv) We plug in black = 1, female = 1 for black females and black = 0 and female = 1 for
depends on two coefficients, we cannot construct a t statistic from the information given. The
easiest approach is to define dummy variables for three of the four race/gender categories and
choose nonblack females as the base group. We can then obtain the t statistic we want as the
coefficient on the black females dummy variable.
(ii) 100 ⋅ [exp(−.283) – 1) ≈ −24.7%, and so the estimate is somewhat smaller in magnitude.
(iii) The proportionate difference is .181 − .158 = .023, or about 2.3%. One equation that can
be estimated to obtain the standard error of this difference is
and so the coefficient δ1 directly measures the difference between the consumer products and
where trans is a dummy variable for the transportation industry. Now, the base group is finance,
7.5 (i) Following the hint, colGPA = βˆ0 + δˆ0 (1 – noPC) + β̂1 hsGPA + βˆ2 ACT = ( βˆ0 + δˆ0 ) −
δˆ0 noPC + βˆ1 hsGPA + βˆ2 ACT. For the specific estimates in equation (7.6), βˆ0 = 1.26 and
δˆ = .157, so the new intercept is 1.26 + .157 = 1.417. The coefficient on noPC is –.157.
0
(ii) Nothing happens to the R-squared. Using noPC in place of PC is simply a different way
of including the same information on PC ownership.
(iii) It makes no sense to include both dummy variables in the regression: we cannot hold
noPC fixed while changing PC. We have only two groups based on PC ownership so, in
addition to the overall intercept, we need only to include one dummy variable. If we try to
include both along with an intercept we have perfect multicollinearity (the dummy variable trap).
7.6 In Section 3.3 – in particular, in the discussion surrounding Table 3.2 – we discussed how to
determine the direction of bias in the OLS estimators when an important variable (ability, in this
case) has been omitted from the regression. As we discussed there, Table 3.2 only strictly holds
with a single explanatory variable included in the regression, but we often ignore the presence of
other independent variables and use this table as a rough guide. (Or, we can use the results of
Problem 3.10 for a more precise analysis.) If less able workers are more likely to receive
training than train and u are negatively correlated. If we ignore the presence of educ and exper,
we can use Table 3.2: the OLS estimator of β1 (with ability in the error term) has a downward
or at least assume that train and u are negatively correlated after netting out educ and exper, then
bias. Because we think β1 ≥ 0, we are less likely to conclude that the training program was
51
effective. Intuitively, this makes sense: if those chosen for training had not received training,
they would have lowers wages, on average, than the control group.
− β 6 kidslt6 − β 7 kidsage6 − u,
The new error term, −u, has the same properties as u. From this we see that if we regress outlf on
all of the independent variables in (7.29), the new intercept is 1 − .586 = .414 and each slope
new coefficient on educ is −.038 while the new coefficient on kidslt6 is .262.
coefficient takes on the opposite sign from when inlf is the dependent variable. For example, the
(ii) The standard errors will not change. In the case of the slopes, changing the signs of the
the t statistics change sign). Also, Var(1 − βˆ0 ) = Var( βˆ0 ), so the standard error of the intercept
estimators does not change their variances, and therefore the standard errors are unchanged (but
(iii) We know that changing the units of measurement of independent variables, or entering
qualitative information using different sets of dummy variables, does not change the R-squared.
But here we are changing the dependent variable. Nevertheless, the R-squareds from the
regressions are still the same. To see this, part (i) suggests that the squared residuals will be
identical in the two regressions. For each i the error in the equation for outlfi is just the negative
of the error in the other equation for inlfi, and the same is true of the residuals. Therefore, the
SSRs are the same. Further, in this case, the total sum of squares are the same. For outlf we
have
which is the SST for inlf. Because R2 = 1 – SSR/SST, the R-squared is the same in the two
regressions.
7.8 (i) We want to have a constant semi-elasticity model, so a standard wage equation with
marijuana usage included would be
52
log(wage) = β 0 + β1 usage + β 2 educ + β 3 exper + β 4 exper2 + β 5 female + u.
Then 100⋅ β1 is the approximate percentage change in wage when marijuana usage increases by
one time per month.
The null hypothesis that the effect of marijuana usage does not differ by gender is H0: β 6 = 0.
(iii) We take the base group to be nonuser. Then we need dummy variables for the other
three groups: lghtuser, moduser, and hvyuser. Assuming no interactive effect with gender, the
model would be
(v) The error term could contain factors, such as family background (including parental
history of drug abuse) that could directly affect wages and also be correlated with marijuana
usage. We are interested in the effects of a person’s drug usage on his or her wage, so we would
like to hold other confounding factors fixed. We could try to collect data on relevant background
information.
ˆ
colGPA = 1.26 + .152 PC + .450 hsGPA + .0077 ACT − .0038 mothcoll
(0.34) (.059) (.094) (.0107) (.0603)
+ .0418 fathcoll
(.0613)
n = 141 , R2 = .222.
53
with tpc ≈ 2.58.
The estimated effect of PC is hardly changed from equation (7.6), and it is still very significant,
with p-value ≈ .78; these variables are jointly very insignificant. It is not surprising the
(ii) The F test for joint significance of mothcoll and fathcoll, with 2 and 135 df, is about .24
estimates on the other coefficients do not change much when mothcoll and fathcoll are added to
the regression.
(iii) When hsGPA2 is added to the regression, its coefficient is about .337 and its t statistic is
about 1.56. (The coefficient on hsGPA is about –1.803.) This is a borderline case. The
quadratic in hsGPA has a U-shape, and it only turns up at about hsGPA* = 2.68, which is hard to
interpret. The coefficient of main interest, on PC, falls to about .140 but is still significant.
Adding hsGPA2 is a simple robustness check of the main finding.
The coefficient on black implies that, at given levels of the other explanatory variables, black
men earn about 18.8% less than nonblack men. The t statistic is about –4.95, and so it is very
statistically significant.
(ii) The F statistic for joint significance of exper2 and tenure2, with 2 and 925 df, is about
1.49 with p-value ≈ .226. Because the p-value is above .20, these quadratics are jointly
insignificant at the 20% level.
(iii) We add the interaction black ⋅ educ to the equation in part (i). The coefficient on the
interaction is about −.0226 (se ≈ .0202). Therefore, the point estimate is that the return to
another year of education is about 2.3 percentage points lower for black men than nonblack men.
(The estimated return for nonblack men is about 6.7%.) This is nontrivial if it really reflects
differences in the population. But the t statistic is only about 1.12 in absolute value, which is not
enough to reject the null hypothesis that the return to education does not depend on race.
(iv) We choose the base group to be single, nonblack. Then we add dummy variables
marrnonblck, singblck, and marrblck for the other three groups. The result is
54
log (wage) = 5.40 + .0655 educ + .0141 exper + .0117 tenure
(0.11) (.0063) (.0032) (.0025)
− .092 south + .184 urban + .189 marrnonblck
(.026) (.027) (.043)
− .241 singblck + .0094 marrblck
(.096) (.0560)
n = 935 , R2 = .253.
taking the difference of their coefficients: .0094 − .189 = −.1796, or about −.18. That is, a
We obtain the ceteris paribus differential between married blacks and married nonblacks by
married black man earns about 18% less than a comparable, married nonblack man.
7.11 (i) H0: β13 = 0. Using the data in MLB1.RAW gives βˆ13 ≈ .254, se( βˆ13 ) ≈ .131. The t
statistic is about 1.94, which gives a p-value against a two-sided alternative of just over .05.
Therefore, we would reject H0 at just about the 5% significance level. Controlling for the
(ii) This is a joint null, H0: β 9 = 0, β10 = 0, K , β13 = 0. The F statistic, with 5 and 339 df,
is about 1.78, and its p-value is about .117. Thus, we cannot reject H0 at the 10% level.
(iii) Parts (i) and (ii) are roughly consistent. The evidence against the joint null in part (ii) is
weaker because we are testing, along with the marginally significant catcher, several other
insignificant variables (especially thrdbase and shrtstop, which has absolute t statistics well
below one).
7.12 (i) The two signs that are pretty clear are β 3 < 0 (because hsperc is defined so that the
smaller the number the better the student) and β 4 > 0. The effect of size of graduating class is
may think that β 6 < 0, that is, athletes do worse than other students with comparable
not clear. It is also unclear whether males and females have systematically different GPAs. We
characteristics. But remember, we are controlling for ability to some degree with hsperc and sat.
n = 4,137, R2 = .293.
55
nonathlete. The t statistic .169/.042 ≈ 4.02, which is very significant.
Holding other factors fixed, an athlete is predicted to have a GPA about .169 points higher than a
(se ≈ .0448), which is practically and statistically not different from zero. This happens because
(iii) With sat dropped from the model, the coefficient on athlete becomes about .0054
we do not control for SAT scores, and athletes score lower on average than nonathletes. Part (ii)
shows that, once we account for SAT differences, athletes do better than nonathletes. Even if we
do not control for SAT score, there is no difference.
(iv) To facilitate testing the hypothesis that there is no difference between women athletes
and women nonathletes, we should choose one of these as the base group. We choose female
nonathletes. The estimated equation is
The coefficient on femath = female ⋅ athlete shows that colgpa is predicted to be about .175 points
higher for a female athlete than a female nonathlete, other variables in the equation fixed.
(v) Whether we add the interaction female ⋅ sat to the equation in part (ii) or part (iv), the
outcome is practically the same. For example, when female ⋅ sat is added to the equation in part
(ii), its coefficient is about .000051 and its t statistic is about .40. There is very little evidence
that the effect of sat differs by gender.
56
= 3,648.2 − .182 totwrk − 13.05 educ + 7.16 age − .0448 age2 + 60.38 yngkid
sleep
(310.0) (.024) (7.41) (14.32) (.1684) (59.02)
n = 400, R2 = .156.
= 4,238.7 − .140 totwrk − 10.21 educ − 30.36 age − .368 age2 − 118.28 yngkid
sleep
(384.9) (.028) (9.59) (18.53) (.223) (93.19)
n = 306, R2 = .098.
There are certainly notable differences in the point estimates. For example, having a young child
in the household leads to less sleep for women (about two hours a week) while men are
estimated to sleep about an hour more. The quadratic in age is a hump-shape for men but a U-
shape for women. The intercepts for men and women are also notably different.
(ii) The F statistic (with 6 and 694 df) is about 2.12 with p-value ≈ .05, and so we reject the
null that the sleep equations are the same at the 5% level.
interaction terms, male ⋅ totwrk, male ⋅ educ, male ⋅ age, male ⋅ age2, and male ⋅ yngkid, the F
(iii) If we leave the coefficient on male unspecified under H0, and test only the five
statistic (with 5 and 694 df) is about 1.26 and p-value ≈ .28.
(iv) The outcome of the test in part (iii) shows that, once an intercept difference is allowed,
there is not strong evidence of slope differences between men and women. This is one of those
cases where the practically important differences in estimates for women and men in part (i) do
not translate into statistically significant differences. We apparently need a larger sample size to
determine whether there are differences in slopes. For the purposes of studying the sleep-work
tradeoff, the original model with male added as an explanatory variable seems sufficient.
women and men is −.227 − .0056(12.5) = −.297. When educ = 0, the difference is −.227. So the
7.15 (i) When educ = 12.5, the approximate proportionate difference in estimated wage between
+ other factors
≡ β 0 + θ 0 female + β1 educ + δ1 female ⋅ (educ – 12.5) + other factors,
57
where θ 0 ≡ δ 0 + 12.5 δ1 is the gender differential at 12.5 years of education. When we run this
regression we obtain about –.294 as the coefficient on female (which differs from –.297 due to
rounding error). Its standard error is about .036.
(iii) The t statistic on female from part (ii) is about –8.17, which is very significant. This is
because we are estimating the gender differential at a reasonable number of years of education,
12.5, which is close to the average. In equation (7.18), the coefficient on female is the gender
differential when educ = 0. There are no people of either gender with close to zero years of
education, and so we cannot hope – nor do we want to – to estimate the gender differential at
educ = 0.
7.16 (i) If the appropriate factors have been controlled for, β1 > 0 signals discrimination against
minorities: a white person has a greater chance of having a loan approved, other relevant factors
fixed.
ˆ
approve = .708 + .201 white
(.018) (.020)
n = 1,989, R2 = .049.
The coefficient on white means that, in the sample of 1,989 loan applications, an application
submitted by a white application was 20.1% more likely to be approved than that of a nonwhite
applicant. This is a practically large difference and the t statistic is about 10. (We have a large
sample size, so standard errors are pretty small.)
(iii) When we add the other explanatory variables as controls, we obtain βˆ1 ≈ .129,
se( β̂1 ) ≈ .020. The coefficient has fallen by some margin because we are now controlling for
factors that should affect loan approval rates, and some of these clearly differ by race. (On
average, white people have financial characteristics – such as higher incomes and stronger credit
(iv) When we add the interaction white ⋅ obrat to the regression, its coefficient and t statistic
are about .0081 and 3.53, respectively. Therefore, there is an interactive effect: a white
applicant is penalized less than a nonwhite applicant for having other obligations as a larger
percent of income.
(v) The trick should be familiar by now. Replace white ⋅ obrat with white ⋅ (obrat – 32); the
se ≈ .020. So the 95% confidence interval is about .113 ± 1.96(.020) or about .074 to .152.
coefficient on white is now the race differential when obrat = 32. We obtain about .113 and
Clearly, this interval excludes zero, so at the average obrat there is evidence of discrimination
58
(or, at least loan approval rates that differ by race for some other reason that is not captured by
the control variables).
7.17 (i) About .392, or 39.2%.
e401k = −.506 + .0124 inc − .000062 inc2 + .0265 age − .00031 age2 − .0035 male
(.081) (.0006) (.000005) (.0039) (.00005) (.0121)
n = 9,275, R2 = .094.
(iii) 401(k) eligibility clearly depends on income and age in part (ii). Each of the four terms
involving inc and age have very significant t statistics. On the other hand, once income and age
are controlled for, there seems to be no difference in eligibility by gender. The coefficient on
male is very small – at given income and age, males are estimated to have a .0035 probability
less of being 401(k) eligible – and it has a very small t statistic.
(iv) Perhaps surprisingly, out of 9,275 fitted values, none is outside the interval [0,1]. The
smallest fitted value is about .030 and the largest is about .697. This means one theoretical
problem with the LPM – the possibility of generating silly probability estimates – does not occur
in this application.
e401k = −.502 + .0123 inc − .000061 inc2 + .0265 age − .00031 age2
(.081) (.0006) (.000005) (.0039) (.00005)
− .0038 male + .0198 pira
(.0121) (.0122)
n = 9,275, R2 = .095.
The coefficient on pira means that, other things equal, IRA ownership is associated with about
a .02 higher probability of being eligible for a 401(k) plan. However, the t statistic is only about
1.62, which gives a two-sided p-value = .105. So pira is not significant at the 10% level against
a two-sided alternative.
ˆ
points = 4.76 + 1.28 exper − .072 exper2 + 2.31 guard + 1.54 forward
(1.18) (.33) (.024) (1.00) (1.00)
59
(ii) Including all three position dummy variables would be redundant, and result in the
dummy variable trap. Each player falls into one of the three categories, and the overall intercept
is the intercept for centers.
(iii) A guard is estimated to score about 2.3 points more per game, holding experience fixed.
The t statistic is 2.31, so the difference is statistically different from zero at the 5% level, against
a two-sided alternative.
(iv) When marr is added to the regression, its coefficient is about .584 (se = .740). Therefore,
a married player is estimated to score just over half a point more per game (experience and
position held fixed), but the estimate is not statistically different from zero (p-value = .43). So,
based on points per game, we cannot conclude married players are more productive.
(v) Adding the terms marr ⋅ exper and marr ⋅ exper 2 leads to complicated signs on the three
terms involving marr. The F test for their joint significance, with 3 and 261 df, gives F = 1.44
and p-value = .23. Therefore, there is not very strong evidence that marital status has any partial
effect on points scored.
(vi) If in the regression from part (iv) we use assists as the dependent variable, the coefficient
on marr becomes .322 (se = .222). Therefore, holding experience and position fixed, a married
man has almost one-third more assist per game. The p-value against a two-sided alternative is
about .15, which is stronger, but not overwhelming, evidence that married men are more
productive when it comes to assists.
7.19 (i) The average is 19.072, the standard deviation is 63.964, the smallest value is –502.302,
and the largest value is 1,536.798. Remember, these are in thousands of dollars.
(ii) This can be easily done by regressing nettfa on e401k and doing a t test on βˆe401k ; the
Using the 9,275 observations gives βˆe401k = 18.858 and te401k = 14.01. Therefore, we strongly
estimate is the average difference in nettfa for those eligible for a 401(k) and those not eligible.
reject the null hypothesis that there is no difference in the averages. The coefficient implies that,
on average, a family eligible for a 401(k) plan has $18,858 more on net total financial assets.
nett$ fa = 23.09 + 9.705 e401k − .278 inc + .0103 inc2 − 1.972 age + .0348 age2
(9.96) (1.277) (.075) (.0006) (.483) (.0055)
n = 9,275, R2 = .202
Now, holding income and age fixed, a 401(k)-eligible family is estimated to have $9,705 more in
wealth than a non-eligible family. This is just more than half of what is obtained by simply
comparing averages.
60
(iv) Only the interaction e401k⋅(age − 41) is significant. Its coefficient is .654 (t = 4.98). It
shows that the effect of 401(k) eligibility on financial wealth increases with age. Another way to
The coefficient on e401k⋅(age − 41)2 is −.0038 (t statistic = −.33), so we could drop this term.
think about it is that age has a stronger positive effect on nettfa for those with 401(k) eligibility.
(v) The effect of e401k in part (iii) is the same for all ages, 9.705. For the regression in part
(iv), the coefficient on e401k from part (iv) is about 9.960, which is the effect at the average age,
age = 41. Including the interactions increases the estimated effect of e401k, but only by $255. If
we evaluate the effect in part (iv) at a wide range of ages, we would see more dramatic
differences.
nett$ fa = 16.34 + 9.455 e401k − .240 inc + .0100 inc2 − 1.495 age + .0290 age2
(10.12) (1.278) (.075) (.0006) (.483) (.0055)
The F statistic for joint significance of the four family size dummies is about 5.44. With 4 and
9,265 df, this gives p-value = .0002. So the family size dummies are jointly significant.
(vii) The SSR for the restricted model is from part (vi): SSRr = 30,215,207.5. The SSR for
29,985,400]*(9245/20) ≈ 3.54. With 20 and 9,245 df, the p-value is essentially zero. In this case,
there is strong evidence that the slopes change across family size. Allowing for intercept
changes alone is not sufficient. (If you look at the individual regressions, you will see that the
signs on the income variables actually change across family size.)
61
CHAPTER 8
TEACHING NOTES
This is a good place to remind students that homoskedasticity played no role in showing that
OLS is unbiased for the parameters in the regression equation. In addition, you should probably
mention that there is nothing wrong with the R-squared or adjusted R-squared as goodness-of-fit
measures. The key is that these are estimates of the population R-squared, 1 – [Var(u)/Var(y)],
where the variances are the unconditional variances in the population. The usual R-squared, and
the adjusted version, consistently estimate the population R-squared whether or not Var(u|x) =
Var(y|x) depends on x. Of course, heteroskedasticity causes the usual standard errors, t statistics,
and F statistics to be invalid, even in large samples, with or without normality.
As I mention in the text, other traditional tests for heteroskedasticity, such as the Park and
Glejser tests, do not directly test what we want, or are too restrictive. The Goldfeld-Quandt test
only works when there is a natural way to order the data based on one independent variable.
This is rare in practice, especially for cross-sectional applications.
Some argue that weighted least squares is a relic, and is no longer necessary given the
availability of heteroskedasticity-robust standard errors and test statistics. While I am somewhat
sympathetic to this argument, it presumes that we do not care much about efficiency. Even in
large samples, the OLS estimates may not be precise enough to learn much about the population
parameters. With substantial heteroskedasticity, we might do better with weighted least squares,
even if the weighting function is misspecified. As mentioned in Question 8.4 on page 280, one
can (and perhaps should) compute robust standard errors after weighted least squares. These
would be directly comparable to the heteroskedasiticity-robust standard errors for OLS.
Weighted least squares estimation of the LPM is a nice example of feasible GLS, at least when
all fitted values are in the unit interval. Interestingly, in the LPM examples and exercises, the
heteroskedasticity-robust standard errors often differ by only small amounts from the usual
standard errors. However, in a couple of cases the differences are notable, as in Computer
Exercise 8.12.
62
SOLUTIONS TO PROBLEMS
8.1 Parts (ii) and (iii). The homoskedasticity assumption played no role in Chapter 5 in showing
that OLS is consistent. But we know that heteroskedasticity causes statistical inference based on
the usual t and F statistics to be invalid, even in large samples. As heteroskedasticity is a
violation of the Gauss-Markov assumptions, OLS is no longer BLUE.
8.2 With Var(u|inc,price,educ,female) = σ2inc2, h(x) = inc2, where h(x) is the heteroskedasticity
function defined in equation (8.21). Therefore, h(x) = inc, and so the transformed equation is
obtained by dividing the original equation by inc:
Notice that β1 , which is the slope on inc in the original model, is now a constant in the
transformed equation. This is simply a consequence of the form of the heteroskedasticity and the
functional forms of the explanatory variables in the original equation.
8.3 False. The unbiasedness of WLS and OLS hinges crucially on Assumption MLR.4, and, as
we know from Chapter 4, this assumption is often violated when an important variable is omitted.
When MLR.4 does not hold, both WLS and OLS are biased. Without specific information on
how the omitted variable is correlated with the included explanatory variables, it is not possible
to determine which estimator has a small bias. It is possible that WLS would have more bias
than OLS or less bias.
8.4 (i) These variables have the anticipated signs. If a student takes courses where grades are, on
average, higher – as reflected by higher crsgpa – then his/her grades will be higher. The better
the student has been in the past – as measured by cumgpa, the better the student does (on average)
in the current semester. Finally, tothrs is a measure of experience, and its coefficient indicates
an increasing return to experience.
The t statistic for crsgpa is very large, over five using the usual standard error (which is the
largest of the two). Using the robust standard error for cumgpa, its t statistic is about 2.61, which
is also significant at the 5% level. The t statistic for tothrs is only about 1.17 using either
standard error, so it is not significant at the 5% level.
only explanatory variable, H0: β crsgpa = 1 means that, without any information about the student,
(ii) This is easiest to see without other explanatory variables in the model. If crsgpa were the
the best predictor of term GPA is the average GPA in the students’ courses; this holds essentially
it is not necessarily true that β crsgpa = 1 because crsgpa could be correlated with characteristics of
by definition. (The intercept would be zero in this case.) With additional explanatory variables
the student. (For example, perhaps the courses students take are influenced by ability – as
measured by test scores – and past college performance.) But it is still interesting to test this
hypothesis.
63
The t statistic using the usual standard error is t = (.900 – 1)/.175 ≈ −.57; using the hetero-
skedasticity-robust standard error gives t ≈ −.60. In either case we fail to reject H0: β crsgpa = 1 at
any reasonable significance level, certainly including 5%.
(iii) The in-season effect is given by the coefficient on season, which implies that, other
things equal, an athlete’s GPA is about .16 points lower when his/her sport is competing. The t
statistic using the usual standard error is about –1.60, while that using the robust standard error is
about –1.96. Against a two-sided alternative, the t statistic using the robust standard error is just
standard error, the t statistic is not quite significant at the 10% level (cv ≈ 1.65). So the standard
significant at the 5% level (the standard normal critical value is 1.96), while using the usual
error used makes a difference in this case. This example is somewhat unusual, as the robust
standard error is more often the larger of the two.
8.5 (i) No. For each coefficient, the usual standard errors and the heteroskedasticity-robust ones
are practically very similar.
(ii) The effect is −.029(4) = −.116, so the probability of smoking falls by about .116.
(iii) As usual, we compute the turning point in the quadratic: .020/[2(.00026)] ≈ 38.46, so
about 38 and one-half years.
(iv) Holding other factors in the equation fixed, a person in a state with restaurant smoking
restrictions has a .101 lower chance of smoking. This is similar to the effect of having four more
years of education.
(v) We just plug the values of the independent variables into the OLS regression line:
ˆ
smokes = .656 − .069 ⋅ log(67.44) + .012 ⋅ log(6,500) − .029(16) + .020(77) − .00026(77 2 ) ≈ .0052.
Thus, the estimated probability of smoking for this person is close to zero. (In fact, this person is
not a smoker, so the equation predicts well for this particular observation.)
the assumption that the variance of u given all explanatory variables depends only on gender is
Then the variance for women is simply δ 0 and that for men is δ 0 + δ1 ; the difference in
variances is δ1.
64
(ii) After estimating the above equation by OLS, we regress uˆi2 on malei, i = 1,2, K ,706
(including, of course, an intercept). We can write the results as
Because the coefficient on male is negative, the estimated variance is higher for women.
(iii) No. The t statistic on male is only about –1.06, which is not significant at even the 20%
level against a two-sided alternative.
8.7 (i) The estimated equation with both sets of standard errors (heteroskedasticity-robust
standard errors in brackets) is
= −21.77 + .00207 lotsize + .123 sqrft + 13.85 bdrms
price
(29.48) (.00064) (.013) (9.01)
[36.28] [.00122] [.017] [8.28]
n = 88, R2 = .672.
The robust standard error on lotsize is almost twice as large as the usual standard error, making
lotsize much less significant (the t statistic falls from about 3.23 to about 1.70). The t statistic on
sqrft also falls, but it is still very significant. The variable bdrms actually becomes somewhat
more significant, but it is still barely significant. The most important change is in the
significance of lotsize.
Here, the heteroskedasticity-robust standard error is always slightly greater than the
corresponding usual standard error, but the differences are relatively small. In particular,
log(lotsize) and log(sqrft) still have very large t statistics, and the t statistic on bdrms is not
significant at the 5% level against a one-sided alternative using either standard error.
(iii) As we discussed in Section 6.2, using the logarithmic transformation of the dependent
variable often mitigates, if not entirely eliminates, heteroskedasticity. This is certainly the case
here, as no important conclusions in the model for log(price) depend on the choice of standard
65
error. (We have also transformed two of the independent variables to make the model of the
constant elasticity variety in lotsize and sqrft.)
8.8 After estimating equation (8.18), we obtain the squared OLS residuals û 2 . The full-blown
White test is based on the R-squared from the auxiliary regression (with an intercept),
observations the n-R-squared version of the White statistic is 88(.109) ≈ 9.59, and this is the
where “l ” in front of lotsize and sqrft denotes the natural log. [See equation (8.19).] With 88
outcome of an (approximately) χ 92 random variable. The p-value is about .385, which provides
little evidence against the homoskedasticity assumption.
ˆ
voteA = 37.66 + .252 prtystrA + 3.793 democA + 5.779 log(expendA)
(4.74) (.071) (1.407) (0.392)
− 6.238 log(expendB) + û
(0.397)
error. Remember, this is how OLS works: the estimates βˆ j are chosen to make the residuals be
squared of zero, although it might not be exactly zero in your computer output due to rounding
uncorrelated in the sample with each independent variable (as well as have zero sample average).
statistic for joint significant (with 4 and 168 df) is about 2.33 with p-value ≈ .058. Therefore,
(ii) The B-P test entails regressing the uˆi2 on the independent variables in part (i). The F
from part (i). The F test, with 2 and 170 df, is about 2.79 with p-value ≈ .065. This is slightly
(iii) Now we regress uˆi2 on voteA ˆ i )2, where the voteA
ˆ i and ( voteA ˆ i are the OLS fitted values
less evidence of heteroskedasticity than provided by the B-P test, but the conclusion is very
similar.
8.10 (i) By regressing sprdcvr on an intercept only we obtain μ̂ ≈ .515 se ≈ .021). The
asymptotic t statistic for H0: µ = .5 is (.515 − .5)/.021 ≈ .71, which is not significant at the 10%
level, or even the 20% level.
66
(iii) The estimated LPM is
ˆ
sprdcvr = .490 + .035 favhome + .118 neutral − .023 fav25 + .018 und25
(.045) (.050) (.095) (.050) (.092)
n = 553, R2 = .0034.
The variable neutral has by far the largest effect – if the game is played on a neutral court, the
probability that the spread is covered is estimated to be about .12 higher – and, except for the
intercept, its t statistic is the only t statistic greater than one in absolute value (about 1.24).
(iv) Under H0: β1 = β 2 = β 3 = β 4 = 0, the response probability does not depend on any
explanatory variables, which means neither the mean nor the variance depends on the
explanatory variables. [See equation (8.38).]
(v) The F statistic for joint significance, with 4 and 548 df, is about .47 with p-value ≈ .76.
There is essentially no evidence against H0.
(vi) Based on these variables, it is not possible to predict whether the spread will be covered.
The explanatory power is very low, and the explanatory variables are jointly very insignificant.
The coefficient on neutral may indicate something is going on with games played on a neutral
court, but we would not want to bet money on it unless it could be confirmed with a separate,
larger sample.
8.11 (i) The estimates are given in equation (7.31). Rounded to four decimal places, the smallest
fitted value is .0066 and the largest fitted value is .5577.
(ii) The estimated heteroskedasticity function for each observation i is hˆi = arr86
垐 i (1 − arr86i ) ,
which is strictly between zero and one because 0 < arr86 ˆ i < 1 for all i. The weights for WLS are
1/ hˆ . To show the WLS estimate of each parameter, we report the WLS results using the same
i
The coefficients on the significant explanatory variables are very similar to the OLS estimates.
The WLS standard errors on the slope coefficients are generally lower than the nonrobust OLS
standard errors. A proper comparison would be with the robust OLS standard errors.
67
and 2,719 df, is about .88 with p-value ≈ .41. They are not close to being jointly significant at
(iii) After WLS estimation, the F statistic for joint significance of avgsen and tottime, with 2
the 5% level. If your econometrics package has a command for WLS and a test command for
joint hypotheses, the F statistic and p-value are easy to obtain. Alternatively, you can obtain the
restricted R-squared using the same weights as in part (ii) and dropping avgsen and tottime from
the WLS estimation. (The unrestricted R-squared is .0744.)
8.12 (i) The heteroskedasticity-robust standard error for βˆwhite ≈ .129 is about .026, which is
notably higher than the nonrobust standard error (about .020). The heteroskedasticity-robust
95% confidence interval is about .078 to .179, while the nonrobust CI is, of course, narrower,
about .090 to .168. The robust CI still excludes the value zero by some margin.
(ii) There are no fitted values less than zero, but there are 231 greater than one. Unless we
do something to those fitted values, we cannot directly apply WLS, as hˆi will be negative in 231
cases.
colGPA = 1.36 + .412 hsGPA + .013 ACT − .071 skipped + .124 PC
(.33) (.092) (.010) (.026) (.057)
value ≈ .031. So, at the 5% level, we conclude there is evidence of heteroskedasticity in the
(ii) The F statistic obtained for the White test is about 3.58. With 2 and 138 df, this gives p-
errors of the colGPA equation. (As an aside, note that the t statistics for each of the terms is very
small, and we could have simply dropped the quadratic term without losing anything of value.)
(iii) In fact, the smallest fitted value from the regression in part (ii) is about .027, while the
largest is about .165. Using these fitted values as the hˆi in a weighted least squares regression
gives the following:
colGPA = 1.40 + .402 hsGPA + .013 ACT − .076 skipped + .126 PC
(.30) (.083) (.010) (.022) (.056)
There is very little difference in the estimated coefficient on PC, and the OLS t statistic and WLS
t statistic are also very close. Note that we have used the usual OLS standard error, even though
it would be more appropriate to use the heteroskedasticity-robust form (since we have evidence
of heteroskedasticity). The R-squared in the weighted least squares estimation is larger than that
from the OLS regression in part (i), but, remember, these are not comparable.
68
(iv) With robust standard errors – that is, with standard errors that are robust to misspecifying
the function h(x) – the equation is
colGPA = 1.40 + .402 hsGPA + .013 ACT − .076 skipped + .126 PC
(.31) (.086) (.010) (.021) (.059)
The robust standard errors do not differ by much from those in part (iii); in most cases, they are
statistically significant. But the confidence interval for βPC is a bit wider.
slightly higher, but all explanatory variables that were statistically significant before are still
8.14 (i) I now get R2 = .0527, but the other estimates seem okay.
(ii) One way to ensure that the unweighted residuals are being provided is to compare them
with the OLS residuals. They will not be the same, of course, but they should not be wildly
different.
(iii) The R-squared from the regression ui2 on yi , yi2 , i = 1,...,807 is about .027. We use this
( ( (
as Rû22 in equation (8.15) but with k = 2. This gives F = 11.15, and so the p-value is about zero.
(iv) The substantial heteroskedasticity found in part (iii) shows that the feasible GLS
procedure described on page 279 does not, in fact, eliminate the heteroskedasticity. Therefore,
the usual standard errors, t statistics, and F statistics reported with weighted least squares are not
valid, even asymptotically.
(v) The weighted least squares equation with robust standard errors is
=
cigs 5.64 + 1.30 log(income) − 2.94 log(cigpric) − .463 educ
(37.31) (.54) (8.97) (.149)
n = 807, R2 = .1134
The substantial differences in standard errors compare with equation (8.36) is another indication
that our proposed correction for heteroskedasticity did not really do the trick. With the exception
of restaurn, all standard errors got notably bigger; for example, the standard error for log(cigpric)
doubled. All variables that were significant with the nonrobust standard errors remain significant,
but the confidence intervals are much wider in several cases.
[ Instructor’s Note: You can also do this exercise with regression (8.34) used in place of (8.32).
This gives a somewhat larger estimated income effect.]
69
8.15 (i) In the following equation, estimated by OLS, the usual standard errors are in (⋅) and the
heteroskedasticity-robust standard errors are in [⋅]:
e401k = −.506 + .0124 inc − .000062 inc2 + .0265 age − .00031 age2 − .0035 male
(.081) (.0006) (.000005) (.0039) (.00005) (.0121)
[.079] [.0006] [.000005] [.0038] [.00004] [.0121]
n = 9,275, R2 = .094.
There are no important differences; if anything, the robust standard errors are smaller.
(ii) This is a general claim. Since Var(y|x) = p (x)[1 − p (x)] , we can write
E(u 2 | x) = p(x) − [ p(x)]2 . Written in error form, u 2 = p (x) − [ p (x)]2 + v . In other words, we
can write this as a regression model u 2 = δ 0 + δ1 p(x) + δ 2 [ p(x)]2 + v , with the restrictions δ0 = 0,
δ1 = 1, and δ2 = -1. Remember that, for the LPM, the fitted values, yˆi , are estimates of
p(xi ) = β 0 + β1 xi1 + ... + β k xik . So, when we run the regression u垐
2 2
i on yi , yi (including an
intercept), the intercept estimates should be close to zero, the coefficient on yˆi should be close to
one, and the coefficient on yˆ i2 should be close to –1.
(iii) The White F statistic is about 310.32, which is very significant. The coefficient on
ˆ
e401k is about 1.010, the coefficient on e401 ˆ k 2 is about −.970, and the intercept is about -.009.
This accords quite well with what we expect to find.
(iv) The smallest fitted value is about .030 and the largest is about .697. The WLS estimates
of the LPM are
e401k = −.488 + .0126 inc − .000062 inc2 + .0255 age − .00030 age2 − .0055 male
(.076) (.0005) (.000004) (.0037) (.00004) (.0117)
n = 9,275, R2 = .108.
There are no important differences with the OLS estimates. The largest relative change is in the
coefficient on male, but this variable is very insignificant using either estimation method.
70
CHAPTER 9
TEACHING NOTES
The coverage of RESET in this chapter recognizes that it is a test for neglected nonlinearities,
and it should not be expected to be more than that. (Formally, it can be shown that if an omitted
variable has a conditional mean that is linear in the included explanatory variables, RESET has
no ability to detect the omitted variable. Interested readers can consult my chapter in Companion
to Theoretical Econometrics, 2001, edited by Badi Baltagi.) I would just teach students the F
statistic version of the test, although the LM version is easier to make robust to heteroskedasticity.
(However, some econometrics packages, including Eviews and Stata, have simple commands for
obtaining a heteroskedasticity-robust F-type statistic.)
The Davidson-MacKinnon test can be useful for detecting functional form misspecification,
especially when one has in mind a specific alternative, nonnested model. It is always a one
degree of freedom test.
I think the proxy variable material is important, but the main points can be made with Examples
9.3 and 9.4. The first shows that controlling for IQ can substantially change the estimated return
to education, and the omitted ability bias is in the expected direction. Interestingly, education
and ability do not appear to have an interactive effect. Example 9.4 is a nice example of how
controlling for a previous value of the dependent variable – something that is often possible with
survey and nonsurvey data – can greatly affect a policy conclusion. Computer Exercise 9.8 is
also a good illustration of this method.
I rarely get to teach the measurement error material, although the attenuation bias result for
classical errors-in-variables is worth mentioning.
The result on exogenous sample selection is easy to discuss, with more details given in Chapter
17. The effects of outliers can be illustrated using the examples. I think the infant mortality
example, Example 9.10, is useful for illustrating how a single influential observation can have a
large effect on the OLS estimates.
With the growing importance of least absolute deviations, it makes sense to at least discuss the
merits of LAD, at least in more advanced courses. Computer Exercise 9.14 is a good example to
show how mean and median effects can be very different, even though there may not be
“outliers” in the usual sense.
71
SOLUTIONS TO PROBLEMS
9.1 There is functional form misspecification if β 6 ≠ 0 or β 7 ≠ 0, where these are the population
parameters on ceoten2 and comten2, respectively. Therefore, we test the joint significance of
these variables using the R-squared form of the F test: F = [(.375 − .353)/(1 − .375)][(177 –
8)/2] ≈ 2.97. With 2 and ∞ df, the 10% critical value is 2.30 awhile the 5% critical value is 3.00.
Thus, the p-value is slightly above .05, which is reasonable evidence of functional form
misspecification. (Of course, whether this has a practical impact on the estimated partial effects
for various levels of the explanatory variables is a different matter.)
9.2 [Instructor’s Note: Out of the 186 records in VOTE2.RAW, three have voteA88 less than 50,
which means the incumbent running in 1990 cannot be the candidate who received voteA88
percent of the vote in 1988. You might want to reestimate the equation dropping these three
observations.]
(i) The coefficient on voteA88 implies that if candidate A had one more percentage point of
the vote in 1988, she/he is predicted to have only .067 more percentage points in 1990. Or, 10
more percentage points in 1988 implies .67 points, or less than one point, in 1990. The t statistic
is only about 1.26, and so the variable is insignificant at the 10% level against the positive one-
sided alternative. (The critical value is 1.282.) While this small effect initially seems surprising,
it is much less so when we remember that candidate A in 1990 is always the incumbent.
Therefore, what we are finding is that, conditional on being the incumbent, the percent of the
vote received in 1988 does not have a strong effect on the percent of the vote in 1990.
(ii) Naturally, the coefficients change, but not in important ways, especially once statistical
−.929 to −.839, the coefficient is not statistically or practically significant anyway (and its sign is
significance is taken into account. For example, while the coefficient on log(expendA) goes from
not what we expect). The magnitudes of the coefficients in both equations are quite similar, and
there are certainly no sign changes. This is not surprising given the insignificance of voteA88.
9.3 (i) Eligibility for the federally funded school lunch program is very tightly linked to living in
poverty. Therefore, the percentage of students eligible for the lunch program is very similar to
the percentage of students living in poverty.
(ii) We can use our usual reasoning on omitting important variables from a regression
poorer children spend, on average, less on schools. Further, β 3 < 0. From Table 3.2, omitting
equation. The variables log(expend) and lnchprg are negatively correlated: school districts with
lnchprg (the proxy for poverty) from the regression produces an upward biased estimator of β1
(ignoring the presence of log(enroll) in the model). So when we control for the poverty rate, the
effect of spending falls.
(iii) Once we control for lnchprg, the coefficient on log(enroll) becomes negative and has a t
72
(iv) Both math10 and lnchprg are percentages. Therefore, a ten percentage point increase in
lnchprg leads to about a 3.23 percentage point fall in math10, a sizeable effect.
(v) In column (1) we are explaining very little of the variation in pass rates on the MEAP
math test: less than 3%. In column (2), we are explaining almost 19% (which still leaves much
variation unexplained). Clearly most of the variation in math10 is explained by variation in
lnchprg. This is a common finding in studies of school performance: family income (or related
factors, such as living in poverty) are much more important in explaining student performance
than are spending per student or other school characteristics.
9.4 (i) For the CEV assumptions to hold, we must be able to write tvhours = tvhours* + e0,
where the measurement error e0 has zero mean and is uncorrelated with tvhours* and each
explanatory variable in the equation. (Note that for OLS to consistently estimate the parameters
we do not need e0 to be uncorrelated with tvhours*.)
(ii) The CEV assumptions are unlikely to hold in this example. For children who do not
watch TV at all, tvhours* = 0, and it is very likely that reported TV hours is zero. So if
positive or negative, but, since tvhours ≥ 0, e0 must satisfy e0 ≥ −tvhours*. So e0 and tvhours*
tvhours* = 0 then e0 = 0 with high probability. If tvhours* > 0, the measurement error can be
are likely to be correlated. As mentioned in part (i), because it is the dependent variable that is
measured with error, what is important is that e0 is uncorrelated with the explanatory variables.
But this is unlikely to be the case, because tvhours* depends directly on the explanatory
variables. Or, we might argue directly that more highly educated parents tend to underreport
how much television their children watch, which means e0 and the education variables are
negatively correlated.
9.5 The sample selection in this case is arguably endogenous. Because prospective students may
look at campus crime as one factor in deciding where to attend college, colleges with high crime
rates have an incentive not to report crime statistics. If this is the case, then the chance of
appearing in the sample is negatively related to u in the crime equation. (For a given school size,
higher u means more crime, and therefore a smaller probability that the school reports its crime
figures.)
9.6 (i) To obtain the RESET F statistic, we estimate the model in Problem 7.13 and obtain the
fitted values, say lsalary
. To use the version of RESET in (9.3), we add ( lsalary )2 and
i i
F statistic is about 1.33 and p-value ≈ .27, which means that there is not much concern about
( lsalary 3
i ) and obtain the F test for joint significance of these variables. With 2 and 203 df, the
≈ .11, so there is stronger evidence of some functional form misspecification with the robust test.
(ii) Interestingly, the heteroskedasticity-robust F-type statistic is about 2.24 with p-value
73
9.7 [Instructor’s Note: If educ ⋅ KWW is used along with KWW, the interaction term is significant.
This is in contrast to when IQ is used as the proxy. You may want to pursue this as an additional
part to the exercise.]
educ becomes about .058 (se ≈ .006), so this is similar to the estimate obtained with IQ, although
(i) We estimate the model from column (2) but with KWW in place of IQ. The coefficient on
(se ≈ .007). Compared with the estimate when only KWW is used as a proxy, the return to
(ii) When KWW and IQ are both used as proxies, the coefficient on educ becomes about .049
at the 5% level against a two-sided alternative. They are jointly very significant, with F2,925 ≈
(iii) The t statistic on IQ is about 3.08 while that on KWW is about 2.07, so each is significant
9.8 (i) If the grants were awarded to firms based on firm or worker characteristics, grant could
easily be correlated with such factors that affect productivity. In the simple regression model,
these are contained in u.
(ii) The simple regression estimates using the 1988 data are
The coefficient on grant is actually positive, but not statistically different from zero.
where the year subscripts are for clarity. The t statistic for H0: β grant = 0 is −.254/.147 ≈ -1.73.
We use the 5% critical value for 40 df in Table G.2: -1.68. Because t = −1.73 < −1.68, we reject
H0 in favor of H1: β grant < 0 at the 5% level.
(iv) The t statistic is (.831 – 1)/.044 ≈ −3.84, which is a strong rejection of H0.
(v) With the heteroskedasticity-robust standard error, the t statistic for grant88 is −.254/.142 ≈
−1.79, so the coefficient is even more significantly less than zero when we use the
74
heteroskedasticity-robust standard error. The t statistic for H0: β log( scrap ) = 1 is (.831 – 1)/.071 ≈
−2.38, which is notably smaller than before, but it is still pretty significant.
87
The coefficient on DC means that even if there was a state that had the same per capita income,
per capita physicians, and population as Washington D.C., we predict that D.C. has an infant
mortality rate that is about 16 deaths per 1000 live births higher. This is a very large difference.
(ii) In the regression from part (i), the intercept and all slope coefficients, along with their
standard errors, are identical to those in equation (9.38), which simply excludes D.C. (Of course,
equation (9.38) does not have DC in it, so we have nothing to compare with its coefficient and
standard error.) Therefore, for the purposes of obtaining the effects and statistical significance of
the other explanatory variables, including a dummy variable for a single observation is identical
to just dropping that observation when doing the estimation.
The R-squareds and adjusted R-squareds from (9.38) and the regression in part (i) are not the
same. They are much larger when DC is included as an explanatory variable because we are
predicting the infant mortality rate perfectly for D.C. You might want to confirm that the
residual for the observation corresponding to D.C. is identically zero.
9.10 With sales defined to be in billions of dollars, we obtain the following estimated equation
using all companies in the sample:
rdintens = 2.06 + .317 sales − .0074 sales2 + .053 profmarg
(0.63) (.139) (.0037) (.044)
When we drop the largest company (with sales of roughly $39.7 billion), we obtain
rdintens = 1.98 + .361 sales − .0103 sales2 + .055 profmarg
(0.72) (.239) (.0131) (.046)
When the largest company is left in the sample, the quadratic term is statistically significant,
even though the coefficient on the quadratic is less in absolute value than when we drop the
largest firm. What is happening is that by leaving in the large sales figure, we greatly increase
the variation in both sales and sales2; as we know, this reduces the variances of the OLS
estimators (see Section 3.4). The t statistic on sales2 in the first regression is about –2, which
75
makes it almost significant at the 5% level against a two-sided alternative. If we look at Figure
9.1, it is not surprising that a quadratic is significant when the large firm is included in the
regression: rdintens is relatively small for this firm even though its sales are very large
compared with the other firms. Without the largest firm, a linear relationship between rdintens
and sales seems to suffice.
9.11 (i) Only four of the 408 schools have b/s less than .01.
(ii) We estimate the model in column (3) of Table 4.3, omitting schools with b/s < .01:
log (salary ) = 10.71 − .421 (b/s) + .089 log(enroll) − .219 log (staff)
(0.26) (.196) (.007) (.050)
− .00023 droprate + .00090 gradrate
(.00161) (.00066)
n = 404, R2 = .354.
Interestingly, the estimated tradeoff is reduced by a nontrivial amount (from .589 to .421). This
is a pretty large difference considering only four of 408 observations, or less than 1%, were
omitted.
9.12 (i) 205 observations out of the 1,989 records in the sample have obrate > 40. (Data are
missing for some variables, so not all of the 1,989 observations are used in the regressions.)
(ii) When observations with obrat > 40 are excluded from the regression in part (iii) of
≈ .020). To three decimal places, these are the same estimates we got when using the entire
Problem 7.16, we are left with 1,768 observations. The coefficient on white is about .129 (se
sample (see Computer Exercise 7.16). Perhaps this is not very surprising since we only lost 203
out of 1,971 observations. However, regression results can be very sensitive when we drop over
10% of the observations, as we have here.
(iii) The estimates from part (ii) show that βˆ white does not seem very sensitive to the sample
used, although we have tried only one way of reducing the sample.
9.13 (i) The mean of stotal is .047, its standard deviation is .854, the minimum value is –3.32,
and the maximum value is 2.24.
(ii) In the regression jc on stotal, the slope coefficient is .011 (se = .011). Therefore, while
the estimated relationship is positive, the t statistic is only one: the correlation between jc and
stotal is weak at best. In the regression univ on stotal, the slope coefficient is 1.170 (se = .029),
for a t statistic of 38.5. Therefore, univ and stotal are positively correlated (with correlation
= .435).
(iii) When we add stotal to (4.17) and estimate the resulting equation by OLS, we get
76
log (wage) = 1.495 + .0631 jc + .0686 univ + .00488 exper + .0494 stotal
(.021) (.0068) (.0026) (.00016) (.0068)
n = 6,758, R2 = .228
For testing βjc = βuniv, we can use the same trick as in Section 4.4 to get the standard error of the
in the estimated returns, along with its standard error. Let θ1 = βjc − βuniv. Then
difference: replace univ with totcoll = jc + univ, and then the coefficient on jc is the difference
θˆ1 = −.0055 (se = .0069) . Compared with what we found without stotal, the evidence is even
weaker against H1: βjc < βuniv. The t statistic from equation (4.27) is about –1.48, while here we
have obtained only −.80.
(iv) When stotal2 is added to the equation, its coefficient is .0019 (t statistic = .40).
Therefore, there is no reason to add the quadratic term.
(v) The F statistic for the significance of the interaction terms stotal⋅jc and stotal⋅univ is
about 1.96; with 2 and 6,756, this gives p-value = .141. So, even at the 10% level, the
interaction terms are jointly insignificant. It is probably not worth complicating the basic model
estimated in part (iii).
(vi) I would just use the model from part (iii), where stotal appears only in level form. The
other embellishments were not statistically significant at small enough significance levels to
warrant the additional complications.
nett$ fa = 21.198 − .270 inc + .0102 inc2 − 1.940 age + .0346 age2
( 9.992) (.075) (.0006) (.483) (.0055)
n = 9,275, R2 = .202
The coefficient on e401k means that, holding other things in the equation fixed, the average level
of net financial assets is about $9,713 higher for a family eligible for a 401(k) than for a family
not eligible.
(ii) The OLS regression of uˆi2 on inci, inci2 , agei, agei2 , malei, and e401ki gives Rû22 = .0374,
which translates into F = 59.97. The associated p-value, with 6 and 9,268 df, is essentially zero.
So there is strong evidence of heteroskedasticity, which means u and the explanatory variables
cannot be independent [even though E(u|x1,x2,…,xk) = 0 is possible].
77
nett$ fa = 12.491 − .262 inc + .00709 inc2 − .723 age + .0111 age2
( 1.382) (.010) (.00008) (.067) (.0008)
Now, the coefficient on e401k means that, at given income, age, and gender, the median
difference in net financial assets between a families with and without 401(k) eligibility is about
$3,737.
(iv) The findings from parts (i) and (iii) are not in conflict. We are finding that 401(k)
eligibility has a larger effect on mean wealth than on median wealth. Finding different mean and
median effects for a variable such as nettfa, which has a skewed distribution, is not surprising.
Apparently, 401(k) eligibility has some large wealth effects, and these are reflected in the mean.
The median is much less sensitive to effects at the upper end of the distribution.
78
CHAPTER 10
TEACHING NOTES
Because of its realism and its care in stating assumptions, this chapter puts a somewhat heavier
burden on the instructor and student than traditional treatments of time series regressions, but I
think it is worth it. It is important that students learn that there are potential pitfalls inherent in
using regression with time series data that are not present for cross-sectional applications.
Trends, seasonality, and high persistence are ubiquitous in time series data. By this time,
students should have a firm grasp of multiple regression mechanics and inference, and so you
can focus on those features that make time series applications different from cross-sectional ones.
I think it is useful to discuss static and finite distributed lag models at the same time, as these at
least have a shot at satisfying the Gauss-Markov assumptions. Many interesting examples have
distributed lag dynamics. In discussing the time series versions of the CLM assumptions, I rely
mostly on intuition. The notion of strict exogeneity is easy to discuss in terms of feedback. It is
also pretty apparent that, in many applications, there are likely to be some explanatory variables
that are not strictly exogenous. What the student should know is that, to conclude that OLS is
unbiased – as opposed to consistent – we need to assume a very strong form of exogeneity of the
regressors. Chapter 11 shows that only contemporaneous exogeneity is needed for consistency.
Although the text is careful in stating the assumptions, in class, after discussing strict exogeneity,
I leave the conditioning on X implicit, especially when I discuss the no serial correlation
assumption. As this is a new assumption I spend some time on it. (I also discuss why we did not
need it for random sampling.)
Once the unbiasedness of OLS, the Gauss-Markov theorem, and the sampling distributions under
the classical linear model assumptions have been covered – which can be done rather quickly – I
focus on applications. Fortunately, the students already know about logarithms and dummy
variables. I treat index numbers in this chapter because they arise in many time series examples.
A novel feature of the text is the discussion of how to compute goodness-of-fit measures with a
trending or seasonal dependent variable. While detrending or deseasonalizing y is hardly perfect
(and does not work with integrated processes), it is better than simply reporting the very high R-
squareds that often come with time series regressions with trending variables.
79
SOLUTIONS TO PROBLEMS
10.1 (i) Disagree. Most time series processes are correlated over time, and many of them
strongly correlated. This means they cannot be independent across observations, which simply
represent different time periods. Even series that do appear to be roughly uncorrelated – such as
stock returns – do not appear to be independently distributed, as you will see in Chapter 12 under
dynamic forms of heteroskedasticity.
(ii) Agree. This follows immediately from Theorem 10.1. In particular, we do not need the
homoskedasticity and no serial correlation assumptions.
(iii) Disagree. Trending variables are used all the time as dependent variables in a regression
model. We do need to be careful in interpreting the results because we may simply find a
spurious association between yt and trending explanatory variables. Including a trend in the
regression is a good idea with trending dependent or independent variables. As discussed in
Section 10.5, the usual R-squared can be misleading when the dependent variable is trending.
(iv) Agree. With annual data, each time period represents a year and is not associated with
any season.
Now by assumption, ut-1 has zero mean and is uncorrelated with all right-hand-side variables in
the previous equation, except itself of course. So
because γ1 > 0. If σ u2 = E( ut2 ) for all t then Cov(int,ut-1) = γ1 σ u2 . This violates the strict
exogeneity assumption, TS.2. While ut is uncorrelated with intt, intt-1, and so on, ut is correlated
with intt+1.
10.3 Write
80
10.4 We use the R-squared form of the F statistic (and ignore the information on R 2 ). The 10%
critical value with 3 and 124 degrees of freedom is about 2.13 (using 120 denominator df in
Table G.3a). The F statistic is
10.5 The functional form was not specified, but a reasonable one is
Where Q2t, Q3t, and Q4t are quarterly dummy variables (the omitted quarter is the first) and the
other variables are self-explanatory. This inclusion of the linear time trend allows the dependent
quarterly dummies allow all variables to display seasonality. The parameter β2 is an elasticity
variable and log(pcinct) to trend over time (intt probably does not contain a trend), and the
zt-3 + zt-4), zt1 = (zt-1 + 2zt-2 + 3zt-3 + 4zt-4), and zt2 = (zt-1 + 4zt-2 + 9zt-3 + 16zt-4). Then, α0, γ0, γ1,
(ii) This is suggested in part (i). For clarity, define three new variables: zt0 = (zt + zt-1 + zt-2 +
and γ2 are obtained from the OLS regression of yt on zt0, zt1, and zt2, t = 1, 2, K , n. (Following
(iii) The unrestricted model is the original equation, which has six parameters (α0 and the
five δj). The PDL model has four parameters. Therefore, there are two restrictions imposed in
moving from the general model to the PDL model. (Note how we do not have to actually write
out what the restrictions are.) The df in the unrestricted model is n – 6. Therefore, we would
obtain the unrestricted R-squared, Rur2 from the regression of yt on zt, zt-1, K , zt-4 and the
restricted R-squared from the regression in part (ii), Rr2 . The F statistic is
81
( Rur2 − Rr2 ) (n − 6)
F = ⋅
(1 − Rur2 )
.
2
10.7 Let post79 be a dummy variable equal to one for years after 1979, and zero otherwise.
Adding post79 to equation 10.15) gives
The coefficient on post79 is statistically significant (t statistic ≈ 2.14) and economically large:
accounting for inflation and deficits, i3 was about 1.4 points higher on average in years after
1979. The coefficient on def falls substantially once post79 is included in the regression.
Only the trend is statistically significant. In fact, in addition to the time trend, which has a t
statistic over three, only afdec6 has a t statistic bigger than one in absolute value. Accounting for
a linear trend has important effects on the estimates.
(ii) The F statistic for joint significance of all variables except the trend and intercept, of
course) is about .54. The df in the F distribution are 6 and 123. The p-value is about .78, and so
the explanatory variables other than the time trend are jointly very insignificant. We would have
to conclude that once a positive linear trend is allowed for, nothing else helps to explain
log(chnimp). This is a problem for the original event study analysis.
(iii) Nothing of importance changes. In fact, the p-value for the test of joint significance of
82
log (prepopt ) = −6.66 − .212 log(mincovt) + .486 log(usgnpt) + .285 log(prgnpt)
(1.26) (.040) (.222) (.080)
− .027 t
(.005)
The coefficient on log(prgnpt) is very statistically significant (t statistic ≈ 3.56). Because the
dependent and independent variable are in logs, the estimated elasticity of prepop with respect to
estimated elasticity of prepop with respect to mincov is now −.212, as compared with −.169 in
prgnp is .285. Including log(prgnp) actually increases the size of the minimum wage effect: the
equation (10.38).
10.10 If we run the regression of gfrt on pet, (pet-1 – pet), (pet-2 – pet), ww2t, and pillt, the
coefficient and standard error on pet are, rounded to four decimal places, .1007 and .0298,
respectively. When rounded to three decimal places we obtain .101 and .030, as reported in the
text.
and 11 monthly dummy variables is about −.0139 (se ≈ .0012), which implies that monthly
10.11 (i) The coefficient on the time trend in the regression of log(uclms) on a linear time trend
unemployment claims fell by about 1.4% per month on average. The trend is very significant.
There is also very strong seasonality in unemployment claims, with 6 of the 11 monthly dummy
(ii) When ez is added to the regression, its coefficient is about −.508 (se ≈ .146). Because
(iii) We must assume that around the time of EZ designation there were not other external
factors that caused a shift down in the trend of log(uclms). We have controlled for a time trend
and seasonality, but this may not be enough.
Although t and t2 are individually insignificant, they are jointly very significant (p-value ≈ .0000).
83
&& as the dependent variable in (10.35) gives R2 ≈.602, compared with about .727
(ii) Using gfr t
if we do not initially detrend. Thus, the equation still explains a fair amount of variation in gfr
even after we net out the trend in computing the total variation in gfr.
(iii) The coefficient and t statistic on t3 are about −.00129 and .00019, respectively, which
results in a very significant t statistic. It is difficult to know what to make of this. The cubic
trend, like the quadratic, is not monotonic. So this almost becomes a curve-fitting exercise.
gc = .0081 + .571 gyt
t
(.0019) (.067)
n = 36, R2 = .679.
This equation implies that if income growth increases by one percentage point, consumption
(t statistic ≈ 8.5).
growth increases by .571 percentage points. The coefficient on gyt is very statistically significant
The t statistic on gyt-1 is only about 1.39, so it is not significant at the usual significance levels.
(It is significant at the 20% level against a two-sided alternative.) In addition, the coefficient is
not especially large. At best there is weak evidence of adjustment lags in consumption.
The t statistic on r3t is very small. The estimated coefficient is also practically small: a one-
point increase in r3t reduces consumption growth by about .021 percentage points.
84
ˆ = 92.05 + .089 pet − .0040 pet-1 + .0074 pet-2 + .018 pet-3
gfr + .014 pet-4
t
The p-value for the F statistic of joint significance of pet-3 and pet-4 is about .94, which is very
weak evidence against H0.
(ii) The LRP and its standard error can be obtained as the coefficient and standard error on
pet in the regression
gfrt on pet, (pet-1 – pet), (pet-2 – pet), (pet-3 – pet), (pet-4 – pet), ww2t, pillt
ˆ ≈ .129 (se ≈ .030), which is above the estimated LRP with only two lags (.101).
We get LRP
The standard errors are the same rounded to three decimal places.
(iii) We estimate the PDL with the additional variables ww22 and pillt. To estimate γ0, γ1,
and γ2, we define the variables
(to three decimal places) γˆ0 = .069, γˆ1 = –.057, γˆ2 = .012. So δˆ0 = γˆ0 = .069, δˆ1 = .069 -
Then, run the regression gfrtt on z0t, z1t, z2t, ww2t, pillt. Using the data in FERTIL3.RAW gives
.057 + .012 = .024, δˆ = .069 – 2(.057) + 4(.012) = .003, δˆ = .069 – 3(.057) + 9(.012) = .006,
δˆ4 = .069 – 4(.057) + 16(.012) = .033. Therefore, the LRP is .135. This is slightly above
2 3
the .129 obtained from the unrestricted model, but not much.
.536)/(1 − .537)](60/2) ≈ .065, which is very insignificant. Therefore, the restrictions are not
Incidentally, the F statistic for testing the restrictions imposed by the PDL is about [(.537 -
rejected by the data. Anyway, the only parameter we can estimate with any precision, the LRP,
is not very different in the two models.
10.15 (i) The sign of β 2 is fairly clear-cut: as interest rates rise, stock returns fall, so β 2 < 0.
future slowdown in economic activity. The sign of β1 is less clear. While economic growth can
Higher interest rates imply that T-bill and bond investments are more attractive, and also signal a
be a good thing for the stock market, it can also signal inflation, which tends to depress stock
prices.
85
(ii) The estimated equation is
A one percentage point increase in industrial production growth is predicted to increase the stock
market return by .036 percentage points (a very small effect). On the other hand, a one
percentage point increase in interest rates decreases the stock market return by an estimated 1.36
percentage points.
(iv) The regression in part (i) has nothing directly to say about predicting stock returns
because the explanatory variables are dated contemporaneously with resp500. In other words,
we do not know i3t before we know rsp500t. What the regression in part (i) says is that a change
in i3 is associated with a contemporaneous change in rsp500.
10.16 (i) The sample correlation between inf and def is only about .048, which is very small.
Perhaps surprisingly, inflation and the deficit rate are practically uncorrelated over this period.
Of course, this is a good thing for estimating the effects of each variable on i3, as it implies
almost no multicollinearity.
i3t = 1.23 + .425 inft + .273 inft-1 + .163 deft + .405 deft-1
(0.44) (.129) (.141) (.257) (.218)
(iii) The estimated LRP of i3 with respect to inf is .425 + .273 = .698, which is somewhat
larger than .613, which we obtain from the static model in (10.15). But the estimates are fairly
close considering the size and marginal significance of the coefficient on inft-1.
(iv) The F statistic for significance of inft-1 and deft-1 is about 2.18, with p-value ≈ .125. So
they are not jointly significant at the 5% level. But the p-value may be small enough to justify
their inclusion, especially since the coefficient on deft-1 is practically large.
10.17 (i) The variable beltlaw becomes one at t = 61, which corresponds to January, 1986. The
variable spdlaw goes from zero to one at t = 77, which corresponds to May, 1987.
log (totacc) = 10.469 + .00275 t − .0427 feb + .0798 mar + .0185 apr
86
(.019) (.00016) (.0244) (.0244) (.0245)
n = 108, R2 = .797
When multiplied by 100, the coefficient on t gives roughly the average monthly percentage
growth in totacc, ignoring seasonal factors. In other words, once seasonality is eliminated,
totacc grew by about .275% per month over this period, or, 12(.275) = 3.3% at an annual rate.
There is pretty clear evidence of seasonality. Only February has a lower number of total
accidents than the base month, January. The peak is in December: roughly, there are 9.6%
accidents more in December over January in the average year. The F statistic for joint
significance of the monthly dummies is F = 5.15. With 11 and 95 df, this give a p-value
essentially equal to zero.
n = 108, R2 = .910
The negative coefficient on unem makes sense if we view unem as a measure of economic
activity. As economic activity increases – unem decreases – we expect more driving, and
therefore more accidents. The estimate that a one percentage point increase in the
unemployment rate reduces total accidents by about 2.1%. A better economy does have costs in
terms of traffic accidents.
(iv) At least initially, the coefficients on spdlaw and beltlaw are not what we might
expect. The coefficient on spdlaw implies that accidents dropped by about 5.4% after the
highway speed limit was increased from 55 to 65 miles per hour. There are at least a couple of
possible explanations. One is that people because safer drivers after the increased speed limiting,
recognizing that the must be more cautious. It could also be that some other change – other than
the increased speed limit or the relatively new seat belt law – caused lower total number of
accidents, and we have not properly accounted for this change.
The coefficient on beltlaw also seems counterintuitive at first. But, perhaps people became less
cautious once they were forced to wear seatbelts.
87
(v) The average of prcfat is about .886, which means, on average, slightly less than one
percent of all accidents result in a fatality. The highest value of prcfat is 1.217, which means
there was one month where 1.2% of all accidents resulting in a fatality.
(vi) As in part (iii), I do not report the coefficients on the time trend and seasonal dummy
variables:
at =
prcf 1.030 + … + .00063 wkends − .0154 unem
(.103) (.00616) (.0055)
n = 108, R2 = .717
Higher speed limits are estimated to increase the percent of fatal accidents, by .067 percentage
points. This is a statistically significant effect. The new seat belt law is estimated to decrease
the percent of fatal accidents by about .03, but the two-sided p-value is about .21.
Interestingly, increased economic activity also increases the percent of fatal accidents. This may
be because more commercial trucks are on the roads, and these probably increase the chance that
an accident results in a fatality.
88
CHAPTER 11
TEACHING NOTES
Much of the material in this chapter is usually postponed, or not covered at all, in an introductory
course. However, as Chapter 10 indicates, the set of time series applications that satisfy all of
the classical linear model assumptions might be very small. In my experience, spurious time
series regressions are the hallmark of many student projects that use time series data. Therefore,
students need to be alerted to the dangers of using highly persistent processes in time series
regression equations. (The spurious regression problem, and the relatively recent notion of
cointegration, are covered in more detail in Chapter 18.)
It is fairly easy to heuristically describe the difference between a weakly dependent process and
an integrated process. Using the MA(1) and the stable AR(1) examples is usually sufficient.
When the data are weakly dependent and the explanatory variables are contemporaneously
exogenous, OLS is consistent. This result has many applications, including the stable AR(1)
regression model. When we add the appropriate homoskedasticity and no serial correlation
assumptions, the usual test statistics are asymptotically valid.
The random walk process is a good example of a unit root (highly persistent) process. In a one-
semester course, the issue comes down to whether or not to first difference the data before
specifying the linear model. While unit root tests are covered in Chapter 18, just computing the
first-order autocorrelation is often sufficient, perhaps after detrending. The examples in Section
11.3 illustrate how different first-difference results can be from estimating equations in levels.
Section 11.4 is novel in an introductory text, and simply points out that, if a model is
dynamically complete in a well-defined sense, it should not have serial correlation. Therefore,
we need not worry about serial correlation when, say, we test the efficient market hypothesis.
Section 11.5 further investigates the homoskedasticity assumption, and, in a time series context,
emphasizes that what is contained in the explanatory variables determines what kind of hetero-
skedasticity is ruled out. These two sections could be skipped without loss of continuity.
89
SOLUTIONS TO PROBLEMS
11.1 Because of covariance stationarity, γ0 = Var(xt) does not depend on t, so sd(xt+h) = γ 0 for
any h ≥ 0. By definition, Corr(xt,xt+h) = Cov(xt,xt+h)/[sd(xt) ⋅ sd(xt+h)] = γ h /( γ 0 ⋅ γ 0 ) = γ h / γ 0 .
11.2 (i) E(xt) = E(et) – (1/2)E(et-1) + (1/2)E(et-2) = 0 for t = 1,2, K Also, because the et are
independent, they are uncorrelated and so Var(xt) = Var(et) + (1/4)Var(et-1) + (1/4)Var(et-2) = 1 +
(1/4) + (1/4) = 3/2 because Var (et) = 1 for all t.
(ii) Because xt has zero mean, Cov(xt,xt+1) = E(xtxt+1) = E[(et – (1/2)et-1 + (1/2)et-2)(et+1 –
(1/2)et + (1/2)et-1)] = E(etet+1) – (1/2)E( et2 ) + (1/2)E(etet-1) – (1/2)E(et-1et+1) + (1/4(E(et-1et) –
(1/4)E( et2−1 ) + (1/2)E(et-2et+1) – (1/4)E(et-2et) +(1/4)E(et-2et-1) = – (1/2)E( et2 ) – (1/4)E( et2−1 ) =
–(1/2) – (1/4) = –3/4; the third to last equality follows because the et are pairwise uncorrelated
and E( et2 ) = 1 for all t. Using Problem 11.1 and the variance calculation from part (i),
Corr(xtxt+1) = – (3/4)/(3/2) = –1/2.
Computing Cov(xt,xt+2) is even easier, because only one of the nine terms has expectation not
equal to zero: (1/2)E( et2 ) = ½. Therefore, Corr(xt,xt+2) = (1/2)/(3/2) = 1/3.
depends on et+j, j ≤ 0.
(iii) Corr(xt,xt+h) = 0 for h >2 because for h > 2, xt+h depends at most on et+j for j > 0, while xt
et)(z + et+h)] = E(z2) + E(zet+h) + E(etz) + E(etet+h) = E(z2) = σ z2 because {et} is an uncorrelated
(ii) We assume h > 0; when h = 0 we obtain Var(yt). Then Cov(yt,yt+h) = E(ytyt+h) = E[(z +
sequence (it is an independent sequence and z is uncorrelated with et for all t. From part (i) we
know that E(yt) and Var(yt) do not depend on t and we have shown that Cov(yt,yt+h) depends on
neither t nor h. Therefore, {yt} is covariance stationary.
(iii) From Problem 11.1 and parts (i) and (ii), Corr(yt,yt+h) = Cov(yt,yt+h)/Var(yt) = σ z2 /( σ z2 +
σ e2 ) > 0.
(iv) No. In fact, the correlation between yt and yt+h is the same positive value obtained in part
(iii) for any h > 0. In other words, no matter how far apart yt and yt+h are, their correlation is
always the same. Of course this is due to the presence of the time-constant variable, z.
90
variances from (11.21): Var(yt) = σ e2 t and Var(yt+h) = σ e2 (t + h), h > 0. Because E(yt) = 0 for all
11.4 Assuming y0 = 0 is a special case of assuming y0 nonrandom, and so we can obtain the
where we have used the fact that {et} is a pairwise uncorrelated sequence. Therefore,
Corr(yt,yt+h) = Cov(yt,yt+h)/ Var( yt ) ⋅ Var( yt + h ) = t/ t (t + h) = t /(t + h .
11.5 (i) The following graph gives the estimated lag distribution:
coefficient .16
.12
.08
.04
0
0 1 2 3 4 5 6 7 8 9 10 11 12
lag
By some margin, the largest effect is at the ninth lag, which says that a temporary increase in
wage inflation has its largest effect on price inflation nine months later. The smallest effect is at
the twelfth lag, which hopefully indicates (but does not guarantee) that we have accounted for
enough lags of gwage in the FLD model.
(ii) Lags two, three, and twelve have t statistics less than two. The other lags are statistically
significant at the 5% level against a two-sided alternative. (Assuming either that the CLM
assumptions hold for exact tests or Assumptions TS.1′ through TS.5′ hold for asymptotic tests.)
91
(iii) The estimated LRP is just the sum of the lag coefficients from zero through twelve:
1.172. While this is greater than one, it is not much greater, and the difference could certainly be
due to sampling error.
(iv) The model underlying and the estimated equation can be written with intercept α0 and
lag coefficients δ0, δ1, K , δ12. Denote the LRP by θ0 = δ0 + δ1 + K + δ12. Now, we can write
δ0 = θ0 − δ1 − δ2 − K − δ12. If we plug this into the FDL model we obtain (with yt = gpricet and
zt = gwaget)
Therefore, we regress yt on zt, (zt-1 – zt), (zt-2 – zt), K , (zt-12 – zt) and obtain the coefficient and
standard error on zt as the estimated LRP and its standard error.
(v) We would add lags 13 through 18 of gwaget to the equation, which leaves 273 – 6 = 267
observations. Now, we are estimating 20 parameters, so the df in the unrestricted model is dfur =
267. Let Rur2 be the R-squared from this regression. To obtain the restricted R-squared, Rr2 , we
need to reestimate the model reported in the problem but with the same 267 observations used to
estimate the unrestricted model. Then F = [( Rur2 − Rr2 )/(1 − Rur2 )](247/6). We would find the
critical value from the F6,247 distribution.
[Instructor’s Note: As a computer exercise, you might have the students test whether all 13 lag
coefficients in the population model are equal. The restricted regression is gprice on (gwage +
gwage-1 + gwage-2 + K gwage-12), and the R-squared form of the F test, with 12 and 259 df, can
be used.]
11.6 (i) The t statistic for H0: β1 = 1 is t = (1.104 – 1)/.039 ≈ 2.67. Although we must rely on
a two-sided alternative is about 2.62, and so we reject H0: β1 = 1 against H1: β1 ≠ 1 at the 1%
asymptotic results, we might as well use df = 120 in Table G.2. So the 1% critical value against
comparing investment strategies based on the theory (β1 = 1) and the estimate ( βˆ1 = 1.104). But
level. It is hard to know whether the estimate is practically different from one without
(ii) The t statistic for the null in part (i) is now (1.053 – 1)/.039 ≈ 1.36, so H0: β1 = 1 is no
longer rejected against a two-sided alternative unless we are using more than a 10% significance
predicts): t = .480/.109 ≈ 4.40. Based on the estimated equation, when the lagged spread is
level. But the lagged spread is very significant (contrary to what the expectations hypothesis
(iii) This suggests unit root behavior for {hy3t}, which generally invalidates the usual t-
testing procedure.
92
(iv) We would include three quarterly dummy variables, say Q2t, Q3t, and Q4t, and do an F
test for joint significance of these variables. (The F distribution would have 3 and 117 df.)
11.7 (i) We plug the first equation into the second to get
and, rearranging,
E(ut|xt,yt-1,xt-1, K ) = 0, which means that the model is dynamically complete [see equation
Var(ut|xt,yt-1) = σ2 holds, then the usual standard errors, t statistics and F statistics are
(11.37)]. Therefore, the errors are serially uncorrelated. If the homoskedasticity assumption
asymptotically valid.
by regressing on a linear time trend, ρ̂1 ≈ .485. Especially after detrending there is little
11.8 (i) The first order autocorrelation for log(invpc) is about .639. If we first detrend log(invpc)
evidence of a unit root in log(invpc). For log(price), the first order autocorrelation is about .949,
which is very high. After detrending, the first order autocorrelation drops to .822, but this is still
pretty large. We cannot confidently rule out a unit root in log(price).
The coefficient on Δlog(pricet) implies that a one percentage point increase in the growth in price
leads to a 3.88 percent increase in housing investment above its trend. [If Δlog(pricet) = .01 then
93
Δlog( invpc
ˆ t ) = .0388; we multiply both by 100 to convert the proportionate changes to
percentage changes.]
(iii) If we first linearly detrend log(invpct) before regressing it on Δlog(pricet) and the time
trend, then R2 = .303, which is substantially lower than that when we do not detrend. Thus,
∆log(pricet) explains only about 30% of the variation in log(invpct) about its trend.
Δlog( invpc
ˆ t )= .006 + 1.57 Δlog(pricet) + .00004t
(.048) (1.14) (.00190)
n = 41, R2 = .048.
The coefficient on Δlog(pricet) has fallen substantially and is no longer significant at the 5%
level against a positive one-sided alternative. The R-squared is much smaller; Δlog(pricet)
explains very little variation in Δlog(invpct). Because differencing eliminates linear time trends,
it is not surprising that the estimate on the trend is very small and very statistically insignificant.
ˆ
ghrwaget = –.010 + .728 goutphrt + .458 goutphrt-1
(.005) (.167) (.166)
The t statistic on the lag is about 2.76, so the lag is very significant.
(ii) We follow the hint and write the LRP as θ = β1 + β2, and then plug β1 = θ – β2 into the
original model:
error for θˆ . Doing this regression gives 1.186 [as we can compute directly from part (i)] and
Therefore, we regress ghrwaget onto goutphrt, and (goutphrt-1 – goutphrt) and obtain the standard
se( θˆ ) = .203. The t statistic for testing H0: θ = 1 is (1.186 – 1)/.203 ≈ .916, which is not
significant at the usual significance levels (not even 20% against a two-sided alternative).
now available for the regression, βˆ3 ≈ .065 with a t statistic of about .41. Therefore, goutphrt-2
(iii) When goutphrt-2 is added to the regression from part (i), and we use the 38 observations
94
ˆ t = .226 + .049 returnt −1 − .0097 returnt2−1
return
(.087) (.039) (.0070)
n = 689, R2 = .0063.
(ii) The null hypothesis is H0: β1 = β2 = 0. Only if both parameters are zero does
E(returnt|returnt-1) not depend on returnt-1. The F statistic is about 2.16 with p-value ≈ .116.
Therefore, we cannot reject H0 at the 10% level.
(iii) When we put returnt-1 ⋅ returnt-2 in place of returnt2−1 the null can still be stated as in part
squared is about .0052 and F ≈ 1.80 with p-value ≈ .166. Here, we do not reject H0 at even the
(ii): no past values of return, or any functions of them, should help us predict returnt. The R-
15% level.
(iv) Predicting returnt based on past returns does not appear promising. Even though the F
statistic from part (ii) is almost significant at the 10% level, we have many observations. We
cannot even explain 1% of the variation in returnt.
The coefficient on Δunem has the sign that implies an inflation-unemployment tradeoff, and the
coefficient is quite large in magnitude. The t statistic on Δunem is about –2.68, which is very
(ii) Based on the R-squareds (or adjusted R-squareds), the model from part (i) explains Δinf
better than (11.19): the model with Δunem as the explanatory variable explains about three
percentage points more of the variation in Δinf.
Δgfr
ˆ = −1.27 − .035 Δpe − .013 Δpe-1 − .111 Δpe-2 + .0079 t
(1.05) (.027) (.028) (.027) (.0242)
The time trend coefficient is very insignificant, so it is not needed in the equation.
95
Δgfr
ˆ = −.650 − .075 Δpe − .051 Δpe-1 + .088 Δpe-2 + 4.84 ww2 - 1.68 pill
(.582) (.032) (.033) (.028) (2.83) (1.00)
The F statistic for joint significance is F = 2.82 with p-value ≈ .067. So ww2 and pill are not
jointly significant at the 5% level, but they are at the 10% level.
(iii) By regressing Δgfr on Δpe, (Δpe-1 − Δpe). (Δpe-2 − Δpe), ww2, and pill, we obtain the
LRP and its standard error as the coefficient on Δpe: −.075, se = .032. So the estimated LRP is
now negative and significant, which is very different from the equation in levels, (10.19) (the
estimated LRP was .101 with a t statistic of about 3.37). This is a good example of how
differencing variables before including them in a regression can lead to very different
conclusions than a regression in levels.
[Instructor’s Note: A variation on this exercise is to start with the model in levels and then
difference all of the independent variables, including the dummy variables ww2 and pill.]
Δinven
ˆ t = 2.59 + .152 ΔGDPt
(3.64) (.023)
n = 36, R2 = .554.
changes inventory investment by $152 million. βˆ1 is very statistically significant, with t ≈ 6.61.
Both inven and GDP are measured in billions of dollars, so a one billion dollar change in GDP
Δinven
ˆ t = 3.00 + .159 ΔGDPt − .895 r3t
(3.69) (.025) (1.101)
n = 36, R2 = .562.
The sign of βˆ2 is negative, as predicted by economic theory, and it seems practically large: a
one percentage point increase in r3t reduces inventories by almost $1 billion. However, βˆ2 is
If Δr3t is used instead, the coefficient becomes about −.470, se = 1.540. So this is even less
not statistically different from zero. (Its t statistic is less than one in absolute value.)
significant than when r3t is in the equation. But, without more data, we cannot conclude that
interest rates have a ceteris paribus effect on inventory investment.
96
11.14 (i) If E(gct|It-1) = E(gct) – that is, E(gct|It-1) = does not depend on gct-1, then β1 = 0 in gct =
β0 + β1gct-1 + ut. So the null hypothesis is H0: β1 = 0 and the alternative is H1: β1 ≠ 0. Estimating
the simple regression using the data in CONSUMP.RAW gives
gc = .011 + .446 gct-1
t
(.004) (.156)
n = 35, R2 = .199.
The t statistic for βˆ1 is about 2.86, and so we strongly reject the PIH. The coefficient on gct-1 is
also practically large, showing significant autocorrelation in consumption growth.
(ii) When gyt-1 and i3t-1 are added to the regression, the R-squared becomes about .288. The
with p-value ≈ .16. Therefore, gyt-1 and i3t-1 are not jointly significant at even the 15% level.
F statistic for joint significance of gyt-1 and i3t-1, obtained using the Stata “test” command, is 1.95,
unem = 1.57 + .732 unemt-1
t
(0.58) (.097)
n = 48, R2 = .554, σˆ = 1.049.
1.57 + .732(5.4) ≈ 5.52. From the 1998 Economic Report of the President (p. 330), the U.S.
In 1996 the unemployment rate was 5.4, so the predicted unemployment rate for 1997 is
civilian unemployment rate was 4.9. Therefore, the equation overpredicts the 1997
unemployment rate by a nontrivial margin.
unem = 1.30 + .647 unemt-1 + .184 inft-1
t
(iii) To use the equation from part (ii) to predict unemployment in 1997, we also need the
unem in 1997 is 1.30 + .647(5.4) + .184(3.0) ≈ 5.35. This is still too large, but it is closer to 4.9
inflation rate for 1996. This is given in PHILLIPS.RAW as 3.0. Therefore, the prediction of
(iv) We use the model from part (iii) because inft-1 is very significant. To use the 95%
prediction interval from Section 6.4, we assume that unemt has a conditional normal distribution.
As shown in equation (6.36), we need the standard error of the predicted value as well as the
97
standard error of the regressions. The latter is given in part (ii), σˆ = .883. To obtain the
standard error of the predicted value, se( ŷ0 ) in the notation of Chapter 6, we need to find the
standard error of βˆ + (5.4) βˆ + (3.0) βˆ . We use the method described in Section 6.4: we run
0 1 2
the regression unemt on (unemt-1 – 5.4) and (inft-1 – 3.0), and obtain the intercept and standard
error from this regression. We know the intercept must be (approximately) 5.35 from part (iii).
The standard error is about .137. Therefore, from equation (6.36),
for 1997 unemployment is 5.35 ± 2.021(.893), or about 3.54 to 7.16. The actual income for
95% prediction for a test against a two-sided alternative is 2.021, so the 95% prediction interval
1997, 4.9, is comfortably in this interval. (If we forget to include σˆ in obtaining the standard
error of the future value, the CI would be about 5.07 to 5.62, which excludes 4.9. But this is not
the correct prediction interval as it ignores the unobservables that affect unem in 1997.)
[Instructor’s Note: This problem can be redone using more recent data, reported below in
Computer Exercise 11.17.]
cause for concern. For unem, ρˆ1 = .950 , which is cause for concern in using unem as an
11.16 (i) The first order autocorrelation for prcfat is .709, which is high but not necessarily a
(ii) If we use the first differences of prcfat and unem, but leave all other variables in their
original form, we get the following:
n = 107, R2 = .344,
where I have again suppressed the coefficients on the time trend and seasonal dummies. This
regression basically shows that the change in prcfat cannot be explained by the change in unem
or any of the policy variables. It does have some seasonality, which is why the R-squared is .344.
(iii) This is an example about how estimation in first differences loses the interesting
implications of the model estimated in levels. Of course, this is not to say the levels regression is
valid. But, as it turns out, we can reject a unit root in prcfat, and so we can at least justify using
98
it in level form; see Computer Exercise 18.22. Generally, the issue of whether to take first
differences is very difficult, even for professional time series econometricians.
11.17 (i) I got the unemployment rates from Table B−43 of the 2001 Economic Report of the
President and the inflation rates from Table B−63 from the same year. The numbers are in the
following table:
n = 51, R2 = .103
These estimates are similar to those obtained in equation (11.19), as we would hope. Both the
intercept and slope have gotten a little smaller in magnitude.
2.85/.052 ≈ 5.48, which is slightly smaller than the 5.58 obtained using only the data through
(ii) The estimate of the natural rate is obtained as in Example 11.5. The new estimate is
1996.
(iii) The first order autocorrelation of unem is about .75. This is one of those tough cases:
the correlation between unemt and unemt-1 is large, but it is not especially close to one.
(iv) As with the earlier data, the model with Δunemt as the explanatory variable fits
somewhat better:
n = 51, R2 = .132
99
CHAPTER 12
TEACHING NOTES
Most of this chapter deals with serial correlation, but it also explicitly considers
heteroskedasticity in time series regressions. The first section allows a review of what
assumptions were needed to obtain both finite sample and asymptotic results. Just as with
heteroskedasticity, serial correlation itself does not invalidate R-squared. In fact, if the data are
stationary and weakly dependent, R-squared and adjusted R-squared consistently estimate the
population R-squared (which is well-defined under stationarity).
Equation (12.4) is useful for explaining why the usual OLS standard errors are not generally
valid with AR(1) serial correlation. It also provides a good starting point for discussing serial
correlation-robust standard errors in Section 12.5. The subsection on serial correlation with
lagged dependent variables is included to debunk the myth that OLS is always inconsistent with
lagged dependent variables and serial correlation. I do not teach it to undergraduates, but I do to
master’s students.
Section 12.2 is somewhat untraditional in that it begins with an asymptotic t test for AR(1) serial
correlation (under strict exogeneity of the regressors). It may seem heretical not to give the
Durbin-Watson statistic its usual prominence, but I do believe the DW test is less useful than the
t test. With nonstrictly exogenous regressors I cover only the regression form of Durbin’s test, as
the h statistic is asymptotically equivalent and not always computable.
Section 12.3, on GLS and FGLS estimation, is fairly standard, although I try to show how
comparing OLS estimates and FGLS estimates is not so straightforward. Unfortunately, at the
beginning level (and even beyond), it is difficult to choose a course of action when they are very
different.
I do not usually cover Section 12.5 in a first-semester course, but, because some econometrics
packages routinely compute fully robust standard errors, students can be pointed to Section 12.5
if they need to learn something about what the corrections do. I do cover Section 12.5 for a
master’s level course in applied econometrics (after the first-semester course).
I also do not cover Section 12.6 in class; again, this is more to serve as a reference for more
advanced students, particularly those with interests in finance. One important point is that
ARCH is heteroskedasticity and not serial correlation, something that is confusing in many texts.
If a model contains no serial correlation, the usual heteroskedasticity-robust statistics are valid. I
have a brief subsection on correcting for a known form of heteroskedasticity and AR(1) errors in
models with strictly exogenous regressors.
100
SOLUTIONS TO PROBLEMS
estimate of σ / SSTx . When the dependent and independent variables are in level (or log) form,
12.1 We can reason this from equation (12.4) because the usual OLS standard error is an
the AR(1) parameter, ρ, tends to be positive in time series regression models. Further, the
independent variables tend to be positive correlated, so (xt − x )(xt+j − x ) – which is what
generally appears in (12.4) when the {xt} do not have zero sample average – tends to be positive
for most t and j. With multiple explanatory variables the formulas are more complicated but
If ρ < 0, or if the {xt} is negatively autocorrelated, the second term in the last line of (12.4)
have similar features.
could be negative, in which case the true standard deviation of βˆ1 is actually less than σ / SSTx .
12.2 This statement implies that we are still using OLS to estimate the βj. But we are not using
OLS; we are using feasible GLS (without or with the equation for the first time period). In other
words, neither the Cochrane-Orcutt nor the Prais-Winsten estimators are the OLS estimators (and
they usually differ from each other).
12.3 (i) Because U.S. presidential elections occur only every four years, it seems reasonable to
think the unobserved shocks – that is, elements in ut – in one election have pretty much
dissipated four years later. This would imply that {ut} is roughly serially uncorrelated.
(ii) The t statistic for H0: ρ = 0 is −.068/.240 ≈ −.28, which is very small. Further, the
estimate ρ̂ = −.068 is small in a practical sense, too. There is no reason to worry about serial
correlation in this example.
(iii) Because the test based on t ρˆ is only justified asymptotically, we would generally be
concerned about using the usual critical values with n = 20 in the original regression. But any
kind of adjustment, either to obtain valid standard errors for OLS as in Section 12.5 or a feasible
even unbiased, whereas OLS is under TS.1 through TS.3.) Most importantly, the estimate of ρ is
GLS procedure as in Section 12.3, relies on large sample sizes, too. (Remember, FGLS is not
practically small, too. With ρ̂ so close to zero, FGLS or adjusting the standard errors would
yield similar results to OLS with the usual standard errors.
12.4 This is false, and a source of confusion in several textbooks. (ARCH is often discussed as a
in the equation returnt = β0 + β1returnt-1 + ut are serially uncorrelated, but there is strong
way in which the errors can be serially correlated.) As we discussed in Example 12.9, the errors
errors almost certainly underestimate the true standard deviation in βˆ EZ . This makes the usual
12.5 (i) There is substantial serial correlation in the errors of the equation, and the OLS standard
101
(ii) We can use the method in Section 12.5 to obtain an approximately valid standard error.
[See equation (12.43).] While we might use g = 2 in equation (12.42), with monthly data we
might want to try a somewhat longer lag, maybe even up to g = 12.
standard error for β̂1 differs from the OLS standard error by a substantial amount: the robust
12.6 With the strong heteroskedasticity in the errors it is not too surprising that the robust
is .059/.069 ≈ .86, which is even less significant than before. Therefore, we conclude that, once
standard error is almost 82% larger. Naturally, this reduces the t statistic. The robust t statistic
heteroskedasticity is accounted for, there is very little evidence that returnt-1 is useful for
predicting returnt.
12.7 Regressing uˆt on uˆt −1 , using the 69 available observations, gives ρ̂ ≈ .292 and
se( ρ̂ ) ≈ .118. The t statistic is about 2.47, and so there is significant evidence of positive AR(1)
serial correlation in the errors (even though the variables have been differenced). This means we
should view the standard errors reported in equation (11.27) with some suspicion.
uˆt on uˆt −1 , using 272 observations. We get ρ̂ ≈ .503 and t ρˆ ≈ 9.60, which is very strong
12.8 (i) After estimating the FDL model by OLS, we obtain the residuals and run the regression
(ii) When we estimate the model by iterated C-O, the LRP is estimated to be about 1.110.
(iii) We use the same trick as in Problem 11.5, except now we estimate the equation by
iterated C-O. In particular, write
Where θ0 is the LRP and {ut} is assumed to follow an AR(1) process. Estimating this equation
by C-O gives θˆ0 ≈ 1.110 and se( θˆ0 ) ≈ .191. The t statistic for testing H0: θ0 = 1 is (1.110 –
1)/.191 ≈ .58, which is not close to being significant at the 5% level. So the LRP is not
statistically different from one.
12.9 (i) The test for AR(1) serial correlation gives (with 35 observations) ρ̂ ≈ –.110,
se( ρ̂ ) ≈ .175. The t statistic is well below one in absolute value, so there is no evidence of serial
correlation in the accelerator model. If we view the test of serial correlation as a test of dynamic
misspecification, it reveals no dynamic misspecification in the accelerator model.
(ii) It is worth emphasizing that, if there is little evidence of AR(1) serial correlation, there is
no need to use feasible GLS (Cochrane-Orcutt or Prais-Winsten).
102
12.10 (i) After obtaining the residuals uˆt from equation (11.16) and then estimating (12.48), we
can compute the fitted values hˆ = 4.66 – 1.104 returnt for each t. This is easily done in a single
t
command using most software packages. It turns out that 12 of 689 fitted values are negative.
Among other things, this means we cannot directly apply weighted least squares using the
heteroskedasticity function in (12.48).
at .789/[2(.297)] ≈ 1.33. Now, there are no fitted values less than zero.
So the conditional variance is a quadratic in returnt-1, in this case a U-shape that bottoms out
(se ≈ .078) and βˆ1 ≈ .039 (se ≈ .046). So the coefficient on returnt-1, once weighted least
0
squares has been used, is even less significant (t statistic ≈ .85) than when we used OLS.
(12.51) and obtain the fitted values, hˆt . The WLS estimates are now βˆ0 ≈ .159 (se ≈ .076) and
(iv) To obtain the WLS using an ARCH variance function we first estimate the equation in
βˆ ≈ .024 (se ≈ .047). The coefficient and t statistic are even smaller. Therefore, once we
1
account for heteroskedasticity via one of the WLS methods, there is virtually no evidence that
E(returnt|returnt-1) depends linearly on returnt-1.
demwins = .441 − .473 partyWH + .479 incum + .059 partyWH ⋅ gnews
(.107) (.354) (.205) (.036)
− .024 partyWH ⋅ inf
(.028)
The largest t statistic is on incum, which is estimated to have a large effect on the probability of
winning. But we must be careful here. incum is equal to 1 if a Democratic incumbent is running
and –1 if a Republican incumbent is running. Similarly, partyWH is equal to 1 if a Democrat is
currently in the White House and –1 if a Republican is currently in the White House. So, for an
incumbent Democrat running, we must add the coefficients on partyWH and incum together, and
this nets out to about zero.
103
The economic variables are less statistically significant than in equation (10.23). The gnews
interaction has a t statistic of about 1.64, which is significant at the 10% level against a one-sided
alternative. (Since the dependent variable is binary, this is a case where we must appeal to
asymptotics. Unfortunately, we have only 20 observations.) The inflation variable has the
expected sign but is not statistically significant.
(ii) There are two fitted values less than zero, and two fitted values greater than one.
(iii) Out of the 10 elections with demwins = 1, 8 of these are correctly predicted. Out of the
10 elections with demwins = 0, 7 are correctly predicted. So 15 out of 20 elections through 1992
are correctly predicted. (But, remember, we used data from these years to obtain the estimated
equation.)
(iv) The explanatory variables are partyWH = 1, incum = 1, gnews = 3, and inf = 3.019.
Therefore, for 1996,
demwins = .441 − .473 + .479 + .059(3) − .024(3.019) ≈ .552.
Because this is above .5, we would have predicted that Clinton would win the 1996 election, as
he did.
(v) The regression of uˆt on uˆt −1 produces ρ̂ ≈ -.164 with heteroskedasticity-robust standard
error of about .195. (Because the LPM contains heteroskedasticity, testing for AR(1) serial
correlation in an LPM generally requires a heteroskedasticity-robust test.) Therefore, there is
little evidence of serial correlation in the errors. (And, if anything, it is negative.)
(vi) The heteroskedasticity-robust standard errors are given in [ ⋅ ] below the usual standard
errors:
demwins = .441 − .473 partyWH + .479 incum + .059 partyWH ⋅ gnews
(.107) (.354) (.205) (.036)
[.086] [.301] [.185] [.030]
– .024 partyWH ⋅ inf
(.028)
[.019]
making each variable more significant. For example, the t statistic on partyWH ⋅ gnews becomes
In fact, all heteroskedasticity-robust standard errors are less than the usual OLS standard errors,
about 1.97, which is notably above 1.64. But we must remember that the standard errors in the
LPM have only asymptotic justification. With only 20 observations it is not clear we should
prefer the heteroskedasticity-robust standard errors to the usual ones.
104
12.12 (i) The regression uˆt on uˆt −1 (with 35 observations) gives ρ̂ ≈ −.089 and se( ρ̂ ) ≈ .178;
there is no evidence of AR(1) serial correlation in this equation, even though it is a static model
in the growth rates.
(ii) We regress gct on gct-1 and obtain the residuals uˆt . Then, we regress uˆt2 on gct-1 and
gct2−1 (using 35 observations), the F statistic (with 2 and 32 df) is about 1.08. The p-value is
about .352, and so there is little evidence of heteroskedasticity in the AR(1) model for gct. This
means that we need not modify our test of the PIH by correcting somehow for heteroskedasticity.
12.13 (i) The iterated Prais-Winsten estimates are given below. The estimate of ρ is, to three
decimal places, .293, which is the same as the estimate used in the final iteration of Cochrane-
Orcutt:
n = 131, R2 = .202
they use the same value of ρ̂ (to four decimal places it is .2934 for C-O and .2932 for P-W).
(ii) Not surprisingly, the C-O and P-W estimates are quite similar. To three decimal places,
The only practical difference is that P-W uses the equation for t = 1. With n = 131, we hope this
makes little difference.
t on ut −1 , t = 2,...,108. (Included an
12.14 (i) This is the model that was estimated in part (vi) of Computer Exercise 10.17. After
getting the OLS residuals, uˆt , we run the regression u垐
intercept, but that is unimportant.) The coefficient on uˆt −1 is ρ̂ = .281 (se = .094). Thus, there is
evidence of some positive serial correlation in the errors (t ≈ 2.99). I strong case can be made
that all explanatory variables are strictly exogenous. Certainly there is no concern about the time
trend, the seasonal dummy variables, or wkends, as these are determined by the calendar. It is
seems safe to assume that unexplained changes in prcfat today do not cause future changes in the
state-wide unemployment rate. Also, over this period, the policy changes were permanent once
they occurred, so strict exogeneity seems reasonable for spdlaw and beltlaw. (Given legislative
lags, it seems unlikely that the dates the policies went into effect had anything to do with recent,
unexplained changes in prcfat.
(ii) Remember, we are still estimating the βj by OLS, but we are computing different
β垐spdlaw = .0671, se(β spdlaw ) = .0267 and β beltlaw = −.0295, se(β beltlaw ) = .0331 . The t statistic for
standard errors that have some robustness to serial correlation. Using Stata 7.0, I get
垐
spdlaw has fallen to about 2.5, but it is still significant. Now, the t statistic on beltlaw is less than
one in absolute value, so there is little evidence that beltlaw had an effect on prcfat.
105
(iii) For brevity, I do not report the time trend and monthly dummies. The final estimate of ρ
is ρˆ = .289 :
at =
prcf 1.009 + … + .00062 wkends − .0132 unem
(.102) (.00500) (.0055)
n = 108, R2 = .641
There are no drastic changes. Both policy variable coefficients get closer to zero, and the
standard errors are bigger than the incorrect OLS standard errors [and, coincidentally, pretty
close to the Newey-West standard errors for OLS from part (ii)]. So the basic conclusion is the
same: the increase in the speed limit appeared to increase prcfat, but the seat belt law, while it is
estimated to decrease prcfat, does not have a statistically significant effect.
log (avgprc) = −.073 − .0040 t − .0101 mon − .0088 tues + .0376 wed + .0906 thurs
(.115) (.0014) (.1294) (.1273) (.1257) (.1257)
n = 97, R2 = .086
The test for joint significance of the day-of-the-week dummies is F = .23, which gives p-value
= .92. So there is no evidence that the average price of fish varies systematically within a week.
log (avgprc) = −.920 − .0012 t − .0182 mon − .0085 tues + .0500 wed + .1225 thurs
(.190) (.0014) (.1141) (.1121) (.1117) (.1110)
n = 97, R2 = .310
Each of the wave variables is statistically significant, with wave2 being the most important.
Rough seas (as measured by high waves) would reduce the supply of fish (shift the supply curve
back), and this would result in a price increase. One might argue that bad weather reduces the
demand for fish at a market, too, but that would reduce price. If there are demand effects
captured by the wave variables, they are being swamped by the supply effects.
106
(iii) The time trend coefficient becomes much smaller and statistically insignificant. We can
use the omitted variable bias table from Chapter 3, Table 3.2 (page 92) to determine what is
probably going on. Without wave2 and wave3, the coefficient on t seems to have a downward
bias. Since we know the coefficients on wave2 and wave3 are positive, this means the wave
variables are negatively correlated with t. In other words, the seas were rougher, on average, at
the beginning of the sample period. (You can confirm this by regressing wave2 on t and wave3
on t.)
(iv) The time trend and daily dummies are clearly strictly exogenous, as they are just
functions of time and the calendar. Further, the height of the waves is not influenced by past
unexpected changes in log(avgprc).
(v) We simply regress the OLS residuals on one lag, getting ρ垐= .618,se( ρ ) = .081, t ρˆ = 7.63.
Therefore, there is strong evidence of positive serial correlation.
(vi) The Newey-West standard errors are se( β垐 wave 2 ) = .0234 and se( β wave 3 ) = .0195. Given the
significant amount of AR(1) serial correlation in part (v), it is somewhat surprising that these
the Newey-West standard error for βˆwave 3 is actually smaller than the OLS standard error.
standard errors are not much larger compared with the usual, incorrect standard errors. In fact,
log (avgprc) = −.658 − .0007 t + .0099 mon + .0025 tues + .0624 wed + .1174 thurs
(.239) (.0029) (.0652) (.0744) (.0746) (.0621)
n = 97, R2 = .135
The coefficient on wave2 drops by a nontrivial amount, but it still has a t statistic of almost 3.
107
CHAPTER 13
TEACHING NOTES
While this chapter falls under “Advanced Topics,” most of this chapter requires no more
sophistication than the previous chapters. (In fact, I would argue that, with the possible
exception of Section 13.5, this material is easier than some of the time series chapters.)
Two years of panel data are often available, in which case differencing across time is a simple
way of removing g unobserved heterogeneity. If you have covered Chapter 9, you might
compare this with a regression in levels using the second year of data, but where a lagged
dependent variable is included. (The second approach only requires collecting information on
the dependent variable in a previous year.) These often give similar answers. Two years of
panel data, collected before and after a policy change, can be very powerful for policy analysis.
Having more than two periods of panel data causes slight complications in that the errors in the
differenced equation may be serially correlated. (However, the traditional assumption that the
errors in the original equation are serially uncorrelated is not always a good one. In other words,
it is not always more appropriate to used fixed effects, as in Chapter 14, than first differencing.)
With large N and relatively small T, a simple way to account for possible serial correlation after
differencing is to compute standard errors that are robust to arbitrary serial correlation and
heteroskedasticity. Econometrics packages that do cluster analysis (such as Stata) often allow
this by specifying each cross-sectional unit as its own cluster.
108
SOLUTIONS TO PROBLEMS
13.1 Without changes in the averages of any explanatory variables, the average fertility rate fell
increase in average education levels, we obtain an additional effect: –.128(13.3 – 12.2) ≈ –.141.
by .545 between 1972 and 1984; this is simply the coefficient on y84. To account for the
So the drop in average fertility if the average education level increased by 1.1 is .545
+ .141 = .686, or roughly two-thirds of a child per woman.
13.2 The first equation omits the 1981 year dummy variable, y81, and so does not allow any
appreciation in nominal housing prices over the three year period in the absence of an incinerator.
The interaction term in this case is simply picking up the fact that even homes that are near the
incinerator site have appreciated in value over the three years. This equation suffers from
omitted variable bias.
The second equation omits the dummy variable for being near the incinerator site, nearinc,
which means it does not allow for systematic differences in homes near and far from the site
before the site was built. If, as seems to be the case, the incinerator was located closer to less
valuable homes, then omitting nearinc attributes lower housing prices too much to the
incinerator effect. Again, we have an omitted variable problem. This is why equation (13.9) (or,
even better, the equation that adds a full set of controls), is preferred.
13.3 We do not have repeated observations on the same cross-sectional units in each time period,
and so it makes no sense to look for pairs to difference. For example, in Example 13.1, it is very
unlikely that the same woman appears in more than one year, as new random samples are
obtained in each year. In Example 13.3, some houses may appear in the sample for both 1978
and 1981, but the overlap is usually too small to do a true panel data analysis.
13.4 The sign of β1 does not affect the direction of bias in the OLS estimator of β1 , but only
whether we underestimate or overestimate the effect of interest. If we write Δcrmrtei = δ0 +
β1Δunemi + Δui, where Δui and Δunemi are negatively correlated, then there is a downward bias
in the OLS estimator of β1. Because β1 > 0, we will tend to underestimate the effect of
unemployment on crime.
13.5 No, we cannot include age as an explanatory variable in the original model. Each person in
the panel data set is exactly two years older on January 31, 1992 than on January 31, 1990. This
means that ∆agei = 2 for all i. But the equation we would estimate is of the form
Δsavingi = δ0 + β1Δagei + …,
where δ0 is the coefficient the year dummy for 1992 in the original model. As we know, when
we have an intercept in the model we cannot include an explanatory variable that is constant
across i; this violates Assumption MLR.3. Intuitively, since age changes by the same amount for
everyone, we cannot distinguish the effect of age from the aggregate time effect.
109
13.6 (i) Let FL be a binary variable equal to one if a person lives in Florida, and zero otherwise.
Let y90 be a year dummy variable for 1990. Then, from equation (13.10), we have the linear
probability model
The effect of the law is measured by δ1, which is the change in the probability of drunk driving
arrest due to the new law in Florida. Including y90 allows for aggregate trends in drunk driving
arrests that would affect both states; including FL allows for systematic differences between
Florida and Georgia in either drunk driving behavior or law enforcement.
(ii) It could be that the populations of drivers in the two states change in different ways over
time. For example, age, race, or gender distributions may have changed. The levels of education
across the two states may have changed. As these factors might affect whether someone is
possibility of obtaining a more precise estimator of δ1 by reducing the error variance. Essentially,
arrested for drunk driving, it could be important to control for them. At a minimum, there is the
any explanatory variable that affects arrest can be used for this purpose. (See Section 6.3 for
discussion.)
13.7 (i) The F statistic (with 4 and 1,111 df) is about 1.16 and p-value ≈ .328, which shows that
the living environment variables are jointly insignificant.
(ii) The F statistic (with 3 and 1,111 df) is about 3.01 and p-value ≈ .029, and so the region
dummy variables are jointly significant at the 5% level.
(iii) After obtaining the OLS residuals, û , from estimating the model in Table 13.1, we run
the regression û 2 on y74, y76, …, y84 using all 1,129 observations. The null hypothesis of
homoskedasticity is H0: γ1 = 0, γ2 = 0, … , γ6 = 0. So we just use the usual F statistic for joint
significance of the year dummies. The R-squared is about .0153 and F ≈ 2.90; with 6 and 1,122
df, the p-value is about .0082. So there is evidence of heteroskedasticity that is a function of
time at the 1% significance level. This suggests that, at a minimum, we should compute
heteroskedasticity-robust standard errors, t statistics, and F statistics. We could also use
weighted least squares (although the form of heteroskedasticity used here may not be sufficient;
it does not depend on educ, age, and so on).
(iv) Adding y74 ⋅ educ, K , y84 ⋅ educ allows the relationship between fertility and education
to be different in each year; remember, the coefficient on the interaction gets added to the
added to the equation, R2 ≈ .137. The F statistic for joint significance (with 6 and 1,105 df) is
coefficient on educ to get the slope for the appropriate year. When these interaction terms are
about 1.48 with p-value ≈ .18. Thus, the interactions are not jointly significant at even the 10%
level. This is a bit misleading, however. An abbreviated equation (which just shows the
coefficients on the terms involving educ) is
110
= −8.48 − .023 educ + K − .056 y74 ⋅ educ − .092 y76 ⋅ educ
kids
(3.13) (.054) (.073) (.071)
− .152 y78 ⋅ educ − .098 y80 ⋅ educ − .139 y82 ⋅ educ − .176 y84 ⋅ educ.
(.075) (.070) (.068) (.070)
Three of the interaction terms, y78 ⋅ educ, y82 ⋅ educ, and y84 ⋅ educ are statistically significant at
the 5% level against a two-sided alternative, with the p-value on the latter being about .012. The
coefficients are large in magnitude as well. The coefficient on educ – which is for the base year,
1972 – is small and insignificant, suggesting little if any relationship between fertility and
education in the early seventies. The estimates above are consistent with fertility becoming more
linked to education as the years pass. The F statistic is insignificant because we are testing some
insignificant coefficients along with some significant ones.
13.8 (i) The coefficient on y85 is roughly the proportionate change in wage for a male (female =
0) with zero years of education (educ = 0). This is not especially useful since we are not
interested in people with no education.
(ii) What we want to estimate is θ0 = δ0 + 12δ1; this is the change in the intercept for a male
with 12 years of education, where we also hold other factors fixed. If we write δ0 = θ0 − 12δ1,
plug this into (13.1), and rearrange, we get
Therefore, we simply replace y85 ⋅ educ with y85 ⋅ (educ – 12), and then the coefficient and
standard error we want is on y85. These turn out to be θˆ0 = .339 and se( θˆ0 ) = .034. Roughly,
the nominal increase in wage is 33.9%, and the 95% confidence interval is 33.9 ± 1.96(3.4), or
about 27.2% to 40.6%. (Because the proportionate change is large, we could use equation (7.10),
which implies the point estimate 40.4%; but obtaining the standard error of this estimate is
harder.)
–.383 (se ≈ .124). This shows that real wages have fallen over the seven year period, although
(iii) Only the coefficient on y85 differs from equation (13.2). The new coefficient is about
education is –.383 + .0185(12) = −.161, or a fall of about 16.1%. For a male with 20 years of
less so for the more educated. For example, the proportionate change for a male with 12 years of
(iv) The R-squared when log(rwage) is the dependent variable is .356, as compared with .426
when log(wage) is the dependent variable. If the SSRs from the regressions are the same, but the
R-squareds are not, then the total sum of squares must be different. This is the case, as the
dependent variables in the two equations are different.
111
(v) In 1978, about 30.6% of workers in the sample belonged to a union. In 1985, only about
18% belonged to a union. Therefore, over the seven-year period, there was a notable fall in
union membership.
(vi) When y85 ⋅ union is added to the equation, its coefficient and standard error are about
−.00040 (se ≈ .06104). This is practically very small and the t statistic is almost zero. There has
been no change in the union wage premium over time.
(vii) Parts (v) and (vi) are not at odds. They imply that while the economic return to union
membership has not changed (assuming we think we have estimated a causal effect), the fraction
of people reaping those benefits has fallen.
13.9 (i) Other things equal, homes farther from the incinerator should be worth more, so δ1 > 0.
If β1 > 0, then the incinerator was located farther away from more expensive homes.
log (price) = 8.06 − .011 y81 + .317 log(dist) + .048 y81 ⋅ log(dist)
(0.51) (.805) (.052) (.082)
While δˆ1 = .048 is the expected sign, it is not statistically significant (t statistic ≈ .59).
y81 ⋅ log(dist) becomes .062 (se = .050). So the estimated effect is larger – the elasticity of price
(iii) When we add the list of housing characteristics to the regression, the coefficient on
The p-value for the one-sided alternative H1: δ1 > 0 is about .108, which is close to being
with respect to dist is .062 after the incinerator site was chosen – but its t statistic is only 1.24.
13.10 (i) In addition to male and married, we add the variables head, neck, upextr, trunk,
coefficient on afchnge ⋅ highearn becomes .231 (se ≈ .070), and so the estimated effect and t
lowback, lowextr, and occdis for injury type, and manuf and construc for industry. The
statistic are now larger than when we omitted the control variables. The estimate .231 implies a
substantial response of durat to the change in the cap for high-earnings workers.
(ii) The R-squared is about .041, which means we are explaining only a 4.1% of the variation
in log(durat). This means that there are some very important factors that affect log(durat) that
for a particular individual, it does not mean that there is anything biased about δˆ1 : it could still
we are not controlling for. While this means that predicting log(durat) would be very difficult
be an unbiased estimator of the causal effect of changing the earnings cap for workers’
compensation.
112
log (durat ) = 1.413 + .097 afchnge + .169 highearn + .192 afchnge ⋅ highearn
(0.057) (.085) (.106) (.154)
n = 1,524, R2 = .012.
The estimate of δ1, .192, is remarkably close to the estimate obtained for Kentucky (.191).
The estimate for Michigan is not statistically significant at even the 10% level against δ1 > 0.
However, the standard error for the Michigan estimate is much higher (.154 compared with .069).
Even though we have over 1,500 observations, we cannot get a very precise estimate. (For
Kentucky, we have over 5,600 observations.)
log (rent ) = −.569 + .262 d90 + .041 log(pop) + .571 log(avginc) + .0050 pctstu
(.535) (.035) (.023) (.053) (.0010)
n = 128, R2 = .861.
The positive and very significant coefficient on d90 simply means that, other things in the
equation fixed, nominal rents grew by over 26% over the 10 year period. The coefficient on
pctstu means that a one percentage point increase in pctstu increases rent by half a percent (.5%).
The t statistic of five shows that, at least based on the usual analysis, pctstu is very statistically
significant.
(ii) The standard errors from part (i) are not valid, unless we thing ai does not really appear in
the equation. If ai is in the error term, the errors across the two time periods for each city are
positively correlated, and this invalidates the usual OLS standard errors and t statistics.
Interestingly, the effect of pctstu is over twice as large as we estimated in the pooled OLS
equation. Now, a one percentage point increase in pctstu is estimated to increase rental rates by
about 1.1%. Not surprisingly, we obtain a much less precise estimate when we difference
(although the OLS standard errors from part (i) are likely to be much too small because of the
there may be other unobservables that change over time and are correlated with Δpctstu.
positive serial correlation in the errors within each city). While we have differenced away ai,
(iv) The heteroskedasticity-robust standard error on Δpctstu is about .0028, which is actually
(robust t statistic ≈ 4). Note that serial correlation is no longer an issue because we have no time
much smaller than the usual OLS standard error. This only makes pctstu even more significant
113
H0: β1 = β2 after estimating the unrestricted model in (13.22). But, as we have seen many times,
13.12 (i) You may use an econometrics software package that directly tests restrictions such as
we can simply rewrite the equation to test this using any regression software. Write the
differenced equation as
Following the hint, we define θ1 = β1 − β2, and then write β1 = θ1 + β2. Plugging this into the
differenced equation and rearranging gives
Estimating this equation by OLS gives θˆ1 = .0091, se( θˆ1 ) = .0085. The t statistic for H0: β1 = β2
is .0091/.0085 ≈ 1.07, which is not statistically significant.
Since we did not reject the hypothesis in part (i), we would be justified in using the simpler
model with avgclr. Based on adjusted R-squared, we have a slightly worse fit with the restriction
whether the fairly different unconstrained estimates of β1 and β2 in equation (13.22) reveal true
imposed. But this is a minor consideration. Ideally, we could get more data to determine
114
pa = −1.75 − .058 spring + .00170 sat − .0087 hsperc
trmg
(0.35) (.048) (.00015) (.0010)
+ .350 female − .254 black − .023 white − .035 frstsem
(.052) (.123) (.117) (.076)
− .00034 tothrs + 1.048 crsgpa − .027 season
(.00073) (0.104) (.049)
The coefficient on season implies that, other things fixed, an athlete’s term GPA is about .027
points lower when his/her sport is in season. On a four point scale, this a modest effect (although
(ii) The quick answer is that if omitted ability is correlated with season then, as we know
form Chapters 3 and 5, OLS is biased and inconsistent. The fact that we are pooling across two
semesters does not change that basic point.
If we think harder, the direction of the bias is not clear, and this is where pooling across
semesters plays a role. First, suppose we used only the fall term, when football is in season.
in the OLS estimator of βseason. Because βseason is hypothesized to be negative, an OLS regression
Then the error term and season would be negatively correlated, which produces a downward bias
βˆseason = −.116 (se = .084), which is in the direction of more bias.] However, if we use just the
using only the fall data produces a downward biased estimator. [When just the fall data are used,
spring semester, the bias is in the opposite direction because ability and season would be positive
spring semester gives βˆseason = .00089 (se = .06480), which is practically and statistically equal
correlated (more academically able athletes are in season in the spring). In fact, using just the
to zero. When we pool the two semesters we cannot, with a much more detailed analysis,
determine which bias will dominate.
(iii) The variables sat, hsperc, female, black, and white all drop out because they do not vary
by semester. The intercept in the first-differenced equation is the intercept for the spring. We
have
Interestingly, the in-season effect is larger now: term GPA is estimated to be about .065 points
lower in a semester that the sport is in-season. The t statistic is about –1.51, which gives a one-
sided p-value of about .065.
115
(iv) One possibility is a measure of course load. If some fraction of student-athletes take a
lighter load during the season (for those sports that have a true season), then term GPAs may
tend to be higher, other things equal. This would bias the results away from finding an effect of
season on term GPA.
Δvote
= −2.56 − 1.29 Δlog(inexp) − .599 Δlog(chexp) + .156 Δincshr
(0.63) (1.38) (.711) (.064)
Only Δincshr is statistically significant at the 5% level (t statistic ≈ 2.44, p-value ≈ .016). The
other two independent variables have t statistics less than one in absolute value.
(ii) The F statistic (with 2 and 153 df) is about 1.51 with p-value ≈ .224. Therefore,
Δlog(inexp) and Δlog(chexp) are jointly insignificant at even the 20% level.
Δvote
= −2.68 + .218 Δincshr
(0.63) (.032)
This equation implies that a 10 percentage point increase in the incumbent’s share of total
spending increases the percent of the incumbent’s vote by about 2.2 percentage points.
The estimated effect is notably smaller and, not surprisingly, the standard error is much larger
13.15 (i) When we add the changes of the nine log wage variables to equation (13.33) we obtain
116
Δ log (crmrte) = .020 − .111 d83 − .037 d84 − .0006 d85 + .031 d86 + .039 d87
(.021) (.027) (.025) (.0241) (.025) (.025)
− .323 Δlog(prbarr) − .240 Δlog(prbconv) − .169 Δlog(prbpris)
(.030) (.018) (.026)
− .016 Δlog(avgsen) + .398 Δlog(polpc) − .044 Δlog(wcon)
(.022) (.027) (.030)
+ .025 Δlog(wtuc) − .029 Δlog(wtrd) + .0091 Δlog(wfir)
(0.14) (.031) (.0212)
+ .022 Δlog(wser) − .140 Δlog(wmfg) − .017 Δlog(wfed)
(.014) (.102) (.172)
− .052 Δlog(wsta) − .031 Δlog(wloc)
(.096) (.102)
The coefficients on the criminal justice variables change very modestly, and the statistical
significance of each variable is also essentially unaffected.
(ii) Since some signs are positive and others are negative, they cannot all really have the
manufacturing wages lead to lower crime, as we might expect, but, while the estimated
statistic ≈ –1.37). The F test for joint significance of the wage variables, with 9 and 529 df,
coefficient is by far the largest in magnitude, it is not statistically different from zero (t
13.16 (i) The estimated equation using the 1987 to 1988 and 1988 to 1989 changes, where we
include a year dummy for 1989 in addition to an overall intercept, is
Δhrsemp
ˆ = –.740 + 5.42 d89 + 32.60 Δgrant + 2.00 Δgrant-1 + .744 Δlog(employ)
(1.942) (2.65) (2.97) (5.55) (4.868)
There are 124 firms with both years of data and three firms with only one year of data used, for a
total of 127 firms; 30 firms in the sample have missing information in both years and are not
used at all. If we had information for all 157 firms, we would have 314 total observations in
estimating the equation.
(ii) The coefficient on grant – more precisely, on Δgrant in the differenced equation – means
that if a firm received a grant for the current year, it trained each worker an average of 32.6 hours
117
more than it would have otherwise. This is a practically large effect, and the t statistic is very
large.
(iii) Since a grant last year was used to pay for training last year, it is perhaps not surprising
that the grant does not carry over into more training this year. It would if inertia played a role in
training workers.
≈ (.744/100)(%Δemploy).] This
(iv) The coefficient on the employees variable is very small: a 10% increase in employ
increases hours per employee by only .074. [Recall: Δhr semp
is very small, and the t statistic is also rather small.
13.17. (i) Take changes as usual, holding the other variables fixed: Δmath4it = β1Δlog(rexppit) =
(β1/100)⋅[ 100⋅Δlog(rexppit)] ≈ (β1/100)⋅( %Δrexppit). So, if %Δrexppit = 10, then Δmath4it =
(β1/100)⋅(10) = β1/10.
(ii) The equation, estimated by pooled OLS in first differences (except for the year dummies),
is
Δmath
4 = 5.95 + .52 y94 + 6.81 y95 − 5.23 y96 − 8.49 y97 + 8.97 y98
(.52) (.73) (.78) (.73) (.72) (.72)
n = 3,300, R2 = .208.
decreases the math4 pass rate by about 3.45/10 ≈ .35 percentage points.
Taken literally, the spending coefficient implies that a 10% increase in real spending per pupil
(iii) When we add the lagged spending change, and drop another year, we get
Δmath
4 = 6.16 + 5.70 y95 − 6.80 y96 − 8.99 y97 + 8.45 y98
(.55) (.77) (.79) (.74) (.74)
+ .073 Δlunch
(.061)
n = 2,750, R2 = .238.
The contemporaneous spending variable, while still having a negative coefficient, is not at all
statistically significant. The coefficient on the lagged spending variable is very statistically
significant, and implies that a 10% increase in spending last year increases the math4 pass rate
118
by about 1.1 percentage points. Given the timing of the tests, a lagged effect is not surprising.
In Michigan, the fourth grade math test is given in January, and so if preparation for the test
begins a full year in advance, spending when the students are in third grade would at least partly
matter.
(iv) The heteroskedasticity-robust standard error for βˆΔ log( rexpp ) is about 4.28, which reduces
the significance of Δlog(rexpp) even further. The heteroskedasticity-robust standard error of
βˆ is about 4.38, which substantially lowers the t statistic. Still, Δlog(rexpp-1) is
Δ log( rexpp−1 )
statistically significant at just over the 1% significance level against a two-sided alternative.
(v) The fully robust standard error for βˆΔ log( rexpp ) is about 4.94, which even further reduces
the t statistic for Δlog(rexpp). The fully robust standard error for βˆ Δ log( rexpp−1 )
is about 5.13,
which gives Δlog(rexpp-1) a t statistic of about 2.15. The two-sided p-value is about .032.
it on ri ,t −1 ,
(vi) We can use four years of data for this test. Doing a pooled OLS regression of r垐
using years 1995, 1996, 1997, and 1998 gives ρ̂ = −.423 (se = .019), which is strong negative
serial correlation.
(vii) The fully robust “F” test for Δlog(enroll) and Δlunch, reported by Stata 7.0, is .93. With
2 and 549 df, this translates into p-value = .40. So we would be justified in dropping these
variables, but they are not doing any harm.
119
CHAPTER 14
TEACHING NOTES
My preference is to view the fixed and random effects methods of estimation as applying to the
same underlying unobserved effects model. The name “unobserved effect” is neutral to the issue
of whether the time-constant effects should be treated as fixed parameters or random variables.
With large N and relatively small T, it almost always makes sense to treat them as random
variables, since we can just observed the ai as being drawn from the population along with the
observed variables. Especially for undergraduates and master’s students, it seems sensible to not
raise the philosophical issues underlying the professional debate. In my mind, the key issue in
most applications is whether the unobserved effect is correlated with the observed explanatory
variables. The fixed effect transformation eliminates the unobserved effect entirely while the
random effects transformation accounts for the serial correlation in the composite error via GLS.
(Alternatively, the random effects transformation only eliminates part of the unobserved effect.)
As a practical matter, the fixed effect and random effect estimates are closer when T is large or
when the variance of the unobserved effect is large relative to the variance of the idiosyncratic
error. I think Example 14.4 is representative of what often happens in applications that apply
pooled OLS, random effects, and fixed effects, at least on the estimates of the marriage and
union wage premiums. The random effects estimates are below pooled OLS and the fixed
effects estimates are below the random effects estimates.
Choosing between the fixed effects transformation and first differencing is harder, although
If the AR(1) coefficient is significant and negative (say, less than −.3, to pick a not quite
useful evidence can be obtained by testing for serial correlation in the first-difference estimation.
Matched pairs samples have been profitably used in recent economic applications, and
differencing or random effects methods can be applied. In an equation such as (14.12), there is
probably no need to allow a different intercept for each sister provided that the labeling of sisters
is random. The different intercepts might be needed if a certain feature of a sister that is not
included in the observed controls is used to determine the ordering. A statistically significant
intercept in the differenced equation would be evidence of this.
120
SOLUTIONS TO PROBLEMS
14.1 First, for each t > 1, Var(Δuit) = Var(uit – ui,t-1) = Var(uit) + Var(ui,t-1) = 2σ u2 , where we use
covariance between Δuit and Δui,t+1. Because these each have a zero mean, the covariance is
the assumptions of no serial correlation in {ut} and constant variance. Next, we find the
14.2 (i) The between estimator is just the OLS estimator from the cross-sectional regression of
yi on xi (including an intercept). Because we just have a single explanatory variable xi and the
error term is ai + ui , we have, from Section 5.1,
t =1
(ii) If {xit} is serially uncorrelated with constant variance σ x2 then Var( xi ) = σ x2 /T, and so
plim β%1 = β1 + σxa/( σ x2 /T) = β1 + T(σxa/ σ x2 ).
(iii) As part (ii) shows, when the xit are pairwise uncorrelated the magnitude of the
inconsistency actually increases linearly with T. The sign depends on the covariance between xit
and ai.
14.3 (i) E(eit) = E(vit − λ vi ) = E(vit) − λE( vi ) = 0 because E(vit) = 0 for all t.
s =1
t =1
terms:
121
Var(vit − λ vi ) = (σ a2 + σ u2 ) + λ 2 (σ a2 + σ u2 / T ) − 2λ (σ a2 + σ u2 / T ) .
= ( σ a2 + σ u2 ) − 2(1 − η / γ )γ + (1 − η / γ )2γ
= ( σ a2 + σ u2 ) − 2γ + 2 η ⋅ γ + (1 − 2 η / γ + η/γ)γ
= ( σ a2 + σ u2 ) − 2γ + 2 η ⋅ γ + (1 − 2 η / γ + η/γ)γ
= ( σ a2 + σ u2 ) − 2γ + 2 η ⋅ γ + γ − 2 η ⋅ γ + η
= ( σ a2 + σ u2 ) + η − γ = σ u2 .
(iii) We must show that E(eiteis) = 0 for t ≠ s. Now E(eiteis) = E[(vit − λ vi )(vis − λ vi )] =
E(vitvis) − λE( vi vis) − λE(vit vi ) + λ2E( vi2 ) = σ a2 − 2λ( σ a2 + σ u2 /T) + λ2E( vi2 ) = σ a2 − 2λ( σ a2 +
σ u2 /T) + λ2( σ a2 + σ u2 /T). The rest of the proof is very similar to part (ii):
= σ a2 − 2(1 − η / γ )γ + (1 − η / γ )2γ
= σ a2 − 2γ + 2 η ⋅ γ + (1 − 2 η / γ + η/γ)γ
= σ a2 − 2γ + 2 η ⋅ γ + (1 − 2 η / γ + η/γ)γ
= σ a2 − 2γ + 2 η ⋅ γ + γ − 2 η ⋅ γ + η
= σ a2 + η − γ = 0.
14.4 (i) Men’s athletics are still the most prominent, although women’s sports, especially
basketball but also gymnastics, softball, and volleyball, are very popular at some universities.
Winning percentages for football and men’s and women’s basketball are good possibilities, as
well as indicators for whether teams won conference championships, went to a visible bowl
game (football), or did well in the NCAA basketball tournament (such as making the Sweet 16).
We must be sure that we use measures of athletic success that are available prior to application
deadlines. So, we would probably use football success from the previous school year; basketball
success might have to be lagged one more year.
122
(ii) Tuition could be important: ceteris paribus, higher tuition should mean fewer
applications. Measures of university quality that change over time, such as student/faculty ratios
or faculty grant money, could be important.
measures. If, for example, athsuccit is football winning percentage, then 100β1 is the percentage
The variable athsuccit is shorthand for a measure of athletic success; we might include several
change in applications given a one percentage point increase in winning percentage. It is likely
that ai is correlated with athletic success, tuition, and so on, so fixed effects estimation is
appropriate. Alternatively, we could first difference to remove ai, as discussed in Chapter 13.
14.5 (i) For each student we have several measures of performance, typically three or four, the
number of classes taken by a student that have final exams. When we specify an equation for
each standardized final exam score, the errors in the different equations for the same student are
certain to be correlated. Students who have more (unobserved) ability tend to do better on all
tests.
where as is the unobserved student effect. Because SAT score and cumulative GPA depend only
on the student, and not on the particular class he/she is taking, these do not have a c subscript.
the student’s major. The term θc denotes different intercepts for different classes. Unlike with a
The attendance rates do generally vary across class, as does the indicator for whether a class is in
panel data set, where time is the natural ordering of the data within each cross-sectional unit, and
the aggregate time effects apply to all units, intercepts for the different classes may not be
needed. If all students took the same set of classes then this is similar to a panel data set, and we
would want to put in different class intercepts. But with students taking different courses, the
class we label as “1” for student A need have nothing to do with class “1” for student B. Thus,
are not needed. We can replace θc with β0, an intercept constant across classes.
the different class intercepts based on arbitrarily ordering the classes for each student probably
(iii) Maintaining the assumption that the idiosyncratic error, usc, is uncorrelated with all
explanatory variables, we need the unobserved student heterogeneity, as, to be uncorrelated with
atndrtesc. The inclusion of SAT score and cumulative GPA should help in this regard, as as, is
the part of ability that is not captured by SATs and cumGPAs. In other words, controlling for
SATs and cumGPAs could be enough to obtain the ceteris paribus effect of class attendance.
(iv) If SATs and cumGPAs are not sufficient controls for student ability and motivation, as is
correlated with atndrtesc, and this would cause pooled OLS to be biased and inconsistent. We
could use fixed effects instead. Within each student we compute the demeaned data, where, for
123
each student, the means are computed across classes. The variables SATs and cumGPAs drop out
of the analysis.
(iv) This is the only new part. The fixed effects estimates, reported in equation form, are
log (rentit ) = .386 y90t + .072 log(popit) + .310 log(avgincit) + .0112 pctstuit,
(.037) (.088) (.066) (.0041)
N = 64, T = 2.
(There are N = 64 cities and T = 2 years.) We do not report an intercept because it gets removed
by the time demeaning. The coefficient on y90t is identical to the intercept from the first
difference estimation, and the slope coefficients and standard errors are identical to first
differencing. We do not report an R-squared because none is comparable to the R-squared
obtained from first differencing.
[Instructor’s Note: Some econometrics packages do report an intercept for fixed effects
estimation; if so, it is usually the average of the estimated intercepts for the cross-sectional units,
and it is not especially informative. If one obtains the FE estimates via the dummy variable
regression, an intercept is reported for the base group, which is usually an arbitrarily chosen
cross-sectional unit.]
log (crmrteit ) = .013 d82t − .079 d83t − .118 d84t − .112 d85t
(.022) (.021) (.022) (.022)
− .082 d86t − .040 d87t − .360 log(prbarrit) − .286 log(prbconvit)
(.021) (.021) (.032) (.021)
− .183 log(prbprisit) − .0045 log(avgsenit) + .424 log(polpcit)
(.032) (.0264) (.026)
N = 90, T = 7.
There is no intercept because it gets swept away in the time demeaning. If your econometrics
package reports a constant or intercept, it is choosing one of the cross-sectional units as the base
124
group, and then the overall intercept is for the base unit in the base year. This overall intercept is
not very informative because, without obtaining each aˆi , we cannot compare across units.
Remember that the coefficients on the year dummies are not directly comparable with those
in the first-differenced equation because we did not difference the year dummies in (13.33). The
fixed effects estimates are unbiased estimators of the parameters on the time dummies in the
original model.
The first-difference and fixed effects slope estimates are broadly consistent. The variables
that are significant with first differencing are significant in the FE estimation, and the signs are
all the same. The magnitudes are also similar, although, with the exception of the insignificant
variable log(avgsen), the FE estimates are all larger in absolute value. But we conclude that the
estimates across the two methods paint a similar picture.
(ii) When the nine log wage variables are added and the equation is estimated by fixed
effects, very little of importance changes on the criminal justice variables. The following table
contains the new estimates and standard errors.
Independent Standard
Variable Coefficient Error
log(prbarr) –.356 .032
log(prbconv) –.286 .021
log(prbpris) –.175 .032
log(avgsen) –.0029 .026
log(polpc) .423 .026
The F statistic, with 9 and N(T – 1) – k = 90(6) – 20 = 520 df, is F ≈ 2.47 with p-value ≈ .0090.
The changes in these estimates are minor, even though the wage variables are jointly significant.
14.8 (i) 135 firms are used in the FE estimation. Because there are three years, we would have a
total of 405 observations if each firm had data on all variables for all three years. Instead, due to
missing data, we can use only 390 observations in the FE estimation. The fixed effects estimates
are
(ii) The coefficient on grant means that if a firm received a grant for the current year, it
trained each worker an average of 34.2 hours more than it would have otherwise. This is a
practically large effect, and the t statistic is very large.
125
(iii) Since a grant last year was used to pay for training last year, it is perhaps not surprising
that the grants does not carry over into more training this year. It would if inertia played a role in
training workers.
≈ (.176/100)
(iv) The coefficient on the employees variable is very small: a 10% increase in employ
increases predicted hours per employee by only about .018. [Recall: Δhr semp
(%Δemploy).] This is very small, and the t statistic is practically zero.
and subtract the second equation from the first. The ai are eliminated and cit – ci(t – 1) = ci. So,
for each t ≥ 2, we have
get β̂1 = –.251, se( β̂1 ) = .121. The estimate is actually larger in magnitude than we obtain in
(ii) Because the differenced equation contains the fixed effect ci, we estimate it by FE. We
Example 13.8 (where βˆ = –1.82, se( βˆ ) = .078), but we have not yet included year dummies.
1 1
In any case, the estimated effect of an EZ is still large and statistically significant.
(iii) Adding the year dummies reduces the estimated EZ effect, and makes it more
equation gives βˆ1 = –.192, se( β̂1 ) = .085, which is fairly similar to the estimates without the
comparable to what we obtained without cit in the model. Using FE on the first-differenced
city-specific trends.
14.10 (i) Different occupations are unionized at different rates, and wages also differ by
occupation. Therefore, if we omit binary indicators for occupation, the union wage differential
may simply be picking up wage differences across occupations. Because some people change
occupation over the period, we should include these in our analysis.
(ii) Because the nine occupational categories (occ1 through occ9) are exhaustive, we must
choose one as the base group. Of course the group we choose does not affect the estimated
union wage differential. The fixed effect estimation on union, to four decimal places, is .0804
with standard error = .0194. There is practically no difference between this estimate and
14.11 First, the random effects estimate on unionit becomes .174 (se ≈ .031), while the
coefficient on the interaction term unionit ⋅ t is about –.0155 (se ≈ .0057). Therefore, the
interaction between the union dummy and time trend is very statistically significant (t statistic ≈
126
–2.72), and is important economically. While at a given point in time there is a large union
differential, the projected wage growth is less for unionized workers (on the order of 1.6% less
The fixed effects estimate on unionit becomes .148 (se ≈ .031), while the coefficient on the
per year).
interaction unionit ⋅ t is about −.0157 (se ≈ .0057). Therefore, the story is very similar to that for
the random effects estimates.
14.12 (i) If there is a deterrent effect then β1 < 0. The sign of β2 is not entirely obvious, although
(such as drug dealing) that would lead to fewer murders. This would imply β2 > 0.
one possibility is that a better economy means less crime in general, including violent crime
(ii) The pooled OLS estimates using 1990 and 1993 are
it = −5.28 −
mrdrte 2.07 d93t + .128 execit + 2.53 unemit
(4.43) (2.14) (.263) (.78)
N = 51, T = 2, R2 = .102
There is no evidence of a deterrent effect, as the coefficient on exec is actually positive (though
not statistically significant).
Δmrdrte
i = .413 − .104 Δexeci − .067 Δunemi
(.209) (.043) (.159)
n = 51, R2 = .110
(iv) The heteroskedasticity-robust standard error for Δexeci is .017. Somewhat surprisingly,
(v) Texas had by far the largest value of exec, 34. The next highest state was Virginia, with
11. These are three-year totals.
(vi) Without Texas, we get the following, with heteroskedasticity-robust standard errors in [⋅]:
127
Δmrdrte
i = .413 − .067 Δexeci − .070 Δunemi
(.211) (.105) (.160)
[.200] [.079] [.146]
n = 50, R2 = .013
Δexeci has increased by a substantial amount. This happens because, when we drop Texas, we
Now the estimated deterrent effect is smaller. Perhaps more importantly, the standard error on
(vii) When we apply fixed effects using all three years of data and all states we get
mrdrteit = 1.73 d90t + 1.70 d93t − .054 execit + .395 unemit
(.75) (.71) (.160) (.285)
N = 51, T = 3, R2 = .068
statistic, about −.34, is very small. The earlier finding of a deterrent effect does not seem to be
The size of the deterrent effect is only about half as big as when 1987 is not used. Plus, the t
very robust.
4 = − 31.66 +
math 6.38 y94 + 18.65 y95 + 18.03 y96 + 15.34 y97 + 30.40 y98
(10.30) (.74) (.79) (.77) (.78) (.78)
N = 550, T = 6, R2 = .505
(ii) The lunch variable is the percent of students in the district eligible for free or reduced-
price lunches, which is determined by poverty status. Therefore, lunch is effectively a poverty
rate. We see that the district poverty rate has a large impact on the math pass rate: a one
percentage point increase in lunch reduces the pass rate by about .41 percentage points.
128
4 =
math 6.18 y94 + 18.09 y95 + 17.94 y96 + 15.19 y97 + 29.88 y98
(.56) (.69) (.76) (.80) (.84)
N = 550, T = 6, R2 = .603
The coefficient on the lagged spending variable has gotten somewhat smaller, but its t statistic is
still almost three. Therefore, there is still evidence of a lagged spending effect after controlling
for unobserved district effects.
(v) The change in the coefficient and significance on the lunch variable is most dramatic.
Both enrol and lunch are slow to change over time, which means that their effects are largely
captured by the unobserved effect, ai. Plus, because of the time demeaning, their coefficients are
hard to estimate. The spending coefficients can be estimated more precisely because of a policy
change during this period, where spending shifted markedly in 1994 after the passage of
Proposal A in Michigan, which changed the way schools were funded.
(vi) The estimated long-run spending effect is θˆ1 = 6.59, se( θˆ1 ) = 2.64.
n = 194, R2 = .108
Investment choice is associated with about 11.7 percentage points more in stocks. The t statistic
is 1.88, and so it is marginal significant.
(ii) These variables are not very important. The F test for joint significant is 1.03. With 9
and 179 df, this gives p-value = .42. Plus, when these variables are dropped from the regression,
the coefficient on choice only falls to 11.15.
129
(iv) I will only report the cluster-robust standard error for choice: 6.20. Therefore, it is
essentially the same as the usual OLS standard error. This is perhaps not very surprising since at
least 171 of the 194 observations can be assumed independent of one another. The explanatory
variables may adequately capture the within-family correlation.
(v) There are only 23 families with spouses in the data set. Differencing within these families
gives
−1.22 Δeduc
(3.43)
All of the income and wealth variables, and the stock and IRA indicators, drop out, as these are
defined at the family level (and therefore the same for the husband and wife).
(vi) None of the explanatory variables is significant in part (v), and this is not too surprising.
We have only 23 observations, and we are removing much of the variation in the explanatory
variables (except the gender variable) by using within-family differences.
130
CHAPTER 15
TEACHING NOTES
When I wrote the first edition, I took the novel approach of introducing instrumental variables as
a way of solving the omitted variable (or unobserved heterogeneity) problem. Traditionally, a
student’s first exposure to IV methods comes by way of simultaneous equations models.
Occasionally, IV is first seen as a method to solve the measurement error problem. I have even
seen texts where the first appearance of IV methods is to obtain a consistent estimator in an
AR(1) model with AR(1) serial correlation.
The omitted variable problem is conceptually much easier than simultaneity, and stating the
conditions needed for an IV to be valid in an omitted variable context is straightforward.
Besides, most modern applications of IV have more of an unobserved heterogeneity motivation.
A leading example is estimating the return to education when unobserved ability is in the error
term. We are not thinking that education and wages are jointly determined; for the vast majority
of people, education is completed before we begin collecting information on wages or salaries.
Similarly, in studying the effects of attending a certain type of school on student performance,
the choice of school is made and then we observe performance on a test. Again, we are primarily
concerned with unobserved factors that affect performance and may be correlated with school
choice; it is not an issue of simultaneity.
The asymptotics underlying the simple IV estimator are no more difficult than for the OLS
estimator in the bivariate regression model. Certainly consistency can be derived in class. It is
also easy to demonstrate how, even just in terms of inconsistency, IV can be worse than OLS if
the IV is not completely exogenous.
At a minimum, it is important to always estimate the reduced form equation and test whether the
IV is partially correlated with endogenous explanatory variable. The material on
multicollinearity and 2SLS estimation is a direct extension of the OLS case. Using equation
(15.43), it is easy to explain why multicollinearity is generally more of a problem with 2SLS
estimation.
Testing for endogeneity and testing any overidentification restrictions is something that should
be covered in second semester courses. The tests are fairly easy to motivate and are very easy to
implement.
While I provide a treatment for time series applications in Section 15.7, I admit to having trouble
finding compelling time series applications. These are likely to be found at a less aggregated
level, where exogenous IVs have a chance of existing. (See also Chapter 16.)
131
SOLUTIONS TO PROBLEMS
15.1 (i) It has been fairly well established that socioeconomic status affects student performance.
The error term u contains, among other things, family income, which has a positive effect on
GPA and is also very likely to be correlated with PC ownership.
(ii) Families with higher incomes can afford to buy computers for their children. Therefore,
family income certainly satisfies the second requirement for an instrumental variable: it is
correlated with the endogenous explanatory variable [see (15.5) with x = PC and z = faminc].
But as we suggested in part (i), faminc has a positive affect on GPA, so the first requirement for a
good IV, (15.4), fails for faminc. If we had faminc we would include it as an explanatory
variable in the equation; if it is the only important omitted variable correlated with PC, we could
then estimate the expanded equation by OLS.
(iii) This is a natural experiment that affects whether or not some students own computers.
Some students who buy computers when given the grant would not have without the grant.
(Students who did not receive the grants might still own computers.) Define a dummy variable,
grant, equal to one if the student received a grant, and zero otherwise. Then, if grant was
randomly assigned, it is uncorrelated with u. In particular, it is uncorrelated with family income
and other socioeconomic factors in u. Further, grant should be correlated with PC: the
probability of owning a PC should be significantly higher for student receiving grants.
Incidentally, if the university gave grant priority to low-income students, grant would be
negatively correlated with u, and IV would be inconsistent.
15.2 (i) It seems reasonable to assume that dist and u are uncorrelated because classrooms are not
usually assigned with convenience for particular students in mind.
(ii) The variable dist must be partially correlated with atndrte. More precisely, in the
reduced form
we must have π3 ≠ 0. Given a sample of data we can test H0: π3 = 0 against H1: π3 ≠ 0 using a t
test.
generally correlated with u.) Under the exogeneity assumption that E(u|priGPA,ACT,dist) = 0,
priGPA⋅dist is uncorrelated with u. If dist is partially correlated with atndrte then priGPA⋅dist is
any function of priGPA, ACT, and dist is uncorrelated with u. In particular, the interaction
132
by 2SLS using IVs dist, priGPA, ACT, and priGPA⋅dist. It turns out this is not generally optimal.
It may be better to add priGPA2 and priGPA⋅ACT to the instrument list. This would give us
overidentifying restrictions to test. See Wooldridge (2002, Chapters 5 and 9) for further
discussion.
∑ zi ( yi − y ) = ∑ ⎜ ∑ zi ⎟ y = n1 y1 − n1 y ,
⎛ n ⎞
−
n n
⎝ i =1 ⎠
z y
i =1 i =1
i i
∑z
n
where n1 = is the number of observations with zi = 1 and we have used the fact that
i =1
i
⎜ ∑ zi yi ⎟ /n1 = yi , the average of the yi over the i with zi = 1. So far, we have shown that the
⎛ n ⎞
⎝ i =1 ⎠
numerator in βˆ1 is n1( yi – y ). Next, write y as a weighted average of the averages over the
two subgroups:
y = (n0/n) y0 + (n1/n) y1 ,
(n0n1/n)( y1 – y0 ).
βˆ1 = ( y1 – y0 )/( x1 – x0 ).
15.4 (i) The state may set the level of its minimum wage at least partly based on past or expected
current economic activity, and this could certainly be part of ut. Then gMINt and ut are
correlated, which causes OLS to be biased and inconsistent.
133
(ii) Because gGDPt controls for the overall performance of the U.S. economy, it seems
reasonable that gUSMINt is uncorrelated with the disturbances to employment growth for a
particular state.
(iii) In some years, the U.S. minimum wage will increase in such a way so that it exceeds the
state minimum wage, and then the state minimum wage will also increase. Even if the U.S.
minimum wage is never binding, it may be that the state increases its minimum wage in response
to an increase in the U.S. minimum. If the state minimum is always the U.S. minimum, then
gMINt is exogenous in this equation and we would just use OLS.
15.5 (i) From equation (15.19) with σu = σx, plim βˆ1 = β1 + (.1/.2) = β1 + .5, where βˆ1 is the IV
estimator. So the asymptotic bias is .5.
(ii) From equation (15.20) with σu = σx, plim β%1 = β1 + Corr(x,u), where β%1 is the OLS
estimator. So we would have to have Corr(x,u) > .5 before the asymptotic bias in OLS exceeds
that of IV. This is a simple illustration of how a seemingly small correlation (.1 in this case)
between the IV (z) and error (u) can still result in IV being more biased than OLS if the
correlation between z and x is weak (.2).
(iii) By assumption, u1 has zero mean and is uncorrelated with z1 and z2, and v2 has these
that OLS consistently estimates the αj. [OLS would only be unbiased if we add the stronger
properties by definition. So v1 has zero mean and is uncorrelated with z1 and z2, which means
15.7 (i) Even at a given income level, some students are more motivated and more able than
others, and their families are more supportive (say, in terms of providing transportation) and
enthusiastic about education. Therefore, there is likely to be a self-selection problem: students
that would do better anyway were also more likely to attend a choice school.
(ii) Assuming we have the functional form for faminc correct, the answer is yes. Since u1
does not contain income, random assignment of grants within income class means that grant
designation is not correlated with unobservables such as student ability, motivation, and family
support.
134
choice = π0 + π1faminc + π2grant + v2,
and we need π2 ≠ 0. In other words, after accounting for income, the grant amount must have
some affect on choice. This seems reasonable, provided the grant amounts differ within each
income class.
(iv) The reduced form for score is just a linear function of the exogenous variables (see
Problem 15.6):
This equation allows us to directly estimate the effect of increasing the grant amount on the test
score, holding family income fixed. From a policy perspective this is itself of some interest.
15.8 (i) Family income and background variables, such as parents’ education.
(iii) Parents who are supportive and motivated to have their daughters do well in school may
also be more likely to enroll their daughters in a girls’ high school. It seems likely that girlhs
and u1 are correlated.
(iv) Let numghs be the number of girls’ high schools within a 20-mile radius of a girl’s home.
To be a valid IV for girlhs, numghs must satisfy two requirements: it must be uncorrelated with
u1 and it must be partially correlated with girlhs. The second requirement probably holds, and
can be tested by estimating the reduced form
and testing numghs for statistical significance. The first requirement is more problematical.
Girls’ high schools tend to locate in areas where there is a demand, and this demand can reflect
the seriousness with which people in the community view education. Some areas of a state have
better students on average for reasons unrelated to family income and parents’ education, and
these reasons might be correlated with numghs. One possibility is to include community-level
variables that can control for differences across communities.
15.9 Just use OLS on an expanded equation, where SAT and cumGPA are added as proxy
variables for student ability and motivation; see Chapter 9.
15.10 (i) Better and more serious students tend to go to college, and these same kinds of students
may be attracted to private and, in particular, Catholic high schools. The resulting correlation
135
between u and CathHS is another example of a self-selection problem: students self select
toward Catholic high schools, rather than being randomly assigned to them.
(ii) A standardized score is a measure of student ability, so this can be used as a proxy
variable in an OLS regression. Having this measure in an OLS regression should be an
improvement over having no proxies for student ability.
(iii) The first requirement is that CathRe1 must be uncorrelated with unobserved student
motivation and ability (whatever is not captured by any proxies) and other factors in the error
term. This holds if growing up Catholic (as opposed to attending a Catholic high school) does
not make you a better student. It seems reasonable to assume that Catholics do not have more
innate ability than non-Catholics. Whether being Catholic is unrelated to student motivation, or
preparation for high school, is a thornier issue.
The second requirement is that being Catholic has an effect on attending a Catholic high
be tested by estimating the reduced form equation of the form CathHS = π0 + π1CathRel + (other
school, controlling for the other exogenous factors that appear in the structural model. This can
(iv) Evans and Schwab (1995) find that being Catholic substantially increases the probability
of attending a Catholic high school. Further, it seems that assuming CathRe1 is exogenous in the
structural equation is reasonable. See Evans and Schwab (1995) for an in-depth analysis.
(ii) By assumption E( xt*−1 ut) = E(et-1ut) = E( xt*−1 et) = E(et-1et) = 0, and so E(xt-1ut) = E(xt-1et) =
0 because xt = xt* + et. Therefore, E(xt-1vt) = E(xt-1ut) – β1E(xt-1et) = 0.
(iii) Most economic time series, unless they represent the first difference of a series or the
percentage change, are positively correlated over time. If the initial equation is in levels or logs,
xt and xt-1 are likely to be positively correlated. If the model is for first differences or percentage
changes, there still may be positive or negative correlation between xt and xt-1.
136
yt = β0 + β1xt + vt,
as we should in part (ii): Cov(xt-1,vt) = E(xt-1vt) = 0. Second, xt-1 will often be correlated with xt,
and we can check this easily enough by running a regression of xt of xt-1. This suggests
This is a reduced form simple regression equation. It shows that, controlling for no other factors,
one more sibling in the family is associated with monthly salary that is about 2.8% lower. The t
statistic on sibs is about –4.73. Of course sibs can be correlated with many things that should
have a bearing on wage including, as we already saw, years of education.
(ii) It could be that older children are given priority for higher education, and families may
hit budget constraints and may not be able to afford as much education for children born later.
The simple regression of educ on brthord gives
educ = 14.15 − .283 brthord
(0.13) (.046)
n = 852, R2 = .042.
(Note that brthord is missing for 83 observations.) The equation predicts that every one-unit
increase in brthord reduces predicted education by about .28 years. In particular, the difference
in predicted education for a first-born and fourth-born child is about .85 years.
(iii) When brthord is used as an IV for educ in the simple wage equation we get
(The R-squared is negative.) This is much higher than the OLS estimate (.060) and even above
the estimate when sibs is used as an IV for educ (.122). Because of missing data on brthord, we
are using fewer observations than in the previous analyses.
137
(iv) In the reduced form equation
we need π2 ≠ 0 in order for the βj to be identified. We take the null to be H0: π2 = 0, and look to
observations) yields πˆ 2 = −.153 and se( πˆ 2 ) = .057. The t statistic is about –2.68, which rejects
reject H0 at a small significance level. The regression of educ on sibs and brthord (using 852
The standard error on βˆeduc is much larger than we obtained in part (iii). The 95% CI for β educ is
roughly −.010 to .284, which is very wide and includes the value zero. The standard error of
βˆsibs is very large relative to the coefficient estimate, rendering sibs very insignificant.
using IV, multicollinearity is a serious problem here, and is not allowing us to estimate βeduc with
much precision.
children = −4.138 − .0906 educ + .332 age − .00263 age2
(0.241) (.0059) (.017) (.00027)
n = 4.361, R2 = .569.
Another year of education, holding age fixed, results in about .091 fewer children. In other
words, for a group of 100 women, if each gets another of education, they collectively are
predicted to have about nine fewer children.
and we need π3 ≠ 0. When we run the regression we obtain πˆ3 = −.852 and se( πˆ3 ) = .113.
Therefore, women born in the first half of the year are predicted to have almost one year less
138
education, holding age fixed. The t statistic on frsthalf is over 7.5 in absolute value, and so the
identification condition holds.
children = −3.388 − .1715 educ + .324 age − .00267 age2
(0.548) (.0532) (.018) (.00028)
n = 4.361, R2 = .550.
The estimated effect of education on fertility is now much larger. Naturally, the standard error
for β1.
for the IV estimate is also bigger, about nine times bigger. This produces a fairly wide 95% CI
(iv) When we add electric, tv, and bicycle to the equation and estimate it by OLS we obtain
children = −4.390 − .0767 educ + .340 age − .00271 age2 − .303 electric
(.0240) (.0064) (.016) (.00027) (.076)
− .253 tv + .318 bicycle
(.091) (.049)
n = 4,356, R2 = .576.
children = −3.591 − .1640 educ + .328 age − .00272 age2 − .107 electric
(0.645) (.0655) (.019) (.00028) (.166)
− .0026 tv + .332 bicycle
(.2092) (.052)
n = 4,356, R2 = .558.
Adding electric, tv, and bicycle to the model reduces the estimated effect of educ in both cases,
but not by too much. In the equation estimated by OLS, the coefficient on tv implies that, other
factors fixed, four families that own a television will have about one fewer child than four
families without a TV. Television ownership can be a proxy for different things, including
income and perhaps geographic location. A causal interpretation is that TV provides an
alternative form of recreation.
Interestingly, the effect of TV ownership is practically and statistically insignificant in the
equation estimated by IV (even though we are not using an IV for tv). The coefficient on electric
is also greatly reduced in magnitude in the IV estimation. The substantial drops in the
magnitudes of these coefficients suggest that a linear model might not be the functional form,
which would not be surprising since children is a count variable. (See Section 17.4.)
139
15.14 (i) IQ scores are known to vary by geographic region, and so does the availability of four
year colleges. It could be that, for a variety of reasons, people with higher abilities grow up in
areas with four year colleges nearby.
(iii) When we add smsa66, reg662, K , reg669 to the regression in part (ii), we obtain
IQ = 104.77 + .348 nearc4 + 1.09 smsa66 + …
(1.62) (.814) (0.81)
n = 2,061, R2 = .0626,
where, for brevity, the coefficients on the regional dummies are not reported. Now, the
relationship between IQ and nearc4 is much weaker and statistically insignificant. In other
words, once we control for region and environment while growing up, there is no apparent link
between IQ score and living near a four-year college.
(iv) The findings from parts (ii) and (iii) show that it is important to include smsa66,
reg662, …, reg669 in the wage equation to control for differences in access to colleges that
might also be correlated with ability.
15.15 (i) The equation estimated by OLS, omitting the first observation, is
The estimate on inft is no longer statistically different from one. (If β1 = 1, then one percentage
point increase in inflation leads to a one percentage point increase in the three-month T-bill rate.)
140
(iii) In first differences, the equation estimated by OLS is
Δinf
ˆt = .088 + .0096 Δinft-1
(.325) (.1266)
n = 47, R2 = .0001.
Therefore, Δinft and Δinft-1 are virtually uncorrelated, which means that Δinft-1 cannot be used as
an IV for Δinft.
15.16 (i) When we add v̂2 to the original equation and estimate it by OLS, the coefficient on v̂2
is about –.057 with a t statistic of about –1.08. Therefore, while the difference in the estimates of
the return to education is practically large, it is not statistically significant.
nearc2 is not significant.) The 2SLS estimate of β1 is now .157, se( βˆ1 ) = .053. So the estimate
(ii) We now add nearc2 as an IV along with nearc4. (Although, in the reduced form for educ,
is even larger.
nearc2 and nearc4. The n-R-squared statistic is (3,010)(.0004) ≈ 1.20. There is one over-
(iii) Let uˆi be the 2SLS residuals. We regress these on all exogenous variables, including
identifying restriction, so we compute the p-value from the χ12 distribution: p-value = P( χ12 >
1.20) ≈ .55, so the overidentifying restriction is not rejected.
15.17 (i) Sixteen states executed at least one prisoner in 1991, 1992, or 1993. (That is, for 1993,
exec is greater than zero for 16 observations.) Texas had by far the most executions with 34.
mrdrte = –5.28 – 2.07 d93 + .128 exec + 2.53 unem
(4.43) (2.14) (.263) (0.78)
141
The positive coefficient on exec is no evidence of a deterrent effect. Statistically, the coefficient
is not different from zero. The coefficient on unem implies that higher unemployment rates are
associated with higher murder rates.
(iii) When we difference (and use only the changes from 1990 to 1993), we obtain
Δmrdrte
= .413 – .104 Δexec – .067 Δunem
(.209) (.043) (.159)
The coefficient on Δexec is negative and statistically significant (p-value ≈ .02 against a two-
sided alternative), suggesting a deterrent effect. One more execution reduces the murder rate by
about .1, so 10 more executions reduce the murder rate by one (which means one murder per
100,000 people). The unemployment rate variable is no longer significant.
which shows a strong negative correlation in the change in executions. This means that,
apparently, states follow policies whereby if executions were high in the preceding three-year
Technically, to test the identification condition, we should add Δunem to the regression. But
period, they are lower, one-for-one, in the next three-year period.
its coefficient is small and statistically very insignificant, and adding it does not change the
outcome at all.
(v) When the differenced equation is estimated using Δexec-1 as an IV for Δexec, we obtain
Δmrdrte
= .411 – .100 Δexec – .067 Δunem
(.211) (.064) (.159)
most important change is that the standard error on βˆ1 is now larger and reduces the statistically
This is very similar to when we estimate the differenced equation by OLS. Not surprisingly, the
significance of β̂1 .
[Instructor’s Note: As an illustration of how important a single observation can be, you might
want the students to redo this exercise dropping Texas, which accounts for a large fraction of
executions; see also Computer Exercise 14.12. The results are not nearly as significant. Does
this mean Texas is an “outlier”? Not necessarily, especially given that we have differenced to
142
remove the state effect. But we reduce the variation in the explanatory variable, Δexec,
substantially by dropping Texas.]
estimating β1.
15.18 (i) As usual, if unemt is correlated with et, OLS will be biased and inconsistent for
(ii) If E(et|inft-1,unemt-1, K ) = 0 then unemt-1 is uncorrelated with et, which means unemt-1
satisfies the first requirement for an IV in
(iii) The second requirement for unemt-1 to be a valid IV for unemt is that unemt-1 must be
sufficiently correlated. The regression unemt on unemt-1 yields
(0.58) (.097)
n = 48, R2 = .554.
Δinf
ˆt = .694 − .138 unemt
(1.883) (.319)
n = 48, R2 = .048.
The IV estimate of β1 is much lower in magnitude than the OLS estimate (−.543), and βˆ1 is not
statistically different from zero. The OLS estimate had a t statistic of about –2.36 [see equation
(11.19)].
= −.198 + .054 p401k + .0087 inc − .000023 inc2 − .0016 age + .00012 age2
pira
(.069) (.010) (.0005) (.000004) (.0033) (.00004)
n = 9,275, R2 = .180
The coefficient on p401k implies that participation in a 401(k) plan is associate with a .054
higher probability of having an individual retirement account, holding income and age fixed.
(ii) While the regression in part (i) controls for income and age, it does not account for the
fact that different people have different taste for savings, even within given income and age
categories. People that tend to be savers will tend to have both a 401(k) plan as well as an IRA.
143
(This means that the error term, u, is positively correlated with p401k.) What we would like to
know is, for a given person, if that person participates in a 401(k) does it make it less likely or
more likely that the person also has an IRA. This ceteris paribus question is difficult to answer
by OLS without many more controls for the taste for saving.
(iii) First, we need e401k to be partially correlated with p401k; not surprisingly, this is not an
issue, as being eligible for a 401(k) plan is, by definition, necessary for participation. (The
regression in part (iv) verifies that they are strongly positively correlated.) The more difficult
issue is whether e401k can be taken as exogenous in the structural model. In other words, is
being eligible for a 401(k) correlated with unobserved taste for saving? If we think workers that
like to save for retirement will match up with employers that provide vehicles for retirement
saving, then u and e401k would be positively correlated. Certainly we think that e401k is less
correlated with u than is p401k. But remember, this alone is not enough to ensure that the IV
estimator has less asymptotic bias than the OLS estimator; see page 493.
(iv) The reduced form equation, estimated by OLS but with heteroskedasticity-robust
standard errors, is
k = .059 + .689 e401k + .0011 inc − .0000018 inc2 − .0047 age + .000052 age2
p 401
(.046) (.008) (.0003) (.0000027) (.0022) (.000026)
n = 9,275, R2 = .596
The t statistic on e401k is over 85, and its coefficient estimate implies that, holding income and
age fixed, eligibility in a 401(k) plan increases the probability of participation in a 401(k) by .69.
Clearly, e401k passes one of the two requirements as an IV for p401k.
(v) When e401k is used as an IV for p401k we get the following, with heteroskedasticity-
robust standard errors:
= −.207 + .021 p401k + .0090 inc − .000024 inc2 − .0011 age + .00011 age2
pira
(.065) (.013) (.0005) (.000004) (.0032) (.00004)
n = 9,275, R2 = .180
The IV estimate of βp401k is less than half as large as the OLS estimate, and the IV estimate has a
t statistic roughly equal to 1.62. The reduction in βˆ p401k is what we expect given the unobserved
taste for saving argument made in part (ii). But we still do not estimate a tradeoff between
participating in a 401(k) plan and participating in an IRA. This conclusion has prompted some
in the literature to claim that 401(k) saving is additional saving; it does not simply crowd out
saving in other plans.
(vi) After obtaining the reduced form residuals from part (iv), say vˆi , we add these to the
structural equation and run OLS. The coefficient on vˆi is .075 with a heteroskedasticity-robust t
144
= 3.92. Therefore, there is strong evidence that p401k is endogenous in the structural equation
(assuming, of course, that the IV, e401k, is exogenous).
log (wage) = 5.22 + .0936 educ + .0209 exper + .0115 tenure − .183 black
(.54) (.0337) (.0084) (.0027) (.050)
n = 935, R2 = .169
is .0700 and the corresponding standard error is .0264. Both are too low. The reduction in the
estimated return to education from about 9.4% to 7.0% is not trivial. This illustrates that it is
best to avoid doing 2SLS manually.
n = 1,230, R2 = .162
Given the above estimates, the 95% confidence interval for the return to education is roughly
8.7% to 11.5%.
n = 1,230, R2 = .0003
about −.59, and this is not nearly large enough to conclude that these variables are correlated.
While the correlation between educ and ctuit has the expected negative sign, the t statistic is only
This means that, even if ctuit is exogenous in the simple wage equation, we cannot use it as an
IV for educ.
log (wage) = −.507 + .137 educ + .112 exper − .0030 exper2 − .017 ne − .017 nc
145
(.241) (.009) (.027) (.0012) (.086) (.071)
+ .018 west + .156 ne18 + .011 nc18 − .030 west18 + .205 urban + .126 urban18
(.081) (.087) (.073) (.086) (.042) (.049)
n = 1,230, R2 = .219
the coefficient on ctuit is −.165, t statistic = −2.77. So an increase of $1000 in tuition reduces
(iv) In the multiple regression of educ on ctuit and the other explanatory variables in part (iii),
years of education by about .165 (since the tuition variables are measured in thousands).
IV estimate of β educ is .250 (se = .122). While the point estimate seems large, the 95%
(v) Now we estimate the multiple regression model by IV, using ctuit as an IV for educ. The
(vi) The very large standard error of the IV estimate in part (v) shows that the IV analysis is
not very useful. This is as it should be, as ctuit is not especially convincing as an IV. While it is
significant in the reduced form for educ with other controls, the fact that it was insignificant in
part (ii) is troubling. If we changed the set of explanatory variables slightly, would educ and
ctuit cease to be partially correlated?
146
CHAPTER 16
TEACHING NOTES
I spend some time in Section 16.1 trying to distinguish between good and inappropriate uses of
SEMs. Naturally, this is partly determined by my taste, and many applications fall into a gray
area. But students who are going to learn about SEMS should know that just because two (or
more) variables are jointly determined does not mean that it is appropriate to specify and
estimate an SEM. I have seen many bad applications of SEMs where no equation in the system
can stand on its own with an interesting ceteris paribus interpretation. In most cases, the
researcher either wanted to estimate a tradeoff between two variables, controlling for other
factors – in which case OLS is appropriate – or should have been estimating what is (often
derogatorily) called the “reduced form.”
The identification of a two-equation SEM in Section 16.3 is fairly standard except that I
emphasize that identification is a feature of the population. (The early work on SEMs also had
this emphasis.) Given the treatment of 2SLS in Chapter 15, the rank condition is easy to state
(and test).
Romer’s (1993) inflation and openness example is a nice example of using aggregate cross-
sectional data. Purists may not like the labor supply example, but it has become common to
view labor supply as being a two-tier decision. While there are different ways to model the two
tiers, specifying a standard labor supply function conditional on working is not outside the realm
of reasonable models.
Section 16.5 begins by expressing doubts of the usefulness of SEMs for aggregate models such
as those that are specified based on standard macroeconomic models. Such models raise all
kinds of thorny issues; these are ignored in virtually all texts, where such models are still used to
illustrate SEM applications.
SEMs with panel data, which are covered in Section 16.6, are not covered in any other
introductory text. Presumably, if you are teaching this material, it is to more advanced students
in a second semester, perhaps even in a more applied course. Once students have seen first
differencing or the within transformation, along with IV methods, they will find specifying and
estimating models of the sort contained in Example 16.8 straightforward. Levitt’s example
concerning prison populations is especially convincing because his instruments seem to be truly
exogenous.
147
SOLUTIONS TO PROBLEMS
16.1 (i) If α1 = 0 then y1 = β1z1 + u1, and so the right-hand-side depends only on the exogenous
variable z1 and the error term u1. This then is the reduced form for y1. If α1 = 0, the reduced
form for y1 is y1 = β2z2 + u2. (Note that having both α1 and α2 equal zero is not interesting as it
implies the bizarre condition u2 – u1 = β1z1 − β2z2.)
If α1 ≠ 0 and α2 = 0, we can plug y1 = β2z2 + u2 into the first equation and solve for y2:
where π21 = β1/α1, π22 = −β2/α1, and v2 = (u1 – u2)/α1. Note that the reduced form for y2
generally depends on z1 and z2 (as well as on u1 and u2).
(ii) If we multiply the second structural equation by (α1/α2) and subtract it from the first
structural equation, we obtain
or
Because α1 ≠ α2, 1 – (α1/α2) ≠ 0, and so we can divide the equation by 1 – (α1/α2) to obtain the
reduced form for y1: y1 = π11z1 + π12z2 + v1, where π11 = β1/[1 – (α1/α2)], π12 = −(α1/α2)β2/[1 –
(α1/α2)], and v1 = [u1 – (α1/α2)u2]/[1 – (α1/α2)].
A reduced form does exist for y2, as can be seen by subtracting the second equation from the
first:
because α1 ≠ α2, we can rearrange and divide by α1 − α2 to obtain the reduced form.
(iii) In supply and demand examples, α1 ≠ α2 is very reasonable. If the first equation is the
supply function, we generally expect α1 > 0, and if the second equation is the demand function,
α2 < 0. The reduced forms can exist even in cases where the supply function is not upward
148
sloping and the demand function is not downward sloping, but we might question the usefulness
of such models.
16.2 Using simple economics, the first equation must be the demand function, as it depends on
income, which is a common determinant of demand. The second equation contains a variable,
rainfall, that affects crop production and therefore corn supply.
16.3 No. In this example, we are interested in estimating the tradeoff between sleeping and
working, controlling for some other factors. OLS is perfectly suited for this, provided we have
been able to control for all other relevant factors. While it is true individuals are assumed to
optimally allocate their time subject to constraints, this does not result in a system of
simultaneous equations. If we wrote down such a system, there is no sense in which each
equation could stand on its own; neither would have an interesting ceteris paribus interpretation.
Besides, we could not estimate either equation because economic reasoning gives us no way of
excluding exogenous variables from either equation. See Example 16.2 for a similar discussion.
16.4 We can easily see that the rank condition for identifying the second equation does not hold:
equation. The first equation is identified provided γ3 ≠ 0 (and we would presume γ3 < 0). This
there are no exogenous variables appearing in the first equation that are not also in the second
gives us an exogenous variable, log(price), that can be used as an IV for alcohol in estimating
the first equation by 2SLS (which is just standard IV in this case).
(ii) If students having sex behave rationally, and condom usage does prevent STDs, then
condom usage should increase as the rate of infection increases.
(iii) If we plug the structural equation for infrate into conuse = γ0 + γ1infrate + …, we see
that conuse depends on γ1u1. Because γ1 > 0, conuse is positively related to u1. In fact, if the
structural error (u2) in the conuse equation is uncorrelated with u1, Cov(conuse,u1) = γ1Var(u1) >
to obtain the direction of bias: plim(βˆ1 ) − β1 > 0 because Cov(conuse,u1) > 0, where β̂1 denotes
0. If we ignore the other explanatory variables in the infrate equation, we can use equation (5.4)
the OLS estimator. Since we think β1 < 0, OLS is biased towards zero. In other words, if we use
reducing STDs. (Remember, the more negative is β1, the more effective is condom usage.)
OLS on the infrate equation, we are likely to underestimate the importance of condom use in
(iv) We would have to assume that condis does not appear, in addition to conuse, in the
infrate equation. This seems reasonable, as it is usage that should directly affect STDs, and not
just having a distribution program. But we must also assume condis is exogenous in the infrate:
it cannot be correlated with unobserved factors (in u1) that also affect infrate.
We must also assume that condis has some partial effect on conuse, something that can be
tested by estimating the reduced form for conuse. It seems likely that this requirement for an IV
– see equations (15.30) and (15.31) – is satisfied.
149
16.6 (i) It could be that the decision to unionize certain segments of workers is related to how a
firm treats its employees. While the timing may not be contemporaneous, with the snapshot of a
single cross section we might as well assume that it is.
(ii) One possibility is to collect information on whether workers’ parents belonged to a union,
and construct a variable that is the percentage of workers who had a parent in a union (say,
perpar). This may be (partially) correlated with the percent of workers that belong to a union.
(iii) We would have to assume that percpar is exogenous in the pension equation. We can
test whether perunion is partially correlated with perpar by estimating the reduced form for
perunion and doing a t test on perpar.
16.7 (i) Attendance at women’s basketball may grow in ways that are unrelated to factors that we
can observe and control for. The taste for women’s basketball may increase over time, and this
would be captured by the time trend.
(ii) No. The university sets the price, and it may change price based on expectations of next
year’s attendance; if the university uses factors that we cannot observe, these are necessarily in
the error term ut. So even though the supply is fixed, it does not mean that price is uncorrelated
with the unobservables affecting demand.
(iii) If people only care about how this year’s team is doing, SEASPERCt-1 can be excluded
from the equation once WINPERCt has been controlled for. Of course, this is not a very good
assumption for all games, as attendance early in the season is likely to be related to how the team
did last year. We would also need to check that 1PRICEt is partially correlated with
SEASPERCt-1 by estimating the reduced form for 1PRICEt.
(iv) It does make sense to include a measure of men’s basketball ticket prices, as attending a
women’s basketball game is a substitute for attending a men’s game. The coefficient on
1MPRICEt would be expected to be negative. The winning percentage of the men’s team is
another good candidate for an explanatory variable in the women’s demand equation.
(v) It might be better to use first differences of the logs, which are then growth rates. We
would then drop the observation for the first game in each season.
(vi) If a game is sold out, we cannot observe true demand for that game. We only know that
desired attendance is some number above capacity. If we just plug in capacity, we are
understating the actual demand for tickets. (Chapter 17 discusses censored regression methods
that can be used in such cases.)
16.8 We must first eliminate the unobserved effect, ai1. If we difference, we have
150
for t = 2,3. The δt here denotes different intercepts in the two years. The key assumption is that
the change in the (log of) the state allocation, Δ1STATEALLit, is exogenous in this equation.
Naturally, Δ1STATEALLit is (partially) correlated with Δ1EXPENDit because local expenditures
significant variation in Δ1STATEALLit, at least for the 1994 to 1996 change. Therefore, we can
depend at least partly on the state subsidy. The policy change in 1994 means that there should be
16.9 (i) Assuming the structural equation represents a causal relationship, 100⋅β1 is the
approximate percentage change in income if a person smokes one more cigarette per day.
(ii) Since consumption and price are, ceteris paribus, negatively related, we expect γ5 ≤ 0
(allowing for γ5) = 0. Similarly, everything else equal, restaurant smoking restrictions should
reduce cigarette smoking, so γ5 ≤ 0.
(iii) We need γ5 or γ6 to be different from zero. That is, we need at least one exogenous
variable in the cigs equation that is not also in the log(income) equation.
log (income) = 7.80 + .0017 cigs + .060 educ + .058 age − .00063 age2
(0.17) (.0017) (.008) (.008) (.00008)
n = 807, R2 = .165.
The coefficient on cigs implies that cigarette smoking causes income to increase, although the
coefficient is not statistically different from zero. Remember, OLS ignores potential
simultaneity between income and cigarette smoking.
=
cigs 1.58 − .450 educ + .823 age − .0096 age2 − .351 log(cigpric)
(23.70) (.162) (.154) (.0017) (5.766)
− 2.74 restaurn
(1.11)
n = 807, R2 = .051.
151
While log(cigpric) is very insignificant, restaurn had the expected negative sign and a t statistic
of about –2.47. (People living in states with restaurant smoking restrictions smoke almost three
fewer cigarettes, on average, given education and age.) We could drop log(cigpric) from the
log (income) = 7.78 − .042 cigs + .040 educ + .094 age − .00105 age2
(0.23) (.026) (.016) (.023) (.00027)
n = 807.
Now the coefficient on cigs is negative and almost significant at the 10% level against a two-
lowers predicted income by about 4.2%. Of course, the 95% CI for βcigs is very wide.
sided alternative. The estimated effect is very large: each additional cigarette someone smokes
(vii) Assuming that state level cigarette prices and restaurant smoking restrictions are
exogenous in the income equation is problematical. Incomes are known to vary by region, as do
restaurant smoking restrictions. It could be that in states where income is lower (after controlling
for education and age), restaurant smoking restrictions are less likely to be in place.
16.10 (i) We estimate a constant elasticity version of the labor supply equation (naturally, only
for hours > 0), again by 2SLS. We get
which implies a labor supply elasticity of 1.99. This is even higher than the 1.26 we obtained
from equation (16.24) at the mean value of hours (1303).
(ii) Now we estimate the equation by 2SLS but allow log(wage) and educ to both be
endogenous. The full list of instrumental variables is age, kidslt6, nwifeinc, exper, exper2,
motheduc, and fatheduc. The result is
152
log (hours ) = 7.26 + 1.81 log(wage) − .129 educ − .012 age
(1.02) (0.50) (.087) (.011)
− .543 kidslt6 − .019 nwifeinc
(.211) (.009)
n = 428.
The biggest effect is to reduce the size of the coefficient on educ as well as its statistical
significance. The labor supply elasticity is only moderately smaller.
(iii) After obtaining the 2SLS residuals, û1 , from the estimation in part (ii), we regress these
on age, kidslt6, nwifeinc, exper, exper2, motheduc, and fatheduc. The n-R-squared statistic is
P( χ 22 > .43) ≈ .81. There is no evidence against the exogeneity of the IVs.
408(.0010) = .428. We have two overidentifying restrictions, so the p-value is roughly
model. The IV estimate with log(pcinc) in the equation is −.337, which is very close to −.333.
The OLS coefficient is the same, to three decimal places, when log(pcinc) is included in the
153
While log(land) is very significant, land is not, so we might as well use only log(land) as the IV
for open.
[Instructor’s Note: You might ask students whether it is better to use log(land) as the single IV
for open or to use both land and land2. In fact, log(land) explains much more variation in open.]
(iii) When we add oil to the original model, and assume oil is exogenous, the IV estimates
are
Being an oil producer is estimated to reduce average annual inflation by over 6.5 percentage
points, but the effect is not statistically significant. This is not too surprising, as there are only
seven oil producers in the sample.
16.12 (i) The usual form of the test assumes no serial correlation under H0, and this appears to be
the case. We also assume homoskedasticity. After estimating (16.35), we obtain the 2SLS
35(.0613) ≈ 2.15. With one df the (asymptotic) p-value is P( χ12 > 2.15) ≈ .143, and so the
residuals, uˆt . We then run the regression uˆt on gct-1, gyt-1, and r3t-1. The n-R-squared statistic is
(ii) If we estimate (16.35) but with gct-2, gyt-2, and r3t-2 as the IVs, we obtain, with n = 34,
The coefficient on gyt has doubled in size compared with equation (16.35), but it is not
statistically significant. The coefficient on r3t is still small and statistically insignificant.
gy t = .021 − .070 gct-2 + .094 gyt-2 + .00074 r3t-2
(.007) (.469) (.330) (.00166)
n = 34, R2 = .0137.
The F statistic for joint significance of all explanatory variables yields p-value ≈ .94, and so
there is no correlation between gyt and the proposed IVs, gct-2, gyt-2, and r3t-2. Therefore, we
never should have done the IV estimation in part (ii) in the first place.
154
[Instructor’s Note: There may be serial correlation in this regression, in which case the F
statistic is not valid. But the point remains that gyt is not at all correlated with two lags of all
variables.]
16.l3 This is an open-ended question without a unique answer. Even if we settle on extending
the data through a particular year, we might want to change the disposable income and
nondurable consumption numbers in earlier years, as these are often recalculated. For example,
the value for real disposable personal income in 1995, as reported in Table B-29 of the 1997
Economic Report of the President (ERP), is $4,945.8 billions. In the 1999 ERP, this value has
been changed to $4,906.0 billions (see Table B-31). All series can be updated using the latest
edition of the ERP. The key is to use real values and make them per capita by dividing by
population. Make sure that you use nondurable consumption.
16.14 (i) If we estimate the inverse supply function by OLS we obtain (with the coefficients on
the monthly dummies suppressed)
Several of the monthly dummy variables are very statistically significant, but their coefficients
(ii) We need grdefst to have a nonzero coefficient in the reduced form for gcemt. More
precisely, if we write
then identification requires π1 ≠ 0. When we run this regression, πˆ1 = −1.054 with a t statistic of
about –0.294. Therefore, we cannot reject H0: π1 = 0 at any reasonable significance level, and
we conclude that grdefst is not a useful IV for gcemt (even if grdefst is exogenous in the supply
equation).
and we need at least one of π1 and π2 to be different from zero. In fact, πˆ1 = .136, t( πˆ1 ) = .984
and πˆ 2 = 1.15, t( πˆ 2 ) = 5.47. So grnont is very significant in the reduced form for gcemt, and we
can proceed with IV estimation.
155
(iv) We use both grrest and grnont as IVs for gcemt and apply 2SLS, even though the former
is not significant in the RF. The estimated labor supply function (with seasonal dummy
coefficients suppressed) is now
While the coefficient on gcemt is still negative, it is only about one-fourth the size of the OLS
coefficient, and it is now very insignificant. At this point we would conclude that the static
supply function is horizontal (with gprc on the vertical axis, as usual). Shea (1993) adds many
lags of gcemt and estimates a finite distributed lag model by IV, using leads as well as lags of
grrest and grnont as IVs. He estimates a positive long run propensity.
16.15 (i) If county administrators can predict when crime rates will increase, they may hire more
(ii) This may be reasonable, although tax collections depend in part on income and sales
taxes, and revenues from these depend on the state of the economy, which can also influence
crime rates.
amount for county i and year t, will be uncorrelated with Δuit, the changes in unobservables that
(iv) If the grants were awarded randomly, then the grant amounts, say grantit for the dollar
affect county crime rates. By definition, grantit should be correlated with Δlog(polpcit) across i
and t. This means we have an exogenous variable that can be omitted from the crime equation
and that is (partially) correlated with the endogenous explanatory variable. We could reestimate
(13.33) by IV.
16.16 (i) To estimate the demand equations, we need at least one exogenous variable that appears
in the supply equation.
156
(ii) For wave2t and wave3t to be valid IVs for log(avgprct), we need two assumptions. The
first is that these can be properly excluded from the demand equation. This may not be entirely
reasonable, and wave heights are determined partly by weather, and demand at a local fish
market could depend on demand. The second assumption is that at least one of wave2t and
wave3t appears in the supply equation. There is indirect evidence of this in part three, as the two
variables are jointly significant in the reduced form for log(avgprct).
log (avg prct ) = −1.02 − .012 mont − .0090 tuest + .051 wedt + .124 thurst
(.14) (.114) (.1119) (.112) (.111)
n = 97, R2 = .304
The variables wave2t and wave3t are jointly very significant: F = 19.1, p-value = zero to four
decimal places.
n = 97, R2 = .193
The 95% confidence interval for the demand elasticity is roughly −1.47 to −.17. The point
estimate, −.82, seems reasonable: a 10 percent increase in price reduces quantity demanded by
about 8.2%.
(v) The coefficient on uˆi ,t −1 is about .294 (se = .103), so there is strong evidence of positive
serial correlation, although the estimate of ρ is not huge. One could compute a Newey-West
standard error for 2SLS in place of the usual standard error.
(vi) To estimate the supply elasticity, we would have to assume that the day-of-the-week
dummies do not appear in the supply equation, but they do appear in the demand equation. Part
(iii) provides evidence that there are day-of-the-week effects in the demand function. But we
cannot know about the supply function.
157
(vii) Unfortunately, in the estimation of the reduced form for log(avgprct) in part (iii), the
variables mon, tues, wed, and thurs are jointly insignificant [F(4,90) = .53, p-value = .71.] This
means that, while some of these dummies seem to show up in the demand equation, things cancel
out in a way that they do not affect equilibrium price, once wave2 and wave3 are in the equation.
So, without more information, we have no hope of estimating the supply equation.
[Instructor’s Note: You could have the students try part (vii), anyway, to see what happens.
Also, you could have them estimate the demand function by OLS, and compare the estimates
with the 2SLS estimates in part (iv). You could also have them compute the test of the single
overidentification condition.]
158
CHAPTER 17
TEACHING NOTES
I emphasize to the students that, first and foremost, the reason we use the probit and logit models
is to obtain more reasonable functional forms for the response probability. Once we move to a
nonlinear model with a fully specified conditional distribution, it makes sense to use the efficient
estimation procedure, maximum likelihood. It is important to spend some time on interpreting
probit and logit estimates. In particular, the students should know the rules-of-thumb for
comparing probit, logit, and LPM estimates. Beginners sometimes mistakenly think that,
because the probit and especially the logit estimates are much larger than the LPM estimates, the
explanatory variables now have larger estimated effects on the response probabilities than in the
LPM case. This may or may not be true.
I view the Tobit model, when properly applied, as improving functional form for corner solution
outcomes. In most cases it is wrong to view a Tobit application as a data-censoring problem
(unless there is true data censoring in collecting the data or because of institutional constraints).
For example, in using survey data to estimate the demand for a new product, say a safer pesticide
to be used in farming, some farmers will demand zero at the going price, while some will
demand positive pounds per acre. There is no data censoring here; some farmers find it optimal
to use none of the new pesticide. The Tobit model provides more realistic functional forms for
E(y|x) and E(y|y > 0,x) than a linear model for y. With the Tobit model, students may be tempted
to compare the Tobit estimates with those from the linear model and conclude that the Tobit
estimates imply larger effects for the independent variables. But, as with probit and logit, the
Tobit estimates must be scaled down to be comparable with OLS estimates in a linear model.
(See Equation (17.27); for an example, see Computer Exercise 17.10.)
Poisson regression with an exponential conditional mean is used primarily to improve over a
linear functional form for E(y|x). The parameters are easy to interpret as semi-elasticities or
elasticities. If the Poisson distributional assumption is correct, we can use the Poisson
distribution compute probabilities, too. But overdispersion is often present in count regression
models, and standard errors and likelihood ratio statistics should be adjusted to reflect this.
Some reviewers of the first edition complained about either the inclusion of this material or its
location within the chapter. I think applications of count data models are on the rise: in
microeconometric fields such as criminology, health economics, and industrial organization,
many interesting response variables come in the form of counts. One suggestion was that
Poisson regression should not come between the Tobit model in Section 17.2 and Section 17.4,
on censored and truncated regression. In fact, I put the Poisson regression model between these
two topics on purpose: I hope it helps emphasize that the material in Section 17.2 is purely about
functional form, as is Poisson regression. Sections 17.4 and 17.5 deal with underlying linear
models, but where there is a data-observability problem.
Censored regression, truncated regression, and incidental truncation are used for missing data
problems. Censored and truncated data sets usually result from sample design, as in duration
analysis. Incidental truncation often arises from self-selection into a certain state, such as
employment or participating in a training program. It is important to emphasize to students that
159
the underlying models are classical linear models; if not for the missing data or sample selection
problem, OLS would be the efficient estimation procedure.
160
SOLUTIONS TO PROBLEMS
17.1 (i) Let m0 denote the number (not the percent) correctly predicted when yi = 0 (so the
prediction is also zero) and let m1 be the number correctly predicted when yi = 1. Then the
can write this as (n0/n)(m0/n0) + (n1/n)(m1/n1) = (1 − y )(m0/n0) + y (m1/n1), where we have used
proportion correctly predicted is (m0 + m1)/n, where n is the sample size. By simple algebra, we
the fact that y = n1/n (the proportion of the sample with yi = 1) and 1 − y = n0/n (the proportion
of the sample with yi = 0). But m0/n0 is the proportion correctly predicted when yi = 0, and m1/n1
is the proportion correctly predicted when yi = 1. Therefore, we have
p̂ = (1 − y ) q̂0 + y ⋅ q̂1 ,
where we use the fact that, by definition, p̂ = 100[(m0 + m1)/n], q̂0 = 100(m0/n0), and q̂1 =
100(m1/n1).
(ii) We just use the formula from part (i): p̂ = .30(80) + .70(40) = 52. Therefore, overall we
correctly predict only 52% of the outcomes. This is because, while 80% of the time we correctly
predict y = 0, yi = 0 accounts for only 30 percent of the outcomes. More weight (.70) is given to
the predictions when yi = 1, and we do much less well predicting that outcome (getting it right
only 40% of the time).
17.2 We need to compute the estimated probability first at hsGPA = 3.0, SAT = 1,200, and
study = 5. To obtain the first probability, we start by computing the linear function inside Λ(⋅):
study = 10 and subtract this from the estimated probability with hsGPA = 3.0, SAT = 1,200, and
−1.77 + .24(3.0) + .00058(1,200) + .073(10) = .376. Next, we plug this into the logit function:
exp(.376)/[1 + exp(.376)] ≈ .593. This is the estimated probability that a student-athlete with
the given characteristics graduates in five years.
For the student-athlete who attended study hall five hours a week, we compute –
is .593 − .503 = .090, or just under .10. [Note how far off the calculation would be if we simply
use the coefficient on study to conclude that the difference in probabilities is .073(10 – 5) = .365.]
17.3 (i) We use the chain rule and equation (17.23). In particular, let x1 ≡ log(z1). Then, by the
chain rule,
where we use the fact that the derivative of log(z1) is 1/z1. When we plug in (17.23) for
161
∂E(y|y > 0,x)/ ∂x1, we obtain the answer.
(ii) As in part (i), we use the chain rule, which is now more complicated:
where x1 = z1 and x2 = z12 . But ∂E(y|y > 0,x)/ ∂x1 = β1{1 − λ(xβ/σ)[xβ/σ + λ(xβ/σ)]}, ∂E(y|y >
0,x)/δx2 = β2{1 − λ(xβ/σ)[xβ/σ + λ(xβ/σ)]}, ∂x1/∂z1 = 1, and ∂x2/∂z1 = 2z1. Plugging these into
the first formula and rearranging gives the answer.
17.4 Since log(⋅) is an increasing function – that is, for positive w1 and w2, w1 > w2 if and only if
log(w1) > log(w2) – it follows that, for each i, mvpi > minwagei if and only if log(mvpi) >
log(minwagei). Therefore, log(wagei) = max[log(mvpi), log(minwagei)].
17.5 (i) patents is a count variable, and so the Poisson regression model is appropriate.
(ii) Because β1 is the coefficient on log(sales), β1 is the elasticity of patents with respect to
sales. (More precisely, β1 is the elasticity of E(patents|sales,RD) with respect to sales.)
(iii) We use the chain rule to obtain the partial derivative of exp[β0 + β1log(sales) + β2RD +
β3RD2] with respect to RD:
this gives β2 + 2β3RD, which shows that the semi-elasticity of patents with respect to RD is
A simpler way to interpret this model is to take the log and then differentiate with respect to RD:
100(β2 + 2β3RD).
17.6 (i) OLS will be unbiased, because we are choosing the sample on the basis of an exogenous
explanatory variable. The population regression function for sav is the same as the regression
function in the subpopulation with age > 25.
(ii) Assuming that marital status and number of children affect sav only through household
size (hhsize), this is another example of exogenous sample selection. But, in the subpopulation
subpopulation, we would not be able to estimate β2; effectively, the intercept in the
of married people without children, hhsize = 2. Because there is no variation in hhsize in the
subpopulation becomes β0 + 2β2, and that is all we can estimate. But, assuming there is variation
varied sample from this subpopulation), we can still estimate β1, β3, and β4.
in inc, educ, and age among married people without children (and that we have a sufficiently
162
OLS to be biased and inconsistent for estimating the βj in the population model. We should
(iii) This would be selecting the sample on the basis of the dependent variable, which causes
17.7 For the immediate purpose of finding out the variables that determine whether accepted
applicants choose to enroll, there is not a sample selection problem. The population of interest is
applicants accepted by the particular university. Therefore, it is perfectly appropriate to specify a
model for this group, probably a linear probability model, a probit model, or a logit model. OLS
or maximum likelihood estimation will produce consistent, asymptotically normal estimators.
This is a good example of where many data analysts’ knee-jerk reaction might be to conclude
that there is a sample selection problem, which is why it is important to be very precise about the
purpose of the analysis, including stating the population of interest.
If the university is hoping the pool of applicants changes in the near future, then there is a
sample selection problem: the current students that apply may be systematically different from
students that may apply in the future. As the nature of the pool of applicants is unlikely to
change dramatically over one year, the sample selection problem can be mitigated, if not entirely
eliminated, by updating the analysis after each first-year class has enrolled.
17.8 (i) If spread is zero, there is no favorite, and the probability that the team we (arbitrarily)
label the favorite should have a 50% chance of winning.
favwin = .577 + .0194 spread
(.028) (.0023)
[.032] [.0019]
n = 553, R2 = .111.
[⋅]. Using the usual standard error, the t statistic for H0: β0 = .5 is (.577 − .5)/.028 = 2.75, which
where the usual standard errors are in (⋅) and the heteroskedasticity-robust standard errors are in
leads to rejecting H0 against a two-sided alternative at the 1% level (critical value ≈ 2.58).
value ≈ 2.33).
(iii) As we expect, spread is very statistically significant using either standard error, with a t
statistic greater than eight. If spread = 10 the estimated probability that the favored team wins
is .577 + .0194(10) = .771.
163
Dependent Variable: favwin
Independent Coefficient
Variable (Standard Error)
spread .0925
(.0122)
constant −.0106
(.1037)
Number of Observations 553
Log Likelihood Value −263.56
Pseudo R-Squared .129
and, in particular, P(favwin = 1|spread = 0) = Φ(0) = .5. This is the analog of testing whether the
intercept is .5 in the LPM. From the table, the t statistic for testing H0: β0 = 0 is only about -.102,
so we do not reject H0.
Φ[-.0106 + .0925(10)] = Φ(.9144) ≈ .820. This is somewhat above the estimate for the LPM.
(v) When spread = 10 the predicted response probability from the estimated probit model is
(vi) When favhome, fav25, and und25 are added to the probit model, the value of the log-
2(263.56 – 262.64) = 1.84. The p-value from the χ 32 distribution is about .61, so favhome, fav25,
likelihood becomes –262.64. Therefore, the likelihood ratio statistic is 2[−262.64 – (−263.56)] =
and und25 are jointly very insignificant. Once spread is controlled for, these other factors have
no additional power for predicting the outcome.
17.9 (i) The probit estimates from approve on white are given in the following table:
164
Dependent Variable: approve
Independent Coefficient
Variable (Standard Error)
white .784
(.087)
constant .547
(.075)
Number of Observations 1,989
Log Likelihood Value −700.88
As there is only one explanatory variable that takes on just two values, there are only two
different predicted values: the estimated probabilities of loan approval for white and nonwhite
applicants. Rounded to three decimal places these are .708 for nonwhites and .908 for whites.
Without rounding errors, these are identical to the fitted values from the linear probability model.
This must always be the case when the independent variables in a binary response model are
mutually exclusive and exhaustive binary variables. Then, the predicted probabilities, whether
we use the LPM, probit, or logit models, are simply the cell frequencies. (In other words, .708 is
the proportion of loans approved for nonwhites and .908 is the proportion approved for whites.)
(se ≈ .097). Therefore, there is still very strong evidence of discrimination against nonwhites.
(ii) With the set of controls added, the probit estimate on white becomes about .520
Computer Exercise 7.16: .520/2.5 ≈ .208, compared with .129 in the LPM.
We can divide this by 2.5 to make it roughly comparable to the LPM estimate in part (iii) of
(iii) When we use logit instead of probit, the coefficient (standard error) on white
becomes .938 (.173).
logit estimates by .625. The scaled logit coefficient becomes .625(.938) ≈ .586, which is
(iv) Recall that, to make probit and logit estimates roughly comparable, we can multiply the
reasonably close to the probit estimate. A better comparison would be to compare the predicted
probabilities by setting the other controls at interesting values, such as their average values in the
sample.
17.10 (i) Out of 616 workers, 172, or about 18%, have zero pension benefits. For the 444
workers reporting positive pension benefits, the range is from $7.28 to $2,880.27. Therefore, we
have a nontrivial fraction of the sample with pensiont = 0, and the range of positive pension
benefits is fairly wide. The Tobit model is well-suited to this kind of dependent variable.
165
Dependent Variable: pension
Independent (1) (2)
Variable
exper 5.20 4.39
(6.01) (5.83)
age −4.64 −1.65
(5.71) (5.56)
tenure 36.02 28.78
(4.56) (4.50)
educ 93.21 106.83
(10.89) (10.77)
depends (35.28 41.47
(21.92) (21.21)
married (53.69 19.75
(71.73) (69.50)
white 144.09 159.30
(102.08) (98.97)
male 308.15 257.25
(69.89) (68.02)
union ––––– 439.05
(62.49)
constant −1,252.43 −1,571.51
(219.07) (218.54)
σˆ 677.74 652.90
increases predicted pension benefits, although only male is statistically significant (t ≈ 4.41).
In column (1), which does not control for union, being white or male (or, of course, both)
(iii) We use equation (17.22) with exper = tenure = 10, age = 35, educ = 16, depends = 0,
married = 0, white = 1, and male = 1 to estimate the expected benefit for a white male with the
given characteristics. Using our shorthand, we have
166
Therefore, with σˆ = 677.74 we estimate E(pension|x) as
[Instructor’s Note: If we had just done a linear regression, we would add the coefficients on
white and male to obtain the estimated difference. We get about 114.94 + 272.95 = 387.89,
which is very close to the Tobit estimate. Provided that we focus on partial effects, Tobit and a
linear model often give similar answers for explanatory variables near the mean values.]
(iv) Column (2) in the previous table gives the results with union added. The coefficient is
large, but to see exactly how large, we should use equation (17.22) to estimate E(pension|x) with
union = 1 and union = 0, setting the other explanatory variables at interesting values. The t
statistic on union is over seven.
(v) When peratio is used as the dependent variable in the Tobit model, white and male are
individually and jointly insignificant. The p-value for the test of joint significance is about .74.
Therefore, neither whites nor males seem to have different tastes for pension benefits as a
fraction of earnings. White males have higher pension benefits because they have, on average,
higher earnings.
17.11 (i) The results for the Poisson regression model that includes pcnv2, ptime862, and inc862
are given in the following table:
167
Dependent Variable: narr86
Independent Coefficient
Variable (Standard Error)
pcnv 1.15
(0.28)
avgsen −.026
(.021)
tottime .012
(.016)
ptime86 .684
(.091)
qemp86 .023
(.033)
inc86 −.012
(.002)
black .591
(.074)
hispan .422
(.075)
born60 −.093
(.064)
pcnv2 −1.80
(0.31)
ptime862 −.103
(.016)
inc862 .000021
(.000006)
constant −.710
(.070)
Number of Observations 2,725
Log Likelihood Value −2,168.87
σˆ 1.179
(iii) From Table 17.3 we have the log-likelihood value for the restricted model, Lr =
−2,248.76. The log-likelihood value for the unrestricted model is given in the above table as –
168
statistic is 159.78/1.39 ≈ 114.95. In a χ 32 distribution this gives a p-value of essentially zero.
2,168.87. Therefore, the usual likelihood ratio statistic is 159.78. The quasi-likelihood ratio
17.12 (i) The Poisson regression results are given in the following table:
−.048
Variable Coefficient Error
educ .007
age .204 .055
−.0022
2
age .0006
black .360 .061
east .088 .053
northcen .142 .048
west .080 .066
farm −.015 .058
othrural −.057 .069
town .031 .049
smcity .074 .062
y74 .093 .063
y76 −.029 .068
y78 −.016 .069
y80 −.020 .069
y82 −.193 .067
y84 −.214 .069
constant −3.060 1.211
n = 1,129
L = −2,070.23
σˆ = .944
The coefficient on y82 means that, other factors in the model fixed, a woman’s fertility was
about 19.3% lower in 1982 than in 1972.
difference as exp(.36) – 1 ≈ .433, so a black woman has 43.3% more children than a comparable
(ii) Because the coefficient on black is so large, we obtain the estimated proportionate
169
(iii) From the above table, σˆ = .944, which shows that there is actually underdispersion in
the estimated model.
squared (or, at least one version of it), is about (.348)2 ≈ .121. Interestingly, this is actually
(iv) The sample correlation between kidsi and kids ˆ i is about .348, which means the R-
smaller than the R-squared for the linear model estimated by OLS. (However, remember that
OLS obtains the highest possible R-squared for a linear model, while Poisson regression does not
obtain the highest possible R-squared for an exponential regression model.)
17.13 The results of an OLS regression using only the uncensored durations are given in the
following table.
There are several important differences between the OLS estimates using the uncensored
durations and the estimates from the censored regression in Table 17.4. For example, the binary
170
indicator for drug usage, drugs, has become positive and insignificant, whereas it was negative
(as we expect) and significant in Table 17.4. On the other hand, the work program dummy,
workprg, becomes positive but is still insignificant. The remaining coefficients maintain the
coefficient on black is especially severe, where the estimate changes from −.543 in the
same sign, but they are all attenuated toward zero. The apparent attenuation bias of OLS for the
17.14 (i) When log(wage) is regressed on educ, exper, exper2, nwifeinc, age, kidslt6, and kidsge6,
the coefficient and standard error on educ are .0999 (se = .0151).
(ii) The Heckit coefficient on educ is .1187 (se = .0341), where the standard error is just the
usual OLS standard error. The estimated return to education is somewhat larger than without the
Heckit corrections, but the Heckit standard error is over twice as large.
(iii) Regressing λ̂ on educ, exper, exper2, nwifeinc, age, kidslt6, and kidsge6 (using only the
selected sample of 428) produces R2 ≈ .962, which means that there is substantial
large standard errors. Without an exclusion restriction in the log(wage) equation, λ̂ is almost a
multicollinearity among the regressors in the second stage regression. This is what leads to the
17.15 (i) 185 out of 445 participated in the job training program. The longest time in the
experiment was 24 months (obtained from the variable mosinex).
(ii) The F statistic for joint significance of the explanatory variables is F(7,437) = 1.43 with
p-value = .19. Therefore, they are jointly insignificant at even the 15% level. Note that, even
though we have estimated a linear probability model, the null hypothesis we are testing is that all
slope coefficients are zero, and so there is no heteroskedasticity under H0. This means that the
usual F statistic is asymptotically valid.
(iii) After estimating the model P(train = 1|x) = Φ(β0 + β1unem74 + β2unem75 + β3age +
β4educ + β5black + β6hisp + β7married) by probit maximum likelihood, the likelihood ratio test
for joint significance is 10.18. In a χ 72 distribution this gives p-value = .18, which is very
similar to that obtained for the LPM in part (ii).
(iv) Training eligibility was randomly assigned among the participants, so it is not surprising
that train appears to be independent of other observed factors. (However, there can be a
difference between eligibility and actual participation, as men can always refuse to participate if
chosen.)
171
78 = .354 − .111 train
unem
(.028) (.044)
n = 445, R2 = .014
Participating in the job training program lowers the estimated probability of being unemployed
in 1978 by .111, or 11.1 percentage points. This is a large effect: the probability of being
unemployed without participation is .354, and the training program reduces it to .243. The
differences is statistically significant at almost the 1% level against at two-sided alternative.
(Note that this is another case where, because training was randomly assigned, we have
confidence that OLS is consistently estimating a causal effect, even though the R-squared from
the regression is very small. There is much about being unemployed that we are not explaining,
but we can be pretty confident that this job training program was beneficial.)
train for the probit, −.321, with the LPM estimate. The probabilities have different functional
where standard errors are in parentheses. It does not make sense to compare the coefficient on
forms. However, note that the probit and LPM t statistics are essentially the same (although the
LPM standard errors should be made robust to heteroskedasticity).
(vii) There are only two fitted values in each case, and they are the same: .354 when train =
0 and .243 when train = 1. This has to be the case, because any method simply delivers the cell
they do not involve the transformation by Φ(⋅), but it does not matter which is used provided the
frequencies as the estimated probabilities. The LPM estimates are easier to interpret because
(viii) The fitted values are no longer identical because the model is not saturated, that is, the
explanatory variables are not an exhaustive, mutually exclusive set of dummy variables. But,
because the other explanatory variables are insignificant, the fitted values are highly correlated:
the LPM and probit fitted values have a correlation of about .993.
(ii) The distribution is not continuous: there are clear focal points, and rounding. For
example, many more people report one pound than either two-thirds of a pound or 1 1/3 pounds.
This violates the latent variable formulation underlying the Tobit model, where the latent error
has a normal distribution. Nevertheless, we should view Tobit in this context as a way to
possibly improve functional form. It may work better than the linear model for estimating the
expected demand function.
172
(ii) The following table contains the Tobit estimates and, for later comparison, OLS
estimates of a linear model:
−5.82 −2.90
Variable (Linear Model)
ecoprc
(.89) (.59)
σˆ 3.44 2.48
R-squared .0369 .0393
Only the price variables, ecoprc and regprc, are statistically significant at the 1% level.
(iv) The signs of the price coefficients accord with basic demand theory: the own-price
effect is negative, the cross price effect for the substitute good (regular apples) is positive.
(v) The null hypothesis can be stated as H0: β1 + β2 = 0. Define θ1 = β1 + β2. Then θˆ1 = −.16.
To obtain the t statistic, I write β2 = θ1 − β1, plug in, and rearrange. This results in doing Tobit
of ecolbs on (ecoprc − regprc), regprc, faminc, and hhsize. The coefficient on regprc is θˆ1 and,
of course we get its standard error: about .59. Therefore, the t statistic is about −.27 and p-value
= .78. We do not reject the null.
(vi) The smallest fitted value is .798, while the largest is 3.327.
(vii) The squared correlation between ecolbsi and ecolbsi is about .0369. This is one
possible R-squared measure.
173
(viii) The linear model estimates are given in the table for part (ii). The OLS estimates are
smaller than the Tobit estimates because the OLS estimates are estimated partial effects on
E(ecolbs|x), whereas the Tobit coefficients must be scaled by the term in equation (17.27). The
scaling factor is always between zero and one, and often substantially less than one. The Tobit
model does not fit better, at least in terms of estimating E(ecolbs|x): the linear model R-squared
is a bit larger (.0393 versus .0369).
(ix) This is not a correct statement. We have another case where we have confidence in the
ceteris paribus price effects (because the price variables are exogenously set), yet we cannot
explain much of the variation in ecolbs. The fact that demand for a fictitious product is hard to
explain is not very surprising.
[Instructor’s Notes: This might be a good place to remind students about basic economics. You
can ask them whether reglbs should be included as an additional explanatory variable in the
demand equation for ecolbs, making the point that the resulting equation would no longer be a
demand equation. In other words, reglbs and ecolbs are jointly determined, but it is not
appropriate to write each as a function of the other. You could have the students compute
heteroskedasticity-robust standard errors for the OLS estimates. Also, you could have them
estimate a probit model for ecolbs = 0 versus ecolbs > 0, and have them compare the scaled
Tobit slope estimates with the probit estimates.]
17.17 (i) 497 people do not smoke at all. 101 people report smoking 20 cigarettes a day. Since
one pack of cigarettes contains 20 cigarettes, it is not surprising that 20 is a focal point.
(ii) The Poisson distribution does not allow for the kinds of focal points that characterize cigs.
If you look at the full frequency distribution, there are blips at half a pack, two packs, and so on.
The probabilities in the Poisson distribution have a much smoother transition. Fortunately, the
Poisson regression model has nice robustness properties.
(iii) The results of the Poisson regression are given in the following table, along with the
OLS estimates of a linear model for later reference. The Poisson standard errors are the usual
Poisson maximum likelihood standard errors, and the OLS standard errors are the usual
(nonrobust) standard errors.
174
Dependent Variable: cigs
Independent Poisson OLS
−.355 −2.90
Variable (Exponential Model) (Linear Model)
log(cigpric)
(.144) (5.70)
σˆ 4.54 13.46
R-squared .043 .045
The estimated price elasticity is −.355 and the estimated income elasticity is .085.
−2.47, which is significant at the 5% level against a two-sided alternative. The t statistic on
(iv) If we use the maximum likelihood standard errors, the t statistic on log(cigpric) is about
(v) σ̂ 2 = 20.61, and so σˆ = 4.54. This is evidence of severe overdispersion, and means that
all of the standard errors for Poisson regression should be multiplied by 4.54; the t statistics
should be divided by 4.54.
(vi) The robust t statistic for log(cigpric) is about −.54, which makes it very insignificant.
This is a good example of misleading the usual Poisson standard errors and test statistics can be.
The robust t statistic for log(income) is about .94, which also makes the income elasticity
statistically insignificant.
175
(vii) The education and age variables are still quite significant; the robust t statistic on educ
over three in absolute value, and the robust t statistic on age is over five. The coefficient on educ
implies that one more year of education reduces the expected number of cigarettes smoked by
about 6.0%.
(viii) The minimum predicted value is .515 and the maximum is 18.84. The fact that we
predict some smoking for anyone in the sample is a limitation with using the expected value for
prediction. Further, we do not predict that anyone will smoke even one pack of cigarettes, even
though more than 25% of the people in the sample report smoking a pack or more per day! This
shows that smoking, especially heavy smoking, is difficult to predict based on the explanatory
variables we have access to.
(x) The linear model results are reported in the last column of the previous table. The R-
squared is slightly higher for the linear model – but remember, the OLS estimates are chosen to
maximize the R-squared, while the MLE estimates do not maximize the R-squared (as we have
calculated it). In any case, both R-squareds are quite small.
176
CHAPTER 18
TEACHING NOTES
Several of the topics in this chapter, including testing for unit roots and cointegration, have
become staples of applied time series analysis. Instructors who like their course to be more time
series oriented might cover this chapter after Chapter 12, if time permits. Or, the chapter can be
used as a reference for ambitious students who wish to be versed in recent time series
developments.
The discussion of infinite distributed lag models, and in particular geometric DL and rational DL
models, gives one particular interpretation of dynamic regression models. But one must
emphasize that only under fairly restrictive assumptions on the serial correlation in the error of
the infinite DL model does the dynamic regression consistently estimate the parameters in the lag
distribution. Computer Exercise 18.10 provides a good illustration of how the GDL model, and a
simple RDL model, can be too restrictive.
Example 18.5 tests for cointegration between the general fertility rate and the value of the
personal exemption. There is not much evidence of cointegration, which sheds further doubt on
the regressions in levels that were used in Chapter 10. The error correction model for holding
yields in Example 18.7 is likely to be of interest to students in finance. As a class project, or a
term project for a student, it would be interesting to update the data to see if the error correction
model is stable over time.
The forecasting section is heavily oriented towards regression methods and, in particular,
autoregressive models. These can be estimated using any econometrics package, and forecasts
and mean absolute errors or root mean squared errors are easy to obtain. The interest rate data
sets (for example, in INTQRT.RAW) can be updated to do much more recent out-of-sample
forecasting exercises.
177
SOLUTIONS TO PROBLEMS
18.1 With zt1 and zt2 now in the model, we should use one lag each as instrumental variables, zt-1,1
and zt-1,2. This gives one overidentifying restriction that can be tested.
18.2 (i) When we lag equation (18.68) once, multiply it by (1 – λ), and subtract it from (18.68),
we obtain
when we plug this into the first equation we obtain the desired result.
(ii) If {ut} is serially uncorrelated, then {vt = ut – (1 – λ)ut-1} must be serially correlated. In
fact, {vt} is an MA(1) process with α = – (1 – λ). Therefore, Cov(vt,vt-1) = – (1 – λ) σ u2 , and the
correlation between vt and vt-h is zero for h > 1.
variable, yt-1. Therefore, the OLS estimators of the βj will be inconsistent (and biased, of course).
(iii) Because {vt} follows an MA(1) process, it is correlated with the lagged dependent
Nevertheless, we can use xt-2 as an IV for yt-1 because xt-2 is uncorrelated with vt (because ut and
ut-1 are both uncorrelated with xt-2) and xt-2) and xt-2 is partially correlated with yt-1.
18.3 For δ ≠ β, yt – δzt = yt – βzt + (β – δ)zt, which is an I(0) sequence (yt – βzt) plus an I(1)
sequence. Since an I(1) sequence has a growing variance, it dominates the I(0) part, and the
resulting sum is an I(1) sequence.
18.4 Following the hint, we show that yt-2 – βxt-2 can be written as a linear function of yt-1 – βxt-1,
Δyt-1, and Δxt-1. That is,
(yt-1 – βxt-1) – Δyt-1 + βΔxt-1 = yt-1 – βxt-1 – (yt-1 – yt-2) + β(xt-1 – xt-2) = yt-2 – βxt-2,
178
or
18.6 (i) This is given by the estimated intercept, 1.54. Remember, this is the percentage growth
at an annualized rate. It is statistically different from zero since t = 1.54/.56 = 2.75.
(ii) 1.54 + .031(10) = 1.85. As an aside, you could obtain the standard error of this estimate
by running the regression.
(iii) Growth in the S&P 500 index has a statistically significant effect on industrial
production growth – in the Granger causality sense – because the t statistic on pcspt-1 is about
2.38. The economic effect is reasonably large.
18.7 If unemt follows a stable AR(1) process, then this is the null model used to test for Granger
causality: under the null that gMt does not Granger cause unemt, we can write
unemt = β0 + β1unemt-1 + ut
E(ut|unemt-1, gMt-1, unemt-2, gMt-2, K ) = 0
and |β1| < 1. Now, it is up to us to choose how many lags of gM to add to this equation. The
simplest approach is to add gMt-1 and to do a t test. But we could add a second or third lag (and
probably not beyond this with annual data), and compute an F test for joint significance of all
lags of gMt.
179
By assumption, E(et|It-1) = 0, and since yt-1, zt-1, and zt-2 are all in It-1, we have
We obtain the desired answer by adding one to the time index everywhere.
(ii) The forecasting equation for yn+1 is obtained by using part (i) with t = n, and then
plugging in the estimates:
(iii) From part (i), it follows that the model with one lag of z and AR(1) serial correlation in
the errors can be obtained from
with α0 = (1 − ρ)α, γ1 = δ1, and γ2 = −ρδ1 = −ργ1. The key is that γ2 is entirely determined (in a
nonlinear way) by ρ and γ1. So the model with a lag of z and AR(1) serial correlation is a special
case of the more general model. (Note that the general model depends on four parameters, while
the model from part (i) depends on only three.)
(iv) For forecasting, the AR(1) serial correlation model may be too restrictive. It may
impose restrictions on the parameters that are not met. On the other hand, if the AR(1) serial
correlation model holds, it captures the conditional mean E(yt|It-1) with one fewer parameter than
[See Harvey (1990) for ways to test the restriction γ2 = −ργ1, which is called a common factor
the general model; in other words, the AR(1) serial correlation model is more parsimonious.
restriction.]
18.9 Let eˆn +1 be the forecast error for forecasting yn+1, and let aˆn +1 be the forecast error for
forecasting Δyn+1. By definition, eˆn +1 = yn+1 − fˆn = yn+1 – ( gˆ n + yn) = (yn+1 – yn) − gˆ n = Δyn+1 −
gˆ n = aˆn +1 , where the last equality follows by definition of the forecasting error for Δyn+1.
gprice = .0013 + .081 gwage + .640 gprice-1
(.0003) (.031) (.045)
n = 284, R2 = .454.
180
The estimated impact propensity is .081 while the estimated LRP is .081/(1 – .640) = .225. The
estimated lag distribution is graphed below.
coefficient .1
.08
.06
.04
.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12
lag
(ii) The IP for the FDL model estimated in Problem 11.5 was .119, which is substantially
above the estimated IP for the GDL model. Further, the estimated LRP from GDL model is
much lower than that for the FDL model, which we estimated as 1.172. Clearly we cannot think
of the GDL model as a good approximation to the FDL model. One reason these are so different
can be seen by comparing the estimated lag distributions (see below for the GDL model). With
the FDL, the largest lag coefficient is at the ninth lag, which is impossible with the GDL model
not follow an AR(1) process with parameter ρ, which would cause the dynamic regression to
(where the largest impact is always at lag zero). It could also be that {ut} in equation (18.8) does
gprice = .0011 + .090 gwage + .619 gprice-1 + .055 gwage-1
(.0003) (.031) (.046) (.032)
n = 284, R2 = .460.
anyway. The estimated IP is .09 while the LRP is (.090 + .055)/1 – .619) ≈ .381. These are
The coefficient on gwage-1 is not especially significant, but we compute the IP and LRP and
both slightly higher than what we obtained for the GDL, but the LRP is still well below what we
obtained for the FDL in Problem 11.5. While this RDL model is more flexible than the GDL
181
model, it imposes a maximum lag coefficient (in absolute value) at lag zero or one. For the
estimates given above, the maximum effect is at the first lag. (See the estimated lag distribution
below.) This is not consistent with the FDL estimates in Problem 11.5.
coefficient .12
.1
.08
.06
.04
.02
0
0 1 2 3 4 5 6 7 8 9 10 11 12
lag
ginvpc = –.786 – .956 log(invpct-1) + .0068 t
t
(.170) (.198) (.0021)
+ .532 ginvpct-1 + .290 ginvpct-2
(.162) (.165)
n = 39, R2 = .437,
root test is –.956/.198 ≈ –4.82, which is well below –3.96, the 1% critical value obtained from
where ginvpct = log(invpct) – log(invpct-1). The t statistic for the augmented Dickey-Fuller unit
Table 18.3. Therefore, we strongly reject a unit root in log(invpct). (Incidentally, remember that
the t statistics on the intercept and time trend in this estimated equation to not have approximate t
distributions, although those on ginvpct-1 and ginvpct-2 do under the usual null hypothesis that the
parameter is zero.)
182
gprice = –.040 – .222 log(pricet-1) + .00097 t
t
Now the Dickey-Fuller t statistic is about –2.41, which is above –3.12, the 10% critical value
from Table 18.3. [The estimated root is 1 – .222 = .778, which is much larger than for
log(invpct).] We cannot reject the unit root null at a sufficiently small significance level.
(iii) Given the very strong evidence that log(invpct) does not contain a unit root, while
log(pricet) may very well, it makes no sense to discuss cointegration between the two. If we take
any nontrivial linear combination of an I(0) process (which may have a trend) and an I(1) process,
the result will be an I(1) process (possibly with drift).
pcip = 1.80 + .349 pcipt-1 + .071 pcipt-2 + .067 pcipt-2
t
(0.55) (.043) (.045) (.043)
n = 554, R2 = .166, σˆ = 12.15.
When pcipt-4 is added, its coefficient is .0043 with a t statistic of about .10.
The null hypothesis is that pcsp does not Granger cause pcip. This is stated as H0: γ1 = γ2 = γ3 =
0. The F statistic for joint significance of the three lags of pcspt, with 3 and 547 df, is F = 5.37
and p-value = .0012. Therefore, we strongly reject H0 and conclude that pcsp does Granger
cause pcip.
(iii) When we add Δi3t-1, Δi3t-2, and Δi3t-3 to the regression from part (ii), and now test the
joint significance of pcspt-1, pcspt-2, and pcspt-3, the F statistic is 5.08. With 3 and 544 df in the F
past Δi3.
distribution, this gives p-value = .0018, and so pcsp Granger causes pcip even conditional on
[Instructor’s Note: The F test for joint significance of Δi3t-1, Δi3t-2, and Δi3t-3 yields p-
value = .228, and so Δi3 does not Granger cause pcip conditional on past pcsp.]
18.13 We first run the regression gfrt on pet, t, and t2, and obtain the residuals, uˆt . We then
apply the augmented Dickey-Fuller test, with one lag of Δ uˆt , by regressing Δ uˆt on uˆt −1 and
183
Δ uˆt −1 . There are 70 observations available for this last regression, and it yields −.165 as the
coefficient on uˆt −1 with t statistic = −2.76. This is well above –4.15, the 5% critical value
[obtained from Davidson and MacKinnon (1993, Table 20.2)]. Therefore, we cannot reject the
null hypothesis of no cointegration, so we conclude gfrt and pet are not cointegrated even if we
allow them to have different quadratic trends.
hyˆ 6t = .078 + 1.027 hy3t-1 − 1.021 Δhy3t − .085 Δhy3t-1 − .104 Δhy3t-2
(.028) (0.016) (0.038) (.037) (.037)
n = 121, R2 = .982, σˆ = .123.
The t statistic for H0: β = 1 is (1.027 – 1)/.016 ≈ 1.69. We do not reject H0: β = 1 at the 5% level
against a two-sided alternative, although we would reject at the 10% level.
[Instructor’s Note: The standard errors on all slope coefficients can be used to construct t
statistics with approximate t distributions, provided there is no serial correlation in {et}.]
Neither of the added terms is individually significant. The F test for their joint significance gives
F = 1.35, p-value = .264. Therefore, we would omit these terms and stick with the error
correction model estimated in (18.39).
18.15 (i) The updated equations using data through 1997 are
unemt = 1.549 + .734 unemt-1
(0.572) (.096)
n = 49, R2 = .554, σˆ = 1.041
and
unemt = 1.286 + .648 unemt-1 + .185 inft-1
(0.484) (.083) (.041)
n = 49, R2 = .691, σˆ = .876.
184
The parameter estimates do not change by much. This is not very surprising, as we have added
only one year of data.
(ii) The forecast for unem1998 from the first equation is 1.549 + .734(4.9) ≈ 5.15; from the
second equation the forecast is 1.286 + .648(4.9) + .185(2.3) ≈ 4.89. The actual civilian
unemployment rate for 1998 was 4.5 (from Table B-42 in the 1999 Economic Report of the
President). Once again the model that includes lagged inflation produces a better forecast.
(iii) There is no practical improvement in reestimating the parameters using data through
1997: 4.89 versus 4.90, which differs in a digit that is not even reported in the published
unemployment series.
(iv) To obtain the two-step-ahead forecast we need the 1996 unemployment rate, which was
(1 + .732)(1.572) + (.7322)(5.4) ≈ 5.62. The one-step ahead forecast is 1.572 + .732(4.9) ≈ 5.16,
5.4. From equation (18.55), the forecast of unem1998 made after we know unem1996 is
18.16 (i) The estimated linear trend equation using the first 119 observations is
chnimpt = 248.58 + 5.15 t
(53.20) (0.77)
n = 119, R2 = .277, σˆ = 288.33.
chnimpt = 329.18 + .416 chnimpt-1
(54.71) (.084)
n = 118, R2 = .174, σˆ = 308.17.
Because σˆ is lower for the linear trend model, it provides the better in-sample fit. (The R-
squared is also larger for the linear trend model.)
(iii) Using the last 12 observations for one-step-ahead out-of-sample forecasting gives an
RMSE and MAE for the linear trend equation of about 315.5 and 201.9, respectively. For the
AR(1) model, the RMSE and MAE are about 388.6 and 246.1, respectively. Perhaps
surprisingly, the linear trend is the better forecasting model.
mart, …, dect when added to the linear trend model is about 1.15 with p-value ≈ .328. (The df
(iv) Using again the first 119 observations, the F statistic for joint significance of febt,
are 11 and 107.) So there is no evidence that seasonality needs to be accounted for in forecasting
chnimp.
185
18.17 (i) As can be seen from the following graph, gfr does not have a clear upward or
downward trend. Starting from 1913, there is a sharp downward trend in fertility until the mid-
1930s, when the fertility rate bottoms out. Fertility increased markedly until the end of the baby
boom in the early 1960s, after which point it fell sharply and then leveled off.
gfr
125
100
85
65
1913 1941 1963 1984
year
(ii) The regression of gfrt on a cubic in t, using the data up through 1979, gives
If we use the usual t critical values, all terms are very statistically significant, and the R-squared
indicates that this curve-fitting exercise tracks gfrt pretty well, at least up through 1979.
(iv) The regression Δgfrt on just an intercept, using data up through 1979, gives
Δgfr
ˆ = –.871
t
(.543)
n = 66, σˆ = 4.41.
186
(The R-squared is identically zero since there are no explanatory variables. But σˆ , which
estimates the standard deviation of the error, is comparable to that in part (ii), and we see that it
is much smaller here.) The t statistic for the intercept is about –1.60, which is not significant at
drift, if it is indeed a random walk. (That is, if gfrt = α0 + gfrt-1 + et, where {et} is zero-mean,
the 10% level against a two-sided alternative. Therefore, it is legitimate to treat gfrt as having no
(v) The prediction of gfrn+1 is simply gfrn, so the predication error is simply Δgfrn+1 = gfrn+1 –
gfrn. Obtaining the MAE for the five prediction errors for 1980 through 1984 gives MAE ≈ .840,
which is much lower than the 43.02 obtained with the cubic trend model. The random walk is
clearly preferred for forecasting.
The second lag is significant. (Recall that its t statistic is valid even though gfrt apparently
contains a unit root: the coefficients on the two lags sum to .961.) The standard error of the
regression is slightly below that of the random walk model.
(vii) The out-of-sample forecasting performance of the AR(2) model is worse than the
random walk without drift: the MAE for 1980 through 1984 is about .991 for the AR(2) model.
[Instructor’s Note: You might have the students compare an AR(1) model for ∆gfrt − that is,
impose the unit root − to the random walk without drift model. The MAE is about .879, so it is
better to impose the unit root. But this still does less well than the simple random walk without
drift.]
(Notice how high the R-squared is. However, it is meaningless as a goodness-of-fit measure
because {yt} has a trend and possibly a unit root.)
(ii) The forecast for 1990 (t = 32) is 3,186.04 + 116.24(32) + .630(17,804.09) ≈ 18,122.30,
because y is $17,804.09 in 1989. The actual value for real per capita disposable income was
$17,944.64, and so the forecast error is –$177.66.
(iii) The MAE for the 1990s, using the model estimated in part (i), is about 371.76.
187
(iv) Without yt-1 in the equation, we obtain
yˆ t = 8,143.11 + 311.26 t
(103.38) (5.64)
n = 31, R2 = .991, σˆ = 280.87.
The MAE for the forecasts in the 1990s is about 718.26. This is much higher than for the model
with yt-1, so we should use the AR(1) model with a linear time trend.
18.19 (i) The AR(1) model for Δr6, estimated using all but the last 16 observations, is
Δ r6
t = .047 – .179 Δr6t-1
(.131) (.096)
The RMSE for forecasting one-step-ahead over the last 16 quarters is about .704.
Δ r6
t = .372 – .171 Δr6t-1 – 1.045 sprt-1
(.195) (.095) (0.474)
The RMSE is about .788, which is higher than the RMSE without the error correction term.
Therefore, while the EC term improves the in-sample fit (and is statistically significant), it
actually hampers out-of-sample forecasting.
(iii) To make the forecasting exercises comparable, we exclude the last 16 observations to
estimate the cointegrating parameters. The CI coefficient is about 1.028. The estimated error
correction model is
Δ r6
t = .372 – .171 Δr6t-1 – 1.045 (r6t-1 – 1.028 r3t-1)
(.195) (.095) (0.474)
which shows that this fits worse than the EC model when the cointegrating parameter is assumed
versions of the EC model are dominated by the AR(1) model for Δr6t.
to be one. The RMSE for the last 16 quarters is .782, so this works slightly better. But both
[Instructor’s Note: Since Δr6t-1 is only marginally significant in the AR(1) model, and its
coefficient is small, and the intercept is also very small and insignificant, you might have the
188
students use zero to predict Δr6 for each of the last 16 quarters. The RMSE is about .657, which
means this works best of all. The lesson is that econometric methods are not always called for,
or even desirable.]
ahead errors for forecasting r6n+1 are identical to those for forecasting Δr6n+1.
(iv) The conclusions would be identical because, as shown in Problem 18.9, the one-step-
18.20 (i) For lsp500, the ADF statistic without a trend is t = −.79; with a trend, the t statistic is
−2.20. This are both well above their respective 10% critical values. In addition, the estimated
roots are quite close to one. For lip, the ADF statistic without a trend is −1.37 without a trend
and −2.52 with a trend. Again, these are not close to rejecting even at the 10% levels, and the
estimated roots are very close to one.
n = 558, R2 = .903
The t statistic for lip is over 70, and the R-squared is over .9. These are hallmarks of spurious
regressions.
(iii) Using the residuals uˆt obtained in part (ii), the ADF statistic (with two lagged changes)
is −1.57, and the estimated root is over .99. There is no evidence of cointegration. (The 10%
critical value is −3.04.)
to the residuals is −1.88, and the estimated root is again about .99. Even with a time trend there
(iv) After adding a linear time trend to the regression from part (ii), the ADF statistic applied
is no evidence of cointegration.
(v) It appears that lsp500 and lip do not move together in the sense of cointegration, even if
we allow them to have unrestricted linear time trends. This analysis does not point to a long-run
equilibrium relationship.
18.21 (i) This is supposed to be an AR(3) model, otherwise the claim is incorrect. So, estimating
an AR(3) for pcipt, and computing the F statistic for the second and third lags, gives F(2,550) =
3.76, p-value = .024.
(ii) When pcspt-1 is added to the AR(3) model in part (i), its coefficient is about .031 and its
t statistic is about 2.40. Therefore, we conclude that pcsp does Granger cause pcip.
(iii) The heteroskedasticity-robust t statistic is 2.47, so the conclusion from part (ii) does not
change.
189
18.22 (i) The DF statistic is about −3.31, which is above the 2.5% critical value (−3.12), and so,
using this test, we can reject a unit root at the 2.5% level. (The estimated root is about .81.)
−1.50, and the root is larger (about .915). Now, there is little evidence against a unit root.
(ii) When two lagged changes are added to the regression in part (i), the t statistic becomes
(iii) If we add a time trend to the regression in part (ii), the ADF statistic becomes −3.67, and
the estimated root is about .57. The 2.5% critical value is −3.66, and so we are back to fairly
convincingly rejecting a unit root.
(iv) The best characterization seems to be an I(0) process about a linear trend. In fact, a
stable AR(3) about a linear trend is suggested by the regression in part (iii).
(v) For prcfatt, the ADF statistic without a trend is −4.74 (estimated root = .62) and with a
time trend the statistic is −5.29 (estimated root = .54). Here, the evidence is strongly in favor of
an I(0) process, whether we include a trend or not.
190
CHAPTER 19
TEACHING NOTES
This is a chapter that students should read if you have assigned them a term paper. I used to
allow students to choose their own topics, but this is difficult in a first-semester course, and
places a heavy burden on instructors or teaching assistants, or both. I now assign a common
topic and provide a data set with about six weeks left in the term. The data set is cross-sectional
(because I teach time series at the end of the course), and I provide guidelines of the kinds of
questions students should try to answer. (For example, I might ask them to answer the following
questions: Is there a marriage premium for NBA basketball players? If so, does it depend on
race? Can the premium, if it exists, be explained by productivity differences?) The specifics are
up to the students, and they are to craft a 10 to 15-page paper on their own. This gives them
practice writing up the results in a way that is easy-to-read, and forces them to interpret their
findings. While leaving the topic to each student’s discretion is more interesting, I find that
many students flounder with an open-ended assignment until it is too late. Naturally, for a
second-semester course, or a senior seminar, students would be expected to design their own
topic, collect their own data, and then write a more substantial term paper.
191
APPENDIX A
SOLUTIONS TO PROBLEMS
(ii) The two middle numbers are 480 and 530; when these are averaged, we obtain 505, or
$505.
(iv) The average increases to $586 while the median is unchanged ($505).
A.2 (i) This is just a standard linear equation with intercept equal to 3 and slope equal to .2. The
intercept is the number of missed classes for a student who lives on campus.
A.3 If price = 15 and income = 200, quantity = 120 – 9.8(15) + .03(200) = –21, which is
nonsense. This shows that linear demand functions generally cannot describe demand over a
wide range of prices and income.
A.4 (i) The percentage point change is 5.6 – 6.4 = –.8, or an eight-tenths of a percentage point
decrease in the unemployment rate.
(ii) The percentage change in the unemployment rate is 100[(5.6 – 6.4)/6.4] = –12.5%.
A.5 The majority shareholder is referring to the percentage point increase in the stock return,
while the CEO is referring to the change relative to the initial return of 15%. To be precise, the
shareholder should specifically refer to a 3 percentage point increase.
≈ $40,134.84.
exper = 5, salary = exp[10.6 + .027(5)] ≈ $45,935.80.
A.7 (i) When exper = 0, log(salary) = 10.6; therefore, salary = exp(10.6) When
(ii) The approximate proportionate increase is .027(5) = .135, so the approximate percentage
change is 13.5%.
192
A.8 From the given equation, Δgrthemp = –.78(Δsalestax). Since both variables are in
proportion form, we can multiply the equation through by 100 to turn each variable into
percentage form. This leaves the slope as –.78. So, a one percentage point increase in the sales
tax rate (say, from 4% to 5%) reduces employment growth by –.78 percentage points.
A.9 (i) The relationship between yield and fertilizer is graphed below.
yield 122
121
120
0 50 100
fertilizer
has a diminishing effect, and the slope approaches zero as fertilizer gets large. The initial pound
of fertilizer has the largest effect, and each additional pound has an effect smaller than the
previous pound.
193
APPENDIX B
SOLUTIONS TO PROBLEMS
B.1 Before the student takes the SAT exam, we do not know – nor can we predict with certainty
– what the score will be. The actual score depends on numerous factors, many of which we
cannot even list, let alone know ahead of time. (The student’s innate ability, how the student
feels on exam day, and which particular questions were asked, are just a few.) The eventual SAT
score clearly satisfies the requirements of a random variable.
B.2 (i) P(X ≤ 6) = P[(X – 5)/2 ≤ (6 – 5)/2] = P(Z ≤ –.5) ≈ .309, where Z denotes a Normal (0,1)
(ii) P(X > 4) = P[(X – 4)/2 > (4 – 4)/2] = P(Z > 0) = .5 = 1 – P(Z ≤ 0) = 1 – .5 = .5.
(iii) P(|X – 5| > 1) = P(X – 5 > 1) + P(X – 5 < –1) = P(X > 6) + P(X < 4) ≈ .309 + P(Z <
0) = .309 + .5 = .809, where we use the answer from part (i) along with P(Z < 0) = P(Z ≤ 0) when
Z ~ Normal (0,1).
B.3 (i) Let Yit be the binary variable equal to one if fund i outperforms the market in year t. By
assumption, P(Yit = 1) = .5 (a 50-50 chance of outperforming the market for each fund in each
year). Now, for any fund, we are also assuming that performance relative to the market is
years, P(Yi1 = 1,Yi2 = 1, K , Yi,10 = 1), is just the product of the probabilities: P(Yi1 = 1) ⋅ P(Yi2 = 1)
independent across years. But then the probability that fund i outperforms the market in all 10
K P(Yi,10 = 1) = (.5)10 = 1/1024 (which is slightly less than .001). In fact, if we define a binary
random variable Yi such that Yi = 1 if and only if fund i outperformed the market in all 10 years,
then P(Yi = 1) = 1/1024.
(ii) Let X denote the number of funds out of 4,170 that outperform the market in all 10 years.
independent across funds, then X has the Binomial (n,θ) distribution with n = 4,170 and θ =
Then X = Y1 + Y2 + K + Y4,170. If we assume that performance relative to the market is
relative to the market is random and independent across funds, it is almost certain that at least
one fund will outperform the market in all 10 years.
(iii) Using the Stata command Binomial(4170,5,1/1024), the answer is about .385. So there
is a nontrivial chance that at least five funds will outperform the market in all 10 years.
B.4 We want P(X ≥.6). Because X is continuous, this is the same as P(X > .6) = 1 – P(X ≤ .6) =
F(.6) = 3(.6)2 – 2(.6)3 = .648. One way to interpret this is that almost 65% of all counties have
an elderly employment rate of .6 or higher.
194
X ~ Binomial(12,.20). We want P(X ≥ 1) = 1 – P(X = 0) = 1 – (.8)12 ≈ .931.
B.5 (i) As stated in the hint, if X is the number of jurors convinced of Simpson’s innocence, then
(B.14) with n = 12, θ = .2, and x = 1: P(X = 1) = 12⋅ (.2)(.8)11 ≈ .206. Therefore, P(X ≥ 2) ≈ 1 –
(ii) Above, we computed P(X = 0) as about .069. We need P(X = 1), which we obtain from
(.069 + .206) = .725, so there is almost a three in four chance that the jury had at least two
members convinced of Simpson’s innocence prior to the trial.
0
= 81/4.
0 0 0 0
B.7 In eight attempts the expected number of free throws is 8(.74) = 5.92, or about six free
throws.
B.8 The weights for the two-, three-, and four-credit courses are 2/9, 3/9, and 4/9, respectively.
Let Yj be the grade in the jth course, j = 1, 2, and 3, and let X be the overall grade point average.
B.9 If Y is salary in dollars then Y = 1000 ⋅ X, and so the expected value of Y is 1,000 times the
expected value of X, and the standard deviation of Y is 1,000 times the standard deviation of X.
Therefore, the expected value and standard deviation of salary, measured in dollars, are $52,300
and $14,600, respectively.
(ii) Following the hint, we use the law of iterated expectations. Since
E(GPA|SAT) = .70 + .002 SAT, the (unconditional) expected value of GPA is .70 + .002
E(SAT) = .70 + .002(1100) = 2.9.
195
APPENDIX C
SOLUTIONS TO PROBLEMS
Var( Y ) = σ2/4.
C.1 (i) This is just a special case of what we covered in the text, with n = 4: E( Y ) = µ and
(ii) E(W) = E(Y1)/8 + E(Y2)/8 + E(Y3)/4 + E(Y4)/2 = µ[(1/8) + (1/8) + (1/4) + (1/2)] = µ(1 +
1 + 2 + 4)/8 = µ, which shows that W is unbiased. Because the Yi are independent,
(iii) Because 11/32 > 8/32 = 1/4, Var(W) > Var( Y ) for any σ2 > 0, so Y is preferred to W
because each is unbiased.
C.2 (i) E(Wa) = a1E(Y1) + a2E(Y2) + K + anE(Yn) = (a1 + a2 + K + an)µ. Therefore, we must
have a1 + a2 + K + an = 1 for unbiasedness.
(ii) Var(Wa) = a12 Var(Y1) + a22 Var(Y2) + K + an2 Var(Yn) = ( a12 + a22 + K + an2 )σ2.
– we have 1/n ≤ a12 + a22 + K + an2 . But then Var( Y ) = σ2/n ≤ σ2( a12 + a22 + K + an2 ) =
(iii) From the hint, when a1 + a2 + K + an = 1 – the condition needed for unbiasedness of Wa
Var(Wa).
C.3 (i) E(W1) = [(n – 1)/n]E( Y ) = [(n – 1)/n]µ, and so Bias(W1) = [(n – 1)/n]µ – µ = –µ/n.
zero as n → ∞, while the bias in W2 is –µ/2 for all n. This is an important difference.
Similarly, E(W2) = E( Y )/2 = µ/2, and so Bias(W2) = µ/2 – µ = –µ/2. The bias in W1 tends to
(ii) plim(W1) = plim[(n – 1)/n] ⋅ plim( Y ) = 1 ⋅ µ = µ. plim(W2) = plim( Y )/2 = µ/2. Because
plim(W1) = µ and plim(W2) = µ/2, W1 is consistent whereas W2 is inconsistent.
(iii) Var(W1) = [(n – 1)/n]2Var( Y ) = [(n – 1)2/n3]σ2 and Var(W2) = Var( Y )/4 = σ2/(4n).
[(n – 1)2/n3]σ2 < σ2/n = Var( Y ) because (n – 1)/n < 1. Therefore, MSE(W1) is smaller than
Var( Y ) for µ close to zero. For large n, the difference between the two estimators is trivial.
C.4 (i) Using the hint, E(Z|X) = E(Y/X|X) = E(Y|X)/X = θX/X = θ. It follows by Property CE.4,
the law of iterated expectations, that E(Z) = E(θ) = θ.
196
(ii) This follows from part (i) and the fact that the sample average is unbiased for the
population average: write
W1 = n −1 ∑ (Yi / X i ) = n −1 ∑ Z i ,
n n
i =1 i =1
(iii) In general, the average of the ratios, Yi/Xi, is not the ratio of averages, W2 = Y / X . (This
non-equivalence is discussed a bit on page 676.) Nevertheless, W2 is also unbiased, as a simple
and so
E(Y | X 1 ,..., X n ) = n −1
∑ E(Y | X ,..., X )=n −1
∑θ X
n n
i =1 i =1
i 1 n i
= θ n −1 ∑ X i = θ X .
n
i =1
(iv) For the n = 17 observations given in the table – which are, incidentally, the first 17
observations in the file CORN.RAW – the point estimates are w1 = .418 and w2 = 120.43/297.41
= .405. These are pretty similar estimates. If we use w1, we estimate E(Y|X = x) for any x > 0 as
E(Y | X = x) = .418 x. For example, if x = 300 then the predicted yield is .418(300) = 125.4.
C.5 (i) While the expected value of the numerator of G is E( Y ) = θ, and the expected value of
the denominator is E(1 – Y ) = 1 – θ, the expected value of the ratio is not the ratio of the
expected value.
of the denominator is not zero): plim(G) = plim[ Y /(1 – Y )] = plim( Y )/[1 – plim( Y )] = θ/(1 –
(ii) By Property PLIM.2(iii), the plim of the ratio is the ratio of the plims (provided the plim
θ) = γ.
(iii) The standard error of y is s / n = 466.4/30 ≈ 15.55. Therefore, the t statistic for
testing H0: µ = 0 is t = y /se( y ) = –32.8/15.55 ≈ –2.11. We obtain the p-value as P(Z ≤ –2.11),
where Z ~ Normal(0,1). These probabilities are in Table G.1: p-value = .0174. Because the p-
197
value is below .05, we reject H0 against the one-sided alternative at the 5% level. We do not
reject at the 1% level because p-value = .0174 > .01.
(iv) The estimated reduction, about 33 ounces, does not seem large for an entire year’s
consumption. If the alcohol is beer, 33 ounces is less than three 12-ounce cans of beer. Even if
this is hard liquor, the reduction seems small. (On the other hand, when aggregated across the
entire population, alcohol distributors might not think the effect is so small.)
(v) The implicit assumption is that other factors that affect liquor consumption – such as
income, or changes in price due to transportation costs, are constant over the two years.
about .451, and so, with n = 15, the standard error of d is .451 15 ≈ .1164. From Table G.2,
C.7 (i) The average increase in wage is d = .24, or 24 cents. The sample standard deviation is
the 97.5th percentile in the t14 distribution is 2.145. So the 95% CI is .24 ± 2.145(.1164), or about
–.010 to .490.
(ii) If µ = E(Di) then H0: µ = 0. The alternative is that management’s claim is true: H1: µ > 0.
(iii) We have the mean and standard error from part (i): t = .24/.1164 ≈ 2.062. The 5%
critical value for a one-tailed test with df = 14 is 1.761, while the 1% critical value is 2.624.
Therefore, H0 is rejected in favor of H1 at the 5% level but not the 1% level.
(iv) The p-value obtained from Stata is .029; this is half of the p-value for the two-sided
alternative. (Econometrics packages, including Stata, report the p-value for the two-sided
alternative.)
(ii) Var( Y ) = θ(1 – θ)/n [because the variance of each Yi is θ (1 − θ ) and so sd( Y ) =
θ (1 −θ ) / n .
(iii) The asymptotic t statistic is ( Y − .5)/se( Y ); when we plug in the estimate for Mark Price,
se( y ) = y (1 − y ) / n = .438(1 − .438) / 429 ≈ .024. So the observed t statistic is (.438 –
.5)/.024 ≈ –2.583. This is well below the 5% critical value (based on the standard normal
distribution), –1.645. In fact, the 1% critical value is –2.326, and so H0 is rejected against H1 at
the 1% level.
198
(iv) The evidence is pretty strong against the dictator’s claim. If 65% of the voting
population actually voted yes in the plebiscite, there is only about a 1.3% chance of obtaining
115 or fewer voters out of 200 who voted yes.
C.10 Since y = .394, se( y ) ≈ .024. We can use the standard normal approximation for the
95% CI: .394 ± 1.96(.024), or about .347 to .441. Therefore, based on Gwynn’s average up to
strike, there is not very strong evidence against θ = .400, as this value is well within the 95% CI.
(Of course, .350 is within this CI, too.)
199
APPENDIX D
SOLUTIONS TO PROBLEMS
⎛ 0 1 6⎞
⎛ 2 −1 7 ⎞ ⎜ ⎟ ⎛ 20 −6 12 ⎞
D.1 (i) AB = ⎜ ⎟⎜ 1 8 0⎟ = ⎜ ⎟
⎝ −4 5 0 ⎠ ⎜ ⎟ ⎝ 5 36 −24 ⎠
⎝ 3 0 0⎠
D.2 This result is easy to visualize. If A and B are n × n diagonal matrices, then AB is an n × n
diagonal matrix with jth diagonal element ajbj. Similarly, BA is an n × n diagonal matrix with jth
diagonal element bjaj, which, of course, is the same as ajbj.
D.3 Using the basic rules for transpose, ( X′X)′ = ( X′)( X′)′ = X′X , which is what we wanted to
show.
D.4 (i) This follows from tr(BC) = tr(CB), when B is n × m and C is m × n. Take B = A′ and
C = A.
⎛ 2 0⎞ ⎛ 4 0 −2 ⎞
⎜ ⎟ ⎛ 2 0 −1 ⎞ ⎜ ⎟
A′A = ⎜ 0 3 ⎟ ⎜ ⎟ = ⎜ 0 9 0 ⎟ ; therefore, tr(A′A) = 14.
⎜ −1 0 ⎟ ⎝ ⎠ ⎜ −2 0 1⎟⎠
(ii)
⎝ ⎠ ⎝
0 3 0
⎛ 2 0⎞
⎛ 2 0 −1⎞ ⎜ ⎟ ⎛ 5 0⎞
Similarly, AA′ = ⎜ ⎟ ⎜ 0 3⎟ = ⎜ ⎟ , and so tr(AA′) = 14.
⎝ 0 3 0⎠⎜ ⎟ ⎝ 0 9⎠
⎝ −1 0 ⎠
D.5 (i) The n × n matrix C is the inverse of AB if and only if C(AB) = In and (AB)C = In. We
verify both of these equalities for C = B-1A-1. First, (B-1A-1)(AB) = B-1(A-1A)B = B-1InB =
B-1B = In. Similarly, (AB)(B-1A-1) = A(BB-1)A-1 = AInA-1 = AA-1 = In.
D.6 (i) Let ej be the n × 1 vector with jth element equal to one and all other elements equal to zero.
Then straightforward matrix multiplication shows that e′jAej = ajj, where ajj is the jth diagonal
element. But by definition of positive definiteness, x′Ax > 0 for all x ≠ 0, including x = ej. So
ajj > 0, j = 1,2, K ,n.
⎛ 1 −2 ⎞
(ii) The matrix A = ⎜ ⎟ works because x′Ax = −2 < 0 for x′ = (1 1).
⎝ −2 1⎠
200
D.7 We must show that, for any n × 1 vector x, x ≠ 0, x′(P′AB)x > 0. But we can write this
quadratic form as (Px)′A(Px) = z′Az where z ≡ Px. Because A is positive definite by assumption,
z′Az > 0 for z ≠ 0. So, all we have to show is that x ≠ 0 implies that z ≠ 0. We do this by
showing the contrapositive, that is, if z = 0 then x = 0. If Px = 0 then, because P-1 exists, we
P
E(y). By Property (3) for variances, Var(z) = E[(z – µz)(z – µz) ′]. But z – µz = Ay + b – (Aµy +
D.8 Let z = Ay + b. Then, by the first property of expected values, E(z) = Aµy + b, where µy =
b) = A(y – µy). Therefore, (z – µz)′ = (y – µy)′A′, and so (z – µz)( z – µz)′ = A(y – µy)(y – µy)′A′.
201
APPENDIX E
SOLUTIONS TO PROBLEMS
E.1 This follows directly from partitioned matrix multiplication in Appendix D. Write
⎛ x1 ⎞ ⎛ y1 ⎞
⎜ ⎟ ⎜ ⎟
X = ⎜ 2 ⎟ , X′ = ( x1′ x′2 K x′n ), and y = ⎜ y2 ⎟
x
⎜M ⎟ ⎜M ⎟
⎜⎜ ⎟⎟ ⎜⎜ ⎟⎟
⎝ xn ⎠ ⎝ yn ⎠
β̂ = ⎜ n −1 ∑ x′t xt ⎟ ⎜ n −1 ∑ x′t yt ⎟
⎛ ⎞ ⎛ ⎞
n −1 n
⎝ t =1 ⎠ ⎝ t =1 ⎠
which, when we plug in yt = xtβ + ut for each t and do some algebra, can be written as
β̂ = β + ⎜ n −1 ∑ x′t xt ⎟ ⎜ n −1 ∑ x′t ut ⎟ .
⎛ ⎞ ⎛ ⎞
n −1n
⎝ t =1 ⎠ ⎝ t =1 ⎠
As shown in Section E.4, this expression is the basis for the asymptotic analysis of OLS using
matrices.
E.2 (i) Following the hint, we have SSR(b) = (y – Xb)′(y – Xb) = [ û + X( β̂ – b)]′[ û + X( β̂ –
b)] = û ′ û + û ′X( β̂ – b) + ( β̂ – b)′X′ û + ( β̂ – b)′X′X( β̂ – b). But by the first order conditions
for OLS, X′ û = 0, and so (X′ û )′ = û ′X = 0. But then SSR(b) = û ′ û + ( β̂ – b)′X′X( β̂ – b),
which is what we wanted to show.
(ii) If X has a rank k then X′X is positive definite, which implies that ( β̂ – b) ′X′X( β̂ – b) >
0 for all b ≠ β̂ . The term û ′ û does not depend on b, and so SSR(b) – SSR( β̂ ) = ( β̂ – b) ′X′X
( β̂ – b) > 0 for b ≠ β̂ .
E.3 (i) We use the placeholder feature of the OLS formulas. By definition, β% = (Z′Z)-1Z′y =
[(XA)′ (XA)]-1(XA)′y = [A′(X′X)A]-1A′X′y = A-1(X′X)-1(A′)-1A′X′y = A-1(X′X)-1X′y = A-1 β̂ .
(ii) By definition of the fitted values, yˆt = xt βˆ and y%t = z t β% . Plugging zt and β% into the
second equation gives y%t = (xtA)(A-1 β̂ ) = xt βˆ = yˆt .
(iii) The estimated variance matrix from the regression of y and Z is σ% 2 (Z′Z)-1 where σ% 2 is
the error variance estimate from this regression. From part (ii), the fitted values from the two
202
variable is the same in both regressions.) Therefore, σ% 2 = σˆ 2 . Further, as we showed in part (i),
regressions are the same, which means the residuals must be the same for all t. (The dependent
(iv) The β% j are obtained from a regression of y on XA, where A is the k × k diagonal matrix
with 1, a2, K , ak down the diagonal. From part (i), β% = A-1 β̂ . But A-1 is easily seen to be the
k × k diagonal matrix with 1, a2−1 , K , ak−1 down its diagonal. Straightforward multiplication
shows that the first element of A-1 β̂ is β̂ and the jth element is βˆ /aj, j = 2, K , k.
1 j
(v) From part (iii), the estimated variance matrix of β% is σˆ 2 A-1(X′X)-1(A-1)′. But A-1 is a
symmetric, diagonal matrix, as described above. The estimated variance of β% is the jth
diagonal element of σˆ A (X′X) A , which is easily seen to be = σˆ cjj/ a , where cjj is the jth
j
2 -1 -1 -1 2 −2
diagonal element of (X′X)-1. The square root of this, σˆ c /|aj|, is se( β% ), which is simply
j
se( β% j )/|aj|.
jj j
and so the absolute value is (| βˆ j |/|aj|)/[se( βˆ j )/|aj|] = | βˆ j |/se( βˆ j ), which is just the absolute value
of the t statistic for βˆ . If aj > 0, the t statistics themselves are identical; if aj < 0, the t statistics
j
(ii) Var(δ垐| X) = Var(Gβ | X) = G[Var(β | X)]G′ = G[σ 2 ( X′X) −1 ]G′ = σ 2G[( X′X) −1 ]G′.
so the error variance estimate, σˆ 2 , is the same. Therefore, the estimated variance matrix is
Further, as shown in Problem E.3, the residuals are the same as from the regression y on X, and
203
σ垐2 [( XG −1 )′XG −1 ]−1 = σ 2G ( X′X)−1 G ′,
⎛ 1 0 0 ... 0 ⎞
⎜ ⎟
⎜ 0 1 0 ... 0 ⎟
⎜ ⋅ ⋅ ⎟
G = ⎜ ⋅ ⋅ ... ⋅ ⋅ ⎟
⋅ ⋅
⎜ ⎟
⋅ ⋅ ⋅ ⋅
⎜ 0 ... 0 1 0 ⎟
⎜ c c c ... c ⎟
⎝ 1 2 3 k ⎠
(v) Straightforward matrix multiplication shows that, for the suggested choice of G-1,
G −1G = I n . Also by multiplication, it is easy to see that, for each t,
(iii) The estimator β% is linear in y and, as shown in part (i), it is unbiased (conditional on X).
Since the Gauss-Markov assumptions hold, the OLS estimator, β̂ , is best linear unbiased. In
particular, its variance-covariance matrix is “smaller” (in the matrix sense) than Var (β% | X).
Therefore, we prefer the OLS estimator.
204