09 Discrete Choice 1 Notes
09 Discrete Choice 1 Notes
09 Discrete Choice 1 Notes
Paul Goldsmith-Pinkham
February 26, 2024
Binary choice
Yi = Fi ( Xi ),
using robust standard errors, but does mean that a normal approxi-
mation with the error is a poor one.
Seoncd, except under some special circumstances, it’s very likely
that the predicted values of Yi will be outside of [0, 1]. What’s an
example where they will not be? Discrete exhaustive regressors!
Why? Discrete exhaustive regressors are the one setting where you
can guarantee that the model is correctly specified. When the model
is misspecified, it is quite possible that the model will extrapolate in a
way such that there will be values outside support.
How does this impact our causal estimates? If the model is cor-
rectly specified, we can generate counterfactual predictions of the Table 1: LPM model estimates
variable linear est. std.error
outcome. If not, then we get a linear approximation that may be non-
Intercept 0.0242 0.0410
sensical.
age 0.0220 0.0017
age2 -0.0002 0.0000
Example 1 (LPM estimates of homeownership) income /10k 0.0069 0.0007
We estimate the linear model in Table 1. and note tthat if income
were strictly ignorable, we could say that 10k increase in income
leads to 0.69 p.p. increase in the probability of homeownership. But,
the predicted probability of homeownership would range from 0.283
to 1.78. Oops.
There are two ways to think about how we think about this estima-
tion problem. These are not mutually exclusive, and it is important
to note that both of these approaches are very focused on the model-
based aspect of estimating causal effects.
The first is a statistical view. How can we model the statisti-
cal process for Yi better? In other words, can we fit the outcome
model better? Consider Xi β as the conditional mean of some pro-
cess, what’s the statistical model that fits with this? This is a case of
what’s termed “Generalized Linear Models” (GLM)
A second way to view this is as an structural (economic) choice
problem. Most models of binary outcome variabels assume a latent
index, on the utility of choosing Yi :2 2
The careful reader will note that
analogy to the Heckman model on
1 Y ∗ > 0 treatment choice.
Yi∗ = Xi β + ε i , Yi = i
(2)
0 Y ∗ ≤ 0.
i
Note that these are not, in the binary setting, deeply substantive
assumptions. In Figure 1, we see that there are very minor differences
in the thickness of the tails for a logit vs. normal error, but they’re
both symmetric and centered around zero.3 One downside for probit 3
Important caveat: these models only
models is that there’s no closed form solution for Φ, the CDF for the identify β up to scale. Why? The
“true” model of ϵ could have vari-
normal distribution:
ance σ2 that is unknown. Consider if
1
Z Xβ
i 2 /2
F ( Xi β) = Φ( Xi β). If this were a gen-
Φ ( Xi β ) = √ e−t dt (3) eral normal (rather than standardized
2π −∞ with variance 1), we could just scale
up the coefficients proportionate to σ
We will discuss later how to estimate β given these assumptions, and the realized binary outcome would
but they will involve numerical optimization, as there is no closed identical. Hence, we normalize σ = 1
in most cases. This is not a meaningful
form for β like in linear regression. assumption.
to interpret. To see why, consider the derviative of the probability constant -2.14 0.0242 -0.392
age 0.0903 0.022 0.0166
with respect to Xi : age2 -0.0006 -0.0002 -0.0001
income/10k 0.0716 0.0069 0.0131
∂Pr (Yi = 1| Xi )
= βϕ( Xi β) (Probit)
∂Xi
∂Pr (Yi = 1| Xi ) exp( Xi β)
=β (Logit).
∂Xi (1 + exp( Xi β))2
∂E(Y | X ) exp( Xi β)
n −1 ∑ = n −1 ∑ β
i
∂X i
( 1 + exp( Xi β))2
This will calculate the derivative for every value in the sample, and
then average them. This is a way to get a sense of the average effect of
Xi on Yi . We see a much larger effect of income on homeownership in
the logit model than in the linear model (Column 2).
lecture 9 - discrete choice and glm 4
Example 1 (continued)
Figure 2 shows the predicted values of homeownership from the linear
and logit models. The linear model is predicting values outside of the
support of the outcome, and the logit model is not. This is one benefit
of correctly specifying the model.
which solves the second problem, but makes the first problem even
worse! [Bellemare and Wichman, 2020, Aihounton and Henningsen,
2021, Cohn et al., 2022] Why these solutions? For one, they’re both
well-defined at Y = 0. Second, it has “similar” properties to the
taking a log. Effectively, since the distance between log(1 + Y ) and
log(Y ) was small as Y gets large, the hope is that the differences
would “wash out.” It turns out, thanks to work by Chen and Roth
[2023], that neither of these solutions are a good idea and that these
differences do not wash out.
The key point of Chen and Roth [2023] is that percentage effects
are not well-defined for outcomes that are potentially zero-valued.
That is in some ways obvious – there is no way to talk about the per-
cent increase for something where the base-level is zero. Dividing by
zero is infinite! But recall that part of the goal of using log outcomes
was to approximate percentage changes in the outcome due to treat-
ments. The main result of Chen and Roth [2023] shows that for any
other function approximating log, but defined at zero, the results will
be arbitrarily sensitive to changes in units (e.g. dollars to yuan).9 9
This includes both log(1 + y) and
What drives this effect? Effects close to zero and at zero. Most arcsinh(y).
Multiple Choices
2. X j (choice characteristics)
Now recall there are two (non-exclusive) ways to think the dis- down to binary choice.
Conditional logit
A second way to view the problem is as an structural (economic)
choice problem (pioneered by McFadden [McFadden, 1972]). Con-
sider a set of utilities Uij (unobserved) such that
1. Uij = Xij′ β + ε ij
exp( Xij β)
Pr (Yi = j| Xij ) = J
. (6)
∑k=0 exp( Xik β)
Comment 3
Note that if the characteristics Xij only vary based on the individual
(e.g. we can write Xij β as Xi β j ), then the effects across choices are
relative to each other. We can write our probability equation as
exp(α j + Xi β j )
Pr (Yi = j| Xij ) = J
. (7)
1 + ∑k=1 exp(αk + Xi β k )
Note that with equation (6) as our probability model, we can estimate
all these elasticities (assumign we have the data on prices, and we are
willing to assume prices are exogeneous, a very strong assumption).
But, this formulation creates issues.
A key issue with this formulation of the conditional logit model
is that the cross-price elasticities are identical. Specifically, ϵ jk = ϵlk ,
such that the effect of shifting price of a different good causes an
identical proportionate shift in all choices’ market share. You can see
∂Pr (Yi = j| Xij )
this by simply plugging in for ∂pk :
pk
ϵ jk = −γPr (Yi = j| Xij ) Pr (Yi = k| Xij ) ×
| {z } Pr (Yi = j| Xij )
∂Pr (Yi = j| Xij )
∂pk
With the nested Logit approach, you can specify sets (as the re-
searcher), and allow correlation of the ε within these sets. The key is
that the errors are uncorrelated across choice sets, which preserves
the logit structure (see Goldberg [1995] for an example application),
and the correlation within a nest is allowed to be correlated following
a distinct similarity parameter. In essense, the similarlity parameter
scales up and down the effect of the covariates within a nest: if the
similarity is high, then the effect of the covariates is swamped by the
random error, and the choices are highly correlated; if the similarity
is low, the nest approaches the standard IIA setting. See Wen and
Koppelman [2001] for a more recent discussion.
An alternative approach is to allow the covariance matrix of the
error terms to be flexibly estimated by the data using a multivariate
normal:
S exp( X j β + X j σνis )
1
Ê( Pr (Yi = j| X, β, σ )) =
S ∑ J
. (15)
s=1 ∑k=0 exp( X j β + X j σνis )
References
Jiafeng Chen and Jonathan Roth. Logs with zeros? some problems
and solutions. The Quarterly Journal of Economics, page qjad054,
2023.
Daniel McFadden and Kenneth Train. Mixed mnl models for discrete
response. Journal of applied Econometrics, 15(5):447–470, 2000.