09 Discrete Choice 1 Notes

Lecture 9 - Discrete Choice and GLM
Paul Goldsmith-Pinkham
February 26, 2024
We are now going to generalize our estimation problem beyond lin-

ear models like linear (and quantile) regression, and consider more
complex objective functions. This will initially be motivated by the
binary choice model, but will be more generally applicable to a wide
range of problems. This will lead to us covering a wide range of
topics, including binary choice models, generalized linear models
(GLMs), numerical estimation methods for non-linear models, in-
consistency of non-linear models with many parameters, and the
challenges of estimating models with multiple discrete choices.
Conceptually, we will be considering minimizing objective functions
as a general case of minimizing squares.
Binary choice
Consider the following binary outcome problem: let Yi denote if per-

son i is a homeowner, and Xi includes three covariates: income, age
and age2 (plus a constant). How should we model the relationship
between X and Y? Conceptually, a very general form would consider
Yi = Fi ( Xi ),
where Fi could vary by individual. However, this doesn’t seem like a

very good model for considering estimands, such as “how much does
homeownership increase with a 10k increase in income?”1 In many 1
Formally, this would look something
ways, this is similar to the questions related to binscatter and other like E(dFi ( Xi )/dXi | Xi ), and we would
need to make some assumptions on Fi
semiparametric models. to make progress. That’s what we’ll do
The potentially issues with blithly assuming a linear model for now.
Fi ( Xi ) becomes very apparent in the context of a binary dependent
variable. Say we model this outcome using a linear regression (this is
often called a linear probability model), assuming strong ignorability
or just E(ϵi | Xi ) = 0:
E(Yi | Xi ) = Pr (Yi = 1| Xi ) = Xi β → Yi = Xi β + ϵi (1)
The problems with modeling Y in this way is twofold. First, since

the outcome is binary, the error structure will be bimodal and un-
usual looking. To see this, consider ε i = Yi − Xi β, and consider how
ε i changes for Yi = 0 vs 1. For a given Xi , it is exactly bimodal (like
the outcome). One implication of this is that V (Y | X ) = Xi β(1 − Xi β),
and you’ll have pretty significant heteroskedasticity. This is solveable
lecture 9 - discrete choice and glm 2
using robust standard errors, but does mean that a normal approxi-
mation with the error is a poor one.
Seoncd, except under some special circumstances, it’s very likely
that the predicted values of Yi will be outside of [0, 1]. What’s an
example where they will not be? Discrete exhaustive regressors!
Why? Discrete exhaustive regressors are the one setting where you
can guarantee that the model is correctly specified. When the model
is misspecified, it is quite possible that the model will extrapolate in a
way such that there will be values outside support.
How does this impact our causal estimates? If the model is cor-
rectly specified, we can generate counterfactual predictions of the Table 1: LPM model estimates
variable linear est. std.error
outcome. If not, then we get a linear approximation that may be non-
Intercept 0.0242 0.0410
sensical.
age 0.0220 0.0017
age2 -0.0002 0.0000
Example 1 (LPM estimates of homeownership) income /10k 0.0069 0.0007
We estimate the linear model in Table 1. and note tthat if income
were strictly ignorable, we could say that 10k increase in income
leads to 0.69 p.p. increase in the probability of homeownership. But,
the predicted probability of homeownership would range from 0.283
to 1.78. Oops.
Modeling discrete choice
There are two ways to think about how we think about this estima-
tion problem. These are not mutually exclusive, and it is important
to note that both of these approaches are very focused on the model-
based aspect of estimating causal effects.
The first is a statistical view. How can we model the statisti-
cal process for Yi better? In other words, can we fit the outcome
model better? Consider Xi β as the conditional mean of some pro-
cess, what’s the statistical model that fits with this? This is a case of
what’s termed “Generalized Linear Models” (GLM)
A second way to view this is as an structural (economic) choice
problem. Most models of binary outcome variabels assume a latent
index, on the utility of choosing Yi :2 2
The careful reader will note that
 analogy to the Heckman model on
1 Y ∗ > 0 treatment choice.
Yi∗ = Xi β + ε i , Yi = i
(2)
0 Y ∗ ≤ 0.
i
As we will now see, both approaches do arrive at a similar modeling

conclusion, but the latter model will naturally accomodate choices.
A natural approach in either of these is to make a distributional
assumption about ε i . Two common assumptions:
1. ε i is conditionally normally distributed (probit), such that Pr (Yi =

1 | Xi ) = Φ ( Xi β )
Figure 1: Logit vs. Probit error terms
2. ε i is conditionally extreme value (logistic) such that Pr (Yi =
exp( Xi β)
1 | Xi ) = 1+exp( Xi β)
Note that these are not, in the binary setting, deeply substantive
assumptions. In Figure 1, we see that there are very minor differences
in the thickness of the tails for a logit vs. normal error, but they’re
both symmetric and centered around zero.3 One downside for probit 3
Important caveat: these models only
models is that there’s no closed form solution for Φ, the CDF for the identify β up to scale. Why? The
“true” model of ϵ could have vari-
normal distribution:
ance σ2 that is unknown. Consider if
1
Z Xβ
i 2 /2
F ( Xi β) = Φ( Xi β). If this were a gen-
Φ ( Xi β ) = √ e−t dt (3) eral normal (rather than standardized
2π −∞ with variance 1), we could just scale
up the coefficients proportionate to σ
We will discuss later how to estimate β given these assumptions, and the realized binary outcome would
but they will involve numerical optimization, as there is no closed identical. Hence, we normalize σ = 1
in most cases. This is not a meaningful
form for β like in linear regression. assumption.
Table 2: Homeownership problem

Example 1 (continued) estimated with logit
Consider now the same homeowner problem from Example 1, but es- (1) (2) (3)
timated with logit. The β coefficients in Column 1 of Table 2 are hard term logit est. linear est. avg. deriv.
to interpret. To see why, consider the derviative of the probability constant -2.14 0.0242 -0.392
age 0.0903 0.022 0.0166
with respect to Xi : age2 -0.0006 -0.0002 -0.0001
income/10k 0.0716 0.0069 0.0131
∂Pr (Yi = 1| Xi )
= βϕ( Xi β) (Probit)
∂Xi
∂Pr (Yi = 1| Xi ) exp( Xi β)
=β (Logit).
∂Xi (1 + exp( Xi β))2
In both cases, the effect of Xi changes, depending on the value of

Xi . This is a problem for interpretation. The average derivative in
Column 3 is a way to get around this, but it’s not a perfect solution:
∂E(Y | X ) exp( Xi β)
n −1 ∑ = n −1 ∑ β
i
∂X i
( 1 + exp( Xi β))2
This will calculate the derivative for every value in the sample, and
then average them. This is a way to get a sense of the average effect of
Xi on Yi . We see a much larger effect of income on homeownership in
the logit model than in the linear model (Column 2).
Figure 2: Linear vs. Logit model predic-

tions
Example 1 (continued)
Figure 2 shows the predicted values of homeownership from the linear
and logit models. The linear model is predicting values outside of the
support of the outcome, and the logit model is not. This is one benefit
of correctly specifying the model.
Generalized Linear Models (GLM)
We can generalize the intuition above, where we let the underlying

distribution of ϵ be non-normal, and parameterize the mean of the
distribution to be a function of Xi β. This is the idea behind Gener-
alized Linear Models (GLM), originally formulated in Nelder and
Wedderburn [1972].4 4
Interestingly, this is very common in
The overall setup of GLMs in broad strokes is to consider estima- non-economics fields, but much less
common in economics.
tion of a linear model Xβ, which is linked to the conditional mean
E(Y | X ) by a link function g: E(Y | X ) = g−1 ( Xβ). The crucial un-
derlying assumption for the underlying machinery is that Y, the
outcome, is distributed by some member of the exponential family
of distributions. This includes the normal, binomial, Poisson, and
gamma distributions, among others.
Some simple examples of GLMs include:
exp( Xi β)
1. Logit, with a link function g−1 ( Xi β) = 1+exp( Xi β)
2. Normal, with an identity link function g−1 ( Xi β) = Xi β
3. Poisson, with a log link function g−1 ( Xi β) = exp( Xi β)
In essence, we can enforce a linear functional form to the mean,

and allow the error distribution to fit the form of the data.5 5
It’s interesting to note the underlying
machinery of GLMs is similar to many
of the selection and discrete choice
models we’ve discussed and discuss
today. The linear index provides an
extremely convenient parameterization
of the mean, but also makes some
particular assumptions about the
substitutibilty of the covariates.
We will now discuss the Poisson regression case in more detail, as

it tends to be underused in economics, and is a very important use
case. A key takeaway in GLM, like with OLS, is that it is possible to
correctly specify just the conditional mean function and then robustly
estimation standard errors on parameters of that function, rather than
fully specifying the distribution correctly.
Poisson Regression for non-negative outcomes

Consider an non-negative outcome Y ≥ 0. There are a huge host of
outcomes in economics and finance that are restricted to this support:
investment, assets, wages, patent citations, output, and so on. We are
often interested in the estimand of the partial effect dE(Y | X )/dX. If
we estimate this conditional with linear regression (e.g. by assuming
Yi = Xi β + ϵi ), what are potential issues?
Mechanically, the error terms ϵ̂i = Yi − Xi β̂ will be skewed, since
Yi is skewed. This is not on its own a huge issue, but it does suggest
that the asymptotic approximation for β̂ will be worse for a given n.
This leads to highly influential outliers for OLS as well.6 6
Note that one solution to this issue is
to consider quantile regression instead!
Comment 1
Consider two outcomes, Y1 and Y2 . In both cases, the true model is
linear (with coefficient of 1) with respect to X, but the error term
is Normal with mean zero and variance 1 in Y1 , and is log-Normal
with mean zero and variance 1 in Y2 . If we simulate and estimate
this model using linear regression, plotting the t-statistic of the coef-
ficient on X for each model, we find much higher power for the model
with Normal errors, rather than log-Normal errors. This reflects the
lack of efficiency of OLS in the presence of non-Normal errors (but Figure 3: Non-normal errors in linear
regression
not a lack of consistency!). See Figure 3 for a visual representation of
this.
What are solutions to this issue? One commonly used approach

is to estimate linear regressions on log(Y ) instead of Yi . This solves
many of the outlier and skew issues,7 but creates its own problems.
First, the parameters have a different interpretation. Note that the
units for the outcome are different (log points). Often, these are inter- 7
In the ideal case, note that a log-
preted in percentage points, since log differences are approximately Normal outcome will be exactly linear
after using logs.
equal to percentage changes.8 This is useful, but can at times be con- 8
Recall log(Y1 ) − log(Y0 ) =
fusing (e.g. what is the actual level effect? Sometimes a percentage log(Y1 /Y0 ) = log(1 + ∆Y/Y0 ) ≈ ∆Y/Y0
effect can exaggerate or minimize a large level effect). for ∆Y/Y0 small.
Second, what if Y = 0? This is a problem, as log(0) is undefined.

One common solution is to use log(1 + Y ) or arcsinh(Y ) [Manning
and Mullahy, 2001, Ravallion, 2017, Bellemare and Wichman, 2020],
which solves the second problem, but makes the first problem even
worse! [Bellemare and Wichman, 2020, Aihounton and Henningsen,
2021, Cohn et al., 2022] Why these solutions? For one, they’re both
well-defined at Y = 0. Second, it has “similar” properties to the
taking a log. Effectively, since the distance between log(1 + Y ) and
log(Y ) was small as Y gets large, the hope is that the differences
would “wash out.” It turns out, thanks to work by Chen and Roth
[2023], that neither of these solutions are a good idea and that these
differences do not wash out.
The key point of Chen and Roth [2023] is that percentage effects
are not well-defined for outcomes that are potentially zero-valued.
That is in some ways obvious – there is no way to talk about the per-
cent increase for something where the base-level is zero. Dividing by
zero is infinite! But recall that part of the goal of using log outcomes
was to approximate percentage changes in the outcome due to treat-
ments. The main result of Chen and Roth [2023] shows that for any
other function approximating log, but defined at zero, the results will
be arbitrarily sensitive to changes in units (e.g. dollars to yuan).9 9
This includes both log(1 + y) and
What drives this effect? Effects close to zero and at zero. Most arcsinh(y).
importantly, the extensive margin of moving from zero to non-zero has

huge, and arbitrary, impacts on estimates on these types of rescaling.
Put precisely, if you change the units of the outcome by a (e.g. a =
100, converting from cents to dollars), then the estimated effect will
change by log( a) multiplied by the extensive margin effect. Note that
this is fails scale equivariance, which is the property of OLS and
quantile regression that usually makes a good estimator.10 10
Note also the practical implication: if
This can have some really serious implications. Chen and Roth there is a big extensive margin effect, a
large a has a big effect. In contrast, with
[2023] find that for half the papers they surveyed in the AER, the a small a, then most effects will be close
estimated effects would change by more than 100% if the units of to zero (since they are extensive margin,
and hence close to zero by definition).
the outcome were changed by 100 (e.g. dollars to Yen). This is a non-
trivial effect!
The takeaway I want you to have: you should not be running a
regression with log(1 + Y ) or arcsinh(Y ) on the left-hand side!11 11
I have done this, historically, in my
So what should you do if you have a zero in your left-hand side own work – we’re all flawed creatures
trying to inch towards better method-
variable? Chen and Roth [2023] suggest other ways of considering ological implementations!
these situations:12 12
These solutions are not perfect, but
are motivated by a “trilemma” they
1. First, if you really need something interpretable as as a percentage prove: it is not possible to have an
effect (e.g. rescaling an ATE into percentage), you could estimate estimator that is simultaneously (1) an
average of individual level treatment
τ = E(Yi (1) − Yi (0))/E(Yi (0)), which scales the ATE by the effects (2) invariant to rescaling of units
baseline average. This is the estimand targeted by Poisson regression. and (3) point-identified without more
There are also other normalizations one could consider. Instead of assumptions about the joint distribution
of the potential outcomes (beyond what
normalizing by E(Yi (0)), if there is a pre-treamtent characteristic we usually do in regression).
that is exogeneous, you could normalize by E(Yi (0)|Wi ), e.g. the
predicted baseline value given characteristic Wi . This captures
richer heterogeneity in the baseline characteristic, and may do a

better job of reducing skewness.
2. Second, you could redefine the outcome in terms of functionals of

the distribution, e.g. Ỹ = FY ∗ (Y ). A prominent example is looking
at the rank of an individual relative to the overall individual, as in
Chetty et al. [2014].
3. If the goal is to consider trade-offs in some like concave prefer-

ences, then it is plausible to specify exactly the ’value’ of a person
at Y = 0, relative to positive Y, and then explicitly evaluate the pa-
rameter that way. This has the problem of losing scale-invariance,
but at least the research is explicit about how they value these
issues.
4. Finally, it is plausible to directly estimate the extensive and in-

tensive effects separately. However, the intensive effect is only
partially identified; we will explore this further in later lecutres.
See Table 3 for a full set of alternative estimators.
Description Parameter Pros/Cons
Normalized ATE E(Y (1) − E(Y (0)) Pro: Percent interpretation

Con: Does not capture decreas-
ing returns
Normalized out- E(Y (1)/X − Y (0)/X ) Pro: Per-unit-X interpretation
come Con: Need to find sensible X

log(y) y>0
Explicit trade- ATE for m(y) = Pro: Explicit tradeoff of two
− x y=0
off of inten- margins
sive/extensive Con: Need to choose x; Mono-
margins tone only if support excludes
(0, e− x )
h i
Y (1)
Intensive margin E log Y (0) |Y (1) > 0, Y (0) > 0 Pro: ATE in logs for the inten-
effect sive margin
Con: Partially identified
Table 3: Table 2 from Chen and Roth
(2023)
Comment 2 (Poisson Regression)

Poisson regression is a good example of a way to estimate E(Yi (1) −
Yi (0))/E(Yi (0)). This approach estimates log( E(Y | X )) = Xβ,
rather than E(log(Y )| X ). You get a simple semi-elasticity mea-
sure for the parameters, and Y can be zero. What are the typical
concerns?
1. If Y | X is truly distributed Poisson, conditional on X, then

Var (Y | X ) = E(Y | X ). This just comes from the Poisson dis-
tribution’s statistical properties, but feels like a restrictive model
assumption. But, it’s not relevant for the parameter estimates of
β. The estimates are still consistent, and the standard errors for
these estimates can be adjusted for misspecification using robust
standard errors (e.g. sandwich covariance estimators). These will
give correct coverage, obviating any concerns about the Pois-
son regression. It is not necessary to use a Negative Binomial
regression.
2. As we will discuss shortly, in many non-linear models, if you

include parameters, such as fixed effects, which cannot be con-
sisitently estimated, then this will make all the estimates in the
model inconsistent. This is different from linear models. This
concern is less of an issue in Poisson regression, as fixed effects
can be concentrated out (see PPMLHDFE in Stata and glmhdfe in
R)
3. Individuals are often not sure how to do instrumental variables

in Poisson regression, but it is feasible! See Mullahy [1997],
Windmeijer and Santos Silva [1997].
The benefits of using the Poisson model (instead of log(1+Y)) accord-

ing to Cohn et al. [2022]: “We replicate data sets from six papers
published in top finance journals that together study two count or
count-like outcomes... We...estimate log1plus and Poisson regres-
sions based on that specification, and compare the coefficients of
interest. These coefficients differ markedly in all six cases and have
different signs in three of the six, suggesting that inference about
even the direction of a relationship is sensitive to regression model
choice in real-world applications...in all five cases involving regres-
sions with control variables, switching from a log1plus to Poisson
regression results in a larger change in the coefficient of interest than
omitting the most important control variable, generally by a wide
margin.”
Inconsistency in binary choice models

Consider estimating a panel fixed effects model with binary choice:
Yit = αi + Xit β + ϵit

Yit = F (αi + Xit β)
where we are interested in the parameter β. If we have a short panel

(e.g. few time periods), we cannot consistently estimate αi . However,
in the linear case, this does not affect estimation of β. More shock-
ingly, however, is that for binary outcome case, the only model that
consistently estimates β is a conditional Logit [Chamberlain, 1980,
2010].
More generally, if you have inconsistent fixed effects in your non-
linear models, this can cause serious issues (except in special cases
like conditional). Often, the only way to get around these issues
is by finding ways to “concentrate” or get around these nuisance
parameters. Famous cases where this occurs include conditional
logit, Poisson unit fixed effects, and partial likelihoods in the Cox
proportional hazard model.
Multiple Choices
We’ll now examine multiple discrete choice problems. Much of this

discussion is very adjacent to industrial organization. However, many
of these ideas are important for non-IO problems, such as multiple
IVs and Roy models. Moreover, these tools are very promising in
fields that have not yet used them.
Issues with choice problems that we’ll discuss:
• Independence of Irrelevant Alternatives (IIA)
• Choice sets and consideration sets
Consider the following problem: we observe choices for individu-

als Yi = j, j ∈ Ω = {0, 1, . . . , J }, where J + 1 = |Ω| is the total number
of choices. Importantly, the order of the choices has no particular
meaning. This could be red bus, blue bus and car as transportation
choices, for example.
Given these sets of choices, we have different types of covariates
we can observe. Some characteristics are choice specific (such as
a price), while some are unit specific (such as a person’s income).
Often, we want to allow for the characteristics to vary by both dimen-
sions. This includes allowing for a choice’s characteristic to vary de-
pending on the person (e.g. a unit specific coefficient on the choice’s
characteristic), or allowing the person’s characteristic to have differ-

ential effects on the choice of different goods. In total, we have three
types of characteristics:
1. Xi (individual characteristics, invariant to choices),
2. X j (choice characteristics)
3. Xij includes individual-by-choice characteristics
We can write Xi as Xij by interacting with choice fixed effects, and X j

can have i speicfic coefficients.13 Note that when J = 1, we collapse
13
Now recall there are two (non-exclusive) ways to think the dis- down to binary choice.
crete choice problem. The first is a statistical view: namely, how do

we model the choice probabilities? In the binary choice problem,
there is only one parameter that needs be known, conditional on Xi :
π ( Xi ) = Pr (Yi = 1| Xi ) With more than two choices, the dimension-
ality becomes more complicated. We now have π j (X), j = 2, 3 for 3
choices.
How should we parameterize how other choices’ characteristics
affect each other? Most of the models we will discuss will make very
specific restrictions on how choices affect one another. These are not
innocuous choices, as we’ll see, but they provide a huge amount of
additional structure that can be used to identify the parameters of
interest.
The naive approach

If we want to estimate simple treatment effects, we could focus on bi-
nary outcomes. For exmaple: we have a randomly assigned treatment
T, and J choices. What is the effect of T on Pr (Yi = j) under random
assignment?
τj = Pr (Yi = j| Ti = 1) − Pr (Yi = j| Ti = 0) (4)
The downisde of this approach is that there’s no information about

the substitution patterns of individuals in this form. Concretely, if τ2
is positive, is that because the share of individuals choosing Yi = 1
decreases, the share of individuals choosing Yi = 0 decreases, or
both? Namely, what is the substitution pattern across the choices?14 14
To put a statistical note on this,
Nonetheless, it is still very helpful to estimate these measures, and there are effectively two endogenous
variables (1(Yi = 1) and 1(Yi = 2)), and
it’s useful when faced with a lot of choices to focus on the effect on we only have one randomly assigned
one margin. We will need more structure to estimate relative choice variable ( T ). Hence, there’s no way to
simultaneously identify the effect on
substitution across outcomes, and ask questions like “what is the both.
effect of T on choosing j conditional on choosing j or k?”
Conditional logit
A second way to view the problem is as an structural (economic)
choice problem (pioneered by McFadden [McFadden, 1972]). Con-
sider a set of utilities Uij (unobserved) such that
Yi = arg max Uij . (5)

j∈Ω
In other words, person i chooses j if it’s the choice that maximizes

the utility amongst all J + 1 choices. Note the similarity to the Yi∗ in
the binary case!
If we make the assumptions:
1. Uij = Xij′ β + ε ij
2. ϵij are independent across choices and individuals, and dis-

tributed Type-I extreme value
then we get the McFadden conditional logit model:
exp( Xij β)
Pr (Yi = j| Xij ) = J
. (6)
∑k=0 exp( Xik β)
Comment 3
Note that if the characteristics Xij only vary based on the individual
(e.g. we can write Xij β as Xi β j ), then the effects across choices are
relative to each other. We can write our probability equation as
exp(α j + Xi β j )
Pr (Yi = j| Xij ) = J
. (7)
1 + ∑k=1 exp(αk + Xi β k )
This is the multinomial logit. Once we allow for choice speicfic

characteristics, then we need to write the probability following Equa-
tion (6).
In many choice problems, a key parameter we’re interested in is

the price elasticity. The definition of the price elasticity is the percent-
age change in a market share of a good for a given percentage change
in the price. Formally, the own-price elasticity is:
∂Pr (Yi = j| Xij ) p j ∂Pr (Yi = j| Xij ) pj

ϵj = = . (8)
Pr (Yi = j| Xij ) ∂p j ∂p j Pr (Yi = j| Xij )
We can also think about cross-price elasticities, e.g. how do market

shares change when other goods shift their price:
∂Pr (Yi = j| Xij ) pk

ϵ jk = . (9)
∂pk Pr (Yi = j| Xij )
Note that with equation (6) as our probability model, we can estimate
all these elasticities (assumign we have the data on prices, and we are
willing to assume prices are exogeneous, a very strong assumption).
But, this formulation creates issues.
A key issue with this formulation of the conditional logit model
is that the cross-price elasticities are identical. Specifically, ϵ jk = ϵlk ,
such that the effect of shifting price of a different good causes an
identical proportionate shift in all choices’ market share. You can see
∂Pr (Yi = j| Xij )
this by simply plugging in for ∂pk :
pk
ϵ jk = −γPr (Yi = j| Xij ) Pr (Yi = k| Xij ) ×
| {z } Pr (Yi = j| Xij )
∂Pr (Yi = j| Xij )
∂pk
= −γPr (Yi = k| Xij ) pk ,
where γ is the coefficient on price in the conditional logit model.

Note that this elasticity is not a function of j, and hence identical for
all other products.15 15
It’s useful to note that the levels of the
The canonical example of this is the “car, red bus and blue bus” market share do vary by good, but the
elasticity scaling makes the cross-price
example. Imagine a choice set where there are three choices for trans- elasticities identical.
portation: a car, and two busses: one red, and one blue. Presumably
a person is purely indifferent between red and blue busses. Hence, a
shift in the red bus price would presumably cause a bigger substitu-
tion from the blue bus than from car users, but the conditional logit
(in this form) will not account for this.
How can we deal with the IIA issue? This is a problem of poor
substitution patterns, which is an economics problem. In other
words, economics gives us an intuition about the market substitu-
tion patterns, and we don’t think identical cross-elasticities makes
sense. It’s also a statistical problem – there is a very strong statistical
functional form we have assummed, which was analytically conve-
nient but has somewhat perverse properties. We will now consider a
few (but not all) solutions to the problem proposed in the literature.
Nested Logit and Correlated Multivariate Probit

One part of the IIA problem comes from the independence of ε across
choices. Recall that the ε effectively rationalize the market shares
beyond what we observe that is explained based on the covariates.
Recall the blue and red bus case: getting two independent ε draws
for the busses is not an intuitive view of bus demand. Instead, the
blue and bus likely have highly correlated epsilon draws (if not iden-
tical), e.g. the unobserved latent demand for blue and red busses is
correlated! The issue is exactly how to specify the correlation that
preserves the ability to estimate the model.
With the nested Logit approach, you can specify sets (as the re-
searcher), and allow correlation of the ε within these sets. The key is
that the errors are uncorrelated across choice sets, which preserves
the logit structure (see Goldberg [1995] for an example application),
and the correlation within a nest is allowed to be correlated following
a distinct similarity parameter. In essense, the similarlity parameter
scales up and down the effect of the covariates within a nest: if the
similarity is high, then the effect of the covariates is swamped by the
random error, and the choices are highly correlated; if the similarity
is low, the nest approaches the standard IIA setting. See Wen and
Koppelman [2001] for a more recent discussion.
An alternative approach is to allow the covariance matrix of the
error terms to be flexibly estimated by the data using a multivariate
normal:
ϵi = (ϵi0 , ϵi1 , . . . , ϵiJ ) ∼ N (0, Σ) (10)

where the researcher will then directly estimate Σ. Unfortunately,
this problem gets hard with many choices (parameter space grows at
rate O( J 2 )). See McCulloch et al. [2000] and Geweke et al. [2003] for
details and an application in the Bayesian setting, and Train (2009) for
simulation discussions in the frequentist case.
Rather than directly target the distribution of the ε ij , an alternative
approach is to add more richness to the coefficients themselves. By
adding more random variation in the loadings, it effectively creates
a richer substitution pattern by adding more to the error term. Con-
sider a slight extension of our previous model, with β i varying by
individual (in an unobserved way):16 16
Note that this random variation
in preferences is usually viewed as
Uij = Xij β i + ε ij exogeneous.
Uij = Xij β + νij , νij = ε ij + Xij ( β i − β)
There are a number of ways to estimate this approach, but notice

the key point – subtitution patterns are more richly modeled (and
allowed) due to νij varying by Xij .
Example 2 (Random coefficients estimation example)

Let J = 3, and X j be a scalar (e.g. price). We assume that
Uij = X j β i + ε ij β i = ( β + σνi ), νi ∼ N (0, 1). (11)
Separate the utility of choosing j into
Uij = µij ( β) + X j σνi + ε ij (12)

µij = X j β. (13)
We can write the probability of choosing j as:

Z
exp( X j β + X j σνi )
Pr (Yi = j| X, β, σ ) = J
ϕ(νi )dνi (14)
∑k=0 exp( X j β + X j σνi )
where ϕ(·) is the Normal standard normal pdf.

This setup is often referred to as a “mixed logit” model (in contrast
with the more common Berry Levinsohn Pakes approach, which we’ll
discuss later) [McFadden and Train, 2000]. The typical approach for
estimating these models involves using Maximum Simulated Likeli-
hood, or Method of Simulated Moments. McFadden and Train [2000]
show that a straightforward approach to estimating this is to simulate
the model S times, and then use the simulated data to approximate
the integral:
S exp( X j β + X j σνis )
1
Ê( Pr (Yi = j| X, β, σ )) =
S ∑ J
. (15)
s=1 ∑k=0 exp( X j β + X j σνis )
Then, this probability can be used to form a log-likelihood function,

and the model can be estimated using standard optimization tech-
niques for maximizing log-likelihoods.
Note that an important piece in this setting is micro-level choice data
(which we use to form the likelihood), and the lack of any unobserved
heterogeneity that creates endogeneity and bias in our estimates.
Without an additional error term, there’s no need for an instrument
here. This is a version of assuming exogeneity conditional on ob-
servables. Often, we will only observe market-level shares of goods.
Then, we’ll need many markets in order to have sufficient indepen-
dent variation to estimate parameters. We will discuss this next.
The workhorse set of demand estimation models is known as

BLP (Berry Levinsohn Pakes), named after the authors in Berry et al.
[1995]. This model combines random coefficient estimation with un-
observed market-good-level demand heterogeneity that is potentially
endogeneous and correlated with price. In other words, not only are
individuals allowed to have random (independent) error, but there is

a fixed unobserved error in demand for each good. This allows for
a highly correlated set of demand choices within a market, and also
creates unobserved demand heterogeneity that requires an instru-
ment.
This model is often specified using the following utility function:
Uijm = δjm + µijm + ϵijm , (16)
where δjm = X j β + ξ jm is the mean utility of choosing j in market

m, µijm is the random substition pattern specific to an individual
(typically driven by the random coefficients on good characterstics
as in Example 2), and ϵijm is the individual specific logit error that is
i.i.d. Often, this type of setting is used when only market-level data is
available, and so the researcher observes the market shares of goods,
but not the individual choices.17 17
This is a common setting in many IO
Under the standard logit distributional assumptions for ϵijm , applications, where the researcher ob-
serves the market shares of goods, but
not the individual choices. However, it’s
Z
exp(δjm + µijm ) wonderful when you have more, and
Pr (Yim = j| X ) = s jm (δm , θ ) = f (µ|θ )dµim .
∑k∈ Jm exp(δkm + µikm ) a host of papers using the micro data
(17) exist as well [Berry et al., 2004, Conlon
and Gortmaker, 2023].
The key insight in Berry et al. [1995] is to note that the vector of
δjm in market m, δm , can be inverted from the market shares, sm , and
θ, the parameters of the random mixing coefficients. Once we know
δm , we can define ξ jm ≡ δjm − X j β, and define a conditional moment
condition E(ξ jm | Zjm ) = 0. This moment condition can be used to
estimate β using GMM. Conlon and Gortmaker [2020] provide a very
nice discussion of the algorithmic approach on how to do this, and
provide a Python package to solve this problem.18 18
Part of the reasoning for this is
that the trick to invert the shares and
recover δjm is a non-linear fixed point
Conclusion problem that needs to converge to a
high degree of precision for successful
estimation. Conlon and Gortmaker
Underlying structure of discrete choice is valuable in IV settings. [2020] highlight the best approaches.
Much of this discussion centered on IO style applications. But this
discussion shows up when thinking about Roy style models.19 When 19
See Hull [2018] for an example.
we discuss instruments and individuals’ choice to take up a policy
or not, if the policy is multi-dimensional, this types of models play
a huge role. Recall our discussion of propensity scores for treatment
effects. If individuals choose between multiple treatment options, this
maps directly into a discrete choice setting like what we’ve discussed
today. Thinking carefully about the counterfactual pattern across will
give guidance in more complicated IV settings.
There is also value in arbitraging IO methods in other fields. Many
fields have discrete choice applications but have not adopted the
tools. The cutting edge of IO tools is quite complex, but this type of
structure is very valuable when thinking about complicated choice

patterns. Worthwhile to try to arbitrage these methods in fields that
are less exposed to them (e.g. Koijen and Yogo [2019]).
References
Ghislain BD Aihounton and Arne Henningsen. Units of measure-

ment and the inverse hyperbolic sine transformation. The Economet-
rics Journal, 24(2):334–351, 2021.
Marc F Bellemare and Casey J Wichman. Elasticities and the inverse

hyperbolic sine transformation. Oxford Bulletin of Economics and
Statistics, 82(1):50–61, 2020.
Steven Berry, James Levinsohn, and Ariel Pakes. Automobile prices

in market equilibrium. Econometrica, 63(4):841–890, 1995.
Steven Berry, James Levinsohn, and Ariel Pakes. Differentiated prod-

ucts demand systems from a combination of micro and macro data:
The new car market. Journal of political Economy, 112(1):68–105,
2004.
Gary Chamberlain. Analysis of covariance with qualitative data. The

review of economic studies, 47(1):225–238, 1980.
Gary Chamberlain. Binary response models for panel data: Identifi-

cation and information. Econometrica, 78(1):159–168, 2010.
Jiafeng Chen and Jonathan Roth. Logs with zeros? some problems
and solutions. The Quarterly Journal of Economics, page qjad054,
2023.
Raj Chetty, Nathaniel Hendren, Patrick Kline, and Emmanuel Saez.

Where is the land of opportunity? the geography of intergen-
erational mobility in the united states. The Quarterly Journal of
Economics, 129(4):1553–1623, 2014.
Jonathan B Cohn, Zack Liu, and Malcolm I Wardlaw. Count (and

count-like) data in finance. Journal of Financial Economics, 146(2):
529–551, 2022.
Christopher Conlon and Jeff Gortmaker. Best practices for differenti-

ated products demand estimation with pyblp. The RAND Journal of
Economics, 51(4):1108–1161, 2020.
Christopher Conlon and Jeff Gortmaker. Incorporating micro data

into differentiated products demand estimation with pyblp. Tech-
nical report, NYU working paper, 2023.
John Geweke, Gautam Gowrisankaran, and Robert J Town. Bayesian

inference for hospital quality in a selection model. Econometrica, 71
(4):1215–1238, 2003.
Pinelopi Koujianou Goldberg. Product differentiation and oligopoly

in international markets: The case of the us automobile industry.
Econometrica: Journal of the Econometric Society, pages 891–951, 1995.
Peter Hull. Estimating hospital quality with quasi-experimental data.

Available at SSRN 3118358, 2018.
Ralph SJ Koijen and Motohiro Yogo. A demand system approach to

asset pricing. Journal of Political Economy, 127(4):1475–1515, 2019.
Willard G Manning and John Mullahy. Estimating log mod-

els: to transform or not to transform? Journal of Health
Economics, 20(4):461–494, 2001. ISSN 0167-6296. doi:
https://doi.org/10.1016/S0167-6296(01)00086-8. URL
https://www.sciencedirect.com/science/article/pii/
S0167629601000868.
Robert E McCulloch, Nicholas G Polson, and Peter E Rossi. A

bayesian analysis of the multinomial probit model with fully iden-
tified parameters. Journal of econometrics, 99(1):173–193, 2000.
Daniel McFadden. Conditional logit analysis of qualitative choice

behavior. 1972.
Daniel McFadden and Kenneth Train. Mixed mnl models for discrete
response. Journal of applied Econometrics, 15(5):447–470, 2000.
John Mullahy. Instrumental-variable estimation of count data models:

Applications to models of cigarette smoking behavior. Review of
Economics and Statistics, 79(4):586–593, 1997.
John Ashworth Nelder and Robert WM Wedderburn. Generalized

linear models. Journal of the Royal Statistical Society Series A: Statis-
tics in Society, 135(3):370–384, 1972.
Martin Ravallion. A concave log-like transformation allowing non-

positive values. Economics Letters, 161:130–132, 2017.
Chieh-Hua Wen and Frank S Koppelman. The generalized nested

logit model. Transportation Research Part B: Methodological, 35(7):
627–641, 2001.
Frank AG Windmeijer and Joao MC Santos Silva. Endogeneity in

count data models: an application to demand for health care. Jour-
nal of applied econometrics, 12(3):281–294, 1997.

09 Discrete Choice 1 Notes

Uploaded by

Copyright:

Available Formats

09 Discrete Choice 1 Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

09 Discrete Choice 1 Notes

Uploaded by

Copyright:

Available Formats

Lecture 9 - Discrete Choice and GLM

We are now going to generalize our estimation problem beyond lin-

Consider the following binary outcome problem: let Yi denote if per-

where Fi could vary by individual. However, this doesn’t seem like a

E(Yi | Xi ) = Pr (Yi = 1| Xi ) = Xi β → Yi = Xi β + ϵi (1)

The problems with modeling Y in this way is twofold. First, since

Modeling discrete choice

As we will now see, both approaches do arrive at a similar modeling

1. ε i is conditionally normally distributed (probit), such that Pr (Yi =

Table 2: Homeownership problem

In both cases, the effect of Xi changes, depending on the value of

Figure 2: Linear vs. Logit model predic-

Generalized Linear Models (GLM)

We can generalize the intuition above, where we let the underlying

2. Normal, with an identity link function g−1 ( Xi β) = Xi β

3. Poisson, with a log link function g−1 ( Xi β) = exp( Xi β)

In essence, we can enforce a linear functional form to the mean,

We will now discuss the Poisson regression case in more detail, as

Poisson Regression for non-negative outcomes

What are solutions to this issue? One commonly used approach

Second, what if Y = 0? This is a problem, as log(0) is undefined.

importantly, the extensive margin of moving from zero to non-zero has

richer heterogeneity in the baseline characteristic, and may do a

2. Second, you could redefine the outcome in terms of functionals of

3. If the goal is to consider trade-offs in some like concave prefer-

4. Finally, it is plausible to directly estimate the extensive and in-

See Table 3 for a full set of alternative estimators.

Description Parameter Pros/Cons

Normalized ATE E(Y (1) − E(Y (0)) Pro: Percent interpretation

Comment 2 (Poisson Regression)

1. If Y | X is truly distributed Poisson, conditional on X, then

2. As we will discuss shortly, in many non-linear models, if you

3. Individuals are often not sure how to do instrumental variables

The benefits of using the Poisson model (instead of log(1+Y)) accord-

Inconsistency in binary choice models

Yit = αi + Xit β + ϵit

where we are interested in the parameter β. If we have a short panel

We’ll now examine multiple discrete choice problems. Much of this

• Independence of Irrelevant Alternatives (IIA)

• Choice sets and consideration sets

Consider the following problem: we observe choices for individu-

characteristic), or allowing the person’s characteristic to have differ-

1. Xi (individual characteristics, invariant to choices),

3. Xij includes individual-by-choice characteristics

We can write Xi as Xij by interacting with choice fixed effects, and X j

crete choice problem. The first is a statistical view: namely, how do

The naive approach

τj = Pr (Yi = j| Ti = 1) − Pr (Yi = j| Ti = 0) (4)

The downisde of this approach is that there’s no information about

Yi = arg max Uij . (5)

In other words, person i chooses j if it’s the choice that maximizes

2. ϵij are independent across choices and individuals, and dis-

then we get the McFadden conditional logit model:

This is the multinomial logit. Once we allow for choice speicfic

In many choice problems, a key parameter we’re interested in is

∂Pr (Yi = j| Xij ) p j ∂Pr (Yi = j| Xij ) pj

We can also think about cross-price elasticities, e.g. how do market

∂Pr (Yi = j| Xij ) pk

= −γPr (Yi = k| Xij ) pk ,