SSRN 4508227
SSRN 4508227
SSRN 4508227
Abstract
Choice modeling is at the core of understanding how changes to the competitive landscape affect
consumer choices and reshape market equilibria. In this paper, we propose a fundamental characteri-
zation of choice functions that encompasses a wide variety of extant choice models. We demonstrate
how non-parametric estimators like neural nets can easily approximate such functionals and overcome
the curse of dimensionality that is inherent in the non-parametric estimation of choice functions. We
demonstrate through extensive simulations that our proposed functionals can flexibly capture underlying
consumer behavior in a completely data-driven fashion and outperform traditional parametric models.
As demand settings often exhibit endogenous features, we extend our framework to incorporate esti-
mation under endogenous features. Further, we also describe a formal inference procedure to construct
valid confidence intervals on objects of interest like price elasticity. Finally, to assess the practical appli-
cability of our estimator, we utilize a real-world dataset from Berry et al. (1995). Our empirical analysis
confirms that the estimator generates realistic and comparable own- and cross-price elasticities that are
consistent with the observations reported in the existing literature.
Keywords: Choice Models, Demand Estimation, Permutation Invariance, Set Functions, Neural net-
works
* Email: [email protected]
†
Email: [email protected]
‡
Email: [email protected]
2 Literature Review
2.1 Parametric Discrete Choice Models
Discrete choice models play a crucial role in various fields, including economics, marketing, and operations
management, as they describe decision-making processes when individuals face multiple alternatives. These
models typically rely on a random utility maximization assumption, which can be traced back to Thurstone
(1927) and Marschak (1959).
Over time, various choice models have emerged under different specifications for the density of unob-
3 Theory
3.1 Choice Models
In this section, we provide a general characterization of consumer choice functions. In particular, we focus
on a scenario where researchers have access only to aggregate market-level demand data, while individual-
level choices and characteristics remain unobserved. Aggregate demand models have been extensively stud-
ied in marketing and economics (Berry et al., 1995; Besanko et al., 1998; Sudhir, 2001; Chintagunta, 2001;
Albuquerque and Bronnenberg, 2009; Compiani, 2022), and are particularly useful when individual-level
demand data is not available.
Suppose consumers in a market t face an offer set St that can comprise any subset of Jt distinct products
({1, 2, . . . , Jt }). We use uijt to represent the index tuple {Xjt , pjt , Iit , εijt }, where Xjt ∈ Cd denotes d
non-price features belonging to some countable universe Cd ; pjt ∈ C denotes the price of the product;
Iit ∈ Cl denotes demographics of consumer i in market t, we assume there are l features and belong to
some countable universe Cl , and εijt denotes random idiosyncratic components pertinent to consumer i for
product j in market t that are not unobservable to the researcher but observable to consumers.
Definition 1 (Choice Function). Given the offer set St ⊂ {1, 2, 3, . . . , Jt }, we define a function π : {uijt :
j ∈ St } → R|St | that maps a set of index tuples {uijt }j∈St to a |St |-dimensional probability vector. Each
element in the π(·) vector represents the probability of consumer i choosing product j in market t.
Here we present a very general characterization of choice functions that maps the observable and un-
observable components of product and individual characteristics to observed choices through some choice
function π. Note that, traditionally, uijt is a scalar that represents utility in choice models. However, in our
framework, uijt does not necessarily represent utility. Further, we have not yet imposed any assumption on
π, i.e., how consumers make choices.
We now specify a set of assumptions on the model and data-generating process below.
Assumption 1 (Exogeneity). The unobserved error term εijt is independent and identically distributed (i.i.d.)
across all products. This can be expressed as follows:
πijt ({uikt }k∈St ) = πijt (uijt , {uikt }k∈St ,k̸=j ) = π(uijt , {uikt }k∈St ,k̸=j )
This assumption implies two things: first, the functional form of the choice probability for different products
and markets is the same; second, for any market-level heterogeneity (e.g., in the distribution Ft (Iit , εijt )),
we can include them in Xjt as features. Intuitively, this assumption suggests that conditional on product and
consumer features and the unobserved error term, the choice probabilities are not functions of the identities
of the products themselves.
Assumption 3 (Permutation Invariance). The choice function π is invariant under any permutation function
σj () that rearranges the indices of the competitors of product j, such that:
In this assumption, we state that the choice function for product j is invariant to all permutations of its
competitors. This implies that the individual’s choice for product j is not affected by the order or identity of
the other products in the market, and it only depends on the set of competitors’ characteristics.
Since researchers only observe aggregate data, we next define the aggregate demand function. In aggre-
gate demand settings, individual-level choices are not observable and only aggregate demand is observable.
It is often the case that the market-specific individual features are not observable and are assumed to be
exogenously drawn from some distribution F(mt ), where mt represents the market-level characteristics.
For the sake of notional simplicity, we let mt to be the same across all markets. One can easily incorporate
market-specific user demographics in the choice function. Thus the demand of product j in market t denoted
by πjt can be expressed as follows:
Z Z
πjt = πijt ({uikt }k∈St )dF(mt )dG(εijt ), (1)
where G(εijt ) denotes the CDF of unobserved errors εijt . Since uijt is determined by {Xjt , pjt , Iit , εijt }
and Iit , εijt are integrated out in a market. Hence, we can express πjt as a function of only the observable
product characteristics –
πjt = g(Xjt , pjt , {Xkt , pkt }k∈S,k̸=j ). (2)
Lemma 1. For any choice function that satisfies Assumption 1 and 3, the aggregate demand function is
permutation invariant.
This permutation invariance of the aggregate demand function exists because, under the exogeneity
Theorem 1. For any offer set St ⊂ {1, 2, 3, . . . , Jt }, if a choice function π : {uijt : j ∈ St } → R|St | where
uijt represents the index tuple {Xjt , pjt , Iit , εijt } satisfies Assumption 1, 2 and 3, then there exists suitable
ρ, ϕ1 and ϕ2 such that
X
πjt = ρ(ϕ1 (Xjt , pjt ) + ϕ2 (Xkt , pkt )),
k̸=j,k∈St
At this point, no specific assumptions are made regarding the function γ. However, in the subsequent
inference section, we will discuss that the estimator of γ must be estimable at n−1/2 in order to construct
valid confidence intervals. Next, to address the issue of price endogeneity, we impose a mild restriction on
the space of choice functions we consider.
Assumption 4 (Linear Separability). The unobserved product characteristics can be expressed as the sum of
an endogenous (CF) and exogenous component
X
πjt = ρ(ϕ1 (Xjt , pjt , µjt (γ0 )) + ϕ2 (Xkt , pjt , µkt (γ0 ))),
k̸=j,k∈S
The result follows straightforwardly from the observation that after controlling for CF (µjt ; λ) the un-
observable component ε̃ is exogenous. This implies the aggregate demand function is invariant under any
permutation applied to the competitors of product j. The result demonstrates that endogeneity can be ad-
dressed by using the residuals from Equation (4) as an additional set of features along with observable
product characteristics.
3.3 Inference
This paper aims to estimate choice functions flexibly using non-parametric estimators. However, often in
social science contexts, researchers and managers are also interested in conducting inference over some
economic objects. Note that because non-parametric regression functions are estimated at a slower rate
compared to parametric regressions, it is often infeasible to construct confidence intervals directly on the
estimated π̂. However, it is generally possible to perform inference and construct valid confidence intervals
for specific economic objects that are functions of π. In this section, we will provide an example of one such
important economic object and demonstrate how to construct valid confidence intervals for it. This will be
done by leveraging the recent advances in automatic debiased machine learning as shown in the works of
Ichimura and Newey (2022); Chernozhukov et al. (2022b,a, 2021), and others. However, unlike existing
automatic debiased machine learning setups we also have to account for an additional first-stage estimator
γ̂.
In demand estimation, researchers are often interested in estimating the average effect of a price change
on the demand for a product, as it can significantly influence market dynamics, pricing strategies, and
regulatory decisions. To proceed with our analysis, let wjt = (yjt , pjt , Xjt , {pkt , Xkt }k̸=j ) and zjt =
(pjt , Xjt , {pkt , Xkt }k̸=j ) represent the variables associated with product j in market t. Here, pjt ∈ C
denotes the observed prices, Xjt ∈ Cd−1 represents other product characteristics, and yjt ∈ R refers to
the observed demand for product j in market t, such as market shares or log shares. Note that either the
observed price (pjt ) or other characteristics (Xjt ) could be endogenous. For simplicity and without loss of
generality, we focus on pjt as the endogenous variable in the following analysis.
The average effect of a price change2 can be expressed as the difference between the demand function
πjt (·; γ) evaluated at the original price pjt and at the price incremented by ∆pjt , given by the following
expression:
m(wjt , π(·; γ)) = π(pjt + ∆pjt , Xjt , {pkt , Xkt }k̸=j ); γ) − π(pjt , Xjt , {pkt , Xkt }k̸=j ); γ).
2
The expression for the average effect of a price change can be adapted to represent average price elasticity by placing the
known and fixed value of ∆pjt in the denominator.
θ0 = E[m(wjt , π(·; γ))] = E[π(pjt + ∆pjt , Xjt , {Xkt }k̸=j ; γ) − π(pjt , Xjt , {Xkt }k̸=j ; γ)].
In summary, the average effect of a price change on demand, denoted by θ0 , is calculated by evaluating
the difference between the demand function at the original price and at the price incremented by ∆pjt , and
then computing the expected value of this difference.
In practice, we estimate θ0 by computing its empirical analog using the estimated demand function π̂
and first-stage estimator γ̂, i.e.,
n
1X
θ̂ = m(wjt , π̂(zjt ; γ̂)), (6)
n
t=1
where n is the number of observations. When parametric methods are employed to estimate π̂ and γ̂, the
√ √
estimator for θ̂ is generally n-consistent, assuming that the model is correctly specified. However, n-
consistency may not hold when non-parametric estimators are used, particularly if the first-order bias does
√
not vanish at a rate of n. Irrespective of the method used to estimate π, this is often the case, as flexible
estimation of π always requires some form of regularization and/or model selection. Debiasing techniques
are required to mitigate the effects of regularization and/or model selection when learning flexible demand
models. These approaches can help improve the performance of the estimator and facilitate valid inference
with θ̂. We therefore adapt recent debiasing techniques developed in recent automatic debiased machine
learning literature (see Chernozhukov et al. (2022b)). Specifically, we will focus on problems where there
exists a square-integrable random variable α0 (z) such that ∀ ||γ − γ0 || small enough –
By the Riesz representation theorem, the existence of such α0 (zjt ) is equivalent to E[m(wjt , π(zjt ; γ))]
being a mean square continuous functional of π(zjt ; γ). Henceforth, we refer to α0 (z) as Riesz representer
(or RR). Newey (1994) shows that the mean square continuity of E[m(wjt , πjt (zjt ; γ))] is equivalent to
the semiparametric efficiency bound of θ0 being finite. Thus, our approach focuses on regular functionals.
Similar uses of the Riesz representation theorem can be found in Ai and Chen (2007), Ackerberg et al.
(2014), Hirshberg and Wager (2020), and Chernozhukov et al. (2022b) among others. The debiasing term
in this case takes the form α(zjt )(yjt − π(zjt ; γ)). To see that, consider the score m(wjt , π(zjt ; γ)) +
α(zjt )(yjt − π(zjt ; γ)) − θ0 . It satisfies the following mixed bias property:
This property implies double robustness (Robins et al., 1994; Funk et al., 2011) of the score. That is, if either
3
We assume the data reflects the true population.
10
The mixed bias property implies that the bias of this estimator will vanish at a rate equal to the product of
the mean-square convergence rates of α
b and π
b. Therefore, in cases where the demand function π can be
estimated very well, the rate requirements on αb will be less strict, and vice versa. More notably, whenever
√ √
the product of the mean-square convergence rates of α b and fb is larger than n, we have that n θb − θ0
converges in distribution to centered normal law N 0, E ψ0 (wjt )2 , where
as proven formally in Theorem 3 of Chernozhukov et al. (2022b). Results in Newey (1994) imply that
E ψ0 (wi )2 is a semiparametric efficient variance bound for θ0 , and therefore the estimator achieves this
bound.
Theorem 3. [Chernozhukov et al. (2021)] One can view the Riesz representer as the minimizer of the loss
function: h i
α0 = arg minE (α(zjt ) − α0 (zjt ))2
α
In our earlier discussions, we employed the moment function of π, whereas in Theorem 3, we focus
on the moment function of α. This shift is justified by the Riesz Representation Theorem, which implies
E[m(wjt ; π)] = E[α0 (zjt )π(zjt )]. Given that π can represent any function, substituting α for π is permis-
sible, thereby validating the transition from the second to the third line in Theorem 3. We use the above
theorem to flexibly estimate the RR. The advantage of this approach is that it eliminates the need to derive
an analytical form for the RR estimator, allowing it to be addressed as a simple computational optimization
problem.
Theorem 4. [Chernozhukov et al. (2021)] Let δn be an upper bound on the critical radius (Wainwright
(2019)) of the function spaces:
and suppose that for all f in the spaces above: ∥f ∥∞ ≤ 1. Suppose, furthermore, that m satisfies the
11
for all α, α′ ∈ An and some M ≥ 1. Then for some universal constant C, we have that w.p. 1 − ζ :
M log(1/ζ)
α − α0 ∥22 ≤ C(δn2 M +
∥b
n
2
+ inf ∥α∗ − α0 ∥2
α∗ ∈An
The critical radius has been widely studied in various function spaces, such as high-dimensional linear
functions, neural networks, and superficial regression trees, often showing δn = O dn n−1/2 , where dn
represents the effective dimensions of the hypothesis spaces (Chernozhukov et al. (2021)). In our research,
we focus on applying Theorem 3 from an application standpoint to neural networks.
To that end, we make the following assumptions.
Assumption 5. (i) α0 (z) is bounded, (ii) ∀ ||γ − γ0 || small enough, E[(y − π0 (zjt ; γ))2 |zjt ] is bounded, and
(iii) E[m(wjt , π0 (zjt ; γ0 ))2 ] < ∞.
These assumptions are standard regularity conditions used in the automatic machine learning literature.
p p √
Assumption 6. i) ∀ ||γ − γ0 || small enough ||π̂(; γ) − π0 (; γ)|| → − 0 and ||α̂ − α0 || →
− 0; ii) n||α̂ −
p √ p
α||||(π̂(; γ) − π0 (; γ)|| →
− 0; iii) α̂ is bounded; (iv) n||γ̂ − γ0 || →
− 0
Intuitively these assumptions mean that (i) the estimator of both π and α should be consistent for values
of γ in a close enough neighborhood of γ0 . Further, it requires that the product of mean square error of α̂
√
and mean square error of π should vanish at n− rate. This can be achieved if both these terms converge
at least at n−1/4 rate. Finally, we also assume that the first stage estimator γ̂ is estimable at n−1/2 rate. This
limits the class of functions one can use to estimate γ.
Assumption 7. m(w, π) is linear in π and there is C > 0 such that
We show the proof in Web Appendix B. This theorem shows that if γ̂ is estimable at a fast enough
rate one can still construct valid confidence intervals for θ̂. This result can be shown following similar
arguments as in Chernozhukov et al. (2022a). Finally, we note that while the above arguments focus on the
estimation of the average effect of a price change on demand, we can follow the same arguments to derive
inference results for other economic quantities of interest, e.g., the effect of changing some product features
on demand.
12
• Stage 0 (Data partition): We randomly split the observed markets into L folds such that the data
Dl := {yt , zt }t∈l , where l denotes the lth partition. Note that all the observations for one market are
always in one fold.
• Stage 1 (Estimate γ̂): For each fold l, we estimate γl by regressing the endogenous variable on the
exogenous instruments on the left out data Dlc := {yt , zt }t∈l
/ . We then use the cross-fitting technique,
same as Chernozhukov et al. (2021), to calculate the residual µ̂l of fold l with estimated γ̂l on Dlc .
– Stage 2a (Estimation): In the second stage, for each fold l, we estimate both the choice function
(π̂) and the Riesz estimator (α̂) on the left out data Dlc := {yt , zt }t∈l
/
1 X X
π̂l = arg min P [(yjt − π(zjt ; γ̂))2 ] (7)
π∈F t∈Dlc Jt
t∈Dlc j∈Jt
1 X X
α(zjt )2 − 2m(wjt ; α) .
α̂l = arg min P (8)
α∈A t∈Dlc Jt
t∈Dlc j∈Jt
Based on Theorem 2, instead of directly estimating the function π, we decompose the estimation
into three sub-components: ρ, ϕ1 , and ϕ2 . Specifically, for each component of our model (ϕ1 ,
ϕ2 , and ρ), we use a standard 3-layer neural network4 , and this is implemented without further
hyperparameter tuning. We implement ReLU activation function at each layer as it is standard
in feedforward designs due to simplicity and computation efficiency in gradients. Figure 1 il-
lustrates how we pass the data to the neural networks. Specifically, we pass the focal product’s
characteristics (price pjt and other product features Xjt ) and the residuals (µ̂jt ) estimated from
the first stage regression to the ϕ1 . In parallel, we pass all the products’ characteristics of the
other products in the same market (pjt and Xkt ) and the corresponding residuals (µ̂kt ) to the
same ϕ2 , and then sum the output up. The output of ϕ1 and ϕ2 have the same data structure
(e.g., a 64-dimension vector). Next, we pass the summation of the output of ϕ1 and ϕ2 to a third
neural network ρ. The output of ρ is a scalar which represents the market share of the focal
product jt.
4
For ϕ1 and ϕ2 , each of the three layers consists of 64 neurons, with the output vector also featuring 64 neurons. For ρ, the
layers are configured with 300, 100, and 64 neurons for the first, second, and third layers, respectively.
13
We use the same neural network structure (with the three sub-components same as ϕ1 , ϕ2 , and
ρ) to estimate α. The only difference is that the loss function of α is not based on the difference
between the observed and the predicted market share as in Equation 7. Instead, the loss function
is based on the squared difference between α and the moment function of α as stated in Equation
8 and Theorem 3.
– Stage 2b (Cross-fitting): Now we again use the cross-fitting technique to reduce the bias when
estimating θ̂. Specifically, we use the estimators (π̂ and α̂) estimated on Dlc to estimate the
θ̂l of l. By applying cross-fitting, we ensure that the nuisance functions and the parameters are
estimated on separate, non-overlapping datasets. This approach diminishes the risk of overfitting
and enhances the robustness of our estimation. And finally, to estimate θ̂, we randomly select
one observation t∗ in each market t and average it out across all folds. Thus the estimator for θ0
and its variance can be given as follows:
L
1XX
θ̂ = {m (wt∗ , π̂ℓ ) + α̂ℓ (zt∗ ) (yt∗ − π̂ℓ (zt∗ ; γ̂))} (9)
n c
ℓ=1 t∈Dℓ
L
1XX 2
V̂ = ψ̂t∗ ℓ , ψ̂t∗ ℓ = m (wt∗ , π̂ℓ ) − θ̂ + α̂ℓ (zt∗ ) (yt∗ − π̂ℓ (zt∗ ; γ̂)) (10)
n c
ℓ=1 t∈Dℓ
5 Numerical Experiments
We now present a series of simulation studies that establish the numerical performance of our approach.
First, in §5.1, we examine the predictive performance of our model on a series of models, including stylized
discrete choice models with linear utilities as well as more general models that allow non-linear utilities
14
15
We first consider the two standard Data Generating Processes (DGPs) used in the demand estimation lit-
erature that use linear utility-based choice models – (1) Multinomial Logit model, and (2) the Random
Coefficient Logit model. For both cases, we consider a setting with 10 products (J = 10), 100 markets
(T = 100), one price feature, and 10 non-price features (d = 10). We define the utility uijt that consumer i
in market t derives from product j as the following linear function:
where εijt represents an independently and identically distributed (iid) Type-I extreme value across products
and consumers. Xjt ∈ Cd denotes the non-price features of the product. αi , βi are the model coefficients,
which are kept constant for all consumers in the MNL, while in the RCL, they are normally distributed across
consumers. The probability distribution of features and coefficients used are shown in Web Appendix D.
Also, the mean utility from the outside option is normalized to 0.
M N L and the market
We denote the market share of product j in market t generated from MNL by πjt
RCL . For each market, we generate the market shares of each product by
share generated from RCL by πjt
simulating N = 10, 000 individual choices and aggregating by each market as shown below.
N
MNL 1 X
πjt = 1(uijt = max(uikt )) (12)
N k∈St
i
N
RCL 1 X exp(αi pjt + βi Xjt )
πjt = P (13)
N 1 + k∈St exp(αi pkt + βi Xkt )
i
Note that for MNL, instead of simulating each individual’s choice probability, we simulate each individual’s
choice based on the utility maximization principle. This approach ensures that when we use MNL (true
model) for estimation, it does not reproduce the data perfectly.
For each DGP, we split the generated data into training data (80%) and test data (20%). We use the
training data for estimation (both our model and the benchmark models described above).7 For the predicted
market share, we present all the model results and comparisons on the test data. For the predicted own- and
cross-elasticities, we present all the model results and comparisons on the training data.8
Tables 2a shows the Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) in the predicted
market share (π̂jt ) for our approach as well as the baseline models.We see that when the true model is MNL,
7
In our simulations, we train both our model and NP with the log market shares (log(πjt )). The performance metrics reported
for predicted market shares are computed based on the exponential values of the predicted log market shares, bringing these metrics
back to market shares (πjt ). The performance metrics reported for elasticities are computed based on the relative change of the
predicted log market shares (∂log(πjt )) divided by the percentage change of the price (∂pjt /pjt for own-elasticity and ∂pkt /pkt
∂ π̂ /πjt
for cross-elasticity), which is equivalent to the elasticity calculated directly using the market share ( ∂pjtjt /pjt
for own-elasticity and
∂ π̂jt /πjt
∂pk̸=jt /pk̸=jt
for cross-elasticity).
8
The reason we only use test data to report predicted market share accuracy is to demonstrate the model’s predictive perfor-
mance on unseen data. In contrast, we use training data to report accuracy in elasticities to mimic the real empirical setting where
we use full data to estimate elasticity.
16
∂ π̂ /π
(b) Own-Elasticity ( ∂pjt jt
jt /pjt
)
∂ π̂ /π
(c) Cross-Elasticity ( ∂pk̸=jt jt
jt /pk̸=jt
)
17
18
In the previous section, we focused on linear utility specifications and standard choice behaviors. Recent
literature has highlighted that the oversight of non-linear relationships between features and utilities can
introduce biases in the estimates (Allenby et al., 2004). Conversely, non-parametric estimators are adept
at capturing these non-linear patterns directly from the data. As a result, there has been a growing trend
towards the adoption of non-linear utility functions. Therefore, we now focus on data generated from a
random coefficient logit model with non-linear transformations applied to observable features. We consider
a case with two product features – price and a non-price feature x. We apply a non-linear transformation
g(x). Following Bakhitov and Singh (2022),we consider two functions for g(x):
The log transformation is common in empirical studies, to capture a diminishing sensitivity of a feature
on the market share. The sine transformation captures periodic or cyclical effects. For example, when the
feature represents the time of year, normalized from 0 to 1, then using sine transformation can effectively
capture the seasonal variations in consumer preferences.
The utility that consumer i in market t derives from product j then has the following non-linear form:
The marketshares then follow a similar structure to that from Equation (13).
10
In an extreme case, when the variance of random coefficients is zero, RCL is equivalent to MNL.
19
Table 3: Predictive Performance in Non-Linear Utility: Market Shares, Own-Elasticity, and Cross-Elasticity
∂ π̂ /π
(b) Own-Elasticity ( ∂pjt jt
jt /pjt
)
∂ π̂ /π
(c) Cross-Elasticity ( ∂pk̸=jt jt
jt /pk̸=jt
)
Finally, we consider a scenario where consumers do not pay attention to all the products and/or are not fully
informed of all the alternatives in the choice set. Recent literature has shown that this is often the case in
many empirical settings (Goeree, 2008; Gabaix, 2019; Honka et al., 2019; Abaluck et al., 2020; Compiani,
2022). However, such cases violate a standard assumption of the choice model: that consumers are informed
and consider all options when they make purchase decisions. In some parametric models (e.g., Van Nierop
20
1 exp(αi pht )
P .
1 + pht k exp(αi pkt )
We first present detailed results on the own- and cross-elasticity for the two-product case in Figure 2.
We consider the number of markets to be 1,000 so that we can observe enough variance in our data. Figure
2a shows how the estimated own-elasticity for the highest-priced product (which is ignored by a subset of
consumers), for a range of prices. Note that in our model, when the price is higher, the portion of inattentive
consumers is higher. Thus when we change the price, the change in market share is smaller than the case
without inattention. While our model and the fully NP model are able to capture this pattern, both the
parametric models (MNL and RCL) are unable to do so. Figure 2b shows how the estimated cross-elasticity
of the other products vary with the price of the highest-priced product. Similarly, due to the ignorance
of inattention, both MNL and RCL overestimate the magnitude of the elasticity of the other product. In
contrast, our model and the fully NP are able to capture the true cross-price elasticity and are close to the
true model.
Next, we show a more comprehensive set of results for all three metrics (market-shares, own-, and
cross-price elasticities) when there are more products (2, 5, and 10) and fewer markets (100) in Table 4a,
4b, and 4c. We find that our approach consistently outperforms RCL and MNL, as expected. Further,
the performance of our model improves as the number of products increases while that of the NP model
monotonically decreases with the number of products (for the reasons discussed in §5.1.1).
In summary, we find that our approach adapts well even as the underlying model of consumer behavior
21
∂ π̂ /π
(b) Cross-elasticity ( ∂pk̸=jt jt
jt /pk̸=jt
) in consumer inattention
22
∂ π̂ /π
(b) Own-Elasticity ( ∂pjt jt
jt /pjt
)
∂ π̂ /π
(c) Cross-Elasticity ( ∂pk̸=jt jt
jt /pk̸=jt
)
Note: These tables present the MAE and RMSE for predicted market shares, own-elasticity, and cross-elasticity under consumer
inattention across scenarios with 2, 5, and 10 products. The market share and elasticity are simulated assuming a portion of
consumers (1 − 1+p1j ), jht is the index of the highest-price product in market t) ignore the product with the highest price. Each
ht
scenario is fixed at 100 markets and 1 feature (price) to maintain consistency with the RCL baseline.
23
24
confidence intervals of this effect. The difference between this section and §5.1.1 lies in both the estima-
tors and the methods. In terms of estimators, in §5.1.1, we predict the market share (π̂jt ), own-elasticity
∂ π̂ /π
jt jt ∂ π̂ /π
jt jt
( ∂pjt /pjt ) and cross-elasticity ( ∂pk̸=jt /pk̸=jt ) for individual products. In contrast, the object of interest in this
section is the average effect of price on demand across all products (θ̂). As a result, in §5.1.1, we did not use
the debiasing techniques that we apply here. It is important to emphasize that, in our approach, constructing
a confidence interval is viable only for aggregate measures, not for individual observations.
To simulate the data, we consider a random coefficient logit model of demand with 3 products across
100 markets. We set the true model parameters to be βik ∼ N (1, 0.5), αi ∼ N (−1, 0.5). The effect of a
1% increase in a product’s price is given by
θ0 = E[m(wjt , π)] = E[π(pjt ∗ (1.01), Xjt , {pkt , Xkt }k̸=j ) − π(pjt , Xjt , {pkt , Xkt }k̸=j )],
As discussed earlier, one way to estimate this effect is to compute the sample analog of this using the
estimated π̂, such that θ̂ = n1 ni=1 m(wjt , π̂). However, as we pointed out earlier, the distribution of θ̂
P
might not be asymptotically normal. To demonstrate this, in Figure 3a, we display the histogram of the
estimated effect across 50 random samples by using the plug-in method. We standardize the estimates by
subtracting the mean and then dividing by the standard deviation and plot them against the standard normal
distribution. As one can observe the distribution appears multi-modal and deviates from a standard normal
distribution. Next, we use our proposed debiased estimator and plot the standardized estimates across 50
samples of draws in Figure 3b. The resultant distribution with the debiased estimator is much closer to a
standard normal. Finally, we calculate the 95% confidence intervals using our debiased estimator across
50 random draws. In Table 6, we report the bias (mean absolute error from all draws) and the coverage
i.e., the percentage of times the true parameter is covered in the estimated confidence intervals. We find
that bias across both data-generating processes (RCL and MNL) is notably low, reflecting only a -0.0001
difference from the true effect. The coverage rate of the 95% confidence interval in our corrected model
is 90%, indicating good coverage. This shows that our debiased estimator can be used to conduct valid
inference in finite samples.
25
year regarded as a market. The number of cars varies from 86 to 150 each year. For each car, the dataset
provides information such as the car’s name, the manufacturing company, factory region, market share,
price, and four exogenous car characteristics: horsepower, space, mileage per dollar, and the presence of an
air conditioning device.
Even though the dataset is relatively small, it presents three key challenges that make it difficult to
use naive non-parametric estimators: (i) the dataset features markets with more than 100 products and
only 20 markets in total, (ii) the product assortment in each year or market varies; (iii) the feature “price”
has the endogeneity issue, which was not considered in numerical experiments in §5. In this section, we
demonstrate the use of our estimator, which is capable of effectively addressing such challenges posed in
real-world datasets.
For each component of our model (ϕ1 , ϕ2 , and ρ), we use a standard 3-layer neural network, and this
is implemented without further hyperparameter tuning. We implement ReLU activation function at each
layer. This architecture is the same as the one we used in our numerical experiments. For comparison, we
replicate the random coefficient logit model (with only the demand side) used by Berry et al. (1995) using
the Python package pyblp (Conlon and Gortmaker, 2020). In our replication, we allow for heterogeneity
in random coefficients across all variables. Our findings show that the estimates obtained from our model
26
are comparable to the random coefficient logit estimation presented in Berry et al. (1995).
We estimate our model both without and with consideration of endogeneity. To address endogeneity,
we utilize three sets of IVs – (i) the sum of characteristics of all car models, excluding the product in focus,
produced by the same firm in the same year; (ii) the sum of characteristics of all car models, excluding the
product in focus, produced by rival firms in the same year; and (iii) cost shifters, which encompass the wage
and exchange rate prevalent in the year and region where the factory is located. The utilization of traditional
BLP-style instruments, as discussed by Gandhi and Houde (2019), can be problematic due to their relative
weakness, often resulting in considerable bias in the estimation of parameters. These issues are significantly
exacerbated in non-parametric models. Thus, to counter potential concerns related to weak instruments, we
employ a machine-learning-based IV methodology (MLIV) as proposed by Singh et al. (2020). We detail
the estimation procedure and results using BLP style IVs in Web Appendix F.
∂ π̂ /π
In Figure 4, we present the estimated own-elasticity ( ∂pjt jt
jt /pjt
) of our model without IV and with IV.
The x-axis represents the price of the focal product, while the y-axis shows the product’s own-elasticity.
Each point corresponds to a product in a market, resulting in 2,217 observations. We report the estimated
elasticity based on the same price variation used in the BLP paper (a 1,000-dollar change). In Figure 4a,
we observe that the majority of low-priced products (priced below 6,000 dollars) exhibit positive estimated
own-elasticity, demonstrating the existence of the endogeneity. We also notice that this issue is attenuated
when we correct for endogeneity (see Figure 4), i.e., which suggests that our approach is able to handle
situations with endogenous features.
jt ∂ π̂ /π
jt jt jt ∂ π̂ /π
We report the own-elasticity ( ∂pjt /pjt ) and cross-elasticity ( ∂pk̸=jt /pk̸=jt ) estimated in our model and
random coefficient logit model with a sample of 13 cars in the 1990 market in Tables 7 and 8. The sample
of 13 cars is the same as the one reported in Berry et al. (1995). Overall, our results are very similar and
27
(a) Own-Elasticity Estimation (Our Model vs. BLP (b) Cross-Elasticity Estimation (Our Model vs. BLP
Model) Model)
We further estimate the average own-elasticity (θ̂) for high-priced, medium-priced, and low-priced cars
and construct a confidence interval for each category using our inference procedure. We present our result
in Table 9. Both our model and the BLP model indicate that the average own-elasticity is highest (in terms
of the absolute value of own-elasticity) for high-priced cars and lowest for low-priced cars. Moreover, the
95% confidence intervals for all three categories do not include zero. This also demonstrates the efficiency
of our model even when there is a limited sample of only 20 observations.
The empirical analysis demonstrates the applicability and effectiveness of our model in a real-world set-
ting, addressing challenges such as limited sample size, variability in product assortments, and endogeneity.
The comparable results with established econometric models, such as BLP model, help in validating the
robustness and reliability of our approach.
7 Conclusion
Choice models are fundamental in understanding consumer behavior and informing business decisions.
Over the years, various methods, both parametric and non-parametric, have been developed to represent
consumer behavior. While parametric methods, such as logit or probit-based models, are favored for their
simplicity and interpretability, their restrictive assumptions can limit their ability to fully capture consumer
preferences’ intricacies. On the other hand, non-parametric methods offer a more flexible approach, but
28
29
Lexus LS400 0.3357 0.2606 0.3136 0.3325 0.0822 0.3235 0.3137 0.3126 -6.8791 0.3271 0.3495 0.3348 0.3494
Lincoln Town Car 0.2681 0.2310 0.2548 0.2656 0.0713 0.2663 0.2548 0.2648 0.2586 -5.3996 0.3009 0.2719 0.3009
Mazda 323 0.0361 0.0212 0.0272 0.0326 -0.0127 0.0249 0.0272 0.0363 0.0323 0.0326 -2.6589 0.0404 0.0357
Nissan Maxima 0.1579 0.1367 0.1589 0.1555 0.0425 0.1670 0.1589 0.1689 0.1484 0.1534 0.1884 -7.2216 0.1884
Nissan Sentra 0.0386 0.0239 0.0304 0.0384 -0.0202 0.0294 0.0304 0.0496 0.0375 0.0383 0.0439 0.0470 -1.8754
Table 7: Estimated own- and cross-elasticities of a sample of automobile data using our method
Note: This table presents the estimated own- and cross-elasticity of a sample of 13 cars in the 1990 market using our model. The selected cars are the same as Berry et al. (1995)
reports. Each entry with row index i and column index j gives the percentage change in demand divided by the percentage change in price (based on $1,000 change in the price of i).
30
Lexus LS400 0.1350 0.0536 0.0166 0.1126 0.0130 0.0130 0.0045 0.1400 -7.4316 0.0243 0.0006 0.1464 0.0024
Lincoln Town Car 0.0087 0.0034 0.0286 0.0069 0.0217 0.0135 0.1693 0.0362 0.0086 -5.6139 0.0011 0.0123 0.0024
Mazda 323 0.0114 0.0020 0.0743 0.0056 0.1476 0.1494 0.0679 0.2723 0.0063 0.0304 -2.8631 0.0390 0.0254
Nissan Maxima 0.1008 0.0393 0.0314 0.0831 0.0493 0.0500 0.0271 0.1749 0.1209 0.0286 0.0033 -4.7872 0.0086
Nissan Sentra 0.0140 0.0031 0.0707 0.0078 0.1471 0.1504 0.0617 0.2763 0.0095 0.0271 0.0105 0.0422 -3.1799
Table 8: Estimated own and cross-elasticities of a sample of automobile data using the BLP model
Note: This table presents the estimated own- and cross-elasticity of a sample of 13 cars in the 1990 market using the BLP model. The selected cars are the same as Berry et al. (1995)
reports. Each entry with row index i and column index j gives the percentage change in demand divided by the percentage change in price (based on $1,000 change in the price of i).
they often suffer from the “curse of dimensionality”, where the complexity of estimating choice functions
escalates exponentially with an increase in the number of products.
In this paper, we propose a fundamental characterization of choice models that combines the tractability
of traditional choice models and the flexibility of non-parametric estimators. This characterization specif-
ically tackles the challenge of high dimensionality in choice systems and facilitates flexible estimation of
choice functions. Through extensive simulations, we validate the efficacy of our model, demonstrating its
superior ability to capture a range of consumer behaviors that traditional choice models fail to capture.
We also show how to address the endogeneity issue and estimate counterfactuals in our characterization.
Furthermore, leveraging the recent strides in the automatic debiased machine learning literature, we offer
an inference procedure that constructs confidence intervals on relevant objects, such as price elasticities.
Finally, we apply our method to the automobile dataset from Berry et al. (1995). Our empirical analysis
affirms that our model produces results that align well with the extant literature.
Our paper opens many avenues for future research. We focus on using neural network-based estimators.
However, estimators, such as Gaussian processes and Gradient boosting-based estimators can be adopted to
estimate the proposed functionals. Also, we believe more experimentation needs to be done on the neural
network design side.
References
J. Abaluck, G. Compiani, and F. Zhang. A method to estimate discrete choice models that is robust to
consumer search. Technical report, National Bureau of Economic Research, 2020.
J. Abrevaya. Rank estimation of a generalized fixed-effects regression model. Journal of Econometrics, 95
(1):1–23, 2000.
D. Ackerberg, X. Chen, J. Hahn, and Z. Liao. Asymptotic efficiency of semiparametric two-step gmm.
Review of Economic Studies, 81(3):919–943, 2014.
31
32
33
34
35
36
A.1 MNL
Assuming there are J products, the market share of each product under the MNL model is
If we permute two products k1 and k2 in the set S\{j}, the only change is in the order
Pof terms in the summa-
tion in the denominator of the market share. Since addition is commutative, the sum k∈S,k̸=j exp(f (Xkt , pkt ; β))
remains the same regardless of the order of k1 and k2 . Therefore, the value of πjt remains unchanged,
demonstrating that the MNL model satisfies permutation invariance.
, where F (βi ) is the CDF of random coefficient βi . Similar to MNL, if we permute the position of any other
two products k1 , k2 ∈ S\{j}, it only affects the sequence of the summation in the denominator, thus the
demand of product j remains the same. Therefore, it satisfies the permutation invariance.
Where:
• σ ∈ RL is an L-dimentional vector where the lth element represents the scale parameter for nest l
which captures the correlation in unobserved utilities among products in the same nest. Therefore, the
scale parameter of product jt, which belongs to nest l, can be expressed as Njt σ.
• Nm denotes a binary vector of dimension L, in which the mth element is one and all other elements
are zero. Therefore, Nm σ is the scale parameter of the nest m.
It is intuitive that when we swap the position of any two products, the probability of choosing a nest
(P (l)) would not change because both the denominator and the numerator only depend on the nest affilia-
tions. Moreover, P (j|l) also satisfies permutation invariance because of the additive nature of the denom-
inator in its formula, where the sum over all products in nest l remains constant despite any permutation
of product positions within the nest. The permutation invariance property could be easily extended to that
random coefficient nested logit following the same logic as random coefficient logit.
ii
When we permutate the order of any two competitor products, the choice probability of the highest-price
product remains the same. The choice probability of other products can be expressed as
where uijt represents the consumer i’s utility of product j in market t. Since uikt for any k ∈ St is
parameterized by its own feature (Xkt ), permuting any two k ̸= j, k ∈ St , does not affect the choice
probability of πijt as well as the aggregate demand πjt . Indeed, this shows that any models building on the
random utility framework with the decision rule as stated in Equation A10 are permutation invariant.
The permutation invariance property also extends to other recently developed choice models, such as
Markov Chain Choice model(Blanchet et al., 2016). The choice probability of a product jt is modeled as a
iii
Given that both λjt and pjk,t only rely on jt, St and St \ {j}, permuting the order of any two products
that are not the focal product jt in the market t does not change both arrival probability and transition
probability. Therefore, the choice probability does not change.
Proof. To show the asymptotic normality we will first verify the Assumptions 1-3 of Chernozhukov et al.
(2022a), from now on CEINR, with g(w, π(z; γ), θ) = m(w, π(z; γ)) − θ and ϕ(w, π(z; γ), α(z), θ) =
p
α(z) · (y − π(z; γ)). Using Taylor series expansion, Assumption 6 and ||π̂(z; γ) − π0 (z; γ)|| →
− 0 we have,
iv
Z
≤ C ||(π0 (z; γ0 ) − π̂(z; γ̂))||2 P0 (dw)
p
≤ C||π̂(z; γ̂) − π0 (z; γ0 )||2 →
− 0
(A14)
p
Also by Assumption 5 i) and ||α̂ − α0 || →
− 0, we have,
Z Z
2
||ϕ(w, π0 (z; γ0 ), α̂, θ̃) − ϕ(w, π0 (z; γ0 ), α0 , θ0 )|| P0 (dw) = || (α̂(z) − α0 (z)) (y − π0 (z; γ0 )) ||2 P0 (dw)
Z
≤ C ||α̂ − α0 ||2 P0 (dw)
p
≤ C||α̂ − α0 ||2 →
− 0
(A15)
2 4
X 3
X
ψ̂i − ψi ≤C Rij = C Rij + op (1)
j=1 j=1
where
Ri1 = [m (wi , π̂ℓ (zi ; γ̂ℓ )) − m (wi , π0 (zi ; γ0 ))]2 ,
Ri2 = α̂ℓ2 (zi ) [π̂ℓ (zi ; γ̂ℓ ) − π0 (zi ; γ0 )]2 ,
Ri3 = [α̂ℓ (zi ) − α0 (zi )]2 [yi − π0 (zi ; γ0 )]2 ,
2
Ri4 = θ̂ − θ0 .
p
We already showed consistency, so Ri4 → 0.
vi
We already showed,
Z
E [Ri1 | I−ℓ ] = [m (wi , γ̂ℓ ) − m (wi , γ0 )]2 F0 (dw) = op (1)
Z
E [Ri2 | W−l ] = α̂l2 [π̂l (zi ; γ̂l ) − π0 (zi ; γ0 )]2 F0 (dz)
Z
= [α̂l + α0 − α0 ]2 [π̂l (zi ; γ̂l ) − π0 (zi ; γ0 )]2 F0 (dz)
Z
≤ [α̂l − α0 ]2 [π̂l (zi ; γ̂l ) − π0 (zi ; γ0 )]2 F0 (dz)
Z
+ [α0 ]2 [π̂l (zi ; γ̂l ) − π0 (zi ; γ0 )]2 F0 (dz)
Z
p
≤ Op (1) [π̂ℓ (zi ; γ̂ℓ ) − π0 (zi ; γ0 )]2 F0 (dz) → 0
Finally, we have
h h i i
E [Ri3 | I−ℓ ] = E E [α̂ℓ (zi ) − α0 (zi )]2 [yi − π0 (zi ; γ0 )]2 | zi , I−ℓ | I−ℓ
h h i i
= E [α̂ℓ (Zi ) − α0 (Zi )]2 E [yi − π0 (zi ; γ0 )]2 | Zi | I−ℓ
p
≤ C ∥α̂ℓ − α0 ∥2 → 0.
As a result,
n
1 X 2 p
ψ̂i − ψi → 0
n
i=1
p
Thus, we have V̂ −→ V
Also by Assumption 9 and iterated expectations
Z
E [Ri3 | W−ℓ ] ≤ {α̂ℓ (z) − ᾱ(x)}2 E (y − π̄(z))2 | Z = z FZ (dz)
Z
≤ C {α̂ℓ (z) − ᾱ(z)}2 Fz (dz) = C ∥α̂ℓ − ᾱ∥2 = op (1).
vii
Hyperparameter Space
Number of hidden layers [3, 4, 5]
Number of nodes in each layer [64, 128, 256]
Learning rate [1e-2, 1e-3, 1e-4]
Number of epochs [1, 2, 4]
MNL RCL
pjt U [0, 4] U [0, 4]
Xjt N (0, 1) N (0, 1)
MNL RCL
αi -1 N (−1, 1)
βik 1 N (µβk , 1)
Notes : µβk ∼ N (0, 1/2d)
where µjt is the unobservable that correlates pjt and εijt is i.i.d. Type-I extreme value distributed. Specifi-
cally, without loss of generality, we specify the correlation between pjt and µjt as
viii
• Exogeneous Benchmark: Uses the exact same DGP but treat µjt as observables to researchers in
estimation. Since the coefficients of price and market shares are the same as the endogeneity case,
true elasticities are the same for these two DGPs. This gives a benchmark performance with the same
data under the assumption that endogeneity is not present.
• Ignoring Endogeneity: Directly trains the model using only the observed features Xjt and pjt with-
out considering the endogeneity problem. Note that, even though the endogeneity problem is ignored,
this does not mean the predictive performance in market shares would be low for this case because the
model could be overfitted. However, the elasticities will be biased when the endogeneity is ignored.
Following the routine in our main text, we simulate 20 times for the same DGP with different parameters
∂ π̂ /πjt
and features and report the predictive performance on market share (π̂jt ), own price elasticity ( ∂pjtjt /pjt
),
∂ π̂ /π
and cross-price elasticity ( ∂pk̸=jt jt
jt /pk̸=jt
). We report the performance of our model in Table A4. When
we use the control function to correct for endogeneity (Row 1: Our Method), our model underperforms
slightly compared to the case where µjt is treated as an observable (Row 2: Exogenous Benchmark), on
all three metrics. However, when we ignore the endogeneity issue and apply our model (Row 3: Ignore
Endogneity), although the predictive performance of market shares is not bad, the estimation of own- and
cross-elasticities are significantly biased, with the MAEs being 10–25 times higher than the cases where we
account for endogeneity. This demonstrates both the importance of accounting for endogeneity as well as
the effectiveness of our method in handling this problem in real settings.
• Step 1: Data Partition We randomly split the data set, D, into three separate partitions of markets,
each denoted as Dl . Each market is exclusively assigned to only one partition. For each partition, we
define its complement set, Dlc , as the subset of data in D that is not included in Dl .
• Step 2: Cross-fitting For each partition l, we first estimate a linear regression model on the com-
plement data set, Dlc , using the Lasso method with hyperparameters tuned by 3-fold cross-validation.
As discussed in section 3.3, we need the estimator of γ to converge at n−1/2 rate, a similar result
ix
• Step 3: First-stage Regression We estimate the first-stage estimator γ and residual µ using the MLIV
as the only predictor following step 1 in Section 4.
As a supplement to our main result, we also run our model using non-machine learning-based IVs.
Similar to Figure 4 in the main text, we present the estimated own-elasticity of our model without IV, with
BLP-style IVs, with differentiation IVs, and with MLIV in Figure A1. In Figure A1b, even when IVs
are applied, the persistence of many positive own-elasticities suggests the weakness of the BLP style IVs.
Furthermore, we apply the differentiation IVs (Gandhi and Houde, 2019), which use exogenous measures
of differentiation and provide a more robust instrument compared to the conventional BLP IVs. As one can
see from Figure A1c, the use of differentiation IV provides a more realistic estimation of own-elasticities,
strengthening the issue of weak instruments of the BLP style IVs. We also include the distributions of the
estimated own- and cross-elasticities obtained from our model using different sets of IVs in Figure A2.
In addition, we also perform a weak instrument test on both BLP Style IVs and the MLIV and report the
F-statitics and p-value in Table A5. Both BLP Style IVs and MLIV pass the weak instrument tests.
xi
xii