2018 - Machine Learning
2018 - Machine Learning
2018 - Machine Learning
DOI: https://doi.org/10.1287/opre.2018.1757
Users may download and/or print one copy of any article(s) in LBS Research Online for purposes of
research and/or private study. Further distribution of the material, or use for any commercial gain, is
not permitted.
The Big Data Newsvendor:
Practical Insights from Machine Learning
Gah-Yi Ban*
Management Science & Operations, London Business School, Regent’s Park, London, NW1 4SA, United Kingdom.
[email protected]
Cynthia Rudin
Department of Computer Science, Duke University, Durham, NC 27708, United States of America
[email protected]
We investigate the data-driven newsvendor problem when one has n observations of p features related
to the demand as well as historical demand data. Rather than a two-step process of first estimating a
demand distribution then optimizing for the optimal order quantity, we propose solving the “Big Data”
newsvendor problem via single step machine learning algorithms. Specifically, we propose algorithms based
on the Empirical Risk Minimization (ERM) principle, with and without regularization, and an algorithm
based on Kernel-weights Optimization (KO). The ERM approaches, equivalent to high-dimensional quantile
regression, can be solved by convex optimization problems and the KO approach by a sorting algorithm.
We analytically justify the use of features by showing that their omission yields inconsistent decisions. We
then derive finite-sample performance bounds on the out-of-sample costs of the feature-based algorithms,
which quantify the effects of dimensionality and cost parameters. Our bounds, based on algorithmic stability
theory, generalize known analyses for the newsvendor problem without feature information. Finally, we apply
the feature-based algorithms for nurse staffing in a hospital emergency room using a data set from a large
UK teaching hospital and find that (i) the best ERM and KO algorithms beat the best practice benchmark
by 23% and 24% respectively in the out-of-sample cost, and (ii) the best KO algorithm is faster than the best
ERM algorithm by three orders of magnitude and the best practice benchmark by two orders of magnitude.
Key words : big data, newsvendor, machine learning, Sample Average Approximation, statistical learning
theory, quantile regression
History : First version: February 6, 2014; revisions submitted February 1, 2015, August 7, 2016 and
November 2, 2017. Accepted for publication in Operations Research on March 2, 2018.
1. Introduction
We investigate the newsvendor problem when one has access to n past demand observations as well
as a potentially large number, p, of features about the demand. By features we mean exogenous
variables (also known as covariates, attributes and explanatory variables) that are predictors of the
demand and are available to the decision-maker (hereafter, “DM”) before the ordering occurs. While
inventory models to date have typically been constructed with demand as the stochastic primitive,
1
Ban & Rudin: Big Data Newsvendor
2 Article submitted to Operations Research; manuscript no.
in the world of Big Data, the DM has access to a potentially large amount of relevant information
such as customer demographics, weather forecast, seasonality (e.g. day of the week, month of the
year and season), economic indicators (e.g. the consumer price index), as well as past demands to
inform her ordering decisions.
In this paper, we propose solving the “Big Data” newsvendor problem via distribution-free, one-
step machine learning algorithms that handle high-dimensional feature data, and derive finite-sample
performance bounds on their out-of-sample costs. The one-step algorithms contrast with the ap-
proach of solving the Big Data newsvendor problem via a two-step process, of first estimating a
(feature-dependent) demand distribution then optimizing for the optimal order quantity. Such two-
step processes can be problematic because demand model specification is difficult in high-dimensions,
and errors in the first step will amplify in the optimization step.
In this setting, the paper is structured to answer the following four questions. (Q1) How should
the DM use a feature-demand data set to solve the newsvendor problem? (Q2) What is the value
of incorporating features in newsvendor decision-making in the first place? (Q3) What theoretical
guarantees does the DM using such data have, and how do these scale with the various problem
parameters? (Q4) How do newsvendor decisions based on the feature-demand data set compare to
other benchmarks in practice?
We address (Q1) in Sec. 2, where we propose one-step approaches to finding the optimal order
quantity with a data set of both demand and related feature observations. One approach is based
on the machine learning principle of Empirical Risk Minimization (ERM) (Vapnik 1998), with and
without regularization, and the other, which we call Kernel-weights Optimization (KO), is inspired
by Nadaraya-Watson’s kernel regression method (Nadaraya 1964, Watson 1964). The ERM ap-
proach, equivalent to high-dimensional quantile regression, is a linear programming (LP) algorithm
without regularization, and a mixed-integer program (MIP), an LP or a quadratic program (QP)
with `0 , `1 and `2 regularizations respectively. Under the KO approach, the optimal data-driven
order quantity can be found by a simple sorting algorithm.
We justify the use of features by answering (Q2) in Sec. 3, where we analytically quantify the
value of features by comparing against decisions made without any features (the “Sample Average
Approximation (SAA)” method). This is necessary as most data-driven inventory works to date
do not consider features. We consider two demand models for the comparison — a two-population
model and the linear demand model. For both models, we show that the no-feature SAA decision
does not converge to the true optimal decision, whereas the feature-based ERM decision does. In
other words, the SAA decision can have a constant bias (i.e. O(1) error) regardless of how many
observations n the DM has, whereas any finite-sample bias of the feature-based decision shrinks to
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 3
zero as n tends to infinity. Accordingly, the SAA decision can have a higher newsvendor cost than
the ERM decision. We quantify the additional cost incurred by biased decisions in Theorem 4.
We address (Q3) in Sec. 4. In practice, the DM may never make the truly optimal decision
with finite amount of data, even if s/he uses as much relevant feature information as possible.
To understand the tradeoffs of not having the full distributional information of the demand, we
derive theoretical performance bounds for the DM who uses the algorithms proposed in Sec. 2. Our
bounds characterize how the in-sample cost of the in-sample decision (which the DM can calculate,
as opposed to the expected cost of the in-sample decision) deviates from the true expected cost
of the true optimal decision in terms of the various parameters of the problem, in particular in
terms of p and n. The bounds show how the out-of-sample cost of the in-sample decision (the
“generalization error”) deviates from its in-sample cost by a complexity term that scales gracefully
√ p
as 1/ n for n, and as log(1/δ) for δ , where 1 − δ is the probabilistic accuracy of our bound,
under the minimal assumption that the demand data is independent and identically distributed (iid)
in the high-dimensional feature space. The bounds also show how the finite-sample bias from the
√
true optimal decision scales as n−1/(2+p/2) log n, under the additional assumption of linear demand
model.
From a practical perspective, our bounds explicitly show the trade-offs between the generalization
error (which measures the degree of in-sample overfitting) and the finite-sample bias and how
they depend on the size of the data set, the newsvendor cost parameters and any free parameters
(controls) in the algorithms. While some past papers in operations management have incorporated
specific features in inventory models, none have analyzed the out-of-sample performance (cost) with
high-dimensional data. Our work also contrasts with past works in quantile regression by using
algorithm-specific stability theory of Bousquet and Elisseeff (2002) to analyze the generalization
error, as opposed to Vapnik-Chervonenkis (VC) theory (Vapnik 1998), which results in bounds with
tight constants dependent only on the newsvendor cost parameters, as opposed to ones dependent on
uniform complexity measures such as covering numbers, VC dimensions and Rademacher averages,
which are often difficult to compute and interpret in practice. Detailed discussions of the literature
can be found below in Section 1.1.
We address (Q4) in Sec. 5, where we evaluate our algorithms against other known benchmarks
through an extensive empirical investigation. Specifically, we apply our algorithms and other meth-
ods to a nurse staffing problem in a hospital emergency room. The best result using the ERM
approach was with `1 regularization, with a cost improvement of 23% [a saving of £44,219 per annum
(p.a.)] relative to the best practice benchmark (Sample Average Approximation (SAA) clustered by
day of the week), and the best result using the KO approach had a cost improvement of 24% (a
saving of £46,555 p.a.) relative to the same benchmark. Both results were statistically significant
Ban & Rudin: Big Data Newsvendor
4 Article submitted to Operations Research; manuscript no.
at the 5% level. The best KO method, solved using the sorting procedure described in Sec. 2, was
also very computationally efficient, taking just 0.05 seconds to compute the optimal staffing level
for the next period, which is three orders of magnitude faster than the best ERM method and two
orders of magnitude faster than the best practice benchmark and other benchmarks.
Finally, in Sec. 6, we conclude with a discussion of the practical take-aways, generalizable insights
as well as limitations of our investigation, and thoughts on future directions for research.
Our work contributes to the following areas of investigation in operations management and machine
learning.
(i) Data-driven inventory models. Models in newsvendor/inventory management have long been
constructed with the available (or, rather, the lack of) data in mind, with the stochastic nature
of the demand modelled with various assumptions. The earliest papers (Arrow et al. 1958, Scarf
1959b) make the assumption that the demand distribution is fully known, and this has been relaxed
in recent years. Overall, there have been three main approaches to modeling demand uncertainty
in the literature. In the Bayesian approach, the demand distribution is assumed to be known up
to unknown parameter(s), which are dynamically learned starting with prior assumptions (Scarf
1959a, Azoury 1985, Lovejoy 1990). In the minimax approach, the decision-maker (hereafter, DM)
opts for the best robust decision among all demand distributions within a specified uncertainty
set (Scarf et al. 1958, Gallego and Moon 1993, Chen et al. 2007, Perakis and Roels 2008). Finally,
in the data-driven approach, the DM has access to samples of demand observations drawn from
an unknown distribution. In this setting, Burnetas and Smith (2000), Huh and Rusmevichientong
(2009) and Kunnumkal and Topaloglu (2008) propose stochastic gradient algorithms, Godfrey and
Powell (2001) and Powell et al. (2004) consider the adaptive value estimation method, Levi et al.
(2007) provide a sampling-based method based on solving a shadow problem to solve for the optimal
ordering quantities, and Levi et al. (2015) improve upon the bounds of Levi et al. (2007) for the
single-period, featureless newsvendor case.
Within this line of work, the distinguishing aspect of our paper is in the incorporation and analysis
of a potentially large number of features directly in an inventory model. As far as we are aware, the
works of Liyanage and Shanthikumar (2005), See and Sim (2010) and Hannah et al. (2010) are the
only other preceding methodological papers in operations management to incorporate some form
of feature information in the decision model. We compare them with our approaches in detail in
Sec. 2.4. Most importantly, we derive performance bounds that quantify the effect of the feature
dimension on the out-of-sample cost, which has no precedence in this line of literature.
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 5
Our setup and subsequent analysis also differ from the parametric modeling approaches of Feld-
man (1978), Lovejoy (1992), Song and Zipkin (1993), Gallego and Özer (2001), Lu et al. (2006),
Iida and Zipkin (2006), where the demand is modeled as Markov-modulated processes (MMDP)
with known state-dependent distributions, where the states capture various exogenous information
such as economic indicators or advanced demand information. The key difference is that we make
minimal assumptions about the underlying demand distribution (iid for the most part and later, for
more specific finite-sample bias analysis, the linear demand model with unknown error distribution),
whereas in all of the aforementioned works, the demand is modelled parametrically with known,
state-dependent distributions (e.g. Normal or Poisson, where the mean is a function of the state),
with the state evolving as a Markov process.
(ii) Performance bound analysis of data-driven inventory decisions. While there is no precedent
for feature-dependent performance bounds in the inventory theory literature, Levi et al. (2007)
and Levi et al. (2015) provide such bounds without features. Levi et al. (2007) studies the single-
period and dynamic inventory problems with zero setup cost from a nonparametric perspective and
provides sample average approximation (SAA)-type algorithms to solve them. They then provide
probabilistic bounds on the minimal number of iid demand observations that are needed for the
algorithms to be near-optimal. Levi et al. (2015) improves upon the bound for the single-period
case. Our performance bounds are generalizations of the bounds of Levi et al. (2007) and Levi et al.
(2015) to incorporate features in the DM’s data set. We demonstrate how our bounds can retrieve
the bounds of Levi et al. (2007) when p = 1 in Appendix D; a similar result can be shown for Levi
et al. (2015) as well.
(iii) High-dimensional Quantile Regression. Due to the equivalence of the newsvendor cost func-
tion with the loss function in quantile regression, our work can be classified as a study in high-
dimensional quantile regression as well; albeit with the twist that we analyze (analytically and
empirically) the cost of the estimated quantile as opposed to the estimated quantile itself. We
make two contributions to this literature. First, the KO method is, to the best of our knowledge, a
new nonparametric quantile regression method. Second, our out-of-sample performance analyses of
both the ERM methods and the KO method, which uses algorithmic stability theory from machine
learning (Bousquet and Elisseeff 2002), are new. Lastly, we extend the results of Chaudhuri et al.
(1991) to the high-dimensional setting to derive bounds on the biases of the newsvendor algorithms
under consideration. For references on quantile regression, we refer the readers to Koenker (2005)
for a textbook reference, Takeuchi et al. (2006), Chernozhukov and Hansen (2008), Chernozhukov
et al. (2010), Belloni and Chernozhukov (2011) and references therein for more recent works on
high-dimensionality.
Ban & Rudin: Big Data Newsvendor
6 Article submitted to Operations Research; manuscript no.
A company sells perishable goods and needs to make an order before observing the uncertain
demand. For repetitive sales, a sensible goal is to order a quantity that minimizes the total expected
cost according to:
is the random cost of order q and demand D, and b and h are respectively the unit backordering
and holding costs. If the demand distribution, F , is known, one can show the optimal decision is
given by the b/(b + h) quantile, that is:
∗ b
q = inf y : F (y) ≥ . (3)
b+h
In practice, the decision maker does not know the true distribution. If one has access to historical
demand observations d(n) = [d1 , . . . , dn ], but no other information, then a sensible approach is to
substitute the true expectation with a sample average expectation and solve the resulting problem:
n
1X
min R̂(q; d(n)) = [b(di − q)+ + h(q − di )+ ], (SAA)
q≥0 n i=1
where we use the ˆ notation to emphasize quantities estimated from data. This approach is called
the Sample Average Approximation (SAA) approach in stochastic optimization (for further details
on the SAA approach in stochastic optimization, see Shapiro et al. 2009). One can show the optimal
SAA decision is given by
b
q̂n = inf y : F̂n (y) ≥ , (4)
b+h
where F̂n (·) is the empirical cdf of the demand from the n observations. Note that if F is continuous,
and we let r = b/(b + h), then q̂n = ddnre , the dnre-th largest demand observation.
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 7
where the decision is now a function that maps the feature space X ⊂ Rp to the reals and the
expected cost that we minimize is now conditional on the feature vector x ∈ X ⊂ Rp .
The decision-maker intent on finding an optimal order quantity in this new setting has three issues
to address. The first issue is in knowing what features the demand depends on, which prescribes
what data to collect. As this is application-specific, we assume that the decision maker has already
collected appropriate historical data Sn = [(x1 , d1 ), . . . , (xn , dn )]. The data may be low-dimensional,
where the number of features p is small compared to the number of observations n, or high dimen-
sional, where the number of features is large compared to n (the analysis of identifying the low- and
high-dimensional regimes later in Sec. 4). The second issue is how to solve the problem (5) in an
efficient manner given the feature-demand data set. In this section, we propose two approaches to
solving (5) — the ERM and KO approaches. Both approaches are direct, in that the decision-maker
solves for the (in-sample) optimal order quantity in a single step. As such, our proposed algorithms
are customized for the feature-based newsvendor problem, and are distinct from SAA (which are
independent of features) and the separated estimation and optimization (SEO) approach, against
which we compare in Sec. 5, along with other known benchmarks. The final concern is what perfor-
mance guarantee is possible prior to observing the demand in the next period. We address this in
Sec. 4.
Before we begin, we clarify that the setting of interest is one in which the DM observes the
features xn+1 before making the ordering decision at time n + 1.
2.3.1. Empirical Risk Minimization Algorithms. The ERM approach to solving the
newsvendor problem with feature data is:
n
1X
min R̂(q(·); Sn ) = [b(di − q(xi ))+ + h(q(xi ) − di )+ ], (NV-ERM)
q(·)∈Q,{q:X →R} n i=1
where R̂ is called the empirical risk of function q with respect to the data set Sn .
To solve (NV-ERM), one needs to specify the function class Q. The size or the complexity of Q
controls overfitting or underfitting: for instance, if Q is too large, it will contain functions that fit
the noise in the data, leading to overfitting. Let us consider linear decision rules of the form
( p
)
X
0 j j
Q = q : X → R : q(x) = q x = q x ,
j=1
Ban & Rudin: Big Data Newsvendor
8 Article submitted to Operations Research; manuscript no.
where x1 = 1, to allow for a feature-independent term (an intercept term). This is not restrictive,
as one can easily accommodate nonlinear dependencies by considering nonlinear transformations of
basic features. We might, for instance, consider polynomial transformations of the basic features,
e.g., [x1 , . . . , xp , x21 , . . . , x2p , x1 x2 , . . . , xp−1 xp ]. Such transformations can be motivated from generative
models of the demand (but do not need to be); for instance, assume: D = f (x) + ε, where x is a
p-dimensional vector of features. If we also assume that f (·) is analytic, we can express the demand
function by its Taylor expansion:
which means that the demand function of a basic feature vector x can be approximated by a linear
demand model with a much larger feature space. For example, the second-order Taylor approxi-
mation of the demand model can be considered to be a linear demand model with the (p + p2 )
features mentioned earlier: [x1 , . . . , xp , x21 , . . . , x2p , x1 x2 , . . . , xp−1 xp ]. Regardless of the motivation for
the transformations of basic features, we can choose them to be arbitrarily complex; hence our choice
of decision functions that depend linearly on the feature vector is not particularly restrictive. The
choice of Q can be made more or less complex depending on which transformations are included.
For a discussion on how piecewise linear decisions can be considered by transforming the features,
see Sec. 2.4.2. This comes at the cost of increasing the number of features, perhaps dramatically so,
thereby increasing the likelihood of overfitting if there are not enough data. We thus propose ERM
with regularization for large p (discussed further in Sec. 4). Although we consider linear decision
functions for the rest of the paper, we show how one can consider nonlinear functional spaces for
Q in Appendix A via mapping the original features onto a higher dimensional reproducing kernel
Hilbert space.
When p is relatively small, the DM can solve (NV-ERM) via the following linear program:
ERM Algorithm 1
n
1X
min R̂(q(·); Sn ) = [b(di − q(xi ))+ + h(q(xi ) − di )+ ]
Pp
q:q(x)= j j
j=1 q x
n i=1
n
1X
≡ min (bui + hoi )
q=[q 1 ,...,q p ] n i=1
s.t. ∀ i = 1, . . . , n :
p
X
ui ≥ di − q − 1
q j xji
j=2
p
X
oi ≥ q 1 + q j xji − di
j=2
ui , oi ≥ 0, (NV-ERM1)
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 9
where the indicator variables ui and oi represent, respectively, underage and overage costs in period
i. This is an LP with a p + 2n-dimensional decision vector and 4n constraints. We will see in Sec. 4
that while (NV-ERM1) yields decisions that are algorithmically stable, the performance guarantee
relative to the true optimal decision is loose when p is large (relative to n). Thus, in the case of
high dimensional data, one can solve the LP (NV-ERM1) by selecting a subset of the most relevant
features according to some feature-selection criterion, for example via cross validation or via model
selection criteria such as the Akaike Information Criterion (Akaike 1974) or Bayesian Information
Criteria (Schwarz 1978). Alternatively, one can automate feature selection by solving the following
regularized version of (NV-ERM1):
ERM Algorithm 2 (with regularization)
n
1X
min R̂(q(·); Sn ) + λkq k22 = [b(di − q(xi ))+ + h(q(xi ) − di )+ ] + λkqk2k
Pp
q:q(x)= j=1 q j xj n i=1
n
1X
≡ min (bui + hoi )
q=[q 1 ,...,q p ] n i=1
s.t. ∀ i = 1, . . . , n :
p
X
ui ≥ di − q 1 − q j xji
j=2
p
X
oi ≥ q 1 + q j xji − di
j=2
ui , oi ≥ 0, (NV-ERM2)
where λ > 0 is the regularization parameter and kqkk denotes the `k -norm of the vector q =
[q 1 , . . . , q p ]. If we regularize by the `2 norm, the problem becomes a quadratic program (QP) and can
be solved efficiently using widely available conic programming solvers. If we believe that the number
of features involved in predicting the demand is very small, we can choose to regularize by the `0
semi-norm or the `1 norm to encourage sparsity in the coefficient vector. The resulting problem
then becomes, respectively, a mixed-integer program (MIP) or an LP. Note regularization by the
`k -norm is widely used across engineering, statistics and computer science to handle overfitting.
Let us consider variations. We may want a set of coefficients to be either all present or all absent,
for instance if they fall into the same category (e.g., all are weather-related features). We can
PG
accommodate this with a regularization term g=1 kqIg k2 , with Ig being the indicator of group g .
This regularization term is an intermediate between `1 and `2 regularization, where sparsity at the
group level is encouraged by the sum (`1 norm) over groups. We will see in Sec. 4 that regularization
leads to stable decisions with good finite-sample performance guarantees.
Ban & Rudin: Big Data Newsvendor
10 Article submitted to Operations Research; manuscript no.
where Y ∈ R is the dependent variable and xn+1 ∈ Rp is a vector of new independent variables. In
1964, Nadaraya and Watson proposed to estimate this quantity by the locally weighted average
Pn
Kw (xn+1 − xi )yi
mh (xn+1 ) = Pi=1
n ,
i=1 Kw (xn+1 − xi )
where Kw (·) is a kernel function with bandwidth w. Typical examples of the kernel function include
the uniform kernel
1
K(u) = I(kuk2 ≤ 1)
2
and the Gaussian kernel
1 2
K(u) = √ exp−kuk2 /2 ,
2π
with Kw (·) := K(·/w)/w. Now for an order quantity q , the feature-dependent newsvendor expected
cost after observing features xn+1 is given by
which depends (implicitly) on the demand distribution at xn+1 . Thus if we consider the newsvendor
cost to be the dependent variable, we can estimate (7) by the Nadaraya-Watson estimator
Pn
i=1 Kw (xn+1 − xi )C(q, di )
Pn .
i=1 Kw (xn+1 − xi )
This gives rise to a new approach to feature-data-driven newsvendor, which we call the Kernel
Optimization (KO) Method.
Pn
i=1 Kw (xn+1 − xi )C(q, di )
min R̃(q; Sn , xn+1 ) = min Pn . (NV-KO)
i=1 Kw (xn+1 − xi )
q≥0 q≥0
Note that there are no edge effects in the objective estimate if the kernel is smooth, which is the case
for the Gaussian kernel. Notice that the optimization is over the non-negative reals, and the optimal
decision implicitly depends on xn+1 . (NV-KO) is a one-dimensional piecewise linear optimization
problem, and we can find its solution according to the following proposition.
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 11
Proposition 1. The optimal feature-based newsvendor decision q̂nκ obtained by solving (NV-KO)
is given by
Pn
i=1 κi I(di ≤ q) b
q̂nκ = q̂nκ (xn+1 ) = inf q : Pn ≥ , (8)
i=1 κi b+h
where for simplicity we introduce κi = Kw (xn+1 − xi ). In other words, we can find q̂nκ by ranking the
past demand in increasing order, and choosing the smallest value at which the inequality in (8) is
satisfied.
Notice that the left hand side (lhs) of the inequality in (8) is similar to the empirical cdf of
the demand, except that each past demand observation di is re-weighted by the distance of its
corresponding feature xi to the current feature xn+1 .
In the above, we had identified three systematic approaches to incorporating features for the
newsvendor problem. However, incorporating exogenous information in inventory decision-making
is not entirely new. In what follows, we compare and contrast the algorithms introduced thus far to
past works that incorporate exogenous information in inventory decision-making.
2.4.1. Comparison with Liyanage and Shanthikumar (2005). Our first comparison is
with operational statistics (OS), which was first introduced by Liyanage and Shanthikumar (2005).
The idea behind OS is to integrate parameter estimation and optimization rather than separate
them. Let us illustrate how OS works by an example similar to the one used in Liyanage and
Shanthikumar (2005).
Suppose the true demand has an exponential distribution, i.e. D ∼ exp(1/θ), and that the decision
maker has access to d1 , . . . , dn observations of past data. Then with straightforward calculations,
one can show first estimating then optimizing (“Separated Estimation and Optimization”, hereafter
SEO) leads to the decision
b+h ¯
q̂SEO = log dn ,
b
where d¯n is the sample average of the demand. Now consider instead the decision
1
q̂OS (α) = αd¯n (9)
parameterized by a constant α > 0. The OS approach then picks α by the following optimization:
1
min Eθ [C(q̂OS (α); D)]. (10)
α≥0
Ban & Rudin: Big Data Newsvendor
12 Article submitted to Operations Research; manuscript no.
As α = log((b + h)/b) is a feasible solution of (10), this guarantees the OS decision to yield a true
expected cost that is bounded above by the true expected cost of the SEO decision. In other words,
by construction we have
1
Eθ [C(q̂OS (α∗ ); D)] ≤ Eθ [C(q̂SEO ; D)], (11)
where α∗ is the optimal parameter in (10). With some computations, one can show
" 1/n+1 #
b + h
α∗ = − 1 n.
h
Liyanage and Shanthikumar (2005) also shows that one can also improve upon the SAA optimal
decision in terms of the true expected cost by considering the decision
2
q̂OS (α, β) = ddβ−1e + α(ddβe − ddβ−1e ), (12)
2
min Eθ [C(q̂OS (α, β); D)]. (13)
α≥0,β∈{1,...,n}
As the above example illustrates, OS takes insight from the form of the decision derived by other
methods (e.g. SEO and SAA) and constructively improves upon them in terms of the true expected
cost simply by considering a decision that is a function of past demand data rather than a scalar
quantity. In the parlance of our feature-based approach, the OS method is essentially considering
meaningful statistics of past demand data as features. However, there is an important difference
between the OS approach and ours, and this is in the way the unknown coefficients (parameters)
of the decision function are chosen. Under our decision-making paradigm, one would simply input
the sample average of past demand and differences of order statistics of past demand as features
and choose the coefficients that minimize the in-sample average cost. In contrast, OS is based on
the premise that one knows the distributional family the demand belongs to, and thus is able to
compute the coefficients that minimize the true expected cost. That one knows the true distributional
family is not a weak assumption, however the insights from OS analysis are valuable. In Sec. 5, we
will consider solving (NV-ERM1) and (NV-ERM2) both without and with OS-inspired features, to
evaluate their practical benefit in terms of the out-of-sample cost.
2.4.2. Comparison with See and Sim (2010). See and Sim (2010) investigates the multi-
period inventory management problem in the presence of features such as “market outlook, oil prices,
trend, seasonality, cyclic variation” that a product demand can depend on. Specifically, they model
the demand at time t as
Nt
X
dt (z̃) = d0t + dkt z̃k ,
k=1
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 13
where z̃ = [z̃1 , . . . , z̃Nt ] represents random features, and dkt , k = 0, . . . , Nt are the coefficients. They
then make a number of assumptions on the random features (zero mean, positive definite covari-
ance matrix, bounded support set which is second-order conic representable, etc.), all of which are
assumed to be known to the DM, and solve an approximation of the robust problem by considering
linear decision rules (linear as a function of the features) as well as piecewise linear decision rules.
See and Sim (2010) demonstrates that piecewise linear decision rules have the best performance.
While See and Sim (2010) certainly consider the presence of features in their problem setup, their
work is distinctly different from ours on a number of fronts. First of all, See and Sim (2010) is about
robust decision-making, as opposed to data-driven decision-making; and as such their theoretical
results, while interesting, do not pertain to the data-driven questions we explore in this paper.
Secondly, See and Sim (2010) assumes that the DM has access to a number of key statistics regarding
the random features are known to the DM, and performance bounds for their approximate decisions
are derived as a function of these statistics. In contrast, we do not make any assumptions about
the data-generating process other than iid, and as such our performance analysis are independent
of statistics that are presumed known. Thirdly, See and Sim (2010) do not study the effect of
high-dimensionality (the “Big” in Big Data), which is the central theme of this paper.
Lastly, See and Sim (2010) makes an interesting observation that considering decisions that are
piecewise linear functions of the underlying features perform better than linear decision rules. How-
ever, in this paper we consider only linear decision rules because non-linear decisions can be trans-
formed into linear decision rules with the addition of new features. We had explained how to enlarge
the feature space when the true decision is an analytic function of the features in Section 2.3.1. Let us
now illustrate how to enlarge the feature space to allow for piecewise linear decisions. For simplicity,
consider a single feature x. We can then construct new features x̃1 = I(x ≤ c1 ), x̃2 = βI(c1 < x ≤ c2 )
and x̃3 = I(x > c3 ) based on the “basic” feature x, and some pre-specified constants c1 and c2 . Then
the corresponding linear decision rule
has the same piecewise linear structure as the decisions considered in See and Sim (2010). By
incorporating a large number of breakage points, any piecewise linear decision can be approximated
by linear decision rules arbitrarily well, although perhaps with the addition of a large number of
“new” features corresponding to each breakage point. Thus we solely focus on the feature-based
newsvendor problem with decisions that are linear functions of the feature vector, where the feature
dimension may be large.
Ban & Rudin: Big Data Newsvendor
14 Article submitted to Operations Research; manuscript no.
2.4.3. Comparison with Hannah et al. (2010). Hannah et al. (2010) consider the single-
period stochastic optimization problem
where x ∈ X is the decision variable, Z is a state-dependent random variable, and S = s is the current
state of the world. They propose solving the above problem by the weighted empirical stochastic
optimization problem
n
X
min wn (s, Si )F (x, Z(Si )),
x∈X
i=1
where wn (s, Si ) are the weights given to the past data (Si , Z(Si ))ni=1 determined by the Nadaraya-
Watson-based kernel estimator as in the (KO) method or by a complex Dirichlet process mixture
model.
This is similar to our feature-based newsvendor setup, however there are a few key differences.
Firstly, the model studied by Hannah et al. (2010) requires discrete state variables, whereas we
consider both discrete and continuous feature variables. Secondly, Hannah et al. (2010) do not
study the effect of high-dimensionality (the “Big” in Big Data), which is the central theme of this
paper (their numerical example, which also considers the newsvendor problem, has only two states).
Finally, Hannah et al. (2010) report numerical studies of computing the in-sample decisions, whereas
we provide both theoretical performance guarantees and extensive empirical computation of out-of-
sample performance of the in-sample decisions.
D = D0 (1 − x) + D1 x, (14)
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 15
where D0 and D1 are non-negative continuous random variables such that the corresponding critical
newsvendor fractiles q0∗ and q1∗ follow q0∗ < q1∗ , and x ∈ {0, 1} is a binary feature (e.g. 0 for weekday
and 1 for weekend, or 0 for male and 1 for female). Let p0 be the proportion of time x = 0. We have n
historical observations: [(x1 , d1 ), . . . , (xn , dn )], of which n0 = np0 are when x = 0 and n1 = n − n0 are
when x = 1 (assume rounding effects are negligible). Note the observations dk can be decomposed
into: {dk |xk = 0} = d0k and {dk |xk = 1} = d1k . Also let r = b/(b + h) to simplify notations. Let F0 and
F1 denote the cumulative distribution functions (cdfs), F0−1 and F1−1 denote the inverse cdfs, and
f0 and f1 the probability density functions (pdfs) of D0 and D1 respectively.
We also assume the following.
Condition (A). Assume F0 and F1 are twice differentiable (i.e. f0 and f1 are differentiable) and
that there exists a 0 < γ < 2 such that
|Ji (y)|
sup y(1 − y) ≤ γ,
0<y<1 f (Fi−1 (y))
|f 0 (x)|
sup F (x)(1 − F (x)) .
x∈dom(D) f (x)2
In Table 1, we display some standard distributions that satisfy the requirement of Condition A. For
more details see Parzen (1979).
|J(y)|
Distribution f (F −1 (y)) J(y) Is sup y(1 − y) <2 ?
0<y<1 f (F −1 (y))
Uniform 1 0 Yes, LHS = 0
Exponential 1−y 1 Yes, LHS = 1
Logistic y(1 − y) 2y − 1 Yes, LHS = 1
Normal √1 exp{− 1 |Φ−1 (y)|2 } Φ−1 (y) Yes, LHS = 1
2π 2
Lognormal φ(Φ−1 (y)) exp{−Φ−1 (y)} exp{−Φ−1 (y)}(Φ−1 (y) + 1) Yes, LHS . 1.24
Table 1 Some standard distributions that satisfy the requirement of Condition A. The standard normal cdf and
pdf are denoted as Φ(·) and φ(·) respectively.
Lemma 1 (Optimal ordering decision of (NV-ERM1)). Let F̂i denote the empirical cdf of
D|x = i with ni iid observations for i = 0, 1. Then the optimal decision that solves (NV-ERM1) is
given by
b
q̂n0= inf q : F̂0 (q) ≥ = d0(dn0 re) , if xn+1 = 0
b+h
1 b
q̂n = inf q : F̂1 (q) ≥ = d1(dn1 re) , if xn+1 = 1.
b+h
Put simply, q̂n0 solves the SAA problem for the subsample of data corresponding to x = 0 and q̂n0 + q̂n1
solves the SAA problem for the subsample of data corresponding to x = 1.
i.e. the feature-based decision is asymptotically optimal, correctly identifying the case when x = 0 or
1 as the number of observations goes to infinity.
Lemma 2 (Optimal SAA ordering decision). Let F mix denote the cdf of the mixture distribu-
tion Dmix = p0 D0 + (1 − p0 )D1 and F̂nmix its empirical counterpart with n observations. Then the
optimal SAA decision is given by
b
q̂nSAA = inf q : F̂nmix (q) ≥ = d(dnre) .
b+h
Theorem 2 (Finite-sample bias and asymptotic (sub)-optimality of SAA). The finite-
sample bias of the SAA decision is given by
E[q̂n ] − (F mix )−1 (r) ≤ O log n ,
SAA
(15)
n
where (F mix )−1 is the inverse cdf of Dmix . Hence we also have
E[q̂n − q̂n0 ] = (F mix )−1 (r) − F0−1 (r) + O log n = O(1)
SAA
n
1 SAA
−1 mix −1
E[q̂n − q̂n ] = F1 (r) − (F ) (r) + O
log n
= O(1). (16)
n
That is, on average, if x = 0 in the next decision period, the SAA decision orders too much and if
x = 1 the SAA decision orders too little. In addition,
a.s.
q0∗ < lim q̂nSAA = (F mix )−1 (r) < q1∗ , (17)
n→∞
As a final point, we remark that these observations are analogous in spirit to the bias and in-
consistency of regression coefficients when there are, in econometric parlance, correlated omitted
variables in the model.
where ε ∼ Fε is independent of the (random) feature vector X, is continuous with probability den-
sity function fε (·) which is bounded away from zero on the ordering domain [D, D̄] and has zero
mean, and X1 = 1 almost surely, to allow for a constant location term. In other words, the demand
D depends linearly on the random features X : X → Rp , with some error. This is a widely-used
and useful demand model that, apart from the fact that it can arbitrarily approximate nonlinear
models as outlined in (6), it also subsumes times series models and the Martingale Model of Fore-
cast Evolution (MMFE) of Heath and Jackson (1994) and Graves et al. (1986). For example, the
autoregressive model of degree p, AR(p) is modeled by
Dt = α0 + α1 Dt−1 + . . . + αp Dt−p + εt ,
where εt is a noise term with zero mean; this is clearly a linear demand model with features
Dt−p , . . . , Dt−1 . Also, the additive MMFE model for the demand at time t is given recursively by
Dt = Dt−1 + εt−1,t ,
where εt−1,t is a mean zero normal random variable that captures forecast update at time t − 1 for
demand at time t. Expanding the recursion, we get, at time 1:
t−1
X
Dt = D0 + εi,i+1 ,
i=0
where D0 is known; and so the demand follows a linear model with εi,i+1 , i = 0, . . . , t − 1 as features.
A DM without the feature information only has access to past demand data: D = {d1 , . . . , dn }; and
a DM who has both past feature and demand data has the information: Dx = {(x1 , d1 ), . . . , (xn , dn )}.
Let xn+1 denote the feature at time n + 1, which is available to the DM. Then, the optimal order
quantity is given by
∗ b
q (xn+1 , zn+1 ) = Qε + β > xn+1 (19)
b+h
Ban & Rudin: Big Data Newsvendor
18 Article submitted to Operations Research; manuscript no.
where
b b
Qε = inf {y : Fε (y) ≥ }
b+h b+h
is the b/(b + 1)-quantile of the distribution of ε.
The SAA solution is given by
b
q̂ SAA (xn+1 ) = inf {y : F̂n0 (y) ≥ } + d¯n (20)
b+h
where d¯n is the sample average of the demand data, and F̂n0 is the empirical distribution of {di −
d¯n }n . Note the SAA decision is not dependent on any of the features, as the DM does not have
i=1
access to any feature data. Thus the SAA decision is to order the same critical fractile quantity for
the entire population at time n + 1, regardless of what the population or the particular point in time
may be. As a concrete example, this is like a national newspaper vendor who stocks the same number
of newspapers at all shops, disregarding location features pertaining to the shop (e.g. customer
demographics in the area) as well as relevant temporal features (e.g. holiday or not, weekday versus
weekend, major political or sporting events) or features that pertain to both (e.g. historical demand
for the shop).
The DM with features however orders the quantity
p
b X
q̂ DM 2 (xn+1 ) = inf {y : F̂n2 (y) ≥ }+ q̂ k xkn+1
b+h k=2
Pp
and F̂n2 is the empirical distribution of {di − k=2 q̂
k k
xi }, and q̂ k , k = 2, . . . , p are the solution
coefficients to (NV-ERM1). Unlike the SAA decision, DM2’s decision does depend on all relevant
features. Continuing on with the national newspaper example, this corresponds to orders being
different across stores as well as in time, taking into account such information as past sales and
customer demographics at each store as well as temporal effects such as holidays/weekends and
major public events.
Theorem 3 (Using no features leads to inconsistent decisions). Under the linear demand
model of (18), given features X = x̃,
a.s. b
q̂nSAA (x̃) → Qε + EX [Eε [D|X]]
b+h
b
= Qε + β > E[X],
b+h
and
a.s. b
q̂nDM 2 (x̃) → Qε + β > x̃ = q ∗ (x̃),
b+h
as n tends to infinity.
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 19
Considering the same national newspaper example above, we see that the SAA decision converges
to the critical fractile of the dispersion ε plus the population-temporal average demand EX [Eε [D|X]],
whereas DM2’s decision converges to the correct one, q ∗ (x̃).
The results of Theorems 1–3 indicate that the no-feature SAA decisions are inconsistent, i.e.,
even with infinite amount of demand data the SAA decisions converge to quantities different from
the true optimal. The natural question “is a decision that is further away from the true optimal
necessarily worse in terms of the expected cost?” then follows. In other words, does the loss in the
expected cost increase when the effect of the feature information increases? The answer is in the
affirmative, which we detail below.
Theorem 4 (Expected Cost Difference). Let q ∗ = q ∗ (x̃) denote the true optimal newsvendor
decision given feature x̃, and q̂ some other decision not equal to q ∗ . Then the difference of the
expected costs of the two decisions is given by
Theorem 4 thus provides an exact formula for the expected cost difference of the sub-optimal decision
q̂ from the true optimal decision q ∗ . We observe that the expected cost difference scales as the
expectation of |q̂ − D| over the interval between the two decisions, q̂ and q ∗ . Thus what matters
is the size of this interval and how the demand is distributed over it— the more concentrated
the distribution over this interval, the larger the difference. While the exact quantity can only
be computed with the knowledge of the demand distribution and the true optimal decision, we
nevertheless arrive at the universal insight that the expected cost difference increases as q̂ deviates
further away from q ∗ .
In particular, Theorem 4 implies that for the two population model (14), the more distinct the
two population demands D0 and D1 in their critical fractiles, the worse the expected cost of the
no-feature SAA decision to the true optimal solution. Likewise, for the linear demand model (18),
the more idiosyncratic the feature information β > x̃ over the average β > X̄, the worse the expected
cost of the no-feature SAA decision in comparison to the true optimal decision.
While Theorem 4 together with Theorems 1–3 justify the collection of features, a shortcoming is
that the DM needs to know the demand distribution in order to quantify the gain in the expected
cost due to a sub-optimal in-sample decision, which of course she does not know. In the next section,
we characterize, with high probability bounds, the expected cost of the DM’s in-sample decision
using the information at hand.
Ban & Rudin: Big Data Newsvendor
20 Article submitted to Operations Research; manuscript no.
We are interested in minimizing this cost, but we cannot measure it as the distribution is unknown.
The empirical risk is the average cost over the training sample:
n
1X
R̂(q; Sn ) := C(q, di (xi )).
n i=1
The empirical risk can be calculated, whereas the true risk cannot; the empirical risk alone, however,
is an incomplete picture of the true risk. We must have some additional property of the algorithm to
ensure that the method does not overfit. If the algorithm is stable, it is less likely to overfit, which
we quantify in the results in this section. Specifically, we provide probabilistic upper bounds on the
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 21
true risk in terms of the empirical risk and the algorithmic stability of the method. Since we desire
the true risk to be low, a combination of low empirical risk and sufficient stability ensures this.
The training set is, as before, Sn = {z1 = (x1 , d1 ), . . . , zn = (xn , dn )}, z ∈ Z, and we also define the
modified training set
Sn\i := {z1 , . . . , zi−1 , zi+1 , . . . , zn },
In other words, a symmetric learning algorithm does not depend on the order of the elements in the
training set Sn . The loss of the decision rule q ∈ Q with respect to a sample z = (x, d) is defined as
for some cost function c, which in our work is the newsvendor cost C .
We derive performance bounds on (NV-ERM1)–(NV-KO) under the following assumptions.
Assumptions for Theorems 5–7.
1. The feature vector X is normalized (X1 = 1 almost surely, X[2:p] has mean zero and standard
√
deviation one) and that it lives in a closed unit ball: ||X||2 ≤ Xmax p.
2. The demand follows the linear model (18) where the distribution of ε, fε , is bounded away
from zero on the domain [D, D̄] (otherwise unspecified).
3. All decision functions (policies) described are measurable, and Q is a convex subset of a linear
space.
Assumption 1 is for the feature vector X; we note the normalization assumption is to simplify the
exposition and the results do not require that the DM knows the true mean or standard deviations
of X, only the size bound Xmax , the existence of which is realistic and not prohibitive. Assumption
2 details assumptions on the demand model, which is assumed to be linear for tractability, but
we do not assume any distributional knowledge beyond its total range. Finally, Assumption 3 is a
necessary requirement for sensible optimization over a function class when the demand is linear.
First, we state the performance bound on (NV-ERM1).
1 − δ over the random draw of the sample Sn , where each element of Sn is drawn i.i.d. from an
unknown distribution on X × D, and for all n ≥ 3,
" r # √
∗ 2(b ∨ h) p 4(b ∨ h) log(2/δ) log n
|Rtrue (q ) − R̂in (q̂; Sn )| ≤ (b ∨ h)D̄ + p+1 + (b ∨ h)K 1/(2+p/2) ,
b∧h n b∧h 2n n
q
9(8+5p)
where K = 1
(4+p) (1−2−4/(4+p) )λ∗
, and λ∗2 = min fε (t).
2 t∈[D,D̄]
Theorem 5 is a statement about how close the in-sample cost of the in-sample decision, R̂in (q̂; Sn )
is to the expected cost of the true optimal decision, Rtrue (q ∗ ), in terms of quantities the DM knows.
The first term on the right hand side upper bound is the bound on the generalization error,
which is the difference between the training error and test error for the in-sample decision. For
√
fixed cost parameters b and h, we find that the generalization error scales as O(p/ n). Thus if
the number of relevant features in the population model is small and not growing relative to the
number of observations, then in-sample decisions generalize well to out-of-sample data; in other
√
words overfitting should not be an issue. However, if p/ n is large, or growing (which happens when
new observations are associated with new features), then overfitting will be an issue, and Theorem 5
suggests that (NV-ERM1) may not be a good algorithm in such a scenario. The dependence on the
upper bound on the demand, D̄ is necessary so that the bound is not scale invariant. In other words,
if the risks on the left hand side of the inequality changed units (e.g. from dollars per kilo of demand
to dollars per ton), it would not make sense for the right hand side of the inequality to stay the
same. Lastly, we note that the bound on the generalization error is tight in the sense that it comes
from showing that the probability of large deviation of |R̂true (q̂) − R̂in (q̂; Sn )| decays exponentially
fast in the number of observations n, which we establish through a property known as uniform
stability of a learning algorithm, and because the constants in the bound are the smallest possible.
For further details, we refer the reader to the proof in Appendix C and Bousquet and Elisseeff
(2002). These are the best finite-sample bounds we know of for this problem. This is due to the
fact that they are not uniform bounds, which require a complexity measure for the entire decision
space (e.g. covering numbers, VC dimension or Rademacher complexity), rather algorithm-specific
bounds that considers how the algorithm searches the decision space.
The second term on the right hand side upper bound is due to the finite-sample bias, E|q ∗ − q̂ |.
The only way a DM can reduce the finite-sample bias is by collecting more observations. The rate
√
n−1/(2+p/2) log n is optimal and cannot be improved upon without further assumptions on the
demand model and/or the data generating process. For details, we refer the reader to the proof of
Theorem 5 in Appendix C.
When p = 1, we are in the setup where the demand does not depend on any exogenous features
(recalling that X1 = 1 is the intercept term). This setting had been studied by Levi et al. (2007)
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 23
and our results are consistent with them in that their sampling bound, up to a constant factor, can
be obtained from our bound. The details of this can be found in Appendix D.
We now state the performance bound on (NV-ERM2).
The performance bound of Theorem 6 has three components. The first term is a bound on
√
the generalization error, which is of O(p/( nλ)). Thus the amount of overfitting can be directly
controlled by the amount of regularization imposed on the problem; the larger the λ, the smaller
the generalization error. Choosing λ = O(1/p2 ) retrieves the same error rate as for (NV-ERM1), so
λ = O(1/p2 ) is a good starting point for choosing λ. The bound thus provides a sense of the “right”
scale for lambda, which is useful when you have to search for its best value in practice.
The second term, which does not appear in Theorem 5, is the bias of the in-sample decision due
to regularization, in other words the bias due to having perturbed the optimization problem away
from the true problem of interest. This term is larger for larger λ, and so there is an inherent trade-
off between the generalization error and the regularization bias. Ultimately, however, regularization
gives the DM an extra degree of control while being agnostic to which feature is important a
priori, and in practice would work with the optimal value of λ that balances the generalization
error-regularization bias tradeoff on a validation data set (see Sec. 5).
The third and the final term is the finite-sample bias. We note that while the regularization bias
can be controlled by λ, the finite-sample bias can only be controlled by collecting more data.
Finally, we have the following result for (NV-KO).
the sample Sn , where each element of Sn is drawn i.i.d. from an unknown distribution on X × D,
and for all n ≥ 3,
" r #
∗ κ 2(b ∨ h) 1 4(b ∨ h) log(2/δ)
|Rtrue (q ) − R̂in (q̂ ; Sn )| ≤ (b ∨ h)D̄ + +1
b ∧ h 1 + (n − 1)rw (p) 1/n + (1 − 1/n)rw (p) 2n
√
log n
+ (b ∨ h)ED|xn+1 [|q̂ κ − q̂|] + (b ∨ h)K 1/(2+p/2) ,
n
q
2 9(8+5p) 1
where rw (p) = exp(−2Xmax p/w2 ), w the kernel bandwidth, and K = (4+p) (1−2−4/(4+p) )λ∗
, and
2
λ∗2 = min fε (t).
t∈[D,D̄]
As with Theorem 6, the performance bound on (KO) has three components: a bound on the
generalization error, the bias due to optimizing with a scalar decision when the true decision is a
function, and the finite-sample bias term, which is the same as in Theorems 5–6. The generalization
√
error is of O(1/rw (p) n), so can be controlled by reducing rw (p) by increasing the kernel bandwidth
√ √
w. Setting w = O( p) gives an error which is of O(1/ n), which is as good as having the demand
not depend on any features. When w is set to an arbitrarily large number, rw (p) = 1, so the error
√
rate O(1/ n) cannot be improved upon. It is not surprising that the generalization error can be
made small with large w since this corresponds to smoother comparisons of the feature vectors
from the past to the one in period n + 1. However, as with (NV-ERM2), increased w increases the
second term, thus in practice w needs to be optimized over a reasonable range of values. Finally,
the finite-sample bias term plays the same role as the corresponding terms in Theorems 5–6.
Our data comes from the emergency room of a large teaching hospital in the United Kingdom
from July 2008 to June 2009. The data set includes the total number of patients in the emergency
room at 2-hour intervals. We provide box plots of the number of patients by day and by time periods
in Fig. 1. We assumed a nurse-to-patient ratio of 1 to 5, hence the demand is the total number of
patients divided by 5. We do not require the staffing level to be an integer in our predictions, as
multi-skilled workers could be used for part-time work. We also assumed that the hourly wage of
an agency nurse is 2.5 times that of a regular nurse, that is b = 2.5/3.5 and h = 1/3.5, resulting in a
target fractile of r = b/(b + h) = 2.5/3.5. Although the exact agency nurse rate differs by location,
experience and agency, our assumption is a modest estimate (Donnelly and Mulhern 2012).
We considered two sets of features: the first set being the day of the week, time of the day and
m number of days of past demands; the second set being the first set plus the sample average of
past demands and the differences in the order statistics of past demands, which is inspired by the
analysis in Liyanage and Shanthikumar (2005) as described in Sec. 2.4.1. We refer to these features
as Operational Statistics (OS) features. We used n = 1344 past demand observations (16 weeks)
as training data and computed the critical staffing level 3 periods ahead. We then recorded the
out-of-sample newsvendor cost of the predicted staffing level on 1344/2 = 672 validation data on a
rolling horizon basis, following the-rule-of thumb in Friedman et al. (2009) for choosing the size of
the validation data set. Any parameter that needs calibration was calibrated on the validation data
set. We then applied the algorithms to a test set of 672 unseen observations.
Figure 1 A boxplot of the number of patients in the emergency room (a) by day and (b) by time period.
Ban & Rudin: Big Data Newsvendor
26 Article submitted to Operations Research; manuscript no.
All computations were carried out on MATLAB2013a with the solver MOSEK and CVX, a
package for specifying and solving convex programs (CVX Research 2012, Grant and Boyd 2008)
on a Dell Precision T7600 workstation with two Intel Xeon E5-2643 processors, each of which has
4 cores, and 32.0 GB of RAM.
In Table 3, we report the out-of-sample performance of the sixteen methods considered. We report
the calibrated parameter (if any), the mean and the 95% confidence interval for the out-of-sample
staffing cost in normalized units (parameters are calibrated by in-sample calculations, which can be
found in Appendix E Tables EC.1–EC.7). In the last column, we report the annual cost savings of
the method relative to SAA-day where there is a statistically significant net cost saving, assuming
a regular nurse salary of £25,000 (which is the Band 4 nurse salary for the National Health Service
in the United Kingdom in 2014) and standard working hours of 37.5 hours per week. Cost savings
in USD are also reported, assuming an exchange rate of £1: USD 1.6.
The best result was obtained by the KO method with OS features, with bandwidth w = 1.62,
which yields a cost improvement of 24% (a saving of £46,555 p.a.) relative to the best practice
benchmark (“SAA-day”) with statistical significance at the 5% level. The next best results were
obtained by the ERM method with `1 regularization and the KO method without OS features, which
have average annual cost improvements of 23% (£44,219 p.a.) and 21% (£39,915 p.a.) respectively.
The computational costs of the feature-based methods are very different, however. The KO
method is three orders of magnitude faster than the ERM-based methods. For instance, it takes just
0.0494 seconds to find the next optimal staffing level using the KO method with OS features, which
is the best in terms of the out-of-sample cost, whereas it takes 114 seconds for the the ERM method
Ban & Rudin: Big Data Newsvendor
28 Article submitted to Operations Research; manuscript no.
Method Calibrated Avg. Comp. Mean (95 % CI) % savings Annual cost savings
parameter Time (per it- rel. to SAA- rel. to SAA-day
eration) day
1a. SAA-day — 14.0 s 1.523 (± 0.109) — —
1b. Cluster+SAA — 14.9 s 1.424 (± 0.102) — —
2a. Ker-0 w = 0.08 0.0444 s 1.208 (± 0.146) 20.7% £39,915 ($ 63,864)
2b. Ker-OS w = 1.62 0.0494 s 1.156 (± 0.140) 24.1% £46,555 ($ 74,488)
3a. NV-0 12 days 325 s 1.326 (± 0.100) 12.9% £24,909 ($ 39,854)
3b. NV-OS 4 days 360 s 1.463 (± 0.144) — —
4a. NVreg1 1 × 10−7 84.5 s 1.336 (± 0.100) — —
4b. NVreg1-OS 1 × 10−7 114 s 1.174 (± 0.113) 22.9% £44,219 ($ 70,750)
5a. NVreg2 5 × 10−7 79.6 s 1.336 (± 0.110) — —
5b. NVreg2-OS 1 × 10−7 107 s 1.215 (± 0.111) 20.2% £39,065 ($ 62,503)
6a. SEO-0 1 day 10.8 s 1.279 (± 0.099) 16.0% £30,952 ($ 49,523)
6b. SEO-OS 6 days 16.1 s 12.57 (± 10.63) — —
7a. SEOreg1 5 × 10−1 22.1 s 1.417 (± 0.106) — —
7b. SEOreg1-OS 5 × 10−3 25.9 s 11.95 (± 6.00) — —
8a. SEOreg2 1 × 10−1 26.6 s 1.392 (± 0.105) — —
8b. SEOreg2-OS 5 × 10−3 27.1 s 12.57 (± 10.63) — —
9. Scarf 12 days 20.8 s 1.593 (± 0.114) — —
Table 3 A summary of results. We assume the hourly wage of an agency nurse is 2.5 times that of a regular
nurse. We report the calibrated parameter (if any), the average computational time taken to solve one problem
instance, and the mean and the 95% confidence interval for the out-of-sample staffing cost in normalized units. In
the last column, we report the annual cost savings of the method relative to SAA-day where there is a statistically
significant net cost saving, assuming a regular nurse salary of £25,000 (which is the Band 4 nurse salary for the
National Health Service in the United Kingdom in 2014) and standard working hours. A dashed line represents cost
differential that is not statistically significant. Cost savings in USD are also reported, assuming an exchange rate of
£1: USD 1.6.
with `1 regularization, which is the second-best performing method. The KO method is also faster
than SAA-day, SEO methods and Scarf by two orders of magnitude.
Let us further investigate the staffing decision of the best method: KO-OS with w = 1.62. In Fig. 2
(a), we display the staffing levels predicted by KO-OS with w = 1.62 along with the actual required
levels. For comparison, we also provide the staffing levels predicted by the second-best method,
NVreg1-OS with λ = 1 × 10−7 in Fig. 2 (b). A striking observation is that both KO-OS and NVreg1-
OS methods anticipate periods of high demand fairly well, as evidenced by the matching of the
peaks in the predicted and actual staffing levels. The two methods are otherwise quite different
in the prediction; in particular, the KO-OS method balances both over-staffing and under-staffing,
whereas NVreg1-OS method seems to systematically over-predict the staffing level.
Let us now suppose the hospital indeed implements our algorithm for its nurse staffing decisions.
We wish to gain some insight into the predictions made by the algorithm. In particular, we would like
to know when the hospital is over- or under-staffed, assuming the hospital chooses to implement the
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 29
best possible method, provided by KO-OS with w = 1.62. In Figs. 3 and 4, we show the conditional
probability (frequency) of under- and over-staffing by day of the week and by time period. We derive
the following insights from these plots, which could be useful for patients and managers directly:
(i) mid-week days are more likely to be under-staffed then weekends, thus, given the choice to visit
the emergency room on a weekday or weekend, we would choose a weekend, (ii) the period from
noon to midnight is substantially more likely to be over-staffed then the period from midnight to
noon, thus, given the choice of time to visit the emergency room, we would choose visiting in the
afternoon, and (iii) the algorithm is most likely to over-staff by at least 50% of the required level
on a Monday then any other day of the week, hence, given the flexibility, we would choose to visit
the emergency room on a Monday.
Figure 2 (a) A time-series plot of actual staffing demand (solid blue) versus staffing levels predicted by
KO-OS with w = 1.62 (best method) on test data set in dotted red. (b) A time-series plot
of actual staffing demand (solid blue) versus staffing levels predicted by NVreg1-OS with λ =
1 × 10−7 (second best method) on test data set in dotted red.
Ban & Rudin: Big Data Newsvendor
30 Article submitted to Operations Research; manuscript no.
Figure 3 A plot of the conditional probabilities of under-staffing (a) by day and (b) by time period for
KO-OS with w = 1.62 (best method). The conditioning is done by the particular day or the time
period, i.e. the probability of under-staffing given it is a Monday.
Figure 4 A plot of the conditional probabilities of over-staffing by at least 50% (a) by day and (b) by
time period for KO-OS with w = 1.62 (best method). The conditioning is done by the particular
day or the time period, i.e. the probability of over-staffing given it is a Monday.
6. Conclusion
We investigated the newsvendor problem when the DM has historical data on both demands and
p features that are related to the demand. We have analyzed this problem using recent techniques
from machine learning (algorithmic stability theory) as well as theoretical statistics (theory of
quantile estimators). Rather than reiterate the contributions detailed in the Introduction, below we
summarize the some practical insights that may not be obvious upon first reading (especially to
readers unfamiliar with machine learning), and discuss potential directions for future research.
Some practical insights from this work are as follows. (i) There is not a single approach to solving
the “Big Data” newsvendor problem. In this paper we proposed three approaches (ERM with and
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 31
without regularization, and KO) to using the feature-demand data set and compared with against a
number of other potential approaches to solving the problem, with or without features. This is not
to say these approaches are exhaustive — we are optimistic that new methods can be developed.
(ii) A “Big Data”-driven decision-maker always needs to be weary of overfitting, where decisions can
perform well in-sample but not so out-of-sample, hence leading to not only decisions that perform
badly, but also decisions that are misleading. We have shown that overfitting can be controlled by
a-priori feature selection or regularization, which gives the DM extra control to bias the decision
in favour of improved out-of-sample generalization. (iii) What affects the true performance of the
in-sample decision are the generalization error (a.k.a. overfitting error), bias from regularization and
finite-sample bias simply from having a finite amount of data. The DM can control the generalization
error and bias from regularization through the regularization parameter, and in practice needs to
optimize over a range of parameter values on a separate validation data set. Finite-sample bias
cannot be controlled except by collecting more data. (iv) While the KO method dominates all others
in terms of the out-of-sample performance in our case study, this need not be the case for a different
data set. We believe the most important point of our case study is in demonstrating how to carry
out a careful data-driven investigation, not in the case-specific conclusion that the KO method
performed the best. We note however the relative speed of the KO method is generalizable.
There are many directions for follow-up work, and we discuss a few. First, investigating how the
dynamic inventory management problem can be solved with demand-feature data set remains an
open question. Second, as mentioned in point (i) of the previous paragraph, the methods considered
in this paper are not exhaustive — new methods can be developed, especially if different assumptions
are made on the data-generating process than we have assumed here. It would be interesting to see,
for instance, if Markov-modulated demand processes can be adapted to handle a large number of
feature information. Finally, extending our theoretical results for more general classes of stochastic
optimization problems remains an open task.
Acknowledgments
This research was supported by London Business School Research and Material Development Scheme (Ban)
and the National Science Foundation grant IIS-1053407 (Rudin). The authors thank Nicos Savva and Stefan
Scholtes for providing the data for the case study. The authors are also grateful to Paul Zipkin, Chung-Piaw
Teo, the anonymous referees and seminar attendees at a number of institutions and conferences for helpful
suggestions.
References
Akaike, Hirotugu. 1974. A new look at the statistical model identification. Automatic Control, IEEE
Transactions on 19(6) 716–723.
Ban & Rudin: Big Data Newsvendor
32 Article submitted to Operations Research; manuscript no.
Arrow, Kenneth Joseph, Samuel Karlin, Herbert Scarf. 1958. Studies in the mathematical theory of inventory
and production. 1, Stanford University Press.
Azoury, Katy S. 1985. Bayes solution to dynamic inventory models under unknown demand distribution.
Management Science 31(9) 1150–1160.
Belloni, Alexandre, Victor Chernozhukov. 2011. `1 -penalized quantile regression in high-dimensional sparse
models. The Annals of Statistics 39(1) 82–130.
Bousquet, Olivier, André Elisseeff. 2002. Stability and generalization. The Journal of Machine Learning
Research 2 499–526.
Burnetas, Apostolos N, Craig E Smith. 2000. Adaptive ordering and pricing for perishable products. Oper-
ations Research 48(3) 436–443.
Chaudhuri, Probal, et al. 1991. Nonparametric estimates of regression quantiles and their local bahadur
representation. The Annals of Statistics 19(2) 760–777.
Chen, Xin, Melvyn Sim, Peng Sun. 2007. A robust optimization perspective on stochastic programming.
Operations Research 55(6) 1058–1071.
Chernozhukov, Victor, Iván Fernández-Val, Alfred Galichon. 2010. Quantile and probability curves without
crossing. Econometrica 78(3) 1093–1125.
Chernozhukov, Victor, Christian Hansen. 2008. Instrumental variable quantile regression: A robust inference
approach. Journal of Econometrics 142(1) 379–398.
CVX Research, Inc. 2012. CVX: Matlab software for disciplined convex programming, version 2.0. http:
//cvxr.com/cvx.
Devroye, Luc, T Wagner. 1979a. Distribution-free inequalities for the deleted and holdout error estimates.
Information Theory, IEEE Transactions on 25(2) 202–207.
Devroye, Luc, T Wagner. 1979b. Distribution-free performance bounds for potential function rules. Infor-
mation Theory, IEEE Transactions on 25(5) 601–604.
Donnelly, L., M. Mulhern. 2012. Nhs pays £1,600 a day for nurses as agency use soars. http://www.
telegraph.co.uk/news/9400079/NHS-pays-1600-a-day-for-nurses-as-agency-use-soars.
html.
Durrett, Rick. 2010. Probability: theory and examples. Cambridge university press.
Feldman, Richard M. 1978. A continuous review (s, s) inventory system in a random environment. Journal
of Applied Probability 654–659.
Friedman, Jerome, Trevor Hastie, Robert Tibshirani. 2009. The elements of statistical learning: Data mining,
inference, and prediction. Springer Series in Statistics .
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 33
Gallego, Guillermo, Ilkyeong Moon. 1993. The distribution free newsboy problem: review and extensions.
Journal of the Operational Research Society 825–834.
Gallego, Guillermo, Özalp Özer. 2001. Integrating replenishment decisions with advance demand information.
Management Science 47(10) 1344–1360.
Godfrey, Gregory A, Warren B Powell. 2001. An adaptive, distribution-free algorithm for the newsvendor
problem with censored demands, with applications to inventory and distribution. Management Science
47(8) 1101–1112.
Grant, M., S. Boyd. 2008. Graph implementations for nonsmooth convex programs. V. Blondel, S. Boyd,
H. Kimura, eds., Recent Advances in Learning and Control . Lecture Notes in Control and Information
Sciences, Springer-Verlag Limited, 95–110. http://stanford.edu/~boyd/graph_dcp.html.
Graves, Stephen C, Harlan C Meal, Sriram Dasu, Yuping Qui. 1986. Two-stage production planning in a
dynamic environment. Multi-stage production planning and inventory control . Springer, 9–43.
Green, Linda V, Sergei Savin, Nicos Savva. 2013. Nursevendor problem: Personnel staffing in the presence
of endogenous absenteeism. Management Science 59(10) 2237–2256.
Hannah, Lauren, Warren Powell, David M Blei. 2010. Nonparametric density estimation for stochastic
optimization with an observable state variable. Advances in Neural Information Processing Systems.
820–828.
Heath, David C, Peter L Jackson. 1994. Modeling the evolution of demand forecasts ith application to safety
stock analysis in production/distribution systems. IIE transactions 26(3) 17–30.
Hofmann, Thomas, Bernhard Schölkopf, Alexander J Smola. 2008. Kernel methods in machine learning. The
Annals of Statistics 1171–1220.
Huh, Woonghee Tim, Paat Rusmevichientong. 2009. A nonparametric asymptotic analysis of inventory
planning with censored demand. Mathematics of Operations Research 34(1) 103–123.
Iida, Tetsuo, Paul H Zipkin. 2006. Approximate solutions of a dynamic forecast-inventory model. Manufac-
turing & Service Operations Management 8(4) 407–425.
Kunnumkal, Sumit, Huseyin Topaloglu. 2008. Using stochastic approximation methods to compute optimal
base-stock levels in inventory control problems. Operations Research 56(3) 646–664.
Levi, Retsef, Georgia Perakis, Joline Uichanco. 2015. The data-driven newsvendor problem: new bounds and
insights. Operations Research 63(6) 1294–1306.
Levi, Retsef, Robin O Roundy, David B Shmoys. 2007. Provably near-optimal sampling-based policies for
stochastic inventory control models. Mathematics of Operations Research 32(4) 821–839.
Liyanage, Liwan H, J George Shanthikumar. 2005. A practical inventory control policy using operational
statistics. Operations Research Letters 33(4) 341–348.
Ban & Rudin: Big Data Newsvendor
34 Article submitted to Operations Research; manuscript no.
Lovejoy, William S. 1990. Myopic policies for some inventory models with uncertain demand distributions.
Management Science 36(6) 724–738.
Lovejoy, William S. 1992. Stopped myopic policies in some inventory models with generalized demand
processes. Management Science 38(5) 688–707.
Lu, Xiangwen, Jing-Sheng Song, Amelia Regan. 2006. Inventory planning with forecast updates: Approximate
solutions and cost error bounds. Operations Research 54(6) 1079–1097.
Manton, Jonathan H, Pierre-Olivier Amblard, et al. 2015. A primer on reproducing kernel hilbert spaces.
Foundations and Trends
R in Signal Processing 8(1–2) 1–126.
Nadaraya, Elizbar A. 1964. On estimating regression. Theory of Probability & Its Applications 9(1) 141–142.
Parzen, Emanuel. 1979. Nonparametric statistical data modeling. Journal of the American Statistical Asso-
ciation 74(365) 105–121.
Perakis, Georgia, Guillaume Roels. 2008. Regret in the newsvendor model with partial information. Opera-
tions Research 56(1) 188–203.
Powell, Warren, Andrzej Ruszczyński, Huseyin Topaloglu. 2004. Learning algorithms for separable approx-
imations of discrete stochastic optimization problems. Mathematics of Operations Research 29(4)
814–836.
Rockafellar, R Tyrell. 1997. Convex analysis, vol. 28. Princeton University Press.
Rogers, William H, Terry J Wagner. 1978. A finite sample distribution-free performance bound for local
discrimination rules. The Annals of Statistics 506–514.
Scarf, Herbert. 1959a. Bayes solutions of the statistical inventory problem. The Annals of Mathematical
Statistics 490–508.
Scarf, Herbert. 1959b. The optimality of (s,S) policies in the dynamic inventory problem. Mathematical
Methods in the Social Science, KJ Arrow, S. Karlin, P. Suppes, eds.. Stanford University Press,
Stanford.
Scarf, Herbert, KJ Arrow, S Karlin. 1958. A min-max solution of an inventory problem. Studies in the
Mathematical Theory of Inventory and Production 10 201–209.
Schwarz, Gideon. 1978. Estimating the dimension of a model. The Annals of Statistics 6(2) 461–464.
See, Chuen-Teck, Melvyn Sim. 2010. Robust approximation to multiperiod inventory management. Opera-
tions Research 58(3) 583–594.
Shapiro, Alexander, Darinka Dentcheva, Andrzej P Ruszczyński. 2009. Lectures on stochastic programming:
modeling and theory, vol. 9. SIAM.
Song, Jing-Sheng, Paul Zipkin. 1993. Inventory control in a fluctuating demand environment. Operations
Research 41(2) 351–370.
Ban & Rudin: Big Data Newsvendor
Article submitted to Operations Research; manuscript no. 35
Takeuchi, Ichiro, Quoc V Le, Timothy D Sears, Alexander J Smola. 2006. Nonparametric quantile estimation.
The Journal of Machine Learning Research 7 1231–1264.
Watson, Geoffrey S. 1964. Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A
359–372.
e-companion to Ban & Rudin: Big Data Newsvendor ec1
ui ≥ di − q0 − hφ(xi ), q i
oi ≥ q0 + hφ(xi ), q i − di
ui , oi ≥ 0, (NV-ERM2-KER)
where the optimal order quantity will be qopt (x) = hφ(x), q ∗ i + q0∗ , where q ∗ and q0∗ is the solution to
the optimization problem. Following the standard approaches we can derive the convex dual, using
notation K as the matrix of k(xi , xl ) values, and using α as the vector of n combined Lagrange
multipliers. Note that if k is a valid kernel (i.e. an inner product for a reproducing kernel Hilbert
space), the gram matrix K is positive semi-definite. The dual problem is thus:
1 T
min α Kα − αT d
α 2
−h b
s.t. ∀ i = 1, . . . , n : ≤ αi ≤ , and
X 2λn 2λn
αi = 0.
i
ec2 e-companion to Ban & Rudin: Big Data Newsvendor
If α∗ is the solution to this dual problem, which we can compute with a quadratic programming
solver, the solution to the primal will be given by:
X
q ∗ (x) = αi∗ φ(xi ),
i
Pn −h
and q0∗ is di − hφ(xi ), q ∗ (xi )i = di − ∗
l=1 αl k(xi , xl ) for any i where αi∗ is neither 2λn
nor b
2λn
. This
means that the optimal order quantity q(x) for any new x can be evaluated directly, using only
computations of k(x, xi ). This is given by:
X
qopt (x) = αi k(x, xi ) + q0∗ .
i
One can choose any kernel k so long as it satisfies the property of being a reproducing kernel,
without calculating (or even knowing) the map φ to the reproducing kernel Hilbert space. Typical
kernels are polynomial kernels or Gaussian kernels. The kernel must be symmetric, as it needs to
be an inner product in the reproducing kernel Hilbert space, and inner products are symmetric.
Even if the kernel function is much more complicated than a standard inner product, qopt (x) can
be evaluated for any x, by solving the dual formulation and using the formulae above. This is the
generalization of (NV-ERM) to nonlinear decision rules.
For further readings on the topic, we direct the readers to Hofmann et al. (2008) and Manton
et al. (2015).
where the outer and inner minimization problems correspond to the SAA problem for the subsample
of data corresponding to x = 0 and x = 1 respectively. Hence the solutions are the corresponding
SAA solutions for the appropriate subsample of data, which is the well-known critical fractile of the
inverse empirical cdf as in (4).
e-companion to Ban & Rudin: Big Data Newsvendor ec3
Proof of Theorem 1: Under Condition A, the following strong result holds via Theorem 4.1.2.
pp. 31 of Csörgö (1983): there exists, for each ni , i = 0, 1, a Brownian Bridge {Bni (y), 0 ≤ y ≤ 1}
such that
−1 −1 −1 Bni (y) a.s. log ni
sup fi (Fi (y))(F̂i (y) − Fi (y)) − √
=O . (EC.2)
0<y<1 ni ni
The above implies, for y = r:
−1
(F̂i (r) − Fi−1 (r)) − Bni (r) a.s. log ni
√ ≤O
fi (Fi−1 (r)) ni ni
a.s. Bni (r) log ni
=⇒ F̂i (r) − Fi−1 (r) ≤
−1
√ (r) + O
f (F −1 ) ni n
i i i
−1 −1
−1 −1
EBni (r) log ni log ni
=⇒ E[F̂i (r)] − Fi (r) ≤ EF̂i (r) − Fi (r)≤
√ +O =O ,
fi (Fi−1 (r)) ni ni ni
where the last line uses Jensen’s inequality and the fact that the mean of a Brownian Bridge is zero
everywhere. Hence we get both the finite-sample bias result and the asymptotic optimality result.
Proof of Lemma 2: This is equivalent to the SAA solution for the complete data set.
Proof of Theorem 2: Proof of (15) parallels that of Proposition 1. Proof of (16) then follows from
(15) and Proposition 1.
Proof of (17): The asymptotic convergence of q̂nSAA to its true value is again due to the asymptotic
convergence of the sample quantile estimator, as shown in Theorem 1. The statement then follows.
Proof of Theorem 3. The SAA decision is given by
b
SAA 0 −1
q̂n (x̃) = (F̂n ) + d¯n
b+h
n n
0 −1 b 1X > 1X
= (F̂n ) + β Xi + εi
b+h n i=1 n i=1
a.s. b
→ Qε + β > E[X] + 0,
b+h
where the convergence of the first term is due to the Glivenko-Cantelli theorem and the convergence
of the last two terms is due to the Strong Law of Large Numbers (SLLN) (see Durrett 2010).
Proof of Theorem 4: Let D ≤ q1 < q2 ≤ D̄ be two newsvendor decisions. Then by considering
the difference C(q2 ; ·) − C(q1 ; ·) for three different regions: D < q1 , q1 ≤ D ≤ q2 and D > q2 , one can
show
C(q2 ; D) − C(q1 ; D)
= −b(q2 − q1 )I(D > q2 ) + h(q2 − q1 )I(D < q1 ) + [bq1 + hq2 − (b + h)D]I(q1 ≤ D ≤ q2 )
= −b(q2 − q1 )I(D > q2 ) + h(q2 − q1 )I(D < q1 ) + [b(q2 − (q2 − q1 )) + hq2 − (b + h)D]I(q1 ≤ D ≤ q2 )
= −b(q2 − q1 )I(D > q1 ) + h(q2 − q1 )I(D < q1 ) + (b + h)(q2 − D)I(q1 ≤ D ≤ q2 ) (EC.3)
ec4 e-companion to Ban & Rudin: Big Data Newsvendor
Also,
C(q2 ; D) − C(q1 ; D)
= −b(q2 − q1 )I(D > q2 ) + h(q2 − q1 )I(D < q1 ) + [bq1 + h(q1 + (q2 − q1 )) − (b + h)D]I(q1 ≤ D ≤ q2 )
Case I: Let us suppose q1 = q ∗ . Then P(D < q1 ) = b/(b + h) and P(D > q1 ) = h/(b + h), since we
assume the demand is continuous. Thus the expectation of the first two terms in (EC.3) is equal to
zero. The expected cost difference then becomes
familiar to the reader, are algorithm-independent. Uniform generalization bounds do not consider
the way in which the algorithm searches the space of possible models, and as such they lack specific
insights about prediction.
Stability bounds can be thought of as a form of Hoeffding’s inequality for algorithms (recall, for
instance, that the sample-size bounds of Levi et al. 2007 and Levi et al. 2015 are critically based
on Hoeffding’s inequality). To apply Hoeffding’s inequality, we need to know a bound on the values
of a random variable, whereas to apply stability bounds, we need to know a bound on the stability
of an algorithm to a random data set. The challenging part is to prove a stability result that is as
strong as possible for the algorithm, which we here achieve for the newsvendor problem. Stability
bounds have origins in the 1970’s (Rogers and Wagner 1978, Devroye and Wagner 1979a,b), and
recent work includes that of Bousquet and Elisseeff (2002).
To show that our newsvendor algorithms are stable, we first show that the newsvendor cost on
the training set does not change very much when one of the training examples changes. Stability of
a decision is a desirable property, and these bounds show that stability is intimately linked to how
well the decision performs on new observations. One of the main contributions of this section is thus
in showing that (NV-ERM1), (NV-ERM2) and (NV-KO) are strongly stable, which are important
results in their own right.
Our algorithms for the learning newsvendor problem turn out to have a very strong stability
property, namely it is uniformly stable. We define stability and uniform stability below.
Definition EC.1 (Uniform stability, Bousquet and Elisseeff (2002) Def 6 pp. 504).
A symmetric algorithm A has uniform stability α with respect to a loss function ` if for all Sn ∈ Z n
and for all i ∈ {1, . . . , n},
Here the notation b ∨ h indicates the maximum value of b and h, and b ∧ h indicates the minimum
of the two. This bound illuminates that if one of b or h is too large and the other too small, in
other words if the backordering and holding costs are highly asymmetric, then the algorithm is
very sensitive to small changes in the data set. The algorithm is more stable when the target order
quantity is closer to the median of the demand distribution.
The stability result, coupled with a Hoeffding/McDiarmid based argument (in our case, we are
using more contemporary versions of the results due to Bousquet and Elisseeff 2002), yields the
following:
Proposition EC.2 (Generalization Bound for (NV-ERM1)). Let q̂ be the model produced
by Algorithm (NV-ERM1). The following bound holds with probability at least 1 − δ over the random
draw of the sample Sn , where each element of Sn is drawn iid from an unknown distribution on
X × D:
r
|Rtrue (q̂) − R̂(q̂; Sn )| 2(b ∨ h) p 4(b ∨ h) log(2/δ)
≤ + p+1 .
(b ∨ h)D̄ b∧h n b∧h 2n
For large p/n, we had suggested finding the optimal order quantity by solving (NV-ERM2) instead.
In this case, the generalization is driven by the regularization parameter, rather than the ratio p/n,
as well as b and h as before. First is the stability result.
(b ∨ h)2 Xmax
2
p
αnr = . (EC.7)
2nλ
The new stability parameter is O(p/nλ). Thus the stability of the algorithm can be controlled via
the regularization parameter λ.
We will use the following lemma in the proof of Propositions EC.1 and EC.5.
Lemma EC.1 (Tight uniform bound on the newsvendor cost). The newsvendor cost func-
tion C(·, ·) is bounded by (b ∨ h)D̄, which is tight in the sense that:
Proof of Lemma EC.1: Clearly, D̄(b ∨ h) is an upper bound on |C(q, d)| for all q, d ∈ [0, D̄]. Now
if d = 0 and q = D̄, |C(q, d)| = D̄h. Conversely, if d = D̄ and q = 0, |C(q, d)| = D̄b. Hence the upper
bound is attained.
Now for the proof of the Proposition EC.1.
e-companion to Ban & Rudin: Big Data Newsvendor ec7
Proof of Proposition EC.1: Symmetry follows from the fact that the data-generating process is
iid. For stability, we will change our notation slightly to make the dependence on n and Sn explicit.
Let
p
X
qn (x) := q>
nx= qnj xj
j=1
and
p
X j
qn\i (x) := q>
n\i x = qn\i xj
j=1
where
!+ !+
n p p
1 X X X
[qn1 , . . . , qnp ] = arg min R̂(q; Sn ) = b dj − q j xj +h q j xj − dj
q=[q 1 ,...,q p ] n j=1 j=1 j=1
sup |C(qn (x), D(x)) − C(qn\i (x), D(x))| ≤ sup (b ∨ h)|qn (x) − qn\i (x)|.
(x,D)∈X ×D (x,D)∈X ×D
So we want to bound
p p
X X
j j
|qn (x) − qn\i (x)| = qn xj − qn\i xj .
j=1 j=1
By the convexity of the function R̂n (·, S), we have (see Section 23 of Rockafellar (1997)):
p
X j
νj (qn\i − qnj ) ≤ R̂(qn\i ; Sn ) − R̂(qn ; Sn )
j=1
where the max over ν can be attained because ∂ R̂(qn ; Sn ) is a compact set. Denote this maximum
ν ∗ . We thus have
p
X
R̂(qn\i ; Sn ) − R̂(qn ; Sn ) ≥ |ν ∗ > (qn\i − qn )| = j
νj∗ (qn\i − qnj )
j=1
j j
≥ |νj∗ (qn\i − qnj )| = ∗
|νj kqn\i − qnj | for all j = 1, . . . p
j
where the second inequality is because νj∗ (qn\i − qnj ) > 0 for all j because R̂(·; Sn ) is piecewise linear
and nowhere flat. Thus we get, for all j = 1, . . . , p,
j R̂(qn\i ; Sn ) − R̂(qn ; Sn )
|qn\i − qnj | ≤ .
|νj∗ |
Let us bound R̂(qn\i ; Sn ) − R̂(qn ; Sn ). Note
n−1 1
R̂(qn ; Sn ) = R̂(qn ; Sn\i ) + R̂(qn ; Si )
n n
n−1 \i
≥ R̂(qn\i ; Sn )
n
since qn\i is the minimizer of R̂(·; Sn\i ). Also, R̂(qn ; Sn ) ≤ R̂(qn\i ; Sn ) since qn is by definition the
minimizer of R̂(·; Sn ). Putting these together, we get
n−1
R̂(qn\i ; Sn\i ) − R̂(qn\i ; Sn ) ≤ R̂(qn ; Sn ) − R̂(qn\i ; Sn ) ≤ 0
n
n − 1 \i
=⇒ |R̂(qn ; Sn ) − R̂(qn\i ; Sn )| ≤ R̂(qn\i ; Sn ) − R̂(qn\i ; Sn )
n
n − 1 \i n−1 \i 1
= R̂(qn\i ; Sn ) − R̂(qn\i ; Sn ) − R̂(qn\i ; Si )
n n n
1
= |R̂(qn\i ; Si )|.
n
Thus
p
!
X j
sup (b ∨ h)|qn (x) − qn\i (x)| ≤ sup (b ∨ h) |qnj − qn\i kxj |
(x,D)∈X ×D (x,D)∈X ×D j=1
p
X |xj |
≤ sup (b ∨ h) · · (R̂(qn\i ; Sn ) − R̂(qn ; Sn ))
(x,D)∈X ×D j=1
|νj∗ |
p
b ∨ h X |xj |
≤ sup · · |R̂(qn\i ; Si )|. (EC.9)
(x,D)∈X ×D n j=1
|νj∗ |
We can further simplify the upper bound (EC.9) as follows. Recall that ν ∗ is the subgradient of
Pp j
R̂(·; Sn ) at qn that maximizes j=1 νj (qn\i − qnj ); and as ∂ R̂(qn ; Sn ) is compact (by the convexity
of R̂(·; Sn )), we can compute ν ∗ exactly. It is straightforward to show:
(
j
∗ −bxj if qn\i − qnj ≤ 0
νj = j
hxj if qn\i − qnj ≥ 0 ∀ j.
We can thus bound 1/|νj∗ | by 1/[(b ∧ h)|xj |]. By using the tight uniform upper bound (b ∨ h)D̄ on
each term of |R̂(·, ·)| from Lemma EC.1, we get the desired result.
e-companion to Ban & Rudin: Big Data Newsvendor ec9
We move onto the main result needed to prove Proposition EC.3. First, we build some terminology.
Note that Rp is a reproducing kernel Hilbert space where the kernel is the standard inner product.
Thus, κ in our case is Xmax .
Proof of Proposition EC.3: By the Lipschitz property of C(·; d),
sup |C(q1 (x), d) − C(q2 (x), d)| ≤ (b ∨ h)|q1 (x) − q2 (x)|, ∀q1 (x), q2 (x) ∈ Q
d∈D
Proposition EC.4 (Generalization Bound for (NV-ERM2)). Let q̂ be the model produced
by Algorithm (NV-ERM2) with `2 regularization. The following bound holds with probability at least
1 − δ over the random draw of the sample Sn , where each element of Sn is drawn iid from an unknown
distribution on X × D:
r
|Rtrue (q̂) − R̂(q̂; Sn )| (b ∨ h) 1 2(b ∨ h) 1 log(2/δ)
≤ −2 nλ
+ −2
+1 . (EC.10)
(b ∨ h)D̄ D̄Xmax D̄Xmax λ 2n
We now state stability and generalization bound results for (NV-KO).
Proposition EC.5 (Uniform stability of (NV-KO)). The algorithm (NV-KO) with iid data
and the Gaussian kernel is symmetric with respect to the newsvendor cost function C(·, ·) with
uniform stability parameter
D̄(b ∨ h)2 1
ακ = , (EC.11)
(b ∧ h) 1 + (n − 1)rw
2
where rw = exp(−2Xmax /w2 ).
The first term in the kernel stability parameter is the same as for (NV-ERM1), hence the insight
that highly asymmetric backordering and holding costs leads to unstable decisions still holds. The
second term depends on n and rw , which in turn depends on the bandwidth w and the number of
features p through Xmax . The term rw goes from zero to one as w increases from zero to infinity;
ec10 e-companion to Ban & Rudin: Big Data Newsvendor
this shows that greater stability is achieved with a larger bandwidth. This is an intuitive result, as
a larger bandwidth is associated with bundling feature observations closer together. With this in
mind we state the generalization bound for (NV-KO).
Proof of Proposition EC.5: The proof parallels that of Proposition EC.1. Symmetry follows from
the fact that the data-generating process is i.i.d. For stability, we will change our notation slightly
to make the dependence on n and Sn explicit. Let
Pn
j=1 κj [b(dj − q)+ + h(q − dj )+ ]
qnκ = arg min R̃(q; Sn , xn+1 ) = argmin Pn
q≥0 q≥0 j=1 κj
where Si = (xi , di ).
By definition, the algorithm is stable if for all Sn ∈ Z n and i ∈ {1, . . . , n},
where αn ≤ O(1/n). Now for a fixed x, we have, by the Lipschitz property of C(q; ·),
for all ν ∈ ∂ R̃(qnκ ; Sn , xn+1 ) (set of subgradients of R̃(·; Sn , xn+1 ) at qnκ ). Further, because 0 ∈
∂ R̃(qnκ ; Sn , xn+1 ) by the optimality of qnκ , we have
κ
0≤ max ν(qn\i − qnκ ) ≤ R̃(qn\i
κ
; Sn , xn+1 ) − R̃(qnκ ; Sn , xn+1 )
κ ;S ,x
ν∈∂ R̃(qn n n+1 )
where the max over ν can be attained because ∂ R̃(qnκ ; Sn , xn+1 ) is a compact set. Denote this
maximum ν ∗ .
Following arguments parallel those of the proof for (EC.1),
b∨h
sup (b ∨ h)|qnκ − qn\i
κ
|≤ sup κ
· (R̃(qn\i ; Sn , xn+1 ) − R̃(qnκ ; Sn , xn+1 ))
(x,D)∈X ×D (x,D)∈X ×D |ν ∗ |
(b ∨ h)κi κ
≤ sup P · |R̃(qn\i ; Si , xn+1 )|. (EC.12)
(x,D)∈X ×D |ν ∗ | j κj
e-companion to Ban & Rudin: Big Data Newsvendor ec11
The supremum is thus achieved by the infimum of the ratio κj /κi over X × X × X . For the Gaussian
kernel,
2 2
Kw (xn+1 − xj ) e−4Xmax p/2w 2 2
inf = = e−2Xmax p/w := rw .
(xi ,xj ,xn+1 )∈X ×X ×X Kw (xn+1 − xi ) e 0
Finally, we can bound 1/|νj∗ | by 1/(b ∧ h), as in the proof for Proposition EC.1. By using the tight
uniform upper bound (b ∨ h)D̄ on each term of |R̃(·; ·, xn+1 )| from Lemma EC.1, we get the desired
result.
Proposition EC.6 (Generalization Bound for (NV-KO)). Let q̂ κ be the optimal decision of
(NV-KO) with the Gaussian kernel. The following bound holds with probability at least 1 − δ over the
random draw of the sample Sn , where each element of Sn is drawn iid from an unknown distribution
on X × D:
r
|Rtrue (q̂ κ ) − R̂(q̂ κ ; Sn )| 2(b ∨ h)
1 4(b ∨ h) 1 log(2/δ)
≤ + +1 .
(b ∨ h)D̄ b ∧ h 1 + (n − 1)rw b ∧ h 1 + (n − 1)rw 2n
Figure EC.1 A plot illustrating that the difference |C(qn (x), D(x)) − C(qn\i (x), D(x))| is bounded.
We need the following lemma to wrap up the proofs of Propositions EC.2, EC.4 and EC.5.
Lemma EC.2. Let q̂(Sn ) be an in-sample decision with uniform stability αn with respect to a loss
function ` such that 0 ≤ `(q̂(Sn ), z) ≤ M , for all z ∈ Z and all sets Sn of size n. Then for any n ≥ 1
and any δ ∈ (0, 1), the following bound holds with probability at least 1 − δ over the random draw of
the sample Sn :
r
log(2/δ)
|Rtrue (q̂(Sn )) − R̂(q̂(Sn ), Sn )| ≤ 2αn + (4nαn + M ) .
2n
ec12 e-companion to Ban & Rudin: Big Data Newsvendor
Proof of Lemma EC.2: The result is obtained by extending Theorem 12 of Bousquet and Elis-
seeff (2002) on pp. 507 by using the two-sided version of McDiarmid’s inequality.
We can now put the results together to prove Corollaries EC.2 and EC.4.
Proof of Proposition EC.2: The result follows from Proposition EC.1 and Lemma EC.2.
Proof of Proposition EC.4: By Lemma EC.1, 0 ≤ `(AS , z) ≤ D̄(b ∨ h) for all z ∈ Z and all sets
S . The result then follows from Proposition EC.3 and Lemma EC.2.
Proof of Proposition EC.6: The result follows from Proposition EC.5 and Lemma EC.2.
Before proving Theorems 5, 6 and 7, we first prove the following lemma, that shows the equiv-
alence of the newsvendor objective function to the nonparametric regression loss function by a
multiplicative factor.
Lemma EC.3. The newsvendor objective function is a constant multiple of to the nonparametric
regression loss function Hr (·).
1 1
C(q; D) ≡ Hr (q − D) := [|q − D| + (2r − 1)(q − D)],
2 2
Hr (q − D) := |q − D| + (2r − 1)(q − D)
Proof of Theorem 5:
|Rtrue (q ∗ ) − R̂in (q̂; Sn )| ≤ |Rtrue (q̂) − R̂in (q̂; Sn )| + |Rtrue (q ∗ ) − Rtrue (q̂)|
r
2(b ∨ h)2 D̄ p 4(b ∨ h)2 D̄
log(2/δ)
≤ + p + (b ∨ h)D̄ + |Rtrue (q ∗ ) − Rtrue (q̂)|,
b∧h n b∧h 2n
because the first term in the last inequality is the generalization bound in Proposition EC.2. As in
the proof of Theorem 7, we can bound the second term by
Finally, the term E|q ∗ − q̂ | is the finite-sample bias of the decision q̂ , which is the b/(b + h) quantile of
the conditional distribution of D|xn+1 by Lemma EC.3. Then, by Theorem 3.2. of Chaudhuri et al.
e-companion to Ban & Rudin: Big Data Newsvendor ec13
(1991), we have |q ∗ − q̂ | = O(nm−k−γ/(2(k+γ)+p)) almost surely, where m is the order of derivatives (of
the demand model) required to determine the true conditional quantile, k is the order of derivatives
taken by the estimation procedure and 0 < γ ≤ 1 is the Hölder continuity exponent of the demand
model [see Chaudhuri et al. (1991) for the exact definitions]. For the linear demand model, we have
γ = 1, m = 0 and k = 1, and we obtain the rate n−1/(2+p/2) .
Computing the lower bound on the constant K is quite involved and below we present only the
last steps in the computation. Following the steps in the proof of Theorem 3.2. of Chaudhuri et al.
(1991), we require the convergence of the series
∞
X
np+1−p(p+1)/(4+p)−c7 ,
n=1
where c7 = (1 − ε1 ) 2
(c∗5 )2 K 2 c̃/(p + 1), where c̃ ∈ (c1 , c2 ), c1 and c2 are as in the proof of Theorem
3.2. of Chaudhuri et al. (1991), c∗5 is as in the proof of Proposition 6.1. of Chaudhuri et al. (1991)
and ε2 ∈ (0, 1) is such that
c̃n4/(4+p) − (p + 1) ≥ (1 − ε2 )c̃n4/(4+p)
for all n > N1 , where N1 is required in Fact 6.5. of Chaudhuri et al. (1991). Rearranging and using
the fact that over-harmonic series are convergent, we arrive at the requirement
(8 + 5p)(1 + p) 1
K2 > .
4+p (1 − ε1 )2 (c∗5 )2 c̃
For a tighter bound, we want to find the smallest lower bound on K , i.e. the smallest possible lower
bounds on ε1 , and the largest possible upper bounds on c∗5 and c̃.
From the proof of Proposition 6.1. of Chaudhuri et al. (1991), we have c∗5 equal to λ∗2 ρ/2, where
ρ is equal to 1/3 and λ∗2 is given by
λ∗2 := min fε (t),
t∈[D,D̄]
Proof of Theorem 6:
|Rtrue (q ∗ ) − R̂in (q̂λ ; Sn )| ≤ |Rtrue (q̂λ ) − R̂in (q̂λ ; Sn )| + |Rtrue (q̂) − Rtrue (q̂λ )| + |Rtrue (q ∗ ) − Rtrue (q̂)|
" r #
2 2
(b ∨ h)Xmax p 2(b ∨ h)Xmax p log(2/δ)
≤ (b ∨ h) + + D̄
nλ λ 2n
+ |Rtrue (q̂) − Rtrue (q̂λ )| + |Rtrue (q ∗ ) − Rtrue (q̂)|
because the first term in the last inequality is the generalization bound in Proposition EC.4. As in
the proof of Theorem 5, we can bound the second and the third terms by
The first term E|q̂ − q̂λ | is the finite-sample bias that results from regularization, which depends
on the exact regularization used and on the demand distribution. The second term E|q ∗ − q̂ | is the
finite-sample bias of the decision q̂ , which we bound as in the proof for Theorem 5.
Proof of Theorem 7: We have
|Rtrue (q ∗ ) − R̂in (q̂ κ ; Sn )| ≤ |Rtrue (q̂ κ ) − R̂in (q̂ κ ; Sn )| + |Rtrue (q ∗ ) − Rtrue (q̂ κ )|
r
2(b ∨ h)2 D̄ 4(b ∨ h)2 D̄
1 1 log(2/δ)
≤ + + (b ∨ h)D̄
b ∧ h 1 + (n − 1)rw b ∧ h 1 + (n − 1)rw 2n
∗ κ
+ |Rtrue (q ) − Rtrue (q̂ )|,
because the first term in the last inequality is the generalization bound in Proposition EC.6. As in
the proof of Theorem 5, we can bound the second and the third terms by
The first term E|q̂ − q̂ κ | is the finite-sample bias that results from kernel-weights optimization, which
depends on the choice of the kernel and on the demand distribution. The second term E|q ∗ − q̂ | is
the finite-sample bias of the decision q̂ , which we bound as in the proof for Theorem 5. The result
then follows from similar arguments to the proof for Theorem 5.
|Rtrue (q̂) − Rtrue (q ∗ )| ≤ |Rtrue (q̂) − Rin (q̂; Sn )| + |Rin (q̂; Sn ) − Rtrue (q ∗ )|.
The first term is the generalization bound from Theorem EC.2; thus is less than
" r #
2(b ∨ h) 4(b ∨ h) log(2/δ)
(b ∨ h)D̄ + +1
(b ∧ h)n b∧h 2n
= P(|Rin (q̂; Sn ) − Rtrue (q ∗ )| ≤ 2 ; |q̂ − q ∗ | ≤ ∆n ) + P(|Rin (q̂; Sn ) − Rtrue (q ∗ )| ≤ 2 ; |q̂ − q ∗ | > ∆n,2 )
√
where ∆n,2 = M n−2/(4+p) log n, for some appropriately large M such that |q̂ − q ∗ | ≤ ∆n,2 almost
√
surely (note such M exists because |q̂ − q ∗ | = O(n−2/(4+p) log n); see the proof of Theorem 5). We
can further bound the above by
The final expression is a statement about the concentration of the sample average Rin (q ∗ ; Sn ) about
its mean Rtrue (q ∗ ), so we can use the well-known Hoeffding’s (concentration) inequality to get
log(2/δ)
n ≥ (b ∨ h)D̄ .
2(2 − (b ∨ h)∆n,2 )2
whenever ( 2 )
4(b ∨ h) log (2/δ) log(2/δ)
n ≥ (b ∨ h)D̄ max +1 ,
b∧h 221 2(2 − (b ∨ h)∆n,2 )2
Set 1 = (b ∨ h)D̄ − ∆1,n = (b ∨ h)D̄[ − 4(b ∨ h)/((b ∧ h)n)] and 2 = (b ∨ h)D̄ + (b ∨ h)∆n,2 . Then
we get
|Rtrue (q̂) − Rtrue (q ∗ )|
≤
(b ∨ h)D̄
whenever
( 2 )
log(2/δ) 4(b ∨ h) 1 1
n≥ max +1 ,
2(b ∨ h)D̄ b∧h [ − 4(b ∨ h)/((b ∧ h)n)]2 2
2
1 4(b ∨ h) log(2/δ)
≥ +1
2(b ∨ h)D̄ b∧h 2
which is the same as the sampling bound of Levi et al. (2007) up to a constant factor (which bound
is tighter depends on the exact values of b, h and D̄).
e-companion to Ban & Rudin: Big Data Newsvendor ec17