SSRN Id3952925

Inventory Control and Learning for One-Warehouse
Multi-Store System with Censored Demand

Recep Yusuf Bekci
Desautels Faculty of Management, McGill University [email protected]
Mehmet Gümüş
Sentao Miao
Motivated by our collaboration with one of the largest fast-fashion retailers in Europe, we study a two-echelon
inventory control problem called the One-Warehouse Multi-Store (OWMS) problem when the demand distri-
bution is unknown. This system has a central warehouse that receives an initial replenishment and distributes
its inventory to multiple stores in each time period during a finite horizon. The goal is to minimize the total
expected cost which consists of shipment costs, holding costs, lost-sales costs, and end-of-horizon disposal
costs. The OWMS system is ubiquitous in supply chain management, yet its optimal policy is notoriously
difficult to calculate even under complete demand distribution case. In this work, we consider the OWMS
problem when the demand distribution is unknown a priori. By observing the censored demand, the firm
has to jointly learn the demand and make inventory control decisions on the fly. We first develop a learning
algorithm based on empirical demand distribution and prove an upper bound on its theoretical perfor-
mance when the demand information is uncensored. Then, in the censored demand case, we propose a more
sophisticated algorithm based on a primal-dual learning and optimization approach. Results show that both
algorithms have great theoretical and empirical performances.
1. Introduction
Rapidly changing market conditions, product proliferation, and ever-shifting buying habits of con-
sumers made the retail sector more unpredictable than ever. As if that were not enough, many
retailers are now faced with supply-side glitches due to rising globalization, which left many strug-
gling to find ways to fulfill demand as quickly and cost-efficiently as possible (Petro 2021). Along
these lines, the last decade witnessed increasing use of analytics and data-driven technologies that
enabled supply chain planners to connect the voice of customers into their inventory management
decisions (Nicasio 2021). Indeed, a recent survey conducted by Deloitte Consulting (Davenport
et al. 2019) among 1048 executives who work at large companies (501 and more employees) sug-
gests that executives who incorporate data-driven insights into their approach are 24% more likely
to have exceeded their business goals. That said, the same study also reports that while everyone is
aware of the benefits of data-driven approaches, the majority of them either lack the technological
or human resource infrastructure or fail to integrate the real-time data into their decision-making
process.
Electronic copy available at: https://ssrn.com/abstract=3952925

2
The inventory planners in retail companies are no stranger to these challenges. According to
2021 Inventory Optimization Survey Report (enVista 2021) conducted among 100 inventory man-
agement professionals, while 35% of the respondents consider inventory optimization a strategic
priority and have projects in place to deliver the required capabilities, the majority reports that
the top three challenges are the lack of an effective process to incorporate real-time data into inven-
tory planning, high demand volatility, and reduced forecast accuracy. Aimed at addressing these
challenges, in this paper, we focus on developing effective real-time data-driven inventory control
strategies whose objective is to not only match demand with supply but also learn it on the fly.
Our setting in this paper is motivated by our collaboration with one of the largest fashion
retailers in Europe. Similar to many fast fashion retail chains characterized by quick response
capabilities (Caro and Gallien 2010), our industry partner first receives the new products from its
suppliers into its main warehouse and then distributes the new products to its network of stores.
While inbound shipments are executed in very large quantities and less frequent fashion, outbound
shipments are executed in small quantities and more frequent fashion. This is very typical in a
fast fashion supply chain with tens of thousands of new SKUs introduced to hundreds of stores
in a relatively short period of time where the last-mile shipments to the stores are spread out in
a frequent fashion across the selling season due to logistical considerations. In fact, our industry
partner operates more than 70 stores in a densely populated area (overall more than 400 stores in
its network), as such, the shipments of new products from its central warehouse to its stores are
conducted in an almost daily basis. Furthermore, since the majority of the SKUs designed for a
new season have never been sold before, the retailers in this industry almost always start every
season with very limited data and improve their forecast on the fly based on actual sales during the
season. Indeed, before each selling season, our industry partner conducts focus group analyses to
gather opinions from a group that consists of 10-20 potential customers about the future demand
of a selected SKU. Since the sample size is extremely low, the resulting range for the potential
demand turns out to be inevitably very wide, which leaves the inventory planner with almost no
clue about the true underlying distribution. Using this as our main setting, in this paper, we aim to
develop several data-driven strategies for in-season inventory planning problem between the central
warehouse and the multiple stores that face demand uncertainty with unknown distribution. In
practice, the majority of the SKUs managed by the fast fashion retailers (e.g., approximately more
than 75% of the SKUs in our partner retailer) fall into this category in the sense that once the
initial inventory is shipped to the central warehouse, there is no inventory replenishment decision
(i.e., in-season inventory planning) between the suppliers and central warehouse throughout the
selling season. Therefore, once the season starts, the focus of an inventory planner in a typical fast

3
fashion retailer shifts from deciding on initial pre-season order quantity to optimizing the in-season
inventory shipments from the central warehouse to the stores based on realized sales.
In order to capture the above setting, we develop a two-tier supply chain model, where the
upstream tier consists of one central warehouse and the downstream tier consists of multiple stores.
The central warehouse is replenished at the beginning of the horizon, whereas sales are observed at
the store level and the inventory is distributed periodically from the central warehouse to different
stores during the planning horizon. To model the business environment described above, we assume
that the initial inventory in the central warehouse is fixed and not replenished during the planning
horizon. Aligned with the above setting, the main decisions are periodic shipments to the stores,
which face stochastic demand with unknown distribution. Unsatisfied demand is lost and remaining
inventories in the stores are carried over to the next period. The goal of the firm is to minimize the
total expected cost, which includes shipment cost (of transferring items from the central warehouse
to stores), holding cost, lost-sales cost (of unsatisfied demands in the stores), and end-of-horizon
disposal cost (of remaining inventory in the central warehouse).
The above-mentioned setting, also referred to as one-warehouse-multiple-store (OWMS) model
in OM literature, has received a lot of attention from academia, for its prevalence not only in
fast-fashion but also in many other real-life applications such as grocery chains, cash management
at ATMs, and gas stations and production planning. As shown in a seminal paper by Clark and
Scarf (1960), it is a well-known fact that OWMS model has complicated state-dependent optimal
policies, which suffer from the curse of dimensionality. Therefore, preceding studies on the OWMS
model (see e.g. Marklund and Rosling 2012, Miao et al. 2020, Lei et al. 2020, Nambiar et al. 2020,
Chao et al. 2021) restricted their attention to developing computationally efficient algorithms with
provable worst-case bounds. That said, all of these studies assume that the demand distribution
is known in advance. This paper is the first work focusing on the setting of unknown demand
information, which is prevalent in fast fashion industry due to the majority of SKUs being new
seasonal products. In this work, we analyze the OWMS model with demand learning in two dif-
ferent settings, namely, with and without demand censoring. When the firm is able to observe the
uncensored demand, we show that an efficient inventory control policy can be developed using
sample average approximation by constructing empirical demand distributions. However, when the
demand is censored so that only sales (instead of actual demand) can be observed, this approach
no longer works as it requires gathering uncensored samples to draw inference from the data (see
e.g. Kaplan and Meier 1958, Huh et al. 2011). This complication can be rendered as an exploration-
exploitation tradeoff. Exploration, in this manner, can be defined as gathering uncensored samples
which requires keeping excessive inventory on hand to prevent censorship. On the other hand,
exploitation is keeping optimal (or near-optimal) inventory levels that are identified from the data.

4
It is clear that efficient exploitation requires sufficient exploration; on the other hand, too much
exploration leads to excessive costs from over-ordering, which may even hurt our decisions later.
Our main results and contributions can be summarized as follows.
1. Benchmark and demand learning without demand censoring. Due to well-known
complex structure of optimal policy in the OWMS system (Clark and Scarf 1960), we develop
our learning algorithms, both with and without demand censoring, based on a benchmark
approximation algorithm in Miao et al. (2020). When there is no demand censoring, we pro-
pose an algorithm named EDAF which utilizes the empirical demand distributions in each
period to compute the policy based on the benchmark. Convergence results of sample average
√
approximation are developed and the algorithm EDAF is proved to have a regret of Õ(N T )
where N is the number of stores and T is the length of the horizon.
2. Double Binary Search (DBS) algorithm with demand censoring. When the demand
is censored, as we discussed earlier, optimization based on empirical demand distribution
requires us to observe uncensored demand data, which might hurt the performance of the
algorithm by deliberately ordering too many inventories to the stores. To tackle this challenge,
we propose another algorithm named DBS based on the Lagrangian relaxation heuristic. This
algorithm has a primal-dual learning and optimization framework of finding the optimal dual
variable and the optimal base-stock levels of any fixed dual. Theoretical results show that the
√
algorithm DBS again achieves a regret of Õ(N T ).
3. Thorough numerical experiments based on synthetic and real data. We present two
sets of experiments. First, we experiment on synthetic data for performing a sensitivity analysis
of input parameters such as horizon length, cost parameters, and demand variability. In the
second set of experiments, we utilize a real dataset obtained from our partner fast-fashion
company to infer parameters and demand distributions. We also apply what-if scenarios to
add robustness to our evaluation in a subset of three products with low, medium, and high
prices respectively. In both experiments, the numerical results demonstrate that our DBS
algorithm significantly outperforms all benchmarks.
The rest of the paper is organized as follows. In Section 2, we review the relevant literature.
Section 3 contains the description of the OWMS model and notation. In Section 4 we present
the details about our benchmark policy and the EDAF algorithm for the uncensored demand
case, together with its theoretical performance. The DBS algorithm for censored demand and its
performance analysis is given in Section 5. We carry out numerical experiments in Section 6 using
synthetic and real data. Finally, we conclude in Section 7.

5
2. Literature Review
Our research falls within the general theme of inventory control in the presence of unknown demand
distribution. However, to the best of our knowledge, ours is the first paper in the literature that
studies this problem in the context of OWMS model. As such, there are two streams of research
directly related to our paper. In what follows, we review each stream separately.
Inventory control with demand learning. Inventory control has been an important research
area since decades ago (see e.g., Zipkin 2000, Porteus 2002 for detailed analysis and survey for
some classical work). While earlier work mostly focuses on the inventory control and optimization
with complete demand information, recently there is a growing body of literature investigating the
case when demand is unknown and has to be learned either from an offline dataset or on the fly.
For the demand model, the literature either assumes that it is a parametric model with unknown
parameter, or it is a nonparametric model, which is the one we consider in this paper.
For parametric demand model, usually, the decision-maker has an idea or prior belief about
the demand distribution. Afterwards, the decision-maker updates their belief as they discover new
information. Bayesian inference is one of the methods to update the parameter (see Scarf 1959,
Azoury 1985), and this method has also been applied in censored demands (see Chen 2010).
Compared with the parametric demand model, many other work focus on the nonparametric
demand model which is more flexible and requires fewer assumptions. Nonparametric demand mod-
els do not assume any functional form of the demand distribution. If the decision-maker has access
to a historical dataset, the learning and optimization can be done with this on-hand information.
Since there is no active interaction with the environment (i.e., sales operation and the data collec-
tion process), this type of learning is called offline learning. On the other hand, in contrast with
the offline setting, the online learning starts with zero data and requires direct interaction with the
environment.
The offline learning problem for inventory control has been extensively studied. Levi et al. (2007)
apply an empirical distribution function based solution to the offline newsvendor problem, and
they report the number of samples required for achieving an approximately optimal solution. Later,
Levi et al. (2015) improve this bound for fewer number of samples. Similar problems have been
studied in other literature as well with different settings: Cheung and Simchi-Levi (2019) study
this problem with capacity constraint; Qin et al. (2019) address the joint pricing and inventory
problem using the empirical distribution method; Ban and Rudin (2019) extend the solution using
contextual information; Bu et al. (2020) study the joint pricing and inventory control problem with
censored data; Ban (2020) consider fixed cost of ordering with censored demand.
Our paper belongs to the online learning stream. For newsvendor model, Besbes and Muhar-
remoglu (2013) study the implication of demand censoring, and Godfrey and Powell (2001) use

6
sample gradients to estimate piecewise-linear concave functions to approximate the cost function.
Compared with the newsvendor model, more research focus on the dynamic programming model
where inventories are carried over to the next period. For instance, Huh and Rusmevichientong
(2009) solve an multi-period inventory control problem with demand censoring by applying some
stochastic gradient descent method (this method has been used in other work as well, see e.g.,
Shi et al. 2016, Zhang et al. 2018, Yuan et al. 2021), and Huh et al. (2011) use Kaplan-Meier
estimator to tackle the censored data. Besides pure inventory control, some recent work also study
the joint pricing and inventory problem: Chen et al. (2019) use linear function approximation for
the demand estimation; Chen et al. (2020b) consider fixed costs; Chen et al. (2020c) apply binary
and tri-section search and Chen et al. (2021) use spline approximation to tackle demand censoring;
Chen et al. (2020a) study limited price changes.
One-Warehouse Multi-Store System. The research interest for OWMS system dates back
to the seminal paper by Clark and Scarf (1960) on multi-echelon inventory systems. Due to the
computational difficulty pointed out in Clark and Scarf (1960), researchers focus on developing
efficient heuristics on this problem. One stream of the OWMS system is when the inventory in
the central warehouse is given at the beginning without replenishment, which is first studied by
Jackson (1988) who proposed a constant base-stock heuristic without theoretical performance.
Jackson et al. (2019) is another paper on this topic that investigated a two-period model with
backorder and developed a heuristic based on an approximation where the number of stores goes
to infinity (see McGavin et al. 1993, 1997 for work on similar topic).
Recently, some work on OWMS developed heuristics with theoretical performance (Marklund
and Rosling 2012, Nambiar et al. 2020, Miao et al. 2020, Lei et al. 2020, Chao et al. 2021). In
particular, Marklund and Rosling (2012) and Nambiar et al. (2020) proposed some heuristics based
on Lagrangian relaxation for backorder and lost-sales OWMS system respectively, and they showed
that their heuristics are asymptotically optimal in the number of stores N , but not in the length
of horizon T . Later, Miao et al. (2020) generalized their result to a Multi-Warehouse Multi-Store
(MWMS) system, and proved that Lagrangian relaxation heuristics are asymptotically optimal in
both N and T . Similar results were obtained in Chao et al. (2021) when there is fixed ordering
cost, and the authors showed that the performance of the heuristics can be further improved using
appropriate re-adjustment of policy. Besides Lagrangian relaxation method, asymptotically optimal
heuristics based on certainty-equivalent optimization are used in Lei et al. (2020) who studied a
joint pricing and inventory optimization problem in OWMS system.
It shall be noted that all the aforementioned work on this subject assumes that the demand
distribution is known, whereas, in our study, it is learned along the way.

7
Table 1 List of notations

Di,t : Random demand for store i ∈ [N ] at period t ∈ [T ]
di,t : Realization of the demand
Ii,t : On-hand inventory at store i ∈ [N ] and period t ∈ [T ]
yi,t : Order-up-to level for store i ∈ [N ] and period t ∈ [T ]
π: Inventory control policy
V π: Total expected cost of the policy π in the original system
V ∗: Total expected cost of the optimal policy
Ṽ λ : Total expected cost of the Lagrangian relaxation system with dual variable λ
Ci (yi ; λ): Myopic cost function for store i ∈ [N ]
λ: Lagrangian multiplier
κi (λ): Critical fractile function
Parameters
hi > 0: holding cost per unit for store i ∈ [N ]
bi > 0: lost-sales cost per unit for store i ∈ [N ]
ci ≥ 0: shipment cost from the warehouse to store i ∈ [N ]
w ∈ R: end-of-horizon disposal cost
W: Central warehouse capacity
3. Model Description
Model primitives. We consider the One-Warehouse Multi-Store (OWMS) inventory problem
where there is a central warehouse with an initial capacity W that distributes its inventory to N
stores, denoted by i = 1, . . . , N , during a finite horizon t = 1, . . . , T . There is no external replen-
ishment during the horizon to the central warehouse and the inventory is non-perishable. Trans-
shipment between stores is not allowed. Demands Di,t , which are unknown in our problem, are
independent and identically distributed over time t and independent across stores i (i.e., the demand
distributions for different stores can be different).
Before providing the detail description of our model, we present notations in Table 1. Also,
throughout the paper, we denote “defined as” with “:=”. x+ represents max(x, 0). O(·) is the classic
big-O notation that is common in algorithm analysis. In Õ(·) notation, the logarithmic terms are
hidden.
The decision making process and the optimization problem. The decision making pro-
cess, together with the cost structure, of our problem is summarized as follows.
• At the beginning of each period t ∈ [T ], the decision-maker observes the on-hand inventory
levels Ii,t at stores i ∈ [N ] and remaining inventory at the central warehouse.
• The decision-maker then decides the amount of inventory to allocate to each store from the
central warehouse. There is an exogenous delivery cost per item ci which is specific to the
store i. Deliveries are instantaneous and the corresponding inventory level after the delivery
is denoted by yi,t for each store i. As a result, each store i incurs a delivery cost ci (yi,t − Ii,t ).

8
• Demand Di,t is realized at each store i. Any unsatisfied demand is lost and incurs lost-sales
cost bi (Di,t − yi,t )+ . If there is positive inventory left, it incurs holding cost hi (yi,t − Di,t )+ and
will be carried over to the next time period.
• At the end of the time horizon, any unit of the left-over inventory in the central warehouse
incurs an end-of-horizon disposal cost/value w ∈ R (i.e., it is positive if it is a cost, and negative
if it is a value). For the sake of brevity, we will just call it the end-of-horizon disposal cost.
According to the description above, the decision maker’s goal is to minimize the total expected
cost, which can be modeled as a dynamic programming formulation as follows
T X
X N
V ∗ := min π + π
− Di,t )+ ] + c0i E[yi,t
π π

E[bi (Di,t − yi,t ) + hi (yi,t − Ii,t ] + wW (P)
π
t=1 i=1
XT X N
π π
st (yi,t − Ii,t )≤W
i=1 i=1
π π
Ii,t+1 = (yi,t − Di,t )+ t ∈ [T ], i ∈ [N ]
π π
yi,t ≥ Ii,t t ∈ [T ], i ∈ [N ]
where c0i = ci − w. Note that we denote the policy with π, and in later context, it may be dropped
sometimes for brevity whenever there is no confusion. The objective value is the total expected
cost. The first constraint is the inventory constraint in the central warehouse; the second constraint
is the system dynamics of the inventory carry-over at the stores; the last constraint means that the
delivery from the warehouse to stores cannot be negative. Note that in our dynamic programming
formulation, the constraints are satisfied almost surely.
Demand learning and regret. In this paper, we discuss how to solve the optimization problem
(P) when the distribution of Di,t is unknown. Although our main focus is on the demand learning
with censored demand, we will first discuss the results without censored demand in Section 4, which
may be of independent interest, and help to build some foundation for analysis later and illustrate
the challenges with demand censoring. In particular, we will introduce an important benchmark
based on Lagrangean relaxation of (P) which is both used for bounding the regret and developing
our algorithms with and without demand censoring. Our main result with censored demand is
presented in Section 5.
Similar to other online learning problems, we face the challenge of the trade-off between demand
learning and cost minimization (also known as the exploration-exploitation trade-off). As a result,
the common performance metric of online learning algorithms, named regret, will be used in this
paper as well. In particular, we have the following definition.
Definition 1. For a policy π, the (expected) regret is defined as
R(π) = V π − V ∗ ,

9
which is the difference between the expected cost of policy π and the optimal value of the clair-
voyant.
According to this definition, minimizing V π is equivalent to minimizing the regret R(π).
4. The Benchmark Policy and Demand Learning without Censoring

In this section, we present the algorithm and its performance when there is no demand censor-
ing. We first present a Lagrangean relaxation based benchmark, which is used for bounding the
regret as well as developing the algorithm. Then our algorithm based on sample average approxi-
mation (SAA) is presented together with its theoretical performance. It shall be noted that besides
approximation results which may be of independent interest, this section also shows the major
differences and challenges of online optimization in OWMS system compared with single-store
inventory control literature (see e.g., Levi et al. 2007, Huh and Rusmevichientong 2009).
The benchmark. It is known that the inventory control in an OWMS system has complicated
optimal structure, and hence heuristics are needed to find an approximately optimal solution of (P).
The heuristics with full demand information has been studied in Miao et al. (2020), which propose
two heuristics to solve the OWMS inventory control based on Lagrangian relaxation. In this paper,
we use the heuristic named LaBS in Miao et al. (2020) as our benchmark for both developing the
learning algorithm and bounding the regret. This heuristic solves a modified Lagrangian relaxation
of (P) as follows.
T X
X N
Ṽ λ := min π + π
− Di,t )+ ] + c0i E[yi,t
π π

E[bi (Di,t − yi,t ) + hi (yi,t − Ii,t ] + wW (PS)
π
t=1 i=1
X N X T X N N
X
0 π + π π π
− ci E[(yi,T − Di,T ) ] + λ E[(yi,t − Ii,t )] − W − E[(yi,T − Di,T )+ ]
i=1 i=1 i=1 i=1
π π
st Ii,t+1 = (yi,t − Di,t )+ t ∈ [T ] i ∈ [N ]
π π
yi,t ≥ Ii,t t ∈ [T ] i ∈ [N ]
The idea of LaBS is to apply Lagrangian relaxation for the inventory constraint in the central
warehouse, and allow a hypothetical sell-back option from all stores back to the warehouse at the
end of the horizon. By this relaxation, (PS) can be decomposed into store-level inventory control
problem as
N
X
Ṽ λ = (w − λ)W + Ṽiλ (1)
i=1

10
where each Ṽiλ can be written as

T
X
Ṽiλ := min π + π
− Di,t )+ ] + ci (λ)E[yi,t
π

E[bi (Di,t − yi,t ) + (hi − ci (λ))(yi,t ]
π
t=1
π π
st Ii,t+1 = (yi,t − Di,t )+ t ∈ [T ] (NV)
π π
yi,t ≥ Ii,t t ∈ [T ]
where ci (λ) = c0i + λ. Without loss of generality, we omit c0i in our analysis and use ci in subsequent
sections for brevity. Note that Ṽiλ is simply a single-store inventory control problem with stationary
demand and end-of-horizon sell-back, and it has been studied very well (see e.g., Zipkin 2000). An
important structure of the optimal policy of Ṽiλ is that it is a base-stock policy with constant base-
stock level, which can be solved by minimizing the following newsvendor problem of the myopic
cost function:
Ci (yi ; λ) := E[ci (λ)yi + (hi − ci (λ))(yi − Di,t )+ + bi (Di,t − yi )+ ],
and its optimal solution yi∗ (λ) satisfies that F i (yi∗ (λ)) = κi (λ) where
bi − ci − λ
κi (λ) := ,
bi + hi − ci − λ
and we have that Ṽiλ = T · Ci (yi∗ (λ); λ).

Knowing the optimal value of Ṽ λ , we are able to find the optimal dual variable λ∗ by solving
the convex optimization problem
λ∗ = arg max Ṽ λ , (2)

λ≥0
and its corresponding optimal base-stock levels.

Above all, it is clear that the main purpose that we choose LaBS as our benchmark to develop
the learning algorithm is by its efficient computation as well as the simple optimal policy structure.
Due to the fact that Lagrangian relaxation always acts as a lower bound for the optimal value
∗
function, it has been shown that Ṽ λ ≤ V ∗ in Miao et al. (2020). As a result, we are able to bound
∗
the regret of any policy π by R(π) ≤ V π − Ṽ λ .
Technical assumptions. Before we proceed to the algorithm and its theoretical performance,
let us first present several mild technical assumptions.
Assumption 1. Demand Di,t ∈ [D, D] almost surely where 0 < D < D < ∞. Its distributions F i :
[D, D] → [0, 1] is continuous and it has a probability density function f i with 0 < f ≤ fi (d) ≤ f¯ < ∞
for all i ∈ [N ] and d ∈ [D, D].

11
Assumption 1 is a standard assumption regarding the regularity of the demand distribution

which is also made in other inventory learning literature (see e.g., Chen et al. 2020c). Note that
the boundedness assumption Di,t ∈ [D, D] is for the sake of brevity of technical analysis only and it
can be relaxed. Assumption 1 is satisfied by many popular demand distributions such as uniform
distribution and truncated normal distribution. Under this assumption, optimal base-stock level
yi∗ (λ) is unique and equal to (F i )−1 (κi (λ)) for any λ ∈ [0, bi − ci ] (otherwise yi∗ (λ) = 0 if λ > bi − ci ).
This also leads to our second assumption.
Assumption 2. For the optimal Lagrangian dual variable λ∗ ≥ 0, we have λ∗ ≤ mini∈[N ] (bi − ci ).
Assumption 2 essentially means that with full demand information, every store has positive optimal
base-stock level. The reason is that if λ∗ > bi − ci for some i ∈ [N ], we will have ci (λ∗ ) > bi meaning
that store i should not order anything from the warehouse at all. In real practice, it is unlikely
that cost structure at certain store is significantly different from the others.
The algorithm. Without demand censoring, our algorithm, named EDAF, is based on sample
average approximation. Briefly, at the beginning of time period t ≥ 2, we have collected all the
demand realizations Di,s for all i ∈ [N ] and s ≤ t − 1 with full observation. Using these data, we
can construct an empirical estimation F̂ti (·) of the demand distribution F i (·). More specifically, we
have Pt−1
1 ≤ d)
s=1 1 (Di,s
F̂ti (d) := .
t−1
According to Dvoretzky–Kiefer–Wolfowitz inequality (see Massart 1990), as t grows, F̂ti (·) becomes
more accurate (see Lemma 1 below).
Lemma 1 (Dvoretzky–Kiefer–Wolfowitz inequality). For any ≥ 0 and any period t ∈ [T ]

i
P sup |F̂t+1 (d) − F i (d)| ≥ ≤ 2 exp(−2t2 ). (3)
d∈[D,D]
Similarly, we have the corresponding empirical probability density function as

Pt−1
î 1 (Di,s = d)
f (d) = s=1 .
t−1
Once we have these empirical demand distribution, we are able to estimate yi∗ (λ) for any λ, which
is defined as
ŷi,t (λ) := inf {y : F̂ti (y) ≥ κi (λ)}. (4)
Then for λ∗ , according to Miao et al. (2020), λ∗ can be found by satisfying the complementary
slackness
X W
λ E[min(yi∗ (λ), Di,t )] − = 0.
i
T

12
As a result, in the case we only have empirical demand distribution, the estimated λ∗ , denoted by
λ̂t , is found by solving the “empirical” complementary slackness
X W
λ Ê[min(ŷi,t (λ), D̂i,t )] − =0
i
T
where D̂i,t is the demand with empirical distribution, and the notation Ê[·] represents the expec-
tation according to the empirical distribution. Since the left-hand side of (5) is a left-continuous
stepwise function, we let
( )
X W
λ̂t := sup λ ≥ 0 : Ê[min(ŷi,t (λ), D̂i,t )] − ≥0 . (5)
i
T
The algorithm then simply applies the base-stock policy with level ŷi,t (λ̂t ) (we write it as ŷi,t for
brevity). The detailed algorithm is summarized in Algorithm 1.
Algorithm 1: EDAF(Empirical Distribution Algorithm for Full Demand)

Input: Starting warehouse capacity W1 = W , initial store inventories Ii,1 , ∀i ∈ [N ]
Output: Policy π EDAF = {yi,t = max(ŷi,t , Ii,t )}
1 for t ∈ [T ] do
2 if t = 1 then
3 Set ŷi,1 ∈ [D, D] arbitrarily for all i ∈ [N ]
4 else
5 Compute λ̂t and ŷi,t as in (4) and (5)
6 end
PN
7 if i=1 (ŷi,t − Ii,t )+ ≤ Wt then
8 send (ŷi,t − Ii,t )+ to each store i
PN
9 Wt+1 ← Wt − i=1 (ŷi,t − Ii,t )+
10 Ii,t+1 ← (max(ŷi,t , Ii,t ) − Di,t )+
11 else
12 distribute Wt arbitrarily
13 Ii,t+1 ← (max(ŷi,t , Ii,t ) − Di,t )+
14 end
15 end
The performance of Algorithm EDAF is given in the following proposition. We delegate the
detailed proofs of the propositions and technical lemmas to the Appendix.
Proposition 1. Let π EDAF be the policy constructed by EDAF algorithm. The regret of π EDAF
satisfies that
√
R(π EDAF ) ≤ Õ(N T ).

13
To prove our claim in Proposition 1, we first show that the empirical estimates for order-up-to
levels and the dual parameter λ do not deviate away from the real values with high probability. We
π
then define Vunlim which is a hypothetical system of applying policy π with the same cost structure
as the original OWMS problem, except without inventory constraint in the central warehouse. With
π
the help of Vunlim , we decompose the regret into two parts: regret due to violating the capacity
∗
constraint (i.e., V π − Vunlim
π π
) and regret due to learning (i.e., Vunlim − Ṽ λ ). For the first part,
we show that the regret occurred because of violating the central warehouse capacity is at most
√
Õ(N T ). Moreover, using the estimation properties we show that regret due to learning is not
√
larger than Õ(N T ) as well. A more detailed proof can be found in the appendix.
Challenges for OWMS system. According to our discussion in this section, one significant
difference of our problem from the single-store inventory control (e.g., Huh and Rusmevichientong
2009) is that we also need to estimate λ∗ . In single-store inventory control, estimating optimal
base-stock level does not necessarily need to know the full demand distribution. Even with demand
censoring as in Huh and Rusmevichientong (2009), certain data-driven stochastic gradient descent
method can be used, which still does not require estimating the full demand distribution F i (·).
However, their approach cannot be directly applied to our problem because of λ∗ . In particular,
λ∗ maximizes the concave function Ṽ λ , and Miao et al. (2020) show that the gradient of Ṽ λ
(normalized by dividing T ) is given by
X W
E[min(yi∗ (λ), Di,t )] − .
i
T
As a result, there are two ways to estimate λ∗ . The first way is to construct an empirical distribution
for F i (·) and then estimate λ∗ (and yi∗ (λ∗ )) accordingly. This is exactly what we did in this section
for the case without demand censoring. However, this method fails to achieve the same regret
√
Õ(N T ) when the demand is censored, because constructing the empirical distribution of Di,t
requires full observations of Di,t . For this purpose we have to raise up the base-stock level to D for
some time, and this costs unnecessary money for the firm. Therefore, with the presence of censored
demand, we have to find a way to avoid estimating F i (·), which is our second way of estimating
λ∗ . The idea is to observe samples of min(yi , Di,t ) for some fixed yi , and then by taking average
of these samples, we have an unbiased estimation of E[min(yi , Di,t )] which helps to construct the
(approximated) gradient of Ṽ λ . However, what is the yi we should use to sample min(yi , Di,t )? How
many samples we need in order to approximate E[min(yi , Di,t )]? These are nontrivial questions,
and the next section will discuss this method with more details.

14
5. Proposed Algorithm for Censored Demand

This section presents the proposed algorithm for censored demand, its theoretical performance,
and the main steps of the proof for the theoretical result. In Section 5.1, we give the detailed
algorithm and its theoretical performance together with some high-level ideas. Then the theoretical
performance is proved in Section 5.2 with some technical details provided in the appendix.
5.1. The Algorithm and Its Performance
As we discussed at the end of the previous section, when the demand is censored, the decision-
maker can only observe the sales instead of the actual demand. This hurts the usage of the EDAF
algorithm because the empirical distribution constructed by censored data will be biased. Hence,
to have unbiased estimation, active exploration of the demand distribution is required. Exploration
for inventory control can generally be done by ordering high amounts of inventory to cope with the
censorship. However, this deliberate extra ordering hurts the overall expected cost, which may be
particularly suboptimal if the central warehouse inventory is scarce. Consequently, in this section,
we propose a smarter way that iteratively improves the base-stock levels at the store level and the
dual variable at the warehouse level.
The key idea of our algorithm is that if we want to find λ∗ , the gradient of Ṽ λ (divided by T ;
for the sake of brevity, we will just call it the gradient of Ṽ λ ), i.e.,
X W
E[min(yi∗ (λ), Di,t )] −
i
T
can possibly be estimated without knowing the full demand distribution, as mentioned in the dis-
cussion following Proposition 1. More specifically, suppose we know yi∗ (λ), if we apply yi∗ (λ) for
multiple times, we are able to observe samples of min(yi∗ (λ), Di,t ); by taking average of these sam-
ples, we get an unbiased estimation of E[min(yi∗ (λ), Di,t )]. This nice property is another important
reason that we use LaBS in Miao et al. (2020) as our benchmark instead of other heuristics such
as SLaBS in Miao et al. (2020) and the algorithm in Nambiar et al. (2020). Of course, there are
still two questions left: First, we do not know yi∗ (λ) obviously. Second, how should we divide the
time horizon smartly to have samples to estimate the gradient of Ṽ λ ? Our algorithm, named the
Double Binary Search (DBS) algorithm, is developed to address these two challenges.
Algorithm overview. The overall idea of the algorithm is taking turns to find λ∗ (i.e., update
λ) and yi∗ (λ) (for a fixed λ) via binary search method. More specifically, the algorithm has an outer
loop of searching for λ∗ (with carefully designed exponentially increasing loop length) using binary
search. In each outer loop, with λ fixed, there are inner loops of estimating yi∗ (λ) for each i ∈ [N ]
using binary search again. Briefly, since yi∗ (λ) is given by minimizing the convex function Ci (yi ; λ),

15
we find the estimated yi∗ (λ) iteratively using some data-driven binary search for the zero point
of Ci0 (yi ; λ). As a result, with the estimated yi∗ (λ) for all i ∈ [N ], we are able to approximate the
gradient of Ṽ λ and apply the binary search method again to find λ∗ . More details of each step will
be explained in the following, and we refer to Figure 1 for a graphic representation of the overall
algorithm structure.
Figure 1 Double Binary Search algorithm. Arrows represent information flow. The search durations increase with
time. Optimization for order-up-to levels is done in a decentralized manner while optimization for λ
considers the whole data.
The inner loop of estimating yi∗ (λ). The idea of the inner loop is from Chen et al. (2020c),
who studied a joint pricing and inventory control of single product with censored demand. As we
mentioned earlier, yi∗ (λ) is given by minimizing the convex function Ci (yi ; λ); then we can use
certain data-driven binary search to find the zero point of Ci0 (yi ; λ). More specifically, we have
Ci0 (yi ; λ) = (hi + bi − ci (λ))F i (yi ) − (bi − ci (λ)),
and if we have multiple samples of (hi + bi − ci (λ))11(Di,t ≤ yi ) − (bi − ci (λ)), which are obviously
observable regardless of demand censoring, we can have an unbiased estimation of Ci0 (yi ; λ) by
taking the average of these samples. In particular, let l denote the index of an inner loop, let nl be
the current number of samples of (hi + bi − ci (λ))11(Di,t ≤ yi,l ) − (bi − ci (λ)) by applying the same
yi,l , and let ĝl denote the average of these samples (i.e., the estimated Ci0 (yi,l ; λ)). By concentration
√
inequality, with high probability we shall have Ci0 (yi,l ; λ) ∈ [g l , ḡl ] where g l = ĝl − C1 / nl and ḡl =
√
ĝl + C1 / nl for some parameter C1 to be specified later. As a result, if g l > 0 (ḡl < 0), we have yi,l
too small (large) and we will go to the next iteration in binary search by setting yi,l as the new

16
lower bound (upper bound) of the search range. Otherwise, we will keep applying yi,l . This process
is visualized in Figure 2, and we refer to Algorithm 2 for the details of the inner loop subroutine.
Figure 2 Three consecutive steps of the inner loop. After sampling enough data, the confidence interval for the
derivative excludes zero and the algorithm proceeds to the next step by halving the search range.
Algorithm 2: Inner Loop of Binary Search for Order-up-to Levels

Input: dual parameter λ, search duration ν, parameter C1 , current time period t, store i
Output: yi,l , dî,l = dl /nl for l with the largest nl
1 function SearchOrderUpTo(i, t, λ, ν, C1 ):
2 Set Ll = D , Rl = D, yi,l = D/2, l = 0, gl = 0, dl = 0, nl = 0
P
3 while ν ≥ l nl and sufficient capacity at the central warehose do
4 Observe zt = min(Di,t , yi,l ) where t is the index of the current time period
5 dl ← dl + zt , nl ← nl + 1, gl ← gl − (hi + bi − ci (λ))11(Di,t ≤ yi,l ) + (bi − ci (λ))
6 Set ĝl = gl
nl
and [g l , ḡl ] = ĝl ± √Cn1
l
7 if l < dlog2 (νD)e and g l > 0 then

Ll+1 +Rl+1
8 Ll+1 ← yi,l , Rl+1 ← Rl , yi,l+1 ← 2
, nl+1 ← 0
9 l←l+1
10 else if l < dlog2 (νD)e and ḡl < 0 then
L +R
11 Ll+1 ← Ll , Rl+1 ← yi,l , yi,l+1 ← l+1 2 l+1 , nl+1 ← 0
12 l←l+1
13 end
14 t←t+1
15 end
16 end function

17
The outer loop of estimating λ∗ . Our main algorithm DBS consists of outer loops of finding
optimal Lagrangian dual variable λ∗ . Let τ denote the index of the outer loops. In each loop τ
with dual variable λ̂τ , which is an estimation of λ∗ , we apply the subroutine Algorithm 2 to obtain
approximated yi∗ (λ̂τ ) for all i ∈ [N ]. The idea of estimating λ∗ , as discussed earlier, is based on the
fact that Ṽ λ is a concave function of λ with a gradient equal to
X W
E[min(yi (λ∗ ), Di,t )] −
i
T
according to Miao et al. (2020) and Assumption 1. The method for estimating λ∗ is still a binary
search algorithm (but different from the one in Algorithm 2) to find the zero of the (approximated)
gradient of Ṽ λ . More specifically, let ŷi;τ , dî;τ be the output of subroutine Algorithm 2. Since dî;τ
is the sample average of
min(ŷi;τ , Di,t )
for all time periods t of applying base-stock level ŷi;τ in loop τ , and ŷi;τ approximates yi∗ (λ̂τ ), we
define vτ = i dî;τ − W/T which is an approximated gradient of Ṽ λ̂τ , and define the confidence
P
√
interval around vτ as [v τ , v τ ] = vl ± C2 N/ ντ for some parameter C2 such that the true gradient
of Ṽ λ̂τ is in [v τ , v τ ] with high probability. By concavity of Ṽ λ , v τ ≥ 0 (v τ ≤ 0) implies λ∗ ≥ λ̂τ
(λ∗ ≤ λ̂τ ) so we can cut the search range of λ∗ in half. Otherwise, it means that λ∗ is within a
close range (with similar length as [v τ , v τ ]) of λ̂τ , and we can immediately narrow our search range
accordingly. The length of each outer loop τ is given by C0 β τ where C0 and β are parameters to
be determined later. We refer to Figure 3 for a graphic representation of the outer loop. Note that
the value of these learning parameters have to be determined in a nontrivial way, and we cannot
directly use the same binary search as Algorithm 2 in order to achieve the optimal learning rate
of λ∗ (see Lemma EC.3 in the Appendix). The detailed algorithm DBS is given in Algorithm 3.
The performance of our algorithm DBS is presented in the following theorem.
Theorem 1. Let π DBS be the policy applied by the DBS algorithm with
C0 ≥ max(4/β 2 , 2dlog2 (T D)e)
p p
C1 ≥ 3/2(hi + bi − ci − λ) log(T )
C1
q
C2 ≥ D + 2 log(N T )dlog2 (T D)e
hi f
p h2i
C3 ≥ max C0 min(bi − ci ), 2C2 / min
i i (h + b − c )3 f
i i i
β ∈(1, 4].
The regret of π DBS satisfies that

√
R(π DBS ) ≤ Õ(N T )

18
(a) λ∗ = 0 and the confidence interval already (b) The confidence interval cannot indicate a sign
excluded zero
Figure 3 Two distinct principles of the DBS algorithm. a) the algorithm starts with λ̂ = 0 to increase the effec-
tiveness in the case of abundant inventory. If the value of λ∗ is zero, the algorithm can easily commit
to zero, otherwise, it can continue to search on the rest of the domain. b) different from the classical
stochastic binary search, the DBS algorithm shrinks the search space even though the confidence interval
does not have enough evidence for halving.
5.2. Main Steps of the Proof of Theorem 1
By our definition of regret, we can write

∗
R(π DBS ) ≤ V DBS − Vunlim
DBS DBS
+ Vunlim − Ṽ λ (6)
DBS
where Vunlim is the cost of applying Algorithm 3 in a hypothetical system with the same cost struc-
ture as the original OWMS problem, except without inventory constraint in the central warehouse.
∗
Therefore, our regret will be bounded by V DBS − Vunlim
DBS DBS
and Vunlim − Ṽ λ respectively.
Bound of V DBS − Vunlim
DBS
. Let us first consider the first term. Note that the discrepancy between
V DBS − Vunlim
DBS
comes only from the possible lost-sales due to stockout in the central warehouse.
As a result,
" N T #
XX +
V DBS − Vub
DBS
≤b̄E (yi,t − Ii,t )+ − W
" i=1
N X
t=1
T N
# (7)
X X
≤b̄E (yi,t − Ii,t )+ − T · E[min{yi∗ (λ∗ ), Di,1 }] + O(N )

i=1 t=1 i=1
where the second inequality is from Lemma 5 in Miao et al. (2020), and yi,t denotes the actual
base-stock level from Algorithm DBS. Denote Zi,t = (yi,t − Ii,t )+ as the ordering in time t at store
i and τ̄ the index of the last outer loop. By our algorithm design of exponentially increasing cycles
τ , we have

19
Algorithm 3: Double Binary Search (DBS)

Input: All model parameters for the OWMS problem; parameters C0 , C1 , C2 , C3 > 0, β > 1;
1 Initialize. τ = 0; t = 0; Lτ = 0; Uτ = mini (bi − ci ); λ̂τ = 0;
2 while T periods have not been reached and warehouse did not run out of stock do
3 Set ντ = C0 β τ ;
ŷi;τ , dî;τ ←SearchOrderUpTo(i, t, λ̂τ , ντ , C1 ) for all i ∈ [N ]; define vτ = i dî;τ − W/T ;
P
4
√
5 Construct confidence interval [v τ , v τ ] = vτ ± C2 N/ ντ ;
6 if v τ ≥ 0 then
7 Update Lτ +1 ← λ̂τ , Uτ +1 ← Uτ , λ̂τ +1 ← (Lτ +1 + Uτ +1 )/2;
8 end
9 else if v τ ≤ 0 then
10 Update Lτ +1 ← Lτ , Uτ +1 ← λ̂τ , λ̂τ +1 ← (Lτ +1 + Uτ +1 )/2;
11 end
12 else
√
13 Update Lτ +1 ← Proj[0,mini (bi −ci )] (λ̂τ − C3 / ντ ),
√
Uτ +1 ← Proj[0,mini (bi −ci )] (λ̂τ + C3 / ντ ), λ̂τ +1 ← (Lτ +1 + Uτ +1 )/2;
14 end
15 τ ←τ +1
16 t ← t + ντ
17 end
"
N X
T N
#
X X
(yi,t − Ii,t )+ − T · E[min{yi∗ (λ∗ ), Di,1 }]

E
" i=1
N X
t=1
τ̄ X N
i=1
τ̄
#
X X X
=E Zi,t − ντ · E[min{yi∗ (λ∗ ), Di,1 }]

i=1 τ =0 t∈Tτ i=1 τ =0
N X
τ̄
" #
X X
∗
≤ Zi,t − ντ · E[min{yi (λ̂τ ), Di,1 }]

E
i=1 τ =0 t∈Tτ
N X
X τ̄ h i
+ E ντ · E[min{yi∗ (λ∗ ), Di,1 }] − ντ · E[min{yi∗ (λ̂τ ), Di,1 }]

i=1 τ =0
τ̄
X
≤ Õ(N β τ /2 ) + O(N )
τ =0
XN X
τ̄ h i
+ E ντ · E[min{yi∗ (λ∗ ), Di,1 }] − ντ · E[min{yi∗ (λ̂τ ), Di,1 }]

i=1 τ =0
τ̄
X
≤ Õ(N β τ /2 ) + O(N )
τ =0

20
√
≤Õ(N T )
where Tτ is the set of time periods in cycle τ so that ντ = |Tτ |, and τ̄ ≤ dlogβ (T )e is the index
of last cycle. The second inequality is from Lemma EC.2(iv.). The third inequality is from Lemma
EC.3. Combining this with (7) we have that
√
V DBS − Vunlim
DBS
≤ Õ(N T ). (8)
∗ ∗
DBS
Bound of Vunlim − Ṽ λ . For the part Vunlim
DBS
− Ṽ λ , we first note that,
N X
X τ̄ X
DBS
Vunlim = E[Ci (max(yi,t , Ii,t ); λ̂τ )]
i=1 τ =0 t∈Tτ
N X
τ̄
! N X
τ̄ X
X X X
≤ Ci (yi,t ; λ̂τ ) + O(log2 (ντ )) + wW − E[λ̂τ (yi,t − Ii,t )+ ]
i=1 τ =0 t∈Tτ i=1 τ =0 t∈Tτ
where the inequality is due to the fact that there is at most dlog2 (ντ D)e updates of yi,t for t ∈ Tτ ,
and the assumption that Di,t ≥ D > 0. As a result, we have
∗
DBS
Vunlim − Ṽ λ
N X
X τ̄ X
≤ Ci (yi,t ; λ̂τ ) − Ci (yi∗ (λ̂τ ); λ̂τ )
i=1 τ =0 t∈Tτ
N X
τ̄
X (9)
+ ντ (Ci (yi∗ (λ̂τ ); λ̂τ ) − Ci (yi∗ (λ∗ ); λ∗ ))
i=1 τ =0
N X
τ̄
!
X X
+ ντ λ∗ E[min{yi∗ (λ∗ ), Di,t }] − E[λ̂τ (yi,t − Ii,t )+ ] + O(log2 (ντ )).
i=1 τ =0 t∈Tτ
Therefore, we need to bound the three summations above respectively.

First, we have with probability at least 1 − O(T −1 )
N X
X τ̄ X Xτ̄
Ci (yi,t ; λ̂τ ) − Ci (yi∗ (λ̂τ ); λ̂τ ) ≤ O(N log2 (ντ )) ≤ Õ(N ) (10)
i=1 τ =0 t∈Tτ τ =0
according to Lemma EC.2.(i.) by applying union bound over all τ .

Second, with probability at least 1 − O(T −1 )
N X
X τ̄
ντ (Ci (yi∗ (λ̂τ ); λ̂τ ) − Ci (yi∗ (λ∗ ); λ∗ ))
i=1 τ =0
XN X τ̄
≤ ντ O(E[|λ̂τ − λ∗ |])
i=1 τ =0
(11)
XN X τ̄
≤ Õ(β τ /2 )
i=1 τ =0
√
≤Õ(N T )

21
where the first inequality is from Lipschitz continuity of Ci (yi∗ (λ); λ) with respect to λ, and the
second inequality is from Lemma EC.3.
Third, with probability at least 1 − O(T −1 )
N X
τ̄
!
X X
∗
ντ λ E[min{yi∗ (λ∗ ), Di,t }] − +
E[λ̂τ (yi,t − Ii,t ) ]
i=1 τ =0 t∈Tτ
N X
X τ̄
= ντ λ∗ E[min{yi∗ (λ∗ ), Di,t }] − ντ E[λ̂τ min{yi∗ (λ̂τ ), Di,t }]
i=1 τ =0
N X τ̄
!
X X
+ ντ E[λ̂τ min{yi∗ (λ̂τ ), Di,t }] − E[λ̂τ (yi,t − Ii,t )+ ]
i=1 τ =0 t∈Tτ
N X
τ̄
(12)
X
∗
≤ O(ντ E[|λ̂τ − λ |])
i=1 τ =0
N X τ̄
" !#
X X
+ E λ̂τ ντ min{yi∗ (λ̂τ ), Di,t } − Zi,t
i=1 τ =0 t∈Tτ
τ̄ −1
N X
X √
≤ Õ(β τ /2 ) + O(N ) = Õ(N T )
i=1 τ =0
where the first inequality is from Lipschitz continuity, and the second inequality is from Lemma
EC.3 and Lemma EC.2(iv.) with union bound over τ . Combining (9-12), we have
∗ √ √
DBS
Vunlim − Ṽ λ ≤ Õ(N T ) + O(N log2 (T )) + Õ(N ) + O(N T · T −1 ) = Õ(N T ). (13)
In the end, this theorem is proved by combining (6), (7), and (13).
6. Numerical Study
In this section, we present the performance of the DBS algorithm in two groups of experiments. In
Experiment 1, we perform a sensitivity analysis for different model parameters using synthetic data.
In Experiment 2, we make use of a unique dataset collected from our partner fast-fashion company
and test the performance of the DBS algorithm on three products with cheap, intermediate, and
expensive price points. In both Experiment 1 and Experiment 2, the algorithms run in 100 different
random seeds and their average values and standard errors are reported.
As for benchmark algorithms, we use the explore-then-commit (Exp) algorithm with three dif-
ferent exploration durations. The explore-then-commit algorithm has two phases: the exploration
phase and exploitation phase. In the exploration phase, the decision maker raises the inventory lev-
els to D in the stores to get uncensored demand samples. The exploration phase can continue until
a predetermined time or until the system runs out of inventory. Afterwards, during the exploitation
phase, it applies the best policy calculated from these unbiased data. The explore-then-commit

22
algorithm can be thought as the censored demand version SAA approach. We build our benchmarks
with the lengths of exploration to be T 1/2 , T 2/3 , and T 3/4 respectively.
For better comparison of the performances in different scenarios, we define the relative regret
for a policy π as
∗
V π − Ṽ λ
RR(π) = .
Ṽ λ∗
6.1. Experiment 1: Synthetic Data
In this experiment, we test the empirical performance of the DBS algorithm and perform a sensi-
tivity analysis of the DBS Algorithm’s performance with respect to various input parameters such
as the length of the horizon, coefficient of variation of demand distribution, the central warehouse
capacity, cost parameters (holding and lost-sales cost), and the number of stores. We assume that
the system has two stores which are facing independent but identical demands with homogeneous
cost structures as hi = 6, bi = 60, and ci = 0.5. For the baseline demand, we use synthetic demand
data generated from a truncated normal distribution with domain [0, 175], mean µ = 50, and stan-
dard deviation σ = 50. We fix horizon length to be T = 1000 and the central warehouse capacity
T ·µ·N
to be 2
. We then change parameters one at a time for sensitivity analysis, and the results are
summarized as follows.
Figure 4 Change in relative regret with horizon length. The shades of the curves represent standard errors.
Impact of length of horizon T . Our first experiment is when the length of the horizon T takes
different values and its impact on the relative regret RR(π). The result is summarized in Figure 4,
which shows clearly that longer horizon leads to lower relative regret for all algorithms. However,

23
it is evident from Figure 4 that the performance of our algorithm DBS is significantly better than
the benchmarks Exp. The gap between the DBS algorithm and Exp algorithms could be a result
of two factors. The first factor is that the central warehouse capacity is taken into account by
the DBS algorithm, whereas the Exp algorithms naively orders without considering the capacity,
which can be particularly suboptimal when the warehouse inventory level is not very high. Second,
the Exp algorithms order unnecessarily high amounts to prevent censorship during the exploration
phase. On the other hand, however, the DBS algorithm avoids extravagant exploration with the
help of first-order information as promised in our theoretical analysis.
Figure 5 Change in relative regret with respect to coefficient of variation. The shades of the curves represent
standard errors.
Impact of coefficient of variation. To see the effect of demand variation on learning, we keep
the mean of the demand distribution constant and change its standard deviation. It can be seen
from Figure 5 that the relative regret increases by the coefficient of variation for all algorithms.
This is an expected result since it is harder to learn from demand distributions with high volatility.
As a reaction to increased variation, the committed order-up-to levels of Exp algorithms increase
as well. For the DBS algorithm, we can see that its performance is much more robust by adaptively
changing the policies as more data is gained.
Impact of the warehouse capacity. In this experiment, we test the performance of algorithms
with respect to different warehouse capacity. In particular, we let the initial inventory be φ · T · µ · N ,
and change the values of φ (named as capacity multiplier). Figure 6 shows that relative regret first
increases and then decreases with more capacity. This is due to the trade-off between holding and
lost sales costs. With low capacity, the system suffers from excessive lost sales costs, whereas holding

24
Figure 6 Change in relative regret with respect to capacity. The shades of the curves represent standard errors.
cost is lower. When capacity is large holding cost is taking place and lost sales cost diminishes.
After 1.4 times the standard capacity, relative regrets of algorithms stay constant, since excessive
capacity does not affect policies.
Figure 7 Change in relative regret with respect to holding cost. The shades of the curves represent standard
errors.
Impact of holding cost hi . Intuitively, holding cost is an integral part of the cost of learning
since learning algorithms need to explore as discussed in the previous sections. As the holding cost
increases, the learning process becomes more expensive as optimal order-up-to levels are lower (i.e.,
conservative exploration results in higher cost), which in principle increases the regret as reflected

25
in Figure 7. For comparison between algorithms, Figure 7 shows that as holding cost increases, the
regret of the Exp algorithms increases almost linearly. Regret of the DBS algorithm also increases
with holding cost but it is much more slowly and remains almost constant (even slightly decreases)
when hi is large. This better adaptability to holding cost of DBS algorithm could potentially come
from its ability to control ordering quantity according to system parameters, while Exp algorithms
have to follow the pre-determined order-up-to level D in exploration phase.
Figure 8 Change in relative regret with respect to lost sales cost. The shades of the curves represent standard
errors.
Impact of lost-sales cost bi . It can be seen in Figure 8 that as lost sales cost increases, relative
regrets of all algorithms decrease, which is seemingly counter-intuitive. However, one intuition
behind this decrease is that with higher bi , the optimal order-up-to levels and the benchmark cost
∗
Ṽ λ are also higher. As a result, in Exp algorithms exploration phase actually becomes less costly,
while in DBS algorithm, its flexibility still allows it to identify near-optimal order-up-to levels very
quickly. This intuition is consistent with our observation in Figure 8, while DBS still outperforms
Exp, their differences becomes smaller with higher lost-sales cost.
Impact of shipment cost ci . One can observe from Figure 9 that higher shipment costs can
lead to lower relative regret. We change shipment cost multiplier from % = 2−1 to % = 26 . In all
scenarios, our algorithm DBS again is significantly better than Exp algorithms. The reason for
decreasing relative regret in all algorithms could be similar to the case of higher bi . That is, the
∗
increase of the benchmark objective Ṽ λ outgrows the increase of regret R(π).
Impact of the number of stores N . In this experiment, we explore the effect of the number
of stores on the performances of DBS and Exp algorithms, and the results are summarized in Table

26
Figure 9 Change in relative regret with respect to shipment cost. The shades of the curves represent standard
errors.
Table 2 Average relative regret with number of stores

Number of stores: 2 4 6 8 10
Exp T 1/2 0.060 0.060 0.060 0.060 0.060
Exp T 2/3 0.068 0.068 0.068 0.068 0.068
Exp T 3/4 0.074 0.074 0.074 0.074 0.074
DBS 0.028 0.025 0.024 0.024 0.024
2. Table 2 shows that the relative regret is constant as the number of stores increases. This result is
√ ∗
exactly consistent with our theoretical regret Õ(N T ). It is not difficult to see that Ṽ λ = Θ(N T )
(where Θ(·) represents a tight bound) in our setting, so the relative regret essentially is independent
of N .
6.2. Experiment 2: Real Data
In this experiment, we test the DBS algorithm on various settings derived from a real dataset
collected from our partner fast-fashion company. For illustration purposes, we choose a two-store
system and three popular products with cheap, intermediate, and expensive prices respectively,
whose demand is relatively high and stable. For each product, we run an independent experiment
on the one-warehouse two-store system. The demand distributions and cost parameters of products
are derived from the data. For the cheap product, the demand is estimated to be exponentially
distributed with rate µ = 44.51, cost parameters b = 15, c = 0.5, h = 1.5. The intermediate priced
product has demand to be gamma distribution with shape parameter k = 1.26 and scale parameter
θ = 25.5. The cost parameters for the intermediate priced product are b = 60, c = 0.5, h = 6. For the

27
Table 3 Scenario setups

Scenario Group Scenario W b h c
Capacity Fair Capacity W0 · 2 b h c
Tight Capacity W0 b h c
W0
Very Tight Capacity b h c
2
h
Holding Cost Low Holding Cost W0 b c
2
Medium Holding Cost W0 b h c
High Holding Cost W0 b h·2 c
b
Lost Sales Cost Low Lost Sales Cost W0 h c
2
Medium Lost Sales Cost W0 b h c
High Lost Sales Cost W0 b·2 h c
c
Shipment Cost Low Shipment Cost W0 b h
2
Medium Shipment Cost W0 b h c
High Shipment Cost W0 b h c·2
Table 4 Average relative regrets and standard errors of algorithms on different products
DBS Exp T 1/2 Exp T 2/3 Exp T 3/4

Cheap Product 0.090(0.007) 0.139(0.044) 0.153(0.016) 0.156(0.016)
Intermediate Priced Product 0.093(0.021) 0.128(0.036) 0.139(0.018) 0.149(0.016)
Expensive Product 0.136(0.031) 0.157(0.049) 0.174(0.019) 0.182(0.019)
expensive product, the demand is exponentially distributed with rate µ = 6.39 and cost parameters
are b = 200, c = 0.5, h = 20.
Similar to what we did in Experiment 1, for the sake of robustness, we test the performance
of our algorithm DBS in different scenarios of problem parameters. Among all scenarios of cost
parameters, the medium scenario is based on the real parameters of the product and the warehouse
T ·µ·N
capacity is taken as W0 = 2
. A summary of all the scenarios can be found in Table 3.
The results of this experiment can be found in Table 4. Moreover, the detailed results with
scenario variations are given in Table EC.1, Table EC.3, and Table EC.5 in the appendix for
cheap, intermediate priced, and expensive products respectively. The results of this experiment
follow similar patterns with the sensitivity analysis done in Experiment 1 (i.e. relative regret first
increases then decreases with capacity, increases with holding cost, decreases with lost sales cost,
and decreases with shipment cost). Moreover, we see these patterns in all three price points, which
shows robustness and adaptability in different scenarios. Also, one can observe that the variability
of the results tends to increase with cost parameters (i.e. holding, lost sales, and shipment) and
product price.

28
7. Conclusion
Motivated by our collaboration with one of the largest fast-fashion retailers in Europe, this paper
studied the OWMS problem with unknown demand information a priori, which is the first among
the literature to the best of our knowledge. In particular, the decision maker has to jointly make
inventory allocation decisions from the central warehouse to different stores and learn the unknown
demand faced by stores on the fly. Two scenarios are considered in this paper. The first scenario
is that the decision maker can observe uncensored demand, and in this scenario, we developed an
algorithm named EDAF based on a sample average approximation approach. Convergence results
√
were obtained and the regret of EDAF was proved to be Õ(N T ). In the second scenario, uncen-
sored demand is no longer observable and the decision maker can only observe the sales amount
instead. In this case, EDAF does not work due to demand censoring. To tackle this challenge,
we proposed another algorithm named DBS, which adaptively learns the demand and decides
inventory allocation policy by optimizing a Lagrange dual variable of the inventory constraint in
√
the central warehouse. This algorithm DBS was proved to have regret Õ(N T ) as well, and we
performed simulations leveraging both synthetic data and real data from a fast-fashion firm to
illustrate the effectiveness of our algorithm DBS.
Lastly, we would like to discuss some possible future work directions. In some cases, fixed replen-
ishment costs are charged. Clark and Scarf (1962) show that with fixed costs, characterizing the
optimal policy is difficult (see Chao et al. 2021 for a recent study on multi-warehouse multi-store
system with fixed costs). It would be interesting to develop a solution for the fixed cost setting with
demand learning. Additionally, although the one-warehouse multi-store system is a classic model
in inventory management, demand learning in a general model for multiple warehouses would also
be interesting. In that case, our benchmark would not be generalizable for multi-warehouses due
to the curse of dimensionality. Hence, designing a new algorithm with provable performance is a
nontrivial task and we leave it as a future research direction.
References
Azoury KS (1985) Bayes solution to dynamic inventory models under unknown demand distribution. Man-
agement Science 31(9):1150–1160.
Ban GY (2020) Confidence intervals for data-driven inventory policies with demand censoring. Operations
Research 68(2):309–326.
Ban GY, Rudin C (2019) The big data newsvendor: Practical insights from machine learning. Operations
Research 67(1):90–108.
Besbes O, Muharremoglu A (2013) On implications of demand censoring in the newsvendor problem. Man-
agement Science 59(6):1407–1424.

29
Bu J, Simchi-Levi D, Wang L (2020) Offline pricing and demand learning with censored data. Available at
SSRN 3619625 .
Caro F, Gallien J (2010) Inventory management of a fast-fashion retail network. Operations research
58(2):257–273.
Chao X, Jasin S, Miao S (2021) Adaptive algorithms for multi-warehouse multi-store inventory system with
lost sales and fixed replenishment cost. Available at SSRN 3888794 .
Chen B, Chao X, Ahn HS (2019) Coordinating pricing and inventory replenishment with nonparametric
demand learning. Operations Research 67(4):1035–1052.
Chen B, Chao X, Shi C (2021) Nonparametric learning algorithms for joint pricing and inventory control
with lost sales and censored demand. Mathematics of Operations Research .
Chen B, Chao X, Wang Y (2020a) Data-based dynamic pricing and inventory control with censored demand
and limited price changes. Operations Research 68(5):1445–1456.
Chen B, Simchi-Levi D, Wang Y, Zhou Y (2020b) Dynamic pricing and inventory control with fixed ordering
cost and incomplete demand information. Available at SSRN 3632475 .
Chen B, Wang Y, Zhou Y (2020c) Optimal policies for dynamic pricing and inventory control with nonpara-
metric censored demands. Available at SSRN 3750413 .
Chen L (2010) Bounds and heuristics for optimal bayesian inventory control with unobserved lost sales.
Operations Research 58(2):396–413.
Cheung WC, Simchi-Levi D (2019) Sampling-based approximation schemes for capacitated stochastic inven-
tory control models. Mathematics of Operations Research 44(2):668–692.
Clark AJ, Scarf H (1960) Optimal policies for a multi-echelon inventory problem. Management Science
6(4):475–490.
Clark AJ, Scarf H (1962) Approximate solutions to a simple multi-echelon inventory problem. Studies in
Applied Probability and Management Science 88–110.
Davenport T, Guszcza J, Smith T, Jackson BS (2019) 2021 inventory optimization survey report. Deloitte
Consulting LLP URL https://www2.deloitte.com/us/en/insights/topics/analytics/insight-
driven-organization.html.
enVista (2021) 2021 inventory optimization survey report. National Retail Federation URL https://
nrf.com/research/member-submitted/2021-inventory-optimization-survey-report.
Erkip NK (1984) Approximate policies in multi-echelon inventory systems. Ph.D. thesis, Stanford University.
Godfrey GA, Powell WB (2001) An adaptive, distribution-free algorithm for the newsvendor problem with
censored demands, with applications to inventory and distribution. Management Science 47(8):1101–
1112.

30
Hoeffding W (1963) Probability inequalities for sums of bounded random variables. Journal of the American
Statistical Association 58(301):13–30.
Huh WT, Levi R, Rusmevichientong P, Orlin JB (2011) Adaptive data-driven inventory control with censored
demand based on kaplan-meier estimator. Operations Research 59(4):929–941.
Huh WT, Rusmevichientong P (2009) A nonparametric asymptotic analysis of inventory planning with
censored demand. Mathematics of Operations Research 34(1):103–123.
Jackson PL (1988) Stock allocation in a two-echelon distribution system or “what to do until your ship
comes in”. Management Science 34(7):880–895.
Jackson PL, Muckstadt JA, Li Y (2019) Multiperiod stock allocation via robust optimization. Management
Science 65(2):794–818.
Kaplan EL, Meier P (1958) Nonparametric estimation from incomplete observations. Journal of the American
Statistical Association 53(282):457–481.
Lei YM, Liu S, Jasin S, Vakhutinsky A (2020) On the joint inventory and pricing control for a one-
warehouse multi-store problem with lost sales: Spiraling phenomena and a near-optimal heuristic.
Preprint (September 7, 2020) .
Levi R, Perakis G, Uichanco J (2015) The data-driven newsvendor problem: new bounds and insights.
Operations Research 63(6):1294–1306.
Levi R, Roundy RO, Shmoys DB (2007) Provably near-optimal sampling-based policies for stochastic inven-
tory control models. Mathematics of Operations Research 32(4):821–839.
Marklund J, Rosling K (2012) Lower bounds and heuristics for supply chain stock allocation. Operations
Research 60(1):92–105.
Massart P (1990) The tight constant in the dvoretzky-kiefer-wolfowitz inequality. The Annals of Probability
1269–1283.
McGavin EJ, Schwarz LB, Ward JE (1993) Two-interval inventory-allocation policies in a one-warehouse
n-identical-retailer distribution system. Management Science 39(9):1092–1107.
McGavin EJ, Ward JE, Schwarz LB (1997) Balancing retailer inventories. Operations Research 45(6):820–
830.
Miao S, Jasin S, Chao X (2020) Asymptotically optimal lagrangian policies for multi-warehouse multi-store
systems with lost sales. Available at SSRN 3552995 .
Nambiar M, Simchi-Levi D, Wang H (2020) Dynamic inventory allocation with demand learning for seasonal
goods. Production and Operations Management .
Nicasio F (2021) Retail analytics: How to use data to win more sales and customers. Vend
URL https://www.vendhq.com/blog/how-retailers-can-use-data-to-boost-productivity-
customer-service-sales.

31
Petro G (2021) Supply chain blues: Challenges, shortages and disruptions ahead. 4 minutes to a better retail
industry. URL https://www.forbes.com/sites/gregpetro/2021/10/08/supply-chain-blues-
challenges-shortages-and-disruptions-ahead-------5-minutes-to-a-better-retail-
industry-read-on.
Porteus EL (2002) Foundations of stochastic inventory theory (Stanford University Press).
Qin H, Simchi-Levi D, Wang L (2019) Data-driven approximation schemes for joint pricing and inventory
control models. Available at SSRN 3354358 .
Scarf H (1959) Bayes solutions of the statistical inventory problem. The Annals of Mathematical Statistics
30(2):490–508.
Shi C, Chen W, Duenyas I (2016) Nonparametric data-driven algorithms for multiproduct inventory systems
with censored demand. Operations Research 64(2):362–370.
Yuan H, Luo Q, Shi C (2021) Marrying stochastic gradient descent with bandits: Learning algorithms for
inventory systems with fixed costs. Management Science .
Zhang H, Chao X, Shi C (2018) Perishable inventory systems: Convexity results for base-stock policies and
learning algorithms under censored demand. Operations Research 66(5):1276–1286.
Zipkin PH (2000) Foundations of inventory management (McGraw-Hill).

e-companion ec1
This page is intentionally blank. Proper e-companion title

page, with INFORMS branding and exact metadata of the
main paper, will be produced by the INFORMS office when
the issue is being assembled.

ec2 e-companion
EC.1. Technical Proofs in Section 4

Throughout our analysis we will use t as an explicit variable. We define
r
log(2/δt )
t :=
2t
for simplicity with respective probability values δt . Before we go to the proof of Proposition 1, let
us first prove the next lemma which reveals the consistency of the estimated parameters and costs.
Lemma EC.1. For any t ∈ [T ], estimates of the EDAF algorithm have the following properties
with probability at least 1 − O(N −2 T −2 ).
i. |λ∗ − λ̂t+1 | ≤ O(N t )
ii. |ŷi,t+1 (λ̂t+1 ) − yi∗ (λ∗ )| ≤ O(N t )
iii. |Ci (ŷi,t+1 (λ̂t+1 ); 0) − Ci (yi∗ (λ∗ ); 0)| ≤ O(N t )
i
Proof of Lemma EC.1 We first show that given |F̂t+1 (d) − F i (d)| < t , for all d
|ŷi,t+1 (λ) − yi∗ (λ)| ≤ O(t ) (EC.1)

i
for any λ. In the rest of the proof, everything will be conditioned on event |F̂t+1 (d) − F i (d)| < t ,
which holds with probability at least 1 − O(N −2 T −2 ).
Let q = κ(λ). To simplify notation, we omit store number i. Additionally, with the abuse of
notation, we denote F −1 (q) = inf {y : F (y) ≥ q }. We want to show that
−1
|F̂t+1 (q) − F −1 (q)| ≤ O(t ). (EC.2)
If this claim holds, then (EC.1) follows immediately. To this end, let d = F −1 (q), dˆ = F̂t+1
−1
(q), and
ˆ Then |F̂t+1 (q) − F (q)| = |F (q̂) − F (q)| ≤ |q̂ − q |/|f | where f is the lower bound
q̂ = F (d). −1 −1 −1 −1
of probability density function from Assumption 1 and the inequality is due to inverse function
theorem. Note that we have |F (d) ˆ − F̂t+1 (d)
ˆ | ≤ t by the given event.
−1
By definition of F̂t+1 ˆ ≥ q. In the case that F̂t+1 (d)
, we always have F̂t+1 (d) ˆ = q, we immediately
−1
have |F̂t+1 ˆ − F̂t+1 (d)
(q) − F −1 (q)| ≤ |q̂ − q |/|f | = |F (d) ˆ |/|f | = O(t ). Thus let us focus on the case
ˆ > q. Then we have another two cases: q < q̂ and q ≥ q̂. In the first case, note that
that F̂t+1 (d)
by definition F̂t+1 has a staircase form. Thus F̂t+1 (d) ˆ > q implies that F̂t+1 (dˆ− ) ≤ q < F̂t+1 (d),
ˆ
where dˆ− ≤ dˆ is infinitesimally close to d. ˆ By the fact that 0 ≤ q − F̂t+1 (dˆ− ) < q̂ − F̂t+1 (dˆ− ) =
F (dˆ− ) − F̂t+1 (dˆ− ) ≤ t where the equality is by continuity of F , we have
−1 q̂ − q q̂ − F̂t+1 (dˆ− ) t
|F̂t+1 (q) − F −1 (q)| ≤ ≤ ≤ = O(t ).
|f | |f | |f |
In the second case that q ≥ q̂,
ˆ − F (d)
q − q̂ F̂t+1 (d) ˆ
−1
|F̂t+1 (q) − F −1 (q)| ≤ < = O()
|f | |f |
and we are done with the proof of (EC.2).

e-companion ec3
Proof of Part i. It can be seen that the dual program is concave in λ. The gradient of Ṽ λ /T is
given as
N
X W
E[min(yi∗ (λ), Di,1 )] − (EC.3)
i=1
T
according to Miao et al. (2020).

For a given λ, according to (EC.1), at period t + 1, we have
|E[min(ŷi,t+1 (λ), D̂i,1 )] − E[min(yi∗ (λ), Di,1 )]| ≤ O(t ).

PN
We define g(λ) := i=1 E[min(yi∗ (λ), Di,1 )] − W/T and its empirical estimate at time t as ĝt+1 (λ) :=
PN
i=1 Ê[min(ŷi,t+1 (λ), D̂i,t+1 )] − W/T . As a result, we immediately have
|g(λ) − ĝt (λ)| ≤ O(N t ). (EC.4)
Note that by concavity of Ṽ λ , we have g(λ) is monotone strictly (by Assumption 1) decreasing for
all λ ≥ 0 with |g 0 (λ)| = Θ(N −1 ), and ĝt (λ) is a step-wise function.
There are two cases for λ∗ .
Case 1: λ∗ > 0. By complementary slackness condition in Lemma 5 in Miao et al. (2020),
−1
λ∗ = g −1 (0). Then we just need to bound |λ∗ − λ̂t+1 | = |g −1 (0) − ĝt+1 (0)|. We just repeat the same
proof for (EC.2) except we replace F, F̂t+1 , t by g, ĝt+1 , N t and we are done.
Case 2: λ∗ = 0. In this case, we have g(0) ≤ 0 according to complementary slackness condition
in Lemma 5 in Miao et al. (2020). However, we can always continously extend g(λ) to λ < 0 by
e.g., linear extension such that g(λ) = 0 for some λ < 0. Similarly, we can do the same extension to
ĝt+1 (λ) such that |g(λ) − ĝt+1 (λ)| ≤ O(N t ) for any λ ∈ R. Then we just repeat the same proof for
(EC.2) except we replace F, F̂t+1 , t by g, ĝt+1 , N t (as |λ∗ − λ̂t+1 | ≤ |g −1 (0) − ĝ −1 (0)|) and we are
done .
Proof of Part ii. When bi − ci − λ < 0, the order-up-to levels are zero, therefore the claim
trivially holds. Assume otherwise, the expression can be written as
|ŷi,t+1 (λ̂t+1 ) − yi∗ (λ∗ )| = | ŷi,t (λ̂t+1 ) − yi∗ (λ̂t+1 ) + yi∗ (λ̂t+1 ) − yi∗ (λ∗ ) |
| {z } | {z }
(I) (II)
Element (I) has O(t ) upperbound from Line EC.1.

Element (II) can be written as
1
|F −1 (κ(λ̂t )) − F −1 (κ(λ∗ ))| ≤ |κ(λ̂t ) − κ(λ∗ )|
f (F −1 (q̃))
≤ O(N t )

ec4 e-companion
where the first inequality follows from Taylor’s Theorem for a q̃ between κ(λ̂t ) and κ(λ) and
the inverse function theorem, and the second line is a result of P art i. after simple algebraic
manipulations.
Proof of Part iii. We have
|Ci (ŷi,t+1 (λ̂t+1 ); 0) − Ci (yi∗ (λ∗ ); 0)| ≤ O(|ŷi,t+1 (λ̂t+1 ) − yi∗ (λ∗ )|)
≤ O(N t )
where the first inequality is from Lipschitz continuity of Ci (yi∗ (λ); λ) and the second inequality is
from part (ii).
q
log N T
Taking t = t
gives the result.
The following proposition gives the performance analysis of the EDAF algorithm.
Proof of Proposition 1 In this proof, we shall condition on the intersection of all events in
Lemma EC.1, which happens with probability at least 1 − O(N −1 T −1 ) by union bound. As a result,
the regret on the complement of this event is at most O(1) which can be neglected.
∗
As we discussed, we have the regret can be bounded using benchmark Ṽ λ as
∗ √
R(π) ≤ V π − Ṽ λ + O( N T + N ).
∗
As a result, it suffices to bound V π − Ṽ λ . To this end, we define Vunlim
π
as a hypothetical system
of applying policy π with the same cost structure as the original OWMS problem, except without
inventory constraint in the central warehouse. Then we have
∗ ∗
V π − Ṽ λ = V π − Vunlim
π π
+ Vunlim − Ṽ λ .
∗
As a result, it suffices to bound V π − Vunlim
π π
and Vunlim − Ṽ λ separately.
Bound of V π − Vunlim
π
. We first prove the following claim:
N X
h X T T X
X N + i
E (ŷi,t (λ̂t ) − Ii,t )+ − min(yi∗ (λ∗ ), Di,t )
i=1 t=1 t=1 i=1
T
(EC.5)
X
≤Õ N (t + t−1 ) + N .
t=2
This holds because we have

(ŷi,t+1 (λ̂t+1 ) − Ii,t+1 )+ =(ŷi,t+1 (λ̂t+1 ) − (max(ŷi,t (λ̂t ), Ii,t ) − Di,t )+ )+
≤ max(ŷi,t+1 (λ̂t+1 ), (ŷi,t+1 (λ̂t+1 ) − max(ŷi,t (λ̂t ), Ii,t ) + Di,t )+ )
≤ max(yi∗ (λ∗ ) + Õ(N t ), Õ(N t ) + Õ(N t−1 ) + Di,t )
≤ max(yi∗ (λ∗ ), Di,1 ) + Õ(N t ) + Õ(N t−1 )

e-companion ec5
where the second inequality is from Lemma EC.1.Part ii.

Now we are ready to bound V π − Vunlim
π
. Similar to Miao et al. (2020), the only possible extra
cost in the real system compared with the unbounded system is the extra lost-sales cost due to
stockout in the central warehouse. Therefore, we have
penalty for central warehouse stockout
z }| {
N X
h X T + i
V π − Vunlim
π
≤ b̄E (ŷi,t (λ̂t ) − Ii,t )+ − W
i=1 t=1
N X
h X T T X
X N + i
≤ b̄E (ŷi,t (λ̂t ) − Ii,t )+ − E[min(yi∗ (λ∗ ), Di,t )] + O(N )
i=1 t=1 t=1 i=1
N X
h X T T X
X N + i
∗ ∗
≤ b̄E min(yi (λ ), Di,t ) − E[min(yi∗ (λ∗ ), Di,t )] (EC.6)
i=1 t=1 t=1 i=1
XT
+ Õ N (t + t−1 ) + N + O(N )
t=2
T
√ X
≤ O( N T ) + Õ N (t + t−1 ) + N + O(N )
t=2
√
≤ Õ(N T )
where the second inequality is from Lemma 5 in Miao et al. (2020), the third inequality is from
(EC.5), the fourth inequality is from the upper bound by the variance of sum of independent
random variables, and the last inequality is by definition of t .
∗
π
Bound of Vunlim − Ṽ λ . But before we go to the specific bound for each part, let us prove
2
an important claim: for any t ≥ 2t0 = 4D log(N T )/D2 and i ∈ [N ], with probability at least 1 −
O(N −2 T −2 ),
max(ŷi,t+1 (λ̂t+1 ), Ii,t+1 ) − yi∗ (λ∗ ) ≤ O(N t ). (EC.7)
Notice that by our condition on events in Lemma EC.1, we have |ŷi,t+1 (λ̂t+1 ) − yi∗ (λ∗ )| ≤ O(N t )
for all t. Thus this inequality does not hold only if Ii,t+1 is very large. However, we shall show that
this is very unlikely to happen. To this end, by Hoeffding’s inequality (Hoeffding 1963)
t+1
X 2ε2
P Di,s ≥ t0 E[Di,1 ] − ε ≥ 1 − exp(− 2 ).
s=t−t0 +2 t0 D
1/2
Let ε = Dt0 log(N T )1/2 , then
t+1
X 1/2
1
P Di,s ≥ t0 E[Di,1 ] − Dt0 log(N T )1/2 ≥ 1 − (EC.8)
s=t−t1 +2
N 2T 2
Pt+1 1/2
Then on the event above, s=t−t0 +2 Di,s ≥ t0 E[Di,1 ] − Dt0 log(N T )1/2 ≥ 21 t0 E[Di,1 ] ≥ D given that
2
t0 = 2D log(N T )/D2 . This means no matter what is the value of max(ŷi,t+1−t0 (λ̂t+1 ), Ii,t+1−t0 )

ec6 e-companion
(which is at most D), at some time s ∈ [t + 1 − t0 , t + 1], we will have Ii,s ≤ ŷi,s (λ̂s ), which further
implies that
√
max(ŷi,t+1 (λ̂t+1 ), Ii,t+1 ) ≤ max ŷi,s (λ̂s ) ≤ yi∗ (λ∗ ) + O(N t+1−t0 ) ≤ yi∗ (λ∗ ) + O( 2N t+1−t0 )
s∈[t+1−t0 ,t+1]
2
given that t ≥ 2t0 = 4D log(N T )/D2 .
2
In the rest of the proof, we shall condition on event (EC.7) over all t ≥ 4D log(N T )/D2 and
i ∈ [N ], which happens with probability at least 1 − O(N −1 T −1 ) by union bound (hence the regret
on the complement is at most O(1) which can be neglected).
∗ ∗
π
For the bound of Vunlim − Ṽ λ , by definition of Ṽ λ and complementary slackness condition in
Lemma 5 in Miao et al. (2020), we have
N X
T
∗ X
π
Vunlim − Ṽ λ = E[Ci (max(ŷi,t (λ̂t ), Ii,t ); 0) − Ci (yi∗ (λ∗ ); 0)]
i=1 t=1
XN
+ ci E[(max(ŷi,T (λ̂T ), Ii,T ) − Di,T )+ ]
i=1 (EC.9)
T
X
≤Õ N t + 2N t0 + O(N )
t=2t0 +1
√
≤Õ(N T )
where the first inequality is from (EC.7) and Lipschitz continuity of Ci (·; 0), and the last inequality
is by definition of t , t0 .
Summarizing (EC.6) and (EC.9) gives us the final regret bound.
EC.2. Technical Proofs in Section 5

p p
Lemma EC.2. Given a fixed dual variable λ, and the constant C1 ≥ 3/2(hi + bi − ci − λ) log(T ),
with probability 1 − O(T −2 ), if Algorithm 2 runs during ν periods, for any store i ∈ [N ] the following
hold
Ci (yi,t ; λ) − Ci (yi∗ (λ); λ) ≤ O(log2 (ν)) where Tν is the set of time periods for Algorithm
P
i. t∈Tν
2 and yi,t is the actual base-stock level in time t.

q
∗
ii. |yi,l − yi (λ)| ≤ Õ(1/ν + 1/ ν − dlog2 (νD)e) where yi,l is the final output of the base-stock level.
q
iii. |dî,l − E[min{yi∗ (λ), Di,1 }]| ≤ D + hCi1f log(N T )dlog2 (νD)e
ν−dlog (νD)e
+ ν1 where dî,l is the final output of
2
the estimated expected sales.

√
iv. | t∈Tν Zi,t − νE[min{yi∗ (λ), Di,1 }]| ≤ Õ( ν) where Zi,t is the actual ordering in time t for
P
store i.
Furthermore, the total number of extra periods to lower the inventory level to yi,l does not exceed
O(log(ν)).

e-companion ec7
Proof of Lemma EC.2 Before we go into each part of this lemma, let us define
E1 := {Ci0 (yi,l ; λ) ∈ [g l , ḡl ] for all l}
which is event that the gradient is included in the interval [g l , ḡl ] for all l. We shall show this event
holds with probability at least 1 − O(T −2 ), and the rest of the proof shall be on this event. Note
that the proof of part i. and ii. mostly resembles Lemma 1 in Chen et al. (2020c), and we write
the whole proof here for being self-contained. Note that in the rest of the proof, we shall assume
λ < bi − ci because otherwise yi∗ (λ) = 0 and all the results trivially hold. Also, for the last claim of
this lemma, it holds because each loop has at most D/D periods of no ordering, which adds up to
at most O(log(ν)).
ĝ is an unbiased estimation of Ci0 (yi,l ; λ), and we have by Hoeffding’s inequality (Hoeffding 1963)
−2nl ζ 2
P Ci0 (yi,l ; λ) 6∈ [ĝ − ζ, ĝ + ζ] ≤ 2 exp
(hi + bi − ci − λ)2
√ √
3/2(hi +bi −ci −λ) log(T ) p
for any ζ > 0. Setting ζ = √
nl
, and having the constant C1 ≥ 3/2(hi + bi − ci −
p
λ) log(T ) we have 1 − O(T −3 ) probability for the gradient included in the confidence interval.
Applying union bound for ν ≤ T , we have

P (E1 ) = P Ci0 (yi,l ; λ) ∈ [g l , ḡl ] ∀l ≥ 1 − O(T −2 ).
Therefore, when g l > 0, then Ci0 (yi,l ; λ) > 0 and when ḡl < 0 , Ci0 (yi,l ; λ) < 0. This implies that yi∗ (λ)
is always between Ll and Rl .
Proof of Part i. Let Tl be the set of time periods during l. Suppose l < dlog2 (νD)e, and consider
l0 -th iteration in loop l. Given E1 , we have that
1
|Ci0 (yi,l ; λ)| ≤ C1 √ . (EC.10)
l0 − 1
Recall that according to Assumption 1 the probability density has 0 < f ≤ fi (d) ≤ f¯ < ∞ for
all i ∈ [N ] and f , f¯ ∈ R. Note that C 00 (y; λ) ≥ (hi + bi − ci − λ)f ≥ hi f for any order-up-to level
i
y ∈ [D, D] and 0 ≤ λ < bi − ci . Thus, Taylor’s theorem gives that
Ci (yi∗ (λ); λ) − Ci (yi,l ; λ) = Ci0 (yi,l ; λ)(yi∗ (λ) − yi,l ) + Ci00 (r; λ)(yi∗ (λ) − yi,l )2 ≤ 0
where r ∈ R is between yi∗ (λ) and yi,l . Consequently,
Ci0 (yi,l ; λ)(yi∗ (λ) − yi,l ) + hi f (yi∗ (λ) − yi,l )2 ≤ 0
Rearranging the sides

1
hi f |yi∗ (λ) − yi,l | ≤ −Ci0 (yi,l ; λ) ≤ |Ci0 (yi,l ; λ)| ≤ C1 √
l0 − 1

ec8 e-companion
Therefore we have
C
|yi∗ (λ) − yi,l | ≤ √1 . (EC.11)
hi f l0 − 1
∗
P
For each loop l, let RTl = t∈Tl (Ci (yi,t ; λ) − Ci (yi (λ); λ)).
Case 1. l < dlog2 (νD)e: For any l0 -th time period in loop l, we have
1 C
Ci (yi,l ; λ) − Ci (yi∗ (λ); λ) ≤ Ci0 (yi,l ; λ)|yi∗ (λ) − yi,l | ≤ C1 √ · √1
l0 − 1 hi f l0 − 1
C12
=
hi f (l0 − 1)
where the first inequality follows from convexity of Ci (·; λ) and the second one follows from (EC.10)
and (EC.11).
Summing the cycle regret gives
|Tl |
X C12
RTl ≤ O(1) + ≤ O(C12 f −1 h−1
i log(ν)).
l0 =2
hi f (l0 − 1)
Case 2 l = dlog2 (νD)e: Note that Rl − Ll ≤ D( 21 )log2 (νD) = ν1 . Hence, |yi∗ (λ) − yi,l | ≤ 1/ν. Then,
by convexity,
hi + bi − ci
Ci (yi,l ; λ) − Ci (yi∗ (λ); λ) ≤ Ci0 (yi,l ; λ)|yi∗ (λ) − yi,l | ≤ ;
ν
thus the summation over the cycle gives that
hi + bi − ci
RTl ≤ nl ≤ hi + bi − ci .
ν
Combining the two cases, we have
n
X X
Ci (yi,t ; λ) − Ci (yi∗ (λ); λ) ≤ RTl
t=1 l
≤O(log(νD)) O(C12 f −1 hi log(ν)) + (hi + bi − ci )

=O(log2 (ν))
which gives the result.
Proof of Part ii. Let l∗ be the cycle with largest nl .
Case 1. l∗ < dlog2 (νD)e: Let ¯l be the latest cycle. We have that
ν ν
nl ∗ ≥ ¯ ≥
l dlog2 (νD)e
where the first inequality follows from the pigeonhole principle. Then, by inequality (EC.11), we
have
q
C C 1 dlog2 (νD)e
|yi∗ (λ) − yi,l∗ | ≤ √1 ≤ q
hi f nl∗ − 1 h f ν − dlog (νD)e
i 2

e-companion ec9
Case 2. l∗ = dlog2 (νD)e: Then we have by binary search, |yi∗ (λ) − yi,l∗ | ≤ 1/ν.
Combining the two cases, we have
q
1 C1 dlog2 (νD)e
|yi,l∗ (λ) − yi∗ (λ)| ≤ + q .
ν
hi f ν − dlog2 (νD)e
Proof of Part iii. Let loop l∗ be the one with the largest nl . dî,l∗ is defined as
P
t∈Tl∗ min{yi,l , Di,t }
∗
dî,l∗ = .
nl ∗
According to Hoeffding’s inequality, we have that with probability at least 1 − O(N −1 T −2 ),

s

ˆ
log(N T )
di,l∗ − E[min{yi,l∗ , Di,1 }] ≤ D .

nl∗
Then by part ii., we have

ˆ
di,l∗ − E[min{yi∗ (λ), Di,1 }]

ˆ
≤di,l − E[min{yi,l , Di,1 }] + |E[min{yi,l∗ , Di,1 }] − E[min{yi∗ (λ), Di,1 }]|

∗ ∗

≤dî,l∗ − E[min{yi,l∗ , Di,1 }] + |yi,l∗ − yi∗ (λ)|

s q
log(N T ) 1 C 1 dlog2 (νD)e
≤D + + q
nl ∗ ν
hi f ν − dlog2 (νD)e
Since nl∗ is the largest among all nl , and l ≤ dlog2 (νD)e, we must have nl∗ ≥ ν/dlog2 (νD)e, implying
that s
C1 log(N T )dlog2 (νD)e 1
|dî,l∗ − E[min{yi∗ (λ), Di,1 }] ≤ D + + .

hi f ν − dlog2 (νD)e ν
Proof for Part iv. We have
X XX
Zi,t = Zi,t
t∈Tν l t∈Tl
XX
≤ min{yi,l , Di,t } + D(¯l + 1).
l t∈Tl
where ¯l is the index of last cycle, and the last inequality is due to possible extra ordering at the
beginning of each cycle l. Similarly, we have
X XX
Zi,t ≥ min{yi,l , Di,t } − D/D(¯l + 1)
t∈Tν l t∈Tl
where D/D(¯l + 1) is from possible no-order periods in each loop.

ec10 e-companion
Then
X
∗
Z − νE[min{y (λ), D } ]

i,t i i,1
t∈Tν
XX l̄
X
≤ min{yi,l , Di,t } − nl E[min{yi,l , Di,1 }]

l t∈Tl l=0
l̄
X
+ nl |E[min{yi,l , Di,1 }] − E[min{yi∗ (λ), Di,1 }]| + O(¯l + 1)
l=0
l̄
X p Xl̄
≤ O log(N T )nl + nl |E[min{yi,l , Di,1 }] − E[min{yi∗ (λ), Di,1 }]| + O(¯l + 1)
l=0 l=0
l̄
X p
≤ O log(N T )nl + O(dlog2 (Dν)e + 1)
l=0
where the second inequality is satisfied with probability at least 1 − O(N −1 T −2 ) using Hoeffding’s
inequality, and the last inequality holds according to a similar argument as in Part ii. Note that
Pl̄
since we have l=0 nl = ν, we have
l̄
X √
q √
nl ≤ (¯l + 1)ν ≤ Õ( ν),
l=0
and we are done.

In the next lemma, we prove an important convergence result with respect to λ̂τ in our algorithm
DBS.
Lemma EC.3. Let

C0 ≥ max(4/β 2 , 2dlog2 (T D)e)
C1
q
C2 ≥ D + 2 log(N T )dlog2 (T D)e
hi f
p h2i
C3 ≥ max C0 min(bi − ci ), 2C2 / min
i i (h + b − c )3 f
i i i
β ∈(1, 4],
then with probability at least 1 − O(T −1 ), we have
|λ̂τ − λ∗ | ≤ Õ(β −τ /2 )
for all τ .
Proof of Lemma EC.3 Define the event

 
 X X √ 
E2 := dî;τ − E[min{yi∗ (λ̂τ ), Di,1 }] ≤ C2 N/ ντ , ∀τ .

 
i∈[N ] i∈[N ]

e-companion ec11
Note that
X X X √
dî;τ −
ˆ
E[min{yi∗ (λ̂τ ), Di,1 }] ≤ di;τ − E[min{yi∗ (λ̂τ ), Di,1 }] ≤ C2 N/ ντ

i∈[N ] i∈[N ] i∈[N ]
where the last inequality holds with probability at least 1 − O(T −1 ) according to Lemma

EC.2.(iii.) by union bound over τ and the fact that ν0 = C0 ≥ 2dlog2 (T D)e and C2 ≥ D +
q
C1
hi f
2 log(N T )dlog2 (T D)e. Thus event E2 holds with probability at least 1 − O(T −1 ). Our argu-
ment will be conditioned on event E2 holds.
By a bit abuse of notation, let us write Ṽ λ as its normalized version (by T , i.e., Ṽ λ /T ). For
Ṽ λ
notation convenience, let Ṽ 0 (λ) := i E[min(yi∗ (λ), Di,1 )] − W/T = ddλ
P
. With E2 holds, this shows
that Ṽ 0 (λ̂τ ) = i∈[N ] E[min{yi∗ (λ̂τ ), Di,1 }] − W/T ∈ [v τ , v τ ] for all τ . Moreover, we have, by inverse
P
function theorem and Assumption 1 and Assumption 2,
d2 Ṽ λ X hi
=− (1 − Fi (yi∗ (λ)))
dλ 2 (hi + bi − ci − λ)2 fi (yi∗ (λ))
i∈[N ]
X h2i
=−
(hi + bi − ci − λ)3 fi (yi∗ (λ))
i∈[N ]
X h2i h2i
≤− ≤ −N min .
i∈[N ]
(hi + bi − ci )3 f i∈[N ] (hi + bi − ci )3 f
Let us extend Ṽ 0 (λ) to the domain of negative λ. The way we do it is by extending Ṽ 0 (λ) linearly
d2 Ṽ 0
with the rate of dλ2
so that Ṽ 0 (λ) is differentiable after extension and we still have the same upper
2 λ
d2 Ṽ 0
bound of d Ṽ
dλ2
for all λ ≤ mini (bi − ci ). Note that since dλ2
< 0, we have Ṽ 0 (λ) → ∞ if λ → −∞,
which implies, together with Assumption 2, that there always exists λ̃ ∈ R such that Ṽ 0 (λ̃) = 0.
Obviously, if λ̃ < 0, we have λ∗ = 0 and otherwise λ∗ = λ̃.
For the rest of the proof, we claim that Uτ − Lτ ≤ β 3/2 C3 β −τ /2 and λ∗ ∈ [Lτ , Uτ ] for all τ .
Obviously, if this claim is true, we immediately have our result. We prove by induction. This
statement is obviously true for τ = 0 by assuming β 3/2 C3 ≥ mini (bi − ci ). For τ = 1, we have one of
the following conditions.
If line 7 happens, we have Uτ , Lτ unchanged. Then just take β 3/2 C3 β −1/2 ≥ mini (bi − ci ) and we
are good. If line 10 happens, we immediately have λ∗ = 0 and Uτ = Lτ = 0 so we are good. If line
√
13 happens, since |λ̂0 − λ∗ | = |λ∗ | ≤ mini (bi − ci ) ≤ C3 / C0 where the last inequality is because C3 ≥
√ √
C0 mini (bi − ci ). Thus taking U1 = Proj[0,mini (bi −ci )] (λ̂0 + C3 / C0 ) and L1 = Proj[0,mini (bi −ci )] (λ̂0 −
√
C3 / C0 ) has λ∗ ∈ [L1 , U1 ], and
p
U1 − L1 ≤ 2C3 / C0 ≤ β 3/2 C3 β −1/2
because C0 ≥ 4/β 2 , which again proves the claim.

ec12 e-companion
For the inductive step, suppose it holds for τ , let us consider τ + 1. The procedure is almost the
same as for τ = 1.
If condition line 7 or 10 is satisfied for τ , then by algorithm design, we have
Uτ +1 − Lτ +1 = (Uτ − Lτ )/2 ≤ β 3/2 C3 β −τ /2 /2 ≤ β 3/2 C3 β −(τ +1)/2
where the first inequality is from inductive hypothesis, and the last inequality is because β ≤ 4.
Moreover, since Ṽ 0 (λ̂τ ) = i∈[N ] E[min{yi∗ (λ̂τ ), Di,1 }] − W/T ∈ [v τ , v τ ], line 7 (10) implies that
P
λ̂τ ≤ λ∗ (λ̂τ ≥ λ∗ ), which shows that λ∗ ∈ [Lτ +1 , Uτ +1 ]. Thus our claim is proved.
If condition line 13 is satisfied for τ , this implies that 0, Ṽ 0 (λ̂τ ) ∈ [v τ , v τ ] and thus |Ṽ 0 (λ̂τ )| ≤
√
2C2 N/ ντ . Note that by Taylor’s theorem on Ṽ 0 (λ), we have
d2 Ṽ λ̄
Ṽ 0 (λ̂τ ) = Ṽ 0 (λ̂τ ) − Ṽ 0 (λ̃) = (λ̂τ − λ̃)
dλ2
where λ̄ is between λ̂τ and λ̃. This implies that
|λ̂τ − λ∗ | ≤|λ̂τ − λ̃|

1
0

≤ Ṽ (λ̂τ )

h2
N mini (h +b −c )3 f
i
√i i i
2C2 / ντ
≤ h2
mini (h +b −c
i
)3 f
√ i i i
≤C3 / ντ
h2
where the last inequality is because C3 ≥ 2C2 / mini (h +b −c
i
)3 f
. Thus taking Uτ +1 =
i i i
Proj[0,mini (bi −ci )] (λ̂τ + C3 β −τ /2 ) and Lτ +1 = Proj[0,mini (bi −ci )] (λ̂τ − C3 β −τ /2 ) has λ∗ ∈ [Lτ +1 , Uτ +1 ], and
Uτ +1 − Lτ +1 ≤ 2C3 β −τ /2 ≤ β 3/2 C3 β −(τ +1)/2
because C0 ≥ 4/β 2 , which again proves the claim.
EC.3. Computational Experiments

We give the results of Experiment 2 as relative regret and standard errors in the following tables.

e-companion
Table EC.1
Duration 100 1000 5000 20000

1 2 1 2 1 2 1 2
Algorithm DBS Exp T 2 Exp T 3 DBS Exp T 2 Exp T 3 DBS Exp T 2 Exp T 3 DBS Exp T 2 Exp T 3
Scenario Group Scenario
Capacity Fair Capacity 0.017 0.261 2.139 0.020 0.094 0.897 0.013 0.031 0.444 0.006 0.014 0.306
Tight Capacity 0.112 0.190 5.827 0.090 0.139 6.145 0.030 0.120 5.945 0.008 0.112 5.800
Very Tight Capacity 0.044 0.076 3.417 0.038 0.055 3.348 0.021 0.045 3.312 0.008 0.041 3.210
Holding Low Holding 0.055 0.107 8.558 0.052 0.083 5.175 0.034 0.077 4.952 0.016 0.077 4.894
Medium Holding 0.112 0.190 5.827 0.090 0.139 6.145 0.030 0.120 5.945 0.008 0.112 5.800
High Holding 0.226 0.268 5.884 0.068 0.169 5.530 0.042 0.143 5.220 0.023 0.138 5.017
Lost Sales Low Lost 0.222 0.260 3.862 0.171 0.164 3.575 0.056 0.138 3.360 0.016 0.129 3.216
Medium Lost 0.112 0.190 5.827 0.090 0.139 6.145 0.030 0.120 5.945 0.008 0.112 5.800
High Lost 0.057 0.105 10.644 0.051 0.090 11.074 0.019 0.086 10.746 0.005 0.087 10.611
Shipment Low Shipment 0.114 0.193 8.938 0.092 0.142 9.700 0.030 0.121 9.247 0.008 0.114 9.037
Medium Shipment 0.112 0.190 5.827 0.090 0.139 6.145 0.030 0.120 5.945 0.008 0.112 5.800
High Shipment 0.109 0.184 3.678 0.088 0.112 3.841 0.029 0.116 3.716 0.008 0.108 3.610

Average relative regret for the product at cheap price point
ec13
ec14
Duration 100 1000 5000 20000

Table EC.2
1 2 1 2 1 2 1 2
Tight Capacity 0.035 0.052 2.905 0.007 0.044 3.057 0.001 0.025 3.081 0.000 0.015 3.109
Medium Holding 0.035 0.052 2.905 0.007 0.044 3.057 0.001 0.025 3.081 0.000 0.015 3.109
High Holding 0.064 0.069 2.715 0.007 0.044 2.945 0.042 0.031 2.770 0.018 0.021 2.653
Medium Lost 0.035 0.052 2.905 0.007 0.044 3.057 0.001 0.025 3.081 0.000 0.015 3.109
High Lost 0.024 0.028 6.220 0.009 0.029 7.006 0.007 0.020 7.165 0.003 0.015 7.151
Medium Shipment 0.035 0.052 2.905 0.007 0.044 3.057 0.001 0.025 3.081 0.000 0.015 3.109
High Shipment 0.034 0.050 1.557 0.007 0.033 1.654 0.001 0.024 2.128 0.000 0.015 1.661
Standard errors for the product at cheap price point

e-companion
e-companion
Table EC.3
Duration 100 1000 5000 20000

1 2 1 2 1 2 1 2
Tight Capacity 0.110 0.169 14.272 0.093 0.128 14.794 0.084 0.117 14.502 0.074 0.113 14.309
Medium Holding 0.110 0.169 14.272 0.093 0.128 14.794 0.084 0.117 14.502 0.074 0.113 14.309
High Holding 0.132 0.237 12.109 0.112 0.160 12.275 0.110 0.151 11.798 0.098 0.142 11.481
Medium Lost 0.110 0.169 14.272 0.093 0.128 14.794 0.084 0.117 14.502 0.074 0.113 14.309
High Lost 0.061 0.086 25.944 0.058 0.080 26.975 0.058 0.081 26.328 0.057 0.083 26.199
Medium Shipment 0.110 0.169 14.272 0.093 0.128 14.794 0.084 0.117 14.502 0.074 0.113 14.309
High Shipment 0.109 0.167 9.573 0.091 0.127 10.056 0.082 0.116 9.769 0.073 0.112 9.678

Average relative regret for the product at intermediate price point
ec15
ec16
Table EC.4
Duration 100 1000 5000 20000

1 2 1 2 1 2 1 2
Tight Capacity 0.032 0.054 10.445 0.021 0.036 11.170 0.019 0.029 11.126 0.019 0.020 11.345
Medium Holding 0.032 0.054 10.445 0.021 0.036 11.170 0.019 0.029 11.126 0.019 0.020 11.345
High Holding 0.046 0.065 8.861 0.048 0.048 9.110 0.045 0.033 9.203 0.045 0.021 8.907
Medium Lost 0.032 0.054 10.445 0.021 0.036 11.170 0.019 0.029 11.126 0.019 0.020 11.345
High Lost 0.022 0.029 22.059 0.013 0.021 23.654 0.013 0.018 23.648 0.013 0.016 23.987
Medium Shipment 0.032 0.054 10.445 0.021 0.036 11.170 0.019 0.029 11.126 0.019 0.020 11.345
High Shipment 0.032 0.053 6.339 0.021 0.036 6.889 0.019 0.029 6.919 0.019 0.020 7.072

Standard errors for the product at intermediate price point
e-companion
e-companion
Table EC.5
Duration 100 1000 5000 20000

1 2 1 2 1 2 1 2
Tight Capacity 0.164 0.226 25.357 0.136 0.157 24.048 0.123 0.132 23.987 0.127 0.130 23.140
Medium Holding 0.164 0.226 25.357 0.136 0.157 24.048 0.123 0.132 23.987 0.127 0.130 23.140
High Holding 0.179 0.306 19.273 0.132 0.183 19.391 0.115 0.163 18.620 0.112 0.151 17.605
Medium Lost 0.164 0.226 25.357 0.136 0.157 24.048 0.123 0.132 23.987 0.127 0.130 23.140
High Lost 0.097 0.132 48.269 0.077 0.105 43.924 0.070 0.097 43.829 0.072 0.098 42.637
Medium Shipment 0.164 0.226 25.357 0.136 0.157 24.048 0.123 0.132 23.987 0.127 0.130 23.140
High Shipment 0.163 0.225 17.792 0.135 0.157 16.943 0.123 0.132 16.770 0.127 0.130 16.184

Average relative regret for the product at expensive price point
ec17
ec18
Table EC.6
Duration 100 1000 5000 20000

1 2 1 2 1 2 1 2
Tight Capacity 0.048 0.064 19.549 0.031 0.049 19.627 0.033 0.027 20.682 0.032 0.018 20.660
Medium Holding 0.048 0.064 19.549 0.031 0.049 19.627 0.033 0.027 20.682 0.032 0.018 20.660
High Holding 0.069 0.081 16.398 0.062 0.049 16.985 0.063 0.036 16.940 0.057 0.024 16.516
Medium Lost 0.048 0.064 19.549 0.031 0.049 19.627 0.033 0.027 20.682 0.032 0.018 20.660
High Lost 0.029 0.035 41.253 0.017 0.032 41.224 0.018 0.022 42.923 0.018 0.016 42.973
Medium Shipment 0.048 0.064 19.549 0.031 0.049 19.627 0.033 0.027 20.682 0.032 0.018 20.660
High Shipment 0.048 0.064 12.395 0.030 0.048 12.437 0.033 0.027 13.122 0.032 0.018 13.138

Standard errors for the product at expensive price point
e-companion

SSRN Id3952925

Uploaded by

Copyright:

Available Formats

SSRN Id3952925

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SSRN Id3952925

Uploaded by

Copyright:

Available Formats

Inventory Control and Learning for One-Warehouse

Multi-Store System with Censored Demand

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Table 1 List of notations

Electronic copy available at: https://ssrn.com/abstract=3952925

Definition 1. For a policy π, the (expected) regret is defined as

Electronic copy available at: https://ssrn.com/abstract=3952925

According to this definition, minimizing V π is equivalent to minimizing the regret R(π).

4. The Benchmark Policy and Demand Learning without Censoring

Electronic copy available at: https://ssrn.com/abstract=3952925

where each Ṽiλ can be written as

Ci (yi ; λ) := E[ci (λ)yi + (hi − ci (λ))(yi − Di,t )+ + bi (Di,t − yi )+ ],

and we have that Ṽiλ = T · Ci (yi∗ (λ); λ).

λ∗ = arg max Ṽ λ , (2)

and its corresponding optimal base-stock levels.

Electronic copy available at: https://ssrn.com/abstract=3952925

Assumption 1 is a standard assumption regarding the regularity of the demand distribution

Lemma 1 (Dvoretzky–Kiefer–Wolfowitz inequality). For any  ≥ 0 and any period t ∈ [T ]

Similarly, we have the corresponding empirical probability density function as

Electronic copy available at: https://ssrn.com/abstract=3952925

Algorithm 1: EDAF(Empirical Distribution Algorithm for Full Demand)

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

5. Proposed Algorithm for Censored Demand

5.1. The Algorithm and Its Performance

Electronic copy available at: https://ssrn.com/abstract=3952925

Ci0 (yi ; λ) = (hi + bi − ci (λ))F i (yi ) − (bi − ci (λ)),

Electronic copy available at: https://ssrn.com/abstract=3952925

Algorithm 2: Inner Loop of Binary Search for Order-up-to Levels

7 if l < dlog2 (νD)e and g l > 0 then

Electronic copy available at: https://ssrn.com/abstract=3952925

The regret of π DBS satisfies that

Electronic copy available at: https://ssrn.com/abstract=3952925

5.2. Main Steps of the Proof of Theorem 1

By our definition of regret, we can write

Electronic copy available at: https://ssrn.com/abstract=3952925

Algorithm 3: Double Binary Search (DBS)

Electronic copy available at: https://ssrn.com/abstract=3952925

Therefore, we need to bound the three summations above respectively.

according to Lemma EC.2.(i.) by applying union bound over all τ .

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

6.1. Experiment 1: Synthetic Data

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Table 2 Average relative regret with number of stores

6.2. Experiment 2: Real Data

Electronic copy available at: https://ssrn.com/abstract=3952925

Table 3 Scenario setups

DBS Exp T 1/2 Exp T 2/3 Exp T 3/4

Electronic copy available at: https://ssrn.com/abstract=3952925

Electronic copy available at: https://ssrn.com/abstract=3952925

Lemma 1 (Dvoretzky–Kiefer–Wolfowitz inequality). For any ≥ 0 and any period t ∈ [T ]

|ŷi,t+1 (λ) − yi∗ (λ)| ≤ O(t ) (EC.1)

|E[min(ŷi,t+1 (λ), D̂i,1 )] − E[min(yi∗ (λ), Di,1 )]| ≤ O(t ).

|g(λ) − ĝt (λ)| ≤ O(N t ). (EC.4)

Element (I) has O(t ) upperbound from Line EC.1.

≤ max(yi∗ (λ∗ ) + Õ(N t ), Õ(N t ) + Õ(N t−1 ) + Di,t )

≤ max(yi∗ (λ∗ ), Di,1 ) + Õ(N t ) + Õ(N t−1 )