Computers & Operations Research 38 (2011) 1705–1718
Contents lists available at ScienceDirect
Computers & Operations Research
journal homepage: www.elsevier.com/locate/caor
Sampled fictitious play for approximate dynamic programming
Marina Epelman a, Archis Ghate b,, Robert L. Smith a
a
b
Industrial and Operations Engineering, University of Michigan, Ann Arbor, USA
Industrial and Systems Engineering, Box 352650, University of Washington, Seattle, WA 98195, USA
a r t i c l e i n f o
abstract
Available online 16 February 2011
Sampled fictitious play (SFP) is a recently proposed iterative learning mechanism for computing Nash
equilibria of non-cooperative games. For games of identical interests, every limit point of the sequence
of mixed strategies induced by the empirical frequencies of best response actions that players in SFP
play is a Nash equilibrium. Because discrete optimization problems can be viewed as games of identical
interests wherein Nash equilibria define a type of local optimum, SFP has recently been employed as a
heuristic optimization algorithm with promising empirical performance. However, there have been no
guarantees of convergence to a globally optimal Nash equilibrium established for any of the problem
classes considered to date. In this paper, we introduce a variant of SFP and show that it converges
almost surely to optimal policies in model-free, finite-horizon stochastic dynamic programs. The key
idea is to view the dynamic programming states as players, whose common interest is to maximize the
total multi-period expected reward starting in a fixed initial state. We also offer empirical results
suggesting that our SFP variant is effective in practice for small to moderate sized model-free problems.
& 2011 Elsevier Ltd. All rights reserved.
Keywords:
Stochastic dynamic optimization
Computational game theory
Simulation optimization
1. Introduction
In this paper, we introduce a variant of a game theoretic
learning mechanism called sampled fictitious play (SFP) [20] to
solve model-free stochastic dynamic programming problems, and
investigate its convergence properties and empirical performance.
The defining feature of model-free problems is that the state
space, immediate rewards resulting from choosing an action in a
state, and state transition probabilities are not known explicitly,
and hence, system behavior must be ‘‘learned’’ off-line or on-line
by repeated computer simulations or system runs. This rules out
methods like backward induction, value iteration and mathematical programming. Examples of model-free problems include
control of queueing networks with complicated service disciplines
whose state transitions are available only through simulation via
computer programs [3], control of manufacturing processes where
the effect of a decision on the process is calculated by simulating
the process [18], and dynamic portfolio optimization or financial
derivative pricing problems where the performance of the underlying financial instrument is obtained by simulating complex
computer models. Algorithms for model-free problems are termed
‘‘simulation based’’ methods [3,9,26,32] and typically provide an
approximate solution. Thus, these simulation based techniques,
Corresponding author. Tel.: + 1 206 616 5968; fax: + 1 206 685 3072.
E-mail address:
[email protected] (A. Ghate).
0305-0548/$ - see front matter & 2011 Elsevier Ltd. All rights reserved.
doi:10.1016/j.cor.2011.01.023
including our SFP variant, fall within the realm of approximate
dynamic programming (ADP) [26].
Stochastic search methods rooted in game theory have recently
been applied to large-scale discrete optimization problems, with
special focus on cases where the objective function is available only
through computationally expensive simulations [2,10,14–16,20–22].
Consequently, the hope is to at least find local optima, as stronger
forms of optimality are nearly impossible to attain, and very difficult
to check. These techniques have been numerically tested with
encouraging results on problems in transportation [10,14,22], power
management in sensor networks [16], network optimization [15], and
manufacturing systems [2].
Such heuristic optimization algorithms are applied to problems
of the form
max uðy1 ,y2 , . . . ,yn Þ
s:t:
A ðY 1 Y 2 Y n Þ,
ðy1 ,y2 . . . ,yn Þ
ð1Þ
where ðY 1 Y 2 Y n Þ denotes the Cartesian product of finite
sets Y1 through Yn. The main idea is then to animate (1) as a game
between n players (corresponding to decision variables y1,y,yn),
who share the identical interest of maximizing the objective
function uðÞ. Recall that a Nash equilibrium is a collection of
probability distributions over each player’s actions with the
property that no player can unilaterally improve its utility in
expectation by changing its own distribution [13]. Such an
equilibrium serves as a type of coordinate-wise local optimum
of (1), and hence, the goal is to implement a computationally
efficient procedure to find it.
1706
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
Most of these game theory based techniques for discrete
optimization employ variants of fictitious play (FP) [6,29], a
well-known iterative learning technique for computing Nash
equilibria. At every iteration of FP, each player chooses a strategy
that is a best response (with respect to that player’s expected
utility, which depends on decisions of all players) to the other
players’ strategies, assuming they will be chosen based on the
empirical probability distribution induced by the historical
frequency of their best response decisions in all previous iterations. Suitability of the FP approach for optimization stems from
its convergence properties. In particular, for a game of identical
interests, every limit point of the sequence of mixed strategies
induced by the empirical frequencies of best response actions is a
Nash equilibrium irrespective of the structure of the objective
function [24]. This is termed the ‘‘fictitious play property.’’
Another advantage of such methods is that they are easily
parallelizable, making them potentially suitable for large-scale
optimization problems. However, the best response problem for
each player in FP is computationally intractable for realistic
problems since an exact computation of the expected utility for
a player may require evaluation of the utility function for every
possible combination of actions for all players.
As a result, two of the authors and a co-author recently
proposed sampled fictitious play (SFP) [20], a modified version
of FP where the players choose an action that is a best reply to an
independent sample of actions of other players drawn according to
the empirical probability distribution of actions they used in all
previous iterations. SFP offers significant computational advantage
over FP, and for games of identical interests, almost surely exhibits
the fictitious play property if the sample size is increased at a
polynomial rate with iterations [20]. However, efficacy of the
original version of SFP for optimization problems has the following significant limitations: (i) the fictitious play property guarantees convergence to only an equilibrium solution rather than an
optimal solution, (ii) SFP may converge to a mixed strategy
equilibrium, whereas in many applications, and especially in
optimization, a pure strategy equilibrium is desirable, (iii) the
best response computation becomes increasingly expensive as the
sample size grows without bound with iterations, (iv) it is
computationally very expensive to force every player in largescale problems to perform a best reply computation in every
iteration, and, finally, (v) problem form (1) excludes optimization
problems with constraints across variables. Thus, practical implementations of SFP [2,10,21] have attempted ad hoc variations of
the original version. Unfortunately, the original convergence
results for SFP in [20] do not hold for these ad hoc variants.
Consequently, the question as to whether one can design an SFP
variant, that provably finds optimal solutions to an important
class of optimization problems by surmounting the above difficulties, has remained open. We answer this question in the
affirmative by introducing an SFP variant that solves finitehorizon stochastic dynamic programs.
The key idea is to view the states of the dynamic programming
problem as players engaged in a non-cooperative game of identical
interests, where the objective of each player is to maximize the
expected multi-period reward from the initial state. The problem
structure inherent in dynamic programming, and specifically, the
principle of optimality, help our SFP players coordinate their
actions and hence solve the problem to optimality. Viewing the
states as players also has the important advantage that all
combinations of feasible actions of these players are jointly
feasible so that the resulting problem is an unconstrained one of
the form (1). Importantly, since the objectives of all players are
aligned with one another, it suffices for a very small fraction of the
players to participate in the game in each iteration, naturally
leading to an asynchronous procedure. The procedure to
determine which players will participate in an iteration adaptively
favors optimal actions (and hence the states they deliver) from the
recent past. Specifically, unlike the original SFP in [20], we provide
the players with only finite memory. Moreover, we allow players
to sample only one action at a time (unlike the original SFP version
in [20] which requires an increasing action sample size), and we
deliberately add exogenous noise to this selection procedure so
that every player in theory gets an opportunity to perform a best
response computation infinitely often with probability one even in
the asynchronous case. We also remark that if the inherent
randomness in state transitions of the system is such that all
states will be observed infinitely often with probability one
irrespective of the policy implemented, then this exogenous noise
is not needed (even in the asynchronous case).
This paper is organized as follows. We develop the necessary
notation and formulate our dynamic programming problem in the
second section. This is followed by a precise description of our SFP
algorithm in the third section. Convergence results and proofs
appear in the fourth section. Numerical experiments are presented
in the fifth section. Since our SFP variant falls within the realm of
simulation-based algorithms for approximate dynamic programming (ADP) [26], a detailed discussion of similarities and differences between our approach and existing simulation-based
techniques for ADP is provided in the sixth section, along with
other specific conclusions, and future research directions.
2. Problem formulation
Even though our SFP approach is applicable to any finite-state,
finite-action, finite-horizon stochastic dynamic program, it is
presented here for the case of model-free problems [3,26]. In
particular, consider the following T period sequential decision
problem. The initial state of a system is s1. In each period
t ¼ 1,2, . . . ,T, we observe state st of the system and make a
decision xt, which must be chosen from a finite set Xt(st) of feasible
decisions in state st, as determined by a feasibility oracle F t . This
feasibility oracle receives st as input and produces a finite set Xt(st)
as output. Then, another oracle Ot receives st and xt as input and
returns the (stochastic) state of the system st + 1 in the next period,
and a one period deterministic reward rt(st,xt). It is common in the
literature to consider deterministic one period rewards [7,33]. The
slightly more general case of random one period rewards can also
be handled by our algorithm, and the convergence results
in Section 4 can be extended to that case without much difficulty.
All states of the form sT + 1 are ‘‘terminal’’ states and there are no
feasible actions in these states. We adopt the convention that
terminal states have no intrinsic value. Our results generalize in a
straightforward manner to problems where nonzero values are
assigned to terminal states as for example in our TIC-TAC-TOE
example in Section 5. We use St to denote the set of feasible states
at the beginning of period t, t¼1,2,y,T+ 1 with S1 ¼{s1}. Let S
denote the finite set of all feasible states of the system, i.e.,
S ¼ S1 [ S2 . . . [ ST þ 1 . The sets S2, S3,y,ST + 1, and (hence) the state
space S are not known a priori, but must be ‘‘discovered’’ by
repeatedly querying oracles F t and Ot .
A policy p is a deterministic decision rule that assigns a feasible
decision xt to every state st in S. Thus, p is a jSj-dimensional vector,
and the decision that policy p assigns to state st A S is denoted by
pðst Þ. We use P to denote the set of all feasible policies of this
form. Therefore, for each st A S, any policy p A P must have the
property that pðst Þ A Xt ðst Þ. Let Vp ðs1 Þ be the value of state s1 under
policy p, i.e., the total T-period expected reward from the initial
state if we implement decisions prescribed by policy p in every
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
hand—namely, stochastic finite-horizon dynamic programs—a
properly designed variant of SFP finds an optimal solution.
period. Mathematically,
Vp ðs1 Þ ¼ r1 ðs1 , pðs1 ÞÞ þE
T
X
i¼2
1707
!
ri ðsi , pðsi ÞÞ ,
where si A Si for i¼2,y,T and the expectation is with respect to the
joint distribution induced by oracles O1 ,O2 , . . . ,OT . Our goal is to
find a policy that solves the problem
Proposition 3.1. In the above synchronous sampled fictitious play
algorithm for approximate dynamic programming, the probability
that there exists an iteration after which all entries in the finite
histories of all players are optimal is one.
ð2Þ
Proof. By backward induction as in the proof of Theorem 4.1
in Section 4, and hence omitted. &
We define V ðs1 Þ Vp ðs1 Þ. We assume for simplicity that there is a
unique policy p optimal to problem (2). The extension to the case
of multiple optimal policies is straightforward (see [17]). We next
describe our SFP approach in detail.
However, the above synchronous approach is not viable when
the entire state space and feasible actions in each state are not
known at the outset. Even if they were available, performing
action sampling, value estimation and best response computation
for every player in every iteration would be computationally
expensive. Thus, the problem setting in Section 2 calls for an
asynchronous extension of the above conceptual SFP procedure
wherein only some of the players perform action sampling, value
estimation and best response calculation in each iteration. However, this introduces new hurdles such as the need to include
deliberate exploration to ensure that all players are chosen
infinitely often to perform these activities to ensure convergence,
and the need to clearly specify a sequence in which players are
chosen. Our SFP based asynchronous procedure described below
first in text and then as pseudo-code surmounts this difficulty.
p argmax Vp ðs1 Þ:
pAP
3. Algorithm description
Recall that we view the states in S as players engaged in a game
of identical interest, namely, that of maximizing the total T-period
expected reward from the initial state s1. Consequently, we
henceforth use terms ‘‘players’’ and ‘‘states’’ interchangeably.
The central idea in our algorithm is borrowed from SFP: in each
iteration, players (i) sample actions according to probability
distributions derived from the empirical frequency of their past
best responses, (ii) calculate best responses to actions sampled by
other players, and (iii) update empirical frequencies of best
responses.
3.1. Preliminaries: a synchronous approach
To facilitate discussion, first consider the simpler version of the
above model-free problem where the state space and feasible
actions are known at the outset, while rewards and state transitions may still be available only through oracles. Suppose the
players in this SFP procedure each sample one action in every
iteration. These sampled actions then define a policy, and each
player can find a best response action simply by comparing an
estimate of the value of this policy with the corresponding value
estimates of policies that result if the player unilaterally changed
its action to each of its other feasible actions. Conceptually, such
an SFP procedure would then lead to a synchronous method as
outlined in the pseudo-code below.
A synchronous version of sampled fictitious play for approximate
dynamic programming:
1. Action sampling: Each player, independently of the other
players, samples one action from its (finite) history of past
best responses. (In the first iteration, when the histories are
empty, each player samples a feasible action randomly.)
2. Best response: Each player, independently of the other players,
estimates (in the manner described in detail in our asynchronous algorithm in Section 3.2) the expected utility (namely the
total expected reward from the initial state s1) corresponding
to each one of its feasible actions assuming that the other
players have chosen the actions they sampled in Step 1 above,
and selects an action that yields the highest reward.
3. History update: Each player updates its own history of best
responses by appending it with the best response action
chosen in Step 2 above, ensuring through a FIFO procedure
that history-length remains less than a preset finite bound.
One can show, using the backward induction technique presented
in Section 4, that every entry in every player’s history will
eventually be optimal with probability one. Thus, we have
established that for the class of optimization problems at
3.2. The asynchronous procedure
Each iteration of our asynchronous algorithm consists of two
phases: the choose players phase and the best response phase.
The choose players phase: In the choose players phase, T players,
one each from sets S1,y,ST, are chosen to perform a best response
calculation, and are said to be in play in that iteration. The precise
way in which these players are selected will be explained later. In
simple terms, they are selected by drawing a sequence of player–
action pairs {(st,yt)}Tt ¼ 1 starting at player s1 and ending at some
terminal state, where yt A Xt ðst Þ is sampled using the history of
past plays of player st and st + 1 is the state obtained by sending
state–action pair (st,yt) to oracle Ot . We denote the sequence of
player–action pairs drawn this way in iteration k by fðskt ,ykt ÞgTt ¼ 1 .
Observe that corresponding to each player skt in this sequence,
there is a realized reward Ut(skt) defined as
Ut ðskt Þ ¼
t1
X
ri ðski ,yki Þ:
i¼1
The choose players phase ends here.
The best response phase: In this phase, each player skt, t ¼1,2,y,T
that is selected to be in play in the choose players phase uses each
of its feasible actions xt A Xt ðskt Þ to generate a ‘‘path’’. A path
starting at player skt and action xt A Xt ðskt Þ is a sequence of
k
player–action pairs fðsktþ j ,zktþ j ÞgjTt
¼ 1 , where st þ 1 is received by
k
k
sending (st ,xt) to Ot , player sr uses its history of past plays to
sample action zkr for r¼ t+ 1,y,T, and skr þ 1 is obtained by sending
the player–action pair ðskr ,zkr Þ to the oracle Or for r ¼t + 1,y,T 1.
We remark that other names for paths in the ADP literature
include ‘‘episode’’ and ‘‘trajectory’’ [4,32]. We denote such a path
by pk(skt,xt) and define the corresponding reward as
Rðpk ðskt ,xt ÞÞ ¼
Tt
X
rt þ j ðsktþ j ,zktþ j Þ:
ð3Þ
j¼1
For any player st A St , and any positive integer m let I(st,m) denote
the set of iterations in which player st is in play up to and
including iteration m. We now define the best response problem
for player skt that is in play in iteration k. In this iteration k, player
skt estimates that if it chooses action xt A Xt ðskt Þ, its objective,
1708
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
namely, the total T-period expected reward from the initial state
s1, is
V k ðs1 ; ðskt ,xt ÞÞ ¼ Ut ðskt Þ þ rt ðskt ,xt Þ þ
X
1
Rðpi ðskt ,xt ÞÞ,
jIðskt ,kÞj i A Iðsk ,kÞ
ð4Þ
t
where pi(skt,xt) is the path drawn by the player–action pair (skt,xt) in
the ith iteration in which player skt was in play. Player skt then
solves the following best response problem:
x k ðskt Þ argmaxðV k ðs1 ; ðskt ,xt ÞÞÞ,
ð5Þ
xt A Xt ðskt Þ
which is equivalent to the best response problem
0
1
X
1
k k
k
k
@
x ðst Þ argmax rt ðst ,xt Þ þ
Rðpi ðst ,xt ÞÞA
jIðskt ,kÞj i A Iðsk ,kÞ
xt A Xt ðskt Þ
ð6Þ
t
since Ut(skt) does not depend on xt and therefore there is no need to
actually compute Ut(skt) in an algorithmic implementation. Ties in
the above problem are broken by choosing an action with the
lowest index (note that since there is a finite number of feasible
actions in each state, the actions can be indexed). Finally, for
brevity, we define
X
1
W k ðskt ,xt Þ
Rðpi ðskt ,xt ÞÞ
ð7Þ
k
jIðst ,kÞj i A Iðsk ,kÞ
t
and rewrite the above best response problem as
x k ðskt Þ argmaxðrt ðskt ,xt Þ þ W k ðskt ,xt ÞÞ:
ð8Þ
xt A Xt ðskt Þ
Recall that in the reinforcement learning [32] parlance, the
quantity rt ðskt ,xt Þ þW k ðskt ,xt Þ in Eq. (8) would be called the
‘‘Q-value estimate’’ (see Eqs. (13)–(14) below for the definition
of ‘‘Q-value’’). In light of Eq. (3) it appears that to use estimation
Eq. (7), player skt must know jIðskt ,kÞj and perhaps every sequence
k
of player–action pairs fðsit þ j ,zit þ j ÞgTt
j ¼ 1 that st generated in iterations i A Iðskt ,kÞ. Although the former is true, the latter is not. To see
this, note that Iðskt ,kÞ ¼ Iðskt ,k1Þ [ k and in particular jIðskt ,kÞj ¼
1 þjIðskt ,k1Þj. As a result,
X
1
W k ðskt ,xt Þ ¼
Rðpi ðskt ,xt ÞÞ
jIðskt ,kÞj i A Iðsk ,kÞ
t
X
1
Rðpi ðskt ,xt ÞÞ
¼
k
1 þ jIðst ,k1Þj i A fIðsk ,k1Þ[kg
t
0
1
X
1
k
k
@
¼
Rðpi ðst ,xt ÞÞ þ Rðpk ðst ,xt ÞÞA
1 þ jIðskt ,k1Þj i A Iðsk ,k1Þ
t
¼
1
jIðskt ,k1Þj W k1 ðskt ,xt Þ þ Rðpk ðsk ,xt ÞÞ :
k
1þ jIðst ,k1Þj
ð9Þ
Therefore, in order to compute Wk(skt,xt), player skt needs to know
only jIðskt ,k1Þj, Wk 1(skt,xt) and the reward R(pk(skt,xt)) associated
with the player–action pairs fðsktþ j ,zktþ j ÞgjTt
¼ 1 that it generated in
the current iteration k. Based on Eq. (9), throughout the algorithm
each player st stores the number of times it has been in play
(denoted by N(st)) and an estimate W(st,xt) that is updated every
iteration st is in play as follows:
Wðst ,xt Þ’
ðNðst Þ Wðst ,xt Þ þ Rðpðst ,xt ÞÞÞ
:
1 þ Nðst Þ
ð10Þ
Note that at iteration k, Nðst Þ ¼ jIðst ,kÞj is used in the algorithm
implementation, whereas our proofs use the notation I(st,k) to
make the dependence on k explicit. Similarly for W(st,xt) and
Wk(st,xt).
Description of history of past plays: We define the term ‘‘history’’
here. Let L be a positive integer. For each state st A St , t ¼1,2,3,y,T,
its history Hk(st) after iteration k is (an ordered list) composed of
(at most) L actions in Xt(st) that were the solutions to the best
response problem for state st in the (at most) L most recent
iterations in which it was in play. Thus, history of a state is
updated each time after it performs a best response computation,
and is not updated at any other time. Note that Hk(st) is a FIFO list
of length at most L; namely, if jIðst ,kÞj ZL, then jHk ðst Þj ¼ L, and if
jIðst ,kÞj oL, then jHk ðst Þj ¼ jIðst ,kÞj.
Now we are ready to describe exactly how states are chosen in
the choose players phase of each iteration to perform a best
response computation. This selection is done inductively starting
from state s1 (because that is the only state in S1). In iteration k,
suppose states s1,s2,y,st have been selected so far. Note that st
may or may not have a history of past best responses. If its history
is empty, an action is selected from the set Xt(st) of its feasible
actions uniformly at random. On the other hand, if the history is
not empty, with probability 1ak one action is drawn uniformly at
random from this history, whereas with probability ak one action
is drawn uniformly at random from Xt(st). Suppose this selected
action is xt. Then the pair (st,xt) is sent to the oracle Ot to receive a
state st + 1 from set St + 1 as output. This is repeated until we reach a
terminal state sT + 1. The parameter ak is asymptotically reduced to
zero as k increases in a way that ensures that every state is
selected infinitely often with probability one. Finally, in the best
response phase, the procedure to sample actions (and hence
paths) is the same as above except that players do not use
parameter ak while drawing actions when the history is
not empty.
Asynchronous sampled fictitious play for approximate dynamic
programming:
1. Initialize: Set k’1, choose a decreasing sequence
fak g1
k ¼ 1 ½0,1, fix some positive integer L, Nðs1 Þ’0,
Wðs1 ,x1 Þ’0 for all x1 A X1 ðs1 Þ obtained by sending s1 to oracle
F 1 , and H0 ðs1 Þ’|.
2. Choose players: Starting at player s1 , inductively select a
sequence of players s1 sk1 ,sk2 , . . . ,skT said to be in play in
iteration k, where skt A St for t ¼1,y,T, as follows: for
t¼ 1,y,T:
If Hk1 ðskt Þ ¼ |, then obtain Xt(skt) by sending skt to the
feasibility oracle F t , sample an action ykt uniformly at
random from Xt(skt), set N(skt) ¼0, and W(skt,xt)¼0 for all
xt A Xt ðskt Þ.
Else let u be a Uniform [0,1] random variable; if u r1ak
sample an action ykt uniformly at random from Hk 1(skt)
else sample an action ykt uniformly at random from Xt(skt).
Send the player–action pair (skt,ykt) to oracle Ot to receive
player skt+ 1 as output.
end for.
3. Best response: For each player skt that is in play:
(a) for each xt A Xt ðskt Þ draw a path pk ðskt ,xt Þ ¼ fðsktþ 1 ,zktþ 1 Þ,
. . . ,ðskT ,zkT Þg inductively as follows: send (skt,xt) to oracle Ot
to receive sktþ 1 as output, and then
for j ¼ 1, . . . ,Tt:
3 If Hk1 ðsktþ j Þ ¼ | then sample an action zkt+ j
uniformly at random from Xt þ j ðsktþ j Þ obtained by
sending sktþ j to the feasibility oracle F t þ j .
3 Else sample an action zkt+ j uniformly at random
from Hk1 ðsktþ j Þ.
3 Send the player–action pair ðsktþ j ,zktþ j Þ to oracle
Ot þ j to receive player sktþ j þ 1 as output.
end for.
Set Wðskt ,xt Þ’ð1=ð1 þ Nðskt ÞÞÞðNðskt Þ Wðskt ,xt Þ þRðpk ðskt ,xt ÞÞÞ.
end for.
(b) Find x k ðskt Þ by solving the best response problem (8)
where ties are broken via the lowest index first rule.
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
(c) Append history Hk1 ðskt Þ by x k ðskt Þ in a FIFO manner to keep
its length at most L and set Nðskt Þ’Nðskt Þ þ1.
4. Decrease ak to ak þ 1 , increase iteration counter to k+ 1 and go
to Step 2 (the choose players phase).
A few comments on the algorithm are in order.
We will present sufficient conditions on the sequence fak g to
ensure desirable convergence properties in the next section
(see Lemma 4.2). It should be noted that deliberate introduction
of stochastic noise in choosing players to perform best response
computations through parameter ak is not required in theory if the
underlying stochastic behavior of the system itself ensures that
every state will be visited infinitely often irrespective of decisions
in earlier periods. More generally, the choose players phase and
the best response phase are entirely decoupled and the only
information the latter needs from the former is a set of players
chosen to perform a best response computation. Our convergence
results remain valid irrespective of the details of the mechanism
used to select this set of players as long as the mechanism ensures
that all players are chosen infinitely often with probability one. In
the extreme case, every player may be chosen to perform a best
response calculation in every iteration resulting in the synchronous procedure outlined in Section 3.1. We reiterate the slight
difference between the action sampling mechanisms in the choose
players phase and the best response phase: the latter does not
employ ak . This ensures that the estimates W(st,xt) are based on
actions that were best responses in earlier iterations, thus avoiding unnecessary stochastic errors in best response calculations.
This leads to stronger convergence results and potentially better
practical performance.
The idea of using history of finite length in FP-type algorithms
has been used with much success in computational game theory
in the past (see for example the adaptive play (AP) algorithm
in [35]). In our context, as mentioned earlier, it is intended to
reduce correlation between early and late iterations, progressively
producing better value estimates. The stochastic noise and finite
history in the sampling process may also be viewed as the players
themselves being bounded rational [19].
To assess the computational effort in each iteration, note that
since Xt(st) is a finite set for all st A St and t ¼1,2,y,T, there exists a
positive integer A such that jXt ðst Þj rA for all t and st. Therefore,
the number of paths pk(skt,xt) that player skt samples in the best
response phase is at most A. Each of these paths includes T t +1
players. The number of players who participate in one iteration is
P
therefore at most A Tt ¼ 1 ðTt þ 1Þ ¼ ATðT þ1Þ=2. This observation
will be used in comparing our procedure with a benchmark in the
numerical results Section 5.
In the parlance of existing simulation-based asynchronous ADP
algorithms for model-free finite-horizon problems, Step 2 above
can be seen as a mechanism to select states at which policies will
be updated. Step 3(a) can be viewed as an approximate policy
evaluation step, whereas Step 3(b) regarded as a policy improvement mechanism. The probability distribution from which actions
will be sampled in the next iteration is updated in Step 3 (c)
(indirectly) through changes to the history of plays. In the next
two paragraphs, we briefly mention fundamental features of some
other simulation-based table look-up methods for the reader to
contrast with our SFP procedure. (For a more detailed comparison,
and a discussion of convergence results, see Section 6.)
In Monte Carlo methods, the state–action value, that is, Q-value
estimate only of each state–action pair on the selected path in Step
2 is updated in Step 3(a) by adding total reward on that path
following the occurrence of that state–action pair and then
smoothing. In comparison, the state–action value estimate of each
state on the selected path and each action feasible in that state is
updated in our SFP procedure. Thus the counterpart of Step 3 (a) in
1709
typical Monte Carlo methods is simpler than in the pseudo-code
above. The decision in each state on the selected path is then set to
be optimal with respect to the current Q-value estimates (Steps 3
(b) and 3 (c)). This updated policy is then used in the next iteration
to select a new path. See Fig. 5.4 in [32].
Temporal difference methods, unlike Monte Carlo methods,
employ bootstrapping to update Q-value estimates of state–action
pairs on the selected path. That is, instead of simply adding total
reward following the occurrence of a state–action pair on the
selected path, Q-value estimates ‘‘downstream’’ on the selected
path are used in the update mechanism. See Fig. 6.9 in [32]. The
well-known Q-learning [34] algorithm also uses bootstrapping
and can in fact be viewed as an extension of a particular temporal
difference method [4]. See Fig. 6.12 in [32]. The SFP pseudo-code
above, on the other hand, does not bootstrap.
We present convergence results for our asynchronous SFP
algorithm in the next section.
4. Convergence results
We first introduce some notation. We define the value of any
state st A St for t ¼1,2,3,y,T under policy p A P as
!
T
X
Vp ðst Þ ¼ rt ðst , pðst ÞÞ þ E
ri ðsi , pðsi ÞÞ ,
i ¼ tþ1
where the expectation is with respect to the joint distribution
induced by oracles Ot ,Ot þ 1 , . . . ,OT . The corresponding optimal
value is defined as
V ðst Þ max Vp ðst Þ:
ð11Þ
pAP
The above maxima are achieved because the set P is finite. It is
well known [3,27] that the optimal values satisfy the following
recursive equation:
V ðst Þ ¼ max ðrt ðst ,xt Þ þ EOt ðst ,xt Þ ½V ðst þ 1 ÞÞ,
xt A Xt ðst Þ
ð12Þ
where the subscript on the expectation operator asserts that the
expectation is with respect to the probability distribution of state
st + 1 over St + 1 as induced by the stochastic oracle Ot that receives
state–decision pair (st,xt) as input. Note here that Vn(sT + 1)¼0 for
all terminal states sT + 1 by assumption. The decision prescribed in
state st by the optimal policy p is given by
1
0
C
p ðst Þ ¼ argmaxB
@rt ðst ,xt Þ þEOt ðst ,xt Þ ½V ðst þ 1 Þ A, that is,
ð13Þ
p ðst Þ ¼ argmaxQ ðst ,xt Þ:
ð14Þ
xt A Xt ðst Þ
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
Q ðst ,xt Þ
xt A Xt ðst Þ
For t ¼1,y,T and any st A St , let gðst Þ be the event that there exists
an iteration Kst such that for all iterations kZ Kst , each of the L
entries in the history of the player corresponding to state st is the
optimal decision p ðst Þ. Similarly, for any t ¼1,y,T, let OðtÞ be the
event that there exists an iteration Kt such that for all iterations
kZ Kt , all L entries in the history of the player corresponding
to any state s A St [ St þ 1 . . . [ ST equal the optimal action p ðsÞ.
T
T
Note that ð r ¼ t,...,T sr A Sr gðsr ÞÞ ) OðtÞ since we can choose
T
Kt ¼ maxr ¼ t,...,T maxsr A Sr Ksr . This implies that P½OðtÞ Z P½ r ¼ t,...,T
T
T
T
OðtÞ ) Oðt1Þ.
sr A Sr gðsr Þ. Similarly, ð st1 A St1 gðst1 ÞÞ
The rest of this section is devoted to the proof of the following
theorem.
Theorem 4.1. P½Oð1Þ ¼ 1, that is, the probability that there exists an
iteration K1 such that for all iterations k Z K1 of asynchronous SFP for
ADP, all L entries in the histories of players corresponding to all states
s A S ¼ S1 [ . . . ST equal the optimal action p ðsÞ is one.
1710
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
We employ an intuitive backward induction proof technique.
In particular, we first show that players in ST are able to compute
optimal actions the first time they are in play. Once all players in
ST have been in play at least once (this is ensured by Lemma 4.2),
players in ST 1 are able to generate increasingly accurate estimates of the total expected reward obtained on playing each of
their actions. Consequently, players in ST 1 are eventually able to
compute their optimal actions after being in play in a sufficiently
large number of iterations (this again is ensured by Lemma 4.2).
This process is then propagated backwards through periods T 1,
T 2, y, 1. In particular, after a sufficiently large number of
iterations, all best responses are ‘‘correct’’ implying that every
entry in the players’ history is optimal. These ideas are now
formalized.
Observe that, by problem definition, every state st þ 1 A St þ 1 is
reachable from the initial state s1 (otherwise st + 1 would not be a
member of St + 1). That is, for every st þ 1 A St þ 1 , there exists a state–
action pair (st,xt) with st A St and xt A Xt ðst Þ such that the probability that oracle Ot returns state st + 1 upon receiving (st,xt) as
input is strictly positive. Because we have a finite number of states
and actions, there is a uniform (over all periods t, all states st + 1,
and all state–action pairs (st,xt)) lower bound q 40 on these
strictly positive probabilities.
Lemma 4.2. If ak Z ð1=kÞ1=T for all iterations k then every player is in
play infinitely often with probability one.
j
Proof. Let E ðst Þ be the event that the player corresponding to a
specific state st A St is in play in iteration j and t 1 players in
stages 1,2,y,t 1 each sampled actions randomly from their sets
of feasible actions (as opposed to sampling from their history of
best responses) in the choose players phase. Observe that
P1
j
P½E j ðst Þ Z ðqðaj =AÞÞT Zðq=AÞT 1=j and hence
j ¼ 1 P½E ðst Þ ¼ 1.
1
2
Since events E ðst Þ,E ðst Þ, . . . are independent, the probability that
infinitely many of them occur is one by the second Borel–Cantelli
lemma [12]. &
Note that a fixed constant strictly positive value of ak for all k, such as
ak ¼ 1 clearly satisfies the sufficient condition in Lemma 4.2 above.
However, the importance of decreasing values of ak is that they
employ asymptotically diminishing exploration in the choose players
phase, giving more and more weight to best response results of
previous iterations along the progress of the algorithm. Potential
practical benefits of changing the rate of exploration over iterations
instead of using a constant rate have been discussed in the literature,
for instance in [32]. An interesting question in our context then is how
small a value of ak can be used still guaranteeing that each player is
chosen infinitely often with probability one. Lemma 4.2 provides a
simple bound on such a value.
Proof of Theorem 4.1 employs the following inductive
statement.
Induction statement Mt: P½OðtÞ ¼ 1, that is, the probability that
there exists an iteration Kt such that for all iterations kZ Kt of
asynchronous SFP for ADP, all L entries in the histories of players
corresponding to all states s A St [ . . . ST equal the optimal action
p ðsÞ is one.
Lemma 4.3. MT is true, that is P½OðTÞ ¼ 1.
entries in this player’s history is p ðsT Þ. Since sT A ST was arbitrary,
P½gðsT Þ ¼ 1 for every sT A ST . Then it is easy to see that
T
P½ sT A ST gðsT Þ ¼ 1. This implies that P½OðTÞ ¼ 1. &
Now we prove an intermediate lemma.
Lemma 4.4. Mt implies that for every player–action pair (st 1,xt 1)
with st1 A St1 and xt1 A Xt1 ðst1 Þ, the estimate Wk(st 1,xt 1)
converges to EOt1 ðst1 ,xt1 Þ V ðst Þ as k-1 with probability one. As a
result, rt1 ðst1 ,xt1 Þ þ W k ðst1 ,xt1 Þ converges to rt1 ðst1 ,xt1 Þ þ
EOt1 ðst1 ,xt1 Þ V ðst Þ as k-1 with probability one.
Proof. Recall that I(st 1,k) is the set of iterations up to and
including iteration k in which the player corresponding to state
st 1 is in play. Let J(st 1,m,n) be the set of iterations between
iterations m and n (not including m but including n) in which the
player corresponding to state st 1 is in play. Let Kt be as defined in
the induction statement above and k ZKt be any iteration large
enough such that jIðst1 ,kÞj 4 0 (such an iteration exists with
probability one by Lemma 4.2). We have
jW k ðst1 ,xt1 ÞEOt1 ðst1 ,xt1 Þ V ðst Þj
Tt
X X
1
¼
rt þ j ðsit þ j ,xit þ j ÞEOt1 ðst1 ,xt1 Þ V ðst Þ:
jIðst1 ,kÞj i A Iðs ,kÞ j ¼ 0
t1
Using I Iðst1 ,Kt Þ and J Jðst1 ,Kt ,kÞ for brevity, the above right
hand side becomes
Tt
X X
1
i
i
rt þ j ðst þ j ,xt þ j ÞEOt1 ðst1 ,xt1 Þ V ðst Þ,
¼
jIj þ jJj i A fI[Jg j ¼ 0
which in turn is bounded above by
Tt
1 XX
i
i
rt þ j ðst þ j ,xt þ j Þ
jIjþ jJj
iAI j ¼ 0
Tt
1 XX
rt þ j ðsit þ j ,xit þ j ÞEOt1 ðst1 ,xt1 Þ V ðst Þ:
þ
jIj þjJj i A J j ¼ 0
Since the state and decision spaces are finite, the absolute values
of one period deterministic rewards are uniformly bounded above,
say by a positive number R. As a result, the above right hand side
is at most
Tt
jIjRð1þ TtÞ 1 X X
rt þ j ðsit þ j ,xit þ j ÞEOt1 ðst1 ,xt1 Þ V ðst Þ
jIjþ jJj þ jIjþ jJj
iAJ j ¼ 0
Tt
jIjRð1þ TtÞ jJj 1 X X
¼
rt þ j ðsit þ j ,xit þ j Þ:EOt1 ðst1 ,xt1 Þ V ðst Þ:
þ
jIj þjJj
jIj
þ
jJj
jJj
iAJ j ¼ 0
The first term above converges to zero as k-1 with probability one
since Jðst1 ,Kt ,kÞ-1 as k-1 with probability one owing to Lemma
4.2. Similarly, jJj=ðjIjþ jJjÞ-1 as k-1 with probability one.
Note that Mt implies that for all iterations i A Jðst1 ,Kt ,kÞ, xit þ j
¼ p ðsit þ j Þ for j¼0,y,Tt. Moreover, when state–decision pair
ðst1 ,xt1 Þ is sent to the oracle Ot1 , the sums
Tt
X
rt þ j ðsit þ j , p ðsit þ j ÞÞ
j¼0
Proof. Consider any state sT A ST . This player is in play infinitely
often with probability one owing to Lemma 4.2. Moreover, every
iteration k in which this player is in play, it solves the best reply
problem x k ðsT Þ ¼ argmaxxT A XT ðsT Þ rT ðsT ,xT Þ since Wk(sT,xT)¼0 in problem (8) for all xT A XT ðsT Þ. Thus, x k ðsT Þ ¼ p ðsT Þ in every such
iteration k owing to the principle of optimality and uniqueness
of p . Let KsT be the iteration in which this player is in play for the
Lth time (such an iteration exists with probability one, again
owing to Lemma 4.2). Then for all iterations k Z KsT , each of the L
are independent and
i A Jðst1 ,Kt ,kÞ. Therefore,
identically
distributed
Tt
1 XX
r ðsi ,xi Þ-EOt1 ðst1 ,xt1 Þ V ðst Þ
jJj i A J j ¼ 0 t þ j t þ j t þ j
for
iterations
as k-1
with probability one by the strong law of large numbers and the
definition of Vn(st) for any st A St . Thus the above term of interest
converges to zero with probability one. Thus jW k ðst1 ,xt1 Þ
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
EOt1 ðst1 ,xt1 Þ V ðst Þj is bounded below by zero and bounded above by
a term that converges to zero as k-1 with probability one. Therefore, Wk(st 1,xt 1) converges to EOt1 ðst1 ,xt1 Þ V ðst Þ as k-1 with
probability one. This completes the proof. &
The above intermediate lemma helps us restore the inductive
hypothesis as follows.
Lemma 4.5. Mt implies Mt 1.
Proof. Consider dðst1 Þ defined as follows:
dðst1 Þ ¼
min
p ðst1 Þ a xt1 A Xt1 ðst1 Þ
ðV ðst1 Þrt1 ðst1 ,xt1 ÞEOt1 ðst1 ,xt1 Þ V ðst ÞÞ,
that is, dðst1 Þ is the difference between the optimal value of state st 1
and the value of the next-best feasible action in st 1. The above
minimum is well defined since Xt 1(st 1) is a finite set. The definition
of Vn(st 1) implies that dðst1 Þ Z0 whereas uniqueness of the optimal
policy implies that dðst1 Þ 4 0. The best reply problem (8) for the
player corresponding to state st 1 implies that if jrt1 ðst1 ,xt1 Þ
þEOt1 ðst1 ,xt1 Þ V ðst Þrt1 ðst1 ,xt1 Þ W k ðst1 ,xt1 Þj o dðst1 Þ=2 for
all xt1 A Xt1 ðst1 Þ, then x k ðst1 Þ ¼ p ðst1 Þ. For any e 4 0, let
Ae ðst1 ,xt1 Þ be the event that there exists an integer Ne ðst1 ,xt1 Þ
such that for all iterations k Z Ne ðst1 ,xt1 Þ,
jrt1 ðst1 ,xt1 Þ þ W k ðst1 ,xt1 Þrt1 ðst1 ,xt1 Þ
EOt1 ðst1 ,xt1 Þ V ðst Þj o e:
Note that the almost sure convergence
rt1 ðst1 ,xt1 Þ þ W k ðst1 ,xt1 Þ-rt1 ðst1 ,xt1 Þ þEOt1 ðst1 ,xt1 Þ V ðst Þ
is equivalent (see [30]) to P½Ae ðst1 ,xt1 Þ ¼ 1. Therefore,
"
#
\
P
Adðst1 Þ=2 ðst1 ,xt1 Þ ¼ 1:
xt1 A Xt1 ðst1 Þ
More specifically, choosing Ndðst1 Þ=2 ðst1 Þ ¼ maxxt1 A Xt1 ðst1 Þ Ndðst1 Þ=2
ðst1 ,xt1 Þ, the probability that
jrt1 ðst1 ,xt1 Þ þ W k ðst1 ,xt1 Þrt1 ðst1 ,xt1 ÞEOt1 ðst1 ,xt1 Þ V ðst Þj o
dst1
2
for all xt1 A Xt1 ðst1 Þ and for all iterations kZ Ndðst1 Þ=2 ðst1 Þ is one.
As a result, since the player corresponding to state st 1 is in play
infinitely often with probability one, the probability that there exists
an iteration K 0st1 such that x k ðst1 Þ ¼ p ðst1 Þ for all iterations
k ZK 0st1 is one. This implies that P½gðst1 Þ ¼ 1. Then it is easy to
T
see that P½ st1 A St1 gðst1 Þ ¼ 1. Together with Mt, i.e., P½OðtÞ ¼ 1, this
implies that P½Oðt1Þ ¼ 1. That is, Mt 1 holds.
By the principle of induction, Lemmas 4.3 and 4.5 imply that
M1 is true, that is, P½Oð1Þ ¼ 1. This concludes the proof of Theorem
4.1. &
Since MT,y,Mt,y,M1 are true, Lemma 4.4 implies that for any
t ¼1,y,T 1, and any state–decision pairs (st,xt) with st A St and
xt A Xt ðst Þ, Wk(st,xt) converges to EOt ðst ,xt Þ V ðst þ 1 Þ as k-1 with
probability one. In addition, Theorem 4.1 above states that with
probability one, there exists an iteration K1 such that for all
iterations k ZK1 , every entry in the history of the player corresponding to state s1, denoted by hk(s1), is p ðs1 Þ. We thus have the
following ‘‘value convergence’’ result.
Corollary 4.6. r1 ðs1 ,hk ðs1 ÞÞ þ W k ðs1 ,hk ðs1 ÞÞ converges with probability
one to Vn(s1) as k-1.
Let Ek(st) denote the event that state st A St is in play in iteration k.
Given event Ek(st), let yk(st) denote the feasible action sampled by
state st in the choose players phase of the SFP algorithm of Section
3.2. Similarly, let Dk(st) denote the event that state st gets the
opportunity to sample an action in the best response phase in
iteration k, and given Dk(st), let zk(st) denote an action sampled by
1711
st. Then Theorem 4.1 and the ideas discussed in its proof imply the
following ‘‘solution convergence’’ result.
Corollary 4.7. For any 0 o e o 1, there exists an iteration Ke such
that P½yk ðst Þ ¼ p ðst ÞjEk ðst Þ 4 1e for all k ZKe . That is, sufficiently
far along the progress of the algorithm, the actions sampled by players
in the choose players phase are optimal with arbitrarily high probability. Similarly, P½zk ðst Þ ¼ p ðst ÞjDk ðst Þ ¼ 1 for sufficiently large k.
That is, sufficiently far along the progress of the algorithm, the actions
sampled by players in the best response phase are optimal.
In summary, the version of SFP designed here uses action
samples of size one, finite history of past best responses, and
allows players to make sampling mistakes. It is much simpler to
implement in practice and computationally more efficient than
the original approach in [20], and yet exhibits stronger convergence behavior. In the next section, we present numerical results
that suggest our asynchronous SFP procedure is effective in
practice.
5. Numerical results
We apply SFP to two stochastic inventory control examples
from [7] and to the game of TIC-TAC-TOE against nature. The state
spaces in the inventory control examples are small, whereas in
TIC-TAC-TOE the state space is moderate.
5.1. Stochastic inventory control Example 1
This example is taken from [7]. Our main reason for choosing
this example is the same as given in [7], namely, that it can be
solved either by using inventory theory or by stochastic dynamic
programming backward recursion as described in [7], and hence it
is easy to test the performance of SFP. Moreover, it provides us the
opportunity to compare the performance of SFP with that of a
recent simulation-based method for stochastic dynamic
programming.
Consider the following T period inventory control problem. At
the beginning of period t ¼1,2,y,T, we observe inventory level st
of a single product and decide whether to order an amount q of
the product or not. The ordered amount arrives immediately. The
inventory levels are bounded above by a positive integer M. Note
that an order can be placed only if st þ qr M. A random demand Dt
is realized after receiving the placed order. The distribution of this
demand is not explicitly specified to the decision maker; however,
it is known that demand is an integer from the set {0,1,y,D},
where D is a positive integer. The inventory st + 1 at the beginning
of the next period is given by st þ 1 ¼ maxfst þ xt Dt ,0g, that is,
backlogging is not allowed and unsatisfied demand is lost. We
incur three types of cost. There is a fixed cost of K every time we
place an order, that is, every period t in which xt ¼q. Every unit of
inventory left over at the end of period t is charged an inventory
holding cost of h. There is also a penalty cost of p per unit of lost
demand. The initial inventory at the beginning of the first period is
s1. The objective is to decide the order quantity for every possible
inventory level in each period so as to minimize the expected total
cost over the entire horizon of T periods. Observe that this
problem is slightly more general than the generic formulation
described in Section 2 in that one period cost for a given state–
action pair is random. In particular, the one period cost for
ordering a feasible amount xt when the inventory is st is given by
K Iðxt 4 0Þ þh ðst þ xt Dt Þ þ þp ðst þxt Dt Þ ,
where Iðxt 40Þ is the indicator function which equals one if xt 40
and zero otherwise, ðst þ xt Dt Þ þ ¼ maxf0,ðst þ xt Dt Þg, and
ðst þ xt Dt Þ ¼ maxf0,ðst þ xt Dt Þg. We solve this problem using
1712
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
the following data from [7]: T ¼3, q¼10, M¼20, h ¼1, and the
demand is a uniformly chosen integer from [0,9]. SFP does not use
this information about the demand distribution; however, this
information is needed in our numerical experiments to implement
the oracles Ot . We use two value of K ¼0,5 and two values of
p ¼1,10. The initial inventory is s1 ¼5. We wish to estimate Vn(s1).
It is important to reiterate that even though the demand
distribution is known to be uniform, that information is used only
by the backward recursion algorithm we employed to calculate
optimal values to be compared with the results of our SFP
procedure. The SFP procedure does not use this explicit demand
distribution information but rather ‘‘learns’’ the distribution
through repeated samples generated from it. As a result, the
challenge in this model-free version of the stochastic inventory
control problem with small action and state spaces is in accurately
estimating the expected rewards and then optimizing the ordering
policy based on these estimates.
Table 1
Estimates V(s1 ¼ 5) of Vn(s1 ¼ 5) for stochastic inventory control Example 1 averaged
over 30 independent runs of 50 iterations of SFP with L¼1 and L ¼5. The three
estimates reported in [7] each with sample sizes N¼ 4 and N ¼32 are listed in the
last three columns for comparison.
(K,p)
Vn(s1 ¼5) SFP estimates
L¼ 1
L¼5
AMS estimates from [7]
Estimate 1
Estimate 2
Estimate 3
N ¼ 4, N ¼ 32 N ¼4, N¼ 32 N ¼ 4, N ¼32
(0,1)
(0,10)
(5,1)
(5,10)
10.440
24.745
10.490
31.635
10.8856
24.9614
12.4176
31.183
11.4699
25.7516
12.3314
32.6679
15.03,
30.45,
18.45,
37.52,
11.23 9.13, 10.45 9.56,
26.12 19.98, 24.73 20.48,
11.47 10.23, 10.46 10.41,
33.11 26.42, 31.62 26.92,
10.49
24.74
10.46
31.64
The first column in Table 1 below lists the values of K and p
used. The second column lists the corresponding optimal value
Vn(s1 ¼5). The third column reports the average of the optimal
values estimated by our algorithm after 50 iterations over 30
independent runs when the players remember only their most
recent best response, i.e., L¼1. The fourth column lists this
average when the players remember their five most recent best
responses, i.e., L¼5. Parameter ak was taken to be (1/k)1/T for all
iterations k in all runs as it satisfies the rate calculated in Lemma
4.2. In order to estimate the optimal value of state st, the algorithm
in [7], which the authors term AMS for adaptive multistage
sampling, recursively samples N states from St + 1 for each action
in st and proposes three different estimators: the first estimator
uses a weighted average of Q-values, the second estimator uses
the maximum Q-value whereas the third one employs the Q-value
of the most frequently sampled action. For comparison, we have
copied from [7] estimates of the optimal value of state s1 ¼5 using
these three estimators with sample sizes N ¼4 and N ¼32 (these
were the smallest and the largest sample sizes used, respectively)
in the last three columns of Table 1. The evolution of our estimate
of Vn(s1 ¼5) is plotted versus 50 iterations in Figs. 1 and 2.
The runtimes of SFP and the AMS algorithm of [7] can be
assessed by looking at the number of states sampled by these
methods, since sampling is the bottleneck step of both algorithms.
This method of analysis provides a reasonable comparison of the
algorithms, including the rate of growth of runtime as number
of iterations or problem size increases, without focusing on the
specifics of implementation or computers used. Since the optimal
value of a state in AMS is recursively estimated by sampling
N states from the next stage for each action in the state at hand,
the number of states sampled in a problem with A actions, horizon
T and sample size N is (AN)T [7, p. 129]. Since A¼2 and T¼3 in
11
Estimate of optimum value Vs iterations
K=0 p=10 L=1
Estimate of optimum value
Estimate of optimum value
Estimate of optimum value Vs iterations
K=0 p=10 L=1
10
9
8
7
6
5
1
11
21
31
iterations
25
20
15
10
5
41
1
13
12
11
10
9
8
1
11
21
31
iterations
41
21
31
iterations
41
Estimate of optimum value Vs iterations
K=5 p=10 L=1
Estimate of optimum value
Estimate of optimum value
Estimate of optimum value Vs iterations
K=5 p=10 L=1
11
32
30
28
26
24
22
20
18
16
14
1
11
21
31
iterations
41
Fig. 1. Evolution of the optimal value estimate with iterations for four test problems with L¼ 1 in Example 1. Each plot shows the estimates averaged over 30 independent
runs, and the corresponding standard error bars. The optimal value reported in Table 1 is shown on each plot with a flat line.
1713
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
this inventory control example, the number of states sampled by
AMS is 512 with N ¼ 4 and 262,144 with N ¼32. Recalling the
discussion of the asynchronous algorithm in Section 3.2, the
number of states sampled by SFP in 50 iterations is 50
ATðT þ1Þ=2 ¼ 50 2 12=2 ¼ 600, and hence its computational
effort for this problem is comparable to the faster version (N ¼4)
of AMS.
Comparing the optimal value estimates obtained by SFP to
those obtained by AMS using N ¼4, which requires similar
computational effort, we observe (Table 1) that, for the cases
(K, p) ¼(0, 1), (0, 10), (5, 10), SFP estimates are significantly closer
to the optimal value than those obtained by AMS with either of
the three estimators. Comparing SFP estimates for these three
cases to AMS with N ¼32, we see that SFP estimates are more
accurate than those obtained with the first estimator and only
slightly worse than ones obtained with the other two estimators,
even though SFP requires only 0.23% of the samples that AMS with
N ¼32 does. When (K, p) ¼ (5, 1), the SFP estimate outperforms
the first estimator with N ¼4, but is worse in other cases. Finally,
the tables indicate that SFP performance is not very different for
L¼1 and L¼5, although L¼ 1 seems to work slightly better for this
problem.
Like any other finite-horizon dynamic programming algorithm,
the computational effort in our procedure depends on the length
of problem horizon. This is illustrated for the inventory control
example at hand in Table 2. The table lists average of our optimal
value estimate over 30 independent runs of 50 iterations each as
in Table 1 for L¼1 and various values of T. Optimal values are also
listed for comparison. The percentage relative error in our optimal
value estimates after 50 iterations is plotted versus horizon length
in Fig. 3. We remark that our algorithm was able to reach closer to
the optimal values in all cases after running more iterations;
however, this is not apparent in Table 2 since we deliberately fixed
the iterations to 50 to bring out the sensitivity of error to horizon
T. To illustrate this point, we have included Table 3, which
compares optimal value estimates obtained for T¼10 with 50
and 200 iterations. Plots of optimal value estimates over 200 SFP
iterations for T¼10 are also shown in Fig. 4. The choice of 200 was
motivated by the fact that complexity of the standard backward
recursion method of dynamic programming grows linearly with
horizon [5, Section 5.1]; since we used 50 iterations for T¼3, 200
iterations for T¼10 seem appropriate.
5.2. Stochastic inventory control Example 2
Our second example is also taken from [7] and is similar to
Example 1 above except that we may now order any integer
amount {0,1,y,20} (and so A ¼21). Since SFP performed slightly
better with L¼1 for the first example, we only focus on that case
for the second example. The numerical results are presented
in Table 4. The second column lists the average estimate of
optimal value over 30 independent runs of 5000 iterations each.
As before, we include for comparison three estimates obtained by
the AMS algorithm of [7] with the smallest (N ¼21) and the largest
(N ¼35) sample sizes that were reported there. The convergence
behavior of our average optimal value estimate with iterations is
illustrated in Fig. 5.
We again turn to the number of states sampled to compare
computational effort of SFP and AMS. The number of states
sampled by AMS is ðANÞT ¼ ð21 21Þ3 ¼ 85,766,121 when N ¼21
and (21 35)3 ¼397,065,375 when N ¼35, whereas it is
5000 21 12/2¼630,000 in SFP.
Table 4 shows that, in spite of the significantly smaller number
of states sampled by SFP, its optimal value estimates are considerably more accurate than those of AMS for 20 of the 24
12
Estimate of optimum value Vs iterations
K=0 p=10 L=5
Estimate of optimum value
Estimate of optimum value
Estimate of optimum value Vs iterations
K=0 p=1 L=5
11
10
9
8
7
6
5
4
1
11
21
31
iterations
30
25
20
15
10
5
41
1
13
12
11
10
9
8
7
6
1
11
21
iterations
31
41
21
31
iterations
41
Estimate of optimum value Vs iterations
K=5 p=10 L=5
Estimate of optimum value
Estimate of optimum value
Estimate of optimum value Vs iterations
K=5 p=1 L=5
11
35
30
25
20
15
10
1
11
21
31
41
iterations
Fig. 2. Evolution of the optimal value estimate with iterations for four test problems with L¼ 5 in Example 1. Each plot shows the estimates averaged over 30 independent
runs, and the corresponding standard error bars. The optimal value reported in Table 1 is shown on each plot with a flat line.
1714
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
Table 2
Sensitivity of the accuracy of optimal value estimates to horizon T in stochastic inventory control Example 1. The first column lists various horizon lengths. The other
columns present the optimal values Vn(s1 ¼ 5), and optimal value estimates V(s1 ¼5) averaged over 30 independent trials of 50 iterations of SFP with L¼ 1 (separated by a
comma) for different combinations of (K,p). The percentage relative error between Vn(s1 ¼ 5) and V(s1 ¼ 5) is plotted in Fig. 3. Note that the first row of results is copied
from Table 1.
T
3
4
5
6
7
8
9
10
(K¼ 0, p ¼1)
(K¼0, p ¼ 10)
(K¼ 5, p ¼ 1)
(K¼5, p ¼10)
10.440, 10.8856
14.4752, 14.3797
18.0422, 18.5967
22.5475, 23.0523
26.587, 27.8457
30.6267, 32.6117
34.6663, 36.9863
38.706, 42.1640
24.745, 24.9614
31.8861, 31.6529
39.0273, 39.0144
46.1685, 47.4667
53.3097, 54.4915
60.4509, 62.0641
67.5921, 70.732
74.7333, 79.6176
10.490, 12.4176
14.9452, 17.6562
19.4368, 23.6745
23.9345, 29.498
28.4351, 34.5876
32.9351, 39.6497
37.4351, 43.9118
41.9351, 50.7621
31.635, 31.183
40.9761, 40.9536
50.3198, 50.7654
59.6634, 59.5954
69.0071, 70.549
78.3507, 80.1529
87.6944, 89.0752
97.038, 101.505
Sensitivity of error to horizon length
for K=0, p=10, L=1
10
9
8
7
6
5
4
3
2
1
0
6
5
4
3
2
1
0
3
4
5
6
7
8
9
10
-1
4
5
6
7
8
9
horizon length T
horizon length T
Sensitivity of error to horizon length
for K=5, p=10, L=1
estimation error e after
50 iterations
20
15
10
5
0
3
3
Sensitivity of error to horizon length
for K=5, p=1, L=1
25
estimation error e after
50 iterations
7
estimation error e after
50 iterations
estimation error e after
50 iterations
Sensitivity of error to horizon length
for K=0, p=1, L=1
4
5
6
7
8
9
10
5
4.5
4
3.5
3
2.5
2
1.5
1
0.5
0
10
3
4
horizon length T
5
6
7
8
9
10
horizon length T
Fig. 3. Sensitivity of the accuracy of our optimal value estimates to problem horizon in stochastic inventory control Example 1. The percentage relative error between
optimal values and their estimates in Table 2 is given by e ¼ 100 jV ðs1 ¼ 5ÞV ðs1 ¼ 5Þj=V ðs1 ¼ 5Þ. In each plot, these relative errors are shown as black dots and are
connected by a smoothed solid black line to illustrate the error trend. The dashed trend lines in each plot were obtained by fitting either a quadratic (in (a), (b), (d)) or a
linear (in (c)) curve to the error data.
combinations of parameter values (K,p), estimator type and
sample size, and close in the remaining four instances. The
percentage relative error ð100 jVðs1 ¼ 5ÞV ðs1 ¼ 5ÞjÞ=V ðs1 ¼ 5Þ
in SFP estimates across problem instances and estimator types is
on average about 15 times smaller than that of AMS with N ¼ 21,
and about 8 times smaller with N ¼ 35, even though in these two
cases SFP samples only 0.74% and 0.16%, respectively, of the states
sampled by AMS.
%
%
Table 3
Comparison of optimal value estimates averaged over 30 independent runs with
50 and 200 iterations for Example 1 with T¼ 10.
(K,p)
Vn(s1 ¼5)
50 iterations
200 iterations
% improvement
(0,1)
(0,10)
(5,1)
(5,10)
38.706
74.7333
41.9351
97.038
42.1640
79.6176
50.7621
101.505
40.7585
77.3510
47.6187
98.5661
3.33
2.84
6.19
2.89
5.3. TIC-TAC-TOE against nature
TIC-TAC-TOE is a well-known relatively simple two-player
game played on a 3 by 3 square grid (called the game-board)
where the two players take turns placing distinct tokens (typically
an ‘‘X’’ and an ‘‘O’’) in empty squares on the grid. The first player to
place three tokens in a horizontal row, a vertical column, or a
diagonal wins the game. Thus the game lasts for a total of at most
nine moves. It is well known [32] that every game of TIC-TAC-TOE
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
Estimate of optimum value Vs iterations
K=0 p=10 L=1 T=10
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
Estimate of optimum value
Estimate of optimum value
Estimate of optimum value Vs iterations
K=0 p=1 L=1 T=10
1715
Estimate of optimum value Vs iterations
K=5 p=1 L=1 T=10
Estimate of optimum value Vs iterations
K=5 p=10 L=1 T=10
iterations
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
171
181
191
Estimate of optimum value
iterations
Estimate of optimum value
iterations
iterations
Fig. 4. Evolution of the optimal value estimate with iterations for four test problems with L¼ 1 in Example 1 for T ¼10. Each plot shows the estimates for 200 SFP iterations
averaged over 30 independent runs, and the corresponding standard error bars. The optimal value is shown on each plot with a flat line.
Table 4
Estimates V(s1 ¼ 5) of Vn(s1 ¼ 5) for stochastic inventory control Example 2 averaged
over 30 independent runs of 50 iterations of SFP with L¼ 1. The three estimates
reported in [7] each with sample sizes N ¼21 and N ¼ 35 are listed in the last three
columns for comparison.
(K,p)
Vn(s1 ¼ 5) SFP
estimates
L ¼1
(0,1)
7.5
(0,10) 13.5
(5,1) 10.49
(5,10) 25.785
7.5920
13.5874
12.2923
24.6296
AMS estimates from [7]
Estimate 1
Estimate 2
Estimate 3
N¼ 21,
N¼ 35
N¼ 21,
N¼ 35
N ¼ 21, N ¼ 35
24.06,
29.17,
33.05,
39.97,
18.82
26.06
25.33
36.89
3.12,
6.04,
8.73,
17.78,
6.26
12.23
10.96
24.71
9.79,
12.06,
18.62,
26.76,
6.62
13.69
11.12
25.51
between two ‘‘expert players’’ ends in a tie whereas an expert
player never loses no matter whom it is pitted against [25].
In this section we focus on TIC-TAC-TOE against nature, where
the second player places its tokens in one of the empty squares
chosen randomly. Such a player is also sometimes termed a
‘‘novice player.’’ Thus, even when the first player employs a
deterministic strategy, its game versus a novice player unfolds
stochastically. One web site [25] on TIC-TAC-TOE reports that out
of a 1000 game experiments between an expert player and a
novice player, the expert player won 978 times whereas the other
22 games ended in a tie. Thus, roughly speaking, the probability
that an expert player wins against a novice player is about 0.978
and that the game ends in a tie is about 0.022.
We model TIC-TAC-TOE against nature as a sequential decision
problem where our decision maker corresponds to the first player
playing (say) ‘‘X’’ who makes the first move starting with an
empty game-board. The states of this problem correspond to the
different possible game-board configurations. Since every square
on the board can either be empty, occupied with an ‘‘X’’ or an ‘‘O’’,
the total number of states is 39 ¼19,683. In fact, one can provide a
more compact description of the state space of the game by taking
into account rotational symmetries, but we deliberately did not
incorporate this into our implementation since we wanted to
evaluate how SFP performs with a larger state-space. Note that in
a given state, the first player’s feasible actions correspond to
choosing one of the empty squares to place an ‘‘X.’’ Thus, the
number of feasible actions is at most nine in any state. Once the
first player chooses an action in a specific state, this state–action
pair is ‘‘sent to an oracle’’ that essentially returns the new gameboard, or equivalently, the state reached after nature, i.e., the
second player places an ‘‘O’’ on a randomly chosen empty square.
This sequence of events continues until the game ends either in a
tie or with one of the two players winning. The first player
receives a reward of 1 for winning, a reward of 1 for losing
and 0 for a tie. Notice that rewards are earned at game termination and there are no intermediate rewards along the way. Strictly
speaking, terminal rewards do not fit into the assumptions of our
analysis, but our proofs can be modified in a straightforward
manner to accommodate them. Our goal is to find the optimal
expected reward for the first player from the initial state where the
game-board is empty. Note that the numbers from the web site
mentioned above imply that this reward is roughly 0.978
(0.978 1+ 0.022 0) for an expert player. Thus we expect the
optimal expected reward to be very close to 0.978.
We ran 10 independent trials with 50,000 iterations each of
asynchronous SFP on this problem. We used ak ¼ ð1=kÞ1=9 and
1716
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
13.5
13
12.5
12
11.5
11
10.5
10
9.5
9
8.5
8
7.5
7
1
1001
2001
3001
iterations
4001
Estimate of optimum value Vs iterations
K=0 p=10 L=1
Estimate of optimum value
Estimate of optimum value
Estimate of optimum value Vs iterations
K=0 p=1 L=1
25
24
23
22
21
20
19
18
17
16
15
14
13
12
1
21
20
19
18
17
16
15
14
13
12
11
10
1
1001
2001
3001
iterations
4001
2001
3001
iterations
4001
Estimate of optimum value Vs iterations
Estimate of optimum value
Estimate of optimum value
Estimate of optimum value Vs iterations
K=5 p=1 L=1
1001
35
30
25
20
1
1001
2001
3001
iterations
4001
Fig. 5. Evolution of the optimal value estimate with iterations for four test problems with L¼ 1 in Example 2. Each plot shows the estimates averaged over 30 independent
runs, and the corresponding standard error bars. The optimal value reported in Table 4 is shown on each plot with a flat line.
three different L values: L¼1, L¼ 5 and L¼10. The average of our
estimate of the optimal expected value over these 10 trials is
plotted versus iterations in Fig. 6. The average and standard
deviation of estimates obtained after 50,000 iterations are shown
in Table 5. Note again that our SFP procedure does not exploit
symmetries or the structure of the game and does not know that
the game is being played against nature but rather learns the
optimal expected reward by repeated stochastic simulations
leading to a relatively high number of iterations required.
6. Discussion and conclusions
We have presented a variant of sampled fictitious play (SFP) that
converges to a deterministic optimal policy while solving modelfree, finite-horizon stochastic dynamic programs. This convergence
result is in stark contrast with the recently proposed, original
version of SFP in [20], which may only converge in a very weak
sense to the set of mixed strategy Nash equilibria, i.e., to the set of
sub-optimal randomized policies. Our SFP variant adaptively biases
the sampling mechanism toward high quality actions by endowing
the players with only finite recall, and allows the players to make
occasional sampling mistakes to ensure that all players will
participate in the game infinitely often. Most importantly, it is
computationally far less demanding than the original version
in [20]. It is well known that table look-up simulation-based
methods that explicitly store state value or state–action value
information, as in our SFP approach here, work well for small to
moderate sized model-free problems but are unlikely to successfully tackle very large problems [3,4]. For instance, the memory
requirement for algorithms using state–action value (Q-value)
information is proportional to the product of the sizes of the state
and action spaces. We expect SFP to have this feature as well.
Our SFP procedure can be seen as an asynchronous version of a
Q-value based [34] variant of Monte Carlo optimistic policy
iteration [4,26,33] for model-free finite-horizon ADP [3,9,26,32]
(Optimistic Policy Iteration is a somewhat non-standard name
used in [4] and [33], whereas in [26] it is called hybrid value/
policy iteration). In the infinite-horizon case, synchronous
versions for such algorithms (see [33] Section 5) maintain Qvalues for each state–action pair and generate infinite trajectories
starting at every state–action pair using greedy actions (actions
that optimize the Q-values) in every state visited. The Q-values are
then updated by ‘‘smoothing’’ (commonly by averaging) observed
cumulative costs along these trajectories. Even though these
methods are only of academic interest [33] as they require
generation of infinite trajectories, their analysis provides significant insights into implementable model-free ADP algorithms. In
addition, in some cases, as for example where eventual absorption
into a zero cost terminal state is inevitable along every trajectory,
these methods are indeed implementable in practice. Interestingly, even though variants of Optimistic Policy Iteration have
been known for some time, their detailed theoretical analyses have
been rare. In fact, in a list of important open questions in reinforcement learning, Sutton [31] had posed whether synchronous
Q-function based Monte Carlo-Optimistic Policy Iteration converges to optimality. This question was recently resolved in the
affirmative by Tsitsiklis [33] using monotonicity properties of the
dynamic programming operator. Also, non-convergence of asynchronous Monte Carlo Optimistic Policy Iteration is well
known [33] for infinite-horizon problems. We were not able to
find a rigorous proof for convergence of asynchronous Monte
Carlo optimistic Policy Iteration methods for the model-free finite-
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
1717
Fig. 6. Evolution of the estimate of optimal expected reward with iterations for TIC-TAC-TOE against nature with (a) L¼ 1, (b) L¼ 5, and (c) L¼ 10.
Table 5
Average and standard deviation of optimal expected reward estimates over 10
independent trials of 50,000 iterations each for TIC-TAC-TOE against nature.
L
Average
Standard deviation
1
5
10
0.975322
0.973054
0.97319
0.001299451
0.00153441
0.00127577
horizon case in the literature. We have shown that strong
convergence results for our model-free finite-horizon Monte Carlo
algorithm are easy to prove using backward induction, and the
strong law of large numbers, even in the asynchronous case. Part
of this simplicity stems from the way in which Q-values are
updated in our SFP procedure—update is performed for every
feasible action in a selected state rather than only for the action
currently estimated to be optimal (see the asynchronous SFP
pseudo-code and discussion in Section 3).
As noted in the literature [4,32], a key to ensuring convergence
to optimality in simulation-based asynchronous ADP algorithms
without significantly sacrificing computational performance is to
bias the underlying action sampling procedure toward good
actions and hence states, and yet sample all states infinitely often,
that is, to properly balance exploitation with exploration. Most
asynchronous techniques take an ad-hoc approach toward this
issue [26]. Recently developed algorithms that attempt to carefully tradeoff exploitation versus exploration for simulation-based
methods for model-free finite-horizon stochastic dynamic
programs are surveyed in [8] with expanded details appearing
in [9]. Our approach in this paper is close in spirit to two multistage adaptive sampling algorithms discussed in the second
chapter of [9]. Of these, the upper confidence bound (UCB)
algorithm from [7] exploits the theory of multi-armed bandit
problems [1] to sample actions to minimize expected regret (see
Section 2.1.4 in [9]), whereas the pursuit learning automata (PLA)
technique samples actions from a probability distribution biased
toward an estimate of an optimal action essentially implementing
a recursive extension of the pursuit algorithm [28,32]. This
probability distribution is iteratively updated explicitly through
an update formula that increases the probability of sampling an
action estimated to be optimal, that is, a ‘‘greedy action’’, by a
fraction b (see Eq. (2.12) in [32]). This ‘‘pursuit’’ of a greedy action
motivated the name ‘‘pursuit algorithm.’’ In our SFP approach as
well, actions are sampled from a probability distribution—the one
induced by the history of recent plays. This probability distribution is adaptively updated, not by using an explicit formula as in
PLA, but rather implicitly and naturally as the history of plays
evolves with iterations. Whereas the estimates of optimal
expected value in PLA converge in probability (Theorem 2.9, p.
50 in [9]), we are able to obtain almost sure convergence results
because we do not have exogenous noise in the action sampling
mechanism employed in the best response process. Specifically,
we proved in Theorem 4.1 that our SFP variant converges to a
deterministic optimal policy with probability one. Corollary 4.6
showed that the value estimates generated by SFP converge with
probability one to the optimal expected multi-period reward. In
contrast, the UCB technique converges in mean (Theorem 2.3 page
29 in [9]).
1718
M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718
Extension of results in this paper to infinite-horizon model-free
problems is a natural direction for future research. However, since
it is not possible to sample an infinite path in practice, an
implementable procedure should approximate the infinite-horizon expected rewards with some appropriately selected finitehorizon truncations. This will introduce an additional hurdle in
proving convergence to optimality.
In addition, the fundamental distinguishing features of our
variant of SFP—finite memory, action samples of size one, and
occasional mistakes in sampling—appear to help, both in simplifying convergence proofs, as well as enhancing computational
performance as compared to the original version of SFP in [20].
Therefore, we will focus on investigating theoretical and empirical
properties of this variant when applied to combinatorial optimization problems that are not modeled as dynamic programs. We
have taken initial steps in this direction, obtaining encouraging
numerical results comparable to Genetic Algorithms [11], on
multi-dimensional knapsack problems [17].
Researchers have recently proposed ‘‘payoff-based’’ dynamics
for learning equilibria in multi-player games [23]. In contrast to
‘‘action-based’’ mechanisms such as FP, or AP, these approaches do
not require a best response computation. For example, in the socalled safe experimentation dynamics [23], players simply maintain
a record of the best utility value they obtained in the past, and the
corresponding action they played at the time. These are called the
baseline utility and action, respectively. In each iteration k, each
player samples its baseline action with probability ek , and with
probability 1ek , samples an action randomly. If the utility a
particular player receives for the joint action profile sampled by all
players is higher than its baseline utility, that player updates its
baseline utility and action. It is easy to prove (Theorem 3.1 in [23])
that this dynamic process converges almost surely to a pure
strategy optimal Nash equilibrium for all games of identical
interests under a standard sufficient condition on ek similar to
our condition on ak in Lemma 4.2. In the future, it would be
interesting to compare the computational performance of such
payoff-based processes on dynamic programs with the actionbased approach in this paper.
Acknowledgments
This research was funded in part by the National Science
Foundation under Grants CMMI-0422752, CCF-0830380, and
CCF-0830092. Archis Ghate appreciates summer support from
the University of Washington.
References
[1] Auer P, Cesa-Bianchi N, Fisher P. Finite-time analysis of the multiarmed bandit
problem. Machine Learning 2002;47:235–56.
[2] Baumert S, Cheng SF, Ghate AV, Reaume D, Sharma D, Smith RL. Joint
optimization of capital investment, revenue management, and production
planning in complex manufacturing systems. Technical Report 05-05, Industrial and Operations Engineering, University of Michigan, Ann Arbor; 2005.
[3] Bertsekas DP. Dynamic programming and optimal control, vols. 1 and 2.
Belmont, MA, USA: Athena Scientific; 1995.
[4] Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming. Belmont, MA, USA:
Athena Scientific; 1996.
[5] Blondel VD, Tsitsiklis JN. A survey of computational complexity results in
systems and control. Automatica 2000;36.
[6] Brown GW. Iterative solution of games by fictitious play. In: Activity analysis
of production and allocation. New York, NY, USA: Wiley; 1951.
[7] Chang HS, Fu MC, Hu J, Marcus SI. An adaptive sampling algorithm for solving
Markov decision processes. Operations Research 2005;53(1):126–39.
[8] Chang HS, Fu MC, Hu J, Marcus SI. A survey of some simulation-based
algorithms for Markov decision processes. Communications in Information
and Systems 2007;7(1):59–92.
[9] Chang HS, Fu MC, Hu J, Marcus SI. Simulation-based algorithms for Markov
decision processes. London, UK: Springer; 2007.
[10] Cheng SF, Epelman MA, Smith RL. CoSIGN: a parallel algorithm for coordinated traffic signal control. IEEE Transactions on Intelligent Transportation
Systems 2007;7(4):551–64.
[11] Chu PC, Beasley JE. A genetic algorithm for the multi-dimensional knapsack
problem. Journal of Heuristics 1998;4:63–86.
[12] Feller W. An introduction to probability theory and its applications. New
York, NY, USA: John Wiley and Sons; 1970.
[13] Fudenberg D, Tirole J. Game theory. Cambridge, MA, USA: MIT Press; 1991.
[14] Garcia A, Reaume D, Smith RL. Fictitious play for finding system optimal
routings in dynamic traffic networks. Transportation Research B, Methods
2000;34(2).
[15] Garcia A, Patek SD, Sinha K. A decentralized approach to discrete optimization
via simulation: application to network flows. Operations Research
2007;55(4).
[16] Garcia A, Campos E. Game theoretic approach to efficient power management
in sensor networks. Operations Research 2008;56(3).
[17] Ghate AV. Game theory, Markov chains and infinite programming: three
paradigms for optimization of complex systems. PhD thesis, Industrial and
Operations Engineering, The University of Michigan, Ann Arbor; 2006.
[18] Gershwin SB. Manufacturing systems engineering. Englewood Cliffs, NJ, USA:
Prentice-Hall; 1994.
[19] Gigerenzer G, Selten R. Bounded rationality. Cambridge, MA, USA: MIT Press;
2002.
[20] Lambert TJ, Epelman MA, Smith RL. A fictitious play approach to large-scale
optimization. Operations Research 2005;53(3):477–89.
[21] Lambert TJ, Wang H. Fictitious play approach to a mobile unit situation
awareness problem. Technical Report, University of Michigan, Ann Arbor;
2003.
[22] Marden JR, Arslan G, Shamma JS. Joint strategy fictitious play with inertia for
potential games. In: 44th IEEE conference on decision and control; December
2005.
[23] Marden JR, Young HP, Arslan G, Shamma JS. Payoff-based dynamics for
multiplayer weakly acyclic games. SIAM Journal of Control and Optimization
2009;48(1):373–96.
[24] Monderer D, Shapley L. Fictitious play property for games with identical
interests. Journal of Economic Theory 1996;68:258–65.
[25] Ostermiller S. /http://ostermiller.org/tictactoeexpert.htmlS [last accessed
August 3, 2010].
[26] Powell W. Approximate dynamic programming: solving the curses of dimensionality. New York, NY, USA: John Wiley and Sons; 2007.
[27] Puterman M. Markov decision processes. New York, NY, USA: John Wiley and
Sons; 1994.
[28] Rajaraman K, Sastry PS. Finite time analysis of the pursuit algorithm for
learning automata. IEEE Transactions on Systems, Man, and Cybernetics, Part
B 1996;26(4):590–8.
[29] Robinson J. An iterative method of solving a game. Annals of Mathematics
1951;54:296–301.
[30] Stout W. Almost sure convergence. New York, NY: Academic Press; 1974.
[31] Sutton RS. Open theoretical questions in reinforcement learning. In: Fischer P,
Simon HU, editors. Proceedings of the fourth European conference on
computational learning theory (Proceedings EuroCOLT99). Springer-Verlag;
1999. p. 11–7. ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-99.ps.
[32] Sutton R, Barto A. Reinforcement learning: an introduction. Cambridge, MA,
USA: MIT Press; 1998.
[33] Tsitsiklis JN. On the convergence of optimistic policy iteration. Journal of
Machine Learning Research 2002;3:59–72.
[34] Watkins C, Dayan P. Technical note: Q-learning. Machine Learning
1992;8:279–92.
[35] Young HP. Evolution of conventions. Econometrica 1993;61(1):57–84.