Academia.eduAcademia.edu

Sampled fictitious play for approximate dynamic programming

2011, Computers & Operations Research

Sampled fictitious play (SFP) is a recently proposed iterative learning mechanism for computing Nash equilibria of non-cooperative games. For games of identical interests, every limit point of the sequence of mixed strategies induced by the empirical frequencies of best response actions that players in SFP play is a Nash equilibrium. Because discrete optimization problems can be viewed as games of identical interests wherein Nash equilibria define a type of local optimum, SFP has recently been employed as a heuristic optimization algorithm with promising empirical performance. However, there have been no guarantees of convergence to a globally optimal Nash equilibrium established for any of the problem classes considered to date. In this paper, we introduce a variant of SFP and show that it converges almost surely to optimal policies in model-free, finite-horizon stochastic dynamic programs. The key idea is to view the dynamic programming states as players, whose common interest is to maximize the total multi-period expected reward starting in a fixed initial state. We also offer empirical results suggesting that our SFP variant is effective in practice for small to moderate sized model-free problems.

Computers & Operations Research 38 (2011) 1705–1718 Contents lists available at ScienceDirect Computers & Operations Research journal homepage: www.elsevier.com/locate/caor Sampled fictitious play for approximate dynamic programming Marina Epelman a, Archis Ghate b,, Robert L. Smith a a b Industrial and Operations Engineering, University of Michigan, Ann Arbor, USA Industrial and Systems Engineering, Box 352650, University of Washington, Seattle, WA 98195, USA a r t i c l e i n f o abstract Available online 16 February 2011 Sampled fictitious play (SFP) is a recently proposed iterative learning mechanism for computing Nash equilibria of non-cooperative games. For games of identical interests, every limit point of the sequence of mixed strategies induced by the empirical frequencies of best response actions that players in SFP play is a Nash equilibrium. Because discrete optimization problems can be viewed as games of identical interests wherein Nash equilibria define a type of local optimum, SFP has recently been employed as a heuristic optimization algorithm with promising empirical performance. However, there have been no guarantees of convergence to a globally optimal Nash equilibrium established for any of the problem classes considered to date. In this paper, we introduce a variant of SFP and show that it converges almost surely to optimal policies in model-free, finite-horizon stochastic dynamic programs. The key idea is to view the dynamic programming states as players, whose common interest is to maximize the total multi-period expected reward starting in a fixed initial state. We also offer empirical results suggesting that our SFP variant is effective in practice for small to moderate sized model-free problems. & 2011 Elsevier Ltd. All rights reserved. Keywords: Stochastic dynamic optimization Computational game theory Simulation optimization 1. Introduction In this paper, we introduce a variant of a game theoretic learning mechanism called sampled fictitious play (SFP) [20] to solve model-free stochastic dynamic programming problems, and investigate its convergence properties and empirical performance. The defining feature of model-free problems is that the state space, immediate rewards resulting from choosing an action in a state, and state transition probabilities are not known explicitly, and hence, system behavior must be ‘‘learned’’ off-line or on-line by repeated computer simulations or system runs. This rules out methods like backward induction, value iteration and mathematical programming. Examples of model-free problems include control of queueing networks with complicated service disciplines whose state transitions are available only through simulation via computer programs [3], control of manufacturing processes where the effect of a decision on the process is calculated by simulating the process [18], and dynamic portfolio optimization or financial derivative pricing problems where the performance of the underlying financial instrument is obtained by simulating complex computer models. Algorithms for model-free problems are termed ‘‘simulation based’’ methods [3,9,26,32] and typically provide an approximate solution. Thus, these simulation based techniques,  Corresponding author. Tel.: + 1 206 616 5968; fax: + 1 206 685 3072. E-mail address: [email protected] (A. Ghate). 0305-0548/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.cor.2011.01.023 including our SFP variant, fall within the realm of approximate dynamic programming (ADP) [26]. Stochastic search methods rooted in game theory have recently been applied to large-scale discrete optimization problems, with special focus on cases where the objective function is available only through computationally expensive simulations [2,10,14–16,20–22]. Consequently, the hope is to at least find local optima, as stronger forms of optimality are nearly impossible to attain, and very difficult to check. These techniques have been numerically tested with encouraging results on problems in transportation [10,14,22], power management in sensor networks [16], network optimization [15], and manufacturing systems [2]. Such heuristic optimization algorithms are applied to problems of the form max uðy1 ,y2 , . . . ,yn Þ s:t: A ðY 1  Y 2      Y n Þ, ðy1 ,y2 . . . ,yn Þ ð1Þ where ðY 1  Y 2      Y n Þ denotes the Cartesian product of finite sets Y1 through Yn. The main idea is then to animate (1) as a game between n players (corresponding to decision variables y1,y,yn), who share the identical interest of maximizing the objective function uðÞ. Recall that a Nash equilibrium is a collection of probability distributions over each player’s actions with the property that no player can unilaterally improve its utility in expectation by changing its own distribution [13]. Such an equilibrium serves as a type of coordinate-wise local optimum of (1), and hence, the goal is to implement a computationally efficient procedure to find it. 1706 M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 Most of these game theory based techniques for discrete optimization employ variants of fictitious play (FP) [6,29], a well-known iterative learning technique for computing Nash equilibria. At every iteration of FP, each player chooses a strategy that is a best response (with respect to that player’s expected utility, which depends on decisions of all players) to the other players’ strategies, assuming they will be chosen based on the empirical probability distribution induced by the historical frequency of their best response decisions in all previous iterations. Suitability of the FP approach for optimization stems from its convergence properties. In particular, for a game of identical interests, every limit point of the sequence of mixed strategies induced by the empirical frequencies of best response actions is a Nash equilibrium irrespective of the structure of the objective function [24]. This is termed the ‘‘fictitious play property.’’ Another advantage of such methods is that they are easily parallelizable, making them potentially suitable for large-scale optimization problems. However, the best response problem for each player in FP is computationally intractable for realistic problems since an exact computation of the expected utility for a player may require evaluation of the utility function for every possible combination of actions for all players. As a result, two of the authors and a co-author recently proposed sampled fictitious play (SFP) [20], a modified version of FP where the players choose an action that is a best reply to an independent sample of actions of other players drawn according to the empirical probability distribution of actions they used in all previous iterations. SFP offers significant computational advantage over FP, and for games of identical interests, almost surely exhibits the fictitious play property if the sample size is increased at a polynomial rate with iterations [20]. However, efficacy of the original version of SFP for optimization problems has the following significant limitations: (i) the fictitious play property guarantees convergence to only an equilibrium solution rather than an optimal solution, (ii) SFP may converge to a mixed strategy equilibrium, whereas in many applications, and especially in optimization, a pure strategy equilibrium is desirable, (iii) the best response computation becomes increasingly expensive as the sample size grows without bound with iterations, (iv) it is computationally very expensive to force every player in largescale problems to perform a best reply computation in every iteration, and, finally, (v) problem form (1) excludes optimization problems with constraints across variables. Thus, practical implementations of SFP [2,10,21] have attempted ad hoc variations of the original version. Unfortunately, the original convergence results for SFP in [20] do not hold for these ad hoc variants. Consequently, the question as to whether one can design an SFP variant, that provably finds optimal solutions to an important class of optimization problems by surmounting the above difficulties, has remained open. We answer this question in the affirmative by introducing an SFP variant that solves finitehorizon stochastic dynamic programs. The key idea is to view the states of the dynamic programming problem as players engaged in a non-cooperative game of identical interests, where the objective of each player is to maximize the expected multi-period reward from the initial state. The problem structure inherent in dynamic programming, and specifically, the principle of optimality, help our SFP players coordinate their actions and hence solve the problem to optimality. Viewing the states as players also has the important advantage that all combinations of feasible actions of these players are jointly feasible so that the resulting problem is an unconstrained one of the form (1). Importantly, since the objectives of all players are aligned with one another, it suffices for a very small fraction of the players to participate in the game in each iteration, naturally leading to an asynchronous procedure. The procedure to determine which players will participate in an iteration adaptively favors optimal actions (and hence the states they deliver) from the recent past. Specifically, unlike the original SFP in [20], we provide the players with only finite memory. Moreover, we allow players to sample only one action at a time (unlike the original SFP version in [20] which requires an increasing action sample size), and we deliberately add exogenous noise to this selection procedure so that every player in theory gets an opportunity to perform a best response computation infinitely often with probability one even in the asynchronous case. We also remark that if the inherent randomness in state transitions of the system is such that all states will be observed infinitely often with probability one irrespective of the policy implemented, then this exogenous noise is not needed (even in the asynchronous case). This paper is organized as follows. We develop the necessary notation and formulate our dynamic programming problem in the second section. This is followed by a precise description of our SFP algorithm in the third section. Convergence results and proofs appear in the fourth section. Numerical experiments are presented in the fifth section. Since our SFP variant falls within the realm of simulation-based algorithms for approximate dynamic programming (ADP) [26], a detailed discussion of similarities and differences between our approach and existing simulation-based techniques for ADP is provided in the sixth section, along with other specific conclusions, and future research directions. 2. Problem formulation Even though our SFP approach is applicable to any finite-state, finite-action, finite-horizon stochastic dynamic program, it is presented here for the case of model-free problems [3,26]. In particular, consider the following T period sequential decision problem. The initial state of a system is s1. In each period t ¼ 1,2, . . . ,T, we observe state st of the system and make a decision xt, which must be chosen from a finite set Xt(st) of feasible decisions in state st, as determined by a feasibility oracle F t . This feasibility oracle receives st as input and produces a finite set Xt(st) as output. Then, another oracle Ot receives st and xt as input and returns the (stochastic) state of the system st + 1 in the next period, and a one period deterministic reward rt(st,xt). It is common in the literature to consider deterministic one period rewards [7,33]. The slightly more general case of random one period rewards can also be handled by our algorithm, and the convergence results in Section 4 can be extended to that case without much difficulty. All states of the form sT + 1 are ‘‘terminal’’ states and there are no feasible actions in these states. We adopt the convention that terminal states have no intrinsic value. Our results generalize in a straightforward manner to problems where nonzero values are assigned to terminal states as for example in our TIC-TAC-TOE example in Section 5. We use St to denote the set of feasible states at the beginning of period t, t¼1,2,y,T+ 1 with S1 ¼{s1}. Let S denote the finite set of all feasible states of the system, i.e., S ¼ S1 [ S2 . . . [ ST þ 1 . The sets S2, S3,y,ST + 1, and (hence) the state space S are not known a priori, but must be ‘‘discovered’’ by repeatedly querying oracles F t and Ot . A policy p is a deterministic decision rule that assigns a feasible decision xt to every state st in S. Thus, p is a jSj-dimensional vector, and the decision that policy p assigns to state st A S is denoted by pðst Þ. We use P to denote the set of all feasible policies of this form. Therefore, for each st A S, any policy p A P must have the property that pðst Þ A Xt ðst Þ. Let Vp ðs1 Þ be the value of state s1 under policy p, i.e., the total T-period expected reward from the initial state if we implement decisions prescribed by policy p in every M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 hand—namely, stochastic finite-horizon dynamic programs—a properly designed variant of SFP finds an optimal solution. period. Mathematically, Vp ðs1 Þ ¼ r1 ðs1 , pðs1 ÞÞ þE T X i¼2 1707 ! ri ðsi , pðsi ÞÞ , where si A Si for i¼2,y,T and the expectation is with respect to the joint distribution induced by oracles O1 ,O2 , . . . ,OT . Our goal is to find a policy that solves the problem Proposition 3.1. In the above synchronous sampled fictitious play algorithm for approximate dynamic programming, the probability that there exists an iteration after which all entries in the finite histories of all players are optimal is one. ð2Þ Proof. By backward induction as in the proof of Theorem 4.1 in Section 4, and hence omitted. & We define V  ðs1 Þ  Vp ðs1 Þ. We assume for simplicity that there is a unique policy p optimal to problem (2). The extension to the case of multiple optimal policies is straightforward (see [17]). We next describe our SFP approach in detail. However, the above synchronous approach is not viable when the entire state space and feasible actions in each state are not known at the outset. Even if they were available, performing action sampling, value estimation and best response computation for every player in every iteration would be computationally expensive. Thus, the problem setting in Section 2 calls for an asynchronous extension of the above conceptual SFP procedure wherein only some of the players perform action sampling, value estimation and best response calculation in each iteration. However, this introduces new hurdles such as the need to include deliberate exploration to ensure that all players are chosen infinitely often to perform these activities to ensure convergence, and the need to clearly specify a sequence in which players are chosen. Our SFP based asynchronous procedure described below first in text and then as pseudo-code surmounts this difficulty. p  argmax Vp ðs1 Þ: pAP 3. Algorithm description Recall that we view the states in S as players engaged in a game of identical interest, namely, that of maximizing the total T-period expected reward from the initial state s1. Consequently, we henceforth use terms ‘‘players’’ and ‘‘states’’ interchangeably. The central idea in our algorithm is borrowed from SFP: in each iteration, players (i) sample actions according to probability distributions derived from the empirical frequency of their past best responses, (ii) calculate best responses to actions sampled by other players, and (iii) update empirical frequencies of best responses. 3.1. Preliminaries: a synchronous approach To facilitate discussion, first consider the simpler version of the above model-free problem where the state space and feasible actions are known at the outset, while rewards and state transitions may still be available only through oracles. Suppose the players in this SFP procedure each sample one action in every iteration. These sampled actions then define a policy, and each player can find a best response action simply by comparing an estimate of the value of this policy with the corresponding value estimates of policies that result if the player unilaterally changed its action to each of its other feasible actions. Conceptually, such an SFP procedure would then lead to a synchronous method as outlined in the pseudo-code below. A synchronous version of sampled fictitious play for approximate dynamic programming: 1. Action sampling: Each player, independently of the other players, samples one action from its (finite) history of past best responses. (In the first iteration, when the histories are empty, each player samples a feasible action randomly.) 2. Best response: Each player, independently of the other players, estimates (in the manner described in detail in our asynchronous algorithm in Section 3.2) the expected utility (namely the total expected reward from the initial state s1) corresponding to each one of its feasible actions assuming that the other players have chosen the actions they sampled in Step 1 above, and selects an action that yields the highest reward. 3. History update: Each player updates its own history of best responses by appending it with the best response action chosen in Step 2 above, ensuring through a FIFO procedure that history-length remains less than a preset finite bound. One can show, using the backward induction technique presented in Section 4, that every entry in every player’s history will eventually be optimal with probability one. Thus, we have established that for the class of optimization problems at 3.2. The asynchronous procedure Each iteration of our asynchronous algorithm consists of two phases: the choose players phase and the best response phase. The choose players phase: In the choose players phase, T players, one each from sets S1,y,ST, are chosen to perform a best response calculation, and are said to be in play in that iteration. The precise way in which these players are selected will be explained later. In simple terms, they are selected by drawing a sequence of player– action pairs {(st,yt)}Tt ¼ 1 starting at player s1 and ending at some terminal state, where yt A Xt ðst Þ is sampled using the history of past plays of player st and st + 1 is the state obtained by sending state–action pair (st,yt) to oracle Ot . We denote the sequence of player–action pairs drawn this way in iteration k by fðskt ,ykt ÞgTt ¼ 1 . Observe that corresponding to each player skt in this sequence, there is a realized reward Ut(skt) defined as Ut ðskt Þ ¼ t1 X ri ðski ,yki Þ: i¼1 The choose players phase ends here. The best response phase: In this phase, each player skt, t ¼1,2,y,T that is selected to be in play in the choose players phase uses each of its feasible actions xt A Xt ðskt Þ to generate a ‘‘path’’. A path starting at player skt and action xt A Xt ðskt Þ is a sequence of k player–action pairs fðsktþ j ,zktþ j ÞgjTt ¼ 1 , where st þ 1 is received by k k sending (st ,xt) to Ot , player sr uses its history of past plays to sample action zkr for r¼ t+ 1,y,T, and skr þ 1 is obtained by sending the player–action pair ðskr ,zkr Þ to the oracle Or for r ¼t + 1,y,T 1. We remark that other names for paths in the ADP literature include ‘‘episode’’ and ‘‘trajectory’’ [4,32]. We denote such a path by pk(skt,xt) and define the corresponding reward as Rðpk ðskt ,xt ÞÞ ¼ Tt X rt þ j ðsktþ j ,zktþ j Þ: ð3Þ j¼1 For any player st A St , and any positive integer m let I(st,m) denote the set of iterations in which player st is in play up to and including iteration m. We now define the best response problem for player skt that is in play in iteration k. In this iteration k, player skt estimates that if it chooses action xt A Xt ðskt Þ, its objective, 1708 M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 namely, the total T-period expected reward from the initial state s1, is V k ðs1 ; ðskt ,xt ÞÞ ¼ Ut ðskt Þ þ rt ðskt ,xt Þ þ X 1 Rðpi ðskt ,xt ÞÞ, jIðskt ,kÞj i A Iðsk ,kÞ ð4Þ t where pi(skt,xt) is the path drawn by the player–action pair (skt,xt) in the ith iteration in which player skt was in play. Player skt then solves the following best response problem: x k ðskt Þ  argmaxðV k ðs1 ; ðskt ,xt ÞÞÞ, ð5Þ xt A Xt ðskt Þ which is equivalent to the best response problem 0 1 X 1 k k k k @ x ðst Þ  argmax rt ðst ,xt Þ þ Rðpi ðst ,xt ÞÞA jIðskt ,kÞj i A Iðsk ,kÞ xt A Xt ðskt Þ ð6Þ t since Ut(skt) does not depend on xt and therefore there is no need to actually compute Ut(skt) in an algorithmic implementation. Ties in the above problem are broken by choosing an action with the lowest index (note that since there is a finite number of feasible actions in each state, the actions can be indexed). Finally, for brevity, we define X 1 W k ðskt ,xt Þ  Rðpi ðskt ,xt ÞÞ ð7Þ k jIðst ,kÞj i A Iðsk ,kÞ t and rewrite the above best response problem as x k ðskt Þ  argmaxðrt ðskt ,xt Þ þ W k ðskt ,xt ÞÞ: ð8Þ xt A Xt ðskt Þ Recall that in the reinforcement learning [32] parlance, the quantity rt ðskt ,xt Þ þW k ðskt ,xt Þ in Eq. (8) would be called the ‘‘Q-value estimate’’ (see Eqs. (13)–(14) below for the definition of ‘‘Q-value’’). In light of Eq. (3) it appears that to use estimation Eq. (7), player skt must know jIðskt ,kÞj and perhaps every sequence k of player–action pairs fðsit þ j ,zit þ j ÞgTt j ¼ 1 that st generated in iterations i A Iðskt ,kÞ. Although the former is true, the latter is not. To see this, note that Iðskt ,kÞ ¼ Iðskt ,k1Þ [ k and in particular jIðskt ,kÞj ¼ 1 þjIðskt ,k1Þj. As a result, X 1 W k ðskt ,xt Þ ¼ Rðpi ðskt ,xt ÞÞ jIðskt ,kÞj i A Iðsk ,kÞ t X 1 Rðpi ðskt ,xt ÞÞ ¼ k 1 þ jIðst ,k1Þj i A fIðsk ,k1Þ[kg t 0 1 X 1 k k @ ¼ Rðpi ðst ,xt ÞÞ þ Rðpk ðst ,xt ÞÞA 1 þ jIðskt ,k1Þj i A Iðsk ,k1Þ t ¼   1 jIðskt ,k1Þj  W k1 ðskt ,xt Þ þ Rðpk ðsk ,xt ÞÞ : k 1þ jIðst ,k1Þj ð9Þ Therefore, in order to compute Wk(skt,xt), player skt needs to know only jIðskt ,k1Þj, Wk  1(skt,xt) and the reward R(pk(skt,xt)) associated with the player–action pairs fðsktþ j ,zktþ j ÞgjTt ¼ 1 that it generated in the current iteration k. Based on Eq. (9), throughout the algorithm each player st stores the number of times it has been in play (denoted by N(st)) and an estimate W(st,xt) that is updated every iteration st is in play as follows: Wðst ,xt Þ’ ðNðst Þ  Wðst ,xt Þ þ Rðpðst ,xt ÞÞÞ : 1 þ Nðst Þ ð10Þ Note that at iteration k, Nðst Þ ¼ jIðst ,kÞj is used in the algorithm implementation, whereas our proofs use the notation I(st,k) to make the dependence on k explicit. Similarly for W(st,xt) and Wk(st,xt). Description of history of past plays: We define the term ‘‘history’’ here. Let L be a positive integer. For each state st A St , t ¼1,2,3,y,T, its history Hk(st) after iteration k is (an ordered list) composed of (at most) L actions in Xt(st) that were the solutions to the best response problem for state st in the (at most) L most recent iterations in which it was in play. Thus, history of a state is updated each time after it performs a best response computation, and is not updated at any other time. Note that Hk(st) is a FIFO list of length at most L; namely, if jIðst ,kÞj ZL, then jHk ðst Þj ¼ L, and if jIðst ,kÞj oL, then jHk ðst Þj ¼ jIðst ,kÞj. Now we are ready to describe exactly how states are chosen in the choose players phase of each iteration to perform a best response computation. This selection is done inductively starting from state s1 (because that is the only state in S1). In iteration k, suppose states s1,s2,y,st have been selected so far. Note that st may or may not have a history of past best responses. If its history is empty, an action is selected from the set Xt(st) of its feasible actions uniformly at random. On the other hand, if the history is not empty, with probability 1ak one action is drawn uniformly at random from this history, whereas with probability ak one action is drawn uniformly at random from Xt(st). Suppose this selected action is xt. Then the pair (st,xt) is sent to the oracle Ot to receive a state st + 1 from set St + 1 as output. This is repeated until we reach a terminal state sT + 1. The parameter ak is asymptotically reduced to zero as k increases in a way that ensures that every state is selected infinitely often with probability one. Finally, in the best response phase, the procedure to sample actions (and hence paths) is the same as above except that players do not use parameter ak while drawing actions when the history is not empty. Asynchronous sampled fictitious play for approximate dynamic programming: 1. Initialize: Set k’1, choose a decreasing sequence fak g1 k ¼ 1  ½0,1, fix some positive integer L, Nðs1 Þ’0, Wðs1 ,x1 Þ’0 for all x1 A X1 ðs1 Þ obtained by sending s1 to oracle F 1 , and H0 ðs1 Þ’|. 2. Choose players: Starting at player s1 , inductively select a sequence of players s1  sk1 ,sk2 , . . . ,skT said to be in play in iteration k, where skt A St for t ¼1,y,T, as follows: for t¼ 1,y,T: If Hk1 ðskt Þ ¼ |, then obtain Xt(skt) by sending skt to the feasibility oracle F t , sample an action ykt uniformly at random from Xt(skt), set N(skt) ¼0, and W(skt,xt)¼0 for all xt A Xt ðskt Þ. Else let u be a Uniform [0,1] random variable; if u r1ak sample an action ykt uniformly at random from Hk  1(skt) else sample an action ykt uniformly at random from Xt(skt). Send the player–action pair (skt,ykt) to oracle Ot to receive player skt+ 1 as output. end for. 3. Best response: For each player skt that is in play: (a) for each xt A Xt ðskt Þ draw a path pk ðskt ,xt Þ ¼ fðsktþ 1 ,zktþ 1 Þ, . . . ,ðskT ,zkT Þg inductively as follows: send (skt,xt) to oracle Ot to receive sktþ 1 as output, and then for j ¼ 1, . . . ,Tt: 3 If Hk1 ðsktþ j Þ ¼ | then sample an action zkt+ j uniformly at random from Xt þ j ðsktþ j Þ obtained by sending sktþ j to the feasibility oracle F t þ j . 3 Else sample an action zkt+ j uniformly at random from Hk1 ðsktþ j Þ. 3 Send the player–action pair ðsktþ j ,zktþ j Þ to oracle Ot þ j to receive player sktþ j þ 1 as output. end for. Set Wðskt ,xt Þ’ð1=ð1 þ Nðskt ÞÞÞðNðskt Þ Wðskt ,xt Þ þRðpk ðskt ,xt ÞÞÞ. end for. (b) Find x k ðskt Þ by solving the best response problem (8) where ties are broken via the lowest index first rule. M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 (c) Append history Hk1 ðskt Þ by x k ðskt Þ in a FIFO manner to keep its length at most L and set Nðskt Þ’Nðskt Þ þ1. 4. Decrease ak to ak þ 1 , increase iteration counter to k+ 1 and go to Step 2 (the choose players phase). A few comments on the algorithm are in order. We will present sufficient conditions on the sequence fak g to ensure desirable convergence properties in the next section (see Lemma 4.2). It should be noted that deliberate introduction of stochastic noise in choosing players to perform best response computations through parameter ak is not required in theory if the underlying stochastic behavior of the system itself ensures that every state will be visited infinitely often irrespective of decisions in earlier periods. More generally, the choose players phase and the best response phase are entirely decoupled and the only information the latter needs from the former is a set of players chosen to perform a best response computation. Our convergence results remain valid irrespective of the details of the mechanism used to select this set of players as long as the mechanism ensures that all players are chosen infinitely often with probability one. In the extreme case, every player may be chosen to perform a best response calculation in every iteration resulting in the synchronous procedure outlined in Section 3.1. We reiterate the slight difference between the action sampling mechanisms in the choose players phase and the best response phase: the latter does not employ ak . This ensures that the estimates W(st,xt) are based on actions that were best responses in earlier iterations, thus avoiding unnecessary stochastic errors in best response calculations. This leads to stronger convergence results and potentially better practical performance. The idea of using history of finite length in FP-type algorithms has been used with much success in computational game theory in the past (see for example the adaptive play (AP) algorithm in [35]). In our context, as mentioned earlier, it is intended to reduce correlation between early and late iterations, progressively producing better value estimates. The stochastic noise and finite history in the sampling process may also be viewed as the players themselves being bounded rational [19]. To assess the computational effort in each iteration, note that since Xt(st) is a finite set for all st A St and t ¼1,2,y,T, there exists a positive integer A such that jXt ðst Þj rA for all t and st. Therefore, the number of paths pk(skt,xt) that player skt samples in the best response phase is at most A. Each of these paths includes T t +1 players. The number of players who participate in one iteration is P therefore at most A Tt ¼ 1 ðTt þ 1Þ ¼ ATðT þ1Þ=2. This observation will be used in comparing our procedure with a benchmark in the numerical results Section 5. In the parlance of existing simulation-based asynchronous ADP algorithms for model-free finite-horizon problems, Step 2 above can be seen as a mechanism to select states at which policies will be updated. Step 3(a) can be viewed as an approximate policy evaluation step, whereas Step 3(b) regarded as a policy improvement mechanism. The probability distribution from which actions will be sampled in the next iteration is updated in Step 3 (c) (indirectly) through changes to the history of plays. In the next two paragraphs, we briefly mention fundamental features of some other simulation-based table look-up methods for the reader to contrast with our SFP procedure. (For a more detailed comparison, and a discussion of convergence results, see Section 6.) In Monte Carlo methods, the state–action value, that is, Q-value estimate only of each state–action pair on the selected path in Step 2 is updated in Step 3(a) by adding total reward on that path following the occurrence of that state–action pair and then smoothing. In comparison, the state–action value estimate of each state on the selected path and each action feasible in that state is updated in our SFP procedure. Thus the counterpart of Step 3 (a) in 1709 typical Monte Carlo methods is simpler than in the pseudo-code above. The decision in each state on the selected path is then set to be optimal with respect to the current Q-value estimates (Steps 3 (b) and 3 (c)). This updated policy is then used in the next iteration to select a new path. See Fig. 5.4 in [32]. Temporal difference methods, unlike Monte Carlo methods, employ bootstrapping to update Q-value estimates of state–action pairs on the selected path. That is, instead of simply adding total reward following the occurrence of a state–action pair on the selected path, Q-value estimates ‘‘downstream’’ on the selected path are used in the update mechanism. See Fig. 6.9 in [32]. The well-known Q-learning [34] algorithm also uses bootstrapping and can in fact be viewed as an extension of a particular temporal difference method [4]. See Fig. 6.12 in [32]. The SFP pseudo-code above, on the other hand, does not bootstrap. We present convergence results for our asynchronous SFP algorithm in the next section. 4. Convergence results We first introduce some notation. We define the value of any state st A St for t ¼1,2,3,y,T under policy p A P as ! T X Vp ðst Þ ¼ rt ðst , pðst ÞÞ þ E ri ðsi , pðsi ÞÞ , i ¼ tþ1 where the expectation is with respect to the joint distribution induced by oracles Ot ,Ot þ 1 , . . . ,OT . The corresponding optimal value is defined as V  ðst Þ  max Vp ðst Þ: ð11Þ pAP The above maxima are achieved because the set P is finite. It is well known [3,27] that the optimal values satisfy the following recursive equation: V  ðst Þ ¼ max ðrt ðst ,xt Þ þ EOt ðst ,xt Þ ½V  ðst þ 1 ÞÞ, xt A Xt ðst Þ ð12Þ where the subscript on the expectation operator asserts that the expectation is with respect to the probability distribution of state st + 1 over St + 1 as induced by the stochastic oracle Ot that receives state–decision pair (st,xt) as input. Note here that Vn(sT + 1)¼0 for all terminal states sT + 1 by assumption. The decision prescribed in state st by the optimal policy p is given by 1 0 C p ðst Þ ¼ argmaxB @rt ðst ,xt Þ þEOt ðst ,xt Þ ½V  ðst þ 1 Þ A, that is, ð13Þ p ðst Þ ¼ argmaxQ ðst ,xt Þ: ð14Þ xt A Xt ðst Þ |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} Q ðst ,xt Þ xt A Xt ðst Þ For t ¼1,y,T and any st A St , let gðst Þ be the event that there exists an iteration Kst such that for all iterations kZ Kst , each of the L entries in the history of the player corresponding to state st is the optimal decision p ðst Þ. Similarly, for any t ¼1,y,T, let OðtÞ be the event that there exists an iteration Kt such that for all iterations kZ Kt , all L entries in the history of the player corresponding to any state s A St [ St þ 1 . . . [ ST equal the optimal action p ðsÞ. T T Note that ð r ¼ t,...,T sr A Sr gðsr ÞÞ ) OðtÞ since we can choose T Kt ¼ maxr ¼ t,...,T maxsr A Sr Ksr . This implies that P½OðtÞ Z P½ r ¼ t,...,T T T T OðtÞ ) Oðt1Þ. sr A Sr gðsr Þ. Similarly, ð st1 A St1 gðst1 ÞÞ The rest of this section is devoted to the proof of the following theorem. Theorem 4.1. P½Oð1Þ ¼ 1, that is, the probability that there exists an iteration K1 such that for all iterations k Z K1 of asynchronous SFP for ADP, all L entries in the histories of players corresponding to all states s A S ¼ S1 [ . . . ST equal the optimal action p ðsÞ is one. 1710 M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 We employ an intuitive backward induction proof technique. In particular, we first show that players in ST are able to compute optimal actions the first time they are in play. Once all players in ST have been in play at least once (this is ensured by Lemma 4.2), players in ST  1 are able to generate increasingly accurate estimates of the total expected reward obtained on playing each of their actions. Consequently, players in ST  1 are eventually able to compute their optimal actions after being in play in a sufficiently large number of iterations (this again is ensured by Lemma 4.2). This process is then propagated backwards through periods T  1, T 2, y, 1. In particular, after a sufficiently large number of iterations, all best responses are ‘‘correct’’ implying that every entry in the players’ history is optimal. These ideas are now formalized. Observe that, by problem definition, every state st þ 1 A St þ 1 is reachable from the initial state s1 (otherwise st + 1 would not be a member of St + 1). That is, for every st þ 1 A St þ 1 , there exists a state– action pair (st,xt) with st A St and xt A Xt ðst Þ such that the probability that oracle Ot returns state st + 1 upon receiving (st,xt) as input is strictly positive. Because we have a finite number of states and actions, there is a uniform (over all periods t, all states st + 1, and all state–action pairs (st,xt)) lower bound q 40 on these strictly positive probabilities. Lemma 4.2. If ak Z ð1=kÞ1=T for all iterations k then every player is in play infinitely often with probability one. j Proof. Let E ðst Þ be the event that the player corresponding to a specific state st A St is in play in iteration j and t 1 players in stages 1,2,y,t  1 each sampled actions randomly from their sets of feasible actions (as opposed to sampling from their history of best responses) in the choose players phase. Observe that P1 j P½E j ðst Þ Z ðqðaj =AÞÞT Zðq=AÞT 1=j and hence j ¼ 1 P½E ðst Þ ¼ 1. 1 2 Since events E ðst Þ,E ðst Þ, . . . are independent, the probability that infinitely many of them occur is one by the second Borel–Cantelli lemma [12]. & Note that a fixed constant strictly positive value of ak for all k, such as ak ¼ 1 clearly satisfies the sufficient condition in Lemma 4.2 above. However, the importance of decreasing values of ak is that they employ asymptotically diminishing exploration in the choose players phase, giving more and more weight to best response results of previous iterations along the progress of the algorithm. Potential practical benefits of changing the rate of exploration over iterations instead of using a constant rate have been discussed in the literature, for instance in [32]. An interesting question in our context then is how small a value of ak can be used still guaranteeing that each player is chosen infinitely often with probability one. Lemma 4.2 provides a simple bound on such a value. Proof of Theorem 4.1 employs the following inductive statement. Induction statement Mt: P½OðtÞ ¼ 1, that is, the probability that there exists an iteration Kt such that for all iterations kZ Kt of asynchronous SFP for ADP, all L entries in the histories of players corresponding to all states s A St [ . . . ST equal the optimal action p ðsÞ is one. Lemma 4.3. MT is true, that is P½OðTÞ ¼ 1. entries in this player’s history is p ðsT Þ. Since sT A ST was arbitrary, P½gðsT Þ ¼ 1 for every sT A ST . Then it is easy to see that T P½ sT A ST gðsT Þ ¼ 1. This implies that P½OðTÞ ¼ 1. & Now we prove an intermediate lemma. Lemma 4.4. Mt implies that for every player–action pair (st  1,xt  1) with st1 A St1 and xt1 A Xt1 ðst1 Þ, the estimate Wk(st  1,xt  1) converges to EOt1 ðst1 ,xt1 Þ V  ðst Þ as k-1 with probability one. As a result, rt1 ðst1 ,xt1 Þ þ W k ðst1 ,xt1 Þ converges to rt1 ðst1 ,xt1 Þ þ EOt1 ðst1 ,xt1 Þ V  ðst Þ as k-1 with probability one. Proof. Recall that I(st  1,k) is the set of iterations up to and including iteration k in which the player corresponding to state st  1 is in play. Let J(st  1,m,n) be the set of iterations between iterations m and n (not including m but including n) in which the player corresponding to state st  1 is in play. Let Kt be as defined in the induction statement above and k ZKt be any iteration large enough such that jIðst1 ,kÞj 4 0 (such an iteration exists with probability one by Lemma 4.2). We have jW k ðst1 ,xt1 ÞEOt1 ðst1 ,xt1 Þ V  ðst Þj     Tt X X   1 ¼  rt þ j ðsit þ j ,xit þ j ÞEOt1 ðst1 ,xt1 Þ V  ðst Þ: jIðst1 ,kÞj i A Iðs ,kÞ j ¼ 0  t1 Using I  Iðst1 ,Kt Þ and J  Jðst1 ,Kt ,kÞ for brevity, the above right hand side becomes     Tt X X  1  i i  rt þ j ðst þ j ,xt þ j ÞEOt1 ðst1 ,xt1 Þ V ðst Þ, ¼   jIj þ jJj i A fI[Jg j ¼ 0 which in turn is bounded above by     Tt  1 XX  i i  rt þ j ðst þ j ,xt þ j Þ jIjþ jJj   iAI j ¼ 0     Tt  1 XX  rt þ j ðsit þ j ,xit þ j ÞEOt1 ðst1 ,xt1 Þ V  ðst Þ: þ  jIj þjJj i A J j ¼ 0  Since the state and decision spaces are finite, the absolute values of one period deterministic rewards are uniformly bounded above, say by a positive number R. As a result, the above right hand side is at most       Tt  jIjRð1þ TtÞ  1 X X    rt þ j ðsit þ j ,xit þ j ÞEOt1 ðst1 ,xt1 Þ V  ðst Þ  jIjþ jJj  þ jIjþ jJj   iAJ j ¼ 0       Tt jIjRð1þ TtÞ  jJj 1 X X    ¼ rt þ j ðsit þ j ,xit þ j Þ:EOt1 ðst1 ,xt1 Þ V  ðst Þ: þ    jIj þjJj jIj þ jJj jJj  iAJ j ¼ 0 The first term above converges to zero as k-1 with probability one since Jðst1 ,Kt ,kÞ-1 as k-1 with probability one owing to Lemma 4.2. Similarly, jJj=ðjIjþ jJjÞ-1 as k-1 with probability one. Note that Mt implies that for all iterations i A Jðst1 ,Kt ,kÞ, xit þ j ¼ p ðsit þ j Þ for j¼0,y,Tt. Moreover, when state–decision pair ðst1 ,xt1 Þ is sent to the oracle Ot1 , the sums Tt X rt þ j ðsit þ j , p ðsit þ j ÞÞ j¼0 Proof. Consider any state sT A ST . This player is in play infinitely often with probability one owing to Lemma 4.2. Moreover, every iteration k in which this player is in play, it solves the best reply problem x k ðsT Þ ¼ argmaxxT A XT ðsT Þ rT ðsT ,xT Þ since Wk(sT,xT)¼0 in problem (8) for all xT A XT ðsT Þ. Thus, x k ðsT Þ ¼ p ðsT Þ in every such iteration k owing to the principle of optimality and uniqueness of p . Let KsT be the iteration in which this player is in play for the Lth time (such an iteration exists with probability one, again owing to Lemma 4.2). Then for all iterations k Z KsT , each of the L are independent and i A Jðst1 ,Kt ,kÞ. Therefore, identically distributed Tt 1 XX r ðsi ,xi Þ-EOt1 ðst1 ,xt1 Þ V  ðst Þ jJj i A J j ¼ 0 t þ j t þ j t þ j for iterations as k-1 with probability one by the strong law of large numbers and the definition of Vn(st) for any st A St . Thus the above term of interest converges to zero with probability one. Thus jW k ðst1 ,xt1 Þ M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 EOt1 ðst1 ,xt1 Þ V  ðst Þj is bounded below by zero and bounded above by a term that converges to zero as k-1 with probability one. Therefore, Wk(st 1,xt 1) converges to EOt1 ðst1 ,xt1 Þ V  ðst Þ as k-1 with probability one. This completes the proof. & The above intermediate lemma helps us restore the inductive hypothesis as follows. Lemma 4.5. Mt implies Mt  1. Proof. Consider dðst1 Þ defined as follows: dðst1 Þ ¼ min p ðst1 Þ a xt1 A Xt1 ðst1 Þ ðV  ðst1 Þrt1 ðst1 ,xt1 ÞEOt1 ðst1 ,xt1 Þ V  ðst ÞÞ, that is, dðst1 Þ is the difference between the optimal value of state st 1 and the value of the next-best feasible action in st 1. The above minimum is well defined since Xt 1(st 1) is a finite set. The definition of Vn(st 1) implies that dðst1 Þ Z0 whereas uniqueness of the optimal policy implies that dðst1 Þ 4 0. The best reply problem (8) for the player corresponding to state st 1 implies that if jrt1 ðst1 ,xt1 Þ þEOt1 ðst1 ,xt1 Þ V  ðst Þrt1 ðst1 ,xt1 Þ W k ðst1 ,xt1 Þj o dðst1 Þ=2 for all xt1 A Xt1 ðst1 Þ, then x k ðst1 Þ ¼ p ðst1 Þ. For any e 4 0, let Ae ðst1 ,xt1 Þ be the event that there exists an integer Ne ðst1 ,xt1 Þ such that for all iterations k Z Ne ðst1 ,xt1 Þ, jrt1 ðst1 ,xt1 Þ þ W k ðst1 ,xt1 Þrt1 ðst1 ,xt1 Þ EOt1 ðst1 ,xt1 Þ V  ðst Þj o e: Note that the almost sure convergence rt1 ðst1 ,xt1 Þ þ W k ðst1 ,xt1 Þ-rt1 ðst1 ,xt1 Þ þEOt1 ðst1 ,xt1 Þ V  ðst Þ is equivalent (see [30]) to P½Ae ðst1 ,xt1 Þ ¼ 1. Therefore, " # \ P Adðst1 Þ=2 ðst1 ,xt1 Þ ¼ 1: xt1 A Xt1 ðst1 Þ More specifically, choosing Ndðst1 Þ=2 ðst1 Þ ¼ maxxt1 A Xt1 ðst1 Þ Ndðst1 Þ=2 ðst1 ,xt1 Þ, the probability that jrt1 ðst1 ,xt1 Þ þ W k ðst1 ,xt1 Þrt1 ðst1 ,xt1 ÞEOt1 ðst1 ,xt1 Þ V  ðst Þj o dst1 2 for all xt1 A Xt1 ðst1 Þ and for all iterations kZ Ndðst1 Þ=2 ðst1 Þ is one. As a result, since the player corresponding to state st 1 is in play infinitely often with probability one, the probability that there exists an iteration K 0st1 such that x k ðst1 Þ ¼ p ðst1 Þ for all iterations k ZK 0st1 is one. This implies that P½gðst1 Þ ¼ 1. Then it is easy to T see that P½ st1 A St1 gðst1 Þ ¼ 1. Together with Mt, i.e., P½OðtÞ ¼ 1, this implies that P½Oðt1Þ ¼ 1. That is, Mt 1 holds. By the principle of induction, Lemmas 4.3 and 4.5 imply that M1 is true, that is, P½Oð1Þ ¼ 1. This concludes the proof of Theorem 4.1. & Since MT,y,Mt,y,M1 are true, Lemma 4.4 implies that for any t ¼1,y,T 1, and any state–decision pairs (st,xt) with st A St and xt A Xt ðst Þ, Wk(st,xt) converges to EOt ðst ,xt Þ V  ðst þ 1 Þ as k-1 with probability one. In addition, Theorem 4.1 above states that with probability one, there exists an iteration K1 such that for all iterations k ZK1 , every entry in the history of the player corresponding to state s1, denoted by hk(s1), is p ðs1 Þ. We thus have the following ‘‘value convergence’’ result. Corollary 4.6. r1 ðs1 ,hk ðs1 ÞÞ þ W k ðs1 ,hk ðs1 ÞÞ converges with probability one to Vn(s1) as k-1. Let Ek(st) denote the event that state st A St is in play in iteration k. Given event Ek(st), let yk(st) denote the feasible action sampled by state st in the choose players phase of the SFP algorithm of Section 3.2. Similarly, let Dk(st) denote the event that state st gets the opportunity to sample an action in the best response phase in iteration k, and given Dk(st), let zk(st) denote an action sampled by 1711 st. Then Theorem 4.1 and the ideas discussed in its proof imply the following ‘‘solution convergence’’ result. Corollary 4.7. For any 0 o e o 1, there exists an iteration Ke such that P½yk ðst Þ ¼ p ðst ÞjEk ðst Þ 4 1e for all k ZKe . That is, sufficiently far along the progress of the algorithm, the actions sampled by players in the choose players phase are optimal with arbitrarily high probability. Similarly, P½zk ðst Þ ¼ p ðst ÞjDk ðst Þ ¼ 1 for sufficiently large k. That is, sufficiently far along the progress of the algorithm, the actions sampled by players in the best response phase are optimal. In summary, the version of SFP designed here uses action samples of size one, finite history of past best responses, and allows players to make sampling mistakes. It is much simpler to implement in practice and computationally more efficient than the original approach in [20], and yet exhibits stronger convergence behavior. In the next section, we present numerical results that suggest our asynchronous SFP procedure is effective in practice. 5. Numerical results We apply SFP to two stochastic inventory control examples from [7] and to the game of TIC-TAC-TOE against nature. The state spaces in the inventory control examples are small, whereas in TIC-TAC-TOE the state space is moderate. 5.1. Stochastic inventory control Example 1 This example is taken from [7]. Our main reason for choosing this example is the same as given in [7], namely, that it can be solved either by using inventory theory or by stochastic dynamic programming backward recursion as described in [7], and hence it is easy to test the performance of SFP. Moreover, it provides us the opportunity to compare the performance of SFP with that of a recent simulation-based method for stochastic dynamic programming. Consider the following T period inventory control problem. At the beginning of period t ¼1,2,y,T, we observe inventory level st of a single product and decide whether to order an amount q of the product or not. The ordered amount arrives immediately. The inventory levels are bounded above by a positive integer M. Note that an order can be placed only if st þ qr M. A random demand Dt is realized after receiving the placed order. The distribution of this demand is not explicitly specified to the decision maker; however, it is known that demand is an integer from the set {0,1,y,D}, where D is a positive integer. The inventory st + 1 at the beginning of the next period is given by st þ 1 ¼ maxfst þ xt Dt ,0g, that is, backlogging is not allowed and unsatisfied demand is lost. We incur three types of cost. There is a fixed cost of K every time we place an order, that is, every period t in which xt ¼q. Every unit of inventory left over at the end of period t is charged an inventory holding cost of h. There is also a penalty cost of p per unit of lost demand. The initial inventory at the beginning of the first period is s1. The objective is to decide the order quantity for every possible inventory level in each period so as to minimize the expected total cost over the entire horizon of T periods. Observe that this problem is slightly more general than the generic formulation described in Section 2 in that one period cost for a given state– action pair is random. In particular, the one period cost for ordering a feasible amount xt when the inventory is st is given by K  Iðxt 4 0Þ þh  ðst þ xt Dt Þ þ þp  ðst þxt Dt Þ , where Iðxt 40Þ is the indicator function which equals one if xt 40 and zero otherwise, ðst þ xt Dt Þ þ ¼ maxf0,ðst þ xt Dt Þg, and ðst þ xt Dt Þ ¼ maxf0,ðst þ xt Dt Þg. We solve this problem using 1712 M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 the following data from [7]: T ¼3, q¼10, M¼20, h ¼1, and the demand is a uniformly chosen integer from [0,9]. SFP does not use this information about the demand distribution; however, this information is needed in our numerical experiments to implement the oracles Ot . We use two value of K ¼0,5 and two values of p ¼1,10. The initial inventory is s1 ¼5. We wish to estimate Vn(s1). It is important to reiterate that even though the demand distribution is known to be uniform, that information is used only by the backward recursion algorithm we employed to calculate optimal values to be compared with the results of our SFP procedure. The SFP procedure does not use this explicit demand distribution information but rather ‘‘learns’’ the distribution through repeated samples generated from it. As a result, the challenge in this model-free version of the stochastic inventory control problem with small action and state spaces is in accurately estimating the expected rewards and then optimizing the ordering policy based on these estimates. Table 1 Estimates V(s1 ¼ 5) of Vn(s1 ¼ 5) for stochastic inventory control Example 1 averaged over 30 independent runs of 50 iterations of SFP with L¼1 and L ¼5. The three estimates reported in [7] each with sample sizes N¼ 4 and N ¼32 are listed in the last three columns for comparison. (K,p) Vn(s1 ¼5) SFP estimates L¼ 1 L¼5 AMS estimates from [7] Estimate 1 Estimate 2 Estimate 3 N ¼ 4, N ¼ 32 N ¼4, N¼ 32 N ¼ 4, N ¼32 (0,1) (0,10) (5,1) (5,10) 10.440 24.745 10.490 31.635 10.8856 24.9614 12.4176 31.183 11.4699 25.7516 12.3314 32.6679 15.03, 30.45, 18.45, 37.52, 11.23 9.13, 10.45 9.56, 26.12 19.98, 24.73 20.48, 11.47 10.23, 10.46 10.41, 33.11 26.42, 31.62 26.92, 10.49 24.74 10.46 31.64 The first column in Table 1 below lists the values of K and p used. The second column lists the corresponding optimal value Vn(s1 ¼5). The third column reports the average of the optimal values estimated by our algorithm after 50 iterations over 30 independent runs when the players remember only their most recent best response, i.e., L¼1. The fourth column lists this average when the players remember their five most recent best responses, i.e., L¼5. Parameter ak was taken to be (1/k)1/T for all iterations k in all runs as it satisfies the rate calculated in Lemma 4.2. In order to estimate the optimal value of state st, the algorithm in [7], which the authors term AMS for adaptive multistage sampling, recursively samples N states from St + 1 for each action in st and proposes three different estimators: the first estimator uses a weighted average of Q-values, the second estimator uses the maximum Q-value whereas the third one employs the Q-value of the most frequently sampled action. For comparison, we have copied from [7] estimates of the optimal value of state s1 ¼5 using these three estimators with sample sizes N ¼4 and N ¼32 (these were the smallest and the largest sample sizes used, respectively) in the last three columns of Table 1. The evolution of our estimate of Vn(s1 ¼5) is plotted versus 50 iterations in Figs. 1 and 2. The runtimes of SFP and the AMS algorithm of [7] can be assessed by looking at the number of states sampled by these methods, since sampling is the bottleneck step of both algorithms. This method of analysis provides a reasonable comparison of the algorithms, including the rate of growth of runtime as number of iterations or problem size increases, without focusing on the specifics of implementation or computers used. Since the optimal value of a state in AMS is recursively estimated by sampling N states from the next stage for each action in the state at hand, the number of states sampled in a problem with A actions, horizon T and sample size N is (AN)T [7, p. 129]. Since A¼2 and T¼3 in 11 Estimate of optimum value Vs iterations K=0 p=10 L=1 Estimate of optimum value Estimate of optimum value Estimate of optimum value Vs iterations K=0 p=10 L=1 10 9 8 7 6 5 1 11 21 31 iterations 25 20 15 10 5 41 1 13 12 11 10 9 8 1 11 21 31 iterations 41 21 31 iterations 41 Estimate of optimum value Vs iterations K=5 p=10 L=1 Estimate of optimum value Estimate of optimum value Estimate of optimum value Vs iterations K=5 p=10 L=1 11 32 30 28 26 24 22 20 18 16 14 1 11 21 31 iterations 41 Fig. 1. Evolution of the optimal value estimate with iterations for four test problems with L¼ 1 in Example 1. Each plot shows the estimates averaged over 30 independent runs, and the corresponding standard error bars. The optimal value reported in Table 1 is shown on each plot with a flat line. 1713 M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 this inventory control example, the number of states sampled by AMS is 512 with N ¼ 4 and 262,144 with N ¼32. Recalling the discussion of the asynchronous algorithm in Section 3.2, the number of states sampled by SFP in 50 iterations is 50  ATðT þ1Þ=2 ¼ 50  2 12=2 ¼ 600, and hence its computational effort for this problem is comparable to the faster version (N ¼4) of AMS. Comparing the optimal value estimates obtained by SFP to those obtained by AMS using N ¼4, which requires similar computational effort, we observe (Table 1) that, for the cases (K, p) ¼(0, 1), (0, 10), (5, 10), SFP estimates are significantly closer to the optimal value than those obtained by AMS with either of the three estimators. Comparing SFP estimates for these three cases to AMS with N ¼32, we see that SFP estimates are more accurate than those obtained with the first estimator and only slightly worse than ones obtained with the other two estimators, even though SFP requires only 0.23% of the samples that AMS with N ¼32 does. When (K, p) ¼ (5, 1), the SFP estimate outperforms the first estimator with N ¼4, but is worse in other cases. Finally, the tables indicate that SFP performance is not very different for L¼1 and L¼5, although L¼ 1 seems to work slightly better for this problem. Like any other finite-horizon dynamic programming algorithm, the computational effort in our procedure depends on the length of problem horizon. This is illustrated for the inventory control example at hand in Table 2. The table lists average of our optimal value estimate over 30 independent runs of 50 iterations each as in Table 1 for L¼1 and various values of T. Optimal values are also listed for comparison. The percentage relative error in our optimal value estimates after 50 iterations is plotted versus horizon length in Fig. 3. We remark that our algorithm was able to reach closer to the optimal values in all cases after running more iterations; however, this is not apparent in Table 2 since we deliberately fixed the iterations to 50 to bring out the sensitivity of error to horizon T. To illustrate this point, we have included Table 3, which compares optimal value estimates obtained for T¼10 with 50 and 200 iterations. Plots of optimal value estimates over 200 SFP iterations for T¼10 are also shown in Fig. 4. The choice of 200 was motivated by the fact that complexity of the standard backward recursion method of dynamic programming grows linearly with horizon [5, Section 5.1]; since we used 50 iterations for T¼3, 200 iterations for T¼10 seem appropriate. 5.2. Stochastic inventory control Example 2 Our second example is also taken from [7] and is similar to Example 1 above except that we may now order any integer amount {0,1,y,20} (and so A ¼21). Since SFP performed slightly better with L¼1 for the first example, we only focus on that case for the second example. The numerical results are presented in Table 4. The second column lists the average estimate of optimal value over 30 independent runs of 5000 iterations each. As before, we include for comparison three estimates obtained by the AMS algorithm of [7] with the smallest (N ¼21) and the largest (N ¼35) sample sizes that were reported there. The convergence behavior of our average optimal value estimate with iterations is illustrated in Fig. 5. We again turn to the number of states sampled to compare computational effort of SFP and AMS. The number of states sampled by AMS is ðANÞT ¼ ð21  21Þ3 ¼ 85,766,121 when N ¼21 and (21  35)3 ¼397,065,375 when N ¼35, whereas it is 5000  21  12/2¼630,000 in SFP. Table 4 shows that, in spite of the significantly smaller number of states sampled by SFP, its optimal value estimates are considerably more accurate than those of AMS for 20 of the 24 12 Estimate of optimum value Vs iterations K=0 p=10 L=5 Estimate of optimum value Estimate of optimum value Estimate of optimum value Vs iterations K=0 p=1 L=5 11 10 9 8 7 6 5 4 1 11 21 31 iterations 30 25 20 15 10 5 41 1 13 12 11 10 9 8 7 6 1 11 21 iterations 31 41 21 31 iterations 41 Estimate of optimum value Vs iterations K=5 p=10 L=5 Estimate of optimum value Estimate of optimum value Estimate of optimum value Vs iterations K=5 p=1 L=5 11 35 30 25 20 15 10 1 11 21 31 41 iterations Fig. 2. Evolution of the optimal value estimate with iterations for four test problems with L¼ 5 in Example 1. Each plot shows the estimates averaged over 30 independent runs, and the corresponding standard error bars. The optimal value reported in Table 1 is shown on each plot with a flat line. 1714 M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 Table 2 Sensitivity of the accuracy of optimal value estimates to horizon T in stochastic inventory control Example 1. The first column lists various horizon lengths. The other columns present the optimal values Vn(s1 ¼ 5), and optimal value estimates V(s1 ¼5) averaged over 30 independent trials of 50 iterations of SFP with L¼ 1 (separated by a comma) for different combinations of (K,p). The percentage relative error between Vn(s1 ¼ 5) and V(s1 ¼ 5) is plotted in Fig. 3. Note that the first row of results is copied from Table 1. T 3 4 5 6 7 8 9 10 (K¼ 0, p ¼1) (K¼0, p ¼ 10) (K¼ 5, p ¼ 1) (K¼5, p ¼10) 10.440, 10.8856 14.4752, 14.3797 18.0422, 18.5967 22.5475, 23.0523 26.587, 27.8457 30.6267, 32.6117 34.6663, 36.9863 38.706, 42.1640 24.745, 24.9614 31.8861, 31.6529 39.0273, 39.0144 46.1685, 47.4667 53.3097, 54.4915 60.4509, 62.0641 67.5921, 70.732 74.7333, 79.6176 10.490, 12.4176 14.9452, 17.6562 19.4368, 23.6745 23.9345, 29.498 28.4351, 34.5876 32.9351, 39.6497 37.4351, 43.9118 41.9351, 50.7621 31.635, 31.183 40.9761, 40.9536 50.3198, 50.7654 59.6634, 59.5954 69.0071, 70.549 78.3507, 80.1529 87.6944, 89.0752 97.038, 101.505 Sensitivity of error to horizon length for K=0, p=10, L=1 10 9 8 7 6 5 4 3 2 1 0 6 5 4 3 2 1 0 3 4 5 6 7 8 9 10 -1 4 5 6 7 8 9 horizon length T horizon length T Sensitivity of error to horizon length for K=5, p=10, L=1 estimation error e after 50 iterations 20 15 10 5 0 3 3 Sensitivity of error to horizon length for K=5, p=1, L=1 25 estimation error e after 50 iterations 7 estimation error e after 50 iterations estimation error e after 50 iterations Sensitivity of error to horizon length for K=0, p=1, L=1 4 5 6 7 8 9 10 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 10 3 4 horizon length T 5 6 7 8 9 10 horizon length T Fig. 3. Sensitivity of the accuracy of our optimal value estimates to problem horizon in stochastic inventory control Example 1. The percentage relative error between optimal values and their estimates in Table 2 is given by e ¼ 100  jV ðs1 ¼ 5ÞV  ðs1 ¼ 5Þj=V  ðs1 ¼ 5Þ. In each plot, these relative errors are shown as black dots and are connected by a smoothed solid black line to illustrate the error trend. The dashed trend lines in each plot were obtained by fitting either a quadratic (in (a), (b), (d)) or a linear (in (c)) curve to the error data. combinations of parameter values (K,p), estimator type and sample size, and close in the remaining four instances. The percentage relative error ð100  jVðs1 ¼ 5ÞV ðs1 ¼ 5ÞjÞ=V ðs1 ¼ 5Þ in SFP estimates across problem instances and estimator types is on average about 15 times smaller than that of AMS with N ¼ 21, and about 8 times smaller with N ¼ 35, even though in these two cases SFP samples only 0.74% and 0.16%, respectively, of the states sampled by AMS. % % Table 3 Comparison of optimal value estimates averaged over 30 independent runs with 50 and 200 iterations for Example 1 with T¼ 10. (K,p) Vn(s1 ¼5) 50 iterations 200 iterations % improvement (0,1) (0,10) (5,1) (5,10) 38.706 74.7333 41.9351 97.038 42.1640 79.6176 50.7621 101.505 40.7585 77.3510 47.6187 98.5661 3.33 2.84 6.19 2.89 5.3. TIC-TAC-TOE against nature TIC-TAC-TOE is a well-known relatively simple two-player game played on a 3 by 3 square grid (called the game-board) where the two players take turns placing distinct tokens (typically an ‘‘X’’ and an ‘‘O’’) in empty squares on the grid. The first player to place three tokens in a horizontal row, a vertical column, or a diagonal wins the game. Thus the game lasts for a total of at most nine moves. It is well known [32] that every game of TIC-TAC-TOE M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 Estimate of optimum value Vs iterations K=0 p=10 L=1 T=10 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 Estimate of optimum value Estimate of optimum value Estimate of optimum value Vs iterations K=0 p=1 L=1 T=10 1715 Estimate of optimum value Vs iterations K=5 p=1 L=1 T=10 Estimate of optimum value Vs iterations K=5 p=10 L=1 T=10 iterations 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 1 11 21 31 41 51 61 71 81 91 101 111 121 131 141 151 161 171 181 191 Estimate of optimum value iterations Estimate of optimum value iterations iterations Fig. 4. Evolution of the optimal value estimate with iterations for four test problems with L¼ 1 in Example 1 for T ¼10. Each plot shows the estimates for 200 SFP iterations averaged over 30 independent runs, and the corresponding standard error bars. The optimal value is shown on each plot with a flat line. Table 4 Estimates V(s1 ¼ 5) of Vn(s1 ¼ 5) for stochastic inventory control Example 2 averaged over 30 independent runs of 50 iterations of SFP with L¼ 1. The three estimates reported in [7] each with sample sizes N ¼21 and N ¼ 35 are listed in the last three columns for comparison. (K,p) Vn(s1 ¼ 5) SFP estimates L ¼1 (0,1) 7.5 (0,10) 13.5 (5,1) 10.49 (5,10) 25.785 7.5920 13.5874 12.2923 24.6296 AMS estimates from [7] Estimate 1 Estimate 2 Estimate 3 N¼ 21, N¼ 35 N¼ 21, N¼ 35 N ¼ 21, N ¼ 35 24.06, 29.17, 33.05, 39.97, 18.82 26.06 25.33 36.89 3.12, 6.04, 8.73, 17.78, 6.26 12.23 10.96 24.71 9.79, 12.06, 18.62, 26.76, 6.62 13.69 11.12 25.51 between two ‘‘expert players’’ ends in a tie whereas an expert player never loses no matter whom it is pitted against [25]. In this section we focus on TIC-TAC-TOE against nature, where the second player places its tokens in one of the empty squares chosen randomly. Such a player is also sometimes termed a ‘‘novice player.’’ Thus, even when the first player employs a deterministic strategy, its game versus a novice player unfolds stochastically. One web site [25] on TIC-TAC-TOE reports that out of a 1000 game experiments between an expert player and a novice player, the expert player won 978 times whereas the other 22 games ended in a tie. Thus, roughly speaking, the probability that an expert player wins against a novice player is about 0.978 and that the game ends in a tie is about 0.022. We model TIC-TAC-TOE against nature as a sequential decision problem where our decision maker corresponds to the first player playing (say) ‘‘X’’ who makes the first move starting with an empty game-board. The states of this problem correspond to the different possible game-board configurations. Since every square on the board can either be empty, occupied with an ‘‘X’’ or an ‘‘O’’, the total number of states is 39 ¼19,683. In fact, one can provide a more compact description of the state space of the game by taking into account rotational symmetries, but we deliberately did not incorporate this into our implementation since we wanted to evaluate how SFP performs with a larger state-space. Note that in a given state, the first player’s feasible actions correspond to choosing one of the empty squares to place an ‘‘X.’’ Thus, the number of feasible actions is at most nine in any state. Once the first player chooses an action in a specific state, this state–action pair is ‘‘sent to an oracle’’ that essentially returns the new gameboard, or equivalently, the state reached after nature, i.e., the second player places an ‘‘O’’ on a randomly chosen empty square. This sequence of events continues until the game ends either in a tie or with one of the two players winning. The first player receives a reward of 1 for winning, a reward of  1 for losing and 0 for a tie. Notice that rewards are earned at game termination and there are no intermediate rewards along the way. Strictly speaking, terminal rewards do not fit into the assumptions of our analysis, but our proofs can be modified in a straightforward manner to accommodate them. Our goal is to find the optimal expected reward for the first player from the initial state where the game-board is empty. Note that the numbers from the web site mentioned above imply that this reward is roughly 0.978 (0.978  1+ 0.022  0) for an expert player. Thus we expect the optimal expected reward to be very close to 0.978. We ran 10 independent trials with 50,000 iterations each of asynchronous SFP on this problem. We used ak ¼ ð1=kÞ1=9 and 1716 M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 13.5 13 12.5 12 11.5 11 10.5 10 9.5 9 8.5 8 7.5 7 1 1001 2001 3001 iterations 4001 Estimate of optimum value Vs iterations K=0 p=10 L=1 Estimate of optimum value Estimate of optimum value Estimate of optimum value Vs iterations K=0 p=1 L=1 25 24 23 22 21 20 19 18 17 16 15 14 13 12 1 21 20 19 18 17 16 15 14 13 12 11 10 1 1001 2001 3001 iterations 4001 2001 3001 iterations 4001 Estimate of optimum value Vs iterations Estimate of optimum value Estimate of optimum value Estimate of optimum value Vs iterations K=5 p=1 L=1 1001 35 30 25 20 1 1001 2001 3001 iterations 4001 Fig. 5. Evolution of the optimal value estimate with iterations for four test problems with L¼ 1 in Example 2. Each plot shows the estimates averaged over 30 independent runs, and the corresponding standard error bars. The optimal value reported in Table 4 is shown on each plot with a flat line. three different L values: L¼1, L¼ 5 and L¼10. The average of our estimate of the optimal expected value over these 10 trials is plotted versus iterations in Fig. 6. The average and standard deviation of estimates obtained after 50,000 iterations are shown in Table 5. Note again that our SFP procedure does not exploit symmetries or the structure of the game and does not know that the game is being played against nature but rather learns the optimal expected reward by repeated stochastic simulations leading to a relatively high number of iterations required. 6. Discussion and conclusions We have presented a variant of sampled fictitious play (SFP) that converges to a deterministic optimal policy while solving modelfree, finite-horizon stochastic dynamic programs. This convergence result is in stark contrast with the recently proposed, original version of SFP in [20], which may only converge in a very weak sense to the set of mixed strategy Nash equilibria, i.e., to the set of sub-optimal randomized policies. Our SFP variant adaptively biases the sampling mechanism toward high quality actions by endowing the players with only finite recall, and allows the players to make occasional sampling mistakes to ensure that all players will participate in the game infinitely often. Most importantly, it is computationally far less demanding than the original version in [20]. It is well known that table look-up simulation-based methods that explicitly store state value or state–action value information, as in our SFP approach here, work well for small to moderate sized model-free problems but are unlikely to successfully tackle very large problems [3,4]. For instance, the memory requirement for algorithms using state–action value (Q-value) information is proportional to the product of the sizes of the state and action spaces. We expect SFP to have this feature as well. Our SFP procedure can be seen as an asynchronous version of a Q-value based [34] variant of Monte Carlo optimistic policy iteration [4,26,33] for model-free finite-horizon ADP [3,9,26,32] (Optimistic Policy Iteration is a somewhat non-standard name used in [4] and [33], whereas in [26] it is called hybrid value/ policy iteration). In the infinite-horizon case, synchronous versions for such algorithms (see [33] Section 5) maintain Qvalues for each state–action pair and generate infinite trajectories starting at every state–action pair using greedy actions (actions that optimize the Q-values) in every state visited. The Q-values are then updated by ‘‘smoothing’’ (commonly by averaging) observed cumulative costs along these trajectories. Even though these methods are only of academic interest [33] as they require generation of infinite trajectories, their analysis provides significant insights into implementable model-free ADP algorithms. In addition, in some cases, as for example where eventual absorption into a zero cost terminal state is inevitable along every trajectory, these methods are indeed implementable in practice. Interestingly, even though variants of Optimistic Policy Iteration have been known for some time, their detailed theoretical analyses have been rare. In fact, in a list of important open questions in reinforcement learning, Sutton [31] had posed whether synchronous Q-function based Monte Carlo-Optimistic Policy Iteration converges to optimality. This question was recently resolved in the affirmative by Tsitsiklis [33] using monotonicity properties of the dynamic programming operator. Also, non-convergence of asynchronous Monte Carlo Optimistic Policy Iteration is well known [33] for infinite-horizon problems. We were not able to find a rigorous proof for convergence of asynchronous Monte Carlo optimistic Policy Iteration methods for the model-free finite- M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 1717 Fig. 6. Evolution of the estimate of optimal expected reward with iterations for TIC-TAC-TOE against nature with (a) L¼ 1, (b) L¼ 5, and (c) L¼ 10. Table 5 Average and standard deviation of optimal expected reward estimates over 10 independent trials of 50,000 iterations each for TIC-TAC-TOE against nature. L Average Standard deviation 1 5 10 0.975322 0.973054 0.97319 0.001299451 0.00153441 0.00127577 horizon case in the literature. We have shown that strong convergence results for our model-free finite-horizon Monte Carlo algorithm are easy to prove using backward induction, and the strong law of large numbers, even in the asynchronous case. Part of this simplicity stems from the way in which Q-values are updated in our SFP procedure—update is performed for every feasible action in a selected state rather than only for the action currently estimated to be optimal (see the asynchronous SFP pseudo-code and discussion in Section 3). As noted in the literature [4,32], a key to ensuring convergence to optimality in simulation-based asynchronous ADP algorithms without significantly sacrificing computational performance is to bias the underlying action sampling procedure toward good actions and hence states, and yet sample all states infinitely often, that is, to properly balance exploitation with exploration. Most asynchronous techniques take an ad-hoc approach toward this issue [26]. Recently developed algorithms that attempt to carefully tradeoff exploitation versus exploration for simulation-based methods for model-free finite-horizon stochastic dynamic programs are surveyed in [8] with expanded details appearing in [9]. Our approach in this paper is close in spirit to two multistage adaptive sampling algorithms discussed in the second chapter of [9]. Of these, the upper confidence bound (UCB) algorithm from [7] exploits the theory of multi-armed bandit problems [1] to sample actions to minimize expected regret (see Section 2.1.4 in [9]), whereas the pursuit learning automata (PLA) technique samples actions from a probability distribution biased toward an estimate of an optimal action essentially implementing a recursive extension of the pursuit algorithm [28,32]. This probability distribution is iteratively updated explicitly through an update formula that increases the probability of sampling an action estimated to be optimal, that is, a ‘‘greedy action’’, by a fraction b (see Eq. (2.12) in [32]). This ‘‘pursuit’’ of a greedy action motivated the name ‘‘pursuit algorithm.’’ In our SFP approach as well, actions are sampled from a probability distribution—the one induced by the history of recent plays. This probability distribution is adaptively updated, not by using an explicit formula as in PLA, but rather implicitly and naturally as the history of plays evolves with iterations. Whereas the estimates of optimal expected value in PLA converge in probability (Theorem 2.9, p. 50 in [9]), we are able to obtain almost sure convergence results because we do not have exogenous noise in the action sampling mechanism employed in the best response process. Specifically, we proved in Theorem 4.1 that our SFP variant converges to a deterministic optimal policy with probability one. Corollary 4.6 showed that the value estimates generated by SFP converge with probability one to the optimal expected multi-period reward. In contrast, the UCB technique converges in mean (Theorem 2.3 page 29 in [9]). 1718 M. Epelman et al. / Computers & Operations Research 38 (2011) 1705–1718 Extension of results in this paper to infinite-horizon model-free problems is a natural direction for future research. However, since it is not possible to sample an infinite path in practice, an implementable procedure should approximate the infinite-horizon expected rewards with some appropriately selected finitehorizon truncations. This will introduce an additional hurdle in proving convergence to optimality. In addition, the fundamental distinguishing features of our variant of SFP—finite memory, action samples of size one, and occasional mistakes in sampling—appear to help, both in simplifying convergence proofs, as well as enhancing computational performance as compared to the original version of SFP in [20]. Therefore, we will focus on investigating theoretical and empirical properties of this variant when applied to combinatorial optimization problems that are not modeled as dynamic programs. We have taken initial steps in this direction, obtaining encouraging numerical results comparable to Genetic Algorithms [11], on multi-dimensional knapsack problems [17]. Researchers have recently proposed ‘‘payoff-based’’ dynamics for learning equilibria in multi-player games [23]. In contrast to ‘‘action-based’’ mechanisms such as FP, or AP, these approaches do not require a best response computation. For example, in the socalled safe experimentation dynamics [23], players simply maintain a record of the best utility value they obtained in the past, and the corresponding action they played at the time. These are called the baseline utility and action, respectively. In each iteration k, each player samples its baseline action with probability ek , and with probability 1ek , samples an action randomly. If the utility a particular player receives for the joint action profile sampled by all players is higher than its baseline utility, that player updates its baseline utility and action. It is easy to prove (Theorem 3.1 in [23]) that this dynamic process converges almost surely to a pure strategy optimal Nash equilibrium for all games of identical interests under a standard sufficient condition on ek similar to our condition on ak in Lemma 4.2. In the future, it would be interesting to compare the computational performance of such payoff-based processes on dynamic programs with the actionbased approach in this paper. Acknowledgments This research was funded in part by the National Science Foundation under Grants CMMI-0422752, CCF-0830380, and CCF-0830092. Archis Ghate appreciates summer support from the University of Washington. References [1] Auer P, Cesa-Bianchi N, Fisher P. Finite-time analysis of the multiarmed bandit problem. Machine Learning 2002;47:235–56. [2] Baumert S, Cheng SF, Ghate AV, Reaume D, Sharma D, Smith RL. Joint optimization of capital investment, revenue management, and production planning in complex manufacturing systems. Technical Report 05-05, Industrial and Operations Engineering, University of Michigan, Ann Arbor; 2005. [3] Bertsekas DP. Dynamic programming and optimal control, vols. 1 and 2. Belmont, MA, USA: Athena Scientific; 1995. [4] Bertsekas DP, Tsitsiklis JN. Neuro-dynamic programming. Belmont, MA, USA: Athena Scientific; 1996. [5] Blondel VD, Tsitsiklis JN. A survey of computational complexity results in systems and control. Automatica 2000;36. [6] Brown GW. Iterative solution of games by fictitious play. In: Activity analysis of production and allocation. New York, NY, USA: Wiley; 1951. [7] Chang HS, Fu MC, Hu J, Marcus SI. An adaptive sampling algorithm for solving Markov decision processes. Operations Research 2005;53(1):126–39. [8] Chang HS, Fu MC, Hu J, Marcus SI. A survey of some simulation-based algorithms for Markov decision processes. Communications in Information and Systems 2007;7(1):59–92. [9] Chang HS, Fu MC, Hu J, Marcus SI. Simulation-based algorithms for Markov decision processes. London, UK: Springer; 2007. [10] Cheng SF, Epelman MA, Smith RL. CoSIGN: a parallel algorithm for coordinated traffic signal control. IEEE Transactions on Intelligent Transportation Systems 2007;7(4):551–64. [11] Chu PC, Beasley JE. A genetic algorithm for the multi-dimensional knapsack problem. Journal of Heuristics 1998;4:63–86. [12] Feller W. An introduction to probability theory and its applications. New York, NY, USA: John Wiley and Sons; 1970. [13] Fudenberg D, Tirole J. Game theory. Cambridge, MA, USA: MIT Press; 1991. [14] Garcia A, Reaume D, Smith RL. Fictitious play for finding system optimal routings in dynamic traffic networks. Transportation Research B, Methods 2000;34(2). [15] Garcia A, Patek SD, Sinha K. A decentralized approach to discrete optimization via simulation: application to network flows. Operations Research 2007;55(4). [16] Garcia A, Campos E. Game theoretic approach to efficient power management in sensor networks. Operations Research 2008;56(3). [17] Ghate AV. Game theory, Markov chains and infinite programming: three paradigms for optimization of complex systems. PhD thesis, Industrial and Operations Engineering, The University of Michigan, Ann Arbor; 2006. [18] Gershwin SB. Manufacturing systems engineering. Englewood Cliffs, NJ, USA: Prentice-Hall; 1994. [19] Gigerenzer G, Selten R. Bounded rationality. Cambridge, MA, USA: MIT Press; 2002. [20] Lambert TJ, Epelman MA, Smith RL. A fictitious play approach to large-scale optimization. Operations Research 2005;53(3):477–89. [21] Lambert TJ, Wang H. Fictitious play approach to a mobile unit situation awareness problem. Technical Report, University of Michigan, Ann Arbor; 2003. [22] Marden JR, Arslan G, Shamma JS. Joint strategy fictitious play with inertia for potential games. In: 44th IEEE conference on decision and control; December 2005. [23] Marden JR, Young HP, Arslan G, Shamma JS. Payoff-based dynamics for multiplayer weakly acyclic games. SIAM Journal of Control and Optimization 2009;48(1):373–96. [24] Monderer D, Shapley L. Fictitious play property for games with identical interests. Journal of Economic Theory 1996;68:258–65. [25] Ostermiller S. /http://ostermiller.org/tictactoeexpert.htmlS [last accessed August 3, 2010]. [26] Powell W. Approximate dynamic programming: solving the curses of dimensionality. New York, NY, USA: John Wiley and Sons; 2007. [27] Puterman M. Markov decision processes. New York, NY, USA: John Wiley and Sons; 1994. [28] Rajaraman K, Sastry PS. Finite time analysis of the pursuit algorithm for learning automata. IEEE Transactions on Systems, Man, and Cybernetics, Part B 1996;26(4):590–8. [29] Robinson J. An iterative method of solving a game. Annals of Mathematics 1951;54:296–301. [30] Stout W. Almost sure convergence. New York, NY: Academic Press; 1974. [31] Sutton RS. Open theoretical questions in reinforcement learning. In: Fischer P, Simon HU, editors. Proceedings of the fourth European conference on computational learning theory (Proceedings EuroCOLT99). Springer-Verlag; 1999. p. 11–7. ftp://ftp.cs.umass.edu/pub/anw/pub/sutton/sutton-99.ps. [32] Sutton R, Barto A. Reinforcement learning: an introduction. Cambridge, MA, USA: MIT Press; 1998. [33] Tsitsiklis JN. On the convergence of optimistic policy iteration. Journal of Machine Learning Research 2002;3:59–72. [34] Watkins C, Dayan P. Technical note: Q-learning. Machine Learning 1992;8:279–92. [35] Young HP. Evolution of conventions. Econometrica 1993;61(1):57–84.