Icin08 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/29634548

On the Parallelization of Monte-Carlo planning

Article · May 2008


Source: OAI

CITATIONS READS
29 78

5 authors, including:

Sylvain Gelly Jean-Baptiste Hoock

89 PUBLICATIONS   2,075 CITATIONS   
National Institute for Research in Computer Science and Control
20 PUBLICATIONS   454 CITATIONS   
SEE PROFILE
SEE PROFILE

Arpad Rimmel
École Supérieure d'Electricité
35 PUBLICATIONS   540 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Jean-Baptiste Hoock on 02 June 2014.

The user has requested enhancement of the downloaded file.


Author manuscript, published in "ICINCO, Madeira : Portugal (2008)"

THE PARALLELIZATION OF MONTE-CARLO PLANNING


Parallelization of MC-Planning

S. Gelly, J.B. Hoock, A. Rimmel, O. Teytaud


TAO (Inria), Lri, UMR Cnrs 8623, Univ. Paris-Sud, France
[email protected]

Y. Kalemkarian
High Performance Calculus, Bull, Grenoble, France [email protected]

Keywords: Bandit-based Monte-Carlo planning, parallelization, multi-core machines, message-passing.

Abstract: Since their impressive successes in various areas of large-scale parallelization, recent techniques like UCT and
other Monte-Carlo planning variants (Kocsis and Szepesvari, 2006a) have been extensively studied (Coquelin
and Munos, 2007; Wang and Gelly, 2007). We here propose and compare various forms of parallelization of
bandit-based tree-search, in particular for our computer-go algorithm XYZ.
inria-00287867, version 1 - 13 Jun 2008

1 INTRODUCTION lenges won by such methods (Hussain et al., 2006).


Its application to computer-go has provided partic-
Dynamic programming (Bellman, 1957; Bertsekas, ularly impressive results (Coulom, 2006; Wang and
1995) provides a robust and stable tool for dynamic Gelly, 2007). The decision taking, at each step, is far
optimization. However, its application to large scale more expensive, but the overall cost is much lower
problems is far from simple and involves approxima- than the computational cost of one dynamic program-
tions, as in approximate value functions in approxi- ming loop when the branching factor (or the state
mate dynamic programming (Powell, 2007). Some space) is large. The analysis of Monte-Carlo plan-
alternate approaches, like using simulations for focus- ning if far from being closed, as it is not clear that
ing on more important areas of the state space as in upper-confidence bounds are the best tool for that,
RTDP (Barto et al., 1993), have provided some im- in particular when heuristics like AMAF(Bruegmann,
portant yet unstable tools for large scale approximate 1993) (also termed RAVE) are included. However, we
dynamic programming. Monte-Carlo planning, ap- here consider a fixed empirically tuned algorithm, and
plied to Go since (Bruegmann, 1993), and in partic- we focus on its parallelization, for various forms of
ular Bandit-based Monte-Carlo planning (Kocsis and parallelization (multi-core machines, standard clus-
Szepesvari, 2006a), provides an alternate solution, in- ters). The application to computer-go is one of the
creasing the importance of simulations. Bandit-based most illustrative applications of bandit-based Monte-
Monte-Carlo planning features the construction of a Carlo planning methods. In particular, Go is still far
tree approximating incrementally the distribution of from being solved, as moderate human players are
possible futures. Tools from the bandit literature (see still much better than the best computer-go programs.
(Lai and Robbins, 1985; Auer et al., 2001); adversar- Monte-Carlo planning is a revolution in computer-
ial case in (Kocsis and Szepesvari, 2005); huge sets go, in some cases combined with tactical knowledge
of arms in (Banks and Sundaram, 1992; Agrawal, (Cazenave and Helmstetter, 2005; Coulom, 2007) but
1995; Dani and Hayes, 2006; Berry et al., 1997)) also in some cases almost from scratch without any
are used in order to bias the random development specific knowledge (Wang and Gelly, 2007), except
of the tree in the direction of the most important some likelihood of random playouts. The paralleliza-
directions; the important paper (Kocsis and Szepes- tion of Monte-Carlo planning is classical in the multi-
vari, 2006a) applies recursively bandits for the bi- core case; we here provide an analysis of the speed-
ased random development of a tree. The efficiency up of this multi-core approach (shared memory). We
of recursive bandits is also shown by some chal- then move to the message-passing architecture, for
a cluster-parallelization. We briefly introduce ban- the case of Go. We point out that UCT can be applied
dits and bandit-based Monte-Carlo planning below, to trees far from minimax problems: max/max prob-
for the sake of clarity of notations. We then present lems, or max/Expectation, etc. Algorithm 1 provides
the multi-core parallelization in section 2, and two an overview (see also flowchart in Fig. 1), and Algo-
different approaches for the cluster parallelization in rithm 3 provides a more detailed presentation. The
section 3, 3. A bandit problem is typically defined
Time left > 0 ?
as follows: (1) A finite set A = {1, . . ., N} of arms Update the
tree and its statistics
is given. (2) Each arm a ∈ A is equipped with an un- No Yes
Complete
known probability of reward pa . (3) At each time step Choose a path
Game
S,L,L’,S’
and its reward
t ∈ {1, 2, . . .}, the algorithm chooses at ∈ A depend- in the tree
(using statistics)
Simulation
S in the tree
ing on (a1 , . . . , at−1 ) and (r1 , . . . , rt−1 ). (4) The ban- Decision
and node L’ out
of the tree
dit gives a reward rt , with rt = 1 with probability pat (most simulated
move from the root)
Node L’ out
(independently), and rt = 0 otherwise. A bandit al- of the tree
Simulation S’
out of the tree
and reward
gorithm is an algorithm designed for minimizing the Simulation
from L’ to game
so-called regret: over

T
Figure 1: Flow chart of UCT. The bandit algorithm is ap-
Regret = T inf pa − ∑ rt . plied in the simulation part in the tree.
a∈A i=1

Important results, depending on assumptions on the


set of actions and their distribution of rewards, include
(Lai and Robbins, 1985; Auer et al., 2001; Banks and Algorithm 1 Monte-Carlo planning algorithm. The final
inria-00287867, version 1 - 13 Jun 2008

line chooses the decision with the conservative rule that the
Sundaram, 1992; Agrawal, 1995; Dani and Hayes, most simulated decision should be chosen. Other solutions,
2006; Berry et al., 1997; Hussain et al., 2006; Kocsis such as taking the decision with the higher ratio ”number
and Szepesvari, 2005; Coquelin and Munos, 2007). A of winning simulations divided by the number of simula-
bandit is said to have side information if some infor- tions”, are too much dangerous because a decision might
mation other than the reward is available to the algo- be very poor due to a small number of simulations. Some
rithm. We present various bandit algorithms in sec- other robust and elegant techniques include the use of lower
confidence bounds. Here, rewards are binary (win or loss),
tion 1: these bandit algorithms usually work as fol- but arbitrary distributions of rewards can be used.
lows for each time step:
Initialize T to the root, representing the current state. T is
- compute a score, for each possible arm, which a tree of positions, with each node equipped with statis-
consists in (i) an exploitation term, larger for arms tics.
for which the average rewards in the past is good; while Time left > 0 do
(ii) an exploration term, larger for arms which Simulate one game until a leaf (a position) L of T
(thanks to bandit algorithms). Then, choose one son
have not been tried often. (a successor) L′ of L.
- choose the arm with maximum score. Simulate one game from position L′ until the game is
over.
Typical improvements are: (i) give heuristically a Add L′ as a son of L in T .
score to arms which have not been visited yet (see e.g. Update statistics in all the tree. In UCT, each node
(Wang and Gelly, 2007; Coulom, 2007)); (ii) guess knows how many winning simulations (from this
some side information (see the AMAF heuristic e.g. node) have been performed and how many simula-
(Gelly and Silver, 2007)); (iii) restrict the sampling to tions (from this node) have been performed. In other
forms of tree-search, other informations are neces-
the k(t) first nodes for k(t) some non-decreasing map- sary (heuristic information in AMAF, total number
ping (see progressive widening in (Coulom, 2007)). of nodes in the tree in BAST(Coquelin and Munos,
In the case of Go and more generally bandit-based 2007)).
Monte-Carlo planning, we use one bandit at each end while
node of the tree, as explained in section 1. The dif- Pick up the decision which has been simulated most often
ficult elements are the anytime nature of the problem, from the root.
its non stationarity (Hussain et al., 2006; Kocsis and
Szepesvari, 2006b), its large number of arms (Banks main point in Algorithm 1 is that the tree is unbal-
and Sundaram, 1992). The side information (usually anced, with a strong bias in favor of important parts
termed AMAF) is detailed below. of the tree. The function used for taking decisions out
This section provides an overview of bandit-based of the tree (i.e. the so-called Monte-Carlo part) is de-
Monte-Carlo planning algorithms. It is presented in fined in (Wang and Gelly, 2007) The function used
the case of a game; the experiments are performed in for simulating in the tree is presented in Algorithm
2. This function is a main part of the code: it de- heuristic), at the nth simulation in a given node, with
cides in which direction the tree should be extended. K(n) is an non-decreasing mapping from N to N.
There are various formulas, all of them being based on
the idea of a compromise between exploitation (fur- Algorithm 3 More detailed Monte-Carlo planning algo-
ther simulating the seemingly good moves) and ex- rithm.
ploration (further simulating the moves which have Initialize T to the root, representing the current state. T
not been explored a lot). The empirically best tool in is a tree of positions.
while Time left > 0 do
Simulate one game until a leaf (a position) L of T
Algorithm 2 Taking a decision in the tree. The total (thanks to bandit algorithms applied until a leaf is met,
number of simulations at situation s is Simulations(s) = see Algorithm 2).
∑d Simulations(s, d). We present here various formulas for Choose one son (a successor) L′ of L, possibly with
computing the score (see (Lai and Robbins, 1985; Auer some offline learning (Gelly and Silver, 2007).
et al., 2001; Gelly and Silver, 2007) for UCB1, UCB- Simulate one game from position L′ until the game is
Tuned and AMAF respectively); other very important vari- over.
ants (for infinite domains, larger number of arms, of spe- Add L′ as a son of L in T .
cific assumptions) can be found in (Banks and Sundaram, Update statistics in all the tree. In UCT, each node
1992; Agrawal, 1995; Dani and Hayes, 2006; Coquelin and knows how many winning simulations (from this
Munos, 2007). node) have been performed and how many simula-
Function decision = Bandit(situation s in the tree). tions (from this node) have been performed. In other
for d in the set of possible decisions do forms of tree-search, other informations are neces-
Let p̂(d) = Wins(s, d)/Simulations(s, d). sary (heuristic information in AMAF, total number
SWITCH (bandit formula): of nodes in the tree in BAST(Coquelin and Munos,
- UCB1: compute score(d) = p̂(d) + 2007)).
inria-00287867, version 1 - 13 Jun 2008

p
2 log(Simulations(s))/Simulations(s, d). end while
- UCB-Tuned.1: compute score(d) = p̂(d) + Pick up the decision which has been simulated most often
from the root.
q
V̂ log(Simulations(s))/Simulations(s, d) with
V̂ = max(0.001, p̂(d)(1 − p̂(d))).
- qUCB-Tuned.2: compute score(d) = p̂(d) +
V̂ log(Simulations(s))/Simulations(s, d) +
log(Simulations(s))/Simulations(s, d) with 2 MULTI-CORE
V̂ = max(0.001, p̂(d)(1 − p̂(d))). PARALLELIZATION
- AMAF-guided exploration: Compute
score(d) = α(d) p̂(d) + (1 − α(d)) p̂(d)
ˆ with:
- ˆ
p̂(d) the AMAF-estimate of the asymptotic The multi-core parallelization is intuitively the
value of score(d). most natural one: the memory is shared. We just have
- α(d) a coefficient depending on to distribute the loop on various threads (each thread
Simulations(s, d) and the AMAF-confidence performs simulations independently of other threads,
ˆ
in p̂(d) (see (Gelly and Silver, 2007)). with just mutexes protecting the updates in memory),
END SWITCH leading to algorithm 4. Consider N the number of
end for
Return arg maxd score(d).
Algorithm 4 Multi-core Monte-Carlo planning algorithm.
Initialize T to the root, representing the current state. T
Algorithm 2 is the AMAF-guided exploration. In that is a tree of positions.
case, in the classical formalism of bandits, each time For each thread simultaneously:
one arm is played, we get: (i) a positive or null reward while Time left > 0 do
for the chosen arm (0 or 1); (ii) a list of arms, consist- Simulate one game until a node (=position) L of T
ing in half of the arms roughly, positively correlated (thanks to bandit algorithms); put this simulation S in
memory. Choose a successor L′ of L.
with the list of arms for which the rewards would Simulate one game from position L′ until the game is
have been positive if these arms have been played. over; put this simulation S′ in memory.
The bandit algorithm for this bandit problem has been Add L′ as a son of L in T .
empirically derived in (Gelly and Silver, 2007). To Update statistics in all the tree thanks to S and S′ .
the best of our knowledge, there’s no mathematically end while
grounded bandit algorithm in that case. An important Pick up the decision which has been simulated most of-
improvement (termed progressive widening(Coulom, ten.
2007)) of Algorithm 2 consists in considering only
the K(n) ”best” moves (the best according to some threads. The number of simulations per second is
Nb threads × 10 20 40 Algorithm 5 Cluster algorithm for Monte-Carlo planning.
comp. time sec.procs secs.procs secs.procs As there are many paths leading to the same node, we must
1 thread 51.1 ± 1.8 62.0 ± 2.2 74.4 ± 2.4 use a hash table, so that with some key (describing a goban)
2 threads 62.9 ± 1.8 we can find if a node is in the tree and what are its statistics
4 threads 66.4 ± 2.1 in constant time.
for Each node do
Table 1: Success rate against mogoRelease3, for various Initialize T to the root, representing the current state
computation times. Under the two-assumptions (1) almost of the root. T is a tree of positions, with statistics at-
N times more simulations per second with N cores (2) no tached to each node.
impact of the ”delay” point pointed out in the text, the end for
speed-up would be linear and all the raws would be equal. for For each computer simultaneously: do
We see a decay of performance for 4 threads. for For each thread simultaneously: do
while Time left > 0 do
Simulate one game until a node (=position) L of
typically almost multiplied by N. However, this algo- T (thanks to bandit algorithms); put this simula-
rithm is not equivalent to the sequential one: possibly, tion S in memory. Choose a successor L′ of L.
Simulate one game from position L′ until the
N − 1 simulations are running when one more simula- game is over; put this simulation S′ in memory.
tion is launched, and the updates of T corresponding S, S′ is a complete game starting at the root.
to these N − 1 simulations are not taken into account. Add L′ as a son of L in T , and update all the
There is a N − 1 delay, and the analysis of the delay is statistics in T with S, S′ .
not straightforward - we will quantify this effect ex- Add (L, L′ , S, S′ ) to a stack of to-be-sent simula-
perimentally. We get the following empirical results: tions.
Table 1 confirms the roughly 63% success rate known if time − t0 ≥ T and thread=first thread then
Set t0 = time.
inria-00287867, version 1 - 13 Jun 2008

when doubling the computational power. As a con- Send all the (L, L′ , S, S′ ) in the stack to all
clusion, in 9x9 Go, the speed-up of the multi-core al- other nodes.
gorithm 4 is linear for two nodes, slightly below the Reset the stack.
linear speed-up for 4 nodes (by interpolation, we can Receive many (L, L′ , S, S′ ) from all other
estimate the speed-up as 3 for 4-cores). nodes.
for Each (L, L′ , S, S′ ) received do
Add L′ as a son of L in T (if not present).
Update all the statistics in T with S, S′ .
3 CLUSTER PARALLELIZATION end for
end if
First, let’s consider the generalization of the multi- end while
end for
core approach to a cluster, i.e. a version with mas- end for
sive communication in order to keep roughly the same Pick up the decision which has been simulated most of-
state of the memory on all nodes. As the mem- ten.
ory is not shared here, we have to broadcast on the
network many update-informations; each simulation
on one node leads to one broadcast. Possibly, we
can group communications in order to reduce laten- information (costs 0 second); (iii) each node receives
cies. This leads to the algorithm 5. T = 0 is per- (N − 1)M update-informations (costs 0 second); (iv)
haps possible, for high performance clusters or pro- each node updates its tree with these (N − 1)M up-
cessors inside the same machine. Let’s consider the date informations (costs α(N − 1) second). If we di-
idealized case of a cluster with negligible communi- vide by the number N of nodes and let N → ∞, we
cation cost and infinite number of nodes. Let’s as- get (i) a cost (1 − α)/N → 0; (ii) a cost 0 for send-
sume that a proportion α of time is spent in the up- ing update-informations; (iii) a cost 0 for receiving
dates. Also, let’s assume that the delay of updates update-informations; (iv) a cost α(N − 1)/N → α for
does not reduce the overall efficiency of the algo- updates. This implies that the main cost is the update-
rithm. What is the speed-up in that case ? Consider cost, and that asymptotically, the speed-up is 1/α. In
M the number of simulations per second on one node the case of MoGo, this leads to α ≃ 0.05 and there-
in the (mono-node) case. With N nodes, at each time fore roughly 20 as maximal speed-up for the case of a
steps, we get NM simulations. The number of up- tree simultaneously updated on all nodes. As commu-
dates is therefore NM per second of simulation. If nications are far from negligible, as preliminary ex-
the time of one update is T , for each group of M periments were disappointing and as we expect better
simulations, (i) each node performs M simulations than the 20 speed-up, we will not keep this algorithm
(costs 1 − α second); (ii) each node sends M update- in the sequel.
AN ALTERNATE SOLUTION WITH Number T0 (time Success Estimated
N of between rate Speed-up
LESS COMMUNICATIONS
nodes updates) divided
by N
Section 3 showed that whenever communications are 3 1.00s 67.2 ± 3.5 0.95 (9x9)
perfect, the speed-up1 is limited to some constant 3 0.11s 67.3 ± 2.1 0.94 (9x9
1/α, roughly 20 in MoGo. We propose the follow- 4 0.33s 69.6 ± 1.5 0.81 (9x9)
ing algorithm (Algorithm 6), with the following ad- 9 0.33s 79.0 ± 1.0 (9x9)
vantages: (1) much less communications (can run on 9 0.11s 83.8 ± 5.4 (19x19)
usual ethernet); (2) tolerant to inhomogeneous nodes
(as other algorithms above also); (3) our implemen- We have no estimate for the 9-machines speed-
tation is not yet fault-tolerant, but it could be done; up, because comparing computational budget B with
(4) self-assessment possible (strong variance → more 9 machines and 9B with 1 machine implies the use
time). The algorithm is detailed in Algorithm 6. of games with average time per move 5.B, which re-
quires a lot of computational power. However, the
83.8% is in the extrapolation of linear speed-up. The
Algorithm 6 Algorithm for Monte-Carlo planning on a results were not significantly different with higher
cluster.
numbers of levels. We guess that for larger numbers
Initialize T to the root, representing the current state. T
is a tree of positions. T is the same on all computers. of nodes the results will be different but we have not
for Each computer simultaneously: do yet any empirical evidence of this.
for Each thread simultaneously: do
while Time left > 0 do
inria-00287867, version 1 - 13 Jun 2008

Simulate one game until a node (=position) L of


T (thanks to bandit algorithms). 4 CONCLUSION
Choose one son L′ of L.
Simulate one game from position L′ until the Computer-Go is both a main target for computer-
game is over. games, as the main unsolved game, and a challeng-
Add L′ as a son of L in T0 . ing planification problem as the most promising ap-
Thread 0 only: if time − t0 ≥ T0 , set t0 = time,
proaches have a moderate expert knowledge and are
and average statistics in all the tree for nodes of
depth ≤ K with at least Nmin simulations. therefore quite general. In particular, successful ap-
end while plications of the UCT approach have been reported
end for very far from computer-Go (Kocsis and Szepesvari,
end for 2006a). A main advantage of the approach is that it is
Decision step:Pick up the decision which has been sim- fully scalable. Whereas many expert-based tools have
ulated most often, on the whole set of nodes. roughly the same efficiency when doubling the com-
putational power, bandit-based Monte-Carlo planning
with time 2B has success rate roughly 63% against
An important advantage of this technique is that
a bandit-based Monte-Carlo planning algorithm with
averaging vectors of statistics is possible quickly on
time B. This leads to the hope of designing a paral-
a large number of nodes: the computational cost of
lel platform, for an algorithm that would be efficient
this averaging over N nodes is O(log(N)). The case
in various tasks of planifications. The main results in
T0 = ∞ in Algorithm 6 (no communication before the
this paper are the followings:
final step of the decision making) is just a form of
- Doubling the computational power (doubling the
averaging. For computer-go, we get 59 %± 3% of
time per move) leads to a 63% success rate against
success rate with an averaging over 43 machines ver-
the non-doubled version.
sus a single machine, whereas a speed-up 2 leads to
- The straightforward parallelization on a cluster (im-
63%. This means that averaging provides a speed-up
itating the multi-core case by updating continuously
less than 2 with 43 machines; this is not a good par-
the trees in each node so that the memory is roughly
allelization. We then experiment T0 finite and K = 0
the same in all nodes) does not work in practice and
(i.e. only the root of the tree is shared between nodes):
has strong theoretical limitations, even if all compu-
1 This is not a ”real” speed-up, as the parallel algorithm,
tational costs are neglected.
- A simple algorithm, based on averages which are
even in the multi-core case, is not equivalent to the sequen-
tial one - the difference is the ”delay” detailed in section 2. easily computable with classical message passing li-
We here consider that the speed-up is k if the parallel algo- braries, a few times per second, can lead to a great
rithm is as efficient as the sequential one with k times more successes; in 19x19, we have reached, with 9 nodes,
time. 84 % success rate against one equivalent node. This
success rate is far above simple voting schemas, sug- Barto, A., Bradtke, S., and Singh, S. (1993). Learning to
gesting that communications between independent act using real-time dynamic programming. Technical
randomized agents are important and that communi- Report UM-CS-1993-002.
cating only at the very end is not enough. Bellman, R. (1957). Dynamic Programming. Princeton
- Our two parallelizations (multi-core and cluster) are Univ. Press.
orthogonal, in the sense that: (i) the multi-core paral- Berry, D. A., Chen, R. W., Zame, A., Heath, D. C., and
lelization is based on a faster breadth-first exploration Shepp, L. A. (1997). Bandit problems with infinitely
(the different cores are analyzing the same tree and go many arms. Ann. Statist., 25(5):2103–2116.
through almost the same path in the tree; in spite of Bertsekas, D. (1995). Dynamic Programming and Optimal
many trials, we have no improvement by introducing Control, vols I and II. Athena Scientific.
deterministic or random diversification in the differ- Bruegmann, B. (1993). Monte carlo go. Unpublished.
ent threads. (ii) the cluster parallelization is based Cazenave, T. and Helmstetter, B. (2005). Combining tacti-
on sharing statistics guiding the first levels only of cal search and monte-carlo in the game of go. IEEE
the tree, leading to a natural form of load balancing. CIG 2005, pages 171–175.
The deep exploration of nodes is completely orthog- Coquelin, P.-A. and Munos, R. (2007). Bandit algorithms
onal. Moreover, the results are cumulative; we see for tree search. In Proceedings of UAI’07.
the same speed-up for the cluster parallelization with Coulom, R. (2006). Efficient selectivity and backup opera-
multi-threaded versions of the code or mono-thread tors in monte-carlo tree search. In P. Ciancarini and
versions. H. J. van den Herik, editors, Proceedings of the 5th
- In 9x9 Go, we have roughly linear speed-up until International Conference on Computers and Games,
Turin, Italy.
4 cores or 9 nodes. The speed-up is not negligible
beyond this limit, but not linear. In 19x19 Go, the Coulom, R. (2007). Computing elo ratings of move patterns
inria-00287867, version 1 - 13 Jun 2008

in the game of go. In van den Herik, H. J., Uiterwijk,


speed-up remains linear until at least 4 cores and 9 J. W. H. M., Winands, M., and Schadd, M., editors,
machines. Computer Games Workshop, Amsterdam.
Extending these results to higher numbers of ma-
Dani, V. and Hayes, T. P. (2006). Robbing the bandit:
chines is the natural further work. Increasing the less regret in online geometric optimization against
number of cores is difficult, as getting an access to a an adaptive adversary. In SODA ’06: Proceedings
16-cores machine is not easy. Monte-Carlo planning of the seventeenth annual ACM-SIAM symposium on
is a strongly innovative tool with more and more ap- Discrete algorithm, pages 937–943, New York, NY,
plications, in particular in cases in which variants of USA. ACM Press.
backwards dynamic programming do not work. Ex- Gelly, S. and Silver, D. (2007). Combining online and
trapolating the results to the human scale of perfor- offline knowledge in uct. In ICML ’07: Proceed-
ings of the 24th international conference on Machine
mance is difficult. People usually consider that dou-
learning, pages 273–280, New York, NY, USA. ACM
bling the computational power is roughly equivalent Press.
to adding almost one stone to the level. This is con-
Hussain, Z., Auer, P., Cesa-Bianchi, N., Newnham, L., and
firmed by our experiments. Then, from the 2nd or 3rd Shawe-Taylor, J. (2006). Exploration vs. exploitation
Kyu of the sequential MoGo in 19x19, we need 10 or challenge. Pascal Network of Excellence.
12 stones for the best human level. Then, we need Kocsis, L. and Szepesvari, C. (2005). Reduced-variance
a speed-up of a few thousands. This is far from im- payoff estimation in adversarial bandit problems. In
possible, if the speed-up remains close to linear with Proceedings of the ECML-2005 Workshop on Rein-
more nodes. forcement Learning in Non-Stationary Environments.
Kocsis, L. and Szepesvari, C. (2006a). Bandit-based monte-
carlo planning. ECML’06.
REFERENCES Kocsis, L. and Szepesvari, C. (2006b). Discounted-ucb. In
2nd Pascal-Challenge Workshop.
Agrawal, R. (1995). The continuum-armed bandit problem. Lai, T. and Robbins, H. (1985). Asymptotically efficient
SIAM J. Control Optim., 33(6):1926–1951. adaptive allocation rules. Advances in Applied Math-
ematics, 6:4–22.
Auer, P., Cesa-Bianchi, N., and Gentile, C. (2001). Adap-
tive and self-confident on-line learning algorithms. Powell, W.-B. (2007). Approximate Dynamic Program-
Machine Learning Journal. ming. Wiley.
Banks, J. S. and Sundaram, R. K. (1992). Wang, Y. and Gelly, S. (2007). Modifications of UCT and
Denumerable-armed bandits. Econo- sequence-like simulations for Monte-Carlo Go. In
metrica, 60(5):1071–96. available at IEEE Symposium on Computational Intelligence and
http://ideas.repec.org/a/ecm/emetrp/v60y1992i5p1071- Games, Honolulu, Hawaii, pages 175–182.
96.html.

View publication stats

You might also like