Icin08 PDF
Icin08 PDF
Icin08 PDF
net/publication/29634548
CITATIONS READS
29 78
5 authors, including:
89 PUBLICATIONS 2,075 CITATIONS
National Institute for Research in Computer Science and Control
20 PUBLICATIONS 454 CITATIONS
SEE PROFILE
SEE PROFILE
Arpad Rimmel
École Supérieure d'Electricité
35 PUBLICATIONS 540 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jean-Baptiste Hoock on 02 June 2014.
Y. Kalemkarian
High Performance Calculus, Bull, Grenoble, France [email protected]
Abstract: Since their impressive successes in various areas of large-scale parallelization, recent techniques like UCT and
other Monte-Carlo planning variants (Kocsis and Szepesvari, 2006a) have been extensively studied (Coquelin
and Munos, 2007; Wang and Gelly, 2007). We here propose and compare various forms of parallelization of
bandit-based tree-search, in particular for our computer-go algorithm XYZ.
inria-00287867, version 1 - 13 Jun 2008
T
Figure 1: Flow chart of UCT. The bandit algorithm is ap-
Regret = T inf pa − ∑ rt . plied in the simulation part in the tree.
a∈A i=1
line chooses the decision with the conservative rule that the
Sundaram, 1992; Agrawal, 1995; Dani and Hayes, most simulated decision should be chosen. Other solutions,
2006; Berry et al., 1997; Hussain et al., 2006; Kocsis such as taking the decision with the higher ratio ”number
and Szepesvari, 2005; Coquelin and Munos, 2007). A of winning simulations divided by the number of simula-
bandit is said to have side information if some infor- tions”, are too much dangerous because a decision might
mation other than the reward is available to the algo- be very poor due to a small number of simulations. Some
rithm. We present various bandit algorithms in sec- other robust and elegant techniques include the use of lower
confidence bounds. Here, rewards are binary (win or loss),
tion 1: these bandit algorithms usually work as fol- but arbitrary distributions of rewards can be used.
lows for each time step:
Initialize T to the root, representing the current state. T is
- compute a score, for each possible arm, which a tree of positions, with each node equipped with statis-
consists in (i) an exploitation term, larger for arms tics.
for which the average rewards in the past is good; while Time left > 0 do
(ii) an exploration term, larger for arms which Simulate one game until a leaf (a position) L of T
(thanks to bandit algorithms). Then, choose one son
have not been tried often. (a successor) L′ of L.
- choose the arm with maximum score. Simulate one game from position L′ until the game is
over.
Typical improvements are: (i) give heuristically a Add L′ as a son of L in T .
score to arms which have not been visited yet (see e.g. Update statistics in all the tree. In UCT, each node
(Wang and Gelly, 2007; Coulom, 2007)); (ii) guess knows how many winning simulations (from this
some side information (see the AMAF heuristic e.g. node) have been performed and how many simula-
(Gelly and Silver, 2007)); (iii) restrict the sampling to tions (from this node) have been performed. In other
forms of tree-search, other informations are neces-
the k(t) first nodes for k(t) some non-decreasing map- sary (heuristic information in AMAF, total number
ping (see progressive widening in (Coulom, 2007)). of nodes in the tree in BAST(Coquelin and Munos,
In the case of Go and more generally bandit-based 2007)).
Monte-Carlo planning, we use one bandit at each end while
node of the tree, as explained in section 1. The dif- Pick up the decision which has been simulated most often
ficult elements are the anytime nature of the problem, from the root.
its non stationarity (Hussain et al., 2006; Kocsis and
Szepesvari, 2006b), its large number of arms (Banks main point in Algorithm 1 is that the tree is unbal-
and Sundaram, 1992). The side information (usually anced, with a strong bias in favor of important parts
termed AMAF) is detailed below. of the tree. The function used for taking decisions out
This section provides an overview of bandit-based of the tree (i.e. the so-called Monte-Carlo part) is de-
Monte-Carlo planning algorithms. It is presented in fined in (Wang and Gelly, 2007) The function used
the case of a game; the experiments are performed in for simulating in the tree is presented in Algorithm
2. This function is a main part of the code: it de- heuristic), at the nth simulation in a given node, with
cides in which direction the tree should be extended. K(n) is an non-decreasing mapping from N to N.
There are various formulas, all of them being based on
the idea of a compromise between exploitation (fur- Algorithm 3 More detailed Monte-Carlo planning algo-
ther simulating the seemingly good moves) and ex- rithm.
ploration (further simulating the moves which have Initialize T to the root, representing the current state. T
not been explored a lot). The empirically best tool in is a tree of positions.
while Time left > 0 do
Simulate one game until a leaf (a position) L of T
Algorithm 2 Taking a decision in the tree. The total (thanks to bandit algorithms applied until a leaf is met,
number of simulations at situation s is Simulations(s) = see Algorithm 2).
∑d Simulations(s, d). We present here various formulas for Choose one son (a successor) L′ of L, possibly with
computing the score (see (Lai and Robbins, 1985; Auer some offline learning (Gelly and Silver, 2007).
et al., 2001; Gelly and Silver, 2007) for UCB1, UCB- Simulate one game from position L′ until the game is
Tuned and AMAF respectively); other very important vari- over.
ants (for infinite domains, larger number of arms, of spe- Add L′ as a son of L in T .
cific assumptions) can be found in (Banks and Sundaram, Update statistics in all the tree. In UCT, each node
1992; Agrawal, 1995; Dani and Hayes, 2006; Coquelin and knows how many winning simulations (from this
Munos, 2007). node) have been performed and how many simula-
Function decision = Bandit(situation s in the tree). tions (from this node) have been performed. In other
for d in the set of possible decisions do forms of tree-search, other informations are neces-
Let p̂(d) = Wins(s, d)/Simulations(s, d). sary (heuristic information in AMAF, total number
SWITCH (bandit formula): of nodes in the tree in BAST(Coquelin and Munos,
- UCB1: compute score(d) = p̂(d) + 2007)).
inria-00287867, version 1 - 13 Jun 2008
p
2 log(Simulations(s))/Simulations(s, d). end while
- UCB-Tuned.1: compute score(d) = p̂(d) + Pick up the decision which has been simulated most often
from the root.
q
V̂ log(Simulations(s))/Simulations(s, d) with
V̂ = max(0.001, p̂(d)(1 − p̂(d))).
- qUCB-Tuned.2: compute score(d) = p̂(d) +
V̂ log(Simulations(s))/Simulations(s, d) +
log(Simulations(s))/Simulations(s, d) with 2 MULTI-CORE
V̂ = max(0.001, p̂(d)(1 − p̂(d))). PARALLELIZATION
- AMAF-guided exploration: Compute
score(d) = α(d) p̂(d) + (1 − α(d)) p̂(d)
ˆ with:
- ˆ
p̂(d) the AMAF-estimate of the asymptotic The multi-core parallelization is intuitively the
value of score(d). most natural one: the memory is shared. We just have
- α(d) a coefficient depending on to distribute the loop on various threads (each thread
Simulations(s, d) and the AMAF-confidence performs simulations independently of other threads,
ˆ
in p̂(d) (see (Gelly and Silver, 2007)). with just mutexes protecting the updates in memory),
END SWITCH leading to algorithm 4. Consider N the number of
end for
Return arg maxd score(d).
Algorithm 4 Multi-core Monte-Carlo planning algorithm.
Initialize T to the root, representing the current state. T
Algorithm 2 is the AMAF-guided exploration. In that is a tree of positions.
case, in the classical formalism of bandits, each time For each thread simultaneously:
one arm is played, we get: (i) a positive or null reward while Time left > 0 do
for the chosen arm (0 or 1); (ii) a list of arms, consist- Simulate one game until a node (=position) L of T
ing in half of the arms roughly, positively correlated (thanks to bandit algorithms); put this simulation S in
memory. Choose a successor L′ of L.
with the list of arms for which the rewards would Simulate one game from position L′ until the game is
have been positive if these arms have been played. over; put this simulation S′ in memory.
The bandit algorithm for this bandit problem has been Add L′ as a son of L in T .
empirically derived in (Gelly and Silver, 2007). To Update statistics in all the tree thanks to S and S′ .
the best of our knowledge, there’s no mathematically end while
grounded bandit algorithm in that case. An important Pick up the decision which has been simulated most of-
improvement (termed progressive widening(Coulom, ten.
2007)) of Algorithm 2 consists in considering only
the K(n) ”best” moves (the best according to some threads. The number of simulations per second is
Nb threads × 10 20 40 Algorithm 5 Cluster algorithm for Monte-Carlo planning.
comp. time sec.procs secs.procs secs.procs As there are many paths leading to the same node, we must
1 thread 51.1 ± 1.8 62.0 ± 2.2 74.4 ± 2.4 use a hash table, so that with some key (describing a goban)
2 threads 62.9 ± 1.8 we can find if a node is in the tree and what are its statistics
4 threads 66.4 ± 2.1 in constant time.
for Each node do
Table 1: Success rate against mogoRelease3, for various Initialize T to the root, representing the current state
computation times. Under the two-assumptions (1) almost of the root. T is a tree of positions, with statistics at-
N times more simulations per second with N cores (2) no tached to each node.
impact of the ”delay” point pointed out in the text, the end for
speed-up would be linear and all the raws would be equal. for For each computer simultaneously: do
We see a decay of performance for 4 threads. for For each thread simultaneously: do
while Time left > 0 do
Simulate one game until a node (=position) L of
typically almost multiplied by N. However, this algo- T (thanks to bandit algorithms); put this simula-
rithm is not equivalent to the sequential one: possibly, tion S in memory. Choose a successor L′ of L.
Simulate one game from position L′ until the
N − 1 simulations are running when one more simula- game is over; put this simulation S′ in memory.
tion is launched, and the updates of T corresponding S, S′ is a complete game starting at the root.
to these N − 1 simulations are not taken into account. Add L′ as a son of L in T , and update all the
There is a N − 1 delay, and the analysis of the delay is statistics in T with S, S′ .
not straightforward - we will quantify this effect ex- Add (L, L′ , S, S′ ) to a stack of to-be-sent simula-
perimentally. We get the following empirical results: tions.
Table 1 confirms the roughly 63% success rate known if time − t0 ≥ T and thread=first thread then
Set t0 = time.
inria-00287867, version 1 - 13 Jun 2008
when doubling the computational power. As a con- Send all the (L, L′ , S, S′ ) in the stack to all
clusion, in 9x9 Go, the speed-up of the multi-core al- other nodes.
gorithm 4 is linear for two nodes, slightly below the Reset the stack.
linear speed-up for 4 nodes (by interpolation, we can Receive many (L, L′ , S, S′ ) from all other
estimate the speed-up as 3 for 4-cores). nodes.
for Each (L, L′ , S, S′ ) received do
Add L′ as a son of L in T (if not present).
Update all the statistics in T with S, S′ .
3 CLUSTER PARALLELIZATION end for
end if
First, let’s consider the generalization of the multi- end while
end for
core approach to a cluster, i.e. a version with mas- end for
sive communication in order to keep roughly the same Pick up the decision which has been simulated most of-
state of the memory on all nodes. As the mem- ten.
ory is not shared here, we have to broadcast on the
network many update-informations; each simulation
on one node leads to one broadcast. Possibly, we
can group communications in order to reduce laten- information (costs 0 second); (iii) each node receives
cies. This leads to the algorithm 5. T = 0 is per- (N − 1)M update-informations (costs 0 second); (iv)
haps possible, for high performance clusters or pro- each node updates its tree with these (N − 1)M up-
cessors inside the same machine. Let’s consider the date informations (costs α(N − 1) second). If we di-
idealized case of a cluster with negligible communi- vide by the number N of nodes and let N → ∞, we
cation cost and infinite number of nodes. Let’s as- get (i) a cost (1 − α)/N → 0; (ii) a cost 0 for send-
sume that a proportion α of time is spent in the up- ing update-informations; (iii) a cost 0 for receiving
dates. Also, let’s assume that the delay of updates update-informations; (iv) a cost α(N − 1)/N → α for
does not reduce the overall efficiency of the algo- updates. This implies that the main cost is the update-
rithm. What is the speed-up in that case ? Consider cost, and that asymptotically, the speed-up is 1/α. In
M the number of simulations per second on one node the case of MoGo, this leads to α ≃ 0.05 and there-
in the (mono-node) case. With N nodes, at each time fore roughly 20 as maximal speed-up for the case of a
steps, we get NM simulations. The number of up- tree simultaneously updated on all nodes. As commu-
dates is therefore NM per second of simulation. If nications are far from negligible, as preliminary ex-
the time of one update is T , for each group of M periments were disappointing and as we expect better
simulations, (i) each node performs M simulations than the 20 speed-up, we will not keep this algorithm
(costs 1 − α second); (ii) each node sends M update- in the sequel.
AN ALTERNATE SOLUTION WITH Number T0 (time Success Estimated
N of between rate Speed-up
LESS COMMUNICATIONS
nodes updates) divided
by N
Section 3 showed that whenever communications are 3 1.00s 67.2 ± 3.5 0.95 (9x9)
perfect, the speed-up1 is limited to some constant 3 0.11s 67.3 ± 2.1 0.94 (9x9
1/α, roughly 20 in MoGo. We propose the follow- 4 0.33s 69.6 ± 1.5 0.81 (9x9)
ing algorithm (Algorithm 6), with the following ad- 9 0.33s 79.0 ± 1.0 (9x9)
vantages: (1) much less communications (can run on 9 0.11s 83.8 ± 5.4 (19x19)
usual ethernet); (2) tolerant to inhomogeneous nodes
(as other algorithms above also); (3) our implemen- We have no estimate for the 9-machines speed-
tation is not yet fault-tolerant, but it could be done; up, because comparing computational budget B with
(4) self-assessment possible (strong variance → more 9 machines and 9B with 1 machine implies the use
time). The algorithm is detailed in Algorithm 6. of games with average time per move 5.B, which re-
quires a lot of computational power. However, the
83.8% is in the extrapolation of linear speed-up. The
Algorithm 6 Algorithm for Monte-Carlo planning on a results were not significantly different with higher
cluster.
numbers of levels. We guess that for larger numbers
Initialize T to the root, representing the current state. T
is a tree of positions. T is the same on all computers. of nodes the results will be different but we have not
for Each computer simultaneously: do yet any empirical evidence of this.
for Each thread simultaneously: do
while Time left > 0 do
inria-00287867, version 1 - 13 Jun 2008