SOLVING THE OSHI-ZUMO GAME
M. Bura
University ofAlberta, Edmonton, AB T6G 2E8, Canada
[email protected], http://www.cs.ualberta.carmburo/
Abstract
Kotani (2002) determined the part of the state space ofthe Japanese Oshi-Zumo
game in which pure strategies suffice to win. This paper completes the analysis
by computing and discussing a Nash-optimal mixed strategy for this game.
Keywords:
Nash-optimal strategy, Oshi-Zumo, two-player game
1.
Introduction
In this article the Japanese game Oshi-Zumo is analyzed. Moves in this game
consist of simultaneous actions by two players who otherwise have complete
information about the current game state. In general, such games can be represented by a collection of payoff matrix pairs whose entries de fine the expected
amount paid to the players in case the respective action pair was chosen. It is
well known that not knowing the opponent's action already makes it necessary
to consider mixed strategies and that so-called Nash-optimal mixed strategies
exist for any matrix game (Nash, 1950). A simple example is the Rock-PaperScissors game in which Rock beats Scissors, Scissors beats Paper, and Paper
in turn beats Rock. The Nash-optimal strategy picks each of the actions with
probability
In what follows, we first introduce the Oshi-Zumo game. It is more complex
than Rock-Paper-Scissors, but considerably simpler than other popular incomplete information games such as Poker and Bridge. In fact, we will show how
to compute a Nash-optimal strategy within seconds on ordinary PC hardware.
We then highlight interesting properties of a Nash strategy and conclude the
paper by discussing how the optimal player performs against reasonable, but
sub-optimal strategies.
!.
2.
The Game
Oshi-Zumo- meaning "the pushing sumo (wrestler)" - is played by two
players who both start off with N coins. At the beginning of a game, a sumo
H. J. Van Den Herik et al. (eds.), Advances in Computer Games
© Springer Science+Business Media New York 2004
362
M. Bura
so loooo,ooooiso
[50,4,* ]-Oshi-Zumo starting position- code (50, 50, O)
461ooooQ@oool4s
Position after move (4,2)- code (46, 48, 1)
Figure 1.
Oshi-Zumo positions and their triple representation.
wrestler is positioned at the center of a one-dimensional playing field which
consists of 2K + 1 locations (see Figure 1). Moves are played by secretly
choosing a number of coins less or equal to the amount currently available to
the respective player, but at least M. The bids are then revealed and the highest
bidder pushes the wrestler one location towards the opponent's side. If the bids
are equal, the wrestler does not move. Both bids are deducted and the game
proceeds until the money runs out or the wrestler is pushed off the playing
field. The final position of the wrestler determines the winner: if he is located
at the center, the game result is a draw. Otherwise, the player in whose half
the wrestler is located loses the game. We call this parameterized game an
[N, K, M]-Oshi-Zumo game. In this paper we only consider the minimal bids
M = O and M = 1 and declare a game over if both bids are O. As before, the
winner is determined by the current wrestler position.
3.
Computing a Nash-Optimal Strategy
Certain Oshi-Zumo positions possess pure winning strategies. For example,
all positions in which the opponent has no money left and the wrestler position is
sufficiently advanced can be won by simply bidding one coin for the remainder
of the game. Kotani (2002) determined ali such positions for the standard
[50, 3, 1]-0shi-Zumo game. The following list specifies some more interesting
[50, 4, 0]-positions that can be won by the first player with a pure strategy:
(n, n, 1) : 1 :::; n :::; 11 [bid 1]
(50,n,-4): 1:::;
n:::;
16 [bidn]
(n, n
+ 1, 2)
: 1 :::; n :::; 12 [bid 1]
(49,n,-4): 1:::;
n:::;
16 [bidn]
All such positions can be computed by dynamic programming for small values
of N and K because the size ofthe state space is only a polynomial (N + 1) 2 x
(2K + 3) in the parameters. First, we compute the payoffs Pi for both players
at the boundary positions:
P 1 (0, O, k) = -P2 (0, O, k) = sign(k), for- K:::; k:::; K
P1(n, m, ±(K + 1)) = -P2(n, m, ±(K + 1)) = ±1, for
O:::; n, m:::; N
363
Solving the Oshi-Zumo Game
minimize Z such that
maximize Z such that
n2
nl
for ali
M::::;
j::::; n2 :
Z::::;
L
A;,jXi,
for ali M::::;
i::::;
n1: Z
L
セ@
for ali M ::::; i ::::; n1 : x;
セ@
O, and
A;,iYi>
j=M
i=M
for ali M ::::; j ::::; n2 : Yi
セ@
O, and
n2
nl
L:x;=l
LYi =1
i=M
j=M
Figure 2.
Linear programs (LPs) for determining Nash-optimal mixed strategies.
Then we search for positions with pure winning or drawing strategies, or ones
that Iose for sure no matter what. A position is won for player A if there exists
an action such that for ali actions of the opponent the expected payoff for A is 1.
Declaring a position drawn or lost requires that ali successor position values are
known. We repeat this process until we do not find any new position values.
A Nash-optimal strategy can be computed similarly. Starting again with
assigning values to the boundary positions, we iterate through ali positions with
unknown expected payoff until we find one for which ali successor values have
been established. At this time we make use of the fact that optimal strategies
{(i, Xi) 1 M :=:; i :=:; n1} and {(j, Yj) 1 M :=:; j :=:; n2} for players MAX and
MIN can be found by solving two linear programs (see Figure 2). MAX has
move choices M, ... n 1 and MIN has actions M, ... , n2. Xi and Yj denote the
respective action probabilities. Matrix element Ai,j defines the payment for
MAx if action pair (i,j) is chosen. Because Oshi-Zumo is a zero-sum game,
MIN receives the negated amount. Z denotes the expected payoff for MAX.
This procedure eventualiy halts and computes the expected payoffs and mixed
strategies for ali positions.
We decided to not only create a table containing expected payoffs - which
would be sufficient for computing values for ali positions - but also to store
the move distributions to speed up later game play and move analyses. Only
one distribution needs to be computed and stored for each position because the
move distribution for the second player in position (n, m, k) is identica! to that
of the first player at (m, n, -k).
4.
Implementation Issues
In our first implementation we adopted Michel Berkelaar's open-source software package LPSOLVE. Unfortunately, the solver ran into numerica! problems
which caused it to either give up on instances or report incorrect solutions. Implementing efficient LP solvers is by no means easy. In order to overcome the
numerica! problems we decided to replace floating-point by rational arithmetic
in LPSOLVE- which turned out tobe more complicated than expected. Finaliy,
364
M. Euro
we took the simpler LP solver cade from Press et al. (1992) and combined it
with G MP - the GNU arbitrary precisian arithmetic library - by replacing the
float/double data types by GMP's rational number C++ class. Solving LPs
using rational arithmetic takes much longer than using floating-point values,
even if the denominators are bounded. In order to speed up the Oshi-Zumo
solver we therefore implemented a two-phase approach: whenever the fast LP
solver reported problems or produced inconsistent results, we would start the
slow solver based on rational arithmetic. We bounded denominators by 108 and
normalized rational numbers whenever this limit was exceeded. Test runs on
Oshi-Zumo games manageable by the floating-point based solver indicated that
the results obtained by rational arithmetic only differed by a negligible amount.
OnanotebookPC witha 1-GHzPentium-IIICPU, solvingthe standard [50, 3, 1]
game takes just 12 seconds. The C++ source cade can be downloaded from
http://www.cs.ualberta.ca/-mburo/sumo.tgz.
5.
A Nash-Optimal Oshi-Zumo Strategy
In what follows we concentrate on the [50, 3, O] and [50, 3, 1J versions of
the game and highlight interesting properties of their respective Nash-optimal
strategies. We start by looking at the move distributions for the starting position:
M = O position=(50, 50, O) value= 0.0
bids: O
1
2
3
4
5
6
7
8
9
10
prob: .083 .077 .088 .083 .092 .088 .097 .092 .099 .094 .101
M = 1 position=(50, 50, O)
value= 0.0
bids: 1
2
3
4
5
6
7
8
9
prob: .139 .053 .146 .060 .152 .067 .156 .068 .156
Apparent is an "odd-even" effect in which higher and lower bid probabilities
alternate. This probability pattern occurs in many positions. Why it occurs is
an open question.
The smallest positions with randomization requirement are (5, 2, -3) for
M =O and (6, 3, -3) for M = 1. The move distributions are as follows:
M=O
position = (5, 2, -3)
va1ue1 = -0.5
bidl:
1 2
prob:
.5 .5
bid2:
o 2
prob:
.5 .5
M=l
position = (6, 3, -3)
va1ue1 = -0.5
bidl:
1 3
prob:
.5 .5
bid2:
1 3
prob:
.5 .5
In 5,271 cases ofthe 23,409 possible [50, 3, 0]-positions, and in 4,057 cases for
M = 1, more than one move has tobe considered. To illustrate how complex
the move decisi an can be, we present two positions with a high number of holes
in the move distribution:
365
Solving the Oshi-Zumo Game
M=O position
bidl: o
2
prob: .482 .015
2
bid2: 1
prob: .066 .065
M=l position
bidl:
prob:
bid2:
prob:
1
.369
2
.136
2
.004
3
.021
= (17, 34, 3)
3
.008
3
.086
4
.017
4
.080
5
.016
5
.082
= (20, 32, 3)
3
.038
4
.079
5
.050
5
.029
6
.007
6
.059
value1 = 0.047
6
.020
6
.082
7
.022
7
.058
17
12
14
8
10
.026 .069 .087 .107 .126
17
.476
value1 = 0.333
7
.038
9
.016
9
.125
10
.188
14
18 20
11
12
17
.069 .048 .039 .080 .030 .096
11
12 20
.091 .043 .333
Given such complex distributions, the question arises how well human players can play Oshi-Zumo.
6.
How Good is Optimal?
Playing any mixed or pure strategy against a Nash-optimal player results
in an expected payoff no better than the expected value E of a game between
two Nash players. In contrast, the expected value of any pure strategy that
picks actions from the set an optimal strategy considers, is exactly E when
playing against the Nash-optimal player. This follows from the fact that all
actions with non-zero probability have the same expected value. Therefore,
the Nash-optimal solution is far from optimal with respect to exploiting simple
(pure) strategies, such as playing Rock all the time in a sequence ofRock-PaperScissors games. In Rock-Paper-Scissors the Nash strategy cannot win anything
against any other strategy in the long run. However, in more complex games such as Oshi-Zumo or Poker- it can, because not all actions have non-zero
probability in all situations.
A player who just memorizes one move from a Nash-optimal strategy for
each position does not Iose money against a Nash-player in the long run. How
much does a player Iose who occasionally plays moves not played by a Nashplayer and how well do simple hand-crafted strategies play? To answer these
questions we wrote a program that played a large number of games between a
Nash-optimal strategy and several simple move selection algorithms. Figure 3
presents the tournament results. As expected, the completely random player
loses almost every game. The player that randomly chooses bids in the interval
formed by the minimal and maximum Nash bid performs much better and loses
only about 0.035 units per game for M = O and 0.01 for M = 1. Simply
choosing moves in a small fixed interval also leads to good results and shows
how easy it is to look good against a Nash player. Also some fairly simple pure
strategies perform surprisingly well.
A more interesting question is therefore how to adapt to players and exploit
their weaknesses while minimizing the risk of being exploited. We think that
366
M. Euro
M=O
random 0 .. #
random Nash range
random l..min(6,#)
random l..min(5,#)
random l..min(4,#)
random l..min(3,#)
random l..min(2,#)
1
if#?.2 2 else 1
-.97882
-.035
-.31884
-.16971
-.05115
-.00292
+.000645
-.002765
-.00156
M=1
random 1..#
random Nash range
random min(2,#) .. min(6,#)
random min(2,#) .. min(5,#)
random min(2,#) .. min(4,#)
random min(2,#) .. min(3,#)
if #?_2 2 else 1
if #?_3 3 elif #?_2 2 else l
-.98216
-.0105
-.3533
-.21524
-.05683
-.00372
+.00039
-.02987
Figure 3.
The average payoff of various simple move-selection algorithms playing 200,000
[50, 3, M]-games against a Nash-optimal strategy. # denotes the current number of coins left for
the heuristic player.
using games simpler than say Poker but harder than Rock-Paper-Scissors as
test domains can shed light into this interesting problem, which appears to be
the last remaining hurdle on the way to Poker programs stronger than human
players (Billings et al., 2003). Oshi-Zumo is a suitable candidate because its
Nash-optimal strategy is non-trivial, but can be computed quickly.
Acknowledgement
Thanks go to Darse Billings for helpful discussions clarifying questions on
Nash-optimal strategies.
References
Billings, D., Burch, N., Davidson, A., Holte, R., Schaeffer, J., Schauenberg, T., and Szafron, D.
(2003). Approximating Game-theoretic Optimal Strategies for Full-scale Poker. Proceedings
of the International Joint Conference on Artificiallntelligence. To appear.
Kotani, Y. (2002). Analysis of the Pushing Sumo Game and its Application to Creativity Education. IPSJ SIG-Notes Game Informatics, Vol. 8.
Nash, J.F. (1950). Equilibrium Points in n-Person Games. National Academy of Sciences,
Vol. 36, pp. 48-49.
Press, W.H., Teukolsky, S.A., Vetterling, W.T., and Flannery, B.P. (1992). Numerica[ Recipes in
C. Cambridge University Press. 2nd edition.