Ipsj GPWS2018022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

The 23rd Game Programming Workshop 2018

Counterfactual Regret Minimization for the


Board Game Geister

Chen Chen1,a) Tomoyuki Kaneko1,b)

Abstract: Counterfactual Regret Minimization (CFR) is an effective method to compute approximate


Nash Equilibria for large zero-sum, imperfect information games. In this paper, we applied CFR al-
gorithm to a simplified version of the board game Geister to generate a strategy. We also observed
the changes in efficiency and convergence of different CFR variants.

Keywords: Counterfactual Regret Minimization, CFR, board game, Geister

is a finite set of all terminal histories.


1. Introduction
A(h) denotes available actions for a non-
Counterfactual Regret Minimization (CFR) terminal history h ⊆ H. h0 is a prefix of
is an effective method to compute Nash Equi- history h, which means that h begins with
libria for large zero-sum extensive games with h0 .
imperfect information. It is proved that CFR • A finite set Ii of information sets I for
algorithm is effective in solving many kinds player i, which is called an information
of games, such as card game Poker [1] and partition. An information set I is a set
dice-bidding game Bluff [2]. In the study [3] of histories that player cannot distinguish
by Bowling et al., the game of heads-up limit one from another, which means that if
hold’em is weakly solved using a variant of h, h0 ⊆ I, A(h) = A(h0 ). A(I) denotes
CFR called CFR+ . While CFR has made a available actions at information set I.
great contribution to card games, there are few • A strategy profile σ that consists of a strat-
researches on applying CFR to board games. egy σi for each player i. Σi is a set of all
In this paper, we applied counterfactual re- strategies for player i. σI→a represents a
gret minimization to a simplified version of strategy profile identical to σ except that
the board game Geister to observe the per- action a is always chosen at information
formance of CFR algorithm. We also observed set I.
variance of efficiency and convergence between • A utility function u that maps terminal
using different CFR variants. history to players’ utilities. ui (z) is player
2. Background i’s utility on terminal history z ∈ Z.
• π σ (I) stands for the probability of reaching
2.1 Extensive Games information set I if players choose actions
A finite extensive game with imperfect in- according to σ. πiσ (I) is player i’s contri-
formation is consisting of the following com- bution to the probability.
ponents:
• A finite set N of players, and a chance 2.2 Nash Equilibrium
player c. For player i, −i stands for all Nash equilibrium is a solution concept of an
other players. extensive game, where if each player knows
• A finite set H of possible histories h, which the equilibrium strategies of other players, no
is possible sequences of actions a. Z ⊆ H player is able to gain more reward by changing
his strategy alone.
1
Graduate School of Interdisciplinary Information Studies,
The University of Tokyo
For a two-player extensive game, a Nash
a)
chenchen-319@g.ecc.u-tokyo.ac.jp equilibrium can be represented as a strategy
b)
kaneko@graco.c.u-tokyo.ac.jp

© 2018 Information Processing Society of Japan - 137 -


The 23rd Game Programming Workshop 2018

profile σ that satisfies as


T 1
max RT (I, a)

u1 (σ) ≥ max u1 (σ10 , σ2 ) Ri,imm (I) = (6)
T a∈A(I) i

0 σ1 ∈Σ1
(1)
u2 (σ) ≥ max

0
u2 (σ1 , σ20 ) where the counterfactual regret at time t is
σ2 ∈Σ2
rit (I, a) = vi (σI→a
t
, I) − vi (σ t , I) (7)
-Nash equilibrium is an approximation of a
Nash equilibrium, which is a strategy profile σ and the cumulative counterfactual regret is
that satisfies T
X
 RiT = rit (I, a) (8)
u1 (σ) +  ≥ max
 u1 (σ10 , σ2 ) t=1
σ10 ∈Σ1
(2)
u2 (σ) +  ≥ max
 u2 (σ1 , σ20 ) Now, with regret matching, the strategy for
0 σ2 ∈Σ2 time T + 1 is calculated according to the fol-
3. Former Research lowing formulas
RiT,+ (I,a)

3.1 CFR
 Pa∈A(I) RiT,+ (I,a) ,


Counterfactual Regret Minimization was σiT +1 (I)(a) = if a∈A(I) RiT,+ (I, a) > 0
P
first developed by Zinkevich et al. to approx- 
 1

imate a Nash equilibrium for very large in- |A(I)| , otherwise,
stances of incomplete information extensive (9)
games [1]. where the cumulative counterfactual regret
For an extensive game being played repeat- RiT (I, a) is maintained for every informa-
edly, let σit be the strategy used by player i at tion set and its actions. RiT,+ (I, a) =
time t. The average overall regret of player i max(RiT (I, a), 0).
at time T is defined as Zinkevich et al. applied counterfactual re-
T gret minimization to the game of Poker. With
1 X
abstraction, cards are partitioned into ten
RiT ui (σi∗ , σ−it
) − ui (σ t )

= max
T σi ∈Σi t=1

equally sized buckets and the number of in-
(3) formation sets is reduced from 3.19 × 1014
Denote the probability that action a is cho- to 5.73 × 107 . For each iteration, when it
sen under information set I by σ(I)(a), player comes to a chance player, a deterministic ac-
i’s average strategy σ̄it for information set I is tion is sampled, which is also referred to as
defined as chance-sampled CFR. Executed parallelly on
PT σt t four CPUs, the algorithm is iterated 2 billion
t t=1 πi (I)σ (I)(a)
σ̄i (I)(a) = PT (4) times in 14 days. The resulting strategy has
σt
t=1 πi (I) outperformed all of the competitors from the
According to Theorem 2 in the study [1], in bankroll portion of the 2006 AAAI Computer
a zero-sum game at time T , if both players’ Poker Competition [1].
average overall regret is less than , then σ̄iT is
a 2 equilibrium. 3.2 MCCFR
Define counterfactual value vi (σ, I) as Lanctot et al. described Monte Carlo
X counterfactual regret minimization (MCCFR)
σ
vi (σ, I) = ui (z)π−i (z[I])π σ (z[I], z) based on CFR [4]. All possible terminal his-
z∈ZI tories are partitioned into blocks, and on each
(5) iteration only one of the blocks is sampled and
where ZI is the set of terminal histories pass- only terminal histories in that block are con-
ing through I and z[I] is the prefix of z con- sidered, while expectation values of immediate
tained in I. counterfactual regrets are not changed. They
Theorem 3 in the study [1] suggests that the introduced two sampling schemes: outcome
average overall regret is bounded by the sum of sampling and external sampling. In outcome-
immediate counterfactual regret of each infor- sampling MCCFR, each block contains a sin-
mation set I. Average overall regret is defined gle terminal history, and on each iteration

© 2018 Information Processing Society of Japan - 138 -


The 23rd Game Programming Workshop 2018

only this terminal history and information sets regularly. Therefore, computing and storing
along this history will be updated. In external- the average strategies can be omitted, lead-
sampling MCCFR, only the actions of the op- ing to a reduction of the computational cost.
ponent and chance are sampled. Empirically, By running the implementation on a clus-
MCCFR converges faster than CFR [4]. ter of 200 nodes each with 24 cores for 68.5
days, a close approximation to a Nash equi-
3.3 PCS librium is finally reached, making Heads-up
Johanson et al. described new sampling Limit Hold’em poker weakly solved [3].
techniques to sample a subset of chance nodes,
resulting in slower but more accurate itera- 3.5 Pure CFR
tions [2]. Chance nodes are partitioned into Pure CFR is a new sampling algorithm de-
three sets by observability: observable only scribed in Gibson’s PhD dissertation [5]. Pure
by the player, observable only by the oppo- CFR samples a pure strategy profile ŝt from
nent, and observable by both players. The the current profile σ t , and a pure chance dis-
first two are considered to be private and the tribution ŝtc from chance player’s outcomes σc .
last one is considered to be public. There are For each information set I,
three variants: Self-Public Chance Sampling
(SPCS), Opponent-Public Chance Sampling Prob[ŝ(I) = a] = σ(I, a) for all a ∈ A(I)
(OPCS), and Public-Chance Sampling (PCS). (14)
In SPCS, player’s and public chance nodes Then the estimated counterfactual value
are sampled; in OPCS, opponent’s and public will be
chance nodes are sampled while in PCS only X

public chance nodes are sampled. Empirically, v̂i (I, σ) = ui (z)π−i (z[I])π ŝ (z[I], z)
z∈ZI
Public Chance Sampling converges faster than
(15)
Chance Sampling on large games [2].
When the utilities of the game are integers,
3.4 CFR+ all computations in Pure CFR can be done
Bowling et al. introduced a new algorithm with integers. The computing procedure of
called CFR+ [3]. While CFR implementations integers can be faster than float numbers and
typically sample only a part of the game tree, storing integer values reduces memory cost by
CFR+ does exhaustive iterations over the en- half. These benefits enable Pure CFR to be
tire game tree and uses a different method for applied to games of a larger scale [5].
regret matching called regret matching+ . It 4. The Game of Geister
maintains an alternative counterfactual regret
4.1 The Original Game
value Qt (I, a) for every information set and its
Geister, which is also referred to as Ghosts
actions, which is updated as
or Phantoms vs phantoms, is a board game de-
Q0 (I, a) = 0 (10) signed by Alex Randolph for two players [6].
Qt (I, a) = (11) The game was the winner of Årets Spel Best
+ Family Game in 1986 and has been recom-
Qt−1 (I, a) + Rt (I, a) − Rt−1 (I, a) mended in Spiel des Jahres in 1982 [7]. The
(12) game takes place on a 6×6 square grid game
board. Each player has four good ghosts and
where x+ = max(x, 0).
four evil ghosts assembled in two rows, and
The strategy for time t + 1 is calculated as
their types are not revealed to the opponent
Qti (I,a)

player [6]. In each turn a player can move one
 a∈A(I) Qti (I,a) ,

 P
of his ghosts one step vertically or horizontally.
σit+1 (I)(a) = if a∈A(I) Qti (I, a) > 0
P
 Moving into a square containing an opponent’s
 1 , otherwise

|A(I)| ghost will kill the opponent’s ghost. A player
(13) will win if one of the three constraints is sat-
+
Empirically, with CFR , player’s current isfied [7]:
strategies also approximate an equilibrium • All the player’s evil ghosts are killed.

© 2018 Information Processing Society of Japan - 139 -


The 23rd Game Programming Workshop 2018

• All the opponent player’s good ghosts are 5 1600 six core@3.4GHz CPU using PyPy3
killed. v6.0.0.
• One of the player’s good ghosts is moved We assume that when playing a board
off the board from one of the opponent game, the situation of the board often con-
player’s corner squares. tains enough information for an experienced
player to decide the next move. Since Geister
4.2 The Simplified Game is a board game, for simplicity, we adopt the
As the game complexity increases exponen- players’ vision of the game board as the infor-
tially with the size of the game board, it is mation set, which includes no history memory.
hard to solve the full game of Geister with As is said in Section 4.2, board games have a
normal computing resources. Due to the limi- problem of infinite loops. We restricted play-
tation of computing resources, our research on ers’ moving turns to see the expansion of the
Geister is based on a reduced game board with game tree, i.e., the number of total informa-
small changes in the rules. Now we introduce a tion sets and the time needed to run the pro-
simplified version of Geister, Geister(2×3, 2). gram for a certain number of iterations. We
Geister(2×3, 2) is a two-player board game denote one player moving one of his ghosts as
that takes place on a 2×3 board with 1 good one move, and two players making their moves
ghost and 1 evil ghost for each player. Other respectively as one turn. We set the terminal
rules are identical to those of the full game, state utilities such that 1 for a win, -1 for a
except that: loss and 0 if maximum move limit is reached.
• The game of Geister does not have any
chance nodes. In order to implement it 5.1 Chance-sampled CFR
with CFR, we added a chance node at the Chance-sampled CFR program is imple-
beginning of the game, which decides the mented according to the pseudocode and the
positions of players’ ghosts randomly. It Java source code written in the study [8]. An
means that at the beginning of the game, iteration is to run the recursive CFR func-
every player gets a random ghost arrange- tion for one time. For chance-sampled CFR,
ment rather than arranging them by him- chance node events are sampled before each
self. iteration.
• CFR algorithm calculates all possible We first set moving turn limits from 1 to
routes until the end of the game. How- 10 (one turn contains two moves, one move
ever, in board games, infinite loops are in- for each player, so total move limits are 2 to
evitable. To avoid infinite loops, we add a 20), and ran the program for 100 iterations
constraint that if each player has played a to observe the changes in execution time and
certain number of turns, the game is ended the total number of information sets. For ev-
forcibly with a draw. ery different turn limit, the program was run
• Due to the limited size of the game board, from the beginning. Execution time is the real
the corners of the board are not removed. time between the starting and the ending of
To win the game by moving good ghosts total training iterations. Time is calculated
off the board, players must move the good in Python code by time package. Results are
ghost to the corner first and then move it shown in Figure 1.
off at the next turn. We can see from Figure 1 that when turn
5. Experiments limit is greater than or equal to 7 (14 total
moves), the number of information sets does
We implemented the game of Geister and not increase anymore. It suggests that after
variants of CFR algorithms using Python lan- 7 turns, all possible nodes of the game tree
guage. To accelerate the execution of the pro- have been visited by the program. We can
gram, we ran the program with PyPy, which is also conclude that the execution time grows
a fast interpreter of the Python language with exponentially with move limit.
Just-in-Time compiler. All our programs are Due to the limitations of computing re-
executed on a single thread of AMDr Ryzen sources, turn limit over 8 will take too much

© 2018 Information Processing Society of Japan - 140 -


The 23rd Game Programming Workshop 2018

Execution time and number of information sets - cscfr


Win rate against random player - cscfr
175 Time for 100 iterations 1200 0.650
150 # of infosets 1000 0.625
125 800
0.600
# of infosets

0.575

Win rate
100

Time/s
600
75 0.550
400
50 0.525
200
25 0.500
0 0 0.475
2 4 6 8 10 12 14 16 18 20
Move limit 0.450
2 4 6 8 10 12 14 16
Execution time and number of information sets - cscfr Move limit
175 Time for 100 iterations 103 with Random with move limit=10
with move limit=2 with move limit=12
150 # of infosets with move limit=4 with move limit=14
with move limit=6 with move limit=16
125 102 with move limit=8
# of infosets

100 Time/s
Fig. 2 Win rate against other players - cscfr
75 101
50 We can observe from the figure that win
25 100 rate against random player is quite high when
0 move limit is small, and decreases when move
2 4 6 8 10 12 14 16 18 20 limit increases, and then increase again. How-
Move limit
ever, self-play between agents shows an oppo-
Fig. 1 Execution time and number of infosets - CSCFR
site trend. To find the reason for this result,
we looked into the first step strategy every
time to converge. For turn limits from 1 to agent makes. The probability of choosing to
8, we run the program for 100 000 iterations move the evil ghost for each agent is shown in
respectively. Each program was run from the Figure 3.
beginning.
According to the average strategy profile
Probability to move the evil ghost at first step - cscfr
generated under different turn limits, we made 1.0
an agent for each strategy profile and per-
formed self-plays between agents and the ran- 0.9
dom player. Self-plays do not limit turns. If
Probability

the agent meets an unexplored information 0.8


set, a random action is chosen from all legal
0.7
moves with equal probability. Every point on
the graph represents the win rate of 100 000 0.6
Probability
plays performed between the x-axis player
against the player described on the legend (ev- 2 4 6 8 10 12 14 16
Move limit
ery player is first hand for 50 000 times). The
Fig. 3 Probability to move the evil ghost at the first step -
result is shown in Figure 2. cscfr
By simple inference, we can conclude that
when each player takes only one step, the We can infer from the figure that battles
game can be simplified as a normal game, with the random player almost depend on the
whose Nash equilibrium point is the strategy action of the first step, but battles between
profile that player 1 moves his evil ghost one agents do not. We believe that when move
step forward and player 2 moves his ghost on limit is very small, the agent only looks into
the opposite side one step forward to avoid few steps of each player. The game is small so
killing player 1’s ghost. It is believed to be it is easy for the agent to get a correct strat-
the best strategy for the first two steps. egy. When the agent looks into more steps,

© 2018 Information Processing Society of Japan - 141 -


The 23rd Game Programming Workshop 2018

it will find a better strategy under that move


Execution time and number of information sets - purecfr
limit. However, this strategy may fall into a 175
# of infosets
local optimal which is not a good one overall. 250
150
When move limit increases continuously, the
125 200

# of infosets
agent finally finds its right way to go. When
100

Time/s
move limits are too large, although informa- 150
tion sets are fully visited, the long history of 75
100
the game contains too many loops. Our im- 50
plementation of CFR does not include history 50
25
memory in the information set, so these loops Time for 10000 iterations
0 0
may confuse the agents. 2 4 6 8 10 12 14 16 18 20
Move limit
Also, limitations of hardware resources also
Fig. 4 Execution time and number of infosets - purecfr
prevented us from running more iterations or
doing experiments on larger move limits. It is We generated Pure CFR agents for turn lim-
possible that the strategy profile we get from its from 1 to 8. For every move limit, the Pure
100 000 iterations is not fully converged, which CFR algorithm was executed for 1 000 000 it-
may also affect the strategies generated by the erations. We also performed self-play between
agents. agents and the random player. Other rules are
the same as these in Section 5.1. The result is
5.2 Pure CFR
shown in Figure 5.
Pure CFR program is implemented accord-
ing to the pseudocode written in the study [5].
Win rate against random player - purecfr
An iteration is to run the recursive CFR func- 0.65
tion for both players once respectively. The
chance node events are also sampled before 0.60
each iteration.
0.55
Win rate

For Pure CFR, we did similar tests to ob-


serve the changes in execution time and the
number of information sets to moving turn 0.50
limits. Pure CFR requires less time for one
iteration, as it only samples a pure strategy
0.45
profile while Chance-sampled CFR explores
all possible actions. We also set moving turn
2 4 6 8 10 12 14 16
limits from 1 to 10, and ran the program Move limit
for 10 000 iterations. For every different turn with Random with move limit=10
limit, the program was run from the begin- with move limit=2 with move limit=12
with move limit=4 with move limit=14
ning. Execution time is calculated by the same with move limit=6 with move limit=16
method as in Section 5.1. Results are shown with move limit=8
in Figure 4. Fig. 5 Win rate against other players - purecfr
We found that Pure CFR does not ex-
plore every information set in the game. It We can observe a similar pattern compared
is because that Pure CFR uses a sampled to the result of chance-sampled CFR. The
pure strategy profile, and some actions may probabilities of moving the evil ghost shown
have extremely low possibilities to be sampled. in Figure 6 are also corresponding to the win
There is a high possibility that these actions’ rates against the random player. Although the
resulting information sets had not been vis- trend of the curve is similar to what chance-
ited. sampled CFR algorithm has got, the minimum
The execution time of Pure CFR program points turn out to be different and there are
first goes linearly to the move limits, and then more fluctuations than CSCFR. We infer the
shows no certain pattern due to the sampled reason is that the program has not fully con-
strategy. verged.

© 2018 Information Processing Society of Japan - 142 -


The 23rd Game Programming Workshop 2018

CFR and Pure CFR algorithm to a sim-


Probability to move the evil ghost at first step - purecfr
1.00 plified version of the board game Geister,
0.95 Geister(2×3, 2).
0.90 To deal with the infinite loop problem in
0.85 the board games, we limited players’ moving
Probability

0.80 turns to interrupt the game forcibly. Under


different limitations, we compared the execu-
0.75
tion time for each algorithm, and analyzed
0.70
the result of self-play. We propose that with
0.65 Probability certain turn limits, chance-sampled CFR and
0.60
2 4 6 8 10 12 14 16 Pure CFR algorithm are able to generate an
Move limit
appropriate strategy profile for board games
Fig. 6 Probability to move the evil ghost at the first step -
purecfr
like Geister(2×3, 2).
We infer that in this game, in order to win
We calculated the total execution time of most battles against the random player, the
two algorithms, which is calculated by the strategy of the first step is important. We
same method as in Section 5.1. The result also observed that Pure CFR algorithm con-
is shown in Figure 7. Thanks to the tech- sumes time about linearly to the length of the
nique of sampling a pure strategy profile, Pure game, while chance-sampled CFR does expo-
CFR iterations runs much faster than chance- nentially. Pure CFR algorithm is able to get
sampled CFR. Although running 10 times as faster convergence when solving games with
many iterations as CSCFR does, Pure CFR large turn limits.
saves up to 90% time when solving games of However, due to the limited computing re-
large move limit. We can imagine that while sources, we are not able to test the result for
Pure CFR needs more iterations to converge, larger turn limits. For what we have tested,
the total efficiency will overperform CSCFR in it is possible that our result is not fully con-
large game instances. verged. In future works, we want to run the
program for more iterations to see if it will
Total execution time gain better convergence, and test larger move
80000 cscfr - 100K iterations
limits using Pure CFR within reasonable ex-
70000 purecfr - 1M iterations ecution time. In addition, we will also apply
60000 CFR algorithms to a larger board of Geister
50000 to observe how powerful can CFR algorithms
Time/s

40000 be.
30000
Acknowledgement
20000
10000 A part of this work was supported by JSPS
0 KAKENHI Grant Number 16H02927 and by
2 4 6 8 10 12 14 16 JST, PRESTO.
Move limit
Fig. 7 Total execution time References
[1] Zinkevich, M., Johanson, M., Bowling, M. and Piccione,
C.: Regret Minimization in Games with Incomplete In-
After all, as agents battling with the ran- formation, Advances in Neural Information Processing
Systems 20, pp. 1729–1736 (2008).
dom player has a win rate of far greater than [2] Johanson, M., Bard, N., Lanctot, M., Gibson, R. and
50% and their first step strategies finally con- Bowling, M.: Efficient Nash Equilibrium Approximation
through Monte Carlo Counterfactual Regret Minimiza-
verge to ones that are corresponding to our tion, Proceedings of the Eleventh International Confer-
analysis, we propose that CFR algorithm is ence on Autonomous Agents and Multi-Agent Systems
(AAMAS) (2012).
able to generate an appropriate strategy for [3] Bowling, M., Burch, N., Johanson, M. and Tam-
melin, O.: Heads-up limit hold’em poker is solved, Sci-
Geister(2×3, 2). ence, Vol. 347, No. 6218, pp. 145–149 (online), DOI:
10.1126/science.1259433 (2015).
6. Conclusion [4] Lanctot, M., Waugh, K., Zinkevich, M. and Bowling,
M.: Monte Carlo Sampling for Regret Minimization in
In this paper, we applied chance-sampled Extensive Games, Advances in Neural Information Pro-

© 2018 Information Processing Society of Japan - 143 -


The 23rd Game Programming Workshop 2018

cessing Systems 22, pp. 1078–1086 (2009).


[5] Gibson, R.: Regret Minimization in Games and the De-
velopment of Champion Multiplayer Computer Poker-
Playing Agents, PhD Thesis, University of Alberta
(2014).
[6] Wikipedia: Ghosts (board game), https://en.
wikipedia.org/wiki/Ghosts_(board_game).
[7] BoardGameGeek: Phantoms vs phantoms,
https://www.boardgamegeek.com/boardgame/2290/
phantoms-vs-phantoms.
[8] Neller, T. W. and Lanctot, M.: An Introduction
to Counterfactual Regret Minimization, Proceedings of
Model AI Assignments, The Fourth Symposium on Ed-
ucational Advances in Artificial Intelligence (EAAI-
2013) (2013).

© 2018 Information Processing Society of Japan - 144 -

You might also like