Learning Self-Game-Play Agents For Combinatorial

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada

Learning Self-Game-Play Agents for Combinatorial


Optimization Problems
Extended Abstract
Ruiyang Xu Karl Lieberherr
Northeastern University Northeastern University
Boston, Massachusetts Boston, Massachusetts
ruiyang@ccs.neu.edu lieber@ccs.neu.edu

ABSTRACT to generalize combinatorial problems to more comprehensive com-


Recent progress in reinforcement learning (RL) using self-game- binatorial problems in the form of Zermelo games following the
play has shown remarkable performance on several board games approach in Hintikka’s Game-Theoretical Semantics [3]; (1.b) we
as well as video games (e.g., Atari games and Dota2). DeepMind re- implemented a variant of the neural MCTS algorithm ([2] and
searchers have already implemented model-free RL to play Go and [7]) specifically designed for those Zermelo games; 2. Evaluation:
Chess at a superhuman level using neural Monte-Carlo-Tree-Search we evaluate our algorithm on a problem (i.e., HSR) for which a
(neural MCTS). Therefore, it is plausible to consider that RL, start- Bernoulli’s triangle shows the winning strategy efficiently. Our re-
ing from zero knowledge, can be applied to other problems which sult shows that, for problems under a certain size, neural MCTS does
can be converted into games. We try to leverage the computational find the optimal strategy, hence solving the original optimization
power of neural MCTS to solve a class of combinatorial optimiza- problem in a tabula-rasa (model-free) style. 3. Indications: We show
tion problems. Following the idea of Hintikka’s Game-Theoretical how the winning strategy for both players of the Zermelo game
Semantics, we propose the Zermelo Gamification (ZG) to transform provides indications that a given problem instance does (not) have
specific combinatorial optimization problems into Zermelo games a solution. Those indications are made possible through the gener-
whose winning strategies correspond to the solutions of the original alization mentioned in contribution 1 and they are more useful then
optimization problem. The ZG also provides a specially designed the simple answer: there is no solution. A complete description of
neural MCTS. We use a combinatorial planning problem for which this study can be found in [11].
the ground-truth policy is efficiently computable to demonstrate
that ZG is promising. 2 ZERMELO GAMIFICATION
The combinatorial optimization problems studied in this paper can
KEYWORDS be described with the following logic statement MQ2: ∃n{G(n) ∧
Reinforcement Learning; Neural MCTS; Self-game-play; Combina- (∀n ′ > n ¬G(n ′ ))}, G(n) := ∀x ∃y : {F (x, y; n)} or G(n) := ∃y ∀x :
torial Optimization; Model-free {F (x, y; n)}. In this statement, n is a natural number and x, y can be
ACM Reference Format:
any instances depending on the concrete problem. F is a predicate
Ruiyang Xu and Karl Lieberherr. 2019. Learning Self-Game-Play Agents on x, y, n. Hence the logic statement above essentially means that
for Combinatorial Optimization Problems. In Proc. of the 18th International there is a maximum number n such that for all x, some y can
Conference on Autonomous Agents and Multiagent Systems (AAMAS 2019), be found so that the predicate F (x, y; n) is true. Formulating those
Montreal, Canada, May 13–17, 2019, IFAAMAS, 3 pages. problems as interpreted logic statements are crucial to transforming
them into Zermelo games. A Zermelo game is defined to be a two-
player, finite, and perfect information game with only one winner
1 INTRODUCTION and loser, and during the game, players move alternately (i.e., no
We transform a certain form of combinatorial optimization prob- simultaneous move). Leveraging the logic statement above, the
lems (e.g. the HSR problem, described in section 3) into games so Zermelo game is built on the Game-Theoretical Semantic approach
that a game-play agent can be leveraged to play the game and (by Hintikka [3]) with two phases and two players (the Proponent
solve the original problem on a specific instance. In Fig. 1 one can (P), who claims that the statement is true, and the Opponent (OP),
see how two competitive agents, called P and OP, gradually, but who argues that the statement is false. The original problem can be
with setbacks (as in AlphaZero [9] and [8]), improve and jointly solved if and only if the P is able to propose some optimal number
arrive at the winning strategy. Model-Free learning converges and n so that a perfect OP cannot refute it).
solves a non-trivial problem although the underlying game is fun-
damentally different from Go and Chess. Related works on ML and
2.1 Proposal Phase
combinatorial problems can be found in [1], [4], [5], [6] and [10].
We make three main contributions: (1.) We introduce the Zer- In the initial phase of the Zermelo game player P will propose a
melo Gamification which consists of two contributions (1.a) a way number n. Then the player OP will decide whether to accept this n,
or reject it. OP will make his decision based on the logic statement:
Proc. of the 18th International Conference on Autonomous Agents and Multiagent Systems A ∧ B, A := G(n), B := ∀n ′ > n ¬G(n ′ ). Specifically, the OP tries to
(AAMAS 2019), N. Agmon, M. E. Taylor, E. Elkind, M. Veloso (eds.), May 13–17, 2019,
Montreal, Canada. © 2019 International Foundation for Autonomous Agents and
refute the P by attacking either on the statement A or B. The OP will
Multiagent Systems (www.ifaamas.org). All rights reserved. accept n proposed by the P if she confirms A = False. The OP will

2276
Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada

reject n if she is unable to confirm A = False. In this case, the OP


treats n as non-optimal, and proposes a new n ′ > n which makes
B = False. The rejection can be regarded as a role-flip between the
two players.

2.2 Refutation Phase


This is the phase where the two players actually search for evidence
and construct strategies to attack each other or defend themselves.
Generally speaking, regardless of the role-flip, the P claims G(n)
holds for some n, the OP will refute this claim by giving some
instances of x (for existential quantifier) so that ¬G(n) holds. If the
P successfully figures out the exceptional y (for universal quantifier)
which makes F (x, y; n) hold, the OP loses the game, otherwise, the
P loses. Figure 1: Correctness ratio measured on k = 7, q = 7. Where
New_OP means the newly trained neural network plays as
3 HSR PROBLEM AND HSR GAME an OP; Old_P means the previously trained neural network
The HSR problem serves as a proof-of-concept in our research: plays as a P. Similar for New_P and Old_OP.
consider throwing jars from a specific rung of a ladder, the jars
could either break or not. One has k identical jars and q test chances
to throw those jars, can she find the maximum height of the lad-
der so that any highest safe rung can be located with the given
resources k, q? We observed the recursive structure in this problem
and formulated the HSR problem as a logic statement as described
in section 2, where:
G k,q (n) = ∃m ≤ n ∀a ∈ {“break”, “not break”} : {F (m, a; n)}



 True, if n = 0
 False, if n > 0 ∧ (k = 0 ∨ q = 0)

F (m, a; n) =


G k −1,q−1 (m − 1), if a = “break”


G
 k,q−1 (n − m), if a = “not break”

In the logic statement, G k,q (n) means one can use k jars and q
tests to locate any highest safe rung in a ladder with n rungs. The Figure 2: Refutation phase on k = 7, q = 7 but n is set to 129
problem is defined recursively, m is the optimal testing point (if so there is no solution.
there is one) so that no matter the jar breaks or not, the rest of the
resources can still be used to locate the highest safe rung.
With G k,q (n), we define the HSR game where the P and OP
move alternately. In the proposal phase, the P will propose a num-
where the two players continuously compete with each other until
ber n and the OP will decide whether to accept it or reject it, as
finally converging to a solution where the P’s policy can always
described in section 2. During the refutation phase, the P will pick
keep her winning position. The OP is in a dilemma when P is perfect,
a testing point m, then the OP will reply “break” or “not break”.
i.e., P chooses only the optimal n and m. On the other hand, for a
Specifically, in each round, the P claims ∃m ≤ n : {G k −1,q−1 (m −
problem without any solution, the opposite will happen (see Fig. 2).
1) ∧ G k,q−1 (n − m)} holds. Then the OP refutes the claim by either
refuting G k −1,q−1 (m − 1) or G k,q−1 (n − m). In this way, both the
testing policy and the highest safe rung are learned implicitly. 5 CONCLUSION
We introduce MQ2 and the MQ2 Zermelo Gamification. We formu-
4 EXPERIMENT late the optimization problem using predicate logic (where the types
Two independent neural networks are applied to learn the proposal of the variables are not "too" complex) and then we use the corre-
game and refutation game respectively. During each iteration of sponding Zermelo game which we give to the adapted neural MCTS
the learning process: 1. 100 episodes of self-play will be executed algorithm. For our proof-of-concept specialization of MQ2, HSR
through a neural MCTS using the current neural network. Data Zermelo Gamification, we notice that the adapted neural MCTS
generated during self-play will be stored. 2. the neural networks algorithm converges on small instances that can be handled by our
will be trained with the data in the replay buffer. And 3. the newly hardware and finds the winning strategy. Our evaluation counts
trained neural network and the previous old neural network are all correct/incorrect moves of the players, based on ground-truth
put into a competition. We collect the correctness data for both of We hope our research sheds some light on the potential for neural
the neural networks during each iteration. Fig. 1 show the process MCTS to solve interesting gamified problems.

2277
Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada

REFERENCES [7] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
[1] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-
2016. Neural combinatorial optimization with reinforcement learning. arXiv vam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalch-
preprint arXiv:1611.09940 (2016). brenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu,
[2] Cameron Browne, Edward Jack Powley, Daniel Whitehouse, Simon M. Lucas, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep
Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez Liebana, neural networks and tree search. Nature 529 (Jan. 2016), 484.
Spyridon Samothrakis, and Simon Colton. 2012. A Survey of Monte Carlo Tree [8] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew
Search Methods. IEEE Trans. Comput. Intellig. and AI in Games 4, 1 (2012), 1–43. Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel,
[3] Jaakko Hintikka. 1982. Game-theoretical semantics: insights and prospects. Notre et al. 2018. A general reinforcement learning algorithm that masters Chess, Shogi,
Dame J. Formal Logic 23, 2 (04 1982), 219–241. and Go through self-play. Science 362, 6419 (2018), 1140–1144.
[4] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. 2017. Learn- [9] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja
ing combinatorial optimization algorithms over graphs. In Advances in Neural Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton,
Information Processing Systems. 6348–6358. Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche,
[5] Alexandre Laterre, Yunguan Fu, Mohamed Khalil Jabri, Alain-Sam Cohen, David Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without
Kas, Karl Hajjar, Torbjorn S Dahl, Amine Kerkeni, and Karim Beguir. 2018. Ranked human knowledge. Nature 550 (Oct. 2017), 354.
Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimiza- [10] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In
tion. arXiv preprint arXiv:1807.01672 (2018). Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence,
[6] Daniel Selsam, Matthew Lamm, Benedikt Bunz, Percy Liang, Leonardo de Moura, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 2692–2700.
and David L Dill. 2018. Learning a SAT Solver from Single-Bit Supervision. arXiv [11] Ruiyang Xu and Karl Lieberherr. 2019. Learning Self-Game-Play Agents for
preprint arXiv:1802.03685 (2018). Combinatorial Optimization Problems. arXiv preprint arXiv:1903.03674 (2019).

2278

You might also like