Learning Self-Game-Play Agents For Combinatorial
Learning Self-Game-Play Agents For Combinatorial
Learning Self-Game-Play Agents For Combinatorial
2276
Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada
True, if n = 0
False, if n > 0 ∧ (k = 0 ∨ q = 0)
F (m, a; n) =
G k −1,q−1 (m − 1), if a = “break”
G
k,q−1 (n − m), if a = “not break”
In the logic statement, G k,q (n) means one can use k jars and q
tests to locate any highest safe rung in a ladder with n rungs. The Figure 2: Refutation phase on k = 7, q = 7 but n is set to 129
problem is defined recursively, m is the optimal testing point (if so there is no solution.
there is one) so that no matter the jar breaks or not, the rest of the
resources can still be used to locate the highest safe rung.
With G k,q (n), we define the HSR game where the P and OP
move alternately. In the proposal phase, the P will propose a num-
where the two players continuously compete with each other until
ber n and the OP will decide whether to accept it or reject it, as
finally converging to a solution where the P’s policy can always
described in section 2. During the refutation phase, the P will pick
keep her winning position. The OP is in a dilemma when P is perfect,
a testing point m, then the OP will reply “break” or “not break”.
i.e., P chooses only the optimal n and m. On the other hand, for a
Specifically, in each round, the P claims ∃m ≤ n : {G k −1,q−1 (m −
problem without any solution, the opposite will happen (see Fig. 2).
1) ∧ G k,q−1 (n − m)} holds. Then the OP refutes the claim by either
refuting G k −1,q−1 (m − 1) or G k,q−1 (n − m). In this way, both the
testing policy and the highest safe rung are learned implicitly. 5 CONCLUSION
We introduce MQ2 and the MQ2 Zermelo Gamification. We formu-
4 EXPERIMENT late the optimization problem using predicate logic (where the types
Two independent neural networks are applied to learn the proposal of the variables are not "too" complex) and then we use the corre-
game and refutation game respectively. During each iteration of sponding Zermelo game which we give to the adapted neural MCTS
the learning process: 1. 100 episodes of self-play will be executed algorithm. For our proof-of-concept specialization of MQ2, HSR
through a neural MCTS using the current neural network. Data Zermelo Gamification, we notice that the adapted neural MCTS
generated during self-play will be stored. 2. the neural networks algorithm converges on small instances that can be handled by our
will be trained with the data in the replay buffer. And 3. the newly hardware and finds the winning strategy. Our evaluation counts
trained neural network and the previous old neural network are all correct/incorrect moves of the players, based on ground-truth
put into a competition. We collect the correctness data for both of We hope our research sheds some light on the potential for neural
the neural networks during each iteration. Fig. 1 show the process MCTS to solve interesting gamified problems.
2277
Extended Abstract AAMAS 2019, May 13-17, 2019, Montréal, Canada
REFERENCES [7] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George
[1] Irwan Bello, Hieu Pham, Quoc V Le, Mohammad Norouzi, and Samy Bengio. van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershel-
2016. Neural combinatorial optimization with reinforcement learning. arXiv vam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalch-
preprint arXiv:1611.09940 (2016). brenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu,
[2] Cameron Browne, Edward Jack Powley, Daniel Whitehouse, Simon M. Lucas, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of Go with deep
Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez Liebana, neural networks and tree search. Nature 529 (Jan. 2016), 484.
Spyridon Samothrakis, and Simon Colton. 2012. A Survey of Monte Carlo Tree [8] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew
Search Methods. IEEE Trans. Comput. Intellig. and AI in Games 4, 1 (2012), 1–43. Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel,
[3] Jaakko Hintikka. 1982. Game-theoretical semantics: insights and prospects. Notre et al. 2018. A general reinforcement learning algorithm that masters Chess, Shogi,
Dame J. Formal Logic 23, 2 (04 1982), 219–241. and Go through self-play. Science 362, 6419 (2018), 1140–1144.
[4] Elias Khalil, Hanjun Dai, Yuyu Zhang, Bistra Dilkina, and Le Song. 2017. Learn- [9] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja
ing combinatorial optimization algorithms over graphs. In Advances in Neural Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton,
Information Processing Systems. 6348–6358. Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche,
[5] Alexandre Laterre, Yunguan Fu, Mohamed Khalil Jabri, Alain-Sam Cohen, David Thore Graepel, and Demis Hassabis. 2017. Mastering the game of Go without
Kas, Karl Hajjar, Torbjorn S Dahl, Amine Kerkeni, and Karim Beguir. 2018. Ranked human knowledge. Nature 550 (Oct. 2017), 354.
Reward: Enabling Self-Play Reinforcement Learning for Combinatorial Optimiza- [10] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In
tion. arXiv preprint arXiv:1807.01672 (2018). Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence,
[6] Daniel Selsam, Matthew Lamm, Benedikt Bunz, Percy Liang, Leonardo de Moura, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). Curran Associates, Inc., 2692–2700.
and David L Dill. 2018. Learning a SAT Solver from Single-Bit Supervision. arXiv [11] Ruiyang Xu and Karl Lieberherr. 2019. Learning Self-Game-Play Agents for
preprint arXiv:1802.03685 (2018). Combinatorial Optimization Problems. arXiv preprint arXiv:1903.03674 (2019).
2278