An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
An Analysis of Stochastic Game Theory For Multiagent Reinforcement Learning
Abstract Learning behaviors in a multiagent environment is crucial for developing and adapting multiagent systems. Reinforcement learning techniques have addressed this problem for a single agent acting in a stationary environment, which is modeled as a Markov decision process (MDP). But, multiagent environments are inherently non-stationary since the other agents are free to change their behavior as they also learn and adapt. Stochastic games, rst studied in the game theory community, are a natural extension of MDPs to include multiple agents. In this paper we contribute a comprehensive presentation of the relevant techniques for solving stochastic games from both the game theory community and reinforcement learning communities. We examine the assumptions and limitations of these algorithms, and identify similarities between these algorithms, single agent reinforcement learners, and basic game theory techniques.
This research was sponsored by the United States Air Force under Cooperative Agreements No. F30602-97-2-0250 and No. F30602-98-2-0135. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the ofcial policies or endorsements, either expressed or implied, of the Defense Advanced Research Projects Agency (DARPA), the Air Force, or the US Government.
1 Introduction
The problem of an agent learning to act in an unknown world is both challenging and interesting. Reinforcement learning has been successful at nding optimal control policies for a single agent operating in a stationary environment, specically a Markov decision process. Learning to act in multiagent systems offers additional challenges; see the following surveys [17, 19, 27]. Multiple agents can be employed to solve a single task, or an agent may be required to perform a task in a world containing other agents, either human, robotic, or software ones. In either case, from an agents perspective the world is not stationary. In particular, the behavior of the other agents may change, as they also learn to better perform their tasks. This type of multiagent nonstationary world creates a difcult problem for learning to act in these environments. However, this nonstationary scenario can be viewed as a game with multiple players. Game theory has aimed at providing solutions to the problem of selecting optimal actions in multi-player environments. In game theory, there is an underlying assumption that the players have similar adaptation and learning abilities. Therefore the actions of each agent affect the task achievement of the other agents. It seems therefore promising to identify and build upon the relevant results from game theory towards multiagent reinforcement learning. Stochastic games extend the single agent Markov decision process to include multiple agents whose actions all impact the resulting rewards and next state. They can also be viewed as an extension of game theorys simpler notion of matrix games. Such a view emphasizes the difculty of nding optimal behavior in stochastic games, since optimal behavior depends on the behavior of the other agents, and vice versa. This model then serves as a bridge combining notions from game theory and reinforcement learning. A comprehensive examination of the multiagent learning techniques for stochastic games does not exist. In this paper we contribute such an analysis, examining techniques from both game theory and reinforcement learning. The analysis both helps to understand existing algorithms as well as being suggestive of areas for future work. In section 2 we provide the theoretical framework for stochastic games as extensions of both MDPs and matrix games. Section 3 summarizes algorithms for solving stochastic games from the game theory and reinforcement learning communities. We discuss the assumptions, goals, and limitations of these algorithms. We also taxonomize the algorithms based on their game theoretic and reinforcement learning components. Section 4 presents two nal algorithms that are based on a different game theoretic mechanism, which address a limitation of the other algorithms. Section 5 concludes with a brief summary and a discussion of the future work in multiagent reinforcement learning.
2 Theoretical Framework
In this section we setup the framework for stochastic games. We rst examine MDPs, which is a single-agent, multiple state framework. We then examine matrix games, which is a multiple-agent, single state framework. Finally we introduce the stochastic game framework which can be seen as the merging of MDPs and matrix games.
Figure 1: The matching pennies game. Players can also play mixed strategies, which select actions according to a probability distribution. It is clear that in the above game there is also no optimal mixed strategy that is independent of the opponent. This leads us to dene an opponent-dependent solution, or set of solutions: Denition 1 For a game, dene the best-response function for player , BR , to be the set of all, possibly mixed, strategies that are optimal given the other player(s) play the possibly mixed joint strategy . The major advancement that has driven much of the development of matrix games and game theory is the notion of a best-response equilibrium, or Nash equilibrium: Denition 2 A Nash equilibrium is a collection of strategies (possibly mixed) for all players, , with,
BR
So, no player can do better by changing strategies given that the other players continue to follow the equilibrium strategy. What makes the notion of equilibrium compelling is that all matrix games have a Nash equilibrium, although there may be more than one. Types of Matrix Games. Matrix games can be usefully classied according to the structure of their payoff functions. Two common classes of games are purely collaborative and purely competitive games. In purely collaborative games, all agents have the same payoff function, so an action in the best interest of one agent is in the best interest of all the agents. In purely competitive games, there are two agents, where ones payoff function is the negative of the other (i.e. ). The game in Figure 1 is an example of one such game. Purely competitive games are also called zero-sum games since the payoff functions sum to zero. Other games, including purely collaborative games, are called general-sum games. One appealing feature of zero-sum games is that they contain a unique Nash equilibrium. This equilibrium can be found as the solution to a relatively simple linear program1 Finding equilibria in general-sum games requires a more difcult quadratic programming solution [6].
1 The
value
of
the
equilibrium
can
be
computed .
using
linear
programming
with
the
following
objective
function,
In the algorithms presented in this paper we use the functions ValueMG and Solve MG to refer to algorithms for solving matrix games, either linear programming (for zero-sum) or quadratic programming (for general-sum) depending on the context. Value returns the expected value of playing the matrix games equilibrium and Solve returns player s equilibrium strategy.
equilibrium solution, and usually make little requirements on the behavior of the other agents. This fact is discussed further in Section 3.3.
Value
Table 1: Algorithm: Shapley. Notice the algorithm is nearly identical to value iteration for MDPs, with the operator replaced by the Value operator. The algorithm also shows that equilibria in stochastic games are solutions to Bellman-like equations. The algorithms value function converges to which satises the following equations,
Value
Just as Shapleys algorithm is an extension of value iteration to stochastic games, Pollatschek & Avi-Itzhak [25] introduced an extension of policy iteration [9]. The algorithm is shown in Table 2. Each player selects the equilibrium policy according to the current value function, making use of the same temporal differencing matrix, , as in Shapleys algorithm. The value function is then updated based on the actual rewards of following these policies. Like Shapleys algorithm, this algorithm also computes the equilibrium value function, from which can be derived the equilibrium policies. This algorithm, though, is only guaranteed to converge if the transition function, , and discount factor, , satises a certain property. A similar algorithm by Hoffman & Karp [25] which alternates the policy learning (rather than it occurring simultaneously) avoids the convergence problem, but requires obvious control over when the other agents in the environment are learning.
where,
Value
Table 3: Algorithm: Minimax-Q and Nash-Q. The difference between the algorithms is in the Value function and the values. Minimax-Q uses the linear programming solution for zeros-sum games and Nash-Q uses the quadratic programming solution for general-sum games. Also, the values in Nash-Q are actually a vector of expected rewards, one entry for each player. This algorithm does in fact converge to the stochastic games equilibrium solution, assuming the other agent executes all of its actions innitely often. This is true even if the other agent does not converge to the equilibrium, and so provides an opponent-independent method for learning an equilibrium solution. 5
Temporal Differencing
MG LP LP LP Nash FP
Game Theory Shapley Pollatschek & Avi-Itzhak Van der Wal[25] Fictitious Play
Table 4: Summary of algorithms to solve stochastic games. Each contains a matrix game solving component and a temporal differencing component. Game Theory algorithms assume the transition and reward function is known. RL algorithms only receive observations of the transition and reward function. 3.2.2 Nash-Q Hu & Wellman [10] extended the Minimax-Q algorithm to general-sum games. The algorithm is structurally identical and is also shown in Table 3. The extension requires that each agent maintain values for all the other agents. Also, the linear programming solution used to nd the equilibrium of zero-sum games is replaced with the quadratic programming solution for nding an equilibrium in general-sum games. This algorithm is the rst to address the complex problem of general-sum games. But their algorithm requires a number of very limiting assumptions. The most restrictive of which limits the structure of all the intermediate matrix games faced while learning (i.e. .) The largest difculty is that it is impossible to predict whether this assumption will remain satised while learning. [3] The other assumption to note is that the game must have a unique equilibrium, which is not always true of general-sum stochastic games. This is necessary since the algorithm strives for the opponent-independence property of Minimax-Q, which allows the algorithm to converge almost regardless of the other agents actions. With multiple equilibria its important for all the agents to play the same equilibrium in order for it to have its reinforcing properties. So, learning independently is not possible.
3.3 Observations
A number of observations can be drawn from these algorithms. The rst is the structure of the algorithms. They all have a matrix game solving component and a temporal differencing component. A summary of the components making up these algorithms can be seen in Table 4. The rst thing to note is the lack of an RL algorithm with a multiple step backup (i.e. policy iteration or TD( ).) One complication for multi-step backup techniques is that they are on-policy, that is they require that the agent be executing the policy it is learning. In a multiagent environment this would require that the other agents also be learning on-policy. Despite this limitation it is still very suggestive of new algorithms that may be worth investigating. Also, there has been work to make an off-policy TD( ) [15, 26] that may be useful in overcoming this limitation. A second observation is that proving convergence to an equilibrium solution is not a trivial thing. For convergence almost all the algorithms assume a zero-sum game, and the Pollatschek & Avi-Itzhak algorithm requires stringent assumptions on the games transition function. The only other algorithm, Nash-Q, although doesnt assume zero-sum, still requires that the game have a unique equilibrium along with additional assumptions about the payoffs.
The nal observation is that all of these algorithms use a closed-form solution for matrix games (i.e. linear programming or quadratic programming.) Although this provides what appears to be opponent-independent algorithms for nding equilibria, it automatically rules out stochastic games with multiple equilibria. If a game has multiple equilibria, the optimal policy must depend on the policies of the other agents. Values of states therefore depend on the other agents policies, and static opponent-independent algorithms simply will not work. In the next section we examine algorithms that do not use a static matrix game solver.
4 Other Solutions
We now examine algorithms that use some form of learning as the matrix game solving component. There are two reasons that one would want to consider a matrix game learner rather than a closed form solution. The rst was mentioned in the previous section that selecting between multiple equilibria requires observing and adapting to the other players behavior. It is important to point out that observing other players behavior cannot be a step that follows a static learning algorithm. Equilibrium selection affects both the agents policy at a state and that states value. The value of a state in turn affects the equilibria of states that transition into it. In fact the number of equilibrium solutions in stochastic games can grow exponentially with the number of states, and so equilibrium selection after learning is not feasible. The second reason is that using an opponent-dependent matrix game solver, allows us to examine the problem of behaving when the other agents have limitations, physical or rational. These limitations might prevent the agent from playing certain equilibria strategies, which the static algorithms would not be able to exploit. Handling these situations are important since solving large problems often requires generalization and approximation techniques that impose learning limitations [1, 4, 8, 23]. Here we present two similar algorithms, one from game theory and the other from reinforcement learning, that depend on the other players behavior. Both of these algorithms are only capable of playing deterministic policies, and therefore can only converge to pure strategy equilibria. Despite this fact, they still have interesting properties for zero-sum games that have a mixed policy equilibrium.
1. Initialize arbitrarily,
and
,
such that
argmax
Then,
Table 5: Algorithm: Fictitious play for two-player, zero-sum stochastic games using a model. These distributions combined with learned joint-action values from standard temporal differencing are used to select an action. Uther & Veloso [22] investigated this algorithm in the context of a fully competitive domain, and Claus & Boutilier [5] examined it for fully collaborative domains. The algorithm has very similar behavior to ctitious play. It does require observations of the opponents actions, but not of their individual rewards. Like ctitious play, its empirical distribution of play may converge to an equilibrium solution, but its action selection is deterministic and cannot play a mixed strategy.
5 Conclusion
This paper examined a number of different algorithms for solving stochastic games from both the game theory and reinforcement learning community. The algorithms have differences in their assumptions, such as whether a model is available, or what behavior or control is required of the other agents. But the algorithms also have strong similarities in the division of their matrix game and temporal differencing components. This paper also points out a number of areas for future work. There currently is no multi-step backup algorithm for stochastic games. There is also no algorithm to nd solutions to general-sum games with possibly many equilibria. Variants of ctitious play that learn policies based on the behavior of the other agents may be extended to these domains. Another related issue is algorithms that can handle less information, for example, not requiring observation or knowledge of the other agents rewards, or not requiring observation of their actions. Another important area for future work is examining how to learn when the other agents may have some apparent physical or rational limitation (e.g. [24]). These limitations may come vfrom an incorrect belief about the actions available to other agents, or their reward function. More than likely, limitations on learning will be imposed to bias the agent for faster learning in large problems. These limitations often take the form of using a function approximator for the values or policy. Learning to cope with limited teammates or exploiting limited opponents is necessary to applying multiagent reinforcement learning to large problems.
References
[1] Leemon C. Baird and Andrew W. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems 11. MIT Press, 1999.
and
that maximizes,
where,
Table 6: Algorithm: Opponent Modeling Q-Learning. [2] Craig Boutilier. Planning, learning and coordination in multiagent decision processes. In Proceedings of the Sixth Conference on the Theoretical Aspects of Rationality and Knowledge, pages 195210, Amsterdam, Netherlands, 1996. [3] Michael Bowling. Convergence problems of general-sum multiagent reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 8994, Stanford University, June 2000. Morgan Kaufman. [4] J. A. Boyan and A. W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Advances in Neural Information Processing Systems 7. The MIT PRESS, 1995. [5] Caroline Claus and Craig Boutilier. The dyanmics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Fifteenth National Conference on Articial Intelligence, Menlo Park, CA, 1998. AAAI Press. [6] Jerzy Filar and Koos Vrieze. Competitive Markov Decision Processes. Springer Verlag, New York, 1997. [7] Drew Fudenberg and David K. Levine. The Theory of Learning in Games. The MIT Press, 1999. [8] Geoff Gordon. Approximate Solutions to Markov Decision Problems. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, 1999. [9] Ronald A. Howard. Dynamic Programming and Markov Processes. The MIT Press, 1960.
[10] Junling Hu and Michael P. Wellman. Multiagent reinforcement learning: Theoretical framework and an algorithm. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 242250, San Francisco, 1998. Morgan Kaufman. [11] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Articial Intelligence Research, 4:237285, 1996. [12] Harold W. Kuhn, editor. Classics in Game Theory. Princeton University Press, 1997. [13] Michael L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 157163. Morgan Kaufman, 1994. [14] Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. The MIT Press, 1994. [15] J. Peng and R. J. Williams. Incremental multi-step q-learning. Machine Learning, 22:283290, 1996. [16] Julia Robinson. An iterative method of solving a game. Annals of Mathematics, 54:296301, 1951. Reprinted in [12]. [17] Sandip Sen, editor. Adaptation, Coevolution and Learning in Multiagent Systems: Papers from the 1996 AAAI Spring Symposium. AAAI Press, Menlo Park,CA, March 1996. AAAI Technical Report SS-96-01. [18] L. S. Shapley. Stochastic games. PNAS, 39:10951100, 1953. Reprinted in [12]. [19] Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 2000. In press. [20] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning. The MIT Press, 1998. [21] Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, pages 330337, Amherst, MA, 1993. [22] William Uther and Manuela Veloso. Adversarial reinforcement learning. Technical report, Carnegie Mellon University, 1997. Unpublished. [23] William T. B. Uther and Manuela M. Veloso. Tree based discretization for continuous state space reinforcement learning. In Proceedings of the Fifteenth National Conference on Articial Intelligence, Menlo Park, CA, 1998. AAAI Press. [24] Jose M. Vidal and Edmund H. Durfee. The moving target function problem in multi-agent learning. In Proceedings of the Third International Conference on Multi-Agent Systems (ICMAS98), pages 317324, 1998. [25] O. J. Vrieze. Stochastic Games with Finite State and Action Spaces. Number 33. CWI Tracts, 1987. [26] Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Kings College, Cambridge, UK, 1989. [27] Gerhard Wei and Sandip Sen, editors. Adaptation and Learning in Multiagent Systems. Springer Verlag, Berlin, 1996.
10