Academia.eduAcademia.edu

Modelling MAS with Finite Analytic Stochastic Processes

2008, AISB 2008 Convention Communication, …

Peshkin et al.'s example is illustrated in Fig. 3. There are two team-mates, V1 and V2, and an opponent O, each agent has partial observ-ability and can only see if the 4 horizontally and vertically adjacent squares are occupied, or not. Also, players V1 and V2 have an extra ...

Modelling MAS with Finite Analytic Stochastic Processes Luke Dickens, Krysia Broda and Alessandra Russo1 Abstract. The Multi-Agent paradigm is becoming increasingly popular as a way of capturing complex control processes with stochastic properties. Many existing modelling tools are not flexible enough for these purposes, possibly because many of the modelling frameworks available inherit their structure from single agent frameworks. This paper proposes a new family of modelling frameworks called FASP, which is based on state encapsulation and powerful enough to capture multi-agent domains. It identifies how the FASP is more flexible, and describes systems more naturally than other approaches, demonstrating this with a number of robot football (soccer) formulations. This is important because more natural descriptions give more control when designing the tasks, against which a group of agents’ collective behaviour is evaluated and regulated. 1 Introduction Modelling stochastic processes is a time consuming and complicated task that involves many issues of concern. Among these, some that may appear high on most modellers’ list are highlighted below. What type of language to use for building the model is often among the first decisions to make. Models are often finite state but they represent real-life continuous domains, so approximations are needed. A common problem is deciding what approximations to make: it is often difficult to determine how drastically a model will suffer in terms of results from making choice A rather than choice B; moreover if the language is restrictive or arcane it can hamper further. Furthermore, the modeller may wish to relate small problems to more complex tasks, especially if they want to partially solve or learn solutions on the simpler systems, and then import them to larger and more complicated ones2 . A related issue is one of tuning — a system may have some parameters which are hard to anticipate; instead experimentation and incremental changes might be needed, so a language which supports natural representations with tunable features is desirable. Additionally, it is preferable to have as many analytic and learning tools available as is realistically possible. This means that the modelling framework should either support these directly or allow models to be transformed into frameworks which support them. Finally, single agent control paradigms may not be sufficient, nor may single real valued measures for evaluation; the modern experimenter may want more flexibility; potentially using multiple non-cooperative, isolated, agents each with a number of orthogonal constraints. This paper highlights some of the issues above, and aims to address them by proposing a family of modelling frameworks known as Finite Analytic Stochastic Processes (FASP). We will demonstrate: how the multi-agent scenarios alluded to above can be described 1 Imperial College London, South Kensington Campus, UK, email: luke.dickens@imperial.ac.uk 2 This step-wise training is sometimes called shaping, and is an active area of research, see [7, 14, 22] within this family; the naturally descriptive nature of the representations thus produced; and which real world applications might benefit. We highlight some of the features of this family by examining an often visited problem domain in the literature, that of robot soccer3 . To do this, we introduce a relatively simple problem domain first developed by Littman [16]. We will show how this can be rewritten in the FASP framework in a number of ways, with each one giving more flexibility to re-tune in order to examine certain implicit choices of Littman. Then we will examine some other more recent robot soccer models from the related literature, which extend and add more interest to Littman’s original formulation. We will show that these can also be represented by the FASP family, and will show how our more flexible framework can be exploited here. Ultimately, we generate a tunable turn-based non-cooperative solution, which can be scaled up to finer granularity of pitch size and a greater number of players with simultaneous team and independent objectives. We finish with a discussion of the added value provided by the FASP languages, who might benefit from them, and some of the new challenges this presents to the reinforcement learning community (amongst others). The guiding principle here is that richer domains and evaluation – provided by better modelling tools, allows us to regulate agent and group behaviour in more subtle ways. It is expected that advanced learning techniques will be required to develop strategies concordant with these more sophisticated demands. This paper is a stepping stone on that longer journey. 2 Preliminaries This section introduces the concept of Stochastic Map and Function, and uses them to construct a modelling framework for stochastic processes. Initially, we formalise a way of probabilistically mapping from one finite set to another. Definition 1 (Stochastic Map) A stochastic map m, from finite independent set X to finite dependent set Y , maps all elements in X probabilistically to elements in Y , i.e. m : X Ñ PD♣Y q, where PD♣Y q is the set of all probability distributions over Y . The set of all such maps is Σ♣X Ñ Y q; notationally the undecided probabilistic outcome of m given x is m♣xq and the following shorthand is defined Pr ♣m♣xq ✏ y q ✏ m♣y ⑤xq. Two such maps, m1 , m2 P Σ♣X Ñ Y q, identical for each such conditional probability, i.e. ♣❅x, yq ♣m1 ♣y⑤xq✏ m2 ♣y⑤xqq, are said to be equal, i.e. m1 ✏ m2 . The stochastic map is a generalisation of the functional map, see [8]. Another method for generating outcomes probabilistically is the stochastic function, and relies on the concept of probability density function (PDF). For readers unfamiliar with the PDF, a good definition can be found in [19]. 3 Or as we like to call it outside the US and Australia, robot football Definition 2 (Stochastic Function) A stochastic function f from finite independent set X to the set of real numbers R, maps all elements in X probabilistically to the real numbers, i.e. f : X Ñ PDF♣Rq. The set of all such functions is Φ♣X Ñ Rq; notationally f ♣xq ✏ Fx , where Fx ♣y q is the PDF♣Rqq associated with x and belonging to f ; f ♣xq is also used to represent the undecided probabilistic outcome of Fx — it should be clear from the context which meaning is intended; the probability that this outcome lies between 2 two bounds α1 , α2 P R, α1 ➔ α2 , is denoted by rf ♣xqsα α1 . Two functions, f, g P Φ♣X Ñ Rq, identical for every PDF mapping, i.e. ♣❅x, α1 , α2 q♣rf ♣xqsαα21 ✏rg♣xqsαα21 q, are said to be equal, i.e. f ✏ g; if the two functions are inversely equal for each PDF mapping, i.e. ♣❅x, α1 , α2 q♣rf ♣xqsαα21 ✏ rg♣xqs✁✁αα12 q, then the stochastic functions are said to be inversely equal, i.e. f ✏✁g. Armed with these two stochastic generators, we can formalise the Finite Analytic Stochastic Process (FASP), an agent oriented modelling framework. Definition 3 (FASP) A FASP is defined by the tuple ♣S, A, O, t, ω, F, i, Πq, where the parameters are as follows; S is the finite state space; A is the finite action space; O is the finite observation space; t P Σ♣S ✂ A Ñ S q is the transition function that defines the actions’ effects on the system; ω P Σ♣S Ñ Oq is the observation function that generates observations; F ✏ tf 1 , f 2 , . . . , f N ✉ is the set of measure functions, where for each i, f i P Φ♣S Ñ Rq generates a real valued measure signal, fni , at each time-step, n; i P Σ♣❍Ñ S q is the initialisation function that defines the initial system state; and Π is the set of all control policies available. Broadly similar to MDP style constructions, it can be used to build models representing agent interactions with some environmental system, in discrete time-steps. The FASP allows an observation function, which probabilistically generates an observation in each state, and multiple measure signals — analogous to the MDP’s reward signal, which generate a measure signal at each state. One important feature of the FASP is that it generates observations and measures from state information alone. There is no in-built interpretation of the measure functions, their meaning and any preferred constraints on their output are left to be imposed by the modeller. Care needs to be taken when interpreting the measure signals, and setting the desired output. These can be considered separate reward functions for separate agents - see section Section 3, or conflicting constraints on the same agent. A FASP does not have a natively implied solution (or even set of solutions): there can be multiple policies that satisfy an experimenters constraints; or there may be none — this is a by-product of the FASP’s flexibility. Fortunately, small problems (those that are analytically soluble) can be solved in two steps, first by finding the policy dependent state occupancy probabilities, and then solving the expected output from the measure functions. This means that measure function interpretation can be imposed late and/or different possibilities can be explored without having to resolve the state probabilities again, [8]. If we confine ourselves to a purely reactive policy space, i.e. where action choices are based solely on the most recent observation, hence Π ✏ Σ♣O Ñ Aq, and a policy, π P Π is fixed, then the dynamics of this system resolves into a set of state to state transition probabilities, called the Full State Transition Function. Definition 4 (FASP Full State Transition Function) The FASP M ✏ ♣S, A, O, t, ω, F, i, Πq, with Π ✏ Σ♣O Ñ Aq,has full state transition (FST) function τM : Π Ñ Σ♣S Ñ S q, where for π π τM ♣π q ✏ τM (written τ π when M is clear), and, ❅s, s✶ P S, π τM ♣s✶ ⑤sq ✏ ➳ ➳ ω♣o⑤sqπ♣a⑤oqt♣s✶⑤s, aq P P Π, P o Oa A It is the FST function that concisely defines the policy dependent stochastic dynamics of the system, each fixed policy identifying a Markov Chain, and together giving a policy labelled family of Markov Chains. A more detailed examination of FASPs (including the Multi-Agent derivatives appearing later in this paper) can be found in [8]. 3 Multi-Agent Settings This section shows how the FASP framework can be naturally extended to multi-agent settings. We distinguish two domains; models with simultaneous joint observations and subsequent actions by synchronised agents, and asynchronous agents forced to take turns, observing and acting on the environment. To model the first situation, we define the action space as being a tuple of action spaces — each part specific to one agent, similarly the observation space is a tuple of agent observation spaces. At every time-step, the system generates a joint observation, delivers the relevant parts to each agent, then combines their subsequent action choices into a joint action which then acts on the system. Any process described as a FASP constrained in such a way is referred to as a Synchronous Multi-Agent (SMA)FASP. Definition 5 (Synchronous Multi-Agent FASP) A Synchronous multi-agent FASP (SMAFASP), is a FASP, with the set of enumerated agents, G, and the added constraints that; the action space A, is a Cartesian product of action subspaces, Ag , for each g P G, i.e. Ag ; the observation space O, is a Cartesian product of A✏ g PG Og ; observation subspaces, Og , for each g P G, i.e. O ✏ g PG and the policy space, Π, can be rewritten as a Cartesian product of Πg , sub-policy spaces, Πg , one for each agent g, i.e. Π ✏ g PG g g where Π generates agent specific actions from A , using previous such actions from Ag and observations from Og . ➅ ➅ ➅ To see how a full policy space might be partitioned into agent specific sub-policy spaces, consider a FASP with purely reactive policies; any constrained policy π P Π ♣✏ Σ♣O Ñ Aqq, can be rewritten as a vector of sub-policies π ✏ ♣π 1 , π 2 , . . . , π ⑤G⑤ q, where for each g P G, Πg ✏ Σ♣Og Ñ Ag q. Given some vector observa⑤G⑤ tion oi P O, ~oi ✏ ♣o1i , o2i , . . . , oi q, and vector action ~aj P A, ⑤ G⑤ 1 2 aj ✏ ♣aj , aj , . . . , aj q, the following is true, π ♣~aj ⑤~oi q ✏ ➵ π ♣a ⑤o q g P g j g i g G In all other respects ~oi and ~aj behave as an observation and an action in a FASP. The FST function depends on the joint policy, otherwise it is exactly as for the FASP. Note that the above example is simply for illustrative purposes, definition 5 does not restrict itself to purely reactive policies. Many POMDP style multi-agent examples in the literature could be formulated as Synchronous Multi-Agent FASPs (and hence as FASPs), such as those found in [6, 12, 13, 20], although the formulation is more flexible, especially compared to frameworks that only allow a single reward shared amongst agents, as used in [20]. In real world scenarios with multiple rational decision makers, the likelihood that each decision maker chooses actions in step with every other, or even as often as every other, is small. Therefore it is natural to consider an extension to the FASP framework to allow each agent to act independently, yielding an Asynchronous Multi-Agent Finite Stochastic Process (AMAFASP). Here agents’ actions affect the system as in the FASP, but any pair of action choices by two different agents are strictly non-simultaneous, each action is either before or after every other. This process description relies on the FASPs state encapsulated nature, i.e. all state information is accessible to all agents, via the state description. Definition 6 (AMAFASP) An AMAFASP is the tuple ♣S, G, A, O, t, ω, F, i, u, Πq, where the parameters are as follows; S ➈is the state space; tG✉g1 , g2 , . . . , g⑤G⑤ is the set of agents; A ✏ gPG Ag is the action set, a disjoint union of agent specific ➈ Og is the observation set, a disjoint action spaces; O ✏ g PG union of agent specific observation spaces; t P Σ♣S ✂ ➈ A Ñ Sq tg , is the transition function and is the union function t ✏ g PG where for each agent g, the agent specific transition function tg P Σ♣♣S ✂ Ag q Ñ S q defines the effect of each agent’s actions on the system state; ω P Σ♣➈ S Ñ Oq is the observation function and is the union function ω ✏ gPG ω g , where ω g P Σ♣S Ñ Og q is used to generate an observation for agent g; tF ✉f 1 , f 2 , . . . , f N is the set of measure functions; i P Σ♣❍Ñ S q is the initialisation function as in ➅ the FASP; u P Σ♣S Ñ Gq is the turn taking function; and Πg is the combined policy space as in the SMAFASP. Π✏ g PG A formal definition of the union map and union function used here, can be found in [8], for simplicity these objects can be thought of as a collection of independent stochastic maps or functions. As with the FASP and SMAFASP, if we consider only reactive policies, i.e. Πg ✏ Σ♣Og Ñ Ag q for each g, it is possible to determine the probability of any state-to-state transition as a function of the joint policy, and hence write full state transition function. Definition 7 (AMAFASP Full State Transition) An AMAFASP, M , with Πg ✏ Σ♣Og Ñ Ag q for each g, has associated with it a full state transition function τM , given by, π τM ♣s✶ ⑤sq ✏ ➳ ➳ ➳ u♣g ⑤sq ω g♣og ⑤sq π g♣ag ⑤og q tg♣s✶ ⑤s, ag q g PG og PO g ag PAg To our knowledge, there are no similar frameworks which model asynchronous agent actions within stochastic state dependent environments. This may be because without state encapsulation it would not be at all as straightforward. A discussion of the potential benefits of turn-taking appears in Appendix A. From this point on, the umbrella term FASP covers FASPs, SMAFASPs and AMAFASPs. The multi-agent FASPs defined above allow for the full range of cooperation or competition between agents, dependent on our interpretation of the measure signals. This includes general-sum games, as well as allowing hard and soft requirements to be combined separately for each agent, or applied to groups. Our paper focuses on problems where each agent, g, is associated with a single measure signal, fng at each time-step n. Without further loss of generality, these measure signals are treated as rewards, rng ✏ fng , and each agent g is assumed to prefer higher values for rng over lower ones4 . 4 It would be simple to consider cost signals rather than rewards by inverting the signs For readability we present the general-sum case first, and incrementally simplify. The general-sum scenario considers agents that are following unrelated agendas, and hence rewards are independently generated. Definition 8 (General-Sum Scenario) A FASP is general-sum, if for each agent g there is a measure function f g P Φ♣S Ñ Rq, and at each time-step n with the system in state sn , g’s reward signal rng ✏ fng , where each fng is an outcome of f g ♣sn q, for all g. An agent’s measure function is sometimes also called its reward function, in this scenario. Another popular scenario, especially when modelling competitive games, is the zero-sum case, where the net reward across all agents at each time-step is zero. Here, we generate a reward for all agents and then subtract the average. Definition 9 (Zero-Sum Scenario) A FASP is zero-sum, if for each agent g there is a measure function f g P Φ♣S Ñ Rq, and at each g g time-step n with the system in state sn , g’s reward signal ➦rn h✏▲fn ✁ g g ¯ ¯ fn , where each fn is an outcome of f ♣sn q and fn ✏ h fn ⑤G⑤. With deterministic rewards, the zero-sum case can be achieved with one less measure than there are agents, the final agent’s measure is determined by the constraint that rewards sum to 1 (see [4]). For probabilistic measures with more than two agents, there are subtle effects on the distribution of rewards, so to avoid this we generate rewards independently. If agents are grouped together, and within these groups always rewarded identically, then the groups are referred to as teams, and the scenario is called a team scenario. Definition 10 (Team Scenario) A FASP is in a team scenario, if the set ➈ of agents G is partitioned into some set of sets tGj ✉j, so G ✏ Gj , and for each j, there is a team measure function f . At some j time-step n in state sn , each j’s team reward rnj ✏ fnj , where fnj is an outcome of f j ♣sn q, and for all g P Gj , rng ✏ rnj . The team scenario above is general-sum, but can be adapted to a zero-sum team scenario in the obvious way. Other scenario’s are possible, but we restrict ourselves to these. It might be worth noting that the team scenario with one team (modelling fully cooperative agents) and the two-team zero-sum scenario, are those most often examined in the associated multi-agent literature, most likely because they can be achieved with a single measure function and are thus relatively similar to the single agent POMDP, see [1, 9, 10, 11, 15, 16, 18, 23, 24, 25]. 4 Examples This section introduces the soccer example originally proposed in [16], and revisited in [2, 3, 5, 20, 24]. We illustrate both the transparency of the modelling mechanisms and ultimately the descriptive power this gives us in the context of problems which attempt to recreate some properties of a real system. The example is first formulated below as it appears in Littman’s paper [16], and then recreated in two different ways. Formulation 1 (Littman’s adversarial MDP soccer) An early model of MDP style multi-agent learning and referred to as soccer, the game is played on a 4 ✂ 5 board of squares, with two agents (one of whom is holding the ball) and is zero-sum, see Fig. 1. Each Figure 1. The MDP adversarial soccer example from [16]. Agent B’s reward function is simply the inverse, i.e. f B ♣sq ✏ ✁f A ♣sq, for all s. To define the transition function we need first to imagine that Littman’s set of transitions were written as agent specific transition functions, tg L1 P Σ♣SL1 ✂ AL1 Ñ SL1 q for each agent g — although this is not explicitly possible within the framework he uses, his description suggests he did indeed do this. The new transition function, tL2 P Σ♣SL2 ✂ AL2 Ñ SL2 q, would then be defined, for all B A B sn , sn 1 P SL1 , aA n P AL2 , an P AL2 , in the following way, tL2 ♣sn agent is located at some grid reference, and chooses to move in one of the four cardinal points of the compass (N, S, E and W) or the H(old) action at every time-step. Two agents cannot occupy the same square. The state space, SL1 , is of size 20 ✂ 19 ✏ 380, and the joint action space, AL1 is a cartesian product of the two agents action B spaces, AA L1 ✂ AL1 , and is of size 5 ✂ 5 ✏ 25. A game starts with agents in a random position in their own halves. The outcome for some joint action is generated by determining the outcome of each agent’s action separately, in a random order, and is deterministic otherwise. An agent moves when unobstructed and does not when obstructed. If the agent with the ball tries to move into a square occupied by the other agent, then the ball changes hands. If the agent with the ball moves into the goal, then the game is restarted and the scoring agent gains a reward of 1 (the opposing agent getting ✁1). Therefore, other than the random turn ordering the game mechanics are deterministic. (a) (b) Figure 2. If agent A chooses to go East and agent B chooses to go South when diagonally adjacent as shown, the outcome will be non-deterministic depending on which agent’s move is calculated first. In the original Littman version this was achieved by resolving agent specific actions in a random order, fig. (a). In the SMAFASP version this is translated into a flat transition function with probabilities on arcs (square brackets), fig. (b). The Littman soccer game can be recreated as a SMAFASP, with very little work. We add a couple more states for the reward function, add a trivial observation function, and flatten the turn-taking into the transition function. Formulation 2 (The SMAFASP soccer formulation) The SMAFASP formulation of the soccer game, is formed as follows: the state space SL2 ✏ SL1 s✍A s✍B , where s✍g is the state immediately after g scores a goal; the action space AL2 ✏ AL1 ; the observation space OL2 ✏ SL2 ; the observation function is the identity mapping; and the measure/reward function for agent A, f A P Φ♣SL2 Ñ Rq is as follows, f A ♣s✍A q ✏ 1, f A ♣s✍B q ✏ ✁1, and f A ♣sq ✏ 0 for all other s P S. ⑤sn , ♣aAn✂, aBn qq ✡ A B B 1 ➳ tA L1 ♣s⑤sn , an q.tL1 ♣sn 1 ⑤s, an q . ✏ 2. B A A tB L1 ♣s⑤sn , an q.tL1 ♣sn 1 ⑤s, an q sPS 1 L1 The transition probabilities involving the two new states, s✍A and s✍B , would be handled in the expected way. The turn-taking is absorbed so that the random order of actions within a turn is implicit within the probabilities of the transition function, see Fig. 2(b), rather than as before being a product of the implicit ordering of agent actions, as in Fig. 2(a). It is possible to reconstruct Littman’s game in a more flexible way. To see how, it is instructive to first examine what Littman’s motives may have been in constructing this problem, which may require some supposition on our part. Littman’s example is of particular interest to the multi-agent community, in that there is no independently optimal policy for either agent; instead each policy’s value is dependent on the opponent’s policy — therefore each agent is seeking a policy referred to as the best response to the other agent’s policy. Each agent is further limited by Littman’s random order mechanism, see Fig. 2(a), which means that while one agent is each turn choosing an action based on current state information, in effect the second agent to act is basing its action choice on state information that is one time-step off current; and because this ordering is random, neither agent can really rely on the current state information. Littman doesn’t have much control over this turn-taking, and as can be seen from the SMAFASP formulation, the properties of this turn-taking choice can be incorporated into the transition function probabilities (see Fig. 2 (b)). Different properties would lead to different probabilities, and would constitute a slightly different system with possibly different solutions. However, consider for example, that an agent is close to their own goal defending an attack. Its behaviour depends to some degree on where it expects to see its attacker next: the defender may wish to wait at one of these positions to ambush the other agent. The range of these positions is dependent on how many turns the attacker might take between these observations, which is in turn dependent on the turn-taking built into the system. For ease of reading, we introduce an intermediate AMAFASP formulation. The individual agent action spaces are as in the SMAFASP, as is the observation space, but the new state information is enriched with the positional states of the previous time-step, which in turn can be used to generate the observations for agents. Formulation 3 (The first AMAFASP soccer formulation) This AMAFASP formulation of the soccer game, ML3 , is formed as follows: the new state space SL3 ✏ SL2 ✂ SL2 , so a new state at some time-step n, is given by the tuple ♣sn , sn✁1 q, where sn✁1 , sn P SL2 and records the current and most recent positional A states; there are two action space, one for each agent, AA L3 ✏ AL2 A A and AL3 ✏ AL2 ; and two identical agent specific observation A B spaces, OL3 ✏ OL3 ✏ OL2 ; the new agent specific transition g functions, tL3 P Σ♣SL3 ✂ AgL3 Ñ SL3 q, are defined, for all sn✁1 , sn , s✶n , sn 1 P SL2 , agn P AgL3 , in the following way: tgL3 ♣♣sn ✶ g 1 , sn q⑤♣sn , sn✁1 q, an q ✏ ✧ tgL1 ♣sn 1 ⑤sn , agn q iff s✶n 0 otherwise. ✏ sn , where tgL1 represents agent g’s deterministic action effects in Littman’s example, as in Formulation 2. The goal states, s✍A and s✍B , are dealt with as expected. g Recalling that OL3 ✏ SL2 , the observation function, ωL3 P g Σ♣SL3 Ñ OL3 q, is generated, for all ♣sn✁1 , sn q P SL3 , on P OL3 , and g P tA, B ✉, in the following way, g ωL3 ♣on ⑤♣sn , sn✁1 qq ✏ ✧ 1 iff sn✁1 ✏ on , 0 otherwise. The reward function is straightforward and left to the reader. Finally, we construct the turn-taking function uL3 P Σ♣SL3 Ñ tA, B ✉q, which simply generates either agent in an unbiased way at each time-step. The turn taking function is defined, for all ♣sn , sn✁1 q P SL3 , as uL3 ♣A⑤♣sn , sn✁1 qq ✏ uL3 ♣B ⑤♣sn , sn✁1 qq ✏ 1④ 2. This does not fully replicate the Littman example, but satisfies the formulation in spirit in that agents are acting on potentially stale positional information, as well as dealing with an unpredictable opponent. In one sense, it better models hardware robots playing football, since all agents observe slightly out of date positional information, rather than a mix of some and not others. Both this and the Littman example do, however, share the distinction between turn ordering and game dynamics typified by Fig. 2 (a), what is more, this is now explicitly modelled by the turn-taking function. To fully recreate the mix of stale and fresh observations seen in Littman’s example along with the constrained turn-taking, we need for the state to include turn relevant information. This can be done with a tri-bit of information included with the other state information, to differentiate between; the start of a Littman time-step, when either agent could act next; when agent A has just acted in this time-step – and it must be B next; and vice versa when A must act next; we shall label these situations with l0 , lB and lA respectively. This has the knock on effect that in l0 labelled states the observation function is as Formulation 2; in lA and lB labelled states the stale observation is used – as in Formulation 3. Otherwise Formulation 4 is very much like formulation Formulation 3. Formulation 4 (The second AMAFASP soccer formulation) This AMAFASP formulation of the soccer game, ML4 , is formed as follows: there is a set of turn labels, L ✏ tl0 , lA , lB ✉; the state space is a three way Cartesian product, SL4 ✏ SL2 ✂ SL2 ✂ L, where the parts can be thought of as current-positional-state, previous-positional-state and turn-label respectively; the action spaces and observation spaces are as before, i.e. AgL4 ✏ AgL3 , g g OL4 ✏ OL3 , for each agent g; the transition and reward functions are straightforward and are omitted for brevity; the observation and turn taking functions are defined, for all ♣sn , sn✁1 , ln q P SL4 , g on P OL4 and all agents g, in the following way, g ωL4 ♣on ⑤♣sn , sn✁1 , ln qq ✏ ★ 1 if sn ✏ on and ln ✏ l0 , 1 if sn✁1 ✏ on and ln ✏ ug , 0 otherwise. and uL4 ♣gn ⑤♣sn , sn✁1 , ln qq ✏ ★ if gn ✏ g and ln ✏ l0 , 1 if gn ✏ g and ln ✏ ug , 0 otherwise 1 2 The above formulation recreates Littman’s example precisely, and instead of the opaque turn-taking mechanism hidden in the textual description of the problem, it is transparently and explicitly modelled as part of the turn-taking function. So the Littman example can be recreated as a SMAFASP or AMAFASP, but more interestingly both AMAFASP formulations, 3 and 4, can be tuned or extended to yield new, equally valid, formulations. What is more, the intuitive construction means that these choices can be interpreted more easily. Consider Formulation 3; the turn-taking function u can be defined to give different turn-taking probabilities at different states. For instance, if an agent is next to its own goal, we could increase its probability of acting (over the other agent being chosen) to reflect a defender behaving more fiercely when a loss is anticipated. Alternatively, if an agent’s position has not changed since the last round, but the other’s has then the first agent could be more likely to act (possible as two steps of positional data are stored); giving an advantage to the H(old) action, but otherwise encouraging a loose alternating agent mechanism. While Formulation 4 recreates the Littman example, it again can be adjusted to allow different choices to the turn taking mechanism; in particular it is now possible to enforce strictly alternating agents. This would be done by flipping from state label lA to lB or vice versa, at each step transition, and otherwise keeping things very much as before. It is important to note that many specific models built in this way, can be recreated by implicit encoding of probabilities within existing frameworks, but it is difficult to see how the experimenter would interpret the group of models as being members of a family of related systems. 4.1 Flexible Behaviour Regulation If we increase the number of players in our game, we can consider increasing the number of measure functions for a finer degree of control over desired behaviour. With just 2 agents competing in the Littman problem, it is difficult to see how to interpret any extra signals, and adding agents will increase the state space and hence the policy size radically. So, before we address this aspect of the FASP formalisms, it is useful to examine a more recent derivative soccer game, namely Peshkin et al.’s partially observable identical payoff stochastic game (POIPSG) version [20], which is more amenable to scaling up. Figure 3. The POIPSG cooperative soccer example from [20]. Peshkin et al.’s example is illustrated in Fig. 3. There are two teammates, V1 and V2 , and an opponent O, each agent has partial observability and can only see if the 4 horizontally and vertically adjacent squares are occupied, or not. Also, players V1 and V2 have an extra pass action when in possession of the ball. Otherwise the game is very much like Littman’s with joint actions being resolved for individual agents in some random order at each time-step. Contrary to expectations, the opponent is not modelled as a learning agent and does not receive a reward; instead the two teammates share a reward and learn to optimise team behaviour versus a static opponent policy; for more details see [20]. As with its progenitor, the Peshkin example could be simply reworked as a SMAFASP in much the same way as in Formulation 2. Recreating it as an AMAFASP is also reasonably straightforward; as before, the trick of including the previous step’s positional state in an AMAFASP state representation, allows us to generate stale observations – which are now also partial. As this is relatively similar to the Littman adaptions, the details are omitted. The focus here instead, is to show that, in the context of Peshkin’s example, a zeroor general-sum adaption with more agents, could have some utility. Firstly, agent O could receive the opposite reward as the shared reward of V1 and V2 , this could be done without introducing another measure function, merely reinterpreting Peshkin’s reward function; now, the opponent could learn to optimise against the cooperative team. More interestingly, the players V1 and V2 could be encouraged to learn different roles by rewarding V1 (say) more when the team scores a goal and penalising V2 more when the team concedes one, all this requires is a measure function for each agent and a zerosum correction. Further, we can add a second opponent (giving say O1 and O2 ), either rewarding them equally or encouraging different roles as with V1 and V2 . In this way we could explore the value of different reward structures by competing the teams. If more agents were added, even up to 11 players a side, and a much larger grid, the AMAFASP framework supports a much richer landscape of rewards and penalties, which can encourage individual roles within the team, while still differentiating between good and bad team collaborations. 5 Discussion In this paper, we have examined a new family of frameworks — the FASP, geared towards modelling stochastic processes, for one or many agents. We focus, on the multi-agent aspects of this family, and show how a variety of different game scenario’s can be explored within this context. Further, we have highlighted the difference between synchronous and asynchronous actions within the multi-agent paradigm, and shown how these choices are explicit within the FASP frameworks. As a package, this delivers what we consider to be a much broader tool-set for regulating behaviour within cooperative, non-cooperative and competitive problems. This is in sharp contrast to the more traditional pre-programmed expert systems approaches, that attempt to prescribe agent intentions and interactions. The overarching motivation for our approach is to provide the modeller with transparent mechanisms, roughly this means that all the pertinent mechanisms are defined explicitly within the model. There are two such mechanisms that are visited repeatedly in this paper, these are multiple measures and turn-taking, but they are not the only such mechanisms. In fact, the FASP borrows from the MDP, and POMDP, frameworks a few (arguably) transparent mechanisms. The transition function is a good example, and the observation and reward functions as they are defined in the POMDP have a degree of transparency; we argue that strict state encapsulation improves upon this though. The general agent-environment relationship is the same as the POMDP, meaning existing analytic and learning tools can still be applied where appropriate5 . Moreover, techniques for shaping and incremental learning developed for related single agent frameworks [14, 22] and multi-agent frameworks [7] can also be applied without radical changes. Other ideas for good transparent mechanisms exist in the literature, and the FASP family would benefit by incorporating the best of them. For instance, modelling extraneous actions/events can reduce the required size of a model, by allowing us to approximate portions of an overall system, without trying to explicitly recreate the entire system. Imagine a FASP with a external event function X P Σ♣S Ñ S q, which specifies how an open system might change from time-step to time-step, a similar extraneous event function can be found in [21]. This transition might occur between the observation and action outcomes, incorporating staleness into the observations as with our examples in Section 4. More interestingly, this function could be part of an AMAFASP, and be treated analogously to the action of a null agent; this would allow the turn-taking function to manage how rarely/often an extraneous events occurred. Another candidate for transparent mechanism, can be found in [4], where they simulate a broken actuator by separating intended action from actual action with what amounts to a stochastic map between the two. We would encourage modellers to incorporate such mechanisms as needed. There are other clear directions for future work, the tools outlined in this paper enable a variety of multi-agent problems, with choice of measures for evaluation, but it leaves out how these measures might be used to solve or learn desirable solutions. The terms general- and zero-sum are not used accidentally, the similarities with traditional game theory are obvious, but the differences are more subtle. The extensive form game in the game theoretic sense, as described in [17], enforces that a game has a beginning and an end, only rewards at the end of a game run, players (agents) do not forget any steps in their game; and seeks to address the behaviour of intelligent players who have full access to the game’s properties. While this can be defended as appropriate for human players, our paradigm allows for an ongoing game of synthetic agents with potentially highly restricted memory capabilities and zero prior knowledge of the game to be played; and rewards are calculated at every step. In fact, if we were to combine the AMAFASP with extraneous actions, it would constitute a generalisation of the extensive form game, as described in [17], but we omit the proof here. It is sufficient to say that the FASP family demands a different solution approach than the extensive form game, and certainly a different approach than the easily understood reward-maximisation/costminimisation required by single-agent and cooperative MDP style problems. Some preliminary research has been done with respect to such systems, [4, 11, 26], we envisage this as being an active area of research in the next few years, and hope that the FASP tool-set facilitates that study. REFERENCES [1] Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman, ‘The complexity of decentralized control of markov decision processes’, in Proceedings of the 16th Annual Conference on Uncertainty in Artificial 5 Tools for evaluating the expected long term reward in a POMDPs, can be used without change within a FASP except that multiple measure signals could be evaluated simultaneously. Maximisation procedures are still possible too, but care needs to be taken with what is maximised and where multiple agents are maximising different measures independently [4, 26]. [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] Intelligence (UAI-00), pp. 32–37, San Francisco, CA, (2000). Morgan Kaufmann. Reinaldo A. C. Bianchi, Carlos H. C. Ribeiro, and Anna H. Reali Costa, ‘Heuristic selection of actions in multiagent reinforcement learning’, in IJCAI, ed., Manuela M. Veloso, pp. 690–695, (2007). Michael Bowling, Rune Jensen, and Manuela Veloso, ‘A formalization of equilibria for multiagent planning’, in Proceedings of the AAAI-2002 Workshop on Multiagent Planning, (August 2002). Michael Bowling and Manuela Veloso, ‘Existence of Multiagent Equilibria with Limited Agents’, Journal of Artificial Intelligence Research 22, (2004). Submitted in October. Michael H. Bowling, Rune M. Jensen, and Manuela M. Veloso, ‘Multiagent planning in the presence of multiple goals’, in Planning in Intelligent Systems: Aspects, Motivations and Methods, John Wiley and Sons, Inc., (2005). Michael H. Bowling and Manuela M. Veloso, ‘Simultaneous adversarial multi-robot learning’, in IJCAI, eds., Georg Gottlob and Toby Walsh, pp. 699–704. Morgan Kaufmann, (2003). Olivier Buffet, Alain Dutech, and François Charpillet, ‘Shaping multiagent systems with gradient reinforcement learning’, Autonomous Agents and Multi-Agent Systems, 15(2), 197–220, (2007). Luke Dickens, Krysia Broda, and Alessandra Russo, ‘Transparent Modelling of Finite Stochastic Processes for Multiple Agents’, Technical Report 2008/2, Imperial College London, (January 2008). Alain Dutech, Olivier Buffet, and Francois Charpillet, ‘Multi-agent systems by incremental gradient reinforcement learning’, in IJCAI, pp. 833–838, (2001). Jerzy Filar and Koos Vrieze, Competitive Markov decision processes, Springer-Verlag New York, Inc., New York, NY, USA, 1996. P. Gmytrasiewicz and P. Doshi. A framework for sequential planning in multi-agent settings, 2004. Amy Greenwald and Keith Hall, ‘Correlated q-learning’, in AAAI Spring Symposium Workshop on Collaborative Learning Agents, (2002). Junling Hu and Michael P. Wellman, ‘Multiagent reinforcement learning: theoretical framework and an algorithm’, in Proc. 15th International Conf. on Machine Learning, pp. 242–250. Morgan Kaufmann, San Francisco, CA, (1998). Adam Daniel Laud, Theory and Application of Reward Shaping in Reinforcement Learning, Ph.D. dissertation, University of Illinois at Urbana-Champaign, 2004. Advisor: Gerald DeJong. Martin Lauer and Martin Riedmiller, ‘An algorithm for distributed reinforcement learning in cooperative multi-agent systems’, in Proc. 17th International Conf. on Machine Learning, pp. 535–542. Morgan Kaufmann, San Francisco, CA, (2000). Michael L. Littman, ‘Markov games as a framework for multi-agent reinforcement learning’, in Proceedings of the 11th International Conference on Machine Learning (ML-94), pp. 157–163, New Brunswick, NJ, (1994). Morgan Kaufmann. Roger B. Myerson, Game Theory: Analysis of Conflict, Harvard University Press, September 1997. Fans Oliehoek and Arnoud Visser, ‘A Hierarchical Model for Decentralized Fighting of Large Scale Urban Fires’, in Proceedings of the Fifth International Conference on Autonomous Agents and Multiagent Systems (AAMAS), eds., P. Stone and G. Weiss, Hakodate, Japan, (May 2006). A. Papoulis, Probability, Random Variables, and Stochastic Processes, McGraw Hill, 3rd edn., 1991. Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau, and Leslie P. Kaelbling, ‘Learning to cooperate via policy search’, in Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 307–314, San Francisco, CA, (2000). Morgan Kaufmann. Bharaneedharan Rathnasabapathy and Piotr Gmytrasiewicz. Formalizing multi-agent pomdp’s in the context of network routing. Mark B. Ring, ‘Child: A first step towards continual learning’, Machine Learning, 28, 77–104, (May 1997). Yoav Shoham, Rob Powers, and Trond Grenager, ‘If multi-agent learning is the answer, what is the question?’, Artif. Intell., 171(7), 365–377, (2007). William T. B. Uther and Manuela M. Veloso, ‘Adversarial reinforcement learning.’, Technical report, Computer Science Department, Carnegie Mellon University, (April 1997). Erfu Yang and Dongbing Gu, ‘Multiagent reinforcement learning for multi-robot systems: A survey’, Technical report, University of Essex, (2003). [26] Martin Zinkevich, Amy Greenwald, and Michael Littman, ‘Cyclic Equilibria in Markov Games’, in Advances in Neural Information Processing Systems 18, 1641–1648, MIT Press, Cambridge, MA, (2005). A The Benefits of Turn-Taking Modelling with turn-taking is not only more natural in some cases, in certain cases it allows for a more concise representation of a system. Consider a system with 2 agents, A and B, and 9 positional states (in a 3 ✂ 3 grid), where agents cannot occupy the same location (so there are 9 ✂ 8 ✏ 72 states), and movement actions are in the 4 cardinal directions for each agent. Let us first consider how this might be formulated without explicit turn taking. Imagine, this is our prior-state, s1 ✏ Imagine also that agent A chooses action E(ast), agent B chooses action S(outh) — making joint action ➔ E, S →, and that these actions will be resolved in a random order. This results in one of three postaction states, with the following probabilities; ✂ ✶ Pr s✶ Pr s ✂ ✏ ✞✞ ✞✞ s ✏ , a ✏➔ E, S → ✏ ✞✞ ✞✞ s ✏ , a ✏➔ E, S → ✏ ✞✞ ✞✞ s ✏ γ1 ✏ 1. ✡ ✡ ✏ α1 , ✏ β1 , and ✂ Pr s✶ , a ✏➔ E, S → ✡ ✏ γ1 , where α1 β1 Let’s assume that they are transparently modelled by agents A and B, whose actions fail with the following probabilities; Pr ♣A’s action failsq ✏ p and Pr ♣B’s action failsq ✏ q, acting in some order, where Pr ♣A is firstq ✏ r, and such that agents cannot move into an already occupied square. We can use these values to determine the flattened probabilities as α1 ✏ ♣1 ✁ pq ♣r q ♣1 ✁ rqq, β1 ✏ pq, and γ1 ✏ ♣1 ✁ q q ♣pr ♣1 ✁ rqq. α1 , β1 and γ1 are not independent, they represent only 2 independent variables, but are fed into by p, q and r, which are independent. It may seem that α1 , β1 and γ1 model the situation more concisely, since any one combination of α1 , β1 and γ1 could correspond to many different choices for p, q and r. However, this isn’t the whole story. Consider instead the prior-state s2 , where s2 ✏ , and the combined action ➔ S, E →. Treated naively the possible outcomes might all be specified with separate probabilities (say α2 , β2 and γ2 ) as before. However, the modeller can, with transparent mechanisms, choose to exploit underlying symmetries of the system. If, as is quite natural, the turn ordering probabilities are independent of position, then there need be no extra parameters specified for this situation, the original p, q and r are already sufficient to define the transition probabilities here too.