Modelling MAS with Finite Analytic Stochastic Processes
Luke Dickens, Krysia Broda and Alessandra Russo1
Abstract. The Multi-Agent paradigm is becoming increasingly popular as a way of capturing complex control processes with stochastic
properties. Many existing modelling tools are not flexible enough
for these purposes, possibly because many of the modelling frameworks available inherit their structure from single agent frameworks.
This paper proposes a new family of modelling frameworks called
FASP, which is based on state encapsulation and powerful enough
to capture multi-agent domains. It identifies how the FASP is more
flexible, and describes systems more naturally than other approaches,
demonstrating this with a number of robot football (soccer) formulations. This is important because more natural descriptions give more
control when designing the tasks, against which a group of agents’
collective behaviour is evaluated and regulated.
1
Introduction
Modelling stochastic processes is a time consuming and complicated
task that involves many issues of concern. Among these, some that
may appear high on most modellers’ list are highlighted below. What
type of language to use for building the model is often among the first
decisions to make. Models are often finite state but they represent
real-life continuous domains, so approximations are needed. A common problem is deciding what approximations to make: it is often
difficult to determine how drastically a model will suffer in terms of
results from making choice A rather than choice B; moreover if the
language is restrictive or arcane it can hamper further. Furthermore,
the modeller may wish to relate small problems to more complex
tasks, especially if they want to partially solve or learn solutions on
the simpler systems, and then import them to larger and more complicated ones2 . A related issue is one of tuning — a system may have
some parameters which are hard to anticipate; instead experimentation and incremental changes might be needed, so a language which
supports natural representations with tunable features is desirable.
Additionally, it is preferable to have as many analytic and learning
tools available as is realistically possible. This means that the modelling framework should either support these directly or allow models
to be transformed into frameworks which support them. Finally, single agent control paradigms may not be sufficient, nor may single real
valued measures for evaluation; the modern experimenter may want
more flexibility; potentially using multiple non-cooperative, isolated,
agents each with a number of orthogonal constraints.
This paper highlights some of the issues above, and aims to address them by proposing a family of modelling frameworks known as
Finite Analytic Stochastic Processes (FASP). We will demonstrate:
how the multi-agent scenarios alluded to above can be described
1
Imperial College London, South Kensington Campus, UK, email:
luke.dickens@imperial.ac.uk
2 This step-wise training is sometimes called shaping, and is an active area of
research, see [7, 14, 22]
within this family; the naturally descriptive nature of the representations thus produced; and which real world applications might benefit.
We highlight some of the features of this family by examining an often visited problem domain in the literature, that of robot soccer3 . To
do this, we introduce a relatively simple problem domain first developed by Littman [16]. We will show how this can be rewritten in the
FASP framework in a number of ways, with each one giving more
flexibility to re-tune in order to examine certain implicit choices of
Littman. Then we will examine some other more recent robot soccer
models from the related literature, which extend and add more interest to Littman’s original formulation. We will show that these can
also be represented by the FASP family, and will show how our more
flexible framework can be exploited here. Ultimately, we generate a
tunable turn-based non-cooperative solution, which can be scaled up
to finer granularity of pitch size and a greater number of players with
simultaneous team and independent objectives.
We finish with a discussion of the added value provided by the
FASP languages, who might benefit from them, and some of the new
challenges this presents to the reinforcement learning community
(amongst others). The guiding principle here is that richer domains
and evaluation – provided by better modelling tools, allows us to regulate agent and group behaviour in more subtle ways. It is expected
that advanced learning techniques will be required to develop strategies concordant with these more sophisticated demands. This paper
is a stepping stone on that longer journey.
2
Preliminaries
This section introduces the concept of Stochastic Map and Function,
and uses them to construct a modelling framework for stochastic
processes. Initially, we formalise a way of probabilistically mapping
from one finite set to another.
Definition 1 (Stochastic Map) A stochastic map m, from finite independent set X to finite dependent set Y , maps all elements in X
probabilistically to elements in Y , i.e. m : X Ñ PD♣Y q, where
PD♣Y q is the set of all probability distributions over Y . The set
of all such maps is Σ♣X Ñ Y q; notationally the undecided probabilistic outcome of m given x is m♣xq and the following shorthand
is defined Pr ♣m♣xq ✏ y q ✏ m♣y ⑤xq. Two such maps, m1 , m2 P
Σ♣X Ñ Y q, identical for each such conditional probability, i.e.
♣❅x, yq ♣m1 ♣y⑤xq✏ m2 ♣y⑤xqq, are said to be equal, i.e. m1 ✏ m2 .
The stochastic map is a generalisation of the functional map, see
[8]. Another method for generating outcomes probabilistically is the
stochastic function, and relies on the concept of probability density
function (PDF). For readers unfamiliar with the PDF, a good definition can be found in [19].
3
Or as we like to call it outside the US and Australia, robot football
Definition 2 (Stochastic Function) A stochastic function f from finite independent set X to the set of real numbers R, maps all elements in X probabilistically to the real numbers, i.e. f : X Ñ
PDF♣Rq. The set of all such functions is Φ♣X Ñ Rq; notationally
f ♣xq ✏ Fx , where Fx ♣y q is the PDF♣Rqq associated with x and
belonging to f ; f ♣xq is also used to represent the undecided probabilistic outcome of Fx — it should be clear from the context which
meaning is intended; the probability that this outcome lies between
2
two bounds α1 , α2 P R, α1 ➔ α2 , is denoted by rf ♣xqsα
α1 . Two
functions, f, g P Φ♣X Ñ Rq, identical for every PDF mapping, i.e.
♣❅x, α1 , α2 q♣rf ♣xqsαα21 ✏rg♣xqsαα21 q, are said to be equal, i.e. f ✏ g;
if the two functions are inversely equal for each PDF mapping, i.e.
♣❅x, α1 , α2 q♣rf ♣xqsαα21 ✏ rg♣xqs✁✁αα12 q, then the stochastic functions
are said to be inversely equal, i.e. f ✏✁g.
Armed with these two stochastic generators, we can formalise the
Finite Analytic Stochastic Process (FASP), an agent oriented modelling framework.
Definition 3 (FASP) A FASP is defined by the tuple
♣S, A, O, t, ω, F, i, Πq, where the parameters are as follows;
S is the finite state space; A is the finite action space; O is
the finite observation space; t P Σ♣S ✂ A Ñ S q is the transition function that defines the actions’ effects on the system;
ω P Σ♣S Ñ Oq is the observation function that generates observations; F ✏ tf 1 , f 2 , . . . , f N ✉ is the set of measure functions, where
for each i, f i P Φ♣S Ñ Rq generates a real valued measure signal,
fni , at each time-step, n; i P Σ♣❍Ñ S q is the initialisation function
that defines the initial system state; and Π is the set of all control
policies available.
Broadly similar to MDP style constructions, it can be used to build
models representing agent interactions with some environmental system, in discrete time-steps. The FASP allows an observation function, which probabilistically generates an observation in each state,
and multiple measure signals — analogous to the MDP’s reward signal, which generate a measure signal at each state. One important
feature of the FASP is that it generates observations and measures
from state information alone. There is no in-built interpretation of
the measure functions, their meaning and any preferred constraints
on their output are left to be imposed by the modeller.
Care needs to be taken when interpreting the measure signals, and
setting the desired output. These can be considered separate reward
functions for separate agents - see section Section 3, or conflicting
constraints on the same agent. A FASP does not have a natively implied solution (or even set of solutions): there can be multiple policies that satisfy an experimenters constraints; or there may be none
— this is a by-product of the FASP’s flexibility.
Fortunately, small problems (those that are analytically soluble)
can be solved in two steps, first by finding the policy dependent state
occupancy probabilities, and then solving the expected output from
the measure functions. This means that measure function interpretation can be imposed late and/or different possibilities can be explored
without having to resolve the state probabilities again, [8].
If we confine ourselves to a purely reactive policy space, i.e. where
action choices are based solely on the most recent observation, hence
Π ✏ Σ♣O Ñ Aq, and a policy, π P Π is fixed, then the dynamics of
this system resolves into a set of state to state transition probabilities,
called the Full State Transition Function.
Definition 4 (FASP Full State Transition Function) The
FASP
M ✏ ♣S, A, O, t, ω, F, i, Πq, with Π ✏ Σ♣O Ñ Aq,has full state
transition (FST) function τM : Π Ñ Σ♣S Ñ S q, where for π
π
τM ♣π q ✏ τM
(written τ π when M is clear), and, ❅s, s✶ P S,
π
τM
♣s✶ ⑤sq ✏
➳ ➳ ω♣o⑤sqπ♣a⑤oqt♣s✶⑤s, aq
P
P Π,
P
o Oa A
It is the FST function that concisely defines the policy dependent stochastic dynamics of the system, each fixed policy identifying a Markov Chain, and together giving a policy labelled family
of Markov Chains. A more detailed examination of FASPs (including the Multi-Agent derivatives appearing later in this paper) can be
found in [8].
3
Multi-Agent Settings
This section shows how the FASP framework can be naturally extended to multi-agent settings. We distinguish two domains; models
with simultaneous joint observations and subsequent actions by synchronised agents, and asynchronous agents forced to take turns, observing and acting on the environment. To model the first situation,
we define the action space as being a tuple of action spaces — each
part specific to one agent, similarly the observation space is a tuple of
agent observation spaces. At every time-step, the system generates a
joint observation, delivers the relevant parts to each agent, then combines their subsequent action choices into a joint action which then
acts on the system. Any process described as a FASP constrained in
such a way is referred to as a Synchronous Multi-Agent (SMA)FASP.
Definition 5 (Synchronous Multi-Agent FASP) A Synchronous
multi-agent FASP (SMAFASP), is a FASP, with the set of enumerated
agents, G, and the added constraints that; the action space A, is
a Cartesian product of action subspaces, Ag , for each g P G, i.e.
Ag ; the observation space O, is a Cartesian product of
A✏
g PG
Og ;
observation subspaces, Og , for each g P G, i.e. O ✏
g PG
and the policy space, Π, can be rewritten as a Cartesian product of
Πg ,
sub-policy spaces, Πg , one for each agent g, i.e. Π ✏
g PG
g
g
where Π generates agent specific actions from A , using previous
such actions from Ag and observations from Og .
➅
➅
➅
To see how a full policy space might be partitioned into agent specific sub-policy spaces, consider a FASP with purely reactive policies; any constrained policy π P Π ♣✏ Σ♣O Ñ Aqq, can be rewritten as a vector of sub-policies π ✏ ♣π 1 , π 2 , . . . , π ⑤G⑤ q, where for
each g P G, Πg ✏ Σ♣Og Ñ Ag q. Given some vector observa⑤G⑤
tion oi P O, ~oi ✏ ♣o1i , o2i , . . . , oi q, and vector action ~aj P A,
⑤
G⑤
1
2
aj ✏ ♣aj , aj , . . . , aj q, the following is true,
π ♣~aj ⑤~oi q ✏
➵ π ♣a ⑤o q
g
P
g
j
g
i
g G
In all other respects ~oi and ~aj behave as an observation and an
action in a FASP. The FST function depends on the joint policy, otherwise it is exactly as for the FASP. Note that the above example is
simply for illustrative purposes, definition 5 does not restrict itself to
purely reactive policies. Many POMDP style multi-agent examples
in the literature could be formulated as Synchronous Multi-Agent
FASPs (and hence as FASPs), such as those found in [6, 12, 13, 20],
although the formulation is more flexible, especially compared to
frameworks that only allow a single reward shared amongst agents,
as used in [20].
In real world scenarios with multiple rational decision makers, the
likelihood that each decision maker chooses actions in step with every other, or even as often as every other, is small. Therefore it is natural to consider an extension to the FASP framework to allow each
agent to act independently, yielding an Asynchronous Multi-Agent
Finite Stochastic Process (AMAFASP). Here agents’ actions affect
the system as in the FASP, but any pair of action choices by two different agents are strictly non-simultaneous, each action is either before or after every other. This process description relies on the FASPs
state encapsulated nature, i.e. all state information is accessible to all
agents, via the state description.
Definition 6 (AMAFASP) An
AMAFASP
is
the
tuple
♣S, G, A, O, t, ω, F, i, u, Πq, where the parameters are as follows; S
➈is the state space; tG✉g1 , g2 , . . . , g⑤G⑤ is the set of agents;
A ✏ gPG Ag is the action set, a disjoint union of agent specific
➈
Og is the observation set, a disjoint
action spaces; O ✏
g PG
union of agent specific observation spaces; t P Σ♣S ✂ ➈
A Ñ Sq
tg ,
is the transition function and is the union function t ✏
g PG
where for each agent g, the agent specific transition function
tg P Σ♣♣S ✂ Ag q Ñ S q defines the effect of each agent’s actions on
the system state; ω P Σ♣➈
S Ñ Oq is the observation function and is
the union function ω ✏ gPG ω g , where ω g P Σ♣S Ñ Og q is used
to generate an observation for agent g; tF ✉f 1 , f 2 , . . . , f N is the
set of measure functions; i P Σ♣❍Ñ S q is the initialisation function
as in ➅
the FASP; u P Σ♣S Ñ Gq is the turn taking function; and
Πg is the combined policy space as in the SMAFASP.
Π✏
g PG
A formal definition of the union map and union function used here,
can be found in [8], for simplicity these objects can be thought of as
a collection of independent stochastic maps or functions.
As with the FASP and SMAFASP, if we consider only reactive
policies, i.e. Πg ✏ Σ♣Og Ñ Ag q for each g, it is possible to determine the probability of any state-to-state transition as a function of
the joint policy, and hence write full state transition function.
Definition 7 (AMAFASP Full State Transition) An
AMAFASP,
M , with Πg ✏ Σ♣Og Ñ Ag q for each g, has associated with it a full
state transition function τM , given by,
π
τM
♣s✶ ⑤sq ✏
➳ ➳
➳
u♣g ⑤sq ω g♣og ⑤sq π g♣ag ⑤og q tg♣s✶ ⑤s, ag q
g PG og PO g ag PAg
To our knowledge, there are no similar frameworks which model
asynchronous agent actions within stochastic state dependent environments. This may be because without state encapsulation it would
not be at all as straightforward. A discussion of the potential benefits of turn-taking appears in Appendix A. From this point on, the
umbrella term FASP covers FASPs, SMAFASPs and AMAFASPs.
The multi-agent FASPs defined above allow for the full range of
cooperation or competition between agents, dependent on our interpretation of the measure signals. This includes general-sum games,
as well as allowing hard and soft requirements to be combined separately for each agent, or applied to groups. Our paper focuses on
problems where each agent, g, is associated with a single measure
signal, fng at each time-step n. Without further loss of generality,
these measure signals are treated as rewards, rng ✏ fng , and each
agent g is assumed to prefer higher values for rng over lower ones4 .
4
It would be simple to consider cost signals rather than rewards by inverting
the signs
For readability we present the general-sum case first, and incrementally simplify.
The general-sum scenario considers agents that are following unrelated agendas, and hence rewards are independently generated.
Definition 8 (General-Sum Scenario) A FASP is general-sum, if
for each agent g there is a measure function f g P Φ♣S Ñ Rq, and
at each time-step n with the system in state sn , g’s reward signal
rng ✏ fng , where each fng is an outcome of f g ♣sn q, for all g. An
agent’s measure function is sometimes also called its reward function, in this scenario.
Another popular scenario, especially when modelling competitive
games, is the zero-sum case, where the net reward across all agents
at each time-step is zero. Here, we generate a reward for all agents
and then subtract the average.
Definition 9 (Zero-Sum Scenario) A FASP is zero-sum, if for each
agent g there is a measure function f g P Φ♣S Ñ Rq, and at each
g
g
time-step n with the system in state sn , g’s reward signal
➦rn h✏▲fn ✁
g
g
¯
¯
fn , where each fn is an outcome of f ♣sn q and fn ✏ h fn ⑤G⑤.
With deterministic rewards, the zero-sum case can be achieved
with one less measure than there are agents, the final agent’s measure is determined by the constraint that rewards sum to 1 (see [4]).
For probabilistic measures with more than two agents, there are subtle effects on the distribution of rewards, so to avoid this we generate
rewards independently.
If agents are grouped together, and within these groups always
rewarded identically, then the groups are referred to as teams, and
the scenario is called a team scenario.
Definition 10 (Team Scenario) A FASP is in a team scenario, if the
set
➈ of agents G is partitioned into some set of sets tGj ✉j, so G ✏
Gj , and for each j, there is a team measure function f . At some
j
time-step n in state sn , each j’s team reward rnj ✏ fnj , where fnj is
an outcome of f j ♣sn q, and for all g P Gj , rng ✏ rnj .
The team scenario above is general-sum, but can be adapted
to a zero-sum team scenario in the obvious way. Other scenario’s
are possible, but we restrict ourselves to these. It might be worth
noting that the team scenario with one team (modelling fully cooperative agents) and the two-team zero-sum scenario, are those
most often examined in the associated multi-agent literature, most
likely because they can be achieved with a single measure function and are thus relatively similar to the single agent POMDP, see
[1, 9, 10, 11, 15, 16, 18, 23, 24, 25].
4
Examples
This section introduces the soccer example originally proposed in
[16], and revisited in [2, 3, 5, 20, 24]. We illustrate both the transparency of the modelling mechanisms and ultimately the descriptive
power this gives us in the context of problems which attempt to recreate some properties of a real system. The example is first formulated
below as it appears in Littman’s paper [16], and then recreated in two
different ways.
Formulation 1 (Littman’s adversarial MDP soccer) An
early
model of MDP style multi-agent learning and referred to as soccer,
the game is played on a 4 ✂ 5 board of squares, with two agents
(one of whom is holding the ball) and is zero-sum, see Fig. 1. Each
Figure 1. The MDP adversarial soccer example from [16].
Agent B’s reward function is simply the inverse, i.e. f B ♣sq ✏
✁f A ♣sq, for all s.
To define the transition function we need first to imagine that
Littman’s set of transitions were written as agent specific transition
functions, tg L1 P Σ♣SL1 ✂ AL1 Ñ SL1 q for each agent g — although this is not explicitly possible within the framework he uses,
his description suggests he did indeed do this. The new transition
function, tL2 P Σ♣SL2 ✂ AL2 Ñ SL2 q, would then be defined, for all
B
A
B
sn , sn 1 P SL1 , aA
n P AL2 , an P AL2 , in the following way,
tL2 ♣sn
agent is located at some grid reference, and chooses to move in one
of the four cardinal points of the compass (N, S, E and W) or the
H(old) action at every time-step. Two agents cannot occupy the same
square. The state space, SL1 , is of size 20 ✂ 19 ✏ 380, and the joint
action space, AL1 is a cartesian product of the two agents action
B
spaces, AA
L1 ✂ AL1 , and is of size 5 ✂ 5 ✏ 25. A game starts with
agents in a random position in their own halves. The outcome for
some joint action is generated by determining the outcome of each
agent’s action separately, in a random order, and is deterministic
otherwise. An agent moves when unobstructed and does not when
obstructed. If the agent with the ball tries to move into a square
occupied by the other agent, then the ball changes hands. If the
agent with the ball moves into the goal, then the game is restarted
and the scoring agent gains a reward of 1 (the opposing agent
getting ✁1).
Therefore, other than the random turn ordering the game mechanics are deterministic.
(a)
(b)
Figure 2. If agent A chooses to go East and agent B chooses to go South
when diagonally adjacent as shown, the outcome will be non-deterministic
depending on which agent’s move is calculated first. In the original Littman
version this was achieved by resolving agent specific actions in a random
order, fig. (a). In the SMAFASP version this is translated into a flat transition
function with probabilities on arcs (square brackets), fig. (b).
The Littman soccer game can be recreated as a SMAFASP, with
very little work. We add a couple more states for the reward function,
add a trivial observation function, and flatten the turn-taking into the
transition function.
Formulation 2 (The SMAFASP soccer formulation) The
SMAFASP formulation of the soccer game, is formed as follows: the state space SL2 ✏ SL1 s✍A s✍B , where s✍g is the state
immediately after g scores a goal; the action space AL2 ✏ AL1 ;
the observation space OL2 ✏ SL2 ; the observation function is the
identity mapping; and the measure/reward function for agent A,
f A P Φ♣SL2 Ñ Rq is as follows,
f A ♣s✍A q ✏ 1,
f A ♣s✍B q ✏ ✁1,
and f A ♣sq ✏ 0 for all other s P S.
⑤sn , ♣aAn✂, aBn qq
✡
A
B
B
1 ➳
tA
L1 ♣s⑤sn , an q.tL1 ♣sn 1 ⑤s, an q
.
✏ 2.
B
A
A
tB
L1 ♣s⑤sn , an q.tL1 ♣sn 1 ⑤s, an q
sPS
1
L1
The transition probabilities involving the two new states, s✍A and s✍B ,
would be handled in the expected way.
The turn-taking is absorbed so that the random order of actions
within a turn is implicit within the probabilities of the transition function, see Fig. 2(b), rather than as before being a product of the implicit
ordering of agent actions, as in Fig. 2(a).
It is possible to reconstruct Littman’s game in a more flexible way.
To see how, it is instructive to first examine what Littman’s motives
may have been in constructing this problem, which may require some
supposition on our part. Littman’s example is of particular interest to
the multi-agent community, in that there is no independently optimal policy for either agent; instead each policy’s value is dependent
on the opponent’s policy — therefore each agent is seeking a policy referred to as the best response to the other agent’s policy. Each
agent is further limited by Littman’s random order mechanism, see
Fig. 2(a), which means that while one agent is each turn choosing
an action based on current state information, in effect the second
agent to act is basing its action choice on state information that is
one time-step off current; and because this ordering is random, neither agent can really rely on the current state information. Littman
doesn’t have much control over this turn-taking, and as can be seen
from the SMAFASP formulation, the properties of this turn-taking
choice can be incorporated into the transition function probabilities
(see Fig. 2 (b)). Different properties would lead to different probabilities, and would constitute a slightly different system with possibly
different solutions.
However, consider for example, that an agent is close to their own
goal defending an attack. Its behaviour depends to some degree on
where it expects to see its attacker next: the defender may wish to
wait at one of these positions to ambush the other agent. The range
of these positions is dependent on how many turns the attacker might
take between these observations, which is in turn dependent on the
turn-taking built into the system. For ease of reading, we introduce
an intermediate AMAFASP formulation. The individual agent action
spaces are as in the SMAFASP, as is the observation space, but the
new state information is enriched with the positional states of the
previous time-step, which in turn can be used to generate the observations for agents.
Formulation 3 (The first AMAFASP soccer formulation) This
AMAFASP formulation of the soccer game, ML3 , is formed as
follows: the new state space SL3 ✏ SL2 ✂ SL2 , so a new state
at some time-step n, is given by the tuple ♣sn , sn✁1 q, where
sn✁1 , sn P SL2 and records the current and most recent positional
A
states; there are two action space, one for each agent, AA
L3 ✏ AL2
A
A
and AL3 ✏ AL2 ; and two identical agent specific observation
A
B
spaces, OL3
✏ OL3
✏ OL2 ; the new agent specific transition
g
functions, tL3 P Σ♣SL3 ✂ AgL3 Ñ SL3 q, are defined, for all
sn✁1 , sn , s✶n , sn 1 P SL2 , agn P AgL3 , in the following way:
tgL3 ♣♣sn
✶
g
1 , sn q⑤♣sn , sn✁1 q, an q ✏
✧
tgL1 ♣sn 1 ⑤sn , agn q iff s✶n
0 otherwise.
✏ sn ,
where tgL1 represents agent g’s deterministic action effects in
Littman’s example, as in Formulation 2. The goal states, s✍A and s✍B ,
are dealt with as expected.
g
Recalling that OL3 ✏ SL2 , the observation function, ωL3
P
g
Σ♣SL3 Ñ OL3 q, is generated, for all ♣sn✁1 , sn q P SL3 , on P OL3
,
and g P tA, B ✉, in the following way,
g
ωL3
♣on ⑤♣sn , sn✁1 qq ✏
✧
1 iff sn✁1 ✏ on ,
0 otherwise.
The reward function is straightforward and left to the reader.
Finally, we construct the turn-taking function uL3 P Σ♣SL3 Ñ
tA, B ✉q, which simply generates either agent in an unbiased way
at each time-step. The turn taking function is defined, for all
♣sn , sn✁1 q P SL3 , as
uL3 ♣A⑤♣sn , sn✁1 qq ✏ uL3 ♣B ⑤♣sn , sn✁1 qq ✏ 1④ 2.
This does not fully replicate the Littman example, but satisfies the
formulation in spirit in that agents are acting on potentially stale positional information, as well as dealing with an unpredictable opponent. In one sense, it better models hardware robots playing football,
since all agents observe slightly out of date positional information,
rather than a mix of some and not others. Both this and the Littman
example do, however, share the distinction between turn ordering and
game dynamics typified by Fig. 2 (a), what is more, this is now explicitly modelled by the turn-taking function.
To fully recreate the mix of stale and fresh observations seen in
Littman’s example along with the constrained turn-taking, we need
for the state to include turn relevant information. This can be done
with a tri-bit of information included with the other state information,
to differentiate between; the start of a Littman time-step, when either
agent could act next; when agent A has just acted in this time-step
– and it must be B next; and vice versa when A must act next; we
shall label these situations with l0 , lB and lA respectively. This has
the knock on effect that in l0 labelled states the observation function
is as Formulation 2; in lA and lB labelled states the stale observation
is used – as in Formulation 3. Otherwise Formulation 4 is very much
like formulation Formulation 3.
Formulation 4 (The second AMAFASP soccer formulation) This
AMAFASP formulation of the soccer game, ML4 , is formed as
follows: there is a set of turn labels, L ✏ tl0 , lA , lB ✉; the state
space is a three way Cartesian product, SL4 ✏ SL2 ✂ SL2 ✂ L,
where the parts can be thought of as current-positional-state,
previous-positional-state and turn-label respectively; the action
spaces and observation spaces are as before, i.e. AgL4 ✏ AgL3 ,
g
g
OL4
✏ OL3
, for each agent g; the transition and reward functions
are straightforward and are omitted for brevity; the observation
and turn taking functions are defined, for all ♣sn , sn✁1 , ln q P SL4 ,
g
on P OL4
and all agents g, in the following way,
g
ωL4
♣on ⑤♣sn , sn✁1 , ln qq ✏
★
1 if sn ✏ on and ln ✏ l0 ,
1 if sn✁1 ✏ on and ln ✏ ug ,
0 otherwise.
and
uL4 ♣gn ⑤♣sn , sn✁1 , ln qq ✏
★
if gn ✏ g and ln ✏ l0 ,
1 if gn ✏ g and ln ✏ ug ,
0 otherwise
1
2
The above formulation recreates Littman’s example precisely, and
instead of the opaque turn-taking mechanism hidden in the textual
description of the problem, it is transparently and explicitly modelled
as part of the turn-taking function.
So the Littman example can be recreated as a SMAFASP or
AMAFASP, but more interestingly both AMAFASP formulations,
3 and 4, can be tuned or extended to yield new, equally valid, formulations. What is more, the intuitive construction means that these
choices can be interpreted more easily.
Consider Formulation 3; the turn-taking function u can be defined
to give different turn-taking probabilities at different states. For instance, if an agent is next to its own goal, we could increase its probability of acting (over the other agent being chosen) to reflect a defender behaving more fiercely when a loss is anticipated. Alternatively, if an agent’s position has not changed since the last round, but
the other’s has then the first agent could be more likely to act (possible as two steps of positional data are stored); giving an advantage
to the H(old) action, but otherwise encouraging a loose alternating
agent mechanism.
While Formulation 4 recreates the Littman example, it again can
be adjusted to allow different choices to the turn taking mechanism;
in particular it is now possible to enforce strictly alternating agents.
This would be done by flipping from state label lA to lB or vice versa,
at each step transition, and otherwise keeping things very much as
before. It is important to note that many specific models built in this
way, can be recreated by implicit encoding of probabilities within
existing frameworks, but it is difficult to see how the experimenter
would interpret the group of models as being members of a family of
related systems.
4.1 Flexible Behaviour Regulation
If we increase the number of players in our game, we can consider
increasing the number of measure functions for a finer degree of
control over desired behaviour. With just 2 agents competing in the
Littman problem, it is difficult to see how to interpret any extra signals, and adding agents will increase the state space and hence the
policy size radically. So, before we address this aspect of the FASP
formalisms, it is useful to examine a more recent derivative soccer
game, namely Peshkin et al.’s partially observable identical payoff
stochastic game (POIPSG) version [20], which is more amenable to
scaling up.
Figure 3.
The POIPSG cooperative soccer example from [20].
Peshkin et al.’s example is illustrated in Fig. 3. There are two teammates, V1 and V2 , and an opponent O, each agent has partial observability and can only see if the 4 horizontally and vertically adjacent
squares are occupied, or not. Also, players V1 and V2 have an extra
pass action when in possession of the ball. Otherwise the game is
very much like Littman’s with joint actions being resolved for individual agents in some random order at each time-step. Contrary to
expectations, the opponent is not modelled as a learning agent and
does not receive a reward; instead the two teammates share a reward
and learn to optimise team behaviour versus a static opponent policy;
for more details see [20].
As with its progenitor, the Peshkin example could be simply reworked as a SMAFASP in much the same way as in Formulation 2.
Recreating it as an AMAFASP is also reasonably straightforward; as
before, the trick of including the previous step’s positional state in
an AMAFASP state representation, allows us to generate stale observations – which are now also partial. As this is relatively similar
to the Littman adaptions, the details are omitted. The focus here instead, is to show that, in the context of Peshkin’s example, a zeroor general-sum adaption with more agents, could have some utility.
Firstly, agent O could receive the opposite reward as the shared reward of V1 and V2 , this could be done without introducing another
measure function, merely reinterpreting Peshkin’s reward function;
now, the opponent could learn to optimise against the cooperative
team. More interestingly, the players V1 and V2 could be encouraged to learn different roles by rewarding V1 (say) more when the
team scores a goal and penalising V2 more when the team concedes
one, all this requires is a measure function for each agent and a zerosum correction. Further, we can add a second opponent (giving say
O1 and O2 ), either rewarding them equally or encouraging different
roles as with V1 and V2 . In this way we could explore the value of
different reward structures by competing the teams. If more agents
were added, even up to 11 players a side, and a much larger grid, the
AMAFASP framework supports a much richer landscape of rewards
and penalties, which can encourage individual roles within the team,
while still differentiating between good and bad team collaborations.
5
Discussion
In this paper, we have examined a new family of frameworks —
the FASP, geared towards modelling stochastic processes, for one
or many agents. We focus, on the multi-agent aspects of this family,
and show how a variety of different game scenario’s can be explored
within this context. Further, we have highlighted the difference between synchronous and asynchronous actions within the multi-agent
paradigm, and shown how these choices are explicit within the FASP
frameworks. As a package, this delivers what we consider to be a
much broader tool-set for regulating behaviour within cooperative,
non-cooperative and competitive problems. This is in sharp contrast
to the more traditional pre-programmed expert systems approaches,
that attempt to prescribe agent intentions and interactions.
The overarching motivation for our approach is to provide the
modeller with transparent mechanisms, roughly this means that all
the pertinent mechanisms are defined explicitly within the model.
There are two such mechanisms that are visited repeatedly in this
paper, these are multiple measures and turn-taking, but they are not
the only such mechanisms. In fact, the FASP borrows from the MDP,
and POMDP, frameworks a few (arguably) transparent mechanisms.
The transition function is a good example, and the observation and
reward functions as they are defined in the POMDP have a degree of
transparency; we argue that strict state encapsulation improves upon
this though. The general agent-environment relationship is the same
as the POMDP, meaning existing analytic and learning tools can still
be applied where appropriate5 . Moreover, techniques for shaping and
incremental learning developed for related single agent frameworks
[14, 22] and multi-agent frameworks [7] can also be applied without
radical changes.
Other ideas for good transparent mechanisms exist in the literature, and the FASP family would benefit by incorporating the best
of them. For instance, modelling extraneous actions/events can reduce the required size of a model, by allowing us to approximate
portions of an overall system, without trying to explicitly recreate
the entire system. Imagine a FASP with a external event function
X P Σ♣S Ñ S q, which specifies how an open system might change
from time-step to time-step, a similar extraneous event function can
be found in [21]. This transition might occur between the observation
and action outcomes, incorporating staleness into the observations
as with our examples in Section 4. More interestingly, this function could be part of an AMAFASP, and be treated analogously to
the action of a null agent; this would allow the turn-taking function
to manage how rarely/often an extraneous events occurred. Another
candidate for transparent mechanism, can be found in [4], where they
simulate a broken actuator by separating intended action from actual action with what amounts to a stochastic map between the two.
We would encourage modellers to incorporate such mechanisms as
needed.
There are other clear directions for future work, the tools outlined
in this paper enable a variety of multi-agent problems, with choice of
measures for evaluation, but it leaves out how these measures might
be used to solve or learn desirable solutions. The terms general- and
zero-sum are not used accidentally, the similarities with traditional
game theory are obvious, but the differences are more subtle. The
extensive form game in the game theoretic sense, as described in
[17], enforces that a game has a beginning and an end, only rewards
at the end of a game run, players (agents) do not forget any steps in
their game; and seeks to address the behaviour of intelligent players who have full access to the game’s properties. While this can
be defended as appropriate for human players, our paradigm allows
for an ongoing game of synthetic agents with potentially highly restricted memory capabilities and zero prior knowledge of the game
to be played; and rewards are calculated at every step. In fact, if we
were to combine the AMAFASP with extraneous actions, it would
constitute a generalisation of the extensive form game, as described
in [17], but we omit the proof here.
It is sufficient to say that the FASP family demands a different solution approach than the extensive form game, and certainly a different approach than the easily understood reward-maximisation/costminimisation required by single-agent and cooperative MDP style
problems. Some preliminary research has been done with respect to
such systems, [4, 11, 26], we envisage this as being an active area
of research in the next few years, and hope that the FASP tool-set
facilitates that study.
REFERENCES
[1] Daniel S. Bernstein, Shlomo Zilberstein, and Neil Immerman, ‘The
complexity of decentralized control of markov decision processes’, in
Proceedings of the 16th Annual Conference on Uncertainty in Artificial
5
Tools for evaluating the expected long term reward in a POMDPs, can be
used without change within a FASP except that multiple measure signals
could be evaluated simultaneously. Maximisation procedures are still possible too, but care needs to be taken with what is maximised and where
multiple agents are maximising different measures independently [4, 26].
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
Intelligence (UAI-00), pp. 32–37, San Francisco, CA, (2000). Morgan
Kaufmann.
Reinaldo A. C. Bianchi, Carlos H. C. Ribeiro, and Anna H. Reali Costa,
‘Heuristic selection of actions in multiagent reinforcement learning’, in
IJCAI, ed., Manuela M. Veloso, pp. 690–695, (2007).
Michael Bowling, Rune Jensen, and Manuela Veloso, ‘A formalization
of equilibria for multiagent planning’, in Proceedings of the AAAI-2002
Workshop on Multiagent Planning, (August 2002).
Michael Bowling and Manuela Veloso, ‘Existence of Multiagent Equilibria with Limited Agents’, Journal of Artificial Intelligence Research
22, (2004). Submitted in October.
Michael H. Bowling, Rune M. Jensen, and Manuela M. Veloso, ‘Multiagent planning in the presence of multiple goals’, in Planning in Intelligent Systems: Aspects, Motivations and Methods, John Wiley and
Sons, Inc., (2005).
Michael H. Bowling and Manuela M. Veloso, ‘Simultaneous adversarial multi-robot learning’, in IJCAI, eds., Georg Gottlob and Toby Walsh,
pp. 699–704. Morgan Kaufmann, (2003).
Olivier Buffet, Alain Dutech, and François Charpillet, ‘Shaping multiagent systems with gradient reinforcement learning’, Autonomous
Agents and Multi-Agent Systems, 15(2), 197–220, (2007).
Luke Dickens, Krysia Broda, and Alessandra Russo, ‘Transparent Modelling of Finite Stochastic Processes for Multiple Agents’, Technical
Report 2008/2, Imperial College London, (January 2008).
Alain Dutech, Olivier Buffet, and Francois Charpillet, ‘Multi-agent systems by incremental gradient reinforcement learning’, in IJCAI, pp.
833–838, (2001).
Jerzy Filar and Koos Vrieze, Competitive Markov decision processes,
Springer-Verlag New York, Inc., New York, NY, USA, 1996.
P. Gmytrasiewicz and P. Doshi. A framework for sequential planning
in multi-agent settings, 2004.
Amy Greenwald and Keith Hall, ‘Correlated q-learning’, in AAAI
Spring Symposium Workshop on Collaborative Learning Agents,
(2002).
Junling Hu and Michael P. Wellman, ‘Multiagent reinforcement learning: theoretical framework and an algorithm’, in Proc. 15th International Conf. on Machine Learning, pp. 242–250. Morgan Kaufmann,
San Francisco, CA, (1998).
Adam Daniel Laud, Theory and Application of Reward Shaping in
Reinforcement Learning, Ph.D. dissertation, University of Illinois at
Urbana-Champaign, 2004. Advisor: Gerald DeJong.
Martin Lauer and Martin Riedmiller, ‘An algorithm for distributed reinforcement learning in cooperative multi-agent systems’, in Proc. 17th
International Conf. on Machine Learning, pp. 535–542. Morgan Kaufmann, San Francisco, CA, (2000).
Michael L. Littman, ‘Markov games as a framework for multi-agent
reinforcement learning’, in Proceedings of the 11th International Conference on Machine Learning (ML-94), pp. 157–163, New Brunswick,
NJ, (1994). Morgan Kaufmann.
Roger B. Myerson, Game Theory: Analysis of Conflict, Harvard University Press, September 1997.
Fans Oliehoek and Arnoud Visser, ‘A Hierarchical Model for Decentralized Fighting of Large Scale Urban Fires’, in Proceedings of the
Fifth International Conference on Autonomous Agents and Multiagent
Systems (AAMAS), eds., P. Stone and G. Weiss, Hakodate, Japan, (May
2006).
A. Papoulis, Probability, Random Variables, and Stochastic Processes,
McGraw Hill, 3rd edn., 1991.
Leonid Peshkin, Kee-Eung Kim, Nicolas Meuleau, and Leslie P. Kaelbling, ‘Learning to cooperate via policy search’, in Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 307–314, San Francisco, CA, (2000). Morgan Kaufmann.
Bharaneedharan Rathnasabapathy and Piotr Gmytrasiewicz. Formalizing multi-agent pomdp’s in the context of network routing.
Mark B. Ring, ‘Child: A first step towards continual learning’, Machine
Learning, 28, 77–104, (May 1997).
Yoav Shoham, Rob Powers, and Trond Grenager, ‘If multi-agent learning is the answer, what is the question?’, Artif. Intell., 171(7), 365–377,
(2007).
William T. B. Uther and Manuela M. Veloso, ‘Adversarial reinforcement learning.’, Technical report, Computer Science Department,
Carnegie Mellon University, (April 1997).
Erfu Yang and Dongbing Gu, ‘Multiagent reinforcement learning for
multi-robot systems: A survey’, Technical report, University of Essex,
(2003).
[26] Martin Zinkevich, Amy Greenwald, and Michael Littman, ‘Cyclic
Equilibria in Markov Games’, in Advances in Neural Information Processing Systems 18, 1641–1648, MIT Press, Cambridge, MA, (2005).
A
The Benefits of Turn-Taking
Modelling with turn-taking is not only more natural in some cases, in
certain cases it allows for a more concise representation of a system.
Consider a system with 2 agents, A and B, and 9 positional states (in
a 3 ✂ 3 grid), where agents cannot occupy the same location (so there
are 9 ✂ 8 ✏ 72 states), and movement actions are in the 4 cardinal
directions for each agent.
Let us first consider how this might be formulated without explicit
turn taking. Imagine, this is our prior-state,
s1
✏
Imagine also that agent A chooses action E(ast), agent B chooses
action S(outh) — making joint action ➔ E, S →, and that these actions
will be resolved in a random order. This results in one of three postaction states, with the following probabilities;
✂
✶
Pr s✶
Pr s
✂
✏
✞✞
✞✞ s ✏
, a ✏➔ E, S →
✏
✞✞
✞✞ s ✏
, a ✏➔ E, S →
✏
✞✞
✞✞ s ✏
γ1 ✏ 1.
✡
✡
✏ α1 ,
✏ β1 ,
and
✂
Pr s✶
, a ✏➔ E, S →
✡
✏ γ1 ,
where α1 β1
Let’s assume that they are transparently modelled by agents
A and B, whose actions fail with the following probabilities;
Pr ♣A’s action failsq ✏ p and Pr ♣B’s action failsq ✏ q, acting in
some order, where Pr ♣A is firstq ✏ r, and such that agents cannot
move into an already occupied square. We can use these values to
determine the flattened probabilities as α1 ✏ ♣1 ✁ pq ♣r q ♣1 ✁ rqq,
β1 ✏ pq, and γ1 ✏ ♣1 ✁ q q ♣pr ♣1 ✁ rqq. α1 , β1 and γ1 are not
independent, they represent only 2 independent variables, but are fed
into by p, q and r, which are independent. It may seem that α1 , β1
and γ1 model the situation more concisely, since any one combination of α1 , β1 and γ1 could correspond to many different choices for
p, q and r. However, this isn’t the whole story. Consider instead the
prior-state s2 , where
s2
✏
,
and the combined action ➔ S, E →. Treated naively the possible
outcomes might all be specified with separate probabilities (say α2 ,
β2 and γ2 ) as before. However, the modeller can, with transparent
mechanisms, choose to exploit underlying symmetries of the system.
If, as is quite natural, the turn ordering probabilities are independent
of position, then there need be no extra parameters specified for this
situation, the original p, q and r are already sufficient to define the
transition probabilities here too.