Reinforcement Learning Sutton Book
Reinforcement Learning Sutton Book
Reinforcement Learning Sutton Book
LEARNING
edited by
Richard S. Sutton
GTE Laboratories
33
55
69
99
117
A Reinforcement Connectionist Approach to Robot Path Finding in Non-MazeLike Environments ......................... J. dei R. Mitin and C Torras
139
Library
or Congress
Cataloging-in-Publication
Data
Introduction:
The Challenge of Reinforcement Learning
Reinforcement learning is the learning of a mapping from situations to actions so as to
maximize a scalar reward or reinforcement signal. The learner is not told which action
to take, as in most forms of machine learning, but instead must discover which actions
yield the highest reward by trying them. In the most interesting and challenging cases,
actions may affect not only the immediate's reward, but also the next situation, and through
that all subsequent rewards. These two characteristics-trial-and-error search and delayed
reward-are the two most important distinguishing features of reinforcement learning.
Reinforcement learning is both a new and very old topic in AI. The term appears to
have been coined by Minsky (1961), and independently in control theory by Waltz and
Fu (1965). The earliest machine learning research now viewed as directly relevant was
Samuel's (1959) checker player, which used temporal-difference learning to manage delayed
reward much as it is used today. Of course learning and reinforcement have been studied
in psychology for almost a century, and that work has had a very strong impact on the
AI/engineering work. One could in fact consider all of reinforcement learning to be simply
the reverse engineering of certain psychological learning processes (e.g., operant conditioning and secondary reinforcement.)!
Despite the early papers mentioned above, reinforcement learning was largely forgotten
in the late 1960s and the 1970s. Not until the early 1980s did it gradually become an active
and identifiable area of machine learning research (Barto, et aI., 1981, 1983; see also Hampson, 1983). Research in genetic algorithms and classifier systems, initiated by John Holland
(1975, 1986), has also been an influential part of reinforcement learning research, as has
learning automata theory (see Narendra & Thathachar, 1974). Most recently, Chris Watkins
(1989) and Paul Werbos (1987), among others, have invigorated theoretical research in
reinforcement learning by linking it to optimal control theory and dynamic programming.
The seven articles of this special issue are representative of the excellent reinforcement
learning research ongoing today. Some are theoretical, some empirical. Most of them use
some form of connectionist network as part of their learning method.2 The article by Williams
introduces a gradient theory of reinforcement learning analogous to that available for connectionist supervised learning. Whereas Williams' theory treats the case of immediate
reward, the article by Tesauro focusses on delayed reward. Tesauro compares temporaldifference and supervised-learning approaches to learning to play backgammon. Among
other surprising results, his temporal-difference program learns to play significantly better
than the previous world-champion program and as well as expert human players.
226
R.S. SUTTON
INTRODUCTION
227
Notes
I. Psychologists do nOl use exactly the term "reinforcement learning," so we can feel free to use it, as I do here,
to refer exclusively to the engineering enterprise.
2. This is typical, but by no means necessary, as is shown by reinforcement learning research that instead uses
genetic algorithms (e.g., Grefenstette, et aI., 1990).
References
Barto, AG. Bradtke, S.1. & Singh, S.P. (1991). Real-time learning and control using asynchronous dynamic programming (Technical Report 91-57). Amherst, MA: University of Massachusetts, Computer Science Department.
Barto, A.G. & Sutton, R. S. (1981). Landmark learning: An illustration of associative search. Biological Cybernetics,
42, 1-8.
Barto, A.G., Sutton, R.S. & Anderson, c.w. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Trans. on Systems, Man, and Cybernetics, SMC-J3, 834-846.
Barto, A.G., Sutton, R.S. & Brouwer, P.S. (1981). Associative search network: A reinforcement learning associative
memory. Biological Cybernetics, 40. 201-211.
Booker, L.B. (1988). Classifier systems that learn world models. Machine Learning, 3, 161-192.
Grefenstette, J.J., Ramsey, c.L. & Schultz, A.C. (1990). Learning sequential decision rules using simulation
models and competition. Machine Learning, 5, 355-382.
Hampson, S.E. (1983). A neural model ofadaptive behavior. Ph.D. dissertation, Dept. of Information and Computer Science, Univ. of Calif., Irvine (Technical Report #213). A revised edition appeared as Connectionist
Problem Solving, Boston: Birkhiiuser, 1990.
Holland, J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor, MI: Univ. of Michigan Press.
Holland, 1.H. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to
parallel rule-based systems. In: R.S. Michalski, 1.G. Carbonell, & T.M. Mitchell (Eds.), Machine learning,
An artificial intelligence approach, Volume II, 593-623, Los Altos, CA: Morgan Kaufman.
Kaelbling, L.P. (1990). Learning in embedded systems. Ph.D. dissertation, Computer Science Dept, Stanford
University.
Mahadevan, S. & Connell, 1. (1990). Automatic programming of behavior-based robots using reinforcement learning. roM technical report. To appear in Artificial Intelligence.
Minsky, M.L. (1961). Steps toward artificial intelligence. Proceedings IRE, 49.8-30. Reprinted in E.A. Feigenbaum & J. Feldman (Eds.), Computers and Thought., 406-450, New York: McGraw-Hili, 1963.
Narendra, K.S. & Thathachar, M.A.L. (1974). Learning automata-a survey. IEEE Transactions on Systems,
Man, and Cybernetics, 4, 323-334. (Or see their textbook, Learning Automata: An Introduction, Englewood
Cliffs, NJ: Prentice Hall, 1989.)
Samuel, A.L. (1959). Some studies in machine learning using the game of checkers. IBM Journal on Research
and Development, 3, 210-229. Reprinted in E.A. Feigenbaum & 1. Feldman (Eds.), Computers and Thought,
71-105, New York: McGraw-Hili, 1963.
Waltz, M.D. & Fu, K.S. (1965). A heuristic approach toreinforcement learning control systems. IEEE Transactions on Automatic Control, AC-lO, 390-398.
Watkins, C.J.C.H. (1989). Learning with delayed rewards. Ph.D. dissertation, Psychology Department, Cambridge
University.
Werbos, P.1. (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory
automation and brain research. IEEE Transactions on Systems, Man and Cybernetics, Jan-Feb.
Whitehead, S.D. & Ballard, D.H. (1991). Learning to perceive and act by trial and error. Machine Learning,
7, 45-84.
Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist
networks containing stochastic units. These algorithms, called REINFORCE algorithms, are shown to make weight
adjustments in a direction that lies along the gradient of expected reinforcement in both immediate-reinforcement
tasks and certain limited forms of delayed-reinforcement tasks, and they do this without explicitly computing
gradient estimates or even storing information from which such estimates could be computed. Specific examples
of such algorithms are presented, some of which bear a close relationship to certain existing algorithms while
others are novel but potentially interesting in their own right. Also given are results that show how such algorithms
can be naturally integrated with backpropagation. We close with a brief discussion of a number of additional
issues surrounding the use of such algorithms, including what is known about their limiting behaviors as well
as further considerations that might be used to help develop similar but potentially more powerful reinforcement
learning algorithms.
Keywords. Reinforcement learning, connectionist networks, gradient descent, mathematical analysis
1. Introduction
The general framework of reinforcement learning encompasses a broad variety of problems
ranging from various forms of function optimization at one extreme to learning control
at the other. While research in these individual areas tends to emphasize different sets of
issues in isolation, it is likely that effective reinforcement learning techniques for autonomous
agents operating in realistic environments will have to address all of these issues jointly.
Thus while it remains a useful research strategy to focus on limited forms of reinforcement
learning problems simply to keep the problems tractable, it is important to keep in mind
that eventual solutions to the most challenging problems will probably require integration
of a broad range of applicable techniques.
In this article we present analytical results concerning certain algorithms for tasks that
are associative, meaning that the learner is required to perform an input-output mapping,
and, with one limited exception, that involve immediate reinforcement, meaning that the
reinforcement (i.e., payoff) provided to the learner is determined by the most recent inputoutput pair only. While delayed reinforcement tasks are obviously important and are receiving
much-deserved attention lately, a widely used approach to developing algorithms for such
tasks is to combine an immediate-reinforcement learner with an adaptive predictor or critic
based on the use of temporal difference methods (Sutton, 1988). The actor-critic algorithms
investigated by Barto, Sutton and Anderson (1983) and by Sutton (1984) are clearly of this
form, as is the Q-leaming algorithm of Watkins (1989; Barto, Sutton, & Watkins, 1990).
230
R.J. WILLIAMS
A further assumption we make here is that the learner's search behavior, always a necessary
component of any form of reinforcement learning algorithm, is provided by means of randomness in the input-output behavior of the learner. While this is a common way to achieve
the desired exploratory behavior, it is worth noting that other strategies are sometimes
available in certain cases, including systematic search or consistent selection of the apparent best alternative. This latter strategy works in situations where the goodness of alternative actions is determined by estimates which are always overly optimistic and which
become more realistic with continued experience, as occurs for example in A* search
(Nilsson, 1980).
In addition, all results will be framed here in terms of connectionist networks, and the
main focus is on algorithms that follow or estimate a relevant gradient. While such algorithms
are known to have a number of limitations, there are a number of reasons why their study
can be useful. First, as experience with backpropagation (leCun, 1985; Parker, 1985;
Rumelhart, Hinton & Williams, 1986; Werbos, 1974) has shown, the gradient seems to
provide a powerful and general heuristic basis for generating algorithms which are often
simple to implement and surprisingly effective in many cases. Second, when more
sophisticated algorithms are required, gradient computation can often serve as the core
of such algorithms. Also, to the extent that certain existing algorithms resemble the
algorithms arising from such a gradient analysis, our understanding of them may be
enhanced.
Another distinguishing property of the algorithms presented here is that while they can
be described roughtly as statistically climbing an .appropriate gradient, they manage to do
this without explicitly computing an estimate of this gradient or even storing information
from which such an estimate could be directly computed. This is the reason they have
been called simple in the title. Perhaps a more informative adjective would be non-modelbased. This point is discussed further in a later section of this paper.
Although we adopt a connectionist perspective here, it should be noted that certain aspects
of the analysis performed carry over directly to other ways of implementing adaptive inputouput mappings. The results to be presented apply in general to any learner whose inputoutput mappings consists of a parameterized input-controlled distribution function from
which outputs are randomly generated, and the corresponding algorithms modify the
learner's distribution function on the basis of performance feedback. Because of the gradient approach used here, the only restriction on the potential applicability of these results
is that certain obvious differentiability conditions must be met.
A number of the results presented here have appeared in various form in several earlier
technical reports and conference papers (Williams, 1986; 1987a; 1987b; 1988a; 1988b).
GRADIENT ALGORITHMS
231
corresponding activity through the net, and sending the activity produced at its output units
to the environment for evaluation. The evaluation consists of the scalar reinforcement signal
r, which we assume is broadcast to all units in the net. At this point each unit performs
an appropriate modification of its weights, based on the particular learning algorithm in
use, and the cycle begins again.
The notation we use throughout is as follows: Let Yi denote the output of the ith unit
in the network, and let Xi denote the pattern of input to that unit. This pattern of input
Xi is a vector whose individual elements (typically denoted Xj) are either the outputs of
certain units in the network (those sending their output directly to the ith unit) or certain
inputs from the environment (if that unit happens to be connected so that it receives input
directly from the environment). Then output Yi is drawn from a qistribution depending
on Xi and the weights wij on input lines to this unit. For each i, let Wi denote the weight
vector consisting of all the weights wij. The let W denote the weight matrix consisting of
all weights wij in the network. In a more general setting, Wi can be viewed as the collection of all parameters on which the behavior of the ith unit (or agent) depends, while W
is the collection of parameters on which the behavior of the entire (or collection of agents)
depends.
In addition, for each i let gi(~' Wi, Xi) = Pr {Yi = ~ Iwi, Xi}, so that gi is the probability
mass function determining the value of Yi as a function of the parameters of the unit and
its input. (For ease of exposition, we consistently use terminology and notation appropriate
for the case when the set of possible output values Yi is discrete, but the results to be derived also apply to continuous-valued units when gi is taken to be the corresponding probability density function.) Since the vector Wi contains all network parameters on which
the input-output behavior of the ith unit depends, we could just as well have defined gi
by gi(~' wi, xi) = Pr{Yi = ~IW, Xi}.
Note that many of the quantities we have named here, such as r, Yi, and Xi, actually depend on time, but it is generally convenient in the sequel to suppress explicit reference
to this time dependence, with the understanding that when several such quantities appear
in a single equation they represent the values for the same time step t. We assume that
each new time step begins just before external input is presented to the network. In the
context of immediate-reinforcement tasks we also call each time step's cycle of networkenvironment interaction a trial.
To illustrate these definitions and also introduce a useful subclass, we define a stochastic
semilinear unit to be one whose output Yi is drawn from some given probability distribution whose mass function has a single parameter Pi, which is in turn computed as
(1)
Pi = li(Si),
= wiTXi =
(2)
232
R.J. WILLIAMS
the inner product of Wi and Xi. This can be viewed as a semilinear unit, as widely used
in connectionist networks, followed by a singly parameterized random number generator.
A noteworthy special case of a stochastic semilinear unit is a Bernoulli semilinear unit,
for which the output Yi is a Bernoulli random variable with parameter Pi, which means
that the only possible output values are 0 and 1, with Pr {Yi = 0 IWi, Xi} = 1 - Pi and
Pr {Yi = 11 Wi, Xi} = Pi' Thus, for a Bernoulli semilinear unit,
I - Pi
if
Pi
in
= 0
= 1,
where Pi is computed via equations (1) and (2). This type of unit is common in networks
using stochastic units; it appears, for example, in the Boltzmann machine (Hinton & Sejnowski, 1986) and in the reinforcement learning networks explored by Barto and colleagues
(Barto & Anderson, 1985; Barto & Jordan, 1987; Barto, Sutton, & Brouwer, 1981). While
the name Bernoulli semilinear unit may thus appear to be simply a fancy new name for
something quite familiar, use of this term is intended to emphasize its membership in a
potentially much more general class.
A particular form of squashing function commonly used is the logistic function, given by
(3)
/;(s;) = - - -
A stochastic semilinear unit using both the Bernoulli random number generator and the
logistic squashing function will be called a Bernoulli-logistic unit.
Now we observe that the class of Bernoulli semilinear units includes certain types of
units whose computation is alternatively described in terms of a linear threshold computation together with either additive input noise or a noisy threshold. This observation is useful
because this latter formulation is the one used by Barto and colleagues (Barto, 1985; Barto
& Anandan, 1985; Barto & Anderson, 1985; Barto, Sutton, & Anderson, 1983; Barto,
Sutton, & Brouwer, 1981; Sutton, 1984) in their investigations. Specifically, they assume
a unit computes it output Yi by
if
2:
Wij Xj
+ 11 > 0
otherwise,
where 11 is drawn randomly from a given distribution 'if. To see that such a unit may be
viewed as a Bernoulli semilinear unit, let
Si
2:
wijXj
233
GRADIENT ALGORITHMS
Pr{y;
= Pr {S;
'1/
- Pr{s;
> 0 I S;}
1/ =:; 0
- Pr {1/ =:; -
S;
I S;}
I S;}
I - 'Ir(- S;).
Thus, as long as 'Ir is differentiable, such a unit is a Bernoulli semilinear unit with squashing
function fi given by fi(sj) = I - 'Ir( -Sj).
4. REINFORCE algorithms
Consider a network facing an associative immediate-reinforcement learning task. Recall
that weights are adjusted in this network following receipt of the reinforcement value r
at each trial. Suppose that the learning algorithm for this network is such that at the end
of each trial each parameter wij in the network is incremented by an amount
234
R.J. WILLIAMS
where (Xij is a learning rate factor, bij is a reinforcement baseline, and eij = oln gi10Wij
is called the characteristic eligibility of wij. Suppose further that the reinforcement baseline
bij is conditionally independent of Yi, given Wand Xi, and the rate factor (Xij is nonnegative
and depends at most on Wi and t. (Typically, (Xij will be taken to be a constant.) Any learning algorithm having this particular form will be called a REINFORCE algorithm. The
name is an acronym for "REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility," which describes the form of the algorithm.
What makes this class of algorithms interesting is the following mathematical result:
I W} and
'V wE {r I W} is nonnegative. Furthermore, if (Xij > 0 for all i and), then this inner product is zero only when 'V wE {r I W} = O. Also, if (Xij = (X is independent of i and), then
E{AW I W} = (X 'VwE{r I W}.
This results relates 'V wE {r I W}, the gradient in weight space of the performance
measure E {r I W}, to E {AWl W}, the average update vector in weight space, for any
REINFORCE algorithm.1 Specifically, it says that for any such algorithm the average update vector in weight space lies in a direction for which this performance measure is increasing. The last sentence of the theorem is equivalent to the claim that for each weight
wij the quantity (r - bij ) aIn g;lOWij represents an unbiased estimate of oE{r I W}/owij.
A proof of this theorem is given in Appendix A.
There are a number of interesting special cases of such algorithms, some of which coincide with algorithms already proposed and explored in the literature and some of which
are novel. We begin by showing that some existing algorithms are REINFORCE algorithms,
from which it follows immediately that Theorem 1 applies to them. Later we will consider
some novel algorithms belonging to this class.
Consider first a Bernoulli unit having no (nonreinforcement) input and suppose that the
parameter to be adapted is Pi = Pr {Yi = I}. This is equivalent to a two-action stochastic
learning automaton (Narendra & Thathatchar, 1989) whose actions are labeled 0 and I.
The probability mass function gi is then given by
Theorem 1. For any REINFORCE algorithm, the inner product of E{AW
I - Pi
if Yi = 0
Pi
if Yi = 1,
(4)
from which it follows that the characteristic eligibility for the parameter Pi is given by
oln g, (
op,
Y" P,
~
_
) = {- I
P,
10
if Yi = 0
P,
if Yi
Yi - Pi
Pi(1 - p;)
(5)
235
GRADIENT ALGORITHMS
using the result (5) above. The special case of this algorithm when the reinforcement signal
is limited to 0 and I coincides with the 2-action version of the linear reward-inaction
(LR- 1) stochastic learning automaton (Narendra & Thathatchar, 1989). A "network" consisting of more than one such unit constitutes a team of such learning automata, each using
its own individual learning rate. The behavior of teams of L R - I automata has been investigated by Narendra and Wheeler (1983; Wheeler & Narendra, 1986).
Now consider a Bernoulli semilinear unit. In this case, gi(Yi, Wi, Xi) is given by the righthand side of (4) above, where Pi is expressed in terms of Wi and Xi using equations (1)
and (2). To compute the characteristic eligibility for a particular parameter Wij' we use
the chain rule. Differentiating the equations (1) and (2) yields dpi/dsi = fi'(Si) and asi/awij
= Xj. Noting that aln g;lapi (Yi. wi, Xi) is given by the right-hand side of (5) above, we
multiply these three quantities to find that the characteristic eligibility for the weight wij
is given by
aIn gi (Yi'W,x
i
i)
aWij
--
Yi - Pi fi'(Si)
Pi(1 - Pi)
(6)
Xj,
as long as Pi is not equal to 0 or 1. In the special case when fi is the logistic function,
given by equation (3), Pi is never equal to 0 or 1 and fi'(Si) = Pi(1 - Pi)' so the
characteristic eligibility of wij is simply
(7)
(X
and
(8)
using the result (7) above. It is interesting to compare this with the associative rewardpenalty (A R- P ) algorithm (Barto, 1985; Barto & Anandan, 1985; Barto & Anderson, 1985;
Barto & Jordan, 1987), which, for r E [0, 1], uses the learning rule
t:J.Wij =
(X
[r(Yi - p;)
Xj.
where (X is a positive learning rate parameter and 0 < A ::; 1. If A = 0, this is called
the associative reward-inaction (A R - I ) algorithm, and we see that the learning rule reduces
to equation (8) in this case. Thus A R -I> when applied to a network of Bernoulli-logistic
units, is a REINFORCE algorithm.
11
236
R.J. WILLIAMS
In all the examples considered so far, the reinforcement baseline is O. However, the use
of reinforcement comparison (Sutton, 1984) is also consistent with the REINFORCE formulation. For this strategy one maintains an adaptive estimate i of upcoming reinforcement based on past experience. As a particular example, for a network of Bernoulli-logistic
units one may use the learning rule
(9)
(10)
where 0 < 'Y =:;; 1. More sophisticated strategies are also consistent with the REINFORCE
framework, including making i a function of the current input pattern xi to the unit.
While the analytical results given here offer no basis for comparing various choices of
reinforcement baseline in REINFORCE algorithms, it is generally believed that the use
of reinforcement comparison leads to algorithms having superior performance in general.
We discuss questions of REINFORCE algorithm performance at greater length below.
12
237
GRADIENT ALGORITHMS
(11)
1=1
where all notation is the same as that defined earlier, with eij(t) representing the
characteristic eligibility for wij evaluated at the particular time t. By definition, eij(f) =
ejj, where this latter makes sense within the acyclic network N*. For example, in a completely interconnected recurrent network of Bernoulli-logistic units that is updated synchronously, eij(f) = (Yi(f) - Pi(fXj(f - 1). All quantities are assumed to satisfy the same
conditions required for the REINFORCE algorithm, where, in particular, for each i and
j, the reinforcement baseline bij is independent of any of the outp~t values y;(f) and the
rate factor aij depends at most on Wi and episode number. Call any algorithm of this form
(and intended for such a learning problem) an episodic REINFORCE algorithm.
For example, if the network consists of Bernoulli-logistic units an episodic REINFORCE
algorithm would prescribe weight changes according to the rule
k
1=1
1).
13
238
RJ. WILLIAMS
-----
aJ.l,
(12)
cJ2
aln g = (y - J.I,)2 -
aIJ
IJ
is
IJ2
IJ3
14
239
GRADIENT ALGORITHMS
11J-t = a (r - b ) Y - J-t
I'
I'
(13)
if
and
(14)
where aI" bl" aq, and bq are chosen appropriately. A reasonable algorithm is obtained
by setting
where a is a suitably small positive constant? and letting bl' = bq be determined according to a reinforcement comparison scheme.
It is interesting to note the resemblance between equation (12), giving the characteristic
eligibility for the parameter J-t of the normal distribution, and equation (5), giving the
characteristic eligibility for the parameter p of the Bernoul1i distribution. Since p is the
mean and p (1 - p) the variance of the corresponding Bernoul1i random variable, both
equations have the same form. In fact, the characteristic eligibility of the mean parameter
has this form for an even wider variety of distributions, as stated in the following result:
Proposition 1. Suppose that the probability mass or density function g has the form
... ,
aIng_Y-J-t
a; - ----;;-'
where if is the variance of the distribution.
Mass or density functions having this form represent special cases of exponentialfamilies
of distributions (Rohatgi, 1976). It is easily checked that a number of familiar distributions, such as the Poisson, exponential, Bernoulli, and normal distributions, are all of this
form. A proof of this proposition is given in Appendix B.
7. Compatibility with backpropagation
It is useful to note that REINFORCE, like most other reinforcement learning algorithms
for networks of stochastic units, works essentially by measuring the correlation between
variations in local behavior and the resulting variations in global performance, as given
by the reinforcement signal. When such algorithms are used, all information about the
15
240
R.J. WILLIAMS
effect of connectivity between units is ignored; each unit in the network tries to determine
the effect of changes of its output on changes in reinforcement independently of its effect
on even those units to which it is directly connected. In contrast, the backpropagation
algorithm works by making use of the fact that entire chains of effects are predictable from
knowledge of the effects of individual units on each other. While the backpropagation
algorithm is appropriate only for supervised learning in networks of deterministic units,
it makes sense to also use the term backpropagation for the single component of this
algorithm that determines relevant partial derivatives by means of the backward pass. (In
this sense it is simply a computational implementation of the chain rule.) With this meaning of the term we can then consider how backpropagation might be integrated into the
statistical gradient-following reinforcement learning algorithms investigated here, thereby
giving rise to algorithms that can take advantage of relevant knowledge of network connectivity where appropriate. Here we examine two ways that backpropagation can be used.
w,
Pr{y
I w,
x}
= II Pr{Yk =
~k
I W,
x}
kED
= II Pr{Yk = ~k I w k , xk},
kEO
where, for each k, xk is the pattern appearing at the input to the kth unit as a result of
presentation of the pattern x to the network. Note that each
depends deterministically
on x.
Therefore,
xk
16
241
GRADIENT ALGORITHMS
In g(~, W, x)
In
II gk(~b wk, x k )
=~
kEO
In gk(~b
wk,
x k ),
kEO
so that
aln
(t w,
x) =
aWij
~ Oln
kEO
gk
(h, w k , ~).
aWij
Clearly, this sum may be computed via backpropagation. For example, when the output
units are Bernoulli sernilinear units, we can use the parameters Pk as,intermediate variables
and write the characteristic eligibility of any weight wij as
Oln
g _
--aWij
"
Oln
gk
apk
~----,
kEO
apk
aWij
just after the kth unit's squashing function, for each k, and then performing the standard
backward pass. Note that if wij is a weight attached to an output unit, this backpropagation computation simply gives rise to the result (6) derived earlier. For that result we essentially backpropagated the characteristic eligibility of the Bernoulli parameter Pi through
the sub-units consisting of the "squasher" and the "summer."
While we have restricted attention here to networks having stochastic output units only,
it is not hard to see that such a result can be further generalized to any network containing
an arbitrary mixture of stochastic and deterministic units. The overall algorithm in this
case consists of the use of the correlation-style REINFORCE computation at each stochastic
unit, whether an output unit or not, with backpropagation used to compute (or, more precisely, estimate) all other relevant partial derivatives.
Furthermore, it is not difficult to prove an even more general compatibility between computation of unbiased estimates, not necessarily based on REINFORCE, and backpropagation through deterministic functions. The result is, essentially, that when one set of variables
depends deterministically on a second set of variables, backpropagating unbiased estimates
of partial derivatives with respect to the first set of variables gives rise to unbiased estimates
of partial derivatives with respect to the second set of variables. It is intuitively reasonable
that this should be true, but we omit the rigorous mathematical details here since we make
no use of the result.
17
242
R.I. WILLIAMS
Y = p.
where
z is
+ uz,
and
~=z=y-p..
au
u
Thus, for example, one may combine the use of backpropagation through Gaussian hidden
units with REINFORCE in the output units. In this case the characteristic eligibility for
the p. in such a unit is set equal to that computed for the output value Ywhile the characteristic
eligibility for the u parameter is obtained by multiplying that for Yby (y - p.)1 u. It is worth
noting that these particular results in no way depend on the fact that p. is the mean and
u the standard deviation; the identical result applies whenever JI. represents a translation
parameter and u a scaling parameter for the distribution. More generally, the same technique can obviously be used whenever the output can be expressed as a function of the
parameters together with some auxiliary random variables, as long as the dependence on
the parameters is differentiable.
18
GRADIENT ALGORITHMS
243
Note that the argum,ent given here is based on the results obtained above for the use
of backpropagation when computing the characteristic eligibility in a REINFORCE
algorithm, so the conclusion is necessarily limited to this particular use of backpropagation here. Nevertheless, because it is also true that backpropagation preserves the unbiasedness of gradient estimates in general, this form of argument can be applied to yield
statistical gradient-following algorithms that make use of backpropagation in a variety of
other situations where a network of continuous-valued stochastic units is used. One such
application is to supervised training of such networks.
19
244
R.J. WILLIAMS
when these simpler reinforcement functions are used. Like AR- P, REINFORCE is generally very slow even when it succeeds. Episodic REINFORCE has been found to be
especially slow, but this, too, is not surprising since it performs temporal credit-assignment
by essentially spreading credit or blame uniformly over all past times.
One REINFORCE algorithm whose asymptotic behavior is reasonably well understood
analytically is 2-action L R- h and simulation experience obtained to date with a number
of other REINFORCE algorithms suggests that their range of possible limiting behaviors
may, in fact, be similar. The LR- 1 algorithm is known to converge to a single deterministic
choice of action with probability 1. What is noteworthy about this convergence is that,
in spite of the fact that the expected motion is always in the direction of the best action,
as follows from Theorem 1, there is always a nonzero probability of its converging to an
inferior choice of action. A simpler example that exhibits the same kind of behavior is
a biased random walk on the integers with absorbing barriers. Even though the motion
is biased in a particular direction, there is always a nonzero probability of being absorbed
at the other barrier.
In general, a reasonable conjecture consistent with what is known analytically about simple
REINFORCE algorithms like L R- 1 and what has been found in simulations of more
sophisticated REINFORCE algorithms is the following: Depending on the choice of reinforcement baseline used, any such algorithm is more or less likely to converge to a local
maximum of the expected reinforcement function, with some nonzero (but typically comfortably small) probability of convergence to other points that lead to zero variance in network behavior. For further discussion of the role of the reinforcement baseline, see below.
20
GRADIENT ALGORITHMS
245
and then stop changing. Simulation studies using both deterministic and noisy reinforcement confirm this behavior. They also demonstrate that if r is always nonnegative and reinforcement comparison is not used (Le., b = 0), REINFORCE may cause (J to converge
to 0 before p. has moved to the top of any hill. This can be viewed as a generalization of
the potential convergence to suboptimal performance described earlier for L R- 1
It is interesting to compare REINFORCE for such a unit with an alternative algorithm
for the adaptation of p. and (J that has been proposed by Gullapalli (1990). In this approach,
p. is adapted in essentially the same manner as in REINFORCE but (J is adapted in a quite
different manner. With reinforcement values r assumed to lie between 0 and 1, (J is taken
to be proportional to 1 - r. This strategy makes sense if one takes the point of view that
(J is a parameter controlling the scale of the search being performed and the optimum value
for the function is unknown. In those situations when it is known that unsatisfactory performance is being achieved it is reasonable to broaden this scale in order to take a coarsegrained view of the search space and identify a broad region in which the optimum has
a reasonable chance of being found.
Also relevant here is the work of Schmidhuber and Huber (1990), who have reported
successful results using networks having Gaussian output units in control tasks involving
backpropagating through a model (Jordan & Rumelhart, 1990). In this work, backpropagation through random number generators was used to allow learning of a model and learning of performance to proceed simultaneously rather than in separate phases.
21
246
RJ. WILLIAMS
a slight improvement in convergence speed over the use of mean reinforcement, but a more
convincing advantage remains to be demonstrated.
where
r is computed according
F(t) = -yr(t - I)
where 0
.::\w
<
-y
(l - -y)i(t -
1),
where ji is updated by
y(t) = -yy(t - 1)
(1 - -y)y(t -
1),
using the same -y as is used for updating i. This particular algorithm has been found generally
to converge faster and more reliably than the corresponding REINFORCE algorithm.
It is clear that the two algorithms bear some strong similarities. The variant is obtained
by simply replacing p by y, and each of these can be viewed as reasonable a priori estimates
of the output y. Furthermore, the corresponding strategy can be used to generate variants
of REINFORCE in a number of other cases. For example, if the randomness in the unit
uses any distribution to which Proposition 1 applies, then the REINFORCE algorithm for
adjusting its mean parameter JL will involve the factor y - JL and we can simply replace
22
GRADIENT ALGORITHMS
247
this by y - y. Such an algorithm for adapting the mean of a Gaussian unit has been tested
and found to behave very well.
While some arguments can be given (Rich Sutton, personal communication, 1988) that
suggest potential advantages of the use of y - Yin such algorithms, a more complete analysis
has not yet been performed. Interestingly, one possible analytical justification for the use
of such algorithms may be found in considerations like those discussed next.
23
248
R.l. WILLIAMS
using the y - y form of eligibility described above may be related to such an approach
but this has not been fully analyzed yet.
9. Conclusion
The analyses presented here, together with a variety of simulation experiments performed
by this author and others, suggest that REINFORCE algorithms are useful in their own
right and, perhaps more importantly, may serve as a sound basis for developing other more
effective reinforcement learning algorithms. One major advantage of the REINFORCE
approach is that it represents a prescription for devising statistical gradient-following
algorithms for reinforcement-learning networks of units that compute their random output
in essentially any arbitrary fashion. Also, because it is a gradient-based approach, it integrates well with other gradient computation techniques such as backpropagation. The
main disadvantages are the lack of a general convergence theory applicable to this class
of algorithms and, as with all gradient algorithms, an apparent susceptibility to convergence
to false optima.
Acknowledgments
I have benefitted immeasurably from numerous discussions with Rich Sutton and Andy
Barto on various aspects of the material presented herein. Preparation of this paper was
supported by the National Science Foundation under grant IRI-8921275.
Notes
I. In more detail, V' wE {r I W} and E {.1.W I W} are both vectors having the same dimensionality as W, wilh
the (I, j) coordinate of 17 wE {r I W} being oE {r I W}/owij and the corresponding coordinate of E {.1.W I W}
being E {.1.wij I W}.
2. Strictly speaking, there is no choice of 01 for this algorithm guaranteeing that (J will not become negative,
unless the normal distribution has its tails truncated (which is necessarily the case in practice). Another approach is to take }.. = In (J as the adaptable parameter rather than (J, which leads to an algorithm guaranteed
to keep" positive.
Appendix A
This appendix contains proofs of Theorems 1 and 2 on REINFORCE and episodic REINFORCE algorithms, respectively. In addition to the notation introduced in the text, we symbolize some sets of interest by letting Y; denote the set of possible output values Yi of the
ith unit, with Xi denoting the set of possible values of the input vector Xi to this unit.
Although it is not a critical assumption, we take Y; and Xi to be discrete sets throughout.
Also, we let I denote the index set for elements of W, so that (i, j) E I if and only if wij
is a parameter in the system.
24
249
GRADIENT ALGORlTHMS
It should be remarked here that, in the interest of brevity, all the assertions proved in
this appendix make use of a convention in which each unbound variable is implicitly assumed
to be universally quantified over an appropriate set of values. For example, whenever i
and j appear, they are to be considered arbitrary (subject only to (i, j) E I).
Fact 1.
aE{rl W,
Xi} =
"'E{
L.J
r
Iw , X,i
~EYi
aWij
aWij'
Proof Conditioning on the possible values of the output Yi, we may write
E{r
I W,
Xi} =
E{r
I W,
Xi,
Yi =
Pr{Yi = ~
E{r
I W,
Xi,
Yi =
gi(~'
I W,
Xi}
~O';
Wi, Xi).
~EYi
Note that specification of the value of Yi causes wij to have no influence on the ultimate
value of r, which means that E {r I W, xi, Yi = ~} does not depend on wij. The result
then follows by differentiating both sides of this last equation with respect to wij.
Fact 2.
Proof
wij.
25
250
RJ. WILLIAMS
Although this fails to be defined when gi = 0, it will still be the case that dWij is welldefined for any REINFORCE algorithm as long as Y; is discrete. This is because
gi(~' Wi, Xi) = 0 means that the value ~ has zero probability of occurrence as a value of
the output Yi'
Then
E{dWij
I W,
=~
Xi}
E{dWij
I W,
Xi,
Yi
I W,
Xi,
Yi =
n Pr{Yi = ~ I W,
Xi}
~EYi
= (Xij
E {r
~EYi
" E{bij
- (Xij LJ
I
W, '
X', Yi
n
=
~EYi
agi
(t
Wi, Xi)
aWij
a g w',
x'),
.
n --'
aWij
(~,
making use of the fact that (Xij does not depend on the particular value of the output Yi'
By Fact 1, the first term of this last expression is (Xi/aE {r I W, xi}/awij)' Consider the
remaining term. Since E {bij I W, Xi, Yi = 0 = E {bij I W, Xi}, by assumption, we have
"
LJ E{bij J W, x', Yi
~EYi
a, g
x'). = E{bij I W, x'}. "ag
..
n __
(t w',
LJ - - ' (~, w', x')
~EYi aWij
aWij
= 0
= ~
",EX,
aE{r
I W,
Xi
= x}
Pr{xi
26
= x I W}.
aWij
Xi,
we may write
251
GRADIENT ALGORITHMS
E{r
I W} =
2..:{r I W, Xi
X} Pr{xi
I W}.
Note that the weight wij lies downstream of all computation performed to determine Xi.
This means that Pr {xi = X I W} does not depend on wi}' so the result follows by differentiating both sides of this last equation by wij'
E{~wIJ.. I W}
= (X aE{r
IJ
I W} .
aWij
Proof
E{~wij
I W} =
2..: E{~Wi}
I W, Xi
= X}
Pr{xi
= X I W}
xEXj
_ LJ
~ (Xij aE {r
I W,
aWij
xEXj
Xi = X} Pr {X i--XI W}
_- (Xi) LJ
~ aE {r I W, Xi = X} P r {X i--XI W}
XEXj
aWij
_
aU
aE{r
I W}
aWij
'
where the first equality is obtained by conditioning on the possible input patterns to the
unit, the second equality follows from Lemma I, the third equality follows from the assumption that (Xij does not depend on the input to the unit, and the last equality follows from
Fact 3.
Establishing this last result, which is just like Lemma 1 except that the conditioning on
input to unit i has been removed from both sides of the equation, is a key step. It relates
two quantities that, unlike those of Lemma 1, would be quite messy to compute explicitly
in general because Pr {Xi = X I W} can be quite complicated. From this lemma our main
result follows easily.
Theorem 1. For any REINFORCE algorithm, E{~W I W}T \7 w E{r I W} ~ O. Furthermore, if (Xij > 0 for all i andj, then equality holds if and only if \7 wE {r I W} = O.
Proof
E{~W
I W}T
\7 w E{r
I W}
2..:
(i.j)O
E{~wij I W} aE{r I W}
aWi)
27
252
R.I. WILLIAMS
L;
(i.j)ET
aij (OE {r I W}
Owij
2,
Fact 4.
oE{r
I W}
OWij
oE{r
1=1
I W*}.
owlj
1= I
1= 1
1= I
oE{r I W} oW:j =
owlj
oWij
oE{r I W} =
owlj
oE{r I w*},
owlj
E{.:iw.1
W} = a IJ.. oE{r I W}
IJ
OWij
Proof Let .:iwlj = ai/r - bij)e'ij' so that .:iwij = E~=1 .:iwlj . Note that this represents a
REINFORCE algorithm in N*, so it follows from Lemma 2 that
E{.:iw,.Ij
But then
28
I W*}
a Ij oE{r I W*}
owlj
253
GRADIENT ALGORITHMS
E {.wij I W}
~ E { ~ .";ij I W' }
k
L: E {.iwij I W*}
1=1
L.J
ex. aE{r
If
(Xij
I W*}
awfj
1=1
aE{r
I W} ,
aWij
E{.iW
I W}T
\JwE{r I W} =
L:
E{.iwij
(i.j)El
""' ex..
L.J
(i.j)El
lJ
I W}
aE{r I W}
aWij
[aE {r I W}
aWij
J
2
'
Note that the proof of Theorem 2 is identical to that for Theorem 1. This is because
Theorem 1 uses Lemma 2 and Theorem 2 uses Lemma 3, and both lemmas have the same
conclusion.
Appendix B
29
254
R.I. WILLIAMS
... ,
alng_y-p.
a; - ----;;'
r?
where
Proof. Here we consider the case of a probability mass function only, but a corresponding
argument can be given for a density function.
Let Y denote the support of g. In general,
(15)
since EyEY g = 1. Combining this with the fact that p. = EyEY Y g, we also find that
~ (y
LJ
yEY
) aln g _ ~
aIn g
~
aln g
- P. g - - - LJ yg - - - p. LJ g - -
ap'
yEY
ap'
yEY
ap'
(16)
Now introduce the shorthand notation IX = aQlap. and {3 = aDlap.. From the hypothesis
of the proposition we have
so that
30
255
GRADIENT ALGORITHMS
""J g
illng
L
--
alt
yEY
2: (ay + (3) g
yEY
2: yg + {3 2: g
yEY
a/A-
{3.
(17)
yEY
Also,
2: (y
/A-)g
yEY
a~/A- g
2: (y -
/A-)(ay
(3)g
yEY
/A-)[a(y - /A-)
a/A-
{3]g
(18)
a
2: (y -
/A-)2g
(ay
(3)
~Y
2: (y -
/A-)g
~Y
O.
Combining (15) with (17) and (16) with (18), we see that
a/A-
{3 = 0
and
ar?
= 1,
aln g(y,
a/A-
References
Barto, A.G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements. Human
Neurobiology, 4, 229-256.
Barto, A.G. & Anandan, P. (1985). Pattern recognizing stochastic learning automata. IEEE Transactions on Systems,
Man, and Cybernetics, 15, 360-374.
Barto, A.G. & Anderson, CW. (\985). Structural learning in connectionist systems. Proceedings of the Seventh
Annual Conference of the Cognitive Science Society, (pp. 43-53). Irvine, CA.
Barto, A.G., Sulton, R.S., & Anderson, c.w. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 835-846.
Barto, A.G., Sullon, R.S., & Brouwer, P.S. (1981). Associative search network: A reinforcement learning associative
memory. Biological Cybernetics, 40, 201-211.
Barto, A.G., & Jordan, M.1. (1987). Gradient following without back-propagation in layered networks. Proceedings
of the First Annual International Conference on Neural Networks, Vol. II (pp. 629-636). San Diego, CA.
Barto, A.G., Sulton, R.S., & Watkins, C.J.c.H. (1990). Learning and sequential decision making. In: M. Gabriel
& J.W. Moore (Eds.), Learning and computalional neuroscience: Foundations ofadaptive networks. Cambridge,
MA: MIT Press.
31
256
R.1. WILLIAMS
Dayan, P. (1990). Reinforcement comparison. In D.S. Tourerzky, J.L. Elman, T.1. Sejnowslci, & G.E. Hinton
(Eds.), Proceedings of the 1990 Connectionist Models Summer School (pp. 45-51). San Mateo, CA: Morgan
Kaufmann.
Goodwin, G.C. & Sin, K.S. (1984). Adaptivefiltering prediction and conJrol. Engle\\OOd CliffS, NJ: Prentice-Hall.
Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Neural
Networks, 3, 671-692.
Hinton, G.E. & Sejnowslci, T.I. (1986). Learning and relearning in Boltzmann machines. In: D.E. Rumelhart
& J.L. McClelland, (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition.
~I. 1: Foundations. Cambridge, MA: MIT Press.
Jordan, M.l. & Rumelharl, D.E. (1990). Forward models: supervised learning with a distal teacher. (Occasional
Paper - 40). Cambridge, MA: Massachusetts Institute of Technology, Center for Cognitive Science.
leCun, Y. (1985). Une procedure d'apprentissage pour resau a sequil assymetrique [A learning procedure for
asymmetric threshold networks). Proceedings of Cognitiva, 85, 599-604.
Munro, P. (1987). A dual back-propagation scheme for scalar reward learning. Proceedings of the Ninth Annual
Conference of the Cognitive Science Society (pp. 165-176). Seattle, WA.
Narendra, K.S. & Thathatchar, M.A.L. (1989). Learning Automata: An introduction. Engle-.ood CliffS, NJ: Prentice
Hall.
Narendra, K.S. & Wheeler, R. M., Jr. (1983). An N-player sequential stochastic game with identical payoffs.
IEEE Transactions on Systems, Man, and Cybernetics, 13, 1154-1158.
Nilsson, N.1. (1980). Principles of artificial intelligence. Palo Alto, CA: Tioga.
Parker, D.B. (1985). Learning-logic. (Technical Report TR-47). Cambridge, MA: Massachusetts Institute of Technology, Center for Computational Research in Economics and Management Science.
Rohatgi, V.K. (1976) An introduction to probability theory and TTUltheTTUltical statistics. New York: Wiley.
Rumelhart, D.E., Hinton, G.E., & Williams, R.1. (1986). Learning internal representations by error propagation. In: D.E. Rumelhart & 1.L. McClelland, (Eds.), Parallel distributed processing: Explorations in the microstructure of cognition. ~1. 1: Foundations. Cambridge: MIT Press.
Schrnidhuber, J.H. & Huber, R. (1990). Learning to generate focus trajectories for attentive vision. (Technical
Report FKI-128-9O). Technische Universitiit Munchen, Institut rur Informatik.
Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning. Ph.D. Dissertation, Dept. of Computer and Information Science, University of Massachusetts, Amherst, MA.
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.
Thathatchar, M.A.L. & Sastry, P.S. (1985). A new approach to the design of reinforcement schemes for learning
automata. 1EEE Transactions on Systems, Man, and Cybernetics, 15, 168-175.
Wheeler, R.M., Jr. & Narendra K.S. (1986). Decentralized learning in finite Markov chains. IEEE Transactions
on AutoTTUltic Control, 31, 519-526.
Watkins, C.I.C.H. (1989). Learningfromdelayed rewards. Ph.D. Dissertation, Cambridge University, Cambridge,
England.
Werbos, P.1. (1974). Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph.D.
Dissertation, Harvard University, Cambridge, MA.
Williams, R.I. (1986). Reinforcement learning in connectionist networks: A matheTTUltical analysis. (Technical
Report 8605). San Diego: University of California, Institute for Cognitive Science.
Williams, R.I. (1987a). Reinforcement-learning connectionist systems. (Technical Report NU-CCS-87-3). Boston,
MA: Northeastern University, College of Computer Science.
Williams, R.1. (l987b). A class of gradient-estimating algorithms for reinforcement learning in neural networks.
Proceedings ofthe First Annual International Conference on Neural Networks, Vol. II (pp. 601-608). San Diego,
CA.
Williams, R.I. (1988a). On the use of backpropagation in associative reinforcement learning. Proceedings of the
Second Annual International Conference on Neural Networks, Vol. I (pp. 263-270). San Diego, CA.
Williams, R.I. (1988b). Toward a theory ofreinforcement-learning connectionist systems. (Technical Report NUCCS-88-3). Boston, MA: Northeastern University, College of Computer Science.
Williams, R.J. & Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms.
Connection Science, 3, 241-268.
32
1. Introduction
One of the most fascinating and challenging paradigms of traditional machine learning research is the delayed reinforcement learning paradigm. In the simplest form of this paradigm,
the learning system passively observes a temporal sequence of input states that eventually
leads to a final reinforcement or reward signal (usually a scalar). The learning system's
task in this case is to predict expected reward given an observation of an input state or
sequence of input states. The system may also be set up so that it can generate control
signals that influence the sequence of states. In this case the learning task is usually to
generate the optimal control signals that will lead to maximum reinforcement.
Delayed reinforcement learning is difficult for two reasons. First, there is no explicit
teacher signal that indicates the correct output at each time step. Second, the temporal
delay of the reward signal implies that the learning system must solve a temporal credit
assignment problem, Le., must apportion credit and blame to each of the states and actions
that resulted in the final outcome of the sequence.
Despite these difficulties, delayed reinforcement learning has attracted considerable interest for many years in the machine learning community. The notion of a learning system
interacting with an environment and learning to perform a task solely from the outcome
of its experience in the environment is very intellectually appealing. It could also have
numerous practical applications in areas such as manufacturing process control, navigation
and path planning, and trading in financial markets.
One possible approach to temporal credit assignment is to base the apportionment of
credit on the difference between temporally successive predictions. Algorithms using this
approach have been termed "temporal difference" methods in Sutton (1988), and have been
33
258
G. TESAURO
studied for many years in a variety of contexts. Examples include Samuel's checkers program (Samuel, 1959) and Holland's bucket brigade algorithm (Holland, 1986). An incremental real-time algorithm called TD(A) has been proposed in Sutton (1988) for adjusting the
weights in a connectionist network. It has the following form:
2:
I
CX(PI+l -
PI)
At-kVwPk
(1)
k=l
where PI is the network's output upon observation of input pattern XI at time t, w is the
vector of weights that parameterizes the network, and VwPk is the, gradient of network output with respect to weights. Equation I basically couples a temporal difference method
for temporal credit assignment with a gradient-descent method for structural credit assignment. Many supervised learning procedures use gradient-descent methods to optimize network structures; for example, the back-propagation learning procedure (Rumelhart, et aI.,
1986) uses gradient-descent to optimize the weights in a feed-forward multilayer perceptron.
Equation I provides a way to adapt such supervised learning procedures to solve temporal
credit assignment problems. (An interesting open question is whether more complex supervised learning procedures, such as those that dynamically add nodes or connections during
training, could be adapted to do temporal credit assignment.)
It can be shown that the case A = I corresponds to an explicit supervised pairing of
each input pattern XI with the final reward signal z. Similarly, the case A = 0 corresponds
to an explicit pairing of XI with the next prediction P I + 1. The parameter A provides a
smooth heuristic interpolation between these two limits.
Sutton provides a number of intuitive arguments why TD(A) should be a more efficient
learning procedure than explicit supervised pairing of input states with final reward. A
rigorous proof is also given that TD(O) converges to the optimal predictions for a linear
network and a linearly independent set of input patterns. This proof has recently been extended to arbitrary values of A in Dayan (1992). However, no theoretical or empirical results
are available for more complex tasks requiring multilayer networks, although a related algorithm called the Adaptive Heuristic Critic (Sutton, 1984) has been successfully applied
to a relatively small-scale cart-pole balancing problem (Barto, Sutton & Anderson, 1983;
Anderson, 1987).
The present paper seeks to determine whether temporal difference learning procedures
such as TD(A) can be applied to complex, real-world problem domains. The paper approaches this question from two perspectives. First, section 2 identifies a number of important practical issues and discusses them in the context of the current theoretical understanding of TD(A). Some of the issues are familiar to connectionist researchers who have been
studying real-world applications of supervised learning procedures. Other issues are novel
and more complex than those that arise in supervised learning. Next, section 3 examines
these issues in the context of a specific application: learning backgammon strategy from
the outcome of self-play. This application was selected because of its complexity and stochastic nature, and because detailed comparisons can be made with the alternative approach
of supervised learning from expert examples (Tesauro, 1990). We shall see that, despite
a number of potentially serious theoretical and practical problems, the TD approach works
34
259
amazingly well. With zero built-in knowledge (apart from the rules), networks are able
to learn to playa fairly strong intermediate-level game. The level of performance achieved
not only exceeds conventional commercial programs, but perhaps more surprisingly, it also
surpasses what can be achieved by supervised training on a massive data base of human
expert examples. The fmal section discusses the implications of these results for more general
practical applications, and suggests a number of directions for further research.
35
260
G. TESAURO
be harder to learn. In this case, one might need special techniques analogous to those used
in pattern classification tasks when some of the classes have a much lower probability of
occurring than other classes.
Noisy environment: A [mal issue is whether the environment is noisy or deterministic.
Noise may appear, for example, in the rules which govern transitions from state to state,
and in the final reward signal given at the terminal states. An important consideration which
we examine in more detail below is the volatility of the stochastic environment, i.e., the
step-to-step variance in expected reward. We shall see that learning is more difficult in
highly volatile environments, and that a natural way to approach learning in such environments is with a look-ahead process akin to search or planning. Noise may also enter into
the representation of input patterns seen by the network. This was' not addressed by Sutton,
and it is not known to what extent such noise degrades the learning performance.
36
261
place on-line using patterns generated de novo. One might hope that in this situation, performance would always improve monotonically with increasing training time, i.e., overtraining would not occur. One might also hope that one could always improve the performance
of the TD nets by adding more and more hidden units to the network, i.e., overfitting would
not occur.
Both overtraining and overfitting may occur, however, if the error function minimized
during training does not correspond to the performance function that the user cares about.
For example, the performance that one cares about for a game-playing network is not how
accurately it estimates the probability of winning in a given position, but rather its ability
to select good moves. It may be the case that the network could produce fairly accurate
predictions but not select very good moves. One would especially, expect this to be true
for games in which the best move is only slightly better than other alternatives. On the
other hand, if the network has large errors in the absolute accuracy of its predictions, it
could. still be able to select good moves. This is because, as discussed in Christensen and
Korf (1986), and Utgoff and Clouse (1991), a heuristic evaluation function need not exactly
represent the true values of states for correct move selection. Instead, it only needs to have
the correct sign for the slope between pairs of points in order to make correct state preferences. There may be many such functions, with widely varying degrees of prediction accuracy. A simple example illustrating this is shown in table 1. Overtraining and overfitting
might also occur if the distribution of input training and testing patterns are different. For
example, the game-playing network might be trained on its own play, but have to perform
against other players. The differing styles of play would lead to different distributions of
input patterns.
Incremental learning: A nice feature of TD(A) is that the weights can be updated in a
fully incremental fashion. It is not necessary to wait until the end of the sequence to compute the weight change at a given time step. However, this may not be strictly necessary
in practical implementations. Modern workstations have enough memory to store the entire
sequence of input and output patterns, even for fairly large problems, as long as the sequences terminate in a reasonable amount of time. If the sequence is so long that it cannot
be stored, the algorithm may not be able to learn the task in any case. Thus one could
imagine other algorithms that give improved performance at the price of sacrificing the
fully incremental nature of TD(A).
Table 1. An example showing the difference between a good
predictor and a good move selector.
Move
True Prob.
Network #1
Network #2
0.846
0.842
0.719
0.840
0.843
0.621
37
262
G. TESAURO
38
263
There are a number of reasons why Pr+1 might not be a more accurate stochastic estimator of Or than P r is. One possibility comes from the fact that the network has to generalize in any practical problem, as discussed previously. The network's generalizations for
states near the end of the sequence may well be less accurate than for states far from the end.
Another reason why P r+1 may fail to be a more accurate estimator of Or is due to volatility
in the environment. As learning progresses, P r approaches Or, while P r+1 approaches 0t+I'
Now in highly volatile environments, Or+1 may vary significantly from Or' It is true that
the average value over all possible states that can evolve from Xr is given by (0t+I) = 0"
but we must remember that the network does not get to see infinitely many evolutions from
X"~ In a typical complex problem, the network will only see X r once, and thus only one
state following Xr. If the value of Xr+1 has little to do with the value o,f Xr due to high volatility, then Pt+1 may be a poor estimator of Or' Furthermore in some problems it could be
the case that volatility increases with time, so that states near the end of the sequence are
more volatile than earlier states. This would provide a further limitation on the ability of
P r+1 to estimate Or more accurately than Pro
A natural way to try to overcome limitations due to volatility of the environment is to
allow multiple evolutions from a given state, i.e., for each observed X"~ reset the state to
Xr many times and let the stochastic process generate a distribution of successor states Xt+l
with average expected reward (Pr+1 ). The error signal at each time step would then be
proportional to (Pt+I) - Pro The multiple samples would provide an estimate of (Or+l),
which, as stated previously, should equal Or' If the temporal evolution at each step is governed by a stochastic process with a small number of possible outcomes (such as flipping
a coin or rolling dice), one could explicitly compute an average over all possible outcomes.
39
264
G. TESAURO
40
265
Since the input encoding scheme contains no built-in knowledge about useful features,
and since the network only observes its own play, we may say that this is a "knowledgefree" approach to learning backgammon. Such an approach is interesting because it is not
clear that it can make any progress at all beyond its starting state, which for a network
with random initial weights is a random move selector. The zero-knowledge approach also
provides an important baseline for judging other approaches using various forms of built-in
knowledge.
The approach described above is similar in spirit to Samuel's approach to learning checkers
from self-play, but in several ways it is a more challenging learning task. One important
difference is that Samuel's board description was in terms of a number of hand-crafted
features, several of which were designed in consultations with human checkers experts.
However, the networks studied here use only a raw board description and had no knowledge
built into the input features. The evaluation function learned in Samuel's study was a linear
function of the input variables, whereas multilayer networks learn more complex nonlinear
functions. Also, the final reward signal in backgammon is noisy due to the dice rolls; this
presumably makes the learning task more difficult than in noise-free games such as checkers.
The branching ratios in backgammon are so large that look-ahead methods cannot be employed, whereas Samuel used search both in move selection and in calculation of the learning algorithm's error signal. Finally, Samuel found that it was necessary to give the learning
system at least one fixed intermediate goal, material advantage, as well as the ultimate goal
of the game. The proposed backgammon learning system has no such intermediate goals.
41
266
G. TESAURO
The networks trained on this task had 16 units to encode the board description: 6 units
to encode the number of Black men at locations 1-6, 6 units to encode the number of White
men at locations 19-24, 2 units to encode the number of White and Black men off the
board, and 2 units to encode the side to move. The networks had a feed-forward structure
with full connectivity from one layer to the next. Two architectures were examined: a singlelayer architecture with direct input-output connections and no hidden units, and a multilayer architecture containing a single hidden layer with varying numbers of hidden units.
The single-layer architecture can only represent linearly separable functions (Minsky &
Papert,1969), while the multilayer architecture, given sufficient hidden units, is capable
of universal function approximation (Hornik, Stinchcombe & White, 1989). Both the hidden
units and the output unit utilized a sigmoidal transfer function y = 1/(1 + e-X ).
The initial board configurations were generated randomly by distributing all 15 men for
each side with uniform probability in the six home board locations. With this distribution
of initial positions, the average sequence length is about 14 time steps with good play on
both sides.
As with back-propagation, a certain amount of parameter tuning was required to get good
results with the TD(A) algorithm. It was found that a learning rate of ct = 0.1 usually gave
good results. Lower learning rates did not improve the maximum performance (although
they did reduce the level of stochastic fluctuations in performance), whereas significantly
higher learning rates did degrade performance. Also, the value of Aappeared to have almost
no effect on the maximum obtainable performance, although there was a speed advantage
to using large values of A. The results reported here used a value of A set (somewhat arbitrarily) at 0.7. The initial random weight scale also did not appear to be important; in
the results reported here, weights were initialized to uniform random values in the range
[-0.5, +0.5].
Two measures of performance were monitored to assess the results of learning. The absolute prediction error was monitored by comparing the network's outputs for the positions
seen during training with the exact win probabilities computed from the lookup tables.
(One should keep in mind that these exact probabilities were not used as part of the training signal.) The network's move selection ability was also monitored during training by
measuring the fraction of time the network selected the theoretically optimal move in a
fixed set of test positions. (There were 210 test positions, taken from a set of games in
which the author played both sides. This test set was pruned from a larger set of over 300
positions by deleting all positions with no unique best move, according to the lookup tables.
In most of the deleted positions, the probability of winning was either exactly I or exactly
o regardless of which move was selected.)
A perhaps surprising discovery was that, apart from stochastic fluctuations controlled
by the value of ct, the absolute prediction error always decreased during training. On the
other hand, the move-selection performance usually reached a maximum after several tens
of thousands of games, and then decreased thereafter. (Due to the stochastic training procedure and the relatively flat nature of the learning curves, it is difficult to say precisely
when the maximum performance was reached.) As stated previously, this "overtraining"
could be due to the differing distributions of training and test positions, or it could also
be due to the difference between the function minimized by the learning algorithm (prediction error) and the function that the user cares about (move-selection ability).
42
267
(J)
Q.l
>
0
.921-
+J
Q.l
c...
c...
u
"-
.90
.88
.86
.-1
+J
ro
c...
I..L
.84
.82
10
20
40
80
The level of move-selection performance oblained by the networks was also surprising
in view of the average absolute prediction errors. Typically the networks are only able to
estimate the probability of winning to within about 10% of the true probability. On the
other hand, one usually needs to make discriminations at the 1% level or lower for accurate
move selection. Thus based on this consideration alone, one would expect the networks
to be rather poor move selectors. However, figure 1, which plots move-selection performance as a function of the number of hidden units for TO-trained networks, shows that
performance is in fact quite good. Furthermore, the networks with hidden units clearly
do better than networks without hidden units. This indicates that the networks have absorbed
some of the nonlinear aspects of the problem, and are not just implementing a linear rule
such as "maximize the number of pieces laken off the board." It is also reassuring that
this nonlinear structure can be learned in the TO procedure, despite the lack of any theoretical guarantees.
Also plotted for comparison are the results of supervised training of identical networks
with identical coding schemes on a dala set of about 1700 training positions. In each training position, a human expert (the author) recorded a preference of best move. The training
methodology is the "comparison paradigm" described in Tesauro (1989), in which the network is trained by back-propagation to score the expert's move higher than each of the
other legal moves.
We can see that, as is usual with back-prop nets trained on a fixed dala set, the ability
to increase performance by adding more hidden units is eventually lost for sufficiently large
nets: at this point, the networks Slart to overfit the training dala. On the other hand, since
the TO nets are not trained on a fixed dala set, the performance can in principle always
be improved by adding more hidden units. We do in fact find that, at least for the network
43
268
G. TESAURO
sizes studied here, performance increases monotonically with the number of hidden units,
and that the TO nets eventually catch up to the performance level of the nets trained on
expert preferences (denoted EP in figure 1).
One should note that the absolute prediction error of the TO nets generally gets worse
as the end of the game approaches. This is not surprising, because states far from the end
can be accurately represented by a simple pip count approximation, but this breaks down
when there are only a few pieces left. Also for this particular task, states near the end
of the game have higher volatility than states far from the end. As stated previously, these
factors may provide important limits to the ability of TO learning to approach theoretically
optimal performance.
44
269
non-monotonic function of the lead in the race. The initial board configurations were generated by randomly picking a divider location on the board, and randomly distributing
all White men uniformly to one side and all Black men uniformly to the other side of the
divider location. With this distribution of starting positions, the average number of time
steps from start to finish was about 20, as compared to 14 in the bearoff experiments. As
in the previous section, the algorithm parameters were set at ex = 0.1 and A = 0.7.
In this more complex situation, algorithms are no longer available for computing the
exact outcome probabilities. Hence the only performance measure used was the fraction
of time the network agreed with a human expert (the author) on the best move in a fixed
set of 248 full-board racing test positions. This test set measure is less reliable than in
the previous section, because there may be significant human errors in the choice of best
move. There may also be a substantial fraction of positions in which the outcome is uniquely
determined regardless of which move is made.
Results of TO training are shown in figure 2, which plots maximum test set performance
as a function of number of hidden units. Once again these results are compared with networks using an identical coding scheme trained on a data set of about 3100 human expert
examples. Once again the TO networks reached peak performance after several tens of
thousands of games. For this task, no clear evidence of overtraining was seen; thus with
further training it might be possible to obtain higher levels of performance.
We can see that, despite the increased size of the input space, complexity of the task,
and length of the sequence, the TO networks still were able to achieve a high level of performance. The test set performance is almost as good as the EP nets. In fact, since the
test set measure is admittedly biased in favor of the expert-trained nets (which could pick
(f)
Q.l
>
+'
.86
Q.l
t...
t...
.84
U
4-
.82
c
0
'M
.80
+'
U
<tJ
t...
.78
l.L
10
20
40
45
270
G. TESAURO
up consistently wrong or arbitrary stylistic preferences of the expert), one may legitimately
wonder whether the TD net really is inferior in an absolute sense, and if so, by how much.
Also, further improvements in TD performance might be obtainable by adding more hidden
units to the TD networks, by training longer, or by using a more representative distribution
of starting positions.
Qualitatively, the TD nets appear to have discovered many important ingredients of general
racing strategies. The main apparent defect is a tendency to waste pips when the network
has both a small chance of winning and a small chance of losing a gammon. In this case
a human would try to remove all doubt about saving the gammon, whereas the network
tends to make plays that suggest trying to win.
46
271
game. On the other hand, there may be problems due to the fact that the racing network
was trained with a somewhat arbitrary distribution of starting positions. Furthermore, any
systematic biases in the racing network's judgement might be transferred to the engaged
network. Empirically this two-phase approach seemed to offer some potential for improving
the performance of smaller nets, but not for larger nets, and view of the above-mentioned
theoretical concerns, this approach was eventually abandoned.
Training a single net on the entire game was not expected to yield any useful results
other than a reference point for judging more sensible approaches. However, the rather
surprising result was that a significant amount of learning actually took place. Performance
on the 248-position racing test set reached about 65 %. (This is substantially worse than
the racing specialists described in the previous section.) Performance on a separate set
of 500 full-contact test positions is plotted in figure 3. Again these figures are compared
with results of supervised training on a human expert training set containing over 15,000
engaged positions. We can see that the TD nets reached levels of performance well beyond
the initial random networks, showing that substantial learning took place. (The random
initial networks only select the right move about 5 % of the time.) The performance levels
lag somewhat behind what can be achieved with expert preference training, but we must
remember that this test set measure may not give the most accurate assessment of true
game-playing strength.
A more accurate and objective measure of game-playing strength is actual game performance against an opponent. Both the TD nets and the EP nets have been tested in actual
game play against Sun Microsystem's Gammontool program. Gammontool is representative
of the level of performance that is typically achieved with conventional commercial programs,
en
Ul
>
a
.70
.j.J
Ul
L
L
.60
u
'>-
.50
EP
TO
rl
.j.J
ro
LL
.40
0
10
20
40
Number of hidden units
Figure 3. Plot of move selection performance in general engaged positions vs. number of hidden units for networks
trained using TO learning from self-play (TO), and supervised training on human expert preferences (EP). Both
learning systems used identical 198-unit coding schemes.
47
272
G. TESAURO
c
0
.70
3=
U'J
Q)
TO
.65
co
en
'+--
.60
c
0
.r<
+'
u
co
:;
t
~
EP
.55 1
.50
LL
.45
Number
10
of
20
40
hidden units
Figure 4. Plot of game performance against Gammontool vs. number of hidden units for networks trained using
TD learning from self-play (TD), and supervised training on human expert preferences (EP). Each data point
represents the result of a 10,000 game test, and should be accurate U? within one percentage point.
and provides a decent benchmark for measuring game-playing strength. Human beginners
can win about 40% of the time against it, decent intermediate-level humans would win
about 60%, and human experts would win about 75%. (The random initial networks before
training win only about 1%.) Since the EP nets are trained on engaged positions only, the
testing procedure is to play out the game with the network until it becomes a race, and
then use Gammontool's algorithm to move for both sides until the end. This also does not
penalize the TD net for having learned rather poorly the racing phase of the game. Results
are plotted in figure 4.
Given the complexity of the task, size of input space and length of typical sequences,
it seems remarkable that the TD nets can learn on their own to play at a level substantially
better than Gammontool. Perhaps even more remarkable is that the TD nets surpass the
EP nets trained on a massive human expert data base: the best TD net won 66.2 % against
Gammontool, whereas the best EP net could only manage 59.4 %. This was confirmed
in a head-to-head test in which the best TD net played 10,000 games against the best EP
net. The result was 55 % to 45 % in favor of the TD net. This confirms that the Gammontool benchmark gives a reasonably accurate measure of relative game-playing strength,
and that the TD net really is better than the EP net. In fact, the TD net with no features
appears to be as good as Neurogammon 1.0, backgammon champion of the 1989 Computer
Olympiad, which does have features, and which wins 65% against Gammontool. A 10,000
game test of the best TD net against Neurogammon 1.0 yielded statistical equality: 50%
for the TD net and 50% for Neurogammon.
The TD net's performance against these three opponents indicates that it has reached a
significant level of playing ability. This violates a widely held belief in computer games and
machine learning research that significant levels of performance in game-playing programs
48
273
can only be achieved through the use of hand-crafted features in the evaluation function.
Apparently the hidden units in the TD net have discovered useful features through selfplay. When one looks at the pattern of weights learned by the TD net, one can see a great
deal of spatially organized structure, and some of this structure can be interpreted as useful
features by a knowledgeable backgammon player. Figure 5 shows the weights from the
input layer to two of the hidden units in the best TD net. Both hidden units contribute positively to the estimation of Black's chances of winning and gammoning, and negatively to
the estimation of White's chances of winning and gammoning. The first hidden unit appears
to be a race-oriented feature detector, while the second hidden unit appears to be an attackoriented feature detector.
The training times needed to reach the levels of performance sl}own in figure 4 were
on the order of 50,000 training games for the networks with 0 and 10 hidden units, 100,000
games for the 20-hidden unit net, and 200,000 games for the 40-hidden unit net. Since
the number of training games appears to scale roughly linearly with the number of weights
in the network, and the CPU simulation time per game on a serial computer also scales
linearfy with the number of weights, the total CPU time thus scales quadratically with the
number of weights: on an IBM RS/6000 workstation, the smallest network was trained
in several hours, while the largest net required two weeks of simulation time.
The networks did not appear to overtrain, but it is not clear whether the performance
increases very slowly with further training or stays flat. Since the networks in the previous
sections also required several tens of thousands of training games, this suggests that the
number of training sequences needed to train a TD network may not scale badly with the
task complexity or average sequence length. Instead, the training time may just depend
on the number of bits of information needed to specify a trained network, and on how
many bits of information are received per game. Since the outcome of each game gives
one bit of information (two bits including gammons), and since the networks have several
thousand weights that probably must be specified to at least 4-8 bits of accuracy, this suggests a training time on the order of tens of thousands of games.
Some qualitative observations on the styles of play learned by the TD and EP nets are
worth noting. The TD nets have developed a style emphasizing running and tactical play.
For example, it prefers to immediately split its back men rather than bringing down builders
or slotting home board points. It is good in running game situations and in tactical situations such as blot-hitting contests and blitz attacks. The EP nets, however, favor a more
quiescent positional style of play emphasizing blocking rather than racing. This is more
in line with human expert play (at least the particular human expert who made the training
data), but it often leads to complex prime vs. prime and back-game situations that are hard
for the network to evaluate properly. This suggests one possible advantage of the TD approach
over the EP approach: by learning to imitate an expert teacher, the learning system may
get itself into situations that it doesn't know how to handle. With the alternative approach
of learning from experience, the learner may not reproduce the expert strategies, but at
least it will learn to handle whatever situations are brought about by its own strategy.
It's also interesting that TD net ended up playing well in the early engaged phase of play,
whereas its play in the late racing phase is rather poor. This is contrary to the intuitive
notion that in temporal credit assignment learning, states far from the end of the sequence
will be harder to learn than states near the end. Apparently the inductive bias due to the
49
VI
2
1
6
5
4
3
9
8
10
11
13
12
15
14
16
17
18
21
20
19
24
23
22
Black pieces
2
4
'Whit.e pieces
2
4
3
Black pieces
'~7hit.e
pieces
-J
..,p
.j:>.
275
representation scheme and network architecture is more important than temporal distance
to the final outcome. This may partially explain why training times were not dramatically
worse in the full-game situation than in the simplified endgame situations.
4. Conclusions
We have seen that, from the theoretical point of view, there may be a number of important
limitations to the effectiveness of TD learning in large-scale complex domains. The algorithm
may not converge even for prediction only tasks, let alone for combined prediction/control
tasks. Even if the algorithm does converge, it may get stuck in poor locally optimal solutions. Finally, even if the algorithm is able to find good solutions, the scaling of required
training time with problem size or sequence length may be so poor that the learning task
woulQ be effectively intractable.
In view of these potential difficulties, there are a number of very encouraging conclusions
that can be drawn from the backgammon application. Empirically the algorithm always converges to at least a local minimum. The quality of solution found by the network is usually
fairly good, and generally improves with increasing numbers of hidden units. Furthermore,
the scaling of training time with the length of input sequences, and with the size and complexity of the task, does not appear to be a serious problem. Finally, it is encouraging
that the network was able to learn to make good control decisions as well as learn to estimate
expected outcome. In fact, the network's ability to select good moves is much better than
we have a right to expect, because its absolute accuracy in estimating expected outcome
is usually at the 10% level, whereas the difference in expected outcome between optimal
and non-optimal moves is usually at the level of 1% or less. This suggests that much of
the network's absolute prediction error is systematic error that applies equally to all moves
generated from the same initial position, and thus cancels out in move-making decisions.
The most encouraging finding, however, is a clear demonstration that in the full-game
situation, the TO learning network with zero built-in knowledge can achieve a higher level
of overall game performance than an identical network trained on a massive data base of
human expert examples. The level of performance achieved is in fact equal to a program
that convincingly won the backgammon championship in a major tournament for computer
programs. This level of performance is so far beyond what one would have expected beforehand that it seems worthwhile to devote some further effort to understanding exactly how
Figure 5. A diagram illustrating the weights from the input units to two of the 40 hidden units in the best TD
net. Black squares represent negative weights and white squares represent positive weights; the size of the square
indicates the magnitude of the weights. The rows represent from bottom to top the spatial locations 1-24. The
top row represents: (W barmen, B barmen, W men off, B men off, W turn, B turn). The columns represent
the number of Black and White men as indicated. The first hidden unit has two noteworthy features: a linearly
increasing pattern of negative weights for Black blots and Black points, and a negative weighting of White men
off and a positive weighting of Black men off. These features are recognizable as contributing to an estimate
of Black's probability of winning based on his lead in the race. The second hidden unit has the following noteworthy features: strong positive weights for Black home board points, strong positive weights for White men
on bar, positive weights for White blots, and negative weights for White points in Black's home board. The experienced backgammon player recognizes that these factors all contribute to the probability of a successful Black
attacking strategy.
51
276
G. TESAURO
this is possible. It may also be worthwhile trying to apply the combination of TO with
back-propagation, or TO with other supervised learning algorithms, to other difficult temporal credit assignment problems.
Future prospects for the backgammon application look very promising. It certainly seems
possible that further improvements in performance can be obtained merely by adding more
hidden units to the network and training for longer training times, although the quadratic
scaling of CPU time with the number of weights may limit how far this can be carried
in practice. Better results might also be obtainable by optimizing the parameter settings,
or by modifications of the TO training procedure. For example, the next prediction could
be replaced by an average over all possible dice rolls; this could reduce limitations due
to volatility. Also, partitioning the game into a number of temporal phases and training
separate specialist networks on each phase may make it easier for each specialist network
to learn the evaluation function appropriate for that particular phase. In actual game play,
the outputs of the specialists could be combined using smoothly-varying application coefficients, as suggested in Berliner (1979). Finally, an improved representation scheme, which
uses features known to be useful in backgammon evaluation functions, and which is better
able to represent the value of near-end states, might give substantially better results. Such
improved representation schemes are currently under investigation. As this article goes
to press, a TO net containing all of Neurogammon's features is learning from self-play.
The network has reached 71% against Gammontool and continues to improve slowly with
further training. It appears to be the strongest program ever seen by this author.
Beyond this specific application, however, the larger and more important issue is whether
learning from experience can be useful and practical for more general complex problems.
Certainly the quality of results obtained in this study suggests that the approach may work
well in practice, and may work better than we have a right to expect theoretically. There
may also be some intrinsic advantages to this approach over supervised training on a fixed
set of labeled exemplars. At the very least, for tasks in which the exemplars are handlabeled by humans, it eliminates the laborious and time-consuming process of labeling the
data. Furthermore, the learning system would not be fundamentally limited by the quantity
of labeled data, or by any intrinsic errors in whatever process is used to label the data.
Also, the learning system might be able to avoid getting itself into situations that it doesn't
know how to handle, as can happen when one is learning by imitating an expert. Finally,
preserving the intrinsic temporal nature of the task, and informing the system of the consequences of its actions, may convey important information about the task that is not necessarily
contained in the labeled exemplars. More theoretical and empirical work will be needed
to establish the relative advantages and disadvantages of the two approaches. A result of
this may be the development of novel learning paradigms combining supervised learning
with learning from outcome; in combination it might be possible to surpass what either
approach could achieve individually. Preliminary work supporting this hypothesis is reported
in Utgoff and Clouse (1991).
Acknowledgments
The author thanks Scott Kirkpatrick and Rich Sutton for helpful comments on an earlier
version of the manuscript.
52
277
References
Anderson, c.w. (1987). Strategy learning with multilayer connectionist representations. Proceedings ofthe Fourth
International Workshop on Machine Learning (pp. 103-114).
Barto, A.G., Sutton, R.S., & Anderson, CW. (1983). Neuronlike adaptive elements that can solve difficult learning
control problems. IEEE Transactions on Systems, Man and Cybernetics, 13, 835-846.
Berliner, H. (1977). Experiences in evaluation with BKG-a program that plays backgammon. Proceedings of
lJCAI (pp. 428-433).
Berliner, H. (1979). On the construction of evaluation functions for large domains. Proceedings oflJCAl (pp. 53-55).
Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis
dimension. JACM, 36, 929-965.
Christensen, 1 & Korf, R. (1986). A unified theory of heuristic evaluation functions and its application to learning. Proceeding of AAAI-86 (pp. 148-152).
Dayan, P. (1992). The convergence of TD(lI). Machine Learning, 8, 341-362.
Frey, PW. (1986). Algorithmic strategies for improving the performance of game playing programs. In: D. Farmer,
et al. (Eds.), Evolution, games and learning. Amsterdam: North Holland.
Griffith, A.K. (1974). A comparison and evaluation of three machine learning procedures as applied to the game
of checkers. Artificial Intelligence, 5, 137-148.
Holland: lH. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to
parallel rule-based systems. In: R.S. Michalski, J.G. Carbonell & T.M. Mitchell, (Eds.), Machine learning:
An artificial intelligence approach (Vol. 2). Los Altos, CA: Morgan Kaufmann.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators.
Neural Networks, 2, 359-366.
Lee, K.-F. & Majahan, S. (1988). A pattern classification approach to evaluation function learning. ArtifiCial
Intelligence, 36, 1-25.
Magriel, P. (1976). Backgammon. New York: Times Books.
Minsky, M.L. & Papert, S.A. (1969). Perceptrons. Cambridge, MA: MIT Press. (Republished as an expanded
edition in 1988).
Mitchell, D.H. (1984). Using features to evaluate positions in experts' and novices' Othello games. Master's Thesis,
Northwestern Univ., Evanston, IL.
Quinlan, lR. (1983). Learning efficient classification procedures and their application (0 chess end games. In:
R.S. Michalski, lG. Carbonell & T.M. Mitchell (Eds.), Machine learning. Palo Alto, CA: Tioga.
Robbins, H. & Monro, S. (1951). A stochastic approximation method. Annals ofMathematical Statistics, 22,400-407.
Rumelhart, D.E., Hinton, G.E., & Williams, R.l (1986). Learning internal representations by error propagation.
In: D. Rumelhart & J. McClelland, (Eds.), Parallel distributed processing. Vol. I. Cambridge, MA: MIT Press.
Samuel, A.(I959). Some studies in machine learning using the game of checkers. IBM J. ofResearch and Development, 3, 210-229.
Samuel, A.(1967). Some studies in machine learning using the game of checkers, II-recent progress. IBM J.
of Research and Development, 11, 601-617.
Sulton, R.S. (1984). Temporal credit assignment in reinforcement learning. Doctoral Dissertation, Dept. of Computer
and Information Science, Univ. of Massachusetts, Amherst.
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.
Tesauro, G. & Sejnowski, T.J. (1989). A parallel network that learns to play backgammon. Artificial Intelligence,
39, 357-390.
Tesauro, G. (1989). Connectionist learning of expert preferences by comparison training. In D.Touretzky (Ed.),
Advances in neural information processing, I, 99-106.
Tesauro, G. (1990). Neurogammon: a neural network backgammon program. IJCNN Proceedings, III, 33-39.
Utgoff, P.E. & Clouse, lA. (1991). Two kinds of training information for evaluation function training. To appear
in: Proceedings of AAAI-91.
Vapnik, VN. & Chervonenkis (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Prob. Appl., 16, 264-280.
Widrow, B., et al. (1976). Stationary and nonstationary learning characteristics of the LMS adaptive filter. Proceedings of the IEEE, 64, 1151-1162.
Zadeh, N. & Kobliska, G. (1977). On optimal doubling in backgammon. Management Science, 23, 853-858.
53
Technical Note
~-Learning
CHRlSTOPHER lC.H. WATKINS
25b Framfield Road, Highbury, London N5 LUU, England
PETER DAYAN
Centre for Cognitive Science, University of Edinburgh, 2 Buccleuch Place, Edinburgh EH89EH, Scotland
Abstract. Cl,-Iearning (Watkins, 1989) is a simple way for agents to learn how to act optimally in controlled Markovian
domains. It amounts to an incremental method for dynamic programming which imposes limited computational
demands. It works by successively improving its evaluations of the quality of panicular actions at panicular states.
This paper presents and proves in detail a convergence theorem for Cl,-learning based on that outlined in Watkins
(1989). We show that Q-learning converges to the optimum action-values with probability 1 so long as all actions
are repeatedly sampled in all states and the action-values are represented discretely. We also sketch extensions
to the cases of non-discounted, but absorbing, Markov environments, and where many Q values can be changed
each iteration, rather than just one.
Keywords. Q-Iearning, reinforcement learning, temporal differences, asynchronous dynamic programming
1. Introduction
~-leaming (Watkins, 1989) is a form of model-free reinforcement learning. It can also be
viewed as a method of asynchronous dynamic programming (OP). It provides agents with
the capability of learning to act optimally in Markovian domains by experiencing the consequences of actions, without requiring them to build maps of the domains.
Learning proceeds similarly to Sutton's (1984; 1988) method of temporal differences
(TO): an agent tries an action at a particular state, and evaluates its consequences in terms
of the immediate reward or penalty it receives and its estimate of the value of the state
to which it is taken. By trying all actions in all states repeatedly, it learns which are best
overall, judged by long-term discounted reward. ~-Iearning is a primitive (Watkins, 1989)
form of learning, but, as such, it can operate as the basis of far more sophisticated devices.
Examples of its use include Barto and Singh (1990), Sutton (1990), Chapman and Kaelbling (1991), Mahadevan and Connell (1991), and Lin (1992), who developed it independently. There are also various industrial applications.
This paper presents the proof outlined by Watkins (1989) that ~-Iearning converges. Section 2 describes the problem, the method, and the notation, section 3 gives an overview
of the proof, and section 4 discusses two extensions. Formal details are left as far as possible to the appendix. Watkins (1989) should be consulted for a more extensive discussion
of ~-Iearning, including its relationship with dynamic programming and TO. See also Werbos
(1977).
55
280
The task facing the agent is that of determining an optimal policy, one that maximizes total
discounted expected reward. By discounted reward, we mean that rewards received s steps
hence are worth less than rewards received now, by a factor of "Is (0 < "I < 1). Under
a policy 7f, the value of state x is
V"(X) ;: CR x (7f(x
"I
L: Pxy[7f(x)] V"(y
y
because the agent expects to receive CRA7f(x)) immediately for performing the action 7f
recommends, and then moves to a state that is 'worth' V" (y) to it, with probability
Pxy [7f (x)]. The theory of DP (Bellman & Dreyfus, 1962; Ross, 1983) assures us that there
is at least one optimal stationary policy 7f* which is such that
is as well as an agent can do from state x. Although this might look circular, it is actually
well defined, and DP provides a number of methods for calculating V* and one 7f*, assuming that CRx (a) and Pxy [a] are known. The task facing a Q learner is that of determining
a 7f* without initially knowing these values. There are traditional methods (e.g., Sato, Abe
& Takeda, 1988) for learning CRx(a) and Pxy[a] while concurrently performing DP, but
any assumption of certainty equivalence, i.e., calculating actions as if the current model
were accurate, costs dearly in the early stages of learning (Barto & Singh, 1990). Watkins
(1989) classes Q-learning as incremental dynamic programming, because of the step-bystep manner in which it determines the optimal policy.
For a policy 7f, define Q values (or action-values) as:
Q"(X, a) = CRAa)
"I
L: P
xy [7f(x)]V"(y).
In other words, the Q value is the expected discounted reward for executing action a at
state x and following policy 7f thereafter. The object in Q-learning is to estimate the Q
56
281
Q-LEARNING
values for an optimal policy. For convenience, defme these as Q*(x, a) == Q"*(x, a), 'I x, a.
It is straightforward to show that V*(x) = maxa Q*(x, a) and that if a* is an action at which
the maximum is attained, then an optimal policy can be formed as 1r*(x) == a*. Herein
lies the utility of the Q values-if an agent can learn them, it can easily decide what it
is optimal to do. Although there may be more than one optimal policy or a*, the Q* values
are unique.
In Q-learning, the agent's experience consists of a sequence of distinct stages or episodes.
In the nth episode, the agent:
IXn [r"
"y
Vn-tCYn)] if x
= Xn and a =
an'
(1)
otherwise,
where
(2)
is the best the agent thinks it can do from state y. Of course, in the early stages of learning, the Q values may not accurately reflect the policy they implicitly define (the maximizing actions in equation 2). The initial Q values, Qo(x, a), for all states and actions
are assumed given.
Note that this description assumes a look-up table representation for the Qn(x, a).
Watkins (1989) shows that Q-learning may not converge correctly for other representations.
The most important condition implicit in the convergence theorem given below is that
the sequence of episodes that forms the basis of learning must include an infinite number
of episodes for each starting state and action. This may be considered a strong condition
on the way states and actions are selected-however, under the stochastic conditions of
the theorem, no method could be guaranteed to find an optimal policy under weaker conditions. Note, however, that the episodes need not form a continuous sequence-that is
the Y of one episode need not be the x of the next episode.
The following theorem defines a set of conditions under which Q" (x, a) -+ Q*(x, a)
as n -+ 00. Define ni(x, a) as the index of the i th time that action a is tried in state x.
57
282
Theorem
00
ani(x.a)
00,
i=1
then Qn(x, a)
-+
< 1, and
Q*(x, a) as n
[ani(x.a)F
<
00,
vx, a,
(3)
i=l
-+ 00,
~ Ptx~').(y.m)[a]
m=l
as the probabilities that, for each x, n and a, executing action a at state If,n} in the ARP
leads to state Y of the real process at some lower level in the deck.
S8
283
Q-LEARNING
As defined above, the ARP is as much a controlled Markov process as is the real process. One can therefore consider sequences of states and controls, and also optimal discounted ~: values for the ARp'2 Note that during such a sequence, episode cards are only
removed from the deck, and are never replaced. Therefore, after a finite number of actions,
the bottom card will always be reached.
3.1. Lemmas
Two lemmas form the heart of the proof. One shows that, effectively by construction, the
optimal ~ value for ARP state {X, n} and action a is just ~n(x, a). The next shows that
for almost all possible decks, P1;)[a] converge to Pxy[a] and CR~~)(a) converge to CRia)
as n -> 00. Informal statements of the lemmas and outlines of their proofs are given below;
consult the appendix for the formal statements.
Lemma A
~n(X' a)
are the optimal action values for ARP states {X, n} and ARP actions a.
The ARP was directly constructed to have this property. The proof proceeds by backwards
induction, following the ARP down through the stack of past episodes.
Lemma B
Lemma B concerns the convergence of the ARP to the real process. The first two steps
are preparatory; the next two specify the form of the convergence and provide foundations
for proving that it occurs.
B.l
Consider a discounted, bounded-reward, finite Markov process. From any starting state
x, the difference between the value of that state under the finite sequence of s actions and
its value under that same sequence followed by any other actions tends to 0 as s -> 00.
This follows from the presence of the discount factor which weighs the (s
by "/ -> 0 as s -> 00.
l)th state
B.2
Given any levell, there exists another yet higher level, h, such that the probability can
be made arbitrarily small of straying below I after taking s actions in the ARP, starting
from above h.
The probability, starting at level h of the ARP of straying below any fixed level I tends
to 0 as h -> 00. Therefore there is some sufficiently high level for which s actions can
be safely accommodated, with an arbitrarily high probability of leaving the ARP above I.
59
284
B.3
With probability I, the probabilities P,<;)[a] and expected rewards CRin)(a) in the ARP converge and tend to the transition matrices and expected rewards in the real process as the
level n increases to infmity. This, together with B.2, makes it appropriate to consider
P,<;)[a] rather than the ARP transition matrices pt,~,~.m) [a], i.e., essentially ignoring the
level at which the ARP enters state y.
The ARP effectively estimates the mean rewards and transitions of the real process over
all the episodes. Since its raw data are unbiased, the conditions on the sums and sums
of squares of the learning rates ani(x,a) ensure the convergence with probability one.
B.4
Consider executing a series of s actions in the ARP and in the real process. If the probabilities P'<;)[a] and expected rewards CRin)(a) at appropriate levels of the ARP for each
of the actions, are close to Pxy[a] and CRx(a), '<la, x, y, respectively, then the value of the
series of actions in the ARP will be close to its value in the real process.
The discrepancy in the action values over a finite number s of actions between the values
of two approximately equal Markov processes grows at most quadratically with s. So,
if the transition probabilities and rewards are close, then the values of the actions must
be close too.
~o(x,
a)
1.
CR
By B.3, with probability I, it is possible to choose 1 sufficiently large such that for
By B.2, choose h sufficiently large such that for n > h, the probability, after taking
s actions, of ending up at a level lower than 1 is less than min{(E(1 - 'Y)/6sCR),
(E/3s(s + I)CR)}. This means that
60
285
~-LEARNING
where the primes on p,(n) and ffi r(n) indicate that these are conditional on the level in
the ARP after the sth step being greater than l.
Then, for n > h, by BA, compare the value QARP(x' n), a .. ... , as) of taking actions a .. ... , as at state x in the ARP, with Q(X, a .. ... , as) of taking them in the
real process: 3
I~RP(X,
f(1 - ')')
6s<R
2s<R
I - ')'
<
s (s + 1)
2f
2
3s(s + I)
2f
="3'
(4)
Where, in equation 4, the first term counts the cost of conditions for B.2 not holding,
as the cost of straying below l is bounded by 2sffi /(1 - ')'). The second term is the cost,
from BA, of the incorrect rewards and transition probabilities.
However, by B.l, the effect of taking only s actions makes a difference of less than f /6
for both the ARP and the real process. Also since equation 4 applies to any set of actions, it applies perforce to a set of actions optimal for either the ARP or the real process. Therefore
IQARP(x,
n), a) -
Q*(x, a)1
-->
< E.
Q*(x, a) as n
-->
co as required.
61
286
I V"(x)1
~ u*<R
(1 - p*)u*<R
(1 - p*)2u *<R
u*<R
=7
since in each u * steps the agent earns a reward of less than u *<R, and has probability less
than (1 - P *) of not having been trapped. Similarly, the effect of measuring the reward
after only cj>u * steps is less than (I - P *) <f> u *<R ...... a as cP ...... ,00, and so an equivalent
of lemma B.l does hold.
Changing more than one ~ value on each iteration requires a minor modification to the
action replay process ARP such that an action can be taken at any level at which it was
executed in the real process-i.e., more than one action can be taken at each level. As
long as the stochastic convergence conditions in equation 3 are still satisfied, the proof
requires no non-trivial modification. The ~n(x, a) values are still optimal for the modified
ARP, and this still tends to the real process in the original manner. Intuitively, the proof
relies on the ARP estimating rewards and transition functions based on many episodes,
and this is just speeded up by changing more than one ~ value per iteration.
Although the paper has so far presented an apparent dichotomy between ~-Iearning and
methods based on certainty equivalence, such as Sato, Abe and Takeda (1988), in fact there
is more of a continuum. If the agent can remember the details of its learning episodes,
then, after altering the learning rates, it can use each of them more than once (which is
equivalent to putting cards that were thrown away, back in, lower down on the ARP stack).
This biases the ~-learning process towards the particular sample of the rewards and transitions that it has experienced. In the limit of re-presenting 'old' cards infinitely often, this
reuse amounts to the certainty equivalence step of calculating the optimal actions for the
observed sample of the Markovian environment rather than the actual environment itself.
The theorem above only proves the convergence of a restricted version of Watkins' (1989)
comprehensive ~-learning algorithm, since it does not permit updates based on the rewards
from more than one iteration. This addition was pioneered by Sutton (1984; 1988) in his
TD(A) algorithm, in which a reward from a step taken r iterations previously is weighted
by Ar , where A < 1. Unfortunately, the theorem does not extend trivially to this case, and
alternative proof methods such as those in Kushner and Clark (1978) may be required.
This paper has presented the proof outlined by Watkins (1989) that ~-learning converges
with probability one under reasonable conditions on the learning rates and the Markovian
environment. Such a guarantee has previously eluded most methods of reinforcement
learning.
Acknowledgments
We are very grateful to Andy Barto, Graeme Mitchison, Steve Nowlan, Satinder Singh,
Rich Sutton and three anonymous reviewers for their valuable comments on multifarious
aspects of ~-Iearning and this paper. Such clarity as it possesses owes to Rich Sutton's
62
287
Q.-LEARNING
tireless efforts. Support was from Philips Research Laboratories and SERe. PD's current
address is CNL, The Salk Institute, PO Box 85800, San Diego, CA 92186-5800, USA.
Notes
1. In general, the set of available actions may differ from state to state. Here we assume it does not, to simplify
the notation. The theorem we present can straightfowardly be extended to the general case.
2. The discount factor for the ARP will be taken to be 'Y, the same as for the real process.
3. The bars over the Q. indicate that the sum is over only a finite number of actions, with 0 terminal reward.
Appendix
with probability
i.
II (l
- ani)
i=1
be the index of the episode that is replayed or taken, chosen probabilistically from the
collection of existing samples from the real process. If ie = 0, then the reward is set at
Qo(x, a) and the ARP absorbs, as above, Otherwise, taking ie provides reward rnie, and
causes a state transition to (Yni" nie - 1) which is at level nie - 1. This last point is
crucial, taking an action in the ARP always causes a state transition to a lower level-so
it ultimately terminates. The discount factor in the ARP is 'Y, the same as in the real process.
63
288
Lemma A:
~n(x, a)
~n
are the optimal action values for ARP states (x, n) and ARP actions a. That is
~n(x, a) = ~ARP(X, n), a), "la, x,
and n ~ O.
Proof
By induction. From the construction of the ARp,
possible-action value of (x, 0), a. Therefore,
~o(x, a)
~o(x, a) = ~ARP(X,
0), a).
~-Iearning
Y,,_I (x)
Y*(x, n -
1) =
VII-leX)
== max ~"-I(X,
a
a).
Now consider the cases in trying to perform action a in (x, n). If x, a ;>! x"' a", then this
is the same as performing a in {t, n - 1), and ~,,(x, a) = ~"-I(X, a). Therefore,
from the induction hypothesis and the ~" interation formula in equation 1.
Hence, Q,,(x, a) = ~ARP(X, n), a), Va, x, as required.
64
289
~-LEARNING
Lemma B
a=
-ys
2:
PYsYs+1
Ys+l
But if all rewards are bounded by <R, I v"(x)1 < <R/(l - -y), and so
101 <
CR
-ys - - ..... 0 as s .....
I - -y
00.
B.2 The probability ofstraying below Levell is executing s actions can be make arbitrarily
small
Given any level l, there exists another yet higher level, h, such that the probability can
be made arbitrarily small of straying below l after taking s actions in the ARP, starting
from above h.
Proof
Define ih as the largest i such that ni(x, a) =::; n, and if as the smallest such that ni(x, a) ~ l.
Then, defining ano = 1, the probability of straying below l starting from (X, n), n > l
executing action a is:
where, as before, ni
ni(x, a). But rr;';'i/l - ani) < exp(- E;';,i/ ani) --> 0 as nand
hence ih ..... 00. Furthermore, since the state and action spaces are finite, given 1/, there
exists some level nl such that starting above there from any (x, a) leads to a level above
l with probability at least I - 1/. This argument iterates for the second action with n\ as
the new lower limit. 1/ can be chosen appropriately to set the overall probability of straying
below l less than any arbitrary > O.
65
290
Proof
A standard theorem in stochastic convergence (e.g., theorem 2.3.1 of Kushner & Clark,
1978) states that if Xn are updated according to
Xn
-+
00,
E~l f3;
2, as n
<
-+ 00,
00,
with probability I.
If <R(x,n)Ca) is the expected immediate reward for performing action a from state x at level
n in the ARP, then <R[x,n)(a) satisfies
:e
= <Rx(a), and
where the <R and the ex satisfy the conditions of the theorem with
remembering that ni is the i th occasion on which action a was tried at state x. Therefore
<R(x,n)(a) -+ <RxCa) as n -+ 00, with probbility one. Also, since there is only a finite number of states and actions, the convergence is uniform.
Similarly, define
Xn(Y) =
if Yn = Y
otherwise
as a (random variable) indicator function of the nth transition, mean value Pxy(a). Then,
with Pi;')[a] as the probability of ending up at state y based on a transition from state x
using action a at level n in the ARP,
and so, by the theorem, Pi;')[a] -+ Pxy[a] (the transition matrix in the real process) as
n -+ 00, with probability one.
Since, in addition, all observations from the real process are independent, and, by B.2,
the probability of straying below a fixed level k can be made arbitrarily small, the transition probabilities and expected rewards for a single step conditional on ending up at a level
greater than k also converge to Pxy[a] and <Rx(a) as n -+ 00.
66
291
Cl,-LEARNING
Let P~[a], for i = 1 ... s be the transition matrices of s Markov chains, and <R~(a) be
the reward functions. Consider the s-step chain formed from the concatenation of thesei.e., starting from state xI> move to state X2 according to P;,X2 [all, then state x3 according
to P~~3[a2], and so on, with commensurate rewards. Given TJ > 0, if P'[a] are within
TJI<R of P.ry[a], Va, x, y. and <R;(a) ... <R~(a) are within TJ of <RAa), Va, x, then the
value of the s actions in the concatenated chain is within TJs(s + 1)/2 of their value in the
real process.
Proof
Define:
'Y
as the expected reward in the real process for executing two actions, at and a2 at state x,
and
<R~(at) +
l'
~ P~[atl<R;(a2)
y
as the equivalent in the concatenated chain for exactly the same actions.
Then, since 1<R~(a) - <Rx(a)1 < TJ and P~[a] - P.ry[a] I < TJI<R, Va, i, x, y,
I~'(x,
a" a2) -
:$
I<R~(al)
- <Rx(at) I
~ P;[a2](<R;(~) - <RA~
y
~ (P;y[a2]
- P.ry[a2])<RAa2)
< 3TJ
Similarly, for s actions,
-
s(s
1)
TJ
This applies to the ARP if the rewards and transition matrices at the successively lower
levels are sufficiently close to those in the real process-the main body of the theorem
quantifies the cost of this condition failing.
67
292
References
Barto, A.G., Bradtke, SJ. & Singh, S.P. (1991). Real-time learning and control using asynchronous dynamic
programming. (COINS technical report 91-57). Amherst: University of Massachusetts.
Barto, A.G. & Singh, S.P. (1990). On the computational economics of reinforcement learning. In D.S. Touretzky,
J. Elman, T.J. Sejnowski & G.E. Hinton, (Eds.), Proceedings ofthe 1990 Connectionist Models Summer School.
San Mateo, CA: Morgan Kaufmann.
Bellman, R.E. & Dreyfus, S.E. (1962). Applied dynamic programming. RAND Corporation.
Chapman, D. & Kaelbling, LP. (1991). Input generalizalion in delayed reinforcement learning: An algorithm
and performance comparisons. Proceedings of the 1991 International Joint Conference on ArtifidalIntelligence
(pp. 726-731).
Kushner, H. & Clark, D. (1978). Stochastic approximation methods for constrained and unconstrained systems.
Berlin, Germany: Springer-Verlag.
Lin, L. (1992). Self-improving reaclive agents based on reinforcemenllcarning, planning and leaching. Machine
Learning, 8.
Mahadevan & Connell (1991). Automatic programming of behavior-based robots using reinforcement learning.
Proceedings of the 1991 National Conference on Al (pp. 768-773).
Ross, S. (1983). Introduction to stochastic dynamic programming. New York, Academic Press.
Salo, M:, Abe, K. & Takeda, H. (1988). Learning control of finite Markov chains with explicit trade-off between
estimation and control. IEEE Transactions on Systems, Man and Cybernetics, 18, pp. 677-684.
Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning. PhD Thesis, University of Massachusetts,
Amherst, MA.
Sutton, R.S. (1988). Learning to predict by the methods of temporal difference. Machine Learning, 3, pp. 9-44.
Sulton. R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic
programming. Proceedings of the Seventh International Conference on Machine Learning. San Mateo, CA:
Morgan Kaufmann.
Watkins, C.J.C.H. (1989). Learning from delayed rewards. PhD Thesis, University of Cambridge, England.
Werbos, PJ. (1977). Advanced forecasting methods for global crisis warning and models of intelligence. General
Systems Jearbook, 22, pp. 25-38.
68
Ijl@cs.cmu.edu
Abstract. To date, reinforcement learning has mostly been studied solving simple I~arning tasks. Reinforcement
learning methods that have been studied so far typically converge slowly. The purpose of this work is thus twofold: 1) to investigate the utility of reinforcement learning in solving much more complicated learning tasks than
previously studied, and 2) to investigate methods that will speed up reinforcement learning.
This paper compares eight reinforcement learning frameworks: adaptive heuristic critic (AHC) learning due
to Sutton, Q-learning due to Watkins, and three extensions to both basic methods for speeding up learning. The
three extensions are experience replay, learning action models for planning, and teaching. The frameworks were
investigated using connectionism as an approach to generalization. To evaluate the performance of different frameworks, a dynamic environment was used as a testbed. The environment is moderately complex and nondeterministic. This paper describes these frameworks and algorithms in detail and presents empirical evaluation of the
frameworks.
Keywords. Reinforcement learning, planning, teaching, connectionist networks
1. Introduction
Reinforcement learning is an interesting learning technique. It requires only a scalar reinforcement signal as performance feedback from the environment. Reinforcement learning
often involves two difficult subproblems. The first is known as the temporal credit assignment problem (Sutton, 1984). Suppose a learning agent performs a sequence of actions
and finally obtains certain outcomes. It must figure out how to assign credit or blame to
each individual situation (or situation-action pair) to adjust its decision making and improve
its performance. The second subproblem is the generalization problem (also known as the
structural credit assignment problem). When the problem space is too large to explore completely, a learning agent must have the ability to guess about new situations based on experience with similar situations. In the course of learning, both subproblems must be solved.
The most popular and best-understood approach to the temporal credit assignment problem
is temporal difference (TD) methods (Sutton, 1988). TD methods, which have a solid mathematical foundation, are closely related to dynamic programming (Barto et ai., 1991). Two
forms ofTD-based reinforcement learning have been proposed: the adaptive heuristic critic
(AHC) learning architecture (Sutton, 1984; Barto et ai., 1990) and Q-learning (Watkins,
1989). While both methods have been applied to solve some simple learning problems (e.g.,
(Sutton, 1984; Anderson, 1987; Watkins, 1989; Kaelbling, 1990; Sutton, 1990, they were
found to converge slowly.
Several approaches to the generalization problem have been studied for reinforcement
learning. Anderson (1987) and Lin (1991a, 1991b, 1991c) have successfully combined TD
69
294
L.-J. LIN
methods with the connectionist error backpropagation algorithm (Rumelhart et aI., 1986).
Watkins (1989) used a CMAC algorithm, Grefenstette et al. (1990) a genetic algorithm
method, Mahadevan and Connell (1991) a statistical clustering method, Chapman and
Kaelbling (1991) a method similar to decision-tree algorithms, and Moore (1991) a method
which is basically look-up tables but which uses variable stale resolution.
The goal of this work is to study connectionist reinforcement learning in solving nontrivial learning problems and to study methods for speeding up reinforcement learning.
Although this study took the connectionist approach, the algorithms and results presented
here appear to be relevant to other generalization methods as well.
This work studied eight frameworks for connectionist reinforcement learning: connectionist AHC- and Q-learning, and their extensions. A number of studies (Anderson, 1987; Chapman & Kaelbling, 1991; Mahadevan & Connell, 1991) have found that AHC- and Q-learning
often converge slowly. For speedup, three extensions to both basic methods were investigated
in this work: experience replay, learning action models for planning, and teaching.
The different frameworks were evaluated using a nondeterministic dynamic environment
as a testbed. The environment consists of four kinds of objects: the agent, stationary food
and obstacles, and moving enemies. The task of the agent is to survive in the environment,
which is by no means trivial for a knowledge-poor agent, since the agent has to process
a large number of input signals, has several actions to choose from, and has multiple goals.
The rest of this paper is organized as follows. Section 2 presents the learning frameworks
and algorithms. Section 3 describes the dynamic environment and the survival task. Sections 4 and 5 present the implementation and performance of the learning agents. Section
6 assesses the different learning frameworks by comparing the agents' performance. Section 7 discusses the limitations. Finally, Section 8 concludes the paper by summarizing
the lessons learned from this study.
IIr
-yk rt+k
(1)
k=O
where V, is the discounted cumulative reinforcement starting from time t throughout the
future, rt is the reinforcement received after the transition from time t to t + 1, and
o ~ -y ~ 1 is a discount factor, which adjusts the importance of long-term consequences
of actions.
70
295
This section describes eight frameworks for connectionist reinforcement learning. They
are summarized as follows:
The basic idea behind these frameworks is to learn an evaluation function (see eval and
util below) to predict the discounted cumulative reinforcement to be received in the future.
The evaluation function is represented using connectionist networks and learned using a
combination of temporal difference (TD) methods (Sutton, 1988) and the error backpropagation algorithm (Rumelhart et aI., 1986). In essence, TD methods compute the error (called
the TD error) between temporally successive predictions, and the backpropagation algorithm
minimizes the error by modifying the weights of the networks.
Before starting to discuss the learning frameworks, two terms are defined for later use:
eval(x): the expected discounted cumulative reinforcement that will be received starting
from world state x, or simply the utility of state x;
util(x, a): the expected discounted cumulative reinforcement that will be received after
the execution of action a in response to world state x; or simply the utility of the stateaction pair (x, a).
In a deterministic world, util(x, a) is equal to the immediate reinforcement r plus the utility
of the next state y, discounted by')':
util(x, a) = r
+ ')' . eval(y)
(2)
71
296
L.-J. LIN
Figure 1. Fl"dlTIework AHCON. The bold lines indicate a vector of signals and the thinner lines indicate a scalar signal.
using TD methods, an evaluation network that accurately models the function eva/(x). The
policy network takes a world state l as inputs and assigns to each action a value (called
action merit) indicating the relative merit of performing that action in response to that world
state. Thus, the second subtask is to adjust the policy network so that it will assign higher
merits to actions that result in higher utilities (as measured by the evaluation network).
Ideally, the outputs of the policy network will approach the extremes, for example, 1
for the best action(s) and -1 for the others. Given a policy network and a state, the agent's
policy is to choose the action that has the highest merit. In order to learn an optimal policy
effectively, the agent has to take actions that appear suboptimal once in a while so that
the relative merits of actions can be assessed. A stochastic action selector, which favors
actions having high merits, can be used for this purpose.
In the course of learning, both the evaluation and policy networks are adjusted incrementally. Figure 2 summarizes the learning algorithm, in which the simplest TD method, TD(O),
is used? To use TD methods to learn an evaluation function, the first step is to write down
a recursive definition of the desired function. By Definition I, the utility of a state x is
the immediate payoff r plus the utility of the next state y, discounted by ,...3 Therefore, the
desired function must satisfy Equation 3:
eva/ex) = r
'Y eva/(y)
(3)
During learning, the equation may not hold true. The difference between the two sides
ofthe equation (called TD error) is reduced by adjusting the weights of the evaluation network using the backpropagation algorithm (Step 5).
The policy network is also adjusted (Step 6) according to the same TD error. The idea
is as follows: If e' > e, meaning that action a is found to be better than previously expected, the policy is modified to increase the merit of the action. On the other hand, if
e' < e, the policy is modified to decrease the merit. The policy network is not updated
with respect to actions other than a, since from a single experience we know nothing about
the merits of the other actions.
72
297
1. x
2. a
f-
7. go to 1.
Figure 2. The algorithm for Framework AHCON.
During learning, the stochastic action selector (i.e., the select function in the algorithm)
chooses actions randomly according to a probability distribution determined by action merits.
In this work, the probability of choosing action ai is computed as follows:
Prab(ai) = em/TIl: emk /T
(4)
where mi is the merit of action ail and the temperature Tadjusts the randomness of action
selection.
action
rei
ccement
world stale
Figure 3. Framework QCON. The bold lines indicate a vector of signals and the thinner lines indicate a scalar signal.
73
298
L.-J. LIN
6. go to 1.
Figure 4. The algorithm for Framework QCON.
action a for which util(x, a) is maximal. Therefore, the utility network is not only an evaluation function but also used as a policy.
The learning algorithm is depicted in Figure 4 and briefly described below. Generally
speaking, the utility of an action a in response to a state x is equal to the immediate payoff
r plus the best utility that can be obtained from the next state y, discounted by 'Y. Therefore,
the desired uti! function must satisfy Equation 5:
uti!(x, a) = r
k)lk E actions}
(5)
During learning, the difference between the two sides of the equation is minimized using
the backpropagation algorithm (Step 5). Note that the network is not modified with respect
to actions other than a, since from a single experience we know nothing about the utilities
of the other actions.
The utility network described above has multiple outputs (one for each action). It could
be replaced by multiple networks (one for each action); each with a single output. The
former implementation might be less desirable, because whenever the single utility network
is modified with respect to an action, no matter whether it is desired or not, the network
is also modified with respect to the other actions as a result of shared hidden units between
actions. A similar argument can apply to the policy network of Framework AHCON. (This
study used multiple networks with single outputs.)
Again, a stochastic action selector is needed to explore the consequences of different
actions. The same action selector used by Framework AHCON can be employed for this
purpose.
74
299
The basic AHC- and Q-learning algorithms described above are inefficient in that experiences obtained by trial-and-error are utilized to adjust the networks only once and then
thrown away. This is wasteful, since some experiences may be rare and some (such as those
involving damages) costly to obtain. Experiences should be reused in an effective way.
In this paper, an experience is a quadruple, (x, a, y, r), meaning that the execution of
an action a in a state x results in a new state y and reinforcement r. A lesson is a temporal
sequence of experiences starting from an initial state to a final state, where the goal may
or may not be achieved.
The most straightforward way of reusing experiences is what I call experience replay.
By experience replay, the learning agent simply remembers its past experiences and repeatedly presents the experiences to its learning algorithm as if the agent experienced again
and again what it had experienced before. The result of doing this is that the process of
credit/blame propagation is sped up and therefore the networks usually converge more
quickly. However, it is important to note that a condition for experience replay to be useful
is that the laws that govern the environment of the learning agent should not change over
time (or at least not change rapidly), simply because if the laws have changed, past experiences may become irrelevant or even harmful.
Experience replay can be more effective in propagating credit/blame if a sequence of
experiences is replayed in temporally backward order. It can be even more effective if
TD(A) methods are used with the recency factor, A, greater than zero. With A > 0, the
TD error for adjusting network weights is determined by the discrepancy between not only
two but multiple consecutive predictions. Lin (199lc) presents a detailed algorithm and study
of using backward replay and nonzero A in a mobile robot domain. The results of that
paper are encouraging. In this paper, however, only backward replay and A = 0 are used
for simplicity reason.
The algorithm for AHCON-R is simply the one for AHCON plus repeated presentation
of past experiences to the algorithm, except that experiences involving non-policy actions
(according to the current policy) are not presented. By presenting an experience to the
algorithm, I mean to bind the variables, (x, a, y, r), used in the algorithm with the experience. The reason for the exception is the following: AHCON estimates eval(x) (relative
to the current policy) by sampling the current policy (i.e., executing some policy actions).
Replaying past experiences, which is equivalent to sampling policies in the past, will disturb
the sampling of the current policy, if the past policies are different from the current one.
Consider an agent whose current policy chooses a very good action a in state x. If the
agent changes to choose a very bad action b in the same state, the utility of state x, eval(x) ,
will drop dramatically. Similarly, the utility of x will be underestimated if the bad action
b are replayed a few times. For experience replay to be useful, the agent should only replay
the experiences involving actions that still follow the current policy.
Just like AHCON-R but for a different reason, QCON-R should also replay only actions
which follow the current policy. Consider the previous example. Even if the agent changes
from choosing good action a to choosing bad action bin state x, the values of both util(x, a)
and util(x, b) will not change. As a matter of fact, if the util function is implemented as
look-up tables, Watkins (1989) has shown that Q-learning is guaranteed to learn an optimal
policy as long as all state-action pairs are tried an infinite number of times (along with a few
weak conditions). In other words, it is not harmful for the tabular version of Q-learning
75
300
L.-J. LIN
to replay bad actions. But this does not hold true for connectionist Q-Ieaming, because
whenever the backpropagation algorithm modifies a utility network with respect to one
input state, it also affects the network with respect to many or all of the possible input
states. If a utility network is trained on bad experiences many times consecutively, the network might come to underestimate the real utilities of some state-action pairs.
In short, AHCON-R and QCON-R should replay only policy actions. However, since
an agent may use a stochastic policy as suggested previously, all actions have more or less
possibility of being chosen by any policy. It is difficult to say whether an action is a current
policy action or not. In this work, an action is considered as non-policy action, if its probability (see Equation 4) of being chosen (according to the current policy) is lower than
PI .4 For AHCON-R, PI = 0.2 was found to give rougWy the b~st performance for the
learning task to be described later in this paper. For QCON-R PI, = 0.1,0.01 and 0.001
were tried and found to give similar performance. (PI = 0 indeed gave bad performance.)
In other words, AHCON-R is quite sensitive to replaying non-policy actions, while QCON-R
is much less sensitive. The difference in sensitivity can be explained by the fuct that replaying
non-policy actions, in the case of using look-up tables, is bad for AHC-Iearning but is fine
for Q-Iearning.
For Frameworks AHCON-R and QCON-R to be efficient in terms of time and memory
space, only "recent" experiences need to be stored and replayed. As a matter of fact, it
is not only unnecessary but also harmful to replay an experience as many times as possible,
because the networks might be over-trained and become too specific to that experience,
which usually harms generalization. Consider an agent living in a nondeterministic world.
Whenever the agent is in state x and performs action a, 80% of the time it will receive
a great penalty, while 20% of the time it will receive no penalty. If the agent just experienced
the situation with no penalty and is trained on this experience many times, the agent will
come to believe that it is harmless to perform a in x, which is certainly wrong. Section
4.7 describes a partial solution to prevent over-training.
76
301
Framework AHCON-M is very similar to Framework AHCON, but differs in that AHCONM also learns an action model and uses it to do what Sutton called rekuation planning
(Sutton, 1990). Relaxation planning, which is closely related to dynamic programming,
is an incremental planning process that consists of a series of shallow (usually one-step
look-ahead) searches and ultimately produces the same results as a conventional deep search.
In Sutton's Dyna architecture, his learning algorithm (similar to the one for AHCON) is
applied to real-world situations faced by the agent and also hypothetical situations randomly
generated. In the former case the next state is obtained by executing an action, while in
the latter case the next state is obtained by applying an action mo~el. His approach has
two inefficiencies. First of all, since hypothetical situations are randomly generated, the
agent may spend too much effort planning what to do about hypothetical situations that
will never happen in the real world at alL Second, it is not clear how his approach decides
when relaxation planning is no longer necessary and to stop doing it.
The relaxation planning algorithm (see Figure 5) proposed here addresses both inefficiencies by projecting all actions from states actually visited, not from states chosen at random.
Since all actions to a state are examined at the same time, the relative merits of actions
can be more directly and effectively assessed than the kind of policy iteration (Howard,
1960) used in AHCON and the Dyna architecture. Compared with the Dyna architecture,
the disadvantage of this algorithm, however, is that the number of hypothetical experiences
that can be generated is limited by the number of states visited.
In a deterministic world, the utility of a state x is equal to the immediate payoff r obtained
from executing the best action a plus the discounted utility of the new state y:
eva/ex) = Max{r
(Note: y is a function of a)
(6)
Thus, if a correct action model is available, the utility of a state can be accurately estimated
by looking ahead one step. During learning, the evaluation network is adjusted to minimize
the difference between the two sides of Equation 6 (Step 6).
The policy netvmrk is updated in the following manner: First we compute the average
utility of state x, p., under the assumption that the current policy and stochastic action selector
will be used throughout the future (Step 5). Prob is the probability distribution function
for action selection; it is Equation 4 in this work. Next, the policy network is modified
to increase/decrease the merits of actions which are above/below the average (Step 7).
The algorithm for Framework AHCOM-M is similar to that for AHCON, except that
relaxation planning may take place either before Step 2 or after Step 3 (in Figure 2)-in
the latter case the agent will be more reactive, while in the former case the agent may
make better action choices because it can benefit directly from the one-step look-ahead.
To be efficient, relaxation planning is performed selectively (Steps 2 & 3 in Figure 5).
The idea is this: For situations where the policy is very decisive about the best action,
relaxation planning is not needed. If the policy cannot be very sure about which is the
best action, relaxation planning is performed. In this way, at the beginning of learning,
all actions are equally good and relaxation planning is performed frequently. As learning
proceeds, relaxation planning is performed less and less often. (In this study, promising
actions are those whose probability of being chosen is greater than 2 %.)
77
302
L.-J. LIN
8. exit.
Figure 5. Relaxation planning algorithm for AHCON-M. Prob is the probability function for stochastic action
selection.
6. exit.
Figure 6 Relaxation planning algorithm for QCON-M.
78
={
otherwise
303
79
304
L..J. LIN
point of view may not be optimal for robots if we take into account the fact that robots
may not be able to sense all the information that humans use to make decisions. Human
teachers must teach in terms of what robots can sense.
Obviously, Frameworks AHCON-T and QCON-T do not have the first and second drawbacks. They also do not have the third drawback, because they do not learn by rote. Instead,
they determine the real utilities of the shown actions-If a shown action results in a state
which the agent now believes to be good/bad, then the agent increments/decrements the
likelihood of performing that action given the same situation in the future. On the other
hand, typically learners are going to benefit from experts more than from naive teachers.
3. A dynamic environment
To date, reinforcement learning has been studied mostly for solving simple learning tasks,
such as pole-balancing, route-finding, etc. (Some researchers, for instance, Mahadevan and
Connell (1991) and Chapman and Kaelbling (1991), have begun to study more complicated
problems, however.) It is therefore unclear whether reinforcement learning can scale up
to deal with more realistic problems. One of the goals of this work is thus to study the
utility of various frameworks in solving nontrivial learning problems. To this end, I devised
a simulated dynamic environment to be used as a testbed for evaluating the performance
of various frameworks.
The dynamic environment is a 25 X 25 cell world. A sample environment is shown
in Figure 7.7 There are four kinds of objects in the environment: the agent ("I"), food
("$"), enemies (HE"), and obstacles ("0"). The perimeter of the world is considered
to be occupied by obstacles. The bottom of the figure is an energy indicator ("H"). At
the start, the agent and four enemies are placed in their initial positions as shown in Figure
000000
0
00
0
0 000
0 000
0
00
00
$
00
0
00
$ $
00
00
0
0 000$
0 000
00
0
0$
000000
HHHHHHHH
00
00
E
00
00
$
000
0 0
0000
000
$00$
0
0000
000
00
0
0
$ 0
0
$
00
00
000
000
0000 0000
I
00
00
0 0
000
00
00
000000
0
00
0
000 0
$000 0
$
0
00
00
$
00
0
00
00
00
0
000 0
000 0
0
00
$$0
000000
Figure 7. A dynamic environment involving an agent (I), enemies (E), food ($), and obstacles (0). The H's indicate the agent's energy level.
80
305
7, and fifteen pieces of food are randomly placed on unoccupied cells. On each move, the
agent has four actions to choose from; it can walk to one of the four adjacent cells. If the
agent attempts to walk into obstacles, it will remain at the same position.
After the agent moves, each of the enemies is allowed to stay or move to an adjacent
cell that is not occupied by obstacles. To allow the agent to escape from chasing enemies,
the enemy speed is limited to 80% of the full speed of the agent. The enemies move randomly, but tend to move towards the agent; the tendency becomes stronger as the agent
gets closer. Appendix A gives the algorithm for choosing enemy actions.
The agent has two goals: to get as much food as possible and to avoid being caught by
enemies. Note that the two goals conflict in some situations, and the agent must learn to
arbitrate between them. A play ends when the agent gets all of the food or dies. The agent
dies when it either collides with an enemy or runs out of energy. At the start of a new
play, the agent is given 40 units of energy. Each piece of food provides the agent 15 units
of additional energy, and each move costs the agent 1 unit. These parameter values were
empirically chosen so that survival in the environment would not be too easy or too difficult.
To make the learning task more interesting and realistic, the agent is allowed to see only
a local area surrounding it. From the agent's point of view, the world is nondeterministic,
not only because the enemies behave randomly, but also because the world is only partially
observable (thus, what will be seen after a move is not completely predictable). Although
a human player can avoid the enemies and get all of the food most of the time, survival
in this environment is not trivial. To survive, the agent must learn to 1) approach food,
2) escape from enemies, 3) avoid obstacles, 4) identify and stay away from certain dangerous
places, (e.g., corridors), where it can be easily seized, and 5) seek food when food is out
of sight.
81
306
L.-J. LIN
XXIXX
o a
o
o
xxx
xxx
XXIXX
xxx
xxx
x 0
000
000
00000
0000000
000010000
0000000
00000
000
(a)
Figure 8. The (a) food, (b) enemy, and (e) obstacle sensor arrays.
82
(c)
307
types of sensors, types "X", "0", and "Y". Different food sensor types have different
resolution and different receptive fields-='X", "0", and "Y" sensors can be activated
by food at any of the five, nine, and thirteen nearby cells, respectively. This technique of
encoding spatial positions of objects is known as coarse coding (Hinton et al., 1986).
Through the use of multiple resolution and coarse coding techniques, the food positions
are effectively coded without loss of critical information.
The enemy sensor array has a similar layout of the food sensor array, except that it consists of only "X" and "0" types of sensors. The obstacle sensors are organized differently;
there is only one type of obstacle sensor, and each obstacle sensor is only activated by
an obstacle at the corresponding cell. The coarse coding technique is not used to encode
obstacle positions, because it only works effectively when the features to be encoded are
sparse (Hinton et ai., 1986), and this is not the case for obstacles.
The agent's energy level is, again, coarse-coded using sixteen input units. Each of the
sixteen units represents a specific energy level and is activated when the agent's energy
level is close to that specific level. Finally, four units are used to encode the agent's previous
action choice and one unit to indicate whether or not the previous action resulted in a collision with an obstacle. These five units convey a kind of history information, which allows
the agent to learn heuristics such as "moving back to previous positions is generally bad,"
"when no interesting objects (e.g., food) are around, keep moving in the same direction
until you see something interesting," etc.
Global representation: The four actions are to move to the north, south, west and east,
regardless of the agent's current orientation in the environment.
Local representation: The four actions are to move to the front, back, right and left
relative to the agent's current orientation.
If the local representation is used, the agent begins with a random orientation for each
new play, and its orientation is changed to the move direction after each step. For instance,
if the agent takes four consecutive "right" moves and no collision occurs, it will end up
being at the same cell with the same orientation as before.
83
308
L.-J. LIN
Global representation: With this action representation, the agent uses only one policy
network, whose single output represents the merit of moving to the "north." The merits
of moving in other directions can be computed using the same network by rotating the
state inputs (including food map, enemy map, obstacle map, and the 4 bits encoding the
agent's previous action) by 90, 180, and no degrees. Similarly, the agent uses only one
utility network, whose single output represents the utility of moving to the "north" in response to the input state. Again, the utilities of moving in other directions can be computed
by rotating the state inputs appropriately. By taking advantage of the action symmetry, the
learning task is simplified, because whatever is learned for a situation is automatically carried
over to situations which are more or less symmetric to it.
Local representation: This representation does not take advantage of action symmetry.
Each AHC-agent uses four policy networks and each Q-agent four utility networks. Each
network has a single output.
The outputs of all the networks are between -1 and +1. By Definition 1, the eval and util
functions may be greater than 1 when the agent is close to many pieces of food. In such
(rare) cases, they are truncated to 1 before being used to compute TD errors. Also, the
output units of the evaluation and utility networks use mainly the linear part of the sigmoid
function. An alternative is to use a truly linear activation function for the output units.
4.5. The action model
Agents AHCON-M and QCON-M learn an action model. The action model is intended to
model the input-output behavior of the dynamic environment. More specifically, given a
world state and an action to be performed, the model is to predict what reinforcement signal
will be received, where the food, enemies and obstacles will appear, and what the agent's
energy level will be. In this nondeterministic environment, each action can have many possible outcomes. There are two alternatives to modeling the environment: the action model
can generate either a list of outcomes associated with probabilities of happening or only
the most likely outcome. The second alternative is adopted, because of its simplicity.
Since the food and obstacles do not move, their positions were found to be quite easy
to predict using connectionist networks. So, to shorten simulation time without significantly
simplifying the model-learning task, Agents AHCON-M and QCON-M were only required
to learn reinforcement networks for predicting the immediate reinforcement signal and enemy
networks for predicting the enemy positions. The reinforcement and enemy networks are
all two-layer networks (i.e., without hidden layers). The reinforcement networks use as
inputs all of the 145 input bits mentioned before, while the enemy networks use only the
enemy and obstacle maps. The single output of each enemy network is trained to predict
whether a particular enemy sensor will be turned on or off after the agent moves.
The reinforcement and enemy networks are learned on-line just like the other networks.
Learning these networks is a kind of supervised learning, and the encountered exPeriences
(i.e., x, a, y, r) can be saved in a queue (of limited length) for repeated presentation to
the networks. Because the enemies are nondeterministic and the agent does not know the
exact enemy positions due to coarse coding, the prediction of enemy positions was found
to be often incorrect even after the networks were well-trained. Since it is also interesting
84
309
to see what performance the agent will have if a perfect model can be learned quickly,
AHCON-M and QCON-M are also provided with a "perfect" model, which is just the
environment simulator. Section 6.3 presents a performance comparison between using a
perfect model and using a learned, potentially incorrect one.
85
310
L.-J. LIN
5. Experimental results
This section presents the performance of various learning agents. In the first study, the
learning agents used the global action representation and took advantage of action symmetry.
In the second study, the agents used the local action representation and did not exploit
action symmetry.
86
311
AHCON
30
1/,
'l/ p
H.
30
1/.
0.3
liT = 20
-+
60
H.
30
'1/.
0.15
liT = 20
-+
60
H,
Hp
QCON
Other Q-agents
= 30
= 30
= 30
= 0.2
= 0.4
= 0.\
= 0.2
H,
Hp
Other AHC-agents
1/
1/,
1/p
liT = 2
-+
10
liT = 2
-+
10
Food
13 .-----+----+----~----+-----+----..
12
- --".---- ----------- 11
10
9
B
7
,
6
,, , ,
5
4
''
3
QCON-R
,, ,/
QCON
2
AHCON-R
AHCON
~"
1
,/1'--:---
--
---
,:
,,
010
Food
13r-----~---~---~----~---~---;
12
11
10
9
B
7
6
5
AHCON
AHCON-R
AHCDN-T
AHCOH-M (per.f'ect ..odeD
3
2
1
200
250
300
Food
13 r---~--~---~---~---~---'"
12
11
10
9
~--:;.---'----------B
7
6
----------------
QCON
QCON-R
QCON-T
QCOH-H (per{'ect ....0000
QCOH-H (learned model)
3
2
1
150
2 0
250
300
Plays
Figure 9. Learning curves of the agents using the global action representation.
87
312
L.-J. LIN
merits are supposed to approach 1 for the best action(s) and -1 for the others, while utilities
of state-action pairs are usually small numbers between 1 and -1. Cooling temperatures
were used, although fixed low temperatures seemed to work as well. The learning agents
with learning-speedup techniques used smaller learning rates than those used by the basic
learning agents, because the former trained their networks more often.
In this study, the agents used the local action representation and did not exploit action symmetry. The agents used the same parameter settings as in Study 1, except that all the learning rates were doubled. Figure 10 shows the learning curves of the agents.
food
13
12
11
10
...----<----<----+----+----+----
8
7
6
5
4
3
IJCOtl
AHCDtl
2
1
IJCDtl-R
AHCDtl-R
o
Food
13
12
11
10
.. ::::::=-
8
7
6
5
4
3
2
1
0
AHCDtl
AHCDtl-R
AHCDtl-T
AHCON-M (perfect mode1)
AHCON-M (learned nodel)
50
100
150
Plays
200
300
250
Food
13
12
11
10
-- -- - -- -"-
-----~---
8
7
6
5
4
3
2
1
0
aeOtl
aeOtl-R
QCOH-T
OCOH-H (perfect ~odel)
QCOH-H (1 earned ,.ode 1)
100
150
Plays
200
250
Figure 10. Learning curves of the agents using the local action representation.
88
---
300
313
As mentioned previously, the learning curves in Figures 9 and 10 show the mean performance over 7 experiments. The standard deviations of these curves, point by point, are
roughly 2 food pieces at the beginning of learning and 1 food piece after the asymptotic
performance is reached. Generally speaking, the agents who had better performance had
smaller performance deviations between experiments. But an agent's performance difference
between experiments did not imply much about the repeatability of experimental results,
because each experiment used a different set of training and test environments. What really
matters is the relative performance of agents, which was found consistent most of the time
through each entire experiment.
6. Discussion
This section compares the learning agents by analyzing the experimental results in Section 5.
89
314
L.-J. LIN
AHCOtl-H
AHCOtl-H
aCON-H
aCON-H
(perfect
(learned
(perfect
(learned
_ell
_ell
_ell
_ell
-_ .... -
Figure 11. Average number of hypothetical actions taken on each step (from Study 2).
90
315
results from the 7 experiments of Study 2.) The maximum is 4, since there are 4 actions
to choose from on any step. As we can see from the figure, the amount of planning dropped
very quickly as learning approached the asymptote. But it did not drop close to zero at
the end, because in this task there are many situations where several actions are in fact
almost equally good and therefore they all end up being tried in simulation. For tasks where
only a small portion of actions are relevant most of the time, the saving should be more
significant.
91
316
L.-J. LIN
way of using action models. For example, the relaxation planning algorithms used here
performed just one-step look-ahead. Also, the number of hypothetical experiences that
AHCON-M and QCON-M could generate was limited by the number of states actually
visited. (On the other hand, if the agents are allowed to generate hypothetical experiences
from any randomly chosen states, they may spend too much time planning what to do about
hypothetical situations that will never happen at all.)
Secondly, experience replay is also a kind ofrelaxation planning! Experience replay uses
an "action model," but does not need to learn one; an action model is simply obtained
by sampling past experiences. In essence, the collection of past experiences is a model.
It represents not only explicitly the environment's input-output patterns but also implicitly
the probability distributions of multiple outcomes of actions.
Since experience replay is so effective and also easy to implement, is it of no use to
learn an action model? The answer is unclear. If a model is learned merely for doing relaxation planning, perhaps it is not worthwhile to learn a model, since experience replay does
the same thing as relaxation planning. One may argue that: 1) a model that generalizes
can provide the learning algorithms with induced or interpolated experiences, which are
not seen before, and 2) the extra experiences will result in better learning of the evaluation
functions. But if the evaluation functions also generalize (as it should be the case for nontoy tasks), it is unclear whether these extra experiences can actually do any extra good.
On the other hand, having an action model can be useful. In this work I only investigated
how to use action models for learning the evaluation functions. There are other ways of
using a model to improve performance that I did not study here. For example, we can use
an evaluation function, an action model, and a look-ahead planning technique to help find
the best actions (Whitehead & Ballard, 1989; Thrun et aI., 1991). In a complex domain
where an optimal policy may be hardly obtainable, by looking ahead a few steps (much
as computer chess does), the non-optimality of a policy can be compensated, if an accurate
action model is available. How to use action models effectively is an interesting issue and
needs further study.
92
317
7. Limitations
While it seems that all of these learning frameworks are promising, they have several common limitations.
Representation dependent. A5 in any other learning system, a good input representation
is critical to successful learning. Because of the use of coarse coding and concentric multiple resolution maps, the number of input units needed for the task was dramatically reduced.
Without using these techniques, the agents could not have been so successful. Choosing
a good action representation is also crucial sometimes. For example, the local action
representation gave AHC-agents better performance than the global action representation.
Discrete time and discrete actions. So far TD methods have been used almost exclusively
in domains with discrete time and actions. While it seems possible to extend the idea of
temporal difference to handle continuous time and actions, it is not clear how easily this
extension can be done. But note that these frameworks can work for continuous states,
since connectionist networks take real as well as binary numbers as inputs.
Unwise use of sensing. The learning agent is required to sense the entire environment
at each step to determine what world state it is in. In the real world, there are too many
things to 100 k at. Most of them may be irrelevant. In practice, an agent often cannot afford
to sense everything all the time. It is an important issue to decide how to use sensors efficiently while still knowing what the current state is (Whitehead & Ballard, 1991a; Tan, 1991).
History insensitive. Unlike humans who usually make decisions based on information
which is currently sensed and information which is in the past and cannot be sensed, the
learning agent is only reactive to what is perceived at the current moment, but insensitive
to what was perceived in the past. One possibility of becoming history-sensitive is to use
time-delay networks (Lang, 1989) or recurrent networks (Williams & Zipser, 1988).
Perceptual aliasing. The perceptual aliasing problem occurs when the agent's internal
state representation is insufficient to discriminate different external world states. A5 a consequence of perceptual aliasing, the learning agent cannot learn a correct evaluation function and therefore cannot act optimally. Consider a packing task which involves three steps:
open a box, put an object in the box, and close the box. An agent driven only by its current
visual percepts cannot accomplish this task, because facing a closed box, the agent does
not know whether an object is already in the box or not. There are two ways to resolve
this problem. The first solution is to actively choose perceptual actions, such as measuring
the weight of the box, to resolve the ambiguity. This kind of solutions have been investigated
by Whitehead and Ballard (1991a) for a block-stacking problem and by Tan (1991) for a
route-finding problem. The second solution is to use history information, such as whether
the box was ever opened or not, to help determine the current state of the world. This
solution seems more general and more powerful than the first one. To deal with complex
real world problems, perhaps both kinds of solutions are needed.
No hierarchical control. In theory, TD methods are able to accurately propagate credit
through a long sequence of actions. However, this can be true only when the evaluation
function can be modeled to any arbitrary accuracy, for example, using look-up tables. In
practice, more compact function approximators that allow generalization, such as connectionist networks and decision trees, should be used. By using such approximators, TD
methods can hardly propagate credit accurately through an action sequence longer than,
93
318
L.-J. LIN
say, 20. Furthermore, the longer the sequence is, the more time is needed for credit propagation. A potential solution to this problem is to use hierarchical control. By hierarchical
control, a top-level task is decomposed into subtasks, each subtask is learned separately
using reinforcement learning, and finally a top-level policy is learned to control the invocation of the subtasks (Lin, 1991c; Mahadevan & Connell, 1991).
8. Conclusion
This paper studied connectionist AHC- and Q-learning for a moderately complex tasksurvival in a dynamic environment. The results are encouraging and suggest that reinforcement learning is a promising approach to building autonomous learning systems. Some
previous studies (Lin, 1991a; Sutton, 1990) found the superiority of Q-learning over AHClearning. This work confirmed this finding in one case and found comparable performance
of both methods in the other. More studies will be required before it is possible to formulate the conditions for which AHC- or Q-learning will be more effective than the other.
Reinforcement learning algorithms often converge slowly. This paper investigated three
extensions for speedup: experience replay, learning action models for relaxation planning,
and teaching. Relaxation planning, which is a kind of incremental dynamic programming
(Watkins, 1989), caches the results of repeated shallow searches in an evaluation function,
using an action model for looking ahead. Learning an action model for planning is a wellknown idea. But whether using a model is going to speed up learning depends on the relative
difficulty of learning the model and learning to solve the task directly. In this study, learning a good model for the nondeterministic, dynamic world turned out to be difficult, and
this has limited the utility of using a model in some cases.
By sampling past experiences, experience replay performs a process much like relaxation
planning, but does not need to learn an action model. The key point is to replay only policy
actions. This study found that experience replay was quite effective in speeding up the credit
assignment process. As a matter of fact, experience replay was found to work better than
relaxation planning with a learned model; after all, learning a model takes time. So, perhaps
it is not worthwhile to learn a model, if a model is learned merely for doing relaxation
planning. But an action model is still a good thing to have, because a model together with
an evaluation function can be utilized, for instance, to perform conventional look-ahead
search for optimal actions when an optimal policy is not available.
This paper also described an approach to integrating teaching with reinforcement learning. To learn to solve a problem by reinforcement learning, the learning agent must achieve
the goal (by trial-and-error) at least once. Teaching can be an effective technique to shorten
the trial-and-error process by simply providing some success examples to the learner. Teaching can also help the learner avoid being stuck in local maxima (for instance, Section 6.4).
Although teaching was not found to be critical to learning the survival task in this work,
I expect teaching to be more and more important as tasks get more complicated and rewards
get less likely to obtain by luck (Lin, 1991c).
94
319
Acknowledgments
I would like to express my gratitude to Tom Mitchell and Rich Sutton for many fruitful
discussions on this work. Thanks also to Jeffrey Schlimmer, Nils Nilsson, the reviewers,
and Sebastian Thrun for their valuable comments. This work was supported partly by NASA
under Contract NAGW-1175 and partly by Fujitsu Laboratories Ltd.
Notes
1. Normally the agent's internal description of the world is obtained from a vector of sensory readings, which
mayor may not be pre-processed. Here it is assumed that the agent's input adequately represents the external
world and is sufficient to determine the optimal actions. Reinforcement learning often works more or less,
even when this assumption is relaxed.
2. See (Lin, 1991c); Watkins, 1989; Dayan, 1992) for using TD(A), A > O.
3. For simplicity, 1 will assume the world is deterministic throughout the discussion of the learning algorithms,
although it is straightforward to extend the discussion to handle nondeterministic worlds (Barto et aI., 1990;
Watkins, 1989). The reinforcement learning algorithms presented in the paper can work for deterministic and
nondeterministic worlds.
4. Another way to reduce the negative effects of AHCON-R replaying non-policy actions is to treat an action
being a policy action as a matter of probability. In other words, the computed TD error, which is used to
adjust the networks, is weighted by the probability of choosing the replayed action. This method, however,
did not work as effectively as the thresholding method (i.e., using PI)'
5. Instead of using the term world model as Sutton did, I decided to use the term action model to refer to a model
for predicting the effects of actions. I will use the term world model to mean an agent's internal representation
of the world. For example, knowledge about the next-state function is referred to as action model, while a
cognitive map about a familiar environment is referred to as world model.
6. It seems that AHCON-M does not have this problem, since at each planning step, only the most promising
action is used to determine the amount of change to eval(x).
7. It may appear that the learning task is too simplified by making the obstacles symmetric. That is not all true,
because there are also other objects in the world-enemies and food pieces are positioned randomly and with
no symmetrical pattern. The numhcr of different input patterns that the agent is likely to come across is estimated
to be greater than 2'0
8. My previous studies (Lin, 1991a; Lin, 1991b) reported similar results. There are in fact a few differences between the experimental designs of this study and the previous stUdies. For example, the action model was
different, some of the parameter values were changed, and some implementation details of the learning algorithms
were also different. See (Lin, 1991a; Lin 1991b) for more details about the differences.
References
Anderson, c.w. (1987). Strategy learning with multilayer connectionist representations. Proceedings ofthe Fourth
International U0rkshop on Machine Learning (pp. 103-114).
Barto, A.G., Sutton, R.S., & Watkins, C.lC.H. (1990). Learning and sequential decision making. In: M. Gabriel
& l'w. Moore (Eds.), Learning and computational neuroscience. MIT Press.
Barto, A.G., Bradtke, SJ., & Singh, S.P. (1991). Real-time learning and control using asynchronous dynamic
programming. (Technical Report 91-57). University of Massachusetts, Computer Science Department.
Chapman, D. & Kaelbling, L.P. (1991). Input generalization in delayed reinforcement learning: An algorithm
and performance comparisons. Proceedings of IJCAI-91.
Dayan, P. (1992). The convergence of TD(A) for general A. Machine Learning, 8, 341-362.
Grefenstette, J.J., Ramsey, c.L., & Schultz, A.C. (1990). Learning sequential decision rules using simulation
models and competition. Machine Learning, 5, 355-382.
95
320
L.-J. LIN
Hinton, G.E., McClelland, 1.L., & Rumelhart, D.E. (1986). Distributed representations. Parallel distributedprocessing: Explorations in the microstructure of cognition, Vol. i, Bradford Books/MIT Press.
Howard, R.A. (1960). Dynamic programming and Markov processes. Wiley, New York.
Kaelbling, L.P. (1990). Learning in embedded systems. Ph.D. Thesis, Department of Computer Science, Stanford
University.
Lang, K.I. (1989). A time-delay neural network architecture for speech recognition. Ph.D. Thesis, School of
Computer Science, Carnegie Mellon University.
Lin, Long-Ji. (l99Ia). Self-improving reactive agents: Case studies of reinforcement learning frameworks. Proceedings of the First International Conference on Simulation of Adaptive Behavior: From Animals to Animats
(pp. 297-305). Also Technical Report CMU-CS-90-109, Carnegie Mellon University.
Lin, Long-Ji. (199lb). Self-improvement based on reinforcement learning, planning and teaching. Proceedings
of the Eighth International Workshop on Machine Learning (pp. 323-327).
Lin, Long-Ji. (l99lc). Programming robots using reinforcement learning and teaching. Proceedings of AAAI-9i
(pp. 781-786).
Mahadevan, S. & Connell, 1. (1991). Scaling reinforcement learning to robotics by exploiting the subsumption
architecture. Proceedings of the Eighth International U0rkshop on Machine Learning (pp. 328-332).
Mitchell, T.M. (1982). Generalization as search. Artificial intelligence, 18, 203-226.
Moore, AW. (1991). Variable resolution dynamic programming: Efficiently learning action maps in multivariate
real-valued state-spaces. Proceedings ofthe Eighth international Workshop on Machine Learning (pp. 333-337).
Mozer, M.e. (1986). RAMBar: A connectionist expert system that learns by example. (Institute for Cognitive
Science Report 8610). University of California at San Diego.
Pomerleau, D.A. (1989). ALVINN: An autonomous land vehicle in a neural network (Technical Report CMUCS-89-107). Carnegie Mellon University.
Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representations by error propagation.
Parallel distributed processing: Explorations in the microstructure ofcognition. Vol. i. Bradford Books/MIT Press.
Sutton, R.S. (1984). Temporal credit assignment in reinforcemellliearning. Ph.D. Thesis, Dept. of Computer
and Information Science, University of Massachusetts.
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.
Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic
programming. Proceedings of the Seventh international Workshop on Machine Leaming (pp. 216-224).
Tan, Ming. (1991). Learning a cost-sensitive internal representation for reinforcement learning. Proceedings of
the Eighth Intemational Workshop on Machine Leaming (pp. 358-362).
Thrun, S.B., Moller, K., & Linden, A. (1991). Planning with an adaptive world model. In D.S. Touretzky (Ed.),
Advances in neural information processing systems 3, Morgan Kaufmann.
Thrun, S.B. & Moller, K. (1992). Active exploration in dynamic environments. To appear in D.S. Touretzky (Ed.),
Advances in neural information processing systems 4, Morgan Kaufmann.
Watkins, C.I.e.H. (1989). Learning from delayed rewards. Ph.D. Thesis, King's College, Cambridge.
Williams, R.I. & Zipser, D. (1988). A learning algorithm for continually running fully recurrent neural networks
(Institute for Cognitive Science Report 8805). University of California at San Diego.
Whitehead, S.D. & Ballard, D.H. (1989). A role for anticipation in reactive systems that learn. Proceedings of
the Sixth International Workshop on Machine Learning (pp. 354-357).
Whitehead, S.D. & Ballard, D.H. (1991a). Learning to perceive and act by trial and er.ror. Machine Learning,
7, 45-83.
Whitehead, S.D. (1991b). Complexity and cooperation in Q-Iearning. Proceedings of the Eighth imemational
U0rkshop on Machine Learning (pp. 363-367).
96
321
where
if Ai will result in collison with
obstacles
otherwise
angle = angle between the direction of action Ai and the direction from the enemy
to the agent
dist = distance between the enemy and the agent
W(angle) = (180 -
T(dist) =
i-
angle 1)1180
15 - dist
dist/2
if dist ~ 4
if dist ~ 15
otherwise
97
SATINDER@cs.umass.edu
Abstract. Although building sophisticated learning agents that operate in complex environments will require learning
to perform multiple tasks, most applications of reinforcement learning have focused ~n single tasks. In this paper
I consider a class of sequential decision tasks (SDTs), called composite sequential decision tasks, formed by temporally concatenating a number of elemental sequential decision tasks. Elemental SOTs cannot be decomposed
into simpler SOTs. I consider a learning agent that has to learn to solve a set of elemental and composite SOTs.
I assume that the structure of the composite tasks is unknown to the learning agent. The straightforward application of reinforcement learning to multiple tasks requires learning the tasks separately, which can waste computational resources, both memory and time. I present a new learning algorithm and a modular architecture that learns
the decomposition of composite SOTs, and achieves transfer of learning by sharing the solutions of elemental
SOTs across multiple composite SOTs. The solution of a composite SOT is constructed by computationally inexpensive modifications of the solutions of its constituent elemental SOTs. I provide a proof of one aspect of the
learning algorithm.
Keywords. Reinforcement learning, compositional learning, modular architecture, transfer of learning
1. Introduction
Although building sophisticated autonomous learning systems that operate in complex
environments will require learning to perform multiple tasks, most applications of control
architectures based on reinforcement learning have focused on single tasks (e.g., Barto
et aI., 1991; Sutton, 1990; Whitehead & Ballard, 1990). One way to extend reinforcement
learning to multiple tasks is to use different modules to learn the solutions of the different
tasks. If the tasks are related, learning them separately will waste computational resources,
both memory and time. In addition, reinforcement learning methods do not scale efficiently
to large tasks, and thus the "learn each task independently" method is only feasible for
simple tasks. Discovering faster learning algorithms for single tasks will certainly help,
but algorithms that allow transfer of learning across tasks must also play a crucial role
in control systems that have to learn to perform multiple tasks. While achieving transfer
of learning across an arbitrary set of tasks may be difficult, or even impossible, there are
useful classes of tasks where such transfer is achievable. In this paper I focus on one class
of such tasks.
In this paper I consider a learning agent that interacts with a dynamic external environment and faces multiple sequential decision tasks. Each task requires the agent to execute
a sequence of actions to control the environment" either to bring it to a desired state or
to traverse a desired state trajectory over time. In addition to the environment dynamics
that define state transitions, such tasks are defined by a payoff function that specifies the
99
324
S.P. SINGH
immediate evaluation of each action taken by the agent. The agent's goal is to learn a closed
loop control policy,2 i.e., a function assigning actions to states, that extremizes some measure
of the cumulative payoffs received over time. Sequential decision tasks are difficult to solve
because the long-term consequences of taking an action are seldom reflected in the immediate payoff, making it difficult to assign credit to actions. Thus, decisions cannot be
viewed in isolation because the agent must account for both the short-term and the longterm consequences of its actions. The framework of sequential decision-making is quite
general and can be used to pose a large number of tasks from diverse fields (e.g., Bertsekas, 1987).
Much of everyday human activity involves sequential tasks that have compositional structure, i.e., complex tasks built up in a systematic way from simpler tasks. As an example,
consider how many of our daily activities involve the subtasks of lifting an object, opening
a door, sitting down, walking, etc. Compositionally-structured tasks allow the possibility
of sharing knowledge across the many tasks that have common subtasks. To formulate the
problem abstractly, consider an agent that has to learn to solve many different elemental
and composite tasks. In general, there may be n elemental tasks labeled TJ, T2 , ... , Tn"
Elemental tasks cannot be decomposed into simpler subtasks' Composite tasks, labeled C\,
C2 , ... , Cm' are produced by temporally concatenating a number of elemental tasks. For
example, Cj = [T(j, l)T(j, 2) ... T(j, k), is composite task j made up of k elemental
tasks that have to be performed in the order listed. T(j, i) E {Tt , T2 , . . . , Tn}, for 1 $i
$ k, is the ith elemental task in the list for task 0. The sequence of elemental tasks in
a composite task will be referred to as the decomposition of the composite task. I assume
that the decomposition of a composite task is unknown to the leaning agent.
Compositional learning involves solving a composite task by learning to compose the
solutions of the elemental tasks in its decomposition. It is to be emphasized that given the
short-term, evaluative nature of the payoff from the environment (often the agent gets informative payoff only at the completion of the composite task), the task of discovering the
decomposition of a composite task is formidable. In this paper I propose a compositional
learning scheme in which the separate modules learn to solve the elemental tasks, and a
task-sensitive gating module solves composite tasks by learning to compose the appropriate
elemental modules over time (also see Singh, 1992a). I present the results of three simulations using a set of simple navigational tasks that illustrate the advantage of compositional
learning, and I prove one aspect of the learning algorithm in Appendix A.
100
325
prior to the current state. The transition probabilities for all states and actions constitute
the environment dynamics. The payoff function that assigns expected payoffs to state-action
pairs is defined independently of the environment dynamics.
The agent's goal is to determine a closed loop control policy that maximizes some
cumulative measure of the payoffs received over time. The number of time steps over which
the cumulative payoff is determined is called the horizon of the MIJf. One measure, called
the expected return of a policy for a given initial environment state xo, is the expected value
of the sum over an infinite horizon of the discounted payoffs that the agent would receive
if it were to use the given policy to select its actions. The expected return is equal to
8[I:~o 'Yt R(xt , at), where 8 indicates expected value. The discount factor 'Y, where
o s 'Y S I, allows the payoffs distant in time to be weighted less than more immediate
payoffs. Any control policy that achieves the agent's goal (and there may be more than
one) is called an optimal policy. The optimal policies for the class of infinite horizon MIJfs
defined above are stationary (Ross, 1983), Le., are not functions of time, and therefore,
I restrict attention to stationary policies. Clearly, both the environment dynamics and the
payoff function playa role in determining the set of optimal policies. Changing the payoff
function while keeping the same environment dynamics can change the set of policies that
are optimal. That this is not always true is apparent from the fact that there are a finite
number of stationary policies but an infinite number of payoff functions.
If the environment dynamics and the payoff function are known, computational procedures
based on dynamic programming can be used to find an optimal policy (Bertsekas, 1987).
In particular, DP methods based on the value iteration algorithm (e.g., Ross, 1983) compute an optimal value function that assigns to states their scalar expected returns under
an optimal policy. Given the optimal value function, an optimal policy can be determined
by selecting for each state an action that if executed would maximize the expected value
of the sum of the immediate payoff and the expected return for the next state. In the absence
of a model of the environment dynamics, reinforcement learning methods that approximate
dynamic programming, such as temporal difference (Sutton, 1988; Barto et aI., 1983) and
Q-Iearning methods (Watkins, 1989; Barto et aI., 1990) can be used to directly estimate
an optimal policy without building an explicit model of the dynamics of the environment.
Following Watkins, I define the Q-value, Q(x, a), for xES and a E A, as the expected
return on taking action a in state x, under the condition that an optimal policy is followed
thereafter. Given the Q-values, a greedy policy that in each state selects an action with
the highest associated Q-value, is optimal. Q-Iearning is an asynchronous version of value
iteration that maintains an estimate, Q, of the Q-values for all states and actions. Q-learning
is often used on-line, i.e., actual experience interacting with the environment is used to
incrementally improve Q over time. On executing action a in state x at time t, the resulting
payoff and next state are used to update the estimate at time t, Qt(x, a) as follows: 4
Qt+I(X, a) = (1.0 - Cit)Qt(X, a)
Cit(R(x, a)
(I)
where y is the state at time t + I, and Cit is the value of a positive learning rate parameter
at time t. Watkins and Dayan (1992) prove that under certain conditions on the sequence
101
326
S.P. SINGH
{at}, if every state-action pair is updated infinitely often using Equation 1, Q converges
to the true Q-values asymptotically.
Computing a greedy policy with respect to Q requires evaluating Q(x, a) for all xES
and a E A. A common practice that eliminates this computation is to cache the control
policy in a separate controller module.5 To ensure asymptotic convergence most control
methods based on Q-learning incorporate some exploration strategy (Kaelbling, 1990; Barto
& Singh, 1990; Sutton, 1990) that sometimes takes non-greedy actions.
102
327
command to the agent. On the other hand, if the device is part of the agent, it provides
a context or internal state for the agent. The two views are formally equivalent; the crucial
property is that they determine the payoff function but do not affect the dynamics of the
environment.
If the task command is considered part of the state description, the entire set of tasks
faced by the agent becomes one unstructured Markovian decision task. While an optimal
policy for the unstructured MDT can be found by using Q-Iearning, just as for an elementa,1 MDT, the structure inherent in the set of compositionally structured tasks allows a more
efficient solution. In the next section I show that the Q-values for a composite task have
a special relationship to the Q-values of the elemental tasks in its decomposition. I then
present a learning architecture that uses that relationship to advantage.
4. Compositional Q-learning
Compositional Q-Iearning (CQ-learning) is a method for constructing the Q-values of a
composite task from the Q-values of the elemental tasks in its decomposition. Let Q1i(x, a)
be the Q-value of (x, a), x E 5 and a E A, for elemental task T;, and let Q!f!(x', a) be
the Q-value of (x', a), for x' E 5' and a E A, for task T; when performed as part of the
composite task Cj = [TV, 1) ... TV, k)]. Let TV, 1) = T;. Note that the superscript on
Q refers to the task and the subscript refers to the elemental task currently being performed.
The absence of a subscript implies that the task is elemental.
Consider a set of undiscounted ('Y = 1) MDTs that have compositional structure and
satisfy the following conditions:
(AI) Each elemental task has a single desired fmal state.
(A2) Foc all elemental and composite tasks the expected value of the undiscounted return
for an optimal policy is bounded both from above and below for all states.
(A3) The cost associated with each state-action pair is independent of the task being
accomplished.
(A4) For each elemental task T; the reward function r/ is zero for all states except the
desired final state for that task. For each composite task 0, the reward function
is zero
for all states except the final states of the elemental tasks in its decomposition (see Section 3).
Then, for any elemental task T; and for all composite tasks Cj containing elemental task
T;, the following holds:
rI
(2)
for all x' E 5' and a E A, where x E 5 is the projected state, and K(0, 1) is a function of
the composite task Cj and 1, where T; is the lth task in the decomposition of Cjo Note that
K(Cj , 1) is independent of the state and the action. Thus, given solutions of the elemental
tasks, learning the solution of a composite task with n elemental tasks requires learning
only the values of the function K for the n different subtasks. A proof of Equation 2 is
given in Appendix A.
103
328
S.P. SINGH
Equation 2 is based on the assumption that the decomposition of the composite tasks
is known. In the next Section, I present a modular architecture and learning algorithm
that simultaneously discovers the decomposition of a composite task and implements Equation 2.
104
329
Augmenting
bits
K
I-----------<>{+
Task
N
r----...,_~____J ~
-L
---I--,
g,
1=====::';;1
~ I----~-\
e
r
Augmenting
bits
While
Noise
N(O. cr)
State
Action
State
Action
State Action
Figure 1. CQ-L: The CQ-Learning Architecture. This figure is adapted from Jacobs, et al. (1991). The Q-modules
learn the Q-values for elemental tasks. The gating module has an output for each Q-module and determines the
probability of selecting a particular Q-module. The bias module learns the function K (see Equation 2).
where R(x/, at) was the expected payoff at time t.7 Alternatively, the estimate of the desired
output of CQ-L at time t can be defined as D(t) = e(t) + O(t). At each time step the
learning rules described in Appendix B are used to adjust all the Q-modules, the gating
module, and the bias module.
Each Q-module is adjusted to reduce the error between its output and D(t) - K(t) in
proportion to the probability of that Q-module having produced the estimated desired output. Hence the Q-module whose output would have produced the least error is adjusted
the most. The gating module is adjusted so that the a priori probability of selecting each
Q-module becomes equal to the a posteriori probability of selecting that Q-module, given
D(t). Over time, different Q-modules start winning the competition for different elemental tasks, and the gating module learns to select the appropriate Q-module for each task.
For composite tasks, while performing subtask T;, the Q-module that has best learned task
Ti will have smaller expected error than any other Q-module. The bias module is also adjusted to reduce the error e(t). The parameter {3 is increased over time so that eventually
only the greedy actions are selected.
105
330
S.P. SINGH
A
UP
B
GRIDWORLD
Figure 2. The Grid Room. The room is an 8 x 8 grid with three desired final locations designated A, Band
C. The filled squares represent obstacles. The robot is shown as a circle and has four actions available: UP, DOWN,
RIGHT, and LEFT.
106
331
Table 1. Tasks. Tasks TI T2 and T3 are elemental tasks; tasks C,. C2 and
C3 are composite tasks. The last column describes the compositional structure of the tasks.
Label
Command
Description
Decomposition
TI
T2
T3
0oo1
visit A
visit B
visit C
visit A and then C
visit B and then C
visit A. then B and then C
T,
T2
T3
T IT3
T2 T3
T,T2 T3
CI
C2
C3
oo10
000100
001000
01oo
10oo
(Table 1). Lookup lables were used to implement all the modules of the CQ-Learning architecture. The elements of the set S are the possible positions of the robot in the grid
room. The 3 augmenting bits, one for each the three elemental tasks, and the lask command vector form the input to the bias and the gating modules (Figure 1).
7. Simulation results
In the simulations described below, the performance of CQ-L is compared to the performance of a "one-for-one" architecture that implements the "Iearn-each-lask-separately"
strategy. The one-for-one architecture has a pre-assigned distinct module for each lask,
which prevents transfer of learning, in other words, the one-for-one architecture treats the
entire set of lasks as one unstructured MDT. Each module of the one-for-one architecture
was provided with the augmented slate (E S').
107
332
S.P. SINGH
:;
(i)
(ii)
'.0
%
00.5
300.0
200
Task "A"
a.
CO-Learning Architecture
One-for-One Architecture
1.0
0.0
Task "S"
a.
1.0
00.5
0.0
200
'00
< 100.0
:;
:::::=::::::::=c:::======:=::;,
1250
Trial Number
2500
--
300
Trial Number
a-Module 1
a-Module 2
aModule 3
00.5
(iii)
:;
!!i'
a-Module'
aModule 2
a-Module 3
300
Trial Number
(iv)
a-Module 1
a-Module 2
Q-Module 3
O,Oo!---=-O,:'::00,--------;:!20"'"O----;;J300
Trial Number
Task "C"
Figure 3. (i) Learning Curves for Multiple Elemental tasks. Each data point is the average, taken over 50 trials,
of the number of actions taken by the robot to get to the desired final state. (ii) Module Section for Task TI
The 3 normalized outputs of the gating module are shown averaged over each trial with task T I . Initially the
outputs were about 0.3 each, but as learning proceeded the gating module learned to select Q-module 2 for task
T,. (iii) Module Selection for Task T2 Q-module 3 was selected. (iv) Module Selection for Task T3 . Q-module
I was selected for task T3
simulations reported by Jacobs (1990), except that he does not apply his architecture to
sequential decision tasks. See Appendix C for simulation details.
108
333
900.0
(i)
- _.
12
Task "AC"
_ 600.0
CO-Learning Architecture
One-far-One Architecture
g>
IS.
$5
}',O
J
300.0
o.oL_ _'-:':::::::::::========:::::::'
o
5000
10000
15000
20000
Trial Number
.r
'"
18
Step Number
Jiii)
Q-Module1
o.S
-- ... _.-
"
0.0
Q-Module 1
Q-Module 2
Q-Module 3
Q-Module 2
aModule 3
'------''-'------''2:-------0':'.-
Task "BC"
Step Number
(iv)
1.0
,
I
0.5
0.0
11
22
Task "ABC"
a-Module 1
- -
a-Module 2
Q-Module3
33
Step Numb8f'
Figure 4. (i) Learning Curves for a Set of Elemental and Composite Tasks. Each data point is the average over
50 trials of the time taken by the robot to reach the desired final state. (ii) Temporal Composition for Task CI .
After 10,000 learning trials, the three outputs of the gating module during one trial of task C I are shown. Q-module
1 was turned on for the first seven actions to accomplish subtask T" and then Q-module 2 was turned on to
accomplish subtask T3 . (iii) Temporal Composition for Thsk Cz. Q-module 3 was turned on for the first six actions to accomplish subtask Tz and then Q-module 2 was turned on to accomplish task T3 (iv) Temporal Composition for Task C3 . The three outputs of the gating module for one trial with task C3 are shown. Q-modules
I, 3 and 2 were selected in that order to accomplish the composite task C3 .
over time. This simulation shows that CQ-L is able to learn the decomposition of a composite task, and that compositional learning, due to transfer of training across tasks, can
be faster than learning each composite task separately. See Appendix C for simulation details.
7.3. Simulation 3: Shaping
One approach (cf. Section 7.2) for training a robot to learn a composite task is to train
the robot on multiple tasks: all the elemental tasks in the decomposition of the composite
task, and the composite task itself. Another approach is to train the robot on a succession
of tasks, where each succeeding task requires some subset of the already learned elemental
tasks, plus a new elemental task. This roughly corresponds to the "shaping" procedures
(Skinner, 1938) used by psychologists to train animals to do complex motor tasks.
A simple simulation to illustrate shaping was constructed by training CQ-L with one
Q-module on one elemental task, T3 , for 1,000 trials and then training on the composite
task Cz. After the first 1,000 trials, the learning was turned off for the first Q-module and
a second Q-module was added for the composite task. Figure 5 shows the learning curve
for task T3 composed with the learning curve for task Cz. The number of actions taken
by the robot to get to the desired final state, averaged over 50 trials, were plotted by concatenating the data points for the two tasks, T3 and Cz. Figure 5 shows that the average
109
334
S.P. SINGH
900.0
Cii
f':
CO-Learning Architecture
600.0
iii
Q.
'"
Q.
ii.i'"
.i( 300.0
0.0 0~~===:1~00:-;0:---~==::;:20~0:;:0====3;;;?000
Trial Number
Figure 5. Shaping. The number of actions taken by the robot 10 reach the desired final state, averaged over 50
trials, is plotted. The CQ-Learning architecture containing one Q-module was trained for 1000 trials on task T3.
Then another Q-module was added, and the architecture was trained to accomplish task Cz. The only taskdependent payoff for the task Cz was on reaching the desired final location C.
number of actions required to reach the final state increases when the composite task was
introduced, but eventually the gating module learned to decompose the task and the average
decreased. The second Q-module learned the task T2 without ever being explicitly exposed to it.
A separate shaping simulation was conducted in the same way as above, except that learning was not turned off for the first Q-module after it had learned task T3 . The solution
to task T3 was partially unlearned before the architecture figured out the decomposition
for the composite task. As a result, it took more time than in the earlier shaping simulation
to learn the elemental task T2 in the second Q-module. See Appendix C for simulation
details.
8. Discussion
Learning to solve MDTs with large state sets is difficult due to the sparseness of the evaluative
information and the low probability that a randomly selected sequence of actions will be
optimal. Learning the long sequences of actions required to solve such tasks can be accelerated considerably if the agent has prior know ledge of useful subsequences. Such subsequences can be learned through experience in learning to solve other tasks. In this paper,
I define a class of MDTs, called composite tasks, that are structured as the temporal concatenation of simpler MDTs, called elemental tasks. I present CQ-L, an architecture that
combines the Q-learning algorithm of Watkins (1989) and the modular architecture of Jacobs,
et al. (1991) to achieve transfer of learning by sharing the solutions of elemental tasks across
110
335
multiple composite tasks. CQ-L's use of acquired subsequences to solve composite tasks
is similar to the use of macro-operators (Korf, 1985) in macro-learning systems (Iba, 1989).
Another architecture similar to CQ-L is the subsumption architecture for autonomous
intelligent agents (Brooks, 1989), which is composed of several task-achieving modules
along with precompiled switching circuitry that controls which module should be active
at any time. Maes and Brooks (1990) showed how reinforcement learning can be used to
learn the switching circuitry for a robot with hardwired task modules. Mahadevan and Connell (1990), on the other hand, showed how Q-Iearning can be used to acquire behaviors
that can then be controlled using a hardwired switching scheme. The simulation illustrating
shaping with CQ-L demonstrates that not only can the gating module compose sequences
of elemental tasks on which it has been trained, it can also learn new elemental tasks in
order to solve a composite task. Thus, for compositionally-structured MDTs, CQ-L combines the complementary objectives of Maes and Brooks's architecture with that of
Mahadevan and Connell's architecture.
Given a set of composite and elemental MOTs, the sequence in which the learning agent
receives training experiences on the different tasks determines the relative advantage of
CQ-L over other architectures that learn the tasks separately. The simulation reported in
Section 7.2 demonstrates that it is possible to train CQ-L on intermixed trials of elemental
and composite tasks, while the simulation involving shaping (Section 7.3) demonstrates
that it is possible for CQ-L to learn elemental tasks on which it has not been explicitly
trained. Nevertheless, some training sequences on a set of tasks will result in faster learning of the set of tasks than other training sequences. The ability of CQ-L to scale well
to complex sets of tasks is still dependent on the choice of the training sequence.
To focus on compositional learning I avoided the issues arising from the use of representations more structured than unit-basis vectors and function approximators more powerful
than lookup tables. In addition, Watkins and Dayan's (1992) theorem on the convergence
of Q-Iearning holds for lookup-table representations; it is not known to hold for other function
approximation methods. Nevertheless, future work with connectionist networks may make
it possible to use distributed representations for states and thus to take advantage of the
ability of networks to generalize across states within the same task.
According to the definition used in this paper, composite tasks have only one decomposition and require the elemental tasks in their decomposition to be performed in a fixed
order. A broader definition of a composite task allows it to be an unordered list of elemental tasks, or more generally, a disjunction of many ordered elemental task sequences. CQL should work with the broader definition for composite tasks without any modification
because it should select the particular decomposition that is optimal with respect to its
goal of maximizing expected returns. Further work is required to test this conjecture.
Although compositional Q-Ieaming was illustrated using a set of simple navigational tasks,
it is suitable for a number of different domains where multiple sequences from some set
of elemental tasks have to be learned. CQ-L is a general mechanism whereby a "vocabulary" of elemental tasks can be learned in separate Q-modules, and arbitrary 8 temporal
syntactic compositions of elemental tasks can be learned with the help of the bias and gating
modules.
111
336
S.P. SINGH
where 4>(y, x g ) is the expected cost of going from state y to x g under an optimal policy,
7ri, for elemental task T;.
Consider a composite task Cj that satisfies assumptions AI-A4 given in Section 4 and
w.l.o.g. assume that for 0 = [TV, 1) TV, 2) ... TV, k)], 31 :s: I :s: k, such that nj,
I) = T;. Let the set L C S' be the set of all X' E S' that satisfy the property that the
augmenting bits corresponding to the tasks before T; in the decomposition of 0 are equal
to one and the rest are equal to zero. Let y' E L be the state that has the projected state
yES. Let
E S' be the state formed from g E S by setting to one the augmenting bits
corresponding to all the subtasks before and including subtask T; in the decomposition of
Cj , and setting the other augmenting bits to zero. Let 7r/be an optimal policy for task Cj .
rf (x ') is the reward for state x' E S' while performing task Cj Then by assumptions
AI-A4, we know that rf(x~ = 0 for all x' ELand that c(x', a) = c(x, a).
By the definition of Cj , the agent has to navigate from state y' to state to accomplish
be the expected cost of going from state y' to state under policy
subtask T;. Let 4> '0':
7r/ Then, given that Tfj, I) = T;,
x;
x;
x;
x;)
Q~0", a) = Qflv,I+I)(X;, b)
x;
where b E A is an optimal action for state while performing subtask T(j, I + 1) in task
0. Clearly, 4>'(y', x;> = 4>0', xg ), for if it were not, either policy 7rj would not be optimal
for task T;, or given the independence of the solutions of the subtasks the policy 7r/would
not be optimal for task Cj . Define K(0, I) "" Qft,l+l)(X;, b) + r{(xD - ri(xg ). Then
Q.E.D.
Appendix B: Learning rules for the CQ-learning architecture
The derivation of the learning rules in this section follows the derivation in Nowlan (1990)
closely. The output of Q-module i, qj, is the mean of a Gaussian probability distribution
with variance (J. The ith output of the gating module, 5j determines the a priori probability, gi = e'i/Ej e'j, of selecting Q-module i.
_
The final output of the architecture at time step t, is given by Q(t) = qt(t) + K(t),
where qt is the output of the selected Q-module and K is the output of the bias module.
The desired output at time tis, D(t) = R(t) + (j(t + I), where R(t) is the expected
112
337
payoff at time t. Then the probability that the Q-module i will generate the desired output is
II(D(t)-K(t)-qi(t)II'
Pi(D(t = - e
2a'
No
where N is a normalizing constant. The a posteriori probability that Q-module i was selected
given the desired output D(t), is
p(i ID(t)) =
gj(t)pj(D(t
Ejgj(t)p/D(t))
if
The partial derivative of the log likelihood with respect to the i th output of the gating module
simplifies to
iJ log L(t) = (p(iID(t)) - gj(t)).
iJsi(t)
Using the above results the update rules for Q-module j, the ith output of the gating module,
and the bias module9 respectively are:
Sj(t
1) = Sj(t)
and
iJsj(t)
b(t
1) = b(t)
cxb(D(t) - Q(t)),
113
338
S.P. SINGH
table were random values in the range 0.0-0.4. For all three simulations, the variance of
the Gaussian noise, a, was 1.0 for all Q-modules.
For Simulation I, the parameter values for the both CQ-Leaming and the one-for-one
architectures were cxQ = 0.1, CXb = 0.0, cxg = 0.3. The policy selection parameter, {3, started
at 1.0 and was incremented by 1.0 after every 100 trials.
For Simulation 2, the parameter values for CQ-L were: cxQ = 0.015, CXb = 0.0001, cx g =
0.025, and {3 was incremented by 1.0 after every 200 trials. For the one-for-one architecture,
the parameter values were: cxQ = 0.01 and {3 was incremented by I.O after every 500 trials. lO
For Simulation 3, the parameter values, cxQ = 0.1, CXb
0.0001, and cx g = 0.01 were
used, and (3 was incremented by 1.0 every 25 trials.
Acknowledgments
This work was supported by the Air Force Office of Scientific Research, Bolling AFB,
under Grant AFOSR-89-0526 and by the National Science Foundation under Grant ECS8912623. I thank Andrew Barto, Rich Sutton, and Robbie Jacobs for their guidance and
help in formulating my ideas. Discussions with Richard Yee, Vijaykumar Gullapalli, and
Jonathan Bachrach helped focus my ideas. I also thank Andrew Barto for his patience in
reading and commenting on many hastily written drafts of this paper. Without his extensive help, no reader would have gotten far enough in this paper to be reading this sentence.
Notes
l. The environment is the system or plant being controlled by the agent.
2. In the machine learning literature closed loop control pnlicies are also referred to as situation-action rules,
or simply as "reactions."
3. For ease of expnsition I assume that the same set of actions are available to the agent from each state. The
extension to the case where different sets of actions are available in different states is straightforward.
4. Here, I assume that on executing action a in state x the agent receives the expected value of the immediate
payoff R(x, a), instead of just a sample.
5. However, tltis .adds the computational effort of training the controller module. In addition, the mismatch between the controller module and Q can adversely affect the time required to learn an optimal pnlicy.
6. The theory developed in this paper does not depend on the particular extension of S chosen, as long as the
appropriate connection between the new states and the elements of S can be made.
7. Note that Equation 3 is an accurate implementation of Q-learning only if the action taken at time t + I is
the greedy action. As the value of parameter (1 is increased, the probability of executing the greedy action
increases.
8. This assumes that the state representation is rich enough to distinguish repeated performances of the same
elemental task.
9. This assumes that the bias module is minimizing a mean square error criteria.
10. Incrementing the (1 values faster did not help the one-for-one architecture.
References
Barto, A.G., Bradtke, S.1., & Singh, S.P. (1991). Real-ti11U! learning and control using asynchronous dynamic
programming. (Technical Repnrt 91-57). Amherst, MA: University of Massachusetts, COINS Dept.
Barto, A.G. & Singh, S.P. (1990). On the computational economics of reinforcement learning. Proceedings of
the 1990 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann.
114
339
Barto, A.G., Sutton, R.S., & Anderson, cw. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE SMC, 13, 835-846.
Barto, A.G., Sutton, R.S., & Watkins, CJ.CH. (1990). Sequential decision problems and neural networks. In
D.S. Touretzky, (Ed.), Advances in neural information processing systems 2, San Mateo, CA: Morgan Kaufmann.
Bertsekas, D.P. (1987). Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ:
Prentice-Hall.
Brooks, R. (1989). A robot that walks: Emergent behaviors from a carefully evolved network. Neural Computation, I, 253-262.
Duda, R.O. & Hart, P.E. (1973). Pattern classification and scene analysis. New York: Wiley.
Iba, G.A. (1989). A heuristic approach to the discovery of macro-operators. Machine Learning, 3, 285-317.
Jacobs, R.A. (1990). Task decomposition through competition in a modular connectionist architecture. Ph.D.
Thesis, COINS Dept., Univ. of Massachusetts, Amherst, Mass.
Jacobs, R.A. & Jordan, M.1. (1991). A competitive modular connectionist architectur~. Advances in neural information processing systems, 3.
Jacobs, R.A., Jordan, M.I., Nowlan, S.J., & Hinton, G.E. (1991). Adaptive mixtures of local experts. Neural
Computation, 3.
Kaelbling, L.P. (1990). Learning in embedded systems. Ph.D. Thesis, Stanford University, Department of Computer Science, Stanford CA. Technical Report TR-90-04.
Korf, R.E. (1985). Macro-operators: A weak method for learning. ArtifiCial Learning, 26, 35-77.
Maes, P. & Brooks, R. (1990). Learning to coordinate behaviours. Proceedings ofthe Eighth AAAI (pp. 796-802).
Morgan Kaufmann.
Mahadevan, S. & Connell, J. (1990). Automatic progmmming of behavior-based robots using reinforcement learning. (Technical Report) Yorktown Heights, NY: IBM Research Division, T.J. Watson Research Center.
Nowlan, SJ. (1990). Competing experts: An experimental investigation of associative mixture models. (Technical
Report CRG-TR-90-5). Toronlo, Canada: Univ. of Toronto, Department of Computer Science.
Ross, S. (1983). Introduction to stochastic dynamic programming. New York: Academic Press.
Singh, S.P. (1992a). On the efficient learning of multiple sequential tasks. In J. Moody, SJ. Hanson, & R.P.
Lippman, (Eds.), Advances in neural information processing systems 4, San Mateo, CA: Morgan Kaufmann.
Singh, S.P. (1992b). Solving multiple sequential tasks using a hierarchy of variable temporal resolution models.
Submitted to Machine Learning Conference, 1992.
Skinner, B.E (1938). The behavior of organisms: An experimental analysis. New York: D. Appleton Century.
Sutton, R.S. (1988). Learning to predict by the methods of tempoml differences. Machine Learning, 3, 9-44.
Sutton, R.S. (1990). Integrating architectures for learning, planning, and reacting based on approximating dynamic
programming. Proceedings of the Seventh International Workshop on Machine Learning (pp. 216-224). San
Mateo, CA: Morgan Kaufmann
Watkins, CJ.CH. (1989). Learningfrom delayed rewards. Ph.D. Thesis, Cambridge Univ., Cambridge, England.
Watkins, CJ.CH. & Dayan, P. (1992). Q-Iearning. Machine Learning, 8, m-292.
Whitehead, S.D. & Ballard, D.H. (1990). Active perception and reinforcement learning. Proceedings ofthe Seventh
International Conference on Machine Learning. Austin, TX.
115
Abstract. The method of temporal differences (TO) is one way of making consistent predictions about the future.
This paper uses some analysis of Watkins (1989) to extend a convergence theorem due to Sutton (1988) from
the case which only uses information from adjacent time steps to that involving infOl;mation from arbitrary ones.
It also considers how this version of TO behaves in the face of linearly dependent representations for statesdemonstrating that it still converges, but to a different answer from the least mean squares algorithm. Finally
it adapts Watkins' theorem that Q-learning, his closely related prediction and action learning method, converges
with probability one, to demonstrate this strong form of convergence for a sl ightly modified version of TO.
Keywords. Reinforcement learning, temporal differences, asynchronous dynantic programming
1. Introduction
Many systems operate in temporally extended circumstances, for which whole sequences
of states rather than just individual ones are important. Such systems may frequently have
to predict some future outcome, based on a potentially stochastic relationship between it
and their current states. Furthermore, it is often important for them to be able to learn
these predktions based on experience.
Consider a simple version of this problem in which the task is to predict the expected
ultimate terminal values starting from each state of an absorbing Markov process, and for
which some further random processes generate these terminating values at the absorbing
states. One way to make these predictions is to learn the transition matrix of the chain
and the expected values from each absorbing state, and then solve a simultaneous equation
in one fell swoop. A simpler alternative is to learn the predictions directly, without first
learning the transitions.
The methods of temporal differences (TD), first defined as such by Sutton (1984; 1988),
fall into this simpler category. Given some parametric way of predicting the expected values
of states, they alter the parameters to reduce the inconsistency between the estimate from
one state and the estimates from the next state or states. This learning can happen incrementally, as the system observes its states and terminal values. Sutton (1988) proved some
results about the convergence of a particular case of a TD method.
Many control problems can be formalized in terms of controlled, absorbing, Markov
processes, for which each policy, i.e. , mapping from states to actions, defines an absorbing
Markov chain. The engineering method of dynamic programming (DP) (Bellman & Dreyfus,
1962) uses the predictions of the expected terminal values as a way of judging and hence
improving policies, and TD methods can also be extended to accomplish this. As discussed
extensively by Watkins (1989) and Barto, Sutton and Watkins (1990), TD is actually very
closely related to DP in ways that significantly illuminate its workings.
117
342
P. DAYAN
This paper uses Watkins' insights to extend Sutton's theorem from a special case of TD,
which considers inconsistencies only between adjacent states, to the general case in which
arbitrary states are important, weighted exponentially less according to their temporal distances. It also considers what TD converges to if the representation adopted for states is
linearly dependent, and proves that one version of TD prediction converges with probability
one, by casting it in the form of Q-learning.
Some of the earliest work in temporal difference methods was due to Samuel (1959; 1967).
His checkers (draughts) playing program tried to learn a consistent function for evaluating
board positions, using the discrepancies between the predicted values at each state based
on limited depth games-tree searches, and the subsequently predicted values after those
numbers of moves had elapsed. Many other proposals along similar lines have been made:
Sutton acknowledged the influence of Klopf (1972; 1982) and in Sutton (1988) discussed
Holland's bucket brigade method for classifier systems (Holland, 1986), and a procedure
by Witten (1977). Hampson (1983; 1990) presented empirical results for a quite similar
navigation task to the one described by Barto, Sutton and Watkins (1990). Barto, Sutton
and Anderson (1983) described an early TD system which learns how to balance an upended
pole, a problem introduced in a further related paper by Michie and Chambers (1968).
Watkins (1989) also gave further references.
The next section defines TD(A), shows how to use Watkins' analysis of its relationship with
DP to extend Sutton's theorem, and makes some comments about unhelpful state representations. Section 3 looks at Q-learning, and uses a version of Watkins' convergence theorem
to demonstrate in a particular case the strongest guarantee known for the behavior ofTD(O).
2. TD()')
Sutton (1988) developed the rationale behind TD methods for prediction, and proved that
TD(O), a special case with a time horizon of only one step, converges in the mean for
observations of an absorbing Markov chain. Although his theorem applies generally, he
illustrated the case in point with an example of the simple random walk shown in figure 1.
Here, the chain always starts in state D, and moves left or right with equal probabilities
from each state until it reaches the left absorbing barrier A or the right absorbing barrier
G. The problem facing TD is estimating the probability it absorbs at the right hand barrier
rather than the left hand one, given any of the states as a current location.
Start
Figure 1. Sutton's Markov example. Transition probabilities given above (right to left) and below (left to right)
the arrows.
118
343
The raw information available to the system is a collection of sequences of states and
terminal locations generated from the random walk-it initially has no knowledge of the
transition probabilities. Sutton described the supervised least mean squares (LMS) (Widrow
& Stearns, 1985) technique, which works by making the estimates of the probabilities for
each place visited on a sequence closer to I if that sequence ended up at the right hand
barrier, and closer to 0 if it ended up at the left hand one. He showed that this technique
is exactly TD(I), one special case of TD, and constrasted it with TD(I..) and particularly
TD(O), which tries to make the estimate of probability from one state closer to the estimate
from the next, without waiting to see where the sequence might terminate. The discounting
parameter I.. in TD(I..) determines exponentially the weights of future states based on their
temporal distance-smoothly interpolating between I.. = 0, for which only the next state
is relevant, and I.. = 1, the LMS case, for which all states are equally weighted. As described
in the introduction, it is its obeisance to the temporal order of the sequence that marks
out TD.
The following subsections describe Sutton's results for TD(O) and separate out the
algorithm from the vector representation of states. They then show how Watkins' analysis
provides the wherewithal to extend it to TD(I..) and finally re-incorporate the original
representation.
:J
m:
qij
Xi
Zj
/Ai
E [0, I]
E (He
i
i
j
i
m:, j
m:
E
E :J
E
m:
m:
U :J
terminal states
non-terminal states
transition probabilities
vectors representing non-terminal states
expected terminal values from state j
probabilities of starting at state i, where
L: /Ai =
1.
iEffi.
The payoff structure of the chain shown in figure I is degenerate, in the sense that the
values of the terminal states A and G are deterministically 0 and 1 respectively. This makes
the expected value from any state just the probability of absorbing at G.
The estimation system is fed complete sequences Xi" Xi" ... Xim of observation vectors,
together with their scalar terminal value z. It has to generate for every non-terminal state
i E m: a prediction of the expected value IE[z Ii] for starting from that state. If the transition
matrix of the Markov chain were completely known, these predictions could be computed as:
(I)
119
344
P. DAYAN
Again, following Sutton, let [MJab denote the abth entry of any matrix M, [u]a denote the
ath component of any vector u. Q denote the square matrix with components [QJab = qab,
a, b E
and h denote the vector whose components are [h]a = EbEJ qab4, for a E
m.,
m..
(2)
As Sutton showed, the existence of the limit in this equation follows from the fact that
Q is the transition matrix for the nonterminal states of an absorbing Markov chain, which,
with probability one will ultimately terminate.
During the learning phase, linear TD(A) generates successive vectors w~, w~, ... ,1
changing w A after each complete observation sequence. Define V:(i) = w~ . Xi as the prediction of the terminal value starting from state i, at stage n in learning. Then, during one
such sequence, V:(i/) are the intermediate predictions of these terminal values, and, abusing notation somewhat, define also V:(i m +l ) = z, the observed terminal value. Note that
in Sutton (1988), Sutton used p,n for V;(i,). TD(A) changes wI. according to:
t.
,,-'V-. V;(i'1
(3)
Theorem T For any absorbing Markov chain, for any distribution of starting probabilities
JLi such that there are no inaccessible states, for any outcome distributions with finite expected values~, and for any linearly independent set of observation vectors {Xi liE m.},
there exists an > 0 such that, for all positive ex < and for any initial weight vector,
the predictions of linear TD(A) (with weight updates after each sequence) converge in expected value to the ideal predictions (2); that is, if w~ denotes the weight vector after n
sequences have been experienced, then
IE[zli]
[(l - Q)-lhli, vi E
m..
is true in the case that A = O. This paper proves theorem T for general A.
V:.
Equation (3) conflates two issues; the underlying TD(A) algorithm and the representation
of the prediction functions
Even though these will remain tangled in the ultimate proof
of convergence, it is beneficial to separate them out, since it makes the operation of the
algorithm clearer.
120
345
V;
as a look-up table, with one entry for each state. This is equivConsider representing
alent to choosing a set of vectors Xi for which just one component is I and all the others
are 0 for each state, and no two states have the same representation. These trivially satisfy
the conditions of Sutton's theorem, and also make the w n easy to interpret, as each component is the prediction for just one state. Using them also prevents generalization. For
this representation, the terms 'VW n V;(i k ) in the sum
t
""
L.J }..t-k'V W"
k=l
V"(i)
n k
just reduce to counting the number of times the chain has visited each state, exponentially
weighted in recency by }... In this case, as in the full linear case, these terms do not depend
on n, only on the states the chain visits. Define a characteristic function for state j:
otherwise
and the prediction function V;(i) as the entry in the look-up table for state i at stage n
during learning. Then equation (3) can be reduced to its elemental pieces
(4)
mmmmm
,
"" }..,-k'VW V"(i)
L.J
n k
n
k=l
121
346
P. DAYAN
~ "A1-kXi(k)
k=1
and time t.
2
Vn,lo
. =
vr .
n,lo
122
e")
otherwise
{~.(,,)
if i 2 E
if i 1 E
m.
m.
otherwise
[(I,)
(5)
if ir E
m.
otherwise
347
where i l is the first state accessed by the Markov chain in one particular sequence starting
from i o; i 2 is the second, and so on, and z is the observed terminal value if the chain gets
absorbed before time step r is reached. These are random variables, since they depend
on the particular sequence of states which will be observed. Naturally, they also depend
on the values Vn(i).
The point of these is that if the chain absorbs after completing s steps from i o, then each
of the V~ i ' for r 2: s will be based on the terminal value provided by the world, rather
than one'derived from the 'bootstraps' Vn(i). V; i should therefore on average be more
accurate than Vn and so can be used incrementail to improve it. This can be shown by
looking at the difference between IE[V~,iJ and IE[z IioL the ideal predictions.
Here:
2:
t,O
+ L:
qi,t,Zt,
Qi,i,
i,Em
2: qi,t,Zt, +
t,E3
... +
(6)
L:
qiot Ztt
j
ttE3
+ L: Qioi,
itEm
L: qi,t,zt, + ...
(7)
t,E3
(8)
Therefore,
IE[V~,i,l
- IE[z iol =
L: Q!.ir(Vn(ir)
- IE[z Iir ])
(9)
irE~
Watkins actually treated a slightly different case, in which the target values of the predictors
are based on discounted future values whose contribution diminishes exponentially with
the time until they happen. In this case it is easier to see how the reduction in error is
brought about. His analogue of equation (9) was:
123
348
P. DAYAN
:5
which provides a (weak) guarantee that the error of V~ will be less than that of Vn'
The nondiscounted case, when -y = 1, is somewhat different. Here, for some initial states
i o, there is a nonzero probability that the chain will absorb before finishing r steps. In
this case, the value of V~.io' being z, will be unbiased, and so should provide for error
reduction. Even if the chain does not absorb, its value can be no farther from what it should
be than is the most inaccurate component of V.. Although there is no error reduction due
to -y, it is guaranteed that
with inequality for all those states from which it is possible to absorb within r steps.
This does not ensure that
max IIE[V~.iJ - IE[z IioJ I < max IVn(ir) - IE[z Ii rJI
ir
10
since the maximum could be achieved, pathologically, at a state from which it is impossible to absorb in only r steps. However, the estimates for the states that are within r steps
of absorption will, on average, improve, and this should, again on average, filter back to
the other states.
Watkins demonstrated that TD(A) is based on a weighted average of the V~.io' Consider
00
V,~io = (l - A) ~ Aa-!V~.io
(10)
a=!
which is also a valid estimator of the terminal value starting at i o. He points out that in
choosing the value of A, there is a tradeoff between the bias caused by the error in Va>
and the variance of the real terminal value z. The higher A, the more significant are the
V~ for higher values of r, and the more effect the unbiased terminal values will have. This
leads to higher variance and lower bias. Conversely, the lower A, the less significant are
the contributions from higher values of r, and the less the effect of the unbiased terminal
values. This leads to smaller variance and greater bias.
It remains to be shown that TD(A) is indeed based on this combination estimator. Expanding out the sum in equation (10).
(ll)
defining Vn(is) =
124
E ;n}.
349
The whole point of defining V;'i o is so that it can be used to make V more accurate.
The obvious incremental update rule to achieve this has
(12)
From equation (11) it is apparent that the changes to Viio) involve summing future values
of Vn(i'+l) - Vn(i,) weighted by powers of A. Again following Watkins, these differences
can be calculated through an activity trace based on the characteristic functions X;(t) that
were defined earlier as a way of counting how often and how recently the chain has entered
particular states. Using index t for the members of the observed sequence, the on-line version of the TD(A) rule has
Vii)
a[Viit+l) -
Vii,)]
2: A'-\;(k).
(13)
k=l
For the problem that Sutton treated, the change to Vn is applied off-line, after a complete
sequence through the chain. Therefore, if the states through which the chain passes on one
sequence are i o, iJ, ... , i m- 1 E m., and i m E :3, it absorbs with terminal value Vn(im) :; Z,
and Vn +1 is the new estimator after experiencing the sequence, then
Vn+1(io)
Vn(io)
+ 2:
a[Vn(it+l)
Vii,)]
a[Vn(it+ I)
Vn(i,)]
Vn+1(i l )
Vn(i\)
+ 2:
2: A'-kXio(k)
k=l
'=1
2: A'-kXi,(k)
k=l
'=2
a[z -
Vn(im-l)]
~ A,-kXim_1(k),
k=l
summing over terms where i a = i b (so Xia :; Xi b )' Note that these expressions are exactly
the same as the TD(A) weight change formula in equation (4).
Thus, the actual TD(A) algorithm is based on the exponentially weighted sum defined
in equation (10) ofth.e outcomes of the V[ random variables. The mean contraction properties of these variables will therefore determine ,the mean contraction properties of the overall
TD(A) estimator.
2.4.
Linear representation
The previous subsection considered the TD(A) algorithm isolated from the representation
Sutton used. Although a number of different representations might be employed, the simplest
is the linear one he adopted. Identifying the vectors x with the states they represent gives
125
350
P. DAYAN
VII(X) =
W II
V:
weighting the error due to state i o by the vector representation of i o. Then the equivalent
of equation (13) is just Sutton's main TD(A) equation (3).
More sophisticated representations such as kd-trees (see Omohundro (1987) for a review)
or CMACs (Albus, 1975) may lead to faster learning and better generalization, but each
requires a separate convergence proof. Dayan (1991) compares the qualities of certain different
representations for Barto, Sutton and Watkins' grid task (Barto, Sutton & Watkins, 1990).
yr.
=
n,lo
x'r
if Xi, E :n
otherwise
where Xi are identified with the states in the observed sequence, w:Z is the current weight
vector defining the estimated terminal values, and z is the actual value. Then, after observing the whole sequence, w:Z is updated as:
W:Z+I
w:Z +
a~
a~
[V:Z,i - Wn
xiE~
w:z +
visited
xjE m: visited
(14)
X;] Xi'
126
Xk,
E :n.
1/ij as
the
351
The sum in equation (14) can be regrouped in terms of source and destination states of
the transitions:
W~+I
w~ +
al: l:
iem: j,em:
l)ij}w~ Xj,
w~ x;]
Xi
(15)
where Zj indicates that the terminal value is generated from the distribution due to state
j, and the extra terms are generated by the possibility that, from visiting any Xi E m" the
chain absorbs before taking r further steps.
Taking expected values over sequences, for i E m,
IE [l)ij] = diQij
IE[17ij] =
IE[17~t] =
for j E m,
l: diQilc-1qkj
for j E :J
l: diQilc- 2qkj
for j E :J
kem:
kem:
for j E :J
where d i is the expected number of times the Markov chain is in state i in one sequence.
For an absorbing Markov chain, it is known that the dependency of this on the probabilities
lJ.i of starting in the various states is:
di
= l: IJ.p - Q)Ji l =
jem:
[IJ.T(/ - Q)-I];
(16)
Substituting into equation (15), after taking expectations on both sides, noting that the
dependence of IE[w~+11 w~] on w~ is linear, and using w to denote expected values, a close
relation of equation (6) emerges for the linear representation:
127
352
P. DAYAN
+~%}
j,EJ
Define X to be the matrix whose columns are Xi, so [Xl ab = [Xalb' and D to be the
diagonal matrix [D]ab = Gabda, where Gab is the Kronecker delta. Remembering that
hi = EjE3 qijZt, and converting to matrix form
since
+ L..J
'" q..lh
j,EJ
as this covers all the possible options for r-step moves from state i.
Define the correct predictions [e*]i = lE[z Ii]; then also, from equation (2),
e* = [lE[z Ii]]
= h
+ Qh +
Q 2h
= ~ [Qrt(/
k=O
Q2
+ .,. +
Qr-I)h
where the sum converges since the chain is absorbing. This is another way of writing equation (7).
Multiplying equation (17) on the left by XT ,
128
353
Subtracting e* from both sides of the equation, and noting that from equation (18)
(I - Qr)e* = (I + Q + Q2 + ... + Qr-l)h, this gives the update rule, which is the
equivalent of equation (9):
The Watkins construction of TD(A) developed in equation (10) in the previous section
reveals that, starting from w; = w;;, Vr,
2: A
00
W;;+l
where
wA are
< A < I,
(1 - .A)
1
W;+1
r=1
(l - A)E~l Ar- I = I,
2: Ar-1Qr =
00
(I - A)
(1 - A)Q[I - AQr l
(19)
r=l
converges since 0
Define
< A < I.
then the truth of theorem T will be shown if it can be demonstrated that 3 > 0 such that
for 0 < ex < , limn_oo.i: = O. For then [XTw; - e*] -+ 0 as n -+ 00, and all the estimates will tend to be correct.
Almost all of Sutton's (1988) proof of this applies mutatis mutandis to the case that A
;t!. 0, always provided the crucial condition holds that X has full rank. For completeness,
the entire proof is given in the appendix. Overall it implies that the expected values of
the estimates will converge to their desired values as more sequences are observed under
the conditions stated in theorem T.
129
354
P. DAYAN
Xi
In moving from Watkins' representation-free proof to Sutton's treatment of the linear case,
one assumption was that the Xi, the vectors representing the states, were independent.
If they are not, so that matrix X does not have full rank, the proof breaks down.
D(l - (1 - A)Q[l - AQr J) is still positive, however XTXD(l - (1 - A)Q[1 - AQr J)
will no longer have a full set of eigenvalues with positive real parts, since the null subspace
y = {y IXD(1 - (l - A)Q[l - AQrl)y = O}
;z!
{O}
is not empty. Any nonzero member of this is an eigenvector witq eigenvalue 0 of XTXD(l
- (1 - A)Q[1 - AQr l ).
Saying what will happen to the expected values of the weights turns out to be easier than
understanding it. Choose a basis:
... ,
<
i :5 n, and 0
n~OO
Also,
by the definition of Y.
Writing
n
then
~ If - oX'XD(l -
Q')]" [
p
-+ L..J
'"
i=l
130
{Jb
l
l' as n -+
00
t. ~'b]
< ex <
f.
355
and so
XD(I - (1 - r.)Q[I - r.Q]-l)[XTW';- -
e*]
-+
0 as n
-+ 00.
(20)
To help understand this result, consider the equivalent for the LMS rule, TD(l). There
XD[XTw,: -
e*]
-+
0 as n
(21)
-+ 00.
2XD[XTW~
-+
0 as n
-+ 00,
e*]
(23)
(24)
by equation (21). For weights w, the square error for state i is I[XTw - e*]li, and the expected number of visits to i in one sequence is d j Therefore the quadratic form
is just the loaded square error between the predictions at each state and their desired values,
where the loading factors are just the expected frequencies with which the Markov chain
hits those states. The condition in equation (24) implies that the expected values of the
weights tend to be so as to minimize this error.
This does not happen in general for r. ,e 1. Intuitively, bias has returned to haunt. For
the case where X is full rank, Sutton shows that it is harmless to use the inacurrate estimates
from the next state x;,+,w to criticize the estimates for the current state xi/w. Where X
is not full rank, these successive estimates become biased on account of what might be
deemed their 'shared' representation. The amount of extra bias is then related to the amount
of sharing and the frequency with which the transitions happen from one state to the next.
Formalizing this leads to a second issue; the interaction between the two statistical processes of calculating the mean weight and calculating the expected number of transitions.
Comparing equations (20) and (21), one might expect
However, the key step in proving equation (24) was the transition between equations (22)
and (23), which relied on the symmetry of D. Since Q is not in general symmetric, this
will not happen.
131
356
P. DAYAN
Defining
g(w') =
a:
(26)
= XD(I - (I -
all that will actually happen is that g(w;) -> 0 as n -> 00.
Although the behavior described by equation (25) is no more satisfactory than that described by equation (26), it is revealing to consider what happens if one attempts to arrange
for it to hold. This can be achieved by 'completing' the derivative, Le., by having a learning rule whose effect is
The Q T term effectively arranges for backwards as well as forwards learning to occur, so
that not only would state it adjust its estimate to make it more like state it+l, but also state
it+\ would adjust its estimate to make it more like state it.
Werbos (1990) and Sutton (personal communication) both discussed this point in the context of the gradient descent of TD(A) rather than its convergence for non-independent Xi'
Werbos presented an example based on a learning technique very similar to TD(O), in which
completing the derivative in this manner makes the rule converge away from the true solution. He faulted this procedure for introducing the unhelpful correlations between the learning rule and the random moves from one state to the next which were mentioned above.
He pointed out the convergence in terms of functions g in equation (26) in which the w'
weights are fixed.
Sutton presented an example to help explain the result. At first sight, augmenting TD(A)
seems quite reasonable; after all it could quite easily happen by random chance of the training
sequences that the predictions for one state are more accurate than the predictions for the
next at some point. Therefore, training the second to be more like the first would be helpful.
However, Sutton pointed out that time and choices always move forward, not backwards.
Consider the case shown in figure 2, where the numbers over the arrows represent the transition probabilities, and the numbers at the terminal nodes represent terminal absorbing values.
Here, the value of state A is reasonably 1/2, as there is 50% probability of ending up
at either Y or Z. The value of state B, though, should be I, as the chain is certain to end
up at Y. Training forwards will give this, but training backwards too will make the value
of B tend to 3/4. In Werbos' terms, there are correlations between the weights and the possible
transitions that count against the augmented term. Incidentally, this result does not affect
TD(l), because the training values, being just the terminal value for the sequence, bear
no relation to the transitions themselves, just the number of times each state is visited.
Corning back to the case where X is not full rank. TD(A) for A will still converge,
but away from the 'best' value, to a degree that is determined by the matrix
(I - (l - A)Q[I - AQ]-l).
132
357
Figure 2. Didactic example of the pitfalls of backwards training. If Yand Z are terminal states with values I
and 0 respectively, what values should be assigned to states A and B respectively.
'Y ~ (J'ij(7r(iV"U),
jE'in
where'Y is the discount factor. Define the (;). value of state i and action a under policy 7r as:
(;)."(i, a) = IE[r;(a)]
'Y ~ (J'ij(a)V"(j),
jE'in
which is the value of taking action a at i followed by policy 7r thereafter. Then the theory
of DP (Bellman & Dreyfus, 1962) implies that a policy which is at least as good as 7r is
to take the action a* at state i where a* = argmaxb{(;)."(i, b)}, and to follow 7r in all other
states. In this fact lies the utility of the (;). values. For discounted problems, it turns out
that there is at least one optimal policy 7r'; define (;).*(i, a) = (;)."*(i, a).
(;).-learning is a method of determining Q.*, and hence an optimal policy, based on exploring the effects of actions at states. Consider a sequence of observations (in, an, jn' Zn),
133
358
P. DAYAN
where the process at state i" is probed with action an, taking it to state}1l and giving reward
Zn' Then define recursively:
(I - all)~,(i, a)
if i = in and a = an'
Q(i, a)
otherwise
(27)
for any starting values ~(i, a), where U,,(}n) = maxb{~(jll' b)}. The all are a set of
learning rates that obey standard stochastic convergence criteria:
00
00,
~ a~k(i.a) <
00,
Vi E
m:.,
E <I
(28)
k=1
where nk(i, a) is the k th time that ill = i and all = a. Watkins (1989) proved that if, in
addition, the rewards are all bounded, then, with probability one:
lim Q.,,(i, a) = Q*(i, a),
11-00
Consider a degenerate case of a controlled Markov process in which there is only one
action possible from every state. In that case, the Q"', V.,., and the (similarly defined) U'"
values are exactly the same and equal to Q*, and equation (27) is exactly the on-line form
of TD(O) for the case of a nonabsorbing chain in which rewards (i.e., the terminal values
discussed above in the context of absorbing Markov chains) arrive from every state rather
than just some particular set of absorbing states. Therefore, under the conditions of Watkins'
theorem, the on-line version of TD(O) converges to the correct predictions, with probability one.
Although clearly a TD procedure, there are various differences between this and the
one described in the previous section. Here, learning is on-line, that is the V(=Q) values
are changed for every observation. Also, learning need not proceed along an observed sequence-there is no requirement that}1l = in+ b and so uncoupled or disembodied moves
can be used. 3 The conditions in equation (28) have as a consequence that every state must
be visited infinitely often. Also note that Sutton's proof, since it is confined to showing
convergence in the mean, works for a fixed learning rate a, whereas Watkins', in common
with other stochastic convergence proofs, requires an to tend to O.
Also, as stated, the Q-Iearning theorem only applies to discounted, nonabsorbing, Markov
chains, rather than the absorbing ones with -y=1 of the previous section. -y < I plays the
important role in Watkins' proof of bounding the effect of early ~ values. It is fairly easy
to modify his proof to the case of an absorbing Markov chain with -y=I, as the ever increasing probability of absorption achieves the same effect. Also, the conditions of Sutton's
theorem imply that every nonabsorbing state will be visited infinitely often, and so it suffices to have one set of ai that satisfy the conditions in (28) and apply them sequentially
for each visit to each state in the normal running of the chain.
134
359
4. Conclusions
This paper has used Watkins' analysis of the relationship between temporal difference (TO)
estimation and dynamic programming to extend Sutton's theorem that TD(O) prediction
converges in the mean, to the case of theorem T; TD(A) for general A. It also demonstrated
that if the vectors representing the states are not linearly independent, then TD(A) for A~ 1
converges to a different solution from the least mean squares algorithm.
Further, it has applied a special case of Watkins' theorem that ~-Iearning, his method
of incremental dynamic programming, converges with probability one, to show that TD(O)
using a localist state representation, also converges with probability one. This leaves open
the question of whether TD(A), with punctate or distributed represeJ,ltations, also converges
in this manner.
it is necessary to show that there is some E such that for 0 < <X < E, Iimn~oo .1.;: = O.
In the case that A = 0 (for which this formula remains correct), and X has full rank, Sutton
proved this on pages 26-28 of (Sutton, 1988), by showing successively that D(I - Q) is
positive, that XTXD(l - Q) has a full set of eigenvalues all of whose real parts are positive, and finally that <X can thus be chose such that all eigenvalues of I - cxJ(TXD(l - Q)
are less than 1 in modulus. This proof requires little alteration to the case that A~ 0, and
its path will be followed exactly.
The equivalent of D(l - Q) is D(l - (l - A)Q[I - AQ]-I). This will be positive definite, according to a lemma by Varga (1962) and an observation by Sutton, if
S = D(l - (1 - A)Q[I - AQ]-I)
can be shown to be strictly diagonally dominant with positive diagonal entries. This is the
part of the proof that differs from Sutton, but even here, its structure is rather similar.
Define
Sr = D(l - Qr)
{D(l - QrW.
Then
[SrJu = [D(l - Qr)Ju
= 2d;[1 -
[(D(I - QrW];;
Qr]ii
= 2d;(1 - [Qr]ii)
> 0,
135
360
P. DAYAN
since Q is the matrix of an absorbing Markov chain, and so Qr has no diagonal elements
2:1. Therefore Sr has positive diagonal elements.
Also, for i ~ j,
-ddQr]ij - d)Qr]ji
:50
since all the elements of Q, and hence also those of Qr, are po~itive.
In this case, Sr will be strictly diagonally dominant if, and only if, Lj[Sr]ij 2: 0, with
strict inequality for some i.
2: [Sr]ij = 2: dill j
Qr]ij
di
2: [I -
+ 2: dp - Qr]ji
j
Qr]ij
[dT(l - Qr)]j
~ d, [1 - ; ; IQ'],] + (P'(I -
Q)-'(I -
Q')I,
(29)
(30)
(31)
where equation (29) follows from equation (16), equation (30) holds since
I - Qr = (I _ Q)(I
Q2
+ ... +
Qr-l)
and equation (31) holds since Lj[Qr];j :5 1, as the chain is absorbing, and [QS]ij 2: 0, "Is.
Also, there exists at least one i for which !J.i > 0, and the inequality is strict for that i.
Since Sr is strictly diagonally dominant for all r 2: I,
2: t..r-ISr
00
S'" = (1 - t..)
,.=1
=S
136
361
is strictly diagonally dominant too, and therefore D(/ - (I - A)Q[1 - AQ] -1) is positive
definite.
The next stage is to show that XTXD(1 - (1 - A)Q[l - AQ] -I) has a full set of eigenvalues all of whose real parts are positive. XTX, D and (I - (1 - A)Q[1 - AQ] -I) are
all nonsingular, which ensures that the set is full. Let 1/; and u be any eigenvalue-eigenvector
pair, with u = a + bi and v = (XTX) -IU ~ 0, so u = (XTX)v. Then
= 1/;(Xv)*Xv
Since the right side (by positive definiteness) and (Xv)*Xv are both strictly positive, the
real part of 1/; must be strictly positive too.
Furthermore, u must also be an eigenvector of
since
= (1 - ex1/;)u.
Therefore, all the eigenvalues of 1- cxXTXD(1 - (I - A)Q[1 - AQr l ) are of the form
1 - ex1/; where 1/; == v + <pi has positive v. Take
for all eigenvalues 1/;, and then all the eigenvalues 1 - ex1/; of the iteration matrix are guaranteed to have modulus less than one. By another theorem of Varga (1962)
lim [I - exXTXD(1 - (1 - A)Q[1 - AQ]-I)]" = O.
n~oo
137
362
P. DAYAN
Acknowledgments
This paper is based on a chapter of my thesis (Dayan, 1991). I am very grateful to Andy Barto,
Steve Finch, Alex Lascarides, Satinder Singh, Chris Watkins, David Willshaw, the large number of people who read drafts of the thesis, and particularly Rich Sutton and two anonymous
reviewers for their helpful advice and comments. Support was from SERe. Peter Dayan's
current address is CNL, The Salk Institute, P.O. Box 85800, San Diego, CA 92186-5800.
Notes
I. Here and subsequently, a superscript "A is used to indicate a TD("A)-based estimator.
2. States A and G are absorbing and so are not represented.
3. This was one of Watkins' main motivations, as it allows his system to learn about the effect of actions it believes
to be suboptimal.
References
Albus, J.S. (1975). A new approach to manipulator control: The Cerebellar Model Articulation Controller (CMAC).
Transactions of the ASME: Jouf1Ul1 of Dynamical Systems, Measurement and Control, 97, 220-227.
Barto, A.G., Sutton, R.S. & Anderson, c.w. (1983). Neuronlike elements that ean solve difficult learning problems.
IEEE Transactions on Systems, Man, and Cybernetics, 13, 834-846.
Barto, A.G., Sutton, R.S. & Watkins, C.lC.H. (1990). Learning and sequential decision making. In M. Gabriel
& J. Moore (Eds.), Learning and computational neuroscience: Foundations ofadaptive networks. Cambridge,
MA: MIT Press, Bradford Books.
Bellman, R.E. & Dreyfus, S.E. (1962). Applied dynamic programming. RAND Corporation.
Dayan, P. (1991). Reinforcing connectionism: Learning the statistical way. Ph.D. Thesis, University of Edinburgh,
Scotland.
Hampson, S.E. (1983). A neural model ofadaptive behavior. Ph.D. Thesis. University of California, Irvine, CA.
Hampson, S.E. (1990). Connectionistic problem solving: computational aspects of biological learning. Boston,
MA: Birkhiiuser Boston.
Holland, lH. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to
parallel rule-based systems. In R.S. Michalski, J.G. Carbonell & T.M. Mitchell (Eds.), Machine learning:
An artificial intelligence approach, 2. Los Altos, CA: Morgan Kaufmann.
Klopf, A.H. (1972). Brainfunction and adaptive systems-A heterostaric theory. Air Force Research Laboratories
Research Report, AFCRL-72-0164. Bedford, MA.
Klopf, A.H. (1982). The hedonistic neuron: A theory of memory, learning, and intelligence. Washington, DC:
Hemisphere.
Michie. D. & Chambers, R.A. (1968). BOXES: An experiment in adaptive control. Machine Intelligence, 2, 137-152.
Moore, A.W. (1990). Efficient memory-based learning for robot control. Ph.D. Thesis, University of Cambridge
Computer Laboratory, Cambridge, England.
Omohundro, S. (1987). Efficient algorithms with neural network behaviour. Complex Systems, I, 273-347.
Samuel, A.L. (1959). Some studies in machine learning using the game of checkers. Reprinted in E.A. Feigenbaum
& l Feldman (Eds.) (1963). Computers and thought. McGraw-Hill.
Samuel, A.L. (1967). Some studies in machine learning using the game of checkers II: Recent progress. IBM
Journal of Research and Development, II, 601-617.
Sutton, R.S. (1984). Temporal credit assignment in reinforcement learning. Ph.D. Thesis, University of Massachusetts, Amherst, MA.
Sutton, R.S. (1988). Learning to predict by the methods of temporal difference. Machine Learning, 3, 9-44.
Varga, R.S. (1962). Matrix iterative analysis. Englewood Cliffs, NJ: Prentice-Hall.
Watkins, C.I.C.H. (1989). Learning from delayed rewards. Ph.D. Thesis. University of Cambridge, England.
Werbos, PJ. (1990). Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks,
3, 179-189.
Widrow, B. & Stearns, S.D. (1985). Adaptive signal processing. Englewood Cliffs, NJ: Prentice-Hall.
Witten, I.H. (1977). An adaptive optimal controller for discrete-time Markov environments. Information and Control,
34, 286-295.
138
torras@ic.upc.es
Abstract. This paper presents a reinforcement connectionist system whicb finds and learns the suitable situationaction rules so as to generate feasible paths for a point robot in a 20 environment with circular obstacles. The
basic reinforcement algorithm is extended with a strategy for discovering stable solution paths. Equipped with
this strategy and a powerful codification scheme, the path-finder (i) learns quickly, (ii) deals with continuousvalued inputs and outputs, (iii) exhibits good noise-tolerance and generalization capabilities, (iv) copes with dynamic
environments, and (v) solves an instance of the path finding problem with strong performance demands.
Keywords. Connectionism, reinforcement learning, robot path finding, stability, reactive systems
1. Introduction
An important problem in robotics is that of generating a path between initial and goal robot
configurations that avoids collisions with obstacles. This is called the robot path finding
problem. Many attempts at solving the problem have been made in the fields of Artificial
Intelligence and Computational Geometry (for reviews, see: Brady, et al., 1982; Whitesides,
1985; Yap, 1987; Torras, 1990). All these approaches are based on planning. The time complexity of exact geometric approaches grows exponentially with the number of degrees of
freedom of robot motion (Canny, 1988), thus being of practical use only when this number
is very low. This fact has led to the emergence of numerous heuristic approaches, which
either rely on potential fields (Khatib, 1986) or carry out a search through a state space.
These approaches trade reliability for speed, in that they do not guarantee to find a solution when it exists and they can even produce unfeasible solutions because of the use of
discretized representations.
There have been some attempts at combining exact and heuristic approaches, so as to
exploit their respective strong points and minimize their deficiencies. Along this line, Donald
(1987) proposes to carry out a local heuristic search process on an algebraic (exact) model
of configuration space,' while Ilari and Torras (1990) propose a two-stage process, where
first an exact model of physical space-i.e., disregarding both the shape and the kinematics
of the robot-is used to plan a path and then this path is refined through local search to
conform to a trajectory in configuration space.
The above two-stage decomposition of the path finding problem suggests a possible separation between the path-planning and motion-control aspects of the problem. In path planning,
139
364
the environment is considered at a coarse enough level of detail to be assumed fixed and
known a priori-through access to a map-and the goal is to determine, prior to its execution, a sequence of displacements in physical space for reaching a given place. Once a
path through physical space is planned at this coarse level, a high-level motion control problem arises since commands must be sent to the different motors to make the robot follow
the path previously computed, while transforming a discrete path in physical space into
a continuous, obstacle-avoiding trajectory in configuration space. Note that, with this problem decomposition, it seems natural to include within motion control the possibility to
accommodate for slight deviations in the positions of obstacles, so that it is no longer required that the environment be completely known a priori and static.
In Torras (1990), we have argued that motion planning and motion control involve two
different types of processing. Global planning involves reasoning symbolically upon an explicit representation of the environment, whilst local obstacle-avoidance capabilities rely
on subsymbolic pattern processing. This distinction applies to learning as well. Learning
to plan would be done symbolically (at a central level) , while the learning of reflexes would
be carried out at a subsymbolic peripheral level.
This paper focuses on the second aspect above, namely the subsymbolic learning of
obstacle-avoidance reflexes for an instance of the robot path finding problem characterized
by (i) a continuous set of robot configurations and of robot actions, (ii) a partially unknown
and dynamic environment, (iii) a need to react in real-time, and (iv) strong performance
demands such as finding short paths with wide clearances. Specifically, we present a
reinforcement-based connectionist system able to generate feasible paths for a mobile robot
in a non-maze-like 2D environment, while appropriately dealing with the four problem
characteristics above. By saying "non-maze-like" we stress that finding a path in that environment does not require sophisticated planning capabilities, but mainly obstacle-avoidance
skills. Maze-like environments require the previous concourse of a path-planner that provides the connectionist system with a series of subgoals along a feasible path, so that the
environment around every two consecutive subgoals becomes non-maze-like.
Discovering suitable obstacle-avoidance reflexes by using only a reinforcement signal
is a very general approach whose simplest formulation could be characterized as a weak
search method. This means that reinforcement methods have theoretically limited learning
abilities; Le., they might require heavy learning phases and they might be unable to capture
complex features of the problem. These theoretical limitations can be overcome if domainspecific heuristics are incorporated into the basic reinforcement-based search method
(Langley, 1985). The codification scheme adopted in the present work and the algorithm
used to discover stable solution paths are instances of such heuristics for the path finding
domain. The algorithm allows to greatly speed up learning and to deal with continuousvalued actions. To the best of our knowledge, this is the first reinforcement system in which
continuous-valued actions are used in conjunction with a critic. The codification scheme,
besides contributing to solving the problem in a short time, is responsible for the noisetolerance and generalization capabilities, for satisfying the strong performance demands
concerning path length and clearance and, partially, for the ability to cope with dynamic
environments exhibited by the path-finder.
The paper is structured as follows. In Section 2, previous works relevant to the pathfinder developed are reviewed. These are connectionist approaches that either deal with
140
365
the same particular problem, namely path finding, or tackle the general class of associative
reinforcement learning (ARL) problems. The relation of the latter approaches to the system
developed becomes clear in Section 3, where path finding is formulated as an ARL problem. Section 4 is devoted to the description of the connectionist path-finder developed.
The results of an experimental comparative study of different versions of the general learning rule adopted, as well as the simulation results obtained with the path-finder equipped
with the best of such versions are presented next in Section 5. Finally, in Section 6, some
conclusions from this work are provided and future work is addressed.
2. Previous work
A general approach to path finding in partially unknown environments is to build a mapping from perceived situations to correct actions, and iterate this mapping until the goal is
reached. Systems that use this kind of situation-action rules are known as reactive systems.
Reactive systems normally rely on knowledge-based techniques. However, most of them
are not adaptive, i.e., they do not learn (Agre & Chapman, 1987; Arkins, 1987; Schoppers,
1987; Blythe & Mitchell, 1989). In some systems, the situation-action rules are preprogrammed explicitly by their designers (Agre & Chapman, 1987; Arkins, 1987). In other
systems, the rules result from a compilation of previous planning activities (Schoppers,
1987; Blythe & Mitchell, 1989). In any case, the main limitation of the knowledge-based
systems developed up to now is the need to specify actions to be taken in all possible situations the agent may find itself in.
The knowledge-based reactive systems able to construct by themselves an internal model
of the environment have only been applied to simple tasks (Rivest & Schapire, 1987). Mozer
and Bachrach (1989) have implemented the Rivest and Schapire's system in a connectionist
network, which has been shown to perform better that the original symbolic version in
several respects.
Brooks' approach (1986) is a novel way of building reactive systems. Its originality is
based on the subsumption architecture and also on hard-wiring sensors to effectors rather
than representing the rules. This approach does not overcome the above-mentioned limitation, but it emphasizes a key issue: reactive behavior should involve parallel distributed
processing.
141
366
,,
"""
'1 ~(t+l) Q9
",-,
, ... ""
Model
~,~~~:: """""'1"'~
Network
",
,......'
Action
Network
Y(t)
.,'
142
+
z(t+l)
Environment
X(t)
367
network-the model network-to identify the function that relates each agents' outpu,t with
the subsequent reinforcement received from the environment. Once this function is encoded
in the model network, the weights of this network are frozen and the supervised learning
process shift to the action network-Le., the agent. The gradient of the reinforcement signal
with regard to the agent's output, iJz/iJY, required for training the action network is estimated
by back-propagating derivatives through the model network. This approach has been suggested by Werbos (1987), but the only implementations are those of Munro (1987), Robinson
(1989), and Jordan and Jacobs (1990). Although ingenious and analytically well-defined, this
approach has the important limitation of assuming that the model has been learned with
enough accuracy to allow to compute a good approximation of the reinforcement gradients.
A simpler approach to the ARL problem is to approximate iJz/iJ'( by its sign. This idea
has been proposed, and successfully applied, by Saerens and Socquet (1989) in the field
of control. The signs of the partial derivatives are obtained from qualitative knowledge
about the direction in which the action components must be modified to increase the reinforcement signal.
Finally, the most common approach to the ARL problem is to estimate iJz/iJY by measuring
the correlation between variations in actions and variations in reinforcement. This learning
paradigm, known as reinforcement learning, relies on performing different actions in response to each stimulus, observing the resultant reinforcement, and incorporating the best
action into the situation-action rules of the system. Several reinforcement systems have been
developed for solving nontrivial problems (Barto, et aI., 1983; Anderson, 1987; Chapman
& Kaelbling, 1990; Lin, 1990; Mahadevan & Connell, 1990). Anderson (1987) illustrates
the kind of situation-action rules learned by a particular reinforcement system.
The two main limitations of the basic reinforcement learning algorithm are a possibly
long learning phase and the inability to capture complex features of the problem. To palliate
these limitations, several extensions to the basic algorithm have been proposed, such as
adding an action model for relaxation planning (Lin, 1990; Sutton, 1990) and combining
several modules, each one specialized in solving a particular primitive task (Singh, 1991).
One contribution of this paper is in this direction, since we show that adding a stabilizing
strategy to the learning algorithm and incorporating domain-specific knowledge in the codification scheme permits avoiding the above limitations.
Note also that the reinforcement learning approach is open to improvement through the
incorporation of supervised learning methods. One classic example of this is the introduction of a critic element to predict the amount of reinforcement that would follow a given
action (Barto, et aI., 1983; Sutton, 1984). Anderson (1986) also used a supervised learning
rule for updating the hidden layers of a reinforcement-based system.
143
368
dist[conlstar" con/gaatl
< goal..
(1)
To formulate the above instance of the path finding problem as an ARL problem, we
need to specify what the input, output and reinforcement signal are.
where kalt is a constant. The direction of the attraction force is that of the SPY, but since
the output of the path-finder is computed with respect to this direction (see the next subsection), it need not be included in the input to the system.
Each repulsion force represents the resistance of an obstacle to the fact that the robot
follow the SPY. Each such force follows the bisector of the quadrant where the obstacle
lies and heads towards the opposite quadrant. For example, in the upper situation in Figure
2, an obstacle is to the southeast of the current configuration and its repulsion force is
northwestward. Because the directions of the repulsion forces are specified with respect
to the direction of the attraction force, they are implicit in the codification and therefore are
not included in the input to the system. Note also that this specification permits grouping
goal
repuJsion ~ current
furce ' conf
Spy
goal
144
369
all the repulsion forces into four resulting signals, each being the magnitude of a vector
starting at the current configuration and following the bisector of a quadrant.
The intensity of each single repulsion force depends on three factors: (i) the shortest
distance between the obstacle and the SPY, ra , (ii) the distance from the current configuration of the robot to the point ofconflict-Le., the point of the SPY nearest to the obstaclerb, and (iii) in the case that the Spy intersects the obstacle, the shortest translation of the
obstacle perpendicular to the SPY that leads to nonintersection, rc . If for a configuration
and an obstacle several points of conflict exist, then the one nearest to the current robot
configuration is taken. Figure 2 illustrates these factors for two different situations and
the resulting repulsion forces.
The first factor is aimed at avoiding obstacles in the proximity of the SPY. The second
allows to avoid obstacles near to the robot current configuration. The third ensures that,
in the case that the SPY intersects an obstacle, the next robot movement is the more distant
from the SPY, the deeper is the penetration into the obstacle.
In sum, the number of input signals to the path-finder is independent of the number of
obstacles in the environment, and these signals are five: the intensity of the attraction force,
ia, and the intensity of the environmental repulsion from each quadrant, r l , r 2, r 3 and r 4,
quadrants being numbered counterclockwise. The detailed way in which the repulsion signals
are computed is described in Appendix A. Since the input signals are factors in the learning equations of the system (see Section 4.4), signals with larger magnitudes would have
a greater influence on learning than would other signals. To remove this bias all input signals
are scaled to lie within the same interval, and are thus codified as real numbers in [0, 1].
It is worth noting that one of the aims of adopting the above codification scheme has
been to favor the generalization abilities of the path-finder. Since the goal is codified in
the input information and the input is independent of the environment used during the learning phase, the knowledge acquired for a situation should be transferable to a different one.
~2. Ou~utmfonnation
The output of the path-finder represents the step taken by the robot and it is codified as
a move in relative cartesian coordinates with regard to the SPY. That is, the positive x
axis is the SPY. Both increments, .:ly and .:lx, are real values in the interval [-1, 1].
The output signals determine a direction and a length. Nevertheless, the actual length
of the move made by the robot is computed by the following expression:
length
actual_length =
step"
* radius,
if length
* radius >
step"
(3)
otherwise,
(4)
145
370
The motivation of this postprocessing of the output signals is twofold. Firstly, because
the robot concentrates on its neighboring obstacles, the maximum distance the robot can
cover in a single step should be limited by the perception range. Otherwise, the robot could
collide with obstacles "not completely" perceived. Secondly, the path-finder is intended
to be a reactive system, so it should react to each stimulus with some action. step, represents the minimum action.
Two important consequences of the manner in which the output signals are codified and
postprocessed are that the space ofattainable configurations is continuous and that there
is no predetermination whatsoever of the direction and length of each step, with the only
constraint imposed by the perception range. A final benefit is that the output postprocessing,
in combination with the input codification that supports the generalization capabilities and
the reactive nature of the path-finder, offers the robot the possibility of coping with dynamic
environments. The step the robot takes is mainly aimed at avoiding neighboring obstacles
and it never goes beyond the perception range, therefore the probability of colliding with
a mobile obstacle is low.
(5)
The reinforcement signal, z, is a real number in [-1, 1]. It equals 1 when the goal configuration is reached and it equals -I when a collision with an obstacle or a wall happens:
I,
z =
-r,
{
a - kar * r,
if a = 1,
if r > I - rep"
otherwise,
(6)
146
371
kpa
perception mnge
k,o
k'b
k"
k rep
Value
0.05
0.1
100.0
2.0
0.2
0.1
5.0
1.0
step,
O.oJ
ka ,
0.75
0.1
repf
Finally, the way in which the reinforcement signal is computed allows to reduce the complexity of the task to be solved. In the robot path finding problem, the consequences of
an action can emerge later in time. Thus, actions must be selected based on both their
short- and long-term consequences. Since the reinforcement signal is computed using global
information-i.e., it is based not on the current robot configuration but on the SPY-the
path-finder gets a measure of the short- and long-term consequences of an action only one
time-step after executing it. Thus the task is reduced to learn, for each stimulus, to perform
the action which maximizes the reinforcement signaJ.2
The determination of both the input and the reinforcement signals is reminiscent of the
potential field approach to path finding (Khatib, 1986).
Table 1 summarizes all the constants-coefficients and thresholds-and their actual values
used to compute the different signals.
372
X3
repulsion 2
X4
repulsion 3
s repulsion 4
Of the two kinds of reinforcement learning algorithms proposed to solve the ARL problem, namely associative reward-penalty, A R- P , (Barto & Anandan, 1985) and associative
search, AS, (Barto, et aI., 1981; Sutton, 1984; Williams, 1986, 1987), we adopt the second
one because it fits better the specifications of the output and reinforcement signals we have.
The A R- P algorithm has been mainly designed for instances of the ARL problem where
the action network-the step generator in our case-produces binary output signals and
where the reinforcement has only two possible values, namely success or failure.
A central issue for any reinforcement system is to explore alternative actions for the same
stimulus. Stochastic units provide this source of variation. Thus, the step generator-or,
at least, its output layer-is built out of this kind of unit. The different architectures of
the step generator-i.e., number of hidden layers, kinds of hidden units, and connectivitytested during the simulations will be described in the next section. Nevertheless, the stochastic behavior of the path-finder should tend to be deterministic with learning. Otherwise,
the path-finder could not generate stable solution paths after it eventually discovers them.
4.1. The basic AS algorithm
Continuous stochastic units compute their output in three steps (Gullapalli, 1988), as depicted
in Figure 4 and expressed in Equations (7) through (10).
Since the signals we are interested in are continuous, a separate control of the location
being sought (mean) and the breadth of the search around that location (variance) is needed.
The first step is to determine the value of these parameters. The mean Jl. should be an estimation of the optimal output. A simple way is to let Jl.j(t) equal a weighted sum of the inputs
s/t) to the unit i plus a threshold 8;(t):
n
Jl.j(t) = ~ (wij(t)s/t
j=!
148
8j(t).
(7)
373
extra inputs
Sn
random number
generator
Win
output
function
ith Unit
The standard deviation a should be small if the expected output of the step generator is
close to the optimal, and it should be high in the opposite case. Since the heuristic reinforcement signal provides this comparative measure of output goodness, aCt) should depend
on the expected heuristic reinforcement, h(t):
a(t) = g(h(t,
(8)
where the function g will be made explicit in Section 4.3, once the different ways of computing the heuristic reinforcement signal will be presented.
As a second step, the unit calculates its activation level a;(t) which is a normally distributed random variable:
(9)
+ e- " Q; ()I
1,
(10)
(11)
where a is the learning rate and eij is the eligibility factor of Wjj' The eligibility factor
of a given weight measures how influential that weight was in choosing the action. We
have explored twenty-five versions of this general rule, each particular version differing
from the others in the way hand eij are calculated. These versions result from combining
the five heuristic reinforcement signals with the five eligibility factors which shall be presented in Sections 4.3 and 4.4.
149
374
acc-path
* dist[con!start,
con/goatl)
* dist[con!stam
con/goatl)
150
375
to alternate the deterministic and stochastic phases more than a fixed number of times.
If this limit is reached, the step generator and the critic are reinitialized.
Figure 5 shows the final algorithm for discovering a stable qo-path.
(14)
The critic receives the input pattern X(t) and predicts the reinforcement baseline b(t) with
which to compare the associated environment reinforcement signal z(t + I).
A first alternative for the reinforcement baseline is to make it a constant-zero in the
simulations presented here. This is equivalent to shutting the critic off. One has to expect
a long learning phase before obtaining a qo-path, since the robot may receive a positive
environmental reinforcement when moving away from the goal, in cases where it can approach it.
In order to avoid this shortcoming, the environmental reinforcement received at the current step should be compared with that received at previous steps. Two possibilities for
making this comparison arise: short-term, and long-term. In the first case, the reinforcement baseline is the environmental reinforcement received at the preceding time step. In
the second case, it is a trace of all the environmental reinforcement received by the pathfinder. There are, however, reasons to doubt that these techniques would work well on
the task at hand. The main of them is that the best move the robot can make in a certain
situation may correspond an environmental reinforcement lower than the preceding one.
151
376
aVij
+ 1/.:l v (t - 1),
Ij
(15)
where V is the weight matrix of the critic, is the learning rate, 1/ is the momentum factor,
and is the output of the critic-i.e., the expected environmental reinforcement signal.
The critic has to be updated after each path-finder/environment interaction because as
the step generator improves, the mapping from stimuli to reinforcement changes.
We have found that using a large learning rate for the critic and a small learning rate
ex for the step generator accelerates the learning process. The rationale is the following.
By allowing the critic to adapt more rapidly than the step generator, the critic has the opportunity to predict acceptably the next environmental reinforcement signal associated to the
current stimulus as the step generator is, during a certain period of time, almost stable.
Then, the step generator can take full advantage of the reinforcement baseline.
A potential benefit of the momentum term when applied to dynamic tasks-i.e., tasks
where the training information changes over time-like that faced by the critic is that it
prevents V from oscillating dramatically. Since, at the beginning of the learning phase,
the step generator explores a wide range of alternative actions, to each stimulus perceived
by the path-finder will correspond a large variety of environmental reinforcement signals.
Consequently, the error between the actual and the predicted environmental reinforcement
is not only due to a misfunction of the critic, but also to the "chaotic" mapping the critic
is dealing with. So, there is the risk that the weight modifications made at a certain time
change V too much. The momentum tackles this problem by reducing the intensity of opposite weight modifications at consecutive times.
A second advantage of using a predicted-comparison reinforcement baseline is to mitigate
the possibility of overlearning. Since the goal of the step generator is to produce the optimal action in the presence of every stimulus, learning should stop when the step generator
has discovered the suitable situation-action rules. A predicted-comparison mechanism accomplishes this effect because the expected and the actual environmental reinforcement
signals tend to be the actual and the optimal ones, respectively, as the critic and the step
generator improve.
Predicted comparison, however, may not cope adequately with collisions-a frequent
situation during the learning phase. The reason is that when a collision happens, one would
like to punish severely the step generator. This is satisfied if no baseline is used, whereas
predicted comparison fails when the prediction is close to the environmental reinforcement
152
377
signal. Consequently, the performance of the path-finder should improve if the reinforcement baseline were computed as a heuristic predicted comparison:
z(t) =
o,
ifz(t
z(t),
otherwise
I) = -I,
(16)
The following expression summarizes the five alternative ways used to calculate the reinforcement baseline:
o,
z(t),
b(t) =
(17)
~t(t),
z(t),
z(t),
where
bt(t) = Az(t)
[l - i\]bt(t -
I),
(18)
* h(t),
(19)
where ka is a constant and h(t) is a trace of the absolute value of past heuristic reinforcement received:
h(t) = ~
~
* abs(h(t)) +
[l - nh(t -
I),
(20)
being a constant in [0, I]. Gullapalli proposes the following expression to compute a:
a(t) = k a
* [I
- z(t)],
(19 ')
where z(t) is the expected environmental reinforcement. Both expressions are equivalent
only in the case that the highest environmental reinforcement is recieved for every pair
(stimulus, optimal action). This condition does not hold in the path finding problem where
the highest environmental reinforcement is only received when the goal is reached.
153
378
In the case that no reinforcement baseline is used, the heuristic reinforcement is the environmentalone. This means that (19) provides a standard deviation that is not the desired
one when the expected output is good, as a will be high when it should be small. For
this particular case, (19') will be used.
The eligibility factor of a weight measures the contribution of that weight to the action
taken. We have used two kinds of rules to compute this contribution. The first measures
the correlation between the outputs of the units Iinked by the connection under consideration. Two ways of implementing this intensity-based mechanism are the Hebbian rule and
a trace of it.
The second kind of rule consists of a discrepancy-based mechanism which evaluates the
difference between the actual output and the expected one:
(21)
where f(') is the same function as in (10). A slightly different implementation proposed
by Gullapalli (1988) is the following:
e{t) = s.(t)
V
a (t) I
a(0
/I (t)
r
,.
(22)
Finally, a third discrepancy-based rule makes the learning algorithm perform gradient
ascent on the expected environmental reinforcement, as proved by Williams (1986, 1987):
(t) _
elJ..(t) = (JlnN
"
vWij
Sj
() a;(t) - J.I.;(t)
t
a2(t)
,
(23)
154
379
Si(t)Sj(t) ,
eti/!),
S/t)[Si(t) - !(JLi(t],
Sj (t )
ai(t) - JLi(t)
a(t)
,
Sj (t )
ai(t) - JLi(t)
a2(t)
,
(24)
where
eti/t) = 'YS/t)Si(t)
[1 - 'Y]etij(t - 1),
(25)
Note that only the first two rules-i .e., those corresponding to the intensity-based mechanism-are applicable when deterministic units are in use. The other three eligibility factors
depend on the stochastic behavior of the units.
5. Experimental results
The system has been implemented in Common Lisp on a VAX station 2000. The feasibility
study has been carried out in two main steps. The first step is an experimental comparison
of the twenty five versions of the general AS algorithm described in the preceding section.
The objective of this comparative study is to identify the version that best suits the robot
path finding problem. Then, the most promising version is used to try to build a pathfinder with powerful generalization abilities so as to produce, after a limited learning process, qo-paths in any non-maze-like workspace and to cope with dynamic environments.
The workspace depicted in Figure 6 has been employed in the simulations. There exist
three obstacles-shown as circles-and the goal configuration is represented by a square.
The workspace is a room of 10 * 10 units. The centers of the obstacles are located at (3.0,
6.0), (6.0, 6.0) and (6.0, 3.0), and their radii are equal to 0.75 units. The goal configuration
is located at (8.0, 8.0).
In the simulations, a path is considered to end either at the goal or with a collision.
A path could be alternatively defined to finish only when the goal is reached. This second
definition, however, presents the following shortcoming. During the learning phase, the
path-finder needs to be sufficiently stimulated-i.e., the training data should be both sufficiently varied to reveal the reinforcement function and sufficiently repetitive to make learning possible-to discover the set of suitable situation-action rules. But, if the robot is allowed
to leave the collision situations by itself, then the proportion of inputs corresponding to
a collision with respect to all the other kinds of inputs is very high. So, the training data
is biased to a particular kind of stimuli, which makes learning harder.
A desirable feature for any system is its robustness to changes in the values of the parameters governing it. Thus, no attempt has been made to search for the best set of parameter
155
380
(b)
(a)
Figure 6: Behavior of the first path-finder.
values. Instead, we have chosen values that intuitively seem adequate for all the versions
of the general AS rule. The values of the constants used in the learning rules appear in
Table 2.
In the implementation of the strategy described in Section 4.2, the deterministic phase
is finished when either a single path reaches 500 steps or 500 paths are generated without
finding a qo-path.
Table 2. Actual values of the
factors used in the learning
rules.
Name
{3
'"
'I
A
y
kq
kaCClength
kacCclear
kqOleng,h
kqetear
156
Value
0.4
0.125
0.5
0.5
0.2
0.2
2.0
0.75
1.5
0.15
1.15
0.2
381
Eligibility
Baseline
1
2
3
4
5
34939
59599
31870
25997
21362
46122
66791
52202
27999
22840
15559
48753
28379
12675
16301
14527
24810
17494
13839
11789
10457
32332
27645
08473
17907
157
382
Finally, of the three discrepancy-based rules, none outperforms the others for all choices
of reinforcement baseline. However, the fifth eligibility factor leads to the two best versions
of the algorithm.
5.1.2. Reinforcement baseline
The influence of the reinforcement baseline on the performance of the path-finder is not
so clear and does not agree so exactly with the theoretical expectations as the influence
of the eligibility factor.
As expected, short-term and long-term comparisons-second and third rows of the table,
respectively-are not suitable for solving our formulation of the path finding problem. In
addition, the long-term versions work better than the short-term ones.
Surprisingly, the performance of the versions adopting a null reinforcement baselinefirst row-is very similar to the performance of the versions using predicted comparison
as reinforcement baseline-fourth row-when combined with discrepancy-based eligibility
factors. Nevertheless, the former never outperforms the latter. An explanation for this fact
could be that a null reinforcement baseline deals better with collisions than predicted comparison, as we hypothesized in Section 4.3. However, the performance of the versions using
heuristic predicted comparison-fifth row-which combines a null reinforcement baseline
and predicted comparison, does not support this hypothesis, since it is considerably worse
than the performance of plain predicted comparison when two of the three best eligibility
factors are used-third and fifth columns.
So, it seems that predicted comparison is the best kind of reinforcement baseline.
5.1.3. Conclusions
Since the performance of the versions combining the first, fourth and fifth types of reinforcement baseline-i.e., null, predicted comparison and heuristic predicted comparison,
respectively-with the third, fourth and fifth types of eligibility factors-i.e., the three
discrepancy-based rules-are all quite similar, we will look at some other information to
better discriminate which of these nine versions is the best.
One of these criteria is the average number of times that the path-finder has been reinitialized until a stable qo-path is found. This information is implicit in Table 3, as all the steps
taken by the robot for finding a stable qo-path are recorded. Nevertheless, this information
provides a picture of the sensitivity of each version to the initial conditions. Clearly, the
less a version depends on its initial conditions, the more it should be preferred. Table 4
shows the average number of reinitializations of the nine most promising versions identified before.
Table 4 confirms the ranking we have outlined in the preceding subsections. The reinforcement baseline and the eligibility factor most suitable for our formulation of the robot
path finding problem are predicted comparison and the discrepancy-based rule proposed
by Williams (1986), respectively. This version, however, is not far better than some other
versions.
Appendix B provides an analysis of variance for the most promising versions that supports these conclusions.
158
383
Baseline
1
4
5
1.5
1.4
1.6
1.4
1.1
1.4
1.3
0.9
1.5
159
384
r4
+++++++
+++++++
-----------------------------
+++++++
+++++++
+++++++
+++++++
+++++++
+++++++
-------------------------------------------------------------------------------------
hw :::::::::::::::::::::::
+++++++++++++++++++++++
+++++++++++++++++++++++
me<!
low
160
-+
at.
high
rl
385
where
OIh
161
386
(27)
where k ranges over the input units, j over the hidden units, and i over the output units.
We have compared these two principles on the generation of qo-paths for a pair of symmetrical initial configurations that have obstacles in between them and the goal. In the
simulations, (Xh = 0.125, therefore the learning rate is the same for all the units of the step
generator. For each principle, five qo-paths have been generated. The average number of
steps taken for networks adopting each principle appear in Table 5.
The computer experiments confirm that the performance of tile MF principle is better
than that of the EH principle. Nevertheless, the difference is negligible, perhaps because
the task is not sufficiently difficult and, above all, because the system does not begin from
scratch but with the "approximate" desired knowledge.
The most important result of this comparative study is illustrated in Figure 8. Panel A
and panel B depict the behavior of two of the path-finders generated, one using the EH
principle and the other the MF principle. Similar behaviors have been observed for other
path-finders using these principles. The unexpected feature is that the EH principle is able
to modify correctly all the situation-action rules already discovered, while the MF principle is not.
Table 5. Average number of steps
required by each principle.
Principle
Steps
EH
3487
3307
MF
.-------------------------_.
(a)
Figure 8. Behavior of two path-finders adopting each principle.
162
(b)
387
~------~-------~------_._-(a)
(b)
Figure 9 shows the behavior of a pure reinforcement path-finder (EH principle) that is
able to produce qo-paths from almost all the initial configurations in the training set. The
number of steps required to reach this state of the path-finder has been 77315. The only
situations not properly handled by the path-finder are a subset of the most difficult ones,
that is those in which an obstacle lies in between the goal and the current robot configuration and is very close to the latter. These situations may be handled by getting appropriate
guidance from a symbolic path-planner (Millan & Torras, 199Ib).
It is worth noting that no two-layered step generator trained in one phase succeeded.
163
388
(a)
(b)
(a)
(b)
the number of obstacles have changed. The results of these three experiments show that
the situation-action rules learned are both goal-independent and workspace-independent
and that, even in the worst cases, only a "light" additional learning phase suffices to readapt
the path-finder to the new workspace.
Finally, Figure 14 shows how the path-finder copes with dynamic environments. If the
robot has taken one or more steps toward the goal and either the obstacles (panel A) or
the goal (panel B) move, the path-finder is still able to generate feasible paths. In the first
case, the obstacles are moving toward the northwest, therefore approaching the goal, and
164
389
._-------------------------j
~-------------------------_:
Figure 13. Generalization abilities: New goal and more obstacles.
the path-finder avoids them. In the second case, the goal is moving toward the northeast
and the path-finder tracks it. This ability could be enhanced if a module to predict the
motion of the goal and the obstacles were incorporated into the system.
6. Conclusions and future work
The simulations carried out in this paper demonstrate the adequacy of a reinforcement connectionist learning approach to implement local obstacle-avoidance capabilities. The formulation of the problem used to test this approach is a difficult one, since the input and output
are continuous, the environment is partially unknown, and the optimality criterion (finding
short paths with wide clearances) is severe. The problem, however, has been simplified
by assuming a point robot and circular obstacles. These simplifications are dropped in the
165
390
110
00
(a)
(b)
extended prototype we are currently working on, which deals with a 2D mobile robot and
polygonal obstacles.
The codification scheme adopted and the algorithm used to discover qo-paths can be
thought of as domain-specific heuristics (Langley, 1985) for the robot path finding problem,
which greatly improve the basic reinforcement-based weak search method. Equipped with
these two modules, the path-finder not only learns the necessary situation-action rules in
a very reduced time, but also exhibits good noise-tolerance and generalization capabilities,
and is able to cope with dynamic environments.
In the Introduction we claimed that the robot path finding problem could be solved efficiently if symbolic planning were interfaced to subsymbolic obstacle avoidance. Our current
research is oriented towards designing a hybrid path-finder by coupling a geometric global
planning approach such as that in Dari and Torras (1990) with the connectionist local obstacle
avoidance approach described in this paper.
In Millan and Torras (199Ib) we illustrate with a very simple example the potential benefits
of integrating symbolic and subsymbolic techniques according to this general framework.
A symbolic path-planner suggests intermediate configurations-subgoals or landmarksthat the path has to go through. The path-planner is invoked both at the beginning of the
task and whenever the course of action seems to be wrong. This happens when the pathfinder cannot find suitable action to handle the current situation or when the robot deviates
considerably from the planned physical path.
Acknowledgments
We thank Rich Sutton whose suggestions have been really valuable for improving the presentation of our work, and Aristide Varfis for helpful discussions. We would like also to express
our gratitude to two of the reviewers for their detailed comments. Carme Torras has been
partially supported by the ESPRIT Basic Research Action number 3234.
166
391
Notes
I. Although the notion of configurarion space is that of a "state space" and thus it is quite old, the tenn itself
was first introduced by Lozano-Perez and Wesley (1979) and subsequently developed in Lozano-Perez (1983).
Roughly speaking, it is the space ofdegrees offreedom ofmorion in which a given motion planning problem
is to be solved.
2. If the reinforcement signal were computed using local information, then the task would be to select actions
that optimize the cumularive reinforcement received by the path-finder over time. A prediction problem would
have to be solved, temporal-difference methods (Sutton, 1988; Barto, et aI., 1989; Walkins, 1989) having proved
to be useful for this kind of task.
References
Agre, P.E., & Chapman, D. (1987). Pengi: An implementation of a theory of activity. Proceedings ofthe Seventh
AAAl Conference (pp. 268-272).
Anderson, C.W (1986). Learning and problem solving with multilayer connectionist systems. Ph.D. Thesis, Dept.
of Computer and Information Science, University of Massachusetts, Amherst.
Anderson, CW. (1987). Strategy learning with multilayer connectionist representations. Proceedings of the Fourth
International Workshop on Machine Learning (pp. 103-114).
Arkins, R.C. (1987). Motor schema based navigation for a mobile robot: An approach to progrdJTIrning by behavior.
Proceedings of the IEEE Internarional Conference on Roborics and AU/omation (pp. 264-271).
Barto, A.G. (1985). Learning by statistical cooperation of self-interested neuron-like computing elements. Human
Neurobiology, 4, 229-256.
Barto, A.G., & Anandan, P. (1985). Pattern-recognizing stochastic learning automata. IEEE Transactions on Sysrems,
Man, and Cybernetics, 15, 360-374.
Barto, A.G., Sutton, R.S., & Anderson, C.W (1983). Neuronlike elements that can solve difficultleaming control
problems. IEEE Transactions on Sysrems, Man, and Cybernetics, 13, 835-846.
Barto, A.G., Sutton, R.S., & Brouwer, P.S. (1981). Associative search network: A reinforcement learning associative
memory. Biological Cybernetics, 40, 201-211.
Barto, A.G., Sutton, R.S. & Watkins, C.lC.H. (1989). Learning and sequential decision making (Technical Report
COINS-89-95). University of Massachusetts, Amherst, MA: Dept. of Computer and Information Science.
Blythe, J., & Mitchell, T.M. (1989). On becoming reactive. Proceedings of the Sixth International Workshop
on Machine Learning (pp. 255-259).
Brady, M., Hollerbach, J.M., Johnson, T.L., Lozano-Perez, T., & Mason, M.T., (Eds.) (1982). Robor motion:
Planning and control. Cambridge, MA: MIT Press.
Brooks, R.A. (1986). A robust layered control system for a mobile robot. IEEE Journal ofRobotics and Automation, 2, 14-23.
Canny, J.E (1988). The complexiity of robot morion planning. Cambridge, MA: MIT Press.
Chapman, D., & Kaelbling, L.P. (1990). Learningfrom delayed reinforcement in a complex domain (Technical
Report 90-11). Palo Alto, CA: Teleos Research.
Donald, B.R. (1987). A search algorithm for robot motion planning with six degrees of freedom. Artificiallntelligence, 31, 295-353.
Graf, D.H., & LaLonde, WR. (1988). A neural controller for collision-free movement of general robot manipulators.
Proceedings of the IEEE Second Internarional Conference on Neural Networks, Vol I (pp. 77-84).
Gullapalli, V. (1988). A srochastic algorirhmfor learning real-valuedfiuzctions via reinforcement feedback (Technical
Report COINS-88-91). University of Massachusetts, Amherst, MA: Dept. of Computer and Information Science.
lIari, J., & Torras, C. (1990). 2D path planning: A configuration space heuristic approach. The International
Journal of Roborics Research, 9, 75-91.
Jordan, M.I., & Jacobs, R.A. (1990). Learning to control an unstable system with forward modeling. In D.S.
Touretzky (Ed.), Advances in neural information procesing sysrems 2, 324-331. San Mateo, CA: Morgan
Kaufmann.
167
392
Jorgensen, e.e. (1987). Neural network representation of sensor graphs in autonomous robot navigation. Proceedings of the IEEE First International Conference on Neural Networks, Vol lV (pp. 507-515).
Khatib, O. (1986). Real time obstacle avoidance for manipulators and mobile robots. The International Journal
of Robotics Research, 5, 90-98.
Langley, P. (1985). Learning to search: From weak methods to domain-specific heuristics. Cognitive Science,
9, 217-260.
Lin, L.-1. (1990). Self-improving reactive agents: Case smdies of reinforcement learning frameworks. Proceedings
of the First International Conference on the Simulation of Adaptive Behavior: From Animals to Animats (pp.
297-305).
Lozano-Perez, T. (1983). Spatial planning: A configuration space approach. IEEE Transactions on Computers,
32, 108-120.
Lozano-Perez, T., & Wesley, M. (1979). An algorithm for planning collison-free paths among polyhedral obstacles.
Communications of the ACM, 22, 560-570.
Mahadevan, S., & Connell, 1. (1990). Automatic programming ofbehavior-based robots using reinforcement learning
(Technical Report RC 16359). Yorktown Heights, NY: IBM, T.J. Watson Research Center.
Mel, BW. (1989). MURPHY: A neurally-inspired connectionist approach to learning and performance in visionbased robot motion planning. Ph.D. Thesis, Graduate College, University of Illinois, Urbana-Champaign.
Millan, 1. del R., & Torras, C. (1990). Reinforcement learning: Discovering stable solutions in the robot path
finding domain. Proceedings of the Ninth European Conference on Artificial Intelligence (pp. 219-221).
Millan,1. del R., & Torras, e. (199Ia). Connectionist approaches to robot path finding. In OM. Omidvar (Ed.),
Progress in neural networks series, Vol 3. Norwood, NJ: Ablex.
MiIlan,1. del R., & Torras, C. (1991b). Learning to avoid obstacles through reinforcement. In L. Birnbaum &
G. Collins (Eds.) Machine learning: Proceedings ofthe Eighth International Workshop, 298-302. San Mateo,
CA: Morgan Kaufmann.
Mozer, M.e., & Bachrach, 1. (1989). Discovering the structure ofa reactive environment by exploration (Technical
Report CU-CS-451-89). Boulder, CO: University of Colorado, Dept. of Computer Science.
Munro, P. (1987). A dual back-propagation scheme for scalar reward learning. Proceedings of the Ninth Annual
Coriference of the Cognitive Science Society (pp. 165-176).
Rivest, R.L., & Schapire, R.E. (1987). A new approach to unsupervised learning in deterministic environments.
Proceedings of the Fourth International JWJrkshop on Machine Learning (pp. 364-375).
Robinson, A.J. (1989). Dynamic error propagation networks. Ph.D. Thesis, Engineering Department, Cambridge
University, Cambridge, England.
Saerens, M., & Soquet, A. (1989). A neural controller. Proceedings of the First lEE International Conference
on Artificial Neural Networks (pp. 211-215).
Schoppers, MJ. (1987). Universal plans for reactive robots in unpredictable environments. Proceedings of the
Tenth International Joint Conference on Artificial intelligence (pp. 1039-1046).
Singh, S.P. (1991). Transfer of learning across compositons of sequential tasks. In L. Birnbaum & G. Collins
(Eds.) Machine learning: Proceedings ofthe Eighth International Workshop, 348-352. San Mateo, CA: Morgan
Kaufmann.
Steels, L. (1988). Steps towards common sense. Proceedings of the Eighth European Coriference on Artificial
Intelligence (pp. 49-54).
Sutton, R.S. (1984). Temporal credit assignment in reiriforcement learning. Ph.D. Thesis, Dept. of Computer
and Information Science, University of Massachusetts, Amherst.
Sutton, R.S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9-44.
Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic
programming. Proceedings of the Seventh International Conference on Machine Learning (pp. 216-224).
Torras, e. (1990). Motion planning and control: Symbolic and neural levels of computation. Proceedings of the
Third COONITlVA Conference (pp. 207-218).
Watkins, CJ.e.H. (1989). Learning with delayed rewards. Ph.D. Thesis, Psychology Department, Cambridge
University, Cambridge, England.
Werbos, PJ. (1987). Building and understanding adaptive systems: A statistical/numerical approach to factory
automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17, 7-20.
Whitesides, S.H. (1985). Computational geometry and motion planning. In G. Toussaint (Ed.), Computational
geometry. Amsterdam, New York, Oxford: North-Holland.
168
393
Williams, R.l (1986). Reinforcement learning in connectionist networks: A mathematical analysis (Technical
Report ICS-8605). San Diego, CA: University of California, Institute for Cognitive Science.
Williams, R.J. (1987). Reinforcement-learning connectionist systems (Technical Report NU-CCS-87-3). Northeastern
University, Boston, MA: College of Computer Science.
Yap, C.-K. (1987). Algorithmic motion planning. In J.T. Schwartz & C.-K. Yap (Eds.), Advances in robotics, J.VI.
I. Algorithmic and geometric a~pects of robotics, 95-143. Hillsdale, NJ: Lawrence Erlbaum.
where P E [0, I]. So, each point of conflict (xc, Yc) is determined by the value Pc which
solves the differential equation:
d dist[(xo' Yo), (xc, Yc)] = 0,
dp
(29)
and by the additional constraint that it must be outside the corresponding obstacle. This
value is:
(30)
Pc
where
boxt(x)
I,
0,
{
x,
if x > I,
if x < 0,
otherwise.
(31)
(33)
(34)
where
169
394
if x < y,
O,
(35)
{ limit(x - y),
otherwise,
if Y >
limit(X),
{ limit(x - ..J Z2
O,
boxix, y)
if x
{ y - x,
Z,
(36)
-
y2 ),
otherwise,
> y,
(37)
otherwise,
and
limit(x) =
kp
{
* x,
x,
otherwise.
This limit function is defined in such a way as to keep the repulsion forces very small in
those cases where obstacles are located outside the perception range of the robot (see (39)
below). kpa and perception range are constants, kpa ~ I, and perception range is chosen to be
a fifth of the dimension of the workspace
Once r rb and rc have been calculated, the intensity of a repulsion force is:
Q ,
2 '
(39)
(40)
where k rep is a constant and x is the sum of the intensities of the repulsion forces from
quadrant i.
170
395
Baseline
15654
9541
9145
11477
9814
9318
5249
8655
10948
Table 7. Significance degrees of the means differences. PI < P2, where Pi is the mean of the sample i.
Sample 2
Sample I
v.,
v"
v,.
v"
v"
v"
VI3
v"
v"
v.,
v"
v,.
v,)
V"
V"
V13
v"
v"
80.23%
87.08%
70.54%
92.07%
81.06%
61.03 %
96.25%
90.66%
74.54%
64.43%
96.64%
91.92%
78.81%
70.54%
57.53%
95.73%
90.99%
81.59%
75.49%
65.54%
59.10%
99.65%
99.20%
93.32%
88.49%
78.81%
69.85%
56.75%
99.84%
99.62%
96.78%
94.18%
88.49%
82.12%
70.19%
68.79%
In order to compute the significance degree of the difference between a pair of sample
means JJ.l and JJ.2, the following discriminant function is built:
(41)
where n1 and n2 are the size of each sample, and 0"1 and 0"2 are the sample standard deviations. Now, if experimentally JJ.2 > JJ.l, then for estimating the significance degree 1 - e
it is necessary to find Ul- e , in the table of the normal distribution, such that U < -Uj-e'
Data in Table 7 confirm the results of Section 5.1. In addition, it is possible to draw the
following two main conclusions. First, the versions are ranked in the following order: V45i.e., the version using the fourth baseline an fifth eligibility factor-vIS, V54' V43' V44' V14,
VI3, V53 and V55' Second, V45 is significantly better then V53 and V55-the corresponding significance degrees are greater than 99%-it is quasi-significantly better than VI3' VI4 and
V44-the corresponding significance degrees are greater than 95 %-it is presumably better
than V43-the significance degree is greater than 90%-and it is likely better than VIS and
v54-the corresponding significance degrees are greater than 80 %.
171
INDEX
A
Q
Q-learning, 55
asynchronous dynamic
programming, 55, 117
backgammon, 33
c
compositional learning, 99
connectionism, 139
connectionist methods, 33
connectionist networks, 5,
69
F
feature discovery, 33
G
games, 33
gradient descent, 5
M
mathematical analysis, 5
modular architecture, 99
N
neural networks, 33
P
planning, 69
s
stability, 139
T
teaching, 69
temporal difference learning,
33
temporal differences, 55, 117
transfer of learning, 99