DW 01
DW 01
DW 01
Reinforcement Learning
Peter Dayan Gatsby Computational Neuroscience Unit University College London 17 Queen Square London WC1N 3AR Tel: +44 (0) 20 7679 1175 Fax: +44 (0) 20 7679 1173 Email: dayan@gatsby.ucl.ac.uk Christopher JCH Watkins Department of Computer Science Royal Holloway, University of London Egham Surrey TW20 0EX Tel: +44 (0) 1784 443419 Fax: +44 (0) 1784 439786 Email: chrisw@dcs.rhbnc.ac.uk
Table of Contents
A Formal framework for learning from reinforcement
Markov decision problems Optimization of long term return Interaction between an agent and its environment Dynamic programming as a formal solution Policy iteration Value iteration Temporal difference methods as a practical solution Actor-critic learning Q-learning Extensions to temporal difference Model-based methods TD Representation and neural nets Gradient-based methods Formalizing learning in the brain Examples Checkers 1
Glossary
Markov chain a model for a random process that evolves over time such that the states (like locations in a maze) occupied in the future are independent of the states in the past given the current state. Markov decision problem (MDP) a model for a controlled random process in which an agents choice of action determines the probabilities of transitions of a Markov chain and lead to rewards (or costs) that need to be maximised (or minimised). policy a deterministic or stochastic scheme for choosing an action at every state or location. reward an immediate, possibly stochastic, payoff that results from performing an action in a state. In an MDP, the immediate reward depends on the current state and action only. The agents aim is to optimise a sum or average of (possibly discounted) future rewards. value function a function dened over states, which gives an estimate of the total (possibly discounted) reward expected in the future, starting from each state, and following a particular policy. discounting if rewards received in the far future are worth less than rewards received sooner, they are described as being discounted. Humans and animals appear to discount future rewards hyperbolically; exponential discounting is common in engineering and nance. dynamic programming a collection of calculation techniques (notably policy and value iteration) for nding a policy that maximises reward or minimises costs. temporal difference prediction error a measure of the inconsistency between estimates of the value function at two successive states. This prediction error can be used to improve the predictions and also to choose good actions.
Reinforcement Learning
Secondary Computation dynamic programming#value function#policy#actor-critic#Q-learning Dayan, Peter Peter Dayan Gatsby Computational Neuroscience Unit, University College London, UK Watkins, Christopher JCH Christopher Watkins Department of Computer Science, Royal Holloway College, UK Reinforcement learning is the study of how animals and articial systems can learn to optimize their behavior in the face of rewards and punishments. Reinforcement learning algorithms have been developed that are closely related to methods of dynamic programming, which is a general approach to optimal control. Reinforcement learning phenomena have been observed in psychological studies of animal behavior, and in neurobiological investigations of neuromodulation and addiction.
1 Introduction
One way in which animals acquire complex behaviors is by learning to obtain rewards and to avoid punishments. Reinforcement learning theory is a formal computational model of this type of learning. To see that some theory is needed, consider that in many environments animals need to perform unrewarded or unpleasant preparatory actions in order to obtain some later reward. For example, a mouse may need to leave the warmth and safety of its burrow to go on a cold and initially unrewarded search for food. Is it possible to explain learning to take such an unpleasant decision in terms of learning to obtain a large but distant reward? What computational abilities does an agent need in principle for this type of learning? What capacity for episodic memory is necessary? Does an agent need to be able to plan ahead by envisioning future situations and actions? If an agent is doing something without immediate reward, how does it learn to recognize whether it is making progress towards a desirable goal? It is these questions that the theory of reinforcement learning (RL) seeks to answer. RL theory comprises a formal framework for describing an agent interacting with its environment, together with a family of learning algorithms, and analyses of their performance.
The simplest RL algorithms require minimal ability for episodic memory of past states and actions, no prediction of effects of actions, and very simple computations. The speed of learning, and efciency of use of experience, can be improved if the agent has greater abilities. A continuum of improvement is possible. If an agent has or constructs partial models of the effects of its actions, or if an agent can remember and process longer sequences of past states and actions, learning can be faster and the agents experience can be used more efciently.
where the indicates that an average is taken over the stochasticity in the states and the payoffs. The factor is called the discount factor, and controls the relative weighting of payoffs that happen sooner and those that happen later. The larger , the more important are distant payoffs, and, typically, the more difcult the optimization problem. Other denitions of return are possible. We assume that the Markov decision problem is absorbing, so there is a state or set of states which dene the end of a trial (like the goal of the maze), allowing a new trial to start. The innite sum of equation 1 is effectively truncated at these states, which is equivalent to specifying degenerate transition matrices and reward distributions there. In this case, it is possible to learn to optimize the sum total payoff that the agent receives before reaching the terminal state. If the MDP is such that it is always possible to return to every state, it is possible to learn to optimize the average payoff. A policy that optimizes average payoff will also optimize discounted payoff for a value of sufciently close to . RL learning algorithms can be adapted to use any of these denitions of return; we will consider only total discounted reward here. For convenience, we also assume that the problem is nite, ie there are only nitely many states and possible actions. Conventionally, the agent starts off not knowing the transition matrices or reward distributions. Over the course of repeated exploration, has to work out a course of action that will optimize its return.
$ % " # !
3 Dynamic Programming
The obvious problem with optimizing a goal such as that in expression 1 is that the action . In effect, one has to take into considertaken at time can affect the rewards at time ation all possible sequences of actions. However, the number of such sequences of length typically grows exponentially as a function of , making it impractical to consider each one in turn. Dynamic programming (Bellman, 1957) offers a set of methods for solving the optimization problem while (generally) avoiding this explosion, by taking advantage of the Markov property. The two main methods are called policy iteration and value iteration, and we discuss them in turn.
& '
to maximize (1)
to the optimization problem of expression 1 can be cast in terms of a deterministic policy which is constant over time, so the agent performs the single same action every time it gets to a particular state. Policies without time dependence are called stationary. We will, for the moment, consider stationary deterministic policies. We use the notation for the action that policy recommends at state . In policy iteration, a policy is rst evaluated, and then improved. Policy evaluation consists of working out the value of every state under policy , ie the the expected long term reward available starting from and following policy . That is
The second term in both these expressions is just times the expected innite horizon return for the state at time , averaged over all the possibilities for this state. Equation 3 is a linear equation for the values , and so can be solved by a variety of numerical methods. Later, we see how to nd these values without knowing the mean rewards or the transition distributions. Policy improvement uses the values to specify a policy that is guaranteed to be at least as good as . The idea is to consider at state the non-stationary policy which consists that policy recommends at of taking action (which may, or may not, be the action ) and then following policy thereafter. By the same reasoning as above, the expected value of this is (4)
from which it is obvious that . Since the actions at all states are thus only improving, the overall policy can also only improve simultaneously at every state. If policy is the same as policy , then it is an optimal policy (of which there may be more than one). The values and associated with an optimal policy are called the optimal values or the optimal action values and are often written and . We will see later how to improve a policy without having explicitly to maximize . 8
argmax
where the states follow from the transitions with can separate the rst and subsequent terms in the sum,
and
. We
(2)
(3)
(5)
Since the problem is nite, and policy iteration improves the policy measurably on each step until it nds the optimal policy, policy iteration is bound to converge. Although, in the worst case, policy iteration can take an exponential number of steps, it is generally very quick.
4 Temporal Differences
Implementing either policy or value iteration requires the agent to know the expected rewards and the transition matrices. If the agent does not know these, but rather just has to learn by interacting with the environment (eg by pulling the levers), then what can it do? Methods of temporal differences were invented to perform prediction and optimization in exactly these circumstances. There are two principal avors of temporal difference methods, an actor-critic scheme (Barto, Sutton & Anderson, 1983), which parallels policy iteration, and has been suggested as being implemented in biological RL, and a method called Q-learning (Watkins, 1989), which parallels value iteration. Methods of temporal differences were invented (Sutton, 1988; Sutton & Barto, 1987) to account for the behavior of animals in psychological experiments involving prediction; the links with dynamic programming were only made much later (Watkins, 1989; Barto, Sutton & Watkins, 1989).
$
Two ideas underlie the temporal difference algorithm. One is to use averages from random samples to determine means; the second is to use a form of bootstrapping, substituting the incorrect estimates as approximations for in equation 3. First, consider the quantity (7)
could be used as a sampled error in the current estimate of the value of state , namely , and an algorithm such as
when applied over the course of many trials might be expected to make . Here, is a learning rate. In fact, this is a form of a stochastic, error-correcting, learning rule like the delta rule (see the DELTA RULE .). The same is true for all subsequent timesteps. Of course, expression 7 contains , which is exactly the quantity that we are trying to estimate, only at state rather than . The second key idea underlying temporal difference methods is to substitute the current approximation for this quantity. In this case, the sampled error parallel to equation 9 is
(10)
(11) (12)
is called the temporal difference learning rule. It can be considered as a prediction error or a measure of the inconsistency between the estimates and . Despite the apparent approximations, circumstances are known under which as the number of trials gets large, so implementing policy evaluation correctly. In 10
#
!
"
"
!
!
which is just
which comes from the random reward and transition consequent on choosing action at state . The expected value of this is
(8)
from the transition and reward structure of the world (writfor convenience). It does this by maintaining and improving an to these quantities.
The other half of policy iteration is policy improvement, patterned after equation 5. Here, the idea is to use a stochastic policy (rather than the deterministic ones that we have hitherto considered), which is dened by a set of parameters, called the action matrix . Then, rather than implement the full maximization of equation 5, we consider changing the parameters, using the answer from policy evaluation in order to increase the probability of actions that are associated with higher values of . A natural way to parameterize a stochastic policy is to use a softmax function:
(13)
using just the same temporal difference term as in equation 12. Normally, the action values are changed according to equation 14 before the critic has converged, and there is no proof that the combined estimation and optimization procedure is guaranteed to nd an optimal policy. Nevertheless, it is found to work well in practice. Note that learning is based on following, and simultaneously trying to improve a policy.
4.2 Q-learning
Q-learning is a temporal difference version of value iteration. It applies to equation 6 the same two ideas underlying the actor-critic. That is, it maintains and improves values , employing
11
!
!
at state
, then
(14)
is the average value of the actions specied by policy . If action then ; if action is worse than average, then could be improved by changing the action matrix according to
. In this case,
compared
fact, this is also true in the case that the policy is stochastic, that is distribution over action associated with state .
is a probability
as a sampled version of the right hand side of equation 6. The overall update at time is
Unlike the actor-critic scheme, circumstances are known under which Q-learning is guaranteed to converge on the optimal policy.
argmax
equation 7. The remaining bootstrap approximation is the same as before. The signicance of is that it controls a trade-off between bias and variance. If is near , then the estimate is highly variable (since it depends on long sample paths), but not strongly biased (since real rewards from the reward distribution of the environment are considered). If is near , then the estimate is not very variable (since only short sample paths are involved), but it can be biased, because of the bootstrapping. Third, we described the temporal difference algorithm using a unary or table-lookup representation in which we can separately manipulate the value associated with each state . One can also consider parameterized representations of the values (and also the policies), for instance the linear form
where is a set of parameters, and is a population representation for state . In this case, the temporal difference update rule of equation 12 is changed to one for the parameters Again, this is quite similar to the delta rule. More sophisticated representational schemes than equation 15 can also be used, including neural networks, in which case should be used in place of .
References
Barto, AG, Sutton, RS & Anderson, CW (1983) Neuronlike elements that can solve difcult learning problems. IEEE Transactions on Systems, Man, and Cybernetics 13:834-846.
13
!
(15)
Bellman, RE (1957) Dynamic Programming. Princeton, NJ:Princeton University Press. Sutton, RS (1988) Learning to predict by the methods of temporal difference. Machine Learning 3:9-44. Sutton, RS & Barto, AG (1987) A temporal-difference model of classical conditioning. Proceedings of the Ninth Annual Conference of the Cognitive Science Society. Seattle, WA:??? Watkins, CJCH (1989) Learning from Delayed Rewards. PhD Thesis, University of Cambridge, Cambridge, UK.
Further Reading
Barto, AG, Sutton, RS & Watkins, CJCH (1990) Learning and sequential decision making. In M Gabriel & J Moore, editors, Learning and Computational Neuroscience: Foundations of Adaptive Networks. Cambridge, MA:MIT Press, Bradford Books. Bertsekas, DP & Tsitsiklis, JN (1996) Neuro-Dynamic Programming. Belmont, MA:Athena Scientic. Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.
14