DW 01

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Encyclopedia of Cognitive Science, in press

Reinforcement Learning
Peter Dayan Gatsby Computational Neuroscience Unit University College London 17 Queen Square London WC1N 3AR Tel: +44 (0) 20 7679 1175 Fax: +44 (0) 20 7679 1173 Email: dayan@gatsby.ucl.ac.uk Christopher JCH Watkins Department of Computer Science Royal Holloway, University of London Egham Surrey TW20 0EX Tel: +44 (0) 1784 443419 Fax: +44 (0) 1784 439786 Email: chrisw@dcs.rhbnc.ac.uk

Table of Contents
A Formal framework for learning from reinforcement

Markov decision problems Optimization of long term return Interaction between an agent and its environment Dynamic programming as a formal solution Policy iteration Value iteration Temporal difference methods as a practical solution Actor-critic learning Q-learning Extensions to temporal difference Model-based methods TD Representation and neural nets Gradient-based methods Formalizing learning in the brain Examples Checkers 1

and eligibility traces

Backgammon Robots Finance Summary and future directions

A Processed using L TEX2e

Glossary
Markov chain a model for a random process that evolves over time such that the states (like locations in a maze) occupied in the future are independent of the states in the past given the current state. Markov decision problem (MDP) a model for a controlled random process in which an agents choice of action determines the probabilities of transitions of a Markov chain and lead to rewards (or costs) that need to be maximised (or minimised). policy a deterministic or stochastic scheme for choosing an action at every state or location. reward an immediate, possibly stochastic, payoff that results from performing an action in a state. In an MDP, the immediate reward depends on the current state and action only. The agents aim is to optimise a sum or average of (possibly discounted) future rewards. value function a function dened over states, which gives an estimate of the total (possibly discounted) reward expected in the future, starting from each state, and following a particular policy. discounting if rewards received in the far future are worth less than rewards received sooner, they are described as being discounted. Humans and animals appear to discount future rewards hyperbolically; exponential discounting is common in engineering and nance. dynamic programming a collection of calculation techniques (notably policy and value iteration) for nding a policy that maximises reward or minimises costs. temporal difference prediction error a measure of the inconsistency between estimates of the value function at two successive states. This prediction error can be used to improve the predictions and also to choose good actions.

Reinforcement Learning
Secondary Computation dynamic programming#value function#policy#actor-critic#Q-learning Dayan, Peter Peter Dayan Gatsby Computational Neuroscience Unit, University College London, UK Watkins, Christopher JCH Christopher Watkins Department of Computer Science, Royal Holloway College, UK Reinforcement learning is the study of how animals and articial systems can learn to optimize their behavior in the face of rewards and punishments. Reinforcement learning algorithms have been developed that are closely related to methods of dynamic programming, which is a general approach to optimal control. Reinforcement learning phenomena have been observed in psychological studies of animal behavior, and in neurobiological investigations of neuromodulation and addiction.

1 Introduction
One way in which animals acquire complex behaviors is by learning to obtain rewards and to avoid punishments. Reinforcement learning theory is a formal computational model of this type of learning. To see that some theory is needed, consider that in many environments animals need to perform unrewarded or unpleasant preparatory actions in order to obtain some later reward. For example, a mouse may need to leave the warmth and safety of its burrow to go on a cold and initially unrewarded search for food. Is it possible to explain learning to take such an unpleasant decision in terms of learning to obtain a large but distant reward? What computational abilities does an agent need in principle for this type of learning? What capacity for episodic memory is necessary? Does an agent need to be able to plan ahead by envisioning future situations and actions? If an agent is doing something without immediate reward, how does it learn to recognize whether it is making progress towards a desirable goal? It is these questions that the theory of reinforcement learning (RL) seeks to answer. RL theory comprises a formal framework for describing an agent interacting with its environment, together with a family of learning algorithms, and analyses of their performance.

2 The Reinforcement Learning Framework


In the standard framework for RL, a learning agentan animal or perhaps a robot repeatedly observes the state of its environment, and then chooses and performs an action. Performing the action changes the state of the world, and the agent also obtains an immediate numeric payoff as a result. Positive payoffs are rewards and negative payoffs are punishments. The agent must learn to choose actions so as to maximize a long term sum or average of the future payoffs it will receive. The agents payoffs are subjective, in the sense that they are the agents own evaluation of its experience. The ability to evaluate its experience in this way is an essential part of the agents prior knowledge. The payoffs dene what the agent is trying to achieve; what the agent needs to learn is how to choose actions to obtain high payoffs. For this framework to be sensible, the agent must be able to visit the same or at least similar states on many occasions, to take advantage of learning. A standard assumption is that the agent is able to observe those aspects of the state of the world that are relevant to deciding what to do. This assumption is convenient rather than plausible, since it leads to great simplication of the RL problem. We will not here consider the important problem of how the agent could learn what information it needs to observe. In the conventional framework, the agent does not initially know what effect its actions have on the state of the world, nor what immediate payoffs its actions will produce. It particularly does not know what action is best to do. Rather, it tries out various actions at various states, and gradually learns which one is best at each state so as to maximize its long run payoff. The agent thus acquires what is known as a closed-loop control policy, or a rule for choosing an action according to the observed current state of the world. From an engineering perspective, the most natural way to learn the best actions would be for the agent to try out various actions in various states, and so learn a predictive model of the effects its actions have on the state of the world and of what payoffs its actions produce. Given such a model, the agent could plan ahead by envisioning alternative courses of action and the states and payoffs that would result. However, this approach to learning does not seem at all plausible for animals. Planning ahead involves accurately envisioning alternative sequences of actions and their consequences: the animal would have to imagine what states of the world would result and what payoffs it would receive for different possible sequences of actions. Planning ahead is particularly difcult if actions may have stochastic effects, so that performing an action may lead to one of several different possible states. One of the most surprising discoveries in RL is that there are simple learning algorithms by means of which an agent can learn an optimal policy without ever being able to predict the effects of its actions or the immediate payoffs they will produce, and without ever planning ahead. It is also surprising that, in principle, learning requires only a minimal amount of episodic memory: an agent can learn if it can consider together the last action it took, the state when it chose that action, the payoff received, and the current state.

The simplest RL algorithms require minimal ability for episodic memory of past states and actions, no prediction of effects of actions, and very simple computations. The speed of learning, and efciency of use of experience, can be improved if the agent has greater abilities. A continuum of improvement is possible. If an agent has or constructs partial models of the effects of its actions, or if an agent can remember and process longer sequences of past states and actions, learning can be faster and the agents experience can be used more efciently.

2.1 Markov Decision Problems


Formally, an RL agent faces a Markov decision problem (MDP). An MDP has four components: states, actions, and transition and reward distributions. The state at time , for which we use the notation , characterizes the current situation of the agent in the world. For an agent in a maze, for instance, the relevant state would . generally be the location of the agent. The action the agent takes at time is called Sometimes there may be little or no choice of actions at some state or states. One consequence of taking the action is the immediate payoff or reward which we call , . If the rewards are stochastic, then is its mean. The other or sometimes just consequence of taking the action is that the state changes. Such changes are characterized by a transition distribution, which allows them to be stochastic. In the simplest RL problems, state-transition distributions and the rewards depend only on the current state and the current action, and not on the history of previous actions and states. This restriction is called the Markov property, and ensures that the description of the current state contains all information relevant to choosing an action in the current state. The Markov property is critically necessary for the learning algorithms that will be described below. to describe the transition structure of the world. We typically use a set of matrices Here, is the probability that the state changes from to given that action is emitted. As an example of a Markov decision process, consider a hypothetical experiment in which a rat presses levers to obtain food in a cage with a light. Suppose that if the light is off, pressing lever A turns on the light back on with a certain probability, and pressing lever B has no effect. When the light is on, pressing lever A has no effect, but pressing lever B delivers food with a certain probability, and turns the light off again. In this simple environment there are two relevant states: light on and light off. Lever A may cause a transition from light off to light on; in light on, lever B may yield a reward. The only information that the rat needs to decide what to do is whether the light is on or off. The optimal policy is simple: in light off press lever A; in light on, press lever B. The goal for an agent in solving a Markov decision problem is to maximize its expected, long-term payoff, known as its return. A mathematically convenient way to formalize return is as a discounted sum of payoffs. That is, starting from state at time ,
        

 

where the indicates that an average is taken over the stochasticity in the states and the payoffs. The factor is called the discount factor, and controls the relative weighting of payoffs that happen sooner and those that happen later. The larger , the more important are distant payoffs, and, typically, the more difcult the optimization problem. Other denitions of return are possible. We assume that the Markov decision problem is absorbing, so there is a state or set of states which dene the end of a trial (like the goal of the maze), allowing a new trial to start. The innite sum of equation 1 is effectively truncated at these states, which is equivalent to specifying degenerate transition matrices and reward distributions there. In this case, it is possible to learn to optimize the sum total payoff that the agent receives before reaching the terminal state. If the MDP is such that it is always possible to return to every state, it is possible to learn to optimize the average payoff. A policy that optimizes average payoff will also optimize discounted payoff for a value of sufciently close to . RL learning algorithms can be adapted to use any of these denitions of return; we will consider only total discounted reward here. For convenience, we also assume that the problem is nite, ie there are only nitely many states and possible actions. Conventionally, the agent starts off not knowing the transition matrices or reward distributions. Over the course of repeated exploration, has to work out a course of action that will optimize its return.
  $ % " #   ! 

3 Dynamic Programming
The obvious problem with optimizing a goal such as that in expression 1 is that the action . In effect, one has to take into considertaken at time can affect the rewards at time ation all possible sequences of actions. However, the number of such sequences of length typically grows exponentially as a function of , making it impractical to consider each one in turn. Dynamic programming (Bellman, 1957) offers a set of methods for solving the optimization problem while (generally) avoiding this explosion, by taking advantage of the Markov property. The two main methods are called policy iteration and value iteration, and we discuss them in turn.
 

3.1 Policy iteration


A policy is an assignment of actions to state eg a recipe that says push lever A if the light is off; and push lever B if the light is on. In general, policies can be stochastic, specifying the probability of performing each action at each state. It can be shown that the solution 7

  


& '

  

it should choose actions

 

to maximize (1)

to the optimization problem of expression 1 can be cast in terms of a deterministic policy which is constant over time, so the agent performs the single same action every time it gets to a particular state. Policies without time dependence are called stationary. We will, for the moment, consider stationary deterministic policies. We use the notation for the action that policy recommends at state . In policy iteration, a policy is rst evaluated, and then improved. Policy evaluation consists of working out the value of every state under policy , ie the the expected long term reward available starting from and following policy . That is

which, using the Markov property, is


The second term in both these expressions is just times the expected innite horizon return for the state at time , averaged over all the possibilities for this state. Equation 3 is a linear equation for the values , and so can be solved by a variety of numerical methods. Later, we see how to nd these values without knowing the mean rewards or the transition distributions. Policy improvement uses the values to specify a policy that is guaranteed to be at least as good as . The idea is to consider at state the non-stationary policy which consists that policy recommends at of taking action (which may, or may not, be the action ) and then following policy thereafter. By the same reasoning as above, the expected value of this is (4)

from which it is obvious that . Since the actions at all states are thus only improving, the overall policy can also only improve simultaneously at every state. If policy is the same as policy , then it is an optimal policy (of which there may be more than one). The values and associated with an optimal policy are called the optimal values or the optimal action values and are often written and . We will see later how to improve a policy without having explicitly to maximize . 8

 

argmax

 

    

The new policy is

  

where the states follow from the transitions with can separate the rst and subsequent terms in the sum,

and

  

  

 

 

  

  

. We

(2)

(3)

(5)

Since the problem is nite, and policy iteration improves the policy measurably on each step until it nds the optimal policy, policy iteration is bound to converge. Although, in the worst case, policy iteration can take an exponential number of steps, it is generally very quick.

3.2 Value iteration


Value iteration is the main alternative to policy iteration. For one version of this, a set of values (which are called -values) is updated simultaneously according to the formula (6) Although the -values are not necessarily the -values associated with any policy , as in equation 4, one can show that iterating equation 6 innitely often will lead to the optimal values. Then, equation 5 can be used to nd the associated optimal policy. The proof that value iteration works depends on a contraction property. That is, a particular measure (called the norm) of the distance between and is greater than that between and by a factor of at least . Thus, the values converge exponentially quickly to the optimal values. The optimal policy can actually be derived from them even before convergence.

4 Temporal Differences
Implementing either policy or value iteration requires the agent to know the expected rewards and the transition matrices. If the agent does not know these, but rather just has to learn by interacting with the environment (eg by pulling the levers), then what can it do? Methods of temporal differences were invented to perform prediction and optimization in exactly these circumstances. There are two principal avors of temporal difference methods, an actor-critic scheme (Barto, Sutton & Anderson, 1983), which parallels policy iteration, and has been suggested as being implemented in biological RL, and a method called Q-learning (Watkins, 1989), which parallels value iteration. Methods of temporal differences were invented (Sutton, 1988; Sutton & Barto, 1987) to account for the behavior of animals in psychological experiments involving prediction; the links with dynamic programming were only made much later (Watkins, 1989; Barto, Sutton & Watkins, 1989).

4.1 Actor-critic learning


In actor-critic learning, the actor species a policy, which the critic evaluates. Consider rst that satisfy equation 3 a xed policy or actor . The critic has to nd a set of values 9

$ 

 

Two ideas underlie the temporal difference algorithm. One is to use averages from random samples to determine means; the second is to use a form of bootstrapping, substituting the incorrect estimates as approximations for in equation 3. First, consider the quantity (7)

could be used as a sampled error in the current estimate of the value of state , namely , and an algorithm such as

when applied over the course of many trials might be expected to make . Here, is a learning rate. In fact, this is a form of a stochastic, error-correcting, learning rule like the delta rule (see the DELTA RULE .). The same is true for all subsequent timesteps. Of course, expression 7 contains , which is exactly the quantity that we are trying to estimate, only at state rather than . The second key idea underlying temporal difference methods is to substitute the current approximation for this quantity. In this case, the sampled error parallel to equation 9 is

(10)

which is called the temporal difference. timesteps is dened similarly

The temporal difference term for subsequent


(11) (12)

is called the temporal difference learning rule. It can be considered as a prediction error or a measure of the inconsistency between the estimates and . Despite the apparent approximations, circumstances are known under which as the number of trials gets large, so implementing policy evaluation correctly. In 10

 #

!  

 

The learning rule

 

"

"

 

 

  

 

 

  ! 

 !

 

which is just

, the quantity on the left hand side of equation 3. Therefore, (9)

which comes from the random reward and transition consequent on choosing action at state . The expected value of this is

(8)

 

 

based on samples ing approximation

from the transition and reward structure of the world (writfor convenience). It does this by maintaining and improving an to these quantities.

 

 

  

The other half of policy iteration is policy improvement, patterned after equation 5. Here, the idea is to use a stochastic policy (rather than the deterministic ones that we have hitherto considered), which is dened by a set of parameters, called the action matrix . Then, rather than implement the full maximization of equation 5, we consider changing the parameters, using the answer from policy evaluation in order to increase the probability of actions that are associated with higher values of . A natural way to parameterize a stochastic policy is to use a softmax function:

(13)

using just the same temporal difference term as in equation 12. Normally, the action values are changed according to equation 14 before the critic has converged, and there is no proof that the combined estimation and optimization procedure is guaranteed to nd an optimal policy. Nevertheless, it is found to work well in practice. Note that learning is based on following, and simultaneously trying to improve a policy.

4.2 Q-learning
Q-learning is a temporal difference version of value iteration. It applies to equation 6 the same two ideas underlying the actor-critic. That is, it maintains and improves values , employing

11

  ! 

 

and so the actor learning rule is

  ! 

However, if the agent takes action

at state

, then

(14)

is the average value of the actions specied by policy . If action then ; if action is worse than average, then could be improved by changing the action matrix according to


is better than average, . Thus,

Consider the temporal difference after convergence, ie when

. In this case,

  

  

The normalization forces with , the more likely is action


!

for all states . Here, the larger at state .

compared

fact, this is also true in the case that the policy is stochastic, that is distribution over action associated with state .

is a probability



as a sampled version of the right hand side of equation 6. The overall update at time is


As in value iteration, a policy can be dened at any time according to


Unlike the actor-critic scheme, circumstances are known under which Q-learning is guaranteed to converge on the optimal policy.

4.3 Exploration and Exploitation


One of the most important issues for the temporal difference learning algorithms is maintaining a balance between exploration and exploitation. That is, the agent must sometimes choose actions that it believes to be suboptimal in order to nd out whether they might actually be good. This is particularly true in problems which change over time (which is true of most behavioral experiments), since actions that used to be good might become bad, and vice-versa. The theorems proving that temporal difference methods work usually require much experimentation: all actions must be repeatedly tried in all states. In practice, it is common to choose policies that always embody exploration (such as choosing a random action some small fraction of the time, but otherwise the action currently believed to be best).

5 Extensions to Temporal Difference Learning


We presented a very simple form of temporal difference learning algorithm, but it has been extended in various ways. First, it is possible to learn the transition matrices and rewards and then use standard dynamic programming methods on the resulting estimated model of the environment (this is called a model-based method). It is also possible to integrate the learning of a model with the temporal difference learning methods, as a way of using the samples of transitions and rewards more effectively. There is substantial evidence that animals build models of their environments, although the way that they use this information to learn appropriate behaviors is not so clear. as a stochastic sample of Second, the idea behind equation 7 is to use to replace all but the rst reward term in the innite sum of equation 2. One could equally well imagine collecting two actual steps of reward , as a sample of the all but the rst two terms in equation 2. Similarly, and using one could consider all but the rst three terms, etc. It turns out to be simple to weigh these different contributions in an exponentially decreasing manner, and this leads to an algorithm called TD , where the value of is the weighting, taking the value for 12



 

argmax

   

 

  

equation 7. The remaining bootstrap approximation is the same as before. The signicance of is that it controls a trade-off between bias and variance. If is near , then the estimate is highly variable (since it depends on long sample paths), but not strongly biased (since real rewards from the reward distribution of the environment are considered). If is near , then the estimate is not very variable (since only short sample paths are involved), but it can be biased, because of the bootstrapping. Third, we described the temporal difference algorithm using a unary or table-lookup representation in which we can separately manipulate the value associated with each state . One can also consider parameterized representations of the values (and also the policies), for instance the linear form

where is a set of parameters, and is a population representation for state . In this case, the temporal difference update rule of equation 12 is changed to one for the parameters Again, this is quite similar to the delta rule. More sophisticated representational schemes than equation 15 can also be used, including neural networks, in which case should be used in place of .

6 Formalizing learning in the brain


We have stressed that animals and articial systems face similar problems in learning how to optimizing their behavior in the light of rewards and punishments. Although we have described temporal difference methods from the perspective of the engineering method of dynamic programming, they are also interesting as a way of formalizing behavioral and neurobiological experiments on animals. Briey (see REINFORCEMENT LEARNING IN THE BRAIN), it has been postulated that a prediction error signal for appetitive events with some properties like that of the temporal difference signal (equation 11) seems to be broadcast by dopaminergic cells in the ventral tegmental area, to control the learning of predictions, and by dopaminergic cells in the substantia nigra to control the learning of actions. The model is, as yet, quite incomplete, failing, for example, to specify the interaction between attention and learning, which is critical for accounting for the results of behavioral experiments.

References
Barto, AG, Sutton, RS & Anderson, CW (1983) Neuronlike elements that can solve difcult learning problems. IEEE Transactions on Systems, Man, and Cybernetics 13:834-846.

13

!  

(15)

Bellman, RE (1957) Dynamic Programming. Princeton, NJ:Princeton University Press. Sutton, RS (1988) Learning to predict by the methods of temporal difference. Machine Learning 3:9-44. Sutton, RS & Barto, AG (1987) A temporal-difference model of classical conditioning. Proceedings of the Ninth Annual Conference of the Cognitive Science Society. Seattle, WA:??? Watkins, CJCH (1989) Learning from Delayed Rewards. PhD Thesis, University of Cambridge, Cambridge, UK.

Further Reading
Barto, AG, Sutton, RS & Watkins, CJCH (1990) Learning and sequential decision making. In M Gabriel & J Moore, editors, Learning and Computational Neuroscience: Foundations of Adaptive Networks. Cambridge, MA:MIT Press, Bradford Books. Bertsekas, DP & Tsitsiklis, JN (1996) Neuro-Dynamic Programming. Belmont, MA:Athena Scientic. Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.

14

You might also like