RL+ LSTM
RL+ LSTM
RL+ LSTM
Abstract
∗ Corresponding author.
Email addresses: daan@idsia.ch (Daan Wierstra), alexander@idsia.ch
(Alexander Förster), mail@jan-peters.net (Jan Peters), juergen@idsia.ch
(Jürgen Schmidhuber).
Preprint submitted to Elsevier Journal of Algorithms in Cognition, Informatics and Logic 6 May 2009
1 Introduction
Policy gradient (PG) methods constitute an exception, as these allow for learn-
ing policies even with noisy state information [3], work in combination with
function approximation [4,5], are compatible with policies that have internal
memory [6], can naturally deal with continuous actions [7–9] and are guaran-
teed to converge at least to a local minimum. Furthermore, most successful
algorithms for solving real world reinforcement learning tasks are applications
of PG methods, see, e.g., [10–12,3,7,13,14] for an overview. Provided the choice
of policy representation is powerful enough, PGs can tackle quite complex RL
problems.
2
with large numbers of parameters.
Work on policy gradient methods with memory has been scarce so far, largely
limited to finite state controllers. Strikingly, memory models based on fi-
nite state controllers perform less than satisfactorily, even on quite simple
benchmarks (e.g. single pole balancing without velocity information cannot
be learned beyond 1000 time steps [6,18], whereas evolutionary methods and
the algorithm presented in this paper manage to balance the pole 100, 000+
steps). We conjecture that the reason is that for finite state controllers a
stochastic memory state model must be learned in conjunction with a pol-
icy, which is prohibitively expensive. In this paper, we extend policy gradient
methods to more sophisticated policy representations capable of representing
memory using an RNN architecture called Long Short-Term Memory (LSTM)
for representing our policy [19]. We develop a new reinforcement learning al-
gorithm aimed specifically at RNNs that can effectively learn memory-based
policies for deep memory POMDPs. This algorithm, the Recurrent Policy
Gradient (RPG) algorithm, backpropagates the estimated return-weighted el-
igibilities backwards through time using recurrent connections in the RNN.
As a result, policy updates can become a function of any event in the history.
We show that the presented method outperforms other RL methods on three
important RL benchmark tasks with different properties: continuous control
in a non-Markovian double pole balancing environment, and discrete control
on both the deep memory T-maze [20] task (which was designed to test an
RL algorithm’s ability to deal with extremely long term dependencies) and
the still-unsolved (up to human-level performance) stochastic 89-state Maze
task. Moreover, we show promising results in a complex car driving simulation
which is challenging for humans. Here, we can show real-time improvement
of the policy which has been largely unachieved in reinforcement learning for
such complex tasks.
The paper is organized as follows. The next section describes the reinforcement
learning framework and briefly reviews LSTM’s architecture. The subsequent
sections introduce the derivation of Recurrent Policy Gradient algorithm, and
present our experimental results using RPGs with memory. The paper finishes
with a discussion.
2 Preliminaries
3
2.1 Reinforcement Learning for Recurrent Neural Networks
First, let us introduce the RL framework used in this paper and the corre-
sponding notation. The environment produces a state gt at every time step.
Transitions from state to state are governed by a function p(gt+1 |a1:t , g1:t ) un-
known to the agent but dependent upon all previous actions a1:t executed
by the agent and all previous states g1:t of the system. Note that most rein-
forcement learning papers need to assume Markovian environments – we will
later see that we do not need to for policy gradient methods with an internal
memory. Let rt be the reward assigned to the agent at time t, and let ot be
the corresponding observation produced by the environment. We assume that
both quantities are governed by fixed distributions p(o|g) and p(r|g), solely
dependent on state g.
In the more general reinforcement setting, we require that the agent has
a memory of the generated experience consisting of finite episodes. Such
episodes are generated by the agent’s operations on the (stochastic) envi-
ronment, executing action at at every time step t, after observing observa-
tion ot and special ‘observation’ rt (the reward) which both depend solely
on gt . We define the observed history 1 ht as the string or vector of ob-
servations and actions up to moment t since the beginning of the episode:
ht = ho0 , a0 , o1 , a1 , . . . , ot−1 , at−1 , ot i. The complete history H includes the un-
observed states and is given by HT = hhT , g0:T i. At any time t, the actor
optimizes Rt = ∞ t−k
P
k=t rk γ which is the return at time t where 0 < γ < 1
denotes a discount factor.
1 Note that such histories are also called path or trajectory in the literature.
4
NET OUTPUT
OUTPUT GATE
CEC
1.0
FORGET GATE
INPUT GATE
NET INPUT
Fig. 1. The Long Short-Term Memory cell. The figure shows an LSTM cell with a
net input, a Constant Error Carousel (CEC), an input gate, a forget gate and an
output gate. The cell has an internal state CEC and a forget gate that determines
how much the CEC is attenuated at each time step. The input gate controls access
to the state by the external inputs and the outputs of other cells, and the output
gate determines how much and when the cell fires.
Recurrent neural networks are designed to deal with issues of time, such as ap-
proximating time series. A crucial feature of this class of architectures is that
they are capable of relating events in a sequence, in principle even if placed
arbitrarily far apart. A typical RNN π maintains an internal state M (ht ) (or
memory) which it uses to pass on (compressed) history information to the
next moment by using recurrent connections. At every time step, the RNN
takes an input vector ot and produces an output vector π(M (ht )) from its
internal state, and since the internal state M (ht ) of any step is a function f
of the previous state and the current input signal M (ht ) = f (ot , M (ht−1 ); θ),
it can take into account the entire history of past observations by using its
recurrent connections for recalling events. Not only can RNNs represent mem-
ory, they can, in theory, be used to model any dynamic system [21]. Like
conventional neural networks, they can be trained using a special variant of
backpropagation, backpropagation through time (BPTT) [17,22].
5
in a supervised fashion, but we will apply this technique to a reinforcement
learning setting.
RNNs have attracted some attention in the past decade because of their sim-
plicity and potential power. However, though powerful in theory, they turn out
to be quite limited in practice due to their inability to capture long-term time
dependencies – they suffer from the problem of vanishing gradient [23,24],
the fact that the gradient signal vanishes as the error signal is propagated
back through time. Because of this, events more than 10 time steps apart can
typically not be related.
In this section, we first formally derive the Recurrent Policy Gradient frame-
work. Subsequently, history-dependent baselines are introduced, and the sec-
tion is concluded with a description of the Recurrent Policy Gradient algo-
rithm.
6
3.1 Derivation of Recurrent Policy Gradients
The type of RL algorithm we employ in this paper falls in the class of policy
gradient algorithms, which, unlike many other (notably TD) methods, update
the agent’s policy-defining parameters θ directly by estimating a gradient in
the direction of higher (average or discounted) reward.
Now, let R(H) be some measure of the total reward accrued during a history.
R(H) could be the average of the rewards for the average reward case, or the
discounted sum for the discounted case. Let p(H|θ) denote the probability of a
history given policy-defining
R
weights θ. The quantity the algorithm should be
optimizing is J = H p(H|θ)R(H)dH. This, in essence, indicates the expected
reward over all possible histories, weighted by their probabilities under policy
π. In order to be able to apply gradient ascent to find a better policy, we have
to find the gradient ∇θ J, which can then be used to incrementally update
parameters θ of policy π in small steps. Since we know that rewards R(H) for a
given history H do not depend on the policy parameters θ (that is, ∇θ R(H) =
0), we can write ∇θ J = ∇θ H p(H|θ)R(H)dH = H ∇θ p(H|θ)R(H)dH. Now,
R R
Z
∇θ J = ∇θ p(H)R(H)dH
Z
p(H)
= ∇θ p(H)R(H)dH
p(H)
Z
= p(H)∇θ log p(H)R(H)dH.
Taking the sample average as Monte Carlo (MC) approximation of this ex-
pectation by taking N trial histories we get
N
1 X
∇θ J = EH ∇θ log p(H)R(H) ≈ ∇θ log p(H n )R(H n ).
N n=1
which is a fast approximation of the policy gradient for the current policy with
the convergence speed of O(N −1/2 ) to the true gradient independent of the
number of parameters of the policy (i.e., number of elements of the gradient).
7
is the product of all actions and observations given subhistories:
T
Y
p(HT ) = p(ho0 , g0 i) p(hot , gt i|ht−1 , at−1 , g0:t )π(at−1 |ht−1 )
t=1
Taking the log-derivative results into transforming this large product into a
sum log p(HT ) = (const) + Tt=0 log π(at |ht ): where most parts are not affected
P
by θ, i.e., are constant. Thus, when taking the derivative of this term, we obtain
T
X
∇θ log p(HT ) = ∇θ log π(at |ht ).
t=0
which yields the desired gradient estimator which only has observable vari-
ables.
linearly with T . One way to tackle such problems and reduce this variance
is to include a constant baseline b (first introduced by Williams [27]) into
the gradient estimate ∇θ J ≈ N1 N
PT n n
t=0 ∇θ log π(at |ht )(Rt − b). Baseline
P
n=1
b is typically taken to be the expected average return and subtracted from
the actual return, such that the resulting quantity (Rt − b) intuitively yields
information on whether the return was better or worse than expected. Due to
the likelihood-ratio trick p(H)∇θ log p(H)R(H)dH = ∇θ p(H)bdH = ∇θ 1 =
R R
can only reduce the variance but not bias the gradient in any way [27].
8
receiving observations and actions as inputs, trained to predict future return
given the current policy π. This construct closely resembles the concept of
value functions in temporal difference methods. However, note that we do not
use temporal difference methods for training the history-dependent baseline
network (since such updates can be arbitrarily bad in partially observable
environments [2]), but apply supervised training using simply the actually ex-
perienced returns as targets for every time step. Using non-constant, history-
dependent baselines, our algorithm now uses two separate RNNs: one policy
π parameterized by θ, and one baseline network B parameterized by w. Us-
ing the extended baseline network, the gradient update for the policy now
becomes ∇θ J ≈ N1 N
PT n n n
t=0 ∇θ log π(at |ht )(Rt − B(ht )).
P
n=1
Only the output part of the neural network is interpreted stochastically. This
allows us, during the backward pass, to only estimate the eligibilities of the
output units at every time step. The gradient on the other parameters θ can
9
be derived efficiently via eligibility backpropagation through time, treating
output eligibilities like we would treat normal errors (‘deltas’) in an RNN
trained with gradient descent. Also, by having only stochastic output units,
we do not have to compute complicated gradients on stochastic internal (be-
lief) states such as done in [6,18] – eligibility backpropagation through time
disambiguates relevant hidden state automatically, when possible.
4 Experiments
All experiments were carried out with 10-cell LSTMs. The baseline estimator
used was simply a moving average of the return received at any time step,
except for the 89-state Maze task, where an additional LSTM network was
used to estimate a history-dependent baseline.
4.1 Continuous Control: Partially Observable Single & Double Pole Balanc-
ing
This task involves trying to balance a pole hinged on a cart that moves on
a finite track (see Figure 2). The single control consists of the force F ap-
plied to the cart (in Newtons), and observations usually include the cart’s
position x, the pole’s angle β and velocities ẋ and β̇. It provides a perfect
testbed for algorithms focussing on learning fine control in continuous state
and action spaces. However, recent successes in the RL field have made the
standard pole balancing setup too easy and therefore obsolete. To make the
10
β1
Markov non-Markov
β2
1 pole 863 ± 213 1893 ± 527
F 2 poles 4981 ± 1386 5649 ± 1548
x
Fig. 2. The non-Markov double pole balancing task. The task consists of a moving
cart on a track, with two poles of different lengths (1m and 0.1m) hinged on top.
The controller applies a (continuous) force F to the cart at every time step, after
observing pole angles β1 and β2 . The objective is to indefinitely keep the poles from
falling. The table shows the results for RPGs on the pole balancing task, for the
four possible cases investigated in this paper: 1 pole Markov, 2 poles Markov, 1 pole
non-Markov, and 2 poles non-Markov. The results show the mean and standard
deviation of the number of evaluations until the success criterion was reached, that
is, when a run lasts more than 10, 000 time steps. Results are computed over 20
runs.
task more challenging, we (1) remove velocity information ẋ and β̇ such that
the problem becomes non-Markov, and (2) add a second pole to the same cart,
of length 1/10th of the original one. This yields non-Markovian double pole
balancing [28], a truly challenging task that has not been solved by any other
single-agent RL method but RPGs.
We applied RPGs to the pole balancing task, using a Gaussian output struc-
ture for our LSTM RNN, consisting of two output neurons: a mean µ (which
was interpreted linearly) and a standard deviation σ (which was scaled with
the logistic function between 0 and 1 in order to prevent variances from being
negative) where eligibilities were calculated according to [27]. We use a learn-
ing rate ασ 2 (as suggested by Williams [27]) to prevent numerical instabilities
when variances tend to 0, and use learning rate α = 0.001, momentum = 0.9
and discount factor γ = 0.99. Initial parameters θ were initialized randomly
between −0.01 and 0.01. Reward was always 0.0, except for the last time step
when one of the poles falls over, where it is −1.0.
A run was considered a success when the pole(s) did not fall over for 10, 000
time steps. Figure 2 shows results averaged over 20 runs. RPGs clearly outper-
form earlier PG methods (for a comparison, see [18]’s finite state controller,
which cannot balance a single pole in a partially observable setting for more
than 1000 time steps, even after 500,000 trials). As far as we are aware, RPGs
constitute the only published single-agent approach that can satisfactorily
solve this problem.
11
X G
S
Fig. 3. The T-maze task. The agent observes its immediate surroundings and is
capable of the actions goNorth, goEast, goSouth, and goWest. It starts in the po-
sition labeled ‘S’, there and only there observing either the signal ‘up’ or ‘down’,
indicating whether it should go up or down at the end of the alley. It receives a
reward if it goes in the correct direction, and a punishment if not. In this example,
the direction is ‘up’ and N , the length of the alley, is 35.
4.2 Reinforcement Learning in Discrete POMDPs
In this section, we show the high performance of our algorithm for traditional
discrete POMDP problems. The Long Term Dependency T-maze from Section
4.2.1 is a standard benchmark for learning deep-memory POMDPs while the
89-state Maze in Section 4.2.2 is a problem where humans are still able to
beat the best known algorithmically learned policy (e.g. see [20]).
The second experiment was carried out on the T-maze [20] (see Figure 3).
Designed to test an RL algorithm’s ability to correlate events far apart in
history, it involves having to learn to remember the observation from the first
time step until the episode ends. At the first time step, it starts at position
S and perceives the X either north or south – meaning that the goal state
G is in the north or south part of the T-junction, respectively. Additionally,
the agent perceives its immediate surroundings. The agent has four possible
actions: North, East, South and West. These discrete actions are represented
in the network as a softmax layer. When the agent makes the correct decision
at the T-junction, i.e. go south if the X was south and north otherwise, it
receives a reward of 4.0, otherwise a reward of -0.1. In both cases, this ends
the episode. Note that the corridor length N can be increased to make the
problem more difficult, since the agent has to learn to remember the initial
‘road sign’ for N + 1 time steps. In Figure 3 we see an example T-maze with
corridor length 35.
Corridor length N was systematically varied from 10 to 100, and for each
length 10 runs were performed. Training was performed in batches of 20 nor-
malizing the gradient to length 0.3. Discount factor γ = 0.98 was used. In
Figure 4 the results are displayed, in addition to other algorithms’ results
(RL-Elman and RL-LSTM) taken from [20], of which the results on RL-LSTM
were the best results reported so far. We can see that RPGs clearly outperform
the value-based methods, even by more than an order of magnitude in terms
12
12
RPGs
10 LSTM VI
0
0 10 20 30 40 50 60 70 80 90 100
Corridor Length
Fig. 4. T-maze results. Elman-based Value Iteration (Elman VI) starts to degrade af-
ter corridor length N = 10, LSTM Value Iteration (LSTM VI) falters after N = 50,
while Recurrent Policy Gradients’ performance starts to degrade at length N = 100.
The plot shows the number of average iterations required to solve the task, averaged
over the successful runs. RPGs clearly outperform other RL methods on this task,
to the best of the authors’ knowledge. (The results for the Value Iteration based
algorithms are taken from [20]).
of iterations for corridor lengths longer than 40. Additionally, RPGs are able
to solve this task up to N=90, while the second best algorithm, RL-LSTM,
solves it up to N=50. The large performance gain on this task for Recur-
rent Policy Gradients might be due to the difference in complexity of learning
a simple (memory-based) policy versus learning unnecessarily complex value
functions. Nevertheless, the impressive performance advantage of RPGs over
value-based methods on this domains indicates a possibly significant poten-
tial for the application of Recurrent Policy Gradients to other deep-memory
domains.
In this extremely stochastic benchmark task (see Figure 5; see [29] for a com-
plete description) the aim for the agent is to get to the goal as fast as possible
(where the reward is 1, other locations have reward 0) from a random starting
position and orientation, but within 251 time steps. For reward attribution,
discount factor γ = 0.98 is used. The agent has not only a position, but also
an orientation, and its actions consist of moving forward, turning left, turn-
ing right, turning about, and doing nothing. State transactions are extremely
noisy. Observations, which consist of 4 bits representing adjacent wall infor-
mation (wall or no wall), are noisy and are inverted with probability 10%,
which sets the chance of getting the correct observation somewhere between
0.65 and 0.81, depending on the agent’s location. It is interesting to note that,
to the authors’ knowledge, this domain has as of yet not been satisfactorily
13
Fig. 5. The 89-state maze. In this extremely stochastic maze, the agent has a posi-
tion, an orientation, and can execute five different actions: forward, turnleft, turn-
right, turnabout, and doNothing. The agent starts every trial in a random position.
Its goal is to move to the square labeled ‘G’. Observations comprise the local walls
but are noisy (there is a high probability of observing walls where there are none
and vice versa). Action outcomes are noisy and cannot be relied on. See [29] for a
complete description of this problem domain.
Because of the random starting position, this task is extremely difficult with-
out the use of any history-dependent baseline, since the agent might start
close to the target or not, which influences the expected rewards accordingly.
That is why we apply a history-dependent baseline for this task, trained with
α = 0.001 and momentum = 0.9 after every episode. 20 runs were performed
to test the performance of the algorithm, using a history-dependent baseline
which was trained on actually received returns using a separate LSTM network
with 10 memory cells with the same inputs as the policy network including
a bias. Each run was executed for 30, 000, 000 iterations. After that, the re-
sulting policy was evaluated. The median number of steps to achieve the goal
(in case the goal is achieved) was 58, and the goal was reached in 95% of
the trials. This compares favorably with the second best other (model-free)
method the authors are aware of, Bakker’s RL-LSTM algorithm [30] with 61
steps and 93.9%, respectively. In [29] the human performance of 29 steps and
100% is highlighted, which again underlines the difficulty of the task. However,
the fact that RPGs outperform all other algorithms on this task might indi-
cate that the application of Recurrent Policy Gradients to RNNs, especially
in combination with history-dependent baselines, might indeed be fruitful.
14
source, the game was specifically designed for programming competitions be-
tween steering agents, and the code framework allows for easy plug-ins of
code snippets for competitions between preprogrammed drivers. As such, it
provides a perfect testbed for reinforcement learning algorithms that aim to
go beyond the current benchmark standards.
We trained our RPG agent on one single track (see Figure 6), on which it
has to learn to drive a Porsche GT1 and stay on the road while achieving
high speed. Whenever the car gets stuck off the road, a learning episode ends,
the car is put back on track and a new episode begins. The steering outputs
of the RNN, which were executed at a rate of 30 frames per second, are
interpreted as a Gaussian with one output neuron interpreted linearly (µ, the
mean) and one output neuron interpreted logistically between 0 and 1 (σ, the
standard deviation) to ensure it is nonnegative. The four observations which
were normalized around 0 with std 1, include a bias, the speed, the steering
angle, position on the road and look-ahead-distance (which was linearly varied
with speed). Its rewards consist of speed measurements at every time step, and
the agent receives negative rewards for spending time off track. A large penalty
is inflicted upon the car getting stuck off track, which ends an episode. The
car’s speed starts off at 10 km/h, which is gradually increased over time to
reach 70 km/h after 30 minutes.
We performed 10 runs on this lap using the same learning settings and 10-
cell network as applied to the non-Markovian double pole balancing task. The
baseline was updated after every 100 time steps. We found that the agent
learns, for all runs, to consistently steer and stay on the road after just under
2 minutes of real-time behavior. In all runs, the car first drives off the track
immediately four or five times, then learns to stay on track until it hits the
first curve, where it slides off again. Within two minutes, however, it drives
nearly perfectly in the middle off the road, and learns to ‘cut curves’ slightly
when the speed is increased gradually to 70 km/h after 30 minutes. The agent
can learn to drive safely – not getting off track – up to 70 km/h, after which
its behavior destabilized in all runs. Future work will investigate how to make
the behavior more robust and how to cope with higher speeds. This will have
to include speed control and braking by the network as well, which could be
actualized using additional (softmax) output neurons for gears, brakes and
gas. The fastest lap time achieved after 30 minutes of training was just under
3 minutes, which is, unfortunately, still twice as slow as a trained human player
or our preprogrammed agent.
15
Fig. 6. The TORCS racing car simulator.
the fast learning suggests this approach might be worth investigating when
dealing with real-time learning problems in continuous robot control.
5 Conclusion
Acknowledgments
This research was funded by SNF grant 200021-111968/1 and by the 6th FP
of the EU (project number IST-511931).
References
16
on Machine Learning (ICML 1994), Morgan Kaufmann, San Francisco, CA,
1994, pp. 284–292.
[7] J. Peters, S. Schaal, Policy gradient methods for robotics, in: Proceedings of
the IEEE International Conference on Intelligent Robots and Systems (IROS),
Beijing, China, 2006, pp. 2219 – 2225.
[8] N. Kohl, P. Stone, Policy gradient reinforcement learning for fast quadrupedal
locomotion, in: Proceedings of the IEEE International Conference on Robotics
and Automation, Vol. 3, 2004, pp. 2619–2624.
[9] J. Peters, S. Schaal, Reinforcement learning of motor skills with policy gradients,
Neural Networks 21 (4) (2008) 682–97.
17
[17] P. Werbos, Back propagation through time: What it does and how to do it., in:
Proceedings of the IEEE, Vol. 78, 1990, pp. 1550–1560.
[18] N. Meuleau, L. Peshkin, K.-E. Kim, L. P. Kaelbling, Learning finite-state
controllers for partially observable environments, in: Proc. Fifteenth Conference
on Uncertainty in Artificial Intelligence (UAI ’99). Morgan Kaufmann, 1999, pp.
427–436.
[19] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation
9 (8) (1997) 1735–1780.
[20] B. Bakker, Reinforcement learning with Long Short-Term Memory, in:
Advances in Neural Information Processing Systems 14, MIT Press, 2002, pp.
1475–1482.
[21] H. T. Siegelmann, E. D. Sontag, Turing computability with neural nets, Applied
Mathematics Letters 4 (6) (1991) 77–80.
[22] R. J. Williams, D. Zipser, A learning algorithm for continually running fully
recurrent networks, Neural Computation 1 (2) (1989) 270–280.
[23] Y. Bengio, P. Simard, P. Frasconi, Learning long-term dependencies with
gradient descent is difficult, IEEE Transactions on Neural Networks 5 (2) (1994)
157–166.
[24] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, Gradient flow in
recurrent nets: the difficulty of learning long-term dependencies, in: S. C.
Kremer, J. F. Kolen (Eds.), A Field Guide to Dynamical Recurrent Neural
Networks, IEEE Press, 2001, pp. 237–244.
[25] F. A. Gers, N. Schraudolph, J. Schmidhuber, Learning precise timing with
LSTM recurrent networks, Journal of Machine Learning Research 3 (2002) 115–
143.
[26] J. Schmidhuber, RNN overview, http://www.idsia.ch/˜juergen/rnn.html
(2004).
[27] R. J. Williams, Simple statistical gradient-following algorithms for connectionist
reinforcement learning, Machine Learning 8 (1992) 229–256.
[28] A. Wieland, Evolving neural network controllers for unstable systems, in:
Proceedings of the International Joint Conference on Neural Networks (Seattle,
WA), Piscataway, NJ: IEEE, 1991, pp. 667–673.
[29] M. Littman, A. Cassandra, L. Kaelbling, Learning policies for partially
observable environments: Scaling up, in: A. Prieditis, S. Russell (Eds.),
Machine Learning: Proceedings of the Twelfth International Conference,
Morgan Kaufmann Publishers, San Francisco, CA, 1995, pp. 362–370.
[30] B. Bakker, The state of mind: : Reinforcement learning with recurrent neural
networks, Ph.D. thesis, Leiden University, the Netherlands (2004).
[31] Torcs, The open racing car simulator., http://torcs.sourceforge.net/ (2007).
18