Reinforcement Learning As Classification: Leveraging Modern Classifiers

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Reinforcement Learning as Classification:

Leveraging Modern Classifiers

Michail G. Lagoudakis MGL @ CS . DUKE . EDU


Ronald Parr PARR @ CS . DUKE . EDU
Department of Computer Science, Duke University, Durham, NC 27708 USA
Abstract function that is both rich enough to express interesting poli-
The basic tools of machine learning appear in cies, yet smooth enough to ensure that good policies can
the inner loop of most reinforcement learning al- be discovered. The validity of such criticisms is debatable
gorithms, typically in the form of Monte Carlo (since nearly all practical applications of machine learning
methods or function approximation techniques. methods require some feature engineering), but there can
To a large extent, however, current reinforcement be no question that recent advances in classifier learning
learning algorithms draw upon machine learn- have raised the bar. For example, it is not uncommon to
ing techniques that are at least ten years old and, hear anecdotal reports of the naive application of support
with a few exceptions, very little has been done vector machines matching or exceeding the performance
to exploit recent advances in classification learn- of classical approaches with careful feature engineering.
ing for the purposes of reinforcement learning. We are not the first to note the potential benefits of modern
We use a variant of approximate policy iteration classification methods to reinforcement learning. For ex-
based on rollouts that allows us to use a pure clas- ample, Yoon et al.(2002) use inductive learning techniques,
sification learner, such as a support vector ma- including bagging, to generalize across similar problems.
chine (SVM), in the inner loop of the algorithm. Dietterich and Wang (2001) also use a kernel-based ap-
We argue that the use of SVMs, particularly in proximation method to generalize across similar problems.
combination with the kernel trick, can make it The novelty in our approach is its orientation towards the
easier to apply reinforcement learning as an “out- application of modern classification methods within a sin-
of-the-box” technique, without extensive feature gle, noisy problem at the inner loop of a policy iteration
engineering. Our approach opens the door to algorithm. By using rollouts and a classifier to represent
modern classification methods, but does not pre- policies, we avoid the sometimes problematic step of value
clude the use of classical methods. We present function approximation. Thus we aim to address the cri-
experimental results in the pendulum balancing tiques of value function methods raised by the proponents
and bicycle riding domains using both SVMs and of direct policy search, while avoiding the confines of a pa-
neural networks for classifiers. rameterized policy space.
We note that in recent work, Fern, Yoon and Givan(2003)
1. Introduction also examine policy iteration with rollouts and an induc-
tive learner at the inner loop. However, their emphasis
Reinforcement learning provides an intuitively appealing
is different. They focus on policy space bias as a means
framework for addressing a wide variety of planning and
of searching a rich space of policies, while we emphasize
control problems(Sutton & Barto, 1998). Compelling con-
modern classifiers as a method of recovering high dimen-
vergence results exist for small state spaces(Jaakkola et al.,
sional structure in policies.
1994) and there has been some success in tackling large
state spaces through the use of value function approxima-
tion and/or search through a space of parameterized poli- 2. Basic definitions and algorithms
cies (Williams, 1992).
In this section we review the basic definitions for Markov
Despite the successes of reinforcement learning, some frus- Decision Processes (MDPs), policy iteration, approximate
tration remains about the extent and difficulty of feature policy iteration and rollouts. This is intended primarily as
engineering required to achieve success. This is true both a review and to familiarize the reader with our notation.
of value function methods, which require a rich set of fea- More extensive discussions of these topics are widely avail-
tures to represent accurately the value of a policy, and pol- able (Bertsekas & Tsitsiklis, 1996).
icy search methods, which require a parameterized policy

Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
2.1. MDPs In practice, policy iteration terminates in a surprisingly
small number of steps. However, it relies on the availabil-
An MDP is defined as a 6-tuple (S, A, P, R, γ, D) where:
ity of the full model of the MDP, exact evaluation of each
S is the state space of the process; A is a finite set of ac-
policy, and exact representation of each policy.
tions; P is a Markovian transition model, where P (s, a, s0 )
is the probability of making a transition to state s0 when
2.3. Approximate methods
taking action a in state s; R is a reward (or cost) function,
such that R(s, a) is the expected reward for taking action Approximate methods are frequently used when the state
a in state s; γ ∈ [0, 1) is the discount factor for future re- space of the underlying MDP is extremely large and exact
wards; and, D is the initial state distribution from which (solution or learning) methods fail. A general framework of
states are drawn when the process is initialized. using approximation within policy iteration is known as ap-
In reinforcement learning, it is assumed that the learner can proximate policy iteration (API). In its most general form,
observe the state of the process and the immediate reward API assumes a policy π bi at each step, which may not nec-
at every step, however P and R are completely unknown. essarily be an exact implementation of the improved pol-
In this paper, we also make the assumption that our learning icy from the previous time step. Typically, the error in-
algorithm has access to a generative model of the process troduced is bounded by the maximum norm (L∞ ) error in
what is presumed to be an approximate Q b πi−1 that was used
which is a black box that takes a state s and an action a
as inputs and outputs a next state s0 drawn from P and a to generate πi+1 (Bertsekas & Tsitsiklis, 1996). API can-
reward r. Note that this is not the same as having the model not guarantee monotonic improvement and convergence to
(P and R) itself. the optimal policy. However, in practice it often finds very
good policies in a few iterations, since it normally makes
A deterministic policy π for an MDP is a mapping π : S 7→ big steps in the space of possible policies. This is in con-
A from states to actions, where π(s) is the action the agent trast to policy gradient methods which, despite acceleration
takes at state s. The value Vπ (s) of a state s under a policy methods, are often forced to take very small steps.
π is the expected, total, discounted reward when the pro-
cess begins in state s and all decisions at all steps are made LSPI (Lagoudakis & Parr, 2001) is an example of an
according to policy π: API algorithm. It uses temporal differences of sample
transitions between state-action pairs to approximate each
"∞ #
X  Qπ . It is also possible to use a technique closer to pure
Vπ (s) = E γ t R st , π(st ) |s0 = s . Monte Carlo evaluation called rollouts. Rollouts estimate
t=0 Qπ (s, a) by executing action a in state s and following pol-
The goal of the decision maker is to find an optimal policy icy π thereafter, while recording the total discounted re-
π ∗ that maximizes the expected, total, discounted reward ward obtained during the run. This processes is repeated
from the initial state distribution D: several times and it requires a simulator with resets to ar-
bitrary states (or a generative model) since a large num-
π ∗ = arg max η(π, D) = Es∼D [Vπ (s)] . ber of trajectories is required to obtain an accurate esti-
π
mate of Qπ (s, a). Rollouts were first used by Tesauro
It is known that for every MDP, there exists at least one and Galperin (1997) to improve online performance of a
optimal deterministic policy. backgammon player. (A supercomputer did rollouts before
selecting each move.) However, they can also be used to
2.2. Policy Iteration specify a set of target values for a function approximator
used within API to estimate Qπ . The rollout algorithm is
Policy iteration is a method of discovering such a policy summarized in Figure 1.
by iterating through a sequence π1 , π2 , ..., πk of improving
policies. The iteration terminates when there is no change
in the policy (πk = πk−1 ) and πk is the optimal policy.
3. API without Value Functions
This is typically achieved by computing Vπi , which can be The dependence of typical API algorithms on value func-
done iteratively or by solving a system of linear equations tions places continuous function approximation methods at
and then determining a set of Q-values: the inner loop of most approximate policy iteration algo-
X rithms. These methods typically minimize L2 error, which
Qπi (s, a) = R(s, a) + γ P (s, a, s0 )Vπi (s0 ) , is a poor match with the L∞ bounds for API. This prob-
s0 lem is not just theoretical. Efforts to improve performance
and the improved policy as such as adding new features to a neural network or new
basis functions to LSPI don’t always have the expected ef-
πi+1 (s) = arg max Qπi (s, a) . fect. Increasing expressive power can sometimes lead to
a
Rollout (M, s, a, γ, π, K, T ) // M: Generative model
// M : Generative model // Dρ: Source of rollout states
// (s, a) : State-action pair whose value is sought // γ : Discount factor
// γ : Discount factor // π0: Initial policy (default: uniformly random)
// π : Policy // K: Number of trajectories
// K : Number of trajectories // T: Length of each trajectory
// T : Length of each trajectory
π 0 = π0
for k = 1 to K

e
(s0 , r) ← S IMULATE (M, s, a)
Qk ← r
s ← s0
repeat
π = π0
TS = ?
for t = 1 to T − 1 for each s ∈ Dρ

e e (s0 , r) ← S IMULATE (M, s, π(s))


Qk ← Qk + γ t r e
for each a ∈ A

e
Qπ (s, a) ← Rollout(M, s, a, γ, π, K, T )

e Xe e ee
s ← s0 a = arg max Qπ (s, a)

K a∈A
1 if ∀ a ∈ A, a 6= a∗ : Qπ (s, a) < Qπ (s, a∗ )
Q← Qk

e
K k=1
e
TS ← TS ∪ {(s, a∗ )+ }
ee
for each a ∈ A : Qπ (s, a) < Qπ (s, a∗ )
TS ← TS ∪ {(s, a)− }
return Q
0
π = Learn(TS)
Figure 1. Estimation of state-action values using rollouts. until (π ≈ π 0 )

return π
surprisingly worse performance, which can make feature
engineering a somewhat tedious and counterintuitive task. Figure 2. Approximate Policy Iteration with Rollouts.

If a model is available and the structure of the value func- for the state. This approximate policy iteration algorithm is
tion approximation is compatible with the model, value described in Figure 2.
function approximation minimizing L∞ error is possi-
We use the notation (s, a)+ to indicate a positive train-
ble (Guestrin et al., 2001). There are arguments that such
ing example, and (s, a)− to indicate a negative example.
approximations are better suited to MDPs, suggesting the
Learn is a supervised learning algorithm that trains a clas-
use of something like support vector regression, which tries
sifier given a set of labeled training data. The structure of
to limit the worst case approximation error. Unfortunately,
the policy iteration algorithm naturally suggests a batch im-
the approach does not generalize easily to reinforcement
plementation since the policy is updated in distinct phases.
learning in the presence of noise. To see this, consider that
The termination condition is left somewhat open ended. It
in minimizing L2 error, we serve two objectives: averag-
can be when the performance of the current policy does not
ing within states and smoothing across states. Thus, the
exceed that of the previous one, when two subsequent poli-
mean state value (based upon trajectories or temporal dif-
cies are similar (the notion of similarity will depend upon
ferences) is also the one that minimizes L2 error at that
the learner used), or when a cycle of policies is detected
state. In contrast, minimizing L∞ error is ill suited to
(also learner dependent). If we assume a fortuitous choice
raw data of the type encountered in reinforcement learn-
of Sρ and a sufficiently powerful learner that can correctly
ing since the mean state value can differ sharply from the
generalize from Sρ to the entire state space, the ith itera-
L∞ error minimizing estimate. (Suppose we draw 100 next
tion of this algorithm will learn the improved policy of πi ,
states from state s, of which 99 have value 1.0 and 1 has
effectively implementing a full policy iteration algorithm,
value 0.0. The mean value for V (s) is 0.99, which is also
and terminating with the optimal policy. For large-scale
the value that minimizes the L2 error in the temporal dif-
problems, choosing Sρ and dealing with imperfect classi-
ferences. However, V (s) = 0.5 minimizes the L∞ error.)
fiers will pose some challenges.
An important observation, also noted by Fern, Yoon and
Givan (2003), is that rollouts can be used within API to 4. Choosing Sρ
avoid the problematic value function approximation step
entirely. We choose some representative set of states Sρ For the choice of Sρ , we have a number of alternatives.
and assume that we can perform enough rollouts to de- The simplest is to try to dense, uniform covering of the
termine which action maximizes Qπ (s, a) for the current state space. For low-dimensional state spaces, this will be
policy. Rather than fitting a function approximator to the practical, but it scales poorly. A similar option would be
values obtained by the rollouts, we instead train a classi- to randomly select Sρ from some uniform distribution over
fication learner where the maximizing action is the label the state space. This is again problematic due to poor cov-
erage for high-dimensional spaces. distribution can lead to arbitrarily bad performance (policy
loss that grows with the size of the state space). Of course,
A natural choice of Sρ would be the distribution of states
the ideal restart distribution would be that of the optimal
induced by the current policy, πi . While intuitively appeal-
policy, but this begs the question.
ing, this distribution may differ dramatically from the dis-
tribution of the subsequent policy, πi+1 , for which we must A related practical problem is what to do in states where
train our classifier. (To see this, consider a πi that directs rollouts cannot provide sufficient information to select
the system towards one “side” of the state space and a πi+1 πi+1 . We discuss this issue in Section 6.
that directs the system towards another side. If we train our
classifier on states drawn from πi , when we try to use our 5. Imperfect Learners
classifier to execute πi+1 , it may be asked to classify states
that are disjoint from the ones it has been trained on.) This Suppose that our learner fails to learn πi (s) perfectly when
mismatch between training and testing can be dealt with by presented with π̃i (s) for all s in Sρ . To quantify the extent
using a step size α, 0 < α ≤ 1, to keep the policy for the of this failure, we must first define the test distribution. Fol-
next iteration sufficiently close to the current policy so that lowing Kakade and Langford (2002), we express the natu-
performance does not degrade: ral test distribution for this problem as the set of states en-
countered when starting from states drawn from the initial
b πi (s, a) + (1 − α)b
π̃i+1 = αarg maxa Q πi , distribution D, and following π bi . The probability of reach-
ing future states is discounted by γ and the infinite sum is
bi+1 is now a stochastic policy that chooses from
Note that π normalized by (1 − γ), resulting in:
the improved policy with probability α and from the old

X
policy with probability (1 − α). A positive α that im-
dπi ,D (s) = (1 − γ) γ t (Pπt+1 D)s
proves performance is guaranteed to exist (Jaakkola et al., i
t=0
1995). Recently, Kakade and Langford (2002) demon-
strated a method for picking α near optimally, although the where the s subscript indicates that we are selecting com-
largest “safe” value α may be quite small. ponent s of the matrix-vector product.
Fortunately, our assumption of a generative model gives If we have an a priori guarantee that our learner will choose
us the luxury of drawing states from the policy we wish the wrong action with probability at most δ on states drawn
to learn before we have completely discovered or learned from this distribution, then we can bound the expected
it. While this may sound paradoxical at first, it is actu- shortfall from following πi instead of πbi as follows:
ally quite simple thanks to an observation by Fern (personal
δRmax
communication). If we begin in some state s0 , we can use Vπbi (s) − Vπi (s) ≤ ,
rollouts to determine πi+1 for s0 . We can then sample s1 , (1 − γ)2
by executing action πi+1 (s0 ) in s0 . We continue by using where Rmax is the maximum reward value. This pes-
rollouts to determine πi+1 (s1 ) and executing this action to simistic bound arises from the assumption that all mistakes
obtain s2 . We continue in this fashion until we have sam- are made in the initial states, which occur with probability
pled states and actions along an entire trajectory of πi+1 (1 − γ), incurring penalty Rmax /(1 − γ). Since we cannot
starting from s0 . Trajectories produced during rollouts are guarantee that such learners exist, this is not meant to be a
discarded, the only training kept are from πi+1 . serious bound, but reassurance that in principle good learn-
Fern’s observation mostly solves the Sρ problem, but it ers can produce good policies. In practice, the actual loss
leaves open the question of how the initial state s0 is se- is best measured empirically by Monte Carlo evaluation, or
lected. The initial distribution D may seem like a natural estimated by the error rate on the training set. From a prac-
distribution from which to draw s0 . In practice, however, tical standpoint, high observed errors (or low performance)
this can cause API to get stuck in local optima: Suppose will suggest a change in representation, or a change in the
πi visits only a small region of the state space. To improve learning mechanism, such as a change of kernel.
upon πi , rollouts must discover better alternatives at the
fringe of the states reachable by πi . However, our clas- 6. A Practical Algorithm
sifier for πi was never trained on states that aren’t reach-
able by πi , making it unlikely that rollouts from the fron- The main contribution of our paper is a particular embod-
tier of πi will produce a better alternative to staying within iment of the approximate policy iteration algorithm de-
the region normally circumscribed by πi . The choice of a scribed in Section 3. Training examples can be formed for
“restart distribution” which differs from the problems nat- any given state s ∈ S assuming some underlying policy
ural starting distribution is also explored by Kakade and b. The estimated values Q
π e πb (s, a) are computed by rollouts
Langford(2002), who show that a poor choice of a restart for all possible actions in state s. If the values Q e πb (s, a)
were exact, then the maximizing action a∗ would yield one The most significant contribution of effort is that it opens
positive example (s, a∗ )+ and the rest of the actions would reinforcement learning to the full array of modern classi-
yield a number of negative examples (s, a)− for all a 6= a∗ . fication methods through the learn function. SVMs are a
Unfortunately, the estimates Q e πb (s, a) are noisy and could particularly appealing choice to the reinforcement learning
yield incorrect examples if treated as exact. Thus, we used practitioner vexed by the feature selection problem. We
a simple two-sample t-test to compare rollout values. To offer a brief sketch of how SVMs work to justify this ap-
generate examples in any state s using the rollout values peal: With the kernel trick, SVMs are able to implicitly
e π (s, a), we did the following:
Q and automatically consider classifiers with very complex
feature spaces. Nevertheless, the optimization performed
e π in
1. Use a fixed budget of k samples to determine Q by SVMs can be interpreted as a search through a space of
∗ classifiers to find one that is both a good fit and has low
state s and a :
VC dimension. In the most optimistic interpretation, this
e π (s, a) .
a∗ = arg max Q dodges the feature selection problem while simultaneously
a∈A
demonstrating resistance to overfitting. In practice there
2. Generate a positive example (s, a∗ )+ if the value of are, of course, complications but if SVMs come close to
this dramatic and optimistic description, we should be able
action a∗ is statistically significantly bigger than the
value of every other action a ∈ A: to feed the raw state variables used by our simulators into
our SVM classifier with little regard for the feature engi-
e π (s, a) <
∀ a ∈ A, a 6= a∗ : Q eQe π (s, a∗ ) . neering required to obtain success in these problems using
value function methods.
3. Generate a negative example (s, a)− for each action a While SVMs are a particularly appealing choice for learn,
whose value is statistically significantly smaller than they are not the only option and may not be the most de-
the value of action a∗ : sirable option in many cases. The theoretical motivations
e π (s, a) <
∀a ∈ A : Q eQe π (s, a∗ ) . for using SVMs are not as crisp for multiclass problems.
For problems with many actions, other classification meth-
A positive example is generated only if there is a clearly ods may be more natural: neural nets, Bayes nets, decision
best action in which case all other actions generate nega- trees, etc. For these reasons, and for the sake of compar-
tive examples. If there is no best action, negative examples ison, we also implemented learn using a neural network.
can still be generated for the actions that are clearly inferior. We designed the neural network with a number of outputs
Notice that in this case the remaining actions appear to be equals to the number of actions and trained the network to
equally good and, by not generating a positive example, the activate output i (and not others) output on positive exam-
classifier is essentially given the freedom to choose any of ples (s, ai )+ . Our neural network classifier did not take
them. The only case where no training examples are gen- advantage of negative examples.
erated is when all actions appear to be equally good. We
expect this approach would benefit from more sophisticated 7. Experimental Results
approaches to managing the number of samples used (Kael-
bling, 1993; Kearns et al., 1999). We implemented the SVM version of our API algorithm
using SVMTorch (Collobert & Bengio, 2001), a publicly
One peculiarity of rollout based policy iteration is that if available implementation of support vector machines. The
the current policy is very good, i.e. able to recover from SVMTorch package provides a simple multiclass capability
small mistakes, there will be no statistically significant dif- (one versus all), but is not necessarily representative of the
ferences between many of the actions. This can make it best that can be done on multiclass problems using SVM
difficult to acquire sufficient training data for the next pol- technology. We also implemented a version of our algo-
icy. We mitigate this problem by treating the demonstrably rithm using a simple feedforward, multi-layer neural net-
bad actions as negative training examples even if we can- work as the multiclass classifier. In this section, we present
not determine a single, clearly superior action. Note that experimental results on the inverted pendulum problem and
randomly selecting an action among the equivalent ones the bicycle balancing and riding problem. Our goal in
and marking it as positive will create a lot of noise for our these preliminary experiments is not necessarily to demon-
learner since subsequent visits to the same state may pol- strate the superiority of our rollout approach in terms of
lute the training set with multiple “optimal” actions for the CPU cycles or sample complexity, but rather its viability as
same state. A simple lexicographic ordering can also have an alternate approach to the reinforcement learning control
unexpected side effects at execution time by introducing problem.
strong preferences for particular actions and heavily bias-
ing the training data with examples of just one class. In our experiments we ran approximate policy iteration un-
Using about 200 rollout states, the algorithm consistently
6 learns excellent balancing policies in one or two iterations
with both neural nets and SVMs, starting with an initial
4
policy that selects actions randomly with uniform proba-
bility. Such “excellent” policies balance the pendulum for
more then 3 simulated minutes (in practice, we found that
2
such policies could balance essentially indefinitely). The
Angular Velocity

choice of the sampling distribution did not affect the re-


0 sults significantly. For illustration, we used uniform sam-
pling for rollout states. Figure 3 shows the training data
−2
obtained for the LF action. A positive example indicates a
state where LF was found to be the best action and a neg-
ative example is a state where LF was found to be a bad
−4
choice. It is easy to see that positive and negative examples
are easily separated. The same figure also shows the result-
−6
−1.5 −1 −0.5 0 0.5 1 1.5
ing support vectors for the LF classifier using a polynomial
Angle
kernel of degree 2.

Figure 3. Training data (+ : positive, x : negative) and support Figure 4 shows the entire learned policies (blue/dark-gray
vectors (o) for the LF action. for LF, red/medium-gray for RF, and green/light-gray for
NF) for all three classifiers: SVM with a polynomial ker-
nel, SVM with a Gaussian kernel, and a neural network
til the observed performance of the policy, as measured classifier with 5 hidden units. Interestingly, in the case of
with experiments with the simulator, decreased. Since ap- the polynomial kernel, the policy does not use the NF ac-
proximate policy iteration does not ensure monotonically tion at all, whereas the other policies do. This is due to
improving policies, it is possible that continuing to run pol- the limited abilities of the polynomial degree-2 kernel. All
icy iteration beyond an initial setback could still result in policies are excellent in the sense that they can all balance
better policies, but we did not explore this possibility. the pendulum for a long time, perhaps indefinitely. In all
cases, the input to the SVM or the neural network was just
7.1. Inverted pendulum the 2-dimensional state description. For SVMs, the number
In the inverted pendulum domain, the task is to balance a of support vectors was normally smaller than the number of
pendulum of unknown length and mass at the upright posi- rollout states. The constant C, the trade-off between train-
tion by applying forces to the cart to which it is attached. ing error and margin, was set to 1.
Three actions are allowed: left force LF (−50 Newtons), We note that pendulum balancing is a relatively simple
right force RF (+50 Newtons), or no force (NF) at all (0 problem. The classes are nearly linearly separable, so good
Newtons). All three actions are noisy; uniform noise in classification performance here should not be surprising to
[−10, 10] is added to the chosen action. The state space those familiar with modern classification methods. Note-
of the problem is continuous and consists of the vertical worthy features from the reinforcement learning perspec-
angle θ and the angular velocity θ̇ of the pendulum. The tive are the small number of iterations of policy iteration
transitions are governed by the nonlinear dynamics of the required and the non-parametric representation of the pol-
system (Wang & Griffin, 1996). and depend on the current icy. Figure 3 shows the ability of the SVM to adapt the rep-
state and the current (noisy) control u: resentation to match the training data since only the support
vectors are used to represent the policy.
g sin(θ) − αml(θ̇)2 sin(2θ)/2 − α cos(θ)u
θ̈ = ,
4l/3 − αml cos2 (θ) 7.2. Bicycle riding
where g is the gravity constant (g = 9.8m/s2 ), m is the In the bicycle balancing and riding problem (Randløv &
mass of the pendulum (m = 2.0 kg), M is the mass of the Alstrøm, 1998) the goal is to learn to balance and ride a
cart (M = 8.0 kg), l is the length of the pendulum (l = 0.5 bicycle to a target position located 1 km away from the
m), and α = 1/(m + M ). A reward of 1 is given as long as starting location. Initially, the bicycle’s orientation is at an
the angle of the pendulum does not exceed π/2 in absolute angle of 90◦ to the goal. The state description is a six-
value (the pendulum is above the horizontal line). An angle dimensional real-valued vector (θ, θ̇, ω, ω̇, ω̈, ψ), where θ
greater than π/2 signals the end of the episode and a reward is the angle of the handlebar, ω is the vertical angle of
(penalty) of 0. The discount factor of the process is set to the bicycle, and ψ is the angle of the bicycle to the goal.
0.95.
6 6 6

4 4 4
Angular Velocity

Angular Velocity

Angular Velocity
2 2 2

0 0 0

−2 −2 −2

−4 −4 −4

−6 −6 −6
−1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5
Angle Angle Angle

Figure 4. Pendulum: policies learned with the polynomial kernel SVM, the Gaussian kernel SVM, and the neural network classifier.

The actions are the torque τ applied to the handlebar (dis-


cretized to {−2, 0, +2}) and the displacement of the rider 20
υ (discretized to {−0.02, 0, +0.02}). In our experiments, 15
actions are restricted so that either τ = 0 or υ = 0 giving
10
a total of 5 actions. The noise in the system is a uniformly Goal position

distributed term in [−0.02, +0.02] added to the displace- 5

ment component of the action. The dynamics of the bicycle 0


are based on the model of Randlov and Alstrom (1998) and −5
the time step of the simulation is set to 0.02 seconds. As
−10
is typical with this problem, we used a shaping reward (Ng Starting position
et al., 1999). −15

−20
Our experiments with the bicycle did show some sensitivity 0 100 200 300 400 500 600 700 800 900 1000
to the parameters of the problem as well as the parameters
of our learner. This made it difficult for us to find parame- Figure 5. Bicycle: Trajectory of an SVM policy.
ters that consistently produced good performance. Some of
this may simply be reflective of our inexperience in tuning
the parameters of SVMs. It is also possible that we did not side). This policy was produced with a polynomial ker-
consider enough samples. nel of degree 3 and 4000 rollout states. In the final policy,
For our SVM experiments, we used a shaping reward of rt the bicycle rides to the goal, then turns around toward the
given at each time step, where rt = (dt−1 − γdt ) as long goal in a very tight radius. This policy was obtained in just
as |ω| < π/15, and r = 0 otherwise. dt is the distance of two API iterations, starting with a uniformly random ac-
the back wheel of the bicycle to the goal position at time t. tion selection policy. Similarly to the pendulum, the input
The discount factor was set to 0.95. to the SVM was the raw 6-dimensional state description
and C = 1.
In our preliminary experiments with this domain, we were
able to solve the problem with uniform sampling and poly- For our neural network experiments, we used a shaping re-
nomial kernels of low degree. However it required a large ward of rt given at each time step, where rt = 1 + (dt−1 −
number of rollout states (about 5, 000). With sampling γdt ) as long as |ω| < π/15, and r = 0 otherwise. dt is the
from the distribution of the next policy, we were able to distance of the back wheel of the bicycle to the goal posi-
solve the problem with fewer rollout states and both RBF tion at time t. The discount factor was set to 0.99. Since our
and polynomial kernels. However we did not find kernels neural network learner only uses positive examples and not
that consistently produced good policies with reasonable all states successfully produce positive training instances,
sample sizes. (The balancing problem is solved easily us- we used 8000 rollout states.
ing any of the classification methods, but riding to the goal Figure 5 shows sample trajectories of one of our better neu-
proved more difficult.) ral network policy iteration runs using 30 hidden units. Af-
Figure 5 shows a sample trajectory from the final policy of ter the first iteration, the learned policy can only balance
one of our better policy iteration runs using SVMs. The bi- the bicycle for a few steps and it crashes. The policy at
cycle moves in the 2-dimensional plane from the initial po- the second iteration reaches the goal, but fails to return to
sition (0, 0) (left side) to the goal position (1000, 0) (right it. Finally, the policy at the third iteration, reaches the goal
faster and stays there. The best neural network policy is not
Fern, A., Yoon, S., & Givan, R. (2003). Approximate policy itera-
500 tion with a policy language bias: Learning control knowledge
in planning domainsTechnical report TR-ECE-03-11). Purdue
Iteration 2
University School of Electrical and Computer Engineering.
250
Guestrin, C. E., Koller, D., & Parr, R. (2001). Max-norm projec-
tions for factored MDPs. Proceedings of the Seventeenth Inter-
0
national Joint Conference on Artificial Intelligence (IJCAI-01)
(pp. 673 – 680). Seattle, Washington: Morgan Kaufmann.
Iteration 1
(crash) Jaakkola, T., Jordan, M., & Singh, S. (1994). On the convergence
−250
Goal Position
of stochastic iterative dynamic programming algorithms. Neu-
Starting position Iteration 3
ral Computation, 6, 1185–1201.
−500 Jaakkola, T., Singh, S. P., & Jordan, M. I. (1995). Reinforcement
−500 −250 0 250 500 750 1000 1250
learning algorithm for partially observable Markov decision
problems. Advances in Neural Information Processing Systems
Figure 6. Bicycle: policies at successive iterations (NN classifier). 7 (pp. 345–352). Cambridge, Massachusetts: MIT Press.

Kaelbling, L. P. (1993). Learning in embedded systems. Cam-


as good as the best SVM policy, but it is more illustrative bridge, Massachusetts: MIT Press.
of the progress of policy iteration because it takes an extra
Kakade, S., & Langford, J. (2002). Approximately optimal ap-
iteration. proximate reinforcement learning. The Nineteenth Interna-
tional Conference on Machine Learning (ICML-2002). Syd-
ney, Australia.
8. Discussion
Kearns, M., Mansour, Y., & Ng, A. Y. (1999). A sparse sampling
We have presented a case for an approach to RL that com- algorithm for near-optimal planning large markov decision pro-
bines policy iteration and pure classification learning with cesses. Proceedings of the Sixteenth International Joint Con-
rollouts. The emphasis of the approach in this paper is the ference on Artificial Intelligence (IJCAI-99) (pp. 1324–1331).
ability to use of modern classification techniques such as Stockholm, Sweden: Morgan Kaufmann.
SVMs to alleviate some of the burden of feature engineer- Lagoudakis, M., & Parr, R. (2001). Model free least squares pol-
ing from the practitioner of reinforcement learning. How- icy iteration. To appear in 14th Neural Information Processing
ever, our empirical results also suggest that more traditional Systems (NIPS-14). Vancouver, Canada.
methods such as neural networks can be used successfully.
Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance
We believe that these initial successes will help open the under reward transformations: theory and application to reward
shaping. Proc. 16th International Conf. on Machine Learning
door to greater exploitation of modern classification meth-
(pp. 278–287). Morgan Kaufmann, San Francisco, CA.
ods on more challenging reinforcement learning domains.
Of course, many questions remain. More thorough inves- Randløv, J., & Alstrøm, P. (1998). Learning to drive a bicycle us-
tigation of issues relating to sample complexity and restart ing reinforcement learning and shaping. The Fifteenth Interna-
tional Conference on Machine Learning. Madison, Wisconsin:
distributions are important areas for future work.
Morgan Kaufmann.

Sutton, R., & Barto, A. (1998). Reinforcement learning: An in-


Acknowledgments troduction. Cambridge, MA: MIT Press.
This research was supported in part by NSF grant 0209088. Tesauro, G., & Tesauro, G. (1997). On-line policy improvement
We also thank Alan Fern, Bob Givan, Carlos Guestrin and using monte-carlo search. 9th Neural Information Processing
Ryan Deering for helpful discussions. Systems (NIPS-9). Denver, Colorado.

Wang, H. Tanaka, K., & Griffin, M. (1996). An approach to fuzzy


References control of nonlinear systems: Stability and design issues. IEEE
Transactions on Fuzzy Systems, 4, 14–23.
Bertsekas, D., & Tsitsiklis, J. (1996). Neuro-dynamic program-
ming. Belmont, Massachusetts: Athena Scientific. Williams, R. J. (1992). Simple statistical gradient-following al-
gorithms for connectionist reinforcement learning. Machine
Collobert, R., & Bengio, S. (2001). SVMTorch: Support vec- Learning, 8, 229–256.
tor machines for large-scale regression problems. Journal of
Machine Learning Research (JMLR), 1, 143–160. Yoon, S. W., Fern, A., & Givan, B. (2002). Inductive policy se-
lection for first-order MDPs. Proceedings of the Eighteenth
Dietterich, T. G., & Wang, X. (2001). Batch value funtion approx- Conference on Uncertainty in Artificial Intelligence (UAI-02).
imation via support vectors. Advances in Neural Information Edmonton, Canada: Morgan Kaufmann.
Processing Systems 14: Proceedings of the 2001 Conference.
Vancouver, British Columbia: MIT Press.

You might also like