Towards Direct Policy Search Reinforcement Learnin
Towards Direct Policy Search Reinforcement Learnin
Towards Direct Policy Search Reinforcement Learnin
net/publication/224685285
CITATIONS READS
11 50
3 authors:
Pere Ridao
Universitat de Girona
198 PUBLICATIONS 3,539 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Flow sensing with differential pressure sensors for autonomous underwater vehicles View project
Multifunctional coopERative marine roBOTs for Intervention DomainS– MERBOTS (DPI2014-57746-C3). Spanish Ministry. Coordinated Project (UJI, UdG, UIB). Role:
COORDINATOR. View project
All content following this page was uploaded by Pere Ridao on 31 May 2014.
Abstract— This paper proposes a high-level Reinforcement functions, and better results can be obtained [6] [7]. Informally,
Learning (RL) control system for solving the action selection it is intuitively simpler to determine how to act instead of
problem of an autonomous robot. Although the dominant ap- value of acting [8]. So, rather than approximating a value
proach, when using RL, has been to apply value function based
algorithms, the system here detailed is characterized by the use function, new methodologies approximate a policy using an
of Direct Policy Search methods. Rather than approximating a independent function approximator with its own parameters,
value function, these methodologies approximate a policy using trying to maximize the future expected reward. Furthermore,
an independent function approximator with its own parameters, scientists have developed different kinds of policy search
trying to maximize the future expected reward. The policy based algorithms obtaining good results [5] [9] [10]. Also in [10]
algorithm presented in this paper is used for learning the internal
state/action mapping of a behavior. In this preliminary work, we a study about how gradient methods can be used to search in
demonstrate its feasibility with simulated experiments using the the space of stochastic policies is presented.
underwater robot GARBI in a target reaching task. Policy gradient algorithms can be used to represent the
policy. For example, an ANN whose weights are the policy
I. I NTRODUCTION parameters. The state would be the input of the network and
Reinforcement Learning is a widely used methodology in as output we would have a distribution probability function
robot learning [1]. In RL, an agent tries to maximize a scalar for action selection. In (1) we can see that if θ represents the
evaluation obtained as a result of its interaction with the vector of the policy parameters and ρ the performance of the
environment. The goal of a RL system is to find an optimal policy (e.g., reward received), then the policy parameters are
policy to map the state of the environment to an action updated approximately proportional to the gradient [6]:
which in turn will maximize the accumulated future rewards.
The agent interacts with a new, undiscovered environment δρ
Δθ ≈ α (1)
selecting actions computed as the best for each state, receiving δθ
a numerical reward for every decision. The rewards are used
to teach the agent and in the end the robot learns which action where α is a positive step size. In comparison with the value
it must take at each state, achieving an optimal or sub-optimal function approach, small changes in θ can cause only small
policy (state-action mapping). changes in the policy.
The dominant approach over the last decade has been to ap- The advantages of policy gradient methods against value-
ply reinforcement learning using the value function approach. function based methods are various. The main advantage is
As a result, many RL based control systems have been applied that using a function approximator to represent the policy
to robotics. In [2], an instance-based learning algorithm was directly solves the generalization problem. Besides, a problem
applied to a real robot in a corridor-following task. For the for which the policy is easier to represent should be solved
same task, in [3] a hierarchical memory-based RL method was using policy algorithms [7]. Furthermore, learning systems
proposed, obtaining good results as well. In [4] an underwater should be designed to explicitly account for the resulting viola-
robot that learnt different behaviors using a modified Q- tions of the Markov property. Studies have shown that stochas-
learning algorithm was presented. Although value function tic policy-only methods can obtain better results when working
methodologies have worked well in many applications, they in partially observable Markov decision processes (POMDP)
have several limitations. Function approximator methods in than those obtained with deterministic value-function methods
“value-only” RL algorithms may present converge problems, [11]. In [7] a comparison between a policy-only algorithm [12]
if the state-space is not completely observable, small changes and a value Q-learning method [13] is presented; both algo-
in the the value function can cause big changes in the policy rithms use a simple neural network as function approximator.
[5]. A 13-state Markovian decision process is simulated for which
Over the past few years, studies have shown that ap- the Q-learning oscillates between the optimal and a suboptimal
proximating a policy can be easier than working with value policy while the policy-only method converges to the optimal
Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 26,2010 at 10:24:26 UTC from IEEE Xplore. Restrictions apply.
policy. On the other hand, as disadvantage, policy gradient the target reaching camera (the simulated world, the robot
estimators used in these algorithms may have large variance, model and the controller). The experimentation procedure and
so these methods learn much more slower than RL algorithms the results obtained are included in Section IV and finally,
using a value function [14] [15] [6] and they can converge to conclusions and the future work to be done after this work
local optima of the expected reward [16]. are included in Section V.
The first example of an algorithm optimizing the averaged
reward obtained, for stochastic policies working with gradient II. L EARNING P ROCEDURE
direction estimates, was Williams’s REINFORCE algorithm The objective of this work is to transfer an accurate policy,
[17]. This algorithm learns much more slower than other RL learned in a simulated environment, to a real robot and test
algorithms which work with a value function and, maybe for the behavior of the policy in real conditions. So, the learning
this reason, has received little attention. However, the ideas process can be divided into two main phases. First, the learning
and mathematical concepts presented in REINFORCE were a task will be performed in simulation using the model of the
basic platform for later algorithms. environment. Once the learning process is considered to be
A few years later, in [18], Williams’s algorithm was ex- finished, the policy will be transferred to GARBI AUV in
tended to the infinite horizon setting. Kimura’s method is, as order to test it in the real world. In this paper we present only
REINFORCE, based on stochastic gradient ascent (SGA). The the results of the first phase.
authors compared its algorithm with Jaakola’s method [19] and
Watkin’s Q-learning algorithm [13] in a robot control problem A. The Algorithm
achieving good results.
Baxter and Bartlett’s algorithm procedure is summarized
The Baxter and Bartlett approach [20] is the one selected in
in Algorithm 1. The algorithm works as follows: having
this paper to carry out the experiments. Its method calculates
initialized the parameters vector θ0 , the initial state i0 and
a parameterized policy that converges to an optimal by com-
the eligibility trace z0 = 0, the learning procedure will be
puting approximations of the gradient of the averaged reward
iterated T times. At every iteration, the parameters’ eligibility
from a single path of a controlled POMDP. The convergence
zt will be updated according to the policy gradient approxi-
of the method is proven with probability 1, and one of the most
mation. The discount factor β ∈ [0, 1) increases or decreases
attractive features is that it can be implemented on-line. Baxter
the agent’s memory of past actions. The immediate reward
and Bartlett’s approach is based on the fact that, given a state
received r(it+1 ), and the learning rate α allows us to finally
s, it searches for a policy that minimizes the expected reward.
compute the new vector of parameters θt+1 . The current policy
Moreover, in [21] and [22] an algorithm similar to Baxter
is directly modified by the new parameters becoming a new
and Bartlett’s approach was described and its convergence
policy to be followed by the next iteration, getting closer to a
demonstrated. The algorithm is only suitable for finite MDP
final policy that represents a correct solution of the problem.
and can be implemented to work on-line.
Close to the root of these theoretical variants of policy
search methods, only a few but promising practical appli- Algorithm 1: Baxter and Bartlett’s On-Line POMDP
cations of these algorithms have appeared. Chronologically, (OLPOMDP) algorithm
this paper emphasizes the work presented in [23], where an 1. Initialize:
T >0
autonomous helicopter learns to fly using an off-line model- Initial parameter values θ0 ∈ RK
based policy search method. Also important is the work pre- Initial state i0
sented in [24] where a simple “biologically motivated” policy 2. Set z0 = 0 (z0 ∈ RK )
3. f or t = 0 to T do:
gradient method is used to teach a robot in a weightlifting (a) Observe state yt
task. More recent is the work done in [25] where a simplified (b) Generate control action ut according to current policy μ(θ, yt )
policy gradient algorithm is implemented to optimize the gait (c) Observe the reward obtained r(it+1 )
∇μ (θ,y )
of Sony’s AIBO quadrupedal robot. Finally, in [26] and [27], (d) Set zt+1 = βzt + μ ut(θ,y t)
ut t
a biped robot is trained to walk by means of a “hybrid” RL (e) Set θt+1 = θt + αt r(it+1 )zt+1
4. end f or
algorithm that combines policy search with value function
methods.
In this paper we apply Baxter and Bartlett’s algorithm to a As aforementioned, the algorithm is designed to work on-
particular robotic task in which a neural network acts as the line. The function approximator adopted to define our policy
policy function. The task consists on reaching a target which is an artificial neural network (ANN) (see Figure 1).
is detected by a forward looking camera. These experiments Next lines will relate closely to the update weight process
have been designed for the Autonomous Underwater Vehicle done by the algorithm. Once the ANN is initialized at random,
(AUV) GARBI. In this paper the learning has been fulfilled the network will be given an observation of the state and,
with the hydrodynamic model of GARBI and the model of as a result, a stochastic control action is computed. Then the
a video camera. The structure of the paper is as follows. learner will be driven to another state and will receive a reward
In Section II the algorithm and the learning procedure are associated with this new state. The first step in the parameter
detailed. Section III describes all the elements considered in update procedure is to compute the ratio:
3179
Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 26,2010 at 10:24:26 UTC from IEEE Xplore. Restrictions apply.
∇μut (θ, yt ) o1 ξ1
(2)
μut (θ, yt ) e1=-Pr1
Soft-Max
Action
oj ξjSelected!
for every weight of the network. In artificial neural networks ej=1-Prj
like the one used in the algorithm, the expression defined in
step 3.d of Algorithm 1 can be rewritten as: on ξn
en=-Prn
δ1h = ϕ1 (o1 ) ×
' h
w11
δ1o
wn1
o1 ξ1
Soft-Max
State
Input
on ξn δ no
Fig. 1. Schema of the ANN architecture adopted. Fig. 3. Gradient computation for a hidden layer neuron.
3180
Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 26,2010 at 10:24:26 UTC from IEEE Xplore. Restrictions apply.
Having all the local gradients of all the neurons calculated,
the expression in (3) can be obtained. Finally, the old param-
eters are updated following expression 3.(e) of Algorithm 1:
1
The vector of parameters θt represents the network weights to
camera fx
be updated, r(it+1 ) is the reward given to the learner at every coordinate
frame 0
time step, zt+1 describes the estimated gradients mentioned fo
ca
la
before and, at last, we have α as the learning rate of the xis
0
algorithm. target -1
3181
Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 26,2010 at 10:24:26 UTC from IEEE Xplore. Restrictions apply.
o1 ξ
Figure 8 represents the performance of the neural-network
robot controller as a function of the number of episodes when
o2 ξ trained using OLPOMDP. The episodes have been averaged
Soft-Max
over bins of 40 episodes. The experiment has been repeated
∂ o3 ξ in 100 independent runs, and the results here presented are a
∂ mean over these runs. The simulated experiments have been
∂ o4 ξ repeated for different values of the discount factor β.
∂ Once the vehicle has learnt the task, it needs a few time steps
to reach the goal. As it can be appreciated in Figure 8, the
Fig. 5. The ANN used by the controller. best performance is around -40. The best results are obtained
when using a decay factor of β = 0.95. Different values of α
have been tested without improving the results here presented.
stability in both pitch and roll degrees of freedom. Its five Figure 9 represents the behavior of a trained robot controller.
thrusters will allow GARBI to be operated in the remaining Targets positions were deterministically selected to observe the
degrees of freedom (surge, sway, heave and yaw) achieving robot moving to different locations.
maximum speeds of 3 knots (see Figure 7).
V. C ONCLUSIONS
The mathematical model of GARBI was obtained using
parameter identification methods [30]. The whole model has A direct policy search algorithm for robot control based
been uncoupled and reduced to emulate a robot with only two on Baxter and Bartlett’s direct-gradient algorithm has been
degrees of freedom (DOF), X movement and rotation respect Z studied. The method has been applied to a simulated control
axis. Also, the model of the camera has been used to simulate system where the robot GARBI navigates a two-dimensional
the vision system of GARBI AUV. world learning to reach a target by means of a forward looking
camera. The policy is represented by a neural network whose
IV. R ESULTS weights are the policy parameters. The objective of the agent
was to compute a stochastic policy, which assigns a probability
The controller was trained in an episodic task. According
over each action.
to the variables, fx in the X DOF and fy in the Yaw DOF, a
The results of this preliminary work show a good perfor-
reward value was given. Only three reward values were used:
mance of the algorithm. The convergence times are quite good.
-20, -1 and 0. In order to maintain the target in front of the
A future work will compare these results with a value function
robot, the reward r = 0 is given when the position of the
algorithm. A classical value method would have been affected
target is around fy = 0 (between -0.2 and 0.2) and at a certain
by the generalization problem and, therefore, spent much more
distance from the robot in X axis, around fx = 0.3 (between
iterations to converge. Is is also important to note the reduced
0.2 and 0.4). The reward value of -20 is given if the target is
dimensions of the ANN used in the simulation.
almost outside the image (fy < −0.9 and fy > 0.9) and the
The current work is focused on transferring the learned
robot perceives a reward of -1 if it definitively loses the target.
policy to the real robot and testing in real conditions. A
Robot an target positions are reset either every 50 seconds or
future work will consist on performing the learning process on-
when a “success”(reach reward 0) takes place, whatever comes
line with the robot GARBI navigating in the real underwater
first. The sample time was 0.1 seconds. The robot is always
environment.
reset to position (0,0) meanwhile the target is reset to a random
location inside the robot field of view. ACKNOWLEDGMENT
Achieving a “success” or spending 50 seconds without This research was sponsored by the Spanish commission
reaching the target represents the end of an episode. The num- MCYT (CTM2004-04205/MAR). The authors would like to
ber of episodes to be done has been set to 100.000. For every thank Mr. Douglas Alexander Aberdeen of the Australian
episode, the total amount of reward perceived is calculated. National University, Mr. Russ Tedrake of the Massachusetts
Fig. 6. GARBI schema. Control actions. Fig. 7. GARBI AUV in experimental test.
3182
Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 26,2010 at 10:24:26 UTC from IEEE Xplore. Restrictions apply.
Total R per Trial, Averaged over bins of 40 Trials (alpha 0.0001)
0
[8] D. A. Aberdeen, “Policy-gradient algorithms for partially observable
-50 markov decision processes,” Ph.D. dissertation, Australian National
-100 University, April 2003.
Mean Total R per Trial [9] E. A. Hansen, “Solving POMDPs by searching in policy space,” in 8th
-150
Conference on Uncertainty in Artificial Intelligence, Madison, WI, 1998,
-200
pp. 211–219.
-250 [10] N. Meuleau, K. E. Kim, L. P. Kaelbling, and A. R. Cassandra, “Solving
-300 POMDPs by searching the space of finite policies,” in 15th Conference
on Uncertainty in Artificial Intelligence, M. Kaufmann, Ed., Computer
-350
science Dep., Brown University, July 1999, pp. 127–136.
-400 [11] S. Singh, T. Jaakkola, and M. Jordan, “Learning without state-estimation
beta 0.95
-450 beta 0.96 in partially observable markovian decision processes,” in Proceedings
beta 0.97 of the Eleventh International Conference on Machine Learning, New
-500
0 500 1000 1500 2000 2500
Jersey, USA, 1994.
Groups of 40 Trials
[12] J. Baxter and P. Bartlett, “Direct gradient-based reinforcement learning,”
in International Symposium on Circuits and Systems, Geneva, Switzer-
Fig. 8. Performance of the neural-network robot controller as a function of land, May 2000.
the number of episodes. Performance estimates were generated by simulating [13] C. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp.
100.000 episodes, and averaging them over bins of 40 episodes. Process 279–292, 1992.
repeated in 100 independent runs. The results are a mean of these runs. Fixed [14] P. Marbach and J. N. Tsitsiklis, “Gradient-based optimization of Markov
α = 0.0001, and β = 0.95, 0.96, 0.97. reward processes: Practical variants,” Center for Communications Sys-
tems Research, University of Cambridge, Tech. Rep., March 2000.
[15] V. Konda and J. Tsitsiklis, “On actor-critic algorithms,” SIAM Journal
Target Reaching Task,Results After Learning
5
on Control and Optimization, vol. 42, number 4, pp. 1143–1166, 2003.
robot GARBI [16] N. Meuleau, L. Peshkin, and K. Kim, “Exploration in gradient based re-
4 Camera field of view inforcement learning,” Massachusetts Institute of Technology, AI Memo
GARBI trajectories
3 Target Positions 2001-003, Tech. Rep., April 2001.
[17] R. Williams, “Simple statistical gradient-following algorithms for con-
2
nectionist reinforcement learning,” Machine Learning, vol. 8, pp. 229–
Y Location (m)
1 256, 1992.
0
[18] H. Kimura, K. Miyazaki, and S. Kobayashi, “Reinforcement learning
in pomdps with function approximation,” in Fourteenth International
-1 Conference on Machine Learning (ICML’97), D. H. Fisher, Ed., 1997,
-2 pp. 152–160.
[19] T. Jaakkola, S. Singh, and M. Jordan, Reinforcement Learning algo-
-3
rithms for partially observable Markov decision problems. Morgan
-4 Kaufman, 1995, vol. 7, pp. 345–352.
-5
[20] J. Baxter and P. Bartlett, “Direct gradient-based reinforcement learning:
0 2 4 6 8 10 12 I. gradient estimation algorithms,” Australian National University, Tech.
X Location (m) Rep., 1999.
[21] P. Marbach and J. N. Tsitsiklis, “Simulation-based optimization of
markov reward processes,” Technical report LIDS-P-2411, Massachus-
Fig. 9. Behavior of a trained robot controller, results of the target reaching sets Institute of Technology, 1998.
task after the learning process was completed. [22] P. Marbach, “Simulation-based methods for markov decision processes,”
PhD Thesis, Laboratory for Information and Decision Systems, MIT,
1998.
[23] J. Bagnell and J. Schneider, “Autonomous helicopter control using
Institute of Technology and Mr. Jun Morimoto of the ATR reinforcement learning policy search methods,” in Proceedings of the
Computational Neuroscience Laboratories for their help. IEEE International Conference on Robotics and Automation, Korea,
2001.
[24] M. Rosenstein and A. Barto, “Robot weightlifting by direct policy
R EFERENCES search,” in Proceedings of the International Joint Conference on Ar-
tificial Intelligence, 2001.
[1] R. Sutton and A. Barto, Reinforcement Learning, an introduction. MIT [25] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fast
Press, 1998. quadrupedal locomotion,” in IEEE International Conference on Robotics
[2] W. D. Smart and L. P. Kaelbling, “Practical reinforcement learning in and Automation (ICRA), 2004.
continuous spaces,” in International Conference on Machine Learning, [26] R. Tedrake, T. W. Zhang, and H. S. Seung, “Stochastic policy gradient
2000. reinforcement learning on a simple 3D biped,” in IEEE/RSJ International
[3] N. Hernandez and S. Mahadevan, “Hierarchical memory-based rein- Conference on Intelligent Robots and Systems IROS’04, Sendai, Japan,
forcement learning,” in Fifteenth International Conference on Neural September 28 - October 2 2004.
Information Processing Systems, Denver, USA, 2000. [27] T. Matsubara, J. Morimoto, J. Nakanishi, M. Sato, and K. Doya,
[4] M. Carreras, P. Ridao, R. Garcia, and T. Nicosevici, “Vision-based “Learning sensory feedback to CPG with policy gradient for biped
localization of an underwater robot in a structured environment,” in IEEE locomotion,” in Proceedings of the International Conference on Robotics
International Conference on Robotics and Automation, Taipei, Taiwan, and Automation ICRA, Barcelona, Spain, April 2005.
2003. [28] S. Haykin, Neural Networks, a comprehensive foundation, 2nd ed.
[5] D. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Bel- Prentice Hall, 1999.
mont, MA: Athena Scientific, 1996. [29] A. Savitzky and M. Golay, Analytical Chemistry, 1964, vol. 36, pp.
[6] R. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradi- 1627–1639.
ent methods for reinforcement learning with function approximation,” [30] P. Ridao, M. Carreras, and J. Batlle, “Model identification of a low-speed
Advances in Neural Information Processing Systems, vol. 12, pp. 1057– UUV,” in Control Applications in Marine Systems CAMS’01, Scotland
1063, 2000. (UK), 2001.
[7] C. Anderson, “Approximating a policy can be easier than approximating
a value function,” University of Colorado State,” Computer Science
Technical Report, 2000.
3183
Authorized licensed use limited to: UNIVERSITAT DE GIRONA. Downloaded on April 26,2010 at 10:24:26 UTC from IEEE Xplore. Restrictions apply.
View publication stats