0% found this document useful (0 votes)
50 views9 pages

IROS2021 - Deep Learning Using Gazebo

This document summarizes a research paper that proposes a method called Interactive Progressive Network Learning (IPNL) to improve the Progressive Neural Network approach for sim-to-real transfer learning. IPNL integrates Progressive Networks with interactive reinforcement learning, which allows an agent to learn from evaluative feedback provided by a human trainer. The authors test IPNL on several reinforcement learning tasks and find that while Progressive Networks are effective for transferring between some tasks, IPNL allows agents to learn more stable policies faster for both easier and harder transfer scenarios. It also shows potential for application to real-world robotics tasks.

Uploaded by

roykaushik4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views9 pages

IROS2021 - Deep Learning Using Gazebo

This document summarizes a research paper that proposes a method called Interactive Progressive Network Learning (IPNL) to improve the Progressive Neural Network approach for sim-to-real transfer learning. IPNL integrates Progressive Networks with interactive reinforcement learning, which allows an agent to learn from evaluative feedback provided by a human trainer. The authors test IPNL on several reinforcement learning tasks and find that while Progressive Networks are effective for transferring between some tasks, IPNL allows agents to learn more stable policies faster for both easier and harder transfer scenarios. It also shows potential for application to real-world robotics tasks.

Uploaded by

roykaushik4u
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/355109893

Shaping Progressive Net of Reinforcement Learning for Policy Transfer with


Human Evaluative Feedback

Conference Paper · September 2021


DOI: 10.1109/IROS51168.2021.9636061

CITATIONS READS
4 337

7 authors, including:

Randy Gomez Qixin Sha


Honda Research Institute Japan Co., Ltd. Ocean University of China
130 PUBLICATIONS   925 CITATIONS    26 PUBLICATIONS   305 CITATIONS   

SEE PROFILE SEE PROFILE

Guangliang Li
Ocean University of China
58 PUBLICATIONS   595 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Embodied communication View project

All content following this page was uploaded by Guangliang Li on 07 October 2021.

The user has requested enhancement of the downloaded file.


Shaping Progressive Net of Reinforcement Learning for Policy
Transfer with Human Evaluative Feedback

Rongshun Juan1∗ , Jie Huang1∗ , Randy Gomez2 , Keisuke Nakamura2 , Qixin Sha1 , Bo He1 , Guangliang Li1∗∗

Abstract— Deep reinforcement learning has achieved sig- in the environment for exploration, so at certain moments
nificant success in many fields, but will confront sampling its behavior may damage the robot itself and/or even living
efficiency and safety problems when applying to robot control things in the surrounding environment [5], [6].
in the real world. Sim-to-real transfer learning was proposed
to make use of samples in the simulation and overcome the gap However, sampling in a simulated environment is faster,
between simulation and real world. In this paper, we focus on cheaper and safer than learning directly in the real world,
improving Progressive Neural Network — an effective sim-to- but using the policy trained in simulator directly in the
real learning method, by proposing Interactive Progressive Net- real world is difficult and risky since there is a gap
work Learning (IPNL). IPNL integrates progressive network between simulation and reality [7]. How to bridge this
and interactive reinforcement learning (interactive RL) which
learns from evaluative feedback provided by an observing gap has attracted a great deal of attention. Many sim-to-
human trainer. We test our method using five RL tasks with real algorithms have been proposed to solve this problem,
discrete or continuous actions in OpenAI Gym and a sinusoids such as domain adaption [8], inverse dynamics model [9],
curve following task with AUV simulator on the Gazebo domain randomization [5] and progressive network [10],
platform. Our results suggest that while Progressive Network etc. Among them, progressive neural network (PNN) has
has good performance when transferring from tasks with low-
dimensional state space to those with high-dimensional one but positive transfer between disparate tasks without needing to
has little effect for transferring from high-dimensional tasks to specify source and target tasks [11]. That is to say, even for
low-dimensional ones, IPNL allows an agent to learn a more the target task with different action and state space from the
stable policy with better performance faster for both cases. source task, a PNN agent can have better learning capability
More importantly, our further analysis indicate that there is and faster learning speed.
a synergy between Progressive Network and interactive RL
for improving the agent’s learning. Our results in the path Although PNN can speed up agent learning, an agent
following of AUV shed light on the potential of applying our trained in the target task still needs a large number of
method in the real world tasks. samples to explore before learning an optimal policy, which
might still induce safety problems. In this case, agents can
I. INTRODUCTION further speed up their learning by interacting with human
Recent advances of deep learning (DL) enable rein- users since most robots will operate in the human inhabited
forcement learning (RL) to scale to decision-making in environment. As in human learning, we not only learn how
challenging tasks with high-dimensional state and action to perform a task quickly by transferring capabilities learned
spaces [1]. Deep reinforcement learning (DRL) has been in another previous task, but also learn from feedback or
shown successful in playing a range of Atari games directly guidance provided by surrounding people like teachers and
from image pixels [2], AlphaGo defeating a human world parents. Interactive reinforcement learning (interactive RL)
champion [3] and Watson DeepQA system beating the best has been proposed to facilitate an agent to learn from
human Jeopardy! players [4], etc. However, when applying ordinary people and proved to learn faster than a traditional
DRL methods to robot control in the real world, agents RL agent [12]. In this paper, we propose to improve upon
usually have to confront two main challenges: sampling PNN by combining it with interactive RL. We hypothesize
efficiency and safety guarantee. Firstly, a DRL agent gen- that agents learning with our method can obtain an optimal
erally requires tens of millions samples to learn an optimal policy faster than a PNN agent and an interactive RL agent.
policy, and it will take several months to collect samples We test our method using five RL tasks with discrete or
of this magnitude in the real world. Secondly, an agent continuous actions in OpenAI Gym and a sinusoids curve
learns with DRL requiring a large-scale random samplings following task with AUV simulator on the Gazebo platform.
Our results suggest that while PNN has good performance
1 Rongshun Juan, Jie Huang, Qixin Sha, Bo He and Guangliang Li
when transferring from tasks with low-dimensional state
are with the College of Information Science and Engineering, Ocean space to those with high-dimensional one but has little
University of China, Songling Road 238, 266100, Qingdao, Shandong,
China. {{bhe, guangliangli}@ouc.edu.cn effect for transferring from high-dimensional tasks to low-
2 Randy Gomez and Keisuke Nakamura are with Honda dimensional ones, our proposed IPNL agent can learn a
Research Institute Japan Co., Ltd, Wako, Japan. {r.gomez, more stable policy with better performance faster for both
keisuke}@jp.honda-ri.com
∗ Contributing equally. cases. More importantly, our further analysis indicate that
∗∗ Corresponding author. there is a synergy between PNN and interactive RL for
improving the agent’s learning, even when Progressive Net- been proposed and tested, which can mainly be divided into
work Learning has little transferring effect between tasks. three groups: interactive shaping, learning from categorical
Our results in the path following of AUV also shed light on feedback, learning from policy feedback [12]. Interactive
the potential of applying our method in the real world tasks. shaping allows an agent to learn to perform a task by
interpreting human evaluative feedback as human reward
II. RELATED WORK
signals as in traditional RL [15], [16]. In interactive shaping,
In this section, we review most related work to our the agent takes the human reward signal as a measure of task
method in terms of sim-to-real transfer learning and inter- performance. The agent trained by interactive shaping acts
active RL. more like humans, but the optimality of the policy in the
For the challenge that we have illustrated, how to bridge task cannot be guaranteed [14]. In learning from categorical
the gap between simulation and reality has received a feedback strategy [17], an agent learns depending on the
significant focus in reinforcement learning. Domain adaption teacher’s behavior and training strategy and human teacher
[8] learns the mapping from a state space that simulation and can provide feedback with different strategies. Learning
reality share to a hidden variable space. In the simulation from policy feedback includes policy shaping and COACH.
environment, the mapped state space is used for algorithm Policy shaping formulates human feedback as policy labels
training. When transferring to the reality, it is able to directly and uses those labels to infer what is the optimal policy that
apply the model trained in simulation. While this method a human teacher believes [18]. COACH used the model of
has generality to some extent and succeeds to transfer human feedback as an advantage function to capture current
simulation data to reality, training policy entirely in sim- policy-dependent feedback strategies [19].
ulation requires tackling the question of physical adaptation
and accounting for the mismatch between simulation and III. BACKGROUND
reality. Inverse Dynamic Model [9] extends by learning an
A. Deep Reinforcement Learning
inverse transition probability matrix in the real environment,
and directly applying the trained model from simulation to A RL problem can be modeled as an Markov Decision
the real environment. By training a deep inverse dynamics Progress (MDP), in which an agent learns how to perform
model, this method can adapt complex policies between a task by interacting with the environment [20], [21]. An
environments. However, it only focuses on action adaption MDP can be represented with a tuple M = {S, A, T, R, γ}.
rather than observation adaption, since it is unreasonable to S and A are state and action spaces respectively. T is the
expect simulated visual to match the high fidelity of real transition probability distribution. R: S × A × S → R is the
world. reward function. γ is a discounted factor, γ ∈(0,1].
With domain randomization, an agent first learns by In RL, at time step t the agent observes the current
randomizing visual information or physical parameters in state s ∈ S and chooses an action a ∈ A, then it causes a
simulation [5]. For example, the agent can learn in a state transition to st+1 ← T (st+1 |st , at ). After performing the
simulated environment where the wall color, floor color, or selected action, the agent receives a reward R(st+1 |st , at )
friction, atmospheric pressure, etc., will randomly change. which represents the value of the executed action towards
The obvious advantage of this method is that policy trained the task. Then the sequence of states, actions and rewards in
only in simulation can be directly applied in the real world an episode constitutes a tra jectory or rollout of a policy. In
and achieve good performance. One limit is that it can general, a policy π is a mapping from states to a probability
only be used when the source task is similar to the target distribution over actions: π: S → p(A = a|S). The goal of a
task. Progressive network [13] is a good ideal for DRL to RL agent is to find an optimal policy π ∗ that achieves the
transfer policy even between disparate tasks. First, features maximum expected return from all states.
learnt of the source task can be transferred to the target task A RL agent can learn a state value function V π (s) or
without fine-tuning. Second, features of each layers can be action value function Qπ (s, a). The optimal policy can be ob-
transferred and analyzed concretely. Third, PNN adds new tained by greedily selecting actions with the learned optimal
capacity when transferring knowledge to a new task with value functions. An agent with actor-critic method learns a
new input connections [11]. These advantages allow PNN parameterized policy—the actor, and a parameterized value
to have a faster learning speed and better performance when function—the critic, separately at the same time.
learning in new tasks. An obvious downside of PNN is the Deep reinforcement learning (DRL) was proposed by
growth in number of parameters and when PNN solves K combining deep learning and RL to learn in complex tasks
tasks, choosing columns to transfer requires knowledge of with high-dimensional state and action space. In DRL, deep
the task. neural network (DNN) will be used as function approxi-
Interactive RL was proposed to reduce the learning time mator for value functions and policies. For example, the
of a RL agent based on reward shaping [14]. An agent deep-Q-network (DQN) [3] algorithm was proposed using
with interactive RL allows people to observe the agent’s DNN to approximate the Q function in Q-learning. A typical
behavior in an environmental state and give an evaluation actor-critic algorithm in DRL is deep deterministic policy
of the quality of its behavior as feedback to teach it how gradient algorithm (DDPG) [22]. DDPG was proposed to
to complete the task. Many interactive RL algorithms have solve tasks with continuous action space by combining
DQN and deterministic policy gradient (DPG) [23]. DDPG
deploys experience replay to make the learning stable and
robust. With an actor-critic in DPG, DDPG can update the
policy at each step.
B. Interactive Reinforcement Learning
With increasing attentions being paid to RL, the long
learning time of standard RL methods becomes a critical
problem that cannot be ignored when applying RL to robot
systems in the real world. Therefore, reward shaping was
Fig. 1: Progressive network architecture with 3 columns
proposed to speed up agent learning with standard RL.
(reproduced from [13]).
Based on reward shaping, interactive reinforcement learning
(interactive RL) can allow an agent to learn from non-
experts in agent design and even programming [12]. In
Progressive network and interactive reinforcement learning
interactive RL, a human trainer can observe an agent’s
have been previously shown to provide benefits in improving
behavior in the environment and give evaluative feedback
the learning performance and efficiency of an agent [13],
which is used to train the agent learner. Every time the agent
[11], [12], [15], [24], [17], [19]. Our method can leverage
takes an action a in a state s, the human trainer can provide
the better learning capability of progressive network and
evaluative feedback to tell the value of the selected action
the high quality of samples in interactive reinforcement
a. The agent uses this feedback to update its policy. Policy
learning. We hypothesize that with our method, an agent can
trained with interactive RL usually has a faster learning
learn faster and better than agent learning with Progressive
speed and the agent’s behavior will be more in line with
Network and interactive RL.
human expectations.
Specifically, instead of asking a human trainer to give
C. Progressive Neural Network human rewards all the time which is exhausting, we trained
In progressive neural network, a neural network has L a human reward network (HRN) to predict human feedback
(k)
layers with hidden activations hi ∈ Rni , where ni is the in the task. The loss function of the reward network in our
units at layer i ≤ L. A progressive network starts with method is the standard mean square error. That is to say,
a single column. When switching to the second column, given an input [αS, β A] and the received human evaluative
the parameters Θ(1) of the first column are frozen and feedback Rh , we look to minimize the HRN loss:
the parameters Θ(2) of the second column are randomly 1 n
(2)
initialized, and the activation hi of the second column layer Lhr (α, β , S, A, Rh ) = ∑ (Rh − Φ (αS, β A)), (2)
n i=1
(1) (2)
will receive input from hi −1 and hi −1. This generalizes
to K columns as follows: where Rh is the human reward based on the evaluation of
state S and action A, α and β are the hyper-parameters
(k) (k) (k)
hi = f (Wi hi−1 + ∑ Ui
(k: j) ( j)
hi−1 ), (1) set to trade off how strongly the human trainer weights the
j<k state and action when evaluating the quality of the agent’s
(k)
where Wi ∈ Rni ×ni −1 is the weight matrix of layer i in behavior, Φ is the estimated HRN. n is the number of
(k: j)
column k, Ui ∈ Rni ×n j are the lateral connections from samples. α and β are used because a human trainer might
layer i − 1 of column j. A progressive network with 3 weight the state and action differently when evaluating
columns is shown in Figure 1. the quality of the agent’s behavior in different tasks. In
When applying progressive network to deep reinforce- our experiment, α and β are fixed values and pre-defined
ment learning, each column was trained to solve an Markov according to our experience of training in different tasks.
Decision Process: the k-th column defines a policy π (k) (a|s) Unlike the TAMER framework [24] which learns the
taking input s from the target environment and generating human reward function and updates the value function or
(k)
probabilities over actions π (k) (a|s) := hL (s). At each time- policy at the same time, in our method, we allow the agent to
step, an action is taken by the agent from this distribution, collect T samples to train the HRN before transfer learning
yielding the subsequent state [11], [13]. happens. We believe that a human reward approximator
with inaccurate predictions at the beginning of the learning
IV. INTERACTIVE PROGRESSIVE NETWORK process would have bad effects on policy learning. Our
LEARNING method can make sure HRN to have a good prediction of
In this paper, we focus on extending the Progressive human rewards before updating the policy. Normally, only
Network Learning for transferring policy from a source task a few samples will be enough to train HRN with a high
to a target task by further improving its learning speed and prediction accuracy, which is insignificant for training an
performance in the target task. Therefore, we propose Inter- agent in the target task. The number of samples needed
active Progressive Network Learning (IPNL) by combining can be set differently for different tasks. The human trainer
progressive network and interactive reinforcement learning. can also give evaluative feedback to further train HRN after
Algorithm 1 Interactive Progressive Network Learning
Input: Environment env, number of training samples for
HRN T , α, β for training HRN, policy πsource of source
task, number of episodes N, episode length M, replay
buffer DT
Output: Reward rh , Target policy πtarget of PNN
Initialize target agent and HRN buffer DH
(a) CartPole (b) MountainCar
for 1, . . . , T do
Receive human reward Rht
Store transition {Rht , α, st , β , at } in DH
Update HRN with the reward loss: Lhr
end for
Fix parameters θ 1 of source policy πsource with PNN
for number of episodes = 1, . . . , N do
Reset env and initialize state s0 (c) InvertedPendulum (d) InveredDoublePendulum
for number of steps = 1, . . . , M do Fig. 2: Screenshots of the tasks used in our experiments.
Select action at with target policy πtarget
Perform action at trainer can give human rewards to further train the human
Receive reward rht from HRN and transition to reward function HRN.
next state st+1 V. EXPERIMENT
Store transition {st , at , st+1 , rht } in DT
if End of episode then We test our proposed method by conducting experi-
Reset env, state s0 and receive reward Rh ments using five RL tasks with both discrete and contin-
end if uous action spaces from OpenAI Gym: CartPole, Moun-
Update target policy πtarget tainCar, MoutainCarContinuous, InvertedPendulum and In-
end for vertedDoublePendulum. Screenshots of the tasks are shown
end for in Figure 2. We provide detailed descriptions of the five
tasks as below:
CartPole: CartPole-v0 was used in our experiment, which
has a four-dimensional continuous state space and two-
transferring the policy in the target task. In our experiments,
dimensional discrete action space. In the task, with a non-
we collected 400 samples to train HRN while an episode
operated joint, a pole is connected to a cart, which can move
usually lasts 1000 time steps.
along a frictionless track. When an episode begins, the pole
The proposed Interactive Progressive Network Learning
always stands upright. The agent can control the cart by
(IPNL) approach is summarized in Algorithm 1. Our ap-
applying a force of +1 or -1 to it to prevent the pole from
proach starts by collecting T samples MR = {Rh , α, s, β , a}
falling. An episode ends when the pole is more than 15
stored in a buffer DH to train HRN. During the process, the
degrees from vertical, or the cart moves more than 2.4 units
human trainer can give feedback by evaluating the quality
from the center. A +i reward is given for each step, -200
of the agent’s behavior in the task. The human reward is
reward for ending the episode, where i is the number of total
defined as below:
 steps for each episode.

 +1 good action MountainCar: MountainCar-v0 was used in our experi-

−1 bad action ment, which has a two-dimensional continuous state space
Rh = (3) and three-dimensional discrete action space. The state of the

 +100 agent reaching the goal state
 agent is represented by the position and the velocity of the
−100 agent fails.

car, the actions include driving left, right or staying in the
After the human reward function HRN is good enough, place. When an episode starts, the car is always in a valley
the modified policy trained from the source task will be between two mountains. The goal of the agent is to reach
transferred to the target task. We use progressive network the top of the mountain on the right, but the car’s engine is
with 2 columns for transferring in our method. The first not powerful enough to reach the top in a single pass. The
column is the trained policy from source task, which is agent will receive a reward of -1 at each step, and when
a deep network with L layers and parameters θ 1 . When reaching the goal it will receive a reward of +100.
switching to the second column, the parameters θ 1 are MountainCarContinuous: MountainCarContinuous-v0
frozen and parameters θ 2 are initialized randomly. Freezing was used, which is the same as MountainCar-v0 except that
the parameters θ 1 of the first column is an effective method the action is one-dimensional and continuous.
to avoid catastrophic forgetting. The activate functions h2i of InvertedPendulum: InvertedPendulum-v2 was used in
layer Li2 receive input from last layers Li−1
1 2 . After
and Li−1 our experiment, which is a 3D environment for CartPole
transferring the source policy to the target task, the human with a one-dimensional continuous action space.
InvertedDoublePendulum: In the task, two pendulums task with AUV as the source policy. Then we transfer the
are attached by the cart, and the goal of this task is to trained policy to the target task of sinusoids curve following.
balance the pendulum in the upright position by taking the Since the two tasks are similar, the second column of the
continuous action a ∈ [−1, 1]. The state space is represented policy in target task is initialized with the source policy.
by an eleven-dimensional continuous vector consisting of Agents with interactive DQN and IPNL learn from both
the status information of the cart and two poles. The action environmental reward and human reward as in [25].
space is one-dimensional and continuous.
We conducted mainly two groups of experiments: transfer
learning between tasks with discrete actions and transfer
learning between tasks with continuous actions. For transfer
learning between tasks with discrete actions, a DQN agent
was first trained in MountainCar-v0 as source task, and
transferring to CartPole with PNN and IPNL, to see the
effect of our method transferring from tasks with low-
dimensional continuous observation to tasks with a high-
dimensional one. We also trained a DQN agent in CartPole
as source task, and transferring to MountainCar with PNN
and IPNL, to see the effect of our method transferring from
Fig. 3: The autonomous underwater vehicle simulator and
tasks with high-dimensional continuous state space to tasks
environment in the Gazebo robot platform.
with a low-dimensional one. When the source task and target
task have different dimensions of state space, we expand the TABLE I: The architecture and number of parameters of
dimension of input for the low-dimensional task, and fill in PNN and HRN. fc1 and fc2 are two fully connected layers.
the expanded dimensions with 0 until the two tasks have the
same dimensions of state input. PNN(DQN) PNN(DDPG)
HRN
For transfer learning between tasks with continuous ac- Source Target Source Target
tions, to see the effect of our method transferring be- fc1 50 50 300 300 100
tween tasks with the same dimension of action space but fc2 50 50 400 400 100
from high-dimensional continuous state space to a low- params 2.7K 5.4K 247.4K 494.8K 11.2K
dimensional one, a DDPG agent was trained in Inverted-
Pendulum and transferring to MountainCarContinuous with VI. RESULTS AND ANALYSIS
PNN and IPNL. For transferring from tasks with high-
dimensional continuous state space to tasks with a much In this section, we present and analyze the experimental
higher dimensional one, the policy trained in InvertedPen- results. Figure 4 shows the learning curves of all methods in
dulum was also transferred to InvertedDoublePendulum with the two groups of experiments: transferring between tasks
PNN and IPNL respectively. with discrete actions and transferring between tasks with
continuous actions.
We also trained an interactive DQN/DDPG agent in the
target task which learns from the trained human reward A. Transferring Between Tasks with Discrete Actions
function in the two experiments for comparison. All agents Figure 4a shows the learning curves of the PNN and IPNL
are trained with feedforward neural networks with input agents transferring from MountainCar with two-dimensional
layers followed by two fully connected layers. The activation state space to CartPole with four-dimensional state space. A
function for output layer of HRN is tanh. The architecture DQN agent was also trained in CartPole as baseline for com-
and number of parameters of PNN and HRN are shown in parison. The action space of MountainCar and CartPole is
Table I. The maximum number of time steps is 1000 in almost the same except that MountainCar has an additional
CartPole, MountainCarContinuous and InvertedDoublePen- “staying in the place” action which is not meaningful for
dulum, and 2000 in MountainCar. When transferring to the CartPole. From Figure 4a we can see that, a DQN agent
policy learned in the source task to the target one, the second can generally learn a good policy after about 700 episodes’
column of policy in the target task is initialized randomly. training in CartPole, while a PNN agent transferring from
In addition, to examine the potential of applying our MountainCar can learn a better policy in 400 episodes, but
method in the real world task, we also tested our method the learning speed in the first 200 episodes is similar for
and compared to DQN, PNN, interactive DQN in the path both agents. With our proposed IPNL method, the agent
following task of autonomous underwater vehicle (AUV) learns a much better policy than both of them with only
with a simulator of the real system in our lab on Gazebo. 200 episodes’ training.
A screenshot of the simulator and environment is shown in Figure 4b shows the learning curves of the PNN and IPNL
Figure 3. The state space, action space and reward function agents transferring from CartPole to MountainCar. A DQN
are the same as in [25]. For PNN and our proposed IPNL, agent was also trained in MountainCar as baseline for com-
we first trained a policy to complete a straight line following parison. Different from Figure 4a, Figure 4b shows that a
(a) CartPole (b) MountainCar

(c) MountainCarContinuous (d) InveredDoublePendulum


Fig. 4: Mean performance of the IPNL, DQN, DDPG, PNN, and interactive DQN/DDPG agents trained with 20 random
seeds in CartPole, MountainCar, MountainCarContinuous and InvertedDoublePendulum.

DQN agent can learn a good policy after about 30 episodes’ that, a DDPG agent can generally learn a good policy after
training in MountainCar, but a PNN agent transferring from about 5500 episodes’ training in InvertedDoublePendulum,
CartPole has a similar learning performance and speed with while a PNN agent transferring from InvertedPendulum can
the DQN agent. However, with our proposed IPNL method, learn a policy with similar performance in fewer than 3000
the agent can learn a better policy much faster than both of episodes. With our proposed IPNL method, the agent obtains
them. a more stable policy with the same performance faster than
both of the DDPG and PNN agents.
B. Transferring Between Tasks with Continuous Actions
Figure 4c shows the learning curves of the PNN and
IPNL agents transferring from InvertedPendulum with C. Performance in Sinusoids Curve Following of AUV
four-dimensional state space to MountainCarContinuous
with two-dimensional state space. Both tasks have one- Figure 5 shows the trajectories of a AUV simulator with
dimensional continuous action space. A DDPG agent was the DQN, interactive DQN, PNN and IPNL methods at
also trained in MountainCarContinuous as baseline for different learning episodes in the task of sinusoids curve
comparison. From Figure 4c we can see that, a DDPG following. As we can see from Figure 5a, AUV with DQN
agent can generally learn a good policy after about 200 was shown to be able to learn to follow the curve in about
episodes’ training in MountainCarContinuous. However, a 60 episodes. Figure 5b shows that, AUV with interactive
PNN agent transferring from InvertedPendulum can learn a DQN can obtain a similar performance in 40 episodes,
similar performance in fewer than 150 episodes, which is which is faster than AUV with DQN. Figure 5c shows
slightly faster than the DDPG agent. In contrast, with our AUV with PNN using transferred policy from a straight line
proposed IPNL method, the agent can almost learn a much following task can already approximately follow the curve
more stable policy with the best performance in the first few in 20 episodes, which is similar to the performance of AUV
episodes. with DQN at Episode 40. At Episode 50, AUV with PNN
Figure 4d shows the learning curves of the PNN and has already learned to follow the curve, but is still a bit
IPNL agents transferring from InvertedPendulum to In- worse than the performance of AUV with interactive DQN
vertedDoublePendulum with eleven-dimensional state space. at Episode 40. Figure 5d shows that, with our proposed
Both tasks have one-dimensional continuous action space. A IPNL method, AUV can obtain a much better performance
DDPG agent was also trained in InvertedDoublePendulum than AUV with DQN, interactive DQN and PNN after only
as baseline for comparison. From Figure 4d we can see 20 episodes’ training.
(a) DQN (b) Interactive DQN

(c) PNN (d) IPNL


Fig. 5: Trajectories of the AUV simulator with DQN, interactive DQN, PNN and IPNL learning in the task of sinusoids
curve following. Note that X is the coordinate along the horizontal axis, Y is the coordinate along the vertical axis.

D. Component Analysis learn faster and better than AUV with PNN and interactive
DQN methods when transferring from a low-dimensional
Our experimental results in Figure 4 suggest our proposed task to a high-dimensional one.
IPNL method allow better and faster transfer learning be- However, Figure 4b and Figure 4c show that for both
tween tasks with both discrete and continuous actions, even tasks with discrete or continuous actions, while agents with
when Progressive Network Learning has little effect when Progressive Network have little effect when transferring
transferring from tasks with high-dimensional state space from tasks with high-dimensional state space to those with
to those with low-dimensional one. To further study the low-dimensional one, the interactive DQN/DDPG agent can
reason, we also trained an interactive DQN/DDPG agent still learns a better or similar policy faster than the PNN
which learns from human rewards in the target task of agent. More importantly, combining interactive RL and
above experiments to investigate the effect of human reward Progressive Network as our proposed IPNL method, the
network and Progressive Network in our method. agent can even learn a more stable policy with a better
Figure 4a and Figure 4d show that for both tasks with or similar performance much faster than the interactive
discrete or continuous actions, agents with Progressive Net- DQN/DDPG agent.
work has a good learning performance when transferring In summary, our results suggest that for transfer learn-
from tasks with low-dimensional state space to those with ing between tasks with discrete and continuous actions,
high-dimensional one, while the interactive DQN/DDPG Progressive Network Learning has good performance when
agent learns a better or similar performance faster even transferring from tasks with low-dimensional state space
than the PNN agent. Combining the interactive RL and to those with high-dimensional one, but has little effect
Progressive Network as our proposed IPNL method, the for transferring tasks with high-dimensional state space to
agent can learn a more stable policy with better or similar those with low-dimensional one. However, our proposed
performance much faster than both the PNN and interactive IPNL method allows an agent to learn a better policy faster
DQN/DDPG agents respectively. Since the dimension of for both cases. Our further analysis suggest that there is a
state space in the straight line following task is lower than synergy between Progressive Network and interactive RL
that of the sinusoids curve following task, Figure 5 further for improving the agent’s learning, even when Progressive
proved our results by showing that AUV with IPNL can Network has little transferring effect between tasks.
VII. CONCLUSION [10] H. Shojania and B. Li, “Parallelized progressive network coding with
hardware acceleration,” in 2007 fifteenth IEEE international workshop
In this paper, to improve upon the Progressive Network on quality of service. IEEE, 2007, pp. 47–55.
Learning for transferring policy from a source task to a [11] A. A. Rusu, M. Večerı́k, T. Rothörl, N. Heess, R. Pascanu, and
R. Hadsell, “Sim-to-real robot learning from pixels with progressive
target task, we proposed Interactive Progressive Network nets,” in Conference on Robot Learning. PMLR, 2017, pp. 262–270.
Learning (IPNL) by combining progressive network and [12] G. Li, R. Gomez, K. Nakamura, and B. He, “Human-centered
interactive reinforcement learning. To reduce the human reinforcement learning: a survey,” IEEE Transactions on Human-
Machine Systems, vol. 49, no. 4, pp. 337–349, 2019.
trainer’s workload, we trained a human reward network [13] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,
(HRN) to predict human feedback in our method. We test K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural
our method using five RL tasks with discrete or con- networks,” arXiv preprint arXiv:1606.04671, 2016.
[14] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward
tinuous actions in OpenAI Gym: CarPole, MountainCar, transformations: Theory and application to reward shaping,” in ICML,
MountainCar Continuous, InvertedPendulum and Inverted- vol. 99, 1999, pp. 278–287.
DoublePendulum and a sinusoids curve following task with [15] A. L. Thomaz and C. Breazeal, “Teachable robots: Understanding
human teaching behavior to build more effective robot learners,”
AUV simulator on the Gazebo platform. Our results suggest Artificial Intelligence, vol. 172, no. 6-7, pp. 716–737, 2008.
that for transfer learning between tasks with discrete or [16] W. B. Knox and P. Stone, “Interactively shaping agents via human
continuous actions, while Progressive Network Learning reinforcement: The tamer framework,” in Proceedings of the fifth
international conference on Knowledge capture, 2009, pp. 9–16.
has good performance when transferring from tasks with [17] R. Loftin, J. MacGlashan, M. L. Littman, M. E. Taylor, and D. L.
low-dimensional state space to those with high-dimensional Roberts, “A strategy-aware technique for learning behaviors from
one, but has little effect for transferring tasks with high- discrete human feedback,” North Carolina State University. Dept. of
Computer Science, Tech. Rep., 2014.
dimensional state space to those with low-dimensional one, [18] S. Griffith, K. Subramanian, J. Scholz, C. L. Isbell, and A. L. Thomaz,
our proposed IPNL method allows an agent to learn a better “Policy shaping: Integrating human feedback with reinforcement
performance faster for both cases. More importantly, our learning,” in Advances in neural information processing systems,
2013, pp. 2625–2633.
further analysis indicate that there is a synergy between [19] J. MacGlashan, M. K. Ho, R. Loftin, B. Peng, D. Roberts, M. E. Tay-
Progressive Network and interactive RL for improving the lor, and M. L. Littman, “Interactive learning from policy-dependent
agent’s learning, even when Progressive Network Learning human feedback,” arXiv preprint arXiv:1701.06049, 2017.
[20] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduc-
has little transferring effect between tasks. Our results in the tion. MIT press, 2018.
path following of AUV show the potential of applying our [21] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement
method in the real world tasks. learning: A survey,” Journal of artificial intelligence research, vol. 4,
pp. 237–285, 1996.
[22] S. Li, Y. Wu, X. Cui, H. Dong, F. Fang, and S. Russell, “Robust
R EFERENCES multi-agent reinforcement learning via minimax deep deterministic
policy gradient,” in Proceedings of the AAAI Conference on Artificial
[1] Z. Zhu, K. Lin, and J. Zhou, “Transfer learning in deep reinforcement Intelligence, vol. 33, 2019, pp. 4213–4220.
learning: A survey,” arXiv preprint arXiv:2009.07888, 2020. [23] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Ried-
[2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. miller, “Deterministic policy gradient algorithms,” 2014.
Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, [24] W. B. Knox and P. Stone, “Tamer: Training an agent manually via
et al., “Human-level control through deep reinforcement learning,” evaluative reinforcement,” in 2008 7th IEEE International Conference
nature, vol. 518, no. 7540, pp. 529–533, 2015. on Development and Learning. IEEE, 2008, pp. 292–297.
[3] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van [25] Q. Zhang, J. Lin, Q. Sha, B. He, and G. Li, “Deep interactive
Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, reinforcement learning for path following of autonomous underwater
M. Lanctot, et al., “Mastering the game of go with deep neural vehicle,” IEEE Access, vol. 8, pp. 24 258–24 268, 2020.
networks and tree search,” nature, vol. 529, no. 7587, pp. 484–489,
2016.
[4] D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A.
Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, et al.,
“Building watson: An overview of the deepqa project,” AI magazine,
vol. 31, no. 3, pp. 59–79, 2010.
[5] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
“Domain randomization for transferring deep neural networks from
simulation to the real world,” in 2017 IEEE/RSJ International Con-
ference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp.
23–30.
[6] X. B. Peng, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Sim-to-
real transfer of robotic control with dynamics randomization,” in 2018
IEEE international conference on robotics and automation (ICRA).
IEEE, 2018, pp. 1–8.
[7] J. Kober and J. Peters, “Reinforcement learning in robotics: a survey,”
International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–
1274, 2014.
[8] E. Tzeng, C. Devin, J. Hoffman, C. Finn, X. Peng, S. Levine,
K. Saenko, and T. Darrell, “Towards adapting deep visuomotor
representations from simulated to real environments,” arXiv preprint
arXiv:1511.07111, vol. 2, no. 3, 2015.
[9] P. Christiano, Z. Shah, I. Mordatch, J. Schneider, T. Blackwell,
J. Tobin, P. Abbeel, and W. Zaremba, “Transfer from simulation to
real world through learning deep inverse dynamics model,” arXiv
preprint arXiv:1610.03518, 2016.

View publication stats

You might also like