Visual Hindsight Experience Replay
Himanshu Sahni 1 2 Toby Buckley 1 Pieter Abbeel 1 3 Ilya Kuzovkin 1
arXiv:1901.11529v1 [cs.AI] 31 Jan 2019
Abstract
Reinforcement Learning algorithms typically require millions of environment interactions to
learn successful policies in sparse reward settings.
Hindsight Experience Replay (HER) was introduced as a technique to increase sample efficiency
through re-imagining unsuccessful trajectories as
successful ones by replacing the originally intended goals. However, this method is not applicable to visual domains where the goal configuration is unknown and must be inferred from
observation. In this work, we show how unsuccessful visual trajectories can be hallucinated to
be successful using a generative model trained
on relatively few snapshots of the goal. As far
as we are aware, this is the first work that does
so with the agent policy conditioned solely on
its state. We then apply this model to training
reinforcement learning agents in discrete and continuous settings. We show results on a navigation
and pick-and-place task in a 3D environment and
on a simulated robotics application. Our method
shows marked improvement over standard RL algorithms and baselines derived from prior work.
1. Introduction
Deep Reinforcement learning (RL) has recently demonstrated success in a range of previously unsolved tasks, from
playing Atari and Go on a superhuman level (Mnih et al.,
2015; Silver et al., 2017) to learning control policies for
real robotics tasks (Levine et al., 2016; OpenAI, 2018; Pinto
et al., 2017). But deep RL algorithms are highly sample
inefficient for complex tasks and learning from sparse rewards can be challenging. In these settings, millions of
steps are wasted exploring trajectories that yield no learning
signal. On the other hand, providing dense rewards along
these trajectories is a tedious job that requires substantial
domain knowledge and RL expertise. Ill-specified shaping rewards can also lead to unexpected hacking behaviour
1
2
Offworld Inc.
Georgia Institute of Technology
University of California Berkeley. Correspondence to: Himanshu Sahni <hsahni3@gatech.edu>.
3
Figure 1. VHER works by using a generative model to hallucinate
the presence of goals at the end of unsuccessful trajectories. The
agent’s task is to search for a pebble randomly placed in its surroundings and collect it by approaching and centering it in its view.
The top row shows a failed trajectory which ends in the agent not
finding the pebble. The bottom row replays the same trajectory
with a hallucinated visual goal inserted by HALGAN at every state
such that a pebble appears to be collected.
(Ng et al., 1999; Randløv & Alstrøm, 1998). Therefore, an
important vector for RL research is towards more sample
efficient methods that minimize the number of environment
interactions, yet can be trained using only sparse rewards.
To this end, Andrychowicz et al. (2017) introduced the idea
of Hindsight Experience Replay (HER), which can rapidly
train a goal-conditioned policy by retroactively imagining
failed trajectories as successful ones. By making use of
failed attempts to increase sample efficiency, HER was able
to learn a range of robotics tasks that traditional RL methods
were unable to solve. But HER was only shown to work in
non-visual environments, where the precise goal configuration is provided to the agent’s policy throughout training and
where it is straightforward to find a goal that is satisfied in
any state. It is not directly applicable to challenging visual
domains resembling real world applications, where the goal
location is not explicitly known and must be searched for
within the environment.
Yet, we desire for RL agents to quickly learn to operate
in the high-dimensional visual environments that humans
inhabit. In HER, Andrychowicz et al. (2017) employed a
goal conditioned policy using universal value function approximators (UVFAs) (Schaul et al., 2015) to generalize
over multiple goals. Some recent work has extended that to
visual goal conditioned policies (Nair et al., 2018) where
goals are sampled from the set of possible agent states. But
there is a wide range of visual tasks where we do not have
an explicit representation of a goal beforehand and where a
Visual Hindsight Experience Replay
state may not easily map to a goal. Thus, we would like the
agent to be able to perform visual tasks without providing it
an exact specification of the goal during execution and instead have it search for the goal in its environment. For this,
the agent must be able to infer the presence of goals from
the state image itself. Without a direct goal specification,
the agent must also learn to generalize over multiple goals
just from its state.
To address high sample complexity of RL in such visual
environments, we introduce Visual Hindsight Experience
Replay (VHER), which combines a hallucinatory generative
model with HER to rapidly solve tasks using only raw pixels
in the state as input to the agent policy. The hallucinatory
generative model, HALGAN, minimally alters images in
snippets of failed trajectories to appear as if the desired goal
is achieved at the end. In order to retroactively hallucinate
success in a visual environment, it is necessary to alter the
state images along the failed trajectory to make it appear as if
the goal was present throughout (see figure 1). HALGAN is
trained using a few snapshots of near goal images, where the
relative location of the agent to the goal is known. It is then
combined with HER during the reinforcement learning loop
to hallucinate goals along unsuccessful trajectories. The RL
policy is trained solely on images and without knowledge
of relative goal configuration.
The key contributions of this work are to expand the applicability of HER to visual domains by providing a way
to retroactively transform failed trajectories into successful
ones and hence allow the agent to rapidly generalize across
multiple goals using only the state as input to its policy. In
this work, we aim to minimize the amount of direct goal
specification required and learn RL policies conditioned
solely on the agent state image. We believe that the sample
complexity reduction that VHER provides is an important
step towards being able to train RL policies directly in the
real world.
2. Background
Below, we lay out some preliminary information on reinforcement learning and generative models.
2.1. Reinforcement Learning
In reinforcement learning, the agent is tasked with the maximization of some notion of a long term expected reward
(Sutton & Barto, 2018). The problem is typically modeled
as a Markov decision process (MDP). An MDP consists
of a tuple < S, A, R, T, γ >, where S is the set of states
the agent can exist in, A is the set of environment actions,
R : S × A → R is the function mapping states and actions to a scalar reward, T : S × A → S is the transition
function, and γ ∈ [0, 1) is a discount factor that weighs
how important future rewards are versus immediate ones.
Stochasticity in the environment can be present in the form
of uncertainties in transition or reward.
The agent must learn a policy, π : S → A, mapping every
state to an action. The optimal policy, π ∗ , is often the
goal of learning. It informs the agent on an action that
typically maximizes expected
value of the sum of future
P
discounted rewards, E[ k γ k R(st+k )], starting from any
state st . This expectation, known as the state value (V :
S → R), is over trajectories experienced under the current
policy and environment dynamics. UVFAs (Schaul et al.,
2015) approximate value functions with respect to a goal
in addition to the state, V : S × G → R. Goals are drawn
from the space G and are typically represented as desired
agent states or configurations of objects in the environment
or as desired state images. The optimal policy, π ∗ (s; g), in
this case maximizes the probability of achieving a particular
goal, g, from any state.
Off-policy RL algorithms can learn an optimal policy using
experiences from a behavior policy separate from the optimal policy. In particular, off-policy algorithms can make
use of samples collected in the past, leading to more sample efficient learning. An experience replay (Lin, 1992)
is typically employed to store past transitions as tuples of
(st , at , rt , st+1 ). At every step of training, a minibatch of
transitions is sampled from the replay at random and a loss
on future expected return minimized. The off-policy algorithms employing an experience replay we use in this work
are Double Deep Q-Networks (DDQN) (Van Hasselt et al.,
2016) and Deep Deterministic Policy Gradients (DDPG)
(Lillicrap et al., 2015).
2.2. Hindsight Experience Replay
HER was shown to achieve speedups in learning in environments where the goal configuration is provided along with
the agent state to the policy. The essential idea is to store
each trajectory, T raji = si0 , si1 , ..., siT , with a number of
additional goals along with the originally specified one. An
off-policy algorithm employing an experience replay is used
to train a UVFA which learns a policy which generalizes
across multiple goals. During replay, the original goals are
changed to states that have actually been achieved by the
agent in the past.
The reward is also modified retroactively to reflect the new
goal being replayed. In particular, HER assumes that every
goal, g ∈ G, can be expressed as a predicate fg : S →
{0, 1}. That is to say, all states can be judged as to whether
or not a goal g has been achieved in them. Thus, while
replaying the trajectory T raji with a surrogate goal g, one
Visual Hindsight Experience Replay
can easily reassign rewards along the entire trajectory as,
(
1 if fg (sit ) = 1
rg (sit ) =
0 otherwise.
Andrychowicz et al. (2017) report that selecting g to be a future state from within the same (failed) episode leads to the
best results. This training approach forms a sort of implicit
curriculum for the agent. In the beginning, it encourages the
agent to explore further outwards along trajectories it has
visited before. Since the surrogate goal, g, is also explicitly
provided to the UVFA policy, it soon learns to also generalize this curriculum over unseen goals. Over time, the agent
is able to achieve any goal in G, including the real ones.
2.3. Wasserstein GANs
We employ an improved Wasserstein ACGAN (Gulrajani
et al., 2017; Odena et al., 2017) as our generative model
because of its stability, realistic looking outputs, and ability
to condition the generated images on a desired class. A
typical W-ACGAN has a generator, H, that takes as input a
class variable and a latent vector of random noise. It then
generates an image which is fed into the discriminator, D.
D rates the image on its fidelity to the training data and,
as an auxiliary task, predicts class membership. The EarthMover distance between the distributions of real, pR , and
generated, pH , images is used as a loss to train the combined
model. A standard practice in Wasserstein GANs is to train
the discriminator multiple times for each generator update.
The discriminator begins to act as a critic that rates images
on their fidelity to the training data.
Here, it is important to point out that the motivation behind
using a GAN in this work is to produce realistic looking
hallucinations that will allow the agent to easily generalize
from imagined goals to real ones. Realistic insertion of
goals was not an issue in HER because a new goal could
directly be substituted in a replayed transition without any
modification to the states.
3. Related Work
Generative Models in RL. In recent years, generative models have demonstrated significant improvements in the areas of image generation, data compression, denoising, and
latent-space representations, among others (Goodfellow
et al., 2014; Chen et al., 2016; Vincent et al., 2008). Reinforcement learning has also benefited from incorporating
generative models in the training process. Ha & Schmidhuber (2018) synthesize a lot of prior work in the area by
proposing a Recurrent Neural Network (RNN) based generative dynamics model (Schmidhuber, 1990) of popular OpenAI gym (Brockman et al., 2016) and VizDoom (Kempka
et al., 2016) environments. They employ a fairly common
procedure of encoding high dimensional visual inputs from
the environment into lower dimension embedding vectors using a Variational Auto Encoder (VAE) (Kingma & Welling,
2013) before passing it on to the RNN model. Held et al.
(2017) use a GAN to generate goals matching in difficulty
to an agent’s skill on a task. Called GoalGAN, it generates
an automatic curriculum of incrementally harder to reach
goals. But it assumes that goals can easily be set in the
environment by the agent and does not make efficient use
of trajectories that failed to achieve these objectives.
Generative models have also been used in the closely related field of imitation learning to learn from human demonstrations or observation sequences (Ho & Ermon, 2016;
Edwards et al., 2018b; Schroecker et al., 2019). In our approach, we do not require demonstrations of the task, or
even a sequence of observations, but random snapshots of
the goal which we use to speed up reinforcement learning.
Goal Based RL. Some recent work has focused on leveraging information on the goal or surrounding states to speed up
reinforcement learning. Edwards et al. (2018a) and Goyal
et al. (2018) learn a reverse dynamics model to generate
states backwards from the goal which are then added to the
agent’s replay buffer. The former work assumes that the goal
configuration is known and backtracks from there, whereas
in the latter, high-value states are picked from the replay
buffer or a GoalGAN is used to generate goals. The latter
work also learns an inverse policy, π(at |st+1 ) to generate
plausible actions leading back from goal states. In contrast,
we focus on minimally altering states in existing failed trajectories already in the replay buffer to appear as if a goal
has been completed in them. This avoids having to generate
entirely new trajectories and allows us to make full use of
the environment dynamics already present in previous state
transitions.
Others have focused on learning goal-conditioned policies
in visual domains using a single or few images of the goal
(Xie et al., 2018; Zhu et al., 2017). Nair et al. (2018) train a
β-VAE (Burgess et al., 2018) on state images for a threefold
purpose: (1) to sample new goals during training, (2) to use
the Euclidean distance between feature encodings of current
and goal images as a dense reward, and (3) to retroactively
alter goals with VAE generated images and reassign rewards
appropriately. The set of goals G is assumed to be the same
as the set of states S and hence they are easy to swap back
and forth. This works well for domains where the goal is
separately provided to the policy along with the agent state,
and where states do not have to be modified for changing
goals. In this work, we attempt learning in domains where
the goal image is not known beforehand and thus cannot be
provided to the agent’s policy, and where the goal may or
may not be present in a particular agent state.
Visual Hindsight Experience Replay
4. The missing component in HER
First, we will more formally discuss what is missing from
the original HER formulation that does not allow it to readily
extend to visual domains. Then, in the next section, we will
describe in detail how the use of hallucinatory generative
models can help bridge the gap.
HER makes an assumption on the domain that “given a state
s we can easily find a goal g which is satisfied in this state”
(Andrychowicz et al., 2017). It requires a mapping, m :
S → G that maps every state s ∈ S to a goal g ∈ G that is
achieved in that state. While this mapping may be relatively
straightforward to hand design for real-valued state spaces,
its analog for visual states cannot be constructed easily. For
example, if the state space of the agent lies on the plane of
real values in R2 , the goal may be to achieve a particular
x-coordinate. So in the agent state (x = 0.5, y = 1.0), a
goal that is satisfied is simply g : x = 0.5. Now imagine if
the agent must instead navigate to a beacon on a 2D plane
using camera images as state inputs. In order to convert
any arbitrary state into one in which a goal is satisfied, the
beacon must be visually inserted into the image itself. We
call these goal hallucinations (see figure 2).
In order to fully utilize the power of HER, not only should
the agent be able to hallucinate goals in arbitrary states, but
also consistently in the same absolute position throughout
the failed trajectory. Note that with each step along the trajectory, the position of the goal (a beacon) changes relative
to the agent’s and thus the agent’s observation must be correctly updated to reflect this change. The goal must appear
to have been solved in a future state along every step of the
trajectory (see figure 1). Only then can we make use of the
existing transitions along the entire trajectory for replay with
hallucinated as well as original goals. Thus, visual settings
require the mapping m to be extended along the entire trajectory s0 , . . . , sT ∈ ST raj and becomes mV : STT raj → G,
where T is the maximum length of a trajectory and T raj
is the space of failed trajectories. Every state s along a
trajectory from T raj must be modified by the mapping into
a near-goal state that is consistent with the final goal state
g ∈ G of that trajectory. This is where this work’s main
contribution lies.
It is apparent that the use of UVFAs to generalize over
multiple goals, as in HER, does not extend to visual settings
where the goal location is unknown and must be identified
within the environment. Hence, in this work, the agent’s
policy is solely conditioned on its state.
5. Approach
To address the shortcomings of HER, we adopt a two part
approach. First, a generative adversarial network (GAN), is
trained to modify any existing state from a failed trajectory
Figure 2. Hallucinated images generated by our model. The original, failed, image is on the top left. All others are including goals
generated by HALGAN. The goal distance is increased from top to
bottom and angle from left to right. This image demonstrates that
using our training approach, goal hallucinations can be generated
with high fidelity in any relative configuration.
into a goal or near goal state. We call this model HALGAN.
HALGAN generates goal hallucinations conditioned on the
configuration of the robot in the current state relative to its
configuration in a future state from the same episode. Note
that we will make use of the assumption that in realistic
robotic applications, while it may be difficult to obtain the
explicit location of the goal throughout reinforcement learning, one can obtain the configuration of the robot relative to
itself easily. This can be done using SLAM or other state
tracking techniques (Montemerlo et al., 2002).
Then, during reinforcement learning, random snippets of
past failed trajectories are replayed with the final state in the
snippet set as the target goal location. The trained HALGAN
modifies pairs of states that constitute the transitions along
the trajectory to appear as if the goal was indeed achieved
by the end of it. Details of the entire hallucinating process
are provided in the next few subsections.
5.1. Hallucinating Visual Goals
HALGAN is trained on a dataset, R, of observations of
the goal where its relative location to the agent is explicitly known. These snapshots of the goal can be collected
beforehand and are only used once to train the generative
model. HALGAN then generalizes to create thousands of
hallucinations along failed trajectories during reinforcement
learning. These failed trajectories are ones the agent has
taken in the past and are stored in its experience replay.
In order to fool the agent into thinking that it has indeed
achieved a goal, one has to insert the goal into the final
Visual Hindsight Experience Replay
failed state
+
tanh
real
fake
used to re-normalize the hallucinated state image to [−1, 1].
Any differentiable bounded function can be used for this
purpose. The hallucinated state, st , along with a state sr
sampled from dataset R, is then fed to the discriminator D
to compute the discriminative loss,
LD = Est ∼pH [logD(st )] − Esr ∼pR [log(D(sr ))]
Figure 3. A conditioning vector c(st ; g) informs the generator, H
on the desired relative location of the goal. l is a random noise
vector drawn from N (1, 0.1). The generated goal image is added
to a failed state and then passed through a renormalizing tanh
function. This is the final hallucinated state with the goal positioned as desired. H is trained adversarially along with D, which
is learning to rate the fakes and real near goal images from the
dataset R. D also predicts relative goal configurations in real and
fake images, which in turn incentivizes H to hallucinate goals in
the correct relative locations.
image of that trajectory snippet. Thus, the state sT at the
end of a trajectory has to be modified to sT such that it
appears as if the goal were achieved in it. This is in contrast
to the regular HER approach or the approach by Nair et al.
(2018), where the state can be directly mapped to a goal
using the hand designed mapping m : S → G.
During learning, a snippet of a failed trajectory in the agent’s
experience replay is sampled randomly. Along with the final
state of the snippet, sT , other states in the trajectory leading
up to it, s0 , s1 , ..., sT −1 , must also be modified to appear as
if the goal were indeed accomplished in sT . For this, the
hallucinated goal location must remain consistent throughout the replayed trajectory. In the following subsections, we
describe each component of HALGAN and then show how
it fits together to generate consistent hallucinations of the
goal.
(2)
where pH and pR are the hallucinated and real near goal
image distributions.
In addition to the discriminator image loss, a gradient
penalty is employed in the improved training of Wasserstein GANs (see Gulrajani et al. (2017) for more details).
L∇ = Eŝ∼Pŝ (k∇D(ŝ)k − 1)
2
(3)
As a result of generating only image differences, the trained
hallucinatory model is invariant to some kinds of visual
variations, such as background, presence of other objects,
etc. Note that we do not condition H on the current failed
state, st , nor on the end state in the trajectory, sT . It is
only conditioned on the agent’s relative configuration to the
desired goal state. While this may lead to some awkward
goal hallucinations, we found that in practice it did not
influence the learning noticeably.
To encourage the model to generate minimal modifications
to the original failed image, we also add a L2 norm loss
on the output of H. In our experiments, this helped in
discouraging the generator from focusing on unnecessary
elements of goals such as background information or extra
objects in the environment.
LH = kH (c(st ; sT ), l)k
(4)
5.2. Minimal Hallucinations
5.3. Regression Auxiliary Task
One of our aims is to minimally alter a failed trajectory
in order to turn its states into goal (sT ) or near-goal
(s0 , s1 , . . . , sT −1 ) states. This makes full use of existing
trajectories and does not require HALGAN to re-imagine
the environment dynamics or unnecessary details about the
goal state such as the background.
Typical ACGANs are conditioned on a discrete set of classes,
such as flower, dog, etc (Odena et al., 2017). But to be useful
for reinforcement learning along the failed trajectory, the
generator must be conditioned on the relative configuration
of the agent from the desired goal state, which is a vector
c(st ; g) ∈ Rn . The auxiliary task for the discriminator then
is to regress the real valued relative location of the goal seen
in a training image. To train this regression based auxiliary
task, we use a mean squared error loss,
To this end, we train an additive model, such that the generator, H, has to produce only differences to the state image
that add in the goal. To obtain a hallucinated image st with
the goal at the final state of the trajectory, sT , we compute,
st = T anh (st + H (c(st ; sT ), l)) ,
LA = kc(st ; g) − c(st ; g)k
(5)
(1)
where, H is the generative model function, c(st ; g) is the
relative configuration of the robot to a desired goal state
g and l is a random latent conditioning vector. T anh is
where c(st ; g) is the relative configuration predicted by D.
We found it helpful to add a small amount of Gaussian noise
to our auxiliary inputs for robust training, especially on
smaller datasets.
Visual Hindsight Experience Replay
5.4. HALGAN
Our final loss to the combined HALGAN is,
L = LD + αL∇ + βLH + λLA
(6)
where, α, β, and λ are weighting hyperparameters, which
we set to 10, 1, and 10 respectively in all our experiments.
To summarize, the training process is as follows. The generator, conditioned on a randomly drawn relative goal location
produces a difference image which is then added to a randomly selected image from a failed trajectory to create a
goal hallucination. The discriminator is provided with these
hallucinated images as well as ground truth images from R
and has to score the images on their authenticity and also
predict the auxiliary variable. See figure 3 for a representation of the HALGAN training process and the appendix
for more details on the network architectures and training
procedure.
For the purposes of our experiments, we collect the training
data for HALGAN, R, by using the last few states of a successful rollout, in this case, a demonstration. Note that the
exact data required in R are randomly selected snapshots
from near the goal and then the final agent configuration in
which the goal is achieved to calculate relative poses. Note
that only observations, including the state image and agent
configuration, are used, no actions have to be provided or
demonstrated. This alleviates the data collection burden as
the human does not have to demonstrate the optimal completion of the task and snapshots can be collected in any order.
For example, it is significantly simpler to record the desired
final configuration of objects on a table than to record a full,
optimal demonstration of a robot arm aranging them. It also
allows the generative model to be independent of the agent
and demonstrator action spaces. We also collect a dataset
of failed trajectories using random exploration. These are
used during HALGAN training to add to the output of H
and create hallucinated near goal states. Most off-policy
RL methods that employ an experience replay have a replay
warmup period where actions are taken randomly to fill the
replay to a minimum before training begins. This dataset
of failed trajectories can be the same as the replay warmup
and no extra exploration is required.
5.5. Visual HER
During reinforcement learning, the agent explores its environment as normal. Every time a batch is sampled for
training, a few of the data points from it are augmented with
goal hallucinations. The detailed process is explained in
algorithm 1. The result is that the agent encounters hallucinated near goal states with a much higher frequency than
if it were randomly exploring. This in turn encourages the
agent to explore further from near goal states.
Algorithm 1 Visual Hindsight Experience Replay
1: Given: Trained hallucinatory model H, Reward reassignment strategy rg (s).
2: Initialize off-policy Algorithm A. {eg. DDQN, DDPG}
3: Initialize Experience Replay E by random exploration.
4: for step= 1, N do
Sample an action according to behavior policy at ←
5:
π(st ) in current state.
Execute at in the environment and observe state st+1 ,
6:
reward rt .
7:
Store tuple hst , at , rt , st+1 i in E.
8:
Sample minibatch B from E for training.
9:
for e = hsi , ai , ri , si+1 i in B do
10:
Sample c ∼ Bern(p) {p = hallucination prob.}
11:
if c then
12:
Sample d ∼ U nif ({0, 1, ..., D}) {distance to
goal state}
Compute relative configurations c(si ; si+d ) and
13:
c(si+1 ; si+d ). {Setting si+d as the goal state}
14:
si ← si + H(c(si+d ; si ), l)
15:
si+1 ← si+1 + H(c(si+d ; si+1 ), l)
16:
ri ← rsi+d (si+1 )
17:
end if
18:
end for
19:
Perform one step of optimization using A on the
modified minibatch B.
20: end for
An important consideration is the retroactive reassignment
of rewards. As a reminder, HER uses a manually defined
function fg (s) which decides if the goal g is satisfied in a
state s to designate rewards during hindsight replay. This
sort of retroactive reward function is hard to hand design
in visual environments. Comparing state and goal images
pixel by pixel is typically ineffective. Fortunately, for the
purposes of reward reassignment during hindsight replay,
one need only compare the agent state to a future one in
the same episode. Hence, we assume the existence of a
similar function, fs : S × S → {0, 1}, which decides
whether a pair of states are the same for the purpose of goal
completion. This sort of function is also difficult to hand
specify for visual states because of the above mentioned
difficulties in pixel-by-pixel comparisons. As mentioned
in section 3, Nair et al. (2018) use a trained β-VAE as fs
to reassign rewards in a dense manner. Here, we make use
of the access to the robot’s own configuration to design a
similar function, fc , where c is the robot configuration at a
particular state. We then assume that any goal satisfied in c
must also be satisfied in any other state with a similar enough
configuration. During retroactive reward reassignment, we
compare the relative configurations in the current and future
goal state, and hallucinate a reward if they are similar. We
also compare against the distance metric employed by (Nair
Visual Hindsight Experience Replay
et al., 2018), which did not perform as well as using the
agent configuration in our experiments.
6. Experiments
We test our method on two first person visual environments.
In a modified version of MiniWorld (Chevalier-Boisvert,
2018), we design two tasks. The first one is to navigate to
a red box located in an enclosed room (figure 4(a)). The
second task is a pick-and-place variant for first person 3D
environments, where the agent must navigate to the red box,
visually center it to pick it up and then carry it to a green
box somewhere else in the room (see figure 4(b)).
The second environment is a more visually realistic simulated robotics domain, where a TurtleBot2 (Wise & Foote)
equipped with an RGB camera is simulated within Gazebo
(Koenig & Howard, 2004). We use gym-gazebo (Zamora
et al., 2016) to interface with Gazebo. In this environment,
the agent must collect a pebble scattered randomly on a road
by approaching and centering it in its visual field (figure
4(c)). The episode ends and the agent is reset to the starting
location if it wanders too far. Episodes also end after 400
steps or upon completion of the goal.
(a)
(b)
reward. Details of the annealing rate and other experimental
hyperparameters are provided in the appendix.
Comparisons. We compare our approach against a few
extensions of prior work into the visual domain where goals
are not provided to the policy explicitly. We also compare
against standard model-free RL baselines. A naive extension
of HER into the visual domain, her, simply rewards the
agent for states at the end of failed trajectories during replay
without hallucinating. Hence, the agent receives hindsight
rewards, but the sampled trajectories still seem to end in
failures. This is an ablation of our approach where the
effect of removing HALGAN from the training procedure
is tested.
A second baseline is derived from Nair et al. (2018)’s work
(RIG) in training goal-conditioned policies with a dense
reward based on the distance between the embedding of the
sampled state and that of a goal image. RIG’s retroactive reassignment of goals relies on the use of UVFAs, which is not
possible for our domains where the goal image is unknown.
Therefore we test two variants of this baseline where we
attempt to find a suitable comparison. We first train a VAE
on the exact data available to HALGAN, i.e. near goal
images in R and failed state images collected by random
exploration. Then, during RL, vae-her simply sets the final
image in a failed trajectory, without any hallucinations, as
the goal and uses the trained VAE to compute reward for a
transition along that trajectory. This baseline evaluates the
effectiveness of dense reward reassignment in our domains
without the use of hallucinations from HALGAN.
(c)
Figure 4. Example of a near goal state in Turtlebot (left) and MiniWorld navigate (center) and pick-and-place (right) environments.
Figure 4 depicts near goal states in all of our tasks. The
goal is randomly spawned a small distance away from the
agent. Encountering the goal is extremely rare and standard
RL is sample inefficient or completely ineffective. The size
of the near goal dataset, R, for the Turtlebot, navigation
and pick-and-place tasks is 6840, 2000, and 6419 images
with relative goal configurations respectively. Though, we
show that the effect of reduction in the amount of near goal
states leads to little performance degradation in the Turtlebot
environment (figure 6).
In the Turtlebot and MiniWorld navigation tasks, the configuration of the agent is simply it’s hx, y, yawi. In pick-andplace, an additional binary field indicates whether the red
box is held by the agent. The agent’s relative configuration
is calculated with respect to the red box before it is picked
up, and the green box afterwards. Hallucinations are generated for the agent approaching both boxes. In the tasks, we
found it helpful to anneal the amount of hallucinations in a
batch over time as the agent starts filling the replay with real
rig- follows a similar dense reward reassignment strategy,
but computes distance of a state to a randomly sampled goal
image in R. Goal images are identified in R by filtering
for the relative configuration of the agent from the goal
being zero. Hence, rig- rewards the agent for being in
states that look similar to goal states in retrospect, without
employing any hallucinations. For the the distance based
rewards provided by the VAE in rig- to be the same order of
magnitude as the environment rewards, it was necessary to
re-scale them. The scaling factor in all our experiments was
set to 0.02.
Discrete and Continuous Control. An advantage of our
method is that HALGAN is agnostic to the agent’s action
space. As a result of conditioning on the relative location of
the robot to a state in the future, we are freeing the model
of any assumptions of how the robot actually gets there.
In the discrete TurtleBot environment, only a sparse reward
is used to indicate completion of a goal. The action space
is back and forth movement and turning (4 actions). The
base off-policy algorithm used is Double DQN (Van Hasselt
et al., 2016). For the continuous MiniWorld environments,
a penalty on the L2 norm of the output actions is applied
Visual Hindsight Experience Replay
vher (ours)
her
ddqn
vher (ours)
0.8
rig-
vae-her
her
ddpg
vher (ours)
1.0
1.0
0.5
0.5
0.0
0.0
rig-
vae-her
her
ddpg
0.7
0.4
0.3
reward
0.5
reward
reward
0.6
0.5
0.5
1.0
1.0
0.2
0.1
0.0
0
100K
200K
300K
training steps
400K
500K
0
100K
200K
300K
400K
500K
0
200K
400K
600K
training steps
800K
1000K
1200K
1400K
training steps
Figure 5. In all tasks, VHER starts learning immediately whereas the baselines needs to explore far more to randomly encounter positive
rewards. In the Turtlebot pebble collection task (left), all algorithms eventually learn an optimal policy but VHER begins learning
immediately and converges quickly. In the harder, continuous control MiniWorld navigate task (middle), neither DDPG nor naive-HER
are able to learn to complete the task. Only the rig- baseline somewhat learns the task eventually on three of the five random seeds. In the
final pick-and-place task, only VHER learns the optimal policy in four out of five random seeds.
at each step to simulate energy step cost. Otherwise, the
agent is only provided the sparse task completion reward.
The output actions are the linear and rotational velocities
of the agent at the next step, capped at a fixed amount. The
base algorithm used in this setting is DDPG (Lillicrap et al.,
2015). We employ deep convolutional neural networks as
function approximators that take in the state image as input
and outputs the desired control actions or values.
7. Results
VAE, also proves unsuccessful for either task, demonstrating that dense rewards without hallucinated or real goals in
failed trajectories are also ineffective for learning in these
domains. Only the rig- strategy of providing dense rewards
relative to random goal images eventually learns to complete the navigation task for some of the seeds. For the
pick-and-place task, rig- only learns a working policy on
a single seed and the other baselines perform similarily or
worse. Interestingly, rig’s dense reward reassignment can
be readily combined with our approach of state modification
by hallucination, providing directions for future work.
In all of our experiments, VHER begins learning immediately (figure 5). This is due to the realistic looking hallucinated goals being quickly identified as desirable states. This
is in contrast to standard RL which rarely encounters reward
and must explore at length to encounter random rewards in
order to begin the learning process, if at all.
For the continuous control experiments in MiniWorld (figure
5(b) and 5(c)), only VHER is able to learn to complete the
task. Note that achieving a reward of 0 in this environment
is relatively easy, it is only positive rewards that indicate
achievement of goal. DDPG never encounters any reward
during exploration and hence learns to simply minimize
its actions in order to avoid movement penalty. Naive her
initially explores heavily and hence incurs a heavy penalty,
but doesn’t learn to associate the rewards it receives with
the presence of a goal. Some of the random seeds eventually
converge to the same degenerate policy as DDPG. vae-her,
the augmentation of her with dense rewards from a trained
vher-1000 (ours)
vher-2000 (ours)
0.8
0.6
reward
In the discrete TurtleBot pebble collection domain (figure
5(a)), the naive HER strategy provides a good enough exploration bonus for the agent to explore further and quicker
than standard DDQN. It begins learning by 100K steps.
VHER, by contrast, starts learning to navigate to real goals
immediately.
vher (ours)
1.0
0.4
0.2
0.0
0
100K
200K
300K
400K
500K
training steps
Figure 6. Reinforcement learning using VHER in TurtleBot task
with varying size of training dataset for HALGAN. The curves
being similar is a positive result that shows only minor variance of
RL agent performance with training data available for HALGAN
from 6800 (original) down to 1000 near goal training samples.
Finally in figure 6, we show the change in performance on
the TurtleBot pebble collection task due to using fewer training samples in R. The effect is only slightly slower learning
even for the largely reduced dataset of only 1000 images.
The minimalistic hallucinations created by HALGAN re-
Visual Hindsight Experience Replay
quire a relatively small amount of data to train well enough
to provide a significant boost in reinforcement learning.
8. Discussion
A major impediment to training RL agents in the real world
is the amount of data an agent must collect and process
before it can start drawing inference on which actions lead
to rewards and which ones are to be avoided. High sample
complexity makes problems such as fragility of physical
systems, energy consumption, speed of robots and sensor errors manifest themselves acutely when one attempts running
the reinforcement learning process in the real world.
In this work, we have shown that Hindsight Experience
Replay can be extended to visual scenarios where the goal
location is not explicitly known beforehand, as is common
in many realistic applications. We empirically prove that by
hallucinating goals along failed trajectories, the agent can
begin learning to solve tasks immediately. VHER converges
faster than standard RL techniques that flounder around
fruitlessly before encountering rewards, and in complex
tasks fail to find a working policy at all. VHER requires
relatively few snapshots of near goal images with known
goal configurations. In certain environments, this dataset
could be generated online as the agent learns, or supplied
from orthogonal techniques such as GoalGAN (Held et al.,
2017). We leave this as an avenue for future work.
9. Acknowledgements
We would like to thank the entire OffWorld team for their
enthusiastic support of this work. Special thanks to Ashish
Kumar for help in the setup of experiments and for many
hours of fruitful discussions.
References
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong,
R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and
Zaremba, W. Hindsight experience replay. In Advances
in Neural Information Processing Systems 30, pp. 5048–
5058. Curran Associates, Inc., 2017.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Schulman, J., Tang, J., and Zaremba, W. Openai gym,
2016.
Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters,
N., Desjardins, G., and Lerchner, A. Understanding disentangling in β-vae. arXiv preprint arXiv:1804.03599,
2018.
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,
I., and Abbeel, P. Infogan: Interpretable representation
learning by information maximizing generative adversarial nets. In Advances in neural information processing
systems, pp. 2172–2180, 2016.
Chevalier-Boisvert, M. gym-miniworld environment for
openai gym. https://github.com/maximecb/
gym-miniworld, 2018.
Edwards, A. D., Downs, L., and Davidson, J. C.
Forward-backward reinforcement learning.
CoRR,
abs/1803.10227, 2018a. URL http://arxiv.org/
abs/1803.10227.
Edwards, A. D., Sahni, H., Schroecker, Y., and Isbell,
C. L. Imitating latent policies from observation. CoRR,
abs/1805.07914, 2018b. URL http://arxiv.org/
abs/1805.07914.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Bengio,
Y. Generative adversarial nets. In Advances in neural
information processing systems, pp. 2672–2680, 2014.
Goyal, A., Brakel, P., Fedus, W., Lillicrap, T. P., Levine, S.,
Larochelle, H., and Bengio, Y. Recall traces: Backtracking models for efficient reinforcement learning. CoRR,
abs/1804.00379, 2018. URL http://arxiv.org/
abs/1804.00379.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and
Courville, A. C. Improved training of wasserstein gans.
In Advances in Neural Information Processing Systems
30, pp. 5767–5777. Curran Associates, Inc., 2017.
Ha, D. and Schmidhuber, J. World models. arXiv preprint
arXiv:1803.10122, 2018.
Held, D., Geng, X., Florensa, C., and Abbeel, P. Automatic
goal generation for reinforcement learning agents. CoRR,
abs/1705.06366, 2017. URL http://arxiv.org/
abs/1705.06366.
Ho, J. and Ermon, S. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pp. 4565–4573, 2016.
Kempka, M., Wydmuch, M., Runc, G., Toczek, J., and
Jaśkowski, W. Vizdoom: A doom-based ai research platform for visual reinforcement learning. In Computational
Intelligence and Games (CIG), 2016 IEEE Conference
on, pp. 1–8. IEEE, 2016.
Kingma, D. P. and Welling, M. Auto-encoding variational
bayes. arXiv preprint arXiv:1312.6114, 2013.
Koenig, N. P. and Howard, A. Design and use paradigms for
gazebo, an open-source multi-robot simulator. In IROS,
volume 4, pp. 2149–2154. Citeseer, 2004.
Visual Hindsight Experience Replay
Levine, S., Finn, C., Darrell, T., and Abbeel, P. End-toend training of deep visuomotor policies. The Journal of
Machine Learning Research, 17(1):1334–1373, 2016.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez,
T., Tassa, Y., Silver, D., and Wierstra, D. Continuous
control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.
Lin, L.-J. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning,
8(3-4):293–321, 1992.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control
through deep reinforcement learning. Nature, 518(7540):
529, 2015.
Montemerlo, M., Thrun, S., Koller, D., Wegbreit, B., et al.
Fastslam: A factored solution to the simultaneous localization and mapping problem. Aaai/iaai, 593598, 2002.
Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and
Levine, S. Visual reinforcement learning with imagined
goals. In Advances in Neural Information Processing
Systems, pp. 9209–9220, 2018.
Ng, A. Y., Harada, D., and Russell, S. Policy invariance
under reward transformations: Theory and application to
reward shaping. In ICML, volume 99, pp. 278–287, 1999.
Odena, A., Olah, C., and Shlens, J. Conditional image
synthesis with auxiliary classifier gans. In Proceedings of
the 34th International Conference on Machine LearningVolume 70, pp. 2642–2651. JMLR. org, 2017.
OpenAI. Learning dexterous in-hand manipulation. CoRR,
abs/1808.00177, 2018. URL http://arxiv.org/
abs/1808.00177.
Pinto, L., Andrychowicz, M., Welinder, P., Zaremba, W.,
and Abbeel, P. Asymmetric actor critic for image-based
robot learning. CoRR, abs/1710.06542, 2017. URL
http://arxiv.org/abs/1710.06542.
Randløv, J. and Alstrøm, P. Learning to drive a bicycle using
reinforcement learning and shaping. In ICML, volume 98,
pp. 463–471. Citeseer, 1998.
Schaul, T., Horgan, D., Gregor, K., and Silver, D. Universal value function approximators. In Proceedings of the
32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research,
pp. 1312–1320, Lille, France, 07–09 Jul 2015. PMLR.
Schmidhuber, J. Making the world differentiable: On using self-supervised fully recurrent neural networks for
dynamic reinforcement learning and planning in nonstationary environments. 1990.
Schroecker, Y., Vecerik, M., and Scholz, J. Generative predecessor models for sample-efficient imitation learning.
In International Conference on Learning Representations,
2019. URL https://openreview.net/forum?
id=SkeVsiAcYm.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou,
I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,
Bolton, A., et al. Mastering the game of go without
human knowledge. Nature, 550(7676):354, 2017.
Sutton, R. S. and Barto, A. G. Reinforcement learning: An
introduction. MIT press, 2018.
Van Hasselt, H., Guez, A., and Silver, D. Deep reinforcement learning with double q-learning. In AAAI, volume 2,
pp. 5. Phoenix, AZ, 2016.
Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.
Extracting and composing robust features with denoising
autoencoders. In Proceedings of the 25th international
conference on Machine learning, pp. 1096–1103. ACM,
2008.
Wise, M. and Foote, T. Rep: 119-specification for turtlebot
compatible platforms, dec. 2011.
Xie, A., Singh, A., Levine, S., and Finn, C. Few-shot goal
inference for visuomotor learning and planning. CoRR,
abs/1810.00482, 2018. URL http://arxiv.org/
abs/1810.00482.
Zamora, I., Lopez, N. G., Vilches, V. M., and Cordero,
A. H. Extending the openai gym for robotics: a toolkit
for reinforcement learning using ros and gazebo. arXiv
preprint arXiv:1608.05742, 2016.
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., FeiFei, L., and Farhadi, A. Target-driven visual navigation
in indoor scenes using deep reinforcement learning. In
Robotics and Automation (ICRA), 2017 IEEE International Conference on, pp. 3357–3364. IEEE, 2017.
Appendix
A. Experimental Hyperparameters
Refer to table below for environment specific hyperparameters.
H YPERPARAMETER
R EPLAY WARMUP
R EPLAY C APACITY
I NITIAL E XPLORATION ǫ
F INAL E XPLORATION ǫ
ǫ A NNEAL S TEPS
D ISCOUNT (γ)
O FF -P OLICY A LGORITHM
P OLICY O PTIMIZER
L EARNING R ATE
S IZE OF R FOR HALGAN
H ALLUCINATION S TART %
H ALLUCINATION END %
M AX FAILED T RAJECTORY L ENGTH
I MAGE SIZE
R ANDOM S EEDS
T URTLE B OT
M INI W ORLD NAVIGATE
M INI W ORLD P ICK - AND -P LACE
10,000
100,000
1.0
0.5
100,000
0.99
DDQN
ADAM
1e−3
6,840
20%
0%
16
64 X 64
75839, 69045, 47040
10,000
100,000
1.0
0.5
100,000
0.99
DDPG
ADAM
1e−5 (ACTOR ), 1e−4 (C RITIC )
2,000
30%
0%
32
64 X 64
75839, 69045, 47040,
60489, 11798
10,000
100,000
1.0
0.5
250,000
0.99
DDPG
ADAM
1e−5 (ACTOR ), 1e−4 (C RITIC )
6,419
30%
0%
16
64 X 64
75839, 69045, 47040,
60489, 11798
Table 1. Environment Specific Hyperparameters
Refer to table below for HALGAN specific hyperparameters.
H YPERPARAMETER
L ATENT V ECTOR S IZE
L ATENT S AMPLING D ISTRIBUTION
AUXILIARY TASK W EIGHT
G RADIENT P ENALTY W EIGHT
L2 LOSS ON H W EIGHT
O PTIMIZER
L EARNING R ATE
A DAM β1
A DAM β2
D I TERS PER H I TER
VALUE
128
N (1, 0.1)
10
10
1
ADAM
1e − 4
0.5
0.9
5
Table 2. Hyperparameters involved in training HALGAN
Visual Hindsight Experience Replay
B. Network Architectures
Refer to table below for details on the network architecture
for DDQN. LeakyReLu’s were used as activations throughout except for the output layer where no activation was used.
L AYER
I MAGE I NPUT
C ONV 1
C ONV 2
C ONV 3
C ONV 4
D ENSE 1
D ENSE 2
T OTAL
S HAPE
F ILTERS
# PARAMS
64 X 64
5X5
5X5
5X5
5X5
32
4 (nbactions)
-
3
4
8
16
32
-
0
304
808
3216
12832
16416
132
33708
Table 3. Network Architecture for DDQN Agent
Refer to table below for details on the network architecture
for actor for DDPG. LeakyReLu’s were used as activations
throughout except for the output layer where a Tanh was
used.
L AYER
I MAGE I NPUT
C ONV 1
C ONV 2
C ONV 3
C ONV 4
D ENSE 1
D ENSE 2
T OTAL
activations throughout except immediately after the conditioning layer where no activation was used and the output
where tanh was used.
L AYER
S HAPE
F ILTERS
# PARAMS
C ONFIG I NPUT
D ENSE 1
C ONDITIONING I NPUT
M ULTIPLY
R ESHAPE
U P S AMPLE + C ONV 1
BATCH N ORM
U P S AMPLE + C ONV 2
BATCH N ORM
U P S AMPLE + C ONV 3
BATCH N ORM
U P S AMPLE + C ONV 4
BATCH N ORM
U P S AMPLE + C ONV 5
BATCH N ORM
U P S AMPLE + C ONV 6
BATCH N ORM
C ONV 7
BATCH N ORM
C ONV 8
T OTAL
3
128
128
128
1X1
4X4
2X2
4X4
4X4
4X4
8X8
4X4
16 X 16
4X4
32 X 32
4X4
64 X 64
4X4
64 X 64
4X4
-
128
64
64
64
64
64
64
32
32
32
32
16
16
8
8
3
-
0
384
0
0
0
131136
256
65600
256
65600
256
32800
256
16416
128
8028
64
2056
32
387
323707
S HAPE
F ILTERS
# PARAMS
Table 6. Network Architecture HALGAN Generator
64 X 64
5X5
5X5
5X5
5X5
32
2 (nbactions)
-
3
4
8
16
32
-
0
304
808
3216
12832
16416
66
33642
Refer to table below for details on the network architecture
for the discriminator in HALGAN. LeakyReLu’s were used
as activations throughout except at the output where no
activation was used.
Table 4. Network Architecture for DDPG Actor
Refer to table below for details on the network architecture
for critic for DDPG. LeakyReLu’s were used as activations
throughout except for the output layer where no activation
was used.
L AYER
S HAPE
F ILTERS
# PARAMS
I MAGE I NPUT
C ONV 1
C ONV 2
C ONV 3
C ONV 4
D ENSE 1
D ENSE 2
T OTAL
64 X 64
5X5
5X5
5X5
5X5
32
1
-
3
4
8
16
32
-
0
304
808
3216
12832
16416
33
33673
Table 5. Network Architecture for DDPG Critic
Refer to table below for details on the network architecture
for the generator in HALGAN. LeakyReLu’s were used as
L AYER
S HAPE
F ILTERS
# PARAMS
I MAGE I NPUT
C ONV 1
C ONV 2
C ONV 3
C ONV 4
C ONV 5
C ONV 6
C ONV 7
D ENSE (AUX )
D ENSE ( REAL / FAKE )
T OTAL
64 X 64
4X4
4X4
4X4
4X4
4X4
4X4
4X4
2
1:
-
3
32
32
32
64
64
64
128
-
0
1568
16416
16416
32832
65600
65600
131200
129
258
330019
Table 7. Network Architecture for HALGAN Discriminator