Stabilizing Off Policy QLearning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

Stabilizing Off-Policy Q-Learning via Bootstrapping

Error Reduction

Aviral Kumar∗ Justin Fu∗


UC Berkeley UC Berkeley
aviralk@berkeley.edu justinjfu@eecs.berkeley.edu
arXiv:1906.00949v2 [cs.LG] 25 Nov 2019

George Tucker Sergey Levine


Google Brain UC Berkeley, Google Brain
gjt@google.com svlevine@eecs.berkeley.edu

Abstract
Off-policy reinforcement learning aims to leverage experience collected from
prior policies for sample-efficient learning. However, in practice, commonly used
off-policy approximate dynamic programming methods based on Q-learning and
actor-critic methods are highly sensitive to the data distribution, and can make only
limited progress without collecting additional on-policy data. As a step towards
more robust off-policy algorithms, we study the setting where the off-policy experi-
ence is fixed and there is no further interaction with the environment. We identify
bootstrapping error as a key source of instability in current methods. Bootstrap-
ping error is due to bootstrapping from actions that lie outside of the training data
distribution, and it accumulates via the Bellman backup operator. We theoretically
analyze bootstrapping error, and demonstrate how carefully constraining action se-
lection in the backup can mitigate it. Based on our analysis, we propose a practical
algorithm, bootstrapping error accumulation reduction (BEAR). We demonstrate
that BEAR is able to learn robustly from different off-policy distributions, including
random and suboptimal demonstrations, on a range of continuous control tasks.

1 Introduction
One of the primary drivers of the success of machine learning methods in open-world perception
settings, such as computer vision [20] and NLP [9], has been the ability of high-capacity function
approximators, such as deep neural networks, to learn generalizable models from large amounts of
data. Reinforcement learning (RL) has proven comparatively difficult to scale to unstructured real-
world settings because most RL algorithms require active data collection. As a result, RL algorithms
can learn complex behaviors in simulation, where data collection is straightforward, but real-world
performance is limited by the expense of active data collection. In some domains, such as autonomous
driving [39] and recommender systems [4], previously collected datasets are plentiful. Algorithms
that can utilize such datasets effectively would not only make real-world RL more practical, but also
would enable substantially better generalization by incorporating diverse prior experience.
In principle, off-policy RL algorithms can leverage this data; however, in practice, off-policy al-
gorithms are limited in their ability to learn entirely from off-policy data. Recent off-policy RL
methods (e.g., [19, 30, 24, 10]) have demonstrated sample-efficient performance on complex tasks
in robotics [24] and simulated environments [37]. However, these methods can still fail to learn
when presented with arbitrary off-policy data without the opportunity to collect more experience
from the environment. This issue persists even when the off-policy data comes from effective expert

Equal Contribution

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
policies, which in principle should address any exploration challenge [7, 13, 12]. This sensitivity
to the training data distribution is a limitation of practical off-policy RL algorithms, and one would
hope that an off-policy algorithm should be able to learn reasonable policies through training on
static datasets before being deployed in the real world. In this paper, we aim to develop off-policy,
value-based RL methods that can learn from large, static datasets. As we show, a crucial challenge in
applying value-based methods to off-policy scenarios arises in the bootstrapping process employed
when Q-functions are evaluated on out of out-of-distribution action inputs for computing the backup
when training from off-policy data. This may introduce errors in the Q-function and the algorithm is
unable to collect new data in order to remedy those errors, making training unstable and potentially
diverging. Our primary contribution is an analysis of error accumulation in the bootstrapping process
due to out-of-distribution inputs and a practical way of addressing this error. First, we formalize and
analyze the reasons for instability and poor performance when learning from off-policy data. We
show that, through careful action selection, error propagation through the Q-function can be mitigated.
We then propose a principled algorithm called bootstrapping error accumulation reduction (BEAR)
to control bootstrapping error in practice, which uses the notion of support-set matching to prevent
error accumulation. Through systematic experiments, we show the effectiveness of our method on
continuous-control MuJoCo tasks, with a variety of off-policy datasets: generated by a random,
suboptimal, or optimal policies. BEAR is consistently robust to the training dataset, matching or
exceeding the state-of-the-art in all cases, whereas existing algorithms only perform well for specific
datasets.

2 Related Work
In this work, we study off-policy reinforcement learning with static datasets. Errors arising from
inadequate sampling, distributional shift, and function approximation have been rigorously studied as
“error propagation” in approximate dynamic programming (ADP) [5, 28, 11, 34]. These works often
study how Bellman errors accumulate and propagate to nearby states via bootstrapping. In this work,
we build upon tools from this analysis to show that performing Bellman backups on static datasets
leads to error accumulation due to out-of-distribution values. Our approach is motivated as reducing
the rate of propagation of error propagation between states.
Our approach constrains actor updates so that the actions remain in the support of the training dataset
distribution. Several works have explored similar ideas in the context of off-policy learning learning
in online settings. Kakade and Langford [23] shows that large policy updates can be destructive, and
propose a conservative policy iteration scheme which constrains actor updates to be small for provably
convergent learning. Grau-Moya et al. [17] use a learned prior over actions in the maximum entropy
RL framework [26] and justify it as a regularizer based on mutual information. However, none of these
methods use static datasets. Importance Sampling based distribution re-weighting [30, 16, 31, 27]
has also been explored primarily in the context of off-policy policy evaluation.
Most closely related to our work is batch-constrained Q-learning (BCQ) [13] and SPIBB [25], which
also discuss instability arising from previously unseen actions. Fujimoto et al. [13] show convergence
properties of an action-constrained Bellman backup operator in tabular, error-free settings. We prove
stronger results under approximation errors and provide a bound on the suboptimality of the solution.
This is crucial as it drives the design choices for a practical algorithm. As a consequence, although we
experimentally find that [13] outperforms standard Q-learning methods when the off-policy data is
collected by an expert, BEAR outperforms [13] when the off-policy data is collected by a suboptimal
policy, as is common in real-life applications. Empirically, we find BEAR achieves stronger and more
consistent results than BCQ across a wide variety of datasets and environments. As we explain below,
the BCQ constraint is too aggressive; BCQ generally fails to substantially improve over the behavior
policy, while our method actually improves when the data collection policy is suboptimal or random.
SPIBB [25], like BEAR, is an algorithm based on constraining the learned policy to the support of a
behavior policy. However, the authors do not extend safe performance guarantees from the batch-
constrained case to the relaxed support-constrained case, and do not evaluate on high-dimensional
control tasks. REM [1] is a concurrent work that uses a random convex combination of an ensemble
of Q-networks to perform offline reinforcement learning from a static dataset consisting of interaction
data generated while training a DQN agent.

3 Background
We represent the environment as a Markov decision process (MDP) defined by a tuple
(S, A, P, R, ρ0 , γ), where S is the state space, A is the action space, P (s0 |s, a) is the transition

2
distribution, ρ0 (s) is the initial state distribution, R(s, a) is the reward function, and γ ∈ (0, 1) is the
discount factor. The goal in RL is to find a policy π(a|s) that maximizes the expected cumulative
discounted rewards which is also known as the return. The notation µπ (s) denotes P∞ the discounted
state marginal of a policy π, defined as the average state visited by the policy, t=0 γ t ptπ (s). P π is
shorthand for the transition matrix from s to s0 following a certain policy π, p(s0 |s) = Eπ [p(s0 |s, a)].
Q-learning learns the optimal state-action value function Q∗ (s, a), which represents the expected
cumulative discounted reward starting in s taking action a and then acting optimally thereafter. The
optimal policy can be recovered from Q∗ by choosing the maximizing action. Q-learning algorithms
are based on iterating the Bellman optimality operator T , defined as
(T Q̂)(s, a) := R(s, a) + γET (s0 |s,a) [max
0
Q̂(s0 , a0 )].
a

When the state space is large, we represent Q̂ as a hypothesis from the set of function approximators
Q (e.g., neural networks). In theory, the estimate of the Q-function is updated by projecting T Q̂ into
Q (i.e., minimizing the mean squared Bellman error Eν [(Q − T Q̂)2 ], where ν is the state occupancy
measure under the behaviour policy). This is also referred to a Q-iteration. In practice, an empirical
estimate of T Q̂ is formed with samples, and treated as a supervised `2 regression target to form the
next approximate Q-function iterate.
In large action spaces (e.g., continuous), the maximization maxa0 Q(s0 , a0 ) is generally intractable.
Actor-critic methods [36, 14, 19] address this by additionally learning a policy πθ that maximizes
the Q-function. In this work, we study off-policy learning from a static dataset of transitions
D = {(s, a, s0 , R(s, a))}, collected under an unknown behavior policy β(·|s). We denote the
distribution over states and actions induced by β as µ(s, a).

4 Out-of-Distribution Actions in Q-Learning


Q-learning methods often fail to learn on static, HalfCheetah-v2: AverageReturn
1000
HalfCheetah-v2: log(Q)
30
off-policy data, as shown in Figure 1. At first 750
n=1000
n=10000
n=1000
n=10000
25
glance, this resembles overfitting, but increasing 500 n=100000
n=1000000
n=100000
n=1000000
20
the size of the static dataset does not rectify the 250

0 15
problem, suggesting the issue is more complex. −250
We can understand the source of this instability −500
10

by examining the form of the Bellman backup. −750


5

Although minimizing the mean squared Bell- −1000


0.0K 0.2K 0.4K 0.6K 0.8K 1.0K
0
0.0K 0.2K 0.4K 0.6K 0.8K 1.0K

man error corresponds to a supervised regression TrainSteps TrainSteps

problem, the targets for this regression are them- Figure 1: Performance of SAC on HalfCheetah-v2
selves derived from the current Q-function esti- (return (left) and log Q-values (right)) with off-policy
mate. The targets are calculated by maximizing expert data w.r.t. number of training samples (n). Note
the learned Q-values with respect to the action the large discrepancy between returns (which are nega-
at the next state. However, the Q-function esti- tive) and log Q-values (which have large positive values),
mator is only reliable on inputs from the same which is not solved with additional samples.
distribution as its training set. As a result, naïvely maximizing the value may evaluate the Q̂ estimator
on actions that lie far outside of the training distribution, resulting in pathological values that incur
large error. We refer to these actions as out-of-distribution (OOD) actions.
Formally, let ζk (s, a) = |Qk (s, a) − Q∗ (s, a)| denote the total error at iteration k of Q-learning,
and let δk (s, a) = |Qk (s, a) − T Qk−1 (s, a)| denote the current Bellman error. Then, we have
ζk (s, a) ≤ δk (s, a) + γ maxa0 Es0 [ζk−1 (s0 , a0 )]. In other words, errors from (s0 , a0 ) are discounted,
then accumulated with new errors δk (s, a) from the current iteration. We expect δk (s, a) to be high
on OOD states and actions, as errors at these state-actions are never directly minimized while training.
To mitigate bootstrapping error, we can restrict the policy to ensure that it output actions that lie in
the support of the training distribution. This is distinct from previous work (e.g., BCQ [13]) which
implicitly constrains the distribution of the learned policy to be close to the behavior policy, similarly
to behavioral cloning [32]. While this is sufficient to ensure that actions lie in the training set with
high probability, it is overly restrictive. For example, if the behavior policy is close to uniform, the
learned policy will behave randomly, resulting in poor performance, even when the data is sufficient
to learn a strong policy (see Figure 2 for an illustration). Formally, this means that a learned policy

3
π(a|s) has positive density only where the density of the behaviour policy β(a|s) is more than a
threshold (i.e., ∀a, β(a|s) ≤ ε =⇒ π(a|s) = 0), instead of a closeness constraint on the value of
the density π(a|s) and β(a|s). Our analysis instead reveals a tradeoff between staying within the
data distribution and finding a suboptimal solution when the constraint is too restrictive. Our analysis
motivates us to restrict the support of the learned policy, but not the probabilities of the actions lying
within the support. This avoids evaluating the Q-function estimator on OOD actions, but remains
flexible in order to find a performant policy. Our proposed algorithm leverages this insight.

4.1 Distribution-Constrained Backups

In this section, we define and analyze a backup operator that restricts the set of policies used in the
maximization of the Q-function, and we derive performance bounds which depend on the restricted
set. This provides motivation for constraining policy support to the data distribution. We begin with
the definition of a distribution-constrained operator:
Definition 4.1 (Distribution-constrained operators). Given a set of policies Π, the distribution-
constrained backup operator is defined as:
def def
T Π Q(s, a) = E R(s, a) + γ max EP (s0 |s,a) [V (s0 )]
 
V (s) = max Eπ [Q(s, a)] .
π∈Π π∈Π

This backup operator satisfies properties of the standard Bellman backup, such as convergence to a
fixed point, as discussed in Appendix A. To analyze the (sub)optimality of performing this backup
under approximation error, we first quantify two sources of error. The first is a suboptimality bias.
The optimal policy may lie outside the policy constraint set, and thus a suboptimal solution will be
found. The second arises from distribution shift between the training distribution and the policies
used for backups. This formalizes the notion of OOD actions. To capture suboptimality in the final
solution, we define a suboptimality constant, which measures how far π ∗ is from Π.
Definition 4.2 (Suboptimality constant). The suboptimality constant is defined as:
α(Π) = max |T Π Q∗ (s, a) − T Q∗ (s, a)|.
s,a
Next, we define a concentrability coefficient [29], which quantifies how far the visitation distribution
generated by policies from Π is from the training data distribution. This constant captures the degree
to which states and actions are out of distribution.
Assumption 4.1 (Concentrability). Let ρ0 denote the initial state distribution, and µ(s, a) denote
the distribution of the training data over S × A, with marginal µ(s) over S. Suppose there exist
coefficients c(k) such that for any π1 , ...πk ∈ Π and s ∈ S:
ρ0 P π1 P π2 ...P πk (s) ≤ c(k)µ(s),
where P πi is the transition operator on states induced by πi . Then, define the concentrability
coefficient C(Π) as

X
def
C(Π) = (1 − γ)2 kγ k−1 c(k).
k=1

To provide some intuition for C(Π), if µ was generated by a single policy π, and Π = {π} was a
singleton set, then we would have C(Π) = 1, which is the smallest possible value. However, if Π
contained policies far from π, the value could be large, potentially infinite if the support of Π is not
contained in π. Now, we bound the performance of approximate distribution-constrained Q-iteration:
Theorem 4.1. Suppose we run approximate distribution-constrained value iteration with a set
constrained backup T Π . Assume that δ(s, a) ≥ maxk |Qk (s, a) − T Π Qk−1 (s, a)| bounds the
Bellman error. Then,
1−γ
 
πk ∗ γ
lim Eρ0 [|V (s) − V (s)|] ≤ C(Π)Eµ [max Eπ [δ(s, a)]] + α(Π)
k→∞ (1 − γ)2 π∈Π γ

Proof. See Appendix B, Theorem B.1

This bound formalizes the tradeoff between keeping policies chosen during backups close to the data
(captured by C(Π)) and keeping the set Π large enough to capture well-performing policies (captured

4
Figure 2: Visualized error propagation in Q-learning for various choices of the constraint set Π: unconstrained
(top row) distribution-constrained (middle), and constrained to the behaviour policy (policy-evaluation, bottom).
Triangles represent Q-values for actions that move in different directions. The task (left) is to reach the bottom-
left corner (G) from the top-left (S), but the behaviour policy (visualized as arrows in the task image, support
state-action pairs are shown in black on the support set image) travels to the bottom-right with a small amount of
-greedy exploration. Dark values indicate high error, and light values indicate low error. Standard backups
propagate large errors from the low-support regions into the high-support regions, leading to high error. Policy
evaluation reduces error propagation from low-support regions, but introduces significant suboptimality bias, as
the data policy is not optimal. A carefully chosen distribution-constrained backup strikes a balance between these
two extremes, by confining error propagation in the low-support region while introducing minimal suboptimality
bias.

by α(Π)). When we expand the set of policies Π, we are increasing C(Π) but decreasing α(Π). An
example of this tradeoff, and how a careful choice of Π can yield superior results, is given in a tabular
gridworld example in Fig. 2, where we visualize errors accumulated during distribution-constrained
Q-iteration for different choices of Π.
Finally, we motivate the use of support sets to construct Π. We are interested in the case where
Π = {π | π(a|s) = 0 whenever β(a|s) < }, where β is the behavior policy (i.e., Π is the set of
policies that have support in the probable regions of the behavior policy). Defining Π in this way
allows us to bound the concentrability coefficient:
Theorem 4.2. Assume the data distribution µ is generated by a behavior policy β. Let µ(s)
be the marginal state distribution under the data distribution. Define Π = {π | π(a|s) =
0 whenever β(a|s) < } and let µΠ be the highest discounted marginal state distribution start-
ing from the initial state distribution ρ and following policies π ∈ Π at each time step thereafter.
Then, there exists a concentrability coefficient C(Π ) which is bounded:
 γ 
C(Π ) ≤ C(β) · 1 + (1 − )
(1 − γ)f ()
def
where f () = mins∈S,µΠ (s)>0 [µ(s)] > 0.
Proof. See Appendix B, Theorem B.2
Qualitatively, f () is the minimum discounted visitation marginal of a state under the behaviour policy
if only actions which are more than  likely are executed in the environment. Thus, using support sets
gives us a single lever, , which simultaneously trades off the value of C(Π) and α(Π). Not only can
we provide theoretical guarantees, we will see in our experiments (Sec. 6) that constructing Π in this
way provides a simple and effective method for implementing distribution-constrained algorithms.
Intuitively, this means we can prevent an increase in overall error in the Q-estimate by selecting
policies supported on the support of the training action distribution, which would ensure roughly
bounded projection error δk (s, a) while reducing the suboptimality bias, potentially by a large amount.
Bounded error δk (s, a) on the support set of the training distribution is a reasonable assumption when
using highly expressive function approximators, such as deep networks, especially if we are willing
to reweight the transition set [33, 12]. We further elaborate on this point in Appendix C.

5 Bootstrapping Error Accumulation Reduction (BEAR)


We now propose a practical actor-critic algorithm (built on the framework of TD3 [14] or SAC [19])
that uses distribution-constrained backups to reduce accumulation of bootstrapping error. The key

5
insight is that we can search for a policy with the same support as the training distribution, while
preventing accidental error accumulation. Our algorithm has two main components. Analogous
to BCQ [14], we use K Q-functions and use the minimum Q-value for policy improvement, and
design a constraint which will be used for searching over the set of policies Π , which share the
same support as the behaviour policy. Both of these components will appear as modifications of the
policy improvement step in actor-critic style algorithms. We also note that policy improvement can
be performed with the mean of the K Q-functions, and we found that this scheme works as good in
our experiments.
We denote the set of Q-functions as: Q̂1 , · · · , Q̂K . Then, the policy is updated to maximize the
conservative estimate of the Q-values within Π :
 
πφ (s) := max Ea∼π(·|s) min Q̂j (s, a)
π∈Π j=1,..,K

In practice, the behaviour policy β is unknown, so we need an approximate way to constrain π to Π.


We define a differentiable constraint that approximately constrains π to Π, and then approximately
solve the constrained optimization problem via dual gradient descent. We use the sampled version
of maximum mean discrepancy (MMD) [18] between the unknown behaviour policy β and the
actor π because it can be estimated based solely on samples from the distributions. Given samples
x1 , · · · , xn ∼ P and y1 , · · · , ym ∼ Q, the sampled MMD between P and Q is given by:
1 X 2 X 1 X
MMD2 ({x1 , · · · , xn }, {y1 , · · · , ym }) = 2
k(xi , xi0 )− k(xi , yj )+ 2 k(yj , yj 0 ).
n 0
nm i,j m 0
i,i j,j
Here, k(·, ·) is any universal kernel. In our experiments, we find both Laplacian and Gaussian kernels
work well. The expression for MMD does not involve the density of either distribution and it can be
optimized directly through samples. Empirically we find that, in the low-intermediate sample regime,
the sampled MMD between P and Q is similar to the MMD between a uniform distribution over P ’s
support and Q, which makes MMD roughly suited for constraining distributions to a given support
set. (See Appendix C.3 for numerical simulations justifying this approach).
Putting everything together, the optimization
 problem
 in the policy improvement step is
πφ := max Es∼D Ea∼π(·|s) min Q̂j (s, a) s.t. Es∼D [MMD(D(s), π(·|s))] ≤ ε (1)
π∈∆|S| j=1,..,K
where ε is an approximately chosen threshold. We choose a threshold of ε = 0.05 in our experiments.
The algorithm is summarized in Algorithm 1.
How does BEAR connect with distribution-constrained backups described in Section 4.1? Step 5 of
the algorithm restricts πφ to lie in the support of β. This insight is formally justified in Theorems 4.1
& 4.2 (C(Πε ) is bounded). Computing distribution-constrained backup exactly by maximizing over
π ∈ Πε is intractable in practice. As an approximation, we sample Dirac policies in the support of β
(Alg 1, Line 5) and perform empirical maximization to compute the backup. As the maximization
is performed over a narrower set of Dirac policies ({δai } ⊆ Πε ), the bound in Theorem 4.1 still
holds. Empirically, we show in Section 6 that this approximation is sufficient to outperform previous
methods. This connection is briefly discussed in Appendix C.2.
Algorithm 1 BEAR Q-Learning (BEAR-QL)
input : Dataset D, target network update rate τ , mini-batch size N , sampled actions for MMD n, minimum λ
1: Initialize Q-ensemble {Qθi }K K
i=1 , actor πφ , Lagrange multiplier α, target networks {Qθi0 }i=1 , and a target
0 0
actor πφ0 , with φ ← φ, θi ← θi
2: for t in {1, . . . , N} do
3: Sample mini-batch of transitions (s, a, r, s0 ) ∼ D
Q-update:
4: Sample p action samples, {ai ∼ πφ0 (·|s0 )}pi=1
5: Define y(s, a) := maxai [λ minj=1,..,K Qθj0 (s0 , ai ) + (1 − λ) maxj=1,..,K Qθj0 (s0 , ai )]
6: ∀i, θi ← arg minθi (Qθi (s, a) − (r + γy(s, a)))2
Policy-update:
7: Sample actions {âi ∼ πφ (·|s)}m n
i=1 and {aj ∼ D(s)}j=1 , n preferably an intermediate integer(1-10)
8: Update φ, α by minimizing Equation 1 by using dual gradient descent with Lagrange multiplier α
9: Update Target Networks: θi0 ← τ θi + (1 − τ )θi0 ; φ0 ← τ φ + (1 − τ )φ0
10: end for

6
HalfCheetah-v2 Walker2d-v2 Hopper-v2 Ant-v2
6000 3500 3000
BCQ BC BCQ BC
3000 1000
5000 BEAR-QL DQfD 2500 BEAR-QL DQfD
Naive-RL KL-c Naive-RL KL-c
2500 750
4000 2000
2000 500
3000 1500
1500 250
2000 1000
1000 0
BCQ BC BCQ BC
1000 BEAR-QL DQfD 500 500 −250 BEAR-QL DQfD
Naive-RL KL-c Naive-RL KL-c
0 0 0 −500
0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.1K 0.2K 0.3K 0.4K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K
TrainSteps TrainSteps TrainSteps TrainSteps

Figure 3: Average performance of BEAR-QL, BCQ, Naïve RL and BC on medium-quality data averaged over
5 seeds. BEAR-QL outperforms both BCQ and Naïve RL. Average return over the training data is indicated by
the magenta line. One step on the x-axis corresponds to 1000 gradient steps.

In summary, the actor is updated towards maximizing the Q-function while still being constrained
to remain in the valid search space defined by Π . The Q-function uses actions sampled from the
actor to then perform distribution-constrained Q-learning, over a reduced set of policies. At test time,
we sample p actions from πφ (s) and the Q-value maximizing action out of these is executed in the
environment. Implementation and other details are present in Appendix D.

6 Experiments
In our experiments, we study how BEAR performs when learning from static off-policy data on a
variety of continuous control benchmark tasks. We evaluate our algorithm in three settings: when the
dataset D is generated by (1) a completely random behaviour policy, (2) a partially trained, medium
scoring policy, and (3) an optimal policy. Condition (2) is of particular interest, as it captures many
common use-cases in practice, such as learning from imperfect demonstration data (e.g., of the sort
that are commonly available for autonomous driving [15]), or reusing previously collected experience
during off-policy RL. We compare our method to several prior methods: a baseline actor-critic
algorithm (TD3), the BCQ algorithm [13], which aims to address a similar problem, as discussed in
Section 4, KL-control [22] (which solves a KL-penalized RL problem similarly to maximum entropy
RL), a static version of DQfD [21] (where a constraint to upweight Q-values of state-action pairs
observed in the dataset is added as an auxiliary loss on top a regular actor-critic algorithm), and a
behaviour cloning (BC) baseline, which simply imitates the data distribution. This serves to measure
whether each method actually performs effective RL, or simply copies the data. We report the average
evaluation return over 5 seeds of the policy given by the learned algorithm, in the form of a learning
curve as a function of number of gradient steps taken by the algorithm. These samples are only
collected for evaluation, and are not used for training.
6.1 Performance on Medium-Quality Data

We first discuss the evaluation of condition with “mediocre” data (2), as this condition resembles
the settings where we expect training on offline data to be most useful. We collected one million
transitions from a partially trained policy, so as to simulate imperfect demonstration data or data from
a mediocre prior policy. In this scenario, we found that BEAR-QL consistently outperforms both
BCQ [13] and a naïve off-policy RL baseline (TD3) by large margins, as shown in Figure 3. This
scenario is the most relevant from an application point of view, as access to optimal data may not
be feasible, and random data might have inadequate exploration to efficient learn a good policy. We
also evaluate the accuracy with which the learned Q-functions predict actual policy returns. These
trends are provided in Appendix E. Note that the performance of BCQ often tracks the performance
of the BC baseline, suggesting that BCQ primarily imitates the data. Our KL-control baseline uses
automatic temperature tuning [19]. We find that KL-control usually performs similar or worse to BC,
whereas DQfD tends to diverge often due to cumulative error due to OOD actions and often exhibits
a huge variance across different runs (for example, HalfCheetah-v2 environment).
6.2 Performance on Random and Optimal Datasets

In Figure 5, we show the performance of each method when trained on data from a random policy
(top) and a near-optimal policy (bottom). In both cases, our method BEAR achieves good results,
consistently exceeding the average dataset return on random data, and matching the optimal policy
return on optimal data. Naïve RL also often does well on random data. For a random data policy, all
actions are in-distribution, since they all have equal probability. This is consistent with our hypothesis

7
HalfCheetah-v2 Walker2d-v2 Hopper-v2 Ant-v2
1000 800 2000
5000 BCQ BC BCQ BC BCQ BC BCQ BC
BEAR DQfD BEAR-QL DQfD 700 BEAR-QL DQfD BEAR-QL DQfD
1500
Naive-RL KL-c 800 Naive-RL KL-c Naive-RL KL-c Naive-RL KL-c
4000 600
1000
3000 600 500

400 500
2000
400 300
1000 0
200
200
0 −500
100

−1000 0 0 −1000
0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.1K 0.2K 0.3K 0.4K 0.0K 0.2K 0.4K 0.6K
TrainSteps TrainSteps TrainSteps TrainSteps

HalfCheetah-v2 Walker2d-v2 Hopper-v2 Ant-v2


14000 5000 4000 6000
BCQ BC BCQ BC
12000 BEAR-QL DQfD 3500 BEAR-QL DQfD 5000
4000 Naive-RL KL-c Naive-RL KL-c
10000 3000 4000

8000 3000 2500 3000


BCQ BC BCQ BC
BEAR-QL DQfD 2000 2000 BEAR-QL DQfD
6000
Naive-RL KL-c 2000 Naive-RL KL-c
1500 1000
4000
1000 0
2000 1000
500 −1000
0
0 0 −2000
0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.1K 0.2K 0.3K 0.4K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K
TrainSteps TrainSteps TrainSteps TrainSteps

Figure 5: Average performance of BEAR-QL, BCQ, Naïve RL and BC on random data (top row) and optimal
data (bottom row) over 5 seeds. BEAR-QL is the only algorithm capable of learning in both scenarios. Naïve RL
cannot handle optimal data, since it does not illustrate mistakes, and BCQ favors a behavioral cloning strategy
(performs quite close to behaviour cloning in most cases), causing it to fail on random data. Average return over
the training dataset is indicated by the dashed magenta line.

that OOD actions are one of the main sources of error in off-policy learning on static datasets. The
prior BCQ method [13] performs well on optimal data but performs poorly on random data, where
the constraint is too strict. These results show that BEAR-QL is robust to the dataset composition,
and can learn consistently in a variety of settings. We find that KL-control and DQfD can be unstable
in these settings.
Finally, in Figure 4, we show that BEAR outperforms other considered prior methods in the challeng-
ing Humanoid-v2 environment as well, in two cases – Medium-quality data and random data.

6.3 Analysis of BEAR-QL


Humanoid-v2 Humanoid-v2
In this section, we aim to analyze different com- 7000 BCQ BC
700
BCQ BC
ponents of our method via an ablation study. 6000 BEAR-QL
Naive-RL
DQfD
KL-c
600 BEAR-QL
Naive-RL
DQfD
KL-c
Our first ablation studies the support constraint 5000 500

discussed in Section 5, which uses MMD to mea- 4000 400

3000 300
sure support. We replace it with a more standard 2000
200
KL-divergence distribution constraint, which 1000 100
measures similarity in density. Our hypothesis 0 0
is that this should provide a more conservative 0.0K 0.2K 0.4K 0.6K
TrainSteps
0.8K 1.0K 0.0K 0.2K 0.4K 0.6K
TrainSteps
0.8K 1.0K

constraint, since matching distributions is not


necessary for matching support. KL-divergence Figure 4: Performance of BEAR-QL, BCQ, Naïve RL
performs well in some cases, such as with opti- and BC on medium-quality (left) and random (right) data
mal data, but as shown in Figure 6, it performs in the Humanoid-v2 environment. Note that BEAR-QL
worse than MMD on medium-quality data. Even outperforms prior methods.
when KL-divergence is hand tuned fully, so as to prevent instability issues it still performs worse
than a not-well tuned MMD constraint. We provide the results for this setting in the Appendix. We
also vary the number of samples n that are used to compute the MMD constraint. We find that
smaller n (≈ 4 or 5) gives better performance. Although the difference is not large, consistently better
performance with 4 samples leans in favour of our hypothesis that an intermediate number of samples
works well for support matching, and hence is less restrictive.

7 Discussion and Future Work


The goal in our work was to study off-policy reinforcement learning with static datasets. We
theoretically and empirically analyze how error propagates in off-policy RL due to the use of out-of-
distribution actions for computing the target values in the Bellman backup. Our experiments suggest
that this source of error is one of the primary issues afflicting off-policy RL: increasing the number

8
of samples does not appear to mitigate the degradation issue (Figure 1), and training with naïve
RL on data from a random policy, where there are no out-of-distribution actions, shows much less
degradation than training on data from more focused policies (Figure 5). Armed with this insight, we
develop a method for mitigating the effect of out-of-distribution actions, which we call BEAR-QL.
BEAR-QL constrains the backup to use actions that have non-negligible support under the data
distribution, but without being overly conservative in constraining the learned policy. We observe
experimentally that BEAR-QL achieves good performance across a range of tasks, and across a range
of dataset compositions, learning well on random, medium-quality, and expert data.
While BEAR-QL substantially stabilizes off-
policy RL, we believe that this problem merits MMD vs KL constraint on mediocre data Number of samples for MMD
2500 3000
further study. One limitation of our current
2500
method is that, although the learned policies 2000
are more performant than those acquired with 1500 2000

naïve RL, performance sometimes still tends 1500


1000
to degrade for long learning runs. An exciting 1000 Hopper,4
Hopper,10
direction for future work would be to develop 500
KL 500 Walker2d,4
MMD Walker2d,10
an early stopping condition for RL, perhaps 0 0
0.0K 0.1K 0.2K 0.3K 0.4K 0.5K 0.0K 0.1K 0.2K 0.3K 0.4K
by generalizing the notion of validation error TrainSteps TrainSteps

to reinforcement learning. A limitation of ap-


proaches that perform constrained-action se- Figure 6: Average return (averaged Hopper-v2 and
lection is that they can be overly conservative Walker2d-v2) as a function of train steps for ablation
when compared to methods that constrain state- studies from Section 6.3. (a) MMD constrained optimiza-
tion is more stable and leads to better returns, (b) 4 sample
distributions directly, especially with datasets MMD is more performant than 10.
collected from mixtures of policies. We leave
it to future work to design algorithms that can directly constrain state distributions. A theoretically
robust method for support matching efficiently in high-dimensional continuous action spaces is a
question for future research. Perhaps methods from outside RL, predominantly used in domain
adaptation, such as using asymmetric f-divergences [38] can be used for support restriction. Another
promising future direction is to examine how well BEAR-QL can work on large-scale off-policy
learning problems, of the sort that are likely to arise in domains such as robotics, autonomous driving,
operations research, and commerce. If RL algorithms can learn effectively from large-scale off-policy
datasets, reinforcement learning can become a truly data-driven discipline, benefiting from the same
advantage in generalization that has been seen in recent years in supervised learning fields, where
large datasets have enabled rapid progress in terms of accuracy and generalization [8].

Acknowledgements
We thank Kristian Hartikainen for sharing implementations of RL algorithms and for help in debug-
ging certain issues. We thank Matthew Soh for help in setting up environments. We thank Aurick
Zhou, Chelsea Finn, Abhishek Gupta, Kelvin Xu and Rishabh Agarwal for informative discussions.
We thank Ofir Nachum for comments on an earlier draft of this paper. We thank Google, NVIDIA,
and Amazon for providing computational resources. This research was supported by Berkeley Deep-
Drive, JPMorgan Chase & Co., NSF IIS-1651843 and IIS-1614653, the DARPA Assured Autonomy
program, and ARL DCIST CRA W911NF-17-2-0181.

References
[1] Rishabh Agarwal, Dale Schuurmans, and Mohammad Norouzi. Striving for simplicity in
off-policy deep reinforcement learning. CoRR, abs/1907.04543, 2019. URL http://arxiv.
org/abs/1907.04543.
[2] Andräs Antos, Csaa Szepesvari, and Remi Munos. Value-iteration based fitted policy iteration:
Learning with a single trajectory. In 2007 IEEE International Symposium on Approximate
Dynamic Programming and Reinforcement Learning, pages 330–337, April 2007. doi: 10.1109/
ADPRL.2007.368207.
[3] András Antos, Csaba Szepesvári, and Rémi Munos. Fitted q-iteration in continuous action-
space mdps. In Advances in Neural Information Processing Systems 20, pages 9–16. Curran
Associates, Inc., 2008.

9
[4] James Bennett, Stan Lanning, et al. The netflix prize. 2007.
[5] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming. Athena Scientific,
1996.
[6] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning?
In ICML 2019.
[7] Tim de Bruin, Jens Kober, Karl Tuyls, and Robert Babuska. The importance of experience
replay database composition in deep reinforcement learning. 01 2015.
[8] Jia Deng, Wei Dong, Richard S. Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A
Large-Scale Hierarchical Image Database. In CVPR09, 2009.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of
deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
2018.
[10] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward,
Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl
with importance weighted actor-learner architectures. In Proceedings of the International
Conference on Machine Learning (ICML), 2018.
[11] Amir-massoud Farahmand, Csaba Szepesvári, and Rémi Munos. Error propagation for approxi-
mate policy and value iteration. In Advances in Neural Information Processing Systems, pages
568–576, 2010.
[12] Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing bottlenecks in deep
q-learning algorithms. arXiv preprint arXiv:1902.10250, 2019.
[13] Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning
without exploration. arXiv preprint arXiv:1812.02900, 2018.
[14] Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error
in actor-critic methods. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th
International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning
Research, pages 1587–1596. PMLR, 2018.
[15] Yang Gao, Huazhe Xu, Ji Lin, Fisher Yu, Sergey Levine, and Trevor Darrell. Reinforcement
learning from imperfect demonstrations. In ICLR (Workshop). OpenReview.net, 2018.
[16] Carles Gelada and Marc G. Bellemare. Off-policy deep reinforcement learning by bootstrapping
the covariate shift. CoRR, abs/1901.09455, 2019.
[17] Jordi Grau-Moya, Felix Leibfried, and Peter Vrancx. Soft q-learning with mutual-information
regularization. In International Conference on Learning Representations, 2019. URL https:
//openreview.net/forum?id=HyEtjoCqFX.
[18] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander
Smola. A kernel two-sample test. J. Mach. Learn. Res., 13:723–773, March 2012. ISSN
1532-4435. URL http://dl.acm.org/citation.cfm?id=2188385.2188410.
[19] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint
arXiv:1801.01290, 2018.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 770–778, 2016.
[21] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan,
John Quan, Andrew Sendonaris, Ian Osband, et al. Deep q-learning from demonstrations. In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[22] Natasha Jaques, Asma Ghandeharioun, Judy Hanwen Shen, Craig Ferguson, Àgata Lapedriza,
Noah Jones, Shixiang Gu, and Rosalind W. Picard. Way off-policy batch deep reinforcement
learning of implicit human preferences in dialog. CoRR, abs/1907.00456, 2019. URL http:
//arxiv.org/abs/1907.00456.
[23] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.
In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–
274. Morgan Kaufmann Publishers Inc., 2002.

10
[24] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang,
Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine.
Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings
of The 2nd Conference on Robot Learning, volume 87 of Proceedings of Machine Learning
Research, pages 651–673. PMLR, 2018.
[25] Romain Laroche, Paul Trichelair, and Remi Tachet Des Combes. Safe policy improvement with
baseline bootstrapping. In International Conference on Machine Learning (ICML), 2019.
[26] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and
review. CoRR, abs/1805.00909, 2018. URL http://arxiv.org/abs/1805.00909.
[27] A Rupam Mahmood, Huizhen Yu, Martha White, and Richard S Sutton. Emphatic temporal-
difference learning. arXiv preprint arXiv:1507.01569, 2015.
[28] Rémi Munos. Error bounds for approximate policy iteration. In Proceedings of the Twentieth
International Conference on International Conference on Machine Learning, pages 560–567.
AAAI Press, 2003.
[29] Rémi Munos. Error bounds for approximate value iteration. In Proceedings of the National
Conference on Artificial Intelligence, 2005.
[30] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient
off-policy reinforcement learning. In Advances in Neural Information Processing Systems,
pages 1054–1062, 2016.
[31] Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal-difference learning
with function approximation. In International Conference on Machine Learning (ICML), 2001.
[32] Stefan Schaal. Is imitation learning the route to humanoid robots?, 1999.
[33] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.
CoRR, abs/1511.05952, 2016.
[34] Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist.
Approximate modified policy iteration and its application to the game of tetris. Journal of
Machine Learning Research, 16:1629–1676, 2015. URL http://jmlr.org/papers/v16/
scherrer15a.html.
[35] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust
region policy optimization. In Francis Bach and David Blei, editors, Proceedings of the 32nd
International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning
Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR.
[36] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. Second
edition, 2018.
[37] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based
control. In IROS, pages 5026–5033, 2012.
[38] Yifan Wu, Ezra Winston, Divyansh Kaushik, and Zachary Lipton. Domain adaptation with
asymmetrically-relaxed distribution alignment. In ICML 2019.
[39] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and
Trevor Darrell. BDD100K: A diverse driving video database with scalable annotation tooling.
CoRR, abs/1805.04687, 2018. URL http://arxiv.org/abs/1805.04687.

11
Appendices
A Distribution-Constrained Backup Operator
In this section, we analyze properties of the constrained Bellman backup operator, defined as:
def
T Π Q(s, a) = E R(s, a) + γ max EP (s0 |s,a) [V (s0 )]
 
π∈Π

where
def
V (s) = max Eπ [Q(s, a)].
π∈Π

Such an operator can be reduced to a standard Bellman backup in a modified MDP. We can construct
an MDP M 0 from the original MDP M as follows:

• The state space, discount, and initial state distributions remain unchanged from M .
• We define a new action set A0 = Π to be the choice of policy π to execute.
• We define the new transition distribution p0 as taking one step under the chosen policy π to
execute and one step under the original dynamics p: p0 (s0 |s, π) = Eπ [p(s0 |s, a)].
• Q-values in this new MDP, QΠ (s, π) would, in words, correspond to executing policy π for
one step and executing the policy which maximizes the future discounted value function in
the original MDP M thereafter.

Under this redefinition, the Bellman operator T Π is mathematically the same operation as the Bellman
operator under M 0 . Thus, standard results from MDP theory carry over - i.e. the existence of a fixed
point and convergence of repeated application of T Π to said fixed point.

B Error Propagation
In this section, we provide proofs for Theorem 4.1 and Theorem 4.2.
Theorem B.1. Suppose we run approximate distribution-constrained value iteration with a set
constrained backup T Π . Assume that δ(s, a) ≥ maxk |Qk (s, a) − T Π Qk−1 (s, a)| bounds the
Bellman error. Then,
1−γ
 
∗ γ
lim Eρ0 [|Vk (s) − V (s)|] ≤ C(Π)Eµ [max Eπ [δ(s, a)]] + α(Π)
k→∞ (1 − γ)2 π∈Π γ

Proof. We first begin by introducing V Π , the fixed point of T Π . By the triangle inequality, we have:
Eρ0 [|Vk (s) − V ∗ (s)|] = Eρ0 [|Vk (s, a) − V Π (s) + V Π (s) − V ∗ (s)|]
≤ Eρ0 [|Vk (s) − V Π (s)|] + Eρ0 [|V Π (s) − V ∗ (s)|]
| {z } | {z }
L1 L2

First, we note that maxπ Eπ [δ(s, a)] provides an upper bound on the value error:
|Vk (s) − T Π Vk−1 (s)| = | max Eπ [Qk (s, a)] − max Eπ [T π Qk−1 (s, a)]|
π π
≤ max Eπ [|Qk (s, a) − T π Qk−1 (s, a)|]
π
≤ max Eπ [δ(s, a)]
π

We can bound L1 with



L1 ≤ [C(Π)]Eµ [max Eπ [δ(s, a)]]
(1 − γ)2 π∈Π

by direct modification of the proof of Theorem 3 of Farahmand et al. [11] or Theorem 1 of Munos
[29] with k = 1 (p = 1), but replacing V ∗ with V Π and T with T Π , as T Π is a contraction and

12
V Π is its fixed point. An alternative proof involves viewing T Π as a backup under a modified MDP
(see Appendix A), and directly apply Theorem 1 of Munos [29] under this modified MDP. A similar
bound also holds true for value iteration with the T Π operator which can be analysed on similar lines
as the above proof and Munos [29].
To bound L2 , we provide a simple `∞ -norm bound, although we could in principle apply techniques
used to bound L1 to get a tighter distribution-based bound.
VΠ−V∗ ∞
= T ΠV Π − T V ∗ ∞
≤ T ΠV Π − T ΠV ∗ ∞
+ T ΠV Π − T V ∗ ∞
≤γ VΠ−V∗ ∞
+ α(Π)

Thus, we have V Π − V ∗ ∞ ≤ 1−γ α


. Because the maximum is greater than the expectation,
L2 = Eρ0 ,π [|V (s) − V (s)|] ≤ V Π − V ∗ ∞ .
Π ∗

Adding L1 and L2 completes the proof.


Theorem B.2. Assume the data distribution µ is generated by a behavior policy β, such that
µ(s, a) = µβ (s, a). Let µ(s) be the marginal state distribution under the data distribution. Let us
define Π = {π | π(a|s) = 0 whenever β(a|s) < }. Then, there exists a concentrability coefficient
C(Π ) is bounded as:  γ 
C(Π ) ≤ C(β) · 1 + (1 − )
(1 − γ)f ()
def
where f () = mins∈S,µΠ (s)>0 [µ(s)].

Proof. For notational clarity, we refer to Π as Π in this proof. The term µΠ is the highest discounted
marginal state distribution starting from the initial state distribution ρ and following policies π ∈ Π.
Formally, it is defined as:

X
def
µΠ = max (1 − γ) mγ m−1 ρ0 P π1 · · · P πm
{πi }i : ∀ i, πi ∈Π
m=1

Now, we begin the proof of the theorem. We first note, from the definition of Π, ∀ s ∈ S ∀ π ∈
Π, π(a|s) > 0 =⇒ β(a|s) > . This suggests a bound on the total variation distance between
β and any π ∈ Π for all s ∈ S, DT V (β(·|s)||π(·|s)) ≤ 1 − . This means that the marginal state
γ
distributions of β and Π, are bounded in total variation distance by: DT V (µβ ||µΠ ) ≤ 1−γ (1 − ),
where µΠ is the marginal state distribution as defined above. This can be derived from Schulman
et al. [35], Appendix B, which bounds the difference in returns of two policies by showing the state
marginals between two policies are bounded if their total variation distance is bounded.
Further, the definition of the set of policies Π implies that ∀ s ∈ S, µΠ (s) > 0 =⇒ µβ (s) ≥ f (),
where f () > 0 is a constant that depends on  and captures the minimum visitation probability of
a state s ∈ S when rollouts are executed from the initial state distribution ρ while executing the
behaviour policy β(a|s), under the constraint that only actions with β(a|s) ≥  are selected for
execution in the environment. Combining it with the total variation divergence bound, maxs ||µβ (s)−
γ
µΠ (s)|| ≤ 1−γ (1 − ), we get that

µΠ (s) γ
sup ≤1+ (1 − )
s∈S µβ (s) (1 − γ)f ()

def P∞
We know that, C(Π) = (1 − γ)2 k=1 kγ k−1 c(k) is the ratio of the marginal state visitation
distribution under the policy iterates when performing backups using the distribution-constrained
operator and the data distribution µ = µβ . Therefore,
C(Π ) def µΠ (s) γ
= sup ≤1+ (1 − )
C(β) s∈S µβ (s) (1 − γ)f ()

13
C Additional Details Regarding BEAR-QL
In this appendix, we address several remaining points regarding the support matching formulation of
BEAR-QL, and further discuss its connections to prior work.

C.1 Why can we choose actions from Π , the support of the training distribution, and need
not restrict action selection to the policy distribution?

In Section 4.1, we designed a new distribution-constrained backup and analyzed its properties from
an error propagation perspective. Theorems 4.1 and 4.2 tell us that, if the maximum projection error
on all actions within the support of the train distribution is bounded, then the worst-case error incurred
is also bounded. That is, we have a bound on maxπ∈Π Eπ [δk (s, a)]. In this section, we provide
an intuitive explanation for why action distributions that are very different from the training policy
distributions, but still lie in the support of the train distribution, can be chosen without incurring
large error. In practice, we use powerful function approximators for Q-learning, such as deep neural
networks. That is, δk (s, a) is the Bellman error for one iteration of Q-iteration/Q-learning, which
can essentially be viewed as a supervised regression problem with a very expressive function class.
In this scenario, we expect a bounded error on the entire support of the training distribution, and
we therefore expect approximation error to depend less on the specific density of a datapoint under
the data distribution, and more on whether or not that datapoint is within the support of the data
distribution. I.e., any point that is within the support would have a comparatively low error, due to
the expressivity of the function approximator.
Another justification is that, a different version of the Bellman error objective renormalizes the action-
distributions to the uniform distribution by applying an inverse behavior policy density weighting.
For example, [3, 2] use this variant of Bellman error:
N   2
X 1 0
Qk+1 = argminQ Q (si , ai ) − R(s, a) + γ max Qk (si+1 , a )
β (ai |si ) a0 ∈A
i=1,ai ∼β(·|si )

This implies that this form of Bellman error mainly depends upon the support of the behaviour policy
β (i.e. the set of action samples sampled from β with a high-enough probability which we formally
refer to as β(a|s) ≥  in the main text). In a scenario when this form of Bellman error is being
minimized, δk (s, a) is defined as
1
δk (s, a) = |Qk (s, a) − T Qk−1 (s, a)|
β(a|s)
The overall error, hence, incurred due to error propagation is expected to be insensitive to distribution
change, provided the support of the distribution doesn’t change. Therefore, all policies π ∈ Π incur
the same amount of propagated error (|Vk − VΠ |) whereas different amount of suoptimality biases
– suggesting the existence of a different policy in Π which propagates the same amount of error
while having a lower suboptimality bias. However, in practice, it has been observed that using the
inverse density weighting under the behaviour policy doesn’t lead to substantially better performance
for vanilla RL (not in the setting with purely off-policy, static datasets), so we use the unmodified
Bellman error objective.
Both of these justifications indicate that bounded δk (s, a) is reasonable to expect under in-support
action distributions.

C.2 Details on connection between BEAR-QL and distribution-constrained backups

Distribution-constrained backups perform maximization over a set of policies Π which is defined as


the set of policies that share the support with the behaviour policy. In the BEAR-QL algorithm, πφ
is maximized towards maximizing the expected Q-value for each state under the action distribution
defined by it, while staying in-support (through the MMD constraint). The maximization step biases
πφ towards the in-support actions which maximize the current Q-value. By sampling multiple
Dirac-delta action distributions - δai - and then performing an explicit maximum over them for
computing the target is a stochastic approximation to the distribution-constrained operator. What
is the importance of training the actor to maximize the expected Q-value? We found empirically
that this step is important as without this maximization step and high-dimensional action spaces, it

14
is likely to require many more samples (exponentially more, due to curse of dimensionality) to get
the correct action that maximizes the target value while being in-support. This is hard and unlikely,
and in some experiments we tried with this variant, we found it to lead to suboptimal solutions. At
evaluation time, we use the Q-function as the actor. The same process is followed. Dirac-delta action
distribution candidates δai are sampled, and then the action ai that is gives the empirical maximum
over the Q-function values is the action that is executed in the environment.

C.3 How effective is the MMD constraint in constraining supports of distributions?

In Section 5, we argued in favour of the usage of the sampled MMD distance between distributions
to search for a policy that is supported on the same support as the train distribution. Revisiting the
argument, in this section, we argue, via numerical simulations, the effectiveness of the MMD distance
between two probability distributions in constraining the support of the distribution being learned,
without constraining the distribution density function too much. While, MMD distance computed
exactly between two distribution functions will match distributions exactly and that explains its
applicability in 2-sample tests, however, with a limited number of samples, we empirically find
that the values of the MMD distance computed using samples from two d-dimensional Gaussian
def def
distributions with diagonal covariance matrices: P = N (µP , ΣP ) and Q = N (µQ , ΣQ ) is roughly
def
equal to the MMD distance computed using samples from Uα (P ) = [ Uniform(µ1P ± αΣ1,1 P )] ×
· · · × [ Uniform(µdP ± αΣd,d
P )] and Q. This means that when minimizing the MMD distance to train
distribution Q, the gradient signal would push Q towards a uniform distribution supported on P ’s
support as this solution exhibits a lower MMD value – which is the objective we are optimizing.
Figure 7 shows an empirical comparison of MMD(P, Q) when Q = P , computed by sampling
n-samples from P , and MMD(Uα (P ), Q) (also when Q = P ) computed by sampling n-samples
from Uα (P ). We observe that MMD distance computed using limited samples can, in fact, be higher
between a distribution and itself as compared to a uniform distribution over a distribution’s support
and itself. In Figure 7, note that for smaller values of n and appropriately chosen α (mentioned
against each figure, the support of the uniform distribution), the estimator for MMD(Uα (P ), P ) can
provide lower estimates than the value of the estimator for MMD(P, P ). This observation suggests
that when the number of samples is not enough to sample infer the distribution shape, density-agnostic
distances like MMD can be used as optimization objectives to push distributions to match supports.
Subfigures (c) and (d) shows the increase in MMD distance as the support of the uniform distribution
is expanded.
In order to provide a theoretical example, we refer to Example 1 in Gretton et al. [18], and extend
it. First, note that the example argues that a fixed sample size of samples drawn from a distribution
P , there exists another discrete distribution
 Q supported
 on m2 samples from the support set of P ,
2
m m! −1
such that there atleast is a probability
m m2m > 1 − e > 0.63 that a sample from Q is
indeed a sample from P as well. So, with a smaller value of m, no 2-sample test will be able to
distinguish between P and Q. We would also note that this example is exactly the argument that our
algorithm build upon. We further extend this example by noting that if Q were rather not completely
supported on the support of P , then there exists atleast a probability of  that a sample from Q lies
outside the support of P . This gives us a lower bound on the value of the MMD estimator, indicating
thatpthe MMD 2-sample test will be able to detect this distribution due to an irreducible difference
of  miny∈Extremal(P) Ex∼P [k(x, y)] (where y is an "extremal point" in P ’s support) in the MMD
estimate.

D Additional Experimental Details

Data collection We trained behaviour policies using the Soft Actor-Critic algorithm [19]. In all
cases, random data was generated by running a uniform at random policy in the environment. Optimal
data was generated by training SAC agents in all 4 domains until convergence to the returns mentioned
in Figure 5. Mediocre data was generated by training a policy until the return value marked in each
of the plots in Figure 3. Each of our datasets contained 1e6 samples. We used the same datasets for
evaluating different algorithms to maintain uniformity across results.

15
(a) N (0, 0.1), U(−0.1, 0.1) (b) N (0, 1.0), U(−1.5, 1.5)

(c) N (0, 1.0), U(−2.0, 2.0) (d) N (0, 1.0), U(−4.0, 4.0)
Figure 7: Comparing MMD distance between a 1-d Gaussian distribution (P ) and itself (P ), and
a uniform distribution over support set of the P and the distribution P . The parameters of the
Gaussian distribution (P ) and the uniform distribution being considered are mentioned against each
plot. (’Self’ refers to MMD(P, P ) and ’Uniform’ refers to MMD(P, U(P )).) Note that for small
values of n ≈ 1 − 10, the MMD with the Uniform distribution is slightly lower in magnitude than
the MMD between the distribution P and itself (sub-figures (a), (b) and (c)). For (d), as the support
of this uniform distribution is enlarged, this leads to an increase in the value of MMD in the uniform
approximation case – which suggests that a near-local minimizer for the MMD distance can be
obtained by making sure that the distribution which is being trained in this process shares the same
support as the other given distribution.

Choice of kernels In our experiments, we found that the choice of the kernel is an important design
decision that needs to be made. In general, we found that a Laplacian kernel k(x, y) = exp( −||x−y||
σ )
2
worked well in all cases. Gaussian kernel k(x, y) = exp( −||x−y|| 2σ 2 ) worked quite well in the case
of optimal dataset. For the Laplacian kernel, we chose σ = 10.0 for Cheetah, Ant and Hopper, and
σ = 20.0 for Walker. However, we found that σ = 20.0 worked well for all environments in all
settings. For the Gaussian kernel, we chose σ = 20.0 for all settings. Kernels often tend to not
provide relevant measurements of distance especially in high-dimensional spaces, so one direction for
future work is to design right kernels. We further experimented with a mixture of Laplacian kernel
with different bandwidth parameters σ (1.0, 10.0, 50.0) on Hopper-v2 and Walker2d-v2 where we
found that it performs comparably and sometimes is better than a simple Laplacian kernel, probably
because it is able to track supports upto different levels of thresholds due to multiple kernels.

More details about the algorithm At evaluation time, we find that using the greedy maximum of
the Q-function over the support set of the behaviour policy (which can be approximated by sampling
multiple Dirac-delta policies δai from the policy πφ and performing a greedy maximization of the
Q-values over these Dirac-delta policies) works best, better than unrolling the learned actor πφ in
the environment. This was also found useful in [13]. Another detail about the algorithm is deciding
which samples to use for computing the MMD objective. We train a parameteric model πdata which
fits a tanh-Gaussian distribution to a given the states s, πdata (·|s) = tanh N (µ(·|s), σ(·|s)) and then
use this to sample a candidate n actions for computing the MMD-distance, meaning that MMD is
computed between a1 , · · · , aN ∼ πdata and πφ . We find the latter to work better in practice. Also,
computing the MMD distance between actions before applying the tanh transformation work better,
and leads to a constraint, that perhaps provides stronger gradient signal – because tanh saturates very
quickly, after which gradients almost vanish.

Other hyperparameters Other hyperparameters include the following – (1) The variance of the
Gaussian σ 2 /(standard deviation of) Laplacian kernel σ: We tried a variance of 10, 20, and 40. We
found that 10 and 20 worked well across Cheetah, Hopper and Ant, and 20 worked well for Walker2d;
(2) The learning rate for the Lagrange multiplier was chosen to be 1e-3, and the log of the Lagrange

16
multiplier was clipped between [−5, 10] to prevent instabilities; (3) For the policy improvement step,
we found using average Q works better than min Q for Walker2d. For the baselines, we used BCQ
code from the official implementation accompanying [13], TD3 code from the official implementation
accompanying [14] and the BC baseline was the VAE-based behaviour cloning baseline also used in
[13]. We evaluated on 10 evaluation episodes (which were separate from the train distribution) after
every 1000 iterations and used the average score and the variance for the plots.

E Additional Experimental Results

Ant-v2: Q - MC Walker2d-v2: Q - MC
500 100
BCQ BCQ
400 BEAR-QL BEAR-QL
0
300

−100
200

100
−200

0
−300
−100

−200 −400
0.0K 0.1K 0.2K 0.3K 0.4K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K
TrainSteps TrainSteps

Figure 8: The trend of the difference between the Q-values and Monte-Carlo returns: Q − M C
returns for 2 environments. Note that a high value of Q − M C corresponds to more overestimation.
In these plots, BEAR-QL is more well behaved than BCQ. In Walker2d-v2, BCQ tends to diverge
in the negative direction. In the case of Ant-v2, although roughly the same, the difference between
Q values and Monte-carlo returns is slightly lower in the case of BEAR-QL suggestion no risk of
overestimation. (This corresponds to medium-quality data.)

Hopper-v2 Walker2d-v2 HalfCheetah-v2


1000 1000
BCQ BCQ BCQ
400
BEAR-QL BEAR-QL BEAR-QL
800 800
200
600 600
0
400 400
−200
200 200
−400
0 0
0.0K 0.1K 0.2K 0.3K 0.4K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K
TrainSteps TrainSteps TrainSteps

Figure 9: The trends of Q-values as a function of number of gradient steps taken in case of 3
environments. BCQs Q-values tend to be more unstable (especially in the case of Walker2d, where
they diverge in the negative direction) as compared to BEAR-QL. This corresponds to medium-quality
data.

In this section, we provide some extra plots for some extra experiments. In Figure 8 we provide the
difference between learned Q-values and Monte carlo returns of the policy in the environment. In
Figure 9 we provide the trends of comparisons of Q-values learned by BEAR-QL and BCQ in three
environments. In Figure 10 we compare the performance when using the MMD constraint vs using
the KL constraint in the case of three environments. In order to be fair at comparing to MMD, we
train a model for the behaviour policy and constrain the KL-divergence to this behaviour policy. (For
MMD, we compute MMD using samples from the model of the behaviour policy.) Note that in the
case of Half Cheetah with medium-quality data, KL divergence constraint works pretty well, but
it fails drastically in the case of Hopper and Walker2d and the Q-values tend to diverge. Figure 10
summarizes the trends for 3 environments.
We further study the performance of the KL-divergence in the setting when the KL-divergence is
stable. In this setting we needed to perform extensive hyperparameter tuning to find the optimal

17
MMD vs KL: HalfCheetah-v2 MMD vs KL: Walker2d-v2 MMD vs KL: Hopper-v2
6000 3000 2500

5000 2500
2000

4000 2000
1500
3000 1500
1000
2000 1000

500
1000 KL 500 KL KL
MMD MMD MMD
0 0 0
0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K 0.0K 0.1K 0.2K 0.3K 0.4K
TrainSteps TrainSteps TrainSteps

Figure 10: Performance Trends (measured in AverageReturn) for Hopper-v2, HalfCheetah-v2 and
Walker2d-v2 environments with BEAR-QL algorthm but varying kind of constraint. In general we
find that using the KL constraint leads to worse performance. However, in some rare cases (for
example, HalfCheetah-v2), the KL constraint learns faster. In general, we find that the KL-constraint
often leads to diverging Q-values. This experiment corresponds to medium-quality data.

MMD vs KL: Hopper-v2 MMD vs KL: Walker2d-v2


2500 3000

2500
2000

2000
1500
1500
1000
1000

500
KL 500 KL
MMD MMD
0 0
0.0K 0.1K 0.2K 0.3K 0.4K 0.0K 0.2K 0.4K 0.6K 0.8K 1.0K
TrainSteps TrainSteps

Figure 11: Performance Trends (measured in Average Returns) for Hopper-v2 and Walker2d-v2
environments with BEAR-QL algorithm with an extensively tuned KL-constraint and the MMD-
constraint from. Note that the MMD-constraint still outperforms the KL-constraint.

Lagrange multiplier for the KL-constraint and plain and simple dual descent always gave us an
unstable solution with the KL-constraint. Even in this case tuned hyperparameter case, we find that
using a KL-constraint is worse than using a MMD-constraint. Trends are summarized in Figure 11.
As described in Section C, we can achieve a reduced overall error ||Vk (s) − V ∗ (s)||, if we use
the MMD support-matching constraint alongside importance sampling, i.e. when we multiply the
Bellman error with the inverse of the behaviour policy density. Empirically, we tried reweighting the
Bellman error by inverse of the fitted behavior policy density, alongside the BEAR-QL algorithm.
The trends for two environments and medium-quality data are summarized in Figure 12. We found
that reweighting the Bellman error wasn’t that useful, although in theory, it provides an absolute error
reduction as described by Theorem 4.1. We hypothesize that this could be due to the possible reason
that when optimizing neural nets using stochastic gradient procedures, importance sampling isn’t that
beneficial [6].

18
Figure 12: BEAR with importance sampled Bellman error minimization. We find that importance
sampling isn’t that beneficial in practice.

19

You might also like