Phihp RLC2024

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Physics-Informed Model and Hybrid Planning for

Efficient Dyna-Style Reinforcement Learning


Zakariae El Asri Olivier Sigaud Nicolas Thome
Sorbonne Université, CNRS, ISIR, F-75005 Paris, France
{elasri,sigaud, thome}@sorbonne-universite.fr

Abstract

Applying reinforcement learning (RL) to real-world applications requires addressing


a trade-off between asymptotic performance, sample efficiency, and inference time.
In this work, we demonstrate how to address this triple challenge by leveraging par-
tial physical knowledge about the system dynamics. Our approach involves learning
a physics-informed model to boost sample efficiency and generating imaginary tra-
jectories from this model to learn a model-free policy and Q-function. Furthermore,
we propose a hybrid planning strategy, combining the learned policy and Q-function
with the learned model to enhance time efficiency in planning. Through practical
demonstrations, we illustrate that our method improves the compromise between
sample efficiency, time efficiency, and performance over state-of-the-art methods.
Code is available at https://anonymous.4open.science/r/PHIHP/

1 Introduction

Reinforcement learning (RL) has proven successful in sequential decision-making tasks across diverse
artificial domains, ranging from games to robotics (Mnih et al., 2015; Lillicrap et al., 2016; Fujimoto
et al., 2018; Haarnoja et al., 2018). However, this success has not yet been evident in real-world
applications, where RL is facing many challenges (Dulac-Arnold et al., 2019), especially in terms
of sample efficiency and inference time needed to reach a satisfactory performance. A limitation of
existing research is that most works address these three challenges – sample efficiency, time efficiency,
and performance – individually, whereas we posit that addressing them simultaneously can benefit
from useful synergies between the leveraged mechanisms.
Concretely, on one side Model-Free Reinforcement Learning (MFRL) techniques excel at learning
a wide range of control tasks (Lillicrap et al., 2016; Fujimoto et al., 2018), but at a high sample
cost. On the other side, Model-Based Reinforcement Learning (MBRL) drastically reduces the
need for samples by acquiring a representation of the agent-environment interaction (Deisenroth &
Rasmussen, 2011; Chua et al., 2018), but requires heavy planning strategies to reach competitive
performance, at the cost of inference time.
A recent line of works focuses on combining MBRL and MFRL to benefit from the best of both
worlds (Ha & Schmidhuber, 2018; Hafner et al., 2019a; Clavera et al., 2020). Particularly, Byravan
et al. (2021); Wang & Ba (2019); Hansen et al. (2022) combine a learned model and a learned policy
in planning, this combination helps improve the asymptotic performance but requires more samples,
due to the sample cost of learning a good policy.
This paper introduces PhIHP, a Physics-Informed model and Hybrid Planning method in RL1 .
PhIHP improves the compromise between the three main challenges outlined above – sample effi-
ciency, time efficiency, and performance –, as illustrated in Figure 1. Compared to state-of-the-art
MFRL TD3 (Fujimoto et al., 2018) and hybrid TD-MPC (Hansen et al., 2022), we show that PhIHP
1 The source code will be released upon acceptance.

1
provides a much better sample efficiency, reaches higher asymptotic performance, and is much faster
than TD-MPC at inference.
To achieve this goal, PhIHP first learns a physics-
informed model of the environment and uses it to
learn an MFRL policy in imagination. This policy is
used in a hybrid planning scheme. PhIHP leverages
three main mechanisms:
• Physics-informed model: We leverage an ap-
proximate physical model and combine it with a
learned data-driven residual to match the true dy-
namics. This physical prior boosts the sample effi-
ciency of PhIHP and the learned residual improves
asymptotic performance.
• MFRL in imagination: we preserve the sam-
ple efficiency by training a policy in an actor-critic
fashion, using TD3 on trajectories generated from
the learned model. The reduced bias in the physics-
informed model enables to learn an effective policy
in imagination, which is challenging with data-driven Figure 1: PhIHP includes a Physics-Informed
models, e.g. TD-MPC. model and hybrid planning for efficient policy
• Hybrid planning strategy: We incorporate the learning in RL. PhIHP improves the compro-
learned policy and Q-function in planning with the mise over state-of-the-art methods, model-
learned model. A better model and policy learned free TD3 and hybrid TD-MPC, between sam-
in imagination improve the performance vs inference ple efficiency, time efficiency, and perfor-
time trade-off. mance. Results averaged over 6 tasks (Tow-
ers et al., 2023).

2 Related work

Our work is at the intersection of Model-based RL, physics-informed methods, and hybrid controllers.
Model-based RL: Since DYNA architectures (Sutton, 1991), model-based RL algorithms are
known to be generally more sample-efficient than model-free methods. Planning with inaccurate
or biased models can lead to bad performance due to compounding errors, so many works have fo-
cused on developing different methods to learn accurate models: PILCO (Deisenroth & Rasmussen,
2011), SVG (Heess et al., 2015), PETS (Chua et al., 2018), PlaNet (Hafner et al., 2019b) and
Dreamer (Hafner et al., 2019a; 2020; 2023). Despite the high asymptotic performance achieved by
model-based planning, these methods require a large inference time. By contrast, by learning a
policy used to sample better actions, we can drastically reduce the inference time.
Physics-informed methods: Recently, a new line of work attempted to leverage the physical
knowledge available from the laws of physics governing dynamics, to speed up learning and enhance
sample efficiency in MBRL. (Ajay et al., 2018; Jeong et al., 2019; Johannink et al., 2019; Zeng et al.,
2020; Kloss et al., 2017; El Asri et al., 2022; Ramesh & Ravindran, 2023). However, these methods
use the learned model in model predictive control (MPC) and suffer from a large inference time.
In this work, we efficiently learn an accurate model by combining physical prior knowledge and a
data-driven component using Neural ODEs.
Hybrid controllers: An interesting line of work consists in combining MBRL and MFRL to benefit
from the best of both worlds. This combination can be done by using a learned model to generate
imaginary samples and augment the training data for a model-free agent (Buckman et al., 2018;
Clavera et al., 2020; Morgan et al., 2021; Young et al., 2022). However, the improvement in terms
of sample efficiency is limited, since the agent remains trained on real data. Recent hybrid methods
enhance the planning process by using a policy (Byravan et al., 2021; Wang & Ba, 2019), or a Q-
function (Bhardwaj et al., 2020) with a learned model. More related to our work, TD-MPC (Hansen
et al., 2022) combines the last two methods, using a learned policy and a Q-function with a learned

2
data-driven model to evaluate trajectories. TD-MPC jointly trains all components on real samples
and learns a latent representation of the world, resulting in improved sample efficiency. However, the
need for samples remains significant as they learn a policy from real data. By contrast, we first train
a physics-informed model from real samples, and then the policy and the Q-function are trained
in imagination. In addition, TD-MPC uses an expensive method to optimize sequences of actions,
which impacts inference time. By contrast, accurately learning a policy from the physics-informed
model reduces the action optimization budget, thereby enhancing time efficiency.

3 Background

Our work builds on reinforcement learning and the cross-entropy method.

Reinforcement learning: In RL, the problem of solving a given task is formulated as a Markov
Decision Process (MDP), that is a tuple (S, A, T , R, γ, p(s0 )) where S is the state space, A the
action space, T =: S × A → S the transition function, R : S × A → R the reward function, γ ∈ [0, 1]
is a discount factor P and ρ0 is the initial state distribution. The objective in RL is to maximize

the expected return t=t0 γ t−t0 rt at each timestep t0 . In model-free RL, an agent learns a policy
πθ : S → A that maximizes this expected return. In contrast, in model-based RL, the agent learns
a model that represents the transition function T , then uses this learned model Tˆθ to predict the
next state ŝt+1 = Tˆθ (st , at ). The agent maximizes the expected return by optimizing a sequence of
actions A = {at0 , ..., at0 +H } over a horizon H:

H
X
A∗ = arg max γ t−t0 R(st , at ), subject to st+1 = Tˆθ (st , at ). (1)
A∈AH
t=t0

Furthermore, using an inaccurate model can degrade solutions due to compounding errors. So, one
often solves this optimization problem at each time step, only executes the first action from the
sequence, and plans again at the next time step with updated state information. This is known as
model predictive control (MPC).

Cross Entropy Method (CEM): Since the dynamics and the reward functions are generally
nonlinear, it is difficult to analytically calculate the exact minimum of (1). In this work, we use the
derivative-free Cross-Entropy Method (de Boer et al., 2005) to resolve this optimization problem.
In CEM, the agent looks for the best sequence of actions over a finite horizon H. It first generates
N candidate sequences of actions from a normal distribution X ∼ N (µ, σ 2 ). Then, it evaluates the
resulting trajectories using the learned dynamics model using a reward model and determines the
K elite sequences of actions (K < N ), that is the sequences that lead to the highest return. Finally,
the normal distribution parameters σ and µ are updated to fit the elites. This process is repeated
for a fixed number of iterations. The optimal action sequence is calculated as the mean of the K
elites after the last iteration. We call CEM budget the size of the population times the number of
iterations, this budget being the main factor of inference time in methods that use the CEM.

4 Physics-Informed model for Hybrid Planning

In this section, we describe PhIHP, our proposed Physics-Informed model for Hybrid Planning.
PhIHP first learns a physics-informed residual dynamics model (Sec. 4.1), then learns a MFRL
agent through imagination (Sec. 4.2), and uses a hybrid planning strategy at inference (Sec. 4.3).
PhIHP follows recent hybrid MBRL/MFRL approaches, e.g. TD-MPC Hansen et al. (2022), but
the physics-informed model brings important improvements at each stage of the process. It brings a
more accurate model, which improves predictive performance and robustness with respect to training
data distribution shifts. Crucially, it benefits from the continuous neuralODE method (Sec. 4.1)
to accurately predict trajectories, enabling to learn a powerful model-free agent in imagination

3
(Sec. 4.2). Finally, it enables to design a hybrid policy learning (Sec. 4.3) optimizing the performance
vs time efficiency trade-off.

(a) Learn a physics-informed model (b) Learn an actor/critic offline (c) Behaviour at inference time

Figure 2: Schematic view of PhIHP. (a) We iteratively learn a physics-informed model from few
interactions in the environment. (b) We learn a policy and Q-function from trajectories imagined
with the learned model. (c) The agent samples actions from the policy output and random actions
and then evaluates the resulting trajectories using CEM, a reward function, and the Q-function.

4.1 Learning a physics-informed dynamics model

Model-based RL methods aim to learn the transition function T of the world i.e. a mapping from
(st , at ) to st+1 . However, learning T is challenging when st and st+1 are similar and actions have
a low impact on the output, in particular when the time interval between steps decreases. We
address this issue by learning a dynamics function Tˆθ to predict the state change ∆st over the time
step duration ∆t. The next state st+1 can be subsequently determined through integration with an
Ordinary Differential Equation (ODE) solver. Thus, we describe the dynamics as a system following
an ODE of the form:
dst
= Tˆθ (st0 , at0 ), and st+1 ≃ ODESolve st , at , Tˆθ , t, t + ∆t ,

(2)
dt t=t0

where st and at are the state and action vector for a given time t. We assume the common situation
where a partial knowledge of the dynamics is available, generally from the underlying physical
laws. The dynamics Tˆθ can thus be written as Tˆθ = Fθpp + Fθrr , where Fθpp is the known analytic
approximation of the dynamics and Fθrr is a residual part used to reduce the gap between the
model prediction and the real world by learning the complex phenomena that cannot be captured
analytically. The physical model Fθpp is described by an ODE and the residual part Fθrr as a neural
network with respective parameters θp and θr . We learn the dynamics model in a supervised manner
by optimizing the following objective:
1 X dŝt
Lpred (θ) = ∥st+1 − ŝt+1 ∥22 subject to = (Fθpp + Fθrr )(st′ , at′ ) , (3)
|Dre | dt t=t′
(st ,at ,st+1 )∈Dre

on a dataset Dre of real transitions (st , at , st+1 ). As the decomposition Tˆθ = Fθpp + Fθrr is not unique,
we apply an ℓ2 constraint over the residual part with a coefficient λ to enforce the model Tˆθ to
mostly rely on the physical prior. The learning objective becomes Lλ (θ) = Lpred (θ)+ λ1 ·∥Fθrr ∥2 . The
coefficient λ is initialized with a value λ0 and updated at each epoch with λj+1 = λj + τph · Lpred (θ),
where λ0 and τph are fixed hyperparameters.

4.2 Learning a policy and Q-function through imagination

Simply planning with a learned model and CEM is time expensive. MFRL methods are generally
more time-efficient during inference time than planning methods, since they use policies that directly
map a state to an action. However, learning complex policies requires a large amount of training

4
data which impacts sample efficiency. To maintain sample efficiency, a policy can be learned from
synthetic data generated by a model. However, an imperfect model may propagate the bias to
the learned policy. In this work, we benefit from the reduced bias in the physics-informed model
to generate a sufficiently accurate synthetic dataset Dim to train a parametric policy πθ (st ) and a
Q-function Qθ (st , at ), using the TD3 model-free actor-critic algorithm (Fujimoto et al., 2018).

4.3 Hybrid planning with learned model and policy

PhIHP leverages a hybrid planning method that combines a physics-informed model with a learned
policy and Q-function. This combination helps overcome the drawbacks associated with each method
when used individually. While using a sub-optimal policy in control tasks significantly affects the
asymptotic performance, planning with a learned model has a high computational cost: i) the
planning horizon must be long enough to capture future rewards and ii) the CEM budget must be
sufficiently large to converge.
We use the learned policy in PhIHP to guide planning. In practice, a CEM-based planner first
samples Nπ informative candidates from the learned policy outputs π̂(st ) and complements them
with Nrand exploratory candidates sampled from a uniform distribution X ∼ N (µ, σ 2 ). These
informative candidates help reduce the population size and accelerate convergence. The planner
estimates the resulting trajectories using the learned model and evaluates each trajectory using the
immediate reward function up to the MPC horizon and the Q-value beyond that horizon.
By using the Q-value, we can evaluate the trajectories over a considerably reduced planning horizon
H and we add the Q-value of the last state to cover the long-term reward. Hence, the optimization
problem is written as follows:
H
X
A∗ = γ t−t0 R(st , at ) + α · γ H−t0 Q(sH ) , subject to st+1 = Tˆθ (st , at ),

arg max (4)
A∈AH
t=t0

where the discounted sum term represents a local solution to the optimization problem, while the
Q-value term encodes the long-term reward and α balances the immediate reward over the planning
horizon and the Q-value.

5 Experiments

We first compare PhIHP to baselines in terms of performance, sample efficiency, and time efficiency.
Then we perform ablations and highlight the generalization capability brought by the physics prior.
The robustness of PhIHP to hyper-parameter settings is deferred to Appendix E.

5.1 Experimental setup

Environments: We evaluate our method on 6 ODE-governed environments from the gymnasium


classic control suite. These include the continuous versions of 3 basic environments: Pendulum,
Cartpole, and Acrobot. Additionally, we consider their swing-up variants, where the initial state is
“hanging down” and the goal is to swing up and balance the pole at the upright position, similarly
to Yildiz et al. (2021). We opted for this benchmark for its challenging characteristics, including
tasks with sparse rewards and early termination.
However, to move closer to methods applicable in a real-world situation, we added to the original
environments from the gymnasium suite a friction term which is not present in the analytical model
of these environments. Thus, the dynamic of each system is governed by an ODE that can be
represented as the combination of two terms: a friction-less component F p and a friction term F r .
Please refer to Appendix B for additional details.
Evaluation metrics. In all experiments, we use three main metrics to compare methods:
• Asymptotic performance: we report the episodic cumulated reward on each environment.

5
• Sample efficiency: we define the sample efficiency of a method as the minimal amount of samples
required to achieve 90% of its maximum performance.
• Inference time: we report the wall-clock time taken by the agent to select an action at one timestep.
Design choice for PhIHP: We learn the model by combining an approximate ODE describing
frictionless motion with a data-driven residual model parameterized as a low-dimension MLP. We
use TD3 (Fujimoto et al., 2018) for the model-free component of our method, i.e. the policy and
Q-function. We found it beneficial to modify the original hyperparameters of TD3 to resolve the
friction environments. For planning, we use CEM-based MPC. Please refer to Appendix C for
additional details.

5.2 Comparison to state of the art:

We compare PhIHP to the following state-of-the-art methods:


• TD-MPC (Hansen et al., 2022), a state-of-the-art hybrid MBRL/MFRL algorithm shown to
outperform strong state-based algorithms whether model-based e.g. LOOP (Sikchi et al., 2022) and
model-free e.g. SAC (Haarnoja et al., 2018) on diverse continuous control tasks.
• TD3 (Fujimoto et al., 2018), a state-of-the-art model-free algorithm. In addition to its popularity
and strong performance on continuous control tasks, TD3 is a backbone algorithm for our method
to learn the policy and Q-function. We used the same hyperparameters as in PhIHP.
• CEM-oracle: a CEM-based controller with the ground-truth model.

(a) Learning curves, the x-axis uses a symlog scale. (b) Performance profiles.

Figure 3: Comparison of PhIHP vs baselines aggregated on 6 control tasks (10 runs). a) PhIHP shows
excellent sample efficiency and better asymptotic performance. b) Performance profiles are obtained
with rliable (Agarwal et al., 2021). PhIHP shows better performance profiles which indicates a
better robustness to outliers. Comparison on individual environments are shown in Appendix D.

Asymptotic performance Sample efficiency (×103 samples) Inference time (in milliseconds)
PhIHP TD-MPC TD3 CEM-oracle PhIHP TD-MPC TD3 PhIHP TD-MPC TD3 CEM-oracle
Pendulum -263 ±144 -276 ±301 -229 ±155 -228 ±71 2±0 26±24 86±40 6.32±0.02 39.56±0.28 0.11±0.0 18.89±0.26
Pendulum-sw -356 ±13 -395±324 -368±14 -597 ±6 5±0 28±12 57±16 6.37±0.01 39.6±0.54 0.11±0.0 18.87±0.05
CartPole 500±0 432 ±129 464±80 453±24 5±0 23±10 108±27 7.43±0.02 39.3±0.07 0.11±0.0 33.22±0.03
CartPole-sw 453 ±8 460 ±4 354 ±113 446±5 5±0 76±27 27±10 10.13±0.06 39.36±0.05 0.11±0.0 33.73±0.03
Acrobot -138 ±122 -249±168 -237±183 -500±0 5±0 10±5 233±110 11.14±0.02 39.38±0.09 0.12±0.0 59.83±0.1
Acrobot-sw 371 ±52 373 ±127 119 ±71 349 ±5 15±0 135±123 500±0 9.12±0.02 39.39±0.06 0.12±0.0 58.50±0.27

Table 1: Return of PhIHP and baselines on 6 classic control tasks. Mean and std. over 10 runs

In Tab. 1, Figure 4 and Figure 3a, we show that PhIHP outperforms the baselines with a large mar-
gin in at least one of the metrics without being worse on the others. Specifically, PhIHP is far more
sample efficient than TD3 and it generally shows 5-15 times better sample efficiency than TD-MPC,
except on Acrobot where they are comparable. Figure 3a further illustrates this excellent sample
efficiency of PhIHP and how TD3 stacks on sub-optimal performance. This enhanced sample effi-

6
Figure 4: Agregated median, interquartile median (IQM), mean performance, and optimality gap of
PhIHP and baselines on 6 tasks (10 runs). Higher mean, median, and IQM performance and lower
optimality gaps are better. Confidence intervals are estimated using the percentile bootstrap with
stratified sampling (Agarwal et al., 2021). PhIHP outperforms baselines in all metrics.

ciency of PhIHP results from training the model-free policy on imaginary trajectories generated by
the learned model, as opposed to using real samples in the baselines. Besides, PhIHP demonstrates
superior performance in sparse-reward early-termination environment tasks (Cartpole and Acrobot)
compared to TD-MPC, and PhIHP outperforms TD3 with a large margin in Cartpole-swingup,
Acrobot, and Acrobot-swingup. Figure 4 in Appendix D.1 shows how TD3 stacks on lower asymp-
totic performance for the aforementioned tasks. It also shows that TD-MPC performance drops in
sparse-reward early-termination environments e.g. Cartpole and Acrobot. It also illustrates that,
since CEM-oracle uses the reward function to evaluate trajectories within a limited horizon, it man-
ages to solve both tasks with smooth reward functions, and tasks with sparse reward where the goal
is to maintain an initial state (i.e. Cartpole), but it fails to solve sparse reward problems where the
goal is to reach a position out of the planning horizon (i.e. Acrobot).
Finally, Figure 3b shows that PhIHP has better performance profiles compared to baselines which
indicates better robustness to outliers in PhIHP.
Tab. 1 also reports the time needed for planning at each time step, obtained with an Apple M1 CPU
with 8 cores. It is noteworthy that PhIHP significantly reduces the inference time when compared
to TD-MPC. The inference time is still larger than that of TD3 since the latter is a component of
our method, but it meets the real-time requirements of various robotics applications.

5.3 Ablation study

In this section, we study the impact of each PhIHP component to illustrate the benefits of using an
analytical physics model, imagination learning, and combining CEM with a model-free policy and
Q-function for planning. To illustrate this, we compare PhIHP to several methods:
• TD-MPC*: our method without physical prior and without imagination. It is similar to TD-
MPC since the model is data-driven and it is learned with the policy from real trajectories. But
learning the model and the policy are separated.
• Ph-TD-MPC*: our method without learning in imagination, thus a physics-informed TD-MPC*.
• dd-CEM: our method without physical prior nor policy component, thus a CEM with a data-
driven model learned from real trajectories.
• Ph-CEM: our method without the policy component, thus a simple CEM with a physics-informed
model learned from real trajectories.
Figure 5 shows the impact of the quality of the model on the final performance in MBRL. Precisely,
leveraging a physical prior in Ph-CEM and Ph-TD-MPC* shows improvements compared to full
data-driven methods, i.e. dd-CEM and TD-MPC*. We also illustrate that planning with a model,
a Q-function, and a policy leads to better performance compared to planning only with the model.
For instance Ph-TD-MPC* outperforms Ph-CEM and TD-MPC* outperforms dd-CEM. However,
this gain in performance comes with a significant cost in samples, because the agent needs a large
amount of data to learn a good policy and Q-function.
Figure 5 illustrates the trade-off between asymptotic performance, sample efficiency, and inference
time in RL. On one hand, methods that learn a model and directly plan with it (e.g. dd-CEM
and ph-CEM) do not need many samples to achieve sufficiently good performance, but they are

7
Asymptotic Performance Number of Samples Needed Inference Time
1.0
300 0.6

0.8 250 0.5

Time (milliseconds)
Normalized Score

x103 samples
200 0.4
0.6
150 0.3
0.4
100 0.2
0.2
50 0.1

0.0 0 0.0
PhIHP ph_TDMPC* TDMPC* ph_CEM dd-CEM PhIHP ph_TDMPC* TDMPC* ph_CEM dd-CEM PhIHP ph_TDMPC* TDMPC* ph_CEM dd-CEM

Figure 5: Comparison of PhIHP and its variants on the 3 main metrics. The figures illustrate the
aggregated results of running all algorithms on 6 classic control tasks. Histograms and bars represent
mean and std. over 10 runs.

too expensive at inference time. On the other hand, methods that learn to plan with a model,
Q-function, and policy plan fast but require many samples to train their policies and Q-functions.
PhIHP is the only method that achieves good asymptotic performance with low cost in sample
efficiency due to learning in imagination and a good inference time due to hybrid planning.

5.4 Generalization benefits of the physics prior

In this section, we highlight the key role of incorporating physical knowledge into PhIHP in find-
ing the better compromise between asymptotic performance, sample efficiency, and time efficiency
illustrated in Figure 5. Actually, learning a policy and Q-function through imagination leads to

1000 0.006
0.005
MSE of prediction

1500
Total reward

0.004
2000
0.003
2500 0.002
3000 Physics-informed model 0.001
Data-driven model
3500 0.000
2 4 6 8 10 2 4 6 8 10
Episode Episode

Figure 6: A data-driven model still poorly predicts the next states even when its asymptotic per-
formance matches that of the physics-informed model. Figure obtained with 10 episodes of model
training on Pendulum swingup.

superior performance only when the model used to generate samples is accurate enough. Figure 5
in Appendix D.3 shows that an agent trained on imaginary trajectories generated with a physics-
informed model largely outperforms the same agent using a fully data-driven model and matches
the performance of TD3 which is trained on real trajectories. This highlights the capability of the
physics-informed model to immediately generalize to unseen data, in contrast to the data-driven
model, which poorly predicts trajectories in unseen states. Figure 6 illustrates this faster general-
ization capability, showing that the agent with a data-driven model still poorly predicts trajectories
even when it meets the asymptotic performance of the agent with the physics-informed model.

6 Conclusion

We have introduced PhiHP, a novel approach that leverages physics knowledge of system dynamics
to address the trade-off between asymptotic performance, sample efficiency, and time efficiency in

8
RL. PhIHP enhances the sample efficiency by learning a physics-informed model that serves to
train a model-free agent through imagination and uses a hybrid planning strategy to improve the
inference time and the asymptotic performance. In the future, we envision to apply PhIHP to more
challenging control tasks where there is a larger discrepancy between the known equations and the
real dynamics of the system.

References
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Bellemare.
Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Informa-
tion Processing Systems, 2021.
Anurag Ajay, Jiajun Wu, Nima Fazeli, Maria Bauza, Leslie P Kaelbling, Joshua B Tenenbaum,
and Alberto Rodriguez. Augmenting physical simulators with stochastic neural networks: Case
study of planar pushing and bouncing. In 2018 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pp. 3066–3073. IEEE, 2018.
Mohak Bhardwaj, Sanjiban Choudhury, and Byron Boots. Blending mpc & value function approxi-
mation for efficient reinforcement learning. arXiv preprint arXiv:2012.05909, 2020.
Jacob Buckman, Danijar Hafner, George Tucker, Eugene Brevdo, and Honglak Lee. Sample-efficient
reinforcement learning with stochastic ensemble value expansion. Advances in neural information
processing systems, 31, 2018.
Arunkumar Byravan, Leonard Hasenclever, Piotr Trochim, Mehdi Mirza, Alessandro Davide Ialongo,
Yuval Tassa, Jost Tobias Springenberg, Abbas Abdolmaleki, Nicolas Heess, Josh Merel, et al.
Evaluating model-based planning and planner amortization for continuous control. arXiv preprint
arXiv:2110.03363, 2021.
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement
learning in a handful of trials using probabilistic dynamics models. Advances in neural information
processing systems, 31, 2018.
Ignasi Clavera, Violet Fu, and Pieter Abbeel. Model-augmented actor-critic: Backpropagating
through paths. arXiv preprint arXiv:2005.08068, 2020.
P. T. de Boer, Dirk P. Kroese, Shie Mannor, and Reuven Y. Rubinstein. A tutorial on the cross-
entropy method. Annals of Operations Research, 134:19–67, 2005.
Marc Peter Deisenroth and Carl Edward Rasmussen. Pilco: A model-based and data-efficient ap-
proach to policy search. In ICML, 2011.
Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester. Challenges of real-world reinforcement
learning. arXiv preprint arXiv:1904.12901, 2019.
Zakariae El Asri, Clément Rambour, Vincent Le Guen, and Nicolas THOME. Residual model-
based reinforcement learning for physical dynamics. In 3rd Offline RL Workshop: Offline RL as
a”Launchpad”, 2022.
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-
critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
David Ha and Jürgen Schmidhuber. World models. arXiv preprint arXiv:1803.10122, 2018.
Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash
Kumar, Henry Zhu, Abhishek Gupta, Pieter Abbeel, et al. Soft actor-critic algorithms and appli-
cations. arXiv preprint arXiv:1812.05905, 2018.
Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning
behaviors by latent imagination. arXiv preprint arXiv:1912.01603, 2019a.

9
Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James
Davidson. Learning latent dynamics for planning from pixels. In International conference on
machine learning, pp. 2555–2565. PMLR, 2019b.
Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with
discrete world models. arXiv preprint arXiv:2010.02193, 2020.
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains
through world models. arXiv preprint arXiv:2301.04104, 2023.
Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive
control. In ICML, 2022.
Nicolas Heess, Gregory Wayne, David Silver, Timothy Lillicrap, Tom Erez, and Yuval Tassa. Learn-
ing continuous control policies by stochastic value gradients. Advances in neural information
processing systems, 28, 2015.
Rae Jeong, Jackie Kay, Francesco Romano, Thomas Lampe, Tom Rothorl, Abbas Abdolmaleki, Tom
Erez, Yuval Tassa, and Francesco Nori. Modelling generalized forces with reinforcement learning
for sim-to-real transfer. arXiv preprint arXiv:1910.09471, 2019.
Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll,
Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Residual reinforcement learning for
robot control. In 2019 International Conference on Robotics and Automation (ICRA), pp. 6023–
6029. IEEE, 2019.
Alina Kloss, Stefan Schaal, and Jeannette Bohg. Combining learned and analytical models for
predicting action effects. arXiv preprint arXiv:1710.04102, 11, 2017.
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Manfred Otto Heess, Tom Erez,
Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement
learning. CoRR, abs/1509.02971, 2016.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Belle-
mare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen,
Charlie Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wier-
stra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.
Nature, 518:529–533, 2015.
Andrew S Morgan, Daljeet Nandha, Georgia Chalvatzaki, Carlo D’Eramo, Aaron M Dollar, and Jan
Peters. Model predictive actor-critic: Accelerating robot skill acquisition with deep reinforcement
learning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6672–
6678. IEEE, 2021.
Adithya Ramesh and Balaraman Ravindran. Physics-informed model-based reinforcement learning.
In Learning for Dynamics and Control Conference, pp. 26–37. PMLR, 2023.
Harshit Sikchi, Wenxuan Zhou, and David Held. Learning off-policy with online planning. In
Conference on Robot Learning, pp. 1622–1633. PMLR, 2022.
Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting. ACM
Sigart Bulletin, 2(4):160–163, 1991.
Mark Towers, Jordan K. Terry, Ariel Kwiatkowski, John U. Balis, Gianluca de Cola, Tristan Deleu,
Manuel Goulão, Andreas Kallinteris, Arjun KG, Markus Krimmel, Rodrigo Perez-Vicente, Andrea
Pierré, Sander Schulhoff, Jun Jet Tai, Andrew Tan Jin Shen, and Omar G. Younis. Gymnasium,
March 2023. URL https://zenodo.org/record/8127025.
Tingwu Wang and Jimmy Ba. Exploring model-based planning with policy networks. arXiv preprint
arXiv:1906.08649, 2019.

10
Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforce-
ment learning. In International Conference on Machine Learning, pp. 12009–12018. PMLR, 2021.
Kenny Young, Aditya Ramesh, Louis Kirsch, and Jürgen Schmidhuber. The benefits of model-based
generalization in reinforcement learning. arXiv preprint arXiv:2211.02222, 2022.
Andy Zeng, Shuran Song, Johnny Lee, Alberto Rodriguez, and Thomas Funkhouser. Tossingbot:
Learning to throw arbitrary objects with residual physics. IEEE Transactions on Robotics, 36(4):
1307–1319, 2020.

11
Under review for the Reinforcement Learning Conference (RLC)

A Comparison to existing methods

In this section, we present a conceptual comparison of PhIHP and existing RL methods. Figure 1
illustrates the general scheme of existing RL methods and the possible connections between learning
and planning. We highlight in Figure 2 the origin of the well-known drawbacks in RL: i) learning a
policy on real data (arrow 1) impacts the sample efficiency, ii) learning a policy from a data-driven
learned model (arrow 3) impacts the asymptotic performance due to the bias in the learned model,
iii) model-based planning (arrow 4) impacts the inference time.

Figure 1: Overview of existing scheme of learning/planning in RL. 1- learn a policy/value function


from real data. 2- learn a model from real data. 3- learn a policy/value function from imaginary
data. 4- plan with a learned model. 5- plan with a learned policy/value function. 6- act based on a
policy output. 7- act based on the planning outcome. 8- collect data from the interaction with the
real world.

(a) general scheme (b) MFRL (TD3, SAC) (c) MBRL (PILCO)

(d) Dyna-style RL (LOOP) (e) Hybrid RL (TD-MPC) (f) PhIHP (Ours)

Figure 2: Conceptual comparison of PhIHP and existing methods based on the general scheme in
Figure 1. Thick lines are used by a method, red lines indicate the origin of the main drawbacks: 1-
learning on real data impacts the sample efficiency, 3- bias introduced by the data-driven model
impacts the asymptotic performance, 4- planning with a model impacts the inference time.

1
Under review for the Reinforcement Learning Conference (RLC)

PhIHP benefits from the good sample efficiency of model-based learning methods (arrow 2) and
from the physical knowledge to reduce the bias in the learned model. The accurately learned model
generates good trajectories to train the policy/value networks (arrow 3). When interacting with
the environment, PhIHP uses a hybrid planning strategy (arrows 4 & 5) to improve asymptotic
performance and time efficiency.

B Environments

In this section, we give a comprehensive description of the environments employed in our work.
Across all environments, observations are continuous within [−Sbox , Sbox ] and actions are continuous
and restricted to a [−amax , amax ] range. An overview of all tasks is depicted in Figure 3 and specific
parameters are outlined in Table 1.
Pendulum: A single-linked pendulum is fixed on one end, with an actuator on the joint. The
pendulum starts at a random position and the goal is to swing it up and balance it at the upright
position. Let θ be the joint angle at time t and θ̇ its velocity, the observation at time t is (θ, θ̇).
Pendulum-Swingup: the version of Pendulum where it is started at the "hanging down" position.
Cartpole: A pole is attached by an unactuated joint to a cart, which moves along a horizontal
track. The pole is started upright on the cart and the goal is to balance the pole by applying forces
in the left and right direction on the cart.
Cartpole-Swingup: the version of Cartpole where the pole is started at the "hanging down"
position.
Acrobot: A pendulum with two links connected linearly to form a chain, with one end of the chain
fixed. Only the joint between the two links is actuated. The goal is to apply torques on the actuated
joint to swing the free end of the linear chain above a given height.
Acrobot-Swingup: For the swingup task, we experiment with the fully actuated version of the
Acrobot similarly to (Yildiz et al., 2021; Xie et al., 2016). Initially, both links point downwards at
the "hanging down" position. The goal is to swing up the Acrobot and balance it in the upright
position. Let θ1 be the joint angles of the first fixed to a hinge at time t and θ2 the relative angle
between the two links at time t. The observation at time t is (θ1 , θ2 , θ˙1 , θ˙2 ).

Figure 3: Experimental tasks : Pendulum & Pendulum-swingup (left), Cartpole & Cartpole-swingup
(center), Acrobot & Acrobot-swingup(right). The Acrobot-swingup is fully actuated while Acrobot
is only actuated at the joint between the two links, thus a2 = 0.

2
Under review for the Reinforcement Learning Conference (RLC)

Environments
Parameters Pendulum Pendulum-SU Cartpole Cartpole-SU Acrobot Acrobot-SU
Reward type Smooth Smooth Sparse Smooth Sparse Smooth
Early termination No No Yes No Yes No
2 4
State space R   R R4
θ1 , θ2 , θ˙1 , θ˙2
  
States θ, θ̇ x, ẋ, θ, θ̇
3
Observation space R R5 R6
cos(θ1 ), sin(θ1 ), cos(θ2 ), sin(θ2 ), θ˙1 , θ˙2
     
Observations cos(θ), sin(θ), θ̇ x, ẋ, cos(θ), sin(θ), θ̇
Actions space R1 R1 R1 R1 R1 R2
amax [2.0] [2.0] [10.0] [10.0] [1.0] [1.0, 1.0]
Length of the rollout 200 500 500 500 500 500
∆t 0.05 0.02 0.2

Table 1: Environment specifications

B.1 Dynamic functions

In this section, we provide details of the dynamic functions. For each task, the dynamic function
consists of a frictionless component and a friction term.
Pendulum and Pendulum Swingup: Let st = (θ, θ̇) be the state and at the action at time t.
The dynamic of the pendulum is described as:
   
θ̇ θ̇
F (st , at ) = = (1)
θ̈ Cg · sin(θ) + Ci · at + CF r · θ̇

where Cg is the gravity norm, Ci is the inertia norm and CF r is the Friction norm.
Acrobot and Acrobot Swingup: Let st = (θ1 , θ2 , θ̇1 , θ̇2 ) be the state and at = (a1 , a2 ) (a1 = 0
for the Acrobot environment) the action at time t. The dynamic of the system is similar to (Yildiz
et al., 2021) described as:
 
  θ̇1
θ̇1  θ̇2 
θ̇2   −(α +d + θ̈ +Σ1)

F (st , at ) =  = 0 2 2
(2)
 
θ̈1   d1

d2 2

 α1 + d1 ·Σ1 −m2 ×l1 ·lc2 ×θ̇1 ·sin θ2 −Σ2 
θ̈2 2 d2 2
m2·lc2 +I2 − d1

where:
α0 = a1 − Cf r1 · θ̇1 such as Cf r1 is the friction norm in the first joint ,
α1 = a2 − Cf r2 · θ̇2 such as Cf r2 is the friction norm in the second joint ,
m1 and m2 the mass of the first and second links,
l1 and l2 the length of the first and second links,
lc1 and lc2 the position of the center of mass of the first and second links,
I1 and I2 the moment of inertia of the first and second links,

and
d1 = m1 · lc1 2 + m2 · (l1 2 + lc2 2 + 2 · l1 · lc2 · cos(θ2 )) + I1 + I2
d2 = m2 · (lc2 2 + l1 · lc2 · cos(θ2 )) + I2
Σ2 = m2 · lc2 · g · cos(θ1 + θ2 − π2 )
Σ1 = m2 · l1 · lc2 · θ̈2 · sin(θ2 ) · (θ̈2 − 2 · θ̈1 ) + (m1 · lc1 + m2 · l1 ) · g · cos(θ1 − π2 ) + Σ2 .

3
Under review for the Reinforcement Learning Conference (RLC)

Cartpole and Cartpole Swingup: Let st = (x, ẋ, θ, θ̇) be the state and at the action at time t.
The dynamic of the system is based on (Barto et al., 1983) and described as:

 
  ẋ
ẋ  Σ − m · l · θ̈ · cos(θ) 
p mtotal 
ẍ 
F (st , at ) =  = θ̇ , (3)

 θ̇  

F rp θ̇ 
 g·sin(θ)−(cos(θ)·Σ)− mp ·l 
θ̈ mp ·cos(θ)2
l·[ 34 − mtotal ]

where:
F rc is the friction norm in the contact between the cart and the ground,
F rp is the friction norm in the joint between the cart and the pole,
l is the length of the pole,
mtot = mc + mp and mp , mc are the mass of the pole and the cart respectively,
1
Σ = mtotal · (a + mp · l · θ̇2 · sin(θ) − (F rc · sgn(ẋ)).

B.2 Reward Functions

The reward function encodes the desired task. We adopt the original reward functions in the three
main environments. For the swingup variants, we choose functions that describe the swingup task:
we adopt the same function as Pendulum for Pendulum swingup. For Cartpole swingup, we set a
reward function as the negative distance from the goal position sgoal = (x = 0, y = 1). For Acrobot
swingup, we take the height of the pole as a reward function.

Environment Reward function


Pendulum −θ2 − 0.1 · θ̇2 − 0.001 · a2
Pendulum swingup −θ2 − 0.1 · θ̇2 − 0.001 · a2
Cartpole +1 for every step until termination
Cartpole swingup exp (∥s − sgoal ∥22 )
Acrobot -1 for every step until termination
Acrobot swingup − cos(θ1 ) − cos(θ1 + θ2 )

Table 2: Reward functions for each environment.

C Implementation details

In this section, we describe the experimental setup and the implementation details of PhIHP. We first
learn a physics-informed residual dynamics model, then learn an MFRL agent through imagination,
and use a hybrid planning strategy at inference.
To learn the model, we first use a pure exploratory policy during T timesteps to collect the initial
samples to fill Dre , then we perform stochastic gradient descent on the loss function (Eq. 3 in Sec.
4.1) to train Fθ . The learned model F̂ is used with CEM to perform planning and gather new T
samples to add to Dre . To improve the quality of the model, the algorithm iteratively alternates
between training and planning for a fixed number of iterations.
To train the model-free component of PhIHP, the training dataset Dim is initially filled with T ′
samples generated from the learned model F̂ and random actions from a pure exploratory policy,
πθ and Qθ are trained on batches from Dim which is continuously filled by samples from the learned
model F̂ .
We list in Tab. 3 the relevant hyperparameters of PhIHP and baselines. and we report in Tab. 4 the
task-specific hyperparameters for PhIHP.
We adopted the original implementation and hyperparameters of TD-MPC. However, we needed
to adapt it for early termination environments (i.e. Cartpole and Acrobot) to support episodes of
variable length, and we found it beneficial for TD-MPC to set the critic learning rate at 1e-4 in

4
Under review for the Reinforcement Learning Conference (RLC)

these two tasks.


Fot TD3, we tuned the original hyperparameters and used the same for the TD3 baseline and the
model-free component of PhIHP.

Hyperparameter PhIHP TD-MPC TD3 CEM-oracle


Model learning
Model ODE + MLP MLP - Ground truth
Activation Relu ELU - -
MLP size 2 x 16 2 x 512 - -
Learning rate 1e-3 1e-3 - -
Policy/Value learning
Batch size 64 512 64 -
Critic size 3 x 200 2 x 512 3 x 200 -
Actor size 2 x 300 2 x 512 2 x 300 -
Activation Relu ELU Relu -
Critic learning rate 1e-4 1e-3 1e-4 -
Actor learning rate 1e-3 1e-3 1e-3 -
Soft update coefficient τ 0.05 0.01 0.05 -
Policy update frequency 2 2 2 -
Discount factor 0.99 0.99 0.99 -
Exploratory steps 10000 5000 10000 -
Replay Buffer size 1e6 1e6 1e6 -
Sampling technique Uniform PER (α = 0.6, β = 0.4) Uniform -
Planning
Planner CEM MPPI - CEM
Exploratory population size 200 512 - 700
Policy population size 20 25 - -
Elite 10 64 - 20
CEM iterations I 3 6 - 3
Update distribution mean and std. weighted mean and std. - mean and std.
Planning horizon H 4 5 - 30
Receding horizon RH 1 1 - 5

Table 3: PhIHP and baselines hyperparameters. We emphasize that we use the same hyperparam-
eters for TD3 in the baseline and the model-free component of PhIHP.

Hyperparameter Pendulum Pendulum swingup Cartpole Cartpole swingup Acrobot Acrobot swingup
Model learning
MLP size 2 x 16 2 x 16 2 x 16 2 x 16 3 x 16 3 x 16
Loss initial coefficient λ0 1e3 1e3 1e3 1e3 1e2 1e3
Loss update coefficient τph 1e3 1e3 1e5 1e5 1e5 1e5
Samples needed 2000 5000 5000 5000 5000 15000
Planning
Planning horizon H 5 5 4 6 4 3
Reward coefficient α 1.5 1.5 0.2 0.03 0.8 0.8

Table 4: Task-specific hyperparameters of PhIHP.

5
Under review for the Reinforcement Learning Conference (RLC)

D Comparison to state of the art

We compare PhIHP to baselines on individual tasks, we present both statistical results and a qual-
itative analysis.

D.1 Learning curves

We provide learning curves of PhIHP and baselines on individual tasks. PhIHP outperforms baselines
by a large margin in terms of sample efficiency. Figure 4 shows that TD3, even when converging
early in Cartpole-swingup, achieves sub-optimal performance and fails to converge within 500k steps
in Acrobot-swingup.

Pendulum Cartpole Acrobot


0 100
500
500
400 200
Episode return

1000 300
300
1500 200
2000 400
100
2500 0 500
0 200 400 0 200 400 0 200 400
Pendulum swingup Cartpole swingup Acrobot swingup
400
1000 400
200
Episode return

2000
300 0
3000
200 200
4000
400
100
5000
0 200 400 0 200 400 0 200 400
x1000 steps x1000 steps x1000 steps

Figure 4: Return of PhIHP and baselines on the gymnasium classic control tasks. Mean and std.
over 10 runs. PhIHP outperforms or matches the baselines.

D.2 Statistical Comparison: PhIHP vs. Baselines

To ensure a robust and statistically sound comparison with the results previously reported in Table
1 in Sec. 5.2, we conducted Welch’s t-test to statistically compare the performance of PhIHP vs
baselines across individual tasks. We set the significance threshold at 0.05, and calculated p-values
to determine whether observed differences in performance were statistically significant. Tab. 5 shows
that PhIHP is equivalent to all baselines in Pendulum, and it significantly outperforms TD3 on the
remaining tasks. Moreover, PhIHP outperforms TD-MPC in sparse-reward early-termination envi-
ronment tasks (Cartpole and Acrobot), while they demonstrate equivalent performance in Pendulum,
Pendulum swingup, and Acrobot swingup.

6
Under review for the Reinforcement Learning Conference (RLC)

TD3 TD-MPC CEM-oracle TD3 TD-MPC CEM-oracle TD3 TD-MPC CEM-oracle


Pendulum Cartpole Acrobot
T-statistic -1.41 0.52 -1.18 4.66 5.47 25.75 4.39 5.35 29.78
P-value 0.16 0.61 0.26 9.92e-06 3.40e-07 9.69e-10 1.96e-05 2.58e-07 3.30e-51
Significant difference No No No Yes Yes Yes Yes Yes Yes
Pendulum swingup Cartpole swingup Acrobot swingup
T-statistic 6.35 1.19 6.47 8.41 -7.59 1.65 27.49 -0.10 4.02
P-value 1.48e-09 0.24 1.15e-4 2.70e-13 9.01e-12 0.11 3.54e-66 0.92 1.09e-4
Significant difference Yes No Yes Yes Yes No Yes No Yes

Table 5: Statistical Comparison of PhIHP vs Baselines across individual tasks: we present the
Welch’s t-test results including T-statistics and P-values, to assess the significance of performance
differences. Yes denotes a statistically significant difference (p-value < 0.05), with green Yes in-
dicating PhIHP outperforming the baseline (T-statistics > 0), and red Yes indicating the baseline
performing better (T-statistics < 0). No indicates no significant difference between PhIHP and the
baseline (p-value > 0.05).

D.3 Imagination learning for model-free TD3

We provide learning curves of TD3 through imagination on individual tasks in Figure 5. TD3-im-ph
is a component of PhIHP, it is a TD3 agent learned on trajectories from a physics-informed model.
It largely outperforms TD3-im-dd, a TD3 learned on trajectories from a data-driven model. we
limited the training budget for TD3-re, trained on real trajectories, at 500k real samples in all tasks.

Pendulum Cartpole Acrobot


500 100
500 400 200
Episode return

1000 300
300
1500 200
400
2000 100
500
0 200 400 0 200 400 0 200 400

Pendulum swingup Cartpole swingup Acrobot swingup


400
400
300
1000
300 200
Episode return

2000 100
200 0
3000
100
4000 100 200
300
5000
0 200 400 0 200 400 0 500 1000 1500 2000
x1000 steps x1000 steps x1000 steps

Figure 5: Learning curve of TD3 on classic control tasks, mean and std. over 5 runs. TD3-re (orange
curve) is a TD3 agent trained on real trajectories, TD3-im-ph (green curve) and TD3-im-dd (red
curve) are TD3 agents trained on imaginary trajectories respectively from a physics-informed model
and data-driven model.

7
Under review for the Reinforcement Learning Conference (RLC)

D.4 Qualitative comparison

In this section, we compare performance metrics on individual classic control tasks. We estimate
confidence intervals by using the percentile bootstrap with stratified sampling (Agarwal et al.,
2021).
We show in Figure 6 a comparison of the median, interquartile median (IQM), mean performance,
and optimality gap of PhIHP and baselines. PhIHP matches or outperforms the performance of
TD-MPC and TD3 in all tasks except in Cartpole swingup. PhIHP shown to be robust to outliers
compared to TD-MPC with shorter confidence intervals.
Moreover, Figure 7 shows the performance profiles of PhIHP and baselines. PhIHP shows better
robustness to outliers.

(a) Pendulum. PhIHP matches the performance of TD-MPC and TD3.

(b) Pendulum swingup. PhIHP outperforms TD-MPC and TD3, and PhIHP shows to be robust to
outliers compared to TD-MPC.

(c) Cartpole. PhIHP largely outperforms TD-MPC and TD3.

(d) Cartpole swingup. PhIHP outperforms TD3 and shows slightly less performance than TD-MPC.

(e) Acrobot. PhIHP largely outperforms TD3 and TD-MPC.

(f) Acrobot swingup. PhIHP outperforms TD3 and matches the performance of TD-MPC.

Figure 6: Median, interquartile median (IQM), mean performance, and optimality gap of PhIHP and
baselines on individual classic control tasks (10 runs). Higher mean, median, and IQM performance
and lower optimality gap are better. Confidence intervals (CIs) are estimated using the percentile
bootstrap with stratified sampling (Agarwal et al., 2021).

8
Under review for the Reinforcement Learning Conference (RLC)

Figure 7: Performance profiles of PhIHP and baselines on individual tasks (10 runs). Confidence
intervals are estimated using the percentile bootstrap with stratified sampling (Agarwal et al., 2021).
PhIHP shows a better robustness to outliers.

9
Under review for the Reinforcement Learning Conference (RLC)

E Hyperparameter sensitivity analysis

We investigate the impact of varying controller hyper-parameters on the performance and inference
time of PhIHP. We first study the impact of varying planning horizons and receding horizons (from
1 to 8). We note that planning over longer horizons generally leads to better performance, however,
the performance slightly drops in Acrobot-swingup for planning horizon H > 4 (Figure 8). We
explain this by the compounding error effect on complex dynamics. Unsurprisingly, lower receding
horizons always improve the performance because the agent benefits from replanning.
For the impact of the population size, Figure 8 shows that excluding the policy (policy-population
= 0) from planning degrades the performance, and increasing it under 10 does not have a significant
impact. Moreover, excluding random actions (random-population = 0) from planning degrades the
performance.
Unsurprisingly, the inference time increases with an increase in both the planning horizon and the
population size. Conversely, it decreases when the receding horizon increases.

150 8 150
150 8 6.5
200 7 200 150
Pendulum

Inference time (milisecondes)


7 6.0 4.8
200 6
250 250 200
Episode return

6 5 5.5
250 300 300 4.6
250
5 4 5.0
300 350 350
3 4.5 300 4.4
350 4 400 400
2 350
3 450 450 4.0
400 1 4.2
2 4 6 8 2 4 6 8 0 25 50 75 100 0 200 400
Pendulum swingup

345 345 8
8 6.5
345
350 7
350 360

Inference time (milisecondes)


7 6.0 350 4.8
355 6
355
Episode return

6 355 5 380 5.5


360 4.6
360
365 5 4 5.0
360 400 365
370 3 4.5 4.4
4 370
375 365 2 420
3 4.0 375
1 4.2
380 380
2 4 6 8 2 4 6 8 0 25 50 75 100 0 200 400

10 7.5
10 520 520 500
520 6.2
7.0

Inference time (milisecondes)


9 8 480
Cartpole

510 510 6.0


510 8 6.5 460
Episode return

7 500 6 500 440 5.8


500 6.0
6 420 5.6
490 490 4 490 5.5
5
400 5.4
480 4 480 2 480 5.0
380
3 5.2
2 4 6 8 2 4 6 8 0 25 50 75 100 0 200 400
Acrobot swingup Cartpole swingup

460 7.5
460 10 450
10 450 6.2
7.0 450
9 425 Inference time (milisecondes)
440 440 8 440 6.0
8 400 6.5
Episode return

430 430
7 6 5.8
420 375 6.0 420
6 420
410 5.6
4 350 5.5
400 5 410
400 5.4
4 400 2 325 5.0
390
380 3 5.2
2 4 6 8 2 4 6 8 0 25 50 75 100 0 200 400

400 18 17.5 400


11.0 400
16 300 15.0
Inference time (milisecondes)

350 10.0
300 14 350 10.5
12.5
Episode return

300
12 200 10.0 10.0
200 300 250 9.5
10 9.5
7.5
100 200
8 9.0 9.0
100 5.0 250 150
6
0 2.5 8.5
100 8.5
2 4 6 8 2 4 6 8 0 25 50 75 100 0 200 400
Planning horizon Receding horizon Policy population Random population

Figure 8: Impact of varying planning hyperparameters on asymptotic performance and inference


time on individual tasks.

10
Under review for the Reinforcement Learning Conference (RLC)

References
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, and Marc G Bellemare.
Deep reinforcement learning at the edge of the statistical precipice. Advances in Neural Informa-
tion Processing Systems, 2021.
Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that
can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics,
pp. 834–846, 1983.
Chris Xie, Sachin Patil, Teodor Moldovan, Sergey Levine, and Pieter Abbeel. Model-based rein-
forcement learning with parametrized physical models and optimism-driven exploration. In 2016
IEEE international conference on robotics and automation (ICRA), pp. 504–511. IEEE, 2016.

Cagatay Yildiz, Markus Heinonen, and Harri Lähdesmäki. Continuous-time model-based reinforce-
ment learning. In International Conference on Machine Learning, pp. 12009–12018. PMLR, 2021.

11

You might also like