Augmenting MPC Schemes With Active Learning: Intuitive Tuning and Guaranteed Performance
Augmenting MPC Schemes With Active Learning: Intuitive Tuning and Guaranteed Performance
Augmenting MPC Schemes With Active Learning: Intuitive Tuning and Guaranteed Performance
net/publication/340181129
Augmenting MPC schemes with active learning: Intuitive tuning and guaranteed
performance
CITATIONS READS
0 66
3 authors:
Frank Allgöwer
Universität Stuttgart
700 PUBLICATIONS 13,654 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Alternative Scanning Methods for High-speed Atomic Force Microscopy View project
All content following this page was uploaded by Raffaele Soloperto on 01 April 2020.
Abstract—A framework to augment an existing model predic- general computationally intractable. As a result, dual MPC
tive control (MPC) design/implementation with active learning is schemes [12], [13], [14], [15] try to approximately solve the
proposed. Active learning is achieved by employing a user-defined dual control problem. However, these approaches (i) fail to
learning cost function (e.g. enforcing persistence of excitation
or using exploration terms from reinforcement learning), with provide theoretical guarantees regarding closed-loop safety
the aim to improve model knowledge and reduce uncertainty and/or performance [13], [14], [15], or (ii) are limited to
through model adaptation. The framework is applicable to a linear system dynamics [12], [15]. To overcome the drawbacks
general class of nonlinear MPC design procedures and ensures associated with existing dual MPC schemes, we focus on
desired performance bounds for the resulting closed-loop, which the problem of augmenting an existing MPC implementa-
can be intuitively tuned compared to the initial MPC design. The
performance bounds are obtained by coupling the active learning tion to enhance learning/system excitation, while preserving
objective with performance bounds of the primary MPC, using the safety and performance guarantees of the original MPC
tools from multi-objective MPC and average constraints from scheme.
economic MPC. The overall framework can be easily imple-
mented1 and it is intuitive to tune. The resulting computational Contributions: We present a framework to augment an
demand typically is comparable to the original MPC scheme. We existing nonlinear MPC scheme with a learning cost function
demonstrate the practicality of the proposed framework using a to incentivize active learning. The framework is applicable
numerical example involving a nonlinear uncertain model, and
active learning.
to a wide variety of nonlinear MPC designs: general eco-
nomic MPC schemes [16], MPC schemes without terminal
constraints [17], robust MPC schemes [18], and combinations
I. I NTRODUCTION thereof [19], [20], [21]. For this general setup, we derive
Model predictive control (MPC) [1] is an optimization based closed-loop performance guarantees depending on intuitively
control method that can deal with general nonlinear systems tunable constants, which allow for a simple trade-off between
subject to hard state and input constraints. The performance potential performance relaxation and freedom to explore the
of MPC schemes depends largely on the accuracy of the system. These performance bounds are enabled using tech-
prediction model. Thus, online model refinement in MPC, niques from averaged constraint MPC [22], which extend
e.g., using machine learning approaches [2], [3], or (robust) existing approaches for multi-objective MPC [23], [24].
adaptive methods [4], [5], [6], [7], is an active research topic. In addition, the proposed framework poses no restrictions
Passive learning, i.e. without any active system excitation, is on the predicted learning cost function, allowing for the
often not sufficient to achieve satisfactory performance [8]. In incorporation of most existing formulations, e.g., persistence
this paper, we present a framework to augment an existing of excitation cost [5], covariance predictions using recursive
MPC implementation with a user defined active learning cost, least square (RLS) [12], variance maximization/exploration
while retaining the original properties regarding closed-loop in Gaussian processes (GPs) [25], and general reinforcement
constraint satisfaction and performance. learning based functions [10].
Related works: In Reinforcement Learning approaches [9],
a random excitation input is applied in order to explore un- To summarize, we propose a comprehensive framework to
known dynamics. Since this might lead to constraint violation, extend existing MPC implementations in order to incorporate
in [10] a predictive safety filter is proposed with the aim to a learning cost function. The desired performance and learning
ensure safety, without considering performance. can be easily tuned with intuitive scalar constants. In addition,
Optimizing for performance while ensuring safety is the the computational demand of the proposed learning augmented
goal of many applications. The ideal performance under model MPC framework is typically only moderately increased com-
uncertainty is obtained using dual control [11], which is in pared to the baseline MPC implementation. We showcase the
practicality of the proposed framework with a nonlinear uncer-
This work was supported by the German Research Foundation under Grants tain system, involving reference tracking and active learning.
GRK 2198/1 - 277536708, AL 316/12-2, and MU 3929/1-2 - 279734922. The
authors thank the International Max Planck Research School for Intelligent Outline: Section II presents the basic setup and discusses
Systems (IMPRS-IS) for supporting Raffaele Soloperto.
Raffaele Soloperto, Johannes Köhler, and Frank Allgöwer are with preliminaries regarding the existing MPC framework. The
the ”Institute for Systems Theory and Automatic Control”, Univer- proposed framework is presented in Section III, including
sity of Stuttgart, 70550 Stuttgart, Germany. (email:raffaele.soloperto, jo- a theoretical derivation of performance bounds. Section IV
hannes.koehler, frank.allgower@ist.uni-stuttgart.de).
1 An exemplary implementation can be found at: https://www.ist. demonstrates the results with a numerical example and Sec-
uni-stuttgart.de/dokumente/public/Soloperto 20.m tion V concludes the paper.
II. P ROBLEM F ORMULATION where N ∈ N is the MPC prediction horizon, ` is the bounded
stage cost, and Vf is the bounded terminal cost. By `min ∈ R
A. Problem setup
we denote the lower bound on ` satisfying
We consider a nonlinear, time-invariant, perturbed discrete-
time system `min ≤ `(X, π), (4)
for all X, π satisfying (x, π(x)) ∈ Z for all x ∈ X.
xt+1 = fw (xt , ut , dt ), (1) In most MPC implementations, the cost function (3) is
with state x ∈ Rn , control input u ∈ Rm , disturbance d ∈ simply based on some nominal trajectory `(xk|t , uk|t ) [4], [5],
D ⊂ Rq , time t ∈ N, and perturbed (unknown) system fw . [6], [7], [20]. In a robust tube framework, one can also use
We impose point-wise in time state and input constraints a worst-case cost satisfying `(X, π) ≥ `(x, π(x)), ∀x ∈ X,
which provides stronger performance guarantees [18], [21].
(xt , ut ) ∈ Z, ∀t ≥ 0, (2) Existing MPC scheme: Given the nominal model ft (x, u),
the description of the uncertainty Wt , the primary cost func-
where Z ⊆ Rn+m is a constraint set. tion E, and possibly a suitable terminal set Xf ⊂ Rn , the
We consider the case where the model can be learned resulting optimization problem is given by:
online to ensure an improvement in the performance and
therefore to reduce conservatism. This gives rise to a time- Et∗ := min E(X·|t , π·|t ) (5a)
X·|t ,π·|t
varying nominal model ft (x, u) and some uncertainty bounds s.t. ft (xk|t , πk|t (xk|t )) + wk|t ∈ Xk+1|t , (5b)
Wt (x, u), satisfying the following condition for all t ≥ 0
(xk|t , πk|t (xk|t )) ∈ Z, (5c)
fw (x, u, d) − ft (x, u) ∈ Wt (x, u). ∀wk|t ∈ Wt (xk|t , πk|t (xk|t )), ∀xk|t ∈ Xk|t , (5d)
Such a condition is standard in robust MPC [26, Ass. 1], {xt } ∈ X0|t , XN |t ⊆ Xf , (5e)
robust adaptive MPC using set-membership estimation [4], k = 0, . . . , N − 1.
[5], [6], [7] and is enforced in recent non-parametric machine
The solution of (5) are the optimal cost Et∗ , sets X∗·|t
learning approaches [27]. To ensure constraint satisfaction ∗
and control laws π·|t . A discussion on various tractable for-
despite the uncertainty Wt , tube-based robust MPC approaches
mulations for the sets X·|t , control laws π·|t and the tube
predict sets Xk|t ⊂ Rn that contain the uncertain predicted
propagation (5b) can be found in [7], [26, Rk. 1]. The resulting
trajectories. For linear models, the sets Xk|t can often be
closed-loop system is given by
obtained using vertex enumeration [6], while computationally
more efficient approaches parametrize Xk|t with a nominal xt+1 = fw (xt , ut , dt )∈ X∗1|t , ∗
ut = π0|t (xt ). (6)
trajectory xk|t and some scaled set, e.g. using fixed robust
The following assumption characterizes the fact that the
positive invariant (RPI) sets [18], [21], polytopic tubes [4],
existing MPC scheme (5) is properly designed, i.e., ensures
[5] or incremental Lyapunov functions [7], [10].
recursively feasibility and provides a suitable performance
To reduce the conservatism of this tube propagation, a
bound on the primary objective `.
parametrized control law πk|t : Rn → Rm [6], [7] is typically
included in the prediction, e.g., πk|t (x) = Kx + uk|t with Assumption 1. Problem (5) is feasible for all t ≥ 0 and
a nominal input uk|t and a stabilizing feedback K ∈ Rm×n , the constraints (2) are satisfied for the resulting closed-loop
compare [4], [5]. In the special case of nominal MPC (i.e., with system (6). Furthermore, there exist constants c ≥ `min ,
no uncertainty), we have ft (x, u) := fw (x, u, d), Wt (x, u) := α ∈ (0, 1], such that for any trajectory (X·|t , π·|t ) satisfying
{0} and the predictions reduce to standard state and input the constraints in (5), and for any xt+1 ∈ X1|t , there exists
trajectories πk|t = uk|t , Xk|t = xk|t . a candidate trajectory (X·|t+1 , π·|t+1 ), which satisfies the
constraints in (5) and satisfies the following performance
bound
B. Existing MPC framework
E(X·|t+1 , π·|t+1 ) ≤ E(X·|t , π·|t ) − α`(X0|t , π0|t ) + c. (7)
In this paper, we consider the case where an existing
MPC scheme is already implemented. Such a scheme ensures The following remark summarizes different MPC design
constraint satisfaction and an acceptable level of closed- procedures that satisfy Assumption 1 and the corresponding
loop performance, according to some user-defined criteria. In constants α, c.
Section III, we show how this scheme can be augmented in
Remark 1. Recursive feasibility and constraint satisfaction
order to incentivize learning of system (1), and thus reduce
are standard properties of MPC schemes, which can be
the uncertainty Wt .
guaranteed with suitable conditions on the tube propagation
Primary cost function: Based on the definition of the sets (X·|t , (5b)), the model update (ft , Wt ) and the terminal set Xf ,
X·|t and the parametrized control law π·|t , we define a general compare [7, Thm. 1] for general conditions. In the following,
cost function as follows we discuss how α and c vary according to the considered
N
X −1 scenario.
E(X·|t , π·|t ) := `(Xk|t , πk|t ) + Vf (XN |t ), (3) • MPC with terminal ingredients (Vf , Xf ): The standard
k=0 MPC design [1] uses a (robust) positive invariant terminal set
Xf combined with a control Lyapunov function Vf , which is • instability.
used as a terminal cost. In the nominal case, such a design These issues are mainly due to finding an appropriate trade-
directly implies satisfaction of Assumption 1 with α = 1 and off between the two costs is, in general, not intuitive, and
c := `(xs , us ) = min `(x, u) (8) therefore tuning is not only expensive in terms of time and
(x,u)∈Z resources, but might also generate unsafe behaviours during
s.t. f (x, u) = x, experiments. Note that even an arbitrarily small λ allows the
learning cost to have dominant effects when the primary cost
compare [16]. Furthermore, in case of tracking MPC (` is close to zero.
positive definite), (8) holds with c = `min = 0. To avoid the above mentioned shortcomings resulting from
In the robust case with a nominal stage cost `(xk|t , πk|t ), the such a naive augmentation, we propose a framework for learn-
value of c increases by a factor depending on the magnitude of ing MPC that is easy to implement, allows for intuitive tuning,
the model mismatch, compare e.g. [5, Thm. 7], [26, Thm. 1]. In and gives guaranteed performance bounds on the primary cost.
case a worst-case stage cost ` is used, we have c = `(Ω, πf ),
with some RPI set Ω, compare [18].
• MPC without terminal ingredients2 (Vf := 0, Xf := B. Proposed MPC scheme
n
R ): In (nominal) tracking MPC without terminal ingredients, In the following, we introduce the proposed MPC frame-
inequality (7) is satisfied with some sub-optimality index α ∈ work which is based on Problem (5) augmented with active
(0, 1] and c = 0 for a sufficiently large prediction horizon N learning. For the latter, we consider a general learning cost
[17]. In the economic context, the bound (7) holds with α = 1, function Ht (X·|t , π·|t ). Frequent choices for such a function
c = `(xs , us ) + N with some N > 0 [19]. In the robust case, are discussed in Section III-D. The proposed MPC scheme is
similar bounds hold with a larger constant c, compare [20], defined as follows
[21].
min Ht (X·|t , π·|t ) (9a)
X·|t ,π·|t ,∆E
t
B
Êt+1 − Êt = Et+1
(9b)
+ ∆E + E (11) as an additional constraint in Problem (9), which allows for
t+1 − Êt = −Et+1 + ∆t+1
(12)
convex relaxations in the linear polytopic case, compare [5].
+ +
= − Et+1 + Yt + β̄ max{Et+1 , 0} + γ̄ − Yt+1 . (18) Similar to [10], the learning cost can also be defined based
+ on some external exciting input uL t (e.g. resulting from a
If Et+1 ≥ 0, we have
Reinforcement Learning algorithm) using Ht := kut − uL 2
t k .
(18) +
Êt+1 − Êt = −(1 − β̄)Et+1 + Yt + γ̄ − Yt+1 . (19) Compared to a predictive safety filter [10], the proposed
approach also provides suitable bounds on the transient and
+
Similarly, if Et+1 < 0, the following holds average performance, which is crucial in many applications.
+(18) Moreover, one can define Ht := E(x̃·|t , π·|t ), where x̃ is
Êt+1 − Êt = −Et+1 + Yt + γ̄ − Yt+1 . (20) based on a different online adapted/learned model fˆt , e.g.
Let us now define the set τ as follows using least-mean square parameter estimation [4], [7] or policy
gradients from reinforcement learning [3].
τ (T ) := {t ∈ N[0,T ] |Et+ < 0}, (21)
where N[0,T ] is the set of all the integers between 0 and T . IV. N UMERICAL E XAMPLE
Combining both the cases, we have for all T ≥ 0 In the following example, we illustrate how the tuning
variables β̄ and γ̄ can be used to intuitively tune the primary
ÊT − Ê0 + YT − Y0
performance and active learning. We consider a mass-spring-
T −1
(19),(20),(21) X
+
X
+ damper system subject to additive and parametric uncertainty
= − (1 − β̄)Et+1 + T γ̄ − β̄Et+1
t=0 t∈τ (T ) mẍ1 = −k0 exp(−x1 )x1 − dẋ1 + u + w, (22)
T −1
(4),(11)
≤ −
X
+
(1 − β̄)Et+1 + T γ̄ + T β̄c − T β̄α`min with mass m = 1, uncertain damping constant d = 1, spring
t=0
constant k0 = 0.33, disturbance w, |w| ≤ 0.01 and, constraint
T −1 set Z = [−0.1, 1.1] × [−5, 5] × [−5, 5]. The model (22) is
(11) X
≤ − (1 − β̄)α`(X̂0|t , π̂0|t ) + T (γ̄ + c − β̄α`min ). discretized with a Euler discretization with a sampling time
t=0 of Ts = 0.1s. At time t = 0, we start with a nominal
dˆ0 = 1.1 which is then learned through a Least Mean Square
Similarly, condition (14) can be shown using an analogous
(LMS) algorithm. We make use of the robust MPC approach
reasoning to the proof of (13), by using (9d) in (18) instead
proposed in [26], and we employ a standard quadratic track-
of (12) to bound ∆Et . Satisfaction of (16) can be proved based
ing stage cost, with weight matrices Q = diag(10, 1), and
on (13) and on the non-negativity of c, `, and YT .
R = 1. The variable EtB is defined according to case 3),
i.e., EtB = Et∗ . The goal is to induce excitation in system
D. Learning formulations xref ref
(22) while steering it toward the desired PN reference t , ut .
−1 L
A standard approach in learning MPC schemes is to aug- We consider the learning cost Ht = k=0 kuk|t − ut+k k2 ,
ment the primary cost function with a term that incentives where uL ref
t = ut + 3 sin(0.2t) is an external probing signal
learning [5], [12]. Any learning cost Ht can be directly incor- that excites system (22), similar to [10]. In Fig. 1, we show
porated in the proposed framework, with the main difference four different scenarios. In particular, in all four scenarios, we
that we guarantee suitable performance bounds. The different set Y0 = β max = 0, and γ max = ∞, so that we focus on how
learning cost functions Ht typically depend on the considered the variables β̄ and γ̄ influence the closed-loop behaviour.
model description (ft ,Wt ) and the corresponding update rule. First, we consider the two extremum cases, i.e., the case of
In particular, if the model (ft ,Wt ) is obtained through a pure tracking (passive learning) and of pure (active) learning.
Gaussian Process (GP),PNthen similar to [25], the learning cost In the case of pure tracking (β̄ = 0, γ̄ = 0), the solution
−1
Ht can be chosen as k −Σt (Xk|t , πk|t ) to seek operations of Problem (9) is equivalent to the one from Problem (5).
with maximum uncertainty/covariance Σt . In the case of pure learning (β̄ = 0, γ̄ = ∞), the proposed
In case of nonlinear models linear in the uncertain parame- MPC scheme reduces to the safety filter shown in [10], i.e.,
ters, i.e., a term G(x, u)θ appearing in the model, a Recursive maximal learning while guaranteeing constraint satisfaction.
Least Square (RLS) filter can be used to obtain a Gaussian In this case, we see that the external input uL t generates high
[7] J. Köhler, P. Kötting, R. Soloperto, F. Allgöwer, and M. A. Müller,
1
“A robust adaptive model predictive control framework for nonlinear
0.8 uncertain systems,” arXiv preprint arXiv:1911.02899, 2019.
[8] A. Mesbah, “Stochastic model predictive control with active uncertainty
0.6
learning: A survey on dual control,” Annual Reviews in Control, vol. 45,
0.4 pp. 107–117, 2018.
[9] B. Recht, “A tour of reinforcement learning: The view from continuous
0.2
control,” Annual Review of Control, Robotics, and Autonomous Systems,
0 vol. 2, pp. 253–279, 2019.
[10] K. P. Wabersich and M. N. Zeilinger, “Safe exploration of nonlinear
0 5 10 15 20 dynamical systems: A predictive safety filter for reinforcement learning,”
arXiv preprint arXiv:1812.05506, 2018.
[11] A. A. Feldbaum, “Dual control theory. i,” Automation and Remote
Fig. 1: Comparison of closed-loop trajectories. The value e20 Control, vol. 21, no. 9, pp. 1240–1249, 1960.
indicates the accuracy of the estimated dˆ20 at time 20s, and [12] R. Soloperto, J. Köhler, M. A. Müller, and F. Allgöwer, “Dual adaptive
ˆ
is defined as et := |dtd−d| . Note that e0 = 10%. MPC for output tracking of linear systems,” in Proc. 58th IEEE Conf.
Decision and Control (CDC), 2019, pp. 1377–1382.
[13] S. Thangavel, S. Lucia, R. Paulen, and S. Engell, “Dual robust nonlinear
model predictive control: A multi-stage approach,” Journal of Process
Control, vol. 72, pp. 39–51, 2018.
oscillations in closed-loop, leading to a high tracking error. [14] E. Arcari, L. Hewing, and M. N. Zeilinger, “An approximate dynamic
Such a behaviour can be considered as safe in the sense that programming approach for dual stochastic model predictive control,”
arXiv preprint arXiv:1911.03728, 2019.
the constraints are satisfied, but due to the large tracking error [15] A. Iannelli, M. Khosravi, and R. S. Smith, “Structured exploration in
it may not be desirable in practical applications. the finite horizon linear quadratic dual control problem,” arXiv preprint
In order to generate the desired trade-off between learning arXiv:1910.14492, 2019.
[16] D. Angeli, R. Amrit, and J. B. Rawlings, “On average performance and
and tracking, we consider the case where β̄ = 0.8 (80%) and stability of economic model predictive control,” IEEE transactions on
γ̄ = 0, where we see that the system shows some probing automatic control, vol. 57, pp. 1615–1626, 2012.
only during the transient phase, and then it converges to the [17] A. Boccia, L. Grüne, and K. Worthmann, “Stability and feasibility of
state constrained MPC without stabilizing terminal constraints,” Systems
desired reference point, as discussed in Section III-B. In the & control letters, vol. 72, pp. 14–21, 2014.
case where β̄ = 0 and γ̄ = 5.0, we see that the controller [18] F. A. Bayer, M. A. Müller, and F. Allgöwer, “Min-max economic model
induces a constantly bounded amount of active probing, which predictive control approaches with guaranteed performance,” in Proc.
55th IEEE Conf. Decision and Control (CDC), 2016, pp. 3210–3215.
is mainly visible when the system is close to the steady-state. [19] L. Grüne and M. Stieler, “Asymptotic stability and transient optimality
This implies a negligible relaxation on the convergence rate, of economic MPC without terminal conditions,” Journal of Process
and a visible probing when the system is close to the desired Control, vol. 24, no. 8, pp. 1187–1196, 2014.
[20] J. Köhler, M. A. Müller, and F. Allgöwer, “A novel constraint tighten-
steady state. To conclude, the results shown in Fig. 1 confirm ing approach for nonlinear robust model predictive control,” in Proc.
the intuitive meaning of the parameters β̄ and γ̄, how they American Control Conf. (ACC). IEEE, 2018, pp. 728–734.
influence the closed-loop and therefore the accuracy of the [21] L. Schwenkel, J. Köhler, M. A. Müller, and F. Allgöwer, “Robust
ˆ economic model predictive control without terminal conditions,” in Proc.
estimated parameter d. 21st IFAC World Congress, 2020, accepted.
[22] M. A. Müller, D. Angeli, F. Allgöwer, R. Amrit, and J. B. Rawlings,
V. C ONCLUSION “Convergence in economic model predictive control with average con-
straints,” Automatica, vol. 50, pp. 3100–3111, 2014.
We proposed a framework for enhancing an existing MPC [23] D. He, L. Wang, and J. Sun, “On stability of multiobjective NMPC with
design with active learning to improve model quality and objective prioritization,” Automatica, vol. 57, pp. 189–198, 2015.
[24] L. Grüne and M. Stieler, “Multiobjective model predictive control for
thus performance. We proved suitable transient and average stabilizing cost criteria,” Discrete & Continuous Dynamical Systems-B,
performance bounds on the resulting MPC scheme with in- pp. 2823–2830, 2019.
tuitively tunable constants. The practicality of the proposed [25] N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process
optimization in the bandit setting: No regret and experimental design,”
approach is demonstrated using an example involving tracking arXiv preprint arXiv:0912.3995, 2009.
and active learning. Designing active learning cost functions [26] J. Köhler, R. Soloperto, M. A. Müller, and F. Allgöwer, “A computa-
that guarantee closed-loop reduction of uncertainty is part of tionally efficient robust model predictive control framework for uncertain
nonlinear systems,” IEEE Transactions on Automatic Control, 2020, to
future work. appear.
[27] E. Maddalena and C. Jones, “Learning non-parametric models with
R EFERENCES guarantees: A smooth lipschitz interpolation approach,” p. 6, 2019.
[Online]. Available: http://infoscience.epfl.ch/record/265764
[1] J. B. Rawlings and D. Q. Mayne, Model predictive control: Theory and
design. Nob Hill Pub., 2009.
[2] L. Hewing, J. Kabzan, and M. N. Zeilinger, “Cautious model predic-
tive control using gaussian process regression,” IEEE Transactions on
Control Systems Technology, 2019.
[3] M. Zanon and S. Gros, “Safe reinforcement learning using robust MPC,”
arXiv preprint arXiv:1906.04005, 2019.
[4] M. Lorenzen, M. Cannon, and F. Allgöwer, “Robust MPC with recursive
model update,” Automatica, vol. 103, pp. 461–471, 2019.
[5] X. Lu, M. Cannon, and D. Koksal-Rivet, “Robust adaptive model pre-
dictive control: Performance and parameter estimation,” arXiv preprint
arXiv:1911.00865, 2019.
[6] M. Bujarbaruah, X. Zhang, M. Tanaskovic, and F. Borrelli, “Adaptive
MPC under time varying uncertainty: Robust and stochastic,” arXiv
preprint arXiv:1909.13473, 2019.