L4DC PolicyOptTutorial2023

Download as pdf or txt
Download as pdf or txt
You are on page 1of 160

Toward a Theoretical Foundation of Policy

Optimization for Learning Control Policies

Maryam Fazel, Bin Hu, Kaiqing Zhang

Joint with Tamer Başar, Na Li, Mehran Mesbahi, Yang Zheng

L4DC Tutorial, Philadelphia, PA


June 14, 2023

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 1/47
Motivation

Data-guided decision-making for complex tasks in dynamical


systems, e.g., game playing, robotics, networked systems,...

Many recent successes via Reinforcement Learning

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 2/47
Motivation : Policy Optimization

▶ A workhorse of (deep) RL : (direct) policy optimization methods


▶ Robotic manipulation, locomotion, video games, ChatGPT, etc.

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 3/47
Policy Optimization – One Workhorse of (Deep) RL

Why is policy optimization popular ?


▶ Easy-to-implement & scalable to high-dimensional problems
▶ Enable model-free search for complex dynamics (e.g. with rich
contact) or rich observations (e.g. images)
▶ This tutorial : Does policy optimization have guarantees on
linear control benchmarks (e.g. LQR, LQG, H∞ control, etc) ?

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 4/47
What is Policy Optimization (PO) ?
▶ PO is an old idea from control : Fix the controller structure, and
optimize a control metric over the parameters of the controllers
min J (K )
K
▶ Parametrized controller/policy K
▶ Cost function J (tracking errors, closed-loop H2 /H∞ norms, etc)

▶ Policy gradient method : K ′ = K − α∇J (K )


▶ Example : Optimization-based PID Tuning K = [Kp , Ki , Kd ]⊤ ∈ R3

Credit : Astrom & Murray, 2020


Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 5/47
History : Convex LMIs vs. PO

Key points :
▶ In 1980s, convex optimization methods become dominant due to
strong global guarantees and efficient interior point methods
▶ PO problem formulation is generally not convex
▶ Reparameterize as convex optimization problems (one does not
optimize the controller parameters directly) ; Lyapunov theory,
stability/performance certificates, HJB, ...
▶ Examples of LMIs : state-feedback or full-order output-feedback
H2 / H∞ control
▶ e.g., Boyd et al., “Linear Matrix Inequalities in System and
Control Theory”, 1994, SIAM

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 6/47
History : Convex LMIs vs. PO

Historically, PO is used for control problems that can’t be


convexified ; often no theory
▶ Sometimes the plant order is unknown : PID tuning, feedback
iterative tuning, etc
▶ Static output feedback LQ control
▶ Fixed-order structured H∞ synthesis : HIFOO and Hinfstruct
[Apkarian and Dominikus, ’06]
▶ Distributed control design : Martensson/Rantzer (’09)

In recent years, new reason to revisit PO for classical control : help


provide theory towards understanding model-free RL

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 7/47
A Modern Perspective from Deep RL
A common practice nowadays in deep RL for robotic control : visuomotor
policy learning/image-to-torque [Levine et al., ’16]
▶ A type of perception-based control : purely model-free
▶ Train perception and control systems jointly end-to-end

Advantages :
▶ Direct and relatively simple to implement
▶ Mitigate compounding error as in model-based RL (separately train
perception and control)
▶ Make better use of deep NNs’ abstraction and perception capabilities to
handle high-dimensional visual signals
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 8/47
Policy Optimization : Old & New
Vanilla policy gradient :
▶ Policy Gradient Theorem [Sutton et al., ’99]
 
∇J(K ) = E QK (x, u) · ∇ log πK (u | x)

▶ REINFORCE estimator [Williams ’92] : from N trajectories of length T –


(xt,i , ut,i , ct,i )i∈[N],t∈[T ]
N T  X T  
1 XX
∇J(K ) ≈ cτ,i · ∇ log πK (ut,i | xt,i )
N i=1 t=1 τ =t
| {z }
| {z } score function
accumulated cost

▶ Others estimators : G(PO)MDP [Baxter & Bartlett, ’01], actor-critic [Konda &
Tsitsiklis, ’99], natural policy gradient [Kakade ’01] (will come back to it !)
▶ Essentially stochastic gradient descent (SGD) (heart of modern machine learning) !

Modern variants (benefit from the advances of optimization theory) :


▶ Deep deterministic PG (DDPG) [Silver et al., ’14], Trust-region PO (TRPO)
[Schulman et al., ’15], Proximal PO (PPO) [Schulman et al., ’17], soft actor-critic
(SAC) [Haarnoja et al., ’18], variance-reduced PG [Papini et al., ’18]...
▶ Default algorithm in OpenAI Gym, Dota 5v5, ChatGPT training – PPO

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 9/47
Missing Perspectives in Deep RL Literature

▶ Convergence guarantees : Nonconvex optimization in policy


parameter spaces, e.g., weights of neural networks
▶ Sample efficiency guarantees : How many samples are needed ?
Polynomial in problem parameters ?
▶ Constraints : Stability and robustness of the closed-loop systems

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 10/47
Missing Perspectives in Classic Control Literature

GD on convex landscape GD for nonconvex landscape

▶ Landscape : Is convexity really needed for optimization ?


▶ Finite-iteration/sample complexity : If an algorithm converges,
how fast and how many samples are needed ?

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 11/47
Tutorial Overview

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 12/47
Tutorial Overview : RL/Control → PO

This tutorial : Understanding policy optimization via examining


guarantees on linear control benchmarks
▶ Start from simpler contexts and gain insights
▶ Classical control benchmarks (c.f. [Recht et al., ’17])
▶ Identify issues for establishing guarantees of PO for control
▶ Employ modern optimization perspective : iteration/sample
complexity, first-order & zeroth-order oracle models, etc

Big picture :
▶ One perspective to bridge control theory and RL
▶ Understand and connect “model-free” & “model-based” views
▶ Towards a general framework for learning-based control

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 13/47
Schedule

▶ Now-2 :30pm : Preview and Some Optimization Background


▶ 2 :30-3 :00pm : PO Theory for LQR
▶ 3 :00-3 :30pm : PO Theory for Risk-sensitive & H2 /H∞ Robust
Control
▶ 3 :30-4 :00pm : Coffee Break
▶ 4 :00-4 :30pm : PO Theory for State-feedback H∞ Synthesis
▶ 4 :30-5 :00pm : PO Theory for LQG
▶ 5 :00-5 :15pm : Role of convex parameterization
▶ 5 :15-5 :30pm : Future work and Q&A/discussions

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 14/47
Preview : Big Picture
▶ Revisit linear control problems as benchmarks for PO

min J(K ), s.t. K ∈ K


K

▶ Parametrized policy K (e.g. linear mapping, neural networks)


▶ Cost function J (tracking errors, closed-loop H2 /H∞ norms, etc)
▶ Constraint set K (stability, robustness, safety, etc)

▶ Policy gradient : K ′ = K − α∇J(K )


▶ The gradient J can be estimated from data in a model-free manner
(policy gradient theorem or stochastic finite difference)
▶ For nonsmooth problems, replace the gradient with some subgradient

▶ Recent progress on PO theory (Nonconvexity is the key issue)


▶ Landscape : Is stationary global minimum ?
▶ Feasibility : Does the policy search stay in the feasible set K ?
▶ Global convergence & sample complexity
B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, T. Başar. Toward a theoretical
foundation of policy optimization for learning control policies, Annual Review
of Control, Robotics, and Autonomous Systems, 2023.
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 15/47
Preview : Linear Quadratic Regulator as PO

▶ Linear quadratic regulator (LQR) as PO : Consider xt+1 = Axt + But + wt


"T −1 #
1 X ⊤ ⊤
min J(K ) := lim E (xt Qxt + ut Rut ) , s.t. K is stabilizing
K T →∞ T
t=0
▶ ut = −Kxt for gain matrix K
▶ K = {K : ρ(A − BK ) < 1} ; K a nonconvex constraint set
▶ PO theory for LQR
▶ Landscape : Feasible set is connected, and stationary is global
▶ Feasibility : The LQR cost is coercive and serves as a barrier on K
▶ Global convergence & sample complexity : Linear rate and finite
sample complexity via the gradient dominance/smoothness property
▶ Main Ref :
M. Fazel, R. Ge, S. Kakade, M. Mesbahi. Global convergence of policy
gradient methods for the linear quadratic regulator, ICML 2018.

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 16/47
Preview : Mixed H2 /H∞ Control as PO

▶ Mixed design : H∞ constraints are crucial for robustness


min J(K ), s.t. K is stabilizing and robust in the H∞ sense
K

▶ J(K ) is an upper bound on the H2 performance


▶ ut = −Kxt for gain matrix K
▶ K = {K : ρ(A − BK ) < 1; ∥T (K )∥∞ < γ} ; add robustness constrains
▶ γ → ∞ reduces to LQR
▶ PO theory for mixed design
▶ Key issue : The cost is not coercive ! How to maintain feasibility ?
▶ Fix : Implicit regularization via Natural policy gradient (NPG) and
Gauss-Newton
▶ Global sublinear convergence for NPG and Gauss-Newton
▶ Main Ref :
K. Zhang, B. Hu, T. Başar. Policy optimization for H2 linear control with
H∞ robustness guarantee : Implicit regularization and global convergence,
SIAM Journal on Control and Optimization (SICON), 2021.

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 17/47
Preview : H∞ State-Feedback Synthesis as PO

▶ H∞ state-feedback synthesis : xt+1 = Axt + But + wt with x0 = 0


min J(K ), s.t. K is stabilizing
K
▶ J= ∞ ⊤ ⊤
P
(x
t=0 Pt Qx t + u t Ru t ) subject to the worst-case disturbance
satisfying ∞ t=0 ∥w t ∥2
≤ 1
▶ ut = −Kxt for gain matrix K
▶ K = {K : ρ(A − BK ) < 1}
▶ J(K ) is the closed-loop H∞ norm (nonsmooth in K !)
▶ PO theory for H∞ state-feedback synthesis (Nonconvex and nonsmooth)
▶ Key issue : The cost may not be differentiable at important points
▶ Fix : Show that Clarke stationary points are global, and apply
Goldstein’s subgradient method to guarantee sufficient descent
▶ Global convergence : Goldstein’s subgradient method achieves global
convergence provably
▶ Main Ref :
X. Guo and B. Hu. Global convergence of direct policy search for
state-feedback H∞ robust control : A revisit of nonsmooth synthesis with
Goldstein subdifferential, NeurIPS 2022.

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 18/47
Preview : Linear Quadratic Gaussian as PO

▶ Linear quadratic Gaussian (LQG) is the partially observable variant of LQR,


and can be treated as PO (more details later)

▶ PO theory for LQG


▶ Issue 1 : Feasible set may not be connected
▶ Issue 2 : Stationary may not be global
▶ Today’s talk : Some positive results and many open questions

▶ Main Ref :
Y. Zheng, Y. Tang, N. Li. Analysis of the optimization landscape of linear
quadratic Gaussian (LQG) control, Mathematical Programming, 2023.

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 19/47
Background : Optimization Theory

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 20/47
Optimization of Smooth Nonconvex Functions

Definition : A function J (K ) is L-smooth if the following inequality


holds for all (K , K ′ ) :
L
J (K ′ ) ≤ J (K ) + ⟨∇J (K ), (K ′ − K )⟩ + ∥K ′ − K ∥2F .
2
The above definition is equivalent to ∇J being L-Lipschitz.

Complexity : Gradient descent method K n+1 = K n− α∇ n


 J (K ) is
1
guaranteed to find ϵ-stationary point of J within O ϵ steps
L
J (K n+1 ) ≤ J (K n ) + ⟨∇J (K n ), K n+1 − K n ⟩ + ∥K n+1 − K n ∥2F
2

L α 2
= J (K n ) + −α + ∥∇J (K n )∥2F ,
2
Summing the above inequality from n = 0 to T
T
Lα 2
 X
2
α− ∇J (K n ) F ≤ J (K 0 ) − J (K n+1 )
2
n=0

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 21/47
Optimization of Smooth Nonconvex Functions
Complexity : Gradient descent method K t +1 = K n − α∇ n
 J (K ) is
guaranteed to find ϵ-stationary point of J within O 1ϵ steps
T
Lα 2
 X
2
α− ∇J (K n ) F ≤ J (K 0 ) − J (K n+1 )
2
n=0
2
If α < L2 , then C = α − L2α > 0. We know J (K n+1 ) ≥ J ∗ for some J ∗ .
T
2 J (K 0 ) − J ∗
∇J (K n ) F ≤
X
C
n=0
T
2 1 X 2 J (K 0 ) − J ∗
=⇒ min ∇ J (K n ) F ≤ ∇J (K n ) F ≤ .
0≤n≤T T +1 C (T + 1)
n=0
To find a point whose gradient norm is smaller than or equal to ϵ,
we need to run T steps with
J (K 0 ) − J ∗ 1
T = −1=O .
Cϵ ϵ
which is the complexity for finding ϵ-approximate stationary point
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 22/47
Optimization of Smooth Nonconvex Functions
Complexity : Gradient descent method K t +1 = K n − α∇ n
 J (K ) is
1
guaranteed to find ϵ-stationary point of J within O ϵ steps

Convergence : Gradient descent method is guaranteed to


convergence to a stationary point eventually

Question : What if we can show stationary is global ?


Answer : Then the gradient descent method converges to global
minimum ! We have J (K n ) → J ∗ !

Take-away : Nonconvex optimization may not be that terrifying if


stationary is global !
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 23/47
Gradient Dominance and Linear Rate to Global Minimum
Definition : A function J (K ) is gradient dominant if it is
continuously differentiable and satisfies
1
J (K ) − J (K ∗ ) ≤ ∥∇J (K )∥2F , ∀K ∈ K,

Landscape : Stationary is global !
Complexity : Gradient descent method K n+1 = K n −α∇
J (K n ) is
1
guaranteed to find ϵ-optimal point of J within O log ϵ steps
Lα 2
 
J (K n+1 ) ≤ J (K n ) + −α + ∥∇J (K n )∥2F
2
Lα2
 
≤ J (K n ) − 2µ α − (J (K n ) − J ∗ )
2
=⇒ J (K n+1 ) − J ∗ ≤ (1 − 2µα + µLα2 )(J (K n ) − J ∗ )
=⇒ J (K T ) − J ∗ ≤ (1 − 2µα + µLα2 )T (J (K 0 ) − J ∗ )
  
Running T steps with T = O log 1
ϵ guarantees J (K T ) − J ∗ ≤ ϵ

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 24/47
Coercive Functions and Compact Sublevel Sets
What if there are constraints ? If the cost is coercive, then it is a
barrier function by itself !

Definition : A function J (K ) is coercive on K if for any sequence


{K l }∞ we have J (K l ) → +∞ if either ∥K l ∥2 → +∞, or K l
l =1 ⊂ K
converges to an element on the boundary ∂K.
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 25/47
A Useful Result for Constrained Optimization

If J is coercive and twice continuously differentiable on K, we have


▶ The sublevel set Kγ := {K ∈ K : J (K ) ≤ γ} is compact.
▶ The function J (K ) is L-smooth on Kγ , and the constant L
depends on γ and the problem parameters.
▶ Suppose running GD method K n+1 = K n − α∇J (K n ) initialized
from K 0 ∈ K. Let L be the smoothness parameter for KJ (K 0 ) .
 
Then GD finds an ϵ-approximate stationary point with O 1ϵ
steps with α = 1/L.
▶ If J is gradient dominant with parameter µ, then GD achieves
linear convergence rate.
J (K T ) − J ∗ ≤ (1 − 2µα + µLα2 )T (J (K 0 ) − J ∗ )

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 26/47
PO Theory for LQR

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 27/47
Linear quadratic theory – background

Standard LQR problem (discrete-time, infinite horizon) : linear


dynamics
xt +1 = Axt + But
with given initial state x0 , choose control sequence
u0 , u1 , . . . , ut , . . .
in order to minimize the total cost

xt⊤ Qxt + ut⊤ Rut
X

t =0
with given cost matrices Q , R ≻ 0.

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 28/47
Linear quadratic theory

Classical solution via dynamic programming (when A,B known,


stabilizable) : solve the algebraic Riccati equation (for P )
P = Q + AT PA − (A⊤ PB )(R + B ⊤ PB )−1 (B ⊤ PA)
then let
ut = −K ∗ xt = −(R + B ⊤ PB )−1 (B T PA) xt

▶ a “go-to” model-based control design (since Kalman in 60’s)


▶ extensive theory, computational methods for solving Riccati
equation (Laub ; Kleinman ’68 ; Hewer ’71)

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 29/47
Value and policy iterations

The solution of ARE determines the value matrix


min J (x0 , u ) = x0⊤ P ∗ x0
u
one can develop an iteration on P s. t. P → P ∗ , then recover the
optimal control policy (this would be called value iteration)

PO for LQR, on the other hand, would directly update K , e.g.,


K n+1 = K n − η dK
when dK is some sort of gradient update and η is (possibly
time-varying) stepsize ; this is a first order method

this tutorial : can we develop direct PO methods with guarantees


for some typical control synthesis problems ?

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 30/47
Direct policy optimization

towards writing LQR as “J (K )” ...

First, note that when A is Schur stable, the sequence


(A⊤ )t QAt → P P = A⊤ PA + Q
X
converges, where
t
so with stabilizing feedback in place, the LQR cost for the
dynamics,
xt +1 = (A − BK )xt
with an initial condition x0 , can be written as x0T PK x0 , where
PK = (A − BK )⊤ PK (A − BK ) + K ⊤ RK + Q
so in this case, LQR is really optimizing
min trace P Σ0 (with Σ0 = x0 x0 ⊤ )
P ,K

P = (A − BK )⊤ P (A − BK ) + K ⊤ RK + Q

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 31/47
However, as stated, this problem is a bilinear matrix
optimization ...

When K ∈ S (set of stabilizing K ), equation


P = (A − BK )⊤ P (A − BK ) + K ⊤ RK + Q
has a unique solution P (K ) ; hence, the LQR can be written (for a
given Σ) as
min J (K )
K ∈S

We take x0 ∼ D(0, Σ0 ) where Σ0 is a full-rank covariance


(equivalently, can take Σ to correspond to a spanning set of initial
conditions)
thus J is real analytic function over its domain.

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 32/47
Questions

Consider now PO algorithms :


▶ iterate on policy K ,
▶ using gradient of cost, ∇J (K ) (exact or approximate)

▶ does GD (with exact gradients) converge ? under what


assumptions ? does it converge to the global opt K ∗ ?
▶ rate of convergence ?
▶ how about related algorithms, e.g., “natural gradient” descent ?
▶ “model-free” version : if gradients not available, would sampling
J (K ) work ? finite-sample complexity ?
note : challenging as J (K ) is not convex

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 33/47
LQR and policy gradient methods

▶ Consider LQR without state noise (for simplicity), random initial


condition x0 ∼ D
▶ let J (K ) be the cost as function of policy K
▶ define covariance matrices :
"∞ #
h i
T
Σ0 = E x0 x0T
X
ΣK = E xt xt ,
t =0

▶ consider algorithms :
Gradient descent : K ← K − η∇J (K )
Natural GD : K ← K − η∇J (K )ΣK −1
[Kakade ’01]

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 34/47
Oracles

Suppose algorithms have only oracle access to the model (A,B


not known explicitly), e.g.,
▶ exact gradient oracle : ∇J (K )
▶ “approximate gradient” oracle : use sample values of J (K )
i.e., first-order and zeroth order oracle in optimization

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 35/47
Optimization landscape I

J (K ) = ⟨Σ0 , PK ⟩
= vec(Σ0 )T (I − (A − BK ) ⊗ (A − BK ))−1 vec(Q + K T RK )
observation : J (K ) is not convex in K (or quasiconvex, star
convex) for n ≥ 3 (convex for single input case when n = 2) ; we can
computed the ∇J (K ) = 2(RK − B ⊤ PK AK )ΣK , where,
"∞ #
h i
xt xtT Σ0 = E x0 x0T
X
ΣK = E ;
t =0
lemma : if ∇J (K ) = 0, then either
▶ K is optimal, or
▶ covariance ΣK is rank deficient.
if Σ0 is full rank, then ΣK is full rank =⇒ stationary points globally
optimal
can also show this leveraging the convex LMI reformulation—later
research has shown this deeper connection and its uses.)
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 36/47
Optimization landscape I

J (K ) = ⟨Σ0 , PK ⟩
= vec(Σ0 )T (I − (A − BK ) ⊗ (A − BK ))−1 vec(Q + K T RK )
observation : J (K ) is not convex in K (or quasiconvex, star
convex) for n ≥ 3.
"∞ #
h i
xt xtT Σ0 = E x0 x0T
X
ΣK = E ,
t =0

lemma : if ∇J (K ) = 0, then either


▶ K is optimal, or
▶ covariance ΣK is rank deficient.
if Σ0 is full rank, then ΣK is full rank =⇒ stationary points globally
optimal
can also examine this via transformation to convex LMI (but proofs no
simpler)
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 37/47
Optimization landscape II

lemma : Suppose Σ0 is full rank, then


∥ΣK ∗ ∥
J (K ) − J (K ∗ ) ≤ ∥∇J (K )∥2 .
σmin (Σ0 )σmin (R )

i.e., J (K ) gradient-dominated ([Polyak ’63],...)

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 38/47
Main Theory : Summary

[Fazel, Ge, Kakade, Mesbahi 2018] for discrete-time LQR


▶ J(K ) is generally not convex/quasiconvex/star-convex
▶ convex combination of two stabilizing K ’s may not stabilize
▶ coerciveness : LQR cost is coercive on the set of stabilizing policies
▶ LQR cost is gradient dominant
▶ hence, gradient descent converges to K ∗ from any stabilizing initial
K0 ! (with a linear rate)
▶ similarly for related algorithms, e.g., natural policy gradients and
policy iteration algorithm

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 39/47
Global convergence in the exact case

Theorem 1 : Suppose J (K0 ) is finite (i.e., K0 is stabilizing), Σ0 is


full rank. With stepsize η chosen appropriately, and # of iterations
N as
▶ for natural policy GD :
∥B ∥2 J (K0 ) J (K 0 ) − J (K ∗ )
 
∥ΣK ∗ ∥ ∥R ∥
N≥ + log ,
σmin (Σ0 ) σmin (R ) σmin (Σ0 )σmin (R ) ϵ
▶ for GD :
∥ΣK ∗ ∥ J (K0 ) − J (K ∗ )
N≥ log poly(everything else),
σmin (Σ0 ) ϵ
then,
KN has cost ϵ-close to optimum.

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 40/47
Numerical experiment

2.5

100 2

1.5
Error ||K||−K *K||F||F
*

0.5

10−1 0

-0.5

-1

10−2 -1.5

-2

0 100 200 300 400 500


Iter
-2.5
-1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4

Left : Gradient descent for continuous time LQR


Right : 2-dim projection of the LQ cost contour

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 41/47
Unknown model case

Gradient descent : K ← K − η∇\


J (K )
Natural policy GD : K ← K − η∇\ b −1
J (K )Σ K

▶ we do not know (or directly learn) A, B


▶ but have the ability to explore by perturbing K
▶ model-free estimation : add Gaussian noise to actions during
rollouts
▶ similar to zeroth order (derivative-free) optimization
▶ issues : how much noise ? length of rollouts ? overall sample
complexity ?

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 42/47
Algorithm

Input : K , # trajectories m, rollout length ℓ, parameter r , dimension d


for i = 1, · · · m,
▶ draw Ui is uniformly at random from ∥U ∥F ≤ r
▶ sample policy Kbi = K + Ui
▶ simulate Kbi for ℓ steps starting from x0 ∼ D.
▶ get empirical estimates
ℓ ℓ
xt xt⊤
X X
Cbi = ct , Σ
bi =
t =1 t =1
where ct , xt are costs and states on this trajectory
end for
use following estimates for PGD/NPGD :
m m
\ 1 X d b 1 Xb
∇J (K ) = CU ,
2 i i
Σ
d K = Σi
m r m
i =1 i =1

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 43/47
Idea for the Proof of Convergence

1. Prove when rollout length ℓ is large enough, cost function C and


covariance Σ are close to infinite horizon quantities
2. Show with enough samples, alg can estimate gradient and
covariance matrix within the desired accuracy
3. Show GD and NPGD converge with a similar rate, despite
bounded perturbations in gradient/natural gradient estimates

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 44/47
Suppose J (K0 ) is finite, µ > 0, x0 has norm bounded by L almost
surely ; and GD and the NPGD are called with parameters :
 
1 1 1
m, ℓ, 1/r =poly J (K0 ), , , ∥A∥, ∥B ∥, ∥R ∥, , d , 1/ϵ, L2 /µ .
µ σmin (Q ) σmin (R )

▶ NPGD : for stepsize η = 1 and


∥B ∥2 J (K0 )
∥R ∥+ µ

∥B ∥2 J (K0 ) 2(J (K0 ) − J (K ∗ ))


 
∥ΣK ∗ ∥ ∥R ∥
N≥ + log ,
µ σmin (R ) µσmin (R ) ϵ
with high probability NPGD satisfies : J (KN ) − J (K ∗ ) ≤ ϵ
▶ GD : for appropriate stepsize η ,
 
µσmin (Q ) 1 1 1
η = poly , , , , σmin (R )
J (K0 ) ∥ A∥ ∥ B ∥ ∥ R ∥
and
J (K0 ) − J (K ∗ )
 
∥Σ ∗ ∥ J (K 0 ) 1
N ≥ K log poly , ∥A∥, ∥B ∥, ∥R ∥, ,
µ ϵ µσmin (Q ) σmin (R )
with high probability, GD satisfies : J (KN ) − J (K ∗ ) ≤ ϵ

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 45/47
Related Results
A burst of recent research interest :

LQR, continuous-time : [Mohammadi et al., 2019], [Bu et al., 2020]


LQR, discrete-time : [Fazel et al., 2018], [Bu et al., 2019]
Stabilization : [Perdomo et al., 2021]
Decentralized finite-horizon LQR under QI : [Furieri et al., 2020]
LQ games : [Zhang et al., 2019], [Bu et al., 2019], [Mazumdar et al.,
2019], [Hambly et al., 2021]
Markov jump linear systems : [Jansch-Porto et al, 2020]
Output estimation with differentiable convex liftings (DCL)
framework : [Umenberger et al, 2022]
H∞ control : [Tang and Zheng, 2023]
Fundamental limits of policy gradient : [Ziemann et al, 2022]
And many other variants ! (See our survey article)
B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, T. Başar. Toward a theoretical
foundation of policy optimization for learning control policies, Annual Review of
Control, Robotics, and Autonomous Systems, Vol. 6, pp. 123-158, 2023.
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 46/47
Coming up ...

▶ 3 :00-3 :30pm : PO Theory for Risk-sensitive & H2 /H∞ Robust


Control
▶ 3 :30-4 :00pm : Coffee Break
▶ 4 :00-4 :30pm : PO Theory for State-feedback H∞ Synthesis
▶ 4 :30-5 :00pm : PO Theory for LQG
▶ 5 :00-5 :15pm : Role of convex parameterization
▶ 5 :15-5 :30pm : Future work and Q&A/discussions

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 47/47
Policy Optimization for Risk-Sensitive Control,
H2 /H∞ Robust Control, and Linear Quadratic Games

Kaiqing Zhang

University of Maryland, College Park

5th Annual Learning for Dynamics & Control Conference (L4DC)

University of Pennsylvania, PA June 14, 2023


General Background: Recall
▶ Reinforcement learning (RL) has achieved tremendous empirical
successes, including in some continuous-space control tasks
▶ Game of Go, video games, robotics, etc1 .

▶ A resurgence of interest in the theoretical understandings of RL


1
source: google images
Motivation

▶ Most scenarios involve more than one agent, e.g., game of Go, video
games, robot team motion-planning, autonomous driving
▶ Most control systems are safety-critical, e.g., robotics, unmanned
(aerial) vehicles, cyber-physical systems

Goal: Provable RL with robustness and multi-agency considerations


Background
Theoretical Understanding of Policy Optimization

▶ One workhorse in RL: Direct policy search/policy optimization


▶ Whether, where, how fast, PO methods converge?
▶ Nonconvex in policy parameter space
▶ Let’s start with benchmark RL/control tasks before deep RL?
▶ PO for linear quadratic regulator (LQR) (and variants) has been
extensively studied recently [Fazel et al., ′ 18][Tu & Recht, ′ 19][Malik et
al., ′ 19][Bu et al., ′ 19][Mohammadi et al., ′ 19][Gravell et al. ′ 19][Li et al.,

19][Furieri et al., ′ 19]...

All models are wrong, so is LQR =⇒ robustness concern is critical

▶ Question: whether & how PO methods can address benchmark


control/RL with robustness/risk-sensitivity concerns?
▶ Side motivation: multi-agent RL (will come back later)
PO for RL/Control =⇒ Optimization
PO for RL/Control =⇒ (Constrained Nonconvex) Opt.

▶ PO for RL/Control can be generally written as

min J (K ), s.t. K ∈ K
K
▶ Parametrized policy/controller K
▶ Objective to optimize J
▶ Constraint set K (important but sometimes implicit!)

▶ Linear quadratic regulator (LQR) as an example



X
min J (K ) := E[xt⊤ Qxt + ut⊤ Rut ], s.t. K is stabilizing
K
t=0
▶ ut = −Kxt for gain matrix K
▶ K = {K | ρ(A − BK ) < 1}; K a nonconvex constraint set
▶ Other examples of K: boundedness of K ’s norm, probability simplex,
safety-constraint on states, etc.
Robustness Constraint on K
▶ Beyond stability, robustness is a core topic in control theory
▶ Need a controller robust to disturbance/model-uncertainty

▶ G –∆ model covers many robustness considerations


▶ Parametric uncertainty ∆A, ∆B in A, B (most popular in recent ML
for control literature)
▶ Time-varying parameters At , Bt
▶ Time-varying delay ut = −Kxt−τ
▶ Even dynamical uncertainty (unknown model-order)
▶ Robustness constraint K?
Questions

▶ How to enforce/maintain robustness for policy-optimization RL


methods during learning?

▶ What are the global convergence guarantees, if any, of PO methods


in learning for robust control?
Problem Statement
Starting Point: LEQG [Jacobson ′ 73]
▶ LQR/LQG =⇒ linear exponential quadratic Gaussian (LEQG)
▶ Simple but benchmark risk-sensitive control setting
▶ Linear system dynamics:

xt+1 = Axt + But + wt ,

system state xt ∈ Rd with x0 ∼ N (0, X0 ), noise wt ∼ N (0, W )


▶ One-stage cost c(x, u) = x ⊤ Qx + u ⊤ Ru, with objective

−1
 TX 
1 2 β ⊤ ⊤

min J := lim log E exp xt Qxt + ut Rut
T →∞ Tβ 2 | {z }
t=0
c(xt ,ut )

▶ Intuition: by Taylor expansion around β = 0


  TX−1   TX−1 
1 β
J ≈ lim E c(xt , ut ) + Var c(xt , ut ) + O(β 2 ).
T →∞ T t=0
4 t=0
Starting Point: LEQG

▶ Optimal controller is LTI state-feedback [ZHB, ′ 21], conjectured in


[Glover and Doyle, ′ 88]

µt (x0:t , u0:t−1 ) = −K ∗ xt

▶ See more results on LEQG specifically in [ZHB, ′ 21]


▶ Implicit robustness constraint in LEQG

Lemma (Glover and Doyle ′ 88)



The feasible set of J (K ) is the 1/ β-sublevel set of the H∞ -norm of
T (K ), i.e.,  p
K K stabilizing; ∥T (K )∥∞ < 1/ β .
Bigger Picture: Mixed H2 /H∞ Control

▶ Linear dynamic systems:


xt+1 = Axt + But + Dwt zt = Cxt + Eut
▶ H∞ -norm: ℓ2 → ℓ2 operator norm from {w } to {z}
1/2
[T (K )(e −jθ )]⊤ T (K )(e jθ )
 
∥T (K )∥∞ := sup λmax
θ∈[0,2π)
Bigger Picture: Mixed H2 /H∞ Control

▶ Solve

min J (K ) s.t. ρ(A − BK ) < 1 & ∥T (K )∥∞ < γ


K | {z }
define K

▶ J (K ) upper-bounds H2 -norm:

J (K ) = Tr(PK DD ⊤ ), or J (K ) = −γ 2 log det(I − γ −2 PK DD ⊤ ),

where PK solves a Riccati equation given K

PK = Ae⊤ e⊤ 2 ⊤ −1 ⊤ ⊤ ⊤
K PK AK + AK PK D(γ I − D PK D) D PK AK + C C + K RK
e e
with AeK = A − BK
▶ If γ → ∞, then J (K ) → H2 -norm, e.g., it reduces to LQR
Why an Important & Interesting Model?
▶ Intuition: Small gain theorem – if ∥T (K )∥∞ < γ and
∥∆∥ℓ2 →ℓ2 < 1/γ, then G –∆ is input-output stable
▶ Choosing

K = {K | ρ(A − BK ) < 1; ∥T (K )∥∞ < γ}

=⇒ certain level of robust stability


▶ γ → ∞ reduces to stability region in LQR

▶ Second choice of J (K ) with D = W 1/2 and γ = 1/ β coincides
with the risk-sensitive LEQG objective
▶ It also unifies maximum-entropy/H∞ control, LQG/H∞ control
[Glover and Doyle, ′ 88; Mustafa, ′ 89]; γ-level disturbance attenuation
[Başar, ′ 91]; and zero-sum LQ dynamic games (come to it later)
[Jacobson, ′ 73; Başar and Bernhard, ′ 95]. Also used for solving
H∞ -optimal control
▶ Also used in Economics to model sequential decision-making under
model uncertainty [Hansen and Sargent, ′ 08]
Algorithms and Landscape
Policy Gradient Algorithms
▶ Following the naming convention in [Fazel et al., ′ 18] for LQR

Policy Gradient:
K ′ = K − η · ∇J (K )
= K − 2η · (R + B ⊤ PeK B)K − B ⊤ PeK A · ∆K ,
 

Natural PG:
K ′ = K − η · ∇J (K ) · ∆−1
K

= K − 2η · (R + B ⊤ PeK B)K − B ⊤ PeK A ,


 

Gauss-Newton:
K ′ = K − η · (R + B ⊤ PeK B)−1 · ∇J (K ) · ∆−1
K
⊤e −1 ⊤ e
 
= K − 2η · K − (R + B PK B) B PK A

▶ Recall PK is the solution to some Riccati equation dictated by K ,


and ∆K is the solution to another (dual) Riccati equation
Landscape

Lemma (Nonconvexity)
There is a mixed H2 /H∞ control problem that is nonconvex for policy
optimization.

Lemma (Non-Coercivity)
There is a mixed H2 /H∞ control problem whose cost function J (K ) is
non-coercive. Particularly, as K → ∂K, J (K ) does not necessarily
approach infinity.

▶ Coercivity is key in LQR analysis [Fazel et al., ′ 18][Bu et al.,


′ 19][Malik et al., ′ 19][Mohammadi et al., ′ 19]

▶ Ensures any descent direction to be feasible


▶ Can confine everything to the sub-level set (reduces to standard
smooth optimization)
Convergence
Landscape Illustration

LQR Mixed H2 /H∞ control

▶ Descent in J (K ) does not necessarily ensure feasibility/robust


stability
▶ How to enforce K ∈ K during iteration?
Implicit Regularization

▶ Regularization: iterate K remains inside K, i.e., robustly stable


▶ Can be made explicit via projection onto K. But K is nonconvex,
and defined in frequency domain
▶ Implicit regularization:
▶ The convergence of certain algorithms behaves as if certain
regularization is used
▶ Borrowed from machine learning literature: observed in learning
overparametrized neural nets/nonlinear models [Kubo et al.,

19][Azizan et al., ′ 19], phase retrieval and matrix completion [Ma et
al., ′ 17], etc., with (stochastic) gradient (mirror) descent
▶ Property of both the nonconvex problem and algorithm
▶ Gauss-Newton and natural PG enjoy implicit regularization!
Theory: Implicit Regularization

Theorem (Implicit Regularization)


Suppose the stepsizes η satisfy:
▶ Gauss-Newton: η ≤ 1/2,
▶ Natural policy gradient: η ≤ 1/(2∥R + B ⊤ PeK0 B∥).
Then, K ∈ K =⇒ K ′ ∈ K.

▶ General descent directions of J (K ) may not work, but certain


directions do
Implicit Regularization: Proof Idea

▶ Linear matrix inequalities (LMIs)-based approach


▶ A new use of Bounded Real Lemma [Başar & Bernhard, ′ 95][Zhou et
al., ′ 96][Dullerud & Paganini, ′ 00]:
K ∈ K ⇐⇒ Riccati equation ⇐⇒ strict Riccati inequality (RI)
▶ Observation: two consecutive iterates K → K ′ are related – previous
PK is a candidate for the non-strict RI under K ′ to hold
▶ Perturb PK in a certain way =⇒ strict RI
▶ Perturb P = PK + αP̄ for small enough α > 0, where P̄ > 0 solves
the Lyapunov equation as before, with C ⊤ C + K ⊤ RK replaced by −I
Theory: Global Convergence

Theorem (Global Convergence)


Suppose K0 ∈ K, then both the Gauss-Newton and natural PG updates converge
to the global optimum K ∗ = (R + B ⊤ PeK B)−1 B ⊤ PeK A with O(1/N) rate.
Theory: Local (Super-)Linear Rates

▶ Much faster rates around the global optimum

Theorem (Local Faster Rates)


Suppose the conditions above hold, and additionally DD ⊤ > 0. Then, both the
Gauss-Newton and natural PG updates converge to the optimal control gain K ∗
with locally linear rate. In addition, if η = 1/2, the Gauss-Newton update
converges to K ∗ with locally Q-quadratic rate.

▶ Gradient domination (Polyak-Lojasiewicz) property holds locally


▶ Q-quadratic rate mirrors that of policy iteration for LQR [Lanchaster
and Rodman ′ 95]
Simulations
Simulations

▶ Initialization near the boundary ∂K; infinitesimal stepsize η for PG


32
107
700 31
650
30
6 600
10 29
550
28
500
105 27
450 26

104 400 25

24
350
3 23
10
300
22
2
10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Ave. Grad. Norm Square J (K ) − J (K ∗ ) H∞ -norm ∥T (K )∥∞


Simulations: Global Convergence

▶ Escaping suboptimal stationary points


Simulations: Scalability

▶ Computationally more efficient than existing general robust control


solvers – HIFOO [Arzelier et al., ′ 11] & Matlab h2hinfsyn function
[Mahmoud and Pascal, ′ 96] and systune function [Apkarian et al., ′ 08]

System Dim. HIFOO h2hinfsyn systune NPG GN Speedup


15 × 15 0.3742s 95.2663s 0.4276s 0.0481s 0.0420s ∼ 8/2117/8.8×
60 × 60 18.4380s fail, > 7200s 171.7855s 0.3906s 0.3902s ∼ 47/ > 18461/440×
90 × 90 241.4416s fail, > 7200s 4126.9s 0.8167s 0.8103s ∼ 295/ > 36922/5093×

Table: Average runtime comparison


Connection to Multi-Agent RL (MARL)
Multi-Agent RL
▶ Usually studied under framework of Markov games [Shapley ′ 53]
▶ The most basic MARL model ever since [Littman, ′ 94]: two-player
zero-sum Markov games
Multi-Agent RL
▶ Usually studied under framework of Markov games [Shapley ′ 53]
▶ The most basic MARL model ever since [Littman, ′ 94]: two-player
zero-sum Markov games
▶ Benchmark in continuous control: linear quadratic zero-sum
dynamic games (DG) [Başar and Bernhard, ′ 95] (mirrors LQR for
single-agent RL)
Multi-Agent RL
▶ Usually studied under framework of Markov games [Shapley ′ 53]
▶ The most basic MARL model ever since [Littman, ′ 94]: two-player
zero-sum Markov games
▶ Benchmark in continuous control: linear quadratic zero-sum
dynamic games (DG) [Başar and Bernhard, ′ 95] (mirrors LQR for
single-agent RL)
▶ PO methods widely used in modern empirical MARL, while its
convergence guarantees remain largely open
Multi-Agent RL
▶ Usually studied under framework of Markov games [Shapley ′ 53]
▶ The most basic MARL model ever since [Littman, ′ 94]: two-player
zero-sum Markov games
▶ Benchmark in continuous control: linear quadratic zero-sum
dynamic games (DG) [Başar and Bernhard, ′ 95] (mirrors LQR for
single-agent RL)
▶ PO methods widely used in modern empirical MARL, while its
convergence guarantees remain largely open
▶ (Projected) PO for LQ zero-sum DGs [ZYB, ′ 19][Bu et al., ′ 19]
▶ Negative/Non-convergence results for (multi-player) LQ general-sum
DGs [Mazumdar, Ratliff, Jordan, and Sastry, ′ 19]
▶ PG methods (and variants) for tabular zero-sum Markov games
[Daskalakis et al., ′ 20][Zhao et al., ′ 21][Cen et al., ′ 21,′ 22]...
Multi-Agent RL
▶ Usually studied under framework of Markov games [Shapley ′ 53]
▶ The most basic MARL model ever since [Littman, ′ 94]: two-player
zero-sum Markov games
▶ Benchmark in continuous control: linear quadratic zero-sum
dynamic games (DG) [Başar and Bernhard, ′ 95] (mirrors LQR for
single-agent RL)
▶ PO methods widely used in modern empirical MARL, while its
convergence guarantees remain largely open
▶ (Projected) PO for LQ zero-sum DGs [ZYB, ′ 19][Bu et al., ′ 19]
▶ Negative/Non-convergence results for (multi-player) LQ general-sum
DGs [Mazumdar, Ratliff, Jordan, and Sastry, ′ 19]
▶ PG methods (and variants) for tabular zero-sum Markov games
[Daskalakis et al., ′ 20][Zhao et al., ′ 21][Cen et al., ′ 21,′ 22]...
▶ Nonconvex-nonconcave [ZYB, ′ 19][Daskalakis et al., ′ 20], PO can
easily diverge if not designed carefully
LQ Zero-Sum Dynamic Games

▶ H2 /H∞ control is strongly tied to LQ zero-sum dynamic games


▶ Let ut = −Kxt and wt = −Lxt then solve:

X 
 ⊤ ⊤ u ⊤ v

J (K , L) := Ex0 ∼D xt Qxt + (Kxt ) R (Kxt ) − (Lxt ) R (Lxt )
t=0

Solve: min max J (K , L) ⇐⇒ min J (K , L(K ))


K L K

with xt+1 = Axt + But + Cwt


▶ For fixed K (outer-loop), take max over L (inner-loop), the Riccati
equation becomes
the same Riccati equation as in H2 /H∞ control
Implication for MARL

▶ Previous results =⇒ double-loop update provably works


▶ Double-loop/nested-grad.: fix K and improve L, then improve K
▶ Aligned with the empirical tricks to stabilize nonconvex-
nonconcave minimax opt. with timescale separation [Lin, Jin, &
Jordan, ′ 18], as in training GANs [Heusel et al., ′ 18]
▶ Gives global convergence of PO in competitive MARL (zero-sum
Markov/dynamic games)
Benefit from MARL: Model-Free H2 /H∞ Control

▶ Recall the policy gradient form

∇J (K ) = 2 (R + B ⊤ PeK B)K − B ⊤ PeK A ∆K ,


 

while ∆K cannot be estimated from sampled trajectories


▶ Instead solve the equivalent game using data
▶ Build up a virtual adversary wt = −Lxt
▶ Double-loop/nested-grad.: fix K and improve L, then improve K
▶ Derivative-free methods for LQR [Fazel et al., ′ 18][Malik et al., ′ 19]
cannot work directly
▶ Non-coercive & only certain direction works =⇒ no uniform margin
▶ Caveat: quantities (cost, action space, control gain matrices) in the
LQ setting are continuous, and can easily go unbounded!
▶ Leads to no-global-smoothness + nonconvexity-nonconcavity
▶ Can be addressed under the unified LQ game formulation, for
finite-horizon settings [ZZHB, ′ 21]
Illustration for Derivative-Free PO Convergence

▶ Proof idea illustrated with figures


▶ K0 is the “level-set” corresponding to initialization K0 ;
▶ K
b is a larger set that the iterates will not leave (with high
probability); due to stochastic errors when using samples
▶ both are compact =⇒ uniform smoothness constant over K b
Connection to Robust Adversarial RL (RARL)
Robust Adversarial RL [Pinto et al., ′ 17]

▶ RL hardly generalizes due to Sim2real and/or training-testing gap

[Google AI, ′ 16] [Tobin et al., ′ 17]

▶ One remedy: RARL [Pinto et al., ′ 17]


▶ Idea: introduce an adversary, playing against the agent
▶ Dates back to [Morimoto and Doya, ′ 05], under the name robust RL,
and “inspired by H∞ -theory”
▶ Made popular by the empirical work [Pinto et al., ′ 17]
▶ Question: Any robustness interpretation and convergence guarantee?
LQ RARL

▶ RARL setting ⇐⇒ zero-sum dynamic game


▶ LQ RARL: View wt as model-uncertainty, or the
model-misspecification when linearizing a nonlinear model
▶ Recall the RARL scheme in [Pinto et al., ′ 17]
RARL in [Pinto et al., ′ 17] Easily Fails [ZHB, ′ 20]

▶ Stability issue due to bad initialization


▶ Stability issue due to bad choices of (NK , NL ) (K0 , L0 )

1 1
0.95
0.9 0.95

0.85
0.9
0.8

0.75
0.85
0.7
0 50000 100000 150000 0 1 2 3
104

1 1

0.95 0.95

0.9 0.9

0.85 0.85

0.8 0.8

0 1 2 3 0 40000 80000 120000


104

▶ What is a good combination of initialization & update rule?


Implication from Mixed H2 /H∞ Control
▶ By implicit regularization, we find a provably convergent pair of

(initialization, update rule): (K0 ∈ K, L0 = 0), (NK = 1, NL = ∞)
Implication from Mixed H2 /H∞ Control

▶ By implicit regularization, we find a provably convergent pair of



(initialization, update rule): (K0 ∈ K, L0 = 0), (NK = 1, NL = ∞)
▶ How to find such a K0 ∈ K (in a model-free way) – robustify K0 ?
▶ For any stabilizing K , perform K ′ = K − αg with g ∈ Rm×n the
finite-difference estimate of the subgradient of ∥T (K )∥∞ , where
∥T (K + ϵdij )∥∞ − ∥T (K − ϵdij )∥∞
gij =

2.5

1.5

0.5
2 4 6 8 10

Decrease H∞ -norm Before Robustification After Robustification


Additional Simulations
Convergent Cases

▶ Exact update:

▶ Derivative-free update:
Some Divergent Cases

▶ Update-rules other than double-loop may easily diverge, even with


infinitesimal stepsizes
▶ ANGDA: alternative-update of natural PG descent & ascent
▶ τ -NGDA: simultaneous-update with stepsizes ratio αη = τ > 1
▶ Descent-Multi-Step-Ascent: multiple ascent steps per descent
step
Some Divergent Cases

▶ Without global smoothness, controlling the iterates’ boundedness is


critical and challenging
Concluding Remarks
Concluding Remarks
▶ Studied policy optimization landscape for risk-sensitive/robust
control, with fundamental challenges diff. from that of LQR –
deepened our understanding of existing results on LQR
▶ Developed two PO methods, identified their implicit regularization
property, and established global convergence + sample complexity
▶ Along the way
▶ Global convergence and sample complexity of PO for competitive
MARL, in the LQ zero-sum setting
▶ Some theoretical understanding and critical thinking on RARL, from
robust control perspective
▶ Explicit regularization and convex-reformulation can also be useful —-
a unified differentiable convex liftings (DCL) framework [USPZT, ′ 22]

▶ A natural intersection of control, RL, and game theory


Thank You!
Direct Policy Search for Robust Control:
A Nonsmooth Optimization Perspective

Bin Hu
ECE & CSL, University of Illinois Urbana-Champaign

L4DC Tutorial 2023


Joint work with Xingang Guo
Outline

• Motivation and Problem Formulation

• Main Results

• Conclusions and Future Directions

2
Motivation: Reinforcement Learning for Control
• Many robust control problems are solved via lifting into convex spaces.
Recently, reinforcement learning (RL) has shown great promise for control!

• Main workhorse: direct policy search/policy optimization (PO)


min J(K), s.t. K ∈ K
K
• Parametrized policy K (e.g. linear mapping, neural networks)
• Cost function J (tracking errors, closed-loop H2 /H∞ norms, etc)
• Constraint set K (stability, robustness, safety, etc)
• PO algorithm: K 0 = K − α∇J(K) (nonconvex problem!)
• Theory: Landscape, feasibility, convergence, complexity
• Question: How to tailor policy-based RL for robust control?
• This talk: Guarantees of PO on H∞ control benchmarks 3
PO Theory for Robust Control
• PO theory for mixed design (maintaining robustness via improving average)
• Landscape: Feasible set is connected, and stationary is global
• Feasibility: The cost is nonconvex and non-coercive! Fortunately,
double-loop natural policy gradient (NPG) can implicitly regularize
• Global sublinear convergence for NPG
• Ref: Zhang, Hu, Başar. Policy optimization for H2 linear Control with H∞
robustness guarantee: Implicit regularization and global convergence,
SICON 2021.
• PO theory for H∞ state-feedback synthesis (improving robustness)
• Feature: Nonconvex nonsmooth
• Landscape: Any Clarke stationary points are global
• Feasibility: The cost is coercive and serves as a barrier function on K
• Global convergence: Goldstein’s subgradient method achieves global
convergence provably
• Ref: Guo and Hu. Global convergence of direct policy search for
state-feedback H∞ robust control: A revisit of nonsmooth synthesis with
Goldstein subdifferential, NeurIPS 2022. 4
Review: Linear Quadratic Regulator
• LQR as PO: Consider xt+1 = Axt + But + wt with wt being stochastic IID
"T −1 #
1 X
min J(K) := lim E (x> >
t Qxt + ut Rut ) , s.t. K is stabilizing
K T →∞ T
t=0

• ut = −Kxt for gain matrix K


• K = {K | ρ(A − BK) < 1}; K a nonconvex constraint set
• PO theory for LQR
• Landscape: Stationary is global
• Feasibility: The LQR cost is coercive and serves as a barrier on K
• Global convergence & sample complexity: Linear rate via the gradient
dominance/smoothness property
• Main Ref:
M. Fazel, R. Ge, S. Kakade, M. Mesbahi. Global convergence of policy
gradient methods for the linear quadratic regulator, ICML 2018.

5
Problem Formulation: State-feedback H∞ Control
Consider the following linear time-invariant (LTI) system
xt+1 = Axt + But + wt , x0 = 0.

• We assume that (A, B) is stabilizable


• u := {u0 , u1 , · · · }, w := {w0 , w1 , · · · }, and kwk = ( ∞ 2 1/2
P
t=0 kwt k ) .
• Our goal is to find a sequence u to minimize the quadratic cost function

X
min max (xT T
t Qxt + ut Rut )
u w:kwk≤1
t=0

in the presence of the worst case disturbance kwk ≤ 1.


• This is different than the LQR problem, where w is stochastic.
• kwk ≤ 1 is not restrictive, we can choose arbitrary `2 bound.
• We assume that Q and R are positive definite.
• It is well known that the optimal solution is using a linear state-feedback
policy ut = −Kxt (Başar and Bernhard 2008).
6
Problem Formulation: State-feedback H∞ Control
Consider ut = −Kxt , the closed-loop system becomes xt+1 = (A − BK)xt + wt .
We have the following PO problem:

X
min max xT T
t (Q + K RK)xt .
K w:kwk≤1
t=0

The above optimization problem equivalent to the following PO problem

min J(K) := sup σmax (Q + K T RK)1/2 (ejω I − A + BK)−1



K ω∈[0,2π]

s.t. K ∈ K := {K : ρ(A − BK) < 1}.

• This is a constrained nonconvex nonsmooth optimization problem.


• K can be nonconvex.
• The nonsmoothness comes from two sources:
1. The computation of the maximum singular value.
2. The operator sup over ω ∈ [0, 2π].
7
Convex LMIs vs. Direct Policy Search
• In 1980s, convex optimization methods become popular for control study
due to global guarantees and efficient interior point methods
• Reparameterize problems as convex optimization problems (one does not
optimize the controller parameters directly)

{K ∈ K : J(K) ≤ γ}
⇐⇒ {K = LY −1 : LMI(Y, L, γ)  0 is feasible, Y  0}.

• Minimizing γ over LMI(Y, L, γ)  0 and Y  0 is a SDP problem


• See for example, Boyd et al., “Linear Matrix Inequalities in System and
Control Theory”, 1994, SIAM
• PO is not convex!
• In this past, PO has been a popular approach for problems that cannot be
convexified, e.g. structured H∞ synthesis! (HIFOO and Hinfstrcuct)
• This talk: View H∞ synthesis as a benchmark for understanding PO
8
Some Background on Nonsmooth Optimization

Clarke subdifferential:

∂C J(K) := conv{ lim ∇J(Ki ) : Ki → K, Ki ∈ dom(∇J) ⊂ K}.


i→∞

• ∂C J(K) is well defined for any K ∈ K.


• J(K) is locally Lipschitz and hence almost everywhere differentiable.

Proposition
If K is a local minimum of J, then 0 ∈ ∂C J(K) and K is a Clarke stationary
point.

9
Some Background on Nonsmooth Optimization
Generalized Clarke directional derivative:
J(K 0 + td) − J(K 0 )
J ◦ (K, d) := lim sup .
K 0 →K t&0 t

Directional derivative:
J(K + td) − J(K)
J 0 (K, d) := lim .
t&0 t

• J ◦ (K, d) and J 0 (K, d) are different in general.


• J 0 (K, d) = J ◦ (K, d) if J(K) is subdifferentially regular.

Subdifferentially Regular Property


Let K † be a Clarke stationary point for J. If J is subdifferentially regular, then
J 0 (K † , d) ≥ 0 for all da .
a
This result is known, see Theorem 10.1 in Rockafellar and Wets 2009.
10
Some Background on Nonsmooth Optimization
Goldstein subdifferential:

∂δ J(K) := conv ∪K 0 ∈Bδ (K) ∂C J(K 0 ) ,




• Bδ (K) is the δ-ball around K


• requires Bδ (K) ⊂ K.
Generating a good descent direction (Goldstein1977):

Descent inequality
Let F be the minimal norm element in ∂δ J(K). Suppose K − αF/kF k2 ∈ K for
any 0 ≤ α ≤ δ. Then we have:

J(K − δF/kF k2 ) ≤ J(K) − δkF k2 .

11
Outline

• Motivation and Problem Formulation

• Main Results

• Conclusions and Future Directions

12
Summary of Known Facts

min J(K) := sup σmax (Q + K T RK)1/2 (ejω I − A + BK)−1



K ω∈[0,2π]

s.t. K ∈ K := {K : ρ(A − BK) < 1}.

• K is open, can be unbounded, and nonconvex.


• J(K) is continuous, nonsmooth, and can be nonconvex in K.
• J(K) is locally Lipschitz, subdifferentially regular over the feasible set K.

13
High Level Ideas

Goldstein’s subgradient method:

K n+1 = K n − δ n F n /kF n k2 ,

• F n is the minimum norm element of ∂δn J(K n ).


• K 0 ∈ K is known.

High level ideas:


• Goldstein’s subgradient method finds Clarke stationary point
• Coerciveness ensures K n stay within the nonconvex feasible set.
• Clarke stationary points are global, and hence global optimum is found.

14
Main Results
Theorem (Guo and Hu, NeurIPS2022)
Suppose (Q, R) are positive definite, and (A, B) is stabilizable. We have
1. J(K) is coercive over the set K. (Proved via the properties of (Q, R))
2. For any K ∈ K satisfying J(K) > J ∗ , there exists V 6= 0 s.t. J 0 (K, V ) < 0.
3. Any Clarke stationary points of the H∞ cost are global minimum.
4. For any γ > J ∗ , the sublevel set Kγ = {K ∈ K : J(K) ≤ γ} is compact.
There is a strict separation between Kγ and Kc .
5. Suppose K 0 ∈ K. Set ∆0 := dist(KJ(K 0 ) , Kc ) > 0, and δ n = 0.99∆
n+1 . Then
0

n+1
Goldstein’s subgradient method K = K − δ F /kF kF with F n
n n n n

being the minimum norm element of ∂δn J(K n ) is guaranteed to stay in K


for all n. In addition, we have J(K n ) → J ∗ as n → ∞.
6. There is also a complexity result for finding (ε, δ)-stationary points.

The most technical part of the proof is for Step 2. It requires the use of
non-strict version of the KYP lemma.
15
Step 2 of Main Result
Lemma
For any K ∈ K that J(K) > J ∗ , there exists a direction d 6= 0 such that the
directional derivative J 0 (K, d) ≤ J ∗ − J(K) < 0.

Proof Sketch:

(Y*, L*, γ*)


(Y, L, γ)
LMI(Y, L, γ)  0
(Y + tΔY, L + tΔL, γ + t(γ* − γ)) LMI(Y ∗ , L∗ , γ ∗ )  0

By convexity, we have
LMI(Y + t∆Y, L + t∆L, γ + t(γ ∗ − γ))  0
with ∆Y = Y ∗ − Y , ∆L = L∗ − L, and t ∈ [0, 1]. In addition, we must have
J((L + t∆L)(Y + t∆Y )−1 ) ≤ γ + t(γ ∗ − γ).
16
Step 2 of Main Result
Lemma
For any K ∈ K that J(K) > J ∗ , there exists a direction d 6= 0 such that the
directional derivative J 0 (K, d) ≤ J ∗ − J(K) < 0.

Proof Sketch Con:


Based on the fact J(K ∗ ) < J(K), we can construct a direction d such that
J 0 (K, d) < 0. Specifically, consider d = ∆LY −1 − LY −1 ∆Y Y −1 . Then we have

J(K + t(∆LY −1 − LY −1 ∆Y Y −1 )) − J(K)


J 0 (K, d) = lim
t&0 t
J((L + t∆L)(Y + t∆Y )−1 ) − J(K)
 
≤ lim + O(t)
t&0 t
J(K) + t(J(K ∗ ) − J(K)) − J(K)
 
≤ lim + O(t)
t&0 t
= J(K ∗ ) − J(K) < 0,

we use the fact that (Y + t∆Y )−1 = Y −1 − tY −1 ∆Y Y −1 + O(t2 ).


17
Finite-time complexity for (δ, ε)-stationary points
Goldstein’s subdifferential: ∂δ J(K) := conv ∪K 0 ∈Bδ (K) ∂C J(K 0 ) .


Definition
A point K is said to be (δ, ε)-stationary if dist(0, ∂δ J(K)) ≤ ε.

Theorem 3
If we choose δ n = δ < ∆0 , then we have:
• K n ∈ K for all n
0 ∗
• minn:0≤n≤N kF n k2 ≤ J(K )−J
(N +1)δ , i.e., the complexity of finding a

(δ, ε)-stationary point is O δε

• (δ, ε)-stationarity does not imply being δ-close to an ε-stationary point of J.


• Finite time bounds for (J(K n ) − J ∗ ) is possible via exploiting other
advanced properties J(K).

18
Implementable Algorithms
Finding minimum norm element of Goldstein’s subdifferential may not be easy.
Fortunately, there are many implementable variants:
• Gradient Sampling (GS) (The HIFOO toolbox): The main idea is to
randomly generate differentiable samples over Bδn (K n ) with probability 1.
The convex hull of the gradients of these samples can be used as an
approximation of ∂δn J(K n ).
• Nonderivative Sampling (NS)(Kiwiel2010): The NS method can be
viewed as the derivative-free version of the GS algorithm by only using the
zeroth-order oracle.
• Interpolated normalized gradient descent (INGD) (Zhang, J., et al.
2020; Davis, D., et al. 2022): INGD uses an iterative sampling strategy to
generate a descent direction. The INGD algorithm is guaranteed to find the
(δ, ε)-stationary point with the high-probability finite-time complexity
bound.

19
Numerical Example

To support our theory, we provide some numerical simulations.


Consider the following example:
     
1 0 −5 1 2 −1 0
A = −1 1 0  , B =  0  , Q = −1 2 −1 , R = 1.
0 0 1 −1 0 −1 2

For this example, we have J ∗ = 7.3475. We initialize from

K 0 = 0.4931 −0.1368 −2.2654 ,


 

which satisfies ρ(A − BK 0 ) = 0.5756 < 1.

20
Numerical Example

0.25 10 2 0.25

0.2 0.2
10 0

0.15 0.15 6
10 -3

10 -2 5

0.1 0.1 4

10 -4 2
0.05 0.05
1
60 65 70 75 80

0 10 -6 0
10 20 30 40 50 60 20 40 60 80 100 10 20 30 40 50 60 70 80

Figure: Simulation results. Left: The trajectory of relative error of GS, NS,
INGD, and Model-free NS methods. Middle: The trajectory of the relative
optimality gap of 8 randomly generated cases for the NS method. Right:
The trajectory of the Model-free NS method with more noisy oracle.

21
Outline

• Motivation and Problem Formulation

• Main Results

• Conclusions and Future Directions

22
Take Aways

We studied the global convergence of direct policy search on state-feedback H∞


robust control synthesis.
• State-feedback H∞ synthesis is a constrained nonconvex nonsmooth policy
optimization problem.

• Any Clarke stationary points for this problem are actually global minimum.

• Goldstein’s subgradient methods are guaranteed to stay within the


nonconvex feasible set and converge to the global optimal.

• (δ, ε)-stationary points can be found with finite-time guarantees.

• (δ, ε)-stationarity does not imply being δ-close to an ε-stationary point of J.

23
Future Work

• Finite-time bounds for the optimality gap (i.e. J(K n ) − J ∗ )

• The sample complexity of direct policy search on model-free H∞ control

• Other H∞ synthesis problems (static/dynamic output feedback, etc)

24
Thanks!

If you are interested, feel free to send an email to binhu7@illinois.edu

Funding & Support: NSF

25
Analysis of the Optimization Landscape of Linear
Quadratic Gaussian (LQG) Control

Work by Yang Zheng, Yujie Tang, and Na Li

Presented by Bin Hu
5th Annual Learning for Dynamics & Control Conference
University of Pennsylvania. June 14-16, 2023
Today’s talk
❑ Optimal Control Linear Quadratic Optimal control

Feedback Paradigm
d(t) w(t)

u(t)
System x(t) y(t)
• Many practical applications
• Linear Quadratic Regulator (LQR) when the state
𝑥𝑡 is directly observable
Feedback • Linear Quadratic Gaussian (LQG) control when
Control input Controller Measurement only partial output 𝑦𝑡 is observed
• Extensive classical results (Dynamic programming,
Separation principle, Riccati equations, etc.)
They are all model-based. Are there any
guarantees for non-convex policy optimization? 2
Challenges for partially observed LQG
❑ Policy optimization for LQG control
• LQG is more complicated than LQR Applications Sparse structures
• Requires dynamical controllers
• Its non-convex landscape properties are much richer and more complicated than LQR

Our focus: non-convex LQG landscape

▪ Q1: Properties of the domain (set of stabilizing controllers)


• convexity, connectivity, open/closed?

▪ Q2: Properties of the accumulated LQG cost


• convexity, differentiability, coercivity?
• set of stationary points/local minima/global minima?

3
Outline

❑ LQG problem Setup

❑ Connectivity of the Set of Stabilizing Controllers

❑ Structure of Stationary Points of the LQG cost

4
LQG Problem Setup
v(t) Gaussian white w(t) Objective: The LQG cost

Plant

➢ internal state of the controller


➢ order of the controller
➢ full-order
y(t) u(t)
➢ reduced-order

dynamical controller Minimal controller


The input-output behavior cannot be
Controllable
replicated by a lower order controller.
Standard
Assumption Observable * controllable and observable
5
Separation principle
v(t) Gaussian white w(t) Objective: The LQG cost

Plant

Solution: Kalman filter for state estimation


+ LQR based on the estimated state

y(t) u(t)
Two Riccati equations
➢ Kalman gain
dynamical controller

➢ Feedback gain
Explicit dependence on the dynamics
6
Policy Optimization formulation
❑ Closed-loop dynamics
Policy optimization for LQG

❑ Feasible region of the controller parameters Direct policy search

❑ Cost function ✓ Does it converge at all? Optimization


✓ Converge to which point? Landscape
✓ Convergence speed? Analysis

Solution to Lyapunov equations Hyland, David, and Dennis Bernstein. "The optimal projection equations for fixed-order 7
dynamic compensation." IEEE Transactions on Automatic Control 29.11 (1984): 1034-1037.
Main questions
Policy optimization for LQG

▪ Q1: Connectivity of the feasible region


• Is it connected?
• If not, how many connected components can it have?
Non-convex ▪ Q2: Structure of stationary points of
Landscape • Are there spurious (strictly suboptimal, saddle)
Analysis stationary points?
• How to check if a stationary point is globally optimal?

8
Outline

❑ LQG problem Setup

❑ Connectivity of the Set of Stabilizing Controllers

❑ Structure of Stationary Points of the LQG cost

9
Connectivity of the feasible region
❑ Simple observation: non-convex and unbounded

Lemma 1: the set is non-empty, unbounded, and can be non-convex.

Example

Stabilize the plant, and thus belong to

Fails to stabilize the plant, and thus outside


10
Connectivity of the feasible region
❑ Main Result 1: dis-connectivity
Theorem 1: The set can be disconnected but has at most 2 connected components.

✓ Different from the connectivity of static stabilizing state-feedback controllers,


which is always connected!
✓ Is this a negative result for gradient-based algorithms? → No
11
Connectivity of the feasible region
❑ Main Result 2: dis-connectivity
Theorem 2: If has 2 connected components, then there is a smooth bijection T between
the 2 connected components that has the same cost function value.

✓ In fact, the bijection T is defined by a similarity


transformation (change of controller state coordinates)

Positive news: For gradient-based local search


methods, it makes no difference to search over
either connected component.

12
Connectivity of the feasible region
❑ Main Result 3: conditions for connectivity
Theorem 3: 1) is connected if there exists a reduced-order stabilizing controller.
2) The sufficient condition above becomes necessary if the plant is single -input or
single-output.

Corollary 1: Given any open-loop stable plant, the set of stabilizing controllers is connected.

Example: Open-loop stable system

Routh--Hurwitz stability criterion

13
Connectivity of the feasible region
❑ Main Result 3: conditions for connectivity
Example: Open-loop unstable system (SISO)

Disconnected feasible region

• Routh--Hurwitz stability criterion

• Two path-connected components

14
Policy Optimization formulation
Policy optimization for LQG

▪ Q1: Connectivity of the feasible region


• Is it connected? No
• If not, how many connected components can it have? Two
Non-convex ▪ Q2: Structure of stationary points of
Landscape • Are there spurious (strictly suboptimal, saddle) stationary
Analysis points?
• How to check if a stationary point is globally optimal?

15
Outline

❑ LQG problem Setup

❑ Connectivity of the Set of Stabilizing Controllers

❑ Structure of Stationary Points of the LQG cost

16
Structure of Stationary Points
❑ Simple observations Policy optimization for LQG
1) is a real analytic function over its domain
(smooth, infinitely differentiable)
2) has non-unique and non-isolated global optima

Similarity transformation

➢ is invariant under similarity transformations.


➢ It has many stationary points, unlike the LQR with a unique
stationary point
17
Structure of Stationary Points
❑ Gradient computation
How does the set of Stationary
Lemma 2 : For every , we have
Points look like?

where
❑ Non-unique, non-isolated

are the unique positive semidefinite solutions to two ❑ Local minimum, local
Lyapunov equations. maximum, saddle points,
or globally minimum?

18
Structure of Stationary Points
❑ Main Result: existences of strict saddle points
Theorem 4: Consider any open-loop stable plant. The zero controller with any stable

is a stationary point. Furthermore, the corresponding hessian is either indefinite ( strict saddle
point) or equal to zero (high-order saddle or else).

Example:
Stationary point:

➢ Cost function:

Indefinite with
eigenvalues:
➢ Hessian:

19
Structure of Stationary Points
❑ Main Result: existences of strict saddle points
Theorem 4: Consider any open-loop stable plant. The zero controller with any stable

is a stationary point. Furthermore, the corresponding hessian is either indefinite ( strict saddle
point) or equal to zero (high-order saddle or else).

Another example with zero Hessian ❑ Non-unique, non-


How does the set of Stationary
Points look like? isolated
❑ Strictly suboptimal
points; Strict saddle
points
❑ All bad stationary points
correspond to non-
minimal controllers
20
Structure of Stationary Points
❑ Main Result
Theorem 5: All stationary points corresponding to controllable and
observable controllers are globally optimum.

Structural Global Optimality


Local Zero Gradient Certificate
Information

Particularly, given a stationary point that is a minimal controller


1) It is globally optimal, and the set of all global optima forms a manifold with 2 connected components.

Example: open-loop Example: open-loop


unstable system stable system

21
Structure of Stationary Points
❑ Implication
Corollary: Consider gradient descent iterations

If the iterates converge to a minimal controller, then this minimal controller is a global optima.

More questions:
✓ Escaping saddle points?
✓ Convergence conditions?
✓ Convergence speed?
✓ Alternative model-free parameterization?

22
Comparison with LQR
Policy optimization for LQR Policy optimization for LQG

Connectivity of ❖ Disconnected, but at most 2 connected comp.


feasible region ❖ Always connected ❖ They are almost identical to each other

❖ Non-unique, non-isolated stationary points


Stationary ❖ Spurious stationary points (strict saddle,
❖ Unique
points nonminimal controller)
❖ All mini. stationary points are globally optimal
❖ Gradient dominance ❖ No gradient dominance
Gradient
❖ Global fast convergence ❖ Local convergence/speed (unknown)
Descent
(like strictly convex) ❖ Many open questions
Fazel et al., ICML, 2018; Malik et al., 2019; Mohammadi et al., IEEE TAC,
References 2020; Li et al., 2019; K. Zhang, B. Hu, and T. Başar, 2021; Furieri et al., Zheng*, Tang*, Li. 2021, link (* equal contribution)
2019; Feiran Zhao & Keyou You, 2021, and many others 23
Conclusions

24
Policy optimization for LQG control

❑ Much richer and more complicated than LQR


❑ Disconnected, but at most 2 connected components
❑ Non-unique, non-isolated stationary points, strict saddle points
❑ Minimal (controllable and observable) stationary points are globally optimal

25
Ongoing and Future work

❑ How to certify the optimality of a non-minimal stationary point


❑ Perturbed policy gradient (PGD) for escaping saddle points
❑ Quantitative analysis of PGD algorithms for LQG
❑ Alternative model-free parametrization of dynamical controllers (e.g., Makdah
& Pasqualetti, 2023; Zhao, Fu &You, 2022.)

✓ Better optimization landscape structures, smaller dimension


❑ Nonconvex Landscape of Hinf dynamical output feedback control (Tang &
Zheng, 2023 https://arxiv.org/abs/2304.00753; )

26
Analysis of the Optimization Landscape of Linear
Quadratic Gaussian (LQG) Control

Thank you for your attention!


Q&A
1. Y. Tang*, Y. Zheng*, and N. Li, “Analysis of the optimization landscape of Linear Quadratic Gaussian (LQG) control,”
Mathematical Programming, 2023. Available: https://arxiv.org/abs/2102.04393 *Equal contribution
2. B. Hu and Y. Zheng, "Connectivity of the feasible and sublevel sets of dynamic output feedback control with robustness
constraints,“ IEEE Control Systems Letters, 2022.
3. Y. Zheng*, Y. Sun*, M. Fazel, and N. Li. "Escaping High-order Saddles in Policy Optimization for Linear Quadratic Gaussian
(LQG) Control." CDC, 2022 https://arxiv.org/abs/2204.00912. *Equal contribution
Role of convex parameterization

Message : favorable landscape properties for nonconvex J can be


obtained from the convex parameterization under appropriate
conditions on the mapping
[Sun, F.,’21 ; Umenberger et al.’22 ; Hu et al.’23 survey]

Warm up : convex formulation for continuous-time LQR

min Tr (QP + ZR)


Z ,L,P
⇒ min f (L, P, Z )
s.t. AP + PAT + BL + LB T + Σ = 0, Z ,L,P

s.t. (L, P, Z ) ∈ S
 
Z L
P ≻ 0, ⪰0
L⊤ P

where K ∗ = L∗ (P ∗ )−1 .
▶ further, K = LP −1 parameterizes all stabilizing K ∈ K
▶ also see [Mohammadi et al.’19]

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 1/9
Assumptions on parameterization map

min J(K ) min f (L, P, Z )


K ⇒ Z ,L,P

s.t. K ∈K s.t. (L, P, Z ) ∈ S

Assumptions :
1. S is convex, f (L, P, Z ) is convex, bounded, differentiable on S.
2. we can express J(K ) as

J(K ) = min f (L, P, Z ), s.t. (L, P, Z ) ∈ S, K = LP −1 .


L,P,Z

more generally, K = LP −1 can be replaced by a surjective map


K = Φ(L, P) with “nicely behaved” first-order derivatives.

[Sun, F.,’21], [Umenberger et al.,’22]

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 2/9
Role of convex parameterization

min J (K ) min f (L, P , Z ),


K Z ,L ,P
s.t. K ∈K s.t. (L, P , Z ) ∈ S

Theorem (simplified) [Sun & F.,’21]


Under assumptions 1 and 2,
∇J (K ) = 0 ⇐⇒ K = K ∗ .
Also,
▶ If f is convex, ∥∇J (K )∥F ≳ J (K ) − (K ∗ ).
▶ If f is µ-strongly convex, ∥∇J (K )∥F ≳ (µ(J (K ) − J (K ∗ )))1/2 .

(≳ hides instance-dependent constants ; depend on system


parameters & initial point K0 )

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 3/9
Proof picture

min f (L, P, Z ),
min J(K ) Z ,L,P
K ∈K
s.t., (L, P, Z ) ∈ S

Map the direction


∇ J(Kt )
Map the gradient

Map the point

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 4/9
Role of convex parameterization

A general version that applies to non-smooth J (K ) as well :

Theorem [Hu et al.,’23]


Suppose J (K ) is differentiable or subdifferentially regular, Assump-
tions 1, 2 hold. For any K satisfying J (K ) > J (K ∗ ), there exists
non-zero V in the descent cone of K at K , such that
0 < J (K ) − J (K ∗ ) ≤ −J ′ (K , V ),
so any stationary point of J is a global minimum.

J ′ (K , V ) denotes directional derivative of J (K ) along direction V .


When J is differentiable, J ′ (K , V ) = Tr (V T ∇J (K )).

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 5/9
Example : Continuous time LQR

min f (L, P , Z ) := Tr (QP ) + Tr (ZR )


Z ,L ,P

s.t., A(P ) + B(L) + Σ = 0, G ≻ 0,


" #
Z L⊤
⪰0
L G
Question : K = LP −1 , is P always invertible ? (yes, if initial x0 has
full-rank covariance)

L, P , P −1 are bounded in the sublevel set {K : J (K ) ≤ a}.

then : a ≥ J (K ) = Tr (QP ) + Tr (LP −1 L⊤ R ).

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 6/9
Example : Continuous time LQR

L, P , P −1 are bounded in the sublevel set {K : (K ) < a}.

Define
λ2min (Σ)  −1/2 −1/2
−2
ν= σmax (A)λmin (Q ) + σmax (B )λmin (R ) ,
4
then
∥J (K )∥ ≤ −C1 (J (K ) − J (K ∗ ))
where
1/2 1/2
νλmin (Q )λmin (R ) n o
C1 = · min a2 , νλmin (Q ) .
4a 4

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 7/9
Related Results

Many other landscape results rely on connections to LMIs


H∞ landscape : Clarke stationary is global [Guo et al., 2022]
Dynamic filtering : Differentiable convex lifting [Umenberger et
al., 2022]

LQG : Connectivity [Y. Zheng, 2023]


Output-feedback H∞ : Connectivity [Hu et al., 2022]

A general tool for landscape study. More study is needed for


output-feedback problems !

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 8/9
Future Work

The last section of our survey article lists several directions :

▶ Further connections between optimization and control theory,


e.g. complexity of escaping saddles for output feedback problems
▶ Advanced regularization for stability, robustness, and safety
▶ Nonlinear systems, deep RL, and perception-based control
▶ Multi-agent systems and decentralized control
▶ Integration of model-based and model-free methods
▶ New PO formulations from machine learning

And many more which are not listed in our article !

Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 9/9

You might also like