L4DC PolicyOptTutorial2023

Toward a Theoretical Foundation of Policy
Optimization for Learning Control Policies
Maryam Fazel, Bin Hu, Kaiqing Zhang
Joint with Tamer Başar, Na Li, Mehran Mesbahi, Yang Zheng
L4DC Tutorial, Philadelphia, PA

June 14, 2023
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 1/47
Motivation
Data-guided decision-making for complex tasks in dynamical

systems, e.g., game playing, robotics, networked systems,...
Many recent successes via Reinforcement Learning
Motivation : Policy Optimization
▶ A workhorse of (deep) RL : (direct) policy optimization methods

▶ Robotic manipulation, locomotion, video games, ChatGPT, etc.
Policy Optimization – One Workhorse of (Deep) RL
Why is policy optimization popular ?

▶ Easy-to-implement & scalable to high-dimensional problems
▶ Enable model-free search for complex dynamics (e.g. with rich
contact) or rich observations (e.g. images)
▶ This tutorial : Does policy optimization have guarantees on
linear control benchmarks (e.g. LQR, LQG, H∞ control, etc) ?
What is Policy Optimization (PO) ?
▶ PO is an old idea from control : Fix the controller structure, and
optimize a control metric over the parameters of the controllers
min J (K )
K
▶ Parametrized controller/policy K
▶ Cost function J (tracking errors, closed-loop H2 /H∞ norms, etc)
▶ Policy gradient method : K ′ = K − α∇J (K )

▶ Example : Optimization-based PID Tuning K = [Kp , Ki , Kd ]⊤ ∈ R3
Credit : Astrom & Murray, 2020

History : Convex LMIs vs. PO
Key points :
▶ In 1980s, convex optimization methods become dominant due to
strong global guarantees and efficient interior point methods
▶ PO problem formulation is generally not convex
▶ Reparameterize as convex optimization problems (one does not
optimize the controller parameters directly) ; Lyapunov theory,
stability/performance certificates, HJB, ...
▶ Examples of LMIs : state-feedback or full-order output-feedback
H2 / H∞ control
▶ e.g., Boyd et al., “Linear Matrix Inequalities in System and
Control Theory”, 1994, SIAM
History : Convex LMIs vs. PO
Historically, PO is used for control problems that can’t be

convexified ; often no theory
▶ Sometimes the plant order is unknown : PID tuning, feedback
iterative tuning, etc
▶ Static output feedback LQ control
▶ Fixed-order structured H∞ synthesis : HIFOO and Hinfstruct
[Apkarian and Dominikus, ’06]
▶ Distributed control design : Martensson/Rantzer (’09)
In recent years, new reason to revisit PO for classical control : help

provide theory towards understanding model-free RL
A Modern Perspective from Deep RL
A common practice nowadays in deep RL for robotic control : visuomotor
policy learning/image-to-torque [Levine et al., ’16]
▶ A type of perception-based control : purely model-free
▶ Train perception and control systems jointly end-to-end
Advantages :
▶ Direct and relatively simple to implement
▶ Mitigate compounding error as in model-based RL (separately train
perception and control)
▶ Make better use of deep NNs’ abstraction and perception capabilities to
handle high-dimensional visual signals
Policy Optimization : Old & New
Vanilla policy gradient :
▶ Policy Gradient Theorem [Sutton et al., ’99]

∇J(K ) = E QK (x, u) · ∇ log πK (u | x)
▶ REINFORCE estimator [Williams ’92] : from N trajectories of length T –

(xt,i , ut,i , ct,i )i∈[N],t∈[T ]
N T X T
1 XX
∇J(K ) ≈ cτ,i · ∇ log πK (ut,i | xt,i )
N i=1 t=1 τ =t
| {z }
| {z } score function
accumulated cost
▶ Others estimators : G(PO)MDP [Baxter & Bartlett, ’01], actor-critic [Konda &
Tsitsiklis, ’99], natural policy gradient [Kakade ’01] (will come back to it !)
▶ Essentially stochastic gradient descent (SGD) (heart of modern machine learning) !
Modern variants (benefit from the advances of optimization theory) :

▶ Deep deterministic PG (DDPG) [Silver et al., ’14], Trust-region PO (TRPO)
[Schulman et al., ’15], Proximal PO (PPO) [Schulman et al., ’17], soft actor-critic
(SAC) [Haarnoja et al., ’18], variance-reduced PG [Papini et al., ’18]...
▶ Default algorithm in OpenAI Gym, Dota 5v5, ChatGPT training – PPO
Missing Perspectives in Deep RL Literature
▶ Convergence guarantees : Nonconvex optimization in policy

parameter spaces, e.g., weights of neural networks
▶ Sample efficiency guarantees : How many samples are needed ?
Polynomial in problem parameters ?
▶ Constraints : Stability and robustness of the closed-loop systems
Missing Perspectives in Classic Control Literature
GD on convex landscape GD for nonconvex landscape
▶ Landscape : Is convexity really needed for optimization ?

▶ Finite-iteration/sample complexity : If an algorithm converges,
how fast and how many samples are needed ?
Tutorial Overview
Tutorial Overview : RL/Control → PO
This tutorial : Understanding policy optimization via examining

guarantees on linear control benchmarks
▶ Start from simpler contexts and gain insights
▶ Classical control benchmarks (c.f. [Recht et al., ’17])
▶ Identify issues for establishing guarantees of PO for control
▶ Employ modern optimization perspective : iteration/sample
complexity, first-order & zeroth-order oracle models, etc
Big picture :
▶ One perspective to bridge control theory and RL
▶ Understand and connect “model-free” & “model-based” views
▶ Towards a general framework for learning-based control
Schedule
▶ Now-2 :30pm : Preview and Some Optimization Background

▶ 2 :30-3 :00pm : PO Theory for LQR
▶ 3 :00-3 :30pm : PO Theory for Risk-sensitive & H2 /H∞ Robust
Control
▶ 3 :30-4 :00pm : Coffee Break
▶ 4 :00-4 :30pm : PO Theory for State-feedback H∞ Synthesis
▶ 4 :30-5 :00pm : PO Theory for LQG
▶ 5 :00-5 :15pm : Role of convex parameterization
▶ 5 :15-5 :30pm : Future work and Q&A/discussions
Preview : Big Picture
▶ Revisit linear control problems as benchmarks for PO
min J(K ), s.t. K ∈ K

K
▶ Parametrized policy K (e.g. linear mapping, neural networks)

▶ Cost function J (tracking errors, closed-loop H2 /H∞ norms, etc)
▶ Constraint set K (stability, robustness, safety, etc)
▶ Policy gradient : K ′ = K − α∇J(K )

▶ The gradient J can be estimated from data in a model-free manner
(policy gradient theorem or stochastic finite difference)
▶ For nonsmooth problems, replace the gradient with some subgradient
▶ Recent progress on PO theory (Nonconvexity is the key issue)

▶ Landscape : Is stationary global minimum ?
▶ Feasibility : Does the policy search stay in the feasible set K ?
▶ Global convergence & sample complexity
B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, T. Başar. Toward a theoretical
foundation of policy optimization for learning control policies, Annual Review
of Control, Robotics, and Autonomous Systems, 2023.
Preview : Linear Quadratic Regulator as PO
▶ Linear quadratic regulator (LQR) as PO : Consider xt+1 = Axt + But + wt

"T −1 #
1 X ⊤ ⊤
min J(K ) := lim E (xt Qxt + ut Rut ) , s.t. K is stabilizing
K T →∞ T
t=0
▶ ut = −Kxt for gain matrix K
▶ K = {K : ρ(A − BK ) < 1} ; K a nonconvex constraint set
▶ PO theory for LQR
▶ Landscape : Feasible set is connected, and stationary is global
▶ Feasibility : The LQR cost is coercive and serves as a barrier on K
▶ Global convergence & sample complexity : Linear rate and finite
sample complexity via the gradient dominance/smoothness property
▶ Main Ref :
M. Fazel, R. Ge, S. Kakade, M. Mesbahi. Global convergence of policy
gradient methods for the linear quadratic regulator, ICML 2018.
Preview : Mixed H2 /H∞ Control as PO
▶ Mixed design : H∞ constraints are crucial for robustness

min J(K ), s.t. K is stabilizing and robust in the H∞ sense
K
▶ J(K ) is an upper bound on the H2 performance

▶ K = {K : ρ(A − BK ) < 1; ∥T (K )∥∞ < γ} ; add robustness constrains
▶ γ → ∞ reduces to LQR
▶ PO theory for mixed design
▶ Key issue : The cost is not coercive ! How to maintain feasibility ?
▶ Fix : Implicit regularization via Natural policy gradient (NPG) and
Gauss-Newton
▶ Global sublinear convergence for NPG and Gauss-Newton
▶ Main Ref :
K. Zhang, B. Hu, T. Başar. Policy optimization for H2 linear control with
H∞ robustness guarantee : Implicit regularization and global convergence,
SIAM Journal on Control and Optimization (SICON), 2021.
Preview : H∞ State-Feedback Synthesis as PO
▶ H∞ state-feedback synthesis : xt+1 = Axt + But + wt with x0 = 0

min J(K ), s.t. K is stabilizing
K
▶ J= ∞ ⊤ ⊤
P
(x
t=0 Pt Qx t + u t Ru t ) subject to the worst-case disturbance
satisfying ∞ t=0 ∥w t ∥2
≤ 1
▶ K = {K : ρ(A − BK ) < 1}
▶ J(K ) is the closed-loop H∞ norm (nonsmooth in K !)
▶ PO theory for H∞ state-feedback synthesis (Nonconvex and nonsmooth)
▶ Key issue : The cost may not be differentiable at important points
▶ Fix : Show that Clarke stationary points are global, and apply
Goldstein’s subgradient method to guarantee sufficient descent
▶ Global convergence : Goldstein’s subgradient method achieves global
convergence provably
▶ Main Ref :
X. Guo and B. Hu. Global convergence of direct policy search for
state-feedback H∞ robust control : A revisit of nonsmooth synthesis with
Goldstein subdifferential, NeurIPS 2022.
Preview : Linear Quadratic Gaussian as PO
▶ Linear quadratic Gaussian (LQG) is the partially observable variant of LQR,

and can be treated as PO (more details later)
▶ PO theory for LQG

▶ Issue 1 : Feasible set may not be connected
▶ Issue 2 : Stationary may not be global
▶ Today’s talk : Some positive results and many open questions
▶ Main Ref :
Y. Zheng, Y. Tang, N. Li. Analysis of the optimization landscape of linear
quadratic Gaussian (LQG) control, Mathematical Programming, 2023.
Background : Optimization Theory
Optimization of Smooth Nonconvex Functions
Definition : A function J (K ) is L-smooth if the following inequality

holds for all (K , K ′ ) :
L
J (K ′ ) ≤ J (K ) + ⟨∇J (K ), (K ′ − K )⟩ + ∥K ′ − K ∥2F .
2
The above definition is equivalent to ∇J being L-Lipschitz.
Complexity : Gradient descent method K n+1 = K n− α∇ n

J (K ) is
1
guaranteed to find ϵ-stationary point of J within O ϵ steps
L
J (K n+1 ) ≤ J (K n ) + ⟨∇J (K n ), K n+1 − K n ⟩ + ∥K n+1 − K n ∥2F
2

L α 2
= J (K n ) + −α + ∥∇J (K n )∥2F ,
2
Summing the above inequality from n = 0 to T
T
Lα 2
X
2
α− ∇J (K n ) F ≤ J (K 0 ) − J (K n+1 )
2
n=0
Complexity : Gradient descent method K t +1 = K n − α∇ n
J (K ) is
guaranteed to find ϵ-stationary point of J within O 1ϵ steps
T
Lα 2
X
2
α− ∇J (K n ) F ≤ J (K 0 ) − J (K n+1 )
2
n=0
2
If α < L2 , then C = α − L2α > 0. We know J (K n+1 ) ≥ J ∗ for some J ∗ .
T
2 J (K 0 ) − J ∗
∇J (K n ) F ≤
X
C
n=0
T
2 1 X 2 J (K 0 ) − J ∗
=⇒ min ∇ J (K n ) F ≤ ∇J (K n ) F ≤ .
0≤n≤T T +1 C (T + 1)
n=0
To find a point whose gradient norm is smaller than or equal to ϵ,
we need to run T steps with
J (K 0 ) − J ∗ 1
T = −1=O .
Cϵ ϵ
which is the complexity for finding ϵ-approximate stationary point
Complexity : Gradient descent method K t +1 = K n − α∇ n
J (K ) is
1
guaranteed to find ϵ-stationary point of J within O ϵ steps
Convergence : Gradient descent method is guaranteed to

convergence to a stationary point eventually
Question : What if we can show stationary is global ?

Answer : Then the gradient descent method converges to global
minimum ! We have J (K n ) → J ∗ !
Take-away : Nonconvex optimization may not be that terrifying if

stationary is global !
Gradient Dominance and Linear Rate to Global Minimum
Definition : A function J (K ) is gradient dominant if it is
continuously differentiable and satisfies
1
J (K ) − J (K ∗ ) ≤ ∥∇J (K )∥2F , ∀K ∈ K,
2µ
Landscape : Stationary is global !
Complexity : Gradient descent method K n+1 = K n −α∇
J (K n ) is
1
guaranteed to find ϵ-optimal point of J within O log ϵ steps
Lα 2

J (K n+1 ) ≤ J (K n ) + −α + ∥∇J (K n )∥2F
2
Lα2

≤ J (K n ) − 2µ α − (J (K n ) − J ∗ )
2
=⇒ J (K n+1 ) − J ∗ ≤ (1 − 2µα + µLα2 )(J (K n ) − J ∗ )
=⇒ J (K T ) − J ∗ ≤ (1 − 2µα + µLα2 )T (J (K 0 ) − J ∗ )

Running T steps with T = O log 1
ϵ guarantees J (K T ) − J ∗ ≤ ϵ
Coercive Functions and Compact Sublevel Sets
What if there are constraints ? If the cost is coercive, then it is a
barrier function by itself !
Definition : A function J (K ) is coercive on K if for any sequence

{K l }∞ we have J (K l ) → +∞ if either ∥K l ∥2 → +∞, or K l
l =1 ⊂ K
converges to an element on the boundary ∂K.
A Useful Result for Constrained Optimization
If J is coercive and twice continuously differentiable on K, we have

▶ The sublevel set Kγ := {K ∈ K : J (K ) ≤ γ} is compact.
▶ The function J (K ) is L-smooth on Kγ , and the constant L
depends on γ and the problem parameters.
▶ Suppose running GD method K n+1 = K n − α∇J (K n ) initialized
from K 0 ∈ K. Let L be the smoothness parameter for KJ (K 0 ) .

Then GD finds an ϵ-approximate stationary point with O 1ϵ
steps with α = 1/L.
▶ If J is gradient dominant with parameter µ, then GD achieves
linear convergence rate.
J (K T ) − J ∗ ≤ (1 − 2µα + µLα2 )T (J (K 0 ) − J ∗ )
PO Theory for LQR
Linear quadratic theory – background
Standard LQR problem (discrete-time, infinite horizon) : linear

dynamics
xt +1 = Axt + But
with given initial state x0 , choose control sequence
u0 , u1 , . . . , ut , . . .
in order to minimize the total cost
∞
xt⊤ Qxt + ut⊤ Rut
X
t =0
with given cost matrices Q , R ≻ 0.
Linear quadratic theory
Classical solution via dynamic programming (when A,B known,

stabilizable) : solve the algebraic Riccati equation (for P )
P = Q + AT PA − (A⊤ PB )(R + B ⊤ PB )−1 (B ⊤ PA)
then let
ut = −K ∗ xt = −(R + B ⊤ PB )−1 (B T PA) xt
▶ a “go-to” model-based control design (since Kalman in 60’s)

▶ extensive theory, computational methods for solving Riccati
equation (Laub ; Kleinman ’68 ; Hewer ’71)
Value and policy iterations
The solution of ARE determines the value matrix

min J (x0 , u ) = x0⊤ P ∗ x0
u
one can develop an iteration on P s. t. P → P ∗ , then recover the
optimal control policy (this would be called value iteration)
PO for LQR, on the other hand, would directly update K , e.g.,

K n+1 = K n − η dK
when dK is some sort of gradient update and η is (possibly
time-varying) stepsize ; this is a first order method
this tutorial : can we develop direct PO methods with guarantees

for some typical control synthesis problems ?
Direct policy optimization
towards writing LQR as “J (K )” ...
First, note that when A is Schur stable, the sequence

(A⊤ )t QAt → P P = A⊤ PA + Q
X
converges, where
t
so with stabilizing feedback in place, the LQR cost for the
dynamics,
xt +1 = (A − BK )xt
with an initial condition x0 , can be written as x0T PK x0 , where
PK = (A − BK )⊤ PK (A − BK ) + K ⊤ RK + Q
so in this case, LQR is really optimizing
min trace P Σ0 (with Σ0 = x0 x0 ⊤ )
P ,K
P = (A − BK )⊤ P (A − BK ) + K ⊤ RK + Q
However, as stated, this problem is a bilinear matrix
optimization ...
When K ∈ S (set of stabilizing K ), equation

P = (A − BK )⊤ P (A − BK ) + K ⊤ RK + Q
has a unique solution P (K ) ; hence, the LQR can be written (for a
given Σ) as
min J (K )
K ∈S
We take x0 ∼ D(0, Σ0 ) where Σ0 is a full-rank covariance

(equivalently, can take Σ to correspond to a spanning set of initial
conditions)
thus J is real analytic function over its domain.
Questions
Consider now PO algorithms :

▶ iterate on policy K ,
▶ using gradient of cost, ∇J (K ) (exact or approximate)
▶ does GD (with exact gradients) converge ? under what

assumptions ? does it converge to the global opt K ∗ ?
▶ rate of convergence ?
▶ how about related algorithms, e.g., “natural gradient” descent ?
▶ “model-free” version : if gradients not available, would sampling
J (K ) work ? finite-sample complexity ?
note : challenging as J (K ) is not convex
LQR and policy gradient methods
▶ Consider LQR without state noise (for simplicity), random initial

condition x0 ∼ D
▶ let J (K ) be the cost as function of policy K
▶ define covariance matrices :
"∞ #
h i
T
Σ0 = E x0 x0T
X
ΣK = E xt xt ,
t =0
▶ consider algorithms :
Gradient descent : K ← K − η∇J (K )
Natural GD : K ← K − η∇J (K )ΣK −1
[Kakade ’01]
Oracles
Suppose algorithms have only oracle access to the model (A,B

not known explicitly), e.g.,
▶ exact gradient oracle : ∇J (K )
▶ “approximate gradient” oracle : use sample values of J (K )
i.e., first-order and zeroth order oracle in optimization
Optimization landscape I
J (K ) = ⟨Σ0 , PK ⟩
= vec(Σ0 )T (I − (A − BK ) ⊗ (A − BK ))−1 vec(Q + K T RK )
observation : J (K ) is not convex in K (or quasiconvex, star
convex) for n ≥ 3 (convex for single input case when n = 2) ; we can
computed the ∇J (K ) = 2(RK − B ⊤ PK AK )ΣK , where,
"∞ #
h i
xt xtT Σ0 = E x0 x0T
X
ΣK = E ;
t =0
lemma : if ∇J (K ) = 0, then either
▶ K is optimal, or
▶ covariance ΣK is rank deficient.
if Σ0 is full rank, then ΣK is full rank =⇒ stationary points globally
optimal
can also show this leveraging the convex LMI reformulation—later
research has shown this deeper connection and its uses.)
Optimization landscape I
J (K ) = ⟨Σ0 , PK ⟩
= vec(Σ0 )T (I − (A − BK ) ⊗ (A − BK ))−1 vec(Q + K T RK )
observation : J (K ) is not convex in K (or quasiconvex, star
convex) for n ≥ 3.
"∞ #
h i
xt xtT Σ0 = E x0 x0T
X
ΣK = E ,
t =0
lemma : if ∇J (K ) = 0, then either

▶ K is optimal, or
▶ covariance ΣK is rank deficient.
if Σ0 is full rank, then ΣK is full rank =⇒ stationary points globally
optimal
can also examine this via transformation to convex LMI (but proofs no
simpler)
Optimization landscape II
lemma : Suppose Σ0 is full rank, then

∥ΣK ∗ ∥
J (K ) − J (K ∗ ) ≤ ∥∇J (K )∥2 .
σmin (Σ0 )σmin (R )
i.e., J (K ) gradient-dominated ([Polyak ’63],...)
Main Theory : Summary
[Fazel, Ge, Kakade, Mesbahi 2018] for discrete-time LQR

▶ J(K ) is generally not convex/quasiconvex/star-convex
▶ convex combination of two stabilizing K ’s may not stabilize
▶ coerciveness : LQR cost is coercive on the set of stabilizing policies
▶ LQR cost is gradient dominant
▶ hence, gradient descent converges to K ∗ from any stabilizing initial
K0 ! (with a linear rate)
▶ similarly for related algorithms, e.g., natural policy gradients and
policy iteration algorithm
Global convergence in the exact case
Theorem 1 : Suppose J (K0 ) is finite (i.e., K0 is stabilizing), Σ0 is

full rank. With stepsize η chosen appropriately, and # of iterations
N as
▶ for natural policy GD :
∥B ∥2 J (K0 ) J (K 0 ) − J (K ∗ )

∥ΣK ∗ ∥ ∥R ∥
N≥ + log ,
σmin (Σ0 ) σmin (R ) σmin (Σ0 )σmin (R ) ϵ
▶ for GD :
∥ΣK ∗ ∥ J (K0 ) − J (K ∗ )
N≥ log poly(everything else),
σmin (Σ0 ) ϵ
then,
KN has cost ϵ-close to optimum.
Numerical experiment
2.5
100 2
1.5
Error ||K||−K *K||F||F
*
0.5
10−1 0
-0.5
-1
10−2 -1.5
-2
0 100 200 300 400 500

Iter
-2.5
-1 -0.5 0 0.5 1 1.5 2 2.5 3 3.5 4
Left : Gradient descent for continuous time LQR

Right : 2-dim projection of the LQ cost contour
Unknown model case
Gradient descent : K ← K − η∇\

J (K )
Natural policy GD : K ← K − η∇\ b −1
J (K )Σ K
▶ we do not know (or directly learn) A, B

▶ but have the ability to explore by perturbing K
▶ model-free estimation : add Gaussian noise to actions during
rollouts
▶ similar to zeroth order (derivative-free) optimization
▶ issues : how much noise ? length of rollouts ? overall sample
complexity ?
Algorithm
Input : K , # trajectories m, rollout length ℓ, parameter r , dimension d

for i = 1, · · · m,
▶ draw Ui is uniformly at random from ∥U ∥F ≤ r
▶ sample policy Kbi = K + Ui
▶ simulate Kbi for ℓ steps starting from x0 ∼ D.
▶ get empirical estimates
ℓ ℓ
xt xt⊤
X X
Cbi = ct , Σ
bi =
t =1 t =1
where ct , xt are costs and states on this trajectory
end for
use following estimates for PGD/NPGD :
m m
\ 1 X d b 1 Xb
∇J (K ) = CU ,
2 i i
Σ
d K = Σi
m r m
i =1 i =1
Idea for the Proof of Convergence
1. Prove when rollout length ℓ is large enough, cost function C and

covariance Σ are close to infinite horizon quantities
2. Show with enough samples, alg can estimate gradient and
covariance matrix within the desired accuracy
3. Show GD and NPGD converge with a similar rate, despite
bounded perturbations in gradient/natural gradient estimates
Suppose J (K0 ) is finite, µ > 0, x0 has norm bounded by L almost
surely ; and GD and the NPGD are called with parameters :

1 1 1
m, ℓ, 1/r =poly J (K0 ), , , ∥A∥, ∥B ∥, ∥R ∥, , d , 1/ϵ, L2 /µ .
µ σmin (Q ) σmin (R )
▶ NPGD : for stepsize η = 1 and

∥B ∥2 J (K0 )
∥R ∥+ µ
∥B ∥2 J (K0 ) 2(J (K0 ) − J (K ∗ ))

∥ΣK ∗ ∥ ∥R ∥
N≥ + log ,
µ σmin (R ) µσmin (R ) ϵ
with high probability NPGD satisfies : J (KN ) − J (K ∗ ) ≤ ϵ
▶ GD : for appropriate stepsize η ,

µσmin (Q ) 1 1 1
η = poly , , , , σmin (R )
J (K0 ) ∥ A∥ ∥ B ∥ ∥ R ∥
and
J (K0 ) − J (K ∗ )

∥Σ ∗ ∥ J (K 0 ) 1
N ≥ K log poly , ∥A∥, ∥B ∥, ∥R ∥, ,
µ ϵ µσmin (Q ) σmin (R )
with high probability, GD satisfies : J (KN ) − J (K ∗ ) ≤ ϵ
Related Results
A burst of recent research interest :
LQR, continuous-time : [Mohammadi et al., 2019], [Bu et al., 2020]

LQR, discrete-time : [Fazel et al., 2018], [Bu et al., 2019]
Stabilization : [Perdomo et al., 2021]
Decentralized finite-horizon LQR under QI : [Furieri et al., 2020]
LQ games : [Zhang et al., 2019], [Bu et al., 2019], [Mazumdar et al.,
2019], [Hambly et al., 2021]
Markov jump linear systems : [Jansch-Porto et al, 2020]
Output estimation with differentiable convex liftings (DCL)
framework : [Umenberger et al, 2022]
H∞ control : [Tang and Zheng, 2023]
Fundamental limits of policy gradient : [Ziemann et al, 2022]
And many other variants ! (See our survey article)
B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, T. Başar. Toward a theoretical
foundation of policy optimization for learning control policies, Annual Review of
Control, Robotics, and Autonomous Systems, Vol. 6, pp. 123-158, 2023.
Coming up ...
▶ 3 :00-3 :30pm : PO Theory for Risk-sensitive & H2 /H∞ Robust

Control
▶ 3 :30-4 :00pm : Coffee Break
▶ 4 :00-4 :30pm : PO Theory for State-feedback H∞ Synthesis
▶ 4 :30-5 :00pm : PO Theory for LQG
▶ 5 :00-5 :15pm : Role of convex parameterization
▶ 5 :15-5 :30pm : Future work and Q&A/discussions
Policy Optimization for Risk-Sensitive Control,
H2 /H∞ Robust Control, and Linear Quadratic Games
Kaiqing Zhang
University of Maryland, College Park
5th Annual Learning for Dynamics & Control Conference (L4DC)
University of Pennsylvania, PA June 14, 2023

General Background: Recall
▶ Reinforcement learning (RL) has achieved tremendous empirical
successes, including in some continuous-space control tasks
▶ Game of Go, video games, robotics, etc1 .
▶ A resurgence of interest in the theoretical understandings of RL

1
source: google images
Motivation
▶ Most scenarios involve more than one agent, e.g., game of Go, video
games, robot team motion-planning, autonomous driving
▶ Most control systems are safety-critical, e.g., robotics, unmanned
(aerial) vehicles, cyber-physical systems
Goal: Provable RL with robustness and multi-agency considerations

Background
Theoretical Understanding of Policy Optimization
▶ One workhorse in RL: Direct policy search/policy optimization

▶ Whether, where, how fast, PO methods converge?
▶ Nonconvex in policy parameter space
▶ Let’s start with benchmark RL/control tasks before deep RL?
▶ PO for linear quadratic regulator (LQR) (and variants) has been
extensively studied recently [Fazel et al., ′ 18][Tu & Recht, ′ 19][Malik et
al., ′ 19][Bu et al., ′ 19][Mohammadi et al., ′ 19][Gravell et al. ′ 19][Li et al.,
′
19][Furieri et al., ′ 19]...
All models are wrong, so is LQR =⇒ robustness concern is critical
▶ Question: whether & how PO methods can address benchmark

control/RL with robustness/risk-sensitivity concerns?
▶ Side motivation: multi-agent RL (will come back later)
PO for RL/Control =⇒ Optimization
PO for RL/Control =⇒ (Constrained Nonconvex) Opt.
▶ PO for RL/Control can be generally written as
min J (K ), s.t. K ∈ K
K
▶ Parametrized policy/controller K
▶ Objective to optimize J
▶ Constraint set K (important but sometimes implicit!)
▶ Linear quadratic regulator (LQR) as an example

∞
X
min J (K ) := E[xt⊤ Qxt + ut⊤ Rut ], s.t. K is stabilizing
K
t=0
▶ K = {K | ρ(A − BK ) < 1}; K a nonconvex constraint set
▶ Other examples of K: boundedness of K ’s norm, probability simplex,
safety-constraint on states, etc.
Robustness Constraint on K
▶ Beyond stability, robustness is a core topic in control theory
▶ Need a controller robust to disturbance/model-uncertainty
▶ G –∆ model covers many robustness considerations

▶ Parametric uncertainty ∆A, ∆B in A, B (most popular in recent ML
for control literature)
▶ Time-varying parameters At , Bt
▶ Time-varying delay ut = −Kxt−τ
▶ Even dynamical uncertainty (unknown model-order)
▶ Robustness constraint K?
Questions
▶ How to enforce/maintain robustness for policy-optimization RL

methods during learning?
▶ What are the global convergence guarantees, if any, of PO methods

in learning for robust control?
Problem Statement
Starting Point: LEQG [Jacobson ′ 73]
▶ LQR/LQG =⇒ linear exponential quadratic Gaussian (LEQG)
▶ Simple but benchmark risk-sensitive control setting
▶ Linear system dynamics:
xt+1 = Axt + But + wt ,
system state xt ∈ Rd with x0 ∼ N (0, X0 ), noise wt ∼ N (0, W )

▶ One-stage cost c(x, u) = x ⊤ Qx + u ⊤ Ru, with objective
−1
TX
1 2 β ⊤ ⊤

min J := lim log E exp xt Qxt + ut Rut
T →∞ Tβ 2 | {z }
t=0
c(xt ,ut )
▶ Intuition: by Taylor expansion around β = 0

TX−1 TX−1
1 β
J ≈ lim E c(xt , ut ) + Var c(xt , ut ) + O(β 2 ).
T →∞ T t=0
4 t=0
Starting Point: LEQG
▶ Optimal controller is LTI state-feedback [ZHB, ′ 21], conjectured in

[Glover and Doyle, ′ 88]
µt (x0:t , u0:t−1 ) = −K ∗ xt
▶ See more results on LEQG specifically in [ZHB, ′ 21]

▶ Implicit robustness constraint in LEQG
Lemma (Glover and Doyle ′ 88)

√
The feasible set of J (K ) is the 1/ β-sublevel set of the H∞ -norm of
T (K ), i.e., p
K K stabilizing; ∥T (K )∥∞ < 1/ β .
Bigger Picture: Mixed H2 /H∞ Control
▶ Linear dynamic systems:

xt+1 = Axt + But + Dwt zt = Cxt + Eut
▶ H∞ -norm: ℓ2 → ℓ2 operator norm from {w } to {z}
1/2
[T (K )(e −jθ )]⊤ T (K )(e jθ )

∥T (K )∥∞ := sup λmax
θ∈[0,2π)
Bigger Picture: Mixed H2 /H∞ Control
▶ Solve
min J (K ) s.t. ρ(A − BK ) < 1 & ∥T (K )∥∞ < γ

K | {z }
define K
▶ J (K ) upper-bounds H2 -norm:
J (K ) = Tr(PK DD ⊤ ), or J (K ) = −γ 2 log det(I − γ −2 PK DD ⊤ ),
where PK solves a Riccati equation given K
PK = Ae⊤ e⊤ 2 ⊤ −1 ⊤ ⊤ ⊤
K PK AK + AK PK D(γ I − D PK D) D PK AK + C C + K RK
e e
with AeK = A − BK
▶ If γ → ∞, then J (K ) → H2 -norm, e.g., it reduces to LQR
Why an Important & Interesting Model?
▶ Intuition: Small gain theorem – if ∥T (K )∥∞ < γ and
∥∆∥ℓ2 →ℓ2 < 1/γ, then G –∆ is input-output stable
▶ Choosing
K = {K | ρ(A − BK ) < 1; ∥T (K )∥∞ < γ}
=⇒ certain level of robust stability

▶ γ → ∞ reduces to stability region in LQR
√
▶ Second choice of J (K ) with D = W 1/2 and γ = 1/ β coincides
with the risk-sensitive LEQG objective
▶ It also unifies maximum-entropy/H∞ control, LQG/H∞ control
[Glover and Doyle, ′ 88; Mustafa, ′ 89]; γ-level disturbance attenuation
[Başar, ′ 91]; and zero-sum LQ dynamic games (come to it later)
[Jacobson, ′ 73; Başar and Bernhard, ′ 95]. Also used for solving
H∞ -optimal control
▶ Also used in Economics to model sequential decision-making under
model uncertainty [Hansen and Sargent, ′ 08]
Algorithms and Landscape
Policy Gradient Algorithms
▶ Following the naming convention in [Fazel et al., ′ 18] for LQR
Policy Gradient:
K ′ = K − η · ∇J (K )
= K − 2η · (R + B ⊤ PeK B)K − B ⊤ PeK A · ∆K ,

Natural PG:
K ′ = K − η · ∇J (K ) · ∆−1
K
= K − 2η · (R + B ⊤ PeK B)K − B ⊤ PeK A ,

Gauss-Newton:
K ′ = K − η · (R + B ⊤ PeK B)−1 · ∇J (K ) · ∆−1
K
⊤e −1 ⊤ e

= K − 2η · K − (R + B PK B) B PK A
▶ Recall PK is the solution to some Riccati equation dictated by K ,

and ∆K is the solution to another (dual) Riccati equation
Landscape
Lemma (Nonconvexity)
There is a mixed H2 /H∞ control problem that is nonconvex for policy
optimization.
Lemma (Non-Coercivity)
There is a mixed H2 /H∞ control problem whose cost function J (K ) is
non-coercive. Particularly, as K → ∂K, J (K ) does not necessarily
approach infinity.
▶ Coercivity is key in LQR analysis [Fazel et al., ′ 18][Bu et al.,

′ 19][Malik et al., ′ 19][Mohammadi et al., ′ 19]
▶ Ensures any descent direction to be feasible

▶ Can confine everything to the sub-level set (reduces to standard
smooth optimization)
Convergence
Landscape Illustration
LQR Mixed H2 /H∞ control
▶ Descent in J (K ) does not necessarily ensure feasibility/robust

stability
▶ How to enforce K ∈ K during iteration?
Implicit Regularization
▶ Regularization: iterate K remains inside K, i.e., robustly stable

▶ Can be made explicit via projection onto K. But K is nonconvex,
and defined in frequency domain
▶ Implicit regularization:
▶ The convergence of certain algorithms behaves as if certain
regularization is used
▶ Borrowed from machine learning literature: observed in learning
overparametrized neural nets/nonlinear models [Kubo et al.,
′
19][Azizan et al., ′ 19], phase retrieval and matrix completion [Ma et
al., ′ 17], etc., with (stochastic) gradient (mirror) descent
▶ Property of both the nonconvex problem and algorithm
▶ Gauss-Newton and natural PG enjoy implicit regularization!
Theory: Implicit Regularization
Theorem (Implicit Regularization)

Suppose the stepsizes η satisfy:
▶ Gauss-Newton: η ≤ 1/2,
▶ Natural policy gradient: η ≤ 1/(2∥R + B ⊤ PeK0 B∥).
Then, K ∈ K =⇒ K ′ ∈ K.
▶ General descent directions of J (K ) may not work, but certain

directions do
Implicit Regularization: Proof Idea
▶ Linear matrix inequalities (LMIs)-based approach

▶ A new use of Bounded Real Lemma [Başar & Bernhard, ′ 95][Zhou et
al., ′ 96][Dullerud & Paganini, ′ 00]:
K ∈ K ⇐⇒ Riccati equation ⇐⇒ strict Riccati inequality (RI)
▶ Observation: two consecutive iterates K → K ′ are related – previous
PK is a candidate for the non-strict RI under K ′ to hold
▶ Perturb PK in a certain way =⇒ strict RI
▶ Perturb P = PK + αP̄ for small enough α > 0, where P̄ > 0 solves
the Lyapunov equation as before, with C ⊤ C + K ⊤ RK replaced by −I
Theory: Global Convergence
Theorem (Global Convergence)

Suppose K0 ∈ K, then both the Gauss-Newton and natural PG updates converge
to the global optimum K ∗ = (R + B ⊤ PeK B)−1 B ⊤ PeK A with O(1/N) rate.
Theory: Local (Super-)Linear Rates
▶ Much faster rates around the global optimum
Theorem (Local Faster Rates)

Suppose the conditions above hold, and additionally DD ⊤ > 0. Then, both the
Gauss-Newton and natural PG updates converge to the optimal control gain K ∗
with locally linear rate. In addition, if η = 1/2, the Gauss-Newton update
converges to K ∗ with locally Q-quadratic rate.
▶ Gradient domination (Polyak-Lojasiewicz) property holds locally

▶ Q-quadratic rate mirrors that of policy iteration for LQR [Lanchaster
and Rodman ′ 95]
Simulations
Simulations
▶ Initialization near the boundary ∂K; infinitesimal stepsize η for PG

32
107
700 31
650
30
6 600
10 29
550
28
500
105 27
450 26
104 400 25
24
350
3 23
10
300
22
2
10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
Ave. Grad. Norm Square J (K ) − J (K ∗ ) H∞ -norm ∥T (K )∥∞

Simulations: Global Convergence
▶ Escaping suboptimal stationary points

Simulations: Scalability
▶ Computationally more efficient than existing general robust control

solvers – HIFOO [Arzelier et al., ′ 11] & Matlab h2hinfsyn function
[Mahmoud and Pascal, ′ 96] and systune function [Apkarian et al., ′ 08]
System Dim. HIFOO h2hinfsyn systune NPG GN Speedup

15 × 15 0.3742s 95.2663s 0.4276s 0.0481s 0.0420s ∼ 8/2117/8.8×
60 × 60 18.4380s fail, > 7200s 171.7855s 0.3906s 0.3902s ∼ 47/ > 18461/440×
90 × 90 241.4416s fail, > 7200s 4126.9s 0.8167s 0.8103s ∼ 295/ > 36922/5093×
Table: Average runtime comparison

Connection to Multi-Agent RL (MARL)
Multi-Agent RL
▶ Usually studied under framework of Markov games [Shapley ′ 53]
▶ The most basic MARL model ever since [Littman, ′ 94]: two-player
zero-sum Markov games
Multi-Agent RL
▶ Benchmark in continuous control: linear quadratic zero-sum
dynamic games (DG) [Başar and Bernhard, ′ 95] (mirrors LQR for
single-agent RL)
Multi-Agent RL
single-agent RL)
▶ PO methods widely used in modern empirical MARL, while its
convergence guarantees remain largely open
Multi-Agent RL
single-agent RL)
▶ (Projected) PO for LQ zero-sum DGs [ZYB, ′ 19][Bu et al., ′ 19]
▶ Negative/Non-convergence results for (multi-player) LQ general-sum
DGs [Mazumdar, Ratliff, Jordan, and Sastry, ′ 19]
▶ PG methods (and variants) for tabular zero-sum Markov games
[Daskalakis et al., ′ 20][Zhao et al., ′ 21][Cen et al., ′ 21,′ 22]...
Multi-Agent RL
single-agent RL)
▶ (Projected) PO for LQ zero-sum DGs [ZYB, ′ 19][Bu et al., ′ 19]
▶ Negative/Non-convergence results for (multi-player) LQ general-sum
DGs [Mazumdar, Ratliff, Jordan, and Sastry, ′ 19]
▶ PG methods (and variants) for tabular zero-sum Markov games
[Daskalakis et al., ′ 20][Zhao et al., ′ 21][Cen et al., ′ 21,′ 22]...
▶ Nonconvex-nonconcave [ZYB, ′ 19][Daskalakis et al., ′ 20], PO can
easily diverge if not designed carefully
LQ Zero-Sum Dynamic Games
▶ H2 /H∞ control is strongly tied to LQ zero-sum dynamic games

▶ Let ut = −Kxt and wt = −Lxt then solve:
∞
X
⊤ ⊤ u ⊤ v

J (K , L) := Ex0 ∼D xt Qxt + (Kxt ) R (Kxt ) − (Lxt ) R (Lxt )
t=0
Solve: min max J (K , L) ⇐⇒ min J (K , L(K ))

K L K
with xt+1 = Axt + But + Cwt

▶ For fixed K (outer-loop), take max over L (inner-loop), the Riccati
equation becomes
the same Riccati equation as in H2 /H∞ control
Implication for MARL
▶ Previous results =⇒ double-loop update provably works

▶ Double-loop/nested-grad.: fix K and improve L, then improve K
▶ Aligned with the empirical tricks to stabilize nonconvex-
nonconcave minimax opt. with timescale separation [Lin, Jin, &
Jordan, ′ 18], as in training GANs [Heusel et al., ′ 18]
▶ Gives global convergence of PO in competitive MARL (zero-sum
Markov/dynamic games)
Benefit from MARL: Model-Free H2 /H∞ Control
▶ Recall the policy gradient form
∇J (K ) = 2 (R + B ⊤ PeK B)K − B ⊤ PeK A ∆K ,

while ∆K cannot be estimated from sampled trajectories

▶ Instead solve the equivalent game using data
▶ Build up a virtual adversary wt = −Lxt
▶ Double-loop/nested-grad.: fix K and improve L, then improve K
▶ Derivative-free methods for LQR [Fazel et al., ′ 18][Malik et al., ′ 19]
cannot work directly
▶ Non-coercive & only certain direction works =⇒ no uniform margin
▶ Caveat: quantities (cost, action space, control gain matrices) in the
LQ setting are continuous, and can easily go unbounded!
▶ Leads to no-global-smoothness + nonconvexity-nonconcavity
▶ Can be addressed under the unified LQ game formulation, for
finite-horizon settings [ZZHB, ′ 21]
Illustration for Derivative-Free PO Convergence
▶ Proof idea illustrated with figures

▶ K0 is the “level-set” corresponding to initialization K0 ;
▶ K
b is a larger set that the iterates will not leave (with high
probability); due to stochastic errors when using samples
▶ both are compact =⇒ uniform smoothness constant over K b
Connection to Robust Adversarial RL (RARL)
Robust Adversarial RL [Pinto et al., ′ 17]
▶ RL hardly generalizes due to Sim2real and/or training-testing gap
[Google AI, ′ 16] [Tobin et al., ′ 17]
▶ One remedy: RARL [Pinto et al., ′ 17]

▶ Idea: introduce an adversary, playing against the agent
▶ Dates back to [Morimoto and Doya, ′ 05], under the name robust RL,
and “inspired by H∞ -theory”
▶ Made popular by the empirical work [Pinto et al., ′ 17]
▶ Question: Any robustness interpretation and convergence guarantee?
LQ RARL
▶ RARL setting ⇐⇒ zero-sum dynamic game

▶ LQ RARL: View wt as model-uncertainty, or the
model-misspecification when linearizing a nonlinear model
▶ Recall the RARL scheme in [Pinto et al., ′ 17]
RARL in [Pinto et al., ′ 17] Easily Fails [ZHB, ′ 20]
▶ Stability issue due to bad initialization

▶ Stability issue due to bad choices of (NK , NL ) (K0 , L0 )
1 1
0.95
0.9 0.95
0.85
0.9
0.8
0.75
0.85
0.7
0 50000 100000 150000 0 1 2 3
104
1 1
0.95 0.95
0.9 0.9
0.85 0.85
0.8 0.8
0 1 2 3 0 40000 80000 120000

104
▶ What is a good combination of initialization & update rule?

Implication from Mixed H2 /H∞ Control
▶ By implicit regularization, we find a provably convergent pair of

(initialization, update rule): (K0 ∈ K, L0 = 0), (NK = 1, NL = ∞)
Implication from Mixed H2 /H∞ Control
▶ By implicit regularization, we find a provably convergent pair of

(initialization, update rule): (K0 ∈ K, L0 = 0), (NK = 1, NL = ∞)
▶ How to find such a K0 ∈ K (in a model-free way) – robustify K0 ?
▶ For any stabilizing K , perform K ′ = K − αg with g ∈ Rm×n the
finite-difference estimate of the subgradient of ∥T (K )∥∞ , where
∥T (K + ϵdij )∥∞ − ∥T (K − ϵdij )∥∞
gij =
2ϵ
2.5
1.5
0.5
2 4 6 8 10
Decrease H∞ -norm Before Robustification After Robustification

Additional Simulations
Convergent Cases
▶ Exact update:
▶ Derivative-free update:
Some Divergent Cases
▶ Update-rules other than double-loop may easily diverge, even with

infinitesimal stepsizes
▶ ANGDA: alternative-update of natural PG descent & ascent
▶ τ -NGDA: simultaneous-update with stepsizes ratio αη = τ > 1
▶ Descent-Multi-Step-Ascent: multiple ascent steps per descent
step
Some Divergent Cases
▶ Without global smoothness, controlling the iterates’ boundedness is

critical and challenging
Concluding Remarks
Concluding Remarks
▶ Studied policy optimization landscape for risk-sensitive/robust
control, with fundamental challenges diff. from that of LQR –
deepened our understanding of existing results on LQR
▶ Developed two PO methods, identified their implicit regularization
property, and established global convergence + sample complexity
▶ Along the way
▶ Global convergence and sample complexity of PO for competitive
MARL, in the LQ zero-sum setting
▶ Some theoretical understanding and critical thinking on RARL, from
robust control perspective
▶ Explicit regularization and convex-reformulation can also be useful —-
a unified differentiable convex liftings (DCL) framework [USPZT, ′ 22]
▶ A natural intersection of control, RL, and game theory

Thank You!
Direct Policy Search for Robust Control:
A Nonsmooth Optimization Perspective
Bin Hu
ECE & CSL, University of Illinois Urbana-Champaign
L4DC Tutorial 2023

Joint work with Xingang Guo
Outline
• Motivation and Problem Formulation
• Main Results
• Conclusions and Future Directions
2
Motivation: Reinforcement Learning for Control
• Many robust control problems are solved via lifting into convex spaces.
Recently, reinforcement learning (RL) has shown great promise for control!
• Main workhorse: direct policy search/policy optimization (PO)

min J(K), s.t. K ∈ K
K
• Parametrized policy K (e.g. linear mapping, neural networks)
• Cost function J (tracking errors, closed-loop H2 /H∞ norms, etc)
• Constraint set K (stability, robustness, safety, etc)
• PO algorithm: K 0 = K − α∇J(K) (nonconvex problem!)
• Theory: Landscape, feasibility, convergence, complexity
• Question: How to tailor policy-based RL for robust control?
• This talk: Guarantees of PO on H∞ control benchmarks 3
PO Theory for Robust Control
• PO theory for mixed design (maintaining robustness via improving average)
• Landscape: Feasible set is connected, and stationary is global
• Feasibility: The cost is nonconvex and non-coercive! Fortunately,
double-loop natural policy gradient (NPG) can implicitly regularize
• Global sublinear convergence for NPG
• Ref: Zhang, Hu, Başar. Policy optimization for H2 linear Control with H∞
robustness guarantee: Implicit regularization and global convergence,
SICON 2021.
• PO theory for H∞ state-feedback synthesis (improving robustness)
• Feature: Nonconvex nonsmooth
• Landscape: Any Clarke stationary points are global
• Feasibility: The cost is coercive and serves as a barrier function on K
• Global convergence: Goldstein’s subgradient method achieves global
convergence provably
• Ref: Guo and Hu. Global convergence of direct policy search for
state-feedback H∞ robust control: A revisit of nonsmooth synthesis with
Goldstein subdifferential, NeurIPS 2022. 4
Review: Linear Quadratic Regulator
• LQR as PO: Consider xt+1 = Axt + But + wt with wt being stochastic IID
"T −1 #
1 X
min J(K) := lim E (x> >
t Qxt + ut Rut ) , s.t. K is stabilizing
K T →∞ T
t=0
• ut = −Kxt for gain matrix K

• K = {K | ρ(A − BK) < 1}; K a nonconvex constraint set
• PO theory for LQR
• Landscape: Stationary is global
• Feasibility: The LQR cost is coercive and serves as a barrier on K
• Global convergence & sample complexity: Linear rate via the gradient
dominance/smoothness property
• Main Ref:
M. Fazel, R. Ge, S. Kakade, M. Mesbahi. Global convergence of policy
gradient methods for the linear quadratic regulator, ICML 2018.
5
Problem Formulation: State-feedback H∞ Control
Consider the following linear time-invariant (LTI) system
xt+1 = Axt + But + wt , x0 = 0.
• We assume that (A, B) is stabilizable

• u := {u0 , u1 , · · · }, w := {w0 , w1 , · · · }, and kwk = ( ∞ 2 1/2
P
t=0 kwt k ) .
• Our goal is to find a sequence u to minimize the quadratic cost function
∞
X
min max (xT T
t Qxt + ut Rut )
u w:kwk≤1
t=0
in the presence of the worst case disturbance kwk ≤ 1.

• This is different than the LQR problem, where w is stochastic.
• kwk ≤ 1 is not restrictive, we can choose arbitrary `2 bound.
• We assume that Q and R are positive definite.
• It is well known that the optimal solution is using a linear state-feedback
policy ut = −Kxt (Başar and Bernhard 2008).
6
Problem Formulation: State-feedback H∞ Control
Consider ut = −Kxt , the closed-loop system becomes xt+1 = (A − BK)xt + wt .
We have the following PO problem:
∞
X
min max xT T
t (Q + K RK)xt .
K w:kwk≤1
t=0
The above optimization problem equivalent to the following PO problem
min J(K) := sup σmax (Q + K T RK)1/2 (ejω I − A + BK)−1

K ω∈[0,2π]
s.t. K ∈ K := {K : ρ(A − BK) < 1}.
• This is a constrained nonconvex nonsmooth optimization problem.

• K can be nonconvex.
• The nonsmoothness comes from two sources:
1. The computation of the maximum singular value.
2. The operator sup over ω ∈ [0, 2π].
7
Convex LMIs vs. Direct Policy Search
• In 1980s, convex optimization methods become popular for control study
due to global guarantees and efficient interior point methods
• Reparameterize problems as convex optimization problems (one does not
optimize the controller parameters directly)
{K ∈ K : J(K) ≤ γ}
⇐⇒ {K = LY −1 : LMI(Y, L, γ) 0 is feasible, Y 0}.
• Minimizing γ over LMI(Y, L, γ) 0 and Y 0 is a SDP problem

• See for example, Boyd et al., “Linear Matrix Inequalities in System and
Control Theory”, 1994, SIAM
• PO is not convex!
• In this past, PO has been a popular approach for problems that cannot be
convexified, e.g. structured H∞ synthesis! (HIFOO and Hinfstrcuct)
• This talk: View H∞ synthesis as a benchmark for understanding PO
8
Some Background on Nonsmooth Optimization
Clarke subdifferential:
∂C J(K) := conv{ lim ∇J(Ki ) : Ki → K, Ki ∈ dom(∇J) ⊂ K}.

i→∞
• ∂C J(K) is well defined for any K ∈ K.

• J(K) is locally Lipschitz and hence almost everywhere differentiable.
Proposition
If K is a local minimum of J, then 0 ∈ ∂C J(K) and K is a Clarke stationary
point.
9
Generalized Clarke directional derivative:
J(K 0 + td) − J(K 0 )
J ◦ (K, d) := lim sup .
K 0 →K t&0 t
Directional derivative:
J(K + td) − J(K)
J 0 (K, d) := lim .
t&0 t
• J ◦ (K, d) and J 0 (K, d) are different in general.

• J 0 (K, d) = J ◦ (K, d) if J(K) is subdifferentially regular.
Subdifferentially Regular Property

Let K † be a Clarke stationary point for J. If J is subdifferentially regular, then
J 0 (K † , d) ≥ 0 for all da .
a
This result is known, see Theorem 10.1 in Rockafellar and Wets 2009.
10
Goldstein subdifferential:
∂δ J(K) := conv ∪K 0 ∈Bδ (K) ∂C J(K 0 ) ,

• Bδ (K) is the δ-ball around K

• requires Bδ (K) ⊂ K.
Generating a good descent direction (Goldstein1977):
Descent inequality
Let F be the minimal norm element in ∂δ J(K). Suppose K − αF/kF k2 ∈ K for
any 0 ≤ α ≤ δ. Then we have:
J(K − δF/kF k2 ) ≤ J(K) − δkF k2 .
11
Outline
• Main Results
12
Summary of Known Facts
min J(K) := sup σmax (Q + K T RK)1/2 (ejω I − A + BK)−1

K ω∈[0,2π]
s.t. K ∈ K := {K : ρ(A − BK) < 1}.
• K is open, can be unbounded, and nonconvex.

• J(K) is continuous, nonsmooth, and can be nonconvex in K.
• J(K) is locally Lipschitz, subdifferentially regular over the feasible set K.
13
High Level Ideas
Goldstein’s subgradient method:
K n+1 = K n − δ n F n /kF n k2 ,
• F n is the minimum norm element of ∂δn J(K n ).

• K 0 ∈ K is known.
High level ideas:

• Goldstein’s subgradient method finds Clarke stationary point
• Coerciveness ensures K n stay within the nonconvex feasible set.
• Clarke stationary points are global, and hence global optimum is found.
14
Main Results
Theorem (Guo and Hu, NeurIPS2022)
Suppose (Q, R) are positive definite, and (A, B) is stabilizable. We have
1. J(K) is coercive over the set K. (Proved via the properties of (Q, R))
2. For any K ∈ K satisfying J(K) > J ∗ , there exists V 6= 0 s.t. J 0 (K, V ) < 0.
3. Any Clarke stationary points of the H∞ cost are global minimum.
4. For any γ > J ∗ , the sublevel set Kγ = {K ∈ K : J(K) ≤ γ} is compact.
There is a strict separation between Kγ and Kc .
5. Suppose K 0 ∈ K. Set ∆0 := dist(KJ(K 0 ) , Kc ) > 0, and δ n = 0.99∆
n+1 . Then
0
n+1
Goldstein’s subgradient method K = K − δ F /kF kF with F n
n n n n
being the minimum norm element of ∂δn J(K n ) is guaranteed to stay in K

for all n. In addition, we have J(K n ) → J ∗ as n → ∞.
6. There is also a complexity result for finding (ε, δ)-stationary points.
The most technical part of the proof is for Step 2. It requires the use of
non-strict version of the KYP lemma.
15
Step 2 of Main Result
Lemma
For any K ∈ K that J(K) > J ∗ , there exists a direction d 6= 0 such that the
directional derivative J 0 (K, d) ≤ J ∗ − J(K) < 0.
Proof Sketch:
(Y*, L*, γ*)

(Y, L, γ)
LMI(Y, L, γ) 0
(Y + tΔY, L + tΔL, γ + t(γ* − γ)) LMI(Y ∗ , L∗ , γ ∗ ) 0
By convexity, we have
LMI(Y + t∆Y, L + t∆L, γ + t(γ ∗ − γ)) 0
with ∆Y = Y ∗ − Y , ∆L = L∗ − L, and t ∈ [0, 1]. In addition, we must have
J((L + t∆L)(Y + t∆Y )−1 ) ≤ γ + t(γ ∗ − γ).
16
Step 2 of Main Result
Lemma
For any K ∈ K that J(K) > J ∗ , there exists a direction d 6= 0 such that the
directional derivative J 0 (K, d) ≤ J ∗ − J(K) < 0.
Proof Sketch Con:

Based on the fact J(K ∗ ) < J(K), we can construct a direction d such that
J 0 (K, d) < 0. Specifically, consider d = ∆LY −1 − LY −1 ∆Y Y −1 . Then we have
J(K + t(∆LY −1 − LY −1 ∆Y Y −1 )) − J(K)

J 0 (K, d) = lim
t&0 t
J((L + t∆L)(Y + t∆Y )−1 ) − J(K)

≤ lim + O(t)
t&0 t
J(K) + t(J(K ∗ ) − J(K)) − J(K)

≤ lim + O(t)
t&0 t
= J(K ∗ ) − J(K) < 0,
we use the fact that (Y + t∆Y )−1 = Y −1 − tY −1 ∆Y Y −1 + O(t2 ).

17
Finite-time complexity for (δ, ε)-stationary points
Goldstein’s subdifferential: ∂δ J(K) := conv ∪K 0 ∈Bδ (K) ∂C J(K 0 ) .

Definition
A point K is said to be (δ, ε)-stationary if dist(0, ∂δ J(K)) ≤ ε.
Theorem 3
If we choose δ n = δ < ∆0 , then we have:
• K n ∈ K for all n
0 ∗
• minn:0≤n≤N kF n k2 ≤ J(K )−J
(N +1)δ , i.e., the complexity of finding a
∆
(δ, ε)-stationary point is O δε
• (δ, ε)-stationarity does not imply being δ-close to an ε-stationary point of J.

• Finite time bounds for (J(K n ) − J ∗ ) is possible via exploiting other
advanced properties J(K).
18
Implementable Algorithms
Finding minimum norm element of Goldstein’s subdifferential may not be easy.
Fortunately, there are many implementable variants:
• Gradient Sampling (GS) (The HIFOO toolbox): The main idea is to
randomly generate differentiable samples over Bδn (K n ) with probability 1.
The convex hull of the gradients of these samples can be used as an
approximation of ∂δn J(K n ).
• Nonderivative Sampling (NS)(Kiwiel2010): The NS method can be
viewed as the derivative-free version of the GS algorithm by only using the
zeroth-order oracle.
• Interpolated normalized gradient descent (INGD) (Zhang, J., et al.
2020; Davis, D., et al. 2022): INGD uses an iterative sampling strategy to
generate a descent direction. The INGD algorithm is guaranteed to find the
(δ, ε)-stationary point with the high-probability finite-time complexity
bound.
19
Numerical Example
To support our theory, we provide some numerical simulations.

Consider the following example:
     
1 0 −5 1 2 −1 0
A = −1 1 0  , B =  0  , Q = −1 2 −1 , R = 1.
0 0 1 −1 0 −1 2
For this example, we have J ∗ = 7.3475. We initialize from
K 0 = 0.4931 −0.1368 −2.2654 ,

which satisfies ρ(A − BK 0 ) = 0.5756 < 1.
20
Numerical Example
0.25 10 2 0.25
0.2 0.2
10 0
0.15 0.15 6
10 -3
10 -2 5
0.1 0.1 4
10 -4 2
0.05 0.05
1
60 65 70 75 80
0 10 -6 0
10 20 30 40 50 60 20 40 60 80 100 10 20 30 40 50 60 70 80
Figure: Simulation results. Left: The trajectory of relative error of GS, NS,
INGD, and Model-free NS methods. Middle: The trajectory of the relative
optimality gap of 8 randomly generated cases for the NS method. Right:
The trajectory of the Model-free NS method with more noisy oracle.
21
Outline
• Main Results
22
Take Aways
We studied the global convergence of direct policy search on state-feedback H∞

robust control synthesis.
• State-feedback H∞ synthesis is a constrained nonconvex nonsmooth policy
optimization problem.
• Any Clarke stationary points for this problem are actually global minimum.
• Goldstein’s subgradient methods are guaranteed to stay within the

nonconvex feasible set and converge to the global optimal.
• (δ, ε)-stationary points can be found with finite-time guarantees.
• (δ, ε)-stationarity does not imply being δ-close to an ε-stationary point of J.
23
Future Work
• Finite-time bounds for the optimality gap (i.e. J(K n ) − J ∗ )
• The sample complexity of direct policy search on model-free H∞ control
• Other H∞ synthesis problems (static/dynamic output feedback, etc)
24
Thanks!
If you are interested, feel free to send an email to binhu7@illinois.edu
Funding & Support: NSF
25
Analysis of the Optimization Landscape of Linear
Quadratic Gaussian (LQG) Control
Work by Yang Zheng, Yujie Tang, and Na Li
Presented by Bin Hu
5th Annual Learning for Dynamics & Control Conference
University of Pennsylvania. June 14-16, 2023
Today’s talk
❑ Optimal Control Linear Quadratic Optimal control
Feedback Paradigm
d(t) w(t)
u(t)
System x(t) y(t)
• Many practical applications
• Linear Quadratic Regulator (LQR) when the state
𝑥𝑡 is directly observable
Feedback • Linear Quadratic Gaussian (LQG) control when
Control input Controller Measurement only partial output 𝑦𝑡 is observed
• Extensive classical results (Dynamic programming,
Separation principle, Riccati equations, etc.)
They are all model-based. Are there any
guarantees for non-convex policy optimization? 2
Challenges for partially observed LQG
❑ Policy optimization for LQG control
• LQG is more complicated than LQR Applications Sparse structures
• Requires dynamical controllers
• Its non-convex landscape properties are much richer and more complicated than LQR
Our focus: non-convex LQG landscape
▪ Q1: Properties of the domain (set of stabilizing controllers)

• convexity, connectivity, open/closed?
▪ Q2: Properties of the accumulated LQG cost

• convexity, differentiability, coercivity?
• set of stationary points/local minima/global minima?
3
Outline
❑ LQG problem Setup
❑ Connectivity of the Set of Stabilizing Controllers
❑ Structure of Stationary Points of the LQG cost
4
LQG Problem Setup
v(t) Gaussian white w(t) Objective: The LQG cost
Plant
➢ internal state of the controller

➢ order of the controller
➢ full-order
y(t) u(t)
➢ reduced-order
dynamical controller Minimal controller

The input-output behavior cannot be
Controllable
replicated by a lower order controller.
Standard
Assumption Observable * controllable and observable
5
Separation principle
v(t) Gaussian white w(t) Objective: The LQG cost
Plant
Solution: Kalman filter for state estimation

+ LQR based on the estimated state
y(t) u(t)
Two Riccati equations
➢ Kalman gain
dynamical controller
➢ Feedback gain
Explicit dependence on the dynamics
6
Policy Optimization formulation
❑ Closed-loop dynamics
Policy optimization for LQG
❑ Feasible region of the controller parameters Direct policy search
❑ Cost function ✓ Does it converge at all? Optimization

✓ Converge to which point? Landscape
✓ Convergence speed? Analysis
Solution to Lyapunov equations Hyland, David, and Dennis Bernstein. "The optimal projection equations for fixed-order 7
dynamic compensation." IEEE Transactions on Automatic Control 29.11 (1984): 1034-1037.
Main questions
▪ Q1: Connectivity of the feasible region

• Is it connected?
• If not, how many connected components can it have?
Non-convex ▪ Q2: Structure of stationary points of
Landscape • Are there spurious (strictly suboptimal, saddle)
Analysis stationary points?
• How to check if a stationary point is globally optimal?
8
Outline
9
Connectivity of the feasible region
❑ Simple observation: non-convex and unbounded
Lemma 1: the set is non-empty, unbounded, and can be non-convex.
Example
Stabilize the plant, and thus belong to
Fails to stabilize the plant, and thus outside

10
❑ Main Result 1: dis-connectivity
Theorem 1: The set can be disconnected but has at most 2 connected components.
✓ Different from the connectivity of static stabilizing state-feedback controllers,

which is always connected!
✓ Is this a negative result for gradient-based algorithms? → No
11
❑ Main Result 2: dis-connectivity
Theorem 2: If has 2 connected components, then there is a smooth bijection T between
the 2 connected components that has the same cost function value.
✓ In fact, the bijection T is defined by a similarity

transformation (change of controller state coordinates)
Positive news: For gradient-based local search

methods, it makes no difference to search over
either connected component.
12
❑ Main Result 3: conditions for connectivity
Theorem 3: 1) is connected if there exists a reduced-order stabilizing controller.
2) The sufficient condition above becomes necessary if the plant is single -input or
single-output.
Corollary 1: Given any open-loop stable plant, the set of stabilizing controllers is connected.
Example: Open-loop stable system
Routh--Hurwitz stability criterion
13
❑ Main Result 3: conditions for connectivity
Example: Open-loop unstable system (SISO)
Disconnected feasible region
• Routh--Hurwitz stability criterion
• Two path-connected components
14
Policy Optimization formulation
▪ Q1: Connectivity of the feasible region

• Is it connected? No
• If not, how many connected components can it have? Two
Non-convex ▪ Q2: Structure of stationary points of
Landscape • Are there spurious (strictly suboptimal, saddle) stationary
Analysis points?
• How to check if a stationary point is globally optimal?
15
Outline
16
Structure of Stationary Points
❑ Simple observations Policy optimization for LQG
1) is a real analytic function over its domain
(smooth, infinitely differentiable)
2) has non-unique and non-isolated global optima
Similarity transformation
➢ is invariant under similarity transformations.

➢ It has many stationary points, unlike the LQR with a unique
stationary point
17
❑ Gradient computation
How does the set of Stationary
Lemma 2 : For every , we have
Points look like?
where
❑ Non-unique, non-isolated
are the unique positive semidefinite solutions to two ❑ Local minimum, local
Lyapunov equations. maximum, saddle points,
or globally minimum?
18
❑ Main Result: existences of strict saddle points
Theorem 4: Consider any open-loop stable plant. The zero controller with any stable
is a stationary point. Furthermore, the corresponding hessian is either indefinite ( strict saddle
point) or equal to zero (high-order saddle or else).
Example:
Stationary point:
➢ Cost function:
Indefinite with
eigenvalues:
➢ Hessian:
19
❑ Main Result: existences of strict saddle points
Theorem 4: Consider any open-loop stable plant. The zero controller with any stable
is a stationary point. Furthermore, the corresponding hessian is either indefinite ( strict saddle
point) or equal to zero (high-order saddle or else).
Another example with zero Hessian ❑ Non-unique, non-

How does the set of Stationary
Points look like? isolated
❑ Strictly suboptimal
points; Strict saddle
points
❑ All bad stationary points
correspond to non-
minimal controllers
20
❑ Main Result
Theorem 5: All stationary points corresponding to controllable and
observable controllers are globally optimum.
Structural Global Optimality

Local Zero Gradient Certificate
Information
Particularly, given a stationary point that is a minimal controller

1) It is globally optimal, and the set of all global optima forms a manifold with 2 connected components.
Example: open-loop Example: open-loop

unstable system stable system
21
❑ Implication
Corollary: Consider gradient descent iterations
If the iterates converge to a minimal controller, then this minimal controller is a global optima.
More questions:
✓ Escaping saddle points?
✓ Convergence conditions?
✓ Convergence speed?
✓ Alternative model-free parameterization?
22
Comparison with LQR
Policy optimization for LQR Policy optimization for LQG
Connectivity of ❖ Disconnected, but at most 2 connected comp.

feasible region ❖ Always connected ❖ They are almost identical to each other
❖ Non-unique, non-isolated stationary points

Stationary ❖ Spurious stationary points (strict saddle,
❖ Unique
points nonminimal controller)
❖ All mini. stationary points are globally optimal
❖ Gradient dominance ❖ No gradient dominance
Gradient
❖ Global fast convergence ❖ Local convergence/speed (unknown)
Descent
(like strictly convex) ❖ Many open questions
Fazel et al., ICML, 2018; Malik et al., 2019; Mohammadi et al., IEEE TAC,
References 2020; Li et al., 2019; K. Zhang, B. Hu, and T. Başar, 2021; Furieri et al., Zheng*, Tang*, Li. 2021, link (* equal contribution)
2019; Feiran Zhao & Keyou You, 2021, and many others 23
Conclusions
24
Policy optimization for LQG control
❑ Much richer and more complicated than LQR

❑ Disconnected, but at most 2 connected components
❑ Non-unique, non-isolated stationary points, strict saddle points
❑ Minimal (controllable and observable) stationary points are globally optimal
25
Ongoing and Future work
❑ How to certify the optimality of a non-minimal stationary point

❑ Perturbed policy gradient (PGD) for escaping saddle points
❑ Quantitative analysis of PGD algorithms for LQG
❑ Alternative model-free parametrization of dynamical controllers (e.g., Makdah
& Pasqualetti, 2023; Zhao, Fu &You, 2022.)
✓ Better optimization landscape structures, smaller dimension

❑ Nonconvex Landscape of Hinf dynamical output feedback control (Tang &
Zheng, 2023 https://arxiv.org/abs/2304.00753; )
26
Analysis of the Optimization Landscape of Linear
Quadratic Gaussian (LQG) Control
Thank you for your attention!

Q&A
1. Y. Tang*, Y. Zheng*, and N. Li, “Analysis of the optimization landscape of Linear Quadratic Gaussian (LQG) control,”
Mathematical Programming, 2023. Available: https://arxiv.org/abs/2102.04393 *Equal contribution
2. B. Hu and Y. Zheng, "Connectivity of the feasible and sublevel sets of dynamic output feedback control with robustness
constraints,“ IEEE Control Systems Letters, 2022.
3. Y. Zheng*, Y. Sun*, M. Fazel, and N. Li. "Escaping High-order Saddles in Policy Optimization for Linear Quadratic Gaussian
(LQG) Control." CDC, 2022 https://arxiv.org/abs/2204.00912. *Equal contribution
Role of convex parameterization
Message : favorable landscape properties for nonconvex J can be

obtained from the convex parameterization under appropriate
conditions on the mapping
[Sun, F.,’21 ; Umenberger et al.’22 ; Hu et al.’23 survey]
Warm up : convex formulation for continuous-time LQR
min Tr (QP + ZR)

Z ,L,P
⇒ min f (L, P, Z )
s.t. AP + PAT + BL + LB T + Σ = 0, Z ,L,P
s.t. (L, P, Z ) ∈ S

Z L
P ≻ 0, ⪰0
L⊤ P
where K ∗ = L∗ (P ∗ )−1 .
▶ further, K = LP −1 parameterizes all stabilizing K ∈ K
▶ also see [Mohammadi et al.’19]
Assumptions on parameterization map
min J(K ) min f (L, P, Z )

K ⇒ Z ,L,P
s.t. K ∈K s.t. (L, P, Z ) ∈ S
Assumptions :
1. S is convex, f (L, P, Z ) is convex, bounded, differentiable on S.
2. we can express J(K ) as
J(K ) = min f (L, P, Z ), s.t. (L, P, Z ) ∈ S, K = LP −1 .

L,P,Z
more generally, K = LP −1 can be replaced by a surjective map

K = Φ(L, P) with “nicely behaved” first-order derivatives.
[Sun, F.,’21], [Umenberger et al.,’22]
min J (K ) min f (L, P , Z ),

K Z ,L ,P
s.t. K ∈K s.t. (L, P , Z ) ∈ S
Theorem (simplified) [Sun & F.,’21]

Under assumptions 1 and 2,
∇J (K ) = 0 ⇐⇒ K = K ∗ .
Also,
▶ If f is convex, ∥∇J (K )∥F ≳ J (K ) − (K ∗ ).
▶ If f is µ-strongly convex, ∥∇J (K )∥F ≳ (µ(J (K ) − J (K ∗ )))1/2 .
(≳ hides instance-dependent constants ; depend on system

parameters & initial point K0 )
Proof picture
min f (L, P, Z ),
min J(K ) Z ,L,P
K ∈K
s.t., (L, P, Z ) ∈ S
Map the direction

∇ J(Kt )
Map the gradient
Map the point
A general version that applies to non-smooth J (K ) as well :
Theorem [Hu et al.,’23]

Suppose J (K ) is differentiable or subdifferentially regular, Assump-
tions 1, 2 hold. For any K satisfying J (K ) > J (K ∗ ), there exists
non-zero V in the descent cone of K at K , such that
0 < J (K ) − J (K ∗ ) ≤ −J ′ (K , V ),
so any stationary point of J is a global minimum.
J ′ (K , V ) denotes directional derivative of J (K ) along direction V .

When J is differentiable, J ′ (K , V ) = Tr (V T ∇J (K )).
Example : Continuous time LQR
min f (L, P , Z ) := Tr (QP ) + Tr (ZR )

Z ,L ,P
s.t., A(P ) + B(L) + Σ = 0, G ≻ 0,

" #
Z L⊤
⪰0
L G
Question : K = LP −1 , is P always invertible ? (yes, if initial x0 has
full-rank covariance)
L, P , P −1 are bounded in the sublevel set {K : J (K ) ≤ a}.
then : a ≥ J (K ) = Tr (QP ) + Tr (LP −1 L⊤ R ).
Example : Continuous time LQR
L, P , P −1 are bounded in the sublevel set {K : (K ) < a}.
Define
λ2min (Σ) −1/2 −1/2
−2
ν= σmax (A)λmin (Q ) + σmax (B )λmin (R ) ,
4
then
∥J (K )∥ ≤ −C1 (J (K ) − J (K ∗ ))
where
1/2 1/2
νλmin (Q )λmin (R ) n o
C1 = · min a2 , νλmin (Q ) .
4a 4
Related Results
Many other landscape results rely on connections to LMIs

H∞ landscape : Clarke stationary is global [Guo et al., 2022]
Dynamic filtering : Differentiable convex lifting [Umenberger et
al., 2022]
LQG : Connectivity [Y. Zheng, 2023]

Output-feedback H∞ : Connectivity [Hu et al., 2022]
A general tool for landscape study. More study is needed for

output-feedback problems !
Future Work
The last section of our survey article lists several directions :
▶ Further connections between optimization and control theory,

e.g. complexity of escaping saddles for output feedback problems
▶ Advanced regularization for stability, robustness, and safety
▶ Nonlinear systems, deep RL, and perception-based control
▶ Multi-agent systems and decentralized control
▶ Integration of model-based and model-free methods
▶ New PO formulations from machine learning
And many more which are not listed in our article !

L4DC PolicyOptTutorial2023

Uploaded by

Copyright:

Available Formats

L4DC PolicyOptTutorial2023

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L4DC PolicyOptTutorial2023

Uploaded by

Copyright:

Available Formats

Toward a Theoretical Foundation of Policy

Optimization for Learning Control Policies

Maryam Fazel, Bin Hu, Kaiqing Zhang

Joint with Tamer Başar, Na Li, Mehran Mesbahi, Yang Zheng

L4DC Tutorial, Philadelphia, PA

Data-guided decision-making for complex tasks in dynamical

Many recent successes via Reinforcement Learning

▶ A workhorse of (deep) RL : (direct) policy optimization methods

Why is policy optimization popular ?

▶ Policy gradient method : K ′ = K − α∇J (K )

Credit : Astrom & Murray, 2020

Historically, PO is used for control problems that can’t be

In recent years, new reason to revisit PO for classical control : help

▶ REINFORCE estimator [Williams ’92] : from N trajectories of length T –

Modern variants (benefit from the advances of optimization theory) :

▶ Convergence guarantees : Nonconvex optimization in policy

GD on convex landscape GD for nonconvex landscape

▶ Landscape : Is convexity really needed for optimization ?

This tutorial : Understanding policy optimization via examining

▶ Now-2 :30pm : Preview and Some Optimization Background

min J(K ), s.t. K ∈ K

▶ Parametrized policy K (e.g. linear mapping, neural networks)

▶ Policy gradient : K ′ = K − α∇J(K )

▶ Recent progress on PO theory (Nonconvexity is the key issue)

▶ Linear quadratic regulator (LQR) as PO : Consider xt+1 = Axt + But + wt

▶ Mixed design : H∞ constraints are crucial for robustness

▶ J(K ) is an upper bound on the H2 performance

▶ H∞ state-feedback synthesis : xt+1 = Axt + But + wt with x0 = 0

▶ Linear quadratic Gaussian (LQG) is the partially observable variant of LQR,

▶ PO theory for LQG

Definition : A function J (K ) is L-smooth if the following inequality

Complexity : Gradient descent method K n+1 = K n− α∇ n

Convergence : Gradient descent method is guaranteed to

Question : What if we can show stationary is global ?

Take-away : Nonconvex optimization may not be that terrifying if

Definition : A function J (K ) is coercive on K if for any sequence

If J is coercive and twice continuously differentiable on K, we have

Standard LQR problem (discrete-time, infinite horizon) : linear

Classical solution via dynamic programming (when A,B known,

▶ a “go-to” model-based control design (since Kalman in 60’s)

The solution of ARE determines the value matrix

PO for LQR, on the other hand, would directly update K , e.g.,

this tutorial : can we develop direct PO methods with guarantees

towards writing LQR as “J (K )” ...

First, note that when A is Schur stable, the sequence

When K ∈ S (set of stabilizing K ), equation

We take x0 ∼ D(0, Σ0 ) where Σ0 is a full-rank covariance

Consider now PO algorithms :

▶ does GD (with exact gradients) converge ? under what

▶ Consider LQR without state noise (for simplicity), random initial

Suppose algorithms have only oracle access to the model (A,B

lemma : if ∇J (K ) = 0, then either

lemma : Suppose Σ0 is full rank, then

i.e., J (K ) gradient-dominated ([Polyak ’63],...)

[Fazel, Ge, Kakade, Mesbahi 2018] for discrete-time LQR

Theorem 1 : Suppose J (K0 ) is finite (i.e., K0 is stabilizing), Σ0 is

0 100 200 300 400 500

Left : Gradient descent for continuous time LQR

Gradient descent : K ← K − η∇\

Complexity : Gradient descent method K n+1 = K n− α∇ n