L4DC PolicyOptTutorial2023
L4DC PolicyOptTutorial2023
L4DC PolicyOptTutorial2023
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 1/47
Motivation
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 2/47
Motivation : Policy Optimization
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 3/47
Policy Optimization – One Workhorse of (Deep) RL
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 4/47
What is Policy Optimization (PO) ?
▶ PO is an old idea from control : Fix the controller structure, and
optimize a control metric over the parameters of the controllers
min J (K )
K
▶ Parametrized controller/policy K
▶ Cost function J (tracking errors, closed-loop H2 /H∞ norms, etc)
Key points :
▶ In 1980s, convex optimization methods become dominant due to
strong global guarantees and efficient interior point methods
▶ PO problem formulation is generally not convex
▶ Reparameterize as convex optimization problems (one does not
optimize the controller parameters directly) ; Lyapunov theory,
stability/performance certificates, HJB, ...
▶ Examples of LMIs : state-feedback or full-order output-feedback
H2 / H∞ control
▶ e.g., Boyd et al., “Linear Matrix Inequalities in System and
Control Theory”, 1994, SIAM
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 6/47
History : Convex LMIs vs. PO
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 7/47
A Modern Perspective from Deep RL
A common practice nowadays in deep RL for robotic control : visuomotor
policy learning/image-to-torque [Levine et al., ’16]
▶ A type of perception-based control : purely model-free
▶ Train perception and control systems jointly end-to-end
Advantages :
▶ Direct and relatively simple to implement
▶ Mitigate compounding error as in model-based RL (separately train
perception and control)
▶ Make better use of deep NNs’ abstraction and perception capabilities to
handle high-dimensional visual signals
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 8/47
Policy Optimization : Old & New
Vanilla policy gradient :
▶ Policy Gradient Theorem [Sutton et al., ’99]
∇J(K ) = E QK (x, u) · ∇ log πK (u | x)
▶ Others estimators : G(PO)MDP [Baxter & Bartlett, ’01], actor-critic [Konda &
Tsitsiklis, ’99], natural policy gradient [Kakade ’01] (will come back to it !)
▶ Essentially stochastic gradient descent (SGD) (heart of modern machine learning) !
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 9/47
Missing Perspectives in Deep RL Literature
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 10/47
Missing Perspectives in Classic Control Literature
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 11/47
Tutorial Overview
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 12/47
Tutorial Overview : RL/Control → PO
Big picture :
▶ One perspective to bridge control theory and RL
▶ Understand and connect “model-free” & “model-based” views
▶ Towards a general framework for learning-based control
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 13/47
Schedule
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 14/47
Preview : Big Picture
▶ Revisit linear control problems as benchmarks for PO
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 16/47
Preview : Mixed H2 /H∞ Control as PO
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 17/47
Preview : H∞ State-Feedback Synthesis as PO
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 18/47
Preview : Linear Quadratic Gaussian as PO
▶ Main Ref :
Y. Zheng, Y. Tang, N. Li. Analysis of the optimization landscape of linear
quadratic Gaussian (LQG) control, Mathematical Programming, 2023.
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 19/47
Background : Optimization Theory
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 20/47
Optimization of Smooth Nonconvex Functions
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 21/47
Optimization of Smooth Nonconvex Functions
Complexity : Gradient descent method K t +1 = K n − α∇ n
J (K ) is
guaranteed to find ϵ-stationary point of J within O 1ϵ steps
T
Lα 2
X
2
α− ∇J (K n ) F ≤ J (K 0 ) − J (K n+1 )
2
n=0
2
If α < L2 , then C = α − L2α > 0. We know J (K n+1 ) ≥ J ∗ for some J ∗ .
T
2 J (K 0 ) − J ∗
∇J (K n ) F ≤
X
C
n=0
T
2 1 X 2 J (K 0 ) − J ∗
=⇒ min ∇ J (K n ) F ≤ ∇J (K n ) F ≤ .
0≤n≤T T +1 C (T + 1)
n=0
To find a point whose gradient norm is smaller than or equal to ϵ,
we need to run T steps with
J (K 0 ) − J ∗ 1
T = −1=O .
Cϵ ϵ
which is the complexity for finding ϵ-approximate stationary point
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 22/47
Optimization of Smooth Nonconvex Functions
Complexity : Gradient descent method K t +1 = K n − α∇ n
J (K ) is
1
guaranteed to find ϵ-stationary point of J within O ϵ steps
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 24/47
Coercive Functions and Compact Sublevel Sets
What if there are constraints ? If the cost is coercive, then it is a
barrier function by itself !
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 26/47
PO Theory for LQR
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 27/47
Linear quadratic theory – background
t =0
with given cost matrices Q , R ≻ 0.
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 28/47
Linear quadratic theory
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 29/47
Value and policy iterations
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 30/47
Direct policy optimization
P = (A − BK )⊤ P (A − BK ) + K ⊤ RK + Q
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 31/47
However, as stated, this problem is a bilinear matrix
optimization ...
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 32/47
Questions
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 33/47
LQR and policy gradient methods
▶ consider algorithms :
Gradient descent : K ← K − η∇J (K )
Natural GD : K ← K − η∇J (K )ΣK −1
[Kakade ’01]
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 34/47
Oracles
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 35/47
Optimization landscape I
J (K ) = ⟨Σ0 , PK ⟩
= vec(Σ0 )T (I − (A − BK ) ⊗ (A − BK ))−1 vec(Q + K T RK )
observation : J (K ) is not convex in K (or quasiconvex, star
convex) for n ≥ 3 (convex for single input case when n = 2) ; we can
computed the ∇J (K ) = 2(RK − B ⊤ PK AK )ΣK , where,
"∞ #
h i
xt xtT Σ0 = E x0 x0T
X
ΣK = E ;
t =0
lemma : if ∇J (K ) = 0, then either
▶ K is optimal, or
▶ covariance ΣK is rank deficient.
if Σ0 is full rank, then ΣK is full rank =⇒ stationary points globally
optimal
can also show this leveraging the convex LMI reformulation—later
research has shown this deeper connection and its uses.)
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 36/47
Optimization landscape I
J (K ) = ⟨Σ0 , PK ⟩
= vec(Σ0 )T (I − (A − BK ) ⊗ (A − BK ))−1 vec(Q + K T RK )
observation : J (K ) is not convex in K (or quasiconvex, star
convex) for n ≥ 3.
"∞ #
h i
xt xtT Σ0 = E x0 x0T
X
ΣK = E ,
t =0
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 38/47
Main Theory : Summary
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 39/47
Global convergence in the exact case
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 40/47
Numerical experiment
2.5
100 2
1.5
Error ||K||−K *K||F||F
*
0.5
10−1 0
-0.5
-1
10−2 -1.5
-2
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 41/47
Unknown model case
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 42/47
Algorithm
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 43/47
Idea for the Proof of Convergence
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 44/47
Suppose J (K0 ) is finite, µ > 0, x0 has norm bounded by L almost
surely ; and GD and the NPGD are called with parameters :
1 1 1
m, ℓ, 1/r =poly J (K0 ), , , ∥A∥, ∥B ∥, ∥R ∥, , d , 1/ϵ, L2 /µ .
µ σmin (Q ) σmin (R )
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 45/47
Related Results
A burst of recent research interest :
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 47/47
Policy Optimization for Risk-Sensitive Control,
H2 /H∞ Robust Control, and Linear Quadratic Games
Kaiqing Zhang
▶ Most scenarios involve more than one agent, e.g., game of Go, video
games, robot team motion-planning, autonomous driving
▶ Most control systems are safety-critical, e.g., robotics, unmanned
(aerial) vehicles, cyber-physical systems
min J (K ), s.t. K ∈ K
K
▶ Parametrized policy/controller K
▶ Objective to optimize J
▶ Constraint set K (important but sometimes implicit!)
−1
TX
1 2 β ⊤ ⊤
min J := lim log E exp xt Qxt + ut Rut
T →∞ Tβ 2 | {z }
t=0
c(xt ,ut )
µt (x0:t , u0:t−1 ) = −K ∗ xt
▶ Solve
▶ J (K ) upper-bounds H2 -norm:
PK = Ae⊤ e⊤ 2 ⊤ −1 ⊤ ⊤ ⊤
K PK AK + AK PK D(γ I − D PK D) D PK AK + C C + K RK
e e
with AeK = A − BK
▶ If γ → ∞, then J (K ) → H2 -norm, e.g., it reduces to LQR
Why an Important & Interesting Model?
▶ Intuition: Small gain theorem – if ∥T (K )∥∞ < γ and
∥∆∥ℓ2 →ℓ2 < 1/γ, then G –∆ is input-output stable
▶ Choosing
Policy Gradient:
K ′ = K − η · ∇J (K )
= K − 2η · (R + B ⊤ PeK B)K − B ⊤ PeK A · ∆K ,
Natural PG:
K ′ = K − η · ∇J (K ) · ∆−1
K
Gauss-Newton:
K ′ = K − η · (R + B ⊤ PeK B)−1 · ∇J (K ) · ∆−1
K
⊤e −1 ⊤ e
= K − 2η · K − (R + B PK B) B PK A
Lemma (Nonconvexity)
There is a mixed H2 /H∞ control problem that is nonconvex for policy
optimization.
Lemma (Non-Coercivity)
There is a mixed H2 /H∞ control problem whose cost function J (K ) is
non-coercive. Particularly, as K → ∂K, J (K ) does not necessarily
approach infinity.
104 400 25
24
350
3 23
10
300
22
2
10
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100
1 1
0.95
0.9 0.95
0.85
0.9
0.8
0.75
0.85
0.7
0 50000 100000 150000 0 1 2 3
104
1 1
0.95 0.95
0.9 0.9
0.85 0.85
0.8 0.8
1.5
0.5
2 4 6 8 10
▶ Exact update:
▶ Derivative-free update:
Some Divergent Cases
Bin Hu
ECE & CSL, University of Illinois Urbana-Champaign
• Main Results
2
Motivation: Reinforcement Learning for Control
• Many robust control problems are solved via lifting into convex spaces.
Recently, reinforcement learning (RL) has shown great promise for control!
5
Problem Formulation: State-feedback H∞ Control
Consider the following linear time-invariant (LTI) system
xt+1 = Axt + But + wt , x0 = 0.
{K ∈ K : J(K) ≤ γ}
⇐⇒ {K = LY −1 : LMI(Y, L, γ) 0 is feasible, Y 0}.
Clarke subdifferential:
Proposition
If K is a local minimum of J, then 0 ∈ ∂C J(K) and K is a Clarke stationary
point.
9
Some Background on Nonsmooth Optimization
Generalized Clarke directional derivative:
J(K 0 + td) − J(K 0 )
J ◦ (K, d) := lim sup .
K 0 →K t&0 t
Directional derivative:
J(K + td) − J(K)
J 0 (K, d) := lim .
t&0 t
Descent inequality
Let F be the minimal norm element in ∂δ J(K). Suppose K − αF/kF k2 ∈ K for
any 0 ≤ α ≤ δ. Then we have:
11
Outline
• Main Results
12
Summary of Known Facts
13
High Level Ideas
K n+1 = K n − δ n F n /kF n k2 ,
14
Main Results
Theorem (Guo and Hu, NeurIPS2022)
Suppose (Q, R) are positive definite, and (A, B) is stabilizable. We have
1. J(K) is coercive over the set K. (Proved via the properties of (Q, R))
2. For any K ∈ K satisfying J(K) > J ∗ , there exists V 6= 0 s.t. J 0 (K, V ) < 0.
3. Any Clarke stationary points of the H∞ cost are global minimum.
4. For any γ > J ∗ , the sublevel set Kγ = {K ∈ K : J(K) ≤ γ} is compact.
There is a strict separation between Kγ and Kc .
5. Suppose K 0 ∈ K. Set ∆0 := dist(KJ(K 0 ) , Kc ) > 0, and δ n = 0.99∆
n+1 . Then
0
n+1
Goldstein’s subgradient method K = K − δ F /kF kF with F n
n n n n
The most technical part of the proof is for Step 2. It requires the use of
non-strict version of the KYP lemma.
15
Step 2 of Main Result
Lemma
For any K ∈ K that J(K) > J ∗ , there exists a direction d 6= 0 such that the
directional derivative J 0 (K, d) ≤ J ∗ − J(K) < 0.
Proof Sketch:
By convexity, we have
LMI(Y + t∆Y, L + t∆L, γ + t(γ ∗ − γ)) 0
with ∆Y = Y ∗ − Y , ∆L = L∗ − L, and t ∈ [0, 1]. In addition, we must have
J((L + t∆L)(Y + t∆Y )−1 ) ≤ γ + t(γ ∗ − γ).
16
Step 2 of Main Result
Lemma
For any K ∈ K that J(K) > J ∗ , there exists a direction d 6= 0 such that the
directional derivative J 0 (K, d) ≤ J ∗ − J(K) < 0.
Definition
A point K is said to be (δ, ε)-stationary if dist(0, ∂δ J(K)) ≤ ε.
Theorem 3
If we choose δ n = δ < ∆0 , then we have:
• K n ∈ K for all n
0 ∗
• minn:0≤n≤N kF n k2 ≤ J(K )−J
(N +1)δ , i.e., the complexity of finding a
∆
(δ, ε)-stationary point is O δε
18
Implementable Algorithms
Finding minimum norm element of Goldstein’s subdifferential may not be easy.
Fortunately, there are many implementable variants:
• Gradient Sampling (GS) (The HIFOO toolbox): The main idea is to
randomly generate differentiable samples over Bδn (K n ) with probability 1.
The convex hull of the gradients of these samples can be used as an
approximation of ∂δn J(K n ).
• Nonderivative Sampling (NS)(Kiwiel2010): The NS method can be
viewed as the derivative-free version of the GS algorithm by only using the
zeroth-order oracle.
• Interpolated normalized gradient descent (INGD) (Zhang, J., et al.
2020; Davis, D., et al. 2022): INGD uses an iterative sampling strategy to
generate a descent direction. The INGD algorithm is guaranteed to find the
(δ, ε)-stationary point with the high-probability finite-time complexity
bound.
19
Numerical Example
20
Numerical Example
0.25 10 2 0.25
0.2 0.2
10 0
0.15 0.15 6
10 -3
10 -2 5
0.1 0.1 4
10 -4 2
0.05 0.05
1
60 65 70 75 80
0 10 -6 0
10 20 30 40 50 60 20 40 60 80 100 10 20 30 40 50 60 70 80
Figure: Simulation results. Left: The trajectory of relative error of GS, NS,
INGD, and Model-free NS methods. Middle: The trajectory of the relative
optimality gap of 8 randomly generated cases for the NS method. Right:
The trajectory of the Model-free NS method with more noisy oracle.
21
Outline
• Main Results
22
Take Aways
• Any Clarke stationary points for this problem are actually global minimum.
23
Future Work
24
Thanks!
25
Analysis of the Optimization Landscape of Linear
Quadratic Gaussian (LQG) Control
Presented by Bin Hu
5th Annual Learning for Dynamics & Control Conference
University of Pennsylvania. June 14-16, 2023
Today’s talk
❑ Optimal Control Linear Quadratic Optimal control
Feedback Paradigm
d(t) w(t)
u(t)
System x(t) y(t)
• Many practical applications
• Linear Quadratic Regulator (LQR) when the state
𝑥𝑡 is directly observable
Feedback • Linear Quadratic Gaussian (LQG) control when
Control input Controller Measurement only partial output 𝑦𝑡 is observed
• Extensive classical results (Dynamic programming,
Separation principle, Riccati equations, etc.)
They are all model-based. Are there any
guarantees for non-convex policy optimization? 2
Challenges for partially observed LQG
❑ Policy optimization for LQG control
• LQG is more complicated than LQR Applications Sparse structures
• Requires dynamical controllers
• Its non-convex landscape properties are much richer and more complicated than LQR
3
Outline
4
LQG Problem Setup
v(t) Gaussian white w(t) Objective: The LQG cost
Plant
Plant
y(t) u(t)
Two Riccati equations
➢ Kalman gain
dynamical controller
➢ Feedback gain
Explicit dependence on the dynamics
6
Policy Optimization formulation
❑ Closed-loop dynamics
Policy optimization for LQG
Solution to Lyapunov equations Hyland, David, and Dennis Bernstein. "The optimal projection equations for fixed-order 7
dynamic compensation." IEEE Transactions on Automatic Control 29.11 (1984): 1034-1037.
Main questions
Policy optimization for LQG
8
Outline
9
Connectivity of the feasible region
❑ Simple observation: non-convex and unbounded
Example
12
Connectivity of the feasible region
❑ Main Result 3: conditions for connectivity
Theorem 3: 1) is connected if there exists a reduced-order stabilizing controller.
2) The sufficient condition above becomes necessary if the plant is single -input or
single-output.
Corollary 1: Given any open-loop stable plant, the set of stabilizing controllers is connected.
13
Connectivity of the feasible region
❑ Main Result 3: conditions for connectivity
Example: Open-loop unstable system (SISO)
14
Policy Optimization formulation
Policy optimization for LQG
15
Outline
16
Structure of Stationary Points
❑ Simple observations Policy optimization for LQG
1) is a real analytic function over its domain
(smooth, infinitely differentiable)
2) has non-unique and non-isolated global optima
Similarity transformation
where
❑ Non-unique, non-isolated
are the unique positive semidefinite solutions to two ❑ Local minimum, local
Lyapunov equations. maximum, saddle points,
or globally minimum?
18
Structure of Stationary Points
❑ Main Result: existences of strict saddle points
Theorem 4: Consider any open-loop stable plant. The zero controller with any stable
is a stationary point. Furthermore, the corresponding hessian is either indefinite ( strict saddle
point) or equal to zero (high-order saddle or else).
Example:
Stationary point:
➢ Cost function:
Indefinite with
eigenvalues:
➢ Hessian:
19
Structure of Stationary Points
❑ Main Result: existences of strict saddle points
Theorem 4: Consider any open-loop stable plant. The zero controller with any stable
is a stationary point. Furthermore, the corresponding hessian is either indefinite ( strict saddle
point) or equal to zero (high-order saddle or else).
21
Structure of Stationary Points
❑ Implication
Corollary: Consider gradient descent iterations
If the iterates converge to a minimal controller, then this minimal controller is a global optima.
More questions:
✓ Escaping saddle points?
✓ Convergence conditions?
✓ Convergence speed?
✓ Alternative model-free parameterization?
22
Comparison with LQR
Policy optimization for LQR Policy optimization for LQG
24
Policy optimization for LQG control
25
Ongoing and Future work
26
Analysis of the Optimization Landscape of Linear
Quadratic Gaussian (LQG) Control
s.t. (L, P, Z ) ∈ S
Z L
P ≻ 0, ⪰0
L⊤ P
where K ∗ = L∗ (P ∗ )−1 .
▶ further, K = LP −1 parameterizes all stabilizing K ∈ K
▶ also see [Mohammadi et al.’19]
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 1/9
Assumptions on parameterization map
Assumptions :
1. S is convex, f (L, P, Z ) is convex, bounded, differentiable on S.
2. we can express J(K ) as
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 2/9
Role of convex parameterization
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 3/9
Proof picture
min f (L, P, Z ),
min J(K ) Z ,L,P
K ∈K
s.t., (L, P, Z ) ∈ S
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 4/9
Role of convex parameterization
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 5/9
Example : Continuous time LQR
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 6/9
Example : Continuous time LQR
Define
λ2min (Σ) −1/2 −1/2
−2
ν= σmax (A)λmin (Q ) + σmax (B )λmin (R ) ,
4
then
∥J (K )∥ ≤ −C1 (J (K ) − J (K ∗ ))
where
1/2 1/2
νλmin (Q )λmin (R ) n o
C1 = · min a2 , νλmin (Q ) .
4a 4
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 7/9
Related Results
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 8/9
Future Work
Maryam Fazel, Bin Hu, Kaiqing Zhang Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies 9/9