Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT
Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT
Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT
Suvrit Sra
Massachusetts Institute of Technology
23 Feb, 2021
Basic convex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 2
Linear Programming
min cT x
s.t. Ax ≤ b, Cx = d.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 3
Linear Programming
min cT x
s.t. Ax ≤ b, Cx = d.
f (x)
bi
+
i x
aT
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 3
Linear Programming
min cT x
s.t. Ax ≤ b, Cx = d.
f (x)
bi
+
i x
aT
min t s.t. aTi x + bi ≤ t, i = 1, . . . , m.
x,t
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 3
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 4
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 4
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 4
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)
1 T
min 2 x Ax + bT x + c s.t. Gx ≤ h.
We assume A 0 (semidefinite).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 5
Quadratic Programming
1 T
min 2 x Ax + bT x + c s.t. Gx ≤ h.
We assume A 0 (semidefinite).
Exercise: Suppose no constraints; does QP always have solutions?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 5
Quadratic Programming
1 T
min 2 x Ax + bT x + c s.t. Gx ≤ h.
We assume A 0 (semidefinite).
Exercise: Suppose no constraints; does QP always have solutions?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 5
Regularized least-squares
Lasso
min 1
2 kAx − bk22 + λkxk1 .
Exercise: How large must λ > 0 so that x = 0 is the optimum?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 6
Regularized least-squares
Lasso
min 1
2 kAx − bk22 + λkxk1 .
Exercise: How large must λ > 0 so that x = 0 is the optimum?
Total-variation denoising
Xn−1
1 2
min 2 kAx − bk2 + λ |xi+1 − xi |.
i=1
Exercise: Is the total-variation term a norm? Prove or disprove.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 6
Regularized least-squares
Lasso
min 1
2 kAx − bk22 + λkxk1 .
Exercise: How large must λ > 0 so that x = 0 is the optimum?
Total-variation denoising
Xn−1
1 2
min 2 kAx − bk2 + λ |xi+1 − xi |.
i=1
Exercise: Is the total-variation term a norm? Prove or disprove.
Group Lasso
2
1
b −
X T
+λ
XT
min Aj x j kxj k2 .
x1 ,...,xT 2
j=1
2
j=1
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 6
Robust LP as an SOCP
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 7
Robust LP as an SOCP
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 7
Robust LP as an SOCP
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 7
Semidefinite Program (SDP)
min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 8
Semidefinite Program (SDP)
min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 8
Semidefinite Program (SDP)
min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 8
Semidefinite Program (SDP)
min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 8
Examples
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 9
Examples
A(x)T
tI
min t s.t. 0.
A(x) tI
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 9
Examples
A(x)T
tI
min t s.t. 0.
A(x) tI
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 9
Duality
(Weak duality, strong duality)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 10
Primal problem
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 11
Primal problem
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 11
Primal problem
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 11
The reader will find no figures in this work. The
methods which I set forth do not require either
constructions or geometrical or mechanical rea-
sonings: but only algebraic operations, subject to
a regular and uniform rule of procedure.
—Joseph-Louis Lagrange
Preface to Mécanique Analytique
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 12
Lagrangian
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 13
Lagrangian
♠ Variables λ ∈ Rm
+ called Lagrange multipliers
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 13
Lagrangian
♠ Variables λ ∈ Rm
+ called Lagrange multipliers
♠ Suppose x feasible, and λ ≥ 0. Lower-bound holds:
f (x) ≥ L(x, λ) ∀x ∈ X , λ ∈ Rm
+.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 13
Lagrangian
♠ Variables λ ∈ Rm
+ called Lagrange multipliers
♠ Suppose x feasible, and λ ≥ 0. Lower-bound holds:
f (x) ≥ L(x, λ) ∀x ∈ X , λ ∈ Rm
+.
♠ In other words,
(
f (x), if x feasible,
sup L(x, λ) =
λ∈Rm
+
+∞ otherwise.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 13
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 14
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 14
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 14
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 14
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 15
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 15
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
Dual function
g(λ) := inf L(x, λ).
x∈X
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 15
Weak duality theorem
Proof:
1. f (x0 ) ≥ L(x0 , λ) ∀x0 ∈ X
2. Thus, for any x ∈ X , we have f (x) ≥ inf x0 L(x0 , λ) = g(λ)
3. Now minimize over x on lhs to obtain
∀ λ ∈ Rm
+ p∗ ≥ g(λ).
∗ ∗
4. Thus, taking sup over λ ∈ Rm
+ we obtain p ≥ d .
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 16
Lagrangians - Exercise
min f (x)
s.t. fi (x) ≤ 0, i = 1, . . . , m,
hi (x) = 0, i = 1, . . . , p.
g : Rm p
+ × R : (λ, ν) 7→ inf L(x, λ, ν),
x
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 17
Exercises: Some duals
Derive Lagrangian duals for the following problems
I Least-norm solution of linear equations: min xT x s.t. Ax = b
I Dual of an LP
I Dual of an SOCP
I Dual of an SDP
I Study example (5.7) in BV (binary QP)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 18
Strong duality
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 19
Duality gap
p∗ − d∗
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 20
Duality gap
p∗ − d∗
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 20
Duality gap
p∗ − d∗
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 20
Duality gap
p∗ − d∗
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 20
Abstract duality gap theorem?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 21
Slater’s sufficient conditions
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 22
Slater’s sufficient conditions
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.
Constraint qualification: There exists x ∈ ri X s.t.
fi (x) < 0, Ax = b.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 22
Slater’s sufficient conditions
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.
Constraint qualification: There exists x ∈ ri X s.t.
fi (x) < 0, Ax = b.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 22
Example with positive duality-gap
min e−x x2 /y ≤ 0,
x,y
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap
min e−x x2 /y ≤ 0,
x,y
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap
min e−x x2 /y ≤ 0,
x,y
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap
min e−x x2 /y ≤ 0,
x,y
Dual problem
∗
d = max 0 s.t. λ ≥ 0.
λ
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap
min e−x x2 /y ≤ 0,
x,y
Dual problem
∗
d = max 0 s.t. λ ≥ 0.
λ
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap
min e−x x2 /y ≤ 0,
x,y
Dual problem
∗
d = max 0 s.t. λ ≥ 0.
λ
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example: Support Vector Machine (SVM)
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi
s.t. Ax ≥ 1 − ξ, ξ ≥ 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 24
Example: Support Vector Machine (SVM)
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi
s.t. Ax ≥ 1 − ξ, ξ ≥ 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 24
Example: Support Vector Machine (SVM)
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi
s.t. Ax ≥ 1 − ξ, ξ ≥ 0.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 24
Example: norm regularized problems
min f (x) + kAxk
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 25
Example: norm regularized problems
min f (x) + kAxk
Dual problem
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 25
Example: norm regularized problems
min f (x) + kAxk
Dual problem
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 25
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 28
Strong duality: nonconvex example
min xT Ax + 2bT x xT x ≤ 1.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 29
Strong duality: nonconvex example
min xT Ax + 2bT x xT x ≤ 1.
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 29
von Neumann minmax theorem?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 30
von Neumann minmax theorem?
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 30