Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT

Download as pdf or txt
Download as pdf or txt
You are on page 1of 81

Optimization for Machine Learning

Lecture 3: Basic problems, Duality


6.881: MIT

Suvrit Sra
Massachusetts Institute of Technology

23 Feb, 2021
Basic convex problems

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 2
Linear Programming

min cT x
s.t. Ax ≤ b, Cx = d.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 3
Linear Programming

min cT x
s.t. Ax ≤ b, Cx = d.

Piecewise linear minimization is an LP


min f (x) = max (aTi x + bi )
1≤i≤m

f (x)

bi
+
i x
aT

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 3
Linear Programming

min cT x
s.t. Ax ≤ b, Cx = d.

Piecewise linear minimization is an LP


min f (x) = max (aTi x + bi )
1≤i≤m

f (x)

bi
+
i x
aT
min t s.t. aTi x + bi ≤ t, i = 1, . . . , m.
x,t

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 3
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)

Formulate minx kAx − bk∞ as an LP


(kxk∞ = max1≤i≤n |xi |)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 4
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)

Formulate minx kAx − bk∞ as an LP


(kxk∞ = max1≤i≤n |xi |)

Explore: LP formulations for Markov Decision Processes


(MDPs). MDPs are frequently used models in Reinforcement
Learning, and in some cases admit nice LP formulations.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 4
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)

Formulate minx kAx − bk∞ as an LP


(kxk∞ = max1≤i≤n |xi |)

Explore: LP formulations for Markov Decision Processes


(MDPs). MDPs are frequently used models in Reinforcement
Learning, and in some cases admit nice LP formulations.

Explore: Integer LP: minx cT x, Ax ≤ b, x ∈ Zn .

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 4
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)

Formulate minx kAx − bk∞ as an LP


(kxk∞ = max1≤i≤n |xi |)

Explore: LP formulations for Markov Decision Processes


(MDPs). MDPs are frequently used models in Reinforcement
Learning, and in some cases admit nice LP formulations.

Explore: Integer LP: minx cT x, Ax ≤ b, x ∈ Zn .

Open Problem. Can we solve the system of inequalities Ax ≤ b in


strongly polynomial time in the dimensions of the system, indep-
dent of the magnitudes of the coefficients? Best known result (Tar-
dos, 1984) depends on coefficients of A but permits indpendence on
magnitudes of b and the cost vector c.
N. Meggido, On the complexity of linear programing: Click here!
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 4
Quadratic Programming

1 T
min 2 x Ax + bT x + c s.t. Gx ≤ h.
We assume A  0 (semidefinite).

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 5
Quadratic Programming

1 T
min 2 x Ax + bT x + c s.t. Gx ≤ h.
We assume A  0 (semidefinite).
Exercise: Suppose no constraints; does QP always have solutions?

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 5
Quadratic Programming

1 T
min 2 x Ax + bT x + c s.t. Gx ≤ h.
We assume A  0 (semidefinite).
Exercise: Suppose no constraints; does QP always have solutions?

Nonnegative least squares (NNLS)


min 1
2 kAx − bk2 s.t. x ≥ 0.
Exercise: Prove that NNLS always has a solution.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 5
Regularized least-squares
Lasso
min 1
2 kAx − bk22 + λkxk1 .
Exercise: How large must λ > 0 so that x = 0 is the optimum?

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 6
Regularized least-squares
Lasso
min 1
2 kAx − bk22 + λkxk1 .
Exercise: How large must λ > 0 so that x = 0 is the optimum?
Total-variation denoising
Xn−1
1 2
min 2 kAx − bk2 + λ |xi+1 − xi |.
i=1
Exercise: Is the total-variation term a norm? Prove or disprove.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 6
Regularized least-squares
Lasso
min 1
2 kAx − bk22 + λkxk1 .
Exercise: How large must λ > 0 so that x = 0 is the optimum?
Total-variation denoising
Xn−1
1 2
min 2 kAx − bk2 + λ |xi+1 − xi |.
i=1
Exercise: Is the total-variation term a norm? Prove or disprove.
Group Lasso
2
1
b −
X T

XT
min Aj x j kxj k2 .
x1 ,...,xT 2 j=1
2
j=1

Exercise: What is the dual norm of the regularizer above?

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 6
Robust LP as an SOCP

min cT x, s.t. aTi x ≤ bi ∀ai ∈ Ei


Ei := {āi + Pi u | kuk2 ≤ 1}

Constraints are uncertain but with bounded uncertainty.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 7
Robust LP as an SOCP

min cT x, s.t. aTi x ≤ bi ∀ai ∈ Ei


Ei := {āi + Pi u | kuk2 ≤ 1}

Constraints are uncertain but with bounded uncertainty.


(Adversarially) Robust LP formulation
n o
min max cT x | aTi x ≤ bi , ai ∈ Ei
x kuk2 ≤1

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 7
Robust LP as an SOCP

min cT x, s.t. aTi x ≤ bi ∀ai ∈ Ei


Ei := {āi + Pi u | kuk2 ≤ 1}

Constraints are uncertain but with bounded uncertainty.


(Adversarially) Robust LP formulation
n o
min max cT x | aTi x ≤ bi , ai ∈ Ei
x kuk2 ≤1

Second Order Cone Program


T
min c x, s.t. kPTi xk2 ≤ −āTi x + bi , i = 1, . . . , m.

SOCP constraint comes from:

max (āi + Pi u)T x = āTi x + kPTi xk2


kuk2 ≤1

Exercise: Give a quick argument for above equality.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 7
Semidefinite Program (SDP)

min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An  0.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 8
Semidefinite Program (SDP)

min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An  0.

I A0 , . . . , An are real, symmetric matrices


I Inequality A  B means B − A is semidefinite
I Also a cone program (conic optimization problem)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 8
Semidefinite Program (SDP)

min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An  0.

I A0 , . . . , An are real, symmetric matrices


I Inequality A  B means B − A is semidefinite
I Also a cone program (conic optimization problem)
I SDP ⊃ SOCP ⊃ QP ⊃ LP
I Exercise: Write LPs, QPs, and SOCPs as SDPs

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 8
Semidefinite Program (SDP)

min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An  0.

I A0 , . . . , An are real, symmetric matrices


I Inequality A  B means B − A is semidefinite
I Also a cone program (conic optimization problem)
I SDP ⊃ SOCP ⊃ QP ⊃ LP
I Exercise: Write LPs, QPs, and SOCPs as SDPs
T
I Feasible set of SDP is {semidefinite cone hyperplanes}
Explore: Which convex problems representable as SDPs?
(This is an important topic in optimization theory).

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 8
Examples

♠ Eigenvalue optimization: minx λmax (A(x))

min t s.t. A(x)  tI.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 9
Examples

♠ Eigenvalue optimization: minx λmax (A(x))

min t s.t. A(x)  tI.

♠ Norm minimization: minx kA(x)k

A(x)T
 
tI
min t s.t.  0.
A(x) tI

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 9
Examples

♠ Eigenvalue optimization: minx λmax (A(x))

min t s.t. A(x)  tI.

♠ Norm minimization: minx kA(x)k

A(x)T
 
tI
min t s.t.  0.
A(x) tI

♠ More examples – see CVX documentation and BV book

Explore: SDP relaxations of nonconvex probs: important tech-


nique, starting with M AX C UT SDP (Goemans-Williamson).
Explore: Sum-of-squares (SOS) optimization, Lasserre hierar-
chy of relaxations; see also: https://www.sumofsquares.org

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 9
Duality
(Weak duality, strong duality)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 10
Primal problem

Let fi : Rn → R (1 ≤ i ≤ m). Generic nonlinear program

min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 11
Primal problem

Let fi : Rn → R (1 ≤ i ≤ m). Generic nonlinear program

min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .

Domain: The set X := {dom f ∩ dom f1 · · · ∩ dom fm }


I We call (P) the primal problem
I The variable x is the primal variable

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 11
Primal problem

Let fi : Rn → R (1 ≤ i ≤ m). Generic nonlinear program

min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .

Domain: The set X := {dom f ∩ dom f1 · · · ∩ dom fm }


I We call (P) the primal problem
I The variable x is the primal variable

Lagrangians and Duality

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 11
The reader will find no figures in this work. The
methods which I set forth do not require either
constructions or geometrical or mechanical rea-
sonings: but only algebraic operations, subject to
a regular and uniform rule of procedure.

—Joseph-Louis Lagrange
Preface to Mécanique Analytique

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 12
Lagrangian

To primal, associate Lagrangian L : Rn × Rm


+ → (−∞, ∞],
Xm
L(x, λ) := f (x) + λi fi (x).
i=1

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 13
Lagrangian

To primal, associate Lagrangian L : Rn × Rm


+ → (−∞, ∞],
Xm
L(x, λ) := f (x) + λi fi (x).
i=1

♠ Variables λ ∈ Rm
+ called Lagrange multipliers

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 13
Lagrangian

To primal, associate Lagrangian L : Rn × Rm


+ → (−∞, ∞],
Xm
L(x, λ) := f (x) + λi fi (x).
i=1

♠ Variables λ ∈ Rm
+ called Lagrange multipliers
♠ Suppose x feasible, and λ ≥ 0. Lower-bound holds:

f (x) ≥ L(x, λ) ∀x ∈ X , λ ∈ Rm
+.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 13
Lagrangian

To primal, associate Lagrangian L : Rn × Rm


+ → (−∞, ∞],
Xm
L(x, λ) := f (x) + λi fi (x).
i=1

♠ Variables λ ∈ Rm
+ called Lagrange multipliers
♠ Suppose x feasible, and λ ≥ 0. Lower-bound holds:

f (x) ≥ L(x, λ) ∀x ∈ X , λ ∈ Rm
+.

♠ In other words,
(
f (x), if x feasible,
sup L(x, λ) =
λ∈Rm
+
+∞ otherwise.

Proof on next slide

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 13
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm


+ ; so primal optimal (value)

p∗ = inf sup L(x, λ).


x∈X λ≥0

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 14
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm


+ ; so primal optimal (value)

p∗ = inf sup L(x, λ).


x∈X λ≥0

I If x is not feasible, then some fi (x) > 0

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 14
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm


+ ; so primal optimal (value)

p∗ = inf sup L(x, λ).


x∈X λ≥0

I If x is not feasible, then some fi (x) > 0


I In this case, inner sup is +∞, so claim true by definition

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 14
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm


+ ; so primal optimal (value)

p∗ = inf sup L(x, λ).


x∈X λ≥0

I If x is not feasible, then some fi (x) > 0


I In this case, inner sup is +∞, so claim true by definition
P
I If x is feasible, each fi (x) ≤ 0, so supλ i λi fi (x) = 0

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 14
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

Primal value ∈ [−∞, +∞]


p∗ = inf sup L(x, λ).
x∈X λ≥0

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 15
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

Primal value ∈ [−∞, +∞]


p∗ = inf sup L(x, λ).
x∈X λ≥0

Dual value ∈ [−∞, +∞]


d∗ = sup inf L(x, λ).
λ≥0 x∈X

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 15
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

Primal value ∈ [−∞, +∞]


p∗ = inf sup L(x, λ).
x∈X λ≥0

Dual value ∈ [−∞, +∞]


d∗ = sup inf L(x, λ).
λ≥0 x∈X

Dual function
g(λ) := inf L(x, λ).
x∈X

Observe that g(λ) is always concave!

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 15
Weak duality theorem

Theorem. (Weak duality). p∗ ≥ d∗ . (i.e., WD always holds)

Proof:
1. f (x0 ) ≥ L(x0 , λ) ∀x0 ∈ X
2. Thus, for any x ∈ X , we have f (x) ≥ inf x0 L(x0 , λ) = g(λ)
3. Now minimize over x on lhs to obtain

∀ λ ∈ Rm
+ p∗ ≥ g(λ).
∗ ∗
4. Thus, taking sup over λ ∈ Rm
+ we obtain p ≥ d .

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 16
Lagrangians - Exercise

min f (x)
s.t. fi (x) ≤ 0, i = 1, . . . , m,
hi (x) = 0, i = 1, . . . , p.

Exercise: Show that we get the Lagrangian dual

g : Rm p
+ × R : (λ, ν) 7→ inf L(x, λ, ν),
x

Lagrange variable ν corresponds to the equality constraints.


Exercise: Prove that p∗ ≥ supλ≥0,ν∈Rp g(λ, ν) = d∗ .

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 17
Exercises: Some duals
Derive Lagrangian duals for the following problems
I Least-norm solution of linear equations: min xT x s.t. Ax = b
I Dual of an LP
I Dual of an SOCP
I Dual of an SDP
I Study example (5.7) in BV (binary QP)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 18
Strong duality

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 19
Duality gap

p∗ − d∗

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 20
Duality gap

p∗ − d∗

Strong duality holds if duality gap is zero: p∗ = d∗

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 20
Duality gap

p∗ − d∗

Strong duality holds if duality gap is zero: p∗ = d∗

Several sufficient conditions known!

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 20
Duality gap

p∗ − d∗

Strong duality holds if duality gap is zero: p∗ = d∗

Several sufficient conditions known!

“Easy” necessary and sufficient conditions: unknown

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 20
Abstract duality gap theorem?

Theorem. Let v : Rm → R be the primal value function

v(u) := inf {f (x) | fi (x) ≤ ui , 1 ≤ i ≤ m} .

The following relations hold:


1 p∗ = v(0) (
−g(λ) λ≥0
2 v∗ (−λ) =
+∞ otherwise.
3 d∗ = v∗∗ (0)

So if v(0) = v∗∗ (0) we have strong duality


Remark: Conditions such as Slater’s ensure ∂v(0) 6= ∅, which ensures v is
finite and lsc at 0, whereby v(0) = v∗∗ (0) holds.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 21
Slater’s sufficient conditions

min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 22
Slater’s sufficient conditions

min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.
Constraint qualification: There exists x ∈ ri X s.t.

fi (x) < 0, Ax = b.

In words: there is a strictly feasible point.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 22
Slater’s sufficient conditions

min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.
Constraint qualification: There exists x ∈ ri X s.t.

fi (x) < 0, Ax = b.

In words: there is a strictly feasible point.


Theorem. Let the primal problem be convex. If there is
a point that is strictly feasible for the non-affine constraints
(merely feasible for affine), then strong duality holds. More-
over, in this case, the dual optimal is attained (i.e., ∂v(0) 6= ∅).
See BV §5.3.2 for a proof; (above, v is the primal value function)

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 22
Example with positive duality-gap

min e−x x2 /y ≤ 0,
x,y

over the domain X = {(x, y) | y > 0}.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap

min e−x x2 /y ≤ 0,
x,y

over the domain X = {(x, y) | y > 0}.


Clearly, only feasible x = 0. So p∗ = 1

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap

min e−x x2 /y ≤ 0,
x,y

over the domain X = {(x, y) | y > 0}.


Clearly, only feasible x = 0. So p∗ = 1

L(x, y, λ) = e−x + λx2 /y,


so dual function is (
−x 2 0 λ≥0
g(λ) = inf e + λx y =
x,y>0 −∞ λ < 0.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap

min e−x x2 /y ≤ 0,
x,y

over the domain X = {(x, y) | y > 0}.


Clearly, only feasible x = 0. So p∗ = 1

L(x, y, λ) = e−x + λx2 /y,


so dual function is (
−x 2 0 λ≥0
g(λ) = inf e + λx y =
x,y>0 −∞ λ < 0.

Dual problem

d = max 0 s.t. λ ≥ 0.
λ

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap

min e−x x2 /y ≤ 0,
x,y

over the domain X = {(x, y) | y > 0}.


Clearly, only feasible x = 0. So p∗ = 1

L(x, y, λ) = e−x + λx2 /y,


so dual function is (
−x 2 0 λ≥0
g(λ) = inf e + λx y =
x,y>0 −∞ λ < 0.

Dual problem

d = max 0 s.t. λ ≥ 0.
λ

Thus, d∗ = 0, and gap is p∗ − d∗ = 1.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example with positive duality-gap

min e−x x2 /y ≤ 0,
x,y

over the domain X = {(x, y) | y > 0}.


Clearly, only feasible x = 0. So p∗ = 1

L(x, y, λ) = e−x + λx2 /y,


so dual function is (
−x 2 0 λ≥0
g(λ) = inf e + λx y =
x,y>0 −∞ λ < 0.

Dual problem

d = max 0 s.t. λ ≥ 0.
λ

Thus, d∗ = 0, and gap is p∗ − d∗ = 1.


Here, we had no strictly feasible solution.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 23
Example: Support Vector Machine (SVM)
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi

s.t. Ax ≥ 1 − ξ, ξ ≥ 0.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 24
Example: Support Vector Machine (SVM)
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi

s.t. Ax ≥ 1 − ξ, ξ ≥ 0.

L(x, ξ, λ, ν) = 21 kxk22 + C1T ξ − λT (Ax − 1 + ξ) − ν T ξ

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 24
Example: Support Vector Machine (SVM)
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi

s.t. Ax ≥ 1 − ξ, ξ ≥ 0.

L(x, ξ, λ, ν) = 21 kxk22 + C1T ξ − λT (Ax − 1 + ξ) − ν T ξ

g(λ, ν) := inf L(x, ξ, λ, ν)


(
λT 1 − 12 kAT λk22 λ + ν = C1
=
+∞ otherwise
d∗ = max g(λ, ν)
λ≥0,ν≥0

Exercise: Using ν ≥ 0, eliminate ν from above dual and obtain


the canonical dual SVM formulation.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 24
Example: norm regularized problems
min f (x) + kAxk

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 25
Example: norm regularized problems
min f (x) + kAxk

Dual problem

min f ∗ (−AT y) s.t. kyk∗ ≤ 1.


y

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 25
Example: norm regularized problems
min f (x) + kAxk

Dual problem

min f ∗ (−AT y) s.t. kyk∗ ≤ 1.


y

Say kȳk∗ < 1, such that AT ȳ ∈ ri(dom f ∗ ), then we have strong


duality—for instance if 0 ∈ ri(dom f ∗ )

Exercise. Write the constrained form of group-lasso:


2
1
b −
XT

XT
min Aj x j kxj k2 .
x1 ,...,xT 2 j=1
2
j=1

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 25
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x

Introduce ν and λ as dual variables; consider Lagrangian


X
L(x, λ, ν) := f0 (x) + λi fi (x) + ν T (Ax − b)
i

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x

Introduce ν and λ as dual variables; consider Lagrangian


X
L(x, λ, ν) := f0 (x) + λi fi (x) + ν T (Ax − b)
i
g(λ, ν) = inf L(x, λ, ν)
x

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x

Introduce ν and λ as dual variables; consider Lagrangian


X
L(x, λ, ν) := f0 (x) + λi fi (x) + ν T (Ax − b)
i
g(λ, ν) = inf L(x, λ, ν)
x
g(λ, ν) = −ν T b + inf xT AT ν + F(x)
x

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x

Introduce ν and λ as dual variables; consider Lagrangian


X
L(x, λ, ν) := f0 (x) + λi fi (x) + ν T (Ax − b)
i
g(λ, ν) = inf L(x, λ, ν)
x
g(λ, ν) = −ν T b + inf xT AT ν + F(x)
x
X
F(x) := f0 (x) + λi fi (x)
i

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x

Introduce ν and λ as dual variables; consider Lagrangian


X
L(x, λ, ν) := f0 (x) + λi fi (x) + ν T (Ax − b)
i
g(λ, ν) = inf L(x, λ, ν)
x
g(λ, ν) = −ν T b + inf xT AT ν + F(x)
x
X
F(x) := f0 (x) + λi fi (x)
i
g(λ, ν) = −ν T b − sup hx, −AT νi − F(x)
x

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x

Introduce ν and λ as dual variables; consider Lagrangian


X
L(x, λ, ν) := f0 (x) + λi fi (x) + ν T (Ax − b)
i
g(λ, ν) = inf L(x, λ, ν)
x
g(λ, ν) = −ν T b + inf xT AT ν + F(x)
x
X
F(x) := f0 (x) + λi fi (x)
i
g(λ, ν) = −ν T b − sup hx, −AT νi − F(x)
x
g(λ, ν) = −ν T b − F∗ (−AT ν).

F∗ seems rather opaque...

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 26
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”

min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b


x
x = xi , i = 1, . . . , m.

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”

min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b


x
x = xi , i = 1, . . . , m.
L(x, xi , λ, ν, πi )
X X
:= f0 (x) + λi fi (xi ) + ν T (Ax − b) + πiT (xi − x)
i i

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”

min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b


x
x = xi , i = 1, . . . , m.
L(x, xi , λ, ν, πi )
X X
:= f0 (x) + λi fi (xi ) + ν T (Ax − b) + πiT (xi − x)
i i
g(λ, ν, πi ) = inf L(x, xi , λ, ν, πi )
x,xi

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”

min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b


x
x = xi , i = 1, . . . , m.
L(x, xi , λ, ν, πi )
X X
:= f0 (x) + λi fi (xi ) + ν T (Ax − b) + πiT (xi − x)
i i
g(λ, ν, πi ) = inf L(x, xi , λ, ν, πi )
x,xi
 X 
= −ν b + inf f0 (x) + ν T Ax −
T
πiT x
x i
X  
T
+ inf πi xi + λi fi (xi ) ,
i xi

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Example: Dual via Fenchel conjugates
Important trick: “variable splitting”

min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b


x
x = xi , i = 1, . . . , m.
L(x, xi , λ, ν, πi )
X X
:= f0 (x) + λi fi (xi ) + ν T (Ax − b) + πiT (xi − x)
i i
g(λ, ν, πi ) = inf L(x, xi , λ, ν, πi )
x,xi
 X 
= −ν b + inf f0 (x) + ν T Ax −
T
πiT x
x i
X  
T
+ inf πi xi + λi fi (xi ) ,
i xi
 X  X
= −ν T b − f ∗ −AT ν + πi − (λi fi )∗ (−πi ).
i i
P
(you may want to write i πi = s)
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 27
Exercise: the variable splitting trick
min f (x) + h(x).
x

Exercise: Fill in the details for the following steps

min f (x) + h(z) s.t. x=z


x,z

L(x, z, ν) = f (x) + h(z) + ν T (x − z)


g(ν) = inf L(x, z, ν)
x,z

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 28
Strong duality: nonconvex example

Trust region subproblem (TRS)

min xT Ax + 2bT x xT x ≤ 1.

A is symmetric but not necessarily semidefinite!

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 29
Strong duality: nonconvex example

Trust region subproblem (TRS)

min xT Ax + 2bT x xT x ≤ 1.

A is symmetric but not necessarily semidefinite!

Theorem. TRS always has zero duality gap.

Proof: Read Section 5.2.4 of BV.

See the challenge problems on pg 18, Lect1

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 29
von Neumann minmax theorem?

(Simplified.) Let A be linear, C, D be compact convex sets.

min max hAx, yi = max min hAx, yi.


x∈C y∈D y∈D x∈C

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 30
von Neumann minmax theorem?

(Simplified.) Let A be linear, C, D be compact convex sets.

min max hAx, yi = max min hAx, yi.


x∈C y∈D y∈D x∈C

von Neumann proved this via fixed-point theory. By


considering the Fenchel problem

min 1C (x) + 1∗D (Ax),


x

we can conclude the theorem (some work required).

Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 30

You might also like