Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT

Optimization for Machine Learning
Lecture 3: Basic problems, Duality

6.881: MIT
Suvrit Sra
Massachusetts Institute of Technology
23 Feb, 2021
Basic convex problems
Suvrit Sra (suvrit@mit.edu) 6.881 Optimization for Machine Learning (2/23/21; Lecture 3) 2
Linear Programming
min cT x
s.t. Ax ≤ b, Cx = d.
Linear Programming
min cT x
Piecewise linear minimization is an LP

min f (x) = max (aTi x + bi )
1≤i≤m
f (x)
bi
+
i x
aT
Linear Programming
min cT x
Piecewise linear minimization is an LP

min f (x) = max (aTi x + bi )
1≤i≤m
f (x)
bi
+
i x
aT
min t s.t. aTi x + bi ≤ t, i = 1, . . . , m.
x,t
Exercises
P
Formulate minx kAx − bk1 as an LP (kxk1 = i |xi |)
Formulate minx kAx − bk∞ as an LP

(kxk∞ = max1≤i≤n |xi |)
Exercises
P

Explore: LP formulations for Markov Decision Processes

(MDPs). MDPs are frequently used models in Reinforcement
Learning, and in some cases admit nice LP formulations.
Exercises
P


Explore: Integer LP: minx cT x, Ax ≤ b, x ∈ Zn .
Exercises
P


Explore: Integer LP: minx cT x, Ax ≤ b, x ∈ Zn .
Open Problem. Can we solve the system of inequalities Ax ≤ b in

strongly polynomial time in the dimensions of the system, indep-
dent of the magnitudes of the coefficients? Best known result (Tar-
dos, 1984) depends on coefficients of A but permits indpendence on
magnitudes of b and the cost vector c.
N. Meggido, On the complexity of linear programing: Click here!
Quadratic Programming
1 T
min 2 x Ax + bT x + c s.t. Gx ≤ h.
We assume A 0 (semidefinite).
1 T
Exercise: Suppose no constraints; does QP always have solutions?
1 T
Exercise: Suppose no constraints; does QP always have solutions?
Nonnegative least squares (NNLS)

min 1
2 kAx − bk2 s.t. x ≥ 0.
Exercise: Prove that NNLS always has a solution.
Regularized least-squares
Lasso
min 1
2 kAx − bk22 + λkxk1 .
Exercise: How large must λ > 0 so that x = 0 is the optimum?
Lasso
min 1
Total-variation denoising
Xn−1
1 2
min 2 kAx − bk2 + λ |xi+1 − xi |.
i=1
Exercise: Is the total-variation term a norm? Prove or disprove.
Lasso
min 1
Total-variation denoising
Xn−1
1 2
min 2 kAx − bk2 + λ |xi+1 − xi |.
i=1
Exercise: Is the total-variation term a norm? Prove or disprove.
Group Lasso
2
1
b −
X T
+λ
XT
min Aj x j kxj k2 .
x1 ,...,xT 2 j=1
2
j=1
Exercise: What is the dual norm of the regularizer above?
Robust LP as an SOCP
min cT x, s.t. aTi x ≤ bi ∀ai ∈ Ei

Ei := {āi + Pi u | kuk2 ≤ 1}
Constraints are uncertain but with bounded uncertainty.

Ei := {āi + Pi u | kuk2 ≤ 1}

(Adversarially) Robust LP formulation
n o
min max cT x | aTi x ≤ bi , ai ∈ Ei
x kuk2 ≤1

Ei := {āi + Pi u | kuk2 ≤ 1}

(Adversarially) Robust LP formulation
n o
min max cT x | aTi x ≤ bi , ai ∈ Ei
x kuk2 ≤1
Second Order Cone Program

T
min c x, s.t. kPTi xk2 ≤ −āTi x + bi , i = 1, . . . , m.
SOCP constraint comes from:
max (āi + Pi u)T x = āTi x + kPTi xk2

kuk2 ≤1
Exercise: Give a quick argument for above equality.
Semidefinite Program (SDP)
min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An 0.
min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An 0.
I A0 , . . . , An are real, symmetric matrices

I Inequality A B means B − A is semidefinite
I Also a cone program (conic optimization problem)
min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An 0.

I SDP ⊃ SOCP ⊃ QP ⊃ LP
I Exercise: Write LPs, QPs, and SOCPs as SDPs
min cT x
x∈Rn
s.t. A(x) := A0 + x1 A1 + x2 A2 + . . . + xn An 0.

I SDP ⊃ SOCP ⊃ QP ⊃ LP
I Exercise: Write LPs, QPs, and SOCPs as SDPs
T
I Feasible set of SDP is {semidefinite cone hyperplanes}
Explore: Which convex problems representable as SDPs?
(This is an important topic in optimization theory).
Examples
♠ Eigenvalue optimization: minx λmax (A(x))
min t s.t. A(x) tI.
Examples
min t s.t. A(x) tI.
♠ Norm minimization: minx kA(x)k
A(x)T

tI
min t s.t. 0.
A(x) tI
Examples
min t s.t. A(x) tI.
♠ Norm minimization: minx kA(x)k
A(x)T

tI
min t s.t. 0.
A(x) tI
♠ More examples – see CVX documentation and BV book
Explore: SDP relaxations of nonconvex probs: important tech-

nique, starting with M AX C UT SDP (Goemans-Williamson).
Explore: Sum-of-squares (SOS) optimization, Lasserre hierar-
chy of relaxations; see also: https://www.sumofsquares.org
Duality
(Weak duality, strong duality)
Primal problem
Let fi : Rn → R (1 ≤ i ≤ m). Generic nonlinear program
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .
Primal problem
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .
Domain: The set X := {dom f ∩ dom f1 · · · ∩ dom fm }

I We call (P) the primal problem
I The variable x is the primal variable
Primal problem
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m, (P)
x ∈ {dom f ∩ dom f1 · · · ∩ dom fm } .
Domain: The set X := {dom f ∩ dom f1 · · · ∩ dom fm }

I We call (P) the primal problem
I The variable x is the primal variable
Lagrangians and Duality
The reader will find no figures in this work. The
methods which I set forth do not require either
constructions or geometrical or mechanical rea-
sonings: but only algebraic operations, subject to
a regular and uniform rule of procedure.
—Joseph-Louis Lagrange
Preface to Mécanique Analytique
Lagrangian
To primal, associate Lagrangian L : Rn × Rm

+ → (−∞, ∞],
Xm
L(x, λ) := f (x) + λi fi (x).
i=1
Lagrangian

+ → (−∞, ∞],
Xm
L(x, λ) := f (x) + λi fi (x).
i=1
♠ Variables λ ∈ Rm
+ called Lagrange multipliers
Lagrangian

+ → (−∞, ∞],
Xm
L(x, λ) := f (x) + λi fi (x).
i=1
♠ Suppose x feasible, and λ ≥ 0. Lower-bound holds:
f (x) ≥ L(x, λ) ∀x ∈ X , λ ∈ Rm
+.
Lagrangian

+ → (−∞, ∞],
Xm
L(x, λ) := f (x) + λi fi (x).
i=1
♠ Suppose x feasible, and λ ≥ 0. Lower-bound holds:
f (x) ≥ L(x, λ) ∀x ∈ X , λ ∈ Rm
+.
♠ In other words,
(
f (x), if x feasible,
sup L(x, λ) =
λ∈Rm
+
+∞ otherwise.
Proof on next slide
Lagrangian – proof
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm

+ ; so primal optimal (value)
p∗ = inf sup L(x, λ).

x∈X λ≥0
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm


x∈X λ≥0
I If x is not feasible, then some fi (x) > 0
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm


x∈X λ≥0

I In this case, inner sup is +∞, so claim true by definition
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm


x∈X λ≥0

I In this case, inner sup is +∞, so claim true by definition
P
I If x is feasible, each fi (x) ≤ 0, so supλ i λi fi (x) = 0
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).
Primal value ∈ [−∞, +∞]

x∈X λ≥0
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

x∈X λ≥0
Dual value ∈ [−∞, +∞]

d∗ = sup inf L(x, λ).
λ≥0 x∈X
Dual value
Pm
L(x, λ) := f (x) + i=1 λi fi (x).

x∈X λ≥0
Dual value ∈ [−∞, +∞]

d∗ = sup inf L(x, λ).
λ≥0 x∈X
Dual function
g(λ) := inf L(x, λ).
x∈X
Observe that g(λ) is always concave!
Weak duality theorem
Theorem. (Weak duality). p∗ ≥ d∗ . (i.e., WD always holds)
Proof:
1. f (x0 ) ≥ L(x0 , λ) ∀x0 ∈ X
2. Thus, for any x ∈ X , we have f (x) ≥ inf x0 L(x0 , λ) = g(λ)
3. Now minimize over x on lhs to obtain
∀ λ ∈ Rm
+ p∗ ≥ g(λ).
∗ ∗
4. Thus, taking sup over λ ∈ Rm
+ we obtain p ≥ d .
Lagrangians - Exercise
min f (x)
s.t. fi (x) ≤ 0, i = 1, . . . , m,
hi (x) = 0, i = 1, . . . , p.
Exercise: Show that we get the Lagrangian dual
g : Rm p
+ × R : (λ, ν) 7→ inf L(x, λ, ν),
x
Lagrange variable ν corresponds to the equality constraints.

Exercise: Prove that p∗ ≥ supλ≥0,ν∈Rp g(λ, ν) = d∗ .
Exercises: Some duals
Derive Lagrangian duals for the following problems
I Least-norm solution of linear equations: min xT x s.t. Ax = b
I Dual of an LP
I Dual of an SOCP
I Dual of an SDP
I Study example (5.7) in BV (binary QP)
Strong duality
Duality gap
p∗ − d∗
Duality gap
p∗ − d∗
Strong duality holds if duality gap is zero: p∗ = d∗
Duality gap
p∗ − d∗
Several sufficient conditions known!
Duality gap
p∗ − d∗
Several sufficient conditions known!
“Easy” necessary and sufficient conditions: unknown
Abstract duality gap theorem?
Theorem. Let v : Rm → R be the primal value function
v(u) := inf {f (x) | fi (x) ≤ ui , 1 ≤ i ≤ m} .
The following relations hold:

1 p∗ = v(0) (
−g(λ) λ≥0
2 v∗ (−λ) =
+∞ otherwise.
3 d∗ = v∗∗ (0)
So if v(0) = v∗∗ (0) we have strong duality

Remark: Conditions such as Slater’s ensure ∂v(0) 6= ∅, which ensures v is
finite and lsc at 0, whereby v(0) = v∗∗ (0) holds.
Slater’s sufficient conditions
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.
Constraint qualification: There exists x ∈ ri X s.t.
fi (x) < 0, Ax = b.
In words: there is a strictly feasible point.
min f (x)
s.t. fi (x) ≤ 0, 1 ≤ i ≤ m,
Ax = b.
Constraint qualification: There exists x ∈ ri X s.t.
fi (x) < 0, Ax = b.
In words: there is a strictly feasible point.

Theorem. Let the primal problem be convex. If there is
a point that is strictly feasible for the non-affine constraints
(merely feasible for affine), then strong duality holds. More-
over, in this case, the dual optimal is attained (i.e., ∂v(0) 6= ∅).
See BV §5.3.2 for a proof; (above, v is the primal value function)
Example with positive duality-gap
min e−x x2 /y ≤ 0,
x,y
over the domain X = {(x, y) | y > 0}.
min e−x x2 /y ≤ 0,
x,y

Clearly, only feasible x = 0. So p∗ = 1
min e−x x2 /y ≤ 0,
x,y

L(x, y, λ) = e−x + λx2 /y,

so dual function is (
−x 2 0 λ≥0
g(λ) = inf e + λx y =
x,y>0 −∞ λ < 0.
min e−x x2 /y ≤ 0,
x,y

L(x, y, λ) = e−x + λx2 /y,

−x 2 0 λ≥0
x,y>0 −∞ λ < 0.
Dual problem
∗
d = max 0 s.t. λ ≥ 0.
λ
min e−x x2 /y ≤ 0,
x,y

L(x, y, λ) = e−x + λx2 /y,

−x 2 0 λ≥0
x,y>0 −∞ λ < 0.
Dual problem
∗
d = max 0 s.t. λ ≥ 0.
λ
Thus, d∗ = 0, and gap is p∗ − d∗ = 1.
min e−x x2 /y ≤ 0,
x,y

L(x, y, λ) = e−x + λx2 /y,

−x 2 0 λ≥0
x,y>0 −∞ λ < 0.
Dual problem
∗
d = max 0 s.t. λ ≥ 0.
λ
Thus, d∗ = 0, and gap is p∗ − d∗ = 1.

Here, we had no strictly feasible solution.
Example: Support Vector Machine (SVM)
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi
s.t. Ax ≥ 1 − ξ, ξ ≥ 0.
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi
s.t. Ax ≥ 1 − ξ, ξ ≥ 0.
L(x, ξ, λ, ν) = 21 kxk22 + C1T ξ − λT (Ax − 1 + ξ) − ν T ξ
X
1 2
min
x,ξ 2 kxk2 +C
i
ξi
s.t. Ax ≥ 1 − ξ, ξ ≥ 0.
L(x, ξ, λ, ν) = 21 kxk22 + C1T ξ − λT (Ax − 1 + ξ) − ν T ξ
g(λ, ν) := inf L(x, ξ, λ, ν)

(
λT 1 − 12 kAT λk22 λ + ν = C1
=
+∞ otherwise
d∗ = max g(λ, ν)
λ≥0,ν≥0
Exercise: Using ν ≥ 0, eliminate ν from above dual and obtain

the canonical dual SVM formulation.
Example: norm regularized problems
min f (x) + kAxk
min f (x) + kAxk
Dual problem
min f ∗ (−AT y) s.t. kyk∗ ≤ 1.

y
min f (x) + kAxk
Dual problem
min f ∗ (−AT y) s.t. kyk∗ ≤ 1.

y
Say kȳk∗ < 1, such that AT ȳ ∈ ri(dom f ∗ ), then we have strong

duality—for instance if 0 ∈ ri(dom f ∗ )
Exercise. Write the constrained form of group-lasso:

2
1
b −
XT
+λ
XT
min Aj x j kxj k2 .
x1 ,...,xT 2 j=1
2
j=1
Example: Dual via Fenchel conjugates
min f0 (x) s.t. fi (x) ≤ 0 (1 ≤ i ≤ m), Ax = b.
x
Introduce ν and λ as dual variables; consider Lagrangian

X
L(x, λ, ν) := f0 (x) + λi fi (x) + ν T (Ax − b)
i
x

X
i
g(λ, ν) = inf L(x, λ, ν)
x
x

X
i
x
g(λ, ν) = −ν T b + inf xT AT ν + F(x)
x
x

X
i
x
x
X
F(x) := f0 (x) + λi fi (x)
i
x

X
i
x
x
X
F(x) := f0 (x) + λi fi (x)
i
g(λ, ν) = −ν T b − sup hx, −AT νi − F(x)
x
x

X
i
x
x
X
F(x) := f0 (x) + λi fi (x)
i
g(λ, ν) = −ν T b − sup hx, −AT νi − F(x)
x
g(λ, ν) = −ν T b − F∗ (−AT ν).
F∗ seems rather opaque...
Important trick: “variable splitting”
min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b

x
x = xi , i = 1, . . . , m.
min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b

x
x = xi , i = 1, . . . , m.
L(x, xi , λ, ν, πi )
X X
:= f0 (x) + λi fi (xi ) + ν T (Ax − b) + πiT (xi − x)
i i
min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b

x
x = xi , i = 1, . . . , m.
X X
i i
g(λ, ν, πi ) = inf L(x, xi , λ, ν, πi )
x,xi
min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b

x
x = xi , i = 1, . . . , m.
X X
i i
x,xi
X
= −ν b + inf f0 (x) + ν T Ax −
T
πiT x
x i
X
T
+ inf πi xi + λi fi (xi ) ,
i xi
min f0 (x) s.t. fi (xi ) ≤ 0, Ax = b

x
x = xi , i = 1, . . . , m.
X X
i i
x,xi
X
= −ν b + inf f0 (x) + ν T Ax −
T
πiT x
x i
X
T
+ inf πi xi + λi fi (xi ) ,
i xi
X X
= −ν T b − f ∗ −AT ν + πi − (λi fi )∗ (−πi ).
i i
P
(you may want to write i πi = s)
Exercise: the variable splitting trick
min f (x) + h(x).
x
Exercise: Fill in the details for the following steps
min f (x) + h(z) s.t. x=z

x,z
L(x, z, ν) = f (x) + h(z) + ν T (x − z)

g(ν) = inf L(x, z, ν)
x,z
Strong duality: nonconvex example
Trust region subproblem (TRS)
min xT Ax + 2bT x xT x ≤ 1.
A is symmetric but not necessarily semidefinite!
Strong duality: nonconvex example
Trust region subproblem (TRS)
min xT Ax + 2bT x xT x ≤ 1.
A is symmetric but not necessarily semidefinite!
Theorem. TRS always has zero duality gap.
Proof: Read Section 5.2.4 of BV.
See the challenge problems on pg 18, Lect1
von Neumann minmax theorem?
(Simplified.) Let A be linear, C, D be compact convex sets.
min max hAx, yi = max min hAx, yi.

x∈C y∈D y∈D x∈C
von Neumann minmax theorem?
(Simplified.) Let A be linear, C, D be compact convex sets.
min max hAx, yi = max min hAx, yi.

x∈C y∈D y∈D x∈C
von Neumann proved this via fixed-point theory. By

considering the Fenchel problem
min 1C (x) + 1∗D (Ax),

x
we can conclude the theorem (some work required).

Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT

Uploaded by

Copyright:

Available Formats

Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimization For Machine Learning: Lecture 3: Basic Problems, Duality 6.881: MIT

Uploaded by

Copyright:

Available Formats

Optimization for Machine Learning

Lecture 3: Basic problems, Duality

Piecewise linear minimization is an LP

Piecewise linear minimization is an LP

Formulate minx kAx − bk∞ as an LP

Formulate minx kAx − bk∞ as an LP

Explore: LP formulations for Markov Decision Processes

Formulate minx kAx − bk∞ as an LP

Explore: LP formulations for Markov Decision Processes

Explore: Integer LP: minx cT x, Ax ≤ b, x ∈ Zn .

Formulate minx kAx − bk∞ as an LP

Explore: LP formulations for Markov Decision Processes

Explore: Integer LP: minx cT x, Ax ≤ b, x ∈ Zn .

Open Problem. Can we solve the system of inequalities Ax ≤ b in

Nonnegative least squares (NNLS)

Exercise: What is the dual norm of the regularizer above?

min cT x, s.t. aTi x ≤ bi ∀ai ∈ Ei

Constraints are uncertain but with bounded uncertainty.

min cT x, s.t. aTi x ≤ bi ∀ai ∈ Ei

Constraints are uncertain but with bounded uncertainty.

min cT x, s.t. aTi x ≤ bi ∀ai ∈ Ei

Constraints are uncertain but with bounded uncertainty.

Second Order Cone Program

SOCP constraint comes from:

max (āi + Pi u)T x = āTi x + kPTi xk2

Exercise: Give a quick argument for above equality.

I A0 , . . . , An are real, symmetric matrices

I A0 , . . . , An are real, symmetric matrices

I A0 , . . . , An are real, symmetric matrices

♠ Eigenvalue optimization: minx λmax (A(x))

min t s.t. A(x)  tI.

♠ Eigenvalue optimization: minx λmax (A(x))

min t s.t. A(x)  tI.

♠ Norm minimization: minx kA(x)k

♠ Eigenvalue optimization: minx λmax (A(x))

min t s.t. A(x)  tI.

♠ Norm minimization: minx kA(x)k

♠ More examples – see CVX documentation and BV book

Explore: SDP relaxations of nonconvex probs: important tech-

Let fi : Rn → R (1 ≤ i ≤ m). Generic nonlinear program

Let fi : Rn → R (1 ≤ i ≤ m). Generic nonlinear program

Domain: The set X := {dom f ∩ dom f1 · · · ∩ dom fm }

Let fi : Rn → R (1 ≤ i ≤ m). Generic nonlinear program

Domain: The set X := {dom f ∩ dom f1 · · · ∩ dom fm }

Lagrangians and Duality

To primal, associate Lagrangian L : Rn × Rm

To primal, associate Lagrangian L : Rn × Rm

To primal, associate Lagrangian L : Rn × Rm

To primal, associate Lagrangian L : Rn × Rm

Proof on next slide

I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm

p∗ = inf sup L(x, λ).

I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm

p∗ = inf sup L(x, λ).

I If x is not feasible, then some fi (x) > 0

I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm

p∗ = inf sup L(x, λ).

I If x is not feasible, then some fi (x) > 0

I f (x) ≥ L(x, λ), ∀ x ∈ X , λ ∈ Rm

min t s.t. A(x) tI.

min t s.t. A(x) tI.

min t s.t. A(x) tI.