1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
Main references:
• Vandenberghe (UCLA): EECS236C - Optimization methods for large scale systems,
http://www.seas.ucla.edu/˜vandenbe/ee236c.html
• Parikh and Boyd, Proximal algorithms, slides and note.
http://stanford.edu/˜boyd/papers/prox_algs.html
• Boyd, ADMM
http://stanford.edu/˜boyd/admm.html
• Simon Foucart and Holger Rauhut, Appendix B.
• Image processing:
min k∇xk1 subject to kAx − bk22 ≤ .
• The constrained can be a convex set C. That is
min f0 (x) subject to Ax ∈ C
x
we can define an indicator function
0 if x ∈ C
ιC (x) =
+∞ otherwise .
We can rewrite the constrained minimization problem as a unconstrained minimization problem:
min f0 (x) + ιC (Ax).
x
This can be reformulated as
min f0 (x) + ιC (y) subject to Ax = y.
x,y
1
1.2 Convex functions
Goal: We want to extend theory of smooth convex analysis to non-differentiable convex functions.
Let X be a separable Hilbert space, f : X → (−∞, +∞] be a function.
• Proper: f is called proper if f (x) < ∞ for at least one x. The domain of f is defined to be:
domf = {x|f (x) < ∞}.
• Lower Semi-continuity: f is called lower semi-continuous if lim infxn →x̄ f (xn ) ≥ f (x̄).
• Convex function
• Strictly convex:
– f is called strictly convex if the strict Jensen inequality holds: for x 6= y and t ∈ (0, 1),
2
1.3 Gradients of convex functions
Proposition 1.3 (Monotonicity of ∇f (x)). Suppose f ∈ C 1 . Then f is convex if and only if domf is
convex and ∇f (x) is a monotone operator:
f (y) ≥ f (x) + h∇f (x), y − xi, f (x) ≥ f (y) + h∇f (y), x − yi.
2. (⇐) Let g(t) = f (x+t(y −x)). Then g 0 (t) = h∇f (x+t(y −x)), y −xi ≥ g 0 (0) by monotonicity.
Hence
Z 1 Z 1
f (y) = g(1) = g(0) + g 0 (t) dt ≥ g(0) + g 0 (0) dt = f (x) + h∇f (x), y − xi
0 0
Proposition 1.4. Suppose f is convex and in C 1 . The following statements are equivalent.
L 2
(b) g(x) := 2 kxk − f (x) is convex.
(d) Co-coercivity
1
h∇f (x) − ∇f (y), x − yi ≥ k∇f (x) − ∇f (y)k2 .
L
Proof. 1. (a) ⇒ (b):
3
3. (b) ⇒ (d): Define fx (z) = f (z) − h∇f (x), zi, fy (z) = f (z) − h∇f (y), zi. From (b), both
(L/2)kzk2 − fx (z) and (L/2)kzk2 − fy (z) are convex, and z = x minimizes fx . Thus from the
proposition below
1 1
f (y) − f (x) − h∇f (x), y − xi = fx (y) − fx (x) ≥ k∇fx (y)k2 = k∇f (y) − ∇f (x)k2 .
2L 2L
Similarly, z = y minimizes fy (z), we get
1
f (x) − f (y) − h∇f (y), x − yi ≥ k∇f (y) − ∇f (x)k2 .
2L
Adding these two together, we get the co-coercivity.
Proposition 1.5. Suppose f is convex and in C 1 with ∇f (x) being Lipschitz continuous with parameter
L. Suppose x∗ is a global minimum of f . Then
1 L
k∇f (x)k2 ≤ f (x) − f (x∗ ) ≤ kx − x∗ k2 .
2L 2
Proof. 1. Right-hand inequality follows from quadratic upper bound.
4
Proposition 1.7. If f is strongly convex, then f has a unique global minimizer x∗ which satisfies
m 1
kx − x∗ k2 ≤ f (x) − f (x∗ ) ≤ k∇f (x)k2 for all x ∈ domf.
2 2m
Proof. 1. For lelf-hand inequality, we apply quadratic lower bound
m m
f (x) ≥ f (x∗ ) + h∇f (x∗ ), x − x∗ i + kx − x∗ k2 = kx − x∗ k2 .
2 2
Proposition 1.8. Suppose f is both strongly convex with parameter m and ∇f (x) is Lipschitz continuous
with parameter L. Then f satisfies stronger co-coercivity condition
mL 1
h∇f (x) − ∇f (y), x − yi ≥ kx − yk2 + k∇f (x) − ∇f (y)k2 .
m+L m+L
m 2
Proof. 1. Consider g(x) = f (x) − 2 kxk . From strong convexity of f , we get g(x) is convex.
1.5 Subdifferential
Let f be convex. The subdifferential of f at a point x is a set defined by
5
Proposition 1.9. Let f : Rn → (−∞, ∞] be convex and closed. Then x∗ is a minimum of f if and only
if 0 ∈ ∂f (x∗ ).
Proposition 1.10. The subdifferential of a convex function f is a set-valued monotone operator. That is,
if u ∈ ∂f (x), v ∈ ∂f (y), then hu − v, x − yi ≥ 0.
Proof. From
f (y) ≥ f (x) + hu, y − xi, f (x) ≥ f (y) + hv, x − yi,
Combining these two inequality, we get monotonicity.
Since f (u) + 1/2ku − xk2 is strongly convex in u, we get unique minimum. Thus, proxf (x) is well-
defined.
Examples
6
Properties
if and only if
0 ∈ ∂f (z) + z − x
or
x ∈ z + ∂f (z).
Sometimes, we express this as
• Co-coercivity:
hproxf (x), proxf (y), x − yi ≥ kproxf (x) − proxf (y)k2 .
Let x+ = proxf (x) := argminz f (z) + 12 kz − xk2 . We have x − x+ ∈ ∂f (x+ ). Similarly,
y + := proxf (y) satisfies y − y + ∈ ∂f (y + ). From monotonicity of ∂f , we get
hu − v, x+ − y + i ≥ 0
implies
kproxf (x) − proxf (y)k ≤ kx − yk.
Examples
b if y = a
1. f (x) = ha, xi − b, f ∗ (y) = supx (hy, xi − ha, xi + b) =
∞ otherwise.
ax if x < 0
2. f (x) = , a < 0 < b.
bx if x > 0.
0 if a < y < b
f ∗ (y) =
∞ otherwise.
7
3. f (x) = 12 hx, Axi + hb, xi + c, where A is symmteric and non-singular, then
1
f ∗ (y) = hy − b, A−1 (y − b)i − c.
2
In general, if A 0, then
1
f ∗ (y) = hy − b, A† (y − b)i − c, A† := (A∗ A)−1 A∗
2
and dom f ∗ = range A + b.
4. f (x) = p1 kxkp , p ≥ 1, then f ∗ (u) = 1 p∗ where 1/p + 1/p∗ = 1.
p∗ kuk ,
5. f (x) = ex ,
y ln y − y if y > 0
f ∗ (y) = sup(xy − ex ) = 0 if y = 0
x
∞ if y < 0
Properties
• f ∗ is convex and l.s.c.
Note that f ∗ is the supremum of linear functions. We have seen that supremum of a family of
closed functions is closed; and supremum of a family of convex functions is also convex.
• Fenchel’s inequality:
f (x) + f ∗ (y) ≥ hx, yi.
This follows directly from the definition of f ∗ :
f ∗ (y) = sup (hx, yi − f (x)) ≥ hx, yi − f (x).
x
8
3. If b < 0, we may normalize it such that (a, b) = (y, −1). Then we have
Thus, we get
f ∗ (y) − hy, xi + f ∗∗ (x) ≤ c < 0.
This contradicts to Fenchel’s inequality.
4. If b = 0, choose ŷ ∈ dom f ∗ and add (ŷ, −1) to (a, b), we can get
a + ŷ z−x
, ≤ c1 < 0
− s − f ∗∗ (x)
5. If f ∗∗ = f , then f is closed and convex because f ∗∗ is closed and convex no matter what f is.
Proof. 1.
2. For the equivalence of x ∈ ∂f ∗ (x) ⇔ hx, yi = f (x) + f ∗ (y), we use f ∗∗ (x) = f (x) and apply
the previous argument.
inf f0 (x)
x
subject to fi (x) ≤ 0, i = 1, ..., m
and hi (x) = 0 i = 1, ..., p.
9
The method of Lagrange multiplier is to introduce augmented variables λ, µ and a Lagrangian so
that the problem is transformed to a unconstrained optimization problem. Let us define the Lagrangian
to be
m
X p
X
L(x, λ, µ) = f0 (x) + λi fi (x) + µi hi (x).
i=1 i=1
Here, λ and µ are the augmented variables, called the Lagrange multipliers or the dual variables.
and
p
!
X \
sup µi hi (x) = ιCh (x), Ch = {x|hi (x) = 0}.
µ
i=1 i
Hence
sup L(x, λ, µ) = f0 (x) + ιCf (x) + ιCh (x)
λ0,µ
p∗ = inf f0 (x) + ιCf (x) + ιCh (x) = inf sup L(x, λ, µ).
x∈D x∈D λ0,µ
This is an infimum of a family of concave closed functions in λ and µ, thus g(λ, µ) is a concave closed
function. The dual problem is
d∗ = sup g(λ, µ).
λ0,µ
We refer (λ, µ) ∈ dom g with λ 0 as dual feasible variables. The primal problem and dual problem
are connected by the following duality property.
g(λ, µ) ≤ p∗ .
In other words,
d∗ ≤ p∗
10
Proof. Suppose x is feasible point (i.e. x ∈ D, or equivalently, fi (x) ≤ 0 and hi (x) = 0). Then for any
λi ≥ 0 and any µi , we have
X m p
X
λi fi (x) + µi hi (x) ≤ 0.
i=1 i=1
This leads to
m
X p
X
L(x, λ, µ) = f0 (x) + λi fi (x) + µi hi (x) ≤ f0 (x).
i=1 i=1
Hence
g(λ, µ) := inf L(x, λ, µ) ≤ f0 (x), for all x ∈ D.
x∈D
Hence
g(λ, µ) ≤ p∗
for all feasible pair (λ, µ)
This is called weak duality property. Thus, the weak duality can also be read as
Definition 1.2. (a) A point x∗ is called a primal optimal if it minimizes supλ0,µ L(x, λ, µ).
(b) A dual pair (λ∗ , µ∗ ) with λ∗ 0 is said to be a dual optimal if it maximizes inf x∈D L(x, λ, µ).
Strong duality
Definition 1.3. When d∗ = p∗ , we say the strong duality holds.
A sufficient condition for strong duality is the Slater condition: there exists a feasible x in relative
interior of domD: fi (x) < 0, i = 1, ..., m and hi (x) = 0, i = 1, ..., p. Such a point x is called a strictly
feasible point.
Theorem 1.1. Suppose f0 , ..., fm are convex, h(x) = Ax − b, and assume the Slater condition holds:
there exists x ∈ D◦ with Ax − b = 0 and fi (x) < 0 for all i = 1, ..., m. Then the strong duality
holds.
Proof. See pp. 234-236, Boyd’s Convex Optimization.
Complementary slackness Suppose there exist x∗ , λ∗ 0 and µ∗ such that x∗ is the optimal primal
point and (λ∗ , µ∗ ) is the optimal dual point and the strong duality gap p∗ − d∗ = 0. In this case,
f0 (x∗ ) = g(λ∗ , µ∗ )
m p
!
X X
:= inf f0 (x) + λ∗i fi (x) + µ∗i hi (x)
x
i=1 i=1
m
X p
X
≤ f0 (x∗ ) + λ∗i fi (x∗ ) + µ∗i hi (x∗ )
i=1 i=1
∗
≤ f0 (x ).
11
The last line follows from
m
X p
X
λi fi (x) + µi hi (x) ≤ 0.
i=1 i=1
This is called complementary slackness. It holds for any optimal solutions (x∗ , λ∗ , µ∗ ).
KKT condition
Proposition 1.14. When f0 , fi and hi are differentiable, then the optimal points x∗ to the primal problem
and (λ∗ , µ∗ ) to the dual problem satisfy the Karush-Kuhn-Tucker (KKT) condition:
fi (x∗ ) ≤ 0, i = 1, ..., m
λ∗i ≥ 0, i = 1, ..., m,
λ∗i fi (x
∗
) = 0, i = 1, ..., m
∗
hi (x ) = 0, i = 1, ..., p
m
X p
X
∇f0 (x∗ ) + λ∗i ∇fi (x∗ ) + µ∗i ∇gi (x∗ ) = 0.
i=1 i=1
Remark. If f0 , fi , i = 0, ..., m are closed and convex, but may not be differentiable, then the last KKT
condition is replaced by
m
X p
X
0 ∈ ∂f0 (x∗ ) + λ∗i ∂fi (x∗ ) + µ∗i ∂gi (x∗ ).
i=1 i=1
Theorem 1.2. If f0 , fi are closed and convex and h are affine. Then the KKT condition is also a sufficient
condition for optimal solutions. That is, if (x̂, λ̂, µ̂) satisfies KKT condition, then x̂ is primal optimal and
(λ̂, µ̂) is dual optimal, and there is zero duality gap.
is also convex in x.
12
3. The last KKT condition states that x̂ minimizes L(x, λ̂, µ̂). Thus
This shows that x̂ and (λ̂, µ̂) have zero duality gap and therefore are primal optimal and dual
optimal, respectively.
2 Optimization algorithms
2.1 Gradient Methods
Assumptions
Gradient method
• Forward method
xk = xk−1 − tk ∇f (xk−1 )
• Backward method
xk = xk−1 − tk ∇f (xk ).
Proposition 2.15. Suppose f ∈ C 1 , convex and ∇f is Lipschitz with constant L. If the step size t
satisfies t ≤ 1/L, then the fixed-step size gradient descent method satisfies
1
f (xk ) − f (x∗ ) ≤ kx0 − x∗ k2
2kt
13
Remarks
• The sequence {xk } is bounded. Thus, it has convergent subsequence to some x̃ which is an optimal
solution.
• If in addition f is strongly convex, then the sequence {xk } converges to the unique optimal solution
x∗ linearly.
Proof.
1. Let x+ := x − t∇f (x).
2. From quadratic upper bound:
L
f (y) ≤ f (x) + h∇f (x), y − xi + ky − xk2
2
choose y = x+ and t < 1/L, we get
Lt2
+ t
f (x ) ≤ f (x) + −t + k∇f (x)k2 ≤ f (x) − k∇f (x)k2 .
2 2
3. From
f (x∗ ) ≥ f (x) + h∇f (x), x∗ − xi
we get
t
f (x+ ) ≤ f (x) − k∇f (x)k2
2
t
≤ f ∗ + h∇f (x), x − x∗ i − k∇f (x)k2
2
1
= f∗ + kx − x∗ k2 − kx − x∗ − t∇f (x)k2
2t
∗ 1
kx − x∗ k2 − kx+ − x∗ k2 .
=f +
2t
Proposition 2.16. Suppose f ∈ C 1 and convex. The fixed-step size backward gradient method satisfies
1
f (xk ) − f (x∗ ) ≤ kx0 − x∗ k2 .
2kt
Here, no assumption on Lipschitz continuity of ∇f (x) is needed.
14
Proof.
f (z) ≥ f (x+ ) + h∇f (x+ ), z − x+ i = f (x+ ) + h∇f (x+ ), z − xi + tk∇f (x+ )k2 .
3. Take z = x, we get
f (x+ ) ≤ f (x) − tk∇f (x+ )k2
Thus, f (x+ ) < f (x) unless ∇f (x+ ) = 0.
4. Take z = x∗ , we obtain
Proposition 2.17. Suppose f is strongly convex with parameter m and ∇f (x) is Lipschitz continuous
with parameter L. Suppose the minimum of f is attended at x∗ . Then the gradient method converges
linearly, namely
kxk − x∗ k2 ≤ ck kx0 − x∗ k2
ck L 0
f (xk ) − f (x∗ ) ≤ kx − x∗ k2 ,
2
where
2mL 2
c=1−t < 1 if the step size t ≤ .
m+L m+L
Proof. 1. For 0 < t ≤ 2/(m + L):
t is chosen so that c < 1. Thus, the sequence xk − x∗ converges linearly with rate c.
L k ck L 0
f (xk ) − f (x∗ ) ≤ kx − x∗ k2 ≤ kx − x∗ k2 .
2 2
we get f (xk ) − f (x∗ ) also converges to 0 with linear rate.
15
2.2 Subgradient method
Assumptions
Subgradient method
xk = xk−1 − tk vk−1 , vk−1 ∈ ∂f (xk−1 ).
tk is chosen so that f (xk ) < f (xk−1 ).
Proposition 2.18. Suppose f is closed and convex and suppose an ptimal solution x∗ of min f is attain-
able. Then the proximal point method xk = proxtf (xk−1 ) with t > 0 satisfies
1
f (xk ) − f (x∗ ) ≤ kx0 − x∗ k.
2kt
16
Convergence proof:
1. Given x, let x+ := proxtf (x). Let Gt (x) := (x+ − x)/t. Then Gt (x) ∈ ∂f (x+ ). We then have,
for any z,
f (z) ≥ f (x+ ) + hGt (x), z − x+ i = hGt (x), z − xi + tkGt (x)k2 .
2. Take z = x, we get
f (x+ ) ≤ f (x) − tk∇f (x+ )k2
Thus, f (x+ ) < f (x) unless ∇f (x+ ) = 0.
3. Take z = x∗ , we obtain
• Let
G(x) = x − x+ = (I − F )(x).
17
Proof.
(1). x+ = proxf (x) = F (x), y + = proxf (y) = F (y). G(x) = x − x+ ∈ ∂f (x+ ). From
monotonicity of ∂f , we have
hG(x) − G(y), x+ − y + i ≥ 0.
This gives
hx+ − y + , x − yi ≥ kx+ − y + k2 .
That is
hF (x) − F (y), x − yi ≥ kF (x) − F (y)k2 .
(2). From G = I − F , we have
hG(x) − G(y), x − yi = hG(x) − G(y), (F + G)(x) − (F + G)(y)i
= kG(x) − G(y)k2 + hG(x) − G(y), F (x) − F (y)i
= kG(x) − G(y)k2 + hx − F (x) − y + F (y), F (x) − F (y)i
= kG(x) − G(y)k2 + hx − y, F (x) − F (y)i − kF (x) − F (y)k2
≥ kG(x) − G(y)k2
18
4. Since the sequence {y k } is bounded, any convergent subsequence, say ȳ k , converges to ȳ satisfying
by the continuity of G. Thus, any cluster point ȳ of {y k } satisfies G(ȳ) = 0. Hence, by the
previous argument with y ∗ replaced by ȳ, the sequence ky k − ȳk is also non-increasing and has a
limit.
5. We claim that there is only one limiting point of {y k }. Suppose ȳ1 and ȳ2 are two cluster points of
{y k }. Then both sequences {ky k − ȳ1 k} and {ky k − ȳ2 k} are non-increasing and have limits. Since
1 2
ȳi are limiting points, there exist subsequences {ki1 } and {ki2 } such that y ki → ȳ1 and y ki → ȳ2
as i → ∞. We can choose subsequences again so that we have
2
ki−1 < ki1 < ki2 < ki+1
1
for all i
Assumptions:
• g ∈ C 1 convex, ∇g(x) Lipschitz continuous with parameter L
We can express proxtf as (I + t∂f )−1 . Therefore the proximal gradient method can be expressed as
Thus, the proximal gradient method is also called the forward-backward method.
Theorem 2.4. The forward-backward method converges provided Lt ≤ 1.
Proof. 1. Given a point x, define
Then
x0 − x x+ − x0
− = ∇g(x), − ∈ ∂f (x+ ).
t t
+ −x
Combining these two, we define a “gradient” Gt (x) := − x t . Then Gt (x) − ∇g(x) ∈ ∂f (x+ ).
19
2. From the quadratic upper bound of g, we have
L +
g(x+ ) ≤ g(x) + h∇g(x), x+ − xi + kx − xk2
2
Lt2
= g(x) + h∇g(x), x+ − xi + kGt (x)k2
2
t
≤ g(x) + h∇g(x), x+ − xi + kGt (x)k2 ,
2
The last inequality holds provided Lt ≤ 1. Combining this with
we get
t
g(x+ ) ≤ g(z) + h∇g(x), x+ − zi + kGt (x)k2 .
2
3. From first-order condition at x+ of f
Assumptions
• f and g are closed and convex.
20
Examples:
0 if y = b
• g(y) = ι{b} (y) =
∞ otherwise
The corresponding g ∗ (z) = hz, bi.
• g(y) = ιC (y)
• g(y) = ky − bk2 .
The Lagrangian is
L(x, y, z) := f (x) + g(y) + hz, Ax − yi.
The primal function is
FP (x) = inf sup L(x, y, z).
y z
We have
∗ T 1∗ 2
sup −f (−A u) − g (u) − ku − zk
u 2t
1 2
= sup inf L(x, y, u) − ku − zk
u x,y 2t
1 2
= sup inf f (x) + g(y) + hu, Ax − yi − ku − zk
u x,y 2t
1 2
= inf sup f (x) + g(y) + hu, Ax − yi − ku − zk
x,y u 2t
t 2
= inf f (x) + g(y) + hz, Ax − yi + kAx − yk .
x,y 2
21
Here, the maximum u = z + t(Ax − y). Thus, we define the augmented Lagrangian to be
t
Lt (x, y, z) := f (x) + g(y) + hz, Ax − yi + kAx − yk2
2
The augmented Lagrangian method is
(xk , y k ) = argminx,y Lt (x, y, z k−1 )
z k = z k−1 + t(Axk − y k )
Thus, the Augmented Lagrangian method is equivalent to the proximal point method applied to the dual
problem:
sup (−f ∗ (−A∗ z) − g ∗ (z)) .
z
Assumptions
• fi are closed and convex.
ADMM
• Define
t
Lt (x1 , x2 , z) := f1 (x1 ) + f2 (x2 ) + hz, A1 x1 + A2 x2 − bi + kA1 x1 + A2 x2 − bk2 .
2
• ADMM:
xk1 = argminx1 Lt (x1 , xk−1
2 ,z
k−1
)
t k−1 1 k−1 2
= argminx1 f1 (x1 ) + kA1 x1 + A2 x2 − b + z k
2 t
xk2 = argminx2 Lt (xk1 , x2 , z k−1 )
t k 1 k−1 2
= argminx2 f2 (x2 ) + kA1 x1 + A2 x2 − b + z k
2 t
z k = z k−1 + t(A1 xk1 + A2 xk2 − b)
• Douglas-Rachford method
min h1 (z) + h2 (z)
z k = proxh1 (y k−1 )
y k = y k−1 + proxh2 (2z k − y k−1 ) − z k .
If we call (I + ∂h1 )−1 = A and (I + ∂h2 )−1 = B. These two operators are firmly nonexpansive.
The Douglas-Rachford method is to find the fixed point of y k = T y k−1 .
T = I + A + B(2A − I).
22
2.9 Primal dual formulation
Consider
inf (f (x) + g(Ax))
x
Let
FP (x) := f (x) + g(Ax)
Define y = Ax consider inf x,y f (x) + g(y) subject to y = Ax. Now, introduce method of Lagrange
multiplier: consider
LP (x, y, z) = f (x) + g(y) + hz, Ax − yi
Then
FP (x) = inf sup LP (x, y, z)
y z
The problem is
inf inf sup LP (x, y, z)
x y z
We find that
inf LP (x, y, z) = −f ∗ (−A∗ z) − g ∗ (z). := FD (z)
x,y
inf LP (x, y, z) = inf (f (x) + g(y) + hz, Ax − yi) = f (x) + hz, Axi − g ∗ (z) := LP D (x, z).
y y
On the other hand, we can start from FD (z) := −f ∗ (−A∗ z) − g ∗ (z). Consider
then we have
sup inf LD (z, w, x) = FD (z).
w x
sup LD (z, w, x) = sup (−f ∗ (w) − g ∗ (z) − hx, −A∗ z − wi) = f (x) + g(Ax) = FP (x).
z,w z,w
sup LD (z, w, x) = sup (−f ∗ (w) − g ∗ (z) − hx, −A∗ z − wi) = f (x) − g ∗ (z) + hAx, zi = LP D (x, z).
w w
23
Let us summarize
24