Subgradients: Ryan Tibshirani Convex Optimization 10-725
Subgradients: Ryan Tibshirani Convex Optimization 10-725
Subgradients: Ryan Tibshirani Convex Optimization 10-725
Ryan Tibshirani
Convex Optimization 10-725
Last time: gradient descent
Consider the problem
min f (x)
x
2
Outline
3
Subgradients
• Always exists1
• If f differentiable at x, then g = ∇f (x) uniquely
• Same definition works for nonconvex f (however, subgradients
need not exist)
1
On the relative interior of dom(f )
4
Examples of subgradients
2.0
1.5
1.0
f(x)
0.5
0.0
−0.5
−2 −1 0 1 2
5
Consider f : Rn → R, f (x) = kxk2
f(x)
x2
x1
6
Consider f : Rn → R, f (x) = kxk1
f(x)
x2
x1
7
Consider f (x) = max{f1 (x), f2 (x)}, for f1 , f2 : Rn → R convex,
differentiable
15
10
f(x)
5
0
−2 −1 0 1 2
8
Subdifferential
∂f (x) = {g ∈ Rn : g is a subgradient of f at x}
9
Connection to convex geometry
10
●
●
● ●
11
Subgradient calculus
∂g(x) = AT ∂f (Ax + b)
12
• General composition: if
f (x) = h g(x) = h g1 (x), . . . , gk (x)
where g : Rn → Rk , h : Rk → R, f : Rn → R, h is convex
and nondecreasing in each argument, g is convex, then
n
∂f (x) ⊆ p1 q1 + · · · + pk qk :
o
p ∈ ∂h(g(x)), qi ∈ ∂gi (x), i = 1, . . . , k
13
• Norms: important special case. To each norm k · k, there is a
dual norm k · k∗ such that
kxk = max z T x
kzk∗ ≤1
∂f (x) = argmax z T x
kzk∗ ≤1
14
Optimality condition
15
Derivation of first-order optimality
16
Observe
0 ∈ ∂ f (x) + IC (x)
⇐⇒ 0 ∈ {∇f (x)} + NC (x)
⇐⇒ − ∇f (x) ∈ NC (x)
⇐⇒ − ∇f (x)T x ≥ −∇f (x)T y for all y ∈ C
⇐⇒ ∇f (x)T (y − x) ≥ 0 for all y ∈ C
as desired
17
Example: lasso optimality conditions
Given y ∈ Rn , X ∈ Rn×p , lasso problem can be parametrized as
1
min ky − Xβk22 + λkβk1
β 2
where λ ≥ 0. Subgradient optimality:
1
0 ∈ ∂ ky − Xβk22 + λkβk1
2
⇐⇒ 0 ∈ −X T (y − Xβ) + λ∂kβk1
⇐⇒ X T (y − Xβ) = λv
18
Write X1 , . . . , Xp for columns of X. Then our condition reads:
(
XiT (y − Xβ) = λ · sign(βi ) if βi 6= 0
|XiT (y − Xβ)| ≤ λ if βi = 0
19
Example: soft-thresholding
Simplfied lasso problem with X = I:
1
min ky − βk22 + λkβk1
β 2
20
Now plug in β = Sλ (y) and check these are satisfied:
• When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1
• When yi < −λ, argument is similar
• When |yi | ≤ λ, βi = 0, and |yi − βi | = |yi | ≤ λ
1.0
0.5
Soft-thresholding in 0.0
one variable:
−0.5
−1.0
21
Example: distance to a convex set
22
We will only show one direction, i.e., that
x − PC (x)
∈ ∂dist(x, C)
kx − PC (x)k2
Claim:
(x − u)T (y − u)
dist(y, C) ≥ for all y
kx − uk2
Check: first, for y ∈ H, the right-hand side is ≤ 0
23
Now for y ∈/ H, we have (x − u)T (y − u) = kx − uk2 ky − uk2 cos θ
where θ is the angle between x − u and y − u. Thus
(x − u)T (y − u)
= ky − uk2 cos θ = dist(y, H) ≤ dist(y, C)
kx − uk2
as desired
(x − u)T (y − x + x − u)
dist(y, C) ≥
kx − uk2
T
x−u
= kx − uk2 + (y − x)
kx − uk2
24
References and further reading
25