Subgradients: Ryan Tibshirani Convex Optimization 10-725

Subgradients
Ryan Tibshirani
Convex Optimization 10-725
Last time: gradient descent
Consider the problem
min f (x)
x
for f convex and differentiable, dom(f ) = Rn . Gradient descent:

choose initial x(0) ∈ Rn , repeat
x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .
Step sizes tk chosen to be fixed and small, or by backtracking line

search
If ∇f is Lipschitz, gradient descent has convergence rate O(1/).

Downsides:
• Requires f differentiable
• Can be slow to converge
2
Outline
Today: crucial mathematical underpinnings!

• Subgradients
• Examples
• Properties
• Optimality characterizations
3
Subgradients
Recall that for convex and differentiable f ,
f (y) ≥ f (x) + ∇f (x)T (y − x) for all x, y
That is, linear approximation always underestimates f
A subgradient of a convex function f at x is any g ∈ Rn such that
f (y) ≥ f (x) + g T (y − x) for all y
• Always exists1
• If f differentiable at x, then g = ∇f (x) uniquely
• Same definition works for nonconvex f (however, subgradients
need not exist)
1
On the relative interior of dom(f )
4
Examples of subgradients
Consider f : R → R, f (x) = |x|
2.0
1.5
1.0
f(x)
0.5
0.0
−0.5
−2 −1 0 1 2
• For x 6= 0, unique subgradient g = sign(x)

• For x = 0, subgradient g is any element of [−1, 1]
5
Consider f : Rn → R, f (x) = kxk2
f(x)
x2
x1
• For x 6= 0, unique subgradient g = x/kxk2

• For x = 0, subgradient g is any element of {z : kzk2 ≤ 1}
6
Consider f : Rn → R, f (x) = kxk1
f(x)
x2
x1
• For xi 6= 0, unique ith component gi = sign(xi )

• For xi = 0, ith component gi is any element of [−1, 1]
7
Consider f (x) = max{f1 (x), f2 (x)}, for f1 , f2 : Rn → R convex,
differentiable
15
10
f(x)
5
0
−2 −1 0 1 2
• For f1 (x) > f2 (x), unique subgradient g = ∇f1 (x)

• For f2 (x) > f1 (x), unique subgradient g = ∇f2 (x)
• For f1 (x) = f2 (x), subgradient g is any point on line segment
between ∇f1 (x) and ∇f2 (x)
8
Subdifferential
Set of all subgradients of convex f is called the subdifferential:
∂f (x) = {g ∈ Rn : g is a subgradient of f at x}
• Nonempty (only for convex f )

• ∂f (x) is closed and convex (even for nonconvex f )
• If f is differentiable at x, then ∂f (x) = {∇f (x)}
• If ∂f (x) = {g}, then f is differentiable at x and ∇f (x) = g
9
Connection to convex geometry
Convex set C ⊆ Rn , consider indicator function IC : Rn → R,

(
0 if x ∈ C
IC (x) = I{x ∈ C} =
∞ if x ∈ /C
For x ∈ C, ∂IC (x) = NC (x), the normal cone of C at x is, recall
NC (x) = {g ∈ Rn : g T x ≥ g T y for any y ∈ C}
Why? By definition of subgradient g,
IC (y) ≥ IC (x) + g T (y − x) for all y

• For y ∈
/ C, IC (y) = ∞
• For y ∈ C, this means 0 ≥ g T (y − x)
10
●
●
● ●
11
Subgradient calculus
Basic rules for convex functions:

• Scaling: ∂(af ) = a · ∂f provided a > 0
• Addition: ∂(f1 + f2 ) = ∂f1 + ∂f2
• Affine composition: if g(x) = f (Ax + b), then
∂g(x) = AT ∂f (Ax + b)
• Finite pointwise maximum: if f (x) = maxi=1,...,m fi (x), then

[
∂f (x) = conv ∂fi (x)
i:fi (x)=f (x)
convex hull of union of subdifferentials of active functions at x
12
• General composition: if

f (x) = h g(x) = h g1 (x), . . . , gk (x)
where g : Rn → Rk , h : Rk → R, f : Rn → R, h is convex
and nondecreasing in each argument, g is convex, then
n
∂f (x) ⊆ p1 q1 + · · · + pk qk :
o
p ∈ ∂h(g(x)), qi ∈ ∂gi (x), i = 1, . . . , k
• General pointwise maximum: if f (x) = maxs∈S fs (x), then

[
∂f (x) ⊇ cl conv ∂fs (x)
s:fs (x)=f (x)
Under some regularity conditions (on S, fs ), we get equality
13
• Norms: important special case. To each norm k · k, there is a
dual norm k · k∗ such that
kxk = max z T x
kzk∗ ≤1
(For example, k · kp and k · kq are dual when 1/p + 1/q = 1.)

In fact, for f (x) = kxk (and fz (x) = z T x), we get equality:
[
∂f (x) = cl conv ∂fz (x)
z:fz (x)=f (x)
Note that ∂fz (x) = z. And if z1 , z2 each achieve the max at

x, which means that z1T x = z2T x = kxk, then by linearity, so
will tz1 + (1 − t)z2 for any t ∈ [0, 1]. Thus
∂f (x) = argmax z T x
kzk∗ ≤1
14
Optimality condition
For any f (convex or not),
f (x? ) = min f (x) ⇐⇒ 0 ∈ ∂f (x? )

x
That is, x? is a minimizer if and only if 0 is a subgradient of f at

x? . This is called the subgradient optimality condition
Why? Easy: g = 0 being a subgradient means that for all y
f (y) ≥ f (x? ) + 0T (y − x? ) = f (x? )
Note the implication for a convex and differentiable function f ,

with ∂f (x) = {∇f (x)}
15
Derivation of first-order optimality
Example of the power of subgradients: we can use what we have

learned so far to derive the first-order optimality condition. Recall
min f (x) subject to x ∈ C

x
is solved at x, for f convex and differentiable, if and only if
∇f (x)T (y − x) ≥ 0 for all y ∈ C
Intuitively: says that gradient increases as we move away from x.

How to prove it? First recast problem as
min f (x) + IC (x)

x
Now apply subgradient optimality: 0 ∈ ∂(f (x) + IC (x))
16
Observe

0 ∈ ∂ f (x) + IC (x)
⇐⇒ 0 ∈ {∇f (x)} + NC (x)
⇐⇒ − ∇f (x) ∈ NC (x)
⇐⇒ − ∇f (x)T x ≥ −∇f (x)T y for all y ∈ C
⇐⇒ ∇f (x)T (y − x) ≥ 0 for all y ∈ C
as desired
Note: the condition 0 ∈ ∂f (x) + NC (x) is a fully general condition

for optimality in convex problems. But it’s not always easy to work
with (KKT conditions, later, are easier)
17
Example: lasso optimality conditions
Given y ∈ Rn , X ∈ Rn×p , lasso problem can be parametrized as
1
min ky − Xβk22 + λkβk1
β 2
where λ ≥ 0. Subgradient optimality:
1
0 ∈ ∂ ky − Xβk22 + λkβk1
2
⇐⇒ 0 ∈ −X T (y − Xβ) + λ∂kβk1
⇐⇒ X T (y − Xβ) = λv
for some v ∈ ∂kβk1 , i.e.,


{1}
 if βi > 0
vi ∈ {−1} if βi < 0 , i = 1, . . . , p

[−1, 1] if βi = 0

18
Write X1 , . . . , Xp for columns of X. Then our condition reads:
(
XiT (y − Xβ) = λ · sign(βi ) if βi 6= 0
|XiT (y − Xβ)| ≤ λ if βi = 0
Note: subgradient optimality conditions don’t lead to closed-form

expression for a lasso solution ... however they do provide a way to
check lasso optimality
They are also helpful in understanding the lasso estimator; e.g., if

|XiT (y − Xβ)| < λ, then βi = 0 (used by screening rules, later?)
19
Example: soft-thresholding
Simplfied lasso problem with X = I:
1
min ky − βk22 + λkβk1
β 2
This we can solve directly using subgradient optimality. Solution is

β = Sλ (y), where Sλ is the soft-thresholding operator:

yi − λ if yi > λ

[Sλ (y)]i = 0 if − λ ≤ yi ≤ λ , i = 1, . . . , n

yi + λ if yi < −λ

Check: from last slide, subgradient optimality conditions are

(
yi − βi = λ · sign(βi ) if βi 6= 0
|yi − βi | ≤ λ if βi = 0
20
Now plug in β = Sλ (y) and check these are satisfied:
• When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1
• When yi < −λ, argument is similar
• When |yi | ≤ λ, βi = 0, and |yi − βi | = |yi | ≤ λ
1.0
0.5
Soft-thresholding in 0.0
one variable:
−0.5
−1.0
−1.0 −0.5 0.0 0.5 1.0
21
Example: distance to a convex set
Recall the distance function to a closed, convex set C:
dist(x, C) = min ky − xk2

y∈C
This is a convex function. What are its subgradients?
Write dist(x, C) = kx − PC (x)k2 , where PC (x) is the projection of

x onto C. It turns out that when dist(x, C) > 0,

x − PC (x)
∂dist(x, C) =
kx − PC (x)k2
Only has one element, so in fact dist(x, C) is differentiable and

this is its gradient
22
We will only show one direction, i.e., that
x − PC (x)
∈ ∂dist(x, C)
kx − PC (x)k2
Write u = PC (x). Then by first-order optimality conditions for a

projection,
(x − u)T (y − u) ≤ 0 for all y ∈ C
Hence
C ⊆ H = {y : (x − u)T (y − u) ≤ 0}
Claim:
(x − u)T (y − u)
dist(y, C) ≥ for all y
kx − uk2
Check: first, for y ∈ H, the right-hand side is ≤ 0
23
Now for y ∈/ H, we have (x − u)T (y − u) = kx − uk2 ky − uk2 cos θ
where θ is the angle between x − u and y − u. Thus
(x − u)T (y − u)
= ky − uk2 cos θ = dist(y, H) ≤ dist(y, C)
kx − uk2
as desired
Using the claim, we have for any y
(x − u)T (y − x + x − u)
dist(y, C) ≥
kx − uk2
T
x−u
= kx − uk2 + (y − x)
kx − uk2
Hence g = (x − u)/kx − uk2 is a subgradient of dist(x, C) at x
24
References and further reading
• S. Boyd, Lecture notes for EE 264B, Stanford University,

Spring 2010-2011
• R. T. Rockafellar (1970), “Convex analysis”, Chapters 23–25
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012
25

Subgradients: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

Copyright:

Available Formats

Subgradients: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Subgradients: Ryan Tibshirani Convex Optimization 10-725

Uploaded by

Copyright:

Available Formats

Subgradients

for f convex and differentiable, dom(f ) = Rn . Gradient descent:

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Step sizes tk chosen to be fixed and small, or by backtracking line

If ∇f is Lipschitz, gradient descent has convergence rate O(1/).

Today: crucial mathematical underpinnings!

Recall that for convex and differentiable f ,

f (y) ≥ f (x) + ∇f (x)T (y − x) for all x, y

That is, linear approximation always underestimates f

A subgradient of a convex function f at x is any g ∈ Rn such that

f (y) ≥ f (x) + g T (y − x) for all y

Consider f : R → R, f (x) = |x|

• For x 6= 0, unique subgradient g = sign(x)

• For x 6= 0, unique subgradient g = x/kxk2

• For xi 6= 0, unique ith component gi = sign(xi )

• For f1 (x) > f2 (x), unique subgradient g = ∇f1 (x)

Set of all subgradients of convex f is called the subdifferential:

• Nonempty (only for convex f )

Convex set C ⊆ Rn , consider indicator function IC : Rn → R,

For x ∈ C, ∂IC (x) = NC (x), the normal cone of C at x is, recall

NC (x) = {g ∈ Rn : g T x ≥ g T y for any y ∈ C}

Why? By definition of subgradient g,

IC (y) ≥ IC (x) + g T (y − x) for all y

Basic rules for convex functions:

• Finite pointwise maximum: if f (x) = maxi=1,...,m fi (x), then

convex hull of union of subdifferentials of active functions at x

• General pointwise maximum: if f (x) = maxs∈S fs (x), then

Under some regularity conditions (on S, fs ), we get equality

(For example, k · kp and k · kq are dual when 1/p + 1/q = 1.)

Note that ∂fz (x) = z. And if z1 , z2 each achieve the max at

For any f (convex or not),

f (x? ) = min f (x) ⇐⇒ 0 ∈ ∂f (x? )

That is, x? is a minimizer if and only if 0 is a subgradient of f at

Why? Easy: g = 0 being a subgradient means that for all y

f (y) ≥ f (x? ) + 0T (y − x? ) = f (x? )

Note the implication for a convex and differentiable function f ,

Example of the power of subgradients: we can use what we have

min f (x) subject to x ∈ C

is solved at x, for f convex and differentiable, if and only if

∇f (x)T (y − x) ≥ 0 for all y ∈ C

Intuitively: says that gradient increases as we move away from x.

min f (x) + IC (x)

Now apply subgradient optimality: 0 ∈ ∂(f (x) + IC (x))

Note: the condition 0 ∈ ∂f (x) + NC (x) is a fully general condition

for some v ∈ ∂kβk1 , i.e.,

Note: subgradient optimality conditions don’t lead to closed-form

They are also helpful in understanding the lasso estimator; e.g., if

This we can solve directly using subgradient optimality. Solution is

Check: from last slide, subgradient optimality conditions are

−1.0 −0.5 0.0 0.5 1.0

Recall the distance function to a closed, convex set C:

dist(x, C) = min ky − xk2

This is a convex function. What are its subgradients?

Write dist(x, C) = kx − PC (x)k2 , where PC (x) is the projection of

Only has one element, so in fact dist(x, C) is differentiable and

Write u = PC (x). Then by first-order optimality conditions for a

Using the claim, we have for any y

Hence g = (x − u)/kx − uk2 is a subgradient of dist(x, C) at x

If ∇f is Lipschitz, gradient descent has convergence rate O(1/).