Subgradients: Ryan Tibshirani Convex Optimization 10-725

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Subgradients

Ryan Tibshirani
Convex Optimization 10-725
Last time: gradient descent
Consider the problem
min f (x)
x

for f convex and differentiable, dom(f ) = Rn . Gradient descent:


choose initial x(0) ∈ Rn , repeat

x(k) = x(k−1) − tk · ∇f (x(k−1) ), k = 1, 2, 3, . . .

Step sizes tk chosen to be fixed and small, or by backtracking line


search

If ∇f is Lipschitz, gradient descent has convergence rate O(1/).


Downsides:
• Requires f differentiable
• Can be slow to converge

2
Outline

Today: crucial mathematical underpinnings!


• Subgradients
• Examples
• Properties
• Optimality characterizations

3
Subgradients

Recall that for convex and differentiable f ,

f (y) ≥ f (x) + ∇f (x)T (y − x) for all x, y

That is, linear approximation always underestimates f

A subgradient of a convex function f at x is any g ∈ Rn such that

f (y) ≥ f (x) + g T (y − x) for all y

• Always exists1
• If f differentiable at x, then g = ∇f (x) uniquely
• Same definition works for nonconvex f (however, subgradients
need not exist)

1
On the relative interior of dom(f )
4
Examples of subgradients

Consider f : R → R, f (x) = |x|

2.0
1.5
1.0
f(x)

0.5
0.0
−0.5

−2 −1 0 1 2

• For x 6= 0, unique subgradient g = sign(x)


• For x = 0, subgradient g is any element of [−1, 1]

5
Consider f : Rn → R, f (x) = kxk2

f(x)

x2
x1

• For x 6= 0, unique subgradient g = x/kxk2


• For x = 0, subgradient g is any element of {z : kzk2 ≤ 1}

6
Consider f : Rn → R, f (x) = kxk1

f(x)

x2
x1

• For xi 6= 0, unique ith component gi = sign(xi )


• For xi = 0, ith component gi is any element of [−1, 1]

7
Consider f (x) = max{f1 (x), f2 (x)}, for f1 , f2 : Rn → R convex,
differentiable

15
10
f(x)

5
0

−2 −1 0 1 2

• For f1 (x) > f2 (x), unique subgradient g = ∇f1 (x)


• For f2 (x) > f1 (x), unique subgradient g = ∇f2 (x)
• For f1 (x) = f2 (x), subgradient g is any point on line segment
between ∇f1 (x) and ∇f2 (x)

8
Subdifferential

Set of all subgradients of convex f is called the subdifferential:

∂f (x) = {g ∈ Rn : g is a subgradient of f at x}

• Nonempty (only for convex f )


• ∂f (x) is closed and convex (even for nonconvex f )
• If f is differentiable at x, then ∂f (x) = {∇f (x)}
• If ∂f (x) = {g}, then f is differentiable at x and ∇f (x) = g

9
Connection to convex geometry

Convex set C ⊆ Rn , consider indicator function IC : Rn → R,


(
0 if x ∈ C
IC (x) = I{x ∈ C} =
∞ if x ∈ /C

For x ∈ C, ∂IC (x) = NC (x), the normal cone of C at x is, recall

NC (x) = {g ∈ Rn : g T x ≥ g T y for any y ∈ C}

Why? By definition of subgradient g,

IC (y) ≥ IC (x) + g T (y − x) for all y


• For y ∈
/ C, IC (y) = ∞
• For y ∈ C, this means 0 ≥ g T (y − x)

10

● ●

11
Subgradient calculus

Basic rules for convex functions:


• Scaling: ∂(af ) = a · ∂f provided a > 0
• Addition: ∂(f1 + f2 ) = ∂f1 + ∂f2
• Affine composition: if g(x) = f (Ax + b), then

∂g(x) = AT ∂f (Ax + b)

• Finite pointwise maximum: if f (x) = maxi=1,...,m fi (x), then


 [ 
∂f (x) = conv ∂fi (x)
i:fi (x)=f (x)

convex hull of union of subdifferentials of active functions at x

12
• General composition: if
 
f (x) = h g(x) = h g1 (x), . . . , gk (x)

where g : Rn → Rk , h : Rk → R, f : Rn → R, h is convex
and nondecreasing in each argument, g is convex, then
n
∂f (x) ⊆ p1 q1 + · · · + pk qk :
o
p ∈ ∂h(g(x)), qi ∈ ∂gi (x), i = 1, . . . , k

• General pointwise maximum: if f (x) = maxs∈S fs (x), then


  [ 
∂f (x) ⊇ cl conv ∂fs (x)
s:fs (x)=f (x)

Under some regularity conditions (on S, fs ), we get equality

13
• Norms: important special case. To each norm k · k, there is a
dual norm k · k∗ such that

kxk = max z T x
kzk∗ ≤1

(For example, k · kp and k · kq are dual when 1/p + 1/q = 1.)


In fact, for f (x) = kxk (and fz (x) = z T x), we get equality:
  [ 
∂f (x) = cl conv ∂fz (x)
z:fz (x)=f (x)

Note that ∂fz (x) = z. And if z1 , z2 each achieve the max at


x, which means that z1T x = z2T x = kxk, then by linearity, so
will tz1 + (1 − t)z2 for any t ∈ [0, 1]. Thus

∂f (x) = argmax z T x
kzk∗ ≤1

14
Optimality condition

For any f (convex or not),

f (x? ) = min f (x) ⇐⇒ 0 ∈ ∂f (x? )


x

That is, x? is a minimizer if and only if 0 is a subgradient of f at


x? . This is called the subgradient optimality condition

Why? Easy: g = 0 being a subgradient means that for all y

f (y) ≥ f (x? ) + 0T (y − x? ) = f (x? )

Note the implication for a convex and differentiable function f ,


with ∂f (x) = {∇f (x)}

15
Derivation of first-order optimality

Example of the power of subgradients: we can use what we have


learned so far to derive the first-order optimality condition. Recall

min f (x) subject to x ∈ C


x

is solved at x, for f convex and differentiable, if and only if

∇f (x)T (y − x) ≥ 0 for all y ∈ C

Intuitively: says that gradient increases as we move away from x.


How to prove it? First recast problem as

min f (x) + IC (x)


x

Now apply subgradient optimality: 0 ∈ ∂(f (x) + IC (x))

16
Observe

0 ∈ ∂ f (x) + IC (x)
⇐⇒ 0 ∈ {∇f (x)} + NC (x)
⇐⇒ − ∇f (x) ∈ NC (x)
⇐⇒ − ∇f (x)T x ≥ −∇f (x)T y for all y ∈ C
⇐⇒ ∇f (x)T (y − x) ≥ 0 for all y ∈ C

as desired

Note: the condition 0 ∈ ∂f (x) + NC (x) is a fully general condition


for optimality in convex problems. But it’s not always easy to work
with (KKT conditions, later, are easier)

17
Example: lasso optimality conditions
Given y ∈ Rn , X ∈ Rn×p , lasso problem can be parametrized as
1
min ky − Xβk22 + λkβk1
β 2
where λ ≥ 0. Subgradient optimality:
1 
0 ∈ ∂ ky − Xβk22 + λkβk1
2
⇐⇒ 0 ∈ −X T (y − Xβ) + λ∂kβk1
⇐⇒ X T (y − Xβ) = λv

for some v ∈ ∂kβk1 , i.e.,



{1}
 if βi > 0
vi ∈ {−1} if βi < 0 , i = 1, . . . , p

[−1, 1] if βi = 0

18
Write X1 , . . . , Xp for columns of X. Then our condition reads:
(
XiT (y − Xβ) = λ · sign(βi ) if βi 6= 0
|XiT (y − Xβ)| ≤ λ if βi = 0

Note: subgradient optimality conditions don’t lead to closed-form


expression for a lasso solution ... however they do provide a way to
check lasso optimality

They are also helpful in understanding the lasso estimator; e.g., if


|XiT (y − Xβ)| < λ, then βi = 0 (used by screening rules, later?)

19
Example: soft-thresholding
Simplfied lasso problem with X = I:
1
min ky − βk22 + λkβk1
β 2

This we can solve directly using subgradient optimality. Solution is


β = Sλ (y), where Sλ is the soft-thresholding operator:

yi − λ if yi > λ

[Sλ (y)]i = 0 if − λ ≤ yi ≤ λ , i = 1, . . . , n

yi + λ if yi < −λ

Check: from last slide, subgradient optimality conditions are


(
yi − βi = λ · sign(βi ) if βi 6= 0
|yi − βi | ≤ λ if βi = 0

20
Now plug in β = Sλ (y) and check these are satisfied:
• When yi > λ, βi = yi − λ > 0, so yi − βi = λ = λ · 1
• When yi < −λ, argument is similar
• When |yi | ≤ λ, βi = 0, and |yi − βi | = |yi | ≤ λ

1.0
0.5
Soft-thresholding in 0.0
one variable:
−0.5
−1.0

−1.0 −0.5 0.0 0.5 1.0

21
Example: distance to a convex set

Recall the distance function to a closed, convex set C:

dist(x, C) = min ky − xk2


y∈C

This is a convex function. What are its subgradients?

Write dist(x, C) = kx − PC (x)k2 , where PC (x) is the projection of


x onto C. It turns out that when dist(x, C) > 0,
 
x − PC (x)
∂dist(x, C) =
kx − PC (x)k2

Only has one element, so in fact dist(x, C) is differentiable and


this is its gradient

22
We will only show one direction, i.e., that

x − PC (x)
∈ ∂dist(x, C)
kx − PC (x)k2

Write u = PC (x). Then by first-order optimality conditions for a


projection,
(x − u)T (y − u) ≤ 0 for all y ∈ C
Hence
C ⊆ H = {y : (x − u)T (y − u) ≤ 0}

Claim:
(x − u)T (y − u)
dist(y, C) ≥ for all y
kx − uk2
Check: first, for y ∈ H, the right-hand side is ≤ 0

23
Now for y ∈/ H, we have (x − u)T (y − u) = kx − uk2 ky − uk2 cos θ
where θ is the angle between x − u and y − u. Thus

(x − u)T (y − u)
= ky − uk2 cos θ = dist(y, H) ≤ dist(y, C)
kx − uk2

as desired

Using the claim, we have for any y

(x − u)T (y − x + x − u)
dist(y, C) ≥
kx − uk2
 T
x−u
= kx − uk2 + (y − x)
kx − uk2

Hence g = (x − u)/kx − uk2 is a subgradient of dist(x, C) at x

24
References and further reading

• S. Boyd, Lecture notes for EE 264B, Stanford University,


Spring 2010-2011
• R. T. Rockafellar (1970), “Convex analysis”, Chapters 23–25
• L. Vandenberghe, Lecture notes for EE 236C, UCLA, Spring
2011-2012

25

You might also like