0% found this document useful (0 votes)
2 views

Berkeley-tutorial Optimization for Machine Learningpart2

This document discusses optimization techniques for machine learning, focusing on methods such as stochastic optimization, adaptive regularization, and gradient descent acceleration. It covers both first-order and second-order optimization methods, including Newton's method and the Conditional Gradient algorithm, highlighting their applications in recommendation systems and matrix completion. The document emphasizes the importance of well-conditioned functions for faster optimization and presents theoretical results on the efficiency of these algorithms.

Uploaded by

Van Tien Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Berkeley-tutorial Optimization for Machine Learningpart2

This document discusses optimization techniques for machine learning, focusing on methods such as stochastic optimization, adaptive regularization, and gradient descent acceleration. It covers both first-order and second-order optimization methods, including Newton's method and the Conditional Gradient algorithm, highlighting their applications in recommendation systems and matrix completion. The document emphasizes the importance of well-conditioned functions for faster optimization and presents theoretical results on the efficiency of these algorithms.

Uploaded by

Van Tien Le
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Tutorial: PART 2

Optimization for Machine Learning

Elad Hazan
Princeton University

+ help from Sanjeev Arora & Yoram Singer


Agenda
1. Learning as mathematical optimization
• Stochastic optimization, ERM, online regret minimization
• Online gradient descent
2. Regularization
• AdaGrad and optimal regularization
3. Gradient Descent++
• Frank-Wolfe, acceleration, variance reduction, second order methods,
non-convex optimization
Accelerating gradient descent?
1. Adaptive regularization (AdaGrad)
works for non-smooth&non-convex
2. Variance reduction
uses special ERM structure
very effective for smooth&convex
3. Acceleration/momentum
smooth convex only, general purpose optimization
since 80’s
Condition number of convex functions
#
defined as 𝛾 = $ , where (simplified)

0 ≺ 𝛼𝐼 ≼ 𝛻 + 𝑓 𝑥 ≼ 𝛽𝐼

𝛼 = strong convexity, 𝛽 = smoothness


Non-convex smooth functions: (simplified)

−𝛽𝐼 ≼ 𝛻 + 𝑓 𝑥 ≼ 𝛽𝐼

Why do we care?

well-conditioned functions exhibit much faster optimization!


(but equivalent via reductions)
Examples
Smooth gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻4 )

𝑓 𝑥012 − 𝑓 𝑥0 ≤ −𝛻0 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

1
=− 𝜂+ 𝛽𝜂+ 𝛻0 + =− 𝛻 +
4𝛽 0

Thus, for M-bounded functions: ( 𝑓 𝑥0 ≤ 𝑀)

1 +
−2𝑀 ≤ 𝑓 𝑥 ; − 𝑓(𝑥2 ) = > 𝑓(𝑥012 ) − 𝑓 𝑥0 ≤− > 𝛻0
4𝛽
0 0
Thus, exists a t for which,
+
8𝑀𝛽
𝛻0 ≤
𝑇
Smooth gradient descent
2
Conclusions: for 𝑥012 = 𝑥0 − 𝜂𝛻4 and T = Ω , finds
C

𝛻0 + ≤𝜖

1. Holds even for non-convex functions


2. For convex functions implies 𝑓 𝑥0 − 𝑓 𝑥 ∗ ≤ 𝑂(𝜖) (faster for
smooth!)
Non-convex stochastic gradient descent
The descent lemma, 𝛽-smooth functions: (algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻N0 )

𝐸 𝑓 𝑥012 − 𝑓 𝑥0 ≤ 𝐸 −𝛻4 𝑥012 − 𝑥0 + 𝛽 𝑥0 − 𝑥012 +

+ +
= 𝐸 −𝛻P0 ⋅ 𝜂𝛻4 + 𝛽 𝛻N4 = −𝜂𝛻4+ + 𝜂+ 𝛽𝐸 𝛻N4

= −𝜂𝛻4+ + 𝜂+ 𝛽(𝛻4+ + 𝑣𝑎𝑟 𝛻P0 )

Thus, for M-bounded functions: ( 𝑓 𝑥0 ≤ 𝑀)

M𝛽 +
T=𝑂 + ⇒ ∃0Y; . 𝛻0 ≤𝜀
𝜀
Controlling the variance:
Interpolating GD and SGD
Model: both full and stochastic gradients. Estimator combines
both into lower variance RV:

𝑥012 = 𝑥0 − 𝜂 𝛻P 𝑓 𝑥0 − 𝛻P 𝑓 𝑥[ + 𝛻𝑓(𝑥[ )

Every so often, compute full gradient and restart at new 𝑥[.

Theorem: [Schmidt, LeRoux, Bach ‘12; Johnson and Zhang ‘13;


Mahdavi, Zhang, Jin ’13]
Variance reduction for well-conditioned functions
+
𝛽
0 ≺ 𝛼𝐼 ≼ 𝛻 𝑓 𝑥 ≼ 𝛽𝐼 , 𝛾= 𝛾 should be
𝛼
Produces an 𝜖 approximate solution in time interpreted as
1 1
𝑂 𝑚 + 𝛾 𝑑 log
𝜖
𝜖
Acceleration/momentum [Nesterov ‘83]
• Optimal gradient complexity (smooth, convex)

• modest practical improvements, non-convex “momentum”


methods.

• With variance reduction, fastest possible running time of first-


order methods:
1
𝑂 (𝑚 + 𝛾𝑚) 𝑑 log
𝜖

[Woodworth, Srebro ’15] – tight lower bound w. gradient oracle


Experiments w. convex losses

Improve upon GradientDescent++ ?


Next few slides:

Move from first order (GD++) to second order


Higher Order Optimization

• Gradient Descent – Direction of Steepest Descent


• Second Order Methods – Use Local Curvature
Newton’s method (+ Trust region)

𝑥2

𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)

𝑥+

For non-convex function: can move to ∞


Solution: solve a quadratic approximation in a
𝑥m
local area (trust region)
Newton’s method (+ Trust region)

𝑥2

𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)

𝑥+
d3 time per iteration!
Infeasible for ML!!
𝑥m

Till recently…
Speed up the Newton direction computation??

• Spielman-Teng ‘04: diagonally dominant systems of equations in


linear time!
• 2015 Godel prize
• Used by Daitch-Speilman for faster flow algorithms

• Erdogu-Montanari ‘15: low rank approximation & inversion by


Sherman-Morisson
• Allow stochastic information
• Still prohibitive: rank * d2
Stochastic Newton? 𝑛„
𝑥012 = 𝑥0 − 𝜂 [𝛻 + 𝑓(𝑥)]p2 𝛻𝑓(𝑥)
2
• ERM, rank-1 loss: arg min 𝐸 s∼ u [ℓ 𝑥 ; 𝑎s , 𝑏s + |𝑥|+ ]
r +

• unbiased estimator of the Hessian:


y+ = az a{ ⋅ ℓ′ 𝑥 ; 𝑎s , bz + 𝐼
𝛻 𝑖 ~ 𝑈[1, … , 𝑚]
z

p2 p2
y+ + y
• clearly 𝐸 𝛻 = 𝛻 𝑓 , but 𝐸 𝛻 + ≠ 𝛻+𝑓
Circumvent Hessian creation and inversion!

• 3 steps:
• (1) represent Hessian inverse as infinite series
For any distribution on
naturals i ∼ 𝑁
𝛻 p+ = > 𝐼 − 𝛻+ s

s†[ 0‡ ˆ

• (2) sample from the infinite series (Hessian-gradient product) , ONCE


1
𝛻 + 𝑓 p2 𝛻𝑓 = > 𝐼 − 𝛻 + 𝑓 s
𝛻f = 𝐸s∼‰ 𝐼 − 𝛻 + 𝑓 s
𝛻f ⋅
Pr[𝑖]
s

• (3) estimate Hessian-power by sampling i.i.d. data examples

1
= Es∼‰,‹∼[s] Œ 𝐼 − 𝛻 + 𝑓‹ 𝛻f ⋅
Pr[𝑖] Single example
‹†2 0‡ s
Vector-vector
products only
Linear-time Second-order Stochastic Algorithm
(LiSSA)

• Use the estimator 𝛻•p+ 𝑓 defined previously

• Compute a full (large batch) gradient 𝛻f


• Move in the direction 𝛻•p+ 𝑓 𝛻𝑓

• Theoretical running time to produce an 𝜖 approximate solution for 𝛾


well-conditioned functions (convex): [Agarwal, Bullins, Hazan ‘15]

1 1
𝑂 𝑑𝑚 log + 𝛾𝑑 𝑑 log
𝜖 𝜖

1. Faster than first-order methods!


2. Indications this is tight [Arjevani,Shamir ‘16]
What about constraints??

Next few slides – projection free


(Frank-Wolfe) methods
Recommendation systems
movies
rating of user i
1 5 for movie j
2 3
2 4 1

users
5
4 2 3
1 1 1
5 5
Recommendation systems
movies
complete
1 4 5 2 1 5
missing entries
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 3
users
2 3 1 5 4 4
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
1 4 5 2 1 5
3 4 2 2 3 5
5 2 4 4 1 1
3 3 3 3 3 5
users
2 3 1 5 4 4 get new data
3 4 2 1 2 3
1 2 4 1 5 1
4 5 3 3 5 2
Recommendation systems
movies
re-complete
1 2 5 5 4 4
missing entries
2 4 3 2 3 1
5 2 4 2 3 1
3 3 2 4 5 5
users
3 3 2 5 1 5
2 4 3 4 2 3
1 1 1 1 1 1
4 5 3 3 5 2

Assume low rank of “true matrix”, convex relaxation: bounded trace


norm
Bounded trace norm matrices

• Trace norm of a matrix = sum of singular values


• K = { X | X is a matrix with trace norm at most D }

• Computational bottleneck: projections on K


require eigendecomposition: O(n3) operations

• But: linear optimization over K is easier


computing top eigenvector; O(sparsity) time
Projections à linear optimization

1. Matrix completion (K = bounded trace norm matrices)


eigen decomposition
largest singular vector computation

2. Online routing (K = flow polytope)


conic optimization over flow polytope
shortest path computation

3. Rotations (K = rotation matrices)


conic optimization over rotations set
Wahba’s algorithm – linear time

4. Matroids (K = matroid polytope)


convex opt. via ellipsoid method
greedy algorithm – nearly linear time!
Conditional Gradient algorithm [Frank, Wolfe ’56]
Convex opt problem:
min 𝑓(𝑥)
•∈‘

• f is smooth, convex
• linear opt over K is easy

vt = arg min rf (xt )> x


x2K
xt+1 = xt + ⌘t (vt xt )

1. At iteration t: convex comb. of at most t vertices (sparsity)


2
2. No learning rate. 𝜂0 ≈ (independent of diameter, gradients etc.)
0
FW theorem rf (xt )

xt ) , vt = arg min r> vt+1


xt+1 = xt + ⌘t (vt
x2K
t x xt+1
⇤ 1 xt
Theorem: f (xt ) f (x ) = O( )
t

Proof, main observation:


f (xt+1 ) f (x⇤ ) = f (xt + ⌘t (vt xt )) f (x⇤ )

 f (xt ) f (x⇤ ) + ⌘t (vt xt )> rt + ⌘t2 kvt xt k2 -smoothness of f


2
 f (xt ) f (x⇤ ) + ⌘t (x⇤ xt )> rt + ⌘t2 kvt xt k2 optimality of vt
2
 f (xt ) f (x⇤ ) + ⌘t (f (x⇤ ) f (xt )) + ⌘t2 kvt xt k2 convexity of f
2
⇤ ⌘t2
 (1 ⌘t )(f (xt ) f (x )) + D2 .
2
Thus: ht = f(xt) – f(x*)
1
ht+1  (1 ⌘t )ht + O(⌘t2 ) ⌘t , ht = O( )
t
Online Conditional Gradient

• Set 𝑥2 ∈ 𝐾 arbitrarily
• For t = 1, 2,…,
1. Use 𝑥0, obtain ft
2. Compute 𝑥012 as follows
⇣P ⌘>
t
vt = arg min i=1 rfi (xi ) + t xt x
x2K
↵ ↵ xt
xt+1 (1 t )xt + t vt
xt+1
Pt
rfi (xi ) + t xt
i=1 vt

Theorem:[Hazan, Kale ’12] 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂(𝑇 m/— )


Theorem: [Garber, Hazan ‘13] For polytopes, strongly-convex and
smooth losses,
1. Offline: convergence after t steps: 𝑒 p˜(0)
2. Online: 𝑅𝑒𝑔𝑟𝑒𝑡 = 𝑂( 𝑇)
Next few slides:
survey state-of-the-art
Non-convex optimization in ML

Machine

Distribution
over label
{a} ∈ 𝑅›

1
arg min• > ℓs 𝑥, 𝑎s , 𝑏s + 𝑅 𝑥
•∈œ 𝑚
s†2 0‡ u
What is Optimization

But generally speaking...


We’re screwed.
! Local (non global) minima of f0
! All kinds of constraints (even restricting to continuous func

Solution concepts for non-convex optimization? h(x) = sin(2πx) = 0

250

200

• Global minimization is NP hard, even for degree 4


150

100

50

polynomials. Even local minimization up to 4th order 0

−50
3

optimality conditions is NP hard.


2
3
1 2
0 1
−1 0
−1
−2 −2
−3 −3

• Algorithmic stability is sufficient [Hardt, Recht, Singer ‘16]


Duchi (UC Berkeley) Convex Optimization for Machine Learning F

• Optimization approaches:
• Finding vanishing gradients / local minima efficiently
• Graduated optimization / homotopy method
• Quasi-convexity
• Structure of local optima (probabilistic assumptions that
allow alternating minimization,…)
Gradient/Hessian based methods
Goal: find point x such that
1. 𝛻𝑓 𝑥 ≤ 𝜀 (approximate first order optimality)
2. 𝛻 + 𝑓 𝑥 ≽ −𝜀𝐼 (approximate second order optimality)

2
1. (we’ve proved) GD algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻4 finds in 𝑂 C
(expensive) iterations point (1)
2
2. (we’ve proved) SGD algorithm: 𝑥012 = 𝑥0 − 𝜂𝛻P4 finds in 𝑂 (cheap)

iterations point (1)
2
3. SGD algorithm with noise finds in 𝑂 (cheap) iterations (1&2)
C
[Ge, Huang, Jin, Yuan ‘15]
2
4. Recent second order methods: find in 𝑂 C¡/¢ (expensive) iterations
point (1&2)
[Carmon,Duchi, Hinder, Sidford ‘16]
[Agarwal, Allen-Zuo, Bullins, Hazan, Ma ‘16]
Recap

Chair/car

What is Optimization

But generally speaking...


We’re screwed.
! Local (non global) minima of f0
1. Online learning and stochastic
! All kinds of constraints (even restricting to continuous functions):
optimization
h(x) = sin(2πx) = 0
• Regret minimization
• Online gradient descent
250

200 2. Regularization
• AdaGrad and optimal
150

100

50 regularization
0

−50
3
3. Advanced optimization
• Frank-Wolfe, acceleration,
2
3
1 2
0 1
−1 0

variance reduction, second order


−1
−2 −2
−3 −3

methods, non-convex
Duchi (UC Berkeley) Convex Optimization for Machine Learning Fall 2009 7 / 53 optimization
Bibliography & more information, see:
http://www.cs.princeton.edu/~ehazan/tutorial/SimonsTutorial.htm

Thank you!

You might also like