0% found this document useful (0 votes)
5 views

L-BFGS algorithm

The document discusses the Limited-Memory BFGS (L-BFGS) algorithm, which is designed for large-scale optimization problems by approximating the inverse Hessian matrix using a limited number of vectors from recent iterations. It explains the trade-offs involved, such as reduced memory and computation at the cost of potentially losing local superlinear convergence. The document also outlines the implementation details of L-BFGS, including a two-loop recursion for efficiently computing the required matrix-vector products.

Uploaded by

Enoch C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

L-BFGS algorithm

The document discusses the Limited-Memory BFGS (L-BFGS) algorithm, which is designed for large-scale optimization problems by approximating the inverse Hessian matrix using a limited number of vectors from recent iterations. It explains the trade-offs involved, such as reduced memory and computation at the cost of potentially losing local superlinear convergence. The document also outlines the implementation details of L-BFGS, including a two-loop recursion for efficiently computing the required matrix-vector products.

Uploaded by

Enoch C
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Lecture 23: Limited-Memory BFGS (L-BFGS)

Yudong Chen

1 Basic ideas
Newton and quasi-Newton methods enjoy fast convergence (small number of iterations), but for
large-scale problems each iteration may be too costly.
For example, recall the quasi-Newton method xk+1 = xk − αk Hk ∇ f ( xk ) with BFGS update:

Hk = Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤


k −1 , (1)

where
1
ρk = , Vk = I − ρk yk s⊤
k ,
s⊤
k yk
s k = x k +1 − x k , y k = ∇ f ( x k +1 ) − ∇ f ( x k ),

and the stepsize αk satisfies WWC. The matrices Bk and Hk constructed by BFGS are often dense,
even when the true Hessian is sparse. In general, BFGS requires Θ(d2 ) computation per iteration
and Θ(d2 ) memory. For large d, Θ(d2 ) may be too much.
Idea of L-BFGS: instead of storing the full matrix Hk (approximation of ∇2 f ( xk )−1 ), construct
and represent Hk implicitly using a small number of vectors {si , yi } for the last few iterations.
Intuition: we do not expect the current Hessian to depend too much on “old” vectors si , yi (old
iterates xi and their gradients.)
Tradeoff: we reduce memory and computation to O(d), but we may lose local superlinear
convergence—we can only guarantee linear convergence in general.

2 L-BFGS
Recall and expand the BFGS update:

BFGS: Hk =Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤


k −1
=Vk⊤−1 Vk⊤−2 Hk−2 Vk−2 Vk−1 + ρk−2 Vk−2 sk−2 s⊤ ⊤
k −2 Vk −1 + ρk −1 sk −1 sk −1
 
= Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk−m (Vk−m Vk−m+1 · · · Vk−1 )
 
+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )
 
+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )

+···
+ ρk−2 Vk⊤−1 sk−2 s⊤
k −2 Vk −1
+ ρ k −1 s k −1 s ⊤
k −1 .

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

In L-BFGS, we replace Hk−m (a dense d × d matrix) with some sparse matrix Hk0 , e.g., a diagonal
matrix. Thus, Hk can be constructed using the most recent m ≪ d pairs {si , yi }ik=−k1−m . That is,
 
L-BFGS: Hk = Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk0 (Vk−m Vk−m+1 · · · Vk−1 )
 
+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )
 
+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )

+···
+ ρ k −1 s k −1 s ⊤
k −1 .

In fact, we only need the d-dimensional vector Hk ∇ f ( xk ) to update xk+1 = xk − αk Hk ∇ f ( xk ).


Therefore, we do not even need to compute or store the matrix Hk explicitly. Instead, we only
store the vectors {si , yi }ik=−k1−m , from which Hk ∇ f ( xk ) can be computed using only vector-vector
multiplications, thanks to tricks like ( aa⊤ + bb⊤ ) g = a( a⊤ g) + b(b⊤ g).

This leads to a two-loop recursion implementation for computing Hk ∇ f ( xk ), stated in Algo-


rithm 1.

Algorithm 1 L-BFGS two-loop recursion


set q = ∇ f ( xk ) want to compute Hk · ∇ f ( xk )
for i = k − 1, k − 2, . . . , k = m do:
αi ← ρi si⊤ q  
q ← q − αi yi // RHS= q − ρi si⊤ qyi = I − ρi yi si⊤ q
| {z }
Vi
r= Hk0 q
for i = k − m to k − 1:
β ← ρi yi⊤ r  
r ← r + si ( αi − β ) // RHS = r + ρi αi − ρi yi⊤ rsi = I − ρi si yi⊤ r + ρi αi
| {z }
Vi⊤
return r // which equals Hk ∇ f ( xk )

(Exercise) The total number of multiplications is at most 4md + nnz( Hk0 ) = O (md) .
In practice:

• We often take m to be a small constant independent of d, e.g., 3 ≤ m ≤ 20.


s⊤
k −1 y k −1
• A popular choice for Hk0 is Hk0 = γk I, where γk = y⊤
. This choice appears to be quite
k −1 y k −1
1 z⊤ 2
k ∇ f ( xk )zk
effective in practice. (Optional) is an approximation of , which is the size of the
γk ∥ z k ∥2
1/2
true Hessian along the direction zk ≈ ∇2 f ( xk )

sk ; see Section 6.1 in Nocedal-Wright.

The complete L-BFGS algorithm is given in Algorithm 2. As discussed in Lecture 21, it is important
that αk satisfies both the sufficient decrease and curvature conditions in Wolfe.

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Algorithm 2 L-BFGS
input: x0 ∈ Rd (initial point), m > 0 (memory budget), ϵ > 0 (convergence criterion)
k←0
repeat:

• Choose Hk0

• pk ← − Hk ∇ f ( xk ), where Hk ∇ f ( xk ) is computed using Algorithm 1

• xk+1 ← xk + αk pk , where αk satisfies Wolfe Conditions

• if k > m:

– discard {sk−m , yk−m } from storage

• Compute and store sk ← xk+1 − xk and yk = ∇ f ( xk+1 ) − ∇ f ( xk )

• k ← k+1

until ∥∇ f ( xk ∥ ≤ ϵ

Some numerical results taken from Nocedal-Wright:

3 Relationship with nonlinear conjugate gradient methods


In Lecture 13 we mentioned several ways of generalizing CG to non-quadratic functions (a.k.a. non-
lienar CG), including Dai-Yuan, Fletcher-Rieves and Polak-Ribiere. The last one has a variant

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

called Hestenes-Stiefel, which uses the search direction


!
∇ f ( x k +1 ) ⊤ y k sk y⊤
k
p k +1 = −∇ f ( xk+1 ) + pk = − I− ⊤ ∇ f ( x k +1 ), (2)
y⊤k pk yk sk
| {z }
=: Ĥk+1

where we recall that yk = ∇ f ( xk+1 ) − ∇ f ( xk ) and sk = xk+1 − xk .


The matrix Ĥk+1 is neither symmetric nor p.d. If we try to symmetrize Ĥk+1 by taking Ĥk⊤+1 Ĥk+1 ,
we end up with a matrix that does not satisfy the secant equation and is singular.
A symmetric p.d. matrix that satisfies the secant equation is

sk s⊤
Hk+1 = Ĥk+1 Ĥk⊤+1 + k
y⊤
k ks
! !
sk y⊤ yk s⊤ sk s⊤
= I− ⊤k I I− ⊤k + k
yk sk yk sk y⊤
k ks
= BFGS update (1) applied to Hk = I

Therefore, computing Hk+1 as above for the search direction pk+1 = − Hk+1 ∇ f ( xk+1 ) can be
viewed as “memoryless” BFGS, i.e., L-BFGS with m = 1 and Hk0 = I.
Suppose we combine memoryless BFGS and exact line search:

αk = argmin f ( xk + αpk ).
α ∈R

For all k , the stepsize αk satisfies


D E
0 = ⟨∇ f ( xk + αk pk ), pk ⟩ = ∇ f ( xk+1 ), α−
k
1
s k ,

hence s⊤
k ∇ f ( xk +1 ) = 0. It follows that

pk+1 = − Hk+1 ∇ f ( xk+1 )


" ! ! #
sk y⊤ y k s ⊤ s k s ⊤
=− I− ⊤k I − ⊤ k + ⊤ k ∇ f ( x k +1 )
yk sk yk sk yk sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + sk s⊤
k ∇ f ( x k +1 ) = 0
y⊤
k sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + pk , sk = αk pk
y⊤
k pk

which is the same as Hestenes-Stiefel CG update (2).

You might also like