0% found this document useful (0 votes)

5 views

L-BFGS algorithm

The document discusses the Limited-Memory BFGS (L-BFGS) algorithm, which is designed for large-scale optimization problems by approximating the inverse Hessian matrix using a limited number of vectors from recent iterations. It explains the trade-offs involved, such as reduced memory and computation at the cost of potentially losing local superlinear convergence. The document also outlines the implementation details of L-BFGS, including a two-loop recursion for efficiently computing the required matrix-vector products.

Uploaded by

Enoch C

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

L-BFGS algorithm

Uploaded by

Enoch C

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Lecture 23: Limited-Memory BFGS (L-BFGS)

Yudong Chen

1 Basic ideas
Newton and quasi-Newton methods enjoy fast convergence (small number of iterations), but for
large-scale problems each iteration may be too costly.
For example, recall the quasi-Newton method xk+1 = xk − αk Hk ∇ f ( xk ) with BFGS update:

Hk = Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤

k −1 , (1)

where
1
ρk = , Vk = I − ρk yk s⊤
k ,
s⊤
k yk
s k = x k +1 − x k , y k = ∇ f ( x k +1 ) − ∇ f ( x k ),

and the stepsize αk satisfies WWC. The matrices Bk and Hk constructed by BFGS are often dense,
even when the true Hessian is sparse. In general, BFGS requires Θ(d2 ) computation per iteration
and Θ(d2 ) memory. For large d, Θ(d2 ) may be too much.
Idea of L-BFGS: instead of storing the full matrix Hk (approximation of ∇2 f ( xk )−1 ), construct
and represent Hk implicitly using a small number of vectors {si , yi } for the last few iterations.
Intuition: we do not expect the current Hessian to depend too much on “old” vectors si , yi (old
iterates xi and their gradients.)
Tradeoff: we reduce memory and computation to O(d), but we may lose local superlinear
convergence—we can only guarantee linear convergence in general.

2 L-BFGS
Recall and expand the BFGS update:

BFGS: Hk =Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤

k −1
=Vk⊤−1 Vk⊤−2 Hk−2 Vk−2 Vk−1 + ρk−2 Vk−2 sk−2 s⊤ ⊤
k −2 Vk −1 + ρk −1 sk −1 sk −1

= Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk−m (Vk−m Vk−m+1 · · · Vk−1 )

+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )

+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )

+···
+ ρk−2 Vk⊤−1 sk−2 s⊤
k −2 Vk −1
+ ρ k −1 s k −1 s ⊤
k −1 .

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

In L-BFGS, we replace Hk−m (a dense d × d matrix) with some sparse matrix Hk0 , e.g., a diagonal
matrix. Thus, Hk can be constructed using the most recent m ≪ d pairs {si , yi }ik=−k1−m . That is,

L-BFGS: Hk = Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk0 (Vk−m Vk−m+1 · · · Vk−1 )

+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )

+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )

+···
+ ρ k −1 s k −1 s ⊤
k −1 .

In fact, we only need the d-dimensional vector Hk ∇ f ( xk ) to update xk+1 = xk − αk Hk ∇ f ( xk ).

Therefore, we do not even need to compute or store the matrix Hk explicitly. Instead, we only
store the vectors {si , yi }ik=−k1−m , from which Hk ∇ f ( xk ) can be computed using only vector-vector
multiplications, thanks to tricks like ( aa⊤ + bb⊤ ) g = a( a⊤ g) + b(b⊤ g).

This leads to a two-loop recursion implementation for computing Hk ∇ f ( xk ), stated in Algo-

rithm 1.

Algorithm 1 L-BFGS two-loop recursion

set q = ∇ f ( xk ) want to compute Hk · ∇ f ( xk )
for i = k − 1, k − 2, . . . , k = m do:
αi ← ρi si⊤ q
q ← q − αi yi // RHS= q − ρi si⊤ qyi = I − ρi yi si⊤ q
| {z }
Vi
r= Hk0 q
for i = k − m to k − 1:
β ← ρi yi⊤ r
r ← r + si ( αi − β ) // RHS = r + ρi αi − ρi yi⊤ rsi = I − ρi si yi⊤ r + ρi αi
| {z }
Vi⊤
return r // which equals Hk ∇ f ( xk )

(Exercise) The total number of multiplications is at most 4md + nnz( Hk0 ) = O (md) .
In practice:

• We often take m to be a small constant independent of d, e.g., 3 ≤ m ≤ 20.

s⊤
k −1 y k −1
• A popular choice for Hk0 is Hk0 = γk I, where γk = y⊤
. This choice appears to be quite
k −1 y k −1
1 z⊤ 2
k ∇ f ( xk )zk
effective in practice. (Optional) is an approximation of , which is the size of the
γk ∥ z k ∥2
1/2
true Hessian along the direction zk ≈ ∇2 f ( xk )

sk ; see Section 6.1 in Nocedal-Wright.

The complete L-BFGS algorithm is given in Algorithm 2. As discussed in Lecture 21, it is important
that αk satisfies both the sufficient decrease and curvature conditions in Wolfe.

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

Algorithm 2 L-BFGS
input: x0 ∈ Rd (initial point), m > 0 (memory budget), ϵ > 0 (convergence criterion)
k←0
repeat:

• Choose Hk0

• pk ← − Hk ∇ f ( xk ), where Hk ∇ f ( xk ) is computed using Algorithm 1

• xk+1 ← xk + αk pk , where αk satisfies Wolfe Conditions

• if k > m:

– discard {sk−m , yk−m } from storage

• Compute and store sk ← xk+1 − xk and yk = ∇ f ( xk+1 ) − ∇ f ( xk )

• k ← k+1

until ∥∇ f ( xk ∥ ≤ ϵ

Some numerical results taken from Nocedal-Wright:

3 Relationship with nonlinear conjugate gradient methods

In Lecture 13 we mentioned several ways of generalizing CG to non-quadratic functions (a.k.a. non-
lienar CG), including Dai-Yuan, Fletcher-Rieves and Polak-Ribiere. The last one has a variant

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023

called Hestenes-Stiefel, which uses the search direction

!
∇ f ( x k +1 ) ⊤ y k sk y⊤
k
p k +1 = −∇ f ( xk+1 ) + pk = − I− ⊤ ∇ f ( x k +1 ), (2)
y⊤k pk yk sk
| {z }
=: Ĥk+1

where we recall that yk = ∇ f ( xk+1 ) − ∇ f ( xk ) and sk = xk+1 − xk .

The matrix Ĥk+1 is neither symmetric nor p.d. If we try to symmetrize Ĥk+1 by taking Ĥk⊤+1 Ĥk+1 ,
we end up with a matrix that does not satisfy the secant equation and is singular.
A symmetric p.d. matrix that satisfies the secant equation is

sk s⊤
Hk+1 = Ĥk+1 Ĥk⊤+1 + k
y⊤
k ks
! !
sk y⊤ yk s⊤ sk s⊤
= I− ⊤k I I− ⊤k + k
yk sk yk sk y⊤
k ks
= BFGS update (1) applied to Hk = I

Therefore, computing Hk+1 as above for the search direction pk+1 = − Hk+1 ∇ f ( xk+1 ) can be
viewed as “memoryless” BFGS, i.e., L-BFGS with m = 1 and Hk0 = I.
Suppose we combine memoryless BFGS and exact line search:

αk = argmin f ( xk + αpk ).
α ∈R

For all k , the stepsize αk satisfies

D E
0 = ⟨∇ f ( xk + αk pk ), pk ⟩ = ∇ f ( xk+1 ), α−
k
1
s k ,

hence s⊤
k ∇ f ( xk +1 ) = 0. It follows that

pk+1 = − Hk+1 ∇ f ( xk+1 )

" ! ! #
sk y⊤ y k s ⊤ s k s ⊤
=− I− ⊤k I − ⊤ k + ⊤ k ∇ f ( x k +1 )
yk sk yk sk yk sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + sk s⊤
k ∇ f ( x k +1 ) = 0
y⊤
k sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + pk , sk = αk pk
y⊤
k pk

which is the same as Hestenes-Stiefel CG update (2).

Scientific Computing Selected Solutions
50% (2)
Scientific Computing Selected Solutions
3 pages
Exercises From Finite Difference Methods For Ordinary and Partial Differential Equations
No ratings yet
Exercises From Finite Difference Methods For Ordinary and Partial Differential Equations
35 pages
Lecture 15
No ratings yet
Lecture 15
3 pages
bfgs
No ratings yet
bfgs
11 pages
Wiki Lbfgs
No ratings yet
Wiki Lbfgs
6 pages
L-BFGS-B Summary
No ratings yet
L-BFGS-B Summary
1 page
Algorithm 778: L-BFGS-B: Fortran Subroutines For Large-Scale Bound-Constrained Optimization
No ratings yet
Algorithm 778: L-BFGS-B: Fortran Subroutines For Large-Scale Bound-Constrained Optimization
11 pages
1-s2.0-S0893965901001628-main
No ratings yet
1-s2.0-S0893965901001628-main
7 pages
Numerical Optimization: Lecture Notes #18 Quasi-Newton Methods - The BFGS Method
No ratings yet
Numerical Optimization: Lecture Notes #18 Quasi-Newton Methods - The BFGS Method
24 pages
1991IMAJNA-11-325-332
No ratings yet
1991IMAJNA-11-325-332
9 pages
A Limited-memory Algorithm For
No ratings yet
A Limited-memory Algorithm For
22 pages
BFGS
No ratings yet
BFGS
9 pages
Modified Limited Memory BFGS Method With Nonmonoto
No ratings yet
Modified Limited Memory BFGS Method With Nonmonoto
23 pages
E1 251 Linear and Nonlinear Op2miza2on: Chapter 10: Quasi - Newton Method
No ratings yet
E1 251 Linear and Nonlinear Op2miza2on: Chapter 10: Quasi - Newton Method
18 pages
Bfgs inexactLS
No ratings yet
Bfgs inexactLS
35 pages
A Modified BFGS Method and Its Global Convergence in Nonconvex Minimization - 2001 - Journal of Computational and Applied Mathematics
No ratings yet
A Modified BFGS Method and Its Global Convergence in Nonconvex Minimization - 2001 - Journal of Computational and Applied Mathematics
21 pages
Support Lecture 1
No ratings yet
Support Lecture 1
4 pages
Ica20100100003 17780538
No ratings yet
Ica20100100003 17780538
8 pages
A Limited T,: Memory Algorithm For Bound Constrained T, T
No ratings yet
A Limited T,: Memory Algorithm For Bound Constrained T, T
19 pages
b Fb 0067700
No ratings yet
b Fb 0067700
12 pages
16.323 Optimal Control Problems Set 1
No ratings yet
16.323 Optimal Control Problems Set 1
3 pages
opt-sem10
No ratings yet
opt-sem10
26 pages
Quasi Newton Methods
No ratings yet
Quasi Newton Methods
15 pages
Practica Cuasi-Newton
No ratings yet
Practica Cuasi-Newton
6 pages
Quasi Newton PDF
No ratings yet
Quasi Newton PDF
15 pages
Quasi Newton PDF
No ratings yet
Quasi Newton PDF
15 pages
Doan BFGS
No ratings yet
Doan BFGS
72 pages
Project For Automated Train by Roshan
No ratings yet
Project For Automated Train by Roshan
6 pages
Probabilistic Graphical Models & Applications: Pseudo Boolean Optimization I/III
No ratings yet
Probabilistic Graphical Models & Applications: Pseudo Boolean Optimization I/III
25 pages
SQ P Methods
No ratings yet
SQ P Methods
13 pages
A Truncated Nonmonotone Gauss-Newton Method For Large-Scale Nonlinear Least-Squares Problems
No ratings yet
A Truncated Nonmonotone Gauss-Newton Method For Large-Scale Nonlinear Least-Squares Problems
16 pages
The Levenberg-Marquardt Algorithm-Implementation and Theory
No ratings yet
The Levenberg-Marquardt Algorithm-Implementation and Theory
12 pages
Lecture_7_8_other_descent_methods
No ratings yet
Lecture_7_8_other_descent_methods
7 pages
Unit 2 (Second Order Methods)
No ratings yet
Unit 2 (Second Order Methods)
9 pages
Lec4 Orth
No ratings yet
Lec4 Orth
13 pages
CG Survey
No ratings yet
CG Survey
21 pages
Hintermüller M. Semismooth Newton Methods and Applications
No ratings yet
Hintermüller M. Semismooth Newton Methods and Applications
72 pages
A Quasi-Gauss-Newton Method For Solving Non-Linear Algebraic Equations
No ratings yet
A Quasi-Gauss-Newton Method For Solving Non-Linear Algebraic Equations
11 pages
Jurnal - Gauss Newton
No ratings yet
Jurnal - Gauss Newton
11 pages
Data Structures & Algorithms Cheatsheet
No ratings yet
Data Structures & Algorithms Cheatsheet
5 pages
Fast and Accurate Bessel Function Computation: John Harrison, Intel Corporation
No ratings yet
Fast and Accurate Bessel Function Computation: John Harrison, Intel Corporation
22 pages
Brent and McMillan - 2021 - Some New Algorithms For High-Precision Computation
No ratings yet
Brent and McMillan - 2021 - Some New Algorithms For High-Precision Computation
9 pages
Solns
No ratings yet
Solns
38 pages
All Exercises
No ratings yet
All Exercises
35 pages
1 Iterative Methods For Linear Systems 2 Eigenvalues and Eigenvectors
No ratings yet
1 Iterative Methods For Linear Systems 2 Eigenvalues and Eigenvectors
2 pages
Lecture 05 - Quasi Newthon Methods
No ratings yet
Lecture 05 - Quasi Newthon Methods
10 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Global Convergence of A Modified Fletcher-Reeves Conjugate Gradient Method With Armijo-Type Line Search - Zhang, Zhou (2006)
No ratings yet
Global Convergence of A Modified Fletcher-Reeves Conjugate Gradient Method With Armijo-Type Line Search - Zhang, Zhou (2006)
12 pages
The Convergence of Quasi-Gauss-Newton Methods For Nonlinear Problems
No ratings yet
The Convergence of Quasi-Gauss-Newton Methods For Nonlinear Problems
12 pages
Limited-Memory BFGS With Displacement Aggregation
No ratings yet
Limited-Memory BFGS With Displacement Aggregation
24 pages
On The Approximation of Volterra Integral Equations With 2019 Alexandria en
No ratings yet
On The Approximation of Volterra Integral Equations With 2019 Alexandria en
5 pages
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
No ratings yet
A Strengthened Conjecture on the Minimax Optimal Constant Stepsize for Gradient Descent
8 pages
Existence and Continuous Dependence of The Solutions of The Benjamin-Bona-Mahony-Peregrine-Burger's Equation On The Circle
No ratings yet
Existence and Continuous Dependence of The Solutions of The Benjamin-Bona-Mahony-Peregrine-Burger's Equation On The Circle
12 pages
Practice BFGS Algorithm
No ratings yet
Practice BFGS Algorithm
9 pages
GVL 5, 5.3 Heath 3.1-3.5? TB 7 - (8), 11? - Qr-New
No ratings yet
GVL 5, 5.3 Heath 3.1-3.5? TB 7 - (8), 11? - Qr-New
68 pages
Lec Continuous Least Squares Annotated
No ratings yet
Lec Continuous Least Squares Annotated
31 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet
Feynman Lectures Simplified 2C: Electromagnetism: in Relativity & in Dense Matter
From Everand
Feynman Lectures Simplified 2C: Electromagnetism: in Relativity & in Dense Matter
Robert Piccioni
No ratings yet
Problems in Quantum Mechanics: Third Edition
From Everand
Problems in Quantum Mechanics: Third Edition
D. ter Haar
3/5 (2)
Solutions to Problems in Fluids and Turbomachinery
From Everand
Solutions to Problems in Fluids and Turbomachinery
Rahul Basu
No ratings yet
Computational Maths Syllabus PDF
No ratings yet
Computational Maths Syllabus PDF
2 pages
Quartic Formula
No ratings yet
Quartic Formula
1 page
Gauss - Method of Chio
No ratings yet
Gauss - Method of Chio
6 pages
Finite Difference Method - Dr. Rahman - NU - BD
No ratings yet
Finite Difference Method - Dr. Rahman - NU - BD
3 pages
Taller 2 de Programación Lineal y Entera - Solver - Dual y Sensibilidad
No ratings yet
Taller 2 de Programación Lineal y Entera - Solver - Dual y Sensibilidad
3 pages
Dipesh NITW EMOO
No ratings yet
Dipesh NITW EMOO
68 pages
Lecture 1P4-Further Matrices (Class 3) - B&W
No ratings yet
Lecture 1P4-Further Matrices (Class 3) - B&W
55 pages
AND Gate
No ratings yet
AND Gate
2 pages
Lecture Notes On Finite Element Method - Hani Aziz Ameen
No ratings yet
Lecture Notes On Finite Element Method - Hani Aziz Ameen
62 pages
MTH603 Assignment 1 by KK
No ratings yet
MTH603 Assignment 1 by KK
4 pages
Greedy Vs Dynamic Programming Approach
No ratings yet
Greedy Vs Dynamic Programming Approach
30 pages
Chapter 1
No ratings yet
Chapter 1
3 pages
Lecture 2.1.1 Concept and Solution (Hungarian Method)
No ratings yet
Lecture 2.1.1 Concept and Solution (Hungarian Method)
15 pages
Matlab Optimization
No ratings yet
Matlab Optimization
14 pages
This Study Resource Was: Final Examination
No ratings yet
This Study Resource Was: Final Examination
7 pages
Adobe Scan 23-Jun-2023
No ratings yet
Adobe Scan 23-Jun-2023
25 pages
M.iqbaluddin Al Huda Tugas PDF
No ratings yet
M.iqbaluddin Al Huda Tugas PDF
15 pages
CBSE Test Paper 03 CH-2 Polynomials
No ratings yet
CBSE Test Paper 03 CH-2 Polynomials
7 pages
7 Rational Functions Equations Inequalities
No ratings yet
7 Rational Functions Equations Inequalities
27 pages
Linear Programming - One Shot - Vmath
No ratings yet
Linear Programming - One Shot - Vmath
72 pages
Transportation Problem in LPP
No ratings yet
Transportation Problem in LPP
40 pages
TOPIC 3 - Roots of Non-Linear Equations
No ratings yet
TOPIC 3 - Roots of Non-Linear Equations
34 pages
Linear Programming: Simplex Method: Dr. R. K Singh Professor, Operations Management MDI, Gurgaon
No ratings yet
Linear Programming: Simplex Method: Dr. R. K Singh Professor, Operations Management MDI, Gurgaon
58 pages
Grid Search in Matlab
No ratings yet
Grid Search in Matlab
15 pages
Galerkin Method
No ratings yet
Galerkin Method
21 pages
Data Serumen Dan Modalitas Tipe Belajar SD Ibnu Abbas
No ratings yet
Data Serumen Dan Modalitas Tipe Belajar SD Ibnu Abbas
16 pages
A Remainder Theorem and Factor Theorem 1
No ratings yet
A Remainder Theorem and Factor Theorem 1
18 pages
Synthetic Division of Polynomials
No ratings yet
Synthetic Division of Polynomials
7 pages
Math 7 AS Week 3-4
No ratings yet
Math 7 AS Week 3-4
2 pages
Optimization Open Method - 3
No ratings yet
Optimization Open Method - 3
6 pages