UW-Madison CS/ISyE/Math/Stat 726 Spring 2023
Lecture 23: Limited-Memory BFGS (L-BFGS)
Yudong Chen
1 Basic ideas
Newton and quasi-Newton methods enjoy fast convergence (small number of iterations), but for
large-scale problems each iteration may be too costly.
For example, recall the quasi-Newton method xk+1 = xk − αk Hk ∇ f ( xk ) with BFGS update:
Hk = Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤
k −1 , (1)
where
1
ρk = , Vk = I − ρk yk s⊤
k ,
s⊤
k yk
s k = x k +1 − x k , y k = ∇ f ( x k +1 ) − ∇ f ( x k ),
and the stepsize αk satisfies WWC. The matrices Bk and Hk constructed by BFGS are often dense,
even when the true Hessian is sparse. In general, BFGS requires Θ(d2 ) computation per iteration
and Θ(d2 ) memory. For large d, Θ(d2 ) may be too much.
Idea of L-BFGS: instead of storing the full matrix Hk (approximation of ∇2 f ( xk )−1 ), construct
and represent Hk implicitly using a small number of vectors {si , yi } for the last few iterations.
Intuition: we do not expect the current Hessian to depend too much on “old” vectors si , yi (old
iterates xi and their gradients.)
Tradeoff: we reduce memory and computation to O(d), but we may lose local superlinear
convergence—we can only guarantee linear convergence in general.
2 L-BFGS
Recall and expand the BFGS update:
BFGS: Hk =Vk⊤−1 Hk−1 Vk−1 + ρk−1 sk−1 s⊤
k −1
=Vk⊤−1 Vk⊤−2 Hk−2 Vk−2 Vk−1 + ρk−2 Vk−2 sk−2 s⊤ ⊤
k −2 Vk −1 + ρk −1 sk −1 sk −1
= Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk−m (Vk−m Vk−m+1 · · · Vk−1 )
+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )
+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )
+···
+ ρk−2 Vk⊤−1 sk−2 s⊤
k −2 Vk −1
+ ρ k −1 s k −1 s ⊤
k −1 .
1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023
In L-BFGS, we replace Hk−m (a dense d × d matrix) with some sparse matrix Hk0 , e.g., a diagonal
matrix. Thus, Hk can be constructed using the most recent m ≪ d pairs {si , yi }ik=−k1−m . That is,
L-BFGS: Hk = Vk⊤−1 Vk⊤−2 · · · Vk⊤−m Hk0 (Vk−m Vk−m+1 · · · Vk−1 )
+ ρk−m Vk⊤−1 · · · Vk⊤−m+1 sk−m s⊤ k −m (Vk −m+1 · · · Vk −1 )
+ ρk−m+1 Vk⊤−1 · · · Vk⊤−m+2 sk−m+1 s⊤ k −m+1 (Vk −m+2 · · · Vk −1 )
+···
+ ρ k −1 s k −1 s ⊤
k −1 .
In fact, we only need the d-dimensional vector Hk ∇ f ( xk ) to update xk+1 = xk − αk Hk ∇ f ( xk ).
Therefore, we do not even need to compute or store the matrix Hk explicitly. Instead, we only
store the vectors {si , yi }ik=−k1−m , from which Hk ∇ f ( xk ) can be computed using only vector-vector
multiplications, thanks to tricks like ( aa⊤ + bb⊤ ) g = a( a⊤ g) + b(b⊤ g).
This leads to a two-loop recursion implementation for computing Hk ∇ f ( xk ), stated in Algo-
rithm 1.
Algorithm 1 L-BFGS two-loop recursion
set q = ∇ f ( xk ) want to compute Hk · ∇ f ( xk )
for i = k − 1, k − 2, . . . , k = m do:
αi ← ρi si⊤ q
q ← q − αi yi // RHS= q − ρi si⊤ qyi = I − ρi yi si⊤ q
| {z }
Vi
r= Hk0 q
for i = k − m to k − 1:
β ← ρi yi⊤ r
r ← r + si ( αi − β ) // RHS = r + ρi αi − ρi yi⊤ rsi = I − ρi si yi⊤ r + ρi αi
| {z }
Vi⊤
return r // which equals Hk ∇ f ( xk )
(Exercise) The total number of multiplications is at most 4md + nnz( Hk0 ) = O (md) .
In practice:
• We often take m to be a small constant independent of d, e.g., 3 ≤ m ≤ 20.
s⊤
k −1 y k −1
• A popular choice for Hk0 is Hk0 = γk I, where γk = y⊤
. This choice appears to be quite
k −1 y k −1
1 z⊤ 2
k ∇ f ( xk )zk
effective in practice. (Optional) is an approximation of , which is the size of the
γk ∥ z k ∥2
1/2
true Hessian along the direction zk ≈ ∇2 f ( xk )
sk ; see Section 6.1 in Nocedal-Wright.
The complete L-BFGS algorithm is given in Algorithm 2. As discussed in Lecture 21, it is important
that αk satisfies both the sufficient decrease and curvature conditions in Wolfe.
2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023
Algorithm 2 L-BFGS
input: x0 ∈ Rd (initial point), m > 0 (memory budget), ϵ > 0 (convergence criterion)
k←0
repeat:
• Choose Hk0
• pk ← − Hk ∇ f ( xk ), where Hk ∇ f ( xk ) is computed using Algorithm 1
• xk+1 ← xk + αk pk , where αk satisfies Wolfe Conditions
• if k > m:
– discard {sk−m , yk−m } from storage
• Compute and store sk ← xk+1 − xk and yk = ∇ f ( xk+1 ) − ∇ f ( xk )
• k ← k+1
until ∥∇ f ( xk ∥ ≤ ϵ
Some numerical results taken from Nocedal-Wright:
3 Relationship with nonlinear conjugate gradient methods
In Lecture 13 we mentioned several ways of generalizing CG to non-quadratic functions (a.k.a. non-
lienar CG), including Dai-Yuan, Fletcher-Rieves and Polak-Ribiere. The last one has a variant
3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2023
called Hestenes-Stiefel, which uses the search direction
!
∇ f ( x k +1 ) ⊤ y k sk y⊤
k
p k +1 = −∇ f ( xk+1 ) + pk = − I− ⊤ ∇ f ( x k +1 ), (2)
y⊤k pk yk sk
| {z }
=: Ĥk+1
where we recall that yk = ∇ f ( xk+1 ) − ∇ f ( xk ) and sk = xk+1 − xk .
The matrix Ĥk+1 is neither symmetric nor p.d. If we try to symmetrize Ĥk+1 by taking Ĥk⊤+1 Ĥk+1 ,
we end up with a matrix that does not satisfy the secant equation and is singular.
A symmetric p.d. matrix that satisfies the secant equation is
sk s⊤
Hk+1 = Ĥk+1 Ĥk⊤+1 + k
y⊤
k ks
! !
sk y⊤ yk s⊤ sk s⊤
= I− ⊤k I I− ⊤k + k
yk sk yk sk y⊤
k ks
= BFGS update (1) applied to Hk = I
Therefore, computing Hk+1 as above for the search direction pk+1 = − Hk+1 ∇ f ( xk+1 ) can be
viewed as “memoryless” BFGS, i.e., L-BFGS with m = 1 and Hk0 = I.
Suppose we combine memoryless BFGS and exact line search:
αk = argmin f ( xk + αpk ).
α ∈R
For all k , the stepsize αk satisfies
D E
0 = ⟨∇ f ( xk + αk pk ), pk ⟩ = ∇ f ( xk+1 ), α−
k
1
s k ,
hence s⊤
k ∇ f ( xk +1 ) = 0. It follows that
pk+1 = − Hk+1 ∇ f ( xk+1 )
" ! ! #
sk y⊤ y k s ⊤ s k s ⊤
=− I− ⊤k I − ⊤ k + ⊤ k ∇ f ( x k +1 )
yk sk yk sk yk sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + sk s⊤
k ∇ f ( x k +1 ) = 0
y⊤
k sk
y⊤
k ∇ f ( x k +1 )
= −∇ f ( xk+1 ) + pk , sk = αk pk
y⊤
k pk
which is the same as Hestenes-Stiefel CG update (2).