Lecture 04 - Conjugate Gradient Methods
Lecture 04 - Conjugate Gradient Methods
LEARNING OUTCOMES
By the completion of this lecture you should be able to:
1. Describe the conjugate directions and conjugate gradient algorithms for quadratic functions.
2. Describe the conjugate gradient algorithm for general nonlinear functions.
3. Describe convergence properties of conjugate gradient algorithms.
Reference:
• Chapter 5, Jorge Nocedal and Stephen J. Wright, ‘Numerical Optimization’.
• Chapter 1, Dimitri P. Bertsekas, ‘Nonlinear Programming’.
4.1 Introduction
Conjugate direction methods have been motivated by the need to improve the convergence rate of steepest
descent method, without requiring the computation of ∇2 f (x), as required in Newton’s method. Conjugate
direction methods were originally introduced for solving quadratic problem:
1
min f (x) = xT Qx + bT x + c, (4.1)
x∈Rn 2
where Q is a symmetric positive definite matrix. Equivalently, they solve a linear system Qx = −b. The
basic characteristic of conjugate direction methods is to find the minimum of a quadratic function in a
finite number of steps. These methods have been extended to find solutions of unconstrained optimization
problems for the general nonlinear functions.
We first derive the conjugate gradient method for minimizing convex quadratic functions. Then we transfer
the technique to minimization problems of general nonlinear functions.
The steepest descent method often takes steps in the same direction as earlier steps (the zig-zag effect).
It will be better if we get it right the first time every time we take a step, so we never take a step in that
direction again. The idea is to select a set of conjugate (orthogonal) directions {d0 , d1 , . . . , dn−1 }, then
in each direction, we take exactly one step and that step should have the right length to line up evenly
with a specific component of the optimal solution x∗ . The conjugacy requirement on a set of directions
{d0 , d1 , . . . , dn−1 } is not sufficient to obtain the optimal solution. Rather, the best set of search directions
{d0 , d1 , . . . , dn−1 } is required to be conjugate with respect to a symmetric positive definite matrix Q. The
formal definition is:
Definition 4.1. Let Q be a symmetric positive definite matrix. Then, a set of non-zero vectors {d0 , d1 , . . . , dm }
is said to be Q-conjugate if:
T
di Qdj = 0, for all i 6= j. (4.2)
4-1
Note that if Q = I then two vectors are Q-conjugate if they are orthogonal. Also a set of vectors
{d0 , d1 , . . . , dn−1 } must be linearly independent for the algorithm to work. Lemma 4.1 relate Q-conjugacy
with linear independence.
Lemma 4.1. Let Q ∈ Rn×n be a symmetric positive definite matrix. If the directions d0 , d1 , . . . , dk ∈
Rn , k ≤ n − 1, are nonzero and Q-conjugate, then they are linearly independent.
We now present the conjugate direction algorithm for minimizing the quadratic function of n variables
1
f (x) = xT Qx + bT x + c,
2
where Q is a symmetric positive definite matrix. Note that because Q is positive definite, the function f
has a global minimizer that can be found by solving Qx = −b. The importance of Q-conjugacy lies in the
fact that a quadratic function can be minimized in n steps.
Since the search directions are Q-conjugate and thus linearly independent, the idea is a successive minimization
along individual directions. Given and arbitrary starting point x0 and a set of Q-conjugate directions
{d0 , d1 , . . . , dn−1 }, the corresponding conjugate direction algorithm generate the sequence {xk } by setting:
xk+1 = xk + sk dk , (4.3)
where sk is the stepsize and it is obtained by solving the one-dimensional minimization problem
In particular, by setting to zero the derivative of f (xk + sdk ) with respect to s, we obtain
T
g k dk
sk = − T
. (4.4)
dk Qdk
Here gk represent the gradient of the objective function, i.e gk = ∇f (xk ). The basic conjugate direction
algorithms is summarized in Algorithm 4.1
Theorem 4.1. For any x0 ∈ Rn the sequence {xk } generated by the conjugate direction algorithm (4.3),
(4.4) converges to the optimal solution x∗ of the linear system Qx = −b in at most n steps.
.
Proof : Consider x∗ − x0 . Since the directions {di } are linearly independent, there exist a set of constants
γi for i = 0, . . . , n − 1, such that
x∗ − x0 = γ0 d0 + γ1 d1 + · · · + γn−1 dn−1 .
4-2
T
Now permultiply both sides of the above equation by dk Q for 0 ≤ k < n, to obtain
T T
dk Q x∗ − x0 = γk dk Qdk ,
T
where the terms dk Qdi = 0, k 6= i, by the Q-conjugate property. Hence
T
dk Q x ∗ − x 0
γk = T
.
dk Qdk
xk = x0 + s0 d0 + s1 d1 + · · · + sk−1 dk−1 .
Therefore,
xk − x0 = s0 d0 + s1 d1 + · · · + sk−1 dk−1 .
So writing
x∗ − x0 = x∗ − xk + xk − x 0 ,
T
and premultiplying the above expression by dk Q and using the Q-conjugacy property, we obtain
T T
T
dk Q x∗ − x0 = dk Q x∗ − xk = −dk gk ,
Hence
gk+1 = gk + sk Qdk .
and since the function f is quadratic the exact line search stepsize is
T
g k dk
sk = − T
.
dk Qdk
4-3
We prove the lemma by induction. We first show the result is true for k = 0, thus i = 0:
T T T
g1 d0 = Qx1 + b d0 = x1 Qd0 + bT d0
T
!
0T 0 g 0 d0 T
= x Qd + − T d0 Qd0 + bT d0
0
d Qd 0
T T
= Qx0 + b d0 − g0 d0
T T
= g0 d0 − g0 d0 = 0
T
We now show the result is true for k. Fix k > 0 and 0 ≤ i < k. By the induction hypothesis gk di = 0.
Because
gk+1 = gk + sk Qdk ,
T
and dk Qdi = 0 by Q-conjugacy, we have
T T T
gk+1 di = gk di + sk dk Qdi = 0.
T
It remains to be shown that gk+1 dk = 0. Indeed,
T
T T
gk+1 dk = Qxk+1 + b dk = xk+1 Qdk + bT dk
T
= xk + sk dk Qdk + bT dk
T
!
kT k g k dk T
= x Qd − T
dk Qdk + bT dk
k
d Qd k
T T
= Qxk + b dk − gk dk
T T
= gk dk − gk dk = 0
T
Therefore by induction, for all 0 ≤ k ≤ n − 1 and 0 ≤ i ≤ k, gk+1 di = 0. Q.E.D
This property is very important as it will be used extensively in subsequent sections.
Example 4.1. Suppose we want to minimize f (x) = 2x21 + x22 − 3 starting at x0 = [1, 1]T with the initial
direction begin d0 = [−4, −2]T . Find a conjugate direction to the initial direction d0 and find the minimizer.
The conjugate direction algorithm is very effective. However, to use the algorithm, we need to specify
all Q-conjugate directions before we start. This is computationally cumbersome for large problems.
Fortunately there is a way to generate Q-conjugate directions iteratively. In the next section, we discuss
an algorithm that incorporates the generation of Q-conjugate directions with the gradient of the objective
function.
4-4
4.3 Conjugate Gradient Method for Quadratic Problems
The conjugate gradient algorithm does not use predefined Q-conjugate directions, but instead it computes
the directions as the algorithm progresses. At each iteration of the algorithm, the direction dk+1 is
calculated as a linear combination of the previous direction and the current gradient, in such a way that
all the directions are mutually Q-conjugate−hence the name conjugate gradient method. In the conjugate
gradient algorithm the n conjugate directions are generated iteratively according to the rule:
( k+1
−g k=0
dk+1 = (4.5)
−gk+1 + βk dk k ≥ 1,
where the scalar βk is determined from the requirement that dk+1 and dk must be conjugate with respect
to the matrix Q. Using (4.5) and Definition 4.1, we can easily determine βk to be:
T
gk+1 Qdk
βk = T
. (4.6)
dk Qdk
This choice of Q-conjugate directions (4.5) may be seen as a modification to the steepest descent search
direction. The method improve the steepest descent algorithm by taking into consideration the previous
search direction. The conjugate gradient method is summarized in Algorithm 4.2.
In the conjugate gradient algorithm, the directions d0 , d1 , . . . are Q-conjugate. Thus, for a quadratic
function, the conjugate gradient method terminate with an optimal solution x∗ after n iterations. Conjugate
gradient method generate Q-conjugate directions d0 , d1 , . . . iteratively without the need to remember all
previous directions. This remarkable property implies that the method require little computational storage
as compare to conjugate direction method.
Example 4.2. Consider the quadratic function f (x) = 4x21 + x22 − 2x1 x2 . Find the minimizer using the
conjugate gradient algorithm, using the staring point x0 = [1, 1]T .
In the previous section, we have described the conjugate gradient algorithm for minimizing a quadratic
function. It is natural to ask whether we can adapt a similar technique to minimize the general non-linear
functions. Indeed, conjugate gradient methods can be used to estimate a minimizer of a non-quadratic
objective function f without having to compute the second partial derivatives of f . The conjugate gradient
4-5
algorithm can be extended to general non-linear functions by interpreting the quadratic function (4.1) as
a second order Taylor series approximation of the general nonlinear objective function. Near the minimum
such functions behave approximately as quadratic function.
For a quadratic function the Hessian matrix Q is constant. However, for a general nonlinear function the
Hessian matrix has to be re-evaluated at each iteration of the algorithm. This can be computationally
expensive. Thus, an efficient implementation of the conjugate gradient algorithm that eliminates the
Hessian evaluation at each iteration is desirable.
Notice that the Hessian matrix Q appears only in the calculations of the stepsize sk and the scalar βk . Since
we are dealing with a general nonlinear function f , and sk is determined from one-dimensional optimization
that minimizes f (xk + sdk ), we can do this numerically using line search. Hence, to eliminate the Hessian
in the calculation of the stepsize sk , the analytical expression (4.4) can be replaced by a numerical line
search procedure. In this regards the inexact line search with Armijo’s condition or Wolfe’s conditions are
recommended.
Next, we only need to concern ourselves with the formula for βk , the expression (4.6). Fortunately,
elimination of Q form the formula is possible and results in algorithms that depend only on the function
and gradient values at each iteration. We now discuss modifications of the conjugate gradient algorithm
for a quadratic function for the case in which the Hessian is unknown but in which the objective value and
gradients are available. The modifications are all based on algebraically manipulating the formula (4.6) of
βk in such a way that Q is eliminated.
We know that for any quadratic function, the Hessian maps the difference in position to the difference in
gradients, that is
The above property says that changes in the gradient provide information about the second derivative
along the search directions dk . The quadratic function can be interpreted as the second order Taylor
approximation of a general nonlinear objective function. Using this fact and the relation (4.7) we can
replace the calculation of βk for non-quadratic function by:
T
gk+1 gk+1 − gk
βk = T
. (4.8)
dk [gk+1 − gk ]
We get this expression for βk by substituting (4.7) into (4.6). Further simplification of (4.8) results in the
following algorithms: Polak-Ribiere method and Fletcher-Reeves method.
The Polak-Ribiere method
T
In this case the denominator in formula (4.8) is multiplied out. By Lemma 4.2, dk gk+1 = 0. Also, since
T
dk = −gk + βk−1 dk−1 , and premultiplying this by gk , we get
T
T T
T
gk dk = gk −gk + βk−1 dk−1 = −gk gk ,
4-6
T
We now use the fact that gk+1 gk = 0, which we get by using the equation
T T T
gk+1 dk = −gk+1 gk + βk gk+1 dk ,
If step 7 of Algorithm 4.3 is updated using the expression (4.9), the conjugate gradient method is
known as Polak-Ribiere algorithm. Similarly, the algorithm is known as Fletcher-Reeves is the expression
(4.10) is used. Numerical testing suggests that Polak-Ribiere method tends to be more efficient than the
Fletcher-Reeves method.
4.5 Summary
We discussed how the conjugate direction method uses the idea of Q-conjugacy of search direction to
minimize quadratic functions. The predetermination of conjugate direction can be computationally challenging,
especially for large problems. The conjugate gradient method resolves this challenge by updating the
directions iteratively. We derived the conjugate gradient method for minimizing convex quadratic functions.
Then we transferred the technique to minimization problems of general nonlinear functions. In this context,
we discussed Fletcher-Reeves and Polak-Ribiere variants of the conjugate gradient method. The two
versions differ in the update strategy of a scalar, which impacts the determination of the search direction
and the line search algorithm.
4-7
PROBLEM SET IV
1. Consider the following matrix:
3 0 1
Q = 0 4 2
1 2 3
using the conjugate direction algorithm with the initial point x0 = [0, 0]T .
4. The staring search direction from x = [2, 2]T to minimize
is the negative gradient. Find a conjugate direction to the staring direction. In it unique?
5. Consider the problem of minimizing the objective function,
5 1
f (x) = x21 + x22 + 2x1 x2 − 3x1 − x2 .
2 2
(a) Express f (x) in the standard quadratic form.
(b) Find the minimizer of f using the conjugate gradient algorithm. Use a starting point x0 = [0, 0]T .
(c) Calculate the minimizer of f explicitly form Q and b, and check it with your answer in part (b).
6. Show that in the conjugate gradient algorithm, the directions d0 , d1 , . . . , dn−1 generated via (4.5) are
Q-conjugate.
7. Use Polak-Ribiere method with exact line search to locate the minimizer of the function
1
f (x) = (x1 + 1)2 + (x2 + 2)2 + 3.
2
Start with initial point x0 = [ 12 , 32 ]T .
8. Use Fletcher-Reeves method with exact line search to locate the minimizer of the function
1
f (x) = (x1 + 1)2 + (x2 + 2)2 + 3. (4.11)
2
Start with initial point x0 = [ 12 , 32 ]T . Compare with the result from 7
9. Staring with the initial point x0 = [1, 1]T , use (1) Polak-Ribiere and (2) Fletcher-Reeves to find the
minimizer of the objective functions
a f (x) = 3x21 + x22 .
b f (x) = 4(x1 − 5)2 + (x2 − 6)2 .
4-8
10. Consider the following algorithm for minimizing a function f :
xk+1 = xk + sk dk ,
where sk = arg mins>0 f (xk + sdk ). Let gk = ∇f (xk ). Suppose f is quadratic with Hessian Q.
We choose dk+1 = γk gk+1 + dk , and we wish the directions dk and dk+1 to be Q-conjugate. Find a
formula for γk in terms of dk , gk+1 , and Q.
11. Show that the conjugate gradient directions generated by (4.5) satisfy the descent condition.
12. Write a compute programme to implement the conjugate gradient algorithm for general functions.
Use the backtracking algorithm for the line search. Test the different formulas for βk on Rosenbrock’s
function, with an initial point x0 = [−2, 2]T . For this exercise reinitialize the update direction to the
negative gradient every 6 iterations.
4-9