0% found this document useful (0 votes)
72 views

Lecture 04 - Conjugate Gradient Methods

1) The document summarizes the conjugate gradient method for minimizing convex quadratic functions and general nonlinear functions. 2) The conjugate gradient algorithm finds the minimum of a quadratic function in a finite number of steps by taking steps in conjugate directions. 3) The algorithm generates a sequence of points by minimizing along individual conjugate directions, with the step size chosen to make the directions conjugate with respect to the quadratic.

Uploaded by

Fanta Camara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Lecture 04 - Conjugate Gradient Methods

1) The document summarizes the conjugate gradient method for minimizing convex quadratic functions and general nonlinear functions. 2) The conjugate gradient algorithm finds the minimum of a quadratic function in a finite number of steps by taking steps in conjugate directions. 3) The algorithm generates a sequence of points by minimizing along individual conjugate directions, with the step size chosen to make the directions conjugate with respect to the quadratic.

Uploaded by

Fanta Camara
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

School of Computer Science and Applied Mathematics

APPM 3017: Optimization III


Lecture 04 : Conjugate Gradient Methods
Lecturer: Matthews Sejeso Date: August 2021

LEARNING OUTCOMES
By the completion of this lecture you should be able to:
1. Describe the conjugate directions and conjugate gradient algorithms for quadratic functions.
2. Describe the conjugate gradient algorithm for general nonlinear functions.
3. Describe convergence properties of conjugate gradient algorithms.
Reference:
• Chapter 5, Jorge Nocedal and Stephen J. Wright, ‘Numerical Optimization’.
• Chapter 1, Dimitri P. Bertsekas, ‘Nonlinear Programming’.

4.1 Introduction

Conjugate direction methods have been motivated by the need to improve the convergence rate of steepest
descent method, without requiring the computation of ∇2 f (x), as required in Newton’s method. Conjugate
direction methods were originally introduced for solving quadratic problem:
1
min f (x) = xT Qx + bT x + c, (4.1)
x∈Rn 2
where Q is a symmetric positive definite matrix. Equivalently, they solve a linear system Qx = −b. The
basic characteristic of conjugate direction methods is to find the minimum of a quadratic function in a
finite number of steps. These methods have been extended to find solutions of unconstrained optimization
problems for the general nonlinear functions.
We first derive the conjugate gradient method for minimizing convex quadratic functions. Then we transfer
the technique to minimization problems of general nonlinear functions.

4.2 Conjugate Direction Algorithm

The steepest descent method often takes steps in the same direction as earlier steps (the zig-zag effect).
It will be better if we get it right the first time every time we take a step, so we never take a step in that
direction again. The idea is to select a set of conjugate (orthogonal) directions {d0 , d1 , . . . , dn−1 }, then
in each direction, we take exactly one step and that step should have the right length to line up evenly
with a specific component of the optimal solution x∗ . The conjugacy requirement on a set of directions
{d0 , d1 , . . . , dn−1 } is not sufficient to obtain the optimal solution. Rather, the best set of search directions
{d0 , d1 , . . . , dn−1 } is required to be conjugate with respect to a symmetric positive definite matrix Q. The
formal definition is:

Definition 4.1. Let Q be a symmetric positive definite matrix. Then, a set of non-zero vectors {d0 , d1 , . . . , dm }
is said to be Q-conjugate if:
T
di Qdj = 0, for all i 6= j. (4.2)

4-1
Note that if Q = I then two vectors are Q-conjugate if they are orthogonal. Also a set of vectors
{d0 , d1 , . . . , dn−1 } must be linearly independent for the algorithm to work. Lemma 4.1 relate Q-conjugacy
with linear independence.

Lemma 4.1. Let Q ∈ Rn×n be a symmetric positive definite matrix. If the directions d0 , d1 , . . . , dk ∈
Rn , k ≤ n − 1, are nonzero and Q-conjugate, then they are linearly independent.
We now present the conjugate direction algorithm for minimizing the quadratic function of n variables
1
f (x) = xT Qx + bT x + c,
2
where Q is a symmetric positive definite matrix. Note that because Q is positive definite, the function f
has a global minimizer that can be found by solving Qx = −b. The importance of Q-conjugacy lies in the
fact that a quadratic function can be minimized in n steps.
Since the search directions are Q-conjugate and thus linearly independent, the idea is a successive minimization
along individual directions. Given and arbitrary starting point x0 and a set of Q-conjugate directions
{d0 , d1 , . . . , dn−1 }, the corresponding conjugate direction algorithm generate the sequence {xk } by setting:

xk+1 = xk + sk dk , (4.3)

where sk is the stepsize and it is obtained by solving the one-dimensional minimization problem

sk = arg min f (xk + sdk ).


s>0

In particular, by setting to zero the derivative of f (xk + sdk ) with respect to s, we obtain
T
g k dk
sk = − T
. (4.4)
dk Qdk

Here gk represent the gradient of the objective function, i.e gk = ∇f (xk ). The basic conjugate direction
algorithms is summarized in Algorithm 4.1

Algorithm 4.1 Conjugate direction algorithm


1: Calculate Q-conjugate directions d0 , d1 , . . . , dn−1 ,
2: Set the initial point x0 and iteration counter k ← 0.
3: for k = 0, . . . , n − 1 do
4: Compute the gradient gk = Qxk + b.
T
gk dk
5: Compute the stepsize sk = − .
dk T Qdk
6: Set xk+1 = xk + sk dk .
7: end for
8: return x∗ = xn .

Theorem 4.1. For any x0 ∈ Rn the sequence {xk } generated by the conjugate direction algorithm (4.3),
(4.4) converges to the optimal solution x∗ of the linear system Qx = −b in at most n steps.
.
Proof : Consider x∗ − x0 . Since the directions {di } are linearly independent, there exist a set of constants
γi for i = 0, . . . , n − 1, such that

x∗ − x0 = γ0 d0 + γ1 d1 + · · · + γn−1 dn−1 .

4-2
T
Now permultiply both sides of the above equation by dk Q for 0 ≤ k < n, to obtain
T T
dk Q x∗ − x0 = γk dk Qdk ,


T
where the terms dk Qdi = 0, k 6= i, by the Q-conjugate property. Hence
T
dk Q x ∗ − x 0

γk = T
.
dk Qdk

If xk is generated by (4.3) and (4.4), then we can write

xk = x0 + s0 d0 + s1 d1 + · · · + sk−1 dk−1 .

Therefore,

xk − x0 = s0 d0 + s1 d1 + · · · + sk−1 dk−1 .

So writing
   
x∗ − x0 = x∗ − xk + xk − x 0 ,

T
and premultiplying the above expression by dk Q and using the Q-conjugacy property, we obtain
T T
  T
dk Q x∗ − x0 = dk Q x∗ − xk = −dk gk ,


where gk = Qxk + b and Qx∗ = −b. Thus


T
dk g k
γk = − T
= sk
dk Qdk
and x∗ = xn , which completes the proof. Q.E.D
For a quadratic function of n variables, the conjugate direction method reaches the solution after n steps.
The method also possesses a certain desirable property in the intermediate steps. The current gradient is
orthogonal to all previous search directions. This property is stated and proven in the following lemma.

Lemma 4.2. In the conjugate Direction Algorithm,


T
gk+1 di = 0

for all k, 0 ≤ k ≤ n − 1, and 0 ≤ i ≤ k.


Proof: Note that
   
gk+1 − gk = Q xk+1 − xk = Q xk + sk dk − xk = sk Qdk .

Hence

gk+1 = gk + sk Qdk .

and since the function f is quadratic the exact line search stepsize is
T
g k dk
sk = − T
.
dk Qdk

4-3
We prove the lemma by induction. We first show the result is true for k = 0, thus i = 0:
T T T
g1 d0 = Qx1 + b d0 = x1 Qd0 + bT d0
T
!
0T 0 g 0 d0 T
= x Qd + − T d0 Qd0 + bT d0
0
d Qd 0

T T
= Qx0 + b d0 − g0 d0


T T
= g0 d0 − g0 d0 = 0

Assume the result is true for k − 1. That is


T
gk di = 0, for 0 ≤ i ≤ k − 1.

T
We now show the result is true for k. Fix k > 0 and 0 ≤ i < k. By the induction hypothesis gk di = 0.
Because

gk+1 = gk + sk Qdk ,
T
and dk Qdi = 0 by Q-conjugacy, we have
T T T
gk+1 di = gk di + sk dk Qdi = 0.

T
It remains to be shown that gk+1 dk = 0. Indeed,

T
 T T
gk+1 dk = Qxk+1 + b dk = xk+1 Qdk + bT dk
 T
= xk + sk dk Qdk + bT dk
T
!
kT k g k dk T
= x Qd − T
dk Qdk + bT dk
k
d Qd k
 T T
= Qxk + b dk − gk dk
T T
= gk dk − gk dk = 0

T
Therefore by induction, for all 0 ≤ k ≤ n − 1 and 0 ≤ i ≤ k, gk+1 di = 0. Q.E.D
This property is very important as it will be used extensively in subsequent sections.

Example 4.1. Suppose we want to minimize f (x) = 2x21 + x22 − 3 starting at x0 = [1, 1]T with the initial
direction begin d0 = [−4, −2]T . Find a conjugate direction to the initial direction d0 and find the minimizer.

The conjugate direction algorithm is very effective. However, to use the algorithm, we need to specify
all Q-conjugate directions before we start. This is computationally cumbersome for large problems.
Fortunately there is a way to generate Q-conjugate directions iteratively. In the next section, we discuss
an algorithm that incorporates the generation of Q-conjugate directions with the gradient of the objective
function.

4-4
4.3 Conjugate Gradient Method for Quadratic Problems

The conjugate gradient algorithm does not use predefined Q-conjugate directions, but instead it computes
the directions as the algorithm progresses. At each iteration of the algorithm, the direction dk+1 is
calculated as a linear combination of the previous direction and the current gradient, in such a way that
all the directions are mutually Q-conjugate−hence the name conjugate gradient method. In the conjugate
gradient algorithm the n conjugate directions are generated iteratively according to the rule:
( k+1
−g k=0
dk+1 = (4.5)
−gk+1 + βk dk k ≥ 1,

where the scalar βk is determined from the requirement that dk+1 and dk must be conjugate with respect
to the matrix Q. Using (4.5) and Definition 4.1, we can easily determine βk to be:
T
gk+1 Qdk
βk = T
. (4.6)
dk Qdk

This choice of Q-conjugate directions (4.5) may be seen as a modification to the steepest descent search
direction. The method improve the steepest descent algorithm by taking into consideration the previous
search direction. The conjugate gradient method is summarized in Algorithm 4.2.

Algorithm 4.2 Conjugate Gradient Algorithm for Quadratic Functions


1: Set the initial point x0 and iteration counter k ← 0.
2: Calculate the initial gradient g0 = ∇f (x0 ). If g0 = 0, stop, else set d0 = −g0 .
3: for k = 0, 1, . . . do
T
gk dk
4: Compute the stepsize sk = − .
dk T Qdk
5: Set the new iterate xk+1 = xk + sk dk .
6: Compute the gradient gk+1 = ∇f (xk+1 ). If gk+1 = 0, stop.
gk+1 T Qdk
7: Compute βk = .
dk T Qdk
8: Set the direction d k+1 = −gk+1 + βk dk .
9: end for
10: return x∗ = xk .

In the conjugate gradient algorithm, the directions d0 , d1 , . . . are Q-conjugate. Thus, for a quadratic
function, the conjugate gradient method terminate with an optimal solution x∗ after n iterations. Conjugate
gradient method generate Q-conjugate directions d0 , d1 , . . . iteratively without the need to remember all
previous directions. This remarkable property implies that the method require little computational storage
as compare to conjugate direction method.

Example 4.2. Consider the quadratic function f (x) = 4x21 + x22 − 2x1 x2 . Find the minimizer using the
conjugate gradient algorithm, using the staring point x0 = [1, 1]T .

4.4 Conjugate Gradient Method for Non-Quadratic Problems

In the previous section, we have described the conjugate gradient algorithm for minimizing a quadratic
function. It is natural to ask whether we can adapt a similar technique to minimize the general non-linear
functions. Indeed, conjugate gradient methods can be used to estimate a minimizer of a non-quadratic
objective function f without having to compute the second partial derivatives of f . The conjugate gradient

4-5
algorithm can be extended to general non-linear functions by interpreting the quadratic function (4.1) as
a second order Taylor series approximation of the general nonlinear objective function. Near the minimum
such functions behave approximately as quadratic function.
For a quadratic function the Hessian matrix Q is constant. However, for a general nonlinear function the
Hessian matrix has to be re-evaluated at each iteration of the algorithm. This can be computationally
expensive. Thus, an efficient implementation of the conjugate gradient algorithm that eliminates the
Hessian evaluation at each iteration is desirable.
Notice that the Hessian matrix Q appears only in the calculations of the stepsize sk and the scalar βk . Since
we are dealing with a general nonlinear function f , and sk is determined from one-dimensional optimization
that minimizes f (xk + sdk ), we can do this numerically using line search. Hence, to eliminate the Hessian
in the calculation of the stepsize sk , the analytical expression (4.4) can be replaced by a numerical line
search procedure. In this regards the inexact line search with Armijo’s condition or Wolfe’s conditions are
recommended.
Next, we only need to concern ourselves with the formula for βk , the expression (4.6). Fortunately,
elimination of Q form the formula is possible and results in algorithms that depend only on the function
and gradient values at each iteration. We now discuss modifications of the conjugate gradient algorithm
for a quadratic function for the case in which the Hessian is unknown but in which the objective value and
gradients are available. The modifications are all based on algebraically manipulating the formula (4.6) of
βk in such a way that Q is eliminated.
We know that for any quadratic function, the Hessian maps the difference in position to the difference in
gradients, that is

gk+1 − gk = Q(xk+1 − xk ) = sk Qdk . (4.7)

The above property says that changes in the gradient provide information about the second derivative
along the search directions dk . The quadratic function can be interpreted as the second order Taylor
approximation of a general nonlinear objective function. Using this fact and the relation (4.7) we can
replace the calculation of βk for non-quadratic function by:
T 
gk+1 gk+1 − gk

βk = T
. (4.8)
dk [gk+1 − gk ]
We get this expression for βk by substituting (4.7) into (4.6). Further simplification of (4.8) results in the
following algorithms: Polak-Ribiere method and Fletcher-Reeves method.
The Polak-Ribiere method
T
In this case the denominator in formula (4.8) is multiplied out. By Lemma 4.2, dk gk+1 = 0. Also, since
T
dk = −gk + βk−1 dk−1 , and premultiplying this by gk , we get
T
 T T
 T
gk dk = gk −gk + βk−1 dk−1 = −gk gk ,

where once again we used Lemma 4.2. Hence, we get


T 
gk+1 gk+1 − gk

βk = T
. (4.9)
gk gk
This expression for βk is known as the Polak-Ribiere formula.
The Fletcher-Reeves method
Here, we start with the Polak-Ribiere formula (4.9) and is multiplied out the numerator to get
T T
gk+1 gk+1 − gk+1 gk
βk = T
.
gk gk

4-6
T
We now use the fact that gk+1 gk = 0, which we get by using the equation
T T T
gk+1 dk = −gk+1 gk + βk gk+1 dk ,

and applying Lemma 4.2. This leads to


T
gk+1 gk+1
βk = T
, (4.10)
gk gk
which is called the Fletcher-Reeves formula.
The above formulas give us conjugate gradient algorithms that do not require explicit knowledge of the
Hessian matrix Q. All we nee is the objective function and gradient values at each iteration.For the case
of quadratic function, all the three formulas (4.8), (4.9) and (4.10) are exactly equal. However, this is not
the case for a general nonlinear objective function.
We need a few more slight modifications to apply the algorithm to general nonlinear function in practice.
For nonquadratic problems, the algorithm will not usually converge in n steps, and as the algorithm
progress, the Q-conjugacy of the direction vectors will tend to deteriorate. Thus, a common practice is
to reinitialize the direction vector to the negative gradient after every few iterations (e.g., n or n + 1),
and continue until the algorithm satisfies the stopping criterion. The conjugate gradient algorithm for
nonlinear functions is summarized in Algorithm 4.3.

Algorithm 4.3 Conjugate Gradient for Nonlinear functions


1: Set the initial point x0 and iteration counter k ← 0.
2: Calculate the initial gradient g0 = ∇f (x0 ). If g0 = 0, stop, else set d0 = −g0 .
3: while not convergence do
4: Compute the stepsize sk using inexact line search (e.g. backtracking with Armijo).
5: Set the new iterate xk+1 = xk + sk dk .
6: Compute the gradient gk+1 = ∇f (xk+1 ). If gk+1 = 0, stop.
T
gk+1 [gk+1 −gk ]
7: Compute βk = k T k+1 k .
d [g −g ]
8: Set the direction d k+1 = −gk+1 + β k dk .
9: end while
10: return x∗ = xk .

If step 7 of Algorithm 4.3 is updated using the expression (4.9), the conjugate gradient method is
known as Polak-Ribiere algorithm. Similarly, the algorithm is known as Fletcher-Reeves is the expression
(4.10) is used. Numerical testing suggests that Polak-Ribiere method tends to be more efficient than the
Fletcher-Reeves method.

4.5 Summary

We discussed how the conjugate direction method uses the idea of Q-conjugacy of search direction to
minimize quadratic functions. The predetermination of conjugate direction can be computationally challenging,
especially for large problems. The conjugate gradient method resolves this challenge by updating the
directions iteratively. We derived the conjugate gradient method for minimizing convex quadratic functions.
Then we transferred the technique to minimization problems of general nonlinear functions. In this context,
we discussed Fletcher-Reeves and Polak-Ribiere variants of the conjugate gradient method. The two
versions differ in the update strategy of a scalar, which impacts the determination of the search direction
and the line search algorithm.

4-7
PROBLEM SET IV
1. Consider the following matrix:
 
3 0 1
Q = 0 4 2
1 2 3

(a) Show that the matrix Q is positive definite.


(b) Let x0 = [1, 0, 0]T , find d1 and d2 such that the set {d0 , d1 , d2 } is Q-conjugate.
2. Proof Lemma 4.2, that is any set of Q-conjugate vectors is linearly independent.
3. Find the minimizer of the following quadratic equation:

f (x) = 2x21 + 2x1 x2 + x22 + x1 − x2 ,

using the conjugate direction algorithm with the initial point x0 = [0, 0]T .
4. The staring search direction from x = [2, 2]T to minimize

f (x) = x21 + x1 x2 + x22 − 3x1 − 3x2

is the negative gradient. Find a conjugate direction to the staring direction. In it unique?
5. Consider the problem of minimizing the objective function,
5 1
f (x) = x21 + x22 + 2x1 x2 − 3x1 − x2 .
2 2
(a) Express f (x) in the standard quadratic form.
(b) Find the minimizer of f using the conjugate gradient algorithm. Use a starting point x0 = [0, 0]T .
(c) Calculate the minimizer of f explicitly form Q and b, and check it with your answer in part (b).
6. Show that in the conjugate gradient algorithm, the directions d0 , d1 , . . . , dn−1 generated via (4.5) are
Q-conjugate.
7. Use Polak-Ribiere method with exact line search to locate the minimizer of the function
1
f (x) = (x1 + 1)2 + (x2 + 2)2 + 3.
2
Start with initial point x0 = [ 12 , 32 ]T .
8. Use Fletcher-Reeves method with exact line search to locate the minimizer of the function
1
f (x) = (x1 + 1)2 + (x2 + 2)2 + 3. (4.11)
2
Start with initial point x0 = [ 12 , 32 ]T . Compare with the result from 7
9. Staring with the initial point x0 = [1, 1]T , use (1) Polak-Ribiere and (2) Fletcher-Reeves to find the
minimizer of the objective functions
a f (x) = 3x21 + x22 .
b f (x) = 4(x1 − 5)2 + (x2 − 6)2 .

4-8
10. Consider the following algorithm for minimizing a function f :

xk+1 = xk + sk dk ,

where sk = arg mins>0 f (xk + sdk ). Let gk = ∇f (xk ). Suppose f is quadratic with Hessian Q.
We choose dk+1 = γk gk+1 + dk , and we wish the directions dk and dk+1 to be Q-conjugate. Find a
formula for γk in terms of dk , gk+1 , and Q.
11. Show that the conjugate gradient directions generated by (4.5) satisfy the descent condition.
12. Write a compute programme to implement the conjugate gradient algorithm for general functions.
Use the backtracking algorithm for the line search. Test the different formulas for βk on Rosenbrock’s
function, with an initial point x0 = [−2, 2]T . For this exercise reinitialize the update direction to the
negative gradient every 6 iterations.

4-9

You might also like