Numerical Optimization: Penn State Math
555 Lecture Notes
Version 1.0
Christopher Griffin
« 2012
Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License
With Contributions By:
Simon Miller
Douglas Mercer
Contents
List of Figures
v
Using These Notes
ix
Chapter 1. Introduction to Optimization Concepts, Geometry and Matrix Properties
1. Optimality and Optimization
2. Some Geometry for Optimization
3. Matrix Properties for Optimization
1
1
2
6
Chapter 2. Fundamentals of Unconstrained Optimization
1. Mean Value and Taylor’s Theorems
2. Necessary and Sufficient Conditions for Optimality
3. Concave/Convex Functions and Convex Sets
4. Concave Functions and Differentiability
13
13
15
16
19
Chapter 3. Introduction to Gradient Ascent and Line Search Methods
1. Gradient Ascent Algorithm
2. Some Results on the Basic Ascent Algorithm
3. Maximum Bracketing
4. Dichotomous Search
5. Golden Section Search
6. Bisection Search
7. Newton’s Method
8. Convergence of Newton’s Method
23
23
24
24
27
30
34
36
37
Chapter 4. Approximate Line Search and Convergence of Gradient Ascent
1. Approximate Search: Armijo Rule and Curvature Condition
2. Algorithmic Convergence
3. Rate of Convergence for Pure Gradient Ascent
4. Rate of Convergence for Basic Ascent Algorithm
41
41
44
47
49
Chapter 5. Newton’s Method and Corrections
1. Newton’s Method
2. Convergence Issues in Newton’s Method
3. Newton Method Corrections
55
55
57
60
Chapter 6. Conjugate Direction Methods
1. Conjugate Directions
2. Generating Conjugate Directions
3. The Conjugate Gradient Method
4. Application to Non-Quadratic Problems
67
67
69
70
73
iii
iv
CONTENTS
Chapter 7. Quasi-Newton Methods
1. Davidon-Fletcher-Powell (DFP) Quasi-Newton Method
2. Implementation and Example of DFP
3. Derivation of the DFP Method
4. Broyden-Fletcher-Goldfarb-Shanno (BFGS) Quasi-Newton Method
5. Implementation of the BFGS Method
79
79
83
86
88
90
Chapter 8. Numerical Differentiation and Derivative Free Optimization
1. Numerical Differentiation
2. Derivative Free Methods: Powell’s Method
3. Derivative Free Methods: Hooke-Jeeves Method
93
93
95
98
Chapter 9. Linear Programming: The Simplex Method
1. Linear Programming: Notation
2. Polyhedral Theory and Linear Equations and Inequalities
3. The Simplex Method
4. Karush-Kuhn-Tucker (KKT) Conditions
5. Simplex Initialization
6. Revised Simplex Method
7. Cycling Prevention
8. Relating the KKT Conditions to the Tableau
9. Duality
105
105
107
118
127
130
131
135
140
141
Chapter 10. Feasible Direction Methods and Quadratic Programming
1. Preliminaries
2. Frank-Wolfe Algorithm
3. Farkas’ Lemma and Theorems of the Alternative
4. Preliminary Results: Feasible Directions, Improving Directions
5. Fritz-John and Karush-Kuhn-Tucker Theorems
6. Quadratic Programming and Active Set Methods
147
147
149
153
158
160
163
Chapter 11. Penalty and Barrier Methods, Sequential Quadratic Programming,
Interior Point Methods
1. Penalty Methods
2. Sequential Quadratic Programming
3. Barrier Methods
4. Interior Point Simplex as a Barrier Method
5. Interior Point Methods for Quadratic Programs
175
175
176
176
176
176
Bibliography
177
List of Figures
1.1
1.2
1.3
1.4
2.1
2.2
2.3
Plot with Level Sets Projected on the Graph of z. The level sets existing in R2
while the graph of z existing R3 . The level sets have been projected onto their
appropriate heights on the graph.
3
Contour Plot of z = x2 + y 2 . The circles in R2 are the level sets of the function.
The lighter the circle hue, the higher the value of c that defines the level set.
4
A Line Function: The points in the graph shown in this figure are in the set
produced using the expression x0 + ht where x0 = (2, 1) and let h = (2, 2).
4
A Level Curve Plot with Gradient Vector: We’ve scaled the gradient vector
in this case to make the picture understandable. Note that the gradient is
perpendicular to the level set curve at the point (1, 1), where the gradient was
evaluated. You can also note that the gradient is pointing in the direction of
steepest ascent of z(x, y).
6
An illustration of the mean value theorem in one variable. The multi-variable
mean value theorem is simply an application of the single variable mean value
theorem applied to a slice of a function.
13
A convex function: A convex function satisfies the expression f (λx1 +(1−λ)x2 ) ≤
λf (x1 ) + (1 − λ)f (x2 ) for all x1 and x2 and λ ∈ [0, 1].
17
(Left) A simple quartic function with two local maxima and one local minima.
(Right) A segment of the function that is locally concave.
19
Dichotomous Search iteratively refines the size of a bracket containing a
maximum of the function φ by splitting the bracket in two.
28
3.2
A non-concave function with a maximum on the interval [0, 15].
29
3.3
The relative sizes of the interval and sub-interval lengths in a Golden Section
Search.
31
3.1
3.4
A function for which Golden Section Search (and Dichotoous Search) might fail
to find a global solution.
33
4.1
The function f (x, y) has a maximum at x = y = 0, where the function attains a
value of 1.
42
4.2
A plot of φ(t) illustrates the function increases as we approach the global
maximum of the function and then decreases.
42
The Wolfe Conditions are illustrated. Note the region accepted by the Armijo
rule intersects with the region accepted by the curvature condition to bracket
the (closest local) maximum for δk . Here σ1 = 0.15 and σ2 = 0.5
43
4.3
v
vi
LIST OF FIGURES
4.4
We illustrate the failure of the gradient ascent method to converge to a stationary
point when we do not use the Armijo rule or minimization.
46
4.5
Gradient ascent is illustrated on the function F (x, y) = −2 x2 − 10 y 2 starting
at x = 15, y = 5. The zig-zagging motion is typical of the gradient ascent
algorithm in cases where λn and λ1 are very different (see Theorem 4.20).
50
5.1
Newton’s Method converges for the function F (x, y) = −2x2 − 10y 4 in 11 steps,
with minimal zigzagging.
55
5.2
A double peaked function with a local minimum between the peaks. This
function also has saddle points.
58
5.3
A simple modification to Newton’s method first used by Gauss. While H(xk ) is
not negative definite, we use a gradient ascent to converge to the neighborhood
of a stationary point (ideally a local maximum). We then switch to a Newton
step.
61
5.4
Modified Newton’s method uses the modified Cholesky decomposition and
efficient linear solution methods to find an ascent direction in the case when the
Hessian matrix is not negative definite. This algorithm converges superlinearly,
as illustrated in this case.
66
6.1
The steps of the conjugate gradient algorithm applied to F (x, y).
76
6.2
In this example, the conjugate gradient method also converges in four total
steps, with much less zig-zagging than the gradient descent method or even
Newton’s method.
77
7.1
The steps of the DFP algorithm applied to F (x, y).
84
7.2
The steps of the DFP algorithm applied to F (x, y).
91
8.1
A comparison of the BFGS method using numerical gradients vs. exact
gradients.
97
Powell’s Direction Set Method applied to a bimodal function and a variation
of Rosenbrock’s function. Notice the impact the valley has on the steps in
Rosenbrock’s method.
98
8.2
8.3
Hooke-Jeeves algorithm applied to a bimodal function.
101
8.4
Hooke-Jeeves algorithm applied to a bimodal function.
103
9.1
A hyperplane in 3 dimensional space: A hyperplane is the set of points satisfying
an equation aT x = b, where k is a constant in R and a is a constant vector
in Rn and x is a variable vector in Rn . The equation is written as a matrix
multiplication using our assumption that all vectors are column vectors.
110
9.2
Two half-spaces defined by a hyper-plane: A half-space is so named because any
hyper-plane divides Rn (the space in which it resides) into two halves, the side
“on top” and the side “on the bottom.”
110
9.3
An Unbounded Polyhedral Set: This unbounded polyhedral set has many
directions. One direction is [0, 1]T .
112
LIST OF FIGURES
9.4
9.5
9.6
9.7
9.8
9.9
10.1
10.2
10.3
10.4
10.5
vii
Boundary Point: A boundary point of a (convex) set C is a point in the set
so that for every ball of any radius centered at the point contains some points
inside C and some points outside C.
113
A Polyhedral Set: This polyhedral set is defined by five half-spaces and has
a single degenerate extreme point located at the intersection of the binding
x + x2 <= 100. All faces are
constraints 3x1 + x2 ≤ 120, x1 + 2x2 ≤ 160 and 28
16 1
shown in bold.
114
Visualization of the set D: This set really consists of the set of points on the red
line. This is the line where d1 + d2 = 1 and all other constraints hold. This line
has two extreme points (0, 1) and (1/2, 1/2).
116
The Cartheodory Characterization Theorem: Extreme points and extreme
directions are used to express points in a bounded and unbounded set.
117
The Simplex Algorithm: The path around the feasible region is shown in the
figure. Each exchange of a basic and non-basic variable moves us along an edge
of the polygon in a direction that increases the value of the objective function. 126
The Gradient Cone: At optimality, the cost vector c is obtuse with respect to
the directions formed by the binding constraints. It is also contained inside the
cone of the gradients of the binding constraints, which we will discuss at length
later.
129
(a) The steps of the Frank-Wolfe Algorithm when maximizing −(x−2)2 −(y −2)2
over the set of (x, y) satisfying the constraints x + y ≤ 1 and x, y ≥ 0. (b) The
steps of the Frank-Wolfe Algorithm when maximizing −7(x − 20)2 − 6(y − 40)2
over the set of (x, y) satisfying the constraints 3x + y ≤ 120, x + 2y ≤ 160, x ≤ 35
and x, y ≥ 0. (c) The steps of the Frank-Wolfe Algorithm when maximizing
−7(x − 40)2 − 6(y − 40)2 over the set of (x, y) satisfying the constraints
3x + y ≤ 120, x + 2y ≤ 160, x ≤ 35 and x, y ≥ 0.
150
System 2 has a solution if (and only if) the vector c is contained inside the
positive cone constructed from the rows of A.
155
System 1 has a solution if (and only if) the vector c is not contained inside the
positive cone constructed from the rows of A.
156
An example of Farkas’ Lemma: The vector c is inside the positive cone formed
by the rows of A, but c′ is not.
156
The path taken when solving the proposed quadratic programming problem
using the active set method. Notice we tend to hug the outside of the polyhedral
set.
174
Using These Notes
Stop! This is a set of lecture notes. It is not a book. Go away and come back when you
have a real textbook on Numerical Optimization. Okay, do you have a book? Alright, let’s
move on then. This is a set of lecture notes for Math 555–Penn State’s graduate Numerical
Optimization course. Since I use these notes while I teach, there may be typographical errors
that I noticed in class, but did not fix in the notes. If you see a typo, send me an e-mail and
I’ll add an acknowledgement. There may be many typos, that’s why you should have a real
textbook.
The lecture notes are loosely based on Nocedal and Wright’s book Numerical Optimization, Avriel’s text on Nonlinear Optimization, Bazaraa, Sherali and Shetty’s book on Nonlinear Programming, Bazaraa, Jarvis and Sherali’s book on Linear Programming and several
other books that are cited in these notes. All of the books mentioned are good books (some
great). The problem is, some books don’t cover things in enough depth. The other problem
is for students taking this course, this may be the first time they’re seeing optimization, so
we have to cover some preliminaries. Our Math 555 course should really be two courses: one
on theory and the other on practical algorithms. Apparently we’re not that interested, so
we offer everything in one course.
This set of notes correct some of the problems I mention by presenting the material in
a format for that can be used easily in Penn State in Math 555. These notes are probably
really inappropriate if you have a strong Operations Research program. You’d be better off
reading Nocedal and Wright’s book directly.
In order to use these notes successfully, you should know something about multi-variable
calculus. It also wouldn’t hurt to have had an undergraduate treatment in optimization (in
some form). At Penn State, the only prerequisite for this course is Math 456, which is a
numerical methods course. That could be useful for some computational details, but I’ll
review everything that you’ll need. That being said, I hope you enjoy using these notes!
One last thing: the algorithms in these notes were coded using Maple. I’ve also coded
most of the algorithms in C++. The code will be posted (eventually – perhaps it’s already
there). Until then, you can e-mail me if you want the code. I can be reached at griffin
‘at’ ieee.org.
ix
CHAPTER 1
Introduction to Optimization Concepts, Geometry and Matrix
Properties
1. Optimality and Optimization
Definition 1.1. Let z : D ⊆ Rn → R. The point x∗ is a global maximum for z if for all
x ∈ D, z(x∗ ) ≥ z(x). A point x∗ ∈ D is a local maximum for z if there is a neighborhood
S ⊆ D of x∗ (i.e., x∗ ∈ S) so that for all x ∈ S, z(x∗ ) ≥ z(x). When the foregoing
inequalities are strict, x∗ is called a strict global or local maximum.
Remark 1.2. Clearly Definition 1.1 is valid only for domains and functions where the
concept of a neighborhood is defined and understood. In general, S must be a topologically
connected set (as it is in a neighborhood in Rn ) in order for this definition to be used or at
least we must be able to define the concept of neighborhood on the set.
Definition 1.3 (Maximization Problem). Let z : D ⊆ Rn → R; for i = 1, . . . , m,
gi : D ⊆ Rn → R; and for j = 1, . . . , l hj : D ⊆ Rn → R be functions. Then the general maximization problem with objective function z(x1 , . . . , xn ) and inequality constraints
gi (x1 , . . . , xn ) ≤ bi (i = 1, . . . , m) and equality constraints hj (x1 , . . . , xn ) = rj is written as:
max z(x1 , . . . , xn )
s.t. g1 (x1 , . . . , xn ) ≤ b1
..
.
gm (x1 , . . . , xn ) ≤ bm
(1.1)
h1 (x1 , . . . , xn ) = r1
..
.
hl (x1 , . . . , xn ) = rl
Remark 1.4. Expression 1.1 is also called a mathematical programming problem. Naturally when constraints are involved we define the global and local maxima for the objective
function z(x1 , . . . , xn ) in terms of the feasible region instead of the entire domain of z, since
we are only concerned with values of x1 , . . . , xn that satisfy our constraints.
Remark 1.5. When there are no constraints (or the only constraint is that (x1 , . . . , xn ) ∈
Rn ), the problem is called an unconstrained maximization problem.
Example 1.6. Let’s recall a simple optimization problem from differential calculus (Math
140): Goats are an environmentally friendly and inexpensive way to control a lawn when
there are lots of rocks or lots of hills. (Seriously, both Google and some U.S. Navy bases use
goats on rocky hills instead of paying lawn mowers!)
Suppose I wish to build a pen to keep some goats. I have 100 meters of fencing and I
wish to build the pen in a rectangle with the largest possible area. How long should the sides
1
2
1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES
of the rectangle be? In this case, making the pen better means making it have the largest
possible area.
We can write this problem as:
max A(x, y) = xy
s.t. 2x + 2y = 100
(1.2)
x≥0
y≥0
Remark 1.7. It is easy to see that when we replace the objective function in Problem 1.1
with the objective function −z(x1 , . . . , xn ), then the solution to this new problem minimizes
z(x1 , . . . , xn ) subject to the constraints of Problem 1.1. It therefore suffices to know how to
solve either maximization or minimization problems, since one can be converted easily into
the other.
2. Some Geometry for Optimization
Remark 1.8. We’ll denote vectors in Rn in boldface. So x ∈ Rn is an n-dimensional
vector and we have x = (x1 , . . . , xn ). We’ll always associate an n-dimensional vectors with a
n × 1 matrix (column vector) unless otherwise noted. Thus, when we write x ∈ Rn we also
mean x ∈ Rn×1 (the set of n × 1 matrices with entries from R).
Definition 1.9 (Dot Product). Recall that if x, y ∈ Rn are two n-dimensional vectors,
then the dot product (scalar product) is:
(1.3)
x·y =
n
X
xi yi
i=1
where xi is the ith component of the vector x. Clearly if x, y ∈ Rn×1 , then
(1.4)
x · y = xT y
where xT ∈ R1×n is the transpose of x when treated as a matrix.
Lemma 1.10. Let x, y ∈ Rn and let θ be the angle between x and y, then
(1.5)
x · y = ||x||||y|| cos θ
Remark 1.11. The preceding lemma can be proved using the law of cosines from
trigonometry. The following small lemma follows and is is proved as Theorem 1 of [MT03]:
Lemma 1.12. Let x, y ∈ Rn . Then the following hold:
(1) The angle between x and y is less than π/2 (i.e., acute) iff x · y > 0.
(2) The angle between x and y is exactly π/2 (i.e., the vectors are orthogonal) iff x · y =
0.
(3) The angle between x and y is greater than π/2 (i.e., obtuse) iff x · y < 0.
Lemma 1.13 (Schwartz Inequality). Let x, y ∈ Rn . Then:
(1.6)
(xT y)2 ≤ (xT x) · (yT y)
2. SOME GEOMETRY FOR OPTIMIZATION
3
This is equivalent to:
(1.7)
(xT y)2 ≤ ||x||2 ||y||2
Definition 1.14 (Graph). Let z : D ⊆ Rn → R be function, then the graph of z is the
set of n + 1 tuples:
(1.8)
{(x, z(x)) ∈ Rn+1 |x ∈ D}
When z : D ⊆ R → R, the graph is precisely what you’d expect. It’s the set of pairs
(x, y) ∈ R2 so that y = z(x). This is the graph that you learned about back in Algebra 1.
Definition 1.15 (Level Set). Let z : Rn → R be a function and let c ∈ R. Then the
level set of value c for function z is the set:
(1.9)
{x = (x1 , . . . , xn ) ∈ Rn |z(x) = c} ⊆ Rn
Example 1.16. Consider the function z = x2 + y 2 . The level set of z at 4 is the set of
points (x, y) ∈ R2 such that:
(1.10)
x2 + y 2 = 4
You will recognize this as the equation for a circle with radius 4. We illustrate this in the
following two figures. Figure 1.1 shows the level sets of z as they sit on the 3D plot of the
function, while Figure 1.2 shows the level sets of z in R2 . The plot in Figure 1.2 is called a
contour plot.
Level Set
Figure 1.1. Plot with Level Sets Projected on the Graph of z. The level sets
existing in R2 while the graph of z existing R3 . The level sets have been projected
onto their appropriate heights on the graph.
Definition 1.17. (Line) Let x0 , h ∈ Rn . Then the line defined by vectors x0 and h is
the function l(t) = x0 + th. Clearly l : R → Rn . The vector h is called the direction of the
line.
Example 1.18. Let x0 = (2, 1) and let h = (2, 2). Then the line defined by x0 and h
is shown in Figure 1.3. The set of points on this line is the set L = {(x, y) ∈ R2 : x =
2 + 2t, y = 1 + 2t, t ∈ R}.
4
1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES
Level Set
Figure 1.2. Contour Plot of z = x2 + y 2 . The circles in R2 are the level sets of the
function. The lighter the circle hue, the higher the value of c that defines the level
set.
Figure 1.3. A Line Function: The points in the graph shown in this figure are in
the set produced using the expression x0 + ht where x0 = (2, 1) and let h = (2, 2).
Definition 1.19 (Directional Derivative). Let z : Rn → R and let h ∈ Rn be a vector
(direction) in n-dimensional space. Then the directional derivative of z at point x0 ∈ Rn in
the direction of h is
d
(1.11)
z(x0 + th)
dt
t=0
when this derivative exists.
Proposition 1.20. The directional derivative of z at x0 in the direction h is equal to:
z(x0 + hh) − z(x0 )
(1.12) lim
h→0
h
Exercise 1. Prove Proposition 1.20. [Hint: Use the definition of derivative for a univariate function and apply it to the definition of directional derivative and evaluate t = 0.]
Definition 1.21 (Gradient). Let z : Rn → R be function and let x0 ∈ Rn . Then the
gradient of z at x0 is the vector in Rn given by:
∂z
∂z
(x0 ), . . . ,
(x0 )
(1.13) ∇z(x0 ) =
∂x1
∂xn
2. SOME GEOMETRY FOR OPTIMIZATION
5
Gradients are extremely important concepts in optimization (and vector calculus in general). Gradients have many useful properties that can be exploited. The relationship between
the directional derivative and the gradient is of critical importance.
Theorem 1.22. If z : Rn → R is differentiable, then all directional derivatives exist.
Furthermore, the directional derivative of z at x0 in the direction of h is given by:
(1.14)
∇z(x0 ) · h
where · denotes the dot product of two vectors.
Proof. Let l(t) = x0 + ht. Then l(t) = (l1 (t), . . . , ln (t)); that is, l(t) is a vector function
whose ith component is given by li (t) = x0i + hi t.
Apply the chain rule:
dz(l(t))
∂z dl1
∂z dln
(1.15)
=
+ ··· +
dt
∂l1 dt
∂ln dt
Thus:
d
dl
(1.16)
z(l(t)) = ∇z ·
dt
dt
Clearly dl/dt = h. We have l(0) = x0 . Thus:
(1.17)
d
z(x0 + th)
dt
t=0
= ∇z(x0 ) · h
We now come to the two most important results about gradients, (i) the fact that they
always point in the direction of steepest ascent with respect to the level curves of a function
and (ii) that they are perpendicular (normal) to the level curves of a function. We can
exploit this fact as we seek to maximize (or minimize) functions.
Theorem 1.23. Let z : Rn → R be differentiable, x0 ∈ Rn . If ∇z(x0 ) 6= 0, then ∇z(x0 )
points in the direction in which z is increasing fastest.
Proof. Recall ∇z(x0 ) · h is the directional derivative of z in direction h at x0 . Assume
that h is a unit vector. We know that:
(1.18)
∇z(x0 ) · h = ||∇z(x0 )|| cos θ
(because we assumed h was a unit vector) where θ is the angle between the vectors ∇z(x0 )
and h. The function cos θ is largest when θ = 0, that is when h and ∇z(x0 ) are parallel
vectors. (If ∇z(x0 ) = 0, then the directional derivative is zero in all directions.)
Theorem 1.24. Let z : Rn → R be differentiable and let x0 lie in the level set S defined
by z(x) = k for fixed k ∈ R. Then ∇z(x0 ) is normal to the set S in the sense that if h
is a tangent vector at t = 0 of a path c(t) contained entirely in S with c(0) = x0 , then
∇z(x0 ) · h = 0.
Remark 1.25. Before giving the proof, we illustrate this theorem in Figure 1.4. The
function is z(x, y) = x4 + y 2 + 2xy and x0 = (1, 1). At this point ∇z(x0 ) = (6, 4). We include
the tangent line to the level set at the point (1,1) to illustrate the normality of the gradient
to the level curve at the point.
6
1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES
Figure 1.4. A Level Curve Plot with Gradient Vector: We’ve scaled the gradient
vector in this case to make the picture understandable. Note that the gradient
is perpendicular to the level set curve at the point (1, 1), where the gradient was
evaluated. You can also note that the gradient is pointing in the direction of steepest
ascent of z(x, y).
Proof. As stated, let c(t) be a curve in S. Then c : R → Rn and z(c(t)) = k for all
t ∈ R. Let h be the tangent vector to c at t = 0; that is:
dc(t)
=h
(1.19)
dt t=0
Differentiating z(c(t)) with respect to t using the chain rule and evaluating at t = 0 yields:
(1.20)
d
z(c(t))
dt
t=0
= ∇z(c(0)) · h = ∇z(x0 ) · h = 0
Thus ∇z(x0 ) is perpendicular to h and thus normal to the set S as required.
3. Matrix Properties for Optimization
Definition 1.26 (Definiteness). A matrix M ∈ Rn×n is positive semi-definite if for all
x ∈ Rn with x 6= 0, xT Mx ≥ 0. The matrix M is positive definite if for all x ∈ Rn with
x 6= 0, xT Mx > 0. The matrix M is negative definite if −M is positive definite and negative
semi-definite if −M is positive semi-definite. If M satisfies none of these properties, then M
is indefinite.
Remark 1.27. We note that this is not the most general definition of matrix definiteness.
In general, matrix definiteness can be defined for complex matrices and has specialization to
Hermitian matrices.
Remark 1.28. Positive semi-definiteness is also called non-negative definiteness and
negative semi-definiteness is also called non-positive definiteness.
Lemma 1.29. A matrix M ∈ Rn×n is positive definite if and only if for every vector
x ∈ Rn , there is an α ∈ R+ (that is α > 0) such that xT Mx > αxT x1.
1Thanks
to Doug Mercer for point out that α needs to be positive.
3. MATRIX PROPERTIES FOR OPTIMIZATION
7
Exercise 2. Prove Lemma 1.29.
Lemma 1.30. Suppose that M is positive definite. If λ ∈ R is a real eigenvalue of M
with real eigenvector x, then λ > 0.
Exercise 3. Prove Lemma 1.30.
Lemma 1.31. Prove that M ∈ Rn×n has eigenvalue λ if and only if M−1 has eigenvalue
1/λ.
Exercise 4. Prove Lemma 1.31.
Remark 1.32. The theory of eigenvalues and eigenvectors of matrices is deep and well
understood. A substantial part of this theory should be covered in Math 436, for those
interested. The following two results are proved in several different sources. One very nice
proof is in Chapter 8 of [GR01]. Unfortunately, the proofs are well outside the scope of the
class.
Theorem 1.33 (Spectral Theorem for Real Symmetric Matrices). Let M ∈ Rn×n be a
symmetric matrix. Then the eigenvalues of M are all real.
Theorem 1.34 (Principle Axis Theorem). Let M ∈ Rn×n be a symmetric matrix. Then
Rn has an orthongonal basis consisting of eigenvectors of M.
Lemma 1.35. Let M ∈ Rn×n be a symmetric matrix and suppose that λ1 ≥ · · · ≥ λn
are the eigenvalues of M. If x ∈ Rn is a vector with ||x|| = 1, then xT Mx ≤ λ1 and
xT Mx ≥ λn .
Proof. From the Principle Axis Theorem, we know that that there is a basis B =
{v1 , . . . vn } consisting of eigenvectors of M. For any x ∈ Rn , we can write:
(1.21) x = xT v1 v1 + · · · + xT vn vn
That is, in the coordinates of B:
T
x = xT v1 , . . . , xT vn
It follows then that:
Mx = xT v1 Mv1 + · · · + xT vn Mvn = xT v1 λ1 v1 + · · · + xT vn λn vn =
λ1 x T v 1 v 1 + · · · + λ n x T v n v n
Pre-multiplying by xT yields:
xT Mx = λ1 xT v1 xT v1 + · · · + λn xT vn xT vn
From Equation 1.21 and our assumption on ||x||, we see that2:
2
2
1 = ||x||2 = xT v1 + · · · + xT vn
Since λ1 ≥ · · · ≥ λn , we have:
2
2
2
2
xT Mx = λ1 xT v1 + · · · + λn xT vn ≤ λ1 xT v1 + · · · + λ1 xT vn = λ1
The fact that xT Mx ≥ λn follows by a similar argument. This completes the proof.
2If
this is not obvious, note that the length of a vector is an invariant. That is, it is the same no matter
in which basis the vector is expressed.
8
1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES
Theorem 1.36. Suppose M ∈ Rn×n is symmetric. If every eigenvalue of M is positive,
then M is positive definite.
Proof. By the spectral theorem, all the eigenvalues of M are real and thus we can
write them in order as: λ1 ≥ · · · ≥ λn . By Lemma 1.35, we know that xT Mx ≥ λn . By our
assumption λn > 0 and thus xT Mx > 0 for all x ∈ Rn .
Corollary 1.37. If M is a positive definite matrix, then its inverse is also positive
definite.
Proof. Applying Lemma 1.31 and the preceding theorem yields the result.
Remark 1.38. The proofs of the preceding two results can be found in [Ant94] (and
most likely later editions). These are fairly standard proofs though and are available in most
complete linear programming texts.
Remark 1.39. Thus we’ve proved that a symmetric real matrix is positive definite if and
only if every eigenvalue is positive. We can use this fact to obtain a simple test for positive
definiteness3.
Definition 1.40 (Gerschgorin Disc). Let A ∈ Cn×n and let
X
Ri =
|Aij |
j6=i
That is, the sum of the off-diagonal entries of row i. Let D(Aii , Ri ) be a closed disc centered
at Aii with radius Ri . This is a Gerschgorin disc. When A ∈ Rn×n , this disc is a closed
interval.
Remark 1.41. The following theorem was first proved in [Ger31] and is now a classical
matrix theorem, used frequently in control theory.
Theorem 1.42 (Gerschgorin Disc Theorem). Let A ∈ Cn×n . Every eigenvalue of A lies
in a Gerschgorin disc.
Corollary 1.43. Let M ∈ Rn×n be symmetric. If
X
(1.22) Aii >
|Aij |
j6=i
then M is positive definite.
Exercise 5. Prove Corollary 1.43.
Exercise 6. Construct an example in which the converse of Corollary 1.43 does not
hold or prove that it must hold.
Remark 1.44. The last theorem of this section, which we will not prove, shows that
real symmetric positive definite matrices have a special unique decomposition (called the
Cholesky decomposition). (Actually, positive definite Hermetian matrices have a unique
Cholesky decomposition and this generalizes the result on real symmetric matrices that are
positive definite.) A proof can be found in Chapter 2 of [Dem97].
Theorem 1.45. A M ∈ Rn×n is symmetric positive definite if and only if there is a
(unique) lower triangular matrix H ∈ Rn×n such that M = HHT .
3Thanks
for Xiao Chen for pointing out we needed the matrix to be symmetric.
3. MATRIX PROPERTIES FOR OPTIMIZATION
9
Definition 1.46. The previous matrix factorization is called the Cholesky Decomposition. It serves as a test for positive definiteness in symmetric matrices that is less computationally complex than computing the eigenvalues of the matrix.
Remark 1.47. We note that Theorem 1.45 could be written to say there is an upper
triangular matrix H so that M = HHT . Some numerical systems implement this type of
Cholesky decomposition, e.g., Mathematica. The example and code we provide computes
the lower triangular matrix variation.
Example 1.48 (The Cholesky Decomposition). There is a classic algorithm for computing the Cholesky decomposition, which is available in [Dem97] (Algorithm 2.11). Instead
of using the algorithm, we illustrate the algorithm with a more protracted example.
Consider the matrix:
8 −2 7
M = −2 10 5
7
5 12
We can construct the LU decomposition using Gaussian elimination without pivoting to
obtain:
8
0
8 −2 7
−2 10 5 = −2 19/2
7
5 12
27
7
4
0
1
7
8
27
38
0
1
1 −1/4
0 · 0
41
0
38
Notice the determinant of the U matrix is 1. The L matrix can be written as L = L1 D
where D is a diagonal matrix whose determinant is equal to the determinant of M and L1
is a lower triangular matrix with determinant 1. This yields:
1
8 −2 7
−2 10 5 =
−1/4
7
5 12
7
8
0
0
1
0 · 0 19/2
1
0
0
27
38
8
0
0
1
7
8
27
38
0
1
1 −1/4
0 · 0
41
0
38
Notice LT1 = U and more importantly, notice that the diagonal of D is composed of ratios
of the principal minors of M. That is:
8 = |8|
76 =
8
−2
−2 10
8 −2 7
82 = −2 10 5
7
5 12
10
1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES
And we have 76/8 = 19/2 and 82/76 = 41/38. Thus, if we multiply the diagonal elements
of the matrix D we easily see that this yields 82, the determinant of M. Finally, define:
√
0
0
2 2
1
0 0
√
H = −1/4 1 0 · 0
1/2 38
0
=
√
7
27
1
0
0
1/38 1558
8
38
√
2 2
0
0
√
√
0
−1/2 2 1/2 38
√
√
√
27
38 1/38 1558
7/4 2
76
This is the Cholesky factor, where the second matrix in the product is obtained by taking
the square root of the diagonal elements of D. It is easy to see now that M = HHT . The
fact that the determinants of the principal minors of M are always positive is a direct result
of the positive definiteness of M.
This is naturally not the most efficient way to compute the Cholesky decomposition, but
it is illustrative. The Maple code for the standard algorithm for computing the Cholesky
decomposition is shown in Algorithm 1.
Exercise 7. Determine the computational complexity of Algorithm 1. Compare it to
the computational complexity of the steps shown in Example 1.48. Use Algorithm 1 to
confirm the Cholesky decomposition from Example 1.48.
3. MATRIX PROPERTIES FOR OPTIMIZATION
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
CholeskyDecomp := proc (M::Matrix, n::integer)::list;
local MM, H, ret, i, j, k;
MM := Matrix(M);
H := Matrix(n, n);
ret := true;
for i to n do
if evalb(not ret) then
ret := false;
break:
end if:
if 0 <= MM[i, i] then
H[i, i] := sqrt(MM[i, i])
else
ret := false;
break;
end if;
for j from i+1 to n do
if H[i, i] = 0 then
ret := false;
break
end if;
H[j, i] := MM[j, i]/H[i, i];
for k from i+1 to j do
MM[j, k] := MM[j, k]-H[j, i]*H[k, i]
end do
end do
end do;
if ret then
[H, ret]
else
[M, ret]
end if
end proc:
Algorithm 1. Cholesky Decomposition
11
CHAPTER 2
Fundamentals of Unconstrained Optimization
1. Mean Value and Taylor’s Theorems
Remark 2.1. We state, without proof, some classic theorems from multi-variable calculus. The proofs of these theorems are not terribly hard, but they build on several other
results from single variable calculus and reviewing all these results is tedious and outside the
scope of the course.
Lemma 2.2 (Mean Value Theorem). Suppose that f : Rn → R is continuously differentiable. (That is f ∈ C 1 .) Let x0 , h ∈ Rn . Then there is a t ∈ (0, 1) such that:
(2.1)
f (x0 + h) − f (x0 ) = ∇f (x0 + th)T h
Remark 2.3. This is the natural generalization of the one-variable mean value theorem
from calculus (Math 140 at Penn State). The expression x0 + th is simply the line segment
connecting x0 and x0 + h. If we imagine this being the “x-axis” and the corresponding slice
of f (·), then we can see that we’re just applying the single variable mean value theorem to
this slice.
Figure 2.1. An illustration of the mean value theorem in one variable. The multivariable mean value theorem is simply an application of the single variable mean
value theorem applied to a slice of a function.
Lemma 2.4 (Second Order Mean Value Theorem). Suppose that f : Rn → R is twice
continuously differentiable. (That is f ∈ C 2 .) Let x0 , h ∈ Rn . Then there is a t ∈ (0, 1)
13
14
2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
such that:
(2.2)
1
f (x0 + h) − f (x0 ) = ∇f (x0 )T h + hT ∇2 f (x0 + th)h
2
Lemma 2.5 (Mean Value Theorem – Vector Valued Function). Let f : Rn → Rn be a
differentiable function and let x0 , h ∈ Rn . Then:
Z 1
Df (x0 + th)hdt
(2.3)
f (x0 + h) − f (x0 ) =
0
where D is the Jacobian operator.
Remark 2.6. It should be noted that there is no exact analog of the mean value theorem
for vector valued functions. The previous lemma is the closes thing to such an analog and it
is generally referred to as such.
Corollary 2.7. Let f : Rn → R with f ∈ C 2 and let x0 , h ∈ Rn . Then:
Z 1
∇2 f (x0 + th)hdt
(2.4)
∇f (x0 + h) − ∇f (x0 ) =
0
Remark 2.8. The proof of this lemma rests on the single variable mean value theorem
and the mean value theorem for integrals in single variable calculus.
Lemma 2.9 (Taylor’s Theorem – Second Order). Let f : Rn → R with f ∈ C 2 and let
x0 , h ∈ Rn . Then:
(2.5)
where:
f (x0 + h) = f (x0 ) + ∇f (x0 )T h + R2 (x0 , h)
1
R2 (x0 , h) = hT ∇2 f (x0 + th)h
2
for some t ∈ (0, 1) or:
Z 1
(1 − t)∇2 f (x0 + th)hdt
(2.7)
R2 (x0 , h) =
(2.6)
0
Remark 2.10. Taylor’s Theorem in the general case considers functions in C k . One can
convert between the two forms of the remainder using the mean value theorem, though this
is not immediately obvious without some concentration. Most of the proofs of the remainder
term use the mean value theorems. There is a very nice proof, very readable proof of Taylor’s
theorem in [MT03] (Chapter 3.2).
Remark 2.11. From Taylor’s theorem, we obtain first and second order approximations
for functions. That is, the first order approximation for f (x0 + h) is:
(2.8)
f (x0 + h) ∼ f (x0 ) + ∇f (x0 )T h
while the second order approximation is:
1
f (x0 + h) ∼ f (x0 ) + ∇f (x0 )T h + hT ∇2 f (x0 )h
2
We will use these approximations and Taylor’s theorem repeatedly throughout this set of
notes.
(2.9)
2. NECESSARY AND SUFFICIENT CONDITIONS FOR OPTIMALITY
15
2. Necessary and Sufficient Conditions for Optimality
Theorem 2.12. Suppose that f : D ⊆ Rn → R is differentiable in an open neighborhood
of a local maximizer x∗ ∈ D, then ∇f (x∗ ) = 0.
Proof. By way of contradiction, suppose that ∇f (x∗ ) 6= 0. The differentiability of f
implies that ∇f (x) is continuous in the open neighborhood of x∗ . Thus, for all ǫ, there is a
δ so that if ||x∗ − x|| < δ, then ||∇f (x∗ ) − ∇f (x)|| < ǫ. Let h = ∇f (x∗ ). Trivially, hT h > 0
and (by continuity), for some t ∈ (0, 1) (perhaps very small), we know that:
(2.10)
hT ∇f (x∗ + th) > 0
Let p = th. From the mean value theorem, we know there is an s ∈ (0, 1) such that:
f (x0 + p) − f (x0 ) = ∇f (x0 + sp)T p > 0
by our previous argument. Thus, x∗ is not a (local) maximum.
Exercise 8. In the last step of the previous proof, we assert the existence of a t ∈ (0, 1)
so that Equation 2.10 holds. Explicitly prove such a t must exist. [Hint: Use a component
by component argument with the continuity of ∇f .]
Exercise 9. Construct an analagous proof for the statement: Suppose that f : D ⊆
R → R is differentiable in an open neighborhood of a local minimizer x∗ ∈ D, then ∇f (x∗ ) =
0.
n
Exercise 10. Construct an alternate proof of Theorem 2.12 by studying the singlevariable function g(t) = f (x + th). You may use the fact from Math 140 that g ′ (t∗ ) = 0 is
a necessary condition for a local maxima in the one dimensional case. (Extra credit if you
prove that fact as well.)
Theorem 2.13. Suppose that f : D ⊆ Rn → R is twice differentiable in an open neighborhood of a local maximizer x∗ ∈ D, then ∇f (x∗ ) = 0 and H(x∗ ) = ∇2 f (x∗ ) is negativesemidefinite.
Proof. From Theorem 2.12 ∇f (x∗ ) = 0. Our assumption that f is twice differentiable
implies that H(x) is continuous and therefore for all ǫ there exists a δ so that ||x∗ − x|| < δ
then |Hij (x∗ ) − Hij (x)| < ǫ for all i = 1, . . . , n and j = 1, . . . n. We are just asserting
pointwise continuity for the elements of the matrix H(x).
Suppose we may choose a vector h so that hT H(x∗ )h > 0 and thus H(x∗ ) is not negative
semidefinite. Then by our continuity argument, we can choose an h with norm small enough,
so we can assure that hT H(x∗ + h)h > 0. From Lemma 2.9 we have for some t ∈ (0, 1):
1
(2.11) f (x∗ + h) − f (x∗ ) = hT H(x∗ + th)h
2
∗
since ∇f (x ) = 0. Then it follows that f (x0 + h) − f (x0 ) > 0 and thus we have found a
direction in which the value of f increases and x∗ cannot be a local maximum.
Theorem 2.14. Suppose that f : D ⊂ Rn → R is twice differentiable. If ∇f (x∗ ) = 0
and H(x∗ ) = ∇2 f (x∗ ) is negative definite, then x∗ is a local maximum.
Proof. Applying Lemma 2.9, we know for any h ∈ Rn , there is a t ∈ (0, 1) so that:
1
(2.12) f (x∗ + h) = f (x∗ ) + hT ∇2 f (x∗ + th)h
2
16
2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
By the same argument as in the proof of Theorem 2.13, we know that there is an ǫ > 0 so
that if |h| < ǫ then for all t ∈ (0, 1), ∇2 f (x∗ + th) is negative definite if ∇2 f (x∗ ) is negative
definite. Let Bǫ (x∗ ) the open ball centered at x∗ with radius ǫ.
Thus we can see that for all x ∈ Bǫ (x∗ ):
1 T 2
h ∇ f (x)h < 0
2
where x = x∗ + th for some appropriately chosen h and t ∈ (0, 1). Equation 2.12 combined
with the previous observation shows that for all x ∈ Bǫ (x∗ ), f (x) < f (x∗ ) and thus x∗ is a
local maximum. This completes the proof.
Exercise 11. Theorem 2.14 provides sufficient conditions for x∗ to be a strict local
minimum. Give an example showing that the conditions are not necessary.
Example 2.15. The importance of the Cholesky Decomposition is now apparent. To
confirm that a point is a local maximum, we must simply check that the Hessian at the point
in question is negative definite, which can be done with the Cholesky Decomposition on the
negative of the Hessian. Local minima can be verified by checking for minima. Consider the
function:
2
(1 − x)2 + 100 y − x2
The gradient of this function is:
[−2 + 2 x − 400 y − x2 x, 200 y − 200 x2 ]T
While the Hessian of this function is:
#
"
2 − 400 y + 1200 x2 −400 x
−400 x
200
At point x = 1, y = 1, it is easy to see that the gradient is 0, while the Hessian is:
"
#
802 −400
−400
200
This matrix has a Cholesky decomposition:
"
#
√
802
0
√
√
10
802
802
− 200
401
401
Since the Hessian is positive definite, we know that x = 1, y = 1 must be a local minima for
this function. As it happens, this is a global minima for this function.
3. Concave/Convex Functions and Convex Sets
Definition 2.16 (Convex Set). Let X ⊆ Rn . Then the set X is convex if and only if
for all pairs x1 , x2 ∈ X we have λx1 + (1 − λ)x2 ∈ X for all λ ∈ [0, 1].
Theorem 2.17. The intersection of a finite number of convex sets in Rn is convex.
Proof. Let C1 , . . . , Cn ⊆ Rn be a finite collection of convex sets. Let
n
\
(2.13) C =
Ci
i=1
3. CONCAVE/CONVEX FUNCTIONS AND CONVEX SETS
17
be the set formed from the intersection of these sets. Choose x1 , x2 ∈ C and λ ∈ [0, 1].
Consider x = λx1 + (1 − λ)x2 . We know that x1 , x2 ∈ C1 , . . . , Cn by definition of C. By
convexity, we know that x ∈ C1 , . . . , Cn by convexity of each set. Therefore, x ∈ C. Thus
C is a convex set.
Definition 2.18 (Convex Function). A function f : Rn → R is a convex function if it
satisfies:
(2.14)
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 )
for all x1 , x2 ∈ Rn and for all λ ∈ [0, 1]. When the inequality is strict, for λ ∈ (0, 1), the
function is a strictly convex function.
Example 2.19. This definition is illustrated in Figure 2.2. When f is a univariate
f (x1 ) + (1 − λ)f (x2 )
f (λx1 + (1 − λ)x2 )
Figure 2.2. A convex function: A convex function satisfies the expression f (λx1 +
(1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) for all x1 and x2 and λ ∈ [0, 1].
function, this definition can be shown to be equivalent to the definition you learned in
Calculus I (Math 140) using first and second derivatives.
Definition 2.20 (Concave Function). A function f : Rn → R is a concave function if it
satisfies:
(2.15)
f (λx1 + (1 − λ)x2 ) ≥ λf (x1 ) + (1 − λ)f (x2 )
for all x1 , x2 ∈ Rn and for all λ ∈ [0, 1]. When the inequality is strict, for λ ∈ (0, 1), the
function is a strictly concave function.
To visualize this definition, simply flip Figure 2.2 upside down. The following theorem
is a powerful tool that can be used to show sets are convex. It’s proof is outside the scope
of the class, but relatively easy.
Theorem 2.21. Let f : Rn → R be a convex function. Then the set C = {x ∈ Rn :
f (x) ≤ c}, where c ∈ R, is a convex set.
Exercise 12. Prove the Theorem 2.21.
Definition 2.22 (Linear Function). A function z : Rn → R is linear if there are constants c1 , . . . , cn ∈ R so that:
(2.16)
z(x1 , . . . , xn ) = c1 x1 + · · · + cn xn
18
2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
Definition 2.23 (Affine Function). A function z : Rn → R is affine if z(x) = l(x) + b
where l : Rn → R is a linear function and b ∈ R.
Exercise 13. Prove that every affine function is both convex and concave.
Theorem 2.24. Suppose that g1 , . . . , gm : Rn → R are convex functions and h1 , . . . , hl :
Rn → R are affine functions. Then the set:
(2.17)
X = {x ∈ Rn : gi (x) ≤ 0, (i = 1, . . . , m) and hj (x) = 0, (j = 1, . . . , l)}
is convex.
Exercise 14. Prove Theorem 2.24
Theorem 2.25. Suppose that f : Rn → R is concave and that x∗ is a local maximizer of
f . Then x∗ is a global maximizer of f .
Proof. Suppose x+ ∈ Rn has the property that f (x+ ) > f (x∗ ). For any λ ∈ (0, 1) we
know that:
f (λx∗ + (1 − λ)x+ ) ≥ λf (x∗ ) + (1 − λ)f (x+ )
Since x∗ is a local maximum there is an ǫ > 0 so that for all x ∈ Bǫ (x∗ ), f (x∗ ) ≥ f (x). Choose
λ so that λx∗ + (1 − λ)x+ is in Bǫ (x∗ ) and let x = λx∗ + (1 − λ)x+ . Let r = f (x+ ) − f (x∗ ).
By assumption r > 0. Then we have:
f (x) ≥ λf (x∗ ) + (1 − λ)(f (x∗ ) + r)
But this implies that:
f (x) ≥ f (x∗ ) + (1 − λ)r
But x ∈ Bǫ (x∗ ) by choice of λ, which contradicts our assumption that x∗ is a local maximum.
Thus, x∗ must be a global maximum.
Theorem 2.26. Suppose that f : Rn → R is strictly concave and that x∗ is a global
maximizer of f . Then x∗ is the unique global maximizer of f .
Exercise 15. Prove Theorem 2.26. [Hint: Proceed by contradiction as in the proof of
Theorem 2.25.]
Remark 2.27. We generally think of a function as either being convex (or concave) or
not. However, it is often useful to think of functions being convex (or concave) on a subset
of its domain. This can be important for determining the existence of (global) maxima on a
constrained region.
Definition 2.28. Let f : D ⊆ Rn → R and suppose that X ⊆ D. Then f is convex on
X if for all x1 , x2 ∈ int(X) and
(2.18)
f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 )
for λ ∈ [0, 1].
Example 2.29. Consider the function f (x) = −(x − 4)4 + 3(x − 4)3 + 6(x − 4)2 − 3(x −
4) + 100. This function is neither convex nor concave. However on a subset of its domain
(say x ∈ [6, ∞)) the function is concave.
4. CONCAVE FUNCTIONS AND DIFFERENTIABILITY
19
Figure 2.3. (Left) A simple quartic function with two local maxima and one local
minima. (Right) A segment of the function that is locally concave.
4. Concave Functions and Differentiability
Remark 2.30. The following theorem is interesting, but its proof is outside the scope of
the course. There is a proof for the one-dimensional case in Rudin [Rud76]. The proof for
the general case can be derived from this. The general proof can also be found in Appendix
B of [Ber99].
Theorem 2.31. Every concave (convex) function is continuous on the interior of its
domain.
Remark 2.32. The proofs of the next two theorems are variations on those found in
[Ber99], with some details added for clarity.
Theorem 2.33. A function f : Rn → R is concave if and only if for all x0 , x ∈ Rn
(2.19)
f (x) ≤ f (x0 ) + ∇f (x0 )T (x − x0 )
Proof. (⇐) Suppose Inequality 2.19 holds. Let x1 , x2 ∈ Rn and let λ ∈ (0, 1) and let
x0 = λx1 + (1 − λ)x2 . We may write:
f (x1 ) ≤ f (x0 ) + ∇f (x0 )T (x1 − x0 )
f (x2 ) ≤ f (x0 ) + ∇f (x0 )T (x2 − x0 )
Multiplying the first equation by λ and the second by 1 − λ and adding yields:
λf (x1 ) + (1 − λ)f (x2 ) ≤
λf (x0 ) + (1 − λ)f (x0 ) + λ∇f (x0 )T (x1 − x0 ) + (1 − λ)∇f (x0 )T (x2 − x0 )
Simplifying the inequality, we have:
λf (x1 ) + (1 − λ)f (x2 ) ≤ f (x0 ) + ∇f (x0 )T (λ(x1 − x0 ) + (1 − λ)(x2 − x0 ))
Which simplifies further to:
λf (x1 ) + (1 − λ)f (x2 ) ≤ f (x0 ) + ∇f (x0 )T (λx1 + (1 − λ)x2 − λx0 − (1 − λ)x0 )
20
2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION
Or:
λf (x1 ) + (1 − λ)f (x2 ) ≤ f (x0 ) + ∇f (x0 )T (λx1 + (1 − λ)x2 − x0 ) = f (x0 )
because we assumed that:x0 = λx1 + (1 − λ)x2 .
(⇒) Now let x, x0 ∈ Rn and let λ ∈ (0, 1). Define h = x − x0 . Let:
f (x0 + λh) − f (x0 )
(2.20) g(λ) =
λ
Clearly, as λ approaches 0 from the right, g(α) approaches the directional derivative of f (x)
at x0 in the direction h.
Claim 1. The function g(λ) is monotonically decreasing.
Proof. Consider λ1 , λ2 ∈ (0, 1) with λ1 < λ2 and let α = λ1 /λ2 . Define z = x0 + λ2 h.
Note:
λ1
αz + (1 − α)x0 = x0 + α(z − x0 ) = x0 + (x0 + λ2 (z − x0 ) − x0 ) = x0 + λ1 (z − x0 )
λ2
Thus:
(2.21)
f (x0 + α(z − x0 )) ≥ αf (z) + (1 − α)f (x0 ) = f (x0 ) + αf (z) − αf (x0 )
Simplifying, we obtain:
f (x0 + α(z − x0 )) − f (x0 )
≥ f (z) − f (x0 )
(2.22)
α
Since z = x0 + λ2 h, we have:
f (x0 + α(z − x0 )) − f (x0 )
(2.23)
≥ f (z) − f (x0 )
α
Recall z = x0 + λ2 h, thus z − x0 = λ2 h. Thus the left hand side simplifies to:
f (x0 + λ1 h) − f (x0 )
f (x0 + (λ1 /λ2 )(λ2 h)) − f (x0 )
=
≥ f (x0 + λ2 h) − f (x0 )
(2.24)
α
λ1 /λ2
Lastly, dividing both sides by λ2 yields:
f (x0 + λ1 h) − f (x0 )
f (x0 + λ2 h) − f (x0 )
(2.25)
≥
λ1
λ2
Thus g(λ) is monotonically decreasing. This completes the proof of the claim.
Since g(λ) is monotonically decreasing, we must have:
(2.26)
lim g(λ) ≥ g(1)
λ→0+
But this implies that:
f (x0 + λh) − f (x0 )
≥ f (x0 +h)−f (x0 ) = f (x0 +x−x0 )−f (x0 ) = f (x)−f (x0 )
(2.27) lim+
λ→0
λ
since h = x − x0 . Applying Theorem 1.22, the inequality becomes:
(2.28)
∇f (x0 )T h ≥ f (x) − f (x0 )
which can be rewritten as:
(2.29)
f (x) ≤ f (x0 ) + ∇f (x0 )T h = f (x0 ) + ∇f (x0 )T (x − x0 )
This completes the proof.
4. CONCAVE FUNCTIONS AND DIFFERENTIABILITY
21
Exercise 16. Argue that for strictly concave functions, Inequality 2.19 is strict and the
theorem still holds.
Exercise 17. State and prove a similar theorem for convex functions.
Theorem 2.34. Let f : Rn → R be a twice differentiable function. If f is concave, then
∇ f (x) is negative semidefinite.
2
Proof. Suppose that there is a point x0 ∈ Rn and h ∈ Rn such that hT ∇2 f (x)h > 0.
By the same argument as in the proof of Theorem 2.13, we may chose an h with a small norm
so that for every t ∈ [0, 1], hT ∇2 f (x + th)h > 0 as well. By Lemma 2.9 for any x = x0 + h
we have some t ∈ (0, 1) so that:
1
(2.30) f (x) = f (x0 ) + ∇f (x0 )T h + hT ∇2 f (x0 + th)h
2
T 2
But since h ∇ f (x + th)h > 0, we know that
(2.31)
f (x) > f (x0 ) + ∇f (x0 )T h
and thus by Theorem 2.19, f (x) cannot be concave, a contradiction. This completes the
proof.
Theorem 2.35. Let f : Rn → R be a twice differentiable function. If ∇2 f (x) is negative
semidefinite, then f (x) is concave.
Proof. From Lemma 2.9, we know that for every x, x0 ∈ Rn there is a t ∈ (0, 1) such
that when h = x − x0 :
1
(2.32) f (x) = f (x0 ) + ∇f (x0 )T h + hT ∇2 f (x0 + th)h
2
2
Since ∇ f (x) is negative semidefinite, it follows that: 21 hT ∇2 f (x0 + th)h ≤ 0 and thus:
(2.33)
f (x) ≤ f (x0 ) + ∇f (x0 )T (x − x0 )
Thus by Theorem 2.19, f (x) is concave.
Exercise 18. Suppose that f : Rn → R with f (x) = xT Hx, where x ∈ Rn and
H ∈ Rn×n . Show that f (x) is convex if and only if H is positive semidefinite.
Exercise 19. State and prove a theorem on the nature of the Hessian matrix when f (x)
is strictly concave.
Remark 2.36 (A Last Remark on Concave and Convex Functions). In this chapter,
we’ve focused on primarily on concave functions. It’s clear there are similar theorems for
convex functions and many books focus on convex functions and minimization. In these
notes, we’ll focus on maximization, because the geometry of certain optimality conditions
makes more sense on first exposure in a maximization problem. If you want to see the
minimization side of the story, look at [Ber99] and [BSS06]. In reality, it doesn’t matter,
any maximization problem can be converted to a minimization problem and vice versa.
CHAPTER 3
Introduction to Gradient Ascent and Line Search Methods
1. Gradient Ascent Algorithm
Remark 3.1. In Theorem 1.23, we proved that the gradient points in the direction of
fastest increase for a function f : Rn → R. If we are trying to identify a point x∗ ∈ Rn
that is a local or global maximum for f , then a reasonable approach is to walk along some
direction p so that ∇f (x)T p > 0.
Definition 3.2. Let f : Rn → R be a continuous and differentiable function and let
p ∈ Rn . If ∇f (x)T p > 0, then p is called an ascent direction.
Remark 3.3. Care must be taken, however, since the ∇f (x) represents only the direction
of fastest increase at x. As a result, we only want to take a small step in the direction of p,
then re-evaluate the gradient and continue until a stopping condition is reached.
Basic Ascent Algorithm
Input: f : Rn → R a function to maximize, x0 ∈ Rn , a starting position
Initialize: k = 0
(1) do
(2)
Choose pk ∈ Rn×1 and δk ∈ R+ so that ∇f (x)T pk > 0.
(3)
xk+1 := xk + δk pk
(4) while some stopping criteria are not met.
Output: xk+1
Algorithm 2. Basic Ascent Algorithm
Remark 3.4. There are some obvious ambiguities with Algorithm 2. We have neither
specified how to choose pk nor δk in Line (2) of Algorithm 2, nor have we defined specific
stopping criteria for the while loop. More importantly, we’d like to prove that there is a way
of choosing pk so that when we use this method at Line (2), the algorithm both converges i.e.,
at some point we exit the loop in Lines (1)-(4) and when we exit, we have identified a local
maximum, or at least a point that satisfies the necessary conditions for a local maximum
(see Theorem 2.12).
Remark 3.5. For the remainder of these notes, we will assume that:
pk = B−1
k ∇f (xk )
where Bk ∈ Rn×n is some appropriately chosen symmetric and non-singular matrix.
Definition 3.6 (Gradient Ascent). When Bk = In , for some n, then Algorithm 2 is
called Gradient Ascent.
Definition 3.7 (Newton’s Method). When Bk = −∇2 f (xk ), for some δk ∈ R+ , then
Algorithm 2 is called Newton’s Method, which we will study in the next chapter.
23
24
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
2. Some Results on the Basic Ascent Algorithm
Lemma 3.8. Suppose that Bk is positive definite. Then: pk = B−1
k ∇f (x) is an ascent
T
direction. That is, ∇f (x) pk > 0.
Proof. By Corollary 1.37, B−1
k is positive definite. Thus:
∇f (x)T pk = ∇f (x)T B−1
k ∇f (x) > 0
by definition.
Remark 3.9. Since Bk = In is always positive definite and non-singular, we need never
worry that pk = ∇f (x) is not a properly defined ascent direction in gradient ascent.
Remark 3.10. As we’ve noted, we cannot simply choose δk arbitrarily in Line (2) of
Algorithm 2, since ∇f (x) is only the direction of greatest ascent at x and thus, pk =
B−1
k ∇f (xk ) is only a ascent direction in a neighborhood about x. If we define:
(3.1)
φ(δk ) = f (xk + δk pk )
then our problem is to solve:
max φ(δk )
(3.2)
s.t. δk ≥ 0
This is an optimization problem in a single variable, δk and assuming its solution is δk∗ and we
compute xk+1 := xk + δk∗ pk , then we will assuredly have increased (or at least not decreased)
the value of f (xk+1 ) compared to f (xk ).
Remark 3.11. As we’ll see in Section 2, it’s not enough to just increase the value of
f (xk ) at each iteration k, we must increase it by a sufficient amount. One of the easiest,
though not necessarily the most computationally efficient, ways to make sure this happens
is to ensure that we identify a solution to the problem in Expression 3.2. In the following
sections, we discuss several methods for determining a solution.
3. Maximum Bracketing
Remark 3.12. For the remainder of this section, let us assume that φ(x) is a one dimensional function that we wish to maximize. One problem with attempting to solve Problem
3.2 is that the half-interval δk ≥ 0 is unbounded and thus we could have to search a very
long way to find a solution. To solve this problem, we introduce the notion of a constrained
search on an interval [a, c] where we assume that we have a third value b ∈ [0, b] such that
φ(a) < φ(b) and φ(b) > φ(c). Such a triple (a, b, c) is called a bracket.
Proposition 3.13. For simplicity of notation, assume that φ(x) = φx ∈ R. The coefficients of the quadratic curve rx2 + sx + t passing through the points (a, φa ), (b, φb ), (c, φc )
are given by the expressions:
−bφa + bφc − aφc + φa c − φb c + φb a
r=− 2
b c − b2 a − bc2 + ba2 − a2 c + ac2
2
−a φc + a2 φb − φa b2 + φa c2 − c2 φb + φc b2
s=
(−c + a) (ab − ac − b2 + cb)
−φb a2 c + a2 bφc + ac2 φb − aφc b2 − c2 bφa + cφa b2
t=
(−c + a) (ab − ac − b2 + cb)
3. MAXIMUM BRACKETING
25
Exercise 20. Prove proposition 3.13. [Hint: Use a computer algebra system.]
Corollary 3.14. The maximum (or minimum) of the quadratic curve rx2 + sx + t
passing through the points (a, φa ), (b, φb ), (c, φc ) is given by:
(3.3)
u=
1 −a2 φc + a2 φb − φa b2 + φa c2 − c2 φb + φc b2
2
−bφa + bφc − aφc + φa c − φb c + φb a
Remark 3.15. The Maple code for finding the maximum point for the curve passing
through (a, φa ), (b, φb ), (c, φc ) is shown below:
1
2
3
4
5
6
7
QuadraticTurn :=
proc (a::numeric, fa::numeric, b::numeric, fb::numeric, c::numeric, fc::numeric)
::list;
local qf, r, s, t, SOLN;
qf := proc (x) options operator, arrow; r*x^2+s*x+t end proc;
SOLN := solve([fa = qf(a), fb = qf(b), fc = qf(c)], [r, s, t]);
[eval(-(1/2)*s/r, SOLN[1]), evalb(nops(SOLN) > 0 and eval(r < 0, SOLN[1]))]
end proc;
Note in Line 6, we actually return a list, whose second element is true if and only if
there is a solution to the proposed problem and the point is the maximum of a quadratic
approximation, rather than the minimum. Note also for simplicity we use fa, fb, and fc
instead of φa , φb and φc .
Remark 3.16. In our case, we know that a = 0 and we could choose an arbitrarily
small value b = a + ǫ as our second value. The problem is identifying point c. To solve this
problem, we introduce a bracketing algorithm (see Algorithm 3). Essentially, the bracketing
algorithm keeps moving the points a, b and c further and further to the right using a quadratic
approximation of the function when possible. Note, this algorithm assumes that the local
behavior of φ(x) is concave, which is true in our line search, especially when f (x) is concave.
Remark 3.17. Note in Algorithm 3, our φ(x) is written f (x) and we use τ as the Golden
ratio, which is used to define our initial value c = b + τ · (b − a). This is also used when
the parabolic approximation of the function fails. In practice a tiny term is usually added
to the denominator of the fraction in Corollary 3.14 to prevent numerical instability (see
[PTVF07], Page 491). Also note, that we compute φ(x) many times. The number of times
that φ(x) can be reduced by storing these values as discussed in ([PTVF07], Page 491).
Proposition 3.18. If φ(x) has a maximum value on R+ , then Algorithm 3 will identify
a bracket containing a local maximum.
Proof. The proof is straightforward. The algorithm will continue to move the bracket
(a, b, c) rightward ensuring that φ(a) < φ(b). At some point, the algorithm will choose c so
that φ(b) > φ(c) with certainty (since φ(x) has a maximum and its maximum is contained
in R+ ). When this occurs, the algorithm terminates.
Example 3.19. Consider the following function:
φ(x) = −(x − 4)4 + 3(x − 4)3 + 6(x − 4)2 − 3(x − 4) + 100
The plot of this function is shown in Figure 2.3. When we execute the parabolic bracketing
algorithm on this function starting at a = 0 and b = 0.25, we obtain the output:
26
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
1 ParabolicBracket := proc (phi::operator, a::numeric, b::numeric)::list;
2
local aa, bb, cc, tau, ret, M, umax, u, bSkip, uu;
3
tau := evalf(1/2+(1/2)*sqrt(5));
4
M := 10.0;
5
aa := a; bb := b;
6
if phi(bb) < phi(aa) and aa < bb then
7
ret := []
8
else
9
cc := bb+tau*(bb-aa);
10
ret := [aa,bb,cc]:
11
while phi(bb) < phi(cc) do
12
umax := (cc-bb)*M+cc;
13
uu := QuadraticTurn(aa, phi(aa), bb, phi(bb), cc, phi(cc));
14
u := uu[1];
15
print(sprintf("a=%f\t\tb=%f\t\tc=%f\t\tu=%f", aa, bb, cc, u));
16
if uu[2] then
17
bSkip := false;
18
if aa < u and u < bb then
19
if phi(aa) < phi(u) and phi(bb) < phi(u) then
20
ret := [aa, u, bb];
21
bb := u; cc := bb
22
else
23
aa := bb; bb := cc; cc := bb+tau*(bb-aa);
24
ret := [aa, bb, cc]
25
end if:
26
elif bb < u and u < cc then
27
if phi(bb) < phi(u) and phi(cc) < phi(u) then
28
ret := [bb, u, cc];
29
aa := bb; bb := u
30
else
31
aa := bb; bb := u; cc := cc; cc := bb+tau*(cc-bb);
32
ret := [aa, bb, cc]
33
end if:
34
elif cc < u and u < umax then
35
if phi(cc) < phi(u) then
36
aa := bb; bb := cc; cc := u; cc := bb+tau*(cc-bb);
37
ret := [aa, bb, cc]
38
else
39
aa := bb; bb := cc; cc := bb+tau*(bb-aa);
40
ret := [aa, bb, cc]
41
end if:
42
elif umax < u then
43
aa := bb; bb := cc; cc := bb+tau*(bb-aa);
44
ret := [aa, bb, cc]
45
end if:
46
else
47
aa := bb; bb := cc; cc := bb+tau*(bb-aa);
48
ret := [aa, bb, cc]
49
end if:
50
end do:
51
end if:
52
ret
53 end proc:
Algorithm 3. Bracketing Algorithm
4. DICHOTOMOUS SEARCH
27
"a=0.000000 b=0.250000 c=0.654508 u=1.580537"
"a=0.250000 b=0.654508 c=2.152854 u=2.095864"
"a=0.654508 b=2.095864 c=2.188076 u=2.461833"
"a=2.095864 b=2.188076 c=2.631024 u=2.733759"
"a=2.188076 b=2.631024 c=2.797253 u=2.839039"
"a=2.631024 b=2.797253 c=2.864864 u=2.889411"
"a=2.797253 b=2.864864 c=2.904582 u=2.899652"
Output = [2.864863884, 2.899652455, 2.904582162]
When we execute the algorithm starting at a = 4.3 and b = 4.5, we obtain the output:
"a=4.300000 b=4.500000 c=4.823607 u=4.234248"
"a=4.500000 b=4.823607 c=5.347214 u=4.235687"
"a=4.823607 b=5.347214 c=6.194427 u=3.781555"
"a=5.347214 b=6.194427 c=7.565248 u=7.317872"
Output = [6.194427192, 7.317871765, 7.565247585]
The actual (local) maxima for this function occur at x ∼ 2.900627632 and x ∼ 4.217851814,
showing our maximum points are bracketed as expected.
Remark 3.20. Parabolic fitting is often overly complicated, especially if the function
φ(x) is locally concave. In this case, we can simply compute c = b + α(b − a) and successively
reassign a = b and b = c.
Exercise 21. Write the algorithm discussed in Remark 3.20 and prove that if φ(x) has
a maximum [0, ∞], this algorithm will eventually bracket the maximum if you start with
a = 0 and b = a + ǫ.
Exercise 22. Compare the bracket size obtained from Algorithm 3 vs. the algorithm
you wrote in Exercise 21. Is there a substantial difference in the bracket size?
4. Dichotomous Search
Remark 3.21. Assuming we have a bracket around a (local) maximum of φ(x), it is
now our objective to find a value close to the value x∗ such that φ(x∗ ) yields that (local)
maximum value. A dichotomous search is a popular (but not the most efficient) method for
finding such a value. The Maple code for Dichotomous Search is shown in Figure 4.
Remark 3.22. Dichotomous Search begins with an interval around which we are certain
there is a maximum. The interval is repeatedly reduced until the maximum is localized to
a small enough interval. At Lines 11 and 12, two test points are computed. Depending
upon the value the function φ at these points, the interval is either sub-divided to the left or
right. This is illustrated in Figure 3.1. If the value of the function is the same at both test
points, then the direction is chosen arbitrarily. In the implementation shown in Algorithm
4, the algorithm always chooses the right half of the interval. Convergence can be sped up
by defining x1 = u − δ and x2 = u + δ from the algorithm for some small δ > 0.
Exercise 23. For some small δ (say 0.01) empirically compare the rate of convergence
of the Dichotomous Search Algorithm shown in Algorithm 4 vs. the rate of the algorithm
for the given δ.
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
DichotomousSearch := proc (phi::operator, a::numeric, b::numeric, epsilon::
numeric)::real;
local u, aa, bb, x;
if a < b then
aa := a; bb := b
else bb := a; aa := b
end if:
u := (1/2)*aa+(1/2)*bb;
print("a\t\t\t\t\tb");
while epsilon < bb-aa do
print(sprintf("%f \t\t\t %f", aa, bb));
x[1] := (1/2)*aa+(1/2)*u;
x[2] := (1/2)*bb+(1/2)*u;
if phi(x[2]) < phi(x[1]) then
bb := x[2]; u := (1/2)*aa+(1/2)*bb
else
aa := x[1]; u := (1/2)*aa+(1/2)*bb
end if:
end do;
(1/2)*aa+(1/2)*bb
end proc
Algorithm 4. Dichotomous Search
(2)
(1)
x2 x2
(1)
x1
(2)
x1
Third Bracket
Second Bracket
First Bracket
Figure 3.1. Dichotomous Search iteratively refines the size of a bracket containing
a maximum of the function φ by splitting the bracket in two.
Theorem 3.23. Suppose φ(x) is strictly concave and attains its unique maximum on the
(initial) interval [a, b] at x∗ . Then Dichotomous Search will converge to a point x+ such that
|x+ − x∗ | < ǫ.
4. DICHOTOMOUS SEARCH
29
Proof. We will show by induction on the algorithm iteration that x∗ is always in the
interval returned by the algorithm. At initialization, this is clearly true. Suppose after n
iterations we have the interval [l, r] and x∗ ∈ [l, r]. Let let u = (l + r)/2 and without loss
of generality, suppose x∗ ∈ [u, r]. Let x1 and x2 be the two test points with x1 ∈ [l, u] and
x2 ∈ [u, r]. If φ(x2 ) > φ(x1 ), then at the next iteration the interval is [x1 , r] and clearly
x∗ ∈ [x1 , r]. Conversely, suppose that φ(x1 ) > φ(x2 ). Since φ(x) is concave and must
always lie above its secant, it follows that x∗ ∈ [x1 , x2 ] (because clearly the function has
reached its turning point between x1 and x2 ). Thus, x∗ ∈ [u, x2 ]. The next interval when
φ(x1 ) > φ(x2 ) is [l, x2 ] and thus x∗ ∈ [l, x2 ]. It follows by induction that x∗ is contained in
the interval computed at each iteration of the while loop at Line 9. Suppose that [l∗ , r∗ ] is
the final interval upon termination of the while loop at Line 9. We know that x∗ ∈ [l∗ , r∗ ]
and r∗ − l∗ < ǫ, thus |x+ − x∗ | < ǫ. This completes the proof.
Theorem 3.24. Suppose φ(x) is concave and attains a maximum on the (initial) interval
[a, b] at x∗ . Assume that at each iteration, the interval of uncertainty around x∗ is reduced
by a factor of r ∈ (0, 1). Then Dichotomous search converges in:
ǫ
log |b−a|
n=
log(r)
steps.
Exercise 24. Prove Theorem 3.24. Illustrate the correctness of your proof by implementing Dichotomous Search and maximizing the function φ(x) = 10 − (x − 5)2 on the
interval [0, 10].
Example 3.25. The assumption that φ(x) is concave (on [a, b]) is not necessary to ensure
convergence to the maximum within [a, b]. Generally speaking it is sufficient to assume φ(x)
is unimodal on [a, b]. To see this, consider the function:
2x + 1
x<2
5
2≤x≤8
φ(x) =
x − 3
8<x≤9
−x + 15 x > 9
The function is illustrated in Figure 3.2. If we execute the Dichotomous Search algorithm
Figure 3.2. A non-concave function with a maximum on the interval [0, 15].
30
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
as illustrated in Algorithm 4, it will converge to the bracket [8.99, 9.00] even though this
function is not concave on the interval. The output from the algorithm is shown below:
"a b"
"0.000000
15.000000"
"0.000000
11.250000"
"2.812500
11.250000"
"4.921875
11.250000"
"6.503906
11.250000"
"6.503906
10.063477"
"7.393799
10.063477"
"8.061218
10.063477"
"8.061218
9.562912"
"8.436642
9.562912"
"8.718209
9.562912"
"8.718209
9.351736"
"8.718209
9.193355"
"8.836996
9.193355"
"8.836996
9.104265"
"8.903813
9.104265"
"8.903813
9.054152"
"8.941398
9.054152"
"8.969586
9.054152"
"8.969586
9.033010"
"8.969586
9.017154"
"8.981478
9.017154"
"8.990397
9.017154"
"8.990397
9.010465"
"8.990397
9.005448"
"8.994160
9.005448"
Output = 9.001215067
5. Golden Section Search
Remark 3.26. In the Dichotomous search each iteration requires two new evaluations
of the value of φ. If φ is very costly to evaluate, it might be more efficient to construct
an algorithm that evaluates φ(x) only once at each iteration by cleverly constructing the
intervals. The Golden Section Search does exactly that.
Remark 3.27. At each iteration of the golden section search, there are four points under
consideration, x1 < x2 < x3 < x4 , where x1 and x4 are the left and right endpoints of the
current interval. At the next iteration, if φ(x2 ) > φ(x3 ), then x1 remains the left end point,
x3 becomes the right end point and x2 plays the role of x3 in the next iteration. On the other
hand, if φ(x2 ) < φ(x3 ), then x4 remains the right endpoint, x2 becomes the left endpoint and
x3 plays the role of x2 in the next iteration. The objective in the Golden Section Search is
to ensure the the sizes of the intervals remain proportionally constant throughout algorithm
execution.
5. GOLDEN SECTION SEARCH
31
Proposition 3.28. Given an initial bracket [x1 , x4 ] containing a maximum for the function φ(x). To ensure that the ratio of x2 − x1 to x4 − x2 is kept constant, then x2 =
(1/τ )x1 + √
(1 − 1/τ )x4 and x3 = (1 − 1/τ )x1 + (1/τ )x4 , where τ is the Golden ratio; i.e.,
1
τ = 2 (1 + 5).
Proof. Consider Figure 3.3: In the initial interval, the ratio of x2 − x1 to x4 − x2 is
b
x1
x2
c
x3
x4
a
Figure 3.3. The relative sizes of the interval and sub-interval lengths in a Golden
Section Search.
a/b. At the next iteration, either φ(x2 ) < φ(x3 ) or not (in the case of a tie, it an arbitrary
left/right choice is made). If φ(x2 ) > φ(x3 ), then x2 becomes the new x3 point and thus we
require:
a
c
=
b
a
On the other hand, if φ(x2 ) < φ(x3 ) then x3 becomes the new x2 point and we require:
c
a
=
b
b−c
From the first equation, we see that a2 /c = b and substituting that into the second equation
yields:
a 2 a
− −1=0
c
c
Which implies:
a
b
=τ =
c
a
Without loss of generality, suppose that x1 = 0 and x4 = 1. To compute x2 we have:
a
1
1
=
=⇒ a =
1−a
τ
1+τ
For the Golden Ratio, 1/(1 + τ ) = 1 − 1/τ and thus it follows x2 = (1/τ )x1 + (1 − 1/τ )x4 .
To compute x3 , suppose a + c = r. Then we have:
1
r − x2
=
1−r
τ
When we replace x2 with 1/(1 + τ ) we have:
1 + 2τ
r=
(1 + τ )2
In the case of the Golden Ratio, the previous fraction reduces to r = 1/τ . Thus we have
x3 = (1 − 1/τ )x1 + (1/τ )x4 . This completes the proof.
Remark 3.29. The Golden Section Search is shown illustrated in Algorithm 5. This is,
essentially, the Golden Section search as it is implemented in [PTVF07] (Page 495).
32
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
GoldenSectionSearch := proc (phi::operator, a::numeric, b::numeric,
c::numeric, epsilon::numeric)::real;
local aa, bb, cc, dd, tau, ret;
tau := evalf(1/2+(1/2)*sqrt(5));
aa := a; dd := c;
if b-a < c-b then
bb := b; cc := b+(1-1/tau)*(c-b)
else
bb := b-(1-1/tau)*(b-a); cc := b
end if;
while epsilon < dd-aa do
print(sprintf("aa=%f\t\tbb=%f\t\tcc=%f\t\tdd=%f", aa, bb, cc, dd));
if phi(cc) < phi(bb) then
dd := cc; cc := bb; bb := cc/tau+(1-1/tau)*aa;
else
aa := bb; bb := cc; cc := bb/tau+(1-1/tau)*dd;
end if
end do;
if phi(cc) < phi(bb) then
ret := bb;
else
ret := cc;
end if;
ret;
end proc:
Algorithm 5. Golden Section Search
Theorem 3.30. Suppose φ(x) is strictly concave and attains its unique maximum on the
(initial) interval [a, b] at x∗ . Then Golden Section Search will converge to a point x+ such
that |x+ − x∗ | < ǫ.
Exercise 25. Prove Theorem 3.30.
Example 3.31. In Example 3.25, we saw an example of a non-concave function for which
Dichotomous Search converged. In this example, we show an instance of a function for which
the Golden Section Search does not converge to the global maximum of the function. Let:
10x
x < 12
−10x + 10 x ≥ 1 and x < 3
2
4
φ(x) = 5
2
x ≥ 34 and x < 9
x≥9
−x + 23
2
The function in question is shown in Figure 3.4. If we use the bracket [0, 11], note that the
global functional maximum occurs in a position very close to the left end of the bracket. As
a result, the interval search misses the global maximum. This illustrates the importance of
tight bracketing in line search algorithms. The output of the search is shown below.
"aa=0.000000 bb=4.201626 cc=6.798374 dd=11.000000"
5. GOLDEN SECTION SEARCH
33
Figure 3.4. A function for which Golden Section Search (and Dichotoous Search)
might fail to find a global solution.
"aa=4.201626 bb=6.798374
"aa=6.798374 bb=8.403252
"aa=6.798374 bb=7.790243
"aa=7.790243 bb=8.403252
"aa=8.403252 bb=8.782113
"aa=8.403252 bb=8.637401
"aa=8.637401 bb=8.782113
"aa=8.782113 bb=8.871549
"aa=8.871549 bb=8.926824
"aa=8.926824 bb=8.960986
"aa=8.960986 bb=8.982099
"aa=8.982099 bb=8.995148
"aa=8.982099 bb=8.990164
"aa=8.990164 bb=8.995148
Output = 8.998228439
cc=8.403252
cc=9.395122
cc=8.403252
cc=8.782113
cc=9.016261
cc=8.782113
cc=8.871549
cc=8.926824
cc=8.960986
cc=8.982099
cc=8.995148
cc=9.003213
cc=8.995148
cc=8.998228
dd=11.000000"
dd=11.000000"
dd=9.395122"
dd=9.395122"
dd=9.395122"
dd=9.016261"
dd=9.016261"
dd=9.016261"
dd=9.016261"
dd=9.016261"
dd=9.016261"
dd=9.016261"
dd=9.003213"
dd=9.003213"
Remark 3.32. There are derivative other free line search methods. One of the simplest
applies the parabolic approximation we used in the bracketing method in Algorithm 3 (see
[BSS06] or [Ber99]). Another method, generally called Brent’s method, combines parabolic approximation with the Golden Section method (see [PTVF07], Chapter 10 for an
implementation of Brent’s Method). These techniques extend and combine derivative free
methods presented in this section, but do not break any substantially new ground beyond
their efficiency.
34
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
6. Bisection Search
Remark 3.33. We are now going to assume that we can compute the first and possibly
second derivative of φ(x). In doing so, we will construct new methods for finding points to
maximize the function.
Remark 3.34. Before proceeding any further, we warn the reader that methods requiring
function derivatives should use the actual derivative of the function rather than an approximation. While approximate methods are necessary in multi-dimensional optimization and
often used, line searches can be sensitive to incorrect derivatives. Thus, it is often safer to
use a derivative free method even though it may not be as efficient.
Remark 3.35. The bisection search assumes that for a given univariate function φ(x),
we have a bracket [a, b] ⊂ R containing a (local) maximum and furthermore that φ(x) is
differentiable on this closed interval and φ′ (a) > 0 and φ′ (b) < 0. As we will see, convergence
can be ensured when φ(x) is concave.
Remark 3.36. The bisection search algorithm is illustrated in Algorithm 6. Given the
assumption that φ′ (a) > 0 and φ′ (b) < 0, we choose a test point u ∈ [a, b]. In the implementation, we choose u = (a + b)/2. If φ′ (u) > 0, then this point is presumed to lie on the
left side of the maximum and we repeat on the new interval [u, b]. If φ′ (u) < 0, then u is
presumed to lie on the right side of the maximum and we repeat on the new interval [a, u].
If φ′ (u) = 0, then we terminate our search and return u as x∗ the maximizing value for φ(x).
Because of round off, we will never reach a condition in which φ′ (u) = 0, so instead we test
to see if |φ′ (u)| < ǫ, where ǫ is a small constant provided by the user.
Theorem 3.37. If φ(x) is concave and continuously differentiable on [a, b] with a maximum x∗ ∈ (a, b), then for all ǫ (an input parameter) there is a δ > 0 so that Bisection Search
converges to u with the property that |x∗ − u| ≤ δ.
Proof. The fact that φ(x) is concave on [a, b] combined with the fact that φ′ (a) > 0
and φ′ (b) < 0 implies that φ′ (x) is a monotonically decreasing continuous function on the
interval [a, b] and therefore if there are two values x∗ and x+ maximizing φ(x) on [a, b] then
for all λ ∈ [0, 1], x◦ = λx∗ + (1 − λ)x+ also maximizes φ(x) and by necessity φ′ (x◦ ) =
φ′ (x∗ ) = φ′ (x+ ) = 0. Thus, for any iteration if we have the bracket [l, r] if φ′ (l) > 0 and
φ′ (r) < 0 we can be sure that x∗ ∈ [l, r]. Suppose after n iterations we have interval [l, r]
with the property that φ′ (l) > 0 and φ′ (r) < 0. Lines 15 and 17 of Bisection search ensure
that at iteration n + 1 we continue to have a new interval [l′ , r′ ] ⊂ [l, r] with the property
that φ′ (l′ ) > 0 and φ′ (r′ ) < 0, assuming that |φ′ (u)| > ǫ. Thus, at each iteration x∗ is in
the bracket. Furthermore, since φ′ (x) is monotonically decreasing we can be sure that as
each iteration shrinks the length of the interval [l, r], φ′ (u) will not increase. Thus at some
point either φ′ (u) < ǫ or |r − l| < ǫ. If the latter occurs, then we know that |x∗ − u| < ǫ
since u ∈ [l, r] at the last iteration and δ = ǫ. On the other hand, if φ′ (u) = 0, then u is a
maximizer of φ(x) and we may assume that x∗ = u, so δ = 0. Finally, suppose 0 < φ′ (u) < ǫ.
Then there is some non-zero δ < |r − l| so that u + δ = x∗ . Thus |x∗ − u| ≤ δ. The case
when φ′ < 0 and |φ′ (u)| < ǫ is analogous.
Remark 3.38. Note, we do not put a detailed bound on the δ in the proof of the previous
theorem. As a rule of thumb, we’d like, φ′ (x) to have a somewhat nice property like having
a continuous inverse function, in which case we could make an exact ǫ-δ style argument to
6. BISECTION SEARCH
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
35
BisectionSearch := proc (phi::operator, a::numeric, b::numeric, epsilon::numeric)
::real;
local u, aa, bb, x, y;
if a < b then
aa := a; bb := b
else
bb := a; aa := b
end if;
u := (1/2)*aa+(1/2)*bb;
print("a\t\t\t\t\tb");
while epsilon < bb-aa do
print(sprintf("%f \t\t\t %f", aa, bb));
y := eval(diff(phi(z), z), z = u);
if epsilon < abs(y) then
if y < 0 then
bb := u; u := (1/2)*aa+(1/2)*bb
else
aa := u; u := (1/2)*aa+(1/2)*bb
end if
else
break
end if
end do;
u
end proc:
Algorithm 6. Bisection Search
assert that if |φ′ (u)−φ′ (x∗ )| < ǫ, then there better be some small δ so that |u−x∗ | < δ, for an
appropriately chosen maximizer x∗ (if the maximizer is not unique). Unfortunately, plain old
continuity does not work for this argument. We could require φ′ (x) to be bilipschitz, which
would also achieve this effect. See Exercise 27 for an alternative way to bound |u − x∗ | < δ
in terms of ǫ.
Exercise 26. A function f : R → R is called Lipschitz continuous if there is a K > 0
so that for all x1 and x2 in R we have |f (x1 ) − f (x2 )| ≤ K|x1 − x2 |. A function is bilipschitz
if there is a K ≥ 1 so that:
(3.4)
1
|x1 − x2 | ≤ |f (x1 ) − f (x2 )| ≤ K|x1 − x2 |
K
Suppose φ′ (x) is bilipschitz. Find an exact expression for δ.
Exercise 27. This exercise will tell you something about the convergence proof for
bisection search under a stronger assumption on φ(x).
Suppose that φ : R → R is twice continuously differentiable and strictly concave on [a, b]
with its unique maximum x∗ ∈ [a, b] (so φ′ (x∗ ) = 0). Assume that |φ′ (u)| < ǫ. Use the Mean
Value Theorem to find a bound on |u − x∗ |.
36
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
Remark 3.39. If φ(x) is twice differentiable then there is a t between u and x∗ so that
(3.5)
φ′ (u) − φ′ (x∗ ) = φ′′ (t) (u − x∗ )
This implies that:
(3.6)
|φ′ (u) − φ′ (x∗ )| = |φ′′ (t)| |u − x∗ |
From this we know that
(3.7)
|φ′ (u) − φ′ (x∗ )| = |φ′′ (t)| |u − x∗ | < ǫ
Therefore:
(3.8)
|u − x∗ | <
ǫ
|φ′′ (t)|
Exercise 28. Modify the bisection search to detect minima. Test your algorithm on the
function f (x) = (x − 5)2 .
Example 3.40. Below we show example output from Bisection Search when executed
on the function φ(x) = 10 − (x − 5)2 when we search on the interval [0, 9] and set ǫ = 0.01.
"a b"
"0.000000
9.000000"
"4.500000
9.000000"
"4.500000
6.750000"
"4.500000
5.625000"
"4.500000
5.062500"
"4.781250
5.062500"
"4.921875
5.062500"
"4.992188
5.062500"
"4.992188
5.027344"
"4.992188
5.009766"
Output = 5.000976562
7. Newton’s Method
Remark 3.41. Newton’s method is a popular method for root finding that can also be
applied to the problem of univariate functional maximization (or minimization) provided the
second derivative of the function can be computed. Since we will cover Newton’s Method for
multidimensional functions in the next chapter, we will not go into great detail on the theory
here. Note in theory, we do not require a bracket for Newton’s Method, just a starting point.
A bracket is useful though for finding a starting point.
Remark 3.42. The algorithm assumes that the quadratic Taylor approximation is reasonable for the function φ(x) to be optimized about the current iterated value xk . That is:
1
φ(x) ≈ φ(xk ) + φ′ (x0 )(x − xk ) + φ′′ (x0 )(x − xk )2
2
′′
If we assume that φ (xk ) < 0, then differentiating and setting equal to zero will yield a
maximum for the function approximation:
φ′ (xk ) − φ′′ (xk )xk
φ′ (xk )
(3.10) xk+1 = −
=
x
−
k
φ′′ (xk )
φ′′ (xk )
(3.9)
8. CONVERGENCE OF NEWTON’S METHOD
37
Iteration continues until some termination criterion is achieved. An implementation of the
algorithm is shown in Algorithm 7
1
2
3
4
5
6
7
8
9
10
11
12
NewtonsMethod := proc (phi::operator, a::numeric, epsilon::numeric)::list;
local u, fp, fpp;
u := a;
fp := infinity;
while epsilon < abs(fp) do
print(sprintf("xk = %f", u));
fp := eval(diff(phi(z), z), z = u); #Take a first derivative at u.
fpp := eval(diff(phi(z), ‘$‘(z, 2)), z = u); #Take a second derivative at u.
u := evalf(-(fp-fpp*u)/fpp)
end do;
[u, evalb(fpp < 0)]:
end proc:
Algorithm 7. Newton’s Method in One Dimension
Remark 3.43. Note that our implementation of Newton’s Algorithm checks to see if the
final value of φ′′ (xk ) < 0. This allows us to make sure we identify a maximum rather than a
minimum.
Example 3.44. Suppose we execute our bracketing algorithm on the function:
φ(x) = −(x − 4)4 + 3(x − 4)3 + 6(x − 4)2 − 3(x − 4) + 100
starting at 0 and 0.25. We obtain the bracket (2.864863884, 2.899652455, 2.904582162). If
we begin Newton’s method with the second point of the bracket 2.899652455 we obtain the
output:
"xk = 2.899652"
"xk = 2.900627"
Output = [2.900627632, true]
Exercise 29. Find a simple necessary condition for which Newton’s Method converges
in one iteration.
Exercise 30. Implement Newton’s Method. Using
φ(x) = −(x − 4)4 + 3(x − 4)3 + 6(x − 4)2 − 3(x − 4) + 100
find a starting point for which Newton’s Method converges to a maximum and another
starting point for which Newton’s Method converges to a minimum.
8. Convergence of Newton’s Method
Remark 3.45. Most of this discussion is taken from Chapter 8.1 of [Avr03]. A slightly
more general discussion can be found in [BSS06].
Definition 3.46. Let S = [a, b] ⊂ R. A contractor or contraction mapping on S is a
continuous function f : S → S with the property that there is a q ∈ (0, 1) so that:
(3.11)
|f (x1 ) − f (x2 )| ≤ q|x1 − x2 |
38
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
Remark 3.47. Definition 3.46 is a special form of Lipschitz continuity in which the
Lipschitz constant is required to be less than 1.
Lemma 3.48. Let f : S → S be a contraction mapping on S = [a, b] ⊂ R. Define
x(k+1) = f (x(k) ) and suppose x(0) ∈ S. Then there is a unique fixed point x∗ to which the
sequence {x(k) } converges and:
(3.12)
|x(k+1) − x∗ | ≤ q k+1 |x(0) − x∗ |
k = 0, 1, . . .
Proof. The fact that f (x) has a fixed point is immediate from a variation on Brower’s
Fixed Point Theorem (see [Mun00], Page 351 - 353). Denote this point by x∗ . The fact
that f is a contraction mapping implies that:
(3.13)
|f (x(0) ) − x∗ | ≤ q|x(0) − x∗ |
which implies that:
(3.14)
|x(1) − x∗ | ≤ q|x(0) − x∗ |
Now proceed by induction and assume for
(3.15)
|x(k) − x∗ | ≤ q k |x(0) − x∗ | k = 0, 1, . . . , K
From Expression 3.15, we know that:
(3.16)
|f (x(K) ) − x∗ | ≤ q|x(K) − x∗ | ≤ q K+1 |x(0) − x∗ |
Thus:
(3.17)
|x(K+1) − x∗ | ≤ q K+1 |x(0) − x∗ |
Thus Expression 3.12, follows by induction. Finally, since q < 1, we see at once that {x(k) }
converges to x∗ .
To prove the uniqueness of x∗ , suppose there is a second, distinct, fixed point x+ . Then:
(3.18)
0 < |x∗ − x+ | = |f (x∗ ) − f (x+ )| ≤ q|x∗ − x+ |
As 0 < q < 1, this is a contradiction unless x∗ = x+ . Thus x∗ is unique. This completes the
proof.
Lemma 3.49. Suppose that f is continuously differentiable on S = [a, b] ⊂ R and maps
S into itself. If |f ′ (x)| < 1 for every x ∈ S, then f is a contraction mapping.
Proof. Let x1 and x2 be two points in S. Then by the Mean Value Theorem (Theorem
2.2) we know that there is some x∗ ∈ (x1 , x2 ) so that:
(3.19)
f (x1 ) = f (x2 ) + f ′ (x∗ )(x1 − x2 )
Thus:
(3.20)
|f (x1 ) − f (x2 )| = |f ′ (x∗ )||(x1 − x2 )|
By our assumption that |f ′ (x∗ )| < 1 for all x ∈ S, we can set q equal to the maximal value
of |f ′ (x∗ )| on S. The fact that f is a contraction mapping follows from the definition.
8. CONVERGENCE OF NEWTON’S METHOD
39
Theorem 3.50. Let h and γ be continuously differentiable functions on S = [a, b] ⊂ R.
Suppose further that:
(3.21)
h(a)h(b) < 0
and for all x ∈ S:
(3.22)
γ(x) > 0
(3.23)
h′ (x) > 0
(3.24)
0 ≤ 1 − (γ(x)h(x))′ ≤ q < 1
Let:
(3.25)
x(k+1) = x(k) − γ(x(k) )h(x(k) )
k = 0, 1, . . .
with x(0) ∈ S. Then the sequence {x(k) } converges to a solution x∗ with h(x∗ ) = 0.
Proof. Define:
(3.26)
(3.27)
ρ(x) = x − γ(x)h(x)
ρ′ (x) = 1 − (γ(x)h(x))′
From Inequality 3.24, we see that 0 < ρ′ (x) ≤ q < 1 for all x ∈ S and ρ is monotonically
increasing (non-decreasing) on S. Inequality 3.23 shows that h is monotonically increasing
on S and therefore since h(a)h(b) < 0, it follows that h(a) < 0 and h(b) > 0. From this,
and Equation 3.26 we conclude that ρ(a) > a and ρ(b) < b, since γ(x) > 0 for all x ∈ S.By
the monotonicity of ρ, we can conclude then that a < ρ(x) < b for all x ∈ S. Thus, ρ maps
S into itself. Moreover, since we assume that |ρ′ (x)| < 1 in Inequality 3.24. From Lemma
3.49, it follows that ρ(x) is a contraction mapping.
Corollary 3.51. Suppose φ : R → R is a univariate function to be maximized that
is three-times continuously differentiable. Let h(x) = φ′ (x) and let γ(x) = 1/φ′′ (x). Then
ρ(x) = x − γ(x)h(x) is Newton’s Method. If the hypotheses on h(x) and g(x) from Theorem
3.50 hold, then Newton’s method converges to a stationary point (φ′ (x) = 0) on S = [a, b] ⊂
R.
Definition 3.52 (Rate of Convergence). If a sequence {xk } ⊆ Rn converges to x∗ ∈ Rn ,
but the sequence does not attain x∗ for any finite k. If there is a p ∈ R and an α ∈ R with
α 6= 0 such that:
(3.28)
||xk+1 − x∗ ||
=α
k→∞ ||xk − x∗ ||p
lim
then p is the order of convergence of the sequence {xk }. If p = 1, then convergence is linear.
If p > 1, then convergence is superlinear and if p = 2, then convergence is quadratic.
Theorem 3.53. Suppose that the hypotheses of Theorem 3.50 and Corollary 3.51 hold
and the sequence {xk } ⊂ R is generated by Newton’s method and this sequence converges to
x∗ ∈ R. Then the rate of convergence is quadratic.
Proof. From Theorem 3.50 and Corollary 3.51, we are maximizing φ : S = [a, b] ⊂ R →
R with h(x) = φ′ (x) and γ(x) = 1/φ′′ (x) and φ is three-times continuously differentiable.
40
3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS
From Theorem 3.50, we know that φ′ (x∗ ) = h(x∗ ) = 0. Furthermore, we know that x∗ is a
fixed point of the equation:
h(xk )
(3.29) f (xk ) = xk − ′
h (xk )
∗
if and only if h(x ) = 0. To see this, note that if h(x∗ ) = 0, then φ(x∗ ) = x∗ and so x∗ is a
fixed point. If x∗ is a fixed point, then:
h(x∗ )
(3.30) 0 = − ′ ∗
h (x )
and thus, h(x∗ ) = 0. By the Mean Value Theorem (Theorem 2.2), for any iterate xk , we
know there is a t ∈ R between xk and x∗ so that there is a
(3.31)
xk+1 − x∗ = f (xk ) − f (x∗ ) = f ′ (t)(xk − x∗ )
From this, we conclude that:
(3.32)
|xk+1 − x∗ | = |f (xk ) − f (x∗ )| = |f ′ (t)||xk − x∗ |
We note that:
h′′ (t)h(t)
h′ (t)2 − h′′ (t)h(t)
=
h′ (t)2
h′ (t)2
We can rewrite Equation 3.32 as:
|h′′ (t)||h(t)|
|xk − x∗ |
(3.34) |xk+1 − x∗ | = |f (xk ) − f (x∗ )| =
h′ (t)2
Since h(x∗ ) = 0, we know that there is some value s between t and x∗ so that:
(3.33)
f ′ (t) = 1 −
(3.35)
|h(t)| = |h(t) − h(x∗ )| = |h′ (s)||t − x∗ |
again by he Mean Value Theorem (Theorem 2.2). But since s is between t and x∗ and t is
between xk and x∗ , we conclude that:
(3.36)
|h(t)| = |h(t) − h(x∗ )| = |h′ (s)||t − x∗ | ≤ |h′ (s)||xk − x∗ |
Combining Equations 3.34 and 3.36, yields:
|h′′ (t)||h′ (s)|
|h′′ (t)||h(t)|
∗
∗
|xk − x | ≤
|xk − x∗ |2
(3.37) |xk+1 − x | =
′
2
′
2
h (t)
h (t)
If:
|h′′ (t)||h′ (s)|
β = sup
h′ (t)2
k
then
(3.38)
|xk+1 − x∗ | ≤ β|xk − x∗ |2
The fact that f ′ (x) is bounded is sufficient to ensure that the supremum β exists. Thus the
convergence rate of Newton’s Method is quadratic.
Exercise 31. Using your code for Newton’s Method, illustrate empirically that convergence is quadratic.
CHAPTER 4
Approximate Line Search and Convergence of Gradient Ascent
1. Approximate Search: Armijo Rule and Curvature Condition
Remark 4.1. It is sometimes the case that optimizing the function φ(δk ) is expensive.
In this case, it would be nice to find a δk by that is sufficient to guarantee convergence of a
gradient ascent (or related algorithm). We can then stop our line search before convergence
to an optimal solution and speed up our overall optimization algorithm. The following
conditions, usually called the Wolfe Conditions can be used to ensure convergence of gradient
descent and related algorithms.
Definition 4.2 (Armijo Rule). Given a function f : Rn → R and an ascent direction
pk with constant σ1 ∈ (0, 1), the Armijo rule is satisfied if:
(4.1)
f (xk + δk pk ) − f (xk ) ≥ σ1 δk ∇f (xk )T pk
Remark 4.3. Recall, φ(δk ) = f (xk + δk pk ) consequently, Equation 4.1 simply states
that:
(4.2)
φ(δk ) − φ(0) ≥ σ1 δk ∇f (xk )T pk = σ1 δk φ′ (0)
which simply means there is a sufficient increase in the value of the function.
Definition 4.4 (Curvature Rule). Given a function f : Rn → R and an ascent direction
pk with constant σ2 ∈ (σ1 , 1), the curvature condition is statisfied if:
(4.3)
σ2 ∇f (xk )T pk ≥ ∇f (xk + δk pk )T pk
Remark 4.5. Recall from Proposition 1.20 and Theorem 1.22:
φ′ (δk ) = ∇f (xk )T pk
Thus we can write the curvature condition as:
(4.4)
σ2 φ′ (0) ≥ φ′ (δk )
Remember, since ∇f (xk )T pk > 0 (since pk is an ascent direction), φ′ (0) > 0. Thus, all the
curvature condition is saying is that the slope of the univariate function φ at the chosen
point δk is not so steep as it was at 0.
Example 4.6. Consider the function:
1
2
2
cos x2 + y 2
x +y
f (x, y) = exp −
10
The function is illustrated in Figure 4.1. Suppose we start at point x0 = y0 = 1, then the
function φ(t) is given by: f (x0 + t∇f (x0 , y0 ), y0 + t∇f (x0 , y0 )). A plot of φ(t) is shown in
Figure 4.2. Armijo’s Rule simply says:
(4.5)
φ(t) ≥ φ(0) + σ1 φ′ (0)t
41
42
4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT
Figure 4.1. The function f (x, y) has a maximum at x = y = 0, where the function
attains a value of 1.
Figure 4.2. A plot of φ(t) illustrates the function increases as we approach the
global maximum of the function and then decreases.
Here σ1 ∈ (0, 1) is a constant we choose and φ′ (0) and φ(0) are computed from φ(t). This
rule can be associated to the set of t so that φ(t) lies above the line φ(0) + σ1 φ′ (0)t. This is
illustrated in Figure 4.3. Likewise, the Curvature Condition asserts that:
(4.6)
σ2 φ′ (0) ≥ φ′ (t)
This rule can be associated to the set of t so that φ′ (t) is greater than some constant, where
σ2 is chosen between σ1 and 1 and φ′ (0) is computed. This is illustrated in Figure 4.3. When
these two conditions are combined, it is clear that we have a good bracket of the maximum
of φ(t).
Lemma 4.7. Suppose pk is an ascent direction at xk and that f (x) is bounded on the ray
xk + tpk , where t ≥ 0. Then if 0 < σ1 < σ2 < 1, then there exists an interval of values for t
satisfying the Wolfe Conditions.
Proof. Define φ(t) = f (xk + tpk ). The fact that pk is an ascent direction, means:
φ′ (0) = ∇f (xk )T pk > 0
by Definition 3.2. Thus, since 0 < σ1 < 1, the line l(t) = φ(0) + σ1 φ′ (0)t is unbounded above
and must intersection the graph of φ(t) at least once because φ(t) is bounded above by
1. APPROXIMATE SEARCH: ARMIJO RULE AND CURVATURE CONDITION
43
φ0 (t)
σ2 φ0 (0)
φ(t)
ARMIJO RULE
CURVATURE CONDITION
φ(0) + σ1 φ0 (0)t
Figure 4.3. The Wolfe Conditions are illustrated. Note the region accepted by
the Armijo rule intersects with the region accepted by the curvature condition to
bracket the (closest local) maximum for δk . Here σ1 = 0.15 and σ2 = 0.5
assumption. (Note, φ(t) must lie above l(t) initially since we choose σ1 < 1 and at φ(0) the
tangent line has slope φ′ (0).) Let t′ > 0 be the least such value of t at which l(t) intersects
φ(t). That is:
φ(t′ ) = φ(0) + σ1 φ′ (0)t′
It follows that the interval (0, t′ ) satisfies the Armijo rule.
Applying the univariate Mean Value Theorem (Lemma 2.2), we see that there is a t′′ ∈
(0, t′ ) so that:
φ(t′ ) − φ(0) = φ′ (t′′ )(t′ − 0)
Combining the two previous equations yields:
σ1 φ′ (0)t′ = φ′ (t′′ )t′
Since σ2 > σ1 , we conclude that:
φ′ (t′′ )t′ = σ1 φ′ (0)t′ < σ2 φ′ (0)t′
Thus φ′ (t′′ ) < σ2 φ′ (0) and t′′ satisfies the Curvature Condition. Since we assumed that f
was continuously differentiable, it follows that there is a neighborhood around t′′ satisfying
the Curvature Condition as well. This completes the proof.
Remark 4.8. It turns out that the curvature condition is not necessary to ensure that
gradient descent converges to a point x∗ at which ∇f (x∗ ) = 0. From this observation, we
can derive an algorithm for identifying a value for δk that satisfies the Armijo rule.
Remark 4.9. Algorithm 8 codifies our observation about the Armijo rule. Here, we
choose a σ1 (called sigma in the algorithm) and a value β ∈ (0, 1) and an initial value t0 .
We set t = β k t0 and iteratively test to see if
(4.7)
φ(t) ≥ φ(0) + σ1 tφ′ (0)
44
4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT
When this holds, we terminate execution and return t = β k t0 for some k. The proof of Lemma
4.7 is sufficient to ensure this algorithm terminates with a point satisfying the Armijo rule.
1
2
3
4
5
6
7
8
BackTrace := proc (phi::operator, beta::numeric, t0::numeric, sigma)::numeric;
local t, k, dphi;
if beta < 0 or 1 < beta then
t := -1
else
k := 0;
t := beta^k*t0;
dphi := evalf(eval(diff(phi(x), x), x = 0));
9
10
11
12
13
14
15
16
while evalf(phi(t)) < evalf(phi(0))+sigma*dphi*t do
k := k+1;
t := beta^k*t0
end do
end if;
t #return t.
end proc
Algorithm 8. The back-tracing algorithm
Remark 4.10. In practice, σ1 is chosen small (between 10−5 and 0.1). The value β is
generally between 0.1 and 0.5. Often t0 = 1, though this varies based on problem at hand
[Ber99].
2. Algorithmic Convergence
Definition 4.11 (Gradient Related). Let f : Rn → R be a continuously differentiable
function. A sequence of ascent directions {pk } is gradient related if for any subsequence
K = {k1 , k2 , . . . } of {xk } that converges to a non-stationary point of f the corresponding
subsequence {pk }k∈K is bounded and has the property that:
(4.8)
lim sup ∇f (xk )T pk > 0
k→∞,k∈K
Theorem 4.12. Assume that δk is chosen to ensure the Armijo rule holds at each iteration and the ascent directions pk are chosen so they are gradient related. Then if Algorithm
2 converges, it converges to a point x∗ so that ∇f (x∗ ) = 0 and the stopping criteria:
(4.9)
||∇f (xk )|| < ǫ
for some small ǫ > 0 may be used.
Proof. If the algorithm converges to a point x∗ at which ∇f (x∗ ) = 0, then the stopping
criteria ||∇f (xk )|| < ǫ can clearly be used.
We proceed by contradiction: Suppose that the sequence of points {xk } converges to
a point x+ with the property that ∇f (x+ ) 6= 0. We construct {xk } so that {f (xk )} is a
monotonically nondecreasing sequence and therefore, either {f (xk )} converges to some finite
2. ALGORITHMIC CONVERGENCE
45
value or diverges to +∞. By assumption, f (x) is continuous and since x+ is a limit point
of {xk }, it follows f (x+ ) is a limit point of {f (xk )} and thus the sequence {f (xk )} must
converge to this value. That is:
(4.10)
lim f (xk+1 ) − f (xk ) = 0
k→∞
From the Armijo rule we have:
(4.11)
f (xk+1 ) − f (xk ) ≥ σ1 δk ∇f (xk )T pk
since xk+1 = xk + δk pk . Note the left hand side of this inequality goes to zero, while the right
hand side is always positive (pk are ascent directions and δk and σ1 are positive). Hence:
(4.12)
lim δk ∇f (xk )T pk = 0
k→∞
Let K = {k1 , k2 , . . . } be any subsequence of {xk } that converges to x+ . Since pk is gradient
related, we know that:
(4.13)
lim sup ∇f (xk )T pk > 0
k→∞,k∈K
Therefore, limk∈K δk = 0 (otherwise, Equation 4.12 can’t hold). Consider the backtrace
algorithm. Since limk∈K δk = 0, we know that for some k + ∈ K so that if k ≥ k + :
(4.14)
f (xk + δk /βpk ) − f (xk ) < σ1 δk /β∇f (xk )T pk
That is, if k ≥ k + , the while-loop in the Algorithm 8 executes at least once and when it
executes, it only stops once Inequality 4.14 doesn’t hold. Therefore, at the iteration before
Inequality 4.14 must hold.
Since {pk } is gradient related, the sequence {pk }, for k ∈ K, is bounded and thus by the
Bolzano-Weierstrass theorem there is a subsequence K + ⊆ K so that {pk }k∈K + converges
to a vector p+ . From Equation 4.14:
(4.15)
f (xk + δk+ pk ) − f (xk )
< σ1 ∇f (xk )T pk
+
δk
where δk+ = δk /β. Let φ(δk ) = f (xk + δk pk ). Remember, that φ′ (δk ) = ∇f (xk + δk pk )T pk .
We can rewrite Expression 4.15 as:
(4.16)
φ(δk+ ) − φ(0)
< σ1 ∇f (xk )T pk
δk+
Applying the mean value theorem: means there is a δ̄k ∈ (0, δk+ ) so that:
(4.17)
φ′ (δ̄k )(δk+ − 0)
= φ′ (δ̄k ) < σ1 ∇f (xk )T pk
+
δk
If we write φ′ (δ̄k ) in terms of the gradient of f and pk we obtain:
(4.18)
∇f (xk + δ̄k pk )T pk < σ1 ∇f (xk )T pk
Taking the limit as k → ∞ (and k ∈ K) we see that:
(4.19)
∇f (x+ )T p+ < σ1 ∇f (x+ )T p+
But since σ1 ∈ (0, 1), this cannot be true, contradicting our assumption that ∇f (x+ ) 6=
0.
46
4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT
Corollary 4.13. Assume that δk is chosen by maximizing φ(δk ) at each iteration and
the ascent directions pk are chosen so they are gradient related. Then Algorithm 2 converges
to a point x∗ so that ∇f (x∗ ) = 0.
Corollary 4.14. Suppose that pk = B−1
k ∇f (xk ) in each iteration of Algorithm 2 and
every matrix Bk is symmetric, positive definite. Then, Algorithm 2 converges to a stationary
point of f .
Exercise 32. Prove Corollary 4.13 by arguing that if xk+1 is generated from xk by
maximizing φ(δk ), then:
(4.20)
f (xk+1 ) − f (xk ) ≥ f (x̂k+1 ) − f (xk ) ≥ σ δ̂k ∇f (xk )T pk
where x̂k+1 and δ̂k are generated by the Armijo rule.
Exercise 33. Prove Corollary 4.14. [Hint: Use Lemma 3.8.]
Example 4.15 (Convergence Failure). We illustrate a simple case in which we do not
use minimization or the Armijo rule to chose a step length. Consider the function:
2
4
− 5 (1 − x) + 2 − 2x 1 < x
f (x) = − 45 (1 + x)2 + 2 + 2x x < −1
−x2 + 1
otherwise
Suppose we start at x0 = −2 and fix a decreasing step length δk = 1. If we follow a gradient
ascent, then we obtain a sequence of values {xk } (k = 1, 2, . . . ) shown in Figure 4.4. At each
Figure 4.4. We illustrate the failure of the gradient ascent method to converge to
a stationary point when we do not use the Armijo rule or minimization.
algorithmic step we have:
df
(4.21) xk+1 = xk +
dx x=xk
Evaluation of Equation 4.21 yields:
n
n
3
3
n+1
n
(4.22) xk+1 = (−1) ·
· x0 + (−1)
· 1−
5
5
3. RATE OF CONVERGENCE FOR PURE GRADIENT ASCENT
47
As n → ∞, we see this sequence oscillates between −1 and 1 and does not converge.
Remark 4.16. We state, but do not prove, the capture theorem, a result that quantifies,
in some sense, why gradient methods tend to converge to unique limit points. The proof can
be found in [Ber99] (50 - 51).
Theorem 4.17 (Capture Theorem). Let f : Rn → R be continuously differentiable and
let {xk } be a sequence satisfying f (xk+1 ) ≥ f (xk ) for all k generated as xk+1 = xk +δk pk that
is convergent in the sense that every limit point of sequences that it generates is a stationary
point of f . Assume there exists scalars s > 0 and c > 0 so that for all k we have:
(4.23)
δk ≤ s
||pk || ≤ c||∇f (xk )
Let x∗ be a local maximum of f be the only stationary point in an open neighborhood of x∗ .
Then there is an open set S containing x∗ so that if xK ∈ S for some K ≥ 0, then xk ∈ S
for all k ≥ Kand {xk } (k ≥ K) converges to x∗ . Furthermore, given any scalar ǫ > 0, the
set S can be chosen so that ||x − x∗ || < ǫ for all x ∈ S.
3. Rate of Convergence for Pure Gradient Ascent
Remark 4.18. We state, but do not prove Kantorovich’s Lemma. A sketh of the proof
can be found on Page 76 of [Ber99]. A variation of the proof provided for Theorem 4.20
can be found in [Ber99] for gradient descent.
Lemma 4.19 (Kantorovich’s Inequality). Let Q ∈ Rn×n be a symmetric positive definite
matrix furthermore, suppose the eigenvalues of Q are 0 < λ1 ≤ λ2 ≤ · · · ≤ λn . For any
x ∈ Rn×1 with x 6= 0 we have:
2
xT x
4λ1 λn
≥
(4.24)
(xT Qx) (xT Q−1 x)
(λ1 + λn )2
Exercise 34. Prove that Kantorovich’s Inequality also holds for negative definite matrices.
Theorem 4.20. Suppose f : Rn → R is a strictly concave function defined by:
1
(4.25) f (x) = xT Qx
2
n×n
where Q ∈ R
is symmetric (and negative definite). Suppose the sequence {xk } is generated
using an exact line search method and x∗ is the (unique) global maximizer of f . Finally,
suppose the eigenvalues of Q are 0 > λ1 ≥ λ2 ≥ · · · ≥ λn . Then:
2
λn − λ 1
f (xk )
(4.26) f (xk+1 ) ≥
λn + λ 1
Remark 4.21. The value:
λn − λ 1
λn + λ 1
is called the condition number of the optimization problem. Condition numbers close to 1
lead to slow convergence of gradient ascent (see Example 4.24).
48
4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT
Proof. Denote by gk = ∇f (xk ) = Qxk . If gk = 0, then f (xk+1 ) = f (xk ) and the
theorem is proved. Suppose gk 6= 0. Consider the function:
(4.27)
φ(δ) = f (xk + δgk ) =
1
(xk + δgk )T Q (xk + δgk )
2
We see at once that:
(4.28)
1
1
φ(δ) = f (xk ) + δxTk Qgk + δgkT Qxk + δ 2 f (gk )
2
2
Note that since Q = QT and Qxk = gk and xTk QT = xTk Q = gkT we have:
1
1
φ(δ) = f (xk ) + δgkT gk + δgkT gk + δ 2 f (gk ) = f (xk ) + δgkT gk + δ 2 f (gk )
2
2
′
Differentiating and setting φ (δ) = 0 yields:
(4.29)
(4.30)
gkT gk + 2δf (gk ) = 0 =⇒ δ =
−gkT gk
gkT Qgk
Notice that δ > 0 because Q is negative definite. Substituting this value into f (xk+1 ) yields:
(4.31)
1
(xk + δgk )T Q (xk + δgk ) =
2
T
T 2
−gk gk
−gk gk
T
2
T
f (xk ) + δgk gk + δ f (gk ) = f (xk ) +
gk gk +
f (gk ) =
T
gk Qgk
gkT Qgk
2
2
2 !
T
− gkT gk
g
g
1 gkT gk
1
1 T
k
+
=
x Qxk +
xTk Qxk − Tk
2 k
2 gkT Qgk
2
gkT Qgk
gk Qgk
f (xk+1 ) = f (xk + δgk ) =
Note that since Q is symmetric:
(4.32)
xTk Qxk = xTk QT Q−1 Qxk = gkT Q−1 gk
Therefore, we can write:
(4.33)
2 !
T
g
g
k
=
xTk Qxk − Tk
gk Qgk
!
2
gkT gk
1− T
xTk Qxk =
T −1
(gk Qgk ) (gk Q gk )
1
f (xk+1 ) =
2
1
2
!
2
gkT gk
1− T
f (xk )
(gk Qgk ) (gkT Q−1 gk )
Since Q is negative definite (and thus −Q is positive definite), we know three things:
(1) f (xk ), f (xk+1 ) ≤ 0
(2) We know:
2
gkT gk
>0
(gkT Qgk ) (gkT Q−1 gk )
(3) Thus, in order for the equality in Expression 4.33 to be true it follows that:
2
gkT gk
<1
0<1− T
(gk Qgk ) (gkT Q−1 gk )
4. RATE OF CONVERGENCE FOR BASIC ASCENT ALGORITHM
49
Lemma 4.19 and our previous observations imply that:
!
(λ1 + λn )2
4λ1 λn
4λ1 λn
f (xk ) =
f (xk ) =
−
(4.34) f (xk+1 ) ≥ 1 −
(λ1 + λn )2
(λ1 + λn )2 (λ1 + λn )2
!
2
2
λ1 − 2λ1 λ2 + λ2n
(λ1 − λn )2
λ 1 − λn
f (xk ) =
f (xk ) =
f (xk )
λ1 + λn
(λ1 + λn )2
(λ1 + λn )2
This completes the proof.
Remark 4.22. It’s a little easier to see the proof of the previous theorem when we are
minimizing, since everything is positive instead of negative and we do not have to keep track
of sign changes. Nevertheless, such exercises are good to ensure understanding of the proof.
Remark 4.23. A reference implementation for Gradient Ascent is shown in Algorithm
9. The LineSearch method called at Line 49 is a combination of the parabolic bracketing
algorithm (Algorithm 3 and Golden Section Search (Algorithm 5) along with a simple backtrace (Algorithm 8) to find the second parameter that is passed into the parabolic bracketing
algorithm (starting with a = 0 and ∆a = 1 we check to see if φ(a + ∆a) > φ(a), if not, we
set ∆a = β∆a).
Example 4.24. Consider the function F (x, y) = −2 x2 − 10 y 2 . If we initialize our
gradient ascent at x = 15, y = 5 and set ǫ = 0.001, then we obtain the following output,
which converges near the optimal point x∗ = 0, y ∗ = 0. The zig-zagging motion is typical
of the gradient ascent algorithm in cases where the largest and smallest eigenvalues for the
matrix Q (in a pure quadratic function) are very different (see Theorem 4.20). In this case,
−2 0
Q=
0 −10
Thus, we can see that the condition number is 2/3, which is close enough to 1 to cause slow
convergence, leading to the zig-zagging.
Exercise 35. Implement Gradient Ascent with inexact line search. Using F (x, y) from
Example 4.24, draw a picture like Figure 4.5 to illustrate your algorithm’s steps.
4. Rate of Convergence for Basic Ascent Algorithm
Theorem 4.25. Let f : Rn → R be twice continuously differentiable and suppose that
{xk } is generated by Algorithm 2 so that xk+1 = xk +δk pk and suppose that {xk } converges to
x∗ , ∇f (x∗ ) = 0 and H(x∗ ) = ∇2 f (x∗ ) is negative definite. Finally, assume that ∇f (xk ) 6= 0
for any k = 1, 2, . . . and:
(4.35)
||pk + H−1 (x∗ )∇f (xk )||
=0
k→∞
||∇f (xk )||
lim
Then if δk is chosen by means of the backtrace algorithm with t0 = 1 and σ1 < 21 , we have:
(4.36)
||xk+1 − x∗ ||
=0
k→∞ ||xk − x∗ ||
lim
Furthermore, there is an integer k̄ ≥ 0 so that δk = 1 for all k ≥ k̄ (that is, eventually, the
backtrace algorithm will converge with no iterations of the while-loop).
50
4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT
Figure 4.5. Gradient ascent is illustrated on the function F (x, y) = −2 x2 − 10 y 2
starting at x = 15, y = 5. The zig-zagging motion is typical of the gradient ascent
algorithm in cases where λn and λ1 are very different (see Theorem 4.20).
Remark 4.26. This theorem asserts that if we have a method for choosing our ascent
direction pk so that at large iteration numbers, the ascent direction approaches a Newton
step, then the algorithm will converge super-linearly (since Equation 4.36 goes to zero.)
Remark 4.27. We note that in Newton’s Method we have pk = −H−1 (x∗ ). Thus,
Equation 4.35 tells us that in the limit as k approaches ∞, we are taking pure Newton steps.
Proof. By the Second Order Mean Value Theorem (Lemma 2.4), we know that there
is some t ∈ (0, 1) so that:
1
(4.37) f (xk + pk ) − f (xk ) = ∇f (xk )T pk + pTk H(xk + tpk )pk
2
If we can show that for k sufficiently large we have:
1
(4.38) ∇f (xk )T pk + pTk H(xk + tpk )pk ≥ σ1 ∇f (xk )T pk
2
then we will have established that there is a k̄ so that for any k ≥ k̄, the backtrace algorithm
converges with no executions of the while loop. Equation 4.38 may equivalently be written
as:
1
(4.39) (1 − σ1 )∇f (xk )T pk + pTk H(xk + tpk )pk ≥ 0
2
If we let:
∇f (xk )
qk =
||∇f (xk )||
pk
rk =
||∇f (xk )||
Dividing Inequality 4.39 by ||∇f (xk )||2 , we obtain an equivalent inequality:
1
(4.40) (1 − σ1 )qTk rk + rTk H(xk + tpk )rk ≥ 0
2
4. RATE OF CONVERGENCE FOR BASIC ASCENT ALGORITHM
51
From Equation 4.35, we know that:
(4.41)
lim ||rk + H−1 (x∗ )qk || = 0
k→∞
The fact that H(x∗ ) is negative definite, and ||qk || = 1, it follows that rk must be a bounded
sequence and since ∇f (xk ) converges to ∇f (x∗ ) = 0, it’s clear that pk must also converge
to 0 and hence xk + pk converges to x∗ . From this, it follows that xk + tpk also converges
to x∗ and H(xk + tpk ) converges to H(x∗ ). We can re-write Equation 4.35 as:
(4.42)
rk = −H−1 (x∗ )qk + zk
where {zk } is a sequence of vectors that converges to 0. Substituting the previous equation
into Equation 4.40 yields:
(4.43) (1 − σ1 )qTk −H−1 (x∗ )qk + zk +
T
1
−H−1 (x∗ )qk + zk H(xk + tpk ) −H−1 (x∗ )qk + zk ≥ 0
2
The preceding equation can be re-written as:
T
1
(4.44) −(1 − σ1 )qTk H−1 (x∗ )qk +
H−1 (x∗ )qk H(xk + tpk ) H−1 (x∗ )qk + g(zk ) ≥ 0
2
where g(zk ) is a function of zk with the property that g(zk ) → 0 as k → ∞. Since H(x)
is symmetric (and so is its inverse), for large enough k, H(xk + tpk ) ∼ H(x∗ ) and we can
re-write the previous expression as:
1
(4.45) −(1 − σ1 )qTk H−1 (x∗ )qk + qTk H−1 (x∗ )qk + ḡ(zk ) ≥ 0
2
Where ḡ(zk ) is another function with the same properties as g(zk ). We can re-write this
expression as:
1
− σ1 qTk H−1 (x∗ )qk + ḡ(zk ) ≥ 0
(4.46) −
2
Now, since σ1 < 12 and H−1 (x∗ ) is negative definite and ḡ(zk ) → 0, the inequality in Expression 4.46 must hold for very large k. Thus, Expression 4.39 must hold and therefore
Expression 4.38 must hold. Therefore, it follows there is some k̄ so that for all k ≥ k̄, the
unit step in the Armijo rule is sufficient and no iterations of the while-loop in the back-tracing
algorithm are executed.
We can re-write Expression 4.35 as:
(4.47)
pk + H−1 (x∗ )∇f (xk ) = ||∇f (xk )||yk
where yk is a vector sequence that approaches 0 as k approaches infinity. Applying Taylor’s
Theorem to the many-valued function ∇f (xk ), we see that:
(4.48)
∇f (xk ) = ∇f (xk ) − ∇f (x∗ ) = H(x∗ )(xk − x∗ ) + o (||xk − x∗ ||)
since ∇f (x∗ ) = 0, by assumption. This implies:
(4.49)
H−1 (x∗ )∇f (xk ) = (xk − x∗ ) + o (||xk − x∗ ||)
We also see that: ||∇f (xk )|| = O(||xk − x∗ ||). These two relations show that:
(4.50)
pk + (xk − x∗ ) = o (||xk − x∗ ||)
52
4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT
When k ≥ k̄, we know that:
(4.51)
xk+1 = xk + pk
Combining this with Equation 4.50 yields:
(4.52)
xk+1 − x∗ = o (||xk − x∗ ||)
Now, letting k approach infinity yields:
o (||xk − x∗ ||)
||xk+1 − x∗ ||
=
lim
=0
(4.53) lim
k→∞ ||xk − x∗ ||
k→∞ ||xk − x∗ ||
This completes the proof.
4. RATE OF CONVERGENCE FOR BASIC ASCENT ALGORITHM
1
2
3
4
GradientAscent := proc (F::operator, initVals::list, epsilon::numeric) :: list;
# A function to execute Gradient Ascent! Pass in a function
# and an initial point in the format [x = x0, y=y0, z=z0,...]
# epsilon is the tolerance on the gradient.
5
6
local vars, xnow, i, G, normG, OUT, passIn, phi, LS, vals, ttemp;
7
8
9
10
11
12
13
14
15
# The first few lines are housekeeping...
vars := [];
xnow := [];
vals := initVals;
for i to nops(initVals) do
vars := [op(vars), lhs(initVals[i])];
xnow := [op(xnow), rhs(initVals[i])]
end do;
16
17
18
19
# Compute the gradient of the function using the VectorCalculus package
# This looks nicer in "Maple" Format.
G := Vector(Gradient](F(op(vars)), vars));
20
21
22
23
# Compute the norm of the gradient at the initial point. This uses the Linear
# Algebra package.
normG := Norm(evalf(eval(G, initVals)));
24
25
26
27
# Output will be the path we take to get to the "optimal point."
# You can just output the last point for simplicity.
OUT := [];
28
29
30
# While we’re not at a stationary point...
while evalb(epsilon < normG) do
31
32
33
# Update the output.
OUT := [op(OUT), xnow];
34
35
36
37
38
39
# Do some house keeping: This is x_next = x_now + s * G(x_now)
# Essentially, we’re going to be solving for the distance to walk "s" below.
# Most of this line is just getting Maple to treat things as floating point
# and returning a list of the inputs we can pass to F.
passIn := convert(Vector(xnow) + s * evalf(eval(G, vals)), list);
53
54
40
41
42
43
44
4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT
# Create a local procedure to be the line function f(x_k + delta * p_k)
# Here ’t’ is playing the role of delta.
phi := proc (t) options operator, arrow;
evalf(eval(F(op(passIn)), s = t))
end proc;
45
46
47
48
49
# Get the output from a line search function. I’m using the
# ParabolicBracket code and the GoldenSection search
# found in the notes.
ttemp := LineSearch(phi);
50
51
52
# Compute the new value for xnow using the ’t’ we just found.
xnow := evalf(eval(passIn, [s = ttemp]));
53
54
55
56
57
58
# Do some housekeeping.
vals := [];
for i to nops(vars) do
vals := [op(vals), vars[i] = xnow[i]]
end do;
59
60
61
62
# Evaluate the norm of the gradient at the new point.
normG := Norm(evalf(eval(G, vals)))
end do;
63
64
65
# Record the very last point you found (it’s the local max).
OUT := [op(OUT), xnow];
66
67
68
69
#Return the list of points.
OUT
end proc;
Algorithm 9. Gradient Ascent Reference Implementation
CHAPTER 5
Newton’s Method and Corrections
1. Newton’s Method
Remark 5.1. Suppose f : Rn → R and f is twice continuously differentiable. Recall in
our general ascent algorithm, when Bk = ∇2 f (xk ) = H(xk ), then we call the general ascent
algorithm Newton’s method. If we fix δk = 1 for k = 1, 2, . . . , then the algorithm is pure
Newton’s method. When δk is chosen in each iteration using a line search technique, then
the technique is a variable step length Newton’s method.
Remark 5.2. The general Newton’s Method algorithm (with variable step length) is
shown in Algorithm 10.
Example 5.3. Consider the function F (x, y) = −2x2 − 10y 4 . This is a concave function
as illustrated by its Hessian matrix:
−4
0
H(x, y) =
0 −120y 2
This matrix is negative definite except when y = 0, at which point it is negative semi-definite.
Clearly, x = y = 0 is the global maximizer for the function. If we begin execution of Newtons
Method at x = 15, y = 5, Newton’s method converges in 11 steps. As illustrated in Figure
5.1
Figure 5.1. Newton’s Method converges for the function F (x, y) = −2x2 − 10y 4
in 11 steps, with minimal zigzagging.
55
56
1
2
3
5. NEWTON’S METHOD AND CORRECTIONS
NewtonsMethod := proc (F::operator, initVals::list, epsilon::numeric)::list;
local vars, xnow, i, G, normG, OUT, passIn, phi, LS, vals, ttemp, H;
vars := []; xnow := []; vals := initVals;
4
5
6
7
8
9
#The first few lines are housekeeping, so we can pass variables around.
for i to nops(initVals) do
vars := [op(vars), lhs(initVals[i])];
xnow := [op(xnow), rhs(initVals[i])];
end do;
10
11
12
13
14
#Compute the gradient and hessian using the VectorCalculus package.
#Store the Gradient as a Vector (that’s not the default).
G := Vector(Gradient(F(op(vars)), vars));
H := Hessian(F(op(vars)), vars);
15
16
17
#Compute the current gradient norm.
normG := Norm(evalf(eval(G, initVals)));
18
19
20
21
#Output will be a path to the optimal solution.
#Your output could just be the optimal solution.
OUT := [];
22
23
24
#While we’re not at a stationary point...
while evalb(epsilon < normG) do
25
26
27
#Update the output.
OUT := [op(OUT), xnow];
28
29
30
31
32
33
34
#Do some housekeeping. This is x_next = x_now - s * H^(-1) * G.
#Essentially, we’re going to be solving for the distance to walk "s" below.
#Most of this line is just getting Maple to treat things as floating point and
#returning a list of the inputs we can pass to F.
passIn := convert(Vector(xnow) s*evalf(eval(H,vals))^(-1).evalf(eval(G,vals)), list);
35
36
37
38
39
40
41
42
#Build a univariate function that we can maximize. This is a little weird.
#We defined our list (above) in terms of s, and we’re going to evaluate
#s at the value t we pass in.
phi := proc (t)
options operator, arrow;
evalf(eval(F(op(passIn)), s = t))
end proc;
43
44
45
#Find the optimal step using a line search.
ttemp := LineSearch(phi);
2. CONVERGENCE ISSUES IN NEWTON’S METHOD
57
46
47
48
#Update xnow using ttemp. This is our next point.
xnow := evalf(eval(passIn, [s = ttemp]));
49
50
51
52
53
54
#Store the information in xnow in the vals list.
vals := [];
for i to nops(vars) do
vals := [op(vals), vars[i] = xnow[i]]
end do;
55
56
57
58
59
#Compute the new norm of the gradient. Notice, we don’t have to recompute the
#gradient, it’s symbolic.
normG := Norm(evalf(eval(G, vals)))
end do;
60
61
62
#Pass the last point in (the while loop doesn’t get executed the last time
around.
OUT := [op(OUT), xnow];
63
64
65
66
#Return the optimal solution.
OUT:
end proc:
Algorithm 10. Variable Step Newton’s Method
Exercise 36. Implement Newton’s Method and test it on a variation of Rosenbrock’s
Function:
2
(5.1)
−(1 − x)2 − 100 y − x2
2. Convergence Issues in Newton’s Method
Example 5.4. Consider the function:
2
2
(5.2)
F (x, y) = x2 + 3 y 2 e1−x −y
A plot of the function is shown in Figure 5.2. It has two global maxima, a local minima
between the maxima and saddle points near the peaks. If we execute Newton’s Method
starting at x = 1 and y = 0.5, we obtain an interesting phenomenon. The Hessian matrix
at this point is:
"
#
−1.947001959 −3.504603525
(5.3)
H(1, 0.5) =
−3.504603525 −1.362901370
This matrix is not positive definite, as illustrated by the fact that its (1, 1) element is negative
(and therefore, the Cholesky decomposition algorithm will fail). In fact, this matrix is
indefinite. The consequence of this is that the direction −sH(1, 0.5)∇F (1, 0.5) is not an
ascent direction for any positive value of s, and thus the line search algorithm converges to
58
5. NEWTON’S METHOD AND CORRECTIONS
Figure 5.2. A double peaked function with a local minimum between the peaks.
This function also has saddle points.
s = 0 when maximizing the function:
φ(s) = [1, 0.5]T − sH(1, 0.5)∇F (1, 0.5)
Thus, the algorithm fails to move from the initial point. Unless, one includes a loop counter
to Line 24 in Algorithm 10, the algorithm will cycle forever.
Exercise 37. Implement Pure Newton’s method and try it on the previous example.
(You should not see infinite cycling, but you should not see good convergence.)
Remark 5.5. We are now in a situation were we’d like to know whether Newton’s Method
converges and if so how fast. Naturally, we’ve already foreshadowed the results presented
below by our discussion of Newton’s Method in one dimension (see Theorem 3.53).
Definition 5.6 (Matrix Norm). Let M ∈ Rn×n . Then the matrix norm of M is:
(5.4)
||M|| = max ||Mx||
||x||−1
Theorem 5.7. Let f : Rn → R with local maximum at x∗ . For δ > 0, let Sδ = {x :
||x − x∗ || ≤ δ} Suppose that H(x∗ ) is invertible and within some sphere Sǫ , f is twice
continuously differentiable. Then:
(1) There exists a δ > 0 such that if x0 ∈ Sδ , the sequence xk generated by the pure
Newton’s method:
(5.5)
xk+1 = xk − H−1 (xk )∇f (xk )
is defined, belongs to Sδ and converges to x∗ . Furthermore, ||xk − x∗ || converges
superlinearly.
(2) If for some L > 0, M > 0 and δ > 0 and for all x1 , x2 ∈ Sδ :
(5.6)
(5.7)
||H(x1 ) − H(x2 )|| ≤ L||x1 − x2 ||
and
H−1 (x1 ) ≤ M
then, if x0 ∈ Sδ we have:
LM
||xk − x∗ ||2 , ∀k = 0, 1, . . .
||xk+1 − x∗ || ≤
2
Thus, if LM δ/2 < 1 and x0 ∈ Sδ , ||xk − x∗ || converges superlinearly with order at
least 2.
2. CONVERGENCE ISSUES IN NEWTON’S METHOD
59
Proof. To prove the first statement, choose δ > 0 so that H(x) exists and is invertible
for all x ∈ Sδ . Define M > 0 so that:
H−1 (x) ≤ M
∀x ∈ Sδ
Let h = xk − x∗ . Then from Lemma 2.7 (the Mean Value Theorem for Vector Valued
Functions) we have:
Z 1
Z 1
2
∗
∗
∗
∇2 f (x∗ + th)hdt
∇ f (x + th)hdt =⇒ ∇f (xk ) =
(5.8)
∇f (x + h) − ∇f (x ) =
0
0
We can write:
(5.9)
||xk+1 − x∗ || = ||xk − H−1 (xk )∇f (xk ) − x∗ || = ||xk − x∗ − H−1 (xk )∇f (xk )||
We now factor H−1 (xk ) out of the above equation to obtain:
(5.10)
||xk+1 − x∗ || = H−1 (xk ) [H(xk )(xk − x∗ ) − ∇f (xk )]
Using Equation 5.8, we can write the previous inequality as:
Z 1
∗
2
∗
−1
∗
(5.11) ||xk+1 − x || = H (xk ) H(xk )(xk − x ) −
∇ f (x + th)hdt
0
Z 1
2
∗
∗
−1
∗
= H (xk ) H(xk )(xk − x ) −
∇ f (x + th)dt (xk − x )
0
∗
because h = xk − x and is not dependent on t. Factoring (xk − x∗ ) from both terms inside
the square brackets on the right hand side yields:
Z 1
∗
−1
2
∗
∇ f (x + th)dt (xk − x∗ )
(5.12) ||xk+1 − x || = H (xk ) H(xk ) −
0
Z 1
−1
∗
= H (xk ) H(xk ) −
H(txk + (1 − t)x )dt (xk − x∗ )
0
Thus we conclude:
(5.13)
∗
−1
Z
1
∗
H(txk + (1 − t)x )dt
H(xk ) −
||xk − x∗ ||
0
Z 1
∗
−1
∗
= ||xk+1 − x || = ||H (xk )||
H(xk ) − H(txk + (1 − t)x )dt
||xk − x∗ ||
0
Z 1
∗
≤M
||H(xk ) − H(txk + (1 − t)x )||dt ||xk − x∗ ||
||xk+1 − x || = ||H (xk )||
0
By our assumption on H(x) for x ∈ Sδ . Note, in the penultimate step, we’ve moved H(xk )
under the integral sign since:
Z 1
H(xk )dt = H(xk )
0
Our assumption on f (x) (namely that it is twice continuously differentiable in Sδ ) tells us
we can take δ small enough to ensure that the norm of the integral is sufficiently small. This
establishes the superlinear convergence of the algorithm inside Sδ .
60
5. NEWTON’S METHOD AND CORRECTIONS
To prove the second assertion, suppose that H(x) is Lipschitz inside Sδ . Then Inequality
5.13 becomes:
Z 1
∗
∗
L||xk − (txk + (1 − t)x )||dt ||xk − x∗ ||
(5.14) ||xk+1 − x || ≤ M
0
Z 1
LM
∗
=M
L(1 − t)||xk − x )||dt ||xk − x∗ || =
||xk − x∗ ||2
2
0
This completes the proof.
Remark 5.8. Notice that at no time did we state that Newton’s Method would be
attracted to a local maximum. In fact, Newton’s method tends to be attracted to stationary
points of any kind.
Remark 5.9. What we’ve shown is that, when Newton’s method gets close to a solution,
it works extremely well. Unfortunately, there’s no way to ensure that Newton’s method – in
its raw form – will get to a point where it works well. For this reason, a number of corrections
have been devised.
3. Newton Method Corrections
Remark 5.10. The original “Newton’s Method” correction was a simple one developed
by Cauchy (and was the motivation for developing the method of steepest ascent). In essence,
one executes a gradient ascent algorithm until the Hessian matrix becomes negative definite.
At this point, we are close to a maximum and we use Newton’s method.
Example 5.11. Consider, again the function:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
Suppose instead of using variable step Newton’s method algorithm, we take gradient ascent
steps until H(x) becomes negative definite, at which point we switch to variable length
Newton steps. This can be accomplished by adding a test on H before line 29 of Algorithm
10. When we do this and start at x0 = 1, y0 = 0.1, we obtain convergence to x∗ = 1, y ∗ = 1
in five steps, as illustrated in Figure 5.3.
Exercise 38. Implement the modification to Newton’s method discussed above. And
test the the rate of convergence on Rosenbrock’s function, compared to pure gradient ascent
when starting from x0 = 0 and y0 = 0. Prove that the resulting sequence {xk } generated
by this algorithm is gradient related and therefore the algorithm converges to a stationary
point.
Exercise 39. Prove that Cauchy’s modification to Newton’s algorithm converges superlinearly.
Remark 5.12. The most common correction to Newton’s Method is the Modified Cholesky
Decomposition approach. In this approach, we will be passing −H(x) into a modification of
the Cholesky decomposition. The goal is to correct the Hessian matrix so that it is negative
definite (and thus its negative is positive definite). In this way, we will always take an ascent
step. The problem is, we’d like to maintain as much of the Hessian structure as possible
and also ensure that when the Hessian is naturally negative definite, we do not change it
at all. A simple way to do this is to choose µ1 , µ2 ∈ R+ with µ1 < µ2 . In the Cholesky
3. NEWTON METHOD CORRECTIONS
61
Figure 5.3. A simple modification to Newton’s method first used by Gauss. While
H(xk ) is not negative definite, we use a gradient ascent to converge to the neighborhood of a stationary point (ideally a local maximum). We then switch to a Newton
step.
decomposition at Line 12, we check to see if Hii > µ1 . If not, we replace Hii with µ2 and
continue. In this way, the modified Cholesky decomposition will always return a factorization
for a positive definite matrix that has some resemblance to the Hessian matrix. A reference
implementation is shown in Algorithm 11.
Remark 5.13. Suppose that we have constructed L ∈ Rn×n from the negative of the
Hessian matrix for the function f (xk ). That is:
(5.15)
−H(xk ) ∼ LLT
where ∼ indicates that −H(xk ) may be positive definite, or may just be loosely related to
LLT , depending on the steps executed in the modified Cholesky decomposition. We must
solve the problem:
T −1
∇f (xk ) = p
(5.16)
LL
where p is our direction. This can be accomplished efficiently by solving:
(5.17)
∇f (xk ) = Lz
by forward substitution, and then solving:
(5.18)
LT p = z
by backwards substitution. Since L is a lower triangular matrix and LT is an upper triangular
matrix.
Example 5.14. Consider, again the function:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
62
1
2
3
4
5
6
7
8
9
10
11
12
5. NEWTON’S METHOD AND CORRECTIONS
ModifiedCholeskyDecomp := proc (M::Matrix, n::integer,
mu1::numeric, mu2::numeric)::list;
local MM, H, ret, i, j, k;
MM := Matrix(M);
H := Matrix(n, n);
ret := true;
for i to n do
if mu1 <= MM[i, i] then
H[i, i] := sqrt(MM[i, i])
else
H[i, i] := mu2
end if;
13
14
15
for j from i+1 to n do
H[j, i] := MM[j, i]/H[i, i];
16
17
18
19
20
21
22
23
24
25
for k from i+1 to j do
MM[j, k] := MM[j, k]-H[j, i]*H[k, i]
end do:
end do:
end do:
if ret then [H, ret]
else [M, ret]
end if:
end proc:
Algorithm 11. The modified Cholesky algorithm always returns a factored matrix
close to an input matrix, but that is positive definite.
and suppose we are starting at position x = 1 and y = 1/2. Then the Hessian matrix is:
"
#
−5/2 e−1/4 −9/2 e−1/4
H=
−9/2 e−1/4 −7/4 e−1/4
As we noted, this matrix is not positive definite. When we perform the modified Cholesky
decomposition, we obtain the matrix:
#
"
1
0
L=
−9/2 e−1/4 1
The gradient at this point is:
"
#
−1/4
−3/2
e
∇f 1, 21 =
5/4 e−1/4
This leads to the first problem:
"
# "
#
1
0
−3/2 e−1/4
z1
(5.19)
=
z2
−9/2 e−1/4 1
5/4 e−1/4
3. NEWTON METHOD CORRECTIONS
63
Thus we see immediately that:
z1 = −3/2 e−1/4
z2 = 5/4 e−1/4 − 9/2 e−1/4
27 −1/4 2
3/2 e−1/4 = −
+ 5/4 e−1/4
e
4
We can now solve the problem:
#
# "
"
−1/4
−3/2
e
1 −9/2 e−1/4
p1
=
(5.20)
2
p2
0
1
e−1/4 + 5/4 e−1/4
− 27
4
Thus:
27 −1/4 2
+ 5/4 e−1/4
e
4
243 −1/4 3 45 −1/4 2
+
− 3/2 e−1/4
e
e
p1 = −3/2 e−1/4 + 9/2 e−1/4 p2 = −
8
8
T
If we compute ∇f 1, 21 p we see:
243 −1/4 3 45 −1/4 2
−1/4
−1/4
1 T
+
− 3/2 e
e
e
+
−
(5.21) ∇f 1, 2 p = −3/2 e
8
8
27 −1/4 2
−1/4
−1/4
−
5/4 e
∼ 11.103 > 0
+ 5/4 e
e
4
Thus p is an ascent direction.
p2 = −
Remark 5.15. A reference implementation of the modified Cholesky correction to Newton’s Method is shown in Algorithm 12.
Theorem 5.16. Let f : Rn → R be twice continuously differentiable. Suppose 0 < µ1 <
µ2 . Let {xk } be the sequence of points generated by the modified Newton’s method. Then this
sequence converges to a stationary point x∗ (i.e., ∇f (x∗ )).
−1
Proof. By construction pk = B−1
k ∇f (xk ) at each stage, where Bk is positive definite
and non-singular for each k. Thus, ∇f (xk )T pk > 0 just in case, ∇f (xk ) 6= 0. Therefore
{pk } is gradient related and the necessary conditions of Theorem 4.12 are satisfied and {xk }
converges to a stationary point.
Theorem 5.17. Let f : Rn → R be twice continuously differentiable. Suppose that x∗
is the stationary point to which the sequence {xk } converges, when {xk } is the sequence of
points generated by the modified Newton’s method. If H(x∗ ) is negative definite then there
is a µ1 > 0 so that the algorithm converges superlinearly when using the Armijo rule with
σ1 < 21 and t0 = 1 in the backtrace algorithm.
Proof. Let 0 < µ1 be any value that is less than the minimal diagonal value Mii at
Line 8 of Algorithm 11 when executed on H(x∗ ). By the continuity of f , there is some
neighborhood S of x∗ so that the modified Cholesky decomposition of H(x) is identical to
the Cholesky decomposition of H(x) for all x ∈ S. Then, for all xk in S, the modified
Newton’s algorithm becomes Newton’s method and thus:
||pk + H−1 (x∗ )∇f (xk )||
=0
k→∞
||∇f (xk )||
lim
64
1
2
3
4
5. NEWTON’S METHOD AND CORRECTIONS
ModifiedNewtonsMethod := proc (F::operator, initVals::list, epsilon::numeric,
maxIter::integer)::list;
local vars, xnow, i, G, normG, OUT, passIn, phi, LS, vals, ttemp, H;
vars := []; xnow := []; vals := initVals;
5
6
7
8
9
10
#The first few lines are housekeeping, so we can pass variables around.
for i to nops(initVals) do
vars := [op(vars), lhs(initVals[i])];
xnow := [op(xnow), rhs(initVals[i])];
end do;
11
12
13
14
15
#Compute the gradient and hessian using the VectorCalculus package.
#Store the Gradient as a Vector (that’s not the default).
G := Vector(Gradient(F(op(vars)), vars));
H := Hessian(F(op(vars)), vars);
16
17
18
#Compute the current gradient norm.
normG := Norm(evalf(eval(G, initVals)));
19
20
21
22
#Output will be a path to the optimal solution.
#Your output could just be the optimal solution.
OUT := [];
23
24
25
#While we’re not at a stationary point...
while evalb(epsilon < normG and count < maxIter) do
26
27
28
#Update the output.
OUT := [op(OUT), xnow];
29
30
31
#Update count
count := count + 1;
32
33
34
#Compute the modified Cholesky decomposition for the Hessian.
L := ModifiedCholeskyDecomp(evalf(eval(-H, vals)), nops(vals), .1, 1)[1];
35
36
37
38
#Now solve for the ascent direction P.
X := LinearSolve(L, evalf(eval(G, vals)),method=’subs’);
P := LinearSolve(Transpose(L), X,method=’subs’)
39
40
41
42
#Do some housekeeping. This is x_next = x_now + s * P.
#Essentially, we’re going to be solving for the distance to walk "s" below.
passIn := convert(Vector(xnow)+s*P, list)
3. NEWTON METHOD CORRECTIONS
43
44
45
46
47
48
49
65
#Build a univariate function that we can maximize. This is a little weird.
#We defined our list (above) in terms of s, and we’re going to evaluate
#s at the value t we pass in.
phi := proc (t)
options operator, arrow;
evalf(eval(F(op(passIn)), s = t))
end proc;
50
51
52
#Find the optimal step using a line search.
ttemp := LineSearch(phi);
53
54
55
#Update xnow using ttemp. This is our next point.
xnow := evalf(eval(passIn, [s = ttemp]));
56
57
58
59
60
61
#Store the information in xnow in the vals list.
vals := [];
for i to nops(vars) do
vals := [op(vals), vars[i] = xnow[i]]
end do;
62
63
64
65
66
#Compute the new norm of the gradient. Notice, we don’t have to recompute the
#gradient, it’s symbolic.
normG := Norm(evalf(eval(G, vals)))
end do;
67
68
69
#Pass the last point in (the while loop doesn’t get executed the last time
around.
OUT := [op(OUT), xnow];
70
71
72
73
#Return the optimal solution.
OUT:
end proc:
Algorithm 12. Variable Step Newton’s Method using the Modified Cholesky Decomposition
In this case, the necessary conditions of Theorem 4.25 are satisfied and we see that:
||xk+1 − x∗ ||
=0
k→∞ ||xk − x∗ ||
lim
This completes the proof.
Remark 5.18. The previous theorem can also be modified so that µ1 = c||∇f (xk )||,
where is a fixed constant with c > 0.
Exercise 40. Prove the superlinear convergence of the modified Newton’s method when
µ1 = c||∇f (xk )|| and c > 0.
66
5. NEWTON’S METHOD AND CORRECTIONS
Remark 5.19. There is no scientific method for choosing µ1 and µ2 in the modified
Newton’s method. Various authors suggest different approaches. Most recommend beginning
with a very small µ1 . If µ2 is too small, the matrix LLT may be near singular. On the other
hand, if µ2 is too large, then convergence will be slow, since the matrix will be diagonally
dominated. Generally, it is up to the user to find adequate values of µ1 and µ2 , though
[Ber99] suggests varying the size of µ1 and µ2 based on the largest value of the diagonal of
the computed Hessian matrix.
Example 5.20. We illustrate the convergence of the modified Newton’s method for the
function:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
starting at position x = 1 and y = 1/2. We set µ1 = 0.1 and µ2 = 1. Algorithmic steps are
shown in Figure 5.4. Notice in this case, we obtain convergence to a global maximum in 5
iterations of the algorithm.
Figure 5.4. Modified Newton’s method uses the modified Cholesky decomposition
and efficient linear solution methods to find an ascent direction in the case when
the Hessian matrix is not negative definite. This algorithm converges superlinearly,
as illustrated in this case.
CHAPTER 6
Conjugate Direction Methods
Remark 6.1. We begin this chapter with the study of conjugate gradient methods,
which have the property that they converge to an optimal solution for a concave quadratic
objective function f : Rn → R in n iterations, as we will see. These methods can be applied
to non-quadratic functions (with some efficiency loss), but they are remarkably good for
problems of very large dimensionality.
1. Conjugate Directions
Definition 6.2 (Conjugate Directions). Let Q ∈ Rn×n be a negative (or positive) definite matrix. Directions p0 , . . . pn−1 ∈ Rn are Q conjugate if for all i 6= j we have:
(6.1)
pTi Qpj = 0
Remark 6.3. For the remainder of this section when we say Q is definite we mean it is
either positive or negative definite and our results will apply in each case.
Lemma 6.4. If Q ∈ Rn×n is definite and p0 , . . . pn−1 ∈ Rn are Q conjugate, then
p0 , . . . pn−1 are linearly independent.
Proof. By way of contradiction, suppose that:
(6.2)
pn−1 = α0 p0 + · · · + αn−2 pn−2
Multiplying on the left by pTn−1 Q yields:
(6.3)
pTn−1 Qpn−1 = α0 pTn−1 Qp0 + · · · + αn−2 pTn−1 Qpn−2 = 0
by the definition of Q conjugacy. However, since Q is definite, we know that pTn−1 Qpn−1 6=
0.
Definition 6.5 (Conjugate Direction Method). Suppose Q ∈ Rn×n is a symmetric
negative definite matrix with conjugate directions p0 , . . . pn−1 ∈ Rn and that f : Rn → R is
give by:
1
(6.4)
f (x) = xT Qx + bT x
2
with b ∈ Rn . If x0 ∈ Rn , then the sequence generated by the conjugate direction method, xk
(k = 0, . . . , n − 1), is given by:
(6.5)
xk+1 = xk + δk pk
where δk solves the problem:
(6.6)
max f (xk + δk pk )
δk
That is, δk = arg maxδk f (xk + δk pk )
67
68
6. CONJUGATE DIRECTION METHODS
Lemma 6.6. For any k, in the conjugate direction method, δk is given by:
(6.7)
pTk ∇f (xk )
pTk (Qxk + b)
=− T
δk = −
pTk Qpk
pk Qpk
Exercise 41. Prove Lemma 6.6. Argue that when pk is an ascent direction, δk > 0.
Theorem 6.7. Let Mk ⊆ Rn be the subspace spanned by p0 , . . . , pk−1 . Then f (x) is
maximized on Mk by the conjugate direction method after k iterations and therefore after
n iterations, the conjugate direction method converges to xn = x∗ , the global maximum of
f (x).
Proof. Since p0 , . . . , pk−1 are linearly independent and:
(6.8)
xk = x0 + δ0 p0 + · · · + δk−1 pk−1
it suffices to show that1:
∂f (xk )
= 0 ∀i = 0, . . . , k − 1
(6.9)
∂δi
Observe first that:
∂f (xk )
(6.10)
=0
∂δk−1
as a necessary condition for the conjugate direction step. That is, since δk−1 = arg maxδk−1 f (xk−1 +
δk−1 pk−1 ), Equation 6.10 follows at once. We also note for i = 0, . . . , k − 1:
∂f (xk )
= ∇f (xk )T pi
∂δi
since ∂f (xk )/∂δi is a directional derivative. Thus we already know that:
(6.11)
(6.12)
∇f (xk )T pk−1 = 0
Now, observe that:
(6.13)
xk = xi + δi pi + δi+1 pi+1 + · · · + δk−1 pk−1 = xi +
For i = 0, . . . , k − 2
(6.14)
k−1
X
δj pj
j=i
∂f (xk )
= ∇f (xk )T pi = (Qxk + b)T pi =
∂δi
(Qxk )T pi + bT pi = xTk Qpi + bT pi =
xi +
k−1
X
j=i
Since
(6.15)
pTj Qpi
= 0 if i 6= j, the previous equation becomes:
∂f (xk )
= (xi + δi pi ) Q + bT pi = xTi+1 Q + bT pi
∂δi
1There
δj pj
!
Qpi + bT pi
is something a little subtle going on here. We are actually arguing that a strictly concave function
achieves a unique global maximum on a constraining set Mk . We have not shown that the sufficient condition
given below is a sufficient condition outside of Rn . This can be corrected, but for now, we assume it is clear
to the reader.
2. GENERATING CONJUGATE DIRECTIONS
69
since xi+1 = xi + δi pi . But:
(6.16)
∇f (xi+1 )T = xTi+1 Q + bT
Thus, we have proved that:
(6.17)
∂f (xk )
= ∇f (xi+1 )T pi = 0
∂δi
when we apply Equation 6.12 to i (instead of k). Thus we have shown Equation 6.9 holds
for all k and it follows that f (x) is maximized on Mk by the conjugate direction method
after k iterations and therefore after n iterations, the conjugate direction method converges
to xn = x∗ , the global maximum of f (x) since p0 , . . . , pn−1 must be a basis for Rn .
Example 6.8. Consider the case when:
−1 0
Q=
0 −1
and suppose we are given the Q conjugate directions:
0
1
p1 =
p0 =
1
0
and begin at x1 = x2 = 1. To compute δ0 , we use Equation 6.7:
−1 0
1
1 0
0 −1 1
−1
= −
= −1
(6.18) δ0 = −
−1 0
−1
1
1 0
0 −1 0
Notice that δ0 < 0 because p0 is not an ascent direction. Thus:
0
1
1
=
+ (−1)
(6.19) x1 =
1
0
1
Repeating this logic with p1 , we see that δ1 = −1 as well and:
0
0
0
=
+ (−1)
(6.20) x2 =
0
1
1
which is clearly the global maximum of the function.
Remark 6.9. All conjugate direction methods can be thought of as scaled versions of the
previous example. Essentially, when transformed into an appropriate basis, we are simply
minimizing along the various basis elements. The challenge is to determine a basis that is Q
conjugate.
2. Generating Conjugate Directions
Remark 6.10. The principle problem we have yet to overcome is the generation of Q
conjugate directions. We overcome this problem by applying the Graham-Schmidt procedure,
which we define in the following theorem.
70
6. CONJUGATE DIRECTION METHODS
Theorem 6.11. Suppose that d0 , . . . , dn−1 ∈ Rn are arbitrarily chosen linearly independent directions (that span Rn ). Then the directions:
(6.21)
p0 = d0
(6.22)
pi+1 = di+1 +
i
X
(i+1)
cj
pj
j=0
where:
(6.23)
c(i+1)
=
m
−dTi+1 Qpm
pTm Qpm
m = 0, . . . , i, ∀i = 0, . . . , n − 2
are Q conjugate.
Proof. Assume that Equations 6.21 and 6.22 hold. At construction stage i + 1, to
ensure Q conjugacy we require: pTi+1 Qpm = 0 for m = 0, . . . , i. Using Equation 6.22 yields:
(6.24)
pTi+1 Qpm =
di+1 +
i
X
j=0
(i+1)
cj
pj
!T
Qpm = di+1 Qpm +
i
X
j=0
(i+1)
If we apply induction, we know that only when j = m is cj
pTi+1 Qpm = 0, then:
(6.25)
(i+1)
cj
pj
!T
Qpm
pTj Qpm 6= 0. Thus, if
di+1 Qpm + c(i+1)
pTm Qpm = 0 m = 0, . . . , i
m
That is, we have i + 1 equations with i + 1 unknowns and one unknown in each equation.
Solving Equation 6.25 yields:
(6.26)
c(i+1)
=
m
−dTi+1 Qpm
pTm Qpm
m = 0, . . . , i, ∀i = 0, . . . , n − 2
This completes the proof.
Corollary 6.12. The space spanned by d0 , . . . , dk is identical to the space spanned by
p0 , . . . , pk .
Exercise 42. Prove Corollary 6.12. [Hint: Note that Equations 6.21 and 6.22 show that
di+1 is a linear combination of pi+1 , . . . , p0 .]
Remark 6.13. The approach described in the preceding theorem is known as the GrahamSchmidt method.
Exercise 43. In Example 6.8, show that the directions generated by the GrahamSchmidt method are identical to the provided directions.
3. The Conjugate Gradient Method
Remark 6.14. One major problem to overcome with a conjugate direction method is
obtaining a set of Q conjugate directions. This can be accomplished by using the gradients
of f (x) at xk . Naturally, we must show that these vectors are linearly independent (and
thus will form a basis of Rn after n iterations).
3. THE CONJUGATE GRADIENT METHOD
71
Lemma 6.15. Let f : Rn → R be defined as in Definition 6.5 and let xk be the sequence generated by the conjugate direction method using the Graham-Schmidt generated
directions with dk = gk = ∇f (xk ). Then gk is orthogonal to p0 , . . . , pk−1 (and by extension
g0 , . . . , gk−1 ).
Proof. We proceed by induction on k. When k = 0, the set of vectors is just g0 and
this is clearly a linearly independent set and the space spanned by g0 is identical to the space
spanned by d0 from Corollary 6.12. Assume the statement is true up to stage k − 1. To see
it is true for stage k, consider the following cases:
(1) Suppose gk = 0. Then xk = xk−1 and we must be at a global maximum, and the
theorem is true by assumption.
(2) Suppose gk 6= 0. Then from Equations 6.10 and 6.11 and , we know that:
∇f (xk )T pi = 0
for i = 0, . . . , k − 1.
Thus, gk is orthogonal to p0 , . . . , pk−1 and since (by Corollary 6.12) the space spanned
by p0 , . . . , pk−1 is identical to the space spanned by g0 , . . . , gk−1 , we see that gk must be
orthogonal to g0 , . . . , gk−1 .
Theorem 6.16. Let f : Rn → R be as in Definition 6.5 and let xk be the sequence
generated by the conjugate direction method using the Graham-Schmidt generated directions
with dk = gk = ∇f (xk ). If x∗ is the global maximum for f (x), then the sequence converges
to x∗ after at most n steps and furthermore, we can write:
(6.27)
pk = gk + βk pk−1
where:
(6.28)
βk =
gkT gk
T
gk−1
gk−1
Proof. From Equations 6.21 and 6.22, we know that:
(6.29)
(6.30)
p0 = g0
pk+1 = gk+1 +
k
X
(k+1)
cj
pj
j=0
where:
(6.31)
c(k+1)
=
m
T
−gk+1
Qpm
T
pm Qpm
m = 0, . . . , k, ∀k = 0, . . . , n − 2
Note that:
(6.32)
gm+1 − gm = Q(xm+1 − xm ) = αm Qpm
and αm 6= 0 (for otherwise xm+1 = xm and the algorithm will have converged). This
is because xm+1 = xm + αm pm , where αm is the optimal δm found by line search. Left
T
multiplying by −gk+1
yields:
(
T
−gk+1
gk+1 m = k
T
T
(6.33) −αk gk+1
Qpm = −gk+1
(gm+1 − gm ) =
0
otherwise
72
6. CONJUGATE DIRECTION METHODS
by Lemma 6.15. Thus we have:
(6.34)
c(k+1)
m
T
−gk+1
Qpm
=
=
T
pm Qpm
(
− α1k
T
gk+1
gk+1
pT
Qp
k
k
0
m=k
otherwise
By a similar argument, we note that:
1 T
(6.35) pTk Qpk =
p (gk+1 − gk )
αk k
Then substituting this into the previous Equation 6.35 into Equation 6.34 yields:
(6.36)
(k+1)
βk+1 = ck
=
−gkT gk
pTk (gk+1 − gk )
Substituting this into the Graham-Schmidt procedure yields:
(6.37)
pk+1 = gk+1 −
T
gk+1
gk+1
pk
T
pk (gk+1 − gk )
Written for k, rather than k + 1, we have:
(6.38)
pk = gk −
gkT gk
pk−1
pTk−1 (gk − gk−1 )
Finally, observe that pTk−1 gk = 0 and:
(6.39)
pk−1 = gk−1 + βk−1 pk−2
Thus:
(6.40)
T
T
−pTk−1 gk−1 = −gk−1
gk−1 − βk−1 pTk−2 gk−1 = −gk−1
gk−1
because pTk−2 gk−1 = 0. Finally we see that:
(6.41)
pk = gk +
gkT gk
pk−1
T
gk−1
gk−1
Thus:
(6.42)
βk =
gkT gk
T
gk−1
gk−1
The fact that this algorithm converges to x∗ in at most n iterations is a result of Theorem
6.7. This completes the proof.
Exercise 44. Decide whether the following is true and if so, prove it; if not, provide a
counter-example: In the conjugate gradient method, pk is always an ascent direction and
therefore βk > 0.
Definition 6.17. Let f : Rn → R be as in Definition 6.5 and suppose that x = My
where M ∈ Rn×n is symmetric and invertible. Then when the conjugate gradient method is
applied to:
1
(6.43) h(y) = f (My) = yT MQMy − bT My
2
So that:
yk+1 = yk + δk dk
4. APPLICATION TO NON-QUADRATIC PROBLEMS
73
where:
d0 = ∇h(y0 )
and
dk = ∇h(yk ) + βdk−1
∇h(yk )T ∇h(yk )
∇h(yk−1 )T ∇h(yk−1 )
When δk is obtained by line maximization, then the method is called the preconditioned
conjugate gradient method.
βk =
Theorem 6.18. Consider the preconditioned conjugate gradient method with matrix M.
Then when xk = Myk , then the preconditioned conjugate gradient method is equivalent to
the conjugate gradient method where:
(6.44)
xk+1 = xk + δk pk
(6.45)
p0 = Mg0
(6.46)
pk = Mgk + βpk−1 ,
(6.47)
βk =
gkT M2 gk
T
gk−1
M2 gk−1
k = 1, . . . , n − 1
and δk is obtained by line maximization. Furthermore, pk are Q conjugate.
Exercise 45. Prove Theorem 6.18.
4. Application to Non-Quadratic Problems
Remark 6.19. The conjugate gradient method can be applied to non-quadratic functions, usually in an attempt to exploit locally quadratic behavior. Assuming f : Rn → R is
a differentiable function, the update rule is given by:
∇f (xk )T (∇f (xk ) − ∇f (xk−1 ))
∇f (xk−1 )T ∇f (xk−1 )
Alternatively we can also use:
(6.48)
βk =
∇f (xk )T ∇f (xk )
∇f (xk−1 )T ∇f (xk−1 )
where the first formulation is called the Polak-Ribiéra formulation and is derived from Equation 6.33, while the second is the Fletcher-Reeves formulation and adapted from the straightforward transformation of the conjugate gradient method for quadratic functions. We note
that the Polak-Ribiéra formulation is generally preferred.
(6.49)
βk =
Exercise 46. Prove that the Polak-Ribiéra formulation and the Fletcher-Reeves formulation for βk are identical in the case when f (x) is a strictly concave quadratic function.
Remark 6.20. Sample code for the general conjugate gradient method is shown in Algorithm 13.
Remark 6.21. Notice that we have a while-loop encasing the steps of the conjugate
gradient method. There are three ways to handle the general conjugate gradient method:
74
1
2
3
4
5
6. CONJUGATE DIRECTION METHODS
ConjugateGradient := proc (F::operator, initVals::list, epsilon::numeric,
maxIter::integer)::list;
uses VectorCalculus,LinearAlgebra,Optimization:
local vars, xnow, i, j, G, normG, OUT, passIn, phi, vals, ttemp, count,
X, p, gnext, gnow, beta, pnext;
6
7
8
9
10
11
12
13
#The first few lines are house keeping.
vars := []; xnow := []; vals := initVals;
for i to nops(initVals) do
vars := [op(vars),
lhs(initVals[i])];
xnow := [op(xnow), rhs(initVals[i])]:
end do;
14
15
16
17
#Compute the gradient and its norm, use the VectorCalculus package.
G := Gradient(F(op(vars)), vars));
normG := Norm(evalf(eval(G, initVals)));
18
19
20
21
#Output will be the path we take to the "optimal" solution.
#Your output could just be the optimal solution.
OUT := [];
22
23
24
#An iteration counter.
count := 0;
25
26
27
28
#While we’re not at a stationary point:
while evalb(epsilon < normG and count < maxIter) do
count := count+1;
29
30
31
#Compute the initial direction. This is p[0].
p := evalf(eval(G, vals));
32
33
34
35
36
#Here’s the actual conjugate gradient search.
for j to nops(initVals) do
#Append the current position.
OUT := [op(OUT), xnow];
37
38
39
#Compute the gradient (wherever we are). This is g[k-1].
gnow := evalf(eval(G, vals));
40
41
42
43
#Compute the next x-position, this will be x[k]
#Do housekeeping first.
passIn := convert(Vector(xnow)+s*p, list):
4. APPLICATION TO NON-QUADRATIC PROBLEMS
44
45
46
47
#Define the line function f(x+delta*p)
phi := proc (t) options operator, arrow;
evalf(eval(F(op(passIn)), s = t))
end proc;
48
49
50
#Run line search to find delta
ttemp := LineSearch(phi);
51
52
53
#Compute the new position x + delta * p (This is x[k].)
xnow := evalf(eval(passIn, [s = ttemp]));
54
55
56
57
58
59
#Do some house keeping.
vals := [];
for i to nops(vars) do
vals := [op(vals), vars[i] = xnow[i]]
end do;
60
61
62
#Compute the next gradient (this is g[k]).
gnext := evalf(eval(G, vals));
63
64
65
#Compute beta[k]
beta := (Transpose(gnext).(gnext - gnow))/(Transpose(gnow).gnow);
66
67
68
69
#Compute p[k].
pnext := gnext + beta * p:
p:=pnext:
70
71
72
end do;
73
74
75
#Append any remaining x positions.
OUT := [op(OUT), xnow];
76
77
78
#Compute the new gradient norm.
normG := Norm(evalf(eval(G, vals)));
79
80
81
82
83
84
85
#Print out the gradient norm.
print(sprintf("normg=%f", normG))
end do;
#Return the output.
OUT:
end proc:
Algorithm 13. Conjugate Gradient Method for non-quadratic functions with a
simple loop.
75
76
6. CONJUGATE DIRECTION METHODS
(1) The conjugate gradient sub-steps can be executed n times (when maximizing a
function f : Rn → R).
(2) The conjugate gradient sub-steps can be executed k < n times.
(3) The conjugate gradient method can be executed n times or as long as:
(6.50)
||∇f (xk )T ∇f (xk−1 )|| ≤ δ||∇f (xk−1 )||2
for δ ∈ (0, 1). Otherwise, restart with a gradient step. The previous equation acts
as a test for conjugacy. F
The idea behind these conjugate gradient methods is to execute a gradient ascent step every
“few” iterations to ensure global convergence to a stationary point, while simultaneously
trying to speed up convergence by using conjugate gradient steps in between.
Example 6.22. Consider the execution of the conjugate gradient method to the function:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
We obtain convergence in four iterations (8 steps of the conjugate gradient method) when
we choose ǫ = 0.001 and begin at x = 1, y = 0.1. This is illustrated in Figure 6.1.
Figure 6.1. The steps of the conjugate gradient algorithm applied to F (x, y).
We can compare this behavior with the case when:
F (x, y) = −2x2 − 10y 4
and we start from x = 15, y = 5. This is illustrated in Figure 6.2. In this example, the
conjugate gradient method converges in two iterations (four steps), with much less zig-zagging
than the gradient descent method or even Newton’s method. Notice also how the steps are
highly normal to each other as we expect from Lemma 6.15.
Example 6.23. Consider the function:
(6.51)
F (x, y) = −2x2 − 10y 2
The conjugate gradient method should converge in two steps for this function, but in practice
(depending on the value of ǫ) it may not because of floating point error. In fact, if we run
the conjugate gradient algorithm on this function, starting from x = 15, y = 5, we obtain
4. APPLICATION TO NON-QUADRATIC PROBLEMS
77
Figure 6.2. In this example, the conjugate gradient method also converges in four
total steps, with much less zig-zagging than the gradient descent method or even
Newton’s method.
convergence after a surprising 4 steps when ǫ = 0.001 because of floating point error. This
will not present itself if we use exact arithmetic. We can solve this problem in floating point
by using the preconditioned conjugate gradient method. First, let us write:
"
#
−4 0
1
x
x y
(6.52) F (x, y) =
y
2
0 −20
Consider the matrix:
#
"
1/2
0
√
(6.53) M =
0 1/10 5
If we let:
"
#
1/2
0
x
z
√
=M=
(6.54)
y
w
0 1/10 5
Then:
(6.55)
because:
(6.56)
1
z w
h(z, w) =
2
MT QM =
"
−1
0
(6.57)
Q=
−4
0
0
−20
−1
0
#
−1
where:
"
"
#
0
#
z
w
−1
0
78
6. CONJUGATE DIRECTION METHODS
If we execute the preconditioned
conjugate gradient algorithmon h(z, w) starting from the
√
position z = 30 and w = 10 5, where is:
#−1
"
1/2
0
30
15
√ =
√
5
10 5
0 1/10 5
we obtain convergence to z ∗ = w∗ = 0 in two steps (as expected) and further we can see at
once that x∗ = y ∗ = 0. Thus, M has stretched (and it could have twisted) our problem into
the problem from Example 6.8.
Exercise 47. Implement the conjugate gradient method. Try is on F (x, y) = −2x2 −
10y 2 .
Remark 6.24. We state but do not prove one final theorem regarding the conjugate
gradient method, which helps explain why the conjugate gradient method might only be
executed k < n times before a restart. The proof is available in [Ber99].
Theorem 6.25. Assume Q has n − k eigenvalues on the interval [a, b] with b < 0 and
the remaining eigenvalues are less than a. Then for every x0 , the vector xk+1 produced after
k + 1 steps of the conjugate gradient method satisfies:
2
b−a
(6.58) f (xk+1 ) ≥
f (x0 )
b+a
Remark 6.26. A similar condition as in Theorem 6.25 can be proved for pre-conditioned
conjugate gradient method with appropriate transformations to the matrix. Thus, Theorem
6.25 helps explain conditions in which we may only be interested in executing the conjugate
gradient loop k < n times.
CHAPTER 7
Quasi-Newton Methods
Remark 7.1. We noted in Chapter 5 that Newton’s method only works in a region
around local maxima (or minima) and otherwise, the direction −H(xk )−1 ∇f (xk ) may not
be an ascent direction. This is solved in modified Newton’s method approaches by adjusting
the hessian matrix to ensure negative (or positive definiteness). Quasi-Newton methods, by
contrast construct positive definite matrices Bk and use the update:
(7.1)
(7.2)
rk = Bk ∇f (xk )
xk+1 = xk + δk rk
These methods have the property that Bk approaches −H(xk )−1 as xk approaches a stationary point x∗ . This property can be showed exactly for quadratic functions.
Exercise 48. Show that the Conjugate Gradient method with the Fletcher-Reeves rule
can be thought of as like a quasi-Newton method when:
(7.3)
Bk = I n +
(xk − xk−1 )∇f (xk )T
∇f (xk−1 )T ∇f (xk−1 )
while the Conjugate Gradient method with the Polak-Ribiéra rule can be thought of as like
a quasi-Newton method when:
(7.4)
Bk = I n +
(xk − xk−1 )(∇f (xk ) − ∇f (xk−1 ))T
∇f (xk−1 )T ∇f (xk−1 )
You do not have to prove Bk converges to H−1 (x∗ ).
1. Davidon-Fletcher-Powell (DFP) Quasi-Newton Method
Definition 7.2 (Davidon-Fletcher-Powell Approximation). Suppose that f : Rn → R is
a continuously differentiable function. Let:
(7.5)
(7.6)
pk = xk+1 − xk = δk rk
qk = ∇f (xk ) − ∇f (xk+1 )
Then the Davidon-Fletcher-Powell (DFP) inverse-Hessian approximation is:
(7.7)
Bk+1 = Bk +
Bk qk qTk Bk
pk pTk
−
pTk qk
qTk Bk qk
where B0 ∈ Rn×n is a symmetric positive definite matrix (usually In ).
Remark 7.3. This formulation is useful for maximization problems. The DFP minimization formulation set qk = ∇f (xk+1 ) − ∇f (xk ) and rk = −Bk ∇f (xk ).
79
80
7. QUASI-NEWTON METHODS
Lemma 7.4. Suppose that f : Rn → R, x0 ∈ Rn and B0 ∈ Rn×n is a symmetric and
positive definite matrix and B1 , . . . , Bn are generated using Equation 7.7 and x1 , . . . , xn are
generated by:
(7.8)
(7.9)
rk = Bk ∇f (xk )
xk+1 = xk + δk rk
with δk = arg max f (xk +δk rk ). If ∇f (xk ) 6= 0 for k = 0, . . . , n then B1 , . . . Bn are symmetric
and positive definite. Thus, r0 , . . . , rn−1 are ascent directions.
Proof. We proceed by induction. Clear B0 is symmetric and positive definite and thus
p0 is an ascent direction since ∇f (x0 )T p0 = ∇f (x0 )T B0 ∇f (x0 ) > 0. Symmetry is ensured
by the nature of DFP formula. Assume that statement is true for all k < n. We show the
statement is true for Bk+1 .
Consider yT Bk+1 y for any y ∈ Rn . We have:
(7.10)
yT Bk+1 y = yT Bk y +
yT pk pTk y yT Bk qk qTk Bk y
−
pTk qk
qTk Bk qk
Note yT pk = pTk y and yT Bk qk = qTk Bk y because Bk is symmetric by the induction hypothesis. Furthermore, Bk has a Cholesky decomposition so that Bk = Lk LTk for some lower
triangular matrix Lk . Let:
(7.11)
aT = y T Lk
(7.12)
bT = qTk Lk
Then we can write:
(7.13)
T
2
T
T
y
p
a
b
b
a
k
yT Bk+1 y = aT a −
=
+
bT b
pTk qk
2
aT a bT b − aT b bT a
y T pk
+
bT b
pTk qk
Recall the Schwartz Inequality (Lemma 1.13):
(7.14)
Thus,
(aT b)(bT a) = (aT b)2 ≤ (aT a) · (bT b)
aT a
bT b − aT b bT a
>0
(7.15)
bT b
since bT b ≥ 0. As in the proof of Theorem 6.7, we know that:
∂f (xk+1 )
= ∇f (xk+1 )T rk = ∇f (xk+1 )T pk = 0
∂δk
because δk = arg max f (xk + δk rk ). This means that:
(7.16)
(7.17)
Thus:
(7.18)
pTk qk = pTk (∇f (xk ) − ∇f (xk+1 )) = pTk ∇f (xk ) = ∇f (xk )T BTk ∇f (xk ) > 0
2
yT pk
>0
pTk qk
1. DAVIDON-FLETCHER-POWELL (DFP) QUASI-NEWTON METHOD
81
and yT Bk+1 y > 0. Thus, Bk+1 is positive definite and since rk+1 = Bk+1 ∇f (xk+1 ) is an
ascent direction.
Exercise 49. Prove: Suppose f : Rn → R, x0 ∈ Rn and B0 ∈ Rn×n is a symmetric and positive definite matrix and B1 , . . . , Bn are generated using the DFP formula for
minimization and x1 , . . . , xn are generated by:
(7.19)
(7.20)
rk = −Bk ∇f (xk )
xk+1 = xk + δk rk
with δk = arg min f (xk +δk rk ). If ∇f (xk ) 6= 0 for k = 0, . . . , n then B1 , . . . Bn are symmetric
and positive definite. Thus, r0 , . . . , rn−1 are descent directions.
Theorem 7.5. Suppose that f : Rn → R with:
(7.21)
1
f (x) = xT Qx + bT x
2
where b ∈ Rn and Q ∈ Rn×n is symmetric and negative definite. Suppose that B1 , . . . , Bn−1
are generated using the DFP formulation (Equation 7.7) with x0 ∈ Rn and B0 ∈ Rn×n ,
symmetric and positive definite. If x1 , . . . , xn and r0 , . . . , rn−1 are generated by:
(7.22)
(7.23)
rk = Bk ∇f (xk )
xk+1 = xk + δk rk
with δk = arg max f (xk + δk rk ) and
(7.24)
(7.25)
pk = xk+1 − xk = δk rk
qk = ∇f (xk ) − ∇f (xk+1 )
then:
(1)
(2)
(3)
(4)
p0 , . . . , pn−1 are Q conjugate and
Bk+1 Qpj = −pj for j = 0, . . . , k
Bn = −Q−1 and
xn = x∗ is the global maximum of f (x).
Proof. It suffices to prove (1) and (2) above. To see this note that if (1) holds, then the
DFP quasi-Newton method is a conjugate direction method and therefore after n iterations,
it must converge to the maximum x∗ of f (x). Furthermore, since p0 , . . . , pn−1 must be
linearly independent by Lemma 6.4, they are a basis for Rn . Therefore, for each standard
basis vector ei (i = 1, . . . , n) we see that there are α0 , . . . , αn−1 so that:
ei = α0 p0 + · · · + αn−1 pn−1
and thus: Bk+1 Qei = −ei . The previous result holds for i = 1, . . . , n and therefore Bk+1 Q =
−In , which means that Bn = −Q−1 .
Recall that pk = xk+1 − xk and therefore:
(7.26)
Qpk = Q(xk+1 − xk ) = ∇f (xk+1 ) − ∇f (xk ) = −qk
82
7. QUASI-NEWTON METHODS
Multiplying by Bk+1 we have:
(7.27)
Bk+1 Qpk = −Bk+1 qk =
Bk qk qTk Bk
pk pTk qk Bk qk qTk Bk qk
pk pTk
−
q
=
−B
q
−
+
=
− Bk + T
k
k
k
pk qk
qTk Bk qk
pTk qk
qTk Bk qk
− Bk qk − pk + Bk qk = −pk
We now proceed by induction to show that Bk+1 Qpj = −pj for j = 0, . . . , k. We will
also prove Q conjugacy. We have just proved the based case when k = 0 and therefore, we
assume this is true for all iterations up to some k. Note that:
(7.28)
∇f (xk+1 ) = ∇f (xi+1 )+
(∇f (xi+2 ) − ∇f (xi+1 )) + (∇f (xi+3 ) − ∇f (xi+2 )) + · · · + (∇f (xk+1 ) − ∇f (xk ))
= ∇f (xi+1 )+Qpi+1 +Qpi+2 +· · ·+Qpk = ∇f (xi+1 )+Q (pi+1 + pi+2 + · · · + pk )
By the induction hypothesis, we know that pi is orthogonal to pj where j = i + 1 . . . k and
we know a fortiori that pi is orthogonal to ∇f (xi+1 ) (see Equation 7.16, as in the proof of
Theorem 6.7). We conclude that: pTi ∇f (xk+1 ) = 0 for 0 ≤ i ≤ k. Applying the induction
hypothesis again, we know that: Bk+1 Qpj = −pj for j = 0, . . . , k and therefore:
(7.29)
∇f (xk+1 )T Bk+1 Qpj = −∇f (xk+1 )T pj = 0
for j = 0, . . . , k. But:
(7.30)
δk+1 rTk+1 = δk+1 ∇f (xk+1 )T Bk+1 = pk+1
Therefore:
(7.31)
pTk+1 Qpj = δk+1 rTk+1 Qpj = δk+1 ∇f (xk+1 )T Bk+1 Qpj = 0
and thus, p0 , . . . , pk+1 are Q conjugate. We now need only prove Bk+2 Qpj = −pj for
j = 0, . . . , k + 1. We first note by the induction hypothesis that:
(7.32)
qTk+1 Bk+1 Qpi = −qTk+1 pi = (Qpk+1 )T pi = pTk+1 Qpi = 0
because Q is symmetric and we have established that p0 , . . . , pk+1 are Q conjugate. Then
we write:
pk+1 pTk+1 Bk+1 qk+1 qTk+1 Bk+1
−
qi =
(7.33) Bk+2 qi = Bk+1 + T
pk+1 qk+1
qTk+1 Bk+1 qk+1
Bk+1 qi +
pk+1 pTk+1 qi
Bk+1 qk+1 qTk+1 Bk+1 qi
−
pTk+1 qk+1
qTk+1 Bk+1 qk+1
Note that: Qpi = −qi and since pTk+1 Qpi = 0, we know −pTk+1 qi = −pk+1 Qpi = 0. Which
implies pTk+1 qi = 0. Therefore,
(7.34)
pk+1 pTk+1 qi
=0
pTk+1 qk+1
Finally since Bk+1 Qpi = −pi , we know:
(7.35)
qTk+1 Bk+1 qi = −qTk+1 Bk+1 Qpi = qTk+1 pi = −pTk+1 Qpi = 0
2. IMPLEMENTATION AND EXAMPLE OF DFP
83
Thus:
(7.36)
Bk+1 qk+1 qTk+1 Bk+1 qi
=0
qTk+1 Bk+1 qk+1
Thus we conclude that:
(7.37)
Bk+2 qi = Bk+1 qi = −Bk+1 Qpi = pi
and therefore:
(7.38)
Bk+2 Qpi = −pi
So for i = 0, . . . , k we have Bk+2 Qpi = −pi . To prove that this is also true when i = k + 1
recall that Equation 7.27, handles this case. Thus we have shown that Bk+2 Qpi = −pi for
i = 0, . . . , k + 1. This completes the proof.
2. Implementation and Example of DFP
Example 7.6. Consider F (x, y) = −2x2 − 10y 2 . We know with exact arithmetic, the
DFP method should converge in two steps (1 iteration). To see this, suppose B0 = I2 . If we
start at x = 15 and y = 5. Then our first gradient and ascent direction are:
−60
(7.39) r0 = g0 =
−100
Our first line function, then is:
(7.40)
φ0 (t) = −2 (15 − 60 t)2 − 10 (5 − 100 t)2
Solving φ′ (t) = 0 for t yields:
17
(7.41) t∗ =
268
17
This leads to our next point x1 = 15 + 268
(−60) and y1 = 5 +
−90
y1 = 67 . We compute the gradient at this new point to obtain:
−3000
67
(7.42) g1 = 1800
67
This leads to the two expressions:
−255
67
(7.43) p0 = −425
67
−1020
67
(7.44) q0 = −8500
67
If we compute B1 we obtain:
#
"
15345
170353
− 169912
169912
(7.45) B1 =
15345
10337
− 169912
169912
We now compute r1 :
"
#
− 15000
317
(7.46) r1 =
1800
317
17
(−100)
268
or x1 =
750
67
and
84
7. QUASI-NEWTON METHODS
From r1 , we can compute our new line search function:
2
2
750 15000
90 1800
(7.47) φ1 (t) = −2
−
t − 10 − +
t
67
317
67
317
We can find t∗ =
(7.48)
(7.49)
317
1340
as before and compute our new position:
750
−15000
317
x2 =
=0
+
67
1340
317
−90
317
1800
y2 =
+
=0
67
1340
317
Thus showing that the DFP method converges in one iteration or two steps.
Remark 7.7. An implementation of the DFP quasi-Newton method is shown in Algorithm 141.
Example 7.8. We can apply the DFP method to:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
We obtain convergence in four iterations (8 steps of the conjugate gradient method) when
we choose ǫ = 0.001 and begin at x = 1, y = 0.1. This is illustrated in Figure 7.1. Notice
Figure 7.1. The steps of the DFP algorithm applied to F (x, y).
that the steps are identical to those shown in the conjugate gradient method (see Figure
6.1). This is because the DFP method is a conjugate direction method.
1Thanks
mistake!
to Simon Miller who noticed that line 75 was missing in versions before 0.8.7. That’s a big
2. IMPLEMENTATION AND EXAMPLE OF DFP
1
2
DFPQuasiNewton := proc (F::operator, initVals::list,
epsilon::numeric, maxIter::integer)::list;
3
4
5
local vars, xnow, i, j, G, normG, OUT, passIn, phi, vals, ttemp, count, X, p,
gnext,
gnow, beta, pnext, r, B, xnext, q;
6
7
8
9
10
11
12
13
14
#Do some house keeping.
vars := [];
xnow := [];
vals := initVals;
for i to nops(initVals) do
vars := [op(vars), lhs(initVals[i])];
xnow := [op(xnow), rhs(initVals[i])]
end do;
15
16
17
18
#Define the gradient vector and the current norm of the gradient.
G := Vector(Gradient(F(op(vars)), vars));
normG := Norm(evalf(eval(G, initVals)));
19
20
21
22
23
#Output will be the path we take to an optimal point. We also define an
#iteration counter.
OUT := [];
count := 0;
24
25
26
27
#While we’re not at a stationary point...
while evalb(epsilon < normG and count < maxIter) do
count := count + 1;
28
29
30
31
#Define B[0] and r[0], the first direction to walk. (It’s a gradient step.)
B := IdentityMatrix(nops(vars), nops(vars));
r := B.evalf(eval(G, vals)));
32
33
34
35
#Now go into our conjugate direction method.
for j to nops(initVals) do
print(j);
36
37
38
#Append some output.
OUT := [op(OUT), xnow];
39
40
41
#Evaluate the gradient at this point.
gnow := evalf(eval(G, vals));
42
43
44
#Do some housekeeping and define the line function.
passIn := convert(Vector(xnow) + s * r), list);
85
86
7. QUASI-NEWTON METHODS
phi := proc (t) options operator, arrow;
evalf(eval(F(op(passIn)), s = t))
end proc;
45
46
47
48
#Compute the optimal step length using parabolic bracketing and
#Golden section search.
ttemp := LineSearch(phi);
49
50
51
52
#Define the next x position and the p-vector in DFP.
xnext := evalf(eval(passIn, [s = ttemp]));
p := Vector(xnext - xnow);
53
54
55
56
#Do some housekeeping.
xnow := xnext;
vals := [];
for i to nops(vars) do
vals := [op(vals), vars[i] = xnow[i]]
end do;
57
58
59
60
61
62
63
#Evaluate the gradient at the next position
gnext := evalf(eval(G, vals));
64
65
66
#Compute the q vector.
q := Vector(gnow-gnext);
67
68
69
#Update the B matrix B[k]
B := B + (p.Transpose(p))/(Transpose(p).q) (B.q.Transpose(q).B)/(Transpose(q).B.q);
70
71
72
73
74
75
76
#Compute the next direction.
r := B.gnext
end do;
77
78
79
80
81
82
#Compute the norm of the gradient
normG := Norm(evalf(eval(G, vals)));
end do;
OUT
end proc
Algorithm 14. Davidon-Fletcher-Powell quasi-Newton method.
3. Derivation of the DFP Method
Suppose we wish to derive the DFP update from first principles. Recall we have defined:
(7.50)
(7.51)
pk = xk+1 − xk = δk rk
qk = ∇f (xk ) − ∇f (xk+1 ) = −Qpk
3. DERIVATION OF THE DFP METHOD
87
Proceeding inductively, suppose that at Step k, we know that p0 , . . . , pk are all:
(1) Q conjugate and
(2) Bk+1 Qpj = −pj for j = 0, . . . , k
The second statement asserts that p0 , . . . , pk are to be eigenvectors of Bk+1 Q with eigenvalue
1. If this holds all the way through to k = n − 1, then clearly p0 , . . . , pn−1 are eigenvalues
of Bn Q, each with eigenvalue −1, yielding precisely the fact that Bn Q = −In .
If we want to ensure that this pattern, continues to hold, suppose that we want to write:
(7.52)
Bk+1 = Bk + Ck
where Ck is a correction matrix that will ensure that the nice properties above are also true
at k + 1. That is, it ensures that: p0 , . . . , pk are eigenvectors of Bk+1 Q with eigenvalue −1.
This means we require:
(7.53)
Bk+1 Qpj = −pj =⇒ Bk+1 qj = pj
j = 0, . . . , k
Combining this with Equation 7.52 yields:
(7.54)
(Bk + Ck )qj = Bk qj + Ck qj = pj
j = 0, . . . , k
But:
(7.55)
Thus:
(7.56)
Bk qj = −Bk Qk pj = pj
j = 0, . . . , k − 1
Bk qj + Ck qj = pj =⇒ pj + Ck qj = pj =⇒ Ck qj = 0 j = 0, . . . , k − 1
When k = j, we require:
(7.57)
C k q k = pk − B k q k
from Equation 7.54. We are now free to use our imagination about the structure of Ck .
Suppose that Ck had the rank-one term:
(7.58)
T1 =
pk pTk
pTk qk
Then we see that T1 qk = pk and we could satisfy part of our requirement from Equation
7.57. On the other hand, if Ck had a second rank-one term:
(7.59)
T2 = −
Bk qk qTk Bk
qTk Bk qk
then T2 qk = −Bk qk . Combining these we see:
(7.60)
Ck = T 1 + T 2 =
pk pTk
Bk qk qTk Bk
−
pTk qk
qTk Bk qk
as expected. Now, note that for any pj (j 6= k) we see that:
(7.61)
Ck q j =
Bk qk qTk Bk qj
pk pTk Qpj Bk qk qTk Bk Qpj
pk pTk qj
−
=
−
+
pTk qk
qTk Bk qk
pTk pk
qTk Bk qk
We know that:
pk pTk Qpj
=0
(7.62)
pTk pk
88
7. QUASI-NEWTON METHODS
by Q conjugacy. While:
(7.63)
Bk qk qTk Bk Qpj
Bk qk pTk Qpj
=
=0
qTk Bk qk
qTk Bk qk
again by conjugacy. Note in the previous equation, we transformed qTk to −pTk Q and Bk Qpj
to −pj . Thus, Ck qj = 0 are required.
Remark 7.9. Since we are adding two rank-one corrections together to obtain Ck , the
matrix Ck is sometimes called a rank-two correction.
4. Broyden-Fletcher-Goldfarb-Shanno (BFGS) Quasi-Newton Method
Remark 7.10. Recall from our construction of the DFP correction that we really only
require two things of the correction matrix Ck :
Ck qj = 0 j = 0, . . . , k − 1
and Equation 7.57:
C k q k = pk − B k q k
This we could add any extra term Tk we like as long as Tk qj = 0 for j = 0, . . . , k.
Definition 7.11 (Broyden Family of Updates). Suppose that f : Rn → R is a continuously differentiable function. Let:
(7.64)
(7.65)
pk = xk+1 − xk = δk rk
qk = ∇f (xk ) − ∇f (xk+1 )
Then the Brodyen Family of Updates for the inverse-Hessian approximation is:
Bk+1
Bk qk qTk Bk
pk pTk
−
+ φk τk vk vkT
= Bk + T
T
pk q k
qk Bk qk
(7.67)
vk =
Bk qk
pk
−
T
τk
pk qk
(7.68)
τk = qTk Bk qk
(7.69)
0 ≤ φk ≤ 1
(7.66)
where:
and B0 ∈ Rn×n is a symmetric positive definite matrix (usually In ).
Exercise 50. Verify that when:
(7.70)
Tk = φk τk vk vkT
then Tk qj = 0 for j = 0, . . . , k.
Definition 7.12 (BFGS Update). If φk = 1 for all k, then Equation 7.66 is called the
Broyden-Fletcher-Goldfarb-Shanno (BFGS) update.
Remark 7.13. We state, but do not prove the following proposition, which can be verified
by (extensive) matrix arithmetic.
4. BROYDEN-FLETCHER-GOLDFARB-SHANNO (BFGS) QUASI-NEWTON METHOD
89
Proposition 7.14. The BFGS Update is given by:
Bk qk pTk + pk qTk Bk
qTk Bk qk pk pTk
−
(7.71) Bk+1 = Bk + 1 +
pTk qk
pTk qk
pTk qk
Thus if CB (φk ) is the Broyden family of updates correction matrix, CDFP is the DFP correction matrix and CBFGS is the BFGS correction matrix, then:
(7.72)
CB = φk CDFP + (1 − φk )CBFGS
Exercise 51. Prove Proposition 7.14.
Remark 7.15. The following lemma and theorem are proved in precisely the same way
as the corresponding results for the DFP method, with appropriate allowances made for the
extra term Tk = φk τk vk vkT .
Lemma 7.16. Suppose that f : Rn → R, x0 ∈ Rn and B0 ∈ Rn×n is a symmetric and
positive definite matrix and B1 , . . . , Bn are generated using Equation 7.66 and x1 , . . . , xn are
generated by:
(7.73)
(7.74)
rk = Bk ∇f (xk )
xk+1 = xk + δk rk
with δk = arg max f (xk +δk rk ). If ∇f (xk ) 6= 0 for k = 0, . . . , n then B1 , . . . Bn are symmetric
and positive definite. Thus, r0 , . . . , rn−1 are ascent directions.
Theorem 7.17. Suppose that f : Rn → R with:
(7.75)
1
f (x) = xT Qx + bT x
2
where b ∈ Rn and Q ∈ Rn×n is symmetric and negative definite. Suppose that B1 , . . . , Bn−1
are generated using the Broyden family of updates formulation (Equation 7.66) with x0 ∈ Rn
and B0 ∈ Rn×n , symmetric and positive definite. If x1 , . . . , xn and r0 , . . . , rn−1 are generated
by:
(7.76)
(7.77)
rk = Bk ∇f (xk )
xk+1 = xk + δk rk
with δk = arg max f (xk + δk rk ) and
(7.78)
(7.79)
pk = xk+1 − xk = δk rk
qk = ∇f (xk ) − ∇f (xk+1 )
then:
(1)
(2)
(3)
(4)
p0 , . . . , pn−1 are Q conjugate and
Bk+1 Qpj = −pj for j = 0, . . . , k
Bn = −Q−1 and
xn = x∗ is the global maximum of f (x).
90
7. QUASI-NEWTON METHODS
Theorem 7.18. The sequence of iterates {xk } generated by any quasi-Newton algorithm
in the Broyden family when applied to a maximization problem for a quadratic function:
1
f (x) = xT Qx + bT x
2
where Q is negative definite is identical to the sequence generated by the preconditioned
conjugate gradient method with scaling matrix B0 .
Sketch of Proof. It is sufficient to show that xk+1 maximizes f (x) for the subspace:
(7.80)
Mk = {x : x = x0 + α0 B0 ∇f (x0 ) + · · · + αk B0 ∇f (xk ), δ0 , . . . , α0 , . . . , αk ∈ R}
This can be proved when B0 = In by applying induction to that that for all k there are
scalars βijk such that:
(7.81)
Bk = I n +
k
k X
X
i=0 j=0
βijk ∇f (xi )∇f (xj )T
Therefore we can write:
(7.82)
pk = Bk ∇f (xk ) =
k
X
i=0
bki ∇f (xi )
for some scalars bki ∈ R. Thus for all i, xi+1 lies in Mi . By Theorem 7.17, we know that any
quasi-Newton method in the Broyden family is a conjugate gradient method and therefore,
xk+1 maximize f (x) over Mk and by the nature of f (x) this maximum must be unique. To
see this for the case when B0 6= In , note that we can simply transform Q to a space in which
it is −In as in the preconditioned conjugate gradient method. (See Example 6.23).
Remark 7.19. This explains why Figure 7.1 is identical to Figure 6.1 and why we will
2
2
see the same figure again when we apply the BFGS method to F (x, y) = (x2 + 3 y 2 ) e1−x −y .
5. Implementation of the BFGS Method
Remark 7.20. The only reason that users prefer the BFGS method to the DFP method is
that the BFGS method has better convergence properties in practice than the DFP method,
which tends to produce singular matrices in a wider range of optimization problems. The
BFGS method seems to be more stable.
Remark 7.21. We note also that the φk can be varied from iteration to iteration. This
is not recommended (in fact, we usually either set it to be 0 or 1) since it can be shown that
for each iteration there is a specific φk that will generate a singular matrix at iteration k + 1
and it is better to not accidentally cause this to occur.
Remark 7.22. The BFGS method is implemented by replacing Line 70 - 71 of Algorithm
14 with the line:
1
2
3
4
#Update the B matrix B[k]
B := B +
(1 + (Transpose(q).B.q)/(Transpose(p).q)) * (p.Transpose(p))/(Transpose(p).q) (B.q.Transpose(p) + p.Transpose(q).B)/(Transpose(p).q)
5. IMPLEMENTATION OF THE BFGS METHOD
91
Exercise 52. Show that the steps taken in Example 7.6 are identical for the BFGS
method.
Example 7.23. We can apply the BFGS method to:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
We obtain convergence in four iterations when we choose ǫ = 0.001 and begin at x = 1,
y = 0.1 (as with the DFP method). This is illustrated in Figure 7.2. Notice that the steps
Figure 7.2. The steps of the DFP algorithm applied to F (x, y).
are identical to those shown in the conjugate gradient method (see Figure 6.1) and the DFP
method.
Remark 7.24. There are several extensions of the BFGS method that are popular. The
simplest is the limited memory BFGS (LBFGS) method [NL89], which does not require as
much storage as the BFGS method since Bk is never stored explicitly, but approximated
from the previous m values of pk and qk , where m < n. This makes it possible to apply
the BFGS method to problems with many variables. The LBFGS method can also be
extended for simple boundary constraints, in the LBGS with bounding (LBFGSB) method
[ZBLN97]. The code for these algorithms is available freely and often incorporated into
popular commercial software. For example, Matlab uses a variation on BFGS for its fmin
function. BFGS is also implement in the GNU Scientific Library, which is available for free.
Remark 7.25. It is worth noting that implementations of the DFP and BFGS methods
may produce singular or even negative definite matrices as a result of round-off errors. A
check can be placed in the code (using the Cholesky decomposition) to determine this. When
this happens, one can re-initialize Bk to In (e.g.) and continue with execution.
Exercise 53. Implement the BFGS algorithm with a check for singularity. Use it on
Rosenbrock’s Function:
(7.83)
−(1 − x)2 + 100(y − x2 )2
start at point (0, 0).
CHAPTER 8
Numerical Differentiation and Derivative Free Optimization
The topics contained in this chapter are actually important and each one could easily
deserve a chapter on its own. Unfortunately, in order to get on to constrained problems, we
must cut somewhere and this is as reasonable a place to do it as any. If this were a real
book, each of these sections would have its own chapter. Therefore, the concerned reader
should see [NW06] for details.
1. Numerical Differentiation
Remark 8.1. Occasionally, it is non-trivial to compute a derivative of a function. In
such a case, a numerical derivative can be used to compute an approximation of the gradient
of the function at that point.
Definition 8.2 (Forward Difference Approximation). Let f : Rn → R. The forward
difference approximation is given by the formula:
(8.1)
∂f (x0 )
δf (x0 )
f (x0 + ǫei ) − f (x0 )
≈
=
∂xi
δxi
ǫ
here ǫ > 0 is a tiny value and ei is the ith standard basis vector for Rn .
Remark 8.3. Note that by Taylor’s Theorem (in one dimension):
(8.2)
f (x0 + ǫei ) = f (x0 ) + ǫ
∂f (x0 )
+ O(ǫ2 )
∂xi
Thus:
∂f (x0 )
δf (x0 )
=
+ O(ǫ)
δxi
∂xi
We know that δxi = ǫ. Assuming a floating point error of h when evaluating δf (x0 ) we see
that:
h
(8.4)
= O(ǫ)
ǫ
or h =√O(ǫ2 ). If the machine precision for a given system is hmin , then setting ǫ√any smaller
than hmin could lead to numerical instability. Thus, a useful estimate is ǫ ≥ hmin . Note
that, to a point, the smaller ǫ, the better the estimate of the derivative.
(8.3)
Definition 8.4 (Central Difference Approximation). Let f : Rn → R. The central
difference approximation is given by the formula:
(8.5)
∂f (x0 )
∆f (x0 )
f (x0 + ǫei ) − f (x0 − ǫei )
≈
=
∂xi
∆xi
2ǫ
here ǫ > 0 is a tiny value and ei is the ith standard basis vector for Rn .
93
94
8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION
Remark 8.5. Note that by Taylor’s Theorem (in one dimension):
∂f (x0 ) 1 2 ∂ 2 f (x0 )
+ O(ǫ3 )
(8.6)
f (x0 − ǫei ) = f (x0 ) − ǫ
+ ǫ
2
∂xi
2
∂xi
A similar expression holds for f (x0 + ǫei ). Thus, we can see that:
∂f (x0 )
+ O(ǫ3 )
(8.7)
f (x0 + ǫei ) − f (x0 − ǫei ) = 2ǫ
∂xi
because that quadratic terms (of identical sign) will cancel out. The result is:
∂f (x0 )
f (x0 + ǫei ) − f (x0 − ǫei )
=
+ O(ǫ2 )
(8.8)
2ǫ
∂xi
Using a similar argument to the one above we have: h = O(ǫ3√
). This yields a rule for
choosing ǫ when using the central difference approximation: ǫ ≥ 3 hmin . We also note that
the central difference formula is substantially more accurate (an order of magnitude in ǫ)
than the forward difference formula, but requires two functional evaluations rather than one.
Remark 8.6. From the previous observations, one might wonder when it is best to use
the forward difference formula and when it is best to use the central difference formula.
Bertsekas [Ber99] provides the practical advice that when the approximated derivative(s)
are not very close to zero, then the forward difference is fine. Once the derivative begins to
approach zero (and the error may swamp the computation), then a transition to the central
difference is called for.
Remark 8.7. Code for computing a numerical gradient is shown in Algorithm 17. It
uses the Forward Differencing Method shown in Algorithm 15 and the Central Differencing
Method shown in Algorithm 16.
Exercise 54. Re-implement the BFGS algorithm using a finite difference approximation
of the gradient. Compare your results to your original implementation on the function:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
Remark 8.8. In a similar fashion, we can compute the numerical hessian approximation.
If H(x0 ) = ∇2 f (x0 ), then:
∇f (x0 + ǫej )i − ∇f (x0 )i
(8.9)
Hi,j =
ǫ
th
Here: ∇f (x0 )i is the i component of the gradient vector.
Exercise 55. Implement a numerical Hessian computation algorithm and illustrate its
effect on Newton’s method.
Example 8.9. Numerical gradients have little impact on convergence of quasi-Newton,
conjugate gradient (or even modified Newton’s method) for well behaved functions. This is
illustrated in Figure 8.1. In this figure, we compare a BFGS run using numerical gradients
with a BFGS run using exact gradients. The dashed black line uses symbolic gradients,
while the solid red line uses numerical gradients. As you can see there is very little difference
detectable by inspection.
You may face a time penalty for functions that are difficult to evaluate, (because you are
introducing additional functional evaluations). This is a practical issue that you must deal
with.
2. DERIVATIVE FREE METHODS: POWELL’S METHOD
1
2
95
NumericalGradientF := proc (f::operator, xnow::list, epsilon::numeric)::Vector;
local G, fnow, i, x;
3
4
5
#Initialize a list to store output.
G := [];
6
7
8
#Passed in variables are non-modifiable in Maple.
x := xnow;
9
10
11
#We’ll use this over and over, don’t keep computing it.
fnow := f(op(xnow));
12
13
14
#For each index...
for i to nops(xnow) do
15
16
17
#Perturb x[i].
x[i] := x[i]+epsilon;
18
19
20
#Store the forward difference in G.
G := [op(G), (f(op(x))-fnow)/epsilon];
21
22
23
24
#Unperturb x[i].
x[i] := x[i]-epsilon
end do;
25
26
27
28
#Return a gradient vector.
Vector(G)
end proc:
Algorithm 15. Forward Differencing Method for Numerical Gradients
2. Derivative Free Methods: Powell’s Method
Remark 8.10. We will discuss two derivative free methods for finding optimal points of
functions. The first, Powell’s method, is a conjugate direction method, while the second, the
Hooke-Jeeves algorithm, is a type of optimization method called a pattern search. We will
not prove convergence properties of either method, though as a conjugate direction method,
Powell’s method will have all the properties one should expect when applying it to a concave
quadratic function.
Remark 8.11. Given a function f : Rn → R, Powell’s basic method works in the
following way:
(1) Choose a starting position x0 and n directions p0 , . . . , pn−1 .
(2) For each k = 0, . . . , n − 1, let δk = arg max f (xk + δk pn ) and let xk+1 = xk + δk pk .
(3) For each k = 0, . . . , n − 2, set pk = pk+1 .
(4) Set pn−1 = xn − x0 .
(5) If ||xn − x0 || < ǫ, stop.
96
1
2
8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION
NumericalGradientF := proc (f::operator, xnow::list, epsilon::numeric)::Vector;
local G, fnow, i, x;
3
4
5
#Initialize a list to store output.
G := [];
6
7
8
#Passed in variables are non-modifiable in Maple.
x := xnow;
9
10
11
#For each index...
for i to nops(xnow) do
12
13
14
15
#Perturn x[i] forward and store the function value.
x[i] := x[i]+epsilon;
f1 := f(op(x));
16
17
18
19
#Perturb x[i] backward and store the function value
x[i] := x[i]-2*epsilon;
f2 := f(op(x));
20
21
22
#Store the central difference in G.
G := [op(G), (f1-f2)/(2*epsilon)];
23
24
25
26
#Unperturn x[i]
x[i] := x[i]+epsilon
end do;
27
28
29
30
#Return a gradient vector.
Vector(G)
end proc:
Algorithm 16. Central Differencing Method for Numerical Gradients
(6) Solve δn = arg max f (xn + δn pn ). Set x0 = xn + δn pn . Goto (1).
Usually the directions p0 , . . . , pn−1 are initialized as the standard basis vectors.
Remark 8.12. One problem with Powell’s method is that the directions may tend to
become linearly dependent at the directions tend (more or less) toward a gradient, especially
in valleys (like in Rosenbrock’s function). There are a few ways to solve this problem.
(1) Reinitialization of the basic directions every n or n + 1 iterations of the basic algorithm, (once all the original directions are gone).
(2) You can reset the directions to any set of orthogonal directions, at the expense
of loosing information on the directions you’ve already constructed, or construct
a special set of orthogonal directions (say following the Gram-Schmidt procedure),
but that assumes that you have some special knowledge about the function (e.g.,
that it is quadratic, which obviates the need for derivative free methods).
2. DERIVATIVE FREE METHODS: POWELL’S METHOD
1
2
97
NumericalGradient := proc (f::operator, xnow::list, epsilon::numeric)::Vector;
local G;
3
4
5
#Compute the forward difference approximation.
G := NumericalGradientF(f, xnow, epsilon);
6
7
8
9
10
#Test to see how small the norm of G is. (You can supply your own rule.)
if Norm(G) <= 2*epsilon then
G := NumericalGradientC(f, xnow, epsilon)
end if;
11
12
13
14
#Return the vector.
G
end proc:
Algorithm 17. A simple routine to compute the numerical gradient using forward
and central differencing.
fi
Figure 8.1. A comparison of the BFGS method using numerical gradients vs. exact
gradients.
(3) You can use a heuristic, provided by Powell.
Remark 8.13 (Powell’s Heuristic). Let:
(8.10)
f0 = f (x0 )
(8.11)
fn = f (xn )
(8.12)
fE = f (2xn − x0 ) = f (xn + (xn − x0 ))
Here fE is the value of f (x) when move xn as far as possible following direction xn − x0 .
Finally, define ∆f to be the magnitude of the largest decrease only any direction encountered
in Line 2.
Powell then provides the following rules:
98
8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION
(1) fE ≤ f0 , then keep the old set of directions for the next procedural iteration. That
is, do not execute Steps 3,4, or 6, simply check for convergence and return to Step
1 with x0 = xn .
(2) If 2(2fn +fE −f0 )((fn −f0 )−∆f )2 ≥ (fE −f0 )2 ∆f , then keep the old set of directions
for the next procedural iteration. That is, do not execute Steps 3,4, or 6, simply
check for convergence and return to Step 1 with x0 = xn .
(3) If neither of these situations occur, replace the direction of greatest increase with
xn − x0 .
A reference implementation for this algorithm is shown in Algorithm 18. We remark the
BasisVector function called within this algorithm is a custom function that returns the ith
basis vector. It is not part of the Maple basline.
Example 8.14. We illustrate two examples of Powell’s Method, the first on:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
and the second on a variation of Rosenbrock’s Function:
2
G(x, y) = −(1 − x)2 − 100 y − x2
In Figure 8.2, we draw the reader’s attention to the effect the valley has on the steps in
Figure 8.2. Powell’s Direction Set Method applied to a bimodal function and a
variation of Rosenbrock’s function. Notice the impact the valley has on the steps in
Rosenbrock’s method.
Powell’s method. While each step individually is an ascent, the zig-zagging pattern seems
almost random.
3. Derivative Free Methods: Hooke-Jeeves Method
Remark 8.15. The method of Hooke and Jeeves is a pattern search method, meaning
it creates a search pattern and the replicates that pattern expanding or contracting it when
possible. The original discrete form has no proof of convergence [HJ61]. Bazarra et al.
present a version of Hooke and Jeeves algorithm using line search, which they claim has
convergence properties [BSS06].
3. DERIVATIVE FREE METHODS: HOOKE-JEEVES METHOD
1
2
3
4
PowellsMethod := proc (F::operator, initVals::list, epsilon::numeric,
maxIter::integer)::list;
local vars, xnow, i, P, L, vals, xdiff, r, OUT, passIn, phi, ttemp,
xnext, p, count, Df, x0, Nr, f0, fN, fE;
5
6
7
8
9
10
11
12
vars := [];
xnow := [];
vals := initVals;
for i to nops(initVals) do
vars := [op(vars), lhs(initVals[i])];
xnow := [op(xnow), rhs(initVals[i])]
end do;
13
14
15
16
17
18
xdiff := 1;
count := 0;
OUT := [];
L := [];
bestIndex = -1;
19
20
21
22
for i to nops(vals) do
L := [op(L), BasisVector(i, nops(vals))]
end do;
23
24
25
26
27
while epsilon < xdiff and count < maxIter do
count := count + 1;
Df := 0;
x0 := xnow;
28
29
30
31
32
for i to nops(L) do
OUT := [op(OUT), xnow];
p := L[i];
passIn := convert(Vector(xnow) + s * p), list);
33
34
35
36
phi := proc (t) options operator, arrow;
evalf(VectorCalculus:-eval(F(op(passIn)), s = t))
end proc;
37
38
ttemp := LineSearch(phi);
39
40
41
if ttemp < epsilon then
passIn := convert(Vector(xnow) - s * p), list);
42
43
44
45
46
47
phi := proc (t) options operator, arrow;
evalf(VectorCalculus:-eval(F(op(passIn)), s = t))
end proc;
ttemp := LineSearch(phi)
end if;
99
100
47
48
49
8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION
xnext := evalf(eval(passIn, [s = ttemp]));
r := xnext - xnow;
Nr := Norm(Vector(r));
50
51
52
53
54
if Df < Nr then
Df := Nr
bestIndex = i:
end if;
55
56
57
58
59
60
61
xnow := xnext
end do;
P := xnow - x0;
f0 := evalf(F(op(x0)));
fN := evalf(F(op(xnow)));
fE := evalf(F(op(2 * xnow - x0)));
62
63
64
65
66
67
68
69
70
if not (evalb(fE <= f0) or
evalb(evalf(2 * (2*fN+fE-f0)*((fN-f0)-Df)^2) >= evalf((fE-f0)^2*Df))) then
L[bestIndex] := Vector(P):
end if;
xdiff := Norm(Vector(P)):
end do;
OUT
end proc
Algorithm 18. Powell’s derivative free method of optimization with heuristic.
Remark 8.16. The method of Hooke and Jeeves attempts to find a movement pattern
that increases the functional value and then repeats this pattern until there is evidence this
is no longer working. Assume f : Rn → R. Like Powell’s method, we begin with a set of
(orthogonal) directions p0 , . . . , pn−1 , and scalars ǫ > 0, ∆ ≥ ǫ and α ∈ [0, 1]. We summarize
the method as:
(1) Given a starting point x0 ∈ Rn , set y0 = x0 . Set j = k = 0.
(2) For each j = 0, . . . , n − 1,
(a) if f (yj + ∆pj ) > f (yj ), the trial is a success and yj+1 = yj + ∆pj .
(b) Otherwise, the trial is a failure and if so, if f (yj − ∆pj ) > f (yj ) the new trial
is a success and yj+1 = yj − ∆pj .
(c) Otherwise yj+1 = yj .
(3) If f (yn ) > f (xk ), then set xk+1 = yn , set y0 = xk+1 + α(xk+1 − xk ). Goto (2).
(4) Otherwise: If ∆ < ǫ terminate. If ∆ ≥ ǫ, set ∆ = ∆/2. Let xk+1 = xk . Let y0 = xk .
Goto (2).
Remark 8.17. The Hooke-Jeeves pattern search essentially gropes around in each (cardinal) direction attempting to improve the value of f (x) as it goes. When it finds a point
where improvement is not possible, it shrinks the distance it looks and tries again, until it
reaches a stopping point. When y0 = xk+1 + α(xk+1 − xk ), the algorithm is attempting
3. DERIVATIVE FREE METHODS: HOOKE-JEEVES METHOD
101
to use the pattern of improvement just identified to further move f (x) uphill. A reference
implementation for Hooke-Jeeves is shown in Algorithm 19.
Example 8.18. We can apply the Hooke-Jeeves algorithm to our bimodal function:
2
2
F (x, y) = x2 + 3 y 2 e1−x −y
When we start at x0 = 1, y0 = 0.1 and α = 0.5, ∆ = 1, we get good convergence, illustrated
in Figure 8.3
Figure 8.3. Hooke-Jeeves algorithm applied to a bimodal function.
If we apply Hooke-Jeeves to a variation of Rosenbrock’s Function:
2
G(x, y) = −(1 − x)2 − 100 y − x2
we see different behavior. For ∆ = 1 and α = 0.5, we fail to converge to the optimal point
(1, 1). If we adjust α to be 1, we do converge in 28 steps. This is shown in Figure 8.4. Thus
we can see that the Hooke-Jeeves method may be very sensitive to the parameters supplied
and thus it should only be used on functions where evaluation is very easy since we may
have to try a few parameter combinations before we identify a true maximum.
102
8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION
1 HookeJeeves := proc (F::operator, initVals::list, epsilon::numeric, DeltaIn::numeric,
2
alpha::numeric)::list;
3
4
5
6
7
8
9
10
11
12
13
14
local i, OUT, vars, xnow, L, vals, Delta, p, xlast, ynow, ytest;
Delta := DeltaIn;
vars := [];
xnow := [];
vals := initVals;
for i to nops(initVals) do
vars := [op(vars), lhs(initVals[i])];
xnow := [op(xnow), rhs(initVals[i])]
end do;
15
16
17
18
19
20
OUT := [xnow];
L := [];
for i to nops(vals) do
L := [op(L), BasisVector(i, nops(vals))]
end do;
21
22
23
24
25
26
ynow := xnow;
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
while epsilon < Delta do
for i to nops(L) do
p := L[i];
ytest := convert(Vector(ynow)+Delta*p, list);
if F(op(ynow)) < F(op(ytest)) then
ynow := ytest
else ytest := convert(Vector(ynow)-Delta*p, list);
if F(op(ynow)) < F(op(ytest)) then
ynow := ytest
end if
end if
end do;
if F(op(xnow)) < F(op(ynow)) then
xlast := xnow;
xnow := ynow;
ynow := xnow+alpha*(xnow-xlast);
OUT := [op(OUT), xnow]
else
Delta := (1/2)*Delta;
ynow := xnow
end if
end do;
OUT:
end proc
Algorithm 19. Hooke-Jeeves derivative free algorithm.
3. DERIVATIVE FREE METHODS: HOOKE-JEEVES METHOD
Figure 8.4. Hooke-Jeeves algorithm applied to a bimodal function.
103
CHAPTER 9
Linear Programming: The Simplex Method
The material covered in this chapter is a summary of my Math 484 lecture notes [Gri11],
which go into more detail (and have proofs). If you’re interested, I suggest you look there or
get a good book on Linear Programming like [BJS04]. Also, under no circumstances do I
recommend you implement the simplex method. There are too many “nits” for implementing
an efficient algorithm. Instead, you should check out the GNU Linear Programming Kit
(http://www.gnu.org/software/glpk/). It’s very well implemented and free (in every
sense of the word).
1. Linear Programming: Notation
Definition 9.1 (Linear Programming Problem). A linear programming problem is an
optimization problem of the form:
max z(x1 , . . . , xn ) = c1 x1 + · · · + cn xn
s.t. a11 x1 + · · · + a1n xn ≤ b1
..
.
am1 x1 + · · · + amn xn ≤ bm
(9.1)
h11 x1 + · · · + hn1 xn = r1
..
.
hl1 x1 + · · · + hln xn = rl
Remark 9.2. You will recall from your matrices class (Math 220) that matrices can be
used as a short hand way to represent linear equations. Consider the following system of
equations:
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
(9.2)
..
.
am1 x1 + am2 x2 + · · · + amn xn = bm
Then we can write this in matrix notation as:
(9.3)
Ax = b
where Aij = aij for i = 1, . . . , m, j = 1, . . . , n and x is a column vector in Rn with entries
xj (j = 1, . . . , n) and b is a column vector in Rm with entries bi (i = 1 . . . , m). Obviously, if
we replace the equalities in Expression 9.2 with inequalities, we can also express systems of
inequalities in the form:
(9.4)
Ax ≤ b
105
106
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Using this representation, we can write our general linear programming problem using
matrix and vector notation. Expression 9.1 can be written as:
T
max z(x) =c x
s.t. Ax ≤ b
(9.5)
Hx = r
Definition 9.3. In Problem 9.5, if we restrict some of the decision variables (the xi ’s)
to have only integer (or discrete) values, then the problem becomes a mixed integer linear
programming problem. If all of the variables are restricted to integer values, the problem is
an integer programming problem and if every variable can only take on the values 0 or 1, the
program is called a 0 − 1 integer programming problem. [WN99] is an excellent reference
for Integer Programming.
Definition 9.4 (Canonical Form). A maximization linear programming problem is in
canonical form if it is written as:
T
max z(x) =c x
s.t. Ax ≤ b
(9.6)
x≥0
A minimization linear programming problem is in canonical form if it is written as:
T
min z(x) =c x
s.t. Ax ≥ b
(9.7)
x≥0
Definition 9.5 (Standard Form). A linear programming problem is in standard form if
it is written as:
T
max z(x) =c x
s.t. Ax = b
(9.8)
x≥0
Remark 9.6. The following theorem is outside the scope of the course. You may cover
it in a Math 484 [Gri11].
Theorem 9.7. Every linear programming problem in canonical form can be put into
standard form.
Exercise 56. Show that a minimization linear programming problem in canonical form
can be rephrased as a maximization linear programming problem in canonical form. [Hint:
Multiply the objective and constraints −1. Define new matrices.]
Remark 9.8. To illustrate Theorem 9.7, we note that it is relatively easy to convert any
inequality constraint into an equality constraint. Consider the inequality constraint:
(9.9)
ai1 x1 + ai2 x2 + · · · + ain xn ≤ bi
We can add a new slack variable si to this constraint to obtain:
ai1 x1 + ai2 x2 + · · · + ain xn + si = bi
2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES
107
Obviously this slack variable si ≥ 0. The slack variable then becomes just another variable
whose value we must discover as we solve the linear program for which Expression 9.9 is a
constraint.
We can deal with constraints of the form:
(9.10)
ai1 x1 + ai2 x2 + · · · + ain xn ≥ bi
in a similar way. In this case we subtract a surplus variable si to obtain:
ai1 x1 + ai2 x2 + · · · + ain xn − si = bi
Again, we must have si ≥ 0.
Example 9.9. Consider the linear programming problem:
max z(x1 , x2 ) = 2x1 − x2
s.t. x − x ≤ 1
1
2
2x1 + x2 ≥ 6
x 1 , x2 ≥ 0
This linear programming problem can be put into standard form by using both a slack and
surplus variable.
max z(x1 , x2 ) = 2x1 − x2
s.t. x − x + s = 1
1
2
1
2x1 + x2 − s2 = 6
x1 , x2 , s1 , s2 ≥ 0
Definition 9.10 (Row Rank). Let A ∈ Rm×n . The row rank of A is the size of the
largest set of row (vectors) from A that are linearly independent.
Remark 9.11. The column rank of a matrix A ∈ Rm×n is defined analogously on columns
rather than rows. The following theorem relates the row and column rank. It’s proof is
outside the scope of the course.
Theorem 9.12. If A ∈ Rm×n is a matrix, then the row rank of A is equal to the column
rank of A. Further, rank(A) ≤ min{m, n}.
Definition 9.13. Suppose that A ∈ Rm×n and let m ≤ n. Then A has full row rank if
rank(A) = m.
Remark 9.14. We will assume, when dealing with Linear Programming Problems in
standard or canonical form that the matrix A has full row rank and if not, we will adjust it
so this is true. The following theorem tells us what can happen in a Linear Programming
Problem.
2. Polyhedral Theory and Linear Equations and Inequalities
2.1. Solving Systems with More Variables than Equations. Suppose now that
A ∈ Rm×n where m ≤ n. Let b ∈ Rm . Then the equation:
(9.11)
Ax = b
108
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
has more variables than equations and is underdetermined and if A has full row rank then
the system will have an infinite number of solutions. We can formulate an expression to
describe this infinite set of solutions.
Sine A has full row rank, we may choose any m linearly independent columns of A
corresponding to a subset of the variables, say xi1 , . . . , xim . We can use these to form the
matrix
(9.12)
B = [A·i1 · · · A·im ]
from the columns A·i1 , . . . , A·im of A, so that B is invertible. It should be clear at this point
that B will be invertible precisely because we’ve chosen m linearly independent column
vectors. We can then use elementary column operations to write the matrix A as:
(9.13)
A = [B|N]
The matrix N is composed of the n − m other columns of A not in B. We can similarly
sub-divide the column vector x and write:
x
(9.14) [B|N] B = b
xN
where the vector xB are the variables corresponding to the columns in B and the vector xN
are the variables corresponding to the columns of the matrix N.
Definition 9.15 (Basic Variables). For historical reasons, the variables in the vector
xB are called the basic variables and the variables in the vector xN are called the non-basic
variables.
We can use matrix multiplication to expand the left hand side of this expression as:
(9.15)
BxB + NxN = b
The fact that B is composed of all linearly independent columns implies that applying GaussJordan elimination to it will yield an m × m identity and thus that B is invertible. We can
solve for basic variables xB in terms of the non-basic variables:
(9.16)
xB = B−1 b − B−1 NxN
We can find an arbitrary solution to the system of linear equations by choosing values for
the variables the non-basic variables and solving for the basic variable values using Equation
9.16.
Definition 9.16. (Basic Solution) When we assign xN = 0, the resulting solution for x
is called a basic solution and
(9.17)
xB = B−1 b
Example 9.17. Consider the problem:
x1
1 2 3
7
x2 =
(9.18)
8
4 5 6
x3
Then we can let x3 = 0 and:
1 2
(9.19) B =
4 5
2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES
109
We then solve1:
−19
x1
−1 7
3
= 20
(9.20)
=B
8
x2
3
Other basic solutions could be formed by creating B out of columns 1 and 3 or columns
2 and 3.
Exercise 57. Find the two other basic solutions in Example 9.17 corresponding to
2 3
B=
5 6
and
1 3
B=
4 6
In each case, determine what the matrix N is. [Hint: Find the solutions any way you like.
Make sure you record exactly which xi (i ∈ {1, 2, 3}) is equal to zero in each case.]
2.2. Polyhedral Sets. Important examples of convex sets are polyhedral sets, the
multi-dimensional analogs of polygons in the plane. In order to understand these structures, we must first understand hyperplanes and half-spaces.
Definition 9.18 (Hyperplane). Let a ∈ Rn be a constant vector in n-dimensional space
and let b ∈ R be a constant scalar. The set of points
(9.21) H = x ∈ Rn |aT x = b
is a hyperplane in n-dimensional space. Note the use of column vectors for a and x in this
definition.
Example 9.19. Consider the hyper-plane 2x1 +3x2 +x3 = 5. This is shown in Figure 9.1.
This hyperplane is composed of the set of points (x1 , x2 , x3 ) ∈ R3 satisfying 2x1 +3x2 +x3 = 5.
This can be plotted implicitly or explicitly by solving for one of the variables, say x3 . We
can write x3 as a function of the other two variables as:
(9.22)
x3 = 5 − 2x1 − 3x2
Definition 9.20 (Half-Space). Let a ∈ Rn be a constant vector in n-dimensional space
and let b ∈ R be a constant scalar. The sets of points
(9.23) Hl = x ∈ Rn |aT x ≤ b
(9.24) Hu = x ∈ Rn |aT x ≥ b
are the half-spaces defined by the hyperplane aT x = b.
Example 9.21. Consider the two dimensional hyperplane (line) x1 + x2 = 1. Then the
two half-spaces associated with this hyper-plane are shown in Figure 9.2. A half-space is
so named because the hyperplane aT x = b literally separates Rn into two halves: the half
above the hyperplane and the half below the hyperplane.
1Thanks
to Doug Mercer, who found a typo below that was fixed.
110
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Figure 9.1. A hyperplane in 3 dimensional space: A hyperplane is the set of points
satisfying an equation aT x = b, where k is a constant in R and a is a constant
vector in Rn and x is a variable vector in Rn . The equation is written as a matrix
multiplication using our assumption that all vectors are column vectors.
(a) Hl
(b) Hu
Figure 9.2. Two half-spaces defined by a hyper-plane: A half-space is so named
because any hyper-plane divides Rn (the space in which it resides) into two halves,
the side “on top” and the side “on the bottom.”
Definition 9.22 (Polyhedral Set). If P ⊆ Rn is the intersection of a finite number
of half-spaces, then P is a polyhedral set. Formally, let a1 , . . . , am ∈ Rn be a finite set of
constant vectors and let b1 , . . . , bm ∈ R be constants. Consider the set of half-spaces:
Hi = {x|aTi x ≤ bi }
Then the set:
(9.25)
P =
m
\
Hi
i=1
is a polyhedral set.
2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES
111
It should be clear that we can represent any polyhedral set using a matrix inequality.
The set P is defined by the set of vectors x satisfying:
(9.26)
Ax ≤ b,
where the rows of A ∈ Rm×n are made up of the vectors a1 , . . . , am and b ∈ Rm is a column
vector composed of elements b1 , . . . , bm .
Theorem 9.23. Every polyhedral set is convex.
Exercise 58. Prove Theorem 9.23. [Hint: You can prove this by brute force, verifying
convexity. You can also be clever and use two results that we’ve proved in the notes.]
2.3. Directions of Polyhedral Sets. Recall the definition of a line (Definition 1.17
from Chapter 1). A ray is a one sided line.
Definition 9.24 (Ray). Let x0 ∈ Rn be a point and and let d ∈ Rn be a vector
called the direction. Then the ray with vertex x0 and direction d is the collection of points
{x|x = x0 + λd, λ ≥ 0}.
Definition 9.25 (Direction of a Convex Set). Let C be a convex set. Then d 6= 0 is a
(recession) direction of the convex set if for all x0 ∈ C the ray with vertex x0 and direction
d is contained entirely in C. Formally, for all x0 ∈ C we have:
(9.27)
{x : x = x0 + λd, λ ≥ 0} ⊆ C
Remark 9.26. There is a unique relationship between the defining matrix A of a polyhedral set P and a direction of this set that is particularly useful when we assume that P is
located in the positive orthant of Rn (i.e., x ≥ 0 are defining constraints of P ).
Remark 9.27. A proof of the next theorem can be found in [Gri11].
Theorem 9.28. Suppose that P ⊆ Rn is a polyhedral set defined by:
(9.28)
P = {x ∈ Rn : Ax ≤ b, x ≥ 0}
If d is a direction of P , then the following hold:
(9.29)
Ad ≤ 0, d ≥ 0, d 6= 0.
Corollary 9.29. If
(9.30)
P = {x ∈ Rn : Ax = b, x ≥ 0}
and d is a direction of P , then d must satisfy:
(9.31)
Ad = 0, d ≥ 0, d 6= 0.
Exercise 59. Prove the corollary above.
Example 9.30. Consider the polyhedral set defined by the equations:
x1 − x2 ≤ 1
2x1 + x2 ≥ 6
x1 ≥ 0
x2 ≥ 0
112
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
This set is clearly unbounded as we showed in class and it has at least one direction. The
direction d = [0, 1]T pointing directly up is a direction of this set. This is illustrated in
Figure 9.3. In this example, we have:
Figure 9.3. An Unbounded Polyhedral Set: This unbounded polyhedral set has
many directions. One direction is [0, 1]T .
(9.32)
1 −1
A=
−2 −1
Note, the second inequality constraint was a greater-than constraint. We reversed it to
a less-than inequality constraint −2x1 − x2 ≤ −6 by multiplying by −1. For our chosen
direction d = [0, 1]T , we can see that:
−1
1 −1 0
≤0
=
(9.33) Ad =
−1
−2 −1 1
Clearly d ≥ 0 and d 6= 0.
2.4. Extreme Points.
Definition 9.31 (Extreme Point of a Convex Set). Let C be a convex set. A point
x0 ∈ C is a extreme point of C if there are no points x1 and x2 (x1 6= x0 or x2 6= x0 ) so that
x = λx1 + (1 − λ)x2 for some λ ∈ (0, 1).2
An extreme point is simply a point in a convex set C that cannot be expressed as a strict
convex combination of any other pair of points in C. We will see that extreme points must
be located in specific locations in convex sets.
Definition 9.32 (Boundary of a set). Let C ⊆ Rn be (convex) set. A point x0 ∈ C is
on the boundary of C if for all ǫ > 0,
Bǫ (x0 ) ∩ C 6= ∅ and
Bǫ (x0 ) ∩ Rn \ C 6= ∅
Example 9.33. A convex set, its boundary and a boundary point are illustrated in
Figure 9.4.
2Thanks
to Bob Pakzad-Hurson who fixed a typo in this definition in Version ≤ 1.4.
2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES
113
BOUNDARY POINT
BOUNDARY
INTERIOR
Figure 9.4. Boundary Point: A boundary point of a (convex) set C is a point in
the set so that for every ball of any radius centered at the point contains some points
inside C and some points outside C.
Lemma 9.34. Suppose C is a convex set. If x is an extreme point of C, then x is on the
boundary of C.
Exercise 60. Prove the previous lemma.
Most important in our discussion of linear programming will be the extreme points of
polyhedral sets that appear in linear programming problems. The following theorem establishes the relationship between extreme points in a polyhedral set and the intersection of
hyperplanes in such a set.
Theorem 9.35. Let P ⊆ Rn be a polyhedral set and suppose P is defined as:
(9.34)
P = {x ∈ Rn : Ax ≤ b}
where A ∈ Rm×n and b ∈ Rm . A point x0 ∈ P is an extreme point of P if and only if x0 is
the intersection of n linearly independent hyperplanes from the set defining P .
Remark 9.36. The easiest way to see this as relevant to linear programming is to assume
that
(9.35)
P = {x ∈ Rn : Ax ≤ b, x ≥ 0}
In this case, we could have m < n. In that case, P is composed of the intersection of n + m
half-spaces. The first m are for the rows of A and the second n are for the non-negativity
constraints. An extreme point comes from the intersection of n of the hyperplanes defining
these half-spaces. We might have m come from the constraints Ax ≤ b and the other n − m
from x ≥ 0.
Remark 9.37. A complete proof of the previous theorem can be found in [Gri11].
Definition 9.38. Let P be the polyhedral set from Theorem 9.35. If x0 is an extreme
point of P and more than n hyperplanes are binding at x0 , then x0 is called a degenerate
extreme point.
Definition 9.39 (Face). Let P be a polyhedral set defined by
P = {x ∈ Rn : Ax ≤ b}
where A ∈ Rm×n and b ∈ Rm . If X ⊆ P is defined by a non-empty set of binding linearly
independent hyperplanes, then X is a face of P .
114
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
That is, there is some set of linearly independent rows Ai1 · , . . . Ail · with il < m so that
when G is the matrix made of these rows and g is the vector of bi1 , . . . , bil then:
(9.36)
X = {x ∈ Rn : Gx = g and Ax ≤ b}
In this case we say that X has dimension n − l.
Remark 9.40. Based on this definition, we can easily see that an extreme point, which
is the intersection n linearly independent hyperplanes is a face of dimension zero.
Definition 9.41 (Edge and Adjacent Extreme Point). An edge of a polyhedral set P is
any face of dimension 1. Two extreme points are called adjacent if they share n − 1 binding
constraints. That is, they are connected by an edge of P .
Example 9.42. Consider the polyhedral set defined by the system of inequalities:
3x1 + x2 ≤ 120
x1 + 2x2 ≤ 160
28
x1 + x2 ≤ 100
16
x1 ≤ 35
x1 ≥ 0
x2 ≥ 0
The polyhedral set is shown in Figure 9.5. The extreme points of the polyhedral set are shown
Figure 9.5. A Polyhedral Set: This polyhedral set is defined by five half-spaces
and has a single degenerate extreme point located at the intersection of the binding
28
constraints 3x1 + x2 ≤ 120, x1 + 2x2 ≤ 160 and 16
x1 + x2 <= 100. All faces are
shown in bold.
as large diamonds and correspond to intersections of binding constraints. Note the extreme
point (16, 72) is degenerate since it occurs at the intersection of three binding constraints
x + x2 <= 100. All the faces of the polyhedral set
3x1 + x2 ≤ 120, x1 + 2x2 ≤ 160 and 28
16 1
are shown in bold. They are locations where one constraint (or half-space) is binding. An
2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES
115
example of a pair of adjacent extreme points is (16, 72) and (35, 15), as they are connected
by the edge defined by the binding constraint 3x1 + x2 ≤ 120.
Exercise 61. Consider the polyhedral set defined by the system of inequalities:
4x1 + x2 ≤ 120
x1 + 8x2 ≤ 160
x1 + x2 ≤ 30
x1 ≥ 0
x2 ≥ 0
Identify all extreme points and edges in this polyhedral set and their binding constraints.
Are any extreme points degenerate? List all pairs of adjacent extreme points.
2.5. Extreme Directions.
Definition 9.43 (Extreme Direction). Let C ⊆ Rn be a convex set. Then a direction
d of C is an extreme direction if there are no two other directions d1 and d2 of C (d1 6= d
and d2 6= d) and scalars λ1 , λ2 > 0 so that d = λ1 d1 + λ2 d2 .
We have already seen by Theorem 9.28 that is P is a polyhedral set in the positive orthant
of Rn with form:
P = {x ∈ Rn : Ax ≤ b, x ≥ 0}
then a direction d of P is characterized by the set of inequalities and equations
Ad ≤ 0, d ≥ 0, d 6= 0.
Clearly two directions d1 and d2 with d1 = λd2 for some λ ≥ 0 may both satisfy this system.
To isolate a unique set of directions, we can normalize and construct the set:
(9.37)
D = {d ∈ Rn : Ad ≤ 0, d ≥ 0, eT d = 1}
here we are interested only in directions satisfying eT d = 1. This is a normalizing constraint
that will chose only vectors whose components sum to 1.
Theorem 9.44. A direction d ∈ D is an extreme direction of P if and only if d is an
extreme point of D when D is taken as a polyhedral set.
Remark 9.45. A complete proof of the previous theorem can be found in [Gri11].
Example 9.46. Let’s consider Example 9.30 again. The polyhedral set in this example
was defined by the A matrix:
1 −1
A=
−2 −1
and the b vector:
1
b=
−6
If we assume that P = {x ∈ Rn : Ax ≤ b, x ≥ 0}, then the set of extreme directions of P
is the same as the set of extreme points of the set
D = {d ∈ Rn : Ad ≤ 0, d ≥ 0, eT d = 1}
116
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Then we have the set of directions d = [d1 , d2 ]T so that:
d1 − d2 ≤ 0
−2d1 − d2 ≤ 0
d1 + d2 = 1
d1 ≥ 0
d2 ≥ 0
The feasible region (which is really only the line d1 + d2 = 1) is shown in red in Figure 9.6.
The critical part of this figure is the red line. It is the true set D. As a line, it has two
Figure 9.6. Visualization of the set D: This set really consists of the set of points
on the red line. This is the line where d1 + d2 = 1 and all other constraints hold.
This line has two extreme points (0, 1) and (1/2, 1/2).
extreme points: (0, 1) and (1/2, 1/2). Note that (0, 1) as an extreme point is one of the
direction [0, 1]T we illustrated in Example 9.30.
Exercise 62. Show that d = [1/2, 1/2]T is a direction of the polyhedral set P from
Example 9.30. Now find a non-extreme direction (whose components sum to 1) using the
feasible region illustrated in the previous example. Show that the direction you found is
a direction of the polyhedral set. Create a figure like Figure 9.3 to illustrate both these
directions.
2.6. Caratheodory Characterization Theorem.
Remark 9.47. The theorem stated in this sub-section is critical to understanding the
fundamental theorems of linear programming. Proofs can be found in [Gri11].
Lemma 9.48. The polyhedral set defined by:
P = {x ∈ Rn : Ax ≤ b, x ≥ 0}
has a finite, non-zero number of extreme points (assuming that A is not an empty matrix)3.
3Thanks
to Bob Pakzah-Hurson for the suggestion to improve the statement of this lemma.
2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES
117
Lemma 9.49. Let P be a non-empty polyhedral set. Then the set of directions of P is
empty if and only if P is bounded.
Lemma 9.50. Let P be a non-empty unbounded polyhedral set. Then the number extreme
directions of P is finite and non-zero.
Theorem 9.51. Let P be a non-empty, unbounded polyhedral set defined by:
P = {x ∈ Rn : Ax ≤ b, x ≥ 0}
(where we assume A is not an empty matrix). Suppose that P has extreme points x1 , . . . , xk
and extreme directions d1 , . . . , dl . If x ∈ P , then there exists constants λ1 , . . . , λk and
µ1 , . . . , µl such that:
x=
k
X
λi x i +
i=1
(9.38)
k
X
l
X
µ j dj
j=1
λi = 1
i=1
λi ≥ 0 i = 1, . . . , k
µj ≥ 0 1, . . . , l
Example 9.52. The Cartheodory Characterization Theorem is illustrated for a bounded
and unbounded polyhedral set in Figure 9.7. This example illustrates simply how one could
x2
λx2 + (1 − λ)x3
x3
x1
x
x5
x4
x = µx5 + (1 − µ) (λx2 + (1 − λ)x3 )
x1
x
x2
λx2 + (1 − λ)x3
d1
x3
x = λx2 + (1 − λ)x3 + θd1
Figure 9.7. The Cartheodory Characterization Theorem: Extreme points and extreme directions are used to express points in a bounded and unbounded set.
construct an expression for an arbitrary point x inside a polyhedral set in terms of extreme
points and extreme directions.
118
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
3. The Simplex Method
let
For the remainder of this chapter, assume that A ∈ Rm×n with full row rank and b ∈ Rm
(9.39)
X = {x ∈ Rn : Ax ≤ b, x ≥ 0}
be a polyhedral set over which we will maximize the objective function z(x1 , . . . , xn ) = cT x,
where c, x ∈ Rn . That is, we will focus on the linear programming problem:
T
max c x
s.t. Ax ≤ b
(9.40) P
x≥0
Theorem 9.53. If Problem P has an optimal solution, then Problem P has an optimal
extreme point solution.
Proof. Applying the Cartheodory Characterization theorem, we know that any point
x ∈ X can be written as:
(9.41)
x=
k
X
λi x i +
i=1
l
X
µi di
i=1
where x1 , . . . xk are the extreme points of X and d1 , . . . , dl are the extreme directions of X
and we know that
(9.42)
k
X
λi = 1
i=1
λi , µi ≥ 0 ∀i
We can rewrite problem P using this characterization as:
max
k
X
T
λi c x i +
i=1
(9.43)
s.t.
k
X
l
X
µ i c T di
i=1
λi = 1
i=1
λi , µi ≥ 0 ∀i
If there is some i such that cT di > 0, then we can simply choose µi as large as we like,
making the objective as large as we like, the problem will have no finite solution.
Therefore, assume that cT di ≤ 0 for all i = 1, . . . , l (in which case, we may simply choose
µi = 0, for all i). Since the set of extreme points x1 , . . . xk is finite, we can simply set λp = 1
if cT xp has the largest value among all possible values of cT xi , i = 1, . . . , k. This is clearly
the solution to the linear programming problem. Since xp is an extreme point, we have
shown that if P has a solution, it must have an extreme point solution.
Corollary 9.54. Problem P has a finite solution if and only if cT di ≤ 0 for all i =
1, . . . l when d1 , . . . , dl are the extreme directions of X.
Proof. This is implicit in the proof of the theorem.
3. THE SIMPLEX METHOD
119
Corollary 9.55. Problem P has alternative optimal solutions if there are at least two
extreme points xp and xq so that cT xp = cT xq and so that xp is the extreme point solution
to the linear programming problem.
Proof. Suppose that xp is the extreme point solution to P identified in the proof of
the theorem. Suppose xq is another extreme point solution with cT xp = cT xq . Then every
convex combination of xp and xq is contained in X (since X is convex). Thus every x with
form λxp + (1 − λ)xq and λ ∈ [0, 1] has objective function value:
λcT xp + (1 − λ)cT xq = λcT xp + (1 − λ)cT xp = cT xp
which is the optimal objective function value, by assumption.
Exercise 63. Let X = {x ∈ Rn : Ax ≤ b, x ≥ 0} and suppose that d1 , . . . dl are the
extreme directions of X (assuming it has any). Show that the problem:
(9.44)
min cT x
s.t. Ax ≤ b
x≥0
has a finite optimal solution if (and only if) cT dj ≥ 0 for k = 1, . . . , l. [Hint: Modify the
proof above using the Cartheodory characterization theorem.]
3.1. Algorithmic Characterization of Extreme Points. In the previous sections we
showed that if a linear programming problem has a solution, then it must have an extreme
point solution. The challenge now is to identify some simple way of identifying extreme
points. To accomplish this, let us now assume that we write X as:
(9.45)
X = {x ∈ Rn : Ax = b, x ≥ 0}
Our work in the previous sections shows that this is possible. Recall we can separate A into
an m × m matrix B and an m × (n − m) matrix N and we have the result:
(9.46)
xB = B−1 b − B−1 NxN
We know that B is invertible since we assumed that A had full row rank. If we assume that
xN = 0, then the solution
(9.47)
xB = B−1 b
was called a basic solution (See Definition 9.16.) Clearly any basic solution satisfies the
constraints Ax = b but it may not satisfy the constraints x ≥ 0.
Definition 9.56 (Basic Feasible Solution). If xB = B−1 b and xN = 0 is a basic solution
to Ax = b and xB ≥ 0, then the solution (xB , xN ) is called basic feasible solution.
Theorem 9.57. Every basic feasible solution is an extreme point of X. Likewise, every
extreme point is characterized by a basic feasible solution of Ax = b, x ≥ 0.
Proof. Since Ax = BxB + NxN = b this represents the intersection of m linearly
independent hyperplanes (since the rank of A is m). The fact that xN = 0 and xN contains
n − m variables, then we have n − m binding, linearly independent hyperplanes in xN ≥
0. Thus the point (xB , xN ) is the intersection of m + (n − m) = n linearly independent
hyperplanes. By Theorem 9.35 we know that (xB , xN ) must be an extreme point of X.
120
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Conversely, let x be an extreme point of X. Clearly x is feasible and by Theorem 9.35
it must represent the intersection of n hyperplanes. The fact that x is feasible implies that
Ax = b. This accounts for m of the intersecting linearly independent hyperplanes. The
remaining n − m hyperplanes must come from x ≥ 0. That is, n − m variables are zero. Let
xN = 0 be the variables for which x ≥ 0 are binding. Denote the remaining variables xB .
We can see that A = [B|N] and that Ax = BxB + NxN = b. Clearly, xB is the unique
solution to BxB = b and thus (xB , xN ) is a basic feasible solution.
3.2.
simplex
(1)
(2)
(3)
(4)
The Simplex Algorithm–Algebraic Form. In this section, we will develop the
algorithm algebraically. The idea behind the simplex algorithm is as follows:
Convert the linear program to standard form.
Obtain an initial basic feasible solution (if possible).
Determine whether the basic feasible solution is optimal. If yes, stop.
If the current basic feasible solution is not optimal, then determine which non-basic
variable (zero valued variable) should become basic (become non-zero) and which
basic variable (non-zero valued variable) should become non-basic (go to zero) to
make the objective function value better.
(5) Determine whether the problem is unbounded. If yes, stop.
(6) If the problem doesn’t seem to be unbounded at this stage, find a new basic feasible
solution from the old basic feasible solution. Go back to Step 3.
Suppose we have a basic feasible solution x = (xB , xN ). We can divide the cost vector
c into its basic and non-basic parts, so we have c = [cB |cN ]T . Then the objective function
becomes:
(9.48)
cT x = cTB xB + cTN xN
We can substitute Equation 9.46 into Equation 9.48 to obtain:
(9.49) cT x = cTB B−1 b − B−1 NxN + cN xN = cTB B−1 b + cTN − cTB B−1 N xN
Let J be the set of indices of non-basic variables. Then we can write Equation 9.49 as:
X
(9.50) z(x1 , . . . , xn ) = cTB B−1 b +
cj − cTB B−1 A·j xj
j∈J
Consider now the fact xj = 0 for all j ∈ J . Further, we can see that:
(9.51)
∂z
= cj − cTB B−1 A·j
∂xj
This means that if cj − cTB B−1 A·j > 0 and we increase xj from zero to some new value,
then we will increase the value of the objective function. For historic reasons, we actually
consider the value cTB B−1 A·j − cj , called the reduced cost and denote it as:
(9.52)
−
∂z
= zj − cj = cTB B−1 A·j − cj
∂xj
In a maximization problem, we chose non-basic variables xj with negative reduced cost to
become basic because, in this case, ∂z/∂xj is positive.
Assume we chose xj , a non-basic variable to become non-zero (because zj − cj < 0). We
wish to know which of the basic variables will become zero as we increase xj away from zero.
We must also be very careful that none of the variables become negative as we do this.
3. THE SIMPLEX METHOD
121
By Equation 9.46 we know that the only current basic variables will be affected by
increasing xj . Let us focus explicitly on Equation 9.46 where we include only variable xj
(since all other non-basic variables are kept zero). Then we have:
(9.53)
xB = B−1 b − B−1 A·j xj
Let b = B−1 b be an m × 1 column vector and let that aj = B−1 A·j be another m × 1
column. Then we can write:
(9.54)
x B = b − a j xj
Let b = [b1 , . . . bm ]T and aj = [aj1 , . . . , ajm ], then we have:
aj 1
b1
b1 − aj1 xj
xB1
b −a x
xB2
j2 j
2
b2 bj 2
=
−
x
=
(9.55)
. . j
..
...
.. ..
.
x Bm
bj m
bm
bm − ajm xj
We know (a priori) that bi ≥ 0 for i = 1, . . . , m. If aji ≤ 0, then as we increase xj , bi −aji ≥ 0
no matter how large we make xj . On the other hand, if aji > 0, then as we increase xj we
know that bi − aji xj will get smaller and eventually hit zero. In order to ensure that all
variables remain non-negative, we cannot increase xj beyond a certain point.
For each i (i = 1, . . . , m) such that aji > 0, the value of xj that will make xBi goto 0 can
be found by observing that:
(9.56)
x B i = b i − aj i x j
and if xBi = 0, then we can solve:
(9.57)
0 = bi − aji xj =⇒ xj =
bi
aji
Thus, the largest possible value we can assign xj and ensure that all variables remain positive
is:
bi
(9.58) min
: i = 1, . . . , m and aji > 0
aji
Expression 9.58 is called the minimum ratio test. We are interested in which index i is the
minimum ratio.
Suppose that in executing the minimum ratio test, we find that xj = bk /ajk . The variable
xj (which was non-basic) becomes basic and the variable xBk becomes non-basic. All other
basic variables remain basic (and positive). In executing this procedure (of exchanging one
basic variable and one non-basic variable) we have moved from one extreme point of X to
another.
Theorem 9.58. If zj − cj ≥ 0 for all j ∈ J , then the current basic feasible solution is
(locally) optimal4.
4We
now.
show later that is globally optimal, however we cannot do any better than a local argument for
122
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Proof. We have already shown in Theorem 9.53 that if a linear programming problem
has an optimal solution, then it occurs at an extreme point and we’ve shown in Theorem
9.57 that there is a one-to-one correspondence between extreme points and basic feasible
solutions. If zj − cj ≥ 0 for all j ∈ J , then ∂z/∂xj ≤ 0 for all non-basic variables xj . That
is, we cannot increase the value of the objective function by increasing the value of any nonbasic variable. Thus, since moving to another basic feasible solution (extreme point) will not
improve the objective function, it follows we must be at a (locally) optimal solution.
Theorem 9.59. In a maximization problem, if aji ≤ 0 for all i = 1, . . . , m, and zj − cj <
0, then the linear programming problem is unbounded.
Proof. The fact that zj − cj < 0 implies that increasing xj will improve the value of the
objective function. Since aji < 0 for all i = 1, . . . , m, we can increase xj indefinitely without
violating feasibility (no basic variable will ever go to zero). Thus the objective function can
be made as large as we like.
Remark 9.60. We should note that in executing the exchange of one basic variable and
one non-basic variable, we must be very careful to ensure that the resulting basis consist of
m linearly independent columns of the original matrix A. Specifically, we must be able to
write the column corresponding to xj , the entering variable, as a linear combination of the
columns of B so that:
(9.59)
α1 b1 + . . . αm bm = A·j
and further if we are exchanging xj for xBi (i = 1, . . . , m), then αi 6= 0.
We can see this from the fact that aj = B−1 A·j and therefore:
Baj = A·j
and therefore we have:
A·j = B·1 aj1 + · · · + B·m ajm
which shows how to write the column A·j as a linear combination of the columns of B.
Exercise 64. Consider the linear programming problem given in Exercise 63. Under
what conditions should a non-basic variable enter the basis? State and prove an analogous
theorem to Theorem 9.58 using your observation. [Hint: Use the definition of reduced cost.
Remember that it is −∂z/∂xj .]
Example 9.61. Consider a simple Linear Programming Problem:
max z(x1 , x2 ) = 7x1 + 6x2
s.t. 3x1 + x2 ≤ 120
x1 + 2x2 ≤ 160
x1 ≤ 35
x1 ≥ 0
x2 ≥ 0
3. THE SIMPLEX METHOD
123
We can convert this problem to standard form by introducing the slack variables s1 , s2
and s3 :
max z(x1 , x2 ) = 7x1 + 6x2
s.t. 3x1 + x2 + s1 = 120
x1 + 2x2 + s2 = 160
x1 + s3 = 35
x1 , x2 , s1 , s2 , s3 ≥ 0
which yields the matrices
x1
7
x2
6
x = s1
0
c=
s2
0
s3
0
120
3 1 1 0 0
A = 1 2 0 1 0 b = 160
35
1 0 0 0 1
We can begin with the matrices:
3 1
1 0 0
B = 0 1 0 N = 1 2
1 0
0 0 1
In this case we have:
s1
0
x1
7
x B = s2 x N =
cB = 0 cN =
x2
6
s3
0
and
3 1
120
B−1 b = 160 B−1 N = 1 2
1 0
35
Therefore:
cTB B−1 b = 0 cTB B−1 N = 0 0
Using this information, we can compute:
cTB B−1 N − cN = −7 −6
cTB B−1 A·1 − c1 = −7
cTB B−1 A·2 − c2 = −6
and therefore:
∂z
∂z
= 7 and
=6
∂x1
∂x2
Based on this information, we could chose either x1 or x2 to enter the basis and the value
of the objective function would increase. If we chose x1 to enter the basis, then we must
determine which variable will leave the basis. To do this, we must investigate the elements
of B−1 A·1 and the current basic feasible solution B−1 b. Since each element of B−1 A·1 is
124
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
positive, we must perform the minimum ratio test on each element of B−1 A·1 . We know
that B−1 A·1 is just the first column of B−1 N which is:
3
B−1 A·1 = 1
1
Performing the minimum ratio test, we see have:
120 160 35
,
,
min
3
1 1
In this case, we see that index 3 (35/1) is the minimum ratio. Therefore, variable x1 will
enter the basis and variable s3 will leave the basis. The new basic and non-basic variables
will be:
0
s1
s
0
3
c B = 0 cN =
x B = s2 x N =
x2
6
x1
7
and the matrices
1
B= 0
0
become:
0 1
0 3
1 1 N = 0 2
1 0
0 1
Note we have simply swapped the column corresponding to x1 with the column corresponding
to s3 in the basis matrix B and the non-basic matrix N. We will do this repeatedly in the
example and we recommend the reader keep track of which variables are being exchanged
and why certain columns in B are being swapped with those in N.
Using the new B and N matrices, the derived matrices are then:
−3 1
15
B−1 b = 125 B−1 N = −1 2
1 0
35
The cost information becomes:
cTB B−1 b = 245 cTB B−1 N = 7 0
using this information, we can compute:
cTB B−1 N − cN = 7 −6
cTB B−1 A·5 − c5 = 7
cTB B−1 A·2 − c2 = −6
Based on this information, we can only choose x2 to enter the basis to ensure that the
value of the objective function increases. We can perform the minimum ration test to figure
out which basic variable will leave the basis. We know that B−1 A·2 is just the second column
of B−1 N which is:
1
−1
B A·2 = 2
0
3. THE SIMPLEX METHOD
125
Performing the minimum ratio test, we see have:
15 125
min
,
1 2
In this case, we see that index 1 (15/1) is the minimum ratio. Therefore, variable x2 will
enter the basis and variable s1 will leave the basis. The new basic and non-basic variables
will be: The new basic and non-basic variables will be:
x2
6
s
0
3
x B = s2 x N =
cB = 0 cN =
s1
0
x1
7
and the matrices
1
B = 2
0
become:
0 1
0 3
1 1 N = 0 0
1 0
0 1
The derived matrices are then:
−3 1
15
B−1 b = 95 B−1 N = 5 −2
1
0
35
The cost information becomes:
cTB B−1 b = 335 cTB B−1 N = −11 6
cTB B−1 N − cN = −11 6
Based on this information, we can only choose s3 to (re-enter) the basis to ensure that
the value of the objective function increases. We can perform the minimum ration test to
figure out which basic variable will leave the basis. We know that B−1 A·5 is just the fifth
column of B−1 N which is:
−3
−1
B A·5 = 5
1
Performing the minimum ratio test, we see have:
95 35
min
,
5 1
In this case, we see that index 2 (95/5) is the minimum ratio. Therefore, variable s3 will
enter the basis and variable s2 will leave the basis. The new basic and non-basic variables
will be:
6
x2
0
s2
cB = 0 cN =
x B = s3 x N =
0
s1
7
x1
and the matrices
1
B= 2
0
become:
0 3
0 1
0 1 N = 1 0
1 1
0 0
126
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
The derived matrices are then:
6/10 −1/5
72
B−1 b = 19 B−1 N = 1/5 −2/5
−1/5 2/5
16
The cost information becomes:
cTB B−1 b = 544 cTB B−1 N = 11/5 8/5
cTB B−1 N − cN = 11/5 8/5
Since the reduced costs are now positive, we can conclude that we’ve obtained an optimal
solution because no improvement is possible. The final solution then is:
72
x2
xB ∗ = s3 = B−1 b = 19
16
x1
Simply, we have x1 = 16 and x2 = 72. The path of extreme points we actually took in
traversing the boundary of the polyhedral feasible region is shown in Figure 9.8.
Figure 9.8. The Simplex Algorithm: The path around the feasible region is shown
in the figure. Each exchange of a basic and non-basic variable moves us along an
edge of the polygon in a direction that increases the value of the objective function.
Exercise 65. Assume that a leather company manufactures two types of belts: regular
and deluxe. Each belt requires 1 square yard of leather. A regular belt requires 1 hour of
skilled labor to produce, while a deluxe belt requires 2 hours of labor. The leather company
receives 40 square yards of leather each week and a total of 60 hours of skilled labor is
available. Each regular belt nets $3 in profit, while each deluxe belt nets $5 in profit. The
company wishes to maximize profit.
(1) Ignoring the divisibility issues, construct a linear programming problem whose solution will determine the number of each type of belt the company should produce.
(2) Use the simplex algorithm to solve the problem you stated above remembering to
convert the problem to standard form before you begin.
4. KARUSH-KUHN-TUCKER (KKT) CONDITIONS
127
(3) Draw the feasible region and the level curves of the objective function. Verify that
the optimal solution you obtained through the simplex method is the point at which
the level curves no longer intersect the feasible region in the direction following the
gradient of the objective function.
There are a few ways to chose the entering variable:
(1) Using Dantzig’s Rule, we choose the variable with the greatest (absolute value)
reduced cost.
(2) We can compute the next objective function value for each possible entering variable
and chose the one that results in the largest objective function increase, in a greedy
search.
(3) We can choose the first variable we find that is an acceptable entering variable (i.e.,
has a negative reduced cost).
(4) We can use a combination of these approaches.
Most Simplex codes have a complex recipe for choosing entering variables. We will deal with
the question of breaking ties among leaving variables in our section on degeneracy.
4. Karush-Kuhn-Tucker (KKT) Conditions
Remark 9.62. The single most important thing to learn about Linear Programming (or
optimization in general) is the Karush-Kuhn-Tucker optimality conditions. These conditions
provide necessary and sufficient conditions for a point x ∈ Rn to be an optimal solution to
a Linear Programming problem. We state the Karush-Kuhn-Tucker theorem, but do not
prove it. A proof can be found in Chapter 8 of [Gri11].
Theorem 9.63. Consider the linear programming problem:
max cx
s.t. Ax ≤ b
(9.60) P
x≥0
with A ∈ Rm×n , b ∈ Rm and (row vector) c ∈ Rn . Then x∗ ∈ Rn if and only if there exists
(row) vectors w∗ ∈ Rm and v∗ ∈ Rn and a slack variable vector s∗ ∈ Rm so that:
Ax∗ + s∗ = b
(9.61)
Primal Feasibility
x∗ ≥ 0
∗
∗
w A − v = c
w∗ ≥ 0
Dual Feasibility
(9.62)
v∗ ≥ 0
∗
w (Ax∗ − b) = 0
(9.63) Complementary Slackness
v ∗ x∗ = 0
Remark 9.64. The vectors w∗ and v∗ are sometimes called dual variables for reasons
that will be clear in the next section. They are also sometimes called Lagrange Multipliers.
You may have encountered Lagrange Multipliers in your Math 230 or Math 231 class. These
are the same kind of variables except applied to linear optimization problems. There is one
128
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
element in the dual variable vector w∗ for each constraint of the form Ax ≤ b and one
element in the dual variable vector v∗ for each constraint of the form x ≥ 0.
Example 9.65. Consider a simple Linear Programming Problem with Dual Variables
(Lagrange Multipliers) listed next to their corresponding constraints:
max z(x1 , x2 ) = 7x1 + 6x2
Dual Variable
s.t. 3x1 + x2 ≤ 120
(w1 )
x1 + 2x2 ≤ 160
(w1 )
x1 ≤ 35
(w3 )
x1 ≥ 0
(v1 )
x2 ≥ 0
(v2 )
In this problem we have:
3 1
120
b = 160
A= 1 2
35
1 0
c= 7 6
Then the KKT conditions can be written as:
120
3 1
x1
160
1
2
≤
x2
35
1 0
Primal Feasibility
x1 ≥ 0
0
x2
3
1
1 2 − v1 v2 = 7 6
w
1 w2 w3
1 0
Dual Feasibility
w
w
w
0
0
0
≥
1
2
3
v1 v2 ≥ 0 0
Complementary Slackness
w1
v1
120
3
1
x
w2 w3 1 2 1 − 160 = 0
x2
35
1 0
v2 x1 x2 = 0
Note, we are suppressing the slack variables s in the primal feasibility expression. Recall
that at optimality, we had x1 = 16 and x2 = 72. The binding constraints in this case where
3x1 + x2 ≤ 120
x1 + 2x2 ≤ 160
To see this note that if 3(16) + 72 = 120 and 16 + 2(72) = 160. Then we should be able
to express c = [7 6] (the vector of coefficients of the objective function) as a positive
4. KARUSH-KUHN-TUCKER (KKT) CONDITIONS
129
combination of the gradients of the binding constraints:
∇(7x1 + 6x2 ) = 7 6
∇(3x1 + x2 ) = 3 1
∇(x1 + 2x2 ) = 1 2
That is, we wish to solve the linear equation:
3 1
w1 w2
= 7 6
(9.64)
1 2
The result is the system of equations:
3w1 + w2 = 7
w1 + 2w2 = 6
A solution to this system is w1 = 85 and w2 = 11
. This fact is illustrated in Figure 9.9.
5
Figure 9.9 shows the gradient cone formed by the binding constraints at the optimal
point for the toy maker problem. Since x1 , x2 > 0, we must have v1 = v2 = 0. Moreover,
(x∗1 , x∗2 ) = (16, 72)
3x1 + x2 ≤ 120
x1 + 2x2 ≤ 160
x1 ≤ 35
x1 ≥ 0
x2 ≥ 0
Figure 9.9. The Gradient Cone: At optimality, the cost vector c is obtuse with
respect to the directions formed by the binding constraints. It is also contained
inside the cone of the gradients of the binding constraints, which we will discuss at
length later.
since x1 < 35, we know that x1 ≤ 35 is not a binding constraint and thus its dual variable
w3 is also zero. This leads to the conclusion:
∗
∗
∗ ∗
16
x1
w1 w2∗ w3∗ = 8/5 11/5 0
v1 v2 = 0 0
∗ =
72
x2
and the KKT conditions are satisfied.
130
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Exercise 66. Consider the problem:
max x1 + x2
s.t. 2x1 + x2 ≤ 4
x1 + 2x2 ≤ 6
x 1 , x2 ≥ 0
Write the KKT conditions for an optimal point for this problem. (You will have a vector
w = [w1 w2 ] and a vector v = [v1 v2 ]).
Draw the feasible region of the problem and use Matlab to solve the problem. At the
point of optimality, identify the binding constraints and draw their gradients. Show that the
KKT conditions hold. (Specifically find w and v.)
Exercise 67. Find the KKT conditions for the problem:
min cx
s.t. Ax ≥ b
(9.65)
x≥0
[Hint: Remember, every minimization problem can be converted to a maximization problem
by multiplying the objective function by −1 and the constraints Ax ≥ b are equivalent to
the constraints −Ax ≤ −b.
5. Simplex Initialization
So far we have investigated linear programming problems that had form:
max cT x
s.t. Ax ≤ b
x≥0
In this case, we use slack variables to convert the problem to:
max cT x
s.t. Ax + Im xs = b
x, xs ≥ 0
where xs are slack variables, one for each constraint. If b ≥ 0, then our initial basic feasible
solution can be x = 0 and xs = b (that is, our initial basis matrix is B = Im ).
Suppose now we wish to investigate problems in which we do not have a problem structure
that lends itself to easily identifying an initial basic feasible solution. The simplex algorithm
requires an initial BFS to begin execution and so we must develop a method for finding such
a BFS.
For the remainder of this chapter we will assume, unless told otherwise, that we are
interested in solving a linear programming problem provided in Standard Form. That is:
T
max c x
s.t. Ax = b
(9.66) P
x≥0
6. REVISED SIMPLEX METHOD
131
and that b ≥ 0. Clearly our work in Chapter 3 shows that any linear programming problem
can be put in this form.
Suppose to each constraint Ai· x = bi we associate an artificial variable xai . We can
replace constraint i with:
(9.67)
Ai· x + xai = bi
Since bi ≥ 0, we will require xai ≥ 0. If xai = 0, then this is simply the original constraint.
Thus if we can find values for the ordinary decision variables x so that xai = 0, then constraint
i is satisfied. If we can identify values for x so that all the artificial variables are zero and
m variables of x are non-zero, then the modified constraints described by Equation 9.67 are
satisfied and we have identified an initial basic feasible solution.
Obviously, we would like to penalize non-zero artificial variables. This can be done by
writing a new linear programming problem:
T
min e xa
s.t. Ax + Im xa = b
(9.68) P1
x, xa ≥ 0
Remark 9.66. We can see that the artificial variables are similar to slack variables, but
they should have zero value because they have no true meaning in the original problem P .
They are introduced artificially to help identify an initial basic feasible solution to Problem
P.
Theorem 9.67. The optimal objective function value in Problem P1 is bounded below by
0. Furthermore, if the optimal solution to problem P1 has xa = 0, then the values of x form
a feasible solution to Problem P .
6. Revised Simplex Method
Consider an arbitrary linear programming problem, which we will assume is written in
standard form:
T
max c x
s.t. Ax = b
(9.69) P
x≥0
Consider the data we need at each iteration of the Simplex algorithm:
(1) Reduced costs: cTB B−1 A·j − cj for each variable xj where j ∈ J and J is the set of
indices of non-basic variables.
(2) Right-hand-side values: b = B−1 b for use in the minimum ratio test.
(3) aj = B−1 A·j for use in the minimum ratio test.
(4) z = cTB B−1 b, the current objective function value.
The one value that is clearly critical to the computation is B−1 as it appears in each and
every computation. It would be far more effective to keep only the values: B−1 , cTB B−1 , b
and z and compute the reduced cost values and vectors aj as we need them.
Let w = cTB B−1 , then the pertinent information may be stored in a new revised simplex
tableau with form:
w z
(9.70)
xB
B−1 b
132
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
The revised simplex algorithm is detailed in Algorithm 20.
In essence, the revised
Revised Simplex Algorithm
(1) Identify an initial basis matrix B and compute B−1 , w, b and z and place these into a
revised simplex tableau:
w z
xB
B−1 b
(2) For each j ∈ J use w to compute: zj − cj = wA·j − cj .
(3) Choose an entering variable xj (for a maximization problem, we choose a variable with
negative reduced cost, for a minimization problem we choose a variable with positive
reduced cost):
(a) If there is no entering variable, STOP, you are at an optimal solution.
(b) Otherwise, continue to Step 4.
(4) Append the column aj = B−1 A·j to the revised simplex tableau:
w z
zj − c j
xB
aj
B−1 b
(5) Perform the minimum ratio test and determine a leaving variable (using any leaving
variable rule you prefer).
(a) If aj ≤ 0, STOP, the problem is unbounded.
(b) Otherwise, assume that the leaving variable is xBr which appears in row r of the
revised simplex tableau.
(6) Use row operations and pivot on the leaving variable row of the column:
zj − c j
aj
transforming the revised simplex tableau into:
w′
z′
0
′
x′B
er
B′ −1 b
where er is an identity column with a 1 in row r (the row that left). The variable xj is
now the rth element of xB .
(7) Goto Step 2.
Algorithm 20. Revised Simplex Algorithm
simplex algorithm allows us to avoid computing aj until we absolutely need to do so. In
fact, if we do not apply Dantzig’s entering variable rule and simply select the first acceptable
entering variable, then we may be able to avoid computing a substantial number of columns
in the tableau.
Example 9.68. Consider a software company who is developing a new program. The
company has identified two types of bugs that remain in this software: non-critical and
critical. The company’s actuarial firm predicts that the risk associated with these bugs are
uniform random variables with mean $100 per non-critical bug and mean $1000 per critical
bug. The software currently has 50 non-critical bugs and 5 critical bugs.
Assume that it requires 3 hours to fix a non-critical bug and 12 hours to fix a critical
bug. For each day (8 hour period) beyond two business weeks (80 hours) that the company
fails to ship its product, the actuarial firm estimates it will loose $500 per day.
6. REVISED SIMPLEX METHOD
133
We can find the optimal number of bugs of each type the software company should fix
assuming it wishes to minimize its exposure to risk using a linear programming formulation.
Let x1 be the number of non-critical bugs corrected and x2 be the number of critical
software bugs corrected. Define:
(9.71)
y1 = 50 − x1
(9.72)
y2 = 5 − x2
Here y1 is the number of non-critical bugs that are not fixed while y2 is the number of critical
bugs that are not fixed.
The time (in hours) it takes to fix these bugs is:
(9.73)
3x1 + 12x2
Let:
1
(80 − 3x1 − 12x2 )
8
Then y3 is a variable that is unrestricted in sign and determines the amount of time (in
days) either over or under the two-week period that is required to ship the software. As an
unrestricted variable, we can break it into two components:
(9.74)
y3 =
(9.75)
y3 = z 1 − z 2
We will assume that z1 , z2 ≥ 0. If y3 > 0, then z1 > 0 and z2 = 0. In this case, the software
is completed ahead of the two-week deadline. If y3 < 0, then z1 = 0 and z2 > 0. In this case
the software is finished after the two-week deadline. Finally, if y3 = 0, then z1 = z2 = 0 and
the software is finished precisely on time.
We can form the objective function as:
(9.76)
z = 100y1 + 1000y2 + 500z2
The linear programming problem is then:
(9.77)
min z =y1 + 10y2 + 5z2
s.t. x1 + y1 = 50
x2 + y2 = 5
3
3
x1 + x2 + z1 − z2 = 10
8
2
x 1 , x2 , y 1 , y 2 , z 1 , z 2 ≥ 0
Notice we have modified the objective function by dividing by 100. This will make the
arithmetic of the simplex algorithm easier. The matrix of coefficients for this problem is:
x1 x2 y1 y2 z 1 z 2
1 0 1 0 0 0
(9.78)
0 1 0 1 0 0
3
3
0 0 1 −1
8
2
Notice there is an identity matrix embedded inside the matrix of coefficients. Thus a good
initial basic feasible solution is {y1 , y2 , z1 }. The initial basis matrix is I3 and naturally,
B−1 = I3 as a result. We can see that cB = [1 10 0]T . It follows that cTB B−1 = w = [1 10 0].
134
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Our initial revised
1 10
z
y1
1
0
(9.79)
y2
0 1
z1
0 0
simplex tableau is thus:
0 100
0 50
0 5
1 10
There are three variables that might enter at this point, x1 , x2 and z1 . We can compute the
reduced costs for each of these variables using the columns of the A matrix, the coefficients
of these variables in the objective function and the current w vector (in row 0 of the revised
simplex tableau). We obtain:
1
z1 − c1 = wA·1 − c1 = 1 10 0 0 − 0 = 1
3/8
0
z2 − c2 = wA·2 − c2 = 1 10 0 1 − 0 = 10
3/2
0
z6 − c6 = wA·6 − c6 = 1 10 0 0 − 5 = −5
−1
By Dantzig’s rule, we enter variable x2 . We append B−1 A·2 and the reduced cost to the
revised simplex tableau to obtain:
10
1 10 0 100
M RT
z
y1 1 0 0 50 0 −
(9.80)
5
y2 0 1 0 5 1
20/3
z1
0 0 1 10
3/2
After pivoting on the indicated element, we obtain the new tableau:
1
0
0 50
z
y1
0
0 50
1
(9.81)
x2 0
1
0 5
z1
0 −3/2 1 5/2
We can compute reduced costs for the non-basic variables (except for y2 , which we know will
not re-enter the basis on this iteration) to obtain:
z1 − c1 = wA·1 − c1 = 1
z6 − c6 = wA·6 − c6 = −5
In this case, x1
z
y1
(9.82)
x2
z1
will enter the basis and we augment our revised simplex tableau to obtain:
M RT
1
1
0
0 50
1
0
0 50 1 50
0
1
0 5 0 −
0 −3/2 1 5/2
3/8
20/3
7. CYCLING PREVENTION
135
Note that:
1
1
0
0
1
−1
1
0
0 = 0
B A·1 = 0
3/8
0 −3/2 1
3/8
This is the ā1 column that is appended to the right hand side of the tableau along with
z1 − c1 = 1. After pivoting, the tableau becomes:
1 4 −8/3 130/3
z
y1
1 4 −8/3 130/3
(9.83)
x2 0 1
0
5
x1
0 −4 8/3 20/3
We can now check our reduced costs. Clearly, z1 will not re-enter the basis. Therefore, we
need only examine the reduced costs for the variables y2 and z2 .
z4 − c4 = wA·4 − c4 = −6
z6 − c6 = wA·6 − c6 = −7/3
Since all reduced costs are now negative, no further minimization is possible and we conclude
we have arrived at an optimal solution.
Two things are interesting to note: first, the solution for the number of non-critical
software bugs to fix is non-integer. Thus, in reality the company must fix either 6 or 7 of
the non-critical software bugs. The second thing to note is that this economic model helps
to explain why some companies are content to release software that contains known bugs.
In making a choice between releasing a flawless product or making a quicker (larger) profit,
a selfish, profit maximizer will always choose to fix only those bugs it must fix and release
sooner rather than later.
Exercise 68. Solve the following problem using the revised simplex algorithm.
max x1 + x2
s.t. 2x1 + x2 ≤ 4
x1 + 2x2 ≤ 6
x 1 , x2 ≥ 0
7. Cycling Prevention
Theorem 9.69. Consider Problem P (our linear programming problem). Let B ∈ Rm×m
be a basis matrix corresponding to some set of basic variables xB . Let b = B−1 b. If bj = 0
for some j = 1, . . . , m, then xB = b and xN = 0 is a degenerate extreme point of the feasible
region of Problem P .
Degeneracy can cause us to take extra steps on our way from an initial basic feasible solution to an optimal solution. When the simplex algorithm takes extra steps while remaining
at the same degenerate extreme point, this is called stalling. The problem can become much
worse; for certain entering variable rules, the simplex algorithm can become locked in a
cycle of pivots each one moving from one characterization of a degenerate extreme point
to the next. The following example from Beale and illustrated in Chapter 4 of [BJS04]
demonstrates the point.
136
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Remark 9.70. In this section, for the sake of simplicity, we illustrate cycling with full
tableau. That is, we show the entire table of reduced costs as the top row and the entire
matrix B−1 A. The right-hand-side and z values are identical to the revised simplex method.
Example 9.71. Consider the following linear programming problem:
1
3
min − x4 + 20x5 − x6 + 6x7
4
2
1
s.t x1 + x4 − 8x5 − x6 + 9x7 = 0
4
(9.84)
1
1
x2 + x4 − 12x5 − x6 + 3x7 = 0
2
2
x3 + x6 = 1
xi ≥ 0 i = 1, . . . , 7
It is conducive to
1
(9.85) A = 0
0
analyze the A matrix of the constraints of this problem. We have:
0 0 1/4 −8
−1 9
1 0 1/2 −12 −1/2 3
0 1 0
0
1
0
The fact that the A matrix contains an identity matrix embedded within it suggests that
an initial basic feasible solution with basic variables x1 , x2 and x3 would be a good choice.
This leads to a vector of reduced costs given by:
(9.86) cB T B−1 N − cN T = 3/4 −20 1/2 −6
These yield an initial
z x1
z 1 0
x1
0 1
x2 0 0
x3
0 0
tableau with structure:
x2 x3 x4
x5
x6
x7 RHS
0 0 3/4 −20 1/2 −6
0
0 0 1/4 −8
−1
9
0
1 0 1/2 −12 −1/2 3
0
0 1
0
0
1
0
1
If we apply an entering variable rule where we always chose the non-basic variable to enter
with the most positive reduced cost (since this is a minimization problem), and we choose
the leaving variable to be the first row that is in a tie, then we will obtain the following
sequence of tableaux:
Tableau I:
x5
x6
x7 RHS
z x1 x2 x3 x4
1 0 0 0 3/4 −20 1/2 −6
0
z
x1
0
1
0
0
1/4
−8
−1
9
0
x2 0 0 1 0 1/2 −12 −1/2 3
0
x3
0
0
1
0
1
0 0 0 1
Tableau II:
z
x4
x2
x3
x6
x7 RHS
z x1 x2 x3 x4 x5
1 −3 0 0 0
4 7/2 −33
0
0 4 0 0 1 −32 −4 36
0
4 3/2 −15
0
0 −2 1 0 0
0 0 0 1 0
0
1
0
1
7. CYCLING PREVENTION
Tableau III:
z
x4
x5
x3
x1
x2 x3 x4 x5 x6
x7
RHS
z
1 −1 −1 0 0 0
2
−18
0
0 −12
8
0 1 0
8
−84
0
0 −1/2 1/4 0 0 1 3/8 −15/4
0
0
0
1 0 0
1
0
1
0
z
x1
x2
x3
x4
x5 x6
x7
RHS
1
2
−3
0 −1/4 0 0
3
0
0 −3/2
1
0
1/8
0 1 −21/2
0
0
0 1/16 −1/8 0 −3/64 1 0 3/16
−1
1 −1/8 0 0 21/2
1
0 3/2
Tableau IV:
z
x6
x5
x3
Tableau V:
z
x6
x7
x3
Tableau VI:
z
x1
x7
x3
Tableau VII:
z
x1
x2
x3
137
z x1
x2
x3
x4
x5 x6 x7 RHS
1 1
−1
0 1/2 −16 0 0
0
0 2
−6
0 −5/2 56
1 0
0
0 1/3 −2/3 0 −1/4 16/3 0 1
0
0 −2
6
1 5/2 −56 0 0
1
z x1
1 0
0 1
0 0
0 0
x2
2
−3
1/3
0
x3
x4
x5
x6
x7 RHS
0 7/4 −44 −1/2 0
0
0 −5/4 28
1/2 0
0
0 1/6 −4 −1/6 1
0
1
0
0
1
0
1
x5
x6
x7 RHS
z x1 x2 x3 x4
1 0 0 0 3/4 −20 1/2 −6
0
0 1 0 0 1/4 −8
−1
9
0
0 0 1 0 1/2 −12 −1/2 3
0
0 0 0 1
0
0
1
0
1
We see that the last tableau (VII) is the same as the first tableau and thus we have constructed an instance where (using the given entering and leaving variable rules), the Simplex
Algorithm will cycle forever at this degenerate extreme point.
7.1. The Lexicographic Minimum Ratio Leaving Variable Rule. Given the example of the previous section, we require a method for breaking ties in the case of degeneracy
is required that prevents cycling from occurring. There is a large literature on cycling prevention rules, however the most well known is the lexicographic rule for selecting the entering
variable.
Definition 9.72 (Lexicographic Order). Let x = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T be
vectors in Rn . We say that x is lexicographically greater than y if: there exists m < n so
that xi = yi for i = 1, . . . , m, and xm+1 > ym+1 .
138
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Clearly, if there is no such m < n, then xi = yi for i = 1, . . . , n and thus x = y. We
write x ≻ y to indicate that x is lexicographically greater than y. Naturally, we can write
x y to indicate that x is lexicographically greater than or equal to y.
Lexicographic ordering is simply the standard order operation > applied to the individual
elements of a vector in Rn with a precedence on the index of the vector.
Definition 9.73. A vector x ∈ Rn is lexicographically positive if x ≻ 0 where 0 is the
zero vector in Rn .
Lemma 9.74. Let x and y be two lexicographically positive vectors in Rn . Then x + y is
lexicographically positive. Let c > 0 be a constant in R, then cx is a lexicographically positive
vector.
Exercise 69. Prove Lemma 9.74.
Suppose we are considering a linear programming problem and we have chosen an entering
variable xj according to a fixed entering variable rule. Assume further, we are given some
current basis matrix B and as usual, the right-hand-side vector of the constraints is denoted
b, while the coefficient matrix is denoted A. Then the minimum ratio test asserts that we
will chose as the leaving variable the basis variable with the minimum ratio in the minimum
ratio test. Consider the following set:
br
bi
: i = 1, . . . , m and aji > 0
= min
(9.87) I0 = r :
aj r
aj i
In the absence of degeneracy, I0 contains a single element: the row index that has the
smallest ratio of bi to aji , where naturally: b = B−1 b and aj = B−1 A·j . In this case, xj is
swapped into the basis in exchange for xBr (the rth basic variable).
When we have a degenerate basic feasible solution, then I0 is not a singleton set and
contains all the rows that have tied in the minimum ratio test. In this case, we can form a
new set:
a1 r
a1i
(9.88) I1 = r :
: i ∈ I0
= min
ajr
aji
Here, we are taking the elements in column 1 of B−1 A·1 to obtain a1 . The elements of
this (column) vector are then being divided by the elements of the (column) vector aj on a
index-by-index basis. If this set is a singleton, then basic variable xBr leaves the basis. If
this set is not a singleton, we may form a new set I2 with column a2 . In general, we will
have the set:
akr
aki
(9.89) Ik = r :
: i ∈ Ik−1
= min
ajr
aji
Lemma 9.75. For any degenerate basis matrix B for any linear programming problem,
we will ultimately find a k so that Ik is a singleton.
Exercise 70. Prove Lemma 9.75. [Hint: Assume that the tableau is arranged so that
the identity columns are columns 1 through m. (That is aj = ej for i = 1, . . . , m.) Show
that this configuration will easily lead to a singleton Ik for k < m.]
In executing the lexicographic minimum ratio test, we can see that we are essentially
comparing the tied rows in a lexicographic manner. If a set of rows ties in the minimum
7. CYCLING PREVENTION
139
ratio test, then we execute a minimum ratio test on the first column of the tied rows. If
there is a tie, then we move on executing a minimum ratio test on the second column of the
rows that tied in both previous tests. This continues until the tie is broken and a single row
emerges as the leaving row.
Example 9.76. Let us consider the example from Beale again using the lexicographic
minimum ratio test. Consider the tableau shown below.
Tableau I:
x5
x6
x7 RHS
z x1 x2 x3 x4
1 0 0 0 3/4 −20 1/2 −6
0
z
−1
9
0
x1 0 1 0 0 1/4 −8
x2 0 0 1 0 1/2 −12 −1/2 3
0
x3
0
0
1
0
1
0 0 0 1
Again, we chose to enter variable x4 as it has the most positive reduced cost. Variables x1
and x2 tie in the minimum ratio test. So we consider a new minimum ratio test on the first
column of the tableau:
0
1
,
(9.90) min
1/4 1/2
From this test, we see that x2 is the leaving variable and we pivot on element 1/2 as indicated
in the tableau. Note, we only need to execute the minimum ratio test on variables x1 and x2
since those were the tied variables in the standard minimum ratio test. That is, I0 = {1, 2}
and we construct I1 from these indexes alone. In this case I1 = {2}. Pivoting yields the new
tableau:
Tableau II:
x2
x3 x4 x5
x6
x7
RHS
z x1
0
z
1 0 −3/2 0 0 −2 5/4 −21/2
0
1
−1/2
0
0
−2
−3/4
15/2
0
x1
2
0 1 −24 −1
6
0
x4 0 0
x3
0 0
0
1 0
0
1
0
1
There is no question this time of the entering or leaving variable, clearly x6 must enter and
x3 must leave and we obtain5:
Tableau III:
x2
x3
x4 x5 x6
x7
RHS
z x1
z
1 0 −3/2 −5/4 0 −2 0 −21/2 −5/4
x1
3/4
0 1 −1/2 3/4 0 −2 0 15/2
x4 0 0
2
1
1 −24 0
6
1
x6
0 0
0
1
0
0
1
0
1
Since this is a minimization problem and the reduced costs of the non-basic variables are
now all negative, we have arrived at an optimal solution. The lexicographic minimum ratio
test successfully prevented cycling.
Remark 9.77. The following theorem is proved in [Gri11].
5Thanks
to Ethan Wright for finding a small typo in this example, that is now fixed.
140
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Theorem 9.78. Consider the problem:
T
max c x
s.t. Ax = b
P
x≥0
Suppose the following hold:
(1) Im is embedded in the matrix A and is used as the starting basis,
(2) a consistent entering variable rule is applied (e.g., largest reduced cost first), and
(3) the lexicographic minimum ratio test is applied as the leaving variable rule.
Then the simplex algorithm converges in a finite number of steps.
8. Relating the KKT Conditions to the Tableau
Consider a linear programming problem in Standard Form:
max cx
s.t. Ax = b
(9.91) P
x≥0
with A ∈ Rm×n , b ∈ Rm and (row vector) c ∈ Rn .
The KKT conditions for a problem of this type assert that
wA − v = c
vx = 0
at an optimal point x for some vector w unrestricted in sign and v ≥ 0. (Note, for the sake
of notational ease, we have dropped the ∗ notation.)
Suppose at optimality we have a basis matrix B corresponding to a set of basic variables
xB and we simultaneously have non-basic variables xN . We may likewise divide v into vB
and vN .
Then we have:
(9.92) wA − v = c =⇒ w B N − vB vN = cB cN
(9.93)
vx = 0 =⇒ vB
xB
vN
=0
xN
We can rewrite Expression 9.92 as:
wB − vB wN − vN = cB cN
(9.94)
This simplifies to:
wB − vB = cB
wN − vN = cN
Let w = cB B−1 . Then we see that:
(9.95)
wB − vB = cB =⇒ cB B−1 B − vB = cB =⇒ cB − vB = cB =⇒ vB = 0
9. DUALITY
141
Since we know that xB ≥ 0, we know that vB should be equal to zero to ensure complementary slackness. Thus, this is consistent with the KKT conditions.
We further see that:
(9.96)
wN − vN = cN =⇒ cB B−1 N − vN = cN =⇒ vN = cB B−1 N − cN
Thus, the vN are just the reduced costs of the non-basic variables. (vB are the reduced costs
of the basic variables.) Furthermore, dual feasibility requires that v ≥ 0. Thus we see that
at optimality we require:
(9.97)
cB B−1 N − cN ≥ 0
This is precisely the condition for optimality in the simplex tableau.
We now can see the following facts are true about the Simplex Method:
(1) At each iteration of the Simplex Method, primal feasibility is satisfied. This is
ensured by the minimum ratio test and the fact that we start at a feasible point.
(2) At each iteration of the Simplex Method, complementary slackness is satisfied. After
all, the vector v is just the reduced cost vector (Row 0) of the Simplex tableau.
If a variable is basic xj (and hence non-zero), then the its reduced cost vj = 0.
Otherwise, vj may be non-zero.
(3) At each iteration of the Simplex Algorithm, we may violate dual feasibility because
we may not have v ≥ 0. It is only at optimality that we achieve dual feasibility and
satisfy the KKT conditions.
We can now prove the following theorem:
Theorem 9.79. Assuming an appropriate cycling prevention rule is used, the simplex
algorithm converges in a finite number of iterations to an optimal solution to the linear
programming problem.
Proof. Convergence is guaranteed by the proof of Theorem 9.78 in which we show that
when the lexicographic minimum ratio test is used, then the simplex algorithm will always
converge. Our work above shows that at optimality, the KKT conditions are satisfied because
the termination criteria for the simplex algorithm are precisely the same as the criteria in
the Karush-Kuhn-Tucker conditions. This completes the proof.
9. Duality
Remark 9.80. In this section, we show that to each linear programming problem (the
primal problem) we may associate another linear programming problem (the dual linear
programming problem). These two problems are closely related to each other and an analysis
of the dual problem can provide deep insight into the primal problem.
Consider the linear programming problem
T
max c x
s.t. Ax ≤ b
(9.98) P
x≥0
142
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Then the dual
min
s.t.
(9.99) D
problem for Problem P is:
wb
wA ≥ c
w≥0
Remark 9.81. Let v be a vector of surplus variables. Then we can transform Problem
D into standard form as:
min wb
s.t. wA − v = c
(9.100) DS
w≥0
v≥0
Thus we already see an intimate relationship between duality and the KKT conditions. The
feasible region of the dual problem (in standard form) is precisely the the dual feasibility
constraints of the KKT conditions for the primal problem.
In this formulation, we see that we have assigned a dual variable wi (i = 1, . . . , m) to
each constraint in the system of equations Ax ≤ b of the primal problem. Likewise dual
variables v can be thought of as corresponding to the constraints in x ≥ 0.
Remark 9.82. The proof of the following lemma is outside the scope of the class, but it
establishes an important fact about duality.
Lemma 9.83. The dual of the dual problem is the primal problem.
Remark 9.84. Lemma 9.83 shows that the notion of dual and primal can be exchanged
and that it is simply a matter of perspective which problem is the dual problem and which is
the primal problem. Likewise, by transforming problems into canonical form, we can develop
dual problems for any linear programming problem.
The process of developing these formulations can be exceptionally tedious, as it requires
enumeration of all the possible combinations of various linear and variable constraints. The
following table summarizes the process of converting an arbitrary primal problem into its
dual. This table can be found in Chapter 6 of [BJS04].
Example 9.85. Consider the problem of finding the dual problem when the primal
problem is:
max 7x1 + 6x2
s.t. 3x1 + x2 + s1 = 120
(w1 )
x1 + 2x2 + s2 = 160
(w2 )
x1 + s3 = 35
x1 , x2 , s1 , s2 , s3 ≥ 0
(w3 )
Here we have placed dual variable names (w1 , w2 and w3 ) next to the constraints to which
they correspond.
The primal problem variables in this case are all positive, so using Table 1 we know that
the constraints of the dual problem will be greater-than-or-equal-to constraints. Likewise, we
know that the dual variables will be unrestricted in sign since the primal problem constraints
are all equality constraints.
9. DUALITY
≥0
≤
≤0
≥
=
UNRESTRICTED
≥
≥0
≤
=
≤0
UNRESTRICTED
VARIABLES
CONSTRAINTS
MAXIMIZATION PROBLEM
CONSTRAINTS
VARIABLES
MINIMIZATION PROBLEM
143
Table 1. Table of Dual Conversions: To create a dual problem, assign a dual
variable to each constraint of the form Ax ◦ b, where ◦ represents a binary relation.
Then use the table to determine the appropriate sign of the inequality in the dual
problem as well as the nature of the dual variables.
The coefficient matrix
3 1 1 0
A = 1 2 0 1
1 0 0 0
is:
0
0
1
Clearly we have:
c= 7 6 0 0 0
120
b = 160
35
Since w = [w1 w2 w3 ], we know that wA will be:
wA = 3w1 + w2 + w3 w1 + 2w2 w1 w2 w3
This vector will be related to c in the constraints of the dual problem. Remember, in this
case, all variables in the primal problem are greater-than-or-equal-to zero. Thus
we see that the constraints of the dual problem are:
3w1 + w2 + w3
w1 + 2w2
w1
w2
w3
≥7
≥6
≥0
≥0
≥0
We also have the redundant set of constraints that tell us w is unrestricted because the
primal problem had equality constraints. This will always happen in cases when you’ve
introduced slack variables into a problem to put it in standard form. This should be clear
from the definition of the dual problem for a maximization problem in canonical form.
144
9. LINEAR PROGRAMMING: THE SIMPLEX METHOD
Thus the whole dual problem becomes:
min 120w1 + 160w2 + 35w3
s.t. 3w1 + w2 + w3 ≥ 7
w1 + 2w2 ≥ 6
w1 ≥ 0
(9.101)
w2 ≥ 0
w3 ≥ 0
w unrestricted
Again, note that in reality, the constraints we derived from the wA ≥ c part of the dual
problem make the constraints “w unrestricted” redundant, for in fact w ≥ 0 just as we
would expect it to be if we’d found the dual of the Toy Maker problem given in canonical
form.
Exercise 71. Identify the dual problem for:
max x1 + x2
s.t. 2x1 + x2 ≥ 4
x1 + 2x2 ≤ 6
x 1 , x2 ≥ 0
Exercise 72. Use the table or the definition of duality to determine the dual for the
problem:
min cx
s.t. Ax ≥ b
(9.102)
x≥0
Remark 9.86. The following theorems are outside the scope of this course, but they can
be useful to know and will help cement your understanding of the true nature of duality.
Theorem 9.87 (Strong Duality Theorem). Consider Problem P and Problem D. Then
(Weak Duality): cx∗ ≤ w∗ b, thus every feasible solution to the primal problem
provides a lower bound for the dual and every feasible solution to the dual problem
provides an upper bound to the primal problem.
Furthermore exactly one of the following statements is true:
(1) Both Problem P and Problem D possess optimal solutions x∗ and w∗ respectively
and cx∗ = w∗ b.
(2) Problem P is unbounded and Problem D is infeasible.
(3) Problem D is unbounded and Problem P is infeasible.
(4) Both problems are infeasible.
9. DUALITY
145
Theorem 9.88. Problem D has an optimal solution w∗ ∈ Rm if and only if there exists
vectors x∗ ∈ Rn and s∗ ∈ Rm and a vector of surplus variables v∗ ∈ Rn such that:
∗
w A≥c
(9.103)
Primal Feasibility
w∗ ≥ 0
∗
∗
Ax + s = b
x∗ ≥ 0
(9.104)
Dual Feasibility
s∗ ≥ 0
∗
(w A − c) x∗ = 0
(9.105) Complementary Slackness
w ∗ s∗ = 0
Furthermore, these KKT conditions are equivalent to the KKT conditions for the primal
problem.
CHAPTER 10
Feasible Direction Methods and Quadratic Programming
1. Preliminaries
Remark 10.1. In this chapter, we return to the problem we originally discussed in
Chapter 1, namely: Let f : Rn → R; for i = 1, . . . , m, gi : Rn → R; and for j = 1, . . . , l
hj : Rn → R be functions. Then the problem we consider is:
max f (x1 , . . . , xn )
s.t. g1 (x1 , . . . , xn ) ≤ b1
..
.
gm (x1 , . . . , xn ) ≤ bm
h1 (x1 , . . . , xn ) = r1
..
.
hl (x1 , . . . , xn ) = rl
For simplicity, however, we will consider a minor variation of the problem, namely:
max f (x1 , . . . , xn )
s.t. g1 (x1 , . . . , xn ) ≤ 0
..
.
gm (x1 , . . . , xn ) ≤ 0
(10.1)
h1 (x1 , . . . , xn ) = 0
..
.
h (x , . . . , x ) = 0
l
1
n
It is easy to see that this is an equivalent formulation, as we have allowed g1 , . . . , gm and
h1 , . . . , hl to be defined vaguely.
Unlike in other chapters, we may consider the minimization very of this problem, when it
is convenient for understanding. When we do so, the reader will be given sufficient warning.
In addition, for the remainder of this chapter (unless otherwise stated), we will suppose that:
(10.2)
X = {x ∈ Rn : g1 (x), . . . , gm (x) ≤ 0, h1 (x) = · · · = hl (x) = 0}
is a closed, non-empty, convex set.
Remark 10.2. We begin our discussion of constrained optimization with a lemma, better
known as Weirstrass’ Theorem. We will not prove it as the proof requires a bit more analysis
than is required for the rest of the notes. The interested reader may consul
147
148
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
Lemma 10.3 (Weirstrass’ Theorem). Let X be a non-empty closed and bounded set in
Rn and let z be a continuous mapping with f : X → R. Then the optimization problem:
max f (x)
(10.3)
s.t. x ∈ X
has at least one solution x∗ ∈ X.
Remark 10.4. The proof of the next theorem is a simple modification to the proof of
Theorem 2.25.
Theorem 10.5. Suppose that f : Rn → R is concave and x∗ is a local maximizer of
max f (x)
(10.4) P
s.t. x ∈ X
then x∗ is a global maximizer.
Proof. Suppose that x+ ∈ X has the property that f (x+ ) > f (x∗ ). For any λ ∈ (0, 1)
we know that:
f (λx∗ + (1 − λ)x+ ) ≥ λf (x∗ ) + (1 − λ)f (x+ )
Moreover, by the convexity of X, λx∗ + (1 − λ)x+ ∈ X. Since x∗ is a local maximum there
is an ǫ > 0 so that for all x ∈ Bǫ (x∗ ) ∩ X, f (x∗ ) ≥ f (x). Choose λ so that λx∗ + (1 − λ)x+
is in Bǫ (x∗ ) ∩ X and let x = λx∗ + (1 − λ)x+ . Let r = f (x+ ) − f (x∗ ). By assumption r > 0.
Then we have:
f (x) ≥ λf (x∗ ) + (1 − λ)(f (x∗ ) + r)
But this implies that:
f (x) ≥ f (x∗ ) + (1 − λ)r
But x ∈ Bǫ (x∗ ) ∩ X by choice of λ, which contradicts our assumption that x∗ is a local
maximum. Thus, x∗ must be a global maximum.
Theorem 10.6. Suppose that f : Rn → R is strictly concave and that x∗ is a global
solution of
max f (x)
(10.5) P
s.t. x ∈ X
Then x∗ is the unique global maximizer of f .
Exercise 73. Prove Theorem 10.6.
Theorem 10.7. Suppose f (x) is continuously differentiable on X. If x∗ is a local maximum of f (x), then:
(10.6)
∇f (x∗ )T (x − x∗ ) ≤ 0
∀x ∈ X
Furthermore, if f (x) is a concave function, then this is a sufficient condition for a maximum.
2. FRANK-WOLFE ALGORITHM
149
Proof. Suppose that Inequality 10.6 does not hold, but x∗ is a local maximum. By the
mean value theorem, there is some t ∈ (0, 1) so that:
(10.7)
f (x) = f (x∗ ) + ∇f (x∗ + t(x − x∗ ))T (x − x∗ )
If Inequality 10.6 does not hold for x ∈ X, then ∇f (x∗ )T (x − x∗ ) > 0. Choose y =
x∗ + ǫ(x − x∗ ) where ǫ > 0 is small enough so that by continuity,
(10.8)
∇f (x∗ + t(y − x∗ ))T (y − x∗ ) > 0
since ∇f (x∗ )T (x − x∗ ) > 0. Note that y is in the same direction as x relative to x∗ , ensuring
that ∇f (x∗ )T (y − x∗ ) > 0. Thus:
(10.9)
f (y) − f (x∗ ) = +∇f (x∗ + t(y − x∗ ))T (x − x∗ ) > 0
and x∗ cannot be a local maximum.
If f (x) is concave, then by Theorem 2.19, we know that:
(10.10) f (x) ≤ f (x∗ ) + ∇f (x∗ )T (x − x∗ )
If Inequality 10.6, then for all x ∈ X, we know that: f (x) ≤ f (x∗ ). Thus, x∗ must be a
(local) maximum. When f (x) is strictly convex, then x∗ is a global maximum.
Definition 10.8 (Stationary Point). Suppose that x∗ ∈ X satisfies Inequality 10.6.
Then x∗ is a stationary point for the problem of maximizing f (x) subject to the constraints
that x ∈ X.
Remark 10.9. This notion of stationarity takes the place of ∇f (x∗ ) = 0 used in unconstrained optimization. We will use it throughout the rest of this chapter. Not only that, but
the definition of gradient related remains the same as before. See Definition 4.11.
2. Frank-Wolfe Algorithm
Remark 10.10. For the remainder of this chapter, we assume the following Problem P :
(10.11) P
max f (x)
s.t. x ∈ X
where X is a closed, convex and non-empty set.
Remark 10.11. Our study of Linear Programming in the previous chapter suggests that
when f (x) in Problem P is non-linear, but g1 , . . . , gm and h1 , . . . , hl are all linear, we might be
able to find a simple way of using linear programming to solve this non-linear programming
problem. The simplest approach to this is the Frank-Wolfe algorithm, or reduced gradient
algorithm. This method is (almost) the non-linear programming variation of the gradient
descent method, in that it works well initially, but slows down as the algorithm approaches
an optimal point.
Remark 10.12. The Frank-Wolfe algorithm works by iteratively approximating the objective function as a linear function and then uses linear programming to choose an ascent
direction. Univariate maximization is then used to choose a distance along the ascent direction. In essence, given problem P , at step k, with iterate xk , we solve:
(
max ∇f (xk )T x
(10.12) L
s.t. x ∈ X
150
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
Note that this is equivalent to maximizing the first order approximation:
(10.13) f (xk ) + ∇f (xk )T (x − xk )
If x+ is the solution to this problem, we then solve the problem:
(10.14) max f (xk + t(x+ − xk ))
t∈[0,1]
By the convexity of X, we see that restricting t to [0, 1] ensures that xk+1 = t∗ (xk +t(x+ −xk ))
is in X, when t∗ is the solution to the univariate problem. Clearly when X is a polyhedral set,
Problem L is a linear programming problem. Iteration continues until a stopping criterion
is reached. A simple one is ||xk+1 − xk || < ǫ. A reference implementation for this algorithm
is shown in Algorithm 21.
Example 10.13. We can see an example of the Frank-Wolfe algorithm when we attempt
to solve the problem:
2
2
max − (x − 2) − (y − 2)
s.t. x + y ≤ 1
(10.15)
x, y ≥ 0
In this case, the sequence of points when starting from x0 = 0, y0 = 0 is
(1) x1 = 0, y1 = 1
(2) x2 = 0.5, y2 = 0.5
when we set ǫ = 0.0001. This is illustrated in Figure 10.1(a).
Figure 10.1. (a) The steps of the Frank-Wolfe Algorithm when maximizing −(x −
2)2 −(y−2)2 over the set of (x, y) satisfying the constraints x+y ≤ 1 and x, y ≥ 0. (b)
The steps of the Frank-Wolfe Algorithm when maximizing −7(x − 20)2 − 6(y − 40)2
over the set of (x, y) satisfying the constraints 3x + y ≤ 120, x + 2y ≤ 160, x ≤
35 and x, y ≥ 0. (c) The steps of the Frank-Wolfe Algorithm when maximizing
−7(x − 40)2 − 6(y − 40)2 over the set of (x, y) satisfying the constraints 3x + y ≤ 120,
x + 2y ≤ 160, x ≤ 35 and x, y ≥ 0.
2. FRANK-WOLFE ALGORITHM
1
2
FrankWolfe := proc (F::operator, LINCON::set, initVals::list, epsilon::numeric,
maxIter::integer)::list;
3
4
5
local vars, xnow, xnext, ttemp, FLIN, SOLN, X, vals, i, G, dX, count, p, r,
passIn, phi, OUT;
6
7
8
9
10
11
vars := []; xnow := []; vals := initVals;
for i to nops(initVals) do
vars := [op(vars), lhs(initVals[i])];
xnow := [op(xnow), rhs(initVals[i])]
end do;
12
13
14
15
#Compute the gradient.
G := Gradient(F(op(vars)), vars));
OUT := [];
16
17
18
dX := 1;
count := 0;
19
20
21
while epsilon < dX and count < maxIter do
count := count + 1;
22
23
24
#Store output.
OUT := [op(OUT), xnow];
25
26
27
#A vector of variables, used to create the linear objective.
X := Vector([seq(vars[i], i = 1 .. nops(vars))]);
28
29
30
#Create the linear objective.
FLIN := Transpose(evalf(eval(G,vals)).X;
31
32
33
#Recover the actual solution.
SOLN := LPSolve(FLIN, LINCON, maximize = true)[2];
34
35
36
37
38
39
40
41
#Compute the direction.
p := [];
for i to nops(initVals) do
p := [op(p), eval(vars[i], SOLN)]
end do;
r := Vector(p);
passIn := convert(Vector(xnow)+s*(r-Vector(xnow)), list);
151
152
42
43
44
45
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
#Define the line function
phi := proc (t) options operator, arrow;
evalf(eval(F(op(passIn)), s = t))
end proc;
46
47
48
#Defined separately.
ttemp := GoldenSectionSearch(phi, 0, .39, 1, 0.1e-3);
49
50
51
xnext := evalf(eval(passIn, [s = ttemp]));
dX := Norm(Vector(xnext - xnow));
52
53
54
55
56
57
58
59
60
61
#Move and update the current values.
xnow := xnext;
vals := [];
for i to nops(vars) do
vals := [op(vals), vars[i] = xnow[i]]
end do:
end do;
OUT
end proc:
Algorithm 21. Frank-Wolfe Algorithm for finding the maximum of a (concave)
function with linear constraints.
By contrast, we can investigate the behavior of the Frank-Wolfe Algorithm when we
attempt to solve the problem:
max − 7(x − 20)2 − 6(y − 40)2
s.t. 3x + y ≤ 120
x + 2y ≤ 160
(10.16)
x ≤ 35
x, y ≥ 0
Here the optimal point is in the interior of the constraint set (as opposed to the exterior)
and the Frank-Wolfe algorithm exhibits behavior like Gradient Ascent. Convergence occurs
in 73 iterations. This is shown in Figure 10.1(b).
Finally, we can consider the objective function −7(x − 40)2 − 6(y − 40)2 , with the same
constraints as before. In this case, convergence takes over 2000 iterations, but we see we
can very close to an approximate solution. (See Figure 10.1(c).) This illustrates the power
of the Frank-Wolfe method. It can be used to find a reasonably good solution quickly. If
refinement is needed, then an algorithm with better convergence properties can take over.
Lemma 10.14. Each search direction (x+ −xk ) in the Frank-Wolfe algorithm is an ascent
direction.
3. FARKAS’ LEMMA AND THEOREMS OF THE ALTERNATIVE
153
Proof. By construction x+ maximizes ∇f (xk )T x subject to the constraint that x+ ∈ X.
This implies that for all x ∈ X:
(10.17) ∇f (xk )T x+ ≥ ∇f (xk )T x
Thus:
(10.18) ∇f (xk )T (x+ − x) ≥ 0 ∀x ∈ X
Thus, in particular, ∇f (xk )T (x+ − xk ) ≥ 0 and thus, x+ − xk is an ascent direction.
Definition 10.15. Given f : Rn → R and a convex set X, a feasible direction method is
one in which at each iteration k a direction pk is chosen and xk+1 = xk + δpk is constructed
so that xk+1 ∈ X and for some ǫ > 0 if 0 ≤ δ ≤ ǫ, xk + δpk ∈ X. Thus, pk is a feasible
direction.
Lemma 10.16. Let {xk } be a sequence generated by a feasible direction method (e.g. the
Frank-Wolfe Algorithm). If δ is chosen by limited line search or the Armijo rule, then any
point x∗ to which {xk } converges is a stationary point.
Proof. The proof is identical (mutatis mutandis) to the proof of Theorem 4.12.
Theorem 10.17. Suppose that the Frank-Wolfe algorithm converges. Then it converges
to a stationary point.
Proof. Every direction generated by the Frank-Wolfe algorithm is an ascent direction
and therefore the sequence of directions must be gradient related. By the previous lemma,
the Frank-Wolfe algorithm converges to a stationary point.
3. Farkas’ Lemma and Theorems of the Alternative
Lemma 10.18 (Farkas’ Lemma). Let A ∈ Rm×n and c ∈ Rn be a row vector. Suppose
x ∈ Rn is a column vector and w ∈ Rm is a row vector. Then exactly one of the following
systems of inequalities has a solution:
(1) Ax ≥ 0 and cx < 0 or
(2) wA = c and w ≥ 0
Remark 10.19. Before proceeding to the proof, it is helpful to restate the lemma in the
following way:
(1) If there is a vector x ∈ Rn so that Ax ≥ 0 and cx < 0, then there is no vector
w ∈ Rm so that wA = c and w ≥ 0.
(2) Conversely, if there is a vector w ∈ Rm so that wA = c and w ≥ 0, then there is
no vector x ∈ Rn so that Ax ≥ 0 and cx < 0.
Proof. We can prove Farkas’ Lemma using the fact that a bounded linear programming
problem has an extreme point solution. Suppose that System 1 has a solution x. If System
2 also has a solution w, then
(10.19) wA = c =⇒ wAx = cx.
The fact that System 1 has a solution ensures that cx < 0 and therefore wAx < 0. However,
it also ensures that Ax ≥ 0. The fact that System 2 has a solution implies that w ≥ 0.
Therefore we must conclude that:
(10.20) w ≥ 0 and Ax ≥ 0 =⇒ wAx ≥ 0.
154
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
This contradiction implies that if System 1 has a solution, then System 2 cannot have a
solution.
Now, suppose that System 1 has no solution. We will construct a solution for System 2.
If System 1 has no solution, then there is no vector x so that cx < 0 and Ax ≥ 0. Consider
the linear programming problem:
min cx
(10.21) PF
s.t. Ax ≥ 0
Clearly x = 0 is a feasible solution to this linear programming problem and furthermore is
optimal. To see this, note that the fact that there is no x so that cx < 0 and Ax ≥ 0, it
follows that cx ≥ 0; i.e., 0 is a lower bound for the linear programming problem PF . At
x = 0, the objective achieves its lower bound and therefore this must be an optimal solution.
Therefore PF is bounded and feasible.
We can covert PF to standard form through the following steps:
(1) Introduce two new vectors y and z with y, z ≥ 0 and write x = y − z (since x is
unrestricted).
(2) Append a vector of surplus variables s to the constraints.
This yields the new problem:
min cy − cz
s.t. Ay − Az − Im s = 0
(10.22) PF′
y, z, s ≥ 0
Applying Theorems 9.53 and 9.57, we see we can obtain an optimal basic feasible solution
for Problem PF′ in which the reduced costs for the variables are all negative (that is, zj −cj ≤ 0
for j = 1, . . . , 2n + m). Here we have n variables in vector y, n variables in vector z and m
variables in vector s. Let B ∈ Rm×m be the basis matrix at this optimal feasible solution with
basic cost vector cB . Let w = cB B−1 (as it was defined for the revised simplex algorithm).
Consider the columns of the simplex tableau corresponding to a variable xk (in our
original x vector). The variable xk = yk − zk . Thus, these two columns are additive inverses.
That is, the column for yk will be B−1 A·k , while the column for zk will be B−1 (−A·k ) =
−B−1 A·k . Furthermore, the objective function coefficient will be precisely opposite as well.
Thus the fact that zj − cj ≤ 0 for all variables implies that:
wA·k − ck ≤ 0 and
−wA·k + ck ≤ 0 and
That is, we obtain
(10.23) wA = c
since this holds for all columns of A.
Consider the surplus variable sk . Surplus variables have zero as their coefficient in the
objective function. Further, their simplex tableau column is simply B−1 (−ek ) = −B−1 ek .
The fact that the reduced cost of this variable is non-positive implies that:
(10.24) w(−ek ) − 0 = −wek ≤ 0
3. FARKAS’ LEMMA AND THEOREMS OF THE ALTERNATIVE
155
Since this holds for all surplus variable columns, we see that −w ≤ 0 which implies w ≥ 0.
Thus, the optimal basic feasible solution to Problem PF′ must yield a vector w that solves
System 2.
Lastly, the fact that if System 2 does not have a solution, then System 1 does follows
from contrapositive on the previous fact we just proved.
Exercise 74. Suppose we have two statements A and B so that:
A ≡ System 1 has a solution.
B ≡ System 2 has a solution.
Our proof showed explicitly that NOT A =⇒ B. Recall that contrapositive is the logical
rule that asserts that:
(10.25) X =⇒ Y ≡ NOT Y =⇒ NOT X
Use contrapositive to prove explicitly that if System 2 has no solution, then System 1 must
have a solution. [Hint: NOT NOT X ≡ X.]
3.1. Geometry of Farkas’ Lemma. Farkas’ Lemma has a pleasant geometric interpretation1. Consider System 2: namely:
wA = c and w ≥ 0
Geometrically, this states that c is inside the positive cone generated by the rows of A.
That is, let w = (w1 , . . . , wm ). Then we have:
(10.26) w1 A1· + · · · + wm Am·
and wi ≥ 0 for i = 1, . . . , m. Thus c is a positive combination of the rows of A. This is
illustrated in Figure 10.2. On the other hand, suppose System 1 has a solution. Then let
A2·
A1·
Half-space
c
Am·
Positive Cone of Rows of A
cy > 0
Figure 10.2. System 2 has a solution if (and only if) the vector c is contained
inside the positive cone constructed from the rows of A.
y = −x. System 1 states that Ay ≤ 0 and cy > 0. That means that each row of A (as a
vector) must be at a right angle or obtuse to y. (Since Ai· x ≥ 0.) Further, we know that the
vector y must be acute with respect to the vector c. This means that System 1 has a solution
only if the vector c is not in the positive cone of the rows of A or equivalently the intersection
of the open half-space {y : cy > 0} and the set of vectors {y : Ai· y ≤ 0, i = 1, . . . m} is
1Thanks
to Akinwale Akinbiyi for pointing out a typo in this discussion.
156
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
non-empty. This set is the cone of vectors perpendicular to the rows of A. This is illustrated
in Figure 10.3
Am·
c
Non-empty intersection
A2·
Half-space
cy > 0
Cone of perpendicular
vectors to Rows of A
A1·
Figure 10.3. System 1 has a solution if (and only if) the vector c is not contained
inside the positive cone constructed from the rows of A.
Example 10.20. Consider the matrix:
1 0
A=
0 1
and the vector c = 1 2 . Then clearly, we can see that the vector w = 1 2 will satisfy
System 2 of Farkas’ Lemma, since w ≥ 0 and wA = c.
T
Contrast this with c′ = 1 −1 . In this case, we can choose x = 0 1 . Then
T
Ax = 0 1 ≥ 0 and c′ x = −1. Thus x satisfies System 1 of Farkas’ Lemma.
These two facts are illustrated in Figure 10.4. Here, we see that c is inside the positive
cone formed by the rows of A, while c′ is not.
[0 1]
[1 2]
System 2 has
a solution
Positive cone
[1 0]
[1 -1]
System 1 has
a solution
Figure 10.4. An example of Farkas’ Lemma: The vector c is inside the positive
cone formed by the rows of A, but c′ is not.
Exercise 75. Consider the following matrix:
1 0
A=
1 1
and the vector c = 1 2 . For this matrix and this vector, does System 1 have a solution or
does System 2 have a solution? [Hint: Draw a picture illustrating the positive cone formed
by the rows of A. Draw in c. Is c in the cone or not?]
3. FARKAS’ LEMMA AND THEOREMS OF THE ALTERNATIVE
157
3.2. Theorems of the Alternative. Farkas’ lemma can be manipulated in many ways
to produce several equivalent statements. The collection of all such theorems are called
Theorems of the Alternative and are used extensively in optimization theory in proving
optimality conditions. We state two that will be useful to us.
Corollary 10.21. Let A ∈ Rm×n and let c ∈ Rn be a row vector. Then exactly one of
the following systems has a solution:
(1) Ax < 0 and cx > 0 or
(2) wA = w0 c and [w0 , w] ≥ 0, [w0 , w] 6= 0.
Proof. First define:
−c
(10.27) M = −
A
So that System 1 becomes Mx < 0 for some x ∈ Rn . Now, re-write this new System 1 as:
(10.28) Mx + s1 ≤ 0
(10.29)
s>0
where 1 is a vector of m ones. If we write:
(10.30) Q = M | 1
x
≤0
−
(10.31) y =
s
(10.32) r = 0 0 · · · 0 1
Then System 1 becomes My ≤ 0 and ry > 0. Finally letting z = −y, we see that System
1 becomes Qz ≥ 0 and rz < 0, which is System 1 of Farkas Lemma. Thus, if System 1 has
no solution, then there is a solution to:
(10.33) wQ = r
w≥0
This implies that:
(10.34) vM = 0
(10.35) v1 = 1
(10.36) v ≥ 0
Suppose that v = [w0 , w1 , . . . , wm ] and w = [w1 , . . . , wm ]. Then:
−c
(10.37) w0 w − = 0
A
But this implies that:
(10.38) −w0 c + wA = 0
Therefore: wA = w0 c and v1 = 1 and v ≥ 0 imply that [w0 , w] ≥ 0, [w0 , w] 6= 0.
Exercise 76. Prove the following corollary to Farkas’ Lemma:
158
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
Corollary 10.22. Let A ∈ Rm×n and c ∈ Rn be a row vector. Suppose d ∈ Rn is a
column vector and w ∈ Rm is a row vector and v ∈ Rn is a row vector. Then exactly one of
the following systems of inequalities has a solution:
(1) Ad ≤ 0, d ≥ 0 and cd > 0 or
(2) wA − v = c and w, v ≥ 0
[Hint: Write System 2 from this corollary as wA − In v = c and then re-write the system
with an augmented vector [w v] with an appropriate augmented matrix. Let M be the
augmented matrix you identified. Now write System 1 from Farkas’ Lemma using M and x.
Let d = −x and expand System 1 until you obtain System 1 for this problem.]
4. Preliminary Results: Feasible Directions, Improving Directions
Remark 10.23. In the previous chapter, we saw the Karush-Kuhn-Tucker (KKT) necessary and sufficient conditions for optimality for linear programming problems. In this
chapter, we generalize these conditions to the case of non-linear programming problems.
Remark 10.24. We return to the study of Problem P :
max f (x)
P
s.t. x ∈ X
where X is a set with non-empty interior that is, ideally, closed and convex.
Definition 10.25 (Cone of Feasible Directions). For Problem P , the cone of feasible
directions at x ∈ X is the set:
(10.39) D(x) = {p ∈ Rn : p 6= 0 and for all λ ∈ (0, δ) , x + λp ∈ X, for δ > 0}
Definition 10.26 (Cone of Improving Directions). For Problem P , the cone of improving
directions at x ∈ X is the set:
(10.40) F (x) = {p ∈ Rn : for all λ ∈ (0, δ) , f (x + λp) > f (x), for δ > 0}
Proposition 10.27. Let x ∈ Rn . If p ∈ Rn and ∇f (x)T p > 0, then p ∈ F (x). Thus:
(10.41) F0 (x) = {p ∈ Rn : ∇f (x)T p > 0} ⊆ F (x)
Exercise 77. Prove Proposition 10.27.
Lemma 10.28. For Problem P , suppose that x∗ ∈ X is a local maximum. Then F0 (x∗ ) ∩
D(x∗ ) = ∅.
Proof. Suppose this is not the case by contradiction. Choose p ∈ F0 (x∗ ) ∩ D(x∗ ). Let
x = x∗ + λp for some appropriately chosen λ > 0 so taht x ∈ X. Such a λ exists because
p ∈ D(x∗ ). Thus, d = x − x∗ and
(10.42) ∇f (x∗ )T (x − x∗ ) = λ∇f (x∗ )T p > 0
because p ∈ F0 (x∗ ). This contradicts Theorem 10.7 and thus x∗ cannot have been a local
maximum.
Proposition 10.29. If f : Rn → R is concave and F0 (x∗ ) ∩ D(x∗ ) = ∅, then x∗ is a
local maximum of f on X.
4. PRELIMINARY RESULTS: FEASIBLE DIRECTIONS, IMPROVING DIRECTIONS
159
Exercise 78. Using the sufficiency argument in the proof of Theorem 10.7, prove Proposition 10.29.
Remark 10.30. It is relatively straight-forward to prove that when f is concave, then
F0 (x∗ ) = F (x∗ ) and when f is strictly concave, then:
(10.43) F (x∗ ) = {p ∈ Rn : p 6= 0, ∇f (x∗ )T p ≥ 0}
Lemma 10.31. Consider a simplified form of Problem P , called Problem P ′ :
max f (x)
′
(10.44) P
s.t. gi (x) ≤ 0 i = 1, . . . , m
Here, f : Rn → R and gi (x) :→ Rn are continuously differentiable at a point x0 . Denote the
set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Then:
(10.45) G0 (x0 ) = {p ∈ Rn : ∇gi (x0 )T p < 0, ∀i ∈ I} ⊆ D(x0 )
Proof. Suppose p ∈ G0 (x0 ) and define x = x0 + λp. For all j 6∈ I, we know that
gj (x0 ) < 0. By the continuity of gj , for all ǫ > 0, there exists a δ > 0 so that if ||x0 − x|| < δ,
then |gj (x0 ) − gj (x)| < ǫ and thus for some choice of λ > 0 we can ensure that gj (x) < 0.
Now since ∇gi (x0 )T p < 0, this is a descent direction of gi for i ∈ I. Thus (by a variation
of Proposition 10.27), there is some λ′ > 0 so that gi (x0 + λ′ p) < gi (x0 ) = 0. Thus, if we
choose t = min{λ, λ′ }, then gi (x0 + tp) ≤ 0 for i = 1, . . . , m and thus x0 + tp ∈ X. Thus it
follows that p ∈ D.
Lemma 10.32. Consider Problem P ′ as in Lemma 10.31 and suppose that gi (i ∈ I) are
strictly convex (as defined in Lemma 10.31). Then G0 (x0 ) = D(x0 ).
Proof. Suppose that p ∈ D(x0 ) and suppose p 6∈ G0 (x0 ). Then ∇gi (x0 )T p ≥ 0. Thus,
p is not a descent direction. Suppose that x = x0 + λp and define h = x − x0 . Then by
strict convexity (and Exercises 16 and 17), for all λ > 0 we know:
(10.46) gi (x0 + λh) − gi (x0 ) > ∇gi (x0 )T h = λ∇gi (x0 )T p ≥ 0
Thus, there is no λ > 0 so that x = x0 + λp ∈ X and p 6∈ D(x0 ).
Theorem 10.33. Consider Problem P ′ , where f : Rn → R and gi (x) :→ Rn are continuously differentiable at a local maximum x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}.
Then F0 (x∗ ) ∩ G0 (x∗ ) = ∅. Conversely, if gi (i = 1, . . . , m) are strictly convex and f is concave and F0 (x∗ ) ∩ G0 (x∗ ) = ∅, then x∗ is a local maximum.
Proof. Suppose that x∗ is a local maximum. Then we know from Lemma 10.28 that
F0 (x∗ ) ∩ D(x∗ ) = ∅. Additionally from Lemma 10.31 we know that G0 (x∗ ) ⊆ D(x∗ ). Thus
it follows that F0 (x∗ ) ∩ G0 (x∗ ) = ∅.
Now suppose that gi (i = 1, . . . , m) are strictly convex and f is concave and F0 (x∗ ) ∩
G0 (x∗ ) = ∅. From Lemma 10.32, we know that D(x∗ ) = G0 (x∗ ) and thus we know at once
that F0 (x∗ ) ∩ D(x∗ ) = ∅. It follows from Proposition 10.29 that x∗ is a local maximum over
X.
Remark 10.34. All of these theorems can be generalized substantially, to include functions not defined over all Rn , functions that are only locally convex or concave or functions
that exhibit generalized concavity/convexity properties. Full details can be found in Chapter
4 of [BSS06], on which most of this section is based.
160
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
5. Fritz-John and Karush-Kuhn-Tucker Theorems
Theorem 10.35 (Fritz-John Theorem). Consider Problem P ′ , where f : Rn → R and
gi : Rn → Rn are continuously differentiable at a local maximum x∗ . Denote the set I = {i ∈
{1, . . . , m} : gi (x∗ ) = 0}. Then there exists scalars, u0 , u1 , . . . , um so that:
m
X
∗
u0 ∇f (x ) −
ui ∇gi (x∗ ) = 0
i=1
(10.47)
ui gi (x∗ ) = 0 i = 1, . . . , m
u0 , ui ≥ 0 i = 1, . . . , m
[u0 , u1 , . . . , um ]T 6= 0
Proof. Let c = ∇f (x∗ )T and let A be the matrix whose rows are formed from ∇gi (x∗ )T
(i ∈ I). From Theorem 10.33 we know that F0 (x∗ ) ∩ G0 (x∗ ) = ∅. This implies that there is
no p satisfying cp > 0 and Ap < 0. Then it follows from Corollary 10.21 there is a scalars
u0 and a vector u so that: uA = u0 c and [u0 , u] ≥ 0, [u0 , u] 6= 0. Thus we see that:
X
(10.48) uA = u0 c =⇒ u0 ∇f (x∗ ) −
ui ∇gi (x∗ ) = 0
i∈I
Now let uj = 0 for j 6∈ I. Then we see Equation 10.48 is equivalent to:
(10.49) u0 ∇f (x∗ ) −
m
X
i=1
ui ∇gi (x∗ ) = 0
The fact that u0 , ui ≥ 0 and [u0 , u1 , . . . , um ]T 6= 0 is also immediate from Corollary 10.21
as well. By definition of the ui , it is easy to see that if i ∈ I, then gi (x∗ ) = 0 and thus
ui gi (x∗ ) = 0. On the other hand, if i 6∈ I, then ui = 0 and thus ui gi (x∗ ) = 0. Thus we have
shown that Expression 10.47 must hold.
Theorem 10.36 (Karush-Kuhn-Tucker Necessary Conditions). Consider Problem P ′ ,
where f : Rn → R and gi : Rn → Rn are continuously differentiable at a local maximum x∗ .
Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Suppose that the set of vectors ∇gi (x∗ )
(i ∈ I) are linearly independent. Then there exist scalars λ1 , . . . , λm not all zero so that:
m
X
∗
λi ∇gi (x∗ ) = 0
∇f (x ) −
i=1
(10.50)
λi gi (x∗ ) = 0 i = 1, . . . , m
λi ≥ 0 i = 1, . . . , m
Proof. Applying the Fritz-John Theorem, we know there are constants u0 , u1 , . . . , um
so that:
m
X
∗
u0 ∇f (x ) −
ui ∇gi (x∗ ) = 0
i=1
(10.51)
ui gi (x∗ ) = 0 i = 1, . . . , m
u0 , ui ≥ 0 i = 1, . . . , m
[u0 , u1 , . . . , um ]T 6= 0
5. FRITZ-JOHN AND KARUSH-KUHN-TUCKER THEOREMS
161
Suppose that u0 = 0, then:
m
X
(10.52)
ui ∇gi (x∗ ) = 0
i=1
(10.53) [u1 , . . . , um ]T 6= 0
This is not possible by our assumption that ∇gi (x∗ ) (i ∈ I) are linearly independent and
that uj = 0 for j 6∈ I. Thus u0 > 0. Therefore, let λi = ui /u0 . The result follows at once by
dividing:
m
X
∗
u0 ∇f (x ) −
ui ∇gi (x∗ ) = 0
i=1
by u0 . This completes the proof.
∗
Remark 10.37. The assumption that the set of vectors ∇gi (x ) (i ∈ I) are linearly
independent is called a constraint qualification. There are several alternative weaker (and
stronger) constraint qualifications. These are covered in depth in a course on convex optimization or the theory of nonlinear programming.
Theorem 10.38 (Fritz-John Sufficient Theorem). Consider Problem P ′ , where f : Rn →
R is locally concave about x∗ and gi : Rn → Rn are locally strictly convex about x∗ . If there
are scalars u0 , u1 , . . . , um so that:
m
X
∗
u0 ∇f (x ) −
ui ∇gi (x∗ ) = 0
i=1
(10.54)
ui gi (x∗ ) = 0 i = 1, . . . , m
u0 , ui ≥ 0 i = 1, . . . , m
[u0 , u1 , . . . , um ]T 6= 0
then x∗ is a local maximum of Problem P ′ . Moreover, if the concavity and convexity are
global, then x∗ is a global maximum.
Proof. Applying the same reasoning as in the proof of the Fritz-John theorem, we have
that F0 (x∗ ) ∩ G0 (x∗ ) = ∅. Suppose we restrict our attention only to the set Bǫ (x∗ ) in which
f is concave and gi (i = 1, . . . , m) are strictly convex. In doing so, we can (if necessary)
redefine f and gi (i = 1, . . . , m) outside this ball so that they are globally concave and
strictly convex as needed. Then we may apply Proposition 10.33 to see that x∗ is a local
maximum.
Now suppose that global concavity and strict convexity hold. From Lemma 10.32, we
know that D(x∗ ) = G0 (x∗ ) and thus we know that F0 (x∗ )∩D(x∗ ) = ∅. Consider a relaxation
of the problem in which only the constraints indexed in I are kept. Then we still have
F0 (x∗ ) ∩ D(x∗ ) = ∅ in our relaxed problem. For any feasible solution x∗ in the relaxed
feasible set we know that for all λ ∈ [0, 1]:
(10.55) g(x∗ + λ(x − x∗ )) < (1 − λ)g(x∗ ) + λg(x) = 0 + λg(x) < 0
Thus p = x − x∗ ∈ D(x0 ); that is a (x − x∗ ) is a feasible direction. The fact that F0 (x∗ ) ∩
D(x∗ ) = ∅ means that ∇f (x∗ )T p ≤ 0 implies for all x in the (new) feasible region we have:
∇f (x∗ )T (x−x∗ ) ≤ 0. Thus, by Theorem 10.5, x∗ is a global maximum since the new feasible
region must contain the original feasible region X.
162
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
Theorem 10.39. Consider Problem P ′ , where f : Rn → R is locally concave about x∗
and gi : Rn → Rn are locally strictly convex about x∗ . Denote the set I = {i ∈ {1, . . . , m} :
gi (x∗ ) = 0}. Suppose that the set of vectors ∇gi (x∗ ) (i ∈ I) are linearly independent. If
there exist scalars λ1 , . . . , λm not all zero so that:
m
X
∗
λi ∇gi (x∗ ) = 0
∇f (x ) −
i=1
(10.56)
λi gi (x∗ ) = 0 i = 1, . . . , m
λi ≥ 0 i = 1, . . . , m
then x∗ is a local maximum of Problem P ′ . Moreover, if the concavity and convexity are
global, then x∗ is a global maximum.
Exercise 79. Use the sufficiency condition of the Fritz-John Theorem to prove the
previous sufficient condition.
Remark 10.40. We can extend both these theorems to our general problem P by rewriting the constraints hj (x) = 0 (j = 1, . . . , l) as the pair of constraints hj (x) ≤ 0 and
−hj (x) ≤ 0. For the sufficiency to hold, we require hj (x) to be affine, as otherwise both
hj (x) and −hj (x) (j = 1, . . . , l) could not both be strictly convex. These constraints yield a
pair of multipliers ρj ≥ 0 and νj ≥ 0 (j = 1, . . . , l) which satisfy:
∗
(10.57) ∇f (x ) −
as well as:
m
X
i=1
∗
λi ∇gi (x ) −
l
X
j=1
(ρj − νj )∇h(x∗ ) = 0
ρj hj (x∗ ) = 0 j = 1, . . . , l
−νj hj (x∗ ) = 0 j = 1, . . . , l
It is now a simple matter to define µj = (ρj − νj ) and note that it will be a value that is
unrestricted in sign (since ρj ≥ 0 and νj ≥ 0). We also note that hj (x∗ ) = 0 (j = 1, . . . , l)
and thus, ρj hj (x∗ ) = 0 and νhj (x∗ ) = 0. The following theorems follow at once.
Theorem 10.41 (Karush-Kuhn-Tucker Necessary Conditions). Consider Problem P ,
where f : Rn → R and gi :→ Rn and hj : Rn → R are continuously differentiable at a
local maximum x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Suppose that the set of
vectors ∇gi (x∗ ) (i ∈ I) and hj (x∗ ) (j = 1, . . . , l) are linearly independent. Then there exist
scalars λ1 , . . . , λm and µ1 , . . . , µl not all zero so that:
m
l
X
X
∗
∗
∇f (x ) −
λi ∇gi (x ) −
µj ∇hj (x∗ ) = 0
i=1
j=1
(10.58)
λi gi (x∗ ) = 0 i = 1, . . . , m
λi ≥ 0 i = 1, . . . , m
µj unrestricted j = 1, . . . , l
Theorem 10.42. Consider Problem P , where f : Rn → R is locally concave about x∗
and gi : Rn → R are locally strictly convex about x∗ and hj : Rn → R are locally affine
6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS
163
about x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Suppose that the set of vectors
∇gi (x∗ ) (i ∈ I) and ∇hj (x∗ ) are linearly independent. If there exist scalars λ1 , . . . , λm and
µ1 , . . . , µl not all zero so that:
m
l
X
X
∗
∗
∇f
(x
)
−
λ
∇g
(x
)
−
µj ∇hj (x∗ ) = 0
i
i
i=1
j=1
(10.59)
λi gi (x∗ ) = 0 i = 1, . . . , m
λi ≥ 0 i = 1, . . . , m
µj unrestricted j = 1, . . . , l
then x∗ is a local maximum of Problem P . Moreover, if the concavity and convexity are
global and hj (j = 1, . . . , l) are globally affine, then x∗ is a global maximum.
6. Quadratic Programming and Active Set Methods
Lemma 10.43. Consider the constrained optimization problem:
max cT x + 1 xT Qx
(10.60) Q0
2
s.t. Ax = b
where A ∈ Rm×n , c ∈ Rn×1 and Q ∈ Rn×n is symmetric. Then the KKT conditions for
Problem Q0 are:
(10.61) Qx − AT λ = −c
(10.62)
Ax = b
Exercise 80. Prove Lemma 10.43.
Lemma 10.44. Suppose that A ∈ Rn×m has full row rank. Further, suppose that for all
x ∈ Rn such that Ax = 0 and x 6= 0, we have xT Qx < 0, then:
Q −AT
(10.63) M =
A
0
is non-singular.
Proof. Suppose that we have a solution pair (x, λ) so that
Q −AT x
=0
(10.64)
λ
A
0
From this we conclude that:
(10.65) Qx − AT λ = 0
(10.66)
Ax = 0
Thus, x is in the null space of A. From Equation 10.65 we deduce:
(10.67) xT Qx − xT AT λ = 0
164
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
But, xT AT = 0 and therefore xT Qx = 0. This can only be true if x = 0 by our assumption.
Thus we have shown that necessarily x = 0. Given this, we see that:
(10.68) −AT λ = 0
This implies that: λT A = 0T . Note that λT A is a linear combination of the rows of A,
which we assumed to be linearly independent (since A had full row rank). Thus for Equation
10.68 to hold, we must have λ = 0. Thus, the null space of M consists only of the zero-vector
and M is non-singular.
Lemma 10.45. Suppose the conditions of the Lemma 10.44 are satisfied. Then Problem
Q0 has a unique solution.
Proof. It is easy to see that any pair (x, λ) satisfying:
−c
Q −AT x
=
(10.69)
b
λ
A
0
is a KKT point. Suppose that (x∗ , λ∗ ) is a solution to Equation 10.69 and suppose that
z(x+ ) ≥ z(x∗ ) where:
1
(10.70) z(x) = cT x + xT Qx
2
+
∗
Define p = x − x . Then Ap = Ax+ − Ax∗ = b − b = 0 and thus p is in the null space
of A. Consider that:
1
(10.71) z(x+ ) = z(x∗ + p) = cT (x∗ + p) + (x∗ + p)T Q(x∗ + p) =
2
1 ∗T
1
z(x∗ ) + cT p + pT Qp +
x Qp + pT Qx∗
2
2
We’ve assumed Q is symmetric and thus:
1 ∗T
(10.72)
x Qp + pT Qx∗ = pT Qx∗
2
By the necessity of the KKT conditions, we know that Qx∗ = −c + AT λ∗ . From this we
conclude that:
pT Qx∗ = −pT c + pT AT λ∗
We can substitute this quantity back into Equation 10.71 to obtain:
1
(10.73) z(x+ ) = z(x∗ ) + cT p + pT Qp − pT c + pT AT λ∗
2
We’ve already observed that p is in the null space of A and thus pT AT = 0. Likewise,
cT p = pT c and thus:
1
(10.74) z(x+ ) = z(x∗ ) + pT Qp
2
Finally, by our assumption on Q, the fact that Ap = 0 implies that pT Qp < 0 just in case
p 6= 0. Thus: z(x+ ) < z(x∗ ) and x∗ must be a unique global optimal solution to Problem
Q0 .
Remark 10.46. Maple code for solving equality constrained quadratic programming
problems with concave objective functions is shown in Algorithm 23. We also include a
helper function to determine whether matrices or vectors are empty in Maple.
6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
165
IsEmpty := overload([
proc (A::Matrix)::boolean;
option overload;
uses LinearAlgebra:
if evalb(RowDimension(A) = 0 or ColumnDimension(A) = 0) then
true:
else
false:
end if:
end proc,
proc (V::Vector)::boolean;
option overload;
uses LinearAlgebra:
if evalb(Dimension(V) = 0) then
true:
else
false:
end if
end proc
]):
Algorithm 22. An algorithm to determine whether a matrix is empty in Maple.
Notice this is an overloaded method, allowing us to pass in Vectors or Matrices.
Remark 10.47. We can use the previous lemmas to develop an algorithm for solving
convex quadratic programming problems. Before proceeding, we state the following theorem,
which we do not prove. See [Sah74].
Theorem 10.48. Let
(1) Q ∈ Rn×n ,
(2) A ∈ Rm×n ,
(3) H ∈ Rl×n ,
(4) b ∈ Rm×1 ,
(5) r ∈ Rl×1 and
(6) c ∈ Rn×1 .
Then finding a solution for the problem:
1 T
T
max 2 x Qx + c x
(10.75)
s.t. Ax ≤ b
Hx = r
is NP-complete for arbitrary parameters.
Remark 10.49. A general algorithm for solving arbitrary quadratic programming problems is given in [BSS06]. It is, essentially, a branch-and-bound style algorithm befitting the
fact that the problem is NP-complete.
166
1
2
3
4
5
6
7
8
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
SolveEQQP := proc (c::Vector, Q::Matrix, A::Matrix, b::Vector)::list;
uses LinearAlgebra:
local M, r, n, m, y;
n := RowDimension(Q);
m := RowDimension(A);
assert(n = ColumnDimension(A));
assert(n = Dimension(c));
assert(m = Dimension(b));
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
if IsEmpty(A) then
M := Q:
r := -c:
else
#Construct some augmented matrices and vectors
M := <<Q|-Transpose(A)>;<A|ZeroMatrix(m,m)>>:
r := <-c ; b>
end if;
#More efficient than inverting.
y := LinearSolve(M, r);
if IsEmpty(A) then
#Return the solution, there are no Lagrange multipliers.
[y, []]:
else
#Return the solution and the Lagrange multipliers in a list.
[SubVector(y, [1 .. n]), SubVector(y, [n+1 .. n+m])]
end if:
end proc:
Algorithm 23. An algorithm to solve equality constrained quadratic programming
problems, under the forgoing assumptions. Note we do not test for full row
rank.
Remark 10.50. We now consider an active set method for solving a convex quadratic
programming problem, which will be used in the next chapter when developing sequential
quadratic programming.
Remark 10.51.
max
(10.76) QP
s.t.
The algorithm assumes a problem with the following structure:
1 T
x Qx + cT x
2
aTi x ≤ bi i ∈ I
aTi x = bi
i∈E
Obviously, this is identical to the formulation of the problem in (10.75) except we have index
sets E and I for the equality and inequality constraints, respectively. Furthermore, we will
assume that Q is negative definite.
6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS
167
Lemma 10.52. Suppose that xk is a feasible solution to QP. Let
Wk = {i ∈ I : aTi xk = bi } ∪ E
Consider the problem:
max pT (c + Qx ) + 1 pT Qp
k
2
(10.77) QP (xk )
T
s.t. ai p = 0 i ∈ Wk
If pk solves this problem, then xk + pk solves:
1
max cT x + xT Qx
2
(10.78)
s.t. aTi x = bi i ∈ Wk
Proof. By Lemmas 10.44 and 10.45 and our assumption on Q, there is only one unique
solution to each of the problems given in the statement of the theorem. Suppose pk is the
solution to the first problem. Then since aTi pk = 0 for all i ∈ Wk and we know that aTi xk = bi
for all i ∈ Wk it follows that aTi (xk + pk ) = bi for all i ∈ Wk and thus xk + pk is feasible to
the second problem in the statement of the theorem. Note that:
1
1
(10.79) cT (xk + pk ) + (xk + pk )T Q(xk + pk ) = z(xk ) + cT pk + pTk Qpk + pTk Qxk
2
2
1 T
T
just as it was in Equation 10.71 where again: z(x) = c x + 2 x Qx. Thus:
1
1
(10.80) cT (xk + pk ) + (xk + pk )T Q(xk + pk ) = z(xk ) + pTk (c + Qxk ) + pTk Qpk
2
2
1 T
T
Since pk was chosen to maximize pk (c + Qxk ) + 2 pk Qpk subject to the constraints that
aTi pk = 0 for all i ∈ Wk , it follows that xk + pk must be the unique solution to the second
problem in the statement of the theorem, for otherwise, we could have chosen a different
p′k that satisfied aTi (xk + p′k ) = bi for all i ∈ Wk with xk + p′k producing a larger objective
function value and this would have been the maximal solution to the first problem. This
completes the proof.
Remark 10.53. Note that the sub-problem:
max pT (c + Qx ) + 1 pT Qp
k
2
(10.81) QP (xk )
s.t. aTi p = 0 i ∈ Wk
can be solved using simple linear algebra as illustrated in Lemma 10.45 to obtain not just
pk but also a set of Lagrange multipliers.
Lemma 10.54. Suppose that xk is a feasible solution to QP. Let
Ik = {i ∈ I : aTi xk = bi }
so that (as in Lemma 10.52) Wk = Ik ∪ E. Let λ be the Lagrange multipliers obtained for
the problem:
max pT (c + Qx ) + 1 pT Qp
k
2
QP (xk )
s.t. aTi p = 0 i ∈ Wk
168
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
when finding the optimal solution pk . If pk = 0 and λi ≥ 0 for all i ∈ Ik , then xk is the
optimal solution to QP.
Exercise 81. Prove Lemma 10.54. [Hint: Clearly, λ can act as Lagrange multipliers for
all the constraints indexed in Wk . Define the remaining Lagrange multipliers for constraints
with index not in Wk to be zero and argue that the KKT conditions for the whole problem
are satisfied because they are satisfied for the problem max cT x + 12 xT Qx such that aTi x = bi
for all i ∈ Wk .]
Lemma 10.55. Suppose that pk 6= 0 and suppose further that if xk + pk is feasible to QP,
then xk+1 = xk + pk and otherwise, using a minimum ratio test, we set:
(
)
bq − aTq xk
(10.82) αk = min
: q ∈ I \ Ik , aTq pk > 0
aTq pk
and define xk+1 = xk + αk pk . Then xk+1 is feasible to QP, z(xk+1 ) > z(zk ) and Ik+1 =
Ik ∪ {q ∗ } where q ∗ is the index identified in the minimum ratio test of Expression 10.82.
Proof. In the case when xk+1 = xk +pk is feasible to QP, the result follows immediately
from Lemma 10.52. Suppose, therefore, xk + pk is not feasible to QP.
To prove feasibility, we observe that aTi xk+1 = bi for all i ∈ Wk by the proof of Lemma
10.52. Thus, it follows that we wish to ensure that:
(10.83) aTi (xk + αk pk ) ≤ bi
i ∈ I \ Ik
The previous equation implies that for all i ∈ I \ Ik we have:
bq − aTq xk
(10.84) αk ≤
aTq pk
Clearly Expression 10.83 can be made true by setting using the minimum ratio test given
in Expression 10.82, since if aTi pk < 0, then we are assured that if aTi xk < bi , then so to
aTi xk+1 < bi .
To see that z(xk+1 ) > z(xk ), recall we have assumed that Q is negative definite. Thus, the
objective function is strictly concave. We know from Lemma 10.52 that z(xk + pk ) > z(xk )
(in particular, since pk 6= 0). Let yk = xk + pk . Then there is some λ ∈ (0, 1) so that:
λxk + (1 − λ)yk = xk + αk pk . From this we deduce that:
(10.85) z(xk+1 ) = z(xk + αk pk ) = z(λxk + (1 − λ)yk ) >
λz(xk ) + (1 − λ)z(yk ) > λz(xk ) + (1 − λ)z(xk ) = z(xk )
This completes the proof.
Remark 10.56. We are now in a position to sketch the active set algorithm for solving
convex quadratic programming problems.
(1) Initialize with a feasible point x0 derived possibly from a Phase I problem on the
linear constraints. Set k = 0. Compute Ik and Wk .
(2) Solve QP (xk ) to obtain pk and λk .
(3) If pk = 0 and λki ≥ 0 for all i ∈ Ik , then stop, xk is optimal.
(4) If pk = 0 and for some i ∈ Ik , λki < 0, then set i∗ = arg min{λki : i ∈ Ik }. Remove
i∗ from Ik to obtain Ik+1 and set Wk+1 = Wk \ i∗ . Set xk+1 = xk . Set k = k + 1,
Goto (2).
6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS
169
(5) If pk 6= 0 and xk + pk is feasible, set xk+1 = xk + pk . Compute Ik+1 and Wk+1 . Set
k = k + 1, Goto (2).
(6) If pk 6= 0 and xk + pk is not feasible then compute:
(
)
bq − aTq xk
: q ∈ I \ Ik , aTq pk > 0
αk = min
aTq pk
Set xk+1 = xk + αk pk . Compute Ik+1 and Wk+1 . Set k = k + 1, Goto (2). Note,
Ik+1 and Wk+1 can be computed easily as a part of the minimum ratio test.
A reference implementation for the algorithm in Maple is shown in Algorithm 27.
Theorem 10.57. The algorithm described in Remark 10.56 and illustrated in Algorithm
27 converges in a finite number of steps to the optimal solution of Problem 10.76 provided
that Q is negative definite (or that z(x) = 21 xT Qx + cT x is strictly concave).
Sketch of Proof. It suffices to show that a working (active) set Wk is never repeated.
To see this note that if pk = 0 and we have not achieved optimality, then a binding constraint
is removed from Wk to obtain Wk+1 . This process continues until pk 6= 0, at which point
by Lemma 10.82, the objective function decreases. Since the objective is monotonically
decreasing and when it does not decrease, we force a chance in Wk , it is easy to see we
can never repeat a working set. The fact that there are a finite number of working sets,
it follows that the algorithm must converge after a finite number of iterations to a solution
with λki > 0 for all i ∈ Ik . By Lemmas 10.52 and 10.54 it follows this must be an optimal
solution for the Problem 10.76.
Remark 10.58. The “proof” for the convergence of the active set method given in
[NW06] is slightly more complete than the sketch of the proof given above, though the
major elements of the proof are identical.
Remark 10.59. The point of introducing this specialized quadratic programming algorithm is that it can be used very efficiently in implementing sequential quadratic programming (SQP) because, through judicious use of the BFGS method, we can ensure that we
always have a convex quadratic sub-problem. The SQP method is a generally convergent
approach for solving constraint non-linear programming problems that enjoys substantial
commercial success. (Matlab’s fmincon and Maple s Minimize functions both use it.) We
introduce this technique in the next chapter.
Remark 10.60. We end this chapter by providing a Maple implementation of the active
set method for solving convex quadratic programming problems. This algorithm is long and
requires sub-routines for finding an initial point and determining the index of the minimum
value in a vector and determining the (initial) set of binding constraints I0 . The actual
routine for solving quadratic programs is shown in Algorithm 27.
Example 10.61. Consider the following problem:
1
1
max (x − 1)2 + (y − 1)2
2
2
s.t.x + y ≤ 1
(10.86)
2x + y ≥ 1
1
x, y ≥
4
170
1
2
3
4
5
6
7
8
9
10
11
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
MinimumIndex := proc (V::Vector)::integer;
local leastValue, index, i;
leastValue := Float(infinity);
for i to Dimension(V) do
if evalb(V[i] < leastValue) then
leastValue := V[i];
index := i
end if:
end do;
index:
end proc
Algorithm 24. An algorithm to determine the index of the minimum value in a
Vector in Maple.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
FindInitialSolution := proc (A::Matrix, b::Vector, Aeq::Matrix, beq::Vector)::
Vector;
uses LinearAlgebra, Optimization:
local N, c;
N := max(ColumnDimension(A), ColumnDimension(Aeq));
c := Vector([seq(0, i = 1 .. N)]);
if ColumnDimension(A) = 0 and ColumnDimension(Aeq) = 0 then
#Return c, it doesn’t matter what you return. There are no constraints.
c:
elif ColumnDimension(A) = 0 then
LPSolve(c, [NoUserValue, NoUserValue, Aeq, beq])[2]
elif ColumnDimension(Aeq) = 0 then
LPSolve(c, [A, b])[2]
else
LPSolve(c, [A, b, Aeq, beq])[2]
end if:
end proc:
Algorithm 25. A simple routine for initializing the active set quadratic programming method.
If we expand this problem out, constructing the appropriate matrices we see that:
−1 0
Q=
0 −1
1
c=
1
6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
171
ReturnWorkingSet := proc (A::Matrix, b::Vector, x::Vector)::list;
local m, l, i, Ik;
m := RowDimension(A);
Ik := [];
#This is a not equal.
if evalb(m <> 0) then
assert(Dimension(x) = ColumnDimension(A));
assert(m = Dimension(b));
for i to m do
if evalb(Row(A, i).x = b[i]) then
#Append any binding rows to Ik.
Ik := [op(Ik), i]:
end if:
end do:
end if:
Ik:
end proc:
Algorithm 26. A routine to determine the binding constraints among the inequality constraints.
1
2
3
4
5
6
SolveConvexQP := proc (c::Vector, Q::Matrix,
A::Matrix, b::Vector,
Aeq::Matrix, beq::Vector)::list;
uses LinearAlgebra:
local xnow, bDone, M, Ik, OUT, r, n, ynow, eps, alpha, alphak, cc,
rebuildM, Ap, i, p, SOLN, lambda, pos, q, pA, pAeq;
7
8
9
10
eps := 1/1000000;
xnow := FindInitialSolution(A, b, Aeq, beq); #Initialize with Linear
Programming!
n :=Dimension(xnow);
11
12
13
14
rebuildM := true; #A variable that tells us whether we have to rebuild a matrix
.
Ik := ReturnWorkingSet(A, b, xnow); #Find the working (active set). Only called
once.
bDone := false;
15
16
17
18
#Return list...we return more than is necessary.
#Specifically, the path we take.
OUT := [];
172
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
while evalb(not bDone) do
OUT := [op(OUT), convert(xnow, list)];
if rebuildM then #Rebuild the matrix that gets past to the sub-QP problem.
if nops(Ik) <> 0 and evalb(not IsEmpty(Aeq)) then
#Create a augmented matrix...these are the binding constraints.
#Some of the inequality constraints are binding.
M := <SubMatrix(A, Ik, [1 .. n]); Aeq>:
elif evalb(not IsEmpty(Aeq)) then
M := Aeq
elif nops(Ik) <> 0 then
M := SubMatrix(A, Ik, [1 .. n])
else
M := Matrix(0, 0)
end if;
end if;
cc := c+Q.xnow;
r := Vector([seq(0, i = 1 .. RowDimension(M))]);
#Here’s where we solve the sub-problem.
SOLN := SolveEQQP(cc, Q, M, r);
p := SOLN[1]; #The direction to move.
lambda := SOLN[2]; #The Lagrange multipliers.
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
if Norm(p) < eps then
if nops(Ik) = 0 then
bDone := true
else
pos := MinimumIndex(SubVector(lambda, [1 .. nops(Ik)]));
if 0 <= lambda[pos] then
#We’re done! All Lagrange multipliers that need to be positive
#are.
bDone := true
else
#Delete the row corresponding to the "most offending" Lagrange
#multiplier
M := DeleteRow(M, pos);
#The next line is a little tortured...it’s to remove the right
#row number from Ik.
Ik := convert((convert(Ik, set) minus convert({Ik[pos]}, set)), list);
rebuildM := false;
end if
end if
6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
else
alphak := 1;
q := -1;
for i to RowDimension(A) do
if evalb(not i in Ik) then
Ap := Row(A, i).p;
if 0 < Ap then #Otherwise, it doesn’t matter.
alpha := (b[i]-Row(A, i).xnow)/Ap;
if alpha < alphak then
alphak := alpha;
q := i
end if;
end if;
end if;
end do;
if q <> -1 and alphak <= 1 then
Ik := [op(Ik), q]; #A new constraint becomes binding.
rebuildM := true:
else
alphak := 1: #No new constraints become binding.
end if;
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
xnow := xnow+alphak*p:
end if:
end do;
#We return two vectors of dual variables, one for the inequality constraints.
#and one for the equality constraints.
pA := Vector(RowDimension(A));
pAeq := Vector(RowDimension(Aeq));
for i to nops(Ik) do
pA[Ik[i]] := lambda[i]:
end do;
for i from nops(Ik)+1 to Dimension(lambda) do
pAeq[i-nops(Ik)] := lambda[i]:
end do;
OUT := [[op(OUT), convert(xnow, list)], xnow, pA, pAeq]:
end proc:
Algorithm 27. The active set convex quadratic programming algorithm. The
algorithm uses all the sub-routines developed for quadratic programming.
Our matrices are:
1
1
−2 −1
A=
−1 0
0 −1
Aeq = [] beq = []
1
−1
b=
− 1
4
− 14
173
174
10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING
Since we have no equality constraints, we use empty matrices for these terms. Passing this
into our optimization algorithm yields an optimal solution of x = y = 12 . The path the
algorithm takes is shown in Figure 10.5.
Figure 10.5. The path taken when solving the proposed quadratic programming
problem using the active set method. Notice we tend to hug the outside of the
polyhedral set.
Exercise 82. Work through the steps of the algorithm by hand for the problem:
1
1
2
2
max 2 (x − 1) + 2 (y − 1)
(10.87)
s.t.x + y ≤ 1
x, y ≥ 0
Remark 10.62. As we noted in the previous example, the active set method still tends
to hug the boundary of the polyhedral constraint set. Interior point methods are known to
converge more quickly. We will discuss these methods for both linear and quadratic programs
in the network chapter, after we introduce barrier methods.
CHAPTER 11
Penalty and Barrier Methods, Sequential Quadratic
Programming, Interior Point Methods
For simplicity in this chapter, we will not study maximization problems, since penalty
functions are more easily understood in terms of minimization (i.e., we will minimize a
penalty). Thus, we will focus on the following problem:
min f (x1 , . . . , xn )
s.t. g1 (x1 , . . . , xn ) ≤ 0
..
.
gm (x1 , . . . , xn ) ≤ 0
(11.1)
h1 (x1 , . . . , xn ) = 0
..
.
h (x , . . . , x ) = 0
l
1
n
With a little work, it is easy to see that this problem has as its dual feasibility and
complementary slackness conditions:
m
l
X
X
∗
∗
∇f (x ) +
λi ∇gi (x ) +
µj ∇hj (x∗ ) = 0
i=1
j=1
(11.2)
λi gi (x∗ ) = 0 i = 1, . . . , m
λi ≥ 0 i = 1, . . . , m
µj unrestricted j = 1, . . . , l
1. Penalty Methods
Proposition 11.1. Let g : Rn → R and suppose that r : R → R is a strictly convex
function with minimum at x = 0 so that r(0) = 0. Then:
(11.3)
pI (x) = r (max{0, g(x)})
is 0 if and only if x satisfies g(x) ≤ 0. Furthermore, suppose that:
(11.4)
pE (x) = r(g(x))
is 0 if and only if g(x) = 0.
Exercise 83. Prove Proposition 11.1
Definition 11.2. Let g : Rn → R and suppose that r : R → R is a strictly convex
function with minimum at x = 0 so that r(0) = 0. For a constraint of type g(x) ≤ 0,
175
11.
176 PENALTY AND BARRIER METHODS, SEQUENTIAL QUADRATIC PROGRAMMING, INTERIOR POINT METHODS
pI (x) defined as in Equation 11.3 is an inequality penalty function. For a constraint of type
g(x) = 0, pE (x) defined as in Equation 11.4 is an equality penalty function.
Theorem 11.3. Consider the inequality penalty function with r(x) = x2 and inequality
g(x) ≤ 0. If g(x) is continuous and differentiable, then pI (x) is differentiable with derivative:
(11.5)
∂pI (x)
=
∂xi
(
∂g
2g(x) ∂x
i
0
g(x) ≤ 0
otherwise
Exercise 84. Prove Theorem 11.3
Remark 11.4. We can use penalty functions to transform constrained optimization problems into unconstrained optimization problems. To see this, consider a simple version of
Problem
11.1:
min f (x1 , . . . , xn )
s.t. g(x1 , . . . , xn ) ≤ 0
h(x1 , . . . , xn ) = 0
Then the penalized optimization problem is:
(11.6)
min fP (x1 , . . . , xn ) = f (x1 , . . . , xn ) + λI pI (x1 , . . . , xn ) + λE pE (x1 , . . . , xn )
where λI , λE ∈ R+ . If we choose r(x) = x2 (e.g.) then as long as the constraints and
objective in the original problem are differentiable, then fP (x) is differentiable and we can
apply unconstrained optimization methods that use derivatives. Otherwise, we must use
non-differentiable methods.
Remark 11.5. In a penalty function method if λI or λE are chosen too large, then the
penalty functions will overwhelm the objective function. If these values are chosen to small,
then the resulting optimal value x∗ for fP may not be feasible.
Example 11.6. Consider the following simple
2. Sequential Quadratic Programming
3. Barrier Methods
4. Interior Point Simplex as a Barrier Method
5. Interior Point Methods for Quadratic Programs
Bibliography
[Ant94]
[Avr03]
[Ber99]
[BJS04]
H. Anton, Elementary linear algebra, 7 ed., John Wiley and Sons, 1994.
M. Avriel, Nonlinear programming: Analysis and methods, Dover Press, 2003.
D. P. Bertsekas, Nonlinear Programming, 2 ed., Athena Scientific, 1999.
Mokhtar S. Bazaraa, John J. Jarvis, and Hanif D. Sherali, Linear programming and network
flows, Wiley-Interscience, 2004.
[BSS06] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty, Nonlinear programming: Theory and algorithms,
John Wiley and Sons, 2006.
[Dem97] J. W. Demmel, Applied numerical linear algebra, SIAM Publishing, 1997.
[Ger31]
S. Gerschgorin, Über die Abgrenzung der Eigenwerte einer Matrix, Izv. Akad. Nauk. USSR Otd.
Fiz.-Mat. Nauk 7 (1931), 749–754.
[GR01]
C. Godsil and G. Royle, Algebraic graph theory, Springer, 2001.
[Gri11]
C. Griffin, Linear programming:
Penn state math 484 lecture notes (v 1.8),
http://www.personal.psu.edu/cxg286/Math484 V1.pdf, 2010-2011.
[HJ61]
R. Hooke and T. A. Jeeves, Direct search solution of numerical and statistical problems., J. ACM
(1961).
[MT03]
J. E. Marsden and A. Tromba, Vector calculus, 5 ed., W. H. Freeman, 2003.
[Mun00] J. Munkres, Topology, Prentice Hall, 2000.
[NL89]
J. Nocedal and D. C. Liu, On the limited memory method for large scale optimization, Math.
Program. Ser. B 45 (1989), no. 3, 503–528.
[NW06]
J. Nocedal and S. Wright, Numerical optimization, Springer, 2006.
[PTVF07] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical recipes 3rd edition:
The art of scientific computing, 3 ed., Cambridge University Press, 2007.
[Rud76] W. Rudin, Principles of mathematical analysis, McGraw-Hill, 1976.
[Sah74]
S. Sahni, Computationally related problems, SIAM J. Comput. 3 (1974), 262–279.
[WN99]
L. A. Wolsey and G. L. Nemhauser, Integer and combinatorial optimization, Wiley-Interscience,
1999.
[ZBLN97] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, L-bfgs-b: Algorithm 778: Lbfgs-b, fortran routines
for large scale bound constrained optimization, ACM. Trans. Math. Software 23 (1997), no. 4,
550–560.
177