Numerical Optimization: Penn State Math 555 Lecture Notes

Christopher Griffin

Numerical Optimization: Penn State Math 555 Lecture Notes

Christopher Griffin

visibility

…

description

187 pages

link

1 file

AI-generated Abstract

These lecture notes provide a comprehensive overview of numerical optimization methods, focusing on various algorithms such as gradient ascent, Newton's method, and conjugate gradient methods. Visual representations such as contour plots and gradient vectors illustrate key concepts, and discussions on feasible direction methods and quadratic programming highlight theoretical foundations and practical applications. The notes also include insights into the convergence properties of various methods and the significance of mathematical rigor in the optimization process.

Numerical Optimization: Penn State Math 555 Lecture Notes Version 1.0 Christopher Griffin « 2012 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License With Contributions By: Simon Miller Douglas Mercer Contents List of Figures v Using These Notes ix Chapter 1. Introduction to Optimization Concepts, Geometry and Matrix Properties 1. Optimality and Optimization 2. Some Geometry for Optimization 3. Matrix Properties for Optimization 1 1 2 6 Chapter 2. Fundamentals of Unconstrained Optimization 1. Mean Value and Taylor’s Theorems 2. Necessary and Sufficient Conditions for Optimality 3. Concave/Convex Functions and Convex Sets 4. Concave Functions and Differentiability 13 13 15 16 19 Chapter 3. Introduction to Gradient Ascent and Line Search Methods 1. Gradient Ascent Algorithm 2. Some Results on the Basic Ascent Algorithm 3. Maximum Bracketing 4. Dichotomous Search 5. Golden Section Search 6. Bisection Search 7. Newton’s Method 8. Convergence of Newton’s Method 23 23 24 24 27 30 34 36 37 Chapter 4. Approximate Line Search and Convergence of Gradient Ascent 1. Approximate Search: Armijo Rule and Curvature Condition 2. Algorithmic Convergence 3. Rate of Convergence for Pure Gradient Ascent 4. Rate of Convergence for Basic Ascent Algorithm 41 41 44 47 49 Chapter 5. Newton’s Method and Corrections 1. Newton’s Method 2. Convergence Issues in Newton’s Method 3. Newton Method Corrections 55 55 57 60 Chapter 6. Conjugate Direction Methods 1. Conjugate Directions 2. Generating Conjugate Directions 3. The Conjugate Gradient Method 4. Application to Non-Quadratic Problems 67 67 69 70 73 iii iv CONTENTS Chapter 7. Quasi-Newton Methods 1. Davidon-Fletcher-Powell (DFP) Quasi-Newton Method 2. Implementation and Example of DFP 3. Derivation of the DFP Method 4. Broyden-Fletcher-Goldfarb-Shanno (BFGS) Quasi-Newton Method 5. Implementation of the BFGS Method 79 79 83 86 88 90 Chapter 8. Numerical Differentiation and Derivative Free Optimization 1. Numerical Differentiation 2. Derivative Free Methods: Powell’s Method 3. Derivative Free Methods: Hooke-Jeeves Method 93 93 95 98 Chapter 9. Linear Programming: The Simplex Method 1. Linear Programming: Notation 2. Polyhedral Theory and Linear Equations and Inequalities 3. The Simplex Method 4. Karush-Kuhn-Tucker (KKT) Conditions 5. Simplex Initialization 6. Revised Simplex Method 7. Cycling Prevention 8. Relating the KKT Conditions to the Tableau 9. Duality 105 105 107 118 127 130 131 135 140 141 Chapter 10. Feasible Direction Methods and Quadratic Programming 1. Preliminaries 2. Frank-Wolfe Algorithm 3. Farkas’ Lemma and Theorems of the Alternative 4. Preliminary Results: Feasible Directions, Improving Directions 5. Fritz-John and Karush-Kuhn-Tucker Theorems 6. Quadratic Programming and Active Set Methods 147 147 149 153 158 160 163 Chapter 11. Penalty and Barrier Methods, Sequential Quadratic Programming, Interior Point Methods 1. Penalty Methods 2. Sequential Quadratic Programming 3. Barrier Methods 4. Interior Point Simplex as a Barrier Method 5. Interior Point Methods for Quadratic Programs 175 175 176 176 176 176 Bibliography 177 List of Figures 1.1 1.2 1.3 1.4 2.1 2.2 2.3 Plot with Level Sets Projected on the Graph of z. The level sets existing in R2 while the graph of z existing R3 . The level sets have been projected onto their appropriate heights on the graph. 3 Contour Plot of z = x2 + y 2 . The circles in R2 are the level sets of the function. The lighter the circle hue, the higher the value of c that defines the level set. 4 A Line Function: The points in the graph shown in this figure are in the set produced using the expression x0 + ht where x0 = (2, 1) and let h = (2, 2). 4 A Level Curve Plot with Gradient Vector: We’ve scaled the gradient vector in this case to make the picture understandable. Note that the gradient is perpendicular to the level set curve at the point (1, 1), where the gradient was evaluated. You can also note that the gradient is pointing in the direction of steepest ascent of z(x, y). 6 An illustration of the mean value theorem in one variable. The multi-variable mean value theorem is simply an application of the single variable mean value theorem applied to a slice of a function. 13 A convex function: A convex function satisfies the expression f (λx1 +(1−λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) for all x1 and x2 and λ ∈ [0, 1]. 17 (Left) A simple quartic function with two local maxima and one local minima. (Right) A segment of the function that is locally concave. 19 Dichotomous Search iteratively refines the size of a bracket containing a maximum of the function φ by splitting the bracket in two. 28 3.2 A non-concave function with a maximum on the interval [0, 15]. 29 3.3 The relative sizes of the interval and sub-interval lengths in a Golden Section Search. 31 3.1 3.4 A function for which Golden Section Search (and Dichotoous Search) might fail to find a global solution. 33 4.1 The function f (x, y) has a maximum at x = y = 0, where the function attains a value of 1. 42 4.2 A plot of φ(t) illustrates the function increases as we approach the global maximum of the function and then decreases. 42 The Wolfe Conditions are illustrated. Note the region accepted by the Armijo rule intersects with the region accepted by the curvature condition to bracket the (closest local) maximum for δk . Here σ1 = 0.15 and σ2 = 0.5 43 4.3 v vi LIST OF FIGURES 4.4 We illustrate the failure of the gradient ascent method to converge to a stationary point when we do not use the Armijo rule or minimization. 46 4.5 Gradient ascent is illustrated on the function F (x, y) = −2 x2 − 10 y 2 starting at x = 15, y = 5. The zig-zagging motion is typical of the gradient ascent algorithm in cases where λn and λ1 are very different (see Theorem 4.20). 50 5.1 Newton’s Method converges for the function F (x, y) = −2x2 − 10y 4 in 11 steps, with minimal zigzagging. 55 5.2 A double peaked function with a local minimum between the peaks. This function also has saddle points. 58 5.3 A simple modification to Newton’s method first used by Gauss. While H(xk ) is not negative definite, we use a gradient ascent to converge to the neighborhood of a stationary point (ideally a local maximum). We then switch to a Newton step. 61 5.4 Modified Newton’s method uses the modified Cholesky decomposition and efficient linear solution methods to find an ascent direction in the case when the Hessian matrix is not negative definite. This algorithm converges superlinearly, as illustrated in this case. 66 6.1 The steps of the conjugate gradient algorithm applied to F (x, y). 76 6.2 In this example, the conjugate gradient method also converges in four total steps, with much less zig-zagging than the gradient descent method or even Newton’s method. 77 7.1 The steps of the DFP algorithm applied to F (x, y). 84 7.2 The steps of the DFP algorithm applied to F (x, y). 91 8.1 A comparison of the BFGS method using numerical gradients vs. exact gradients. 97 Powell’s Direction Set Method applied to a bimodal function and a variation of Rosenbrock’s function. Notice the impact the valley has on the steps in Rosenbrock’s method. 98 8.2 8.3 Hooke-Jeeves algorithm applied to a bimodal function. 101 8.4 Hooke-Jeeves algorithm applied to a bimodal function. 103 9.1 A hyperplane in 3 dimensional space: A hyperplane is the set of points satisfying an equation aT x = b, where k is a constant in R and a is a constant vector in Rn and x is a variable vector in Rn . The equation is written as a matrix multiplication using our assumption that all vectors are column vectors. 110 9.2 Two half-spaces defined by a hyper-plane: A half-space is so named because any hyper-plane divides Rn (the space in which it resides) into two halves, the side “on top” and the side “on the bottom.” 110 9.3 An Unbounded Polyhedral Set: This unbounded polyhedral set has many directions. One direction is [0, 1]T . 112 LIST OF FIGURES 9.4 9.5 9.6 9.7 9.8 9.9 10.1 10.2 10.3 10.4 10.5 vii Boundary Point: A boundary point of a (convex) set C is a point in the set so that for every ball of any radius centered at the point contains some points inside C and some points outside C. 113 A Polyhedral Set: This polyhedral set is defined by five half-spaces and has a single degenerate extreme point located at the intersection of the binding x + x2 <= 100. All faces are constraints 3x1 + x2 ≤ 120, x1 + 2x2 ≤ 160 and 28 16 1 shown in bold. 114 Visualization of the set D: This set really consists of the set of points on the red line. This is the line where d1 + d2 = 1 and all other constraints hold. This line has two extreme points (0, 1) and (1/2, 1/2). 116 The Cartheodory Characterization Theorem: Extreme points and extreme directions are used to express points in a bounded and unbounded set. 117 The Simplex Algorithm: The path around the feasible region is shown in the figure. Each exchange of a basic and non-basic variable moves us along an edge of the polygon in a direction that increases the value of the objective function. 126 The Gradient Cone: At optimality, the cost vector c is obtuse with respect to the directions formed by the binding constraints. It is also contained inside the cone of the gradients of the binding constraints, which we will discuss at length later. 129 (a) The steps of the Frank-Wolfe Algorithm when maximizing −(x−2)2 −(y −2)2 over the set of (x, y) satisfying the constraints x + y ≤ 1 and x, y ≥ 0. (b) The steps of the Frank-Wolfe Algorithm when maximizing −7(x − 20)2 − 6(y − 40)2 over the set of (x, y) satisfying the constraints 3x + y ≤ 120, x + 2y ≤ 160, x ≤ 35 and x, y ≥ 0. (c) The steps of the Frank-Wolfe Algorithm when maximizing −7(x − 40)2 − 6(y − 40)2 over the set of (x, y) satisfying the constraints 3x + y ≤ 120, x + 2y ≤ 160, x ≤ 35 and x, y ≥ 0. 150 System 2 has a solution if (and only if) the vector c is contained inside the positive cone constructed from the rows of A. 155 System 1 has a solution if (and only if) the vector c is not contained inside the positive cone constructed from the rows of A. 156 An example of Farkas’ Lemma: The vector c is inside the positive cone formed by the rows of A, but c′ is not. 156 The path taken when solving the proposed quadratic programming problem using the active set method. Notice we tend to hug the outside of the polyhedral set. 174 Using These Notes Stop! This is a set of lecture notes. It is not a book. Go away and come back when you have a real textbook on Numerical Optimization. Okay, do you have a book? Alright, let’s move on then. This is a set of lecture notes for Math 555–Penn State’s graduate Numerical Optimization course. Since I use these notes while I teach, there may be typographical errors that I noticed in class, but did not fix in the notes. If you see a typo, send me an e-mail and I’ll add an acknowledgement. There may be many typos, that’s why you should have a real textbook. The lecture notes are loosely based on Nocedal and Wright’s book Numerical Optimization, Avriel’s text on Nonlinear Optimization, Bazaraa, Sherali and Shetty’s book on Nonlinear Programming, Bazaraa, Jarvis and Sherali’s book on Linear Programming and several other books that are cited in these notes. All of the books mentioned are good books (some great). The problem is, some books don’t cover things in enough depth. The other problem is for students taking this course, this may be the first time they’re seeing optimization, so we have to cover some preliminaries. Our Math 555 course should really be two courses: one on theory and the other on practical algorithms. Apparently we’re not that interested, so we offer everything in one course. This set of notes correct some of the problems I mention by presenting the material in a format for that can be used easily in Penn State in Math 555. These notes are probably really inappropriate if you have a strong Operations Research program. You’d be better off reading Nocedal and Wright’s book directly. In order to use these notes successfully, you should know something about multi-variable calculus. It also wouldn’t hurt to have had an undergraduate treatment in optimization (in some form). At Penn State, the only prerequisite for this course is Math 456, which is a numerical methods course. That could be useful for some computational details, but I’ll review everything that you’ll need. That being said, I hope you enjoy using these notes! One last thing: the algorithms in these notes were coded using Maple. I’ve also coded most of the algorithms in C++. The code will be posted (eventually – perhaps it’s already there). Until then, you can e-mail me if you want the code. I can be reached at griffin ‘at’ ieee.org. ix CHAPTER 1 Introduction to Optimization Concepts, Geometry and Matrix Properties 1. Optimality and Optimization Definition 1.1. Let z : D ⊆ Rn → R. The point x∗ is a global maximum for z if for all x ∈ D, z(x∗ ) ≥ z(x). A point x∗ ∈ D is a local maximum for z if there is a neighborhood S ⊆ D of x∗ (i.e., x∗ ∈ S) so that for all x ∈ S, z(x∗ ) ≥ z(x). When the foregoing inequalities are strict, x∗ is called a strict global or local maximum. Remark 1.2. Clearly Definition 1.1 is valid only for domains and functions where the concept of a neighborhood is defined and understood. In general, S must be a topologically connected set (as it is in a neighborhood in Rn ) in order for this definition to be used or at least we must be able to define the concept of neighborhood on the set. Definition 1.3 (Maximization Problem). Let z : D ⊆ Rn → R; for i = 1, . . . , m, gi : D ⊆ Rn → R; and for j = 1, . . . , l hj : D ⊆ Rn → R be functions. Then the general maximization problem with objective function z(x1 , . . . , xn ) and inequality constraints gi (x1 , . . . , xn ) ≤ bi (i = 1, . . . , m) and equality constraints hj (x1 , . . . , xn ) = rj is written as:  max z(x1 , . . . , xn )      s.t. g1 (x1 , . . . , xn ) ≤ b1      ..   .   gm (x1 , . . . , xn ) ≤ bm (1.1)    h1 (x1 , . . . , xn ) = r1      ..   .     hl (x1 , . . . , xn ) = rl Remark 1.4. Expression 1.1 is also called a mathematical programming problem. Naturally when constraints are involved we define the global and local maxima for the objective function z(x1 , . . . , xn ) in terms of the feasible region instead of the entire domain of z, since we are only concerned with values of x1 , . . . , xn that satisfy our constraints. Remark 1.5. When there are no constraints (or the only constraint is that (x1 , . . . , xn ) ∈ Rn ), the problem is called an unconstrained maximization problem. Example 1.6. Let’s recall a simple optimization problem from differential calculus (Math 140): Goats are an environmentally friendly and inexpensive way to control a lawn when there are lots of rocks or lots of hills. (Seriously, both Google and some U.S. Navy bases use goats on rocky hills instead of paying lawn mowers!) Suppose I wish to build a pen to keep some goats. I have 100 meters of fencing and I wish to build the pen in a rectangle with the largest possible area. How long should the sides 1 2 1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES of the rectangle be? In this case, making the pen better means making it have the largest possible area. We can write this problem as:  max A(x, y) = xy     s.t. 2x + 2y = 100 (1.2)  x≥0    y≥0 Remark 1.7. It is easy to see that when we replace the objective function in Problem 1.1 with the objective function −z(x1 , . . . , xn ), then the solution to this new problem minimizes z(x1 , . . . , xn ) subject to the constraints of Problem 1.1. It therefore suffices to know how to solve either maximization or minimization problems, since one can be converted easily into the other. 2. Some Geometry for Optimization Remark 1.8. We’ll denote vectors in Rn in boldface. So x ∈ Rn is an n-dimensional vector and we have x = (x1 , . . . , xn ). We’ll always associate an n-dimensional vectors with a n × 1 matrix (column vector) unless otherwise noted. Thus, when we write x ∈ Rn we also mean x ∈ Rn×1 (the set of n × 1 matrices with entries from R). Definition 1.9 (Dot Product). Recall that if x, y ∈ Rn are two n-dimensional vectors, then the dot product (scalar product) is: (1.3) x·y = n X xi yi i=1 where xi is the ith component of the vector x. Clearly if x, y ∈ Rn×1 , then (1.4) x · y = xT y where xT ∈ R1×n is the transpose of x when treated as a matrix. Lemma 1.10. Let x, y ∈ Rn and let θ be the angle between x and y, then (1.5) x · y = ||x||||y|| cos θ Remark 1.11. The preceding lemma can be proved using the law of cosines from trigonometry. The following small lemma follows and is is proved as Theorem 1 of [MT03]: Lemma 1.12. Let x, y ∈ Rn . Then the following hold: (1) The angle between x and y is less than π/2 (i.e., acute) iff x · y > 0. (2) The angle between x and y is exactly π/2 (i.e., the vectors are orthogonal) iff x · y = 0. (3) The angle between x and y is greater than π/2 (i.e., obtuse) iff x · y < 0. Lemma 1.13 (Schwartz Inequality). Let x, y ∈ Rn . Then: (1.6) (xT y)2 ≤ (xT x) · (yT y) 2. SOME GEOMETRY FOR OPTIMIZATION 3 This is equivalent to: (1.7) (xT y)2 ≤ ||x||2 ||y||2 Definition 1.14 (Graph). Let z : D ⊆ Rn → R be function, then the graph of z is the set of n + 1 tuples: (1.8) {(x, z(x)) ∈ Rn+1 |x ∈ D} When z : D ⊆ R → R, the graph is precisely what you’d expect. It’s the set of pairs (x, y) ∈ R2 so that y = z(x). This is the graph that you learned about back in Algebra 1. Definition 1.15 (Level Set). Let z : Rn → R be a function and let c ∈ R. Then the level set of value c for function z is the set: (1.9) {x = (x1 , . . . , xn ) ∈ Rn |z(x) = c} ⊆ Rn Example 1.16. Consider the function z = x2 + y 2 . The level set of z at 4 is the set of points (x, y) ∈ R2 such that: (1.10) x2 + y 2 = 4 You will recognize this as the equation for a circle with radius 4. We illustrate this in the following two figures. Figure 1.1 shows the level sets of z as they sit on the 3D plot of the function, while Figure 1.2 shows the level sets of z in R2 . The plot in Figure 1.2 is called a contour plot. Level Set Figure 1.1. Plot with Level Sets Projected on the Graph of z. The level sets existing in R2 while the graph of z existing R3 . The level sets have been projected onto their appropriate heights on the graph. Definition 1.17. (Line) Let x0 , h ∈ Rn . Then the line defined by vectors x0 and h is the function l(t) = x0 + th. Clearly l : R → Rn . The vector h is called the direction of the line. Example 1.18. Let x0 = (2, 1) and let h = (2, 2). Then the line defined by x0 and h is shown in Figure 1.3. The set of points on this line is the set L = {(x, y) ∈ R2 : x = 2 + 2t, y = 1 + 2t, t ∈ R}. 4 1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES Level Set Figure 1.2. Contour Plot of z = x2 + y 2 . The circles in R2 are the level sets of the function. The lighter the circle hue, the higher the value of c that defines the level set. Figure 1.3. A Line Function: The points in the graph shown in this figure are in the set produced using the expression x0 + ht where x0 = (2, 1) and let h = (2, 2). Definition 1.19 (Directional Derivative). Let z : Rn → R and let h ∈ Rn be a vector (direction) in n-dimensional space. Then the directional derivative of z at point x0 ∈ Rn in the direction of h is d (1.11) z(x0 + th) dt t=0 when this derivative exists. Proposition 1.20. The directional derivative of z at x0 in the direction h is equal to: z(x0 + hh) − z(x0 ) (1.12) lim h→0 h Exercise 1. Prove Proposition 1.20. [Hint: Use the definition of derivative for a univariate function and apply it to the definition of directional derivative and evaluate t = 0.] Definition 1.21 (Gradient). Let z : Rn → R be function and let x0 ∈ Rn . Then the gradient of z at x0 is the vector in Rn given by: ∂z ∂z (x0 ), . . . , (x0 ) (1.13) ∇z(x0 ) = ∂x1 ∂xn 2. SOME GEOMETRY FOR OPTIMIZATION 5 Gradients are extremely important concepts in optimization (and vector calculus in general). Gradients have many useful properties that can be exploited. The relationship between the directional derivative and the gradient is of critical importance. Theorem 1.22. If z : Rn → R is differentiable, then all directional derivatives exist. Furthermore, the directional derivative of z at x0 in the direction of h is given by: (1.14) ∇z(x0 ) · h where · denotes the dot product of two vectors. Proof. Let l(t) = x0 + ht. Then l(t) = (l1 (t), . . . , ln (t)); that is, l(t) is a vector function whose ith component is given by li (t) = x0i + hi t. Apply the chain rule: dz(l(t)) ∂z dl1 ∂z dln (1.15) = + ··· + dt ∂l1 dt ∂ln dt Thus: d dl (1.16) z(l(t)) = ∇z · dt dt Clearly dl/dt = h. We have l(0) = x0 . Thus: (1.17) d z(x0 + th) dt t=0 = ∇z(x0 ) · h We now come to the two most important results about gradients, (i) the fact that they always point in the direction of steepest ascent with respect to the level curves of a function and (ii) that they are perpendicular (normal) to the level curves of a function. We can exploit this fact as we seek to maximize (or minimize) functions. Theorem 1.23. Let z : Rn → R be differentiable, x0 ∈ Rn . If ∇z(x0 ) 6= 0, then ∇z(x0 ) points in the direction in which z is increasing fastest. Proof. Recall ∇z(x0 ) · h is the directional derivative of z in direction h at x0 . Assume that h is a unit vector. We know that: (1.18) ∇z(x0 ) · h = ||∇z(x0 )|| cos θ (because we assumed h was a unit vector) where θ is the angle between the vectors ∇z(x0 ) and h. The function cos θ is largest when θ = 0, that is when h and ∇z(x0 ) are parallel vectors. (If ∇z(x0 ) = 0, then the directional derivative is zero in all directions.) Theorem 1.24. Let z : Rn → R be differentiable and let x0 lie in the level set S defined by z(x) = k for fixed k ∈ R. Then ∇z(x0 ) is normal to the set S in the sense that if h is a tangent vector at t = 0 of a path c(t) contained entirely in S with c(0) = x0 , then ∇z(x0 ) · h = 0. Remark 1.25. Before giving the proof, we illustrate this theorem in Figure 1.4. The function is z(x, y) = x4 + y 2 + 2xy and x0 = (1, 1). At this point ∇z(x0 ) = (6, 4). We include the tangent line to the level set at the point (1,1) to illustrate the normality of the gradient to the level curve at the point. 6 1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES Figure 1.4. A Level Curve Plot with Gradient Vector: We’ve scaled the gradient vector in this case to make the picture understandable. Note that the gradient is perpendicular to the level set curve at the point (1, 1), where the gradient was evaluated. You can also note that the gradient is pointing in the direction of steepest ascent of z(x, y). Proof. As stated, let c(t) be a curve in S. Then c : R → Rn and z(c(t)) = k for all t ∈ R. Let h be the tangent vector to c at t = 0; that is: dc(t) =h (1.19) dt t=0 Differentiating z(c(t)) with respect to t using the chain rule and evaluating at t = 0 yields: (1.20) d z(c(t)) dt t=0 = ∇z(c(0)) · h = ∇z(x0 ) · h = 0 Thus ∇z(x0 ) is perpendicular to h and thus normal to the set S as required. 3. Matrix Properties for Optimization Definition 1.26 (Definiteness). A matrix M ∈ Rn×n is positive semi-definite if for all x ∈ Rn with x 6= 0, xT Mx ≥ 0. The matrix M is positive definite if for all x ∈ Rn with x 6= 0, xT Mx > 0. The matrix M is negative definite if −M is positive definite and negative semi-definite if −M is positive semi-definite. If M satisfies none of these properties, then M is indefinite. Remark 1.27. We note that this is not the most general definition of matrix definiteness. In general, matrix definiteness can be defined for complex matrices and has specialization to Hermitian matrices. Remark 1.28. Positive semi-definiteness is also called non-negative definiteness and negative semi-definiteness is also called non-positive definiteness. Lemma 1.29. A matrix M ∈ Rn×n is positive definite if and only if for every vector x ∈ Rn , there is an α ∈ R+ (that is α > 0) such that xT Mx > αxT x1. 1Thanks to Doug Mercer for point out that α needs to be positive. 3. MATRIX PROPERTIES FOR OPTIMIZATION 7 Exercise 2. Prove Lemma 1.29. Lemma 1.30. Suppose that M is positive definite. If λ ∈ R is a real eigenvalue of M with real eigenvector x, then λ > 0. Exercise 3. Prove Lemma 1.30. Lemma 1.31. Prove that M ∈ Rn×n has eigenvalue λ if and only if M−1 has eigenvalue 1/λ. Exercise 4. Prove Lemma 1.31. Remark 1.32. The theory of eigenvalues and eigenvectors of matrices is deep and well understood. A substantial part of this theory should be covered in Math 436, for those interested. The following two results are proved in several different sources. One very nice proof is in Chapter 8 of [GR01]. Unfortunately, the proofs are well outside the scope of the class. Theorem 1.33 (Spectral Theorem for Real Symmetric Matrices). Let M ∈ Rn×n be a symmetric matrix. Then the eigenvalues of M are all real. Theorem 1.34 (Principle Axis Theorem). Let M ∈ Rn×n be a symmetric matrix. Then Rn has an orthongonal basis consisting of eigenvectors of M. Lemma 1.35. Let M ∈ Rn×n be a symmetric matrix and suppose that λ1 ≥ · · · ≥ λn are the eigenvalues of M. If x ∈ Rn is a vector with ||x|| = 1, then xT Mx ≤ λ1 and xT Mx ≥ λn . Proof. From the Principle Axis Theorem, we know that that there is a basis B = {v1 , . . . vn } consisting of eigenvectors of M. For any x ∈ Rn , we can write: (1.21) x = xT v1 v1 + · · · + xT vn vn That is, in the coordinates of B: T x = xT v1 , . . . , xT vn It follows then that: Mx = xT v1 Mv1 + · · · + xT vn Mvn = xT v1 λ1 v1 + · · · + xT vn λn vn = λ1 x T v 1 v 1 + · · · + λ n x T v n v n Pre-multiplying by xT yields: xT Mx = λ1 xT v1 xT v1 + · · · + λn xT vn xT vn From Equation 1.21 and our assumption on ||x||, we see that2: 2 2 1 = ||x||2 = xT v1 + · · · + xT vn Since λ1 ≥ · · · ≥ λn , we have: 2 2 2 2 xT Mx = λ1 xT v1 + · · · + λn xT vn ≤ λ1 xT v1 + · · · + λ1 xT vn = λ1 The fact that xT Mx ≥ λn follows by a similar argument. This completes the proof. 2If this is not obvious, note that the length of a vector is an invariant. That is, it is the same no matter in which basis the vector is expressed. 8 1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES Theorem 1.36. Suppose M ∈ Rn×n is symmetric. If every eigenvalue of M is positive, then M is positive definite. Proof. By the spectral theorem, all the eigenvalues of M are real and thus we can write them in order as: λ1 ≥ · · · ≥ λn . By Lemma 1.35, we know that xT Mx ≥ λn . By our assumption λn > 0 and thus xT Mx > 0 for all x ∈ Rn . Corollary 1.37. If M is a positive definite matrix, then its inverse is also positive definite. Proof. Applying Lemma 1.31 and the preceding theorem yields the result. Remark 1.38. The proofs of the preceding two results can be found in [Ant94] (and most likely later editions). These are fairly standard proofs though and are available in most complete linear programming texts. Remark 1.39. Thus we’ve proved that a symmetric real matrix is positive definite if and only if every eigenvalue is positive. We can use this fact to obtain a simple test for positive definiteness3. Definition 1.40 (Gerschgorin Disc). Let A ∈ Cn×n and let X Ri = |Aij | j6=i That is, the sum of the off-diagonal entries of row i. Let D(Aii , Ri ) be a closed disc centered at Aii with radius Ri . This is a Gerschgorin disc. When A ∈ Rn×n , this disc is a closed interval. Remark 1.41. The following theorem was first proved in [Ger31] and is now a classical matrix theorem, used frequently in control theory. Theorem 1.42 (Gerschgorin Disc Theorem). Let A ∈ Cn×n . Every eigenvalue of A lies in a Gerschgorin disc. Corollary 1.43. Let M ∈ Rn×n be symmetric. If X (1.22) Aii > |Aij | j6=i then M is positive definite. Exercise 5. Prove Corollary 1.43. Exercise 6. Construct an example in which the converse of Corollary 1.43 does not hold or prove that it must hold. Remark 1.44. The last theorem of this section, which we will not prove, shows that real symmetric positive definite matrices have a special unique decomposition (called the Cholesky decomposition). (Actually, positive definite Hermetian matrices have a unique Cholesky decomposition and this generalizes the result on real symmetric matrices that are positive definite.) A proof can be found in Chapter 2 of [Dem97]. Theorem 1.45. A M ∈ Rn×n is symmetric positive definite if and only if there is a (unique) lower triangular matrix H ∈ Rn×n such that M = HHT . 3Thanks for Xiao Chen for pointing out we needed the matrix to be symmetric. 3. MATRIX PROPERTIES FOR OPTIMIZATION 9 Definition 1.46. The previous matrix factorization is called the Cholesky Decomposition. It serves as a test for positive definiteness in symmetric matrices that is less computationally complex than computing the eigenvalues of the matrix. Remark 1.47. We note that Theorem 1.45 could be written to say there is an upper triangular matrix H so that M = HHT . Some numerical systems implement this type of Cholesky decomposition, e.g., Mathematica. The example and code we provide computes the lower triangular matrix variation. Example 1.48 (The Cholesky Decomposition). There is a classic algorithm for computing the Cholesky decomposition, which is available in [Dem97] (Algorithm 2.11). Instead of using the algorithm, we illustrate the algorithm with a more protracted example. Consider the matrix:  8 −2 7 M = −2 10 5  7 5 12  We can construct the LU decomposition using Gaussian elimination without pivoting to obtain:   8 0 8 −2 7  −2 10 5  =  −2 19/2 7 5 12 27 7 4  0   1 7 8 27 38 0 1 1 −1/4   0 · 0 41 0 38    Notice the determinant of the U matrix is 1. The L matrix can be written as L = L1 D where D is a diagonal matrix whose determinant is equal to the determinant of M and L1 is a lower triangular matrix with determinant 1. This yields:   1 8 −2 7 −2 10 5  =   −1/4 7 5 12 7  8   0 0 1   0  ·  0 19/2 1 0 0 27 38 8 0 0   1 7 8 27 38 0 1 1 −1/4   0 · 0 41 0 38    Notice LT1 = U and more importantly, notice that the diagonal of D is composed of ratios of the principal minors of M. That is: 8 = |8| 76 = 8 −2 −2 10 8 −2 7 82 = −2 10 5 7 5 12 10 1. INTRODUCTION TO OPTIMIZATION CONCEPTS, GEOMETRY AND MATRIX PROPERTIES And we have 76/8 = 19/2 and 82/76 = 41/38. Thus, if we multiply the diagonal elements of the matrix D we easily see that this yields 82, the determinant of M. Finally, define:     √ 0 0 2 2 1 0 0 √     H =  −1/4 1 0  ·  0 1/2 38 0 = √ 7 27 1 0 0 1/38 1558 8 38 √   2 2 0 0 √ √   0   −1/2 2 1/2 38 √ √ √ 27 38 1/38 1558 7/4 2 76 This is the Cholesky factor, where the second matrix in the product is obtained by taking the square root of the diagonal elements of D. It is easy to see now that M = HHT . The fact that the determinants of the principal minors of M are always positive is a direct result of the positive definiteness of M. This is naturally not the most efficient way to compute the Cholesky decomposition, but it is illustrative. The Maple code for the standard algorithm for computing the Cholesky decomposition is shown in Algorithm 1. Exercise 7. Determine the computational complexity of Algorithm 1. Compare it to the computational complexity of the steps shown in Example 1.48. Use Algorithm 1 to confirm the Cholesky decomposition from Example 1.48. 3. MATRIX PROPERTIES FOR OPTIMIZATION 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 CholeskyDecomp := proc (M::Matrix, n::integer)::list; local MM, H, ret, i, j, k; MM := Matrix(M); H := Matrix(n, n); ret := true; for i to n do if evalb(not ret) then ret := false; break: end if: if 0 <= MM[i, i] then H[i, i] := sqrt(MM[i, i]) else ret := false; break; end if; for j from i+1 to n do if H[i, i] = 0 then ret := false; break end if; H[j, i] := MM[j, i]/H[i, i]; for k from i+1 to j do MM[j, k] := MM[j, k]-H[j, i]*H[k, i] end do end do end do; if ret then [H, ret] else [M, ret] end if end proc: Algorithm 1. Cholesky Decomposition 11 CHAPTER 2 Fundamentals of Unconstrained Optimization 1. Mean Value and Taylor’s Theorems Remark 2.1. We state, without proof, some classic theorems from multi-variable calculus. The proofs of these theorems are not terribly hard, but they build on several other results from single variable calculus and reviewing all these results is tedious and outside the scope of the course. Lemma 2.2 (Mean Value Theorem). Suppose that f : Rn → R is continuously differentiable. (That is f ∈ C 1 .) Let x0 , h ∈ Rn . Then there is a t ∈ (0, 1) such that: (2.1) f (x0 + h) − f (x0 ) = ∇f (x0 + th)T h Remark 2.3. This is the natural generalization of the one-variable mean value theorem from calculus (Math 140 at Penn State). The expression x0 + th is simply the line segment connecting x0 and x0 + h. If we imagine this being the “x-axis” and the corresponding slice of f (·), then we can see that we’re just applying the single variable mean value theorem to this slice. Figure 2.1. An illustration of the mean value theorem in one variable. The multivariable mean value theorem is simply an application of the single variable mean value theorem applied to a slice of a function. Lemma 2.4 (Second Order Mean Value Theorem). Suppose that f : Rn → R is twice continuously differentiable. (That is f ∈ C 2 .) Let x0 , h ∈ Rn . Then there is a t ∈ (0, 1) 13 14 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION such that: (2.2) 1 f (x0 + h) − f (x0 ) = ∇f (x0 )T h + hT ∇2 f (x0 + th)h 2 Lemma 2.5 (Mean Value Theorem – Vector Valued Function). Let f : Rn → Rn be a differentiable function and let x0 , h ∈ Rn . Then: Z 1 Df (x0 + th)hdt (2.3) f (x0 + h) − f (x0 ) = 0 where D is the Jacobian operator. Remark 2.6. It should be noted that there is no exact analog of the mean value theorem for vector valued functions. The previous lemma is the closes thing to such an analog and it is generally referred to as such. Corollary 2.7. Let f : Rn → R with f ∈ C 2 and let x0 , h ∈ Rn . Then: Z 1 ∇2 f (x0 + th)hdt (2.4) ∇f (x0 + h) − ∇f (x0 ) = 0 Remark 2.8. The proof of this lemma rests on the single variable mean value theorem and the mean value theorem for integrals in single variable calculus. Lemma 2.9 (Taylor’s Theorem – Second Order). Let f : Rn → R with f ∈ C 2 and let x0 , h ∈ Rn . Then: (2.5) where: f (x0 + h) = f (x0 ) + ∇f (x0 )T h + R2 (x0 , h) 1 R2 (x0 , h) = hT ∇2 f (x0 + th)h 2 for some t ∈ (0, 1) or: Z 1 (1 − t)∇2 f (x0 + th)hdt (2.7) R2 (x0 , h) = (2.6) 0 Remark 2.10. Taylor’s Theorem in the general case considers functions in C k . One can convert between the two forms of the remainder using the mean value theorem, though this is not immediately obvious without some concentration. Most of the proofs of the remainder term use the mean value theorems. There is a very nice proof, very readable proof of Taylor’s theorem in [MT03] (Chapter 3.2). Remark 2.11. From Taylor’s theorem, we obtain first and second order approximations for functions. That is, the first order approximation for f (x0 + h) is: (2.8) f (x0 + h) ∼ f (x0 ) + ∇f (x0 )T h while the second order approximation is: 1 f (x0 + h) ∼ f (x0 ) + ∇f (x0 )T h + hT ∇2 f (x0 )h 2 We will use these approximations and Taylor’s theorem repeatedly throughout this set of notes. (2.9) 2. NECESSARY AND SUFFICIENT CONDITIONS FOR OPTIMALITY 15 2. Necessary and Sufficient Conditions for Optimality Theorem 2.12. Suppose that f : D ⊆ Rn → R is differentiable in an open neighborhood of a local maximizer x∗ ∈ D, then ∇f (x∗ ) = 0. Proof. By way of contradiction, suppose that ∇f (x∗ ) 6= 0. The differentiability of f implies that ∇f (x) is continuous in the open neighborhood of x∗ . Thus, for all ǫ, there is a δ so that if ||x∗ − x|| < δ, then ||∇f (x∗ ) − ∇f (x)|| < ǫ. Let h = ∇f (x∗ ). Trivially, hT h > 0 and (by continuity), for some t ∈ (0, 1) (perhaps very small), we know that: (2.10) hT ∇f (x∗ + th) > 0 Let p = th. From the mean value theorem, we know there is an s ∈ (0, 1) such that: f (x0 + p) − f (x0 ) = ∇f (x0 + sp)T p > 0 by our previous argument. Thus, x∗ is not a (local) maximum. Exercise 8. In the last step of the previous proof, we assert the existence of a t ∈ (0, 1) so that Equation 2.10 holds. Explicitly prove such a t must exist. [Hint: Use a component by component argument with the continuity of ∇f .] Exercise 9. Construct an analagous proof for the statement: Suppose that f : D ⊆ R → R is differentiable in an open neighborhood of a local minimizer x∗ ∈ D, then ∇f (x∗ ) = 0. n Exercise 10. Construct an alternate proof of Theorem 2.12 by studying the singlevariable function g(t) = f (x + th). You may use the fact from Math 140 that g ′ (t∗ ) = 0 is a necessary condition for a local maxima in the one dimensional case. (Extra credit if you prove that fact as well.) Theorem 2.13. Suppose that f : D ⊆ Rn → R is twice differentiable in an open neighborhood of a local maximizer x∗ ∈ D, then ∇f (x∗ ) = 0 and H(x∗ ) = ∇2 f (x∗ ) is negativesemidefinite. Proof. From Theorem 2.12 ∇f (x∗ ) = 0. Our assumption that f is twice differentiable implies that H(x) is continuous and therefore for all ǫ there exists a δ so that ||x∗ − x|| < δ then |Hij (x∗ ) − Hij (x)| < ǫ for all i = 1, . . . , n and j = 1, . . . n. We are just asserting pointwise continuity for the elements of the matrix H(x). Suppose we may choose a vector h so that hT H(x∗ )h > 0 and thus H(x∗ ) is not negative semidefinite. Then by our continuity argument, we can choose an h with norm small enough, so we can assure that hT H(x∗ + h)h > 0. From Lemma 2.9 we have for some t ∈ (0, 1): 1 (2.11) f (x∗ + h) − f (x∗ ) = hT H(x∗ + th)h 2 ∗ since ∇f (x ) = 0. Then it follows that f (x0 + h) − f (x0 ) > 0 and thus we have found a direction in which the value of f increases and x∗ cannot be a local maximum. Theorem 2.14. Suppose that f : D ⊂ Rn → R is twice differentiable. If ∇f (x∗ ) = 0 and H(x∗ ) = ∇2 f (x∗ ) is negative definite, then x∗ is a local maximum. Proof. Applying Lemma 2.9, we know for any h ∈ Rn , there is a t ∈ (0, 1) so that: 1 (2.12) f (x∗ + h) = f (x∗ ) + hT ∇2 f (x∗ + th)h 2 16 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION By the same argument as in the proof of Theorem 2.13, we know that there is an ǫ > 0 so that if |h| < ǫ then for all t ∈ (0, 1), ∇2 f (x∗ + th) is negative definite if ∇2 f (x∗ ) is negative definite. Let Bǫ (x∗ ) the open ball centered at x∗ with radius ǫ. Thus we can see that for all x ∈ Bǫ (x∗ ): 1 T 2 h ∇ f (x)h < 0 2 where x = x∗ + th for some appropriately chosen h and t ∈ (0, 1). Equation 2.12 combined with the previous observation shows that for all x ∈ Bǫ (x∗ ), f (x) < f (x∗ ) and thus x∗ is a local maximum. This completes the proof. Exercise 11. Theorem 2.14 provides sufficient conditions for x∗ to be a strict local minimum. Give an example showing that the conditions are not necessary. Example 2.15. The importance of the Cholesky Decomposition is now apparent. To confirm that a point is a local maximum, we must simply check that the Hessian at the point in question is negative definite, which can be done with the Cholesky Decomposition on the negative of the Hessian. Local minima can be verified by checking for minima. Consider the function: 2 (1 − x)2 + 100 y − x2 The gradient of this function is: [−2 + 2 x − 400 y − x2 x, 200 y − 200 x2 ]T While the Hessian of this function is: # " 2 − 400 y + 1200 x2 −400 x −400 x 200 At point x = 1, y = 1, it is easy to see that the gradient is 0, while the Hessian is: " # 802 −400 −400 200 This matrix has a Cholesky decomposition: " # √ 802 0 √ √ 10 802 802 − 200 401 401 Since the Hessian is positive definite, we know that x = 1, y = 1 must be a local minima for this function. As it happens, this is a global minima for this function. 3. Concave/Convex Functions and Convex Sets Definition 2.16 (Convex Set). Let X ⊆ Rn . Then the set X is convex if and only if for all pairs x1 , x2 ∈ X we have λx1 + (1 − λ)x2 ∈ X for all λ ∈ [0, 1]. Theorem 2.17. The intersection of a finite number of convex sets in Rn is convex. Proof. Let C1 , . . . , Cn ⊆ Rn be a finite collection of convex sets. Let n \ (2.13) C = Ci i=1 3. CONCAVE/CONVEX FUNCTIONS AND CONVEX SETS 17 be the set formed from the intersection of these sets. Choose x1 , x2 ∈ C and λ ∈ [0, 1]. Consider x = λx1 + (1 − λ)x2 . We know that x1 , x2 ∈ C1 , . . . , Cn by definition of C. By convexity, we know that x ∈ C1 , . . . , Cn by convexity of each set. Therefore, x ∈ C. Thus C is a convex set. Definition 2.18 (Convex Function). A function f : Rn → R is a convex function if it satisfies: (2.14) f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) for all x1 , x2 ∈ Rn and for all λ ∈ [0, 1]. When the inequality is strict, for λ ∈ (0, 1), the function is a strictly convex function. Example 2.19. This definition is illustrated in Figure 2.2. When f is a univariate f (x1 ) + (1 − λ)f (x2 ) f (λx1 + (1 − λ)x2 ) Figure 2.2. A convex function: A convex function satisfies the expression f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) for all x1 and x2 and λ ∈ [0, 1]. function, this definition can be shown to be equivalent to the definition you learned in Calculus I (Math 140) using first and second derivatives. Definition 2.20 (Concave Function). A function f : Rn → R is a concave function if it satisfies: (2.15) f (λx1 + (1 − λ)x2 ) ≥ λf (x1 ) + (1 − λ)f (x2 ) for all x1 , x2 ∈ Rn and for all λ ∈ [0, 1]. When the inequality is strict, for λ ∈ (0, 1), the function is a strictly concave function. To visualize this definition, simply flip Figure 2.2 upside down. The following theorem is a powerful tool that can be used to show sets are convex. It’s proof is outside the scope of the class, but relatively easy. Theorem 2.21. Let f : Rn → R be a convex function. Then the set C = {x ∈ Rn : f (x) ≤ c}, where c ∈ R, is a convex set. Exercise 12. Prove the Theorem 2.21. Definition 2.22 (Linear Function). A function z : Rn → R is linear if there are constants c1 , . . . , cn ∈ R so that: (2.16) z(x1 , . . . , xn ) = c1 x1 + · · · + cn xn 18 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION Definition 2.23 (Affine Function). A function z : Rn → R is affine if z(x) = l(x) + b where l : Rn → R is a linear function and b ∈ R. Exercise 13. Prove that every affine function is both convex and concave. Theorem 2.24. Suppose that g1 , . . . , gm : Rn → R are convex functions and h1 , . . . , hl : Rn → R are affine functions. Then the set: (2.17) X = {x ∈ Rn : gi (x) ≤ 0, (i = 1, . . . , m) and hj (x) = 0, (j = 1, . . . , l)} is convex. Exercise 14. Prove Theorem 2.24 Theorem 2.25. Suppose that f : Rn → R is concave and that x∗ is a local maximizer of f . Then x∗ is a global maximizer of f . Proof. Suppose x+ ∈ Rn has the property that f (x+ ) > f (x∗ ). For any λ ∈ (0, 1) we know that: f (λx∗ + (1 − λ)x+ ) ≥ λf (x∗ ) + (1 − λ)f (x+ ) Since x∗ is a local maximum there is an ǫ > 0 so that for all x ∈ Bǫ (x∗ ), f (x∗ ) ≥ f (x). Choose λ so that λx∗ + (1 − λ)x+ is in Bǫ (x∗ ) and let x = λx∗ + (1 − λ)x+ . Let r = f (x+ ) − f (x∗ ). By assumption r > 0. Then we have: f (x) ≥ λf (x∗ ) + (1 − λ)(f (x∗ ) + r) But this implies that: f (x) ≥ f (x∗ ) + (1 − λ)r But x ∈ Bǫ (x∗ ) by choice of λ, which contradicts our assumption that x∗ is a local maximum. Thus, x∗ must be a global maximum. Theorem 2.26. Suppose that f : Rn → R is strictly concave and that x∗ is a global maximizer of f . Then x∗ is the unique global maximizer of f . Exercise 15. Prove Theorem 2.26. [Hint: Proceed by contradiction as in the proof of Theorem 2.25.] Remark 2.27. We generally think of a function as either being convex (or concave) or not. However, it is often useful to think of functions being convex (or concave) on a subset of its domain. This can be important for determining the existence of (global) maxima on a constrained region. Definition 2.28. Let f : D ⊆ Rn → R and suppose that X ⊆ D. Then f is convex on X if for all x1 , x2 ∈ int(X) and (2.18) f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ) for λ ∈ [0, 1]. Example 2.29. Consider the function f (x) = −(x − 4)4 + 3(x − 4)3 + 6(x − 4)2 − 3(x − 4) + 100. This function is neither convex nor concave. However on a subset of its domain (say x ∈ [6, ∞)) the function is concave. 4. CONCAVE FUNCTIONS AND DIFFERENTIABILITY 19 Figure 2.3. (Left) A simple quartic function with two local maxima and one local minima. (Right) A segment of the function that is locally concave. 4. Concave Functions and Differentiability Remark 2.30. The following theorem is interesting, but its proof is outside the scope of the course. There is a proof for the one-dimensional case in Rudin [Rud76]. The proof for the general case can be derived from this. The general proof can also be found in Appendix B of [Ber99]. Theorem 2.31. Every concave (convex) function is continuous on the interior of its domain. Remark 2.32. The proofs of the next two theorems are variations on those found in [Ber99], with some details added for clarity. Theorem 2.33. A function f : Rn → R is concave if and only if for all x0 , x ∈ Rn (2.19) f (x) ≤ f (x0 ) + ∇f (x0 )T (x − x0 ) Proof. (⇐) Suppose Inequality 2.19 holds. Let x1 , x2 ∈ Rn and let λ ∈ (0, 1) and let x0 = λx1 + (1 − λ)x2 . We may write: f (x1 ) ≤ f (x0 ) + ∇f (x0 )T (x1 − x0 ) f (x2 ) ≤ f (x0 ) + ∇f (x0 )T (x2 − x0 ) Multiplying the first equation by λ and the second by 1 − λ and adding yields: λf (x1 ) + (1 − λ)f (x2 ) ≤ λf (x0 ) + (1 − λ)f (x0 ) + λ∇f (x0 )T (x1 − x0 ) + (1 − λ)∇f (x0 )T (x2 − x0 ) Simplifying the inequality, we have: λf (x1 ) + (1 − λ)f (x2 ) ≤ f (x0 ) + ∇f (x0 )T (λ(x1 − x0 ) + (1 − λ)(x2 − x0 )) Which simplifies further to: λf (x1 ) + (1 − λ)f (x2 ) ≤ f (x0 ) + ∇f (x0 )T (λx1 + (1 − λ)x2 − λx0 − (1 − λ)x0 ) 20 2. FUNDAMENTALS OF UNCONSTRAINED OPTIMIZATION Or: λf (x1 ) + (1 − λ)f (x2 ) ≤ f (x0 ) + ∇f (x0 )T (λx1 + (1 − λ)x2 − x0 ) = f (x0 ) because we assumed that:x0 = λx1 + (1 − λ)x2 . (⇒) Now let x, x0 ∈ Rn and let λ ∈ (0, 1). Define h = x − x0 . Let: f (x0 + λh) − f (x0 ) (2.20) g(λ) = λ Clearly, as λ approaches 0 from the right, g(α) approaches the directional derivative of f (x) at x0 in the direction h. Claim 1. The function g(λ) is monotonically decreasing. Proof. Consider λ1 , λ2 ∈ (0, 1) with λ1 < λ2 and let α = λ1 /λ2 . Define z = x0 + λ2 h. Note: λ1 αz + (1 − α)x0 = x0 + α(z − x0 ) = x0 + (x0 + λ2 (z − x0 ) − x0 ) = x0 + λ1 (z − x0 ) λ2 Thus: (2.21) f (x0 + α(z − x0 )) ≥ αf (z) + (1 − α)f (x0 ) = f (x0 ) + αf (z) − αf (x0 ) Simplifying, we obtain: f (x0 + α(z − x0 )) − f (x0 ) ≥ f (z) − f (x0 ) (2.22) α Since z = x0 + λ2 h, we have: f (x0 + α(z − x0 )) − f (x0 ) (2.23) ≥ f (z) − f (x0 ) α Recall z = x0 + λ2 h, thus z − x0 = λ2 h. Thus the left hand side simplifies to: f (x0 + λ1 h) − f (x0 ) f (x0 + (λ1 /λ2 )(λ2 h)) − f (x0 ) = ≥ f (x0 + λ2 h) − f (x0 ) (2.24) α λ1 /λ2 Lastly, dividing both sides by λ2 yields: f (x0 + λ1 h) − f (x0 ) f (x0 + λ2 h) − f (x0 ) (2.25) ≥ λ1 λ2 Thus g(λ) is monotonically decreasing. This completes the proof of the claim. Since g(λ) is monotonically decreasing, we must have: (2.26) lim g(λ) ≥ g(1) λ→0+ But this implies that: f (x0 + λh) − f (x0 ) ≥ f (x0 +h)−f (x0 ) = f (x0 +x−x0 )−f (x0 ) = f (x)−f (x0 ) (2.27) lim+ λ→0 λ since h = x − x0 . Applying Theorem 1.22, the inequality becomes: (2.28) ∇f (x0 )T h ≥ f (x) − f (x0 ) which can be rewritten as: (2.29) f (x) ≤ f (x0 ) + ∇f (x0 )T h = f (x0 ) + ∇f (x0 )T (x − x0 ) This completes the proof. 4. CONCAVE FUNCTIONS AND DIFFERENTIABILITY 21 Exercise 16. Argue that for strictly concave functions, Inequality 2.19 is strict and the theorem still holds. Exercise 17. State and prove a similar theorem for convex functions. Theorem 2.34. Let f : Rn → R be a twice differentiable function. If f is concave, then ∇ f (x) is negative semidefinite. 2 Proof. Suppose that there is a point x0 ∈ Rn and h ∈ Rn such that hT ∇2 f (x)h > 0. By the same argument as in the proof of Theorem 2.13, we may chose an h with a small norm so that for every t ∈ [0, 1], hT ∇2 f (x + th)h > 0 as well. By Lemma 2.9 for any x = x0 + h we have some t ∈ (0, 1) so that: 1 (2.30) f (x) = f (x0 ) + ∇f (x0 )T h + hT ∇2 f (x0 + th)h 2 T 2 But since h ∇ f (x + th)h > 0, we know that (2.31) f (x) > f (x0 ) + ∇f (x0 )T h and thus by Theorem 2.19, f (x) cannot be concave, a contradiction. This completes the proof. Theorem 2.35. Let f : Rn → R be a twice differentiable function. If ∇2 f (x) is negative semidefinite, then f (x) is concave. Proof. From Lemma 2.9, we know that for every x, x0 ∈ Rn there is a t ∈ (0, 1) such that when h = x − x0 : 1 (2.32) f (x) = f (x0 ) + ∇f (x0 )T h + hT ∇2 f (x0 + th)h 2 2 Since ∇ f (x) is negative semidefinite, it follows that: 21 hT ∇2 f (x0 + th)h ≤ 0 and thus: (2.33) f (x) ≤ f (x0 ) + ∇f (x0 )T (x − x0 ) Thus by Theorem 2.19, f (x) is concave. Exercise 18. Suppose that f : Rn → R with f (x) = xT Hx, where x ∈ Rn and H ∈ Rn×n . Show that f (x) is convex if and only if H is positive semidefinite. Exercise 19. State and prove a theorem on the nature of the Hessian matrix when f (x) is strictly concave. Remark 2.36 (A Last Remark on Concave and Convex Functions). In this chapter, we’ve focused on primarily on concave functions. It’s clear there are similar theorems for convex functions and many books focus on convex functions and minimization. In these notes, we’ll focus on maximization, because the geometry of certain optimality conditions makes more sense on first exposure in a maximization problem. If you want to see the minimization side of the story, look at [Ber99] and [BSS06]. In reality, it doesn’t matter, any maximization problem can be converted to a minimization problem and vice versa. CHAPTER 3 Introduction to Gradient Ascent and Line Search Methods 1. Gradient Ascent Algorithm Remark 3.1. In Theorem 1.23, we proved that the gradient points in the direction of fastest increase for a function f : Rn → R. If we are trying to identify a point x∗ ∈ Rn that is a local or global maximum for f , then a reasonable approach is to walk along some direction p so that ∇f (x)T p > 0. Definition 3.2. Let f : Rn → R be a continuous and differentiable function and let p ∈ Rn . If ∇f (x)T p > 0, then p is called an ascent direction. Remark 3.3. Care must be taken, however, since the ∇f (x) represents only the direction of fastest increase at x. As a result, we only want to take a small step in the direction of p, then re-evaluate the gradient and continue until a stopping condition is reached. Basic Ascent Algorithm Input: f : Rn → R a function to maximize, x0 ∈ Rn , a starting position Initialize: k = 0 (1) do (2) Choose pk ∈ Rn×1 and δk ∈ R+ so that ∇f (x)T pk > 0. (3) xk+1 := xk + δk pk (4) while some stopping criteria are not met. Output: xk+1 Algorithm 2. Basic Ascent Algorithm Remark 3.4. There are some obvious ambiguities with Algorithm 2. We have neither specified how to choose pk nor δk in Line (2) of Algorithm 2, nor have we defined specific stopping criteria for the while loop. More importantly, we’d like to prove that there is a way of choosing pk so that when we use this method at Line (2), the algorithm both converges i.e., at some point we exit the loop in Lines (1)-(4) and when we exit, we have identified a local maximum, or at least a point that satisfies the necessary conditions for a local maximum (see Theorem 2.12). Remark 3.5. For the remainder of these notes, we will assume that: pk = B−1 k ∇f (xk ) where Bk ∈ Rn×n is some appropriately chosen symmetric and non-singular matrix. Definition 3.6 (Gradient Ascent). When Bk = In , for some n, then Algorithm 2 is called Gradient Ascent. Definition 3.7 (Newton’s Method). When Bk = −∇2 f (xk ), for some δk ∈ R+ , then Algorithm 2 is called Newton’s Method, which we will study in the next chapter. 23 24 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS 2. Some Results on the Basic Ascent Algorithm Lemma 3.8. Suppose that Bk is positive definite. Then: pk = B−1 k ∇f (x) is an ascent T direction. That is, ∇f (x) pk > 0. Proof. By Corollary 1.37, B−1 k is positive definite. Thus: ∇f (x)T pk = ∇f (x)T B−1 k ∇f (x) > 0 by definition. Remark 3.9. Since Bk = In is always positive definite and non-singular, we need never worry that pk = ∇f (x) is not a properly defined ascent direction in gradient ascent. Remark 3.10. As we’ve noted, we cannot simply choose δk arbitrarily in Line (2) of Algorithm 2, since ∇f (x) is only the direction of greatest ascent at x and thus, pk = B−1 k ∇f (xk ) is only a ascent direction in a neighborhood about x. If we define: (3.1) φ(δk ) = f (xk + δk pk ) then our problem is to solve: max φ(δk ) (3.2) s.t. δk ≥ 0 This is an optimization problem in a single variable, δk and assuming its solution is δk∗ and we compute xk+1 := xk + δk∗ pk , then we will assuredly have increased (or at least not decreased) the value of f (xk+1 ) compared to f (xk ). Remark 3.11. As we’ll see in Section 2, it’s not enough to just increase the value of f (xk ) at each iteration k, we must increase it by a sufficient amount. One of the easiest, though not necessarily the most computationally efficient, ways to make sure this happens is to ensure that we identify a solution to the problem in Expression 3.2. In the following sections, we discuss several methods for determining a solution. 3. Maximum Bracketing Remark 3.12. For the remainder of this section, let us assume that φ(x) is a one dimensional function that we wish to maximize. One problem with attempting to solve Problem 3.2 is that the half-interval δk ≥ 0 is unbounded and thus we could have to search a very long way to find a solution. To solve this problem, we introduce the notion of a constrained search on an interval [a, c] where we assume that we have a third value b ∈ [0, b] such that φ(a) < φ(b) and φ(b) > φ(c). Such a triple (a, b, c) is called a bracket. Proposition 3.13. For simplicity of notation, assume that φ(x) = φx ∈ R. The coefficients of the quadratic curve rx2 + sx + t passing through the points (a, φa ), (b, φb ), (c, φc ) are given by the expressions: −bφa + bφc − aφc + φa c − φb c + φb a r=− 2 b c − b2 a − bc2 + ba2 − a2 c + ac2 2 −a φc + a2 φb − φa b2 + φa c2 − c2 φb + φc b2 s= (−c + a) (ab − ac − b2 + cb) −φb a2 c + a2 bφc + ac2 φb − aφc b2 − c2 bφa + cφa b2 t= (−c + a) (ab − ac − b2 + cb) 3. MAXIMUM BRACKETING 25 Exercise 20. Prove proposition 3.13. [Hint: Use a computer algebra system.] Corollary 3.14. The maximum (or minimum) of the quadratic curve rx2 + sx + t passing through the points (a, φa ), (b, φb ), (c, φc ) is given by: (3.3) u= 1 −a2 φc + a2 φb − φa b2 + φa c2 − c2 φb + φc b2 2 −bφa + bφc − aφc + φa c − φb c + φb a Remark 3.15. The Maple code for finding the maximum point for the curve passing through (a, φa ), (b, φb ), (c, φc ) is shown below: 1 2 3 4 5 6 7 QuadraticTurn := proc (a::numeric, fa::numeric, b::numeric, fb::numeric, c::numeric, fc::numeric) ::list; local qf, r, s, t, SOLN; qf := proc (x) options operator, arrow; r*x^2+s*x+t end proc; SOLN := solve([fa = qf(a), fb = qf(b), fc = qf(c)], [r, s, t]); [eval(-(1/2)*s/r, SOLN[1]), evalb(nops(SOLN) > 0 and eval(r < 0, SOLN[1]))] end proc; Note in Line 6, we actually return a list, whose second element is true if and only if there is a solution to the proposed problem and the point is the maximum of a quadratic approximation, rather than the minimum. Note also for simplicity we use fa, fb, and fc instead of φa , φb and φc . Remark 3.16. In our case, we know that a = 0 and we could choose an arbitrarily small value b = a + ǫ as our second value. The problem is identifying point c. To solve this problem, we introduce a bracketing algorithm (see Algorithm 3). Essentially, the bracketing algorithm keeps moving the points a, b and c further and further to the right using a quadratic approximation of the function when possible. Note, this algorithm assumes that the local behavior of φ(x) is concave, which is true in our line search, especially when f (x) is concave. Remark 3.17. Note in Algorithm 3, our φ(x) is written f (x) and we use τ as the Golden ratio, which is used to define our initial value c = b + τ · (b − a). This is also used when the parabolic approximation of the function fails. In practice a tiny term is usually added to the denominator of the fraction in Corollary 3.14 to prevent numerical instability (see [PTVF07], Page 491). Also note, that we compute φ(x) many times. The number of times that φ(x) can be reduced by storing these values as discussed in ([PTVF07], Page 491). Proposition 3.18. If φ(x) has a maximum value on R+ , then Algorithm 3 will identify a bracket containing a local maximum. Proof. The proof is straightforward. The algorithm will continue to move the bracket (a, b, c) rightward ensuring that φ(a) < φ(b). At some point, the algorithm will choose c so that φ(b) > φ(c) with certainty (since φ(x) has a maximum and its maximum is contained in R+ ). When this occurs, the algorithm terminates. Example 3.19. Consider the following function: φ(x) = −(x − 4)4 + 3(x − 4)3 + 6(x − 4)2 − 3(x − 4) + 100 The plot of this function is shown in Figure 2.3. When we execute the parabolic bracketing algorithm on this function starting at a = 0 and b = 0.25, we obtain the output: 26 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS 1 ParabolicBracket := proc (phi::operator, a::numeric, b::numeric)::list; 2 local aa, bb, cc, tau, ret, M, umax, u, bSkip, uu; 3 tau := evalf(1/2+(1/2)*sqrt(5)); 4 M := 10.0; 5 aa := a; bb := b; 6 if phi(bb) < phi(aa) and aa < bb then 7 ret := [] 8 else 9 cc := bb+tau*(bb-aa); 10 ret := [aa,bb,cc]: 11 while phi(bb) < phi(cc) do 12 umax := (cc-bb)*M+cc; 13 uu := QuadraticTurn(aa, phi(aa), bb, phi(bb), cc, phi(cc)); 14 u := uu[1]; 15 print(sprintf("a=%f\t\tb=%f\t\tc=%f\t\tu=%f", aa, bb, cc, u)); 16 if uu[2] then 17 bSkip := false; 18 if aa < u and u < bb then 19 if phi(aa) < phi(u) and phi(bb) < phi(u) then 20 ret := [aa, u, bb]; 21 bb := u; cc := bb 22 else 23 aa := bb; bb := cc; cc := bb+tau*(bb-aa); 24 ret := [aa, bb, cc] 25 end if: 26 elif bb < u and u < cc then 27 if phi(bb) < phi(u) and phi(cc) < phi(u) then 28 ret := [bb, u, cc]; 29 aa := bb; bb := u 30 else 31 aa := bb; bb := u; cc := cc; cc := bb+tau*(cc-bb); 32 ret := [aa, bb, cc] 33 end if: 34 elif cc < u and u < umax then 35 if phi(cc) < phi(u) then 36 aa := bb; bb := cc; cc := u; cc := bb+tau*(cc-bb); 37 ret := [aa, bb, cc] 38 else 39 aa := bb; bb := cc; cc := bb+tau*(bb-aa); 40 ret := [aa, bb, cc] 41 end if: 42 elif umax < u then 43 aa := bb; bb := cc; cc := bb+tau*(bb-aa); 44 ret := [aa, bb, cc] 45 end if: 46 else 47 aa := bb; bb := cc; cc := bb+tau*(bb-aa); 48 ret := [aa, bb, cc] 49 end if: 50 end do: 51 end if: 52 ret 53 end proc: Algorithm 3. Bracketing Algorithm 4. DICHOTOMOUS SEARCH 27 "a=0.000000 b=0.250000 c=0.654508 u=1.580537" "a=0.250000 b=0.654508 c=2.152854 u=2.095864" "a=0.654508 b=2.095864 c=2.188076 u=2.461833" "a=2.095864 b=2.188076 c=2.631024 u=2.733759" "a=2.188076 b=2.631024 c=2.797253 u=2.839039" "a=2.631024 b=2.797253 c=2.864864 u=2.889411" "a=2.797253 b=2.864864 c=2.904582 u=2.899652" Output = [2.864863884, 2.899652455, 2.904582162] When we execute the algorithm starting at a = 4.3 and b = 4.5, we obtain the output: "a=4.300000 b=4.500000 c=4.823607 u=4.234248" "a=4.500000 b=4.823607 c=5.347214 u=4.235687" "a=4.823607 b=5.347214 c=6.194427 u=3.781555" "a=5.347214 b=6.194427 c=7.565248 u=7.317872" Output = [6.194427192, 7.317871765, 7.565247585] The actual (local) maxima for this function occur at x ∼ 2.900627632 and x ∼ 4.217851814, showing our maximum points are bracketed as expected. Remark 3.20. Parabolic fitting is often overly complicated, especially if the function φ(x) is locally concave. In this case, we can simply compute c = b + α(b − a) and successively reassign a = b and b = c. Exercise 21. Write the algorithm discussed in Remark 3.20 and prove that if φ(x) has a maximum [0, ∞], this algorithm will eventually bracket the maximum if you start with a = 0 and b = a + ǫ. Exercise 22. Compare the bracket size obtained from Algorithm 3 vs. the algorithm you wrote in Exercise 21. Is there a substantial difference in the bracket size? 4. Dichotomous Search Remark 3.21. Assuming we have a bracket around a (local) maximum of φ(x), it is now our objective to find a value close to the value x∗ such that φ(x∗ ) yields that (local) maximum value. A dichotomous search is a popular (but not the most efficient) method for finding such a value. The Maple code for Dichotomous Search is shown in Figure 4. Remark 3.22. Dichotomous Search begins with an interval around which we are certain there is a maximum. The interval is repeatedly reduced until the maximum is localized to a small enough interval. At Lines 11 and 12, two test points are computed. Depending upon the value the function φ at these points, the interval is either sub-divided to the left or right. This is illustrated in Figure 3.1. If the value of the function is the same at both test points, then the direction is chosen arbitrarily. In the implementation shown in Algorithm 4, the algorithm always chooses the right half of the interval. Convergence can be sped up by defining x1 = u − δ and x2 = u + δ from the algorithm for some small δ > 0. Exercise 23. For some small δ (say 0.01) empirically compare the rate of convergence of the Dichotomous Search Algorithm shown in Algorithm 4 vs. the rate of the algorithm for the given δ. 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS DichotomousSearch := proc (phi::operator, a::numeric, b::numeric, epsilon:: numeric)::real; local u, aa, bb, x; if a < b then aa := a; bb := b else bb := a; aa := b end if: u := (1/2)*aa+(1/2)*bb; print("a\t\t\t\t\tb"); while epsilon < bb-aa do print(sprintf("%f \t\t\t %f", aa, bb)); x[1] := (1/2)*aa+(1/2)*u; x[2] := (1/2)*bb+(1/2)*u; if phi(x[2]) < phi(x[1]) then bb := x[2]; u := (1/2)*aa+(1/2)*bb else aa := x[1]; u := (1/2)*aa+(1/2)*bb end if: end do; (1/2)*aa+(1/2)*bb end proc Algorithm 4. Dichotomous Search (2) (1) x2 x2 (1) x1 (2) x1 Third Bracket Second Bracket First Bracket Figure 3.1. Dichotomous Search iteratively refines the size of a bracket containing a maximum of the function φ by splitting the bracket in two. Theorem 3.23. Suppose φ(x) is strictly concave and attains its unique maximum on the (initial) interval [a, b] at x∗ . Then Dichotomous Search will converge to a point x+ such that |x+ − x∗ | < ǫ. 4. DICHOTOMOUS SEARCH 29 Proof. We will show by induction on the algorithm iteration that x∗ is always in the interval returned by the algorithm. At initialization, this is clearly true. Suppose after n iterations we have the interval [l, r] and x∗ ∈ [l, r]. Let let u = (l + r)/2 and without loss of generality, suppose x∗ ∈ [u, r]. Let x1 and x2 be the two test points with x1 ∈ [l, u] and x2 ∈ [u, r]. If φ(x2 ) > φ(x1 ), then at the next iteration the interval is [x1 , r] and clearly x∗ ∈ [x1 , r]. Conversely, suppose that φ(x1 ) > φ(x2 ). Since φ(x) is concave and must always lie above its secant, it follows that x∗ ∈ [x1 , x2 ] (because clearly the function has reached its turning point between x1 and x2 ). Thus, x∗ ∈ [u, x2 ]. The next interval when φ(x1 ) > φ(x2 ) is [l, x2 ] and thus x∗ ∈ [l, x2 ]. It follows by induction that x∗ is contained in the interval computed at each iteration of the while loop at Line 9. Suppose that [l∗ , r∗ ] is the final interval upon termination of the while loop at Line 9. We know that x∗ ∈ [l∗ , r∗ ] and r∗ − l∗ < ǫ, thus |x+ − x∗ | < ǫ. This completes the proof. Theorem 3.24. Suppose φ(x) is concave and attains a maximum on the (initial) interval [a, b] at x∗ . Assume that at each iteration, the interval of uncertainty around x∗ is reduced by a factor of r ∈ (0, 1). Then Dichotomous search converges in:   ǫ log |b−a|   n=   log(r)   steps. Exercise 24. Prove Theorem 3.24. Illustrate the correctness of your proof by implementing Dichotomous Search and maximizing the function φ(x) = 10 − (x − 5)2 on the interval [0, 10]. Example 3.25. The assumption that φ(x) is concave (on [a, b]) is not necessary to ensure convergence to the maximum within [a, b]. Generally speaking it is sufficient to assume φ(x) is unimodal on [a, b]. To see this, consider the function:  2x + 1 x<2    5 2≤x≤8 φ(x) = x − 3 8<x≤9    −x + 15 x > 9 The function is illustrated in Figure 3.2. If we execute the Dichotomous Search algorithm Figure 3.2. A non-concave function with a maximum on the interval [0, 15]. 30 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS as illustrated in Algorithm 4, it will converge to the bracket [8.99, 9.00] even though this function is not concave on the interval. The output from the algorithm is shown below: "a b" "0.000000 15.000000" "0.000000 11.250000" "2.812500 11.250000" "4.921875 11.250000" "6.503906 11.250000" "6.503906 10.063477" "7.393799 10.063477" "8.061218 10.063477" "8.061218 9.562912" "8.436642 9.562912" "8.718209 9.562912" "8.718209 9.351736" "8.718209 9.193355" "8.836996 9.193355" "8.836996 9.104265" "8.903813 9.104265" "8.903813 9.054152" "8.941398 9.054152" "8.969586 9.054152" "8.969586 9.033010" "8.969586 9.017154" "8.981478 9.017154" "8.990397 9.017154" "8.990397 9.010465" "8.990397 9.005448" "8.994160 9.005448" Output = 9.001215067 5. Golden Section Search Remark 3.26. In the Dichotomous search each iteration requires two new evaluations of the value of φ. If φ is very costly to evaluate, it might be more efficient to construct an algorithm that evaluates φ(x) only once at each iteration by cleverly constructing the intervals. The Golden Section Search does exactly that. Remark 3.27. At each iteration of the golden section search, there are four points under consideration, x1 < x2 < x3 < x4 , where x1 and x4 are the left and right endpoints of the current interval. At the next iteration, if φ(x2 ) > φ(x3 ), then x1 remains the left end point, x3 becomes the right end point and x2 plays the role of x3 in the next iteration. On the other hand, if φ(x2 ) < φ(x3 ), then x4 remains the right endpoint, x2 becomes the left endpoint and x3 plays the role of x2 in the next iteration. The objective in the Golden Section Search is to ensure the the sizes of the intervals remain proportionally constant throughout algorithm execution. 5. GOLDEN SECTION SEARCH 31 Proposition 3.28. Given an initial bracket [x1 , x4 ] containing a maximum for the function φ(x). To ensure that the ratio of x2 − x1 to x4 − x2 is kept constant, then x2 = (1/τ )x1 + √ (1 − 1/τ )x4 and x3 = (1 − 1/τ )x1 + (1/τ )x4 , where τ is the Golden ratio; i.e., 1 τ = 2 (1 + 5). Proof. Consider Figure 3.3: In the initial interval, the ratio of x2 − x1 to x4 − x2 is b x1 x2 c x3 x4 a Figure 3.3. The relative sizes of the interval and sub-interval lengths in a Golden Section Search. a/b. At the next iteration, either φ(x2 ) < φ(x3 ) or not (in the case of a tie, it an arbitrary left/right choice is made). If φ(x2 ) > φ(x3 ), then x2 becomes the new x3 point and thus we require: a c = b a On the other hand, if φ(x2 ) < φ(x3 ) then x3 becomes the new x2 point and we require: c a = b b−c From the first equation, we see that a2 /c = b and substituting that into the second equation yields: a 2 a − −1=0 c c Which implies: a b =τ = c a Without loss of generality, suppose that x1 = 0 and x4 = 1. To compute x2 we have: a 1 1 = =⇒ a = 1−a τ 1+τ For the Golden Ratio, 1/(1 + τ ) = 1 − 1/τ and thus it follows x2 = (1/τ )x1 + (1 − 1/τ )x4 . To compute x3 , suppose a + c = r. Then we have: 1 r − x2 = 1−r τ When we replace x2 with 1/(1 + τ ) we have: 1 + 2τ r= (1 + τ )2 In the case of the Golden Ratio, the previous fraction reduces to r = 1/τ . Thus we have x3 = (1 − 1/τ )x1 + (1/τ )x4 . This completes the proof. Remark 3.29. The Golden Section Search is shown illustrated in Algorithm 5. This is, essentially, the Golden Section search as it is implemented in [PTVF07] (Page 495). 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS GoldenSectionSearch := proc (phi::operator, a::numeric, b::numeric, c::numeric, epsilon::numeric)::real; local aa, bb, cc, dd, tau, ret; tau := evalf(1/2+(1/2)*sqrt(5)); aa := a; dd := c; if b-a < c-b then bb := b; cc := b+(1-1/tau)*(c-b) else bb := b-(1-1/tau)*(b-a); cc := b end if; while epsilon < dd-aa do print(sprintf("aa=%f\t\tbb=%f\t\tcc=%f\t\tdd=%f", aa, bb, cc, dd)); if phi(cc) < phi(bb) then dd := cc; cc := bb; bb := cc/tau+(1-1/tau)*aa; else aa := bb; bb := cc; cc := bb/tau+(1-1/tau)*dd; end if end do; if phi(cc) < phi(bb) then ret := bb; else ret := cc; end if; ret; end proc: Algorithm 5. Golden Section Search Theorem 3.30. Suppose φ(x) is strictly concave and attains its unique maximum on the (initial) interval [a, b] at x∗ . Then Golden Section Search will converge to a point x+ such that |x+ − x∗ | < ǫ. Exercise 25. Prove Theorem 3.30. Example 3.31. In Example 3.25, we saw an example of a non-concave function for which Dichotomous Search converged. In this example, we show an instance of a function for which the Golden Section Search does not converge to the global maximum of the function. Let:  10x x < 12    −10x + 10 x ≥ 1 and x < 3 2 4 φ(x) = 5 2 x ≥ 34 and x < 9    x≥9 −x + 23 2 The function in question is shown in Figure 3.4. If we use the bracket [0, 11], note that the global functional maximum occurs in a position very close to the left end of the bracket. As a result, the interval search misses the global maximum. This illustrates the importance of tight bracketing in line search algorithms. The output of the search is shown below. "aa=0.000000 bb=4.201626 cc=6.798374 dd=11.000000" 5. GOLDEN SECTION SEARCH 33 Figure 3.4. A function for which Golden Section Search (and Dichotoous Search) might fail to find a global solution. "aa=4.201626 bb=6.798374 "aa=6.798374 bb=8.403252 "aa=6.798374 bb=7.790243 "aa=7.790243 bb=8.403252 "aa=8.403252 bb=8.782113 "aa=8.403252 bb=8.637401 "aa=8.637401 bb=8.782113 "aa=8.782113 bb=8.871549 "aa=8.871549 bb=8.926824 "aa=8.926824 bb=8.960986 "aa=8.960986 bb=8.982099 "aa=8.982099 bb=8.995148 "aa=8.982099 bb=8.990164 "aa=8.990164 bb=8.995148 Output = 8.998228439 cc=8.403252 cc=9.395122 cc=8.403252 cc=8.782113 cc=9.016261 cc=8.782113 cc=8.871549 cc=8.926824 cc=8.960986 cc=8.982099 cc=8.995148 cc=9.003213 cc=8.995148 cc=8.998228 dd=11.000000" dd=11.000000" dd=9.395122" dd=9.395122" dd=9.395122" dd=9.016261" dd=9.016261" dd=9.016261" dd=9.016261" dd=9.016261" dd=9.016261" dd=9.016261" dd=9.003213" dd=9.003213" Remark 3.32. There are derivative other free line search methods. One of the simplest applies the parabolic approximation we used in the bracketing method in Algorithm 3 (see [BSS06] or [Ber99]). Another method, generally called Brent’s method, combines parabolic approximation with the Golden Section method (see [PTVF07], Chapter 10 for an implementation of Brent’s Method). These techniques extend and combine derivative free methods presented in this section, but do not break any substantially new ground beyond their efficiency. 34 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS 6. Bisection Search Remark 3.33. We are now going to assume that we can compute the first and possibly second derivative of φ(x). In doing so, we will construct new methods for finding points to maximize the function. Remark 3.34. Before proceeding any further, we warn the reader that methods requiring function derivatives should use the actual derivative of the function rather than an approximation. While approximate methods are necessary in multi-dimensional optimization and often used, line searches can be sensitive to incorrect derivatives. Thus, it is often safer to use a derivative free method even though it may not be as efficient. Remark 3.35. The bisection search assumes that for a given univariate function φ(x), we have a bracket [a, b] ⊂ R containing a (local) maximum and furthermore that φ(x) is differentiable on this closed interval and φ′ (a) > 0 and φ′ (b) < 0. As we will see, convergence can be ensured when φ(x) is concave. Remark 3.36. The bisection search algorithm is illustrated in Algorithm 6. Given the assumption that φ′ (a) > 0 and φ′ (b) < 0, we choose a test point u ∈ [a, b]. In the implementation, we choose u = (a + b)/2. If φ′ (u) > 0, then this point is presumed to lie on the left side of the maximum and we repeat on the new interval [u, b]. If φ′ (u) < 0, then u is presumed to lie on the right side of the maximum and we repeat on the new interval [a, u]. If φ′ (u) = 0, then we terminate our search and return u as x∗ the maximizing value for φ(x). Because of round off, we will never reach a condition in which φ′ (u) = 0, so instead we test to see if |φ′ (u)| < ǫ, where ǫ is a small constant provided by the user. Theorem 3.37. If φ(x) is concave and continuously differentiable on [a, b] with a maximum x∗ ∈ (a, b), then for all ǫ (an input parameter) there is a δ > 0 so that Bisection Search converges to u with the property that |x∗ − u| ≤ δ. Proof. The fact that φ(x) is concave on [a, b] combined with the fact that φ′ (a) > 0 and φ′ (b) < 0 implies that φ′ (x) is a monotonically decreasing continuous function on the interval [a, b] and therefore if there are two values x∗ and x+ maximizing φ(x) on [a, b] then for all λ ∈ [0, 1], x◦ = λx∗ + (1 − λ)x+ also maximizes φ(x) and by necessity φ′ (x◦ ) = φ′ (x∗ ) = φ′ (x+ ) = 0. Thus, for any iteration if we have the bracket [l, r] if φ′ (l) > 0 and φ′ (r) < 0 we can be sure that x∗ ∈ [l, r]. Suppose after n iterations we have interval [l, r] with the property that φ′ (l) > 0 and φ′ (r) < 0. Lines 15 and 17 of Bisection search ensure that at iteration n + 1 we continue to have a new interval [l′ , r′ ] ⊂ [l, r] with the property that φ′ (l′ ) > 0 and φ′ (r′ ) < 0, assuming that |φ′ (u)| > ǫ. Thus, at each iteration x∗ is in the bracket. Furthermore, since φ′ (x) is monotonically decreasing we can be sure that as each iteration shrinks the length of the interval [l, r], φ′ (u) will not increase. Thus at some point either φ′ (u) < ǫ or |r − l| < ǫ. If the latter occurs, then we know that |x∗ − u| < ǫ since u ∈ [l, r] at the last iteration and δ = ǫ. On the other hand, if φ′ (u) = 0, then u is a maximizer of φ(x) and we may assume that x∗ = u, so δ = 0. Finally, suppose 0 < φ′ (u) < ǫ. Then there is some non-zero δ < |r − l| so that u + δ = x∗ . Thus |x∗ − u| ≤ δ. The case when φ′ < 0 and |φ′ (u)| < ǫ is analogous. Remark 3.38. Note, we do not put a detailed bound on the δ in the proof of the previous theorem. As a rule of thumb, we’d like, φ′ (x) to have a somewhat nice property like having a continuous inverse function, in which case we could make an exact ǫ-δ style argument to 6. BISECTION SEARCH 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 35 BisectionSearch := proc (phi::operator, a::numeric, b::numeric, epsilon::numeric) ::real; local u, aa, bb, x, y; if a < b then aa := a; bb := b else bb := a; aa := b end if; u := (1/2)*aa+(1/2)*bb; print("a\t\t\t\t\tb"); while epsilon < bb-aa do print(sprintf("%f \t\t\t %f", aa, bb)); y := eval(diff(phi(z), z), z = u); if epsilon < abs(y) then if y < 0 then bb := u; u := (1/2)*aa+(1/2)*bb else aa := u; u := (1/2)*aa+(1/2)*bb end if else break end if end do; u end proc: Algorithm 6. Bisection Search assert that if |φ′ (u)−φ′ (x∗ )| < ǫ, then there better be some small δ so that |u−x∗ | < δ, for an appropriately chosen maximizer x∗ (if the maximizer is not unique). Unfortunately, plain old continuity does not work for this argument. We could require φ′ (x) to be bilipschitz, which would also achieve this effect. See Exercise 27 for an alternative way to bound |u − x∗ | < δ in terms of ǫ. Exercise 26. A function f : R → R is called Lipschitz continuous if there is a K > 0 so that for all x1 and x2 in R we have |f (x1 ) − f (x2 )| ≤ K|x1 − x2 |. A function is bilipschitz if there is a K ≥ 1 so that: (3.4) 1 |x1 − x2 | ≤ |f (x1 ) − f (x2 )| ≤ K|x1 − x2 | K Suppose φ′ (x) is bilipschitz. Find an exact expression for δ. Exercise 27. This exercise will tell you something about the convergence proof for bisection search under a stronger assumption on φ(x). Suppose that φ : R → R is twice continuously differentiable and strictly concave on [a, b] with its unique maximum x∗ ∈ [a, b] (so φ′ (x∗ ) = 0). Assume that |φ′ (u)| < ǫ. Use the Mean Value Theorem to find a bound on |u − x∗ |. 36 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS Remark 3.39. If φ(x) is twice differentiable then there is a t between u and x∗ so that (3.5) φ′ (u) − φ′ (x∗ ) = φ′′ (t) (u − x∗ ) This implies that: (3.6) |φ′ (u) − φ′ (x∗ )| = |φ′′ (t)| |u − x∗ | From this we know that (3.7) |φ′ (u) − φ′ (x∗ )| = |φ′′ (t)| |u − x∗ | < ǫ Therefore: (3.8) |u − x∗ | < ǫ |φ′′ (t)| Exercise 28. Modify the bisection search to detect minima. Test your algorithm on the function f (x) = (x − 5)2 . Example 3.40. Below we show example output from Bisection Search when executed on the function φ(x) = 10 − (x − 5)2 when we search on the interval [0, 9] and set ǫ = 0.01. "a b" "0.000000 9.000000" "4.500000 9.000000" "4.500000 6.750000" "4.500000 5.625000" "4.500000 5.062500" "4.781250 5.062500" "4.921875 5.062500" "4.992188 5.062500" "4.992188 5.027344" "4.992188 5.009766" Output = 5.000976562 7. Newton’s Method Remark 3.41. Newton’s method is a popular method for root finding that can also be applied to the problem of univariate functional maximization (or minimization) provided the second derivative of the function can be computed. Since we will cover Newton’s Method for multidimensional functions in the next chapter, we will not go into great detail on the theory here. Note in theory, we do not require a bracket for Newton’s Method, just a starting point. A bracket is useful though for finding a starting point. Remark 3.42. The algorithm assumes that the quadratic Taylor approximation is reasonable for the function φ(x) to be optimized about the current iterated value xk . That is: 1 φ(x) ≈ φ(xk ) + φ′ (x0 )(x − xk ) + φ′′ (x0 )(x − xk )2 2 ′′ If we assume that φ (xk ) < 0, then differentiating and setting equal to zero will yield a maximum for the function approximation: φ′ (xk ) − φ′′ (xk )xk φ′ (xk ) (3.10) xk+1 = − = x − k φ′′ (xk ) φ′′ (xk ) (3.9) 8. CONVERGENCE OF NEWTON’S METHOD 37 Iteration continues until some termination criterion is achieved. An implementation of the algorithm is shown in Algorithm 7 1 2 3 4 5 6 7 8 9 10 11 12 NewtonsMethod := proc (phi::operator, a::numeric, epsilon::numeric)::list; local u, fp, fpp; u := a; fp := infinity; while epsilon < abs(fp) do print(sprintf("xk = %f", u)); fp := eval(diff(phi(z), z), z = u); #Take a first derivative at u. fpp := eval(diff(phi(z), ‘$‘(z, 2)), z = u); #Take a second derivative at u. u := evalf(-(fp-fpp*u)/fpp) end do; [u, evalb(fpp < 0)]: end proc: Algorithm 7. Newton’s Method in One Dimension Remark 3.43. Note that our implementation of Newton’s Algorithm checks to see if the final value of φ′′ (xk ) < 0. This allows us to make sure we identify a maximum rather than a minimum. Example 3.44. Suppose we execute our bracketing algorithm on the function: φ(x) = −(x − 4)4 + 3(x − 4)3 + 6(x − 4)2 − 3(x − 4) + 100 starting at 0 and 0.25. We obtain the bracket (2.864863884, 2.899652455, 2.904582162). If we begin Newton’s method with the second point of the bracket 2.899652455 we obtain the output: "xk = 2.899652" "xk = 2.900627" Output = [2.900627632, true] Exercise 29. Find a simple necessary condition for which Newton’s Method converges in one iteration. Exercise 30. Implement Newton’s Method. Using φ(x) = −(x − 4)4 + 3(x − 4)3 + 6(x − 4)2 − 3(x − 4) + 100 find a starting point for which Newton’s Method converges to a maximum and another starting point for which Newton’s Method converges to a minimum. 8. Convergence of Newton’s Method Remark 3.45. Most of this discussion is taken from Chapter 8.1 of [Avr03]. A slightly more general discussion can be found in [BSS06]. Definition 3.46. Let S = [a, b] ⊂ R. A contractor or contraction mapping on S is a continuous function f : S → S with the property that there is a q ∈ (0, 1) so that: (3.11) |f (x1 ) − f (x2 )| ≤ q|x1 − x2 | 38 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS Remark 3.47. Definition 3.46 is a special form of Lipschitz continuity in which the Lipschitz constant is required to be less than 1. Lemma 3.48. Let f : S → S be a contraction mapping on S = [a, b] ⊂ R. Define x(k+1) = f (x(k) ) and suppose x(0) ∈ S. Then there is a unique fixed point x∗ to which the sequence {x(k) } converges and: (3.12) |x(k+1) − x∗ | ≤ q k+1 |x(0) − x∗ | k = 0, 1, . . . Proof. The fact that f (x) has a fixed point is immediate from a variation on Brower’s Fixed Point Theorem (see [Mun00], Page 351 - 353). Denote this point by x∗ . The fact that f is a contraction mapping implies that: (3.13) |f (x(0) ) − x∗ | ≤ q|x(0) − x∗ | which implies that: (3.14) |x(1) − x∗ | ≤ q|x(0) − x∗ | Now proceed by induction and assume for (3.15) |x(k) − x∗ | ≤ q k |x(0) − x∗ | k = 0, 1, . . . , K From Expression 3.15, we know that: (3.16) |f (x(K) ) − x∗ | ≤ q|x(K) − x∗ | ≤ q K+1 |x(0) − x∗ | Thus: (3.17) |x(K+1) − x∗ | ≤ q K+1 |x(0) − x∗ | Thus Expression 3.12, follows by induction. Finally, since q < 1, we see at once that {x(k) } converges to x∗ . To prove the uniqueness of x∗ , suppose there is a second, distinct, fixed point x+ . Then: (3.18) 0 < |x∗ − x+ | = |f (x∗ ) − f (x+ )| ≤ q|x∗ − x+ | As 0 < q < 1, this is a contradiction unless x∗ = x+ . Thus x∗ is unique. This completes the proof. Lemma 3.49. Suppose that f is continuously differentiable on S = [a, b] ⊂ R and maps S into itself. If |f ′ (x)| < 1 for every x ∈ S, then f is a contraction mapping. Proof. Let x1 and x2 be two points in S. Then by the Mean Value Theorem (Theorem 2.2) we know that there is some x∗ ∈ (x1 , x2 ) so that: (3.19) f (x1 ) = f (x2 ) + f ′ (x∗ )(x1 − x2 ) Thus: (3.20) |f (x1 ) − f (x2 )| = |f ′ (x∗ )||(x1 − x2 )| By our assumption that |f ′ (x∗ )| < 1 for all x ∈ S, we can set q equal to the maximal value of |f ′ (x∗ )| on S. The fact that f is a contraction mapping follows from the definition. 8. CONVERGENCE OF NEWTON’S METHOD 39 Theorem 3.50. Let h and γ be continuously differentiable functions on S = [a, b] ⊂ R. Suppose further that: (3.21) h(a)h(b) < 0 and for all x ∈ S: (3.22) γ(x) > 0 (3.23) h′ (x) > 0 (3.24) 0 ≤ 1 − (γ(x)h(x))′ ≤ q < 1 Let: (3.25) x(k+1) = x(k) − γ(x(k) )h(x(k) ) k = 0, 1, . . . with x(0) ∈ S. Then the sequence {x(k) } converges to a solution x∗ with h(x∗ ) = 0. Proof. Define: (3.26) (3.27) ρ(x) = x − γ(x)h(x) ρ′ (x) = 1 − (γ(x)h(x))′ From Inequality 3.24, we see that 0 < ρ′ (x) ≤ q < 1 for all x ∈ S and ρ is monotonically increasing (non-decreasing) on S. Inequality 3.23 shows that h is monotonically increasing on S and therefore since h(a)h(b) < 0, it follows that h(a) < 0 and h(b) > 0. From this, and Equation 3.26 we conclude that ρ(a) > a and ρ(b) < b, since γ(x) > 0 for all x ∈ S.By the monotonicity of ρ, we can conclude then that a < ρ(x) < b for all x ∈ S. Thus, ρ maps S into itself. Moreover, since we assume that |ρ′ (x)| < 1 in Inequality 3.24. From Lemma 3.49, it follows that ρ(x) is a contraction mapping. Corollary 3.51. Suppose φ : R → R is a univariate function to be maximized that is three-times continuously differentiable. Let h(x) = φ′ (x) and let γ(x) = 1/φ′′ (x). Then ρ(x) = x − γ(x)h(x) is Newton’s Method. If the hypotheses on h(x) and g(x) from Theorem 3.50 hold, then Newton’s method converges to a stationary point (φ′ (x) = 0) on S = [a, b] ⊂ R. Definition 3.52 (Rate of Convergence). If a sequence {xk } ⊆ Rn converges to x∗ ∈ Rn , but the sequence does not attain x∗ for any finite k. If there is a p ∈ R and an α ∈ R with α 6= 0 such that: (3.28) ||xk+1 − x∗ || =α k→∞ ||xk − x∗ ||p lim then p is the order of convergence of the sequence {xk }. If p = 1, then convergence is linear. If p > 1, then convergence is superlinear and if p = 2, then convergence is quadratic. Theorem 3.53. Suppose that the hypotheses of Theorem 3.50 and Corollary 3.51 hold and the sequence {xk } ⊂ R is generated by Newton’s method and this sequence converges to x∗ ∈ R. Then the rate of convergence is quadratic. Proof. From Theorem 3.50 and Corollary 3.51, we are maximizing φ : S = [a, b] ⊂ R → R with h(x) = φ′ (x) and γ(x) = 1/φ′′ (x) and φ is three-times continuously differentiable. 40 3. INTRODUCTION TO GRADIENT ASCENT AND LINE SEARCH METHODS From Theorem 3.50, we know that φ′ (x∗ ) = h(x∗ ) = 0. Furthermore, we know that x∗ is a fixed point of the equation: h(xk ) (3.29) f (xk ) = xk − ′ h (xk ) ∗ if and only if h(x ) = 0. To see this, note that if h(x∗ ) = 0, then φ(x∗ ) = x∗ and so x∗ is a fixed point. If x∗ is a fixed point, then: h(x∗ ) (3.30) 0 = − ′ ∗ h (x ) and thus, h(x∗ ) = 0. By the Mean Value Theorem (Theorem 2.2), for any iterate xk , we know there is a t ∈ R between xk and x∗ so that there is a (3.31) xk+1 − x∗ = f (xk ) − f (x∗ ) = f ′ (t)(xk − x∗ ) From this, we conclude that: (3.32) |xk+1 − x∗ | = |f (xk ) − f (x∗ )| = |f ′ (t)||xk − x∗ | We note that: h′′ (t)h(t) h′ (t)2 − h′′ (t)h(t) = h′ (t)2 h′ (t)2 We can rewrite Equation 3.32 as: |h′′ (t)||h(t)| |xk − x∗ | (3.34) |xk+1 − x∗ | = |f (xk ) − f (x∗ )| = h′ (t)2 Since h(x∗ ) = 0, we know that there is some value s between t and x∗ so that: (3.33) f ′ (t) = 1 − (3.35) |h(t)| = |h(t) − h(x∗ )| = |h′ (s)||t − x∗ | again by he Mean Value Theorem (Theorem 2.2). But since s is between t and x∗ and t is between xk and x∗ , we conclude that: (3.36) |h(t)| = |h(t) − h(x∗ )| = |h′ (s)||t − x∗ | ≤ |h′ (s)||xk − x∗ | Combining Equations 3.34 and 3.36, yields: |h′′ (t)||h′ (s)| |h′′ (t)||h(t)| ∗ ∗ |xk − x | ≤ |xk − x∗ |2 (3.37) |xk+1 − x | = ′ 2 ′ 2 h (t) h (t) If: |h′′ (t)||h′ (s)| β = sup h′ (t)2 k then (3.38) |xk+1 − x∗ | ≤ β|xk − x∗ |2 The fact that f ′ (x) is bounded is sufficient to ensure that the supremum β exists. Thus the convergence rate of Newton’s Method is quadratic. Exercise 31. Using your code for Newton’s Method, illustrate empirically that convergence is quadratic. CHAPTER 4 Approximate Line Search and Convergence of Gradient Ascent 1. Approximate Search: Armijo Rule and Curvature Condition Remark 4.1. It is sometimes the case that optimizing the function φ(δk ) is expensive. In this case, it would be nice to find a δk by that is sufficient to guarantee convergence of a gradient ascent (or related algorithm). We can then stop our line search before convergence to an optimal solution and speed up our overall optimization algorithm. The following conditions, usually called the Wolfe Conditions can be used to ensure convergence of gradient descent and related algorithms. Definition 4.2 (Armijo Rule). Given a function f : Rn → R and an ascent direction pk with constant σ1 ∈ (0, 1), the Armijo rule is satisfied if: (4.1) f (xk + δk pk ) − f (xk ) ≥ σ1 δk ∇f (xk )T pk Remark 4.3. Recall, φ(δk ) = f (xk + δk pk ) consequently, Equation 4.1 simply states that: (4.2) φ(δk ) − φ(0) ≥ σ1 δk ∇f (xk )T pk = σ1 δk φ′ (0) which simply means there is a sufficient increase in the value of the function. Definition 4.4 (Curvature Rule). Given a function f : Rn → R and an ascent direction pk with constant σ2 ∈ (σ1 , 1), the curvature condition is statisfied if: (4.3) σ2 ∇f (xk )T pk ≥ ∇f (xk + δk pk )T pk Remark 4.5. Recall from Proposition 1.20 and Theorem 1.22: φ′ (δk ) = ∇f (xk )T pk Thus we can write the curvature condition as: (4.4) σ2 φ′ (0) ≥ φ′ (δk ) Remember, since ∇f (xk )T pk > 0 (since pk is an ascent direction), φ′ (0) > 0. Thus, all the curvature condition is saying is that the slope of the univariate function φ at the chosen point δk is not so steep as it was at 0. Example 4.6. Consider the function: 1 2 2 cos x2 + y 2 x +y f (x, y) = exp − 10 The function is illustrated in Figure 4.1. Suppose we start at point x0 = y0 = 1, then the function φ(t) is given by: f (x0 + t∇f (x0 , y0 ), y0 + t∇f (x0 , y0 )). A plot of φ(t) is shown in Figure 4.2. Armijo’s Rule simply says: (4.5) φ(t) ≥ φ(0) + σ1 φ′ (0)t 41 42 4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT Figure 4.1. The function f (x, y) has a maximum at x = y = 0, where the function attains a value of 1. Figure 4.2. A plot of φ(t) illustrates the function increases as we approach the global maximum of the function and then decreases. Here σ1 ∈ (0, 1) is a constant we choose and φ′ (0) and φ(0) are computed from φ(t). This rule can be associated to the set of t so that φ(t) lies above the line φ(0) + σ1 φ′ (0)t. This is illustrated in Figure 4.3. Likewise, the Curvature Condition asserts that: (4.6) σ2 φ′ (0) ≥ φ′ (t) This rule can be associated to the set of t so that φ′ (t) is greater than some constant, where σ2 is chosen between σ1 and 1 and φ′ (0) is computed. This is illustrated in Figure 4.3. When these two conditions are combined, it is clear that we have a good bracket of the maximum of φ(t). Lemma 4.7. Suppose pk is an ascent direction at xk and that f (x) is bounded on the ray xk + tpk , where t ≥ 0. Then if 0 < σ1 < σ2 < 1, then there exists an interval of values for t satisfying the Wolfe Conditions. Proof. Define φ(t) = f (xk + tpk ). The fact that pk is an ascent direction, means: φ′ (0) = ∇f (xk )T pk > 0 by Definition 3.2. Thus, since 0 < σ1 < 1, the line l(t) = φ(0) + σ1 φ′ (0)t is unbounded above and must intersection the graph of φ(t) at least once because φ(t) is bounded above by 1. APPROXIMATE SEARCH: ARMIJO RULE AND CURVATURE CONDITION 43 φ0 (t) σ2 φ0 (0) φ(t) ARMIJO RULE CURVATURE CONDITION φ(0) + σ1 φ0 (0)t Figure 4.3. The Wolfe Conditions are illustrated. Note the region accepted by the Armijo rule intersects with the region accepted by the curvature condition to bracket the (closest local) maximum for δk . Here σ1 = 0.15 and σ2 = 0.5 assumption. (Note, φ(t) must lie above l(t) initially since we choose σ1 < 1 and at φ(0) the tangent line has slope φ′ (0).) Let t′ > 0 be the least such value of t at which l(t) intersects φ(t). That is: φ(t′ ) = φ(0) + σ1 φ′ (0)t′ It follows that the interval (0, t′ ) satisfies the Armijo rule. Applying the univariate Mean Value Theorem (Lemma 2.2), we see that there is a t′′ ∈ (0, t′ ) so that: φ(t′ ) − φ(0) = φ′ (t′′ )(t′ − 0) Combining the two previous equations yields: σ1 φ′ (0)t′ = φ′ (t′′ )t′ Since σ2 > σ1 , we conclude that: φ′ (t′′ )t′ = σ1 φ′ (0)t′ < σ2 φ′ (0)t′ Thus φ′ (t′′ ) < σ2 φ′ (0) and t′′ satisfies the Curvature Condition. Since we assumed that f was continuously differentiable, it follows that there is a neighborhood around t′′ satisfying the Curvature Condition as well. This completes the proof. Remark 4.8. It turns out that the curvature condition is not necessary to ensure that gradient descent converges to a point x∗ at which ∇f (x∗ ) = 0. From this observation, we can derive an algorithm for identifying a value for δk that satisfies the Armijo rule. Remark 4.9. Algorithm 8 codifies our observation about the Armijo rule. Here, we choose a σ1 (called sigma in the algorithm) and a value β ∈ (0, 1) and an initial value t0 . We set t = β k t0 and iteratively test to see if (4.7) φ(t) ≥ φ(0) + σ1 tφ′ (0) 44 4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT When this holds, we terminate execution and return t = β k t0 for some k. The proof of Lemma 4.7 is sufficient to ensure this algorithm terminates with a point satisfying the Armijo rule. 1 2 3 4 5 6 7 8 BackTrace := proc (phi::operator, beta::numeric, t0::numeric, sigma)::numeric; local t, k, dphi; if beta < 0 or 1 < beta then t := -1 else k := 0; t := beta^k*t0; dphi := evalf(eval(diff(phi(x), x), x = 0)); 9 10 11 12 13 14 15 16 while evalf(phi(t)) < evalf(phi(0))+sigma*dphi*t do k := k+1; t := beta^k*t0 end do end if; t #return t. end proc Algorithm 8. The back-tracing algorithm Remark 4.10. In practice, σ1 is chosen small (between 10−5 and 0.1). The value β is generally between 0.1 and 0.5. Often t0 = 1, though this varies based on problem at hand [Ber99]. 2. Algorithmic Convergence Definition 4.11 (Gradient Related). Let f : Rn → R be a continuously differentiable function. A sequence of ascent directions {pk } is gradient related if for any subsequence K = {k1 , k2 , . . . } of {xk } that converges to a non-stationary point of f the corresponding subsequence {pk }k∈K is bounded and has the property that: (4.8) lim sup ∇f (xk )T pk > 0 k→∞,k∈K Theorem 4.12. Assume that δk is chosen to ensure the Armijo rule holds at each iteration and the ascent directions pk are chosen so they are gradient related. Then if Algorithm 2 converges, it converges to a point x∗ so that ∇f (x∗ ) = 0 and the stopping criteria: (4.9) ||∇f (xk )|| < ǫ for some small ǫ > 0 may be used. Proof. If the algorithm converges to a point x∗ at which ∇f (x∗ ) = 0, then the stopping criteria ||∇f (xk )|| < ǫ can clearly be used. We proceed by contradiction: Suppose that the sequence of points {xk } converges to a point x+ with the property that ∇f (x+ ) 6= 0. We construct {xk } so that {f (xk )} is a monotonically nondecreasing sequence and therefore, either {f (xk )} converges to some finite 2. ALGORITHMIC CONVERGENCE 45 value or diverges to +∞. By assumption, f (x) is continuous and since x+ is a limit point of {xk }, it follows f (x+ ) is a limit point of {f (xk )} and thus the sequence {f (xk )} must converge to this value. That is: (4.10) lim f (xk+1 ) − f (xk ) = 0 k→∞ From the Armijo rule we have: (4.11) f (xk+1 ) − f (xk ) ≥ σ1 δk ∇f (xk )T pk since xk+1 = xk + δk pk . Note the left hand side of this inequality goes to zero, while the right hand side is always positive (pk are ascent directions and δk and σ1 are positive). Hence: (4.12) lim δk ∇f (xk )T pk = 0 k→∞ Let K = {k1 , k2 , . . . } be any subsequence of {xk } that converges to x+ . Since pk is gradient related, we know that: (4.13) lim sup ∇f (xk )T pk > 0 k→∞,k∈K Therefore, limk∈K δk = 0 (otherwise, Equation 4.12 can’t hold). Consider the backtrace algorithm. Since limk∈K δk = 0, we know that for some k + ∈ K so that if k ≥ k + : (4.14) f (xk + δk /βpk ) − f (xk ) < σ1 δk /β∇f (xk )T pk That is, if k ≥ k + , the while-loop in the Algorithm 8 executes at least once and when it executes, it only stops once Inequality 4.14 doesn’t hold. Therefore, at the iteration before Inequality 4.14 must hold. Since {pk } is gradient related, the sequence {pk }, for k ∈ K, is bounded and thus by the Bolzano-Weierstrass theorem there is a subsequence K + ⊆ K so that {pk }k∈K + converges to a vector p+ . From Equation 4.14: (4.15) f (xk + δk+ pk ) − f (xk ) < σ1 ∇f (xk )T pk + δk where δk+ = δk /β. Let φ(δk ) = f (xk + δk pk ). Remember, that φ′ (δk ) = ∇f (xk + δk pk )T pk . We can rewrite Expression 4.15 as: (4.16) φ(δk+ ) − φ(0) < σ1 ∇f (xk )T pk δk+ Applying the mean value theorem: means there is a δ̄k ∈ (0, δk+ ) so that: (4.17) φ′ (δ̄k )(δk+ − 0) = φ′ (δ̄k ) < σ1 ∇f (xk )T pk + δk If we write φ′ (δ̄k ) in terms of the gradient of f and pk we obtain: (4.18) ∇f (xk + δ̄k pk )T pk < σ1 ∇f (xk )T pk Taking the limit as k → ∞ (and k ∈ K) we see that: (4.19) ∇f (x+ )T p+ < σ1 ∇f (x+ )T p+ But since σ1 ∈ (0, 1), this cannot be true, contradicting our assumption that ∇f (x+ ) 6= 0. 46 4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT Corollary 4.13. Assume that δk is chosen by maximizing φ(δk ) at each iteration and the ascent directions pk are chosen so they are gradient related. Then Algorithm 2 converges to a point x∗ so that ∇f (x∗ ) = 0. Corollary 4.14. Suppose that pk = B−1 k ∇f (xk ) in each iteration of Algorithm 2 and every matrix Bk is symmetric, positive definite. Then, Algorithm 2 converges to a stationary point of f . Exercise 32. Prove Corollary 4.13 by arguing that if xk+1 is generated from xk by maximizing φ(δk ), then: (4.20) f (xk+1 ) − f (xk ) ≥ f (x̂k+1 ) − f (xk ) ≥ σ δ̂k ∇f (xk )T pk where x̂k+1 and δ̂k are generated by the Armijo rule. Exercise 33. Prove Corollary 4.14. [Hint: Use Lemma 3.8.] Example 4.15 (Convergence Failure). We illustrate a simple case in which we do not use minimization or the Armijo rule to chose a step length. Consider the function:  2 4  − 5 (1 − x) + 2 − 2x 1 < x f (x) = − 45 (1 + x)2 + 2 + 2x x < −1  −x2 + 1 otherwise Suppose we start at x0 = −2 and fix a decreasing step length δk = 1. If we follow a gradient ascent, then we obtain a sequence of values {xk } (k = 1, 2, . . . ) shown in Figure 4.4. At each Figure 4.4. We illustrate the failure of the gradient ascent method to converge to a stationary point when we do not use the Armijo rule or minimization. algorithmic step we have: df (4.21) xk+1 = xk + dx x=xk Evaluation of Equation 4.21 yields: n n 3 3 n+1 n (4.22) xk+1 = (−1) · · x0 + (−1) · 1− 5 5 3. RATE OF CONVERGENCE FOR PURE GRADIENT ASCENT 47 As n → ∞, we see this sequence oscillates between −1 and 1 and does not converge. Remark 4.16. We state, but do not prove, the capture theorem, a result that quantifies, in some sense, why gradient methods tend to converge to unique limit points. The proof can be found in [Ber99] (50 - 51). Theorem 4.17 (Capture Theorem). Let f : Rn → R be continuously differentiable and let {xk } be a sequence satisfying f (xk+1 ) ≥ f (xk ) for all k generated as xk+1 = xk +δk pk that is convergent in the sense that every limit point of sequences that it generates is a stationary point of f . Assume there exists scalars s > 0 and c > 0 so that for all k we have: (4.23) δk ≤ s ||pk || ≤ c||∇f (xk ) Let x∗ be a local maximum of f be the only stationary point in an open neighborhood of x∗ . Then there is an open set S containing x∗ so that if xK ∈ S for some K ≥ 0, then xk ∈ S for all k ≥ Kand {xk } (k ≥ K) converges to x∗ . Furthermore, given any scalar ǫ > 0, the set S can be chosen so that ||x − x∗ || < ǫ for all x ∈ S. 3. Rate of Convergence for Pure Gradient Ascent Remark 4.18. We state, but do not prove Kantorovich’s Lemma. A sketh of the proof can be found on Page 76 of [Ber99]. A variation of the proof provided for Theorem 4.20 can be found in [Ber99] for gradient descent. Lemma 4.19 (Kantorovich’s Inequality). Let Q ∈ Rn×n be a symmetric positive definite matrix furthermore, suppose the eigenvalues of Q are 0 < λ1 ≤ λ2 ≤ · · · ≤ λn . For any x ∈ Rn×1 with x 6= 0 we have: 2 xT x 4λ1 λn ≥ (4.24) (xT Qx) (xT Q−1 x) (λ1 + λn )2 Exercise 34. Prove that Kantorovich’s Inequality also holds for negative definite matrices. Theorem 4.20. Suppose f : Rn → R is a strictly concave function defined by: 1 (4.25) f (x) = xT Qx 2 n×n where Q ∈ R is symmetric (and negative definite). Suppose the sequence {xk } is generated using an exact line search method and x∗ is the (unique) global maximizer of f . Finally, suppose the eigenvalues of Q are 0 > λ1 ≥ λ2 ≥ · · · ≥ λn . Then: 2 λn − λ 1 f (xk ) (4.26) f (xk+1 ) ≥ λn + λ 1 Remark 4.21. The value: λn − λ 1 λn + λ 1 is called the condition number of the optimization problem. Condition numbers close to 1 lead to slow convergence of gradient ascent (see Example 4.24). 48 4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT Proof. Denote by gk = ∇f (xk ) = Qxk . If gk = 0, then f (xk+1 ) = f (xk ) and the theorem is proved. Suppose gk 6= 0. Consider the function: (4.27) φ(δ) = f (xk + δgk ) = 1 (xk + δgk )T Q (xk + δgk ) 2 We see at once that: (4.28) 1 1 φ(δ) = f (xk ) + δxTk Qgk + δgkT Qxk + δ 2 f (gk ) 2 2 Note that since Q = QT and Qxk = gk and xTk QT = xTk Q = gkT we have: 1 1 φ(δ) = f (xk ) + δgkT gk + δgkT gk + δ 2 f (gk ) = f (xk ) + δgkT gk + δ 2 f (gk ) 2 2 ′ Differentiating and setting φ (δ) = 0 yields: (4.29) (4.30) gkT gk + 2δf (gk ) = 0 =⇒ δ = −gkT gk gkT Qgk Notice that δ > 0 because Q is negative definite. Substituting this value into f (xk+1 ) yields: (4.31) 1 (xk + δgk )T Q (xk + δgk ) = 2 T T 2 −gk gk −gk gk T 2 T f (xk ) + δgk gk + δ f (gk ) = f (xk ) + gk gk + f (gk ) = T gk Qgk gkT Qgk 2 2 2 ! T − gkT gk g g 1 gkT gk 1 1 T k + = x Qxk + xTk Qxk − Tk 2 k 2 gkT Qgk 2 gkT Qgk gk Qgk f (xk+1 ) = f (xk + δgk ) = Note that since Q is symmetric: (4.32) xTk Qxk = xTk QT Q−1 Qxk = gkT Q−1 gk Therefore, we can write: (4.33) 2 ! T g g k = xTk Qxk − Tk gk Qgk ! 2 gkT gk 1− T xTk Qxk = T −1 (gk Qgk ) (gk Q gk ) 1 f (xk+1 ) = 2 1 2 ! 2 gkT gk 1− T f (xk ) (gk Qgk ) (gkT Q−1 gk ) Since Q is negative definite (and thus −Q is positive definite), we know three things: (1) f (xk ), f (xk+1 ) ≤ 0 (2) We know: 2 gkT gk >0 (gkT Qgk ) (gkT Q−1 gk ) (3) Thus, in order for the equality in Expression 4.33 to be true it follows that: 2 gkT gk <1 0<1− T (gk Qgk ) (gkT Q−1 gk ) 4. RATE OF CONVERGENCE FOR BASIC ASCENT ALGORITHM 49 Lemma 4.19 and our previous observations imply that: ! (λ1 + λn )2 4λ1 λn 4λ1 λn f (xk ) = f (xk ) = − (4.34) f (xk+1 ) ≥ 1 − (λ1 + λn )2 (λ1 + λn )2 (λ1 + λn )2 ! 2 2 λ1 − 2λ1 λ2 + λ2n (λ1 − λn )2 λ 1 − λn f (xk ) = f (xk ) = f (xk ) λ1 + λn (λ1 + λn )2 (λ1 + λn )2 This completes the proof. Remark 4.22. It’s a little easier to see the proof of the previous theorem when we are minimizing, since everything is positive instead of negative and we do not have to keep track of sign changes. Nevertheless, such exercises are good to ensure understanding of the proof. Remark 4.23. A reference implementation for Gradient Ascent is shown in Algorithm 9. The LineSearch method called at Line 49 is a combination of the parabolic bracketing algorithm (Algorithm 3 and Golden Section Search (Algorithm 5) along with a simple backtrace (Algorithm 8) to find the second parameter that is passed into the parabolic bracketing algorithm (starting with a = 0 and ∆a = 1 we check to see if φ(a + ∆a) > φ(a), if not, we set ∆a = β∆a). Example 4.24. Consider the function F (x, y) = −2 x2 − 10 y 2 . If we initialize our gradient ascent at x = 15, y = 5 and set ǫ = 0.001, then we obtain the following output, which converges near the optimal point x∗ = 0, y ∗ = 0. The zig-zagging motion is typical of the gradient ascent algorithm in cases where the largest and smallest eigenvalues for the matrix Q (in a pure quadratic function) are very different (see Theorem 4.20). In this case, −2 0 Q= 0 −10 Thus, we can see that the condition number is 2/3, which is close enough to 1 to cause slow convergence, leading to the zig-zagging. Exercise 35. Implement Gradient Ascent with inexact line search. Using F (x, y) from Example 4.24, draw a picture like Figure 4.5 to illustrate your algorithm’s steps. 4. Rate of Convergence for Basic Ascent Algorithm Theorem 4.25. Let f : Rn → R be twice continuously differentiable and suppose that {xk } is generated by Algorithm 2 so that xk+1 = xk +δk pk and suppose that {xk } converges to x∗ , ∇f (x∗ ) = 0 and H(x∗ ) = ∇2 f (x∗ ) is negative definite. Finally, assume that ∇f (xk ) 6= 0 for any k = 1, 2, . . . and: (4.35) ||pk + H−1 (x∗ )∇f (xk )|| =0 k→∞ ||∇f (xk )|| lim Then if δk is chosen by means of the backtrace algorithm with t0 = 1 and σ1 < 21 , we have: (4.36) ||xk+1 − x∗ || =0 k→∞ ||xk − x∗ || lim Furthermore, there is an integer k̄ ≥ 0 so that δk = 1 for all k ≥ k̄ (that is, eventually, the backtrace algorithm will converge with no iterations of the while-loop). 50 4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT Figure 4.5. Gradient ascent is illustrated on the function F (x, y) = −2 x2 − 10 y 2 starting at x = 15, y = 5. The zig-zagging motion is typical of the gradient ascent algorithm in cases where λn and λ1 are very different (see Theorem 4.20). Remark 4.26. This theorem asserts that if we have a method for choosing our ascent direction pk so that at large iteration numbers, the ascent direction approaches a Newton step, then the algorithm will converge super-linearly (since Equation 4.36 goes to zero.) Remark 4.27. We note that in Newton’s Method we have pk = −H−1 (x∗ ). Thus, Equation 4.35 tells us that in the limit as k approaches ∞, we are taking pure Newton steps. Proof. By the Second Order Mean Value Theorem (Lemma 2.4), we know that there is some t ∈ (0, 1) so that: 1 (4.37) f (xk + pk ) − f (xk ) = ∇f (xk )T pk + pTk H(xk + tpk )pk 2 If we can show that for k sufficiently large we have: 1 (4.38) ∇f (xk )T pk + pTk H(xk + tpk )pk ≥ σ1 ∇f (xk )T pk 2 then we will have established that there is a k̄ so that for any k ≥ k̄, the backtrace algorithm converges with no executions of the while loop. Equation 4.38 may equivalently be written as: 1 (4.39) (1 − σ1 )∇f (xk )T pk + pTk H(xk + tpk )pk ≥ 0 2 If we let: ∇f (xk ) qk = ||∇f (xk )|| pk rk = ||∇f (xk )|| Dividing Inequality 4.39 by ||∇f (xk )||2 , we obtain an equivalent inequality: 1 (4.40) (1 − σ1 )qTk rk + rTk H(xk + tpk )rk ≥ 0 2 4. RATE OF CONVERGENCE FOR BASIC ASCENT ALGORITHM 51 From Equation 4.35, we know that: (4.41) lim ||rk + H−1 (x∗ )qk || = 0 k→∞ The fact that H(x∗ ) is negative definite, and ||qk || = 1, it follows that rk must be a bounded sequence and since ∇f (xk ) converges to ∇f (x∗ ) = 0, it’s clear that pk must also converge to 0 and hence xk + pk converges to x∗ . From this, it follows that xk + tpk also converges to x∗ and H(xk + tpk ) converges to H(x∗ ). We can re-write Equation 4.35 as: (4.42) rk = −H−1 (x∗ )qk + zk where {zk } is a sequence of vectors that converges to 0. Substituting the previous equation into Equation 4.40 yields: (4.43) (1 − σ1 )qTk −H−1 (x∗ )qk + zk + T 1 −H−1 (x∗ )qk + zk H(xk + tpk ) −H−1 (x∗ )qk + zk ≥ 0 2 The preceding equation can be re-written as: T 1 (4.44) −(1 − σ1 )qTk H−1 (x∗ )qk + H−1 (x∗ )qk H(xk + tpk ) H−1 (x∗ )qk + g(zk ) ≥ 0 2 where g(zk ) is a function of zk with the property that g(zk ) → 0 as k → ∞. Since H(x) is symmetric (and so is its inverse), for large enough k, H(xk + tpk ) ∼ H(x∗ ) and we can re-write the previous expression as: 1 (4.45) −(1 − σ1 )qTk H−1 (x∗ )qk + qTk H−1 (x∗ )qk + ḡ(zk ) ≥ 0 2 Where ḡ(zk ) is another function with the same properties as g(zk ). We can re-write this expression as: 1 − σ1 qTk H−1 (x∗ )qk + ḡ(zk ) ≥ 0 (4.46) − 2 Now, since σ1 < 12 and H−1 (x∗ ) is negative definite and ḡ(zk ) → 0, the inequality in Expression 4.46 must hold for very large k. Thus, Expression 4.39 must hold and therefore Expression 4.38 must hold. Therefore, it follows there is some k̄ so that for all k ≥ k̄, the unit step in the Armijo rule is sufficient and no iterations of the while-loop in the back-tracing algorithm are executed. We can re-write Expression 4.35 as: (4.47) pk + H−1 (x∗ )∇f (xk ) = ||∇f (xk )||yk where yk is a vector sequence that approaches 0 as k approaches infinity. Applying Taylor’s Theorem to the many-valued function ∇f (xk ), we see that: (4.48) ∇f (xk ) = ∇f (xk ) − ∇f (x∗ ) = H(x∗ )(xk − x∗ ) + o (||xk − x∗ ||) since ∇f (x∗ ) = 0, by assumption. This implies: (4.49) H−1 (x∗ )∇f (xk ) = (xk − x∗ ) + o (||xk − x∗ ||) We also see that: ||∇f (xk )|| = O(||xk − x∗ ||). These two relations show that: (4.50) pk + (xk − x∗ ) = o (||xk − x∗ ||) 52 4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT When k ≥ k̄, we know that: (4.51) xk+1 = xk + pk Combining this with Equation 4.50 yields: (4.52) xk+1 − x∗ = o (||xk − x∗ ||) Now, letting k approach infinity yields: o (||xk − x∗ ||) ||xk+1 − x∗ || = lim =0 (4.53) lim k→∞ ||xk − x∗ || k→∞ ||xk − x∗ || This completes the proof. 4. RATE OF CONVERGENCE FOR BASIC ASCENT ALGORITHM 1 2 3 4 GradientAscent := proc (F::operator, initVals::list, epsilon::numeric) :: list; # A function to execute Gradient Ascent! Pass in a function # and an initial point in the format [x = x0, y=y0, z=z0,...] # epsilon is the tolerance on the gradient. 5 6 local vars, xnow, i, G, normG, OUT, passIn, phi, LS, vals, ttemp; 7 8 9 10 11 12 13 14 15 # The first few lines are housekeeping... vars := []; xnow := []; vals := initVals; for i to nops(initVals) do vars := [op(vars), lhs(initVals[i])]; xnow := [op(xnow), rhs(initVals[i])] end do; 16 17 18 19 # Compute the gradient of the function using the VectorCalculus package # This looks nicer in "Maple" Format. G := Vector(Gradient](F(op(vars)), vars)); 20 21 22 23 # Compute the norm of the gradient at the initial point. This uses the Linear # Algebra package. normG := Norm(evalf(eval(G, initVals))); 24 25 26 27 # Output will be the path we take to get to the "optimal point." # You can just output the last point for simplicity. OUT := []; 28 29 30 # While we’re not at a stationary point... while evalb(epsilon < normG) do 31 32 33 # Update the output. OUT := [op(OUT), xnow]; 34 35 36 37 38 39 # Do some house keeping: This is x_next = x_now + s * G(x_now) # Essentially, we’re going to be solving for the distance to walk "s" below. # Most of this line is just getting Maple to treat things as floating point # and returning a list of the inputs we can pass to F. passIn := convert(Vector(xnow) + s * evalf(eval(G, vals)), list); 53 54 40 41 42 43 44 4. APPROXIMATE LINE SEARCH AND CONVERGENCE OF GRADIENT ASCENT # Create a local procedure to be the line function f(x_k + delta * p_k) # Here ’t’ is playing the role of delta. phi := proc (t) options operator, arrow; evalf(eval(F(op(passIn)), s = t)) end proc; 45 46 47 48 49 # Get the output from a line search function. I’m using the # ParabolicBracket code and the GoldenSection search # found in the notes. ttemp := LineSearch(phi); 50 51 52 # Compute the new value for xnow using the ’t’ we just found. xnow := evalf(eval(passIn, [s = ttemp])); 53 54 55 56 57 58 # Do some housekeeping. vals := []; for i to nops(vars) do vals := [op(vals), vars[i] = xnow[i]] end do; 59 60 61 62 # Evaluate the norm of the gradient at the new point. normG := Norm(evalf(eval(G, vals))) end do; 63 64 65 # Record the very last point you found (it’s the local max). OUT := [op(OUT), xnow]; 66 67 68 69 #Return the list of points. OUT end proc; Algorithm 9. Gradient Ascent Reference Implementation CHAPTER 5 Newton’s Method and Corrections 1. Newton’s Method Remark 5.1. Suppose f : Rn → R and f is twice continuously differentiable. Recall in our general ascent algorithm, when Bk = ∇2 f (xk ) = H(xk ), then we call the general ascent algorithm Newton’s method. If we fix δk = 1 for k = 1, 2, . . . , then the algorithm is pure Newton’s method. When δk is chosen in each iteration using a line search technique, then the technique is a variable step length Newton’s method. Remark 5.2. The general Newton’s Method algorithm (with variable step length) is shown in Algorithm 10. Example 5.3. Consider the function F (x, y) = −2x2 − 10y 4 . This is a concave function as illustrated by its Hessian matrix: −4 0 H(x, y) = 0 −120y 2 This matrix is negative definite except when y = 0, at which point it is negative semi-definite. Clearly, x = y = 0 is the global maximizer for the function. If we begin execution of Newtons Method at x = 15, y = 5, Newton’s method converges in 11 steps. As illustrated in Figure 5.1 Figure 5.1. Newton’s Method converges for the function F (x, y) = −2x2 − 10y 4 in 11 steps, with minimal zigzagging. 55 56 1 2 3 5. NEWTON’S METHOD AND CORRECTIONS NewtonsMethod := proc (F::operator, initVals::list, epsilon::numeric)::list; local vars, xnow, i, G, normG, OUT, passIn, phi, LS, vals, ttemp, H; vars := []; xnow := []; vals := initVals; 4 5 6 7 8 9 #The first few lines are housekeeping, so we can pass variables around. for i to nops(initVals) do vars := [op(vars), lhs(initVals[i])]; xnow := [op(xnow), rhs(initVals[i])]; end do; 10 11 12 13 14 #Compute the gradient and hessian using the VectorCalculus package. #Store the Gradient as a Vector (that’s not the default). G := Vector(Gradient(F(op(vars)), vars)); H := Hessian(F(op(vars)), vars); 15 16 17 #Compute the current gradient norm. normG := Norm(evalf(eval(G, initVals))); 18 19 20 21 #Output will be a path to the optimal solution. #Your output could just be the optimal solution. OUT := []; 22 23 24 #While we’re not at a stationary point... while evalb(epsilon < normG) do 25 26 27 #Update the output. OUT := [op(OUT), xnow]; 28 29 30 31 32 33 34 #Do some housekeeping. This is x_next = x_now - s * H^(-1) * G. #Essentially, we’re going to be solving for the distance to walk "s" below. #Most of this line is just getting Maple to treat things as floating point and #returning a list of the inputs we can pass to F. passIn := convert(Vector(xnow) s*evalf(eval(H,vals))^(-1).evalf(eval(G,vals)), list); 35 36 37 38 39 40 41 42 #Build a univariate function that we can maximize. This is a little weird. #We defined our list (above) in terms of s, and we’re going to evaluate #s at the value t we pass in. phi := proc (t) options operator, arrow; evalf(eval(F(op(passIn)), s = t)) end proc; 43 44 45 #Find the optimal step using a line search. ttemp := LineSearch(phi); 2. CONVERGENCE ISSUES IN NEWTON’S METHOD 57 46 47 48 #Update xnow using ttemp. This is our next point. xnow := evalf(eval(passIn, [s = ttemp])); 49 50 51 52 53 54 #Store the information in xnow in the vals list. vals := []; for i to nops(vars) do vals := [op(vals), vars[i] = xnow[i]] end do; 55 56 57 58 59 #Compute the new norm of the gradient. Notice, we don’t have to recompute the #gradient, it’s symbolic. normG := Norm(evalf(eval(G, vals))) end do; 60 61 62 #Pass the last point in (the while loop doesn’t get executed the last time around. OUT := [op(OUT), xnow]; 63 64 65 66 #Return the optimal solution. OUT: end proc: Algorithm 10. Variable Step Newton’s Method Exercise 36. Implement Newton’s Method and test it on a variation of Rosenbrock’s Function: 2 (5.1) −(1 − x)2 − 100 y − x2 2. Convergence Issues in Newton’s Method Example 5.4. Consider the function: 2 2 (5.2) F (x, y) = x2 + 3 y 2 e1−x −y A plot of the function is shown in Figure 5.2. It has two global maxima, a local minima between the maxima and saddle points near the peaks. If we execute Newton’s Method starting at x = 1 and y = 0.5, we obtain an interesting phenomenon. The Hessian matrix at this point is: " # −1.947001959 −3.504603525 (5.3) H(1, 0.5) = −3.504603525 −1.362901370 This matrix is not positive definite, as illustrated by the fact that its (1, 1) element is negative (and therefore, the Cholesky decomposition algorithm will fail). In fact, this matrix is indefinite. The consequence of this is that the direction −sH(1, 0.5)∇F (1, 0.5) is not an ascent direction for any positive value of s, and thus the line search algorithm converges to 58 5. NEWTON’S METHOD AND CORRECTIONS Figure 5.2. A double peaked function with a local minimum between the peaks. This function also has saddle points. s = 0 when maximizing the function: φ(s) = [1, 0.5]T − sH(1, 0.5)∇F (1, 0.5) Thus, the algorithm fails to move from the initial point. Unless, one includes a loop counter to Line 24 in Algorithm 10, the algorithm will cycle forever. Exercise 37. Implement Pure Newton’s method and try it on the previous example. (You should not see infinite cycling, but you should not see good convergence.) Remark 5.5. We are now in a situation were we’d like to know whether Newton’s Method converges and if so how fast. Naturally, we’ve already foreshadowed the results presented below by our discussion of Newton’s Method in one dimension (see Theorem 3.53). Definition 5.6 (Matrix Norm). Let M ∈ Rn×n . Then the matrix norm of M is: (5.4) ||M|| = max ||Mx|| ||x||−1 Theorem 5.7. Let f : Rn → R with local maximum at x∗ . For δ > 0, let Sδ = {x : ||x − x∗ || ≤ δ} Suppose that H(x∗ ) is invertible and within some sphere Sǫ , f is twice continuously differentiable. Then: (1) There exists a δ > 0 such that if x0 ∈ Sδ , the sequence xk generated by the pure Newton’s method: (5.5) xk+1 = xk − H−1 (xk )∇f (xk ) is defined, belongs to Sδ and converges to x∗ . Furthermore, ||xk − x∗ || converges superlinearly. (2) If for some L > 0, M > 0 and δ > 0 and for all x1 , x2 ∈ Sδ : (5.6) (5.7) ||H(x1 ) − H(x2 )|| ≤ L||x1 − x2 || and H−1 (x1 ) ≤ M then, if x0 ∈ Sδ we have: LM ||xk − x∗ ||2 , ∀k = 0, 1, . . . ||xk+1 − x∗ || ≤ 2 Thus, if LM δ/2 < 1 and x0 ∈ Sδ , ||xk − x∗ || converges superlinearly with order at least 2. 2. CONVERGENCE ISSUES IN NEWTON’S METHOD 59 Proof. To prove the first statement, choose δ > 0 so that H(x) exists and is invertible for all x ∈ Sδ . Define M > 0 so that: H−1 (x) ≤ M ∀x ∈ Sδ Let h = xk − x∗ . Then from Lemma 2.7 (the Mean Value Theorem for Vector Valued Functions) we have: Z 1 Z 1 2 ∗ ∗ ∗ ∇2 f (x∗ + th)hdt ∇ f (x + th)hdt =⇒ ∇f (xk ) = (5.8) ∇f (x + h) − ∇f (x ) = 0 0 We can write: (5.9) ||xk+1 − x∗ || = ||xk − H−1 (xk )∇f (xk ) − x∗ || = ||xk − x∗ − H−1 (xk )∇f (xk )|| We now factor H−1 (xk ) out of the above equation to obtain: (5.10) ||xk+1 − x∗ || = H−1 (xk ) [H(xk )(xk − x∗ ) − ∇f (xk )] Using Equation 5.8, we can write the previous inequality as: Z 1 ∗ 2 ∗ −1 ∗ (5.11) ||xk+1 − x || = H (xk ) H(xk )(xk − x ) − ∇ f (x + th)hdt 0 Z 1 2 ∗ ∗ −1 ∗ = H (xk ) H(xk )(xk − x ) − ∇ f (x + th)dt (xk − x ) 0 ∗ because h = xk − x and is not dependent on t. Factoring (xk − x∗ ) from both terms inside the square brackets on the right hand side yields: Z 1 ∗ −1 2 ∗ ∇ f (x + th)dt (xk − x∗ ) (5.12) ||xk+1 − x || = H (xk ) H(xk ) − 0 Z 1 −1 ∗ = H (xk ) H(xk ) − H(txk + (1 − t)x )dt (xk − x∗ ) 0 Thus we conclude: (5.13) ∗ −1 Z 1 ∗ H(txk + (1 − t)x )dt H(xk ) − ||xk − x∗ || 0 Z 1 ∗ −1 ∗ = ||xk+1 − x || = ||H (xk )|| H(xk ) − H(txk + (1 − t)x )dt ||xk − x∗ || 0 Z 1 ∗ ≤M ||H(xk ) − H(txk + (1 − t)x )||dt ||xk − x∗ || ||xk+1 − x || = ||H (xk )|| 0 By our assumption on H(x) for x ∈ Sδ . Note, in the penultimate step, we’ve moved H(xk ) under the integral sign since: Z 1 H(xk )dt = H(xk ) 0 Our assumption on f (x) (namely that it is twice continuously differentiable in Sδ ) tells us we can take δ small enough to ensure that the norm of the integral is sufficiently small. This establishes the superlinear convergence of the algorithm inside Sδ . 60 5. NEWTON’S METHOD AND CORRECTIONS To prove the second assertion, suppose that H(x) is Lipschitz inside Sδ . Then Inequality 5.13 becomes: Z 1 ∗ ∗ L||xk − (txk + (1 − t)x )||dt ||xk − x∗ || (5.14) ||xk+1 − x || ≤ M 0 Z 1 LM ∗ =M L(1 − t)||xk − x )||dt ||xk − x∗ || = ||xk − x∗ ||2 2 0 This completes the proof. Remark 5.8. Notice that at no time did we state that Newton’s Method would be attracted to a local maximum. In fact, Newton’s method tends to be attracted to stationary points of any kind. Remark 5.9. What we’ve shown is that, when Newton’s method gets close to a solution, it works extremely well. Unfortunately, there’s no way to ensure that Newton’s method – in its raw form – will get to a point where it works well. For this reason, a number of corrections have been devised. 3. Newton Method Corrections Remark 5.10. The original “Newton’s Method” correction was a simple one developed by Cauchy (and was the motivation for developing the method of steepest ascent). In essence, one executes a gradient ascent algorithm until the Hessian matrix becomes negative definite. At this point, we are close to a maximum and we use Newton’s method. Example 5.11. Consider, again the function: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y Suppose instead of using variable step Newton’s method algorithm, we take gradient ascent steps until H(x) becomes negative definite, at which point we switch to variable length Newton steps. This can be accomplished by adding a test on H before line 29 of Algorithm 10. When we do this and start at x0 = 1, y0 = 0.1, we obtain convergence to x∗ = 1, y ∗ = 1 in five steps, as illustrated in Figure 5.3. Exercise 38. Implement the modification to Newton’s method discussed above. And test the the rate of convergence on Rosenbrock’s function, compared to pure gradient ascent when starting from x0 = 0 and y0 = 0. Prove that the resulting sequence {xk } generated by this algorithm is gradient related and therefore the algorithm converges to a stationary point. Exercise 39. Prove that Cauchy’s modification to Newton’s algorithm converges superlinearly. Remark 5.12. The most common correction to Newton’s Method is the Modified Cholesky Decomposition approach. In this approach, we will be passing −H(x) into a modification of the Cholesky decomposition. The goal is to correct the Hessian matrix so that it is negative definite (and thus its negative is positive definite). In this way, we will always take an ascent step. The problem is, we’d like to maintain as much of the Hessian structure as possible and also ensure that when the Hessian is naturally negative definite, we do not change it at all. A simple way to do this is to choose µ1 , µ2 ∈ R+ with µ1 < µ2 . In the Cholesky 3. NEWTON METHOD CORRECTIONS 61 Figure 5.3. A simple modification to Newton’s method first used by Gauss. While H(xk ) is not negative definite, we use a gradient ascent to converge to the neighborhood of a stationary point (ideally a local maximum). We then switch to a Newton step. decomposition at Line 12, we check to see if Hii > µ1 . If not, we replace Hii with µ2 and continue. In this way, the modified Cholesky decomposition will always return a factorization for a positive definite matrix that has some resemblance to the Hessian matrix. A reference implementation is shown in Algorithm 11. Remark 5.13. Suppose that we have constructed L ∈ Rn×n from the negative of the Hessian matrix for the function f (xk ). That is: (5.15) −H(xk ) ∼ LLT where ∼ indicates that −H(xk ) may be positive definite, or may just be loosely related to LLT , depending on the steps executed in the modified Cholesky decomposition. We must solve the problem: T −1 ∇f (xk ) = p (5.16) LL where p is our direction. This can be accomplished efficiently by solving: (5.17) ∇f (xk ) = Lz by forward substitution, and then solving: (5.18) LT p = z by backwards substitution. Since L is a lower triangular matrix and LT is an upper triangular matrix. Example 5.14. Consider, again the function: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y 62 1 2 3 4 5 6 7 8 9 10 11 12 5. NEWTON’S METHOD AND CORRECTIONS ModifiedCholeskyDecomp := proc (M::Matrix, n::integer, mu1::numeric, mu2::numeric)::list; local MM, H, ret, i, j, k; MM := Matrix(M); H := Matrix(n, n); ret := true; for i to n do if mu1 <= MM[i, i] then H[i, i] := sqrt(MM[i, i]) else H[i, i] := mu2 end if; 13 14 15 for j from i+1 to n do H[j, i] := MM[j, i]/H[i, i]; 16 17 18 19 20 21 22 23 24 25 for k from i+1 to j do MM[j, k] := MM[j, k]-H[j, i]*H[k, i] end do: end do: end do: if ret then [H, ret] else [M, ret] end if: end proc: Algorithm 11. The modified Cholesky algorithm always returns a factored matrix close to an input matrix, but that is positive definite. and suppose we are starting at position x = 1 and y = 1/2. Then the Hessian matrix is: " # −5/2 e−1/4 −9/2 e−1/4 H= −9/2 e−1/4 −7/4 e−1/4 As we noted, this matrix is not positive definite. When we perform the modified Cholesky decomposition, we obtain the matrix: # " 1 0 L= −9/2 e−1/4 1 The gradient at this point is: " # −1/4 −3/2 e ∇f 1, 21 = 5/4 e−1/4 This leads to the first problem: " # " # 1 0 −3/2 e−1/4 z1 (5.19) = z2 −9/2 e−1/4 1 5/4 e−1/4 3. NEWTON METHOD CORRECTIONS 63 Thus we see immediately that: z1 = −3/2 e−1/4 z2 = 5/4 e−1/4 − 9/2 e−1/4 27 −1/4 2 3/2 e−1/4 = − + 5/4 e−1/4 e 4 We can now solve the problem: # # " " −1/4 −3/2 e 1 −9/2 e−1/4 p1 = (5.20) 2 p2 0 1 e−1/4 + 5/4 e−1/4 − 27 4 Thus: 27 −1/4 2 + 5/4 e−1/4 e 4 243 −1/4 3 45 −1/4 2 + − 3/2 e−1/4 e e p1 = −3/2 e−1/4 + 9/2 e−1/4 p2 = − 8 8 T If we compute ∇f 1, 21 p we see: 243 −1/4 3 45 −1/4 2 −1/4 −1/4 1 T + − 3/2 e e e + − (5.21) ∇f 1, 2 p = −3/2 e 8 8 27 −1/4 2 −1/4 −1/4 − 5/4 e ∼ 11.103 > 0 + 5/4 e e 4 Thus p is an ascent direction. p2 = − Remark 5.15. A reference implementation of the modified Cholesky correction to Newton’s Method is shown in Algorithm 12. Theorem 5.16. Let f : Rn → R be twice continuously differentiable. Suppose 0 < µ1 < µ2 . Let {xk } be the sequence of points generated by the modified Newton’s method. Then this sequence converges to a stationary point x∗ (i.e., ∇f (x∗ )). −1 Proof. By construction pk = B−1 k ∇f (xk ) at each stage, where Bk is positive definite and non-singular for each k. Thus, ∇f (xk )T pk > 0 just in case, ∇f (xk ) 6= 0. Therefore {pk } is gradient related and the necessary conditions of Theorem 4.12 are satisfied and {xk } converges to a stationary point. Theorem 5.17. Let f : Rn → R be twice continuously differentiable. Suppose that x∗ is the stationary point to which the sequence {xk } converges, when {xk } is the sequence of points generated by the modified Newton’s method. If H(x∗ ) is negative definite then there is a µ1 > 0 so that the algorithm converges superlinearly when using the Armijo rule with σ1 < 21 and t0 = 1 in the backtrace algorithm. Proof. Let 0 < µ1 be any value that is less than the minimal diagonal value Mii at Line 8 of Algorithm 11 when executed on H(x∗ ). By the continuity of f , there is some neighborhood S of x∗ so that the modified Cholesky decomposition of H(x) is identical to the Cholesky decomposition of H(x) for all x ∈ S. Then, for all xk in S, the modified Newton’s algorithm becomes Newton’s method and thus: ||pk + H−1 (x∗ )∇f (xk )|| =0 k→∞ ||∇f (xk )|| lim 64 1 2 3 4 5. NEWTON’S METHOD AND CORRECTIONS ModifiedNewtonsMethod := proc (F::operator, initVals::list, epsilon::numeric, maxIter::integer)::list; local vars, xnow, i, G, normG, OUT, passIn, phi, LS, vals, ttemp, H; vars := []; xnow := []; vals := initVals; 5 6 7 8 9 10 #The first few lines are housekeeping, so we can pass variables around. for i to nops(initVals) do vars := [op(vars), lhs(initVals[i])]; xnow := [op(xnow), rhs(initVals[i])]; end do; 11 12 13 14 15 #Compute the gradient and hessian using the VectorCalculus package. #Store the Gradient as a Vector (that’s not the default). G := Vector(Gradient(F(op(vars)), vars)); H := Hessian(F(op(vars)), vars); 16 17 18 #Compute the current gradient norm. normG := Norm(evalf(eval(G, initVals))); 19 20 21 22 #Output will be a path to the optimal solution. #Your output could just be the optimal solution. OUT := []; 23 24 25 #While we’re not at a stationary point... while evalb(epsilon < normG and count < maxIter) do 26 27 28 #Update the output. OUT := [op(OUT), xnow]; 29 30 31 #Update count count := count + 1; 32 33 34 #Compute the modified Cholesky decomposition for the Hessian. L := ModifiedCholeskyDecomp(evalf(eval(-H, vals)), nops(vals), .1, 1)[1]; 35 36 37 38 #Now solve for the ascent direction P. X := LinearSolve(L, evalf(eval(G, vals)),method=’subs’); P := LinearSolve(Transpose(L), X,method=’subs’) 39 40 41 42 #Do some housekeeping. This is x_next = x_now + s * P. #Essentially, we’re going to be solving for the distance to walk "s" below. passIn := convert(Vector(xnow)+s*P, list) 3. NEWTON METHOD CORRECTIONS 43 44 45 46 47 48 49 65 #Build a univariate function that we can maximize. This is a little weird. #We defined our list (above) in terms of s, and we’re going to evaluate #s at the value t we pass in. phi := proc (t) options operator, arrow; evalf(eval(F(op(passIn)), s = t)) end proc; 50 51 52 #Find the optimal step using a line search. ttemp := LineSearch(phi); 53 54 55 #Update xnow using ttemp. This is our next point. xnow := evalf(eval(passIn, [s = ttemp])); 56 57 58 59 60 61 #Store the information in xnow in the vals list. vals := []; for i to nops(vars) do vals := [op(vals), vars[i] = xnow[i]] end do; 62 63 64 65 66 #Compute the new norm of the gradient. Notice, we don’t have to recompute the #gradient, it’s symbolic. normG := Norm(evalf(eval(G, vals))) end do; 67 68 69 #Pass the last point in (the while loop doesn’t get executed the last time around. OUT := [op(OUT), xnow]; 70 71 72 73 #Return the optimal solution. OUT: end proc: Algorithm 12. Variable Step Newton’s Method using the Modified Cholesky Decomposition In this case, the necessary conditions of Theorem 4.25 are satisfied and we see that: ||xk+1 − x∗ || =0 k→∞ ||xk − x∗ || lim This completes the proof. Remark 5.18. The previous theorem can also be modified so that µ1 = c||∇f (xk )||, where is a fixed constant with c > 0. Exercise 40. Prove the superlinear convergence of the modified Newton’s method when µ1 = c||∇f (xk )|| and c > 0. 66 5. NEWTON’S METHOD AND CORRECTIONS Remark 5.19. There is no scientific method for choosing µ1 and µ2 in the modified Newton’s method. Various authors suggest different approaches. Most recommend beginning with a very small µ1 . If µ2 is too small, the matrix LLT may be near singular. On the other hand, if µ2 is too large, then convergence will be slow, since the matrix will be diagonally dominated. Generally, it is up to the user to find adequate values of µ1 and µ2 , though [Ber99] suggests varying the size of µ1 and µ2 based on the largest value of the diagonal of the computed Hessian matrix. Example 5.20. We illustrate the convergence of the modified Newton’s method for the function: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y starting at position x = 1 and y = 1/2. We set µ1 = 0.1 and µ2 = 1. Algorithmic steps are shown in Figure 5.4. Notice in this case, we obtain convergence to a global maximum in 5 iterations of the algorithm. Figure 5.4. Modified Newton’s method uses the modified Cholesky decomposition and efficient linear solution methods to find an ascent direction in the case when the Hessian matrix is not negative definite. This algorithm converges superlinearly, as illustrated in this case. CHAPTER 6 Conjugate Direction Methods Remark 6.1. We begin this chapter with the study of conjugate gradient methods, which have the property that they converge to an optimal solution for a concave quadratic objective function f : Rn → R in n iterations, as we will see. These methods can be applied to non-quadratic functions (with some efficiency loss), but they are remarkably good for problems of very large dimensionality. 1. Conjugate Directions Definition 6.2 (Conjugate Directions). Let Q ∈ Rn×n be a negative (or positive) definite matrix. Directions p0 , . . . pn−1 ∈ Rn are Q conjugate if for all i 6= j we have: (6.1) pTi Qpj = 0 Remark 6.3. For the remainder of this section when we say Q is definite we mean it is either positive or negative definite and our results will apply in each case. Lemma 6.4. If Q ∈ Rn×n is definite and p0 , . . . pn−1 ∈ Rn are Q conjugate, then p0 , . . . pn−1 are linearly independent. Proof. By way of contradiction, suppose that: (6.2) pn−1 = α0 p0 + · · · + αn−2 pn−2 Multiplying on the left by pTn−1 Q yields: (6.3) pTn−1 Qpn−1 = α0 pTn−1 Qp0 + · · · + αn−2 pTn−1 Qpn−2 = 0 by the definition of Q conjugacy. However, since Q is definite, we know that pTn−1 Qpn−1 6= 0. Definition 6.5 (Conjugate Direction Method). Suppose Q ∈ Rn×n is a symmetric negative definite matrix with conjugate directions p0 , . . . pn−1 ∈ Rn and that f : Rn → R is give by: 1 (6.4) f (x) = xT Qx + bT x 2 with b ∈ Rn . If x0 ∈ Rn , then the sequence generated by the conjugate direction method, xk (k = 0, . . . , n − 1), is given by: (6.5) xk+1 = xk + δk pk where δk solves the problem: (6.6) max f (xk + δk pk ) δk That is, δk = arg maxδk f (xk + δk pk ) 67 68 6. CONJUGATE DIRECTION METHODS Lemma 6.6. For any k, in the conjugate direction method, δk is given by: (6.7) pTk ∇f (xk ) pTk (Qxk + b) =− T δk = − pTk Qpk pk Qpk Exercise 41. Prove Lemma 6.6. Argue that when pk is an ascent direction, δk > 0. Theorem 6.7. Let Mk ⊆ Rn be the subspace spanned by p0 , . . . , pk−1 . Then f (x) is maximized on Mk by the conjugate direction method after k iterations and therefore after n iterations, the conjugate direction method converges to xn = x∗ , the global maximum of f (x). Proof. Since p0 , . . . , pk−1 are linearly independent and: (6.8) xk = x0 + δ0 p0 + · · · + δk−1 pk−1 it suffices to show that1: ∂f (xk ) = 0 ∀i = 0, . . . , k − 1 (6.9) ∂δi Observe first that: ∂f (xk ) (6.10) =0 ∂δk−1 as a necessary condition for the conjugate direction step. That is, since δk−1 = arg maxδk−1 f (xk−1 + δk−1 pk−1 ), Equation 6.10 follows at once. We also note for i = 0, . . . , k − 1: ∂f (xk ) = ∇f (xk )T pi ∂δi since ∂f (xk )/∂δi is a directional derivative. Thus we already know that: (6.11) (6.12) ∇f (xk )T pk−1 = 0 Now, observe that: (6.13) xk = xi + δi pi + δi+1 pi+1 + · · · + δk−1 pk−1 = xi + For i = 0, . . . , k − 2 (6.14) k−1 X δj pj j=i ∂f (xk ) = ∇f (xk )T pi = (Qxk + b)T pi = ∂δi (Qxk )T pi + bT pi = xTk Qpi + bT pi = xi + k−1 X j=i Since (6.15) pTj Qpi = 0 if i 6= j, the previous equation becomes: ∂f (xk ) = (xi + δi pi ) Q + bT pi = xTi+1 Q + bT pi ∂δi 1There δj pj ! Qpi + bT pi is something a little subtle going on here. We are actually arguing that a strictly concave function achieves a unique global maximum on a constraining set Mk . We have not shown that the sufficient condition given below is a sufficient condition outside of Rn . This can be corrected, but for now, we assume it is clear to the reader. 2. GENERATING CONJUGATE DIRECTIONS 69 since xi+1 = xi + δi pi . But: (6.16) ∇f (xi+1 )T = xTi+1 Q + bT Thus, we have proved that: (6.17) ∂f (xk ) = ∇f (xi+1 )T pi = 0 ∂δi when we apply Equation 6.12 to i (instead of k). Thus we have shown Equation 6.9 holds for all k and it follows that f (x) is maximized on Mk by the conjugate direction method after k iterations and therefore after n iterations, the conjugate direction method converges to xn = x∗ , the global maximum of f (x) since p0 , . . . , pn−1 must be a basis for Rn . Example 6.8. Consider the case when: −1 0 Q= 0 −1 and suppose we are given the Q conjugate directions: 0 1 p1 = p0 = 1 0 and begin at x1 = x2 = 1. To compute δ0 , we use Equation 6.7: −1 0 1 1 0 0 −1 1 −1 = − = −1 (6.18) δ0 = − −1 0 −1 1 1 0 0 −1 0 Notice that δ0 < 0 because p0 is not an ascent direction. Thus: 0 1 1 = + (−1) (6.19) x1 = 1 0 1 Repeating this logic with p1 , we see that δ1 = −1 as well and: 0 0 0 = + (−1) (6.20) x2 = 0 1 1 which is clearly the global maximum of the function. Remark 6.9. All conjugate direction methods can be thought of as scaled versions of the previous example. Essentially, when transformed into an appropriate basis, we are simply minimizing along the various basis elements. The challenge is to determine a basis that is Q conjugate. 2. Generating Conjugate Directions Remark 6.10. The principle problem we have yet to overcome is the generation of Q conjugate directions. We overcome this problem by applying the Graham-Schmidt procedure, which we define in the following theorem. 70 6. CONJUGATE DIRECTION METHODS Theorem 6.11. Suppose that d0 , . . . , dn−1 ∈ Rn are arbitrarily chosen linearly independent directions (that span Rn ). Then the directions: (6.21) p0 = d0 (6.22) pi+1 = di+1 + i X (i+1) cj pj j=0 where: (6.23) c(i+1) = m −dTi+1 Qpm pTm Qpm m = 0, . . . , i, ∀i = 0, . . . , n − 2 are Q conjugate. Proof. Assume that Equations 6.21 and 6.22 hold. At construction stage i + 1, to ensure Q conjugacy we require: pTi+1 Qpm = 0 for m = 0, . . . , i. Using Equation 6.22 yields: (6.24) pTi+1 Qpm = di+1 + i X j=0 (i+1) cj pj !T Qpm = di+1 Qpm + i X j=0 (i+1) If we apply induction, we know that only when j = m is cj pTi+1 Qpm = 0, then: (6.25) (i+1) cj pj !T Qpm pTj Qpm 6= 0. Thus, if di+1 Qpm + c(i+1) pTm Qpm = 0 m = 0, . . . , i m That is, we have i + 1 equations with i + 1 unknowns and one unknown in each equation. Solving Equation 6.25 yields: (6.26) c(i+1) = m −dTi+1 Qpm pTm Qpm m = 0, . . . , i, ∀i = 0, . . . , n − 2 This completes the proof. Corollary 6.12. The space spanned by d0 , . . . , dk is identical to the space spanned by p0 , . . . , pk . Exercise 42. Prove Corollary 6.12. [Hint: Note that Equations 6.21 and 6.22 show that di+1 is a linear combination of pi+1 , . . . , p0 .] Remark 6.13. The approach described in the preceding theorem is known as the GrahamSchmidt method. Exercise 43. In Example 6.8, show that the directions generated by the GrahamSchmidt method are identical to the provided directions. 3. The Conjugate Gradient Method Remark 6.14. One major problem to overcome with a conjugate direction method is obtaining a set of Q conjugate directions. This can be accomplished by using the gradients of f (x) at xk . Naturally, we must show that these vectors are linearly independent (and thus will form a basis of Rn after n iterations). 3. THE CONJUGATE GRADIENT METHOD 71 Lemma 6.15. Let f : Rn → R be defined as in Definition 6.5 and let xk be the sequence generated by the conjugate direction method using the Graham-Schmidt generated directions with dk = gk = ∇f (xk ). Then gk is orthogonal to p0 , . . . , pk−1 (and by extension g0 , . . . , gk−1 ). Proof. We proceed by induction on k. When k = 0, the set of vectors is just g0 and this is clearly a linearly independent set and the space spanned by g0 is identical to the space spanned by d0 from Corollary 6.12. Assume the statement is true up to stage k − 1. To see it is true for stage k, consider the following cases: (1) Suppose gk = 0. Then xk = xk−1 and we must be at a global maximum, and the theorem is true by assumption. (2) Suppose gk 6= 0. Then from Equations 6.10 and 6.11 and , we know that: ∇f (xk )T pi = 0 for i = 0, . . . , k − 1. Thus, gk is orthogonal to p0 , . . . , pk−1 and since (by Corollary 6.12) the space spanned by p0 , . . . , pk−1 is identical to the space spanned by g0 , . . . , gk−1 , we see that gk must be orthogonal to g0 , . . . , gk−1 . Theorem 6.16. Let f : Rn → R be as in Definition 6.5 and let xk be the sequence generated by the conjugate direction method using the Graham-Schmidt generated directions with dk = gk = ∇f (xk ). If x∗ is the global maximum for f (x), then the sequence converges to x∗ after at most n steps and furthermore, we can write: (6.27) pk = gk + βk pk−1 where: (6.28) βk = gkT gk T gk−1 gk−1 Proof. From Equations 6.21 and 6.22, we know that: (6.29) (6.30) p0 = g0 pk+1 = gk+1 + k X (k+1) cj pj j=0 where: (6.31) c(k+1) = m T −gk+1 Qpm T pm Qpm m = 0, . . . , k, ∀k = 0, . . . , n − 2 Note that: (6.32) gm+1 − gm = Q(xm+1 − xm ) = αm Qpm and αm 6= 0 (for otherwise xm+1 = xm and the algorithm will have converged). This is because xm+1 = xm + αm pm , where αm is the optimal δm found by line search. Left T multiplying by −gk+1 yields: ( T −gk+1 gk+1 m = k T T (6.33) −αk gk+1 Qpm = −gk+1 (gm+1 − gm ) = 0 otherwise 72 6. CONJUGATE DIRECTION METHODS by Lemma 6.15. Thus we have: (6.34) c(k+1) m T −gk+1 Qpm = = T pm Qpm ( − α1k T gk+1 gk+1 pT Qp k k 0 m=k otherwise By a similar argument, we note that: 1 T (6.35) pTk Qpk = p (gk+1 − gk ) αk k Then substituting this into the previous Equation 6.35 into Equation 6.34 yields: (6.36) (k+1) βk+1 = ck = −gkT gk pTk (gk+1 − gk ) Substituting this into the Graham-Schmidt procedure yields: (6.37) pk+1 = gk+1 − T gk+1 gk+1 pk T pk (gk+1 − gk ) Written for k, rather than k + 1, we have: (6.38) pk = gk − gkT gk pk−1 pTk−1 (gk − gk−1 ) Finally, observe that pTk−1 gk = 0 and: (6.39) pk−1 = gk−1 + βk−1 pk−2 Thus: (6.40) T T −pTk−1 gk−1 = −gk−1 gk−1 − βk−1 pTk−2 gk−1 = −gk−1 gk−1 because pTk−2 gk−1 = 0. Finally we see that: (6.41) pk = gk + gkT gk pk−1 T gk−1 gk−1 Thus: (6.42) βk = gkT gk T gk−1 gk−1 The fact that this algorithm converges to x∗ in at most n iterations is a result of Theorem 6.7. This completes the proof. Exercise 44. Decide whether the following is true and if so, prove it; if not, provide a counter-example: In the conjugate gradient method, pk is always an ascent direction and therefore βk > 0. Definition 6.17. Let f : Rn → R be as in Definition 6.5 and suppose that x = My where M ∈ Rn×n is symmetric and invertible. Then when the conjugate gradient method is applied to: 1 (6.43) h(y) = f (My) = yT MQMy − bT My 2 So that: yk+1 = yk + δk dk 4. APPLICATION TO NON-QUADRATIC PROBLEMS 73 where: d0 = ∇h(y0 ) and dk = ∇h(yk ) + βdk−1 ∇h(yk )T ∇h(yk ) ∇h(yk−1 )T ∇h(yk−1 ) When δk is obtained by line maximization, then the method is called the preconditioned conjugate gradient method. βk = Theorem 6.18. Consider the preconditioned conjugate gradient method with matrix M. Then when xk = Myk , then the preconditioned conjugate gradient method is equivalent to the conjugate gradient method where: (6.44) xk+1 = xk + δk pk (6.45) p0 = Mg0 (6.46) pk = Mgk + βpk−1 , (6.47) βk = gkT M2 gk T gk−1 M2 gk−1 k = 1, . . . , n − 1 and δk is obtained by line maximization. Furthermore, pk are Q conjugate. Exercise 45. Prove Theorem 6.18. 4. Application to Non-Quadratic Problems Remark 6.19. The conjugate gradient method can be applied to non-quadratic functions, usually in an attempt to exploit locally quadratic behavior. Assuming f : Rn → R is a differentiable function, the update rule is given by: ∇f (xk )T (∇f (xk ) − ∇f (xk−1 )) ∇f (xk−1 )T ∇f (xk−1 ) Alternatively we can also use: (6.48) βk = ∇f (xk )T ∇f (xk ) ∇f (xk−1 )T ∇f (xk−1 ) where the first formulation is called the Polak-Ribiéra formulation and is derived from Equation 6.33, while the second is the Fletcher-Reeves formulation and adapted from the straightforward transformation of the conjugate gradient method for quadratic functions. We note that the Polak-Ribiéra formulation is generally preferred. (6.49) βk = Exercise 46. Prove that the Polak-Ribiéra formulation and the Fletcher-Reeves formulation for βk are identical in the case when f (x) is a strictly concave quadratic function. Remark 6.20. Sample code for the general conjugate gradient method is shown in Algorithm 13. Remark 6.21. Notice that we have a while-loop encasing the steps of the conjugate gradient method. There are three ways to handle the general conjugate gradient method: 74 1 2 3 4 5 6. CONJUGATE DIRECTION METHODS ConjugateGradient := proc (F::operator, initVals::list, epsilon::numeric, maxIter::integer)::list; uses VectorCalculus,LinearAlgebra,Optimization: local vars, xnow, i, j, G, normG, OUT, passIn, phi, vals, ttemp, count, X, p, gnext, gnow, beta, pnext; 6 7 8 9 10 11 12 13 #The first few lines are house keeping. vars := []; xnow := []; vals := initVals; for i to nops(initVals) do vars := [op(vars), lhs(initVals[i])]; xnow := [op(xnow), rhs(initVals[i])]: end do; 14 15 16 17 #Compute the gradient and its norm, use the VectorCalculus package. G := Gradient(F(op(vars)), vars)); normG := Norm(evalf(eval(G, initVals))); 18 19 20 21 #Output will be the path we take to the "optimal" solution. #Your output could just be the optimal solution. OUT := []; 22 23 24 #An iteration counter. count := 0; 25 26 27 28 #While we’re not at a stationary point: while evalb(epsilon < normG and count < maxIter) do count := count+1; 29 30 31 #Compute the initial direction. This is p[0]. p := evalf(eval(G, vals)); 32 33 34 35 36 #Here’s the actual conjugate gradient search. for j to nops(initVals) do #Append the current position. OUT := [op(OUT), xnow]; 37 38 39 #Compute the gradient (wherever we are). This is g[k-1]. gnow := evalf(eval(G, vals)); 40 41 42 43 #Compute the next x-position, this will be x[k] #Do housekeeping first. passIn := convert(Vector(xnow)+s*p, list): 4. APPLICATION TO NON-QUADRATIC PROBLEMS 44 45 46 47 #Define the line function f(x+delta*p) phi := proc (t) options operator, arrow; evalf(eval(F(op(passIn)), s = t)) end proc; 48 49 50 #Run line search to find delta ttemp := LineSearch(phi); 51 52 53 #Compute the new position x + delta * p (This is x[k].) xnow := evalf(eval(passIn, [s = ttemp])); 54 55 56 57 58 59 #Do some house keeping. vals := []; for i to nops(vars) do vals := [op(vals), vars[i] = xnow[i]] end do; 60 61 62 #Compute the next gradient (this is g[k]). gnext := evalf(eval(G, vals)); 63 64 65 #Compute beta[k] beta := (Transpose(gnext).(gnext - gnow))/(Transpose(gnow).gnow); 66 67 68 69 #Compute p[k]. pnext := gnext + beta * p: p:=pnext: 70 71 72 end do; 73 74 75 #Append any remaining x positions. OUT := [op(OUT), xnow]; 76 77 78 #Compute the new gradient norm. normG := Norm(evalf(eval(G, vals))); 79 80 81 82 83 84 85 #Print out the gradient norm. print(sprintf("normg=%f", normG)) end do; #Return the output. OUT: end proc: Algorithm 13. Conjugate Gradient Method for non-quadratic functions with a simple loop. 75 76 6. CONJUGATE DIRECTION METHODS (1) The conjugate gradient sub-steps can be executed n times (when maximizing a function f : Rn → R). (2) The conjugate gradient sub-steps can be executed k < n times. (3) The conjugate gradient method can be executed n times or as long as: (6.50) ||∇f (xk )T ∇f (xk−1 )|| ≤ δ||∇f (xk−1 )||2 for δ ∈ (0, 1). Otherwise, restart with a gradient step. The previous equation acts as a test for conjugacy. F The idea behind these conjugate gradient methods is to execute a gradient ascent step every “few” iterations to ensure global convergence to a stationary point, while simultaneously trying to speed up convergence by using conjugate gradient steps in between. Example 6.22. Consider the execution of the conjugate gradient method to the function: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y We obtain convergence in four iterations (8 steps of the conjugate gradient method) when we choose ǫ = 0.001 and begin at x = 1, y = 0.1. This is illustrated in Figure 6.1. Figure 6.1. The steps of the conjugate gradient algorithm applied to F (x, y). We can compare this behavior with the case when: F (x, y) = −2x2 − 10y 4 and we start from x = 15, y = 5. This is illustrated in Figure 6.2. In this example, the conjugate gradient method converges in two iterations (four steps), with much less zig-zagging than the gradient descent method or even Newton’s method. Notice also how the steps are highly normal to each other as we expect from Lemma 6.15. Example 6.23. Consider the function: (6.51) F (x, y) = −2x2 − 10y 2 The conjugate gradient method should converge in two steps for this function, but in practice (depending on the value of ǫ) it may not because of floating point error. In fact, if we run the conjugate gradient algorithm on this function, starting from x = 15, y = 5, we obtain 4. APPLICATION TO NON-QUADRATIC PROBLEMS 77 Figure 6.2. In this example, the conjugate gradient method also converges in four total steps, with much less zig-zagging than the gradient descent method or even Newton’s method. convergence after a surprising 4 steps when ǫ = 0.001 because of floating point error. This will not present itself if we use exact arithmetic. We can solve this problem in floating point by using the preconditioned conjugate gradient method. First, let us write: " # −4 0 1 x x y (6.52) F (x, y) = y 2 0 −20 Consider the matrix: # " 1/2 0 √ (6.53) M = 0 1/10 5 If we let: " # 1/2 0 x z √ =M= (6.54) y w 0 1/10 5 Then: (6.55) because: (6.56) 1 z w h(z, w) = 2 MT QM = " −1 0 (6.57) Q= −4 0 0 −20 −1 0 # −1 where: " " # 0 # z w −1 0 78 6. CONJUGATE DIRECTION METHODS If we execute the preconditioned conjugate gradient algorithmon h(z, w) starting from the √ position z = 30 and w = 10 5, where is: #−1 " 1/2 0 30 15 √ = √ 5 10 5 0 1/10 5 we obtain convergence to z ∗ = w∗ = 0 in two steps (as expected) and further we can see at once that x∗ = y ∗ = 0. Thus, M has stretched (and it could have twisted) our problem into the problem from Example 6.8. Exercise 47. Implement the conjugate gradient method. Try is on F (x, y) = −2x2 − 10y 2 . Remark 6.24. We state but do not prove one final theorem regarding the conjugate gradient method, which helps explain why the conjugate gradient method might only be executed k < n times before a restart. The proof is available in [Ber99]. Theorem 6.25. Assume Q has n − k eigenvalues on the interval [a, b] with b < 0 and the remaining eigenvalues are less than a. Then for every x0 , the vector xk+1 produced after k + 1 steps of the conjugate gradient method satisfies: 2 b−a (6.58) f (xk+1 ) ≥ f (x0 ) b+a Remark 6.26. A similar condition as in Theorem 6.25 can be proved for pre-conditioned conjugate gradient method with appropriate transformations to the matrix. Thus, Theorem 6.25 helps explain conditions in which we may only be interested in executing the conjugate gradient loop k < n times. CHAPTER 7 Quasi-Newton Methods Remark 7.1. We noted in Chapter 5 that Newton’s method only works in a region around local maxima (or minima) and otherwise, the direction −H(xk )−1 ∇f (xk ) may not be an ascent direction. This is solved in modified Newton’s method approaches by adjusting the hessian matrix to ensure negative (or positive definiteness). Quasi-Newton methods, by contrast construct positive definite matrices Bk and use the update: (7.1) (7.2) rk = Bk ∇f (xk ) xk+1 = xk + δk rk These methods have the property that Bk approaches −H(xk )−1 as xk approaches a stationary point x∗ . This property can be showed exactly for quadratic functions. Exercise 48. Show that the Conjugate Gradient method with the Fletcher-Reeves rule can be thought of as like a quasi-Newton method when: (7.3) Bk = I n + (xk − xk−1 )∇f (xk )T ∇f (xk−1 )T ∇f (xk−1 ) while the Conjugate Gradient method with the Polak-Ribiéra rule can be thought of as like a quasi-Newton method when: (7.4) Bk = I n + (xk − xk−1 )(∇f (xk ) − ∇f (xk−1 ))T ∇f (xk−1 )T ∇f (xk−1 ) You do not have to prove Bk converges to H−1 (x∗ ). 1. Davidon-Fletcher-Powell (DFP) Quasi-Newton Method Definition 7.2 (Davidon-Fletcher-Powell Approximation). Suppose that f : Rn → R is a continuously differentiable function. Let: (7.5) (7.6) pk = xk+1 − xk = δk rk qk = ∇f (xk ) − ∇f (xk+1 ) Then the Davidon-Fletcher-Powell (DFP) inverse-Hessian approximation is: (7.7) Bk+1 = Bk + Bk qk qTk Bk pk pTk − pTk qk qTk Bk qk where B0 ∈ Rn×n is a symmetric positive definite matrix (usually In ). Remark 7.3. This formulation is useful for maximization problems. The DFP minimization formulation set qk = ∇f (xk+1 ) − ∇f (xk ) and rk = −Bk ∇f (xk ). 79 80 7. QUASI-NEWTON METHODS Lemma 7.4. Suppose that f : Rn → R, x0 ∈ Rn and B0 ∈ Rn×n is a symmetric and positive definite matrix and B1 , . . . , Bn are generated using Equation 7.7 and x1 , . . . , xn are generated by: (7.8) (7.9) rk = Bk ∇f (xk ) xk+1 = xk + δk rk with δk = arg max f (xk +δk rk ). If ∇f (xk ) 6= 0 for k = 0, . . . , n then B1 , . . . Bn are symmetric and positive definite. Thus, r0 , . . . , rn−1 are ascent directions. Proof. We proceed by induction. Clear B0 is symmetric and positive definite and thus p0 is an ascent direction since ∇f (x0 )T p0 = ∇f (x0 )T B0 ∇f (x0 ) > 0. Symmetry is ensured by the nature of DFP formula. Assume that statement is true for all k < n. We show the statement is true for Bk+1 . Consider yT Bk+1 y for any y ∈ Rn . We have: (7.10) yT Bk+1 y = yT Bk y + yT pk pTk y yT Bk qk qTk Bk y − pTk qk qTk Bk qk Note yT pk = pTk y and yT Bk qk = qTk Bk y because Bk is symmetric by the induction hypothesis. Furthermore, Bk has a Cholesky decomposition so that Bk = Lk LTk for some lower triangular matrix Lk . Let: (7.11) aT = y T Lk (7.12) bT = qTk Lk Then we can write: (7.13) T 2 T T y p a b b a k yT Bk+1 y = aT a − = + bT b pTk qk 2 aT a bT b − aT b bT a y T pk + bT b pTk qk Recall the Schwartz Inequality (Lemma 1.13): (7.14) Thus, (aT b)(bT a) = (aT b)2 ≤ (aT a) · (bT b) aT a bT b − aT b bT a >0 (7.15) bT b since bT b ≥ 0. As in the proof of Theorem 6.7, we know that: ∂f (xk+1 ) = ∇f (xk+1 )T rk = ∇f (xk+1 )T pk = 0 ∂δk because δk = arg max f (xk + δk rk ). This means that: (7.16) (7.17) Thus: (7.18) pTk qk = pTk (∇f (xk ) − ∇f (xk+1 )) = pTk ∇f (xk ) = ∇f (xk )T BTk ∇f (xk ) > 0 2 yT pk >0 pTk qk 1. DAVIDON-FLETCHER-POWELL (DFP) QUASI-NEWTON METHOD 81 and yT Bk+1 y > 0. Thus, Bk+1 is positive definite and since rk+1 = Bk+1 ∇f (xk+1 ) is an ascent direction. Exercise 49. Prove: Suppose f : Rn → R, x0 ∈ Rn and B0 ∈ Rn×n is a symmetric and positive definite matrix and B1 , . . . , Bn are generated using the DFP formula for minimization and x1 , . . . , xn are generated by: (7.19) (7.20) rk = −Bk ∇f (xk ) xk+1 = xk + δk rk with δk = arg min f (xk +δk rk ). If ∇f (xk ) 6= 0 for k = 0, . . . , n then B1 , . . . Bn are symmetric and positive definite. Thus, r0 , . . . , rn−1 are descent directions. Theorem 7.5. Suppose that f : Rn → R with: (7.21) 1 f (x) = xT Qx + bT x 2 where b ∈ Rn and Q ∈ Rn×n is symmetric and negative definite. Suppose that B1 , . . . , Bn−1 are generated using the DFP formulation (Equation 7.7) with x0 ∈ Rn and B0 ∈ Rn×n , symmetric and positive definite. If x1 , . . . , xn and r0 , . . . , rn−1 are generated by: (7.22) (7.23) rk = Bk ∇f (xk ) xk+1 = xk + δk rk with δk = arg max f (xk + δk rk ) and (7.24) (7.25) pk = xk+1 − xk = δk rk qk = ∇f (xk ) − ∇f (xk+1 ) then: (1) (2) (3) (4) p0 , . . . , pn−1 are Q conjugate and Bk+1 Qpj = −pj for j = 0, . . . , k Bn = −Q−1 and xn = x∗ is the global maximum of f (x). Proof. It suffices to prove (1) and (2) above. To see this note that if (1) holds, then the DFP quasi-Newton method is a conjugate direction method and therefore after n iterations, it must converge to the maximum x∗ of f (x). Furthermore, since p0 , . . . , pn−1 must be linearly independent by Lemma 6.4, they are a basis for Rn . Therefore, for each standard basis vector ei (i = 1, . . . , n) we see that there are α0 , . . . , αn−1 so that: ei = α0 p0 + · · · + αn−1 pn−1 and thus: Bk+1 Qei = −ei . The previous result holds for i = 1, . . . , n and therefore Bk+1 Q = −In , which means that Bn = −Q−1 . Recall that pk = xk+1 − xk and therefore: (7.26) Qpk = Q(xk+1 − xk ) = ∇f (xk+1 ) − ∇f (xk ) = −qk 82 7. QUASI-NEWTON METHODS Multiplying by Bk+1 we have: (7.27) Bk+1 Qpk = −Bk+1 qk = Bk qk qTk Bk pk pTk qk Bk qk qTk Bk qk pk pTk − q = −B q − + = − Bk + T k k k pk qk qTk Bk qk pTk qk qTk Bk qk − Bk qk − pk + Bk qk = −pk We now proceed by induction to show that Bk+1 Qpj = −pj for j = 0, . . . , k. We will also prove Q conjugacy. We have just proved the based case when k = 0 and therefore, we assume this is true for all iterations up to some k. Note that: (7.28) ∇f (xk+1 ) = ∇f (xi+1 )+ (∇f (xi+2 ) − ∇f (xi+1 )) + (∇f (xi+3 ) − ∇f (xi+2 )) + · · · + (∇f (xk+1 ) − ∇f (xk )) = ∇f (xi+1 )+Qpi+1 +Qpi+2 +· · ·+Qpk = ∇f (xi+1 )+Q (pi+1 + pi+2 + · · · + pk ) By the induction hypothesis, we know that pi is orthogonal to pj where j = i + 1 . . . k and we know a fortiori that pi is orthogonal to ∇f (xi+1 ) (see Equation 7.16, as in the proof of Theorem 6.7). We conclude that: pTi ∇f (xk+1 ) = 0 for 0 ≤ i ≤ k. Applying the induction hypothesis again, we know that: Bk+1 Qpj = −pj for j = 0, . . . , k and therefore: (7.29) ∇f (xk+1 )T Bk+1 Qpj = −∇f (xk+1 )T pj = 0 for j = 0, . . . , k. But: (7.30) δk+1 rTk+1 = δk+1 ∇f (xk+1 )T Bk+1 = pk+1 Therefore: (7.31) pTk+1 Qpj = δk+1 rTk+1 Qpj = δk+1 ∇f (xk+1 )T Bk+1 Qpj = 0 and thus, p0 , . . . , pk+1 are Q conjugate. We now need only prove Bk+2 Qpj = −pj for j = 0, . . . , k + 1. We first note by the induction hypothesis that: (7.32) qTk+1 Bk+1 Qpi = −qTk+1 pi = (Qpk+1 )T pi = pTk+1 Qpi = 0 because Q is symmetric and we have established that p0 , . . . , pk+1 are Q conjugate. Then we write: pk+1 pTk+1 Bk+1 qk+1 qTk+1 Bk+1 − qi = (7.33) Bk+2 qi = Bk+1 + T pk+1 qk+1 qTk+1 Bk+1 qk+1 Bk+1 qi + pk+1 pTk+1 qi Bk+1 qk+1 qTk+1 Bk+1 qi − pTk+1 qk+1 qTk+1 Bk+1 qk+1 Note that: Qpi = −qi and since pTk+1 Qpi = 0, we know −pTk+1 qi = −pk+1 Qpi = 0. Which implies pTk+1 qi = 0. Therefore, (7.34) pk+1 pTk+1 qi =0 pTk+1 qk+1 Finally since Bk+1 Qpi = −pi , we know: (7.35) qTk+1 Bk+1 qi = −qTk+1 Bk+1 Qpi = qTk+1 pi = −pTk+1 Qpi = 0 2. IMPLEMENTATION AND EXAMPLE OF DFP 83 Thus: (7.36) Bk+1 qk+1 qTk+1 Bk+1 qi =0 qTk+1 Bk+1 qk+1 Thus we conclude that: (7.37) Bk+2 qi = Bk+1 qi = −Bk+1 Qpi = pi and therefore: (7.38) Bk+2 Qpi = −pi So for i = 0, . . . , k we have Bk+2 Qpi = −pi . To prove that this is also true when i = k + 1 recall that Equation 7.27, handles this case. Thus we have shown that Bk+2 Qpi = −pi for i = 0, . . . , k + 1. This completes the proof. 2. Implementation and Example of DFP Example 7.6. Consider F (x, y) = −2x2 − 10y 2 . We know with exact arithmetic, the DFP method should converge in two steps (1 iteration). To see this, suppose B0 = I2 . If we start at x = 15 and y = 5. Then our first gradient and ascent direction are: −60 (7.39) r0 = g0 = −100 Our first line function, then is: (7.40) φ0 (t) = −2 (15 − 60 t)2 − 10 (5 − 100 t)2 Solving φ′ (t) = 0 for t yields: 17 (7.41) t∗ = 268 17 This leads to our next point x1 = 15 + 268 (−60) and y1 = 5 + −90 y1 = 67 . We compute the gradient at this new point to obtain: −3000 67 (7.42) g1 = 1800 67 This leads to the two expressions: −255 67 (7.43) p0 = −425 67 −1020 67 (7.44) q0 = −8500 67 If we compute B1 we obtain: # " 15345 170353 − 169912 169912 (7.45) B1 = 15345 10337 − 169912 169912 We now compute r1 : " # − 15000 317 (7.46) r1 = 1800 317 17 (−100) 268 or x1 = 750 67 and 84 7. QUASI-NEWTON METHODS From r1 , we can compute our new line search function: 2 2 750 15000 90 1800 (7.47) φ1 (t) = −2 − t − 10 − + t 67 317 67 317 We can find t∗ = (7.48) (7.49) 317 1340 as before and compute our new position: 750 −15000 317 x2 = =0 + 67 1340 317 −90 317 1800 y2 = + =0 67 1340 317 Thus showing that the DFP method converges in one iteration or two steps. Remark 7.7. An implementation of the DFP quasi-Newton method is shown in Algorithm 141. Example 7.8. We can apply the DFP method to: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y We obtain convergence in four iterations (8 steps of the conjugate gradient method) when we choose ǫ = 0.001 and begin at x = 1, y = 0.1. This is illustrated in Figure 7.1. Notice Figure 7.1. The steps of the DFP algorithm applied to F (x, y). that the steps are identical to those shown in the conjugate gradient method (see Figure 6.1). This is because the DFP method is a conjugate direction method. 1Thanks mistake! to Simon Miller who noticed that line 75 was missing in versions before 0.8.7. That’s a big 2. IMPLEMENTATION AND EXAMPLE OF DFP 1 2 DFPQuasiNewton := proc (F::operator, initVals::list, epsilon::numeric, maxIter::integer)::list; 3 4 5 local vars, xnow, i, j, G, normG, OUT, passIn, phi, vals, ttemp, count, X, p, gnext, gnow, beta, pnext, r, B, xnext, q; 6 7 8 9 10 11 12 13 14 #Do some house keeping. vars := []; xnow := []; vals := initVals; for i to nops(initVals) do vars := [op(vars), lhs(initVals[i])]; xnow := [op(xnow), rhs(initVals[i])] end do; 15 16 17 18 #Define the gradient vector and the current norm of the gradient. G := Vector(Gradient(F(op(vars)), vars)); normG := Norm(evalf(eval(G, initVals))); 19 20 21 22 23 #Output will be the path we take to an optimal point. We also define an #iteration counter. OUT := []; count := 0; 24 25 26 27 #While we’re not at a stationary point... while evalb(epsilon < normG and count < maxIter) do count := count + 1; 28 29 30 31 #Define B[0] and r[0], the first direction to walk. (It’s a gradient step.) B := IdentityMatrix(nops(vars), nops(vars)); r := B.evalf(eval(G, vals))); 32 33 34 35 #Now go into our conjugate direction method. for j to nops(initVals) do print(j); 36 37 38 #Append some output. OUT := [op(OUT), xnow]; 39 40 41 #Evaluate the gradient at this point. gnow := evalf(eval(G, vals)); 42 43 44 #Do some housekeeping and define the line function. passIn := convert(Vector(xnow) + s * r), list); 85 86 7. QUASI-NEWTON METHODS phi := proc (t) options operator, arrow; evalf(eval(F(op(passIn)), s = t)) end proc; 45 46 47 48 #Compute the optimal step length using parabolic bracketing and #Golden section search. ttemp := LineSearch(phi); 49 50 51 52 #Define the next x position and the p-vector in DFP. xnext := evalf(eval(passIn, [s = ttemp])); p := Vector(xnext - xnow); 53 54 55 56 #Do some housekeeping. xnow := xnext; vals := []; for i to nops(vars) do vals := [op(vals), vars[i] = xnow[i]] end do; 57 58 59 60 61 62 63 #Evaluate the gradient at the next position gnext := evalf(eval(G, vals)); 64 65 66 #Compute the q vector. q := Vector(gnow-gnext); 67 68 69 #Update the B matrix B[k] B := B + (p.Transpose(p))/(Transpose(p).q) (B.q.Transpose(q).B)/(Transpose(q).B.q); 70 71 72 73 74 75 76 #Compute the next direction. r := B.gnext end do; 77 78 79 80 81 82 #Compute the norm of the gradient normG := Norm(evalf(eval(G, vals))); end do; OUT end proc Algorithm 14. Davidon-Fletcher-Powell quasi-Newton method. 3. Derivation of the DFP Method Suppose we wish to derive the DFP update from first principles. Recall we have defined: (7.50) (7.51) pk = xk+1 − xk = δk rk qk = ∇f (xk ) − ∇f (xk+1 ) = −Qpk 3. DERIVATION OF THE DFP METHOD 87 Proceeding inductively, suppose that at Step k, we know that p0 , . . . , pk are all: (1) Q conjugate and (2) Bk+1 Qpj = −pj for j = 0, . . . , k The second statement asserts that p0 , . . . , pk are to be eigenvectors of Bk+1 Q with eigenvalue 1. If this holds all the way through to k = n − 1, then clearly p0 , . . . , pn−1 are eigenvalues of Bn Q, each with eigenvalue −1, yielding precisely the fact that Bn Q = −In . If we want to ensure that this pattern, continues to hold, suppose that we want to write: (7.52) Bk+1 = Bk + Ck where Ck is a correction matrix that will ensure that the nice properties above are also true at k + 1. That is, it ensures that: p0 , . . . , pk are eigenvectors of Bk+1 Q with eigenvalue −1. This means we require: (7.53) Bk+1 Qpj = −pj =⇒ Bk+1 qj = pj j = 0, . . . , k Combining this with Equation 7.52 yields: (7.54) (Bk + Ck )qj = Bk qj + Ck qj = pj j = 0, . . . , k But: (7.55) Thus: (7.56) Bk qj = −Bk Qk pj = pj j = 0, . . . , k − 1 Bk qj + Ck qj = pj =⇒ pj + Ck qj = pj =⇒ Ck qj = 0 j = 0, . . . , k − 1 When k = j, we require: (7.57) C k q k = pk − B k q k from Equation 7.54. We are now free to use our imagination about the structure of Ck . Suppose that Ck had the rank-one term: (7.58) T1 = pk pTk pTk qk Then we see that T1 qk = pk and we could satisfy part of our requirement from Equation 7.57. On the other hand, if Ck had a second rank-one term: (7.59) T2 = − Bk qk qTk Bk qTk Bk qk then T2 qk = −Bk qk . Combining these we see: (7.60) Ck = T 1 + T 2 = pk pTk Bk qk qTk Bk − pTk qk qTk Bk qk as expected. Now, note that for any pj (j 6= k) we see that: (7.61) Ck q j = Bk qk qTk Bk qj pk pTk Qpj Bk qk qTk Bk Qpj pk pTk qj − = − + pTk qk qTk Bk qk pTk pk qTk Bk qk We know that: pk pTk Qpj =0 (7.62) pTk pk 88 7. QUASI-NEWTON METHODS by Q conjugacy. While: (7.63) Bk qk qTk Bk Qpj Bk qk pTk Qpj = =0 qTk Bk qk qTk Bk qk again by conjugacy. Note in the previous equation, we transformed qTk to −pTk Q and Bk Qpj to −pj . Thus, Ck qj = 0 are required. Remark 7.9. Since we are adding two rank-one corrections together to obtain Ck , the matrix Ck is sometimes called a rank-two correction. 4. Broyden-Fletcher-Goldfarb-Shanno (BFGS) Quasi-Newton Method Remark 7.10. Recall from our construction of the DFP correction that we really only require two things of the correction matrix Ck : Ck qj = 0 j = 0, . . . , k − 1 and Equation 7.57: C k q k = pk − B k q k This we could add any extra term Tk we like as long as Tk qj = 0 for j = 0, . . . , k. Definition 7.11 (Broyden Family of Updates). Suppose that f : Rn → R is a continuously differentiable function. Let: (7.64) (7.65) pk = xk+1 − xk = δk rk qk = ∇f (xk ) − ∇f (xk+1 ) Then the Brodyen Family of Updates for the inverse-Hessian approximation is: Bk+1 Bk qk qTk Bk pk pTk − + φk τk vk vkT = Bk + T T pk q k qk Bk qk (7.67) vk = Bk qk pk − T τk pk qk (7.68) τk = qTk Bk qk (7.69) 0 ≤ φk ≤ 1 (7.66) where: and B0 ∈ Rn×n is a symmetric positive definite matrix (usually In ). Exercise 50. Verify that when: (7.70) Tk = φk τk vk vkT then Tk qj = 0 for j = 0, . . . , k. Definition 7.12 (BFGS Update). If φk = 1 for all k, then Equation 7.66 is called the Broyden-Fletcher-Goldfarb-Shanno (BFGS) update. Remark 7.13. We state, but do not prove the following proposition, which can be verified by (extensive) matrix arithmetic. 4. BROYDEN-FLETCHER-GOLDFARB-SHANNO (BFGS) QUASI-NEWTON METHOD 89 Proposition 7.14. The BFGS Update is given by: Bk qk pTk + pk qTk Bk qTk Bk qk pk pTk − (7.71) Bk+1 = Bk + 1 + pTk qk pTk qk pTk qk Thus if CB (φk ) is the Broyden family of updates correction matrix, CDFP is the DFP correction matrix and CBFGS is the BFGS correction matrix, then: (7.72) CB = φk CDFP + (1 − φk )CBFGS Exercise 51. Prove Proposition 7.14. Remark 7.15. The following lemma and theorem are proved in precisely the same way as the corresponding results for the DFP method, with appropriate allowances made for the extra term Tk = φk τk vk vkT . Lemma 7.16. Suppose that f : Rn → R, x0 ∈ Rn and B0 ∈ Rn×n is a symmetric and positive definite matrix and B1 , . . . , Bn are generated using Equation 7.66 and x1 , . . . , xn are generated by: (7.73) (7.74) rk = Bk ∇f (xk ) xk+1 = xk + δk rk with δk = arg max f (xk +δk rk ). If ∇f (xk ) 6= 0 for k = 0, . . . , n then B1 , . . . Bn are symmetric and positive definite. Thus, r0 , . . . , rn−1 are ascent directions. Theorem 7.17. Suppose that f : Rn → R with: (7.75) 1 f (x) = xT Qx + bT x 2 where b ∈ Rn and Q ∈ Rn×n is symmetric and negative definite. Suppose that B1 , . . . , Bn−1 are generated using the Broyden family of updates formulation (Equation 7.66) with x0 ∈ Rn and B0 ∈ Rn×n , symmetric and positive definite. If x1 , . . . , xn and r0 , . . . , rn−1 are generated by: (7.76) (7.77) rk = Bk ∇f (xk ) xk+1 = xk + δk rk with δk = arg max f (xk + δk rk ) and (7.78) (7.79) pk = xk+1 − xk = δk rk qk = ∇f (xk ) − ∇f (xk+1 ) then: (1) (2) (3) (4) p0 , . . . , pn−1 are Q conjugate and Bk+1 Qpj = −pj for j = 0, . . . , k Bn = −Q−1 and xn = x∗ is the global maximum of f (x). 90 7. QUASI-NEWTON METHODS Theorem 7.18. The sequence of iterates {xk } generated by any quasi-Newton algorithm in the Broyden family when applied to a maximization problem for a quadratic function: 1 f (x) = xT Qx + bT x 2 where Q is negative definite is identical to the sequence generated by the preconditioned conjugate gradient method with scaling matrix B0 . Sketch of Proof. It is sufficient to show that xk+1 maximizes f (x) for the subspace: (7.80) Mk = {x : x = x0 + α0 B0 ∇f (x0 ) + · · · + αk B0 ∇f (xk ), δ0 , . . . , α0 , . . . , αk ∈ R} This can be proved when B0 = In by applying induction to that that for all k there are scalars βijk such that: (7.81) Bk = I n + k k X X i=0 j=0 βijk ∇f (xi )∇f (xj )T Therefore we can write: (7.82) pk = Bk ∇f (xk ) = k X i=0 bki ∇f (xi ) for some scalars bki ∈ R. Thus for all i, xi+1 lies in Mi . By Theorem 7.17, we know that any quasi-Newton method in the Broyden family is a conjugate gradient method and therefore, xk+1 maximize f (x) over Mk and by the nature of f (x) this maximum must be unique. To see this for the case when B0 6= In , note that we can simply transform Q to a space in which it is −In as in the preconditioned conjugate gradient method. (See Example 6.23). Remark 7.19. This explains why Figure 7.1 is identical to Figure 6.1 and why we will 2 2 see the same figure again when we apply the BFGS method to F (x, y) = (x2 + 3 y 2 ) e1−x −y . 5. Implementation of the BFGS Method Remark 7.20. The only reason that users prefer the BFGS method to the DFP method is that the BFGS method has better convergence properties in practice than the DFP method, which tends to produce singular matrices in a wider range of optimization problems. The BFGS method seems to be more stable. Remark 7.21. We note also that the φk can be varied from iteration to iteration. This is not recommended (in fact, we usually either set it to be 0 or 1) since it can be shown that for each iteration there is a specific φk that will generate a singular matrix at iteration k + 1 and it is better to not accidentally cause this to occur. Remark 7.22. The BFGS method is implemented by replacing Line 70 - 71 of Algorithm 14 with the line: 1 2 3 4 #Update the B matrix B[k] B := B + (1 + (Transpose(q).B.q)/(Transpose(p).q)) * (p.Transpose(p))/(Transpose(p).q) (B.q.Transpose(p) + p.Transpose(q).B)/(Transpose(p).q) 5. IMPLEMENTATION OF THE BFGS METHOD 91 Exercise 52. Show that the steps taken in Example 7.6 are identical for the BFGS method. Example 7.23. We can apply the BFGS method to: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y We obtain convergence in four iterations when we choose ǫ = 0.001 and begin at x = 1, y = 0.1 (as with the DFP method). This is illustrated in Figure 7.2. Notice that the steps Figure 7.2. The steps of the DFP algorithm applied to F (x, y). are identical to those shown in the conjugate gradient method (see Figure 6.1) and the DFP method. Remark 7.24. There are several extensions of the BFGS method that are popular. The simplest is the limited memory BFGS (LBFGS) method [NL89], which does not require as much storage as the BFGS method since Bk is never stored explicitly, but approximated from the previous m values of pk and qk , where m < n. This makes it possible to apply the BFGS method to problems with many variables. The LBFGS method can also be extended for simple boundary constraints, in the LBGS with bounding (LBFGSB) method [ZBLN97]. The code for these algorithms is available freely and often incorporated into popular commercial software. For example, Matlab uses a variation on BFGS for its fmin function. BFGS is also implement in the GNU Scientific Library, which is available for free. Remark 7.25. It is worth noting that implementations of the DFP and BFGS methods may produce singular or even negative definite matrices as a result of round-off errors. A check can be placed in the code (using the Cholesky decomposition) to determine this. When this happens, one can re-initialize Bk to In (e.g.) and continue with execution. Exercise 53. Implement the BFGS algorithm with a check for singularity. Use it on Rosenbrock’s Function: (7.83) −(1 − x)2 + 100(y − x2 )2 start at point (0, 0). CHAPTER 8 Numerical Differentiation and Derivative Free Optimization The topics contained in this chapter are actually important and each one could easily deserve a chapter on its own. Unfortunately, in order to get on to constrained problems, we must cut somewhere and this is as reasonable a place to do it as any. If this were a real book, each of these sections would have its own chapter. Therefore, the concerned reader should see [NW06] for details. 1. Numerical Differentiation Remark 8.1. Occasionally, it is non-trivial to compute a derivative of a function. In such a case, a numerical derivative can be used to compute an approximation of the gradient of the function at that point. Definition 8.2 (Forward Difference Approximation). Let f : Rn → R. The forward difference approximation is given by the formula: (8.1) ∂f (x0 ) δf (x0 ) f (x0 + ǫei ) − f (x0 ) ≈ = ∂xi δxi ǫ here ǫ > 0 is a tiny value and ei is the ith standard basis vector for Rn . Remark 8.3. Note that by Taylor’s Theorem (in one dimension): (8.2) f (x0 + ǫei ) = f (x0 ) + ǫ ∂f (x0 ) + O(ǫ2 ) ∂xi Thus: ∂f (x0 ) δf (x0 ) = + O(ǫ) δxi ∂xi We know that δxi = ǫ. Assuming a floating point error of h when evaluating δf (x0 ) we see that: h (8.4) = O(ǫ) ǫ or h =√O(ǫ2 ). If the machine precision for a given system is hmin , then setting ǫ√any smaller than hmin could lead to numerical instability. Thus, a useful estimate is ǫ ≥ hmin . Note that, to a point, the smaller ǫ, the better the estimate of the derivative. (8.3) Definition 8.4 (Central Difference Approximation). Let f : Rn → R. The central difference approximation is given by the formula: (8.5) ∂f (x0 ) ∆f (x0 ) f (x0 + ǫei ) − f (x0 − ǫei ) ≈ = ∂xi ∆xi 2ǫ here ǫ > 0 is a tiny value and ei is the ith standard basis vector for Rn . 93 94 8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION Remark 8.5. Note that by Taylor’s Theorem (in one dimension): ∂f (x0 ) 1 2 ∂ 2 f (x0 ) + O(ǫ3 ) (8.6) f (x0 − ǫei ) = f (x0 ) − ǫ + ǫ 2 ∂xi 2 ∂xi A similar expression holds for f (x0 + ǫei ). Thus, we can see that: ∂f (x0 ) + O(ǫ3 ) (8.7) f (x0 + ǫei ) − f (x0 − ǫei ) = 2ǫ ∂xi because that quadratic terms (of identical sign) will cancel out. The result is: ∂f (x0 ) f (x0 + ǫei ) − f (x0 − ǫei ) = + O(ǫ2 ) (8.8) 2ǫ ∂xi Using a similar argument to the one above we have: h = O(ǫ3√ ). This yields a rule for choosing ǫ when using the central difference approximation: ǫ ≥ 3 hmin . We also note that the central difference formula is substantially more accurate (an order of magnitude in ǫ) than the forward difference formula, but requires two functional evaluations rather than one. Remark 8.6. From the previous observations, one might wonder when it is best to use the forward difference formula and when it is best to use the central difference formula. Bertsekas [Ber99] provides the practical advice that when the approximated derivative(s) are not very close to zero, then the forward difference is fine. Once the derivative begins to approach zero (and the error may swamp the computation), then a transition to the central difference is called for. Remark 8.7. Code for computing a numerical gradient is shown in Algorithm 17. It uses the Forward Differencing Method shown in Algorithm 15 and the Central Differencing Method shown in Algorithm 16. Exercise 54. Re-implement the BFGS algorithm using a finite difference approximation of the gradient. Compare your results to your original implementation on the function: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y Remark 8.8. In a similar fashion, we can compute the numerical hessian approximation. If H(x0 ) = ∇2 f (x0 ), then: ∇f (x0 + ǫej )i − ∇f (x0 )i (8.9) Hi,j = ǫ th Here: ∇f (x0 )i is the i component of the gradient vector. Exercise 55. Implement a numerical Hessian computation algorithm and illustrate its effect on Newton’s method. Example 8.9. Numerical gradients have little impact on convergence of quasi-Newton, conjugate gradient (or even modified Newton’s method) for well behaved functions. This is illustrated in Figure 8.1. In this figure, we compare a BFGS run using numerical gradients with a BFGS run using exact gradients. The dashed black line uses symbolic gradients, while the solid red line uses numerical gradients. As you can see there is very little difference detectable by inspection. You may face a time penalty for functions that are difficult to evaluate, (because you are introducing additional functional evaluations). This is a practical issue that you must deal with. 2. DERIVATIVE FREE METHODS: POWELL’S METHOD 1 2 95 NumericalGradientF := proc (f::operator, xnow::list, epsilon::numeric)::Vector; local G, fnow, i, x; 3 4 5 #Initialize a list to store output. G := []; 6 7 8 #Passed in variables are non-modifiable in Maple. x := xnow; 9 10 11 #We’ll use this over and over, don’t keep computing it. fnow := f(op(xnow)); 12 13 14 #For each index... for i to nops(xnow) do 15 16 17 #Perturb x[i]. x[i] := x[i]+epsilon; 18 19 20 #Store the forward difference in G. G := [op(G), (f(op(x))-fnow)/epsilon]; 21 22 23 24 #Unperturb x[i]. x[i] := x[i]-epsilon end do; 25 26 27 28 #Return a gradient vector. Vector(G) end proc: Algorithm 15. Forward Differencing Method for Numerical Gradients 2. Derivative Free Methods: Powell’s Method Remark 8.10. We will discuss two derivative free methods for finding optimal points of functions. The first, Powell’s method, is a conjugate direction method, while the second, the Hooke-Jeeves algorithm, is a type of optimization method called a pattern search. We will not prove convergence properties of either method, though as a conjugate direction method, Powell’s method will have all the properties one should expect when applying it to a concave quadratic function. Remark 8.11. Given a function f : Rn → R, Powell’s basic method works in the following way: (1) Choose a starting position x0 and n directions p0 , . . . , pn−1 . (2) For each k = 0, . . . , n − 1, let δk = arg max f (xk + δk pn ) and let xk+1 = xk + δk pk . (3) For each k = 0, . . . , n − 2, set pk = pk+1 . (4) Set pn−1 = xn − x0 . (5) If ||xn − x0 || < ǫ, stop. 96 1 2 8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION NumericalGradientF := proc (f::operator, xnow::list, epsilon::numeric)::Vector; local G, fnow, i, x; 3 4 5 #Initialize a list to store output. G := []; 6 7 8 #Passed in variables are non-modifiable in Maple. x := xnow; 9 10 11 #For each index... for i to nops(xnow) do 12 13 14 15 #Perturn x[i] forward and store the function value. x[i] := x[i]+epsilon; f1 := f(op(x)); 16 17 18 19 #Perturb x[i] backward and store the function value x[i] := x[i]-2*epsilon; f2 := f(op(x)); 20 21 22 #Store the central difference in G. G := [op(G), (f1-f2)/(2*epsilon)]; 23 24 25 26 #Unperturn x[i] x[i] := x[i]+epsilon end do; 27 28 29 30 #Return a gradient vector. Vector(G) end proc: Algorithm 16. Central Differencing Method for Numerical Gradients (6) Solve δn = arg max f (xn + δn pn ). Set x0 = xn + δn pn . Goto (1). Usually the directions p0 , . . . , pn−1 are initialized as the standard basis vectors. Remark 8.12. One problem with Powell’s method is that the directions may tend to become linearly dependent at the directions tend (more or less) toward a gradient, especially in valleys (like in Rosenbrock’s function). There are a few ways to solve this problem. (1) Reinitialization of the basic directions every n or n + 1 iterations of the basic algorithm, (once all the original directions are gone). (2) You can reset the directions to any set of orthogonal directions, at the expense of loosing information on the directions you’ve already constructed, or construct a special set of orthogonal directions (say following the Gram-Schmidt procedure), but that assumes that you have some special knowledge about the function (e.g., that it is quadratic, which obviates the need for derivative free methods). 2. DERIVATIVE FREE METHODS: POWELL’S METHOD 1 2 97 NumericalGradient := proc (f::operator, xnow::list, epsilon::numeric)::Vector; local G; 3 4 5 #Compute the forward difference approximation. G := NumericalGradientF(f, xnow, epsilon); 6 7 8 9 10 #Test to see how small the norm of G is. (You can supply your own rule.) if Norm(G) <= 2*epsilon then G := NumericalGradientC(f, xnow, epsilon) end if; 11 12 13 14 #Return the vector. G end proc: Algorithm 17. A simple routine to compute the numerical gradient using forward and central differencing. fi Figure 8.1. A comparison of the BFGS method using numerical gradients vs. exact gradients. (3) You can use a heuristic, provided by Powell. Remark 8.13 (Powell’s Heuristic). Let: (8.10) f0 = f (x0 ) (8.11) fn = f (xn ) (8.12) fE = f (2xn − x0 ) = f (xn + (xn − x0 )) Here fE is the value of f (x) when move xn as far as possible following direction xn − x0 . Finally, define ∆f to be the magnitude of the largest decrease only any direction encountered in Line 2. Powell then provides the following rules: 98 8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION (1) fE ≤ f0 , then keep the old set of directions for the next procedural iteration. That is, do not execute Steps 3,4, or 6, simply check for convergence and return to Step 1 with x0 = xn . (2) If 2(2fn +fE −f0 )((fn −f0 )−∆f )2 ≥ (fE −f0 )2 ∆f , then keep the old set of directions for the next procedural iteration. That is, do not execute Steps 3,4, or 6, simply check for convergence and return to Step 1 with x0 = xn . (3) If neither of these situations occur, replace the direction of greatest increase with xn − x0 . A reference implementation for this algorithm is shown in Algorithm 18. We remark the BasisVector function called within this algorithm is a custom function that returns the ith basis vector. It is not part of the Maple basline. Example 8.14. We illustrate two examples of Powell’s Method, the first on: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y and the second on a variation of Rosenbrock’s Function: 2 G(x, y) = −(1 − x)2 − 100 y − x2 In Figure 8.2, we draw the reader’s attention to the effect the valley has on the steps in Figure 8.2. Powell’s Direction Set Method applied to a bimodal function and a variation of Rosenbrock’s function. Notice the impact the valley has on the steps in Rosenbrock’s method. Powell’s method. While each step individually is an ascent, the zig-zagging pattern seems almost random. 3. Derivative Free Methods: Hooke-Jeeves Method Remark 8.15. The method of Hooke and Jeeves is a pattern search method, meaning it creates a search pattern and the replicates that pattern expanding or contracting it when possible. The original discrete form has no proof of convergence [HJ61]. Bazarra et al. present a version of Hooke and Jeeves algorithm using line search, which they claim has convergence properties [BSS06]. 3. DERIVATIVE FREE METHODS: HOOKE-JEEVES METHOD 1 2 3 4 PowellsMethod := proc (F::operator, initVals::list, epsilon::numeric, maxIter::integer)::list; local vars, xnow, i, P, L, vals, xdiff, r, OUT, passIn, phi, ttemp, xnext, p, count, Df, x0, Nr, f0, fN, fE; 5 6 7 8 9 10 11 12 vars := []; xnow := []; vals := initVals; for i to nops(initVals) do vars := [op(vars), lhs(initVals[i])]; xnow := [op(xnow), rhs(initVals[i])] end do; 13 14 15 16 17 18 xdiff := 1; count := 0; OUT := []; L := []; bestIndex = -1; 19 20 21 22 for i to nops(vals) do L := [op(L), BasisVector(i, nops(vals))] end do; 23 24 25 26 27 while epsilon < xdiff and count < maxIter do count := count + 1; Df := 0; x0 := xnow; 28 29 30 31 32 for i to nops(L) do OUT := [op(OUT), xnow]; p := L[i]; passIn := convert(Vector(xnow) + s * p), list); 33 34 35 36 phi := proc (t) options operator, arrow; evalf(VectorCalculus:-eval(F(op(passIn)), s = t)) end proc; 37 38 ttemp := LineSearch(phi); 39 40 41 if ttemp < epsilon then passIn := convert(Vector(xnow) - s * p), list); 42 43 44 45 46 47 phi := proc (t) options operator, arrow; evalf(VectorCalculus:-eval(F(op(passIn)), s = t)) end proc; ttemp := LineSearch(phi) end if; 99 100 47 48 49 8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION xnext := evalf(eval(passIn, [s = ttemp])); r := xnext - xnow; Nr := Norm(Vector(r)); 50 51 52 53 54 if Df < Nr then Df := Nr bestIndex = i: end if; 55 56 57 58 59 60 61 xnow := xnext end do; P := xnow - x0; f0 := evalf(F(op(x0))); fN := evalf(F(op(xnow))); fE := evalf(F(op(2 * xnow - x0))); 62 63 64 65 66 67 68 69 70 if not (evalb(fE <= f0) or evalb(evalf(2 * (2*fN+fE-f0)*((fN-f0)-Df)^2) >= evalf((fE-f0)^2*Df))) then L[bestIndex] := Vector(P): end if; xdiff := Norm(Vector(P)): end do; OUT end proc Algorithm 18. Powell’s derivative free method of optimization with heuristic. Remark 8.16. The method of Hooke and Jeeves attempts to find a movement pattern that increases the functional value and then repeats this pattern until there is evidence this is no longer working. Assume f : Rn → R. Like Powell’s method, we begin with a set of (orthogonal) directions p0 , . . . , pn−1 , and scalars ǫ > 0, ∆ ≥ ǫ and α ∈ [0, 1]. We summarize the method as: (1) Given a starting point x0 ∈ Rn , set y0 = x0 . Set j = k = 0. (2) For each j = 0, . . . , n − 1, (a) if f (yj + ∆pj ) > f (yj ), the trial is a success and yj+1 = yj + ∆pj . (b) Otherwise, the trial is a failure and if so, if f (yj − ∆pj ) > f (yj ) the new trial is a success and yj+1 = yj − ∆pj . (c) Otherwise yj+1 = yj . (3) If f (yn ) > f (xk ), then set xk+1 = yn , set y0 = xk+1 + α(xk+1 − xk ). Goto (2). (4) Otherwise: If ∆ < ǫ terminate. If ∆ ≥ ǫ, set ∆ = ∆/2. Let xk+1 = xk . Let y0 = xk . Goto (2). Remark 8.17. The Hooke-Jeeves pattern search essentially gropes around in each (cardinal) direction attempting to improve the value of f (x) as it goes. When it finds a point where improvement is not possible, it shrinks the distance it looks and tries again, until it reaches a stopping point. When y0 = xk+1 + α(xk+1 − xk ), the algorithm is attempting 3. DERIVATIVE FREE METHODS: HOOKE-JEEVES METHOD 101 to use the pattern of improvement just identified to further move f (x) uphill. A reference implementation for Hooke-Jeeves is shown in Algorithm 19. Example 8.18. We can apply the Hooke-Jeeves algorithm to our bimodal function: 2 2 F (x, y) = x2 + 3 y 2 e1−x −y When we start at x0 = 1, y0 = 0.1 and α = 0.5, ∆ = 1, we get good convergence, illustrated in Figure 8.3 Figure 8.3. Hooke-Jeeves algorithm applied to a bimodal function. If we apply Hooke-Jeeves to a variation of Rosenbrock’s Function: 2 G(x, y) = −(1 − x)2 − 100 y − x2 we see different behavior. For ∆ = 1 and α = 0.5, we fail to converge to the optimal point (1, 1). If we adjust α to be 1, we do converge in 28 steps. This is shown in Figure 8.4. Thus we can see that the Hooke-Jeeves method may be very sensitive to the parameters supplied and thus it should only be used on functions where evaluation is very easy since we may have to try a few parameter combinations before we identify a true maximum. 102 8. NUMERICAL DIFFERENTIATION AND DERIVATIVE FREE OPTIMIZATION 1 HookeJeeves := proc (F::operator, initVals::list, epsilon::numeric, DeltaIn::numeric, 2 alpha::numeric)::list; 3 4 5 6 7 8 9 10 11 12 13 14 local i, OUT, vars, xnow, L, vals, Delta, p, xlast, ynow, ytest; Delta := DeltaIn; vars := []; xnow := []; vals := initVals; for i to nops(initVals) do vars := [op(vars), lhs(initVals[i])]; xnow := [op(xnow), rhs(initVals[i])] end do; 15 16 17 18 19 20 OUT := [xnow]; L := []; for i to nops(vals) do L := [op(L), BasisVector(i, nops(vals))] end do; 21 22 23 24 25 26 ynow := xnow; 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 while epsilon < Delta do for i to nops(L) do p := L[i]; ytest := convert(Vector(ynow)+Delta*p, list); if F(op(ynow)) < F(op(ytest)) then ynow := ytest else ytest := convert(Vector(ynow)-Delta*p, list); if F(op(ynow)) < F(op(ytest)) then ynow := ytest end if end if end do; if F(op(xnow)) < F(op(ynow)) then xlast := xnow; xnow := ynow; ynow := xnow+alpha*(xnow-xlast); OUT := [op(OUT), xnow] else Delta := (1/2)*Delta; ynow := xnow end if end do; OUT: end proc Algorithm 19. Hooke-Jeeves derivative free algorithm. 3. DERIVATIVE FREE METHODS: HOOKE-JEEVES METHOD Figure 8.4. Hooke-Jeeves algorithm applied to a bimodal function. 103 CHAPTER 9 Linear Programming: The Simplex Method The material covered in this chapter is a summary of my Math 484 lecture notes [Gri11], which go into more detail (and have proofs). If you’re interested, I suggest you look there or get a good book on Linear Programming like [BJS04]. Also, under no circumstances do I recommend you implement the simplex method. There are too many “nits” for implementing an efficient algorithm. Instead, you should check out the GNU Linear Programming Kit (http://www.gnu.org/software/glpk/). It’s very well implemented and free (in every sense of the word). 1. Linear Programming: Notation Definition 9.1 (Linear Programming Problem). A linear programming problem is an optimization problem of the form:  max z(x1 , . . . , xn ) = c1 x1 + · · · + cn xn     s.t. a11 x1 + · · · + a1n xn ≤ b1      ..   .   am1 x1 + · · · + amn xn ≤ bm (9.1)    h11 x1 + · · · + hn1 xn = r1      ..   .    hl1 x1 + · · · + hln xn = rl Remark 9.2. You will recall from your matrices class (Math 220) that matrices can be used as a short hand way to represent linear equations. Consider the following system of equations:  a11 x1 + a12 x2 + · · · + a1n xn = b1      a21 x1 + a22 x2 + · · · + a2n xn = b2 (9.2) ..  .     am1 x1 + am2 x2 + · · · + amn xn = bm Then we can write this in matrix notation as: (9.3) Ax = b where Aij = aij for i = 1, . . . , m, j = 1, . . . , n and x is a column vector in Rn with entries xj (j = 1, . . . , n) and b is a column vector in Rm with entries bi (i = 1 . . . , m). Obviously, if we replace the equalities in Expression 9.2 with inequalities, we can also express systems of inequalities in the form: (9.4) Ax ≤ b 105 106 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Using this representation, we can write our general linear programming problem using matrix and vector notation. Expression 9.1 can be written as:  T   max z(x) =c x s.t. Ax ≤ b (9.5)   Hx = r Definition 9.3. In Problem 9.5, if we restrict some of the decision variables (the xi ’s) to have only integer (or discrete) values, then the problem becomes a mixed integer linear programming problem. If all of the variables are restricted to integer values, the problem is an integer programming problem and if every variable can only take on the values 0 or 1, the program is called a 0 − 1 integer programming problem. [WN99] is an excellent reference for Integer Programming. Definition 9.4 (Canonical Form). A maximization linear programming problem is in canonical form if it is written as:  T   max z(x) =c x s.t. Ax ≤ b (9.6)   x≥0 A minimization linear programming problem is in canonical form if it is written as:  T   min z(x) =c x s.t. Ax ≥ b (9.7)   x≥0 Definition 9.5 (Standard Form). A linear programming problem is in standard form if it is written as:  T   max z(x) =c x s.t. Ax = b (9.8)   x≥0 Remark 9.6. The following theorem is outside the scope of the course. You may cover it in a Math 484 [Gri11]. Theorem 9.7. Every linear programming problem in canonical form can be put into standard form. Exercise 56. Show that a minimization linear programming problem in canonical form can be rephrased as a maximization linear programming problem in canonical form. [Hint: Multiply the objective and constraints −1. Define new matrices.] Remark 9.8. To illustrate Theorem 9.7, we note that it is relatively easy to convert any inequality constraint into an equality constraint. Consider the inequality constraint: (9.9) ai1 x1 + ai2 x2 + · · · + ain xn ≤ bi We can add a new slack variable si to this constraint to obtain: ai1 x1 + ai2 x2 + · · · + ain xn + si = bi 2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES 107 Obviously this slack variable si ≥ 0. The slack variable then becomes just another variable whose value we must discover as we solve the linear program for which Expression 9.9 is a constraint. We can deal with constraints of the form: (9.10) ai1 x1 + ai2 x2 + · · · + ain xn ≥ bi in a similar way. In this case we subtract a surplus variable si to obtain: ai1 x1 + ai2 x2 + · · · + ain xn − si = bi Again, we must have si ≥ 0. Example 9.9. Consider the linear programming problem:  max z(x1 , x2 ) = 2x1 − x2     s.t. x − x ≤ 1 1 2  2x1 + x2 ≥ 6    x 1 , x2 ≥ 0 This linear programming problem can be put into standard form by using both a slack and surplus variable.  max z(x1 , x2 ) = 2x1 − x2     s.t. x − x + s = 1 1 2 1  2x1 + x2 − s2 = 6    x1 , x2 , s1 , s2 ≥ 0 Definition 9.10 (Row Rank). Let A ∈ Rm×n . The row rank of A is the size of the largest set of row (vectors) from A that are linearly independent. Remark 9.11. The column rank of a matrix A ∈ Rm×n is defined analogously on columns rather than rows. The following theorem relates the row and column rank. It’s proof is outside the scope of the course. Theorem 9.12. If A ∈ Rm×n is a matrix, then the row rank of A is equal to the column rank of A. Further, rank(A) ≤ min{m, n}. Definition 9.13. Suppose that A ∈ Rm×n and let m ≤ n. Then A has full row rank if rank(A) = m. Remark 9.14. We will assume, when dealing with Linear Programming Problems in standard or canonical form that the matrix A has full row rank and if not, we will adjust it so this is true. The following theorem tells us what can happen in a Linear Programming Problem. 2. Polyhedral Theory and Linear Equations and Inequalities 2.1. Solving Systems with More Variables than Equations. Suppose now that A ∈ Rm×n where m ≤ n. Let b ∈ Rm . Then the equation: (9.11) Ax = b 108 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD has more variables than equations and is underdetermined and if A has full row rank then the system will have an infinite number of solutions. We can formulate an expression to describe this infinite set of solutions. Sine A has full row rank, we may choose any m linearly independent columns of A corresponding to a subset of the variables, say xi1 , . . . , xim . We can use these to form the matrix (9.12) B = [A·i1 · · · A·im ] from the columns A·i1 , . . . , A·im of A, so that B is invertible. It should be clear at this point that B will be invertible precisely because we’ve chosen m linearly independent column vectors. We can then use elementary column operations to write the matrix A as: (9.13) A = [B|N] The matrix N is composed of the n − m other columns of A not in B. We can similarly sub-divide the column vector x and write: x (9.14) [B|N] B = b xN where the vector xB are the variables corresponding to the columns in B and the vector xN are the variables corresponding to the columns of the matrix N. Definition 9.15 (Basic Variables). For historical reasons, the variables in the vector xB are called the basic variables and the variables in the vector xN are called the non-basic variables. We can use matrix multiplication to expand the left hand side of this expression as: (9.15) BxB + NxN = b The fact that B is composed of all linearly independent columns implies that applying GaussJordan elimination to it will yield an m × m identity and thus that B is invertible. We can solve for basic variables xB in terms of the non-basic variables: (9.16) xB = B−1 b − B−1 NxN We can find an arbitrary solution to the system of linear equations by choosing values for the variables the non-basic variables and solving for the basic variable values using Equation 9.16. Definition 9.16. (Basic Solution) When we assign xN = 0, the resulting solution for x is called a basic solution and (9.17) xB = B−1 b Example 9.17. Consider the problem:   x1 1 2 3   7 x2 = (9.18) 8 4 5 6 x3 Then we can let x3 = 0 and: 1 2 (9.19) B = 4 5 2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES 109 We then solve1: −19 x1 −1 7 3 = 20 (9.20) =B 8 x2 3 Other basic solutions could be formed by creating B out of columns 1 and 3 or columns 2 and 3. Exercise 57. Find the two other basic solutions in Example 9.17 corresponding to 2 3 B= 5 6 and 1 3 B= 4 6 In each case, determine what the matrix N is. [Hint: Find the solutions any way you like. Make sure you record exactly which xi (i ∈ {1, 2, 3}) is equal to zero in each case.] 2.2. Polyhedral Sets. Important examples of convex sets are polyhedral sets, the multi-dimensional analogs of polygons in the plane. In order to understand these structures, we must first understand hyperplanes and half-spaces. Definition 9.18 (Hyperplane). Let a ∈ Rn be a constant vector in n-dimensional space and let b ∈ R be a constant scalar. The set of points (9.21) H = x ∈ Rn |aT x = b is a hyperplane in n-dimensional space. Note the use of column vectors for a and x in this definition. Example 9.19. Consider the hyper-plane 2x1 +3x2 +x3 = 5. This is shown in Figure 9.1. This hyperplane is composed of the set of points (x1 , x2 , x3 ) ∈ R3 satisfying 2x1 +3x2 +x3 = 5. This can be plotted implicitly or explicitly by solving for one of the variables, say x3 . We can write x3 as a function of the other two variables as: (9.22) x3 = 5 − 2x1 − 3x2 Definition 9.20 (Half-Space). Let a ∈ Rn be a constant vector in n-dimensional space and let b ∈ R be a constant scalar. The sets of points (9.23) Hl = x ∈ Rn |aT x ≤ b (9.24) Hu = x ∈ Rn |aT x ≥ b are the half-spaces defined by the hyperplane aT x = b. Example 9.21. Consider the two dimensional hyperplane (line) x1 + x2 = 1. Then the two half-spaces associated with this hyper-plane are shown in Figure 9.2. A half-space is so named because the hyperplane aT x = b literally separates Rn into two halves: the half above the hyperplane and the half below the hyperplane. 1Thanks to Doug Mercer, who found a typo below that was fixed. 110 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Figure 9.1. A hyperplane in 3 dimensional space: A hyperplane is the set of points satisfying an equation aT x = b, where k is a constant in R and a is a constant vector in Rn and x is a variable vector in Rn . The equation is written as a matrix multiplication using our assumption that all vectors are column vectors. (a) Hl (b) Hu Figure 9.2. Two half-spaces defined by a hyper-plane: A half-space is so named because any hyper-plane divides Rn (the space in which it resides) into two halves, the side “on top” and the side “on the bottom.” Definition 9.22 (Polyhedral Set). If P ⊆ Rn is the intersection of a finite number of half-spaces, then P is a polyhedral set. Formally, let a1 , . . . , am ∈ Rn be a finite set of constant vectors and let b1 , . . . , bm ∈ R be constants. Consider the set of half-spaces: Hi = {x|aTi x ≤ bi } Then the set: (9.25) P = m \ Hi i=1 is a polyhedral set. 2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES 111 It should be clear that we can represent any polyhedral set using a matrix inequality. The set P is defined by the set of vectors x satisfying: (9.26) Ax ≤ b, where the rows of A ∈ Rm×n are made up of the vectors a1 , . . . , am and b ∈ Rm is a column vector composed of elements b1 , . . . , bm . Theorem 9.23. Every polyhedral set is convex. Exercise 58. Prove Theorem 9.23. [Hint: You can prove this by brute force, verifying convexity. You can also be clever and use two results that we’ve proved in the notes.] 2.3. Directions of Polyhedral Sets. Recall the definition of a line (Definition 1.17 from Chapter 1). A ray is a one sided line. Definition 9.24 (Ray). Let x0 ∈ Rn be a point and and let d ∈ Rn be a vector called the direction. Then the ray with vertex x0 and direction d is the collection of points {x|x = x0 + λd, λ ≥ 0}. Definition 9.25 (Direction of a Convex Set). Let C be a convex set. Then d 6= 0 is a (recession) direction of the convex set if for all x0 ∈ C the ray with vertex x0 and direction d is contained entirely in C. Formally, for all x0 ∈ C we have: (9.27) {x : x = x0 + λd, λ ≥ 0} ⊆ C Remark 9.26. There is a unique relationship between the defining matrix A of a polyhedral set P and a direction of this set that is particularly useful when we assume that P is located in the positive orthant of Rn (i.e., x ≥ 0 are defining constraints of P ). Remark 9.27. A proof of the next theorem can be found in [Gri11]. Theorem 9.28. Suppose that P ⊆ Rn is a polyhedral set defined by: (9.28) P = {x ∈ Rn : Ax ≤ b, x ≥ 0} If d is a direction of P , then the following hold: (9.29) Ad ≤ 0, d ≥ 0, d 6= 0. Corollary 9.29. If (9.30) P = {x ∈ Rn : Ax = b, x ≥ 0} and d is a direction of P , then d must satisfy: (9.31) Ad = 0, d ≥ 0, d 6= 0. Exercise 59. Prove the corollary above. Example 9.30. Consider the polyhedral set defined by the equations: x1 − x2 ≤ 1 2x1 + x2 ≥ 6 x1 ≥ 0 x2 ≥ 0 112 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD This set is clearly unbounded as we showed in class and it has at least one direction. The direction d = [0, 1]T pointing directly up is a direction of this set. This is illustrated in Figure 9.3. In this example, we have: Figure 9.3. An Unbounded Polyhedral Set: This unbounded polyhedral set has many directions. One direction is [0, 1]T . (9.32) 1 −1 A= −2 −1 Note, the second inequality constraint was a greater-than constraint. We reversed it to a less-than inequality constraint −2x1 − x2 ≤ −6 by multiplying by −1. For our chosen direction d = [0, 1]T , we can see that: −1 1 −1 0 ≤0 = (9.33) Ad = −1 −2 −1 1 Clearly d ≥ 0 and d 6= 0. 2.4. Extreme Points. Definition 9.31 (Extreme Point of a Convex Set). Let C be a convex set. A point x0 ∈ C is a extreme point of C if there are no points x1 and x2 (x1 6= x0 or x2 6= x0 ) so that x = λx1 + (1 − λ)x2 for some λ ∈ (0, 1).2 An extreme point is simply a point in a convex set C that cannot be expressed as a strict convex combination of any other pair of points in C. We will see that extreme points must be located in specific locations in convex sets. Definition 9.32 (Boundary of a set). Let C ⊆ Rn be (convex) set. A point x0 ∈ C is on the boundary of C if for all ǫ > 0, Bǫ (x0 ) ∩ C 6= ∅ and Bǫ (x0 ) ∩ Rn \ C 6= ∅ Example 9.33. A convex set, its boundary and a boundary point are illustrated in Figure 9.4. 2Thanks to Bob Pakzad-Hurson who fixed a typo in this definition in Version ≤ 1.4. 2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES 113 BOUNDARY POINT BOUNDARY INTERIOR Figure 9.4. Boundary Point: A boundary point of a (convex) set C is a point in the set so that for every ball of any radius centered at the point contains some points inside C and some points outside C. Lemma 9.34. Suppose C is a convex set. If x is an extreme point of C, then x is on the boundary of C. Exercise 60. Prove the previous lemma. Most important in our discussion of linear programming will be the extreme points of polyhedral sets that appear in linear programming problems. The following theorem establishes the relationship between extreme points in a polyhedral set and the intersection of hyperplanes in such a set. Theorem 9.35. Let P ⊆ Rn be a polyhedral set and suppose P is defined as: (9.34) P = {x ∈ Rn : Ax ≤ b} where A ∈ Rm×n and b ∈ Rm . A point x0 ∈ P is an extreme point of P if and only if x0 is the intersection of n linearly independent hyperplanes from the set defining P . Remark 9.36. The easiest way to see this as relevant to linear programming is to assume that (9.35) P = {x ∈ Rn : Ax ≤ b, x ≥ 0} In this case, we could have m < n. In that case, P is composed of the intersection of n + m half-spaces. The first m are for the rows of A and the second n are for the non-negativity constraints. An extreme point comes from the intersection of n of the hyperplanes defining these half-spaces. We might have m come from the constraints Ax ≤ b and the other n − m from x ≥ 0. Remark 9.37. A complete proof of the previous theorem can be found in [Gri11]. Definition 9.38. Let P be the polyhedral set from Theorem 9.35. If x0 is an extreme point of P and more than n hyperplanes are binding at x0 , then x0 is called a degenerate extreme point. Definition 9.39 (Face). Let P be a polyhedral set defined by P = {x ∈ Rn : Ax ≤ b} where A ∈ Rm×n and b ∈ Rm . If X ⊆ P is defined by a non-empty set of binding linearly independent hyperplanes, then X is a face of P . 114 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD That is, there is some set of linearly independent rows Ai1 · , . . . Ail · with il < m so that when G is the matrix made of these rows and g is the vector of bi1 , . . . , bil then: (9.36) X = {x ∈ Rn : Gx = g and Ax ≤ b} In this case we say that X has dimension n − l. Remark 9.40. Based on this definition, we can easily see that an extreme point, which is the intersection n linearly independent hyperplanes is a face of dimension zero. Definition 9.41 (Edge and Adjacent Extreme Point). An edge of a polyhedral set P is any face of dimension 1. Two extreme points are called adjacent if they share n − 1 binding constraints. That is, they are connected by an edge of P . Example 9.42. Consider the polyhedral set defined by the system of inequalities: 3x1 + x2 ≤ 120 x1 + 2x2 ≤ 160 28 x1 + x2 ≤ 100 16 x1 ≤ 35 x1 ≥ 0 x2 ≥ 0 The polyhedral set is shown in Figure 9.5. The extreme points of the polyhedral set are shown Figure 9.5. A Polyhedral Set: This polyhedral set is defined by five half-spaces and has a single degenerate extreme point located at the intersection of the binding 28 constraints 3x1 + x2 ≤ 120, x1 + 2x2 ≤ 160 and 16 x1 + x2 <= 100. All faces are shown in bold. as large diamonds and correspond to intersections of binding constraints. Note the extreme point (16, 72) is degenerate since it occurs at the intersection of three binding constraints x + x2 <= 100. All the faces of the polyhedral set 3x1 + x2 ≤ 120, x1 + 2x2 ≤ 160 and 28 16 1 are shown in bold. They are locations where one constraint (or half-space) is binding. An 2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES 115 example of a pair of adjacent extreme points is (16, 72) and (35, 15), as they are connected by the edge defined by the binding constraint 3x1 + x2 ≤ 120. Exercise 61. Consider the polyhedral set defined by the system of inequalities: 4x1 + x2 ≤ 120 x1 + 8x2 ≤ 160 x1 + x2 ≤ 30 x1 ≥ 0 x2 ≥ 0 Identify all extreme points and edges in this polyhedral set and their binding constraints. Are any extreme points degenerate? List all pairs of adjacent extreme points. 2.5. Extreme Directions. Definition 9.43 (Extreme Direction). Let C ⊆ Rn be a convex set. Then a direction d of C is an extreme direction if there are no two other directions d1 and d2 of C (d1 6= d and d2 6= d) and scalars λ1 , λ2 > 0 so that d = λ1 d1 + λ2 d2 . We have already seen by Theorem 9.28 that is P is a polyhedral set in the positive orthant of Rn with form: P = {x ∈ Rn : Ax ≤ b, x ≥ 0} then a direction d of P is characterized by the set of inequalities and equations Ad ≤ 0, d ≥ 0, d 6= 0. Clearly two directions d1 and d2 with d1 = λd2 for some λ ≥ 0 may both satisfy this system. To isolate a unique set of directions, we can normalize and construct the set: (9.37) D = {d ∈ Rn : Ad ≤ 0, d ≥ 0, eT d = 1} here we are interested only in directions satisfying eT d = 1. This is a normalizing constraint that will chose only vectors whose components sum to 1. Theorem 9.44. A direction d ∈ D is an extreme direction of P if and only if d is an extreme point of D when D is taken as a polyhedral set. Remark 9.45. A complete proof of the previous theorem can be found in [Gri11]. Example 9.46. Let’s consider Example 9.30 again. The polyhedral set in this example was defined by the A matrix: 1 −1 A= −2 −1 and the b vector: 1 b= −6 If we assume that P = {x ∈ Rn : Ax ≤ b, x ≥ 0}, then the set of extreme directions of P is the same as the set of extreme points of the set D = {d ∈ Rn : Ad ≤ 0, d ≥ 0, eT d = 1} 116 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Then we have the set of directions d = [d1 , d2 ]T so that: d1 − d2 ≤ 0 −2d1 − d2 ≤ 0 d1 + d2 = 1 d1 ≥ 0 d2 ≥ 0 The feasible region (which is really only the line d1 + d2 = 1) is shown in red in Figure 9.6. The critical part of this figure is the red line. It is the true set D. As a line, it has two Figure 9.6. Visualization of the set D: This set really consists of the set of points on the red line. This is the line where d1 + d2 = 1 and all other constraints hold. This line has two extreme points (0, 1) and (1/2, 1/2). extreme points: (0, 1) and (1/2, 1/2). Note that (0, 1) as an extreme point is one of the direction [0, 1]T we illustrated in Example 9.30. Exercise 62. Show that d = [1/2, 1/2]T is a direction of the polyhedral set P from Example 9.30. Now find a non-extreme direction (whose components sum to 1) using the feasible region illustrated in the previous example. Show that the direction you found is a direction of the polyhedral set. Create a figure like Figure 9.3 to illustrate both these directions. 2.6. Caratheodory Characterization Theorem. Remark 9.47. The theorem stated in this sub-section is critical to understanding the fundamental theorems of linear programming. Proofs can be found in [Gri11]. Lemma 9.48. The polyhedral set defined by: P = {x ∈ Rn : Ax ≤ b, x ≥ 0} has a finite, non-zero number of extreme points (assuming that A is not an empty matrix)3. 3Thanks to Bob Pakzah-Hurson for the suggestion to improve the statement of this lemma. 2. POLYHEDRAL THEORY AND LINEAR EQUATIONS AND INEQUALITIES 117 Lemma 9.49. Let P be a non-empty polyhedral set. Then the set of directions of P is empty if and only if P is bounded. Lemma 9.50. Let P be a non-empty unbounded polyhedral set. Then the number extreme directions of P is finite and non-zero. Theorem 9.51. Let P be a non-empty, unbounded polyhedral set defined by: P = {x ∈ Rn : Ax ≤ b, x ≥ 0} (where we assume A is not an empty matrix). Suppose that P has extreme points x1 , . . . , xk and extreme directions d1 , . . . , dl . If x ∈ P , then there exists constants λ1 , . . . , λk and µ1 , . . . , µl such that: x= k X λi x i + i=1 (9.38) k X l X µ j dj j=1 λi = 1 i=1 λi ≥ 0 i = 1, . . . , k µj ≥ 0 1, . . . , l Example 9.52. The Cartheodory Characterization Theorem is illustrated for a bounded and unbounded polyhedral set in Figure 9.7. This example illustrates simply how one could x2 λx2 + (1 − λ)x3 x3 x1 x x5 x4 x = µx5 + (1 − µ) (λx2 + (1 − λ)x3 ) x1 x x2 λx2 + (1 − λ)x3 d1 x3 x = λx2 + (1 − λ)x3 + θd1 Figure 9.7. The Cartheodory Characterization Theorem: Extreme points and extreme directions are used to express points in a bounded and unbounded set. construct an expression for an arbitrary point x inside a polyhedral set in terms of extreme points and extreme directions. 118 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD 3. The Simplex Method let For the remainder of this chapter, assume that A ∈ Rm×n with full row rank and b ∈ Rm (9.39) X = {x ∈ Rn : Ax ≤ b, x ≥ 0} be a polyhedral set over which we will maximize the objective function z(x1 , . . . , xn ) = cT x, where c, x ∈ Rn . That is, we will focus on the linear programming problem:  T   max c x s.t. Ax ≤ b (9.40) P   x≥0 Theorem 9.53. If Problem P has an optimal solution, then Problem P has an optimal extreme point solution. Proof. Applying the Cartheodory Characterization theorem, we know that any point x ∈ X can be written as: (9.41) x= k X λi x i + i=1 l X µi di i=1 where x1 , . . . xk are the extreme points of X and d1 , . . . , dl are the extreme directions of X and we know that (9.42) k X λi = 1 i=1 λi , µi ≥ 0 ∀i We can rewrite problem P using this characterization as: max k X T λi c x i + i=1 (9.43) s.t. k X l X µ i c T di i=1 λi = 1 i=1 λi , µi ≥ 0 ∀i If there is some i such that cT di > 0, then we can simply choose µi as large as we like, making the objective as large as we like, the problem will have no finite solution. Therefore, assume that cT di ≤ 0 for all i = 1, . . . , l (in which case, we may simply choose µi = 0, for all i). Since the set of extreme points x1 , . . . xk is finite, we can simply set λp = 1 if cT xp has the largest value among all possible values of cT xi , i = 1, . . . , k. This is clearly the solution to the linear programming problem. Since xp is an extreme point, we have shown that if P has a solution, it must have an extreme point solution. Corollary 9.54. Problem P has a finite solution if and only if cT di ≤ 0 for all i = 1, . . . l when d1 , . . . , dl are the extreme directions of X. Proof. This is implicit in the proof of the theorem. 3. THE SIMPLEX METHOD 119 Corollary 9.55. Problem P has alternative optimal solutions if there are at least two extreme points xp and xq so that cT xp = cT xq and so that xp is the extreme point solution to the linear programming problem. Proof. Suppose that xp is the extreme point solution to P identified in the proof of the theorem. Suppose xq is another extreme point solution with cT xp = cT xq . Then every convex combination of xp and xq is contained in X (since X is convex). Thus every x with form λxp + (1 − λ)xq and λ ∈ [0, 1] has objective function value: λcT xp + (1 − λ)cT xq = λcT xp + (1 − λ)cT xp = cT xp which is the optimal objective function value, by assumption. Exercise 63. Let X = {x ∈ Rn : Ax ≤ b, x ≥ 0} and suppose that d1 , . . . dl are the extreme directions of X (assuming it has any). Show that the problem: (9.44) min cT x s.t. Ax ≤ b x≥0 has a finite optimal solution if (and only if) cT dj ≥ 0 for k = 1, . . . , l. [Hint: Modify the proof above using the Cartheodory characterization theorem.] 3.1. Algorithmic Characterization of Extreme Points. In the previous sections we showed that if a linear programming problem has a solution, then it must have an extreme point solution. The challenge now is to identify some simple way of identifying extreme points. To accomplish this, let us now assume that we write X as: (9.45) X = {x ∈ Rn : Ax = b, x ≥ 0} Our work in the previous sections shows that this is possible. Recall we can separate A into an m × m matrix B and an m × (n − m) matrix N and we have the result: (9.46) xB = B−1 b − B−1 NxN We know that B is invertible since we assumed that A had full row rank. If we assume that xN = 0, then the solution (9.47) xB = B−1 b was called a basic solution (See Definition 9.16.) Clearly any basic solution satisfies the constraints Ax = b but it may not satisfy the constraints x ≥ 0. Definition 9.56 (Basic Feasible Solution). If xB = B−1 b and xN = 0 is a basic solution to Ax = b and xB ≥ 0, then the solution (xB , xN ) is called basic feasible solution. Theorem 9.57. Every basic feasible solution is an extreme point of X. Likewise, every extreme point is characterized by a basic feasible solution of Ax = b, x ≥ 0. Proof. Since Ax = BxB + NxN = b this represents the intersection of m linearly independent hyperplanes (since the rank of A is m). The fact that xN = 0 and xN contains n − m variables, then we have n − m binding, linearly independent hyperplanes in xN ≥ 0. Thus the point (xB , xN ) is the intersection of m + (n − m) = n linearly independent hyperplanes. By Theorem 9.35 we know that (xB , xN ) must be an extreme point of X. 120 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Conversely, let x be an extreme point of X. Clearly x is feasible and by Theorem 9.35 it must represent the intersection of n hyperplanes. The fact that x is feasible implies that Ax = b. This accounts for m of the intersecting linearly independent hyperplanes. The remaining n − m hyperplanes must come from x ≥ 0. That is, n − m variables are zero. Let xN = 0 be the variables for which x ≥ 0 are binding. Denote the remaining variables xB . We can see that A = [B|N] and that Ax = BxB + NxN = b. Clearly, xB is the unique solution to BxB = b and thus (xB , xN ) is a basic feasible solution. 3.2. simplex (1) (2) (3) (4) The Simplex Algorithm–Algebraic Form. In this section, we will develop the algorithm algebraically. The idea behind the simplex algorithm is as follows: Convert the linear program to standard form. Obtain an initial basic feasible solution (if possible). Determine whether the basic feasible solution is optimal. If yes, stop. If the current basic feasible solution is not optimal, then determine which non-basic variable (zero valued variable) should become basic (become non-zero) and which basic variable (non-zero valued variable) should become non-basic (go to zero) to make the objective function value better. (5) Determine whether the problem is unbounded. If yes, stop. (6) If the problem doesn’t seem to be unbounded at this stage, find a new basic feasible solution from the old basic feasible solution. Go back to Step 3. Suppose we have a basic feasible solution x = (xB , xN ). We can divide the cost vector c into its basic and non-basic parts, so we have c = [cB |cN ]T . Then the objective function becomes: (9.48) cT x = cTB xB + cTN xN We can substitute Equation 9.46 into Equation 9.48 to obtain: (9.49) cT x = cTB B−1 b − B−1 NxN + cN xN = cTB B−1 b + cTN − cTB B−1 N xN Let J be the set of indices of non-basic variables. Then we can write Equation 9.49 as: X (9.50) z(x1 , . . . , xn ) = cTB B−1 b + cj − cTB B−1 A·j xj j∈J Consider now the fact xj = 0 for all j ∈ J . Further, we can see that: (9.51) ∂z = cj − cTB B−1 A·j ∂xj This means that if cj − cTB B−1 A·j > 0 and we increase xj from zero to some new value, then we will increase the value of the objective function. For historic reasons, we actually consider the value cTB B−1 A·j − cj , called the reduced cost and denote it as: (9.52) − ∂z = zj − cj = cTB B−1 A·j − cj ∂xj In a maximization problem, we chose non-basic variables xj with negative reduced cost to become basic because, in this case, ∂z/∂xj is positive. Assume we chose xj , a non-basic variable to become non-zero (because zj − cj < 0). We wish to know which of the basic variables will become zero as we increase xj away from zero. We must also be very careful that none of the variables become negative as we do this. 3. THE SIMPLEX METHOD 121 By Equation 9.46 we know that the only current basic variables will be affected by increasing xj . Let us focus explicitly on Equation 9.46 where we include only variable xj (since all other non-basic variables are kept zero). Then we have: (9.53) xB = B−1 b − B−1 A·j xj Let b = B−1 b be an m × 1 column vector and let that aj = B−1 A·j be another m × 1 column. Then we can write: (9.54) x B = b − a j xj Let b = [b1 , . . . bm ]T and aj = [aj1 , . . . , ajm ], then we have:         aj 1 b1 b1 − aj1 xj xB1    b −a x   xB2   j2 j   2  b2   bj 2   = − x = (9.55)   .   .  j  ..  ...    ..   ..    . x Bm bj m bm bm − ajm xj We know (a priori) that bi ≥ 0 for i = 1, . . . , m. If aji ≤ 0, then as we increase xj , bi −aji ≥ 0 no matter how large we make xj . On the other hand, if aji > 0, then as we increase xj we know that bi − aji xj will get smaller and eventually hit zero. In order to ensure that all variables remain non-negative, we cannot increase xj beyond a certain point. For each i (i = 1, . . . , m) such that aji > 0, the value of xj that will make xBi goto 0 can be found by observing that: (9.56) x B i = b i − aj i x j and if xBi = 0, then we can solve: (9.57) 0 = bi − aji xj =⇒ xj = bi aji Thus, the largest possible value we can assign xj and ensure that all variables remain positive is: bi (9.58) min : i = 1, . . . , m and aji > 0 aji Expression 9.58 is called the minimum ratio test. We are interested in which index i is the minimum ratio. Suppose that in executing the minimum ratio test, we find that xj = bk /ajk . The variable xj (which was non-basic) becomes basic and the variable xBk becomes non-basic. All other basic variables remain basic (and positive). In executing this procedure (of exchanging one basic variable and one non-basic variable) we have moved from one extreme point of X to another. Theorem 9.58. If zj − cj ≥ 0 for all j ∈ J , then the current basic feasible solution is (locally) optimal4. 4We now. show later that is globally optimal, however we cannot do any better than a local argument for 122 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Proof. We have already shown in Theorem 9.53 that if a linear programming problem has an optimal solution, then it occurs at an extreme point and we’ve shown in Theorem 9.57 that there is a one-to-one correspondence between extreme points and basic feasible solutions. If zj − cj ≥ 0 for all j ∈ J , then ∂z/∂xj ≤ 0 for all non-basic variables xj . That is, we cannot increase the value of the objective function by increasing the value of any nonbasic variable. Thus, since moving to another basic feasible solution (extreme point) will not improve the objective function, it follows we must be at a (locally) optimal solution. Theorem 9.59. In a maximization problem, if aji ≤ 0 for all i = 1, . . . , m, and zj − cj < 0, then the linear programming problem is unbounded. Proof. The fact that zj − cj < 0 implies that increasing xj will improve the value of the objective function. Since aji < 0 for all i = 1, . . . , m, we can increase xj indefinitely without violating feasibility (no basic variable will ever go to zero). Thus the objective function can be made as large as we like. Remark 9.60. We should note that in executing the exchange of one basic variable and one non-basic variable, we must be very careful to ensure that the resulting basis consist of m linearly independent columns of the original matrix A. Specifically, we must be able to write the column corresponding to xj , the entering variable, as a linear combination of the columns of B so that: (9.59) α1 b1 + . . . αm bm = A·j and further if we are exchanging xj for xBi (i = 1, . . . , m), then αi 6= 0. We can see this from the fact that aj = B−1 A·j and therefore: Baj = A·j and therefore we have: A·j = B·1 aj1 + · · · + B·m ajm which shows how to write the column A·j as a linear combination of the columns of B. Exercise 64. Consider the linear programming problem given in Exercise 63. Under what conditions should a non-basic variable enter the basis? State and prove an analogous theorem to Theorem 9.58 using your observation. [Hint: Use the definition of reduced cost. Remember that it is −∂z/∂xj .] Example 9.61. Consider a simple Linear Programming Problem:  max z(x1 , x2 ) = 7x1 + 6x2       s.t. 3x1 + x2 ≤ 120    x1 + 2x2 ≤ 160  x1 ≤ 35     x1 ≥ 0     x2 ≥ 0 3. THE SIMPLEX METHOD 123 We can convert this problem to standard form by introducing the slack variables s1 , s2 and s3 :  max z(x1 , x2 ) = 7x1 + 6x2       s.t. 3x1 + x2 + s1 = 120 x1 + 2x2 + s2 = 160    x1 + s3 = 35    x1 , x2 , s1 , s2 , s3 ≥ 0 which yields the matrices     x1 7  x2  6      x =  s1  0 c=      s2  0 s3 0     120 3 1 1 0 0    A = 1 2 0 1 0 b = 160 35 1 0 0 0 1 We can begin with the matrices:     3 1 1 0 0 B =  0 1 0 N =  1 2  1 0 0 0 1 In this case we have:     s1 0 x1 7     x B = s2 x N = cB = 0 cN = x2 6 s3 0 and    3 1 120 B−1 b = 160 B−1 N = 1 2 1 0 35  Therefore: cTB B−1 b = 0 cTB B−1 N = 0 0 Using this information, we can compute: cTB B−1 N − cN = −7 −6 cTB B−1 A·1 − c1 = −7 cTB B−1 A·2 − c2 = −6 and therefore: ∂z ∂z = 7 and =6 ∂x1 ∂x2 Based on this information, we could chose either x1 or x2 to enter the basis and the value of the objective function would increase. If we chose x1 to enter the basis, then we must determine which variable will leave the basis. To do this, we must investigate the elements of B−1 A·1 and the current basic feasible solution B−1 b. Since each element of B−1 A·1 is 124 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD positive, we must perform the minimum ratio test on each element of B−1 A·1 . We know that B−1 A·1 is just the first column of B−1 N which is:   3 B−1 A·1 = 1 1 Performing the minimum ratio test, we see have: 120 160 35 , , min 3 1 1 In this case, we see that index 3 (35/1) is the minimum ratio. Therefore, variable x1 will enter the basis and variable s3 will leave the basis. The new basic and non-basic variables will be:     0 s1 s 0 3 c B =  0  cN = x B =  s2  x N = x2 6 x1 7 and the matrices  1  B= 0 0 become:    0 1 0 3 1 1 N =  0 2  1 0 0 1 Note we have simply swapped the column corresponding to x1 with the column corresponding to s3 in the basis matrix B and the non-basic matrix N. We will do this repeatedly in the example and we recommend the reader keep track of which variables are being exchanged and why certain columns in B are being swapped with those in N. Using the new B and N matrices, the derived matrices are then:     −3 1 15 B−1 b = 125 B−1 N = −1 2 1 0 35 The cost information becomes: cTB B−1 b = 245 cTB B−1 N = 7 0 using this information, we can compute: cTB B−1 N − cN = 7 −6 cTB B−1 A·5 − c5 = 7 cTB B−1 A·2 − c2 = −6 Based on this information, we can only choose x2 to enter the basis to ensure that the value of the objective function increases. We can perform the minimum ration test to figure out which basic variable will leave the basis. We know that B−1 A·2 is just the second column of B−1 N which is:   1 −1  B A·2 = 2 0 3. THE SIMPLEX METHOD 125 Performing the minimum ratio test, we see have: 15 125 min , 1 2 In this case, we see that index 1 (15/1) is the minimum ratio. Therefore, variable x2 will enter the basis and variable s1 will leave the basis. The new basic and non-basic variables will be: The new basic and non-basic variables will be:     x2 6 s 0 3 x B =  s2  x N = cB =  0  cN = s1 0 x1 7 and the matrices  1 B = 2 0 become:    0 1 0 3 1 1 N =  0 0  1 0 0 1 The derived matrices are then:     −3 1 15 B−1 b = 95 B−1 N =  5 −2 1 0 35 The cost information becomes: cTB B−1 b = 335 cTB B−1 N = −11 6 cTB B−1 N − cN = −11 6 Based on this information, we can only choose s3 to (re-enter) the basis to ensure that the value of the objective function increases. We can perform the minimum ration test to figure out which basic variable will leave the basis. We know that B−1 A·5 is just the fifth column of B−1 N which is:   −3 −1  B A·5 = 5  1 Performing the minimum ratio test, we see have: 95 35 min , 5 1 In this case, we see that index 2 (95/5) is the minimum ratio. Therefore, variable s3 will enter the basis and variable s2 will leave the basis. The new basic and non-basic variables will be:     6 x2 0 s2     cB = 0 cN = x B = s3 x N = 0 s1 7 x1 and the matrices  1  B= 2 0 become:    0 3 0 1 0 1 N =  1 0  1 1 0 0 126 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD The derived matrices are then:     6/10 −1/5 72 B−1 b = 19 B−1 N =  1/5 −2/5 −1/5 2/5 16 The cost information becomes: cTB B−1 b = 544 cTB B−1 N = 11/5 8/5 cTB B−1 N − cN = 11/5 8/5 Since the reduced costs are now positive, we can conclude that we’ve obtained an optimal solution because no improvement is possible. The final solution then is:     72 x2 xB ∗ =  s3  = B−1 b = 19 16 x1 Simply, we have x1 = 16 and x2 = 72. The path of extreme points we actually took in traversing the boundary of the polyhedral feasible region is shown in Figure 9.8. Figure 9.8. The Simplex Algorithm: The path around the feasible region is shown in the figure. Each exchange of a basic and non-basic variable moves us along an edge of the polygon in a direction that increases the value of the objective function. Exercise 65. Assume that a leather company manufactures two types of belts: regular and deluxe. Each belt requires 1 square yard of leather. A regular belt requires 1 hour of skilled labor to produce, while a deluxe belt requires 2 hours of labor. The leather company receives 40 square yards of leather each week and a total of 60 hours of skilled labor is available. Each regular belt nets $3 in profit, while each deluxe belt nets $5 in profit. The company wishes to maximize profit. (1) Ignoring the divisibility issues, construct a linear programming problem whose solution will determine the number of each type of belt the company should produce. (2) Use the simplex algorithm to solve the problem you stated above remembering to convert the problem to standard form before you begin. 4. KARUSH-KUHN-TUCKER (KKT) CONDITIONS 127 (3) Draw the feasible region and the level curves of the objective function. Verify that the optimal solution you obtained through the simplex method is the point at which the level curves no longer intersect the feasible region in the direction following the gradient of the objective function. There are a few ways to chose the entering variable: (1) Using Dantzig’s Rule, we choose the variable with the greatest (absolute value) reduced cost. (2) We can compute the next objective function value for each possible entering variable and chose the one that results in the largest objective function increase, in a greedy search. (3) We can choose the first variable we find that is an acceptable entering variable (i.e., has a negative reduced cost). (4) We can use a combination of these approaches. Most Simplex codes have a complex recipe for choosing entering variables. We will deal with the question of breaking ties among leaving variables in our section on degeneracy. 4. Karush-Kuhn-Tucker (KKT) Conditions Remark 9.62. The single most important thing to learn about Linear Programming (or optimization in general) is the Karush-Kuhn-Tucker optimality conditions. These conditions provide necessary and sufficient conditions for a point x ∈ Rn to be an optimal solution to a Linear Programming problem. We state the Karush-Kuhn-Tucker theorem, but do not prove it. A proof can be found in Chapter 8 of [Gri11]. Theorem 9.63. Consider the linear programming problem:    max cx s.t. Ax ≤ b (9.60) P   x≥0 with A ∈ Rm×n , b ∈ Rm and (row vector) c ∈ Rn . Then x∗ ∈ Rn if and only if there exists (row) vectors w∗ ∈ Rm and v∗ ∈ Rn and a slack variable vector s∗ ∈ Rm so that: Ax∗ + s∗ = b (9.61) Primal Feasibility x∗ ≥ 0  ∗ ∗  w A − v = c w∗ ≥ 0 Dual Feasibility (9.62)   v∗ ≥ 0 ∗ w (Ax∗ − b) = 0 (9.63) Complementary Slackness v ∗ x∗ = 0 Remark 9.64. The vectors w∗ and v∗ are sometimes called dual variables for reasons that will be clear in the next section. They are also sometimes called Lagrange Multipliers. You may have encountered Lagrange Multipliers in your Math 230 or Math 231 class. These are the same kind of variables except applied to linear optimization problems. There is one 128 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD element in the dual variable vector w∗ for each constraint of the form Ax ≤ b and one element in the dual variable vector v∗ for each constraint of the form x ≥ 0. Example 9.65. Consider a simple Linear Programming Problem with Dual Variables (Lagrange Multipliers) listed next to their corresponding constraints:  max z(x1 , x2 ) = 7x1 + 6x2 Dual Variable      s.t. 3x1 + x2 ≤ 120 (w1 )     x1 + 2x2 ≤ 160 (w1 )  x1 ≤ 35 (w3 )      x1 ≥ 0 (v1 )    x2 ≥ 0 (v2 ) In this problem we have:     3 1 120    b = 160 A= 1 2 35 1 0 c= 7 6 Then the KKT conditions can be written as:     120 3 1   x1      160 1 2 ≤  x2 35 1 0 Primal Feasibility      x1 ≥ 0 0 x2    3 1     1 2  − v1 v2 = 7 6  w 1 w2 w3   1 0 Dual Feasibility   w w w 0 0 0 ≥  1 2 3    v1 v2 ≥ 0 0 Complementary Slackness        w1 v1     120 3 1 x w2 w3 1 2 1 − 160 = 0 x2 35 1 0 v2 x1 x2 = 0 Note, we are suppressing the slack variables s in the primal feasibility expression. Recall that at optimality, we had x1 = 16 and x2 = 72. The binding constraints in this case where 3x1 + x2 ≤ 120 x1 + 2x2 ≤ 160 To see this note that if 3(16) + 72 = 120 and 16 + 2(72) = 160. Then we should be able to express c = [7 6] (the vector of coefficients of the objective function) as a positive 4. KARUSH-KUHN-TUCKER (KKT) CONDITIONS 129 combination of the gradients of the binding constraints: ∇(7x1 + 6x2 ) = 7 6 ∇(3x1 + x2 ) = 3 1 ∇(x1 + 2x2 ) = 1 2 That is, we wish to solve the linear equation: 3 1 w1 w2 = 7 6 (9.64) 1 2 The result is the system of equations: 3w1 + w2 = 7 w1 + 2w2 = 6 A solution to this system is w1 = 85 and w2 = 11 . This fact is illustrated in Figure 9.9. 5 Figure 9.9 shows the gradient cone formed by the binding constraints at the optimal point for the toy maker problem. Since x1 , x2 > 0, we must have v1 = v2 = 0. Moreover, (x∗1 , x∗2 ) = (16, 72) 3x1 + x2 ≤ 120 x1 + 2x2 ≤ 160 x1 ≤ 35 x1 ≥ 0 x2 ≥ 0 Figure 9.9. The Gradient Cone: At optimality, the cost vector c is obtuse with respect to the directions formed by the binding constraints. It is also contained inside the cone of the gradients of the binding constraints, which we will discuss at length later. since x1 < 35, we know that x1 ≤ 35 is not a binding constraint and thus its dual variable w3 is also zero. This leads to the conclusion: ∗ ∗ ∗ ∗ 16 x1 w1 w2∗ w3∗ = 8/5 11/5 0 v1 v2 = 0 0 ∗ = 72 x2 and the KKT conditions are satisfied. 130 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Exercise 66. Consider the problem: max x1 + x2 s.t. 2x1 + x2 ≤ 4 x1 + 2x2 ≤ 6 x 1 , x2 ≥ 0 Write the KKT conditions for an optimal point for this problem. (You will have a vector w = [w1 w2 ] and a vector v = [v1 v2 ]). Draw the feasible region of the problem and use Matlab to solve the problem. At the point of optimality, identify the binding constraints and draw their gradients. Show that the KKT conditions hold. (Specifically find w and v.) Exercise 67. Find the KKT conditions for the problem:    min cx s.t. Ax ≥ b (9.65)   x≥0 [Hint: Remember, every minimization problem can be converted to a maximization problem by multiplying the objective function by −1 and the constraints Ax ≥ b are equivalent to the constraints −Ax ≤ −b. 5. Simplex Initialization So far we have investigated linear programming problems that had form: max cT x s.t. Ax ≤ b x≥0 In this case, we use slack variables to convert the problem to: max cT x s.t. Ax + Im xs = b x, xs ≥ 0 where xs are slack variables, one for each constraint. If b ≥ 0, then our initial basic feasible solution can be x = 0 and xs = b (that is, our initial basis matrix is B = Im ). Suppose now we wish to investigate problems in which we do not have a problem structure that lends itself to easily identifying an initial basic feasible solution. The simplex algorithm requires an initial BFS to begin execution and so we must develop a method for finding such a BFS. For the remainder of this chapter we will assume, unless told otherwise, that we are interested in solving a linear programming problem provided in Standard Form. That is:  T   max c x s.t. Ax = b (9.66) P   x≥0 6. REVISED SIMPLEX METHOD 131 and that b ≥ 0. Clearly our work in Chapter 3 shows that any linear programming problem can be put in this form. Suppose to each constraint Ai· x = bi we associate an artificial variable xai . We can replace constraint i with: (9.67) Ai· x + xai = bi Since bi ≥ 0, we will require xai ≥ 0. If xai = 0, then this is simply the original constraint. Thus if we can find values for the ordinary decision variables x so that xai = 0, then constraint i is satisfied. If we can identify values for x so that all the artificial variables are zero and m variables of x are non-zero, then the modified constraints described by Equation 9.67 are satisfied and we have identified an initial basic feasible solution. Obviously, we would like to penalize non-zero artificial variables. This can be done by writing a new linear programming problem:  T   min e xa s.t. Ax + Im xa = b (9.68) P1   x, xa ≥ 0 Remark 9.66. We can see that the artificial variables are similar to slack variables, but they should have zero value because they have no true meaning in the original problem P . They are introduced artificially to help identify an initial basic feasible solution to Problem P. Theorem 9.67. The optimal objective function value in Problem P1 is bounded below by 0. Furthermore, if the optimal solution to problem P1 has xa = 0, then the values of x form a feasible solution to Problem P . 6. Revised Simplex Method Consider an arbitrary linear programming problem, which we will assume is written in standard form:  T   max c x s.t. Ax = b (9.69) P   x≥0 Consider the data we need at each iteration of the Simplex algorithm: (1) Reduced costs: cTB B−1 A·j − cj for each variable xj where j ∈ J and J is the set of indices of non-basic variables. (2) Right-hand-side values: b = B−1 b for use in the minimum ratio test. (3) aj = B−1 A·j for use in the minimum ratio test. (4) z = cTB B−1 b, the current objective function value. The one value that is clearly critical to the computation is B−1 as it appears in each and every computation. It would be far more effective to keep only the values: B−1 , cTB B−1 , b and z and compute the reduced cost values and vectors aj as we need them. Let w = cTB B−1 , then the pertinent information may be stored in a new revised simplex tableau with form: w z (9.70) xB B−1 b 132 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD The revised simplex algorithm is detailed in Algorithm 20. In essence, the revised Revised Simplex Algorithm (1) Identify an initial basis matrix B and compute B−1 , w, b and z and place these into a revised simplex tableau: w z xB B−1 b (2) For each j ∈ J use w to compute: zj − cj = wA·j − cj . (3) Choose an entering variable xj (for a maximization problem, we choose a variable with negative reduced cost, for a minimization problem we choose a variable with positive reduced cost): (a) If there is no entering variable, STOP, you are at an optimal solution. (b) Otherwise, continue to Step 4. (4) Append the column aj = B−1 A·j to the revised simplex tableau: w z zj − c j xB aj B−1 b (5) Perform the minimum ratio test and determine a leaving variable (using any leaving variable rule you prefer). (a) If aj ≤ 0, STOP, the problem is unbounded. (b) Otherwise, assume that the leaving variable is xBr which appears in row r of the revised simplex tableau. (6) Use row operations and pivot on the leaving variable row of the column: zj − c j aj transforming the revised simplex tableau into: w′ z′ 0 ′ x′B er B′ −1 b where er is an identity column with a 1 in row r (the row that left). The variable xj is now the rth element of xB . (7) Goto Step 2. Algorithm 20. Revised Simplex Algorithm simplex algorithm allows us to avoid computing aj until we absolutely need to do so. In fact, if we do not apply Dantzig’s entering variable rule and simply select the first acceptable entering variable, then we may be able to avoid computing a substantial number of columns in the tableau. Example 9.68. Consider a software company who is developing a new program. The company has identified two types of bugs that remain in this software: non-critical and critical. The company’s actuarial firm predicts that the risk associated with these bugs are uniform random variables with mean $100 per non-critical bug and mean $1000 per critical bug. The software currently has 50 non-critical bugs and 5 critical bugs. Assume that it requires 3 hours to fix a non-critical bug and 12 hours to fix a critical bug. For each day (8 hour period) beyond two business weeks (80 hours) that the company fails to ship its product, the actuarial firm estimates it will loose $500 per day. 6. REVISED SIMPLEX METHOD 133 We can find the optimal number of bugs of each type the software company should fix assuming it wishes to minimize its exposure to risk using a linear programming formulation. Let x1 be the number of non-critical bugs corrected and x2 be the number of critical software bugs corrected. Define: (9.71) y1 = 50 − x1 (9.72) y2 = 5 − x2 Here y1 is the number of non-critical bugs that are not fixed while y2 is the number of critical bugs that are not fixed. The time (in hours) it takes to fix these bugs is: (9.73) 3x1 + 12x2 Let: 1 (80 − 3x1 − 12x2 ) 8 Then y3 is a variable that is unrestricted in sign and determines the amount of time (in days) either over or under the two-week period that is required to ship the software. As an unrestricted variable, we can break it into two components: (9.74) y3 = (9.75) y3 = z 1 − z 2 We will assume that z1 , z2 ≥ 0. If y3 > 0, then z1 > 0 and z2 = 0. In this case, the software is completed ahead of the two-week deadline. If y3 < 0, then z1 = 0 and z2 > 0. In this case the software is finished after the two-week deadline. Finally, if y3 = 0, then z1 = z2 = 0 and the software is finished precisely on time. We can form the objective function as: (9.76) z = 100y1 + 1000y2 + 500z2 The linear programming problem is then: (9.77) min z =y1 + 10y2 + 5z2 s.t. x1 + y1 = 50 x2 + y2 = 5 3 3 x1 + x2 + z1 − z2 = 10 8 2 x 1 , x2 , y 1 , y 2 , z 1 , z 2 ≥ 0 Notice we have modified the objective function by dividing by 100. This will make the arithmetic of the simplex algorithm easier. The matrix of coefficients for this problem is:   x1 x2 y1 y2 z 1 z 2 1 0 1 0 0 0   (9.78)  0 1 0 1 0 0  3 3 0 0 1 −1 8 2 Notice there is an identity matrix embedded inside the matrix of coefficients. Thus a good initial basic feasible solution is {y1 , y2 , z1 }. The initial basis matrix is I3 and naturally, B−1 = I3 as a result. We can see that cB = [1 10 0]T . It follows that cTB B−1 = w = [1 10 0]. 134 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Our initial revised  1 10 z y1  1 0  (9.79)  y2 0 1 z1 0 0 simplex tableau is thus:  0 100 0 50   0 5  1 10 There are three variables that might enter at this point, x1 , x2 and z1 . We can compute the reduced costs for each of these variables using the columns of the A matrix, the coefficients of these variables in the objective function and the current w vector (in row 0 of the revised simplex tableau). We obtain:   1 z1 − c1 = wA·1 − c1 = 1 10 0  0  − 0 = 1 3/8   0 z2 − c2 = wA·2 − c2 = 1 10 0  1  − 0 = 10 3/2   0 z6 − c6 = wA·6 − c6 = 1 10 0  0  − 5 = −5 −1 By Dantzig’s rule, we enter variable x2 . We append B−1 A·2 and the reduced cost to the revised simplex tableau to obtain:    10 1 10 0 100 M RT z     y1  1 0 0 50   0  − (9.80) 5 y2  0 1 0 5   1  20/3 z1 0 0 1 10 3/2 After pivoting on the indicated element, we obtain the new tableau:   1 0 0 50 z y1  0 0 50   1  (9.81) x2  0 1 0 5  z1 0 −3/2 1 5/2 We can compute reduced costs for the non-basic variables (except for y2 , which we know will not re-enter the basis on this iteration) to obtain: z1 − c1 = wA·1 − c1 = 1 z6 − c6 = wA·6 − c6 = −5 In this case, x1  z y1   (9.82) x2  z1 will enter the basis and we augment our revised simplex tableau to obtain:   M RT 1 1 0 0 50    1 0 0 50   1  50 0 1 0 5  0  − 0 −3/2 1 5/2 3/8 20/3 7. CYCLING PREVENTION 135 Note that:     1 1 0 0 1 −1      1 0 0 = 0  B A·1 = 0 3/8 0 −3/2 1 3/8  This is the ā1 column that is appended to the right hand side of the tableau along with z1 − c1 = 1. After pivoting, the tableau becomes:   1 4 −8/3 130/3 z  y1   1 4 −8/3 130/3  (9.83) x2  0 1 0 5  x1 0 −4 8/3 20/3 We can now check our reduced costs. Clearly, z1 will not re-enter the basis. Therefore, we need only examine the reduced costs for the variables y2 and z2 . z4 − c4 = wA·4 − c4 = −6 z6 − c6 = wA·6 − c6 = −7/3 Since all reduced costs are now negative, no further minimization is possible and we conclude we have arrived at an optimal solution. Two things are interesting to note: first, the solution for the number of non-critical software bugs to fix is non-integer. Thus, in reality the company must fix either 6 or 7 of the non-critical software bugs. The second thing to note is that this economic model helps to explain why some companies are content to release software that contains known bugs. In making a choice between releasing a flawless product or making a quicker (larger) profit, a selfish, profit maximizer will always choose to fix only those bugs it must fix and release sooner rather than later. Exercise 68. Solve the following problem using the revised simplex algorithm. max x1 + x2 s.t. 2x1 + x2 ≤ 4 x1 + 2x2 ≤ 6 x 1 , x2 ≥ 0 7. Cycling Prevention Theorem 9.69. Consider Problem P (our linear programming problem). Let B ∈ Rm×m be a basis matrix corresponding to some set of basic variables xB . Let b = B−1 b. If bj = 0 for some j = 1, . . . , m, then xB = b and xN = 0 is a degenerate extreme point of the feasible region of Problem P . Degeneracy can cause us to take extra steps on our way from an initial basic feasible solution to an optimal solution. When the simplex algorithm takes extra steps while remaining at the same degenerate extreme point, this is called stalling. The problem can become much worse; for certain entering variable rules, the simplex algorithm can become locked in a cycle of pivots each one moving from one characterization of a degenerate extreme point to the next. The following example from Beale and illustrated in Chapter 4 of [BJS04] demonstrates the point. 136 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Remark 9.70. In this section, for the sake of simplicity, we illustrate cycling with full tableau. That is, we show the entire table of reduced costs as the top row and the entire matrix B−1 A. The right-hand-side and z values are identical to the revised simplex method. Example 9.71. Consider the following linear programming problem: 1 3 min − x4 + 20x5 − x6 + 6x7 4 2 1 s.t x1 + x4 − 8x5 − x6 + 9x7 = 0 4 (9.84) 1 1 x2 + x4 − 12x5 − x6 + 3x7 = 0 2 2 x3 + x6 = 1 xi ≥ 0 i = 1, . . . , 7 It is conducive to  1  (9.85) A = 0 0 analyze the A matrix of the constraints of this problem. We have:  0 0 1/4 −8 −1 9 1 0 1/2 −12 −1/2 3 0 1 0 0 1 0 The fact that the A matrix contains an identity matrix embedded within it suggests that an initial basic feasible solution with basic variables x1 , x2 and x3 would be a good choice. This leads to a vector of reduced costs given by: (9.86) cB T B−1 N − cN T = 3/4 −20 1/2 −6 These yield an initial  z x1  z  1 0 x1   0 1 x2  0 0 x3 0 0 tableau with structure:  x2 x3 x4 x5 x6 x7 RHS 0 0 3/4 −20 1/2 −6 0   0 0 1/4 −8 −1 9 0   1 0 1/2 −12 −1/2 3 0  0 1 0 0 1 0 1 If we apply an entering variable rule where we always chose the non-basic variable to enter with the most positive reduced cost (since this is a minimization problem), and we choose the leaving variable to be the first row that is in a tie, then we will obtain the following sequence of tableaux: Tableau I:   x5 x6 x7 RHS z x1 x2 x3 x4 1 0 0 0 3/4 −20 1/2 −6 0  z     x1  0 1 0 0 1/4 −8 −1 9 0   x2  0 0 1 0 1/2 −12 −1/2 3 0  x3 0 0 1 0 1 0 0 0 1 Tableau II: z x4 x2 x3        x6 x7 RHS z x1 x2 x3 x4 x5 1 −3 0 0 0 4 7/2 −33 0   0 4 0 0 1 −32 −4 36 0   4 3/2 −15 0  0 −2 1 0 0 0 0 0 1 0 0 1 0 1 7. CYCLING PREVENTION Tableau III:  z x4 x5 x3       x1 x2 x3 x4 x5 x6 x7 RHS z 1 −1 −1 0 0 0 2 −18 0   0 −12 8 0 1 0 8 −84 0   0 −1/2 1/4 0 0 1 3/8 −15/4 0  0 0 1 0 0 1 0 1 0       z x1 x2 x3 x4 x5 x6 x7 RHS 1 2 −3 0 −1/4 0 0 3 0   0 −3/2 1 0 1/8 0 1 −21/2 0   0  0 1/16 −1/8 0 −3/64 1 0 3/16 −1 1 −1/8 0 0 21/2 1 0 3/2 Tableau IV:  z x6 x5 x3 Tableau V: z x6 x7 x3       Tableau VI:  z x1 x7 x3      Tableau VII:  z x1 x2 x3      137  z x1 x2 x3 x4 x5 x6 x7 RHS 1 1 −1 0 1/2 −16 0 0 0   0 2 −6 0 −5/2 56 1 0 0   0 1/3 −2/3 0 −1/4 16/3 0 1 0  0 −2 6 1 5/2 −56 0 0 1 z x1 1 0 0 1 0 0 0 0 x2 2 −3 1/3 0  x3 x4 x5 x6 x7 RHS 0 7/4 −44 −1/2 0 0   0 −5/4 28 1/2 0 0   0 1/6 −4 −1/6 1 0  1 0 0 1 0 1  x5 x6 x7 RHS z x1 x2 x3 x4 1 0 0 0 3/4 −20 1/2 −6 0   0 1 0 0 1/4 −8 −1 9 0   0 0 1 0 1/2 −12 −1/2 3 0  0 0 0 1 0 0 1 0 1 We see that the last tableau (VII) is the same as the first tableau and thus we have constructed an instance where (using the given entering and leaving variable rules), the Simplex Algorithm will cycle forever at this degenerate extreme point. 7.1. The Lexicographic Minimum Ratio Leaving Variable Rule. Given the example of the previous section, we require a method for breaking ties in the case of degeneracy is required that prevents cycling from occurring. There is a large literature on cycling prevention rules, however the most well known is the lexicographic rule for selecting the entering variable. Definition 9.72 (Lexicographic Order). Let x = [x1 , . . . , xn ]T and y = [y1 , . . . , yn ]T be vectors in Rn . We say that x is lexicographically greater than y if: there exists m < n so that xi = yi for i = 1, . . . , m, and xm+1 > ym+1 . 138 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Clearly, if there is no such m < n, then xi = yi for i = 1, . . . , n and thus x = y. We write x ≻ y to indicate that x is lexicographically greater than y. Naturally, we can write x y to indicate that x is lexicographically greater than or equal to y. Lexicographic ordering is simply the standard order operation > applied to the individual elements of a vector in Rn with a precedence on the index of the vector. Definition 9.73. A vector x ∈ Rn is lexicographically positive if x ≻ 0 where 0 is the zero vector in Rn . Lemma 9.74. Let x and y be two lexicographically positive vectors in Rn . Then x + y is lexicographically positive. Let c > 0 be a constant in R, then cx is a lexicographically positive vector. Exercise 69. Prove Lemma 9.74. Suppose we are considering a linear programming problem and we have chosen an entering variable xj according to a fixed entering variable rule. Assume further, we are given some current basis matrix B and as usual, the right-hand-side vector of the constraints is denoted b, while the coefficient matrix is denoted A. Then the minimum ratio test asserts that we will chose as the leaving variable the basis variable with the minimum ratio in the minimum ratio test. Consider the following set: br bi : i = 1, . . . , m and aji > 0 = min (9.87) I0 = r : aj r aj i In the absence of degeneracy, I0 contains a single element: the row index that has the smallest ratio of bi to aji , where naturally: b = B−1 b and aj = B−1 A·j . In this case, xj is swapped into the basis in exchange for xBr (the rth basic variable). When we have a degenerate basic feasible solution, then I0 is not a singleton set and contains all the rows that have tied in the minimum ratio test. In this case, we can form a new set: a1 r a1i (9.88) I1 = r : : i ∈ I0 = min ajr aji Here, we are taking the elements in column 1 of B−1 A·1 to obtain a1 . The elements of this (column) vector are then being divided by the elements of the (column) vector aj on a index-by-index basis. If this set is a singleton, then basic variable xBr leaves the basis. If this set is not a singleton, we may form a new set I2 with column a2 . In general, we will have the set: akr aki (9.89) Ik = r : : i ∈ Ik−1 = min ajr aji Lemma 9.75. For any degenerate basis matrix B for any linear programming problem, we will ultimately find a k so that Ik is a singleton. Exercise 70. Prove Lemma 9.75. [Hint: Assume that the tableau is arranged so that the identity columns are columns 1 through m. (That is aj = ej for i = 1, . . . , m.) Show that this configuration will easily lead to a singleton Ik for k < m.] In executing the lexicographic minimum ratio test, we can see that we are essentially comparing the tied rows in a lexicographic manner. If a set of rows ties in the minimum 7. CYCLING PREVENTION 139 ratio test, then we execute a minimum ratio test on the first column of the tied rows. If there is a tie, then we move on executing a minimum ratio test on the second column of the rows that tied in both previous tests. This continues until the tie is broken and a single row emerges as the leaving row. Example 9.76. Let us consider the example from Beale again using the lexicographic minimum ratio test. Consider the tableau shown below. Tableau I:   x5 x6 x7 RHS z x1 x2 x3 x4 1 0 0 0 3/4 −20 1/2 −6 0  z     −1 9 0  x1  0 1 0 0 1/4 −8  x2  0 0 1 0 1/2 −12 −1/2 3 0  x3 0 0 1 0 1 0 0 0 1 Again, we chose to enter variable x4 as it has the most positive reduced cost. Variables x1 and x2 tie in the minimum ratio test. So we consider a new minimum ratio test on the first column of the tableau: 0 1 , (9.90) min 1/4 1/2 From this test, we see that x2 is the leaving variable and we pivot on element 1/2 as indicated in the tableau. Note, we only need to execute the minimum ratio test on variables x1 and x2 since those were the tied variables in the standard minimum ratio test. That is, I0 = {1, 2} and we construct I1 from these indexes alone. In this case I1 = {2}. Pivoting yields the new tableau: Tableau II:   x2 x3 x4 x5 x6 x7 RHS z x1 0  z   1 0 −3/2 0 0 −2 5/4 −21/2   0 1 −1/2 0 0 −2 −3/4 15/2 0 x1    2 0 1 −24 −1 6 0  x4  0 0 x3 0 0 0 1 0 0 1 0 1 There is no question this time of the entering or leaving variable, clearly x6 must enter and x3 must leave and we obtain5: Tableau III:   x2 x3 x4 x5 x6 x7 RHS z x1  z   1 0 −3/2 −5/4 0 −2 0 −21/2 −5/4  x1  3/4    0 1 −1/2 3/4 0 −2 0 15/2 x4  0 0 2 1 1 −24 0 6 1  x6 0 0 0 1 0 0 1 0 1 Since this is a minimization problem and the reduced costs of the non-basic variables are now all negative, we have arrived at an optimal solution. The lexicographic minimum ratio test successfully prevented cycling. Remark 9.77. The following theorem is proved in [Gri11]. 5Thanks to Ethan Wright for finding a small typo in this example, that is now fixed. 140 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Theorem 9.78. Consider the problem:  T   max c x s.t. Ax = b P   x≥0 Suppose the following hold: (1) Im is embedded in the matrix A and is used as the starting basis, (2) a consistent entering variable rule is applied (e.g., largest reduced cost first), and (3) the lexicographic minimum ratio test is applied as the leaving variable rule. Then the simplex algorithm converges in a finite number of steps. 8. Relating the KKT Conditions to the Tableau Consider a linear programming problem in Standard Form:    max cx s.t. Ax = b (9.91) P   x≥0 with A ∈ Rm×n , b ∈ Rm and (row vector) c ∈ Rn . The KKT conditions for a problem of this type assert that wA − v = c vx = 0 at an optimal point x for some vector w unrestricted in sign and v ≥ 0. (Note, for the sake of notational ease, we have dropped the ∗ notation.) Suppose at optimality we have a basis matrix B corresponding to a set of basic variables xB and we simultaneously have non-basic variables xN . We may likewise divide v into vB and vN . Then we have: (9.92) wA − v = c =⇒ w B N − vB vN = cB cN (9.93) vx = 0 =⇒ vB xB vN =0 xN We can rewrite Expression 9.92 as: wB − vB wN − vN = cB cN (9.94) This simplifies to: wB − vB = cB wN − vN = cN Let w = cB B−1 . Then we see that: (9.95) wB − vB = cB =⇒ cB B−1 B − vB = cB =⇒ cB − vB = cB =⇒ vB = 0 9. DUALITY 141 Since we know that xB ≥ 0, we know that vB should be equal to zero to ensure complementary slackness. Thus, this is consistent with the KKT conditions. We further see that: (9.96) wN − vN = cN =⇒ cB B−1 N − vN = cN =⇒ vN = cB B−1 N − cN Thus, the vN are just the reduced costs of the non-basic variables. (vB are the reduced costs of the basic variables.) Furthermore, dual feasibility requires that v ≥ 0. Thus we see that at optimality we require: (9.97) cB B−1 N − cN ≥ 0 This is precisely the condition for optimality in the simplex tableau. We now can see the following facts are true about the Simplex Method: (1) At each iteration of the Simplex Method, primal feasibility is satisfied. This is ensured by the minimum ratio test and the fact that we start at a feasible point. (2) At each iteration of the Simplex Method, complementary slackness is satisfied. After all, the vector v is just the reduced cost vector (Row 0) of the Simplex tableau. If a variable is basic xj (and hence non-zero), then the its reduced cost vj = 0. Otherwise, vj may be non-zero. (3) At each iteration of the Simplex Algorithm, we may violate dual feasibility because we may not have v ≥ 0. It is only at optimality that we achieve dual feasibility and satisfy the KKT conditions. We can now prove the following theorem: Theorem 9.79. Assuming an appropriate cycling prevention rule is used, the simplex algorithm converges in a finite number of iterations to an optimal solution to the linear programming problem. Proof. Convergence is guaranteed by the proof of Theorem 9.78 in which we show that when the lexicographic minimum ratio test is used, then the simplex algorithm will always converge. Our work above shows that at optimality, the KKT conditions are satisfied because the termination criteria for the simplex algorithm are precisely the same as the criteria in the Karush-Kuhn-Tucker conditions. This completes the proof. 9. Duality Remark 9.80. In this section, we show that to each linear programming problem (the primal problem) we may associate another linear programming problem (the dual linear programming problem). These two problems are closely related to each other and an analysis of the dual problem can provide deep insight into the primal problem. Consider the linear programming problem  T   max c x s.t. Ax ≤ b (9.98) P   x≥0 142 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Then the dual    min s.t. (9.99) D   problem for Problem P is: wb wA ≥ c w≥0 Remark 9.81. Let v be a vector of surplus variables. Then we can transform Problem D into standard form as:  min wb     s.t. wA − v = c (9.100) DS w≥0     v≥0 Thus we already see an intimate relationship between duality and the KKT conditions. The feasible region of the dual problem (in standard form) is precisely the the dual feasibility constraints of the KKT conditions for the primal problem. In this formulation, we see that we have assigned a dual variable wi (i = 1, . . . , m) to each constraint in the system of equations Ax ≤ b of the primal problem. Likewise dual variables v can be thought of as corresponding to the constraints in x ≥ 0. Remark 9.82. The proof of the following lemma is outside the scope of the class, but it establishes an important fact about duality. Lemma 9.83. The dual of the dual problem is the primal problem. Remark 9.84. Lemma 9.83 shows that the notion of dual and primal can be exchanged and that it is simply a matter of perspective which problem is the dual problem and which is the primal problem. Likewise, by transforming problems into canonical form, we can develop dual problems for any linear programming problem. The process of developing these formulations can be exceptionally tedious, as it requires enumeration of all the possible combinations of various linear and variable constraints. The following table summarizes the process of converting an arbitrary primal problem into its dual. This table can be found in Chapter 6 of [BJS04]. Example 9.85. Consider the problem of finding the dual problem when the primal problem is: max 7x1 + 6x2 s.t. 3x1 + x2 + s1 = 120 (w1 ) x1 + 2x2 + s2 = 160 (w2 ) x1 + s3 = 35 x1 , x2 , s1 , s2 , s3 ≥ 0 (w3 ) Here we have placed dual variable names (w1 , w2 and w3 ) next to the constraints to which they correspond. The primal problem variables in this case are all positive, so using Table 1 we know that the constraints of the dual problem will be greater-than-or-equal-to constraints. Likewise, we know that the dual variables will be unrestricted in sign since the primal problem constraints are all equality constraints. 9. DUALITY ≥0 ≤ ≤0 ≥ = UNRESTRICTED ≥ ≥0 ≤ = ≤0 UNRESTRICTED VARIABLES CONSTRAINTS MAXIMIZATION PROBLEM CONSTRAINTS VARIABLES MINIMIZATION PROBLEM 143 Table 1. Table of Dual Conversions: To create a dual problem, assign a dual variable to each constraint of the form Ax ◦ b, where ◦ represents a binary relation. Then use the table to determine the appropriate sign of the inequality in the dual problem as well as the nature of the dual variables. The coefficient matrix  3 1 1 0 A = 1 2 0 1 1 0 0 0 is:  0 0 1 Clearly we have: c= 7 6 0 0 0   120  b = 160 35 Since w = [w1 w2 w3 ], we know that wA will be: wA = 3w1 + w2 + w3 w1 + 2w2 w1 w2 w3 This vector will be related to c in the constraints of the dual problem. Remember, in this case, all variables in the primal problem are greater-than-or-equal-to zero. Thus we see that the constraints of the dual problem are: 3w1 + w2 + w3 w1 + 2w2 w1 w2 w3 ≥7 ≥6 ≥0 ≥0 ≥0 We also have the redundant set of constraints that tell us w is unrestricted because the primal problem had equality constraints. This will always happen in cases when you’ve introduced slack variables into a problem to put it in standard form. This should be clear from the definition of the dual problem for a maximization problem in canonical form. 144 9. LINEAR PROGRAMMING: THE SIMPLEX METHOD Thus the whole dual problem becomes: min 120w1 + 160w2 + 35w3 s.t. 3w1 + w2 + w3 ≥ 7 w1 + 2w2 ≥ 6 w1 ≥ 0 (9.101) w2 ≥ 0 w3 ≥ 0 w unrestricted Again, note that in reality, the constraints we derived from the wA ≥ c part of the dual problem make the constraints “w unrestricted” redundant, for in fact w ≥ 0 just as we would expect it to be if we’d found the dual of the Toy Maker problem given in canonical form. Exercise 71. Identify the dual problem for: max x1 + x2 s.t. 2x1 + x2 ≥ 4 x1 + 2x2 ≤ 6 x 1 , x2 ≥ 0 Exercise 72. Use the table or the definition of duality to determine the dual for the problem:    min cx s.t. Ax ≥ b (9.102)   x≥0 Remark 9.86. The following theorems are outside the scope of this course, but they can be useful to know and will help cement your understanding of the true nature of duality. Theorem 9.87 (Strong Duality Theorem). Consider Problem P and Problem D. Then (Weak Duality): cx∗ ≤ w∗ b, thus every feasible solution to the primal problem provides a lower bound for the dual and every feasible solution to the dual problem provides an upper bound to the primal problem. Furthermore exactly one of the following statements is true: (1) Both Problem P and Problem D possess optimal solutions x∗ and w∗ respectively and cx∗ = w∗ b. (2) Problem P is unbounded and Problem D is infeasible. (3) Problem D is unbounded and Problem P is infeasible. (4) Both problems are infeasible. 9. DUALITY 145 Theorem 9.88. Problem D has an optimal solution w∗ ∈ Rm if and only if there exists vectors x∗ ∈ Rn and s∗ ∈ Rm and a vector of surplus variables v∗ ∈ Rn such that: ∗ w A≥c (9.103) Primal Feasibility w∗ ≥ 0  ∗ ∗   Ax + s = b x∗ ≥ 0 (9.104) Dual Feasibility   s∗ ≥ 0 ∗ (w A − c) x∗ = 0 (9.105) Complementary Slackness w ∗ s∗ = 0 Furthermore, these KKT conditions are equivalent to the KKT conditions for the primal problem. CHAPTER 10 Feasible Direction Methods and Quadratic Programming 1. Preliminaries Remark 10.1. In this chapter, we return to the problem we originally discussed in Chapter 1, namely: Let f : Rn → R; for i = 1, . . . , m, gi : Rn → R; and for j = 1, . . . , l hj : Rn → R be functions. Then the problem we consider is:    max f (x1 , . . . , xn )    s.t. g1 (x1 , . . . , xn ) ≤ b1      ..   .   gm (x1 , . . . , xn ) ≤ bm    h1 (x1 , . . . , xn ) = r1      ..   .     hl (x1 , . . . , xn ) = rl For simplicity, however, we will consider a minor variation of the problem, namely:  max f (x1 , . . . , xn )      s.t. g1 (x1 , . . . , xn ) ≤ 0      ..   .   gm (x1 , . . . , xn ) ≤ 0 (10.1)    h1 (x1 , . . . , xn ) = 0      ..   .     h (x , . . . , x ) = 0 l 1 n It is easy to see that this is an equivalent formulation, as we have allowed g1 , . . . , gm and h1 , . . . , hl to be defined vaguely. Unlike in other chapters, we may consider the minimization very of this problem, when it is convenient for understanding. When we do so, the reader will be given sufficient warning. In addition, for the remainder of this chapter (unless otherwise stated), we will suppose that: (10.2) X = {x ∈ Rn : g1 (x), . . . , gm (x) ≤ 0, h1 (x) = · · · = hl (x) = 0} is a closed, non-empty, convex set. Remark 10.2. We begin our discussion of constrained optimization with a lemma, better known as Weirstrass’ Theorem. We will not prove it as the proof requires a bit more analysis than is required for the rest of the notes. The interested reader may consul 147 148 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING Lemma 10.3 (Weirstrass’ Theorem). Let X be a non-empty closed and bounded set in Rn and let z be a continuous mapping with f : X → R. Then the optimization problem: max f (x) (10.3) s.t. x ∈ X has at least one solution x∗ ∈ X. Remark 10.4. The proof of the next theorem is a simple modification to the proof of Theorem 2.25. Theorem 10.5. Suppose that f : Rn → R is concave and x∗ is a local maximizer of max f (x) (10.4) P s.t. x ∈ X then x∗ is a global maximizer. Proof. Suppose that x+ ∈ X has the property that f (x+ ) > f (x∗ ). For any λ ∈ (0, 1) we know that: f (λx∗ + (1 − λ)x+ ) ≥ λf (x∗ ) + (1 − λ)f (x+ ) Moreover, by the convexity of X, λx∗ + (1 − λ)x+ ∈ X. Since x∗ is a local maximum there is an ǫ > 0 so that for all x ∈ Bǫ (x∗ ) ∩ X, f (x∗ ) ≥ f (x). Choose λ so that λx∗ + (1 − λ)x+ is in Bǫ (x∗ ) ∩ X and let x = λx∗ + (1 − λ)x+ . Let r = f (x+ ) − f (x∗ ). By assumption r > 0. Then we have: f (x) ≥ λf (x∗ ) + (1 − λ)(f (x∗ ) + r) But this implies that: f (x) ≥ f (x∗ ) + (1 − λ)r But x ∈ Bǫ (x∗ ) ∩ X by choice of λ, which contradicts our assumption that x∗ is a local maximum. Thus, x∗ must be a global maximum. Theorem 10.6. Suppose that f : Rn → R is strictly concave and that x∗ is a global solution of max f (x) (10.5) P s.t. x ∈ X Then x∗ is the unique global maximizer of f . Exercise 73. Prove Theorem 10.6. Theorem 10.7. Suppose f (x) is continuously differentiable on X. If x∗ is a local maximum of f (x), then: (10.6) ∇f (x∗ )T (x − x∗ ) ≤ 0 ∀x ∈ X Furthermore, if f (x) is a concave function, then this is a sufficient condition for a maximum. 2. FRANK-WOLFE ALGORITHM 149 Proof. Suppose that Inequality 10.6 does not hold, but x∗ is a local maximum. By the mean value theorem, there is some t ∈ (0, 1) so that: (10.7) f (x) = f (x∗ ) + ∇f (x∗ + t(x − x∗ ))T (x − x∗ ) If Inequality 10.6 does not hold for x ∈ X, then ∇f (x∗ )T (x − x∗ ) > 0. Choose y = x∗ + ǫ(x − x∗ ) where ǫ > 0 is small enough so that by continuity, (10.8) ∇f (x∗ + t(y − x∗ ))T (y − x∗ ) > 0 since ∇f (x∗ )T (x − x∗ ) > 0. Note that y is in the same direction as x relative to x∗ , ensuring that ∇f (x∗ )T (y − x∗ ) > 0. Thus: (10.9) f (y) − f (x∗ ) = +∇f (x∗ + t(y − x∗ ))T (x − x∗ ) > 0 and x∗ cannot be a local maximum. If f (x) is concave, then by Theorem 2.19, we know that: (10.10) f (x) ≤ f (x∗ ) + ∇f (x∗ )T (x − x∗ ) If Inequality 10.6, then for all x ∈ X, we know that: f (x) ≤ f (x∗ ). Thus, x∗ must be a (local) maximum. When f (x) is strictly convex, then x∗ is a global maximum. Definition 10.8 (Stationary Point). Suppose that x∗ ∈ X satisfies Inequality 10.6. Then x∗ is a stationary point for the problem of maximizing f (x) subject to the constraints that x ∈ X. Remark 10.9. This notion of stationarity takes the place of ∇f (x∗ ) = 0 used in unconstrained optimization. We will use it throughout the rest of this chapter. Not only that, but the definition of gradient related remains the same as before. See Definition 4.11. 2. Frank-Wolfe Algorithm Remark 10.10. For the remainder of this chapter, we assume the following Problem P : (10.11) P max f (x) s.t. x ∈ X where X is a closed, convex and non-empty set. Remark 10.11. Our study of Linear Programming in the previous chapter suggests that when f (x) in Problem P is non-linear, but g1 , . . . , gm and h1 , . . . , hl are all linear, we might be able to find a simple way of using linear programming to solve this non-linear programming problem. The simplest approach to this is the Frank-Wolfe algorithm, or reduced gradient algorithm. This method is (almost) the non-linear programming variation of the gradient descent method, in that it works well initially, but slows down as the algorithm approaches an optimal point. Remark 10.12. The Frank-Wolfe algorithm works by iteratively approximating the objective function as a linear function and then uses linear programming to choose an ascent direction. Univariate maximization is then used to choose a distance along the ascent direction. In essence, given problem P , at step k, with iterate xk , we solve: ( max ∇f (xk )T x (10.12) L s.t. x ∈ X 150 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING Note that this is equivalent to maximizing the first order approximation: (10.13) f (xk ) + ∇f (xk )T (x − xk ) If x+ is the solution to this problem, we then solve the problem: (10.14) max f (xk + t(x+ − xk )) t∈[0,1] By the convexity of X, we see that restricting t to [0, 1] ensures that xk+1 = t∗ (xk +t(x+ −xk )) is in X, when t∗ is the solution to the univariate problem. Clearly when X is a polyhedral set, Problem L is a linear programming problem. Iteration continues until a stopping criterion is reached. A simple one is ||xk+1 − xk || < ǫ. A reference implementation for this algorithm is shown in Algorithm 21. Example 10.13. We can see an example of the Frank-Wolfe algorithm when we attempt to solve the problem:  2 2   max − (x − 2) − (y − 2) s.t. x + y ≤ 1 (10.15)   x, y ≥ 0 In this case, the sequence of points when starting from x0 = 0, y0 = 0 is (1) x1 = 0, y1 = 1 (2) x2 = 0.5, y2 = 0.5 when we set ǫ = 0.0001. This is illustrated in Figure 10.1(a). Figure 10.1. (a) The steps of the Frank-Wolfe Algorithm when maximizing −(x − 2)2 −(y−2)2 over the set of (x, y) satisfying the constraints x+y ≤ 1 and x, y ≥ 0. (b) The steps of the Frank-Wolfe Algorithm when maximizing −7(x − 20)2 − 6(y − 40)2 over the set of (x, y) satisfying the constraints 3x + y ≤ 120, x + 2y ≤ 160, x ≤ 35 and x, y ≥ 0. (c) The steps of the Frank-Wolfe Algorithm when maximizing −7(x − 40)2 − 6(y − 40)2 over the set of (x, y) satisfying the constraints 3x + y ≤ 120, x + 2y ≤ 160, x ≤ 35 and x, y ≥ 0. 2. FRANK-WOLFE ALGORITHM 1 2 FrankWolfe := proc (F::operator, LINCON::set, initVals::list, epsilon::numeric, maxIter::integer)::list; 3 4 5 local vars, xnow, xnext, ttemp, FLIN, SOLN, X, vals, i, G, dX, count, p, r, passIn, phi, OUT; 6 7 8 9 10 11 vars := []; xnow := []; vals := initVals; for i to nops(initVals) do vars := [op(vars), lhs(initVals[i])]; xnow := [op(xnow), rhs(initVals[i])] end do; 12 13 14 15 #Compute the gradient. G := Gradient(F(op(vars)), vars)); OUT := []; 16 17 18 dX := 1; count := 0; 19 20 21 while epsilon < dX and count < maxIter do count := count + 1; 22 23 24 #Store output. OUT := [op(OUT), xnow]; 25 26 27 #A vector of variables, used to create the linear objective. X := Vector([seq(vars[i], i = 1 .. nops(vars))]); 28 29 30 #Create the linear objective. FLIN := Transpose(evalf(eval(G,vals)).X; 31 32 33 #Recover the actual solution. SOLN := LPSolve(FLIN, LINCON, maximize = true)[2]; 34 35 36 37 38 39 40 41 #Compute the direction. p := []; for i to nops(initVals) do p := [op(p), eval(vars[i], SOLN)] end do; r := Vector(p); passIn := convert(Vector(xnow)+s*(r-Vector(xnow)), list); 151 152 42 43 44 45 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING #Define the line function phi := proc (t) options operator, arrow; evalf(eval(F(op(passIn)), s = t)) end proc; 46 47 48 #Defined separately. ttemp := GoldenSectionSearch(phi, 0, .39, 1, 0.1e-3); 49 50 51 xnext := evalf(eval(passIn, [s = ttemp])); dX := Norm(Vector(xnext - xnow)); 52 53 54 55 56 57 58 59 60 61 #Move and update the current values. xnow := xnext; vals := []; for i to nops(vars) do vals := [op(vals), vars[i] = xnow[i]] end do: end do; OUT end proc: Algorithm 21. Frank-Wolfe Algorithm for finding the maximum of a (concave) function with linear constraints. By contrast, we can investigate the behavior of the Frank-Wolfe Algorithm when we attempt to solve the problem:   max − 7(x − 20)2 − 6(y − 40)2       s.t. 3x + y ≤ 120 x + 2y ≤ 160 (10.16)    x ≤ 35     x, y ≥ 0 Here the optimal point is in the interior of the constraint set (as opposed to the exterior) and the Frank-Wolfe algorithm exhibits behavior like Gradient Ascent. Convergence occurs in 73 iterations. This is shown in Figure 10.1(b). Finally, we can consider the objective function −7(x − 40)2 − 6(y − 40)2 , with the same constraints as before. In this case, convergence takes over 2000 iterations, but we see we can very close to an approximate solution. (See Figure 10.1(c).) This illustrates the power of the Frank-Wolfe method. It can be used to find a reasonably good solution quickly. If refinement is needed, then an algorithm with better convergence properties can take over. Lemma 10.14. Each search direction (x+ −xk ) in the Frank-Wolfe algorithm is an ascent direction. 3. FARKAS’ LEMMA AND THEOREMS OF THE ALTERNATIVE 153 Proof. By construction x+ maximizes ∇f (xk )T x subject to the constraint that x+ ∈ X. This implies that for all x ∈ X: (10.17) ∇f (xk )T x+ ≥ ∇f (xk )T x Thus: (10.18) ∇f (xk )T (x+ − x) ≥ 0 ∀x ∈ X Thus, in particular, ∇f (xk )T (x+ − xk ) ≥ 0 and thus, x+ − xk is an ascent direction. Definition 10.15. Given f : Rn → R and a convex set X, a feasible direction method is one in which at each iteration k a direction pk is chosen and xk+1 = xk + δpk is constructed so that xk+1 ∈ X and for some ǫ > 0 if 0 ≤ δ ≤ ǫ, xk + δpk ∈ X. Thus, pk is a feasible direction. Lemma 10.16. Let {xk } be a sequence generated by a feasible direction method (e.g. the Frank-Wolfe Algorithm). If δ is chosen by limited line search or the Armijo rule, then any point x∗ to which {xk } converges is a stationary point. Proof. The proof is identical (mutatis mutandis) to the proof of Theorem 4.12. Theorem 10.17. Suppose that the Frank-Wolfe algorithm converges. Then it converges to a stationary point. Proof. Every direction generated by the Frank-Wolfe algorithm is an ascent direction and therefore the sequence of directions must be gradient related. By the previous lemma, the Frank-Wolfe algorithm converges to a stationary point. 3. Farkas’ Lemma and Theorems of the Alternative Lemma 10.18 (Farkas’ Lemma). Let A ∈ Rm×n and c ∈ Rn be a row vector. Suppose x ∈ Rn is a column vector and w ∈ Rm is a row vector. Then exactly one of the following systems of inequalities has a solution: (1) Ax ≥ 0 and cx < 0 or (2) wA = c and w ≥ 0 Remark 10.19. Before proceeding to the proof, it is helpful to restate the lemma in the following way: (1) If there is a vector x ∈ Rn so that Ax ≥ 0 and cx < 0, then there is no vector w ∈ Rm so that wA = c and w ≥ 0. (2) Conversely, if there is a vector w ∈ Rm so that wA = c and w ≥ 0, then there is no vector x ∈ Rn so that Ax ≥ 0 and cx < 0. Proof. We can prove Farkas’ Lemma using the fact that a bounded linear programming problem has an extreme point solution. Suppose that System 1 has a solution x. If System 2 also has a solution w, then (10.19) wA = c =⇒ wAx = cx. The fact that System 1 has a solution ensures that cx < 0 and therefore wAx < 0. However, it also ensures that Ax ≥ 0. The fact that System 2 has a solution implies that w ≥ 0. Therefore we must conclude that: (10.20) w ≥ 0 and Ax ≥ 0 =⇒ wAx ≥ 0. 154 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING This contradiction implies that if System 1 has a solution, then System 2 cannot have a solution. Now, suppose that System 1 has no solution. We will construct a solution for System 2. If System 1 has no solution, then there is no vector x so that cx < 0 and Ax ≥ 0. Consider the linear programming problem: min cx (10.21) PF s.t. Ax ≥ 0 Clearly x = 0 is a feasible solution to this linear programming problem and furthermore is optimal. To see this, note that the fact that there is no x so that cx < 0 and Ax ≥ 0, it follows that cx ≥ 0; i.e., 0 is a lower bound for the linear programming problem PF . At x = 0, the objective achieves its lower bound and therefore this must be an optimal solution. Therefore PF is bounded and feasible. We can covert PF to standard form through the following steps: (1) Introduce two new vectors y and z with y, z ≥ 0 and write x = y − z (since x is unrestricted). (2) Append a vector of surplus variables s to the constraints. This yields the new problem:    min cy − cz s.t. Ay − Az − Im s = 0 (10.22) PF′   y, z, s ≥ 0 Applying Theorems 9.53 and 9.57, we see we can obtain an optimal basic feasible solution for Problem PF′ in which the reduced costs for the variables are all negative (that is, zj −cj ≤ 0 for j = 1, . . . , 2n + m). Here we have n variables in vector y, n variables in vector z and m variables in vector s. Let B ∈ Rm×m be the basis matrix at this optimal feasible solution with basic cost vector cB . Let w = cB B−1 (as it was defined for the revised simplex algorithm). Consider the columns of the simplex tableau corresponding to a variable xk (in our original x vector). The variable xk = yk − zk . Thus, these two columns are additive inverses. That is, the column for yk will be B−1 A·k , while the column for zk will be B−1 (−A·k ) = −B−1 A·k . Furthermore, the objective function coefficient will be precisely opposite as well. Thus the fact that zj − cj ≤ 0 for all variables implies that: wA·k − ck ≤ 0 and −wA·k + ck ≤ 0 and That is, we obtain (10.23) wA = c since this holds for all columns of A. Consider the surplus variable sk . Surplus variables have zero as their coefficient in the objective function. Further, their simplex tableau column is simply B−1 (−ek ) = −B−1 ek . The fact that the reduced cost of this variable is non-positive implies that: (10.24) w(−ek ) − 0 = −wek ≤ 0 3. FARKAS’ LEMMA AND THEOREMS OF THE ALTERNATIVE 155 Since this holds for all surplus variable columns, we see that −w ≤ 0 which implies w ≥ 0. Thus, the optimal basic feasible solution to Problem PF′ must yield a vector w that solves System 2. Lastly, the fact that if System 2 does not have a solution, then System 1 does follows from contrapositive on the previous fact we just proved. Exercise 74. Suppose we have two statements A and B so that: A ≡ System 1 has a solution. B ≡ System 2 has a solution. Our proof showed explicitly that NOT A =⇒ B. Recall that contrapositive is the logical rule that asserts that: (10.25) X =⇒ Y ≡ NOT Y =⇒ NOT X Use contrapositive to prove explicitly that if System 2 has no solution, then System 1 must have a solution. [Hint: NOT NOT X ≡ X.] 3.1. Geometry of Farkas’ Lemma. Farkas’ Lemma has a pleasant geometric interpretation1. Consider System 2: namely: wA = c and w ≥ 0 Geometrically, this states that c is inside the positive cone generated by the rows of A. That is, let w = (w1 , . . . , wm ). Then we have: (10.26) w1 A1· + · · · + wm Am· and wi ≥ 0 for i = 1, . . . , m. Thus c is a positive combination of the rows of A. This is illustrated in Figure 10.2. On the other hand, suppose System 1 has a solution. Then let A2· A1· Half-space c Am· Positive Cone of Rows of A cy > 0 Figure 10.2. System 2 has a solution if (and only if) the vector c is contained inside the positive cone constructed from the rows of A. y = −x. System 1 states that Ay ≤ 0 and cy > 0. That means that each row of A (as a vector) must be at a right angle or obtuse to y. (Since Ai· x ≥ 0.) Further, we know that the vector y must be acute with respect to the vector c. This means that System 1 has a solution only if the vector c is not in the positive cone of the rows of A or equivalently the intersection of the open half-space {y : cy > 0} and the set of vectors {y : Ai· y ≤ 0, i = 1, . . . m} is 1Thanks to Akinwale Akinbiyi for pointing out a typo in this discussion. 156 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING non-empty. This set is the cone of vectors perpendicular to the rows of A. This is illustrated in Figure 10.3 Am· c Non-empty intersection A2· Half-space cy > 0 Cone of perpendicular vectors to Rows of A A1· Figure 10.3. System 1 has a solution if (and only if) the vector c is not contained inside the positive cone constructed from the rows of A. Example 10.20. Consider the matrix: 1 0 A= 0 1 and the vector c = 1 2 . Then clearly, we can see that the vector w = 1 2 will satisfy System 2 of Farkas’ Lemma, since w ≥ 0 and wA = c. T Contrast this with c′ = 1 −1 . In this case, we can choose x = 0 1 . Then T Ax = 0 1 ≥ 0 and c′ x = −1. Thus x satisfies System 1 of Farkas’ Lemma. These two facts are illustrated in Figure 10.4. Here, we see that c is inside the positive cone formed by the rows of A, while c′ is not. [0 1] [1 2] System 2 has a solution Positive cone [1 0] [1 -1] System 1 has a solution Figure 10.4. An example of Farkas’ Lemma: The vector c is inside the positive cone formed by the rows of A, but c′ is not. Exercise 75. Consider the following matrix: 1 0 A= 1 1 and the vector c = 1 2 . For this matrix and this vector, does System 1 have a solution or does System 2 have a solution? [Hint: Draw a picture illustrating the positive cone formed by the rows of A. Draw in c. Is c in the cone or not?] 3. FARKAS’ LEMMA AND THEOREMS OF THE ALTERNATIVE 157 3.2. Theorems of the Alternative. Farkas’ lemma can be manipulated in many ways to produce several equivalent statements. The collection of all such theorems are called Theorems of the Alternative and are used extensively in optimization theory in proving optimality conditions. We state two that will be useful to us. Corollary 10.21. Let A ∈ Rm×n and let c ∈ Rn be a row vector. Then exactly one of the following systems has a solution: (1) Ax < 0 and cx > 0 or (2) wA = w0 c and [w0 , w] ≥ 0, [w0 , w] 6= 0. Proof. First define:   −c (10.27) M =  −  A So that System 1 becomes Mx < 0 for some x ∈ Rn . Now, re-write this new System 1 as: (10.28) Mx + s1 ≤ 0 (10.29) s>0 where 1 is a vector of m ones. If we write: (10.30) Q = M | 1   x  ≤0 − (10.31) y = s (10.32) r = 0 0 · · · 0 1 Then System 1 becomes My ≤ 0 and ry > 0. Finally letting z = −y, we see that System 1 becomes Qz ≥ 0 and rz < 0, which is System 1 of Farkas Lemma. Thus, if System 1 has no solution, then there is a solution to: (10.33) wQ = r w≥0 This implies that: (10.34) vM = 0 (10.35) v1 = 1 (10.36) v ≥ 0 Suppose that v = [w0 , w1 , . . . , wm ] and w = [w1 , . . . , wm ]. Then:   −c (10.37) w0 w  −  = 0 A But this implies that: (10.38) −w0 c + wA = 0 Therefore: wA = w0 c and v1 = 1 and v ≥ 0 imply that [w0 , w] ≥ 0, [w0 , w] 6= 0. Exercise 76. Prove the following corollary to Farkas’ Lemma: 158 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING Corollary 10.22. Let A ∈ Rm×n and c ∈ Rn be a row vector. Suppose d ∈ Rn is a column vector and w ∈ Rm is a row vector and v ∈ Rn is a row vector. Then exactly one of the following systems of inequalities has a solution: (1) Ad ≤ 0, d ≥ 0 and cd > 0 or (2) wA − v = c and w, v ≥ 0 [Hint: Write System 2 from this corollary as wA − In v = c and then re-write the system with an augmented vector [w v] with an appropriate augmented matrix. Let M be the augmented matrix you identified. Now write System 1 from Farkas’ Lemma using M and x. Let d = −x and expand System 1 until you obtain System 1 for this problem.] 4. Preliminary Results: Feasible Directions, Improving Directions Remark 10.23. In the previous chapter, we saw the Karush-Kuhn-Tucker (KKT) necessary and sufficient conditions for optimality for linear programming problems. In this chapter, we generalize these conditions to the case of non-linear programming problems. Remark 10.24. We return to the study of Problem P : max f (x) P s.t. x ∈ X where X is a set with non-empty interior that is, ideally, closed and convex. Definition 10.25 (Cone of Feasible Directions). For Problem P , the cone of feasible directions at x ∈ X is the set: (10.39) D(x) = {p ∈ Rn : p 6= 0 and for all λ ∈ (0, δ) , x + λp ∈ X, for δ > 0} Definition 10.26 (Cone of Improving Directions). For Problem P , the cone of improving directions at x ∈ X is the set: (10.40) F (x) = {p ∈ Rn : for all λ ∈ (0, δ) , f (x + λp) > f (x), for δ > 0} Proposition 10.27. Let x ∈ Rn . If p ∈ Rn and ∇f (x)T p > 0, then p ∈ F (x). Thus: (10.41) F0 (x) = {p ∈ Rn : ∇f (x)T p > 0} ⊆ F (x) Exercise 77. Prove Proposition 10.27. Lemma 10.28. For Problem P , suppose that x∗ ∈ X is a local maximum. Then F0 (x∗ ) ∩ D(x∗ ) = ∅. Proof. Suppose this is not the case by contradiction. Choose p ∈ F0 (x∗ ) ∩ D(x∗ ). Let x = x∗ + λp for some appropriately chosen λ > 0 so taht x ∈ X. Such a λ exists because p ∈ D(x∗ ). Thus, d = x − x∗ and (10.42) ∇f (x∗ )T (x − x∗ ) = λ∇f (x∗ )T p > 0 because p ∈ F0 (x∗ ). This contradicts Theorem 10.7 and thus x∗ cannot have been a local maximum. Proposition 10.29. If f : Rn → R is concave and F0 (x∗ ) ∩ D(x∗ ) = ∅, then x∗ is a local maximum of f on X. 4. PRELIMINARY RESULTS: FEASIBLE DIRECTIONS, IMPROVING DIRECTIONS 159 Exercise 78. Using the sufficiency argument in the proof of Theorem 10.7, prove Proposition 10.29. Remark 10.30. It is relatively straight-forward to prove that when f is concave, then F0 (x∗ ) = F (x∗ ) and when f is strictly concave, then: (10.43) F (x∗ ) = {p ∈ Rn : p 6= 0, ∇f (x∗ )T p ≥ 0} Lemma 10.31. Consider a simplified form of Problem P , called Problem P ′ : max f (x) ′ (10.44) P s.t. gi (x) ≤ 0 i = 1, . . . , m Here, f : Rn → R and gi (x) :→ Rn are continuously differentiable at a point x0 . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Then: (10.45) G0 (x0 ) = {p ∈ Rn : ∇gi (x0 )T p < 0, ∀i ∈ I} ⊆ D(x0 ) Proof. Suppose p ∈ G0 (x0 ) and define x = x0 + λp. For all j 6∈ I, we know that gj (x0 ) < 0. By the continuity of gj , for all ǫ > 0, there exists a δ > 0 so that if ||x0 − x|| < δ, then |gj (x0 ) − gj (x)| < ǫ and thus for some choice of λ > 0 we can ensure that gj (x) < 0. Now since ∇gi (x0 )T p < 0, this is a descent direction of gi for i ∈ I. Thus (by a variation of Proposition 10.27), there is some λ′ > 0 so that gi (x0 + λ′ p) < gi (x0 ) = 0. Thus, if we choose t = min{λ, λ′ }, then gi (x0 + tp) ≤ 0 for i = 1, . . . , m and thus x0 + tp ∈ X. Thus it follows that p ∈ D. Lemma 10.32. Consider Problem P ′ as in Lemma 10.31 and suppose that gi (i ∈ I) are strictly convex (as defined in Lemma 10.31). Then G0 (x0 ) = D(x0 ). Proof. Suppose that p ∈ D(x0 ) and suppose p 6∈ G0 (x0 ). Then ∇gi (x0 )T p ≥ 0. Thus, p is not a descent direction. Suppose that x = x0 + λp and define h = x − x0 . Then by strict convexity (and Exercises 16 and 17), for all λ > 0 we know: (10.46) gi (x0 + λh) − gi (x0 ) > ∇gi (x0 )T h = λ∇gi (x0 )T p ≥ 0 Thus, there is no λ > 0 so that x = x0 + λp ∈ X and p 6∈ D(x0 ). Theorem 10.33. Consider Problem P ′ , where f : Rn → R and gi (x) :→ Rn are continuously differentiable at a local maximum x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Then F0 (x∗ ) ∩ G0 (x∗ ) = ∅. Conversely, if gi (i = 1, . . . , m) are strictly convex and f is concave and F0 (x∗ ) ∩ G0 (x∗ ) = ∅, then x∗ is a local maximum. Proof. Suppose that x∗ is a local maximum. Then we know from Lemma 10.28 that F0 (x∗ ) ∩ D(x∗ ) = ∅. Additionally from Lemma 10.31 we know that G0 (x∗ ) ⊆ D(x∗ ). Thus it follows that F0 (x∗ ) ∩ G0 (x∗ ) = ∅. Now suppose that gi (i = 1, . . . , m) are strictly convex and f is concave and F0 (x∗ ) ∩ G0 (x∗ ) = ∅. From Lemma 10.32, we know that D(x∗ ) = G0 (x∗ ) and thus we know at once that F0 (x∗ ) ∩ D(x∗ ) = ∅. It follows from Proposition 10.29 that x∗ is a local maximum over X. Remark 10.34. All of these theorems can be generalized substantially, to include functions not defined over all Rn , functions that are only locally convex or concave or functions that exhibit generalized concavity/convexity properties. Full details can be found in Chapter 4 of [BSS06], on which most of this section is based. 160 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING 5. Fritz-John and Karush-Kuhn-Tucker Theorems Theorem 10.35 (Fritz-John Theorem). Consider Problem P ′ , where f : Rn → R and gi : Rn → Rn are continuously differentiable at a local maximum x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Then there exists scalars, u0 , u1 , . . . , um so that:  m X  ∗  u0 ∇f (x ) − ui ∇gi (x∗ ) = 0     i=1  (10.47) ui gi (x∗ ) = 0 i = 1, . . . , m    u0 , ui ≥ 0 i = 1, . . . , m     [u0 , u1 , . . . , um ]T 6= 0 Proof. Let c = ∇f (x∗ )T and let A be the matrix whose rows are formed from ∇gi (x∗ )T (i ∈ I). From Theorem 10.33 we know that F0 (x∗ ) ∩ G0 (x∗ ) = ∅. This implies that there is no p satisfying cp > 0 and Ap < 0. Then it follows from Corollary 10.21 there is a scalars u0 and a vector u so that: uA = u0 c and [u0 , u] ≥ 0, [u0 , u] 6= 0. Thus we see that: X (10.48) uA = u0 c =⇒ u0 ∇f (x∗ ) − ui ∇gi (x∗ ) = 0 i∈I Now let uj = 0 for j 6∈ I. Then we see Equation 10.48 is equivalent to: (10.49) u0 ∇f (x∗ ) − m X i=1 ui ∇gi (x∗ ) = 0 The fact that u0 , ui ≥ 0 and [u0 , u1 , . . . , um ]T 6= 0 is also immediate from Corollary 10.21 as well. By definition of the ui , it is easy to see that if i ∈ I, then gi (x∗ ) = 0 and thus ui gi (x∗ ) = 0. On the other hand, if i 6∈ I, then ui = 0 and thus ui gi (x∗ ) = 0. Thus we have shown that Expression 10.47 must hold. Theorem 10.36 (Karush-Kuhn-Tucker Necessary Conditions). Consider Problem P ′ , where f : Rn → R and gi : Rn → Rn are continuously differentiable at a local maximum x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Suppose that the set of vectors ∇gi (x∗ ) (i ∈ I) are linearly independent. Then there exist scalars λ1 , . . . , λm not all zero so that:  m X   ∗  λi ∇gi (x∗ ) = 0   ∇f (x ) − i=1 (10.50)  λi gi (x∗ ) = 0 i = 1, . . . , m     λi ≥ 0 i = 1, . . . , m Proof. Applying the Fritz-John Theorem, we know there are constants u0 , u1 , . . . , um so that:  m X  ∗  u0 ∇f (x ) − ui ∇gi (x∗ ) = 0     i=1  (10.51) ui gi (x∗ ) = 0 i = 1, . . . , m    u0 , ui ≥ 0 i = 1, . . . , m     [u0 , u1 , . . . , um ]T 6= 0 5. FRITZ-JOHN AND KARUSH-KUHN-TUCKER THEOREMS 161 Suppose that u0 = 0, then: m X (10.52) ui ∇gi (x∗ ) = 0 i=1 (10.53) [u1 , . . . , um ]T 6= 0 This is not possible by our assumption that ∇gi (x∗ ) (i ∈ I) are linearly independent and that uj = 0 for j 6∈ I. Thus u0 > 0. Therefore, let λi = ui /u0 . The result follows at once by dividing: m X ∗ u0 ∇f (x ) − ui ∇gi (x∗ ) = 0 i=1 by u0 . This completes the proof. ∗ Remark 10.37. The assumption that the set of vectors ∇gi (x ) (i ∈ I) are linearly independent is called a constraint qualification. There are several alternative weaker (and stronger) constraint qualifications. These are covered in depth in a course on convex optimization or the theory of nonlinear programming. Theorem 10.38 (Fritz-John Sufficient Theorem). Consider Problem P ′ , where f : Rn → R is locally concave about x∗ and gi : Rn → Rn are locally strictly convex about x∗ . If there are scalars u0 , u1 , . . . , um so that:  m X  ∗  u0 ∇f (x ) − ui ∇gi (x∗ ) = 0     i=1  (10.54) ui gi (x∗ ) = 0 i = 1, . . . , m    u0 , ui ≥ 0 i = 1, . . . , m     [u0 , u1 , . . . , um ]T 6= 0 then x∗ is a local maximum of Problem P ′ . Moreover, if the concavity and convexity are global, then x∗ is a global maximum. Proof. Applying the same reasoning as in the proof of the Fritz-John theorem, we have that F0 (x∗ ) ∩ G0 (x∗ ) = ∅. Suppose we restrict our attention only to the set Bǫ (x∗ ) in which f is concave and gi (i = 1, . . . , m) are strictly convex. In doing so, we can (if necessary) redefine f and gi (i = 1, . . . , m) outside this ball so that they are globally concave and strictly convex as needed. Then we may apply Proposition 10.33 to see that x∗ is a local maximum. Now suppose that global concavity and strict convexity hold. From Lemma 10.32, we know that D(x∗ ) = G0 (x∗ ) and thus we know that F0 (x∗ )∩D(x∗ ) = ∅. Consider a relaxation of the problem in which only the constraints indexed in I are kept. Then we still have F0 (x∗ ) ∩ D(x∗ ) = ∅ in our relaxed problem. For any feasible solution x∗ in the relaxed feasible set we know that for all λ ∈ [0, 1]: (10.55) g(x∗ + λ(x − x∗ )) < (1 − λ)g(x∗ ) + λg(x) = 0 + λg(x) < 0 Thus p = x − x∗ ∈ D(x0 ); that is a (x − x∗ ) is a feasible direction. The fact that F0 (x∗ ) ∩ D(x∗ ) = ∅ means that ∇f (x∗ )T p ≤ 0 implies for all x in the (new) feasible region we have: ∇f (x∗ )T (x−x∗ ) ≤ 0. Thus, by Theorem 10.5, x∗ is a global maximum since the new feasible region must contain the original feasible region X. 162 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING Theorem 10.39. Consider Problem P ′ , where f : Rn → R is locally concave about x∗ and gi : Rn → Rn are locally strictly convex about x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Suppose that the set of vectors ∇gi (x∗ ) (i ∈ I) are linearly independent. If there exist scalars λ1 , . . . , λm not all zero so that:  m X   ∗  λi ∇gi (x∗ ) = 0   ∇f (x ) − i=1 (10.56)  λi gi (x∗ ) = 0 i = 1, . . . , m     λi ≥ 0 i = 1, . . . , m then x∗ is a local maximum of Problem P ′ . Moreover, if the concavity and convexity are global, then x∗ is a global maximum. Exercise 79. Use the sufficiency condition of the Fritz-John Theorem to prove the previous sufficient condition. Remark 10.40. We can extend both these theorems to our general problem P by rewriting the constraints hj (x) = 0 (j = 1, . . . , l) as the pair of constraints hj (x) ≤ 0 and −hj (x) ≤ 0. For the sufficiency to hold, we require hj (x) to be affine, as otherwise both hj (x) and −hj (x) (j = 1, . . . , l) could not both be strictly convex. These constraints yield a pair of multipliers ρj ≥ 0 and νj ≥ 0 (j = 1, . . . , l) which satisfy: ∗ (10.57) ∇f (x ) − as well as: m X i=1 ∗ λi ∇gi (x ) − l X j=1 (ρj − νj )∇h(x∗ ) = 0 ρj hj (x∗ ) = 0 j = 1, . . . , l −νj hj (x∗ ) = 0 j = 1, . . . , l It is now a simple matter to define µj = (ρj − νj ) and note that it will be a value that is unrestricted in sign (since ρj ≥ 0 and νj ≥ 0). We also note that hj (x∗ ) = 0 (j = 1, . . . , l) and thus, ρj hj (x∗ ) = 0 and νhj (x∗ ) = 0. The following theorems follow at once. Theorem 10.41 (Karush-Kuhn-Tucker Necessary Conditions). Consider Problem P , where f : Rn → R and gi :→ Rn and hj : Rn → R are continuously differentiable at a local maximum x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Suppose that the set of vectors ∇gi (x∗ ) (i ∈ I) and hj (x∗ ) (j = 1, . . . , l) are linearly independent. Then there exist scalars λ1 , . . . , λm and µ1 , . . . , µl not all zero so that:  m l  X X  ∗ ∗   ∇f (x ) − λi ∇gi (x ) − µj ∇hj (x∗ ) = 0     i=1 j=1 (10.58) λi gi (x∗ ) = 0 i = 1, . . . , m     λi ≥ 0 i = 1, . . . , m     µj unrestricted j = 1, . . . , l Theorem 10.42. Consider Problem P , where f : Rn → R is locally concave about x∗ and gi : Rn → R are locally strictly convex about x∗ and hj : Rn → R are locally affine 6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS 163 about x∗ . Denote the set I = {i ∈ {1, . . . , m} : gi (x∗ ) = 0}. Suppose that the set of vectors ∇gi (x∗ ) (i ∈ I) and ∇hj (x∗ ) are linearly independent. If there exist scalars λ1 , . . . , λm and µ1 , . . . , µl not all zero so that:  m l  X X  ∗ ∗   ∇f (x ) − λ ∇g (x ) − µj ∇hj (x∗ ) = 0 i i     i=1 j=1 (10.59) λi gi (x∗ ) = 0 i = 1, . . . , m     λi ≥ 0 i = 1, . . . , m     µj unrestricted j = 1, . . . , l then x∗ is a local maximum of Problem P . Moreover, if the concavity and convexity are global and hj (j = 1, . . . , l) are globally affine, then x∗ is a global maximum. 6. Quadratic Programming and Active Set Methods Lemma 10.43. Consider the constrained optimization problem:   max cT x + 1 xT Qx (10.60) Q0 2  s.t. Ax = b where A ∈ Rm×n , c ∈ Rn×1 and Q ∈ Rn×n is symmetric. Then the KKT conditions for Problem Q0 are: (10.61) Qx − AT λ = −c (10.62) Ax = b Exercise 80. Prove Lemma 10.43. Lemma 10.44. Suppose that A ∈ Rn×m has full row rank. Further, suppose that for all x ∈ Rn such that Ax = 0 and x 6= 0, we have xT Qx < 0, then: Q −AT (10.63) M = A 0 is non-singular. Proof. Suppose that we have a solution pair (x, λ) so that Q −AT x =0 (10.64) λ A 0 From this we conclude that: (10.65) Qx − AT λ = 0 (10.66) Ax = 0 Thus, x is in the null space of A. From Equation 10.65 we deduce: (10.67) xT Qx − xT AT λ = 0 164 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING But, xT AT = 0 and therefore xT Qx = 0. This can only be true if x = 0 by our assumption. Thus we have shown that necessarily x = 0. Given this, we see that: (10.68) −AT λ = 0 This implies that: λT A = 0T . Note that λT A is a linear combination of the rows of A, which we assumed to be linearly independent (since A had full row rank). Thus for Equation 10.68 to hold, we must have λ = 0. Thus, the null space of M consists only of the zero-vector and M is non-singular. Lemma 10.45. Suppose the conditions of the Lemma 10.44 are satisfied. Then Problem Q0 has a unique solution. Proof. It is easy to see that any pair (x, λ) satisfying: −c Q −AT x = (10.69) b λ A 0 is a KKT point. Suppose that (x∗ , λ∗ ) is a solution to Equation 10.69 and suppose that z(x+ ) ≥ z(x∗ ) where: 1 (10.70) z(x) = cT x + xT Qx 2 + ∗ Define p = x − x . Then Ap = Ax+ − Ax∗ = b − b = 0 and thus p is in the null space of A. Consider that: 1 (10.71) z(x+ ) = z(x∗ + p) = cT (x∗ + p) + (x∗ + p)T Q(x∗ + p) = 2 1 ∗T 1 z(x∗ ) + cT p + pT Qp + x Qp + pT Qx∗ 2 2 We’ve assumed Q is symmetric and thus: 1 ∗T (10.72) x Qp + pT Qx∗ = pT Qx∗ 2 By the necessity of the KKT conditions, we know that Qx∗ = −c + AT λ∗ . From this we conclude that: pT Qx∗ = −pT c + pT AT λ∗ We can substitute this quantity back into Equation 10.71 to obtain: 1 (10.73) z(x+ ) = z(x∗ ) + cT p + pT Qp − pT c + pT AT λ∗ 2 We’ve already observed that p is in the null space of A and thus pT AT = 0. Likewise, cT p = pT c and thus: 1 (10.74) z(x+ ) = z(x∗ ) + pT Qp 2 Finally, by our assumption on Q, the fact that Ap = 0 implies that pT Qp < 0 just in case p 6= 0. Thus: z(x+ ) < z(x∗ ) and x∗ must be a unique global optimal solution to Problem Q0 . Remark 10.46. Maple code for solving equality constrained quadratic programming problems with concave objective functions is shown in Algorithm 23. We also include a helper function to determine whether matrices or vectors are empty in Maple. 6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 165 IsEmpty := overload([ proc (A::Matrix)::boolean; option overload; uses LinearAlgebra: if evalb(RowDimension(A) = 0 or ColumnDimension(A) = 0) then true: else false: end if: end proc, proc (V::Vector)::boolean; option overload; uses LinearAlgebra: if evalb(Dimension(V) = 0) then true: else false: end if end proc ]): Algorithm 22. An algorithm to determine whether a matrix is empty in Maple. Notice this is an overloaded method, allowing us to pass in Vectors or Matrices. Remark 10.47. We can use the previous lemmas to develop an algorithm for solving convex quadratic programming problems. Before proceeding, we state the following theorem, which we do not prove. See [Sah74]. Theorem 10.48. Let (1) Q ∈ Rn×n , (2) A ∈ Rm×n , (3) H ∈ Rl×n , (4) b ∈ Rm×1 , (5) r ∈ Rl×1 and (6) c ∈ Rn×1 . Then finding a solution for the problem:  1 T T    max 2 x Qx + c x (10.75) s.t. Ax ≤ b    Hx = r is NP-complete for arbitrary parameters. Remark 10.49. A general algorithm for solving arbitrary quadratic programming problems is given in [BSS06]. It is, essentially, a branch-and-bound style algorithm befitting the fact that the problem is NP-complete. 166 1 2 3 4 5 6 7 8 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING SolveEQQP := proc (c::Vector, Q::Matrix, A::Matrix, b::Vector)::list; uses LinearAlgebra: local M, r, n, m, y; n := RowDimension(Q); m := RowDimension(A); assert(n = ColumnDimension(A)); assert(n = Dimension(c)); assert(m = Dimension(b)); 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 if IsEmpty(A) then M := Q: r := -c: else #Construct some augmented matrices and vectors M := <<Q|-Transpose(A)>;<A|ZeroMatrix(m,m)>>: r := <-c ; b> end if; #More efficient than inverting. y := LinearSolve(M, r); if IsEmpty(A) then #Return the solution, there are no Lagrange multipliers. [y, []]: else #Return the solution and the Lagrange multipliers in a list. [SubVector(y, [1 .. n]), SubVector(y, [n+1 .. n+m])] end if: end proc: Algorithm 23. An algorithm to solve equality constrained quadratic programming problems, under the forgoing assumptions. Note we do not test for full row rank. Remark 10.50. We now consider an active set method for solving a convex quadratic programming problem, which will be used in the next chapter when developing sequential quadratic programming. Remark 10.51.      max (10.76) QP s.t.     The algorithm assumes a problem with the following structure: 1 T x Qx + cT x 2 aTi x ≤ bi i ∈ I aTi x = bi i∈E Obviously, this is identical to the formulation of the problem in (10.75) except we have index sets E and I for the equality and inequality constraints, respectively. Furthermore, we will assume that Q is negative definite. 6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS 167 Lemma 10.52. Suppose that xk is a feasible solution to QP. Let Wk = {i ∈ I : aTi xk = bi } ∪ E Consider the problem:   max pT (c + Qx ) + 1 pT Qp k 2 (10.77) QP (xk )  T s.t. ai p = 0 i ∈ Wk If pk solves this problem, then xk + pk solves: 1 max cT x + xT Qx 2 (10.78) s.t. aTi x = bi i ∈ Wk Proof. By Lemmas 10.44 and 10.45 and our assumption on Q, there is only one unique solution to each of the problems given in the statement of the theorem. Suppose pk is the solution to the first problem. Then since aTi pk = 0 for all i ∈ Wk and we know that aTi xk = bi for all i ∈ Wk it follows that aTi (xk + pk ) = bi for all i ∈ Wk and thus xk + pk is feasible to the second problem in the statement of the theorem. Note that: 1 1 (10.79) cT (xk + pk ) + (xk + pk )T Q(xk + pk ) = z(xk ) + cT pk + pTk Qpk + pTk Qxk 2 2 1 T T just as it was in Equation 10.71 where again: z(x) = c x + 2 x Qx. Thus: 1 1 (10.80) cT (xk + pk ) + (xk + pk )T Q(xk + pk ) = z(xk ) + pTk (c + Qxk ) + pTk Qpk 2 2 1 T T Since pk was chosen to maximize pk (c + Qxk ) + 2 pk Qpk subject to the constraints that aTi pk = 0 for all i ∈ Wk , it follows that xk + pk must be the unique solution to the second problem in the statement of the theorem, for otherwise, we could have chosen a different p′k that satisfied aTi (xk + p′k ) = bi for all i ∈ Wk with xk + p′k producing a larger objective function value and this would have been the maximal solution to the first problem. This completes the proof. Remark 10.53. Note that the sub-problem:   max pT (c + Qx ) + 1 pT Qp k 2 (10.81) QP (xk )  s.t. aTi p = 0 i ∈ Wk can be solved using simple linear algebra as illustrated in Lemma 10.45 to obtain not just pk but also a set of Lagrange multipliers. Lemma 10.54. Suppose that xk is a feasible solution to QP. Let Ik = {i ∈ I : aTi xk = bi } so that (as in Lemma 10.52) Wk = Ik ∪ E. Let λ be the Lagrange multipliers obtained for the problem:   max pT (c + Qx ) + 1 pT Qp k 2 QP (xk )  s.t. aTi p = 0 i ∈ Wk 168 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING when finding the optimal solution pk . If pk = 0 and λi ≥ 0 for all i ∈ Ik , then xk is the optimal solution to QP. Exercise 81. Prove Lemma 10.54. [Hint: Clearly, λ can act as Lagrange multipliers for all the constraints indexed in Wk . Define the remaining Lagrange multipliers for constraints with index not in Wk to be zero and argue that the KKT conditions for the whole problem are satisfied because they are satisfied for the problem max cT x + 12 xT Qx such that aTi x = bi for all i ∈ Wk .] Lemma 10.55. Suppose that pk 6= 0 and suppose further that if xk + pk is feasible to QP, then xk+1 = xk + pk and otherwise, using a minimum ratio test, we set: ( ) bq − aTq xk (10.82) αk = min : q ∈ I \ Ik , aTq pk > 0 aTq pk and define xk+1 = xk + αk pk . Then xk+1 is feasible to QP, z(xk+1 ) > z(zk ) and Ik+1 = Ik ∪ {q ∗ } where q ∗ is the index identified in the minimum ratio test of Expression 10.82. Proof. In the case when xk+1 = xk +pk is feasible to QP, the result follows immediately from Lemma 10.52. Suppose, therefore, xk + pk is not feasible to QP. To prove feasibility, we observe that aTi xk+1 = bi for all i ∈ Wk by the proof of Lemma 10.52. Thus, it follows that we wish to ensure that: (10.83) aTi (xk + αk pk ) ≤ bi i ∈ I \ Ik The previous equation implies that for all i ∈ I \ Ik we have: bq − aTq xk (10.84) αk ≤ aTq pk Clearly Expression 10.83 can be made true by setting using the minimum ratio test given in Expression 10.82, since if aTi pk < 0, then we are assured that if aTi xk < bi , then so to aTi xk+1 < bi . To see that z(xk+1 ) > z(xk ), recall we have assumed that Q is negative definite. Thus, the objective function is strictly concave. We know from Lemma 10.52 that z(xk + pk ) > z(xk ) (in particular, since pk 6= 0). Let yk = xk + pk . Then there is some λ ∈ (0, 1) so that: λxk + (1 − λ)yk = xk + αk pk . From this we deduce that: (10.85) z(xk+1 ) = z(xk + αk pk ) = z(λxk + (1 − λ)yk ) > λz(xk ) + (1 − λ)z(yk ) > λz(xk ) + (1 − λ)z(xk ) = z(xk ) This completes the proof. Remark 10.56. We are now in a position to sketch the active set algorithm for solving convex quadratic programming problems. (1) Initialize with a feasible point x0 derived possibly from a Phase I problem on the linear constraints. Set k = 0. Compute Ik and Wk . (2) Solve QP (xk ) to obtain pk and λk . (3) If pk = 0 and λki ≥ 0 for all i ∈ Ik , then stop, xk is optimal. (4) If pk = 0 and for some i ∈ Ik , λki < 0, then set i∗ = arg min{λki : i ∈ Ik }. Remove i∗ from Ik to obtain Ik+1 and set Wk+1 = Wk \ i∗ . Set xk+1 = xk . Set k = k + 1, Goto (2). 6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS 169 (5) If pk 6= 0 and xk + pk is feasible, set xk+1 = xk + pk . Compute Ik+1 and Wk+1 . Set k = k + 1, Goto (2). (6) If pk 6= 0 and xk + pk is not feasible then compute: ( ) bq − aTq xk : q ∈ I \ Ik , aTq pk > 0 αk = min aTq pk Set xk+1 = xk + αk pk . Compute Ik+1 and Wk+1 . Set k = k + 1, Goto (2). Note, Ik+1 and Wk+1 can be computed easily as a part of the minimum ratio test. A reference implementation for the algorithm in Maple is shown in Algorithm 27. Theorem 10.57. The algorithm described in Remark 10.56 and illustrated in Algorithm 27 converges in a finite number of steps to the optimal solution of Problem 10.76 provided that Q is negative definite (or that z(x) = 21 xT Qx + cT x is strictly concave). Sketch of Proof. It suffices to show that a working (active) set Wk is never repeated. To see this note that if pk = 0 and we have not achieved optimality, then a binding constraint is removed from Wk to obtain Wk+1 . This process continues until pk 6= 0, at which point by Lemma 10.82, the objective function decreases. Since the objective is monotonically decreasing and when it does not decrease, we force a chance in Wk , it is easy to see we can never repeat a working set. The fact that there are a finite number of working sets, it follows that the algorithm must converge after a finite number of iterations to a solution with λki > 0 for all i ∈ Ik . By Lemmas 10.52 and 10.54 it follows this must be an optimal solution for the Problem 10.76. Remark 10.58. The “proof” for the convergence of the active set method given in [NW06] is slightly more complete than the sketch of the proof given above, though the major elements of the proof are identical. Remark 10.59. The point of introducing this specialized quadratic programming algorithm is that it can be used very efficiently in implementing sequential quadratic programming (SQP) because, through judicious use of the BFGS method, we can ensure that we always have a convex quadratic sub-problem. The SQP method is a generally convergent approach for solving constraint non-linear programming problems that enjoys substantial commercial success. (Matlab’s fmincon and Maple s Minimize functions both use it.) We introduce this technique in the next chapter. Remark 10.60. We end this chapter by providing a Maple implementation of the active set method for solving convex quadratic programming problems. This algorithm is long and requires sub-routines for finding an initial point and determining the index of the minimum value in a vector and determining the (initial) set of binding constraints I0 . The actual routine for solving quadratic programs is shown in Algorithm 27. Example 10.61. Consider the following problem:  1 1   max (x − 1)2 + (y − 1)2   2 2    s.t.x + y ≤ 1 (10.86) 2x + y ≥ 1      1   x, y ≥ 4 170 1 2 3 4 5 6 7 8 9 10 11 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING MinimumIndex := proc (V::Vector)::integer; local leastValue, index, i; leastValue := Float(infinity); for i to Dimension(V) do if evalb(V[i] < leastValue) then leastValue := V[i]; index := i end if: end do; index: end proc Algorithm 24. An algorithm to determine the index of the minimum value in a Vector in Maple. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 FindInitialSolution := proc (A::Matrix, b::Vector, Aeq::Matrix, beq::Vector):: Vector; uses LinearAlgebra, Optimization: local N, c; N := max(ColumnDimension(A), ColumnDimension(Aeq)); c := Vector([seq(0, i = 1 .. N)]); if ColumnDimension(A) = 0 and ColumnDimension(Aeq) = 0 then #Return c, it doesn’t matter what you return. There are no constraints. c: elif ColumnDimension(A) = 0 then LPSolve(c, [NoUserValue, NoUserValue, Aeq, beq])[2] elif ColumnDimension(Aeq) = 0 then LPSolve(c, [A, b])[2] else LPSolve(c, [A, b, Aeq, beq])[2] end if: end proc: Algorithm 25. A simple routine for initializing the active set quadratic programming method. If we expand this problem out, constructing the appropriate matrices we see that: −1 0 Q= 0 −1 1 c= 1 6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 171 ReturnWorkingSet := proc (A::Matrix, b::Vector, x::Vector)::list; local m, l, i, Ik; m := RowDimension(A); Ik := []; #This is a not equal. if evalb(m <> 0) then assert(Dimension(x) = ColumnDimension(A)); assert(m = Dimension(b)); for i to m do if evalb(Row(A, i).x = b[i]) then #Append any binding rows to Ik. Ik := [op(Ik), i]: end if: end do: end if: Ik: end proc: Algorithm 26. A routine to determine the binding constraints among the inequality constraints. 1 2 3 4 5 6 SolveConvexQP := proc (c::Vector, Q::Matrix, A::Matrix, b::Vector, Aeq::Matrix, beq::Vector)::list; uses LinearAlgebra: local xnow, bDone, M, Ik, OUT, r, n, ynow, eps, alpha, alphak, cc, rebuildM, Ap, i, p, SOLN, lambda, pos, q, pA, pAeq; 7 8 9 10 eps := 1/1000000; xnow := FindInitialSolution(A, b, Aeq, beq); #Initialize with Linear Programming! n :=Dimension(xnow); 11 12 13 14 rebuildM := true; #A variable that tells us whether we have to rebuild a matrix . Ik := ReturnWorkingSet(A, b, xnow); #Find the working (active set). Only called once. bDone := false; 15 16 17 18 #Return list...we return more than is necessary. #Specifically, the path we take. OUT := []; 172 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING while evalb(not bDone) do OUT := [op(OUT), convert(xnow, list)]; if rebuildM then #Rebuild the matrix that gets past to the sub-QP problem. if nops(Ik) <> 0 and evalb(not IsEmpty(Aeq)) then #Create a augmented matrix...these are the binding constraints. #Some of the inequality constraints are binding. M := <SubMatrix(A, Ik, [1 .. n]); Aeq>: elif evalb(not IsEmpty(Aeq)) then M := Aeq elif nops(Ik) <> 0 then M := SubMatrix(A, Ik, [1 .. n]) else M := Matrix(0, 0) end if; end if; cc := c+Q.xnow; r := Vector([seq(0, i = 1 .. RowDimension(M))]); #Here’s where we solve the sub-problem. SOLN := SolveEQQP(cc, Q, M, r); p := SOLN[1]; #The direction to move. lambda := SOLN[2]; #The Lagrange multipliers. 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 if Norm(p) < eps then if nops(Ik) = 0 then bDone := true else pos := MinimumIndex(SubVector(lambda, [1 .. nops(Ik)])); if 0 <= lambda[pos] then #We’re done! All Lagrange multipliers that need to be positive #are. bDone := true else #Delete the row corresponding to the "most offending" Lagrange #multiplier M := DeleteRow(M, pos); #The next line is a little tortured...it’s to remove the right #row number from Ik. Ik := convert((convert(Ik, set) minus convert({Ik[pos]}, set)), list); rebuildM := false; end if end if 6. QUADRATIC PROGRAMMING AND ACTIVE SET METHODS 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 else alphak := 1; q := -1; for i to RowDimension(A) do if evalb(not i in Ik) then Ap := Row(A, i).p; if 0 < Ap then #Otherwise, it doesn’t matter. alpha := (b[i]-Row(A, i).xnow)/Ap; if alpha < alphak then alphak := alpha; q := i end if; end if; end if; end do; if q <> -1 and alphak <= 1 then Ik := [op(Ik), q]; #A new constraint becomes binding. rebuildM := true: else alphak := 1: #No new constraints become binding. end if; 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 xnow := xnow+alphak*p: end if: end do; #We return two vectors of dual variables, one for the inequality constraints. #and one for the equality constraints. pA := Vector(RowDimension(A)); pAeq := Vector(RowDimension(Aeq)); for i to nops(Ik) do pA[Ik[i]] := lambda[i]: end do; for i from nops(Ik)+1 to Dimension(lambda) do pAeq[i-nops(Ik)] := lambda[i]: end do; OUT := [[op(OUT), convert(xnow, list)], xnow, pA, pAeq]: end proc: Algorithm 27. The active set convex quadratic programming algorithm. The algorithm uses all the sub-routines developed for quadratic programming. Our matrices are:   1 1 −2 −1  A= −1 0  0 −1 Aeq = [] beq = []  1  −1   b= − 1  4 − 14  173 174 10. FEASIBLE DIRECTION METHODS AND QUADRATIC PROGRAMMING Since we have no equality constraints, we use empty matrices for these terms. Passing this into our optimization algorithm yields an optimal solution of x = y = 12 . The path the algorithm takes is shown in Figure 10.5. Figure 10.5. The path taken when solving the proposed quadratic programming problem using the active set method. Notice we tend to hug the outside of the polyhedral set. Exercise 82. Work through the steps of the algorithm by hand for the problem:  1 1 2 2    max 2 (x − 1) + 2 (y − 1) (10.87) s.t.x + y ≤ 1    x, y ≥ 0 Remark 10.62. As we noted in the previous example, the active set method still tends to hug the boundary of the polyhedral constraint set. Interior point methods are known to converge more quickly. We will discuss these methods for both linear and quadratic programs in the network chapter, after we introduce barrier methods. CHAPTER 11 Penalty and Barrier Methods, Sequential Quadratic Programming, Interior Point Methods For simplicity in this chapter, we will not study maximization problems, since penalty functions are more easily understood in terms of minimization (i.e., we will minimize a penalty). Thus, we will focus on the following problem:  min f (x1 , . . . , xn )      s.t. g1 (x1 , . . . , xn ) ≤ 0      ..   .   gm (x1 , . . . , xn ) ≤ 0 (11.1)    h1 (x1 , . . . , xn ) = 0      ..   .     h (x , . . . , x ) = 0 l 1 n With a little work, it is easy to see that this problem has as its dual feasibility and complementary slackness conditions:  m l  X X  ∗ ∗   ∇f (x ) + λi ∇gi (x ) + µj ∇hj (x∗ ) = 0     i=1 j=1 (11.2) λi gi (x∗ ) = 0 i = 1, . . . , m     λi ≥ 0 i = 1, . . . , m     µj unrestricted j = 1, . . . , l 1. Penalty Methods Proposition 11.1. Let g : Rn → R and suppose that r : R → R is a strictly convex function with minimum at x = 0 so that r(0) = 0. Then: (11.3) pI (x) = r (max{0, g(x)}) is 0 if and only if x satisfies g(x) ≤ 0. Furthermore, suppose that: (11.4) pE (x) = r(g(x)) is 0 if and only if g(x) = 0. Exercise 83. Prove Proposition 11.1 Definition 11.2. Let g : Rn → R and suppose that r : R → R is a strictly convex function with minimum at x = 0 so that r(0) = 0. For a constraint of type g(x) ≤ 0, 175 11. 176 PENALTY AND BARRIER METHODS, SEQUENTIAL QUADRATIC PROGRAMMING, INTERIOR POINT METHODS pI (x) defined as in Equation 11.3 is an inequality penalty function. For a constraint of type g(x) = 0, pE (x) defined as in Equation 11.4 is an equality penalty function. Theorem 11.3. Consider the inequality penalty function with r(x) = x2 and inequality g(x) ≤ 0. If g(x) is continuous and differentiable, then pI (x) is differentiable with derivative: (11.5) ∂pI (x) = ∂xi ( ∂g 2g(x) ∂x i 0 g(x) ≤ 0 otherwise Exercise 84. Prove Theorem 11.3 Remark 11.4. We can use penalty functions to transform constrained optimization problems into unconstrained optimization problems. To see this, consider a simple version of Problem  11.1:   min f (x1 , . . . , xn ) s.t. g(x1 , . . . , xn ) ≤ 0   h(x1 , . . . , xn ) = 0 Then the penalized optimization problem is: (11.6) min fP (x1 , . . . , xn ) = f (x1 , . . . , xn ) + λI pI (x1 , . . . , xn ) + λE pE (x1 , . . . , xn ) where λI , λE ∈ R+ . If we choose r(x) = x2 (e.g.) then as long as the constraints and objective in the original problem are differentiable, then fP (x) is differentiable and we can apply unconstrained optimization methods that use derivatives. Otherwise, we must use non-differentiable methods. Remark 11.5. In a penalty function method if λI or λE are chosen too large, then the penalty functions will overwhelm the objective function. If these values are chosen to small, then the resulting optimal value x∗ for fP may not be feasible. Example 11.6. Consider the following simple 2. Sequential Quadratic Programming 3. Barrier Methods 4. Interior Point Simplex as a Barrier Method 5. Interior Point Methods for Quadratic Programs Bibliography [Ant94] [Avr03] [Ber99] [BJS04] H. Anton, Elementary linear algebra, 7 ed., John Wiley and Sons, 1994. M. Avriel, Nonlinear programming: Analysis and methods, Dover Press, 2003. D. P. Bertsekas, Nonlinear Programming, 2 ed., Athena Scientific, 1999. Mokhtar S. Bazaraa, John J. Jarvis, and Hanif D. Sherali, Linear programming and network flows, Wiley-Interscience, 2004. [BSS06] M. S. Bazaraa, H. D. Sherali, and C. M. Shetty, Nonlinear programming: Theory and algorithms, John Wiley and Sons, 2006. [Dem97] J. W. Demmel, Applied numerical linear algebra, SIAM Publishing, 1997. [Ger31] S. Gerschgorin, Über die Abgrenzung der Eigenwerte einer Matrix, Izv. Akad. Nauk. USSR Otd. Fiz.-Mat. Nauk 7 (1931), 749–754. [GR01] C. Godsil and G. Royle, Algebraic graph theory, Springer, 2001. [Gri11] C. Griffin, Linear programming: Penn state math 484 lecture notes (v 1.8), http://www.personal.psu.edu/cxg286/Math484 V1.pdf, 2010-2011. [HJ61] R. Hooke and T. A. Jeeves, Direct search solution of numerical and statistical problems., J. ACM (1961). [MT03] J. E. Marsden and A. Tromba, Vector calculus, 5 ed., W. H. Freeman, 2003. [Mun00] J. Munkres, Topology, Prentice Hall, 2000. [NL89] J. Nocedal and D. C. Liu, On the limited memory method for large scale optimization, Math. Program. Ser. B 45 (1989), no. 3, 503–528. [NW06] J. Nocedal and S. Wright, Numerical optimization, Springer, 2006. [PTVF07] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical recipes 3rd edition: The art of scientific computing, 3 ed., Cambridge University Press, 2007. [Rud76] W. Rudin, Principles of mathematical analysis, McGraw-Hill, 1976. [Sah74] S. Sahni, Computationally related problems, SIAM J. Comput. 3 (1974), 262–279. [WN99] L. A. Wolsey and G. L. Nemhauser, Integer and combinatorial optimization, Wiley-Interscience, 1999. [ZBLN97] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, L-bfgs-b: Algorithm 778: Lbfgs-b, fortran routines for large scale bound constrained optimization, ACM. Trans. Math. Software 23 (1997), no. 4, 550–560. 177

Log In

Numerical Optimization: Penn State Math 555 Lecture Notes

Sign up to get access to over 50M papers

Sign up for access to the world's latest research

Related papers

Related papers