An Optimization Primer

Download as pdf or txt
Download as pdf or txt
You are on page 1of 149
At a glance
Powered by AI
The document provides an introduction to optimization techniques for large scale problems arising in various fields like engineering, economics, etc.

The goal is to prepare readers to deal with a wide variety of applications involving optimal resource allocation and best estimates/fits.

The three main tracks covered are modeling, theoretical foundations, and some numerical experimentation.

AN OPTIMIZATION PRIMER

An Introduction to Linear, Nonlinear,


Large Scale, Stochastic Programming
and Variational Analysis
Roger J-B Wets
University of California, Davis

Graphics by Maria E. Wets

AMS Classification:
Date:

August 26, 2005

90C15, 90xxx, 49J99

PROLOGUE
The primordial objective of these lectures is to prepare the reader to deal
with a wide variety of applications that include optimal allocation of (limited) resources, finding best estimates or best-fits, etc. Optimization problems of this type arise in almost all areas of human industry: engineering,
economics, agriculture, logistics, ecology, finance, information and communication technology, and so on. Because the solution may have to respond
to an evolutionary system, in time and/or space, or must take into account
uncertainty about some of the problems data, we usually end up with having to solve a large scale optimization problems and this, to a large extent,
conditions our overall approach. Consequently, the layout of the material
doesnt follow the pattern of more traditional introduction to optimization
textbooks.
The main thrust wont be on the detailed analysis of specific algorithms,
but on setting up the tools to deal with these large scale applications. This
doesnt mean, that eventually, we wont describe, justify and even establish
convergence of some basic algorithmic procedures. To achieve our goals, we
proceed, more or less, on three parallel tracks:
(i) modeling,
(ii) theoretical foundations that will allow us to analyze the properties of
solutions as well as hone our modeling skills to help us build stable,
easier to solve optimization problems, and
(iii) some numerical experimentation that will highlight some of the difficulties inherent in numerical implementation, but mostly to illustrate the
use of elementary algorithmic procedures as building blocks of more
sophisticated solution schemes.
The lectures are designed to serve as an introduction to the field of optimization for students who have a background roughly equivalent to a bachelors degree in science, engineering or mathematics. More specifically, its

ii
expected that the reader has a good foundation in Differential Calculus,
Linear Algebra, and is familiar with the abstract notion of function1 . The
presentation also includes an introduction to a plain version of Probability
that will enable the non-initiated reader to follow the sections dealing with
stochastic programming.
A novel feature of this book is that decision making under uncertaintymodels are an integral part of the exposition. There are two basic reasons
for this. Stochastic optimization motivates the study of linear, non-linear
optimization, large scale optimization, non-differentiable functions and variational analysis. But, more significantly, given our concern with intelligent
modeling, one is bound to realize that very few important decision problems
do not involve some level of uncertainty about some of their parameters. Its
thus imperative, that from the outset, the reader be aware of the potential
pitfalls of simplifying a model so as to skirt uncertainty. Nonetheless, its
possible to skip the chapters or sections dealing with stochastic programming
without compromising the continuity of the presentation, but not without being shortchanged on the insight that comes from including uncertainty in the
modeling of optimization models.
There are 16 chapters, each one corresponds to what could be covered
in about a weeks lectures (three to four hours). Proofs, constructive in nature whenever possible, have been provided so that (i) an instructor doesnt
have to go through them in meticulous detail but can limit the discussion
to the main idea(s) accompanied with some relevant examples and counterexamples, and (ii) so that the argumentation can serve as guide to solving
the exercises. The theoretical side comes with almost no compromises, but
there are a few rare exceptions that would have required lengthy mathematical detours that are not germane to the subject at hand, and are more
appropriately dealt with in other texts or lectures.

Numerical software
As already mentioned earlier, although we end up describing a significant
number of algorithmic procedures, we dont concern ourselves directly with
1

An appendix provides a review of some standard notation and terminology as well as


some basic results in analysis that might not be familiar to an heterogeneous student-body,
a typical situation for such a course.

iii
implementation issues. This is best dealt with in specialized courses and textbooks such as [16, 10, 20]. For example, in the case of linear programming,
there is a description of both the simplex and the interior point methods
in Chapters 7 and 6, but from the outset on, its assumed that packages to
solve mathematical programs, of various types, including linear programs,
are available (C-Plex, IBM Solutions, LOQO, . . . ). To allow for some experimentation with these solution procedures, its assumed that the reader
has access to Matlab2 , in particular to the functions found in the Matlab
Optimization Toolbox and will be able to use them to solve the numerical
exercises. These Matlab functionalities were used to solve the examples, and
in a number of instances, the corresponding m-file has been supplied.

Matlab is distributed by The MathWorks, Inc.

iv

Contents
1 PRELUDE
1.1 Mathematical curtain rise . . . . . . .
1.2 Curve fitting I . . . . . . . . . . . . . .
1.3 Steepest Descent and Newton methods
1.4 The Quasi-Newton methods . . . . . .
1.5 Integral functionals . . . . . . . . . . .
1.6 In conclusion . . . . . . . . . . . . . . .
2 FORMULATION
2.1 A product mix problem . . . . . . . . .
2.2 Curve fitting II . . . . . . . . . . . . .
2.3 A network capacity expansion problem
2.4 Discrete decision variables . . . . . . .
2.5 The Broadway producer problem . . .
3 PRELIMINARIES
3.1 Variational analysis I . . . . . . . .
3.2 Variational analysis II . . . . . . .
3.3 Plain probability distributions . . .
3.4 Expectation functionals I . . . . . .
3.5 Analysis of the producers problem
4 LINEAR CONSTRAINTS
4.1 Linearly constrained programs
4.2 Variational analysis III . . . .
4.3 Variational analysis IV . . . .
4.4 Lagrange multipliers . . . . .
v

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

1
1
4
6
14
17
21

.
.
.
.
.

23
23
29
35
40
42

.
.
.
.
.

47
47
58
68
75
79

.
.
.
.

83
85
87
94
98

vi

CONTENTS
4.5 Karush-Kuhn-Tucker conditions I . . . . . . . . . . . . . . . . 101

5 SIMPLE RECOURSE: RHS


5.1 Random right hand sides . . .
5.2 Aircraft allocation to routes I
5.3 Separable simple recourse I . .
5.4 Aircraft allocation to routes II
5.5 Approximations I . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

6 LAGRANGIANS
6.1 Saddle functions . . . . . . . . . . . . . .
6.2 Primal and Dual problems . . . . . . . .
6.3 A primal-dual interior-point method . .
6.4 Monitoring functions . . . . . . . . . . .
6.5 Lake Stoopt I . . . . . . . . . . . . . . .
6.6 Separable simple recourse II . . . . . . .
6.7 The Lagrangian finite generation method
6.8 Lake Stoopt II . . . . . . . . . . . . . . .
7 POLYHEDRAL CONVEXITY
7.1 Polyhedral sets . . . . . . . .
7.2 Full duality: linear programs .
7.3 Variational analysis V . . . .
7.4 The simplex method . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

8 OPTIMALITY & DUALITY


8.1 Variational analysis VI . . . . . . .
8.2 Separation Theorems . . . . . . . .
8.3 Variational analyis VIII . . . . . .
8.4 Karush-Kuhn-Tucker conditions II .
8.5 Variational analysis VII: Conjugacy
8.6 General duality theory . . . . . . .
8.7 Geometric programming . . . . . .
8.8 Semi-definite programming . . . . .

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.

107
. 108
. 111
. 114
. 119
. 122

.
.
.
.
.
.
.
.

127
. 127
. 131
. 134
. 138
. 143
. 145
. 147
. 152

.
.
.
.

157
. 157
. 165
. 167
. 172

.
.
.
.
.
.
.
.

183
. 183
. 188
. 193
. 197
. 200
. 205
. 210
. 210

CONTENTS

vii

9 LINEAR RECOURSE
9.1 Fixed recourse and fixed costs . . . . . .
9.2 A Manufacturing model . . . . . . . . .
9.3 Feasibility . . . . . . . . . . . . . . . . .
9.4 Stochastic linear programs with recourse
9.5 Optimality conditions . . . . . . . . . . .
9.6 Network capacity expansion II . . . . . .
9.7 Preprocessing . . . . . . . . . . . . . . .
9.8 A summary . . . . . . . . . . . . . . . .
9.9 Practical probability II . . . . . . . . . .
9.10 Expectation functionals II . . . . . . . .
9.11 Disintegration Principle . . . . . . . . .
9.12 Stochastic programming: duality . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

211
. 212
. 213
. 215
. 221
. 222
. 223
. 224
. 224
. 225
. 227
. 229
. 229

10 DECOMPOSITION
10.1 Lagrangian relaxation . . . . . . . .
10.2 Sequential linear programming . . .
10.3 The L-shaped method . . . . . . .
10.4 Dantzig-Wolfe decomposition . . .
10.5 An optimal control problem . . . .
10.6 A targetting problem . . . . . . . .
10.7 Linear-quadratic control models . .
10.8 A hydro-power generation problem

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

255
. 255
. 255
. 258
. 258
. 258
. 258
. 258
. 258
. 258
. 258
. 258
. 258

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

11 APPROXIMAION THEORY
11.1 Formulation . . . . . . . . . . . . . . . . . . . . .
11.2 Epi-convergence . . . . . . . . . . . . . . . . . . .
11.3 Barrier & Penalty methods, ecaxt? . . . . . . . .
11.4 Infinite dimensional theory . . . . . . . . . . . . .
11.5 Approximation of control problems . . . . . . . .
11.6 Approximation of stochastic programs . . . . . .
11.7 Approximation of statistical estimation problems
11.8 Augmented Lagrangians . . . . . . . . . . . . . .
11.9 Variational Analysis ?? . . . . . . . . . . . . . . .
11.10Proximal point algorithm . . . . . . . . . . . . . .
11.11Method of multipliers: equalities . . . . . . . . . .
11.12Method of multipliers: inequalities . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

231
231
231
235
242
247
254
254
254

viii

CONTENTS
11.13Application to engineering design . . . . . . . . . . . . . . . . 258

12 NONLINEAR OPTIMIZATION
12.1 Statistical estimation: An introduction
12.1.1 The discrete case . . . . . . . .
12.2 Statistical estimation: parametric . . .
12.3 Satistical estimation: non-parametric .
12.4 Non-convex optimization . . . . . . . .
12.5 KKT-optimality conditions . . . . . . .
12.6 Sequential quadratic programming . .
12.7 Trust regions . . . . . . . . . . . . . .
13 EQUILIBIRUM PROBLEMS
13.1 Convex-type equilibrium problems .
13.2 Variational inequalities . . . . . . . .
13.3 Monotone Operators . . . . . . . . .
13.4 Complementarity problem . . . . . .
13.5 Application in Mechanics . . . . . . .
13.6 Pricing an American option . . . . .
13.7 Market Equilibrium: Walras . . . . .
13.8 Application to traffic, transportation
13.9 Non-cooperative games: Nash . . . .
13.10Energy? Communications pricing . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

14 NON-DIFFERENTIAL OPTIMIZATION
14.1 Bundle method . . . . . . . . . . . . . . . .
14.2 Example of the bundle method . . . . . . .
14.3 Stochastic quasi-gradient method . . . . . .
14.4 Application: urn problem . . . . . . . . . .
14.5 Sampled gradient (Burke, Lewis & Overton)
14.6 Eigenvalues calculations . . . . . . . . . . .
15 DYNAMIC PROBLEMS
15.1 Optimal control problems . . .
15.2 Hamiltonian, Pontryagins . . .
15.3 Polaks minmax approach . . .
15.4 Multistage stochastic programs

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.

259
. 259
. 262
. 270
. 274
. 281
. 282
. 282
. 287

.
.
.
.
.
.
.
.
.
.

289
. 289
. 289
. 289
. 289
. 289
. 289
. 289
. 289
. 289
. 289

.
.
.
.
.
.

291
. 291
. 291
. 291
. 291
. 291
. 291

.
.
.
.

293
. 293
. 293
. 293
. 293

CONTENTS
15.5
15.6
15.7
15.8

Progressive hedging algorithm . . .


Water reservoirs management . . .
Linear-quadratic stochastic control
Pricing a contingent claim . . . . .

ix
.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

16 TOPICS IN STOCHASTIC PROGRAMMING


16.1 The distribution problem . . . . . . . . . . . . . .
16.2 Application to robotics . . . . . . . . . . . . . . .
16.3 Chance constraints . . . . . . . . . . . . . . . . .
16.4 Reliability of networks . . . . . . . . . . . . . . .
16.5 Risk measures . . . . . . . . . . . . . . . . . . . .
16.6 Modeling decision making under uncertainty . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.

293
293
293
293

.
.
.
.
.
.

295
. 295
. 295
. 295
. 295
. 295
. 295

A Notation and terminology


297
A.1 Existence: limits & minimizers . . . . . . . . . . . . . . . . . . 298
A.2 Function expansions . . . . . . . . . . . . . . . . . . . . . . . 299

CONTENTS

Chapter 1
PRELUDE
How simple and clear this is thought Pierre. How could I not have known
this before. War and Peace, Leonid Tolstoy.

Lets begin our journey in the classical landscape: All functions to be


minimized are smooth and there are no side constraints! The mathematical
foundations were laid down in the middle of the last millennium by two genial
mathematical dabblers. As we shall see, the rules they formulated provide
the guidelines for building an optimization theory, in the classical framework
as well as in the non-classical framework thats going to be our main concern
in this book. This chapter also covers the basic algorithmic procedures to
find, at least numerically, the minimizer(s) of smooth multivariate functions.
Although our interest will be primarily with the minimization of functions,
subject or not to constraints, defined on IRn , the last section shows that
the rules to identify minimizers are also applicable to functions defined on
infinite-dimensional spaces, like a space of arcs connecting two points, etc.

1.1

Mathematical curtain rise

The modern theory of optimization starts in the middle of the 14th Century
with Nicholas Oresme, (1323-1382), part-time mathematician and full-time
Bishop of Lisieux (France). In his treatise [15] on , he remarks that near
a minimum, the increment of a variable quantity becomes 0. A present day
1

CHAPTER 1. PRELUDE

version would read


Oresme Rule: x argmin f = df (x ; w) = 0, w IR.
where argmin f is the set of minimizers of f , i.e., the arguments that minimize
f , and
f (
x + w) f (
x)
df (
x; w) := lim
,
0

is the derivative1 of the function f : IR IR at the point x in direction w,


i.e., the limit of the incremental value of f at x, in direction w.

df(x; )

Figure 1.1: Derivative function identifying incremental changes at x.

About three centuries later, Pierre de Fermat (1601-1665), another parttime mathematician and full time ??-lawyer at the Royal Court in Toulouse
(France) while working on the long-standing tangent problem, observed that
for x to be a minimizer of a function
f , the tangent to the graph of the

function f at the point x , f (x ) must be parallel to the x-axis. In the
notation of Differential Calculus2 , one would express this as
Fermat Rule: x argmin f = f 0 (x ) = 0,
where
f 0 (
x) := lim 1 f (
x + ) f (
x)
0

called the G
ateaux derivative when its necessary to distinguish it from some alternative definitions of derivative.
2
whose development can be viewed as a continuation and a formalization of Fermats
work on the tangent problem

1.1. MATHEMATICAL CURTAIN RISE

(x*,f(x*))

x*

Figure 1.2: Horizontal tangent to the graph of a function at a minimum.

is the slope3 of the tangent at x. Implicit in the formulation of these optimality criteria is the assumption: f is smooth, i.e., continuously differentiable;
in those days, only smooth functions were considered to be of any interest.
And for smooth functions, one has
w IR :

df (
x; w) = f 0 (
x)w,

as is immediate from the definitions. Consequently, the Fermat rule can be


derived from Oresmes rule and vice verse.
To extend Oresmes rule to functions defined on IRn , again assuming
smoothness, one has to consider possible moves, or variations, not just to
the right or the left, but in every possible direction. And the rule becomes:
x argmin f = df (x ; w) = 0, w IRn ,
whereas Fermats rule now takes the form
x argmin f = f (x ) = 0.
Indeed, the slope
 of the tangent to the graph of a smooth function f at a
point x, f (
x) is given by the gradient of f at x:



f (
x),
f (
x), . . . ,
f (
x)
f (
x) =
x1
x2
xn
with the partial derivatives defined by
for i = 1, . . . , n,
3

f (
x + ei ) f (
x)

f (
x) := lim
= df (
x; ei ),

0
xi

In Differential Calculus, one usually refers to f 0 (


x) as the derivative of f at x
but we
want to reserve this term for the more malleable function df (
x, ).

CHAPTER 1. PRELUDE

where ei = (0, . . . , 1, . . . , 0) is the unit n-vector with a 1 in the ith position.


In the 1-dimensional case, one has f (
x) = f 0 (
x). Because f is smooth, it
follows from the preceding definitions that
w IRn :

df (
x; w) = hf (
x), wi =

n
X
f (
x)
j=1

xj

wj ,

and so, also for smooth functions defined on IRn , one can derive Fermats
rule from Oresmes rule and vice verse. However, the fact that one can rely
on either one of these rules to check for optimality turns out to be quite
efficient.

1.2

Curve fitting I

Given the values of an unknown function h : [0, 1] IR at a finite number


of distinct points, z1 , z2 , . . . , zL , one is interested in finding a polynomial
p : [0, 1] IR of degree n, i.e.,
p(x) = an xn + + a1 x + a0 ,
whose values at z1 , . . . , zL are as close as possible of those of h. There are a
number of ways to interpret as close as, but for now, let it have the meaning
of least squares, i.e., the sum of the squares of the distances between h(zl )
and p(zl ) is to be minimized. With a = (an , . . . , a1 , a0 ), the minimization
problem is:
L X
n
2
X
minaIRn+1
aj zlj h(zl ) .
l=1

With

and

z1n
z2n
Z=
. .
zLn

j=0

. . . z1
. . . z2
......
. . . zL

1
1

. .
1


y = h(z1 ), h(z2 ), . . . , h(zL ) ,

the least squares problem can be written as:

min hZa y, Za yi,

aIRn+1

1.2. CURVE FITTING I

or still mina f (a) = |Za y|2 , i.e., the least squares solution will minimize
the square of the norm of the error. Applying Fermats rule, we see that the
minimizer(s) a must satisfy
f (a ) = 2Z > (Za y) = 0,
or equivalently, the so-called normal equation,
Z >Za = Z > y.
If we assume, as might be expected, that n + 1 L, and recalling that
the points z1 , . . . , zL are distinct, the columns of the matrix Z are linearly
independent. Hence, Z >Z is invertible and
a = Z >Z)1 Z > y.
This is the solution calculated by the Matlab-function polyfit. In Figure
1.3, a 5th degree polynomial has been fitted to the given data points; the
plot has been obtained by polyval, another Matlab-function.
1
0.5
0
0.5
1
1.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Figure 1.3: Fitting data with 5th degree polynomial

0.9

CHAPTER 1. PRELUDE

1.1 Exercise (polyfit and polyval functions). Let x = (0, 0.05, . . . , 0.95, 1)
and y = (0.95, 0.23, 0.61, 0.49, 0.89, 0.76, 0.46, 0.02, 0.82, 0.44, 0.62, 0.79, 0.92,
0.74, 0.18, 0.41, 0.94, 0.92, 0.41, 0.89, 0.06) use the functions polyfit to obtain
a polynomial fit of degree n = 1, 2, . . . , 11 and n = 21. For each n, calculate
the mean square error and plot you results so that you can visuallly see the
fit and the graph of the polynomial.
Guide. For given n, let p = polyfit(x, y, n) the coefficients of the polynomial of degree n. To graph the resulting polynomial and check the fit, use the
command: plot(z, polyval(p, z), x, y, 0 xm0 ) with z = (0.2 : 0.001 : 1.2)4 .

1.3

Steepest Descent and Newton methods

When we applied Fermats rule to the polynomial fit problem, the minimizer
could be found by solving a system of linear equations, and there are very
efficient procedures available to solve (n n)-linear systems. But, in general, the function to be minimized wont be quadratic, and consequently, the
Fermat rule may result in a system consisting of n nonlinear equations. For
example, when
f (x1 , x2 ) = (x2 x21 )2 + (1 x1 )2 ,
Fermats rule yields,
2x31 2x2 x1 + x1 = 1,

x21 + x2 = 0.

And that system is not any easier to solve than minimizing f . In fact, procedures to solve nonlinear equations and minimizing functions on IRn go hand
in hand5 . In this section, and the next, we outline algorithmic procedures
to find a point x that satisfy Fermats rule. Such points will minimize the
function f , at least locally, when the function f is locally convex; convexity
is dealt with in Chapter 3 and locally convex means that f is convex in the
neighborhood of x , say on a ball IB(x , ) with > 0.
4

In Matlab-Figure, use the Export functionality to obtain printable files, for example,
EPS-files.
5
Roughly speaking, a system of n equations, linear or nonlinear, in n variables can be
thought of as the gradient of some nonlinear function defined on IR n .

1.3. STEEPEST DESCENT AND NEWTON METHODS

1.2 Definition (local minimizers). For f : IRn IR, x is a local minimizer


if f (
x) f (x) for all x IB(
x, ) for some > 0, i.e.,
x argminxIB (x,) f (x).
A global minimizer is simply a minimizer of f on IRn .

  




































argmin
f


























 
 
 
 
 
 
 
  local minimizer
Figure 1.4: Local and global minimizers
The first step in the design of algorithmic procedures to minimize a function f is to identify directions of descent.
1.3 Lemma (direction of descent). Let f : IRn IR be smooth. Whenever
df (
x; d) < 0, d is a direction of descent for f at x, i.e.,
? > 0 such that (0, ? ) :

f (
x + d) < f (
x).

Because f is smooth at x,
every d IRn such that hf (
x), di < 0 is a direction of descent at x.
Proof. If df (
x; d) < 0, the definition of the derivative immediately implies
that for some ? > 0, f (
x + d) f (
x) < 0 for all (0, ? ). The assertion
involving the gradient simply follows from df (
x; d) = hf (
x), di, when f is
smooth.
Steepest Descent Method.
Step 0. x0 IRn , := 0.

CHAPTER 1. PRELUDE
Step 1. Stop if f (x ) = 0, otherwise, d := f (x ).


Step 2. := argmin0 f (x + d ) f (x ) .
Step 3. x+1 := x + d , + 1, go to Step 1.

1.4 Convergence (steepest descent). Suppose f : IRn IR is smooth.


Then, the Steepest Descent algorithm stops after a finite number of steps at a
point x where f (x ) = 0, or it generates a sequence of points {x , IN }
that either diverges, i.e., |x | % , or f (
x) = 0 for every cluster point x of
this sequence.
Proof. The algorithm stops only, in Step 1, when f (x ) = 0, necessarily
after a finite number of steps. Otherwise, excluding the case when the iterates
diverge, the sequence {x , IN } will have at least one cluster point, say x.
By restricting our attention to the convergent subsequence, we may as well
proceed as if x x, i.e., x is a limit point.
If f (
x) 6= 0, then d = f (
x) is a direction of descent and, by Lemma
1.3, f (
x + d) < f (
x) for all (0, ? ) for some ? > 0. Since f is smooth6 ,
f (x+1 ) f (x ) = |d |2 + o( ),
where d = f (x ), and x x implies
f (x+1 ) f (x ) 0,

2 6= 0.
|d |2 |d|

Hence, from the preceding identity, by letting , it follows that 0.


This means that eventually (0, ? ), and then, by definition of the step
size in Step2, one must have
f (x + d ) f (x ) f (x + d ) f (x ) = |d |2 + o(), (0, ? ).
2 < 0,
Letting and choosing close enough to 0, one obtains 0 |d|
a contradiction. Hence, f (
x) = 0.
Step 2 of the Steepest Descent algorithm implicitly assumes that we have
at our disposal a line minimization function that calculates precisely the
minimizer of

7 f (x + d ) f (x )
on [0, ).
6

o() designates a quantity such that o()/|| 0 as 0.

1.3. STEEPEST DESCENT AND NEWTON METHODS

Thats an unrealistic premise. The best one can hope for is an approximating
minimizer. When implementing this algorithm, the step size in Step 2 is
commonly calculated as follows: for parameters , (0, 1) selected at the
outset in Step 0,

A-Step 2. := maxk=0,1,...{ k f (x + k d ) f (x ) |d |2 k }.

One then refers to as the Armijo step size7 ; k is to the power k, so,
in particular, 0 = 1.
The convergence proof can easily be adjusted to

|d |2
f(x + d ) - f(x )
0





Figure 1.5: Calculating the Armijo step size


accommodate the Armijo step size selection.
One may interpret the iterations of the Steepest Descent method as follows: At x , the function is approximated by
f (x) = f (x ) + hf (x ), x x i + 12 |x x |2 ,
where > 0 is a parameter to be selected at some point. The next iterate
x+1 is obtained by minimizing f , i.e.,
x+1 = x f (x ).
If f turns out to be a good approximation of f , at least locally, one should
end up with a point x+1 near a local minimizer of f . However, its obvious
that this can only be the case for a very limited class of nonlinear functions
f . A more trust-worthy local approximation of f at x is provided by
f (x) = f (x ) + hf (x ), x x i + 21 hx x , 2 f (x )(x x )i,
7

more appropriately the Armijo-Goldstein step-size

10

CHAPTER 1. PRELUDE

assuming that f is twice differentiable with Hessian matrix at x,



n,n
2f
2
,
f (x) =
(x)
xi xj
i,j=1
again with > 0, a parameter to be selected appropriately. Assuming
furthermore, that 2 f (x ) is positive definite8 and thus also invertible, the
minimum of f is attained at
1
x+1 = x 2 f (x ) f (x ).
The suggested descent direction,

d = 2 f (x )

1

f (x ),

is known as the Newton


1 direction. Its certainly a direction of descent since
hf (x ), 2 f (x ) f (x )i > 0; the inverse of a positive definite matrix
is also positive definite. This lead us to the following variant of the Steepest
Descent method:
Newton Method (with line minimization).
Step
Step
Step
Step

0.
1.
2.
3.

x0 IRn , := 0.
1
Stop if f (x ) = 0, otherwise, d := 2 f (x ) f (x ).


:= argmin0 f (x + d ) f (x ) .
x+1 := x + d , + 1, go to Step 1.

The convergence proof for the Newton method is essentially the same as
that for the Steepest Descent method, the only difference is that a different
descent direction is calculated in Step 1. And when implementing the Newton
method, one would again replace Step 2 by A-Step 2, i.e., determine by
calculating the Armijo step size.
What really makes the Newton method very attractive is that it comes
with particularly desirable local convergence properties! The proof that follows is for the classical version of Newtons method that doesnt include a
line minimization step, i.e., with = 1.
8

A matrix C is positive definite if hx, Cxi > 0 for all x 6= 0. When the Hessian at x, of
a twice continuously differentiable function f : IR n IR is positive definite, f is strictly
convex on a neighborhood of x
, cf. 3.15. We kick off our study of convexity in Chapter 3.

1.3. STEEPEST DESCENT AND NEWTON METHODS

11

Newton direction

Steepest descent

 
 
 
 
 

 
 
 
 

 
 
 
 

 
 
 
 

f(x) =

Figure 1.6: Comparison of Newton and steepest descent directions

1.5 Convergence (Newton Method: local convergence). Let f : IR n


IR be twice continuously differentiable and x 7 2 f (x) locally Lipschitz
continuous9 , i.e., given any x,
> 0, 0, such that k2 f (x) 2 f (x0 )k |x x0 |, x, x0 IB(
x, ).
Let x be a local minimizer that satisfies the following second order sufficiency
condition: there are constants 0 < l u < such that
l |z|2 hz, 2 f (
x)zi u |z|2 ,

z IRn .

Then, there is a > 0 such that if the Newtons method is started at a point
x0 IB(
x, ), the iterates {x }IN will converge quadratically to x, i.e.,
0 lim

|x+1 x|
< .
|x x|2

Proof. From the assumptions follows the existence of > 0 and <
such that
k2 f (x0 ) 2 f (x)k |x0 x|,

(l /2)|z|2 hz, 2 f (x)zi 2u |z|2 ,

x, x0 IB(
x, ),

x IB(
x, ), z IRn ,

Hence, for all x IB(


x, ), 2 f (x) is positive definite, and consequently
1
invertible, and for all such x: k 2 f (x) k 2/l . From Step 1, for x
9

kAk =

P
m,n

i,j=1

a2ij

1/2

is the Frobenius norm of the matrix A.

12

CHAPTER 1. PRELUDE

IB(
x, ) and with z = x x,
2

f (x )(x

+1

x ) = f (x ) + f (
x) = z ,

1
0

2 f (x tz ) dt,

as follows from the Mean Value Theorem; recall f (


x) = 0. Because x+tz
IB(
x, ) for all t [0, 1], adding 2 f (x )z to both sides of the preceding
identity, yields
Z 1



+1

2
1
2 f (x tz ) dt, .
(x
x ) = f (x )
z ,
0

R
R
Since |x+1 x| |x x| + |x+1 x | and | g(t) dt| |g(t)| dt,
|x

+1

x| |z | k f (x )

1

|z |
|z |2 =
|z |
l
l

k2 f (x ) 2 f (
x + tz )k dt

Thus, if (/l )|z | < 1,


|x+1 x| < |z | = |x x| = x+1 IB(
x, ) whenever x IB(x , ).
For some (0, 1), set = min{, l /}. By induction, one has
x0 IB(
x, ) = x IB(
x, ), .
With = (l /)|z |, one can conclude
|x+1 x|

2
|z | = +1 2 ,
l

and thus, recursively, 02 for all , i.e., the iterates x converge to x


quadratically.
1.6 Exercise (Steepest Descent and Newton with Armijo step size). Implement the Steepest Descent and Newton methods with Armijo step size;
reasonably good values for both , 0.8. Then, minimize f (x1 , x2 ) =
100(x2 x21 )2 + (1 x1 )2 , the so-called Rosenbrock function. Make an iteration count for both methods.

1.3. STEEPEST DESCENT AND NEWTON METHODS

13

Guide. The Rosenbrock function has boomerang shaped level sets. The
minimum occurs at (1, 1). Starting at x0 = (1.2, 1), the Steepest Descent
method may require as many as a 1000 steps to reach the neighborhood of
the solution. As, can be observed, the method of Steepest Descent can not
only be quite inefficient but, due to numerical round-off, it might even get
stuck at a non-optimal point.
The Newton method, like the steepest descent method, can be viewed as
a procedure to find a solution to the system of n equations:
f
f
f
(x) = 0,
(x) = 0, . . . ,
(x) = 0.
x1
x2
xn
f
(x) have, in principle, no preassigned propBecause, these functions x 7 x
j
erties, one could have described Newtons method as one of solving a system
of n non-linear equations in n unknowns, say


G1 (x)
0
.. ..
G(x) = . = .
Gn (x)
0

with

G(x) =

h G

xj

(x)

in

i,j=1

the Jacobian of G at x. Generally, the algorithmic procedure, then known as


the Newton-Raphson method, doesnt involve a line minimization step but a
line-search could be included to avoid being lead in unfavorable directions.
Assuming that the Jacobian is invertible, a generic version of the method
involves the following steps:
Newton-Raphson Method.
Step 0. x0 IRn , := 0.
1
Step 1. Stop if G(x ) = 0, otherwise, d := G(x ) G(x ).
Step 2. x+1 := x + d , + 1, go to Step 1.
After making the appropriate change of notation, one can follow step by
step the proof of 1.5 to obtain a (local) quadratic convergence rate. Figure
1.7 illustrates the steps of the Newton-Raphson procedure when G : IR IR.

14

CHAPTER 1. PRELUDE

x2

x4

x5

x3

x1

Figure 1.7: Root finding via Newton-Raphson in dimension 1

Newton-Raphson for systems of linear/non-linear equations and Newtons


method for unconstrained optimization problems illustrate vividly the important connections between these two classes of problems, both in the design
of solutions procedures as well as in the analysis of their intrinsic properties.

1.4

The Quasi-Newton methods

The numerical implication of quadratic convergence is that once a small


enough neighborhood of a (local) minimizer is reached, each iteration will
double the number of correct digits in the approximating solutions x . Of
course, this doesnt take into consideration round-off errors! But, nonetheless, any method with theses characteristics has to be treasured and deserves
to be plagiarized. Unfortunately, the conditions under which one can apply
Newtons method are quite restrictive:
- 2 f (x ) might not be invertible. This can be repaired to some extent
by choosing appropriately a direction d that satisfies 2 f (x )d = f (x ).
- The approximation f for f at x might be poor, or more generally, only
valid in a very limited region. Again, this can be repaired to some extent by
restricting the step size to a trust region.
- The function f is not twice differentiable, or its Hessian is difficult to
compute. This cant be repaired in the framework of the Newton method.
It requires a different approach.

1.4. THE QUASI-NEWTON METHODS

15

Quasi-Newton Method10 .
Step 0. x0 IRn , := 0, pick B 0 (= I, for example).

Step 1. Stop if f (x ) = 0,
otherwise, choose d such that B d = f (x ).


Step 2. := argmin0 f (x + d ) f (x ) .

Step 3. x+1 := x + d , calculate B +1 , + 1, go to Step 1.


Clearly B plays here the role of the Hessian 2 f (x ) in the Newton
method, i.e., it tries to capture the adjustment that needs to be made in the
steepest descent on the basis of the local curvature of the function f . The
actual behavior of the method is determined by the choice one makes of the
updating mechanism for B . To guarantee, at least, local convergence, one
imposes the following condition:
The Quasi-Newton Condition. The curvature along the descent direction d (from x to x+1 ) should be approximated by
B +1 (x+1 x ) := (B + U )(x+1 x ) = f (x+1 ) f (x ),
or equivalently, with
s = x+1 x , c = f (x+1 ) f (x ) :

U s = c B s .

The updating matrix U = B +1 B must be chosen so that it satisfies


the preceding identity. This can always be achieved by means of a matrix of
rank 1 of the type U = u  v, where

u1 v 1 u1 v 2 . . . u 1 v n
u2 v 1 u2 v 2 . . . u 2 v n

uv =
. . . . . . . . . . . . . . . . . .
un v 1 un v 2 . . . u n v n
is the outer product of the vectors u and v. We must have
[ u  v ]s = hv, s i u = c B s .
10

Methods of this type are also known are variable metric methods. One can think of
the descent direction generated in Step 1 as the steepest descent but with respect to a
different metric on IRn than the usual Euclidean metric.

16

CHAPTER 1. PRELUDE

This means that u needs to be a multiple of (c B s ). Assuming c 6= B s ,


otherwise B itself satisfies the Quasi-Newton condition,
v such that hv, s i 6= 0,
one would set
u=

1
(c B s )
hv, s i

and

U =

1
[ (c B s )  v ].
hv, s i

In the preceding expression all quantities are fixed except for v and the only
restriction is that v shouldnt be orthogonal to s . One can choose v so
as to restrict B to a class of matrices that have some desirable properties.
For example, one may wish to have the matrices B symmetric; the Hessian
2 f (x ) is symmetric when it is defined. Choosing
v = c B s

yields

B +1 = B +

hv , s i

[ v  v ].

which is symmetric if B is symmetric.


One particularly useful property of these updates, is that their inverses
can be computed recursively. Indeed, if B is invertible and hv, B 1 ui 6= 1,
then
1
1
B +[uv]
= B 1
B 1 [u  v ] B 1 .
1 + hv, B 1 ui
To see this, simply observe that


1
B + [ u  v ] B 1
B
[u

v
]
B
= I,
1 + hv, B 1 ui
follows from

hv, B 1 ui [ u  v ] B 1 = [ u  v ] B 1 [ u  v ] B 1 = [ u  v ]hB 1 u, viB 1 .


Updating schemes based directly on (B + [ u  v ]) are not numerically
stable, by which one means that small errors in carrying our the numerical
operations might result in significant errors in the descent method. Two
popular, numerically reliable, updating schemes are
BFGS update

11

B +1 = B +
11



1
1  

c

c

B
s

B
s
.
hc , s i
hB s , s i

BFGS = Broyden-Fletcher-Goldfarb-Shano who, independently, proposed this formula


for the update.

1.5. INTEGRAL FUNCTIONALS


DFP update

12

17

: with D paying the role of (B )1 :

D +1 = D +



1  
1

s

s

D
c

D
c
.
hc , s i
hc , D c i

The Matlab-function fminunc (with option.LargeScale set off) implements the BFGS Quasi-Newton method. Again, the Rosenbrock function
could be used as test function and the results compared to those obtained in
Exercise 1.6.

1.5

Integral functionals

Lets go one step further in our analysis of classical optimality conditions


and consider integral functionals, as they arise in the Calculus of Variations13 ,
Z 1
f (x) =
L(t, x(t), x(t))

dt.
0

With , IR, the simplest problem of the Calculus of Variations is:



o
n

min f (x) x X, x(0) = , x(1) = ,

where X fcns([0, 1], IR) consists of all functions with some specified properties, for example X = ac-fcns([0.1], IR), the space of absolutely continuous
real-valued functions defined on [0, 1]. For simplicity sake, let assume that
X = C 1 ([0, 1]; IR) is the space of real-valued, continuously differentiable functions defined on [0, 1]. One has,
Oresme rule: x argmin f = df (x ; w) = 0, w W X,
where the set W of admissible variations is such that
x X, w W = IR : (x + w)(0) = , (x + w)(1) = ,
that is,
12
13




W = w X w(0) = w(1) = 0 .

DFP = Davidon-Fletcher-Powell who proposed this updating scheme.


Its customary to denote by x,
the derivative of x, with respect to time parameter t

18

CHAPTER 1. PRELUDE

It isnt straightforward to write down Fermats rule. There isnt a ready made
calculus to find the gradient of functions defined on an infinite dimensional
linear space. In the late 17th Century (?), the path going from Oresmes
to Fermats rule for this problem was pioneered by a trio of mathematical
superstars: the Bernoulli brothers, Johan and Jacob, and Isaac Newton.
Lets sketch out their approach when
L : [0, 1] IR IR IR
is a really nice function. Here, this means that the partial derivatives
Lx (t, x, v) =

L
(t, x, v),
x

Lx (t, x, v) =

L
(t, x, v)
v

have the continuity and differentiability properties required to validate the


operations carried out below. Then,
Z

1

df (x; w) = lim
L(t, (x + w)(t), (x + w)(t))

L(t, x(t), x(t))

dt
0
0
Z 1


=
lim 1 L(t, (x + w)(t), (x + w)(t))

L(t, x(t), x(t))

dt
0 0
Z 1h
i


=
Lx t, x(t), x(t)

w(t) + Lx t, x(t), x(t)

w(t)

dt
1

For a given function x X and for w W , integration by parts yields



Z t
Z 1
Z 1


Lx , x( ), x(
) d dt,
w(t)

w(t)Lx t, x(t), x(t)

dt = 0
0

and thus,

df (x; w) =

1
0



Z t


Lx , x( ), x(
) d dt.
w(t)

Lx t, x(t), x(t)

R1
Because w W implies 0 w(t)
dt = 0 and df (x ; w) = 0 must hold for all
such functions w: on [0, 1],


t 7 Lx t, x (t), x (t)

t
0

 
Lx , x ( ), x (t) d
must be constant.

1.5. INTEGRAL FUNCTIONALS

19

In other words, x must satisfy the ordinary differential equation




d
Lx t, x (t), x (t) = Lx t, x (t), x (t) for t [0, 1],
dt

known as the Euler equation. In addition, x must satisfy the boundary


conditions at t = 0, 1. Fermats rule then reads,
(
x (0) = , x (0) = ,
x argmin f =
x satisfies the Euler equation.
1.7 Example (the brachistochrone problem). The problem14 is to find the
path along which a particle will fall in the shortest time from A to B.
Detail. Lets pass a vertical plane through the points A and B with the
y-axis drawn vertically downward, A located at the origin (0, 0) and say, B
= (1, 2). So, we are to find a path y : [0, 1] [0, ) with y(0) = 0 and
y(1) = 2 that will minimize the elapsed time; the force acting on the particle
is gravity. From Newtons Law, one derives the following expression for the
1

g
2

Figure 1.8: Shortest time path for falling particle


the function to be minimized:
Z 1s
0

14

1 + y 0 (x)2
dx.
y(x)

originally formulated by Galileo, the name is derived from the Greek, brachistos for
shortest and chronos for time

20

CHAPTER 1. PRELUDE

Rather than the Euler equation itself, lets rely on a variant, namely,

d
L y 0 Ly 0 = L x ;
dx

carrying out the differentiation with respect to x, the preceding identity


d
d
Ly0 = Lx , or still y 0 (Ly dx
Ly0 ) = 0.
yields Lx + y 0 Ly + y 00 Ly0 y 00 Ly0 y 0 dx
Since in our problem, Lx = 0, this variant of the Euler equation implies that
x 7 (L y 0 Ly0 )(x) should be constant. Hence the optimal path must satisfy
the following boundary value problem:
2 = y(x)(1 + y 0 (x)2 ),

y(0) = 0, y(1) = 2.

where is a constant to be chosen so as to satisfy the boundary conditions.


The path described by the solution in the (x, y)-plane can be parametrized
with respect to time for t [0, 2], and one can verify that the cycloids,
x(t) = (t sin t),

y(t) = (1 cos t),

satisfy the preceding differential equation; somewhat informally


y0 =

dy dt
y
dy
=
= = (1 cos t)1 sin t.
dx
dt dx
x

At t = 0, x(0) = y(0) = 0, so there remains only to choose so that the


path passes through B at time t = t , our shortest time. For B at (1, 2),
that turns out to be = 2.4056 and the corresponding value for t is 1.4014;
these values were obtained with he help of the Matlab root-finder fzero.
1.8 Exercise Show that the straight line yields the shortest distance between two points, say a = (0.0) and b = (1, 2).
Guide. The length of a differentiable arc y : [0, 1] IR is
Z 1p
1 + y 0 (x)2 dx,
0

as follows from the theorem of Pythagoras and the definition of y 0 . Set up


and solve the Euler equation with the boundary conditions y(0) = 0 and
y(1) = 2.

1.6. IN CONCLUSION . . .

1.6

21

In conclusion . . .

We have seen that the rules of Oresme and Fermat, with the help of Differential Calculus, can be used effectively to identify potential minimizers
of a smooth function in a variety of situations. But, many interesting optimization problems involve non-differentiable functions and minimizers have
a predilection for being located at the cusps and kinks of such functions!
Moreover, the presence of constraints in a minimization problem come with
an intrinsic lack of smoothness. There is a sharp discontinuity between points
that are admissible and those that are not.
To deal with this more inclusive class of functions, we need to enrich our
calculus. Our task, on the mathematical side, will thus be to set up a Subdifferential15 Calculus with rules that mirror those of Differential Calculus,
and that culminates in versions of Oresmes and Fermats rules to ferret out
the minimizers of non-smooth, and even discontinuous, functions16 .

15

the prefix sub has the meaning: requiring less than differentiability.
For more comprehensive expositions of Subdifferential Calculus, one should consult
[4, 14, 1, 3, 18], our notation and terminology will be consistent with that of [18].
16

22

CHAPTER 1. PRELUDE

Chapter 2
FORMULATION
Lets begin with a few typical (constrained) optimization problems that fit
under the mathematical programming umbrella. In almost all of these examples, we start with a deterministic version and then switch to a more
realistic model that makes a place for the uncertainty about some of the
parameters.
When we allow for data uncertainty, not only do we gain credibility for
the modeling process but we are also lead to consider a number of issues
that are at the core of optimization theory and practice, namely, how to
deal with non-linearities, with lack of smoothness, and how to design solution procedures for large scale problems. In addition, due to the addition
of randomness (uncertainty) its also necessary to clarify a number of the
basic modeling issues, in particular, how stochastic programs differ from the
simpler, but less realistic, deterministic formulations.
For all these reasons, we are going to rely rather extensively, but by no
means exclusively, on stochastic programming examples to motivate both the
theoretical development and the design of algorithmic procedures.

2.1

A product mix problem

A furniture maker can manufacture and sell four different dressers. Each
dresser requires a certain number tcj of man-hours for carpentry, and a certain
number tf j of man-hours for finishing, j = 1, . . . , 4. In each period, there are
dc man-hours available for carpentry, and df available for finishing. There is
a (unit) profit cj per dresser of type j thats manufactured. The owners goal
23

24

CHAPTER 2. FORMULATION

is to maximize total profit, or equivalently, to minimize cost1 (= negative


profit). Let these cost coefficients be
c = (
c1 , c2 , c3 , c4 ) = (12, 25, 21, 40)
and

T =

tc1 tc2 tc3 tc4


tf 1 tf 2 tf 3 tf 4

4 9 7 10
1 1 3 40

d = (dc , df ) = (6000, 4000).

The furniture maker must choose (xj 0, j = 1, . . . , 4) to minimize


4
X
j=1

cj xj = 12x1 25x2 21x3 40x4 ,

subject to the constraints


4x1 + 9x2 + 7x3 + 10x4 6000,

x1 + x2 + 3x3 + 40x4 4000.

This is a linear program, i.e., an optimization problem in finitely many (realvalued) variables in which a linear function is to be minimized (or maximized) subject to a system of finitely many linear constraints: equations and
inequalities. A general formulation of a linear program could be
min
so that

n
X

j=1
n
X

cj xj over all x X IRn


ai,j xj ./ bi for i = 1, . . . , m,

j=1

where ./ stands for either , = or and the (internal) constraints x X


consist of some simple linear
on the the variables
xj such as:

 inequalities


- X is a box, i.e.,
X = x lj x j
uj , j = 1, . . . , n ,
n

- X = IR+ = x xj 0, j = 1, . . . , n , the non-negative orthant, etc.
1

This conversion to minimization is made in order to have a canonical formulation


of optimization problems. Generally, engineers and mathematicians prefer the minimization framework, whereas social scientists and business majors have a preference for the
maximization framework.

2.1. A PRODUCT MIX PROBLEM

25

The objective and the constraints of our product mix problem are linear
and it may be written compactly as:
min hc, xi

so that T x d,

x 0.

As part of the ensuing development, many of the properties of linear programs will be brought to the fore including optimality conditions, solution
procedures and the associated geometry. For now, lets simply posit that
such problems can be solved efficiently when they are not too large. The
(optimal) solution of our product mix problem is:
xd = (4000/3, 0, 0, 200/3)

with optimal value: $ -18,667.

Here is the Matlab-file used to calculate the solution; linprog is a function


in the Matlab Optimization Toolbox.
function [xopt,ObjVal] = prodmix
%
% data and solution of the product mix example
%
c = -[12 25 21 40]; d = [6000 4000];
T = [4 9 7 10; 1 1 3 40];
xlb = zeros(4,1); xub = ones(4,1)*10^9;
[xopt,ObjVal] = linprog(c,T,d,[],[],xlb,xub);
Now, lets get a bit more realistic and account for the fact that the number
of hours needed to produce each dresser type cant be known with certainty.
Then, each entry in T becomes a random variable2 . For simplicitys sake,
assume that each entry of T takes on four possible man-hour values with
equal probability (1/4) and that these entries are independent of one another.
entry
tc1 :
tc2 :
tc3 :
tc4 :
tf 1 :
tf 2 :
tf 3 :
tf 4 :
2

3.60
8.25
6.85
9.25
0.85
0.85
2.60
37.0

possible values
3.90
4.10
4.40
8.75
9.25
9.75
6.95
7.05
7.15
9.75 10.25 10.75
0.95
1.05
1.15
0.95
1.05
1.15
2.90
3.10
3.40
39.0
41.0
43.0

Bold face will be used for random variables with normal print for their possible values.

26

CHAPTER 2. FORMULATION

We have 8 random variables, each taking four possible values, that yields a
total of 48 = 65,536 possible T matrices (outcomes) and each one of these
has equal probability of occurring! In practice, this could be a discretization
that approximates a continuous distribution (e.g. a uniform distribution).
Lets denote the probability of a particular outcome T l by pl = (0.25)8 , for
l = 1, . . . , 65, 536.
Because the manufacturer must decide on the production plan before the
number of hours required for carpentry or finishing are known with certainty,
there is the possibility that they actually exceed the number of hours available. Therefore, the possibility of having to pay for overtime must be factored
in. The recourse costs are determined by: qc per extra carpentry hour and
qf per extra finishing hour, say q = (qc , qf ) = (5, 10).
This recourse decision will only enter into play after the production plan x
has been selected and the time required, T l , for each task, has been observed.
Our manufacturer will, at least potentially, make a different decision about
overtime when confronted with each one of these 65,536 possible different
outcomes for T . Let ycl and yfl denote the number of hours of overtime hours
hired for carpentry and finishing when the matrix T turns out to be T l . The
problem is then to choose (xj 0, j = 1, . . . , 4) that minimizes
4
X
j=1

65,536

c j xj +

X
l=1



pl qc ycl + qf yfl

so that
4
X
j=1

4
X
j=1

tcjl xj ycl dc ,

l = 1, . . . , 65, 536,

tfl j xj yfl df , l = 1, . . . , 65, 536,

ycl 0, yfl 0,

l = 1, . . . , 65, 536.

Notice that the objective now being minimized is the sum of the immediate costs (actually, the negative profits) and the expected futures costs since
one must consider 65,536 possible outcomes; the constraints involving random quantities are written out explicitly for all 65,536 possible outcomes.
In addition to non-negativity for the decision variables xj and the recourse

2.1. A PRODUCT MIX PROBLEM

27

variables ycl , yfl , the constraints say that the number of man-hours it takes
P
for the carpentry of all dressers ( 4j=1 tcjl xj ) must not exceed the total number of hours made available for carpentry (dc + ycl ), i.e., regular hours plus
overtime, and the same must hold for finishing.
Because there is the possibility of making a recourse decision y l = (ycl , yfl )
that will depend on the outcomes of the random elements, this type of problem is called a stochastic program with recourse. This class of problems will
be studied in more depth later. For now, it suffices to understand how the
decision/information process is evolving:
decision: x ; observation: T l ; recourse: y l
In summary, the manufacturer makes today, a decision x of how much
of each dresser type to produce based on the knowledge that he will be
able tomorrow, to observe how many man-hours T l it actually took to
manufacture the dressers, as well as to decide how much overtime labor y l
to hire based on this observation.
The problem is still a linear program, but of much larger size! Notice the
block-angular structure of the problem when written in the following way:
min
so that

+p1 hq, y 1 i +p2 hq, y 2 i + + p65536 hq, y 65536 i


y 1
d
2
y
d
..
..
.
.
T 65536 x
y 65536
d
x 0,
y 1 0,
y 2 0,

y 65536 0.

hc, xi
T 1x
T 2x
..
.

Later, we shall see how to solve these large scale linear programs by exploiting
their structure. For now, it is enough to observe that these are indeed, large
scale problems.
Oftentimes, there is more than one source of uncertainty in a problem. For
example, due to employee absences, the available man-hours for carpentry
and finishing may also have to be modeled as random variables, say
entry
dc :
df :

5,873
3,936

possible values
5,967 6,033 6,127
3,984 4,016 4,064

each with probability 1/4


each with probability 1/4

28

CHAPTER 2. FORMULATION

We now need to replace d by d l = (dcl , dfl ) and we must take into account
the 42 = 16 possible d l vectors, that gives a total of L = 410 = 1, 048, 576
possible (T , d) realizations. With pl = 1/L, the problem reads:
min
so that

+p1 hq, y 1 i +p2 hq, y 2i + + pL hq, y L i


y 1
d1
2
y
d2
..
..
.
.
L
L
T x
y
dL
x 0, y 1 0,
y 2 0,

y L 0.

hc, xi
T 1x
T 2x
..
.

The relatively small linear program we started out with, in the deterministic
setting, has now become almost enormous! Lets refer to this problem as the
(equivalent) extensive version of the (given) stochastic program.
The optimal solution is
x = (257, 0, 665.2, 33.8)

with total expected cost: $ -18,051.

Because of its large size, this problem is more difficult to solve than its
deterministic counterpart, and any efficient solution procedure must exploit
the problems special structure. But the solution x is robust, meaning that
it has examined all one million plus possibilities, and has taken into account
the resulting recourse costs for overtime and the associated probabilities of
having to pay these costs.
With xd = (4000/3, 0, 0, 200/3), the solution of the deterministic version,
the expected cost would have been $ -16,942; the expected overtime costs are
$ 1,725. Of course, xd is not an optimal solution of the stochastic program,
but more significantly, xd isnt getting us on the right track! The solution x
suggests that a large number of dressers of type 3 should be manufactured,
while the production plan suggested by xd doesnt even include any dresser of
type 3. This is exactly the information a decision maker would want to have,
viz, what are the activities that should be included in a (robust) optimal
solution.
2.1 Exercise (stochastic resources). Consider the product mix problem
when the only uncertainty is about the number of hours that will be available
for carpentry and finishing. Overtime will still be paid at the rates of $ 5 an

2.2. CURVE FITTING II

29

hour for carpentry and $ 10 and hour for finishing. Let




4 9 7 10
c = (12, 25, 21, 20), T =
.
1 1 3 40
and the random variables dc , df (independently) take on the values
entry
dc :
df :

4,800
3,936

possible values
5,500 6,050 6,150
3,984 4,016 4,064

each with probability 1/4


each with probability 1/4

Solve also the deterministic problem: minhc, xi so that T x d, x 0


with dc = 5, 625 and df = 4, 000, the expected values of dc and df . Compare
the solution with that of the stochastic program.
Guide. Here, L=16, is the number of possible outcomes, so in addition to the
non-negativity restrictions, the stochastic program will have 32 constraints.
One solution is x = (1, 072.6, 0, 251.4, 0) with optimal value $ -15,900. The
solution of the deterministic problem suggests manufacturing the same type
of dressers but in significantly different quantities. To compare the solutions,
for the deterministic solution one needs to evaluate not just its cost (= profit) but one must also calculate the recourse costs that might result when
one actually would implement the deterministic solution.

2.2

Curve fitting II

As in 1.2, we know the values of an unknown function h : [0, 1] IR at a


finite number of (distinct) points. Its also known that h is quite smooth
which in this context, we are going to interpret as meaning: h is twice differentiable with h00 bounded, i.e., for all t [0, 1], |h00 (t)| for a given
> 0. Unless is unusually small, our estimate for h is allowed to come
from a much richer class of functions than just polynomials of degree n, as
in 1.2. Its thus reasonable to expect that one should be able to come up
with a better fit.
Every twice differentiable z : [0, 1] IR can be written as
Z t Z
z(t) = z0 + v0 t +
d
ds a(s),
0

30

CHAPTER 2. FORMULATION

where z0 and v0 play the role of integration constants and a : [0, 1] IR,
the second derivative, is some function, not necessarily continuous. In fact,
to render the problem computationally manageable, lets restrict a to be
piecewise constant. Thats not as serious a limitation as might appear at first
glance since any piecewise continuous function on [0, 1] can be approximated
arbitrarily closely by such a piecewise constant function.
Getting down to specifics: Partition (0, 1] in N sub-intervals (tk1 , tk ], of
length = 1/N so that the points at which
h is known are some
 the function

of the end points of these intervals, say tl , l L . For k = 1, . . . , N , set
t (tk1 , tk ];

a(t) = xk , (a constant) ,

fixing x1 , . . . , xN , v0 and z0 completely determines the function z. Moreover,


by introducing bounds on the choice of the xk , one can control the rate of
change in both z 0 and, in the function z itself.
For t (tk1 , tk ], one has
0

z (t) = v0 +

a(s) ds = v0 +
0

k1
X
j=1

xj + (t tk1 )xk ,

and
z(t) = z0 +

t
0

z (s) ds = z0 +
0

= z0 + v0 t +

k1 Z
X
j=1

k1
X
j=1

tj

z (s) ds +
tj1

z 0 (s) ds
tk1

(t tj + /2)xj + 21 (t tk1 )2 xk .

In particular, when t = tk ,
z(tk ) = z0 + kv0 + 2

k
X
j=1

(k j + 21 )xj .

The curve fitting problem comes down to finding v0 , z0 and for k = 1, . . . , N ,


xk [, ] so that z is as close as possible to h in terms of a given criterion.
For example, one may be interested in minimizing the (square of the) `2 -norm
of the error,
X
|z(tl ) h(tl )|2 ,
lL

2.2. CURVE FITTING II

31


i.e., least squares
minimization. With z = zl = z(tl ), l L and h = hl =

h(tl ), l L , one ends up with the following formulation,
min |z h|2 =

X
lL

|zl hl |2

so that zl = z0 + lv0 + 2

l
X
k=1

xk ,

(l k + 12 )xk ,

l L,

k = 1, . . . , N.

Thats a quadratic program: the constraints are linear and the function to be
minimized is quadratic. One can write the equality constraints as
z = Ax

where

x = (z0 , v0 , x1 , . . . , xN );

the entries of A are the detached coefficients of these equations. Since,


|z h|2 = |Ax h|2 = hAx h, Ax hi,
the quadratic program can also be expressed as
min hx, A>Axi 2hA>h, xi so that lb x ub,
where lb and ub are, respectively, lower and upper bounds on the x-variables;
for z0 and v0 these bounds could be . Because the matrix A>A is positive semi-definite3 , it turns out that our quadratic program is a convex optimization problem, a property thats difficult to overvalue in an optimization
context, cf. 3.16 & Chapter 4.
To illustrate this approach, let consider again the same data as that used
in 1.2. We rely on the Matlab-function quadprog to solve the quadratic
program. Figure 2.1 displays the resulting curve when N = 400 (and is
relatively large). The resulting curve is traced in Figure 2.1, and as expected,
the fitting is significantly better than what resulted from a best polynomial
fit; compare Figures 1.3 and 2.1.
function z = lsCurveFit(xr,N,x,h,kappa)
% xr, N: range [0, xr]; partition in N subintervals
3

A matrix C is positive semi-definite if hx, Cxi 0 for all x. When C is positive


semi-definite, the quadratic form x 7 hx, Cxi is convex, see Example 3.16.

32

CHAPTER 2. FORMULATION
% (x,h): data points
% kappa: -lower and upper bound on 2nd derivatives
msh = xr/N; [m, m0] = size(x(:)); xidx = round(N*x);
N1 = N + 1; N2 = N + 2;
mx = 2+max(abs(h)); ub = [kappa*ones(1,N);100;mx]; lb = -ub;
%
generating the coefficients of matrix A
for i = 1:m
for j = 1:xidx(i)
A(i,j) = (xidx(i)-j+0.5)*msh^2;
end %for
A(i,N1) = xidx(i)*msh; A(i,N2) = 1;
end %for
xx = quadprog(A*A,-A*h(:),[],[],[],[],-ub,ub);
%
z-curve calculation
for l = 1:N
zd = 0;
for k = 1:l
zd = zd + (l-k+0.5)*xx(k);
end %for
z(l) = xx(N2) + xx(N1)*l*msh + zd*msh^2;
end %for
1.5
1
0.5
0
0.5
1
1.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Figure 2.1: Least squares fit of a smooth curve


Instead of minimizing the `2 -norm of the error (= least squares), one
could, for instance, choose to minimize the `1 -norm (= sum of the errors).
P
The function to be minimized is then lL |zl hl |. Since
|zl hl | = max {zl hl , hl zl },

2.2. CURVE FITTING II

33

P
one can find the minimum of lL |zl hl | by minimizing
X
l with l zl hl , l hl zl for l L.
lL

With the `1 -norm criterion, the curve fitting problem takes the form,
X
min
l
lL

so that l zl hl , l L,
l zl + hl , l L,
Xl
zl = z0 + lv0 + 2

k=1

xk ,

(l k + 21 )xk ,

l L,

k = 1, . . . , N.

This is a linear program, the constraints are linear and the function to be
minimized is also linear.
2.2 Example (yield curve tracing). The spot rate of a Treasury Note that
matures in t months always includes a risk premium as well as a forecast component that represent the markets perception of future interest rates. Such
spot rates are quoted for Treasury Notes with specific maturities, t = 3, 6, . . . .
To evaluate financial instruments that generate the cash flow (coupons, final
payments) at intermediate dates, one needs to have access to to a yield curve
that supplies the spot rate for every possible date.
15
14
13
12
11
10
9

50

100

150

200

250

Figure 2.2: Yield curve for Treasury Notes July 1982


Detail. Lets work with the (historical) rates quoted in July 1982:

300

34

CHAPTER 2. FORMULATION
term
rt

3
11.9

6
12.8

12
13.2

24
13.8

36
14.0

60
14.1

84
14.1

120
13.9

240
13.8

360
13.6

To trace the yield curve, the simplistic approach is to rely on linear interpolation. However, thats not really satisfactory. Financial markets make
continuous adjustments to the changing environment and this suggests that
the yield curve has to be quite smooth. Certainly, there shouldnt be an
abrupt change in the slope of the yield curve, and a fortiori, this shouldnt
occur at t = 3, 6, . . . . So, lets fit a smooth curve to the data. Because the
spot rates are nonnegative4 , one can express the yield curve as s(t) = ez(t)
in which case we need to search for a smooth z-curve that will fit the pairs
{(3, ln r3 ), (6, ln r6 ), . . . }. The following Matlab-file generates the coefficients of the linear program and then relies on linprog to calculate the
solution. Figure 2.2 graphs the (historical) yield curve calculated by our program.
function spots = YieldCurve(N,x,r,kappa)
% N: # of months, range [0, N]; % (x,r): data points
% kappa: -lower and upper bound on 2nd derivative
[m, m0] = size(x(:)); N1 = N + 1; N2 = N + 2;
ub = [kappa*ones(1,N);0;0;10*ones(1,m)];
lb = [-ub(1:N);-1;-3.25;zeros(1,m)];
%
generating the coefficients of linear program
for i = 1:m
i2 = 2*i; i1= i2-1;
b(i2) = log(r(i)); b(i1) = -b(i2);
A(i1,:) = zeros(1,N2+m);
for j = 1:x(i)
A(i1,j) = (x(i)-j+0.5);
end %for
A(i1,N1) = x(i); A(i1,N2) = 1; A(i1,N2+i) = -1;
A(i2,:) = -A(i1,:); A(i2,N2+i) = -1;
end %for
c = [ zeros(1,N2) ones(1,m)];
xx = linprog(c,A,b,[],[],lb,ub);
%
yield curve calculation
for l = 1:N
zd = 0;
4

and its expedient to have an expression for the spot rates that makes calculating
forward rates and discount factors particularly easy

2.3. A NETWORK CAPACITY EXPANSION PROBLEM

35

for k = 1:l
zd = zd + (l-k+0.5)*xx(k);
end %for
z(l) = xx(N2) + xx(N1)*l + zd;
end %for
spots = exp(-z);

2.3

A network capacity expansion problem

Lets consider a power transmission network, Figure 2.3, with ei be the external flow at node i, i.e., the difference between demand and supply at node i.
The internal flow yj on arc j is limited by its capacity j of the transmission
line. Total supply exceeds total demand but the capacity of the transmission
lines needs to be expanded from j to j + xj , with j an upper bound on xj ,
in order to render the problem feasible5 . The total cost of such an expansion
P
is nj=1 j (xj ).

e2

e1

|yj | <_ j

Figure 2.3: Power transmission network


5

In the 2001 California energy crisis, some of the blackouts were blamed on the lack of
capacity of the transmission lines between South and North California.

36

CHAPTER 2. FORMULATION
The deterministic version of this capacity expansion problem would be:
min

n
X

j (xj )

j=1

so that 0 xj j ,
j = 1, . . . , n,
|yj | j + xj , j = 1, . . . , n,
X
X
yj ei , i = 1, . . . , m;
yj
i

P
yj stands for the (internal) flow into i whereas i yj is the flow from
i to the other nodes. Since the constraint |yj | j + xj can be split in
the two linear constraints yj j + xj and yj j xj , this is again a
linear programming problem if the cost functions j are linear. Usually, the
functions j are nonlinear, and the problem then belongs to a more general
class of optimization problems.
A nonlinear program is an optimization problem in finitely many (realvalued) variables in which a function is to be minimized (or maximized)
subject to a system of finitely many constraints: equations and inequalities.
A general formulation of a nonlinear program could be

min f0 (x) over all x X IRn

so that fi (x) ./ 0 for i = 1, . . . , m,


where ./ stands for either , = or and the set X is usually a simple
set such as a box or an orthant but, in principle, could be any subset of
IRn . Depending on the properties of the functions f0 , {fi , i = 1, . . . , m}
and the set X, various labels are attached to nonlinear programs: quadratic,
geometric, convex, positive definite, etc. As we proceed, we shall develop
optimality conditions and study stability criteria for nonlinear programs as
well as describe a number of algorithmic procedures for solving certain classes
of nonlinear programs.
2.3 Example (capacity expansion example). Consider the (simple) capacity
expansion problem as defined by Figure 2.4 with no upper bounds on the
expansions xj and let 1 (x1 ) = x21 , 2 (x2 ) = 8x22 , 3 (x3 ) = 3x23 .
Detail. This is a quadratic program with solution
x = (0, 0.55, 1.45),

y = (2.45, 2.55, 6.45)

2.3. A NETWORK CAPACITY EXPANSION PROBLEM

37

9
y2 < 7+x 1

y3 < 5+x 3
1

y1 < 2+x 2

Figure 2.4: A capacity expansion example

obtained by quadprog, a Matlab-function in the Optimization Toolbox.


function xopt = capXmpl
%
% data and solution of the capacity expansion example
%
Q = 2*diag([1 8 3 0 0 0]);
xlb = [zeros(1,3) -10^8*ones(1,3)];
A = [-1 0 0 -1 0 0; -1 0 0 1 0 0; 0 -1 0 0 -1 0;
0 -1 0 0 1 0; 0 0 -1 0 0 -1; 0 0 -1 0 0 1;
0 0 0 1 0 1; 0 0 0 -1 1 0; 0 0 0 0 -1 -1];
b = [7 7 2 2 5 5 4 6 -9];
xopt = quadprog(Q,[],A,b,[],[],xlb,[]);

Due to unforeseen breakdowns in the supply generators and unexpected


variations in the demand, it isnt really possibly to validate a model that
assumes that external flows, e = (e1 , . . . , em ), are known with certainty. Certainly, a more believable approach would deal with the external flows as random, say = ( 1 , . . . , m ), with possible values in IRm and distribution
function P .
Because possibly all the demand may not be satisfied (at each node),
we are going to associate with each node i, a monitoring function, equivalently a penalty function, see Figure 2.5, that quantifies the level of demand

38

CHAPTER 2. FORMULATION

satisfaction:

if < 0,
(supply exceeds demand)
0
2
i ( ) = i /2i
if [0, i ], (low level excess demand)

i i i /2 if > i ,
(high level excess demand).

Lets suppose that the budget available for the capacity expansion is fixed,

2
2

Figure 2.5: Monitoring function for capacity expansion problem


say ; eventually, could be varied and the dependence of the solution on
could be (parametrically) analyzed. Suppose also that the random vector
takes on a (finite) number of possible values, say l for l = 1, . . . , L, and
= l with probability pl . The problem is then to find xj , j = 1, . . . , n that
solves
min
x,y l

so that

L h X
m
X
X i
X
yjl +
yjl
pl
i il
l=1

n
X
j=1

i=1

j (xj ) ,

0 xj j , j = 1, . . . , n,
yjl xj j , j = 1, . . . , n, l = 1, . . . , L,

yjl + xj j , j = 1, . . . , n, l = 1, . . . , L.

Its a nonlinear program with a nonlinear objective function, a number of


linear constraints, and one, possibly nonlinear, budget constraint. We are

2.3. A NETWORK CAPACITY EXPANSION PROBLEM

39

again minimizing an expected cost. We can make this even more explicit by
rewriting the problem in the following form:
min
x

so that

L
X


pl q( l , x)
l=1

n
X
j=1

j (xj ) ,

0 xj j , j = 1, . . . , n,
where
q(, x) = infn
yIR

m
nX
i=1

i i

yj +

X
i

o

yj |yj | j + xj , j = 1, . . . , n .

Like the product mix problem, this is a (two-stage) stochastic program with
recourse. The first stage decision x, what were really interested in, has to
be selected now, before the supply/demand situation can be observed. The
second stage or recourse decisions, i.e., the decision about flows y is made
after the supply/demand situation is known. The function q is called the
recourse cost function.
In another version of this problem, the joint distribution function P for
could actually be continuous with (joint) density function p : IR N IR+ .
For a function : IRN IR, the expected value of () is given by
Z
Z
Z
Z
()p() d =
d1
d2
dN ()p().
IRN

|
{z
}
N times

If this is the case, again with the recourse cost function q as defined above,
our optimization problem takes the form:
min
so that

IRN
n
X
j=1

q(, x)p() d
j (xj ) ,

0 xj j , j = 1, . . . , n,

40

CHAPTER 2. FORMULATION

R
Notice that we dont have an explicit expression for IRN q(, x)p()d as a
function of x. For each x, an N -dimensional integration must be performed,
and thats computationally expensive, and anyway we dont even have an
explicit representation for the recourse cost function q. The major challenge
in stochastic programming is to find ways to deal with this situation.

2.4

Discrete decision variables

The solution of an optimization model designed to assign airplanes and crews


to specific routes cant end up with flying a fraction of a plane on certain
routes. This means that in the formulation of the optimization model, one
must require that some variables take on integer values. Quite common
instances are those involving binary choices that get modeled by 0-1 variables.
For example, a given airplane and its crew, is either to fly, or not, a particular
route. Restricting some of the decision variables to integer values introduces
not only a new level of difficulty in the design of solution procedures, but
also requires a significant switch in the analysis of such problems because the
underlying mathematical structure is significantly altered. Although some of
the fundamental concepts developed in these lectures have some overlap with
those that come up in integer programming and combinatorial optimization,
these fields are better dealt with in a separate series of lectures. Here, we
shall content ourselves with a simple and famous example.

2.4 Example (the traveling salesman problem). Given a set of cities (nodes)
V = {1, . . . , n} and connecting routes (arcs) A with cij the travel time between cities i and j, the problem is to find a tour that starts and terminates
in city 1, visits all other cities exactly once, and takes the least total travel
time.

Details. Let xij = 1 if the tour includes the route from city i to city j,
otherwise xij = 0; note that we dont identify traveling from i to j with

2.4. DISCRETE DECISION VARIABLES

41

traveling from j to i. So, if x identifies a (feasible) tour, it must satisfy:


X

xij = 1, j V,
{i (i,j)A}
X

xij = 1, i V,
{j (i,j)A}
X

xij 1, U U ,
{(i,j)A iU, jV \U }
where U consists of the all subsets of V with at least 2 nodes and no more
than n 2 nodes. The two first constraints have to do with the requirement
that exactly one route is used to enter and leave each city on the tour. The
third constraint eliminates a tour consisting of subtours: it requires that
given any pair of potential subtours, say {1, 2, 3} and {4, . . . , n}, there will
be an arc connecting {1, 2, 3} to {4, . . . , n} and vice-versa. So far, these
constraints and the objective,
X
cij xij ,
min
{(i,j)A}

determine a linear program, but the binary restriction a route must be traveled or not hasnt been included in the formulation of the problem. So, one
must add the constraint:
xi,j {0, 1} (i, j) A.
The resulting problem is no longer a linear program, but a so-called integer
program; problems where some of the variables are restricted to be integervalued and others are allowed to be real-valued are mixed integer programs
(MIP).

A word of warning: Because integer programs are usually difficult to


solve, one should include an integer restriction on the variables only when
its really appropriate. For example, in our product mix problem of 2.1,
one might be tempted to ask for the solution to be integer-valued, after all
one wouldnt really be able to sell 1/3rd, or any other fraction, of a dresser.
However, the manager isnt really interested in a precise solution that will
be followed blindly. The manager concern is with a decision about activity
levels, i.e., to the manufacturing of which type of dressers should allocate the
available resources.

42

CHAPTER 2. FORMULATION

2.5 Exercise (mini-traveling tour) Formulate and solve, the following traveling salesman problem where the traveling time between cities {1, . . . , 4} is
given by the following matrix:
ij
1
C =
2
3
4

1
?
3
6
3.5

2
2
?
?
6

3
5
4
?
4

4
3
?
6
?

? indicates that there is no usable route from city i (row) to city j (column)
Guide. For the numerical solution use the Matlab function bintprog.

2.5

The Broadway producer problem

This chapters last example is probably the simplest, non-trivial, stochastic


programming problem6 ; as we shall see, the deterministic version is trivial.
The structure of the recourse cost function will allow us to obtain a closed
form solution that opens the door to complete analysis.
In a rental agreement for a theater on Broadway, a show producer must
commit herself to a minimum of x days for which she will be charged
dollars per day, i.e., a total of x dollars. If the run is successful, the theater
is available for additional shows on a day-to-day basis but the rental would
be significantly higher, at dollars per day ( > ). The length of the
successful run of the show is modeled as a random variable with known
distribution function P () = prob [ ]. The producers problem is to
choose x optimally, i.e., so as to minimize expected rental costs.
If is the length of the run, the actual cost is
(
x
if x,
x + ( x) if > x.
6

Of course, deterministic programs can be viewed as stochastic programs whose random


elements take on a single value with probability one.

2.5. THE BROADWAY PRODUCER PROBLEM

43

When P is a continuous distribution function, i.e., there is a density function


p : IR+ IR+ such that
Z
p() d().
P () =
0

The expected costs are


x +

( x)p()d.

In this case, a simple calculation shows that the optimal solution x must
satisfy
0 = [1 P (
x)].
More generally, if we define P ( ) := lim % P ( ), one must have
P (
x )

P (
x),

that allows for the possibility of a (discontinuous) jump in P at x, as could


happen when the random demand is discretely distributed. Figure 2.6
illustrates these possibilities.
1

1()

1()

x^

x^

Figure 2.6: Solution: discrete and continuous distributions.

2.6 Exercise (numerical solutions of the producer problem). With = 3,


= 7 and uniformly distributed on [10, 20], the producer problem has a

44

CHAPTER 2. FORMULATION

unique solution x = 15 75 . If has a discrete distribution with equal probability on the points {10, 11, . . . , 20}, write down an expression for the expected
costs. What is the distribution function P ? Show that x = 16 is an optimal
solution in this discrete case7 .
2.7 Example (the newsboy problem). A newsboy (a firm) must place an
order for x newspapers (perishable items) to meet a demand thats known
only in a probabilistic sense, i.e., is a random variable with known distribution function P () = prob [ ]. A paper costs cents and is sold for
cents; unsold papers cant be returned. The newsboy wants to choose x to
maximize expected profit.
Detail. The newsboy problem, a classic inventory problem, is a problem of
the same type as the producers except that it is one of maximizing instead
of minimizing. The optimality conditions are essentially the same as those
for the Broadway show producer problem.
The stochastic programming model of the producers problem starts from
a deterministic version of the problem:
min x

such that x =

where is an estimate of the shows run. Of course, this problem is trivial to


solve. But the actual decision problem isnt deterministic. It is quite likely
that the length of the shows run will not match exactly the projected run
Since the length of run is not known at the time of decision, the actual
.
costs to be attributed to the decision have to be evaluated after the decision
is made.
The model for the decision/evaluation process follows the overall schema:
decision: x ; observation: ; evaluation: x.
The objective function is thus made up of two terms: the first term x
corresponding to contractual costs that are known at the time of signing
7

Remark: The solution of the producers problem may not be unique. Uniqueness will
fail whenever there is an interval on which P takes the value ( )/. With = 3 and
= 7, again, and uniformly distributed on [10, 14][17, 20], the solution x
is any point in
the interval [14, 17]. Since can never take on any value between 14 and 17, one might be
tempted to choose the lowest possible value for x
, namely 14. But in terms of the stated
objective, minimization of expected costs, it doesnt matter which point gets selected.

2.5. THE BROADWAY PRODUCER PROBLEM

45

of the contract, and the second term evaluating the decision in terms of
expected costs to come after the random event is observed. The second term
is called the expected recourse cost:
(
0
when y 0,
E{q( x)} where q(y) =
y when y 0.
The recourse cost function is q( x).
q

Figure 2.7: The cost function q.

2.8 Exercise (alternative expression for recourse cost). Show that the cost
function q (with > 0), a function commonly used to define recourse costs,
admits the alternative representations:
q(y) = max [ 0, y ]
= min {y + | y + y = y, y + 0, y 0}.
With E{} denoting expectation with respect to the distribution function
P , the stochastic programming formulation of the producers problem is:


min x + E q( x) .
x

After integration, one obtains the deterministic equivalent of the stochastic


program, stated here in terms of a continuous distribution P with density p,
but valid in the more general case:
Z
min x +
( x)p() d.
x

46

CHAPTER 2. FORMULATION

This is precisely the problem that was solved earlier.


This formulation of the producers problem illustrates the important features of a stochastic program: the decision stages of the problem in relation
to the arrival of information, and the recourse costs being obtained as the
expected value of an optimization problem that will be solved after full information is available. Many issues of stochastic programming may be illustrated by this simple producers problem. There are many instances when
the developments to come can first be explored for this simple example, in
which everything is well understood, then the intuition gained can guide the
application to more complex decision problems.

Chapter 3
PRELIMINARIES
At the congenital level, one makes a distinction between two classes of optimization problems, namely those that are convex and those that are nonconvex1 . Fortunately, a major portion of the optimization problems that
have to be dealt with practice are convex; all examples in Chapter 2 fall in
this class. In the two first sections of this chapter, we build the basic tools to
analyze convex optimization problems, and in particular, whats needed to
generalize Oresme and Fermat rules. The three last sections, set up a minimal
probabilistic framework that will allow us to deal with (convex) stochastic
optimization problems and commence the study of expectation functionals.

3.1

Variational analysis I

The analysis of deterministic and stochastic programming models relies on


the tools and framework of Variational Analysis. Our concern at this point
is mainly, but not exclusively, with finite-valued convex functions, but the
exposition will already touch on the interplay between a function and its
epigraph that occupies such a pivotal role in Variational Analysis and, in
particular, in the theoretical foundations of optimization. The definitions and
accompanying notation are consistent with the extensions and generalizations
required in the sequel. General facts about convexity will be covered in this
1

Of course, non-convex problems have large subfamilies that possess particular properties that can be exploited effectively in the design of solution procedures, e.g., combinatorial optimization, optimization problems with integer variables, complementarity
problems, equilibrium problems, and so on.

47

48

CHAPTER 3. PRELIMINARIES

section. The next one will be devoted to a more detailed analysis of convex
functions and their (sub)differentiability properties.
A subset C of IRn is convex if for all x0 , x1 C, the line segment [x0 , x1 ]
C, i.e.,
x = (1 )x0 + x1 C for all [0, 1].
Note that x0 , x1 dont have to be distinct, and thus if C consists of a single
point its convex; the condition is also vacuously satisfied when C = , the
empty set. Balls, lines, line segments, cubes, planes are all examples of
convex sets. Sets with dents or holes are typical examples of sets that fail to
be convex, cf. Figure 3.1.

Figure 3.1: Convex and non-convex sets.


Given [0, 1], one refers to x as a convex combination of x0 and x1 .
More generally, given any collection x1 , . . . , xL IRn , then any point

x =

L
X

l x

for some

l=1

l 0, l = 1, . . . , L, such that

L
X

l = 1

l=1

is a convex combination of x1 , . . . , xL . The set


1

C = con(x , . . . , x ) = x =

L
X
l=1

L
X


l x
l = 1, l 0, l = 1, . . . , L
l

l=1

is the convex hull of x1 , . . . , xL ; cf. Figure 3.2.


Convexity is preserved under the following operations:

3.1. VARIATIONAL ANALYSIS I



49


x2

x1

xL



Figure 3.2: Convex hull of a finite collection of points

3.1 Exercise (intersections,


products and

linear combinations). Given a
collection of convex sets C i , i = 1, . . . , r one has:
T
(a) C = ri=1 C i is convex, actually, the intersection of an arbitrary collection of convex sets is still convex;
(b) C = C 1 C 2 C r is convex. For C 1 IRn1 , C 2 IRn2 ,



C 1 C 2 := (x1 , x2 ) IRn1 +n2 x1 C 1 , x2 C 2 .

 Pr

P
i i
i
(c) i IR, C = ri=1 i C i :=
is convex.
i=1 i x x C
C1
C2
C3

x0
x1

   


  
C1 
   

1C 1 + 2 C2

C2

!!!!!!!!!
!!!!!!!!!

Figure 3.3: Operations resulting in convex sets.

3.2 Exercise (affine transformation and projection). Given L : IRn IRm ,


an affine mapping, i.e., L(x) = Ax + b where A is a m n-marix and b IRm ,

50

CHAPTER 3. PRELIMINARIES




the set L(C) = z = Ax + b x C is convex whenever C IRn is convex.
In particular, L(C) is convex when its the projection of the convex set C on
a subspace of IRn .

%#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&%&%&%
%#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%#
%
%
%
%
%
%
%
%
%
%
%
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
&%&%%
%#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
&#
%#
%
%
%
%
%
%
%
%
%
%
%
&&%&%
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
&%%&%
C%#
%#
%#
%#
%#
%#
%#
%#
%#
%#
%#
%#
&
&
&
&
&
&
&
&
&
&
&
%#
%
%
%
%
%
%
%
%
%
%
%
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
#
&
%#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%&#
%#&#
% &#
% &#
% &#
% &#
% &#
% &#
% &#
% &#
% &#
% &#
% &&%&%
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" S $#
" = I$"R2
$#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $"
$#
"#
"#
$" "#
$" "#
$" "#
$" proj
$" "#
$" C"#
$" "#
$" "#
$" "#
$" "$"
$#
#
$
#
$
#
$
#
$
#
$
#
$
#
$
#
$
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $#
" $$"
$"#$#
$#

proj C
S

Figure 3.4: Projection of a convex set.

Guide. The first statement only requires a simple application of the definition of convexity. For the second, simply write the projection as an affine
mapping.
Warning: projections preserve convexity but the projection of a closed convex

x2
set is not necessarily
closed.
A
simple
example:
let
C
=
(x
,
x
)
1
2

1/x1 , x1 > 0 , then the projection of C on the x-axis is the open interval
(0, 1), This cant occur if C is also bounded, cf. Proposition 8.9.
A particularly important subclass of convex sets are those that are also
cones. A ray
from the origin, i.e., a set of the
is a closed half-line emanating
n

type x 0 for some 0 6= x IR . A set K IRn is a cone if 0 K
and x K for all x K and > 0. Aside from the zero cone {0}, the
cones K in IRn are characterized as the sets expressible as nonempty unions
of rays. The following sets are all convex cones:
{0}, IR + sn , IRn and (closed)

n
half-spaces, i.e., sets of the type x IR ha, xi 0 with a 6= 0.
3.3 Exercise (convex cones). A nonempty set C is a convex cone if and
only if x1 , x2 C implies x1 + x2 C.

3.1. VARIATIONAL ANALYSIS I

51

)(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )')'
)'()'()(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )')'
)(')'()(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )')'
)(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )(')(' )')'

+(*+(* +(*+(* +(*+(* +(*+(* +(*+(* +(*+(* +*+*


*(++*(*(++(* *(++(* *(++(* *(++(* *(++(* *++*
+(*+*(+(*+(* +(*+(* +(*+(* +(*+(* +(*+(* +*+*
+*(+(* +(*+(* +(*+(* +(*+(* +(*+(* +(*+(* +*+*
0

Figure 3.5: Cones: convex and non-convex.

A function f is convex relative to a convex set C if for all x0 , x1 C:



f (1 )x0 + x1 (1 )f (x0 ) + f (x1 ) for all [0, 1].
It is strictly convex relative to C if for all distinct x0 , x1 C:
f

Figure 3.6: Convex and non-convex functions.

f (1 )x0 + x1

< (1 )f (x0 ) + f (x1 ) for all (0, 1).

In particular, if C = IRn the preceding inequalities must be satisfied for all


x0 , x1 in IRn . A function f is concave relative to a convex set C if f is
convex relative to C.
One can bypass the need for constant reference to the set on which the
function f is defined if we adopt, as we will, the following framework: Rather
than real-valued functions, one considers extended real-valued functions with
the value assigned to those points that are outside their effective domain.
More specifically, with a function f0 : C IR with C IRn , one associates
a function f : IRn IR defined by
(
f0 (x) if x C,
f (x) =

otherwise.

52

CHAPTER 3. PRELIMINARIES

f
x

Figure 3.7: Strictly convex and not strictly convex functions.

The effective domain of a function f : IRn IR is denoted





dom f = x IRn f (x) < .

f0
f
C = dom f

Figure 3.8: Extended real-valued convex function.

Why ? This choice follows from the minimization set-up. Indeed,


argmin f = argmin f0 (x),
xC

inf f = inf f0 (x),


xC

i.e., the minimizers of the two functions are the same and so is the value of
their minimum (=infimum)2 . Addition and multiplication in the extended
2

If we had opted for maximization as our canonical set-up, then the extension would
have assigned the value to the points outside the effective domain.

3.1. VARIATIONAL ANALYSIS I

53

reals follows the usual rules, including 0 = 0, except for =


. This is going to be our extended arithmetic convention, but again this
convention is consistent with the view that points at which a function takes
the value lie outside its effective domain.
The convexity of a function isnt really affected by such an extension.
3.4 Exercise (convexity for extended real-valued functions). Show that f0 :
C IR is a convex function relative to the convex set C IR n if and only
if f : IRn IR is convex, where f = f0 on C and f on IRn \ C.
For any convex function f : IRn IR, dom f is convex.
We only need to adjust the definition of strict convexity, namely, a function
f : IRn IR is strictly convex if the restriction of f to dom f is strictly convex
relative to dom f .
Linear functions f (x) = ha, xi and affine functions f (x) = ha, xi + are
convex and concave. The exponential function ex and the absolute value
function |x| are examples of real-valued convex functions. The sine function
sin x can serve as a typical example of non-convexity.
The epigraph of a function f : IRn IR is the set



epi f := (x, ) IRn+1 f (x) ,
i.e., epi f consists of all points in IRn+1 that lie on or above the graph of
f . Linking the geometrical properties of the epigraph with the analytical
properties of functions is one of the most useful tools of Variational Analysis.

epi g

epi f
f

Figure 3.9: Epigraphs.


The hypograph of a function f : IRn IR is the set



hypo f := (x, ) IRn+1 f (x) ,

54

CHAPTER 3. PRELIMINARIES

i.e., hypo f consists of all points in IRn+1 that lie on or below the graph of f .
3.5 Proposition (convexity of epigraphs). A function f : IRn IR is convex if and only if epi f IRn+1 is convex.
Its concave if and only if hypo f IRn+1 is convex.
Proof. The convexity of epi f means that whenever (x0 , 0 ), (x1 , 1 ) epi f
and (0, 1), the point (x , ) := (1 )(x0 , 0 ) + (x1 , 1 ) belongs to
epi f . This is the same as saying that whenever f (x0 ) 0 and f (x1 ) 1 ,
one has f (x ) .
The assertion about concavity follows by symmetry, passing from f to
f .
The epigraph is not the only convex set associated with a convex function.

3.6 Exercise (convexity of level sets and argmin). Let f : IRn IR be


convex. Then, for all IR, the level sets



lev f := x IRn f (x) .



and the set of minimizers argmin f := x IRn f (x) inf f are convex.
Moreover, if f is strictly convex then argmin f is a single point whenever
its nonempty; for example, if f = ex , the function f is strictly convex but
argmin f = .
f

lev f

Figure 3.10: Level set of a convex function.


Guide. Use the convexity of epi f and IRn {}, and apply 3.1(a).

3.1. VARIATIONAL ANALYSIS I

55



3.7 Exercise (convexity of max-functions). Let f i , i I be a collection
of convex functions. Then the function f (x) = supiI f i (x) is convex.
f

f4

f3
f1

f2

Figure 3.11: Max-function.


Guide. Observe that epi f =

iI

epi f i and appeal to 3.1 and 3.5.

3.8 Example (convex indicator functions). The indicator function C :

epi C

Figure 3.12: An indicator function and its epigraph.


IRn [0, ) of a set C IRn is convex if and only if C is convex, where
(
0 if x C,
C (x) =
otherwise.

56

CHAPTER 3. PRELIMINARIES

Follows from 3.1(b), 3.5, and 3.6 since epi C = C IR+ and C = lev0 C .
3.9 Proposition (inf-projection of convex functions). Let f : IR n IR be
the inf-projection of the convex function g : IRm IRn IR, i.e., for all
x IRn ,
f (x) = inf uIRm g(u, x).
Then f is a convex function.

epi g
epi f
u

x
Figure 3.13: Inf-projection of a convex function
Proof. Follows from 3.2 and 3.5 since epi f is the vertical closure of the
projection,
(u, x, ) 7 (x, ),

of epi g IRm+n+1 on the subspace IRn+1 . By vertical closure one means


that (,
x) is included in epi f whenever (, x) epi f for all >
; its
immediate that taking vertical closure preserves convexity.
3.10 Exercise (convexity under linear transformations). Let f : IRm IR
be a convex function. For any m n-matrix A and a IRm , the function
g : IRn IR defined by g(x) = f (Ax + a) is a convex function.
Sublinear functions, a subclass of convex functions, play a central role in
the subdifferentiation of convex functions. functions. A function f : IRn

57

3.1. VARIATIONAL ANALYSIS I

Figure 3.14: Two sublinear functions.

IR is sublinear if f is convex and positively homogeneous, i.e., 0 dom f ,


f (x) = f (x) for all x IRn , > 0. Typical examples are f (x) = ha, xi,
f (x) = |x| and f (x) = supiI hai , xi, the supremum of a collection of linear
functions.
3.11 Exercise (sublinearity criteria). For f : IRn IR with f (0) = 0,
sublinearity is equivalent to either one of the following conditions:
(a) epi f is a convex cone,
(b) for all x1 , x2 IRn , f (x1 + x2 ) f (x1 ) + f (x2 ).
Our primordial interest in convexity, at least in Variational Analysis,
comes from the theorem below that relates local and global minimizers in
the presence of convexity; refer to 1.2 for the definition of local minimizers.
3.12 Theorem (local and global minimizers). Every local minimizer of a
convex function f : IRn IR is a global minimizer. Moreover, there is only
a single minimizer of f when its strictly convex.
Proof. If x0 and x1 are two points of C with f (x0 ) > f (x1 ), then x0 cannot
furnish a local minimum of f , cf. Definition 1.2: every ball centered at x0
contains points x = (1 )x0 + x1 with (0, 1) that satisfy f (x )
(1 )f (x0 ) + f (x1 ) < f (x0 ). Thus, there cant be any locally optimal
solutions outside of argmin f where global optimality is achieved.
If f is strictly convex, then x0 , x1 dom f cant be distinct points that
minimize f . In view of the preceding, one necessarily would have f (x0 ) =

58

CHAPTER 3. PRELIMINARIES

f (x1 ). But then, strict convexity would imply that f (x) < f (x0 ) for every
point x (x0 , x1 ).

3.2

Variational analysis II

This section continues the study of the properties of convex functions but we
are now mostly concerned with their (sub)differentiability properties. The
class of functions to which we can apply the classical optimality conditions
of Chapter 1, doesnt include many that come up in the mathematical programming context. Restricting the development to models involving only
differentiable (convex) functions would leave by the wayside all constrained
optimization models, and they include the large majority of the applications.
One needs a calculus that applies to functions that are not necessarily differentiable. Eventually, this will enable us to formulate Oresmes and Fermats
rules for convex functions that arent necessarily differentiable, or even continuous. This Subdifferential Calculus is introduced in the this section and
will be expanded throughout the ensuing development.
For the sake of exposition, and so that the readers can drill their intuition,
1-dimensional functions are featured prominently in this section. In some
instances, for the sake of simplicity, the proof of a statement is only provided
for 1-dimensional convex functions3 .
3.13 Proposition (continuity of convex functions). A real-valued convex
function f defined on IRn is continuous.
Proof. The proof is for n = 1. It will be sufficient to show that f : IR IR is
continuous at 0; continuity at any other point x follows from the continuity
at 0 of the function g(z) = f (z + x). By symmetry, it suffices to show
that f (0) = lim f (x ) for any sequence x & 0 with x (0, 1]. From the
convexity of f , one has for all :
f (x ) (1 x )f (0) + x f (1),
f (0)
3

x
1

f
(x
)
+
f (1),
x + 1
x + 1

A complete proof would have required additional background material that would let
us stray too far from the objectives of these lectures; for proofs in full generality, one can
consult [3], [18, Chapter 2], for example.

3.2. VARIATIONAL ANALYSIS II

59

that can also be written as


x f (1) + (x + 1)f (0) f (x ) (1 x )f (0) + x f (1).
The first and last terms both tend to f (0) as x & 0, implying that also
f (x ) f (0); that is, f is continuous at 0.
A number of criteria simplify checking for convexity. Some of these are
related to the properties of the slopes of convex functions. From the definition
of a convex function f : IR IR, for x0 < y < x1 , the following slope
inequalities hold:
f (x1 ) f (x0 )
f (x1 ) f (y)
f (y) f (x0 )

.
y x0
x1 x 0
x1 y

s1
x0

s2

s3
y

x1

Figure 3.15: Slope inequality for convex functions.

3.14 Proposition (one-dimensional derivative tests). When f is a differentiable function on an open interval O IR, each one of the following
conditions is both necessary and sufficient for f to be convex on O:
(a) f 0 is nondecreasing on O, i.e., f 0 (x0 ) f 0 (x1 ) when x0 < x1 in O;
(b) f (y) f (x) + f 0 (x)(y x) for all x and y in O;
(c) f 00 (x) 0 for all x in O (assuming twice differentiability).
Proof. The equivalence between (a) and (c) when f is twice differentiable
is well known from elementary calculus. The proof can be limited to showing

60

CHAPTER 3. PRELIMINARIES

that [convexity] (a) (b) [convexity]. If f is convex, one has for


x0 < x 1 ,
f (x0 ) f (x1 )
f (x1 ) f (x0 )
0 0
=
f 0 (x1 )
f (x )
1
0
0
1
x x
x x
from the inequalities for slopes of convex functions. On the other hand, if (a)
holds we have for any y O that the function gy (x) := f (x)f (y)f 0 (y)(x
y) has gy0 (x) 0 for all x (y, ) O but gy0 (x) 0 for all x (, y) O.
Then gy is nondecreasing to the right of y but non-increasing to the left, so
it attains its global minimum over O at y. Thus (b) holds. Starting now
from (b), consider the family of affine functions lx (y) = f (x) + f 0 (x)(y x)
indexed by x O. Then f (y) = maxxO lx (y) for all y O, so f is convex
on O by 3.7.
Functions whose convexity can be established by the criteria in Proposition 3.14, in particular by the second-derivative tests, are:
f (x) = ax2 + bx + c on (, ) when a 0;
f (x) = xr on (0, ) when r 1;
f (x) = eax on (, );
f (x) = log x on (0, ).
The counterpart of f 00 0 for functions defined on IRn is to have their
Hessian positive semi-definite. A (n n)-matrix Q is positive semi-definite
if hx, Qxi 0 for all x and positive definite if hx, Qxi > 0 for all x 6= 0. We
dont insist on having Q symmetric, although hx, Qxi depends only on the
symmetric part, i.e., the matrix 21 (Q + Q>). For every symmetric positive
semi-definite matrix Q one can find, via diagonalization, a matrix R such
that Q = R>R; when Q is positive definite R is invertible, but thats not
necessarily the case when Q is just positive semi-definite.
3.15 Theorem (higher-dimensional derivative tests). For a differentiable
function f : IRn IR, each of the following conditions is both necessary and
sufficient for f to be convex on O:
(a) hx1 x0 , f (x1 ) f (x0 )i 0 for all x0 , x1 in IRn ;
(b) f (y) f (x) + hf (x), y xi for all x, y IRn ;
(c) the Hessian matrix, 2 f (x), is positive semi-definite for all x (assuming
twice differentiability).

3.2. VARIATIONAL ANALYSIS II

61

A sufficient, but not necessary, condition for strict convexity is that the
Hessian matrix is positive definite for all x IRn .
Proof. From the definition of convexity, f is convex on IR n if and only if
its convex on every line segment. This is equivalent to the property that for
every choice of y and z, the function 7 g( ) = f (y + z) is convex, and
D
E
D
E
g 0 ( ) = z, f (y + z) , g 00 ( ) = z, 2 f (y + z)z .
The asserted conditions for convexity and strict convexity of f are equivalent
to requiring in each case that the corresponding condition in Proposition 3.14
is satisfied for all such functions g.

3.16 Example (quadratic functions). A function f is said to be quadratic


if f (x) = 21 hx, Qxi + hp, xi + where the (n n)-matrix Q is symmetric and
is a constant. Then f (x) = Qx + p and 2 f (x) Q, so f is convex if
and only if Q is positive semi-definite. Moreover, a function f of this type is
strictly convex if and only if Q is positive definite.
Detail. Note that the positive definiteness of the Hessian is being asserted
as necessary for the strict convexity of a quadratic function, even though it
was only listed in Theorem 3.15 merely as sufficient. The reason is that if Q
is positive semi-definite, but not positive-definite, theres a vector z 6= 0 such
that hz, Qzi = 0, and then along the line through the origin in the direction
of z its impossible for f to be strictly convex.
Another way to establish convexity is by showing that the function was
obtained from more elementary convex functions by operations that preserve
convexity. Among all such operations, none is used more often, especially in
stochastic programming, than taking positively (or non-negatively) weighted
sums of convex functions.
3.17 Exercise (convexity under addition and multiplication). Given any
nonnegative scalars {i 0, i I} and {f i : IRn IR, i I} a finite
P
collection of convex functions, then the function f = iI i f i is also convex.

In, Chapter 1, our mathematical prelude, we have seen that when f is


smooth, we have at our disposal an operational calculus that can be used to
identify points at which f is minimized. At points of non-differentiability,

62

CHAPTER 3. PRELIMINARIES

the notion of tangent, at least in the classical sense, isnt defined; slope
cant be defined uniquely. One then must be satisfied with something less
than derivatives and gradients, namely with what we are going to call subderivatives and subgradients4 .
3.18 Definition (subdifferentiation of convex functions). For f : IRn IR
convex, the subderivative at x dom f is the function df (
x; ) defined by
f (
x + w) f (
x)
;

&0

df (
x; w) = lim

df (x; w) is the subderivative of f at x in direction w. The set of subgradients


of f at x is



f (
x) := v IRn f (x) f (
x) + hv, x xi, x IRn .

f()

Figure 3.16: Function f (x) = x2 + [,] (x) and subgradients.


When the convex function f is differentiable at x, the derivative and the
subderivative coincide, indeed,
lim ( ) = lim ( ) = lim ( ) where ( ) =

&0

%0

f (
x + w) f (
x)
,

which implies that the subderivative is then linear. More generally one has:
4

The prefix sub must be interpreted as identifying something less than. Subderivatives and subgradients can be defined for any function, not just convex ones. The definitions are then somewhat more involved [18, Chapter 8], but in the convex case, they
coincide with those introduced here.

3.2. VARIATIONAL ANALYSIS II

63

f
_
x

_
df(x;. )

_
x

Figure 3.17: Subderivatives and subgradients.

3.19 Exercise (sublinearity of the subderivative). Given f : IR n IR convex and x such that f (
x) is finite, the subderivative w 7 df (
x; w) at x is
sublinear.
Hence, the epigraphs of the subderivatives of convex functions are convex
cones.
Guide. Show that df (
x; ) is convex and positively homogeneous.
Its immediate from the definition that v is subgradient of f at x if and
only if the affine function x 7 a(x) = hv, xi + [f (
x) hv, xi] is majorized by
f and a(
x) = f (
x), cf. Figure 3.18 The set of subgradients could be empty.
Indeed one could very well be at a point x at which there is no affine function
thats majorized bypf and takes on the value f (
x) at x. For example, the
2
function f (x) = 1 (x + 1) + [2,0] at x = 0 has infinite slope and
thus there is no affine function majorized by f that takes on the value 0 at
x = 0. Thus, f (0) = , cf. Figure 3.19; note df (0; 1) = , df (0; 1) =
while df (0; 0) = 0.
3.20 Proposition (subderivatives and subgradients). Let f : IRn IR be

64

CHAPTER 3. PRELIMINARIES

<v, .
>+

[f(x_
)

_
x

<v, _
x>]

_
x

Figure 3.18: Slope of affine functions majorized by f .

f
x

Figure 3.19: Empty set of subgradients at x = 0.

a convex function and f (


x) IR. Then,



f (
x) = v IRn hv, wi df (
x; w), w IRn
is a convex set and, assuming f (
x) 6= ,



w : df (
x; w) = maxv hv, wi v f (
x) .

Moreover, if f is differentiable at x with gradient f (


x),
f (
x) = f (
x);

df (
x; w) = hf (
x), wi w IRn .

Proof. If df (
x; w) = for some w, there is no v such that hv, wi
and consequently f (
x) must be empty. So, in the remainder of the proof

3.2. VARIATIONAL ANALYSIS II

65

lets posit that df (


x; ) > and begin with the following observation: The
convexity of f implies that for all w IRn , the ratio 1 [ f (
x + w) f (
x) ]
is (monotonically) non-increasing as & 0. So,
hv, wi df (
x; w) = lim 1 [ f (
x + w) f (
x) ] w IRn ,
&0

if and only if
f (
x + w) f (
x) + hv, x + w xi w IRn , 0,
or equivalently, if and only if
f (x) f (
x) + hv, x xi x IRn ,
since every x IRn can be written as x + w for some w IRn and 0.
And, this is the case if and only if v f (
x).
Certainly f (
x) is convex when its empty, otherwise from the definition
it follows readily that for all [0, 1], v = (1)v 0 +v 1 f (
x) whenever
0 1
v , v f (
x).
If f is differentiable at x, then for all w IRn ,
df (
x; w) = lim 1 [ f (
x + w) f (
x) ] = hf (
x), wi.
0

The preceding argument yields f (


x) = f (
x).
For 1-dimensional real-valued convex functions its possible to give explicit expressions for their subderivatives and subgradients in terms of the
left and right hand derivatives5 at x:
f (x) f (x )
,

&0

f 0 (x ) = lim

f (x + ) f (x)
.

&0

f 0 (x+ ) = lim

Since 1 [ f (
x + w) f (
x) ] is non-increasing as & 0, refer also to Fig0
ure 3.15, both f (x ) and f 0 (x+ ) are well defined since monotone sequences
always have limits in the extended reals, and f 0 (x ) f 0 (x+ ).
Because left and right dont have a meaning in IR n for n 2, the expression for
1-dimensional subderivatives and subgradients cant be trivially generalized to the ndimensional case.
5

66

CHAPTER 3. PRELIMINARIES

3.21 Exercise (1-dimensional subderivatives and subgradients). For a convex function f : IR IR,
df (x; 1) = f 0 (x ),

df (x; 1) = f 0 (x+ ),

and
f (
x) = [ f 0 (
x ), f 0 (
x+ ) ],
i.e., the set of subgradients is the interval spanned by the left and right hand
derivatives. Moreover, the subderivative is the max-function:

df (x; w) = max [ f 0 (x )w, f 0 (x+ )w ] = max [ vw v f (x) ].

If f is differentiable at x, w 7 df (x; w) = f 0 (x)w is linear. Finally, f is


differentiable at x if and only if the interval [ f 0 (
x ), f 0 (
x+ ) ], is reduced to a
0
single point, in which case f (
x) = {f (
x)}, the slope of f at x
Guide. Rely on 3.19 to obtain
(
|w|df (x; 1) = wf 0 (x ) if w 0,
df (x; w) =
wdf (x; 1) = wf 0 (x+ )
if w 0,
i.e., df (x; w) = max [ f 0 (x )w, f 0 (x + )w ] since f 0 (x ) f 0 (x+ ). One can also
write this as df (x; w) = max [ vw v f (x) ]. If f is differentiable at x,
f 0 (x ) = f 0 (x+ ) = f 0 (x).
For f : IR IR, for v f (
x), the subgradient inequality implies that
for all > 0,
f (
x + ) f (
x) + v(
x + x),

f (
x ) f (
x) + v(
x x),

that can be rearranged to read: for all > 0,


f (
x + ) f (
x)
f (
x) f (
x )
v
.

Passing to the limit as & 0, this becomes f 0 (


x ) v f 0 (
x+ ). From this it
0
0 +
will follow that v f (
x) if and only if v [ f (
x ), f (
x ) ].
Examples of subderivatives of well-known functions include:

w if x < 0,
f (x) = |x| : df (x; w) = |w| if x = 0,

w
if x > 0;

3.2. VARIATIONAL ANALYSIS II

67

f (x) = max [x2 2x, 3x2 +


4x]:

2(3x + 2)w

max [ 8w, 14w ]


df (x; w) = 2(x 1)w

max [ 2w, 4w ]

2(3x + 2)w
f (x) = ex :

if
if
if
if
if

x < 3,
x = 3,
3 < x < 0,
x = 0,
x > 0;

df (x; w) = ex w.

3.22 Example (subdifferentiation of sums, multiples). For f1 , f2 : IR IR


convex and 1 , 2 0, and f = 1 f1 + 2 f2 ,
df (x; w) = 1 df1 (x; w) + 2 df2 (x; w)
f (x) = 1 f1 (x) + 2 f2 (x).
Detail. The subderivative identity is a direct consequence of 3.21. The same
holds for subgradient after observing that f 0 (x ) = 1 f10 (x ) + 2 f20 (x ) and
f 0 (x+ ) = 1 f10 (x+ ) + 2 f20 (x+ ).
3.23 Proposition (subgradients of a separable function). For j = 1, . . . , n,
let fj : IR IR be convex and f (x) = f1 (x1 ) + f2 (x2 ) + + fn (xn ). The
function f is said to be separable. Then, for all x dom f with f (x) finite,
f (
x) =

n
Y

fj (
xj ).

j=1

In particular, when f is differentiable, one has



f (
x) = f (
x) = f10 (
x1 ), f20 (
x2 ), . . . , fn0 (
xn ) .

Proof. One has, v f (


x),

f (x) f (
x) + hv, x xi, x IRn ,
n
n
n
X
X
X

fj (xj )
fj (
xj ) +
vj (xj xj ), x IRn ,
j=1

for all j:
for all j:

j=1

j=1

fj (xj ) fj (
xj ) + vj (xj xj ), xj IR,
vj fj (
xj ).

68

CHAPTER 3. PRELIMINARIES

Clearly, the penultimate assertion implies the previous one. For the converse,
consider those x IRn that are such that for i 6= j, xi = xi and xj IR.
To complete the proof, when the functions fj are differentiable, one simply
appeals to 3.20 and 3.21.
What does all of this lead up to? Subdifferential Calculus allows us
characterize the minimizers of any convex functions. Moreover, in view of
Theorem 3.12, these conditions become necessary and sufficient for a point
x to be a minimizer.
3.24 Theorem (generalized Oresme and Fermat rules). Let f : IR n IR
be convex with f (x ) IR. Then
Oresme rule: x argmin f df (x ; ) 0,
One also has,
Fermat rule: x argmin f 0 f (x ).
Proof. Indeed, x argmin f f (x ) f (x)+h0, xx i for all x IRn
and this occurs if and only if df (x ; w) 0 for all w, and in view of 3.20, if
and only if 0 f (x ).

3.3

Plain probability distributions

Three of the examples of Chapter 2 ended up as the minimization of an


expectation functional :
Ef (x) := E{f (, x)},
subject to certain constraints. The expectation was defined in terms of a
discrete or continuous distribution function P for the random elements .
To return to the simplest case, for the producers problem (2.4), P is the
distribution function of the random run length , the integration domain is
some subset of IR and the function f is defined by
(
x
if x 0,
f (, x) =
x + ( x) if x > 0.

3.3. PLAIN PROBABILITY DISTRIBUTIONS

69

The analysis of expectation functionals is basic to the study of stochastic


programming models. It can also serve as fertile testground for the tools of
Variational Analysis that have just been introduced.
In this section, we are going to provide some simple concepts of probability
and integration on IRN that will be sufficient for the development that is to
follow.
Random quantities are displayed in bold face. Thus, refers to a random variable, whereas denotes a possible value of . Values assumed by
random variables are unknown, but the probability that a random variable
might appear in some subset of its range can be measured. For example,
next years total precipitation in the watershed of a large lake is a quantity
that is unknown. An analysis of historical data would give insight into the
plausible range of values, and if the record were detailed enough, would yield
a probability distribution P from which one could calculate the probability
that the total annual rainfall might appear in a given interval. One could
also consider a vector of values measuring more than one quantity, in which
case we would have a joint distribution in IRN , as in the network capacity
expansion problem. The symbol P denotes both the probability distribution:
for a set A IRN ,

P (A) = prob [ A],

and the (cumulative) distribution function


P : IRN [0, 1],

P (z) = prob[ z].

This expedient abuse of notation never creates any problems because in the
latter case the argument is a point in IRN , and in the former its a subset
of IRN . The support of a probability distribution P on IRN is the smallest
closed set IRN such that P () = 1; this will be made precise below.
Two important classes of probability distributions on IRN are the discrete
and the continuous distributions. We refer to a random variable as being discrete if its support consists of a countable number of points. The distribution
function Pd of such a random variable will have at most a countable number
of jumps and the random variable is said to be discretely distributed. For
example, a random variable with a Poisson distribution,
prob[ = k] = e

k
,
k!

k = 0, 1, . . . , for > 0 ,

70

CHAPTER 3. PRELIMINARIES
1

= prob[ < z]
P(z)

Figure 3.20: Plain distribution function.


0.25
0.2
0.15
0.1
0.05
0

10

12

14

16

18

20

Figure 3.21: Poisson-probability mass distribution, = 4

In the general case of discretely distributed random variable on IR N , the


support d is the closure of the set of points for which prob[ = ] > 0;
in IR, these are precisely the points at which the distribution function has a
jump. With
p := prob[ = ] = Pd ({}), d
one has
Pd (z) =

z, d

p ,

Pd (A) =

p .

Ad

When the support is a finite set, the random vector is said to be finitely
supported , or equivalently, to have finite support6 . For example, a random
6

In this book, whenever we deal with random variables that are discrete, they will
nearly always be finitely supported

3.3. PLAIN PROBABILITY DISTRIBUTIONS

71

variable with a binomial distribution has finite support;


 
n k
(1 )(nk) for k = 0, 1, . . . , n,
prob[ = k] =
k
and some (0, 1).
0.25
0.2
0.15
0.1
0.05
0

10

12

14

16

18

20

Figure 3.22: Binomial-probability mass distribution, n = 20, = 2/5


The distribution function Pc of a continuously distributed random vector
is the n-dimensional integral of a probability density function p : IR N IR+ ,
namely for z IRN ,
Z
Z z1
Z z2
Z zN
Pc (z) =
p() d =
d1
d2
dN p().
z

The support of Pc is the closure of the set where the density function is
positive

c := cl{ p() > 0}
In some instances, one may have to distinguish between random vectors with
bounded support, like a uniformly distributed random variable with
(
1/( ) if [, ],
p() =
0
otherwise,

and those with unbounded support, like an exponentially distributed with,


for some > 0,
(
e if 0,
p() =
0
if < 0,

72

CHAPTER 3. PRELIMINARIES
p

p
1

Figure 3.23: Uniform and exponential densities

or normally distributed random variable with


p() =

1
2
2
e() /2
2

for > 0 and IR.


0.4
0.3
0.2
0.1
0
3

Figure 3.24: Normal density, = 1, = 1.5


Most practical applications are covered by distribution functions P on IRN
that are either discrete or continuous. In our development, we are going to
limit the class of probability distributions that we consider to plain probability
distributions that are mixtures of discrete and continuous distributions, i.e.
of the type, cf Figure 3.20:
P := Pd + Pc ,

where

P () = Pd (d ) + Pc (c ) = 1,

and = d c is the union of the support of the discrete and continuous parts. The overriding reason for considering this class of probability

3.3. PLAIN PROBABILITY DISTRIBUTIONS

73

distributions is that it includes both the discrete and the continuous cases!
Consequently, anything proved for plain distributions applies to discrete as
well as to continuous distributions. Plain distributions are all whats needed
to deal with all the applications we have in mind. Moreover, its low technology! We dont need to go through to the full foundations of Probability
Theory7 , a superb, but substantial, monument.
All probability distributions on IR are plain, but that isnt the case for all
probability distributions on IRN ; simply think of the probability distribution
associated with a random variable uniformly distributed on the perimeter of
a circle of radius 1 in IR2 . Restricting ourselves to plain probability distributions simplifies the definition of what is to be understood by expectation and,
as already indicated, bypasses the measure theoretic technicalities required
to deal with the general case. So, in this book, all probability distributions
will be assumed to be of this (practical) plain type. With p the probability
associated with the points in the support of the discrete part and p() the
density function not necessarily summing up to 1 associated with the
continuous part, the probability of a set A is given by
Z
X
P (A) =
p +
p() d,
Ad

assuming that A is nice enough so that one can carry out the integration.
The expectation of a function h : IRN IR with respect to such a probability distribution P is then
Z
Z
X
h()p +
h()p() d
E{h()} =
h() P (d) :=
IRN

IRN

If h is continuous, the value of the integral of h with respect to Pc is based


on the familiar procedure from calculus. The expectation functional in the
producers problem, in particular, is continuous in for fixed x. However, in
the stochastic programming context, we do have to allow for certain discontinuities in h, viz. h is piecewise continuous with 1 , . . . , N , a finite number
of nice subsets of IRN that partition and on each piece l , h is continuous. Such functions belong to pc-fcns(; IR) that, with some terminology
7

All the results for expectation functionals defined by plain distributions, can actually
be proved for expectation functionals defined by (general) probability distributions, often
with the same arguments.

74

CHAPTER 3. PRELIMINARIES

abuse, we shall designate as the class of piecewise continuous functions. Of


course, integrating, or equivalently taking the expectation of, such functions,
doesnt require any special concern if the probability distribution P is discrete: the expectation is still obtained as a finite or countable sum. When the
distribution function P is continuous, with p the associated density function,
E{h()} :=

h() P (d) =
IRN

N Z
X
k=1

h()p() d;
k

in this expression, its taken for granted that those pieces k are nice (=
measurable in Measure Theory).
A function h pc-fcns(; IR) is said to be summable, with respect to
a plain distribution P , if E{h()} is finite. A vector-valued mapping v
pc-fcns(, IRn ) is summable, if for each index i, the function vi : IR is
summable.
3.25 Proposition
(properties of the integral).

Let P be a plain probability

measure and h, h pc-fcns(; IR), IN . Then


(a) Linearity: for 1 , 2 IR,
Z
Z
Z
1
2
1
[1 h () + 2 h ()] P (d) = 1 h () P (d) + 2 h2 () P (d),
(b) Order Preserving:
1

h h

h () P (d)

h2 () P (d).

(c) Dominated Convergence: Assume that |h | g for g a summable


function, and lim h () = h() for all . Then
Z
Z

lim
h () P (d) = h() P (d)

Proof. The assertions in (a) and (b) follow immediately from the properties
of sums when P is a discrete distribution, and of standard integration when
P is a continuous distribution. The combination of these two facts yields (a)
and (b) for plain distributions.
The same approach can be used to obtain the Dominated Convergence
(c) when P is a discrete distribution or when P is a continuous distribution

3.4. EXPECTATION FUNCTIONALS I

75

and h is continuous on . This still works when h and the h are piecewise
continuous functions and the pieces on which these functions are continuous
are the same, i.e., dont depend on . The statement remains valid without
this additional restriction, but the proof requires more background material
about integration than whats at our disposal.

3.4

Expectation functionals I

Expectation functionals are functions of the following type:


Z
x 7 Ef (x) = E{f (, x)} =
f (, x) P (d),

defined on IRn and, in general, with values in IR; here P is the distribution of
the random variable . In our present state of affairs, P is a plain probability
distribution and f (, x) pc-fcns(; IR). Usually, we shall denote expectation
functional by the name of the function being integrated, the integrand,
preceded by an E.
Until Chapter 9, we only deal with expectation functionals Ef that are
real-valued and whose integrands (, x) 7 f (, x) are convex in x for all
. This section is concerned with the convexity and the subdifferentiability properties of expectation functionals. We conclude with a remarkable
characterization of the minimizers of expectation functionals.
3.26 Proposition (convex expectation functionals). Let f : IR n IR.
Assume that for all , x 7 f (, x) is convex and that Ef is finite-valued,
where
Z
Ef (x) :=
f (, x) P (d),

Then, the expectation functional Ef : IRn IR is convex.

Proof. For any x0 , x1 with x = (1 )x0 + x1 , [0, 1], for all ,


one has
f (, x ) (1 )f (, x0 ) + f (, x1 ).

Integrating on both sides with respect to P yields Ef (x ) (1 )Ef (x0 ) +


Ef (x1 ), making use of the linearity and the order preserving property of
the integral, cf. Proposition 3.25.

76

CHAPTER 3. PRELIMINARIES
In Figure 3.25 Ef is the expectation function of the integrand
(
x
if x ,
f (, x) = 1
2
when x .
2 (x )

when is uniformly distributed on [0, 2].


5

f(.,1)

4
3

f(.,2)

4
3

f(.,1)

1
0
2

Ef(.)

0
2

Figure 3.25: Integrands and expectation functional

3.27 Proposition (subdifferentiation of expectation functionals). Let f :


IRn IR. Assume that for all , x 7 f (, x) is convex and that the
expectation functional Ef is finite-valued. Then
Z
dEf (x; w) =
df (, x; w) P (d)

and
Ef (x) =

nZ


o

v() P (d) v() f (, x), v : IRn summable .

Proof. As far as the identity involving the subderivative of Ef is concerned,


one needs only to show that
Z
1
lim
[f (, x + w) f (, x)] P (d)
&0

Z
= [ lim 1 (f (, x + w) f (, x)) ] P (d).
&0

3.4. EXPECTATION FUNCTIONALS I

77

f( . ,x )

v( .)

Figure 3.26: Subgradients of the integrand f (, x) and summable v()

This interchange of limits and integration is justified by the Dominated Convergence property
assumption.
Ef is finite-valued on IR, by
 R 3.25(c), since
n

Let D =
v() P (d) v() f (, x), v : IR summable .

To show that D Ef (
x), simply observe that v D implies Rthat there is
a summable vector-valued function v : IRn such that v = v() P (d)
and
hv(), x xi f (, x) f (, x
), x IRn , .
Integrating on both sides with respect to yields
h
v , x xi Ef (x) Ef (
x),

x IRn ,

that, in turn, implies that v Ef (


x).
The converse, viz., D Ef (
x), when n > 1, is not easy to prove without
appealing to relatively sophisticated tools, cf. [18, Chapter 14]. We have to
content ourselves, at this point, with a proof for the case n = 1. From the
definition for left and right derivatives and our formula for the subderivative
of Ef , one has
Z
Z
0
(Ef ) (x ) = dEf (x; 1) = df (, x; 1) P (d) = f 0 (, x )P (d),
Z
Z
0 +
(Ef ) (x ) = dEf (x; 1) = df (, x; 1) P (d) = f 0 (, x+ )P (d).

78

CHAPTER 3. PRELIMINARIES

Now relying on the characterization of the subgradients of 1-dimensional


convex functions in 3.21, if v [(Ef )0 (x ), (Ef )0 (x+ )] then
v = (1 )(Ef )0 (x ) + (Ef )0 (x+ ) for some [0, 1].

The function v defined by v () := (1 )f 0 (, x ) + f 0 (, x+ ) is such that


and

v () f (, x) = [f 0 (, x ), f 0 (, x+ )],

v () P (d) = v.

Oresme and Fermat rules lead us to the following characterization of the


minimizers of expectation functionals. Its one of the fundamental8 results in
stochastic optimization. A number of versions of this theorem will appear as
various stochastic optimization models are being analyzed. One remarkable
implications of this theorem is that one could find the minimizer of the expectation functional Ef by minimizing f (, x) perturbed by the appropriate
linear term hv(), xi for just one individual .
3.28 Theorem (minimizers of convex expectation functionals). Let f :
IRn IR and assume that for all , x 7 f (, x) is convex and that the
expectation functional Ef is finite-valued. Then x minimizes Ef on IRn if
and only if there exists a summable function
Z
n
v : IR such that
v() P (d) = 0, v() f (, x ), ,

R
n
or equivalently, v : IR such that v() P (d) = 0 and


: x argmin f (, x) hv(), xi .
xIRn

Proof. Certainly if such a function v exists, it follows from the preceding


proposition that 0 Ef (x ). By Fermats rule 3.24 this implies that x
argmin Ef .

On
R the other hand, if 0 Ef (x ), the existence of a function v such
that v() P (d) = 0 and v() f (, x ) is guaranteed by 3.27. The
equivalence between the two conditions x argminx f (, x) hv(), xi and
v() f (, x ) can be validated via 3.24.
8

To fully appreciate its significance, one really needs to see how it brings to light the
relationship between deterministic and stochastic optimization models that can then be
exploited, at least conceptually, to build a number of solutions procedures for stochastic
optimization models.

3.5. ANALYSIS OF THE PRODUCERS PROBLEM

3.5

79

Analysis of the producers problem

The characterizations of optimality discussed in the previous section can


now be applied to the Broadway producers problem; refer to 2.4 for the
formulation of the problem. As already pointed out in 2.4, the problem can
be formulated as the minimization of an expectation functional:
(
x
if < x,
min Ef (x) with f (, x) =
x0
( )x + if x.
As long as E{} is finite, as we now assume, Ef (x) is finite for all x. From
3.26 we know that the convexity of Ef will follow from that of f (, ). One
can write f as
f (, x) = x + q( x)

where

q(y) = max [ 0, y ].

The function q is a max-function and thus convex by 3.7. The convexity of


x 7 q( x) follows from 3.10. For all , f (, ) is the sum of two convex
functions and hence, also convex by 3.17.
To compute the subderivative and subgradients of Ef , one can rely on
3.20 and 3.19:

if x < ,
( )w
( )
df (, x; w) = max [ ( )w, w ], f (, x) = [( ), ]
if x = ,

if x > .
The subderivatives of the expectation functional Ef are then obtained by
integration using the formula in 3.27:
Z
X
dEf (x; w) =
df (, x; w)p + df (, x; w)p() d.
d

A similar formula can be written down for the subgradients, but its informative to go through a more detailed analysis. If
(i) x
/ d , then P ({x}) = 0 and the distribution function of is continuous at x:
Ef (x) = ( )(1 P (x)) + P (x)
= ( ) + P (x).

80

CHAPTER 3. PRELIMINARIES

f(, )

Figure 3.27: Subgradient of the cost function: producer problem.

In this case, the subgradient Ef at x is a singleton, thus the function Ef ,


being convex, is differentiable at x and Ef (x) = Ef (x).
(ii) x d , then P ({x}) > 0 and the distribution function of has a jump
at x: with P (x ) = lim % x P () = P (x) px . From the formula above for
the subgradients of f (, ) and 3.27, one has


Ef (x) = ( ) + P (x ), ( ) + P (x) .

In this situation, the expectation functional Ef is not differentiable at x.


Observe that the expression in (i) for the subgradient is just a special case
of that in (ii), thus one may as well just work with the latter.
Since Ef is convex, the condition 0 Ef (x ) is necessary and sufficient
for a point x to be a minimizer of Ef , i.e., an optimal solution of the
producers problem (assuming is a nonnegative random variable). Thus x
must satisfy the generalized equation:


0 ( ) + (P (x ) px ), ( ) + P (x )
that can also be expressed as

P (x ) px

P (x );

recall that px = P ({x }). Not surprisingly, this is the same optimality
condition as the one obtained in 2.4.

3.5. ANALYSIS OF THE PRODUCERS PROBLEM

81

3.29 Exercise (pointwise characterization of optimality). Let x, v be such


that
(
Z

if < x,
v() =
v(
x) [ , ],
v() P (d) = 0.
if > x,
Then x is an optimal solution of the producers problem.
Guide. Appeal to 3.28.
3.30 Exercise (expected monitoring costs). Let f (, x) = ( x) where
is the monitoring function introduced in 2.3, i.e.,

if < 0,
0
2
( ) = /2
if [0, ],

/2 if > .

Let be uniformly distributed on [0, 1], i.e., the density function is given by
(
1 if [0, 1],
p() =
0 if
/ [0, 1].
Find analytic expressions for the subderivatives and subgradients of Ef .
Conclude from these calculations that Ef is differentiable.

82

CHAPTER 3. PRELIMINARIES

Chapter 4
LINEAR CONSTRAINTS
Since one can always substitute the constraint fi (x) 0 for fi (x) 0, and
one can replace max f0 (x) by min f0 (x), with no loss of generality, we
choose
min f0 (x)
so that fi (x) 0,

i = 1, . . . , s,

fi (x) = 0, i = s + 1, . . . , m,
x X IRn ;

as our canonical formulation of a mathematical program, an optimization


problem with a finite number of variables and a finite number of constraints.
The feasible set is the subset S of IRn determined by the constraints,


S = x X fi (x) 0, i = 1, . . . , s,


fi (x) = 0, i = s + 1, . . . , m ,

and the effective objective, f : IRn IR, defined by


(
f0 (x) if x S,
f (x) =

otherwise.

captures in a single function, albeit now extended real-valued, the complete


definition of the problem. Our mathematical program can equivalently be
formulated as
min f (x) for x IRn ,
83

84

CHAPTER 4. LINEAR CONSTRAINTS

the unconstrained minimization of an extended real-valued function. For


purposes of analysis, as we shall see, its fruitful to study optimization problems in this condensed form.
Our canonical mathematical program is said to be a convex program, if
for i = 0, . . . , s, fi : IRn IR is convex,
for i = s + 1, . . . , m, fi : IRn IR is affine,
the set X IRn is a closed convex set.
The designation convex can be justified by pointing to the convexity of the
constraints and the objective but there is a more compelling reason.
4.1 Theorem (convexity of the effective objective function). For a convex
program, its feasible set S is closed and convex and its effective objective,
f : IRn IR, is a convex function whose epigraph is a closed set.
Proof. The convexity of the functions fi for i = 1, . . . , m and fi for i =
s + 1, . . . , m implies that these functions are continuous, cf. Proposition 3.13,
that in turn implies that their level sets lev 0 fi are closed sets. It also implies,
via 3.6, that these level sets are convex. Since
S=X

s
\

i=1

lev0 fi

m
 \

i=s+1

(lev0 fi lev0 (fi )) ,

its the intersection of a finite number of closed convex sets, and thus closed
and convex, cf. 3.1(a).
The epigraph of the objective function f is closed since its the intersection
of the epigraph of a continuous function and the set S IR = epi S . And its
convex for basically the same reason: the epigraph of f0 is convex, cf. 3.5,
and S IR is the product of two convex sets, cf. 3.1(b). Because its epigraph
is convex, the function f itself is convex, again by Proposition 3.5.
So, the condensed version of a convex program,
min f (x)

for

x IRn .

is simply the minimization of a convex function on IRn , a problem we have


already considered in 3.2. The convexity of f allows us to identify the

4.1. LINEARLY CONSTRAINED PROGRAMS

85

minimizers of f in terms of the generalized rules of Oresme and Fermat 3.24


and moreover these conditions are necessary and sufficient since every local
minimizer of a convex function is also a (global) minimizer, Theorem 3.12.
However, to translate Oresme and Fermat rules into conditions that are
verifiable, one needs a subdifferential calculus that allows us to compute
the subderivatives and subgradients of effective objective functions, or more
generally, of extended real-valued convex functions. Before we turn to this
task, lets first equip ourselves with a minimal taxonomy for convex programs.

4.1

Linearly constrained programs

The major distinction to be made between convex program types is between


those that are, and those that are not, linearly constrained. And this has
implications not just at the theoretical level but also in the design of numerical solution procedures. In fact, this distinction is important not just for
convex programs but for any type of optimization problem. Not unexpectedly, linearly constrained problems are easier to handle both theoretically
and computationally.
Fortunately, a large majority of the practical optimization problems turn
out to be linearly constrained. Although the motivating applications of the
examples in Chapter 2 were quite diverse, they were nearly all linearly constrained.
A mathematical program is linearly constrained if
all the constraints are linear constraints, i.e., for i = 1, . . . , m,
fi (x) = hai , xi i

are affine,

i.e. each fi is a linear function plus a constant.


X is a polyhedral set, i.e., a set that itself can be specified by a finite
number of linear constraints.
The feasible set of a linearly constrained problem,



S = x X hai , xi i , i = 1, . . . , s, hai , xi = i , i = s + 1, . . . , m

is itself a polyhedral set. Indeed, since X can be specified by a finite number


of linear constraints, the total number of (linear) constraints defining S is
finite.

86

CHAPTER 4. LINEAR CONSTRAINTS

/-/-0-// 0-// 0-// 0-// 0-// 0-// 0-// 0-// 0//


/-/-0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 00//
/-/-0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 00//
/-/-0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 00//
/-/-0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 0-0-// 00//
0-0-0-0-0-0-0-0-0

2-12-1 2-12-1 2-12-1 2-12-1 2-12-1 2-12-1 2121


1-21 1-21 1-21 1-21 1-21 1-21 121
2-2-1 2-2-1 2-2-1 2-2-1 2-2-1 2-2-1 221
21-21-2-12-1 2-12-1 2-12-1 2-12-1 2-12-1 2121
2-121-2-12-1 2-12-1 2-12-1 2-12-1 2-12-1 2121
21-2-1 2-1 2-1 2-1 2-1 21

.-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .,.,


.,-.,-.-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .,.,
.-,.,-.-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .,.,
.,-.-, .-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .-,.-, .,.,

4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4343


43-43-4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4343
4-343-4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4343
43-4-3 4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4-34-3 4343

Figure 4.1: Polyhedral sets

4.2 Proposition (polyhedral sets). Polyhedral sets are closed and convex.
Proof. A polyhedral set can be expressed as
C=

r
\
l=l

lev0 fl

m
 \

(lev 0 fl lev 0 (fl )) ,

l=r+1

where the functions fl are affine. Convexity now follows from 3.1 and 3.6.
Since affine functions are continuous their level sets are closed, and hence C
is closed since its the intersection of a finite number of closed sets.
4.3 Example (box constraints). A set X IRn is a box if its the product
of n closed intervals Xj IR, not necessarily bounded. For instance, the
nonnegative orthant



IRn+ := x = (x1 , . . . , xn ) xj 0 for all j = [0, )n
is a box; and so is IRn itself. Boxes are polyhedral sets.

Two major subclasses of linearly constrained convex programs are:


linear programs Also, the criterion f0 (x) = hc, xi is linear;
quadratic programs The criterion f0 (x) = 12 hx, Qxi + hc, xi is a quadratic
function with Q a symmetric, positive semi-definite matrix. The convexity of f0 follows directly from 3.16 and 3.17.
In addition, we want to recognize a class of mathematical programs, that
includes both linear and certain quadratic programs, but are not necessarily

4.2. VARIATIONAL ANALYSIS III

8698686
9 86
9 86
9 89
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98
8696
8 96
8 96
8 96
8 98

5 76
5 76
5 76
5 76
5 76
5 76
5 75
76
5 76
5 76
5 76
5 76
5 76
5 75
75676
5 76
5 76
5 76
5 76
5 76
5 75
75676
5 76
5 76
5 76
5 76
5 76
5 75
75676
5 76
5 76
5 76
5 76
5 76
5 75
75676

87

:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:
:6;6
: ;6
: ;6
: ;6
: ;6
: ;:

<6
= <=
=<6=<
=<6=<
=<6=<
=<6=<
=<6=<
=<6=<

Figure 4.2: Boxes

linearly constrained. A function f : IRn IR is said to be separable if f is of


the form

f (x) =

n
X
j=1

fj (xj ) with each function fj : IR IR.

Of course, affine functions are separable and so are quadratic functions when
the matrix Q is a diagonal matrix. A mathematical program (in canonical
form) is a
separable program if for i = 0, . . . , s, all the functions fi are separable,
P
i.e., are of the form fi (x) = nj=1 fij (xj ) with fij : IR IR and the
set X IRn is a box. A linear program is a separable program. For
a quadratic program, one can always find an equivalent separable program; this requires a change of variables predicated by the diagonalization of the positive (semi-)definite matrix, cf. the comments preceding
Theorem 3.15.

4.2

Variational analysis III

Our goal in this section and the next, is to develop the tools that will enable
us to write down optimality conditions for the linearly constrained convex

88

CHAPTER 4. LINEAR CONSTRAINTS

program:
min f0 (x)
so that hAi , xi bi ,

i = 1, . . . , s,

hAi , xi = bi , i = s + 1, . . . , m,
x X IRn ;

where X is a polyhedral set, not necessarily bounded. One can think of the
vector Ai as the ith row of a matrix A and of bi as the ith entry of a vector
b, and a compact version of this program would then read
min f0 (x) so that Ax ./ b, x X,
where ./ would consist of s less than or equal to-inequalities and m s
equalities. We already know that the feasible set S is polyhedral and that
the effective objective, f : IRn IR, with
f (x) = f0 (x) + S (x),
is convex. Recall that S denotes the indicator function of S; refer to 3.8 for
the properties of indicator functions.
We also know that a minimizer x of the effective objective f , i.e., an optimal solution of our linearly constrained convex program, must obey Oresmes
and Fermats rules 3.24:
df (x ; ) 0,

0 f (x ).

Thats straightforward enough! What we dont know, at this stage, is how


to compute df (x; ) and f (x). In particular, we dont know if df (x; ) =
df0 (x; ) + dS (x) and f (x) = f0 (x) + S (x); in other words, if subdifferentiation and addition can be interchanged. As we shall see, this question
is totally settled by the Moreau-Rockafellar sum rule, a key component of
Subdifferential Calculus. We are going to exclude from our calculus, the uninteresting class of functions that are said to be improper, namely, whose
effective domain is empty or that take on the value ; only seriously defective formulations of optimization problems could ever involve improper
functions. A function f : IRn IR is proper if dom f 6= and f > .

4.2. VARIATIONAL ANALYSIS III

89

4.4 Theorem (sum of subgradients: inclusion). Given any two proper convex functions f, g : IRn IR, for x dom(f + g):
df (x; ) + dg(x; ) = d(f + g)(x; ),
f (x) + g(x) (f + g)(x).

Proof. Because the ratio 1 f (x + w) f (x) monotonically decreases as
& 0, one has


f (x + w) f (x) + g(x + w) g(x)
d(f + g)(x; w) = lim & 0


g(x + w) g(x)
f (x + w) f (x)
+ lim & 0
= lim & 0

= df (x; w) + dg(x; w);


the extended arithmetic convention takes care of the cases when these limits
are not finite. Not only does the monotonicity guarantee the existence of the
limits, but it also excludes the possibility
of having one of ratios 1 f (x +

w) f (x) or 1 g(x + w) g(x) tend to and the other tend to
while the limit of their sum remains bounded.
The inclusion for the subgradients of the sum follows simply from the
Definition 3.18 of subgradient. Indeed,
u f (
x) x IRn :
v g(
x) x IRn :

hu, x xi f (x) f (
x),
hv, x xi g(x) g(
x),

imply
x IRn :

hu + v, x xi (f + g)(x) (f + g)(
x),

that immediately yields (u + v) (f + g)(


x).

For the subgradientspof a sum, equality doesnt hold in general. For


example, with f (x) = 1 (x + 1)2 + [2,0] and g(x) = [0,1] ,
f (0) = , g(0) = IR ,

but

(f + g)(0) = IR.

This simple example already pinpoints the cause of the problem, there is no
substantial overlap of dom f and dom g. Their intersection consists exclusively of boundary points of these sets. We need a sum rule that will exclude
such situations.

CHAPTER 4. LINEAR CONSTRAINTS

90

f+g
0

1
0

Figure 4.3: A situation when the sum rule doesnt apply.

To make more precise what is to be understood by substantial overlap,


lets recall that int C, the interior of a set C IRn , consists of all points x
in C that are such that for some > 0, arbitrarily small, also IB(x, ) C;
the boundary of C is bdry C := cl C \ int C where cl C is the closure of C
that consist of all points in C in addition to all limits points of (sequences
of) points in C.
4.5 Theorem (Moreau-Rockafellar sum rule). For f, g : IR n IR proper
convex functions such that int dom f dom g 6= ,
x dom(f + g) :

(f + g)(x) = f (x) + g(x);

so, in particular, equality certainly holds whenever f is finite-valued.


This is not the sharpest version of Moreau-Rockafellars sum rule, but its
all whats required to handle linearly constrained convex programs. At this
stage, we dont have the tools to prove this theorem, but we shall do so as a
consequence of the duality theory developed in Chapter 9.
4.6 Corollary (sum with an indicator function). Let f : IRn IR be convex
and C IRn a non-empty closed convex set. Then, for all x C,
(f + C )(x) = f (x) + C (x).
Proof. Apply Moreau-Rockafellars sum rule after noting that int dom f =
IRn , i.e., C int dom f = C which is non-empty by assumption.
.
Our next task is to characterize the subgradients of C , in particular when
C is a polyhedral set.

0.5

2
f

g 0.5
0

91

4.2. VARIATIONAL ANALYSIS III

0.5

Figure 4.4: The Moreau-Rockafellar rule applies at x = 0

4.7 Definition (normals and normal cone). Let C IRn be convex and
x C. A vector v is a normal to C at x, written v NC (
x), if
hv, x xi 0, x C.



T
Since NC (
x) = zCx v IRn hv, zi 0 is the intersection of closed
half-spaces, its a closed convex cone, called the normal cone to C at x.1

> @?
> @?
> @?
> @?
> @?
> @?
> @?
> @>
@?
> @?
> @?
> @?
> @?
> @?
> @?
> @>
@>?@?
> @?
> @?
> @?
> @?
> @?
> C@?
> @>
@>?@?
> @?
> @?
> @?
> @?
> @?
> @?
> @>
@>?@?

A B?
A B?
A B?
A B?
A B?
A B?
A BA
B?
A B?
A B?
A B?
A B?
A B?
A BA
BA?B?
A B?
A B?
A B?
A B?
A C B?
A BA
BA?B?
A B?
A B?
A B?
A B?
A B?
A BA
BA?B?
v

v
0

C D?
C D?
C D?
C D?
C DC
D?
C D?
C D?
C D?
C DC
DC?D?
C D?
C D?
C C D?
C DC
DC?D?
C D?
C D?
C D?
C DC
DC?D?
F?F?F?F?F
F?
F?
F?
F?
FE _
E?
E?
E?
E?
F?
?
F
?
F
?
F
E?
E?E?
E?FEF NC (x)
F?
E?F?
E?F?
E?F?
E?E
E?E?E?0 E?E

Figure 4.5: Convex sets and their normal cones.


1

One can also define a normal to a set that is not necessarily convex. Instead of
hv, x x
i 0, x C, one requires that hv, x x
i o(|x x|), x C, cf. [18, Chapter
6]. Its easy to show that this more general definition is consistent, in the convex case,
with that used here.

92

CHAPTER 4. LINEAR CONSTRAINTS

Moreover,
C (
x) = NC (
x).
Indeed, since C (
x) = 0,


C (
x) = v hv, x xi C (x), x IRn ,

The inequality is certainly going to be satisfied whenever x


/ C since then
C (x) = . So it suffices for the inequality
to be satisfied when
x C,


but then C (x) = 0, i.e., C (
x) = v hv, x xi 0, x C . Thus, the
set of subgradients of C at x coincides with NC (
x). One may reformulate
Corollary 4.6 in the following terms:
4.8 Corollary (sum with indicator function, 2nd version). Let f : IRn IR
be convex and C IRn , a non-empty closed convex set. Then, for all x C,

f + C (x) = f (x) + NC (x).
This brings us to the following variant of Fermats rule:

4.9 Proposition (minimizers of a convex function on a convex set). Let


f : IRn IR be convex and C IRn , a non-empty closed convex set. Then,
x argmin f (x) 0 f (x ) + NC (x ),
xC

equivalently,
x argmin f (x) v f (x ) such that v NC (x ).
xC

Proof. Because f + C is a convex function, Fermats rule 3.24 yields 0


(f + C )(x ). The optimality conditions conditions in our statement then
follow from Moreau-Rockafellars sum rule, in particular 4.8.
To deal with the composition of two or more function, we are also going
to need a chain rule, for now a simple one will do2 .
2

A versatile chain rule is at the heart of Subdifferential Calculus just like the standard
chain rule is the key to a rich Differential Calculus, but his doesnt come cheap, cf. [18,
Chapter 10]. Because we deal with convex functions and linear transformations, the simple
chain rule stated here will do.

4.2. VARIATIONAL ANALYSIS III

93

4.10 Theorem (chain rule: linear case). Let f (x) = g(Ax + b) where g :
IRm IR is proper, convex and A is a (m n)-matrix, b IRm . Then
f : IRn IR is convex and for x dom f and
(a) u g(Ax + b) = A>u f (x);

(b) A is invertible, v f (x) (A>)1 v g(Ax + b).


If g is also finite-valued, then for all x IRn ,
(c) f (x) = A>g(Ax + b).

Proof. Appeal to 3.10 for the convexity of f . For (a) and (b), one simply
goes back to the Definition 3.18 of the set of subgradients recalling that
hv, Axi = hA>v, xi.

The ingredients missing for a complete proof of (c) are the same as those
required for a proof of the Moreau-Rockafellar sum rule. By relying on 3.21
we could easily sketch a 1-dimensional proof, i.e., when f (x) = g(ax + b)
and a, b are scalars. But thats already covered by (b) when a 6= 0, and the
identity is trivially satisfied when a = 0.

To see that a complete identification between the subgradients of g and


those of f is not in the cards: Let g(y) = y and consider the linear transformation Ax + b = 0x1 + 0x2 + 1. Then, f (x1 , x2 ) 1 and f (x1 , x2 ) {(0, 0)}
while g(y) = {1} for all y.

g
y

G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IG
IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G _IH
G IH
G IH
G IG
IGHIH
G IH
G IH
G IH
G IH
G 1IH
G IH
G IH
G fIH
G =IH
G 1IH
G IH
G IG
IGHIH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IH
G IG
IGHIH
x2

Figure 4.6: The chain rule doesnt apply

x1

94

4.3

CHAPTER 4. LINEAR CONSTRAINTS

Variational analysis IV

When our convex program is linearly constrained, the feasible set C is polyhedral. This section is concerned with computing the normal cones of polyhedral sets. As we shall see, the optimality conditions of the next couple of
sections are pretty straightforward consequences of Proposition 4.9 and these
formulas for the normal cones of polyhedral sets.
4.11 Exercise (normals of hyperplanes and half-spaces). Let 0 6= a IR n
and IR. When,



C = x IRn ha, xi =
is a hyperplane and for x C:

v NC (
x) IR such that v = a.
When,




C = x IRn ha, xi

is a closed half-space and for x C:

v NC (
x) 0 such that v = a, (ha, x
i ) = 0.

POP

_
NC (x)

NMN

_
x

_
NC (x)

RQR

JKJK
L JK
L JK
L JK
L JK
L JK
L JL
JKLK
J LK
J LK
J LK
J LK
J LK
J LJ
JKLK
J LK
J LK
J LK
J LK
J CLK
J LJ
JKLK
J LK
J T LK
J LK
J LK
J LK
J LJ
JKLK
J LK
J STS LK
J LK
J LK
J LK
J LJ
JKLK
J LK
J _ LK
J LK
J LK
J LK
J LJ
JKLK
J LK
J x LK
J LK
J LK
J LK
J LJ

Figure 4.7: Normal cones to a hyperplane and a half-space at x


Guide. When C is a hyperplane with x C, observe that v NC (
x) if and
only if v NS (0) where S = C  x = x ha,
xi
=
0
,
a
linear
subspace
of

n


IR . Its immediate that S = a IR NS (0). There remains only
to show that if v
/ S , i.e. v = a + v for some R and 0 6= v S, there
is a x S such that hv, x
i > 0.

4.3. VARIATIONAL ANALYSIS IV

95

When x C and C is a half-space, one has to consider the two cases


(i) ha, xi < , and (ii) ha, xi = . In the first case, x int C and then
NC (
x) = {0}. In the second case, when = ha, x
i, certainly NC (
x) S , it
thus only remains to show that in fact a NC (
x) if and only if 0.
4.12 Exercise (normals of polyhedral sets.). Let C be the non-empty polyhedral set,


C = x hAi , xi bi , i = 1, . . . , s, hAi , xi = bi , i = s + 1, . . . , m,

and x C. Then,

v NC (
x) 1 0, . . . , s 0, s+1 IR, . . . m IR,
such that
v=

m
X

i A i

and for i = 1, . . . , s,

i=1

i (hAi , xi bi ) = 0.

Even more precisely, if x is such that among the s inequalities those with
index i I {1, . . . , s} are satisfied with equality, the remaining ones
holding with strict inequality, i.e., hAi , x
i < bi for i I = {1, . . . , s} \ I, then
{i IR, i > s},
v NC (
x) {i 0, i I}, {i = 0, i I},
such that
v=

m
X
i=1

i A i

and for i I,

i (hAi , xi bi ) = 0.

The constraints with index i I {s + 1 . . . , m} are said to be the active


constraints at x; those with index i I are said to be inactive.
Guide. Consider first the cases when C is the intersection of just (i) two
hyperplanes, (ii) two closed half-spaces and (iii) a hyperplane and a closed
half-space, and deal separately with the situations when x satisfies the inequality strictly or when equality holds (as in the second part of the Guide
for Exercise 4.11). Use a recursion argument to extend the result to an arbitrary, but finite, number of linear inequalities and equalities.

96

CHAPTER 4. LINEAR CONSTRAINTS

_
NC (x)

_
x
C

0
Figure 4.8: Normal cone to a polyhedral set at x

The conditions
i (hAi , xi bi ) = 0

for

i = 1, . . . , s,

are known as the complementary slackness conditions. They encapsulate the


requirements:
when the inequality hAi , xi bi is slack at x, i.e., hAi , x
i < bi , then
the corresponding inequality i 0 must be tight, i.e., i = 0;
on the other hand, for i to be (strictly) positive, i.e., i > 0, the
inequality hAi , xi bi must be tight at x: hAi , x
i = bi .
Of course, for i I or i > s, these complementary slackness conditions
are automatically satisfied. In the first instance, i = 0, and in the second
hAi , xi = bi . So, instead of requiring that the complementary slackness
conditions be satisfied for just i I, one wouldnt introduce any additional
requirements if one would require that they hold for all i = 1, . . . , m.
4.13 Exercise (normals of the intersection of polyhedral sets). Let C1 , C2
be polyhedral sets and x C1 C2 . Then, NC1 C2 (
x) = NC1 (
x) + NC2 (
x).
Guide. Express C1 and C2 in terms of the linear constraints defining them,
and then rely on 4.12.

4.3. VARIATIONAL ANALYSIS IV

97

4.14 Example (normals of an interval). For C = [, ] IR, a closed


interval and x C,

IR if x = ,
NC (
x) = 0
if < x < ,

IR+ if x = .

Hence,

(
u 0, u( x) = 0,
w N[,] (
x) u, v such that w = u + v,
v 0, v( x) = 0.

v
NC(x)

Figure 4.9: Graph of the normal cone as a function of x when C = [, ].


Detail. The expression for NC (
x) can be derived from 4.12, of course. But
its easier to go back to the definition of the normal cone. One way to
remember the formula is to observe that the pair (
x, v) must belong to the
graph of NC given by Figure 4.9. When = , the downward half-line
at is missing, and similarly, when = , the upward half-line at is
missing. In the degenerate case when = , the graph is just the vertical
line passing through .
Now that we have a description of N[,] (
x), one simply verifies the formula for the normals w by considering all three possibilities x = , x (, )
and x = .

98

CHAPTER 4. LINEAR CONSTRAINTS

4.15 Example (normals of a box). For C =


Cj IR is a closed interval and x C, one has

Qn

j=1

Cj a box where each

v NC (
x) j : vj NCj (
xj ).
In particular, if C = IRn+ ,
v NIRn+ (
x) j :

vj 0 and vj xj = 0.

Detail. Again this can be derived from 4.12. But more information is gained
from an alternative approach: use the identity NCj (
xj ) = Cj (
xj ) and
Qn
Proposition 3.23 to conclude that when C is a box, NC (
x) = j=1 NCj (
xj ).
The assertion then follow from 4.14.

4.4

Lagrange multipliers

Lets begin our analysis of linearly constrained problems with those that
only involve equality constraints. Optimality conditions are obtained first by
relying on the generalized Fermat rule and the calculus for normals developed
in the previous section. We then take a more classical approach based on
Fermats rule for smooth functions, see 1.1. This allows us to highlight the
demarcation between what can be dealt with in the classical framework and
what falls by the wayside.
4.16 Proposition (convex program: equality constraints). Consider the
convex program,
min f0 (x) so that Ax = b;
where A is an mn-matrix and f0 : IRn IR. Then x is an optimal solution
if and only if
(a) Ax = b,
and there is a y IRm such that
(b) A> y f0 (x ).
In particular, if f0 is also differentiable at x , one must have
(b) A> y = f0 (x ).


Proof. With C = x Ax = b , it follows from 4.9 that x minimizes f0
on C if and only if one can find v f0 (x ) such that v NC (x ). The

4.4. LAGRANGE MULTIPLIERS

99

normals to C, the intersections of m hyperplanes, can be computed with the


help of 4.12. One has,

v NC (x ) 1 IR, . . . , m IR, such that v =

m
X

i A i .

i=1

With yi = i , one has (b). When f0 is differentiable, f0 (x ) is the singleton


{f0 (x )}.

Our derivation of the preceding optimality conditions relies on the generalized Fermat rule, but its also possible to deduce these conditions from
the classical version of Fermats rule at least when f0 is smooth, m n, and
the rows Ai of A are linearly independent (the matrix A has rank m); this
last condition is always satisfied after deletion of the redundant equations.
When A is of rank m, one can always, by pivoting for example, find a system
equivalent to Ax = b of the form IxB + N xN = b where xB consists of m
of the n variables x1 , . . . , xn , and xN of the remaining n m variables. So,
without loss of generality, lets assume that actually A = [ I, N ] is of that
form and then xB = (x1 , . . . , xm ), xN = (xm+1 , . . . , xn ). Hence
    

xB
b
N
x=
=
+
xN .
xN
0
I
With d := (b, 0), D > := [ N > , I ] and g(xN ) = f (d+DxN ), our optimization
problem can be solved as follows: find xN that minimizes g on IRnm and set
x = d + DxN . Because f0 is smooth, so is g and
g(xN ) = D > f0 (d + DxN ) = D > f0 (x ).
Fermats rule tells us that for xN to minimize g, one must have
0 = g(xN ) = D > f0 (x ).
In other words, f0 (x ) must belong to the null space of D > and this can
only occur if f0 (x ) is a linear combination of the rows of [ I, N ] = A, i.e.,
if there exists y IRm such that f0 (x ) = A> y.
This is our condition 4.16(b), except that here we havent used the convexity
of f0 which allowed us to claim that Fermats rule yields a sufficient as well
as a necessary condition for optimality.

100

CHAPTER 4. LINEAR CONSTRAINTS

Except for the matrix notation, the tools used in our second derivation
were those available to J.L. Lagrange3 , when he came up with the preceding
optimality conditions. Actually, Lagrange considered a more general class of
optimization problems, namely,
min f0 (x) so that G(x) = 0,
where f0 : IRn IR and G : IRn IRm are smooth, m n and the mnJacobian of G,

m,n
Gi
G(x) =
,
(x)
xj
i,j=1
has (full) rank m at x . An argument almost identical to the one above4
leads us to the following necessary conditions for optimality:
(i) G(x ) = 0,
(ii) y IRm such that f0 (x ) = G(x )> y.
Taking into account our smoothness assumptions, condition (ii) can also be
written:
y IRm such that x argminx f0 (x) G(x)> y.
Each yi plays the role of a multiplier; the magnitude of yi suggesting how
much the ith constraint affects the choice of the optimal solution. To remind
us that Lagrange was responsible for this breakthrough in characterizing the
optimal solution of equality constrained problems, one refers to the multipliers, y1 , . . . , ym , as Lagrange multipliers.
Excluding some technical embellishments, this is as far as the classical
tools will take us: the minimization of a smooth function f0 on a manifold
M that is differentiable, has no boundaries and is determined
by a number of

smooth equality constraints, i.e., of the form M = x G(x) = 0 . Once we
have to cope with inequality constraints, or even with equality constraints
that dont describe a nice manifold, we have to switch to the more potent
tools of Subdifferential Calculus.
3

Joseph Louis Lagrange (1736-1813) was the leading mathematician of his time. Among
many other achievements one must count the key role he played in the development of the
Metric System.
4
relying on the Implicit Function Theorem to express m of the variables in terms of
the remaining nm variables

4.5. KARUSH-KUHN-TUCKER CONDITIONS I

4.5

101

Karush-Kuhn-Tucker conditions I

For historical reasons, the optimality conditions for mathematical programs


are known as the Karush-Kuhn-Tucker conditions, abbreviated as the KKT
conditions5 . These conditions, like those for the equality constrained problem, involve multipliers, that will now designated6 KKT-multipliers.
We are going to consider a number of linearly constrained problems, beginning with a couple of quite simple ones. All the optimality conditions are
obtained by applying the generalized Fermat rule 3.24, more precisely the
version of Proposition 4.9, in conjunction with our formulas for the normal
cones of polyhedral sets of 4.3. Its pure, straightforward calculus but the
implications are far reaching.
4.17 Example (convex program: non-negativity constraints). Consider the
convex program,
min f0 (x) so that x IRr+ IRnr ;
recall that f0 : IRn IR. Then, x is an optimal solution if and only if
(a) for j = 1, . . . , r: xj 0,
and there is a v f0 (x ) such that
(b) for j = 1, . . . , r: vj 0 and vj xj = 0,
(c) for j = r + 1, . . . , n: vj = 0.
If f0 is differentiable, conditions (b) and (c) can be stated as follows:
f0
f0
(x ) 0 and xj x
(x ) = 0,
(b) for j = 1, . . . , r:
xj
j
5

For a twenty five years span, these conditions were known as the Kuhn-Tucker conditions on the basis of a landmark paper of H. Kuhn and A. Tucker [13]. However, in
1976 H. Kuhn became aware that W. Karush, in his unpublished 1939 Master thesis [12],
at the University of Chicago, had actually scooped them. During the 1930s and 40s,
the University of Chicago was a hotbed for research in Optimization, in particular on the
Calculus of Variations. Quite a number of the leaders in the development of Optimal
Control Theory in the 50s and 60s had either been students at the University of Chicago
or had been closely associated with that school.
6
In the literature one finds reference, somewhat indiscriminately, to these multipliers
as KKT-multipliers or Lagrange multipliers. One of the reasons for making a distinction
between Lagrange and KKT-multipliers is to underscore the difference between results
that can be derived via the classical tools of Differential Calculus and those that require
the use of Subdifferential Calculus.

102

CHAPTER 4. LINEAR CONSTRAINTS

(c) for j = r + 1, . . . , n:

f0

Pnxj

(x ) = 0.

Finally, if in addition, f0 = j=1 f0j is separable, one can replace


0
f0j
(xj ) in the preceding conditions.

f0
(x )
xj

by

Detail. With C = IRr+ IRnr , it follows from 4.9 that x minimizes f0 on C


if and only if one can find v f0 (x ) such that v NC (x ). The normals
to the box C can be immediately be identified with the help of 4.15, namely,
(
vj 0 and vj xj = 0 for j = 1, . . . , r,
v NC (x )
vj = 0
for j = r + 1, . . . , n.
The second set of conditions follows from the first one; when f0 is differentiable f0 (x ) = {f0 (x )}.

4.18 Exercise (convex program: bounded variables). Consider the convex


program,
min f0 (x) so that lb x ub,
where f0 : IRn IR and for j = 1, . . . , n, lbj and ubj are lower and upper
bounds for xj . Then, x is an optimal solution if and only if
(a) for j = 1, . . . , n: lbj xj ubj ,
and there exist u, v IRn (
with u + v f0 (x ) such that
uj 0, uj (xj lbj ) = 0,
(b) for j = 1, . . . , n:
vj 0, vj (xj ubj ) = 0.
The Curve Fitting problem with least squares criterion, cf. 2.2, is a
problem of this type:
min hx, A>Axi 2hA>h, xi so that

lb x ub,

where the matrix A and the vectors h, lb, ub are defined in 2.2. Exploit the
optimality conditions to analyze the properties of the least squares fit.
Guide. To compute the normals of C =

Qn

j=1 [lbj ,

ubj ], use 4.15 and 4.14.

4.5. KARUSH-KUHN-TUCKER CONDITIONS I

103

All the results in the remainder of this section have to do with the linearly
constrained convex program:
min f0 (x)
so that hAi , xi bi ,

i = 1, . . . , s,

(P)

hAi , xi = bi , i = s + 1, . . . , m,
x X IRn .

When we refer to (P), it comes with the following blanket assumptions:


f0 : IRn IR is convex, and hence also continuous;
X is polyhedral set.
4.19 Theorem (linearly constrained: optimality conditions). x is an optimal solution of (P) if and only if one can find KKT-multipliers y IRm such
that
(a) hAi , x i bi , i = 1, . . . , s, hAi , x i = bi , i = s + 1, . . . , m,

(b) for i = 1, . . ., s: yi 0, yi (hA


i, x i

bi ) = 0,

>

(c) x argmin f0 (x) hA y, xi x X .
Another version of (c): v NX (x ) such that f0 (x ) 3 A>y + v.
Proof. Let


X0 = x IRn hAi , xi bi , i = 1, . . . , s,


hAi , xi = bi , i = s + 1, . . . , m .

This is a polyhedral set, so is X, and so is their intersection S = X0 X,


the feasible set. From 4.9 and 4.12, one has
x argminxS f0 (x) 0 f0 (x) A>y + NX (x ),
where for i = 1, . . . , s, yi 0 and yi (hAi , x i bi ) = 0 These are precisely
conditions (c) and (d).
The second version of (d) immediately follows from 4.9.
4.20 Corollary (optimality conditions: non-negativity constraints). Suppose that X = IRr+ IRnr . Then, x is an optimal solution of (P) if and
only if and one can find KKT-multipliers y IRm such that
(a) xj 0, j = 1, . . . , r,

104

CHAPTER 4. LINEAR CONSTRAINTS

(b) hAi , x i bi , i = 1, . . . , s, hAi , x i = bi , i = s + 1, . . . , m,


(c) for i = 1, . . . , s: yi 0, yi (hAi , xi bi ) = 0,
and there is v f0 (x ) such that
(d) for j = 1, . . . , r: vj hAj , yi 0 and xj (vj hAj , yi) = 0,
for j = r + 1, . . . , n: vj = hAj , yi.
Proof. One simply appeals to 4.17 to rewrite condition 4.19(d).
4.21 Corollary (optimality: quadratic and linear programs). When
f0 (x) = hc, xi + 12 hx, Qxi,

X = IRr+ IRnr

with Q symmetric, positive semi-definite, (P) is a quadratic program and x


is an optimal solution if and only if one can find KKT-multipliers y IRm
such that
(a) xj 0, j = 1, . . . , r,
(b) hAi , x i bi , i = 1, . . . , s, hAi , x i = bi , i = s + 1, . . . , m,
(c) for i = 1, . . . , s: yi 0, yi (hAi , x i bi ) = 0,
(d-qp) for j = 1, . . . , r: cj + hQj , x i hAj , yi 0 and
xj (cj + hQj , x i hAj , yi) = 0,
for j = r + 1, . . . , n: cj = hAj , yi hQj , x i.
When
f0 (x) = hc, xi, X = IRr+ IRnr ,
(P) is a linear program and x is an optimal solution if and only if one can
find KKT-multipliers y IRm such that the preceding conditions (a)-(c) are
satisfied and
(d-lp) for j = 1, . . . , r: hAj , yi cj and xj (cj hAj , yi) = 0,
for j = r + 1, . . . , n: hAj , yi = cj .
Proof. We apply Corollary 4.20 to the case when f0 (x ) = c + Qx ; for the
convexity of f0 , refer to 3.16. A linear program is just a quadratic program
with Q = 0.
Further special cases that one could analyze: f0 differentiable, f0 separable, and so on. Later on, we shall encounter a number of such situations, in
particular in the next couple of chapters that deal with stochastic programs
with simple recourse. In particular, we shall make use of the following result.

4.5. KARUSH-KUHN-TUCKER CONDITIONS I

105

From 2.8 we already know that there is more than one way to express
a piecewise linear convex function, at least when it consists of two pieces.
The fact that when , one could also write g(x) = max [ y, y ] as a the
optimal value of some minimization problem, namely,



g(x) = min y + y + y + y = x, y + 0, y 0 .

allows us to use this latter expression and integrate it, when convenient, in
a minimization problem; this will be convincingly illustrated in 5.4.
f
s0
sL
s1
1

Figure 4.10: A piecewise linear convex function.

4.22 Exercise (representation of piecewise linear functions). Let the piecewise linear convex function g : IR IR be defined by

0 + s0 x when x 1 ,
g(x) = l + sl x
when l x l+1 , l = 1, . . . , L 1, (L finite),

L + sL x when x L ,

where 1 < 2 < < L . Note that convexity implies: s0 s1 sL


and also, l = l1 +(sl1 sl )l , by continuity of g. Then, with 0 = s0 1 +0 ,
one has
L
L

n
o
X
X

g(x) = 0 + min
s 0 z0 +
sl zl x 1 = z0 +
zl ,
(z0 ,...,zL )S

l=1

l=1

where



S = (z0 , . . . , zL ) 0 z0 , 0 zl l+1 l for l = 1, . . . , L 1, zL 0 .

106

CHAPTER 4. LINEAR CONSTRAINTS

Guide. Show that, as a function of x, an optimal solution of the linear


program defining g is given by
x < 1 :

z0 = 1 x;

for l = 1, . . . , L 1,
l x < l+1 :
x L :

zl = 0, l = 1, . . . , L;

(
z0 = 0; zk = k+1 k , k = 1, . . . , l 1;
zl = x l ; zk = 0, k = l + 1, . . . , L;

z0 = 0; zL = x L ;
zl = l+1 l , l = 1, . . . , L 1.

Check that z is feasible and satisfies the optimality conditions with appropriately selected KKT-multipliers; for example, use 4.19 and to expand
condition 4.19(c) appeal to 4.18. Finally, show that the optimal value of this
linear program, as a function of x, matches the definition of g.

Chapter 5
SIMPLE RECOURSE: RHS
Ignoring, conveniently, uncertainty in the formulation of a decision model,
seldom comes without impunity! The solutions suggested by the simplified
deterministic version of the examples in Chapter 2, were misleading the decision maker, even about the type of activities that should be part of the
optimal activity mix. When uncertainty is included in the the formulation of
the product mix problem (2.1), the capacity expansion problem (2.3) and
the Broadway producer problem (2.5), one ends up with a problem whose
decision process fits the following pattern:
decision: x ; observation: ; recourse cost evaluation.
We refer to such problems as stochastic programs with recourse. The cost
evaluation may or may not be simple and its useful to make this distinction
in the selection of solutions procedures. In some instances, this cost evaluation may require solving another optimization problem. For example, in the
capacity expansion problem, the recourse costs were computed by solving a
certain network flow problem. If evaluating the recourse cost doesnt involve
costs associated with making additional (recourse) decisions, the problem is
said to be one with simple recourse. More precisely, a stochastic program
with simple recourse is an optimization problem of the following type:
min f0 (x) + E{Q(, x)}

xSIRn

where Q : IRm IR is explicitly given and relatively easy to evaluate. The


Broadway producer problem is of this type:
f0 (x) = x,

S = IR+ ,

Q(, x) = max [ 0, ( x) ].
107

108

CHAPTER 5. SIMPLE RECOURSE: RHS

5.1 Exercise (product mix problem: simple recourse). With = (T, d), the
product mix problem is a stochastic program with simple recourse. Set
X
f0 (x) = hc, xi, S = IR4+ , Q(, x) =
max [ 0, qi (di hTi , xi) ].
i=c,f

5.1

Random right hand sides

Lets start, as in Chapter 2, with a deterministic version of the decision model


that we are going to consider, namely,
x 0.
min hc, xi so that Ax = b, T x = ,
The coefficients of this linear program are assumed to be known with certainty, except for the vector that is believed to be an estimate of the
uncertain parameters . These parameters are only known in a probabilistic sense, and their actual values will be revealed after the decision x has
been selected. If there is any discrepancy between T x and , there will be
additional costs described by a function q that depends on T x.
The vector of random parameters takes values in IR m2 and P is
the (plain) associated probability distribution; is the support of .
This is the backdrop to our formulation of linear stochastic programs
with random right hand sides (rhs). In terms of the notation used earlier for
stochastic programs with simple recourse:
f0 = hc, xi is linear;



S = x IRn+ Ax = b is a polyhedral set, A is m1 n;
Q(, x) = q( T x), T a non-random m2 n matrix;
the recourse cost function q : IRm2 IR is convex;
the expectation functional EQ is given by EQ(x) =
A typical example of a recourse cost function is:
q(y) =

m2
X
i=1

max[ qi yi , qi+ yi ]

q( T x) P (d).

5.1. RANDOM RIGHT HAND SIDES

109

that, as long as q + q + 0, can also be written, cf. 2.8,




q(y) = +inf hq + , y +i + hq , y i | Iy + Iy = y, y + 0, y 0 .
y ,y

But more generally, q could be any finite-valued convex function. Usually,


q 0 and q(0) = 0, that one can interpret as follows: there is no penalty
if T x matches exactly the observed and but when there is a discrepancy
there is a, possibly strictly positive, penalty q( T x).
Our linear stochastic program with rhs, simple recourse, then reads:


min hc, xi + E Q(, x) so that Ax = b, x 0.
(SRrhs)
x

and the deterministic equivalent problem is:

min hc, xi + EQ(x) so that Ax = b, x 0.


x

As long as EQ is finite-valued, this is precisely a problem of the type analyzed


in Chapter 4. Its obviously linearly constrained and the convexity of EQ
follows from that of q via 3.10 and Proposition 3.26. We are well equipped
to characterize the optimal solutions of such problems.
5.2 Proposition (KKT-conditions: simple recourse, rhs). As long as EQ
is finite-valued, x is an optimal solution of (SRrhs) if and only if one can
find KKT-multipliers u IRm1 and summable KKT-multipliers v : IRm2
such that
(a) x 0, Ax = b;
(b) for all : v() q( T x );

(c) x argmin hc A>u T >v, xi x IRn+ where v = E{v()},

Proof. The optimality conditions follow immediately from Corollary 4.20,


Proposition 3.27 about the subgradients of expectation functionals and the
chain rule 4.10(c) that yields Q(, x) = T >q( T x).
There is another version of (SRrhs) that in some ways sticks closer to the
deterministic linear program we started off with and that has some desirable
properties both from an analytical as well as computational viewpoint. Lets
introduce the variable(s)
:= T x,

110

CHAPTER 5. SIMPLE RECOURSE: RHS

and refer to it as a tender. If represents demand for the output of certain


economic activities, then = T x is the vector being tendered by the decision
maker to match the random vector . Introducing this identity in the
formulation of our stochastic program, it takes the form:
min hc, xi + E{(, )} so that Ax = b, T x = , x 0,
x,

(SRtndr)

where
(, ) = q( ),

and with

E() =

(, ) P (d),

the deterministic equivalent problem becomes,


min hc, xi + E() so that Ax = b, T x = 0, x 0.
x,

The functions E and EQ essentially have the same properties since is just
a linear transformation of x. In particular, E is convex. Its finite-valued
when EQ is finite-valued and then,
E() = E{q( )},
as follows from 4.10(c) and 3.27. And, the optimality conditions for (SRtndr)
are easy translations of those for (SRrhs). So, why bother?
In many applications it turns out that , and thus also E, is separable
while EQ is not, i.e.,
(, ) =

m2
X
i=1

i (i , i ), =

m2
X
i=1

qi (i i ),

E() =

m2
X

Ei (i ).

i=1

Separability is quite common in practice because each i might represents the


demand for a specific product/service and the the penalty associated with
not meeting this demand will be product/service dependent, i.e. separable..
The Broadway producer problem falls, trivially, in this category. And,
so does the product mix problem when the only uncertainty is about the
number of hours of carpentry and finishing that are going to be available, cf.
Exercise 2.1. Indeed, with c = dc and f = df , since
X
q( T x) =
max [ 0, qi (i hTi , xi) ],
i=c,f

5.2. AIRCRAFT ALLOCATION TO ROUTES I

111

with i = hTi , xi, for i = c, f , one has


i (i , i ) = max [ 0, qi (i i ) ].
With this definition of , the product mix problem with stochastic rhs takes
the following form:
X
Ei (i ) so that hTi , xi = i , i = c, f, x 0.
min hc, xi +
i=c,f

Separability renders all operations that need to be carried out much easier because one is basically dealing with a juxtaposition of 1-dimensional
cases. Subgradients can be calculated with the help of 3.23 and 3.21, one
can get away with just 1-dimensional integration when evaluating integrals,
and one needs only information about the (marginal) distribution of the random variables i rather than about the joint distribution of , usually much
more difficult to estimate.
Before we pursue the analysis of linear stochastic programs with separable simple recourse and outline solution procedures, lets go through one
more example.

5.2

Aircraft allocation to routes I

This example is a historical one. It was the first non-trivial, published example of a stochastic program with recourse [9], and to top this, with a solution
procedure that the authors carried out, successfully, by hand1 !
Facing uncertain passenger demand, an airline wishes to allocate airplanes
of various types among its routes in such a way as to maximize expected
revenue. Equivalently, one can minimize (i) operating costs, assumed to be
independent of the number of passengers on the plane, plus (ii) expected
revenue loss from flying with less than a full load of passengers on any given
route. Let
- routes: j = 1, . . . , m2 .
- bi the number of aircraft of type i available, i = 1, . . . , m1 ;
1

We wont review their procedure that relies on the special transportation structure
of the problem for which highly efficient solution procedures are available, but its a very
elegant approach and [9] is certainly on the recommended reading list.

112

CHAPTER 5. SIMPLE RECOURSE: RHS

- j passenger demand on route j, j = 1, . . . , m2 ;


- tij passenger capacity of aircraft of type i on route j;
- aij = 1 if type-i aircraft can fly route j, otherwise aij = 0 ( tij = 0);
- ocij cost of operating aircraft of type i on route j;
- rj > 0 per passenger revenue on route j.
So, the cost (= - profit) from operating aircraft of type i on route j at full
capacity is
- cij = tij rj + ocij ;
and the cost of an empty seat on route j must be deducted from this profit.
Thus, the recourse cost functions are:
- qj (yj ) = max [rj yj , 0 ] for j = 1, . . . , m2 .
SEA
JFK

SFO

DNV

LAX

Figure 5.1: Airlinks to be served by the aircraft fleet


When the passenger demand is discretely distributed with prob[ = l ] = pl
for l = 1, . . . , L, one could follow an approach similar to that described in
2.1 for the product mix problem. The extensive version of this problem
would read:
Lj
m2 X
Xm1 ,m2
X
min
rj pl yjl
cij xij +
i,j=1

so that

Xm2

j=1
Xm1
i=1

aij xij

j=1 l=1

bi , i = 1, . . . , m1 ,

tij xij yjl jl , j = 1, . . . , m2 , l = 1, . . . , L,

xij 0, i = 1, . . . , m1 , j = 1, . . . , m2 ,
yjl 0, j = 1, . . . , m2 , l = 1, . . . , L.

5.2. AIRCRAFT ALLOCATION TO ROUTES I

113

This, highly structured, linear program has m1 + m2 L linear (inequality)


constraints and (m1 m2 + m2 L) variables. In the mini-example that will be
introduced in Exercise 5.10, m1 = 4, m2 = 5 and L = 750 (assuming that
the passenger demand on route j doesnt depend on that for route j 0 ). This
means, we would have to deal with a problem involving 3,756 constraints
and 3,770 variables. If we transpose this aircraft allocation problem in the
practical world of a major airline, the numbers would be of the following
magnitude: m1 25, m2 5, 000, L in the tens of millions, if not larger.
The resulting linear program would be truly gigantic.
From a computational viewpoint, the number of constraints of a linear
program is a rough, but pretty good, measure of how much time it will take
to find the solution and of the numerical reliability to attach to this solution.
The number of constraints affects, in a significant way, both numerical stability and the number of iterations required to solve the problem; numerical
stability issues can only be marginally improved by having access to larger
and faster computers. Anyway, if a linear program with 3,756 constraints
might be manageable, the practical world version of this problem certainly
isnt.
Fortunately, there is a very attractive alternative to this extensive version formulation that turns out to be also a linear program but with only
m1 + m2 linear (inequality) constraints: 9 constraints in the case of our
mini-example and 5, 000 for the practical world version. To arrive at this
alternative formulation, a further analysis of the shape and the properties of
the recourse costs is going to be needed.
With j (j , j ) = max [ rj (j j ), 0 ], the aircraft allocation problem
can also be formulated as
n Xm2
o
Xm1 ,m2
min
cij xij + E
j ( j , j )
i,j=1
j=1
Xm2
so that
aij xij
bi , i = 1, . . . , m1 ,
j=1
Xm1
tij xij j = 0, j = 1, . . . , m2 ,
i=1

xij 0, i = 1, . . . , m1 , j = 1, . . . , m2 .

The decision variables xij determine the number of aircraft of type i to allocate to route j. The variables j represent the capacity (= the total number
of seats) made available on route j. There is no revenue loss if the demand j

114

CHAPTER 5. SIMPLE RECOURSE: RHS

on route j exceeds the capacity j made available, but for each empty seat
there is a loss of potential revenue rj , or a cost rj .
In terms of our simple recourse model (SRtndr), the first group of inequalities corresponds to the deterministic constraints2 . In compact notation,
the deterministic equivalent program has the form:
min hc, xi +
x,

m2
X
j=1

Ej (j ) so that Ax b, T x = , x 0.

(AA)

This is a separable mathematical program, the constraints and the objective


are sums of terms involving just one variable, xij or j . The nonlinearity has
been absorbed in the 1-dimensional functions Ej .

5.3

Separable simple recourse I

Our model for a stochastic program with separable simple recourse, random
rhs, will be:
min hc, xi + E
x,

m2
nX

i ( i , i )

i=1

(SSRtndr)

so that Ax = b, T x = , x 0,
where for i = 1, . . . , m2 ,

i (i , i ) = qi (i i ),
qi : IR IR is convex, and Pi is the (marginal) probability distribution of
the random variable i , its support is i IR. With
Z
Ei (i ) = E{i (i , i )} =
i (, i ) Pi (d),
i

the deterministic equivalent problem is then,


Xm2
min hc, xi +
Ei (i ) so that Ax = b, T x = 0, x 0.
x,

i=1

If so desired, by adding a (non-negative) slack variable one can always turn an inequality constraint into an equality constraint. Namely, for the inequality ha, xi one
can substitute ha, xi + s = and s 0

5.3. SEPARABLE SIMPLE RECOURSE I

115

Like (SRrhs), due to the convexity of the functions qi , (SSRtndr) is a linearly


constrained convex program, but its also separable.
Assuming that the functions Ei are finite-valued, the formulas for their
subderivatives and subgradients follow immediately from our analysis of expectation functionals in 3.4. One relies on Propositions 4.10 and 3.27, and
on 3.21 to obtain explicit expressions for dEi (i ; ) and Ei . In particular,
one has

o
n Z

Ei (i ) = vi () Pi (d) i , vi () qi ( i ), vi summable .
i

5.3 Exercise (piecewise linear recourse cost function). The recourse cost
functions of the producer problem (2.4 & 3.5), the product mix problem
(2.1 & Exercise 5.1) and the aircraft allocation problem (5.2) are all of the
following type (ignoring subscripts, for now):
(
y if y 0,
(, ) = q( ) where q(y) = max [ y, y ] =
y if y 0;
with . Show that
Z
Z
E() =
( ) P (d) +

( ) P (d)
Z


= (E{} ) + ( ) P ()
P (d) ,

with the notation of 3.3: P () = prob { }, p = prob { = }.


E() = [ d() ( )p , d() ]

where

d() = ( )P () .

So, in particular, if p = 0, that is always the case when has a continuous


distribution, E is differentiable at and E() = {d()}.
Guide. One can essentially replicate the steps in 3.5.
5.4 Exercise (piecewise linear recourse, uniform distribution). Suppose
(, ) is as defined in 5.3 and is uniformly distributed on [, ]. Show that
E is a finite-valued convex function, quadratic on [, ] and linear outside
this interval. Compute dE() and E() as functions of .

116

CHAPTER 5. SIMPLE RECOURSE: RHS


E
j

Figure 5.2: Expected recourse cost: with bounded support.

Guide. Figure 5.2 graphs E when is uniform on [, ].


When the support of is unbounded, e.g. = [, ), the expected cost
is convex, and asymptotically linear as , as Figure 5.3 demonstrates.
E j
j

xj

j
j = [ j,)

Figure 5.3: Expected recourse costs: unbounded support.

5.5 Exercise (piecewise linear recourse, discrete distribution). As in Exercise 5.3, let (, ) = max [ y, y ] with with a discretely distributed
random variable such that
prob [ = l ] = pl ,

l = 1, . . . , L,

with

1 < 2 < < L,

5.3. SEPARABLE SIMPLE RECOURSE I


P
and let = E{}. Show that with P ( l ) = lk=1 pk
P
R( l ) = lk=1 pk k ,

[ ( )R( l )] + [( )P ( l ) ]
E() =

117
= prob[ l ] and
if < 1 ,
if l < l+1 ,
l = 1, . . . , L 1,
if L ,

i.e., E is a convex, piecewise linear function.

5.6 Example (linear-quadratic recourse cost function). Another common


type of recourse cost function, seen already in 2.3 (Figure 2.5), in connection
with the network capacity expansion problem, is


q(y) = max y 12 2 [ , ] }, > 0,

1
2

y 2 if y < ,
= 12 1 y 2
if y [ , ],

1
y 2 2 if y > .
Detail. With (, ) = q( ), one has

if < + ,

n d(, ) o
1
(, ) =
() = ( ) if + + ,

if > + .

For all , the convex function 7 (, ) is differential, the subgradient is


always a singleton. This, in turn, implies that E will be differentiable, its
derivative is obtained by straightforward integration of (, ), with respect
to . This wouldnt be the case if we allowed for = 0, in fact with = 0,
this q-function turns out to be that considered in the preceding exercises.
Both the latest q-function and that defined in 5.3 belong to the class of
convex linear-quadratic functions, P,Q : IRn IR, defined by



P,Q (y) = sup hy, xi 21 hx, Qxi x P
xIRn

where P is a polyhedral set and Q is a positive semi-definite matrix. The


convexity of P,Q follows from 3.7, since y 7 hy, xi 12 hx, Qxi is affine with

118

CHAPTER 5. SIMPLE RECOURSE: RHS

x playing the role of an index ( the index i in 3.7). Functions of this type
can be viewed as monitoring compliances with certain constraints. They
are special instances of a rich class of functions, called monitoring functions, that will be explored in 6.4. Of course, not all simple recourse costs
functions are linear-quadratic, e.g., reliability consideration might lead to
q(y) = max [ 0, (ey 1) ] with , > 0. In fact, in our separable simple recourse model, we only assume that the functions qi are convex and
finite-valued.
Anyway, returning to our stochastic program with separable simple recourse (SSRtndr), since by separability, cf. 3.23,

 Xm2

i=1

Ei (i ) =

m2
Y

E(i ),

i=1

the optimality conditions will follow from those derived in the preceding
chapter.
5.7 Proposition (KKT-conditions: separable simple recourse, rhs). As
long as for i = 1, . . . , m2 , Eqi is finite-valued, a pair (x , ) is an optimal
solution of (SSRtndr) if and only if one can find KKT-multipliers u IRm1
and for i = 1, . . . , m2 , summable KKT-multipliers vi : i IR such that
(a) Ax = b, T x = ;
(b) for i = 1, . . ., m2 : i , v i () q i ( i );
(c) x argmin hc A>u T >v, xi x RIRn+
where for i = 1, . . . , m2 , vi = E{vi (i )} = i vi () Pi(d).
Proof. Again we go straightforward to Corollary 4.20 making use of the
separability of the objective
R function and the (general) identity for the subgradients Ei (i ) = { i vi () Pi(d)} with vi () qi ( i ).

5.8 Exercise (finite-valued recourse cost). The condition Eqi finite-valued


will be satisfied if the support i of i is bounded, or if E{ i } is finite and
the function qi is asymptotically linear, i.e., there are , IR such that
qi ()/ as and qi ()/ as .
Guide. Since qi is a finite-valued convex function, it is also continuous (3.13)
and consequently bounded on bounded sets.

5.4. AIRCRAFT ALLOCATION TO ROUTES II

119

5.9 Corollary (optimality: simple recourse, piecewise linear, rhs). Lets


consider (SSRtndr) where for i = 1, . . . , m2 ,
qi (y) = max [ i y, i y ] with i i ,
E{i } is finite.
Then, a pair (x , ) is an optimal solution of (SSRtndr) if and only if one
can find KKT-multipliers u IRm1 and v IRm2 such that
(a) Ax = b, T x = ;
(b) for i = 1, . . . , m2 ,

i (i i )Pi (i ) vi i (i i ) Pi (i ) pi ,
(c) x 0, A>u + T >v c, and hc A>u T >v, x i = 0.

Proof. By 5.8, the functions Eqi are finite-valued hence, the optimality conditions of the proposition apply. Conditions (c) of the proposition and that
of the corollary are equivalent by 4.17. As to conditions (b), the equivalence
comes via Proposition 3.27 since for i ,
(
i if < i ,
and vi () [ i , i ] if = i ,
vi () =
i if > i ,
as follows from 5.3.
Lets note that condition (b) can also be stated: for i = 1, . . . , m2 ,
Pi (i ) pi

i vi
Pi (i ),
i i

which should bring to mind the optimality criterion in 2.4 for the Broadway
producer problem!

5.4

Aircraft allocation to routes II

Lets now return to the aircraft allocation problem of 5.2 and exploit what
we have learned about stochastic programs with separable simple recourse
to design a solution procedure when the passenger demand is, or is approximated by, a discrete distribution. The recourse costs are then given by
j (j , j ) = max [ rj (j j ), 0 ],

Ej (j ) =

Lj
X
l=1

j (jl , j )pjl ,

120

CHAPTER 5. SIMPLE RECOURSE: RHS

where for l = 1, . . . , Lj ,
prob [ j = jl ] = pjl

with

From 5.5, with j = E{ j }, Rj (jl ) =

Pl

k=1

0
Ej (j ) = rj Rj (jl ) + rj Pj (jl )j

rj j + rj j

j1 < j2 < < j j .


pjk jk , Pj (jl ) =

Pl

k=1

pjk , one has

if j < j1 ,
if jl j < jl+1 , l = 1, . . . , Lj 1,
L
if j j j .

For each j, the function Ej is finite-valued, convex and piecewise linear.


We can appeal to 4.22 to obtain the following alternative expression for Ej :


Lj
P

X
L
j
1 +
zjLj 0,
j
l=1 zjl
rj
Pj (jl )zjl j
Ej (j ) = min
0 zjl jl , l = 1, . . . , Lj 1,
z IRLj
j

l=1

where jl = jl+1 jl , Introducing this representation for Ej in our formulation of aircraft allocation problem AA in 5.2, and with
djl =

rj Pj (jl )

= rj

l
X

pjk ,

k=1

one has,
min
x,z

so that

Xm1 ,m2
i,j=1

Xm2

j=1

Xm1

i=1

cij xij +

Lj
m2 X
X

bi , i = 1, . . . , m1 ,

aij xij
tij xij

djl zjl

j=1 l=1

Lj
X
l=1

zjl j1 , j = 1, . . . , m2 ,

xij 0, i = 1, . . . , m1 , j = 1, . . . , m2 .
0 zjl jl , l = 1, . . . , Lj 1, zjLj 0,

j = 1, . . . , m2 .

This is a linear program with m1 + m2 linear (inequality) constraints plus


some upper and lower bounds on the variables. As already pointed out in

5.4. AIRCRAFT ALLOCATION TO ROUTES II

121

5.2, this is a much preferable situation than having to deal with the extenQ 2
sive version of the problem that involves m1 + m2 ( m
j=1 Lj ) constraints. In
fact, if we recklessly decided to replace the given problem by a deterministic version, say, by replacing the random variables j by their expectations
or some other estimates, the resulting linear program would also be one
with m1 + m2 linear constraints. So, our analysis has paid off extremely well.
Solving the stochastic version of the problem will essentially require the same
computational effort as solving a questionable deterministic version!
5.10 Exercise (aircraft allocation: numerical example). Set up and solve
the following mini-example3 of the aircraft allocation problem:
b = [10, 19, 25, 15],
aircraft
A
B
C
D
routes:

16
?
?
9
1

capacities: tij
15 28 23
10 14 15
5
?
7
11 22 17
2
3
4

81
57
29
55
5

r = [13, 13, 7, 7, 1],


aircraft
A
B
C
D
routes:

operating costs: ocij


18 21 18 16 10
? 15 16 14 9
? 10 ?
9
6
17 16 17 15 10
1
2
3
4
5

? indicates that aircraft of type i cant fly route j.


1
2
3
4
5

200 : 0.2
50 : 0.3
140 : 0.1
10 : 0.2
580 : 0.1

demand : probability
220 : 0.05 250 : 0.35 270 : 0.2
150 : 0.7
160 : 0.2
180 : 0.4 200 : 0.2
50 : 0.2
80 : 0.3
100 : 0.2
600 : 0.8
620 : 0.1

300 : 0.2
220 : 0.1
340 : 0.1

Guide. The optimal solution is


aircraft
A
B
C
D
routes:
3

10
?
?
7.34
1

solution: xij
0
0
0
12.85 0.82 5.33
4.31
?
0
0
7.66
0
2
3
4

0
0
20.69
0
5

This example was used by Ferguson and Dantzig to illustrate their method [9].

122

CHAPTER 5. SIMPLE RECOURSE: RHS

This solution assigns each available aircraft to some route, and the capacities
made available on routes j = 1, . . . , 5, are
= T x = [ 226.07, 150, 180, 80, 600 ].
Check the optimality conditions of Corollary 5.9 with
u = [ 138, 39.84, 17.42, 70.75 ],

v = [ 3.25, 7.52, 3.01, 3.41, 0.19 ],

the KKT-multipliers. These multipliers are part of the output available when
solving linear programs; the Matlab function linprog returns these multipliers as part of the lambda-structure.
One can compare the solution with that obtained when the random variables are replaced by their expectations, = E{}, and one solves the resulting linear program:



x0 .
xd = argmin hc, xi Ax b, T x = ,
P 2
d

The (expected) profit loss is hc, xd x i+ m


j=1 Ej (hTj , x i)Ej (hTj , x i)
when instead of x , one assigns the available aircrafts as suggested by xd .

5.5

Approximations I

Following our approach, it takes less than the blink of an eye, to set up
and solve the aircraft allocation problem of the previous section, of course,
not counting the time to enter the data. This naturally raises the following
question: If some or all of the random variables are not discretely distributed,
wouldnt it be possible to substitute for the given distribution functions,
approximating discrete distributions, so that the solution of (SSRtndr) with
these approximating distributions could be substituted, at little cost, for x .
This will be a recurring issue that will have to be dealt with at different
levels. For now, lets just deal with the specific situation at hand. Dropping
subscripts for now, let P : IR [0, 1] is a continuous distribution thats
strictly increasing on the interval IR, the support of . From 5.3, with
and = E{},
Z



E() := ( ) + ( ) P ()
P (d) ;

5.5. APPROXIMATIONS I

123

we also know from 5.3 that E is differentiable. In the optimality conditions


for (SSRtndr), the distribution function P plays an explicit role only through
a certain quantile, cf. 5.9(b). Indeed, one must have
P ( ) =

v
=: ,

or equivalently, since P is strictly increasing on and p = 0,


= P 1 ().
Thus, if the distribution function P is replaced by a distribution function Q,
discrete or not, that has the same -quantile, i.e.,
Q1 () = P 1 (),
the optimal solution remains unaffected! It suffices to observe that the optimality conditions of Corollary 5.9 remain satisfied with the same pair (x , )
and the same KKT-multipliers (u, v).
So, one might reasonably expect that if the distribution function Q is
quantile close to P , the error made by the substitution will be small. It
turns out that finding the discrete distribution Q with a prescribed number
of jumps, i.e., with a prescribed number of points in the support of the
random variable, that minimizes the quantile-distance between P and Q
is not a nice convex optimization problem4 . But, one could proceed as
suggested by Figure 5.4. For selected 0 < 1 < 2 < < L < 1, let


Q() = l when P 1 (l1 + l )/2 =: l1 < l := P 1 (l + l+1 )/2 ,
with the obvious adjustments at both ends.
Evidently, if we increase the number of quantiles {l , l = 1, . . . , L} so that
the mesh5 tends to 0, the resulting distribution functions Q will converge to
P with respect to the quantile-distance. In turn, this will imply that the
optimal solutions of the corresponding approximating problems will converge
to x , the optimal solution of (SSRtndr)6 .
4

The space of distribution functions with a fixed number of jumps is not a linear space.
mesh refers here to the maximum of the distance between two adjacent quantiles
6
This is not a proof but its certainly believable. One can actually prove that the
optimal solutions converge, but we wont go into this until Chapter ??
5

124

CHAPTER 5. SIMPLE RECOURSE: RHS

P
Q

Figure 5.4: Approximating discrete distribution function Q

What our analysis has also revealed is that approximating the given problem by replacing continuous distributions by discrete ones has its drawback.
For Q to be a good approximation of P with respect to the quantile-distance,
one should use a very fine mesh, and consequently Q will come with many
(small) jumps. And this will result in having to solve a linear program with
many variables, although still with only m1 + m2 linear constraints. So, that
doesnt really invalidate this approach, but one wonders if another approach
might not yield better results.
Instead of approximating P by a piecewise constant distribution, lets
consider a piecewise linear approximation as illustrated in Figure 5.5. Its
obvious that when we are limited to a prescribed number of pieces, we can
more easily match the quantiles of P with a piecewise linear Q that with
a piecewise constant one; in fact, dramatically so. But the question is now
what type of problem will we have to solve if the distribution functions are
piecewise linear. It turns out to be a separable quadratic program that can
be solved almost as efficiently as linear programs of the same size7 .

5.11 Example (piecewise linear recourse, piecewise linear distribution). Let


(, ) = max [ ( ), ( ) ] with , and P the continuous, piece7

using a primal-dual interior-point method, cf. 6.3 and [20]

5.5. APPROXIMATIONS I

125

P
Q

Figure 5.5: Approximating piecewise linear distribution function Q

wise linear, distribution function of given by

when 1 ,
0
P () = P (l ) + sl ( l ) when l l+1 ,

1
when L ,

l = 1, . . . , L 1,

with 1 < 2 < < L ; 1 , . . . , L are called the nodes of the distribution
functions P . Note that the coefficients sl are necessarily nonnegative, otherwise P would not be monotone nondecreasing. Let = E{}; its always
finite. Then, with = ( 1 ),

P

L
L1
= 1 z0 + Ll=1 zl ,
X

X

a l zl +
bl zl2
E() = + min
z0 0, zL 0,
z

l=0
l=1
0 zl l , l = 1, . . . , L 1,
where

al = + ( )P (l ), l = 0, . . . , L,
bl = ( )sl /2, l = 1, . . . , L 1,
l = l+1 l , l = 1, . . . , L 1.

Detail. One begins with 5.3 and then proceed as in 5.4 or 5.5. Finally,
one appeals to the same arguments as those used in 4.22 to obtain the representation of E as the optimal value of a separable quadratic program.
Furthermore, one can also find explicit expressions for dE and E.

126

CHAPTER 5. SIMPLE RECOURSE: RHS

More general approximations schemes for stochastic and other optimization problems will be explored in Chapters 11 and 16.

296

CHAPTER 5. SIMPLE RECOURSE: RHS

Appendix A
Notation and terminology
The notation is pretty much that of Analysis and Linear Algebra textbooks
with one possible exception: h, i for the vector-inner product.
IR the real numbers, IR = [ , ] the extended reals, IRn the ndimensional vector space.
IN = {1, 2, . . . } the natural numbers. Sequences are usually be indexed
by IN , subsequences by k with k as k .
Generally, lower case letters x, y, a, b, . . . are be reserved for vectors,
upper case letters A, T, W, . . . for matrices, and Greek letters , , . . .
for scalars.
The inner product of two vectors a, x IRn is denoted by ha, xi =
Pn
j=1 aj xj . By adopting this notation, one never has to specify if vectors
are row or column vectors and ha, xi = hx, ai.
Matrix multiplication: Ax is the vector in IRm resulting from multiplying the vector x IRn by the m n-matrix A. For IR, x is the
-multiple of x and the entries of A are those of A multiplied by .
Aj is the jth column, while Ai denotes the ith row of A; A> is the
transpose of A. One has hy, Axi = hA> y, xi.
Unless otherwise specified, the norm of x IRn is the euclidean norm,
i.e., |x| = hx, xi1/2 . The (closed) ball of radius centered at x is
297

298

APPENDIX A. NOTATION AND TERMINOLOGY




IB(x, ) = x0 |x0 x| ; IB is the shorthand notation for IB(0, 1),
the unit ball.

A sequence {x IRn , IN } converges to x if |x x| 0 and


then x is the limit (point) of this sequence; x is a cluster (point) of the
sequence if there is a subsequence xk x as k .
is the empty set. C IRn is a proper subset of IRn if IRn \ C 6= .
For any set C IRn , cl C denotes the closure of C and it consists
of all points in C as well as the all clusters points of any sequence of
points included in C. The interior, int C, consists of all points x C
that are such that IB(x, ) C for some > 0. The boundary of C,
bdry C = cl C \ int C, i.e., all points that are in the closure of C but
not in its interior.
A set C IRn is closed if C = cl C and open if C = int C. For
example,
the
ball
unit ball IB is closed, its interior is the (open) unit

n
x IRn |x| < 1 , its boundary
consists
of
all
points
x

I
R
such



that |x| = 1. When C = (x1 , x2 , x3 ) x1 = 0, x22 + x23 1 , the set C
is closed, but int C = and then bdry C = C.

A function f : IRn IR is continuous at x if f (x ) f (x) whenever


x x. Its continuous if this holds at every x IRn . Its Lipschitz
continuous if there exists a constant > 0 such that |f (x) f (x0 )|
|x x0 | hold for an pair x, x0 IRn ,

A.1

Existence: limits & minimizers

A.1 Definition A set C IRn is bounded if C IB(0, ) for some > 0.


Its said to be compact if its closed and bounded.
A.2 Theorem (Bolzano-Weierstra Theorem). Every sequence in a compact set, i.e., closed and bounded, C IRn has a cluster point that belongs
to C.
Proof. We may prove this theorem by a process that has been suggested as
a method for catching a lion in the Sahara Desert. We surround the desert by

A.2. FUNCTION EXPANSIONS

299

a fence and then bisect the desert by a fence running (say) north and south.
The lion is in one half or the other; bisect this half by a fence running east
to west. The lion is now is one of two quarters; bisect this by a fence and so
on: the lion ultimately becomes enclosed by an arbitrarily small enclosure.
in applying this idea to our problem is that a sequence
 The essential point

x C, IN lies in a bounded set (contained in C). Because C is
bounded it is enclosed in a (square) box whose sides are less than or equal
to > 0. Rather than bisecting C, let us split it in 2n parts, each one
contained in a box whose sides are less than or equal to /2. Then at least
one of the parts

must contain an infinite number of members of the sequence
x , IN . Let C1 be the portion of C containing an infinite number of
points x and split C1 . Again one of the parts, say C2 , must contain an
infinite number of points of the subsequence in C1 . Continue this process.
We obtain a nested sequence of subset of C (C C1 C2 ), each
containing an infinite number of points of the sequence and the sides of Ck of
diameter less than or equal to /2k . The low (coordinate-wise) end points of
all boxes Ck are bounded above (again coordinate-wise, using the fact that C
is bounded) and so there is a least upper bound (again coordinate-wise),
say x. Every neighborhood of x contains some Ck since the length of the
sides of the sets Ck goes to 0. That x is a cluster point of the sequence

x C, IN .
A.3 Theorem (Existence of a minimizer) Given any real-valued continuous
function f defined on a compact set C IRn one can find a point x C at
which the minimum of f on C is actually attained; i.e., such that f (x )
f (x) for all x C. One then writes x argminxC f (x).

A.2

Function expansions

A.4 Theorem (Taylor expansion) Let f : IR IR be of class C , then


f (x) =

X
f (k) (x0 )
k=0

k!

(x x0 )k

f 000 (x0 )
f 00 (x0 )
2
(x x0 ) +
(x x0 )3 +
= f (x0 ) + f (x0 )(x x0 ) +
2
6
0

300

APPENDIX A. NOTATION AND TERMINOLOGY

(a) Taylors formula for remainder. Let f : IR IR be of class C n+1


(continuously differentiable of order n + 1). Then for x 6= x0 ,
f 00 (x0 )
(x x0 )2 +
2
f (n) (x0 )
f (n+1) (x1 )
n
+
(x x0 ) +
(x x0 )n+1
n!
(n + 1)!

f (x) = f (x0 ) + f 0 (x0 )(x x0 ) +

for some x1 such that x0 < x1 < x or x < x1 < x0 (if x < x0 ).
(b) Multivariate Taylors formula. For f : IRn IR of class C r
f (x + y) = f (x) + hf (x), yi +

hy, 2 f (x)yi + + D r f (x + y)
2
r!

(0, ), and where


for some
r

D f (x) =

n X
n
X

i1 =1 i2 =1

n
X

ir =1

y i1 y i2 y ir

r f
(x).
xi1 xi2 xir

A.5 Theorem (Implicit Function Theorem). Let F : Rn IRd IRd be a


continuous mapping in the neighborhood of (x0 , y0 ) with continuous partial
derivatives with respect to the components of y. Let y F (x0 , y0 ), the Jacobian with respect to y, be invertible, and let z0 = F (x0 , y0 ) IRd . Then there
exists > 0, > 0 such that if x and z are fixed, |x x0 | < , |z z0 | < ,
then the equation
z = f (x, y)
has a unique solution y = (z, x) satisfying |y y0 | < . Furthermore, is
continuous for |xx0 | < , |z z0 | < and has continuous partial derivatives
with respect to the components of x.

Bibliography
[1] J.-P. Aubin and H. Frankowska. Set-Valued Analysis. Birkhauser Boston
Inc., 1990.
[2] R.G. Bland. New pivoting rules for the simplex method. Mathematics
of Operations Research, 2:103107, 1977.
[3] J.M. Borwein and A.S. Lewis. Convex Analysis and Nonlinear Optimization. Springer, 2000.
[4] F.H. Clarke. Optimization and Nonsmooth Analysis. Wiley, 1983.
[5] R.W. Cottle.
On dodging degeneracy in linear programming.
Manuscript, Stanford University, 200?
[6] R.W. Cottle, J.-S. Pang, and R.E. Stone. The Linear Complementarity
Problem. Academic Press, 1992.
[7] G.B. Dantzig. Linear Programming and Extensions. Princeton University Press, 1963.
[8] J.H. Dula and F.J. Lopez. Algorithms for the frame of a finitely generated unbounded polyhedra. Manuscript, University of Mississppi, 2002.
[9] A.R. Ferguson and G.B. Dantzig. The allocation of aircraft to routes. an
example of linear programming under uncertain demand. Management
Sciences, 1:4573, 1958.
[10] P.E. Gill, W. Murray, and M.H. Wright. Practical Optimization. Academic Press, 1981.
301

302

BIBLIOGRAPHY

[11] C. Hess. Contributions a


letude de la mesurabilite, de la loi de probabilite, et de la convergence de multifonctions. Th`ese detat, Universite
de Montpellier II, 1986.
[12] W. Karush. Minima of functions of several variables with inequalities
as side conditions. Masters thesis, University of Chicago, 1939.
[13] H.W. Kuhn and A.W. Tucker. Nonlinear programming. In J. Neyman,
editor, Proceedings of the Second Berkeley Symposium on Mathematical
Statistics and Probability, pages 481492. University of California Press,
Berkeley, 1951.
[14] B.S. Mordukhovich. Approximation Methods in Problems of Optimization and Control. Nauka-Moscow, 1988. (Russian).
[15] N. Oresme. Algorismus proportionun, 137? manuscript.
[16] E. Polak. Optimization: Algorithms and Consistent Approximations.
Springer, 1997.
[17] R.T. Rockafellar and R. J-B Wets. A lagrangian finite generation technique for solving linear-quadratic problems in stochastic programming.
Mathematical Programming Study, 28:6393, 1986.
[18] R.T. Rockafellar and R.J-B Wets. Variational Analysis. Springer, 1998.
[19] R. J-B Wets and C. Witzgall. Algorithms for frames and lineality spaces
of cones. J. of Research of the National Bureau of Standards, 71B:17,
1967.
[20] S.J. Wright. Primal-Dual Interior-Point Methods. SIAM, 1997.

Index
active constraints, 95
affine mapping, 49
affine set, 158, 181
argmin, 2
convex, 54

strictly, 51, 53
convex combination, 48
convex hull, 48, 161
convex polyhedron, 85
convex program, 84
convex system, 194
convex-concave bifunction, 128

ball, 298
barrier method, 134
bifunction, 128
convex-concave, 128
effective domain, 128
bivariate function, 128
Blands rule, 179
boundary, 90, 298
set, 183

density function, 71
derivative, 2
diag, 134
direction of descent, 7
discrete distribution, 69
discrete random variable, 69
dual problem, 131
effective domain, 52
bifunction, 128
effective objective, 83
epi-multiplication, 203
epi-polyhedral function, 169
epi-sum, 203
epigraph, 53
convex, 54
Euler equation, 19
expectation functional, 68, 75
extended arithmetic, 53
extensive version, 28, 112
extreme point, 173

closed
set, 183
closure, 90, 183, 298
cluster, 298
complementary slackness, 96
concave function, 51
cone, 50, 158
constraint qualification, 194
continuity, 185
continuous distribution, 71
convex
function, 51
locally, 6
set, 48

Farkas Lemma, 161


303

304
feasible set, 83
Fermat rule, 2, 3
finite generation, 162
finite support, 70
frame, 163
function
affine, 53, 85
continuous, 185, 298
convex, 51
epi-polyhedral, 169
improper, 88
inf-compact, 186
linear-quadratic, 117, 139
Lipschitz continuous, 298
monitoring, 118, 139
piecewise continuous, 74
proper, 88
quadratic, 61
separable, 67, 87
smooth, 3
sublinear, 57
summable, 74
support, 139

INDEX
interior, 90, 298
interior-point method, 134, 136
Jacobian, 13, 100
KKT-conditions, 101
KKT-multipliers, 101

half-space, 157
closed, 94
Hessian matrix, 10
hyperplane, 94, 157
hypograph, 53

Lagrange multipliers, 100


Lagrangian, 127
Lagrangian finite generation, 147
least squares, 4, 31
level set, 54
convex, 54
liminf, 185
limit, 298
lower, 185
upper, 185
limsup, 185
linear complementarity, 133
linear program, 24, 86, 104, 165
infeasible, 179
large scale, 27
unbounded, 175
linear-quadratic function, 117, 139
linearly constrained, 85
Lipschitz continuity, 298
lower limit, 185
lower semicontinuity, 183, 185
lsc
function, 183
lsc at a point, 184
lsc function, 185

indicator function, 55
inf-compact function, 186
inf-projection, 56
integrand, 75

marginal function, 170


max-function, 55
minimizer
global, 7

game
zero-sum, 130
gradient, 3

INDEX
local, 7
Minkowski theorem, 161
monitoring function, 37, 118, 139
Moreau-Rockafellar sum rule, 90
Newton direction, 10
Newton method, 10
Newton-Raphson method, 13
nonlinear program, 36
norm, 297
normal, 91
normal cone, 91
normal equation, 5
Oresme rule, 2, 3
outer product, 15
penalty function, 37
plain distribution, 72
polar, 160
polyhedral set, 85, 157
polytope, 162
positive definite, 60
positive hull, 158
positive semi-definite, 60
positively homogeneous, 57
primal problem, 131
primal-dual interior-point method,
134
probability distribution, 69
product
sets, 49
proper subsets, 298
quadratic program, 31, 36, 86, 104,
132
Quasi-Newton
condition, 15

305
method, 15
random technology, 145
ray, 50
saddle function, 129
saddle point, 129
semicontinuity
lower, 185
upper, 185
separable program, 87
set
boundary, 298
bounded, 298
closed, 298
compact, 298
interior, 298
open, 298
simple recourse, 107, 145
slack variable, 114
steepest descent, 7
step size
Armijo, 9
stochastic program
with recourse, 27
stochastic program with recourse,
39
subderivative, 62
subgradient, 62
sublinear function, 57
sublinear polyhedral, 169
sum rule
Moreau-Rockafellar, 90
support function, 139
unit ball, 298
upper limit, 185
upper semicontinuity, 184, 185

306
usc
function, 184
usc at a point, 185
usc function, 185
value function, 170
vertex, 173
vertical closure, 56
Weyl theorem, 158
zero-sum game, 130

INDEX

You might also like