Optimisation Notes
Optimisation Notes
Optimisation Notes
WRITTEN BY
FABRICIO OLIVEIRA
ASSOCIATE PROFESSOR OF OPERATIONS RESEARCH
AALTO UNIVERSITY, SCHOOL OF SCIENCE
I Linear optimisation 11
1 Introduction 13
1.1 What is optimisation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.1.1 Mathematical programming and optimisation . . . . . . . . . . . . . . . . . 13
1.1.2 Types of mathematical optimisation models . . . . . . . . . . . . . . . . . . 14
1.2 Linear programming applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.1 Resource allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2.2 Transportation problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.3 Production planning (lot-sizing) . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 The geometry of LPs - graphical method . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.1 The graphical method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Geometrical properties of LPs . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3
4 Contents
12 Introduction 183
12.1 What is optimisation? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
12.1.1 Mathematical programming and optimisation . . . . . . . . . . . . . . . . . 183
12.1.2 Types of mathematical optimisation models . . . . . . . . . . . . . . . . . . 184
12.2 Examples of applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
12.2.1 Resource allocation and portfolio optimisation . . . . . . . . . . . . . . . . 185
12.2.2 The pooling problem: refinery operations planning . . . . . . . . . . . . . . 186
12.2.3 Robust optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
12.2.4 Classification: support-vector machines . . . . . . . . . . . . . . . . . . . . 189
Linear optimisation
11
Chapter 1
Introduction
1. Analysing properties of the function f (x) under specific domains and deriving the conditions
that must be satisfied such that a point x is a candidate optimal point.
2. Applying numerical methods that iteratively searches for points satisfying these conditions.
This idea is central in several knowledge domains and often is described with area-specific nomen-
clature. Fields such as economics, engineering, statistics, machine learning and, perhaps more
broadly, operations research, are intensive users and developers of optimisation theory and appli-
cations.
13
14 Chapter 1. Introduction
• domain represents business rules or constraints, representing logic relations, design or engi-
neering limitations, requirements, and such;
With these in mind, we can represent the decision problem as a mathematical programming model
of the form of (1.1) that can be solved using optimisation methods. From now on, we will refer to
this specific class of models as mathematical optimisation models, or optimisation models for short.
We will also use the term to solve the problem to refer to the task of finding optimal solutions to
optimisation models.
This book mostly focuses on the optimisation techniques employed to find optimal solutions for
these models. As we will see, depending on the nature of the functions f and g that are used to
formulate the model, some methods might be more or less appropriate. Further complicating the
issue, for models of a given nature, there might be alternative algorithms that can be employed and
with no generalised consensus on whether one method is generally better performing than another,
which is one of the aspects that make optimisation so exciting and multifaceted when it comes to
alternative approaches. I hope that this makes more sense as we progress through the chapters.
min. f (x)
s.t.: gi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , l
x ∈ X,
1. Unconstrained models: in these, the set X = Rn and m = l = 0. These are prominent in,
e.g., machine learning and statistics applications, where f represents a measure of model
fitness or prediction error.
2. Linear programming (LP): presumes linear objective function f (x) = c⊤ x and affine con-
straints g and h, i.e., of the form a⊤ n
i x − bi , with ai ∈ R and b ∈ R. Normally, X =
n
{x ∈ R | xj ≥ 0, j = 1, . . . , n} enforce that decision variables are constrained to be the non-
negative orthant.
1.2. Linear programming applications 15
3. Nonlinear programming (NLP): some or all of the functions f , g, and h are nonlinear.
4. Mixed-integer (linear) programming (MIP): consists of an LP in which some (or all, being
then simply integer programming) of the variables are constrained to be integers. In other
words, X ⊆ Rk × Zn−k . Very frequently, the integer variables are constrained to be binary
terms, i.e., xi ∈ {0, 1}, for i = 1, . . . , n − k and are meant to represent true-or-false or
yes-or-no conditions.
5. Mixed-integer nonlinear programming (MINLP): are the intersection of MIPs and NLPs.
P
Remark: notice that we use the vector notation c⊤ x = j∈J cj xj , with J = {1, . . . , N }. This is
just a convenience for keeping the notation compact.
where c = [c1 , . . . , cN ]⊤ and x = [x1 , . . . , xN ]⊤ are n-sized vectors. Notice that c⊤ x denotes the
inner (or dot) product. The transpose sign ⊤ is meant to reinforce that we see our vectors as
column vectors, unless otherwise stated.
Next, we need to define constraints that state the conditions for a plan to be valid. In this context,
a valid plan is a plan that does not utilise more than the amount of available resources bi , ∀i ∈ I.
This can be expressed as the collection (one for each i ∈ I) of affine (more often wrongly called,
as we will too, linear) inequalities
X
s.t.: aij xj ≤ bi , ∀i ∈ I ⇒ Ax ≤ b,
j∈J
16 Chapter 1. Introduction
where aij are the components of the M × N matrix A and b = [b1 , . . . , bM ]⊤ . Furthermore, we also
must require that xi ≥ 0, ∀i ∈ I.
Combining the above, we obtain the generic formulation that will be used throughout this text to
represent linear programming models:
max. c⊤ x (1.2)
s.t.: Ax ≤ b (1.3)
x ≥ 0. (1.4)
Let us work on a more specific example that will be useful for illustrating some important concepts
related to the geometry of linear programming problems.
Let us consider a paint factory produces exterior and interior paint from raw materials M1 and
M2. The maximum demand for interior paint is 2 tons/day. Moreover, the amount of interior
paint produced cannot exceed that of exterior paint by more than 1 ton/day.
Our goal is to determine the optimal paint production plan. Table 1.1 summarises the data to
be considered. Notice the constraints that must be imposed to represent the daily availability of
paint.
The paint factory problem is an example of a resource allocation problem. Perhaps one aspect that
is somewhat dissimilar is the constraint representing the production rules regarding the relative
amounts of exterior and interior paint. Notice, however, that this type of constraint also has the
same format as the more straightforward resource allocation constraints.
Let x1 be the amount produced of exterior paints (in tons) and x2 the amount of interior paints.
The complete model that optimises the daily production plan of the paint factory is:
max. z = 5x1 + 4x2 (1.5)
s.t.: 6x1 + 4x2 ≤ 24 (1.6)
x1 + 2x2 ≤ 6 (1.7)
x2 − x1 ≤ 1 (1.8)
x2 ≤ 2 (1.9)
x1 , x2 ≥ 0 (1.10)
Notice that paint factory model can also be compactly represented as in (1.2)–(1.4), where
6 4
1 2
⊤ ⊤
c = [5, 4] , x = [x1 , x2 ] , A = , and b = [24, 6, 1, 2]⊤ .
−1 1
0 1
1.2. Linear programming applications 17
Another important class of linear programming problems are those known as transportation prob-
lems. These problems are often modelled using the abstraction of graphs since they consider a
network of nodes and arcs through which some flow must be optimised. Transportation problems
have several important characteristics that can be exploited to design specialised algorithms, the
so-called transportation simplex method. Although we will not discuss these methods in this text,
the simplex method (and its variant, the dual simplex method) will be at the centre of our devel-
opments later on. Also, modern solvers have increasingly relegated transport simplex methods in
their development, as dual simplex has consistently shown to perform similarly in the context of
transportation problems, despite being a far more general method.
The problem can be summarised as follows. We would like to plan the production and distribution
of a certain product, taking into account that the transportation cost is known (e.g., proportional
to the distance travelled), the factories (or source nodes) have a capacity limit, and the clients (or
demand nodes) have known demands. Figure 1.1 illustrates a small network with two factories,
located in San Diego and Seattle, and three demand points, located in New York, Chicago, and
Miami. Table 1.2 presents the data related to the problem.
Plants Clients
NY
SD
CH
SE
MI
Figure 1.1: Schematic illustration of a network with two source nodes and three demand nodes
Clients
Factory NY Chicago Miami Capacity
Seattle 2.5 1.7 1.8 350
San Diego 3.5 1.8 1.4 600
Demands 325 300 275 -
Table 1.2: Problem data: unit transportation costs, demands and capacities
where cij is the unit transportation cost from i to j. The problem has two types of constraints
that must be observed, relating to the supply capacity and demand requirements. These can be
stated as the following linear constraints
where Ci is the production capacity of factory i and Dj is the client demand j. Notice that the
terms on the lefthand side in (1.11) accounts for the total production in each of the source nodes
i ∈ I. Analogously, in constraint (1.12), the term on the left accounts for the total of the demand
satisfied at the demand nodes j ∈ J.
P
Using an optimality argument, Pwe can see that any solution for which, for any j ∈ J, i∈I xij > Dj
can be improved by making i∈I xij = Dj . This show that his constraint under these conditions
will always be satisfied as an equality constraint instead and could be replaced like such.
The complete transportation model for the example above can be stated as
One interesting aspect to notice regarding algebraic forms is that they allow to represent the main
structure of the model while being independent of the instance being considered. For example,
1.2. Linear programming applications 19
regardless of whether the instance would have 5 or 50 nodes, the algebraic formulation is the same,
allowing for detaching the problem instance (in our case the 5 node network) from the model
itself. Moreover, most computational tools for mathematical programming modelling (hereinafter
referred to simply as modelling - like JuMP) empowers the user to define the optimisation model
using this algebraic representation.
Algebraic forms are the main form in which we will specify optimisation models. This abstraction
is a peculiar aspect of mathematical programming and is perhaps one of its main features, the
fact that one must formulate models for each specific setting, which can be done in multiple ways
and might have consequences for how well an algorithm performs computationally. Further in this
text, we will discuss this point in more detail.
...
Period 1 Period 2 Period T
D1 D2 Dt DT
k1 k2 kT −1
t=1 t=2 ... t=T
p1 p2 pt pT
Figure 1.2: A schematic representation of the lot-sizing problem. Each node represents the material
balance at each time period t.
20 Chapter 1. Introduction
A few points must be considered carefully when dealing with lot-sizing problems. First, one must
carefully consider boundary condition, that is, what the model is deciding in time periods t = T
and what is the initial inventory (carried from t = 0). While the former will be seen by the model
as the “end of the world” and thus will realise that optimal inventory levels at period |T | must be
zero, the latter might considerably influence how much production is needed during the planning
horizon T . These must be observed and handled accordingly.
6 6
3 3
x2
x2
x2 2
2 x2 2 2
1 x1 + 2x2 6 1 x1 + 2x2 6
0 1 4 5 0 1 4 5
0 2 3 6 0 2 3 6
x1 x1
(a) (b)
Figure 1.3: The feasible region of the paint factory problem (in Figure 1.3b), represented as the
intersection of the four closed-half spaces formed by each of the constraints (as shown in Figure
1.3a). Notice how the feasible region is a polyhedral set in R2 , as there are two decision variables
(x1 and x2 ).
optimum, the constraint is satisfied as equality. For example, the constraints 6x1 + 4x2 ≤ 24 and
x1 + 2x2 ≤ 6 are active at the optimum x∗ = (3, 1.5), since 6(3) + 4(1.5) = 24 and 3 + 2(1.5) = 6.
An active constraint indicates that the resource (or requirement) represented by that constraint is
being fully depleted (or minimally satisfied).
Analogously, inactive constraint are constraints that are satisfied as strict inequalities at the opti-
mum. For example, the constraint −x1 + x2 ≤ 1 is inactive at the optimum, as −(3) + 1.5 < 1. In
this case, an inactive constraint represents a resource (or requirement) that is not fully depleted
(or is over-satisfied).
6 6
3 3
x2
x2
z = 10 z = 10
z x2 2 z x2 2
2 2 z * = 21
z=5 z=5 x * = (3, 1.5)
1 x1 + 2x2 6 1 x1 + 2x2 6
0 1 4 5 0 1 4 5
0 2 3 6 0 2 3 6
x1 x1
(a) (b)
Figure 1.4: Graphical representation of some of the level curves of the objective function z =
5x1 + 4x2 . Notice that the constant gradient vector ∇z = (5, 4)⊤ points to the direction in which
the level curves increase in value. The optimal point is represented by x∗ = (3, 1.5)⊤ with the
furthermost level curve being that associated with the value z ∗ = 21
exhaustively test and select the best (i.e., that with the largest objective function value). The issue,
however, is that the number of candidates increases exponentially with the number of constraints
and variables of the problem, which indicates this would quickly become computationally infeasible.
As we will see, it turns out that this search idea can be made surprisingly efficient and is, in fact,
the underlying framework of the simplex method. However, there are indeed artificially engineered
worst-case settings where the method does need to consider every single vertex.
The simplex method exploits the above idea to heuristically search for solutions by selecting n
constraints to be active from the m constraints available. Starting from an initial selection of
constraints to be active, it selects one inactive constraint to activate and one to deactivate in a
way that improvement in the objective function can be observed while feasibility is maintained.
This process repeats until no improvement can be observed. When such is the case, the geometry
of the problem guarantees (global) optimality. In the following chapters, we will concentrate on
defining algebraic objects that we will use to develop the simplex method.
1.4. Exercises 23
1.4 Exercises
(a)
(b)
(c)
(d)
Plants Clients
Chicago
Buffalo
Denver
Austin
Erie
Clients
Factory Chicago Denver Eire
Buffalo 4 9 8 25
Austin 10 7 9 600
Demands 15 12 13 -
Table 1.3: Problem data: unit transportation costs, demands and capacities
Consider the following network, where the supplies from the Austin and Buffalo nodes need to
meet the demands in Chicago, Denver, and Erie. The data for the problem is presented in table
below.
Solve the transportation problem, finding the minimum cost transportation plan.
The Finnish company Next has a logistics problem in which used oils serve as raw material to
form a class of renewable diesel. The supply chain team need to organise, to a minimal cost, the
acquisition of two types of used oils (products p1 and p2) from three suppliers (supply nodes s1,
s2, and s3) to feed the line of three of their used oils processing factories (demand nodes d1, d2,
and d3). As the used oils are the byproduct of the supplier’s core activities, the only requirement
is that Next need to fetch the amount of oil and pay the transportation costs alone.
The oils have specific conditioning and handling requirements, so transportation costs vary between
p1 and p2. Additionally, not all the routes (arcs) between suppliers and factories are available as
some distances are not economically feasible. Table 1.4a shows the volume requirement of the two
types of oil from each supply and demand node, and table 1.4b show the transportation costs for
each oil type per L and the arc capacity. Arcs with “-” for costs are not available as transportation
routes.
Find the optimal oil acquisition plan for Next, i.e., solve its transportation problem to the minimum
cost.
1.4. Exercises 25
d1 d2 d3
node p1 p2
p1/p2 (cap) p1/p2 (cap) p1/p2 (cap)
s1 / d1 80 / 60 400 / 300
s1 5/- (∞) 5/18 (300) -/- (0)
s2 / d2 200 / 100 1500 / 1000
s2 8/15 (300) 9/12 (700) 7/14 (600)
s3 / d3 200 / 200 300 / 500
s3 -/- (0) 10/20 (∞) 8/- (∞)
(a) The predictions are 100% accurate and the mean yields are the only realizations possible.
(b) There are three possible equiprobable scenarios (i.e, each one with a probability equal to 13 ):
a good, fair, and bad weather scenario. In the good weather, the yield is 20% better than the
yield expected whereas in the bad weather scenario it is reduced 20% of the mean yield. In
the regular weather scenario, the yield for each crop keeps the historical mean - 2.5T/acre,
3T/acre, and 20T/acre for wheat, corn, and sugar beets, respectively.
(c) What happens if we assume the same scenarios as item (b) but with probabilities 25%, 25%,
and 50% for good, fair, and bad weather, respectively? How the production plan changes
and why?
26 Chapter 1. Introduction
It is possible to store up to 100 of each product at a time at a cost of $0.5 per unit per month.
There are no stocks at present, but it is desired to have a stock of 50 of each type of product at
the end of June.
The factory works six days a week with two shifts of 8h each day. Assume that each month consists
of only 24 working days. Also, there are no penalties for unmet demands. What is the factory’s
production plan (how much of which product to make and when) in order to maximise the total
profit?
Chapter 2
Matrix inversion is the “kingpin” of linear (and nonlinear) optimisation. As we will see later on,
performing efficient matrix inversion operations (in reality, operations that are equivalent to matrix
inversion but that can exploit the matrix structure to be made faster) is of utmost importance for
developing a linear optimisation solver.
Another important concept is the notion of linear independence. We formally state when a collec-
tion of vectors is said to be linearly independent (or dependent) in Definition 2.2.
k
Definition 2.2 (Linearly independent vectors). The vectors {xi }i=1 ∈ Rn are linearly dependent
k
if there exist real numbers {ai }i=1 with ai ̸= 0 for at least one i ∈ {1, . . . , k} such that
k
X
ai xi = 0;
i=1
k
otherwise, {xi }i=1 are linearly independent.
In essence, for a collection of vectors to be linearly independent, it must be so that none of the
vectors in the collection can be expressed as a linear combination (that is, multiplying the vectors
by nonzero scalars and adding them) of the others. Analogously, they are said to be linearly
dependent if one vector in the collection can be expressed as a linear combination of the others.
This is simpler to see in R2 . Two vectors are linearly independent if one cannot obtain one by
multiplying the other by a constant, which effectively means that they are not parallel. If the two
vectors are not parallel, then one of them must have a component in a direction that the other
27
28 Chapter 2. Basics of Linear Algebra
x1 x2
x2
x1
x3
Figure 2.1: Linearly independent (top) and dependent (bottom) vectors in R2 . Notice how, in the
bottom picture, any of the vectors can be obtained by appropriate scaling and adding the other
two
cannot achieve. The same idea can be generalised to any n-dimensional space. Also, that also
implies why one can only have up to n independent vectors in Rn . Figure 2.1 illustrates this idea.
Theorem 2.3 summarises results that we will utilise in the upcoming developments. These are
classical results from linear algebra and the proof is left as an exercise.
Theorem 2.3 (Inverses, linear independence, and solving Ax = b). Let A be a m × m matrix.
Then, the following statements are equivalent:
1. A is invertible
2. A⊤ is invertible
Notice that Theorem 2.3 establishes important relationships between the geometry of the matrix
A (its rows and columns) and consequences it has to our ability to calculate its inverse A−1 and,
consequently, solve the system Ax = b, to which the solution is obtained as x = A−1 b. Solving
linear systems of equations will turn out to be the most important operation in the simplex method.
S = {ax + by : x, y ∈ S; a, b ∈ R} .
2.1. Basics of linear problems 29
k
A related concept is the notion of a span. A span of a collection of vectors {xi }i=1 ∈ Rn is the
subspace of Rn formed by all linear combinations of such vectors, i.e.,
( k
)
X
span(x1 , . . . , xk ) = y = ai xi : ai ∈ R, i ∈ {1, . . . , k} .
i=1
Notice how the two concepts are related: the span of a collection of vectors forms subspace.
Therefore, a subspace can be characterised by the collection of vectors whose span forms it. In
other words, the span of a set of vectors is the subspace formed by all points we can represent by
some linear combination of these vectors.
The missing part in this is the notion of a basis. A basis of the subspace S ⊆ Rn is a collection of
k
vectors {xi }i=1 ∈ Rn that are linearly independent such that span(x1 , . . . , xk ) = S.
Notice that a basis is a “minimal” set of vectors that form a subspace. You can think of it in light
of the definition of linearly independent vectors (Definition 2.2); if a vector is linearly dependent
to the others, it is not needed for characterising the subspace that the vectors span since it can be
represented by a linear combination of the other vectors (and thus is in the subspace formed by
the span of the other vectors).
The above leads us to some important realisations:
1. All bases of a given subspace S have the same dimension. Any extra vector would be linearly
dependent to those vectors that span S. In that case, we say that the subspace has size (or
dimension) k, the number of linearly independent vectors forming the basis of the subspace.
We can overload the notation dim(S) to represent the dimension of the subspace S.
2. If the subspace S ⊂ Rn is formed by a basis of size m < n, we say that S is a proper subspace
with dim(S) = m, because it is not the whole Rn itself, but a space contained within Rn .
For example, two linearly independent vectors form (i.e., span) a hyperplane in R3 ; this
hyperplane is a proper subspace since dim(S)m = 2 < 3 = n.
3. If a proper subspace has dimension m < n, then it means that there are n−m directions in Rn
that are perpendicular to the subspace and to each other. That is, there are nonzero vectors ai
that are orthogonal to each other and to S. Or, equivalently, a⊤ i x = 0 for i = n − m + 1, ..., n.
Referring to the R3 , if m = 2, then there is a third direction that is perpendicular to (or not
in) S. Figure 2.2 can be used to illustrate this idea. Notice how one can find a vector, say
x3 that is perpendicular to S. This is because the whole space is R3 , but S has dimension
m = 2 (or dim(S) = 2).
Theorem 2.4 builds upon the previous points to guarantee the existence of bases and propose a
procedure to form them.
Theorem 2.4 (Forming bases from linearly independent vectors). Suppose that S = span(x1 , . . . , xk )
has dimension m ≤ k. Then
Proof. Notice that, if every vector xk′ +1 , . . . , xk can be expressed as a linear combination of
x1 , . . . xk′ , then every vector in S is also a linear combination of x1 , . . . , xk′ . Thus, x1 , . . . , xk′
30 Chapter 2. Basics of Linear Algebra
S x1
x1
S
x2
form a basis to S with m = k ′ . Otherwise, at least one of the vectors in xk′ +1 , . . . , xk is linearly
independent from x1 , . . . , xk′ . By picking one such vector, we now have k ′ + 1 of the vectors
xk′ +1 , . . . , xk that are linearly independent. If we repeat this process m − k ′ times, we end up with
a basis for S.
Our interest in subspaces and bases spans from (pun intended!) their usefulness in explaining how
the simplex method works under a purely algebraic (as opposed to geometric) perspective. For
now, we can use the opportunity to define some “famous” subspaces which will often appear in
our derivations.
Let A be a m × n matrix as before. The column space of A consists of the subspace spanned
by the n columns of A and has dimension m (recall that each column has as many components
as the number of rows and is thus a m-dimensional vector). Likewise, the row space of A is
the subspace in Rn spanned by the rows of A. Finally, the null space of A, often denoted as
null(A) = {x ∈ Rn : Ax = 0}, consist of the vectors that are perpendicular to the row space of A.
One important notion related to those subspaces is their size. Both the row and the column space
have the same size, which is the rank of A. If A is full rank, than it means that
rank(A) = min {m, n} .
Finally, the size of the null space of A is given n − rank(A), which is in line with Theorem 2.4.
x1
S
x0
x2
.
Figure 2.3: The affine subspace S generated by x0 and null(a)
constraints, meaning that we will always have that m < n. Now, assume that x0 ∈ Rn is such that
Ax0 = b. Then, we have that
Ax = Ax0 = b ⇒ A(x − x0 ) = 0.
Thus, x ∈ S if and only if the vector (x − x0 ) belongs to null(A), the nullspace of A. Notice that
the feasible region S can be also defined as
S = {x + x0 : x ∈ null(A)} ,
being thus an affine subspace with dimension n − m, if A has m linearly independent rows (i.e.,
rank(A) = m). This will have important implications in the way we can define multiple bases
for S from the n vectors in the column space by choosing m to be removed and what this process
means geometrically. Figure 2.3 illustrates this concept for a single-row matrix a. For multiple
rows, one would see S as being represented by the intersection of multiple hyperplanes.
Definition 2.5 (Polyhedral set). A polyhedral set is a set that can be described as
S = {x ∈ Rn : Ax ≥ b} ,
One important thing to notice is that polyhedral sets, as defined in Definition 2.5, as formed by
m
the intersection multiple half-spaces. Specifically, let {ai }i=1 be the rows of A. Then, the set S
can be described as
S = x ∈ Rn : a ⊤i x ≥ bi , i = 1, . . . , m , (2.3)
32 Chapter 2. Basics of Linear Algebra
which represents exactly the intersection of the half-spaces a⊤ i x ≥ bi . Furthermore, notice that the
hyperplanes a⊤i x = b i , ∀i ∈ {1, . . . , m}, are the boundaries of each hyperplane, and thus describe
one of the facets of the polyhedral set. Figure 2.4 illustrates a hyperplane forming two half-spaces
(also polyhedral sets) and how the intersection of five half-spaces form a (bounded) polyhedral set.
a⊤
3 x = b3
a⊤
2 x = b2
a⊤ x > b a3
1 x = b1
a a2
a1 a4
a⊤
a⊤ x < b a5
a⊤ x = b a⊤
4 x = b4
a⊤
5 x = b5
Figure
2.4: A hyperplane and its respective halfspaces (left) and the polyhedral set
x ∈ R2 : axi ≤ bi , i = 1, . . . , 5 (right).
You might find authors referring to bounded polyhedral sets as polytopes. However, this is not used
consistently across references, sometimes with switched meanings (for example, using polytope to
refer to a set defined as in Definition 2.5 and using polyhedron to refer to a bounded version of
S). In this text, we will only use the term polyhedral set to refer to sets defined as in Definition
2.5 and use the term bounded whenever applicable.
Also, it may be useful
to formally define
some elements in polyhedral sets. For that, let us consider
a hyperplane H = x ∈ Rn : a⊤ x = b . Now consider the set F = H ∩ S. This set is known as a
face of a polyhedral set. If the face F has dimension zero, then F is called a vertex. Analogously,
if dim(F ) = 1, then F is called an edge. Finally, if dim(F ) = dim(S) − 1, then F is called a facet.
Notice that in R3 , facets and faces are the same, whenever the face is not an edge or a vertex.
Definition 2.6 leads to a simple geometrical intuition: for a set to be convex, the line segment
connecting any two points within the set must lie within the set. This is illustrated in Figure 2.5.
Associated with the notion of convex sets are two important elements we will refer to later, when
we discuss linear problems that embed integrality requirements. The first is the notion of a convex
combination, which is already contained in Definition 2.6, but can be generalised for an arbitrary
number of points. The second consists of convex hulls, which are sets formed by combining the
2.2. Convex polyhedral set 33
x2 x2
x1 x2
x1 x1
Figure 2.5: Two convex sets (left and middle) and one nonconvex set (right)
x1 x1
x2 x2
x5
x6
x1
x3 x2 x4 x3
Figure 2.6: The convex hull of two points is the line segment connecting them (left); The convex
hull of three (centre) and six (right) points in R2
convex combinations of all elements within a given set. As one might suspect, convex hulls are
always convex sets, regardless whether the original set from which the points are drawn from is
convex or not. These are formalised in Definition 2.7 and illustrated in Figure 2.6.
Definition 2.7 (Convex combinations and convex hulls). Let x1 , . . . , xk ∈ Rn and λ1 , . . . , λk ∈ R
Pk
such that λi ≥ 0 for i = 1, . . . , k and i=1 λi = 1. Then
Pk k
1. x = i=1 λi xi is a convex combination of {xi }i=1 ∈ Rn .
k
2. The convex hull of {xi }i=1 ∈ Rn , denoted conv(x1 , . . . , xk ), is the set of all convex combi-
k
nations of {xi }i=1 ∈ Rn .
We are now ready to state the result that guarantees the convexity of polyhedral sets of the form
S = {x ∈ Rn : Ax ≤ b} .
Theorem 2.8 (Convexity of polyhedral sets). The following statements are true:
2. Let a ∈ Rn and b ∈ R. Let x, y ∈ Rn , such that a⊤ x ≥ b and a⊤ y ≥ b. Let λ ∈ [0, 1]. Then
a⊤ (λx + (1 − λ)y) ≥ λb + (1 − λ)b = b, showing that half-spaces are convex. The result
follows from combining this with (1).
3. By induction. Let S be a convex set and assume that the convex combination of x1 , . . . , xk ∈
S also belongs to S. Consider k+1 elements x1 , . . . , xk+1 ∈ S and λ1 , . . . , λk+1 with λi ∈ [0, 1]
Pk+1
for i = 1, . . . , k + 1 and i=1 λi = 1 and λk+1 ̸= 1 (without loss of generality). Then
k+1
X k
X λi
λi xi = λk+1 xk+1 + (1 − λk+1 ) xi . (2.4)
i=1 i=1
1 − λk+1
Pk λi Pk λi
Notice that i=1 1−λk+1 = 1. Thus, using the induction hypothesis, i=1 1−λk+1 xi ∈
Pk+1
S. Considering that S is convex and using (2.4), we conclude that i=1 λk+1 xk+1 ∈ S,
completing the induction.
Pk Pk
4. Let S = conv(x1 , . . . , xk ). Let y = i=1 αi xi and z = i=1 βi xi be such that y, z ∈ S,
Pk Pk
αi , βi ≥ 0, and i=1 αi = i=1 βi = 1. Let λ ∈ [0, 1]. Then
k
X k
X k
X
λy + (1 − λ)z = λ αi xi + (1 − λ) β i xi = (λαi + (1 − λ)βi )xi . (2.5)
i=1 i=1 i=1
Pk
Since i=1 λαi + (1 − λ)βi = 1 and λαi + (1 − λ)βi ≥ 0 for i = 1, . . . , k, λy + (1 − λ)z is
a convex combination of x1 , . . . , xk and, thus, λy + (1 − λ)z ∈ S, showing the convexity of
S.
Figure 2.7 illustrates some of the statements represented in the proof. For example, the intersection
of the convex sets is always a convex set. One should notice however that the same does not apply to
the union of convex sets. Notice that statement 2 proves that polyhedral sets as defined according
to Definition 2.5 are convex. Finally the third figure on the right illustrates the convex hull of four
points as a convex polyhedral set containing the lines connecting any two points within the set.
We will halt our discussion about convexity for now and return to it in deeper detail in Part 2.
As it will become clearer then, the presence of convexity (which is a given in the context of linear
programming, as we have just seen) is what allows us to conclude that the solutions returned by
our optimisation algorithms are indeed optimal for the problem at hand.
y : c⊤ y = c⊤ w
w v
w
u
P P
c
c y
x y : c⊤ y = c⊤ x x
z
previous chapter, the optimum of linear programming problems is generally located at the vertices
of the feasible set. Furthermore, such vertices are formed by the intersection of n constraints (in a
n-dimensional space, which comprises constraints that are active (or satisfied at the boundary of
the half-space of said constraints).
First, let us formally define the notions of vertex and extreme point. Although in general these can
refer to different objects, we will see that in the case of linear programming problems, if a point is
a vertex, then it is an extreme point as well, being the converse also true.
Definition 2.9 (Vertex). Let P be a convex polyhedral set. The vector x ∈ P is a vertex of P if
there exists some c such that c⊤ x < c⊤ y for all y ∈ P with y ̸= x.
Definition 2.10 (Extreme points). Let P be a convex polyhedral set. The vector x ∈ P is an
extreme point of P if there are no two vectors y, z ∈ P (different than x) such that x = λy+(1−λ)z,
for any λ ∈ [0, 1].
Figure 2.8 provides an illustration of the Definitions 2.9 and 2.10. Notice that the definition of a
vertex involves an additional hyperplane that, once placed on a vertex point, strictly contains the
whole polyhedral set in one of the half-spaces it defines, except for the vertex itself. On the other
hand, the definition of an extreme point only relies on convex combinations of elements in the set
itself.
Definition 2.9 also hints an important consequence for linear programming problems. As we seen
from Theorem 2.8, P is convex, which guarantees that P is contained in the half-space c⊤ y > c⊤ x.
This implies that c⊤ x ≤ c⊤ y, ∀y ∈ P, which is precisely
the condition that x must satisfy to be
the minimum for the problem min. x c⊤ x : x ∈ P .
Now we focus on the description of active constraints from an algebraic standpoint. For that, let
us first generalise our setting by considering all possible types of linear constraints. That is, let us
consider the convex polyhedral set P ⊂ Rn , formed by the set of inequalities and equalities:
a⊤
i x ≥ b, i ∈ M1 ,
a⊤
i x ≤ b, i ∈ M2 ,
a⊤
i x = b, i ∈ M3 .
Definition 2.11 (Active (or binding) constraints). If a vector x satisfies a⊤ i x = bi for some
i ∈ M1 , M2 , or M3 , we say that the corresponding constraints are active (or binding).
while points A, B, C and D have 3 active constraints, E only has 2 active constraints (x2 = 0 and
x1 + x2 + x3 = 1).
x3
x2
D
E
P
B
C x1
Theorem 2.12 sows a thread between having a collection of active constraints forming a vertex
and being able to describe it as a basis of a subspace that is formed by the vectors ai that form
these constraints. This link is what will allow us to characterise vertices by their forming active
constraints.
Theorem 2.12 (Properties of active constraints). Let x ∈ Rn and I = i ∈ M1 ∪ M2 ∪ M3 | a⊤ i x = bi .
Then, the following are equivalent:
2. The span({ai }i∈I ) spans Rn . That is, every x ∈ Rn can be expressed as a linear combination
of {ai }i∈I .
3. The system of equations a⊤ i x = bi i∈I has a unique solution.
Proof. Suppose that {ai }i∈I spans Rn , implying that the span({ai }i∈I ) has dimension n. By
Theorem 2.4 (part 1), n of these vectors form a basis for Rn and are, thus, linearly independent.
Moreover, they must span Rn and therefore every x ∈ Rn can be expressed as a combination of
{ai }i∈I . This connects (1) and (2).
Assume that the system of equations a⊤ i x = bi i∈I has multiple solutions, say x1 and x2 . Then,
the nonzero vector d = x1 − x2 satisfies a⊤i d = 0 for all i ∈ I. As d is orthogonal to every ai , i ∈ I,
d cannot be expressed as a combination of {ai }i∈I and, thus, {ai }i∈I do not span Rn .
Rn , choose d ∈ Rn that is orthogonal to span({ai }i∈I ). If
Conversely,if {ai }i∈I do not span
x satisfies a⊤i x = bi i∈I , so does a⊤
i (x + d) = bi i∈I , thus yielding multiple solutions. This
connects (2) and (3).
Notice that Theorem 2.12 implies that there are (at least) n active constraints (ai ) that are linearly
independent at x. This is the reason why we will refer to x, and any vertex-forming solution, as
a basic solution, of which we will be interested in those that are feasible, i.e., that satisfy all
constraints i ∈ M1 ∪ M2 ∪ M3 . Definition 2.13 provides a formal definition of these concepts.
Definition 2.13 (Basic feasible solution (BFS)). Consider a convex polyhedral set P ⊂ Rn defined
by linear equality and inequality constraints, and let x ∈ Rn .
2.3. Extreme points, vertices, and basic feasible solutions 37
1. x is a basic solution if
(a) All equality constraints are active and,
(b) Out of the constraints active at x, n of them are linearly independent.
2. if x is a basic solution satisfying all constraints, we say x is a basic feasible solution.
Figure 2.10 provides an illustration of the notion of basic solutions, and show how only a subset
of the basic solutions are feasible. As one might infer, these will be the points of interest in out
future developments, as these are the candidates for optimal solution.
A
C
B
P
F
E
D
Figure 2.10: Points A to F are basic solutions; B,C,D, and E are BFS.
We finalise stating the main result of this chapter, which formally confirms the intuition we have
developed so far. That is, for convex polyhedral sets, the notion of vertices and extreme points
coincide, and these points can be represented as basic feasible solutions. This is precisely the
link that allows for considering the feasible region of linear programming problems under a purely
algebraic characterisation of the candidates for optimal solutions, those described uniquely by a
subset of constraints of the problem that is assumed to be active.
Theorem 2.14 (BFS, extreme points and vertices). Let P ⊂ Rn be a convex polyhedral set and
let x ∈ P . Then, the following are equivalent
x is a vertex ⇐⇒ x is an extreme point ⇐⇒ x is a BFS.
Proof. Let P = {x ∈ Rn : a⊤ ⊤ ⊤
i x ≥ bi , i ∈ M1 , ai x = bi , i ∈ M2 }, and I = i ∈ M1 ∪ M2 | ai x = bi .
1. (Vertex ⇒ Extreme point) Suppose x is a vertex. Then, there exists some c ∈ Rn such that
c⊤ x < c⊤ x, for every x ∈ P with x ̸= x (cf. Definition 2.9). Take y, z ∈ P with y, z ̸= x.
Thus c⊤ x < c⊤ y and c⊤ x < c⊤ z. For λ ∈ [0, 1], c⊤ x < c⊤ (λy + (1 − λ)z) implying that
x ̸= λy + (1 − λ)z, and is thus an extreme point (cf. Definition 2.10).
2. (Extreme point ⇒ BFS) suppose x ∈ P is not a BFS. Then, there are no n linearly indepen-
dent vectors within {ai }i∈I . Thus {ai }i∈I lie in a proper subspace of Rn . Let the nonzero
vector d ∈ Rn be such that a⊤ i d = 0, for all i ∈ I.
Let ϵ > 0, y = x + ϵd, and z = x − ϵd. Notice that a⊤ ⊤
i y = ai z = bi , for all i ∈ I. Moreover,
for i ̸= I, ai x > bi and, provided that ϵ is sufficiently small (such that ϵ|a⊤
⊤ ⊤
i d| < ai x − bi ),
⊤
we have that ai x ≥ bi for all i ∈ I. Thus y ∈ P , and by a similar argument, z ∈ P . Now,
by noticing that x = 21 y + 12 z, we see that x is not an extreme point.
38 Chapter 2. Basics of Linear Algebra
P
3. (BFS ⇒ Vertex) Let x be a BFS. Define c = i∈I ai . Then
X X
c⊤ x = a⊤
i x= bi .
i∈I i∈I
since a⊤ ⊤ ⊤
i x ≥ bi for i ∈ M1 ∪ M2 . Thus, for any x ∈ P , c x ≤ c x, making x a vertex (cf.
Definition 2.9).
Some interesting insights emerge from the proof of Theorem 2.14, upon which we will build our
next developments. Once the relationship between being a vertex/ extreme point and a BFS is
made, it means that x can be recovered as the unique solution of a system of linear equations,
these equations being the active constraints at that vertex. This means that the list of all optimal
solution candidate points can be obtained by simply looking at all possible combinations of n active
constraints, discarding those that are infeasible. This
means that the number of candidates for
optimal solution is finite and can be bounded by m n , where m = |M1 ∪ M2 |.
2.4. Exercises 39
2.4 Exercises
Exercise 2.1: Polyhedral sets [1]
Which of the following sets are polyhedral sets?
Note: the proof of the theorem is proved in the notes. Use this as an opportunity to revisit the
proof carefully, and try to take as many steps without consulting the text as you can. This is a
great exercise to help you internalise the proof and its importance in the context of this book. I
strongly advise against blindly memorising it, as I suspect you will never (in my courses, at least)
be requested to recite the proof literally.
1. A is invertible
2. A⊤ is invertible
3. The determinant of A is nonzero
4. The rows of A are linearly independent
5. The columns of A are linearly independent
6. For every b ∈ Rm , the linear system Ax = b has a unique solution
7. There exists some b ∈ Rm such that Ax = b has a unique solution.
40 Chapter 2. Basics of Linear Algebra
a⊤
i x ≥ b, i ∈ M1 ,
a⊤
i x ≤ b, i ∈ M2 ,
a⊤
i x = b, i ∈ M3 .
2. The span({ai }i∈I ) spans Rn . That is, every x ∈ Rn can be expressed as a combination of
{ai }i∈I .
3. The system of equations a⊤ i x = bi i∈I has a unique solution.
Theorem (BFS, extreme points and vertices). Let P ⊂ Rn be a convex polyhedral set and let
x ∈ P . Then, the following are equivalent
1. x is a vertex;
2. x is a extreme point;
3. x is a BFS;
max. 2x1 + x2
s.t.: 2x1 + 2x2 ≤ 9
2x1 − x2 ≤ 3
x1 − x2 ≤ 1
x1 ≤ 2.5
x2 ≤ 4
x1 , x2 ≥ 0.
2.4. Exercises 41
Assess the following points relative to the polyhedron defined in R2 by this system and classify
them as in (i) belonging to which active constraint(s), (ii) being a non-feasible/basic/basic feasible
solution, and (iii) being an extreme point, vertex, or outside the polyhedron. Use Theorem 2.12
and Theorem 2.14 to check if your classification is correct.
a) (1.5, 0)
b) (1, 0)
c) (2, 1)
d) (1.5, 3)
42 Chapter 2. Basics of Linear Algebra
Chapter 3
where |u| represents the cardinality of the vector u. Another transformation that may be required
consists of imposing the condition x ≥ 0. Let us assume that a polyhedral set P was such that
(notice the absent nonnegativity condition)
P = {x ∈ Rn : Ax = b} .
43
44 Chapter 3. Basis, Extreme Points and Optimality in Linear Programming
It is a requirement for standard-form linear programs to have all variables to be assumed nonneg-
ative. To achieve that in this case, we can simply include two auxiliary variables, say x+ and x− ,
with the same dimension as x, and reformulate P as
P = x+ , x− ∈ Rn : A(x+ − x− ) = b, x+ , x− ≥ 0 .
These transformations, as we will see, will be required for employing the simplex method to solve
linear programming problems with inequality constraints and, inevitably, will always render stan-
dard form linear programming problems with more variables than constraints, or m < n.
The standard-form polyhedral set P always has, by definition, m active constraints because of its
equality constraints. To reach the total of n active constraints, n − m of the remaining constraints
xi ≥ 0, i = 1, . . . , n, must be activated, and this is achieved by selecting n − m of those to be set as
xi = 0. These n active constraints (the original m plus the n − m variables set to zero) form a basic
solution, as we seen in the last chapter. If it happens that the m equalities can still be satisfied
while the constraints xi ≥ 0, i = 1, . . . , n, are satisfied, then we have a basic feasible solution
(BFS). Theorem 3.1 summarises this process, guaranteeing that the setting of n − m variables to
zero will render a basic solution.
Theorem 3.1 (Linear independence and basic solutions). Consider the constraints Ax = b and
x ≥ 0, and assume that A has m linearly independent (LI) rows M = {1, . . . , m}. A vector x ∈ Rn
is a basic solution if and only if we have that Ax = b and there exists indices B(1), . . . , B(m) such
that
Proof. Assume that (1) and (2) are satisfied. Then the active constraints xj = 0 for j ∈
/
{B(1), . . . , B(m)} and Ax = b imply that
m
X n
X
AB(i) xB(i) = Aj xj = Ax = b.
i=1 j=1
Since the columns AB(i) i∈M are LI, xB(i) i∈M are uniquely determined and thus Ax = b has
a unique solution, implying that x is a basic solution (cf. Theorem 2.14).
Conversely, assume that x is a basic solution. Let xB(1) , . . . , xB(k) be the nonzero components of
x. Thus, the system
Xn
Ai xi = b and {xi = 0}i∈{B(1),...,B(k)}
/
i=1
Pk
has a unique solution, and so does i=1 AB(i) xB(i) = b, implying that the columns AB(1) , . . . , AB(k)
Pk
are LI. Otherwise, there would be scalars λ1 , . . . , λk , not all zeros, for which i=1 AB(i) λi =
Pk
0; this would imply that i=1 AB(i) (xB(i) + λi ) = b, contradicting the uniqueness of x. Since
AB(1) , . . . , AB(k) are LI, k ≤ m. Also, since A has m LI rows, it must have m LI columns spanning
Rm . Using Theorem 2.4, we can obtain m − k additional columns AB(k+1) , . . . , AB(m) so that
AB(1) , . . . , AB(m) are LI.
Finally, since k ≤ m, {xj = 0}i∈{B(1),...,B(m)}
/ ⊂ {xj = 0}i∈{B(1),...,B(k)}
/ , satisfying (1) and (2).
3.1. Polyhedral sets in standard form 45
This proof highlights an important aspect in the process of generating basic solutions. Notice
that once we set n − m variables to be zero, the system of equations forming P becomes uniquely
determined, i.e.,
Xm Xn
AB(i) xB(i) = Aj xj = Ax = b.
i=1 j=1
You might have noticed that in the proof of Theorem 3.1, the focus shifted to the columns of
A rather than its rows. The reason for that is because, when we think of solving the system
Ax = b, what we are truly doing is finding a vector x representing the linear combination of the
columns of A that yield the vector b. This creates an association between the columns of A and the
components of x (i.e., the variables). Notice, however that the columns of A are not the variables
per se, as they have dimension m (while x ∈ Rn ).
One important interpretation for Theorem 3.1 is that we will form bases for the column space of
A by choosing m components to be nonzero in a vector of dimension n. Since m < n by definition,
A can only have rank rank(A) = min m, n = m, which happens to be the size of both the row
space of A (as the rows are assumed LI). This, in turn, means that both the column and the row
spaces have dimension m. Thus, these bases are bases for the column space of A. Finally, finding
the vector x is the same as finding how the vector b can be expressed as a linear combination of
that basis. Notice that this is always possible when the basis spans Rm (as we have m LI column
vectors) and b ∈ Rm .
You will notice that from here onwards, we will implicitly refer to the columns of A as variables
(although we actually mean the weight associated with that column, represented by the respective
component in the variable x). Then, when we say that we are setting some (n − m) of the variables
to be zero, it means that we are ignoring the respective columns of A (the mapping between
variables and columns being their indices: x1 referring to the first column, x2 to the second, and
so forth), while using the remainder to form a (unique) combination that yields the vector b, being
the weights of this combination precisely the solution x, which in turn represent the coordinates in
Rn of the vertex formed by the n (m equality constraints plus n − m variables set to zero) active
constraints.
As we will see, this procedure will be at the core of the simplex method. Since we will often refer
to elements associated with this procedure, it will be useful to define some nomenclature.
We say that B = AB(i) i∈I is a basis (or, perhaps more precisely, a basic matrix) with basic
B
indices IB = {B(1), . . . , B(m)}. Consequently, we say that the variables xj , for xj , for j ∈ IB , are
basic variables. Somewhat analogously, we say that the variables chosen to be set to zero are the
nonbasic variables xj , for j ∈ IN , where IN = J \ IB , with J = {1, . . . , n} being the indices of all
variables (and all columns of A).
46 Chapter 3. Basis, Extreme Points and Optimality in Linear Programming
Notice that the basic matrix B is invertible, since its columns are LI (c.f. Theorem 2.4). For
xB = (xB(1) , . . . , xB(m) ), the unique solution for BxB = b is
xB(1)
xB = B −1 b, where B = AB(1) . . . AB(m) and xB = ... .
xB(m)
Let us consider the following numerical example. If we consider the following set P
x1 + x2 + 2x3 ≤ 8
x + 6x ≤ 12
2 3
7
P = x ∈ R : x1 ≤ 4 , (3.1)
x2 ≤ 6
x1 , x2 , x3 ≥ 0
which can be written in the standard by adding slack variables {xi }i∈{4,...,7} , yielding
x1 + x2 + 2x3 + x4 = 8
x + 6x + x = 12
2 3 5
7
P = x ∈ R : x1 + x6 = 4 . (3.2)
x + x = 6
2 7
x1 , . . . , x 7 ≥ 0
The system Ax = b can be represented as
1 1 2 1 0 0 0 8
0 1 6 0 1 0 0 12
x = .
1 0 0 0 0 1 0 4
0 1 0 0 0 0 1 6
Following our notation, we have that m = 4 and n = 7. The rows of A are LI, meaning that
rank(A) = 41 . We can make arbitrary selections of n − m = 3 variables to be set to zero (i.e.,
nonbasic) and calculate the value of the remaining (basic) variables. For example:
• Let IB = {4, 5, 6, 7}; in that case xB = (8, 12, 4, 6) and x = (0, 0, 0, 8, 12, 4, 6), which is a
basic feasible solution (BFS), as x ≥ 0.
• For IB = {3, 5, 6, 7}, xB = (4, −12, 4, 6) and x = (0, 0, 4, 0, −12, 4, 6), which is basic but not
feasible, since x5 < 0.
Definition 3.2 (Adjacent basic solutions). Two basic solutions are adjacent if they share n − 1 LI
active constraints. Alternatively, two bases B1 and B2 are adjacent if all but one of their columns
are the same.
For example, consider the set polyhedral set P defined in (3.2). Our first BFS was defined by
making x1 = x2 = x3 = 0 (nonbasic index set IN = {1, 2, 3}). Thus, our basis was IB = {4, 5, 6, 7}.
An adjacent basis was then formed, by replacing the basic variable x4 with the nonbasic variable
x3 , rendering the new (not feasible) basis IB = {3, 5, 6, 7}.
Notice that the process of moving between adjacent basis has a simple geometrical interpretation.
Since adjacent bases share all but one basic element, this means that the two must be connected by a
line segment (in the case of the example, it would be the segment between (0, 8) and (4, 0), projected
onto (x3 , x4 ) ∈ R2 , or equivalently, the line between (0, 0, 0, 8, 12, 4, 6) and (0, 0, 4, 0, −12, 4, 6)
(Notice how the coordinate x6 also changed values; this is necessary, so the movement is made
along the edge of the polyhedral set. This will become clearer when we analyse the simplex method
in further detail in Chapter 4.
Proof. Assume, without loss of generality, that i1 = 1 and ik = k. Clearly, P ⊂ Q, since a solution
satisfying the constraints forming P also satisfies those forming Q.
As rank(A) = k, the rows ai1 , . . . , aik form a basis in the row space of A and any row ai , i ∈ M ,
Pk
can be expressed as a⊤
i =
⊤
j=1 λij aj for λij ∈ R.
Pk Pk
For y ∈ Q and i ∈ M , we have a⊤ i y =
⊤
j=1 λij aj y = j=1 λij bj = bi , which implies that y ∈ P
and that Q ⊂ P . Consequently, P = Q.
Theorem 3.3 implies that any linear programming problem in standard form can be reduced to
an equivalent problem with linearly independent constraints. It turns out that, in practice, most
professional-grade solvers (i.e., software that implements solution methods and can be used to find
optimal solutions to mathematical programming models) have preprocessing routines to remove
redundant constraints. This means that the problem is automatically treated to become smaller
by not incorporating unnecessary constraints.
Degeneracy is somewhat related to the notion of redundant constraints. We say that a given vertex
is a degenerate basic solution if it is formed by the intersection of more than n active constraints
(in Rn ). Effectively, this means that more than n − m variables (i.e, some of the basic variables)
are set to zero, which is the main way to identify a degenerate BFS. Figure 3.1 illustrates a case
in which degeneracy is present.
Notice that, while in the figure on the left, the constraint causing degeneracy is redundant, that
is not the case on the figure on the righthand side. That is, redundant constraints may cause
degeneracy, but not all constraints causing degeneracy are redundant.
48 Chapter 3. Basis, Extreme Points and Optimality in Linear Programming
A D
B
Figure 3.1: A is a degenerate basic solution, B and C are degenerate BFS, and D is a BFS.
In practice, degeneracy might cause issues related to the way we identify vertices. Because more
than n active constraints form the vertex, and yet, we identify vertices by groups of n constraints
to be active, it means that we might be have a collection of adjacent bases that, in fact, are
representing the same vertex in space, meaning that we might be “stuck” for a while in the same
position while scanning through adjacent bases. The numerical example below illustrates this
phenomenon.
Let us consider again the same example as before.
1 1 2 1 0 0 0 8
0 1 6 0 1 0 0 12
x =
1 0 0 0 0 1 0 4
0 1 0 0 0 0 1 6
• let IB = {1, 2, 3, 7}; this implies that x = (4, 0, 2, 0, 0, 0, 6). There are 4 zeros (instead of
n − m = 3) in x, which indicates degeneracy.
• Now, let = IB = {1, 3, 4, 7}. This also implies that x = (4, 0, 2, 0, 0, 0, 6). The two bases are
adjacent yet represent the same point in R7 .
As we will see, there are mechanisms that prevent the simplex method from being stuck on such
vertices forever, an issue that is referred to as cycling. One final point to observe about degeneracy
is that it can be caused by the chosen representation of the problem. For example, consider the
two equivalent sets:
(0, 0, 1)
(1, 1, 0)
to devise an optimisation method is to define the optimality conditions we wish to satisfy. In other
words, we must define the conditions that, once satisfied, mean that we can stop the algorithm
and declare the current solution optimal.
Figure 3.3 illustrates the notion of containing a line and the existence of extreme points. Notice
that if a set would have any “corner”, then that would imply that a line is contained between the
edges that form that corner and, therefore, that corner would be an extreme point.
P
Q
Figure 3.3: P contains a line (left) and Q does not contain a line (right)
We are now ready to pose the result that utilises Definition 3.4 to provide the conditions for the
existence of extreme points.
Theorem 3.5 (Existence of extreme points). Let P = x ∈ Rn : a⊤ i x ≥ bi , i = 1, . . . , m ̸= ∅ be a
polyhedral set. Then the following are equivalent:
It turns out that linear programming problems in the standard form do not contain a line, meaning
that they will always provide at least one extreme point (or a basic feasible solution). More
generally, bounded polyhedral sets do not contain a line, and neither does the positive orthant.
50 Chapter 3. Basis, Extreme Points and Optimality in Linear Programming
We are now to state the result that proves the intuition we had when analysing the plots in Chapter
1, which states that if a polyhedral set has at least one extreme point and at least one optimal
solution, then there must be an optimal solution that is an extreme point.
Theorem 3.6 is posed in a somewhat general way, which might be a source for confusion. First,
recall that in the example in Chapter 1, we considered the possibility of the objective function level
curve associated with the optimal value to be parallel to one of the edges of the feasible region,
meaning that instead of a single optimal solution (a vertex), we would observe a line segment
containing an infinite number of optimal solutions, of which exactly two would be extreme points.
In a more general case (with n > 2) it might be so that a whole facet of optimal solutions is
obtained. That is precisely the polyhedral set of all optimal solutions Q in the proof. Clearly, this
polyhedral set will not contain a line and, therefore (cf. Theorem 3.5), have at least one extreme
point.
Perhaps another important point is to notice that the result is posed assuming a minimisation
problem, but it naturally holds for maximisation problems as well. In fact, maximising a function
f (x) is the same as minimising −f (x), with the caveat that, although the optimal value x∗ is the
same in both cases, the optimal values are symmetric in sign (because of the additional minus we
included in the problem being originally maximised).
This is important because we intend to design an algorithm that only inspects extreme points.
This discussion guarantees that, even for the cases in which a whole set of optimal solution exists,
some elements in that set will be extreme points anyway and thus identifiable by our method.
This very simple procedure happens to be the core idea of most optimisation methods. We will
concentrate on how to identify directions of improvement and, as a consequence of their absence,
how to identify optimality.
Starting from a point x ∈ P , we would like to move in a direction d that yields improvement.
Definition 3.7 provides a formalisation of this idea.
Definition 3.7 (Feasible directions). Let x ∈ P , where P ⊂ Rn is a polyhedral set. A vector
d ∈ Rn is a feasible direction at x if there exists θ > 0 for which x + θd ∈ P .
Figure 3.4 illustrates the concept. Notice that at extreme points, the relevant feasible directions
are those along the edges of the polyhedral set since those are the directions that can lead to other
extreme points.
Let us now devise a way of identifying feasible directions algebraically. For that, let A be a m × n
matrix, I = {1, . . . , m} and J = {1, . . . , n}. Consider the problem
min. c⊤ x : Ax = b, x ≥ 0 .
Let x be a basic feasible solution (BFS) with basis B = [AB(1) , . . . , AB(m) ]. Recall that the basic
variables xB are given by
and that the remaining nonbasic variables xN are such that xN = (xj )j∈IN = 0, with IN = J \ IB .
Moving to a neighbouring solution can be achieved by simply moving between adjacent basis, which
can be accomplished without significant computational burden. This entails selecting a nonbasic
variable xj , j ∈ IN , and increasing it to a positive value θ.
Equivalently, we can define a feasible direction d = [dN , dB ], where dN represent the components
associated with nonbasic variables and dB those associated with basic variables and move from the
point x to the point x + θd. The components dN associated with the nonbasic variables are thus
defined as (
dj = 1
d=
dj ′ = 0, for all j ′ ̸= j,
with j, j ′ ∈ IN . Notice that, geometrically, we are moving along a line in the dimension represented
by the nonbasic variable xj .
Now, feasibility might become an issue if we are not careful to retain feasibility conditions. To
retain feasibility, we must observe that A(x + θd) = b, implying that Ad = 0. This allows us to
define the components dB of the direction vector d that is associated with the basic variables xj ,
j ∈ IB , since
Xn m
X
0 = Ad = Aj dj = AB(i) dB(i) + Aj = BdB + Aj
j=1 i=1
52 Chapter 3. Basis, Extreme Points and Optimality in Linear Programming
and thus dB = −B −1 Aj is the basic direction implied by the choice of the nonbasic variable xj ,
j ∈ IN , to become basic. The vector dB can be thought of as the adjustments that must be made
in the value of the other basic variables to accommodate the new variable becoming basic to retain
feasibility.
Figure 3.5 provides a schematic representation of this process, showing how the change between
adjacent basis can be seen as a movement between adjacent extreme points. Notice that it conveys
a schematic representation of a n = 5 dimensional problem, in which we ignore all the m dimensions
and concentrate on the n−m dimensional projection of the feasibility set. This implies that the only
constraints left are those associated with the nonnegativity of the variables x ≥ 0, each associated
with an edge of this alternative representation. Thus, when we set n − m (nonbasic variables)
to zero, we identify an associated extreme point. As n − m = 2, we can plot this alternative
representation on R2 .
x3 = 0 B
A
x1 = 0
x4 = 0 x5 = 0
C
x2 = 0
Clearly, overall feasibility, i.e., ensuring that x ≥ 0 can only be retained if θ > 0 is chosen
appropriately small. This can be achieved if the following is observed:
1. All the other nonbasic variables remain valued at zero, that is, xj ′ = 0 for j ′ ∈ IN \ {j}.
2. if x is a nondegenerate extreme point, then all xB > 0 and thus xB +θdB ≥ 0 for appropriately
small θ > 0.
3. if x is a degenerate extreme point: dB(i) might not be feasible since, for some B(i), if dB(i) < 0
and xB(i) = 0, any θ > 0 will make xB(i) < 0.
We will see later that we can devise a simply rule to define a value for θ that guarantees the above
will be always observed. For now, we will put this discussion on hold and focus on the issue of how
to guide the choice of which nonbasic variable xj , j ∈ IN , to select to become basic.
Using objective function c⊤ x, if we move along the feasible direction d as previously defined, we
have that the objective function value changes by
c⊤ d = c⊤ ⊤ −1
B dB + cj = cj − cB B Aj ,
where cB = [cB(1) , . . . , cB(m) ]. The quantity cj − cB B −1 Aj can be used, for example, in a greedy
fashion, meaning that we choose the nonbasic variable index j ∈ IN with greatest potential of
improvement.
First, let us formally define this quantity, which is known as the reduced cost.
Definition 3.8 (Reduced cost). Let x be a basic solution associated with the basis B and let
cB = [cB(1) , . . . , cB(m) ] be the objective function coefficients associated with the basis B. For each
nonbasic variable xj , with j ∈ IN , we define the reduced cost cj as
cj = cj − c⊤
BB
−1
Aj .
The name reduced cost is motivated by the fact that it quantifies a cost change onto the reduced
space of the basic variables. In fact, the reduced cost is calculating the change in the objective
function caused by the increase in one unit of the nonbasic variable xj elected to become basic
(represented by the cj component) and the associated change caused by the accommodation in
the basic variable values to retain feasibility, as discussed in the previous section. Therefore, the
reduced cost can be understood as a marginal value of change in the objective function value
associated with each nonbasic variable.
Let us demonstrate this with a numerical example. Consider the following linear programming
problem
Thus, between d3 and d4 , d4 is a better direction since its reduced cost indicates a reduction of 1 unit
of the objective function per unit of x4 . In contrast, the reduced cost associated with d3 indicates
an increase of 0.5 units of the objective function per unit of x3 , indicating that this is a direction to
be avoided as we are minimising the problem. Clearly, the willingness to choose xj ′ , j ′ ∈ IN as the
variable to become basic will depend on whether the scalar cj ′ − (cj )⊤ j∈IB dB is negative (recall that
we want to minimise the problem, so the smaller the total cost, the better). Another point is how
large in module the reduced cost is. Recall that the reduced is, in fact, a measure of the marginal
value associated with the increase in value of the nonbasic variable. Thus the larger (in module)
it is, the quicker the objective function value will decrease per unit of increase of the nonbasic
variable value. One interesting thing to notice is what is the reduced cost associated with basic
variables. Recall that B = [AB(1) , . . . , AB(m) ] and thus B −1 [AB(1) , . . . , AB(m) ] = I. Therefore
B −1 AB(i) is the ith column of I, denoted ei , implying that
cB(i) = cB(i) − c⊤
BB
−1
AB(i) = cB(i) − c⊤
B ei = cB(i) − cB(i) = 0.
Proof. To prove (1), assume that cj ≥ 0, let y be a feasible solution to P , and d = y − x. We have
that Ax = Ay = b and thus Ad = 0. Equivalently:
X X
BdB + Aj dj = 0 ⇒ dB = − B −1 Aj dj , implying that
j∈IN j∈IN
X X X
⊤
c d= c⊤
B dB + cj dj = (cj − c⊤
BB
−1
Aj )dj = cj dj (3.3)
j∈IN j∈IN j∈IN
c⊤ d ≥ 0 ⇒ c⊤ (y − x) ≥ 0 ⇒ c⊤ y ≥ c⊤ x, i.e., x is optimal.
To prove (2) by contradiction, assume that x is optimal with cj < 0 for some j ∈ IN . Thus, we
could improve on x moving along this j th direction d, contradicting the optimality of x.
3.2. Optimality of extreme points 55
A couple of remarks are worth making at this point. First, notice that, in the presence of degen-
eracy, it might be that x is optimal with cj < 0 for some j ∈ IN . Luckily, the simplex method
manages to get around this issue in an effective manner, as we will see in the next chapter. Another
point to notice is that, if cj > 0, ∀j ∈ IN , then x is a unique optimal. Analogously, if c ≥ 0 with
cj = 0 for some j ∈ IN , then it means that moving in that direction will cause no change in the
objective function value, implying that both BFS are “equally optimal” and that the problem has
multiple optimal solutions.
56 Chapter 3. Basis, Extreme Points and Optimality in Linear Programming
3.3 Exercises
Exercise 3.1: Properties of basic solutions
Prove the following theorem:
Theorem (Linear independence and basic solutions). Consider the constraints Ax = b and x ≥ 0,
and assume that A has m LI rows M = {1, . . . , m}. A vector x ∈ Rn is a basic solution if and only if we
have that Ax = b and there exists indices B(1), . . . , B(m) such that
Note: the theorem is proved in the notes. Use this as an opportunity to revisit the proof carefully,
and try to take as many steps without consulting the text as you can. This is a great exercise to
help you internalise the proof and its importance in the context of the material. I strongly advise
against blindly memorising it, as I suspect you will never (in my courses, at least) be requested to
recite the proof literally.
(a) Enumerate all basic solutions, and identify those that are basic feasible solutions.
(b) Draw the feasible region, and identify the extreme point associated with each basic feasible
solution.
(c) Consider a minimization problem with the cost vector c′ = (c1 , c2 , c3 , c4 ) = (−2, 12 , 0, 0).
Compute the basic directions and the corresponding reduced costs of the nonbasic variables
at the basic solution x′ = (3, 3, 0, 0) with x′B = (x1 , x2 ) and x′N = (x3 , x4 ); either verify that
x′ is optimal, or move along a basic direction which leads to a better solution.
max 2x1 + x2
s.t.
2x1 + 2x2 ≤ 9
2x1 − x2 ≤ 3
x1 − x2 ≤ 1
4x1 − 3x2 ≤ 5
x1 ≤ 2.5
x2 ≤ 4
x1 , x2 ≥ 0
(a) Plot the constraints and find the degenerate basic feasible solutions.
(b) What are the bases forming the degenerate solutions?
3.3. Exercises 57
Moving along the feasible direction d towards x + θd (with scalar θ > 0) makes xj > 0 (i.e.,
j ∈ IN enter the basis) while reducing the objective value at a rate of cj . Thus, one should move
as furthest as possible (say, take a step of length θ) along the direction d, which incurs on an
objective value change of θ(c⊤ d) = θcj .
Moving as far along the feasible direction d as possible while observing that feasibility is retained
is equivalent to setting θ as
θ = max θ ≥ 0 : A(x + θd) = b, x + θd ≥ 0 .
Recall that, by construction, the feasible direction d, we have that Ad = 0 and thus A(x + θd) =
Ax = b. Therefore, the only feasibility condition that can be violated when setting θ too large is
the nonnegativity of all variables, i.e., x + θd ≥ 0.
1 notice that this can be assumed without loss of generality, by a trivial transformation that adds constants on
59
60 Chapter 4. The simplex method
To prevent this from being the case, all basic variables i ∈ IB for which the component in the basic
direction vector dB is negative must be guaranteed to retain
xi
xi + θdi ≥ 0 ⇒ θ ≤ − .
di
Therefore, the maximum value θ is that that can be increased until the first component of xB
turns zero. Or, more precisely put,
xB(i)
θ = min − .
i∈IB : di <0 dB(i)
Notice that we only need to consider those basic variables with components di , i ∈ IB , that are
negative. This is because, if di ≥ 0, then xi + θdi ≥ 0 holds for any value of θ > 0. This means
that the constraints associated with these basic variables (referring to the representation in Figure
3.5) do not limit the increase in value of the select nonbasic variable. Notice that this can lead
to a pathological case in which none of the constraints limits the increase in value of the nonbasic
variable, which indicates that the problem has an unbounded direction of decrease for the objective
function. In this case, we say that the problem is unbounded.
Another important point is the assumption of a nondegenerate BFS. The nondegeneracy of the
BFS implies that xB(i) > 0, ∀i ∈ IB and, thus, θ > 0. In the presence of degeneracy, one can infer
that the definition of the step size θ must be done more carefully.
Let us consider the numerical example we used in Chapter 3 with a generic objective function.
min. c1 x1 + c2 x2 + c3 x3 + c4 x4
s.t.: x1 + x2 + x3 + x4 = 2
2x1 + 3x3 + 4x4 = 2
x1 , x2 , x3 , x4 ≥ 0.
Let c = (2, 0, 0, 0) and IB = {1, 2}. The reduced cost of the nonbasic variable x3 is
where dB = [−3/2, 1/2]. As x3 increases in value, only x1 decreases, since d1 < 0. Therefore, the
largest θ for which x1 ≥ 0 is −(x1 /d1 ) = 2/3. Notice that this is precisely the value that makes
x1 = 0, i.e., nonbasic. The new basic variable is now x3 = 2/3, and the new (adjacent, as we will
see next) extreme point is
By moving to the BFS x by making x = x + θd, we are in fact moving from the basis B to an
adjacent basis B, defined as
(
B(i) = B(i), for i ∈ IB \ {l}
B=
B(i) = j, for i = l.
Notice that the new basis B only has one pair of variables swapped between basic and nonbasic
when compared against B. Analogously, the basic matrix associated with B is given by
AB(1) ... AB(l−1) Aj AB(l+1) ... AB(m) ,
where the middle column representing that the column AB(l) has been replaced with Aj .
Theorem 4.1 provide the technical result that formalises our developments so far.
Theorem 4.1 (Adjacent bases). Let Aj be the column of the matrix A associated with the selected
nonbasic variable index j ∈ IN . And let l be defined as (4.1), with AB(i) being its respective column
in A. Then
(1) The columns AB(i) and Aj are linearly independent. Thus, B is a basic matrix;
making B −1 AB(i) not linearly independent. However, B −1 AB(i) = B −1 AB(i) for i ∈ IB \ {l} and
thus are all unit vectors ei with the lth component zero.
Now, B −1 Aj = −d
B , with
component
dB(l) ̸= 0, is linearly independent from B −1 AB(i) =
−1
B AB(i) . Thus, AB(i) i∈I = AB(i) i∈B\{l} ∪ {Aj } are linearly independent, forming the
B
contradiction.
Now we focus on (2). We have that x ≥ 0, Ax = b and xj = 0 for j ∈ IN = J \ IB . This,
proving
combined with B(i) i∈I being linearly independent (cf. 1), completes the proof.
B
We have finally compiled all the elements we need to state the simplex method pseudocode, which
is presented in Algorithm 1. One minor detail in the presentation of the algorithm is the use of the
auxiliary vector u. This allows for the precalculation of the components of dB = −B −1 Aj (notice
the changed signed) to test for unboundedness, that is, the lack of a constraint (and associated
basic variable) that can limit the increase of the chosen nonbasic variable.
The last missing element is proving that Algorithm 1 eventually converges to an optimal solution,
should one exists. This is formally stated in Theorem 4.2.
Theorem 4.2 (Convergence of the simplex method). Assume that P has at least one feasible
solution and that all BFS are nondegenerate. Then, the simplex method terminates after a finite
number of iterations, in one of the following states:
62 Chapter 4. The simplex method
Proof. If the condition in Line 3 of Algorithm 1 is not met, then B and associated BFS are optimal,
c.f. Theorem 3.9. Otherwise, if Line 5 condition is met, then d is such that Ad = 0 and d ≥ 0,
implying that x + θd ∈ P for all θ > 0, and a value reduction θc of −∞.
Finally, notice that θ > 0 step sizes are taken along d satisfy c⊤ d < 0. Thus, the value of successive
solutions is strictly decreasing and no BFS can be visited twice. As there is a finite number of
BFS, the algorithm must eventually terminate.
the possibility of moving to the nondegenerate basis IB = {3, 4, 5}, i.e., y. Assuming that we
are minimising, any direction with a negative component in the y-axis direction will represent an
improvement in the objective function (notice the vector c, which is parallel to the y-axis and
points upwards). Thus, the method is only stalled by the degeneracy of x but does not cycle.
−g x
c f g
x4 = 0 x5 = 0
h
x3 = 0
x2 = 0
x1 = 0 y
Figure 4.1: IN = {4, 5} for x; f (x5 > 0) and g (x4 > 0) are basic directions. Making IN = {2, 5}
lead to new basic directions h (x4 > 0) and −g (x2 > 0).
• Greedy selection (or Dantzig’s rule): choose xj , j ∈ IN , with largest |cj |. Prone to cycling.
• Index-based order (or Bland’s rule): choose xj , j ∈ IN , with smallest j. It prevents cycling
but is computationally inefficient.
• Reduced cost pricing: calculate θ for all (or some) j ∈ N and pick smallest θcj . Calculating
the actual observed change for all nonbasic variables is too computationally expensive. Partial
pricing refers to the idea of only considering a subset of the nonbasic variables to calculate
θcj .
64 Chapter 4. The simplex method
• Devex 2 and steepest-edge 3 : most commonly used by modern implementations of the simplex
method, available in professional-grade solvers.
where the lth column AB(l) is precisely how the adjacent bases B and B differ, with Aj replacing
AB(l) in B.
−1
We can devise an efficient manner to update B −1 into B utilising elementary row operations.
First, let us formally define the concept.
Definition 4.3 (Elementary row operations). Adding a constant multiple of one row to the same
or another row is called an elementary row operation.
Defining elementary row operations is the same of devising a matrix Q = I + Dij to premultiply
B, where (
Dij = β,
D=
Di′ j ′ = 0 for all i′ , j ′ ̸= i, j.
2 P. M. J. Harris (1973), Pivot Selection Methods in the Devex LP Code, Math. Prog., 57, 341–374.
3 J. Forrest & D. Goldfarb (1992), Steepest-Edge Simplex Algorithms for LP, Math. Prog., 5, 1–28.
4.2. Implementing the simplex method 65
Calculating QB = (I + D)B is the same as having the j th row of B multiplied by a scalar β and
then having the resulting j th row added to the ith row of B. Before we continue, let us utilise a
numerical example to clarify this procedure. Let
1 2
B = 3 4
5 6
and suppose we would like to multiply the third row by 2 and have it then added to the first row.
That means that D13 = 2 and that Q = I + D would be
1 2
Q= 1
1
As a side note, we have that Q−1 exists since det(Q) = 1. Furthermore, sequential elementary
row operations {1, 2, . . . , k} can be represented as Q = Q1 Q2 , . . . Qk .
−1
Going back to the purpose of updating B −1 into B , notice the following. Since B −1 B = I, each
term B −1 AB(i) is the ith unit vector ei (the ith column of the identity matrix). That is,
1 u1
.. ..
. .
B −1 B = e1 ... el−1 u el+1 ... em =
ul ,
.. ..
. .
um 1
1. for each i ̸= l, multiply the lth row by − uuil and add to the ith row. This replaces ui with
zero for all i ∈ I \ {l}.
We can restate the simplex method in its revised form. This is presented in Algorithm 2.
Notice that in Algorithm 2, apart from the initialisation step, no linear systems are directly solved.
Instead, elementary row operations (ERO) are performed, leading to computational savings.
66 Chapter 4. The simplex method
The key feature of the revised simplex method is a matter of representation and, thus, memory
allocation savings. Algorithm 2 only require keeping in memory a matrix of the form
p | p⊤ b
B −1 | u
−1
which, after each series of elementary row operations, yield not only B but also the updated
⊤ ⊤ −1 ⊤
simplex multipliers p and p b = cB B b = cB xB , which represents the objective function value of
the new basic feasible solution x = [xB , xN ]. This bookkeeping savings will become obvious once
we discuss the tabular (or non-revised) version of the simplex method.
Three main issues arise when considering the efficiency of implementations of the simplex method,
namely, matrix (re)inversion, representation in memory, and the use of matrix decomposition
strategies.
• Reinversion: localised updates have the side effect of accumulating truncation and round-
off error. To correct this, solvers typically rely on periodically recalculating B −1 , which,
although costly, can avoid numerical issues.
between each basic feasible solution. However, it is a helpful representation from a pedagogical
standpoint and will be useful for explaining further concepts in the upcoming chapters.
In contrast to the revised simplex method, instead of updating only B −1 , we consider the complete
matrix
B −1 [A | b] = B −1 A1 , . . . , B −1 An | B −1 b .
c1 ··· cn −cB xB
⊤
c − c⊤ −1
−c⊤ −1 xB(1)
BB A BB b
−1 −1 ⇒ ..
B A B b B −1 A1 ··· B −1 An .
xB(m)
In this representation, we say that the j th column associated with the nonbasic variable to become
basic is the pivot column u. Notice that, since the tableau exposes the reduced costs cj , j ∈ IN , it
allows for trivially applying the greedy pricing strategy (by simply choosing the variables with a
negative reduced cost with the largest module).
The lth row associated with the basic variable selected to leave the basis is the pivot row. Again, the
tableau representation facilitates the calculation of the ratios used in choosing the basic variable
l to leave the basis since it amounts to simply calculating the ratios between the elements on the
rightmost column and those in the pivot column, disregarding those that present entries less or
equal than zero and the zeroth row. The row with the minimal ratio will be the row associated
with the current basic variable leaving the basis.
Once a pivot column and a pivot row have been defined, it is a matter of performing elemental
row operations utilising the pivot row to turn the pivot column into the unit vector el and turn
to zero the respective zeroth element (recall that basic variables have zero reduced costs). This is
the same as using elementary row operations using the pivot row to turn all elements in the pivot
column zero, except for the pivot element ul , which is the intersection of the pivot row and the
pivot column, that must be turned into one. The above highlights the main purpose of the tableau
representation, which is to facilitate hand calculation.
Notice that, as we have seen before, performing elementary row operations to convert the pivot
−1
column u into el converts B −1 [A | b] into B [A | b]. Analogously, turning the entry associated
with the pivot column u in the zeroth row to zero converts [c⊤ − c⊤ BB
−1
A | −c⊤BB
−1
b] into
⊤ ⊤ −1 ⊤ −1
[c − cB B A | −cB B b].
Let s return to the paint factory example from section , which in its standard form can be written
68 Chapter 4. The simplex method
as
x1 x2 x3 x4 x5 x6 RHS
z -5 -4 0 0 0 0 0
x3 6 4 1 0 0 0 24
x4 1 2 0 1 0 0 6
x5 -1 1 0 0 1 0 1
x6 0 1 0 0 0 1 2
x1 x2 x3 x4 x5 x6 RHS
z 0 -2/3 5/6 0 0 0 20
x1 1 2/3 1/6 0 0 0 4
x4 0 4/3 -1/6 1 0 0 2
x5 0 5/3 1/6 0 1 0 5
x6 0 1 0 0 0 1 2
x1 x2 x3 x4 x5 x6 RHS
z 0 0 3/4 1/2 0 0 21
x1 1 0 1/4 -1/2 0 0 3
x2 0 1 -1/8 3/4 0 0 3/2
x5 0 0 3/8 -5/4 1 0 5/2
x6 0 0 1/8 -3/4 0 1 1/2
The bold terms in the tableau represent the pivot elements at each iteration, i.e., the intersection of
the pivot column and row. From the last tableau, we see that the optimal solution is x∗ = (3, 3/2).
Notice that we applied a change in signs in the objective function coefficients, turning it into a
minimisation problem; also notice that this makes the values of the objective function in the RHS
column appear as positive, although it should be negative as we are in fact minimising −5x1 − 4x2 ,
for which x∗ = (3, 3/2) evaluates as −21. As we have seen, the tableau shows −cB xB , hence why
the optimal tableau displays 21 in the first row of the RHS column.
Notice that this is equivalent to assuming all original problem variables (i.e., those that are not
slack variables) to be initialised as zero (i.e., nonbasic) since this is a trivially available initial
feasible solution. However, this method does not work for constraints of the form Ax ≥ b, as in
this case, the transformation would take the form
Ax ≥ b ⇒ Ax − s = b ⇒ Ax − s + y = b.
Notice that making the respective slack variable basic would yield an initial value of −b (recall
that b ≥ 0 can be assumed without loss of generality), making the basic solution not feasible.
For more general problems, however, this might not be possible since simply setting the original
problem variables to zero might not yield a feasible solution that can be used as a BFS. To
circumvent that, we rely on artificial variables to obtain a BFS.
Let P : min. c⊤ x : Ax = b, x ≥ 0 , which can be achieved with appropriate transformation (i.e.,
adding nonnegative slack to the inequality constraints) and assumed (without loss of generality)
to have b ≥ 0. Then, finding a BFS for P amounts to finding a zero-valued optimal solution to the
auxiliary problem
m
X
(AU X) : min. yi
i=1
s.t.: Ax + y = b
x, y ≥ 0.
The auxiliary problem AU X is formed by including one artificial variable for each constraint in P ,
represented by the vector y of so-called artificial variables. Notice that the problem is represented
in a somewhat compact notation, in which we assume that all slack variables used to convert
inequalities into equalities have already been incorporated in the vector x and matrix A, with
the artificial variables y playing the role of “slacks” in AU X that can be assumed to be basic
and trivially yield an initial BFS for AU X. In principle, one does not need artificial variables
for the rows in which there is a positive signed slack variable (i.e., an originally less-or-equal-than
constraint), but this representation allows for compactness.
Solving AU X to optimality consists of trying to find a BFS in which the value of the artificial vari-
ables is zero since, in practice, the value of the artificial variables measures a degree of infeasibility
of the current basis in the context of the original problem P . This means that a BFS in which the
artificial variable plays no roles was found and can be used as an initial BFS for solving P . On the
other hand, if the optimal for AU X is such that some of the artificial variables are nonzero, then
this implies that there is no BFS for AU X in which the artificial variables are all zero, or, more
specifically, there is no BFS for P , indicating that the problem P is infeasible.
Assuming that P is feasible and y = 0, two scenarios can arise. The first is when the optimal basis
B for AU X is composed only of columns Aj of the original matrix A, with no columns associated
with the artificial variables. Then B can be used as an initial starting basis without any issues.
The second scenario is somewhat more complicated. Often, AU X is a degenerate problem and the
optimal basis B may contain some of the artificial variables y. This then requires an additional
preprocessing step, which consists of the following:
(1) Let AB(1) , . . . , AB(k) be the columns A in B, which are linearly independent. We know that
from earlier (c.f. Theorem 2.4) that, being A full-rank, we can choose additional columns
AB(k+1) , . . . , AB(m) that will span Rm .
70 Chapter 4. The simplex method
(2) Select the lth artificial variable yl = 0 and select a component j in the lth row with nonzero
B −1 Aj and use elementary row operations to include Aj in the basis. Repeat this m − k
times.
Pm
The procedure is based on several ideas we have seen before. Since i=1 yi is zero at the optimal,
there must be a BFS in which the artificial variables are nonbasic (which is what (1) is referring
to). Thus, step (2) can be repeated until a basis B is formed and includes none of the artificial
variables.
Some interesting points are worth highlighting. First, notice that B −1 AB(i) = ei , i = 1, . . . , k.
Since k < l, the lth component of each of these vectors is zero and will remain so after performing
the elementary row operations. In turn, the lth entry of B −1 Aj is not zero, and thus Aj is linearly
independent to AB(1) , . . . , AB(k) .
However, it might be so that we find zero elements in the lth row. Let g be the lth row of B −1 A
(i.e., the entries in the tableau associated with the original problem variables). If g is the null
vector, then gl is zero, and the procedure fails. However, note that g ⊤ A = 0 = g ⊤ Ax = g ⊤ b,
implying that g ⊤ Ax = g ⊤ b is redundant can be removed altogether.
This process of generating initial BFS is often referred to as Phase I of the two-phase simplex
method. Phase II consists of employing the simplex method as we developed it, utilising the BFS
found in Phase I as a starting basis.
P : min. z
s.t.: Ax = b
c⊤ x = z
Xn
xj = 1
j=1
x ≥ 0.
In this reformulation, we make the objective function an auxiliary variable, so it can be easily
represented on a real line at the expense of adding an additional constraint c⊤ x = z. Furthermore,
we normalise the decision variables so they add to one (notice that this implies a bounded feasible
4.3. Column geometry of the simplex method 71
P : min. z
A1 A A b
s.t.: x1 + x2 2 + · · · + xn n =
c1 c2 cn z
n
X
xj = 1
j=1
x ≥ 0.
This second formulation exposes one interesting interpretation of the problem. Solving P is akin to
finding a set of weights x that makes a convex combination (c.f. Definition 2.7) of the columns of
A such that it constructs (or matches) b in a way that the resulting combination of the respective
components of the vector c is minimised. Now, let us define some nomenclature that will be useful
in what follows.
Definition 4.4 (k-dimensional simplex). A collection of vectors y1 , . . . , yk+1 are affinely indepen-
dent if k ≤ n and y1 − yk+1 , . . . , yk − yk+1 are linearly independent. The convex hull of k + 1
affinely independent vectors is a k-dimensional simplex.
Definition 4.4 is precisely the inspiration for the name of the simplex method. We know that only
m + 1 components of x will be different than zero since that is the number of constraints we have
and, thus, the size of a basis in this case. Thus, a BFS is formed by m + 1 (Ai , 1) columns, which
in turn are associated with (Ai , ci ) basic points.
Figure 4.2 provides an illustration of the concept. In this, we have that m = 2, so each column Aj
can be represented in a two-dimensional plane. Notice that a basis requires three points (Ai , ci )
and forms a 3-simplex. A BFS is a selection of three points (Ai , ci ) such that b, also illustrated
in the picture, can be formed by a convex combination of the (Ai , ci ) forming the basis. This will
be possible if b happens to be inside the 3-simplex formed by these points. For example, in Figure
4.2, the basis formed by columns {2, 3, 4} is a BFS, while the basis {1, 2, 3} is not.
(A3 , c3 )
z (A1 , c1 )
(A2 , c2 )
(A4 , c4 )
A3
A1 b A4
A2
We now can add a third dimension to the analysis representing the value of z. For that, we will use
Figure 4.3. As can be seen, each selection of basis creates a tilt in the three-dimensional simplex,
such that the point b is met precisely at the high corresponding to its value in the z axis. This
allows to compare bases according to their objective function value. And, since we are minimising,
we would like to find the basis that has its respective simplex crossing b at the lowermost point.
72 Chapter 4. The simplex method
z
B
I
C
H
F
D
G
E
b
Notice that in Figure 4.3, although each facet is a basic simplex, only three are feasible (BCD,
CDF , and DEF ). We can also see what one iteration of the simplex method does under this
geometrical interpretation. Moving between adjacent basis means that we are replacing one vertex
(say, C) with another (say, E) considering the potential for decrease in value in the z axis (rep-
resented by the difference between points H and G onto the z axis). You can also see the notion
of pivoting: since we are moving between adjacent bases, two successive simplexes share an edge
in common, consequently, they pivot around that edge (think about the movement of the edge C
moving to the point E while the edge DF remains fixed).
1
3
2
4 7
8
5
Figure 4.4: Pivots from initial basis [A3 , A6 ] to [A3 , A5 ] and to the optimal basis [A8 , A5 ]
Now we are ready to provide an insight into why the simplex method is often so efficient. The main
reason is associated with the ability that the method possesses of skipping bases in favour of those
with most promising improvement. To see that, consider Figure 4.4, which is a 2-dimensional
schematic projection of Figure 4.3. By using the reduced costs to guide the choice of the next
basis, we tend to choose the steepest of the simplexes that can provide reductions in the objective
function value, which has the side effect of allowing for skipping several basis that would have to be
otherwise considered. This creates a “sweeping effect”, in which allows the method to find optimal
4.3. Column geometry of the simplex method 73
solutions in fewer pivots than vertices. Clearly, this can be engineered to be prevented, as there
are examples purposely constructed to force the method to consider every single vertex, but the
situation illustrated in Figure 4.4 is by far the most common in practice.
74 Chapter 4. The simplex method
4.4 Exercises
Exercise 4.1: Properties of the simplex algorithms
Consider the simplex method applied to a standard form minimization problem, and assume that
the rows of the matrix A are linearly independent. For each of the statements that follow, give
either a proof or a counter example.
(a) An iteration of the simplex method might change the feasible solution while leaving the cost
unchanged.
(b) A variable that has just entered the basis cannot leave in the very next iteration.
(c) If there is a non-degenerate optimal basis, then there exists a unique optimal basis.
A feasible point for this problem is (x1 , x2 ) = (0, 3). Formulate the problem as a minimisation
problem in standard form and verify whether or not this point is optimal. If not, solve the problem
by using the simplex method.
x1 x2 x3 x4 x5 x6 x7 RHS
z 0 0 0 δ 3 γ ξ 0
x2 0 1 0 α 1 0 3 β
x3 0 0 1 -2 2 η -1 2
x1 1 0 0 0 -1 2 1 3
The entries α, β, γ, δ, η and ξ in the tableau are unknown parameters. Furthermore, let B be the
basis matrix corresponding to having x2 , x3 , and x1 (in that order) be the basic variables. For
each one of the following statements, find the ranges of values of the various parameters that will
make the statement to be true.
(a) Phase II of the Simplex method can be applied using this as an initial tableau.
(b) The corresponding basic solution is feasible, but we do not have an optimal basis.
4.4. Exercises 75
(c) The corresponding basic solution is feasible and the first Simplex iteration indicates that the
optimal cost is −∞.
(d) The corresponding basic solution is feasible, x6 is a candidate for entering the basis, and
when x6 is the entering variable, x3 leaves the basis.
(e) The corresponding basic solution is feasible, x7 is a candidate for entering the basis, but if it
does, the objective value remains unchanged.
max. 5x1 + x2
s.t.: 2x1 + x2 ≥ 5
x2 ≥ 1
2x1 + 3x2 ≤ 12
x1 , x2 ≥ 0.
76 Chapter 4. The simplex method
Chapter 5
5.1.1 Motivation
Let us define the notation we will be using throughout the next chapters. As before, let c ∈ Rn ,
b ∈ Rm , A ∈ Rm×n , and P be the standard form linear programming problem
(P ) : min. c⊤ x
s.t.: Ax = b
x ≥ 0,
which we will refer to as the primal problem. In mathematical programming, we say that a
constraint has been relaxed if it has been removed from the set of constraints. With that in mind,
let us consider a relaxed version of P , where Ax = b is replaced with a violation penalty term
p⊤ (b − Ax). This lead to the following problem:
g(p) = min. c⊤ x + p⊤ (b − Ax) ,
x≥0
which has the benefit of not having equality constraints explicitly represented, but only implicit
by means of a penalty term. This term is used to penalise the infeasibility of the constraints in the
attempt to steer the solution of the relaxed problem towards the solution to P . Recalling that our
main objective is to solve P , we are interested in the values (or prices, as they are often called) for
p ∈ Rm that make P and g(p) equivalent.
Let x be the optimal solution to P . Notice that, for any p ∈ Rm , we have that
g(p) = min. c⊤ x + p⊤ (b − Ax) ≤ c⊤ x + p⊤ (b − Ax) = c⊤ x,
x≥0
i.e., g(p) is a lower bound on the optimal value c⊤ x. The lefthand-side inequality holds because,
although x is optimal for P , it might not be optimal for g(p) for an arbitrary vector p. The second
inequality is a consequence of x ∈ P , i.e., the feasibility of x implies that Ax = b.
77
78 Chapter 5. Linear Programming Duality - Part I
We can use an optimisation-based approach to try to find an optimal lower bound, i.e., the tightest
possible lower bound for P . This can be achieved by solving the dual problem D formulated as
Notice that D is an unconstrained problem to which a solution proves the tightest lower bound on
P (say, at p). Also, notice how the function g(p) : Rm 7→ R has embedded on its evaluation the
solution of a linear programming problem with x ∈ Rn as decision variables for a fixed p, which is
the argument given to the function g. This is a new concept at this point that often is a source of
confusion.
We will proceed in this chapter developing the analytical framework that allows us to pose the key
result in duality theory, which is that stating that
g(p) = c⊤ x.
That is, we will next develop the results that guarantee the equivalence between primal and dual
representations. This will be useful for interpreting properties associated with the optimal primal
solution x from the associated optimal prices p. Furthermore, we will see in later chapters that
linear programming duality can be used as a framework for replacing constraints with equivalent
representations, which is a useful procedure in many settings, including for developing alternative
solution strategies also based in linear programming.
As x ≥ 0, the rightmost problem can only be bounded if (c⊤ − p⊤ A) ≥ 0. This gives us a linear
constraint that can be used to enforce the existence of a solution for
min. (c⊤ − p⊤ A)x .
x≥0
(D) : max. p⊤ b
s.t.: p⊤ A ≤ c⊤ .
Notice that D is a linear programming problem with m variables (one per constraint of the primal
problem P ) and n constraints (one per variable of P ). As you might suspect, if you were to repeat
the analysis, looking at D as the “primal” problem, you would end with a dual that is exactly P .
For this to become more apparent, let us first define more generally the rules that dictate what
kind of dual formulations are obtained for different types of primal problems in terms of their
original (i.e., not in standard) form.
5.1. Formulating duals 79
(P ) : min. c⊤ x
s.t.: Ax ≥ b.
(P ) : min. c⊤ x
s.t.: Ax − s = b
s ≥ 0.
Let us focus on the constraints in the reformulated version of P , which can be written as
x
[A | −I] = b.
s
We will apply the same procedure as before, being our constraint matrix [A | −I] in place of A
and [x | s]⊤ our vector of variables, in place of x. Using analogous arguments, we now require that
c⊤ − p⊤ A = 0, so g(p) is finite. Notice that this is a slight deviation from before, but in this case,
we have that x ∈ Rn , so c⊤ − p⊤ A = 0 is the only condition that allows the inner problem in g(p)
to have a finite solution. Then, we obtain the following conditions to be imposed to our dual linear
programming formulation
p⊤ [A | −I] ≤ [c⊤ | 0⊤ ]
c⊤ − p⊤ A = 0,
Combining them all and redoing the previous steps for obtaining a dual formulation, we arrive at
(D) : max. p⊤ b
s.t.: p⊤ A = c⊤
p ≥ 0.
Notice how the change in the type of constraints in the primal problem P lead to additional
nonnegative constraints in the dual variables p. Similarly, the absence of explicit nonnegativity
constraints in the primal variables x lead to equality constraints in the dual problem D, as opposed
to inequalities.
Table 5.1 provides a summary which allows one to identify the resulting formulation of the dual
problem based on the primal formulation, in particular regarding its type (minimisation or max-
imisation), constraint types and variable domains.
For converting a minimisation primal problem into a (maximisation) dual, one must read the table
from left to right. That is, the independent terms (b) become the objective function coefficients,
greater or equal constraints become nonnegative variables, and so forth. However, if the primal
problem is a maximisation problem, the table must be read from right to left. For example, in
this case, less-or-equal-than constraints would become nonnegative variables instead, and so forth.
It takes a little practice to familiarise yourself with this table, but it is a really useful resource to
obtain dual formulations from primal problems.
One remark to be made at this point is that, as is hopefully clearer now, the conversion of primal
problems into duals is symmetric, meaning that reapplying the rules in Table 5.1 would take you
80 Chapter 5. Linear Programming Duality - Part I
from the obtained dual back to the original primal. This is a property of linear programming
problems called being self dual. Another remark is that equivalent reformulations made in the
primal lead to equivalent duals. Specifically, transformations that replace variables x ∈ R with
x+ −x− , where x+ , x− ≥ 0, introduce nonnegative slack variables, or remove redundant constraints
all lead to equivalent duals.
For example, recall that the dual formulation for the primal problem
(P ) : min. c⊤ x
s.t.: Ax ≥ b
x ∈ Rn
is given by
(D) : max. p⊤ b
s.t.: p ≥ 0
p⊤ A = c⊤ .
(P ′ ) : min. c⊤ x + 0⊤ s
s.t.: Ax − s = b
x ∈ Rn , s ≥ 0.
Then, using Table 5.1, we would obtain the following dual formulation, which is equivalent to D
(D′ ) : max. p⊤ b
s.t.: p ∈ Rm
p⊤ A = c⊤
−p ≤ 0.
5.2. Duality theory 81
(P ′′ ) : min. c⊤ x+ − c⊤ x−
s.t.: Ax+ − Ax− ≥ b
x+ ≥ 0, x− ≥ 0.
(D′′ ) : max. p⊤ b
s.t.: p ≥ 0
p⊤ A ≤ c
−p⊤ A ≤ −c⊤ ,
m n
Proof. Let I = {i}i=1 and J = {j}j=1 . For any x and p, define
ui = pi (a⊤ ⊤
i x − bi ) and vj = (cj − p Aj )xj .
Notice that ui ≥ 0 for i ∈ I and vj ≥ 0 for j ∈ J, since each pair of terms will have the same
sign (you can see that from Table 5.1 and assuming xj to be the dual variable associated with
p⊤ A ≤ c⊤ ). Thus, we have that
X X
0≤ ui + vj = [p⊤ Ax − p⊤ b] + [c⊤ x − p⊤ Ax] = c⊤ x − p⊤ b.
i∈I j∈J
Let us also pose some results that are direct consequences of Theorem 5.1, which are summarised
in Corollary 5.2.
Corollary 5.2 (Consequences of weak duality). The following are immediate consequences of
Theorem 5.1:
(3) let x and p be feasible to P and D, respectively. Suppose that p⊤ b = c⊤ x. Then x is optimal
to P and p is optimal to D.
Proof. By contradiction, suppose that P has optimal value −∞ and that D has a feasible solution
p. By weak duality, p⊤ b ≤ c⊤ x = −∞, i.e., a contradiction. Part (2) follows a symmetric argument.
Part (3): let x be an alternative feasible solution to P . From weak duality, we have c⊤ x ≥ p⊤ b =
c⊤ x, which proves the optimality of x. The optimality of p follows a symmetric argument.
Notice that Theorem 5.1 provides us with a bounding technique for any linear programming prob-
lem. That is, for a given pair of primal and dual feasible solutions, x and p, respectively, we have
that
p⊤ b ≤ c⊤ x∗ ≤ c⊤ x,
where c⊤ x∗ is the optimal objective function value.
Corollary 5.2 also provides an alternative way of identifying infeasibility by means of linear pro-
gramming duals. One can always use the unboundedness of a given element of a primal-dual pair to
state the infeasibility of the other element in the pair. That is, an unbounded dual (primal) implies
an infeasible primal (dual). However, the reverse statement is not as conclusive. Specifically, an
infeasible primal (dual) does not necessarily imply that the dual (primal) is unbounded, but does
imply it to be either infeasible or unbounded.
Proof. Assume P is solved to optimality, with optimal solution x and basis B. Let xB = B −1 b. At
the optimal, the reduced costs are c⊤ − c⊤
BB
−1
A ≥ 0. Let p = c⊤BB
−1
. We then have p⊤ A ≤ c⊤ ,
which shows that p is feasible to D. Moreover,
p⊤ b = c⊤
BB
−1
b = c⊤ ⊤
B xB = c x, (5.1)
The proof of Theorem 5.3 reveals something remarkable relating the simplex method and the dual
variables p. As can be seen in (5.1), for any primal feasible solution x, an associated dual (not
necessarily) feasible solution p can be immediately recovered. If the associated dual solution is also
feasible, then Theorem 5.3 guarantees that optimality ensues.
This means that we can interchangeably solve either a primal or a dual form of a given problem,
considering aspects related to convenience and computational ease. This is particularly useful
in the context of the simplex method since the prices p are readily available as the algorithm
progresses. In the next section, we will discuss several practical uses of this relationship in more
detail. For now, let us halt this discussion for a moment and consider a geometrical interpretation
of duality in the context of linear programming.
5.3. Geometric interpretation of duality 83
(P ) : min. c⊤ x
s.t.: a⊤
i x ≥ bi , ∀i ∈ I.
Imagine that there is a particle within the polyhedral set representing the feasible region of P and
that this particle is subjected to a force represented by the vector −c.
Notice that this is equivalent
to minimising the function z = c⊤ x within the polyhedral set a⊤ i x ≥ bi i∈I representing the
feasible set of P . Assuming that the feasible set of P is bounded in the direction −c, this particle
will eventually come to a halt after hitting the “walls” of the feasible set, at a point where the
pulling force −c and the reaction of these walls reach an equilibrium. We can think of x as this
stopping point. This is illustrated in Figure 5.1.
a3
a1 c
p2 a2
p1 a1 a2
We can then think of the dual variables p as the multipliers applied to the normal vectors asso-
ciated with the hyperplanes (i.e., the walls) that are in contact with the particle to achieve this
equilibrium. Hence, these multipliers p will be such that
X
c= pi ai , for some pi ≥ 0, i ∈ I,
i∈I
which is precisely the dual feasibility condition (i.e., constraint) associated with the dual of P ,
given by
D : max. p⊤ b : p⊤ A = c, p ≥ 0 .
And, dual feasibility, as we seen before, implies the optimality of x.
which again implies the optimality of p (cf. Corollary 5.2 (3)). This geometrical insight leads to
another key result for linear programming duality, which is the notion of complementary slackness.
Theorem 5.4 (Complementary slackness). Let x be a feasible solution for
(P ) : min. c⊤ x : Ax = b, x ≥ 0
The vectors x and p are optimal solutions to P and D, respectively, if and only if pi (a⊤
i x − bi ) =
0, ∀i ∈ I, and (cj − p⊤ Aj )xj = 0, ∀j ∈ J.
Proof. From the proof of Theorem 5.1 and with Theorem 5.3 holding, we have that
pi (a⊤ ⊤
i x − bi ) = 0, ∀i ∈ I, and (cj − p Aj )xj = 0, ∀j ∈ J.
In turn, if these hold, then x and p are optimal (cf. Corollary 5.2 (3)).
For nondegenerate basic feasible solutions (BFS) (i.e., xj > 0, ∀j ∈ IB , where IB is the set of basic
variable indices), complementary slackness determines a unique dual solution. That is
a⊤
i x ≥ bi , ∀i ∈ I (primal feasibility) (5.2)
0
pi = 0, ∀i ∈
/I (complementary conditions) (5.3)
X
⊤
pi ai = c (dual feasibility I) (5.4)
i∈I
pi ≥ 0, (dual feasibility II) (5.5)
where I 0 = i ∈ I, a⊤i x = bi are the active constraints. From (5.2)–(5.5), we see that the opti-
mality of the primal-dual pair has two main requirements. The first is that x must be (primal)
feasible. The second, expressed as
X
pi ai = c, pi ≥ 0,
i∈I 0
5.4. Practical uses of duality 85
a1 a2 a5 a3
c a1
A c a1
a4 a2
B a4
a3 c a1 c
C
a6 c a1
a5
a6 D
Figure 5.2: A is both primal and dual infeasible; B is primal feasible and dual infeasible; C is
primal and dual feasible; D is degenerate.
That is, assume that we have B −1 (b + d) > 0, noticing that nondegenerate feasibility holds.
Recall that the optimality condition c = c⊤ − c⊤ BB
−1
A ≥ 0 is not influenced by such a marginal
perturbation. That is, for a small change d, the optimal basis (i.e., the selection of basic variables)
is not disturbed. On the other hand, the optimal value of the basic variables is, and consequently,
so is the optimal value, which becomes
c⊤
BB
−1
(b + d) = p⊤ (b + d).
x1 x2 x3 x4 x5 x6 RHS
z 0 0 3/4 1/2 0 0 21
x1 1 0 1/4 -1/2 0 0 3
x2 0 1 -1/8 3/4 0 0 3/2
x5 0 0 3/8 -5/4 1 0 5/2
x6 0 0 1/8 -3/4 0 1 1/2
where x3 and x4 were the slack variables associated with raw material M1 and M2, respectively.
In this case, we have that
⊤
1/4 −1/2 0 0 −5 1/4 −1/2 0 0 −3/4
−1/8 3/4 0 0 −4 −1/8 3/4 0 0
B −1 = and p = c⊤
BB
−1
= = −1/2
3/8 −5/4 1 0 0 3/8 −5/4 1 0 0
1/8 −3/4 0 1 0 1/8 −3/4 0 1 0
Notice that these are values of the entries in the z-row below the slack variables x3 , . . . , x6 , except
the minus sign. This is because the z-rows contain the entries for c − p⊤ B −1 and for all slack
variables we have that cj = 0, for j = 3, . . . , 6. Also, recall that the paint factory problem is a
maximisation problem, so p represents the decrease in the objective function value. In this, we see
that removing one unit of M1 would decrease the objective function by 3/4 and by 1/2 if one unit
of M2 is removed. Analogous, increasing M1 or M2 availability by one unit would increase the
objective function value by 3/4 and 1/2, respectively.
optimality); or dual methods, where dual feasibility is maintained while seeking for primal feasibility
(i.e., dual optimality).
As we have seen in Chapter 4, the original (or primal) simplex method iterated from an initial
basic feasible solution (BFS) until the optimality condition
c = c⊤ − cB B −1 A ≥ 0
was observed. Notice that this is precisely the dual feasibility condition p⊤ A ≤ c.
Being a dual method, the dual version of the simplex method, or the dual simplex method, considers
conditions in reverse order. That is, it starts from an initial dual feasible solution and iterates in
a manner that the primal feasibility condition B −1 b ≥ 0 is sought for, while c ≥ 0, or equivalently,
p⊤ A ≤ c, is maintained.
To achieve that, one must revise the pivoting of the primal simplex method such that the variable
to leave the basis is some i ∈ IB , with xB(i) < 0, while the variable chosen to enter the basis is
some j ∈ IN , such that c ≥ 0 is maintained.
Consider the lth simplex tableau row for which xB(i) < 0 of the form [v1 , . . . , vn , xB(i) ]; i.e., vj is
the lth component of B −1 Aj .
For each j ∈ IN for which vj < 0, we pick
cj
j ′ = arg minj∈IN :vj <0 .
|vj |
Pivoting is performed by employing elemental row operations to replace AB(i) with Aj in the basis.
This implies that cj ≥ 0 is maintained, since
cj cj ′ cj ′ cj ′
≥ ⇒ cj − |vj | ≥ 0 ⇒ cj + vj ≥ 0, ∀j ∈ J.
|vj | |vj |
′ |vj |
′ |vj ′ |
Notice that it also justifies why we must only consider for entering the basis those variables for
which vj < 0. Analogously to the case in the primal simplex method, if we observe that vj ≥ 0
for all j ∈ J , then no limiting condition is imposed in terms the increase in the nonbasic variable
(i.e., an unbounded dual, which, according to Corollary 5.2 (2), implies the original problem is
infeasible).
Assuming that the dual is not unbounded, the termination of the dual simplex method is ob-
served when B −1 b ≥ 0 is achieved, and primal-dual optimal solutions have been found, with
x = (xB , xN ) = (B −1 b, 0) (i.e., the primal solution) and p = (pB , pN ) = (0, c⊤
BB
−1
) (dual).
Algorithm 3 presents a pseudocode for the dual simplex method.
To clarify some of the previous points, let us consider a numerical example. Consider the problem
min. x1 + x2
s.t.: x1 + 2x2 ≥ 2
x1 ≥ 1
x1 , x2 , x3 , x4 ≥ 0.
The first thing we must do is convert the greater-or-equal-than inequalities into less-or-equal-than
inequalities and add the respective slack variables. This allows us to avoid the inclusion of artificial
variables, which are not required anymore since we can allow for primal infeasibility. This leads to
88 Chapter 5. Linear Programming Duality - Part I
min. x1 + x2
s.t.: − x1 − 2x2 + x3 = −2
− x1 + x4 = −1
x1 , x2 , x3 , x4 ≥ 0.
Below is the sequence of tableaus after applying the dual simplex method to solve the problem.
The terms in bold font represent the pivot element (i.e., the intersection between the pivot row
and pivot column).
x1 x2 x3 x4 RHS
z 1 1 0 0 0
x3 -1 -2 1 0 -2
x4 -1 0 0 1 -1
x1 x2 x3 x4 RHS
z 1/2 0 1/2 0 -1
x2 1/2 1 -1/2 0 1
x4 -1 0 0 1 -1
x1 x2 x3 x4 RHS
z 0 0 1/2 1/2 -3/2
x2 0 1 -1/2 1/2 1/2
x1 1 0 0 -1 1
Figure 5.3 illustrates the progress of the algorithm both in the primal (Figure 5.3a) and in the dual
(Figure 5.3b) variable space. Notice how in the primal space the solution remains primal infeasible
until a primal feasible solution is reached, that being the optimal for the problem. Also, notice
that the coordinates of the dual variables can be extracted from the zeroth row of the simplex
tableau.
Some interesting features related to the progress of the dual simplex algorithm are worth high-
lighting. First, notice that the objective function is monotonically increasing in this case, since
5.4. Practical uses of duality 89
3 1.5
Constraints Constraints
Obj. function Obj. function
2 1.0
p2
x2
B 0.5
1
C
A 0.0
0 0.0 0.5 1.0 1.5
0 1 2 3
x1 p1
(a) The primal-variable space (b) The dual-variable space
Figure 5.3: The progress of the dual simplex method in the primal and dual space.
c ′
xB(l) |vj ′ | is added to −cB B −1 b and xB(l) < 0, meaning that the dual cost increases (recall the
j
convention of having a minus sign so that the zeroth row correctly represent the objective function
value, given by the negative of the value displayed in the rightmost column). This illustrates the
gradual loss of optimality in the search for (primal) feasibility. For a nondegenerate problem, this
can also be used as an argument for eventual convergence since the dual objective value can only
increase and is bounded by the primal optimal value. However, in the presence of dual degeneracy,
that is, cj = 0 for some j ∈ IN in the optimal solution, the algorithm can suffer from cycling. As
we have seen before, that is an indication that the primal problem has multiple optima.
The dual simplex method is often the best choice of algorithm, because it typically precludes
the need for a Phase I type of method as it is often trivial to find initial dual feasible solutions
(the origin, for example, is typically dual feasible in minimisation problems with nonnegative
coefficients; similar trivial cases are also well known).
Moreover, dual simplex is the algorithm of choice for resolving a linear programming problem when
after finding an optimal solution, you modify the feasible region. Turns out that this procedure
is in the core of the methods used to solve integer programming problems, as well as in the
Benders decomposition, both topics we will explore later on. The dual simplex method is also
more successful than its primal counterpart in combinatorial optimisation problems, which are
typically plagued with degeneracy. As we have seen, primal degeneracy simply means multiple
dual optima, which are far less problematic under an algorithmic standpoint.
Most professional implementations of the simplex method use by default the dual simplex version.
This has several computational reasons, in particular related to more effective Phase I and pricing
methods for the dual counterpart.
90 Chapter 5. Linear Programming Duality - Part I
5.5 Exercises
Exercise 5.1: Duality in the transportation problem
Recall the transportation problem from Chapter 1. Answer the following questions based on the
interpretation of the dual price.
We would like to plan the production and distribution of a certain product, taking into account
that the transportation cost is known (e.g., proportional to the distance travelled), the factories
(or source nodes) have a supply capacity limit, and the clients (or demand nodes) have known
demands. Table 5.2 presents the data related to the problem.
Clients
Factory NY Chicago Miami Capacity
Seattle 2.5 1.7 1.8 350
San Diego 3.5 1.8 1.4 600
Demands 325 300 275 -
Table 5.2: Problem data: unit transportation costs, demands and capacities
Additionally, we consider that the arcs (routes) from factories to clients have a maximum trans-
portation capacity, assumed to be 250 units for each arc. The problem formulation is then
XX
min. z = cij xij
i∈I j∈J
X
s.t.: xij ≤ Ci , ∀i ∈ I
j∈J
X
xij ≥ Dj , ∀j ∈ J
i∈I
xij ≤ Aij , ∀i ∈ I, j ∈ J
xij ≥ 0, ∀i ∈ I, j ∈ J,
where Ci is the supply capacity of factory i, Dj is the demand of client j and Aij is the trans-
portation capacity of the arc between i and j.
(a) What price would the company be willing to pay for increasing the supply capacity of a given
factory?
(b) What price would the company be willing to pay for increasing the transportation capacity
of a given arc?
(b) Write the dual formulation of the problem and use strong duality to verify that x and p are
optimal.
5.5. Exercises 91
min. 2x1 + x3
s.t.: − 1/4x1 − 1/2x2 ≤ −3/4
8x1 + 12x2 ≤ 20
x1 + 1/2x2 − x3 ≤ −1/2
9x1 + 3x2 ≥ −6
x1 , x2 , x3 ≥ 0.
(P ) : min. c⊤ x
s.t.: Ax = b
x ≥ 0,
where A ∈ Rm×n and b ∈ Rm . Show that if P has a finite optimal solution, then the new problem
P obtained from P by replacing the right-hand side vector b with another one b̄ ∈ Rm cannot be
unbounded no matter what value the components of b̄ can take.
where A1,...,9 are matrices, b1,...,3 , c1,...,3 are column vectors, and y 1,...,3 are the dual variables
associated to each constraint.
(a) Construct the dual of the problem below and solve both the original problem and its dual.
(b) Use complementary slackness to verify that the primal and dual solutions are optimal.
P : min. c⊤ x
s.t.: Ax = b
x ≥ 0.
As we have seen in Chapter 4, the optimal solution x with associated basis B satisfies the following
optimality conditions: it is a basic feasible solution and, therefore (i) B −1 b ≥ 0; and (ii) all reduced
costs are nonnegative, that is c⊤ − c⊤ BB
−1
b ≥ 0.
We are interested in analysing aspects associated with the stability of the optimal solution x in
terms of how it changes with the inclusion of new decision variables and constraints or in with
changes in the input data. Both cases are somewhat motivated by the realisation that problems
typically emerge from dynamic settings. Thus, one must assess how stable a given plan (represented
by x) is or how it can be adapted in face of changes in the original problem setting. This kind of
analysis is generally referred to as sensitivity analysis in the context of linear programming.
First, we will consider the inclusion of new variables or new constraints after the optimal solution
x is obtained. This setting represents, for example, the inclusion of a new product or a new
production plant (referring to the context of resource allocation and transportation problems, as
discussed in Chapter 1) or the consideration of additional constraints imposing new (or previously
disregarded) requirements or conditions. The techniques we will consider here will also be relevant
in the following chapters. We will then discuss specialised methods for large-scale problems and
solution techniques for integer programming problems, both topics that heavily rely on the idea of
iteratively incrementing linear programming problems with additional constraints (or variables).
The second group of cases relates to changes in the input data. When utilising linear programming
models to optimise systems performance, one must bear in mind that there is inherent uncertainty
associated with the input data. Be it due to measurement errors or a lack of complete knowledge
about the future, one must accept that the input data of these models will, by definition, embed
some measure of error. One way of taking this into account is to try to understand the consequences
to the optimality of x in case of eventual changes in the input data, represented by the matrix A,
93
94 Chapter 6. Linear Programming Duality - Part II
and the vectors c and b. We will achieve this by studying the ranges within which variations in
these terms do not compromise the optimality of x.
x1 x2 x3 x4 RHS
z 0 0 2 7 12
x1 1 0 -3 2 2
x2 0 1 5 -3 2
Suppose we include a variable x5 , for which c5 = −1 and A5 = (1, 1). The modified problem then
becomes
min. − 5x1 − x2 + 12x3 − x5
s.t.: 3x1 + 2x2 + x3 + x5 = 10
5x1 + 3x2 + x4 + x5 = 16
x1 , . . . , x5 ≥ 0.
6.1. Sensitivity analysis 95
x1 x2 x3 x4 x5 RHS
z 0 0 2 7 -4 12
x1 1 0 -3 2 -1 2
x2 0 1 5 -3 2 2
Notice that this tableau now shows a primal feasible solution that is not optimal and can be further
iterated using primal simplex.
We can reuse the optimal basis B to form a new basis B for the problem. This will have the form
B 0
B= ⊤ ,
a −1
where a are the respective components of am+1 associated with the columns from A that formed
−1
B. Now, since we have that B B = I, we must have that
−1
−1 B 0
B = ⊤ −1 .
a B −1
Notice however, that the basic solution (x, am+1 ⊤ x − bm+1 ) associated with B is not feasible, since
we assumed that x did not satisfy the newly added constraint, i.e., am+1 ⊤ x < bm+1 .
The reduced costs considering the new basis B then becomes
−1
B 0 A 0
[c⊤ 0] − [c⊤
B 0] = [c⊤ − c⊤
BB
−1
A 0].
a⊤ B −1 −1 a⊤ m+1 −1
Notice that the new slack variable has a null component as a reduced cost, meaning that it does
not violated dual feasibility conditions. Thus, after adding a constraint that makes x infeasible,
we still have a dual feasible solution that can be immediately used by the dual simplex method,
again allowing for warm starting the solution of the new problem.
96 Chapter 6. Linear Programming Duality - Part II
To build an initial solution in terms of the tableau representation of the simplex method, we must
simply add an additional row, which leads to a new tableau with the following structure
−1 B −1 A 0
B A = ⊤ −1 .
a B A − a⊤ m+1 1
Let us consider a numerical example again. Consider the same problem as the previous example,
but that we instead include the additional constraint x1 + x2 ≥ 5, which is violated by the optimal
solution (2, 2, 0, 0). In this case, we have that am+1 = (1, 1, 0, 0) and a⊤ B −1 A−a⊤
m+1 = [0, 0, 2, −1].
This modified problem then looks like
x1 x2 x3 x4 x5 RHS
z 0 0 2 7 0 12
x1 1 0 -3 2 0 2
x2 0 1 5 -3 0 2
x5 0 0 2 -1 1 -1
Notice that this tableau indicates that we have a dual feasible solution that is not primal feasible
and thus suitable to be solved using dual simplex.
A final point to note is that these operations are related to each other in terms of equivalent
primal-dual formulations. That is, consider dual of P , which is given by
D : max. p⊤ b
s.t.: p⊤ A ≤ c.
Then, adding a constraint of the form p⊤ An+1 ≤ cn+1 is equivalent to adding a variable to P ,
exactly as discussed in Section 6.1.1.
Suppose that some component bi changes and becomes bi + δ, with δ ∈ R. We are interested in
the range for δ within which the basis B remains optimal.
6.1. Sensitivity analysis 97
B −1 (b + δei ) ≥ 0.
In other words, changing bi will incur changes in the value of the basic variables, and thus, we
must determine the range within which all basic variables remain nonnegative (i.e., feasible).
Let us consider a numerical example. Once again, consider the problem from Section 6.1.1. The
optimal tableau was given by
x1 x2 x3 x4 RHS
z 0 0 2 7 12
x1 1 0 -3 2 2
x2 0 1 5 -3 2
Suppose that b1 will change by δ in the constraint 3x1 + 2x2 + x3 = 10. Notice that the first
column of B −1 can be directly extracted from the optimal tableau and is given by (−3, 5). The
optimal basis will remain feasible if 2 − 3δ ≥ 0 and 2 + 5δ ≥ 0, and thus −2/5 ≤ δ ≤ 2/3.
Notice that this means that we can calculate the change in the objective function value as a function
of δ ∈ [−2/5, 2/3]. Within this range, the optimal cost changes as
c⊤ ⊤
B (b + δei ) = p b + δpi ,
where p⊤ = c⊤BB
−1
is the optimal dual solution. In case the variation falls outside that range, this
means that some of the basic variables will become negative. However, since the dual feasibility
conditions are not affected by changes in bi , one can still reutilise the basis B using dual simplex
to find a new optimal solution.
We now consider the case where variations are expected in the objective function coefficients.
Suppose that some component cj becomes cj + δ. In this case, optimality conditions become a
concern. Two scenarios can occur. First, it might be that the changing coefficient is associated
with a variable j ∈ J that happens to be nonbasic (j ∈ IN ) in the optimal solution. In this case,
we have that optimality will be retained as long as the nonbasic variable remains “not attractive”,
i.e., the reduced cost associated with j remains nonnegative. More precisely put, the basis B will
remain optimal if
(cj + δ) − cB B −1 Aj ≥ 0 ⇒ δ ≥ −cj .
98 Chapter 6. Linear Programming Duality - Part II
The second scenario concerns changes in variables that are basic in the optimal solution, i.e.,
j ∈ IB . In that case, the optimality conditions are directly affected, meaning that we have to
analyse the range of variation for δ within which the optimality conditions are maintained, i.e., the
reduced costs remain nonnegative.
Let cj is the coefficient of the lth basic variable, that is j = B(l). In this case, cB becomes cB + δel ,
meaning that all optimality conditions are simultaneously affected. Thus, we have to define a range
for δ in which the condition
(cB + δei )⊤ B −1 Ai ≤ ci , ∀i ̸= j
holds. Notice that we do not need to consider j since xj is a basic variable, and thus, its reduced
costs are assumed to remain zero.
Considering the tableau representation, we can use the lth row and examine the conditions for
which δqli ≤ ci , ∀i ̸= j, where qli is the lth entry of B −1 Ai .
Let us once again consider the previous example, with optimal tableau
x1 x2 x3 x4 RHS
z 0 0 2 7 12
x1 1 0 -3 2 2
x2 0 1 5 -3 2
First, let us consider variations in the objective function coefficients of variables x3 and x4 . Since
both variables are nonbasic in the optimal basis, the allowed variation for them is given by
δ3 ≥ −c3 = −2 and δ4 ≥ −c4 = −7.
Two points to notice. First, notice that both intervals are one-sided. This means that one should
only be concerned with variations that decrease the reduced cost value, since increases in their
value can never cause any changes in the optimality conditions. Second, notice that the allowed
variation is trivially the negative value of the reduced cost. For variations that turn the reduced
costs negative, the current basis can be utilised as a starting point for the primal simplex.
Now, let us consider a variation in the basic variable x1 . Notice that in this case we have to
analyse the impact in all reduced costs, with exception of x1 itself. Using the tableau, we have
that ql = [1, 0, −3, 2] and thus
δ1 q12 ≤ c2 ⇒ 0 ≤ 0
δ1 q13 ≤ c3 ⇒ δ1 ≥ −2/3
δ1 q14 ≤ c4 ⇒ δ1 ≤ 7/2,
implying that −2/3 ≤ δ1 ≤ 7/2. Like before, for a change outside this range, primal simplex can
be readily employed.
A cone C can be understood as a set formed by the nonnegative scaling of a collection of vectors
x ∈ C. Notice that it implies that 0 ∈ C. Often, it will be the case that 0 ∈ C is an extreme point
of C and in that case, we say that C is pointed. As one might suspect, in the context of linear
programming, we will be mostly interested in a specific type of cone, those known are polyhedral
cones. Polyhedral cones are sets of the form
P = {x ∈ Rn : Ax ≥ 0} .
Figure 6.1 illustrates a polyhedral cone in R3 formed by the intersection of three half-spaces.
x3
a1
a2
x2
x1
Some interesting properties can be immediately concluded regarding polyhedral cones. First, they
are convex sets, since they are polyhedral sets (cf. Theorem 2.8). Also, the origin is an extreme
point, and thus, polyhedral cones are always pointed. Furthermore, just like general polyhedral
sets, a cone C ∈ Rn will always be associated with a collection of n linearly independent vectors.
Corollary 6.2 summarises these points. Notice we pose it as a Corollary because these are immediate
consequences of Theorem 3.5.
Corollary 6.2. Let C ⊂ Rn be a polyhedral cone defined by constraints a⊤ i x ≥ 0 i=1,...,m . Then
the following are equivalent
1. 0 is an extreme point of C;
Notice that 0 ∈ C is the unique extreme point of the polyhedra cone C. To see that, let 0 ̸= x ∈ C,
x1 = (1/2)x and x2 = (3/2)x. Note that x1 , x2 ∈ C, and x ̸= x1 ̸= x2 . Setting λ1 = λ2 = 1/2, we
have that λ1 x1 + λ2 x2 = x and thus, x is not an extreme point (cf. Definition 2.10).
100 Chapter 6. Linear Programming Duality - Part II
Notice that the definition states that a recession cone comprises all directions d along which
one can move from x ∈ P without ever leaving P . However, notice that the definition do not
depend on x, meaning that the recession cone is unique for the polyhedral set P , regardless of its
“origin”. Furthermore, notice that Definition 6.3 implies that recession cones of polyhedral sets
are polyhedral cones.
We say that any directions d ∈ recc(P ) is a ray. Thus, bounded polyhedra can be alternatively
defined as polyhedral sets that do not contain rays.
Figure 6.2 illustrates the concept of recession cones. Notice that it is purposely placed in several
places to illustrate the independence of the point x ∈ P .
Finally, the recession cone for a standard form polyhedral set P = {x ∈ Rn : Ax = b, x ≥ 0} is
given by
recc(P ) = {d ∈ Rn : Ad = 0, d ≥ 0} .
Notice that we are interested in extreme rays of the recession cone recc(P ) of the polyhedral set
P . However, it is typical to say that they are extreme rays of P . Figure 6.3 illustrates the concept
of extreme rays in polyhedral cones.
Notice that, just like extreme points, the number of extreme rays is finite by definition. In fact, we
say that two extreme rays are equivalent if they are positive multiples corresponding to the same
n − 1 linearly independent active constraints.
6.2. Cones and extreme rays 101
x3
a1
d1
d2 a2
d3
x2
x1
Figure 6.3: A polyhedral cone formed by the intersection of three half-spaces (the normal vector
a3 is perpendicular to the plane of the picture and cannot be seen). Directions d1 , d2 , and d3
represent extreme rays.
The existence of extreme rays can be used to verify unboundedness in linear programming problems.
The mere existence of extreme rays does not suffice since unboundedness is a consequence of the
extreme ray being a direction of improvement for the objective function. To demonstrate this, let us
first describe unboundedness in polyhedral cones, which we can then use to show the unboundedness
in polyhedral sets.
⊤
Theorem
⊤ 6.5 (Unboundedness
in polyhedral cones). Let P : min. c x : x ∈ C , with C =
ai x ≥ 0, i = 1, . . . , m . The optimal value is equal to −∞ if and only if some extreme ray d ∈ C
satisfies c⊤ d < 0.
Proof. If c⊤ d < 0, then P is unbounded, since c⊤ x → −∞ along d. Also, there exists some x ∈ C
for which c⊤ x < 0 can be scaled to -1.
Let P = x ∈ Rn : a⊤ ⊤
i x ≥ 0, i = 1, . . . , m, c x = −1 . Since 0 ∈ C, P has at least one extreme
m
point {ai }i=1 and thus span Rn (cf. Theorem 3.5). Let d be one of those.
m As we have n linearly-
independent active constraints at d, n − 1 of the constraints a⊤ i x ≥ 0 i=1 must be active (plus
c⊤ x = −1), and thus d is an extreme ray.
We now focus on how this can be utilised in the context of the simplex method. It turns out that
once unboundedness is identified in the simplex method, one can extract the extreme ray causing
102 Chapter 6. Linear Programming Duality - Part II
the said unboundedness. In fact, most professional-grade solvers are capable of returning extreme
(or unbounded) rays, which is helpful in the process of understanding the causes for unboundedness
in the model. We will also see in the next chapter that these extreme rays are also used in the
context of specialised solution methods.
To see that is possible, let P : min. c⊤ x : x ∈ X with X = {x ∈ Rn : Ax = b, x ≥ 0} and assume
that, for a given basis B, we conclude that the optimal value is −∞, that is, the problem is
unbounded. In the context of the simplex method, this implies that we found a nonbasic variable
xj for which the reduced cost cj < 0 and the j th column of B −1 Aj has no positive coefficient.
Nevertheless, we can still form the feasible direction d = [dB dN ] as before, with
(
−1 dj = 1
dB = −B Aj and dN =
di = 0, ∀i ∈ IN \ {j} .
This direction d is precisely an extreme ray for P . To see that, first, notice that Ad = 0 and
d ≥ 0, thus d ∈ recc(X). Moreover, there are n − 1 active constraints at d: m in Ad = 0 and
n − m − 1 in di = 0 for i ∈ IN \ {j}. The last thing to notice is that cj = c⊤ d < 0, which shows
the unboundedness in the direction d.
X = {x ∈ Rn : Ax = b, x ≥ 0} and
Y = p ∈ Rm : p⊤ Ax ≥ 0, p⊤ b < 0 .
If there exists any p ∈ Y , then there is no x ∈ X for which p⊤ Ax = p⊤ b, (and in turn Ax = b),
holds. Thus, X must be empty. Notice that this can be used to infer that a problem P with a
feasibility set represented by X prior to solving P itself, by means of solving the feasibility problem
of finding a vector p ∈ Y .
We now pose this relationship more formally via a result generally known as the Farkas’ lemma.
Theorem 6.7 (Farkas’ lemma). Let A be a m × n matrix and b ∈ Rm . Then, exactly one of the
following statements hold
The Farkas’ lemma has a nice geometrical interpretation that represents the mutually exclusive
relationship between the two sets. For that, notice that we can think of b as being a conic com-
bination of the columns Aj of A, for some x ≥ 0. If that cannot be the case, then there exists a
hyperplane that separates b and the cone formed by the columns of A, C = {y ∈ Rm : y = Ax}.
6.3. Resolution theorem* 103
This is illustrated in Figure 6.4. Notice that the separation caused by such a hyperplane with
normal vector p implies that p⊤ Ax ≥ 0 and p⊤ b < 0, i.e., Ax and b are on the opposite sides of
the hyperplane.
A1 A2 A3
p
Theorem 6.8 has an important consequence, as it states that bounded polyhedra, i.e., polyhedral
sets that have no extreme rays, can be represented by the convex hull of its extreme points. For
now, let us look at an example that illustrates the concept.
Consider the polyhedral set P given by
P = {x1 − x2 ≥ −2; x1 + x2 ≥ 1, x1 , x2 ≥ 0} .
The recession cone C = recc(P ) is described by d1 − d2 ≥ 0, d1 + d2 ≥ 0 (from Ad = 0), and
d1 , d2 ≥ 0, which can be simplified as
C = (d1 , d2 ) ∈ R2 : 0 ≤ d2 ≤ d1 .
104 Chapter 6. Linear Programming Duality - Part II
We can then conclude that the two vectors w1 = (1, 1) and w2 = (1, 0) are extreme rays of P .
Moreover, P has three extreme points: x1 = (0, 2), x2 = (0, 1), and x3 = (1, 0).
Figure 6.5 illustrates what is stated in Theorem 6.8. For example, a representation for the point
y = (2, 2) ∈ P is given by
2 0 1 1
y= = + + ,
2 1 1 0
2 1 0 1 1 3 1
y= = + + ,
2 2 1 2 0 2 1
with then y == 12 x2 + 21 x3 + 32 w1 . Notice that this imply that the representation of each point is
not unique.
w1
2 x1 y
x2
1 x2
Recession
cone
x3 w2
0
0 1 2 3
x1
Figure 6.5: Example showing that every point of P = {x1 − x2 ≥ −2; x1 + x2 ≥ 1, x1 , x2 ≥ 0} can
be represented as a convex combination of its extreme point and a linear combination of its extreme
rays
6.4. Exercises 105
6.4 Exercises
min. − 2x1 − x2 + x3
s.t.: x1 + 2x2 + x3 ≤ 8
− x1 + x2 − 2x3 ≤ 4
3x1 + x2 ≤ 10
x1 , x2 , x3 ≥ 0.
x1 x2 x3 x4 x5 x6 RHS
z 0 0 1.2 0.2 0 0.6 -7.6
x1 1 0 -0.2 -0.2 0 0.4 2.4
x2 0 1 0.6 0.6 0 -0.2 2.8
x5 0 0 -2.8 -0.8 1 0.6 3.6
(a) If you were to choose between increasing in 1 unit the right-hand side of any constraints,
which one would you choose, and why? What is the effect of the increase on the optimal
cost?
(b) Perform a sensitivity analysis on the model to discover what is the range of alteration in
the RHS in which the same effect calculated in item (a) can be expected. HINT : JuMP
(from version 0.21.6) includes the function “lp sensitivity report()” that you can use to help
performing the analysis.
(a) Let P = {(x1 , x2 ) : x1 − x2 = 0, x1 + x2 = 0}. What are the extreme points and the extreme
rays of P ?
(b) Let P = {(x1 , x2 ) : 4x1 + 2x2 ≥ 0, 2x1 + x2 ≤ 1}. What are the extreme points and the
extreme rays of P ?
(c) For the polyhedron of part (b), is it possible to express each one of its elements as a convex
combination of its extreme points plus a nonnegative linear combination of its extreme rays?
Is this compatible with the Resolution Theorem?
max. 2x1 + x2
s.t.: 2x1 + 2x2 ≤ 9 (p1 )
2x1 − x2 ≤ 3 (p2 )
x1 ≤ 3 (p3 )
x2 ≤ 4 (p4 )
x1 , x2 ≥ 0.
(a) Find the primal and dual optimal solutions. HINT : You can use complementary slackness,
once having the primal optimum, to find the dual optimal solution.
(b) Suppose we add a new constraint 6x1 − x2 ≤ 6, classify the primal and dual former optimal
points stating if they: (i) remain optimal; (ii) remain feasible but not optimal; or (iii) become
infeasible.
(c) Consider the new problem from item (b) and find the new dual optimal point through one
dual simplex iteration. After that, find the primal optimum.
Chapter 7
Decomposition methods
TK
where X = k=1 Xk , for some K > 0, and
Xk = xk ∈ Rn+k : Dk xk = dk , ∀k ∈ {1, . . . , K} .
That is, X is the intersection of K standard-form polyhedral sets. Our objective is to devise a way
to break into K separable parts that can be solved separately and recombined as a solution for P .
In this case, this can be straightforwardly achieved by noticing that P can be equivalently stated
as
107
108 Chapter 7. Decomposition methods
(P ) : max. x c⊤
1 x1 +···+ c⊤
K xK
s.t.: D 1 x1 = d1
.. ..
. .
DK xK = dK
x1 , ..., xK ∈ Rn+
has a structure that immediately allows for separation. That is, P could be solved as K independent
problems
Pk : min. c⊤ k xk : xk ∈ Xk
in parallel and then combinePK their individual solutions onto a solution for P , simply by making
⊤ ⊤
x = [xk ]Kk=1 and c x = i=1 ck xk . Notice that, if we were to assume that the solution time
scales linearly (it does not; it grows faster than linear) and K = 10, then solving P as K separated
problems would be ten times faster (that is not true; there are bottlenecks and other considerations
to take into account, but the point stands).
Unfortunately, complicating structures often compromise this natural separability, preventing one
from being able to directly exploit this idea. Specifically, two types of complicating structures
can be observed. The first is of the form of complicating constraint. That is, we observe that a
constraint is such that it connects variables from (some of) the subsets Xk . In this case, we would
notice that P has an additional constraint of the form
A1 x1 + · · · + AK xK = b,
P ′ : max. x c⊤
1 x1 +···+ c⊤
K xK
s.t.: A1 x 1 +···+ AK x K =b
D1 x1 = d1
.. ..
. .
D K xK = dK
x1 , ..., xK ∈ Rn+ .
The other type of complicating structure is the case in which the same set of decision variable is
present in multiple constraints, or multiple subsets Xk . In this case, we observe that variables of a
subproblem k ∈ {1, . . . , K} has nonzero coefficient in another subproblem k ′ ̸= k, k ′ ∈ {1, . . . , K}.
Hence, problem P takes the form of
P ′′ : max. x c⊤
0 x0 + c⊤
1 x1 +···+ c⊤
K xK
s.t.: A1 x 0 + D1 x1 = d1
.. .. ..
. . .
AK x0 + D K xK = dK
x0 , x1 , ..., xK ∈ Rn+ .
The challenging aspect is that, depending on the type of complicating structure, a specific method
becomes more suitable. Therefore, being able to identify these structures is one of the key success
factors in terms of the chosen method performance. As a general rule, problems with complicating
constraints (as P ′ ) are suitable to be solved by a delayed variable generation method such as
7.2. Dantzig-Wolfe decomposition and column generation* 109
column generation. Analogously, problem with complicating variables (P ′′ ) are better suited for
employing delayed constraint generation methods such as Benders decomposition.
The development of professional-grade code employing decomposition methods is a somewhat
recent occurrence. The commercial solver CPLEX offers a Benders decomposition implementation
that requires the user to specify the separable structure. On the other hand, although there are
some available frameworks for implementing column generation-based methods, these tend to be
more ad hoc occurrences, yet often reaping impressive results.
We start with the Dantzig-Wolfe decomposition, which consists of an alternative approach for
reducing memory requirements when solving large-scale linear programming problems. Then, we
show how this can be expanded further with the notion of delayed variable generation to yield a
truly decomposed problem.
As before, let Pk = {xk ≥ 0 : Dk xk = d}, with Pk ̸= ∅ for k ∈ {1, . . . , K}. Then, the problem P
can be reformulated as:
K
X
min. c⊤
k xk
k=1
K
X
s.t.: Ak x k = b
k=1
xk ∈ Pk , ∀k ∈ {1, . . . , K} .
PK
Notice that P has a complicating constraint structure, due to the constraints k=1 Ak xk = b. In
order to devise a decomposition method for this setting, let us first assume that we have available
for each of the sets Pk , k ∈ {1, . . . , K}, (i) all extreme points, represented by xjk , ∀j ∈ Jk ; and (ii)
all extreme rays wkr , ∀r ∈ Rk . As one might suspect, this is in principle a demanding assumption,
but one that we will be able to drop later on.
Using the Resolution theorem (Theorem 6.8), we know that any element of Pk can be represented
as
X X
xk = λjk xjk + θkr wkr , (7.1)
j∈Jk r∈Rk
where λjk ≥ 0, ∀j ∈ Jk , are the coefficients of the convex combination of extreme points, meaning
P j r
that we also observe j∈Jk λk = 1, and θk ≥ 0, ∀r ∈ Rk , are the coefficients of the conic
combination of the extreme rays.
110 Chapter 7. Decomposition methods
Using the identity represented in (7.1), we can reformulate P onto the main problem PM as follows.
XK X j X
(PM ) : min. λ k c⊤ j
k xk + θkr c⊤ r
k wk
k=1 j∈Jk r∈Rk
K
X X X
s.t.: λjk Ak xjk + θkr Ak wkr = b (7.2)
k=1 j∈Jk r∈Rk
X
λjk = 1, ∀k ∈ {1, . . . , K} (7.3)
j∈Jk
where ek is the unit vector (i.e., with 1 in the k th component, and 0 otherwise). Notice that PM
has as many variables as the number of extreme points and extreme rays of P , which is likely to
be prohibitively large.
However, we can still solve it if we use a slightly modified version of the revised simplex method.
To see that, let us consider that b is a m-dimensional vector. Then, a basis for PM would be of
size m + K, since we have the original m constraints plus one for each convex combination (arising
from each subproblem k ∈ K). This means that we are effectively working with (m + K) × (m + K)
matrices, i.e., the basic matrix B and its inverse B −1 . Another element we need is the vector of
simplex multipliers p, which is a vector of dimension (m + K).
The issue with the representation adopted in PM arises when we are required to calculate the
reduced costs of all the nonbasic variables, since this is the critical issue for its tractability. That
is where the method provides a clever solution. To see that, notice that the vector p is formed by
components p⊤ = (q, r1 , . . . , rK )⊤ , where q represent the m dual variables associated with (7.2),
and rk , ∀k ∈ {1, . . . , K}, are the dual variables associated with (7.3).
The reduced costs associated with the extreme-point variables λjk , j ∈ JK , is given by
j Ak xjk j
c⊤
k xk − [q ⊤
r1 . . . rK ] = (c⊤ ⊤
k − q Ak )xk − rk . (7.4)
ek
The main difference is how we assess the reduced costs of the non-basic variables. Instead of
explicitly calculating the reduced costs of all variables, we instead rely on an optimisation-based
approach to consider them only implicitly. For that, we can use the subproblem
which can be solved in parallel for each subproblem k ∈ {1, . . . , K}. The subproblem Sk is known
as the pricing problem. For each subproblem k = 1, . . . , K, we have the following cases.
We might observe that ck = −∞. In this case, we have found an extreme ray wkr satisfying
(c⊤ ⊤ r r
k − q Ak )wk < 0. Thus, the reduced cost of the associated extreme-ray variable θk is negative.
If that is the case, we must generate the column
Ak wkr
0
∗l ∗l
∗l
6: if ck = c⊤ ∗l ∗l
k xk − q (Ak xk ) < rk then X̃k ← X̃k ∪ xk
7: end if
8: end for
9: l ← l + 1.
10: until ck > rk for all k ∈ {1, . . . , K}
11: return λ∗l .
The (delayed) column generation method is presented in Algorithm 5. Notice in Line 6 the step
that
is generating new columns in the main problem PM , represented in the statement X̃k ← X̃k ∪ x∗lk .
That is precisely when new variables λtk are introduced in the PM with coefficients represented by
7.3. Benders decomposition 113
the column
ck x∗l
k
Ak x∗l
k
.
ek
Notice that the unbounded case is not treated to simplify the pseudocode, but could be trivially
adapted to return extreme rays to be used in PM , like the previous variant presented in Algorithm
4. Also, notice that the method is assumed to be initialised with a collection of columns (i.e.,
extreme points) X̃k , which can normally be obtained from inspection or using a heuristic method.
We finalise showing that the Dantzig-Wolfe and column generation methods can provide informa-
tion related to its own convergence. This means that we have access to an optimality bound that
can be used to monitor the convergence of the method and allow for a preemptive termination
given an acceptable tolerance. This bounding property is stated in Theorem 7.1.
Theorem 7.1. Suppose P is feasible with finite optimal value z. Let z be the optimal cost associated
with PM at a given iteration l of the Dantzig-Wolfe method. Also, let rk be the dual variable
associated with the convex combination of the k th subproblem and zk its optimal cost. Then
K
X
z+ (zk − rk ) ≤ z ≤ z.
k=1
Proof. We know that z ≤ z, because a solution for PM is primal feasible and thus feasible for P .
Now, consider the dual of PM
K
X
⊤
(DM ) : max. q b + rk
k=1
s.t.: q ⊤ Ak xjk + rk ≤ c⊤ j
k xk , ∀j ∈ Jk , ∀k ∈ {1, . . . , K}
q ⊤ Ak wkr ≤ c⊤ r
k wk , ∀r ∈ Rk , ∀k ∈ {1, . . . , K}
PK
We know that strong duality holds, and thus z = q ⊤ b + k=1 rk for dual variables (q, r1 , . . . , rK ).
j j
Now, since zk are finite, we have minj∈Jk (c⊤ ⊤
= zk and minr∈Rk (c⊤
k xk −q Ak xk )
r ⊤ r
k wk −q Dk wk ) ≥ 0,
meaning that (q, z1 , . . . , zK ) is feasible to DM . By weak duality, we have that
K
X K
X K
X K
X
z ≥ q⊤ b + zk = q ⊤ b + rk + (zk − rk ) = z + (zk − rk ).
k=1 k=1 k=1 k=1
This structure is sometimes referred to as block-angular, referring to the initial block of columns
on the left (as many as there are components in x) and the diagonal structure representing the
elements associated with the variables y. In this case, notice that if the variable x were to be
removed, or fixed to a value x = x, the problem becomes separable in K independent parts
(Sk ) : min. fk⊤ yk
y
s.t.: Dk yk = ek − Ck x
yk ≥ 0.
Notice that these subproblems k ∈ {1, . . . , K} can be solved in parallel and, in certain contexts,
might even have analytical closed-form solutions. The part missing is the development of a coor-
dination mechanism that would allow for iteratively update the solution x based on information
emerging from the solution of the subproblems k ∈ {1, . . . , K}.
To see how that can be achieved, let us reformulate P as
K
X
(PR ) : min. c⊤ x + zk (x)
x
k=1
s.t.: Ax = b
x ≥ 0.
where, for k ∈ {1, . . . , K},
zk (x) = min. fk⊤ yk : Dk yk = ek − Ck x .
y
Note that obtaining zk (x) requires solving Sk , which, in turn, depends on x. To get around this
difficulty, we can rely on linear programming duality, combined, once again, with the resolution
theorem (Theorem 6.8).
7.3. Benders decomposition 115
First, let us consider the dual formulation of the subproblems k ∈ {1, . . . , K}, which is given by
s.t.: p⊤
k Dk ≤ fk .
The main advantage of utilising the equivalent dual formulation is to “move” the original decision
variable x to the objective function, a trick that will present its benefits shortly. Next, let us
denote the feasibility set of SkD as
Pk = p : p⊤ Dk ≤ fk , ∀k ∈ {1, . . . , K} , (7.6)
and assume that each Pk ̸= ∅ with at least one extreme point. Relying on the resolution theorem,
we can define the sets of extreme points of Pk , given by pik , i ∈ Ik , and extreme rays wkr , r ∈ Rk .
As we assume that Pk ̸= ∅, two cases can occur when we solve SkD , k ∈ {1, . . . , K}. Either SkD is
unbounded, meaning that the relative primal subproblem is infeasible, or SkD is bounded, meaning
that zkD < ∞.
From the first case, we can use Theorem 6.6 to conclude that primal feasibility (or a bounded dual
value zkD < ∞) can only be attained if and only if
Furthermore, we know that if SkD has a solution, that must lie on a vertex of Pk . So, having the
available the set of all extreme vertices pik , i ∈ Ik , we have that if one can solve SkD , it can be
equivalently represented as
min. θk (7.8)
s.t.: θk ≥ (pik )⊤ (ek − Ck x), ∀i ∈ Ik . (7.9)
K
X
(PR ) : min. c⊤ x + θk
x
k=1
s.t.: Ax = b
(pik )⊤ (ek − Ck x) ≤ θk , ∀i ∈ Ik , ∀k ∈ {1, . . . , K} (7.10)
(wkr )⊤ (ek − Ck x) ≤ 0, ∀r ∈ Rk , ∀k ∈ {1, . . . , K} (7.11)
x ≥ 0.
Notice that, just like the reformulation used for the Dantzig-Wolfe method presented in Section
7.2, the formulation of PR is of little practical use, since it requires the complete enumeration of
(a typically prohibitive) number of extreme points and rays and is likely to be computationally
intractable due to the large number of associated constraints. To address this issue, we can
employ delayed constraint generation and iteratively generate only the constraints we observe to
be violated.
116 Chapter 7. Decomposition methods
l
By doing so, at a given iteration l, we have at hand a relaxed main problem PM , which comprises
only some of the extreme point and rays obtained until iteration l. The relaxed main problem can
be stated as
K
X
l
(PM ) : zPl M = min. c⊤ x + θk
x
k=1
s.t.: Ax = b
(pik )⊤ (ek − Ck x) ≤ θk , ∀i ∈ Ikl , ∀k ∈ {1, . . . , K}
(wkr )⊤ (ek − Ck x) ≤ 0, ∀r ∈ Rkl , ∀k ∈ {1, . . . , K}
x ≥ 0,
where Ikl ⊆ Ik represent subsets of extreme points pik of Pk , and Rkl ⊆ Rk subsets of extreme rays
wkr of Pk .
One can iteratively obtain these extreme points and rays from the subproblems Sk , k ∈ {1, . . . , K}.
l
To see that, let us fist define that, at iteration l, we solve the the main problem PM and obtain a
solution
l l l
argmin PM = (xl , θ1 , . . . θK ).
x,θ
We can then solve the subproblems Skl , k ∈ {1, . . . , K}, for that fixed solution xl and then observe
if we can find additional constraints that were to be violated if they had been in the relaxed main
problem in the first place. In other words, we can identify if the solution xl allows for identifying
additional extreme points pik or extreme rays wkr of Pk that were not yet included in PM l
.
To identify those, first recall that the subproblem (in its primal form), is given by
(Skl ) : min. f ⊤ y
s.t.: Dk yk = ek − Ck xl
yk ≥ 0.
Then, two cases can lead to generating violated constraints that must be added to the relaxed
l+1
primal problem to form PM . The first is when Skl is feasible. In that case, a dual optimal basic
il ⊤ l l
feasible solution pil
k is obtained. If (pk ) (ek − Ck x ) > θ k , then we can conclude that we just
formed a violated constraint of the form of (7.10). The second case is when Skl is infeasible, then
an extreme ray wkrl of Pk is available, such that (wkrl )⊤ (ek − Ck xl ) > 0, violating (7.11).
Notice that the above can also be accomplished by solving the dual subproblems SkD , k ∈ {1, . . . , K},
instead. In that case, the extreme point pil
k is immediately available and so are the extreme rays
wkrl in case of unboundedness.
Algorithm 6 presents a pseudocode for the Benders decomposition. Notice that the method can
benefit in terms of efficiency from the use of dual simplex, since we are iteratively adding violated
l
constraints to the relaxed main problem PM . Likewise, the subproblem Skl has only the independent
terms being modified at each iteration and, in light of the discussion in Section 6.1.3, can also benefit
from the use of dual simplex. Furthermore, the loop represented by Line 4 can be parallelised to
provide further computational performance improvements.
7.3. Benders decomposition 117
Notice that the algorithm terminates if no violated constraint is found. This in practice implies
K
that (pik )⊤ (ek − Ck x) ≤ θk for all k ∈ {1, . . . , K}, and thus (x, θk k=1 ) is optimal for P . In a
way, if one consider the dual version subproblem, SkD , one can notice that it is acting as an implicit
search for values of pik that can make (pik )⊤ (ek − Ck x) larger than θk , meaning that the current
l
solution x violates 7.10 and is thus not feasible to PM .
l
Also, every time one solves PM , a dual (lower for minimisation) bound LB l = zPl M is obtained. This
is simply because the relaxed main problem is a relaxation of the problem P , i.e., it contains less
constraints than the original problem P . A primal (upper) bound can also be calculated at every
iteration, which allows for keeping track of the progress of the algorithm in terms of convergence
and preemptively terminate it at any arbitrary optimality tolerance. That can be achieved by
setting
( K
)
X
⊤ l
l
U B = min U B l−1
,c x + f ⊤ y lk
k=1
( K K
)
X l X
l−1
= min U B , zPl M − θk + zkDl ,
k=1 k=1
n l oK l l
where (xl , θk ) = argminx,θ PM , y k = argminy Skl , and zkDl is the objective function
k=1
value of the dual subproblem SkD at iteration l. Notice that, differently from the lower bound LB l ,
there are no guarantees that the upper bound U B l will decrease monotonically. Therefore, one
must compare the bound obtained at a given iteration l using the solution (xl , y l1 , . . . , y lK ) against
an incumbent (or best-so-far) bound U B l−1 .
118 Chapter 7. Decomposition methods
7.4 Exercises
Exercise 7.1: Dantzig-Wolfe decomposition
Consider the following linear programming problem:
We wish to solve this problem using Dantzig-Wolfe decomposition, where the constraint x11 +x23 ≤
15 is the only “coupling” constraint and the remaining constraints define a single subproblem.
(a) Consider the following two extreme points for the subproblem:
and
x2 = (0, 10, 10, 20, 0, 0).
Construct a main problem in which x is constrained to be a convex combination of x1 and
x2 . Find the optimal primal and dual solutions for the main problem.
(b) Using the dual variables calculated in part a), formulate the subproblem and find its optimal
solution.
(c) What is the reduced cost of the variable λ3 associated with the extreme point x3 obtained
from solving the subproblem in part b)?
(d) Compute a lower bound on the optimal cost.
• pi : production at supplier i
• tij : amount of products transported between i and j
• ljs : amount of products sold from the distribution point j in scenario s
The model for minimising the cost (considering revenue as a negative cost) is given below,
X X X X
min. Bi pi + Tij tij + Ps (−Rj ljs + Wj wjs )
i∈I i∈I,j∈J s∈S j∈J
s.t.: pi ≤ Ci , ∀i ∈ I
X
pi = tij , ∀i ∈ I
j∈J
X
rj = tij , ∀j ∈ J
i∈I
rj = ljs + wjs , ∀j ∈ J, ∀s ∈ S
ljs ≤ Djs , ∀j ∈ J, ∀s ∈ S
pi ≥ 0, ∀i ∈ I
rj ≥ 0, ∀j ∈ J
tij ≥ 0, ∀i ∈ I, ∀j ∈ J
ljs , wjs ≥ 0, ∀j ∈ J, ∀s ∈ S.
Solve an instance of the wholesaler’s distribution problem proposed using Benders decomposition.
120 Chapter 7. Decomposition methods
Chapter 8
(P) : min. c⊤ x
s.t.: Ax ≤ b
x ≥ 0,
where cj , j ∈ N , is a weight, and F is a family of feasible subsets. As the name suggests, in these
problems we are trying to form combinations of elements such that a measure (i.e., an objective
121
122 Chapter 8. Integer programming models
1 1 1 1
C12
C31
2 2 2 2
C24
3 3 3 3
C43
4 4 4 4
(a) (b)
Figure 8.1: An illustration of all potential assignments as a graph and an example of one possible
assignment, with total cost C12 + C31 + C24 + C43
To represent the problem, let xij = 1 if worker i is assigned to job j and 0, otherwise. Let
N = {1, . . . , n} be a set of indices of workers and jobs (we can use the same set since they are of
same number). The integer programming model that represent the assignment problem is given
8.2. (Mixed-)integer programming applications 123
by
XX
(AP ) : min. Cij xij
i∈N j∈N
X
s.t.: xij = 1, ∀i ∈ N
j∈N
X
xij = 1, ∀j ∈ N
i∈N
xij ∈ {0, 1} , ∀i, ∀j ∈ N.
Before we proceed, let us make a parallel to combinatorial problems. Clearly, the assignment
problem is an example of a combinatorial problem, which can be posed by making N the set of all
job-worker pairs (i, j), S ∈ F the (i, j) pairs in which i and j appear in exactly one pair, and, finally,
xS such that xij , i, j = 1, . . . , n. Thus, the assignment problem is an example of a combinatorial
optimisation problem that can be represented as an integer programming formulation.
Notice that the knapsack problems has variants in which an item can be selected a certain number
of times, or that multiple knapsacks must be considered simultaneously, both being generalisations
of KP .
Also, the knapsack problem is also a combinatorial optimisation problem, which can be stated by
making N the set of all items {1, . . . , n}, S ∈ F the subset of items with total cost not greater
than B, and xS such that xi , ∀i ∈ N .
items bins
C11
1 1
C12
2 2
C23
3 3
C44
4 4
Figure 8.2: An example of a bin packing with total cost C11 + C12 + C23 + C44
bin, or a bin might have no item assigned. In some context, this is problem is also known as the
bin packing problem.
In this case, we would like to assign all of the m items to n bins, observing that the capacity B of
each bin cannot be exceeded by the weights Ai of the items assigned to that bin. We know that
assigning the item i = 1, . . . , m to the bin j = 1, . . . , n costs Cij . Our objective is to obtain a
minimum-cost bin assignment (or packing) that does not exceed any of the bin capacities. Figure
8.2 illustrates a possible assignment of items to bins. Notice that the number of total bins does
not necessarily need to be the same the number of items.
To formulate the generalised assignment problem as an integer programming problem, let us define
xij = 1, if item i is packed into bin j, and xij = 0 otherwise. Moreover, let M = {1, . . . , m} be
the set of items and N = {1, . . . , n} be the set of bins. Then, the problem can be formulated as
follows.
XX
(GAP ) : min. Cij xij
x
i∈M j∈N
Xn
s.t.: xij = 1, ∀i ∈ M
j=1
Xm
Ai xij ≤ Bj , ∀j ∈ N
i=1
xij ∈ {0, 1} , ∀i ∈ M, ∀j ∈ N.
Hopefully, the parallel to the combinatorial optimisation problem version is clear at this point and
is let for the reader as a thought exercise.
Our objective is to decide where to open the centres so that all regions are served and the total
opening cost is minimised.
Figure 8.3 illustrates an example of a set covering problem based on a fictitious map broken into
regions. Each of the cells represents a region that must be covered, i.e., M = {1, . . . , 20}. The
blue cells represent regions that can have a centre opened, that is, N = {3, 4, 7, 11, 12, 14, 19}.
Notice that N ⊂ M . In this case, we assume that if a centre is opened at a blue cell, then it
can serve the respective cell and all adjacent cells. Therefore, we have, e.g., that S3 = {1, 2, 3, 8},
S4 = {2, 4, 5, 6, 7}, and so forth.
Figure 8.3: The hive map illustrating the set covering problem. Our objective is to cover all of the
regions while minimising the total cost incurred by opening the centres at the blue cells
To model the set covering problem, and pretty much any other problem involving indexed subsets
such as Sj , ∀j ∈ N , we need an auxiliary structure, often referred to as 0-1 incidence matrix A
such that
(
Aij = 1, if i ∈ Sj ,
A=
Aij = 0, otherwise.
For example, referring to Figure 8.3, the first column of A would refer to j = 3 and would have
nonzero values at rows 1, 2, 3, and 8.
We are now ready to pose the set covering problem as an integer programming problem. For that,
let xj = 1 if facility is opened at location j and xj = 0, otherwise. In addition, let M = {1, . . . , m}
be the set of regions to be served and N = {1, . . . , n} the set of candidate places to have a centre
opened. Then, the set covering problem can be formulated as follows.
X
(SCP ) : min. C j xj
x
j∈N
X
s.t.: Aij xj ≥ 1, ∀i ∈ M
j∈N
xj ∈ {0, 1} , ∀j ∈ N.
As a final note, notice that this problem can also be posed as a combinatorial optimisation problem
126 Chapter 8. Integer programming models
of the form
X [
min. Cj : Sj = M ,
T ⊆N
j∈T j∈T
1 6
3 4 5
To pose the problem as an integer programming model, let us define xij = 1 if city j is visited
directly after city i, xij = 0 otherwise. Let N = {1, . . . , n} be the set of cities. We assume that
xii is not defined for i ∈ N . A naive model for the travelling salesperson problem would be
XX
(T SP ) : min. Cij xij
x
i∈N j∈N
X
s.t.: xij = 1, ∀i ∈ N
j∈N \{i}
X
xij = 1, ∀j ∈ N
i∈N \{j}
First, notice that the formulation of T SP is exactly the same as that of the assignment problem,
however, this formulation has an issue. Although it can guarantee that all cities are only visited
once, it cannot enforce an important feature of the problem that is that the tour cannot present
disconnections, i.e., contain sub-tours. In other words, the salesperson must physically visit from
city to city in the tour, and cannot “teletransport” from one city to another. Figure 8.5 illustrate
the concept of sub-tours.
In order to prevent subtours, we must include constraints that can enforce the full connectivity of
the tour. There are mainly two types of such constraints. The first is called cut-set constraints
8.2. (Mixed-)integer programming applications 127
2 3 6
1 4 5
Figure 8.5: A feasible solution of for the naive TSP model. Notice the two sub-tours formed
The cut-set constraints act by guaranteeing that among any subset of nodes S ⊆ N there is always
at least one arc (i, j) connecting one of the nodes in S and a node not in S.
An alternative type of constraint is called subtour elimination constraint and is of the form
XX
xij ≤ |S| − 1, ∀S ⊂ N, 2 ≤ |S| ≤ n − 1.
i∈S j∈S
Differently than the cutset constraints, the subtour elimination constraints prevent the cardinality
of the nodes in each subset to match the cardinality of arcs within the same subset.
For example, consider the subtours illustrated in Figure 8.5 and assume that we would like to
prevent the subtour formed by S = {1, 2, 3}. Then the cutset constraint would be of the form
There are some differences between these two constraints and typically cutset constraints are
preferred for being stronger (we will discuss the notion of stronger constraints in the next chapters).
In any case, either of them suffer from the same problem: the number of such constraints quickly
becomes computationally prohibitive as the number of nodes increase. This is because one would
have to generate a constraint to each possible combination of sizes 2 to n − 1.
A possible remedy to this consists of relying on delayed constraint generation. In this case, one
can start from the naive formulation T SP and from the solution, observe whether there are torus
formed. That being the case, only the constraints eliminating the observed subtours must be
generated, and the problem can be warm-started. This procedure typically terminates far earlier
than having all of the possible cutset or subtour elimination constraints generated.
i4 i4
i7 j2 i7 j2
j5 j5
i3 i3
i8 j6 i8 j6
j1 j1
j3 i6 i2 j3 i6 i2
i1 i5 i1 i5
j4 j4
(a) (b)
Figure 8.6: An illustration of the facility location problem an one possible solution with two
facilities located
configuration with two facilities. The optimal number of facilities open and the client-facility
association depends on the trade-offs between locating and service costs.
To formulate the problem as a mixed-integer programming problem, let us define xij as the fraction
of the demand in i ∈ M being served by a facility located at j ∈ N . In addition, we define the
binary variable yj such that yj = 1, if a facility is located at j ∈ N and 0, otherwise. With those,
the uncapacitated facility location (or UFL) problem can be formulated as
X XX
(U F L) : min. Fj yj + Cij xij (8.1)
x,y
j∈N i∈M j∈N
X
s.t.: xij = 1, ∀i ∈ M (8.2)
j∈N
X
xij ≤ myj , ∀j ∈ N (8.3)
i∈M
xij ≥ 0, ∀i ∈ M, ∀j ∈ N (8.4)
yj ∈ {0, 1} , ∀j ∈ N. (8.5)
Some features of this model are worth highlighting. First, Notice that the absolute values associated
with the demand at nodes i ∈ M are somewhat implicitly represented in the cost parameter Cij .
This is an important modelling feature that allows the formulation to be not only stronger, but
also more numerically favourable (avoiding large coefficients). Therefore, the demand is thought
as being, at each node, 1 or 100%, and 0 ≤ xij ≤ 1 represents the fraction of the demand at i ∈ M
being served by a facility eventually located at j ∈ N . Second, notice how the variable xij is only
allowed to be greater than zero if the variable yj is set to 1, due to (8.3). Notice that m, the
number of clients, is acting as a maximum upper bound for the amount of demand being served
from the facility j, which would be at most m when only one facility is located. That constraint
is precisely the reason why the problem is called uncapacitated, since, in principle, there are no
capacity limitations on how much demand is served from a facility.
Facility location problems are frequent in applications associated with supply chain planning prob-
lems, and can be specialised to a multitude of settings, including capacitated versions (both nodes
and arcs), versions where the arcs must also be located (or built), having multiple echelons, and
8.3. Good formulations 129
so forth.
Notice that the formulation of U LS is veryP similar to that seen in Chapter 1, with exception of the
variable yt , its associated fixed cost term t∈T Fj yj and the constraint (8.6). This constraint is
precisely what renders the “uncapacitated” nomenclature, and is commonly known in the context
of mixed-integer programming as big-M constraints. Notice that the constant M is playing the
role of +∞: it only really makes the constraint relevant when yt = 0, so that pt ≤ 0 is enforced,
making thus pt = 0. However, this interpretation has to be taken carefully. Big-M constraints
are known for being the cause of numerical issues and worsening the performance of mixed-integer
programming solver methods. Thus, the value of M must be set such that it is the smallest
value possible such that it does not artificially create a constraint. Finding these values are often
challenging and instance dependent. In the capacitated case, M can trivially take the value of the
production capacity.
differences. For that, the first thing to realise is that solution methods for MIP models rely on
successively solving linear programming models called linear programming (LP) relaxations. How
exactly this happens will be the subject of our next chapters. But for now, one can infer that
a better formulation will be that that require the solution of less of such LP relaxations. And
precisely, it so turns out that the number of LP relaxations that need to be solved (and, hence,
performance) is strongly dependent on the quality of the formulation.
An LP relaxation simply consists of a version of the original MIP problem in which the integrality
requirements are dropped. Most of the methods used to solve MIP models are based on LP
relaxation.
There are several reasons why the employment of LP relaxations is a good strategy. First, we can
solve LP problems efficiently. Secondly, the solution of the LP relaxation can be used to reduce
the search space of the original MIP. However, simply rounding the solution of the LP relaxation
will typically not lead to relevant solutions.
Let us illustrate the geometry of an integer programming model, such that the points we were
discussing become more evident. Consider the problem
16
(P ) : max. x1 + x2
x 25
s.t.: 50x1 + 31x2 ≤ 250
3x1 + 31x2 ≥ −4
x1 , x2 ∈ Z+ .
The feasible region of problem P is represented in Figure 8.7. First, notice how in this case the
feasible region is not a polyhedral set anymore, but yet a collection of discrete points (represented
in blue) that happen to be within the polyhedral set formed the linear constraints. This is one of
its main complicating features, because the premise of convexity does not hold anymore. Another
point can be noticed from Figure 8.7. Notice that rounding the solution obtained from the LP
relaxation would in most cases lead to infeasible solutions, except when x1 is rounded up and x2
rounded down, which leads to the suboptimal solution (2, 5). However, one can still graphically find
the optimal solution using exactly the same procedure as that employed for linear programming
problems, which would lead to the optimal integer solution (5, 0).
x2
6
LP sol. (376/193, 950/193)
5
4
3
2
1
opt. IP sol
0
0 1 2 3 4 5 6 x1
One aspect that can be noticed from Definition 8.1 is that the feasible region of an integer pro-
gramming problem is a collection of points, represented by X. This is illustrated in Figure 8.8,
where one can see three alternative formulations, P1 , P2 , and P3 for the same set X.
x2
5 P2 P 1
4
3 P3
2
0
0 1 2 3 4 5 x1
Figure 8.8: An illustration of three alternative formulations for X. Notice that P3 is an ideal
formulation, representing the convex hull of X.
The formulation P3 has a special feature associated with it. Notice how all extreme points of P3
belong to X. This has an important consequence. That implies that the solution of the original
integer programming problem can be obtained by solving a single LP relaxation, since the solution
of both problems is the same. This is precisely what characterises an ideal formulation, which is
that leading to a minimal (i.e., only one) number of required LP relaxation solutions as solving an
LP relaxation over an ideal P yields a solution x ∈ X for any cost vector c. This will only be the
case if the formulation P is the convex hull of X.
This is the case because of two important properties relating the set X and its convex hull conv(X).
The first is that conv(X) is a polyhedral set and the second is that the extreme points of conv(X)
belong to X. We summarise those two facts in Proposition 8.2, to which we will refer shortly. Notice
that the proof for the proposition can be derived from Definition 2.7 and Theorem 2.8.
Proposition 8.2. conv(X) is a polyhedral set and all its extreme points belong to X.
Unfortunately, this is often not the case. Typically, with exception of some specific cases, a descrip-
tion of conv(X) is not known and deriving it algorithmically require an exponential number or
constraints. However, Proposition 8.2 allows us to define a structured way to compare formulations.
This is summarised in Definition 8.3.
132 Chapter 8. Integer programming models
Definition 8.3. Given a set X ⊆ Zn × Rn and two formulations P1 and P2 for X, P1 is a better
formulation than P2 if P1 ⊂ P2 .
Definition 8.3 gives us a framework to try to demonstrate that a given formulation is better than
another. If we can show that P1 ⊂ P2 , then, by definition (literally), P1 is better formulation
than P2 . Clearly, this is not a perfect framework, since, for example, it would not be useful for
comparing P1 and P2 in Figure 8.8, and, in fact, there is nothing that can be said a priori about
the two in terms of which will render the better performance. Often, in the context of MIP, this
sort of analysis can only rely on careful computational experimentation.
A final point to make is that sometimes one must compare formulations of distinct dimensions,
that is, with different number of variables. When that is the case, one can resort to projection, as
a means to compare both formulations onto the same space of variables.
8.4 Exercises
P
t wit production in period i
(w, y) : wit = dt , ∀t ∈ N
to be used in period t
(w,y)
i=1
PULS-2 = wit ≤ dt yi , ∀i, t ∈ N : i ≤ t
yt setup in period t
wit ≥ 0, ∀i, t ∈ N : i ≤ t
0 ≤ yt ≤ 1, ∀t ∈ N
(x,s,y,w) (x,s,y)
proj (PULS-2 ) ⊆ PULS-1 ,
x,s,y
(b) The optimisation problems associated with the two ULS formulations are
P
(ULS-1) min. (ft yt + pt xt + qt st )
x,s,y xt production in period t
t∈N
s.t.: st−1 + xt = dt + st , ∀t ∈ N s stock in period t
t
xt ≤ M y t , ∀t ∈ N
yt setup in period t
s0 = 0,
dt demand in period t
st ≥ 0, ∀t ∈ N
M maximum production
xt ≥ 0, ∀t ∈ N P
M = t∈N dt
0 ≤ yt ≤ 1, ∀t ∈ N
" !#
P P
n P
t P
n
(ULS-2) min. ft yt + pt wti + qt wij − di
w,y
t∈N i=t i=1 j=i wit production in period i
Pt
s.t.: wit = dt , ∀t ∈ N to be used in period t
i=1 yt setup in period t
wit ≤ dt yi , ∀i, t ∈ N : i ≤ t
wit ≥ 0, ∀i, t ∈ N : i ≤ t
0 ≤ yt ≤ 1, ∀t ∈ N
Consider a ULS problem instance over N = {1, . . . , 6} periods with demands d = (6, 7, 4, 6, 3, 8),
set-up costs f = (12, 15, 30, 23, 19, 45), unit production costs p = (3, 4, 3, 4, 4, 5), unit storage costs
P6
q = (1, 1, 1, 1, 1, 1), and maximum production capacity M = i=1 dj = 34. Solve the problems
ULS-1 and ULS-2 with Julia using JuMP to verify the result of part (a) computationally.
where xij = 1 if city j ∈ N is visited immediately after city i ∈ N , and xij = 0 otherwise.
Constraints (∗) with the variables ui ∈ R for all i ∈ N are called Miller-Tucker-Zemlin (MTZ)
subtour elimination constraints.
Hint: Formulation PM T Z is otherwise similar to the formulation presented, except for the con-
straints (∗) which replace either the cutset constraints
X X
xij ≥ 1, ∀S ⊂ N, S ̸= ∅,
i∈S j∈N \S
which are used to prevent subtours in TSP solutions. Thus, you have to show that:
134 Chapter 8. Integer programming models
You can prove 1. by contradiction. First, assume that a solution x ∈ PM T Z has a subtour with k
arcs (i1 , i2 ), . . . , (ik−1 , ik ), (ik , i1 ) and k nodes {i1 , . . . , ik } ∈ N \ {1}. Then, write the constraints
(∗) for all arcs in this subtour and try to come up with a contradiction.
You can prove 2. by finding suitable values for each u2 , . . . , un that will satisfy the constraints
(∗) for any TSP solution x. Recall that a TSP solution represents a tour that visits each of the
N = {1, . . . , n} cities exactly once and returns to the starting city.
Implement the model and solve the problem instance with Julia using JuMP.
Hint: You can assume that the variables u2 , . . . , un take unique integer values from the set
{2, . . . , n}. That is, we have ui ∈ {2, . . . , n} for all i = 2, . . . , n with ui ̸= uj for all i, j ∈ 2, . . . , n.
This holds for any TSP solution of problem MTZ as we showed in Exercise 8.2. If we fix city 1 as
the starting city, then the value of each ui represents the position of city i in the TSP tour, i.e.,
ui = t for t = 2, . . . , n if city i ̸= 1 is the t:th city in the tour. You have to check that each of the
8.4. Exercises 135
inequalities (8.7) - (8.10) hold (individually) for any arc (i, j) ∈ A and city i ∈ N that are part of
the inequality, by checking that the following two cases are satisfied: either xij = 0 or xij = 1.
(b) Add all four sets of inequalities (8.7) - (8.10) to the MTZ formulation and compare the com-
putational performance against the model with no extra inequalities.
Branch-and-bound method
9.2 Relaxations
Before we present the method itself, let us discuss the more general concept of relaxation. We have
visited the concept somewhat informally before, but now we will concentrate on a more concrete
definition.
Consider an integer programming problem of the form
z = min. c⊤ x : x ∈ X ⊆ Zn .
x
To prove that a given solution x∗ is optimal, we must rely on the notion of bounding. That is,
we must provide a pair of upper and lower bounds that are as close (or tight) as possible. In the
occasion that these bounds are the same, and thus match the value of z = c⊤ x∗ , we have available
a certificate of optimality for x∗ . This concept must be familiar to you. We already used a similar
arguments in Chapter 5, when we introduced the notion of dual bounds.
Most methods that can prove optimality work by bounding an optimal solution. In this context,
bounding means to construct an increasing sequence of lower bounds
137
138 Chapter 9. Branch-and-bound method
to obtain as tight as possible lower (z ≤ z) and upper (z ≥ z) bounds. Notice that the process can
be arbitrarily stopped when z t − z s ≤ ϵ, where s and t are some positive integers and ϵ > 0 is a
predefined (suitably small) tolerance. The term ϵ represents an absolute optimality gap, meaning
that one can guarantee that the optimal value is at most greater than z by ϵ units and at most
smaller than z by ϵ units. In other words, the optimal value must be either z, z, or a value in
between.
This framework immediately poses the key challenge of deriving such bounds efficiently. It turns
out that this is a challenge that goes beyond the context of mixed-integer programming problems.
In fact, we have already seen this idea of bounding in Chapter 7, when we discussed decomposition
methods, which also generate lower and upper bounds during their execution.
Regardless of the context, bounds are typically of two types: primal bounds, which are bounds
obtained by evaluating a feasible solution (i..e, that satisfy primal feasibility conditions); and dual
bounds, which are typically attained when a primal feasibility is allowed to be violated so that a
dual feasible solution is obtained. In the context of minimisation, primal bounds are upper bounds
(to be minimised), while dual bounds are lower bounds (to be maximised). Clearly, in the case of
maximisation, the reverse holds.
Primal bounds can be obtained by means of a feasible solution. For example, one can heuristically
assemble a solution that is feasible by construction. On the other hand, dual bounds are typically
obtained by means of solving a relaxation of the original problem. We are ready now to provide
Definition 9.1, which formally states the notion of a relaxation.
Definition 9.1 (Relaxation). A problem
(RP ) : zRP = min. c⊤ x : x ∈ X ⊆ Rn
is a relaxation of problem ⊤
(P ) : z = min. c x : x ∈ X ⊆ Rn
if X ⊆ X, and c⊤ x ≤ c⊤ x, ∀x ∈ X.
Definition 9.1 provides an interesting insight related to relaxations: they typically comprise an
expansion of the feasible region, possibly combined with an objective function bounding. Thus,
two main strategies to obtain relaxations are to enlarge the feasible set by dropping constraints
and replacing the objective function with another of same or smaller value. One might notice at
this point that we have used a very similar argumentation setting to define linear programming
duals in Chapter 5. We will return to the relationship between relaxations and Lagrangian duality
in a more general setting in Part II, when we discuss the notion of Lagrangian relaxation.
Clearly, for relaxations to be useful in the context of solving mixed-integer programming problems,
they have to be easier to solve than the original problem. That being the case, we can then rely
on two important properties that relaxations have, which are crucial for using them as a means to
generate dual bounds. These are summarised in Proposition 9.2 and 9.3.
Proposition 9.2. If RP is a relaxation of P, then zRP is a dual bound for z.
Proof. For any optimal solution x∗ of P , we have that x∗ ∈ X ⊆ X, which implies that x∗ ∈ X.
Thus, we have that
z = c⊤ x∗ ≥ c⊤ x∗ ≥ zRP .
The first inequality is due to Definition 9.1 and the second is because x∗ is simply a feasible
solution, but not necessarily the optimal, for zRP .
9.2. Relaxations 139
Proof. To prove (1), simply notice that if X = ∅, then X = ∅. To show (2), notice that as x∗ ∈ X,
z ≤ c⊤ x∗ = c⊤ x∗ = zRP . From Proposition 9.2, we have z ≥ zRP . Thus, z = zRP .
Notice that an LP relaxation is indeed a relaxation, since we are enlarging the feasible region by
dropping the integrality requirements while maintaining the same objective function (cf. Definition
9.1).
Let us consider a numerical example. Consider the integer programming problem
z = max. 4x1 − x2
x
s.t.: 7x1 − 2x2 ≤ 14
x2 ≤ 3
2x1 − 2x2 ≤ 3
x ∈ Z2+ .
A dual (upper; notice the maximisation) bound for z can be obtained by solving its LP relaxation,
which yields the bound zLP = 8.42 ≥ z. A primal (lower) bound can be obtained by choosing any
of the feasible solutions (e.g., (2, 1), to which z = 7 ≤ z). This is illustrated in Figure 9.1.
We can now briefly return to the discussion about better (or stronger) formulations for integer
programming problems. Stronger formulations are characterised by those that wield stronger
relaxations, or, specifically, that yield relaxations that are guaranteed to provide better (or tighter)
dual bounds. This is formalised in Proposition 9.5.
i
⊤
Assume P1 is a better formulation than P2 (i.e., P1 ⊂ P2 ). Let zLP = min. c x : x ∈ Pi for
1 2
i = 1, 2. Then zLP ≥ zLP for any cost vector c.
x2
4
3
x0LP
0
x1
0 1 2 3 4
Figure 9.1: The feasible region of the example (represented by the blue dots) and the solution of
the LP relaxation, with objective function value zLP = 8.42
There is another important type of relaxation that is often exploited in the context of combinatorial
optimisation. Specifically, a relaxation for a combinatorial optimisation problem that is also a
combinatorial optimisation problem is called a combinatorial relaxation.
Efficient algorithms are known for some combinatorial optimisation problems, and this can be
exploited in a solution method for problem to which the combinatorial relaxation happens to be
one of such problems.
Let us illustrate the concept with a couple of example. Consider the travelling salesperson problem
(TSP). Recall that, without considering the tour elimination constraints, we recover the assignment
problem. It so turns out that the assignment problem can be solved efficiently (for example, using
the so-called the Hungarian method) and thus can be used as a relaxation for the TSP, i.e.,
X
zT SP = min. cij : T forms a tour ≥
T ⊆A
(i,j)∈T
X
zAP = min. cij : T forms an assignment .
T ⊆A
(i,j)∈T
Still relating to the TSP, one can obtain even stronger combinatorial relaxations using 1-trees for
symmetric TSP. Let us first define some elements. Consider an undirected graph G = (V, E) with
edge (or arcs) weights ce for e ∈ E. The objective is to find a minimum weight tour.
Now, notice the following: (i) a tour contains exactly two edges adjacent to the origin node (say,
node 1) and a path through nodes {2, . . . , |V |}; (ii) a tour is a special case of a (spanning) tree,
which is any subset of edges that covers (or touch at least once) all nodes v ∈ V .
We can now define what is a 1-tree. A 1-tree is a subgraph consisting of two edges adjacent to
node 1 plus the edges of a tree on nodes {2, . . . , |V |}. Clearly, every tour is a 1-tree with the
additional requirement (or constraint) that every node has exactly two incident edges. Thus, the
problem of finding minimal 1-trees is a relaxation for the problem of finding optimal tours. Figure
9.2 illustrates a 1-tree for an instance with eight nodes.
Once again, it so turns out that several efficient algorithms are known for forming minimal spanning
9.3. Branch-and-bound method 141
2 8
1 5 6
3 4 7
edges of a tree on nodes {2, . . . , 8}. 2 edges from node 1.
trees, which can be efficiently utilised as a relaxation for the symmetric TSP, that is
( )
X
zST SP = min. ce : T forms a tour ≥
T ⊆E
e∈T
( )
X
z1−T REE = min. ce : T forms a 1-tree .
T ⊆E
e∈T
The working principle behind this strategy is based on the principle formalised in Proposition 9.6.
S k
Proposition
⊤ K = {1, . . . , |K|} and k∈K Sk = S be a decomposition of S. Let z =
9.6. Let
max. x c x : x ∈ Sk , ∀k ∈ K. Then
z = max z k .
k∈K
Notice the use of the word decomposition in Proposition 9.6. Indeed, the principle is philosophically
the same, and this connection will be exploited in later chapters when we discuss in more detail
the available technology for solving mixed-integer programming problems.
Now, one challenging aspect related with divide-and-conquer approaches is that, in order to be able
to find a solution, one might need to repeat several times the strategy based on Proposition 9.6,
leading to a multi-layered collection of subproblems. To address this issue, such methods typically
rely on tree structures called enumerative trees, which are simply a representation that allows
for keeping track the relationship (represented by branches) between subproblems (represented by
nodes).
142 Chapter 9. Branch-and-bound method
3
Figure 9.3 represents an enumerative tree for a generic problem S ⊆ {0, 1} in which one must
define the value of a three-dimensional binary variable. The subproblems are formed by, at each
level, fixing one of the components to zero or one, forming then two subproblems. Any strategy to
form subproblems that generate two subproblems (or children) is called a binary branching.
S0 S1
Figure 9.3: A enumeration tree using binary branching for a problem with three binary variables
S = S0 ∪ S1 = {x ∈ S : x1 = 0} ∪ {x ∈ S : x1 = 1} ,
which renders two subproblems. Then, each subproblem is again decomposed each into two chil-
dren, such that
Si = Si0 ∪ Si1 = {x ∈ S : x1 = i, x2 = 0} ∪ {x ∈ S : x1 = i, x2 = 1} .
Finally, once all of the variables are fixed, we arrive at what is called the leaves of the tree. These
are such that they cannot be further divided, since they immediately yield a candidate solution
for the original problem.
Notice that applying Proposition 9.6, we can recover an optimal solution to the problem.
First, notice tat P is a maximisation problem, for which an upper bound is a dual bound, ob-
tained from a relaxation, and a lower bound is a primal bound, obtained from a feasible solution.
9.3. Branch-and-bound method 143
Proposition 9.7 states that the best known primal (lower) bound can be applied globally to all of
the subproblems Sk , k ∈ K. On the other hand, dual (upper) bounds can only be considered valid
locally, since only the worst of the upper bounds can be guaranteed to hold globally.
Pruning branches is made possible by combining relaxations and global primal bounds. If, at any
moment of the search, the solution of a relaxation of Sk is observed to be worse than a known
global primal bound, then any further branching from that point onwards would be fruitless, since
no solution found from that subproblem could be better than the relaxation for Sk . Specifically,
we have that
z ≥ z k ≥ z k′ , ∀k ′ that is descendent of k.
Notice that this implies that each of the subproblems will be disjunct (i.e., with no intersection)
and have one additional constraint that eliminates the fractional part of the component around
the solution of the LP relaxation.
Bounding can occur in three distinct ways. The first case is when the solution of the LP relaxation
happens to be integer and, therefore, optimal for the subproblem itself. In this case, no further
exploration along that subproblem is necessary and we say that the node has been pruned by
optimality.
Figure 9.4 illustrates the process of pruning by optimality. Each box denotes a subproblem, with the
interval denoting known lower (primal) and upper (dual) bounds for the problem and x denoting
the solution for the LP relaxation of the subproblem. In Figure 9.4, we see a pruning that is
caused because a solution to the original (integer) subproblem has been identified by solving its
(LP) relaxation, akin to the leaves in the enumerative tree represented in Figure 9.3. This can be
concluded because the solution of the LP relaxation of subproblem S1 is integer.
x2 ≤ 2 x2 ≥ 3 x2 ≤ 2 x2 ≥ 3
Figure 9.4: An example of pruning by optimality. Since the solution of LP relaxation of subproblem
S1 is integer, x = (2, 2) must be optimal for S1
144 Chapter 9. Branch-and-bound method
Another type of pruning takes place when known global (primal) bounds can be used to prevent
further exploration of a branch in the enumeration tree. Continuing the example in Figure 9.4,
notice that the global lower (primal) bound z = 11 becomes available and can be transmitted to
all subproblems. Now suppose we solve the LP relaxation of S2 and obtain the optimal value of
z 2 = 9.7. Notice that we are precisely in the situation described in Section 9.3.1. That is, the nodes
descending from S2 (i.e., their descendants) can only yield solutions that have objective function
value worse than the dual bound of S2 , which, in turn, is worse than a known global primal (lower)
bound. Thus, any further exploration among the descendants of S2 would be fruitless in terms of
yielding better solutions and can be pruned. This is known as pruning by bound, and is illustrated
in Figure 9.5.
x2 ≤ 2 x2 ≥ 3 x2 ≤ 2 x2 ≥ 3
Figure 9.5: An example of pruning by bound. Notice that the newly found global bound holds for
all subproblems. After solving the LP relaxation of S2 , we notice that z 2 ≤ z, which renders the
pruning.
The third type of pruning is called pruning by infeasibility, which takes place whenever the branch-
ing constraint added to the subproblem renders its relaxation infeasible, implying that the sub-
problem itself is infeasible (cf. Proposition 9.3)
Algorithm 7 presents a pseudocode for an LP-based branch-and-bound method. Notice that the
algorithm keeps a list L of subproblems to be solved and requires that a certain rule to select
which subproblem is solved next to be employed. This subproblem selection (often referred to as
search strategy) can have considerable impacts on the performance of the method. Similarly, in
case multiple components are found to be fractional, one must be chosen. Defining such branching
priorities also has consequences to performance. We will discuss these in more depth later on.
Also, recall that we have seen how to efficiently resolve a linear programming problem from an
optimal basis once we include an additional constraint (in Chapter 6). It so turns out that an
efficient dual simplex method is the kingpin of an efficient branch-and-bound method for (mixed)-
integer programming problems.
Finally, although we developed the method in the context of integer programming problems, the
method can be readily applied to mixed-integer programming problems, with the only difference
being that the branch-and-bound steps are only applied to the integer variables while the continuous
variables are naturally taken care of in the solution of the LP relaxations.
Let us finalise presenting a numerical example of the employment of branch-and-bound method to
solve an integer programming problem. Consider the problem:
max. z =4x1 − x2
x
s.t.: 7x1 − 2x2 ≤ 14
x2 ≤ 3
2x1 − 2x2 ≤ 3
x ∈ Z2+
We start by solving its LP relaxation, as represented in Figure 9.6. We obtain the solution xLP =
9.3. Branch-and-bound method 145
(20/7, 3) with objective value of z = 59/7. As the first component of x is fractional, we can
generate subproblems by branching the node into subproblems S1 and S2 , where
S1 = S ∩ {x : x1 ≤ 2}
S2 = S ∩ {x : x1 ≥ 3} .
The current enumerative (or branch-and-bound) tree representation is depicted in Figure 9.7.
x2
4
3
x0LP
0
0 1 2 3 4 x1
Suppose we arbitrarily choose to solve the relaxation of S1 next. Notice that this subproblem
consists of the problem S, with the added constraint x1 ≤ 2. The feasible region and solution of
the LP relaxation of S2 is depicted in Figure 9.8. Since we again obtain a new fractional solution
x1LP = (2, 1/2), we must branch on the second component, forming the subproblems
S11 = S1 ∩ {x : x2 = 0}
S12 = S1 ∩ {x : x2 ≥ 1} .
Notice that, at this point, our list of active subproblems is formed by L = {S1 , S11 , S12 }. Our
current branch-and-bound tree is represented in Figure 9.9.
146 Chapter 9. Branch-and-bound method
S
[−, 59/7]
(20/7, 3)
x1 ≤ 2 x1 ≥ 3
S1 S2
[−, −] [−, −]
x2
3
x0LP
1
x1LP
0
x1
0 1 2 3 4
Suppose we arbitrarily choose to first solve S2 . Once can see that this would render an infeasible
subproblem, since the constraint x2 ≥ 3 does not intersect with the original feasible region and,
thus, S2 can be pruned by infeasibility.
S
[−, 59/7]
(20/7, 3)
x1 ≤ 2 x1 ≥ 3
S1
S2
[−, 15/2]
[−, −]
(2, 1/2)
x2 = 0 x2 ≥ 1
S11 S12
[−, −] [−, −]
Figure 9.9: The branch-and-bound tree after branching S1 onto S11 and S12
Next, we choose to solve the LP relaxation of S12 , which yields an integer solution x12 LP = (2, 1).
Therefore, an optimal solution for S12 was found, meaning that a global primal (lower) bound has
been found and can be transmitted to the whole branch-and-bound tree. Solving S11 next, we
obtain the solution x11
LP = (3/2, 0) with optimal value z = 6. Since a better global primal (lower)
bound is known, we can prune S11 by bound. As there are no further nodes to be explored, the
solution for the original problem is the best (and, in this case, the single) integer solution found
in the process (cf. Proposition 9.6), x∗ = (2, 1), z ∗ = 7. Figure 9.10 illustrates the feasible region
of the subproblems and their respective optimal solutions, while Figure 9.11 presents the final
branch-and-bound tree with all branches pruned.
9.3. Branch-and-bound method 147
x2
3
x0LP
x12
LP
1
x1LP
x11
LP
0
x1
0 1 2 3 4
Figure 9.10: LP relaxations of all subproblems. Notice that S11 and S12 includes the constraints
x1 ≤ 2 from the parent node S1
S
[7, 59/7]
(20/7, 3)
x1 ≤ 2 x1 ≥ 3
S1 S2
[7, 15/2] [7, −∞]
(2, 1/2) -
x2 = 0 x1 ≥ 1
S11 S12
[7, 6] [7, 7]
- (2, 1)
Notice that in this example, the order in which we solved the subproblems was crucial for pruning
by bound the subproblem S11 , only possible because we happened to solve the LP relaxation of S12
first and that happened to yield a feasible solution and associated primal bound. This illustrates
an important aspect associated with the branch and bound method: having good feasible solutions
available early on in the process increases the likelihood of performing more pruning by bound,
which is highly desirable in terms of computational savings (and thus, performance). We will
discuss in more detail the impacts of different search strategies later on when we consider this and
other aspects involved in the implementation of mixed-integer programming solvers.
148 Chapter 9. Branch-and-bound method
9.4 Exercises
(a) Let N = {1, . . . , n} be a set of potential facilities and M = {1, . . . , m} a set of clients. Let
yj = 1 if facility j is opened, and yj = 0 otherwise. Moreover, let xij be the fraction of client i’s
demand satisfied from facility j. The UFL can be formulated as the mixed-integer problem:
X XX
(UFL-W) : min. fj yj + cij xij (9.1)
x,y
j∈N i∈M j∈N
X
s.t.: xij = 1, ∀i ∈ M, (9.2)
j∈N
X
xij ≤ myj , ∀j ∈ N, (9.3)
i∈M
xij ≥ 0, ∀i ∈ M, ∀j ∈ N, (9.4)
yj ∈ {0, 1}, ∀j ∈ N, (9.5)
where fj is the cost of opening facility j, and cij is the cost of satisfying client i’s demand from
facility j. Consider an instance of the UFL with opening costs f = (4, 3, 4, 4, 7) and client costs
12 13 6 0 1
8 4 9 1 2
2 6 6 0 1
(cij ) =
3 5 2 1 8
8 0 5 10 8
2 0 3 4 1
Implement (the model) and solve the problem with Julia using JuMP.
(b) An alternative formulation of the UFL is of the form
X XX
(UFL-S) : min. fj yj + cij xij (9.6)
x,y
j∈N i∈M j∈N
X
s.t.: xij = 1, ∀i ∈ M, (9.7)
j∈N
xij ≤ yj , ∀i ∈ M, ∀j ∈ N, (9.8)
xij ≥ 0, ∀i ∈ M, ∀j ∈ N, (9.9)
yj ∈ {0, 1}, ∀j ∈ N. (9.10)
Linear programming (LP) relaxations of these problems can be obtained by relaxing the binary
constraints yj ∈ {0, 1} to 0 ≤ yj ≤ 1 for all j ∈ N . For the same instance as in part (a), solve the
LP relaxations of UFL-W and UFL-S and compare the optimal costs of the LP relaxations against
the optimal integer cost obtained in part (a).
9.4. Exercises 149
Solve the IP problem by LP-based branch and bound, i.e., use LP relaxations to compute dual
(upper) bounds. Use Dual Simplex to efficiently solve the subproblem of each node starting from
the optimal basis of the previous node. Recall that the LP relaxation of IP is obtained by relaxing
the variables x1 , . . . , x5 ∈ Z+ to x1 , . . . , x5 ≥ 0.
Hint: The initial dual bound z is obtained by solving the LP relaxation of IP at the root node S.
Let [z, z] be the lower and upper bounds of each node. The optimal tableau and the initial Branch
& Bound tree with only the root node S is shown below.
-z x1 x2 x3 x4 x5
-59/7 0 0 -4/7 -1/7 0 z = 59/7
x1 = 20/7 1 0 1/7 2/7 0 z = −∞
x2 = 3 0 1 0 1 0
S
x5 = 23/7 0 0 -2/7 10/7 1
You can proceed by branching on the fractional variable x1 and imposing either x1 ≤ 2 or x ≥ 3.
This creates two new subproblems S1 = S ∩ {x1 ≤ 2} and S2 = S ∩ {x1 ≥ 3} in the branch-and-
bound tree that can be solved efficiently using the dual simplex method, starting from the optimal
tableau of S shown above, by first adding the new constraint x1 ≤ 2 for S1 or x1 ≥ 3 for S2 to
the optimal tableau. The dual simplex method can be applied immediately if the new constraint
is always written in terms of non-basic variables before adding it to the tableau as a new row,
possibly multiplying the constraint by −1 if needed.
Plot (or draw) the feasible region of the linear programming (LP) relaxation of the problem IP ,
then solve the problems using the figure. Recall that the LP relaxation of IP is obtained by
150 Chapter 9. Branch-and-bound method
(a) What is the optimal cost zLP of the LP relaxation of the problem IP ? What is the optimal
cost z of the problem IP ?
(b) Draw the border of the convex hull of the feasible solutions of the problem IP . Recall that
the convex hull represents the ideal formulation for the problem IP .
(c) Solve the problem IP by LP-relaxation based branch-and-bound. You can solve the LP
relaxations at each node of the branch-and-bound tree graphically. Start the branch-and-
bound procedure without any primal bound.
Chapter 10
Cutting-planes method
Furthermore, if we had available Ãx ≤ b̃, then we could solve IP by solving its linear programming
relaxation.
Cutting-plane methods are based on the idea of iteratively approximating the set of inequalities
Ãx ≤ b̃ by adding constraints to the formulation P of IP . These constraints are called valid
inequalities, a term we define more precisely in Definition 10.1.
Notice that the condition for an inequality to be valid is that it does not remove any of the point
in the original integer set X. In light of the idea of gradually approximating conv(X), one can
infer that good valid inequalities are those that can “cut off” some of the area defined by the
polyhedral set P , but without removing any of the points in X. This is precisely where the name
cut comes from. Figure 10.1 illustrates the process of adding a valid inequality to a formulation
P . Notice how the inequality expose one of the facets of the convex hull of X. Cuts like such are
called “facet-defining” and are the strongest types of cuts one can generate. We will postpone the
discussion of stronger cuts to later in this chapter.
151
152 Chapter 10. Cutting-planes method
x2 x2
4 4
3 3
2 2
1 1
0 0
x1 x1
0 1 2 3 4 0 1 2 3 4
Figure 10.1: Illustration of a valid inequality being added to a formulation P . Notice how the
inequality cuts of a portion of the polyhedral set P while not removing any of the feasible points
X (represented by the dots)
Proof. We can use linear programming duality to prove this statement. First, notice that, if the
proposition holds, for u ≥ 0 and x ∈ P , we have
Ax ≤ b
u Ax ≤ u⊤ b
⊤
π ⊤ x ≤ u⊤ Ax ≤ u⊤ b ≤ π0 ,
and, thus, it implies the validity of the cut, i.e., that π ⊤ x ≤ π0 , ∀x ∈ P . Now, consider the primal
problem
max. π ⊤ x
s.t.: Ax ≤ b
x≥0
min. u⊤ b
s.t.: u⊤ A ≥ π
u ≥ 0.
10.2. The Chvátal-Gomory procedure 153
Thus, u⊤ A ≥ π can be seen as a consequence of dual feasibility, which is guaranteed to hold for
some u since π ⊤ x is bounded. Furthermore, strong duality gives u⊤ b = π ⊤ x ≤ π0 , completing the
proof.
One thing to notice is that valid cuts in the context of polyhedral sets are somewhat redundant,
since, by definition, they do not alter the polyhedral set in any means. However, the concept can
be combined with a simple yet powerful way of generating valid inequalities for integer set by using
rounding. This is stated in Proposition 10.3.
Proposition 10.3 (Valid inequalities for integer sets). Let X = y ∈ Z1 : y ≤ b . The inequality
y ≤ ⌊b⌋ is valid for X.
The proof of Proposition 10.3 is somewhat straightforward and left as a thought exercise.
We can combine Propositions 10.2 and 10.3 into a single procedure to automatically generate valid
inequalities. Let us start with a numerical example. Consider the set X = P ∩ Zn where P is
defined by
P = x ∈ R2+ : 7x1 − 2x2 ≤ 14, x2 ≤ 3, 2x1 − 2x2 ≤ 3 .
First, let u = 72 , 37
63 , 0 , which, for now, we can assume that they were arbitrarily chosen. We
can then combine the constraints in P (the Ax ≤ b in Proposition 10.2) forming the constraint
(equivalent to u⊤ Ax ≤ u⊤ b)
1 121
2x1 + x2 ≤ .
63 21
Now, notice that the constraint would remain valid for P if we simply round down the coefficients
on the lefthand side (as x ≥ 0 and all coefficients are positive). This would lead to the new
constraint (notice that this yields a vector π in Proposition 10.2)
121
2x1 + 0x2 ≤ .
21
Finally, we can invoke Proposition 10.3 to generate a cut valid for X. This can be achieved by
simply rounding down the righthand side (yielding π0 ), obtaining
2x1 + 0x2 ≤ 5,
which is valid for X, but not for P . Notice that, apart from the vector of weights u used to
combine the constraints, everything else in the procedure of generating the valid inequality for X
is automated. This procedure is known as the Chvátal-gomory procedure and can be formalised as
follows.
Definition
10.4 (Chvátal-Gomory
procedure). Consider the integer set X = P ∩ Zn where
P = x ∈ R+ : Ax ≤ b , A is an m × n matrix with columns {A1 , . . . , An } and u ∈ Rm
n
+.
The Chvátal-Gomory procedure consists of the following set of steps to generate valid inequalities
for X:
Pn
1. j=1 u⊤ Aj xj ≤ u⊤ b is valid for P as u ≥ 0;
Pn ⊤
2. j=1 ⌊u Aj ⌋xj ≤ u⊤ b is valid for P as x ≥ 0;
Pn ⊤
3. j=1 ⌊u Aj ⌋xj ≤ ⌊u⊤ b⌋ is valid for X as LHS is integer.
154 Chapter 10. Cutting-planes method
Perhaps the most striking result in the theory of integer programming is that every valid inequality
for an integer set X can be obtained by employing the Chvátal-gomory procedure a number of
times. This is formalised in Theorem 10.5.
Theorem 10.5. Every valid inequality for X can be obtained by applying the Chvátal-Gomory
procedure a finite number of times.
Notice that we have already discussed a cutting-plane method before in Chapter 7, when we
presented the Benders decomposition. In that case, the optimality and feasibility cuts form the
family of valid inequalities F, while the separation problem was the subproblem responsible to find
the cuts that were violated by the current main problem solution.
Like it was the case in Benders decomposition, the motivation of cutting-plane algorithms lies in the
belief that one a few of all |F| inequalities (assuming F is finite, which might not be necessarily the
case) are necessary, circumventing the computationally prohibitive need of generating all possible
inequalities from F.
There are some other complicating aspects that must be observed when dealing with cutting-plane
algorithms. First, it might be so that a given family of valid inequalities F is not sufficient to
1 the separation principle is a consequence of the separation theorem (Theorem 13.10) in Part II.
10.4. Gomory’s fractional cutting-plane method 155
expose the optimal solution x ∈ X, which might be the case, for example, if F cannot fully
describe conv(X) or if the separation problem is unsolvable. In that case, the algorithm will
terminate with a solution for the LP relaxation tat is not integer, i.e., xkLP ∈
/ Zn .
However, failing to converge to an integer solution is not a complete failure since, in the process,
we have improved the formulation P (cf. Definition 8.3). In fact, this idea plays a major role in
professional-grade implementations of mixed-integer programming solvers, as we will see later.
A = [B | N ] and x = (xB , xN ),
where xB are the basic components of the solution and xN = 0 the nonbasic components. The
matrix N if formed by columns of A associated with the nonbasic variables xN .
As we have discussed in Chapter 3, the system of equation Ax = b can be written as
BxB + N xN = b or xB + B −1 N xN = B −1 b,
which is equivalent to B −1 Ax = B −1 b. Now, let aij be the element in row i and column j in
B −1 A, and let ai0 = (B −1 b)i be the i-th component of B −1 b. With that, we can represent the set
of feasible solutions X as
X
xB(i) + aij xj = ai0 , ∀i ∈ I
j∈IN
xj ∈ Z+ , ∀j ∈ J,
where I = {1, . . . , m}, J = {1, . . . , n}, IB ⊂ J are the indices of basic variables and IN = J \IB the
indices of nonbasic variables. Notice that, at this point, we are simply recasting P by performing
permutations of columns, since basic feasible solutions for the LP relaxation do not necessarily
translate into a feasible solution for X.
However, assume we solve the LP relaxation of the integer programming problem P and obtain an
optimal solution x = (xB , xN ) with associated optimal basis B. If x is fractional, then it means
that ai0 is fractional for some i.
From any of the rows i with fractional ai0 , we can derive a valid inequality using the Chavátal-
Gomory procedure. These inequalities, commonly referred to as CG cuts, take the form
X
xB(i) + ⌊aij ⌋xj ≤ ⌊ai0 ⌋. (10.1)
j∈IN
156 Chapter 10. Cutting-planes method
As this is thought to be used in conjunction with the simplex method, we must be able to state
(10.1)
P in terms of the nonbasic variables xj , ∀j ∈ IN . To do so, we can replace xB(i) = ai0 −
j∈IN aij xj , obtaining
X
(aij − ⌊aij ⌋)xj ≥ (ai0 − ⌊ai0 ⌋),
j∈IN
which, by defining fij = aij − ⌊aij ⌋, can be written in the more conventional form
X
fij xj ≥ fi0 . (10.2)
j∈IN
In the form of (10.2), this inequality is referred to as the Gomory (fractional) cut.
Notice that the inequality (10.2) is not satisfied by the optimal solution of the LP relaxation, since
xj = 0, ∀j ∈ IN and fi0 > 0. Therefore, this indicates that a cutting-plane method using this idea
benefits from the employment of dual simplex, in line with the discussion in Section 6.1.2.
Let us present an numerical example illustrating the employment of Gomory’s fractional cutting
plane algorithm for solving the following integer programming problem
z = max. 4x1 − x2
x
s.t.: 7x1 − 2x2 ≤ 14
x2 ≤ 3
2x1 − 2x2 ≤ 3
x1 , x2 ∈ Z+
Figure 10.2 illustrates the feasible region of the problem and indicates the solution of the its LP
relaxation. Considering the tableau representation of the optimal basis for the LP relaxation, we
have
x1 x2 x3 x4 x5 RHS
0 0 -4/7 -1/7 0 59/7
1 0 1/7 2/7 0 20/7
0 1 0 1 0 3
0 0 -2/7 10/7 1 23/7
Notice that the tableau indicate that the component x1 in the optimal solution is fractional. Thus,
we can choose that rom to generate a Gomory cut. This will lead to the new constraint (with
added respective slack variable s ≥ 0.
1 2 6
x3 + x4 − s = .
7 7 7
We can proceed to add this new constraint onto the problem, effectively adding an additional row
to the tableau. After multiplying it by -1 (so we have s as a basic variable complementing the
augmented basis), we obtain the new tableau
10.4. Gomory’s fractional cutting-plane method 157
x1 x2 x3 x4 x5 s RHS
0 0 -4/7 -1/7 0 0 59/7
1 0 1/7 2/7 0 0 20/7
0 1 0 1 0 0 3
0 0 -2/7 10/7 1 0 23/7
0 0 -1/7 -2/7 0 1 -6/7
Notice that the solution remains dual feasible, which indicates the suitability of the dual simplex
method. Applying the dual simplex method leads to the optimal tableau
x1 x2 x3 x4 x5 s RHS
0 0 0 0 -1/2 -3 15/2
1 0 0 0 0 1 2
0 1 0 0 -1/2 1 1/2
0 0 1 0 -1 -5 1
0 0 0 1 1/2 -1 5/2
Notice that we still have a fractional component, this time associated with x2 . We proceed in
an analogous fashion, first generating the Gomory cut and adding the slack variable t ≥ 0, thus
obtaining
1 1
x5 − t = .
2 2
Then, adding it to the previous tableau and employing the dual simplex again, leads to the optimal
tableau
x1 x2 x3 x4 x5 s t RHS
0 0 0 0 0 -4 -1 7
1 0 0 0 0 1 0 2
0 1 0 0 0 0 -1 1
0 0 1 0 0 -7 -2 2
0 0 0 1 0 0 1 2
0 0 0 0 1 -2 -2 1
Notice that now all variables are integer, and thus, an optimal for solution for the original integer
programming problem was found.
Some points are worth noticing. First, notice that, at the optimum, all variables, including the
slacks, are integer. This is a consequence of having the Gomory cuts active at the optimal solution
since (10.1), and consequently (10.2), have both the left and righthand sides integer. Also, notice
that at each iteration the problem increases in size, due to the new constraint being added, which
implies that the basis also increases in size. Thought this is an issue also in branch-and-bound
method, it can be a more prominent computational issue in the context of cutting-plane methods.
We can also interpret the progress of the algorithm in graphical terms. First of all, notice that
we can express the cuts in terms of the original variables (x1 , x2 ) by noticing that the original
formulation gives x3 = 14 − 7x1 + x2 and x4 = 3 − x2 . Substituting x3 and x4 in the cut
1 2 6
7 x3 + 7 x4 − s = 7 gives x1 ≤ 2. More generally, we have that cuts can be expressed using the
original problem variables, as stated in Proposition 10.6.
158 Chapter 10. Cutting-planes method
−1
Proposition 10.6. Let β be the P row l of B selected to generate the cut, and let qi = βi − ⌊βi ⌋,
i ∈ {1, . . . , m}. Then the cut j∈IN flj xj ≥ fl0 , written in terms of the original variables, is the
Chvátal-Gomory inequality
X n
⌊qAj ⌋xj ≤ ⌊qb⌋.
j=1
x2
x2 x2 4
4 4
3
x0LP
3 3
x0LP x0LP
2
2 2
x2LP
1
1 1
x1LP x1LP
0 0 0
x1 x1 x1
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Figure 10.2: Feasible region of the LP relaxation (polyhedral set) and of the integer programming
problem (blue dots) at each of three iterations taken to solve the integer programming problem.
The inequalities in orange represent the Gomory cut added at each iteration
Let us illustrate the concept of dominance with a numerical example. Consider the inequalities
2x1 + 4x2 ≤ 9 and x1 + 3x2 ≤ 4, which are valid for P = conv(X), where
X = {(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1)} .
10.5. Obtaining stronger inequalities 159
x2
3
2 2x1 + 4x2 ≤ 9
1 x1 + 3x2 ≤ 4
conv(X)
0
x1
0 1 2 3 4 5
Figure 10.3: Illustration of dominance between constraints. Notice that x1 + 3x2 ≤ 4 dominates
2x1 + 4x2 ≤ 9 and is thus stronger
Notice that, if we consider u = 1/2, we have that, for any x = (x1 , x2 ), that 1x1 + 3x2 ≥
u(2x1 + 4x2 ) = x1 + 2x2 and that 4 ≤ 9u = 9/2. Thus, we say that x1 + 3x2 ≤ 4 dominates
2x1 + 4x2 ≤ 9. Figure 10.3 illustrates the two inequalities. Notice that x1 + 3x2 ≤ 4 is a stronger
inequality, since it is more efficient in representing the convex hull of X than 2x1 + 4x2 ≤ 9.
Another related concept is the notion of redundancy. Clearly, the presence of two constraints
in which one dominates the other, the dominated constraint is also redundant and can be safely
removed from the formulation of a polyhedral set P . However, in some cases one might not be able
to identify redundant constraints simply because no constraint is clearly dominated by another.
Even then, there might be a way to identify weak (or redundant) constraints by combining two or
more constraints that then form a dominating constraint. This is formalised in Definition 10.8.
Definition 10.8 (Redundancy). The inequality πx ≤ π0 is redundant for P if there P exists k ≥1
i i k i
valid inequalities π x ≤ π0 and k ≥ 1 vectors ui > 0, for i ∈ {1, . . . , k}, such that i=1 ui π x≤
Pk i
i=1 ui π0 dominates πx ≤ π0 .
Once again, let us illustrate the concept with a numerical example. Consider we generate the
inequality 5x1 − 2x2 ≤ 6, which is valid for the polyhedral set
P = x ∈ R2+ : 6x1 − x2 ≤ 9, 9x1 − 5x2 ≤ 6 .
The inequality 5x1 − 2x2 ≤ 6 is not dominated by any of the inequalities forming P . However, if
we set ui = ( 31 , 13 ), we obtain 5x1 − 2x2 ≤ 5, which in turn dominates 5x1 − 2x2 ≤ 6. Thus, we
can conclude that the generated inequality is redundant and does not improve the formulation of
P . This is illustrated in Figure 10.4.
As one might realise, checking whether a newly generated inequality improves the current formu-
lation is a demanding task, as it requires finding the correct set of coefficients u for all constraints
currently forming the polyhedral set P . Nonetheless, the notions of redundancy and dominance
can be used to guide procedures that generate or improve existing inequalities. Let us discuss one
of such procedures, in the context of 0-1 knapsack inequalities.
160 Chapter 10. Cutting-planes method
x2
3 9x1 − 5x2 ≤ 6
2
5x1 − 2x2 ≤ 6
1
6x1 − x2 ≤ 9
0
x1
0 1 2 3
Figure 10.4: Illustration of a redundant inequality. Notice how the inequality 5x1 − 2x2 ≤ 6 (in
orange) does not dominate any of the other inequalities
Let us consider the family of constraints known as knapsack constraints and see how they can be
strengthened. For that, let us first define the knapsack set
n
X
n
X= x ∈ {0, 1} : aj xj ≤ b .
j=1
We assume that aj ≥ 0, j ∈ N = {1, . . . , n}, and b > 0. Let us start by defining the notion of a
minimal cover.
P
Definition 10.9 (minimal cover). A set C ⊆ N is a cover if j∈C aj > b. A cover C is minimal
if C \ {j} for all j ∈ C is not a cover.
Notice that a cover C refers to any selection of items that exceed the budget b of the constraint
and this selection is said to be a minimal cover if, upon removing of any item of the selection, the
constraint becomes satisfied. This logic allows us to design a way to generate valid inequalities
using covers. This is the main result in Proposition 10.10.
P
Proof. Let R = j ∈ N : xR
j = 1 , for x
R
∈ X. If R
j∈C xj > |C| − 1, then |R ∩ C| = |C|
P R
P
and C ⊆ R. Thus, j∈N aj xj = j∈R aj > b, which violates the inequality and implies that
xR ∈
/ X.
The usefulness of Proposition 10.10 becomes evident if C is a minimal cover. Let us consider a
numerical example to illustrate this. Consider the knapsack set
n o
7
X = x ∈ {0, 1} : 11x1 + 6x2 + 6x3 + 5x4 + 5x5 + 4x6 + x7 ≤ 19 .
10.6. Exercises 161
x1 + x2 + x3 ≤ 2
x1 + x2 + x6 ≤ 2
x1 + x5 + x6 ≤ 2
x3 + x4 + x5 + x6 ≤ 3
We leave the proof as a thought exercise. Let us however illustrate this using the previous numerical
example. For C = {3, 4, 5, 6}, E(C) = {1, 2, 3, 4, 5, 6}, yielding the inequality
x1 + x2 + x3 + x4 + x5 + x6 ≤ 3
which is stronger than x3 + x4 + x5 + x6 ≤ 3 as the former dominates the latter, cf. Definition
10.7.
10.6 Exercises
Exercise 10.1: Chvátal-Gomory (C-G) procedure
Consider the set X = P ∩ Zn where P = {x ∈ Rn : Ax ≤ b, x ≥ 0} and in which A is an
m × n matrix with columns {A1 , . . . , An }. Let u ∈ Rm with u ≥ 0. The Chvátal-Gomory (C-G)
procedure to construct valid inequalities for X uses the following 3 steps:
P
n P
n
1. uAj xj ≤ ub is valid for P , as u ≥ 0 and Aj xj ≤ b.
j=1 j=1
P
n
2. ⌊uAj ⌋xj ≤ ub is valid for P , as x ≥ 0.
j=1
P
n P
n
3. ⌊uAj ⌋xj ≤ ⌊ub⌋ is valid for X, as any x ∈ X is integer and thus ⌊uAj ⌋xj is integer.
j=1 j=1
162 Chapter 10. Cutting-planes method
Show that every valid inequality for X can be obtained by applying the Chvátal-Gomory procedure
a finite number of times.
Hint: We show this for the 0-1 case. Thus, let P = {x ∈ Rn : Ax ≤ b, 0 ≤ x ≤ 1}, X = P ∩ Zn ,
and suppose that πx ≤ π0 with π, π0 ∈ Z is a valid inequality for X. We show that πx ≤ π0 can
be obtained by applying Chvátal-Gomory procedure a finite number of times. We do this in parts
by proving the following claims C1, C2, C3, C4, and C5.
C4. If X X
πx ≤ π0 + τ + xj + (1 − xj ) (10.5)
j∈T 0 ∪{p} j∈T 1
and X X
πx ≤ π0 + τ + xj + (1 − xj ) (10.6)
j∈T 0 j∈T 1 ∪{p}
are valid inequalities for X, where τ ∈ Z+ and (T 0 , T 1 ) is any partition of {1, . . . , p − 1}, then
X X
πx ≤ π0 + τ + xj + (1 − xj ) (10.7)
j∈T 0 j∈T 1
C5. If
πx ≤ π0 + τ + 1 (10.8)
is a valid inequality for X with τ ∈ Z+ , then
πx ≤ π0 + τ (10.9)
(i) x2 + x4 ≥ 1
(ii) x1 ≤ x2
(b) Consider the set X = {x ∈ B4 : xi + xj ≤ 1 for all i, j ∈ {1, . . . , 4} : i ̸= j}. Derive the clique
inequalities x1 + x2 + x3 ≤ 1 and x1 + x2 + x3 + x4 ≤ 1 as C-G inequalities.
s.t.: 4x1 + x2 ≤ 28
x1 + 4x2 ≤ 27
x1 − x2 ≤ 1
x1 , x2 ∈ Z+ .
s.t.: 4x1 + x2 + x3 = 28
x1 + 4x2 + x4 = 27
x1 − x2 + x5 = 1
x1 , x2 , x3 , x4 , x5 ≥ 0
The optimal Simplex tableau after solving the problem LP with primal Simplex is
x1 x2 x3 x4 x5 RHS
0 0 -1/5 -6/5 0 -38
1 0 4/15 -1/15 0 17/3
0 1 -1/15 4/15 0 16/3
0 0 -1/3 1/3 1 2/3
(a) Derive two fractional Gomory cuts from the rows of x1 and x5 , and express them in terms
of the original variables x1 and x2 .
(b) Derive the same cuts as in part (a) as Chvátal-Gomory cuts. Hint: Use Proposition 5 from
Lecture 9. Recall that the bottom-right part of the tableau corresponds to B −1 A, where B −1
is the inverse of the optimal basis matrix and A is the original constraint matrix. You can
thus obtain the matrix B −1 from the optimal Simplex tableau, since the last three columns
of A form an identity matrix.
164 Chapter 10. Cutting-planes method
Solve the problem by adding Gomory cuts to the LP relaxation until you find an integer solution.
and a solution x̄ = (0, 2/3, 0, 1, 1, 1, 1) to its LP relaxation. Find a cover inequality cutting out
(violated by) the fractional solution x̄.
Chapter 11
Mixed-integer programming
solvers
165
166 Chapter 11. Mixed-integer programming solvers
Start Presolve
LP Relaxation
Cuts Heuristics
Branching
Figure 11.1: The flowchart of a typical MIP solver. The nodes represent phases of the algorithm
presolve, this is likely the phases that most differ between implementations of MIP solvers. The
cut phase consists of the employment of a cutting-plane method onto the current LP relaxation
with the aim of either obtaining an integer solution (and thus pruning the branch by optimality)
or strengthening the formulation of the LP relaxation, as discussed in Chapter 10. Each solver
will have their own family of cuts that is used in this phase, and typically a collection of them
are used simultaneously. The heuristics phase is used in combination to try to obtain primal
feasible solutions from the LP relaxations (possibly augmented by cuts) so primal bounds (integer
and feasible solution values) can be obtained and broadcasted to the whole search tree, hopefully
fostering pruning by bound.
In what follows, we will discuss the main techniques in each of these phases.
Many techniques used in the preprocessing phase relies on the notion of constraint activity. Con-
sider the constraint a⊤ x ≤ b, with x ∈ Rn as a decision variable vector, l ≤ x ≤ u, b ∈ R, where
(a, b, l, u) are given. We say that the minimum activity of the constraint is given by
X X
αmin = min a⊤ x : l ≤ x ≤ u = aj lj + aj uj .
j:aj >0 j:aj <0
Notice that the constraint activity is simply capturing what is the minimum and maximum (re-
spectively) values the left-hand side of a⊤ x ≤ b could assume. This constraint activity can be used
11.2. Presolving methods 167
in number of ways. For example, if there is a constraint for which αmin > b, then the problem is
trivially infeasible. On the other hand, if one observes that αmax ≤ b for a given constraint, then
the constraint can be safely removed since it is guaranteed to be redundant.
Bound tightening
Another important presolving method is bound tightening, which, as the name suggests, tries to
tighten lower and upper bounds of variables, thus strengthening the LP relaxation formulation.
There are alternative ways that this can be done, and they typically trade off how much tightening
can be observed and how much computational effort they require.
One simple way of employing bound tightening is by noticing the following. Assume, for simplicity,
that aj > 0 ∀j ∈ J. Then, we have that
X
αmin = aj lj = a⊤ l ≤ a⊤ x ≤ b,
j∈J
aj xj ≤ b − a⊤ x + aj xj .
αmin − aj lj ≤ a⊤ x − aj xj .
b − a⊤ x + aj xj ≤ b − αmin + aj lj .
Combining this result with what was obtained from the second inequality, we have that
aj xj ≤ b − a⊤ x + aj xj ≤ b − αmin + aj lj ,
provides a lower bound for xj that considers all possible constraints at once. Analogously, solving
LPxj as a maximisation problem yields an upper bound. Though this can be done somewhat
efficiently, this clearly has steeper computational requirements.
168 Chapter 11. Mixed-integer programming solvers
Coefficient tightening
Differently from bound tightening, coefficient tightening techniques aim at improving the strength
of existing constraints. The simplest form consists of the following. Let aj > 0, xj ∈ {0, 1}, such
that αmax − aj < b. If such coefficients are identified, then the constraint
X
aj xj + aj ′ xj ′ ≤ b
j ′ :j ′ ̸=j
Notice that the modified constraint is valid for the original integer set, while dominating the original
constraint (cf. Definition 10.7 since, on the lefthand side, we have that αmax − b < aj and on the
righthand side wee have αmax − aj < b).
Other methods
There are a wide range of methods employed as preprocessing, and they vary greatly among
different solvers, and even different modelling languages. Thus, compiling an exhaustive list is no
trivial feat. Some other common methods that are employed include:
• Merge of parallel rows and columns: methods implemented to identify pairs of rows
(constraints) and columns (variables) with constant proportionality coefficient (i.e., are lin-
early dependent) and merge them into a single entity, thus reducing the size of the model.
• Domination tests between constraints: heuristics that test whether domination between
selected constraints can be asserted so that some constraints can be deemed redundant and
removed.
• Clique merging: a clique is a subset of vertices of graph that are fully connected. Assume
that xj ∈ {0, 1} for j ∈ {1, 2, 3} and that the three constraints hold:
x1 + x2 ≤ 1
x1 + x3 ≤ 1
x2 + x3 ≤ 1.
Then, one can think of these constraints forming a clique between the imaginary nodes 1,
2, and 3, which renders the clique cut x1 + x2 + x3 ≤ 1. Many other ideas using this
graph representation of implication constraints, known as conflict graphs are implemented in
presolvers.
• Greatest common denominator (GCD) reduction: We can use the GCD of the coef-
ficients a = [a1 , . . . , an ] to generate or tighten inequalities. Let gcd(a) be the GCD of all
coefficients aj in a. Then we can generate the valid inequality
n
X
aj b
xj ≤ .
j=1
gcd(a) gcd(a)
11.3. Cut generation 169
Some final remarks are worth making. Most solvers might, at some point in the process of solving
the MIP refer to something called restart, which consists of reapplying some or all of the techniques
associated with the preprocessing phase after a few iterations of the branch-and-cut process. This
can be beneficial since in the solution process new constraints are generated (cuts) which might
lead to new opportunities for reducing the problem or further tightening bounds.
In addition, conflict graphs can contain information that can be exploited in a new round of pre-
processing and transmitted across all search tree, a process known as propagation. Conflict graphs
and propagation are techniques originally devised for constraint programming and satisfiability
(SAT) problems, but have made considerable inroads in MIP solvers as well.
and Chvátal-Gomory cuts for pure integer constraints, which are dependent on the heuristic or
process used to define the values of the multipliers u in
n
X
⌊uAj ⌋xj ≤ ⌊ub⌋.
j=1
For example, zero-half cuts (with u ∈ [0, 1/2]) and mod-k cuts (u ∈ {0, 1/k, . . . , (k − 1)/k}) are
available in most solvers. Other cuts such as knapsack (or cover) inequalities, mixed-integer round-
170 Chapter 11. Mixed-integer programming solvers
ing, clique as also common and, although it can be shown to be related to the Chvátal-Gomory
procedure, are generated by means of heuristics (normally referring to a cut generation procedure).
Variable selection, commonly referred to as branching strategies in most MIP solver implementa-
tions, refer to the decision of which of the currently fractional variables should be chosen to generate
subproblems. There are basic three main methods most commonly used, which we discuss next.
Furthermore, most MIP solvers allow for the user to set priority weights to the variables, which
defines priority orders for variable selection. These can be useful for example when the user knows
that the problem possess a dependence structure between variables (e.g., location and allocation
variables, where allocation can only happen if the location decision is made) that the solver cannot
infer automatically.
Maximum infeasibility
The first branching strategy, sometimes called maximum infeasibility, consists of choosing the
variable with the fractional part as close to 0.5 as possible, or, more precisely select the variables
j ∈ {1, . . . , n} as
arg max min {fj , 1 − fj }
j∈{1,...,n}
where fj = xj − ⌊xj ⌋. This in effect is trying to reduce as most as possible the infeasibility of the
LP relaxation solution, which in turn would more quickly lead to a feasible (i.e., integer) solution.
A analogous form, called minimum infeasibility is often available, and as the name suggests, focus
on selecting the variables that are closer to be integer valued.
Strong branching
Strong branching can be understood as an explicit look-ahead strategy. That is, to decide which
variable to branch on, the method performs branching on all possible variables, and choose that
provides the best improvement on the dual (LP relaxation) bound. Specifically, for each fractional
variable, we can xj , solve the LP relaxations corresponding to branching options xj ≤ ⌊xLP ⌋ and
xj ≥ ⌈xLP ⌉ and then choose the fractional variable xj that leads to subproblems the best LP
relaxation objective value.
As you might suspect, there is a trade-off related to the observed reduction in the number of nodes
explored, given by the more prominent improvement of the dual bound, and how computationally
intensive is the method. There are however ideas that can exploit this trade off more efficiently.
First, the solution of the subproblems might yield information related to infeasibility and pruning
by limit, which can be used in favour of the method.
Another idea is to limit the number of simplex iterations performed when solving the subproblems
associated with each branching option. This allows for using approximate solution of the subprob-
lems and potential savings in computational efforts. Some solvers offer a parameter that allow the
user to set this iteration limit value.
Pseudo-cost branching
Pseudo-cost branching relies on the idea of using past information from the search process to
estimate gains from branching on a specific variable. Because of this reliance on past information,
the method tends to be more reliable later in the search tree, where more information has been
accumulated on the impact of choosing a variable for branching.
These improvement estimates are the so called pseudo-costs, which compile an estimate on much
the dual (LP relaxation) bound has improved per fractional unit of the variable that has been
172 Chapter 11. Mixed-integer programming solvers
fj− = xLP LP + LP LP
j − ⌊xj ⌋ and fj = ⌈xj ⌉ − xj . (11.1)
∆− − − + + +
j = fj Ψj and ∆j = fj Ψj (11.2)
which represent the estimated change to be observed when selecting the variable xj for branching,
based on the current fractional parts fj+ and fj− . In effect, these are considered in a branching
score, with the branching variable being selected as, for example,
j = argmax α min ∆− +
j , ∆j + (1 − α) max ∆− +
j , ∆j .
j=1,...,n
where α ∈ [0, 1]. Setting the value of α trades off two aspects. Assume a maximisation problem.
Then, setting α closer to zero will slow down degradation, which refers to the decrease of the
upper bound (notice that the dual bound is decreasing and thus, ∆+ and ∆− are negative). This
strategy improves the chances of finding a good feasible solution on the given branch, and, in turn,
potentially fostering pruning by bound. In contrast, setting α closer to one increases the rate of
decrease (improvement) of the dual bound, which can be helpful for fostering pruning once a good
global primal bound is available. Some solvers allow for considering alternatives branching score
functions.
As one might suspect, it might take several iterations before reliable estimates Ψ+ and Ψ− are
available. The issue with unreliable pseudo-costs can be alleviated with the use of a hybrid strategy
known as reliability branching1 . In that, variables that are deemed unreliable for not having been
selected for branching a minimum number of times η ∈ [4, 8], have strong branching employed
instead.
GUB branching
are referred to as special ordered sets 1 (or SOS1) which, under the assumption that xj ∈ {0, 1},
∀j ∈ {1, . . . , k} implies that only one variable can take value different than zero. Notice you may
have SOS1 sets involving continuous variables, which, in turn would require the use of binary
variables to be modelled appropriately.
Branching on these variables might lead to unbalanced branch-and-bound trees. This is because
the branch in which xj is set to a value different than zero, immediately define the other variables
to be zero, leading to an early pruning by optimality or infeasibility. In turn, unbalanced trees are
undesirable since they preclude the possibility of parallelisation and might lead to issues related
to searches that focuses on finding leaf notes quickly.
1 Tobias Achterberg, Thorsten Koch, and Alexander Martin (2005), Branching rules revisited, Operations Re-
search Letters
11.5. Node selection 173
To remediate this, the idea of using a generalised upper bound is employed, leading to what
is referred to as GUB branching (with some authors referring to this as SOS1 branching). A
generalised upper bound is an upper bound imposed on the sum of several variables. In GUB
branching, branching for binary variables is imposed considering the following rule:
can also benefit of the use of GUB branching, with the term GUB being perhaps better suited in
this case.
2. Alternatively, focus on improving the dual bound faster, hoping that once an integer solution
is found, more pruning by bound is possible.
3. Increase ramp-up, which means increase the number of unsolved nodes in the list of subprob-
lems so that these might be solved in parallel. For that, the nodes must be created, and the
faster they are opened, the earlier parallelisation can benefit the search.
4. Minimise computational effort by minimising the overhead associated with changing the
subproblems to be solved. That means that children nodes are preferred over other nodes,
in a way to minimise the changes needed to assemble a starting basis for the dual simplex
method.
You might notice that points 1 and 2 are conflicting, since while the former would benefit from
a breadth-focused search (that is, having wider trees earlier is preferable than deeper trees) the
latter would benefit from searches that dive deeply in the tree searching from leaf nodes. Points
3 and 4 pose exactly the same dilemma: the former benefits from breadth-focusing searches while
the latter benefits from depth-focusing searches.
The main strategies for node selection are, to a large extent, ideas to emphasise each or a combi-
nation of the above.
174 Chapter 11. Mixed-integer programming solvers
A depth-first search focus on diving down in the search tree, prioritising nodes at lower levels. It
has the effect of increasing the chances of finding leaves, and potentially primal feasible solutions,
earlier. Furthermore, because the problems being successively solved are very similar, with the
exception of one additional branching constraint, the dual simplex methods can be efficiently
restarted and often fewer iterations are needed to find the optimal of the children subproblem
relaxation. On the other hand, as a consequence of the way the search is carried, it is slower in
populating the list of subproblems.
In contrast, breadth-first search gives priorities to nodes higher levels in the tree, ultimately causing
a horizontal spread of the search tree. This has as consequence a faster improvement of the dual
bound, at the expense of potentially delaying the generation of primal feasible solutions. This also
generates more subproblems quickly, fostering diversification (more subproblems and potentially
more information to be re-utilised in repeated rounds of preprocessing) and opportunities for
parallelisation.
Best bound
Best bound consists of the strategy of choosing the next node to be that with the best dual (LP
relaxation) bound. It leads to a breadth-first search pattern, but with flexibility to allow potentially
good nodes that are in deeper levels of the tree to be selected.
Ultimately, this strategy foster a faster improvement of the dual bound, but with a higher overhead
on the set up of the subproblems, since they can be quite different than each other in terms of its
constraints. One way to mitigate this overhead is perform diving sporadically, which consists of,
after choosing a node by best bound, temporarily switching to a DFS search for a few iterations.
Uses a strategy similar to that employing pseudo-costs to choose which variable to branch on.
However, instead of focusing on objective function values, it uses estimates of the node progress
towards feasibility but relative to its bound degradation.
To see how this works, assume that the parent node has been solved, and a dual bound zD is
available. Now, using our estimates in (11.2), we can calculate an estimate of the potential primal
feasible solution value E, given by
n
X
E = zD + min ∆− +
j , ∆j .
j=1
The expression E is an estimate of what is the best possible value an integer solution could have
if it were to be generated by rounding the LP relaxation solution. These estimates can also take
into account the feasibility per se, trying to estimate feasibility probabilities considering known
feasible solutions and how fractional the subproblem LP relaxation solution is.
from a solution obtained from a relaxation; or (ii), geared towards improving on previously known
primal solutions. The former strategy is often referred to as constructive heuristics, while the latter
are called improvement heuristics.
The name heuristic refers to the fact that these are methods that are not guided by any optimality
certificates per se, but yet by performing local (or nearby, according to a given metric of solution
difference) improvements repeatedly.
Primal heuristics play three main important roles in MIP solver algorithms. First, they are em-
ployed in the preprocessing phase to verify whether the model can be proved to be feasible, by
constructing a primal feasible solution. Second, constructive heuristics are very powerful in gen-
erating feasible solutions during the branch-and-bound phase, meaning that it can make primal
feasible solutions available before they are found at leaf nodes when pruning by optimality (i.e., with
integer LP relaxation solution), and therefore fostering early pruning by bound. Lastly, heuristics
are a powerful way to obtain reasonably (often considerably) good solutions, which in practical
cases, might be sufficient given computational or time limitations and precision requirements.
Diving heuristics are used in combination with node selection strategies that search in breadth
(instead of depth). In simple terms, it consists of performing a local depth search at the node
being considered with no or very little backtracking, with the hope of reusing the subproblem
structure while searching for primal feasible solutions. The main difference is that the subproblems
are generated in an alternative tree in which branching based on rounding and fixing instead of
the standard branching we have seen in Chapter 9.
Once the heuristic terminates, the structure is discarded, but the solution, if found, is kept. Notice
that the diving can also be preemptively aborted if it either renders an infeasible subproblem or if
it leads to a relaxation with worse bound than a known primal bound from an incumbent solution.
Another common termination criteria consists in limiting the total number of LP iterations solving
the subproblems or the total number of subproblems solved.
The most common types of rounding employed in diving heuristics include fractional diving, in
which the variable selected for rounding is simply that with the smallest fractional component,
i.e., xj is chosen, such that the index j is given by
Another common idea consists of selecting the variables to be rounded by considering a reference
solution, which often is an incumbent primal feasible solution. This guided dive is then performed
by choosing the variable with the smallest fractional value when compared against this reference
solution.
A third idea consists of taking into account the number of locks associated with the variable. The
locks refer to the number of constraints that are potentially made infeasible by rounding up or
down a variable. This potential infeasibility stems from taking into count the coefficient of the
variable, the type of constraint constraint and whether rounding it up or down can potentially
cause infeasibility. This is referred to as coefficient diving.
176 Chapter 11. Mixed-integer programming solvers
The relaxation-induced neighbourhood search (or RINS) is possibly the most common improvement
heuristic available in modern implementations2 . The heuristic tries to find solutions that present
a balance between proximity of the current LP solution, hoping this would improve the solution
quality, and proximity to an incumbent (i.e., best-known primal feasible) solution, emphasising
feasibility.
In a nutshell, the method consists of the following. After solving the LP relaxation of a node,
suppose we obtain the solution xLP . Also, assume we have at hand an incumbent solution x.
Then, we form an auxiliary MIP problem in which we fix all variables coinciding between xLP and
x. This can be achieved by including the constraints
xj = xj , ∀j ∈ {1, . . . , n} : xj = xLP ,
which, in effect fix these variables to integer values and remove them from the problem, as they can
be converted to parameters (or input data). Notice that this constraints the feasible space to be in
the (potentially large) neighbourhood of the incumbent solution. In later iterations, when more of
the components of the relaxation solutions xLP are integer, this becomes a more local search, with
less degrees of freedom. Finally, this additional MIP is solved and, in case an optimal solution is
found, a new incumbent solution might become available.
In contrast, relaxation-enforced neighbourhood search (or RENS) is a constructive heuristics, which
has not yet seem a wider introduction in commercial-grade solvers, though it is available in CBC
and SCIP3 .
The main differences between RINS and RENS are the fact that no incumbent solution is considered
(hence the dropping of the term “induced”) but rather the LP relaxation solution xLP fully defines
the neighbourhood (explaining the name “enforced”).
Once again, let us assume we obtain the solution xLP . And again, we fix all integer valued variables
2 Emilie Danna, Edward Rothberg, and Claude Le Pape (2005), Exploring relaxation induced neighborhoods to
xj = xLP LP
j , ∀j ∈ {1, . . . , n} : xj ∈ Z.
One key difference is how the remaining variables are treated. For those components that are
fractional, the following bounds are imposed
Notice that his in effect makes the neighbourhood considerably smaller around the solution xLP .
Then, the MIP subproblem with all these additional constraints is solved and a new incumbent
solution may be found.
Local branching
The idea of local branching is to allow the search to be performed in a neighbourhood of controlled
size, which is achieved by the use of an L1 -norm4 . The size of the neighbourhood is controlled by
a divergence parameter ∆, which in the case of binary variables, amounts to being the Humming
distance between the variable vectors.
In its most simple form, it can be seen as the following idea. From an incumbent solution x, one
can generate and impose the following neighbourhood inducing constraint
n
X
|xj − xj | ≤ ∆
j=1
Feasibility pump
Feasibility pump is a constructive heuristic that, contrary to the previous heuristics, has made
inroads in most professional-grade solvers and if often employed by default. The focus is exclusively
trying to find a first primal feasible solution. The idea consists of, from the LP relaxation solution
xLP , performing alternate steps of rounding and solving a projection step, which happens to be a
LP problem.
Starting from the xLP , the method starts by simply rounding the components to obtain an integer
solution x. If this rounded solution is feasible, the algorithm terminates. Otherwise, we perform a
projection step by replacing the LP relaxation objective function with
n
X
f aux (x) = |xj − xj |
j=1
and resolving it. This is called a projection because it is effectively finding a point in the feasible
region of the LP relaxation that is the closest to the integer solution x. This new solution xLP is
once again rounded and the process repeats until a feasible integer solution is found.
4 Matteo Fischetti and Andrea Lodi (2003), Local branching, Mathematical Programming
178 Chapter 11. Mixed-integer programming solvers
It is known that feasibility pump can suffer from cycling, that is, repeatedly findingthe same
xLP and x solutions. This can be alleviated by performing random perturbations on some of the
components of x.
Feasibility pump is an extremely powerful method and plays a central role in many professional-
grade solvers. It is also useful in the context of mixed-integer nonlinear programming models.
More recently, variants have been developed5 , taking into account the quality of the projection
(i.e., taking into account also the original objective function) and discussing theoretical properties
of the methods and its convergence guarantees.
5 Timo Berthold, Andrea Lodi, and Domenico Salvagnin (2018), Ten years of feasibility pump, and counting,
11.7 Exercises
Problem 11.1: Preprocessing and primal heuristics
(a) Tightening bounds and redundant constraints
max 2x1 + x2 − x3
s.t. 5x1 − 2x2 + 8x3 ≤ 15
8x1 + 3x2 − x3 ≤ 9
x1 + x2 + x3 ≤ 6
0 ≤ x1 ≤ 3
0 ≤ x2 ≤ 1
x3 ≥ 1 ,
derive tightened bounds for variables x1 and x3 from the first constraint and eliminate re-
dundant constraints after that.
s.t.:
X
xij = 1, ∀i ∈ M, (11.4)
j∈N
X
xij ≤ myj , ∀j ∈ N, (11.5)
i∈M
xij ≥ 0, ∀i ∈ M, ∀j ∈ N, (11.6)
yj ∈ {0, 1}, ∀j ∈ N, , (11.7)
where fj is the cost of opening facility j, and cij is the cost of satisfying client i’s demand
from facility j. Consider an instance of the UFL with opening costs f = (21, 16, 30, 24, 11)
and client costs
6 9 3 4 12
1 2 4 9 2
15 2 6 3 18
(cij ) =
9 23 4 8 1
7 11 2 5 14
4 3 10 11 3
180 Chapter 11. Mixed-integer programming solvers
In the UFL problem, the facility production capacities are assumed to be large, and there is
no budget constraint on how many facilities can be built. The problem thus has a feasible so-
lution if at least one facility is opened. We choose an initial feasible solution ȳ = (1, 0, 0, 0, 0).
Try to improve the solution by using relaxation-induced neighbourhood search (RINS) and
construct a feasible solution using relaxation-enforced neighbourhood search (RENS).
Our objective is to define optimal routes such that the total distance traveled is minimised. We
assume that the total distance is a proxy for the operation cost in this case. The structural elements
and parameters that define the problem are described below:
• V is the set of nodes, representing a depot (node 1) and the clients (nodes i ∈ N ). Thus
V = {1} ∪ N
• A is a set of arcs, with A = {(i, j) ∈ V × V : i ̸= j}
• Ci,j - cost of travelling via arc (i, j) ∈ A
Formulate the LP to solve the problem and solve it using JuMP, use the instance found in the
notebook for session 11. After solving, test different solver parameters and compare the time
needed to solve the problem.
Part II
Nonlinear optimisation
181
Chapter 12
Introduction
In a general sense, these problems can be solved by employing the following strategy:
1. Analysing properties of functions under specific domains and deriving the conditions that
must be satisfied such that a point x is a candidate optimal point.
2. Applying numerical methods that iteratively searches for points satisfying these conditions.
This idea is central in several domains of knowledge, and very often are defined under area-specific
nomenclature. Fields such as economics, engineering, statistics, machine learning and, perhaps
more broadly, operations research, are intensive users and developers of optimisation theory and
applications.
183
184 Chapter 12. Introduction
• domain represents business rules or constraints, representing logic relations, design or engi-
neering limitations, requirements, and such;
With these in mind, we can represent the decision problem as a mathematical programming model
of the form of (12.1) that can be solved using optimisation methods. From now on, we will refer to
this specific class of models as mathematical optimisation models, or optimisation models for short.
We will also use the term to solve the problem to refer to the task of finding optimal solutions to
optimisation models.
This course is mostly focused on the optimisation techniques employed to find optimal solutions
for these models. As we will see, depending on the nature of the functions f and g that are used to
formulate the model, some methods might be more or less appropriate. Further complicating the
issue, for models of a given nature, there might be alternative algorithms that can be employed
and with no generalised consense whether one method is generally better performing than another.
min. f (x)
s.t.: gi (x) ≤ 0, i = 1, . . . , m
hi (x) = 0, i = 1, . . . , l
x ∈ X,
1. Unconstrained models: in these, the set X = Rn and m = l = 0. These are prominent in,
e.g., machine learning and statistics applications, where f represents a measure of model
fitness or prediction error.
2. Linear programming (LP): presumes linear objective function. f (x) = c⊤ x and constraints
g and h affine, i.e., of the form a⊤ n
i x − bi , with ai ∈ R and b ∈ R. Normally, X =
n
{x ∈ R | xj ≥ 0, j = 1, . . . , n} enforce that decision variables are constrained to be the non-
negative orthant.
3. Nonlinear programming (NLP): some or all of the functions f , g, and h are nonlinear.
12.2. Examples of applications 185
4. Mixed-integer (linear) programming (MIP): consists of an LP in which some (or all) of the
variables are constrained to be integers. In other words, X ⊆ Rk × Zn−k . Very frequently,
the integer variables are binary terms, i.e., xi ∈ {0, 1}, for i = 1, . . . , n − k and are meant to
represent true-or-false or yes-or-no conditions.
5. Mixed-integer nonlinear programming (MINLP): are the intersection of MIPs and NLPs.
P
Remark: notice that we use the vector notation c⊤ x = j∈J cj xj , with J = {1, . . . , N }. This is
just a convenience for keeping the notation compact.
xj ≥ 0, ∀j ∈ J. (12.4)
Equation (12.2) represents the objective function, in which we maximise the total return obtained
from a given production plan. Equation (12.3) quantify the resource requirements for a given
production plan and enforce that such a requirement does not exceed the resource availability.
Finally, constraint (12.4) defines the domain of the decision variables.
Notice that, as posed, the resource allocation problem is linear. This is perhaps the most basic, and
also most diffused setting for optimisation models for which very reliable and mature technology
is available. In this course, we will concentrate on methods that can solve variants of this model
in which the objective function and/or the constraints are required to include nonlinear terms.
186 Chapter 12. Introduction
One classic variant of resource allocation that include nonlinear terms is the portfolio optimisation
problem. In this, we assume that a collection of assets j ∈ J = {1, . . . , N } are available for
investment. In this case, capital is the single (actual) resource to be considered. Each asset has
random return Rj , with an expected value E[Rj ] = µj . Also, the covariance between two assets
i, j ∈ J is given by σij = E[(Ri − µi )(Rj − µj )], which can be denoted as the covariance matrix
σ11 . . . σ1N
. .. ..
Σ= .
. .
.
σN 1 . . . σ N N
Markowitz (1952) proposed using x⊤ Σx as a risk measure that captures the variability in the return
of the assets. Given the above, the optimisation model that provides the investment portfolio with
the least risk, given a minimum requirement ϵ in terms of expected returns is given by
min. x⊤ Σx (12.5)
⊤
s.t.: µ x ≥ ϵ (12.6)
0 ≤ xj ≤ 1, ∀j ∈ J. (12.7)
Objective function (12.5) represents the portfolio risk to be minimised, while constraint (12.6)
enforces that the expected return must be at least ϵ. Notice that ϵ can be seen as a resource
that has to be (at least) completely depleted, if one wants to do a parallel with the resource
allocation structure discussed early. Constraint (12.7) defined the domain of the decision variables.
Notice how the problem is posed in a scaled form, where xj ∈ [0, 1] represents a percentage of a
hypothetical available capital for investment.
In this example,
P the problem is nonlinear due to the quadratic nature of the objective function
x⊤ Σx = i,j∈J σij xi xj . As we will see later on, there are efficient methods that can be employed
to solve quadratic problems like this.
Robust optimisation is a subarea of mathematical programming concerned with models that sup-
port decision-making under uncertainty. In specific, the idea is to devise a formulation mechanism
that can guarantee feasibility of the optimal solution in face of variability, ultimately taking a
risk-averse standpoint.
Consider the resource allocation problem from Section 12.2.1. Now, suppose that the parameters
ãi ∈ RN associated with a given constraint i ∈ I = {1, . . . , M } are uncertain with a unknown
probability distribution. The resource allocation problem can then be formulated as
max. c⊤ x
s.t.: ã⊤
i x ≤ bi , ∀i ∈ I
xj ≥ 0, ∀j ∈ J.
Let us assume that the only information available are observations âi , from which we can estimate a
nominal value ai . This is illustrated in Figure 12.1, in which 100 random observations are generated
for a˜i = [ãi1 , ãi2 ] with ãi1 ∼ Normal(10, 2) and ãi2 ∼ Normal(5, 3) for a single constraint i ∈ I. The
nominal values are assumed to have coordinates given by the average values used in the Normal
distributions. Our objective is to develop a model that incorporates a given level of protection in
7.5
5.0
a2
2.5
0.0
-2.5
6 8 10 12 14
a1
Figure 12.1: One hundred random realisations for ãi .
terms of feasibility guarantees. That is, we would like to develop a model that provides solutions
that are guaranteed to remain feasible if the realisation of ãi falls within an uncertainty set ϵi of
size controlled by the parameter Γi . The idea is that the bigger the uncertainty set ϵi , the more
robust is the solution, which typically comes at the expense of accepting solutions with expected
worse performance.
The tractability of robust optimisation models depends on the geometry of the uncertainty set
188 Chapter 12. Introduction
representation in (12.10).
We can now formulate the robust counterpart, which consists of a risk-averse version of the original
resource allocation problem. In that, we try to anticipate the worst possible outcome and make
decisions that are both optimal and guarantee feasibility in this worst-case sense. This standpoint
translates into the following optimisation model.
max. c⊤ x
⊤
s.t.: max. ai x ≤ bi , ∀i ∈ I (12.11)
ai ∈ϵi
xj ≥ 0, ∀j ∈ J.
Notice how the constraint (12.11) has an embedded optimisation problem, turning into a bi-level
optimisation problem. This highlights the issue associated with tractability, since solving the whole
problem strongly depends on deriving tractable equivalent reformulations.
Assuming that the uncertainty set ϵi is an ellipsoid, the following result holds.
⊤
max a⊤ ⊤
i x = ai x + max u Pi x : ||u||2 ≤ Γi (12.12)
ai ∈ϵi u
= a⊤
i x + Γi ||Pi x||2 . (12.13)
In (12.12), we recast the inner problem in terms of the ellipsoidal uncertainty set, ultimately
meaning that we recast the inner maximisation problem in terms of variable u. Since the only
constraint is ||u||2 ≤ Γi , in (12.13) we can derive a closed form for the inner optimisation problem.
With the closed form derived in (12.13), we can reformulate the original bi-level problem as a
tractable single-level problem of the following form
max. c⊤ x
s.t.: a⊤
i x + Γi ||Pi x||2 ≤ bi , ∀i ∈ I (12.14)
xj ≥ 0, ∀j ∈ J.
Notice how the term Γi ||Pi⊤ x||2 creates a buffer for constraint (12.14), ultimately preventing the
complete depletion of the resource. Clearly, this will lead to a suboptimal solution when compared
to the original deterministic at the expense of providing protection against deviations in coefficients
ai . This difference is often referred to as the price of robustness.
In Figure 12.2, we show the ellipsoidal sets for two levels of Γi for a single constraint i. We define
(" # " # " #)
10 2 0 u1
ϵi = + (12.15)
5 0 3 u2
using the average and standard deviation of the original distributions that generated the observa-
tions. We plot the ellipsoids for Γ1 = 1 and Γ2 = 1.5, illustrating how the protection level increases
as Γ increases. This can be inferred since the uncertainty set covers more of the observations and
the formulation is such that feasibility is guaranteed for any observation within the uncertainty
set.
12.2. Examples of applications 189
10.0
7.5
5.0
a2
2.5
0.0
Observations
Nominal value
-2.5 2 = 1.5
1=1
6 8 10 12 14
a1
Figure 12.2: One hundred random realisations for ãi .
This is an example in which the resource allocation structure within the optimisation model is not
as obvious. Suppose we are given a data set D ∈ Rn with |D| = N + M that can be divided into
two disjunct sets I − = {x1 , . . . , xN } and I + = {x1 , . . . , xM }.
Each element in D is an observation of a given set of n features with values represented by a x ∈ Rn
that has been classified as belonging to set I − and I + . Because of the availability of labelled data,
classification is said to be ane xample of supervised learning in the field of machine learning.
Figure 12.3 illustrates this situation for n = 2, in which the orange dots represent points classified
as belonging to I − (negative observations) and the blue dots represent points classified as belonging
to I + (positive observations).
Our task is to obtain a function f : Rn 7→ R from a given family of functions that is capable to,
given an observed set of features x̂, classify whether it belongs to I − or I + . In other words, we
want to calibrate f such that
This function would then act as a classifier that could be employed to any new observation x̂
made. If f is presumed to be an affine function of the form f (x) = a⊤ x − b, then we obtain a
linear classifier.
Our objective is to obtain a ∈ Rn and b ∈ R such that misclassification error is minimised. Let us
190 Chapter 12. Introduction
15
Pos. obs.
Neg. obs.
10
5
x2
5
5 0 5 10 15
x1
Figure 12.3: Two hundred observations for xi classified to belong to I − (orange) or I + (blue).
Using this error measure, we can define constraints that capture deviation on each measure by
means of nonnegative slack variables. Let ui ≥ 0 for i = 1, . . . , N and vi ≥ 0 for i = 1, . . . , M be
slack variables that measure the misclassification error for xi ∈ I − and xi ∈ I + , respectively.
The optimisation problem that finds optimal parameters a and b can be stated as
M
X N
X
min. ui + vi (12.17)
i=1 i=1
⊤
s.t.: a xi − b − ui ≤ 0, i = 1, . . . , M (12.18)
a⊤ xi − b + vi ≥ 0, i = 1, . . . , N (12.19)
||a||2 = 1 (12.20)
ui ≥ 0, i = 1, . . . , N (12.21)
vi ≥ 0, i = 1, . . . , M (12.22)
n
a ∈ R , b ∈ R. (12.23)
The objective function (12.17) accumulates the total misclassification error. Constraint
(12.18) al-
lows for capturing the misclassification error for each xi ∈ I − . Notice that ui = max 0, a⊤ xi − b =
e− (xi ∈ I − ; a, b). Likewise, constraint (12.19) guarantees that vi = e+ (xi ∈ I + ; a, b). To avoid
12.2. Examples of applications 191
trivial solutions in which (a, b) = (0, 0), the normalisation constraint ||a||2 = 1 is imposed in
constraint (12.20), which turns the model nonlinear.
Solving the model (12.17)–(12.23) provides optimal (a, b) which translates into the classifier rep-
resented as the green line in Figure 12.4.
15
Pos. obs.
Neg. obs.
Classifier
10
5
x2
5
5 0 5 10 15
x1
Figure 12.4: Two hundred observations for xi classified to belong to I − (orange) or I + (blue) with
a classifier (green).
(
− − 0, if a⊤ xi − b ≤ −1,
e (xi ∈ I ; a, b) :=
|a⊤ xi − b|, if a⊤ xi − b > −1.
(
+ + 0, if a⊤ xi − b ≥ 1,
e (xi ∈ I ; a, b) :=
|b − a⊤ xi |, if a⊤ xi − b < 1.
By doing so, a penalty is applied not only to those points that were misclassified but also to
those points correctly classified that happen to be inside the slab S. To define an optimal robust
classifier, one must trade off the size of the slab, which is inversely proportional to ||a||, and the
192 Chapter 12. Introduction
total of observations that fall in the slab S. The formulation for the robust classifier then becomes
M
X N
X
min. ui + vi + γ||a||22 (12.24)
i=1 i=1
⊤
s.t.: a xi − b − ui ≤ −1, i = 1, . . . , M (12.25)
⊤
a xi − b + vi ≥ 1, i = 1, . . . , N (12.26)
ui ≥ 0, i = 1, . . . , N (12.27)
vi ≥ 0, i = 1, . . . , M (12.28)
n
a ∈ R , b ∈ R. (12.29)
15
10
5
x2
Pos. obs.
0 Neg. obs.
Classifier
Classifier = 0.1
5
5 0 5 10 15
x1
Figure 12.5: Two hundred observations for xi classified to belong to I − (orange) or I + (blue).
Remark: robust classifiers are known in the machine learning literature as support vector ma-
chines, where the support vectors are the observations that support the slab.
Chapter 13
Convex sets
... in fact, the great watershed in optimization isn’t between linearity and nonlinearity,
but convexity and nonconvexity.
The importance of convexity will become clear later in the course. In a nutshell, the existence of
convexity allows us to infer global properties of a solution (i.e., that holds for all of its domain)
by considering exclusively local information (such as gradients, for example). This is critical in
the context of optimisation, since most of the methods we know to perform well in practice are
designed to find solutions that satisfy local optimality conditions. Once convexity is attested, one
can then guarantee that these local solutions are in fact globally optimal without exhaustively
exploring the solution space.
For a problem of the form
(P ) : min. f (x)
s.t.: x ∈ X
to be convex, we need to verify whether f is a convex function and X is a convex set. If both
statements hold true, we can conclude that P is a convex problem. We start looking into how to
identify convex sets, since we can use the convexity of sets to infer the convexity of functions.
193
194 Chapter 13. Convex sets
Pk
• An affine combination is a linear combination, with the additional constraint that j=1 λj =
1. That is,
k
X k
X
x ∈ Rn : λ j xj , λj = 1, λj ∈ R for j = 1, . . . , k . (13.2)
j=1 j=1
• A conic combination is a linear combination with the additional condition that λj ≥ 0 for
j = 1, . . . , k.
k
X
x ∈ Rn : λj xj , λj ≥ 0 for j = 1, . . . , k . (13.3)
j=1
• And finally, a convex combination is the intersection between an affine and a conic combina-
tions, implying that λj ∈ [0, 1].
k
X k
X
x ∈ Rn : λ j xj , λj = 1, λj ≥ 0 for j = 1, . . . , k . (13.4)
j=1 j=1
We say that a set is convex if it contains all points formed by the convex combination of any pair
of points in this set. This is equivalent to saying that the set contains the line segment between
any two points belonging to the set.
Pk
Definition 13.1 (Convex sets). A set S ⊆ Rn is said to be convex if x = j=1 λj xj belongs to
Pk
S, where j=1 λj = 1, λj ≥ 0 and xj ∈ S for j = 1, . . . , k.
Definition 13.1 is useful as it allows for showing that some set operations preserve convexity.
1. Intersection: S = S1 ∩ S2 ;
Figures 13.1 and 13.2 illustrate the concept behind some of these set operations. Showing that the
sets resulting from the operations in Lemma 13.2 are convex typically entails showing that convex
combinations of elements in the resulting set S also belong to S1 and S2 .
13.2. Identifying convexity of sets 195
S2
+ S2 = S1
S1 S
S2 S2
S1
S2
S1 ∩ S2
• Them empty set ∅, any singleton {x} and the whole space Rn ;
• halfspaces: S = x : p⊤ x ≤ α ⊂ Rn ;
• hyperplanes: H = x : p⊤ x = α ⊂ Rn , where p ̸= 0n is a normal
vector and α ∈ R is a
scalar. Notice that H can be equivalently represented as H = x ∈ Rn : p⊤ (x − x) = 0 for
x ∈ H;
Hyperplanes and halfspaces will play a central role in the developments we will see in our course.
Therefore, let us take a moment and discuss some important aspects related these convex sets.
First, notice that, geometrically, a hyperplane H ⊂ Rn can be interpreted as the set of points with
a constant inner product to a given vector p ∈ Rn , while x determines the offset of the hyperplane
from the origin. That is,
H = x : p⊤ (x − x) = 0 ≡ x + p⊥ ,
⊥
where
p is the orthogonal
complement of p, i.e., the set of vectors orthogonal to p, which is given
by x ∈ Rn : p⊤ x = 0 .
p
H
x
H
Figure 13.3: A hyperplane H = x ∈ Rn : p⊤ (x − x) = 0 with normal vector p displaced to x.
Analogously, a halfspaces can be represented as S = x ∈ Rn : p⊤ (x − x) ≤ 0 where p⊤ x = α
is the hyperplane that forms the boundary of the halfspace. This definition suggests a simple
geometrical interpretation: the halfspace S consists of x plus any vector with an obtuse or right
angle (i.e., greater of equal to 90◦ ) with the outward normal vector p.
p
S
H
x
x
Figure 13.4: A halfspace S = x ∈ Rn : p⊤ (x − x) ≤ 0 defined by the same hyperplane H. Notice
how the vectors p (or p − x, which is fundamentally the same vector but translated to x) and x − x
form angles greater or equal than 90◦ .
13.3. Convex hulls 197
a1 a3
a2
From Definition 13.3, one can show that the convex hull conv(S) can also be defined as the
intersection of all convex sets containing S. Perhaps the easiest way to visualise this is to think
of the infinitely many half-space containing S and their intersection, which can only be S. Figure
13.6 illustrates the convex hull conv(S) of an nonconvex set S.
conv(S)
S
Figure 13.6: Example of an arbitrary set S (in solid blue) and its convex hull conv(S) (combined
blue and grey areas).
The notion of convex hulls is a powerful tool in optimisation. One important application is using
conv(S) to obtain approximations for a nonconvex S that can be exploited to solve an optimisation
problem with constraint set defined by S. This is the underpinning technique in many important
optimisation methods for such as branch-and-bound-based methods for nonconvex problems and
decomposition methods (i.e., methods that solve large problems by breaking it into smaller parts
that are presumably easier to solve).
In specific, let us consider the convex hull of a finite collection of discrete points. Some of these
sets are so important in optimisation that they have their own names.
Definition 13.4. Let S = {x1 , . . . , xn+1 } ⊂ Rn . Then conv(S) is called a polytope. If x1 , . . . , xn+1
are affinely independent (i.e., x2 − x1 , . . . , xn+1 − x1 are linearly independent) then conv(S) is
called a simplex with vertices x1 , . . . , xn+1 .
If S is the same as its own interior, then we say that S is open. Some authors say that S
is solid if it has a nonempty interior (that is, int(S) ̸= ∅. Notice that the interior of S is a
subset of S, that is int(S) ⊆ S.
We say that S is bounded if exists Nϵ (x), x ∈ Rn , for some ϵ > 0 such that S ⊂ Nϵ (x).
We say that a set is compact if it is both closed and bounded. Compact sets appear very frequently
in real-world applications of optimisation, since typically one can assume the existence of bounds
for decision variables (such as nonnegativity or maximum physical bounds or, at an extreme case,
smallest/ largest computational constants). Another frequent example of bounded set is the convex
hull of a collection of discrete points, which is called by some authors polytopes (effectively bounded
polyhedral sets).
Let us consider the following example. Let S = (x1 , x2 ) ∈ Rn : x21 + x22 ≤ 1 . Then, we have
that:
1. clo(S) = (x1 , x2 ) ∈ Rn : x21 + x22 ≤ 1 . Since S = clo(S), S is closed.
2. int(S) = (x1 , x2 ) ∈ Rn : x21 + x22 < 1 .
3. bou(S) = (x1 , x2 ) ∈ Rn : x21 + x22 = 1 . Notice that it makes S bounded.
Notice that, if S is closed, then bou(S) ⊂ S. That is, its boundary is part of the set itself.
Moreover, it can be shown that clo(S) = bou(S) ∪ S is the smallest closed set containing S.
In case S is convex, one can infer the convexity of the interior int(S) and its closure clo(S). The
following theorem summarises this result.
Theorem 13.5. Let S ⊆ Rn be a convex set with int(S) ̸= ∅. Let x1 ∈ clo(S) and x2 ∈ int(S).
Then x = λx1 + (1 − λ)x2 ∈ int(S) for all λ ∈ (0, 1).
Theorem 13.5 is useful for inferring the convexity of the elements related to S. We summarise the
key results in the following corollary.
1. int(S) is convex;
2. clo(S) is convex;
3. clo(int(S)) = clo(S);
4. int(clo(S)) = int(S).
200 Chapter 13. Convex sets
(P ) : z = min. {f (x) : x ∈ S}
be our optimisation problem. If an optimal solution x∗ exists, then f (x∗ ) ≤ f (x) for all x ∈ S and
z = f (x∗ ) = min {f (x) : x ∈ S}.
Notice the difference between min. (an abbreviation for minimise) and the operator min. The
first is meant to represent the problem of minimising the function f in the domain S, while min is
shorthand for minimum, in this case z, assuming that it is attainable.
It might be that an optimal solution is not attainable, but a bound can be obtained for the optimal
solution value. The greatest lower bound for z is its infimum (or supremum for maximisation
problems), denoted by inf. That is, if z = inf {f (x) : x ∈ S} , then z ≤ f (x) for all x ∈ S and
there is no z > z such that z ≤ f (x) for all x ∈ S. We might sometimes use the notation
(P ) : z = inf {f (x) : x ∈ S}
to represent optimisation problems for which one cannot be sure whether an optimal solution
is attainable. The Weierstrass theorem describes the situations in which those minimums (or
maximums) are guaranteed to be attained, which is the case whenever S is compact.
Theorem 13.7 (Weierstrass theorem). Let S ̸= ∅ be a compact set, and let f : S → R be
continuous on S. Then there exists a solution x ∈ S to min. {f (x) : x ∈ S}.
Figure 13.7 illustrates three examples. In the first (on the left) the domain [a, b] is compact, and
thus the minimum of f is attained at b. In the other two, [a, b) is open and therefore, Weierstrass
theorem does not hold. In the middle example, one can obtain inf f , which is not the case for the
last example on the right.
f (b) inf f
a b a b a
Figure 13.7: Examples of attainable minimum (left) and infimum (centre) and an example where
neither are attainable (right).
Theorem 13.8 (Closest-point theorem). Let S ̸= ∅ be a closed convex set in Rn and y ∈ / S. Then,
there exists a unique point x ∈ S with minimum distance from y. In addition, x is the minimising
point if and only if
(y − x)⊤ (x − x) ≤ 0, for all x ∈ S
Simply put, if S is a closed convex set, then x ∈ S will be the closest point to y ∈ / S if the vector
y − x is such that if it forms an angle that is greater or equal than 90◦ with all other vectors x − x
for x ∈ S. Figure 13.8 illustrates this logic.
y
x x
x
x
S S
Figure 13.8: Closest-point theorem for a closed convex set (on the left). On the right, an illustration
on how the absence of convexity invalidates the result.
H+
p
H−
H
x
p⊤ (x − x) = p⊤ x − p⊤ x = α − α = 0.
+
⊤
From that, the half-spaces defined
by H can be equivalently stated as H = x : p (x − x) ≥ 0
and H − = x : p⊤ (x − x) ≤ 0 .
We can now define the separation of convex sets.
Definition 13.9. Let S1 and S2 be nonempty sets in Rn . The hyperplane H = x : p⊤ x = α is
said to separate S1 and S2 if p⊤ x ≥ α for each x ∈ S1 and p⊤ x ≤ α for each x ∈ S2 . In addition,
the following apply:
1. Proper separation: S1 ∪ S2 ̸⊂ H;
2. Strict separation: p⊤ x < α for each x ∈ S1 and p⊤ x > α for each x ∈ S2 ;
3. Strong separation: p⊤ x ≥ α + ϵ for some ϵ > 0 and x ∈ S1 , and p⊤ x ≤ α for each x ∈ S2 .
Figure 13.10 illustrates the three types of separation in Definition 13.9. On the left, proper separa-
tion is illustrated, which is obtained by any hyperplane that does not contain both S1 and S2 , but
that might contain points from either or both. In the middle, sets S1 and S2 belong to two distinct
half-spaces in a strict sense. On the right, strict separation holds with an additional margin ϵ > 0,
which is defined as strong separation.
H H H
S1 S1 S1
S2 S2 S2
A powerful yet simple result that we will use later is that, for a closed convex set S, there always
exists a hyperplane separating S and a point y that does not belong to S.
Theorem 13.10 (Separation theorem). Let S ̸= ∅ be a closed convex set in Rn and y ∈ / S. Then,
there exists a nonzero vector p ∈ Rn and α ∈ R such that p⊤ x ≤ α for each x ∈ S and p⊤ y > α.
Proof. Theorem 13.8 guarantees the existence of a unique minimising x ∈ S such that (y − x)⊤ (x −
x) ≤ 0 for each x ∈ S. Let p = (y − x) ̸= 0 and α = x⊤ (y − x) = p⊤ x. Then we get p⊤ x ≤ α for
each x ∈ S, while p⊤ y − α = (y − x)⊤ (y − x) = ||y − x||2 > 0.
This is the first proof we look at in these notes, and the reason for that is its importance in many
of the results we will discuss further. The proof first looks at the problem of finding a minimum
distance point as an optimisation problem and uses the Weierstrass theorem (our Theorem 13.8
is a consequence of the Weierstrass theorem stated in Theorem 13.7) to guarantee that such a x
exists. Being a minimum distance point, we know from Theorem 13.8 that (y − x)⊤ (x − x) ≤ 0
holds. Now by defining p and α as in the proof, one might notice that
The inequality p⊤ y > α is demonstrated to hold in the final part by noticing that
p⊤ y − α = (y − x)⊤ y − x⊤ (y − x)
= y ⊤ (y − x) − x⊤ (y − x)
= (y − x)⊤ (y − x) = ||y − x||2 > 0.
Theorem ?? has interesting consequences. For example, one can apply it to every point in the
boundary bou(S) to show that S is formed by the intersection of all half-spaces containing S.
Another interesting result is the existence of strong separation. If y ∈
/ clo(conv(S)), then one can
show that strong separation between y and S exists since there will surely be a distance ϵ > 0
between y and S.
Theorem 13.11. Let A be an m×n matrix and c be an n-vector. Then exactly one of the following
two systems has a solution:
(1) : Ax ≤ 0, c⊤ x > 0, x ∈ Rn
(2) : A⊤ y = c, y ≥ 0, y ∈ Rm .
Proof. Suppose (2) has a solution. Let x be such that Ax ≤ 0. Then c⊤ x = (A⊤ y)⊤ x = y ⊤ Ax ≤ 0.
Hence, (1) has no solution.
Next, suppose (2) has no solution. Let S = x ∈ Rn : x = A⊤ y, y ≥ 0 . Notice that S is closed
and convex and that c ∈ / S. By Theorem ??, there exists p ∈ Rn and α ∈ R such that p⊤ c > α
⊤
and p x ≤ α for x ∈ S.
As 0 ∈ S, α ≥ 0 and p⊤ c > 0. Also, α ≥ p⊤ A⊤ y = y ⊤ Ap for y ≥ 0. This implies that Ap ≤ 0, and
thus p satisfies (1).
The first part of the proof shows that, if we assume that system (2) has a solution, than c⊤ x > 0
cannot hold for y ≥ 0. The second part uses the separation theorem (Theorem ??) to show that
c can be seen as a point not belonging to the closed convex set S for which there is a separation
hyperplane and that the existence of such plane implies that system (1) must hold. The set S is
closed and convex since it is a conic combination of rows ai , for i = 1, . . . , m. Using the 0 ∈ S,
one can show that α ≥ 0. The last part uses the identity p⊤ A⊤ = (Ap)⊤ and the fact that
(Ap)⊤ y = y ⊤ Ap. Notice that, since y can be arbitrarily large and α is a constant, y ⊤ Ap ≤ α can
only hold if y ⊤ Ap ≤ 0, requiring that p ≤ 0 since y ≥ 0 from the definition of S.
Farkas’ theorem has an interesting geometrical interpretation arising from this proof, as illustrated
in Figure 13.11. Consider the cone C formed by the rows of A
( m
)
X
n
C = c ∈ R : cj = aij yi , j = 1, . . . , n, yi ≥ 0, i = 1, . . . , m
i=1
204 Chapter 13. Convex sets
The polar cone of C, denoted C 0 , is formed by the all vectors having angles of 90◦ or more with
vectors in C. That is,
C 0 = {x : Ax ≤ 0} .
0 +
Notice that (1) has a solution if the
intersection between
the polar cone C and the positive (H
+ n ⊤
as defined earlier) half-space H = x ∈ R : c x > 0 is not empty. If (2) has a solution, as in
the beginning of the proof, then c ∈ C and the intersection C 0 ∩ H + = ∅. Now, if (2) does not
/ C, then one can see that C 0 ∩ H + cannot be empty, meaning that (1)
have a solution, that is, c ∈
has a solution.
a1 a1
c
a2 a2
c
C0 C C
C0
a3 a3
Figure 13.11: Geometrical illustration of the Farkas’ theorem. On the left, system (2) has a
solution, while on the right, system (1) has a solution
p⊤ (x − x) ≥ 0 for x ∈ S) or S ⊆ H − .
Figure 13.12 illustrates the concept of supporting hyperplanes. Notice that supporting hyperplanes
might not be unique, with the geometry of the set S playing an important role in that matter.
Let us define the function f (x) = p⊤ x with x ∈ S. One can see that the optimal solution x given
by
x = argmax f (x)
x∈S
p
p1 p
x1
x p2
x2
S x S S
Figure 13.12: Supporting hyperplanes for an arbitrary set. Notice how a single point might have
multiple supporting planes (middle) or different points might have the same supporting hyperplane
(right)
The proof follows immediately from Theorem ??, without explicitly considering a point y ∈
/ S and
by noticing that bou(S) ⊂ clo(S). Figure 13.13 provides an illustration of the theorem.
x
S
Figure 13.13: Supporting hyperplanes for convex sets. Notice how every boundary point has at
least one supporting hyperplane
206 Chapter 13. Convex sets
Chapter 14
Convex functions
(P ) : min. f (x)
s.t.: g(x) ≤ 0
x∈X
Very often, we use the term convex to loosely refer to concave functions, which must be done with
caution. In fact, if f is convex, than −f is concave and we say that (P ) is a convex problem even
if f is concave and we seek to maximise f instead. Also, linear functions are both convex and
concave.
We say that a convex function is strictly convex if the inequality holds strictly in Definition 14.1
for each λ ∈ (0, 1) (notice the open interval instead). In practice, it means that the function is
guaranteed to not present flatness around its minimum (or maximum, for concave functions).
1. f (x) = a⊤ x + b;
2. f (x) = ex ;
207
208 Chapter 14. Convex functions
Knowing that these common functions are convex is helpful for identifying convexity in more com-
plex functions formed by composition. By knowing that an operation between functions preserves
convexity, we can infer the convexity of more complicated functions. The following are convexity
preserving operations.
Figure 14.1 illustrates the lower level sets of two functions. The lower level set Sα can be seem as
the projection of the function image onto the domain for a given level α.
f (x) f (x)
α α
x x
Figure 14.1: The lower level sets Sα (in blue) of two functions, given a value of α. Notice the
nonconvexity of the level set of the nonconvex function (on the right)
Notice that, for convex functions, no discontinuity can be observed, making Sα convex. Lemma
14.3 states this property.
14.1. Convexity in functions 209
f (x) f (x)
epi(f )
epi(f )
x x
Figure 14.2: The epigraph epi()f of a convex function is a convex set (in grey on the left).
Lemma 14.3. Let S ⊆ Rn be a nonempty convex set and f : S 7→ R a convex function. Then,
any level set Sα with α ∈ R is convex.
Proof. Let x1 , x2 ∈ Sα . Thus, x1 , x2 ∈ S with f (x1 ) ≤ α and f (x2 ) ≤ α. Let λ ∈ (0, 1) and
x = λx1 + (1 − λ)x2 . Since S is convex, we have that x ∈ S. Now, by the convexity of f , we have
and thus x ∈ Sα .
Remark: notice that a convex lower level set does not necessarily mean that the function is
convex. In fact, as we will see later, there are nonconvex functions that have convex level sets (the
so-called quasiconvex functions).
Figure 14.2 illustrates the epigraphs of two functions. Notice that the second function (on the
right) is not convex, and nor is its epigraph. In fact, we can use the convexity of epigraphs (and
the technical results associated with the convexity of sets) to show the convexity of functions.
Theorem 14.5 (Convex epigraphs). Let S ⊆ Rn be a nonempty convex set and f : S 7→ R. Then
f is convex if and only if epi(f ) is a convex set.
Proof. First, suppose f is convex and let (x1 , y1 ), (x2 , y2 ) ∈ epi(f ) for λ ∈ (0, 1). Then
The proof starts with the implication “if f is convex, then epi(f ) is convex”. For that, it assumes
that f is convex and use the convexity of f to show that any convex combination of x1 ,x2 in S
will also be in the epi(f ), which is the definition of a convex set.
To prove the implication “if epi(f ) is convex, then f is convex”, we define a convex combination
of points in epi(f ) and use the definition of epi(f ) to show that f is convex by setting y =
λf (x1 ) + (1 − λ)f (x2 ) and x = λx1 + (1 − λ)x2 .
Inequality (14.1) is called the subgradient inequality and is going to be useful in several contexts
later int his course. The set of subgradients ξ of f at x is the subdifferential
∂f (x) = ξ ∈ Rn : f (x) ≥ f (x) + ξ ⊤ (x − x) .
Every convex function f : S 7→ R has at least one subgradient at any point x in the interior of the
convex set S. Requiring that x ∈ int(S) allows us to disregard boundary points of f where ∂(x)
might be empty. Theorem 14.7 presents this result.
Theorem 14.7. Let S ⊆ Rn be a nonempty convex set and f : S 7→ R a convex function. Then
for all x ∈ int(S), there exists ξ ∈ Rn such that
H = (x, y) : y = f (x) + ξ ⊤ (x − x)
supports epi(f ) at (x, f (x)). In particular,
f (x) ≥ f (x) + ξ ⊤ (x − x), ∀x ∈ S.
The proof consists of directly applying Theorem 14.5 and then using the support of convex sets
theorem (Theorem 14 in Lecture 2) to show that the subgradient inequality holds.
Notice that this definition is based on the existence of first-order (Taylor series) expansion, with
an error term α. This definition is useful as it highlights
h i requirement that ∇f (x) exists and is
the
∂f (x)
unique at x since the gradient is given by ∇f (x) = ∂xi = 1, . . . , n.
i
If f is differentiable in S, then its subdifferential ∂(x) is a singleton (a set with a single element)
for all x ∈ S. This is shown in Lemma 14.9
Lemma 14.9. Let S ⊆ Rn be a nonempty convex set and f : S 7→ R a convex function. Suppose
that f is differentiable at x ∈ int(S). Then ∂f (x) = {∇f (x)}, i.e., the subdifferential ∂f (x) is a
singleton with ∇f (x) as its unique element.
Proof. From Theorem 14.7, ∂f (x) ̸= ∅. Moreover, combining the existence of a subgradient ξ and
differentiability of f at x, we obtain:
Subtracting (14.3) from (14.2), we get 0 ≥ λ(ξ − ∇f (x))⊤ d − λ||d||α(x; λd). Dividing by λ > 0 and
letting λ 7→ 0+ , we obtain (ξ − ∇f (x))⊤ d ≤ 0. Now, by setting d = ξ − ∇f (x), it becomes clear
that ξ = ∇f (x).
Notice that in the proof we use x + λd to indicate that x is in direction d, scaled by λ > 0. The fact
that ∂f (x) is a singleton comes from the uniqueness of the solution for (ξ−∇f (x))⊤ (ξ−∇f (x)) = 0.
Figure 19.9 illustrates subdifferential sets for three distinct points of a piecewise linear function.
The picture schematically represents a multidimensional space x as a one-dimensional projection
(you can imagine this picture as being a section in one of the x dimensions). For the points in which
the function is not differentiable, the subdifferential set contains an infinite number of subgradients.
At points in which the function is differentiable (any mid-segment point) the subgradient is unique
(a gradient) and the subdifferential is a singleton.
If f : S 7→ R is a convex differentiable function, then Theorem 14.7 can be combined with Lemma
14.9 to express the one of the most powerful results relating f and its affine (first-order) approxi-
mation at x.
Theorem 14.10 (Convexity of a function II). Let S ⊆ Rn be a nonempty convex open set, and let
f : S 7→ R be differentiable on S. The function f is convex if and only if for any x ∈ S, we have
The proof for this theorem follows from the proof for Theorem 14.7 to obtain the subgradient
inequality and then use Lemma 14.9 to replace the subgradient with the gradient. To see how the
opposite direction (subgradient inequality holing implying the convexity of f ), one should proceed
as follows.
1. Take x1 and x2 from S. The convexity of S implies that λx1 + (1 − λ)x2 is also in S.
2. Assume that the subgradient exists, and therefore the two relations hold:
f (x)
∂f (x2 ) = ∇f (x2 )
ξ ∂f (x3 )
∂f (x1 )
x1 x2 x3 x
Figure 14.3: A representation of the subdifferential (in grey) for nondifferentiable (x1 and x3 ) and
differentiable (x2 ) points
3. Multiply (14.4) by λ, (14.5) by (1 − λ), and add them together. One will then obtain
We say that H(x) is positive semi-definite if x⊤ H(x)x ≥ 0 for x ∈ Rn . Having a positive semi-
definite Hessian for all x ∈ S implies that the function is convex in S.
Theorem 14.12. Let S ⊆ Rn be a nonempty convex open set, and let f : S 7→ R be twice
differentiable on S. Then f is convex if and only if the Hessian matrix is positive semidefinite
(PSD) at each point in S.
Proof. Suppose f is convex and let x ∈ S. Since S is open, x + λx ∈ S for a small enough |λ| =
̸ 0.
From Theorem 14.10 and twice differentiability of f , we have
The proof uses a trick we have seen before. First, we assume convexity and use the definition of
convexity provided by Theorem 14.10 combined with an alternative definition for (x − x) to show
that x⊤ H(x)x ≥ 0. That is, instead of using the reference points x and x, we incorporate a step
size λ from x in the direction of x.
To show the other direction of implication, that is, that x⊤ H(x)x ≥ 0 implies convexity, we use
the mean value theorem. The mean value theorem states that there must exist a point x̂ between x
and x for which the second order approximation is exact. From these, we can derive the definition
of convexity, as in Theorem 14.10.
Checking for positive semi-definiteness can be done efficiently using appropriate computational
algebra method, though it can be computationally expensive. It involves calculating the eigenval-
ues of H(x) and testing whether they are all nonnegative (positive), which implies the positive
semi-definiteness (definiteness) of H(x). Some nonlinear solvers are capable of returning warning
messages (or errors even) pointing out lack of convexity by testing (under a certain threshold) for
positive semi-definiteness.
14.3 Quasiconvexity
Quasiconvexity can be seem as the generalisation of convexity to functions that are not convex,
but share similar properties that allow for defining global optimality conditions. One class of these
functions are named quasiconvex. Let us first technically define quasiconvex functions.
Definition 14.13 (quasiconvex functions). Let S ⊆ Rn be a nonempty convex set and f : S 7→ R.
Function f is quasiconvex if, for each x1 , x2 ∈ S and λ ∈ (0, 1), we have
We say that, if f is quasiconvex, then −f is quasiconcave. Also, functions that are both quasiconvex
and quasiconcave are called quasilinear. Quasiconvex functions are also called unimodal.
Figure 14.4 illustrates a quasiconvex function. Notice that, for any pair of points x1 and x2 in the
domain of f , the graph of the function is always below the maximum between f (x1 ) and f (x2 ).
This is precisely what renders convex the lower level sets of quasiconvex functions. Notice that,
on the other hand, the epigraph epi(f ) is not a convex set.
f (x) f (x)
f (x2 )
α
f (x1 )
x1 x2 x Sα x
Figure 14.4: A quasiconvex function with its epigraph (in grey) and lower level set (in blue).
An important property of quasiconvex functions is that their level sets are convex.
Theorem 14.14. Let S ⊆ Rn be a nonempty convex set and f : S 7→ R. Function f is quasiconvex
if and only if Sα = {x ∈ S : f (x) ≤ α} is convex for all α ∈ R.
Proof. Suppose f is quasiconvex and let x1 , x2 ∈ Sα . Thus, x1 , x2 ∈ S and max {f (x1 ), f (x2 )} ≤
α. Let x = λx1 + (1 − λ)x2 for λ ∈ (0, 1). As S is convex, x ∈ S. By quasiconvexity of f ,
f (x) ≤ max {f (x1 ), f (x2 )} ≤ α. Hence, x ∈ Sα and Sα is convex.
Conversely, assume that Sα is convex for α ∈ R. Let x1 , x2 ∈ S, and let x = λx1 + (1 − λ)x2 for
λ ∈ (0, 1). Note that, for α = max {f (x1 ), f (x2 )}, we have x1 , x2 ∈ Sα . The convexity of Sα implies
that x ∈ Sα , and thus f (x) ≤ α = max {f (x1 ), f (x2 )}, which implies that f is quasiconvex.
The proof relies on the convexity of the domain S to show that a convex combination from point
in the level set Sα also belongs to Sα . To show the other way around, we simply need to define
α = max {f (x1 ), f (x2 )} to see that a convex level set Sα implies that f is quasiconvex.
Quasiconvex functions have an interesting first-order condition that arises from the convexity of
its level sets.
Theorem 14.15. Let S ⊆ Rn be a nonempty open convex set, and let f : S → 7 R
be differentiable on S. Then f is quasiconvex if and only if, for x1 , x2 ∈ S and f (x1 ) ≤ f (x2 ),
∇f (x2 )⊤ (x1 − x2 ) ≤ 0.
14.3. Quasiconvexity 215
2.00 1.95
||x||1 2
||x||1
1.75
1.65
2.0
1.50 1
1.35
1.5
1.25
1.0
0 1.05
x2
1.00
0.5
0.75 0.75
2 -1
1 0.50
-2 0 0.45
-1 x2
0 -1
x1 1
0.25
-2
2 -2 -2 -1 0 1 2
x1 0.15
The condition in Theorem 14.15 is in fact sufficient for global optimality if one can show that f
is in fact strictly quasiconvex, that is when (14.8) holds strictly. Figure 14.5a and 14.5b show an
example of a strict quasiconvex function and its level curves, illustrating that, despite the lack of
convexity, the level sets are convex.
Strictly quasiconvex functions is a subset of a more general class of functions named pseudoconvex,
for which the conditions in Theorem 14.15 are sufficient for global optimality.
216 Chapter 14. Convex functions
Chapter 15
Unconstrained optimality
conditions
3. a global optimal solution is a feasible solution x ∈ S with f (x) ≤ f (x) for all x ∈ S. Or
alternatively, is a local optimal solution for which S ⊆ Nϵ (x).
Figure 15.1 illustrates the concepts above. Solution x1 is an unconstrained global minimum, but
is not a feasible solution considering the feasibility set S. Solution x2 is a local optima, for any
neighbourhood Nϵ (x2 ) only encompassing the points within the same plateau. Solution x3 is a
local optimum, while x4 is neither a local or a global optimum in the unconstrained case, but is
a global maximum in the constrained case. Finally, x5 is the global minimum in the constrained
case.
Theorem 15.1 (global optimality of convex problems). Let S ⊆ Rn be a nonempty convex set
and f : S 7→ R convex on S. Consider the problem (P ) : min. {f (x) : x ∈ S}. Suppose x is a
local optimal solution to P . Then x is a global optimal solution.
Proof. Since x is a local optimal solution, there exists Nϵ (x) such that, for each x ∈ S ∩ Nϵ (x),
f (x) ≤ f (x). By contradiction, suppose x is not a global optimal solution. Then, there exists a
217
218 Chapter 15. Unconstrained optimality conditions
f (x)
x1 x2 x3 x4 x
S
x1 x5 x2 x3 x4 x
Figure 15.1: Points of interest in optimisation. Points x1 , x2 and x3 are local optima in the
unconstrained problem. Once a constraint set S is imposed, x4 and x5 become points of interest
and x1 becomes infeasible.
15.3. Optimality condition of convex problems 219
solution x̂ ∈ S so that f (x̂) < f (x). Now, for any λ ∈ [0, 1], the convexity of f implies:
f (λx̂ + (1 − λ)x) ≤ λf (x̂) + (1 − λ)f (x) < λf (x) + (1 − λ)f (x) = f (x)
However, for λ > 0 sufficiently small, λx̂ + (1 − λ)x ∈ S ∩ Nϵ (x) due to the convexity of S, which
contradicts the local optimality of x. Thus, x is a global optimum.
The proof is built using contradiction. That is, we show that for a solution to be a local optimum
in a convex problem, not being a global solution contradicts its local optimality, originally true by
assumption. This is achieved using the convexity of f and showing that the convex combination
between hypothetical better solution x̂ and x would have to be both in Nϵ (x) and better than x,
contradicting the local optimality of x.
Note that Λ1 and Λ2 are convex. By optimality of x, Λ1 ∩ Λ2 = ∅. Using the separation theorem,
there exists a hyperplane defined by (ξ0 , µ) ̸= 0 and α that separates Λ1 and Λ2 :
−ξ0
Dividing (15.1) and (15.2) by −µ and denoting ξ = µ , we obtain:
Letting y = 0 in (15.4), we get ξ ⊤ (x − x) ≥ 0 for all x ∈ S. From (15.3), we can see that
y > f (x) − f (x) and y ≥ ξ ⊤ (x − x). Thus, f (x) − f (x) ≥ ξ ⊤ (x − x), which is the subgradient
inequality. Thus, ξ is a subgradient at x with ξ ⊤ (x − x) ≥ 0 for all x ∈ S.
In the first part of the proof, we use the definition of convexity based on the subgradient inequality
to show that ξ ⊤ (x − x) ≥ 0 implies that f (x) ≤ f (x) for all x ∈ S. The second part of the proof
uses the separation theorem in a creative way to show that the subgradient inequality must hold
if x is optimal. This is achieved by using the two sets Λ1 and Λ2 . Notice that, x being optimal
implies that y > f (x) − f (x) ≥ 0, which leads to the conclusion that Λ1 ∩ Λ2 = ∅, demonstrating
the existence of a separating hyperplane between them, as shown in (15.1) and (15.2). We can
show that α in those has to be 0 by noticing that µϵ ≤ 0 must hold for ϵ > 0 to be a bounded
constant.
The second part is dedicated to show that µ < 0, so we can divide (15.1) and (15.2) by µ to obtain
the subgradient inequality as we have seen. We show that by contradiction, since µ = 0 would
imply ξ0 = 0, which disagrees with existence of a (ξ, µ) ̸= 0 in the separation theorem. Finally,
as y > f (x) − f (x) and y ≥ ξ ⊤ (x − x), for any given y, we have that f (x) − f (x) ≥ ξ ⊤ (x − x) 1 ,
which leads to the subgradient inequality.
Notice that this result provides necessary and sufficient conditions for optimality for convex prob-
lems. These conditions can be extended to the unconstrained case as well, which is presented in
Corollary 15.3.
Corollary 15.3 (optimality in open sets). Under the conditions of Theorem 15.2, if S is open, x
is an optimal solution to P if and only if 0 ∈ ∂f (x).
Notice that, if S is open, then the only way to attain the condition ξ ⊤ (x − x) ≥ 0 is if ξ = 0
itself. This is particularly relevant in the context of nondifferentiable functions, as we will see
later. Another important corollary is the classic optimality condition ∇f (x) = 0, which we state
below for completeness.
The proof for Corollary 15.4 is the same as Theorem 15.2 under a setting where ∂(x) = {∇f (x)}
due to the differentiability of f .
Let us consider two examples. First, consider the problem
1 Notice that, on the line of nonnegative reals, for a same y, f (x) − f (x) is always on the ’right side’ of ξ ⊤ (x − x)
2
3
min. x1 − + (x2 − 5)2
2
s.t.: − x1 + x2 ≤ 2
2x1 + 3x2 ≤ 11
x1 ≥ 0
x2 ≥ 0
Figure 15.2 presents a plot of the feasible region S, which is form by the intersection of the two
halfspaces, and the level curves of the objective function, with some of the values indicated in the
50.
curves. Notice that that this is a convex problem.
00
25.
8
00
4.2
5
6 0
1.0
4
x2
00
12.
f(x)
0
0 2 4 1. 006 8
4
x1
Figure 15.2: Example 1
The arrow shows the gradient ∇f (x) at x = (1, 3). Notice that this point is special since at
that point, no vector x − x can be found forming an angle greater than 90◦ with ∇f (x), that is
∇f (x)⊤ (x − x) ≥ 0 for any x ∈ S, which means that x is optimal. Since the problem is convex,
that is in fact the global optimum for this problem.
Figure 15.3 shows a similar situation, but now with one of the constraints being nonlinear. Notice
that of the two points highlighted ((1,2) in orange and (2,1) in purple), the optimality condition
only holds for (2,1). For example, for x = (2, 1) and x = (1, 2) the vector x − x forms a angle
greater than 90◦ with the gradient of f at x, ∇f (x), and thus the condition ∇f (x)( x − x) ≥ 0 does
not hold for all S. The condition does hold for x = (2, 1), as can be seen in Figure 15.3.
222 Chapter 15. Unconstrained optimality conditions
3
0
2.0
1.0
0
2
x2
f(x)
0.5
0
1
4.0
f(x)0
0 1 4
0 2 3
x1
Figure 15.3: Example 2
We have developed most of the concepts required to state optimality conditions for unconstrained
optimisation problems, as presented in Corollaries 15.3 and 15.4. We now take an alternative route
in which we do not take into account the feasibility set, but only the differentiability of f . This
will be useful as it will allow us to momentarily depart from the assumption of convexity, which
was used to state Theorem 15.2.
f (x + λd) − f (x)
= ∇f (x)⊤ d + ||d||α(x; λd).
λ
Since ∇f (x)⊤ d < 0 and α(x; λd) → 0 when λ → 0 for some λ ∈ (0, δ), we must have f (x + λd) −
f (x) < 0.
The proof uses the first-order expansion around x to show that, f being differentiable, the condition
∇f (x)⊤ d < 0 implies that f (x+λd) < f (x), or put in words, that a step in the direction d decreases
the objective function value.
We can derive the first-order optimality condition in Corollary 15.4 as a consequence from Theorem
15.5. Notice, however, that since convexity is not assumed, all we can say is that this condition is
necessary (but not sufficient) for local optimality.
Proof. By contradiction, suppose that ∇f (x) ̸= 0. Letting d = −∇f (x), we have that ∇f (x)⊤ d =
−||∇f (x)||2 < 0. By Theorem 15.5, there exists a δ > 0 such that f (x + λd) < f (x) for all
λ ∈ (0, δ), thus contradicting the local optimality of x.
Notice that Corollary 15.6 only holds in one direction. The proof uses contradiction once again,
where we assume local optimality of x and show that having ∇f (x) ̸= 0 contradicts the local
optimality of x, our initial assumption. To do that, we simply show that having any descent
direction d (we use −∇f (x) since in this setting it is guaranteed to exist as ∇f (x) ̸= 0) would
mean that small step λ can reduce the objective function value, contradicting the local optimality
of x.
We now derive necessary conditions for local optimality of x based on second-order differentiability.
As we will see, it requires that the Hessian H(x) of f (x) at x is positive semidefinite.
1
f (x + λd) = f (x) + λ∇f (x)⊤ d + λ2 d⊤ H(x)d + λ2 ||d||2 α(x; λd)
2
since x is a local minimum, Corollary 15.6 implies that ∇f (x) = 0 and f (x + λd) ≥ f (x).
Rearranging terms and dividing by λ2 > 0 we obtain
f (x + λd) − f (x) 1
= d⊤ H(x)d + ||d||2 α(x; λd).
λ2 2
The second-order conditions can be used to attest local optimality of x. In the case where H(x) is
positive definite, then this second order condition becomes sufficient for local optimality, since it
implies that the function is ’locally convex’ for a small enough neighbourhood Nϵ (x).
In case f is convex, then the first-order condition ∇f (x) = 0 becomes also sufficient for attesting
the global optimality of x. Recall that f is convex if and only if H(x) is positive semidefinite for
all x ∈ Rn , meaning that in this case the second-order necessary conditions are also satisfied at x.
Theorem 15.8. Let f : Rn 7→ R be convex. Then x is a global minimum if and only if ∇f (x) = 0.
Proof. From Corollary 15.6, if x is a global minimum, then ∇f (x) = 0. Now, since f is convex,
we have that
Notice that ∇f (x) = 0 implies that ∇f (x)⊤ (x − x) = 0 for each x ∈ Rn , thus implying that
f (x) ≤ f (x) for all x ∈ Rn .
Chapter 16
Unconstrained optimisation
methods: part 1
Algorithm 22 has two main elements, namely the computation of the direction dk and the step size
λk at each iteration k. In what follows, we present some univariate optimisation methods that can
be employed to calculate step sizes λk . These methods are commonly referred to as line search
methods.
θ(λ) = f (x + λd).
225
226 Chapter 16. Unconstrained optimisation methods: part 1
Assuming differentiability, we can use the first-order necessary condition θ′ (λ) = 0 to obtain
optimal values for the step size λ. This means solving the system
θ′ (λ) = d⊤ ∇f (x + λd) = 0
which might pose challenges. First, d⊤ ∇f (x + λd) is often nonlinear in λ, with optimal solutions
not trivially resting at boundary points for an explicit domain of λ. Moreover, recall that θ′ (λ) = 0
is not a sufficient condition for optimality in general, unless properties such as convexity can be
inferred.
In what follows, we assume that strict quasiconvexity holds and therefore θ′ (λ) = 0 becomes
necessary and sufficient for optimality. In some contexts, unidimensional strictly quasiconvex
functions are called unimodal.
Theorem 16.1 establishes the mechanism underpinning line search methods. In that, we use the
assumption that the function has a unique minimum (a consequence of being strictly quasiconvex)
to successively reduce the search space until the optimal is contained in a sufficiently small interval
l within an acceptable tolerance.
Theorem 16.1 (Line search reduction). Let θ : R → R be strictly quasiconvex over the interval
[a, b], and let λ, µ ∈ [a, b] such that λ < µ. If θ(λ) > θ(µ), then θ(z) ≥ θ(µ) for all z ∈ [a, λ]. If
θ(λ) ≤ θ(µ), then θ(z) ≥ θ(λ) for all z ∈ [µ, b].
θ θ
θ(λ) θ(µ)
θ(µ) θ(λ)
a λ µ b a λ µ b
a b a b
Figure 16.1: Applying Theorem 16.1 allows to iteratively reduce the search space.
Figure 16.1 provides an illustration of Theorem 16.1. The line below the x-axis illustrates how the
search space can be reduced between two successive iterations. In fact, most line search methods
will iteratively reduce the search interval (represented by [a, b]) until the interval is sufficiently
small to be considered “a point” (i.e., is smaller than a set threshold l).
Line searches are exact when optimal step sizes λ∗k are calculated at each iteration k, and inexact
when arbitrarily good approximations for λ∗k are used instead. As we will see, there is a trade-off
between the number iterations required for convergence and the time taken per iteration that must
be taken into account when choosing between exact and inexact line searches.
Uniform search
The uniform search consists of breaking the search domain [a, b] into N slices of uniform size
δ = |b−a|
N . This leads to a one-dimensional grid with grid points an = a0 + nδ, n = 0 . . . N where
a0 = a and aN = b. We can then set λ̂ to be
λ̂ = arg min f (ai )
i=0,...,n
From Theorem 16.1, we know that the optimal step size λ∗ ∈ [λ̂ − δ, λ̂ + δ]. The process can then
be repeated, by making a = λ̂ − δ and b = λ̂ + δ (see Figure 16.2). until |a − b| is less than a
prespecified tolerance l. Without enough repetition of the search, the uniform search becomes an
inexact search.
This type of search is particularly useful when setting values for hyperparameters in algorithms
(that is, user defined parameters that influence the behaviour of the algorithm) of performing
any sort of search in a grid structure. One concept related to this type of search is what is
known as the coarse-to-fine approach. Coarse-to-fine approaches use sequences of increasingly fine
approximations (i.e., gradually increasing n) to obtain computational savings in terms of function
evaluations. In fact, the number of function evaluations a line search method executes is one of
the indicators of its efficiency.
θ(a2 )
a = a0 a1 a2 . . . an = b
λ̂ − δ λ̂ λ̂ + δ
Figure 16.2: Grid search with 5 points; Note that θ(a2 ) = mini=0,...,n θ(ai ).
Dichotomous search
The dichotomous search is an example of a sequential line search method, in which evaluations
of the function θ at a current iteration k are reused in the next iteration k + 1 to minimise the
number of function evaluations and thus improve performance.
The word dichotomous refer to the mutually exclusive parts that the search interval [a, b] is divided
at each iteration. We start by defining a distance margin ϵ and defining two reference points
λ = a+b a+b
2 − ϵ and µ = 2 + ϵ. Using the function values θ(λ) and θ(µ), we proceed as follows.
1. If θ(λ) < θ(µ), then move to the left by making ak+1 = ak and bk+1 = µk ;
2. Otherwise, if θ(λ) > θ(µ), then move to the right by making ak+1 = λk and bk+1 = bk .
Notice that, the assumption of strict quasiconvexity implies that θ(λ) = θ(µ) cannot occur, but in
a more general setting one must make sure a criterion for resolving the tie. Once the new search
228 Chapter 16. Unconstrained optimisation methods: part 1
interval [ak+1 , bk+1 ] is updated, new reference points λk+1 and µk+1 are calculated and the process
is repeated until |a − b| ≤ l. The method is summarised in Algorithm 23. Notice that, at any given
iteration k, one can calculate what will be the size |ak+1 − bk+1 |, given by
1 1
bk+1 − ak+1 = k (b0 − a0 ) + 2ϵ 1 − k .
2 2
This is useful in that it allows predicting the number of iterations Algorithm 23 will require
before convergence. Figure 16.3 illustrates the process for two distinct functions. Notice that the
employment of the central point a+b
2 as the reference to define the points λ and µ turns the method
robust in terms of interval reduction at each iteration.
θ1 θ1
θ2 (λ)
θ2 (µ) θ1 (µ)
θ2 (λ)
θ1 (µ) θ1 (λ)
θ2 θ2
θ1 (λ) θ2 (µ) (a+b)
a λ µ b a λ 2 µ b
a b a b
a b a b
Figure 16.3: Using the midpoint (a + b)/2 and Theorem 16.1 to reduce the search space.
1. the reduction in the search interval should not depend on whether θ(λk ) > θ(µk ) or vice-versa.
16.2. Line search methods 229
2. at each iteration, we perform a single function evaluation, thus making λk+1 = µk if θ(λk ) >
θ(µk ) or vice-versa.
λk+1 = µk
ak+1 + (1 − α)(bk+1 − ak+1 ) = µk
(1 − α)[α(bk − ak )] = µk − λk
(α − α2 )(bk − ak ) = ak + α(bk − ak ) − [ak + (1 − α)(bk − αk )]
α2 + α − 1 = 0
to which α = 1+2√5 = 0.618... = φ1 is the positive solution. Clearly, the same result is obtained if
one consider θ(λk ) < θ(µk ). Algorithm 24 summarises the golden section search. Notice that at
each iteration, only a single additional function evaluation is required.
Comparing the above method for a given accuracy l, the required number of function evaluations
is:
−a1
uniform: n ≥ b1l/2 −1
min n : dichotomous: (1/2)n/2 ≤ b −a l
1 1
l
golden section: (0.618)n−1 ≤ b1 −a 1
For example: suppose we set [a, b] = [−10, 10] and l = 10−6 . Then the number of iterations
required for convergence is
• uniform: n = 4 × 106 ;
230 Chapter 16. Unconstrained optimisation methods: part 1
• dichotomous: n = 49;
• golden section: n = 36.
A variant of the golden section method uses Fibonacci numbers to define the ratio of interval
reduction. Despite being marginally more efficient in terms of function evaluations, the overhead
of calculating Fibonacci numbers has to be taken into account.
Bisection search
Differently form the previous methods, the bisection search relies on derivative information to infer
whether how the search interval should be reduced. For that, we assume that θ(λ) is differentiable
and convex.
We proceed as follows. If θ′ (λk ) = 0, then λk is a minimiser. Otherwise
1. if θ′ (λk ) > 0, then, for λ > λk , we have θ′ (λk )(λ − λk ) > 0, which implies θ(λ) ≥ θ(λk ) since
θ is convex. Therefore, the new search interval becomes [ak+1 , bk+1 ] = [ak , λk ].
2. if θ′ (λk ) < 0, we have θ′ (λk )(λ − λk ) > 0 (and thus θ(λ) ≥ θ(λk )) for λ < λk . Thus, the new
search interval becomes [ak+1 , bk+1 ] = [λk , bk ].
As in the dichotomous search, we set λk = 12 (bk + ak ), which provides robust guarantees of search
interval reduction. Notice that the dichotomous search can be seen as a bisection search in which
the derivative information is estimated using the difference of function evaluation at two distinct
points. Algorithm 12 summarises the bisection method.
Armijo rule
The Armijo rule is a condition that is tested to decide whether a current step size λ is acceptable
or not. The step size λ is considered acceptable if
One way of understanding the Armijo rule is to look at what it means in terms of the function
θ(λ) = f (x + λd). Notice that, at λ = 0, the Armijo rule becomes
That is, θ(λ) has to be less than the deflected linear extrapolation of θ at λ = 0. The deflection is
given by the pre-specified parameter α. In case λ does not satisfy the test in (16.1), λ is reduced
by a factor β ∈ (0, 1) until the test in (16.1) is satisfied.
θapp (λ) = θ(0) + αλ(θ′ (0)) θapp (λ) = θ(0) + αλ(θ′ (0))
θ(λ)
θapp (βλ)
θapp (λ)
θ(βλ)
Figure 16.4: At first λ0 = λ is not acceptable; after reducing the step size to λ1 = βλ, it enters
the acceptable range where θ(λk ) ≤ θapp (λk ) = θ(0) + αλk (θ′ (0)).
In Figure 16.4, we can see the acceptable region for the Armijo test. At first, λ does not satisfy the
condition (16.1), being then reduced to βλ, which, in turn, satisfies (16.1). In this case, λk would
have been set βλ. Suitable values for α are within (0, 0.5] and for β are within (0, 1), trading of
precision (higher values) and number of tests before acceptance (lower values).
The Armijo rule is called backtracking in some contexts, due to the successive reduction of the step
size caused by the factor β ∈ (0, 1). Some variants might also include rules that prevent the step
size from becoming too small, such as θ(δλ) ≥ θ(0) + αδλθ′ (0), with δ > 1.
Algorithm 13 summarises the general structure of the coordinate descent method. Notice that the
for-loop starting in Line 3 uses the cyclic variant of the coordinate descent method.
Figure 16.5 shows the progress of the algorithm when applied to solve
0.0
x2
4.0
5.0
2.5
10.0
5.0 5
0 5 20.010 15
x1
Figure 16.5: Coordinate descent method applied to f . Convergence is observed in 4 steps for a
tolerance ϵ = 10−5
f (x + λd) − f (x)
f ′ (x; d) = lim+ = ∇f (x)⊤ d.
λ→0 λ
∇f (x)
Thus, d = argmin||d||≤1 ∇f (x)⊤ d = − ||∇f (x)||
In the proof, we use the differentiability to define a directional derivative for f at direction d, that
is, the change in the value of f by a move of size λ > 0 in the direction d, which is given by
∇f (x)⊤ d. If we minimise this term in d for ||d||2 ≤ 1, we observe that d is a vector of length one
∇f (x)
that has the opposite direction of ∇f (x), thus d = − ||∇f (x)|| .
That provides us with the insight that we can use ∇f (x) to derive (potentially good) directions
for optimising f . Notice that the direction employed is the opposite direction of the gradient for
minimisation problems, being the opposite in case of maximisation. That is the reason why the
gradient method is called the steepest descent method in some references, though gradient and
steepest descent might refer to different methods in specific contexts.
Using the gradient ∇f (x) is also a convenience as it allows for the definition of a straightforward
convergence condition. Notice that, if ∇f (x) = 0, then the algorithm stalls, as xk+1 = xk + λk dk =
xk . In other words, the algorithm converges to points x ∈ Rn that satisfy the first-order necessary
conditions ∇f (x) = 0.
The gradient method has many known variants that try to mitigate issues associated with the poor
convergence caused by the natural ’zigzagging’ behaviour of the algorithm (see, for example the
gradient method with momentum and the Nesterov method).
There are also variants that only consider the partial derivatives of some (and not all) of the
dimensions i = 1, . . . , n forming blocks of coordinates at each iteration. If these blocks are randomly
formed, these methods are known as stochastic gradient methods.
In Algorithm 14 we provide a pseudocode for the gradient method. In Line 2, the stopping condition
for the while-loop is equivalent of testing ∇f (x) = 0 for a tolerance ϵ.
234 Chapter 16. Unconstrained optimisation methods: part 1
Figure 16.6 presents the progress of the gradient method using exact (bisection) and inexact
(Armijo rule with α = 0.1 and β = 0.7) line searches. As can be expected, when an inexact
line search is employed, the method overshoots slightly some of the steps, taking a few more
iterations to converge.
30.0
5.0
15.0Gradient (exact)
8.0 Gradient (Armijo)
2.5
6.0
0.0
x2
4.0
5.0
2.5
10.0
5.0 5
0 5 20.010 15
x1
Figure 16.6: Gradient method applied to f . Convergence is observed in 10 steps using exact line
search and 19 using Armijo’s rule (for ϵ = 10−5 )
1
q(x) = f (xk ) + ∇f (xk )⊤ (x − xk ) + (x − xk )⊤ H(xk )(x − xk )
2
16.3. Unconstrained optimisation methods 235
The method uses as direction d that of the extremum of the quadratic approximation at xk , which
can be obtained from the first-order condition ∇q(x) = 0. This renders
Assuming that H −1 (xk ) exists, we can use (16.2) to obtain the following update rule, which is
known as the Newton step
Notice that the “pure” Newton’s method has embedded in the direction of the step, its length (i.e.,
the step size) as well. In practice, the method uses d = −H −1 (xk )∇f (xk ) as a direction combined
with a line search to obtain optimal step sizes and prevent divergence (that is, converge to −∞)
in cases where the second-order approximation might lead to divergence. Fixing λ = 1 renders
the natural Newton’s method, as derived in (16.3). The Newton’s method can also be seen as
employing Newton-Raphson method to solve the system of equations that describe the first order
conditions of the quadratic approximation at xk .
Figure 16.7 shows the calculation of direction d = −H −1 (xk )∇f (xk ) for the first iteration of
the Newton’s method. Notice that the direction is the same as the that of the minimum of the
quadratic approximation q(x) at xk . The employment of a line search allows for overshooting the
exact minimum, making the search more efficient.
30.0 30.0
5.0 20.0 5.0 20
x0 x0 .0
10.0 x * for quad. approx 10quad.
x * for .0 approx
2.5 6.0 2.5 6.0 Newton's method (pure) step 1
4.0 4.0
0.0 0.0
x2
x2
5.0 5.0
2.5 2.5
8.0 8.0
15.0 15.0
5.0 5 5 15 5.0 5 5 15
0 10 0
x1 30.0 x1 30.0 10
40.0 40.0
5.0 20 5.0 20.0
x1 .0 x1
10.0 x * for quad. approx 10quad.
x * for .0 approx
2.5 6.0 2.5 6.0 Newton's method (pure) step 2
4.0 4.0
0.0 0.0
x2
x2
5.0 5.0
2.5 2.5
8.0 8.0
15.0 15.0
5.0 5 5 15 5.0 5 5 15
0 10 0 10
x1 40.0 x1 40.0
Figure 16.7: The calculation of the direction d = x∗ − x0 in the first two iterations of the Newton’s
method with step size λ fixed to 1 (the pure Newton’s method, in left to right, top to bottom order).
Notice in blue the level curves of the quadratic approximation of the function at the current point
xk and how it improves from one iteration to the next.
The Newton’s method might diverge if the initial point is too far from the optimal and fixed step
sizes (such as λ = 1) are used, since the quadratic approximation minimum and the actual function
minimum can become drastically and increasingly disparate. Levenberg-Marquardt method and
other trust-region-based variants address convergence issues of the Newton’s method. As a general
236 Chapter 16. Unconstrained optimisation methods: part 1
rule, combining the method with an exact line search of a criteria for step-size acceptance that
require improvement (such as employing the Armijo rule for defining the step sizes) if often sufficient
for guaranteed convergence. Figure 16.8 compares the convergence of the pure Newton’s method
and the method employing an exact line search.
30.0
5.0 20.
0 (pure)
Newton's method
10.0 method - opt. step
Newton's
2.5 6.0
4.0
0.0
x2
5.0
2.5
8.0
15.0
5.0 5 5 15
0 10
x1 40.0
Figure 16.8: A comparison of the trajectory of both Newton’s method variants. Notice that in the
method using the exact line search, wile the direction d = x∗ − x0 is utilised, the step size is larger
in the first iteration.
Algorithm 15 presents a pseudocode for the Newton’s method. Notice that in Line 3, an inversion
operation is required. One might be cautious about this operation, since as ∇f (xk ) tends to zero,
the Hessian H(xk ) tends to become singular, potentially causing numerical instabilities.
Figure 16.9 shows the progression of the Newton’s method for f with exact and inexact line
searches.
16.3. Unconstrained optimisation methods 237
30.0
5.0
15.0 method (exact)
Newton's
8.0 Newton's method (Armijo)
2.5
6.0
0.0
x2
4.0
5.0
2.5
10.0
5.0 5
0 5 20.010 15
x1
Figure 16.9: Newton’s method applied to f . Convergence is observed in 4 steps using exact line
search and 27 using Armijo’s rule (ϵ = 10−5 )
238 Chapter 16. Unconstrained optimisation methods: part 1
Chapter 17
Unconstrained optimisation
methods: part 2
We will now discuss to variants of the gradient and Newton methods that try to exploit the com-
putational simplicity of gradient methods while encoding of curvature information as the Newton’s
method, but without explicitly relying on second-order derivatives (i.e., Hessian matrices).
The conjugate gradient method use the notion of conjugacy to guide the search for optimal solu-
tions. The original motivation for the method comes from quadratic problems, in which one can
use conjugacy to separate the search for the optimum of f : Rn 7→ R into n exact steps.
Definition 17.1. Let H be an n × n symmetric matrix. The vectors d1 , . . . , dn are called (H-
)conjugate if they are linearly independent and d⊤
i Hdj = 0, for all i, j = 1, . . . , n such that i ̸= j.
Notice that H-conjugacy (or simply conjugacy) is a generalisation of orthogonality under the
linear transformation imposed by the matrix H. Notice that orthogonal vectors are H-conjugate
for H = I. Figure 17.1 illustrate the notion of conjugacy between two vectors d1 and d2 that are
H-conjugate, being H the Hessian of the underlying quadratic function. Notice how it allows one
to generate, from direction d1 , a direction d2 that, if used in combination with an exact line search,
would take us to the centre of the curve.
239
240 Chapter 17. Unconstrained optimisation methods: part 2
f(x) = (x1 + 1)2 + (x2 2)2 f(x) = 12x2 + 4x12 + 4x22 + 4x1x2
3 3
2 2
d2 d2
1 1
x2
x2
d1 d1
0 0
40.
1 10
2 1 0 1 2 2 1 0 1 2
x1 x1
Figure 17.1: d1 and d2 are H-conjugates; on the left, H = I.
One can use H-conjugate directions to find optimal solutions for the quadratic function f (x) =
c⊤ x + 12 x⊤ Hx, where H is a symmetric matrix. Suppose we know directions d1 , . . . , dP
n that are
n
H-conjugate. Then, given an initial point x0 , any point x can be described as x = x0 + j=1 λj dj .
We can then reformulate f (x) as a function of the step size λ, i.e.,
n
X Xn Xn
1
f (x) = F (λ) = c⊤ (x0 + λj dj ) + (x0 + λj dj )⊤ H(x0 + λ j dj )
j=1
2 j=1 j=1
Xn
⊤ 1
= c (x0 + λj dj ) + (x0 + λj dj )⊤ H(x0 + λj dj ) .
j=1
2
1
Fj (λj ) = c⊤ (x0 + λj dj ) + (x0 + λj dj )⊤ H(x0 + λj dj ),
2
and is, ultimately, a consequence of the linear independence of the conjugate directions. Assuming
that H is positive definite, and thus that first-order conditions are necessary and sufficient for
optimality, we can then calculate optimal λj for j = 1, . . . , n as
Fj′ (λj ) = 0
c⊤ dj + x⊤ ⊤
0 Hdj + λj dj Hdj = 0
c⊤ dj + x⊤
0 Hdj
λj = − ⊤
, for all j = 1, . . . , n.
dj Hdj
This result can be used to devise an iterative method that can obtain optimal solution for quadratic
functions in exactly n iterations. From an initial point x0 and a collection of H-conjugate directions
d1 , . . . , dn , the method consists of the successively executing the following step
c⊤ dk + x⊤
k−1 Hdk
xk = xk−1 + λk dk , where λk = −
d⊤
k Hd k
17.1. Unconstrained optimisation methods 241
f(x) = 12x2 + 4x12 + 4x22 + 4x1x2 f(x) = (x1 + 1)2 + (x2 2)2
4 4
Conjugate method Conjugate method
Coord. desc. Coord. desc.
3 3
2 2
x2
x2
1 1
0 1 1 0 1 1
3 2 0 3 2 0
x1 x1
Figure 17.2: Optimising f with the conjugate method and coordinate descent (left). For H = I,
both methods coincide (right)
Notice the resemblance this method hold with the coordinate descent method. In case H = I, then
the coordinate directions given by di = 1 and dj̸=i = 0 are H-conjugate and thus, the coordinate
descent method converges in two iterations. Figure 17.2 illustrates this behaviour. Notice that, on
the left, the conjugate method converges in exactly two iterations, while coordinate descent takes
several steps before finding the minimum. On the right, both methods become equivalent, since,
when H = I, the coordinate directions become also conjugate to each other.
The missing part at this point is how one can generate H-conjugate directions. This can be done
efficiently using an adaptation of the Gram-Schmidt procedure, typically employed to generate
orthonormal bases.
We intend to build a collection of conjugate directions d0 , . . . , dn−1 , which can be achieved provided
that we have a collection of linearly independent vectors ξ0 , . . . , ξn−1 .
The method proceed as follows.
i
2. At a given iteration k + 1, we need to set the coefficients αk+1 such that dk+1 is H-conjugate
to d0 , . . . , dk and formed by adding ξk+1 to a linear combination of d0 , . . . , dk , that is
k
X
l
dk+1 = ξk+1 + αk+1 dl .
l=0
242 Chapter 17. Unconstrained optimisation methods: part 2
k
!⊤
X
⊤ ⊤ l
dk+1 Hdi = ξk+1 Hdi + αk+1 dl Hdi = 0.
l=0
⊤
i −ξk+1 Hdi
αk+1 = ⊤
, for i = 0, . . . , k. (17.1)
di Hdi
The next piece required for developing a method that could exploit conjugacy is the definition of
what collection of linearly independent vectors ξ0 , . . . , ξn−1 could be used to generate conjugate
directions. In the setting of developing an unconstrained optimisation method, the gradients
∇f (xk ) can play this part, which is the key result in Theorem 17.2.
The proof of this theorem is based on the idea that, for a given collection of conjugate directions
d0 , . . . , dk , xk will be optimal in the space spanned by the conjugate directions d0 , . . . , dk , meaning
that the partial derivatives of F (λ) for these directions is zero. This phenomena is sometimes called
the expanding manifold property, since at each iteration L(d0 , . . . , dk ) expands in one independent
(conjugate) direction at the time. To verify the second point, notice that the optimality condition
for λj ∈ arg min {Fj (λj )} is d⊤ j ∇f (x0 + λdj ) = 0.
We have now all parts required for describing the conjugate gradient method. The method uses the
gradients ∇f (xk ) as linearly independent vectors to generate conjugate directions, which are then
used as search directions dk .
In specific, the method operates generating a sequence of iterates
xk+1 = xk + λk dk ,
where d0 = −∇f (x0 ). Given a current iterate xk+1 with −∇f (xk+1 ) ̸= 0, we use Gram-Schmidth
procedure, in particular (17.1), to generate a conjugate direction dk+1 by making the linearly
independent vector ξk+1 = ∇f (xk+1 ). Thus, we obtain
∇f (xk+1 )⊤ Hdk
dk+1 = −∇f (xk+1 ) + αk dk , with αk = . (17.2)
d⊤
k Hdk
17.1. Unconstrained optimisation methods 243
Notice that, since ∇f (xk+1 ) − ∇f (xk ) = H(xk+1 − xk ) = λk Hdk and dk = −∇f (xk ) + αk−1 dk−1 ,
αk can be simplified to be
∇f (xk+1 )⊤ Hdk
αk =
d⊤
k Hdk
∇f (xk+1 )⊤ (∇f (xk+1 ) − ∇f (xk ))
=
(−∇f (xk ) + αk−1 dk−1 )⊤ (∇f (xk+1 ) − ∇f (xk ))
||∇f (xk+1 )||2
= ,
||∇f (xk )||2
where the last relation follows from Theorem 17.2. Algorithm 23 summarises the conjugate gradient
method.
0.0
x2
4.0
5.0
2.5
10.0
5.0 5
0 5 20.010 15
x1
Figure 17.3: Conjugate gradient method applied to f . Convergence is observed in 24 steps using
exact line search and 28 using Armijo’s rule (ϵ = 10−6 )
direction (given by the gradient) −∇f (xk+1 ) and the previous direction αk dk , which naturally
compensates for the curvature encoded in the original matrix H (which is the Hessian of the
quadratic approximation).
pk = λk dk = xk+1 − xk
qk = ∇f (xk+1 ) − ∇f (xk ) = H(xk+1 − xk ) = Hpk .
Starting from an initial guess D0 , quasi-Newton methods progress by successively updating Dk+1 =
Dk +Ck , with Ck being such that it only uses the information in pk and qk and that, after n updates,
Dn converges to H −1 .
For that to be the case, we require that pj , j = 1, . . . , k are eigenvectors of Dk+1 H with unit
eigenvalue, that is
This condition guarantees that, at the last iteration, Dn = H −1 . To see that, first, notice the
17.1. Unconstrained optimisation methods 245
Dk+1 Hpj = pj , j = 1, . . . , k
Dk+1 qj = pj , j = 1, . . . , k
Dk qj + Ck qj = pj j = 1, . . . , k
pj = Dk Hpj + Ck qj = pj + Ck qj , j = 1, . . . , k − 1,
Dk+1 qk = pk
Dk qk + Ck qk = pk
(Dk + Ck )qk = pk
Condition (17.4) is called the secant condition as a reference to the approximation to the second-
order derivative. Another way of understanding the role this condition has is by noticing the
following.
Dk+1 qk = pk
Dk+1 (∇f (xk+1 ) − ∇f (xk )) = xk+1 − xk
−1
∇f (xk+1 ) = ∇f (xk ) + Dk+1 (xk+1 − xk ), (17.5)
−1
where Dk+1 can be seen as an approximation to the Hessian H, just as Dk+1 is an approximation
−1
to H . Now, consider the second-order approximation of f at xk
1
q(x) = f (xk ) + ∇f (xk )⊤ (x − xk ) + (x − xk )⊤ H(xk )(x − xk ).
2
We can now notice the resemblance the condition (17.5) holds with
In other words, at each iteration, the updates are made such that the optimality conditions in
terms of the quadratic expansion remains valid.
The Davidon-Fletcher-Powell (DFP) is one classical quasi-Newton method available. It employs
updates of the form
pk p⊤ Dk qk q ⊤ Dk
Dk+1 = Dk + C DF P = Dk + ⊤
k
− ⊤ k
pk qk qk Dk qk
We can verify that C DF P satisfies conditions (17.3) and (17.4). For that, notice that
(1) C DF P qj = C DF P Hpj
pk p⊤
k Hpj Dk qk p⊤
k HDk Hpj
= p⊤
− ⊤D q = 0, for j = 1, . . . , k − 1;
k qk qk k k
246 Chapter 17. Unconstrained optimisation methods: part 2
pk p⊤
k qk
⊤
Dk qk qk Dk qk
(2) C DF P qk = p⊤
− ⊤D q = pk − Dk qk .
k qk qk k k
The main difference between available quasi-Newton methods is the nature of the matrix C em-
ployed in the updates. Over the years, several ideas emerged in terms of generating updates
that satisfied the above properties. The most widely used quasi-Newton method is the Broyden-
Fletcher-Goldfarb-Shanno (BFGS), which has been widely shown to have remarkable practical
performance. BFGS is part of the Broyden family of updates, given by
τj vk vk⊤
C B = C DF P + ϕ ,
p⊤k qk
q ⊤ Dk qk
where vk = pk − τ1k Dk qk , τk = jp⊤ q , and ϕ ∈ (0, 1). The extra term in the Broyden
k k
family of updates is designed to help with mitigating numerical difficulties from near-singular
approximations.
It can be shown that all updates from the Broyden family also satisfy the quasi-Newton conditions
(17.3) and (17.4). The BFGS update is obtained for ϕ = 1, which renders
pk p⊤ qk⊤ Dk qk Dk qk p⊤ ⊤
k + pk qk Dk
CkBF GS = k
1+ − .
p⊤ q
k k p⊤k qk p⊤
k qk
The BFGS method is often presented explicitly approximating the Hessian H instead of its inverse,
which is useful when using specialised linear algebra packages that rely on the “backslash” operator
to solve linear systems of equations. Let Bk be the current approximation of H. Then Dk+1 =
−1 BF GS −1
Bk+1 = (Bk + C k ) , with
BF GS qk qk⊤ Bk pk p⊤
k Bk
Ck = ⊤
− ⊤
.
qk pk pk Bk pk
The update for the inverse Hessian H −1 can then be obtained using the Sherman-Morrison formula.
Figure 17.4 illustrates the behaviour of the BFGS method when applied to solve
using both exact and inexact line searches. Notice how the combination of imprecisions both in the
calculation of H −1 and in the line search turns the search noisy. This combination (BFGS com-
bined with Armijo rule) is, however, widely used in efficient implementations of several nonlinear
optimisation methods.
A variant of BFGS, called the limited memory BFGS (l-BFGS) utilises efficient implementations
that do not require storing the whole approximation for the Hessian, but only a few most recent
pk and qk vectors.
0.0
x2
4.0
5.0
2.5
10.0
5.0 5
0 5 20.010 15
x1
Figure 17.4: BFGS method applied to f . Convergence is observed in 11 steps using exact line
search and 36 using Armijo’s rule (ϵ = 10−6 )
We focus on three key properties that one should be aware when employing the methods we have
seen to solve optimisation problems. The first two, complexity and convergence refer to the algo-
rithm itself, but often involve considerations related to the function being optimised. Conditioning,
on the other hand, is a characteristic exclusively related to the problem at hand. Knowing how the
“three C’s” can influence the performance of an optimisation problem is central in making good
choices in terms of which optimisation method to employ.
17.2.1 Complexity
Algorithm complexity analysis is a discipline from computer science that focus on deriving worst-
case guarantees in terms of the number of computational steps required for an algorithm to con-
verge, given an input of known size. For that, we use the following definition to identify efficient,
generally referred to as polynomial, algorithms.
Algorithm A is polynomial for a problem P if fA∗ (n) = O(np ) for some integer p.
Notice that this sort of analysis only render bounds on the worst-case performance. Though it can
be informative under a general setting, there are several well known examples in that experimental
practice does not correlate with the complexity analysis. One famous example is the simplex
method for linear optimisation problems, which despite not being a polynomial algorithm, presents
widely-demonstrated reliable (polynomial-like) performance.
248 Chapter 17. Unconstrained optimisation methods: part 2
17.2.2 Convergence
In the context of optimisation, local analysis is typically more informative regarding to the be-
haviour of optimisation methods. This analysis tend to disregard initial steps further from the
initial points and concentrate on the behaviour of the sequence {xk } to a unique point x.
The convergence is analysed by means of rates of convergence associated with error functions
e : Rn 7→ R such that e(x) ≥ 0. Typical choices for e include:
The sequence {e(xk )} is then compared to the geometric progression β k , with k = 1, 2, . . . , and
β ∈ (0, 1). We say that a method presents linear convergence if exists q > 0 and β ∈ (0, 1) such
that e(x) ≤ qβ k for all k. An alternative way of posing this result is stating that
e(xk+1 )
lim sup ≤ β.
k→∞ e(xk )
We say that an optimisation method converges superlinearly if the rate of convergence tends to
k
zero. That is, if exists β ∈ (0, 1), q > 0 and p > 1 such that e(xk ) ≤ qβ p for all k. For k = 2, we
say that the method presents quadratic convergence. Any p-order convergence is obtained if
e(xk+1 ) e(xk+1 )
lim sup p
< ∞, which is true if lim sup = 0.
k→∞ e(xk ) k→∞ e(xk )
Linear convergence is the most typical convergence rate for nonlinear optimisation methods, which
is satisfactory if β is not too close to one. Certain methods are capable of achieving superlinear
convergence for certain problems, being Newton’s method an important example.
In light of what we discussed, let us analyse the convergence rate of some of the methods earlier
discussed. We start by posing the convergence of gradient methods.
Theorem 17.4 (Convergence of the gradient method). Let f (x) = 12 x⊤ Hx where H is a positive
definite symmetric matrix. Suppose f (x) is minimised with the gradient method using an exact
line search. Let λ = mini=1,...,n λi and λ = maxi=1,...,n λi , where λi are eigenvalues of H. Then,
for all k,
2
f (xk+1 ) λ−λ
≤
f (xk ) λ+λ
Theorem 17.4 implies that, under certain assumptions, the gradient methods present linear con-
vergence. Moreover, this result shows that the convergence rate is dependent on the scaling of the
function, since it depends on the ratio of eigenvalues of H, which in turn can be modified by scaling
f . This results exposes an important shortcoming that gradient methods present: the dependence
on the conditioning of the problem, which we will discuss shortly. Moreover, this result can be
extended to incorporate functions other than quadratic and also inexact line searches.
The convergence of Newton’s method is also of interest since, under specific circumstances, it
presents a quadratic convergence rate. Theorem 17.5 summarises these conditions.
Theorem 17.5 (Convergence of Newton’s method - general case). Let g : Rn → Rn be differen-
tiable, x such that g(x) = 0, and let {e(xk )} = {||xk − x||}. Moreover, let Nδ (x) = {x : ||x − x|| ≤ δ}
for some δ > 0. Then
17.2. Complexity, convergence and conditioning 249
101
Gradient
100 Newton
Conjugate
||xk x|| 10 1 BFGS
10 2
10 3
10 4
10 5
10 6
2 4 6 8 10 12 14
iterations
Figure 17.5: Convergence comparison for the four methods
1. There exists δ > 0 such that if x0 ∈ Nδ (x), the sequence {xk } with xk+1 = xk −(∇g(xk )⊤ )−1 g(xk )
belongs to Nδ (x) and converges to x, while {e(xk )} converges superlinearly.
2. If for some L > 0, M > 0, and for all x, y ∈ Nδ (x), λ ∈ (0, δ]
Notice that the convergence of the method is analysed in two distinct phases. In the first phase,
referred to as ’damped’ phase, superlinear convergence is observed within the neighbourhood Nδ (x)
defined by δ. The second phase is where quadratic convergence is observed and it happens when
2
δ < LM , which in practice can only be interpreted as small enough, as the constants L (the
Lipschitz constant) and M (a finite bound for the norm of the Hessian) cannot be easily estimated
in practical applications.
However, it is interesting to notice that the convergence result for Newton’s method do not depend
on the scaling of the problem, like the gradient method. This property, called affine invariance is
one of the greatest features that Newton’s method possess.
Figure 17.5 compare the convergence of four methods presented considering f (x) = e(−(x1 −3)/2) +
e((4x2 +x1 )/10) + e((−4x2 +x1 )/10) , employing exact line search and using e(x) = ||xk − x||. Notice
how the quadratic convergence of Newton’s method compare with the linear convergence of the
gradients method. The other two, conjugate gradients and BFGS, present superlinear convergence.
17.2.3 Conditioning
The condition number of a symmetric matrix is given by
maxi=1,...,n {λi } λ
κ = ||A||2 ||A−1 ||2 = =
mini=1,...,n {λi } λ
250 Chapter 17. Unconstrained optimisation methods: part 2
The condition number κ is an important measure in optimisation, since it can be used to predict
how badly scaled a problem might be. Large κ values mean that numerical errors will be amplified
after repeated iterations, in particular matrix inversions.
Roughly speaking, having κ ≥ 10k means that at each iteration, k digits of accuracy are lost. As
general rule, one would prefer smaller κ numbers, but good values are entirely problem dependent.
One way of understanding the role that the conditioning number κ has is to think the role that
the eigenvalues of the Hessian have in the shape of the level curves of quadratic approximations of
a general function f : Rn 7→ R. First, let us consider the Hessian H(x) at a given point x ∈ Rn is
the identity matrix I, for which all eigenvalues are 1 and eigenvectors are ei , i = 1, . . . , n, where
ei is the vector with component 1 in the position i and zero everywhere else. This means that in
the direction of the n-eigenvectors, the ellipsoid formed by the level curves (specifically, the lower
level sets) of f stretch by the same magnitude and, therefore, the level curves of the quadratic
approximation are in fact a circle. Now, suppose that for one of the dimensions i of the matrix
H(x), we have one of the eigenvalues greater than 1. What we would see is that the level curves
of the quadratic approximation will be more stretched in that dimension i than in the others.
The reason for that is because the Hessian plays a role akin to that of a characteristic matrix in
an ellipsoid (specifically due to the second order term 12 (x − xk )⊤ H(xk )(x − xk ) in the quadratic
approximation).
Thus, larger κ will mean that the ratio between the eigenvalues is larger, which in turn implies that
there is eccentricity in the lower level sets (i.e., the lower level sets are far wider in one direction
than in others), which ultimately implies that first-order methods struggle since often the gradients
often point to directions that only show descent for small step sizes.
6.0
2 25.
0 10.0
4
10 5 0 5 10
x1 60.0
f(x) = (1/2)(x12 + 10x22), = 10.0
4
2
40.0
15.0
2.0 6.0
0 0.5
x2
10.0
2 25.0
60.0
4
10 5 0 5 10
x1
Figure 17.6: The gradient method with exact line search for different κ.
17.2. Complexity, convergence and conditioning 251
Figure 17.6 illustrates the effect of different condition numbers on the performance of the gradient
method. As can be seen, the method require more iterations for higher conditioning numbers, in
accordance to the convergence result presented in Theorem 17.4.
252 Chapter 17. Unconstrained optimisation methods: part 2
Chapter 18
In particular, we are interested in understanding the role that the feasibility set S has on the
optimality conditions of constrained optimisation problems in the form of P . Let us first define
two geometric elements that we will use to derive the optimality conditions for P .
Definition 18.1 (cone of feasible directions). Let S ⊆ Rn be a nonempty set, and let x ∈ clo(S).
The cone of feasible directions D at x ∈ S is given by
D = {d : d ̸= 0, and x + λd ∈ S for all λ ∈ (0, δ) for some δ > 0} .
Definition 18.2 (cone of descent directions). Let S ⊆ Rn be a nonempty set, f : Rn → R, and
x ∈ clo(S). The cone of improving (i.e., descent) directions F at x ∈ S is
F = {d : f (x + λd) < f (x) for all λ ∈ (0, δ) for some δ > 0} .
These cones are geometrical descriptions of the regions that, from a given point x, one can obtain
feasible (D) and improving (F ) solutions. This is useful in that it allows to express the optimality
conditions for x as observing that F ∩ D = ∅ holds. In other words, x is optimal if there exists no
feasible direction that can provide improvement in the objective function value.
Although having a geometrical representation of such sets can be useful in solidifying the conditions
for which a feasible solution is also optimal, we need to derive an algebraic representation of such
sets that can be used in computations. To reach that objective, let us start by defining an algebraic
representation for F . For that, let us assume that f : S ⊂ Rn 7→ R is differentiable. Recall that d
is a descent direction at x if ∇f (x)⊤ d < 0. Thus, we can define the set F0
F0 = d : ∇f (x)⊤ d < 0
as an algebraic representation for F . Notice that F0 is an open half-space formed by the hyperplane
with normal ∇f (x). Figure 18.1 illustrates the condition F0 ∩ D = ∅. Theorem 18.3 establishes
that the condition F0 ∩ D = ∅ is necessary for optimality in constrained optimisation problems.
Theorem 18.3 (geometric necessary condition). Let S ⊆ Rn be a nonempty set, and let f : S → R
be differentiable at x ∈ S. If x is a local optimal solution to
(P ) : min. {f (x) : x ∈ S} ,
then F0 ∩ D = ∅, where F0 = d : ∇f (x)⊤ d < 0 and D is the cone of feasible directions.
253
254 Chapter 18. Constrained optimality conditions
F0
x
∇f (x)
Figure 18.1: Illustration of the cones F0 and D for the optimal point x. Notice that D is an open
set.
The proof for this theorem consists of using the separation theorem to show that F0 ∩ D = ∅
implies that the first-order optimality condition ∇f (x)⊤ d ≥ 0 holds.
As discussed earlier (in Lecture 4), in the presence of convex, this conditions becomes sufficient
for optimality. Moreover,
if f is strictly
convex, then F = F0 . If f is linear, it might be worth
considering F0′ = d ̸= 0 : ∇f (x)⊤ d ≤ 0 to allow for considering orthogonal directions.
(P ) : min. f (x)
s.t.: gi (x) ≤ 0, i = 1, . . . , m
x ∈ X,
The use of G0 is a convenient algebraic representation, since it can be shown that G0 ⊆ D, which
is stated in Lemma 18.4. As F0 ∩ D = ∅ must hold for a local optimal solution x ∈ S, it follows
that F0 ∩ G0 = ∅ must also hold.
Lemma 18.4. Let S = {x ∈ X : gi (x) ≤ 0 for all i = 1, . . . , m}, where X ⊂ Rn is a nonempty
open set and gi : Rn → R a differentiable function for all i = 1, . . . , m. For a feasible point x ∈ S,
let I = {i : gi (x) = 0} be the index set of the binding (or active) constraints. Let
G0 = d : ∇gi (x)⊤ d < 0, i ∈ I
18.2. Fritz-John conditions 255
In settings
in which gi is affine for some i ∈ I, it might be worth considering
G′0 = d ̸= 0 : ∇gi (x)⊤ d ≤ 0, i ∈ I so that orthogonal feasible directions can also be represented.
Notice that in this case D ⊆ G′0 .
Proof. Since x solves P locally, Theorem 18.3 guarantees that there is no d such that ∇f (x)⊤ d < 0
and ∇gi (x)⊤ d < 0 for each i ∈ I. Let A be the matrix whose rows are ∇f (x)⊤ and ∇gi (x)⊤ for
i ∈ I.
Using Farkas’ theorem, we can show that if Ad < 0 is inconsistent, then there
exists nonzero p ≥ 0
such that A⊤ p = 0. Letting p = (u0 , ui1 , . . . , ui|I| ) for I = i1 , . . . , i|I| and making ui = 0 for
i ̸= I, the result follows.
The proof considers that, if x is optimal, then f (x)⊤ d ≥ 0 holds and a matrix A formed by
∇f (x)
∇gi1 (x)
A= ..
.
∇gi| I| (x)
with I = i1 , . . . , i|I| , will violate Ad < 0. This is used with a variant of Farkas’ theorem (known
as the Gordan’s theorem) to show that the alternative system A⊤ p = 0, with p ≥ 0 holds, which,
by setting p = [u0 , ui1 , . . . , ui| I| ] and enforcing that the remainder of the gradients ∇gi (x), for
i∈/ I, are removed by setting ui = 0, which leads precisely to the Fritz-John conditions.
The multipliers ui , for i = 0, . . . , m, are named Lagrangian multipliers due to the connection with
Lagrangian duality, as we will see later. Also, notice that for nonbinding constraints (gi (x) < 0 for
i∈/ I), ui must be zero to form the Fritz-John conditions. This condition is named complementary
slackness.
The Fritz-John conditions are unfortunately too weak, which is a problematic issue in some rather
common settings. A point x satisfies the Fritz-John conditions if and only if F0 ∩ G0 = ∅, which
is trivially satisfied when G0 = ∅.
256 Chapter 18. Constrained optimality conditions
For example, the Fritz-John conditions are trivially satisfied for points where some of the gradient
vanishes (i.e., ∇f (x) = 0 or ∇gi (x) = 0 for some i = 1, . . . , m). Sets with no relative interior in
the immediate vicinity of x also satisfy Fritz-John conditions.
An interesting case is for problems with equality constraints, as illustrated in Figure 18.2. In
general, if the additional regularity condition that the gradients ∇gi (x) are linearly independent
does not hold, x trivially satisfies the Fritz-John conditions.
x2
g2 (x) ≤ 0 g3 (x) ≤ 0
∇g2 (x)
∇f (x)
∇g3 (x)
∇g2 (x)
∇f (x)
g1 (x) ≤ 0 x1
Figure 18.2: All points in the blue segment satisfy FJ conditions, including the minimum x.
The Karush-Kuhn-Tucker (KKT) conditions can be understood as the Frizt-John conditions with
an extra requirement of regularity for x ∈ S. This regularity requirement is called constraint
qualification and, in a general sense, are meant to prevent the trivial case G0 = ∅, making thus
the optimality conditions stronger (i.e., more stringent).
This is achieved by making u0 = 1 in Theorem 18.5, which ultimately implies that the gradients
∇gi (x) for i ∈ I must be linearly independent. This condition is called linearly independent
constraint qualification (LICQ) and is one of several known constraints qualifications that can be
used to guarantee regularity of x ∈ S.
Theorem 18.6 establishes the KKT conditions as necessary for local optimality of x assuming that
LICQ holds. For notational simplicity, let us assume for now that
Proof. By Theorem 18.5, there exists nonzero (ûi ) for i ∈ {0} ∪ I such that
m
X
û0 ∇f (x) + ûi ∇gi (x) = 0
i=1
ûi ≥ 0, i = 0, . . . , m
Pm
Note that û0 > 0, as the linear independence of ∇gi (x) for all i ∈ I implies that i=1 ûi ∇gi (x) ̸= 0.
Now, let ui = ûi /u0 for each i ∈ I and ui = 0 for all i ̸∈ I.
The proof builds upon the Fritz-John conditions, which under the assumption that the gradients
of the active constraints ∇gi (x) for i ∈ I are independent, the multipliers ûi can be rescaled so
that u0 = 1.
The general conditions including both inequality and equality constraints are posed as follows.
Notice that the Lagrange multipliers vi associated with the equality constraints h(x) = 0 for i =
1, . . . , l are irrestricted in sign and the complementary slackness condition is not explicitly stated,
since it holds redundantly. These can be obtained by replacing equality constraints h(x) = 0 with
two equivalent inequalities h− (x) ≤ 0 and −h+ (x) ≤ 0 and writing the conditions in Theorem 18.6.
Also, notice that, in the absence of constraints, the KKT conditions reduce to the unconstrained
first-order condition ∇f (x) = 0.
m
X l
X
∇f (x) + ui ∇gi (x) + vi ∇hi (x) = 0 (dual feasibility 1)
i=1 i=1
ui gi (x) = 0, i = 1, . . . , m (complementary slackness)
x ∈ X, gi (x) ≤ 0, i = 1, . . . , m (primal feasibility)
hi (x) = 0, i = 1, . . . , l
ui ≥ 0, i = 1, . . . , m (dual feasibility 2)
2.0
258 Chapter 18. Constrained optimality conditions
0
2.5
g2(x)
2.0
f(x) g1(x)
N(x)
1.5 0.5
x2
0
1.0
0.51.5
2.0 2.5 3.0 3.5
x1
Figure 18.3: Graphical illustration of the KKT conditions at the optimal point x
In specific, constraint qualification can be seen as a certification that the geometry of the feasible
region and gradient information obtained from the constraints that forms it are related at an
optimal solution. Remind that gradients can only provide a first-order approximation of the
feasible region, which might lead to mismatches. This is typically the case when the feasible region
has cusps, or a single feasible points.
Constraint qualification can be seen as certificates for proper relationships between the set of
feasible directions
G′0 = d ̸= 0 : ∇gi (x)⊤ d ≤ 0, i ∈ I
∇g1 (x)
T = G′0 x
T = G′0 S
∇g1 (x)
∇g2 (x) (b)
x
∇g1 (x)
S
∅ = T ̸= G′0 x
∇g2 (x)
(a) (c)
Figure 18.4: CQ holds for 18.4a and 18.4b, since the tangent cone T and the cone of feasible
directions G′0 (denoted by the dashed black lines and grey area) match; for 18.4c, they do not
match, as T = ∅
qualification. In the presence of equality constraints, the condition becomes T = G′0 ∩ H0 , with
H0 = d : ∇hi (x)⊤ d = 0, i = 1, . . . , l .
Figure 18.4 illustrates the tangent cone T and the cone of feasible directions (G′0 ) for cases when
constraint qualification holds (Figures 18.4a and 18.4b) for which case T = G′0 , and a case for
when it does not (Figure 18.4c, where T = ∅ and G′0 is given by the dashed black line).
The importance of Abadie constraint qualification is that it allows for generalising the KKT con-
ditions by replacing the condition the linear independence of the gradients ∇gi (x) for i ∈ I. This
allows us to state the KKT conditions as presented in Theorem 18.7.
Theorem 18.7 (Karush-Kuhn-Tucker necessary conditions II). Consider the problem
(P ) : min. {f (x) : gi (x) ≤ 0, i = 1, . . . , m, x ∈ X} .
Let X ⊆ Rn be a nonempty open set, and let f : Rn → R and gi : Rn → R be differentiable for all
i = 1, . . . , m. Additionally, for a feasible x, let I = {i : gi (x) = 0} and suppose that Abadie CQ
holds at x. If x solves P locally, there exist scalars ui for i ∈ I such that
m
X
∇f (x) + ui ∇gi (x) = 0
i=1
ui gi (x) = 0, i = 1, . . . , m
ui ≥ 0, i = 1, . . . , m.
Despite being a more general result, Theorem 18.7 is of little use, as Abadie’s constraint quali-
fication cannot be straightforwardly verified in practice. Alternatively, we can rely on verifiable
constraint qualification conditions that imply Abadie’s constraint qualification. Examples include
1. Linear independence (LI)CQ: holds at x if ∇gi (x), for i ∈ I, as well as ∇hi (x), i = 1, . . . , l
are linearly independent.
260 Chapter 18. Constrained optimality conditions
2. Affine CQ: holds for all x ∈ S if gi , for all i = 1, . . . , m, and hi , for all i = 1, . . . , l, are
affine.
3. Slater’s CQ: holds for all x ∈ S if gi is a convex function for all i = 1, . . . , m, hi is an affine
function for all i = 1, . . . , l, and there exists x ∈ S such that gi (x) < 0 for all i = 1, . . . , m.
Slater’s constraint qualification is the most frequently used, in particular in the context of convex
optimisation problems. One important point to notice is the requirement of not having an empty
relative interior, which can be a source of error.
Consider, for example: P = min. x1 : x21 + x2 ≤ 0, x2 ≥ 0 . Notice that P is convex and therefore
the KKT system for P is
!
1 0 0 u1
+ = 0; u1 , u2 ≥ 0,
0 1 −1 u2
which has no solution. Thus, the KKT conditions are not necessary for the global optimality of
(0, 0). This is due to the lack of CQ, since the feasible region is the single point (0, 0) and the fact
that KKT conditions are only sufficient (not necessary), in the presence of convexity.
Corollary 18.8 summarises the setting in which one should expect the KKT conditions to be
necessary an sufficient conditions for global optimality, i.e., convex optimisation.
Corollary 18.8 (Necessary and sufficient KKT conditions). Suppose that Slater’s CQ holds. Then,
if f is convex, the conditions of Theorem 18.7 are necessary and sufficient for x to be a global
optimal solution.
Chapter 19
Lagrangian duality
In specific, PR is said to be a relaxation for P if fR (x) bounds f (x) from below (in a minimisation
setting) for all x ∈ S and the enlarged feasible region SR contains S.
The motivation for using relaxations arises from the possibility of finding a solution to the the
original problem P by solving PR . Clearly, such a strategy would only make sense if PR possess
some attractive property or feature that we can use in our favour to, e.g., improve solution times
or create separability that can be further exploited using parallelised computation (which we will
discuss in more details in the upcoming lectures). Theorem 19.2 presents the technical result that
allows for using relaxations for solving P .
Theorem 19.2 (Relaxation theorem). Let us define
(P ) : min. {f (x) : x ∈ S} and (PR ) : min. {fR (x) : x ∈ SR }
If PR is a relaxation of P , then the following hold:
261
262 Chapter 19. Lagrangian duality
1. if PR is infeasible, so is P ;
2. if xR is an optimal solution to PR such that xR ∈ S and fR (xR ) = f (xR ), then xR is optimal
to P as well.
Proof. Result (1) follows since S ⊆ SR . To show (2), notice that f (xR ) = fR (xR ) ≤ fR (x) ≤ f (x)
for all x ∈ S.
(P ) : min. f (x)
s.t.: g(x) ≤ 0
h(x) = 0
x ∈ X.
For a given set of dual variables (u, v) ∈ Rm+l with u ≥ 0, the Lagrangian relaxation (or Lagrangian
dual function) of P is
where
ϕ(x, u, v) := f (x) + u⊤ g(x) + v ⊤ h(x)
is the Lagrangian function.
Notice that the Lagrangian dual function θ(u, v) has a built-in optimisation problem in x, meaning
that evaluating θ(u, v) still requires solving an optimisation problem, which amounts to finding the
minimiser x for ϕ(x, u, v), given (u, v).
The proof uses the fact that the infimal of the Lagrangian function ϕ(x, u, v), and in fact any
value for ϕ(x, u, v) for all primal feasible x and dual feasible u ≥ 0 (a condition for the Lagrangian
relaxation to be indeed a relaxation) are bounds to f (x). This arises from observing that g(x) ≤ 0
for a feasible x. The Lagrangian dual problem is the problem used to obtain the best possible
relaxation bound θ(u, v) for f (x), in light of Theorem 19.3. This can be achieved by optimising
θ(u, v) in the space of the dual variables (u, v), that is
The use of Lagrangian dual problems is an alternative for dealing with constrained optimisation
problems, as they allow to convert the constrained primal into a (typically) unconstrained dual
that is potentially easier to handle, or present exploitable properties that can benefit specialised
algorithms, such as separability.
Employing Lagrangian relaxations to solve optimisation problems is possible due to the following
important results, which are posed as corollaries of Theorem 19.3.
Proof. We have θ(u, v) ≤ f (x) for any feasible x and (u, v), thus implying supu,v {θ(u, v) : u ≥ 0} ≤
inf x {f (x) : g(x) ≤ 0, h(x) = 0, x ∈ X}
Corollary 19.5 (Strong Lagrangian duality). If f (x) = θ(u, v), u ≥ 0, and x ∈ {x ∈ X : g(x) ≤ 0, h(x) = 0},
then x and (u, v) are optimal solutions to P and D, respectively.
Proof. Use part (2) of Theorem 19.2 with D being a Lagrangian relaxation.
Notice that Corollary 19.5 implies that if the optimal solution value of the primal and the dual
problems match, then the respective primal and dual solutions are optimal. However, to use
Lagrangian relaxations to solve constrained optimisation problems, we need the opposite clause to
also hold, which is called strong duality and, unfortunately, does not always hold.
To investigate the cases in which strong duality can hold, let us focus on a graphical interpretation
of Lagrangian dual problems. For that, let us first define some auxiliary elements.
For the sake of simplicity, consider (P ) : min. {f (x) : g(x) ≤ 0, x ∈ X} with f : Rn 7→ R, a single
constraint g : Rn 7→ R and X ⊆ Rn an open set.
Let us define the mapping G = {(y, z) : y = g(x), z = f (x), x ∈ X}, which consists of a mapping
of points x ∈ X to the (y, z)-space obtained using (f (x), g(x)). In this setting, solving P means
finding a point with minimum ordinate z for which y ≤ 0. Figure 19.1 illustrate this setting.
264 Chapter 19. Lagrangian duality
z = f (x)
x∈X
G
(g(x), f (x))
Figure 19.1: Illustration of the mapping G, in which one can see that solving P amounts to finding
the lowermost point on the vertical axis (the ordinate) that is still contained within G.
z = f (x)
Figure 19.2: Solving the Lagrangian dual problem is the same as finding the coefficient u such that
z = α − uy is a supporting hyperplane of G with the uppermost intercept α. Notice that, for u,
the hyperplane supports G at the same point that solves P .
which can be represented by a hyperplane of the form z = α − uy. Therefore, optimising the
Lagrangian dual problem (D) : supu {θ(u)} consists of finding the slope −u that would achieve
the maximum intercept on the ordinate z while being a supporting hyperplane for G. Figure 19.2
illustrates this effect. Notice that, in this case, the optimal values of the primal and dual problems
coincide. The perturbation function v(y) = min. {f (x) : g(x) ≤ y, x ∈ X} is an analytical tool
that plays an important role in understanding when strong duality holds, which, in essence, is the
underlying reason why the optimal values of the primal and dual problems coincide.
Specifically, notice that v(y) is the greatest monotone nonincreasing lower envelope of G. Moreover,
the reason why f (x) = θ(u) is related to the convexity of v(y), which implies that
v(y) ≥ v(0) − uy for all y ∈ R.
Notice that this is a consequence of Theorem 12 from Lecture 2 (that states that convex sets have
supporting hyperplanes for all points on their boundary) and Theorem 5 in Lecture 3 (that convex
functions have convex epigraphs)
19.2. Lagrangian dual problems 265
A duality gap exists when the perturbation function v(y) does not have supporting hyperplanes
within all its domain, which is otherwise the case when v(y) is convex. Figure 19.3 illustrates a
case in which v(y) is not convex and therefore θ(u) < f (x).
z = f (x)
G
(y, z) = (g(x), f (x)
θ(u)
z + uy = α
y = g(x)
Figure 19.3: An example in which the perturbation function v(y) is not convex. Notice the
consequent mismatch between the intercept of the supporting hyperplane and the lowermost point
on the ordinate still contained in G.
Let us illustrate the above with two numerical examples. First, consider the following problem
Figures 19.4a and 19.4b provide a graphical representation of the primal problem P and dual
problem D. As can be seen, both problems have as optimal value f (x1 , x2 ) = θ(u) = 8, with the
optimal solution x = (2, 2) for P and u = 4 for D.
To draw the (g, f ) map of X, we proceed as follows. First, notice that
v(y) = min. x21 + x22 : −x1 − x2 + 4 ≥ y
which shows that (x1 , x2 ) = (0, 0) if y > 4. For y ≤ 4, v(y) can be equivalently rewritten as
v(y) = min. x21 + x22 : −x1 − x2 + 4 = y .
Let h(x) = −x1 − x2 + 4 and f (x) = x21 + x22 . Now, the optimality conditions for x to be an
optimum for v are such that
(
2x1 − u = 0
∇f (x) + u∇h(x) = 0 ⇒ ⇒ x1 = x2 = u/2.
2x2 − u = 0
266 Chapter 19. Lagrangian duality
4 10
18 22 8
3 12 6
16
4
20
(u)
2
x2
10
6 2
14
4
8 0
1
2
4
0 1 4
0 2 3 2 0 2 4 6 8 10
x1 u
(a) P (b) D
Figure 19.4: The primal problem P as a constrained optimisation problem, and the dual problem
D, as an unconstrained optimisation problem. Notice how the Lagrangian dual function is discon-
tinuous, due to the implicit minimisation in x of θ(u) = inf x∈X ϕ(x, u).
From the definition of h(x), we see that u = 4 − y, and thus x = ( 4−y 4−y
2 , 2 ), which, substituting
in f (x) gives v(y) = (4 − y)2 /2. Note that v(y) ≥ v(0) − uy holds for all y ∈ R, that is, v(y) is
convex. Also, notice that the supporting hyperplane is exactly z = 8 − 4y.
Now, let us consider a second example, in which the feasible set is not convex and, therefore, the
mapping G will not be convex either. For that, consider the problem
(P ) : min. − 2x1 + x2
s.t.: x1 + x2 = 3
x1 , x2 ∈ X.
where X = {(0, 0), (0, 4), (4, 4), (4, 0), (1, 2), (2, 1)}. The optimal point x = (2, 1). The Lagrangian
dual function is given by
Figure 19.6a provides a graphical representation of the problem. Notice that to obtain the La-
grangian dual function one must simply take the lowermost segments of the hyperplanes obtained
when considering each x ∈ X, which leads to a piecewise concave function, as represented in Figure
19.6b.
Similarly to the previous example, we can plot the G mapping, which in this case consists of the
points x ∈ X mapped as (h(x), f (x)), with h(x) = x1 + x2 − 3 and f (x) = −2x1 + x2 . Notice
that v(y) in this case is discontinuous, represented by the three lowermost points. Clearly, v(y)
19.2. Lagrangian dual problems 267
40
G
32 v(y)
z + uy =
24
z
16
0 1 1 4
2 0 2 3
y
Figure 19.5: The G mapping for the first example.
does not have a supporting hyperplane at the minimum of P , which illustrates the existence of a
duality gap, as stated by the fact that −3 = f (x) > θ(v) = −6.
Strong duality
From the previous graphical interpretation and related examples, it becomes clear that there is a
strong tie between strong duality and the convexity of P . This is formally described in Theorem
19.6.
Theorem 19.6. Let X ⊆ Rn be a nonempty convex set. Moreover, let f : Rn → R and g : Rn →
Rm be convex functions, and let h : Rn → Rl be an affine function: h(x) = Ax − b. Suppose that
Slater’s constraint qualification holds true. Then
inf {f (x) : g(x) ≤ 0, h(x) = 0, x ∈ X} = sup {θ(u, v) : u ≥ 0} ,
x u,v
where θ(u, v) = inf x∈X f (x) + u⊤ g(x) + v ⊤ h(x) is the Lagrangian function. Furthermore, if
inf x {f (x) : g(x) ≤ 0, h(x) = 0, x ∈ X} is finite and achieved at x, then
supu,v {θ(u, v) : u ≥ 0} is achieved at (u, v) with u ≥ 0 and u⊤ g(x) = 0.
The proof for the strong duality theorem follows the following outline:
1. Let γ = inf x {f (x) : g(x) ≤ 0, h(x) = 0, x ∈ X}. Suppose that −∞ < γ < ∞, hence finite
(for unbounded problems, f (x) = −∞ implies θ(u, v) = −∞ since θ(u, v) ≤ f (x) from
Theorem 19.3; the right-hand side holds by assumption of the existence of a feasible point
from Slater’s constraint qualification).
2. Formulate the inconsistent system:
f (x) − γ < 0, g(x) ≤ 0, h(x) = 0, x ∈ X.
3. Use the separation theorem (or a variant form of Farkas theorem) to show that (u0 , u, v)
with u0 > 0 and u ≥ 0 exists such that, after scaling using u0 one obtains θ(u, v) :=
f (x) + u⊤ g(x) + v ⊤ h(x) ≥ γ, x ∈ X, which requires the assumption of Slater’s constraint
qualification.
268 Chapter 19. Lagrangian duality
4 4
x1 + x2 = 3
2x1 + x2 = 3
X 6
3
8
2
(v)
x2
10
1
12
0 14
0 1 2 3 4 2 0 2 4
x1 v
(a) P (b) (D) : max. θ(v)
Figure 19.6: The primal problem P as a constrained optimisation problem and the dual problem
D. Notice how the Lagrangian dual function is concave and piecewise linear, despite the nonconvex
nature of P .
4. From weak duality (Theorem 19.3), we have that θ(u, v) ≤ γ, which combined with the
above, yields θ(u, v) = γ.
5. Finally, an optimal x solving the primal problem implies that g(x) ≤ 0, h(x) = 0, x ∈ X,
and f (x) = γ. From 3, we have u⊤ g(x) ≥ 0. As g(x) ≤ 0 and u ≥ 0, u⊤ g(x) ≥ 0 = 0.
The proof uses a variant of the Farkas theorem that states the existence of a solution for the system
u0 (f (x) − γ) + u⊤ g(x) ≥ 0, x ∈ X with (u0 , u, v) ̸= 0, what can be shown to be the case if Slater’s
constraint qualification holds. This, combined with weak duality stated in Theorem 19.3 yields
strong duality.
which is a consequence of f (x) ≥ θ(u, v) (i.e., weak duality). We say that x is ϵ-optimal, with
ϵ = f (x) − θ(u, v).
In essence, (u, v) is a certificate of (sub-)optimality of x, as (u, v) proves that x is ϵ-optimal.
Moreover, in case strong duality holds, under the conditions of Theorem 6, one can expect ϵ
converge to zero.
19.2. Lagrangian dual problems 269
4
3 z + uy =
2 v(y)
1 v(y) G
0
1
z = f(x) 2
3
4
5
6
7
8
3 2 1 0 1 2 3 4 5
y = h(x)
Figure 19.7: The G mapping for the second example. The blue dots represent the perturbtion
function v(y), which is not convex and thus cannot be supported everywhere. Notice the duality
gap represented by the difference between the intercept of z = −6 − 2y and the optimal value of
P at (0,-3).
To see how this the case, observe the following. First, as can be seen in Theorem 19.6, a consequence
of strong duality is that complementarity conditions u⊤ g(x) ≥ 0 = 0 hold for an optimal primal-
dual pair (x, (u, v)). Secondly, notice that, by definition, x and (u, v) are primal and dual feasible,
respectively.
The last component missing is to notice that, if x is a minimiser for ϕ(x, u, v) = f (x) + u⊤ g(x) +
v ⊤ h(x), then we must have
m
X l
X
∇f (x) + ui ∇gi (x) + vi ∇hi (x) = 0.
i=1 i=1
Combining the above, one can see that we have listed all of the KKT optimality conditions,
which under the assumptions of Theorem 19.6 are known to be necessary and sufficient for global
optimality. That is, in this case, any primal dual pair for which the objective function values
match will automatically be a point satisfying the KKT conditions and therefore globally optimal.
This provides an alternative avenue to search for optimal solutions, relying on Lagrangian dual
problems.
This insight allows for the development of methods that can alternatively solve the Lagrangian
dual problem in the space of primal variables x and dual variables (u, v) in a block-coordinate
descent fashion.
Theorem 19.7 establishes the relationship between the existence of saddle points for Lagrangian
dual problems and zero duality gaps.
Theorem 19.7 (Saddle point optimality and zero duality gap). A solution (x, u, v) with x ∈ X
and u ≥ 0 is a saddle point for the Lagrangian function ϕ(x, u, v) = f (x) + u⊤ g(x) + v ⊤ h(x) if and
only if:
3. u⊤ g(x) = 0
Moreover, (x, u, v) is a saddle point if and only if x and (u, v) are optimal solutions for the primal
(P) and dual (D) problems, respectively, with f (x) = θ(u, v).
From Theorem 19.7 it becomes clear that there is a strong connection between the existence of
saddle points and the KKT conditions for optimality. Figure 19.8 illustrates the existence of a
saddle point and the related zero optimality gap.
19.3. Properties of Lagrangian functions 271
172 4 Duality
<latexit sha1_base64="EqCHS9rjy76bnf91Z9SjhLUDiT4=">AAACDHicbVDLSgMxFM3UV62vqks3wSJUkDIjRV0W3bisYB/QGUomzbShmWRIMsUyzAe48VfcuFDErR/gzr8x085CWw8ETs499yb3+BGjStv2t1VYWV1b3yhulra2d3b3yvsHbSViiUkLCyZk10eKMMpJS1PNSDeSBIU+Ix1/fJPVOxMiFRX8Xk8j4oVoyGlAMdJG6pcrbjSi1Ycz6Apjy6Ykcfr7NklPjcuu2TPAZeLkpAJyNPvlL3cgcBwSrjFDSvUcO9JegqSmmJG05MaKRAiP0ZD0DOUoJMpLZsuk8MQoAxgIaQ7XcKb+7khQqNQ09I0zRHqkFmuZ+F+tF+vgyksoj2JNOJ4/FMQMagGzZOCASoI1mxqCsKTmrxCPkERYm/xKJgRnceVl0j6vORe1+l290rjO4yiCI3AMqsABl6ABbkETtAAGj+AZvII368l6sd6tj7m1YOU9h+APrM8fsWebaw==</latexit>
(x, λ,
(x, u,µv))
(x(x, u,µv))
*, λ,
<latexit sha1_base64="tD714MeHOIrnmBxMayECfS9ffFI=">AAAB8nicbZDNSgMxFIXv1L9a/6ou3QSLUKGUmSLosuDGZQXbCtOhZNJMG5pJhiRTLEMfw40LRdz6NO58G9N2Ftp6IPBx7r3k3hMmnGnjut9OYWNza3unuFva2z84PCofn3S0TBWhbSK5VI8h1pQzQduGGU4fE0VxHHLaDce383p3QpVmUjyYaUKDGA8FixjBxlp+Lxmx6lMtrU0u++WKW3cXQuvg5VCBXK1++as3kCSNqTCEY619z01MkGFlGOF0VuqlmiaYjPGQ+hYFjqkOssXKM3RhnQGKpLJPGLRwf09kONZ6Goe2M8ZmpFdrc/O/mp+a6CbImEhSQwVZfhSlHBmJ5vejAVOUGD61gIlidldERlhhYmxKJRuCt3ryOnQadc/y/VWl2cjjKMIZnEMVPLiGJtxBC9pAQMIzvMKbY5wX5935WLYWnHzmFP7I+fwBPSaQgQ==</latexit>
ff((x) = φ✓(u,
x*) = ( λ,* v)
µ* )
<latexit sha1_base64="2Qsau2rwN+IFeLYtsaY3KPdQ+2Y=">AAACHHicbVDLSgMxFM34rPU16tJNsAgtSJmpgm6EghuXFewDOkPJpJk2NPMguVMsw3yIG3/FjQtF3LgQ/BvTB1hbDwROzrmH5B4vFlyBZX0bK6tr6xubua389s7u3r55cNhQUSIpq9NIRLLlEcUED1kdOAjWiiUjgSdY0xvcjP3mkEnFo/AeRjFzA9ILuc8pAS11zHO/6ER6YJxPH7ISvsYO9BmQOTnJzn4vw6zUMQtW2ZoALxN7RgpohlrH/HS6EU0CFgIVRKm2bcXgpkQCp4JleSdRLCZ0QHqsrWlIAqbcdLJchk+10sV+JPUJAU/U+URKAqVGgacnAwJ9teiNxf+8dgL+lZvyME6AhXT6kJ8IDBEeN4W7XDIKYqQJoZLrv2LaJ5JQ0H3mdQn24srLpFEp25rfXRSqlVkdOXSMTlAR2egSVdEtqqE6ougRPaNX9GY8GS/Gu/ExHV0xZpkj9AfG1w+cfKJH</latexit>
x
<latexit sha1_base64="r49wv4wfWIPtkOnce+sDr9Qxb+E=">AAAB6HicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix4MVjC/YD2lA220m7drMJuxuxhP4CLx4U8epP8ua/cdvmoK0vLDy8M8POvEEiuDau++0UNja3tneKu6W9/YPDo/LxSVvHqWLYYrGIVTegGgWX2DLcCOwmCmkUCOwEk9t5vfOISvNY3ptpgn5ER5KHnFFjrebToFxxq+5CZB28HCqQqzEof/WHMUsjlIYJqnXPcxPjZ1QZzgTOSv1UY0LZhI6wZ1HSCLWfLRadkQvrDEkYK/ukIQv390RGI62nUWA7I2rGerU2N/+r9VIT3vgZl0lqULLlR2EqiInJ/Goy5AqZEVMLlCludyVsTBVlxmZTsiF4qyevQ7tW9Sw3ryr1Wh5HEc7gHC7Bg2uowx00oAUMEJ7hFd6cB+fFeXc+lq0FJ585hT9yPn8A4WuM7A==</latexit>
<latexit
λ, vµ
u,
λ,u,
µv
* *
<latexit sha1_base64="SSpLPfk7cnNfeJfUL4EXu6lLGe0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8CAlKYIeC148VrQf0Iay2U7apZtN2N0USuhP8OJBEa/+Im/+G7dtDtr6wsLDOzPszBskgmvjut9OYWNza3unuFva2z84PCofn7R0nCqGTRaLWHUCqlFwiU3DjcBOopBGgcB2ML6b19sTVJrH8slME/QjOpQ85Iwaaz2mV5N+ueJW3YXIOng5VCBXo1/+6g1ilkYoDRNU667nJsbPqDKcCZyVeqnGhLIxHWLXoqQRaj9brDojF9YZkDBW9klDFu7viYxGWk+jwHZG1Iz0am1u/lfrpia89TMuk9SgZMuPwlQQE5P53WTAFTIjphYoU9zuStiIKsqMTadkQ/BWT16HVq3qWX64rtRreRxFOINzuAQPbqAO99CAJjAYwjO8wpsjnBfn3flYthacfOYU/sj5/AEdZY2f</latexit>
φ✓(u,
( λ, v)
µ) <latexit sha1_base64="Sr0hRTa6j8BrbMDnsOG3VUMeeH8=">AAACAnicbVBPS8MwHP11/pvzX9WTeAkOwYOMdgh6HHjxOMFtwlZGmqVbWJqWJB2MUrz4Vbx4UMSrn8Kb38Z0K6ibDwIv7/0eye/5MWdKO86XVVpZXVvfKG9WtrZ3dvfs/YO2ihJJaItEPJL3PlaUM0FbmmlO72NJcehz2vHH17nfmVCpWCTu9DSmXoiHggWMYG2kvn3Ui4ydp9MkO/+5TLK+XXVqzgxombgFqUKBZt/+7A0ikoRUaMKxUl3XibWXYqkZ4TSr9BJFY0zGeEi7hgocUuWlsxUydGqUAQoiaY7QaKb+TqQ4VGoa+mYyxHqkFr1c/M/rJjq48lIm4kRTQeYPBQlHOkJ5H2jAJCWaTw3BRDLzV0RGWGKiTWsVU4K7uPIyaddrruG3F9VGvaijDMdwAmfgwiU04Aaa0AICD/AEL/BqPVrP1pv1Ph8tWUXmEP7A+vgGjlqYHA==</latexit>
xx*
<latexit sha1_base64="HwqKAvK1Valz0rk+s5ovyCnTtWg=">AAAB83icbVDLSgMxFL1TX7W+Rl26CRbBVZkpgi4LblxWsA/oDCWTZtrQTDIkGbEM/Q03LhRx68+482/MtLPQ1gOBwzn3cG9OlHKmjed9O5WNza3tnepubW//4PDIPT7papkpQjtEcqn6EdaUM0E7hhlO+6miOIk47UXT28LvPVKlmRQPZpbSMMFjwWJGsLFSEEhrFtn8aT50617DWwCtE78kdSjRHrpfwUiSLKHCEI61HvheasIcK8MIp/NakGmaYjLFYzqwVOCE6jBf3DxHF1YZoVgq+4RBC/V3IseJ1rMkspMJNhO96hXif94gM/FNmDORZoYKslwUZxwZiYoC0IgpSgyfWYKJYvZWRCZYYWJsTTVbgr/65XXSbTZ8y++v6q1mWUcVzuAcLsGHa2jBHbShAwRSeIZXeHMy58V5dz6WoxWnzJzCHzifP7qskhI=</latexit>
<latexit
<latexit sha1_base64="bYWvOrdJwWJS2mGtTUuVzWxgSRQ=">AAAB8nicbZBNS8NAEIYnftb6VfXoZbEIFaQkRdBjwYvHCvYD0lA22027dLMJu5NCKf0ZXjwo4tVf481/47bNQVtfWHh4Z4adecNUCoOu++1sbG5t7+wW9or7B4dHx6WT05ZJMs14kyUy0Z2QGi6F4k0UKHkn1ZzGoeTtcHQ/r7fHXBuRqCecpDyI6UCJSDCK1vK7OORIK9n1+KpXKrtVdyGyDl4OZcjV6JW+uv2EZTFXyCQ1xvfcFIMp1SiY5LNiNzM8pWxEB9y3qGjMTTBdrDwjl9bpkyjR9ikkC/f3xJTGxkzi0HbGFIdmtTY3/6v5GUZ3wVSoNEOu2PKjKJMEEzK/n/SF5gzlxAJlWthdCRtSTRnalIo2BG/15HVo1aqe5cebcr2Wx1GAc7iACnhwC3V4gAY0gUECz/AKbw46L86787Fs3XDymTP4I+fzB4i3kLI=</latexit>
∗
Fig. 4.11.19.8:
Figure Graphical illustration
Illustration of saddle
of a saddle pointpoint , λ∗ , µ∗ ) ofdual
(xLagrangian
for the the Lagrangian
problem
19.3 minimize
Properties of Lagrangian −2x1 + x2
functions
x1 , x2
subject
Lagrangian dualstoare a useful framework for devising solution methods for constrained optimisation
problems if solving the dual problem can be done efficiently
5 or exposes some exploitable structure.
x1 + x2 =
One important property that Lagrangian dual functions 2 present is that they are concave piece-
∈ are
(x1 , x2 ) they
wise linear in the dual multipliers. Moreover, X ,continuous and thus have subgradients
everywhere. Notice however that they are typically not differentiable, requiring the employment
of a nonsmooth = {(0, 0), (0,method
where Xoptimisation 2), (2, 0),
to (2,
be2), (5/4, 5/4)}.solved.
appropriately Its dualTheorem
function19.8
is given by
establishes the
the explicit expression
concavity of the Lagrangian dual function.
3λ
−2 + if λ ≤ −1
Theorem 19.8 (Concavity of Lagrangian
dual2 functions). Let X ⊆ Rn be a nonempty compact
λ ⊤ g(x)
set, and let f : Rn → R and β :=
φ(λ) Rn →−4 Rm+l
− , with
if w⊤
−1β(x)
≤ =≤ uv2 h(x)
λ be continuous. Then
2 in Rm+l
θ(w) = inf x f (x) + w⊤ β(x) : x ∈ X
is concave
− 5λ
if λ≥2,
2
shown in Fig. 4.12. The optimal solution of the dual problem is λ∗ = −1 with
objective function value φ(λ∗ ) = −7/2.
Proof. Since f and β are continuous and X is compact, θ is finite on Rm+l . Let w1 , w2 ∈ Rm+l ,
272 Chapter 19. Lagrangian duality
θ[λw1 + (1 − λ)w2 ] = inf f (x) + [λw1 + (1 − λ)w2 ]⊤ β(x) : x ∈ X
x
= inf λ[f (x) + w1⊤ β(X)] + (1 − λ)[f (x) + w2⊤ β(x)] : x ∈ X
x
≥ λ inf f (x) + w1⊤ β(x) : x ∈ X + (1 − λ) inf f (x) + w2⊤ β(x) : x ∈ X
x x
= λθ(w1 ) + (1 − λ)θ(w2 ).
The proof uses the fact that the Lagrangian function θ(w) is the infimum of affine functions in w,
and therefore concave. An alternative approach to show the concavity of the Lagrangian function
is to show that it has subgradients everywhere. This is established in Theorem 19.9.
Theorem 19.9. Let X ⊂ Rn be a nonempty compact set, and let f : Rn 7→ R and β : Rn 7→ Rm+l ,
⊤ g(x)
with w⊤ β(x) = uv h(x)
be continuous. If x ∈ X(w) = x ∈ X : x = arg min f (x) + w⊤ β(x) ,
then β(x) is a subgradient of θ(w).
Proof. Since f and β are continuous and X is compact, X(w) ̸= ∅ for any w ∈ Rm+l . Now, let
w ∈ Rm+l and x ∈ X(w). Then
θ(w) = inf f (x) + w⊤ β(x) : x ∈ X
≤ f (x) + w⊤ β(x)
= f (x) + (w − w)⊤ β(x) + w⊤ β(x)
= θ(w) + (w − w)⊤ β(x).
Theorem 19.9 can be used to derive a simple optimisation method for Lagrangian functions using
subgradient information that is easily available from the term β(w).
One challenging aspect concerning the solution of Lagrangian dual functions is that very often they
are not differentiable. This requires an adaptation of the gradient method to consider subgradient
information instead.
The challenge with using subgradients (instead of gradients) is that subgradients are not guaran-
teed to be descent directions (as opposed to gradients being the steepest descent direction under
adequate norm). Nevertheless, for suitable step size choices, convergence can be observed. Figure
19.9 illustrates the fact that subgradients are not necessarily descent directions.
Algorithm 22 summarises the subgradient method. Notice that the stopping criterion emulates
the optimality condition 0 ∈ ∂θ(wk ), but in practice, one also enforces more heuristically driven
criteria such as maximum number of iterations or a given number of iterations without observable
improvement on the value of θ(w).
19.3. Properties of Lagrangian functions 273
u2
∂θ(wk )
β(xk )
w u1
Figure 19.9: One possible subgradient β(xk ) that is a descent direction for suitable step size.
Notice that within the subdifferential ∂θ(wk ) other subgradients that are not descent direction are
available.
One critical aspect associated with the subgradient method is the step size update described in
Step 5 of Algorithm
P∞ 22. Theoretical convergence is guaranteed if Step 5 generates a sequence {λk }
such that k=0 λk → ∞ and limk→∞ λk = 0. However, discrepant performance can be observed
for distinct parametrisation of the method.
The classical step update rule employed for the subgradient method is known as the Polyak rule,
which is given by
αk (LBk − θ(wk ))
λk+1 =
||β(xk )||2
with αk ∈ (0, 2) and LBk being the best-available lower-estimate of θ(w). This rule is inspired by
the following result.
Proposition 19.10 (Improving step size). If wk is not optimal, then, for all optimal dual solutions
w, we have
||wk+1 − w|| < ||wk − w||
for all step sizes λk such that 2(θ(w) − θ(wk ))
0 < λk < .
||β(xk )||2
||wk+1 − w||2 ≤ ||wk − w||2 − 2λk (θ(w) − θ(wk ))⊤ β(xk ) + (λk )2 ||β(xk )||2 .
λk ||β(xk )||2
Parametrising the last two terms by γk = θ(w)−θ(wk ) leads to
In practice, since θ(w) is not known, it must be replaced by a proxy LBk , which is chosen to be a
lower bound on θ(w) to still satisfy the subgradient inequality. The αk is then reduced from the
nominal value 2 to correct for the estimation error of term θ(w) − θ(wk ).
Chapter 20
Penalty methods
where µ > 0 is a penalty term and α(x) : Rn 7→ R is a penalty function of the form
m
X l
X
α(x) = ϕ(gi (x)) + ψ(hi (x)). (20.1)
i=1 i=1
For α(x) to be a suitable penalty function, one must observe that ϕ : Rn 7→ R and ψ : Rn 7→ R are
continuous and satisfy
Typical options are ϕ(y) = ([y]+ )p with p ∈ Z+ and ψ(y) = |y|p with p = 1 or p = 2.
Figure 20.1 illustrates the solution of (P ) : min. x21 + x22 : x1 + x2 = 1, x ∈ R2 using a penalty-
based approach.
2 Using α(x1 , x2 ) = (x1 + x2 − 1)2 , the penalised auxiliary problem Pµ becomes
(Pµ ) : min. x1 + x2 + µ(x1 + x2 − 1)2 : x ∈ R2 . Since fµ is convex and differentiable, necessary
2
x1 + µ(x1 + x2 − 1) = 0
x2 + µ(x1 + x2 − 1) = 0,
µ
which gives x1 = x2 = 2µ+1 .
One can notice that, as µ increases, the solution of the unconstrained penalised problem, repre-
sented by the level curves, becomes closer to the optimal of the original constrained problem P ,
represented by the dot on the hyperplane defined by x1 + x2 = 1.
275
276 Chapter 20. Penalty methods
4
f(x) = x12 + x22 f4 (x) = x12 + x22 + 0.5(x1 + x2 1)2
x1 + x2 = 1 optimal x
optimal x
2 2
0 0
x2
x2
2 2
4 4
4 2 0 2 4 4 2 0 2 4
x1 x1
f (x) = x12 + x22 + 5(x1 + x2 1)2
4 4
f (x) = x12 + x22 + 1(x1 + x2 1)2
optimal x
2 2
0 0
x2
x2
2 2
4 4
4 2 0 2 4 4 2 0 2 4
x1 x1
Figure 20.1: Solving the constrained problem P (top left) by gradually increasing the penalty term
µ (0.5, 1, and 5, in clockwise order)
A similar geometrical analysis to that performed with the Lagrangian duals can be employed for
understanding how penalised problems can obtain optimal solutions. For that, let us the problem
from the previous example (P ) : min. x21 + x22 : x1 +x2 = 1, x ∈ R2 . Let G : R2 → R 2
be a
2 2 2 2
mapping [h(x), f (x)] : x ∈ R , and let v(ϵ) = min. x1 + x2 : x1 + x2 − 1 = ϵ, x ∈ R . The
1+ϵ (1+ϵ)2
optimal solution is x1 = x2 = 2 with v(ϵ) = 2 .
Minimising f (x) + µ(h(x)2 ) consists of moving the curve downwards until a single contact point
ϵµ remains. One can notice that, as µ → ∞, f + µh becomes sharper (µ2 > µ1 ), and ϵµ converges
20.1. Penalty functions 277
z = f (x)
G v(ϵ)
ϵµ
ϵµ2
ϵµ1
h(x) = ϵ
2
f + µ2 h f + µ1 h2
Figure 20.2: Geometric representation of penalised problems in the mapping G = [h(x), f (x)]
2. and v ′ (ϵµ ) = ∂
∂ϵ (f (xµ ) + µ(h(xµ ))2 − µϵ2µ ) = −2µϵµ .
Therefore, (h(xµ ), f (xµ )) = (ϵµ , v(ϵµ )). Denoting f (xµ ) + µh(xµ )2 = kµ , we see the parabolic
function f = kµ − µϵ2 matching v(ϵµ ) for ϵ = ϵµ and has the slope −2µϵ, matching that of v(ϵ) at
that point.
and α(x) is a penalty function as defined in (20.1). For that to be possible, we need first to state
a convergence result guaranteeing that
In practice, that would mean that µk can be increased at each iteration k until a suitable tolerance
is achieved. Theorem 20.1 states the convergence of penalty based methods.
where θ(µ) = minx {f (x) + µα(x) : x ∈ X} = f (xµ ) + µα(xµ ). Also, the limit of any convergent
subsequence of {xµ } is optimal to the original problem and µα(xµ ) → 0 as µ → ∞.
Proof. We first show that θ(µ) are nondecreasing function of µ. Let 0 < λ < µ. From the definition
of θ(µ), we have that
f (xµ ) + λα(xµ ) ≥ f (xλ ) + λα(xλ ) (20.2)
Adding and subtracting µα(xµ ) in the left side of (20.2), we conclude that θ(µ) ≥ θ(λ). Now, for
x ∈ X with g(x) ≤ 0 and h(x) = 0, notice that α(x) = 0. This implies that
and, therefore, θ(µ) is bounded above, and thus supµ≥0 θ(µ) = limµ→∞ θ(µ). For that to be the
case, we must have that µα(xµ ) → 0 as µ → ∞. Moreover, we notice from (20.3) that
On the other hand, take any convergent subsequence {xµk } of {xµ }µ→∞ with limit x. Then
Since xµk → x as µ → ∞ and f is continuous, this implies that supµ≥0 θ(µ) ≥ f (x). Combined
with (20.4), we have that f (x) = supµ≥0 {θ(µ)} and thus the result follows.
The proof starts by demonstrating the nonincreasing behaviour of penalty functions and nonde-
creasing behaviour of θ(µ) to allow for convergence. By noticing that
and that λ − µ < 0, we can infer that θ(µ) ≥ θ(λ). It is also interesting to notice how the objective
function f (x) and infeasibility α(x) behave as we increase the penalty coefficient µ. For that,
notice that using the same trick in the proof for two distinct values 0 < λ < µ, we have
Notice that in 1, we use the fact that xλ = arg minx θ(λ) = arg minx {f (x) + λα(x)} and therefore,
must be less or equal then f (xµ ) + λα(xµ ) for an arbitrary xµ ∈ X. The same logic is employed in
2, but reversed in λ and µ. Adding 1 and 2, we obtain (µ − λ)(α(xλ ) − α(xµ )) ≥ 0 and conclude
that α(xµ ) ≤ α(xλ ) for µ > λ, i.e., that α(x) is nonincreasing in µ.
Moreover, from the first inequality, we have that f (xµ ) ≥ f (xλ ). Notice how this goes in line with
what one would expect from the method: as we increase the penalty coefficient µ, the optimal
infeasibility, measured by α(xµ ) decreases, while the objective function value f (xµ ) worsens at it
is slowly “forced” to be closer to the original feasible region.
Note that the assumption of compactness plays a central role in this proof, such that θ(µ) can be
evaluated for any µ as µ → ∞. Though this is a strong assumption, it tends to not be so restrictive
in practical cases, since variables typically lie within finite lower and upper bounds. Finally, notice
that α(x) = 0 implies that x is feasible for gi for i = 1, . . . , m, and hi for i = 1, . . . , l, and thus
optimal for P . This is stated in the following corollary.
Corollary 20.2. If α(xµ ) = 0 for some µ, then xµ is optimal for P .
A technical detail of the proof of Theorem 20.1 is that the convergence of such approach is asymp-
totically, i.e., by making µ arbitrarily large, xµ can be made arbitrarily close to the true optimal x
and θ(µ) can be made arbitrarily close to the optimal value f (x). In practice, this strategy tends
to be prone to computational instability.
The computational instability arises from the influence that the penalty term exerts in some of the
eigenvalues of the Hessian of the penalised problem. Let Hµ (xµ ) be the Hessian of the penalised
maxi=1,...,n λi
function at xµ . Recall that conditioning is measured by κ = mini=1,...,n λi , where {λi }i=1,...,n are
the eigenvalues of Hµ (xµ ). Since the influence is only on some of the eigenvalues, this affects the
conditioning of the problem and might lead to numerical instabilities. An indication of that can
be seen in Figure 20.1, where one can notice the elongated profile of the function as the penalty
term µ increases.
Consider the following example. Let the penalised function fµ (x) = x21 + x22 + µ(x1 + x2 − 1)2 .
The Hessian of fµ (x) is " #
2 2(1 + µ) 2µ
∇ fµ (x) = .
2µ 2(1 + µ)
Solving det(∇2 fµ (x) − λI) = 0, we obtain λ1 = 2, λ2 = 2(1 + 2µ), with eigenvectors (1, −1)
and (1, 1), which gives κ = (1 + 2µ). This illustrates that the eigenvalues, and consequently the
conditioning number, is proportional to the penalty term.
z = f (x)
G v(ϵ)
ϵµ f + vh + µ1 h2
f + vh + µ2 h2
h(x) = ϵ
−v −v
2µ2 2µ1
Figure 20.3: Geometric representation of augmented Lagrangians in the mapping G = [h(x), f (x)]
Considering the geometrical interpretation in Figure 20.2, one might notice that a horizontal shift
in the penalty curve would allow for the extreme point of the curve to match the optimum on the
z ordinate.
Therefore, we consider a modified penalised problem of the form
l
X
fµ (x) = f (x) + µ (hi (x) − θi )2
i=1
implies that the optimal solution x can be recovered using a finite penalty term µ, unlike with
the previous penalty-based method. The existence of finite penalty terms µ > 0 that can recover
optimality has an interesting geometrical interpretation, in light of what was previously discussed.
Consider the same setting from Figure 20.2, but now we consider curves of the form f +vh+µh2 = k.
This is illustrated in Figure 20.3.
Optimising the augmented Lagrangian function amounts to finding the curve f + vh + µh2 = k
2
in
which v(ϵ) = k. The expression for k can be conveniently rewritten as f = −µ [h + (v/2µ)] +
2
k + (v /4µ) , exposing that f is a parabola shifted by h = −v/2µ.
20.2. Augmented Lagrangian method of multipliers 281
which amount to rely on strong duality and search for KKT points (or primal-dual pairs) (x, v) by
iteratively operating in the primal (x) and dual (v) spaces. In particular, the strategy consists of
This strategy is akin to applying the subgradient method to solving the augmented Lagrangian
dual. The update step for the dual variable is given by
v k+1 = v k + 2µh(xk+1 ).
The motivation for the dual step update stems from the following observation:
That is, by employing v k+1 = v k + 2µh(xk+1 ) one can retain optimality in the dual variable space
for the Lagrangian function from the optimality conditions of the penalised functions, which is a
condition for x to be a KKT point.
Algorithm 22 summarises the augmented Lagrangian method of multipliers (ALMM).
The method can be specialised such that µ is individualised for each constraint and updated
proportionally to the observed infeasibility hi (x). Such a procedure is still guaranteed to converge,
as the requirement in Theorem 20.1 that µ → ∞ is still trivially satisfied.
One important point about the augmented Lagrangian method of multipliers is that linear conver-
gence is to be expected, due to the gradient-like step taken to find optimal dual variables. This is
often the case with traditional Lagrangian duality based approaches.
We would like to be able to solve the problem separately for x and y, which could, in principle be
achieved using ALMM. However, the consideration of the penalty term prevents the problem from
being completely separable. To see that, let
be the augmented Lagrangian function. One can notice that the penalty term µ(c − Ax − By)2
prevents the separation of the problem in terms of the x and y variables. However, separability can
be recovered is one employs a coordinate descent approach in which three blocks are considered:
x, y, and v. The ADMM is summarised in Algorithm 23.
Algorithm 19 ADMM
1: initialise. tolerance ϵ > 0, initial dual and primal solutions v 0 and y 0 , k = 0
2: while |c − Axk − By k | and ||y k+1 − y k || > ϵ do
3: xk+1 = arg min ϕµ (x, y k , v k )
4: y k+1 = arg min ϕµ (xk+1 , y, v k )
5: v k+1 = v k + 2µ(c − Axk+1 − By k+1 )
6: k =k+1
7: end while
8: return (xk , y k ).
One important feature regarding ADMM is that the coordinate descent steps are taken in a cyclic
order, not requiring more than one (x, y) update step. Variants consider more than one of these
steps, but no clear benefit in practice has been observed. Moreover, µ can be updated according
to the amount of infeasibility observed at iteration k, but no generally good update rule is known.
ADMM is particularly relevant as a method for (un)constrained problems in which it might expose
a structure that can be exploited, such as having in some of the optimisation problems (in Lines
3 and 4 in Algorithm 23) that might have solutions in closed forms.
Chapter 21
Barrier methods
s.t.: µ > 0
where θ(µ) = inf x {f (x) + µB(x) : g(x) < 0, x ∈ X} and B(x) is a barrier function. The barrier
function is such that its value approaches +∞ as the boundary of the region {x : g(x) ≤ 0} is
approached from its interior. Notice that, in practice, it means that the constraint g(x) < 0 can
be dropped, as they are automatically enforced by the barrier function.
The barrier function B : Rm → R is such that
m
(
X ϕ(y) ≥ 0, if y < 0;
B(x) = ϕ(gi (x)), where (21.1)
i=1
ϕ(y) = +∞, when y → 0− .
Perhaps the most important barrier function is the Frisch’s log barrier function, used in the highly
successful primal-dual interior point methods. We will describe its use later. The log barrier is
defined as
Xm
B(x) = − ln(−gi (x)).
i=1
283
284 Chapter 21. Barrier methods
2.0
=1
= 0.5
1.5 = 0.1
B(u)
1.0
(ln( x))
0.5
0.0
B (x) =
-0.5
-1.0
-1.5
-3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5
x
Figure 21.1: The barrier function for different values of µ
Figure 21.1 illustrates the behaviour of the barrier function. Ideally, the barrier function B(x)
has the role of an indicator function, which is zero for all feasible solutions x ∈ {x : g(x) < 0} but
assume infinite value if a solution is at the boundary g(x) = 0 or outside the feasible region. This
is illustrated in the dashed line in Figure 21.1. The barrier functions for different values of barrier
term µ illustrate how the log barrier mimics this behaviour, becoming more and more pronounced
as µ decreases.
Letting θ(µ) = f (xµ ) + µB(xµ ), where B(x) is a barrier function as described in (21.1), xµ ∈ X
and g(xµ ) < 0, the limit of {xµ } is optimal to P and µB(xµ ) → 0 as µ → 0+ .
21.2. The barrier method 285
Proof. First, we show that θ(µ) is a nondecreasing function in µ. For µ > λ > 0 and x such that
g(x) < 0 and x ∈ X, we have
From these, we conclude that limµ→0+ θ(µ) = inf {θ(µ) : µ > 0}. Now, let ϵ > 0. As x is optimal,
by assumption there exists some x̂ ∈ X with g(x̂) < 0 such that f (x) + ϵ > f (x̂). Then, for µ > 0
we have
f (x) + ϵ + µB(x̂) > f (x̂) + µB(x̂) ≥ θ(µ).
Letting µ → 0+ , it follows that f (x) + ϵ ≥ limµ→0+ θ(µ), which implies f (x) ≥ limµ→0+ θ(µ) =
inf {θ(µ) : µ > 0}. Conversely, since B(x) ≥ 0 and g(x) < 0 for some µ > 0, we have
Thus f (x) ≤ limµ→0+ θ(µ) = inf {θ(µ) : µ > 0}. Therefore, f (x) = limµ→0+ θ(µ) = inf {θ(µ) : µ > 0}.
The proof has three main steps. First, we show that θ(µ) is a nondecreasing function in µ,
which implies that limµ→0+ θ(µ) = inf {θ(µ) : µ > 0}. This can be trivially shown as only feasible
solutions x are required to be considered.
Next, we show the convergence of the barrier method by showing that inf µ>0 θ(µ) = f (x), where
x = arg min {f (x) : g(x) ≤ 0, x ∈ X} = limµ→0+ θ(µ) = inf µ>0 θ(µ). The optimality of x implies
that f (x̂) − f (x) < ϵ for feasible x̂ and ϵ > 0. Moreover, B(x̂) ≥ 0 by definition. In the last
part, we use the argument that including the boundary can only improve the objective function
value, leading to the last inequality. It is worth highlighting that, to simplify the proof, we have
assumed that the barrier function has the form described in (21.1). However, a proof in the veins of
Theorem 21.1 can be still be developed for the Frisch log barrier (for which B(x) is not necessarily
nonnegative) since, essentially, (21.1) only needs to be observed in a neighbourhood of g(x) = 0.
The result in Theorem 21.1 allows to design a optimisation methods that, starting from a strictly
feasible (interior) solution, is based on successively reducing the barrier term until a solution with
arbitrarily small barrier term is obtained. Algorithm 22 present a pseudo code for such method.
One important aspect to notice is that starting the algorithm requires a strictly feasible point,
which in some applications, might be challenging to be obtained. This characteristic is what
renders the name interior point methods for this class of algorithms.
286 Chapter 21. Barrier methods
30
(x + 1)2 ( = 0)
x
= 10
x10
ln(x)
20 =5
x5
=2
x2
f(x) + B(x) = (x + 1)2
=1
10 x1
10 1 1 4
2 0 2 3
x
Figure 21.2: Example 1: solving a one-dimensional problem with the barrier method
Consider the following example. Let P = (x + 1)2 : x ≥ 0 . Let us assume that we use the barrier
function B(x) = − ln(x). Then, we unconstrained barrier problem becomes
Since this is a convex function, the first order condition f ′ (x) = 0 is necessary and sufficient
for optimality. Thus, solving 2(x + 1) − µx = 0 we obtain the positive root and unique solution
√
4+8µ
xµ = −12 + 4 . Figure X shows the behaviour of the problem as µ converges to zero. As can
be seen, as µ → 0, the optimal solution xµ converges to the constrained optimum x = 0.
We now consider a more involved example. Let us consider the problem
P = min. (x1 − 2)4 + (x1 − 2x2 )2 : x21 − x2 ≤ 0
1
with B(x) = − x2 −x 2
. We implemented Algorithm 22 and solved it with two distinct values for the
1
penalty term µ and reduction term β. Figure 21.2 illustrate the trajectory of the algorithm with
each parametrisation, exemplifying how these can affect the convergence of the method.
2.0 2.0
1.5 1.5
1.0 1.0
x
x
0.5 0.5
x²-x 0
x²-x 0 =30, =0.4
=30, =0.4 =1, =0.6
0.0 1.5 0.0 1.5
0.0 0.5 1.0 2.0 0.0 0.5 1.0 2.0
x x
Figure 21.3: The trajectory of the barrier method for problem P . Notice how the parameters
influence the trajectory and number of iterations. The parameters on the left require 27 iterations
while those on the right require 40 iterations for convergence.
Ax = b, x > 0
1 1
A⊤ v = c − µ ,...,
x1 xn
Notice that, since µ > 0 and x > 0, u = µ x11 , . . . , x1n serves as an estimate for the Lagrangian
dual variables. To further understand the relationship between the optimality conditions of BP
and P , let us define X ∈ Rn×n and U ∈ Rn×n as
.. ..
. .
X = diag(x) = xi and U = diag(u) = ui
.. ..
. .
and let e = [1, . . . , 1]⊤ be a vector of ones of suitable dimension. We can rewrite the KKT conditions
of BP as
Ax = b, x > 0 (21.3)
⊤
A v+u=c (21.4)
−1
u = µX e ⇒ XU e = µe. (21.5)
Notice how the condition (21.5) resembles the complementary slackness from P , but relaxed to be
µ instead of zero. This is why this system is often referred to as the perturbed KKT system.
Theorem 21.11 guarantees that wµ = (xµ , vµ , uµ ) approaches the optimal primal-dual solution of
P as µ → 0+ . The trajectory formed by successive solutions {wµ } is called the central path, which
is due to the interiority enforced by the barrier function. When the barrier term µ is large enough,
the solution of the barrier problem is close to the analytic centre of the feasibility set. The analytic
centre of a polyhedral set S = {x ∈ Rn : Ax ≤ b} is given by
m
Y
max. (bi − a⊤
i x)
x
i=1
s.t.: x ∈ X,
which corresponds to finding the point of maximum distance to each of the hyperplanes forming
the polyhedral set. This is equivalent to the convex problem
m
X
min. − ln(bi − a⊤
i x)
x
i=1
s.t.: x ∈ X,
of g(x) = 0.
21.3. Interior point method for LP/QP problems 289
that it measures the duality gap at a given solution. That is, notice that
c⊤ x = (A⊤ v + u)⊤ x
= (A⊤ v)⊤ x + u⊤ x
= v ⊤ (Ax) + u⊤ x
Xn Xn
⊤ ⊤ ⊤ µ
c x−b v =u x= ui xi = xi = nµ.
i=1 i=1
xi
which gives the total slack violation that can be used to determine the convergence of the algorithm.
xk+1 c
xµ1
Nµ1 (θ)
xk
xµ2
Nµ2 (θ)
x+∞
Figure 21.4: an illustrative representation of the central path and how the IPM follows it approx-
imately.
For example, let Nµ (θ) = ||Xµ Uµ e − µe|| ≤ θµ. Then, by selecting β = 1 − √σn , σ = θ = 0.1, and
µ0 = (x⊤ u)/n, successive Newton steps are guaranteed to remain within Nµ (θ).
To see how the setting works, let the perturbed KKT system (21.3) – (21.5) for each µ̂ be denoted
as H(w) = 0. Let J(w) be the Jacobian of H(w) at w.
290 Chapter 21. Barrier methods
To see how this still leads to primal and dual feasible solutions, consider the primal residuals (i.e.,
the amount of infeasibility) as rp (x, u, v) = Ax − b and the dual residuals rd (x, u, v) = A⊤ v + u − c.
Now, let r(w) = r(x, u, v) = (rp (x, u, v), rd (x, u, v)), recalling that wk = (x, v, u). The optimality
conditions can be expressed as requiring that the residuals must vanish, that is r(w) = 0.
Now, consider the first-order Taylor approximation for r at w for a step dw
r(w + dw ) ≈ r(w) + Dr(w)dw ,
where Dr(w) is the derivative of r evaluated at w, which is given by the two first rows of the
Newton system (21.7). The step dw for which the residue vanishes is
Dr(w)dw = −r(w), (21.9)
which is the same as (22.1) without the bottom equation. Now, if we consider the directional
derivative of the square of the norm of r in the direction dw , we obtain
d
||r(w + tdw )||22 = 2r(w)⊤ Dr(w)dw = −2r(w)⊤ r(w), (21.10)
dt t→0+
which is strictly decreasing. That is, the step dw is such that it will make the residual decrease
and eventually become zero. From that point onwards, the Newton system will take the form of
(21.7).
The algorithm proceeds by iteratively solving the system (21.8) with µk+1 = βµk with β ∈ (0, 1)
until nµk is less than a specified tolerance. Algorithm 23 summarises a simplified form of the IPM.
Figure 21.5 illustrates the behaviour of the IPM when employed to solve the linear problem
min. x1 + x2
s.t.: 2x1 + x2 ≥ 8
x1 + 2x2 ≥ 10,
x1 , x2 ≥ 0
considering two distinct initial penalties µ. Notice how higher penalty values enforce a more
central convergence of the method. Some points are worth noticing concerning Algorithm 23.
4
x0
x
2 2x1 + x2 8
x1 + 2x2 10
z * level curve
= 10.0, = 0.5
= 2.0, = 0.5
0 4
0 2 6 8 10
x
Figure 21.5: IPM applied to a LP problem with two different barrier terms
First, notice that in Line 4, a fixed step size is considered. A line search can be incorporated
n o to
k xk
prevent infeasibility and improve numerical stability. Typically, it is used λi = min α, − dk with
i
i
α < 1 but close to 1.
Also, even though the algorithm is initialised with a feasible solution wk , this might in practice
not be necessary. Implementations of the infeasible IPM method can efficiently handle primal and
dual infeasibility.
√
Under specific conditions, the IPM can be shown to have complexity of O( n ln(1/ϵ)), which is
polynomial and of much better worst-case performance than the simplex method, which makes it
the algorithm of choice for solving large-scale LPs. Another important advantage is that IPM can
be modified with little effort to solve a wider class of problems under the class of conic optimisation
problems.
292 Chapter 21. Barrier methods
Predictor-corrector methods are variants of IPM that incorporate a two-phase direction calculation
using a predicted direction dpred
w , calculated by setting µ = 0 and a correcting direction, which is
computed considering the impact that dcor
w would have in the term X U e.
Let ∆X = diag(dpred
x ) and ∆U = diag(dpred
u ). Then
(X + ∆X)(U + ∆U )e = XU e + (U ∆X + X∆U )e + ∆X∆U e
= XU e + (0 − XU e) + ∆X∆U e
= ∆X∆U e (21.11)
Using the last equation (21.11), the corrector Newton step becomes U dx + Xdu = µ̂e − ∆X∆U e.
Finally, dkw is set to be a combination of dpred
w and dcor
w .
Chapter 22
Primal methods
The key feature of feasible direction methods is the process of deriving such directions and asso-
ciated step sizes that retain feasibility, even if approximately. Similarly to the other methods we
have discussed in the past lectures, these methods progress following two basic steps:
2. Make xk+1 = xk + λk dk .
As a general rule, for a feasible direction method to perform satisfactorily, it must be that the
calculation of the directions dk and step sizes λk are simple enough. Often, these steps can be
reduced to closed forms or, more frequently, solving linear or quadratic programming problems, or
even from posing modified Newton systems.
293
294 Chapter 22. Primal methods
Problem DS consists of finding the furtherest feasible point in the direction of the gradient, that is
we move in the direction of the gradient, under condition that we stop if the line search mandates
so, or that the search reach at the boundary of the feasible region. This is precisely what gives the
name conditional gradient.
By letting xk = arg minx∈S ∇f (xk )⊤ (x − xk ) and obtaining λk ∈ (0, 1] employing a line search,
the method iterates making
xk+1 = xk + λk (xk − xk ).
One important condition to observe is that λk has to be constrained such that λ ∈ (0, 1] to
guarantee feasibility, as xk is feasible by definition. Also, notice that the condition ∇f (xk ) = 0
might never be achieved, since it might be so that the unconstrained optimum is outside the feasible
region S. In that case, after two successive iterations we will observe that xk = xk−1 and thus
that dk = 0. This eventual stall of the algorithm will happen at a point xk satisfying first-order
(constrained) optimality conditions. Therefore, the term ∇f (x)⊤ dk will become zero regardless
whether the minimum of them function belongs to S, and is hence used as the stopping condition
of the algorithm. Algorithm 22 summarises the Frank-Wolfe method.
Notice that, for a polyhedral feasibility set, the subproblems are linear programming problems,
meaning that the Frank-Wolfe method can be restarted fairly efficiently using dual simplex at each
iteration.
Figure 22.1 shows the employment of the FW method for optimising a nonlinear function within
a polyhedral feasibility set. We consider the problem
starting from (0, 0) and using an exact line search to set step sizes λ ∈ [0, 1]. Notice that the
method can be utilised with inexact line searches as well.
22.3. Sequential quadratic programming 295
5 5
Feas. region Feas. region
2x1 + 3x2 8 2x1 + 3x2 8
4 x1 + 4x2 6 4 x1 + 4x2 6
Trajectory Trajectory
3 3
x2
x2
2 2
1 1
0 x0 1 4 5 0 x0 1 4 5
0 2 3 0 2 3
x1 x1
Figure 22.1: The Frank-Wolfe method applied to a problem with linear constraints. The algorithm
takes 2 steps using an exact line search (left) and 15 with Armijo line search (right).
The KKT conditions for P are given by the system W (x, v) where
( Pl
∇f (x) + i=1 vi ∇hi (x) = 0
W (x, v) =
hi (x) = 0, i = 1, . . . , l
Using the Newton(-Raphson) method, we can solve W (x, v). Starting from (xk , v k ), we can solve
W (x, v) by successively employing Newton steps of the form
" #
k k k k x − xk
W (x , v ) + ∇W (x , v ) = 0. (22.1)
v − vk
Upon closer inspection, one can notice that the term ∇W (x, v) is given by
" #
k k ∇2 L(xk , v k ) ∇h(xk )⊤
∇W (x , v ) = ,
∇h(xk ) 0
where
l
X
∇2 L(xk , v k ) = ∇2 f (xk ) + vik ∇2 hi (xk )
i=1
296 Chapter 22. Primal methods
Notice that QP is a linearly constrained quadratic programming problem, for which we have seen
several solution approaches. Moreover, notice that the optimality conditions of QP are given by
(22.2) and (22.3), where v is the dual variable associated with the constraints in (22.5), which, in
turn, represent first-order approximations of the original constraints.
The objective function in QP can be interpreted as being a second-order approximation of f (x)
Pl
enhanced with the term (1/2) i=1 vik d⊤ ∇2 hi (xk )d that captures constraint curvature information.
An alternative interpretation for the objective function of QP is to notice
Plthat it consists of the
second order approximation of the Lagrangian function L(x, v) = f (x) + i=1 vi hi (x) at (xk , v k ),
which is given by
1
L(x, v) ≈ L(xk , v k ) + ∇x L(xk , v k )⊤ d + d⊤ ∇2 L(xk , v k )d
2
Xl
⊤ ⊤ 1
= f (xk ) + v k h(xk ) + (∇f (xk ) + v k ∇h(xk ))⊤ d + d⊤ (∇2 f (xk ) + vik ∇2 hi (xk ))d
2 i=1
To see this, notice that terms f (xk ), v k⊤ h(xk ) are constants and that ∇h(xk )⊤ (x − xk ) = 0 (from
(22.5), as h(xk ) = 0).
The general subproblem in the SQP method can be stated as
1
QP (xk , uk , v k ) : min. ∇f (xk )⊤ d + d⊤ ∇2 L(xk , uk , v k )d
2
s.t.: gi (xk ) + ∇gi (xk )⊤ d ≤ 0, i = 1, . . . , m
hi (xk ) + ∇hi (xk )⊤ d = 0, i = 1, . . . , l,
which includes inequality constraints gi (x) ≤ 0 for i = 1, . . . , m in a linearised from and their
respective associated Lagrangian multipliers ui , for i = 1, . . . , m. This is possible since we are
using an optimisation setting rather than a Newton system that only allows for equality constraints,
22.3. Sequential quadratic programming 297
even though the latter can be obtained by simply introducing slack variables. Clearly, there are
several option that could be considered to handle this quadratic problem, including employing a
primal/dual interior point method.
A pseudocode for the standard SQP method is presented in Algorithm 23.
Notice that in Line 4, dual variable values are retrieved from the constraints in QP (xk , uk , v k ).
therefore, QP (xk , uk , v k ) needs to be solved by an algorithm that can return these dual variables,
such as the (dual) simplex method.
Figure 22.2 illustrates the behaviour of the SQP method on the problem. Notice how the trajec-
tory might eventually become infeasible due to the consideration of linear approximations of the
nonlinear constraint.
min. 2x21 + 2x22 − 2x1 x2 − 4x1 − 6x2 : x21 − x2 ≤ 0, x1 + 5x2 ≤ 5, x1 ≥ 0, x2 ≥ 0
One important feature for the SQP method is that it closely mimics the convergence properties
of Newton’s method and therefore, under appropriate conditions, superlinear convergence can be
observed. Moreover, the BFGS method can be used to approximate ∇2 L(xk , v k ), which can turn
the method dependent only on first order derivatives.
Notice that, because the constraints are considered implicitly in the subproblem QP (xk , uk , v k ),
one cannot devise a line search for the method, which in turn, being based on successive quadratic
approximations, presents a risk for divergence.
The l1 -SQP is a modern variant of SQP that addresses divergence issues arising in the SQP
method when considering solutions that are far away from the optimum, while presenting superior
computational performance.
In essence, l1 -SQP relies on a similar principle than penalty methods, encoding penalisation for in-
feasibility in the objective function of the quadratic subproblem. In the context of SQP algorithms,
these functions are called “merit” functions. This not only allow for considering line searches (since
feasibility becomes encoded in the objective function with feasibility guaranteed at a minimum.
cf. penalty methods) or, alternatively, relying on a trust region approach, ultimately preventing
divergence issues.
298 Chapter 22. Primal methods
2.0
x1 + 5x2 5
x12 x2 0
Trajectory
g(x0) + g(x0) x
1.5
1.0
x2
0.5
x0
0.0 1.5
0.0 0.5 1.0 2.0
x1
Figure 22.2: The SQP method converges in 6 iterations with ϵ = 10−6
22.4. Generalised reduced gradient* 299
Let us consider the trust-region based l1 -penalty QP subproblem, which can be formulated as
l1 − QP (xk , v k ) :
1
min. ∇f (xk )⊤ d + d⊤ ∇2 L(xk , v k )d
2
"m l
#
X X
k k ⊤ + k k ⊤
+µ [gi (x ) + ∇gi (x ) d] + |hi (x ) + ∇hi (x ) d|
i=1 i=1
k k
s.t.: − ∆ ≤ d ≤ ∆ ,
where µ is a penalty term, [ · ] = max {0, ·}, and ∆k is a trust region term. This variant is of
particular interest, because the subproblem l1 − QP (xk , v k ) can be recast as a QP problem with
linear constraints:
l1 − QP (xk , v k ) :
"m l
#
1 ⊤ 2 X X
k ⊤ k k + −
min. ∇f (x ) d + d ∇ L(x , v )d + µ yi + (zi − zi )
2 i=1 i=1
s.t.: − ∆k ≤ d ≤ ∆k
yi ≥ gi (xk ) + ∇gi (xk )⊤ d, i = 1 . . . , m
zi+ − zi− = hi (xk ) + ∇hi (xk )⊤ d, i = 1, . . . , l
y, z + , z − ≥ 0.
The subproblem l1 − QP (xk , v k ) enjoys the same benefits the original form, meaning that it can
be solved with efficient simplex-method based solvers.
The trust-region variant of l1 -SQP is globally convergent (does not diverge) and enjoys superlinear
convergence rate, as the original SQP. The l1 -penalty term is what is often called a merit function
in the literature. Alternatively, one can disregard the trust region and employ a line search (either
exact or inexact) which would also enjoy globally convergent properties.
(P ) : min. f (x)
s.t.: Ax = b
Ax ≥ 0,
To ease the illustration, we assume linear programming nondegeneracy, i.e., that any m columns
of A are linearly independent and every extreme point of feasible region has at least m positive
components and at most n − m zero components.
Being so, let us define a partition of A as A = (B, N ), x⊤ = (x⊤ ⊤
B , xN ), where B is an invertible
m × m matrix with xB > 0 as a basic vector and xN ≥ 0 as a nonbasic vector. This implies that
∇f (x)⊤ can also be partitioned as ∇f (x)⊤ = (∇B f (x)⊤ , ∇N f (x)⊤ ).
In this context, for d to be an improving feasible direction, we must observe that
1. ∇f (x)⊤ d < 0
We will show how to obtain a direction d that satisfies conditions 1 and 2. For that, let d⊤ =
(d⊤ ⊤
B , dN ). We have that 0 = Ad = BdB + N dn for any dN , implying that dB = −B
−1
N dN .
Moreover,
⊤
The term rN = (∇N f (x)⊤ − ∇B f (x)⊤ B −1 N ) is referred to as the reduced gradient as it express
the gradient of the function in terms of the nonbasic directions only. Notice that the reduced
gradient r holds a resemblance to the reduced costs from the simplex method. In fact
r⊤ = (rB
⊤ ⊤
, rN ) = ∇f (x) − ∇B f (x)⊤ B −1 A
= (∇B f (x) − ∇B f (x)⊤ B −1 B, ∇N f (x) − ∇B f (x)⊤ B −1 N )
= (0, ∇N f (x) − ∇B f (x)⊤ B −1 N ),
Notice that the rule is related with the direction of the optimisation. For rj ≤ 0, one wants to
increase the value of xj in that coordinate direction, making dj non-negative. On the other hand,
if the reduced gradient is positive (rj > 0), one wants to reduce the value of xj , unless it is already
zero, a safeguard created by multiplying xj in the definition of the direction d.
The following result guarantee the convergence of Wolfe’s reduced gradient to a KKT point.
where IB is the index set of basic variables. By construction (using Wolfe’s rule), either d = 0 or
∇f (x)⊤ d < 0, being thus an improvement direction.
x is a KKT point if and only if there exists (u⊤ ⊤
B , uN ) ≥ (0, 0) and v such that
Algorithm 24 presents a pseudocode for the Wolfe’s reduced gradient. A few implementation
details stand out. First, notice that the basis is selected choosing the largest components in
value, which differs from the simplex method by allowing for nonbasic variables to assume nonzero
values. Moreover, notice that a line search is employed conditioned on bounds on the step size λ
to guarantee that feasibility x ≥ 0 is retained.
with an additional restoration phase that has the purpose of recovering feasibility via pro-
jection or using Newton’s method.
2. Consideration of superbasic variables. In that, xN is further partitioned into x⊤ ⊤ ⊤
N = (xS , xN ′ ).
The superbasic variables xS (with index set JS , 0 ≤ |JS | ≤ n − m), are allowed change value, while
xN ′ are kept at their current value. Hence, d⊤ = (d⊤ ⊤ ⊤
B , dS , dN ′ ), with dN ′ = 0. From Ad = 0, we
−1
obtain dB = −B SdS . Thus d becomes
dB −B −1 S
d = dS = I dS .
dN ′ 0
Bibliography
[1] Dimitris Bertsimas and John N Tsitsiklis. Introduction to linear optimization, volume 6. Athena
Scientific Belmont, MA, 1997.
[2] John R Birge and Francois Louveaux. Introduction to stochastic programming. Springer Science
& Business Media, 2011.
[3] Changhyun Kwon. Julia Programming for Operations Research. Changhyun Kwon, 2019.
[4] H Paul Williams. Model building in mathematical programming. John Wiley & Sons, 2013.
303