Nisheeth VishnoiFall2014 ConvexOptimization PDF
Nisheeth VishnoiFall2014 ConvexOptimization PDF
Nisheeth K. Vishnoi
i
Preface
These notes are largely based on lectures delivered by the author in the
Fall of 2014. The first three lectures are meant to be an introduction to
the core techniques in convex optimization which are increasingly being
deployed to design fast algorithms for discrete problems. Many thanks
to the scribes: Elisa Celis, Slobodan Mitrovic, Damian Straszak and
Jakub Tarnawski. The material presented in these lectures is inspired
from several excellent sources such as Boyd and Vanderberghe’s Convex
optimization, Nesterov’s Introductory lectures on convex optimization
and Arora et al.’s survey on the Multiplicative weight update method.
Thanks to Sushant Sachdeva for many enlightening discussions about
Interior point methods which have influenced the last part of these
notes. Finally, these notes are quite preliminary and will be revised.
Please feel free to email your comments and suggestions.
Nisheeth K. Vishnoi
7 Oct. 2015, EPFL
nisheeth.vishnoi@epfl.ch
Contents
ii
Contents iii
3.1 Overview 46
3.2 Newton’s Method and its Quadratic Convergence 47
3.3 Constrained Convex Optimization via Barriers 52
3.4 Interior Point Method for Linear Programming 55
3.5 Self-Concordant Barrier Functions 68
3.6 Appendix: The Dikin Ellipsoid 69
3.7 Appendix: The Length of Newton Step 74
5.1 Overview 98
5.2 Linear Programming and Duality 98
5.3 A Primal-Dual Interior Point Method 101
References 110
1
Basics, Gradient Descent and Its Variants
1.1 Overview
Convex optimization is about minimizing a convex function over a con-
vex set. For a convex set K, and a convex function f whose domain
contains K, the goal is to solve the following problem:
inf f (x).
x∈K
1
2 Basics, Gradient Descent and Its Variants
ago. And tools and techniques from convex optimization are playing
a bigger and bigger role in this goal. Since general methods of solv-
ing convex programs have runtimes which are far from linear, often
this goal requires adapting known methods from convex optimization
and, more often than not, coming up with novel problem-specific so-
lutions. Thus, techniques from convex optimization have become an
indispensable tool in the toolkit of any algorithm designer and in these
three lectures, we cover a large ground in this regard. The methods we
present include gradient descent and its many variants, multiplicative
weight update methods and their application to approximately solving
linear programs and semi-definite programs quickly, Newton’s method
and its application to path-following interior point methods. Through
these methods, we will see a very intimate connection between convex
optimization and online convex optimization. With a good understand-
ing of the material covered in these three lectures, a student should be
well-equipped to understand many recent breakthrough works in TCS
and, hopefully, push the state-of-the-art.
1.2.2 Convexity
We start by defining the basic notions of a convex set and a convex
function.
λx + (1 − λ)y ∈ K. (1.1)
for any x, y ∈ K and λ ∈ [0, 1]. If inequality (1.2) is strict for all x 6= y
and λ ∈ (0, 1), then we say that f is strictly convex.
4 Basics, Gradient Descent and Its Variants
x y
≥0
x y
To extend the proof to many dimensions, just note that after fixing
points x, y ∈ K it is enough to restrict our attention to the line segment
connecting them.
∇2 f (x) 0
6 Basics, Gradient Descent and Its Variants
In the rest of this lecture we will most often use the definition of The-
orem 1.3.
and it yields
f (y) ≈ f (x) − η k∇f (x)k2 .
Thus we see that the norm of the gradient controls the rate at which
we can decrease f . More vividly, the gradient of f at x is the direction
in which f grows the fastest, so if we are looking to minimize f , it
makes the most sense to go in the opposite direction.
Any gradient descent method will work by constructing a sequence
of points x1 , x2 , . . . , xT with the objective of getting very close to the
optimum x? after a small number of iterations. Usually, it will try to
ensure that f (x0 ) ≥ · · · ≥ f (xT ) (although the first method that we
will show does not guarantee this). We will not be able to find the exact
optimum; we can only hope to get ε-close to it, for a given accuracy
requirement ε.
The first nice property of convex functions that we can discover in
this context, and a reason why gradient descent does not work for a
general f , is the following: if we keep decreasing the value of f , we will
not get ”stuck” in a local minimum which is not a global minimum.
This is guaranteed by the following fact, which again points out the
role of the gradient in optimizing f .
Proof. The direction (a)–(b) is trivial, and (b)–(c) holds for all f . For
(c)–(a), just note that for any y ∈ K,
h∇f (x), y − xi ≥ 0
where ηt is the step size (which will also depend on f and will be
determined as per the setting.). For a given ε, our objective is to get
ε-close to the optimum f (x? ) in as less iterations as possible. We want
to understand what values of ηt will enable us to do achieve this goal.
kx1 − x? k ≤ D.
1.4. Convex Functions, Oracles and their Properties 9
kf (y) − f (x)k ≤ G · ky − xk
In this case we can get the following bound, which we prove in Sec-
tion 1.5.1:
where
2
DG
T = .
ε
Remark 1.9. We are using the Euclidean norm here and assuming
that k∇f k2 = O(1). But sometimes we might find ourselves in a setting
where k∇f k∞ = O(1), and a naive bound would give us just k∇f k2 =
√
O( n). We will see in later lectures how to deal with such situations
(see also Section 1.7.6).
∈ [0, L · kx − yk2 ]
x y
This means that the distance of f (y) from its first-order Taylor approx-
imation at x is between 0 and L · kx − yk2 , which is to be thought of
as small; see Fig. 1.3. This distance is called the Bregman divergence
with respect to the `2 -norm.
f (xT ) − f (x? ) ≤ ε
and
LD2
T =O ,
ε
where D = kx1 − x? k.
f (xT ) − f (x? ) ≤ ε
and √ !
LD
T =O √ .
ε
∇2 f (x) `I.
f (xT ) − f (x? ) ≤ ε
and
G2
T =O .
`ε
where we used the well-known equality 2 ha, bi = kak2 +kbk2 −ka − bk2 .
We can now take advantage of our assumption about f and write
We may bound the last term by simply zero. For the middle one, recall
that we have introduced D which bounds the distance from the starting
point to the optimum. Thus, using kx1 − x? k2 ≤ D2 , we obtain
T
D2
X
? 1 1
(f (xt ) − f (x )) ≤ T η 2 G2 + D 2 = 2
T ηG + .
2η 2 η
t=1
D2
The minimizing choice of η here is the one which satisfies T ηG2 = η ,
so that
D
η= √ .
G T
1.5. Proofs of Convergence Rates 15
1 PT ?
We have bounded the quantity TP t=1 f(xt ) − fP(x ). But from con-
vexity of f we have2 that f T1 Tt=1 xt ≤ T1 Tt=1 f (xt ). Thus we
get
T
!
D2
1X ? 1 2 DG
f xt − f (x ) ≤ T ηG + = √ .
T 2T η T
t=1
2
√ ≤ ε we need to set T ≥ DG .
In order to get DG
T ε
Once again, we will keep the step length η constant and take
xt+1 − xt = −η · ∇f (xt ).
Let us see how the value of f changes as we iterate. Use Theorem 1.12
to get
and if we bound kxt − x? k by D (this follows since f (xt ) ≤ f (x1 )), then
we get
θ
k∇f (xt )k ≥
D
and, by (1.9),
θ2
f (xt+1 ) − f (xt ) ≤ − .
4LD2
θ
Until our distance
from
the optimum goes down below 2 , we will make
θ 2 2
a progress of Ω LD 2 at every step, so we will take at most O LD θ
steps before this happens. Then this process of halving is repeated,
and so on, until we reach the required
2 distance
ε. Bringing the distance
θ θ LD 2i
down from 2i to 2i+1 requires O θ steps, so we get
θ
log θ
!
Xε LD2 2i LD2 2log ε LD2
O =O =O
θ θ ε
i=0
T T T
G2 X
X
?
X
? 2 t `t t−1
t (f (xt ) − f (x )) ≤ tηt + kxt − x k · − −
2 2ηt 2 2ηt−1
t=1 t=1 t=2
1 ` T
+ kx1 − x? k2 · − − kxT +1 − x? k2 · .
2η1 2 2ηT
As before, we bound the last term by just zero. Now, to make the
sum telescoping, we would like to get 2ηt t − `t2 − 2ηt−1
t−1
= 0 for every
t = 2, . . . , T . As for the term 2η11 − 2` , we would also prefer to remove
it, so as not to have any dependence on kx1 − x? k. Solving this system
of equations (beginning with 2η11 − 2` = 0) yields the following setting
of ηt s:
2
ηt = .
`(t + 1)
We are thus left with:
T T T T
X G2 X X G2 t X G2 G2 T
t (f (xt ) − f (x? )) ≤ tηt = ≤ = .
2 `(t + 1) ` `
t=1 t=1 t=1 t=1
T (T +1)
The rest is straightforward: we normalize by (1 + · · · + T ) = 2
and use convexity of f to get
T
!
2 X 2G2
f t · xt − f (x? ) ≤ ,
T (T + 1) `(T + 1)
t=1
∇2 f (x) = A 0
∇f (x) = Ax − b
as previously. However, this time we will not use the same step length
ηt for every iteration, but instead (exercise) analytically find a closed-
form expression for the best ηt , i.e.
where κ(A) = L` , the ratio between the largest and the smallest eigen-
value of A, is the condition number of the matrix A.3
3 Formore information on this and the related Conjugate Gradient method, see Chapter 15
of the monograph [Vis13].
1.7. Discussion 19
1.7 Discussion
1.7.1 Gradient vs. Sub-gradient
1.7.2 Dependence on ε
We have seen in Theorems 1.8, 1.13, 1.14 and 1.17 that our methods
converge to the ε-approximate solution in a number of iterations which
is O poly 1ε , with the exponent equal to 2, 1 or 12 . However, a poly-
nomial dependence on ε does not look very good if we hope to use these
methods in computer science. For such applications, the right answer
should be O poly log 1ε . And, if the function is both strongly convex
1.7.3 Dependence on n
Note that n does not appear anywhere in the oracle complexity4 of
the presented algorithms. This is an important feature of first-order
methods. (Of course, the time complexity of a single iteration will still
depend on n.)
and 2 !
DG
T =O ·n .
ε
I.e., we compare the player’s loss to the minimum loss which could be
achieved if one knew all the functions ft beforehand, but were only
allowed to pick a single argument x for all of them.5 The player is
trying to minimize the regret, and the adversary is trying to maximize
it. This model has many practical applications, such as stock pricing
and portfolio selection, see [Haz14].
One can describe a strategy for the player which uses gradient de-
scent:
xt+1 = xt − ηt · ∇ft (xt )
for a suitable choice of ηt . Even though the functions ft change from
iteration to iteration, one can still obtain a similar guarantee to the
one from our first analysis. Namely, we are able to make the average
regret go to 0 at a rate inversely proportional to the square root of T :
RegretT 1
=O √ .
T T
5 For some choices of x1 , . . . , xt , f1 , . . . , ft , regret can be negative.
22 Basics, Gradient Descent and Its Variants
Indeed, note that in the proof of Section 1.5.1 we never used the fact
that the function f remains the same between iterations. We actually
proved the following:
DG 2
T = .
ε
Moreover, it produces each point xt+1 knowing only the functions
f1 , . . . , ft .
Then we say that a function f is L-smooth with respect to k·k if for any x, y ∈ K,
k∇f (x) − ∇f (y)k? ≤ L · kx − yk .
1.7. Discussion 23
h∇f (xt ), xt+1 − xt i = h∇f (xt ), −η∇f (xt )i = −η k∇f (xt )k22 .
and 2
DG
T = .
ε
To compute each point xt , it uses one gradient query and one projection
query.
And for the L-smooth case we get the following analogue of Theo-
rem 1.13.
and
LD2 + f (x1 ) − f (x? )
T =O .
ε
To compute each point xt , it uses one gradient query and one projection
query.
f (xT ) − f (x? ) ≤ ε
and
LD2
T =O .
ε
To compute each point xt , it uses one gradient query and one call to a
routine which optimizes a linear function over K.
2.1 Overview
In the previous lecture we discussed gradient descent-type methods
from convex optimization and showed how they can be straight-
forwardly translated into regret-minimizing algorithms in the online
convex optimization setting. The key observation was that several
gradient-descent type methods are oblivious to the fact that the same
function is used in every round. Today we turn this idea on its head:
namely, we begin with an algorithm that solves an online problem,
and end up with a (new) convex optimization method. As a result, we
can get improved dependence on the number of iterations when we are
in settings other than the Euclidean space or have guarantees on the
function on the Euclidian norm. For instance, suppose we know that
√
k∇f (x)k∞ = O(1) which, at best, implies a bound k∇f (x)k2 = O( n).
We obtain a gradient-descent type algorithm that takes O(log n/ε2 ) it-
√
erations as opposed to O( n/ε2 ) iterations if we use the algorithm from
the previous lecture. The journey we take in deducing this result is
long and rich. The central player is the old, famous and versatile: the
Multiplicative Weights Update (MWU) method. In the process we also
27
28 Multiplicative Weights Update vs. Gradient Descent
def
mti = number of mistakes made by expert i until day t,
2.2. The Multiplicative Weights Update Method 29
and
def
M t = number of mistakes made by us until day t.
Having these definitions in hand, our goal is to ensure that the regret
def
Regrett = M t − mt?
is minimized, where mt? = mini mti . Note that M t may even be smaller
than mt? .
Let us first consider some obvious approaches to regret-
minimization given the advice of experts. One natural strategy is to
take advice from an expert who was right on the previous day. However,
one can very easily design examples where this strategy will perform
very poorly.1 Alternately, we can consider Majority Vote in which we
simply see the advice for a given day, and take the majority option. In
other words, we will buy if and only if the majority of the experts say
to buy. However, in this case we can also design examples where this
algorithm performs poorly.
wi1 = 1, ∀i.
1 Forexample, if half the experts are right on even days and wrong on odd days, and the
other half or the experts are wrong on even days and right on odd days, we will always
make the wrong choice.
30 Multiplicative Weights Update vs. Gradient Descent
Then, we set
def
wit+1 = wit (1 − εfit )
(1 − x) ≤ e−x .
− ln (1 − x) ≤ x + x2 .
Theorem 2.3. After t steps, let mti denote the number of mistakes
made by expert i and M t the number of mistakes WMA had made,
and let 0 < ε ≤ 1/2. Then, for every i we have
2 ln n
M t − 2(1 + ε)mti ≤ .
ε
Corollary 2.4. After T days, let mT? be the number of mistakes the
best expert has made and let M T be the number of mistakes WMA
has made. If T ≥ 2 εln2 n where 0 < ε ≤ 1/2, then
1
M T − 2(1 + ε)mT? ≤ ε.
T
Note that the corollary says that in the long run, i.e., for large T , the
algorithm essentially makes at most two times more mistakes than the
best expert in hindsight!2
The proof is easy and relies on observing that every time the WMA
makes a mistake, the total weight on the experts reduces by a multi-
plicative factor of (1 − ε/2).
Therefore, every time we make a mistake the total weight of the experts
decreases by the multiplicative factor of 1 − 2ε . Hence, we can upper
Φt+1 ≥ wit+1
t
= wi1 (1 − ε)mi
t
= (1 − ε)mi , (2.2)
setting. As before, there are n experts, and following solely expert i’s
advice on day t incurs a cost fit where f t : [n] → [−1, 1]. However,
in many cases, as with stocks, we can take fractional decisions. Hence,
for a given vector pt ∈ ∆n which represents a convex combination of
experts,3 we incur a loss of pt , f t .
against the best convex combination of the experts, and are interested
in minimizing the regret
T
X T
X
t t
p, f t .
p , f − min
p∈∆n
t=1 t=1
for a parameter ε > 0 which has the same intention as before. Since
fit could be negative or positive, the weight can increase or decrease.
However, since kf t k∞ ≤ 1 for all t, the weights always remain non-
negative. Finally, since we need a probability distribution (or convex
combination) pt ∈ ∆n , we normalize the weight vector wt by the total
def P
weight Φt = i wit to obtain
def wt
pt = .
Φt
We call this method the Multiplicative Weights Update (MWU) algo-
rithm. We are now ready to prove the main theorem of this section.
The only difference from Theorem 2.3 is that the factor of 2 goes away!
The proof is similar to that of Theorem 2.3.
Theorem 2.5. Assume that kf t k∞ ≤ 1 for every t, and let 0 < ε ≤ 1/2.
Then, the MWU algorithm produces a sequence of probability distri-
butions p1 , . . . , pT such that
T T
X
t t X ln n
p, f t ≤
Thus, the number of iterations after which the average regret becomes
less than 2ε is at most lnε2n .
Proof. We have
X X
Φt+1 = wit+1 = wit 1 − εfit .
i i
Now, using the facts that pti = /Φt , kpt k1 = 1 and pti ≥ 0, we rewrite
wit
i
X
= Φt − εΦt pti fit
i
= Φt 1 − ε pt , f t .
(2.5)
Φt+1 ≤ Φt e−εhp i.
t ,f t
since Φ1 = n.
Our next step is to provide a lower bound on ΦT +1 . As before, we
observe that Φt ≥ wit for any t and any i, and obtain
T
Y
ΦT +1 ≥ wiT +1 = (1 − εfit ).
t=1
2.2. The Multiplicative Weights Update Method 35
The last inequality is true since kf t k∞ ≤ 1. Since this is true for any
i, by convexity we obtain that
T T
X
t t X ln n
p, f t ≤
p ,f − + εT
ε
t=1 t=1
illustrate the powerful idea of using the MWU algorithm. The method
has been used in numerous ways to speed up solutions to linear pro-
grams for fundamental problems. We consider the following feasibility
problem: given an m × n matrix A, and b ∈ Rn does there exist an x
such that Ax ≥ b?
∃?x : Ax ≥ b. (2.8)
For this problem we give an algorithm that, given an error parameter
ε > 0, either outputs a point x such that Ax ≥ b − ε1 or proves that
there is no solution to this linear system of inequalities. We also assume
the existence of an oracle that, given vector p ∈ ∆n , solves the following
relaxed problem
∃?x : p> Ax ≥ p> b. (2.9)
Note that the above feasibility problem involves only one inequality.
This, often, may be significantly easier algorithmically than the original
feasibility problem. In any case, we will assume that. Clearly, if there is
an x that satisfies (2.8), then x satisfies (2.9) as well for all p ∈ ∆n . On
the other hand, if there is no solution to (2.9), then the system (2.8)
is infeasible. Finally, we assume that when the oracle returns a feasible
solution for a p, the solution x that it returns is not arbitrary but has
bounded width:
max |(Ax)i − bi | ≤ 1.
i
In this setting, we can prove the following theorem:
The algorithm in the proof of the theorem is the MWU: we just have to
identify the loss function at time t. We maintain a probability distribu-
tion pt at any time t and pass it to the oracle. As is usual, the starting
probability distribution, p1 , is uniform. We pass this pt to the oracle.
If the oracle returns that the system 2.9 is infeasible, the algorithm
2.3. Multiplicative Weights Update as Gradient Descent 37
def 1X t
x̃ = x.
T t
Thus, the loss at every step is at least 0. Hence, from Theorem 2.5, for
the choice of T = ln m/ε2 we obtain that for every i, − T1 Tt=1 fit ≤ 2ε.
P
1 P t
Now, using the definition of x̃ = T tx , we obtain that for all i,
(Ax̃)i ≥ bi − 2ε.
f (x) − f (x? ) ≤ ε.
(For negative entropy, this results in the choice of the vector where all
the coordinates are the same, i.e., the uniform distribution over the
experts or the vector n1 1). Let xt ∈ K. Once we map xt to ∇M (xt )
and do the gradient step, we may move out of K but should remain
in K 0 . Let y t+1 be the point in K 0 which corresponds to the resulting
point in S. In other words, given xt and ∇f, for t ≥ 1 we let y t+1 be
the point such that
∇M y t+1 = ∇M xt − η∇f xt .
(2.10)
divergence as follows
def
xt+1 = arg inf DM x, y t+1 .
x∈K
(In the MWU setting, the Bregman divergence of the negative entropy
function corresponds to the Kullback-Liebler divergence and projecting
according to it is nothing but normalizing a non-negative vector to
make its `1 norm 1.) In this setting, we can prove the following theorem
which is a generalization the MWU method (and from which we can
recover the MWU method):
where 2
1 DG
T = · .
l ε
at round t:
def
where A • B = Tr(AT B) is the usual matrix inner product. So if
we denote P t = Ev∼µ [vv T ], then the loss at round t is simply P t •
F t . Observe that P t is PSD and Tr(P t ) = 1. Our algorithm, instead
of working directly with a distribution over Sn−1 , will keep a matrix
P t . Our objective is to ensure that Tt=1 P t • Ft as small as possible,
P
PT
compared to minkwk2 =1 t=1 wT F t w. The latter is nothing but the
smallest eigenvalue of Tt=1 F t . We proceed with the description of our
P
(1) Initialize Q1 = I.
(2) In round t, use the matrix
Qt
Pt = ,
Φt
def
where Φt = Tr(Qt ).
(3) Observe the loss matrix F t , and update Qt as follows
Pt
Fk
Qt+1 = e−ε k=1 .
def P∞ Ak
where: eA = k=0 k!
T T
X
t t
X ln n
P • F − inf vT F tv ≤ + εT. (2.11)
v∈Sn−1 ε
t=1 t=1
2.4. Appendix: A Multiplicative Weights Update for Matrices 43
ln n
Thus, after T = ε2
,
T T
!
1X t 1X t
P • F t ≤ λmin F + 2ε.
T T
t=1 t=1
Tr eA+B ≤ Tr eA eB .
e−A I − A + A2 .
Proof. As in the proof of Theorem 2.5 we try to obtain upper and lower
bounds on the potential function Φt+1 , which in our case is defined to
def
be Φt+1 = Tr(Qt+1 ). Let us start with the upper bound:
Pt k
Φt+1 = Tr(Qt+1 ) = Tr e−ε k=1 F
G−T Pt−1 k t
≤ Tr e−ε k=1 F e−εF
Pt−1 k
≤ Tr e−ε k=1 F I − εF t + ε2 (F t )2
Tr Qt − εTr Qt F t + ε2 Tr Qt (F t )2 .
=
Tr Qt F t = Tr(Qt )Tr P t F t = Φt (P t • F t )
Similarly:
Tr Qt (F t )2 = Tr(Qt )Tr P t F t = Φt (P t • (F t )2 ).
2.4. Appendix: A Multiplicative Weights Update for Matrices 45
Expanding further,
Pt k •F k )+ε2
Pt k •(F k )2 )
Φt+1 ≤ ne−ε k=1 (P k=1 (P . (2.12)
It remains to obtain a suitable lower bound for Φt+1 . To this end let us
fix any vector v with kvk2 = 1. We use Fact 2.11 with A = −ε tk=1 F k :
P
Pt > k
Pt k
e−ε k=1 v F v ≤ Tr e−ε k=1 F = Φt+1 . (2.13)
3.1 Overview
In the last of the three foundational lectures we continue our journey
towards faster (and better) algorithms for convex programs. The al-
gorithms introduced till now assume access only to an oracle for the
value of the convex function and its gradient at a specified point. In
this lecture we assume that we are also given a second order access to
f : namely, given vectors x and y, we could obtain (∇2 f (x))−1 y. The
resulting method would be Newton’s method for solving unconstrained
programming and will have this property that if one starts close enough
to the optimal solution the convergence would be in log log 1/ε itera-
tions! Finally, we present the application of Newton’s method to solv-
ing constrained convex programs. This is achieved by moving from
constrained to unconstrained optimization via a barrier function. The
resulting methods are broadly termed as interior point methods. We an-
alyze one such method, referred to as the primal path-following interior
point method, for linear programs. We end this lecture by a discussion
on self-concordant barrier functions which allow us to go beyond linear
programs to more general convex programs.
46
3.2. Newton’s Method and its Quadratic Convergence 47
g(x)
r x1 x0
def g(x0 )
x1 = x0 −
g 0 (x0 )
def g(xk )
xk+1 = xk − for all k ≥ 0. (3.1)
g 0 (xk )
1
0 = g 0 (xk )(xk − xk+1 ) + (r − xk )g 0 (xk ) + (r − xk )2 g 00 (θ)
2
which implies that
1
g 0 (xk )(r − xk+1 ) = (r − xk )2 g 00 (θ).
2
This gives us the relation between the new distance from the root in
terms of the old distance from it:
00
g (θ)
|r − xk+1 | = 0 |r − xk |2 .
2g (xk )
|r − x1 | ≤ M |r − x0 |2
00
g (θ)
where M = supx∈[r,x0 ] 2g 0 (x) .
kx1 − x? k ≤ M kx0 − x? k2
L
for some constant M . For example M = 2h will do.
Proof. [Proof of Theorem 3.2] The basic idea of the proof is the same
as in 3.1. We need a similar tool as the Taylor expansion used in the
previous chapter. To obtain such, we consider the function φ : [0, 1] →
Rn , φ(t) = ∇f (x + t(y − x)). Applying the fundamental theorem of
calculus to φ (to every coordinate separately) yields:
Z 1
φ(1) − φ(0) = ∇φ(t)dt
0
Z 1
∇f (y) − ∇f (x) = H(x + t(y − x))(x − y)dt. (3.3)
0
form:
x1 − x? = x − x? + n(x)
= x − x? − H(x)−1 ∇f (x)
= x − x? + H(x)−1 (∇f (x? ) − ∇f (x))
Z 1
? −1
= x − x + H(x) H(x + t(x? − x))(x − x? )dt
0
Z 1
= H(x)−1 (H(x + t(x? − x)) − H(x))(x − x? )dt.
0
description of the linear program. In the linear programming setting, LO(1) can be shown
to bound the number of binary digits required to represent the coordinates of the optimal
solution.
4 Actually we are not losing on generality here. Every convex problem can be stated equiv-
boundary, which will force our steps to be short and thus inefficient. In
case of K being a polytope, such a method would be equivalent to the
simplex algorithm, which as known to have an exponential worst case
running time. For this reason we need to set some force, which would
repel us from the boundary of K. More formally, we want to move our
constraints to the objective function and consider c> x + F (x),5 where
F (x) can be regarded as a “fee” for violating constraints. F (x) should
become big for x close to ∂K. Of course, if we would like the methods
developed in the previous sections for unconstrained optimization to be
applicable here, we would also like f to be strongly convex. Thus, this
is another route we could take to convert a constrained minimization
problem into unconstrained minimization, but with a slightly altered
objective function. To formalize this approach we introduce the notion
of a Barrier Function. Instead of giving a precise definition, we list
some properties, which we wish to hold for a barrier function F :
{x?η : η ≥ 0}
lim x?η = x? .
η→∞
in the 60s. However, there was no analysis for the time needed for
convergence. In 1984 Narendra Karmarkar announced a polynomial-
time algorithm to solve linear programs using an IPM. At that point
there was already a known polynomial time algorithm for solving LPs,
namely the Ellipsoid Algorithm from 1979. Further, the method of
choice in practice, despite known to be inefficient in the worst case, was
the Simplex method. However, in his paper, he also presented empirical
results which showed that his algorithm was consistently 50 times faster
than the simplex method. This event, which received publicity around
the world throughout the popular press and media, marks the beginning
of the interior-point revolution. For a nice historical perspective on this
revolution, we refer the reader to the survey of Wright [Wri05].
Karmarkar’s algorithm needed roughly O(m log 1/ε) iterations to
find a solution (which is optimal up to an additive error of ε) to a
linear program, where m is the number of constraints. Each such it-
eration involved solving a linear system of equations, which could be
easily performed in polynomial time. Thanks to the logarithmic depen-
dence on the error ε it can be used to find the exact, optimal solution
to a linear program in time polynomial with respect to the encoding
size of the problem. Thus, Karmarkar’s algorithm was the first efficient
algorithm with provable worst case polynomial running time. Subse-
quently, James Renegar proposed an interior point algorithm with re-
√
duced number of iterations: O( m log(1/ε)). Around the same time,
Nesterov and Nemirovski abstracted out the essence of interior point
methods and came up with the the notion of self-concordance, which
in turn was used to provide efficient, polynomial time algorithms for
many nonlinear convex problems such as semi-definite programs.
We begin our discussion by introducing the logarithmic barrier func-
tion and then Newton’s Method, which is at a core of all interior-
point algorithms. Then we present Renegar’s primal path following
method. We give a full analysis of this algorithm, i.e. we prove that
e √m log(1/ε)) it outputs a solution which is ε-close to the op-
after O(
timum.
3.4. Interior Point Method for Linear Programming 57
It is easy to see that F (x) is well defined on int(P ) and tends to infinity
when approaching the boundary of P . Let us now write down formulas
for the first and second derivative of F , they will prove useful a couple
of times in the analysis. To simplify the notation, we will often write
si (x) for bi − a>
i x.
(1) ∇F (x) = m ai
P
i=1 si (x)
ai a>
(2) ∇2 F (x) = m
P i
i=1 si (x)2
Lemma 3.4. For every η > 0 we have c> x?η − c> x? < m
η.
The point x?η is the minimum of fη , hence ∇fη (x?η ) = 0 and so:
We will show even more, that for every two points x, y in the interior
of P , we have hg(x), y − xi < m. This follows by a simple calculation:
m
X a> (y − x)
i
hg(x), y − xi =
si (x)
i=1
m
X (bi − a> >
i x) − (bi − ai y)
=
si (x)
i=1
m m
X si (x) − si (y) X si (y)
= =m− <m
si (x) si (x)
i=1 i=1
Where in the last inequality we make use of the fact that our points
x, y are strictly feasible, i.e. si (x), si (y) > 0 for all i.
later, for now let us assume that it is possible to provide some η0 > 0
and the corresponding point x0 = x?η0 .
One step of our algorithm will essentially correspond to one step
of Newton’s method. Before we give a full description of the algorithm
recall that n(x) is the Newton step at point x with respect to some
underlying function f . Whenever we write n(xk ), we have the function
fηk (x) = ηk c> x + F (x) in mind. So
n(xk ) = −H(xk )−1 ∇fηk (xk ).
We now give the algorithm.
c> x̂ ≤ c> x? + ε.
To prove this, we consider one iteration and show that, if started with
kn(xk )kxk ≤ 1/2, then we will end up with kn(xk+1 )kxk+1 ≤ 1/2. Ev-
ery iteration consists of two steps: the Newton step w.r.t. fηk and the
increase of ηk to ηk+1 . We formulate another two lemmas explaining
what happens when performing those steps.
Lemma 3.8. After taking one Newton step at point x w.r.t. the func-
tion fη , the new point x0 satisfies:
kn(x0 )kx0 ≤ kn(x)k2x .
It is easy to verify that the above two lemmas together imply that if
kn(xk )kxk ≤ 12 , then
1 1 1
kn(xk+1 )kxk+1 ≤ + + o(1) < .
4 8 2
Hence, Lemma 3.7 follows by induction.
It remains to prove Lemmas 3.8 and 3.9. We start with the latter,
since its proof is a bit simpler.
Let us stop here for a moment and try to understand what is the
significance of the specific terms in the last expression. In our analysis
of the algorithm, the term kH −1 (x)∇fη (x)kx is a small constant. The
goal is to bound the left hand side by a small constant as well. We
should think of η 0 as η(1 + δ) for some small δ > 0. In such a setting
η0 −1 (x)∇f (x)k will be still a small constant, so what prevents
η kH η x
0
us from choosing a large δ is the second term (1 − ηη )kH −1 g(x)kx .
We need to derive an upper bound on kH −1 g(x)kx . We show that
√ def
kH −1 g(x)kx ≤ m. Let us denote z = H −1 g(x). We get:
v
m >a
um
2 >
X z i √ uX (z > ai )2
kzkx = z g(x) = ≤ mt . (3.13)
si (x) si (x)2
i=1 i=1
m m
!
X (z > ai )2 X ai a>
= z> i
z = z > H(x)z = kzk2x . (3.14)
si (x)2 si (x)2
i=1 i=1
√
Putting (3.13) and (3.14) together we obtain that kzk2x ≤ mkzkx , so
√
in fact kzkx ≤ m.
Before we proceed with the proof of Lemma 3.8 let us remark that it
can be seen as an assertion that x belongs to the quadratic convergence
region of Newton’s method.
Proof. [Proof of Lemma 3.8] We know that the point x0 is the minimizer
of the second order approximation of fη at point x, hence:
m
0
X ai
∇fη (x ) = ηc +
si (x0 )
i=1
m m m
!
X ai X ai a>
i 0
X ai
=− + (x − x) +
s (x) s (x) 2 s (x0 )
i=1 i i=1 i i=1 i
m
ai a> 0
ai ai i (x − x)
X
= − −
si (x0 ) si (x) si (x)2
i=1
m
ai a> 0 ai a> 0
i (x − x) i (x − x)
X
= −
si (x)si (x0 ) si (x)2
i=1
m
X ai (a> 0
i (x − x))
2
= .
si (x)2 si (x0 )
i=1
Our goal is to show that kn(x0 )kx0 ≤ kn(x)k2x . Instead 7 , we will prove
that for every vector z, we have that hz, n(x0 )ix0 ≤ kzkx0 kn(x)k2x .
Indeed:
m
0
> 0
X z > ai (a> 0
i (x − x))
2
z, n(x ) x0 = z ∇fη (x ) =
si (x)2 si (x0 )
i=1
m
!1/2 m
!1/2
Cauchy−Schwarz X (z > ai )2 X (a>
i (x 0 − x))4
≤ ·
si (x0 )2 si (x)4
i=1 i=1
m
!
X (a> 0
i (x − x))
2
≤ kzkx0 · = kzkx0 kn(x)k2x
si (x)2
i=1
which completes the proof.
8 Inthis presentation we will not discuss some details, e.g. the full-dimensionality assump-
tion. These are easy to deal with and can be found in any book on linear programming.
3.4. Interior Point Method for Linear Programming 67
Suppose now we have some very small ν > 0 and a point x such that
kn0 (x)kx ≤ 1/2. By performing two further Newton steps, we may as-
sume kn0 (x)kx ≤ 1/16. We have:
n0 (x) = H(x)−1 (−νg(x0 ) + g(x)),
hence:
1
kH(x)−1 g(x)kx ≤ νkH(x)−1 g(x0 )kx +kn0 (x)kx ≤ νkH(x)−1 g(x0 )kx +
16
So it turns out, by Lemma 3.10 that it is enough to make
νkH(x)−1 g(x0 )kx smaller then a constant. We can make ν as small
as we want, but what with kH(x)−1 g(x0 )kx ? Let us present a technical
claim, which we leave without proof:
68 Newton’s Method and the Interior Point Method
Claim 3.11. All the calculations during the backward walk along Γd
can be performed with 2O(L) precision.
This means that we may assume that all the points x computed during
the procedure have only O(L) places after the comma, and are of ab-
solute value at most 2O(L) . This allows us to provide an upper bound
on kH(x)−1 g(x0 )kx .
This follows from the fact that a solution to a linear system is of polyno-
mial size. Now it remains to take ν so small that νkH(x)−1 g(x0 )kx ≤ 1/8.
By our previous reasoning this is enough to get suitable (η0 , x0 ). To
reach such a ν we need to perform O( e √m log 1/ν ) iterations, but log 1/ν
is polynomial in L, so we are done.
To complete our considerations we show how to find a suitable x0 ∈
int(P ). To this end, we consider an auxiliary linear program:
min t
(t,x)∈Rn+1
(3.15)
a>i x ≤ bi + t, for all i
P
Taking x = 0 and t big enough (t = 1 + i |bi |), we get a strictly
feasible solution. Having such, we may solve 3.15 up to an additive
error of 2−O(L) in polynomial time (by IPM). This will give us a solution
(t0 , x0 ) with t0 = −2−Ω(L) , so x0 ∈ int(P ) is a suitable starting point for
our walk along Γd .
barrier function for a convex set, one can design a path following IPM
for it.
then consider the slacks si (x) = bi − hai , xi which capture the extent
to which a constraint is satisfied. Clearly, if x is a feasible point, then
si (x) ≥ 0 for all i. Note that, when clear from context, we will often
write si for si (x) for conciseness. The log-barrier function (rather its
negative) and its differentials are:
X
F (x) = log(si (x))
i
where
ai
∇F (x) = , and
si
X ai a>
∇2 F (x) = i
2 .
i
s i
We often write H(x) = ∇2 F (x) for the Hessian of x, and note that
H(x) 0 (and in fact H(x) 0 when A is fully-dimensional).
Now, we can consider the ellipsoid centred at x and defined by H,
namely
Ex (H(x)) = {y : (y − x)> H(x)(y − x) ≤ 1}.
This is the Dikin ellipsoid centred at x.
10 Infact, this also holds when P is not bounded for certain cases such as if P is the positive
orthant.
3.6. Appendix: The Dikin Ellipsoid 71
Proof. This is easy to see since, for any y ∈ Ex , all slacks are non-
negative, and hence y ∈ P . Formally, let y ∈ Ex (H(x)). Then,
(y − x)> H(x)(y − x) ≤ 1
X hai , y − xi2
⇒ ≤ 1
i
s2i
hai , y − xi2
⇒ ≤ 1 ∀i since all summands are non-negative
s2i
si (x) − si (y) 2
⇒ ≤ 1 ∀i
si (x)
si (y)
⇒ 1 − ≤ 1 ∀i.
si (x)
Hence, for all i we know that 0 ≤ si (y)/si (x) ≤ 2, and, in particular,
si (y) ≥ 0 ∀i.
Let us start with a very interesting lemma describing one crucial prop-
erty of the analytic center of P :
si (x)
Proof. Let us denote r(x) = m
P
i=1 si (xc ) . We know that r(xc ) = m. To
show that a function is constant, it remains to find its derivative and
argue that it is zero:
m
!
X si (x) ai
∇r(x) = ∇ = = ∇F (xc )
si (xc ) si (xc )
i=1
Now we are ready to proof the second theorem describing the Dikin
ellipsoid. This time we look at the Dikin ellipsoid centered at a specific
point xc :
72 Newton’s Method and the Interior Point Method
m
X si (x)2
(x − xc )> H(xc )(x − xc ) = − m.
si (xc )2
i=1
si (x)2
Observe that all the terms in the sum m
P
i=1 si (xc )2 are nonnegative, so
2
si (x)2
P
m si (x)
we can use the simple bound m = m2 (the
P
i=1 si (xc )2 ≤ i=1 si (xc )
last equality again by Lemma 3.16) and obtain finally:
The last theorem we want to proof about the Dikin ellipsoid asserts
that when we restrict ourselves to symmetric polytopes (i.e. symmetric
with respect to the center of P ) then the Dikin ellipsoid at the center
√
inflated only m times contains the polytope.
3.6. Appendix: The Dikin Ellipsoid 73
Remark 3.19. Note that the Dikin ellipsoid depends not exactly on
P but rather on the description of P (on the set of constraints). In the
above theorem we assume that the description of P is reasonable. Since
P is symmetric (with respect to the origin), we can assume that for
every constraint x> ai ≤ bi there is a corresponding constraint −x> ai ≤
bi .
Theorems 3.17 and 3.18 together with Theorem 3.15 demostrate that
the Dikin ellipsoid centered at xc is a very good approximation of the
polytope P . We obtain an ellipsoid with a blow-up ratio of m (that is
√
Exc (H(xc )) ⊆ P ⊆ mExc (H(xc ))) or m for the symmetric case. One
can ask if this is the best we can do. It turns out that we can obtain
better ratio by taking the so called John ellipsoid. It is defined as the
largest-volume ellipsoid contained in the polytope P . When we inflate
√
it by a factor of n, it contains P inside (similarly, n is enough for sym-
metric polytopes). This means that the John Ellipsoid achieves better
blow-up ratio, because n ≤ m (we need at least n linear inequalities
to define a non-degenerate, bounded polytope in n-dimensional space).
One can also prove that this is indeed tight, the best possible blow-up
ratio for the n-dimensional simplex is exactly n.
74 Newton’s Method and the Interior Point Method
√
3.6.1 The Dikin Algorithm and some Intuition for m Iter-
ations
The results above imply an particularly simple greedy IPM: start with
the analytic centre xc of a polytope P and move to the boundary of
Exc (H(xc )) in the direction of −c, where c is the cost vector. Repeat the
same from the next point. First note that algorithmically this makes
sense as Theorem 3.15 implies that throughout the execution of the
algorithm, we are always inside P. The cost is also decreasing. However,
√
we can only argue that the cost becomes 1/ m in the symmetric case (or
1/m in the non-symmetric case) of the optimal cost in the first iteration.
We have obtained:
Thus
(c> x − c> x? )(1 − kx − ykx ) ≤ c> x?η − c> x? .
By applying Lemma 3.4 we get the result.
76 Newton’s Method and the Interior Point Method
Note that in the algorithm we never literally mention the condition that
kx − x?η kx is small. However, we will show that it follows from kn(x)kx
being small. We will need the following simple fact about logarithmic
barrier. A proof appears in the notes on Volumetric Barrier in the next
lecture.
1
Fact 3.21. If x, z ∈ int(P ) are close to each other, i.e. kz − xkz ≤ 4
then H(x) and H(z) are close as well:
4
H(x) H(z) 4H(x).
9
Theorem 3.22. Let x be any point in int(P ) and η > 0. Consider the
1
Newton step n(x) at point x with respect to fη . If kn(x)kx ≤ 18 , then
?
kx − xη kx ≤ 5kn(x)kx .
by Cauchy-Schwarz. Next:
4
h> ∇2 fη (θ)h = h> H(θ)h ≥ h> H(x)h (3.19)
9
where we used Fact 3.21. Applying bounds 3.18, 3.19 to the expansion
3.17 results in:
2
fη (x + h) ≥ fη (x) − khkx kn(x)kx + khk2x . (3.20)
9
3.7. Appendix: The Length of Newton Step 77
fη (x + h) ≥ fη (x)
which implies that x?η , the unique minimizer of fη , belongs to the above
mentioned ball and the theorem follows.
m
Combining Lemma 3.20 with Theorem 3.22 we get that if η ≥ ε and
1
kn(x)kx ≤ 18 then c> x − c> x? < 2ε.
4
Volumetric Barrier, Universal Barrier and John
Ellipsoids
4.1 Overview
√
In this lecture we take a step towards improving the m in the number
of iterations the path-following IPM presented in the previous lecture
√
to n. This will be achieved by considering more involved barrier func-
tions rather than altering the basic framework of IPMs. The log-barrier
function was particularly nice as working with it was computationally
not harder than solving linear systems. However, it had many short-
comings. An obvious one is that it has no way to take into account
that a constraint could be repeated many time. One way to handle
this would be to work with weighted barrier functions which have the
following form: X
− wi (x) log si (x)
i
where we could allow the weights to depend on the current point as
well. Analyzing such methods, however, poses great technical challeges
as now the gradients and Hessians become more unwieldy. Amazingly,
Vaidya laid the foundations of analyzing such weighted barrier func-
tions. He introduced one such function, the volumetric barrrier and
√
showed how it allows us to improve the m to (mn)1/4 . Computation-
78
4.2. Restating the Interior Point Method for LPs 79
Note that the first property captures the fact that, if two points x and z
are close in the local norm, then the Hessian does not change by much.
These two requirements along with differentiability are equivalent to
self-concordance with complexity parameter θ/ε. We leave the proof to
the reader.
√
We will show that for this function θ = m and δ = O(1). Hence, via
Theorem 4.1, it follows that this results in an IPM which converges in
√
roughly m iterations.
Let S be the diagonal matrix where Sii = si (x). We can now restate the
gradient and Hessian of F with more convenient notation as follows:
Hence,
k∇F (x)k2(∇2 F (x))−1 ) = (∇F (x))> (∇2 F (x))−1 (∇F (x))
= (A> S −1 1)> (A> S −2 A)−1 (A> S −1 1)
= 1> (S −1 A(A> S −2 A)−1 A> S −1 )1.
Denote
def
Π = S −1 A(A> S −2 A)−1 A> S −1 ,
and note that Π is a projection matrix; i.e., Π2 = Π. Hence,
k∇F (x)k2(∇2 F (x))−1 ) = 1> Π1 = 1> Π2 1.
= kΠ1k22 ≤ k1k22
= m.
√
Thus, k∇F (x)k(∇2 F (x))−1 ) ≤ m as desired.
Hx = A> > −2
x Ax = A S A.
∇Vx = A>
x σx (4.1)
where
def a> −1
i Hx ai
σx,i = .
s2x,i
An informed reader could compare σx,i with leverage scores, or effective
resistances that arise when dealing with graph Laplacians. We note
some simple properties of σx which will allow us to build towards finding
the θ and δ necessary for applying Thoerem 4.1.
Fact 4.3.
(1) σx ≤ 1.
(2) σx> 1 = n.
−1
a>
i Hx ai
Proof. The first fact follows from the fact that s2x,i
≤ 1 if and only
ai a>
if i
s2x,i
Hx . But the latter is true since Hx is the sum over all i of
84 Volumetric Barrier, Universal Barrier and John Ellipsoids
quantites on the left hand side. For the second part note that
X a> H −1 ai
σx> 1 = i x
i
s2x,i
!
X ai a>
= Tr Hx−1 i
i
s2x,i
−1
= Tr Hx Hx
= n.
Thus, each σx,i is at most one and they sum up to n. Thus, σx,i assigns
a relative importance to each constraint while maintaining a budget
of n. This is unlike in the setting of log-barrier where each constraint
has a weight of 1 and, hence, requires a budget of m. Further, note the
following straightforward fact:
−1
P (a>
i Hx aj )
2
Fact 4.4. σx,i = j 2 2
sx,i sx,j
.
Proof.
a> −1
i Hx ai
σx,i =
s2x,i
a> −1 −1
i Hx Hx Hx ai
=
s2x,i
> −1
P aj a>
ai Hx j s2
j
Hx−1 ai
x,j
=
s2x,i
X (a> −1
i Hx aj )
2
= .
j
s2x,i s2x,j
σx = Px(2) 1.
(2) (2) (2)
Thus, Px ≥ 0, Σx − Px is symmetric and (Σx − Px )1 = 0. Thus, it
is a graph Laplacian. As a consequence we obtain the following fact.
def (2)
Fact 4.5. Λx = Σx − Px 0.
Lemma 4.6.
Thus, using Fact 4.5, we obtain the following corollary which allows
us to think of the Hessian of the volumetric barrier as the log-barrier
weighted with leverage scores. This lemma greatly simplifies the calcu-
lations to come.
Corollary 4.7.
A> 2 >
x Σx Ax ∇ Vx 5Ax Σx Ax .
Λx 2(Σx + Px(2) ).
86 Volumetric Barrier, Universal Barrier and John Ellipsoids
Thus,
def
We now let Qx = A> x Σx Ax . The next lemma relates Qx back the
Hessian of the log-barrier function.
1
Lemma 4.8. 4m Hx Qx Hx .
Proof. The upper bound follows straightforwardly from Fact 4.3 which
states that Σx I.
For the lower bound, we first break the sum into two parts as follows
X σx,i ai a>
i
Qx =
i
s2x,i
X σx,i ai a>
i
X σx,i ai a>
i
= +
s2x,i s2x,i
i:σx,i ≥1/2m σx,i <1/2m
X σx,i ai a>
i
.
s2x,i
i:σx,i ≥1/2m
Thus,
1 X ai a>i 1 X ai a>i
Qx = Hx − . (4.2)
2m s2x,i 2m s2x,i
i:σx,i ≥1/2m i:σx,i <1/2m
ai a>i
σx,i Hx .
s2x,i
4.3. Vaidya’s Volumetric Barrier 87
Thus,
X ai a>i
X 1 X 1
σx,i Hx Hx Hx .
s2x,i 2m
i
2
i:σx,i <1/2m i:σx,i <1/2m
a> −1
i Qx ai
n
1
o √
Lemma 4.9. 2
sx,i
≤ min σx,i , 4mσx,i ≤ 2 m.
Proof. From the lower bound estimate of Lemma 4.8, it follows that
a> −1
i Qx ai 4ma> −1
i Hx ai
2 ≤ 2 = 4mσx,i .
sx,i sx,i
Theorem 4.10.
O(1)
∇2 Vx ≈ ∇2 Vy .
88 Volumetric Barrier, Universal Barrier and John Ellipsoids
= hζ, A>
x Σx 1i
2
1 1
= hΣx/2 Ax ζ, Σx/2 1i2
≤ ζ > A>
x Σ x Ax ζ 1>
Σ x 1
≤ ζ > Qx ζ n
≤ n ζ > ∇2 Vx ζ .
For the second part, we start by noting that kx − ykx ≤ δ is the same
as (x − y)> Qx (x − y) ≤ δ 2 . This in turn implies that
1 2
|a> 2 2 > −1
i (x − y)| ≤ δ ai Qx ai ≤ ·s .
4 x,i
Thus, an argument similar to what we did for the log-barrier implies
that
1 − s y,i 1
≤ .
sx,i 2
This implies that Hx ≈ Hy and σx,i ≈ σy,i for all i. Thus, Qx ≈ Qy .
This implies that ∇2 Vx ≈ ∇2 Vy , completing the proof.
1
rather than 4m Hx in the case of volumetric barrier.
p This implies that
the upper bound in Lemma 4.9 comes down to m/n. This then implies
that, while the first part of Theorem 4.10 continues to hold with 4n
√
instead of n, the second part holds with (mn)1/4 rather than m. Hence,
from Theorem 4.1, we obtain an (mn)1/4 iteration IPM for LPs.
One can interpret Vaidya’s technique as starting with a barrier F
and replacing it with
1 n
− log det ∇2 F (x) + F (x).
2 m
One may ask if repeating this helps improve the bound of (mn)1/4 . In
particular, it is an interesting question to understand the fixed point of
this functional equation and to what extent is it self-concordant.
polar goes to +∞ and as x → 0, the left end of the polar goes to −∞.
In either case, the volume goes to infinity.
In fact this one-dimensional intuition goes a long way in understand-
ing F (x). This is due to the following simple observation which allows
us to think of Kx◦ as a collection of line segments. Let v ∈ Sn−1 be a
unit vector and let p(v) denote the maximum over y ∈ K of v > y. Thus,
we can rewrite the polar as a collection of the following line segments;
each along a unit vector v.
[ 1
◦
Kx = 0, v.
n−1
p(v) − v > x
v∈S
I12 (h)
(2) D2 F (x)[h, h] = (n + 1)(n + 2) I2I(h)
0
− (n + 1)2 I02
.
I3 (h)
(3) D3 F (x)[h, h, h] = −(n + 3)(n + 2)(n + 1) I0 + 3(n + 2)(n +
(h)3
1)2 I2 (h)I
I2
1 (h)
− 2(n + 1)3 I1I 3 .
0 0
P ai a> 1
Proof. Recall that Hx = i
i s2x,i and Vx = 2 log det Hx . Let ζ be a
vector and let t ∈ R be an infinitesimal. First note that
X ai a> X ai a> X ai a> a> ζ
Hx+tζ − Hx = 2
i
− 2
i
= 2t 2
i i
+ O(t2 ).
s x+tζ,i s x,i s x,i s x,i
i i i
92 Volumetric Barrier, Universal Barrier and John Ellipsoids
∇ 2 V x = A> (2)
x (3Σx − 2Px )Ax .
def
Definition 4.12. Given a bounded polytope P = {x ∈ Rn : a> i x ≤
bi for i = 1, . . . , m} and a point x ∈ int(P ), the John ellipsoid of P at
x is defined to be the ellipsoid of maximum volume which is centered
at x and contained in P .
Ex = {y : ky − xk2Bx ≤ 1}.
Ex ⊆ {y : hai , yi ≤ bi }.
max hai , yi ≤ bi
y∈Ex
or equivalently
max hai , y + xi ≤ bi
y:y > Bx y≤1
94 Volumetric Barrier, Universal Barrier and John Ellipsoids
or equivalently
max hai , yi ≤ bi − hai , xi = si (x).
y:y > Bx y≤1
It can be checked that the left-hand side is equal to kai kBx−1 . Because
si (x) ≥ 0, we can square both sides and obtain the constraint
a> −1 2
i Bx ai ≤ si (x) .
4.6.1 Duality
Let us now write the dual program (John-Dual). We multiply the i-th
constraint by a multiplier wi ≥ 0, obtaining the Lagrangian
X a> Cai
i
L(C, w) = − log det C + wi −1
si (x)2
i
!
X wi X
>
= − log det C + C • a i ai − wi .
si (x)2
i i
4.6. Appendix: John ellipsoid 95
and thus X wi
B = C −1 = ai a>
i .
si (x)2
i
The dual objective function is
!−1
X wi X
g(w) = inf L(C, w) = − log det ai a>
i +n− wi ,
C si (x)2
i i
Proof. The proof resembles the one of a similar statement for Dikin
ellipsoids. Without loss of generality assume that all bi = 1. Also,
because P is symmetric, we assume that for every constraint hai , xi ≤ 1
there is a corresponding constraint h−ai , xi ≤ 1. Now all si (x) = 1 and
the John ellipsoid E is given by
( )
X X 2
> >
B= wi ai ai , E = {x : x Bx ≤ 1} = x : wi hai , xi ≤ 1 .
i i
√
Pick a point x on the boundary of E. It is enough to show that nx is
not inside K. Since x ∈ ∂E, we have
X
wi hai , xi2 = 1.
i
√
If it were the case that nx ∈ int(P ), then for every i we would have
√ √
hai , nxi < 1 and h−ai , nxi < 1, so that
√ 2
ai , nx < 1,
a contradiction.
5.1 Overview
In this lecture we will present an elementary proof of Ye’s primal-
dual interior point method (IPM) for solving linear programming (LP)
problems. The primal-dual approach presented here can be cast in the
language of central path using a barrier function. The formulation and
analysis here are simpler and make no reference to barrier functions,
central paths, Newton step, affine scaling, predictor step, corrector step,
centering etc., however an avid who is familiar with path following
methods is encouraged to see the similarities.
98
5.2. Linear Programming and Duality 99
Primal (P)
min hc, xi
s.t.
Bx = b
∀e ∈ [m] xe ≥ 0
Dual (D)
max hb, yi
Here, m is the num-
s.t.
B>y + s = c
∀e ∈ [m] se ≥ 0
ber of variables and n is the number of constraints, i.e.,
B ∈ Rn×m , b ∈ Rn , c ∈ Rm , x ∈ Rm , y ∈ Rn , s ∈ Rm .
solution we output will be good enough for the purpose at hand (for
example, that it can be rounded to an optimal solution).2
Consider an optimal primal-dual solution (x, y, s). It satisfies all
four (sets of) conditions from the programs (P) and (D), as well as the
following complementary slackness condition:
∀e ∈ [m] xe se = 0.
Bx = b
B>y + s = c
∀e ∈ [m] xe ≥ 0
∀e ∈ [m] se ≥ 0
∀e ∈ [m] xe s e = 0
2 Extremal solution to linear programming problems have this property that they are
rational with the denominator bounded by 2O(L) , where L is the encoding length. Further,
for many combinatorial polytopes, the vertices have all coordinates 0/1. Due to this, once
the duality gap is small enough, of the order 2−O(L) , one can round the solutions to get
an actual vertex solution. For the MinCostUncapacitatedFlow problem, a duality gap
of m−O(1) suffices. This is left as an easy exercise.
5.3. A Primal-Dual Interior Point Method 101
Bx = b (5.1)
>
B y+s=c (5.2)
∀e ∈ [m] xe ≥ 0 (5.3)
∀e ∈ [m] se ≥ 0 (5.4)
∀e ∈ [m] xe se = µ (5.5)
Proof.
D E
hx, ci − hb, yi = x, B > y + s − hBx, yi
D E D E
= x, B > y + hx, si − x, B > y
= hx, si .
That is,
v
u 2
uX xe se
v(x, s, µ) = t −1 .
e
µ
Our main invariant is that we will keep this quantity bounded from
above by 1/2 at all times. In other words, we will always be close to the
actual solution of KKT(µ).4 Hence, the program KKT’(µ) that we will
be solving is the following:
Bx = b
>
B y+s=c
∀e ∈ [m] xe ≥ 0
∀e ∈ [m] se ≥ 0
1
v(x, s, µ) ≤
2
It can be checked that one can find an initial solution (x0 , s0 , y 0 ) to the
above problem for some µ0 = mO(1) .5 Let us quickly verify that this
relaxation is good enough for our purposes.
4 The solution of KKT(µ) traces a path in the interior of the primal polytope and the dual
polytope simultaneoulsly as µ → 0 continuously. This is the primal-dual central path.
5 Actually, we will need to introduce a constant number of new x, y, s variables (with very
hence6
√
xe se − µ
≤ m,
µ
e 1 2
and hence,
√
X µ m
(xe se − µ) ≤
e
2
and, using Theorem 5.3, the duality gap is
√
X µ m
xe se ≤ + µm = O(m−1 ).
e
2
Invariant : xt 0, st 0. (5.6)
6 We √
are using the basic fact that kvk1 ≤ mkvk2 (actually, a factor of m would suffice).
104 A Primal-Dual IPM for Linear Programs (without the Barrier)
That is: in the first stage, we will try to bring v down as much as
possible by changing our solution (but keeping µ intact). In the second
stage we will decrease µ as much as possible while still keeping v below
def
1/2 (we decrease µ using the multiplicative update µt+1 = (1 − γ)µt ).
Thus our task amounts to finding (for the first stage) a way to decrease
v, and (for the second stage) a maximum value of γ which will guarantee
v ≤ 1/2.
Let us focus on the first stage for now. We have a solution which
satisfies all constraints of KKT(µ) except for (5.5), and our goal is
to satisfy this constraint (as much as possible). We have denoted by
(∆x, ∆s, ∆y) the change in the solution. Let us write the old and new
linear equality constraints:
Bx = b
B>y + s = c
B(x + ∆x) = b
>
B (y + ∆y) + s + ∆s = c
Straightforward subtraction shows that the two new ones will be sat-
isfied iff
B∆x = 0,
>
B ∆y + ∆s = 0.
B∆x = 0 (5.7)
>
B ∆y + ∆s = 0 (5.8)
∀e ∈ [m], xe se + ∆xe se + xe ∆se = µ. (5.9)
def def
Proof. Let8 L = BXS −1 B > , and let z = − µ1 Bs−1 , where we denote
def
s−1 e = s−1
e . Then ∆y can be found by solving:
L∆y = z.
∆s = −B > ∆y,
µ − xe ∆se
(∆x)e = − xe .
se
We are left with verifying that B∆x = 0, which is an easy exercise.
7 This is the Newton step which approximates the function by its first order Taylor series
and finds the best direction to move.
8 We use the notation that for a vector x, let X denote the diagonal matrix with the entries
of x in the diagonal.
106 A Primal-Dual IPM for Linear Programs (without the Barrier)
Example 5.6. Coming back to Theorem 5.1, we can see that the ma-
trix BXS −1 B > from the proof of Theorem 5.5 is the Laplacian matrix
of a graph where each edge e has conductance xsee . Solving the system
for ∆y (which is the only computationally nontrivial step in solving the
system LS) amounts to computing an electrical flow in this graph. It
can be done in nearly-linear time using a fast approximate Laplacian
solver due to Spielman-Teng. It is important to note that the Lapla-
cian solver of Spielman-Teng can find a solution in Õ(m log 1/ε) time
where ε is the relative error of the solution. It can be checked that an
approximate solution with ε = 1/mO(1) suffices for our purpose. In sum-
mary, our method solves the MinCostUncapacitatedFlow problem
by computing a sequence of electrical flows where the resistances and
the demand vector depend on the current point.
Now let us fix a solution (∆x, ∆y, ∆s) and proceed to analyzing its
properties. We begin with a simple but very useful observation.
Lemma 5.8. For (x, s, µ) such that v(x, s, µ) < 1 and ∆x, ∆s that
satisfy LS, we have
1 v(x, s, µ)2
v(x + ∆x, s + ∆s, µ) ≤ · .
2 1 − v(x, s, µ)
Theorem 5.7.
def def
q q
se xe
Proof. Define dxe = µxe ∆xe and dse = µse ∆se . Thus,
r
µ xe se
dxe + dse = − −1 .
xe se µ
Hence, v(x + ∆x, s + ∆s, µ)2 = e∈[m] dx2e ds2e which is at most
P
2
2
1 X 2 2 2 1 X
2 1 µ
(dxe +dse ) ≤ (dxe + dse ) = max v(x, s, µ)4 .
4 4 4 e∈[m] xe se
e∈[m] e∈[m]
for the system of quadratic equations we started with and the condition that v ≤ 1/2
corresponds to the domain of quadratic convergence in Newton’s method.
108 A Primal-Dual IPM for Linear Programs (without the Barrier)
Lemma 5.9. If (x, s, µ) are such that v(x + ∆x, s + ∆s, µ) < 1, where
∆x, ∆s satisfy LS, then x + ∆x > 0 and s + ∆s > 0.
Proof. v(x + ∆x, s + ∆s, µ) < 1 implies that for all e, −µ < ∆xe ∆se <
µ. Thus,
(xe + ∆xe )(se + ∆se ) = µ + ∆xe ∆se > 0.
Thus, for each e, both xe + ∆xe and se + ∆se are either both positive
or both negative. If they are both negative then,
Finally, the following lemma lets us analyze the second stage. It gives
an estimate of the rate of convergence of the whole method. That is,
it answers the question of what values of γ are guaranteed to keep v
below 1/2.
Lemma 5.10. For (x, s, µ) such that v(x, s, µ) < 1 and ∆x, ∆s that
satisfy LS we have
1 p
v(x + ∆x, s + ∆s, (1 − γ)µ) ≤ · v(x + ∆x, s + ∆s, µ)2 + γ 2 m.
1−γ
Proof.
X (xe + ∆xe )(se + ∆se ) 2
2
v(x + ∆x, s + ∆s, (1 − γ)µ) = −1
µ(1 − γ)
e∈[m]
2
1 X ∆xe ∆se
= 2
+γ
(1 − γ) µ
e∈[m]
1 X ∆xe ∆se 2
=
(1 − γ)2 µ
e∈[m]
X 2∆xe ∆se γ
+ + mγ 2 .
µ
e∈[m]
1 X ∆xe ∆se 2 mγ 2
v(x + ∆x, s + ∆s, (1 − γ)µ)2 = 2
+
(1 − γ) µ (1 − γ)2
e∈[m]
Example. Coming back to Theorems 5.1 and 5.6, notice that our
result, together with a fast Laplacian solver, implies that we are able to
solve the MinCostUncapacitatedFlow problem in time Õ m3/2 .
References
[AHK12] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights
update method: a meta-algorithm and applications. Theory of Computing,
8(1):121–164, 2012.
[Bub14] Sebastien Bubeck. Theory of convex optimization for machine learning.
Internet draft, 2014. URL: http://arxiv.org/abs/1405.4980.
[Haz14] Elad Hazan. Introduction to online convex optimization. Internet draft,
2014. URL: http://ocobook.cs.princeton.edu/.
[Vis13] Nisheeth K. Vishnoi. Lx = b. Foundations and Trends in Theoretical Com-
puter Science, 8(1-2):1–141, 2013. URL: http://research.microsoft.
com/en-us/um/people/nvishno/site/Lxb-Web.pdf.
[Wri05] Margaret H. Wright. The interior-point revolution in optimization: his-
tory, recent developments, and lasting consequences. Bull. Amer. Math.
Soc. (N.S, 42:39–56, 2005.
110