Continuous Optimization
Continuous Optimization
Continuous Optimization
Continuous Optimization
212
Draft chapter (September 14, 2018) from “Mathematics for Machine Learning” c 2018 by Marc
Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. To be published by Cambridge University
Press. Report errata and feedback to http://mml-book.com. Please do not post or distribute this
file, please link to https://mml-book.com.
Continuous Optimization 213
Convex optimization
Convex conjugate
& duality
Linear
programming
Quadratic Chapter 12
programming Classification
3871 need to start at some value, say x0 = 10 and follow the gradient. The
3872 gradient indicates that we should go right, but not how far (this is called
3873 the step size). Furthermore, if we had started at the right side (e.g. x0 = 0)
3874 the gradient would have led us to the wrong minimum. Figure 7.2 illus-
3875 trates the fact that for x > 1, the gradient points towards the minimum
3876 on the right of the figure, which has a larger objective value.
3877 We will see in Section 7.3 a class of functions called convex functions
3878 that do not exhibit this tricky dependency on the starting point of the
convex functions 3879 optimization algorithm. For convex functions all local minima are global
3880 minimum. It turns out that many machine learning objective functions
3881 are designed such that they are convex, and we will see an example in
3882 Chapter 12.
3883 The discussion in this chapter so far was about a one dimensional func-
3884 tion, where we are able to visualize the ideas of gradients, descent direc-
3885 tions and optimal values. In the rest of this chapter we develop the same
3886 ideas in high dimensions. Unfortunately we can only visualize the con-
3887 cepts in one dimension, but some concepts do not generalize directly to
3888 higher dimensions, therefore some care needs to be taken when reading.
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.1 Optimization using Gradient Descent 215
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
216 Continuous Optimization
Example 7.1
Consider a quadratic function in two dimensions
✓ ◆ > >
x1 x 2 1 x1 5 x1
f = 1 (7.7)
x2 x2 1 20 x2 3 x2
with gradient
✓ ◆
x1 2 1 x1 5
rf = . (7.8)
x2 1 20 x2 3
Starting at the initial location x0 = [ 3, 1]> , we iteratively apply (7.6)
to obtain a sequence of estimates that converge to the minimum value
(illustrated in Figure 7.3). We can see (both from the figure and by
plugging x0 into (7.8)) that the the gradient at x0 points north and
east, leading to x1 = [ 1.98, 1.21]> . Repeating that argument gives us
x2 = [ 1.32, 0.42]> , and so on.
3904 Remark. Gradient descent can be relatively slow close to the minimum:
3905 Its asymptotic rate of convergence is inferior to many other methods. Us-
3906 ing the ball rolling down the hill analogy, when the surface is a long thin
3907 valley the problem is poorly conditioned (Trefethen and Bau III, 1997).
3908 For poorly conditioned convex problems, gradient descent increasingly
3909 ‘zigzags’ as the gradients point nearly orthogonally to the shortest direc-
3910 tion to a minimum point, see Fig. 7.3. }
3921 • When the function value increases after a gradient step, the step size
3922 was too large. Undo the step and decrease the stepsize.
3923 • When the function value decreases the step could have been larger. Try
3924 to increase the stepsize.
3925 Although the “undo” step seems to be a waste of resources, using this
3926 heuristic guarantees monotonic convergence.
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.1 Optimization using Gradient Descent 217
3947 where ↵ 2 [0, 1]. Sometimes we will only know the gradient approxi-
3948 mately. In such cases the momentum term is useful since it averages out
3949 different noisy estimates of the gradient. One particularly useful way to
3950 obtain an approximate gradient is using a stochastic approximation, which
3951 we discuss next.
3964 where xn 2 RD are the training inputs, yn are the training targets and ✓
3965 are the parameters of the regression model.
Standard gradient descent, as introduced previously, is a “batch” opti-
mization method, i.e., optimization is performed using the full training set
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.1 Optimization using Gradient Descent 219
3966 for a suitable stepsize parameter i . Evaluating the sum-gradient may re-
3967 quire expensive evaluations of the gradients from all individual functions
3968 Ln . When the training set is enormous and/or no simple formulas exist,
3969 evaluating the sums of gradients becomes very expensive.
PN
3970 Consider the term n=1 (rLn (✓ i )) in (7.15) above: we can reduce the
3971 amount of computation by taking a sum over a smaller set of Ln . In con-
3972 trast to batch gradient descent, which uses all Ln for n = 1, . . . , N , we
3973 randomly choose a subset of Ln for mini-batch gradient descent. In the
3974 extreme case, we randomly select only a single Ln to estimate the gra-
3975 dient. The key insight about why taking a subset of data is sensible is
3976 to realise that for gradient descent to converge, we only require that the
3977 gradient
PN is an unbiased estimate of the true gradient. In fact the term
3978
n=1 (rLn (✓ i )) in (7.15) is an empirical estimate of the expected value
3979 (Section 6.4.1) of the gradient. Therefore any other unbiased empirical
3980 estimate of the expected value, for example using any subsample of the
3981 data, would suffice for convergence of gradient descent.
3982 Why should one consider using an approximate gradient? A major rea-
3983 son is practical implementation constraints, such as the size of CPU/GPU
3984 memory or limits on computational time. We can think of the size of the
3985 subset used to estimate the gradient in the same way that we thought of
3986 the size of a sample when estimating empirical means 6.4.1. In practice,
3987 it is good to keep the size of the mini-batch as large as possible. Large
3988 mini-batches reduce the variance in the parameter update. Furthermore This often leads to
3989 large mini-batches take advantage of highly optimized matrix operations more stable
convergence since
3990 in vectorized implementations of the cost and gradient. However when
the gradient
3991 we choose the mini-batch size, we need to make sure it fits into CPU/GPU estimator is less
3992 memory. Typical mini-batch sizes are 64, 128, 256, 512, 1024, which de- noisy.
3993 pends on the way computer memory is laid out and accessed.
3994 Remark. When the learning rate decreases at an appropriate rate, and sub-
3995 ject to relatively mild assumptions, stochastic gradient descent converges
3996 almost surely to local minimum (Bottou, 1998). }
3997 If we keep the mini-batch size small, the noise in our gradient estimate
3998 will allow us to get out of some bad local optima, which we may otherwise
3999 get stuck in.
4000 Stochastic gradient descent is very effective in large-scale machine learn-
4001 ing problems (Bottou et al., 2018), such as training deep neural networks
4002 on millions of images (Dean et al., 2012), topic models (Hoffman et al.,
4003 2013), reinforcement learning (Mnih et al., 2015) or training large-scale
4004 Gaussian process models (Hensman et al., 2013; Gal et al., 2014).
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
220 Continuous Optimization
Figure 7.4
Illustration of
constrained
optimization. The
unconstrained
problem (indicated
by the contour
lines) has a
minimum on the
right side (indicated
by the circle). The
box constraints
( 1 6 x 6 1 and
1 6 y 6 1) require
that the optimal
solution are within
the box, resulting in
an optimal value
indicated by the
star.
4006 where f : RD ! R.
In this section we have additional constraints. That is for real valued
functions gi : RD ! R for i = 1, . . . , m we consider the constrained
optimization problem
min f (x) (7.17)
x
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.2 Constrained Optimization and Lagrange Multipliers 221
4009 This gives infinite penalty if the constraint is not satisfied, and hence
4010 would provide the same solution. However, this infinite step function is
4011 equally difficult to optimize. We can overcome this difficulty by introduc-
4012 ing Lagrange multipliers. The idea of Lagrange multipliers is to replace the Lagrange multipliers
4013 step function with a linear function.
We associate to problem (7.17) the Lagrangian by introducing the La- Lagrangian
grange multipliers i > 0 corresponding to each inequality constraint re-
spectively (Boyd and Vandenberghe, 2004, Chapter 4).
m
X
L(x, ) = f (x) + i gi (x)
i=1
>
= f (x) + g(x) (7.20)
4014 where in the last line we have a concatenated all constraints gi (x) into a
4015 vector g(x), and all the Lagrange multipliers into a vector 2 Rm .
4016 We now introduce the idea of Lagrangian duality. In general, duality
4017 in optimization is the idea of converting an optimization problem in one
4018 set of variables x (called the primal variables), into another optimization
4019 problem in a different set of variables (called the dual variables). We
4020 introduce two different approaches to duality: in this section we discuss
4021 Lagrangian duality, and in Section 7.3.3 we discuss Legendre-Fenchel du-
4022 ality.
Theorem 7.1. The problem in (7.17)
min f (x)
x
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
222 Continuous Optimization
that, for any function swapping the order of the minimum and maximum
above results in a smaller value.
min max L(x, ) > max mind L(x, ) (7.25)
x2Rd >0 >0 x2R
weak duality 4024 This is also known as weak duality. Note that the inner part of the right
4025 hand side is the dual objective function D( ) and the theorem follows.
4026 In contrast to the original optimization problem which has constraints,
4027 minx2Rd L(x, ) is an unconstrained optimization problem for a given
4028 value of . If solving minx2Rd L(x, ) is easy, then the overall problem
4029 is easy to solve. The reason is that the outer problem (maximization over
4030 ) is a maximum over a set of affine functions, and hence is a concave
4031 function, even though f (·) and gi (·) may be non-convex. The maximum
4032 of a concave function can be efficiently computed.
4033 Assuming f (·) and gi (·) are differentiable, we find the Lagrange dual
4034 problem by differentiating the Lagrangian with respect to x and setting
4035 the differential to zero and solving for the optimal value. We will discuss
4036 two concrete examples in Section 7.3.1 and 7.3.2, where f (·) and gi (·)
4037 are convex.
Remark (Equality constraints). Consider (7.17) with additional equality
constraints
minf (x) (7.26)
x
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.3 Convex Optimization 223
y = 3x2 5x + 2
x2
20
0
2 0 2
x1
4055 Convex functions are functions such that a straight line between any
4056 two points of the function lie above the function. Figure 7.2 shows a non-
4057 convex function and Figure 7.3 shows a convex function. Another convex
4058 function is shown in Figure 7.5.
Definition 7.2. A function f : RD ! R is a convex function if for all x, y convex function
in the domain of f , and for any scalar ✓ with 0 6 ✓ 6 1, we have Technically, the
domain of the
f (✓x + (1 ✓)y) 6 ✓f (x) + (1 ✓)f (y) (7.27) function f must also
be a convex set.
4059 Remark. A concave function is the negative of a convex function. } concave function
4060 The constraints involving g(·) and h(·) in (7.26) truncate functions at a Figure 7.6 Example
of a convex set
4061 scalar value, resulting in sets. Another relation between convex functions
4062 and convex sets is to consider the set obtained by “filling in” a convex
4063 function. A convex function is a bowl like object, and we imagine pouring
4064 water into it to fill it up. This resulting filled in set, called the epigraph
4065 of the convex function, is a convex set. Convex sets are sets such that a
4066 straight line connecting any two elements of the set lie inside the set. Fig-
4067 ure 7.6 and Figure 7.7 illustrates convex and nonconvex sets respectively.
Definition 7.3. A set C is a convex set if for any x, y 2 C and for any Figure 7.7 Example
scalar ✓ with 0 6 ✓ 6 1, we have of a nonconvex set
✓x + (1 ✓)y 2 C (7.28)
If a function f : Rn ! R is differentiable, we can specify convexity in
terms of its gradient rx f (x) (Section 5.2). A function f (x) is convex if
and only if for any two points x, y ,
f (y) > f (x) + rx f (x)> (y x). (7.29)
convex set
4068 If we further know that a function f (x) is twice differentiable, that is the
4069 Hessian (5.144) exists for all values in the domain of x, then the function
4070 f (x) is convex if and only if r2x f (x) is positive semi-definite (Boyd and
4071 Vandenberghe, 2004).
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
224 Continuous Optimization
Example 7.3
0 2 4
t
The negative entropy f (x) = x log2 x is convex for x > 0. A visualiza-
tion of the function is show in Figure 7.8, and we can see that the function
is convex. To illustrate the above definitions of convexity, let us check the
calculations for two points x = 2 and x = 4. Note that to prove convexity
of f (x) we would need to check for all points x 2 R.
Recall Definition 7.2. Consider a point midway between the two points
(that is ✓ = 0.5), then the left hand side is f (0.5⇥2+0.5⇥4) = 3 log2 3 ⇡
4.75. The right hand side is 0.5(2 log2 2) + 0.5(4 log2 4) = 1 + 4 = 5. And
therefore the definition is satisfied.
Since f (x) is differentiable, we can alternatively use (7.29). Calculating
the derivative of f (x), we obtain
1
rx (x log2 x) = 1 ⇥ log2 x + x ⇥ (7.30)
x
= log2 x + 1. (7.31)
Using the same two test points x = 2 and x = 4, the left hand side of
(7.29) is given by f (4) = 8. The right hand side is
f (x) + r>
x (y x) = f (2) + rf (2) ⇥ (4 2) (7.32)
= 2 + 2 ⇥ 2 = 6. (7.33)
4072 We can check that a function or set is convex from first principles by
4073 recalling the definitions. In practice we often rely on operations that pre-
4074 serve convexity to check that a particular function or set is convex. Al-
4075 though the details are vastly different, this is again the idea of closure
4076 that we introduced in Chapter 2 for vector spaces.
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.3 Convex Optimization 225
Example 7.4
A nonnegative weighted sum of convex functions is convex. Observe that
if f is a convex function, and ↵ > 0 is a nonnegative scalar, then the
function ↵f is convex. We can see this by multiplying ↵ to both sides of
equation in Definition 7.2, and recalling that multiplying a nonnegative
number does not change the inequality.
If f1 and f2 are convex functions, then we have by the definition
f1 (✓x + (1 ✓)y) 6 ✓f1 (x) + (1 ✓)f1 (y) (7.34)
f2 (✓x + (1 ✓)y) 6 ✓f2 (x) + (1 ✓)f2 (y). (7.35)
Summing up both sides gives us
f1 (✓x + (1 ✓)y) + f2 (✓x + (1 ✓)y)
6 ✓f1 (x) + (1 ✓)f1 (y) + ✓f2 (x) + (1 ✓)f2 (y) (7.36)
where the right hand side can be rearranged to
✓(f1 (x) + f2 (x)) + (1 ✓)(f1 (y) + f2 (y)) (7.37)
completing the proof that the sum of convex functions is convex.
Combining the two facts above, we see that ↵f1 (x) + f2 (x) is convex
for ↵, > 0. This closure property can be extended using a similar argu-
ment for nonnegative weighted sums of more than two convex functions.
4077 Remark. The inequality defining convex functions, see 7.27, is sometimes
4078 called Jensen’s inequality. In fact a whole class of inequalities for taking Jensen’s inequality
4079 nonnegative weighted sums of convex functions are all called Jensen’s
4080 inequality. }
In summary, a constrained optimization problem is called a convex opti- convex optimization
mization problem if problem
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
226 Continuous Optimization
Figure 7.9
Illustration of a
linear program. The
unconstrained
problem (indicated
by the contour
lines) has a
minimum on the
right side. The
optimal value given
the constraints are
shown by the star.
subject to Ax 6 b
>
5 x1
min2 (7.40)
x2R 3 x2
2 3 2 3
2 2 33
6 2 47 6 8 7
6 7 x1 6 7
subject to 6
6 2 17 6 7
7 x2 6 6 5 7 (7.41)
4 0 15 4 15
0 1 8
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.3 Convex Optimization 227
subject to c + A> = 0
> 0.
4088 This is also a linear program, but with m variables. We have the choice
4089 of solving the primal (7.39) or the dual (7.42) program depending on
4090 whether m or d is larger. Recall that d is the number of variables and m is
4091 the number of constraints in the primal linear program.
subject to Ax 6 b
4093 where A 2 Rm⇥d , b 2 Rm and c 2 Rd . The square symmetric matrix Q 2
4094 Rd⇥d is positive definite, and therefore the objective function is convex.
4095 This is known as a quadratic program. Observe that it has d variables and
4096 m linear constraints.
Example 7.6
An example of a quadratic program is illustrated in Figure 7.4, which has
two variables. The objective function is quadratic with a positive semidefi-
nite matrix Q, resulting in elliptical contour lines. The optimal value must
lie in the shaded (feasible) region, and is indicated by the star.
> >
1 x1 2 1 x1 5 x1
min + (7.44)
x2R2 2 x2 1 4 x2 3 x2
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
228 Continuous Optimization
2 3 2 3
1 0 1
6 1 7
0 7 x1 6 17
subject to 6
4 0 5 66
4
7 (7.45)
1 x2 15
0 1 1
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.3 Convex Optimization 229
4110 a given point x0 is the evaluation of the gradient of that function at that
4111 point dfdx
(x)
. In summary, because convex sets can be equivalently
x=x0
4112 described by its supporting hyperplanes, convex functions can be equiv-
4113 alently described by a function of their gradient. The Legendre transform Legendre transform
4114 formalizes this concept . Physics students are
4115 We begin with the most general definition which unfortunately has a often introduced to
the Legendre
4116 counterintuitive form, and look at special cases to try to relate the defini-
transform as
4117 tion to the intuition above. The Legendre-Fenchel transform is a transfor- relating the
4118 mation (in the sense of a Fourier transform) from a convex differentiable Lagrangian and the
4119 function f (x) to a function that depends on the tangents s(x) = rx f (x). Hamiltonian in
classical mechanics.
4120 It is worth stressing that this is a transformation of the function f (·) and
Legendre-Fenchel
4121 not the variable x or the function evaluated at a value. The Legendre- transform
4122 Fenchel transform is also known as the convex conjugate (for reasons
4123 we will see soon) and is closely related to duality (Hiriart-Urruty and
4124 Lemaréchal, 2001, Chapter 5).
Definition 7.4. The convex conjugate of a function f : RD ! R is a convex conjugate
function f ⇤ defined by
f ⇤ (s) = sup hs, xi f (x) (7.49)
x2RD
4125 Note that the convex conjugate definition above does not need the func-
4126 tion f to be convex nor differentiable. In the definition above, we have
4127 used a general inner product (Section 3.2) but in the rest of this section
4128 we will consider the standard dot product between finite dimensional vec-
4129 tors (hs, xi = s> x) to avoid too many technical details.
To understand the above definition in a geometric fashion, consider an This derivation is
nice simple one dimensional convex and differentiable function, for ex- easiest to
understand by
ample f (x) = x2 . Note that since we are looking at a one dimensional
drawing the
problem, hyperplanes reduce to a line. Consider a line y = sx + c. Recall reasoning as it
that we are able to describe convex functions by their supporting hyper- progresses.
planes, so let us try to describe this function f (x) by its supporting lines.
Fix the gradient of the line s 2 R and for each point (x0 , f (x0 )) on the
graph of f , find the minimum value of c such that the line still inter-
sects (x0 , f (x0 )). Note that the minimum value of c is the place where a
line with slope s “just touches” the function f (x) = x2 . The line passing
through (x0 , f (x0 )) with gradient s is given by
y f (x0 ) = s(x x0 ) (7.50)
The y -intercept of this line is sx0 + f (x0 ). The minimum of c for which
y = sx + c intersects with the graph of f is therefore
inf sx0 + f (x0 ). (7.51)
x0
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
230 Continuous Optimization
4132 chose a one dimensional convex and differentiable function, and holds for
4133 f : RD ! R which are nonconvex and non differentiable.
The classical
Legendre transform Remark. Convex differentiable functions such as the example f (x) = x2
is defined on convex is a nice special case, where there is no need for the supremum, and there
differentiable
functions in RD
is a one to one correspondence between a function and its Legendre trans-
form. Let us derive this from first principles. For a convex differentiable
function, we know that at x0 the tangent touches f (x0 ), therefore
f (x0 ) = sx0 + c. (7.52)
Recall that we want to describe the convex function f (x) in terms of its
gradient rx f (x), and that s = rx f (x0 ). We rearrange to get an expres-
sion for c to obtain
c = sx0 f (x0 ). (7.53)
Note that c changes with x0 and therefore with s, which is why we can
think of it as a function of s, which we call f ⇤ (s).
f ⇤ (s) = sx0 f (x0 ). (7.54)
4134 Compare (7.54) with Definition 7.4, and observe that (7.54) is a special
4135 case (without the supremum). }
4136 The conjugate function has nice properties, for example for convex
4137 functions, applying the Legendre transform again gets us back to the origi-
4138 nal function. In the same way that the slope of f (x) is s, the slope of f ⇤ (s)
4139 is x. The following two examples show common uses of convex conjugates
4140 in machine learning.
Example 7.7
To illustrate the application of convex conjugates, consider the quadratic
function based on a positive definite matrix K 2 Rn⇥n . We denote the
primal variable to be y 2 Rn and the dual variable to be ↵ 2 Rn .
y>K 1y
f (y) = (7.55)
2
Applying Definition 7.4, we obtain the function
1
f ⇤ (↵) = sup hy, ↵i y>K y. (7.56)
y2Rn 2
Observe that the function is differentiable, and hence we can find the
maximum by taking the derivative and with respect to y setting it to zero.
⇥ ⇤
@ hy, ↵i 2 y > K 1 y
= (↵ K 1 y)> (7.57)
@y
and hence when the gradient is zero we have y = 1
K↵. Substituting
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
7.3 Convex Optimization 231
Example 7.8
In machine learning we often use sums of functions, for example the ob-
jective function of the training set includes a sum of the losses for each ex-
ample in the training set. In the following, we derive the convex conjugate
of a sum of losses `(t), where ` : R ! R. This also illustratesPnthe applica-
tion of the convex conjugate to the vector case. Let L(t) = i=1 `i (ti ),
n
X
L⇤ (z) = sup hz, ti `i (ti ) (7.60)
t2Rn i=1
n
X
= sup zi t i `i (ti ) definition of dot product (7.61)
t2Rn i=1
Xn
= sup zi ti `i (ti ) (7.62)
n
i=1 t2R
Xn
= `⇤i (zi ) definition of conjugate (7.63)
i=1
4141 Recall that in Section 7.2 we derived a dual optimization problem using
4142 Lagrange multipliers. Furthermore for convex optimization problems we
4143 have strong duality, that is the solutions of the primal and dual problem
4144 match. The Fenchel-Legendre transform described here also can be used
4145 to derive a dual optimization problem. Furthermore then the function is
4146 convex and differentiable, the supremum is unique. To further investigate
4147 the relation between these two approaches, let us consider a linear equal-
4148 ity constrained convex optimization problem.
Example 7.9
Let f (y) and g(x) be convex functions, and A a real matrix of appropriate
dimensions such that Ax = y . Then
min f (Ax) + g(x) = min f (y) + g(x). (7.64)
x Ax=y
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
232 Continuous Optimization
where the last step of swapping max and min is due to the fact that f (y)
and g(x) are convex functions. By splitting up the dot product term and
collecting x and y ,
max min f (y) + g(x) + (Ax y)> u (7.67)
u x,y
h i
= max min y > u + f (y) + min(Ax)> u + g(x) (7.68)
u y x
h i
= max min y > u + f (y) + min x> A> u + g(x) (7.69)
u y x
For general inner Recall the convex conjugate (Definition 7.4) and the fact that dot prod-
products, A> is
ucts are symmetric,
replaced by the
h i
adjoint A⇤ .
max min y > u + f (y) + min x> A> u + g(x) (7.70)
u y x
4149 The Legendre-Fenchel conjugate turns out to be quite useful for ma-
4150 chine learning problems that can be expressed as convex optimization
4151 problems. In particular for convex loss functions that apply independently
4152 to each example, the conjugate loss is a convenient way to derive a dual
4153 problem.
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.
Exercises 233
4164 such as Newton methods use the Hessian to provide information about the
4165 curvature. Many of the choices for choosing stepsizes and ideas like mo-
4166 mentum arise by considering the curvature of the objective function (Goh,
4167 2017; Bottou et al., 2018). Quasi-Newton methods such as L-BFGS try to
4168 use cheaper computational methods to approximate the Hessian (Nocedal
4169 and Wright, 2006). Recently there has been interest in other metrics for
4170 computing descent directions, resulting in approaches such as mirror de-
4171 scent (Beck and Teboulle, 2003) and natural gradient (Toussaint, 2012).
4172 The second challenge are non-differentiable functions. Gradient meth-
4173 ods are not well defined when there are kinks in the function. In these
4174 cases, subgradient methods can be used (Shor, 1985). For further infor-
4175 mation and algorithms for optimizing non-differentiable functions, we re-
4176 fer to the book by Bertsekas (1999). There is a vast amount of literature
4177 on different approaches for numerically solving continuous optimization
4178 problems, including algorithms for constrained optimization problems. A
4179 good starting point to appreciate this literature are Luenberger (1969)
4180 and Bonnans et al. (2006). A recent survey of continuous optimization is
4181 Bubeck (2015).
4182 Modern applications of machine learning often mean that the size of
4183 datasets prohibit the use of batch gradient descent, and hence stochastic
4184 gradient descent is the current workhorse of large scale machine learning
4185 methods. Recent surveys of the literature include (Hazan, 2015; Bottou
4186 et al., 2018).
4187 For duality and convex optimization, the book by Boyd and Vanden-
4188 berghe (Boyd and Vandenberghe, 2004) includes lectures and slides on-
4189 line. A more mathematical treatment is provided by Bertsekas (2009).
4190 Convex optimization is based upon convex analysis, and the reader inter-
4191 ested in more foundational results about convex functions is referred to
4192 Hiriart-Urruty and Lemaréchal (2001); Rockafellar (1970); Borwein and
4193 Lewis (2006). Legendre-Fenchel transforms are also covered in the above
4194 books on convex analysis, but more beginner friendly presentations are
4195 available at Zia et al. (2009); Gonçalves (2014). The role of Legendre-
4196 Fenchel transforms in the analysis of convex optimization algorithms is
4197 surveyed in Polyak (2016).
4198 Exercises
7.1 Consider the univariate function
f (x) = x3 + 2x2 + 5x 3.
4199 Find its stationary points and indicate whether they are maximum, mini-
4200 mum or saddle points.
4201 7.2 Consider the update equation for stochastic gradient descent (Equation (7.15)).
4202 Write down the update when we use a mini-batch size of one.
7.3 Express the following optimization problem as a standard linear program in
c 2018 Marc Peter Deisenroth, A. Aldo Faisal, Cheng Soon Ong. To be published by Cambridge University Press.
234 Continuous Optimization
matrix notation
max p> x + ⇠
x2R2 ,⇠2R
Draft (2018-09-14) from Mathematics for Machine Learning. Errata and feedback to https://mml-book.com.