Script Convex Analysis
Script Convex Analysis
Lecture notes
Convex Analysis
in normed spaces with applications to image processing
Sommersemester 2024
Based on the script by Gabriele Steidl, which was modfied by Viktor Stein and Johannes Hertrich.
This script surely contains typos and I appreciate any hints for corrections and
improvements.
Last edited on July 23, 2024.
CONTENTS
Contents
1 Convex Sets 2
1.1 Basic Properties of Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Hahn-Banach Separation of Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 Convex Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 Orthogonal Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Convex Functions 26
2.1 Proper, Lower Semicontinuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3 Convexity for Differentiable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4 Continuity Properties of Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5 Minima of Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.6 Infimal Projection and Infimal Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.7 Proximal Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3 Subgradients 66
3.1 The Subdifferential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.2 Subdifferential Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Proximal Algorithms 77
4.1 Fixed Point Algorithms and Averaged Operators . . . . . . . . . . . . . . . . . . . . . . . 77
4.2 First-order Algorithms for Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Application: Inverse Problems in Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Index 155
2
CONTENTS
Foreword
In recent years Convex Analysis has become an important tool in image and data processing.
This lecture introduces Convex Analysis and in particular optimization algorithms for solving
convex variational problems.
In contrast to the initial version of this lecture, we present many results in general normed
spaces, Banach spaces or Hilbert spaces and highlight the differences to the Euclidean set-
ting. To keep the notations simple, we restrict our attention to real vector spaces. Neverthe-
less, many theorems can be generalized further to (Hausdorff locally convex) topological
vector spaces and to vector spaces over the complex numbers.
In order to get familiar with the concepts of the lecture and to see the relation to image
processing tasks, we highly recommend the participation in the exercises accompanying the
lecture.
Good treatments of Convex Analysis in normed spaces are [AO16, MN22, Luc06, Z0̆2, NP19],
where the latter is more introductory than the others. In [ET99] and the more advanced
texts [BS13, Chp. 2] and [BP12, Chp. 1-3] locally convex (Hausdorff) topological vector
spaces are considered, in [Bon19] and [BV10] Banach spaces, and in [BC11] Hilbert spaces.
For Euclidean cases, we refer to [Roc70b, HUL93] as an overview.
1
1 CONVEX SETS
1 Convex Sets
In this chapter, we consider convex sets such as affine sets, halfspaces and hyperplanes, Lecture 1,
polyhedral sets, simplices and convex cones as well as their basic properties. Moreover, 03.04.2024
we consider separation and projection theorems. Unless specified otherwise, E and F are
always real vector spaces. Later, these are additionally quipped with norms.
A point x P E is a convex combination of pxk qnk“1 Ă E if there exist pλk qnk“1 Ă r0, 1s convex
such that combination
n
ÿ ÿn
x“ λ k xk and λk “ 1.
k“1 k“1
The convex hull of S Ă E is the set of all convex combinations of points in S: convex hull
# +
N
ÿ N
ÿ
convpSq :“ λk xk : N P N, pxk qN
k“1 Ă S, pλk qN
k“1 Ă r0, 1s, λk “ 1 .
k“1 k“1
2
y
λ= 3
1
3x + 23 y
Fig. 1.1: Left: Two points x, y P E “ R2 , the line segment connecting them and an
intermediary point. Right: A convex set and a non-convex set in R2 .
Example 1.1.2 The whole set E and the empty set H both are convex sets and so is any
singleton txu for x P E. The convex subsets of R are exactly the intervals. Open and closed
balls in pE, } ¨ }q are convex. The unit sphere tx P E : }x} “ 1u is never convex. ˛
2
1 CONVEX SETS
A set is convex if and only if it contains all convex combinations of its points, see Exer-
cise 1.1.20. In particular, convpSq Ă E is convex for any S Ă E.
Various combinations of convex sets are convex again. Here is an important example.
C1 ` C2 :“ tc1 ` c2 : c1 P C1 , c2 P C2 u.
Fig. 1.3: The red set is the Minkowski sum of the blue and green set.
The Minkowski sum of two convex sets is convex. If C1 is closed and C2 is compact, then
C1 ` C2 is closed. On the other hand, the Minkowski sum of two closed (but not compact)
sets is not necessarily closed as the following example shows. Let
3
1 CONVEX SETS
Remark 1.1.5 (Convex hull) By Theorem 1.1.4 4 , we have for S Ă E that Fig. 1.5: While the
intersection of con-
č vex sets is convex,
convpSq “ C
their union is not
C convex:SĂCĂE
necessarily convex.
is convex. In this sense, the convex hull is the smallest convex set that includes S. ˝
λx ` p1 ´ λqy P S @λ P R, @x, y P S.
A point x P E is an affine combination of pxk qnk“1 Ă E if there exist pλk qnk“1 Ă R with affine
combination
n
ÿ n
ÿ
x“ λk xk and λk “ 1. (1.2)
k“1 k“1
The affine hull of S Ă E is affpSq, the set of all affine combinations of points in S. The affine hull
closed affine hull of S Ă E is affpSq :“ affpSq. closed affine hull
4
1 CONVEX SETS
Fig. 1.6: A convex set C and its affine hull affpCq [Luc06, Fig. 1.3].
By definition, affine sets are convex. Moreover, any affine set S Ă E can be represented as
translation S “ L ` s of some linear subspace L Ă E with a vector s P E. In particular,
every linear subspace of E is an affine set and we have that affptxuq “ txu for all x P E.
The affine hull of two points is the unique line containing them both.
Remark 1.1.7 (Affine versus convex) Affine sets inherit many properties from convex
ones. For example, the properties 1 - 4 from Theorem 1.1.4 also hold if we replace
convex by affine. Moreover, a set is affine if and only if it contains all affine combinations
of its points and we have that
č
affpSq “ C,
C affine:SĂCĂE
i.e., affpSq is the smallest affine set containing S. However, not any convex set is affine and
consequently some properties do not coincide. For instance, if E is finite-dimensional, then
the affine hull is always closed, which is not true for the convex hull. ˝
Due to the relation of affine sets with linear subspaces, we can define an anlogous concept
to linear independence.
imply that λ1 “ . . . “ λn “ 0.
5
1 CONVEX SETS
x2 x2
x1 x1 x4
x2
x1
x3 x3 x3
Fig. 1.7: In the first plot, the points are affinely independent. In the others, they are not.
Affine and linear independence are closely related. More precisely, the points pxk qnk“1 Ă E
are affinely independent if and only if the elements txk ´ xr : k P t1, . . . , nuztruu are linearly
independent for any r P t1, . . . , nu, see Exercise 1.1.30.
Let pxk qnk“1 be affinely independent. Since (1.2) is equivalent to
n
ÿ n
ÿ
x ´ xr “ λk pxk ´ xr q and λk “ 1, (1.4)
k“1 k“1
k‰r
every x P aff pxk qnk“1 can be uniquely represented as affine combination of pxk qnk“1 , see
` ˘
Exercise 1.1.30.
In order to examine affine sets more in detail, we consider so-called affine functions.
f : E Ñ F, x ÞÑ gpxq ` y,
A function f is affine if and only if it commutes with affine combinations, i.e., f pλx ` p1 ´
λqyq “ λf pxq ` p1 ´ λqf pyq for all x, y P E and λ P R, see Exercise 1.1.32. This implies that
images and preimages of affine sets under affine functions are affine.
6
1 CONVEX SETS
} ¨ }˚ : E ˚ Ñ R, g ÞÑ sup x g, x y
xPE
}x}“1
defines a norm on the dual space E ˚ . We know from the lecture “Functional Analysis” that if
E is a Banach space, then also E ˚ is a Banach space. Moreover, if E is a Hilbert space,
then the dual pairing corresponds to the inner product by Riesz’s representation theorem.
where a :“ x p, x0 y.
Moreover, for a P R and p P E ˚ zt0u the closed halfspaces in E are defined as closed halfspace
`
Hp,a :“ tx P E : x p, x y ě au and ´
Hp,a :“ tx P E : x p, x y ď au.
By definition, the hyperplane Hp,a is the preimage of the closed affine set t0u under the
continuous affine function x ÞÑ x p, x y ´a. Hence, it is a closed affine set. In contrast, the
closed halfspaces Hp,a
˘
are the preimages of the closed non-affine sets r0, `8q and p´8, 0s
under the same function. Thus, they are closed and convex, but not affine. Moreover, we
can always assume that }p}˚ “ 1 and a ě 0 since it holds Hbp,ba “ Hp,a for all real b ą 0.
p˘q p˘q
7
1 CONVEX SETS
S “ tx P E : f pxq ď au,
Fig. 1.9: A polyhedral set defined by five linear inequalities is equal to the intersection of
five half-spaces with normal vectors a1 , . . . , a5 [BV04, Fig. 2.11].
In the case that S is full-dimensional in the sense that affpSq “ E, we have that ripSq “ S ˝ .
In general, this is not true as the following examples show.
• The relative interior of a point is the point itself.
• The relative interior of a line segment contains all points except the endpoints.
• A planar triangle sitting in R3 like in Fig. 1.6 has empty interior. However, the
relative interior captures the interior with respect to the lower-dimensional
affine subspace which the triangle occupies.
Remark 1.1.16 (Relative interior of convex sets)
If E is finite-dimensional, then every convex set C ‰ H Ă E has non-empty relative interior.
This is not true in, e.g., for C being the closed convex set of Lebesgue-a.e. non-positive
functions f : r0, 1s Ñ R in E “ L2 pr0, 1sq. The proof of these facts is left as an exercise.
Solutions can be found in [Luc06, Prop. 1.1.13], [Hol12, p. 9] and [BS13, p. 20]. ˝
8
1 CONVEX SETS
Lecture 2,
Proof. The inclusion “Ą” is clear since ripCq Ă C. For the other inclusion, let x P ripCq,
05.04.2024
y P C and t P r0, 1q. We want to show that w :“ p1 ´ tqx ` ty P ripCq. The case t “ 0 is
clear, so assume t ‰ 0. As x P ripCq, there exists a ε ą 0 such that
If a set C Ă pE, } ¨ }q is convex with C ˝ ‰ H, then we have that ripCq “ C ˝ . Therefore, the
lemma implies that tC ˝ ` p1 ´ tqC Ă C ˝ for all t P r0, 1q.
9
1 CONVEX SETS
Exercises
Exercise 1.1.19
Prove that the sets C1 :“ tpxn qnPN P ℓ2 : |xn | ď 2´n u and
are convex. ■
Exercise 1.1.20 (Convex hull and convex combination [Roc70b, Thm. 2.2])
Prove that a set is convex if and only if it contains all convex combinations of its points. ■
Exercise 1.1.21 (Convex hull is not open nor closed) Find a convex set that is nei-
ther closed nor open. Is the convex hull of an open set open? ■
Exercise 1.1.27 (Affine calculus [Z0̆2, p. 2]) We have affpA ` Bq “ affpAq ` affpBq,
affpA ˆ Cq “ affpAq ˆ affpCq and aff T pAq “ T affpAq for all A, B P E, C P F and linear
` ˘ ` ˘
Exercise 1.1.29 (aff pCq “ aff pCq) We have affpCq “ affpCq because aff is a closure
operator as well. ■
10
1 CONVEX SETS
Exercise 1.1.32 The affine functions are exactly the functions F with F λx ` p1 ´ λqy “
` ˘
Exercise 1.1.33 Prove that taking the affine hull comutes with affine functions, i.e., it
holds affpF pAqq “ F paffpAqq for any affine function F : E Ñ F and any set A Ă E. ■
11
1 CONVEX SETS
x p, x y ď a ď x p, y y @x P C, y P D.
x p, x y ă a ă x p, y y @x P C, y P D.
• properly separates C and D if it separates C and D and both sets are not fully
contained in Hp,a , i.e., if there exist x P C and y P D such that
x p, x y ă x p, y y .
Fig. 1.12: Left: Two proper separable sets C1 , C2 Ă R2 . Middle and Right: Two sets
C1 , C2 Ă Rk , k P t2, 3u which cannot be properly separated [HUL93, Fig. 4.1.3].
The following theorem can be proven using the Hahn-Banach theorem [BS13, Thm. 2.13].
12
1 CONVEX SETS
The inclusion “Ă” holds since C is a subset of the right hand side by definition and since
the right hand side is closed. Moreover, we find by Theorem 1.2.2 for any x P EzC some
p P E ˚ and a P R such that x R Hp,a
´
Ą C which implies the reverse direction “Ą”. ˝
Let C, D Ă E be non-empty, convex disjoint sets. Then there exist p P E ˚ zt0u and
a P R such that the hyperplane Hp,a separates C and D. If C is closed and D is
compact, then p and a can be chosen such that Hp,a separates C and D strongly.
Fig. 1.13: Left: Two convex compact sets are strongly separable by a hyperplane. Right:
Without compactness, two closed convex disjoint sets can be not strongly separable.
Proof. The set S :“ C ´ D does not contain the origin, because C and D are disjoint.
Applying Theorem 1.2.2 on S and t0u there exists p ‰ 0 P E ˚ such that
x p, x y ď x p, 0 y “ 0 @x P S. (1.7)
Since S “ C ´ D, we obtain
x p, c y ď x p, d y @c P C, d P D. (1.8)
Hence, the hyperplane Hp,a with a :“ 21 supcPC x p, c y ` inf dPD x p, d y separates C and D.
` ˘
If C is closed and D is compact, then S is closed and convex since the Minkowski sum
of a closed and a compact set is again closed. Hence we can chose p P E ˚ zt0u and ε ą 0
by Theorem 1.2.2 such that (1.7) and (1.8) hold with an additional “´ε” on the right hand
side. Thus, Hp,a separates C and D strongly in this case. l
Even though two convex sets can always be separated by the previous theorem, this sepa-
ration might not be proper as we can see in the following theorem.
13
1 CONVEX SETS
Proof. “ ùñ :” Let p P E ˚ zt0u and a P R such that Hp,a separates C and D properly
and assume that x0 P ripCq X ripDq. That is, it holds x p, z1 y ď a ď x p, z2 y for all z1 P C
and z2 P D and there exist x1 P C and x2 P D such that x p, x1 y ă x p, x2 y. Since
x0 P C X D we have that x p, x0 y “ a. Because x0 P ripCq X ripDq, we find by Theorem 1.1.17
3 numbers λ1 , λ2 ą 0 such that y1 :“ x0 ´ λpx1 ´ x0 q “ p1 ` λqx0 ´ λx1 P C and
y2 :“ x0 ´ λpx2 ´ x0 q “ p1 ` λqx0 ´ λx2 P D for all 0 ă λ ă minpλ1 , λ2 q. Consequently, we
have that
x p, y1 y “ x p, p1 ` λqx0 ´ λx1 y “ x p, p1 ` λqx0 y ´ x p, λx1 y
ą x p, p1 ` λqx0 y ´ x p, λx2 y “ x p, p1 ` λqx0 ´ λx2 y “ x p, y2 y .
x p, x1 ´ x2 y ď ´ε ô x p, x1 y ď x p, x2 y ´ε.
x p, x y ď 0 @x P ripC ´ Dq.
In particular, it holds
tx P F : x p, x y ď 0u Ą ripC ´ Dq.
such that
x p, x1 ´ x2 y ă 0 @x1 P ripCq, x2 P ripDq.
Hence, we obtain
14
1 CONVEX SETS
x p, x ´ y y “ lim xloooooomoooooon
p, xk ´ yk y ď 0.
kÑ8
ă0
Hp,a :“ tx P E : x p, x y “ x p, x0 y “: au
Fig. 1.14: Supporting hyperplanes to a convex set at different points [HUL93, Fig. 2.4.1].
15
1 CONVEX SETS
Fig. 1.15: Left: An open convex cone. Right: A closed non-convex cone.
The properties 1 - 4 of Theorem 1.1.4 remain true when “convex set” is replaced by
“cone”. Moreover, the closure of a cone remains a cone. However, the interior of a cone does
not necessarily remain one. Simple examples of cones are linear subspaces of E.
16
1 CONVEX SETS
Remark 1.3.4 (Plotting Sym` 2 pRq) Let Symd pRq be equipped the Frobenius norm }¨}F
and let | ¨ |2 be the Euclidean norm. We can visualize the cone Sym` 2 pRq of symmetric
positive definite 2 ˆ 2 matrices by using the linear (normal) isometry (Exercise!)
¨ ˛ ¨ ˛
˜ ¸ a´c x
` ˘ ` 3 ˘ a b 1 ˚
T : Sym2 pRq, } ¨ }F Ñ R , | ¨ |2 ÞÑ ? ˝ 2b ‚ “: ˝y ‚.
‹ ˚ ‹
b c 2
a`c z
1 ` a ˘
λ1,2 “ ? ˘ x2 ` y 2 ` z .
2
a 3
Hence, X P Sym` 2 pRq if and only if x2 ` y 2 ď z, so T Sym`
` ˘
a 2 pRq “ tpx, y, zq P R :
x2 ` y 2 ď zu is an “ice cream cone” (see Fig. 1.17). ˝
Fig. 1.17: The boundaries of the standard second order cone L3 (the “ice cream cone”) and
its rotated version L3r . Image source: https://docs.mosek.com/modeling- cookbook/cqo.html.
is closed and convex, where d ě 1. For d “ 2, the second-order cone is simply the “ice cream
cone” from Remark 1.3.4. Its rotated version is
17
1 CONVEX SETS
where K, K̃ are products of closed convex cones of the form Ld1 , Ldr 2 , Rd3 , Rd`4 or t0u, A is a
matrix and b and c are vectors. Software packages for SOCP are e.g. SeDuMi and MOSEK.
In Exercise 1.3.17, we see how SOCPs can be applied for solving quadratic problems. ˝
S 0 :“ p P E ˚ : x p, x y ď 0 @x P S .
␣ (
Fig. 1.18: Left. The polar cone C 0 of a non-convex set C Ă R2 . The angles between the
limiting rays of the conic hull of C and that of C 0 make right angles. Right. The dual cone
C0 (in the figure denoted as C ˚ ) of the same set C.
For analyzing polar and dual cone, we recap the definition of the canonical embedding of E
into its bidual space E ˚˚ :“ pE ˚ q˚ . The canonical embedding is defined by the continuous canonical
linear isometry ΛE : E Ñ E ˚˚ , where we define ΛE pxq for x P E by embedding
x ΛE pxq, f yE ˚ “ x f, x yE for all f P E ˚ .
If ΛE is surjective, the normed space E is called reflexive. In this case, ΛE is an isometric reflexive
isomorphism between E and E ˚˚ . Then, we identify x with ΛE pxq and write formally
x “ ΛE pxq and E “ E ˚˚ . Note that not any normed space is reflexive.
Now, we can rewrite the polar cone S 0 by
( č␣
S 0 :“ p P E ˚ : x p, x yE ď 0 @x P S “
␣ (
p P E ˚ : x p, x yE ď 0
xPS
č␣ ( č ´
“ p P E ˚ : x ΛE pxq, p yE ˚ ď 0 “ HΛE pxq,0 .
xPS xPS
18
1 CONVEX SETS
In particular, S 0 is an intersection of closed convex sets in E ˚ and hence closed and convex.
Example 1.3.7
1 For E “ Rd it holds rr0, 8qd s0 “ p´8, 0sd .
2 For a linear subspace S Ă E we have that the dual and polar cone coincide and are
given by the orthogonal complement of S, i.e., S 0 “ S0 “ S K , where the orthogonal
complement is defined as S K :“ tp P E ˚ : x p, x y “ 0 @x P Su. In particular pt0uq0 “
E ˚ and E 0 “ t0u Ă E ˚ .
3 For x P E, the cone K “ tλx : λ ě 0u has the polar cone K 0 :“ tp P E ˚ : x p, x y ď 0u.
‰0
Symd` pRq “ Symd´ pRq.
“
4
5 Equipped with the right topologies, the polar cone of the cone of nonnegative measures
M` pXq is the cone of nonpositive continuous functions C ´ pX; Rq.
‰0
Lp pΩ; Rě0 q “ Lq pΩ; Rď0 q, where p ě 1 and p1 ` 1q “ 1.
“
6
7 Let pE, } ¨ }q “ pRd , } ¨ }2 q. The polar cone of the finitely generated cone
K :“ tAy : y P Rm
ě0 u, A P Rd,m
K 0 “ tp P Rd : AT p ď 0u.
so that p P K 0 . Thus, S Ď K 0 .
Conversely, let p P K 0 , i.e., xp, Ayy ď 0 for all y ě 0. In particular, we obtain for
y “ ei that xp, ai y ď 0, i “ 1, . . . , m. Hence K 0 Ď S and in summary K 0 “ S. ˛
x p, x y
cospαq “ .
}p}˚ }x}
Using this definition, the polar cone S 0 zt0u consists exactly of those p P E ˚ such that for
all x P Szt0u the angle α between x and p fulfills cospαq ď 0, i.e., α ě π2 .
Lecture 3,
Lemma 1.3.9 (The bipolar cone)
10.04.2024
Assume that E is reflexive (where we identify E with E ˚˚ ). Then, for K ‰ H Ă E we have
` ˘
K 00 :“ pK 0 q0 “ cone convpKq . For closed convex cones K Ă E we get K 00 “ K.
K 00 “ tx̂ P E ˚˚ : x x̂, p yE ˚ ď 0 @p P K 0 u.
19
1 CONVEX SETS
By the identification x “ ΛE pxq and by inserting the definition of K 0 ,we get that
K 00 “ tx P E : x ΛE pxq, p yE ˚ ď 0 @p P K 0 u “ tx P E : x p, x y ď 0 @p P K 0 u.
(1.10)
` ˘
x p, x y ă x p, x0 y @x P cone convpKq .
x p, x0 y
x p, x y “ x p, y y “ x p, x0 y,
x p, y y
By definition, we have TS px0 q “ H for all x0 R S. It can be proven that the tangential cone
is always closed. If S Ă E is convex, the tangential cone can be defined equivalently by
TS px0 q “ td “ tpx ´ x0 q : x P S, t ě 0u
“ td P E : Dλ ą 0 such that x0 ` λd P Su “ conepS ´ x0 q.
In particluar, if S is convex, then TS px0 q is convex, too. However, in the case that S is
non-convex, TS px0 q is not necessarily convex, see Fig. 1.19 left.
Remark 1.3.11 (Clarke’s tangent cone) An alternative tangential cone is Clarke’s
tangent cone as proposed in [Cla90a, p. 51]. It is always convex even if S is non-convex. If
20
1 CONVEX SETS
S is convex, then Clarke’s tangent cone and the tangential cone due to Bouligand from
Definition 1.3.10 coincide. Since we are mainly interested in convex sets, we do not discuss
Clarke’s tangent cone within this lecture. ˝
For a convex set C Ă E we define the cone of all normals onto C in a point x0 P E. Since
normals live on the dual space E ˚ , the normal cone is a subset of E ˚ .
By definition normal cones are closed and it holds NC px0 q “ NC px0 q “ NripCq px0 q. The
right side of Fig. 1.19 illustrates the tangential and normal cone of a convex set.
Lemma 1.3.13
Let C Ă E be an affine set and let x0 P C. Then, it holds
NC px0 q “ tp P E ˚ : x p, x ´ x0 y “ 0, @x P Cu.
Proof. The inclusion “Ą” is clear by the definition. For “Ă” let p P NC px0 q and x P C.
Moreover, let pxn q8n“1 be a sequence in C such that xn Ñ x0 . Then we have by the affinity of
C that 2xn ´ x P C such that the definition of the normal cone yields x p, 2xn ´ x ´ x0 y ď 0.
Taking the limit n Ñ 8 and using the continuity of p, we obtain that x p, x0 ´ x y “
x p, 2x0 ´ x ´ x0 y ď 0. Since x P C the definition of the normal cone also yields that
x p, x ´ x0 y ď 0 such that we arrive at x p, x ´ x0 y “ 0. l
x0
C
x0+ Nc (x 0 )
x0+ T c(x0 )
TS (x0 )
Fig. 1.19: Left: Tangential cone (blue dotted) of a non-convex set S (black curve). Right:
Tangential cone and normal cone of a convex set C.
21
1 CONVEX SETS
Let C be convex and x0 P C. Then, NC px0 q is the polar cone to TC px0 q, i.e.,
x p, x ´ x0 y ď 0 @x P C,
Since d P TC px0 q was arbitrary chosen, we obtain that x p, d y ď 0 for all d P TC px0 q such
that p P pTC px0 qq0 . l
Exercises
Exercise 1.3.18 (Polar cones) Prove the identities from Example 1.3.7. ■
Exercise 1.3.19 (Dual cone of Lorentz cone) The second order cone Ld`1 from Re-
mark 1.3.5 is dual to itself. ■
22
1 CONVEX SETS
Theorem 1.4.1 does not remain true, whenever we drop some of the assumptions:
• If E is no longer a Hilbert space, the arg min from (1.11) might not be unique. For
example, let pE, } ¨ }q “ pR2 , } ¨ }8 q, C “ tpx, 0q : x P Ru and x “ p0, 1q. Then, it holds
arg minyPC }x ´ y} “ tpx, 0q : x P r´1, 1su.
• If C is not convex, the arg min from (1.11) might not be unique. For instance, consider
pE, } ¨ }q “ pR, | ¨ |q, C “ t´1, 1u and x “ 0. Then, it holds arg minyPC }x ´ y} “
t´1, 1u “ C. See also Fig. 1.20 for another example.
• If C is not closed, the arg min from (1.11) might be empty. For example, consider
pE, } ¨ }q “ pR, | ¨ |q, C “ r´1, 0q and x “ 1. Then, arg minyPC }x ´ y} “ H.
Fig. 1.20: The or-
Example 1.4.2 Let pE, } ¨ }q “ pRd , } ¨ }2 q. For y P Rd and Bp p0, λq “ tz P Rd : }z}p ď λu,
thogonal projection
we want to find the minimizer x̂ “ PBp p0,λq pyq of onto a non-convex
set is not unique
}y ´ x}22 Ñ min s.t. x P Bp p0, λq. [AM19, Fig. 3].
1 For p “ 8, the problem can be solved componentwise, i.e., we have to find xi with
|xi | ď λ such that the distance |yi ´ xi | becomes minimal. We obtain that
#
yi if |yi | ď λ,
x̂i “
λ sgnpyi q if |yi | ą λ.
23
1 CONVEX SETS
3 The case p “ 1 is more complicated and we refer to [DSSSC08, DFL08]. One can
prove that
# #
y if }y}1 ď λ, 0 if |yi | ď γ,
x̂ “ d and Sγ pyi q :“
pSµ pyi qqi“1 if }y}1 ą λ yi ´ γ sgnpyi q if |yi | ą γ
|y |`...`|y |´λ
with µ :“ πp1q m
πpmq
, where |yπp1q | ě . . . ě |yπpdq | ě 0 are the sorted absolute
values of the components of y and m ď d is the largest index with |yπpmq | ą 0 and
|yπp1q | ` . . . ` |yπpmq | ´ λ
ď |yπpmq |.
m
The computation of PB1 p0,λq pyq requires Opd log dq operations due to the sorting pro-
cedure. The function Sγ is known as soft-shrinkage with threshold γ, see [DJ98]. soft-shrinkage
Note that the example can be generalized to pE, } ¨ }q “ L2 pΩ, µq for some measure µ on
some set Ω and the sets
ż
2
Bp p0, λq :“ tf P L pΩ, µq : |f pxq|p dµpxq ď λp u, 1 ď p ă 8
Ω
.f
. λ ff
24
1 CONVEX SETS
In particular, this implies for x “ 0 that xx0 ´ PC px0 q, PC px0 qy ě 0 and for x “ 2PC px0 q
that xx0 ´ PC px0 q, PC px0 qy ď 0. Thus, xx0 ´ PC px0 q, PC px0 qy “ 0 and consequently xx0 ´
PC px0 q, xy ď 0 for all x P C, i.e., x0 ´ PC px0 q P C 0 . Since for all x P C 0 it holds
xx0 ´ px0 ´ PC px0 qq, x ´ px0 ´ PC px0 qqy “ xPC px0 q, x ´ x0 ` PC px0 qy
“ xPC px0 q, xy ` xPC px0 q, PC px0 q ´ x0 y ď 0,
. K*
25
2 CONVEX FUNCTIONS
2 Convex Functions
Real-valued convex functions are intimately connected with convex sets since they are pre-
cisely the functions such that all the points on and above its graph form a convex set. They
have very favorable properties when it comes to optimization.
After studying extended real-valued functions that are suitably well-behaved (“proper”) and
fulfill a semi-continuity requirement compatible with minimization procedures, we introduce
different types of convex functions. For differentiable functions, convexity is equivalent to
monotonicity of the derivative. Moreover, convex functions are continuous (on the relative
interior of their domain) under mild assumptions. In the last two sections of the chapter,
we study the problem of minimizing a convex function and related problems like proximal
operators. Throughout this chapter, pE, } ¨ }q is a real normed vector space, unless indicated
otherwise.
Textbooks: [ET99, Chp. 1.2], [Roc70b, Sec. 4], [AB99, Sec. 5.4], [BL11, Subsec. 6.2.1],
[DM93, Chp. 1] and [Hol12, Sec. 1.3].
26
2 CONVEX FUNCTIONS
Since f px0 q ě inf xPBε px0 q f pxq for every ε ą 0, we always have f px0 q ě supεą0 inf xPBε px0 q f pxq,
so the inequality in (2.1) can be replaced by equality.
27
2 CONVEX FUNCTIONS
The domain dompf q is given by the projection of epipf q Ă E ˆ R onto E, i.e., the image
of epipf q under the mapping px, aq ÞÑ x for px, aq P E ˆ R. Thus epipf q “ H if and only if
f ” 8. The set epipf q has vertical lines where f takes the value ´8.
The (strict) epigraph is a subset of E ˆ R. We equip the space E ˆ R with the norm
}px, aq}2 “ }x}2 ` a2 . Then, the dual space of E ˆ R is given by pE ˆ Rq˚ “ E ˚ ˆ R with
Using Lemma 2.1.4 one can prove a result similar to Theorem 1.1.4 [BL18, Lem. 6.14], which
among other things, states that the lower semicontinuous functions form a convex cone.
28
2 CONVEX FUNCTIONS
2 f ` g (if defined).
3 f ˝ L, where L : F Ñ E is a continuous function.
4 minpf1 , f2 q
5 φ ˝ f provided φ : R Ñ R is increasing and lower semicontinuous.
6 supiPI fi .
The proof is not hard and left as Exercise 2.1.11. Theorem 2.1.6 does not hold true for other
relations such as substraction or the infimum, see Exercise 2.1.12.
29
2 CONVEX FUNCTIONS
Fig. 2.4: Left: A proper non-lower semicontinuous function g1 . Right: Its closure clpg1 q.
Exercises
Fig. 2.5: The proper functions from Eqs. (2.2) and (2.3). In the first two plots, the blue
region depicts the epigraph of the functions.
30
2 CONVEX FUNCTIONS
Exercise 2.1.13 Show that the “ℓ0 -norm” (used by David Donoho for image processing
[Don01, Chp. 2]) defined as
} ¨ }0 : Rd Ñ t0, . . . , du,
␣ (
px1 , . . . , xd q ÞÑ # j P t1, . . . , nu : xj ‰ 0 ,
Exercise 2.1.15 Show that for f : E Ñ R we have clpf qpxq “ supthpxq : h is lsc, h ď f u
for all x P E. ■
31
2 CONVEX FUNCTIONS
(2.4)
` ˘
f tx ` p1 ´ tqy ď tf pxq ` p1 ´ tqf pyq @t P r0, 1s, @x, y P dompf q,
uniformly convex if there exists a non-decreasing function φ : r0, 8q Ñ r0, 8q with uniformly convex
φpxq “ 0 if and only if x “ 0 such that
` ˘
f tx ` p1 ´ tqy ď tf pxq ` p1 ´ tqf pyq ´ tp1 ´ tqφp}x ´ y}q @t P r0, 1s, @x, y P dompf q,
1
f tx ` p1 ´ tqy ď tf pxq ` p1 ´ tqf pyq ´ λtp1 ´ tq}x ´ y}2 @t P r0, 1s, @x, y P dompf q.
` ˘
2
Fig. 2.6: Left: A strictly convex function. Right: A convex, but not strictly convex
function.
32
2 CONVEX FUNCTIONS
By Exercise 1.1.32 affine functions are the only ones that are both convex and concave. If
f : E Ñ R is convex, then dompf q is convex too. In particular, for S Ă E, the indicator
function ιS is convex if and only if S is convex. Moreover, any convex f : S Ñ R for
some convex S Ă E can naturally be extended to a convex function f¯: E Ñ R by setting
f¯pxq “ f pxq for x P S and f¯pxq “ `8 otherwise.
We denote the set of convex and lower semicontinuous functions f : E Ñ R by ΓpEq. Its
subset of proper functions is denoted by Γ0 pEq. Note that ΓpEq and Γ0 pEq are convex cones
in the vector space of all functions from E to R.
Lemma 2.2.2
Let f : E Ñ R be convex. Then the following holds true.
• If f has a finite value in some point x0 P ripdom f q, then f is proper.
• If f is lsc, ripdompf qq ‰ H and f px0 q “ ´8 for some x0 P E, then f pxq P t˘8u for
all x P E.
Proof. For the first statement, we have to verify that f does not take the value ´8.
Assume that there exists x1 P dom f with f px1 q “ ´8. Since x0 P ripdom f q, there exists
x2 P ripdom f q such that x0 “ λx1 ` p1 ´ λqx2 , λ P p0, 1q. Then, we get the contradiction
Next, we prove the second statement. Due to the first statement, we know that f pxq “ ´8
for all x P ripdompf qq. Since f is lsc, this implies that f pxq “ ´8 for all x P dompf q. By
definition, f pxq “ `8 for x R dompf q. l
Lemma 2.2.3
Let E be a Hilbert space. Then, a function f is λ-convex if and only if the quadratic
perturbation f ´ 21 λ} ¨ }2 of f is convex.
Proof. Since inner products are bi-linear, it holds for any x, y P E that
Hence
1 1 1
λ}tx ` p1 ´ tqy}2 ´ tλ}x}2 ´ p1 ´ tqλ}y}2
2 2 2
1 1
“ λtpt ´ 1q}x} ` λtp1 ´ tq x x, y y ` λp1 ´ tq p1 ´ tq ´ 1 }y}2
2
“ ‰
2 2
1 2
“ ´ λtp1 ´ tq}x ´ y} .
2
Therefore, f is λ-convex, if and only if
1
f tx ` p1 ´ tqy ď tf pxq ` p1 ´ tqf pyq ´ λtp1 ´ tq}x ´ y}2
` ˘
2
1 1 1
“ tf pxq ` p1 ´ tqf pyq ` λ}tx ` p1 ´ tqy}2 ´ tλ}x}2 ´ p1 ´ tqλ}y}2
2 2 2
for all x, y P E. By substracting 1
2 λ}tx ` p1 ´ tqy}2 this is equivalent to convexity of
f ´ 12 λ} ¨ }2 . l
33
2 CONVEX FUNCTIONS
Note that we do not need the completeness for the proof. Hence, Lemma 2.2.3 also holds
true in non-complete inner product spaces. Lecture 5,
Example 2.2.4 17.04.2024
• Linear functionals and their absolute value are convex, but not strictly convex.
• Let E be a Hilbert space. Then, the function 12 } ¨ ´y}2 for y P E is 1-convex.
• Let pE, } ¨ }q “ pR, | ¨ |q. Then, f : R Ñ R with x ÞÑ signpxq |x| is not λ-convex for
a
any λ P R.
• Let E be a Hilbert space. Then, } ¨ }2 is strictly convex. This is no longer true on
Banach spaces. A counterexample is pRd , } ¨ }8 q for d ě 2. ˛
The following relations between convexity, level sets and epigraphs hold true.
• If f : E Ñ R is convex, then levα pf q is convex for every α P R.
• A function f : E Ñ R is convex if and only if epipf q is a convex set.
• If f is convex, then so are f and clpf q.
Proof. • Take x, y P levα pf q for α P R and λ P r0, 1s. Then λx ` p1 ´ λyq P levα pf q as
by the convexity of f we have
` ˘
f λx ` p1 ´ λyq ď λf pxq ` p1 ´ λqf pyq ď λα ` p1 ´ λqα “ α.
so
λpx, aq ` p1 ´ λqpy, bq “ pλx ` p1 ´ λqy, λa ` p1 ´ λqbq P epipf q.
“ ðù ”: Let epipf q be convex and x, y P dompf q such that f pxq, f pyq ‰ ´8. Since
px, f pxqq, py, f pyqq P epipf q, we have
λpx, f pxqq ` p1 ´ λqpy, f pyqq “ pλx ` p1 ´ λqy, λf pxq ` p1 ´ λqf pyqq P epipf q,
which implies f pλx ` p1 ´ λqyq ď λf pxq ` p1 ´ λqf pyq. The case f pxq “ ´8 follows
by considering the points px, ´N q and py, f pyqq for N Ñ 8.
34
2 CONVEX FUNCTIONS
• If f is convex, then epipf q is convex by the second part. By Theorem 1.1.4 epipf q “
epipf q is convex, so f is convex by the second part.
If f takes the value ´8, then clpf q ” ´8 is constant, hence convex and if not, then
clpf q ” f is convex by the previous paragraph. l
Hence the epigraph of a proper convex function is non-empty, convex and does not contain
any vertical lines. Moreover, if f : E Ñ R is proper and convex, then clpf q P Γ0 pEq.
Remark 2.2.7 (Quasiconvex functions)
The converse of the first statement in Theorem 2.2.6 is not true, consider e.g. | ¨ |. Func-
a
tions with convex level sets are called quasi-convex and can equivalently be characterized
by the relaxed inequality
` ˘
f λx ` p1 ´ λyq ď maxtf pxq, f pyqu @λ P r0, 1s, @x, y P dompf q. ˝
f pyq ě x p, y y `a @y P E.
Proof. Since clpf q ď f by Exercise 2.1.15, it suffices to show the assertion for lower semi-
continuous functions.
Let x0 P dompf q and b0 ă f px0 q. Then, px0 , b0 q R epipf q Ă E ˆ R, which is a closed and
convex set. By Theorem 1.2.2, there exists pp̄, ´āq P pE ˆ Rq˚ zt0u “ pE ˚ ˆ Rqzt0u with
35
2 CONVEX FUNCTIONS
By noting that px, f pxqq P epipf q for all x P dompf q, this implies
0 ă ā pf px0 q ´ b0 q
loooooomoooooon
ą0
such that ā ą 0. By dividing (2.5) by ´ā and setting p :“ p̄{ā this yields
Since f pxq “ 8 for x R dompf q, the equation is true for all x P E. Thus, setting a “
x p, x0 y `b0 yields the claim. l
Using similar techniques and Remark 1.2.3, one can prove the following corollary.
Corollary 2.2.9 (Lower description of convex lsc functions)
For a convex function f : E Ñ R and x0 P E
␣ (
clpf qpx0 q “ sup x p, x0 y `a : x p, x y `a ď f pxq @x P E .
pPE ˚ zt0u,
aPR
Analogously to Theorem 1.1.4, there are several operations preserving the convexity of func-
tions. Note that products of convex functions may not be convex. An example is the function
f : R2 Ñ R, px1 , x2 q ÞÑ x1 x2 .
36
2 CONVEX FUNCTIONS
Exercises
37
2 CONVEX FUNCTIONS
if it exists. If the limit in (2.6) exists for all v and if Df rx0 s : E Ñ F is linear and continuous,
we call Df rx0 s P LpE, F q the (linear) Gateaux derivative of f at x0 . In this case, we Gateaux
call f Gateaux differentiable at x0 . We call f : U Ñ F Gateaux differentiable, if it derivative
is Gateaux differentiable at x0 for all x0 P U ˝ . Then, the mapping Df : U ˝ Ñ LpE, F q,
x ÞÑ Df rxs is called the Gateaux differential.
Consider the case that pE, } ¨ }E q “ pRd , } ¨ }2 q and pF, } ¨ }E q “ pRn , } ¨ }2 q and let f : Rd Ñ
Rn be a Gateaux differentiable function. Then, f is by definition partially differentiable
and the Gateaux derivative Df rx0 s is given by the Jacobian matrix Jf rx0 s of f at x0
via the matrix-vector multiplication Df rx0 spvq “ Jf rx0 sv. Consequently, the Gateaux
derivative coincides with the partial derivative in Euclidean spaces. Vice versa, not any
partial differentiable function is Gateaux differentiable as the following example shows.
Example 2.3.2 The function
$
x2 y
&
x2 `y 2 for px, yq ‰ p0, 0q,
f : R2 Ñ R, px, yq ÞÑ
%0 else.
38
2 CONVEX FUNCTIONS
2xy 3 x2 px2 ´ y 2 q
fx px, yq “ and fy px, yq “
px2 ` y 2 q2 px2 ` y 2 q2
for px, yq ‰ p0, 0q and fx p0, 0q “ fy p0, 0q “ 0. Note that the fx is not continuous in
nÑ8
x0 :“ p0, 0q (choose pxn , yn q “ p n1 , n1 q ÝÝÝÑ p0, 0q). The directional derivative in x0 is not a
linear function since
f px0 ` thq ´ f px0 q t2 h2 th2 h2 h2 h2 h2
lim “ lim 2 2 1 2 2 “ lim 2 1 2 “ 2 1 2
tÑ0 t tŒ0 tpt h1 ` t h2 q tŒ0 h1 ` h2 h1 ` h2
and the last term is not linear in h. Thus, f is not Gâteaux differentiable in x0 . ˛
is a continuous linear operator from E to LpE, F q (i.e., D2 f rx0 s P LpE, LpE, F qq).
Analogously, higher-order Gateaux derivatives can be defined. Similar to first derivatives,
higher-order Gateaux derivatives are consistent with the usual definition of differentiability
in finite-dimensional cases. ˝
In this lecture, we are interested in functions mapping into the the real numbers, i.e., pF, } ¨
}F q “ pR, | ¨ |q. Here, we will need first- and second-order Gateaux derivatives. In this case,
the first Gateaux derivative Df rx0 s is an element of E ˚ . Then, we call it the Gateaux
gradient, denoted by ∇f px0 q :“ Df rx0 s P E ˚ , since the directional derivatives from (2.6) Gateaux
are given by the dual pairing gradient
for all v P E. Moreover, we call the second Gateaux derivative the Gateaux Hessian Gateaux
and denote it by ∇2 f rx0 s :“ D2 f rx0 s P LpE, E ˚ q. It defines a bilinear form E ˆ E Ñ R by Hessian
Example 2.3.4 Let pE, } ¨ }q “ pRnˆd , } ¨ }Fro q be the Hilbert space of all n ˆ d matrices Lecture 6,
equipped with Frobenius norm and associated inner product x A, B y “ tracepAT Bq. We 19.04.2024
compute the Gateaux gradient and Hessian of f pAq “ 12 }Ax ´ y}22 for fixed x P Rd , y P Rn .
39
2 CONVEX FUNCTIONS
In order to retain the usual properties of differentiable functions, we introduce the total
derivative or Fréchet derivative.
One often writes “f pxq “ f px0 q ` Df rx0 spx ´ x0 q ` op}x ´ x0 }q”. Similarly as before,
we call Df rx0 s P LpE, F q the Fréchet derivative of f at x0 . Further, f is Fréchet
differentiable if it is Fréchet differentiable at every x P U ˝ . In this case Df : U ˝ Ñ
LpE, F q is called the Fréchet differential of f .
40
2 CONVEX FUNCTIONS
Dptf qrx0 spvq “ t Df rx0 spvq, Dpf ` gqrx0 spvq “ Df rx0 spvq ` Dgrx0 spvq.
Proof. 1 - 3 These follow directly by definition and are left as an exercise to the reader.
4 The proof is analogously to the proof of the finite dimensional Schwarz theorem. For
x, y P E consider the expression
f px0 ` hx ` tyq ´ f px0 ` hxq ´ f px0 ` tyq ` f px0 q
Qph, tq :“
th
and note that
x ∇f px0 ` hxq ´ ∇f px0 q, y y
lim lim Qph, tq “ lim “ x ∇2 f rx0 spxq, y y . (2.8)
hÑ0 tÑ0 hÑ0 h
On the other hand, the (one-dimensional) mean value theorem applied on the function
g1 psq “ f px0 ` sx ` tyq ´ f px0 ` sxq yields that there exist some 0 ă c1 ph, tq ă h such that
1 1
Qph, tq “ pf px0 ` hx ` tyq ´ f px0 ` hxq ´ f px0 ` tyq ` f px0 qq “ pg1 phq ´ g1 p0qq
th th
1 A ∇f px ` c ph, tqx ` tyq ´ ∇f px ` c ph, tqxq E
0 1 0 1
“ g11 pc1 ph, tqq “ ,x .
t t
Applying again the mean value theorem onto g2 psq “ x ∇f px0 ` c1 ph, tqx ` syq, x y yields
that there exists some 0 ă c2 ph, tq ă t with
1
Qph, tq “ pg2 ptq ´ g2 p0qq “ g21 pc2 ph, tqq “ x ∇2 f rx0 ` c1 ph, tqx ` c2 ph, tqyspyq, x y .
t
Since ∇2 f is continuous and limh,tÑ0 c1 ph, tq “ 0 “ limh,tÑ0 c2 ph, tq, we obtain that
41
2 CONVEX FUNCTIONS
Together with (2.8), we arrive at x ∇2 f rx0 spxq, y y “ x ∇2 f rx0 spyq, x y “ x x, ∇2 f rx0 spyq y.
5 Let Df rx0 s be the Gateaux derivative. Assume that Df rx0 s is not a Fréchet deriva-
tive. Then, there exists ε ą 0 and a sequence pun qn in E such that }un }E Ñ 0 and
It is not clear if the last item of the theorem still holds true if E is infinite-dimensional.
42
2 CONVEX FUNCTIONS
Reformulating yields
1 1´λ
f px0 ` tvq ě f px0 ` λtvq ´ f px0 q.
λ λ
From this, we indeed obtain (2.9) as required
“1{λ
hkkkkikkkkj
1 1´λ
f px0 ` tvq ´ f px0 q λ f px0 ` λtvq ´ p λ ` 1q f px0 q f px0 ` λtvq ´ f px0 q
ě “ .
t t λt
2 Let v1 , v2 P dompDf rx0 sq. Then, we obtain for h ą 0 small enough and x0 ` hv1 , x0 `
hv2 P dompf q that
3 The second inequality follows from 1 for t “ 1. The first one is clear if f px0 ´vq “ `8 Lecture 7,
or Df rx0 spvq “ `8. Otherwise, there exists h̃ ą 0 with h1 pf px0 ` hvq ´ f px0 qq ă 8 24.04.2024
for all h P p0, h̃s. Hence, we have x0 ´ v, x0 ` hv P dompf q for all h P p0, h̃s. By the
convexity of f we obtain
ˆ ˙
1 h 1 h
f px0 q “ f px0 ` hvq ` px0 ´ vq ď f px0 ` hvq ` f px0 ´ vq.
1`h 1`h 1`h 1`h
43
2 CONVEX FUNCTIONS
pąq
x ∇f pxq ´ ∇f pyq, x ´ y y ě 0 @x, y P U px ‰ yq.
3
pąq
f pyq ´ f pxq ě x ∇f pxq, y ´ x y @x, y P U px ‰ yq.
This yields
pąq f px ` tpy ´ xqq ´ f pxq
f pyq ´ f pxq ě ě Df rxspy ´ xq “ x ∇f pxq, y ´ x y,
t
where we used Lemma 2.3.8 for the second inequality.
“ 3 ùñ 2 ”: By assumption, it holds that
“ 2 ùñ 1 ”: For x, y P E define
` ˘
φ : r0, 1s Ñ R, λ ÞÑ f x ` λpy ´ xq .
We show that φ is convex. To this end, observe that φ is differentiable on p0, 1q with
derivative
` ˘ ` ˘
φpλ ` hq ´ φpλq f x ` pλ ` hqpy ´ xq ´ f x ` λpy ´ xq
φ1 pλq “ lim “ lim
hÑ0
` h hÑ0
˘ ` h˘
f x ` λpy ´ xq ` hpy ´ xq ´ f x ` λpy ´ xq
“ lim
hÑ0 h
` ˘
“ Df rx ` λpy ´ xqspy ´ xq “ x ∇f x ` λpy ´ xq , y ´ x y .
1 pąq
φ1 pλ̃q ´ φ1 pλq “ x ∇f pz̃q ´ ∇f pzq, y ´ x y “ ∇f pz̃q ´ ∇f pzq, z̃ ´ z y ě 0.
xlooooooooooooooomooooooooooooooon
λ̃omo
lo ´ oλn
pąq
ě0 ě 0 by assumption
44
2 CONVEX FUNCTIONS
Now let t P p0, 1q. Then, we obtain by the fundamental theorem of differentiation and
integration together with the fact that φ1 is (strictly) increasing that
żt păq
ż1 pąq
φptq ´ φp0q “ φ1 psqds ď tφ1 ptq, φp1q ´ φptq “ φ1 psqds ě φ1 ptqp1 ´ tq.
0 t
Substracting t times the second equality from 1 ´ t times the first equality, we obtain by
c2 ą c1 that
păq
φptq ´ p1 ´ tqφp0q ´ tφp1q ď tp1 ´ tqpφ1 ptq ´ φ1 ptqq “ 0.
Rearranging the terms yields that
păq
φptq ď p1 ´ tqφp0q ` tφp1q.
The second statement is not true even in finite-dimensional Banach spaces. For instance,
let pE, } ¨ }q “ pRd , } ¨ }8 q and let x “ p´ 12 , 1q and y “ p 12 , 1q. Then, we have 12 }x}2 ` 12 }y}2 “
1 “ }p0, 1q}2 “ } 12 x ` 12 y}2 .
In the case that f is twice Gateaux differentiable, we obtain another criterion for convexity.
45
2 CONVEX FUNCTIONS
Dividing by t yields
∇f rx0 ` tzs ´ ∇f rx0 s
x , z y ě 0, @z P E with }z} ď 1.
t
Taking the limit for t Œ 0 and using the continuity of the dual pairing, we obtain
∇f rx0 ` tzs ´ ∇f rx0 s
x lim , z y ě 0, @z P E with }z} ď 1.
tŒ0 t
By the definition of the Gateaux Hessian as derivative of the Gateaux gradient this is
equal to
x ∇2 f rx0 spzq, z y ě 0, @z P E with }z} ď 1.
Using the positive homogenity of Gateaux derivatives and dual pairings, we obtain for
arbitrary x P E that
which is a contradiction. l
46
2 CONVEX FUNCTIONS
Example 2.3.12 Consider the function f pAq “ 12 }Ax ´ y}22 from Example 2.3.4. We found
that ∇2 f rAspHq “ HxxT . Hence, we have that
x ∇2 f rAspHq, H y “ x HxxT , H y “ tracepxxT H T Hq
“ tracepxT H T Hxq “ xT H T Hx “ }Hx}22 ě 0.
Hence, f is convex. ˛
Example 2.3.13 • The function f pxq “ x4 is strictly convex, but its Gateaux Hessian
(i.e., its second derivative) in 0 is not positive definite since f 2 pxq “ 12x2 for all x P R.
Consequently, the reverse statement of Theorem 2.3.11 2 is not true.
• Using Theorem 2.3.11 and Theorem 2.3.9, we can now verify (strict) convexity of
functions just by evaluating their (second) derivatives. For instance, we can verify
1
that the functions x ÞÑ ´ logpxq, x ÞÑ x logpxq and x ÞÑ xe x are proper, continuous
and strictly convex on p0, 8q.
• The negative geometric mean
˜ ¸ n1
n
ź
f : Rn Ñ R, x ÞÑ ´ xk ` ιr0,8qn pxq
k“1
is convex.
• Let ψ : R Ñ R be increasing and a P R. Then
$ş
& x ψptq dt, if x ě a,
a
φ : R Ñ p´8, 8s, x ÞÑ
% 8, else
is convex [BC11, Ex. 8.13]. ˛
Remark 2.3.14 (First- and second-order criterion for λ-convex functions) We can
extend of Theorem 2.3.9 and Theorem 2.3.11 for λ-convex functions. More precisely, let
U Ă E be open and let f : U Ñ R. Then, one can show that the following statements are
equivalent.
• The function f is λ-convex.
• For all x, y P U , it holds x ∇f pxq ´ ∇f pyq, x ´ y y ě λ}x ´ y}2 .
• For all x, y P U , it holds f pyq ´ f pxq ´ x ∇f pxq, y ´ x y ě 12 λ}x ´ y}2 .
• For all x0 P U and x P E it holds x ∇2 f rx0 spxq, x y ě λ}x}2 .
In the Euclidean case, the last statement is fulfilled if and only if all eigenvalues of the
Hessian matrix are greater or equal than λ.
The proof of the statements is analogously to the ones from Theorem 2.3.9 and Theo-
rem 2.3.11 and is left as Exercise 2.3.19. ˝
Exercises
47
2 CONVEX FUNCTIONS
Exercise 2.3.16 Let pE, } ¨ }E q “ pRd , } ¨ }q and pF, } ¨ }E q “ pRn , } ¨ }1 q be finitie dimensional
spaces, equiped with arbitrary (i.e., possibly non-Euclidiean) norms } ¨ } and } ¨ }1 and let
f : Rd Ñ Rn . Is the relation between the Gateaux derivative Df rx0 s and the Jacobian
matrix Jf rx0 s of f at x0 still true? ■
Exercise 2.3.20 (HM-AM inequality) The harmonic mean řn n 1 of pxk qnk“1 Ă p0, 8q
k“1 xk
řn
is smaller than or equal to its arithmetic mean n1 k“1 xk . ■
´ř ¯
d
Exercise 2.3.21 The log-exponential function (or: LogSumExp) x ÞÑ log k“1 e xk
is
convex, as is the log barrier ´ log ˝ det on the cone of symmetric positive definite matrices.■
48
2 CONVEX FUNCTIONS
M ´m
|f px1 q ´ f px2 q| ď }x1 ´ x2 } @x1 , x2 P Bδ px0 q.
δ
In particular, f is Lipschitz continuous on Bδ px0 q.
Proof. For x P B2δ px0 q we have x0 “ 21 x ` 12 p2x0 ´ xq and thus, by the convexity of f ,
1 1
f px0 q ď f pxq ` f p2x0 ´ xq. (2.10)
2 2
Since 2x0 ´ x P B2δ px0 q, we obtain the lower bound
(2.10)
f pxq ě 2f px0 q ´ f p2x0 ´ xq ě 2f px0 q ´ M :“ m.
x1 x2 y
x0
δ
2δ
49
2 CONVEX FUNCTIONS
The local boundedness assumption in Lemma 2.4.1 is essential. For example, the function
$
& ´?x, if x ě 0,
f pxq :“ (2.11)
% `8, else
is convex but not locally Lipschitz continuous in x “ 0.
Corollary 2.4.2 (Continuity of convex functions)
Fig. 2.10: The func-
Let f : E Ñ R be convex and x0 P dompf q. The following are equivalent.
tion (2.11).
1 f is locally Lipschitz continuous at x0 .
2 f is continuous at x0 .
3 f is locally bounded at x0 .
4 f is locally bounded above at x0 .
Theorem 2.4.3
Let E be finite-dimensional. A proper convex function on a finite-dimensional space
is continuous on pdompf qq˝ . In praticular, a finite convex function f : E Ñ R on a
finite-dimensional space E is continuous.
For a proof, we refer to [Luc06, Cor. 2.1.3] and [BS13, Cor. 2.109]. A finer analysis using
the same arguments shows that that a proper and convex function f : E Ñ R on a finite-
dimensional E is continuous on ripdompf qq relative to affpdompf qq.
The result is wrong for infinite dimensional spaces. Simple counterexamples are unbounded
(i.e., non-continuous) linear functionals. However, continuity can be recovered by assuming
that f is lower semicontinuous and that E is a Banach space.
50
2 CONVEX FUNCTIONS
Proof. • We first show that there exists a point û P dompf q and some δ ą 0 and R P R
such that f puq ă R for all u P Bδ pûq.
Let u0 P pdompf qq˝ and R P R such that f pu0 q ă R. Then, we define V :“ levR pf q.
Since f is convex and lsc, we obtain that V is convex and closed. We want to utilize
Baire’s category theorem (Corollary 1.8 in the Functional Analysis script or [BL18,
Thm. 2.14]) in order to show that V admits an interior point. To this end, we define
the scaled and translated versions of V defined by
" *
u ´ u0
Vn :“ u P E : u0 ` P V “ nV ´ pn ` 1qu0 .
n
By construction, V has an interior point if and only if Vn has an interior point for
some n. Further, by the properties of the Minkowski sum, Vn is closed, convex and
non-empty for all n. Thus, by Baire’s category theorem, it remains to show that
Ť
nPN Vn “ E.
which are convex due to the convexity of f . Since u0 P pdompf qq˝ , there exists some
t0 ą 0 such that fu is finite on r´t0 , t0 s. Thus, by the convexity of fu , we have for any
t P r´t0 , t0 s that
ˆ ˙
t0 ´ t t0 ` t ` ˘
fu ptq “ fu p´t0 q ` t0 ď max fu p´t0 q, fu pt0 q .
2t0 2t0
By Corollary 2.4.2 this implies that fu is continuous in 0. Since f pu0 q ă R this means
that there exists some n P N such that
u ´ u0 ˘ u ´ u0
“ fu p n1 q ă R, i.e. u0 `
`
f u0 ` P V.
n n
Hence it holds that u P Vn . Since u was arbitrary, this implies that nPN Vn “ E.
Ť
Consequently, Baire’s category theorem implies that one Vn has an interior point.
Since Vn is a scaled and translated version of V , this implies that there exists û P V ˝ ,
i.e., there exists some δ ą 0 such that f puq ă R for all u P Bδ pûq.
• Now let u P dompf q˝ be arbitrary. We show that f is locally Lipschitz continuous
in u. The mapping p0, 8q Ñ E given by λ ÞÑ λ1 pu ´ p1 ´ λqûq is continuous and and
has value u at λ “ 1. Since u P pdompf qq˝ , there exits some λ P p0, 1q such that
f p λ1 pu ´ p1 ´ λqûqq “: S ă 8 is finite.
For v P Bδp1´λq puq, set w “ û ` 1
1´λ pv ´ uq P Bδ pûq. Rearranging yields v “ u ´ p1 ´
51
2 CONVEX FUNCTIONS
Thus, f pvq ď S ` R for all v P Bδp1´λq puq, namely, f is locally bounded from above at
u. This implies by Corollary 2.4.2 that f is locally Lipschitz continuous at u. l
52
2 CONVEX FUNCTIONS
Similarly, we define the set of global maximizers as arg maxpf q :“ arg minp´f q.
The following properties state that a function grows sufficiently quickly at the “boundary”.
f pxq
lim inf “ 8.
}x}Ñ8 }x}
For example, the function f pxq “ x2 is supercoercive, but f pxq “ x is not. Note that not
every strictly convex and coercive function f : E Ñ R is supercoercive. For instance consider
f : R Ñ R, x ÞÑ |x| ´ logp|x| ` 1q.
In finite dimensions, a function f P Γ0 pRd q is even coercive if one level set is bounded, see
Exercise 2.5.8.
In the following, we will prove the existence of minimizers for coercive functions f P Γ0 pEq.
In finite-dimensions the proof works by a compactness argument for bounded sets. Unfortu-
nately, the unit ball is compact if and only if E is finite-dimensional such that the argument
fails in infinite dimensional spaces.
As a remedy, we use compactness with respect to weak convergence. Recall that pxn qn in
53
2 CONVEX FUNCTIONS
Lemma 2.5.3
Let E be a real normed space. Then, the following holds true.
• A convex set C Ă E is closed if and only if it is weakly sequentially closed.
• A convex function f : E Ñ R is lower semicontinuous if and only if it is weakly lower
semicontinuous.
• Let E be a reflexive Banach space, then the unit ball is weakly sequentially compact.
In particualar, any bounded sequence admits a weakly convergent subsequence.
Proof. • Let C Ă E be convex and weakly sequentially closed. Moreover, let pxn qn
be a sequence in C that converges strongly to some x P E. Since strong convergence
implies weak convergence, we obtain that xn á x. Because C is weakly sequentially
closed, we obtain x P C. Thus, C is closed.
For the reverse direction, note that any closed halfspace Hp,a
´
is weakly sequentially
closed. Indeed: let pxn qn be a sequence in Hp,a and let xn á x. Then, x p, x y “
´
Now, we prove existence of minimizers for convex functions under certain assumptions.
54
2 CONVEX FUNCTIONS
Proof. 1 Since f is proper, we have that there exists some x0 P E such that a :“ f px0 q P
R. Now, we choose a sequence pxn qn in E such that f pxn q ď a and limnÑ8 f pxn q “
inf xPE f pxn q. By definition, we have that xn P leva pf q. Since f is coercive, we obtain
that leva pf q and therefore the sequence pxn qn is bounded. Using that E is a reflexive
Banach space this yields by Lemma 2.5.3 that it admits a subsequence pxnk qk which
converges weakly to some x̂ P E. Finally, the lower semicontinuity of f implies by
Lemma 2.5.3 weak lower semicontinuity such that we have
This implies that ´8 ă f px̂q “ inf xPE f pxq such that x̂ P arg minpf q.
Because arg minpf q “ levf px̂q pf q, we obtain that arg minpf q is bounded (due to coer-
civity of f ) and closed (by Lemma 2.1.4 since f is lsc).
2 Let x̂ be a local minimizer which is not a global minimizer. Then, there exists some x̃
with f px̃q ă f px̂q. In particular, we have for any t P p0, 1q that
Since p1 ´ tqx̂ ` tx̃ Ñ x̂ as t Ñ 0 we obtain that for any δ ą 0 there exists some
x P Bpx̂, δq such that f pxq ă f px̂q which contradicts the assumption that x̂ is a local
minimizer.
3 Let x̂1 , x̂2 P arg minpf q and t P r0, 1s. Then, we have that
inf f pxq ď f pp1 ´ tqx̂1 ` tx̂2 q ď p1 ´ tqf px̂1 q ` tf px̂2 q “ inf f pxq,
xPE xPE
such that all ď are in fact equalities. Thus, f pp1 ´ tqx̂1 ` tx̂2 q “ inf xPE f pxq, which
implies p1 ´ tqx̂1 ` tx̂2 P arg minpf q which implies convexity.
4 Assume that x̂1 ‰ x̂2 are two elements of arg minpf q, i.e., f px̂1 q “ f px̂2 q “ inf xPE f pxq.
Then, it holds by strict convexity that
which is a contradiction. l
Using the same proof, we observe that part 1 of the theorem also holds true if f is only
quasi-convex.
Remark 2.5.5 Part 1 of the previous proof follows the so-called direct method in the
calculus of variations or Tonelli’s direct method for proving the existence of minimizers
of some functional f . It consists out of three steps.
55
2 CONVEX FUNCTIONS
Exercises
56
2 CONVEX FUNCTIONS
α epi f
epi p
Fig. 2.11: The name “epigraphical projection” is due to epipvq being the image of epipφq
Fig. 1–8. Epigraphical projection in parametric minimization.
under the projection px, u, aq ÞÑ pu, aq.
for Since
Proof. φ is set
a closed convex,
X ⊂forIRpx
n , u q P dompφq, i P t1, 2u, and λ P r0, 1s we have
i continuous functions fi : X × U #→ IR (for
i and
i ∈ {0} ∪ I1 ∪ I2 ). Suppose that for each ū ∈ U , ε > 0 and α ∈ IR the set of
λφpx1 , u1 q ` p1 ´ λqφpx2 , u2 q ě φpλx1 ` p1 ´ λqx2 , λu1 ` p1 ´ λqu2 qq
pairs (x, u) ∈ X × U satisfying |u − ū| ≤ ε and f0 (x, u) ≤ α, along with all the
vpλu1 ` p1 n 2 q. m
constraints indexed by I1 and I2 , isěbounded in´IRλqu × IR .
Then p is lsc on U , and for every u ∈ U
Taking the infimum at the left hand side over x1 and x2 we havewith p(u) < ∞ the set P (u)
is nonempty and compact. If only f0 depends on u, and the constraints are
satisfied by at least one1 qx,
λvpu ` then dom 2pq =
p1 ´ λqvpu ěU , and
vpλu 1 `pp1is´continuous
λqu2 q. relative to U .
ν ν ν
In that case, whenever x ∈ P (u ) with u → ū in U , all the cluster points of
Lastly,
thenoting that{x
sequence ν
}ν∈INisare
dompvq theinprojection
P (ū). of dompφq shows that v is convex.
For the secondThis
Guide. partisconsider
obtainedφpa,
from :“ a ` ιC pa,
uq Theorem uq.by taking f (x, u) = f0 (x, u) when
1.17 l
57
2 CONVEX FUNCTIONS
Example 2.6.3 (Infimal projection can be improper) Let f P Γ0 pRd q be finite and
φpx, uq :“ f pxq ` ιtex ďuu pxq. Then φp¨, uq P Γ0 pRd q. For f :“ id, the infimal projection of φ
is the improper, not lower semicontinuous function
$
& 8, u ď 0,
vpuq “
% ´8 u ą 0.
˛
Infimal Convolution
Closely related to infimal projections are infimal convolutions.
By definition, the infimal convolution is commutative (i.e., f1 □f2 “ f2 □f1 ) and it holds
dompf1 □f2 q “ dompf1 q ` dompf2 q. Moreover, we can analogously define the infimal convo-
lution of more than two functions by
n
ÿ
pf1 □ ¨ ¨ ¨ □fn qpxq “ inf f pxi q.
xi PE
x1 `¨¨¨`xn “x i“1
In the case that pf1 □f2 q and pf2 □f3 q are proper, the infimal convolution is associative ,i.e.,
f1 □pf2 □f3 q “ pf1 □f2 q□f3 “ f1 □f2 □f3 .
Example 2.6.5
• We have ιS1 □ιS2 “ ιS1 `S2 for S1 , S2 Ă E.
• For f1 :“ ιC and f2 :“ } ¨ } we have pf1 □f2 qpyq “ inf xPC }y ´ x}, which is the
distance of the point y to the set C ‰ H, and also the infimal projection of φpx, uq :“
}u ´ x} ` ιC pxq.
• The infimal convolution admits the neutral element ιt0u . ˛
Let f1 , f2 : E Ñ R be proper.
1 We have
where equality holds if and only if the infimum in (2.12) is attained for all
elements indompf1 □f2 q.
2 We have s-epipf1 □f2 q “ s-epipf1 q ` s-epipf2 q.
58
2 CONVEX FUNCTIONS
Proof. 1 Let px, aq P epi f1 ` epi f2 . Then there exist pxi , ai q P epi fi , i “ 1, 2 so that
x “ x1 ` x2 , a “ a1 ` a2 and f1 px1 q ` f2 px2 q ď a1 ` a2 “ a. Consequently,
Conversely, assume that we have equality in (2.13). Let f pxq :“ pf1 lf2 qpxq be finite
so that px, f pxqq P epipf1 lf2 q. Then there exist pxi , ai q P epi fi with px, f pxqq “
px1 , a1 q ` px2 , a2 q. This implies that f pxq “ a1 ` a2 ě f1 px1 q ` f2 px2 q. Since we have
on the other hand that f pxq ď f1 px1 q ` f2 px2 q we see that the infimum is attained for
the decomposition x “ x1 ` x2 . Lecture 9,
2 The inclusion “Ă” is analogously to 1 . 03.05.2024
The following example shows that the reverse inclusion of part 1 of the theorem is not true.
Moreover, the infimal convolution of two proper convex and lsc functions is not necessarily
proper or lsc.
Example 2.6.7
• Let p P R and consider the proper, continuous, convex functions f1 , f2 : R Ñ R with
f1 pxq :“ px and f2 pxq :“ ex . Then
$
& ppx ´ logppq ` 1q, p ą 0,
’
’
pf1 □f2 qpxq “ 0, p “ 0, ,
’
’
p ă 0.
%
´8,
59
2 CONVEX FUNCTIONS
so that for p ă 0, the infimal convolution is not proper. For p “ 0 we have f1 “ f1 □f2
and thus
Exercises
Exercise 2.6.8 (Computing some inf-convolutions [Luc06, Ex. 1.2.24]) Find f □g,
where
• f pxq :“ x and gpxq :“ 2x.
• f : E Ñ R and gpxq :“ ιtx0 u for some x0 P E.
• f : R Ñ R and gpxq :“ rιt0u for r P R.
• f pxq :“ 12 x2 and gpxq :“ ιr0,1s .
• f : E Ñ p´8, 8s is convex and proper and gpxq :“ ιBr p0q for some r ą 0.
• f pxq :“ 21 x2 and gpxq :“ x. ■
60
2 CONVEX FUNCTIONS
The Moreau envolope is convex as the infimal convolution of two convex functions.
In the case, that f : Rd Ñ R is differentiable, the proximal mapping corresponds to an
implicit gradient step. More precisely, let x̂ P proxλf pxq. Then, x̂ is a minimum and hence
a critical point of the function f ` 2λ
1
}x ´ ¨}2 . Consequently, it holds 0 “ λ1 px̂ ´ xq ` ∇f px̂q
which can be reformulated as x̂ “ x ´ λ∇f px̂q.
Remark 2.7.2 (Generalizations of proximal mappings)
The definition can be extended in several directions. For instance, we could consider proximal
mappings with respect to non-convex functions f . Moreover, we could replace the term
}x ´ y}22 by other “distance-like” functions gpx, yq : E ˆ E Ñ R. However, in this case many
of the properties of proximal mappings which we show in the sequel are not clear. ˝
which is the soft-shrinkage operator with threshold λ. The Moreau envelope of soft-shrinkage
operator
61
2 CONVEX FUNCTIONS
1 2
2λ x
Sλ
λ
2
−λ λ −λ λ
Fig. 2.12: Left: The soft-shrinkage function proxλf “ Sλ for f :“ | ¨ |. Middle: The
Moreau envelope λ f of f . (In this plot, λ “ 2.) Right: The absolute value function
(black) and its Moreau envelope (gray) of the absolute value function for λ P t1, 2, 4u.
For the multivariate analogon f :“ } ¨ }1 of the function considered above, the mini-
mization can be done componentwise, leading to
d ˆ ˙
ÿ 1 2
proxλf pxq “ arg min pxk ´ yk q ` λ|yk | ,
yPRd k“1 2
If E is a Hilbert space:
4 The proximal mapping admits a unique element, i.e., there exists some x̂ P E
such that proxλf pxq “ tx̂u. In this case, we also write x̂ “ proxλf pxq
5 A point x̂ P E minimizes (2.15) if and only if the variational inequality variational
inequality
1
x x ´ x̂, y ´ x̂ y `f px̂q ď f pyq
λ
holds for all y P E.
62
2 CONVEX FUNCTIONS
1`
∇pλ f qpxq “ (2.16)
˘
x ´ proxλf pxq .
λ
1 1
}x̂ ´ x̂}2 ` f px̂q “ f px̂q ď f pxq ď }x̂ ´ x}2 ` f pxq.
2λ 2λ
Hence, x̂ P proxλf px̂q.
“ ðù ”: Suppose x̂ P proxλf px̂q and let x P E. Then it holds
1
x̂ P arg min }x̂ ´ y}2 ` f pyq
yPE 2λ
λ 1
f px0 q ď f px0 q ď }x̂ ´ x0 }2 ` f px0 q “ λ f px̂q ď λ f px0 q.
2λ
Consequently, the above inequalities are equalities such that }x̂´x0 }2 “ 0, i.e., x0 “ x̂.
Therefore, we have x̂ P proxλf px̂q such that x̂ is a minimizer of f by part 1 .
4 Since E is a Hilbert space, we get by Corollary 2.3.10 that 1
2λ } ¨ }
2
is strictly convex
such that the claim follows by Theorem 2.5.4 4 .
63
2 CONVEX FUNCTIONS
t t2
0 ě tf px̂q ´ tf pyq ` xx̂ ´ x, x̂ ´ yy ´ }y ´ x̂}2 .
λ 2λ
Finally, dividing by t ą 0 and taking t Œ 0 we obtain the variational inequality.
“ ðù ”: Conversely, let the variational inequality hold true. Then we have for all y P E
that
1 1 1
f px̂q ` }x̂ ´ x}2 ď f pyq ´ xx̂ ´ x, x̂ ´ yy ` }x̂ ´ x}2
2λ λ 2λ
1 1 1
ď f pyq ` }x̂ ´ x}2 ´ xx̂ ´ x, x̂ ´ yy ` }x̂ ´ y}2
2λ λ 2λ
1
“ f pyq ` }y ´ x}2
2λ
by the binomial formula. Hence x̂ “ proxλf pxq.
6 We show that the gradient of λf exists and is given by (2.16). For an arbitrary x0 P E
let
1
x̂0 :“ proxλf px0 q, z :“ px0 ´ x̂0 q.
λ
To show that λf is differentiable at x0 with ∇λ f px0 q “ z we have to prove that
λ 1
f px0 q “ f px̂0 q ` }x̂0 ´ x0 }2 ,
2λ
λ
! 1 ) 1
f px0 ` uq “ min f pxq ` }x ´ x0 ´ u}2 ď f px̂0 q ` }x̂0 ´ x0 ´ u}2 ,
xPE 2λ 2λ
so that
1 1 1 1
rpuq ď }x̂0 ´ x0 ´ u}2 ´ }x̂0 ´ x0 }2 ´ xx0 ´ x̂0 , uy “ }u}2 .
2λ 2λ λ 2λ
64
2 CONVEX FUNCTIONS
The uniqueness from Theorem 2.7.4 4 holds true for so-called “strictly convex” Banach
spaces.
Explicit forms of proximal mappings for specific (finite-dimensional) functions can be found
at the website http : / / proximity-operator . net/ and in Appendix 9 of the script on
convex analysis in Euclidean spaces by Gabriele Steidl (available in ISIS).
Exercises
Exercise 2.7.5 What is proxλf if f is constant? Show that every translation on E (that
is, a map z ÞÑ z ´ a) is a proximal operator for suitable f and g [Mor65, p. 278-279]. ■
65
3 SUBGRADIENTS
3 Subgradients
x0
66
3 SUBGRADIENTS
The condition is often denoted by “f pxq´f px0 q ě x p, x´x0 y `op}x´x0 }q”. In the case that
f is convex, one can show that the regular subdifferential coincides with Definition 3.1.1.
In this lecture, we mainly consider convex functions. Nevertheless, many of the following
relations remain true for non-convex functions and the regular subdifferential. ˝
For convex functions the following lemma relates subgradients with one-sided directional
derivatives (which exist by Lemma 2.3.8).
that is, 1
h pf px0 ` hvq ´ f px0 qq ě x p, v y. Letting h Œ 0, we obtain the first implication.
“Ą”: Let p P tp P E ˚ : x p, v y ď Df rx0 spvq @v P Eu. By Lemma 2.3.8 3 we have
The lemma implies particularly that Df rx0 spvq “ suppPBf px0 q x p, v y. By replacing v by
´v, in the statement of the lemma, we obtain that p P Bf px0 q if and only if x p, v y ě
´ Df rx0 sp´vq for all v P E. Consequently, we can rewrite the subdifferential as
67
3 SUBGRADIENTS
In the case E “ R, this yields that Bf px0 q “ tp P R : ´ Df rx0 sp´1q ď p ď Df rx0 sp1qu,
i.e., the subdifferential is the interval with the bounds being exactly the one-sided deriva-
tives of f at x0 . If the directional derivatives at x0 are finite, this simplifies to Bf px0 q “
r´ Df rx0 sp´1q, Df rx0 sp1qs.
Under the assumption that f is continuous at x0 , the next theorem shows that the subdif-
ferential is non-empty and bounded. By Theorem 2.4.3 and Theorem 2.4.4, this assumption
is particularly fulfilled if E is finite-dimensional or if E is a Banach space and f lsc.
Bounded: It holds by Lemma 2.3.8 for p P Bf px0 q and x P E with }x} ď 1 that
c ´ f px0 q
}p}˚ “ sup x p, x y ď .
xPE ε
}x}ď1
Since f is a convex function, we have that C1 is convex. Moreover, (3.3) yields that px0 , bq P
C1 for all b ą c. Consequently, C1 and C2 are disjoint, convex and non-empty. Hence,
they can be separated accordingly to Theorem 1.2.2. More precisely, there exists pp, aq P
pE ˆ Rq˚ zt0u “ pE ˚ ˆ Rqzt0u such that
i.e.,
x p, x ´ x0 y `apb ´ f px0 qq ď 0 @px, bq P epipf q˝ .
68
3 SUBGRADIENTS
Since (3.3) yields that there exists b ą c ą f px0 q such that px0 , bq P epipf q˝ , this implies
that
´ f px0 qq ă 0 such that a ă 0.
apbloooomoooon
ą0
Hence, we obtain by deviding by a ă 0 and setting p̄ :“ ´p{a P E ˚ that
By Lemma 1.1.18 we have that epipf q Ă epipf q “ epipf q˝ such that there exist for any x P E
some sequence pxn , bn qn in epipf q˝ with pxn , bn q Ñ px, f pxqq. In particular, we have that
Proof. Without loss of generality, assume that 0 P affpdompf qq (otherwise consider gpxq “
f px ´ x0 q instead of f ). Then, F :“ affpdompf qq is a closed linear subspace of the Banach
space E and therefore complete. Now, we consider the restricted function f |F : F Ñ R de-
fined by f |F pxq “ f pxq for x P F . Since x0 P ripdompf qq, we obtain that x0 P pdompf |F qq˝ Ă
F . Because either f is lsc or E (and therefore F ) is finite-dimensional, we obtain by The-
orem 2.4.4 or Theorem 2.4.3 that f |F is continuous at x0 . Hence, by Theorem 3.1.5 there
exists q P Bf |F px0 q Ă F ˚ . Finally, the Hahn-Banach theorem implies that there exists
p P E ˚ with x p, x y “ x q, x y for all x P F such that
69
3 SUBGRADIENTS
If E is a Banach space and f is lsc, the reverse direction holds true. That is, if
Bf px0 q “ tpu, then f is Gateaux differentiable at x0 and p “ ∇f px0 q holds.
x p, x y ď x ∇f px0 q, x y @x P E.
Now we show that x p, v y “ Df rx0 spvq for all v P E which implies that p “ ∇f px0 q is the
Gateaux gradient of f at x0 . Assume that there exists v P E such that x p, v y ă Df rx0 spvq.
Then, let F “ ttv : t P Ru be the linear subspace of E spanned by v and define the linear
functional q : F Ñ R by x q, tv y “ t Df rx0 spvq. Since F is finite-dimensional, we have that
q P F ˚ . In the following, we extend q to an element p̄ P E ˚ ztpu by the Hahn-Banach
theorem and show that p̄ P Bf px0 q.
Using the positive homogeneity of Df rx0 s, it holds that x q, tv y “ t Df rx0 spvq “ Df rx0 sptvq
for all t ě 0. Moreover, since p is a subgradient, we have for t ă 0 that
Hence, we have that x q, x y ď Df rx0 spxq for all x P F . Moreover Df rx0 s is subadditive since
it is positively homogeneous and it holds
´1 1 ¯
Df rx0 spx ` yq “ 2 Df rx0 s x ` y
2 2
´1 1 ¯
ď2 Df rx0 spxq ` Df rx0 spyq “ Df rx0 spxq ` Df rx0 spyq
2 2
due to the convexity of Df rx0 s by Lemma 2.3.8. Thus, by the algebraic Hahn-Banach
theorem, there exists some p̄ P E 1 such that x p̄, tv y “ x q, tv y for all t P R and x p̄, x y ď
Df rx0 spxq for all x P E. By Lemma 2.3.8 and (3.5) that for all x P E with }x} ď 1
f px0 ` εxq ´ f px0 q c ´ f px0 q
x p̄, x y ď Df rx0 spxq ď ď .
ε ε
In particular, p̄ is bounded and therefore continuous, i.e., p̄ P E ˚ . Together with x p̄, x y ď
Df rx0 spxq for all x P E, this implies that p̄ P Bf px0 q. Since p̄ ‰ p (because x p̄, v y “
Df rx0 spvq ą x p, v y) this contradicts Bf px0 q “ tpu. l
70
3 SUBGRADIENTS
f : R Ñ R, x ÞÑ (3.6) 1.5
else.
1
% 8, 0.5
−1
−1.5
−3 −2 −1 0 1 2 3
$
’
’ 8, if x “ 2,
&
Df rxsp1q “ 1 ` 2?2´x , if x P r0, 2q,
1
’
’ Fig. 3.2: The func-
´ 1, if x P r´2, 0q.
% ?1
2 2´x tion from (3.6).
and $
’
’
’ ´8, if x “ 2,
’
?1 if x P p0, 2q,
’
& ´1 ´
2 2´x
,
Df rxsp´1q “
’
’
’ 1´ ? 1
2 2´x
, if x P p´2, 0s,
’
if x “ ´2.
’
%
8,
Thus by ˘p ď Df rxsp˘1q, i.e. ´ Df rxsp´1q ď p ď Df rxsp1q we get
$! )
’
’
’” 1 ` ?1
2 2´x
, if x P p0, 2q,
’
’ ı
1 1
´ 1, 2? ` 1 , if x “ 0,
’
’ ?
’
& !2 2
’ )2
Bf pxq “
’
? 1
2 2´x
´1 , if x P p´2, 0q,
’
´8, ´ 43 , if x “ ´2,
’
’ ` ‰
’
’
’
’
otherwise.
’
% H,
˛
0 P Bf px0 q.
Example 3.1.10
• The subdifferential of the norm } ¨ } is given by
$
& tp P E ˚ : x p, x y “ }x}, }p} “ 1u, if x ‰ 0,
˚
B} ¨ }pxq “
% B1 p0q, else.
71
3 SUBGRADIENTS
Exercises
72
3 SUBGRADIENTS
For f, g : E Ñ R we have
with equality if f, g P Γ0 pEq and there exists a point x̂ P dompf q X dompgq such that
f is continuous at x̂.
Let F be a real normed space, A P LpE; F q and f P Γ0 pF q be continuous at Aŷ for
some ŷ P E. For every x0 P E we have
“ ‰
Bpf ˝ Aqpx0 q “ A˚ pBf qpAx0 q ,
Proof. We only prove the first statement. A proof for the second statement can be found
in [ET99, Prop. I.5.7].
“Ă”: Let p P Bf px0 q ` Bgpx0 q, i.e., p “ p1 ` p2 for p1 P Bf px0 q and p2 P Bgpx0 q. Then it holds
73
3 SUBGRADIENTS
This implies
f pxq ´ f px0 q ´ x p, x ´ x0 y ě gpx
looooooooooooooooomooooooooooooooooon 0 q ´ gpxq .
loooooomoooooon (3.7)
“:φ1 pxq “:´φ2 pxq
Since φ1 px0 q “ φ2 px0 q “ 0, we have for all b2 ă 0 ă b1 that px0 , b1 q P C1 and px0 , b2 q P C2
and therefore
x q, x0 y ď β ď x q, x0 y, i.e., β “ x q, x0 y . (3.9)
Now let y P E with }y} ď 1 such that x q, y y ą 0. Since f is continuous in x̂, we have
that there exists some ε ą 0 such that B2ε px̂q Ă dompf q. In particular, we have that
x̂ ´ εy P dompf q such that it holds by (3.10) that
x q, x̂ y ď x q, x̂ ´ εy y “ x q, x̂ y ´ε x q, y y ă x q, x̂ y
74
3 SUBGRADIENTS
We have
$
’ H,
’ if x ă 0, $
& & H, if x ě 0,
Bf pxq :“ p´8, 0s, if x “ 0, and Bgpxq :“
’
’ % t0u, if x ă 0.
if x ą 0.
%
t0u,
Thus Bf pxq ` Bgpxq “ H for all x P R (as S ` H “ H for all sets S). Furthermore, for x P R
we have $
& 1, if x “ 0,
f pxq ` gpxq “
% 8, if x ‰ 0.
For x “ 0 we obtain
Thus $
& H, if x ‰ 0,
Bpf ` gqpxq :“
% R, if x “ 0.
˛
Combining Theorem 3.2.2 and Theorem 3.1.9 gives a characterization of minima in con- Lecture 12,
strained optimization problems. 15.05.2024
if and only if x̂ P C and 0 P Bf px̂q ` NC px̂q, i.e., there exists some p P Bf px̂q such that
x p, x ´ x̂ y ě 0 for all x P C.
In particular, if E is a Hilbert space, then it is a minimizer if and only if x̂ P C and there
exists a p P Bf px̂q such that
75
3 SUBGRADIENTS
x p, x ´ x̂ y ě 0 @x P C,
There are also product, quotient and chain rules for subdifferentials. We refer to [FP03,
Prop. 7.1.9, Prop. 7.1.11] for detailed formulations and proofs in the case E “ Rn and to
[Cla90b, p. 42 - 48] for a more general case.
Finally, using subdifferential calculus, we can now present an easy proof for the variational
inequality for proximity operators from Theorem 2.7.4 5 .
1
x̂ P arg mint }z ´ x}2 ` λf pzqu.
zPE 2
By Fermat’s rule (Theorem 3.1.9) and subdifferential calculus (Theorem 3.2.2) this is equiv-
alent to
1
0 P x̂ ´ x ` λBf px̂q ô px ´ x̂q P Bf px̂q.
λ
Inserting the definition of subdifferentials this is equivalent to
1
x x ´ x̂, y ´ x̂ y ď f pyq ´ f px̂q.
λ
This completes the proof. l
Exercises
76
4 PROXIMAL ALGORITHMS
4 Proximal Algorithms
In this section, we consider algorithms for solving the optimization problem
In the case that f is differentiable, we know from a lecture on numerical methods that we
can employ a gradient descent scheme. Moreover, we have seen in the previous sections that
the proximal mapping generalizes the backward gradient step
Therefore, we will consider algorithms for solving (4.1), where the proximal mapping replaces
gradient descent steps. To analyze their convergence, we consider fixed point theorems and
averaged operators. Moreover, we assume that pE, } ¨ }q is a Hilbert space with inner
product x ¨, ¨ y. Indeed, many of the following theorems do not hold true in general Banach
spaces without additional assumptions.
where I is the identity. We say that T is averaged, if it is averaged for some t P p0, 1q.
77
4 PROXIMAL ALGORITHMS
which is equivalent to
}T x ´ T y}2 ď x T x ´ T y, x ´ y y .
By definition, averaged and firmly non-expansive operators are non-expansive. The reverse
is not true. For instance, the identity is firmly non-expansive and t-averaged for all t P p0, 1q,
but is not a contraction.
Convex combinations and concatenations of non-expansive operators are non-expansive.
Moreover, if T is t-averaged for some t P p0, 1q, then it is also s-averaged for any 0 ă s ď t.
Further, the following lemma summarizes some relations between the above definitions.
Lemma 4.1.2
Let E be a Hilbert space. Then:
1 A contractive operator T : E Ñ E with Lipschitz constant L ă 1 is t-averaged for all
t P 0, 21 p1 ´ Lq Ă 0, 12 .
` ‰ ` ‰
t
}T x ´ T y}2 ď }x ´ y}2 ´ }pI ´ T qx ´ pI ´ T qy}2 @x, y P E. (4.3)
1´t
1
}Rx ´ Ry} “ }pT ´ tIqx ´ pT ´ tIqy}
1´t
△‰ 1 t L`t
ď }T x ´ T y} ` }x ´ y} ď }x ´ y}
1´t 1´t 1´t
2 Define R “ 1´t
1
T ´ 1´tt
I such that T “ tI ` p1 ´ tqR. Then, we have to show that R
is non-expansive if and only if T is firmly non-expansive.
Using the parallelogram identity }αv ` p1 ´ αqw}2 ` αp1 ´ αq}v ´ w}2 “ α}v}2 ` p1 ´
αq}w}2 in Hilbert spaces, we obtain with α “ 1´t1
(i.e., 1 ´ α “ ´ 1´t
t
), v “ T x ´ T y
and w “ x ´ y that
1 t
}Rx ´ Ry}2 “ } pT x ´ T yq ´ px ´ yq}2
1´t 1´t
1 t t
“ }T x ´ T y}2 ´ }x ´ y}2 ` }T x ´ T y ´ px ´ yq}2
1´t 1´t p1 ´ tq2
1 t t
“ }T x ´ T y}2 ´ }x ´ y}2 ` }pI ´ T qx ´ pI ´ T qy}2 .
1´t 1´t p1 ´ tq2
78
4 PROXIMAL ALGORITHMS
We conclude that: R is non-expansive if and only if the left side of the above equation
is ď 0 for all x, y P E if and only if the right side of the above equation is ď 0 for all
x, y P E if and only if T fulfills (4.3).
Inserting t “ 1
2 gives that T is firmly non-expansive if and only if it is 12 -averaged.
3 By assumption there exists non-expansive operators Rk : E Ñ E and tk P p0, 1q such
that Tk “ tk I ` p1 ´ tk qRk for k P t1, 2u for x P E. Then, for t :“ a1 a2 we have that
pT2 ˝ T1 qpxq “ t2 T1 pxq ` p1 ´ t2 qpR2 ˝ T1 qpxq
` ˘
“ t2 t1 x ` p1 ´ t1 qR1 pxq ` p1 ´ t2 qpR2 ˝ T1 qpxq
“ tx ` pt2 ´ tqR1 pxq ` p1 ´ t2 qpR2 ˝ T1 qpxq
ˆ ˙
t2 ´ t 1 ´ t2
“ tx ` p1 ´ tq R1 pxq ` pR2 ˝ T1 qpxq .
1´t 1´t
loooooooooooooooooooooooomoooooooooooooooooooooooon
“:Rpxq
Next, we aim to examine the convergence behaviour of the fixed point iteration
xn`1 “ T pxn q
provided that T is an averaged operator. To this end, we need some more definitions.
79
4 PROXIMAL ALGORITHMS
FixpT q :“ tx P E : T x “ xu.
kÑ8
Moreover, we call T asymptotically regular if T k`1 x ´ T k x ÝÝÝÑ 0 for all x P E, where asymptotically
T k denotes the operator looooomooooon
T ˝ ¨¨¨ ˝ T. regular
k times
Example 4.1.5 In general, the fixed point iteration xn`1 “ T pxn q does not necessarily
converge for non-expansive operators T even when FixpT q ‰ H. Simple examples are
T “ ´I or T : R2 Ñ R defined by T px1 , x2 q “ px1 , ´x2 q. ˛
In contrast to the operators from Example 4.1.5, averaged ones are asymptotically regular.
Proof. Suppose T is t-averaged with t P p0, 1q. Let n P N and y P FixpT q. By Lemma 4.1.2
and T y “ y we have for all z P E that
t
}T z ´ y}2 “ }T z ´ T y}2 ď }z ´ y}2 ´ }pI ´ T qz ´ pI ´ T qy}2
1´t
t
“ }z ´ y}2 ´ }T z ´ z}2 ,
1´t
which can be rearranged to
1´t`
}T z ´ z}2 ď }z ´ y}2 ´ }T z ´ y}2 . (4.4)
˘
t
Now, let x P E be arbitrary. Since T is averaged, it is non-expansive, such that for any
y P FixpT q it holds
}T k`1 x ´ y} “ }T k`1 x ´ T y} ď }T k x ´ y}.
Therefore, the sequence p}T n x´y}qnPN is non-negative and decreasing such that it converges
to some d ě 0.
Plugging z “ T k x into (4.4) yields
For the next theorems, we need a technical lemma on weak convergence. Recall that x P E
is a weak sequential cluster point of a sequence pxk qk in E if there exists a subsequence
pxkj qj such that xkj á x as j Ñ 8.
Lecture 13,
17.05.2024
80
4 PROXIMAL ALGORITHMS
Proof. “ ùñ ”: Assume that pxk qk converges weakly to some x P E. Then, the sequence
x ΛE pxk q, p y is bounded for all p P E ˚ , i.e., there exists Mp ă 8 such that x ΛE pxk q, p y ă Mp
for all k P N. Thus, the uniform boundedness principle (also known as Banach–Steinhaus
theorem) yields that there exists some M ă 8 such that }xk } “ }ΛE pxk q}˚˚ ă M . Hence,
pxk qk is bounded. Since weak limits are unique, x is the only weak sequential cluster point,
and any subsequence of pxk qk converges weakly to x.
“ ðù ”: Let pxk qk be bounded and suppose that it has at most one weak sequential cluster-
point. Since E is a reflexive Banach space, there exists a subsequence converging to some
x P E. In particular, x is the unique weak sequential cluster point.
Now assume that pxk qk does not weakly converge to x. By definition, this implies that
there exists some p P E ˚ such that x p, xk y does not converge to x p, x y. Thus, there
exists a subsequence pxkj qj and some ε ą 0 such that | x p, xkj y ´ x p, x y | ą ε. Since
x p, xk y ď }p}˚ }xk } is bounded, there exists a subsequence of pxkj qj (which we again denote
by pxkj qj a P R such that x p, xkj y Ñ a ‰ x p, x y. Because pxkj qj is bounded and E is a
reflexive Banach space, it admits a weakly convergent subsequence pxkjl ql . Using that x is
the unique weak sequential cluster point of pxk qk , we obtain that xkjl á x. In particular,
we have that
x p, x y “ lim x p, xkjl y “ lim x p, xkj y “ a ‰ x p, x y .
lÑ8 jÑ8
This is a contradiction. l
The next theorem ensures that weak limits of the fixed point iteration xn`1 “ T pxn q for
non-expansive, asymptotically regular operators T are fixed points of T .
Proof. Using twice that }v ˘ w}2 “ }v}2 ˘ 2 x v, w y `}w}2 for all v, w P E yields for every
k P N that
}T x ´ x}2 “ }T x ´ xk }2 ´ }xk ´ x}2 ´ 2 x xk ´ x, x ´ T x y
“ }pT xk ´ xk q ` pT x ´ T xk q}2 ´ }xk ´ x}2 ´ 2 x xk ´ x, x ´ T x y
“ }T xk ´ xk }2 ` 2 x T xk ´ xk , T x ´ T xk y
` }T xk ´ T x}2 ´ }xk ´ x}2 ´ 2 x xk ´ x, x ´ T x y
Since T is non-expansive, this is smaller or equal than
}T xk ´ xk }2 ` 2 x T xk ´ xk , T x ´ T xk y ´2 x xk ´ x, x ´ T x y
}T xk ´ xk }2 `2 }T
ď loooooomoooooon xk ´ xk } loooooomoooooon
looooomooooon }T x ´ T xk } ´2 x x k ´ x, x ´ T x y .
loomoon
Ñ0 Ñ0 bounded á0
81
4 PROXIMAL ALGORITHMS
Now, the first two summands converge to zero by the assumption that T xk ´ xk Ñ 0 and
by the fact that xk converges weakly and is therefore bounded by Lemma 4.1.7. The last
summand converges to zero since pxk qk converges weakly to x. Thus, considering the limit
k Ñ 8, we obtain that }T x ´ x} “ 0. l
The following theorem was first proved by Zdzisław Opial in 1967 [Opi67, Thm. 1].
i.e., the sequence pxk qk is bounded and the sequence ak pyq :“ }xk ´ y}2 is decreasing.
Since any Hilbert space is reflexive, we obtain by Lemma 2.5.3 that pxk qk admits a sub-
sequence pxkj qj which converges weakly to some x̂ P E. Using that T is non-expansive and
asymptotically regular, it holds by Theorem 4.1.8 that x̂ P FixpT q. In the following, we show
that the whole sequence pxk qk converges weakly to x̂. Due to Lemma 4.1.7 is suffices to show
that x̂ is the only weak sequential cluster point of pxk qk . Assume that ŷ P E is another
weak sequential cluster point and let pxkl ql be a subsequence of pxk qk converging weakly to
ŷ. Analogously to x̂, we have that ŷ P FixpT q. Since ak px̂q and ak pŷq are decreasing and
positive, we obtain that they converge to some apx̂q and apŷq. Then, it holds
2 2
2 x x̂, x̂ ´ ŷ y “ lim 2 x xkj , x̂ ´ ŷ y “ lim }x }x̂}2 ´ }xkj ´ x̂}2
kj ´ ŷ} ´ }ŷ} ` loooooooooomoooooooooon
jÑ8 jÑ8 loooooooooomoooooooooon
“}xkj }2 ´2 x xkj ,ŷ y “2 x xkj ,x̂ y ´}xkj }2
“ lim akj pŷq ´ }ŷ}2 ` }x̂}2 ´ akj px̂q “ apŷq ´ }ŷ}2 ` }x̂}2 ´ apx̂q.
jÑ8
such that ŷ “ x̂. Consequently, x̂ is the only weak sequential cluster point of the bounded
sequence pxk qk such that xk á x̂ as k Ñ 8 by Lemma 4.1.7. l
Combining Theorem 4.1.9 and Theorem 4.1.6 yields the following main result.
82
4 PROXIMAL ALGORITHMS
If dimpEq ă 8, weak and strong convergence coincide. Then, Theorem 4.1.10 yields strong
convergence of the sequence pT k x0 qkPN for any initial point x0 P E. Putting everything
together, we obtain the following diagram, which summarizes the results of this section.
a= 12
a≤ 12 (1−L)
a-averaged L-contractive
Fix̸=∅
Fix̸=∅
+non-expansive
asymptotically regular (T k x(0) )k ⇀ x̂
Fix̸=∅
Fig. 4.1: Different conditions ensuring the weak convergence of Picard iterates to a fixed
point x̂ P FixpT q belonging to an operator T : H Ñ H and the relationships between them.
Exercises
Exercise 4.1.13 (| ¨ | not averaged) The absolute value is not an averaged operator. ■
Exercise 4.1.14
Let E be a real normed space and let T : E Ñ E.
• Assume that pT k xqkPN converges for all x P E. Prove that T is asymptotically regular.
• Find an example of an asymptotically regular operator T and some x P E such that
the sequence pT k xqkPN is unbounded (and diverges consequently). ■
83
4 PROXIMAL ALGORITHMS
Gradient descent
First, we recall the basic gradient descent algorithms from the lecture on numerical methods.
For proving convergence of Algorithm 1, we need the following theorem. It is known under
the name Baillon-Haddad theorem and can be shown in even more generality by using
duality theory. In particular, existence of the second derivative can be dropped.
Lecture 14,
22.05.2024
Theorem 4.2.1: Baillon-Haddad
Let E be a Hilbert space and suppose that f P Γ0 pEq is twice Gateaux differen-
tiable such that the Gateaux gradient is L-Lipschitz continuous and the Gateaux
Hessian is continuous. Then, the operator L1 ∇f : E Ñ E is firmly non-expansive.
The proof follows quite simple ideas. However, taking care of the infinite-dimensional setting
requires some technical effort. Therefore, we first state a finite-dimensional sketch of the
proof and present afterwards a general proof which includes all details.
Hence, using DIrx0 spxq “ x, we obtain that DRrx0 s “ Dp2∇f ´ Iqrx0 s fulfills for all x P E
84
4 PROXIMAL ALGORITHMS
that
´}x}2 ď 2∇2 f rx0 spxq ´ x, x y “ x DRrx0 spxq, x ď }x}2 .
@ D
Since ∇2 f rx0 s is self-adjoint, we have that also DR “ 2∇2 f rx0 s ´ I is self-adjoint (as sum of
two self-adjoint operators). In particular, it was proven in the lecture “Functional Analysis”
(SoSe 2022, Lem. 7.41) that
To show for fixed x0 , y0 P E that }Rpx0 q ´ Rpy0 q} ă }x0 ´ y0 }, we consider the function
@ D
φptq :“ Rpx0 q ´ Rpp1 ´ tqx0 ` ty0 q, Rpx0 q ´ Rpy0 q .
By definition, we obtain that φp0q “ x Rpx0 q ´ Rpx0 q, Rpx0 q ´ Rpy0 q y “ 0 and φp1q “
x Rpx0 q ´ Rpy0 q, Rpx0 q ´ Rpy0 q y “ }Rpx0 q ´ Rpy0 q}2 . Moreover, φ is differentiable with
B F
Rpp1 ´ tqx0 ` ty0 ` hpy0 ´ x0 qq ´ Rpp1 ´ tqx0 ` ty0 q
φ1 ptq “ lim , Rpx0 q ´ Rpy0 q
hÑ0 h
@ D
“ DRrp1 ´ tqx0 ` ty0 spy0 ´ x0 q, Rpx0 q ´ Rpy0 q .
In particular, we have by the mean value theorem and Cauchy-Schwarz that there exists
some 0 ă t ă 1 such that
}Rpx0 q ´ Rpy0 q}2 “ φp1q “ φp1q ´ φp0q “ φ1 ptq
“ x DRrp1 ´ tqx0 ` ty0 spy0 ´ x0 q, Rpx0 q ´ Rpy0 q y
ď } DRrp1 ´ tqx0 ` ty0 spy0 ´ x0 q}}Rpx0 q ´ Rpy0 q}
ď } DRrp1 ´ tqx0 ` ty0 s}}y0 ´ x0 }}Rpx0 q ´ Rpy0 q}
ď }y0 ´ x0 }}Rpx0 q ´ Rpy0 q}.
Now, dividing by }Rpx0 q ´ Rpy0 q} yields the claim. l
Proof. Let T : E Ñ E be defined by T pxq “ x ´ λ∇f pxq. By Theorem 4.2.1, there exists
some non-expansive operator R : E Ñ E such that ∇f pxq “ L2 x ` L2 Rpxq. In particular,
2 qx ` 2 Rpxq is an averaged operator. Thus, pxk qk “ pT x0 qk converges
T pxq “ p1 ´ λL λL k
weakly to some x̂ P FixpT q. This can be reformulated as T px̂q “ x̂, i.e., ∇f px̂q “ 0. By
Remark 2.5.6, this implies that x̂ P arg minpf q. l
85
4 PROXIMAL ALGORITHMS
xptq
9 “ ´∇f pxptqq.
Convergence of this algorithm is guaranteed by the following lemma. For the proof, we refer
to [Bačá14, Thm. 3.4, Rem. 3.8].
86
4 PROXIMAL ALGORITHMS
Suitable step sizes λk would be, e.g., λk :“ k1 λ0 for some λ0 ą 0. An important example of
the cyclic proximal point algorithm is the computation of a point within the inersection of
convex sets.
artn;n. -{ tn + te I
C?p n = al*nna,{ioa
Vo1iolio+
ot7*]h
Fig. 4.2: The cyclic proximal point algorithm finds a point in the intersection of two closed
convex sets A and B in R2 by minimizing ιA ` ιB .
87
4 PROXIMAL ALGORITHMS
To make use of this observation, we split f into g ` h and apply a forward step (i.e., a
gradient descent step) on g and a backward step (i.e., a proximal mapping) on h. The
arising algorithm is called “forward-backward splitting” or “proximal gradient algorithm”.
for k “ 0, 1, . . . do
xk`1 “ proxλh pxk ´ λ∇gpxk qq
end
Using again Fermat’s rule, this is equivalent to x P arg minpg ` hq “ arg minpf q. Summa-
rizing, this yields that FixpT q “ arg minpf q ‰ H.
Moreover, T1 is averaged by Theorem 4.2.1 and T2 is averaged since it is a proximal map-
ping. Consequently T is a composition of averaged operators and consequently averaged
by Lemma 4.1.2. Hence, the sequence pxk qk converges weakly to some x̂ P FixpT q “
arg minpf q. l
Remark 4.2.10 Under the conditions we posed on g and h, a linear convergence rate of
Algorithm 4 can be achieved: for a minimizer x̂ P arg minxPH f pxq of f we have
ˆ ˙
1
f pxk q ´ f px̂q “ O ,
k
88
4 PROXIMAL ALGORITHMS
becomes
` ˘
xk`1 “ Sηλ xk ´ λKT pKxk ´ yq .
The FBS algorithm can be extended to non-convex functions, see e.g., [AB09, ABS13,
BST13, CPR13, OCBP14]. The convergence analysis mainly rely on the assumption that
the objective function f “ g ` h satisfies the Kurdyka-Lojasiewiccz inequality which
is indeed fulfilled for a wide class of functions as log ´ exp, semi-algebraic and subanalytic
functions which are of interest in image processing.
Accelerated Algorithms
For large scale problems as those arising in image processing a major concern is to find Lecture 15,
efficient algorithms with a reasonable runtime. While each step in Algorithm 4 has low 24.05.2024
computational complexity, it may suffer from slow linear convergence. Using a simple ex-
trapolation idea with appropriate relaxation parameters τk P p0, 1q, the convergence
can often be accelerated: This can be accelerated by the extrapolation step
yk “ xk ` τk pxk ´ xk´1 q ,
For k P N we choose
k´1 θk p1 ´ θk´1 q 2
τk :“ “ , where θk :“ .
k`2 θk´1 k`2
89
4 PROXIMAL ALGORITHMS
90
4 PROXIMAL ALGORITHMS
Proof. First we consider the progress in one step of the algorithm. By the Lipschitz conti-
nuity of ∇g and since λ ă L1 we know by the descent lemma (Exercise 2.3.15) that for k P N
it holds
1
gpxk`1 q ď gpyk q ` x∇gpyk q, xk`1 ´ yk y ` }xk`1 ´ yk }2 . (4.7)
2λ
Moreover, the variational inequality of the proximal operators (Theorem 2.7.4, 4 ) we have
for all u P E that
1
hpxk`1 q ď hpuq ` xyk ´ λ∇gpyk q ´ xk`1 , xk`1 ´ uy
λ
1
“ hpuq ´ x∇gpyk q, xk`1 ´ uy ` xyk ´ xk`1 , xk`1 ´ uy . (4.8)
λ
Adding the inequalities (4.7) and (4.8) yields
λ 1 λp1 ´ θk q 1
pf pxk`1 q ´ f px̂qq ` }zk`1 ´ x̂}2 ď pf pxk q ´ f px̂qq ` }zk ´ x̂}2 .
θk2 2 θk2 2
Using the relation recursively on the right-hand side and regarding that 1´θk
2
θk
ď 2
1
θk´1
we
obtain
λ λp1 ´ θ0 q 1 1
2 pf pxk`1 q ´ f px̂qq ď 2 pf px0 q ´ f px̂qq ` }z0 ´ x̂}2 “ }x0 ´ x̂}2 .
θk θ0 2 2
Douglas-Rachford Splitting
Textbooks: [FP03, Subsubsec. 12.4.1].
91
4 PROXIMAL ALGORITHMS
Proof. By Fermats rule and Theorem 3.2.2 we have that x̂ minimizes g ` h if and only if
Dt̂ P E such that x̂ P arg mint 21 }y ´ t̂}2 ` λgpyqu and x̂ “ proxλh p2x̂ ´ t̂q
yPE
By inserting the first condition into the second one this is true if and only if
Dt̂ P E such that x̂ “ proxλg pt̂q and proxλg pt̂q “ proxλh p2 proxλg pt̂q ´ t̂q
Dt̂ P E such that x̂ “ proxλg pt̂q and t̂ “ t̂ ` proxλh p2 proxλg pt̂q ´ t̂q ´ proxλg pt̂q “ T pt̂q.
92
4 PROXIMAL ALGORITHMS
The lemma indicates in particular, that it suffices to find a fixed point of T in order to
find minimizers of g ` h. This leads to the following algorithm, which implements the fixed
point iteration tk`1 “ T ptk q. It is known under the name Douglas-Rachford splitting
algorithm (abbreviated as DRS or DRA).
Next, we prove that T from (4.9) is firmly nonexpansive such that the Douglas-Rachford
splitting algorithm converges.
Let E be a Hilbert space and let g, h P Γ0 pEq such that Bpg ` hq “ Bg ` Bh and
arg minpg ` hq ‰ H. Then, the iterates pxk qkPN produced by Algorithm 6 converge
weakly to some x̂ P arg minpg ` hq.
Proof. Since arg minpg ` hq ‰ H, we have by Lemma 4.2.14 that FixpT q ‰ H for T as
defined in (4.9). Moreover, since proxλh is firmly nonexpansive, we know that there exists
some nonexpansive R : E Ñ E such that proxλh “ 12 I ` 12 R. In particular, it holds that
2 proxλh ´I “ R is nonexpansive. Analogously, we obtain that 2 proxλg ´I is nonexpansive
such that p2 proxλh ´Iq ˝ p2 proxλg ´Iq is nonexpansive. This implies by the definition in
(4.9) that T is firmly non-expansive. Consequently, we obtain by Theorem 4.1.10 that there
exists t̂ P FixpT q such that tk “ T k t0 á t̂. In particular, we have that xk “ proxλg ptk q á
proxλg pt̂q P arg minpg ` hq by Lemma 4.2.14. l
93
4 PROXIMAL ALGORITHMS
olnin lh*LaJ
O*(* - RerTfnoL ,t
Fig. 4.5: The Douglas-Rachford spitting Algorithm 6 finds a point in the intersection of
closed convex non-disjoint sets A, B Ă R2 by minimizing g ` h, where g :“ ιA and h :“ ιB .
In this case, it converges in two steps.
řn
The DRS can also be applyed for minimizing the sum f “ i“1 hi for n ą 2 with hi P Γ0 pEq. Lecture 16,
To this end consider the Hilbert space E n with inner product xpx1 , ..., xn q, py1 , ..., yn q y “ 29.05.2024
řn
i“1 x xi , yi y and let
D :“ tx “ px1 , . . . , xn q : x1 “ . . . “ xn P Eu Ă E n
Then, any minimizer of F is of the form x “ px̂, ..., x̂q, where x̂ P arg minpf q. Now, we
can apply the DRS onto for minimizing F “ G ` H. For the implementation, we note for
x “ px1 , ..., xn q P E n that
1
proxλH pxq “ arg min t }y ´ x}2 ` Hpyqu
y“py1 ,...,yn qPE n 2λ
n
1 ÿ 1
“ arg min t }y ´ x}2 ` Hpyqu “ arg min t }yi ´ xi }2 ` hi pyi qu
y“py1 ,...,yn qPE n 2λ y“py1 ,...,yn qPE n i“1 2λ
´ 1 ¯n ´ ¯n
“ arg mint }yi ´ xi }2 ` hi pyi qu “ proxλhi pxi q .
yi PE 2λ i“1 i“1
94
4 PROXIMAL ALGORITHMS
with
n n
1ÿ 1 ÿ
x̂ P arg min }z ´ ti }2 “ ti ,
zPE 2 i“1 n i“1
where the last equality follows by setting the derivative to zero. By putting these computa-
tions together, we obtain the following algorithm, which is called the Parallel Douglas-
Rachford splitting. The name is inspired by the fact that the inner loop can be com-
puted in parallel.
Exercises.
Exercise 4.2.17 (Variable proximal parameters) Let us consider the case n “ 2 and
suppose that in Algorithm 3 we choose λk ” λ ą 0 to be constant. With the same arguments
as in Algorithm 3, the iterates converge weakly to a fixed point of proxλf2 ˝ proxλf1 . Show
that this fixed point is not a minimizer of f but of the smoother function f2 ` λ f1 . ■
95
4 PROXIMAL ALGORITHMS
96
4 PROXIMAL ALGORITHMS
• Denoising: Reconstruct a clean image from a noisy version, see Fig. 4.6b.
• Deblurring: Sharpening of a blurred image, see Fig. 4.6c.
• Inpainting: Reconstruct missing pixels from an image, see Fig. 4.6d.
• Superresolution: Increase the resolution of an given image, see Fig. 4.6e.
All of these tasks can be formulated as so-called inverse problems. More precisely, we inverse problems
want to reconstruct (approximately) an unknown ground truth x̂ P Rnm from an observation
y P Rd generated by
y “ F px̂q ` ξ (4.10)
where F : Rnˆm Ñ Rd is a function and ξ P Rd is some unknown noise which arises from in-
exact measurements or other pertubations in the image aquesition. For the specific examples
from above, the operator F is given as follows.
• Denoising: For denoising, we do not modify the image itself, i.e., we choose F pxq “ x.
On the other hand, we add a large pertubation ξ.
• Deblurring: The forward operator is given by a 2D-convolution with kernel k “
r,s“´P . That is, for x “ pxij qij we generate an output image y “ F pxq given by
pkrs qP
řP řP
y “ pyij qij , where yij “ r,s“´P krs xi´r,j´s . Usually, we have that r,s“´P krs “ 1
such that each pixel yij is the average of close pixels from x.
• Inpainting: Here, the forward operator F : Rnˆm Ñ Rnˆm is given by the matrix
vector multiplication F pxq “ diagpe1 , ..., enm qx, where ei “ 1 if the ith pixel of x is
known and ei “ 0 otherwise.
• Superresolution: For superresolution the forward operator F is a concatenation of
a blur operator and a downsampling operator. The blur operator is given similar as
in the deblurring example. An example for an downsampling operator is the operator
which takes every second pixel in the image.
Moreover, several image acquesition problems in medical applications can be formulated as
inverse problems. This includes computed tomography (CT), magnetic resonance imaging
(MRI) and electrical impedance tomography (EIT). Unfortunately, for the considered prob-
lems, the operator F is not invertible or highly ill-posed, i.e., small pertubations of y can
lead to huge differences in the reconstruction F ´1 pyq.
A classical approach to reconstruct the image x from the observation y is the minimization
of a variational functional given by
Here DpF pxq, yq is called the data-fidelity term and measures the distance between F pxq
and y. Using that y “ F px̂q ` ξ, we obtain that for “good” reconstructions x which are
close to x̂ we have that the distance between F pxq and y is small. A common choice of D is
Dpx, yq “ 21 }x ´ y}2 . The term Rpxq is a regularizer. It incorporates some prior knowledge
about the image. More precisely R should be a function which is small if the image x looks
“natural” and large otherwise.
97
4 PROXIMAL ALGORITHMS
Example 4.3.1 Using only the data-fidelity term is not sufficient. To see this, we consider
the deblurring problem. Here, we minimize the data-fidelity term
1
J pxq “ DpF pxq, yq “ }F pxq ´ y}2
2
via a gradient descent algorithm. The results after certain numbers of steps are given in
Fig. 4.7. We see that there appear some unnatural artifacts. ˛
In the following example summarizes different choices for the regularizer R in (4.11) and
shows how they can be used for reconstructing the unknown ground truth in certain inverse
problems.
After 200 gradient descent steps After 2000 gradient descent steps
Fig. 4.7: Minimizing only the data-fidelity term yields unnatural artifacts.
Example 4.3.2
• Tikhonov Regularization for Deblurring. The regularizer Rpxq “ }x}2 is called
Tikhonov regularization. Together with the data-fidelity term DpF pxq, yq “ Tikhonov
2 }F pxq ´ y} , the objective function J from the variational problem (4.11) reads regularization
1 2
as
1
J pxq “ }F pxq ´ y}2 ` λ}x}2 .
2
In this case J is differentiable and can be minimized by applying Algorithm 1 (gradient
descent). An application of this algorithm to deblurring is given in Fig. 4.8. We can
see that it reduces the artifacts compared to the reconstruction without regularization.
98
4 PROXIMAL ALGORITHMS
TV regularization
Fig. 4.8: Deblurring by solving the variational problem (4.11) for different regularizers.
The regularizer can be rewritten as Rpxq “ }Dx}22 for a certain matrix D. To en-
sure that the known pixels have the correct value, we choose the data fidelity term
DpF pxq, yq “ ιtzPRnm :F pzq“yu pxq. This leads to the variational problem
J pxq “ ιtzPRnm :F pzq“yu pxq ` λRpxq.
In order to minimize J , we can use the forward-backward splitting, where g “ R is
differentiable. The proximity operator of h “ ιtzPRnm :F pzq“yu can be computed by
setting all known pixels to the correct value. The result is given in Fig. 4.9.
99
4 PROXIMAL ALGORITHMS
Fig. 4.9: Inpainting by solving the variational problem (4.11) for different regularizers.
• Total Variation. Finally, we consider the regularizer R which penalizes the 1-norm
of the differences of neighboring pixels. More precisely, we define R for an image
x “ pxi,j qi,j P Rnˆm by
n ÿ
ÿ m n ÿ
ÿ m
Rpxq “ TVpxq “ |xi´1,j ´ xi,j | ` |xi,j´1 ´ xi,j |.
i“2 j“1 i“1 j“2
It is called the total variation (TV) regularizer and can be rewritten as TVpxq “ total variation
}Dx}1 for a certain difference matrix D. Together with the data-fidelity term DpF pxq, yq “
2 }F pxq ´ y} we end up with the objective function
1 2
1
J pxq “ }F pxq ´ y}2 ` λTVpxq. (4.12)
2
Unfortunately, TV is not differentiable and no closed form of its proximity operator
is known. Therefore, none of the algorithms from the previous section is directly
applicable. As a remedy, we will study so-called dual functions within the next sections
in order to derive an efficient algorithm for minimizing functions of the form f pxq “
gpxq ` hpAxq. By setting gpxq “ 21 }F pxq ´ y}2 , A “ D and hpxq “ λ}x}1 this will
enable us to minimize the functional from (4.12).
In Fig. 4.8 we show some reconstructions using the functional (4.12) for deblurring. In
order to apply the TV regularizer together with the data-fidelity term DpF pxq, yq “
ιtzPRnm :F pzq“yu pxq onto an inpainting problem, we use the Douglas-Racheford
splitting with gpxq “ ιtzPRnm :F pzq“yu pxq and hpxq “ TVpxq. For computing proxλTV ,
100
4 PROXIMAL ALGORITHMS
we again refer to the duality theory at the end of this lecture. The result is given in
Fig. 4.9. ˛
λ “ 0.0002 λ “ 0.002
λ “ 0.02 λ “ 0.2
Fig. 4.10: Deblurring with TV regularizer and different regularization parameters λ. For
larger values of λ, the reconstruction becomes more “cartoon-like”.
101
4 PROXIMAL ALGORITHMS
In the case that all involved probability distributions have densities, we can search for the
“most likely” reconstruction x by maximizing the density pX|Y “y of PX|Y “y . Assuming that
Ξ „ N p0, σ 2 q this can be reformulated by Bayes’ theorem as
´p
Y |X“x pyq pX pxq
! ¯)
arg maxtlogppX|Y “y pxqqu “ arg max log
x x pY pyq
“ arg max log exp ´}F pxq ´ y}2 {p2σ 2 q ` log ppX pxqq
␣ ` ` ˘˘ (
x
Exercises
102
5 MONOTONE SET-VALUED MAPPINGS
which is the subdifferential of f pxq “ |x|. The functions F, I ` λF , λ ą 0 and their inverses
F ´1 , pI `λF q´1 are depicted in Fig. 5.1. We see that pI `λF q´1 is a single-valued function.˛
103
5 MONOTONE SET-VALUED MAPPINGS
1 λ
-1 1 -λ λ
-1 -λ
104
5 MONOTONE SET-VALUED MAPPINGS
1 1
-1 -1 -1 1
Fig. 5.2: The graphs of the set-valued maps F , I ` F and pI ` F q´1 , where F is induced by
the single-valued map f from (5.1).
The maps F and I ` F are monotone, but F is not maximally monotone: take px1 , y1 q :“
p0, 0q P R ˆ R. Then px1 , y1 q R grpF q because 0 R F p0q “ t´1u. However, for any px2 , y2 q P
grpF q it holds
x x1 ´ x2 , y1 ´ y2 y “ x x2 , y2 y ě 0.
Exercises
Exercise 5.1.5 Let E be a Hilbert space. How do linear monotone maps T : E Ñ E look
like? ■
105
5 MONOTONE SET-VALUED MAPPINGS
Proof. 1 Let f be proper and convex. For p1 P Bf px1 q and p2 P Bf px2 q we have
x y ´ xk , pk y ď f pyq ´ f pxk q @y P E
x x1 ´ x2 , p1 ´ p2 y ě 0.
2 This goes back to Minty, a not too difficult proof can be found in [Roc66, Thm. 4] or
[Phe09, Thm. 3.25]. l
It can be shown that also the converse statement of the theorem is true: if f : E Ñ R is
a proper and lower semicontinuous and Bf is monotone, then f is convex. For a proof, we
refer to [Z0̆2, Thm. 3.2.7].
The following figure shows that lower semi-continuity of f is really necessary for Bf to be
maximal monotone.
"∞" "∞"
f1 𝜕 f1 f2 𝜕 f2
Fig. 5.3: Left to right: f1 P Γ0 pRq, Bf1 is maximal monotone, the proper, convex function
f2 which is not lower semicontinuous, Bf2 is not maximal monotone.
106
5 MONOTONE SET-VALUED MAPPINGS
is unique. Since 1
2} ¨ ´x}2 is continuous, by Theorems 3.1.9 and 3.2.2 xλ is a minimizer if
and only if
0 P λBf pxλ q ` xλ ´ x.
This can be rewritten as
In particular, we see that the so-called resolvent of λBf given by JλBf :“ pI ` λBf q´1
coincides with the proximal mapping proxλf . In particular, it is single-valued and firmly
nonexpansive by Theorem 2.7.4. ˛
xx1 ´ x2 , y ´ x1 ´ py ´ x2 qy ě 0,
2
xx1 ´ x2 , ´x1 ` x2 y “ ´}x1 ´ x2 } ě 0.
107
5 MONOTONE SET-VALUED MAPPINGS
2 This is known as Theorem of Minty and for a proof we refer to [Min62, Corollary on
p. 344], [Roc70a, Cor. on p. 78] or [Z0̆2, Thm. 3.11.7]. l
Finally we characterize λ-convex functions via the subdifferential. The following lemma
generalizes Remark 2.3.14.
x p1 , x0 ´ x1 y ´ x p0 , x0 ´ x1 y `λ}x0 ´ x1 }2 ď 0,
Exercises
Exercise 5.2.6 Explicitly write down the functions f1 and f2 from Fig. 5.3, calculate their
subdifferentials and verify that Bf1 is maximal monotone, but Bf2 is not. ■
108
6 CONJUGATE FUNCTIONS
6 Conjugate Functions
Like in Functional Analysis, in Convex Analysis many mathematical objects can be paired
with their so-called duals. In this way, close connections between apparently dissimilar
properties of objects become transparent and the analysis of a given object can be treated
in an equivalent yet very different context.
Textbooks: [ET99, Sec. I.4], [Sim06, pp. 25–26], [BC11, Chp. 13], [BS13, Subsec. 2.4.2],
[Z0̆2, Sec. 2.3], [BP12, Subsec. 3.2.1] and [MN22, Sec. 4.1].
f ˚ : E ˚ Ñ R, p ÞÑ sup x p, x yE ´f pxq.
xPE
Moreover, we call the conjugate of the conjugate of f the biconjugate f ˚˚ “ pf ˚ q˚ defined biconjugate
by
f ˚˚ : E ˚˚ Ñ R, x ÞÑ sup x x, p yE ˚ ´f ˚ ppq.
pPE ˚
Note that f ˚ p0q “ ´ inf xPE f pxq. By definition, we have that for a given p P E ˚ the value
f ˚ ppq is the largest element in R such that it holds
f pxq ě x p, x y ´f ˚ ppq.
In other words, f ˚ ppq is the largest value such that the hyperplane Hpp,1q,´f ˚ ppq is below the
epigraph of f . As an example, we consider the case of f pxq “ x2 and choose p “ 2. Then,
the geometric interpretation from above is visualized on the left of Fig. 6.1.
Similarly, the function value f ˚ ppq can be interpreted as the largest diffence between the
graphs of f and the function x ÞÑ x p, x y. An illustration is given on the right of Fig. 6.1.
Remark 6.1.2 ((Bi)conjugates are lower semicontinuous, convex)
By definition, we have that
epipf ˚ q “ tpp, aq P E ˚ ˆ R : f ˚ ppq ď au
“ tpp, aq P E ˚ ˆ R : x p, x y ´f pxq ď a @x P Eu
č
“ tpp, aq P E ˚ ˆ R : xpp, aq, px, ´1q y ď f pxqu
xPE
č
“ tpp, aq P E ˚ ˆ R : x ΛEˆR px, ´1q, pp, aq y ď f pxqu
xPE
č
“ HΛ´EˆR px,1q,f pxq .
xPE
109
6 CONJUGATE FUNCTIONS
In particular, epipf ˚ q is closed and convex such that f ˚ is convex and lower semicontinuous.
Consequently, also f ˚˚ “ pf ˚ q˚ and f ˚˚ ˝ ΛE are convex and lower semicontinuous. ˝
p“2
f pxq “ x2
ℓpxq “ 2x ´ f ˚ p2q
1
´f ˚ p2q
Fig. 6.1: Left: Geometric interpretation of f ˚ p2q for f pxq :“ x2 . Right: f ˚ pxq as supremum
of the differences between the graph of f and x ¨, x y.
using the convention 0 logp0q :“ limtŒ0 t logptq “ 0. The functions f and f ˚ are visu-
alized in Fig. 6.2 Similarly, we can verify f ˚˚ “ f . ˛
Fig. 6.2: Left: The family px ÞÑ αex qαPp0,1s . Right: The family of conjugates.
110
6 CONJUGATE FUNCTIONS
The following lemmas collect some basic properties of conjugate functions. In order to Lecture 18,
compare f and f ˚˚ , we use the canonical embedding ΛE : E Ñ E ˚˚ : 05.06.2024
In the case that E is reflexive, we can identify E ˚˚ with E such that ΛE becomes the
identity.
Lemma 6.1.4
Let f, g : E Ñ R.
1 If f ě g, then f ˚ ď g ˚ .
2 For all x P E and p P E ˚ such that either f pxq or f ˚ ppq are finite, we have the
Fenchel(-Young) inequality
f ˚ ppq ` f pxq ě x p, x y .
3 We have f ˚˚ ˝ ΛE ď f .
2 By definition it holds
f ˚ ppq ě x p, x y ´f pxq.
Adding f pxq on both sides yields the claim.
3 Let x P E. Then it holds
Lemma 6.1.5
Let f : E Ñ R.
1 If f is improper, then f ˚ ” ´8 or f ˚ ” 8.
2 If f is proper, then f ˚ cannot attain ´8.
3 If f is proper and convex , then f ˚ P Γ0 pE ˚ q. Moreover, if f is proper, then clpf q˚ “
f ˚.
Proof. 1 If there exists x0 with f px0 q “ ´8, then it holds f ˚ ppq “ supxPE x p, x y ´f px0 q “
8 such that f ˚ ” 8. If f ” 8, then it holds f ˚ ppq “ supxPE x p, x y ´f pxq “ ´8
such that f ˚ ” ´8.
2 Assume that f ˚ ppq “ ´8 for some p P E ˚ . Then, it holds for any x P E that
111
6 CONJUGATE FUNCTIONS
and so f is proper.
˚
f ˚ ppq ě pf q˚ ppq. l
The converse of 1 is not true, see Example 6.1.3. The lemma implies in particular, that
the assumption from Lemma 6.1.4 2 are fulfilled automatically if f is proper. Moreover,
the statement of Lemma 6.1.4 3 extended by the following theorem.
112
6 CONJUGATE FUNCTIONS
which is a contradiction.
• If a “ 0, then (6.2) implies that
x p0 , x y ď x p0 , x0 y ´ε @x P dompf q.
Adding the last two inequalities up yields for q P dompf ˚ q (which exists since f ˚ is
proper) and b ą 0 that
For x R dompf q, we have that f pxq “ 8 such that the above inequality is fulfilled
automatically. Thus, taking the supremum over all x P E yields that
x ´ a1 p0 , x0 y ´f ˚˚ ˝ ΛE px0 q ě f ˚ p´ a1 p0 q ´ ε
a ą f ˚ p´ a1 p0 q.
Proof. Since f is proper and convex, we have that clpf q P Γ0 pEq. Thus, applying Theo-
rem 6.1.6 to clpf q yields
6.1.5 ` ˘˚ 6.1.6
f ˚˚ ˝ ΛE “ pf ˚ q˚ ˝ ΛE “ clpf q˚ ˝ ΛE “ clpf q˚˚ ˝ ΛE “ clpf q. l
If E is a Hilbert space, we have E “ E ˚ such that we can compare the functions f and
f ˚ . The following lemma shows that there exists exactly one function fulfilling f “ f ˚ .
113
6 CONJUGATE FUNCTIONS
that is, f pxq ě upxq for x P dompf q. Since u only takes finite values, it holds that f pxq ě upxq
for all x P E. By Lemma 6.1.4 1 this implies for all p P E that
114
6 CONJUGATE FUNCTIONS
Proof. 1 Since f and g are proper, both f ˚ and g ˚ cannot attain the value ´8 anywhere
by Lemma 6.1.5 2 .
Using that suprema commute, we have for p P E ˚ that
␣ ` ˘( ␣ (
pf □gq˚ ppq “ sup x p, x y ´ inf f pyq ` gpx ´ yq “ sup x p, x y ´f pyq ´ gpx ´ yq
xPE yPE x,yPE
␣ ( ␣ (
“ sup x p, y y ´f pyq ` sup x p, x ´ y y ´gpx ´ yq
yPE xPE
␣ ( ␣ (
“ sup x p, y y ´f pyq ` sup x p, z y ´gpzq “ f ˚ ppq ` g ˚ ppq.
yPE zPE
If f, g : Rd Ñ R are convex and proper and dompf q X dompgq contains a point at which f
or g is continuous, then we have equality in the second part of the theorem, i.e., it holds
pf ` gq˚ “ f ˚ □g ˚ . For a proof we refer to [Roc70b, Thm. 16.4].
Next we are interested in relations between subgradients of functions and their conjugates.
Proof. 1 We have
115
6 CONJUGATE FUNCTIONS
2 Let additionally f px0 q “ f ˚˚ ˝ ΛE px0 q. Using the first part of the theorem, we observe
that both sides cannot hold true if f ˚ pp0 q R R. Therefore, we can assume that f ˚ pp0 q P
R such that we get analogously to the first part that
The following corollary deals with conjugates of infimal projections which will be used in
the next section on duality theory.
Let F be a normed space. Recall that the dual space of E ˆ F is given by pE ˆ F q˚ “
E ˚ ˆ F ˚ with dual pairing xpp, qq, px, yq yEˆF :“ x p, x yE ` x q, y yF for pp, qq P E ˚ ˆ F ˚ and
px, yq P E ˆ F .
Corollary 6.1.12 (Conjugates of Infimal Projections)
For φ : E ˆ F Ñ R, the infimal projection v : F Ñ R, u ÞÑ inf xPE φpx, uq has the conjugate
v ˚ “ φ˚ p0, ¨q.
If vpuq is finite and there exists a x0 P E such that vpuq “ φpx0 , uq, then
116
6 CONJUGATE FUNCTIONS
Exercises
holds. ■
117
6 CONJUGATE FUNCTIONS
Let E be a Hilbert space and f P Γ0 pEq. Then, the Moreau decomposition Moreau
holds true for all x P E: decomposition
1 1 f pxq ` 1 f ˚ pxq “ 12 }x}2 ;
2 proxf pxq ` proxf ˚ pxq “ x;
3 there exist unique elements y, z P E such that y ` z “ x and z P Bf pyq. These
elements are y “ proxf pxq and z “ proxf ˚ pxq.
118
6 CONJUGATE FUNCTIONS
0 P x̂ ´ x ` Bf px̂q ô x ´ x̂ P Bf px̂q.
Using Fermat’s rule again, this is equivalent to x ´ x̂ “ arg minyPE t 21 }y ´ x}2 ` f ˚ pxqu “
proxf ˚ pxq. Since we defined x̂ “ proxf pxq we arrive at
which proves 2 . ˝
Lemma 6.2.3
Let E be a Hilbert space and assume that f P Γ0 pEq. Then
1 1
f ˚ ´ } ¨ }2 is convex ô } ¨ }2 ´ f is convex.
2 2
Taking the dual and noting that the Moreau envelope is continuous by Theorem 2.7.4 6 ,
we obtain that
f “ f ˚˚ “ p1 g ˚ q˚˚ “ clp1 g ˚ q “ 1 g ˚ .
119
6 CONJUGATE FUNCTIONS
In the first case, we obtain f px2 q “ ´8 which contradicts the assumption that f is proper.
In the second case, we have f px0 q “ 8, which contradicts f px0 q P R. Consequently, it holds
dompf q “ domphq “ E such that Theorem 2.4.4 implies that f (and hence h) is continuous.
In particular, we have that h P Γ0 pEq such that h “ h˚˚ by Theorem 6.1.6.
Therefore, it holds that
1 1 !1 )
f pxq “ }x}2 ´ h˚˚ pxq “ }x}2 ´ suptx x, y y ´h˚ pyqu “ inf }x}2 ´ x x, y y `h˚ pyq .
2 2 yPE yPE 2
In particular, we obtain with the computation rule for conjugates from Theorem 6.1.9 that
!1 )
f ˚ pxq “ sup x x, y y ´ inf }y}2 ´ x y, z y `h˚ pzq
yPE zPE 2
´1 ¯˚ ´1 ¯˚
“ sup } ¨ }2 ´ x ¨, z y `h˚ pzq pxq “ sup } ¨ }2 ´ x ¨, z y pxq ´ h˚ pzq
zPE 2 zPE 2
1 1 1
“ sup }x ` z}2 ´ h˚ pzq “ }x}2 ` sup x x, z y ` }z}2 ´ h˚ pzq
zPE 2 2 zPE 2
1 ´ 1 ¯˚
“ }x}2 ` h˚ ´ } ¨ }2 pxq.
2 2
We conclude that
1 ´ 1 ¯˚
f ˚ ´ } ¨ }2 “ h˚ ´ } ¨ }2
2 2
is convex since conjugate functions are convex. l
Lemma 6.2.4
Let E be a Hilbert space and let T : E Ñ E be non-expansive such that there exists some
Φ : E Ñ R with T pxq P BΦpxq for all x P E. Then, the following holds true.
1 Φ is Frechet differentiable with gradient ∇Φ “ T .
2 The function g “ 21 Φ˚ ´ 12 } ¨ }2 is convex.
Proof. 1 Let x, x0 P E. Then, T px0 q P BΦpx0 q and T pxq P BΦpxq yields by the definition
of the subdifferential that
Φpxq ´ Φpx0 q ě x T px0 q, x ´ x0 y, and Φpx0 q ´ Φpxq ě x T pxq, x0 ´ x y
ô Φpxq ´ Φpx0 q ď x T pxq, x ´ x0 y .
This implies by Cauchy-Schwartz and since T is non-expansive that
0 ď Φpxq ´ Φpx0 q ´ x T px0 q, x ´ x0 y ď x T pxq ´ T px0 q, x ´ x0 y
ď }T pxq ´ T px0 q}}x ´ x0 } ď }x ´ x0 }2 .
In particular, we have that
|Φpxq ´ Φpx0 q ´ x T px0 q, x ´ x0 y |
lim ď lim }x ´ x0 } “ 0.
xÑx0 }x ´ x0 } xÑx0
120
6 CONJUGATE FUNCTIONS
such that
0 ď }z1 ´ z2 }2 ´ }z1 ´ z2 ´ px1 ´ x2 q}2
“ }z1 ´ z2 }2 ´ }z1 ´ z2 }2 ` 2 x z1 ´ z2 , x1 ´ x2 y ´}x1 ´ x2 }2
“ x 2z1 ´ 2z2 , x1 ´ x2 y ´ x x1 ´ x2 , x1 ´ x2 y
“ x 2z1 ´ x1 ´ p2z2 ´ x2 q, x1 ´ x2 y
“ x ∇gpx1 q ´ ∇gpx2 q, x1 ´ x2 y .
By the first order criterion from Theorem 2.3.9, this implies that g is convex. l
Now, we can prove that an operator is a proximal mapping if and only if it is non-expansive
and the gradient of a convex potential.
121
6 CONJUGATE FUNCTIONS
p1 Φ˚ q˚ “ p 12 } ¨ }2 □Φ˚ q˚ “ 12 } ¨ }2 ` Φ˚˚ “ 12 } ¨ }2 ` Φ,
where the last statement follows from the fact that Φ is convex and Frechet differentiable
and therefore Φ P Γ0 pEq. This implies that
1
p21 Φ˚ q˚ “ 2p1 Φ˚ q˚ p 12 ¨q “ 2Φp 12 ¨q ` } 12 ¨ }2 “ 2Φp 12 ¨q ` } ¨ }2 .
4
Hence, we have that the function
1 1 ´1 ¯
} ¨ }2 ´ p21 Φ˚ q˚ “ } ¨ }2 ´ 2Φp 12 ¨q “ 2 } 12 ¨ }2 ´ Φp 21 ¨q
2 4 2
is convex. This implies that also 12 } ¨ }2 ´ Φ is convex. Again by Lemma 6.2.3 this implies
that f :“ Φ˚ ´ 12 } ¨ }2 is convex such that
´1 ¯˚ 1
Φ “ Φ˚˚ “ } ¨ }2 ` f “ } ¨ }2 □f ˚ “ 1 f ˚ .
2 2
Finally, this implies by Theorem 2.7.4 that
The proof implies in particular, that the function f from the first item of Theorem 6.2.5 can
be expressed by the potential Φ from the second and third item by
1
f pxq “ Φ˚ pxq ´ }x}2 .
2
In the case that E “ R, the characterization of proximal mappings can be simplified.
Corollary 6.2.6
Let f : R Ñ R. Then, there exists some g P Γ0 pRq such that f “ proxg if and only if f is
non-expansive and monotone increasing.
Exercises
122
6 CONJUGATE FUNCTIONS
σS˚ ˝ ΛE “ ι˚˚
S ˝ ΛE “ clpιS q “ ιS .
In particular, we have that σS˚ ˝ΛE “ ιS if and only if S is additionally closed. Summarizing,
indicator and support function of a non-empty, bounded, closed, convex set are conjugate
to each other.
Lecture 21,
Example 6.3.2 Let K Ď E be a non-empty, closed, convex cone. Then we have that
14.06.2024
ι˚K ppq “ σK ppq “ ιK ˚ ppq.
The following lemma relates positively homogeneous functions with support functions.
123
6 CONJUGATE FUNCTIONS
Since f ˚ “ ιCf by Lemma 6.3.3, the equality on the right holds only holds if p P Cf and
f px0 q “ x p, x0 y. l
f ˚ “ ιC}¨} ,
where
124
6 CONJUGATE FUNCTIONS
Exercises
Exercise 6.3.6 What is the support function of a linear subspace, a half-space or the kernel
of a linear operator? ■
125
7 DUALITY THEORY
7 Duality Theory
Textbooks: [BS13, Sec. 2.5], [BP12, Subsec. 3.2.1], [AE06, Sec. 4.4] and [ET99, Chp. III].
To a convex minimization problem we can associate a “dual” maximization problem using
the conjugates of the functions. The original problem is called the “primal” problem. We
examine the relationship between the optimal values and solution of these two problems.
Throughout this section, we will assume that E and F are reflexive normed spaces.
solpPu q :“ arg min φpx, uq, solpP q :“ arg min φpx, 0q.
xPE xPE
solpDu q :“ arg max x p, u y ´φ˚ p0, pq, solpDq :“ arg min φ˚ p0, pq.
pPF ˚ pPF ˚
126
7 DUALITY THEORY
Proof. 1 Let u P F with v ˚˚ puq P R. Then v ˚ is proper by Lemma 6.1.5 and therefore
v P Γ0 pF ˚ q such that v ˚ “ v ˚˚˚ . Then, the last statement in Theorem 6.1.11 applied
˚
to f “ v ˚˚ yields
Bv ˚˚ puq “ arg max x p, u y ´v ˚˚˚ ppq “ arg max x p, u y ´v ˚ ppq “ solpDu q,
pPF ˚˚˚ pPF ˚
1 (6.4) (6.4)
p̂ P solpDu q “ Bv ˚˚ puq ðñ u P Bv ˚˚˚ pp̂q “ Bv ˚ pp̂q ðñ p̂ P Bvpuq.
3 Let Bvpuq ‰ H. Then, by definition of the subdifferential, vpuq must be finite. Let
p̂ P Bvpuq. By the Fenchel–Young identity from Theorem 6.1.11 we get
(7.1)
x p̂, u y ´v ˚ pp̂q “ vpuq ě v ˚˚ puq “ sup x p, u y ´v ˚ ppq.
pPF ˚
127
7 DUALITY THEORY
4 Let u P F .
" i ùñ ii ": Suppose that vpuq “ v ˚˚ puq, x̂ P solpPu q and p̂ P solpDu q. Since φ is
proper, we have that φ˚ is proper such that
In particular, vp0q “ v ˚˚ puq is finite. By 2 this implies that p̂ P Bvpuq and thus the
Fenchel–Young identity from Theorem 6.1.11 yields
In particular, φpx̂, uq and φ˚ p0, p̂q are finite. Since vpuq ď φpx̂, uq by definition, we
obtain
vpuq ` v ˚ pp̂q ď φpx̂, uq ` φ˚ p0, p̂q “ x p̂, u y .
Thus, by the Fenchel inequality from Lemma 6.1.4 2 we derive
By (7.2) and since it holds v ˚ pp̂q “ φ˚ p0, p̂q by Corollary 6.1.12 this yields
φpx̂, uq ` φ˚ p0, p̂q “ vpuq ` v ˚ pp̂q “ x p̂, u y “ xpx̂, uq, p0, p̂q y .
By the Fenchel–Young identity applied for φ this yields that p0, p̂q P Bφpx̂, uq
" iii ùñ ii ": Let p0, p̂q P Bφpx̂, uq. Then, in particular, φpx̂, uq is finite. Thus, the
Fenchel–Young identity yields that φpx̂, uq ` φ˚ p0, p̂q “ x p̂, u y. l
Since Young’s inequality entails φpx̂, uq ` φ˚ p0, p̂q ě x x̂, 0 y ` x u, p̂ y, the second condition
from 4 is sometimes called extremality condition. Part 4 of the theorem states for
the special case u “ 0 in particular, that f px̂q “ ´φ˚ pp̂q implies x̂ P arg minpf q.
The following example shows, that the statements of the theorem do not hold true any
longer if we drop some of the assumptions.
Example 7.1.3
Consider the function
$
& ex , if ex ď u,
φ : R ˆ R Ñ R, px, uq ÞÑ
% 8, otherwise.
128
7 DUALITY THEORY
and thus Bvp0q “ H and Bvp1q “ t0u and vp0q “ 8, vp1q “ 0. Furthermore,
v ˚ ppq “ sup px ´ ιRą0 pxq “ sup px “ ιRď0 ppq.
xPR xą0
We are mainly interested in the solutions of the primal problem (P ) for u “ 0. In this case,
the following corollary reformulates the findings from the previous theorem for φ P Γ0 pEˆF q.
Corollary 7.1.4 (Conjugate duality for the unperturbed problem)
Let φ P Γ0 pE ˆ F q.
1 If F is finite-dimensional and 0 P ripdompvqq or 0 P ripdompv˚ qq, then vp0q “ v ˚˚ p0q.
2 If F is finite-dimensional and 0 P ripdompvqq and vp0q is finite, then vp0q “ v ˚˚ p0q
and solpDq “ Bvp0q ‰ H.
3 If F is finite-dimensional and 0 P ripdompv˚ qq and v ˚˚ p0q is finite, then vp0q “ v ˚˚ p0q
and solpP q “ Bv˚ p0q ‰ H.
4 The following equivalence holds true.
,
vp0q “ v ˚˚ p0q /
.
x̂ P solpP q ô px̂, 0q P Bφ˚ p0, p̂q ô p0, p̂q P Bφpx̂, 0q.
/
p̂ P solpDq
-
we obtain by Theorem 7.1.2 2 for v˚ instead of v that solpP q “ Bv˚ p0q, which
shows 3 .
129
7 DUALITY THEORY
130
7 DUALITY THEORY
(7.4)
` ˘
vp0q “ inf x c, x y `gpxq ` h b ´ Ax .
xPE
By the conjugate calculus rule from Theorem 6.1.9 4 and 5 we get for y P E ˚ and p P F ˚
that ␣ (
φ˚ py, pq “ sup x y, x y ` x p, u y ´ x c, x y ´gpxq ´ hpb ´ Ax ` uq
xPE,uPF
␣ (
“ sup x y, x y ´ x c, x y ´gpxq ` suptx p, u y ´hpb ´ Ax ` uqu
xPE uPF
looooooooooooooooomooooooooooooooooon
“phpb´Ax`¨qq˚ ppq
␣ (
“ sup x y, x y ´ x c, x y ´gpxq ` h˚ ppq ´ x p, b ´ Ax y
xPE
␣ (
“ h˚ ppq ´ x p, b y ` sup x y, x y ´ x c, x y ´gpxq ` x p, Ax y
xPE
␣ (
“ h ppq ´ x p, b y ` sup x y, x y ´gpxq ´ x c ´ A˚ p, x y
˚
xPE
looooooooooooooooooooooomooooooooooooooooooooooon
“px c´A˚ p,¨ y `gq˚ pyq
“ h˚ ppq ´ x p, b y `g ˚ py ´ c ` A˚ pq.
Thus, the dual problems (D) and (Du ) can be rewritten as
131
7 DUALITY THEORY
` ˘
3 If F is finite-dimensional and c P ri A˚ domph˚ q ´ dompg ˚ q and v ˚˚ p0q is finite, then
solpP q “ Bv˚ p0q ‰ H.
4 The following equivalence holds true.
,
vp0q “ v ˚˚ p0q /
.
x̂ P solpP q ô px̂, 0q P Bφ˚ p0, p̂q ô p0, p̂q P Bφpx̂, 0q.
/
p̂ P solpDq
-
We verify only the first relation. The second one follows analogously. By the definition of
the relative interior we have 0 P ri dompvq if and only if 0 P dompvq and there exists a
` ˘
ε ą 0 such that vpuq ă 8 for all u P Bε p0q X aff dompvq . This is the case if and only if
` ˘
for any such u there is a xu P dompf q with b ´ Axu ` u P domphq, that is,
` ˘ ` ˘
domphq X b ´ A dompgq ` u ‰ H @u P Bε p0q X aff dompvq ,
that is
` ˘
b ` u P A dompgq ` domphq @u P Bε p0q X aff dompvq .
Problems like this are called linear programs (in standard form) and play an important linear programs
role in discrete and combinatorial optimization. The constraint Ax ě b ensures that x is
contained in a polyhedron.
In our notation which we can rewrite the linear program equivalently as the unconstrained
problem
inf x c, x y `ιRm
ď0
pb ´ Axq.
xPRd
sup x p, b y ´ιRm
ě0
ppq ´ ιt0u pA˚ p ´ cq.
pPRm
which is exactly the dual form of the linear program (7.6) that is known from linear opti-
mization. ˛
132
7 DUALITY THEORY
Note that Gpxq :“ b ´ Ax leads again to a Fenchel problem (7.4) with c “ 0 handled in
the previous subsection.
To formulate the dual problem (Du ) we calculate for p P F ˚ , again using the conjugate
calculus rule Theorem 6.1.9 4 ,
␣ (
φ˚ p0, pq “ sup sup x u, p y ´gpxq ´ hpGpxq ` uq “ h˚ ppq ´ inf gpxq ` x p, Gpxq y .
xPE uPF xPE
The following theorem gives optimality conditions which ensure the existence of primal and
dual solutions. Later we will see that for a special setting these conditions coincide with the
so-called Karush-Kuhn-Tucker (KKT) conditions.
133
7 DUALITY THEORY
hold true. Conversely, if there exist x̂ and p̂ satisfying (OC), then they are solutions
of (L-P ) and (L-D) and there is no duality gap.
Proof. “ ùñ ”: Let vp0q “ v ˚˚ p0q, x̂ P sol (L-P ) and p̂ P sol (L-D). Then we obtain
vp0q “ gpx̂q ` hpGpx̂qq “ v ˚˚ p0q “ inf Lpx, p̂q ´ h˚ pp̂q.
xPE
Subtracting the last from the second term and adding 0 “ x p̂, Gpx̂q y ´ x p̂, Gpx̂q y yields
` ˘
gpx̂q ` x Gpx̂q, p̂ y ´ inf Lpx, p̂q ` h Gpx̂q ` h˚ pp̂q ´ x p̂, Gpx̂q y “ 0,
xPE
looooooooooooooooooomooooooooooooooooooon loooooooooooooooooomoooooooooooooooooon
ě0 (Fenchel inequality)
“Lpx̂,p̂q´inf xPE Lpx,p̂qě0
134
7 DUALITY THEORY
This relates our minimization problem with saddle points as outlined in the next section.
Corollary 7.3.4 (Optimality conditions for conic problem) Lecture 23,
For the conic problem (P) the optimality conditions (OC) read as 21.06.2024
Proof. By Example 3.1.10, the subdifferential of ιK is given by the normal cone BιK pxq “
NK pxq. Thus, the second optimality condition p̂ P BhpGpx̂qq in (OC) for x̂ P E and p̂ P F ˚
becomes p̂ P NK Gpx̂q . By the definition of the normal cone we have
` ˘
Because Gpx̂q P K and K is cone, we have x :“ λGpx̂q P K for all λ ě 0. Plugging this into
(7.10) yields
x p̂, x y ď x p̂, Gpx̂q y @x P K, (7.11)
λ x p̂, Gpx̂q y ď x p̂, Gpx̂q y @λ ě 0. (7.12)
Setting λ “ 0 yields x p̂, Gpx̂q y ě 0. If x p̂, Gpx̂q y ą 0, we could divide by it such that λ ď 1
for all λ ě 0 which is a contradiction. Thus, we obtain that x p̂, Gpx̂q y “ 0. Plugging this
into (7.11) yields x p̂, x y ď 0 for all x P K such that p̂ P K ˚ . l
Finally, we consider a special problem which arises frequently in practice. To this end, we
assume that the spaces E and F are finite-dimensional.
Example 7.3.5 (Finitely many convex inequality and affine equality constraints)
s
Consider the functions f P Γ0 pEq, G :“ pg1 , . . . , gs q : E Ñ R , where g1 , . . . , gk P Γ0 pEq
and gk`1 , . . . , gs are affine functions for some k P t1, . . . , su and H :“ ph1 , . . . , ht q : E Ñ Rt ,
where hj is an affine function for every j P t1, . . . , tu. We are interested in the problem
This is a special case of the above cone setting (C-P ) where G is replaced by pG, Hq : E Ñ
F :“ Rs ˆ Rt , Ω is given by the whole space E and K :“ Rsď0 ˆt0ut . Then, the polar cone is
given by K ˚ “ Rsě0 ˆ Rt . The primal (C-P ) and dual problems (C-D) can thus be written
as
inf sup Lpx, p, qq and sup inf Lpx, p, qq
xPE pě0, pě0, xPE
qPRt qPRt
x̂ P arg min Lpx, p̂, q̂q, Gpx̂q ď 0, Hpx̂q “ 0, p̂ ě 0, x p̂, Gpx̂q y “ 0, (7.15)
xPE
If Lp¨, p̂, q̂q is differentiable, the first condition can be rewritten as 0 “ ∇x Lpx, p̂, q̂q. Together
these conditions are the original KKT conditions derived in 1951. ˛
135
7 DUALITY THEORY
Assume the the optimal value is finite. Then the solution fulfills
that is, we have strong duality and both a primal and dual solution exist.
The KKT conditions can be also formulated for nonconvex smooth problems where they pro-
vide necessary conditions for the local minima if some qualification constraints are fulfilled.
We refer to [BS13, Spe93, Man94] for more information.
Example 7.3.7 (Discrepancy Principle: penalized vs. constrained problems)
Consider matrices A P Rnˆd and L P Rmˆd with kerpAq X kerpLq “ t0u and b P Rn . For
τ ą 0 consider the constrained minimization problem
This setting can be applied for solving inverse problems as discussed in Section 4.3. To
ensure that there exists a solution of (Pτ ) we require that
Since x ÞÑ }Lx}22 is continuous and coercive on the non-empty set tx P Rd : }Ax ´ b}22 ď τ u,
the problem (Pτ ) has a solution. If there exists x P kerpLq with }Ax ´ b}22 ď τ , then x is a
solution and the problem (Pτ ) becomes trivial. Therefore, we assume that
136
7 DUALITY THEORY
which implies by the assumption (7.17) that τ “ }Ax̂ ´ b}22 . Consequently, in both cases,
the minimum x̂ is attained on tx P Rd : }Ax ´ b}22 “ τ u. Now 1 implies that
ˆ ˙´1
1 T
x̂ “: xλ “ pLT L ` λAT Aq´1 λAT b “ L L ` AT A AT b.
λ
sup inf }Lx}22 ` pp}Ax ´ b}22 ´ τ q “ sup }Lxp }22 ` pp}Axp ´ b}22 ´ τ q.
pě0 xPRd pě0
solves
x̂ “ arg min }Lx}22 ` λ}Ax ´ b}22 ,
x ˛
where λ is the solution p of (7.19).
Exercises
137
7 DUALITY THEORY
There is the following close relation between saddle point problems and Min-Max problems. Fig. 7.1: A smooth
function with saddle
point at the origin.
Lemma 7.4.2 (Saddle points and min-max problems)
Let Φ : Ω ˆ Ξ Ñ R.
1 The equality
max min Φpx, yq ď min max Φpx, yq (7.21)
yPΞ xPΩ xPΩ yPΞ
holds true.
3 Any saddle point px̂, ŷq of Φ has function value
ψpỹq :“ min Φpx, ỹq ď Φpx̃, ỹq ď max Φpx̃, yq “: φpx̃q. (7.23)
xPΩ yPΞ
Hence we can take the maximum on the left side and the minimum on the right:
maxỹPΞ ψpỹq ď minx̃PΩ φpx̃q.
Lecture 24,
2 “ ùñ ”: Let px̂, ŷq be a saddle point of Φ. By definition this implies that 26.06.2024
min max Φpx, yq ď max Φpx̂, yq “ Φpx̂, ŷq “ min Φpx, ŷq ď max min Φpx, yq.
xPΩ yPΞ yPΞ xPΩ yPΞ xPΩ
138
7 DUALITY THEORY
Together with (7.21) we arrive at the desired equality (7.22). Moreover, we the above
equation shows 3 .
“ ðù ”: Assume that (7.22) holds true. Let x̂ “ arg minxPΩ φpxq and ŷ “ arg maxyPΞ ψpyq.
Then, it holds
max min Φpx, yq “ max ψpyq “ ψpŷq “ min Φpx, ŷq ď Φpx̂, ŷq ď max Φpx̂, yq
yPΞ xPΩ yPΞ xPΩ yPΞ
(7.24)
“ φpx̂q “ min φpxq “ min max Φpx, yq “ max min Φpx, yq.
xPΩ xPΩ yPΞ yPΞ xPΩ
Since we have the same value on the left and on the right, all inequalities are in fact
equalities. In particular, this yields for all x̃ P Ω and ỹ P Ξ that
Example 7.4.3 Note that the claim of Lemma 7.4.2 in the first version of this script (which
is the same as in [Spe93, Satz 2.2.3] as well as in the finite-dimensional script) was not true.
More precisely, the condition
is not sufficient for showing that px̂, ŷq is a saddle point as the following counterexample
shows. Consider the function Φpx, yq “ x2 ´ y 2 and define px̂, ŷq “ p1, 1q. Then it holds
However, it is easy to check that p0, 0q is the unique saddle point of Φ. In particular,
px̂, ŷq “ p1, 1q is not a saddle point since Φp0, 1q “ ´1 ă 0 “ Φp1, 1q. ˛
By the following theorem a solution of the saddle point problem of the Lagrangian associated
with (7.13) gives rise to a solution of (7.13).
139
7 DUALITY THEORY
f px̂q ` xp, Gpx̂qy ` xq, Hpx̂qy ď f px̂q ` xp̂, Gpx̂qy ` xq̂, Hpx̂qy
(7.25)
ď f pxq ` xp̂, Gpxqy ` xq̂, Hpxqy
so that Hpx̂q “ 0.
On the other hand, setting p “ p̂ ` ej and q “ q̂ in (7.26), we get Gpx̂q j ď 0 for
` ˘
2
every j P t1, . . . , su.
3 Putting p “ 0 P Rsě0 in (7.26) we see further that xp̂, Gpx̂qy ě 0 and since p̂ P Rsě0 and
Gpx̂q ď 0 we conclude xp̂, Gpx̂qy ď 0 and thus xp̂, Gpx̂qy “ 0.
4 Finally, using that p̂ ě 0 and Gpxq ď 0, Hpxq “ 0 for feasible points x P G, we obtain
from the right-hand side of (7.25) that
has no solution and by Lemma 7.4.6 there exist pp0 , pqT P Rs`1
ě0 zt0u and q P R such that
t
140
7 DUALITY THEORY
Now p0 “ 0 and p ‰ 0 would contradict the Slater condition. Thus p0 ‰ 0 and dividing
(7.27) by p0 yields with p̂ :“ p10 p and q̂ :“ p10 q,
that is,
Lpx, p̂, q̂q ě f px̂q @x P Ω. (7.28)
By assumption, Λ does not contain the origin p0, 0q P Rs ˆ Rt . Further, Λ is convex, because
pyi , zi q P Λ, i “ 1, 2 and the convexity of Ω and G imply
x p, u y ` x q, w y ě 0 @u, w P Λ.
141
7 DUALITY THEORY
Now if
inf txp, Gpxqy ` xq, Hpxqyu “ ´δ ă 0,
xPΩ
xq, Ax ´ by ě 0 @x P Ω,
Since λ can be chosen arbitrary large, this results in A˚ q “ 0. If the rows of A are linearly
independent, this implies the contradiction q “ 0. If the rows of A are not linearly indepen-
dent and Hpxq “ 0, i.e. Ax “ b for some x, we may choose a maximal linearly independent
row set of A and cancel the other rows of the system without changing the solution set of
the system. l
142
8 PRIMAL-DUAL ALGORITHMS
8 Primal-Dual Algorithms
For this section we only consider optimization methods over finite dimensional spaces with
Euclidean norm.
The following minimization algorithms closely rely on the primal-dual formulation of prob-
lems. We consider functions f “ g ` hpA ¨q, where g P Γ0 pRd q, h P Γ0 pRm q, and A P Rm,d ,
and ask for the solution of the primal problem
Recall that ppx̂, ŷq, p̂q P Rdm,m is a saddle point of the Lagrangian L in (8.3) if
Using the results from the previous section, we obtain that ppx̂, ŷq, p̂q P Rdm,m is a saddle
point of L if and only if px̂, ŷq is a solution of pP q and p̂ is a solution of pDq. However the
existence of a solution of the primal problem px̂, ŷq P Rdm does only imply under additional
qualification constraints that there exists p̂ such that ppx̂, ŷq, p̂q P Rdm,m is a saddle point of
L.
The main idea of the following algorithms is to search for saddle points of the Lagrangian
in order to solve the primal problem (8.1).
143
8 PRIMAL-DUAL ALGORITHMS
which is known as general Uzawa method [AHU58]. Since for fixed p, the Lagrangian general Uzawa
can be decomposed into terms only depending on one of the variables x and y, this is method
equivalent to the iterations
y pr`1q
P arg min Lpxpr`1q , y, pprq q (8.6)
yPRm
Linear convergence can be proved under certain conditions [GL89]. In particular, these
conditions include strict convexity of the objective function f pxq “ gpxq ` hpAxq. In order
to relax these assumptions on f , we replace the Lagrangian by the so-called augmented
Lagrangian augmented
Lagrangian
γ
Lγ px, y, pq :“ gpxq ` hpyq ` xp, Ax ´ yy `}Ax ´ y}22 ,
2
γ p 1
“ gpxq ` hpyq ` }Ax ´ y ` }22 ´ }p}22 .
2 γ 2γ
Since any solution of the primal problem fulfills Ax “ y, we obtain that the primal problem
can be reformulated as
Consequently, it can be shown that any saddle point of L is a saddle point of Lγ for any
γ ą 0. A formal proof of this fact is left as Exercise 8.1.7.
Now, we replace in (8.5) the Lagrangian by the augmented Lagrangian with fixed pa-
rameter γ:
pxpr`1q , y pr`1q q P arg min Lγ px, y, pprq q, (8.7)
xPRd ,yPRm
This augmented Lagrangian method is known as method of multipliers [Hel69, Pow72, method of
Roc76]. The improved convergence properties came at a cost. The rewriting of (8.5) to (8.6) multipliers
is no longer possible when we use the augmented Lagrangian. Nevertheless, we consider
the algorithm, which alternates the minimization with respect to x and y. This leads to
and is called the alternating direction method of multipliers (ADMM) which dates alternating
back to [Gab83, GM76, GM75]. Setting bprq :“ pprq {γ we obtain the following (scaled) direction method
ADMM: of multipliers
(ADMM)
144
8 PRIMAL-DUAL ALGORITHMS
y pr`1q “ arg minyPRm hpyq ` γ2 }bprq ` Axpr`1q ´ y}22 “ prox γ1 h pbprq ` Axpr`1q q;
␣ (
A good overview on the ADMM algorithm and its applications is given in [BPC` 11], where
in particular the important issue of choosing the parameter γ ą 0 is addressed. Convergence
of the ADMM under various assumptions was proved, e.g., in [GM76, HY98, LM79, Tse91].
Few bounds on the global convergence rate of the algorithm can be found in [EB90] (linear
convergence for linear programs depending on a variety of quantities), [HL12] (linear con-
vergence for sufficiently small step size) and on the local behaviour of a specific variation of
the ADMM during the course of iteration for quadratic programs in [Bol14].
Relation to Douglas-Rachford Splitting In order to show convergence of the ADMM, Lecture 25,
we prove that it is equivalent to the Douglas-Rachford splitting algorithm applied to the 28.06.2024
dual problem, see [EB92, Ess09, Gab83, Set11].
Recall that the dual problam from (8.4) reads as
Since
inf thpyq ´ xp, yyu “ ´ sup txp, yy ´ hpyqu “ ´h˚ ppq
yPRm yPRm
Then, the following theorem establishes a relation between ADMM and DRS. It was first
proven in [EB92, Gab83].
145
8 PRIMAL-DUAL ALGORITHMS
The ADMM sequences pbprq qr and py prq qr are related with the sequences (8.11) gen-
erated by the DRS algorithm applied to the dual problem by η “ γ and
Applying ´ηA on both sides and using the chain rule implies
which is equivalent to the right-hand side of (8.13) by the definition of the resolvent.
Secondly, applying (8.13) to the first ADMM step with γ “ η and q :“ y prq ´ bprq yields
Assume that the ADMM and DRS iterates have the identification (8.12) up to some r P N.
Using this induction hypothesis it follows that
(8.11) pr`1q
ηpbprq ` Axpr`1q q “ proxηg˚ ˝p´AT q pηpb prq
´ y prq qq ` lo
loooooomoooooon ηy prq
omoon “ t . (8.14)
2pprq ´tprq tprq ´pprq
By definition of bpr`1q we see that ηpbpr`1q ` y pr`1q q “ tpr`1q . Next we apply (8.13) in the
second ADMM step where we replace g by h and A by ´I and use q :“ ´bprq ´ Axpr`1q .
Together with (8.14) this gives
(8.11)
ηp´y pr`1q ` bprq ` Axpr`1q q “ proxηh˚ pηpb prq
` Axpr`1q qq “ ppr`1q .
looooooooomooooooooon (8.15)
tpr`1q
Using again the definition of bpr`1q we obtain that ηbpr`1q “ ppr`1q which completes the
proof. l
The theorem implies the following convergence result for the ADMM.
Corollary 8.1.2 (Convergence of ADMM)
Let g P Γ0 pRd q, h P Γ0 pRm q and A P Rm,d . Assume that the Lagrangian (8.3) has a saddle
point. Then, for r Ñ 8, the sequences pγbprq qr and py prq qr converge to some limits p̂ “ γ b̂
and ŷ, where p̂ is a solution of the dual problem. If additionally the first step (8.8) in the
ADMM algorithm has a unique solution, then pxprq qr converges to a solution of the primal
problem.
146
8 PRIMAL-DUAL ALGORITHMS
Proof. The convergence of the y prq to ŷ and of pprq to the dual solution p̂ follows from
Theorem 8.1.1 together with Theorem 4.2.15. For the convergence result of pxprq qr , we refer
to [Set09, Thm. 2.4.9]. l
If the first step of the ADMM has more than one solution, the sequence xprq r generated
` ˘
Example 8.1.3 Let gpxq “ 0, hpyq “ ιt0u and let A P Rm,d be the zero-matrix. Then, the
steps of the ADMM read as
xpr`1q P Rd
y pr`1q “ proxιt0u pbprq q “ 0
ppr`1q “ pprq .
Choosing xprq “ p´1qr leads to a divergent sequence generated by the ADMM. ˛
With the help of the ADMM, we can now minimize the TV regularized functionals from
Example 4.3.2.
Example 8.1.4 Let K P Rn,d and D P Rm,d be matrices. For this example, we do not
require a specific forms for K and D. However, our main application will be the matrix
D which maps a (vectorized) image x onto the vector Dx containing the horizontal and
vertical differences between neighboring pixels. In this case Rpxq “ }Dx}1 is the total
variation regularizer from Example 4.3.2. Moreover the matrix K could define the forward
operator of an inverse problem from Section 4.3 by F pxq “ Kx. Now, we aim to minimize
the functional
1
J pxq “ }Kx ´ a}22 ` λ}Dx}1 .
2
In order to apply the ADMM we set gpxq “ 21 }Kx ´ a}22 , hpyq “ }y}1 and A “ D. Then,
starting with initializations xp0q , y p0q , pp0q , the first step from the ADMM Algorithm 8 reads
as ! γ )
xpr`1q P arg min gpxq ` }bprq ` Ax ´ y prq }22
xPRd 2
" *
1 2 γ prq prq 2
“ arg min }Kx ´ a}2 ` }b ` Dx ´ y }2 .
xPRd 2 2
By setting the derivative of the objective function to zero, we obtain that xpr`1q is the
solution of the linear system
which can be computed, e.g., by a conjugate gradient method, see the lecture on numerical
methods. The secons step reads as
prq
y pr`1q “ prox γ1 h pbprq ` Axpr`1q q “ S γ1 pb1 ` Kxpr`1q q,
where S γ1 is the soft-shrinkage operator from Example 2.7.3. Finally, the last step is given
by
bpr`1q “ bprq ` Axpr`1q ´ y pr`1q .
147
8 PRIMAL-DUAL ALGORITHMS
For minimizing more than two summands, we can use a slight modification of the ADMM
summarized in the following remark. The idea is quite similar to the parallel DRS from
Algorithm 7.
Remark 8.1.5 The ADMM can be extended for minimizing functionals of the form
n
ÿ
f pxq “ gpxq ` hi pAi xq.
i“1
Now we apply the ADMM on the functions g and H resulting in the following iterations for
xprq , y prq “ py1 , ..., yn q and bprq “ pb1 , ..., bn q.
prq prq prq prq
# +
n
pr`1q γ ÿ prq prq 2
x P arg min gpxq ` }b ` Ai x ´ yi }2
xPRd 2 i“1 i
for i “ 1, ..., n
pr`1q prq
yi “ prox γ1 hi pbi ` Ai xpr`1q q,
for i “ 1, ..., n.
pr`1q prq pr`1q
bi “ bi ` Ai xpr`1q q ´ yi , ˝
The remark could be applied for simplifying the calculations in Example 8.1.4. However, it
cannot avoid the need of solving a linear system. Moreover, we can use the remark in order
to apply the ADMM for minimizing the TV regularized functional from Example 4.3.2 with
the non-differentiable data-fidelity term for the inpainting problem.
Example 8.1.6 Let K and D be given as in Example 8.1.4. We aim to minimize the
functional
f pxq “ ιtau pKxq ` }Dx}1 .
To this end, we set g ” 0, h1 “ ιtau , A1 “ K, h2 “ } ¨ }1 and A2 “ D. Then, the first step
of the ADMM is given by
prq prq prq prq
xpr`1q P arg mint}Kx ` b1 ´ y1 }22 ` }Dx ` b2 ´ y2 }22 u.
xPRd
Setting the gradient of the objective to zero, we obtain that this is equivalent to setting
xpr`1q to a solution of the linear system
prq prq prq prq
pK T K ` DT Dqx “ K T py1 ´ b1 q ` DT py2 ´ b2 q.
148
8 PRIMAL-DUAL ALGORITHMS
Exercises
149
8 PRIMAL-DUAL ALGORITHMS
Using that
hpAxq “ sup txp, Axy ´ h˚ ppqu,
pPRm
Similarly, we have seen in (8.10) that the dual problem can be rewritten as
In order to find a saddle point of L, we can use the so-called Arrow–Hurwicz method, i.e.,
we alternate the application of the proximal mapping of Lp¨, pq and ´Lpx, ¨q. This results
in the sequences
" *
pr`1q prq 1 prq 2
x “ arg min gpxq ` xp , Axy ` }x ´ x }2 ,
xPRd 2τ
“ proxτ g pxprq ´ τ AT pprq q, (8.16)
" *
1
ppr`1q “ arg min h˚ ppq ´ xp, Axpr`1q y ` }p ´ pprq }22 ,
pPRm 2σ
“ proxσh˚ ppprq ` σAxpr`1q q. (8.17)
There are several modifications of this basic primal dual algorithm which improve its con-
vergence properties as
- the predictor corrector proximal multiplier method [CT94],
- the primal dual hybrid gradient method (PDHG) [ZC08] with convergence proof in
[BR12],
- primal dual hybrid gradient method with extrapolation of the primal or dual variable
[CP11a, PCCB09], a preconditioned version [CP11b] and a generalization [Con13],
Douglas-Rachford-type algorithms [BH13, BH14] for solving inclusion equations, see
also [CP12, Vũ13], as well an extension allowing the operator A to be non-linear
[Val13].
Here is the algorithm proposed by Chambolle, Cremers and Pock [CP11a, PCCB09], which
is the iteration (8.16) and (8.17) with an extrapolation step for the dual variable p.
150
8 PRIMAL-DUAL ALGORITHMS
In order to compute the proximity operator proxσh˚ , we apply the Moreau decomposition
from Theorem 6.2.1. We use the identities pσh˚ q˚ pxq “ σh˚˚ pσ ´1 xq “ σhpσ ´1 xq and
proxσhpσ´1 ¨q pxq “ σ proxσ´1 h pσ ´1 xq from Theorem 6.1.9 and Exercise 2.7.6. Then we obtain
in (8.18). ˝
Together with the extrapolation step from before, we arrive at the following equivalent
formulation of Algorithm 9.
151
8 PRIMAL-DUAL ALGORITHMS
For A being the identity and θ “ 1 and γ “ σ “ τ1 , the PDHGMp algorithm corresponds to
the ADMM as well as the Douglas-Rachford splitting algorithm as derived in Section 8.1.
The following theorem and convergence proof are based on [CP11a].
Lecture 26,
18.07.2023
Theorem 8.2.2: Convergence of PDHGMp
τ σ ă 1{}A}22 . (8.19)
where }A}2 denotes the spectral norm of A. Suppose that the Lagrangian Lpx, pq :“
gpxq ´ h˚ ppq ` xAx, py has a saddle point px˚ , p˚ q. Then the sequence tpxprq , pprq qur
produced by PDGHMp converges to a saddle point of the Lagrangian.
Proof. We restrict the proof to the case θ “ 1. By the update steps (8.16) and (8.17) it
holds
´ ¯
xpr`1q “ pI ` τ Bgq´1 xprq ´ τ AT p̄prq ,
´ ¯
ppr`1q “ pI ` σBh˚ q´1 pprq ` σAxpr`1q ,
i.e.,
xprq ´ xpr`1q ´ ¯ pprq ´ ppr`1q ´ ¯
´ AT p̄prq P Bg xpr`1q , ` Axpr`1q P Bh˚ ppr`1q .
τ σ
By definition of the subdifferential we obtain for all x P Rd and all p P Rm that
1 prq
gpxq ě gpxpr`1q q ` xx ´ xpr`1q , x ´ xpr`1q y ´ xAT p̄prq , x ´ xpr`1q y,
τ
1
h˚ ppq ě h˚ pppr`1q q ` xpprq ´ ppr`1q , p ´ ppr`1q y ` xp ´ ppr`1q , Axpr`1q y
σ
and by adding the equations and inserting x “ x˚ and p “ p˚ and using the saddle point
property Lpxpr`1q , p˚ q ´ Lpx˚ , ppr`1q q ě 0, we obtain
´ ¯
0 ě gpxpr`1q q ´ h˚ pp˚ q ´ gpx˚ q ´ h˚ pppr`1q q
´ xAT p̄prq , x˚ ´ xpr`1q y ` xp˚ ´ ppr`1q , Axpr`1q y
1 1
` xxprq ´ xpr`1q , x˚ ´ xpr`1q y ` xpprq ´ ppr`1q , p˚ ´ ppr`1q y
τ σ
“ Lpxpr`1q , p˚ q ´ xp˚ , Axpr`1q y ´ Lpx˚ , ppr`1q q ` xppr`1q , Ax˚ y
´ xp̄prq , Apx˚ ´ xpr`1q qy ` xp˚ ´ ppr`1q , Axpr`1q y
1 1
` xxprq ´ xpr`1q , x˚ ´ xpr`1q y ` xpprq ´ ppr`1q , p˚ ´ ppr`1q y
τ σ
ě ´ xppr`1q ´ p̄prq , Apxpr`1q ´ x˚ qy
1 1
` xxprq ´ xpr`1q , x˚ ´ xpr`1q y ` xpprq ´ ppr`1q , p˚ ´ ppr`1q y.
τ σ
By the identity for inner products
1 ´ prq ¯
xxprq ´ xpr`1q , x˚ ´ xpr`1q y “ }x ´ xpr`1q }22 ` }x˚ ´ xpr`1q }22 ´ }x˚ ´ xprq }22
2
152
8 PRIMAL-DUAL ALGORITHMS
where we have inserted the definition p̄prq “ 2pprq ´ ppr´1q . We estimate the last summand
using Cauchy-Schwarz’s inequality as follows:
Since
1 2
2uv ď αu2 ` v , α ą 0, (8.20)
α
we obtain
ˆ ˙
pr`1q prq prq pr´1q }A}2 pr`1q prq 2 1 prq pr´1q 2
}A}2 }x ´x }2 }p ´p }2 ď α}x ´ x }2 ` }p ´ p }2
2 α
}A}2 ατ pr`1q }A}2 σ prq
“ }x ´ xprq }22 ` }p ´ ppr´1q }22 .
2τ 2ασ
With α :“ σ{τ the relation
a
}A}2 σ ?
}A}2 ατ “ “ }A}2 στ ă 1
α
holds true. Thus, we get
1 ˚ 1 ˚
}x ´ xprq }22 ` }p ´ pprq }22
2τ 2σ
1 ˚ 1 ˚
ě }x ´ xpr`1q }22 ` }p ´ ppr`1q }22
2τ 2σ ?
1 ? 1 pr`1q }A}2 στ prq
` p1 ´ }A}2 στ q}xpr`1q ´ xprq }22 ` }p ´ pprq }22 ´ }p ´ ppr´1q }22
2τ 2σ 2σ
` xppr`1q ´ pprq , Apx˚ ´ xpr`1q qy ´ xpprq ´ ppr´1q , Apx˚ ´ xprq qy. (8.21)
153
8 PRIMAL-DUAL ALGORITHMS
By
1 pN q }A}22 στ pN q
xppN q ´ ppN ´1q , ApxpN q ´ x˚ qy ě ´ }p ´ ppN ´1q }22 ´ }x ´ x˚ }22
2σ 2τ
this can be further estimated as
1 ˚ 1 ˚
}x ´ xp0q }22 ` }p ´ pp0q }22
2τ 2σ
1 1 ˚
ě p1 ´ }A}22 στ q}x˚ ´ xpN q }22 ` }p ´ ppN q }22
2τ ˜ 2σ ¸
N N ´1
? 1 ÿ prq 1 ÿ prq
` p1 ´ }A}2 στ q }x ´ x pr´1q 2
}2 ` }p ´ p }2 . (8.22)
pr´1q 2
2τ k“1 2σ k“1
By (8.22) we conclude that the sequence tpxpnq , ppnq qun is bounded. Thus, there exists a
convergent subsequence tpxpnj q , ppnj q quj which convergenes to some point px̂, p̂q as j Ñ 8.
Further, we see by (8.22) that
´ ¯ ´ ¯
lim xpnq ´ xpn´1q “ 0, lim ppnq ´ ppn´1q “ 0.
nÑ8 nÑ8
Consequently, ´ ¯ ´ ¯
lim xpnj ´1q ´ x̂ “ 0, lim ppnj ´1q ´ p̂ “ 0
jÑ8 jÑ8
holds true. Let T denote the iteration operator of the PDHGMp cycles, i.e., T pxprq , pprq q “
pxpr`1q , ppr`1q q. Since T is the concatenation of affine operators and proximation operators,
it is continuous. Now we have that T xpnj ´1q , ppnj ´1q “ xpnj q , ppnj q and taking the limits
` ˘ ` ˘
for j Ñ 8 we see that T px̂, p̂q “ px̂, p̂q so that px̂, p̂q is a fixed point of the iteration and
thus a saddle point of L. Using this particular saddle point in (8.21) and summing up from
r “ nj to N ´ 1, N ą nj , we obtain
1 1
}x̂ ´ xpnj q }22 ` }p̂ ´ ppnj q }22
2τ 2σ
1 1
ě }x̂ ´ xpN q }22 ` }p̂ ´ ppN q }22
2τ 2σ
˜ ¸
N ´1 N ´1
? 1 ÿ pr`1q prq 2 1 ÿ prq pr´1q 2
` p1 ´ }A}2 στ q }x ´ x }2 ` }p ´ p }2
2τ r“n 2σ r“n `1
j j
?
1 pN q }A}2 στ pnj q
` }p ´ ppN ´1q }22 ´ }p ´ ppnj ´1q }22
2σ 2σ
` xppN q ´ ppN ´1q , Apx̂ ´ xpN q qy ´ xppnj q ´ ppnj ´1q , Apx̂ ´ xpnj q qy
and further
1 1
}x̂ ´ xpnj q }22 ` }p̂ ´ ppnj q }22
2τ 2σ
1 1
ě }x̂ ´ xpN q }22 ` }p̂ ´ ppN q }22
2τ 2σ ?
1 pN q pN ´1q 2 }A}2 στ pnj q
` }p ´p }2 ´ }p ´ ppnj ´1q }22
2σ 2σ
` xppN q ´ ppN ´1q , Apx̂ ´ xpN q qy ´ xppnj q ´ ppnj ´1q , Apx̂ ´ xpnj q qy.
For j Ñ 8 this implies that pxpN q , ppN q q converges also to px̂, ŷq and we are done. l
154
8 PRIMAL-DUAL ALGORITHMS
Similar as for the parallel DRS from Algorithm 7 and to Remark 8.1.5, PDHGMp can be
applied for minmizing functionals with more than two terms.
Remark 8.2.3 We derive the PDHGMp algorithm for functionals of the form
n
ÿ
f pxq “ gpxq ` hi pAi xq.
i“1
Using the primal-dual algorithm, we can now solve the L2 -TV problem as considered in
Example 4.3.2, Example 8.1.4 and Example 8.1.6 directly without solving a linear system.
Example 8.2.4 Let K P Rn,d and D P Rm,d be matrices. Then, the PDHGMp for mini-
mizing the functional
1
J pxq “ }Kx ´ a}22 ` λ}Dx}1 , gpxq “ 0
2
loooooomoooooon looomooon
“:h2 pDxq
“:h1 pKxq
reads as
´ ¯
prq prq
xpr`1q “ proxτ g xprq ´ τ σAT b̄prq “ xprq ´ τ σK T b̄1 ´ τ σDT b̄2
pr`1q
´
prq
¯ 1 σ
y1 “ prox σ1 h1 b1 ` Kxpr`1q “ pbprq ` Kxpr`1q q ` a
σ`1 σ`1
´ ¯
pr`1q prq prq
y2 “ prox σ1 h2 b2 ` Dxpr`1q “ S λ pb2 ` Dxpr`1q q
σ
Since every step can be computed in closed form, this is a more efficient algorithm than the
one derived in Example 8.1.4. Similarly, if h1 pxq “ ιa pKxq, then we replace the y1 -update
by y1
pr`1q
“ a. ˛
155
Index
F
affine combination . . . . . . . . . . . . . . . . . . . . . 4
affine function . . . . . . . . . . . . . . . . . . . . . . . . . 6 feasible set. . . . . . . . . . . . . . . . . . . . . . . . . . .133
affine hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Fenchel-Young identity . . . . . . . . . . . 115
affine set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 FISTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
affinely independent . . . . . . . . . . . . . . . . . . . . 5 fixed point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
alternating direction method of Fréchet differentiability . . . . . . . . . . . . . 40
multipliers . . . . . . . . . . . . . . . . . . 144
alternating projection algorithm . . . . . . 87 G
angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
asymptotically regular . . . . . . . . . . . . . . . . 80 Gateaux derivative . . . . . . . . . . . . . . . . . . 38
augmented Lagrangian . . . . . . . . . . . . . . . 144 Gateaux gradient . . . . . . . . . . . . . . . . . . . . 39
Gateaux Hessian . . . . . . . . . . . . . . . . . . . . 39
B general Uzawa method . . . . . . . . . . . . . . . 144
global minimizer . . . . . . . . . . . . . . . . . . . . . . 53
barycentric coordinates . . . . . . . . . . . . . . . . 6 Gordon theorem. . . . . . . . . . . . . . . . . . . .141
graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
C
H
canonical embedding . . . . . . . . . . . . . . . . . . 18
closed function . . . . . . . . . . . . . . . . . . . . . . . . 29 hyperplane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
closed halfspace . . . . . . . . . . . . . . . . . . . . . . . . 7
closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
I
closure operator . . . . . . . . . . . . . . . . . . . 10, 31 indicator function . . . . . . . . . . . . . . . . . . . . . 26
coercive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 infimal convolution. . . . . . . . . . . . . . . . . . . .58
cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 inverse problems . . . . . . . . . . . . . . . . . . . . . . 97
contractive . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 iterative soft-thresholding algorithm. . .89
convex combination . . . . . . . . . . . . . . . . . . . . 2
convex cone . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 K
convex function . . . . . . . . . . . . . . . . . . . . . . . 32
convex hull . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Krasnoselskii-Mann iteration . . . . . . 77
convex set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 L
D Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . 133
domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
λ-convex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
dual cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
level set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
duality gap . . . . . . . . . . . . . . . . . . . . . . . . . . 127
level-bounded . . . . . . . . . . . . . . . . . . . . . . . . . 53
E linear programs . . . . . . . . . . . . . . . . . . . . . . 132
local minimizer . . . . . . . . . . . . . . . . . . . . . . . 53
effective domain . . . . . . . . . . . . . . . . . . . . . . 27 lower semicontinuous. . . . . . . . . . . . . . . . . .27
epigraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 lower semicontinuous hull . . . . . . . . . . . . . 29
156
INDEX
M relative interior . . . . . . . . . . . . . . . . . . . . . . . . 8
resolvent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
MAP estimator . . . . . . . . . . . . . . . . . . . . . . 102
maximal monotone . . . . . . . . . . . . . . . . . . 104 S
method of multipliers . . . . . . . . . . . . . . . . 144
saddle point . . . . . . . . . . . . . . . . . . . . 138, 143
Minkowski sum . . . . . . . . . . . . . . . . . . . . . . . 3
simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
monotone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
soft-shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . 24
Moreau decomposition . . . . . . . . . . . . . 118
soft-shrinkage operator . . . . . . . . . . . . . . . . 61
Moreau envolope . . . . . . . . . . . . . . . . . . . . . . 61
strict epigraph . . . . . . . . . . . . . . . . . . . . . . . . 27
N strictly convex function . . . . . . . . . . . . . . . 32
strong duality . . . . . . . . . . . . . . . . . . . . . . . 127
NESTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 strongly convex . . . . . . . . . . . . . . . . . . . . . . . 32
Nesterov algorithm . . . . . . . . . . . . . . . . . . . 90 strongly monotone . . . . . . . . . . . . . . . . . . . 104
non-expansive. . . . . . . . . . . . . . . . . . . . . . . . .77 subdifferential . . . . . . . . . . . . . . . . . . . . . . . . 66
normal cone . . . . . . . . . . . . . . . . . . . . . . . . . . 21 supercoercive . . . . . . . . . . . . . . . . . . . . . . . . . 53
O support function . . . . . . . . . . . . . . . . . . . . . 123
supporting halfspace . . . . . . . . . . . . . . . . . . 15
one-sided directional derivative . . . . . . . . 38 supporting hyperplane . . . . . . . . . . . . . . . . 15
orthogonal projection . . . . . . . . . . . . . . . . . 23
T
P
tangential cone . . . . . . . . . . . . . . . . . . . . . . . 20
Parallel DRS. . . . . . . . . . . . . . . . . . . . . . . . . .95 Tikhonov regularization . . . . . . . . . . . . . 98
polar cone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 total variation . . . . . . . . . . . . . . . . . . . . . . . 100
polyhedral set . . . . . . . . . . . . . . . . . . . . . . . . . . 7
predictor corrector proximal multiplier U
method . . . . . . . . . . . . . . . . . . . . . 150
primal dual hybrid gradient method uniformly convex function. . . . . . . . . . . . .32
(PDHG) . . . . . . . . . . . . . . . . . . . . 150 unit simplex . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
proper function . . . . . . . . . . . . . . . . . . . . . . . 27
proximal mapping . . . . . . . . . . . . . . . . . . . . 61
V
157
REFERENCES
References
[AB99] Charalambos Dionisios Aliprantis and Kim Border. Infinite Dimensional Anal-
ysis - A Hitchhiker’s Guide. Springer Science & Business Media, Berlin Heidel-
berg, 2 edition, 1999.
[AB09] H. Attouch and J. Bolte. On the convergence of the proximal algorithm for
nonsmooth functions involving analytic features. Math. Program., 116(1-2):5–
16, 2009.
[ABS13] H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for
semi-algebraic and tame problems proximal algorithms, forward-backward split-
ting, and regularized gauss-seidel methods. Math. Program. Series A, 137(1-
2):91–129, 2013.
[AE06] Jean-Pierre Aubin and Ivar Ekeland. Applied nonlinear analysis. Courier Cor-
poration, 2006.
[AF09] Jean-Pierre Aubin and Hélène Frankowska. Set-valued analysis. Birkhäuser
Boston, MA, 1 edition, 2009.
[AHU58] K. J. Arrow, L. Hurwitz, and H. Uzawa. Studies in Linear and Nonlinear Pro-
gramming. Stanford University Press, 1958.
[Alb96] Ya I Alber. Metric and generalized projection operators in banach spaces: prop-
erties and applications. In Athanass Kartsatos, editor, Theory and Applications
of Nonlinear Operators of Accretive and Monotone Type, volume 178, chapter 2,
pages 15–50. CRC Press, 1996.
[AM19] Francisco J. Aragón Artacho and Paula Segura Martínez. Finding magic squares
with the douglas-rachford algorithm, 2019.
[AO16] Aram V Arutyunov and Valeri Obukhovskii. Convex and Set-Valued Analysis.
De Gruyter, 2016.
[Bačá14] Miroslav Bačák. Computing medians and means in hadamard spaces. SIAM
Journal on Optimization, 24(3):1542–1566, 2014.
[Bar02] Alexander Barvinok. A course in convexity, volume 54. American Mathematical
Society, 2002.
[BB88] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA
Journal of Numerical Analysis, 8:141–148, 1988.
[BBC11] S. Becker, J. Bobin, and Emmanuel Jean Candès. Nesta: a fast and accu-
rate first-order method for sparse recovery. SIAM Journal of Imaging Sciences,
4(1):1–39, 2011.
[BC11] Heinz H. Bauschke and Patrick Louis Combettes. Convex analysis and monotone
operator theory in Hilbert spaces, volume 408. Springer, 2011.
[BGLS95] Joseph Frédéric Bonnans, J. C. Gilbert, C. Lemaréchal, and C. A. Sagastizábal.
A family of variable metric proximal methods. Mathematical Programming,
158
REFERENCES
68:15–47, 1995.
[BH13] Radu Ioan Boţ and Christopher Hendrich. A Douglas-Rachford type primal-dual
method for solving inclusions with mixtures of composite and parallel-sum type
monotone operators. SIAM Journal on Optimization, 23(4):2541–2565, 2013.
[BH14] Radu Ioan Boţ and Christopher Hendrich. Convergence analysis for a primal-
dual monotone + skew splitting algorithm with applications to total vari-
ation minimization. Journal of Mathematical Imaging and Vision, 2014.
arXiv:1211.1706.
[BL11] Kristian Bredies and Dirk Lorenz. Mathematische Bildverarbeitung, volume 1.
Springer, 2011.
[BL12] Heinz H. Bauschke and Yves Lucet. What is a fenchel conjugate. Notices of the
AMS, 59(1):44–46, 2012.
[BL18] Kristian Bredies and Dirk Lorenz. Mathematical Image Processing. Springer,
2018.
[Bol14] D. Boley. Local linear convergence of the alternating direction method of mul-
tipliers on quadratic or linear programs. SIAM J. Optim., 2014.
[Bon19] Joseph Frédéric Bonnans. Convex and Stochastic Optimization. Springer, 2019.
[Bou30] Georges Louis Bouligand. Sur les surfaces dépourvues de points hyperlimites.
Ann. Soc. Polon. Math, 9:32–41, 1930.
[BP12] Viorel Barbu and Teodor Precupanu. Convexity and optimization in Banach
spaces. Springer Science & Business Media, 2012.
[BPC` 11] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimiza-
tion and statistical learning via the alternating direction method of multipliers.
Foundations and Trends in Machine Learning, 3(1):101–122, 2011.
[BQ99] J. V. Burke and M. Qian. A variable metric proximal point algorithm for mono-
tone operators. SIAM Journal on Control and Optimization, 37:353–375, 1999.
[BR12] Silvia Bonettini and Valeria Ruggiero. On the convergence of primal-dual hybrid
gradient algorithms for total variation image restoration. J. Math. Imaging Vis.,
44:236–253, 2012.
[Bre65] Lev Meerovich Bregman. Finding the common point of convex sets by the
method of successive projection. Doklady Akademii Nauk, 162(3):487–490, 1965.
[Bré11] Haïm Brézis. Functional analysis, Sobolev spaces and partial differential equa-
tions. Universitext. Springer, 2011.
[BS13] Joseph Frédéric Bonnans and Alexander Shapiro. Perturbation analysis of opti-
mization problems. Springer Science & Business Media, 2013.
[BST13] J. Bolte, S. Sabach, and M. Teboulle. Proximal alternating linearized mini-
mization for nonconvex and nonsmooth problems. Math. Program., Series A,
2013.
159
REFERENCES
[BT09a] A. Beck and M. Teboulle. Fast gradient-based algorithms for constrained total
variation image denoising and deblurring. SIAM Journal on Imaging Sciences,
2:183–202, 2009.
[BT09b] Amir Beck and Marc Teboulle. Fast gradient-based algorithms for constrained
total variation image denoising and deblurring problems. IEEE transactions on
image processing, 18(11):2419–2434, 2009.
[BV04] Stephen Poythress Boyd and Lieven Vandenberghe. Convex optimization. Cam-
bridge university press, 2004.
[BV10] Jonathan M Borwein and Jon D. Vanderwerff. Convex functions: constructions,
characterizations and counterexamples, volume 172. Cambridge University Press
Cambridge, 2010.
[Byr04] Charles Byrne. A unified treatment of some iterative algorithms in signal pro-
cessing and image reconstruction. Inverse Problems, 20:103–120, 2004.
[CG59] Ward Cheney and Allen A. Goldstein. Proximity maps for convex sets. Proceed-
ings of the American Mathematical Society, 10(3):448–450, 1959.
[Cha09] Chidume Charles. Geometric properties of Banach spaces and nonlinear itera-
tions. Springer, 2009.
[Cla90a] Frank H. Clarke. Optimization and Nonsmooth Analysis. Society for Industrial
and Applied Mathematics, 1990.
[Cla90b] Frank H. Clarke. Optimization and Nonsmooth Analysis. Society for Industrial
and Applied Mathematics, 1990.
[Com04] Patrick Louis Combettes. Solving monotone inclusions via compositions of non-
expansive averaged operators. Optimization, 53(5–6):475–504, 2004.
[Con13] L. Condat. A primal-dual splitting method for convex optimization involving
Lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl.,
158(2):460–479, 2013.
[CP11a] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex
problems with applications to imaging. Journal of Mathematical Imaging and
Vision, 40(1):120–145, 2011.
[CP11b] Antonin Chambolle and Thomas Pock. Diagonal preconditioning for first order
primal-dual algorithms in convex optimization. ICCV, pages 1762 – 1769, 2011.
[CP11c] Patrick Louis Combettes and Jean-Christophe Pesquet. Proximal splitting meth-
ods in signal processing. In Fixed-point algorithms for inverse problems in sci-
ence and engineering, pages 185–212. Springer, 2011.
[CP12] Patrick Combettes and Jean-Christoph Pesquet. Primal-dual splitting algorithm
for solving inclusions with mixture of composite, Lipschitzian, and parallel-sum
type monotone operators. Set-Valued and Variational Analysis, 20(2):307–330,
2012.
[CPR13] E. Chouzenoux, J.-C. Pesquet, and A. Repetti. Variable metric forward-
160
REFERENCES
161
REFERENCES
162
REFERENCES
163
REFERENCES
164
REFERENCES
165