Lectures On AM
Lectures On AM
Lectures On AM
Fall Semester 5
I Applied Analysis 6
2 Complex Analysis 7
2.1 Complex Variables and Complex-valued Functions . . . . . . . . . . . . . . 7
2.1.1 The Cartesian Representation of Complex Variables . . . . . . . . . 7
2.1.2 The Polar Representation of Complex Variables . . . . . . . . . . . . 10
2.1.3 Parameterization of Curves in the Complex Plane . . . . . . . . . . 12
2.1.4 Functions of a Complex Variable . . . . . . . . . . . . . . . . . . . . 14
2.1.5 Complex Exponentials . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.6 Multi-valued Functions and Branch Cuts . . . . . . . . . . . . . . . 16
2.2 Analytic Functions and Integration along Contours . . . . . . . . . . . . . . 22
2.2.1 Analytic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Integration along Contours . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Cauchy’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.4 Cauchy’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.5 Laurent Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Residue Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.1 Singularities and Residues . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2 Evaluation of Real-valued Integrals by Contour Integration . . . . . 42
2.3.3 Contour Integration with Multi-valued Functions . . . . . . . . . . . 49
2.4 Extreme-, Stationary- and Saddle-Point Methods ∗ . . . . . . . . . . . . . . 53
i
CONTENTS ii
3 Fourier Analysis 57
3.1 The Fourier Transform and Inverse Fourier Transform . . . . . . . . . . . . 57
3.2 Properties of the 1-D Fourier Transform . . . . . . . . . . . . . . . . . . . . 58
3.3 Dirac’s δ-function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 The δ-function as the limit of a δ-sequence . . . . . . . . . . . . . . 61
3.3.2 Properties of the δ-function . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.3 Using δ-functions to Prove Properties of Fourier Transforms . . . . . 65
3.3.4 The δ-function in Higher Dimensions . . . . . . . . . . . . . . . . . . 66
3.3.5 Formal Differentiation: The Heaviside Function and the Derivatives
of the δ-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 Closed form representation for select Fourier Transforms . . . . . . . . . . . 68
3.4.1 Elementary examples of closed form representations . . . . . . . . . 68
3.4.2 More advanced examples of closed form representations . . . . . . . 71
3.4.3 Closed form representations in higher dimensions . . . . . . . . . . . 73
3.5 Fourier Series: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6 Properties of the Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7 Riemann-Lebesgue Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.8 Gibbs Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.9 Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.9.1 Integral Representations and Asymptotics of Special Functions . . . 82
3.10 From Differential to Algebraic Equations with FT, FS and LT . . . . . . . . 84
II Differential Equations 87
Every student in the Program for Applied Mathematics at the University of Arizona takes
the same three core courses during their first year of study. These three courses are called
Methods (Math 581), Theory (Math 584), and Algorithms (Math 589). Each course presents
a different expertise, or ‘toolbox’ of competencies, for approaching problems in modern
applied mathematics. The courses are designed to discuss many of the same topics, often
synchronously, (Fig. 1.1). This allows them to better illustrate the potential contributions
of each toolbox, and also to provide a richer understanding of the applied mathematics.
The material discussed in the courses include topics that are taught in traditional applied
mathematics curricula (like differential equation) as well as topics that promote a modern
perspective of applied mathematics (like optimization, control and elements of computer
science and statistics). All the material is carefully chosen to reflect what we believe is most
relevant now and in the future.
The essence of the core courses is to develop the different toolboxes available in applied
Figure 1.1: Topics covered in Theory (blue), Methods (red) and Algorithms (green) during
the Fall semester (columns 1 & 2) and Spring semester (columns 3 & 4)
1
CHAPTER 1. APPLIED MATH CORE COURSES 2
mathematics. When we’re lucky, we can find exact solutions to a problem by applying
powerful (but typically very specialized) techniques, or methods. More often, we must
formulate solutions algorithmically, and find approximate solutions using numerical sim-
ulations and computation. Understanding the theoretical aspects of a problem motivates
better design and implementation of these methods and algorithms, and allows us to make
precise statements about when and how they will work.
The core courses discuss a wide array of mathematical content that represents some of
the most interesting and important topics in applied mathematics. The broad exposure
to different mathematical material often helps students identify specific areas for further
in-depth study within the program. The core courses do not (and cannot) satisfy the in-
depth requirements for a dissertation, and students must take more specialized courses and
conduct independent study in their areas of interest.
Furthermore, the courses do not (and cannot) cover all subjects comprising applied
mathematics. Instead, they provide a (somewhat!) minimal, self-consistent, and admittedly
subjective (due to our own expertise and biases) selection of the material that we believe
students will use most during and after their graduate work. In this introductory chapter
of the lecture notes, we aim to present our viewpoint on what constitutes modern applied
mathematics, and to do so in a way that unifies seemingly unrelated material.
Adjacent Disciplines:
Physics, Statistics,
Computer Science,
Data Science
Domain Knowlege:
e.g. Physical Sciences,
Biological Sciences, Mathematical Science:
Social Sciences, e.g. Differential Equations,
Engineering Real & Functional Analysis,
Optimization, Probability,
Numerical Analysis,
Figure 1.2: The key components studied under the umbrella of applied mathematics: (1)
mathematical science, (2) domain-specific knowledge, and (3) a few ‘math-adjacent’ disci-
plines.
1. Formulating the problem, first casually, i.e. in terms standard in sciences and engi-
neering, and then transitioning to a proper mathematical formulation;
2. Analyzing the problem by “all means available”, including theory, method and algo-
rithm toolboxes developed within applied mathematics;
Making contributions to a specific domain that are truly valuable requires more than
just mathematical expertise. Domain-specific understanding may change our perspective for
what constitutes a solution. For example, whenever system parameters are no longer ‘nice’
but must be estimated from measurement or experimental data, it becomes more the difficult
to finding meaning in the solutions, and it becomes more important, and challenging, to
estimate the uncertainty in solutions,. Similarly, whenever a system couples many sub-
systems at scale, it may be no longer possible to interpret the exact expressions, (if they
CHAPTER 1. APPLIED MATH CORE COURSES 4
can be computed at all) and approximate, or ‘effective’ solutions may be more meaningful.
In every domain-specific application, it is important to know what problems are most urgent,
and what kinds of solutions are most valuable.
Mathematics is not the only field capable of making valuable contributions to other
domains, and we think specifically of physics, statistics and computer science as other fields
that have each developed their own frameworks, philosophies, and intuitions for describing
problems and their solutions. This is particularly evident with the recent developments in
data science. The recent deluge of data has brought a wealth of opportunity in engineering,
and in the physical, natural and social sciences where there have been many open problems
that could only be addressed empirically. Physics, statistics, and computer science have
become fundamental pillars of data science, in part, because each of these ’math-adjacent’
disciplines provide a way to analyze and interpret this data constructively. Nonetheless,
there are many unresolved challenges ahead, and we believe that a mixture of mathematical
insight and some intuition from these adjacent disciplines may help resolve these challenges.
Problem Formulation
We will rely on a diverse array of instructional examples from different areas of science and
engineering to illustrate how to translate a rather vaguely stated scientific or engineering
phenomenon into a crisply stated mathematical challenge. Some of these challenges will be
resolved, and some will stay open for further research. We will be refering to instructional
examples, such as the Kirchoff and the Kuramoto-Sivashinsky equations for power systems,
the Navier-Stokes equations for fluid dynamics, network flow equations, the Fokker-Plank
equation from statistical mechanics, and constrained regression from data science.
Problem Analysis
We analyze problems extracted from applications by all means possible, which requires
both domain-specific intuition and mathematical knowledge. We can often make precise
statements about the solutions of a problem without actually solving the problem in the
mathematical sense. Dimensional analysis from physics is an example of this type of pre-
liminary analysis that is helpful and useful. We may also identify certain properties of the
solutions by analyzing any underlying symmetries and establishing the correct principal
behaviors expected from the solutions, some important example involve oscillatory behav-
ior (waves), diffusive behavior, and dissipative/decaying vs. conservative behaviors. One
can also extract a lot from analyzing the different asymptotic regimes of a problem, say
when a parameter becomes small, making the problem easier to analyze. Matching different
CHAPTER 1. APPLIED MATH CORE COURSES 5
asymptotic solutions can give a detailed, even though ultimately incomplete, description.
Solution Construction
Applied Analysis
6
Chapter 2
Complex Analysis
The real number system is somewhat “deficient” in the sense that not all operations are
allowed for all real numbers. For example, taking arbitrary roots of negative numbers is
not allowed in the real number system. This deficiency can be remedied by defining the
√
imaginary unit, i := −1. A number that is a real multiple of the imaginary unit, for
example 3i, i/2 or −πi, is called an imaginary number. A number that has both a real and
an imaginary component is called a complex number.
Complex analysis is the branch of mathematics that investigates functions of complex
variables. A fundamental premise of complex analysis is that most operations between real
numbers have natural extensions to complex numbers, and that most real-valued functions
have natural extensions to complex-valued functions. Interestingly, the natural extensions of
even the most elementary functions can lead to a richness that often admits new techniques
for problem solving.
Complex analysis provides useful tools for many other areas of mathematics, (both pure
and applied), as well as for physics (including the branches of hydrodynamics, thermody-
namics, and particularly quantum mechanics), and engineering fields (such as aerospace,
mechanical and electrical engineering).
7
CHAPTER 2. COMPLEX ANALYSIS 8
Im(z)
z1 = 1 + 2i
z1 + z2
Re(z)
z2 = 4 − i
Figure 2.1: Complex numbers can be visualized as vectors in R2 . By convention, the real
component is plotted on the horizontal axis, and the imaginary component is plotted on
the vertical axis. The addition of two complex numbers is reminiscent of vector addition in
R2 .
The addition and subtraction of complex numbers are direct generalizations of their
real-valued counterparts.
The multiplication and division of complex numbers are also direct generalizations of
their real-valued counterparts with the additional definition, i2 = −1.
(b) To compute z1 /z2 , we first multiply it by z2∗ /z2∗ , so that the denominator, z2 z2∗ , is a
real number,
z1 z1 z2∗ −1 + 2i 4 + 3i −4 − 3i + 8i + 6i2 −10 + 5i 2 1
= ∗ = = = = − + i.
z2 z2 z2 4 − 3i 4 + 3i 16 − 12i + 12i + 9 25 5 5
Complex conjugates
Theorem 2.1.4. For algebraic operations including addition, multiplication, division and
exponentiation, consider a sequence of algebraic operations over the n complex numbers
z1 , . . . , zn with the result w. If the same actions are applied in the same order to z1∗ , . . . , zn∗ ,
then the result will be w∗ .
Example 2.1.5. Let us illustrate the Theorem 2.1.4 on the example of a quadratic equation,
az 2 +bz +c = 0, where the coefficients, a, b and c are real. Direct application of the Theorem
2.1.4 to this example results in the fact that if the equation has a root, then its complex
conjugate is also a root, which is obviously consistent with the roots of quadratic equations
√
formula, z1,2 = (−b ± b2 − 4ac)/(2a).
Exercise 2.1. Use Theorem 2.1.4 to show that the roots of a polynomial with real-valued
coefficients of arbitrary order occur in complex conjugate pairs.
Example 2.1.6. Find all the roots of the polynomial, p(z) = z 4 − 6z 3 + 11z 2 − 2z − 10,
given that one of its roots is 2 − i.
Solution. We observe that p(z) has real-valued coefficients, so its roots occur in conjugate
pairs; given that z1 = 2 − i is a root, then z2 = 2 + i must also be a root, which we verify
by evaluation. We factorize p(z) as p(z) = (z − z1 )(z − z2 )r(z), where we find r(z) by
polynomial division, giving r(z) = (z 2 − 2z − 2). Therefore, the four roots of p(z) are found
by solving
Example 2.1.7. Let z1 = x1 + iy1 and z2 = x2 + iy2 . Show that if ω = z1 /z2 , then
ω ∗ = z1∗ /z2∗ .
CHAPTER 2. COMPLEX ANALYSIS 10
Im(z) z = 4 + 3i
∗
√ zz
2 =
y
2 +
p x
y = r sin(θ)
r=
|=
|z
θ
Re(z)
x = r cos(θ)
Figure 2.2: A complex number, z, has both a Cartesian representation (shown in blue) and
a polar representation (shown in orange). Its modulus, denoted by |z| or r, is non-negative
and satisfies |z|2 = r2 := zz ∗ = x2 + y 2 . Its argument, denoted by θ, is the angle measured
modulo 2π, counter-clockwise from the positive real axis.
We must find ω ∗ and verify that it is equivalent to z1∗ /z2∗ . First compute ω,
x1 + iy1 x1 + iy1 x2 − iy2 x1 x2 + y1 y2 x2 y1 − x1 y2
ω= = = 2 2 +i .
x2 + iy2 x2 + iy2 x2 − iy2 x 2 + y2 x22 + y22
Now compute ω ∗ ,
x1 x2 + y1 y2 x2 y1 − x1 y2 (x1 − iy1 )(x2 + iy2 ) x1 − iy1
ω∗ = 2 2 −i 2 2 = = ,
x 2 + y2 x2 + y2 (x2 − iy2 )(x2 + iy2 ) x2 − iy2
which is equivalent to z1∗ /z2∗ , as required.
The application of trigonometric identities shows that the product of two complex num-
bers is the complex number whose modulus is the product of the moduli of its factors, and
CHAPTER 2. COMPLEX ANALYSIS 11
whose argument is the sum of the arguments of its factors. That is, if z1 = r1 cos θ1 +
ir1 sin θ1 , and z2 = r2 cos θ2 + ir2 sin θ2 , then z1 z2 = r1 r2 cos(θ1 + θ2 ) + ir1 r2 sin(θ1 + θ2 ).
This summation of arguments whenever two functions are multiplied together is reminiscent
of multiplying exponential functions. The polar representation is simplified by defining the
complex-valued exponential function. When expressed in their polar representation, the
multiplication of two complex numbers is simplified by elementary trigonometric identities.
For z1 = r1 cos θ1 + ir1 sin θ1 and z2 = r2 cos θ2 + ir2 sin θ2 , their product is
z1 z2 = r1 r2 cos θ1 cos θ2 + r1 r2 sin θ1 sin θ2 + ir1 r2 cos θ1 sin θ2 + ir1 r2 sin θ1 cos θ2
= r1 r2 cos(θ1 + θ2 ) + ir1 r2 sin(θ1 + θ2 ).
We make two observations: first, the modulus of the product of two complex numbers
is the product of their moduli, that is, |z1 z2 | = |z1 | |z2 |, and second the argument of the
product is the sum of arguments, that is, arg(z1 z2 ) = arg(z1 ) + arg(z2 ). These observations,
which are reminiscent of the multiplication of real-valued exponential functions, motivate
the definition of the complex-valued exponential function.
Euler’s (famous) formula, eiπ = −1, follows directly from this definition.
Example 2.1.10. Find r̃ and θ̃ such that the point, ω = 1 + 5i, can be written as,
ω = −1 + r̃eiθ̃ .
CHAPTER 2. COMPLEX ANALYSIS 12
Figure 2.3: Examples of curves in the complex plane. The first two curves (red and blue)
are open and the last two curves (green and orange) are closed. The first and fourth curves
(red and orange) are simple and the second and third curves (blue and green) are not simple
because they self-intersect at points other than the end points.
Solution. Given that 1 + 5i = −1 + r̃eiθ̃ , solves for r̃eiθ̃ to get 2 + 5i = r̃eiθ̃ . Solve for r̃
√
and θ̃ to get r̃ = (2 + 5i)(2 − 5i) = 29 ≈ 5.39 and θ̃ = tan−1 (5/2) ≈ 1.19rad. Therefore,
w ≈ −1 + 5.39e1.19i .
Example 2.1.11. Express z := (2 + 2i)e−iπ/6 by its (a) Cartesian and (b) polar represen-
tations.
Solution.
(a) z = (2 + 2i)(cos(− π6 ) + i sin(− π6 )) = 2 cos(− π6 ) + 2 sin(− π6 ) + i 2 cos(− π6 ) + 2 sin(− π6 )
√ √
= (1 + 3) + i( 3 − 1).
√ √
(b) z = (2 + 2i)e−iπ/6 = 2 2eπ/4 e−iπ/6 = 2 2eiπ/12 .
Definition 2.1.12. A curve in the complex plane is a set of points z(t) where, a ≤ t ≤ b,
for some a ≤ b. We say that the curve is closed if, z(a) = z(b), and simple if it does not
self-intersect, except possibly at the end-points. That is the curve is simple if, z(t) 6= z(t0 )
for t 6= t0 and a < t, t0 < b. A curve is called a contour if it is continuous and piece-
wise smooth. By convention, all simple, closed contours are parameterized to be traversed
counter-clockwise, unless stated otherwise.
Im(z)
0 + 3i
−1 + 0i π/2 + 0i
Re(z)
z(τ ) = εeiτ
Figure 2.4: Parameterized curves for example 2.1.13. Red: The infinite ‘vertical’ line
passing through π/2 is parameterized by z(s) = π/2 + is for −∞ < s < ∞. Green: The
√
semi-infinite ray extending from the point z = −1 and passing through 3i is parameterized
by z(ρ) = −1 + ρeiπ/3 for 0 ≤ ρ < ∞. Blue: The circular arc of radius ε centered at 0 is
parameterized by z(τ ) = εeiτ for 0 ≤ τ ≤ 2π.
Solution.
Complex numbers can be considered as the resolution of the notation for numbers that
are closed under all possible algebraic operations. What this means is that any algebraic
operation between two complex numbers is guaranteed to return another complex number.
This is not generally true for other classes of numbers, for example,
i. The addition of two positive integers is guaranteed to be another positive integer, but
the subtraction of two positive integers is not necessarily a positive integer. Therefore,
we say that the positive integers are closed under addition but are not closed under
subtraction.
CHAPTER 2. COMPLEX ANALYSIS 14
ii. The class of all integers is closed under subtraction and also multiplication. However
the integers are not closed under division because the quotient of two integers is not
necessarily another integer.
iii. The rational numbers are closed under division. However the process of taking limits
of rational numbers may lead to numbers that are not rational, so real numbers are
needed if we require a system that is closed under limits.
iv. Taking non-integer powers of negative numbers does not yield a real number. The
class of complex numbers must be introduced to have a system that is closed under
this operation.
Moreover one finds that the class of complex numbers is also closed under the operations of
finding a root of algebraic equations, of taking logarithms, and others. We conclude with a
happy statement that the class of complex numbers is closed under all the operations.
A function of a complex variable, w = f (z), maps the complex number z to the complex
number w. That is, f maps a point in the z-complex plane to a point (or points) in the
w-complex plane. Since both z and w have a Cartesian representation, this means that
every function of a complex variable can be expressed as two real-valued functions of two
real variables, f (z) := u(x, y) + iv(x, y).
In Eq. (2.1) we motivated the definition of the exponential function, f (z) = ez , with the in-
tention to preserve the property that, ez1 +z2 = ez1 ez2 , and incidentally that, e1 = 2.718 . . . .
This is not the only property we could have chosen to motivate the definition of ez . We
could have chosen to preserve any of the following properties:
P∞
• the function’s Taylor series n=0 z
n /n!;
• the fact that f (z) = ez solves the Ordinary Differential Equation, f 0 (z) = f (z),
subject to f (0) = 1.
We encourage the reader to verify that all these properties are preserved for the complex
exponential, and that any one of them could have motivated our definition and yielded the
same results.
CHAPTER 2. COMPLEX ANALYSIS 15
Example 2.1.15. Evaluate the functions along the curves : (a) z 7→ sin z along the infinite
horizontal line passing through π/2. (b) z 7→ exp(z+1) along the semi-infinite ray extending
√
from the point z = −1 and passing through 3i, and (c) z 7→ z 2 along circular arc of radius
ε centered at 0. See example 2.1.13 and also Fig. 2.4.
Solution.
(a) Parameterize the vertical line passing through π/2 by π/2 + is for −∞ < s < ∞
eiπ/2−s − e−iπ/2+s ie−s + ies
f (s + iπ/2) = sin(π/2 + is) = =
2i 2i
= cosh(s) for − ∞ < s < ∞.
(b) Parameterize the semi-infinite ray extending from the point z = −1 and passing
√
through 3i by −1 + ρeiπ/3 for 0 < ρ < ∞.
(c) Parameterize the circular arc of radius ε centered at 0 by εeiτ for 0 ≤ τ ≤ 2π.
Not every complex function is single-valued. We often deal with functions that are multi-
valued, meaning that for some z, there exist two or more wi such that f (z) = wi . Recall the
parametrized curve in example 2.1.15 and consider (c)(i) where we evaluated the function
f (z) = z 2 along the circle of radius ε centered at the origin. Notice, in particular, that
the function returns to its original value, that is, f (εe0i ) = f (εe2πi ) = ε2 . It may be seem
surprising, but there are functions where this is not the case.
√
Example 2.1.16. Consider the example of ω(z) = z. When z is represented in polar
coordinates, z = r exp(iθ), we know that θ is defined up to a shift on 2πn, for any integer
√
n. For our example, this translates to ωn (z) = r exp(iθ/2 + iπn), where different n
√ √
will result in (two) different values of z, called two branches, ω1 = r exp(iθ/2), ω2 =
√
r exp(iθ/2 + iπ). If we choose one branch, say ω1 , and walk in the complex plane around
z = 0 in a positive counter-clockwise direction (so that z = 0 always stays on the left)
changing θ from its original value, say θ = 0, to π/2, π, 3π/2 and eventually get to 2π, ω1
will transition to ω2 . Making one more positive 2π swing will return to ω1 . In other words,
the two branches transition to each other after one makes a 2π turn. Per definition below,
√
the point z = 0 is called a second order branch point of the two-valued function z.
A multi-valued function has the property that if we traverse a sufficiently small closed
contour around its branch point, we experience a discontinuity. One should note that the
location of this discontinuity is entirely dependent on where we choose to start and stop
the closed contour. To see this consider the following two closed contours:
order to separate the branches of a multi-valued function. For most multi-valued functions,
this means that when we attempt to traverse a cut with some closed contour, we end up
experiencing a discontinuity. Really what we have is a contour that is closed in the domain
of f , but maps to an open contour.
Definition 2.1.19. A branch cut is a curve in the complex plane along which a branch is
discontinuous.
Remark. Branch cuts are usually not unique, and are something that is defined by us, not
the multi-valued function in question. One branch is arbitrarily selected as the principal
branch. Most software packages employ a set of rules for selecting the principal branch of
a multi-valued function.
Example 2.1.23. Find the branch points of log(z − 1) and sketch a set of possible branch
cuts. Choose a branch cut and describe the resulting branches.
Solution. Parameterize the function as follows, log(z−1) = log ρ+iφ, where z−1 = ρ exp(iφ)
with ρ > 0 (non-negative real) and φ real. Since φ changes by multiples of 2π as we travel
on a closed path around z = 1, the point z = 1 is a branch point of log(z − 1). We can
observe that z = ∞ is also a branch point (thus infinite branch point) by replacing z with
1
z = w and observing that w = 0 is a branch point. Therefore a valid branch cut for the
function should connect the two branch points as illustrated in Fig. (2.7).
To describe branches of the function, let us choose (for concreteness) the branch cut
starting at z = 1 and moving along the x axis to z = +∞. Introduce potential branches
of z by zn = 1 + ρ exp(iφ + 2iπn), n = 0, 1, · · · . Given that f (z) = log(z − 1), the family
of zn translates into the following branches: fn (z) = log ρ + iφ + 2iπn. Observe that each
branch is distinct from the others and that each is a single-valued, analytic function in C
excluding the branch cut.
CHAPTER 2. COMPLEX ANALYSIS 18
Re(z) Re(z)
Re(z) Re(z)
Re(z) Re(z)
Figure 2.5: (a) Top row: The real (left) and imaginary (right) components of z 7→ z 1/2 .
(b) Middle row: The representation, z = re−iθ , with 0 ≤ θ < 2π, gives a branch cut along
the positive real axis and a (single-valued) branch that is analytic everywhere except along
the branch cut. (c) Bottom row: The representation, z = re−iθ , with −π ≤ θ < π, gives
a branch cut along the negative real axis and a (single valued) branch of z 7→ z 1/2 that is
analytic everywhere except along the branch cut.
CHAPTER 2. COMPLEX ANALYSIS 19
Im(z) Im(z)
Re(z) Re(z)
Im(z) Im(z)
Re(z) Re(z)
Im(z) Im(z)
Re(z) Re(z)
Figure 2.6: (a) Top row: The real (left) and imaginary (right) components of z 7→ log(z)
. (b) Middle row: The representation, z = re−iθ , with 0 ≤ θ < 2π, gives a branch cut
along the positive real axis and a (single-valued) branch of z 7→ log(z) that is analytic
everywhere except along the branch cut. (c) Bottom row: The representation, z = re−iθ ,
with −π ≤ θ < π, gives a branch cut along the negative real axis and a (single valued)
branch of z 7→ log(z) that is analytic everywhere except along the branch cut.
CHAPTER 2. COMPLEX ANALYSIS 20
y
z=x+ i y
"
0 (1,0) x
y
z=x+ i y
0 (1,0) x
y
z=x+ i y
0 (1,0) x
y
z=x+ i y
0 (1,0) x
Figure 2.7: Polar parametrization of log(z − 1) (left) and three examples of branch cut for
the function connecting its two branch points, at z = 1 and at z = ∞.
CHAPTER 2. COMPLEX ANALYSIS 21
y
z=x+ i y
0 x
y
z=x+ i y
0 x
Figure 2.8
1. The function log(f (z)) has branch points at the zeros of f (z) and at the points where
f (z) is infinite, as well as (possibly) at the points where f (z) itself has branch points.
But be careful with this (later possibility): the zeros have to be zeros in the sense of
analytic functions and by infinities we mean poles. Other types of (singular) behaviors
in f (z) can lead to unexpected results, e.g. check what happens at z = 0 when
f (z) = exp(1/z).
2. The fact that a function f (z) or its derivatives may or may not have a (finite) value
at some point z = z0 , is irrelevant as far as deciding the issue of whether or not z0 is
a branch point of f (z).
Exercise 2.3. Identify the branch points, introduce suitable branch cuts, and describe the
p
resulting branches for the functions (a) f (z) = (z − a)(z − b), and (b) g(z) = log((z −
1)/(z − 2)).
CHAPTER 2. COMPLEX ANALYSIS 22
eiz − e−iz
= 3,
2i
Multiply each side by 2i eiz
(eiz )2 − 6ieiz − 1 = 0.
This can be solved using the quadratic formula which, after some algebra, gives
√
eiz = i(3 ± 2 2).
Now, take the natural log of both sides
√
iz = ln(i) + ln(3 ± 2 2).
By ln(z) = ln(r) + i(θ + 2nπ) where n = 1, 2, 3, ..., we know that ln(i) = ln(1) + i( π2 + 2nπ),
so
π √
z= + 2nπ ± i ln(3 + 2 2).
2
The derivative of a real valued function is defined at a point x via the limiting expression
f (x + ∆x) − f (x)
f 0 (x) = lim ,
∆x→0 ∆x
and we say that the function is differentiable at x if the limit exists and is independent of
whether the x is approached from above or below as given by the sign of ∆x.
Definition 2.2.1. The derivative of a complex function is defined via a limiting expression:
f (z + ∆z) − f (z)
f 0 (z) = lim . (2.5)
∆z→0 ∆z
This limit only exists if f 0 (z) is independent of the direction in the z-plane the limit ∆z → 0
is taken. (Note: there are infinitely many ways to approach a point z ∈ C.)
CHAPTER 2. COMPLEX ANALYSIS 23
f 0 (z) = ux + ivx ,
f 0 (z) = −iuy + vy .
A consistent definition of a derivative requires that the two ways of taking the derivative
coincide, that is,
ux = vy , uy = −vx , (2.6)
Notice that in the explanations which lead us to the Cauchy-Riemann theorem (2.2.2)
we only sketched one side of the proof – that it is necessary for the differentiability of f (z) to
have the theorem’s conditions satisfied. To complete the proof, one needs to also show that
Eq. (2.6) is sufficient for the differentiability of f (z). In other words, one needs to show that
any function u(x, y)+iv(x, y) is complex-differentiable if the Cauchy–Riemann equations
hold. The missing part of the proof follows from the following chain of transformations
∂f ∂f
∆f = f (z + ∆z) − f (z) = ∆x + ∆y + O (∆x)2 , (∆y)2 , (∆x)(∆y)
∂x ∂y
1 ∂f ∂f 1 ∂f ∂f
= −i ∆z + +i ∆z ∗ + O (∆x)2 , (∆y)2 , (∆x)(∆y)
2 ∂x ∂y 2 ∂x ∂y
∂f ∂f
= ∆z + ∗ ∆z ∗ + O (∆x)2 , (∆y)2 , (∆x)(∆y)
∂z ∂z
∂f ∂f ∆z ∗
= ∆z + ∗ + O (∆x)2 , (∆y)2 , (∆x)(∆y) , (2.7)
∂z ∂z ∆z
where O (∆x)2 , (∆y)2 , (∆x)(∆y) indicates that we have ignored terms of orders higher or
equal than two in ∆x and ∆y. In transition to the last line of Eq. (2.7) we change variables
from (x, y) to (z, z ∗ ), thus using
∂ ∂z ∂ ∂z ∗ ∂ ∂ ∂
= + ∗
= + ∗,
∂x ∂x ∂z ∂x ∂z ∂z ∂z
∂ ∂z ∂ ∂z ∗ ∂ ∂ ∂
= + =i − i ∗,
∂y ∂y ∂z ∂y ∂z ∗ ∂z ∂z
CHAPTER 2. COMPLEX ANALYSIS 24
Observe that ∆z ∗ /∆z takes different values depending on which direction we take the
respective, ∆z, ∆z ∗ → 0 limit in the complex plain. Therefore to ensure that the derivative,
f 0 (z), is well defined at any z, one needs to require that
∂f
= 0, (2.8)
∂z ∗
i.e. that f does not depend on z ∗ . It is straightforward to check that the “independence of
the complex conjugate” Eq. (2.8) is equivalent to Eq. (2.6).
Definition 2.2.3 ((Complex) Analyticity). A function f (z) is called (a) analytic (or holo-
morphic) at a point, z0 , if it is differentiable (as a complex function) in a neigborhood of
z0 ; (b) analytic in a region of the complex plane if it is analytic at each point of the region.
Exercise 2.4. The isolines for a function, f (x, y) = u(x, y) + iv(x, y), are defined to be the
curves, u(x, y) = const and v(x, y) = const0 . Show that the iso-lines of an analytic function
always cross at a right angle.
Example 2.2.4. Let f (z) = u(x, y) + iv(x, y) be analytic. Given that u(x, y) = x + x2 − y 2
and f (0) = 0, find v(x, y).
Solution. We start by utilizing the Cauchy-Riemann conditions:
∂u ∂v ∂u ∂v
= , =− .
∂x ∂y ∂y ∂x
This gives us the two differential equations
∂v ∂v
= 2x + 1 , = 2y,
∂y ∂x
Based on these two solutions, C1 (x) = k, for some constant k, and C2 (y) = y + k. Given
the initial condition f (0) = 0, v(0, 0) = 0, and therefore k = 0. So, we find the solution,
v(x, y) = 2xy + y. Therefore
Exercise 2.5. Let f (z) = u(x, y) + iv(x, y) be analytic. Given that, v(x, y) = −2xy and
f (0) = 1, find u(x, y).
CHAPTER 2. COMPLEX ANALYSIS 25
Example 2.2.5. Determine whether (and where) f (z) = z ∗ is analytic and compute its
derivative where it exists.
Solution. Recall: If z = x + iy, then z ∗ := x − iy. We first compute ux , vy , uy and vx , and
determine whether (and where) they are continuous.
ux = 1 vy = −1
uy = 0 vx = 0
We confirm that the partial derivatives are continuous everywhere in C. We now check
the Cauchy-Riemann conditions and find that they are not satisfied anywhere in C because
ux = 1 6= −1 = vy . Intuitively, the complex conjugate fails to be analytic because analytic
functions can be locally approximated by rotations and stretches of the complex plane,
whereas the complex conjugate function is a reflection.
Example 2.2.6. Determine whether (and where) f (z) = z 1/2 is analytic and compute its
derivative where it exists.
Solution. We leave it to the reader to apply the chain rule and the trigonometic identity
sin2 (θ) + cos2 (θ) = 1 to verify that the Cauchy-Riemann equations in polar coordinates are
∂u 1 ∂v
=
∂r r ∂θ
∂v 1 ∂u
=−
∂r r ∂θ
We compute the relevant partial derivatives of z 1/2 and observe that they are not defined at
z = 0. We also observe that they cannot be continuous at a branch cut because branches of
z 1/2 are not continuous across the branch cut. The Cauchy-Riemann conditions are satisfied
everywhere else. In conclusion, a branch of z 1/2 is analytic in any region of C \ {0} that
does not contain the branch cut. We leave it to the reader to show that the derivative
in polar representation is given by f 0 (z) = e−iθ (ur + ivr ). For our example, this gives
f 0 (z) = 21 r−1/2 e−iθ/2 = 21 z −1/2 .
Example 2.2.7. Determine whether (and where) the function f (z) = 1/z is analytic.
Solution. Note that f is not defined at z = 0 and that limz→0 f (z) does not exist. Rationalize
the denominator to write f (z) = x/(x2 + y 2 ) − iy/(x2 + y 2 ). The relevant partial derivatives
are:
y 2 − x2 y 2 − x2
ux = vy =
x2 + y 2 x2 + y 2
−2xy 2xy
uy = 2 vx = 2
x + y2 x + y2
CHAPTER 2. COMPLEX ANALYSIS 26
The partial derivatives exist and are continuous everywhere (except z = 0) and the the
Cauchy-Riemann conditions are satisfied on C \ {0}. We evaluate the derivative f 0 (z) and
observe that limz→0 f 0 (z) does not exist. We say that f has a simple pole at z = 0 because
(z − 0)f (z) is analytic in a neighborhood of 0. We will revisit this in section 2.2.5.
Example 2.2.8. Determine whether (and where) the functions (a) exp(z), (b) z exp(z̄) and
(c) (exp(z) − 1)/z are analytic and compute their derivatives where they exist.
Let us now make a detour into the subject of conformal map describing a function of two
variables, x, y, that locally preserves angles, but not necessarily lengths. We will see, and
quite remarkably, that the the rich family of conformal functions (maps) can be analyzed
based solely on the Cauchy-Riemann condition (2.6).
Indeed, the Cauchy-Riemann conditions (2.6) can be restated in the following compact
form
∂f ∂f
i = .
∂x ∂y
Then the Jacobian matrix of the function f : R2 → R2 , i.e. of the (x, y) → (u, v) map is
! !
∂u ∂u ∂u ∂u
∂x ∂y ∂x ∂y
J= ∂v ∂v
= ∂u ∂u
.
∂x ∂y − ∂y ∂x
Geometrically, the off-diagonal (skew-symmetric) part of the matrix represents rotation and
the diagonal part of the matrix represents stretching/re-scaling. The Jacobian of a function
f (z) takes infinitesimal line segments at the intersection of two curves in z and rotates
them to the corresponding segments in f (z). Therefore, a function satisfying the Cauchy-
Riemann equations, with a nonzero derivative, preserves the angle between curves in the
plane. Transformations corresponding to such functions and functions themselves are called
conformal. That is, the Cauchy-Riemann equations are not only conditions for the function
analyticity, but are also conditions for a function to be conformal.
The following famous theorem of the complex analysis builds a strong theoretical foun-
dation to the conformal maps. (It is due to Bernard Riemann, who has stated it in his
Ph.D. thesis in 1851. First proof of the theorem was published in 1912 by Constantin
Carathéodory.)
CHAPTER 2. COMPLEX ANALYSIS 27
Figure 2.9: Exemplary functions (maps) from unit disk. Screenshots from https://
demonstrations.wolfram.com/ConformalMappingOfTheUnitDisk/.
The theorem allows to build various conformal maps from one single-connected domain
to another by reducing the problem to two, each mapping the respective domain to the unit
disk.
It is useful for developing geometrical intuition to consider conformal maps associated
with elementary functions. See a number of illustrations in Fig. 2.9. Notice, however that
even relatively simple Riemann mappings, such as a map from the unit disk to the interior
of a square, may not be expressible in terms of elementary functions and in general one
CHAPTER 2. COMPLEX ANALYSIS 28
needs to rely on approximate numerical methods to build the maps. (See ConformalMaps
Julia package https://github.com/sswatson/ConformalMaps.jl for approximating the
Riemann map from a simply connected planar domain to a disk.)
Here we will make a fast jump to the end of the semester where Partial Differential Equations
(PDEs) will be discussed in details. Consider solution of the Laplace equation in two
dimensions
(∂x2 + ∂y2 )f (x, y) = 0. (2.9)
Eq. (2.9) defines the so-called Harmonic functions. We do it now, while studying complex
calculus, because, and quite remarkably, an arbitrary analytic function is a solution of
Eq. (2.9). This statement is a straightforward corollary of the Cauchy-Riemann theorem
(2.2.2). To see it we recall that, f = u + iv, and use the following set of transformation
following from the Cauchy-Riemann conditions (2.6) also assuming that the function, f (z),
is analytic at z (which allows us to differentiate it one more time with respect to x and y)
( (
ux = vy uxx = vxy
⇒ ⇒ uxx + uyy = 0
uy = −vx uyy = −vxy
( (
ux = vy uxy = vyy
⇒ ⇒ vxx + vyy = 0
uy = −vx uxy = −vxx
The descriptor “harmonic” in the name harmonic function originates from a point on
a taut string which is undergoing periodic motion which is pleasant-sounding, thus coined
by ancient Greeks harmonic (!). This type of motion can be written in terms of sines and
cosines, functions which are thus referred to as harmonics. Fourier analysis, which we will
turn our attention to soon, involves expanding periodic functions on the unit circle in terms
of a series over these harmonics. These functions satisfy Laplace equation and over time
”harmonic” was used to refer to all functions satisfying Laplace equation.
Laplace equation, and thus harmonic functions, arise most prominently in mathematical
physics, in particular in electromagnetism (electrostatics) and fluid mechanics (hydrostat-
ics).
For example, in electrostatics it describes distribution of electrostatic potential within
the planar domain cut in a metal – domain which is free of charge, i.e. free of singularities.
On the other hand, singular points of the harmonic functions above are expressed as “point
charges” and/or continuously distributed “charge densities”. Placing a point charge at
the origin, z = x + iy = 0, results in the solution of the Laplace equation (2.9), which
CHAPTER 2. COMPLEX ANALYSIS 29
is singular, i.e. non-analytic at the origin. We will see later in the PDE part of the
course, that the analytic function correspondent to a point charge placed at the origin
is, f (z) = C log z, where C is a constant related to the value of the charge, and then,
Re(f (z)) = u = C log r = (C/2) log(x2 + y 2 ), is the corresponding electrostatic potential.
The family of the harmonic (complex) functions is rich. Each harmonic function which
satisfies Eq. (2.9) will yield another harmonic function when multiplied by a constant,
rotated, and/or has a constant added. The inversion of each function will yield another
harmonic function which has singularities which are the images of the original singularities
in a spherical “mirror”. Also, the sum of any two harmonic functions will yield another
harmonic function.
where for each n, (ζk |k = 0, · · · , n) describes an ordered sequence of points along the path
breaking it into n intervals such that ζ0 = a, ζn = b and maxk |ζk+1 − ζk | → 0 as n → ∞.
Remark. It is now time to utilize parametrization of the complex functions discussed earlier
in the course. Let z(t) with a ≤ t ≤ b be a parameterization of C, then definition 2.2.10 is
equivalent to the Riemann integral of f (z(t))z 0 (t) with respect to t. Therefore,
Z Zb
f (z) dz = f (z(t)) z 0 (t) dt (2.11)
C a
Example 2.2.11. In example 2.1.13 we evaluated the functions f1 (z) = sin z, f2 (z) =
exp(z + 1), and f3 (z) = z 2 along the parameterized curves described in example 2.1.15.
R R R
Now compute (a) C1 f1 (z) dz, (b) C2 f2 (z) dz, and (c) C3 f3 (z) dz where C1 is the the
vertical line segment from π/2 − iM to π/2 + iM , C2 is the ray segment extending from
√
the point z = −1 and to the point 3i, and C3 is the circular arc of radius ε centered at 0.
Solution.
CHAPTER 2. COMPLEX ANALYSIS 30
(c) Let z = eiτ for 0 ≤ τ < 2π, then dz = ieiτ dθ. Therefore,
Z Z2π 2π
2
2
z dz = eiτ ieiτ dθ = 1 3 3iθ
3 e = 13 ε3 e6πi − 13 ε3 e0 = 0.
0
C3 0
Exercise 2.6. Let C+ and C− represent the upper and lower unit semi-circles centered at
the origin and oriented from z = −1 to z = 1. Find the integrals of the functions (a) z; (b)
√ √
z 2 ; (c) 1/z; and (d) z along C+ and C− . For z, use the branch where z is represented
by reiθ with 0 ≤ θ < 2π.
Example 2.2.12. Let C be the circular closed contour of radius R centered at the origin.
Show that I
dz
= 0, for m = 2, 3, . . . (2.12)
zm
C
by parameterizing the contour in polar coordinates.
Solution. One possible parameterization of the contour is z(θ) = Reiθ for 0 ≤ θ < 2π.
Therefore, dz = iReiθ dθ. Changing the integral to polar coordinates gives:
I Z 2π
dz iReiθ
= dθ.
zm 0 (Reiθ )m
C
A simple, but useful, continuation of this example would be confirming that the integral is
not 0 when m = 1.
Example 2.2.13. Use numerical integration to approximate the integrals in the examples
above and verify your results.
CHAPTER 2. COMPLEX ANALYSIS 31
In general the integral along a path in the complex plane depends on the entire path and
not only on the position of the end points. The following fundamental question arrives
naturally: is there a condition which makes the integral dependent only on the end points
of the path? The question is answered by the following famous theorem.
A more compact way of stating the theorem is to say that integrals of analytic functions
are path independent.
It is important to recognize that for Cauchy’s theorem to hold in the case of the multi-
valued functions one needs the integrand to be a single-valued function. The cuts introduced
in the preceding section are required for exactly this reason – to force the integration path
to stay within a single branch of a multi-valued function and thus to guarantee analyticity
(differentiability) of the function along the path.
The same theorem can be restated in the following form.
Theorem 2.2.15 (Cauchy’s Theorem (closed contour version)). Let f (z) be analytic in a
simply connected region D and C be a closed contour that lies in the interior of D. Then
H
the integral of f along C is equal to zero: C f (z) dz = 0.
To make the transformation from the former formulation of Cauchy’s formula to the
latter one, we need to consider two paths connecting two points of the complex plain. From
Eq. (2.10), we see that paths are oriented and that changing the direction of the path
changes the value of the integral by a factor of −1. Therefore, of the two paths considered,
one needs to reverse its direction, then leading us to a closed contour formulation of Cauchy’s
theorem.
Let us now sketch the proof of the closed contour version of Cauchy’s theorem. Consider
breaking the region of the complex plane bounded by the contour C into small squares
with the contours Ck , as well as the original contour C, oriented in the positive direction
(counter-clockwise). Then I XI
dzf (z) = f (z)dz, (2.13)
C k C
k
where we have accounted for the fact that integrals over the inner sides of the small contours
cancel each other, as two of them (for each side) are running in opposite directions. Next,
CHAPTER 2. COMPLEX ANALYSIS 32
pick inside a Ck contour a point, zk , and then approximate, f (z), expanding it in the Taylor
series around zk ,
f (z) = f (zk ) + f 0 (zk )(z − zk ) + O ∆2 (2.14)
where with ∆-squares, the length of Ck is at most 4∆, and we have at most (L/∆)2 small
squares. Substituting Eq. (2.14) into Eq. (2.13) one derives
I I I I
0
dzf (z) = f (zk ) dz + f (zk ) dz(z − zk ) + dzO ∆2 = 0 + 0 + ∆3 . (2.15)
Ck Ck Ck Ck
Summing over all the small squares bounded by C one arrives at the estimate ∆ → 0 in
the ∆ → 0 limit. .
Disclaimer: We have just used discretization of the integral. When dealing with inte-
grations of functions in the rest of the course we will always discuss it in the sense of a
limit, assuming that it exists, and not really breaking the integration path into segments.
However, if any question on the details of the limiting procedure surfaces one should get
back to the discretization and analyze respective limiting procedure sorely.
One important consequence of the Cauchy’s theorem (there will be more discussed in
the following) is that all integration rules known for standard, “interval”, integrals apply to
the contour integrals. This is also facilitated by the following statement.
Theorem 2.2.16 (Triangle Inequality). (A: From Euclidean Geometry) |z1 + z2 | ≤ |z1 | +
|z2 |, also with equality iff (if and only if) z1 and z2 lie on the same ray from the origin. (B:
Integral over Interval) Suppose g(t) is a complex valued function of a real variable, defined
on a ≤ t ≤ b, then
Zb Zb
dtg(t) ≤ dt|g(t)|, (2.16)
a a
with equality iff (i.e. if and only if) the values of g(t) all lie on the same ray from the origin.
(Integral over Curve/Path) For any function f (z) and any curve γ, we have
Z Z
f (z)dz ≤ |f (z)||dz|, (2.17)
γ γ
Proof. We take the “Euclidean” geometry version (A) of the statement, extended to the sum
of complex numbers, as granted and give a brief sketch of proofs for the integral formulations.
The interval version (B) of the triangular inequality follows by approximating the integral
CHAPTER 2. COMPLEX ANALYSIS 33
Im (z)=y
Re (z)=x
Figure 2.10
as a Riemann sum
X X Zb
|g(t)dt| ≈ g(tk )∆t ≤ |g(tk )|∆t ≈ |g(t)|dt,
a
where the middle inequality is just the standard triangular inequality for sums of complex
numbers. The contour version (C) of the Theorem follows immediately from the interval
version
Z Zb Zb Z
0
f (z)dz = f (γ(t))γ (t)dt ≤ |f (γ(t))||γ 0 (t)|dt = |f (z)||dz|.
γ a a γ
Recall from definition 2.1.12 that a curve is called simple if it does not intersect itself, and
is called a contour if it is piece-wise smooth.
Theorem 2.2.17 (Cauchy’s formula, 1831). Let f (z) be analytic on and interior to a simple
closed contour C. Then, Z
1 f (ζ)dζ
f (z) = . (2.18)
2πi ζ −z
C
To illustrate Cauchy’s formula consider the simplest, and arguably most important,
H
example of an integral over complex plane, I = dz/z. For the integral over closed contour
CHAPTER 2. COMPLEX ANALYSIS 34
y y
0 x 0 x
Figure 2.11
shown in Fig. (2.10a), we parameterize the contour explicitly in polar coordinates and derive
Example 2.2.18. Compute, compare and discuss the difference (if any) between values of
H
the integral dz/z over two distinct paths shown in Fig. (2.11).
Solution. Firstly, note that 1/z is not analytic at z = 0, and therefore neither contour can
be deformed to the other without passing through the non-analytic point. Both curves are
simple and closed, therefore Cauchy’s formula applies. The curve on the left, C1 contains
z = 0, and makes one full turn around the origin, and therefore will be equal to the curve
H dz
in Figure 2.4, so, z = 2πi. The curve on the right, C2 , does not contain z = 0, and
C1
makes a full turn counter-clockwise around the origin, as well as a clockwise turn around
H dz
the origin on the ”inner” part of the curve. So, z = 0.
C2
The “small square” construction used above to prove the closed contour version of
Cauchy’s Theorem, i.e. Theorem 2.2.15, is a useful tool for dealing with integrals over
CHAPTER 2. COMPLEX ANALYSIS 35
awkward (difficult for direct computation) paths around singular points of the integrand.
However, it should not be thought that all the integrals will necessarily be zero. Consider
I
dz
m = 2, 3, · · · : ,
zm
where the integral is singular at z = 0. The respective indefinite integral (what is sometimes
called the “anti-derivative”) is z −m+1 /(1 − m) + C, where C is constant. Observe that the
indefinite integral is a single-valued function and thus its integral over a closed contour is
zero. (Notice that if m = 1 the indefinite integral is a multi-valued function within the
domain surrounding z = 0.)
Cauchy’s formula can be extended to higher derivatives
Theorem 2.2.19 (Cauchy’s formula for derivatives, 1842). Under the same conditions as
in Theorem 2.2.17, higher derivatives are
Z
(n) n! f (ζ)dζ
f (z) = . (2.20)
2πi (ζ − z)n+1
C
Cauchy’s theorem and formulas have many powerful and far reaching consequences.
Theorem 2.2.20. Suppose f (z) is analytic on a region A. Then, f has derivatives of all
orders.
Proof. It follows directly from Cauchy’s formula for derivatives, Theorem 2.2.19 – that is we
have an explicit formula for all the derivatives, so, in particular, the derivatives all exist.
Theorem 2.2.21 (Cauchy Inequality.). Let CR be the circle |z −z0 | = R. Assume that f (z)
is analytic on CR and its interior, i.e. on the disk |z − z0 | ≤ R. Finally let MR = max |f (z)|
over z on CR . Then
n!MR
∀n = 1, 2, · · · : |f (n) (z0 )| ≤ .
Rn
Exercise 2.7. Prove the Cauchy’s Inequality Theorem utilizing Theorem 2.2.19. Provide
an alternative argument for the Theorem validity on examples of exp(z) and cos(z) using
a circle that is centered at the origin (you are expected to argue informally and without
reference to Theorem 2.2.19 why the inequality holds).
Theorem 2.2.22 (Liouville Theorem.). If f (z) is entire, i.e. analytic at all finite points of
the complex plane C, and bounded then f is constant.
CHAPTER 2. COMPLEX ANALYSIS 36
Proof. For any circle of radius R around z0 the Cauchy’s inequality (Theorem 2.2.21) states
that f 0 (z) ≤ M/R, but R can be arbitrarily large, thus |f 0 (z0 )| = 0 for every z0 ∈ C. And
since the derivative is 0, the function itself is constant.
Pn k,
Note that P (z) = k=0 ak z exp(z), cos(z) are entire but not bounded.
Proof. The prove consists of two parts. First, we want to show that P (z) has at least one
root. (See example below.) Second, assume that P has exactly n roots. Let z0 be one of
the roots. Factor, P (z) = (z − z0 )Q(z). Q(z) has degree n − 1. If n − 1 > 0, then we can
apply the result to Q(z). We can continue this process until the degree of Q is 0.
Pn k
Example 2.2.24. Prove that P (z) = k=0 ak z has at least one root.
Solution. We provide a hint and not the full solution: prove by contradiction and utilize
the Liouville Theorem 2.2.22.
Theorem 2.2.25 (Maximum modulus principle (over disk)). Suppose f (z) is analytic on
the closed disk, Cr , of radius r centered at z0 , i.e. the set |z − z0 | ≤ r. If |f | has a relative
maximum at z0 than f (z) is constant in Cr .
In order to prove the Theorem we will first prove the following statement.
Theorem 2.2.26 (Mean value property). Suppose f (z) is analytic on the closed disk of
radius r centered at z0 , i.e. the set |z − z0 | ≤ r. Then,
Z2π
1
f (z0 ) = dθf (z0 + r exp(iθ)) .
2π
0
Z Z2π Z2π
1 f (z)dz 1 f (z0 + reiθ ) iθ 1
f (z0 ) = = dθ ire = dθf (z0 + reiθ ).
2πi z − z0 2πi reiθ 2π
Cr 0 0
Now back to the Theorem 2.2.25. To sketch the proof we will use both the mean value
property Theorem 2.2.26 and the triangle inequality Theorem 2.2.16. Since z0 is a relative
CHAPTER 2. COMPLEX ANALYSIS 37
Z2π
1
|f (z0 )| = dθf (z0 + reiθ ) (mean value property)
2π
0
Z2π
1
≤ dθ|f (z0 + reiθ )| (triangle inequality)
2π
0
Z2π
1
≤ dθ|f (z0 )|, (|f (z0 + reiθ )| ≤ |f (z0 )|, i.e. z0 is a local maximum)
2π
0
= |f (z0 )|
Since we start and end with f (z0 ), all inequalities in the chain are equalities. The first
inequality can only be equality if for all θ, f (z0 + reiθ ) lies on the same ray from the
origin, i.e. have the same argument or equal to zero. The second inequality can only be
an equality if all |f (z0 + reiθ | = |f (z0 )|. Thus, combining the two observations, one gets
that all f (z0 + reiθ ) have the same magnitude and the same argument, i.e. all the same.
Finally, if f (z) is constant along the circle and f (z0 ) is the average of f (z) over the circle
then f (z) = f (z0 ), i.e. f is constant on Cr .
Two remarks are in order. First, based on the experience so far (starting from Theorem
2.2.22) it is plausible to expect that Theorem 2.2.25 generalizes from a disk Cr to any single-
connected domain. Second, one also expects that the maximum modulus can be achieved
at the boundary of a domain and then the function is not constant within the domain.
Indeed, consider example of exp(z) on the unit square, 0 ≤ x, y ≤ 1. The maximum,
| exp(x + iy)| = exp(x), is achieved at x = 1 and arbitrary y, 0 ≤ y ≤ 1, i.e. at the
boundary of the domain. These remarks and the example suggest the following extension
of the Theorem 2.2.25.
Proof. Here is a sketch of the proof. Let us cover A by disks which are laid such that
their centers form a path from the value where f (z) is maximized to any other points in A,
while being totally contained within A. Existence of a maximum value of |f (z)| within A
implies, according to Theorem 2.2.25 applied to all the disks, that all the values of f (z) in
CHAPTER 2. COMPLEX ANALYSIS 38
the domain are the same, thus f (z) is constant within A. Obviously the constancy of f (z)
is not required if the maximum of |f (z)| is achieved at δA.
Example 2.2.28. Find the maximum modulus of sin(z) on the square, 0 ≤ x, y ≤ 2π.
Solution.
sin(z) = sin(x + iy) = sin(x) cosh(y) + i cos(x) sinh(y)
Using the identities, cos2 (x) = 1 − sin2 (x), and cosh2 (x) − sinh2 (x) = 1, we get
q q
2 2
| sin(z)| = sin (x) cosh (y) + (1 − sin (x)) sinh (y) = sinh2 (y) + sin2 (x).
2 2
nπ
This is maximized when sin2 (x) = 1, so x = 2 where n = 1, 2, 3, · · · , and y = 2π, as this
is the maximum value of y in our square. So,
nπ
z = 2 + i2π, is on the boundary of our square, which satisfies the maximum modulus
principle.
The Laurent series of a complex function f (z) about a point a is a representation of that
function by a power series that includes terms of both positive and negative degree.
where C is any contour that is contained within the annulus and circling a.
CHAPTER 2. COMPLEX ANALYSIS 39
Definition 2.2.30. The coefficient corresponding to the k = −1 term plays such a signif-
icant role in contour integration that it deserved a special name – residue of f at z = a –
and is denoted by c1 = Res(f ; a).
Definition 2.3.3 (Simple Pole). Let a be a singular point of a function f , and let ck be the
coefficients of the Laurent expansion of f about a. If c−1 6= 1, but ck = 0 for all k < −1,
then we say that a is a first order pole or a simple pole of f . See Fig. 2.12(a) for an example
of a simple pole.
Im(z) Im(z)
Re(z) Re(z)
Im(z) Im(z)
Re(z) Re(z)
Figure 2.12: (a) Top row: The canonical example of a simple pole. The real (left) and
imaginary (right) components of z 7→ z −1 . (b) Bottom row: The canonical example of a
double pole. The real (left) and imaginary (right) components of z 7→ z −2 .
Definition 2.3.4 (Higher Order Pole). Let a be a singular point of a function f , and let ck
be the coefficients of the Laurent expansion of f about a. If, for some positive N , c−N 6= 0
but ck = 0 for all k < −N , then we say that a is a N -th order pole of f . See Fig. 2.12(b)
for an example of a double pole.
g(z)
f (z) = , (2.26)
(z − a)N
Example 2.3.5. Find the removable singularity and give the order of the poles in the
CHAPTER 2. COMPLEX ANALYSIS 41
following function
z−1 z−1
f (z) = 4 = .
(z − 1)(z + 1) (z − 1)(z + 1)2 (z + i)(z − i)
• z = −i: It is similar to z = i.
g(z)
• z = −1: For z 6= −1, f (z) = (z+1)2
, where g(z) = z−1
(z−1)(z+i)(z−i) which is analytic in
a neighborhood of z = −1. Therefore, z = −1 is a second order pole (double pole) of
f.
Solution. (a) The integrand is analytic everywhere within the domain surrounded by the
contour, therefore I = 0. (b) The integrand has a single first-order pole within the domain
surrounded by the contour at z = 1. Notice, that direction of the contour is negative
(clockwise). therefore I = −2πiRes exp(z 2 )/(z − 1), 1 = −2πi exp(z 2 )|z=1 = −2πi. (c)
Since the only singularity of the integrand is at z = 1 the contour can be reduced to the
contour in the case (b), however traveled twice. Therefore, I = 2 ∗ (−2πi) = −4πi.
CHAPTER 2. COMPLEX ANALYSIS 42
z
−" + $% " + $%
−" 0 "
z − iπ/2 2πi 2π
I = 2πiRes (1/ cosh z, iπ/2) = 2πi lim = = = 2π.
z→iπ/2 cosh z sinh(iπ/2) sin π/2
H
Exercise 2.8. Compute the integral dz/(ez −1) over the circle of radius 4 centered around
3i.
Example 2.3.8. Evaluate the following real-valued integral using contour integration:
Z∞
eikx dx
I= .
x2 + 1
−∞
Solution. Let f : C → C be given by f (z) = eikz /(z 2 + 1). Observe that f has simple poles
at z = ±i. Let C = C1 ∪ CR be the contour in the complex plane show in Figure 2.15.
CHAPTER 2. COMPLEX ANALYSIS 43
Our plan is to use the Cauchy’s formula to show that the integral along C is 2πiRes(f ; i),
eikz (z−i) e−k
where Res(f ; i) = limz→i z 2 +1
= 2i , and then, with the correct parameterization, we
can show that the integral along C1 converges to I as R → 0, and that the integral along
CR converges to 0 as R → ∞.
First, evaluate the integral along C1 . Parameterize C1 by z = x + 0i for −R < x < R.
Therefore, dz = dx.
Z ZR ZR
eikz dz eikz dz eikx dx
= = =
z2 + 1 z2 + 1 x2 + 1
C1 −R −R
Observe that in the limit R → 0, this is equivalent to the integral we must find.
Next evaluate the integral along C2 . Parameterize C2 by z = eRiθ = R cos(θ)+iR sin(θ).
Therefore, dz = RieRiθ dθ. This gives
Z Zπ
eikz dz eikR(cos(θ)+i sin(θ)) RieRiθ dθ
= .
z2 + 1 (Reiθ )2 + 1
C2 0
We must consider what happens to the magnitude of the above integral as R → ±∞.
Zπ Zπ
eikR(cos(θ)+i sin(θ)) RieRiθ dθ eikR(cos(θ)+i sin(θ)) RieRiθ
≤ dθ (by triangle inequality)
(Reiθ )2 + 1 (Reiθ )2 + 1
0 0
Zπ
eikR cos(θ)−kR sin(θ)
≤R dθ
R2 e2iθ + 1
0
Zπ
e−kR sin(θ)
≤R dθ (because eikR cos(θ) ≤ 1).
R2 e2iθ + 1
0
CHAPTER 2. COMPLEX ANALYSIS 44
1 1
Observe that for R → ∞, we have R2 e2iθ +1
≤ R2 −1
, because the term R2 e2iθ must lie
between −R2 and R2 .
Zπ
R
≤ 2 e−kR sin(θ) dθ
R −1
0
Zπ/2
2R
≤ 2 e−kR sin(θ) dθ
R −1
0
Zπ/2
2R 2θ
≤ 2 e−kR π dθ (because sin(θ) ≥ 2θ
π for all θ ∈ [0, π]).
R −1
0
Zπ/2
2R 2θ 2R π(1 − e−kR )
2
e−kR π dθ = 2
→0 as R → ∞.
R −1 R −1 2kR
0
Z
π
f (z)dz ≤ MR ,
a
CR
Z
+∞
cos(ωx)dx
I1 = , ω > 0.
1 + x2
−∞
Note: the respective indefinite integral is not expressible via elementary functions and one
needs an alternative way of evaluating the definite integral.
Solution. Observe that
Z
+∞
sin(ωx)dx
= 0,
1 + x2
−∞
just because the integrand is odd (skew-symmetric) over x. Combining the two formulas
above one derives
Z
+∞ Z
+∞ Z
+∞
cos(ωx)dx sin(ωx)dx exp(iωx)dx
I1 = + = .
1 + x2 1 + x2 1 + x2
−∞ −∞ −∞
On the other hand IR can be represented as a sum of two integrals, one over [−R, R], and
one over the semi-circle. Sending R → ∞ one observes that the later integral vanishes, thus
leaving us with the answer
I1 = π exp(−ω).
Z
+∞
dx
I= ,
cosh x
−∞
Solution. Consider contour shown in Fig. 2.14 at a → ∞. Integral along the real axis coin-
cides with the desired integral. Integrals over left (up) and right (down) vertical portions
give zero in the a → ∞ limit (because the respective integrands decays to zero exponen-
tially). Note that
exp(x + iπ) + exp(−x − iπ) exp(x) + exp(−x)
cosh(x + iπ) = =− = − cosh(x).
2 2
R
iπ−∞
Therefore the fourth part of the contour becomes, dx/ cosh(x + iπ) = I. Summing up
iπ+∞
the four pieces and utilizing the result of Example 2.3.7 one derives, I + 0 + 0 + I = 2π,
i.e. I = π. Obviously the integral can also be evaluated directly (via anti-derivative and
definite integral)
π π
I = 2 arctan(tanh x/2)|+∞
−∞ = − − = π.
2 2
Exercise 2.9. Evaluate the following integrals reducing them to contour integrals
Z∞
dx
(a) ,
1 + x4
0
Z∞
dx
(b) ,
1 + x3
0
Z∞
(c) exp ix2 dx,
0
CHAPTER 2. COMPLEX ANALYSIS 47
a b c d
R r
f
0
Figure 2.16
Z∞
exp(ikx)dx
(d) ,
cosh(x)
−∞
(Note that a lot of details in this chain of transformations are dropped. We advise the
reader to reconstruct these details. In particular, we suggest to check that the integrals
over two semi-circles in Fig. 2.16 decay to zero with r → 0 and R → ∞. For the latter, you
may either estimate asymptotic value of the integral yourself, or use the (Jordan’s) Lemma
2.3.9.)
The limiting process just explained is often refereed to as the (Cauchy) principal value
of the integral
Z∞ ZR
exp(ix)dx exp(ix)dx
PV = lim = iπ. (2.29)
x R→∞ x
−∞ −R
In general if the integrand, f (x), becomes infinite at a point x = c inside the range of
integration, so that the limit on the right of the following expression
c−ε
ZR Z ZR
f (x)dx = lim dxf (x) + dxf (x) , (2.30)
ε→0
−R −R c+ε
exists, we call it the principal value integral. (Notice that any of the terms inside the
brackets on the right if considered separately may result in a divergent integral.)
Consider another example
Zb
dx b
= log , (2.31)
x a
a
where we write the integral as a formal indefinite integral. However, if a < 0 and b > 0 the
integral diverges at x = 0. And we can still define
−ε
Zb Z Zb
dx dx dx ε b b
PV := lim + = lim log + log = log , (2.32)
x ε→0 x x ε→0 −a ε |a|
a a ε
excluding ε vicinity of 0. This example helps us to emphasize that the principal value is
R −ε R
unambiguous – the condition that the ε-dependent integration limits in and ε are
R −ε/2 R
taken with the same absolute value, and say not and ε , is essential.
If the complex variables were used, we could complete the path by a semicircle from −ε
to ε about the origin (zero), either above or below the real axis. If the upper semicircle were
chosen, there would be a contribution, −iπ, whereas if the lower semicircle were chosen, the
contribution to the integral would be, −iπ. Thus, according to the path permitted in the
Rb
complex plane we should have a dz/z = log(b/|a|) ± iπ. The principal value is the mean
of these two alternatives.
CHAPTER 2. COMPLEX ANALYSIS 49
!%
i
!$
!#
r
!"
R -i
We discuss below a number of examples of definite integrals which are reduced to contour
integrals avoiding branch cuts.
R R R R
where the integral is broken in four parts, + + + . Then resulting (assuming that
C1 C2 C3 C4
r → 0 and R → ∞) in
I Z Z Z∞
dz dz dz dx
√ 2 = √ 2 + √ 2 =2 √ .
z(z + 1) z(z + 1) z(z + 1) x(x2 + 1)
C1 C3 0
On the other hand the full close contour contains two poles of the integrand, at z = ±i, in
the interior, therefore
I
dz
√ 2 = πi (Res (at z = i) + Res (at z = i)) ,
z(z + 1)
1 exp(3πi/4)
Res (at z = i) = lim (f (z)(z − i)) = lim √ = ,
z→i z→i z(z + i) 2
1 exp(−3πi/4)
Res (at z = −i) = lim (f (z)(z + i)) = lim √ = .
z→i z→i z(z − i) 2
Summarizing one arrives at the following answer
Z∞
dx exp(3πi/4) exp(−3πi/4) π
√ 2
= πi − =√ .
x(x + 1) 2 2 2
0
Solution. Let us analyze contour integral with almost the same integrand
I I
dz dz
= , (2.34)
z 2/3 (z − 1)1/3 f (z)
and the contour, shown in Fig. (2.18a), surrounding the cut connecting two branching points
of f (z), at z = 0 and z = 1 (both points are the branching points of the 3rd order).
Recall that the cuts are introduced to make functions which are multi-valued in the
complex plain (thus the functions which are not entire, i.e. not analytic within the entire
complex plain) to become analytic within the complex plain excluding the cut. Cut also
sets the choice for the (originally multi-valued) function branches. In the case under con-
sideration f (z) := z 2/3 (z − 1)1/3 has the following parameterization as we go around the
cut (in the negative direction):
CHAPTER 2. COMPLEX ANALYSIS 51
r !"
!# !$
a b
0 1
d c
!%
!# !$ r !"
0 1
!%
R
Taking advantage of f (z) analyticity everywhere outside the [0, 1] cut and using Cauchy’s
integral theorem let us transform the integral over, C1 ∪ C2 ∪ C3 ∪ C4 , to the integral with
the same integrand over the contour C shown in Fig. (2.18)
Z Z Z Z Z
dz dz dz dz dz
+ + + = . (2.39)
f (z) f (z) f (z) f (z) f (z)
C1 C2 C3 C4 C
On the other hand the contour integral over C can be computed in the R → ∞ limit:
Z Z0 Z2π
dz iR exp(iθ)dθ
= →R→∞ = −i dθ = −2πi. (2.40)
f (z) R exp(2iθ/3)(R exp(iθ) − 1)1/3
2/3
C 2π 0
Summarizing Eqs. (2.33, 2.34,2.35,2.36,2.37,2.38,2.39,2.40) one arrives at
−2πi π 2π
I= = =√ . (2.41)
− exp(iπ/3) + exp(−iπ/3) sin(π/3) 3
It may be instructive to compare this derivation with an alternative derivation of the
integral discussed in [1].
Exercise 2.11. Evaluate the integral
Z1
dx
√ , (2.42)
(1 + x2 ) 1 − x2
−1
by suggesting and evaluating an equivalent contour integral.
CHAPTER 2. COMPLEX ANALYSIS 53
∗
2.4 Extreme-, Stationary- and Saddle-Point Methods
In this auxiliary ∗ Section, we study the family of related methods which allow to approx-
imate integrals dominated by contribution of a special point and its vicinity. Depending
on the case it is called extreme-point (which is also called Laplace method), stationary-
point or saddle-point method (which is also called steepest-descent method). We start
discussing the extreme-point version, corresponding to estimating real-valued integrals over
a real domain, then we turn to estimation of oscillatory (complex-valued) integrals over
a real interval (stationary-point method) and then generalize to complex-valued integrals
over complex path (saddle-point method, or steepest-descent method).
Extreme- (or maximal-) point method applies to the integral
Zb
I1 = dx exp (f (x)) , (2.43)
a
where the real-valued, continuous function f (x) achieves its maximum at a point x0 ∈]a, b[.
Then one approximates the function by the first terms of its Taylor series expansion around
the maximum
(x − x0 )2 00
f (x) = f (x0 ) + f (x0 ) + O (x − x0 )3 , (2.44)
2
where we assume f 0 (x0 )=0. Since x0 is the maximum, f 0 (x0 ) = 0 and f 00 (x0 ) ≤ 0, and
we consider the case of a general position, f 00 (x0 ) < 0. One substitutes Eq. (2.44) in
Eq. (2.43) and then drops the O((x − x0 )3 ) term and extends the integration over [a, b] to
] − ∞, ∞[. Evaluating the resulting Gaussian integral one arrives at the following extreme-
point estimation s
2π
I1 → exp (f (x0 )) . (2.45)
−f 00 (x0 )
This approximation is justified if |f 00 (x0 )| 1.
√
are f (0) = 0 and f (± α) = α2 /2, and we thus choose the dominating extreme point,
√
xs = ± α, for further evaluations. In fact, and since the two (dominant) extreme-points
are fully equivalent, we pick one of them and then multiply estimation for the integral by
two:
Z
+∞ Z
+∞
2 00 √ 2 2
I ≈ 2 exp(α /2) dx exp(f ( α)x /2) = 2 exp(α /2) dx exp(−2αx2 )
−∞ −∞
r
2
= exp(α2 /2) ,
απ
√
where we also took into account that f 00 (± α) = −4α.
The same idea, known under the name of the stationary-point method, works for highly
oscillatory integrals of the form
Zb
I2 = dx exp (if (x)) , (2.46)
a
where real-valued, continuous f (x) has a real stationary point x0 , f 0 (x0 ) = 0. Integrand
oscillates least at the stationary point, thus guaranteeing that the stationary point and its
vicinity make dominant contribution to the integral. The statement just made may be a
bit confusing because the integrand, considered as a function over x is oscillatory making,
formally, integral over x to be highly sensitive to positions of the ends of interval. To
make the statement sensible consider shifting the contour of integration into the complex
plain so that it crosses the real axis at x0 along a special direction where if 00 (x0 )(x − x0 )2
shows maximum at x0 then making the resulting integrand to decay fast (locally along the
contour) with |x − x0 | increase. One derives
Z
I2 ≈ exp (if (x0 )) dx exp if 00 (x0 )/2(x − x0 )2
s
2π
= 00
exp if (x0 ) + isign(f 00 (x0 ))π/4 ,
|f (x0 )|
where dependence on the interval’s end-points disappear (in the limit of sufficiently large
|f 00 (x0 )|).
Solution. We can re-use here results of the Example 2.4.1. The stationary points of the
√
integrand are the same: xs = 0 and xs = ± α. Values of f at the stationary points are
√
f (0) = 0 and f (± α) = α2 /2 resulting in 1 and exp(iα2 /2) contributions to the integrand.
Therefore, in the asymptotic (large α) estimation we should keep all three contributions to
the integral. Computing second derivatives at the three stationary points, f 00 (0) = 2α and
√
f 00 (± α) = −4α, estimating the three contributions to I2 according to Eqs. (2.48), and
finally summing them up we arrive at
Z Z
I2 ≈ 2 exp(iα /2) dx exp(−2iαx ) + dx exp(iαx2 )
2 2
r r
π 2π
= 2 exp(iα2 /2 − iπ/4) + exp(iπ/4).
2α α
Now in the most general case (of the saddle-point method, also called the steepest-
descent method) we consider the contour integral
Z
I3 = dz exp (f (z)) , (2.47)
C
assuming that f (z) is analytic along the contour, C, and also within a domain, D, of the
complex plain, the contour is embedded in. Let us also assume that there exists a point, z0 ,
within D where f 0 (z0 ) = 0. This point is called a saddle-point because iso-lines of f (z) in
the vicinity of z0 show a saddle – minimum and maximum along two orthogonal directions.
Deforming C such that it passes z0 along the “maximal” path (where f (z) reaches maximum
at z0 ) one arrives at the following saddle-point estimation
s
2π
I3 → exp (f (z0 )) , (2.48)
−f 00 (z0 )
where the square-root sign stand for its main (standard) branch, i.e. ∀θ ∈ [0, 2π] :
p
exp(iθ) = exp(iθ/2). In what concerns applicability of the saddle-point approximation
– the approximation is based on truncating the Taylor expansion of f (z) around z0 , which
is justified if f (z) changes significantly where the expansion applies, i.e. |f 00 (z0 )|R2 1,
where R is the radius of convergence of the Taylor series expansion of f (z) around z0 .
Two remarks are in order. First, let us emphasize that f (z0 ) and f 00 (z0 ) can both be
complex. Second, there may be a number (more than one) of saddle points in the region
of the f (z) analyticity. In this case one picks the saddle-point achieving maximal value (of
f (z0 )). In the case of degeneracy, i.e. when multiple saddle-points achieves the same value,
as was the case both in the Example 2.4.1 and Example 2.4.2, one deforms the contour to
pass through all the saddle-points then replacing right hand side in Eq. (2.48) by the sum
of the saddle-point contributions.
CHAPTER 2. COMPLEX ANALYSIS 56
Z
+∞
(a) dx cos αx2 − x3 /3 ,
−∞
Z
+∞
(b) dx exp −x4 /4 cos(αx).
−∞
Fourier Analysis
where k = (k1 , · · · , kd ) is the “wave-vector”, dk = dk1 · · · dkd , and fˆ(k) is the Fourier
transform of f (x), defined according to
Z
ˆ
f (k) := dx exp −ikT x f (x). (3.2)
Rd
57
CHAPTER 3. FOURIER ANALYSIS 58
Eq. (3.1) and Eq.(3.2) are inverses of each other (meaning, for example, that substituting
Eq. (3.2) into Eq. (3.1) will recover f (x)), and it is for this reason that the Fourier integral
is also called the Inverse Fourier Transform. Proofs that they are inverses, as well as
other important properties of the Fourier Transform, rely on Dirac’s δ-function which in
d-dimensions can be defined as
Z
1
δ(x) := dk exp(ikT x). (3.3)
(2π)d Rd
= a dx f (x)e −ikx
+ b dx g(x)e−ikx = afˆ(k) + bĝ(k). (3.4)
R R
CHAPTER 3. FOURIER ANALYSIS 59
Frequency Modulation: For any real number k0 , if h(x) = exp(ik0 x)f (x), then
Z Z Z
ĥ(k) = dx h(x)e−ikx
= dx f (x)e ik0 x −ikx
e = dx f (x)e−i(k−k0 )x = fˆ(k − k0 ).
R R R
(3.6)
The case a = −1 leads to the time-reversal property: if h(t) = f (−t), then ĥ(ω) = fˆ(−ω).
Complex Conjugation: If h(x) is a complex conjugate of f (x), that is, if h(x) = (f (x))∗ ,
then
Z Z Z ∗ ∗
ĥ(k) = −ikx
dx h(x)e = dx (f (x)) e ∗ −ikx
= dx f (x)eikx = fˆ(−k) .
R R R
(3.8)
(a) If f is real, then fˆ(−k) = (fˆ(k))∗ (this implies that fˆ is a Hermitian function.)
Exercise 3.2. Show that the Fourier transform of a radially symmetric function in two
variables, i.e. f (x1 , x2 ) = g(r), where r2 = x21 + x22 , is also radially symmetric, i.e.
fˆ(k1 , k2 ) = fˆ(ρ), where ρ2 = k12 + k22 . (We remind that in polar coordinates (r, θ) a
radially symmetric function does not depend on the angle θ.)
Differentiation: If h(x) = f 0 (x), then under the assumption that |f (x)| → 0 as x → ±∞,
Z Z h i∞ Z
ĥ(k) = dx h(x)e−ikx = dxf 0 (x)e−ikx = f (x)e−ikx − dx(−ik)f (x)e−ikx
R R −∞ R
= (ik)fˆ(k). (3.9)
CHAPTER 3. FOURIER ANALYSIS 60
R∞
Integration: Substituting k = 0 in the definition, we obtain fˆ(0) = −∞ f (x) dx. That is,
the evaluation of the Fourier transform at the origin, k = 0, equals the integral of f over
all its domain.
Proofs for the following two properties rely on the use of the δ-function (which will
not be addressed until Section 3.3), and require more careful consideration of integrability
(which is beyond the scope of this brief introduction). The following two properties are
added here so that a complete list of properties appears in a single location.
R
Unitarity [Parceval/Plancherel Theorem]: For any function f such that |f | dx < ∞
R
and |f |2 < ∞,
Z Z Z Z Z
∞
2
∞
∗
∞ ∞
dk1 ik1 x ˆ ∞
dk2 −ik2 x ˆ ∗
dx |f (x)| = dxf (x) (f (x)) = dx e f (k1 ) e f (k2 )
−∞ −∞ −∞ −∞ 2π −∞ 2π
Z ∞
1
= dk |fˆ(k)| .2
(3.10)
2π −∞
Definition 3.2.1. The integral convolution of the function f with the function g, is defined
as Z
(f ∗ g)(x) := dy g(x − y)f (y), (3.11)
R
Convolution: Suppose that h is the integral convolution of f with g, that is, h(x) =
(f ∗ g)(x), then
Z Z Z
ĥ(k) = dx h(x)e −ikx
= dx dy f (x − y)g(y)e−ikx = fˆ(k)ĝ(k). (3.12)
R R R
developed a rigorous theory for such ‘functions’, which became known as the theory of dis-
tributions. We usually denote this ‘function’ by δ(x) and call it the (Dirac) δ-function. See
[1](ch. 4) for more details.
We begin our study of Dirac’s δ-function by considering the sequence of functions given by
1/, |x| ≤ /2
δ (x) = . (3.13)
0, |x| > /2
The point-wise limit of δ is clearly zero for all x 6= 0, and therefore the integral of the limit
of δ must also be zero, that is,
Z∞
lim δ (x) = 0 ⇒ dx lim f (x) = 0. (3.14)
→0 →0
−∞
However, for any > 0, the integral of δ is clearly unity, and therefore the limit of the
integral of δ must also be unity
Z∞ Z∞
dx δ (x) = 1 ⇒ lim dx δ (x) = 1. (3.15)
→0
−∞ −∞
Although, Eq. (3.14) suggests that δ (x) may not be very interesting as a function, the
behavior demonstrated by Eq. (3.15) motivates the use of δ (x) as a functional a . For any
sufficiently nice function φ(x), define the functionals δ [φ] and δ[φ] by
Z∞
δ[φ] := lim δ [φ] := lim dx δ (x)φ(x). (3.16)
→0 →0
−∞
a
In casual terms, a function takes numbers as inputs, and gives numbers as outputs, whereas a functional
takes functions as inputs and gives numbers as outputs
CHAPTER 3. FOURIER ANALYSIS 62
Letting m and M represent the minimum and maximum values of φ(x) on the interval
−/2 < x < /2 gives the bounds
m ≤ δ [φ] ≤ M (3.18)
The point-wise limit δ̃ (x) is also zero for every x 6= 0, so as before, the integral of the limit
must be zero
Z∞
lim δ̃ (x) = 0 ⇒ dx lim δ̃ (x) = 0. (3.21)
→0 →0
−∞
A suitable trigonometric substitution shows that the integral of δ̃ (x) is also unity for each
> 0, and as before, the limit of the integrals must be unity:
Z∞ Z∞
dx δ̃ (x) = 1 ⇒ lim dx δ̃ (x) = 1. (3.22)
→0
−∞ −∞
As with δ (x), we can use δ̃ (x) to define the functionals δ̃ [φ(x)] and δ̃[φ] by
Z∞
δ̃[φ] := lim δ̃ [φ] := lim δ̃ (x)φ(x)dx. (3.23)
→0 →0
−∞
This time it takes a little more thought to find the appropriate bounds, but with some
effort, it can be shown that
δ̃[φ] = lim δ̃ [φ] = φ(0) (3.24)
→0
We have defined the δ-function in Eq. (3.3) as the limit of a particular δ-sequence, namely
the ‘top-hat’ function given in Eq. (3.14). One has to wonder whether there may be other
δ-sequences which give the same limit. For example, consider
2t2
δ(t) = lim . (3.25)
→0 π(t2 + 2 )2
To validate the suitability of Eq. (3.25) as an alternative definition of the δ-function one
R
needs to check first that δ(t) → 0 as → 0 for all t 6= 0, and second that dtδ(t) = 1. (It
is easy to evaluate this integral as the complex pole integral and closing the contour, for
example, over the upper part of the complex plane. Observing that the integrand has pole
of the second order at t = i, expanding it into Laurent series around i and keeping the
c = −1 coefficient, and then using the Cauchy formula for the contour integral, we confirm
that the integral is equal to unity.)
Exercise 3.3. Validate the following asymptotic representations for the δ-function
2
1 t
(a) δ(t) = lim √ exp − ,
→0 π
1 − cos(nt)
(b) δ(t) = lim .
n→∞ πnt2
In many applications we deal with periodic functions. In this case one needs to consider
relations hold within the interval. In view of the δ-function extreme locality (just explored),
all the relations discussed above extend to this case.
Example 3.3.1. Validate the following asymptotic representation for the δ-function on the
interval (−π, π)
1 − r2
δ(θ) = lim ,
r→1− 2π(1 − 2r cos(θ) + r2 )
Z π
ii. δr (θ) dθ = 1 for r < 1 (i.e. for > 0 where r = 1 − ).
−π
CHAPTER 3. FOURIER ANALYSIS 64
i. To show that the point-wise limit of δr (θ) is zero for each θ 6= 0, note that for any θ 6= 0,
limr→1− 1 − 2r cos(θ) + r2 > 0 but limr→1− 1 − r2 = 0.
ii. To show that δr (θ) integrates to unity for each r. There is a clever trick that evaluates
this integral by a complex-valued contour integral. Let C be the parameterization of the
unit circle z(θ) = 1eiθ for −π < θ < π. Therefore dz = ieiθ dθ = izdθ. Now consider
Z π Zπ
1 1 − r2
δr (θ)dθ = dθ
−π 2π 1 − 2r cos(θ) + r2
−π
Z
1 1 − r2 dz
= iθ −iθ 2
2π 1 − r(e + e ) + r iz
C
Z
1 1 − r2 dz
= 2
2πi 1 − r(z + 1/z) + r z
C
Z
1 1 − r2
= dz
2πi −r + (1 + r2 )z − rz 2
C
Z
1 1 − r2
= dz
2πi (1 − rz)(z − r)
C
The integrand has a simple pole inside the contour at z1 = r with residue equal to 1. (There
is also a simple pole at z2 = 1/r with residue r, but this is irrelevant because it is outside
Rπ
the contour.) Therefore, δr (θ)dθ = 1.
−π
Example 3.3.2. For b, c ∈ R, show that cδ(x − b)f (x) = cf (b)δ(x − b).
Solution. We need to show that (a) the two functions are zero at x 6= b (which is trivial)
R∞ R∞
and (b) that their integrals are equal: dx cδ(x − b)f (x) = c dy δ(y)f (y + b)
−∞ −∞
Z∞
dk
(a) δ(x) = exp(ikx)
2π
−∞
Solution. (a) We identify the expression on the RHS as the inverse Fourier transform of
R∞
the function fˆ(k) = 1: f (x) = 2π
1
dk 1eikx . Even though the constant function is not
−∞
integrable in the traditional sense, the theory of distributions (which are functionals of the
δ-function type) allows us to give meaning to this integral. We know that the δ-function
R
is defined so that for any suitable function φ(x), dx δ(x)φ(x) = φ(0). Even if we cannot
R
integrate f (x) directly, but can show that, dx f (x)φ(x) = φ(0), for every suitable test
function, φ(x), then we can assert that, f (x) = δ(x).
(b)
Z∞ Z∞ Z∞ Z∞ Z∞
dk −ikx dk −ikx dk
f [φ(x)] = dx φ(x) 1e = dx φ(x)e = φ̂(k)
2π 2π 2π
−∞ −∞ −∞ −∞ −∞
Z∞
dk
= φ̂(k)eik0 = φ(0).
2π
−∞
And since, f [φ] = φ(0), for every suitable test function φ, we say that f (x) = δ(x).
We now return to proving (1) that the Fourier Transform and the inverse Fourier Transfrom
are indeed inverses of each other, (2) Plancherel’s theorem and (3) the convolution property.
Proposition 3.3.6. The Fourier Transform of the convolution of the function f with the
function g is the product fˆ(k)ĝ(k)
Z∞ Z ∞
\
(f ∗ g)(k) = dx dy g(x − y)f (y)e−ikx
−∞
−∞
Z∞ Z ∞ Z ∞ Z ∞
dk1 dk2 ˆ
= dx dy f (k1 )ĝ(k2 )e−ikx+ik1 (x−y)+ik2 y
−∞ −∞ 2π −∞ 2π
−∞
Z∞ Z∞ Z∞ Z∞
dx −ikx+ik1 x dy −ik1 y+ik2 y
= dk1 dk2 fˆ(k1 )ĝ(k2 ) e e
2π 2π
−∞ −∞ −∞ −∞
Z∞ Z∞
= dk1 fˆ(k1 )δ(k − k1 ) dk2 ĝ(k2 )δ(k1 − k2 ) = fˆ(k)ĝ(k)
−∞ −∞
CHAPTER 3. FOURIER ANALYSIS 66
where in transition from the first to the second lines we exchange order of integrations
assuming that all the integrals involved are well-defined.
Remark. Using the δ-function as the convolution kernel yeilds the self-convolution property:
Z
f (x) = dyδ(x − y)f (y). (3.26)
Consider δ-function of a function, δ(f (x)). It can be transformed to the following sum
over zeros of f (x),
X 1
δ (f (x)) = δ(x − yn ).
n
|f 0 (y n )|
To prove the statement, one, first of all, recall that δ-function is equal to zero at all points
where its argument is nonzero. Just this observation suggest that the answer is a sum of
δ-functions and what is left is to establish weights associated with each term in the sum.
Pick a contribution associated with a zero of f (x) and integrating the resulting expression
over a small vicinity around the point, make the change of variable
Z Z
df
dxδ(f (x)) = 0
δ(f (x)).
f (x)
Because of the δ(f (x)) term in the integrand, which is nonzero only at the zero point of
f (x), we can replace f 0 (x) by f 0 evaluated at the zero and move it out from the integrand.
The remaining integral obviously depends on the sign of the derivative.
The d-dimensional δ-function, which was instrumental for introducing d-dimensional Fourier
transform in Section 3.1, is simply a product of one dimensional δ-functions, δ(x) =
δ(x1 ) · · · δ(xn ).
Solution. Functional expressions for the δ-function in the Cartesian frame and Polar frame
are
Z Z
f (r) = δ(r − r̃)f (r̃)dr̃ = δ(x − x̃)δ(y − ỹ)δ(z − z̃)f (x̃, ỹ, z̃)dx̃dỹdz̃, (3.27)
Z
= δ(θ − θ̃)δ(φ − φ̃)δ(r − r̃)f (r̃, θ̃, φ̃)dr̃dθ̃dφ̃. (3.28)
On the other hand the volume element transformation from the Cartesian frame to the
Polar frame is
The δ-function is not technically a well-defined function, as it only exists in the context
of being integrated against a well-defined function. However, formally, using integration
techniques, we can write down a well-defined notion for a “derivative” of the δ-function.
In fact, we can “differentiate” discontinuous or classically non-differentiable functions using
the same notion. Once again, we stress that this is not true differentiation, but rather
something that looks like differentiation in form. This technique is often referred to as
formal differentiation.
R∞
Substituting, f (x) = 1 into Eq. (3.26) we derive, −∞ dxδ(x) = 1. This motivates
introduction of a function associated with an incomplete integration of the δ(x)
Z y (
0, y < 0
θ(y) := dxδ(x) = (3.30)
−∞ 1, y > 0,
One gets, differentiating Eq. (3.30), that θ0 (x) = δ(x). We can also differentiate the
δ-function. Indeed, integrating Eq. (3.26) by parts, and assuming that the respective anti-
derivative is bounded, we arrive at
Z
dyδ 0 (y − x)f (y) = −f 0 (x) (3.32)
Expanding f (x) in the Taylor series around x = y, ignoring terms of the second order (and
higher) in (x − y), and utilizing Eq. (3.34) one arrives at
Notice that δ 0 (x) is skew-symmetric and f (x)δ 0 (x − y) is not equal to f (y)δ 0 (x − y).
We have assumed so far that δ 0 (x) is convolved with a continuous function. To extend it
to the case of piece-wise continuous functions with jumps and jumps in derivative, one need
to be more careful using integration by parts at the points of the function discontinuity. An
exemplary function of this type is the Heaviside function just discussed. This means that
if a function, f (x), shows a jump at x = y, its derivative allows the following expression
where, f (y + 0) − f (y − 0), represents value of the jump and g(x) is finite at x = y. Similar
representation (involving δ 0 (x)) can be build for a function with a jump in its derivative.
Then the δ(x) contribution is associated with the second derivative of f (x),
Solution. In corollary 3.3.5 where we showed that the inverse Fourier transform of unity
was δ(x). A similar calculation shows that 1̂(k) = 2πδ(k)
Example 3.4.3. Show that the Fourier transform of a square pulse function is a sinc
function:
b, |x| < a 2b
f (x) = ⇒ fˆ(k) = sin(ka)
0, |x| > a. k
Solution.
Z Z a
b a
b −ika
fˆ(k) = dxf (x)e −ikx
=b dxe−ikx = e−ikx = e − eika
R −a (−ik) −a −ik
2b
= sin(ka). (3.36)
k
Example 3.4.4. Show that the Fourier transform of a sinc function is a square pulse:
sin(ax) π/a, |k| < a
g(x) = ⇒ ĝ(k) =
ax 0, |k| > a.
There are a number of different solutions to this problem and it is instructive to look at
each one.
Solution.
Z∞ Z∞ Z∞ Z∞
sin(ax) −ikx eiax − e−iax −ikx e−i(k−a)x e−i(k+a)x
ĝ(k) = dx e = dx e = dx − dx
ax 2iax 2iax 2iax
−∞ −∞ −∞ −∞
| {z } | {z }
I II
π/a, |k| < a
= ,
0, |k| > a.
where the integrals I and II are computed by transforming them to contour integrals
analogous to Eq. (2.28) with contours (distinct for the two contributions) shown in Fig. 2.16.
For both integrals, there is a simple pole at z = 0 with residue 1/(2ia). The contribution
from the pole is ±(1/2)(2πi)/(2ia) where the (1/2) arises from the fact that the pole lies
on the contour of integration and the +/− is determined by the orientation of contour
(Contours that are closed in the upper-half plane do not need to be reversed (+) whereas
contours closed in the lower-half plane must be reversed (-)).
When k < a, the contour for I must be closed in the upper half plane and clockwise
traversal coincides with the orientation of I (i.e I = +π/(2a)). When k > a, the contour
for I must be closed in the lower half plane and clockwise traversal must be reversed to
coincide with the orientation of I, (i.e. I = −π/(2a)).
CHAPTER 3. FOURIER ANALYSIS 70
Similarly, when k < −a, the contour for II must be closed in the upper half plane and
clockwise traversal coincides with the orientation of II (i.e II = +π/(2a)). When k > −a,
the contour for II must be closed in the lower half plane and clockwise traversal must be
reversed to coincide with the orientation of II, (i.e. II = −π/(2a)).
Solution. This solution uses the technique of ‘differentiating under the integral’, more for-
mally known as Leibniz’s integration rule. The integral is tricky because of the x in the
denominator. If we consider the integrand as a function of two variables, x and a, and
differentiate with respect to a, the x will disappear. We can then integrate with respect to
x without concern. Taking the anti-derivative of this result with respect to a to ‘undo’ the
earlier differentiation will yield the final answer. Define a function I(a) such that
Z∞
sin(ax) −ikx
I(a) := ĝ(k) = dx e
ax
−∞
The
∞
Z Z∞ Z∞
d d sin(ax) −ikx ∂ sin(ax) −ikx
(aI(a)) = dx e = dx e = dx cos(ax)e−ikx
da da x ∂a x
−∞ −∞ −∞
= πδ(a − k) + πδ(a + k)
Solution.
Z Z∞
2
fˆ(k) = dxf (x)e −ikx
=a dxe−bx e−ikx
R −∞
2 Z∞ !
k ik 2
= a exp − dx exp −b x +
4b 2b
−∞
2 Z∞
a k 02
= √ exp − dx0 e−x
b 4b
−∞
2r
k π
= a exp − .
4b b
CHAPTER 3. FOURIER ANALYSIS 71
(a) If k < 0, the integral fˆ(k) can be computed with a complex-valued contour integral.
Consider the semi-circular contour in the upper-half plane. The contour encloses a
simple pole at z = ia with residue eka /(2ia).
Z
ˆ 1 1 1 ka π ak
f (k) = e−ikz dz = 2πi e = e .
z + ia z − ia 2ia a
C
Similarly, if k > 0, the integral fˆ(k) can be computed with the semi-circular contour
in the lower-half plane. The contour encloses a simple pole at z = −ia with residue
−e−ka /(2ia). For this contour, we must multiply to −1 to reverse the orientation of
the contour.
Z
1 1 1 −ka π −ka
fˆ(k) = e−ikz dz = 2πi e = e .
z + ia z − ia 2ia a
C
We can find closed form representations of other functions by combining the examples above
with the properties in Section 3.2.
CHAPTER 3. FOURIER ANALYSIS 72
2
Example 3.4.7. Let f (x) = e−x and define g(x) = (f ∗ f ∗ f ∗ f )(x). Find (a) ĝ(k) and
(b) g(x).
√ 2
Solution. From Example 3.4.5, we know that fˆ(k) = πe−k 4 . Therefore,
4
f ∗ f ∗ f )(k) = fˆ(k) · fˆ(k) · fˆ(k) · fˆ(k) = fˆ(k)
ĝ(k) = (f ∗ \
√ 2
4 2
= πe−k /4 = π 2 e−k .
We recognize ĝ(k) as a Gaussian function (see example 3.4.5), and we know that if ĝ(k) is
p 2 2
a Gaussian in the form a (π/b)e−k /(4b) , then g(x) is the Gaussian ae−bx . A little algebra
shows we can re-write ĝ(k) in this form by setting a = 12 π 3/2 and b = 41 :
1 2
g(x) = π 3/2 e−x /4 .
2
Example 3.4.8. Let g(x) = max(0, 1 − |x|) (sometimes called a ‘tent’ function). Compute
ĝ(k).
Solution. Let f (x) = 1 if |x| < 1/2 and 0 otherwise. Note that (f ∗ f )(x) = g(x). From
example 3.4.3, we know that the Fourier transform of a square pulse is a sinc function.
Therefore,
2
2 sin(k/2)
ĝ(k) = fˆ(k)fˆ(k) = = sinc2 (k/2).
k
Example 3.4.9. Let f (t) be given by
cos(ω0 t), |t| < A;
f (t) =
0, otherwise;
where ω0 , A ∈ R with A > 0. (a) Compute fˆ(k), the Fourier transform of f , as a function
of ω0 and A. (b) Identify the relationship between the continuity of f and ω0 and A, and
discuss how this affects the decay of the Fourier coefficients as |k| → ∞.
Solution.
where
2 sin(Ak)
cos
[ ω0 (k) = πδ(k − ω0 ) + πδ(k + ω0 ) and rect
[A (k) = .
k
Therefore,
2π sin(A(k − ω0 )) 2π sin(A(k + ω0 ))
fˆ(k) = cos
[ ω0 (k) ∗ RectA (k) =
\ + .
k − ω0 k + ω0
CHAPTER 3. FOURIER ANALYSIS 73
Definition
Z
+∞ Z
+∞
1
fˆ(k) = f (x)e−ikx dx ⇔ f (x) = fˆ(k)eikx dk
2π
−∞ −∞
(b) Some basic algebra shows that we can write fˆ(k) as follows:
ˆ 2k sin(Ak) cos(Aω0 ) − 2ω0 cos(Ak) sin(Aω0 )
f (k) = 2π .
k 2 − ω02
In general, the fˆ(k) ∼ 1/k as |k| → ∞, unless ω0 and A satisfy sin(Ak) cos(Aω0 ) = 0,
(i.e. Aω0 = nπ + π ), then fˆ(k) ∼ 1/k 2 .
2
2a
Exercise 3.7. Let a ∈ C with Re(a) > 0 and define fa (x) := . If also b ∈ C
a2 + (2πx)2
with Re(b) > 0, show that (fa ∗ fb )(x) = fa+b (x).
Example
q 3.4.10. Let x = (x1 , x2 , . . . , xd ) ∈ Rd , and use the notation |x| to represent
x21 + x22 + · · · + x2d . Find the Fourier transform of g(x) = exp(−|x|2 ).
CHAPTER 3. FOURIER ANALYSIS 74
Solution.
Z Z
−ikT x 2 Tx
ĝ(k) = g(x)e dx = e−|x| e−ik dx
Rd Rd
Z
2 2 2
= e−x1 −x2 −···−xd e−i(k1 x1 +k2 x2 +···+kd xd dx
Rd
Z Z Z
2 2 2
= ··· e−x1 e−x2 . . . e−xd e−ik1 x1 eik2 x2 . . . eikd xd dx1 dx2 . . . dxd
R R R
Z Z Z
−x21 −ik1 x1 −x22 −ik2 x2 2
= e e dx1 e e dx2 . . . e−xd . . . e−ikd xd dxd
R R R
√ √ √
= π exp(−k12 /4) π exp(−k22 /4) . . . π exp(−kd2 /4) = π d/2 exp(−|k|2 /4).
This, so-called Fourier series, representation of a periodic function immediately shows that
the Fourier Series is a particular case of the Fourier integral:
∞
X Z∞ Z∞ ∞
X
fn exp (2πinx/L) dk δ(k − n) = dk exp (2πikx/L) fn δ(k − n). (3.38)
n=−∞ −∞ −∞ n=−∞
Like in the case of the Fourier transform and inverse Fourier transform, we would like to
invert Eq. (3.37) and express fn via f (x). By analogy with the Fourier transform, consider
integrating the left hand side of Eq. (3.37) with the oscillating factor, exp(−2πikx/L)/L,
where k is an integer, over x ∈ [0, L]. Applying this integration to the right hand side of
Eq. (3.37), we run into the following easy to evaluate integral (for each term in the resulting
sum)
ZL (
dx ikx2π/L −inx2π/L ei(k−n)2π − 1 1, k = n
e e = = δk,n := , (3.39)
L i(k − n)2π 0, k 6= n
0
CHAPTER 3. FOURIER ANALYSIS 75
Definition
Z +∞ Z +∞
1
fˆ(k) = −ikx
f (x)e dx ⇔ f (x) = fˆ(k)eikx dk
−∞ 2π −∞
Transformations
Linearity a f (x) + b g(x) a fˆ(k) + b ĝ(k)
Space-Shifting f (x − x0 ) e−ikx0 fˆ(k)
k-Space-Shifting eik0 x f (x) fˆ(k − k0 )
Space Reversal f (−x) fˆ(−k)
Space Scaling f (ax) |a|−1 fˆ(k/a)
Calculus
Convolution g(x) ∗ h(x) ĝ(k) ĥ(k)
1
Multiplication g(x)h(x) 2π ĝ(k) ∗ ĥ(k)
Differentiation d
f (x) ik fˆ(k)
R xdx
Integration 1 ˆ
0 f (ξ) dξ ik f (k)
where expression in the middle is resolved via the LH́opitale rule and δk,m is the common
notation for the so-called Kronecker delta. We observe that only one term in the sum is
nonzero, therefore arriving at the desired formula
ZL
dx nx
fn = f (x) exp −2πi . (3.40)
L L
0
Notice that one may also consider Fourier Transform/Integral as a limit of the Fourier
Series. Indeed in the case when a typical scale of the f (x) change is much less than L, many
harmonics are significant and the Fourier series transforms to the Fourier integral
∞
X Z∞
L
··· → dk · · · . (3.41)
−∞
2π
−∞
Let us illustrate expansion of a function into Fourier series on example of f (x) = exp(αx)
considered on the interval, 0 < x < 2π. In this case the Fourier coefficients are
Z2π
dx 1 1
fn = exp(−inx + αx) = e2πα − 1 . (3.42)
2π 2π α − in
0
Exercise 3.9. Let f (x) = x and g(x) = |x| be defined on the interval −π < x < π. (a)
Expand both functions as a Fourier series. (b) Compare the dependence of the n-th Fourier
coefficient on n for the two functions.
We conclude this Section reminding the reader that our construction of the Fourier
Series assumed that the set of harmonic functions forms a complete set of basis functions.
Proving this assumption is outside the scope of this course. This proof (and many other
proofs) will be discussed in detail in the companion course, Math 584.
Definition
Scaling
Linearity af (x) + bg(x) afn + bgn
Space-Shifting f (x − x0 ) e2πinx0 /L fn
k-Space-Shifting e2πik/L f (x) fn−k
Space Reversal f (−x) f−n
Space Scaling f (ax) (with period L/a) fn
Calculus
Z L
Periodic Convolution f (ξ)g(x − ξ) dξ Lfn gn
0
∞
X
Multiplication f (x)g(x) fm gn−m
m=−∞
d 2πin
Differentiation Zdx xf (x) L fn
L
Integration dξf (ξ) 2πin fn
0
We will not prove the Riemann-Lebesgue lemma here but notice that a standard proof
is based on (a) showing that the lemma works for the case of the characteristic function
of a finite open interval in R1 , where f (x) is constant within ]a, b[ and zero otherwise, (b)
extending it to simple functions over R1 , that are functions which are piece-wise constant,
and then (c) building a sequence of simple functions (which are dense in L1 ) approximating
f (x) more and more accurately.
Let us mention the following useful corollary of the Riemann-Lebesgue Lemma: For
any periodic function f (x) with continuous derivatives up to order m, integration by parts
can be performed respective number of times to show that the n-th Fourier coefficient is
C
bounded at sufficiently large n according to |fn | ≤ |n|m+2
, where C = O(1).
In particular, and consistently with the example above, we observe that in the case
of a “jump”, corresponding to continuous anti-derivative, i.e. m = −1, |fn | is O(1/n)
asymptotically at n → ∞. In the case of a “ramp”, i.e. m = 0 with continuous function
but discontinuous derivative, |fn | becomes O(1/n2 ) at n → ∞. For the analytic function,
with all derivatives continuous, |fn | decays faster than polynomially as n increases.
Further details of the Lemma, as well as the general discussion of how the material of
this Section is related to material discussed in the theory course (Math 584) and also the
algorithm course (Math 589), will be given at an inter-core recitation session.
where definition of the function is in the first line and the second line describes expression
for the function in terms of the Fourier series. Notice that the 2π-periodic function jumps
at 2nπ by π/2.
Let us truncate the series in Eq. (3.44) and thus consider N -th partial Fourier Series
N
X sin((2n + 1)x)
SN (x) = . (3.45)
2n + 1
n=0
Gibbs phenomenon consists in the following observation: as N → ∞ the error of the approx-
imation around the jump-points is reduced in width and energy (integral), but converges
to a fixed height. See movie-style visualization (from wikipedia) of how SN (x) evolves with
N . (It is also reproduced in a julia-snippet available at the class D2L repository.)
Let us now back up this simulation by an analytic estimation and compute the limiting
value of the partial Fourier Series at the point of the jump. Notice that
X N
d sin(2(N + 1))
SN () = cos((2n + 1)) = , (3.46)
d 2 sin
n=0
where we have utilized formula for the sum of the geometric progression. Observe that
d
d SN () → N + 1 at → 0, that is the derivative is large (when N is large) and positive.
Therefore, SN () grows with to reach its (first close to = 0) maximum at ∗ = π/(2(N +
1)). Now we estimate the value of SN (∗ )
N sin (2n+1)π N
X 2(N +1) X sin nπ
N
SN (∗ ) = = + O(1/N )
2n + 1 2n
n=0 n=0
N →0
Zπ
1 sin t π
→ dt ≈ + 0.14, (3.47)
2 t 4
0
thus observing that at the point of the closest to zero maximum the partial sum systemat-
ically overshoots, f (0+ ) = π/4, by an O(1) amount.
Exercise 3.10. Generalize the two functions from Exercise 3.9 beyond the [−π, π) interval,
so they are 2π-periodic functions on [−5π, 5π). Compute the respective partial Fourier series
SN (x) for select N , and study numerically (or theoretically!) how the amplitude and the
width of the oscillations near the points x = mπ, m ∈ {−5, −4, . . . , 4} behave as N → ∞.
We complete our discussion of the Fourier Series by mentioning its arguably most signif-
icant application in the field of differential equations. Most differential equations of interest
can only be solved numerically. Mathematicians often find approximate solutions to a dif-
ferential equation by representing the solution as a Fourier series and then truncating the
series to a finite sum according to the desired accuracy. This method, called the spectral
method, will be discussed in the algorithm core course (Math 589).
CHAPTER 3. FOURIER ANALYSIS 80
We consider complex k and require that the integral on the right hand side of Eq. (3.48) is
converging (finite) at sufficiently large Re(k). In other words, f˜(k) is analytic at Re(k) > C,
where C is a positive constant.
Inverse Laplace Transform (ILT) is defined as a complex integral
Z
1
f (x) = dk exp (kx) f˜(k). (3.49)
2πi
C
over the so-called Bromwich contour, C, shown in Fig. (3.1). C can be deformed arbitrarily
within the domain, Re(k) > 0, of the f˜(k) analyticity. Note that by construction, and
consistently with the requirement imposed on f (x), the integral on the right hand side of
Eq. (3.49) is equal to zero at x < 0. Indeed, given that f˜(k) is analytic at Re(k) > 0 and
it approaches zero at k → ∞, contour C can be collapsed to surround ∞, which is also a
non-singular point for the integrand thus resulting in zero for the integral.
Properties of the Laplace Transform, following related properties of the Fourier Trans-
form discussed in Section 3.2 are listed formally in the Table below.
CHAPTER 3. FOURIER ANALYSIS 81
Im (k)
0
Re (k)
Figure 3.1: The Bromwich integration contour, C, in Eq. (3.48) is shown red. C is often
shown as straight line in the complex from ε − i∞ to ε + i∞, where ε is an infinitesimally
small positive number. Possible singularities of the LT Φ̃(k) may only be at the points with
negative real number, which are shown schematically as blue dots.
Scaling
Linearity af (x) + bg(x) af˜ + bg̃(k)
Space-Shifting f (x − x0 )θ(x − x0 ), x0 ∈ R>0 e−kx0 f˜(k)
k-Space-Shifting h(x) = exp(k0 x)f (x) h̃(k) = f˜(k − k0 )
Space Scaling h(x) = f (ax), ∀a ∈ R>0 h̃(k) = 1 f˜ k
a a
Differentiation h(x) = f 0 (x) h̃(k) = k f˜(k) − f (0− )
Rx
Integration h(x) = f (y)dy h̃(k) = k1 f˜(k)
0
Rx
Convolution h(x) = (f ∗ g)(x) = f (y)g(x − y)dy h̃(k) = f˜(k)g̃(k)
0
It is instructive to illustrate similarities and differences between the LT and the FT on
basic examples.
Consider, first, the one sided exponential
Exercise 3.12. Find the Inverse Laplace Transform of 1/(k 2 + a2 ). Show details.
The Laplace transform is often used to describe (and to manipulate) integral representations
of the special functions. Consider, f (x) = xν . Upto an elementary factor its Laplace
transform is related to the so-called Gamma, Γ, function (of the parameter ν)
Z∞ Z∞
Γ(ν + 1)
f˜(k) = ν −kx
dxx e =k −ν−1
dyy ν e−y = ,
k ν+1
0 0
where the integrals are well defined at k > 0. Then, the inverse Laplace transform returns
”definition” of the Γ function in terms of a contour integral
Z
ε+i∞
xν 1 ekx
= dk , (3.59)
Γ(ν + 1) 2πi k ν+1
ε−i∞
where ε > 0 to guarantee that we are on the right of the singularity at k = 0 which is a
pole if ν is positive integer and a branch point for noninteger ν.
In the general case of the branch point at k = 0, the second branch point of the integrand
in Eq. (4.24) is at k = ∞, and we ought to introduce the branch cut connecting the
branch points to guarantee analyticity of the integrand. We choose the branch cut along
CHAPTER 3. FOURIER ANALYSIS 83
!#$
!$ !&
!%
!"
!#%
Figure 3.2: Contour transformation for the integral representation of the Γ function via the
Inverse Laplace Transform. Integrand of Eq. (3.59) is analytic in the domain surrounded
by the (blue) contour, therefore returning zero for the integral.
R kx
the negative real axis. Then the integral, dk keν+1 , over C, shown (blue) in Fig. (3.2),
C
contains no singularities in the interior and it is therefore equal to zero according to the
+ −
Cauchy theorem. (Here, C = C1 + CR + C+ + Cr + C− + CR , R → ∞, r → 0 and
C1 =]ε − i∞, ε + i∞[.) We can show (using the Jordan’s lemma) that the integral along
±
CR and Cr tend to zero at R → ∞, r → 0. Therefore, the Bromwich contour in Eq. (3.59)
can be replaced by the integral over the −C+ − C− contour, i.e. the contour consisting of
first going along the negative part of the real axis from −∞ to 0 a tiny bit under the cut
and then returning back from 0 to −∞ above the cut. We arrive at
Z
1 1 ek
= dk ν+1 . (3.60)
Γ(ν + 1) 2πi k
−C− −C+
Eq. (3.60) just derived is quite useful for evaluating asymptotic expression of the function
f (x) at x → ∞ with f˜(k) which requires introduction of the branch cut (on the left from the
Bromwich contour). Indeed, consider the Inverse Laplace Transform formula (3.49). The
idea is that at x → +∞ the integral is dominated by the singularity furthest to the right in
the complex k plane. (The idea obviously applies not only to the cuts but also to the poles
of the integrand in Eq. (3.49).) Then, we can expand around the right-most singularity of
CHAPTER 3. FOURIER ANALYSIS 84
f˜(k), i.e.
X
f˜(k) ≈ cν (k − k0 )λν , (3.61)
ν
where k0 is the position of the right-most singularity of f˜(k) and λν may be non-integer.
A loop integral around the branch point at k0 results in an asymptotic series that can be
obtained integrating (3.61) term by term
Z Z !
1 1 X X cν
f (x) = dk f˜(k)ekx ≈ dk cν (k − k0 )λν ekx = ek0 x ,
2πi 2πi ν ν
Γ(−λν )xλν +1
Ck0 Ck0
where Ck0 is the contour surrounding the k0 singularity anti-clockwise around the cut and
we have used Eq. (3.60) for the Γ function.
which provides an explicit solution of the original differential equation stated in quadratures,
i.e. as an integral over the known integrand. Note, that in transition from the first line in
Eq. (3.65) to the second line we have exchanged the order of integration and evaluated the
pole integral at k = iq, also observing that the integral returns non-zero only at y < x.
In summary, when g(x) is well defined over the entire x ∈ R we solve the differential
Eq. (3.62) in three straightforward steps: (a) applying FT and arriving at (much simpler)
algebraic equation; (b) solving the algebraic equation; (c) applying the Inverse FT.
Let us now assume that g(x), f (x) 6= 0 only at x ≥ 0. In this case we repeat the logic
of the preceding derivation, however substituting the FT by the LT. We derive
Z
ε+i∞
θ(x) f (0+ ) + g̃(k) kx
f (x) = dk e
2πi q+k
ε−i∞
Z
ε+i∞ Z
+∞
θ(x) dk + kx
= f (0 )e + dyg(y)ek(x−y)
2πi q+k
ε−i∞ 0
Zx
where in transition from the first to the second line we have accounted for the fact that the
LT version of the Eq. (3.63) acquires the additional f (0+ ) factor, and then, in transition
from the second line to the third line we have exchanged the order of integration and
evaluated the pole integrals at k = −q. We also observe that Eq. (3.65) transitions into
Eq. (3.66) if we set f (x), g(x) to be nonzero only at x ≥ 0.
Finally, consider Eq. (3.62) in the case supporting 2π-periodicity of f (x), i.e. requiring
that q = im, where m ∈ Z, and 2π-periodic g(x). In this case we arrive, after application
of the FT to Eq. (3.62) at the discrete version of the algebraic Eq. (3.63), where fˆ(k) and
ĝ(k) are substituted by the Fourier coefficients, fk and gk , valid at k ∈ Z. Then the Fourier
Series version of the Eq. (3.65) becomes
+∞
X X+∞ X+∞ Zx
gk gk0 −m ik0 x 1 0
f (x) = eikx = e−imx 0
e = e−imx gk0 −m 0 + dyeik y
i(k + m) 0
ik 0
ik
k=−∞ k =−∞ k =−∞ 0
+∞
X Zx +∞
X Zx
gk 0
= + dyeim(y−x) eik (y−x) gk0 = f (0) + dyeim(y−x) g(y).
i(k + m)
k=−∞ 0 k0 =−∞ 0
(3.67)
Notice that in the 2π-periodic FS case, like in the semi-infinite LT case, final expressions,
given by Eq. (3.67) and Eq. (3.66) respectively, require providing an additional condition
CHAPTER 3. FOURIER ANALYSIS 86
Differential Equations
87
Chapter 4
A differential equation (DE) is an equation that relates an unknown function and its deriva-
tives to other known functions or quantities. Solving a DE amounts to determining the
unknown function. For a DE to be fully determined, it is necessary to define auxiliary
information, typically available in the form of an initial or boundary condition.
Often several DE’s may be coupled together in a system of DE’s. Since this is equivalent
to a DE of a vector-valued function, we will use the term “differential equation” to refer
to both single equations and systems of equations and the term “function” to refer to both
scalar- and vector-valued functions. We will distinguish between the singular and plural
only when relevant.
The function to be determined may be a function of a single independent variable, (e.g.
f = f (t) or f = f (x)) in which case the differential equation is known as an ordinary
differential equation, or it may be a function of two or more independent variables, (e.g.
f = f (x, y), or f = f (t, x, y, z)) in which case the differential equation is known as a partial
differential equation.
The order of a differential equation is defined as the largest integer n for which the nth
derivative of the unknown function appears in the differential equation.
Most general differential equation is equivalent to the condition that a nonlinear func-
tion of an unknown function and its derivatives is equal to zero. An ODE is linear if the
condition is linear in the function and its derivatives. We call the ODE linear, homoge-
neous if in addition the condition is both linear and homogeneous in the function and its
derivatives. It follows for the homogeneous linear ODE that, if f (t) is a solution, so is cf (t),
where c is a constant. A linear differential equation that fails the condition of homogeneity
is called inhomogeneous. For example, an nth order, inhomogeneous ordinary differential
equation is one that can be written as αn (t)f (n) (t) + · · · + α1 (t)f 0 (t) + α0 (t)g(t) = g(t),
88
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 89
where αi (t), i = 0, . . . n and g(t) are known functions. Typical methods for solving linear
differential equations often rely on the fact that the linear combination of two or more solu-
tions to the homogeneous DE is yet another solution, and hence the particular solution can
be constructed from from a basis of general solutions. This cannot be done for nonlinear
differential equations, and analytic solutions must often be tailor-made for each differential
equation, with no single method applicable beyond a fairly narrow class of nonlinear DEs.
Due to the difficulty in finding analytic solutions, we often rely on qualitative and/or ap-
proximate methods of analyzing nonlinear differential equations, e.g. through dimensional
analysis, phase plane analysis, perturbation methods or linearization. In general, linear dif-
ferential equations admit relatively simple dynamics, as compared to nonlinear differential
equations.
An ordinary differential equation (ODE) is a differential equation of one or more func-
tions of one independent variable, and of the derivatives of these functions. The term
ordinary is used in contrast with the term partial differential equation (PDE) where the
functions are with respect to more than one independent variables. PDEs will be discussed
in the Chapter 5.
A separable differential equation is a first order differential equation that can be written so
that the derivative function appears on one side of the equation, and the other side contains
the product or quotient of two functions, one of which is a function of the independent
variable, and the other a function of the dependent variable.
Z Z
dx f (t)
= ⇒ g(x)dx = f (t)dt ⇒ g(x)dx = f (t)dt. (4.1)
dt g(x)
Applying the method of separable differential equations (see Eq. (4.1)) and then recalling
the substitution (4.3), one arrives at
t τ
Z Z t Z
y(t) = exp dτ p(τ ) y0 + dτ g(τ ) exp − dτ p(τ ) .
t0
t0 t0
The method just applied to the first order differential equation is called the method
of variation of parameters because, c(t), in the derivation above can be considered as a
parameter which we vary, i.e. allow to depend on t.
Let us extend the idea of the parameter variation method to the case of the second order
inhomogeneous differential equation of a general position
d2 y dy
− p(t) − q(t)y(t) = g(t), (4.4)
dt2 dt
and try to find its general solution. Recall that the general solution of a linear inhomoge-
neous equation is a sum of the general solutions of the respective homogeneous equation
and of a particular solution of the inhomogeneous equation.
Consider, first, solution of the homogeneous equation, i.e. solution of Eq. (4.4) with zero
right hand side, g(t) = 0. This is a homogeneous differential equation of the second order
which should therefore have two independent solutions (we call them linearly independent
solutions). Let us denote the two solutions, y1 (t) and y2 (t) and form the so-called Wronskian
of the two
Next we compute the derivative of the Wronskian and use the fact that y1 and y2 are
solutions of the homogeneous version of Eq. (4.4). We derive
d
W = y1 y200(
+y(10 y(20 ( y20 y10 − y2 y100 = py1 y20 + ( 0
(
−( qy( 2−
1 y( (qy2 y1 − py2 y1 = pW.
( (( (
dt
Therefore, the Wronskian becomes
Zt
W (t) = exp dτ p(τ ) , (4.6)
t0
where t0 can be chosen arbitrarily. Moreover, given the relation (4.5), we can express one
of the two independent solutions via another one and the Wronskian (which we now know
explicitly).
Now the question becomes if we can find a particular solution (just one of many) of
the inhomogeneous Eq. (4.4)? Let us follow the idea of the method of the variation of
parameters and look for a particular solution of the Eq. (4.4) as a linear combination of
x1 (t) and x2 (t) multiplied by unknown parameters A(t) and B(t):
Recall that we are looking for a particular solution. Therefore, we can choose to relate the
two unknown coefficients, A(t) and B(t), as we find fit, leaving only one of them as a degree
of freedom. The form of Eq. (4.7) suggests to pick the relation so that the dependence on
p(t) in Eq. (4.8) disappears, i.e.
A0 y1 + B 0 y2 = 0. (4.9)
Notice that the order of derivatives in Eq. (4.10) is reduced in comparison with Eq.(4.8) we
started with. Furthermore, expressing B 0 via A0 , y1 , y2 according to Eq. (4.9) and substi-
tuting the result in Eq. (4.10) we arrive at
Zt
0 y2 (τ )g(τ )
W A + y2 g = 0 ⇒ A = − dτ . (4.11)
W (τ )
t0
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 92
Similarly expressing A0 via B 0 , y1 , y2 according to Eq. (4.9) and substituting the result in
Eq. (4.10) we derive the B-analog of Eq. (4.11)
Zt
y1 (τ )g(τ )
B= dτ . (4.12)
W (τ )
t0
• Find a homogeneous solution, x1 (t), and express its linearly independent counterpart,
x2 (t), via x1 (t) and the previously found Wronskian, W (t).
• Compute the time dependent factors A and B according to Eq. (4.11,4.12), therefore
presenting a particular solution of the original (inhomogeneous) equation according
to Eq. (4.7).
Example 4.1.2. Find the general solution to t2 x00 (t) + tx0 (t) − x(t) = t (where t 6= 0) given
that x(t) = t is a solution.
Solution. Set the leading coefficient to unity by dividing by t to get x00 + t−1 x0 − t−2 x = t−1
(where t 6= 0). Therefore p(t) = −t−1 . We compute the Wronskian
Z t Z t
W (t) = exp p(τ )dτ = exp −τ dτ = t−1
−1
t0 t0
The second linearly independent solution is found by
1 1
W (t) = y1 y20 − y2 y10 ⇒ = ty20 − y2 1 ⇒ y2 (t) = −
t 2t
Computing A and B
Z t Z t
y2 (τ )g(τ ) − 1 t−1 t−1 1
A(t) = − dτ =− dτ 2 −1 = log(t)
t0 W (τ ) t0 t 2
. Z t Z t
y1 (τ )g(τ ) t t−1 t2
B(t) = dτ = dτ −1 =
t0 W (τ ) t0 t 2
The general solution to the differential equation is
1 t
x(t) = c1 t + c2 t−1 + t log(t) −
2 4
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 93
Example 4.1.3. Find the general solution to r00 (θ) + r(θ) = tan(θ) for −π/2 < θ < π/2.
Solution. We compute the Wronskian
Z θ
0
W (θ) = exp 0 dθ = 1
θ0
Let r1 (θ) = cos(θ) be the first linearly independent solution. The second linearly indepen-
dent solution is found by
W (t) = r1 (θ)r20 (θ) − r2 (θ)r10 (θ) ⇒ 1 = cos(θ)r2 (θ) + r20 (θ) sin(θ) ⇒ r2 (θ) = sin(θ)
Computing A and B
Z θ 0 0 Z θ
0 r2 (θ )g(θ ) sin(θ) tan(θ)
A(θ) = − dθ 0
=− dθ0 = sin(θ) − log(sec(θ) + tan(θ))
θ0 W (θ ) θ0 1
.
Z t 0 0 Z θ
0 r1 (θ )g(θ ) cos(θ) tan(θ)
B(θ) = dθ 0
= dθ0 = − cos(θ)
θ0 W (θ ) θ0 1
The solution to the differential equation is
x(θ) = c1 cos(θ) + c2 sin(θ) + cos(θ) log(sec(θ) + tan(θ))
Exercise 4.1. (a) Find a general solution, x(t) to the following ODE,
dx f (t)
− λ(t)x = 2 ,
dt x
where λ(t) and f (t) are known functions of t. (b) Solve the following general second-order,
constant-coefficient, linear ODE
d2 d
τ02 2
y + τ1 y + y = g(t),
dt dt
d
with the initial conditions y(0) = y0 , dt y t=0 = v0 .
(Here and below we will start using bold-calligraphic notation, L, for the differential opera-
tors.) Let us look for the general solution of Eq. (4.13) in the form of a linear combination
of exponentials
n
X
x(t) = ck exp(λk t), (4.14)
k=1
where ck are constants. Substituting Eq.(4.14) into Eq.(4.13), one arrives at the condition
that the λk are roots of the characteristic polynomial:
n
!
X n−m
an−m (λk ) = 0. (4.15)
m=0
Eq. (4.14) holds if the λk are not degenerate (that is, if there are n distinct solutions). In the
case of degeneracy we generalize Eq. (4.14) to a sum of exponentials (or the non-degenerate
λk and of polynomials in t multiplied by the respective exponentials for the degenerate
λk , where for the degrees of the polynomials are equal to the degree of the respective root
degeneracy. !
m
X dk
X (l)
x(t) = ck tl exp(λk t), (4.16)
k=1 l=0
where dk is the degree of the k-th root degeneracy.
Recall that if the particular solution is xp (t), and if x0 (t) is a generic solution of the homo-
geneous version of the equation, then a generic solution of Eq. (4.17) can be expressed as
x(t) = x0 (t) + xp (t).
Let us illustrate the utility of this simple but powerful statement on an example:
Solution. The general solution to the homogeneous equation, Lx = 0 is x0 (t) = c1 cos(ω0 t)+
c2 sin(ω0 t). For ω0 6= 3, a particular solution to Eq. (4.18) is xp (t) = cos(3t)/(ω02 − 9), which
can be found by variation of parameters (Section 4.1.2). Therefore, for ω0 6= 3, the solution
to the inhomogeneous Eq. (4.18) is
cos(3t)
x(t) = c1 cos(ω0 t) + c2 sin(ω0 t) + .
ω02 − 9
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 95
When ω0 = 3, the natural frequency of the system coincides with the forcing frequency of
the right hand side and the system resonates. We must look for a new particular solution
because the particular solution we found above is already represented in the solution to the
inhomogeneous problem. This particular solution can be found by variation of parameters
(Section 4.1.2). Therefore, for ω0 = 3 the solution to the inhomogeneous Eq. (4.18) is
1
x(t) = c1 cos(ω0 t) + c2 sin(ω0 t) + t sin(3t).
6
where G(t, τ ) is the so-called Green Function which is to be determined. Formal manipu-
lations show that
Z Z
f (t) = LL−1 f (t) = L dτ G(t, τ )f (t) = dτ LG(t, τ )f (t).
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 96
We have already seen this equation as one of the properties of the δ-function. That is
Z
f (t) = dτ LG(t, τ )f (t) ⇔ LG(t, τ ) = δ(t − τ ).
The Green function for a differential operator, L, is the function G(t, τ ) that solves the
differential equation LG(t, τ ) = δ(t − τ ) subject to the prescribed side conditions. The
Green function describes the ‘response’ of the system at time t to a ‘impulse’ applied at
time τ .
Notation. Technically, the Green function is a function of two variables, t and τ , where τ
represents the time of an impulse and t represents the time that we observe the system’s
response to the impulse. Notice that if L is a differential operator with constant (time-
independent) coefficients, then the response of the system to an impulse does not depend
on t and τ independently, but instead it only depends on the difference t − τ . In this
situation, G(t, τ ) reduces to the “homogeneous in time” or “time-invariant” G(t − τ ).
We will proceed exploring the method by revisiting the simple constant coefficient case
of the linear scalar-valued first-order ODE (4.2).
where we have assumed that the evolution starts at t = −∞ with limt→−∞ x(t) = 0; and
G(t, τ ) is the Green function which satisfies
d
G(t, τ ) + γG(t, τ ) = δ(t − τ ), (4.21)
dt
and δ(t) is the δ-function.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 97
Notice that the evolutionary problem we discuss here is an initial value problem (also
called a Cauchy problem). Indeed, if we would not assume that back in the past (at t = −∞)
x is fixed, the solution of Eq. (4.19) would be defined unambiguously. Indeed, suppose xs (t)
is a particular solution of Eq. (4.19), then xs (t) = C exp(−γt), where C is a constant,
describes a family of solutions of Eq. (4.19). The freedom, eliminated by fixing the initial
condition, is associated with the so-called zero mode of the differential operator, d/dt + γ.
Another remark is about causality, which may also be referred to, in this context, as the
“causality principle”. It follows from Eq. (4.20) that defining the Green function, one also
enforces that, G(t − τ ) = 0 at t < τ . This formal observation is, of course, consistent with
the obvious—solutions of Eq. (4.19) at a particular moment in time t can only depend on
external driving sources f (τ ) that occurred in the past, when τ ≤ t. The solution cannot
depend on external driving forces that will occur in the future, when τ > t.
Now back to solving Eq. (4.21). Since δ(t − τ ) = 0 at t > τ , one associates G(t − τ ) with
the zero mode of the aforementioned differential operator, G(t − τ ) = A exp(−γ(t − τ )),
where A is a constant. On the other hand due to the causality principle, G(t − τ ) = 0 at
t < τ . Integrating Eq. (4.21) over time from τ − < 0, where 0 < 1, to τ , we observe that
G(t − τ ) should have a discontinuity (jump) at t = τ : G(t − τ ) = A exp(−γ(t − τ ))θ(t − τ ),
where θ is the Heaviside function. Substituting the expression in Eq. (4.21) and integrating
the result (left and right hand sides of the resulting equality) over τ − < t < τ + , one
finds that A = 1. Substituting the expression into Eq. (4.20) one arrives at the solution
Zt
x(t) = dτ exp(−γ(t − τ ))f (τ ). (4.22)
−∞
We observe that the system “forgets” the past at the rate γ per unit time.
Lets sketch out a few different ways to solve Eq. (4.21).
Method 1: Multiply by the appropriate integrating factor. For this problem, the inte-
grating factor is eγt .
LG(t, τ ) = δ(t − τ )
d γt
e G(t, τ ) = eγt δ(t − τ )
dt
Zt Z∞
γt0 0 0 0
γt
e G(t, τ ) = e δ(t − τ )dt = θ(t0 − τ )eγt δ(t0 − τ )dt0
−∞ −∞
Method 2: Take the Fourier transform of both sides, solve the subsequent algebraic
equation for x̂(k), and then use a contour integral to compute the inverse Fourier transform
of x̂(k).
h i
F Ġ(t, τ ) + γG(t, τ ) = F [δ(t − τ )]
Where in the last line, we have computed the contour integral by closing the contour with
an semi-circular arc of radius R and taken the limit R → ∞. To ensure that the closing arc
has a vanishingly small contribution to the integral, the contour is closed off in the upper
half plane for t > τ and in the lower half plane for t < τ . The integrand has a simple pole
1 −γ(t−τ )
at k = iγ with residue 2πi e . Because the pole is in the upper half plane, the integral
is equal to e−γ(t−τ ) for t > τ and because there are no poles in the lower half plane, the
integral is equal to 0 for t < τ .
Method 3: Construct the Green function based on a number of properties that it must
satisfy. Given LG(t, τ ) = δ(t − τ ), we see that G(t, τ ) must satisfy:
(iii.) G(t, τ ) must have a jump of size unity at t = τ . This can be explained by integrating
from τ − < t < τ + and taking the limit → 0+ . The calculation is as follows:
Z
τ + Z
τ + Z τ +
0 0 0 0
Ġ(t − τ )dt + γ G(t − τ )dt + δ(t0 − τ )dt0 ⇒ G(τ + ) − G(τ − ) + 0 = 1
τ −
τ − τ −
Generalization of property (iii): In general, the Green function of an nth order differen-
tial operator has a jump in the (n − 1)st derivative at t = τ . All further derivatives are
continuous. The size of the jump is equal to the magnitude of the leading coefficient.
Let’s use these properties to construct the Green function.
d
Step 1. Find candidate solutions by solving dt + γ G(t, τ ) = 0. We know there is only
one (non-trivial) linearly independent solution because the ODE is linear and first order.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 99
This solution is Ae−γt . There is also the trivial solution. To construct the Green function,
we must to mix and match between the candidates G(t, τ ) = 0 and G(t, τ ) = Ae−γt .
Step 2. Apply the initial condition. Our initial condition is limt→−∞ = 0. The only
candidate solution that satisfies the initial condition is the trivial solution. That is, G(t, τ ) =
0 for t < τ . We are not yet prepared to say what happens for t > τ .
Step 3. Apply the jump condition. Given that G(t, τ ) = 0 for t < τ , we must now determine
G(t, τ ) for t > τ . We realize that Ae−γt is the only candidate that can produce a jump at
t = τ . Furthermore, to ensure the jump is size unity, we must set A = eγτ .
In summary, G(t, τ ) = θ(t − τ )e−γ(t−τ ) .
Exercise 4.2. Solve Eq. (4.19) at t > 0, where x(0) = 0 and f (t) = A exp(−αt). Analyze
the dependence on α and γ, including α → γ.
Recall that Eq. (4.21) assumes that the Green function depends on the difference between
t and τ , t − τ , and not on the two variables separately. This assumption is justified for the
case considered here, however it will not be correct for situations where the decay coefficient
γ(t) depends on t. In this general case one needs to consider the general expressions for the
Green function discussed above, G(t, τ ). In the case of the constant γ the Green function
depends on the difference because of Eq. (4.21) symmetry: invariance with respect to the
time translation (time homogeneity), i.e. the equation does not change under the time shift,
t → t + t0 .
Substituting the expansions into Eq. (4.23) one arrives at the n scalar-valued differential
equations
dyi
+ λ i yi = φ i , (4.26)
dt
therefore reducing the vector equation to the set of scalar equations of the already considered
in section 4.3.1.
If Γ̂ is not diagonalizable, it can be decomposed into Jordan Canonical form. This
occurs when two (or more) eigenvalues share an eigenvector. As before, one introduces the
Green function Ĝ(t, τ ), which satisfies
d
+ Γ̂ Ĝ(t, τ ) = δ(t − τ )1̂. (4.27)
dt
The explicit solution of Eq. (4.27) is
Ĝ(t, τ ) = θ(t) exp −Γ̂(t − τ ) , (4.28)
which allows us to state the solution of Eq. (4.23) in the following invariant form
Z t Z t
x(t) = dτ Ĝ(t − τ )f (τ ) = dτ θ(t − τ ) exp −Γ̂(t − τ ) f (τ ). (4.29)
−∞ −∞
Notice that matrix exponential, introduced in Eq. (4.28) and utilized in Eq. (4.29), is
the formal expression which may be interpreted in terms of the Taylor series
∞
X (−(t − τ ))n Γ̂n
exp −(t − τ )Γ̂ = , (4.30)
n!
n=0
Γ̂ = ÂĴÂ−1 , (4.31)
where Ĵ is the matrix of Jordan blocks formed from the eigenvalues of Γ̂, and the columns
of  are the respective eigenvalues of Γ̂. Note that Γ̂n = ÂĴn Â−1 .
To illustrate the peculiarity of the degenerate case consider
" # " #
λ 1 0 1
Γ̂ = , which can be re-written as λÎ + N̂ where N̂ ≡ ,
0 λ 0 0
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 101
which is the canonical form of the Jordan (2×2) matrix/block. Observe that N̂2 = 0̂, which
indicates that N̂ is a (2 × 2) nilpotent matrix. The nilpotent property can be leveraged
when taking the matrix exponential,
exp −(t − τ )Γ̂ = e−λ(t−τ ) Î − (t − τ )N̂ , (4.32)
where we have accounted for the nilpotent property of N̂. Incorporating Eq. 4.32 into
Eq. 4.29, the solution can therefore be expressed as
Z t
x(t) = dτ θ(t − τ )e−λ(t−τ ) Î − (t − τ )N̂ f (τ ). (4.33)
−∞
dx1 dx2
+ λx1 + x2 = f1 , + λx2 = f2 ,
dt dt
integrating the second equation, substituting result in the first equation, and then changing
from x1 to x̃1 = x1 + tx2 , one arrives at
x̃1
+ λx̃1 = f1 + tf2 .
dt
Note the emergence of a secular term, (a polynomial in t), on the right hand side, which is
generic in the case of degeneracy which is then straightforward to integrate.
Note that vector-valued ODEs appear as the result of the “vectorization” of an nth
order scalar-valued ODE for y(t). The vectorization occurs by setting x1 = y, x2 =
dy/dt, · · · , xn = dn−1 y/dtn−1 . Then dx/dt is expressed via the components of x and
the original equation, thus resulting in Eq.(4.23).
The Green function approach illustrated above can be applied to any inhomogeneous linear
differential equation. Let us see how it works in the case of the second-order differential
equation for a scalar. Consider
d2
x + ω 2 x = f (t). (4.34)
dt2
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 102
To solve Eq. (4.34) note that its general solution can be expressed as a sum of its
particular solution and solution of the homogeneous version of Eq. (4.34) with zero right
hand side. Let us choose a particular solution of Eq. (4.34) in the form of convolution (4.20)
of the source term, f (t), with the Green function of Eq. (4.34)
2
d 2
+ ω G(t) = δ(t). (4.35)
dt2
As established above G(t) = 0 at t < 0. Integration of Eq. (4.35) from − to τ and checking
the balance of the integrated terms reveals that Ġ jumps at t = 0, and the value of the
jump is equal to unity. An additional integration over time around the singularity shows
that G(t) is smooth (and zero) at t = 0. Therefore, in the case of a second order differential
equation considered here: G = 0 and Ġ = 1 at t = +0. Given that δ(+0) = 0 these two
values can be considered as the initial conditions at t = +0 for the homogeneous version
(zero right hand side) of the Eq. (4.35), defining G(t) at t > +0. Finally, we arrive at the
following result
sin(ωt)
G(t) = θ(t) , (4.36)
ω
where θ is the Heaviside function.
Furthermore, Eq. (4.20) gives the solution to Eq. (4.34) over the infinite time horizon,
however one can also use the Green function to solve the respective Cauchy propblem (initial
value problem). Since Eq. (4.34) is the second order ODE, one just needs to fix two values
associated with x(t) evaluated at the initial, t = 0, for example x(0) and ẋ(0). Then, taking
into account that, G(+0) = 0 and Ġ(+0) = 1, one finds the following general solution of
the Cauchy problem for Eq. (4.34)
Zt
x(t) = ẋ(0)G(t) + x(0)Ġ(t) + dt1 G(t − t1 )f (t1 ). (4.37)
0
where ai are constants and L is the linear differential operator of the n-th order with
constant coefficients, already discussed in Section 4.2. We build a particular solution of
Eq. (4.38) as the convolution (4.20) of the source term, φ(t), with the Green function, G(t),
of Eq. (4.38)
LG = δ(t), (4.39)
Observe that the solution to the respective homogeneous equation, Lx = 0, (the zero
modes of the operator L) can be generally presented as
X
x(t) = bi exp(zi t), (4.40)
i
Example 4.3.1. Let γ, ν ∈ R with γ > 0. Find the Green function for the differential
operator and use it to solve the ODE
d2 d
2
x + 2γ x + ν 2 x = f subject to: lim x(t) = 0, lim ẋ(t) = 0. (4.43)
dt dt x→−∞ t→−∞
d
−e−iωτ
LG(ω; τ ) = δ̂(ω; τ ) ⇒ −ω 2 − 2iγω + ν 2 G = e−iωτ ⇒ Ĝ(ω; τ ) = ,
(ω − ω+ )(ω − ω− )
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 104
p
where ω± = −iγ ± ν 2 − γ 2 . The inverse Fourier transform of Ĝ is computed by a contour
integral where the contour must be closed off by a semi-circular arc of radius R under
the limit R → ∞. To ensure that the semi-circular arc has vanishing contribution to the
integral, we must close off the contour in the upper-half plane if t < τ and in the lower-half
plane if t > τ . The integrand has poles with associated residues at:
• If ν > γ:
p
Simple poles at ω = ω± = −iγ± ν 2 − γ 2 (with both real and imaginary components)
p
−1 2
Res(f, ω− ) = (ω− −ω+ ) exp (−iω− τ ) = −(ν −γ ) 2 −1/2 exp (−γτ ) exp −i ν 2 − γ 2 τ
p
Res(f, ω+ ) = (ω+ −ω− )−1 exp (−iω+ τ ) = +(ν 2 −γ 2 )−1/2 exp (−γτ ) exp −i ν 2 − γ 2 τ
• If ν < γ:
p
Simple poles at ω = ω± = −iγ ± i γ 2 − ν 2 (purely imaginary) p
Res(f, ω− ) = (ω− −ω+ )−1 exp (−iω− τ ) = −(γ 2 −ν 2 )−1/2 exp (−γτ ) exp − γ 2 − ν 2 τ
p
Res(f, ω+ ) = (ω+ −ω− )−1 exp (−iω+ τ ) = +(γ 2 −ν 2 )−1/2 exp (−γτ ) exp − γ 2 − ν 2 τ
Solution. Shorter Method. Construct the Green function based on the properties it must
satisfy. We solve it for the case ν > γ. The case ν < γ follows analogously. Given
LG(t, τ ) = δ(t − τ ), we see that G(t, τ ) must satisfy:
(iii.) G(t, τ ) must be continuous everywhere, including t = τ , and the derivative of G(t, τ )
must have a jump of magnitude unity at t = τ .
Generalization of property (iii): In general, the Green function of an nth order differen-
tial operator has a jump in the (n − 1)st derivative at t = τ . All further derivatives are
continuous. The size of the jump is equal to the magnitude of the leading coefficient.
Let’s use these properties to construct the Green
2 function.
d d 2 G(t, τ ) = 0. We know
Step 1. Find candidate solutions by solving dt 2 + 2γ dt + ν
Step 2. Apply the initial conditions. Our initial conditions are limt→−∞ G(t, τ ) = 0 and
limt→−∞ Ġ(t, τ ) = 0 . The only candidate solution that satisfies the initial condition is
the trivial solution. That is, G(t, τ ) = 0 for t < τ . We are not yet prepared to say what
happens for t > τ . We can improve our tentative Green function by
0 if t < τ
G(t, τ ) = p p
c e−γ(t−τ ) sin ν 2 − γ 2 (t − τ ) + c4 e−γ(t−τ ) cos ν 2 − γ 2 (t − τ ) if t > τ
3
Step 3. Apply the continuity and jump conditions. Given that G(t, τ ) = 0 for t < τ , we
must now determine G(t, τ ) for t > τ . The continuity condition requires that c4 = 0. We
∂
p p
compute ∂t G(t, τ ) = −c3 γe−γ(t−τ ) +c3 ν 2 − γ 2 e−γ(t−τ ) cos( ν 2 − γ 2 (t−τ )), and find that
p p
limt→τ + = c3 ν 2 − γ 2 . To ensure the jump is size unity, we must set c3 = ν 2 − γ 2 .
In summary, the Green function is given by
θ(t − τ )e−γ(t−τ ) p
G(t − τ ) = p sin ν2 − γ2 , if ν < γ,
ν2 − γ2
and the solution to the ODE is
Z t p
e−γ(t−τ )
x(t) = p sin ν 2 − γ 2 (t − τ ) f (τ )dτ, if γ < ν.
−∞ ν2 − γ2
A similar calculation can be used to find the Green function and the solution for ν < γ.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 106
Exercise 4.4. Follow the logic of Example 4.3.1 and suggest two methods of finding the
Green function ((a) based on Fourier transform, and (b) based on properties of the Green
2 2 2 2 2 2
d 2 d 2 d
:= dt 2 d 2 ,
function) for solving dt2 + ν x(t) = f (t), where dt2 + ν 2 + ν dt2 + ν
d d2 d3
at t > 0, assuming that x is real-valued and x(0− ) = −
dt x(0 ) = dt2
x(0− ) = dt3
x(0− ) = 0.
So far we have solved linear ODEs by using the Green function approach and constructing
the Green function as a solution of the homogeneous equation with additionally prescribed
initial conditions (one less than the order of the differential equation). In this section
we discuss an alternative way of solving the problem via, first, application of the Laplace
transform introduced in Section 3.9 for solving linear ODEs with constant coefficients,
and then discussing the so-called Laplace method for solving linear ODEs with coefficients
dependent linearly on the (time/space) variable. Connection between the two is not only
via the name of Laplace, who has contributed developing both, but also due to the fact that
the Laplace method can be considered as utilizing a generalization of the Laplace transform.
The Laplace transform is natural for solving dynamic problems with causal structure.
Let us see how it works for finding the Green function defined by Eqs. (4.38,4.39). We
apply the Laplace transform to Eq. (4.39), integrating it over time with the exp(−kt)
Laplace weight from a small positive value, , to ∞. In this case, the integral of the
right hand side is zero. Each term on the left hand side can be transformed through a
sequence of integrations by parts to a product of a monomial in k with G̃(k), the Laplace
transform of G(t). We also check all boundary terms which appear at t = and t = ∞.
Assuming that G(∞) = 0 (which is always the case for stable systems), all contributions at
t = +∞ are equal to zero. All t = boundary terms, but one, are equal to zero, because
∀0 ≤ m < n − 1, dm G()/dtm = 0. The only nonzero boundary contribution originates
from dn−1 G()/dtn−1 = 1. Overall, one arrives at the following equation
n
X
L(k)G̃(k) = 1, L(k) := am−k (−k)n−m . (4.44)
m=0
Therefore, we just found that G(k) has poles (in the complex plain of k) associated with
zeros of the L(k) polynomial. To find G(t) one applies to G̃(k) the inverse Laplace transform
Z
c+∞
dk
G(t) = exp(kt)G̃(k). (4.45)
2πi
c−i∞
The Laplace method allows us to solve ODEs where the coefficients are linear in t.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 107
Remark. Notice, again, that the Laplace’s method for differential equations is not to be
confused with Laplace Transforms or Laplace’s method for approximating integrals. They
are not the same, even though related.
Consider an ODE that can be written as
N
X dm y
(am + bm t) = 0, (4.46)
dtm
m=0
where Z(k) is a function of the complex variable k and C is a contour in the complex plane
of k that will depend on Z(k) but that will not depend on t.
Remark. Notice, that the substitution (4.47) is similar to the inverse Laplace transform
(4.45), however with an important difference that the contour C in the former does not
necessarily coincides with the contour used in the latter. We remind that the contour used
in the basic formula of the inverse Laplace transform, i.e. in Eq. (4.45), goes up, on the
right to the imaginary axis.
The derivatives of y are computed from Eq. (4.47),
Z
dm y
= dk Z(k)k m ekt
dtm C
and are substituted into the left hand side of equation 4.46 which gives
Z
dk Z(k) a0 + a1 k + · · · + an k n + b0 + b1 k + · · · + bn k n t = 0
C | {z } | {z }
P (k) Q(k)
where we have introduced the notation P (k) and Q(k) for convenience. We integrate by
parts to get
Z
0= dk Z(k) P (k) + Q(k)t ekt
C
h ik Z
kt 2 d
= Z(k)Q(k)e + dk Z(k)P (k) − Z(k)Q(k) ekt (4.48)
k1 C dk
where k1 and k2 represent the two endpoints of C. If we can pick Z(k) and a contour C
such that Eq. (4.48) holds, then we can use them to express the solution to the ODE (4.46)
by Eq. (4.47). In summary, we must find Z(k) and the contour C such that
d h ik2
Q(k)Z(k) − P (k)Z(k) = 0 and Z(t)Q(t)ekt = 0.
dk k1
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 108
The differential equation above, for Z(k), can be solved either by finding an integrating
factor, or by separation of variables, as follows:
Z
d [QZ] P P
d [QZ] = P Zdk ⇒ = dk ⇒ ln(QZ) = dk + const
QZ Q Q
Z(k), is given by Z
c P (k)
Z(k) = exp dk (4.49)
Q(k) Q(k)
Once Z(k) is determined, we must find a contour with endpoints k1 and k2 such that
Q(k1 )Z(k1 )ek1 t = Q(k2 )Z(k2 )ek2 t .
Example 4.3.2. Use the Laplace’s method to find the general solution to the boundary
value problem:
d3
x u + 2u = 0, u(0) = 1, u(∞) = 0.
dx3
Solution. For this problem, we compute
2
3 c e−1/k 2
P (k) = 2, Q(k) = k , Z(k) = − , Q(k)Z(k)ekt = ekx−1/k (4.50)
k3
2 2
We see that ekx−1/k → 0 as k → −∞ and that ekx−1/k = 0 as k → 0. Therefore, we take
the contour of integration to be along the negative real axis. The solution is
Z0 2
ekx−1/k
u(x) = −2c dk
k3
−∞
by the change of variables z = t−2 . The constant c = 1/2 was chosen to satisfy the boundary
condition u(0) = 1
Find a second order, linear differential equation which, when supplied with proper initial
conditions at x = 0, results in S(x) as a solution. Solve the initial value problem by the
Laplace method, therefore, representing S(x) as an integral.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 109
Example 4.3.3. (a) Use Laplace’s method to find a general solution to the Hermite equa-
tion
d2 y dy
2
− 2t + 2ny = 0. (4.51)
dt dt
(b) Simplify your result for the case where n is a non-negative integer.
Solution. (a) In this case we derive
2 2 /4
2 c e−k /4 kt ekt−k
P (k) = k + 2n, Q(k) = −2k, Z(k) = − , Q(k)Z(k)e = (4.52)
2k n+1 kn
thus resulting in the following explicit solution of Eq. (4.51), defined up to a multiplicative
constant, is
Z 2
ekt−k /4
y(t) = dk . (4.53)
C k n+1
Let us make the change in variables k → z according to k = 2(t − z) which gives
Z 2
t2 e−z dz
y(t) = e , (4.54)
C0 (z − t)n+1
where C 0 is a suitable contour in the complex plane of z, which is yet undefined, that is we
have a freedom in choosing the contour (as above we had a choice with contour C in the
complex space of k).
(b) When n is a non-negative integer, the integrand in Eq. (4.54) has a simple pole,
and thus choosing the contour to go around the pole both satisfies the requirement on
the boundary terms and allows us to evaluate the integral by residue calculus. Applying
Cauchy’s formula to the resulting contour integral, one therefore arrives at the expression
for the so-called Hermite polynomials
2 dn −t2
y(t) = Hn (t) = (−1)n et e , (4.55)
dtn
where re-scaling (which is a degree of freedom in linear differential equations) is selected
according to the normalization constraint introduced in the following exercise.
Hermite polynomials will come back later in the context of the Sturm-Liouville problem
in Section 4.5.3.
Example 4.3.4. Consider another example particular case of Eqs.(4.46) that can be solved
by the Laplace method,
d2
y − ty = 0. (4.56)
dt2
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 110
II $#
3
III
I
$#
3
!" !
Figure 4.1: Layout of contours in the complex plane of k needed for saddle-point estimations
of the Airy function described in Eq. (4.58).
According to Eq. (4.47) the general solution of Eq. (4.57) can be represented as
Z
y(t) = const dk exp kt − k 3 /3 , (4.58)
C
where we choose an infinite integration path shown in Fig. (4.1) such that values of the
integrand at the two (infinite) end points coincide (and equal to zero). Indeed, this choice
guarantees that the infinite end points of the contour lie in the regions where Re(k 2 ) > 0
(shaded regions I, II, III in Fig. (4.1)). Moreover, by choosing that the contour starts in
the region I and ends in the region II (blue contour C in Fig. (4.1)) we guarantee that the
Airy function given by Eq. (4.56) remains finite at t → +∞. Notice that the contour can
be shifted arbitrarily under condition that the end points remain in the sectors I and II. In
particular one can shift the contour to coincide with the imaginary axis (in the complex k
plane shown in Fig. (4.1), then Eq. (4.58) becomes (up to a constant) the so-called Airy
function
∞
Z∞ Z 3
1 z3 1 z
Ai(t) = dz cos + zt = Re dz exp i + itz . (4.59)
π 3 2π 3
0 −∞
Asymptotic expression for the Airy function at t > 0, t 1, can be derived utilizing the
√
saddle-point method described in Section 2.4. At k = ± t, the integrand in Eq. (4.58) has
an extremum along the direction of its “steepest descent” from the saddle point along the
imaginary axis. Since the contour end-points should stay in the sectors I and II, we shift
the contour to the left from the imaginary axis while keeping it parallel to the imaginary
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 111
√
axis. (See C1 shown in red in Fig. (4.1) which crosses the real axis at k = − t.) The
√
integral is dominated by the saddle-point at k = − t, thus resulting (after substitution
√
k = t + iz, changing integration variable from k to z, making expansion over z, keeping
quadratic term in z, ignoring higher order terms, and evaluating a Gaussian integral) in the
following asymptotic estimation for the Airy function
Z
+∞
1 2 3/2 √ 2 exp(−2t3/2 /3)
t > 0, t 1 : Ai(t) ≈ exp − t − tz dz = √ . (4.60)
2π 3 t1/4 4π
−∞
(Notice that one can also provide an alternative argument and exclude contribution of the
√
second, potentially dominating, saddle-point k = t simply by observing that Gaussian
integral evaluated along the steepest descent path from this saddle-point gives zero contri-
bution after evaluating the real part of the result, as required by Eq. (4.59).)
Poisson’s equation describes, in the case of electrostatics, the potential field caused by a
given charge distribution.
Let us discuss the function u(x) whose distribution over a finite spatial interval is de-
scribed by the following set of equations
d2
u(x) = f (x), ∀x ∈ (a, b) with u(a) = u(b) = 0. (4.61)
dx2
We introduce the Green function which satisfies
d2
∀a < x, y < b : G(x; y) = δ(x − y), G(a; y) = G(b; y) = 0. (4.62)
dx2
Notice that the Green function now depends on both x and y.
d2
According to Eq. (4.62), dx2
G(x; y) = 0 if x 6= y. That is, G(x, y) is a linear function of
x for all x 6= y. Then enforcing the boundary conditions one derives
Furthermore, given that the differential equation in (4.62) is the second order, G(x, y) should
be continuous at x = y and the jump of its first derivative at x = y should be equal to
unity. Summarizing, one finds
1 (y − b)(x − a), x < y
G(x; y) = (4.65)
b − a (y − a)(x − b), x > y.
Zb
u(x) = dyG(x; y)f (y). (4.66)
a
Property (ii.): The Green function satisfies the boundary conditions. Enforcing the bound-
ary conditions u(1) = 0 and u0 (2) = 0, gives c1 = c4 = 0,
c2 log(x) x < y
G(x; y) =
c x>y
3
Property (iii.): G(x; y) is continuous and Gx (x; y) has a jump of magnitude −1/y at x = y.
To ensure continuity at y, we must set c2 log(y) = c3 . This ensures continuity, but does not
yet give the appropriate jump condition. Computing the derivative of G as x → y − and
as x → y + gives limx→y− Gx = c2 /y and limx→y+ Gx = 0. That is, we must set c2 = 1 to
ensure that the derivative has a jump with magnitude −1/y at x = y.
log(x), x < y
G(x; y) =
log(y), x > y
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 113
Remark. Explanation for Property (iii.): To determine the magnitude of the jump at x = y,
integrate the ODE on the interval y − < x < y + and take the limit → 0.
d d
− x u(x) = δ(x − y)
dx dx
Z y+ Z y+
d d
lim − x u(x)dx = lim δ(x − y)dx
→0 y− dx dx →0 y−
y+
d
lim −x u(x) =1
→0 dx y−
d 1
u(x) =−
dx x=y y
Exercise 4.6. Find the Green function for the equation, Lu(x) = f (x), where the operator
d 2
L = − dx 2
2 − κ , and the boundary conditions on u(x) are
Let us first review some basic properties of a Hilbert space, in particular, condition on its
completeness. (These will be discussed at greater length in the companion Math 527 course
of the AM core.) A linear (vector) space is called a Hilbert space, H, if
1. For any two elements, f and g there exists a a scalar product (f, g) which satisfies the
following properties:
(f, g) = (g, f )∗ ;
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 114
Remark. The Hilbert space defined above for complex-valued functions can also be consid-
ered over real-valued functions. In the following we will use the two interchangeably.
Any basis B can be turned into an ortho-normal basis with respect to a given scalar
P
∞ P
∞
product, i.e. x = (x, fn )fn , kxk2 = |(x, fn )|2 . (For example, the Gram-Schmidt
n=1 n=1
process is a standard ortho-normalization procedure.)
One primary example of a Hilbert space is the L2 (Ω) space of complex-valued functions
R
f (x) defined in the space Ω ∈ Rn such that dx|f (x)|2 < ∞ (one may say, casually, that
Ω
the square modulus of the function is integrable). In this case the scalar product is defined
as Z
(f, g) := dxf ∗ (x)g(x).
Ω
Properties 1a-c from the definition of Hilbert space above are satisfied by construction and
property 2 can be proven (it is a standard proof in the course of mathematical analysis).
Consider a fixed infinite ortho-normal sequency of functions
The sequence is a basis in L2 (Ω) iff the following relation of completeness holds
∞
X
fn∗ (x)fn (y) = δ(x − y). (4.67)
n=1
As custom for the δ function (an other generalized functions), Eq. (4.67) should be under-
stood as equality of integrals of the two sides of Eq. (4.67) integrated with a function from
L2 (Ω).
Consider a function from the Hilbert space L2 (a, b) over the reals, i.e. function of a single
variable, x ∈ R, over a bounded domain, a ≤ x ≤ b with an integrable square modulus and
a linear differential operator L̂ acting on the function.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 115
It is clear from how the condition (4.68) was stated that it depends on both the class
of functions and on the operator L̂. For example, considering functions f and g with zero
boundary conditions or functions which are periodic and which derivative is periodic too,
will result in the statement that the operator
d2
L̂ = + U (x), (4.69)
dx2
where U (x) is a function mapping from R to R, is Hermitian.
The natural generalization of the Shrödinger operator 4.69 is the Sturm-Liouville oper-
ator
d2 d
2
+Q
L̂ = + U (x). (4.70)
dx dx
The Sturm-Liouville operator is not Hermitian, i.e. Eq. (4.68) does not hold in this case.
However, it is straightforward to check that at the zero boundary conditions or periodic
boundary conditions imposed on the functions, f (x) and g(x), and their derivatives, the
following generalization of Eq. (4.68) holds
Zb Zb
dxρ(x)f (x)L̂g(x) = dxρ(x)g(x)L̂f (x), (4.71)
a a
Z
d
where ρ = Qρ ⇒ ρ = exp dxQ . (4.72)
dx
L̂fn = λn fn , (4.73)
As a corollary of this statement one also finds that in the Hermitian case the distinct
eigen-functions are orthogonal to each other with unitary weight, ρ = 1.
Let us check Eq. (4.74) on the example, L̂0 = d2 /dx2 , where Q(x) = U (x) = 0, over
the functions which are 2π-periodic. cos(nx) and sin(nx), where n = 0, 1, · · · are distinct
eigen-functions with the eigen-values, λn = −n2 . Then, forall m 6= n,
Note that the example just discussed has a degeneracy: cos(nx) and sin(nx) are two
distinct real eigen-functions corresponding to the same eigen-value. Therefore, any combina-
tion of the two is also an eigen-function corresponding to the same eigen-value. If we would
choose any other pair of the degenerate eigen-functions, say cos(nx) and sin(nx) + cos(nx),
the two would not be orthogonal to each other. Therefore, what we see on this example is
that the eigen-functions corresponding to the same-eigenvalue should be specially selected
to be orthogonal to each other.
We say that the set of eigen-functions, {fn (x)|n ∈ N}, of L̂ is complete over a given
class (of functions) if any function from the class can be expanded into the series over the
eigen-functions from the set
X
f= cn fn . (4.76)
n
Relating this eigen-functions’ property to completeness of the Hilbert space basis, one
observes that eigen-vectors of a self-adjoint (Hermitian) operator over L2 (Ω) form an ortho-
normal basis of L2 (Ω).
Multiplying both sides of Eq. (4.76) by ρfn , integrating over the domain, and applying
(4.74) to the right one derives R
dxρfn f
cn = R . (4.77)
dxρ(fn )2
Note that for the example L̂0 , Eq. (4.76) is a Fourier Series expansion of a periodic
function.
Returning to the general case and substituting Eq. (4.77) back into (4.76), one arrives
at Z !
X f (x)fn (y)
f (x) = dy ρ(y) R n f (y). (4.78)
n
dxρ(x)(fn (x))2
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 117
If the set of functions {fn (x)|n} is complete relation (4.78) should be valid for any function
f from the considered class. Consistently with this statement one observes that the part
of the integrand in Eq. (4.78) is just the δ(x), which is the special function which maps
convolution of the function to itself, i.e.
X fn (x)fn (y) 1
R = δ(x − y). (4.79)
dxρ(x)(fn (x))2 ρ(y)
n
Therefore, one concludes that Eq. (4.79) is equivalent to the statement of the set of functions
{fn (x)|n} completeness.
Example 4.5.1. Check validity of Eq. (4.79), and thus completeness of the respective set
of eigen-functions, for our enabling example of L̂0 = d2 /dx2 over the functions which are
2π-periodic.
Let us now depart from our enabling example and consider the case of Q(x) = −2x and
U (x) = 0, i.e.
d2 d
2
− 2x , ρ(x) = exp −x2 ,
L̂2 = (4.80)
dx dx
over the class of functions mapping from R to R, which also decay sufficiently fast at
x → ±∞. That is we are discussing now
L̂2 fn = λn fn . (4.81)
√
Changing from fn (x) to Ψn (x) = fn (x) ρ one thus arrives at the following equation for
Ψn :
2 /2 2 /2
2 d2
e−x L̂2 fn (x) = e−x L̂2 ex /2 Ψn (x) = 2 Ψn + (1 − x2 )Ψn = λn Ψn . (4.82)
dx
Observe that when λn = −2n, Eq. (4.81) coincides with the Hermite Eq. (4.51).
Let us look for solution of Eq. (4.81) in the form of the Taylor series around x = 0
∞
X
fn (x) = ak xk . (4.83)
k=0
Substituting the series into the Hermite equation and then equating terms for the same
powers of x one arrives at the following regression for the expansion coefficients:
2k + λn
∀k = 0, 1, · · · : ak+2 = ak . (4.84)
(k + 2)(k + 1)
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 118
This results in the following two linearly independent solutions (even and odd, respec-
tively, with respect to the x → −x transformation) of Eq. (4.81) represented in the form of
a series
λn 2 λn (4 + λn ) 4
fn(e) (x)
= a0 1 + x + x + ··· , (4.85)
2! 4!
(o) (2 + λn ) 3 (2 + λn )(6 + λn ) 5
fn (x) = a1 x + x + x + ··· , (4.86)
3! 5!
where the two first coefficients in the series (4.83) are kept as the parameters. Observe that
the series (4.85) and (4.86) terminate if λn = −4n and λn = −4n − 2, respectively, where
(e)
n = 0, 1, · · · , then fn are polynomials – in fact the Hermite polynomials. We combine
the two cases in one and use the standard, Hn (x), notations for the Hermite polynomials
of the n-th order, which satisfies Eq. (4.82). Per statement of the Exercise 4.7, Hermite
polynomials are normalized and orthogonal (weighted with ρ) to each other.
Exercise 4.7. (a) Prove that
Z
+∞
2 √
dt e−t Hn (t)Hm (t) = 2n n! πδnm , (4.87)
−∞
discussed next.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 119
∗
4.5.4 Case study: Schrödinger Equation in 1d
Schrödinger equation ∗
d2 Ψ(x)
+ (E − U (x))Ψ(x) = 0, (4.91)
dx2
described the so-called (complex-valued) wave function describing de-location of a quantum
particle in x ∈ R with energy E in the potential U (x). We are seeking for solutions with
|Ψ(x)| → 0 at x → ∞ and our goal here is do describe the spectrum (allowed values of E)
and respective eigen-functions.
As a simple, but instructive, example consider the case of a quantum particle in a
rectangular potential, i.e. U (x) = U0 at x ∈
/ [0, a] and zero otherwise. General solution of
Eq. (4.91) becomes
U0 > E > 0 :
√
cL exp(x U0 − E), x<0
√ √
ΨE (x) = a+ exp(ix E) + a− exp(−ix E), x ∈ [0, a] , (4.92)
√
cR exp(−x U0 − E), x>a
U0 < E :
√ √
cL+ exp(ix E −√U0 ) + cL− exp(−ix√ E − U0 ),
x<0
ΨE (x) = a+ exp(ix E) + a− exp(−ix E), x ∈ [0, a] , (4.93)
c exp(ix√E − U ) + c exp(−ix√E − U ),
x>a
R+ 0 R− 0
where we account for the fact that E cannot be negative (ODE simply does not allow
such solutions) and in the U0 > E > 0 regime we select one solution (of the two linearly
independent solutions) which does not grow with x → ±∞.
The solutions in the three different intervals should be ”glued” together - or stating it
less casually Ψ and dΨ/dx should be continuous at all x ∈ R. These conditions applied to
Eq. (4.92) or Eq. (4.93) result in an algebraic “consistency” conditions for E. We expect
to get a continuous spectrum at E > U0 and discrete at U0 > E > 0.
Example 4.5.2. Complete calculations above for the case of U0 > E > 0 and find the
allowed values of the discrete spectrum. What is the condition for appearance of at least
one discrete level?
∗
This auxiliary Subsection can be dropped at the first reading. Material form the Subsection will not
contribute midterm and final exams.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 120
Example 4.5.3. Find eigen-functions and energy of stationary states of the Schrödinger
equation for an oscillator:
d2 Ψ(x)
+ (E − x2 )Ψ(x) = 0, (4.94)
dx2
where x ∈ R and Ψ : R → C2 .
Solution. As we saw already in the preceding section analysis of Eq. (4.94) is reduced to
studying the Hermite equation, with its spectral version desribed by Eq. (4.81). However, we
will follow another route here. Let us introduce the so-called “creation” and “annihilation”
operators
i d † i d
â = √ +x , â = √ −x , (4.95)
2 dx 2 dx
It is straightforward to check that the operator Ĥ is positive definite for all functions from
L2 : Z Z Z
† † †
dxΨ (x)ĤΨ(x) = dxΨ (x)ââ Ψ(x) = dx|âΨ(x)|2 ≥ 0,
thus resulting in Ψ0 (x) = A exp(−x2 /2) and E0 = 1/4. We have just found the eigen-
function and eigen-value correspondent to the lowest possible energy, so-called ground state.
To find all other eigen-function, correspondent to the so-called “excited” states, consider
the so-called commutation relations
Introduce Ψn (x) := (↠)n Ψ0 (x). Since âΨ0 (x) = 0, the commutation relations (4.98) shows
immediately that
1
2E − Ψn (x) = ĤΨn (x) = ↠âΨn (x) = nΨn (x).
2
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 121
We observe that eigen-functions Ψn (x) of the states with energies, 2En = n + 1/2 are
expressed via the Hermite polynomials, Hn (x), introduced in Eq. (4.55),
n 2
i d x
Ψn (x) = An √ −x exp −
2 dx 2
n
2 n
i x d 2
= An n/2 exp exp −x ,
2 2 dxn
d d
where we have used the identity, ( dx − x) exp(x2 /2) = exp(x2 /2) dx . From the condition of
√
the Hermite polynomials orthogonality (4.87) one derives, An = (n! π)−1/2 .
Consider equation describing conservative dynamics of a particle of unit mass in the poten-
tial (conservative means there is no dissipation of energy)
where ± on the right hand side is chosen according to the initial condition chosen for
ẋ(0) (there may be multiple solutions, corresponding to the same energy). Eq. (4.101) is
separable, and it can thus be integrated resulting in the following classic implicit expression
for the particle coordinate as a function of time
Z
dx
p = ±t, (4.102)
x0 E − U (x)
which depends on the particle’s initial position, x0 , and its energy, E which is conserved.
In the example above, E is an integral of motion or equivalently a first integral, which is
defined as a quantity that is conserved along solutions to the differential equation. In this
case E was constant along the trajectories x(t).
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 122
The idea of an integral of motion or first integral extends to conservative systems de-
scribed by a system of ODEs. (Here and in the next section we follow [2, 1].) For example,
consider the situation where a quantity H, called Hamiltonian, is a twice-differentiable
function of 2n variables, p1 , · · · , pn (momenta) and q1 , · · · , qn (coordinates). Correspond-
ing system of the dynamic equations is called the Hamilton’s canonical equations,
∂H ∂H
ṗi = − , q̇i = (∀i = 1, · · · , N ) . (4.103)
∂qi ∂pi
Computing the rate of change of the Hamiltonian in time
X N N
X
dH ∂H ∂H
= ṗi + q̇i = (−q̇i ṗi + ṗi q̇i ) , (4.104)
dt ∂pi ∂qi
i=1 i=1
Following [2] and Section 1.3 of [1] we now turn to discussing the famous example of a
conservative (Hamiltonian) system with one degree of freedom (4.99).
We have established above that the energy (Hamiltonian) is conserved, and it is thus
instructive to study isolines, or level curves, of the energy drawn in the two-dimensional
v2
(x, v) space, {{x, v} | 2 + U (x) = E}. To draw a level curve of energy we simply fix E and
evaluate how {x, v} evolves with t according to Eqs. (4.99).
Let us build some intuition for level curves using an analogy. Suppose that the potential
curve is the same shape as a length of wire (literally the same shape, i.e. if the potential
curve is a parabola the wire is shaped like a parabola). We will say that this wire is perfectly
rigid, and frictionless. Now imagine that there is a ball or bead which slides along the wire,
subject to gravity. One may start the bead at any position on the wire, with any initial
velocity (left or right). The path that the bead traces out in position-velocity space is
(qualitatively) a level curve of the corresponding potential function.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 123
Figure 4.2: Phase portrait, i.e. (x, v) level-curves of the conservative system Eq. (4.99) with
the potential, U (x) = kx2 /2 with k > 0 (top) and k < 0 (bottom).
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 124
a b
c d
Figure 4.3: What is appearance of the level curves (phase portrait) of the energy for each
of these potentials?
Consider the quadratic potential, U (x) = 12 kx2 . The two cases of positive and negative
k are illustrated in Fig. (4.2), see the snippet Portrait.ipynb. We observe that with the
exception of the equilibrium position (x, v) = (0, 0), the level curves of the energy are
smooth. Generalizing, we find that the exceptional points are critical, or stationary, points
of the Hamiltonian, which are points where the derivatives of the Hamiltonian with respect
to the canonical variables, q and p, are zero. Note that each level curve, which we draw
observing how a particle slides in a potential well, U (x), also has a direction (not shown in
Fig. (4.2)).
Consider the case where k > 0, and fix the value of the energy E. Due to Eq. (4.100),
the coordinate of the particle, x, should lie within the set where the potential energy is less
than the energy, {x | U (x) ≤ E}. We observe that E ≥ 0, and that equality corresponds
to the particle sitting still at the minimum of the potential, which is called a critical point,
or fixed point. Furthermore, the larger the kinetic energy, the smaller the potential energy.
Any position where the particle changes its velocity from positive to negative or vice-versa is
called a turning point. For any E > 0, there are two turning points, x± = ±2E/k. Testing
different values of E > 0, we sketch different level curves, resulting in different ellipsoids
centered around 0. This is the canonical example of a oscillator. The motion of the particle
is periodic, and its period, T , can be computed evaluating Eq. (4.102) between the turning
points √
Zx+ Z2E/k
dx dx
T := p = p = 2π. (4.105)
E − U (x) √ E − kx2 /2
x− − 2E/k
For this case, the period is a constant, 2π, and we note that it is independent of k.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 125
In the k < 0 case where all the values of energy (positive and negative) are accessible,
x = v = E = 0 is the critical point again. When E > 0 there are no turning points (points
where direction of the velocity changes). When E > 0 the particle may turn only once or
not at all. If x(0) 6= 0 and regardless of the sign of E, x(t) increases with t to become
unbounded at t → ∞. As seen in Fig. (4.2)b, in this case the (x, v) phase space splits into
√
four quadrants, separated by the v = ± kx separatrices. The level curves of the energy
are hyperbolas centered around x = v = 0.
A qualitative study of the dynamics in more complex potentials U (x) can be conducted
by sketching the level curves in a similar way.
Example 4.6.1. Sketch level curves of the energy for the Kepler potential, U (x) := − x1 + xC2 ,
and for the potentials shown in Fig. (4.3).
Let us analyze the following simple but very instructive example of a system which deviates
very slightly from the quadratic potential with k = 1:
We calculate the energy and find that E = (x2 + v 2 )/2 (within the limit) is conserved and
so the system cycles with the period given by T = 2π.
The general case where 0 < ε 1 is not conservative. Let us examine how the energy
changes with time. One derives
d
E = xẋ + v v̇ = ε (xf + vg) = ε x(0) f + v (0) g + O(ε2 ). (4.107)
dt
Integrating over a period, one arrives at the following expression for the gain (or loss) of
energy
Z 2π I
∆E = ε dt x(0) f + v (0) g + O(ε2 ) = ε (−f dv + gdx) + O(ε2 ), (4.108)
0
where the integral is taken over the level curve, which is also iso-energy cycle, of the unper-
turbed (ε = 0) system in the (x, v) space. Obviously ∆E depends on x0 .
For the case of increasing energy, ∆E > 0, we see an unwinding spiral in the (x, v)
plane. For the case of decreasing energy, ∆E < 0, the spiral contracts to a stationary point.
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 126
There are also systems where the sign of ∆E depends on x0 . Consider for example the
van der Pol oscillator
ẍ = −x + εẋ(1 − x2 ). (4.109)
d
As in Eq. (4.108), we integrate dt E over a period, which in this case gives
Z2π Z2π
2 2 2
∆E = ε ẋ (1 − x )dt + O(ε ) = εx20 sin2 t 1 − x20 cos2 t dt + O(ε2 )
0 0
2 x40
= π x0 − ε + O(ε2 ). (4.110)
4
The O(ε) part of this expression is zero when x0 = 2, positive when x0 < 2 and negative
when x0 > 2. Therefore, if we start with x0 < 2 the system will be gaining energy, and
the maximum value of x(t) within a period will approach the value 2. On the contrary, if
x0 > 2 the system will be lose energy, and the maximum value of x(t) over a period will
decrease approaching the same value 2. This type of behavior, established for ∆E including
only O(ε) contributions (and thus ignoring all contributions of higher order in small ε) is
characterized as the stable limit cycle, which can be characterized by
d
∆E(x0 ) = 0 and ∆E(x0 ) < 0.
dx0
In summary, the van der Pol oscillator is an example of behavior where the perturbation
is singular, meaning that is categorically different from the unperturbed case. Indeed, in
the unperturbed case the particle oscillates cycling an orbit which depends on the initial
condition, while in the perturbed case the particle ends up moving along the same limit
cycle.
d
Limit Cycle is Stable at x = x0 if ∆E(x0 ) = 0 and ∆E(x0 ) < 0
dx0
d
Limit Cycle is unstable at x = x0 if ∆E(x0 ) = 0 and ∆E(x0 ) > 0
dx0
Suggest an example of perturbations, f and g, in Eq. (4.106) which leads to (a) an unstable
limit cycle at x0 = 2, and (b) one stable limit cycle at x0 = 2 and one unstable limit cycle
at x0 = 4. Illustrate your suggested perturbations by building a computational snippet.
I˙ = ε (a + b cos(θ/ω)) , θ̇ = ω, (4.111)
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 127
where ω, ε, a, b are constants, and ε-term in the first Eq. (4.111) is a perturbation. When ε
is zero, I is an integral of motion, (meaning that it is constant along solutions of the ODE),
and we think of θ as an angle in the phase space increasing linearly with the frequency ω.
Note that the unperturbed system is equivalent to the one described by Eq. (4.106).
Example 4.6.2.
(a) Show that one can transform the unperturbed (i.e. ε = 0) version of the system
described by Eq. (4.106) to the unperturbed version of the system described by
Eq. (4.111) via the following transformation (change of variables)
p p
v= I/2 cos(θ/ω), x= I/2 sin(θ/ω). (4.112)
The transformation discussed in the Example 4.6.2 is the so-called canonical transforma-
tion that preserves the Hamiltonian structure of the equations. In this case the Hamiltonian,
which is generally a function of θ and I, depends only on I, H = Iω, and one can indeed
rewrite the unperturbed version of Eq. (4.111) as
∂H ∂H
θ̇ = = ω, I˙ = − = 0, (4.113)
∂I ∂θ
therefore interpreting θ and I as the new coordinate and the new momentum respectively.
Averaging perturbed Eq. (4.111) over one (2πω) angle revolution, as done in Section
4.6.3, one arrives at the following expression for the change in I over the 2π-period (of time)
∆I = 2πεa. (4.114)
Taking many, 2πnω, revolutions, replacing 2πn by t, and ∆I by J, where the latter is the
action averaged over time t, one arrives at the following equation
J˙ = εa, (4.115)
and one can check the consistency, that is indeed solution of the averaged Eq. (4.115) does
not deviate (with time) from the exact solution of Eq. (4.111)
In a general n-dimensional case one considers the following system of bare (unperturbed)
differential equations
where f and g are 2πω-periodic functions of each of the components of θ. Since I changes
slowly, due to smallness of ε, the perturbed system can be substituted by a much simpler
averaged system for the slow (adiabatic) variables, J (t) = I(t) + O(ε):
H
g(I, θ, 0)dθ
J˙ = εG(J ), G(J ) := H , (4.120)
dθ
H
where as in Section 4.6.3 stands for averaging over the period (one rotation) in the phase-
space. Notice that the procedure of averaging over the periodic motion may brake at higher
P
dimensions, n > 1, if the system has resonances, i.e. if i Ni ωi = 0, where Ni are integers.
If the perturbed system is Hamiltonian θ plays the role of generalized coordinates and
I of generalized momenta, then Eqs. (4.119) become
∂H ∂H
I˙ = − , θ̇ = . (4.121)
∂θ ∂I
In this case averaging over θ the rhs of the first equation in Eq. (4.121) results in J˙ = 0.
This means that the slow variables, J1 , · · · , Jn , also called adiabatic invariants, do not
change with time. Notice that the main difficulty of applying this rather powerful approach
consists in finding proper variables which remain integrals of motion of the unperturbed
system.
Chapter 5
A partial differential equation (PDE) is a differential equation that contains one or more
unknown multivariate functions and their partial derivatives. We begin our discussion by
introducing first-order ODEs, and how to resolve them to a system of ODEs by the method
of characteristics. We then utilize ideas from the method of characteristics to classify
(hyperbolic, elliptic and parabolic) linear, second-order PDEs in two dimensions (Section
5.2). We will discuss how to generalize and solve elliptic PDE, normally associated with
static problems, in Section 5.3. Hyperbolic PDEs, discussed in section 5.4 are normally
associated with waves. Here, we take a more general approach originating from intuition
associated with waves as the phenomena (then wave solving a hyperbolic PDE is a particular
example of a sound wave). We will discuss diffusion (also heat) equation as the main example
of a generalized (to higher dimension) parabolic PDE in Section 5.5.
129
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 130
Observe, that since the parametric curve was arbitrary we may chose to define it by
dx
= a(x, y, u),
dt
dy
= b(x, y, u),
dt
du
= c(x, y, u). (5.3)
dt
Substituting this into (5.2) gives us precisely (5.1). Thus we have a family of characteristic
curves from which we can construct our solution to (5.1). (It is a family of curves since we
never gave an initial condition for the system (5.3).)
Let u(x) : Rd → R be a function of a d-dimensional coordinate, x := (x1 , . . . , xd ).
Introduce the gradient vector, ∇x u := (∂xi u; i = 1, . . . , d), and consider the following linear
in ∇x u equation
d
X
(V · ∇x u) := Vi ∂xi u = f, (5.4)
i=1
where the velocity, V (x) ∈ Rd and forcing, f (x) ∈ R are given functions of x.
First, consider the homogeneous version of Eq. (5.4)
(V · ∇x u) = 0. (5.5)
Introduce an auxiliary parameter (or dimension) t ∈ R, call it time, and then introduce the
characteristic equations
dx(t)
= V (x(t)), (5.6)
dt
describing the evolution of the characteristic trajectory x(t) in time according to the func-
d
tion V . A first integral is a function for which dt F (x(t)) = 0. Observe that any first
integral of Eqs. (5.6) is a solution to Eq. (5.5), and that any function of the first integrals
of Eqs. (5.6), g(F1 , . . . , Fk ), is also a solution to Eq. (5.5).
Indeed, a direct substitution of u = g in Eq. (5.5) leads to the following sequence of
equalities
k
X d k
∂g X ∂Fi X ∂g d
(V · ∇x g) = Vj = Fi = 0. (5.7)
∂Fi ∂xj ∂Fi dt
i=1 j=1 i=1
The system of equations (5.6) has d − 1 first integrals independent of t (directly). Then a
general solution to Eq. (5.7) is
where g is assumed to be sufficiently smooth (at least twice differential over the first inte-
grals).
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 131
The characteristic equations are dx/dt = 1, dy/dt = y, with the general solution x(t) =
t + c1 , y = c2 exp(t). The only first integral of the characteristic equation is F (x, y) =
y exp(−x), therefore u = g(F (x, y)), where g is an arbitrary function is a general solution.
It is useful to visualize the flow along the characteristics in the (x, y) space.
Example 5.1.1. Find the characteristics of the following PDEs and use them to find the
the general solutions to the PDEs. Verify your solutions by direct substitution.
(a) ∂x u − y 2 ∂y u = 0,
(a) The goal is to find curves parameterized by t expressing the left hand side as a total
d
derivative giving dt u(x(t), y(t)) = 0. By the chain rule, this is equivalent to ∂x u ẋ(t) +
∂y u ẏ(t) = 0, which is equivalent to our PDE if we set ẋ(t) = 1 and ẏ(t) = −y 2 . These
are the characteristic equations. Their solutions are x(t) = t+c1 and y(t) = (t+c2 )−1 .
Eliminating t gives c = y −1 − x =: F (x, y) as the only first integral. General solutions
to the PDE are in the form u(x, y) = g(y −1 − x) where g : R → R can be any function.
(b) The goal is to find curves parameterized by t expressing the left hand side as a total
d
derivative giving dt u(x(t), y(t)) = 0. By the chain rule, this is equivalent to ∂x u ẋ(t) +
∂y u ẏ(t) = 0, which is equivalent to our PDE if we set ẋ(t) = x and ẏ(t) = −y. These
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 132
y y y
x x x
are the characteristic equations. Their solutions are x(t) = c1 et and y(t) = c2 e−t .
Eliminating t gives c = xy =: F (x, y) as the only first integral. General solutions to
the PDE are in the form u(x, y) = g(xy) where g : R → R can be any function.
(c) The goal is to find curves parameterized by t expressing the left hand side as a total
d
derivative giving dt u(x(t), y(t)) = 0. By the chain rule, this is equivalent to ∂x u ẋ(t) +
∂y u ẏ(t) = 0, which is equivalent to our PDE if we set ẋ(t) = y and ẏ(t) = −x. These
are the characteristic equations. Their solutions are x(t) = c cos(t) and y(t) = c sin(t).
Eliminating t gives c = x2 + y 2 =: F (x, y) as the only first integral. General solutions
to the PDE are in the form u(x, y) = g(x2 + y 2 ) where g : R → R can be any function.
Consider the following initial value (boundary) Cauchy problem: solve Eq. (5.5) subject
to the boundary condition
u(x)|x0 ∈S = ϑ(x0 ), (5.9)
Example 5.1.2. Let us illustrate the solution to the Cauchy problem on the example
therefore
u(x, y) = g(y exp(x)),
for initial condition, u(0, y) = y 2 . (b) Explain why the same problem with the initial
condition u(0, y) = y is ill-posed. (c) Determine whether the same problem with the initial
condition u(1, y) = y 2 is ill-posed.
where {f, H} is the so-called Poisson bracket of f and H. Find the characteristics of the
Liouville’s PDE.
Solution. We wish to find a V (t, q, p) such that the left hand side of the PDE can be
expressed in the form V · ∇f . This is satisfied for V = (1, ∂p H, −∂q H). We interpret
dt dq dp
V as the vector field V = ( ds , ds , ds ). Introducing V we also define a family of curves,
(t(s), q(s), p(s)), called the characteristic curves. A little algebra allows us to simplify the
curves to
d ∂H ∂H
qi = , ṗi = − , ∀i = 1, · · · , n.
dt ∂pi ∂qi
Interpretation: In the next chapter, we will see that the Hamilton’s equations, dq/dt = ∂p H,
and dp/dt = −∂q H, describe the evolution of the state of the system in phase space. Since
the characteristic curves are precisely the solutions to the Hamilton’s equations, which
df
reduces Liouville’s PDE to dt = 0, we infer that for a Hamiltonian system any function of
the system’s state variables (q, p) does not change as the system evolves in time.
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 134
Now let us get back to the inhomogeneous Eq. (5.4). As is standard for linear equations,
the general solution to an inhomogeneous equation is constructed as the superposition of
the a particular solution and general solution to the respective homogeneous equation. To
find the former we transition to characteristics, then Eq. (5.4) becomes
d
(V · ∇x ) u = (ẋ · ∇x ) u = u = f (x(t)), (5.10)
dt
which can be integrated along the characteristic thus resulting in a desired particular solu-
tion to Eq. (5.4)
Zt
uinh = f (x(s))ds where x(s) satisfies (V · ∇x )u = (ẋ · ∇x )u. (5.11)
t0
Example 5.1.4. Solve the Cauchy problem for the following inhomogeneous equation
∂t u + u∂x u = 0, (5.13)
which, when u(t; x) refers to the velocity of a particle at location x and time t, describes the
one dimensional flow of non-interacting particles. The characteristic equations and initial
conditions are
ẋ = u, u̇ = 0, x(t = 0) = x0 , u(t = 0) = u0 (x0 ).
Direct integration produces, x = u0 (x0 )t + x0 giving the following implicit equation for u
u = u0 (x − ut). (5.14)
Under the specific conditions, u0 (x) = c(1 − tanh x), this results in the following (still
implicit) equation, u = c(1 − tanh(x − ut)). Computing partial derivative, one derives
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 135
∂x u = −c/(cosh2 (x − ut) − ct), which shows that it diverges in finite time at t∗ = 1/c and
x = ut. The phenomenon is called wave breaking, and has the physical interpretation of fast
particles catching slower ones and aggregating, leading to sharpening of the velocity profile
and eventual breakdown. This singularity is formal, meaning that the physical model is no
longer applicable when the singularity occurs. Introducing a small κ∂x2 u term to the right
hand side of Eq. (5.13) regularizes the non-physical breakdown, and explains creation of
shock. The regularized second-order PDE is called Burgers’ equation.
where all the coefficients may depend on the two independent variables x and y.
The method of characteristics, (which applies to first-order PDEs, for example, when
a11 = a12 = a21 = c = 0 in Eq. (5.15)), can inform the analysis of second-order PDEs.
Therefore, let us momentarily return to the first-order PDE,
b1 ∂x u + b2 ∂y u + f = 0, (5.16)
and interpret its solution as the variable transformation from the (x, y) pair of variables to
the new pair of variables, (η(x, y), ξ(x, y)), assuming that the Jacobian of the transformation
is neither zero no infinite anywhere within the domain of (x, y) of interest.
!
∂x η ∂ y η
J = det 6= 0, ∞. (5.17)
∂x ξ ∂ y ξ
Substituting u = w(η(x, y), ξ(x, y)) into the sum of the first derivative terms in Eq. (5.16)
one derives
Requiring that the second term in Eq. (5.18) is zero one observes that it is satisfied for all
x, y if ξ(y(x)), i.e. it does not depend on x explicitly but only via y(x) if the latter satisfies
the characteristic equation, b1 dy/dx + b2 = 0.
Let us now try the same logic, but now focusing on the sum of the second-order terms
in Eq. (5.15). We derive
a11 ∂x2 u + 2a12 ∂x ∂y u + a22 ∂y2 u = A∂ξ2 + 2B∂ξ ∂η + C∂η2 w, (5.19)
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 136
where
Let us now attempt, by analogy with the case of the first-order PDE, to force first and last
term on the rhs of Eq. (5.19) to zero, i.e. A = C = 0. This is achieved if we require that
ξ(y+ (x)) and η(y− (x)), where
√
dy± a12 ± D
= , where D := a212 − a11 a22 . (5.20)
dx a11
and D is called the discriminant. Eqs. (5.20) have in a general case distinct (first) integrals
ψ± (x, y) = const. Then, we can choose the new variables as ξ = ψ+ (x, y) and η = ψ− (x, y)
If D > 0 Eq. (5.15) is called a hyperbolic PDE. In this case, the characteristics are real,
and any real pair (x, y) is mapped to the real pair (η, ξ). Eq. (5.15) gets the following
canonical form
∂ξ ∂η u + b̃1 ∂ξ u + b̃2 ∂η u + c̃u + f˜ = 0. (5.21)
Notice that another (second) canonical form for the hyperbolic equation is derived if we
transition further from (ξ, η) to (α, β) := ((η + ξ)/2, (ξ − η)/2). Then Eq. (5.21) becomes
(2) (2)
∂α2 u − ∂β2 u + b̃1 ∂α u + b̃2 ∂β u + c̃(2) u + f˜(2) = 0. (5.22)
If D < 0 Eq. (5.15) is called an elliptic PDE. In this case, Eqs. (5.21) are complex
conjugate of each other and their first integrals are complex conjugate as well. To make the
map from old to new variables real, we choose in this case, α = Re(ψ+ (x, y)) = (ψ+ (x, y) +
ψ− (x, y))/2, β = Im(ψ+ (x, y)) = (ψ+ (x, y)−ψ− (x, y))/(2i). This change of variables results
in the following canonical form for the elliptic second-order PDE:
(e) (e)
∂α2 u + ∂β2 u + b1 ∂α u + b2 uβ + c(e) u + f (e) = 0. (5.23)
D = 0 is the degenerate case, ψ+ (x, y) = ψ− (x, y), and the resulting equation is a
parabolic PDE. Then we can choose β = ψ+ (x, y) and α = ϕ(x, y), where ϕ is an arbitrary
independent (of ψ+ (x, y)) function of x, y. In this case Eq. (5.15) gets the following canonical
parabolic form
(p) (p)
∂α2 u + b1 ∂α u + b2 ∂β u + c(p) u + f (p) = 0. (5.24)
Example 5.2.1. Define the type of equation and then perform change of variables reducing
it to the respective canonical form
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 137
Solution.
(a) The coefficients of the second order terms are a11 = 1, a12 = 1/2 and a22 = −2. The
discriminant is D = a212 − a11 a22 = 9/4. The equation is hyperbolic (everywhere)
because the discriminant is positive (everywhere). There are two families of char-
acteristics, defined by dy/dx = (a12 ± D)/a11 giving dy/dx = 2 and dy/dx = −1,
respectively. The general solutions of the two equations are
y = 2x + ξ, y = −x + η,
where ξ and η are arbitrary constants. Expressing, ξ and η via x and y one derives
ξ = y − 2x, η = y + x.
∂ξ ∂η u + ∂ξ u + 2∂η u + 3(η − ξ) = 0.
(b) The coefficients of the second order terms are a11 = 1, a12 = 1 and a22 = 5. The
discriminant is D = a212 − a11 a22 = −4. The equation is elliptic (everywhere) because
the discriminant is negative (everywhere). There are two families of the characteristics,
defined by dy/dx = (a12 ± D)/a11 giving dy/dx = 1 + 2i and dy/dx = 1 − 2i,
respectively. The general solutions of the two equations are
˜
y = (1 − 2i)x + ξ, y = (1 − 2i)x + η̃.
Setting ξ := (ξ˜+ η̃)/2 = y − x and η = (ξ˜− η̃)/2i = 2x as the new variables, we arrive
a the following canonical form
∂ξ2 u + ∂η2 u − 8u = 0.
(c) The coefficients of the second order terms are a11 = 1, a12 = −1 and a22 = 1. The
discriminant is D = a212 − a11 a22 = 0. The equation is parabolic (everywhere). We
have one characteristic, dy/dx = (a12 ± D)/a11 = −1, giving y = −x + ξ therefore one
of the new variables is the integral of the characteristic equations, ξ = x + y. We can
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 138
take any independent function of x and y as the second new variable. Let us pick,
η = x. Then the equation becomes,
∂η2 u + 2∂ξ u + ∂η u − u = 0.
(Notice that the condition of functional independence consists in the requirement that
Jacobian of the transformation is nonzero. Any other choice of second (independent)
variable will result in another canonical form.)
where we assume that it is not possible to eliminate at least one second derivative term from
the condition of the respective Cauchy problem. Notice that in d > 2 Eq. (5.25) cannot be
reduced to a canonical form (introduced, in the previous Section, in d = 2).
Our primary focus will be on the generalization where aij = δij in Eq. (5.25) in d ≥ 2,
and also on solving inhomogeneous equations, where a nontrivial (nonzero) solution is driven
by a nonzero source. It is natural to find solutions to these equations using Green functions.
We have discussed in Section 4.4.1 how to solve static linear one dimensional case of the
Poisson equation using the Green functions. Here we generalize and consider the Poisson
equation in d ≥ 2,
∇2r u = φ(r), (5.26)
The solution to Eq. (5.27) can be found by applying the Fourier transform, resulting in the
following algebraic equation, k 2 Ĝ(k) = −1, where k = |k|. Solving (trivially) for Ĝ and
applying the Inverse Fourier transform, one derives for d = 3,
Z Z Z∞
d3 k exp(i(k · r)) d2 k⊥ exp(ikk r)
G(r) = − =− dkk
(2π)3 k2 (2π)3 kk2 + k⊥
2
−∞
Z
d2 k⊥ π 1
=− exp(−k⊥ r) = − , (5.29)
(2π)3 k⊥ 4πr
where for each r, we change from the Cartesian to cylindrical representation associated with
r, i.e. k = (kk , k⊥ ), the one-dimensional kk = (k · r)/r is along r and the two dimensional
vector, k⊥ , stands for the remaining two components of k orthogonal to r. Substituting
Eq. (5.29) into Eq. (5.28) one derives
Z Z
φ(r) ρ(r)
u(r) = − d3 r 0 = d3 r 0 , (5.30)
4π|r − r 0 | |r − r 0 |
which is thus expression for the electrostatic potential of a given distribution of the charge
density in the space.
Note, that for d = 2, we usually write φ(r) = −2πρ(r). In this case the Green function
is found to be G(r − r 0 ) = ln(|r − r 0 |).
The homogeneous case of φ = 0 is often called the Laplace equation. We will distinguish
the two cases calling them the (inhomogeneous) Laplace equation and the homogeneous
Laplace equation respectively.
We will also discuss in the following the Debye equation
∇2r − κ2 u = φ(r), (5.31)
Example 5.3.1. Find the Green function for the Laplace equation in the region outside of
the sphere of radius R and zero boundary condition on the sphere, i.e. solve
Solution. The Green function can be constructed by recognizing that |r − r 00 |−1 solves
∇2r G(r; r 0 ) = 0 for all r 6= r 00 . The trick is to find for each r 0 ∈ D a fictitious image point
r 00 ∈
/ D such that G(r; r 0 ) = 0 whenever |r| = R. The problem distills down to finding
the correct strength and position of the image point to enforce the boundary condition at
every point on the boundary. Using symmetry, it is clear that r 0 and r 00 must be collinear.
Therefore, we can write r 00 = αr 0
1 A
Find A, α such that G(r; r 0 ) := − + = 0 whenever |r| = r = R,
4π|r − r | 4π|r − αr 0 |
0
p
For any r on the boundary and r 0 ∈ D, |r − r 0 | = R2 + |r 0 |2 − 2|r 0 |R cos(θ). Similarly,
p
|r − r 00 | = R2 + α2 |r 0 |2 − 2α|r 0 |R cos(θ) (See blue and orange triangles respectively in the
Fig. 5.2). To enforce the boundary condition, their contributions must cancel. That is,
1 A
− p + p = 0.
4π |r 0 |2 + R2 − 2|r 0 |R cos(θ) 4π α2 |r 0 |2 + R2 − 2α|r 0 |R cos(θ)
We are looking for values of A and α that are independent of θ. The algebra is a bit tedious,
but we find that A = R/|r 0 | and α = (R/|r 0 |)2 . Hence, the Green function is
1 R/|r 0 |
G(r; r 0 ) := − + ,
4π|r − r 0 | 4π|r − r 00 |
where the charge density, ρ(r) depends only on the distance from the origin (zero), i.e.
ρ(r = |r|). (Hint: Consider finding the Green function first, acting by analogy with how we
found the Green function of the Laplace equation above.)
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 141
r
r R
θ
r 00 r0 r 00 r0
0
α|r |
|r 0 | |r 0 |
(a) For each r 0 ∈ D, identify the associated (b) Contributions from r 0 and r 00 must cancel
image r 00 ∈
/ D to find response at any r to enforce the boundary condition
Figure 5.2: Method of images applied to the exterior of a sphere in the example 5.3.1.
∗
5.4 Waves in a Homogeneous Media: Hyperbolic PDE
Although hyperbolic PDEs are normally associated with waves ∗ , we begin our discussion by
developing intuition which generalizes to a broader class of an integro-differential equations
beyond hyperbolic PDEs. In other words, we act here in reverse to what may be considered
the standard mathematical process; we begin by describing properties of solutions associated
with waves, and then walk back to the equations which are describing such waves.
Consider the propagation of waves in homogeneous media, for example: electro-magnetic
waves, sound waves, spin-waves, surface-waves, electro-mechanical waves (in power sys-
tems), and so on. In spite of such a variety of phenomena, they all admit one rather
universal description. The wave process at a general position in d-dimensional space r and
time t is represented as the following integral over the wave vector k
Z
dk
u(t; r) = exp (i(k·r)) ψk (t)û(k), ψk (t) ≡ exp (−iω(k)t) , (5.33)
(2π)k
where ω(k) and û(k) are the dispersion law and wave amplitude dependent on the wave
vector k. (Notice the similarities and the differences with the Fourier integral.) In Eq. (5.33)
ψk (t) is a solution to the following first-order (in time) linear ODE
d
+ iω(k) ψk = 0, (5.34)
dt
or alternatively of the following second-order linear ODE
2
d 2
+ (ω(k)) ψk = 0. (5.35)
dt2
These are called the wave equations in Fourier representation. The linearity of the equations
is principal and is due to the fact that generally nonlinear dynamics is linearized. Waves
∗
This is an Auxiliary Section which can be dropped at the first reading. Material from the Section will
not contribute midterm and final exams.
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 142
may also interact with each other. The interaction of waves can only come from accounting
for nonlinearities in the original equations. In this analysis, we focus primarily on the linear
regime.
Dispersion Laws
Consider the case where ωk = c|k|, where c is a constant having dimensionality and sense
of velocity. In this case, the inverse Fourier transform version of Eq. (5.35) becomes
2
d 2 2
− c ∇ r ψ(t; r) = 0. (5.36)
dt2
Note that the two differential operators in Eq. (5.36), one in time and another in space,
have opposite signs. Therefore, we naturally arrive at the case which generalizes the hyper-
bolic PDE (5.22). It is a generalization because r is not one-dimensional but d-dimensional,
d ≥ 1.
Eq. (5.26) with c constant, explains a variety of important physical situations: as men-
tioned already, it describes propagation of sound in a homogeneous gas, liquid or crystal
media. In this case ψ describes the shift of an element of the matter from its equilibrium
position and c is the speed of sound in the material b .
Another example is given by the electro-magnetic waves, described by the Maxwell
equations on the electric, E, and magnetic, B, fields,
where × is the vector product in d = 3c , and c is the speed of light in the media. Differ-
entiating the first equation in the pair of Eqs. (5.37) over time, substituting the resulting
∂t ∇r × B by −c(∇r × (∇r × E)), consistently with the second equation in the pair, and
taking into account that for the divergence-free, E, (∇r × (∇r × E)) = ∇2r E, one arrives
at Eq. (5.36) for all components of the electric field, i.e. with ψ replaced by E.
b
Note, that there is a unique speed of sound in gas or liquid, while 3d crystal supports three different waves
(with different three different c) each associated with a distinct polarization. For example, in an isotropic
crystals there are longitudinal and transversal waves propagating along and, respectively, perpendicular to
the media shift.
c
(∇r × B)i = εijk ∇j Bk , where i, j, = 1, ·3 and εijk is the absolutely skew-symmetric tensor in d = 3
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 143
The dispersion law in the case of sound and light waves is linear, ω(k) = ±c|k|, however
there are other more complex examples. For example, surface waves propagating over the
surface of water (with air), are characterized by the following dispersion law
p
ω(k) = gk + (σ/ρ)k 3 , (5.39)
where g, σ and ρ are gravity coefficient, surface tension coefficient and density of the fluid,
respectively. Eq. (5.39) is so complex because it accounts for both capillary and gravita-
tional effects. Gravitational waves dominate at small q (large distances), where Eq. (5.39)
√
transforms to ω(q) = gq, while the capillary waves dominate in the opposite limit of large
q (small distances), where one gets asymptotically ω = (σ/ρ)1/2 q 3/2 .
Recall that Eq. (5.34) or Eq. (5.35) are stated in the Fourier k-representation. Transi-
tioning to the respective r-representation in the case of a nonlinear dispersion relation, for
example associated with Eq. (5.39), will NOT result in a PDE. We arrive in this general case
at an integro-differential equation, reflecting the fact that the nonlinear dispersion relation,
even though local in the k-space becomes nonlocal in r-space.
In general, propagation of waves in the homogeneous media is characterized by the
dispersion law dependent only of the absolute value, k = |k| of the wave vector, k. ω(k)/k
and dω(k)/dk, both having dimensionality of speed, are called, respectively, phase velocity
and group velocity.
Example 5.4.1. Solve the Cauchy (initial value) problem for amplitude of spin-waves which
satisfy the following PDE
∂t2 ψ = −(Ω − b∇2r )2 ψ, (5.40)
is the respective (spin wave) dispersion law. The Fourier transform of the initial condition
over k is, ψ̂(t = 0; k) = π 3/2 exp(−k 2 /4). Since dψ/dt(t = 0; r) = 0, the Fourier transform
of the initial condition is zero as well, that is, dψ̂/dt(t = 0; k) = 0. Then, the solution to
Eqs. (5.34,5.41) becomes ψ̂(t; k) = π 3/2 exp(−k 2 /4) cos((Ω + bk 2 )t). Evaluating the inverse
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 144
Example 5.4.2. Solve the Cauchy (initial value) problem for the wave Eq. (5.36) in d = 3,
where ψ(t = 0; r) = exp(−r2 ) and dψ/dt(t = 0; r) = 0.
So far we have discussed the free propagation of waves. Consider the inhomogeneous equa-
tion generalizing Eq. (5.35) that arises from a source term χ(t; r) on the right hand side:
2
d 2
+ (ω(−i∇ r )) ψ(t; r) = χ(t; r). (5.42)
dt2
where we have used −i∇r exp(ikr) = k exp(ikr). You may assume that the dispersion law,
ω(k) is continuous value of its argument (absolute value of the wave vector) so that the
operator ω(−i∇r ))2 is well defined in the sense of the function’s Taylor series.
The Green function for the PDE is defined as the solution to
2
d 2
+ (ω(−i∇r )) G(t; r) = δ(t)δ(r). (5.43)
dt2
The solution to the inhomogeneous PDE, Eq. (5.42), can be expressed as the convolution
of the source term χ(t1 ; r1 ) with the Green function, G(t; r)
Z
ψ(t; r) = dt1 dr1 G(t − t1 ; r − r1 )χ(t1 ; r1 ), (5.44)
The solution to Eq. (5.42) is expressed as sum of the forced solution (5.44) and a zero
mode of the respective free equation, i.e. Eq. (5.42) with zero right hand side.
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 145
To solve Eq. (5.43) for the Green function, or equivalently equation for its Fourier
transform
d2 2
+ (ω(k)) Ĝ(t; k) = δ(t). (5.45)
dt2
Recall that the inhomogeneous ODE. (5.45) was already discussed earlier in the course.
Indeed Eq. (4.36) solves Eq. (5.45). Then recalling that ω depends on k and applying the
inverse Fourier transform over k to Eq. (4.36) one arrives at
Z
d3 k sin(ω(k)t)
G(t; r) = θ(t) exp (i(k · r)) . (5.46)
(2π)3 ω(k)
Example 5.4.3. Show that the general expression (5.46) in the case of the linear dispersion
law, ω(k) = ck, becomes
θ(t)
G(t; r) = (δ(r − ct) − δ(r + ct)), (5.47)
4πcr
where r = |r|.
Solution. The linear dispersion law means that w(k) = ck, where k = |k|. Then
Z
sin ckt d3 k
G(t; r) = θ(t) exp(ir · k) .
R3 ck (2π)3
Compute the integral by rotating the coordinate system so that r is pointed in the z-
direction (i.e. r = (0, 0, r)> ) and then switch to spherical coordinates (i.e. (k1 , k2 , k3 ) 7→
(k cos θ sin φ, k sin θ sin φ, k cos φ)). The scalar product r ·k then evaluates to 0+0+rk cos φ.
Z 2π Z π Z ∞
θ(t) sin ckt
G(t; r) = 3
exp(irk cos φ)k 2 sin φ dk dφ dθ
(2π) 0 0 0 ck
Z ∞ ickt − e−ickt eikr − e−ikr
θ(t) e
= k2 · dk via the sub. u = − cos θ
(2π)2 0 2ick ikr
Z ∞ ik(r−ct)
θ(t) e + e−ik(r−ct) eik(r+ct) + e−ik(r+ct)
= − dk
(2π)2 cr 0 2 2
θ(t)
= (δ(r − ct) − δ(r + ct)),
4πcr
which is equivalent to the expression we were given.
Substituting Eq. (5.47) into Eq. (5.44) one derives the following expression for linear
dispersion (light or sound) radiation from a source
Z
1 dr1 R
ψ(t; r) = χ t − ; r1 . (5.48)
4πc2 R c
The solution suggests that action of the source is delayed by R/c correspondent to propa-
gation of light (or sound) from the source to the observation point.
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 146
Example 5.4.4. Solve the radiation Eq. (5.42) in the case of the linear dispersion law for
the case of a point harmonic source, χ(t; r) = cos(ωt)δ(r).
Solution.
Z Z
1 dr1
ψ(t; r) = dt1 dr1 G(t − t1 ; r − r1 )χ(t1 ; r1 ) = cos(ω(t − R/c))δ(r1 )
4πc2 R
1
= cos(ω(t − R/c)).
4πRc2
∂t u = κ∇2r u, (5.49)
where κ is the diffusion coefficient. The equation appears in a number of applications, for
example, this equation can be used to describe the evolution of the density of number of
particles, or the spatial variation of temperature. The same equation describes properties
of the basic stochastic process (Brownian motion).
Consider the Cauchy problem with u(t; r) given at t = 0. The Fourier transform over
r ∈ Rd is Z
û(t; k) = dy1 . . . dyd exp (ik · x) u(t; y). (5.50)
Integrating the equation over time, û(t; q) = exp(−q 2 t)û(0; k), and evaluating the inverse
Fourier transform over q of the result one arrives at
Z
dy1 , . . . dyd (x − y)2
u(t; x) = exp − u(0; y). (5.52)
(4πt)d/2 4t
If the initial field, u(0; x), is localized around some x, say around x = 0, that is if
u(0; x) decays with |x| increase sufficiently fast, then one may find a universal asymptotic
of u(t; x) at long times, t l2 , where l is the length scale on which u(0; x) is localized. At
these sufficiently large times dominant contribution to the integral in Eq. (5.52) is acquired
from the |y| ∼ l vicinity of the origin, and therefore in the leading order one can ignore
y-dependence of the diffusive kernel in the integrand of Eq. (5.52), i.e.
2 Z
A x
u(t; x) ≈ exp − , A = u(0; y)dy1 . . . dyd . (5.53)
(4πt)d/2 4t
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 147
→ Aδ(y)
Notice that the approximation (5.53) corresponds to the substitution of u(0, y) in
2
Eq. (5.52). Another interpretation of Eq. (5.53) corresponds to expanding, exp − (x−y)
4t ,
in the Taylor series in y, and then ignoring all but the leading order term, O(y 0 ), in the
expansion. If A = 0 one needs to account for the O(y 1 ) term, and drop the rest. In this
case the analog of Eq. (5.53) becomes
2 Z
(B · x) x
u(t; x) ≈ d/2+1
exp − , B = 2π yu(0; y)dy1 . . . dyd . (5.54)
(4πt) 4t
Exercise 5.3. Find asymptotic behavior of a one-dimensional diffusion equation at suffi-
ciently long times for the following initial conditions
x2
(a) u(0; x) = x exp − 2
2l
|x|
(b) u(0; x) = exp −
l
|x|
(c) u(0; x) = x exp −
l
1
(d) u(0; x) =
x2 + l 2
x
(e) u(0; x) =
(x2 + l 2 )2
Hint: Think about expanding the diffusion kernel in the integrand of Eq.(5.52) in a series
over y.
Our next step is to find the Green function of the heat equation, i.e. to solve
In fact, we have solved this problem already as Eq. (5.52) describes it with u(0; y) =
G(+0; x) = δ(x) set as the initial condition. The result is
2
1 x
G(t; x) = d/2
exp − . (5.56)
(4πt) 4t
As always, the Green function can be used to solve the inhomogeneous diffusion equation
Example 5.5.1. Solve Eq. (5.57) for φ(t; x) = θ(t) exp −x2 /(2l2 ) in the d = 4-dimensional
space.
This problem can be solved by the Fourier Method (also called the method of variable
separation), which is split in two steps.
First, we look for a particular solution which satisfy only boundary conditions over
one of the coordinates, x. We look for u(t, x) in the separable form u(t, x) = X(x)T (t).
Substituting this ansatz in Eq. (5.59) one arrives at
X 00 (x) T 00 (t)
= = −λ, (5.61)
X(x) T (t)
where λ is an arbitrary constant. General solution to the equation for X is
√ √
X = A cos( λx) + B sin( λx).
Require that X(x) satisfies the same boundary conditions as in Eq. (5.60). This is possible
√
only if A = 0 and L λ = nπ, n = 1, 2, . . . . From here we derive solution labeled by
integer n and respective spatial form of the solution
nπ 2 nπx
λn = , Xn (x) = sin .
L L
We are now ready to get back to Eq. (5.61) and resolve equation for T (t):
nπct nπct
Tn (t) = An cos + Bn sin ,
L L
where An , Bn are arbitrary constants. Xn (x) form a complete basis and therefore a general
solution can be written as a linear combination of the basis solutions:
∞
X
u(t, x) = Xn (x)Tn (t).
n=1
On the second step we fix An and Bn resolving the initial portion of the conditions
(5.60):
∞
X ∞
X
ϕ(x) = An Xn (x), ψ(x) = λn Bn Xn (x). (5.62)
n=1 n=1
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 149
Multiplying both Eqs. (5.62) on Xm (x), integrating them from 0 to L, and accounting for
the ortho-normality of the eigen-functions, one derives
ZL ZL
2 2
Am = dxϕ(x)Xm (x), Bm = dxψ(x)Xm (x). (5.63)
L λm L
0 0
Example 5.6.1. The equation describing deviation of a string from the straight line, u(t; x),
is ∂t2 u − c2 ∂x2 u = 0, where x is the position along the line, t, is the time, and, c, is a
constant (speed of sound). Assume that the string has at t = 0 a parabolic shape, u(0; x) =
4hx(L − x)/L2 , with both ends, at x = 0 and x = L, respectively, attached to the straight
line. Let us also assume that the speed of the string is equal to zero at t = 0, i.e. ∀x ∈
[0, L], ∂t u(0; x) = 0. Find dependence of the string deviation, u(t; x), on time, t, at a
position, x ∈ [0, L], along the straight line.
Let us now analyze the following parabolic boundary value problem over x ∈ [0, L]:
x, x < L/2
2 2
∂t u = a ∂x u, u(t, 0) = u(t, L) = 0, u(0, x) = (5.64)
L − x, x > L/2.
Here we follow the same Fourier method approach. In fact the spectral part of the
solution here is identical to the one just described above in the hyperbolic case, while
the temporal components are obviously different. One derives, Tn0 = −λn Tn , which has a
decaying solution
nπ 2 2
Tn = An exp − a t .
L
Expansion of the initial conditions in the Fourier series is equivalent to conducted above,
therefore resulting in
∞ 2 !
4L X (−1)n (2n + 1)π 2 2n + 1
u(t, x) = 2 exp − a t sin πx .
π (2n + 1)2 L L
n=0
Notice that the solution is symmetric with respect to the middle of the interval, u(t, x) =
u(t, L − x), as this symmetry is inherited from the initial conditions.
∗
5.7 Case study: Burgers’ Equation
Burgers’ equation ∗ is a generalization of the Hopf’s equation, Eq. (5.13), discussed when
illustrating the method of characteristics. Recall that the Hopf’s equation results in a wave
breaking which leads to a non-physical multi-valued solution. Modification of the Hopf’s
equation by adding dissipation/diffusion results in the Burgers’ equation:
Like practically every other nonlinear PDE, Burgers’ equation seems rather hopeless to
resolve at first glance. However, Burgers’ equation is in fact special. It allows the Cole-
Hopf’s transformation, from u(t; x) to Ψ(t; x)
∂x Ψ(t; x)
u(t; x) = −2 , (5.66)
Ψ(t; x)
reducing Burgers’ equation to the diffusion equation
∂t Ψ = ∂x2 Ψ. (5.67)
The solution to the Cauchy problem associated with Eq. (5.67) can be expressed as an inte-
gral convolving the initial profile Ψ(0; x), with the Green function of the diffusion equation
described in Eq. (5.56)
Z
dy (x − y)2
Ψ(t; x) = √ exp − Ψ(0; y). (5.68)
4πt 4t
This latter expression can be used to find some exact solutions to Burgers’ equation. Con-
sider, for example, Ψ(0; x) = cosh(ax). Substitution into Eq. (5.68) and conducting in-
tegration over y, one arrives at Ψ(t; x) = cosh(ax) exp(a2 t), which results, according to
Eq. (5.66), in stationary (time independent, i.e. standing) “shock” solution to Burgers’
equation, u(t; x) = −2a tanh(ax). Notice that the following more general solution to Burg-
ers’ equation corresponds to a shock moving with the constant speed u0
Example 5.7.1. Solve the diffusion equation Eq. (5.67) with the initial conditions Ψ(0, x) =
cosh(ax) + B cosh(bx). Reconstruct respective u(t; x) solving the Burgers Eq. (5.65). Ana-
lyze the result in the regime b > a and B 1 and also verify, by building a computational
snippet, that the resulting spatio-temporal dynamics corresponds to a large shock “eating”
a small shock.
∗
This auxiliary Section can be dropped at the first reading. Material from this Section will not contribute
midterm and finals.
Part III
Optimization
151
Chapter 6
Calculus of Variations
The main theme of this chapter is the relation of equations to minimal principles. Over-
simplifying a bit: to minimize a function S(q) is to solve S 0 (q) = 0. For a quadratic,
S(q) = 21 q T Kq − q T g, where K is positive definite, one indeed has the minimum of S(q)
achieved at q∗ , which solves S 0 (q∗ ) = Kq∗ − g = 0.
In the example above, q is an n-(finite) dimensional vector, q ∈ Rn . Consider extending
the finite dimensional optimization to an infinite dimensional, continuous, problem where
q(x) is a function, say, q(x) : R → R, and S{q(x)} is a functional, typically an integral with
the integrand dependent on q(x) and its derivative, q 0 (x), for example
Z c
S{q(x)} = dx (q 0 (x))2 − g(x)q(x) .
2
The derivative of the functional over q(x) is called the variational derivative, and by analogy
with the finite dimensional example above, one finds that the Euler-Lagrange (EL) equation,
δS{q}
= 0,
δq(x)
solves the problem of minimizing the functional. The goal of this section is to understand
the variational derivative and other related concepts in theory and on examples.
6.1 Examples
To have a better understanding of the calculus of variations we start by describing four
examples.
Consider a robot navigating in the (x, y)-plane, and define the function y = q(x) so it
describes the path of the robot. For small δx, the arclength of the the robot’s path from
152
CHAPTER 6. CALCULUS OF VARIATIONS 153
p
(x, y) to (x + δx, y + δy) can be approximated by (δx)2 + (δy)2 by the Pythagorean
p
theorem, which simplifies to 1 + (q 0 (x))2 dx for infinitesimal δx . If the plane constitutes
a rugged terrain, then the robot’s speed may differ at each point in the plane, so we define
the scalar-valued positive function µ−1 (x, y) to describe the speed of the robot at each point
in the plane. The time it takes for the robot to move from (x, q(x)) to (x + dx, q(x + dx))
along the path q(x) is
p
L(x, q(x), q 0 (x)) := µ(x, q(x)) 1 + (q 0 (x))2 dx.
The total time taken to travel along the path which starts at an initial point (xi , yi ) := (0, 0)
and ends at a terminal point (xt , yt ) := (a, b), where a > 0 is
Za
S{q(x)} = dxL(x, q(x), q 0 (x)).
0
Subject to a few modest conditions on µ(x, y), the calculus of variations provides a way to
find the optimal path through the domain, that is, the path which minimizes the functional
S{q(x)}, subject to q(0) = 0 and q(a) = b.
Consider making a three-dimensional bubble by dipping a wire loop into soapy water, and
then asking whether there is an optimal bubble shape for a given loop. Physics suggests
that the bubble will form in whatever shape minimizes the surface area of the soap film.
We formalize this setting as follows. The surface of a bubble is described by the con-
tinuously differentiable function, q(x) : x = (x1 , x2 ) ∈ D → q(x) ∈ R, where D ⊂ R2 is
bounded. We also assume that at the boundary of D, denoted ∂D and representing a closed
line in the (x1 , x2 ) plane, q(x) is fixed/known, i.e., q(∂D) = g(∂D), where g(∂D) describes
the coordinate of the wire loop along the third dimension. Then the optimal bubble results
from minimizing the functional
Z p
S{q(x)} = dx 1 + |∇x q(x)|2 (6.1)
D
Example 6.1.1. Show that Eq. (6.1) represents a general formula for the surface area of
the graph of a continuously differentiable function, q(x), where x = (x1 , x2 ) ∈ D ⊂ R2 and
D is a region in the (x1 , x2 ) plane with the smooth boundary.
CHAPTER 6. CALCULUS OF VARIATIONS 154
C
B
A
Figure 6.1: Illustration for construction of an infinitesimal surface element in the Exam-
ple (6.1.1).
Solution. It is sufficient to derive the differential version of Eq. (6.1), i.e. to show that area
(0) (0)
of a surface element, represented by q(x) around a point, x(0) = (x1 , x2 ) ∈ D, is described
by
p p
1 + |∇x q(x)|2 dx = 1 + (∂x1 q(x1 , x2 ))2 + (∂x2 q(x1 , x2 ))2 dx1 dx2 ,
p
where 1 + |∇x q(x)|2 is evaluated at x = x(0) . In this (infinitesimal) case, we can represent
the surface of the infinitesimal element by the plane
(0)
in R3 , where x3 = q(x(0) ), a = ∂x1 q|x=x(0) and b = ∂x2 q|x=x(0) . Specifically, we can describe
the plane in terms of the following three points in the three dimensional space (see Fig. (6.1)
for the illustration)
(0) (0) (0)
A = x1 , x2 , x3 ,
(0) (0) (0)
B = x1 + dx1 , x2 , x3 + adx1 ,
(0) (0) (0)
C = x1 , x2 + dx2 , x3 + bdx2 .
CHAPTER 6. CALCULUS OF VARIATIONS 155
Then area of the surface of the infinitesimal element becomes the absolute value of the cross
(vector) product of the following two infinitesimal three dimensional vectors:
p
|u × v| = |(B − A) × (C − A)| = |(−adx1 dx2 , −bdx1 dx2 , dx1 dx2 )| = dx1 dx2 1 + a2 + b2 ,
where we have used standard vector calculus rules for the cross/vector product in three di-
P
mensions, (y×z)i = j,k=1,2,3 εijk yj zk , with εijk being Levi-Civita (absolute anti-symmetric)
tensor in three dimensions.
A gray-scale image is described by the function, q(x) : [0, 1]2 → [0, 1], mapping a location, x
within the square box, [0, 1]2 ∈⊂ R2 , into a real number between 0 (white), and 1 (black).
However, the true image is often corrupted by a noise, and we only observe this noisy image.
The task of image restoration is to restore the true image from the noisy observation.
Total Variation (TV) restoration [3] is a method built on the conjecture that the true
image is reconstructed from the noisy image, f (x), by minimization of the following func-
tional Z
S{q(x)} = dx (q(x) − f (x))2 + λ|∇x q(x)| , (6.2)
U =[0,1]2
subject to the Neumann boundary condition, n · ∇x q(x) = 0 for all x ∈ δU , where n is the
(unit) vector normal to δU , which is the boundary of the domain U ).
where L (t, q(t), q̇(t)) is the system Lagrangian, and q̇(t) = dq(t)/dt is the momentum, under
the condition that the values of the coordinate at the initial and final moment of time are
fixed, q(t1 ) = q1 , q(t2 ) = q2 . An exemplary Hamiltonian dynamics is that of a (unit mass)
particle in a potential, V (q), then
q̇ 2
L (t, q(t), q̇(t)) = − V (q). (6.4)
2
CHAPTER 6. CALCULUS OF VARIATIONS 156
over functions, q(x), with the fixed value at the boundary, x ∈ ∂D : q(x) = g(x), where D
is bounded with the known value at all points of the boundary, and the Lagrangian L is a
given function
L : D ⊆ Rn × Rd × Rd×n → R,
of the three variables. It will also be convenient in deriving further relations to consider
the three variables in the argument of L and then denoting the respective derivatives, Lx ,
Lq , and L∇q . (Note that the variables are: x ∈ D ⊆ Rn , q ∈ Rd , and ∇x q ∈ Rd×n , We will
assume in the following that both L and g are smooth.
Theorem 6.2.1 (Necessary condition for optimality). Suppose that q(x) is the minimizer
of S, that is
S{q̃(x)} ≥ S{q(x)} & ∀x ∈ ∂D q̃(x) = q(x) ∀x ∈ D, ∀q̃(x) ∈ C 2 (D̄ = D ∪ ∂D) ,
Sketch of the proof: Consider the perturbation q(x) → q(x) + sδ(x) = q̃(x), where s ∈ R
and δ(x) sufficiently smooth and such that is does not change the boundary condition, i.e.
δ(x) = 0 (∀x ∈ ∂D). Then according to the assumption
Then, exchanging the orders of differentiation and integration, applying the differentiation
(chain) rules to the Lagrangian, and evaluating one of the resulting integrals by parts and
CHAPTER 6. CALCULUS OF VARIATIONS 157
Since the resulting integral should be equal to zero for any δ(x) one arrives at the desired
statement.
Remark. Solutions to the EL Eqs. (6.5), q(x), are stationary curves of the functional,
S{q(x)}, and could be minimizers, maximizers or saddle-points. It’s for this reason that
theorem (6.2.1) provides a necessary, but not sufficient, condition for minimizing S{q(x)}.
where q : R → R.
Solution. In one dimension the Euler-Lagrange equation simplifies to
d ∂L ∂L
− = 0.
dx ∂q 0 ∂q
Since we are not asked to solve the equations this is only a matter of constructing the
respective equation for each case.
2q 00 − exp(q) = 0.
q 0 − q 0 = 0.
This implies that any function q satisfies the Euler-Lagrange equation, which in turn
implies that the functional does not have any minima or maxima.
x2 q 00 + 2xq 0 = 0.
Example 6.2.3. Consider the shortest path version of the fastest path problem set in
Section 6.1.1, that is the case of g(x, y) = 1:
Za p
min dx 1 + (q 0 (x))2 dx .
{q(x)|x∈[0,a]}
0 q(0)=0, q(a)=b
where at the last step we accounted for the boundary condition. The shortest (optimal)
path connects initial and final points by a straight line.
Exercise 6.1. (a) Write the Euler-Lagrange equation for the general case of the fastest
path problem formulated in Section 6.1.1. (b) Find an example of, µ(x, y), resulting in the
b 2
quadratic optimal path, i.e. q(x) = a2
x . (Your solution for µ(x, y) should be independent
of a and b.)
Example 6.2.4. Let us derive the Euler-Lagrange condition for the Minimal Surface prob-
lem introduced in Section 6.1.2:
Z p
min dx 1 + |∇x q(x)|2 .
{q(x)}
D q(∂D)=g(∂D)
!" ∆%"
(&
!# ∆%# !&
!&'"
(&'"
where θ is the angle in the (q, x) space between the tangent to q(x) and the x-axis.
It is instructive to derive Eq. (6.8) bypassing the variational calculus, taking instead
perspective of standard optimization, that is optimizing over a finite number of continuous
variables. To make this link we need, first, to discretize the action, S{q(x)}:
s
X X q(xk ) − q(xk−1 ) 2
S{q(x)} ≈ Sk (· · · , qk , · · · ) = µk ∆sk = µk 1 + ∆
∆
k k
s
X qk − qk−1 2
= µk 1 + ∆
∆
k
where ∆ is the size of a step in x. i.e. ∆ = xk+1 − xk , ∀k, and ∆sk is the length of the
k-th segment of the discretized curve, illustrated in Fig. (6.2). Then, second, we look for
extrema of Sk over qk , i.e. require that ∀ k : ∂qk Sk = 0. The result is the discretized
version of the Euler-Lagrange Eqs. (6.8):
µ (q − qk ) µk (qk − qk−1 )
∀k : rk+1 k+1 =r
qk+1 −qk 2 qk −qk−1 2
1+ ∆ 1+ ∆
The TV functional (6.2) is not differentiable at ∇x q(x) = 0, which creates difficulty for
variations. One way to bypass the problem is to smooth the Lagrangian, considering
Z
(q(x) − f (x))2 p
Sε {q} = dx 2 2
+ λ ε + (∇x q(x)) , (6.9)
2
[0,1]2
where ε is small and positive. The Euler-Lagrange equations for the smoothed action (6.9)
are
∇x q
∀x ∈ [0, 1]2 : q − λ∇x · p = f, (6.10)
ε + (∇x q(x))2
2
We will start this part with a disclaimer. The discussion below of the numerical procedure
for solving Eq. (6.10) is not fully comprehensive. We add it here for completeness, delegating
details to Math 589, and also aiming to emphasize connections between numerical PDE
analysis and optimization algorithms.
A standard numerical scheme for solving Eq. (6.10) originating from optimization of
the action is gradient descent. It is useful to think about the gradient descent algorithm
by introducing an extra “computational time” dimension, which will be discrete in imple-
mentation but can also be thought of (for the purpose of analysis and gaining intuition) as
∗
This auxiliary Section can be dropped at the first reading. Material from this Section will not contribute
midterm and finals.
CHAPTER 6. CALCULUS OF VARIATIONS 162
where a is the damping coefficient. Acting by analogy with the diffusive case, let us make an
empirical estimate for the balanced choice of the spatial discertization step, ∆x, temporal
discretization step, ∆t, and of the damping coefficient. Linearising the nonlinear wave
Eq. (6.12) and then requiring that the ∂t2 (temporal oscillation) term, the a∂t (damping)
term and the (λ/ε)∇2x (diffusion) term are balanced one arrives at the following estimate
∆t ε(∆x)2
(∆t)2 ∼ ∼ ,
a λ
which results in a much less demanding linear scaling, ∆t ∼ ∆x.
This transition from the overdamped relaxation to balancing damping with oscillations
corresponds to the Polyak’s heavy-ball method [5] and Nesterov’s accelerated gradient de-
scent method [6], which are now used extensively (often with addition of a stochastic com-
ponent) in training of the modern Neural Networks. Both methods will be discussed later
in the course, and even more in the companion Math 575 course. Notice also that an addi-
tional material on modern, continuous-time interpretation of the acceleration method and
other related algorithms can be found in [7, 8]. See also Sections 2.3 and 3.6 of [4]
We will come back to the image-restoration problem one more time in Section 6.7.2
where we discuss an alternative, primal-dual algorithm.
CHAPTER 6. CALCULUS OF VARIATIONS 163
d
In section 6.2, we showed that if a function q(x) satisfies the EL equations, dx Lq
0 − Lq = 0,
then q(x) is a stationary curve of S and therefore a candidate for a minimizer of S. (See
also discussion of the EL Theorem 6.2.1 and the following remark.)
Notation. In the following, we abuse notation a little and use the symbol S to denote both
the action-functional Z x1
S{q(x)} = L(x, q(x), q̇(x)) dx
x0
and also the action-function
Z x1
S(A0 , A1 ) = S(x0 , q0 , x1 , q1 ) = min L(x, q(x), q̇(x)) dx.
q(x)∈F x0
Proof. Here (and as custom in this course) we will only sketch the proof. Let us focus,
without loss of generality, on the part of the theorem concerning derivatives with respect
to x1 and q1 , i.e. the final end-point of the critical path.
Let us first keep the final independent variable fixed at x1 but move the final position
by dq, as shown in Fig. 6.5 Left. The trajectory q(x) will vary by δq(x), where δq(x0 ) = 0
and δq(x1 ) = dq.
At each time x, we can estimate the value of L(x, q + δq, q 0 + δq 0 ) from the first-order
Taylor expansion
q(x)
(x1 , q1 + dq)
(x1 , q1 )
(x0 , q0 )
Figure 6.3: Critical curves from (x0 , q0 ) to (x1 , q1 ) (green) and to (x1 , q1 + dq) (purple).
(x1 , q1 )
(x0 , q0 )
Figure 6.4: Critical curves from (x0 , q0 ) to (x1 , q1 ) (green) and to (x1 + dx, q1 + dq) (purple)
dq
where dq = dx x dx.
1
q(x)
(x0 , q0 )
Figure 6.5: Critical curves from (x0 , q0 ) to (x1 , q1 ) (green) and to (x1 + dx, q1 ) (purple).
CHAPTER 6. CALCULUS OF VARIATIONS 165
The relation δq 0 = dδq/dx, together with the Euler-Lagrange Eqs. (6.21), allows us to
rewrite Eq. (6.18) as
Zx1 Zx1
d d x1
dS = dx Lq0 δq + δq Lq0 = d Lq0 δq = Lq0 δq x0
= Lq0 x1
dq. (6.19)
dx dx
x0 x0
Therefore, as we kept the final independent variable fixed, dS = ∂q1 Sdq, and one arrives at
the desired statement
∂S
= Lq0 x1
. (6.20)
∂q1
Now consider variation of the action extended from A1 = (x1 , q1 ) to (x1 + dx, q1 + q 0 (x1 )dx):
∂S ∂S 0 ∂S 0 ∂S 0
dS = Ldx = dx + q (x1 )dx = dx + Lq0 x1 q (x1 )dx = + q Lq 0 dx,
∂x1 ∂q1 ∂x1 ∂x1 x1
Exercise 6.2. Find all the critical function(s), q(x), of the functional
Za
S {q(x)} = q 2 + 2qq 0 + (q 0 )2 + 1 dx
0
where q(0) = 0 and at some positive but not fixed, a (i.e., such, a, that is by itself subject
to optimization), q(a) = 1.
In the case of the classical mechanics, introduced in Section 6.1.4, the Euler-Lagrange
Eqs. (6.5) are
d
Lq̇ − Lq = 0, (6.21)
dt
where L(t, q(t), q̇(t)) : R × Rd × Rd → R. Let us consider the case when the Lagrangian does
not depend explicitly on time. (It may still depend on time implicitly via q(t) and q̇(t),
i.e. L(q(t), q̇(t)).) In this case, and quite remarkably, the Euler-Lagrange equation can be
rewritten as a conservation law. Indeed,
d d d
(q̇ · Lq̇ − L) = q̈ · Lq̇ + q̇ · Lq̇ − Lq · q̇ − Lq̇ · q̈ = q̇ · Lq̇ − Lq = 0,
dt dt dt
• translational invariance, hs (q(t)) = q(t) + se, where e is the unit vector in Rn and s
is the distance of the transformation;
• rotational invariance, hs (q(t)) = Re (s)q(t), around the line through the origin defined
by the unit vector e;
Theorem 6.6.2 (Noether’s theorem (1915)). If the Lagrangian L is invariant under the
action of a one-parameter family of transformations, hs (q(t)), then the quantity,
d
I(q(t), q̇(t)) := Lq̇ ·
(hs (q(t)))s=0 , (6.22)
ds
is constant along any solution of the Euler-Lagrange Eq. (6.21). Such a constant quantity
is called an integral of motion.
(a) ∂t1 S(A0 ; A1 ) = (L − q̇Lq̇ )t=t1 ∂t0 S(A0 ; A1 ) = −(L − q̇Lq̇ )t=t0 ,
(b) ∂q1 S(A0 ; A1 ) = Lq̇ |t=t1 ∂q0 S(A0 ; A1 ) = −Lq̇ |t=t0 .
The assumptions of Noether’s theorem require that the Lagrangian is invariant under the
transformation q(t) → hs (q(t), which gives
Differentiating both sides of Eq. 6.23 with respect to s, applying Theorem 6.5.1, and eval-
uating the result at s = 0, leads us to
d d
0 = ∂q0 S · hs (q(t0 )) + ∂q1 S · hs (q(t1 ))
ds s=0 ds s=0
d d
= −Lq̇ (q(t0 ), q̇(t0 )) · hs (q(t0 )) s=0
+ Lq̇ (q(t1 ), q̇(t1 )) · hs (q(t1 )) s=0
ds ds
CHAPTER 6. CALCULUS OF VARIATIONS 168
Since t1 can be chosen arbitrarily, it proves that Eq. (6.22) is constant along the solution
of the Euler-Lagrange Eq. (6.21).
Exercise 6.3. For q(t) ∈ R3 , where t ∈ [t0 , t1 ], and each of the following families of
transformations, find the explicit form of the conserved quantity given by the Noether’s
theorem (assuming that the respective invariance of the Lagrangian holds)
Let us utilize the specific structure of the classical mechanics Lagrangian which is split,
according to Eq. (6.4), into a difference of the kinetic energy, q̇ 2 /2, and the potential energy,
V (q). Making the obvious observation, that the minimum of the functional
Z
dt 21 (q̇ − p)2 ,
over {p(t)} is achieved at ∀t : q̇ = p, and then stating the kinetic term of the classical
mechanics action, that is the first term in Eq. (6.4), in terms of an auxiliary optimization
Z Z
q̇ 2 p2
dt = max dt pq̇ − , (6.24)
2 {p(t)} 2
and substituting the result in Eqs. (6.3,6.4), one arrives at the following, alternative, vari-
ational formulation of the classical mechanics
Z
min max dt (pq̇ − H(q; p)) (6.25)
{q(t)} {p(t)}
p2
H(q; p) := + V (q), (6.26)
2
where p and H are defined as the momentum and Hamiltonian of the system. Turning
the second (Hamiltonian) principle of the classical mechanics into the equations (which,
like EL equations, are only sufficient conditions of optimality) one arrives at the so-called
Hamiltonian equations
∂H(q; p) ∂H(q; p)
q̇ = , ṗ = − . (6.27)
∂p ∂q
CHAPTER 6. CALCULUS OF VARIATIONS 169
Example 6.6.3. (a) [Conservation of Energy] Show that in the case of the time independent
Hamiltonian (i.e. in the case of H(q; p) considered so far), H, is also the energy which is
conserved along the solution of the Hamiltonian equations (6.27).
(b) [Conservation of Momentum] Show that if the Lagrangian does not depend explicitly on
one of the coordinates, say q (1) where q = (q (1) , · · · ), then the corresponding momentum,
∂L/∂ q̇ (1) , is constant along the physical trajectory, given by the solutions of either EL or
Hamiltonian equations.
Solution. (a) Full time derivative of the Hamiltonian is
dH ∂H ∂H
= q̇ + ṗ.
dt ∂q ∂p
The Hamiltonian equations are
∂H ∂H
q̇ = , ṗ = − .
∂p ∂q
Combining the two expressions
dH ∂H ∂H
= q̇ + ṗ = −ṗq̇ + q̇ ṗ = 0,
dt ∂q ∂p
we arrive at the desired statement.
(b) Suppose that L does not depend on q (1) , then ∂L/∂q (1) = 0. Then, the EL equations
d ∂L ∂L d ∂L
0= − (1) = ,
dt ∂ q̇ (1) ∂q dt ∂ q̇ (1)
Notice, that the Hamiltonian system of Eqs. (6.27) becomes even more elegant in the
vector form
! !
q 0 1
ż = −J∇z H(z) = −∇z JH(z), z := , J := , (6.28)
p −1 0
where the 2 × 2 matrix represents two-dimensional rotation (clock-wise in the (q, p)-space).
Let us work a bit more with the critical/optimal trajectory/path, {q(t); t ∈ [t0 = 0, t1 ]},
solving the Euler-Lagrange Eqs. (6.21), choosing the initial time to be equal to 0, fixing
the initial position, q(0) = q0 , then analyzing dependence of the action on the final time,
t1 , and position, q1 . That is we continue the thread of the Theorem 6.5.1 and consider the
action-function as a function of A1 = (t1 , q1 ), i.e., of the final position of the critical path.
CHAPTER 6. CALCULUS OF VARIATIONS 170
Indeed, let us re-derive in a bit different, but equivalent, form the main results of the
Theorem 6.5.1. Assuming that the action-function is a sufficiently smooth function of the
arguments, t1 and q1 , one would like to introduce (and interpret) derivatives of action over t1
and q1 , and then check if the derivatives are related to each other. Consider, first, derivative
of the action-function over t1 :
Zt1 Zt1
St1 := ∂t1 S(t1 ; q1 ) = ∂t dtL (q(t), q̇(t)) = L(t1 , q1 ) + dt (Lq ∂t1 q(t) − Lq̇ ∂t1 q̇(t))
0 0
Zt1
d
= L(t1 , q1 ) + dt∂t1 q(t) Lq + Lq̇ − Lq̇ ∂t q(t)|t01 = (L − Lq̇ q̇)t=t1 , (6.29)
dt
0
where we have made an integration by parts and used that, ∂t q(t)|t=0 = 0, ∂t1 q(t)|t=t1 =
d
q̇(t1 ), utilized the Euler-Lagrange equations Eq. (6.5), ∀t ∈ [0, t1 ] : Lq − dt Lq̇ = 0.
Next, let us evaluate the derivative of the action-function over the coordinate at the
final position, q1 :
Zt1 Zt1
Sq1 := ∂q S(t1 ; q1 ) = ∂q1 dtL (q(t), q̇(t)) = dt (Lq ∂q1 q(t) + Lq̇ ∂q1 q̇(t))
0 0
Zt1
d
= dt∂q1 q(t) Lq − Lq̇ + Lq̇ ∂q1 q(t)|t01 = Lq̇ |t=t1 . (6.30)
dt
0
In the case of the classical mechanics, when the Lagrangian is factorized into a difference
of the kinetic energy and the potential energy terms, the object on right hand sides of
Eq. (6.29) turns into the minus Hamiltonian, defined above in Eq. (6.26), and the right
hand side of Eq. (6.30) becomes the momentum, then p = q̇. In the case of a generic (not
factorizable) Lagrangian, one can use the right hand side of and Eq. (6.29) and Eq. (6.30)
as the definitions of the minus Hamiltonian of the system and of the system momentum,
respectively,
where the Hamiltonian is considered as a function of the current time, t, coordinate, q(t),
and momentum, p(t).
Combining Eqs. (6.29,6.30,6.31), that is (a) and (b) of the Theorem 6.5.1 and the def-
initions of the momentum and the Hamiltonian, one arrives at the Hamilton-Jacobi (HJ)
equation
which provides a nonlinear first order PDE representation of the classical mechanics.
It is important to stress that if one knows the initial (at t = 0) values of the action-
function, S, and of its derivative, ∂q S, and also of the explicit expression of the Hamiltonian
in terms of the time, coordinate and momentum at all moments of time, Eq. (6.32) combined
with the initial conditions represents a Cauchy initial value problem, therefore resulting in
solving of the HJ equation unambiguously. This is a rather remarkable and strong state-
ment with many important consequences and generalizations. The statement is remarkable
because because one gets the unique solution of the optimization problem in spite of the fact
that solution of the EL equation is not necessarily unique (remember it is a sufficient but
not necessary condition for the minimum action, i.e. there may be multiple solutions of the
EL equations). Consequences of the HJ equations will be seen later when we will discuss its
generalization to the case of theoptimal control, called the Bellman-Hamilton-Jacobi (BHJ)
equation. HJ equation, discussed here, and BHJ discussed in Section are linked ultimately
to the concept of the Dynamic Programming (DP), also discussed later in the course.
Let us re-emphasize, that the schematic derivation of the HJ-equation (just provided)
has revealed the meaning of the action derivative over (final) time and over the (final)
coordinate. We have learned that, ∂t1 S, is nothing but minus Hamiltonian, while ∂q1 S, is
simply momenta, p1 = p(t1 ) (also equal to velocity as in these notes we follow the convention
of unit mass).
Let us provide an alternative (and as simple) derivation of the HJ-equation, based
primarily on the differentials. Given transformation from representation of the action as a
functional, of {q(t); t ∈ [0, t1 ]}, to its representation as a function of t1 and q1 , S{q(t)} →
S(t1 ; q1 ), one rewrites Eqs. (6.3,6.4)
Z Z
S= p dq − H dt,
Example 6.6.4. Find and solve the HJ equation for a free particle.
In this case
p2
H= .
2
Therefore, the HJ equation becomes
(∂q S)2
= −∂t S.
2
√
Look for solution of the HJ equation in the form S = f (q)−Et. One derives f (q) = 2Eq−c,
and therefore the general solution of the HJ equation becomes
√
S(t; q) = 2Eq − Et − c.
Exercise 6.4. Find and solve the HJ equation for a two dimensional oscillator (unit mass
and unit elasticity) in spherical coordinates, i.e. for the Hamiltonian system with the action
functional Z
1 2 1
S{r(t), ϕ(t)} = dt ṙ + r2 ϕ̇2 − r2 .
2 2
We conclude this very brief discussion of the classical/Hamiltonian mechanics by men-
tioning that in addition to its relevance to the concepts of Optimal Control and Dynamic
Programming (to be discussed in Section 7), the HJ-equations are also most useful in estab-
lishing (and using in practical setting) the transformation from the pair of the coordinate-
momentum variables (q, p) to the so-called canonical variables for which paths of motion
reduce to single points, i.e. variables for which the (re-defined) Hamiltonian is simply zero.
∗
6.7 Legendre-Fenchel Transform
This Section ∗ is devoted to the Legendre-Fenchel (LF) transform, which was in fact used
in its relatively simple but functional (infinite dimensional) form in Eq. (6.24). Given LF
importance in variational calculus (already mentioned) and finite dimensional optimization
(yet to be discussed), we have decided to allocate a special section for this important
transformation and its consequences. We will also mention in the end of this Section two
applications of the LF transform: (a) to solving the image restoration problem by a primal-
dual algorithm, and (b) to estimating integrals with the Laplace method.
Often LF transform also refers to as “dual” transform. Then f ∗ (k) is dual to f (x).
Example 6.7.2. Find the LF transform of the quadratic function, f (x) = x · A · x/2 − b · x,
where A is symmetric positive definite matrix, A 0.
Solution: The following sequence of transformations show that the LF transform of the
positively define quadratic function is another positively defined quadratic function
1
sup x·k− x·A·x+b·x
x 2
1 −1 −1 1 −1
= sup − (x − (k + b) · A ) · A · (x − A (k + b)) + (b + k) · A · (b + k)
x 2 2
1
= (b + k) · A−1 · (b + k), (6.34)
2
where the maximum is achieved at x∗ = A−1 (k + b).
The combination of these two notions (the Legendre-Fenchel transform and the convex-
ity) results in the following bold statements (which we only state here, delegating proofs to
Math 527).
Once the formal definitions and statements are made, let us consider the one dimensional
case, n = 1, to develop intuition about the LF and convexity. In one dimension, the LF
transform has a very clear geometrical interpretation (see e.g. [10]) stated in terms of the
supporting lines.
Notice that as defined above supporting lines are defined locally, i.e. not globally for all
x ∈ R, but locally for a particular/fixed, x.
CHAPTER 6. CALCULUS OF VARIATIONS 174
!(#)
% & ' ( #
Example 6.7.6. Find f ∗ (k) and the supporting line(s) for f (x) = ax + b.
Solution: Notice that we cannot draw any straight line which do not cross f (x) unless they
have the same slope. Therefore, f (x) is the supporting line for itself. We also observe that
the LF transform of the straight line is finite only at a single point k = a, corresponding to
the slope of the line, i.e. (
−b, k=a
f ∗ (k) =
∞, otherwise.
Example 6.7.7. Consider the quadratic, f (x) = ax2 /2−bx. Find f ∗ (k), supporting line(s)
for f (x), and supporting line(s) for f ∗ (k).
Solution: The solution, given by one dimensional version of Eq. (6.34), is f ∗ (k) = (b +
k)2 /(2a), where the maximum (in the LF transform) is achieved at x∗ = (b + k)/a. We
observe that f ∗ (k) is well defined (finite) for all k ∈ R. Denote by fx (y) the supporting
line of, f (x), at x. In this case of a nice (smooth and convex) f (x), one derives, fx (x0 ) =
f (x) + f 0 (x)(x0 − x) = ax2 /2 − bx + (ax − b)(x0 − x), representing the Taylor series expansion
of, f (x), around, x = y, truncated at the first (linear) term. Similarly, fk∗ (k 0 ) = f ∗ (k) +
(f ∗ )0 (k)(k 0 − k) = (b + k)2 /(2a) + (b + k)(k 0 − k)/a.
What we see in this example generalizes into the following statements (given without
proof):
CHAPTER 6. CALCULUS OF VARIATIONS 175
Proposition 6.7.8. Assume that f (x) admits a supporting line at x and f 0 (x) exists at x,
then the slope of the supporting line at x should be f 0 (x), i.e. for a differentiable function
the supporting line is always a tangient line.
Theorem 6.7.9. If f (x) admits a supporting line at x with slope k, then f ∗ (k) admits
supporting line at k with the slope x.
Example 6.7.10. Draw supporting lines for the example of a smooth non-convex function
shown in Fig. (6.6).
Solution: Sketching supporting lines for this smooth, non-convex and bounded from below
example of a function with two local minima we arrive at the following observations:
• The point a admits a supporting line. The supporting line touches f at point a and
the touching line is beneath the graph of f (x), hence the term supporting is justified.
• The supporting line at a is strictly supporting because it touches the graph of f only
at x = a.
• The point b does not admit a supporting line, because any line passing through (b, f (b))
crosses the line f (x) at some other point.
• The point c admits a supporting line which is supporting, but not strictly supporting,
as it touches f (x) at another point, d. In this case c and d share the same supporting
line.
The supporting line analysis yields a number of other useful statements listed below
(without proof and only with limited discussion):
The last statement tells us, in particular, that f ∗∗ is not always convolutive, because
f ∗∗ is always convex even for non-convex f , when f 6= f ∗∗ . This observation generalizes to
The following two statements are particularly useful for visualization of f ∗∗ (x)
CHAPTER 6. CALCULUS OF VARIATIONS 176
Theorem 6.7.17. f ∗∗ (x) is the largest convex function satisfying f ∗∗ (x) ≤ f (x).
Because of the last statement we call f ∗∗ (x) the convex envelope of f (x).
+ ∗ (!)
!(#)
1′
% &
"#$%& = ()
#’
'%()* = ,- '%()* = ,.
1
#/
Δ, # !* !/ !
Δ!
Figure 6.7: Function having a singularity cusp (left) and its LF transform (right).
Below we continue to illustrate the notion of supporting lines, as well as convexity and
duality, on illustrative examples.
• (Differentiable part of f (x):) Each point (x, f (x)) on the differentiable part of the
function curve (parts a and b in Fig. (6.7a) admits a strict supporting line with slope
f 0 (x) = k. These points maps under the LF transformation into (k, f ∗ (k)) points
admitting supporting lines of slopes (f ∗ )0 (k) = x, shown as l’ and r’ branches in
Fig. (6.7b). Overall left (l) and right (r) branches in Fig. (6.7a) transform into left
(l’) and right (l’) branches in Fig. (6.7b)).
• (The cusp of f (x) at x = xc :) The nondifferentiable point xc admits not one but
infinetily many supporting lines with slopes in the range [k1 , k2 ]. This means that
f ∗ (k) with k ∈ [k1 , k2 ] must admit a supporting line with the constant slope xc ,
shown as branch (c0 ) in Fig. (6.7b), i.e. (c0 ) branch is linear (affine).
. ∗ (!) $′
!(#)
% '
%&'() = +,
&
"′ Δ+
%&'() = +3
#( #) # !- !
Δ#
∗∗
$ (!)
" # !
Figure 6.8: (a) An exemplary nonconvex function, f (x); (b) its LT transform, f ∗ (k); (c) its
double LT transform f ∗∗ (x).
Example 6.7.19. Show schematically f ∗ (k) and f ∗∗ (x) for f (x) shown in Fig. (6.6).
Solution: We split curve of the function into three branches (l-left), (c-center) and (r-right),
and then built LF and double-LT transform separately for each of the branches, as before
relying in this construct of building supporting lines. The result in shown in Fig. (6.8) and
the details are as follows.
• Branch (l) and branch (r) are strictly convex thus admitting strict supporting lines.
LT transforms of the two branches are smooth. Double LF transform returns exactly
the same function we have started from.
• Branch (c) is not convex and as a result none of the points within this branch, extend-
ing from x1 to x2 , admits supporting lines. This means that the points of the branch
are not represented in f ∗ (k). We see it in Fig. (6.8b) as a collapse of the branch under
the LF transform to a point. Supporting line with slope kc connects end-points of
the branch. The supporting line is not strict and it translates in f ∗ (k) into a sin-
gle (kc , f ∗ (kc )) point. This point of f ∗ (k) is not differentiable. Notice that f ∗ (k) is
convex, as well as, f ∗∗ (x). LF transformation extends (kc , f ∗ (kc )) into a straight line
with slope kc (shown red in Fig. (6.8c). This straight line may be thought as a convex
extrapolation, envelope, of f (x) in its non-convex branch.
Example 6.7.20. (a) Find the supporting lines and build the LF transform of
(
p1 x + b1 , x ≤ x∗
f (x) =
p2 x + b2 , x ≥ x∗
where x∗ = (b2 − b1 )/(p1 − p2 ), and b2 > b1 , p2 > p1 ; and find the respective f ∗∗ (x)
CHAPTER 6. CALCULUS OF VARIATIONS 178
(b) Suggest an example of a convex function defined on a bounded domain with diverging
(infinite) slopes at the boundary. Show schematically f ∗ (k) and f ∗∗ (x) for the function.
Now we are ready to return back to the image restoration problem set up in Section 6.1.3.
Our task becomes to by-pass ε-smoothing discussed in Section 6.4.2 by using LF transform.
This neat theoretical trick will then in developing computationally advantageous primal-
dual algorithm. We will use Theorem 6.7.4 to accomplish this, transformation-to-dual,
goal.
In fact, let us consider a more general set up than one discussed in Section 6.1.3. Assume
that f : Rn → R is convex and consider
Z
min dx (g(x, q(x)) + f (∇x q(x))) , (6.36)
{q(x)} nT ·∇x q=0, ∀x∈∂U
U
where q : U → R and as before n is the normal vector to ∂U . Let us now restate the
formulation in terms of the Legendre-Fenchel transform of f , thus utilizing Theorem 6.7.4:
Z
min max dx (g(x, q(x)) + p(x) · ∇x q(x) − f ∗ (p(x))) T T
, (6.37)
{q(x)} {p(x)} n ·∇x q=n ·p=0, ∀x∈∂U
U
where p : U → Rn and we also require that p (which is a vector) does not have a projection
to the boundary, i.e. it is elongated with ∇x q. q(x) is called the primal variable and p(x)
is called dual variable. We “dualize” only the second term in the integrand on the right
hand side of Eq. (6.36) which is non-smooth, leaving the first (smooth) term unchanged.
The optimization problem (6.37) is also called saddle-point formulation, due to its min-max
structure.
Given the boundary conditions, we can apply integration by parts to the term in the
middle in Eq. (6.37) then arriving at
Z
min max dx (g(x, q(x)) − q(x)∇ · p(x) − f ∗ (p(x))) . (6.38)
{q(x)} {p(x)} nT ·p=0, ∀x∈∂U
U
We can attempt to solve Eq. (6.37) or Eq. (6.38) by the primal-dual method which con-
sists in alternating minimization and maximization steps in either of the two optimizations.
Implementations may be, for example, via alternating gradient descent (for minimization)
and gradient ascent (for maximization).
However in the original problem we are trying to solve – the image restoration problem
defined in Section 6.1.3 – we can carry over the primal-dual min-max formulation further
CHAPTER 6. CALCULUS OF VARIATIONS 179
by exploring the structure of the argument (effective action), evaluating minimization over
{q(x)} explicitly and thus arriving at the dual formulation. This is our plan for the remain-
der of the section.
The case of the Total Variation image restoration corresponds to setting
(q − f (x))2
g(x, q) = , f (w = ∇x q(x)) = |w|,
2λ
in Eq. (6.36) thus arriving at the following optimization
Z
(q − f )2
min dx + |∇x q| . (6.39)
q 2λ
U nT ·∇x q=0,∀x∈∂U
Notice that f (w) = |w| is convex and thus, according to the high-dimensional generalization
of what we have learned about LF transform, f ∗∗ (w) = f (w). The LF dual of f (w) can be
easily computed
(
∗ 0, |w| ≤ 1
f (p) = sup (p · w − |w|) = (6.40)
w∈Rn ∞, |w| > 1.
And then convexity of f (w) = |w| allows us, according to Theorem 6.7.4, to “invert”
Eq. (6.40)
( !
0, |p| ≤ 1
f (w) = |w| = sup p · w − = max p · w. (6.41)
p ∞, |p| > 1. |p|≤1
Z
(q − f )2
min max dx − q∇x · p) . (6.42)
q |p|≤1 2λ
U nT ·p=0, ∀x∈∂U
Remarkably we can swap min and max in Eq. (6.42). This is guaranteed by the strong
convexity theorem (see Appendix)
Z
(q − f )2
max min dx − q∇x · p . (6.43)
|p|≤1 q 2λ
U x∈∂U : nT ·p=0
This trick is very useful because the ”ultra-local” optimization over q can be done explicitly.
One finds that the minimum of the quadratic over q function in the integrand of the objective
in Eq. (6.43) is achieved at
q = f + λ∇x · p, (6.44)
CHAPTER 6. CALCULUS OF VARIATIONS 180
and then substituting the optimal value back in the objective we arrive at
Z
λ
max dx f ∇x · p − (∇x · p)2 . (6.45)
|p|≤1 2
U nT ·p=0, ∀x∈∂U :
which is thus the optimization dual to the primal optimization (6.39). If we are to ignore the
constraint in Eq. (6.45), the objective is minimal at ∇ · p = f /λ. To handle the constraint
[11] has suggested to use the so-called projected gradient ascent algorithm
pk + τ ∇x · (∇x · pk − f /λ)
∀x : pk+1 (x) = , (6.46)
1 + τ |∇x · pk − f /λ|
initiated with p0 satisfying the constraint, |p0 | < 1, iterating in time with step τ > 0 and
taking appropriate spatial discretization of the ∇x · operation on a grid with spacing ∆x.
Introduction of the denominator in the ratio on the right hand side of Eq. (6.46) guarantees
that the condition is enforced in iterations, |pk | < 1. When the iterations converge and the
optimal p is found, the optimal pattern, u is reconstructed from Eq. (6.44).
" ! + " ∗ % = %!
'()*+ = %
"(!)
%! " ∗ (%) !
Here we inject some additional geometric meaning in the LF transform following [12].
We continue to draw our intuition/inspiration from a one dimensional example.
CHAPTER 6. CALCULUS OF VARIATIONS 181
First, notice that if the function is f : R → R is strictly convex than f 0 (x) is increasing,
monotonically and strictly, with x. This means, in particular, that the relation between
the original variable, x, and the respective optimal dual variable, k, is one-to-one, therefore
providing additional explanation for the self-inverse feature of the LT transform in the case
of convexity (strict convexity, to be precise, but we know that is also holds in the convex
case).
Second, consider relation, illustrated in Fig. (6.9), between the original function, f (x),
at x allowing strict supporting line and the respective LT transform, f ∗ (k), evaluated at
k = f 0 (x), i.e. f ∗ (f 0 (x)):
As seen clearly in the figure the LF relation explains f ∗ (k) as f (x) extended by kx (where
the latter term is associated with the supporting line). Notice remarkable symmetry of
Eq. (6.47) under x ↔ k and f ↔ f ∗ transformation, also assuming that the variables, x
and k, are not independent - one of the two is to be selected as tracking the change while
the other (conjugated) variable will depend on the first one, according to k = f 0 (x) or
x = (f ∗ )0 (k)
LF transform is also the key to understanding relation between Hamiltonian and Lagrangian
in classical mechanics. Let us illustrate it on a “no q”-example, i.e. on the case when the
Hamiltonian, generally dependent on t, q and p depends only on p. Specifically consider
p
example of a free relativistic particle, where H(p) = p2 + m2 , m is the particle mass and
p
the speed of light is set to unity, c = 1. In this case, q̇ = ∂p H = dH/dp = p/ p2 + m2 ,
according the Hamilton equation, and the Lagrangian, which generally depends on q̇ and q
but now only depends on q, is, L(q̇) = pq̇ − H(p). This relation, rewritten in the symmetric
form,
pq̇ = L(q̇) + H(p),
should be compared with the LF relation Eq. (6.47). We observe that p and ·q, like x and k,
are conjugated variables while L should be viewed as the LF transform of the Hamiltonian,
L = H ∗ , or vice versa, H = L∗ .
See [12] for further discussion of other examples of LF transform in physics, for exam-
ple in statistical thermodynamics (where inverse temperature and energy are conjugated
variables, while free energy is the LF dual of the entropy, and vice versa).
CHAPTER 6. CALCULUS OF VARIATIONS 182
∗
6.8 Second Variation
Finding extrema of a function involves more than finding its critical points ∗ . A critical
point may be a minimum, a maximum or a saddle-point. To determine the critical point
type one needs to compute the Hessian matrix of the function. Similar consideration applies
to functionals when we want to characterize solutions of the Euler-Lagrange equations.
We naturally start the discussion of the second variation from the finite dimensional
case. Let f : U ⊂ Rn → R be a C2 function (with existing first and second derivatives).
The Hessian matrix of f at x is a symmetric bi-linear form (on the tangent vector space Rnx
to Rn at x) defined by
∂ 2 f (x + s + wη)
∀, η ∈ Rnx : Hessx (, η) = . (6.48)
∂s∂w s=w=0
If the Hessian is positive-definite, i.e. if the respective matrix of second-derivatives has only
positive eigenvalues, then the critical point is the minimum.
R
Let us generalize the notion of the Hessian to the action, S = dtL(q, q̇) and the
Lagrangian, L(q, q̇), where q(t) : R → Rn is a C2 function. Direct generalization of Eq. (6.48)
∗
This auxiliary Section can be dropped at the first reading. Material from this Section will not contribute
midterm and finals.
CHAPTER 6. CALCULUS OF VARIATIONS 183
becomes
∂ 2 S{u(t) + s(t) + wη(t)}
Hessx {(t), η(t)} = (6.49)
∂s∂w s=w=0
∂ ∂S{q(t) + s(t) + wη(t)}
=
∂w ∂s s=0 w=0
Z X n
∂L(q + s; q̇ + s) ˙ d ∂L(q + s; q̇ + s) ˙
= dt − ηi
∂q i dt ∂ q̇ i
i=1 s=0
Z X n 2
∂2L j ∂2L j d ∂ L j ∂2L j
= dt + j i ˙ − + j i ˙ ηi
∂q j ∂q i ∂ q̇ ∂q dt ∂q j ∂ q̇ i ∂ q̇ ∂ q̇
i,j=1
Z X n
:= dt Jij j η i ,
i,j=1
where Jij is the matrix of differential operators called the Jacobi operator. To determine if
the bilinear form is positive definite is usually hard, but in some simple cases the question
can be resolved.
Consider, q : R → R, q ∈ C2 , and quadratic action,
Z T
S{q(t)} = dt q̇ 2 − q 2 . (6.50)
0
with zero boundary conditions, q(0) = q(T ) = 0. To get some intuition about how the
landscape of action (6.50) looks like, let us consider a subclass of functions, for example
oscillatory functions consisting of only one harmonic,
πt
q̄(t) = a sin n , (6.51)
T
where a ∈ R (any real) and n ∈ Z \ 0 (any nonzero integer). Substituting Eq. (6.51) into
Eq. (6.50) one derives,
T
Z
n2 π 2 a2
dt cos2 nπt
S{q̄(t)} = 2
T T
0
T
Z 2
2 2
2 2 nπt = T a n π
−a dt sin −1 .
T 2 T2
0
One observes that at T < π, the action, S, considered on this special class of functions,
is positive. However, when some of these probe functions will result in a negative action
when T > π. This means that at T > π, the functional quadratic form, correspondent to
the action (6.50), is certainly not positive definite.
CHAPTER 6. CALCULUS OF VARIATIONS 184
One thus came out of this “probe function” exercise with the following question: can
it be that the functional quadratic form, correspondent to the action (6.50), is not positive
definite? The analysis so far (restricted to the class of single harmonic test functions) is not
conclusive. Quite remarkably one can prove that the action (6.50) is always positive (over
the class of zero boundary condition, twice differentiable functions), and thus the respective
quadratic form is always positive definite, if T < π.
Example 6.8.1. Prove that the action S{q(t)} given by Eq. (6.50) is positive at, T < π, for
any twice differentiable function, q ∈ C2 with zero boundary conditions, q(0) = q(T ) = 0.
(Hint: Represent the function as Fourier Series and show that the action is a sum of squares.)
Solution. Consider the general Fourier series expansion of q(t), that is
∞
a0 X 2πnt 2πnt
q(t) = + an cos + bn sin .
2 T T
n=1
RT
Calculating 0 (q̇ 2 − q 2 )dt and noting the orthogonality of the terms in the expansion, one
arrives at Z
T ∞
X
2 2 a20 a2n + b2n 4π 2 n2
(q̇ − q )dt = −T +T −1 .
0 4 2 T2
n=1
If we consider the ‘worst-case’ scenario of T = π we then have to show that
∞
X a2n + b2n a2
4n2 − 1 ≥ 0 ,
2 4
n=1
to ensure positivity of the action. Note that from the boundary conditions we have that,
P
a0 /2 + n an = 0. Without loss of generality let us scale q(t) such that a0 = −2. Then it
will suffice for us to show that
∞
X a2 n
4n2 − 1 ≥ 1.
2
n=1
We can do this by constructing the dual problem and demonstrating that the minimal value
of the left-hand-side is 1 when varying the an . Specifically, our problem is
k
X a2 n
min 4n2 − 1 ,
a=(a1 ,...,ak ) 2
n=1
Xk
s.t. an = 1,
n=1
where we first consider the partial sum case and then take the k → ∞ limit. The Lagrangian
is !
k
X k
X
a2 n
L(a, µ) = 4n2 − 1 − µ an − 1 .
2
n=1 n=1
CHAPTER 6. CALCULUS OF VARIATIONS 185
We compute ∇a L = 0 to get
an (4n2 − 1) − µ = 0, ∀n = 1, . . . , k.
which yields µ = (2k + 1)/k. Substituting this back into the original objective function we
arrive at
k
X a2 n
2k + 1
4n2 − 1 = ≥ 1.
2 2k
n=1
Therefore in the k → ∞ limit we are still larger than 1, thus proving positivity of the
action.
Consider the shortest path problem discussed in Example 6.2.3, however constrained by the
area, A as follows
Za p
min dx 1 + (q 0 (x))2 dx .
{q(x)|x∈[0,a]} Ra
0 q(0)=0, q(a)=b, q(x)dx=A
0
Example 6.9.1. The principle of maximum entropy, also called principle of the maximum
likelihood (distribution), selects the probability distribution that maximizes the entropy,
R R
S = − D dxP (x) log P (x), under normalization condition, D dxP (x) = 1.
• (b) Consider D = [a, b] ⊂ R. Find optimal P (x), assuming that the mean of x is
R
known, E{P (x)} (x) := D dxxP (x) = µ.
Solution:
(a) The effective action is,
Z
S̃ = S + λ 1 − dxP (x) ,
D
δ S̃
=0: − log(P (x)) − 1 − λ = 0.
δP (x)
Accounting for the normalization condition one finds that the optimum is achieved at the
equ-distribution:
1
P (x) = ,
kDk
where kDk is the size of D.
(b) The effective action is,
Z Z
S̃ = S + λ 1 − dxP (x) + λ1 µ − dxxP (x) ,
D D
where λ and λ1 are two (constant) Lagrangian multipliers. Variation of S̃ over P (x) results
in the following EL equation
δ S̃
=0: − log(P (x)) − 1 − λ − λ1 x = 0 → P (x) = e−1−λ exp(−λ1 x).
δP (x)
CHAPTER 6. CALCULUS OF VARIATIONS 187
λ and λ1 are constants which can be expressed via a, b and µ resolving the normalization
constraint and the constraint on the mean,
b b
−1−λ exp(−λ1 x) −1−λ x exp(−λ1 x) exp(−λ1 x)
e − = 1, e − − = µ.
λ1 a λ1 λ21 a
Exercise 6.5. Consider the setting of Example 6.9.1b with a = −∞, b = ∞. Assuming
that mean and variance of the probability distribution are known, i.e. E{P (x)} (x) = µ and
E{P (x)} x2 = σ 2 + µ2 , find P (x) which maximizes the entropy.
The method of Lagrange multipliers in the calculus of variations extends to other types of
constrained optimizations, where the condition is not a functional as in the cases discussed
so far but a function. Consider, for example, our standard one-dimensional example of the
action functional, Z
S{q(t)} = dtL(t; q(t); q̇(t)), (6.52)
Let us also assume that L(t; q; q̇) and G(t; q; q̇) are sufficiently smooth functions of their last
argument, q̇. The idea then becomes to introduce the following “modified” action
Z
S̃{q(t), λ(t)} = dt (L(t; q(t); q̇(t)) − λ(t)G(t; q(t); q̇(t))) , (6.54)
which is now a functional of both q(t) and λ(t), and extremize it over both q(t) and λ(t).
One can show that solutions of the EL equations, derived as variations of the action (6.54)
over both q(t) and λ(t), will give a sufficient condition for the minimum of Eq. (6.52)
constrained by Eq. (6.53).
Let us illustrate this scheme and derive the Euler-Lagrange equation for a Lagrangian
L(q; q̇; q̈) which depends on the second derivative of a C3 function, q : R → R and does
not depend on t explicitly. In full analogy with Eq. (6.54) the modified action in this case
becomes
Z
S̃{q(t), λ(t)} = dt (L(q(t); q̇; v̇) − λ(t) (v(t) − q̇(t))) . (6.55)
Eliminating λ and v one arrives at the desired modified EL equations stated solely in terms
of derivatives of the Lagrangian over q(t) and its derivatives:
∂L d ∂L d2 ∂L
− + 2 = 0. (6.57)
∂q dt ∂ q̇ dt ∂ q̈
R1
Exercise 6.6. Find extrema of S{q(t)} = 0 dtkq̇(t)k for q : [0, 1] → R3 subject to the
norm constraint, ∀t : kq(t)k2 = 1, and with generic boundary conditions, i.e. q(0) = q0
and q(1) = q1 , where both boundary points satisfy the norm constraint.
We will see more of the calculus of variations with (function) constraints later in the
optimal control section of the course.
Chapter 7
Optimal control problem shall be considered as a special case of a general variational calculus
problem, where the (vector) fields evolve in time, i.e. reside in one dimensional real space
equipped with a direction, and constrained by a system of ODEs, possibly with algrebraic
constraints added too. We will learn how to analyze the problems by the methods of the
variational calculus from Section 6, using optimization approaches, e.g. convex analysis and
duality, described in Appendix A.1, and also adding to arsenal of tools a new one called
“Dynamic Programming” (DP) in Section 7.4.
Let us start with an illustrative (as sufficiently simple) optimal control problem.
Zt
min dτ (q(τ ))2 (7.1)
{u(τ ),q(τ )}
0 τ ∈(0,t]: q̇(τ )=u(τ ), u(τ )≤1
189
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 190
that the resulting discontinuity of the optimal q(τ ) at τ = 0 is not a problem, as it was not
required in the problem formulation.)
The analysis in the case of q0 ≤ 0 is more elaborate. Let us exclude the control variable,
turning the pair of constraints in Eq. (7.1) into one, ∀τ : q̇ ≤ 1. Our next step will be to
use the Duality Theory and KKT conditions, discussed in Math 584 (see also Appendix A)
in the context of finite dimensional optimization. We can therefore view what follows as a
practical lesson (just a use case, no proofs) on how to extend this important approach from
the case of finite dimensional optimization to the variational calculus.
We introduce the Lagrangian function,
We find that,
where c is a constant, satisfy both the KKT conditions and the initial condition, q(0) = q0 .
Can we have another solution different from Eqs. (7.2) but satisfying the KKT conditions?
How about a discontinuous control? Consider the following probe functions, bringing q to
zero first with the maximal allowed control, and then switching off the control:
( (
q0 + τ, 0 < τ ≤ −q0 τ 2 + 2q0 τ + q02 , 0 < τ ≤ −q0
q(τ ) = , µ(τ ) = . (7.3)
0, −q0 < τ ≤ t 0, −q0 < τ ≤ t
We observe that, indeed, in the regime where the probe function is well defined, i.e. 0 <
−q0 < t, Eqs. (7.3) solves the KKT conditions (7.3), therefore providing an alternative to
the solution (7.2). Comparing objectives in Eq. (7.1) for the two alternatives one finds
that at, 0 < −q0 < t, the solution (7.3) is optimal while the solution (7.2) is optimal if
t < −q0 .
Exercise 7.1. Solve Example 7.0.1 with the condition u ≤ 1 replaced by |u| ≤ 1.
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 191
∗
7.1 Linear Quadratic Control via Calculus of Variations
Our next extra-curricular topic ∗ is Linear Quadratic (LQ) control. Consider d-dimensional
real vector representing evolution of the system state in time, {q(τ ) ∈ Rd |τ ∈ [0, t]}, gov-
erned by the following system of linear ODEs
where A and B are constant (time independent) square, nonsingular (invertible) and pos-
sibly asymmetric, thus A 6= AT and B 6= B T , real matrices, A, B ∈ Rd × Rd , and
{u(τ ) ∈ Rd |τ ∈ [0, t]} is a time-dependent control vector of the same dimensionality as
q. Introduce a combined action, often called cost-to-go:
where {λ(τ )} is the time-dependent vector of the Lagrangian multipliers, also called the
adjoint vector. Euler-Lagrange (EL) equations and the primal feasibility equations following
∗
This auxiliary Section can be dropped at the first reading. Material from this Section will not contribute
midterm and finals.
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 192
δS{q,u,λ}
Euler-Lagrange : δq =0: ∀τ ∈ (0, t] : Qq + λ̇ + AT λ = 0, (7.10)
δS{q,u,λ}
δu =0: ∀τ ∈ [0, t] : Ru + B T λ = 0, (7.11)
δS{q,u,λ}
primal feasibility: δλ =0: Eqs. (7.4). (7.12)
∂S{q, u, λ}
boundary condition at τ = t, = 0 : λ(t) = Qf in q(t), (7.13)
∂q(t)
derived by variations of the effective action over q at the final point, q(t). The simplest
way to derive the boundary condition Eq. (7.13) is through discretization: turning temporal
integrals into discrete sums, specifically
Z t
dτ λT (τ )q̇(τ ) → λT (∆)(q(∆) − q(0) + · · · + λT (t)(q(t − ∆)) − q(t)), (7.14)
0
where ∆ is the discretization step, and then looking for a stationary point over q(t).
Notice that alternatively the boundary conditions can be derived from the “main theo-
rem of classical mechanics – Theorem 6.5.1 – proving that the gradient of S with respect to
the terminal point (t1 , q1 ) is ∇S = h L − q̇Lq̇ , Lq̇ i|(t1 ,q1 ) . (See Theorem 6.5.1). This result
suggests that given that q(τ ) is optimal, the action cannot be improved by variations in the
terminal value, q(t) and therefore the boundary condition at τ = t is:
∂S{q, u, λ}
boundary condition at τ = t, 0= = λ(t) + Lq̇ |τ =t = λ(t) − Qf in q(t), (7.15)
∂q(t)
Observe that Eqs. (7.11) are algebraic, thus allowing to express the control vector, u,
via the adjoint vector, λ
u = −R−1 B T λ. (7.16)
Substituting it into Eqs. (7.10,7.12) one arrives at the following joint system of the original
and adjoint equations
! ! ! ! !
q̇ A −BR−1 B T q q(0) q0
= , = . (7.17)
λ̇ −Q −AT λ λ(t) Qf in q(t)
The system of ODEs (7.17) is a two-point Boundary Value Problem (BVP) because it
has two boundary conditions at the opposite ends of the time interval. In general, two-
point BVPs are solved by the shooting method, which requires multiple iterations forward
and backward in time (hoping for convergence). However for the LQ Control problems,
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 193
the system of equations is linear, and we can solve it in one shot – with only one forward
iteration and one backward iteration. Indeed, integrating the linear ODEs (7.17) one derives
! !
q(τ ) q(0)
= W (τ ) , (7.18)
λ(τ ) λ(0)
! !!
W 1,1 (τ ) W 1,2 (τ ) A −BR−1 B T
W (τ ) = := exp τ , (7.19)
W 2,1 (τ ) W 2,2 (τ ) −Q −AT
Substituting Eqs. (7.18,7.20) into Eq. (7.16) one arrives at the following expression for the
optimal control via q0
u(τ ) = −R−1 B T W 2,1 (τ ) + W 2,2 (τ )M q0 . (7.21)
A control of this type, dependent on the initial state, is called open loop control. That
is, the control policy u(τ ) doesn’t explicitly depend on the current state q(τ ), and instead
only on the initial state q0 . While in ideal conditions this does not pose an issue, under
uncertainty it is often better for the control policy to depend on the current state of the
system q(τ ). Such a control scheme is called feedback loop control, which may also be called
the closed loop control. The feedback loop version of Eq. (7.21) requires us to express u(τ )
in terms of q(τ ). To do this we seek a map from q(τ ) to λ(τ ), which for LQ control is a
matrix P (τ ) such that P (τ )q(τ ) = λ(τ ). This allows us to write
We derive this matrix by expressing λ(τ ) and q(τ ) via q0 according to Eq. (7.18,7.20) and
then substituting the result in Eq. (7.16):
∀τ ∈ (0, t] : u(τ ) = −R−1 B T λ(τ ) = −R−1 B T W 2,1 (τ ) + W 2,2 (τ )M q0
−1
= −R−1 B T W 2,1 (τ ) + W 2,2 (τ )M W 1,1 (τ ) + W 1,2 (τ )M q(τ )
= −R−1 B T P (τ )q(τ ), (7.22)
2,1 2,2
1,1 1,2
−1
P (τ ) := W (τ ) + W (τ )M W (τ ) + W (τ )M . (7.23)
At any moment of time τ , i.e. as we go along, the feedback loop control responds to the
current measurement of the system state, q(τ ), at the same time, τ .
Notice that in the deterministic case without uncertainty/perturbation (and this is what
we have considered so far) the open loop and the feedback loop are equivalent. However,
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 194
the two control schemes/policies give very different results in the presence of uncertain-
ty/perturbation (consistently with what was already mentioned above). We will investigate
this phenomenon and have a more extended comparison of the two controls in the proba-
bility/statistics/data science section of the course
Example 7.1.1. Show, utilizing derivations and discussions above, that the matrix, P (τ ),
defined in Eq. (7.23), satisfies the so-called Riccati equations:
Ṗ + AT P + P A + Q = P BR−1 B T P, (7.24)
Ṗ q + P q̇ = λ̇,
=⇒ Ṗ q + P (Aq − BR−1 B T λ) = −Qq − AT λ,
=⇒ Ṗ q + P (Aq − BR−1 B T P q) = −Qq − AT P q,
=⇒ Ṗ q + AT P q + P Aq + Qq = P BR−1 B T P q,
=⇒ (Ṗ + AT P + P A + Q)q = P BR−1 B T P q.
Ṗ + AT P + P A + Q = P BR−1 B T P.
where u ∈ R and A is a positive constant, A > 0. Design an LQ controller u(τ ) = −P q(τ )/R
that minimizes the action
Z∞
S{q(τ ), u(τ )} = dτ q 2 + Ru2 ,
0
B2 2
2AP + Q = P .
R
Since B = Q = 1 the quadratic form for P becomes
p
P = RA ± R2 A2 + R,
that is it shows two branches. We must consider both of these branches in the context of
the optimization problem. Substituting u in the ODE, we arrive at
p
q(τ ) = exp ((A − P/R)τ ) = exp ∓ A2 + 1/R τ .
√
Observe that the cost is infinite if we do not take P = RA + R2 A2 + R. If we consider
√
R → 0 it results in P → R. However, if we consider R → ∞ the result is P → 2RA.
We can even write down the action explicitly:
Z ∞
S= (q 2 + Ru2 )dτ
0
Z ∞ p p
= (1 + 2A2 + 1/R − 2 A2 + 1/R) exp −2τ A2 + 1/R dτ
0
p
1 + 2A2 + 1/R − 2 A2 + 1/R
= p .
2 A2 + 1/R
Zt
1 T
dτ u (τ )u(τ ) + V (q(τ )) , (7.26)
2
0
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 196
over {u(τ )} which satisfies the ODE (7.25). Here in Eq. (7.26) we shortcut notations and
use (u(τ ))2 for uT (τ )u(τ ). Notice that the cost-to-go objective (7.26) is a sum of two
terms: (a) the cost of control, which is assumed quadratic in the control efforts, and (b) the
bounded from below “potential”, which defines preferences or penalties imposed on where
the particle may or may not go. The potential may be soft or hard. An exemplary soft
potential is the quadratic potential
d
1 1X
V (q) = q T Λq = qi Λij qj , (7.27)
2 2
i=1
where Λ is a positive semi-definite matrix. This potential encourages q(τ ) to stay close to
the origin, q = 0, penalizing (but softly) for deviation from the origin. An exemplary hard
constraint may be
(
0, |q| < a
V (q) = , (7.28)
∞, |q| ≥ a
completely prohibiting q(τ ) to leave the ball of size a around the origin. Summarizing, we
discuss the optimal control problem:
Zt
uT (τ )u(τ )
min dτ + V (q(τ )) (7.29)
{u(τ ),q(τ )} 2 ∀τ ∈ [0, t] : q̇(τ ) = f (q(τ )) + u(τ )
0
q(0) = q0 , q(t) = qt
where initial and final states of the system are assumed fixed.
In the following we restate Eq. (7.29) as an unconstrained variational calculus problem.
(Notice, that we do not count the boundary conditions as constraints.) We will assume that
all the functions involved in the formulation (7.29) are sufficiently smooth and derive re-
spective Euler-Lagrange (EL) equations, Hamiltonian equations and Hamilton-Jacobi (HJ)
equations.
To implement the plan, let us, first of all, exclude {u(τ )} from Eq. (7.29). The resulting
“q-only” formulation becomes
Zt !
(q̇(τ ) − f (q(τ )))T (q̇(τ ) − f (q(τ )))
min dτ + V (q(τ )) . (7.30)
{q(τ )} 2
0 q(0)=q0 , q(t)=qt
Zt
(q̇ − f (q))T (q̇ − f (q))
S{q(τ ), q̇(τ )} = dτ + V (q), (7.31)
2
0
(q̇ − f (q))T (q̇ − f (q))
L = + V (q), (7.32)
2
∂L
p ≡ = q̇ − f (q), (7.33)
∂ q̇ T
∂L q̇ T q̇ (f (q))T f (q)
H ≡ q̇ T T − L = − − V (q)
∂ q̇ 2 2
pT p
= + pT f (q) − V (q). (7.34)
2
Then the Euler-Lagrange equations are
d ∂L ∂L
∀i = 1, · · · , d : = (7.35)
dt ∂ q̇i ∂qi
X d
d
(q̇i − fi (q)) = − (q̇ − f (q))j ∂qi fj (q) + ∂qi V (q),
dt
j=1
where we stated the vector equation by components for clarity. The Hamilton equations
are
∂H
∀i = 1, · · · , d : q̇i = = pi + fi (q), (7.36)
∂pi
∂H
ṗi = − = −pi ∇qi f (q) + ∇qi V (q). (7.37)
∂qi
Considering the action, S, as a function (not a functional!) of the final time, t, and of the
final position, qt , and recalling that,
∂S ∂S ∂L
= −H|τ =t , = = p|τ =t ,
∂t ∂qt ∂ q̇ τ =t
where we use the relations, ∂τ S = H|τ and ∂q S = −∂q̇ L|τ . (Check Theorem 6.5.1 to recall
how differentiation of the action with respect to time and coordinates at the beginning and
at the end of a path are related to each other.)
Notice, that the HJ equation, in the control formulation, is called Bellman or Bellman-
Hamilton-Jacobi (BHJ) equation to commemorate contribution of Bellman to the field, who
has formulated the problem and resolved it deriving the BHJ equation.
In Section 7.4 we derive the BHJ equation in a more general setting.
Exercise 7.2. Consider one-dimensional optimal control problem where we know where
we want to arrive and we must learn how to steer there from any location, formally:
Zt
∗ ∗ (u(τ ))2 + β 2 (q(τ ))2
{u (τ ), q (τ )} = arg min dτ .
{u(τ ),q(τ )∈R} 2 ∀τ ∈ [0, t] : q̇(τ ) = −αq(τ ) + u(τ )
0
q(0) = q0 , q(t) = 0
(7.40)
Use a substitution to eliminate the control variable, and derive the Euler-Lagrange equa-
tions, the Hamiltonian equations, and the Hamilton-Jacobi equations. Find the optimal
trajectory q ∗ (τ ) and verify that it is consistent with the three equations. Reconstruct the
optimal controller u∗ (τ ) and express it via (a) q ∗ (τ ) (closed loop control); (b) q0 (open loop
control). [Hint: In your solution, you many find it convenient to define γ 2 = β 2 + α2 .]
{u∗ (τ ), q ∗ (τ )} = (7.41)
Zt
arg min φ(q(t)) + dτ L (τ, q(τ ), u(τ )) ,
{u(τ ),q(τ )} q̇(τ ) = f (τ, q(τ ), u(τ )), ∀τ ∈ (0, t];
0
u(τ ) ∈ U ⊆ Rd , ∀τ ∈ [0, t];
q(0) = q0
where the control {u(τ ) ∈ U ∈ Rd |τ ∈ [0, 1]} is restricted to the domain U of the d-
dimensional space at all the times considered.
The analog of the standard variational calculus approach, consisting in the necessary
Euler-Lagrange (EL) conditions over {u(τ )} and {q(τ )}, is called Pontryagin Minimal Prin-
ciple (PMP), commemorating the contribution of Lev Pontryagin to the subject [13] (see
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 199
also [14] for extended discussion of the PMP bibliography, circa 1963). We present it here
without much elaboration, as it follows the same variational logic repeated by now many
times in this Section. Begin by introducing the effective action,
Zt Z t
S̃ := φ(q(t)) + dτ L (τ, q(τ ), u(τ )) + dτ λ(τ ) f (τ, q(τ ), u(τ )) − q̇(τ ) , (7.42)
0
0
δ S̃
=0: (7.44)
δq(τ )
q(τ )=q ∗ (τ )
∂
λ̇∗ (τ ) = − (L (τ, q ∗ (τ ), u∗ (τ )) + λ∗ (τ )f (τ, q ∗ (τ ), u∗ (τ ))) ,
∂q ∗
∂ S̃
τ =t: =0: (7.45)
∂q(t)
q(τ )=q ∗ (τ )
λ∗ (t) = ∂φ(q ∗ (t))/∂q ∗ (t).
Notice that Eq. (7.46) is the result of variation of S̃ over q(t), providing the boundary
conditions at τ = t by relating q(t) and λ∗ (t). (Derivation of Eq. (7.46) is equivalent to the
derivation of the respective boundary condition (7.15) at τ = t in the case of the LQ control
discussed in Section 7.1.) Combination of Eqs. (7.43,7.44,7.46) with the (primal) dynamic
equations and the initial condition on q(0) (from the first line in the conditions of Eq. (7.41)
completes description of the PMP approach. This PMP system of equations, stated as a
Boundary Value (BV) problem, with two boundary conditions on the opposite ends of the
temporal interval, is too difficult to allow an analytic solution in the general case. The
system of equations is normally solved numerically by the shooting method. Solution of the
PMP system of equations is not guaranteed to be unique.
Exercise 7.3. Consider a rocket, modeled as a particle of constant (unit) mass moving
in zero gravity (empty) two dimensional space. Assume that the thrust/force acting on
the rocket, f (τ ) is a known (prescribed) function of time (dependent on, presumably pre-
calculated, rate of the fuel burn), and that the direction of the thrust can be controlled.
Then equations of motion for the controlled rocket are
(a) Assume that ∀τ ∈ [0, t], f (τ ) > 0. Show that min{u(τ ),q1 (τ ),q2 (τ )} φ(q(t)), where φ(q) is
an arbitrary function, always result in the optimal control stated in the following, so-called
bi-linear tangent, form:
a + bτ
tan (u∗ (τ )) = .
c + dτ
(b) Assume that the rocket starts at rest at the origin and that we want to drive it to a
given height, q2 (t) = q∗ , at the final moment of time t, such that the final velocity in the
horizontal direction, q̇1 (t), is maximized, while q̇2 (t) = 0:
where making optimization over un−1 we took advantage of the locality in the causal struc-
ture of the objective in Eq. (7.46), therefore taking into account only terms in the objective
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 201
dependent on un−1 . Repeating the same scheme by first excluding, qn−1 , and second opti-
mizing over un−2 , and then repeating the two sub-steps (by induction) n − 1 times (back-
wards in discreet time) we arrive at the following recurrent generalization of Eqs. (7.48,7.49),
k = n, · · · , 1:
u∗k−1 := arg min (S (k, qk−1 + ∆f (τk−1 , qk−1 , uk−1 )) + ∆L (τk−1 , qk−1 , uk−1 )) , (7.50)
uk−1 ∈U
S(k − 1, qk−1 ) := S k, qk−1 + ∆f τk−1 , qk−1 , u∗k−1 + ∆L τk−1 , qk−1 , u∗k−1 , (7.51)
where Eq. (7.47) sets initial condition for the backward in (discrete) time iterations. It is
now clear that S(0, q0 ) is exactly the solution of Eq. (7.46). S(k, qk ), which is defined in
Eq. (7.50), is called the cost-to-go, or the value function, evaluated at the (discrete) time τk .
L(τ, q, u) and f (τ, q, u) are called (incremental) reward and (incremental) state correction.
Eqs. (7.47,7.50,7.51) are summarized in Algorithm 1.
1: S(n, q) ← φ(q)
2: for k = n, · · · , 0 do
3: u∗k (q) ← arg minu (∆L(τk , q, u) + S(τk + 1, qk + ∆f (τk , qk , u))) , ∀q
4: S(k, q) ← ∆L(τk , q, u∗k (q)) + S(k + 1, qk + ∆f (τk , q, u∗k (q))), ∀q
5: end for
Output: u∗k (q), ∀q, k = n − 1, · · · , 0.
The scheme just explained and the resulting DP Algorithm 1 were introduced in the
famous paper of Richard Bellman from 1952 [15].
In accordance with the greedy nature of the DP algorithm construction—one step at a
time, backward in time—is an example of what is called a greedy algorithm in Computer
Science, that is an algorithm that attempts to find a globally optimal solution by making
choices at each step that are only locally optimal. In general, greedy algorithms offer only
a heuristic, i.e. an approximate (sub-optimal), solution. However, the remarkable feature
of the optimal control problem, which we just sketched a proof of (through the sequence
of transformations of Eqs. (7.47,7.50,7.51) resulted in the optimal solution of Eq. (7.46)) is
that the greedy algorithm in this case is optimal/exact.
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 202
Taking a continuous limit of Eqs. (7.47,7.50,7.51) one arrives at the Bellman (also called
Bellman-Hamilton-Jacobi) equation ( which is already familiar from the discussion of the
Section 7.2, where it was derived in a special case)
−∂τ S(τ, q) = min L(τ, q, u) + ∂q S(τ, q) · f (τ, q, u) . (7.52)
u∈U
Then expression for the optimal control, that is continuous time version of the line 3 in the
Algorithm 1, is
∀τ ∈ (0, t] : u∗ (τ, q) = arg min L(τ, q, u) + ∂q S(τ, q) · f (τ, q, u) . (7.53)
u∈U
u2
L(τ, q, u) → + V (q), f (τ, q, u) → f (q) + u,
2
and U → Rd , leads, after explicit evaluation of the resulting quadratic optimization, to
Eq. (7.39).
where particle and control trajectories are {x(τ ) ∈ R|τ ∈ (0, t]} and {u(τ ) ∈ R|τ ∈ (0, t]}.
Given x(0) = x0 and ẋ(0) = 0, i.e. particle is at rest initially, find the control path {u(τ )}
such that particle position at the final moment, x(t) is maximal. (t is assumed known too.)
Describe optimal control and optimal solution for the case of x(0) = 0 and t = 2π.
Solution. First, we change from a single second order (in time) ODE to the two first order
ODEs
! !
q1 x
∀τ ∈ (0, t] : q= := , q̇ = Aq + Bu, (7.55)
q2 ẋ
! !
0 1 0
A := , B := . (7.56)
−1 0 1
We arrive at the optimal control problem (7.41) where, φ(q) = C T q, C T := (−1, 0), L(t, q, u) =
0, f (t, q, u) = Aq + Bu. Then, Eq. (7.52) becomes
The absolute value here comes from the fact that the optimal value for u(τ ) is at one of it’s
extremes, either +1 or −1, depending on the sign of (∂q S)T B. Let us look for solution by the
(standard for HJ) method of variable separation, S(τ, q) = (ψ(τ ))T q + α(τ ). Substituting
the ansatz into Eq. (7.57) one derives
These equations must be solved for all τ , with the terminal/final conditions: ψ(t) = C
and α(t) = 0. Solving the first equation and then substituting the result in Eq. (7.53) one
derives
!
− cos(τ − t)
∀τ ∈ (0, t] : ψ(τ ) = , u(τ, q) = −sign(ψ2 (τ )) = −sign (sin(τ − t)) ,
sin(τ − t)
(7.59)
that is the optimal control depends only on τ (does not depend on q) and it is ±1.
Consider for example q1 (0) = x(0) = 0 and t = 2π. In this case the optimal control is
(
−1, 0 < τ < π
u(τ ) = , (7.60)
1, π < τ < 2π
The solution consists in, first, pushing the mass down, and then up, in both cases to the
extremes, i.e. to u = −1 and u = 1, respectively. This type of control is called bang-bang
control, observed in the cases, like the one considered, without any (soft) cost associated
with the control but only (hard) bounds.
Exercise 7.4. Consider a soft version of the problem discussed in Example 7.4.1:
Zt
min C T q(t) + 1 dτ (u(τ ))2 , (7.62)
{u(τ },{q(τ )} 2
0 ∀τ ∈(0,t]: q̇(τ )=Aq(τ )+Bu(τ )
where (q(0))T = (x0 , 0) and A, B and C are defined above (in the formulation and solution
of the Example 7.4.1). Derive Bellman/BHJ equation, build a generic solution and illustrate
it on the case of t = 2π and q1 (0) = x0 = 0. Compare your result with solution of the
Example 7.4.1.
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 204
where 1 < j1 < j2 < · · · < jl < n. We seek an optimal sequence that minimizes the
total cost. To make the description of the problem complete, one needs to introduce a
plausible way of “pricing” the line breaks. Let us define the total length of the line as a
sum of all lengths (of words) in the sequence plus the number of words in the line minus one
(corresponding to the number of spaces in the line before stretching). Then, one requires
that the total length of the line (before stretching) to be less then the widest allowed line
length L, and define the cost to be a monotonically increasing function of the stretching
factor, for example
Pj
+∞, L < (j − i) + k=i wk
!3
c(i, j) = Pj (7.64)
L − (j − i) − k=i wk
, otherwise
j−i
(The cubic dependence in Eq. (7.64) is an empirical way to introduce preference for smaller
stretching factors. Notice also that Eq. (7.64) assumes that j > i, i.e. any line contains
more than one word, and it does not take into account the last string in the paragraph.)
b
The exemplary Dynamical Programming problem is borrowed from [16]. See Section 3.3.1.
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 205
At first glance the problem of finding the optimal sequence seems hard, that is expo-
nential in the number of words. Indeed, formally one has to make a decision whether to
place a break (or not) after reading each word in the sequence, thus facing the problem of
choosing an optimal sequence from 2n−1 of possible options.
Is there a more efficient way of finding the optimal sequence? Apparently the answer
to this question is the affirmative, and in fact, as we will see below that the solution is of
the Dynamic Programming (DP) type. The key insight is the relation between the optimal
solution of the full problem and an optimal solution of a sub-problem consisting of an early
portion of the full paragraph. One discovers that the optimal solution of the sub-problem
is a sub-set of the optimal solution of the full problem. This means, in particular, that
we can proceed in a greedy manner, looking for an optimal solution sequentially - solving
a sequence of sub-problems, where each consecutive problem extends the preceding one
incrementally. (In general the greedy algorithms follow this basic structure: First, we view
the solving of the problem as making a sequence of ”steps” such that every time we make
a ”step” we end up with a smaller version of the same basic problem. Second, we follow an
approach of always taking whichever ”step” looks best at the moment, and we never back
up and change the ”step”.)
Let f (i) denote the minimum cost of formatting a sequence of words which starts from
the word i and runs to the end of the paragraph. Then, the minimum cost of the entire
paragraph is
f (1) = min(c(1, j) + f (j + 1)). (7.65)
j
1: for i = n, · · · , 1 do
2: f (i) = +∞
3: for j = i, · · · , n do
4: f (i) ← min (f (i), c(i, j) + f (j + 1))
5: end for
6: end for
Output: f (i), ∀i = 1, · · · , n
Let us now discuss another problem. There is a number placed in each cell of a rectangular
grid, N × M . One starts from the left-up corner and aims to reach the right-down corner.
At every step one can move down or right, then “paying a price” equal to the number
written into the cell. What is the minimum amount needed to complete the task?
Solution: You can move to a particular cell (i, j) only from its left-most (i − 1, j) or
upper-most (i, j − 1) neighbor. Let us solve the following sub-problem — find a minimal
price p[i, j] of moving to the (i, j) cell. The recursive formula (Bellman equation again) is:
where a(i, j) is a table of initial numbers. The final answer is an element p(n, m). Note,
that you can manually add the first column and row in the table a(i, j), filled with numbers
which are deliberately larger than the content of any cell (this helps as it allows to avoid
dealing with the boundary conditions). See Algorithm 3.
Algorithm performance is illustrated in Fig. (7.1).
Exercise 7.5. Consider a directed acyclic graph with weighted edges (a directed acyclic
graph is a directed graph with no cycles). One node is the start, and one node is the end.
Construct an algorithm to compute the maximum cost path from the start node to the
end node recursively. That is, beginning with the start node (say node 1), propagate by
adjacency and keep an updated list of the max cost path from node 1 to node j. Your
algorithm should not arbitrarily compute all possible paths. Provide pseudo-code for your
algorithm and test it (by hand or with a computer) on the graph given in Fig. 7.2
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 207
1: for t = 2, · · · , N + M do
2: for i + j = t, i, j ≥ 0 do
3: p(i, j) ← min (p(i − 1, j), p(i, j − 1)) + a(i, j)
4: end for
5: end for
Output: p(i, j), ∀i = 1, · · · , N ; j = 1, · · · , M.
3 1 4 3 1 4 3 1 4
0 0 0 0 0 0 3 0 0
1 3 7 3 1 3 7 3 1 3 7 3
1 4 0 0 0 0 1 0 0 0
8 9 2 5 8 9 2 5 8 9 2 5
1 0 0 0 1 0 0 0
13 15
4 5 2 1 4 5 2 1 4 5 2 1
0 0 0 0 0 0 0 0
17 18
1 3 7 3 1 3 7 3 1 3 7 3
1 4 0 0 1 4 11 0 1 4 11 11
8 9 2 5 8 9 2 5 8 9 2 5
1 0 0 0 9 13 0 0 9 13 13 0
4 5 2 1 4 5 2 1 4 5 2 1
0 0 0 0 13 0 0 0 13 18 0 0
min 13,13 + 5
1 3 7 3 1 3 7 3 1 3 7 3
1 4 11 11 1 4 11 11 1 4 11 11
8 9 2 5 8 9 2 5 8 9 2 5
9 13 13 16 9 13 13 16 9 13 13 16
min 11,13 + 5
4 5 2 1 4 5 2 1 4 5 2 1
13 18 15 0 13 18 15 16 13 18 15 16
(g) Fifth step. (h) Sixth (final) step. (i) Optimal path(s).
4
3 5
2 1
1
1 2 2 3 6
3 4
4
The number of optimization problems which can be solved efficiently with DP is remarkably
broad. In particular, it appears that the following combinatorial optimization problem, over
binary n-dimensional variable, x:
n−1
X
E := min Ei (xi , xi+1 ), (7.67)
x∈{±1}n
i=1
which requires optimization over 2n possible states, can be solved efficiently by DP in efforts
linear in n. Here, in Eq. (7.67) Ei (xi , xi+1 ) is an arbitrary, known, and possibly different
for different i, real-valued function of its arguments, which are both binary. In the jargon
of mathematical physics the problem just introduced is called “finding a ground state of
the Ising model”.
To explain the DP algorithm for this example it is convenient to represent the problem
in terms of a linear graph (a chain) shown in Fig. (7.3). The components of x are associated
with nodes and the “energy” of “pair-wise interactions” between neighboring components
of x are associated with an edge, thus arriving at a linear graph (chain).
Let us illustrate the greedy, DP approach to solving optimization (7.67) on the example
in Fig. (7.3). The greedy essence of the approach suggests that we should minimize over
components sequentially, starting from one side of the chain and advancing to its opposite
end. Therefore, minimizing over x1 one derives
n−1
!
X
E = min min E1 (x1 , x2 ) + Ei (xi , xi+1 )
x2 ,··· ,xn x1
i=2
n−1
!
X
= min Ẽ2 (x2 , x3 ) + Ei (xi , xi+1 ) , (7.68)
x2 ,··· ,xn
i=3
Ẽ2 (x2 , x3 ) := E2 (x2 , x3 ) + min E1 (x1 , x2 ), (7.69)
x1
where we took advantage of the objective factorization (into sum of terms each involving
only a pair of neighboring components). Notice, that computing min E1 (x1 , x2 ), we need
x1
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 209
1 2 3 4 5 6
2 3 4 5 6
!# !$ !% !& !'
Figure 7.3: Top: Example of a linear Graphical Model (chain). Bottom: Modified GM
(shorter chain) after one step of the DP algorithm.
to track the result for all possible (two) values of x2 . The end result of the (greedy)
minimization (over x1 ) we arrive at the problem with exactly the same structure we started
with, i.e. a chain. However, the chain is shorter by one node (and edge). The only change
in the new structure (when compared with the original structure) is “renormalization” of
the pair-wise energy: E2 (x2 , x3 ) → Ẽ2 (x2 , x3 ). Graphical transformation associated with
one greedy step is illustrated in Fig. (7.3). It shows transition from the original chain
to the reduced (one node and one edge shorter) chain. Therefore, repeating the process
sequentially (by induction) we will get the desired answer in exactly n steps. The DP
algorithm is shown below, where we also generalize assuming that all components of xi are
drawn from and arbitrary (and not necessarily binary) set, Σ, often called “alphabet” in
the Computer Science and Information Theory literature.
Consider generalization of the combinatorial optimization problem (7.69) to the case of
a single-connected tree, T = (V, E), e.g. one shown in Fig. (7.4):
X
E := min Ei,j (xi , xj ), (7.70)
x∈Σ|V|
{i,j}∈E
where V and E are the sets of nodes and edges of the tree respectively; |V| is the cardinality
of the of nodes (number of nodes); and Σ is the set (alphabet) marking possible (allowed)
values for any, xi , i ∈ V, component of x.
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 210
1: for i = 1, · · · , n − 2 do
2: for xi+1 , xi+2 ∈ Σ do
3: Ei+1 (xi+1 , xi+2 ) = Ei+1 (xi+1 , xi+2 ) + min Ei (xi , xi+1 )
xi
4: end for
5: end for
Output: E = min En−1 (xn−1 , xn )
xn−1 ,xn
!.
8
!% 4 )
,! .
((
( !"
(
!"
,!%
)
9
(( 1 !"
!# !#
,! ) !$
,)
2 , !# 3
( (! " 7
( (!$ , ! )
-
!& 5 6 !
'
Exercise 7.6. Generalize Algorithm 4 to the case of the GM optimization problem (7.70)
over a tree, that is compute E defined in Eq. (7.70). (Hint: one can start from any leaf
node of the tree, and use induction as in any other DP scheme.)
Part IV
Mathematics of Uncertainty
212
Chapter 8
∀x : 0 ≤ P (x) ≤ 1 (8.1)
X
P (x) = 1, (8.2)
x∈Σ
It is often useful to work with state spaces that are quantitative. For example, the set of
possible outcomes of a single coin toss, {Tail, Head}, can be mapped to the quantitative
state space, Σ = {0, 1}, by asking how many heads are observed after a single toss. For this
example, the probability mass function associated with this binary discrete sample space is
completely determined by a single parameter, call it β. So if P (1) = β, then P (0) = 1 − β
(See also Fig. 8.1.)
Terminology. In the example of the coin toss, we defined a random variable to be the
number of heads after one coin toss. If we call this random variable X, then the notation
P (1) = β and P (0) = 1 − β is really shorthand for P (X = 1) = β and P (X = 0) = 1 − β.
213
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 214
β ≡ 1/2 β ≡ 1/2
0 1 x 0 1 x
1
β
1−β 1−β
0 1 x 0 1 x
Figure 8.1: Probability mass function (left column) and cumulative distribution functions
(right column) for a Bernoulli random variable with parameter β ≡ 1/2 (top) and for a
Bernoulli random variable with parameter β > 1/2 (bottom).
Another common notation is PX (1) = β and PX (0) = 1 − β. All three notations mean “The
probability that exactly one head is observed after a toss is β.”
We could also write
1 − β, for k = 0;
P (X = k) = (8.3)
β, for k = 1.
1/2 1
3/4
1/4 1/2
1/4
0 1 2 x 0 1 2 x
1.00
0.3 0.75
0.2 0.50
0.1 0.25
0 1 2 3 4 5 x 0 1 2 3 4 5 x
Figure 8.2: Probability Mass Functions (left column) and cumulative distribution functions
(right column) for Binomial random variables with parameters n = 2 and β = 1/2 (top)
and with parameters n = 5 and β > 1/2 (bottom).
of heads and tails, for example the sequence (H, T, H, H, ..., T ) is one possible outcome.
If we define the random variable Xi to be the number of heads showing on the ith toss,
then the sequence (Xi )ni=1 is a (quantitative) sequence of ones and zeros that represents the
outcome of n tosses.
For this example, we say that the random variables Xi are independent because the
outcome of each coin toss does not depend on the previous tosses, and we say they are
identically distributed because the underlying principles that determine the outcome of a
toss do not change from toss to toss. Random variables that are both independent and
identically distributed are given the shorthand i.i.d.
Let us define a new random variable, Y , to be the number of heads after n coin tosses,
so Y = X1 + X2 + · · · + Xn . In this situation, {Xi } are i.i.d., so the probability of tossing a
sequence with exactly k heads is precisely the proportion of sequences that contain exactly
k heads, which can be computed from the binomial formula giving
!
n
P (Y = k) = β k (1 − β)n−k . (8.5)
k
The probability distribution described by (8.5) is called a binomial distribution with pa-
rameters n and k (Figs. 8.2). A random variable Y that follows a Binomial distribution is
called a binomial random variable, and we say that Y ∼ B(n, β) or Y ∼ Binom(n, β).
Let us now discuss an example of an unbounded discrete state space. Consider some
event that occurs by chance, for example, observing a meteor in the night sky. Let K be the
random variable that counts the number of such occurrences during a given period of time.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 216
0.6 1.00
0.4 0.75
0.50
0.2 0.25
0 1 2 ··· x 0 1 2 ··· x
1.00
0.3 0.75
0.2 0.50
0.1 0.25
... x ... x
0 1 2 3 4 5 0 1 2 3 4 5
Figure 8.3: Probability mass functions (left column) and cumulative distribution functions
(right column) for Poisson random variables with parameter λ = 0.5 (top) and with param-
eter λ = 2 (bottom).
Then Σ = {0, 1, 2, . . . }, (i.e, it is possible, that there are no occurrences, one occurrence,
two occurrences, and so on). It can be shown that under certain conditions, K will have
the probability distribution
λk e−λ
P (K = k) = . for k = 0, 1, 2, . . . (8.6)
k!
The probability distribution described by (8.6) is called a Poisson distribution with pa-
rameter λ (Fig. 8.3). A random variable K that follows a Poisson distribution is called a
Poisson random variable, and we say that K ∼ Pois(λ). (Check that the probability defined
in Eq. (8.6) satisfies Eq. (8.2).)
Other real-life examples of random processes associated with the Poisson distribution
(called Poisson processes) are: the probability distribution of the number of phone calls
received at a call center in an hour, the probability distribution of the number of customers
arriving at the shop or bank, probability distribution of the number of typing errors per
page page, and many more.
Example 8.1.1. Are the Bernoulli and Poisson distributions related? Can you “design” a
Poisson process from Bernoulli process?
Solution. Consider repeating the Bernoulli process n times independently, thus drawing a
sequence consisting of zeros and ones from a Binomial distribution. Check only for the ones
and record the times (indexes) associated with their occurrences. Analyzing the probability
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 217
The state space Σ can also be continuous. Random variables on continuous states spaces
are associated with a probability density function that must satisfy
∀x ∈ Σ : p(x) ≥ 0, (8.7)
Z
dx p(x) = 1, (8.8)
Σ
It is customary to use lower case p for the Probability Density Function and upper case, P ,
to denote actual probabilities. The Probability Density Function (PDF) provides a means
to compute probabilities that an outcome occurs in a given set or interval.
For example, for A ⊂ Σ, then the probability of observing an outcome in the set A is
given by Z
P (A) = p(x) dx.
A
Consider the probability that a real-valued X will take a value less than or equal to x:
Zx
P (X ≤ x) = p(x0 )dx0 . (8.9)
−∞
Eq. (8.9) extends to continuous space example the notion of CDF (we remind that the
abbreviation stands for the already familiar from Section 8.1.1 Cumulative Distribution
Function).
The setting can be extended from infinite to finite intervals. The uniform distribution
on the interval [a, b] is an example of a distribution on a bounded continuous state space:
1
∀x ∈ [a, b] : p(x) = , (8.10)
b−a
A random variable X with a probability distribution given by equation (8.10) can be de-
scribed by the notation X ∼ Unif(a, b). Fig. 8.4 illustrates PDF and CDF of Unif(a, b).
The Gaussian distribution is perhaps the most common (and also the most important)
continuous distribution:
1 (x − µ)2
∀x ∈ R : p(x|σ, µ) = √ exp − . (8.11)
σ 2π 2σ 2
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 218
1 1
0 1 x 0 1 x
1
1/π
0 π x 0 π x
Figure 8.4: PDF (left column) and CDF (right column) for uniform random variables on
(0, 1) (top) and on (0, π) (bottom).
Area = 1
x x
-2 -1 0 1 2 -2 -1 0 1 2
1
Area = 1
x x
-2 -1 0 1 2 -2 -1 0 1 2
Figure 8.5: Probability density function (left column) and cumulative distribution func-
tion (right column) for normally distributed random variables N (0, 1) (top) and N (1, 0.52 )
(bottom).
pσ,µ (x) another possible notation. The distribution is parameterized by the mean, µ, and
by the variance σ 2 . Standard notation for the Gaussian/normal distribution is N (µ, σ 2 ) or
N (µ, σ 2 ). Fig. 8.4 illustrates PDF and CDF of N (µ, σ 2 ).
The probability distribution given by (8.11) is also called a normal distribution—where
“normality” refers to the fact that the Gaussian distribution is a “normal/natural” outcome
of summing up many random numbers, regardless of the distributions for the individual
contributions. (We will discuss the law of large numbers and central limit theorem shortly.)
Let us make a brief remark about the notations. We will often write, P (X = x), or a
short-cut, P (x) and sometimes you see in the literature, PX (x). By convention, upper case
variables denote random variables. A random variable takes on values in some domain, and
a particular observation of a random variable (that is, it has been sampled and observed
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 219
to have a particular value in the domain) is then a non-random value and is denoted by
lower case e.g. x. X ∼ P (x) denotes the fact that the random variable X is drawn from
the distribution, P (x). Also, and when it is not confusing, we will streamline the notations
(and thus abuse them a bit) and use the lower case variables across the board – for both a
random variable and for a deterministic value the random variable takes.
x x x x
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
x x x x
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
x x x x
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
Figure 8.6: The first cumulant describes the central tendency (mean) of a probability distri-
bution (green). The second cumulant describes the spread about the mean (orange). The
third moment describes the asymmetry of the distribution (blue). The fourth moment
describes whether extreme values are unusually rare or common (red).
Notation. Common notation for the expectation of a random variable X with probability
mass function P includes E[X], EP [X], hXi and hXiP .
Example 8.2.1. Consider the example of tossing a pair of fair coins. The set of possible
outcomes is {(T, T ), (T, H), (H, T ), (H, H)}, each outcome occurring with equal probability.
Define the random variable X to be the number of heads that are observed, so X ∼ B(2, 1/2)
and
1/4, for x = 0
P (X = x) = 1/2, for x = 1
1/4, for x = 2
The expected number of heads is
X
E[X] = x P (x) = 0 · 1/4 + 1 · 1/2 + 2 · 1/4 = 1
x={0,1,2}
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 221
The expectation can also be defined for functions of a random variable. Consider a
function, f (x), and its expectation over the probability, P (x):
X
EP [f (x)] = hf (x)iP = f (x)P (x) (discrete)
x∈Σ
Z
Ep [f (x)] = hf (x)ip = f (x)p(x)dx (continuous)
x∈Σ
Example 8.2.2. Consider a scenario where a gambler wins $200 for tossing a pair of heads,
but loses $100 for any other outcome. If we define f : Σ → R to be the earnings, then the
expectation of f is calculated to be
The variance of a random variable measures its expected spread about its mean. Note
that the mean and the variance do not have the same units (the units of the variance
are the square of the units of the mean), so it can be difficult to meaningfully interpret
the variance. Consequently, it is common to consider the standard deviation of a random
variable, σ, which is defined as the square root of the variance.
Example 8.2.3. Compute the variance and the standard deviation of X and f for the
Example 8.2.1 and the Example 8.2.2.
Solution. Var[X] = (1 − 0)2 · 1/4 + (1 − 1)2 · 1/2 + (1 − 2)2 · 1/4 = 1/2, and σ = 1/4.
Var[f (X)] = (−100 + 25)2 · 3/4 + (200 + 25)2 · 1/4 = 4275, and σ = 65.4.
Example 8.2.4. The Cauchy distribution plays an important role in physics, since it
describes the resonance behavior, e.g. shape of a spectral width of a laser. The probability
density function of a Cauchy distribution with parameters a ∈ R and γ > 0 is given by
1 γ
p(x|a, γ) = , −∞ < x < +∞. (8.15)
π (x − a)2 + γ 2
Show that the probability distribution is properly normalized and find its mean. What can
you say about its variance?
Solution. To verify that equation (8.15) is properly normalized, we must show that the prob-
ability density integrates to unity, which can be done with the trig-substitution, tan(θ) =
(x − a)/γ.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 222
which is calculated using using a principal value integral from residue calculus. To compute
the second moment, we attempt to evaluate the integral
Z
+∞
γ (x − a)2 dx
variance = ,
π (x − a)2 + γ 2
−∞
and find that it is unbounded. Since this integral is unbounded, we conclude that the
variance of a Cauchy distribution does not exist (it it is infinite).
The concept of expectation and variance can be generalized to the moments of a distribution.
For a discrete random variable with probability distribution P (x) the moments of P (x) are
defined as follows
h i X
k = 0, · · · , µk := EP X k = hX k iP = xk P (x). (8.16)
x∈Σ
For a continuous random variable, X with probability density p(x) = pX (x), the moments
of X are: h i Z
k = 0, · · · , µk := Ep X = hX ip = dx xk p(x).
k k
(8.17)
Σ
From the definitions in equations (8.16) and (8.14), it follows that the first moment of a
random variable is equivalent to its mean, E[X] = µ1 , and the second moment is related to
the variance according to V arX = µ2 − µ21 .
Example 8.2.5. Give a closed-form expression for the moments of a Bernoulli distribution
with parameter β. Use the first and second moment to find the mean and variance of the
Bernoulli distribution.
p(x) = βδ(1 − x) + (1 − β)δ(x). (8.18)
Solution.
Z∞
k = 1, · · · : µk = hX k i = xk p(x)dx = β. (8.19)
−∞
The mean of a distribution is equal to it’s first moment: µ = µ1 . In this case µ = β. The
variance of a distribution is equal to the combination of its first two moments given by
σ 2 = µ2 − µ21 . In this case the variance is σ 2 = β − β 2 = β(1 − β).
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 223
Example 8.2.6. What is the mean number of the events in the Poisson process, Pois(λ).
What is the variance of the Poisson distribution?
Solution. For this example, we will compute the mean and the variance of the Poisson
distribution by first computing its first and second moments:
∞
X ∞
X ∞
X X λn∞
kλk −λ λk
µ1 = k P (k) = e = e−λ = λ e−λ = λ.
k! (k − 1)! n!
k=0 k=0 k=1 n=0
Therefore, the mean (average) and the variance of the number of events is
µ := µ1 = λ and σ 2 := µ2 − µ21 = λ
Note that the expectation and the variance of a Poisson distribution are both equal to the
same value, λ. This is not generally true for other distributions.
where t ∈ R and all integrals are assumed well defined. When exp(tx) is expressed by its
Taylor series, we find that the moment generating function can be expressed as an infinite
sum involving the moments, µk , of the random variable.
Z∞ Z∞ ∞
X ∞
X
(tx)k µk t k
MX (t) = dxp(x) exp(tx) = dxp(x) = . (8.21)
k! k!
−∞ −∞ k=0 k=0
The name ‘moment generating function’ arises from the observation that differentiating
MX (t) k times and evaluating the result at t = 0 recovers the k th moment of X.
∞
!
dk dk X µk tk
MX (t) = k = µk (8.22)
dtk t=0 dt t=0 k!
k=0
where β = 1/T is the inverse temperature and E(x) is a known function of x, called energy
of the state x. The normalization factor Z is called the partition function. Suppose we
know the partition function, Z(β) as a function of the inverse temperature, β. Compute
the expected mean value and the variance of the energy.
Solution. The mean (average) value of the energy is
X 1 X 1 ∂Z ∂ ln Z
hE(X)i = p(x)E(x) = E(x)e−βE(x) = − =− . (8.24)
x
Z x Z ∂β ∂β
∂ 2 ln Z
Var [E(X)] = h(E(X) − hE(X)i)2 i = , (8.25)
∂β 2
Notice that up to sign inversion of the argument of the partition function is equivalent to
the moment generating function (8.20), Z(β) = ME(X) (−β).
The characteristic function of a random variable is defined as the Fourier transform of its
probability density:
Z
+∞
where i2 = −1. The characteristic function exists for any real t and it obeys the following
relations
G(0) = 1, |G(t)| ≤ 1. (8.27)
The characteristic function contains information about all the moments µk . Moreover it
allows the Taylor series representation in terms of the moments:
∞
X (it)k
G(t) = hX k i, (8.28)
k!
k=0
and thus
1 ∂k
hX k i = G(t) . (8.29)
ik ∂tk t=0
This implies that derivatives of G(t) at t = 0 exist up to the same m as the moments µk .
Example 8.2.8. Find the characteristic function of a Bernoulli distribution with parameter
β.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 225
Solution. Substituting Eq. (8.18) into the Eq. (8.26) one derives
and thus
∂k
µk = k k
1 − β + βeit = β. (8.31)
i ∂t t=0
The result is naturally consistent with Eq. (8.19).
Exercise 8.1. The exponential distribution has a probability density function given by
Ae−λx , x ≥ 0,
p(x) = (8.32)
0, x < 0,
8.2.5 Cumulants
According to Eq. (8.27) the Taylor series in Eq. (8.33) start from unity. Utilizing Eqs. (8.28)
and (8.33), one derives the following relations between the cumulants and the moments
κ1 = µ1 , (8.34)
κ2 = µ2 − µ21 2
=σ . (8.35)
Example 8.2.9. Find the characteristic function and the cumulants of the Poisson distri-
bution (8.6).
Solution. The respective characteristic function is
∞
X ∞
X
λk (λeit )k
G(t) = e−λ eitk = e−λ = exp λ(eit − 1) , (8.36)
k! k!
k=0 k=0
and then
ln G(t) = λ(eit − 1). (8.37)
Example 8.2.10. Birthday Problem Assume that a year has 366 days. What is the
probability, pm , that m people in a room all have different birthdays?
Solution. Let (b1 , b2 , . . . , bm ) be a list of people birthdays, bi ∈ {1, 2, . . . , 366}. There are
366m different lists, and all are distributed identically (equiprobable). We should count the
Q
lists, which have bi 6= bj , ∀i 6= j. The amount of such lists is mi=1 (366 − i + 1). Then, the
final answer
m
Y
i−1
pm = 1− . (8.38)
366
i=1
The probability that at least 2 people in the room have the same birthday day is 1 − pm .
Note that 1 − p23 > 0.5 and 1 − p22 < 0.5.
Exercise. (not graded) Choose, at random, three points on the circle of unit radius. In-
terpret them as cuts that divide the circle into three arcs. Compute the expected length of
the arc that contains the point (1, 0).
E[X]
P (X ≥ a) ≤ . (8.39)
a
• (Chebyshev’s inequality)
σ2
P (|X − µ| ≥ b) ≤ , (8.40)
b2
where µ and σ 2 are the mean and the variance of X.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 227
• (Chernoff bound)
E[etX ]
P (X ≥ a) = P (etX ≥ eta ) ≤ , (8.41)
eta
where X ∈ R and t ≥ 0.
We will get back to discussion of some other useful probabilistic inequalities in the
lecture devoted to entropy and to how compare probabilities.
H H H T T H T T
Figure 8.7
random variables X1 , X2 , · · · and then created a new variable by taking a function, e.g. a
sum, of the original random variables. The assumed independence was useful for reaching
the goal of designing a new random variable (e.g. for transitioning from Bernoulli random
variable to Poisson random variable), however not all random variables are independent.
In the coming Section we will learn how to describe dependencies or correlations, that is
opposite of the independence, in high-dimensional statistics.
Consider an n-component random vector, X, and let P (x) be the probability that the
P
state x is observed, where x P (x) = 1. (Recall: x (lower-case) represents a particular
realization of the random variable X (upper-case).) So if each component Xi takes values
in Σ = {0, 1}, then X is the random vectors of length n with entries zero or one, and x
might be (1, 1, 0, . . . , 1).) We wish to ask two related questions about P :
1. Marginalization: The probability of observing a state where one or more of the com-
ponents attain certain values.
2. Conditioning: The probability of observing a state given that the value(s) attained
by one or more component is known.
Example 8.4.1. Let Xi be the random variable for the number of heads observed after
flipping a fair coin on the ith toss. So Σ = {0, 1}, and P (Xi = 0) = 1/2 and P (Xi = 1) =
1/2. Let X = (X1 , X2 ) be the random vector showing the outcome of two successive coin
flips. So the probability of each possible outcome is
P (X = (0, 0)) = 1/4, P (X = (1, 0)) = 1/4, P (X = (0, 1)) = 1/4, P (X = (1, 1)) = 1/4,
See Fig. (8.7) for illustration. The following questions are examples of marginalization and
conditioning:
2. Conditioning: It is known that the outcome of the first toss is a “1”, what is the set
of possible outcomes and their respective probabilities?
Solution. 1. From the list of all possible outcomes, we see that there are two outcomes
with a “1” in the first entry (namely, (1, 0) and (1, 1)) each with probability 1/4.
2. Under the condition that the outcome of the first toss is “1”, all other outcomes must
have zero probability. The only two possible outcomes are (1, 0) and (1, 1) which each
occur with equal probability. Therefore,
where J is a constant that determines the coupling strength between adjacent components
and Z is the normalization constant (See figure 8.8.). The normalization constant, also
called the partition function, is introduced to guarantee that the sum of probabilities over
all the states is unity. (See also Example 8.2.7.)
For n = 2 one gets the example of a bi-variate probability distribution
exp(Jx1 x2 )
P (x) = P (x1 , x2 ) = . (8.44)
4 cosh(J)
P (x) is called a joint or multivariate probability distribution function of x = (x1 , · · · , xn ),
because it shows probability of all the components, x1 , · · · , xn , together.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 230
+1 −1 −1 +1 +1 −1 +1 −1
−1 −1 +1 +1 +1 −1 +1 +1
Figure 8.8: Two possible realizations of the n-component Ising model on a line.
+1 +1 +1 −1 −1 +1 −1 −1
P (+1, +1) ∝ exp(+J) P (+1, −1) ∝ exp(−J) P (−1, +1) ∝ exp(−J) P (−1, −1) ∝ exp(+J)
Figure 8.9: Set of all possible outcomes for the 2-component Ising model and their relative
probabilities. The normalization constant, Z is found by summing the probabilities over all
states.
P (x1 , x2 ) exp(Jx1 x2 )
P (x1 |x2 ) = P = (8.45)
P (x1 , x2 ) 2 cosh(Jx2 )
x1
P
is the probability to observe x1 under condition that x2 is known. Notice that, x1 P (x1 |x2 ) =
1, ∀x2 .
We can marginalize the multivariate (joint) distribution over a subset of variables. For
example,
X X
P (x1 ) = P (x) = P (x1 , · · · , xn ). (8.46)
x\x1 x2 ,··· ,xn
(2π)n/2
Z= √ . (8.48)
det A
Moments of the Gaussian distribution are
where A−1
ij = Σij denotes i, j component of the inverse of the matrix A. The Σ matrix
(which is also symmetric and positive definite, as its inverse is by construction) is called
the co-variance matrix. Standard notation for the multi-variate statistics with mean vector,
µ = (µi |i = 1, · · · , n) and co-variance matrix, Σ, is N (µ, Σ) or Nn (µ, Σ).
The Gaussian distribution is remarkable because of its “invariance” properties.
where thus µ1 and µ2 are p and q dimensional vectors and Σ11 , Σ12 , Σ21 and Σ22 are (p × p),
(p × q), (q × p) and (q × q) matrices. Then, the following two statements hold:
R
• Marginalization: p(x1 ) := dx2 p(x1 , x2 ) is the following Normal/Gaussian distribu-
tion, N (µ1 , Σ11 ).
p(x1 ,x2 )
• Conditioning: p(x1 |x2 ) := p(x2 ) is the Normal/Gaussian distribution, N µ1|2 , Σ1|2 ,
where
Proof of the theorem is recommended as a useful technical exercise (not graded) which
requires direct use of some basic linear algebra. (You will need to use or derive explicit for-
mula for the inverse of a positive definite matrix split into four quadrangles, as in Eq. (8.50).)
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 232
Take n random instances, also called samples, X1 , · · · , Xn generated i.i.d. from a distribu-
P
tion with mean µ and variance, σ > 0, and compute Yn = ni=1 Xi /n. What is Prob(Yn )?
√
Theorem 8.4.3 (Weak Version of the Central Limit Theorem). n(Yn − µ), converges in
distribution to a Gaussian with mean, µ, and variance, σ 2 , i.e.
n
!
√ 1X
n→∞: n Xi − µ ∼ N (0, σ 2 ). (8.52)
n
i=1
Let us sketch the prove of the weak-CLT (8.52) in a simple case µ = 0, σ = 1. Obviously,
√
µ1 (Yn n) = 0. Compute
" # P 2 P
√ X1 + · · · + Xn 2 E Xi i6=j E [Xi Xj ]
µ2 (Yn n) = E √ = i + = 1.
n n n
Example 8.4.4 (Sum of Gaussian variables). Compute the probability density, pn (yn ), of
P
the random variable Yn = n−1 ni=1 Xi , where X1 , X2 , . . . , Xn are sampled i.i.d. from the
normal distribution
2 1 (x − µ)2
p(x) = N (µ, σ ) = √ exp − ,
2πσ 2σ 2
exactly.
Solution. First, recall that the characteristic function of a Gaussian distribution is a Gaus-
sian
Z
itx σ 2 t2
G(t) = e p(x)dx = exp iµt − .
2
R
This expression shows that for any n the random variable Yn is Cauchy-distributed with
exactly the same width parameter as the individual samples. The CLT “fails” in this
case because we have ignored an important requirement/condition for the CLT to hold –
existence of the variance. (See Example 8.2.4.)
Exercise 8.2. Assume that you play a dice game 100 times. Awards for the game are as
follows: $0.00 for 1, 3 or 5, $2.00 for 2 or 4 and $26.00 for 6.
Exercise (not graded). Experiment with the CLT for different distributions mentioned in
the lecture.
The CLT holds for independent but not necessarily identically distributed variables too.
(That is one can use different distributions generating different variables in the summed up
sequence.)
We may be interested in not only deviations on the order of the standard deviation but
would also like to discuss arbitrary deviations.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 234
Theorem 8.4.6 (Cramér theorem (strong version of the CLT)). The normalized sum,
P
Yn = ni=1 Xi /n, of the i.i.d. variables, Xi ∼ pX (x), satisfies
1
∀x > µ : limlog Prob (Yn ≥ x) = −Φ∗ (x), (8.56)
n→∞ n
Φ∗ (x) := supt∈R (tx − Φ(t)) , (8.57)
Φ(t) := log (E exp (tX)) , (8.58)
where, Φ(t), is the cumulant generating function of pX (x) and Φ∗ (x) is the Legendre-Fenchel
transform of the cumulant generating function, also called the Cramér function.
Three comments are in order. First, an informal (“physical”) version of Eq. (8.56) is
Second, the cumulant generating function, Φ(t), is equal to the characteristic function (8.26)
of the minus imaginary argument, i.e., Φ(t) = G(−it). Also, and third, the weak version of
the CLT (8.52) is equivalent to approximating the Cramér function (asymptotically exact)
by a Gaussian distribution centered around its minimum.
Exercise (not graded). Prove the strong-CLT (8.56,8.57). [Hint: use saddle point/sta-
tionary point method to evaluate the integrals.] Give an example of an expectation for
which not only vicinity of the minimum but also other details of Φ∗ (x) are significant at
n → ∞? More specifically give an example of the object which behavior is controlled solely
by left/right tail of Φ∗ (x)? Φ∗ (0) and its vicinity?
Example 8.4.7. Compute the Crámer function for the Bernoulli process, i.e. (generally
unfair) coin toss
(
0, with probability 1 − β;
X= (8.60)
1, with probability β.
Solution.
Eqs. (8.61,8.62) are noticeable for two reasons. First of all, they lead (after some alge-
braic manipulations) to the famous Stirling formula for the asymptotic of a factorial
√ n n
n! = 2πn (1 + O(1/n)) .
e
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 235
(Do you see how?) Second, the x log x structure is an “entropy” which will appear a number
of times in following lectures - stay tuned.
The Crámer theorem (8.4.6) gives a powerful asymptotic result, however it says nothing
about the rate of convergence, i.e., the behavior at large but finite n. Quite remarkably the
deficiency can be cured by the following application of the Chernoff bound (8.3.1).
Theorem 8.4.8 (Chernoff Bound Version of the Central Limit Theorem, adapted from
[17]). Let X1 , · · · , Xn be i.i.d. random variables with E[Xi ] = µ and a well-defined (bounded)
Crámer function Φ∗ (x) = supt (tx − log(E[exp(tXi )]). Then,
n
!
X
P Xi ≥ nx ≤ exp (−nΦ∗ (x)) , ∀x > µ,
i=1
n
!
X
P Xi ≤ nx ≤ exp (−nΦ∗ (x)) , ∀x < µ.
i=1
Proof. Consider the case of x > µ:
n
! P
n !
X t Xi
P Xi ≥ nx = P e i=1 ≥ etnx for any t > 0 (to be chosen later)
i=1
" P
n #
t Xi
−tnx
≤ e E e i=1 by Markov’s inequality (its Chernoff bound’s version)
n
Y
= e−tnx E[etXi ] = e−tnx+nΦ(t) since Xi are i.i.d.
i=1
≤ exp −n sup(tx − Φ(t)) = exp −n sup(tx − Φ(t))
t>0 t∈R
∗
= exp (−nΦ (x)) .
Here, Φ(t) = log(E[exp(tXi )]) is convex in t; Φ∗ (x) is convex in x and it achieves its
minimum at x = µ. Therefore, for x > µ the sup in, Φ∗ (x) = sup(tx − Φ(t)), is achieved
t∈R
at t > 0 (positive slope) and we can thus replace, sup by sup. This completes the proof for
t>0 t∈R
x > µ.
In the case of x < µ, one needs to pick t < 0, and otherwise the proof is fully equivalent.
Pn
Exercise 8.3. Let X = i=1 Xi , where the Xi are independent (not necessarily identically
distributed) Poisson random variables. That is, each Xi is independently drawn from a
Poisson distribution with parameter λi , i.e. Xi ∼ Pois(λi ). Denote the characteristic
function of X by GX (t) and the characteristic function of Xi by GXi (t). Show that
Q
(1) GX (t) = ni=1 GXi (t);
P
(2) X ∼ Pois(λ), where λ = ni=1 λi .
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 236
We already saw how to get conditional probability distribution and marginal probability
distribution from the joint probability distribution
P (x, y) P (x, y)
P (x|y) = , P (y|x) = . (8.63)
P (y) P (x)
Combining the two formulas to exclude the joint probability distribution we arrive at the
famous Bayes formula
Here, in Eqs. (8.63,8.64) both x and y may be multivariate. Rewriting Eq. (8.64) as
P (y|x)P (x)
P (x|y) = , (8.65)
P (y)
one often refers (in the field of the so-called Bayesian inference/reconstruction) to P (x) as
the ”prior” probability distribution which measures the degree of the initial “belief” in X.
Then, P (x|y), called the ”posterior”, measured the degree of the (statistical) dependence
P (y|x)
of x on y, and the quotient P (y) represents the “support/knowledge” y provides about x.
A good visual illustration of the notion of the conditional probability can be found at
http://setosa.io/ev/conditional-probability/.
Example 8.4.9. Consider the three component Ising model. (a) Compute the normaliza-
tion constant. (b) Compute the marginal probability, P (x1 ). (c) Compute the conditional
probability, P (x3 |x1 ).
Solution. The set of all possible outcomes is shown in Figure 8.10. (a) The normalization
constant, Z, is found by summing the probabilities over all the states:
Therefore, the probabilities of the states are, P (1, 1, 1) = e2J /(4 + 4 cosh(2J)), etc.
(b) The marginal probability, P (x1 ), is given by summing all the probabilities corresponding
to P (x1 = +1) and all the probabilities corresponding to P (x1 = −1):
P (x1 = +1) = P (1, 1, 1) + P (1, 1, −1) + P (1, −1, −1) + P (1, −1, 1)
e2J + e0 + e0 + e−2J 1
= =
4 + 4 cosh(2J) 2
P (x1 = −1) = P (−1, −1, −1) + P (−1, −1, 1) + P (−1, 1, 1) + P (−1, 1, −1)
e2J + e0 + e0 + e−2J 1
= = .
4 + 4 cosh(2J) 2
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 237
+1 +1 −1 −1
+1 +1 +1 −1 +1 −1 +1 +1
P (x) ∝ exp + 2J P (x) ∝ exp 0 · J P (x) ∝ exp 0 · J P (x) ∝ exp − 2J
−1 −1 +1 +1
−1 −1 −1 +1 −1 +1 −1 −1
P (x) ∝ exp + 2J P (x) ∝ exp 0 · J P (x) ∝ exp 0 · J P (x) ∝ exp − 2J
Figure 8.10: Set of all possible outcomes for the 3-component Ising model and their relative
probabilities. The normalization constant, Z, is found by summing the probabilities over
all states.
Exercise 8.4. The joint probability density of two real random variables X1 and X2 is
1
∀x1 , x2 ∈ R : p(x1 , x2 ) = exp(−x21 − x1 x2 − x22 ).
Z
(1) Calculate the normalization constant Z.
that a particular outcome actually occurred. We suppose that the information content of
an outcome, x, which we denote h(x) and will also be calling surprise, depends only on the
probability of the outcome.
The question becomes: how to quantify the information content? Let us start formu-
lating a list of requirements that the information content (surprise) must satisfy:
h(x) = 0 if P (x) = 1.
2. Learning that an unlikely outcome has occurred provides more information than learn-
ing that a likely outcome has occurred. The information content of an outcome must
be a strictly decreasing function of its probability. That is,
With a little work, it can be shown that only one family of continuous functions satisfies
this modest list of requirements. We are forced to define the information content of an
outcome to be the log of the probability:
h(x) = − log P (x) . (8.66)
The base of the logarithm, or equivalently, the multiplicative scaling constant, can be chosen
arbitrarily. Convention, which is standard in the information theory, is to use a unit scaling
and log base 2, i.e. log → log2 in Eq. (8.66).
Terminology. Standard scientific term used for the information gained by learning the out-
come x, which we also called the surprise of x, is the configurational entropy.
Consistently with all of the above, but now introducing the entropy of all possible
outcomes, i.e. entropy of a random variable X, is defined as the expectation of the config-
urational entropy over the outcomes
X X
H(X) = −EP (X) log P (X) = P (x)h(x) = − P (x) log P (x) , (8.67)
x∈X x∈X
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 239
h(x)
1 P (x)
Figure 8.11: The information content h(x) of an outcome x plotted against the probability of
x. Negative-logs are the only family of functions that satisfy the requirements of h(x). The
base of the logarithm (or equivalently, the multiplicative scaling constant) can be chosen
arbitrarily.
where x is drawn from the space X . We can also say that the entropy is a measure of
uncertainty. In the case of a deterministic process, i.e. when there is only one outcome with
thhe probability 1, the configurational entropy becomes equal to the entropy and according
to Eq. (8.67) both are zero, 0 log 0 = 0.
Terminology. Yet another term associated with the entropy of a random variable, X, is the
measure of uncertainty. Following the tradition of information theory, we use the symbol H
for entropy. However, be aware that an alternative notation, S, is customary in Statistical
Mechanics/Physics.
Let us familiarize ourselves with the concept of entropy on example of the Bernoulli(β)
process (8.60). In this case, there are only two states, P (X = 1) = β and P (X = 0) = 1−β,
and therefore
• H≥0
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 240
H(β)
1/2 1 β
Figure 8.12: The entropy of a Bernoulli random variable as a function of β. The entropy
is zero when β = 0 or β = 1; this is precisely when one of the outcomes is certain and the
random variable is actually deterministic. The entropy is maximized at β = 1/2; this is
precisely when the two outcomes are equi-probable
• Entropy is less than the average number of bits needed to describe the random variable
(the equality is achieved for uniform distribution) a .
• Entropy is the lower bound on the average length of the shortest description of a
random variable
Example 8.5.1. The so called Zipf’s law states that the frequency of the n-th most frequent
word in randomly chosen English document can be approximated by
(
0.1
n , for n ∈ 1, . . . , 12367
pn = (8.69)
0, for n > 12367
Under an assumption that English documents are generated by picking words at random
according to Eq. (8.69) compute the entropy of the made-up English per word.
a
Take integers which are smaller or equal than n, and represent them in the binary system. We will need
log2 (n) binary variables (bits) to represent any of the integers. If all the integers are equally probable then
log2 (n) is exactly the entropy of the distribution. If the random variable is distributed non-uniformly than
the entropy is less than the estimate.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 241
Solution. Substituting the distribution (8.69) into the definition of entropy one derives
12367
X Z
123670
0.1 0.1 0.1 ln x 1
H=− log2 ≈ dx = = (ln2 123670 − ln2 10) ≈ 9.9 bits.
n n ln 2 x 20 ln 2
i=1 10
It is known, from the famous work of Shannon [18], that entropy of English alphabet per
character is fairly low, ∼ 1 bit. Therefore, the character-based entropy of a typical English
text is much smaller than its entropy per word. This result is intuitively clear: after the
first few letters one can often guess the rest of the word, but prediction of the next word in
the sentence is a less trivial task.
The concepts of information content (surprise) and of entropy provide a number of useful
tools in probability. One of the most important tools is a method of comparing two proba-
bility distributions. For illustration, let X be a random variable taking values x ∈ X , and
let P1 be the probability distribution of X, which we consider as the ground truth. Assume
that P1 is approximated or modelled by the probability distribution P2 (x), then the dif-
ference in the information content of x, as measured by the two probability distributions,
is
P1 (x)
log P1 (x) − log P2 (x) ≡ log
P2 (x)
The Kullback-Leibler (KL) divergence is defined as the expectation of the difference in the
information context between the ground truth and its proxy (approximation) with respect
to the probability distribution of the former, P1 ,
X P1 (x)
D(P1 kP2 ) := P1 (x) log . (8.70)
P2 (x)
x∈X
Note that the KL divergence is not symmetric, i.e. D(P1 kP2 ) 6= D(P2 kP1 ). Moreover it
is not a proper metric of comparison as it does not satisfy the so-called triangle inequality.
A metric, d(a, b), is a function mapping two elements a and b from the same space to R that
satisfies (i) non-negativity, i.e. d(a, b) ≥ 0, and zero if and only if a = b, i.e. d(a, a) = 0; (ii)
symmetric, i.e. d(a, b) = d(b, a), and (iii) the triangle inequality, d(a, b) ≤ d(a, c) + d(b, c).
The last two conditions do not hold in the case of the KL divergence. However, an
infinitesimal version of the KL divergence, the Hessian of the KL divergence around its
minimum, also called the Fisher information, satisfies all the requirements of a metric.
Example 8.5.2. An illusionist has a biased coin that comes up heads 70% of the time.
Use the KL divergence to quantify the amount information that would be lost if the biased
coin were modeled as a fair coin.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 242
Solution. We regard the biased probability distribution as the ‘ground truth’, P1 , and the
fair probability distribution, P2 , as our approximation. The KL divergence between the two
becomes
X
D(P1 ||P2 ) = EP1 log P1 /P2 = P1 (x) log P1 (x)/P2 (x)
x
We lose approximately 0.118 bits of information by modeling the biased coin as a fair coin.
Exercise 8.5. Assume that a random variable X2 is generated by the known probability
distribution P2 (x), where x ∈ X and X is finite. Consider the vector (P1 (x)|x ∈ X ) that
P
satisfies P1 (x) ≥ 0 for all x ∈ X and x∈X P1 (x) = 1. Show that D(P1 kP2 ), as a function
of P1 (x), is non-negative and that it achieves its minimum when P1 (x) = P2 (x) ∀x ∈ X , i.e.
The notion of entropy naturally extends to the multivariate statistics. If we have a pair of
discrete random variables, X and Y , taken values x ∈ X and y ∈ Y respectively, their joint
entropy is
X
H(X, Y ) := −E log P (X, Y ) = − P (x, y) log P (x, y) . (8.72)
x∈X ,y∈Y
?
One must ask whether H(X, Y ) = H(X)+H(Y ), that is, one asks whether the expected
information in the entire system is equal to the sum of the expected information of X and
Y individually. To answer this question, we examine the expected amount information in
the system beyond that which can be gained from X, that is, we examine the quantity
H(X, Y ) − H(X),
X X
H(X, Y ) − H(X) = − P (x, y) log P (x, y) + P (x) log P (x)
x∈X ,y∈Y x∈X
X X
=− P (x, y) log P (x, y) + P (x, y) log P (x)
x∈X ,y∈Y x∈X ,y∈Y
X
P (x, y)
=− P (x, y) log
P (x)
x∈X ,y∈Y
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 243
Figure 8.13: Venn diagram illustrating the relationship between entropy, joint entropy and
conditional entropy. (It is customary in information theory to use venn diagrams to illustrate
entropies, conditional entropies and mutual information. Be advised that the shapes in the
diagram do not actually represent sets of objects. See e.g. pp 141 of [19] for a detailed
discussion with examples.).
If X and Y are independent, then P (x, y) = P (x)P (y) and the result is H(Y ). However, if
X and Y are not independent, then P (x, y)/P (x) = P (y|x) by Bayes theorem. We define
the conditional entropy H(Y |X) := H(X, Y ) − H(X) and observe that it can be computed
by
X
H(Y |X) = −E log P (Y |X) = − P (x, y) log P (y|x) . (8.73)
x∈X ,y∈Y
The definitions of the joint and conditional entropies naturally lead to the following
relation between the two
Figure 8.14: Venn diagram explaining relations between the mutual information and re-
spective entropies for two random variables. (It is customary in information theory to use
venn diagrams to illustrate entropies, conditional entropies and mutual information. Be
advised that the shapes in the diagram do not actually represent sets of objects. See e.g.
pp 141 of [19] for a detailed discussion with examples.)
Comparing the two information sources, say tracking events x and y, one assumption, which
is rather dramatic, may be that the probabilities are independent, i.e. P (x, y) = P (x)P (y)
and then, P (x|y) = P (x) and P (y|x) = P (y). Mutual information, which we are about to
discuss, will be zero in this case. Thus, naturally, the mutual information is introduced as
the measure of dependence
XX
P (x, y) P (x, y)
I(X; Y ) = EP (x,y) log = P (x, y) log . (8.76)
P (x)P (y) P (x)P (y)
x∈X y∈Y
Intuitively the mutual information measures the information that X and Y share. In other
words, it measures how much knowing one of these random variables reduces uncertainty
about the other. For example, if X and Y are independent, then knowing X does not give
any information about Y and vice versa - the mutual information is zero. In the other
extreme, if X is a deterministic function of Y then all information conveyed by X is shared
with Y . In this case the mutual information is the same as the uncertainty contained in X
itself (or Y itself), namely the entropy of X (or Y ).
The mutual information is obviously related to respective entropies,
I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) = H(X) + H(Y ) − H(X, Y ). (8.77)
The relation is illustrated in Fig. (8.14). Mutual Information also possesses the following
properties
The conditional mutual information between two random variables (two sources of in-
formation), X and Y , given another random variable, Z, is
P (x, y|z)
I(X; Y |Z) := H(X|Z) − H(X|Y, Z) = EP (x,y,z) log (8.80)
P (x|z)P (y|z)
The entropy chain rule (8.74) when applied to the mutual information of (X1 , · · · , Xn ) ∼
P (x1 , · · · , xn ) results in
n
X
I(Xn , · · · , X1 ; Y ) = I(Xi ; Y |Xi−1 , · · · , X1 ) (8.81)
i=1
See Fig. (8.14) for the Venn diagram illustration of Eq. (8.81).
We recommend the reader to check [19] for extended discussions on entropy, mutual
information and related.
The notions of joint, conditional entropy and mutual information in the context of two
random variables is illustrated in the following three examples.
Example 8.5.3. Consider two Bernoulli random variables X and Y with a joint probability
mass function P (X, Y ) given by
y=0 y=1
x=0 0 0.2
x=1 0.8 0
Compute the entropy of X, the joint entropy of X and Y , and the conditional entropy of
Y given X. Discuss the results.
Solution. The joint probability mass function indicates that the outcome of Y is completely
determined by the outcome of X, and vice versa. Intuitively, we should expect that all the
information in the entire system is fully contained in X, and that once X is known, no
additional information can be gained from Y . Lets do the calculations to verify that our
intuition is correct. The entropy of X is
X
H(X) = − P (x) log2 P (x) = −0.2 log2 0.2 − 0.8 log2 0.8 = 0.722.
x∈X
H(X, Y ) ≈ 0.722
Figure 8.15: Schematic for example 8.5.3. The entropy of the whole system (blue) is the
same as the entropy of X (orange). The conditional entropy of Y given X is zero (illustrated
by the bar of ‘zero’ width at the end of the second row). The bottom row shows that the
entropy of Y (pink) also coincides with that of the entire system (and, incidentally, is fully
shared with that of X).
For this situation, the expected information content of the entire system is exactly the same
as the expected information content of X alone. No additional information can be expected
from (X, Y ) that cannot be expected from X. We anticipate (and verify) that there is no
additional information that can be expected from Y once X is known.
X
H(Y |X) = − P (x, y) log2 P (y|x)
x,y∈X ,Y
= −0 log2 (0) − 0.8 log2 1 − 0.2 log2 1 − 0 log2 (0) = 0.
See Fig. 8.15 for illustration. Comment: Similar calculations for H(Y ) and H(X|Y ) would
show that Y also contains all the expected information in the system, and that no additional
information can be expected from X once Y is known.
Example 8.5.4. Consider two Bernoulli random variables X and Y with a joint probability
mass function P (X, Y ) given by
y=0 y=1
x=0 0.45 0.45
x=1 0.05 0.05
Compute the entropies of X and of Y . Compute the joint entropy of X and Y . Compute
the conditional entropy of Y given X. Discuss the results.
Solution. The marginal distributions are:
0.9, x = 0 0.5, y=0
P (X = x) = ; and P (Y = y) = .
0.1, x = 1 0.5, y=1
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 247
We observe that X and Y are independent, so we should intuitively expect that there is
no ‘overlap’ in the information in X and Y . Let’s do the calculations to verify that our
intuition is correct. The entropy of X is
X
H(X) = − P (x) log2 P (x) = −0.9 log2 0.9 − 0.1 log2 0.1 = 0.469.
x∈X
The entropy of Y is
X
H(Y ) = P (y) log2 P (y) = 0.5 log2 0.5 + 0.5 log2 0.5 = 1.0.
y∈Y
For this situation, the expected information content of the entire system is more than the
expected information content of X alone. The additional expected information of the system
(X, Y ) beyond that of X must be information expected from Y that is not contained in
X. We anticipate (and verify) that the additional information that can be expected from
Y when X is known is non-zero.
X
H(Y |X) = − P (x, y) log2 P (y|x)
x,y∈X ,Y
= −0.45 log2 (0.5) − 0.45 log2 0.5 − 0.05 log2 (0.5) − 0.05 log2 0.5 = 1.0.
Performing similar calculations for H(Y ) and H(X|Y ) would show that Y also contains
all the information content in the system, and that no additional information is gained by
learning X once Y is known. See Fig. 8.15 for illustration.
Example 8.5.5. Consider two Bernoulli random variables X and Y with a joint probability
mass function P (X, Y ) given by
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 248
H(X, Y ) ≈ 1.469
Figure 8.16: Schematic for example 8.5.4. The entropy of the whole system (blue) is equal to
the entropy of X (orange) plus the entropy of Y conditioned on X (pink, shaded). Observe
that in this example, the entropy of Y conditioned on X is equal to the entropy of Y (pink)
because the there is no overlap in information between X and Y .
y=0 y=1
x=0 0.2 0.3
x=1 0 0.5
Compute the entropies of X and of Y . Compute the joint entropy of X and Y . Compute
the conditional entropy of Y given X. Discuss the results.
Solution. The entropy of X is
X
H(X) = P (x) log P (x)
x
= −0.2 log2 (0.4) − 0.4 log2 (0.6) − 0 · log2 (0) − 0.4 log2 (1) = 0.485.
The entropy of Y is
X
H(Y ) = P (y) log P (y) = −0.2 log2 (0.2) − 0.8 log2 (0.8) = 0.722.
y
= −0.2 log2 (1) − 0.4 log2 (0.375) − 0 · log2 (0) − 0.4 log2 (0.625) = 0.764.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 249
H(X, Y ) ≈ 1.485
Figure 8.17: Schematic for example 8.5.5. The entropy of the whole system (blue)is equal to
the entropy of X (orange) plus the entropy of Y conditioned on X (pink, shaded). Observe
that in this example, the entropy of Y conditioned on X is less than the entropy of Y (pink)
because the some of the information content in Y overlaps with that of X.
P (x, y) X P (y)
x1 x2 x3 x4
y1 1/8 1/16 1/32 1/32 1/4
Y y2 1/16 1/8 1/32 1/32 1/4
y3 1/16 1/16 1/16 1/16 1/4
y4 1/4 0 0 0 1/4
P (x) 1/2 1/4 1/8 1/8
Table 8.1: Exemplary joint probability distribution function P (x, y) and the marginal prob-
ability distributions, P (x), P (y), of the random variables x and y.
= −0.2 log2 (0.2) − 0.4 log2 (0.4) − 0 · log (0) − 0.4 log2 (0.4) = 1.522.
See Fig. 8.17 for illustration. Comment: In this example, X and Y are not independent, and
therefore some information is shared between the two. This explains why the joint entropy
is less than the the sum of the individual entropies, i.e. H(X, Y ) < H(X)+H(Y ). This also
explains why the information content of X conditioned on Y is less than the information
content of X alone, i.e. H(X|Y ) < H(X). (Similarly, it explains why H(Y |X) < H(Y ).)
Exercise 8.6. The joint probability distribution P (x, y) of two random variables X and
Y is described in Table 8.1. Calculate the conditional probabilities P (x|y) and P (y|x),
marginal entropies H(X) and H(Y ), as well as the mutual information I(X; Y ).
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 250
Figure 8.18
Let us now discuss the case when a random one dimensional variable, X, is drawn from the
space of reals, x ∈ R, with the probability density, p(x). Now consider averaging a convex
function of X, f (X). One observes that the following statement, called Jensen inequality,
holds
Obviously the statement becomes equality when p(x) = δ(x). To gain a bit more of intuition
consider the case of the Bernoulli-like distribution, p(x) = βδ(x − x1 ) + (1 − β)δ(x − x0 ).
We derive
where the critical inequality in the middle is simply expression of the function f (x) convexity
(taken verbatim from the definition).
See also Fig. (8.18) with another (graphical) hint on the proof of the Jensen inequality.
In fact, the Jensen inequality holds over any spaces.
Notice that the entropy, considered as a function (or functional in the continuous case)
of probabilities at a particular state is convex. This observation gives rise to multiple
consequences of the Jensen inequality (for the entropy and the mutual information):
• (Information Inequality)
with equality iff ai /bi is constant. Convention: 0 log 0 = 0, a log(a/0) = ∞ if a > 0 and
0 log 0/0 = 0. Consequences of the Log-Sum theorem
• (Concavity of Entropy) For X ∼ p(x) we have H(P ) := HP (X) (notations are ex-
tended) is a concave function of P (x).
• (Concavity of the mutual information in P (x)) Let (X, Y ) ∼ P (x, y) = P (x)P (y|x).
Then I(X; Y ) is a concave function of P (x) for fixed P (y|x).
• (Concavity of the mutual information in P (y|x)) Let (X, Y ) ∼ P (x, y) = P (x)P (y|x).
Then I(X; Y ) is a concave function of P (y|x) for fixed P (x).
We will see later (discussing Graphical Models) why the convexity/concavity properties of
the entropy-related objects are useful.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 252
Example 8.5.6. Prove that H(X) ≤ log2 n, where n is the number of possible values of
the random variable x ∈ X.
Solution. The simplest proof is via the Jensen’s inequality. It states that if f is a convex
function and U is a random variable then
Let us define
f (u) = − log2 u, u = 1/P (x)
Note, in passing, that the Jensen’s inequality leads to a number of other useful expres-
sions for entropy, e.g. H(X|Y ) ≤ H(X) with equality iff X and Y are independent, and
P
more generally, H(X1 , . . . , Xn ) ≤ ni=1 H(Xi ) with equality iff all Xi are independent.
Chapter 9
Stochastic Processes
In Chapter 8, we discussed random vectors and their probability distributions (for example
X = (X1 , . . . , XN ) with X ∼ PX ). In Chapter 9, we will discuss stochastic processes,
which are a natural extension of our inquiry into random vectors. A stochastic process is a
collection of random variables, often written as a path, or sequence, {Xt |t = 1, · · · , T }, or
simply (Xt )Tt=1 , whose components Xt take values from the same state space Σ. Typically
we think of t as time, and we consider the cases where T is finite, T < ∞, and where it is
infinite, T → ∞. Both the state space, Σ, and the index set, t, may be either discrete or
continuous.
We will discuss three basic examples: (a) of Bernoulli process, which is space-time
discrete, (b) Poisson process, which is space discrete but time continuous, and (c) continuous
space and continuous time process described by Langevien equation, which is an example of
the so-called Stochastic Differential Equation (SDE), in Section 9.1, Section 9.2 and Section
9.3, respectively.
The three basic examples of the stochastic processes can also be classified in terms of
the amount of memory needed to generate them.
In our first two examples (Bernoulli processes and Poisson processes), the random vari-
ables Xt are independent, i.e. memory-less, meaning that the outcome of each Xt does
not influence, and is not influenced by, the outcomes of any of the other Xt . In general,
however, the components Xt within the path {X(t)}, need not be independent. Thus,
stochastic process described by the Langevien equation results in dependent, i.e. correlated
in time Xt .
Time-correlations within a random path, {X(t)}, described by an SDE may be compli-
cated and difficult to analyze. Consequently, one often considers a discrete-time simplifica-
tion, called a Markov process, discussed in Section 9.4, where the memory holds only for a
253
CHAPTER 9. STOCHASTIC PROCESSES 254
single time step, i.e. Xt depends only on the outcome of the previous step, Xt−1 .
We conclude this Chapter with a brief discussion in Section 9.5 of the Markov Decision
Process (MDP), which is controlled formulation involving (conditioned to) Markov process.
(Queuing theory, discussed in Section 9.6, is a bonus material.)
As we discussed in Eq. 8.5, the number of successes k, in n trials follows the binomial
distribution !
n
∀k = 0, · · · , n : P (S = k|n, β) = β k (1 − β)n−k , (9.1)
k
The mean and variance are found by computing E[S] and E[(S − E[S])2 ] respectively:
Let T1 be the number of trials until the first success (including the success event too). The
Probability Mass Function (PMF) for the time of the first success is the product of the
probabilities of (t − 1) failures and one success:
The distribution in Eq. 9.4 is called a geometric distribution because the calculation to
verify that the probability distribution is normalized involves summing up the geometric
P
sequence: ∞ t=1 β(1−β)
t−1 = β(1−(1−β))−1 = 1. The mean and variance of the geometric
distribution are
1
mean : E[T1 ] = , (9.5)
β
1−β
variance : var(T1 ) = E[(T1 − E[T1 ])2 ] = . (9.6)
β2
The Bernoulli process is memoryless, meaning that each outcome is independent of the
past. If n trials have already occurred, the future sequence xn+1 , xn+2 , · · · is also a Bernoulli
process and it is independent of the first n trials. Moreover, suppose we have observed the
process for n times and no success has occurred. Then the PMF for the remaining arrival
times is also geometric,
What about the k th arrival? Let Tk be the number of trials until k th success (inclusive),
then we write
!
t − 1,
t = k, k + 1, · · · : P (Tk = t|β) = β k (1 − β)t−k [Pascal PMF], (9.8)
k−1
The mean and variance are found by computing E[Tk ] and E[(Tk − E[Tk ])2 ] respectively:
k
mean : E[Tk ] = , (9.9)
β
k(1 − β)
variance : var(Tk ) = E[(Tk − E[Tk ])2 ] = . (9.10)
β2
The combinatorial factor accounts for the number of configurations of k arrivals in Tk trials.
The following two sections discuss two different continuous-time limits of Bernoulli pro-
cesses, namely Poisson Processes and Brownian motion.
CHAPTER 9. STOCHASTIC PROCESSES 256
λ̃k e−λ̃
∀k ∈ {0, 1, 2, . . . } : P(N = k; λ̃) = . (9.11)
k!
In the following derivation, we will show that the outcomes of a Poisson process follow a
Poisson distribution with parameter λt.
The distribution for the Poisson Process can be derived by subdividing the interval [0, t]
into n subintervals of length ∆t := t/n. For sufficiently small ∆t, the probability of two
or more arrivals on any subinterval is negligible and the occurrence of arrivals on any two
subintervals are independent. Under these conditions, the probability of k arrivals in n
sub-intervals can be modeled by a binomial distribution. The probability of an arrival, β,
is proportional to the length of the sub-interval: β ∝ t/n, or equivalently β = λt/n where
λ is the constant of proportionality. In the limit as n → ∞, we get
k
k k n! λt λt n−k
P (Nt = k; λ) = lim β (1 − β)n−k = lim 1− (9.12)
n→∞ n n→∞ k!(n − k)! n n
k n−k
nk + O(nk−1 ) λt λt
= lim 1− (9.13)
n→∞ k! n n
(λt)k e−λt
= . (9.14)
k!
λt n λt k
where we have used the facts that (1 − n) → e−λt and (1 − n) → 1. Notice that the
dimensionless parameter λ̃ in Eq. (9.11) is replaced by λt in Eq. (9.14). Hence λ has the
dimension of inverse time: [λ] = [1/t].
• Collision of high-energy beams a high frequency (10 MHz) where there is a small
chance of actual collision.
• Radioactive decay of a nucleus with the trial being to observe a decay within a small
time interval.
CHAPTER 9. STOCHASTIC PROCESSES 257
Nt
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
T1 T2 T3 T4 T5 t P (Nt )
Figure 9.1: One realization of a Poisson process over the interval [0, t] (blue) in which
five “arrivals” occurred at the random times T1 , . . . , T5 , therefore N (t) = 5. Two other
realizations are shown (grey) in which 6 arrivals and 3 arrivals occurred on [0, t]. The
random variable Nt follows a Poisson distribution (right) with parameter λt where λ is a
fixed parameter.
• Independence: The number of arrivals that occur on two time intervals are inde-
pendent if on only if the two time intervals are disjoint.
• Distribution: The number of arrivals that occur on an interval depends only on the
length of the interval and not its location. In particular, in the limit as ∆t → 0,
P (N (∆t) = 1) → λ∆t and P (N (∆t) ≥ 2) = 0.
With a small amount of effort, it can be shown that these three properties are both necessary
and sufficient conditions to define a stochastic process whose components follow a Poisson
distribution with parameter λt.
A summary of the relationship between the Bernoulli process and the Poisson process
is given in table 9.1.
Let T1 be the (random) time of the first arrival. The probability density of T1 per unit time
can be found from Eq. (9.11) by (1) recalling that a PDF is the derivative of the CDF, and
(2) recognizing the equivalency P (T1 < t) = P (Nt ≥ 1) since the event that first arrival
CHAPTER 9. STOCHASTIC PROCESSES 258
Bernoulli Poisson
Times of Arrival Discrete Continuous
Arrival Rate p/per trial λ/unit time
PMF of Number of arrivals Binomial Poisson
PMF of Interarrival Time Geometric Exponential
PMF of k th Arrival Time Pascal Erlang
Table 9.1: Comparison between the Bernoulli process and the Poisson process
occurs before time t is equivalent to the event that the number of arrivals by time t is
greater than or equal to 1. Therefore,
1
d
P (T1 = t) = lim P (T1 < t + ∆t) − P (T1 < t) = dt P (T1 < t)
∆t→0 ∆t
d d
= dt P (Nt ≥ 1) = dt 1 − P (Nt = 0)
d
= dt 1 − e−λt = λe−λt
Hence the time of the first arrival follows an exponential distribution with parameter λ.
The three properties above imply that, like the Bernoulli process, the Poisson process
is Memoryless and that it has Fresh Starts.
• Memoryless: if we observe the process for t seconds and no arrival has occurred, then
the density of the remaining time of arrival is exponential.
• Fresh Starts: the time of the next arrival is independent of the past, and hence is also
exponentially distributed with parameter λ.
The probability density that the first arrival occurs before time t can therefore be found
by integration:
Z t Z t
0 0
P (T1 ≤ t) = dt pT1 (t ) = dt0 λe−λt = 1 − exp(−λt).
0 0
By extension, the probability density of the time of the k th arrival, one derives
λk tk−1 exp(−λt)
p(Tk = t; λ) = , t>0 (Erlang “of order” k).
(k − 1)!
One of the most important features shared by the Bernoulli and Poisson processes is their
invariance with respect to mixing and splitting. We will show it on the example of the
Poisson process but the same applies to Bernoulli process.
CHAPTER 9. STOCHASTIC PROCESSES 259
N1
N2
N1 + N2
Merging: Let N1 (t) and N2 (t) be two independent Poisson processes with rates λ1
and λ2 respectively. Let us define N (t) = N1 (t) + N2 (t). This random process is derived
combining the arrivals as shown in Fig. (9.2). The claim is that N (t) is the Poisson process
with the rate λ1 + λ2 . To see it we first note that N (0) = N1 (0) + N2 (0) = 0. Next,
since N1 (t) and N2 (t) are independent and have independent increments their sum also
has independent increments. Finally, consider an interval of length τ , (t, t + τ ]. Then the
number of arrivals in the interval are Poisson(λ1 τ ) and Poisson(λ2 τ ) and the two numbers
are independent. Therefore the number of arrivals in the interval associated with N (t)
is Poisson((λ1 + λ2 )τ ) - as sum of two independent Poisson random variables. We can
obviously generalize the statement to a sum of many Poisson processes. Note that in the
case of the Bernoulli process the story is identical provided that collision is counted as one
arrival.
Splitting: Let N (t) be a Poisson process with rate λ. Here, we split N (t) into N1 (t)
and N2 (t) where the splitting is decided by coin tossing (Bernoulli process) - when an arrival
occur we toss a coin and with probability β and 1 − β add arrival to N1 and N2 respectively.
The coin tosses are independent of each other and are independent of N (t). Then, the
following statements can be made
Example 9.2.1. Astronomers estimate that the meteors above a certain size hit the earth
on average once every 1000 years, and that the number of meteor hits follows a Poisson
distribution.
(a) What is the probability to observe at least one large meteor next year?
(b) What is the probability of observing no meteor hits within the next 1000 years?
CHAPTER 9. STOCHASTIC PROCESSES 260
(c) Calculate the probability density p(Tk ), where the random variable Tk represents the
appearance time of the k th meteor.
Therefore
Z ∞
p(Tk )dTk = P (k − 1|t),
t
λk Tkk−1 −λTk
p(Tk ) = e .
(k − 1)!
Exercise 9.2. Customers arrive at a store with the Poisson rate of 5 per hour. 40%/60%
of arrivals are men/women.
(a) Compute probability that at least 10 customers have entered between 10 and 11 am.
(b) Compute probability that exactly 5 women entered between 10 and 11 am.
(c) Compute the expected inter-arrival time of men.
(d) Compute probability that no men arrive between 2 and 4 pm.
For a binomial process Y = (Y1 , . . . Yn ) that takes outcomes ±1 with equal probability, the
P
stochastic process X where Xj := ji=1 Yi is a called a random walk on the integers. The
PMF of this random walk is a binomial distribution. The PMF converges to a Gaussian
√
distribution with mean zero and variance n, which is a direct result of the central limit
theorem.
Exercise. Modify example 9.3.1 for the random starting location P (X0 = x) = 2−x /3.
Example 9.3.2. Consider a random walk on the line with n jumps that occur at times
t = ∆, 2∆, . . . , n∆, where ∆t = 1/n. Find the PMF of the random walk if the jumps are
√ √ √
± ∆ with equal probability, so P (Xj+1 = Xj + ∆) = P (Xj+1 = Xj − ∆) = 1/2, and
use your result to show the stochastic process has mean zero and unit variance when t = 1.
√ n
Solution. P (Xn = (n−2k) ∆) = nk 21 . The mean and variance are found by recognizing
√ √
that Xn = ∆(2Y − n), where Y ∼ B(n, 1/2). Therefore, E[Xn ] = ∆(2E[Y ] − n) = 0
and Var[Xn ] = 4∆Var[Y ] = 1.
Example 9.3.2 could be further extended by using the Central Limit Theorem to show
that the CDF of the random walk converges to the CDF of a standard normal distribution in
CHAPTER 9. STOCHASTIC PROCESSES 262
the limit as n → ∞. This result informs when and how to approximate a high-dimensional
discrete-time stochastic processes by a continuous-time stochastic process and vice versa.
(Solutions to continuous-time are often easier to analyze and solutions to discrete stochastic
processes are often easier to compute numerically.)
The central limit theorem further implies that the jumps need not be Bernoulli pro-
cesses—every random walk with steps i.i.d. random variables will have the same limit,
√
provided that the variance of the random variables scales as 1/ n as n → ∞. This limit is
Brownian Motion.
The first on the rhs of the Stochastic Differential Equation (SDE) (9.16) determines the
(deterministic) drift and the second term (called the Langevin term) determines the random
“noise”, which has mean zero and variance determined by D. The noise is considered
independent at each time step. These equations, also called the Langevin equations, describe
the evolution of a “particle” positioned at x ∈ R. The two terms on the rhs of Eq. (9.16)
correspond to deterministic drift/advancement of the particle (also dependent on its position
at the previous time step) and, respectively, on a random correction/increment. The random
correction models uncertainty of the environment the particles moves through. (We can also
think of it as representing random kicks by other “invisible” particles). The uncertainty
is represented in a probabilistic way – therefore we will be talking about the probability
distribution function of paths, i.e. trajectories of the particle.
The square root on the rhs of Eq. (9.17) may seem mysterious, let us clarify its origin on
the basic “no (deterministic) drift” example of v(x) = 0. (This will be the running example
through out this lecture.) In this case the Langevin equation describes the Brownian motion.
Direct integration of the linear equation with the inhomogeneous source results in
Z t
∀t ≥ 0 : x(t) = dt0 ξ(t0 ), (9.18)
0
Z t Z t Z t
2
∀t ≥ 0 hx (t)i = dt1 dt2 2Dδ(t1 − t2 ) = 2D dt1 = 2Dt, (9.19)
0 0 0
where we also set x(0) = 0. Infinitesimal version of Eq. (9.19) is
which can thus be derived from the Brownian (no drift) version of Eq. (9.17).
Note that the notations introduced above for the SDE, in its continuous time form (9.16),
and in the discrete time form (9.17), are originally from physics and custom in (at least
some part of) applied mathematics. The notations are intuitive and simple, however they
are formally ambiguous as dependent on the notion of the δ-function, i.e. a generalized
function. This is similar to the ambiguity associated with the δ-function use, for example,
as a source term in a linear ODE describing the Green function. Recall that to resolve the
ambiguity associated with the δ-function we should either regularize the δ function or use
it only under an integral. Therefore in theoretical mathematics, statistics and engineering,
we more often see the SDE (9.16) restated in the differential form
√
dx(t) = v(x(t))dt + 2DdW (t), (9.21)
where W (t) denotes the so-called “standard Brownian motion”, also called the Wiener
process/term. (In some of the mathematics literature, dB(t), is used instead of dW (t).)
Formally:
Definition 9.3.3 (Wiener Process). The Wiener process, {W (t)}t≥0+ , is a continuous time
stochastic process in R that is characterized by the following three properties:
1. W (0) = 0;
2. W (t) is almost surely continuous (i.e. with probability 1, W (t) is continuous in t);
3. W (t) has stationary independent increments, that are normally distributed with mean
zero and variance equal to the length of the increment, that is Wt − Wt0 ∼ N (0, t − t0 )
(for 0 ≤ t0 ≤ t).
The differential form (9.21) of the Langevin equation is advantageous because it natu-
rally leads to the following integral version of the Langevin equation resolving the afore-
mentioned ambiguity
Z
t+∆ Z
t+∆
√ Z
t+∆
0 0 0
x(t + ∆) − x(t) = dx(t ) = v(x(t ))dt + 2D dW (t0 ), (9.22)
t t t
where the first term on the rhs is the standard (Lebesgue) integral, while the second term
is the so-called Ito integral. According to Eqs. (9.18,9.19), and consistently with the formal
definition of the Wienner process above, the heuristic intepretation of the Ito integral is
that when ∆ → 0, the increment, x(t + ∆) − x(t), becomes Gaussian, zero mean normally
distributed with the variance, 2D∆.
CHAPTER 9. STOCHASTIC PROCESSES 264
The Langevin equation can also be viewed as relating the change in x(t), i.e. its dynamic
increment, to stochastic dynamics of the δ-correlated source ξ(tn ) = ξn characterized by
the Probability Density Function (PDF)
N
!
X ξ2
p(ξ1 , · · · , ξN ) = (2π)−N/2 exp − n
. (9.23)
2
n=1
Note that Eqs. (9.16,9.17,9.23) are the starting points for our further derivations, but
they should also be viewed as a way to simulate the Langevin equation on computer by
generating many paths at once, i.e. simultaneously. Notice, for completeness, that there are
also other ways to simulate the Langevin equation through the so-called telegraph process.
Let us now formally express ξn via xn from Eq. (9.17) and substitute it into Eq. (9.23)
N −1
!
1 X
p(ξ1 , · · · , ξN −1 ) → p(x1 , · · · , xN ) = (4πD)−(N −1)/2 exp − (xn+1 − xn − ∆v(x))2 .
4D∆
n=1
(9.24)
One gets an explicit expression for the measure over a path written in the discretized way.
And here is a typical way of how we state it in the continuous form (e.g. as a notational
shortcut)
Z T
1 2
p{x(t)} ∝ exp − dt (ẋ − v(x)) (9.25)
4D 0
This object is called (in physics and math) ”path integral” and/or Feynmann-Kac integral.
The Probability Density Function (PDF) of a path is a useful general object. However
we may also want to marginalize it thus extracting the marginal PDF, for being at the
position xN at the (temporal) step N , from the joint PDF (of the path) conditioned to
being at the initial position, x1 , at the moment of time t0 , p(x1 , · · · , xN |x0 ), and also from
the prior/initial (distribution) p0 (x0 ) – both assumed known:
Z Z
pN (xN ) = dx0 · · · dxN −1 p(x0 , · · · , xN ) = dx1 · · · dxN −1 p(x1 , · · · , xN |x0 )p0 (x0 ).(9.26)
It is convenient to derive relation between pN (·) and p0 (·) in steps, i.e. through an induc-
tion/recurrence, integrating over dx0 , · · · , dxN −1 sequentially. Let us proceed analyzing the
CHAPTER 9. STOCHASTIC PROCESSES 265
case of the Brownian motion where, F = 0. Then the first step of the induction becomes
Z
−1/2 1 2
p1 (x1 ) = (4πD) dx0 exp − (x1 − x0 ) p0 (x0 ) (9.27)
2D∆
Z
−1/2 2
= (4πD) d exp − p0 (x1 + ) (9.28)
4D∆
Z
2 2
≈ (4πD)−1/2 d exp − p0 (x1 ) + ∂x1 p0 (x1 ) + ∂x21 p0 (x1 ) (9.29)
4D∆ 2
= p0 (x1 ) + ∆D∂x21 p0 (x1 ), (9.30)
where transitioning from Eq. (9.28) to Eq. (9.29) one makes Taylor expansion in , also
√
assuming that ∼ ∆ and keeping only the leading terms in ∆. The resulting Gaussian in-
tegrations are straightforward. We arrive at the discretized (in time) version of the diffusion
equation
where we write, p(x|t), to emphasize that this is the probability of being at the position x
R
at the moment of time t, i.e. the expression is conditioned to t and thus: ∀t : dxp(x|t) =
1. Of course it is not surprising that the case of the Brownian motion has resulted in
the diffusion equation for the marginal PDF. Restoring the deterministic drift term, v(x),
(derivation is straightforward) one arrives at the Fokker-Planck equation, generalizing the
zero-drift diffusion equation
Here we only give a very brief and incomplete description on the properties of the distri-
bution which analysis is of a fundamental importance for Statistical Mechanics. See e.g.
[20].
The Fokker-Planck equation (9.32) is a linear and deterministic Partial Differential
Equation (PDE). It describes continuous in phase space, x, and time, t, evolution/flow
of the probability density distribution.
Derivation was for a particle moving in 1d, R, but the same ideology and logic extends
to higher dimensions, Rd , d = 1, 2, · · · . There are also extension of this consideration to
compact continuous spaces. Thus one can analyze dynamics on a circle, sphere or torus.
Analogs of the Fokker-Planck can be derived and analyzed for more complicated proba-
bilities than just the marginal probability of the state (path integral marginalized to given
time). An example here is of the so-called first-passage, or “first-hitting” problem.
CHAPTER 9. STOCHASTIC PROCESSES 266
The temporal evolution is driven by two terms, often called “diffusion” and “advecton”.
The terminology originates from fluid mechanics and describes how probabilities can “flow”
in the phase space. The diffusion originates from the stochastic source, while the advection
is associated with a deterministic (possibly nonlinear) force.
Linearity of the Fokker-Planck does not imply that it is simpler than the original nonlin-
ear problem. Deriving the Fokker-Planck we made a transition from nonlinear, stochastic
but ODE to linear PDE. This type of transition from nonlinear representation of many
trajectories to linear probabilistic representation is typical in math/statistics/physics. The
linear Fokker-Planck equation can be viewed as the continuous-time, continuous-space ver-
sion of the discrete-time/discrete space Master equation describing evolution of a (finite
dimensional) probability vector in the case of a Markov Chain.
The Fokker-Planck Eq. (9.32) can be represented in the ‘flux’ form:
where J(t; x) is the flux of probability through the space-state point x at the moment of
time t. The fact that the second (flux) term in Eq. (9.33) has a gradient form, corresponds
to the global conservation of probability. Indeed, integrating Eq. (9.33) over the whole
continuous domain of achievable x, and assuming that if the domain is bounded there is
no injection (or dissipation) of probability on the boundary, one finds that the integral of
the second term is zero (according to the standard Gauss theorem of calculus) and thus,
R
∂t dxp(x|t) = 0. In the steady state, when ∂t p(x|t) = 0 for all x (and not only in the
result of integration over the entire domain) the flux is constant - does not depend on x.
The case of zero-flux is the special case of the so-called ‘equilibrium’ statistical mechanics.
(See some further comments below on the latter.)
If the initial probability distribution, p(x|0) is known, (x|t) for any consecutive t is well
defined, in the sense that the Fokker-Planck is the Cauchy (initial value) problem with
unique solution.
Remarks about simulations. One can solve PDE but can also analyze stochastic ODE
approaching the problem in two complementary ways - correspondent to Eulerian and La-
grangian analysis in Fluid Mechanics describing “incompressible” flows in the probability
space.
Main and simplest (already mentioned) example of the Langevin dynamic is the Brow-
nian motion, i.e. the case of zero drift, v = 0. Another example, principal for the so-
called ’equilibrium statistical physics’, is where the velocity (of the deterministic drift) is
a spatial gradient of a potential, U (x): v(x) = −∂x U (x). Think, for example about x
representing a over-damped particle connected to the origin by a spring. U (x) is the poten-
CHAPTER 9. STOCHASTIC PROCESSES 267
tial/energy stored within the spring. In this case of the gradient drift the stationary (i.e.
time-independent) solution of the Fokker-Planck Eq. (9.32) can be found explicitly,
−1 U (x)
p(x|t) → pst (x) = Z exp − . (9.34)
t∞ D
Example 9.3.4. Consider the motion of a Brownian particle in the parabolic potential,
U (x) = γx2 /2. (The situation is typical for the particle, which is located near minimum or
maximum of a potential.) The Langevin equation (9.16) in this case becomes
dx √
+ γx = 2Dξ(t), hξ(t)i = 0, hξ(t1 )ξ(t2 )i = δ(t1 − t2 ) (9.35)
dt
Write a formal solution of Eq. (9.35) for x(t) as a functional of ξ(t). Compute hx2 (t)i
as a function of t and interpret the results. Write the Kolmogorov-Fokker-Planck (KFP)
equation for p(x|t), and solve it for the initial condition, p(x|0) = δ(x).
Solution. Multiply Eq. (9.35) by the integrating factor eγt to get
d √
x(t)eγt = 2Dξ(t)eγt ,
dt
which has the formal solution Z t
√ 0
x(t)eγt = x(0) + 2D ξ(t0 ) eγt dt0 .
0
The formal solution simplifies to
√ Z t
0
x(t) = x(0)e−γt + 2D ξ(t0 )e−γ(t−t ) dt0
0
We wish to find the mean and the variance of x(t). The first two moments, hx(t)i and
hx2 (t)i, become
Z t
−γt
√ 0 −γ(t−t0 ) 0
hx(t)i = x(0)e + 2D ξ(t ) e dt
0
√ Z t
−γt 0
= hx(0)ie + 2D hξ(t0 )i e−γ(t−t ) dt0
0
−γt
= x(0)e
CHAPTER 9. STOCHASTIC PROCESSES 268
The interpretation of the solution is as follows: (i) The contribution to h(x2 (t)i from the
initial condition decays at a rate of 2γt. (ii) At the smallest times, t 1/γ, we expand the
term (1 − e−2γt ) as a first order Taylor polynomial to find the usual diffusion hx2 (t)i ' 2Dt,
since the particle does not feel the potential. (iii) At larger time scale t 1/γ the dispersion
saturates, hx2 (t)i ' D/γ).
The Kolmogorov-Fokker-Planck equation, ∂t p(x|t) = (γ∂x x + D∂x2 )p(x|t), should be
supplemented by the initial condition p(x|t) = δ(x). Then, the solution (the Green function)
is
1 x2
p(x|t) = p exp − . (9.36)
2πhx2 (t)i 2hx2 (t)i
The meaning of the expression is clear: the probability function p(x|t) is Normal/Gaussian
with the variance which is time-dependent.
Example 9.3.5. Prove that the moments hx2k (t)i for the Brownian motion in R obey the
following recurrent equation
Write down stochastic ODE for the underlying stochastic process, x(t), and, given the initial
condition for, p(x|t) = δ(x), compute respective statistical moments hxk (t)i.
Solution. We propose that the corresponding Langevin equation is
√
ẋ = −αx + 2Dξ(t),
and verify our proposal by going through derivations similar to these shown in Example
9.3.4.
R
Let µk (t) be the k th moment of the random process, so µk (t) := h(x(t)i = xk n(x, t) dx.
A differential equation for µk (t) can be derived from the KFP equation:
Z ∞
∂t µk (t) = ∂t xk p(x|t) dx
Z ∞−∞
= xk α ∂x xp(x|t) + xk ∂xx p(x|t) dx
−∞
Z ∞ Z ∞
= −k αxk p(x|t) dx + k(k − 1) xk−2 p(x|t) dx
−∞ −∞
where the boundary terms from integration by parts vanish at ±∞ because n and its
derivatives are smaller than any polynomial as x → ±∞.
In the case where p(x|0) = δ(x), then µ0 (t) = 1 and µ1 (t) = 0. Applying the recursive
relationship, we get µ2k+1 (t) = 0 for all odd-order moments. For even order moments, the
solutions are subjected to solving Eq. (9.39). For example, the second moment can be
found by solving the differential equation
Example 9.3.7. Consider the following expectation over the Langevin term, ξ(t),
* t +
Z
Ψ(t; x) = exp dτ Q(x(τ )) (9.40)
0 x(t)=x, x(0)=0
Z
← ΨN (x) = dx0 · · · dxN −1 p (x, xN −1 , · · · , x1 |x0 ) δ (x0 ) e∆(Q(xN −1 )+···+Q(x0 ))
N →∞
(9.41)
Z !
dx1 dxN −1 (x − xN −1 )2 + · · · + x21
= √ ··· √ exp − e∆(Q(xN −1 )+···+Q(0)) ,
2πD∆ 4πD∆ 4D∆
CHAPTER 9. STOCHASTIC PROCESSES 270
where x(t) is the Brownian motion, thus satisfying, ẋ(t) = ξ(t), with x(t) set to zero initially,
x(0) = 0; Q(x(t)) is a given function of x, which is finite everywhere in R; and Eq. (9.40)
and Eq. (9.41) show, respectively, continuous time and discrete time versions of the same
expectation of interest.
explicitly, i.e. bypassing solving the PDE (derived in (a)), which does not allow
explicit solutions for a general Q(x(t)).
Solution. (a) First of all notice that Ψ(0; x) = δ(x). Further, observe that when Q(x) = 0,
the resulting PDE becomes the diffusion Eq. (9.32) with p(t; x) substituted by Ψ(t; x),
respectively. We can rewrite Eq. (9.41) as a recurrence
Z !
dxk−1 (x − xk−1 )2
∀k = 1, · · · , N : Ψk (x) = √ exp − + ∆Q(xk−1 ) Ψk−1 (xk−1 ),
4πD∆ 4D∆
(9.43)
where Ψ0 (x) = δ(x). Changing the integration variable, xk−1 → = xk−1 − x, keeping the
Gaussian expression in the integrand intact and expanding all other terms in the Taylor
series in , then evaluating the resulting Gaussian integrals, we arrive at the following
differential version of the recurrence:
∀k = 1, · · · , N : Ψk (x) = 1 + ∆Q(x) + ∆D∂x2 + O(∆2 ) Ψk−1 (x).
The continuous time version of the Eq. (9.43), and therefore the desirable Cauchy (initial
value) problem stated as a PDE supplemented with the initial condition, becomes:
(b) Let us substitute Q(x) in the expressions above, e.g. in the PDE (9.44), by δ ∗ Q(x),
expand the PDE (9.44) in the Taylor series in δ and write down relations for the zero, first
and second terms in δ. We derive the following set of diffusion equations (first homogeneous
and two other inhomogeneous)
∂t ψ (0) (t; x) = D∂x2 ψ (0) (t; x), ψ (0) (0; x) = δ(x), (9.45)
∂t ψ (1) (t; x) = Q(x)ψ (0) (t; x) + D∂x2 ψ (1) (t; x), ψ (1) (0; x) = 0, (9.46)
∂t ψ (2) (t; x) = Q(x)ψ (1) (t; x) + D∂x2 ψ (2) (t; x), ψ (2) (0; x) = 0, (9.47)
CHAPTER 9. STOCHASTIC PROCESSES 271
where
*Zt n +
Exercise 9.3 (Self-propelled particle). The term “self-propelled particle” refers to an object
capable to move actively by gaining energy from the environment. Examples of such objects
range from the Brownian motors and motile cells to macroscopic animals and mobile robots.
In the simplest two-dimensional model, the self-propelled particle moves in the xy-plane
with fixed speed v0 . The Cartesian components of the particle velocity, ẋ(t) and ẏ(t), in
the polar coordinates are
ẋ = v0 cos ϕ, ẏ = v0 sin ϕ,
where the polar angle ϕ defines the direction of motion. Assume that ϕ evolves according
to the stochastic equation
dϕ √
= 2Dξ, (9.48)
dt
where ξ(t) is the Gaussian white noise with zero mean and the following pair correlation
function, hξ(t1 )ξ(t2 )i = δ(t1 − t2 ). The initial condition are chosen to be ϕ(0) = 0, x(0) = 0
and y(0) = 0.
Hint: Derive equation for probability density of observing ϕ at the moment of time t, solve
the equation and use the result. You may also consider using first and second derivatives
of the object of interest over t, as well as evaluations fro the Example 9.3.7.
CHAPTER 9. STOCHASTIC PROCESSES 272
and
X
∀i : pji = 1. (9.50)
j:(j←i)∈E
The combination of G and p := (pji |(j ← i) ∈ E) defines a MC. Mathematically we also say
that the tuple (finite ordered set of elements), (V, E, p), defines the Markov chain.
We will mainly consider stationary Markov chains, where pji does not change in time.
However, for many of the following statements/considerations generalization to the time-
dependent processes is straightforward.
a
A useful interactive playground can be found here http://setosa.io/ev/markov-chains/
CHAPTER 9. STOCHASTIC PROCESSES 273
0.3
0.7 A B 0.5
0.5
One way to analyze a Markov chain is by generating sample trajectories. How does one
relate the weighted, directed graph to samples? The relation, actually, has two sides. The
direct side is about generating samples, which is done by first initializing the trajectory at
a particular state, and then advancing the trajectory from the current state to a randomly
selected adjacent state according to the transition probability pij . The inverse side is about
verifying whether given samples where indeed generated according to the rather restrictive
MC rules, and even reconstructing the characteristics of the underlying Markov chain.
Example 9.4.1. Describe how a sample trajectory may be generated for the Markov chain
illustrated in Fig. 9.3. Assume the system begins in state A at time 0.
Solution. Since the system is in state A at time 0 set X0 = A. At time 1, the system may
either remain in state A with probability 0.7 or transition to state B with probability 0.3.
Formally we write, P (X1 = A|X0 = A) = 0.7 and P (X1 = B|X0 = B) = 0.3. One can
generate a sample trajectory by first drawing a random number on the interval [0, 1) and
then setting X1 = A if the random number lies on [0, 0.7) and setting X1 = B if it lies on
[0.7, 1). By the Markov property, the state of the system at time 2 depends only on the
state at time 1, so one can then generate a second random number on [0, 1) and define X2
accordingly. Samples of the trajectory (xt )∞
t=0 can be generated efficiently in this manner
and may look something like AABABBAA....
The Markov chain defines a random (stochastic) dynamic process. Although time may
flow continuously, Markov chains consider time to be discrete (which is sometimes a matter
of convenient abstraction and sometimes, actually quite often, events do happen discretely).
One uses t = 0, 1, 2, · · · for the times when jumps occur. Then a particular random trajec-
tory/path/sample of the system will look like
Definition 9.4.2 (Irreducible). A Markov chain is said to be irreducible if one can access
any state from any state, formally
The Markov chain in Fig. (9.3) is obviously irreducible. However, if we replace 0.3 → 0
and 0.7 → 1 it becomes reducible because state 1 would no longer be accessible from 2.
Definition 9.4.3 (Aperiodicity). We say that state i has period k if every return to the
state must occur at times that are multiples of k. Formally the period of state is k where
provided that the set is not empty (otherwise the period is not defined). If n = 1 than the
state is said to be aperiodic. If all the states of a Markov Chain are aperiodic, then we say
that the Markov chain is aperiodic.
An irreducible MC only needs one aperiodic state to imply all states are aperiodic. Any
MC with at least one self-loop is aperiodic. The Markov chain in Fig. (9.3) is obviously
aperiodic. However, it becomes periodic with period two if the two self-loops are removed.
Example 9.4.4. Consider the Markov chain shown in Fig. 9.4. Is this Markov chain
reducible or irreducble? Periodic or aperiodic?
CHAPTER 9. STOCHASTIC PROCESSES 275
1.0
A B
1.0 1.0
1.0
B
0.1
A
0.9
1.0 C
Solution. The Markov chain shown in Fig. (9.4) is irreducible and periodic. The Markov
chain is irreducible because each state is accessible from each other state. The Markov chain
is periodic because if we start in state C, we can return to it only after 3, 6, 9, . . . steps (the
system will never forget its initial state). We say that state “C” has period 3. We say that
a Markov chain is aperiodic if and only if each state has period 1. One can make the this
Markov chain aperiodic by adding a self-loop to any of the three state.
Example 9.4.5. Consider the Markov chain shown in Fig. 9.5. Is this Markov chain
reducible or irreducble? Periodic or aperiodic?
Solution. The Markov chain shown in Fig. (9.5) is reducible and periodic. The Markov
chain is reducible because states “A” and “B” cannot be accessed from state “C”. Notice
that once the system enters state “C”, it will remain there forever. It cannot escape from
state “C”. This Markov chain is periodic because state “A” has period 2.
It is often of interest whether the system is guaranteed to return to a given state upon
leaving it, and if so, how many steps we should expect this to take. Define the time of the
CHAPTER 9. STOCHASTIC PROCESSES 276
1/2 A1 A2 A3 Ai
1/2 1/2
1/2
Figure 9.6: An example of a Markov chain with countably many (thus infinite number of)
states. In this example, P (An+1 ← An ) = 1/2 and P (A1 ← An ) = 1/2. This Markov chain
is positive recurrent. See Example 9.4.8.
0 A1 A2 A3 Ai
1/2 1/i
1/3
Figure 9.7: An example of a Markov chain with countably many states. In this example,
P (An+1 ← An ) = 1 − 1/n and P (A1 ← An ) = 1/n. This Markov chain is recurrent, but
not positive recurrent. See Example 9.4.9.
Example 9.4.6. Consider the Markov chain illustrated in Fig. 9.3. Compute the probabil-
ity that the first return to state A is in exactly n steps. (b) Compute the expected return
time to state A.
Solution.
CHAPTER 9. STOCHASTIC PROCESSES 277
(a) Observe that if the first return to state A is in exactly 1 step, then the state must have
transitioned from A back to A (a self-loop). If the first return to state A is in exactly n
steps (where n ≥ 2), then the state must have transitioned from A to B on step 1, and then
remained at B n − 2 times, and then returned from B to A on step n.
n−2
P (τ1 = n|X0 = A) = P (X1 = B|X0 = A) · P (Xk = B|Xk−1 = B)
· P (Xn = A|Xn−1 = B)
0.7, if n = 1,
=
0.3 · (0.5)n−2 · 0.5, otherwise.
Example 9.4.7. Consider the Markov chain illustrated in Fig. 9.4. Compute the probabil-
ity that the first return to the state A occurs in exactly n steps. (b) Compute the expected
return time to the state A.
Solution.
(a) Observe that when the state transitions out of A it is guaranteed to transition to B,
and from there to C, and from there back to A.
1, if n = 3,
P (τ1 = n|X0 = A) =
0, otherwise.
Example 9.4.8. Consider the Markov chain illustrated in Fig. 9.6. Compute the probabil-
ity that the first return to state A1 is in exactly n steps. (b) Compute the expected return
time to state A1 .
CHAPTER 9. STOCHASTIC PROCESSES 278
Solution.
(a) Notice that if the first return time is n, then the state must have transitioned from A1
to A2 to A3 and so on up to An−1 , and then returned to A1 in the nth step.
1 1 1 1
P (τ1 = n|X0 = A1 ), = · · ··· = n.
2 2 2 2
(b) The expected return time to state A is
Example 9.4.9. Consider the Markov chain illustrated in Fig. 9.7. Compute the probabil-
ity that the first return to state A1 is in exactly n steps. (b) Compute the expected return
time to the state A1 .
Solution.
(a) Notice that if the first return time is n, then the state must have transitioned from A1
to A2 to A3 and so on up to An−1 , and then returned to A1 in the nth step.
1 1 2 3 n−2 1 1 1
P (τ1 = n|X0 = A1 ) = · · · · ··· · · = .
1 2 3 4 n−1 n n−1n
(b) The expected return time to state A is
The distinction between recurrent and positive recurrent becomes important when an-
alyzing Markov chains on countable state spaces.
Exercise 9.4. Give an example of a Markov chain with an infinite number of states,
which is irreducible and aperiodic (prove it), but which does not converge to an equilibrium
probability distribution.
CHAPTER 9. STOCHASTIC PROCESSES 279
110 111
010 011
100 101
000 001
Definition 9.4.11 (Ergodic). A state is said to be ergodic if the state is aperiodic and
positive-recurrent. A Markov chain is said to be ergodic if it is irreducible and if every state
is ergodic.
Markov chains are often used to generate samples of a desired distribution. One can imagine
a particle that travels over a graph according to the weights of the edges. If the Markov
chain is ergodic, then after some time the probability distribution of the particle becomes
stationary (one say that the chain is mixed) and then the trajectory of the particle will
represent the sample of a distribution. Important information about the distribution, such
as its moments or the expectation values of functions, is provided by analyzing the trajectory
of a particle.
Imagine that you need to generate a random string of n bits. There are 2n possible
configurations. You can organize these configurations on a hypercube graph with 2n vertices
CHAPTER 9. STOCHASTIC PROCESSES 280
where each vertex has n neighbors, corresponding to the strings that differ from it by a
single bit as in Fig. 9.8. Our Markov chain will walk along these edges and flip one bit at a
time. The trajectory after a long time will correspond to the series of random strings. The
important question is how long should we wait before our Markov chain becomes mixed
(loses a memory about initial condition)? To answer this question we should look at the
MC from a more mathematical point of view.
We began the section by defining a Markov chain in terms of the weighted, directed graph
(V, E, p) and attempting to analyze by generating sample trajectories. A rigorous analysis
involves the probability state vector, or simply the state vector, which is the vector where
the ith component represents the probability that the system is in the state i at the moment
of time t:
π(t) := (πi (t))i∈V where πi (t) := P (Xt = i). (9.54)
P
Thus, πi ≥ 0 and i∈V πi = 1.
The probability state vector evolves according to
X
∀i ∈ V, ∀t = 0, · · · : πi (t + 1) = pij πj (t). (9.55)
j:(i←j)∈E
where p := {pji }, called the transition probability matrix, is the matrix whose (i, j) compo-
nent is the probability of transitioning from state j to state i and is therefore matrix. It is
a stochastic matrix (defined below).
Definition 9.4.12. A matrix is called stochastic if all of its components are nonnegative
and each column sums to 1.
To analyze the Markov chain acting after k sequential step, we consider repeated appli-
cation of Eq. (9.56) which results in
Example 9.4.13. Find the stochastic matrix associated with Fig. 9.3? Is the corresponding
Markov chain reducible?
CHAPTER 9. STOCHASTIC PROCESSES 281
A Markov chain is irredcuible if each state is accessible from every other state. If π(k) is
the vector of probabilities for each state at time k, given initial probabilities π(0), then we
observe that for the initial conditions corresponding to each of the two states, π(0) = ( 10 ),
and π(0) = ( 01 ), we find that every entry of π(k) = pk π(0) is non-zero. Therefore every state
has non-zero probability at time k and we conclude that the Markov chain is irreducible.
Example 9.4.14. Find the stochastic matrix associated with Fig. 9.5? Is the corresponding
Markov chain reducible?
Solution. For Fig. (9.5), the stochastic matrix is:
0.8 0.9 0.0
p=
0.2 0.0 0.0 .
0.0 0.1 1.0
Repeated matrix multiplication shows that the the first two entries of the third column
are zero for all k. This means that states “A” and “B” are inaccessible from state “C”.
Therefore the Markov chain is reducible.
Definition 9.4.15 (Stationary Distribution). The probability state vector (if it exists) that
satisfies
π ∗ = pπ ∗ (9.59)
is called the stationary distribution or invariant measure. (Recall, to be a state vector, each
component must be positive, and the components must sum to unity).
CHAPTER 9. STOCHASTIC PROCESSES 282
Solving Eq. (9.59) for the example of Eq. (9.58) one finds
!
∗ 0.625
π = , (9.60)
0.375
Assume that p is diagonalizable (has n = |p| linearly independent eigenvectors) then we can
decompose p according to the following eigen-decomposition
p = U ΣU −1 , (9.62)
Let us represent the initial π(0) as an expansion over the normalized eigenvectors, ui , · · · i =
1, · · · , n, of p:
n
X
π(0) = ai ui . (9.64)
i=1
Since limt→∞ π(t) = π ∗ = u1 , we get that a1 = 1 and the second term on the rhs of
Eq. (9.65) describes the rate of convergence of π(t) to the steady state at t → ∞. The
convergence is exponential in t with the rate, log(λ1 /λ2 ).
Example 9.4.17. Find eigenvalues for the MC shown in Fig. (9.9) with the transition
matrix
0 5/6 1/3
p=
5/6 0 1/3
. (9.66)
1/6 1/6 1/3
What does define the speed of the MC convergence to a steady state?
Solution. Let us start by noticing that p is stochastic. If the initial probability distribution
is π(0), then the distribution after t steps is
5/6
A 5/6
B
1/6 1/3
1/3 1/6
1/3
pπ ∗ = π ∗ . (9.68)
Thus, π ∗ is an eigenvector of p with eigenvalue 1 with all components positive and normal-
ized. The matrix (9.66) has three eigenvalues λ1 = 1, λ2 = 1/6, λ3 = −5/6 and correspond-
ing eigenvectors are
T T
2 2 1 1 1
π∗ = , , , u2 = − ,− ,1 , u3 = (−1, 1, 0)T . (9.69)
5 5 5 2 2
Suppose that we start in the state ”A”, i.e. π(0) = (1, 0, 0)T . We can write the initial state
as a linear combination of the eigenvectors
u2 u3
π(0) = π ∗ − − , (9.70)
5 2
and then
λt2 λt
π(t) = pt π(0) = π ∗ −
u2 − 3 u3 . (9.71)
5 2
Since |λ2 | < 1 and |λ3 | < 1, then in the limit t → ∞ we obtain π(t) = π ∗ . The speed
of convergence is defined by the eigenvalue (λ2 or λ3 ), which has the greatest absolute
value.
Note, that the considered situation generalizes to the following powerful statement (see
[16] for details):
CHAPTER 9. STOCHASTIC PROCESSES 285
Theorem 9.4.18 (Perron-Frobenius Theorem). Ergodic Markov chain with transition ma-
trix p has a unique eigenvector π ∗ with eigenvalue 1, and all its other eigenvectors have
eigenvalues with absolute value less than 1.
where {i, j} is our notation for the undirected edge, assuming that both directed edges
(i ← j) and (j ← i) are elements of the set E.
In physics this property is also called Detailed Balance (DB). If one introduces the
so-called ergodicity matrix
then DB translates into the statement that Q is symmetric, Q = QT . The MC for which
the property does not hold is called irreversible. Q − QT is nonzero, i.e. Q is asymmetric
for reversible MC. An asymmetric component of Q is the matrix built from currents/flows
(of probability). Thus for the case shown in Fig. (9.3)
! !
0.7 ∗ 0.625 0.5 ∗ 0.375 0.4375 0.1875
Q= = (9.74)
0.3 ∗ 0.625 0.5 ∗ 0.375 0.1875 0.1875
Q is symmetric, i.e. even though p12 6= p21 , there is still no flow of probability from 1 to 2
as the “population” of the two states, π1∗ and π2∗ respectively are different, Q12 − Q21 = 0.
In fact, one observes that in the two node situation the steady state of the MC is always in
DB.
Note that if a steady distribution, π ∗ , satisfies the DB condition (9.72) for a MC,
(V, E, p), it will also be a steady state of another MC, (V, E, p̃), satisfying the more general
Balance (or global balance) B-condition
X X
p̃ji πi∗ = p̃ij πj∗ . (9.75)
j:(j←i)∈E j:(i←j)∈E
This suggests that many different MC (many different dynamics) may result in the same
steady state. Obviously DB is a particular case of the B-condition (9.75).
The difference between DB- and B- can be nicely interpreted in terms of flows (think wa-
ter) in the state space. From the hydrodynamic point of view reversible MCMC corresponds
CHAPTER 9. STOCHASTIC PROCESSES 286
to irrotational probability flows, while irreversibility relates to nonzero rotational part, e.g.
correspondent to vortices contained in the flow. Putting it formally, in the irreversible case
skew-symmetric part of the ergodic flow matrix, Q = (p̃ij πj∗ |(i ← j)), is nonzero and it
actually allows the following cycle decomposition,
X
α α
Qij − Qji = Jα Cij − Cji , (9.76)
α
where index α enumerates cycles on the graph of states with the adjacency matrices C α .
Then, Jα , stands for the magnitude of the probability flux flowing over the cycle α.
One can use the cycle decomposition to modify MC such that the steady distribution
stay the same (invariant). Of course, cycles should be added with care, e.g. to make sure
that all the transition probabilities in the resulting p̃, are positive (stochasticity of the
matrix will be guaranteed by construction). The procedure of “adding cycles” along with
some additional tricks (e.g. the so-called lifting/replication) may help to improve mixing,
i.e. speed up convergence to the steady state — which is a very desirable property for
sampling π ∗ efficiently.
Example 9.4.20. Given a stationary solution, π ∗ = (π1∗ , π2∗ , π3∗ ), construct a three-state
Markov chain, i.e. present a (3 × 3) transition matrix, p, (a) of a general position (satisfies
global balance), (b) satisfies detailed balance. Are the constructions unique? Find the
spectrum of the transition matrix in the case (b) and verify that the Perron-Frobenius
theorem 9.4.18 holds. In the case (b) formulate and solve example of the fastest mixing
MC. Can one generalize solution and find the fastest mixing MC of size n, given π ∗ =
(π1∗ , · · · , πn∗ )? Return back to the three state MC and impose the constraint that all the
diagonal elements of the transition probability (correspondent to self-loops on the fully
connected three state graph) are zero, p(1, 1) = p(2, 2) = p(3, 3) = 0. Is the MC unique in
this case? Is it ergodic?
Solution. See Mathematica snippet MC-3nodes.nb (or pdf-printout MC-3nodes.pdf) posted
at the D2L site of the course.
Example 9.4.21. Let Σ = {x0 , x1 , · · · , xK−1 } be K equidistant points on the circle, i.e.,
xk = e2πik/K . Let α, β ∈ (0, 1) be constants that satisfy α + β + γ = 1, and consider the
random walk (Xt ) defined by
(a) For what values of α, β, and γ (and K) is the Markov chain ergodic?
(c) For what values of α and β does the Markov chain satisfy detailed balance?
(d) Let p denote the transition matrix. Find exact expressions for the eigenvalues of p.
Hint: the linear transformation represented by p is a convolution operator, i.e., there
P
is a g such that (pv)k = (g ? v)k = ` g(xk−` )v(x` ) for all k-vectors v (identifying
k vectors with functions on Σ), and thus can be diagonalized by the discrete Fourier
transform.
(e) The spectral gap of p is 1 − |λ0 |, where λ0 is the second largest (in absolute value)
eigenvalue of p. The size of the spectral gap determines how fast an ergodic chain
converges to its stationary distribution: the larger the gap, the faster the convergence.
Suppose γ = 0.98 and α = β. Use the result of the previous part to find the spectral
gap of p to leading order in 1/K as K → ∞.
(f) Are there initial distributions that converge to the stationary distribution at a rate
faster than the second largest eigenvalue? If so, give an example. If not, explain why
not.
Solution.
(a) The Markov chain is irreducible if γ < 1 and aperiodic if γ > 0. (If K is odd, then
α, β > 0 is also sufficient for aperiodicity.
If we set π(x) = 1/K, we see that this solves the equation. Since the chain is irre-
ducible, the uniform distribution is therefore the unique stationary distribution.
This holds if and only if β = α. So the chain satisfies detailed balance whenever
β = α = (1 − γ)/2.
CHAPTER 9. STOCHASTIC PROCESSES 288
λ` = γ + αe2πi`/K + βe−2πi`/K
= γ + (α + β) cos(2π`/K) + i(α − β) sin(2π`/K).
To find the second largest eigenvalue, we observe that |λ` |2 = f (2π`/K) where
2
f (θ) = γ + (1 − γ) cos(θ) . (9.85)
A simple calculation shows that the only critical points of f are θ = 0 and θ = π,
with f (0) = 1 and f (π) = 2γ − 1 ∈ (0, 1). By continuity, when K is large we expect
the second largest eigenvalue to occur at ` = ±1, i.e.,
(a) Write down the transition matrix P of the Markov chain thus defined. Is the Markov
chain irreducible and aperiodic?
(b) Assume that we start with a hybrid rabbit. Let π(n) = (πGG (n), πGg (n), πgg (n)) be
the probability distribution vector (state) of the character of the rabbit of the n-th
generation. In other words, πGG (n), πGg (n), πgg (n) are the probabilities that the n-th
generation rabbit is GG, Gg, or gg, respectively. Compute π(1), π(2), π(3). Is there
some kind of law?
(c) Calculate P n for general n. What can you say about π(n) for general n?
(d) Calculate the stationary distribution of the MC, π ∗ = (πGG
∗ , π ∗ , π ∗ ). Does Detailed
Gg gg
Balance hold in this case?
Figure 9.10: General Scheme of Markov Decision Process. Drawing is from [21].
• Given:
• Goal:
Notice that in the present formulation, R(·, ·, ·), P(·, ·, ·) do not depend explicitly on time.
(Generalizations are possible but we will not discuss these in the lectures leaving it for an
independent studies.)
Expectation on the right hand side of Eq. (9.87) is called the global reward, i.e. the reward
accumulated over the entire (infinite) time horizon under the given (not necessarily optimal)
CHAPTER 9. STOCHASTIC PROCESSES 291
policy, π. However, it has a sense to discuss not only the global reward but also respective
expected value of the reward, evaluated for the time horizon τ , called the value function:
"τ −1 #
X
∀τ ∈ [1, · · · , ∞], ∀s0 : Vτπ (s0 ) := Es1 ,s2 ,··· γ t R(st , π(st ), st+1 ) (9.88)
t=0
−1
τY
! τ −1
X X
= P(st0 , π(st0 ), st0 +1 ) γ t R(st , π(st ), st+1 ),
s1 ,··· ,sτ t0 =0 t=0
which depends on the initial state, s0 , and the policy, π(·). Observe that the right hand
side of Eq. (9.88) can also be expressed in terms of the value function evaluated at the
preceding, τ − 1, step:
−1
τY
! τ −2
!
X X
= P(st0 , π(st0 ), st0 +1 ) R(s0 , π(s0 ), s1 ) + γ γ t R(st+1 , π(st+1 ), st+2 )
s1 ,··· ,sτ t0 =0 t=0
= Es0 R(s0 , π(s0 ), s0 ) + γVτπ−1 (s0 ) . (9.89)
The recursion, suggested by Bellman in his seminal 1948 paper, shows that the optimal
solution of Eq. (9.87) can be found by solving Eq. (9.90). Moreover, once the optimal value
at the step τ is found we can also find the so-called optimal policy:
X
∀τ, ∀s : πτ∗ (s) = arg max P(s, a, s0 ) R(s, a, s0 ) + γVτ∗−1 (s0 ) , (9.91)
a
s0
• Similarly to the state-dependent value function, one may also introduce the value of
taking action a0 in a state s0 under a policy π:
" τ −1
#
X
∀τ, s0 , a0 : Qπτ (s0 , a0 ) = Es1 ,s2 ,··· R(s0 , a0 , s1 ) + t
γ R(st , π(st ), st+1 )
t=1
= Es0 R(s0 , a0 , s0 ) + γVτπ−1 (s0 ) . (9.92)
CHAPTER 9. STOCHASTIC PROCESSES 292
Therefore, instead of working with Vτπ (s0 ) we can re-state the entire DP approach
and the optimization procedure in terms of Qπτ (s0 , a0 ) – the action-value function of
the state s0 and action a0 under policy π – also called the Q-function. In particular,
the action-value version of Eq. (9.89) becomes
π 0 π 0 0
∀τ, s0 , a0 : Qτ (s0 , a0 ) = Es0 R(s0 , a0 , s ) + γ max
0
Qτ −1 (s , a ) . (9.93)
a
• There exists an alternative iterative way of solving the optimization problem (9.87)
via a policy-iteration algorithm. In this case the approach is to alternate between
the following two steps till convergence: (a) policy evaluation: for a current (not
optimal) policy run the value iteration algorithm solving Eqs. (9.89) for either fixed
number of steps (in τ ) or till a pre-defined tolerance is achieved; (b) policy improve-
ment: according to
X
∀τ, ∀s0 : πτ (s) ← arg max P(s, a, s0 ) R(s, a, s0 ) + γVτπ−1 (s0 ) .
a
s0
MDP may be considered as an interactive probabilistic game one plays (against computer).
The game consists in defining transition rates between the states to achieve certain objec-
tives. Once optimal (or sub-optimal) rates are fixed the implementation becomes just a
Markov Process.
Let us play this ’Grid World’ game with the rules illustrated in Fig. (9.11). An agent
lives on the grid (3 × 4). Walls block the agent’s path. The agent actions do not always
go as planned: for example 80% of time the action ’North’ take the agent ’North’ (if there
is no wall there), 10% of the time the action ’North’ actually takes the agent West; 10%
East. If there is a wall the agent would have been taken, she stays put. Big reward, +1,
or penalty, −1 comes at the end — the game ends after reaching either of the two states.
Visiting any other state does not result in a reward.
CHAPTER 9. STOCHASTIC PROCESSES 293
Figure 9.12: Optimal solution set of actions (arrows) for each state, for each time.
CHAPTER 9. STOCHASTIC PROCESSES 294
s ∈ {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (2, 4), (3, 1), (3, 2), (3, 3), (3, 4)}
a ∈ {↑, ↓, ←, →}
P ((2, 1)|(1, 1), ↑) = 0.8, P ((1, 1)|(1, 1), ↑) = 0.1, P ((1, 2)|(1, 1), ↑) = 0.1,
P ((1, 2)|(1, 1), →) = 0.8, P ((1, 1)|(1, 1), →) = 0.1, P ((2, 1)|(1, 1), →) = 0.1, · · · .
+1, s = (3, 4)
0
R(s, a, s ) = −1, s = (2, 4)
0, s 6= (3, 4), (2, 4).
Vτ∗ (s) is the expected sum of rewards accumulated when starting from state s and acting
optimally for a horizon of τ steps.
To find the optimal actions (policy), one may consider the Value Iteration algorithm
(MDP – Value Iteration). One can show that the algorithms outputs solution of the value
iteration Eqs. (9.90).
The MDP value iteration algorithm is illustrated for the Grid World example in Fig. (9.13).
CHAPTER 9. STOCHASTIC PROCESSES 295
∀s : V0∗ (s) = 0
for τ = 0, · · · , H − 1 do
P
∀s : Vτ∗+1 (s) ← maxa s0 P (s0 |s, a) [R(s, a, s0 ) + γVτ∗ (s0 )] – [ Bellman update/back-
up] – the expected sum of rewards accumulated when starting from state s and acting
optimally for a horizon of τ + 1 steps
end for
Some of the numbers (for the optimal value function) at the first three iterations, are derived
as follows:
V1∗ ((3, 3)) = P ((3, 4)|(3, 3), →) ∗ V0∗ ((3, 4)) ∗ γ = 0.8 ∗ 1 ∗ 0.9 ≈ 0.72,
V2∗ ((3, 3)) = (P ((3, 4)|(3, 3), →) ∗ V1∗ ((3, 4)) + P ((3, 3)|(3, 3), →) ∗ V1∗ ((3, 3))) ∗ γ
≈ (0.8 ∗ 1 + 0.1 ∗ 0.72) ≈ 0.78,
V3∗ ((2, 3)) = (P ((3, 3)|(2, 3), ↑) ∗ V2∗ ((3, 3)) + P ((2, 4)|(2, 3), ↑) ∗ V2∗ ((2, 4))) ∗ γ
≈ (0.8 ∗ 1 ∗ 0.72 + 0.1 ∗ (−1)) ∗ 0.9 ≈ 0.43.
We also observe (empirically) that as τ → ∞ the optimal value functions show a well defined
limit, i.e. the expected sum of rewards accumulated when acting optimally freezes (becomes
stationary) in the limit.
Exercise 9.6. Consider the following modification of the Grid World example: no discount,
γ = 1; the terminal penalty at s = (2, 4) increases by absolute value (from −1) to −2, i.e.
+1, s = (3, 4)
0
R(s, a, s ) = −2, s = (2, 4) .
0, s 6= (3, 4), (2, 4).
(a) Compute optimal values of the expected sum of rewards for the first three iterations of
the Algorithm 5, i.e. find ∀s : V1∗ (s), V2∗ (s), V3∗ (s);
(b) Find optimal policy for the first three iterations, i.e. find ∀s : a∗1 = π1∗ (s), a∗2 =
π2∗ (s), a∗3 = π3∗ (s).
(c) Re-state the original MDP formulation (9.87) as a linear programming. [Hint: take
advantage of the linearity of the value iteration way of solving the MDP according to
Eqs. (9.90).]
CHAPTER 9. STOCHASTIC PROCESSES 296
(d) What we have discussed so far in this Section is the case of a deterministic policy, where
π is a map/function from S to A. It may be advantageous to consider stochastic policy
where for each state a number of actions can be taken with different probabilities, in which
case π(s, a) becomes function of two variables (state and action) describing the probability
of taking action a when the state is s. Suggest stochastic policy modification of Eq. (9.87).
We will return to the MDP (and the grid world example) in Section ??, discussing
reinforcement learning.
∗
9.6 Queuing Networks
9.6.1 Queuing: a bit of History & Applications
∗ There are number of books written on the subject. The book of Frank Kelly and Elena
Ydovina [22] is recommended.
Agner Krarup Erlang, a Danish engineer who worked for the Copenhagen Telephone
Exchange, published the first paper on what would now be called queueing theory in 1909.
He modeled the number of telephone calls arriving at an exchange by a Poisson process and
solved the M/D/1/∞ queue in 1917 and M/D/k/∞ queuing model in 1920.
The notations are now standard in the Queuing theory – which is a discipline tradi-
tionally considered as a part of Operation Research with deep connection to stochastic
processes. In M/D/k/∞, for example,
• M stands for Markov or memoryless and it means that arrivals occur according to a
Poisson process. Arrivals may also be deterministic, D.
• D stands for deterministic and means that the jobs arriving at the queue require a
fixed=deterministic amount of service/processing. Processing can also be stochastic,
Markovian (or non-Markovian, in which case it is custom to mark it as G - generic
sevice; arrival can also be G=generic).
• k describes the number of servers at the queueing node k = 1, 2, .... If there are more
jobs at the node than there are servers then jobs will queue and wait for service.
• ∞ stands for the allowed size of the queue (waiting room) - in this case no limit to
the waiting room capacity (everybody arriving is admitted to the queue - not denied)
∗
This is an Auxiliary Section which can be dropped at the first reading. Material from the Section will
not contribute midterm and final exams.
CHAPTER 9. STOCHASTIC PROCESSES 297
Figure 9.14: On the left: Markov Chain representation of the M/M/1 queue. In the standard
situation considered ∀i : λi = λ, µi = µ. On the right: reduced graphical description of a
single queue.
We will only be dealing with the case of ∞ waiting room, thus dropping the last argu-
ment.
The M/M/1 queue is a simple model where a single server handles jobs that arrive
according to a Poisson process and have exponentially distributed service requirements.
In an M/G/1 queue the G stands for general and indicates an arbitrary probability
distribution.
Many mathematicians and math-engineers contributed the subject since 1930 — Pol-
laczek, Khinchin, Kendall, Kingman, Jackson, Kelly and others.
Applications: call centers, logistics (at different scales), manufacturing, checkout at the
super-market, processing of electric vehicles at the charging stations, etc. In general, any
kind of practical systems where arrivals (of whatever coming in units) and processing fits
the framework. We are talking about design which would
Let us discuss in details M/M/1. We start by playing with the Java modeling tool – JMT
(can upload it from http://jmt.sourceforge.net/Download.html)
The process is also called birth-death process - the name is clear from the Markov-Chain
representation shown in Fig. (9.14). The MC has infinitely many states, each representing #
of customers in the system (waiting room). Arrival of customers is modeled as the Poisson
process with the arrival rate, λ. We assume that all customers are identical. The customers
are taken from the waiting room based on availability of the servant, and the service is
completed with the rate µ of the other Poisson process.
CHAPTER 9. STOCHASTIC PROCESSES 298
Everything is Poisson in here (recall that mixing and splitting of the Poisson processes
is Poisson again).
Let us analyze the (relatively simple) system. Let us start from finding the steady state
of the Markov Chain: ∀i = 0, · · · , ∞ Pi , where Pi is the probability that the system is in
the i-th state, i.e. with i customers in the queue.
The balance equations are
Resolving the equations (sequentially), and requiring that the total probability is normal-
P
ized, ∞i=0 Pi = 1, we derive
n−1
! n
Y λ λ
Pn = P0 = P0 = ρn P0 (9.97)
µ µ
i=0
∞
X ∞
X P0
1= Pn = P0 ρn = (9.98)
1−ρ
n=0 n=0
Pn = (1 − ρ) ρn . (9.99)
We observe that the average queue becomes infinite at ρ = 1, i.e. the steady state exists
only when ρ < 1. This criterium (existence of the steady state) can also be referred to as
“stability”.
Exercise: Consider a single M/M/m queue, i.e. the system when the number of servers
is m. Derive steady state? What is the modified stability criterium? Can a single queue
system withe m = 2 be unstable?
In this simple queue system we can also study transient time dynamics. The steady
state system of Eqs. (9.96) transitions to
d
∀n : Pn = λPn−1 + µPn+1 − (λ + µ) Pn . (9.101)
dt | {z } | {z }
arrival departure
CHAPTER 9. STOCHASTIC PROCESSES 299
11
12
2
23
30
1
24 3
01 14
41 43
4
33
40
Example 9.6.1. Derive Eq. (9.102) from Eq. (9.101). Compute distribution time of the
busy period of server. Assuming first come-first served policy, compute distribution of the
waiting time and distribution of the total time in the system.
We can also write the dynamical Eq. (9.101) in the following matrix form
.
..
d
P = P (tr) P, P (tr) =
· · · 0 µ −(λ + µ) λ 0 · · · .
(9.103)
dt
..
.
Notice that in the steady state (achievable at ρ < 1) the Detailed Balance (DB) does not
(tr) (st) (tr) (st)
hold, Pnm Pm 6= Pmn Pn .
It appears that the description of a single Q- can be extended to a network, e.g. of the type
shown in Fig. (9.15). λ are arrivals and processing rates – now denoted in the same way
CHAPTER 9. STOCHASTIC PROCESSES 300
and indexed by nodes/stations from where the process is coming from and where it is going
to (two indexes). We study, P (n1 , · · · , nN ; t), probability over the entire network and, like
in the single queue case, write the balance (also called Master) equation – stated for any
state of the network at any time
!
∂ X
P (n; t) = λij (ni + 1)P (· · · , ni + 1, · · · , nj − 1, · · · ; t) − ni P (· · · , ni , · · · , nj , · · · ; t)
∂t | {z } | {z }
(i,j)∈E
customers leaving i for j customers staying at i
X
+ λ0i (P (· · · , ni − 1, · · · ; t) − P (· · · , ni , · · · ; t))
i∈V
X
+ λi0 ((ni + 1)P (· · · , ni + 1, · · · ; t) − ni P (· · · , ni , · · · ; t)) (9.104)
i∈V
This equation is written here for the case of M/M/∞ - when the number of servers is
infinite - this is the case when jobs are not waiting but are taken for processing (by tellers
which are always available) immediately.
Remarkably the (complicated looking) Eq. (9.104) allows an explicit steady state solu-
tion (for any graph!)
Y hni
−1 Qnii
P (n) = Z ,
i∈E li =1 li
(i,j)∈E (j,i)∈E
X X
∀i ∈ V : − hi λij + λji hj + λ0i − λi0 hi = 0
j6=0 j6=0
• h is a “single-customer” object
Example 9.6.3. Generalize the steady state formula and reformulate the stability for the
general M/M/m case.
Example 9.6.4. By analogy with what was discussed for a single queue case, state the
skewed detailed balance relation. Show how analysis of the skewed DB leads to the product
state solution for the steady state.
CHAPTER 9. STOCHASTIC PROCESSES 301
• The number of servers is fixed and the traffic intensity (utilization), λ/µ, approaches
unity (from below). The queue length approximation is the so-called “reflected Brow-
nian motion”.
• Traffic intensity is fixed and the number of servers and arrival rate are increased to
infinity. Here the queue length limit converges to the normal distribution.
Let us give some intuitive picture and then pose a number of technical questions/chal-
lenges (some of these with answers not yet fully known).
When the system is congested, i.e. when most of the time it has many customers, an
arriving customer will need to wait long. Assuming FIFO (first in first out) protocol, the
customer joining queue with L customers will see the queue going down to zero. However,
in this time of L + 1 departures, there will also be many arrivals. If traffic intensity, ρ, is
close to 1, the number of arrivals will be on order L as well. Thus, when the customer leaves
the queue behind will be comparable to the queue observed when the customer arrived. For
the system to go from average to empty will take more like L busy periods.
• How long does it take for a system to change from an average/typical filling to empty?
Hint (following from the preceding the discussions): The time scale at which the system
changes is much longer then the time scale of a single customer.
We therefore have a time scale separation and, therefore, may study a Q-system with
many customers on two scales, fluid and diffusive. Let X(t) be some Q-system related
process. X̄n (t) = nX(nt) defines the fluid re-scaling by n. This means that we measure
time in the units of n and we measure the state (# of customers) in the units of n. As
n → ∞ we shall look for n−1 X(nt) → X̄(t), where X̄(t) is the fluid limit.
At this scale as n → ∞ the arrival process and the service process have fluid limits
λt and µt which means that they are deterministic. As we said, queueing is the result of
CHAPTER 9. STOCHASTIC PROCESSES 302
variability, and so on a fluid scale, when input and output are not variable, there
will be no real queueing behavior in the system. We may see the queue length grow
linearly indefinitely (ρ > 1), or go to zero linearly and then stay at 0 (ρ < 1), or we may see
it constant, (ρ = 1). For queueing networks we may observe piecewise linear behavior of
queue lengths. This will capture changes in the queue on the fluid scale: The queue changes
by n in a time of order n. The stochastic fluctuations of a queue in steady state are scaled
down to be identically 0 and uninteresting.
The diffusion scaling looks at the difference between the process and its fluid limit, and
√
measures the time in units of n and the state (counts of customers) in units of n. The
√
diffusion re-scaling of A(t) by n is Ân (t) by n is Ân (t) = n(Ān (t) − Ā(t)). As n → ∞ we
shall look (in analogy with the Central Limit Theorem) for An (t) converging in the sense of
distribution to Â(t) describing the diffusion limit- it is a diffusion process, such as Brownian
motion or reflected Brownian motion. The diffusion limit captures the random fluctuation
of the system around its fluid limit.
Here is a (formal) statement on the heavy traffic asymptotic for the waiting time (in-
cluding both fluid and diffusive limits): Consider G/G/1 indexed by j. For queue j let Tj
λj
denotes the random inter-arrival time, Sj denote the random service time; ρj = µj denote
1 1
the traffic intensity with λj = E(Tj ) and µj = E(Sj ); Wq,j denotes the waiting time in
d
queue for a customer in steady state; αj = −E[Sj − Tj ] and βj2 = var[Sj − Tj ]. If Tj −
→ T,
d 2αj d
Sj −
→ S, and ρj → 1, then βj2
Wq,j −
→ exp(1) provided that: (a) Var[S-T] > 0, and (b) for
some δ > 0, E[Sj2+δ ] and E[Tj2+δ ] are both less than some constant C, ∀j.
Chapter 10
Statistical Inference describes set of (inference) tasks/operations over the statistical model
of the phenomenon. These tasks, assuming knowledge of the statistical model, include:
(a) sampling from the probability distribution; (b) computing marginal probabilities; (c)
finding the most likely configuration/state. However, the statistical model may or may not
be known. In the latter case one needs to learn the model before posing and resolving the
challenge of inference.
In the following we will start discussing statistical inference and then shift our attention
to the discussion of learning statistical models we aim to infer.
This lecture should be read in parallel with the respective IJulia notebook file. Monte-
Carlo (MC) methods refers to a broad class of algorithms that rely on repeated random
sampling to obtain results. Named after Monte Carlo -the city- which once was the capital
of gambling, i.e. playing with randomness. The MC algorithms can be used for numerical
integration, e.g. computing weighted sum of many contributions, expectations, marginals,
etc. MC can also be used in optimization.
Sampling is a selection of a subset of individuals/configurations from within a statistical
population to estimate characteristics of the whole population.
There are two basic flavors of sampling. Direct Sampling MC - mainly discussed in this
lecture and Markov Chain MC. DS-MC focuses on drawing independent samples from a
distribution, while MCMC draws correlated (according to the underlying Markov Chain)
303
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 304
samples.
Let us illustrate both on the simple example of the ’pebble’ game - calculating the value
of π by sampling interior of a circle.
In this simple example we will construct distribution which is uniform within a circle from
another distribution which is uniform within a square containing the circle. We will use
direct product of two rand() to generate samples within the square and then simply reject
samples which are not in the interior of the circle.
In respective MCMC we build a sample (parameterized by a pair of coordinates) by
taking previous sample and adding some random independent shifts to both variables, also
making sure that when the sample crosses a side of the square it reappears on the opposite
side. The sample ”walks” the square, but to compute area of the circle we count only for
samples which are within the circle (rejection again).
See IJulia notebook associated with this lecture for an illustration.
Z ∞ Z 2π Z ∞ Z 2π Z ∞ Z 1 Z 1
dxdy −(x2 +y2 )/2 dϕ −r2 /2 dϕ −z
e = rdr e = dz e = dθ dψ = 1.
−∞ 2π 0 2π 0 0 2π 0 0 0
√
Thus, the desired mapping is (ψ, θ) → (x, y), where x = −2 log ψ cos(2πΘ) and y =
√
−2 log ψ sin(2πθ).
See IJulia notebook associated with this lecture for numerical illustrations.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 305
Let us now show how to get positive Gaussian (normal) random variable from an exponential
random variable through rejection. We do it in two steps
Note that the rejection algorithm has an advantage of being applicable even when the
probability densities are known only up to a multiplicative constant. (We will discuss
issues related to this constant, also called in the multivariate case the partition function,
extensively.)
See IJulia notebook associated with this lecture for numerical illustration.
We also recommend
Importance Sampling
In this case the overlap of f (x) and p(x) is small and as a result a lot of MC samples drawn
from p(x) will be ’wasted’.
Importance Sampling is the method which aims to fix the small-overlap problem. The
method is based on adjusting the distribution function from p(x) to p̃(x) and then utilizing
the following obvious formula
Z Z
f (x)p(x) f (x)p(x)
Ep [f (x)] = dxp(x)f (x) = dxp̃(x) = Ep̃
p̃(x) p̃(x)
See the
IJulia notebook associated
with
this lecture contrasting DS example, p(x) =
x2 (x−4)2
√1 exp − 2 and f (x) = exp − 2 , with IS where the choice of the proposal dis-
2π
1 2
tribution is, p̃(x) = √π exp −(x − 2) . This example shows that we are clearly wasting
samples with DS.
Note one big problem with IS. In a realistic multi-dimensional case it is not easy to
guess the proposal distribution, p̃(x), right. One way of fixing this problem is to search for
good p̃(x) adaptively.
A comprehensive review of the history and state of the art in Importance Sampling can
be found in multiple lecture notes of A. Owen posted at his web page, for example follow
this link. Check also adaptive importance sampling package.
This algorithm relies on availability of the uniform sampling algorithm from [0, 1], rand().
One splits the [0, 1] interval into pieces according to the weights of all possible states and
then use rand() to select the state. The algorithm is impractical as it requires keeping in
the memory information about all possible configurations. The use of this construction is
in providing the bench-mark case useful for proving independence of samples.
Suppose we have an oracle capable of computing the partition function (normalization) for a
multivariate probability distribution and also for any of the marginal probabilities. (Notice
that we are ignoring for now the issue of the oracle complexity.) Does it give us the power
to generate independent samples?
We get affirmative answer to this question through the following decimation algorithm
generating independent sample x ∼ P (x), where x := (xi |i = 1, · · · , N ):
Validity of the algorithm follows from the exact representation for the joint probability
distribution function as a product of ordered conditional distribution function (chain rule
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 307
1: x(d) = ∅; I=∅
2: while |I| < N do
3: Pick i at random from {1, · · · , N } \ I.
4: x(I) = (xj |j ∈ I)
P
5: Compute P (xi |x(d) ) := x\xi ;x(I) =x(d) P (x) with the oracle.
6: Generate random xi ∼ P (xi |x(d) ).
7: I ∪i←I
8: x(d) ∪ xi ← x(d)
end while
9:
for distribution):
P (x1 , · · · , xn ) = P (x1 )P (x2 |x1 )P (x3 |x1 , x2 ) · · · P (xn |x1 , · · · , xn−1 ). (10.1)
(The chain rule follows directly from Bayes rule/formula. Notice also that ordering of
variables within the chain rule is arbitrary.) One way of proving that the algorithm produces
an independent sample is to show that the algorithm outcome is equivalent to another
algorithm for which the independence is already proven. The benchmark algorithm we can
use to state that the Decimation algorithm (6) produces independent samples is the brute-
force sampling algorithm described in the beginning of the lecture. The crucial point here
is that the decimation algorithm can be interpreted in terms of splitting the [0, 1] interval
hierarchically, first according to P (x1 ), then subdividing pieces for different x1 according
to P (x2 , x1 ), etc. This guidanken experiment will result in the desired proof.
Note that in general efforts of the partition function oracle are exponential in the problem
size. However in some special cases the partition function can be computed efficiently
(polynomially in the number of steps).
In the following exercise we suggest to test performance of the direct sampling algorithm
on the example of the Ising model. We remind that in the case of the Ising model probability
of a binary vector (spin configuration), x, is given by
exp (−βE(x)) 1 X X
p(x) = , E(x) = − xi Jij xj + hi xi , (10.2)
Z 2
{i,j}∈E i∈V
X
Z= exp (−βE(x)) . (10.3)
x
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 308
Exercise 10.1. Consider example of the Ising model with zero singleton term, h = 0, and
uniform pair-wise term, Jβ = −1, i.e. Jij β = −1, ∀{i, j} ∈ E, over n × n grid-graph with
the nearest-neighbor interaction. Construct (write down pseudo-algorithm and then code)
the decimation algorithm (6).
Compare the performance of the direct sampling for n = 2, 3, 4, 5 – that is find out how the
time required to generate the next i.i.d. sample depends on n – and explain.
Markov Chain Monte Carlo (MCMC) methods belong to the class of algorithms for sampling
from a probability distribution which is based on constructing a Markov chain that converges
to the target steady distribution.
Examples and flavors of MCMC are many (and some are quite similar) – heat bath,
Glauber dynamics, Gibbs sampling, Metropolis-Hastings, Cluster algorithm, Warm algo-
rithm, etc – in all these cases we only need to know the transition probability between states
while the actual stationary distribution may be not known or, more accurately, known up
to the normalization factor, also called the partition function. Below, we will discuss in
details two key examples: Gibbs sampling and Metropolis-Hastings.
Gibbs Sampling
Assume that the direct sampling is not feasible, because of the high dimensionality of the
state vector (too many components): computation of the joint probability distribution and
its marginalizations (over a small number of components) is of the ”exponential” complexity
(more on this below). The main point of the Gibbs sampling is that given a multivariate
distribution is that, even though to sample from the joint probability distribution is not
feasible, sampling from the probability distribution of a few components (conditioned on
the rest of the state vector) can be done efficiently. We utilize this remarkable feature of the
marginal probability distributions and create the Gibbs Sampling algorithm (Algorithm 7).
The algorithm starts from a current sample of the vector x, pick a component at random,
compute probability for this component/variable conditioned to the rest (other components
of the state vector), and sample from this conditional distribution. As mentioned above,
the conditional distribution is over a single component, and it is therefore an easy one to
compute. We continue the process till convergence, which can be identified empirically, for
example, by checking if estimation of the histogram or observable(s) stopped changing.
Example 10.1.1. Describe Gibbs sampling on the example of a general Ising model. Build
respective Markov chain. Show that the algorithm obeys Detailed Balance.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 309
Solution. Starting from a state, x(t) , we pick a random node, i, and compare two candidate
states, (xi = 1 and xi = −1). Then we calculate the corresponding conditional (all spins
(t) (t)
except i are fixed) probabilities p(xi = +1|x∼i and p(xi = −|x∼i , which we denote, p± ,
respectively. By construction,
(t) (t)
p+ + p− = 1, p+ /p− = e−β∆E , ∆E = E(xi = +1, x∼i ) − E(xi = −1, x∼i ), (10.4)
where ∆E, is the energy difference between the two configurations. Next, one accepts
the configuration xi = 1 with the probability p+ or the configuration xi = −1 with the
probability p− .
MC corresponding to the algorithm is defined over the 2N -dimensional hupercube, when
N is the number of spins, and 2N is the number of states. To check the DB condition,
assume that the conditional probabilities, p± , corresponds to the steady state and compute
(t)
the probability fluxes from the state x(t) to the states (xi = ±1, x∼i ). We derive
Metropolis-Hastings Sampling
Metropolis-Hastings (MH) sampling is an MCMC method which is built, like Gibbs sam-
pling, assuming that the desired stationary distribution, π̃(x), is known explicitly upto
the normalization constant (also called the partition function), where thus the normalized
P
distribution is π(x) = π̃(x)/Z and Z := x π̃(x). Let us also introduce the so-called
”proposal” distribution, p(x0 |x), and assume that drawing a sample proposal x0 from the
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 310
current sample x is (computationally) easy. The combination of π̃(x) and p(x0 |x), as well as
an arbitrary initialization of the state vector, x(t) , set the MH Algorithm 8. Starting with
x(t) the algorithm draws a sample, x0 , according to the proposal distribution, p(x0 |x(t) ), but
then accepts or reject the proposed state, x0 , according to the probability
n p(xt |x0 )π̃(x0 ) o
min 1, . (10.6)
p(x0 |x(t) )π̃(x(t) )
Note that the Gibbs sampling, introduced before, can be considered as the Metropolis-
Hastings without rejection, where the proposal distribution is chosen specifically as the
respective conditional probability distribution. (That is Gibbs sampling should be consid-
ered as a special case of the more general MH algorithm.)
Example 10.1.2. Consider MC shown in Fig. (10.1). Show that this MC is ergodic and
can be viewed as a particular MH Algorithm 8 for the two spin Ising model. What is the
proposal distribution in this case? What is the resulting stationary distribution? Does it
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 311
Figure 10.1: Example of the MC induced by a Metropolis-Hastings for a two spin example.
obey Detailed Balance? Will the steady distribution change if the rejection is removed from
the consideration? How?
Solution. The cardinality (size) of the state space is 22 = 4: x = (x1 = ±1, x2 = ±1).
Inspecting the MC we observe that it is a-periodic and positive-recurrent, thus ergodic.
Observe that some of the transition probabilities are exactly 1/2. Therefore, it is natural
to associate these with the cases when Eq. (10.6) returns unity. Next, we link self-loops
in the MC to rejections, that are the cases when a proposed state is rejected. The two
observations combined suggest that we are dealing with an instance of the MH Algorithm
8 where
x1 x2
π̃(x1 , x2 ) = exp(−βE(x)), E(x = (x1 , x2 )) = − , p(x0 |x) = exp −β E(x0 ) − E(x) .
2
That is the stationary distribution is of the attractive (ferromagnetic) type with the unit
pair-wise strength and without bias (with zero magnetic field). The proposal distribution,
p(x0 |x), is such that only a single spin flip is allowed (cannot flip two spins in one step) and
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 312
the proposal may be accepted only if the resulting energy gain is positive, i.e. when the
algorithm proposes to move from the state when the spins are miss-aligned to the aligned
state. Removal of the rejection translates into setting β to zero (temperature becomes
infinite). In this case the resulting distribution is uniform (so-called, para-magnetic).
Exercise 10.2 (Spanning Trees.). Let G be an undirected complete graph. The following
MCMC algorithm results in a uniform stationary probability distribution of all the spanning
trees of G: Start with some spanning tree; add uniformly-at-random some edge from G (so
that a cycle forms); remove uniformly-at-random an edge from this cycle; repeat. Suppose
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 313
now that the graph G is positively weighted, i.e., each edge e has some cost ce > 0. Suggest
an MCMC algorithm that samples from the set of spanning trees of G, with the stationary
probability distribution proportional to the overall weight of the spanning tree for the
following cases:
(i) the weight of any spanning tree of G is the sum of costs of its edges;
(ii) the weight of any spanning tree of G is the product of costs of its edges. In addition,
(iii) estimate the average weight of a spanning tree using the algorithm of uniform sampling.
(iv) implement all the algorithms on a (4×4) square lattice with randomly assigned weights.
Verify that the algorithm converges to the right value.
For useful additional reading on sampling and computations for the Ising model see
https://www.physik.uni-leipzig.de/~janke/Paper/lnp739_079_2008.pdf.
MCMC algorithm is called (casually) exact if one can show that the generated distribution
”converges” to the desired stationary distribution. However, ”convergence” may mean
different things.
The strongest form of convergence – called exact independence test (warning - this
is our ‘custom’ term) – states that at each step we generate an independent sample from
the target distribution. To prove this statement means to show that empirical correlation
of the consecutive samples is zero in the limit when N number of samples → ∞:
N
1 X
lim f (xn )g(xn−1 ) → E [f (x)] E [g(x)] , (10.8)
N →+∞ N
n=0
where f (x) and g(x) are arbitrary functions (however such that respective expectations on
the rhs of Eq. (10.8) are well-defined).
A weaker statement – call it asymptotic convergence – suggests that in the limit of
N → ∞ we reconstruct the target distribution (and all the respective existing moments):
N
1 X
lim f (xn ) → E [f (x)] , (10.9)
N →+∞ N
n=0
where f (x) is an arbitrary function such that the expectation on the rhs is well defined.
Finally, the weakest statement – call it parametric convergence – corresponds to the
case when one arrives at the target estimate only in a special limit with respect to a special
parameter. It is common, e.g. in statistical/theoretical physics and computer sicence, to
study the so-called thermodynamic limit, where the number of degrees of freedom (for
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 314
For additional math (but also intuitive as written for applied mathematicians, engi-
neers and physicists) reading on the MCMC (and in general MC) convergence see “The
mathematics of mixing things up” article by Persi Diaconis and also [16].
(This part of the lecture is a bonus material - we discuss it only if time permits.)
The material follows Chapter 32 of D.J.C. MacKay book [19]. An extensive set of
modern references, discussions and codes are also available at the website [23] on perfectly
random sampling with Markov Chains.
As mentioned already the main problem with MCMC methods is that one needs to wait
(and sometimes for too long) to make sure that the generated samples (from the target
distribution) are i.i.d. If one starts to form a histogram (empirical distribution) too early
it will deviate from the target distribution. One important question in this regards is: For
how long shall one run the Markov Chain before it has ‘converged’ ? To answer this question
(prove) it is very difficult, in many cases not possible. However, there is a technique which
allows to check the exact convergence, for some cases, and do it on the fly - as we run
MCMC.
This smart technique is the Propp-Wilson exact sampling method, also called coupling
from the past. The technique is based on a combination of three ideas:
• The main idea is related to the notion of the trajectory coalescence. Let us ob-
serve that if starting from different initial conditions the MCMC chains share a single
random number generator, then their trajectories in the phase space can coalesce; and
having coalesced, will not separate again. This is clearly an indication that the initial
conditions are forgotten.
Will running all the initial conditions forward in time till coalescence generate exact
sample? Apparently not. One can show (sufficient to do it for a simple example) that
the point of coalescence does not represent an exact sample.
• However, one can still achieve the goal by sampling from a time T0 in the past,
up to the present. If the coalescence has occurred the present sample is an unbiased
sample; and if not we restart the simulation from the time T0 further into the past,
reusing the same random numbers. The simulation is repeated till a coalescence occur
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 315
at a time before the present. One can show that the resulting sample at the present
is exact.
• One problem with the scheme is that we need to test it for all the initial conditions -
which are too many to track. Is there a way to reduce the number of necessary
trials. Remarkably, it appears possible for sub-class of probabilistic models the so-
called ’attractive’ models. Loosely speaking and using ’physics’ jargon - these are
’ferromagnetic’ models - which are the models where for a stand alone pair of vari-
ables the preferred configuration is the one with the same values of the two variables.
In the case of attractive model monotonicity (sub-modularity) of the underlying model
suggests that the paths do not cross. This allows to only study limiting trajectories
and deduce interesting properties of all the other trajectories from the limiting cases.
Brief reminder of what we have learned so far about the Ising Model. It is fully described by
Eqs. (10.2,10.3). The weight of a “spin” configuration is given by Eq. (10.2). Let us not pay
much of attention for now to the normalization factor Z and observe that the weight is nicely
factorized. Indeed, it is a product of pair-wise terms. Each term describes “interaction”
between spins. Obviously we can represent the factorization through a graph. For example,
if our spin system consists only of three spins connected to each other, then the respective
graph is a triangle. Spins are associated with nodes of the graphs and “interactions”, which
may also be called (pair-wise) factors, are associated with edges.
It is useful, for resolving this and other factorized problems, to introduce a bit more
general representation — in terms of graphs where both factors and variables are associated
with nodes/vertices. Transformation to the factor-graph representation for the three spin
example is shown in Fig. (10.2).
Ising Model, as well as other models discussed later in the lectures, can thus be stated
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 316
𝜎1
𝜎2 𝜎3 𝑓23 (𝜎2 , 𝜎3 )
Figure 10.2: Factor Graph Representation for the (simple case) with pair-wise factors only.
In the case of the Ising model: f12 (x1 , x2 ) = exp (−J12 x1 x2 + h1 x1 + h2 x2 )).
First, let us discuss decoding of a graphical code. (Our description here is terse, and we
advise interested reader to check the book by Richardson and Urbanke [24] for more details.)
A message word consisting of L information bits is encoded in an N -bit long code word,
N > L. In the case of binary, linear coding discussed here, a convenient representation of
the code is given by M ≥ N − L constraints, often called parity checks or simply, checks.
P
Formally, ς = (ςi = 0, 1|i = 1, · · · , N ) is one of the 2L code words iff i∼α ςi = 0 ( mod 2)
for all checks α = 1, · · · , M , where i ∼ α if the bit if the bit i contributes the check α, and
α ∼ i will indicate that the check α contains bit i. The relation between bits and checks
is often described in terms of the M × N parity-check matrix H consisting of ones and
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 317
Figure 10.3: Tanner graph of a linear code, represented with N = 10 bits, M = 5 checks,
and L = N − M = 5 information bits. This code selects 25 codewords from 21 0 possible
patterns. This adjacency, parity-check matrix of the code is given by Eq. (10.12).
zeros: Hiα = 1 if i ∼ α and Hiα = 0 otherwise. The set of the codewords is thus defined as
Ξ(cw) = (ς|Hς = 0 ( mod 2)). A bipartite graph representation of H, with bits marked as
circles, checks marked as squares, and edges corresponding to respective nonzero elements
of H, is usually called (in the coding theory) the Tanner graph of the code, or parity-check
graph of the code. (Notice that, fundamentally, code is defined in terms of the set of its
codewords, and there are many parity check matrixes/graphs parameterizing the code. We
ignore this unambiguity here, choosing one convenient parametrization H for the code.)
Therefore the bi-partite Tanner graph of the code is defined as G = (G0 , G1 ), where the set
of nodes is the union of the sets associated with variables and checks, G0 = G0;v ∪ G0;e and
only edges connecting variables and checks contribute G1 .
For a simple example with 10 bits and 5 checks, the parity check (adjacency) matrix of
the code with the Tanner graph shown in Fig. (10.3) is
1 1 1 1 0 1 1 0 0 0
0 0 1 1 1 1 1 1 0 0
H= 0 1 0 1 0 1 0 1 1 1 . (10.12)
1 0 1 0 1 0 0 1 1 1
1 1 0 0 1 0 1 0 1 1
Another example of a bigger code and respective parity check matrix are shown in Fig. (10.4).
For this example, N = 155, L = 64, M = 91 and the Hamming distance, defined as the
minimum l0 -distance between two distinct codewords, is 20.
Assume that each bit of the transmitted signal is changed (effect of the channel noise)
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 318
Figure 10.4: Tanner graph and parity check matrix of the (155, 64, 20) Tanner code, where
N = 155 is the length of the code (size of the code word), L = 64 and the Hamming distance
of the code, d = 20.
independently of others. It is done with some known conditional probability, p(x|σ), where
σ = 0, 1 is the valued of the bit before transmission, and x ∈ R is its changed/distorted
image. Once x = (xi |i = 1, · · · , N ) is measured, the task of the Maximum-A-Posteriori
(MAP) decoding becomes to reconstruct the most probable codeword consistent with the
measurement:
N
Y
σ (M AP ) = arg min p(xi |σi ). (10.13)
σ∈Ξ(cw)
i=1
where Z(x) is thus the partition function dependent on the detected vector x. One may
also consider the signal (bit-wise) MAP decoder
ς∈Ξ (cw)
(s−M AP )
X
∀i: ςi = arg max P(ς|x). (10.15)
ςi
ς\ςi
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 319
where x = (xi ∈ {0, 1} ∈ Vn ). Here, we assume that the alphabet of the elementary random
variable is binary, however generalization to the case of a higher alphabet is straightforward.
We are interested to ‘marginalize’ Eq. (10.11) over a subset of variables, for example
over all the elementary/nodal variables but one
X
P (xi ) := P (x). (10.17)
x\xi
Expectation of xi computed with the probability Eq. (10.17) is also called (in physics)
’magnetization’ of the variable.
Example 10.2.1. Is a partition function oracle sufficient for computing P (xi )? What is
the relation in the case of the Ising model between P (xi ) and Z(h)?
Solution. There are two ways one can relate P (xi ) to the partition function. First, introduce
an auxiliary graphical model, derived from the original one simply by fixing the value at
the node i to xi . Then, P (xi ) is simply the ratio of the partition function of the newly
derived graphical model to the original graphical model. Second, we can modify the original
graphical model introducing multiplicative factor, exp(xi hi ), and denoting the resulting
partition function by Z(h). Then, log Z(h), is also a moment generating function of P (xi ).
Another object of interest is the so-called Maximum Likelihood. Stated formally, is the
most probable state of all the states represented in Eq. (10.11):
All these objects are difficult to compute. “Difficulty” - still stated casually - means
that the number of operations needed is exponential in the system size (e.g. number of
variables/spins in the Ising model). This is in general, i.e. for a GM of a general position.
However, for some special cases, or even special classes of cases, the computations may be
much easier than in the worst case. Thus, ML (10.18) for the case of the so-called ferromag-
netic (attractive, sub-modular) Ising model can be computed with efforts polynomial in the
system size. Note that the partition function computation (at any nonzero temperatures)
is still exponential even in this case, thus illustrating the general statement - computing Z
or P (xi ) is a more difficult problem than computing x∗ .
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 320
We will go from counting (computing partition function is the problem of weighted counting)
to optimization by changing description from states to probabilities of the states, which we
will also call beliefs. b(x) will be a belief - which is our probabilistic guess - for the probability
of state x. Consider it on the example of the triangle system shown in Fig. (10.2). There are
23 states in this case: (x1 = ±1, x2 = ±1, x3 = ±1), which can occur with the probabilities,
b(x1 , x2 , x3 ). All the beliefs are positive and together should some to unity. We would
like to compare a particular assignment of the beliefs with P (x), generally described by
Eq. (10.11). Let us recall a tool which we already used to compare probabilities - the
Kullback-Leibler (KL) divergence (of probabilities) discussed in Lecture #2:
X
b(x)
D(bkP ) = b(x) log (10.19)
x
P (x)
Note that the KL divergence (10.19) is a convex function of the beliefs (remember, there
are 23 of the beliefs in the our enabling three node example) within the following polytope
– domain in the space of beliefs bounded by linear constraints:
∀x : b(x) ≥ 0, (10.20)
X
b(x) = 1. (10.21)
x
where F (b), considered as a function of all the beliefs, is called (configurational) free energy
(where configuration is one of the beliefs). The terminology originates from statistical
physics.
To summarize, we did manage to reduce counting problem to an optimization problem.
Which is great, however so far it is just a reformulation – as the number of variational
degrees of freedom (beliefs) is as much as the number of terms in the original sum (the
partition function). Indeed, it is not the formula itself but (as we will see below) its further
use for approximations which will be extremely useful.
The main idea is to reduce the search space from exploration of the 2N −1 dimensional beliefs
to their lower dimensional, i.e. parameterized with fewer variables, proxy/approximation.
What kind of factorization can one suggest for the multivariate (N -spin) probabilities/bel-
liefs? The idea of postulating independence of all the N variables/spins comes to mind:
Y
b(x) → bM F (x) = bi (xi ) (10.24)
i
∀i ∈ Vi , ∀xi : bi (xi ) ≥ 0 (10.25)
X
∀i ∈ Vi : bi (xi ) = 1. (10.26)
xi
Clearly bi (xi ) is interpreted within this substitution as the single-node marginal belief (es-
timate for the single-node marginal probability).
Substituting b by bM F in Eq. (10.23) one arrives at the MF estimation for the partition
function
Example 10.2.2. Show that Z ≥ Zmf , and that F(bmf ) is a strictly convex function of
its (vector) argument. Write down equations defining the stationary point of L(bmf ).
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 322
Solution. Zmf is low bound because we optimize over the class of belief functions, bmf (x)
which is strictly within the class of allowed b(x). Convexity is proven simply by utilizing
the fact the optimization’s objective is decomposed into a sum of the convex functions.
The fact that Zmf (see the example above) gives a lower bound on Z is a good news.
However, in general the approximation is very crude, i.e. the gap between the bound and
the actual value is large. The main reason for that is clear - by assuming that the variables
are independent we have ignored significant correlations.
In the next lecture we will analyze what, very frequently, provides a much better ap-
proximation for ML inference - the so called Belief Propagation approach.
We will mainly focus on the so-called Belief Propagation, related theory and techniques.
In addition to discussing inference with Belief Propagation we will also have a brief discus-
sions (pointers) to respective inverse problem – learning with Graphical Models.
Consider Ising model over a linear chain of n spins shown in Fig. 10.5a, the partition
function is
X
Z= Z(xn ), (10.29)
xn
where Z(xn ) is the newly introduced object representing sum over all but last spin in the
chain, labeled by n. Zn can be expressed as follows
X
Z(xn ) = exp(Jn,n−1 xn xn−1 + hn xn )Z(n−1)→(n) (xn−1 ), (10.30)
xn−1
where Z(n−1)→(n) (xi ) is the partial partition function for the subtree (a shorter chain in this
case) rooted at n − 1 and built excluding the branch/link directed towards n. The newly
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 323
introduced partially summed partition function contains summation over one less spins then
the original chain. In fact, this partially sum object can be defined recursively
X
Z(i−1)→(i) (xi−1 ) = exp(Ji−1,i−2 xi−1 xi−2 + hi−1 xi−1 )Z(i−2)→(i−1) (xi−2 ) (10.31)
xi−2
that is expressing one partially sum object via the partially sum object computed on the
previous step. Advantage of this recursive approach is obvious – it allows to replace sum-
mation over the exponentially many spin configurations by summing up of only two terms
at each step of the recursion.
What should also be obvious is that the method just described is adaptation of the
Dynamic Programming (DP) methods we have discussed in the optimization part of the
course to the problem of statistical inference.
It is also clear that the approach just explained allows generalization from the case of
the linear chain to the case of a general tree. Then, in the general case Z(xi ) is the partition
function of the entire tree with a value of the spin at the site/node i fixed. We derive
Y X
Z(xi ) = ehi xi eJij xi xj Zj→i (xj ) , (10.32)
j∈∂i xj
The partition function, partially summed and conditioned to the spin value at the spin, x4 ,
is
X X X
Z(x4 ) = eh4 x4 eJ45 x4 x5 Z5→4 (x5 ) eJ46 x4 x6 Z6→4 (x6 ) eJ34 x3 x4 Z3→4 (x3 ) (10.35)
x5 x6 x3
where
X X
Z3→4 (x3 ) = eh3 x3 eJ13 x1 x3 Z1→3 (x1 ) eJ23 x2 x3 Z2→3 (x2 ). (10.36)
x1 x2
Exercise 10.3. Consider the Ising model on a graph G = (V, E), with spins x. Let V0 ⊂ V,
and let V̄0 = {i ∈ V \ V0 : (i, j) ∈ E for some j ∈ V}. (You may think of V̄0 as the
“boundary” of V0 within V.) Show that the spins on V0 are conditionally independent of
all other spins, given values of spins on V̄0 .
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 324
It appears that in the case of a general pair-wise graphical model over trees the joint
distribution function over all variables can be expressed solely via single-node marginals
and pair-vise marginals over all pairs of the graph-neighbors. To illustrate this important
factorization property, let us consider examples shown in Fig. 10.6. In the case of the two-
nodes example of Fig. 10.6a the statement is obvious as following directly from the Bayes
formula
P (x1 , x2 ) = P (x1 )P (x2 |x1 ), (10.37)
where the conditional independence of x3 on x1 , P (x3 |x1 , x2 ) = P (x3 |x2 ), was used.
Next, let us work it out on the example of the pair-wise graphical model shown in
Fig. 10.6
Here one uses the following reductions, P (x4 |x1 , x3 , x4 ) = P (x4 |x2 ) and P (x3 |x1 , x2 ) =
P (x3 |x2 ), related to respective independence properties.
Finally, it is easy to verify that the joint probability distribution corresponding to the
model in Fig. 10.6d is
P (x1 , x2 , x3 , x4 , x5 , x6 ) = P (x1 )P (x2 |x1 )P (x3 |x2 )P (x4 |x2 )P (x5 |x2 )P (x6 |x5 ) =
P (x1 , x2 )P (x2 , x3 )P (x2 , x4 )P (x2 , x5 )P (x5 , x6 )
= . (10.40)
P 3 (x2 )P (x5 )
Exercise 10.4. In the case of a general tree-like graphical model joint probability distri-
bution can be stated in terms of the pair-wise and singleton marginals as follows
Q
(i,j)∈E P (xi , xj )
P (x1 , x2 , . . . , xn ) = Q qi −1 (x )
, (10.41)
i∈V P i
where qi is the degree of the i-th node. Use mathematical induction to prove Eq. (10.41).
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 325
As discussed above Dynamic Programming is a provably exact approach for inference when
the graph is a tree. It also provides an empirically good approximation for a very broad
family of problems stated on loopy graphs.
The approximation is usually called Bethe-Peierls or Belief Propagation (BP is the ab-
breviation which works for both). Loopy BP is anothe popular term. See the original paper
[25], a comprehensive review [26], and respective lecture notes, for an advanced/additional
reading.
Instead of Eq. (10.24) one uses the following BP substitution
Q
ba (xa )
b(x) → bbp (x) = Q a qi −1
(10.42)
i (bi (xi ))
∀a ∈ Vf , ∀xa : ba (xa ) ≥ 0 (10.43)
X
∀i ∈ Vn , ∀a ∼ i : bi (xi ) = ba (xa ) (10.44)
xa \xi
X
∀i ∈ Vn : bi (xi ) = 1. (10.45)
xi
where qi stands for degree of node i. The physical meaning of the factor qi − 1 on the rhs of
Eq. (10.42) is straightforward: by placing beliefs associated with the factor-nodes connected
by an edge with a node, i, we over-count contributions of an individual variable qi times
and thus the denominator term in Eq. (10.42) comes as a correction for this over-counting.
Substitution of Eqs. (10.42) into Eq. (10.23) results in what is called Bethe Free Energy
(BFE)
where Ebp is the so-called self-energy (physics jargon) and Hbp is the BP-entropy (this
name should be clear in view of what we have discussed about entropy so far). Thus the
BP version of the KL-divergence minimization becomes
The ML (zero temperature) version of Eq. (10.49) results from the following optimization
Let us restate Eq. (10.49) as an unconditional optimization. We use the standard method
of Lagrangian multipliers to achieve it. The resulting Lagrangian is
XX XX
Lbp (b, η, λ) := ba (xa ) log fa (xa ) − ba (xa ) log ba (xa )
a xa a xa
XX
+ (qi − 1)bi (xi ) log bi (xi )
i xi
!
XXX X X X
− ηia (xi ) bi (xi ) − ba (xa ) + λi bi (xi ) − 1 , (10.52)
i a∼i xi xa \xi i xi
where η and λ are the dual (Lagrangian) variables associated with the conditions Eqs. (10.44,10.45)
respectively. Then Eq. (10.49) become the following min-max problem
Changing the order of optimizations in Eq. (10.53) and then minimizing over η one arrives
at the following expressions for the beliefs via messages (check the derivation details)
!
X Y
∀a, ∀xa : ba (xa ) ∼ fa (xa ) exp ηia (xi ) := fa (xa ) ni→a (xi )
i∼a i∼a
Y b6Y
=a
:= fa (xa ) mb→i (xi ) (10.54)
i∼a b∼i
P
ηia (xi ) Y
∀i, ∀xi : bi (xi ) ∼ exp a∼i := ma→i (xi ), (10.55)
qi − 1
a∼i
where, as usual, ∼ for beliefs means equality up to a constant which guarantees that the
sum of respective beliefs is unity, and we have also introduce the auxiliary variables , m
and n, called messages, related to the Lagrangian multipliers η as follows
Combining Eqs. (10.54,10.55,10.56,10.57) with Eq. (10.44) results in the following BP-
equations stated in terms of the message variables
b6Y
=a
∀i, ∀a ∼ i, ∀xi : ni→a (xi ) = ma→i (xi ) (10.58)
b∼i
X j6=i
Y
∀a, ∀i ∼ a, ∀xi : ma→i (xi ) = fa (xa ) nj→a (xj ). (10.59)
xa \xi j∼a
Note that if the Bethe Free Energy (10.46) is non-convex there may be multiple fixed points
of the Eqs. (10.58,10.59). The following iterative, so called Message Passing (MP), algorithm
(10) is used to find a fixed point solution of the BP Eqs. (10.42,10.43)
So far we have been discussing direct (inference) GM problem. In the remainder of this
lecture we will briefly talk about inverse problems. This subject will also be discussed (on
example of the tree) in the following.
Stated casually - the inverse problem is about ‘learning’ GM from data/samples. Think
about the two room setting. In one room a GM is known and many samples are generated.
The samples, but not GM (!!!), are passed to the second room. The task becomes to
reconstruct GM from samples.
The first question we should ask is if this is possible in principle, even if we have an
infinite number of samples. A very powerful notion of sufficient statics helps to answer this
question.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 329
Consider the Ising model (not the first time in this course)
1 X X
P (x) = exp hi x i + Jij xi xj = exp{θT φ(x) − log Z(θ)}, (10.60)
Z(θ)
i∈V {i,j}∈E
where xi ∈ {−1, 1}, θ := h ∪ J = (hi |i ∈ V) ∪ (Jij |{i, j} ∈ E) and the partition function
Z(θ) serves to normalize the probability distribution. In fact, Eq. (10.60) describes what
is called the exponential family - emphasizing ‘exponential’ dependence on the factors θ.
Notice that any pairwise GM over binary variables can be represented as an Ising model.
Consider collection of all first and second moments (but only these two, and not higher
moments) of the spin variables, µ(1) := (µi = E[xi ], i ∈ V ) and µ(2) := (µij = E[xi xj ], {i, j} ∈
E). The sufficient statistics statement is that to reconstruct θ, fully defining the GM, it is
sufficient to know µ(1) and µ(2) .
Let us turn the sufficiency into a constructive statement – the Maximum-likelihood estima-
tion over an exponential family of GMs.
First, notice that (according to the definition of µ)
∀i : ∂hi log Z(θ) = −µi , ∀i, j : ∂Jij log Z(θ) = −µij . (10.61)
This leads to the following statement: if we know how to compute log-partition function
for any values of θ - reconstructing ’correct’ θ is a convex optimization problem (over θ):
• Limit consideration to the class of functions for which computation of the partition
function can be done efficiently for any values of the parameters. We will discuss
such case below – this will be the so-called tree (Chow-Lou) learning. (In fact, the
partition function can also be computed efficiently in the case of the Ising model over
planar graphs and generalizations, see [27] and references therein for details.)
• Rely on approximations, e.g. such as variational approximation (MF, BP, and other),
MCMC or approximate elimination (approximate Dynamical Programming).
• There exists a very innovative new approach - which allows to learn GM efficiently
however using more information than suggested by the notion of the sufficient statis-
tics. How one of the scientists contributing to this line of research put it – ’the
sufficient statistics is not sufficient’. This is a fascinating novel subjects, which is
however beyond the scope of this course. Check [28] and references therein.
Eq. (10.41) suggests that knowing the structure of the tree-based graphical model allows
to express the joint probability distribution in terms of the single-(node) and pairwise
(edge-related) marginals. Below we will utilize this statement to pose and solve an inverse
problem. Specifically, we attempt to reconstruct a tree representing correlations between
multiple (ideally, infinitely many) snapshots of the discrete random variables x1 , x2 , . . . , xn ?
A straightforward strategy to achieve this goal is as follows. First, one estimates all
possible single-node and pairwise marginal probability distributions, P (xi ) and P (xi , xj ),
from the infinite set of the snapshots. Then, we may similarly estimate the joint distribution
function and verify for a possible tree layout if the relations (10.41) hold. However, this
strategy is not feasible as requiring (in the worst unlucky case) to test exponentially many,
nn−2 , possible spanning threes. Luckily a smart and computationally efficient way of solving
the problem was suggested by Chow and Liu in 1968 [29].
Consider the candidate probability distribution, PT (x = (x1 , . . . , xn )) over a tree, T =
(V, E) (where V and E are the sets of nodes and edges of the tree, respectively) which is tree-
factorized according to Eq. (10.41) via marginal (pair-wise and single-variable) probabilities
as follows Q
(i,j)∈E F P (xi , xj )
PT (x1 , x2 , . . . , xn ) = Q . (10.63)
i∈V F P (xi )qi −1 (xi )
”Distance” between the actual (correct) joint probability distribution P and the candidate
tree-factorized probability distribution, PT , can be measured in terms of the Kullback-
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 331
As discussed in Section 8.5, the KL divergence is always positive if P and PT are different,
and is zero if these distributions are identical. Then, we are looking for a tree that minimizes
the KL divergence.
Substituting (10.63) into Eq. (10.64) one arrives at the following chain of explicit trans-
formations
X X X
P (x) log P (x) − log P (xi , xj ) + (qi − 1) log P (xi ) =
x (i,j)∈E i∈V
X X X
= P (x) log P (x) − P (xi , xj ) log P (xi , xj ) +
x (i,j)∈E F xi ,xj
X X X X P (xi , xj )
+ (qi − 1) P (xi ) log P (xi ) = − P (xi , xj ) log +
xi
P (xi )P (xj )
i∈V (i,j)∈E F xi ,xj
X XX
+ P (x) log P (x) − P (xi ) log P (xi ), (10.65)
x i∈V F xi
where the following nodal and edge marginalization relations were used, ∀i ∈ V F : P (xi ) =
P F :
P
x\xi P (x), and, ∀(i, j) ∈ E P (xi , xj ) = x\xi ,xj P (x), respectively. One observes
that the Kullback-Leibler divergence becomes
X X
D(P k PF ) = − I(Xi , Xj ) + S(xi ) − S(x), (10.66)
(i,j)∈E F i∈V F
where
X P (xi , xj )
I(Xi , Xj ) := P (xi , xj ) log (10.67)
xi ,xj
P (xi )P (xj )
Based on this observation, Chow and Liu have suggested to use the following (standard in
computer science) Kruskal maximum tree reconstruction algorithm (notice that the algo-
rithm is greedy, i.e. of the Dynamic Programming type):
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 332
Table 10.1: Information available about an exemplary probability distribution of four binary
variables discussed in the Exercise 10.5.
x1 x2 x3 x4 P (x1 , x2 , x3 , x4 ) P (x1 )P (x2 |x1 )P (x3 |x2 )P (x4 |x1 ) P (x1 )P (x2 )P (x3 )P (x4 )
0000 0.100 0.130 0.046
0001 0.100 0.104 0.046
0010 0.050 0.037 0.056
0011 0.050 0.030 0.056
0100 0.000 0.015 0.056
0101 0.000 0.012 0.056
0110 0.100 0.068 0.068
0111 0.050 0.054 0.068
1000 0.050 0.053 0.056
1001 0.100 0.064 0.056
1010 0.000 0.015 0.068
1011 0.000 0.018 0.068
1100 0.050 0.033 0.068
1101 0.050 0.040 0.068
1110 0.150 0.149 0.083
1111 0.150 0.178 0.083
• (step 1) Sort the edges of G into decreasing order by weight = Mutual Information,
i.e. I(Xi , Xj ) for the candidate edge (i, j). Let ET be the set of edges comprising the
maximum weight spanning tree. Set ET = ∅.
• (step 3) Add the next edge to ET if and only if it does not form a cycle in ET .
• (step 4) If ET has n − 1 edges (where n is the number of nodes in G) stop and output
ET . Otherwise go to step 3.
Eq. (10.41) is exact only in the case when it is guaranteed that the Graphical Model
we attempt to recover forms a tree. However, the same tree ansatz can be used to recover
the best tree approximation for a graphical model defined over a graph with loops. How to
choose the optimal (best approximation) tree in this case? To answer this question within
the aforementioned Kullback-Leibler paradigm one needs to compare the tree ansatz (10.41)
and the empirical joint distribution. This reconstruction of the optimal tree is based on the
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 333
Chow-Liu algorithm.
Exercise 10.5. Find the Chow-Liu optimal spanning tree approximation for the joint prob-
ability distribution of four random binary variables with statistical information presented
in the Table 10.1. [Hint: Estimate the empirical (i.e. based on the data), pair-wise mutual
information and then utilize the Chow-Liu-Kruskal algorithm (see description above in the
lecture notes) to reconstruct the optimal tree.]
Theorem 10.4.1 (Universal Approximation Theorem (Cybenko 1989; Hornik 1991; Pinkus
1999)). Let ρ : R → R be any continuous function (called activation function). Let N N ρn
represent the class of feed-forward NN with activation function, ρ, with n neurons in the
input layer, one neuron in the output layer, and one arbitrary layer with an arbitrary
number of neurons. Let K ⊂ R be a compact. Then N N ρn is dense in C(K) (class of
differentiable function on K if and only if ρ is not polynomial
This famous theorem was an inhibitor of the NN revolution. In its original form, stated
above, it therefore applied to the regime of bounded depth and arbitrary width, was also
extended recently to the complementary case of the bounded width and arbitrary depth
(See [31] and references therein. Notice that the term ”deep” in the name of Deep Learning
refers to the large depth of respective NN.)
Even though, the Theorem 10.4.1 is agnostic to the type of the activation function some
activation functions are used more frequently in practice. The choice which has succeeded
far beyond expectations is the nonlinear function called Rectified Linear Unit, or simply,
ReLU(x) = x+ = max(x, 0).
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 334
• Choose, (q, p)-dimensional matrix A1 and q-dimensional vector b1 and set to zero all
negative components of A1 v + b1 , i.e. introduce ReLU(A1 v + b1 ) = (A1 v + b1 )+ acting
component-wise, where each component is associated with a neuron of a hidden layer.
• Choose (m, q) dimensional matrix A2 and apply it to the hidden layer vector, A2 (A1 v+
b1 )+ .
Introducing depth in NN allows to construct more and more expressive, as containing more
piece-wise linear pieces, continuous piece-wise linear functions. One standard way to do it
is to use composition:
Example 10.4.2 (Counting Number of Pieces). Consider an example of N N (v) with Re-
LU activation function, one dimensional input layer, one 5-dimensional hidden layer, and
one dimensional output layer, and count the number of pieces in the resulting continuous
piece-wise linear function.
Key element of fitting a function with NN is the so-called loss function, which is the term
commonly used in data science to describe objective of the underlying optimization formu-
lation. Standard choice of the loss function is a norm, e.g. l1 , l2 or l∞ norm, of the error
between the function, F (v), and its NN-approximation evaluated at the available samples,
s = 1, · · · , S. Then, for the example of the lp norm min-error, the resulting optimization
becomes
S
X
min kF (vs ) − N N (vs |θ)kp , (10.70)
θ
s=1
where θ is the vector of the NN parameters, e.g. θ = (AL , bL , · · · , A1 , b1 ) in the case of the
continuous piece-wise NN given by Eq. (10.69).
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 335
w = v2 = b2 + A2 v1 = b2 + A2 R(b1 + A1 v0 ).
Over-fitting occurs when a trained network performs very accurately on the given data,
but cannot generalize well to new data. If F does purely on the training set we say that
the NN is under-fit. Over-fitting occurs when F does well on the training set but gives
much larger error on the validation data. Colloquially, it occurs when our function is
not sufficiently smooth – filling gaps (unnecessarily) between the training data points. One
standard recommendation: to avoid over-fitting introduce regularization to the loss function.
Another recommendation: stop training before you over-fit.
Example 10.4.3. Consider a Neural Network (NN) with L = 2, with one hidden layers
each consisting of a single node (neuron), and with the tanh activation functions. Therefore,
the model is
w = v2 = tanh a2 tanh a1 v + b1 + b2 ,
and assume that the weights are currently set at (a1 , b1 ) = (1.0, 0.5) and (a2 , b2 ) = (−0.5, 0.3).
What is the gradient of the Mean Square Error (MSE) cost for the observation (x, y) =
(2, −0.5)? What is the optimal MSE and optimal values of the parameters?
Solution. Evaluating the function (forward pass) with the initial parameters gives:
where ŵ stands for the NN estimate of the output, as opposed to the actual, i.e. sample,
output, w.
The MSE loss function is L = (ŵ − w)2 . The gradient of the loss function is
∂L ∂L ∂ ŵ ∂z2
∂a 2
∂L ∂ ŵ ∂z2 ∂a2
∂L ∂ ŵ ∂z2
∂b
∇L = ∂L = ∂L ∂ ŵ ∂z ∂a ∂z
2 ∂ ŵ ∂z2 ∂b2
∂a1 ∂ ŵ ∂z2 ∂a21 ∂z11 ∂a11
∂L ∂L ∂ ŵ ∂z2 ∂a1 ∂z1
∂b1 ∂ ŵ ∂z2 ∂a1 ∂z1 ∂b1
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 337
(b) There is only one data point, and four parameters. Therefore the problem is under-
determined. Iterating gradient descent-type methods does not guarantee return of a unique
solution.
(a) Describe the complexity of the class of functions representing this NN.
(c) Build an example where this NN outputs continuous piece-wise linear function with
two linear intervals.
Appendix A
This Appendix was originally prepared by Dr. Yury Maximov from Los Alamos National
Laboratory (and edited by MC). The material was presented in 2020 in 6 lectures cross-
cut between Math 581 (then Math 583), Math 584 (then Math 527) and Math 589 (then
Math 575). In 2021 material of the Appendix was mainly covered in Math 584 (then Math
527), with a brief summary included in Math 581 (then Math 583) via 1.5 lecture and two
exercises. In 2022 we do not include the material in Math 581 at all. The Appendix should
thus be viewed as a ∗-Section of the main text – suggested for an extra-curriculum study ∗ .
The Appendix is split into four Subsections. Sections A.1 and A.2 will be discussing
basic convex and non-convex finite dimensional optimizations. Then in Sections A.3 and A.4
we will turn to discussing iterative optimization methods for the optimization formulations,
set in Sections A.1 and A.2, which are of constrained and unconstrained types.
The most general problem we will start our discussion from in Section A.1 consists in
minimization of a function, f : S ⊆ Rn → R:
min f (x).
x∈S⊆Rn
338
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 339
Iterative algorithms, discussed in Sections A.3 and A.4, will be designed to solve Eq. (A.1).
Each step of such an algorithm will consist in updating the current estimate, xk , using
xj ,f (xj ), j ≤ k, possibly its vector of derivatives ∇f (x), and possibly the Hessian matrix,
∇2 f (x), such that the optimum is achieved in the limit, limk→+∞ f (xk ) = inf x∈S⊆Rn f (x).
Different iterative algorithms can be classified depending on the information available,
as follows:
• Zero-order algorithm, where at each iteration step one has an access to the value of
f (x) at a given point x (but no information on ∇f (x) and ∇2 f (x) is available);
• First-order optimization, where at each iteration step one has an access to the value
of f (x) and ∇f (x);
• Second-order algorithm, where at each iteration step one has an access to the value of
f (x), ∇f (x) and ∇2 f (x);
• Higher-order algorithm where at each iteration step one has an access to the value of
the objective function, its first, second and higher-order derivatives.
We will not discuss in these notes second-order and higher-order algorithm, focusing in
Sections A.3 and A.4 primarily on the first-order and second order algorithms.
An important class of functions one can efficiently minimize are convex functions, that were
introduced earlier in Definition 6.7.3. We restate it here for convenience.
Definition A.1.3. Let function f : Rn → R has smooth gradient. Then f is convex iff
that is the Hessian of the function is a positive semi-definite matrix at any point. (Remind
that real symmetric n × n matrix H is positive semi-definite iff xT Hx ≥ 0 for any x ∈ Rn .)
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 340
Lemma A.1.4. Prove that the definitions above are equivalent for sufficiently smooth
functions.
Proof. Assume that the function is convex according to the Definition A.1.1. Then for any
h ∈ Rn , λ ∈ [0, 1], one has according to the Definition A.1.1:
That is
Then taking the limit for λ → 0 one has ∇f (x)> h ≤ f (x + h) − f (x), ∀h ∈ Rn which is
exactly Def. A.1.2. Vice versa, if ∀x, y : f (y) ≤ ∇f (x)> (y−x), one has for z = λx+(1−λ)y,
and any λ ∈ [0, 1] :
summing up the inequalities above with the quotients 1−λ and λ one gets f (λx+(1−λ)y) ≤
λf (x) + (1 − λ)f (y). Thus Def. A.1.1 and Def. A.1.2.
Further, if f is sufficiently smooth, one has according to the Taylor expansion:
1
f (y) = f (x) + ∇f (x)> (y − x) + (y − x)∇2 f (x)(y − x) + o(ky − xk22 ).
2
Taking y → x one gets from the Definition A.1.2 to the Definition A.1.3 and vice versa.
To establish some properties of the sub-gradients (which can also be called sub-differentials)
let us introduce the notion of convex set, i.e. a segment between any point of the set which
belongs to the set as well.
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 341
Definition A.1.7. Set S is convex, iff for any x1 , x2 ∈ S, and θ ∈ [0, 1] one has x1 θ +
x2 (1 − θ) ∈ S. In other words, set S is convex if for any points x1 , x2 in it, the set contains
a line segment [x1 , x2 ].
Theorem A.1.8. For any convex function f , f : Rn → R, and any point x ∈ Rn the
sub-differential ∂f (x) is a convex set. In other words, for any g1 , g2 ∈ ∂f (x) one has
θg1 + (1 − θ)g2 ∈ ∂f (x). Moreover, ∂f (x) = {∇f (x)} if f is smooth.
Proof. Let g1 , g2 ∈ ∂f (x), then f (y) ≥ f (x) + g1> (y − x), and f (y) ≥ f (x) + g2> (y − x).
That is for any λ ∈ [0, 1] one has f (y) ≥ f (x) + (λg1 + (1 − λ)g2 )> (y − x) and λg1 +
(1 − λ)g2 is a sub-gradient as well. We conclude that the set of all the sub-gradients
is convex. Moreover, if f is smooth, according to the Taylor expansion formula one has
f (x + h) = f (x) + ∇f (x)> h + O(khk)22 . Assume that there exists sub-gradient g ∈ ∂f (x)
other than ∇f (x) (as ∇f (x) ∈ ∂f (x) by the definition of convex functions A.1.2). Then
f (x) + g > h ≤ f (x + h) = f (x) + ∇f (x)> h + O(khk22 ) and similarly f (x) − g > h ≤ f (x − h) =
f (x) − ∇f (x)> h + O(khk22 ), and
• Sub-differential of |x| is
1, if x > 0
∂f (x) = −1, if x < 0
[−1, 1], if x = 0.
a) xp , p ≥ 1 or p ≤ 0 is convex; xp , 0 ≤ p ≤ 1 is concave;
One can also extend the statement to non-smooth and multidimensional functions.
P
d) LogSumMax, also called soft-max, log ( ni=1 exp(xi )), is convex in x ∈ Rn as a com-
position of a convex non-decreasing and a convex function. The soft-max function
plays a very important role because it bridges smooth and non-smooth optimizations:
n
!
1 X
max(x1 , x2 , . . . , xn ) ≈ log exp(λxi ) , λ → 0, λ > 0. (A.2)
λ
i=1
f) Vector norm: kxkp := (|xi |p )1/p , x ∈ Rn , also called p-norm, or `p -norm when p ≥ 1,
is convex.
g) Dual norm k · k∗ to k · k is kyk∗ := supkxk≤1 x> y. The dual norm is always convex.
S̄ = {x : Ax + b, x ∈ S}
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 343
3. Image (and inverse image) of a convex set S under perspective mapping P : Rn+1 →
Rn , P = x/t, dom P = {(x, t) : t > 0}.
Indeed, consider y1 , y2 ∈ P (S) so that y1 = x1 /t1 and y2 = x2 /t2 . We need to prove
that for any λ ∈ [0, 1]
x1 x2 θx1 + (1 − θ)x2
y = λy1 + (1 − λ)y2 = λ + (1 − λ) =
t1 t2 θt1 + (1 − θ)t2
which holds for θ = λt2 /(λt1 +(1−λ)t2 ). The proof of the inverse statement is similar.
Exercise A.2. Check that all functions and all sets above are convex using Definition A.1.1
of the convex function (or equivalent Definitions A.1.2,A.1.3) and the Definition A.1.7 of
the complex set.
In further analysis, we introduce a special subclass of convex functions for which one
can guarantee much faster convergence than for minimization of a general convex function.
Definition A.1.11. Function f : Rn → R is µ-strongly convex with respect to norm k · k
for some µ > 0, iff
1. ∀x, y : f (y) ≥ f (x) + ∇f (x)> (y − x) + µ2 ky − xk2
• strongly convex/concave in `1 , `2 , `∞ ?
Hint: to prove that the function is strongly convex in `p norm it is sufficient to show that
is convex if f (x) and S are convex. Complexity of an iterative algorithm initiated with x0
to solve the optimization problem is measured in the number of iterations required to get
a point xk such that |f (xk ) − inf x∈S⊆Rn f (x)| < ε. Each iteration means an update of xk .
Complexity classification is as follows
• linear, that is the number of iterations k = O(log(1/ε)), and in other words f (xk+1 ) −
inf x∈S f (x) ≤ c(f (xk ) − inf x∈S f (x)) for some constant c, 0 < c < 1. Roughly, after
iteration we increase the number of correct digits in our answer by one.
• quadratic, that is k = O(log log(1/ε)), and f (xk+1 )−inf x∈S f (x) ≤ c(f (xk )−inf x∈S f (x))2
for some constant c, 0 < c < 1. That is, after iteration we double the number of correct
digits in our answer.
• sub-linear, that is characterized by the rate slower than O(log(1/ε)). In convex op-
timization, it is often the case that the convergence rate for different methods is
√
k = O(1/ε), O(1/ε2 ), or O(1/ ε) depending on the properties of function f .
f (x) → minn
x∈R
s.t.:g(x) ≤ 0
h(x) = 0
If the inequality constraint g(x) is convex and the equality constraint is affine, h(x) = Ax+b,
a feasible set of this problem, S = {x : g(x) ≤ 0 and h(x) = 0}, is convex that follows
immediately from definitions of a convex set and a convex function. As we will see later in
the lectures, in contrast to non-convex problems the convex ones admit very efficient and
scalable solutions.
`
Exercise A.6. Let ΠCp (x) be a projection of a point x to a convex compact set C in `p
norm, if
`
ΠCp (x) = arg min kx − ykp .
y∈C
A.2 Duality
Duality is very powerful tool which allows (1) to design efficient (tractable) algorithms
to approximate non-convex problems; (2) to build efficient algorithms to convex and non-
convex problems with constraints (which are often of a much smaller dimensionality than
the original formulations); (3) to formulate necessary and sufficient conditions of optimality
for convex and non-convex optimization problems.
Lagrangian
with the optimal value p∗ (which is possibly −∞). Let S be the feasible set of this problem,
that is the set of all x for which all the constraints are satisfied.
Compose the so-called Lagrangian function L : Rn × Rm × Rp → R:
m
X p
X
L(x, λ, µ) = f (x) + λi gi (x) + µj hj (x) = f (x) + λ> g(x) + µ> h(x), λ ≥ 0 (A.4)
i=1 j=1
which is a weighted combination of the objective and the constraints. Lagrange multipliers,
λ and µ, can be viewed as penalties for violation of inequality and equality constraints.
The Lagrangian function (A.4) allows us to formulate the constrained optimization,
Eq. (A.3), as a min-max (also called saddle point) optimization problem:
Let us consider the saddle point problem (A.5) in greater details. For any feasible point
x ∈ S ⊆ Rn one has f (x) ≥ L(x, λ, µ), λ ≥ 0. Thus
L(λ, µ) = min L(x, λ, µ) ≤ min f (x) = p∗ ⇒ max min L(x, λ, µ) ≤ p∗ = min max L(x, λ, µ),
x∈S x∈S λ≥0,µ x∈S x∈S λ≥0,µ
| {z }
L(λ,µ)
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 346
where L(λ, µ) = inf x∈Rn L(x, λ, µ) = inf x∈Rn f (x) + λ> g(x) + µ> h(x) is called the La-
grange dual function. One can restate it as
The original optimization, minx∈S f (x) = minx∈S maxλ≥0,µ L(x, λ, µ), is called Lagrange
primal optimization, while , maxλ≥0,µ L(λ, µ) = maxλ≥0,µ minx∈S L(x, λ, µ), is called the
Lagrange dual optimization.
Note that, maxλ≥0,µ minx∈S L(x, λ, µ) = maxλ≥0,µ minx∈Rn L(x, λ, µ), regardless of what
S is. This is because x̂ 6∈ S one has maxλ≥0,µ L(x̂, λ, µ) = +∞, thus allowing us to perform
unconstrained minimization of L(x, λ, µ) over x much more efficiently.
Let us describe a number of important features of the dual optimization:
1. Concavity of the dual function. The dual function L(λ, µ) is always concave. Indeed
for (λ̄, µ̄) = θ(λ1 , µ1 ) + (1 − θ)(λ2 , µ2 ) one has
3. Weak duality: For any optimization problem d∗ ≤ p∗ . Indeed, for any feasible (x, λ, µ)
we have f (x) ≥ L(x, λ, µ) ≥ L(λ, µ), thus p∗ = minx∈Rn f (x) ≥ maxλ≥0,µ L(λ, µ) =
d∗ .
4. Strong duality: We say that strong duality holds if p∗ = d∗ . Convexity of the objective
function and convexity of the feasible set S is neither sufficient nor necessary condition
for strong duality (see the example following).
Example A.2.1. Convexity alone is not sufficient for the strong duality. Find the dual
problem and the duality gap p∗ − d∗ for the following optimization
exp(−x) → min
y>0,x
2
s.t.:x /y ≤ 0.
The optimal problem is p∗ = 1, which is achieved at x = 0 and any positive y. The dual
problem is
L(λ) = inf (exp(−x) + λx2 /y) = 0.
y>0,x
That is the dual problem is maxλ≥0 0 = 0, and the duality gap is p∗ − d∗ = 1.
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 347
Theorem A.2.2 (Slater (sufficient) conditions). Consider the optimization (A.3) where
all the equality constraints are affine and all the inequality constraints and the objective
function are convex. The strong duality holds if there exists an x∗ such that x∗ is strictly
feasible, i.e. all constraints are satisfied and the nonlinear constraints are satisfied with
strict inequalities.
The Slater conditions imply that the set of optimal solutions of the dual problem,
therefore making the conditions sufficient for the strong duality of the optimization.
Optimality Conditions
Another notable feature of the Lagrangian function is due to its role in establishing nec-
essary and sufficient conditions for a triplet (x, λ, µ) to be the solution of the saddle-point
optimization (A.5). First, let us formulate necessary conditions of optimality for
f (x) → min
s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m
hj (x) = 0, 1 ≤ j ≤ p
x ∈ S ⊆ Rn .
where the Lagrangian is defined in Eq. (A.4). The following conditions, called Karush-
Kuhn-Tucker (KKT) conditions, are necessary for a triplet (x∗ , λ∗ , µ∗ ) to become optimal:
1. Primal feasibility: x∗ ∈ S.
2. Dual feasibility: λ∗ ≥ 0.
Note, that the KKT conditions generalize (the finite dimensional version of) the Euler-
Lagrange conditions introduced in the variational calculus. Let us now investigate when
the conditions are sufficient.
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 348
The KKT conditions are sufficient if the problem allows the strong duality, for which (as
we saw above) the Slatter conditions are sufficient. Indeed, assume that the strong duality
holds and a point (x∗ , λ∗ , µ∗ ) satisfies the KKT conditions. Then
where the first equality holds because of the problem stationarity, and the second conditions
holds because of the complementary slackness.
Example A.2.3. Find a duality gap and solve the dual problem for the following mini-
mization
min (x1 − 3)2 + (x2 − 2)2
x1 + 2x2 = 4
x21 + x22 ≤ 5
Note, that the problem is convex and the Slater’s conditions are satisfied, therefore the
minimum is unique and there is no duality gap. The Lagrangian is
L(x, λ, µ) = (x1 − 3)2 + (x2 − 2)2 + µ(x1 + 2x2 − 4) + λ(x21 + x22 − 5), λ ≥ 0.
Therefore, (1 + λ)(2x1 − x2 ) = 4, and using the primal feasibility constraint one derives,
12+4λ 4+8λ
x1 = 5(1+λ) , x2 = 5(1+λ) . The dual problem becomes
9λ 16
L(λ) = 5 − − → max
5 5(1 + λ) λ≥0
Looking for a stationary point of L(λ), we arrive at λ = 1/3 and λ = −7/3. However, given
that λ∗ ≥ 0, we get λ∗ = 1/3. Finally, the saddle point is (x∗1 , x∗2 , λ∗ , µ∗ ) = (2, 1, 2/3, 1/3).
3x + 7y + z → min
s.t.:x + 5y = 2
x+y ≥3
z≥0
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 349
find the dual problem, the optimal values of the primal and dual objectives, as well as
optimal solutions for the primal variables and for the dual variables. Describe all the steps
in details.
Solution:
3x + 7y → min
s.t.:x + 5y = 2
x+y ≥3
d
L(x, y, µ, λ) = 3 − µ − λ = 0
dx
d
L(x, y, µ, λ) = 7 − 5µ − λ = 0,
dy
therefore resulting in µ = 1, and λ = 2. One observes that the Lagrange multipliers
are feasible, meaning that there exists at least one point on the intersection of the
equality and inequality constraints.
λ(3 − x − y) = 0.
x + 5y = 2 and x + y = 3,
6. Optimal values of the primal variables are (x, y, z) = (3.25, −0.25, 0).
Dual problem.
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 350
Dual objective:
2µ + 3λ, if 3 − µ − λ = 0 and 7 − 5µ − λ = 0
L(λ, µ) = inf L(x, y, µ, λ) =
x,y −∞, otherwise
2µ + 3λ → max
s.t.: 3 − λ − µ = 0
7 − 5µ − λ = 0
3. The duality gap is 0 as this problem is linear (Slater’s condition is satisfied by the
definition).
Exercise A.7. For the primal optimization problems stated below find the dual problem,
the optimal values of the primal and dual objectives, as well as optimal solutions for the
primal variables and for the dual variables. Describe all the steps in details.
Examples of Duality
Example A.2.5 (Duality and Legendre-Fenchel Transform). Let us discuss the relation
between transformation from the Lagrange function to the dual (Lagrange) function and
the Legendge-Fenchel (LF) transform (or a conjugate function),
introduced in Variational Calculus Section 6 of the course. One of the principal conclusions
of the LF analysis is f (x) ≥ f ∗∗ (x). The inequality is directly linked to the statement
of duality, specifically to the fact that dual optimization low bounds the primal one. To
illustrate the relationship between the maximization of f ∗∗ and the dual problem consider
f (x) → min
s.t.:x = b
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 351
min max{f (x) + µ> (b − x)} ≤ max min{f (x) + µ(b − x)}
x µ µ x
Minimizing the expression over all b ∈ Rn one arrives at minx∈Rn f (x) ≥ minx∈Rn f ∗∗ (x).
Example A.2.6. [Duality in Linear Programming (LP)] Consider the following problem:
c> x → min
s.t.:Ax ≤ b
We define Lagrangian L(x, λ) = c> x + λ> (Ax − b), λ ≥ 0, and arrives at the following dual
objective
Example A.2.7 (Non-convex problems with strong duality). Consider the following quadratic
minimization:
−t − λ → max
s.t.:t ≥ b> (A + λI)+ b
A + λI 0
b ∈ Im(A + λI)
−t − λ → max
!
A + λI b
s.t.: 0
b> t
Example A.2.8 (Dual to binary Quadratic Programming (QP)). Consider the following
binary quadratic optimization
x> Ax → max
s.t.:x2i = 1, 1 ≤ i ≤ n
that is
n
X
µi → min (A.7)
i=1
s.t.:Diag(µ) A
Note that the optimization (A.7 is convex and it provides a non-trivial lower bound to the
primal optimization problem. The low bound is called Semi-Definite Programming (SDP)
relaxation.).
The original formulation is convex and Slater’s condition is obviously satisfied (any x which
is sufficiently small in the p-norm is feasible), therefore the strong duality holds and we can
reverse the order of optimizations in Eq. (A.8)
max min λT x + µ (kxkp − 1) µ≥0 . (A.9)
µ x
Next let us write the KKT conditions. Stationary point condition over the primal variable
for the objective in Eq. (A.9) is
|x∗i |p−2
∀i = 1, · · · , d : λi + µx∗i 1−1/p = 0. (A.10)
(x∗1 )p + · · · + (x∗d )p
µ (kx∗ kp − 1) = 1. (A.11)
Assume that λ 6= 0 (if otherwise the result is trivially zero), then µ 6= 0 according to
Eq. (A.10), kx∗ kp = 1, and Eq. (A.10) becomes, ∀i = 1, · · · , d : λi = −µx∗i |x∗i |p−2 .
Combining the two equations we derive
!(p−1)/p
X
µ=− |λi |p/(p−1) = −kλk(p−1)/p , (A.12)
i
X
λT x∗ = −µ |x∗i |p = −µ, (A.13)
i
therefore proving the relation after substitution the optimal values back into the objective.
• (b) Find the dual of Eq. (A.14), restating the l∞ -constraint as a convex quadratic
constraint. Is the duality gap zero at L 0?
• (c) Show that if bb> εL 0 for some ε > 0, then L = cbb> , where c is a constant.
• (d) Assuming that conditions in (c) are satisfied, solve Eq. (A.14) analytically. (Hint:
Transition to a scalar variable and show that the problem reduces to a one-dimensional
quadratic, concave optimization over a bounded domain.)
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 354
s.t.:Ax = b
x∈K
2. K is closed;
Conic optimization problems are important in optimization. In Example A.2.8 you already
see the (dual to the binary quadratic optimization) problem which is a conic optimization
problem over the cone of positive semi-definite matrices.
K∗ defines a dual cone of K K∗ = {c : c> hc, xi ≥ 0, x ∈ K}.
Exercise A.9. Show that the following sets are self-dual cones (that is, K∗ = K).
where the last term stands for x ∈ K. From the definition of the dual cone one derives
0, x∈K
max∗ −λ> x =
λ∈K +∞, x 6∈ K
Therefore
p∗ = min max
∗
L(x, λ, µ) ≥
x∈K λ∈K ,µ
∗
d = max
∗
min L(x, λ, µ)
λ∈K ,µ x∈K
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 355
And finally
d∗ = max µ> b
s.t.: = c − A> µ − λ = 0
λ ∈ K∗
µ> b → max
s.t.:c − A> y ∈ K∗ .
s.t.:Diag(µ) A.
hA, Xi → max
s.t.:X ∈ Sn+
Xii = 1 ∀i
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 356
In the remainder of the Section we will study iterative algorithms to solve the optimiza-
tion problems discussed so far. It will be convenient to think about iterations in terms of
“discrete (algorithmic) time”, and also consider the “continuous time” limit when changes
in the values per iteration is sufficiently small and the number of iterations is sufficiently
large. In the continuous time analysis of the algorithms we utilize the language of dif-
ferential equations, as it helps both for intuition (familiar from first semester studies of
the differential equations) and also for analysis. However, to reach some of the rigorous
conclusions we may also get back to the original, discrete, language.
f (x) → minn ,
x∈R
and focus on the first-order optimization methods. That is we assume that the objective
function, as well as the gradient of the objective function, can both be evaluated efficiently.
Note that the first order methods described in this Section are most popular methods/algo-
rithms currently in use to resolve majority of practical machine learning, data science and
more generally applied mathematics problems.
We assume that function f is smooth, that is
for some positive constant β. Choosing the `2 norm for k · k = k · k2 , one derives
β
f (y) ≤ f (x) + ∇f (x)> (y − x) + ky − xk22 , ∀x, y ∈ Rn .
2
To simplify description we will thus omit in the following “w.r.t. to norm k · k” when
discussing the `2 norm.
Smooth Optimization
Gradient Descent. Gradient Descent (GD) is the simplest and arguably most popular
method/algorithm for solving convex (and non-convex) optimization problems. Iteration of
the GD algorithm is
> 1 2
xk+1 = xk −ηk ∇f (xk ) = arg min f (xk ) + ∇f (xk ) (x − xk ) + kx − xk k2 , ηk ≤ 1/β
x 2ηk
| {z }
hηk (x)
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 357
where we assume that f is β smooth with respect to `2 norm. If ηk ≤ 1/β, each step of
the GD becomes equivalent to minimization of the convex quadratic upper bound hηk (x)
of f (x).
We will provide the continuous time proof of the Theorem, as well as its discrete time
version, where the former will rely on the notion of the Lyapunov function.
Definition A.3.3. Lyapunov function, V (x(t)), of the differential equation, ẋ(t) = f (x(t)),
is a function that
From now on, we will use capitalized notation, X(t), for the continuous time version of
(xk |k = 1, · · · ).
Proof of Theorem A.3.2: Continuous time. The GD algorithm can be viewed as a discretiza-
tion of the first-order differential equation:
Introduce the following Lyapunov’s function for this ODE, V (X(t)) = kX(t)−x∗ k22 /2. Then
d
V (t) = (X(t) − x∗ )> Ẋ(t) = −∇f (X(t))> (X(t) − x∗ ) ≤ −(f (X(t)) − f ∗ ), (A.19)
dt
where the last inequality is due to the convexity of f . Integrating Eq. (A.19 over time, one
derives Z t
∗
V (X(t)) − V (X(0)) ≤ tf − f (X(t))dt
0
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 358
Proof of Theorem A.3.2: Discrete time. Condition of smoothness applied to, y = x−η∇f (x),
results in
β2
f (y) ≤ f (x) + ∇f (x)> (y − x) + ky − xk22
2
1
= f (x) + ∇f (x)> (x − η∇f (x) − x) + β22 kx − η∇f (x) − xk22
2
2 β2 2
= f (x) − ηk∇f (x)k2 + k∇f (x)k2
2
β2 η
= f (x) − 1 − ηk∇f (x)k22 .
2
As η ≤ 1/β, one derives, 1 − βη/2 ≤ −1/2, and
η
f (y) ≤ f (x) − k∇f (x)k22 . (A.20)
2
Note, that Eq. A.20 does not require convexity of the function, however if the function is
convex one derives
by choosing y = x∗ . Plugging the last inequality into the smoothness inequality, one derives
for y = x − η∇f (x):
η
f (y) − f (x∗ ) ≤ ∇f (x)> (x − x∗ ) − k∇f (x)k22
2
1
= kx − x k2 − kx − η∇f (x) − x∗ k22
∗ 2
2η
1
= kx − x∗ k22 − ky − x∗ k22
2η
X 1 X
(f (xj ) − f (x∗ )) ≤ kxj − x∗ k22 − kxj+1 − x∗ k22
2η
j≤k j≤k
1
= kx1 − x∗ k22 − kxk+1 − x∗ k22
2η
R2 β2 R22
≤ 2 = ,
2η 2
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 359
One obviously would like to choose the step size in GD which results in the fastest
convergence. However, this problem – of choosing best, or simply good step size – is hard
and remains open. The statement also means that finding a good stopping criterion for the
iterations is hard as well. Here are practical/empirical strategies for choosing the step size
in GD:
• Polyak’s step-size rule. If the optimal value f ∗ of the function is known, one can
suggest a better step-size policy. Minimization of the right-hand side of:
kxk+1 − x∗ k ≤ kxk − x∗ k22 − 2ηk (f (xk ) − f (x∗ )) + ηk2 kgk k22 → min,
ηk
results in the Polyak’s rule, ηk = (f (xk ) − f (x∗ ))/kgk k22 , which is known to be useful,
in particular, for solving an undetermined system of linear equations, Ax = b.
Exercise A.11. Recall that GD minimizes the convex quadratic upper bound hηk (x) of
f (x). Consider a modified GD, where the step size is, η = (2 + ε)/β, with ε chosen positive.
(Notice that the step size used in the conditions of the Theorem A.3.2 was η ≤ 1/β.)
Derive modified version of Eq. (A.18). Can one find a quadratic convex function for which
the modified algorithm fails to converge?
f (x) → min
s.t.:kx − x∗ k ≤ ε,
x ∈ Rn
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 360
where x∗ is a global and unique minimum of the β-smooth function f . Moreover, let
1
∀x ∈ Rn : k∇f (x)k22 ≥ µ(f (x) − f (x∗ )).
2
Is it true, that for some small ε > 0 the GD with a step-size ηk = 1/β converges to the
optimum? How ε depends on β and µ?
Exercise A.13 (not graded - difficult). In many optimization problems, it is often the case
that exact value of the gradient is polluted, i.e. only its noise version is observed. In this
case one may consider the following “inexact oracle” optimization: f (x) → min, x ∈ Rn ,
assuming that for any x one can compute fˆ(x) and ∇f
ˆ (x) so that
and seek for an algorithm to solve it. Propose and analyze modification of GD solving the
“inexact oracle” optimization?
where ηk ≤ 1/βp , βp ≥ supx kg(x)kp , with a properly chosen p can converge much faster
than in `2 . GD in `1 is particularly popular.
Exercise A.14. Restate and prove discrete time version of the Theorem A.3.2 for GD in
`p norm. (Hint: Consider the following Lyapunov function: kx − x∗ k2p .)
Theorem A.3.4. GD for a strongly convex function f and a fixed step-size policy
where c ≤ 1 − µ/β.
Fast Gradient Descent. GD is simple and efficient in practice. However, it may also
be slow if the gradient is small. It may also oscillate about the point of optimality if the
gradient is pointed in a direction with a small projection to the optimal direction (pointing
at the optimum). The following two modifications of the GD algorithm were introduced to
cure the problems
The last term in Eqs. (A.21,A.22) is called “momentum” or “inertia” term to emphasize
relation to respective phenomena in classical mechanics. The inertia terms, added to the
original GD term, which may be associated with “damping” or “friction”, aims to force
the hypothetical “ball” rolling towards optimum faster. In spite of their seemingly minor
difference, convergence rate of FGM and of the heavy-ball method differ rather dramatically,
as the heavy ball can lead to an overshoot (not enough “friction”).
Exercise A.16. Construct a convex function f with a piece-wise linear gradient such that
the heavy ball algorithm (A.21) with some fixed µ and ηfails to converge.
Consider a slightly modified (less general, two-step recurrence) version of the FGM
(A.22):
k−1
xk = yk−1 − η∇f (yk−1 ), yk = xk + (xk − xk−1 ), (A.23)
k+2
which can be re-stated in continuous time as follows
3
Ẍ(t) + Ẋ(t) + ∇f (X) = 0. (A.24)
t
√
Indeed, assuming t ≈ k η and re-scaling one derives from Eq. (A.23)
xk+1 − xk k − 1 xk − xk−1 √
√ = √ − η∇f (yk ). (A.25)
η k+2 η
√
Let xk ≈ X(k η), then
√
X(t) ≈ xt/√η = xk , X(t + η) ≈ x(t+√η)/√η = xk+1
one arrives at
1 √ √ 1 √ √ √ √
Ẋ(t)+ Ẍ(t)+o( η) = (1 − 3 η/t) Ẋ(t) − Ẍ(t) η + o( η) − ηf (X(t))+o( η) = 0,
2 2
resulting in Eq. (A.24).
To analyze convergence rate of the FGM (A.24) we introduce the following Lyapunov
function:
V (X(t)) = t2 (f (X(t)) − f ∗ ) + 2kX + tẊ/2 − x∗ k22 .
Time derivative of the Lyapunov function is
Given that, Ẋ + tẌ/2 = −t∇f (X)/2, and also utilizing convexity of f one derives
V̇ = 2t(f (X) − f ∗ ) − 4(X − x∗ )> (t∇f (X)/2) = 2t(f (X) − f ∗ ) − 2t(X − x∗ )> ∇f (X) ≤ 0.
Theorem A.3.5. Fast GD for, f (x) → minx∈Rn , where f (x) is a β-smooth convex function,
with an update rule
k−1
xk = yk−1 − η∇f (yk−1 )), yk = xk + (xk − xk−1 )
k+2
converges to the optimum as
2kx0 − x∗ k22
f (xk+1 ) − f ∗ ≤ .
ηk 2
As always, turning the continuous time sketch of the proof into the actual (discrete
time) proof takes some additional technical efforts.
Exercise A.18. Show that the FGM method, described by Eq. (A.23), transitions to
Eq. (A.22) at some ηk .
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 363
Non-Smooth Problems
which is just the original GD with the gradient replaced by the sub-gradient to deal with
non-smooth f . Note, however, that it is not proper to call the algorithm (A.26) SG descent
because in the process of iterations f (xk+1 ) may become larger than f (xk ). To fix the
problem one may keep track of the best point, or substitute the result by an average of the
points seen in the iterations so far (a finite horizon portion of the past). For example, one
may augment Eq. (A.26) at each k with
(k) (k−1)
fbest = min{fbest , f (xk )}.
This condition follows, for example, from the Lipshitz condition, |f (x) − f (y)| ≤ Lkx − yk,
imposed on f . Let x∗ be the optimal point of, f (x) → minx∈Rn , then
kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22 = kxk − x∗ k22 − 2ηk gk> (xk − x∗ ) + ηk2 kgk k22
≤ kxk − x∗ k22 − 2ηk (f (xk ) − f (x∗ )) + ηk2 kgk k22 , (A.27)
where the last inequality is due to convexity of f , i.e. f (x∗ ) ≥ f (xk )+gk> (x∗ −xk ). Applying
the inequality (A.27) recursively,
X X
kxk+1 − x∗ k22 ≤ kx1 − x∗ k22 − 2 ηj (f (xj ) − f (x∗ )) + ηj2 kgj k22 ,
j≤k j≤k
one derives,
X (k)
X X
2 ηj (f (best ) − f (x∗ )) ≤ 2 ηj (f (xj ) − f (x∗ )) ≤ kx(1) − x∗ k22 + ηj2 kgj k22 ,
j≤k j≤k j≤k
which becomes
P
(k) ∗
kx1 − x∗ k22 + L22 j≤k ηj2
∗
f (best ) − f (x ) = min f (xj ) − f ≤ P ,
j≤k 2 j≤k ηj
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 364
Proximal Gradient Method. In multiple machine learning (and more generally statis-
tics) applications we deal with a function built as a sum over samples. Inspired by this
application consider the following composite optimization
which will soon be linked to the composite optimization. Standard examples of the proximal
operator/function are
1. h(x) = IC (x), that is h(x) is an indicator of a convex set C. Then the proximal
function is
proxh (x) = arg min kx − uk22
u∈C
is a projection of x on C.
The examples suggest using the proximal operator to smooth out non-smooth functions
entering respective optimizations. Having this use of the proximal operator in mind we
introduce the Proximal Gradient Descent (PGD) algorithm
1 2
xk+1 = proxηk h (xk − ηk ∇g(xk )) = arg min kxk − ηk ∇g(xk ) − uk2 + ηk h(u)
u 2
1
= arg min g(xk ) + ∇g(xk )> (u − xk ) + ku − xk k22 + h(u)
u 2ηk
where ηk ≤ β, and g is a β-smooth function in `2 norm.
Note, that as in the case of the GD algorithm, at each step of the PGD we minimize
a convex upper bound of the objective function. We find out that the PGD algorithm has
the same convergence rate (measured in the number of iterations) as the GD algorithm.
Finally, we are ready to connect PGD algorithm to the composite optimization (A.29).
with a fixed step size policy converges to the optimal solution f ∗ of the composite optimiza-
tion (A.29) according to
kx0 − x∗ k22
f (xk+1 ) − f ∗ ≤ .
2ηk
Proof of the Theorem (A.3.6) repeats the logic we use to prove Theorem A.3.2 for the GD
algorithm. Moreover, one can also accelerate the PGD, similarly to how we have accelerated
GD. The accelerated version of the PGD is
k−1
xk = proxηk (yk−1 − ηk ∇f (yk−1 )) yk = xk + (xk − xk−1 ).
k+2
We naturally arrives at the PGD version of the Theorem A.3.5:
f (x) → minn
x∈R
PGD is one possible approach developed to deal with non-smooth objectives. Another
sound alternative is discussed next.
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 366
which is one of the most common non-smooth optimizations. Recall, that a smooth and
convex approximation to the maximum function is provided by the soft-max function (A.2)
√
which can then be minimized by the accelerated GD (that has a convergence rate O(1/ ε)
in contrast to 1/ε2 for non-smooth functions). Accurate choice of λ (parameter within the
soft-max) normally allows to speed up algorithms to O(1/ε).
where ΠC is an Euclidean projection to the convex set C, ΠC (y) = arg minx∈C kx − yk22 .
PGD has the same convergence rate as GD. The proof is similar to the one of the gradient
descent taking into account that projection does not lead to an expansion, i.e.
Exercise A.19. (Alternating Projections.) Consider two convex sets C, D ⊆ Rn and pose
the question of finding x ∈ C ∩ D. One starts from, x0 ∈ C, and applies PGD
In contrast to the PGD algorithm (A.31) making projection at each iteration, the Frank-
Wolfe (FW) algorithm solves the following linear problem on C:
In this case the update yk of the FW algorithm is a unit vector correspondent to the
maximal coordinate of the gradient. Overall time to update xk is O(n) therefore resulting
in a significant acceleration in comparison with the PGD algorithm.
FW algorithm has an edge over other algorithms considered so far because it has a
reliable stopping criteria. Indeed, convexity of the objective guarantees that
The value on the left of the inequality, maxy∈C ∇f (xk )> (xk −y), gives us an easy to compute
stopping criterion.
The following statement characterizes convergence of the FW algorithm.
Theorem A.4.1. Given that f (x) in Eq. (A.32) is a convex β-smooth function and C is a
bounded, convex, compact set, Eq. A.33 converges to the optimal solution, f ∗ , of Eq. (A.32)
as
2βD2
f (xk ) − f ∗ ≤ ,
k+2
where D2 ≥ maxy,y∈C kx − yk22 .
That is f (xk ) − f (x∗ ) ≤ ∇f (xk )> (xk − x∗ ). This inequality, in combination with the
second sub-step in the FW algorithm, xk+1 = γk yk + (1 − γk )xk , results in the following
transformations
The conditional GD is slower than the FGM method in terms of the number of iterations.
However, it is often favorable in practice especially when minimizing a convex function
over sufficiently simple objects (like the norm-ball or a polytope) as it does not require
implementing explicit projection to the constraining set.
f (x) → min
Ax = b, x ∈ Rn
ρ
f (x) + kAx − bk22 → min
2
s.t.:Ax = b
f (x) → min
s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m.
s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m.
Convergence analysis of PDG algorithm repeats all steps involved in analysis of the
original GD. The Lypunov exponent here is , V (x, λ) = kx0 − x∗ k22 + kλ0 − λ∗ k22 .
Exercise A.20. Analyze convergence of the PDG algorithm for convex optimization with
inequality constraints assuming that all the functions involved (in the objective, f , and in
the constraints, gi ) are convex and β-smooth.
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 370
Our previous analysis was mostly focused on the case, where the objective function f is
smooth in `2 norm and the distance from the starting point, where we initiate the algorithm,
to the optimal point is measured in the `2 norm as well. From the perspective of the GD,
the optimization over a unit simplex and the optimization over a unit Euclidean sphere
are equivalent computational complexity-wise. On the other hand, the volume of the unit
simplex is exponentially smaller than the volume of the unit sphere. Mirror Descent (MD)
algorithm allows to explore geometry of the domain thus providing a faster algorithm for the
√
case of the simplex. The acceleration is up to the ∼ d factor, where d is the dimensionality
of the underlying space.
We start with an unconstrained convex optimization problem:
f (x) → min
s.t.: x ∈ S ⊆ Rn
xk+1 = xk − ηk ∇f (xk ).
From the mathematical perspective we sum up objects from different spaces: x belongs to
the primal space, while the space where ∇f (x) resides, called the dual (conjugate) space
may be different. To overcome this “inconsistency”, Nemirovski and Yudin have proposed
in 1978 the following algorithm:
where φ(x) is a strongly convex function defined on Rn and ∇φ(Rn ) = Rn ; and φ∗ (y) =
supx∈Rn (y > x − φ(x)) is the Legendre Fenchel (LF) transform (conjugate function) of φ(x).
Function φ is also called the mirror map function. Dφ (u, v) = φ(u) − φ(v) − ∇φ(v)> (u − v)
is the so-called Bregman divergence
which measures (for strictly convex function φ) the distance between φ(u) and its linear
approximation φ(v) − ∇φ(v)> (u − v) evaluated at v.
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 371
Exercise A.21. Let φ(x) be a strongly convex function on Rn . Using the definition of the
conjugate function prove that ∇φ∗ (∇φ(x)) = x, where φ∗ is a conjugate function to φ.
• Convexity in the first argument. The Bregman divergence Dφ (u, v) is convex in its
first argument. (Notice that is not necessarily convex in the second argument.)
• Linearity with respect to the non-negative coefficients. In other words, for any strictly
convex φ and ψ we observe:
• Euclidean norm. Let φ = kxk22 , then Dφ (x, y) = kxk22 − kyk22 − 2y > (x − y) = kx − yk22 .
Pn
• Negative entropy. φ(x) = i=1 xi ln xi , f : Rn++ → R. Then
n
X n
X n
X
Dφ (x, y) = xi ln(xi /yi ) − xi + yi = DKL (x||y),
i=1 i=1 i=1
• Lower and upper bounds. Let φ be a µ-strongly convex function with respect to a
norm k · k then
µ β
Dφ (x, y) ≥ kx − yk2 , Dφ (x, y) ≤ kx − yk2
2 2
The following statement represents an important fact which will be used below to analyze
the MD algorithm.
Pn Pn
Theorem A.4.2 (Pinsker Inequality). For any x, y, such that i=1 xi = i=1 yi = 1,
x ≥ 0, y ≥ 0 one get the following KL divergence estimate, DKL (x||y) ≥ 12 kx − yk21 .
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 372
Pn
An immedite corollary of the Theorem is that φ(x) = i=1 xi ln xi is 1-strongly convex
in `1 norm:
1
φ(y) ≥ φ(x) + ∇φ(x)> (y − x) + DKL (y||x) ≥ φ(x) + ∇φ(x)> (y − x) + kx − yk21
2
The proximal form of the MD algorithm is
D 1
xk+1 = ΠC φ arg min f (xk ) + ∇f (xk )> (x − xk ) + Dφ (x, xk ) ,
x∈Rn ηk
D
where ΠS φ (y) = arg minx∈S Dφ (x, y).
Example A.4.3. Consider the following optimization problem over the unit simplex:
f (x) → minn
x∈R
Let us sketch the continuous time analysis of the MD algorithm in the case of the β-
smooth convex functions. In contrast with the GD analysis, it is more appropriate to work
in this case with the Lyapunov’s function in the dual space:
d
V (Z(t)) = −∇f (X(t))> (X(t) − x∗ ) ≤ −(f (X(t)) − f ∗ ).
dt
Integrating both sides of the inequality one arrives at
Z t Z t
∗ 1 ∗
V (Z(t)) − V (Z(0)) ≥ f (X(τ ))dτ − tf ≥ t f X(τ )dτ − f ,
0 t 0
where the last transformation is due to the Jensen inequality. Therefore, similarly to the
case of GD, the convergence rate of the MD algorithm is O(1/k). The resulting MD ODE
is
X(t) = ∇φ∗ (Z(t))
Ż(t) = −∇f (X(t))
X(0) = x0 , Z(0) = z0 with ∇φ∗ (z0 ) = x0 .
Behavior of the MD, when applied to a non-smooth convex function, repeats the one of
√
the GD: the convergence rate is O(1/ k) in this case.
Bibliography
[3] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal
algorithms,” Physica D: Nonlinear Phenomena, vol. 60, no. 1, pp. 259–268, 1992.
[5] B. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR
Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1 – 17, 1964.
[6] Y. E. Nesterov, “A method for solving the convex programming problem with conver-
gence rate o(1/k 2 ),” Dokl. Akad. Nauk SSSR, vol. 269, pp. 543–547, 1983.
[7] W. Su, S. Boyd, and E. J. Candes, “A Differential Equation for Modeling Nesterov’s
Accelerated Gradient Method: Theory and Insights,” arXiv:1503.01243, 2015.
[9] M. Levi, Classical Mechanics with Calculus of Variations and Optimal Control: An
Intuitive Introduction. AMS, 2014.
[11] A. Chambolle, “An algorithm for total variation minimization and applications,” Jour-
nal of Mathematical Imaging and Vision, vol. 20, pp. 89–97, 2004.
374
BIBLIOGRAPHY 375
[15] R. Bellman, “On the theory of dynamic programming,” PNAS, vol. 38, no. 8, p. 716,
1952.
[16] C. Moore and S. Mertens, The Nature of Computation. New York, NY, USA: Oxford
University Press, 2011.
[17] A. Sinclair, “Uc berkley, cs271 ”randomness & computation” course,” 2020. [Online].
Available: https://people.eecs.berkeley.edu/∼sinclair/cs271/n13.pdf
[18] C. E. Shannon, “Prediction and entropy of printed english,” The Bell System Technical
Journal, vol. 30, no. 1, pp. 50–64, 1951.
[20] N. V. Kampen, Stochastic processes in physics and chemistry. North Holland, 2007.
[22] F. Kelly and E. Yudovina, Stochastic Networks, ser. Institute of Mathematical Statistics
Textbooks. Cambridge University Press, 2014.
[24] T. Richardson and R. Urbanke, Modern Coding Theory. USA: Cambridge University
Press, 2008.
[29] C. Chow and C. Liu, “Approximating discrete probability distributions with depen-
dence trees,” IEEE Transactions on Information Theory, vol. 14, no. 3, pp. 462–467,
1968.
[30] G. Strang, Linear Algebra and Learning from Data. Wellesley-Cambridge Press, 2019.
[31] P. Kidger and T. Lyons, “Universal Approximation with Deep Narrow Networks,”
in Proceedings of Thirty Third Conference on Learning Theory, ser. Proceedings
of Machine Learning Research, J. Abernethy and S. Agarwal, Eds., vol.
125. PMLR, 09–12 Jul 2020, pp. 2306–2327. [Online]. Available: https:
//proceedings.mlr.press/v125/kidger20a.html