AM Core 2021

Download as pdf or txt
Download as pdf or txt
You are on page 1of 279

Lecture Notes on the

Principles and Methods of Applied Mathematics

Michael (Misha) Chertkov


(lecturer)
and Colin Clark
(recitation instructor for this and other core classes)

Graduate Program in Applied Mathematics,


University of Arizona, Tucson

August 19, 2020


Contents

Applied Math Core Courses vii

I Applied Analysis 1

1 Complex Analysis 2
1.1 Complex Variables and Complex-valued Functions . . . . . . . . . . . . . . 2
1.1.1 Complex Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Functions of a Complex Variable . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Multi-valued Functions and Branch Cuts . . . . . . . . . . . . . . . 8
1.2 Analytic Functions and Integration along Contours . . . . . . . . . . . . . . 11
1.2.1 Analytic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Integration along Contours . . . . . . . . . . . . . . . . . . . . . . . 14
1.2.3 Cauchy’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.4 Cauchy’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.2.5 Laurent Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Residue Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.1 Singularities and Residues . . . . . . . . . . . . . . . . . . . . . . . . 23
1.3.2 Evaluation of Real-valued Integrals by Contour Integration . . . . . 23
1.3.3 Contour Integration with Multi-valued Functions . . . . . . . . . . . 26
1.4 Extreme-, Stationary- and Saddle-Point Methods . . . . . . . . . . . . . . . 30

2 Fourier Analysis 33
2.1 The Fourier Transform and Inverse Fourier Transform . . . . . . . . . . . . 33
2.2 Properties of the 1-D Fourier Transform . . . . . . . . . . . . . . . . . . . . 34
2.3 Dirac’s δ-function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 The δ-function as the limit of a δ-sequence . . . . . . . . . . . . . . 37
2.3.2 Using δ-functions to Prove Properties of Fourier Transforms . . . . . 40

i
CONTENTS ii

2.3.3 The δ-function in Higher Dimensions . . . . . . . . . . . . . . . . . . 41


2.3.4 The Heaviside Function and the Derivatives of the δ-function . . . . 42
2.4 Closed form representation for select Fourier Transforms . . . . . . . . . . . 43
2.4.1 Elementary examples of closed form representations . . . . . . . . . 43
2.4.2 More complex examples of closed form representations . . . . . . . . 44
2.4.3 Closed form representations in higher dimensions . . . . . . . . . . . 45
2.5 Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.6 Riemann-Lebesgue Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.7 Gibbs Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.8 Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

II Differential Equations 51

3 Ordinary Differential Equations. 52


3.1 ODEs: Simple cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.1 Separable Differential Equations . . . . . . . . . . . . . . . . . . . . 53
3.1.2 Method of Parameter Variation . . . . . . . . . . . . . . . . . . . . . 53
3.1.3 Integrals of Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Phase Space Dynamics for Conservative and Perturbed Systems . . . . . . . 55
3.2.1 Phase Portrait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2.2 Small Perturbation of a Conservative System . . . . . . . . . . . . . 58
3.3 Direct Methods for Solving Linear ODEs . . . . . . . . . . . . . . . . . . . . 61
3.3.1 Homogeneous ODEs with Constant Coefficients . . . . . . . . . . . . 61
3.3.2 Inhomogeneous ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Linear Dynamics via the Green Function . . . . . . . . . . . . . . . . . . . . 62
3.4.1 Evolution of a linear scalar . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 Evolution of a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4.3 Higher Order Linear Dynamics . . . . . . . . . . . . . . . . . . . . . 66
3.4.4 Laplace’s Method for Dynamic Evolution . . . . . . . . . . . . . . . 68
3.5 Linear Static Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.1 One-Dimensional Poisson Equation . . . . . . . . . . . . . . . . . . . 72
3.6 Sturm–Liouville (spectral) theory . . . . . . . . . . . . . . . . . . . . . . . . 72
3.6.1 Hilbert Space and its completeness . . . . . . . . . . . . . . . . . . . 73
3.6.2 Hermitian and non-Hermitian Differential Operators . . . . . . . . . 74
3.6.3 Hermite Polynomials, Expansions . . . . . . . . . . . . . . . . . . . . 76
3.6.4 Schrödinger Equation in 1d . . . . . . . . . . . . . . . . . . . . . . . 78
CONTENTS iii

4 Partial Differential Equations. 80


4.1 First-Order PDE: Method of Characteristics . . . . . . . . . . . . . . . . . . 80
4.2 Classification of linear second-order PDEs: . . . . . . . . . . . . . . . . . . . 84
4.3 Elliptic PDEs: Method of Green Function . . . . . . . . . . . . . . . . . . . 86
4.4 Waves in a Homogeneous Media: Hyperbolic PDE . . . . . . . . . . . . . . 89
4.5 Diffusion Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.6 Boundary Value Problems: Fourier Method . . . . . . . . . . . . . . . . . . 95
4.7 Exemplary Nonlinear PDE: Burger’s Equation . . . . . . . . . . . . . . . . 97

III Optimization 99

5 Calculus of Variations 100


5.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1.1 Fastest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1.2 Minimal Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1.3 Image Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1.4 Classical Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 Euler-Lagrange Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Phase-Space Intuition and Relation to Optimization . . . . . . . . . . . . . 105
5.4 Towards Numerical Solutions of the Euler-Lagrange Equations . . . . . . . 106
5.4.1 Smoothing Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4.2 Gradient Descent and Acceleration . . . . . . . . . . . . . . . . . . . 107
5.5 Variational Principle of Classical Mechanics . . . . . . . . . . . . . . . . . . 108
5.5.1 Noether’s Theorem & time-invariance of space-time derivatives of action109
5.5.2 Hamiltonian and Hamilton Equations: the case of Classical Mechanics 112
5.5.3 Hamilton-Jacobi equation . . . . . . . . . . . . . . . . . . . . . . . . 113
5.6 Legendre-Fenchel Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.6.1 Geometric Interpretation: Supporting Lines, Duality and Convexity 117
5.6.2 Primal-Dual Algorithm and Dual Optimization . . . . . . . . . . . . 121
5.6.3 More on Geometric Interpretation of the LF transform . . . . . . . . 123
5.6.4 Hamiltonian-to-Lagrangian Duality in Classical Mechanics . . . . . . 124
5.6.5 LF Transformation and Laplace Method . . . . . . . . . . . . . . . . 125
5.7 Second Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.8 Methods of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 127
5.8.1 Functional Constraint(s) . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.8.2 Function Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
CONTENTS iv

6 Convex and Non-Convex Optimization 130


6.1 Convex Functions, Convex Sets and Convex Optimization Problems . . . . 131
6.2 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.3 Unconstrained First-Order Convex Minimization . . . . . . . . . . . . . . . 147
6.4 Constrained First-Order Convex Minimization . . . . . . . . . . . . . . . . 157

7 Optimal Control and Dynamic Programming 165


7.1 Linear Quadratic (LQ) Control via Calculus of Variations . . . . . . . . . . 166
7.2 From Variational Calculus to Bellman-Hamilton-Jacobi Equation . . . . . . 170
7.3 Pontryagin Minimal Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.4 Dynamic Programming in Optimal Control . . . . . . . . . . . . . . . . . . 174
7.4.1 Discrete Time Optimal Control . . . . . . . . . . . . . . . . . . . . . 174
7.4.2 Continuous Time & Space Optimal Control . . . . . . . . . . . . . . 175
7.5 Dynamic Programming in Discrete Mathematics . . . . . . . . . . . . . . . 177
7.5.1 LATEX Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.5.2 Shortest Path over Grid . . . . . . . . . . . . . . . . . . . . . . . . . 180
7.5.3 DP for Graphical Model Optimization . . . . . . . . . . . . . . . . . 180

IV Mathematics of Uncertainty 185

8 Basic Concepts from Statistics 186


8.1 Random Variables: Characterization & Description. . . . . . . . . . . . . . 186
8.1.1 Probability of an event . . . . . . . . . . . . . . . . . . . . . . . . . . 186
8.1.2 Sampling. Histograms. . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.1.3 Moments. Generating Function. . . . . . . . . . . . . . . . . . . . . 188
8.1.4 Probabilistic Inequalities. . . . . . . . . . . . . . . . . . . . . . . . . 193
8.2 Random Variables: from one to many. . . . . . . . . . . . . . . . . . . . . . 193
8.2.1 Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.2.2 Multivariate Distribution. Marginalization. Conditional Probability. 197
8.2.3 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.3 Information-Theoretic View on Randomness . . . . . . . . . . . . . . . . . . 200
8.3.1 Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.3.2 Independence, Dependence, and Mutual Information. . . . . . . . . . 202
8.3.3 Probabilistic Inequalities for Entropy and Mutual Information . . . 204
CONTENTS v

9 Stochastic Processes 209


9.1 Markov Chains [discrete space, discrete time] . . . . . . . . . . . . . . . . . 209
9.1.1 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.1.2 Properties of Markov Chains . . . . . . . . . . . . . . . . . . . . . . 211
9.1.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9.1.4 Steady State Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.1.5 Spectrum of the Transition Matrix & Speed of Convergence to the
Stationary Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.1.6 Reversible & Irreversible Markov Chains. . . . . . . . . . . . . . . . 216
9.1.7 Detailed Balance vs Global Balance. Adding cycles to accelerate mixing.217
9.2 Bernoulli and Poisson Processes [discrete space, discrete & continuous time] 218
9.2.1 Bernoulli Process: Definition . . . . . . . . . . . . . . . . . . . . . . 219
9.2.2 Bernoulli: Number of Successes . . . . . . . . . . . . . . . . . . . . . 219
9.2.3 Bernoulli: Distribution of Arrivals . . . . . . . . . . . . . . . . . . . 219
9.2.4 Poisson Process: Definition . . . . . . . . . . . . . . . . . . . . . . . 220
9.2.5 Poisson: Arrival Time . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.2.6 Merging and Splitting Processes . . . . . . . . . . . . . . . . . . . . 222
9.3 Space-time Continuous Stochastic Processes . . . . . . . . . . . . . . . . . . 224
9.3.1 Langevin equation in continuous time and discrete time . . . . . . . 224
9.3.2 From the Langevin Equation to the Path Integral . . . . . . . . . . . 225
9.3.3 From the Path Integral to the Fokker-Plank (through sequential Gaus-
sian integrations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.3.4 Analysis of the Fokker-Planck Equation: General Features and Ex-
amples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.3.5 MDP: Grid World Example . . . . . . . . . . . . . . . . . . . . . . . 230
9.3.6 Recitation. Dynamic Programming. . . . . . . . . . . . . . . . . . . 233

10 Elements of Inference and Learning 234


10.1 Exact and Approximate Inference and Learning . . . . . . . . . . . . . . . . 234
10.1.1 Monte-Carlo Algorithms: General Concepts and Direct Sampling . . 234
10.1.2 Markov-Chain Monte-Carlo . . . . . . . . . . . . . . . . . . . . . . . 240
10.2 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.3.1 Single Neuron and Supervised Learning . . . . . . . . . . . . . . . . 264
10.3.2 Hopfield Networks and Boltzmann Machines . . . . . . . . . . . . . 265
CONTENTS vi

Projects
If you are interested to make a project presentation, please pick up of the subjects below.
Please communicate your choice of the subject and discuss content with the instructor as
soon as possible. First come first served. You may also suggest your own project for material
which is relevant to the course but not covered in the class. You will need to prepare a
Jupiter notebook presentation (in ipython or ijulia) for 10+5 minutes. We will have two
presentation sessions, scheduled for Oct 22 and Dec 1 respectively during the regular class
time. Oct 11 and November 22 are the last days to claim a project for the first and second
sessions respectively.
List of suggested projects for the first session (Complex Analysis & Fourier Analysis):

1.1 Numerical Conformal mapping.

1.2 Complex numbers & analysis: AC electric circuit applications.

1.3 Laplace transform in systems engineering: linear-time-invariant and linear-time-varying


systems.

1.4 Mellin transform and its applications.

1.5 Wavelets.

List of suggested projects for the second session (ODEs & PDEs):

2.1 Linear Stability/Instability in Fluid Mechanics: Kelvin-Helmholtz.

2.2 Susceptible-Infected-Susceptible (SIS) and Susceptible-Infected-Removed (SIR) of Epi-


demiology.

2.3 Sturm-Liouville Problem: Fokker-Planck equation (of statistical mechanics).

2.4 Wave equations and Eikonal (WKB) approximation of Classical Optics.

2.5 Nonlinear Schrödinger equation: solitons and integrability.


Applied Math Core Courses

Every student in the Program for Applied Mathematics at the University of Arizona takes
the same three core courses during their first year of study. These three courses are called
Methods (Math 583), Theory (Math 527), and Algorithms (Math 575). Each course presents
a different expertise, or ‘toolbox’ of competencies, for approaching problems in modern
applied mathematics. The courses are designed to discuss many of the same topics, often
synchronously, (Fig. 1). This allows them to better illustrate the potential contributions
of each toolbox, and also to provide a richer understanding of the applied mathematics.
The material discussed in the courses include topics that are taught in traditional applied
mathematics curricula (like differential equation) as well as topics that promote a modern
perspective of applied mathematics (like optimization, control and elements of computer
science and statistics). All the material is carefully chosen to reflect what we believe is most
relevant now and in the future.
The essence of the core courses is to develop the different toolboxes available in applied
mathematics. When we’re lucky, we can find exact solutions to a problem by applying
powerful (but typically very specialized) techniques, or methods. More often, we must
formulate solutions algorithmically, and find approximate solutions using numerical sim-

Metric, Normed & Measure Theory Convex Probability &


Topological Spaces & Integration Optimization Statistics

Complex Analysis, Differential Calculus of Probability &


Fourier Analysis Equations Variations & Control Stochastic Processes

Numerical Numerical Numerical Monte Carlo,


Linear Algebra Differential Equations Optimization Inference & Learning

Figure 1: Topics covered in Theory (blue), Methods (red) and Algorithms (green) during
the Fall semester (columns 1 & 2) and Spring semester (columns 3 & 4)

vii
APPLIED MATH CORE COURSES viii

ulations and computation. Understanding the theoretical aspects of a problem motivates


better design and implementation of these methods and algorithms, and allows us to make
precise statements about when and how they will work.
The core courses discuss a wide array of mathematical content that represents some of
the most interesting and important topics in applied mathematics. The broad exposure
to different mathematical material often helps students identify specific areas for further
in-depth study within the program. The core courses do not (and cannot) satisfy the in-
depth requirements for a dissertation, and students must take more specialized courses and
conduct independent study in their areas of interest.
Furthermore, the courses do not (and cannot) cover all subjects comprising applied
mathematics. Instead, they provide a (somewhat!) minimal, self-consistent, and admittedly
subjective (due to our own expertise and biases) selection of the material that we believe
students will use most during and after their graduate work. In this introductory chapter
of the lecture notes, we aim to present our viewpoint on what constitutes modern applied
mathematics, and to do so in a way that unifies seemingly unrelated material.

What is Applied Mathematics?


We study and develop mathematics as it applies to model, optimize and control various
physical, biological, engineering and social systems. Applied mathematics is a combination
of (1) mathematical science, (2) knowledge and understanding from a particular domain
of interest, and often (3) insight from a few ‘math-adjacent’ disciplines (Fig. 2). In our
program, the core courses focus on the mathematical foundations of applied math. The
more specialized mathematics and the domain-specific knowledge are developed in other
coursework, independent research and internship opportunities.
Applying mathematics to real-world problems requires mathematical approaches that
have evolved to stand up to the many demands and complications of real-world problems.
In some applications, a relatively simple set of governing mathematical expressions are able
to describe the relevant phenomena. In these situations, problems often require very accu-
rate solutions, and the mathematical challenge is to develop methods that are efficient (and
sometimes also adaptable to variable data) without losing accuracy. In other applications,
there is no set of governing mathematical expressions (either because we do no know them,
or because they may not exist). Here, the challenge is to develop better mathematical
descriptions of the phenomena by processes, interpreting and synthesizing imperfect obser-
vations. In terms of the general methodology maintained throughout the core courses, we
devote considerable amount of time to:
APPLIED MATH CORE COURSES ix

Adjacent Disciplines:
Physics, Statistics,
Computer Science,
Data Science

Domain Knowlege:
e.g. Physical Sciences,
Biological Sciences, Mathematical Science:
Social Sciences, e.g. Differential Equations,
Engineering Real & Functional Analysis,
Optimization, Probability,
Numerical Analysis,

Figure 2: The key components studied under the umbrella of applied mathematics: (1)
mathematical science, (2) domain-specific knowledge, and (3) a few ‘math-adjacent’ disci-
plines.

1. Formulating the problem, first casually, i.e. in terms standard in sciences and engi-
neering, and then transitioning to a proper mathematical formulation;

2. Analyzing the problem by “all means available”, including theory, method and algo-
rithm toolboxes developed within applied mathematics;

3. Identifying what kinds of solutions are needed, and implementing an appropriate


method to find such a solution.

Making contributions to a specific domain that are truly valuable requires more than
just mathematical expertise. Domain-specific understanding may change our perspective for
what constitutes a solution. For example, whenever system parameters are no longer ‘nice’
but must be estimated from measurement or experimental data, it becomes more the difficult
to finding meaning in the solutions, and it becomes more important, and challenging, to
estimate the uncertainty in solutions,. Similarly, whenever a system couples many sub-
systems at scale, it may be no longer possible to interpret the exact expressions, (if they
can be computed at all) and approximate, or ‘effective’ solutions may be more meaningful.
In every domain-specific application, it is important to know what problems are most urgent,
and what kinds of solutions are most valuable.
APPLIED MATH CORE COURSES x

Mathematics is not the only field capable of making valuable contributions to other
domains, and we think specifically of physics, statistics and computer science as other fields
that have each developed their own frameworks, philosophies, and intuitions for describing
problems and their solutions. This is particularly evident with the recent developments in
data science. The recent deluge of data has brought a wealth of opportunity in engineering,
and in the physical, natural and social sciences where there have been many open problems
that could only be addressed empirically. Physics, statistics, and computer science have
become fundamental pillars of data science, in part, because each of these ’math-adjacent’
disciplines provide a way to analyze and interpret this data constructively. Nonetheless,
there are many unresolved challenges ahead, and we believe that a mixture of mathematical
insight and some intuition from these adjacent disciplines may help resolve these challenges.

Problem Formulation

We will rely on a diverse array of instructional examples from different areas of science and
engineering to illustrate how to translate a rather vaguely stated scientific or engineering
phenomenon into a crisply stated mathematical challenge. Some of these challenges will be
resolved, and some will stay open for further research. We will be refering to instructional
examples, such as the Kirchoff and the Kuramoto-Sivashinsky equations for power systems,
the Navier-Stokes equations for fluid dynamics, network flow equations, the Fokker-Plank
equation from statistical mechanics, and constrained regression from data science.

Problem Analysis

We analyze problems extracted from applications by all means possible, which requires
both domain-specific intuition and mathematical knowledge. We can often make precise
statements about the solutions of a problem without actually solving the problem in the
mathematical sense. Dimensional analysis from physics is an example of this type of pre-
liminary analysis that is helpful and useful. We may also identify certain properties of the
solutions by analyzing any underlying symmetries and establishing the correct principal
behaviors expected from the solutions, some important example involve oscillatory behav-
ior (waves), diffusive behavior, and dissipative/decaying vs. conservative behaviors. One
can also extract a lot from analyzing the different asymptotic regimes of a problem, say
when a parameter becomes small, making the problem easier to analyze. Matching different
asymptotic solutions can give a detailed, even though ultimately incomplete, description.
APPLIED MATH CORE COURSES xi

Solution Construction

As previously mentioned, one component of applied mathematics is a collection of special-


ized techniques for finding analytic solutions. These techniques are not always feasible,
and developing computational intuition should help us to identify proper methods of
numerical (or mixed analytic-numerical) analysis, i.e. a specific toolbox, helping to unravel
the problem.
Part I

Applied Analysis

1
Chapter 1

Complex Analysis

Complex analysis is the branch of mathematics that investigates functions of complex vari-
ables. A fundamental premise of complex analysis is that most of binary operations have
natural extensions from real numbers to complex numbers. Furthermore, real-valued func-
tions have natural extensions to complex-valued functions. Natural extensions of even the
most elementary functions can lead to new and interesting behavior.
Complex-valued functions exhibit a richness that often admits new techniques for prob-
lem solving. Complex analysis provides useful tools for many other areas of mathematics,
(both pure and applied), as well as in physics, (including the branches of hydrodynam-
ics, thermodynamics, and particularly quantum mechanics), and engineering fields (such as
aerospace, mechanical and electrical engineering).

1.1 Complex Variables and Complex-valued Functions


1.1.1 Complex Variables

The real number system is somewhat “deficient” in the sense that not all operations are
allowed for all real numbers. For example, taking arbitrary roots of negative numbers is
not allowed in the real number system. This deficiency can be remedied by defining the

imaginary unit, i := −1. An imaginary number is any number that is a real multiple
of the imaginary unit, for example 3i, i/2 or −πi. A complex number is any number that
has both a real and an imaginary component, and can therefore be represented by two real
numbers, x and y, which we often write as z = x + iy.
The addition and subtraction of complex numbers are direct generalizations of their
real-valued counterparts.

Example 1.1.1. Let z1 = 4 + 3i and z2 = −2 + 5i. Compute (a) z1 + z2 and (b) z1 − z2 .

2
CHAPTER 1. COMPLEX ANALYSIS 3

Solution.

(a) z1 + z2 = (4 + 3i) + (−2 + 5i) = (4 + −2) + (3 + 5)i = 2 + 3i

(b) z1 − z2 = (4 + 3i) − (−2 + 5i) = (4 − −2) + (3 − 5)i = 6 − 7i

Because the behavior of addition and subtraction is reminiscent of translating vectors


in R2 , we often visualize complex numbers as points on a cartesian plane by associating the
the real and imaginary components of the complex number with the x- and y-coordinates
respectively.

Definition 1.1.1. The complex conjugate of a complex number z, denoted by z ∗ or z̄, is


the complex number with an equal real part and an imaginary part equal in magnitude but
opposite in sign. That is, if z = x + iy then z ∗ := x − iy.

The multiplication and division of complex numbers are also direct generalizations of
their real-valued counterparts with the application of the definition i2 = −1.

Example 1.1.2. Let z1 = 4 − 3i and z2 = −2 + 5i. Compute (a) z1 z2 , (b) 1/z1 , (c) 1/z2 ,
and (d) z1 /z2 .
Solution.

(a) z1 z2 = (4 + 3i)(−2 + 5i) = −8 − 6i + 20i + 15i2 = −23 + 14i

(b) Note: z1∗ = 4 + 3i, and z1 z1∗ = (4 − 3i)(4 + 3i) = 16 + 12i − 12i − 9i2 = 25.
Therefore, 1/z1 = (1/z1 )(z1∗ /z1∗ ) = z1∗ /(z1 z1∗ ) = (4 + 3i)/25 = 4/25 + 3/25i

(c) Note: z2∗ = −2 + 5i, and z2 z2∗ = (−2 + 5i)(−2 − 5i) = 4 − 10i + 10i − 25i2 = 29.
Therefore, 1/z2 = (1/z2 )(z2∗ /z2∗ ) = z2∗ /(z2 z2∗ ) = (−2 + 5i)/29 = −2/29 + 5/29i

(d) z1 /z2 = (z1 /z2 )(z2∗ /z2∗ ) = (z1 z2∗ )/(z2 z2∗ ) = 7/29 − 26/29i

In addition to their cartesian representation, complex numbers can also be represented


by their polar representation with components r and θ. Here r is called the modulus of z
and satisfies r2 = |z|2 := zz ∗ = x2 + y 2 ≥ 0, and θ is called the argument of z or sometimes
the polar angle. Note that θ = arg(z) is defined only for |z| > 0, and modulo addition of
2π.
x + iy ⇔ r cos θ + ir sin θ, where r = x2 + y 2 , θ = tan−1 (y/x)

The application of trigonometric identities shows that the product of two complex num-
bers is the complex number whose modulus is the product of the moduli of its factors, and
whose argument is the sum of the arguments of its factors. That is, if z1 = r1 cos θ1 +
CHAPTER 1. COMPLEX ANALYSIS 4

ir1 sin θ1 , and z2 = r2 cos θ2 + ir2 sin θ2 , then z1 z2 = r1 r2 cos(θ1 + θ2 ) + ir1 r2 sin(θ1 + θ2 ).
This summation of arguments whenever two functions are multiplied together is reminiscent
of multiplying exponential functions. The polar representation is simplified by defining the
complex-valued exponential function

Definition 1.1.2. The exponential function is defined for imaginary arguments by

reiθ := r cos(θ) + ir sin(θ) = x + iy. (1.1)

Euler’s famous formula, eiπ = −1 follows directly from this definition.

Example 1.1.3. Compute the polar representations of (a) z1 = 4−3i and (b) z2 = −2+5i.
Solution.

(a) r1 = z1 z1∗ = 5, θ1 = tan−1 (3/4) ≈ 0.64, ⇒ z1 = 5e0.64i


√ √
(b) r2 = z2 z2∗ = 29, θ2 = tan−1 (5/ − 2) ≈ 1.95 ⇒ z2 = 29e1.95i

Sometimes it is convenient to express a complex number using a mixture of cartesian


and polar representations.

Example 1.1.4. Find r̃ and θ̃ such that the point ω = 1+5i can be written as ω = −1+r̃eiθ̃
Solution. Given that 1 + 5i = −1 + r̃eiθ̃ , solve for r̃eiθ̃ to get 2 + 5i = r̃eiθ̃ . Solve for r̃

and θ̃ to get r̃ = (2 + 5i)(2 − 5i) = 29 ≈ 5.39 and θ̃ = tan−1 (5/2) ≈ 1.19rad. Therefore,
w ≈ −1 + 5.39e1.19i

Example 1.1.5. Express z := (2 + 2i)e−iπ/6 by its (a) cartesian and (b) polar representa-
tions.
Solution.

(a) z = (2+2i)(cos(−π/6)+i sin(−π/6)) = 2 cos(−π/6)+2 sin(−π/6) +i 2 cos(−π/6)+
√ √
2 sin(−π/6) = (1 + 3) + i( 3 − 1)
√ √
(b) (2 + 2i)e−iπ/6 = 2 2eπ/4 e−iπ/6 = 2 2eiπ/12

Definition 1.1.3. A curve in the complex plane is a set of points z(t) where a ≤ t ≤ b
for some a ≤ b. We say that the curve is closed if z(a) = z(b), and simple if it does not
self-intersect, that is the curve is simple if z(t) 6= z(t0 ) for t 6= t0 . A curve is called a contour
if it is continuous and piecewise smooth. By convention, all simple, closed contours are
parameterized to be traversed counter-clockwise unless stated otherwise.

Example 1.1.6. Parameterize the following curves:


CHAPTER 1. COMPLEX ANALYSIS 5

(a) The infinite horizontal line passing through 0 + iπ.



(b) The semi-infinite ray extending from the point z = −1 and passing through 3i.

(c) The circular arc of radius ε centered at 0.

Solution.

(a) x + πi for −∞ < x < ∞

(b) −1 + ρeiπ/3 for 0 < ρ < ∞

(c) εeiθ for 0 ≤ θ ≤ 2π

The Complex Number System

Complex numbers can be considered as the resolution of the notation for numbers that
are closed under all possible algebraic operations. What this means is that any algebraic
operation between two complex numbers is guaranteed to return another complex number.
This is not generally true for other classes of numbers, for example,

i. The addition of two positive integers is guaranteed to be another positive integer, but
the subtraction of two positive integers is not necessarily a positive integer. Therefore,
we say that the positive integers are closed under addition but are not closed under
subtraction.

ii. The class of all integers is closed under subtraction and also multiplication. However
the integers are not closed under division because the quotient of two integers is not
necessarily another integer.

iii. The rational numbers are closed under division. However the process of taking limits
of rational numbers may lead to numbers that are not rational, so real numbers are
needed if we require a system that is closed under limits.

iv. Taking non-integer powers of negative numbers does not yield a real number. The
class of complex numbers must be introduced to have a system that is closed under
this operation.

Moreover one finds that the class of complex numbers is also closed under the operations of
finding a root of algebraic equations, of taking logarithms, and others. We conclude with a
happy statement that the class of complex numbers is closed under all the operations.
CHAPTER 1. COMPLEX ANALYSIS 6

1.1.2 Functions of a Complex Variable

A function of a complex variable, w = f (z), maps the complex number z to the complex
number w. That is, f maps a point in the z-complex plane to a point (or points) in the
w-complex plane. Since both z and w have a cartesian representation, this means that
every function of a complex variable can be expressed as two real-valued functions of two
real variables, f (z) =: u(x, y) + iv(x, y).

Example 1.1.7. Let f (z) = exp(iz) where z = x + iy. Express f as the sum u + iv where
u and v are real-valued functions of x and y.
Solution.

f (z) = exp(i(x + iy)) = exp(ix − y) = exp(−y) exp(ix)


= exp(−y) cos(x) + i exp(−y) sin(x)

In equation (1.1) we motivated the definition of the exponential function f (z) = ez


with the intention to preserve the property that ez1 +z2 = ez1 ez2 , and incidently that e1 =
2.718 . . . . This is not the only property we could have chosen to motivate the defintion ez .
We could have chosen to preserve any of the following properties:

z n /n!,
P
• the function represented by the Taylor series

• the limiting expression limn→∞ (1 + z/n)n ,

• the solution to the ODE z 0 (t) = z(t) subject to z(0) = 1.

We encourage the reader to verify that all these properties are preserved for the complex
exponential, and that any one of them could have motivated our definition and yeilded the
same results.
An immediate consequence that follows is that that the natural definitions of the
complex-valued trigonometric functions are

eiz + e−iz eiz − e−iz


cos(z) := and sin(z) := (1.2)
2 2i
Exercise 1.1.8. Find all values of z ∈ C satisfying the equation sin(z) = 3.

Exercise 1.1.9. Investigate the asymptotic behavior of the complex-valued functions (a)
f (z) = exp(z), (b) f (z) = sin(z), (c) f (z) = cos(z).

Example 1.1.10. Evaluate the functions (i) f (z) = z 2 and (ii) g(z) = exp(z + 1) along the
parameterized curves described in example 1.1.6.
CHAPTER 1. COMPLEX ANALYSIS 7

Solution.

(a) For the infinite horizontal line passing through 0 + iπ.

(i) f (x + iπ) = (x + iπ)2 = x2 − π 2 + 2πix for −∞ < x < ∞.


(ii) g(x + iπ) = exp(x + iπ + 1) = −ex+1 for −∞ < x < ∞.

(b) For the semi-infinite ray extending from the point z = −1 and passing through 3i.

(i) f (−1 + ρeiπ/3 ) = (−1 + ρeiπ/3 )2 = 1 − 2ρeiπ/3 + ρ2 ei2π/3 for ρ < 0 < ∞.
(ii) g(−1 + ρeiπ/3 ) = exp(−1 + ρeiπ/3 + 1) = exp(ρ cos(iπ/3) + iρ sin(iπ/3)) =
√ √
eρ/2 cos(ρ 3/2) + i sin(ρ 3/2) for ρ < 0 < ∞.


(c) For the circular arc of radius ε centered at 0.

(i) f (εeiθ ) = (εeiθ )2 = ε2 e2iθ for 0 ≤ θ ≤ 2π.


(ii) g(εeiθ ) = exp(εeiθ + 1) = . . .

Complex conjugates

Theorem 1.1.4. For algebraic operations including addition, multiplication, division and
exponentiation, consider a sequence of algebraic operations over the n complex numbers
z1 , . . . , zn with the result w. If the same actions are applied in the same order to z1∗ , . . . , zn∗ ,
then the result will be w∗ .

Example 1.1.11. Let us illustrate theorem 1.1.4 on the example of a quadratic equation,
az 2 +bz +c = 0, where the coefficients, a, b and c are real. Direct application of the theorem
1.1.4 to this example results in the fact that if the equation has a root, then its complex
conjugate is also a root, which is obviously consistent with the roots of quadratic equations

formula, z1,2 = (−b ± b2 − 4ac)/(2a).

Exercise 1.1.12. Use theorem 1.1.4 to show that the roots of a polynomial with real-valued
coefficients of arbitrary order occur in complex conjugate pairs.

Exercise 1.1.13. Find all the roots of the polynomial, z 4 − 6z 3 + 11z 2 − 2z − 10, given
that one of its roots is 2 − i.

Exercise 1.1.14. Let z1 = x1 + iy1 and z2 = x2 + iy2 . Show that if ω = z1 /z2 , then
ω ∗ = z1∗ /z2∗ .
CHAPTER 1. COMPLEX ANALYSIS 8

1.1.3 Multi-valued Functions and Branch Cuts

Not every complex function is single-valued. We often deal with functions that are multi-
valued, meaning that for some z, there exist two or more wi such that f (z) = wi . Recall
how we demonstrated how to parameterize curves in the complex plane in example 1.1.6
and how to evaluate a function along a parameterized curve in example 1.1.10. Consider
example 1.1.10(c)(i) where we evaluated the function f (z) = z 2 along the circle of radius ε
centered at the origin. Notice in particular that the function returns to its original value,
that is, f (εe0i ) = f (εe2πi ) = ε2 . It may be seem surprising, but there are functions where
this is not the case.

Example 1.1.15. Consider the example of ω = z. When z is represented in polar
coordinates, z = r exp(iθ), we know that θ is defined up to a shift on 2πn, for any integer n.
√ √
For z, this translates into r exp(iθ/2+iπn), where therefore even and odd n will result in
√ √ √
(two) different values of z, called two branches, ω1 = r exp(iθ/2), ω2 = r exp(iθ/2+iπ).
If we choose one branch, say ω1 , and walk in the complex plane around z = 0 in a positive
counter-clockwise, so that z = 0 always stays on the left) direction changing θ from its
original value, say θ = 0, to π/2, π, 3π/2 and eventually get to 2π, ω1 will transition to ω2 .
Making one more positive 2π swing will return to ω1 . In other words, the two branches
transition to each other after one makes a 2π turn. The point z = 0 is called a branch point

of the second order of the two-valued z function.

Example 1.1.16. The generalization of example 1.1.15 to ω = z 1/n is straightforward.


This function has n branches and thus z = 0 is an nth order branch point.

Example 1.1.17. Another important example is ω = log(z). We can represent z by


its polar representation, z = rei(θ+2πn) to show that log is a multi-valued function with
infinitely many (but countable number of) roots, ωn = log(r) + i(θ + 2nπ), n = 0, ±1, . . . .
In this case, z = 0 is an infinite order branch point.

To separate the branches one introduces cuts – lines which are forbidden to cross. After
the introduction of appropriate branch cuts, each branch of a multi-valued, analytic function
defines a single-valued function that is analytic everywhere except at the branch cut, where
it is discontinuous. The choice of branch cuts need not be unique.
Remark. One branch is arbitrarily selected as the principal branch. Most software packages
employ a set of rules for selecting the principal branch of a multi-valued function.

Definition 1.1.5. A multi-valued function w(z) has a branch point at z0 ∈ C if w(z) is


varies continuously along along a sufficiently small circuit surrounding z0 , but does not
return to its starting values after one full circuit.
CHAPTER 1. COMPLEX ANALYSIS 9

Definition 1.1.6. A branch of a multi-valued function w(z) is a single-valued function that


is obtained by restricting the image of the w(z).

Definition 1.1.7. A branch cut is a curve in the complex plane along which a branch is
discontinuous.

Example 1.1.18. Find the branch points of log(z − 1), and sketch a set of possible branch
cuts.

Solution. Parameterize the function as follows, log(z−1) = log ρ+iφ, where z−1 = ρ exp(iφ)
with ρ > 0 (non-negative real) and φ real. Since φ changes by multiples of 2π as we travel
on a closed path around z = 1, the point z = 1 is a branch point of log(z − 1). Similarly
we observe that z = ∞ is also a branch point (thus infinite branch point) and there are no
others. Therefore a valid branch cut for the function should connect the two branch points
as illustrated in Fig. (1.1).

Example 1.1.19. Next consider log(z 2 − 1) = log(z − 1) + log(z + 1). As we travel around
z = 1, log(z − 1) and also log(z 2 − 1) change by 2π. Therefore z = 1 is a branch point
of log(z 2 − 1). Similarly, z = −1 and z = ∞ are two other branch points of log(z 2 − 1).
Fig. (1.2) show two examples of the log(z 2 − 1) branch cut.

Two important general remarks are in order.

1. The function log(f (z)) has branch points at the zeros of f (z) and at the points where
f (z) is infinite, as well as (possibly) at the points where f (z) itself has branch points.
But, be careful with this (later possibility): the zeros have to be zeros in the sense of
analytic functions and by infinities we mean poles. Other types of (singular) behaviors
in f (z) can lead to unexpected results, e.g. check what happens at z = 0 when
f (z) = exp(1/z).

2. The fact that a function g(z) or its derivatives may or may not have a (finite) value
at some point z = z0 , is irrelevant as far as deciding the issue of whether or not z0 is
a branch point of g(z).

Exercise 1.1.20. Identify the branch points, introduce suitable branch cuts, and describe
p
the resulting branches for the functions (a) f (z) = (z − a)(z − b), and (b) g(z) = log((z −
1)/(z − 2)).

The graphs of complex multi-valued functions are in general two-dimensional manifolds


in the space R4 . These manifolds are called Riemann surfaces. Riemann surfaces are
CHAPTER 1. COMPLEX ANALYSIS 10

y
z=x+ i y

"

0 (1,0) x

y
z=x+ i y

0 (1,0) x

y
z=x+ i y

0 (1,0) x

y
z=x+ i y

0 (1,0) x

Figure 1.1: Polar parametrization of log(z − 1) (left) and three examples of branch cut for
the function connecting its two branch points, at z = 1 and at z = ∞.
CHAPTER 1. COMPLEX ANALYSIS 11

y
z=x+ i y

0 x

y
z=x+ i y

0 x

Figure 1.2

visualized in three-dimensional space with parallel projection and the image the surface
in three-dimensional space is rendered on the screen. (See http://matta.hut.fi/matta/
mma/SKK_MmaJournal.pdf for details and visualization with Mathematica.)

1.2 Analytic Functions and Integration along Contours


1.2.1 Analytic functions

The derivative of a real valued function is defined at a point x via a the limiting expression

f (x + ∆x) − f (x)
f 0 (x) = lim
∆x→0 ∆x
and we say that the function is differentiable at x if the limit exists and is independent of
whether the x is approached from above or below as given by the sign of ∆x.

Definition 1.2.1. The derivative of a complex function is defined via a limiting expression:

f (z + ∆z) − f (z)
f 0 (z) = lim . (1.3)
∆z→0 ∆z
This limit only exists if f 0 (z) is independent of the direction in the z-plane the limit ∆z → 0
is taken. (Note: there are infinitely many ways to approach a point z ∈ C.)
CHAPTER 1. COMPLEX ANALYSIS 12

If one sets, ∆z = ∆x, Eq. (1.3) results in

f 0 (z) = ux + ivx ,

where f = u + iv. However, setting ∆z = i∆y results in

f 0 (z) = −iuy + vy .

A consistent definition of a derivative requires that the two ways of taking the derivative
coincide, that is,

ux = vy ,
(1.4)
uy = −vx .
and this gives a necessary condition for the following theorem.

Theorem 1.2.2 (Cauchy-Riemann Theorem). The function f (z) = u(x, y) + iv(x, y) is


differentiable at the point z = x + iy iff (if and only if) the partial derivatives, ux , uy , vx , vy
are continuous and the Cauchy-Riemann conditions (1.4) are satisfied in a neighborhood of
z.

Notice that in the explanations which lead us to the Cauchy-Riemann theorem (1.2.2)
we only sketched one side of the proof – that it is necessary for the differentiability of f (z)
to have the theorem’s conditions satisfied. To complete it one needs to show that Eq. (1.4)
is sufficient for the differentiability of f (z). In other words, one needs to show that any
function u(x, y) + iv(x, y) is complex-differentiable if the Cauchy–Riemann equations
hold. The missing part of the proof follows from the following chain of transformations
∂f ∂f
∆y + O (∆x)2 , (∆y)2 , (∆x)(∆y)

∆f = f (z + ∆z) − f (z) = ∆x +
∂x ∂y
   
1 ∂f ∂f 1 ∂f ∂f
∆z ∗ + O (∆x)2 , (∆y)2 , (∆x)(∆y)

= −i ∆z + +i
2 ∂x ∂y 2 ∂x ∂y
∂f ∂f
∆z + ∗ ∆z ∗ + O (∆x)2 , (∆y)2 , (∆x)(∆y)

=
∂z  ∂z
∂f ∆z ∗

∂f
+ O (∆x)2 , (∆y)2 , (∆x)(∆y) ,

= ∆z + ∗ (1.5)
∂z ∂z ∆z
where O (∆x)2 , (∆y)2 , (∆x)(∆y) indicates that we have ignored terms of orders higher or


equal than two in ∆x and ∆y. In transition to the last line of Eq. (1.5) we change variables
from (x, y) to (z, z ∗ ), thus using
∂ ∂z ∂ ∂z ∗ ∂ ∂ ∂
= + ∗
= + ∗,
∂x ∂x ∂z ∂x ∂z ∂z ∂z
∂ ∂z ∂ ∂z ∗ ∂ ∂ ∂
= + =i − i ∗,
∂y ∂y ∂z ∂y ∂z ∗ ∂z ∂z
CHAPTER 1. COMPLEX ANALYSIS 13

and its inverse (known as “Wirtinger derivatives”)


   
∂ 1 ∂ ∂ ∂ 1 ∂ ∂
= −i , ∗ = +i .
∂z 2 ∂x ∂y ∂z 2 ∂x ∂y
Observe that ∆z ∗ /∆z takes different values depending on which direction we take the
respective, ∆z, ∆z ∗ → 0 limit in the complex plain. Therefore to ensure the derivative,
f 0 (z), is well defined at any z, one needs to require that
∂f
= 0, (1.6)
∂z ∗
i.e. that f does not depend on z ∗ . It is straightforward to check that the “independence of
the complex conjugate” Eq. (1.6) is equivalent to Eq. (1.4).

Definition 1.2.3 (Analyticity). A function f (z) is called (a) analytic (or holomorphic) at
a point, z0 , if it is differentiable in a neigborhood of z0 ; (b) analytic in a region of the
complex plane (in the entire complex plane) if it is analytic at each point of the region (in
the entire plane).

Exercise 1.2.1. Verify whether the functions (a) exp(z), (b) z̄ := x − iy, (c) z exp(z̄), and
(d) 1/(1 + z) are analytic.

Exercise 1.2.2. The isolines for a function f (x, y) = u(x, y) + iv(x, y) are defined to be the
curves u(x, y) = const and v(x, y) = const0 . Show that the iso-lines of an analytic function
always cross at a right angle.

Exercise 1.2.3. Let f (z) = u(x, y) + iv(x, y) be analytic. Given that u(x, y) = x + x2 − y 2
and f (0) = 0, find v(x, y).

Exercise 1.2.4. Let f (z) = u(x, y) + iv(x, y) be analytic. Given that v(x, y) = −2xy and
f (0) = 1, find u(x, y).

The Cauchy-Riemann theorem 1.2.2 has a couple of other complementary interpretations


discussed below.

Conformal Mappings

The Cauchy-Riemann condition (1.4) can be re-stated in the following compact form
∂f ∂f
i = . (1.7)
∂x ∂y
Then the Jacobian matrix of the function f : R2 → R2 , i.e. of the (x, y) → (u, v) map is
! !
∂u ∂u ∂u ∂u
∂x ∂y ∂x ∂y
J= ∂v ∂v
= . (1.8)
∂x ∂y − ∂u
∂y
∂u
∂x
CHAPTER 1. COMPLEX ANALYSIS 14

Geometrically, the off-diagonal (skew-symmetric) part of the matrix represents rotation


and the diagonal part of the matrix represents scaling. The Jacobian of a function f (z)
takes infinitesimal line segments at the intersection of two curves in z and rotates them to
the corresponding segments in f (z). Therefore, a function satisfying the Cauchy-Riemann
equations, with a nonzero derivative, preserves the angle between curves in the plane. Trans-
formations corresponding to such functions and functions themselves are called conformal.
That is, the Cauchy-Riemann equations are the conditions for a function to be conformal.

Harmonic functions

Here we will make a fast jump to the end of the semester where Partial Differential Equations
(PDEs) will be discussed in detail. Consider the solution of the Laplace equation in two
dimensions
(∂x2 + ∂y2 )f (x, y) = 0. (1.9)

Eq. (1.9) defines the so-called Harmonic functions. We do it now, while studying complex
calculus, because, and quite remarkably, an arbitrary analytic function is a solution of
Eq. (1.9). This statement is a straightforward corollary of the Cauchy-Riemann theorem
(1.2.2).
The descriptor “harmonic” in the name harmonic function originates from a point on
a taut string which is undergoing periodic motion which is pleasant-sounding, thus coined
by ancient Greeks harmonic (!). This type of motion can be written in terms of sines and
cosines, functions which are thus referred to as harmonics. Fourier analysis, which we will
turn our attention to soon, involves expanding periodic functions on the unit circle in terms
of a series over these harmonics. These functions satisfy Laplace equation and over time
”harmonic” was used to refer to all functions satisfying Laplace equation.

1.2.2 Integration along Contours

Complex integration is defined along an oriented contour C in the complex plane.

Definition 1.2.4 (Complex Integration). Let f (z) be analytic in the neighborhood of a


contour C. The integral of f (z) along C is
Z n−1
X
f (z) dz := lim f (ζk )(ζk+1 − ζk ), (1.10)
C n→∞
k=0

where for each n, {ζk }nk=0 is an ordered sequence of points along the path breaking the path
into n intervals such that ζ0 = a, ζn = b and maxk |ζk+1 − ζk | → 0 as n → ∞.
CHAPTER 1. COMPLEX ANALYSIS 15

Remark. Let z(t) with a ≤ t ≤ b be a parameterization of C, then definition 1.2.4 is


equivalent to the Riemann integral of f (z(t))z 0 (t) with respect to t. Therefore,
Z Z b
f (z) dz = f (z(t)) z 0 (t) dt (1.11)
C a

Example 1.2.5. In example 1.1.6 we evaluated the functions (i) f (z) = z 2 and (ii) g(z) =
exp(z + 1) along the parameterized curves described in example 1.1.10. Now compute (i)
R R
C f (z) dz and (ii) C g(z) dz along the contours (a) Ca : the horizontal line segment from
−M + iπ to M + iπ, (b) Cb : the ray segment extending from the point z = −1 and to the

point 3i, and (c) Cc : he circular arc of radius ε centered at 0.
Solution.

(a) Let z = x + iπ along Ca , then dz = dx for −∞ < x < ∞.


Z Z M M
2
(x + iπ)2 dx = 13 x3 − π 2 x + πix2 2 3
− 2π 2 M

(i) z dz = = 3M
Ca −M −M
Z Z M M
(ii) ez+1 dz = ex+1+iπ dx = ex+1 eiπ = −eM +1 + e−M +1
Ca −M −M

(b) Let z = −1 + ρeiπ/3 for 0 ≤ ρ ≤ 2. Then dz = eiπ/3 dρ.


Z Z 2 2 2 √
2 iπ/3
(i) z dz = −1 + ρe eiπ/3 dρ = ρeiπ/3 −ρ2 ei2π/3 + 31 ρ3 ei3π/3 = 13 −i 3
ZCb 0 0

(ii) ez+1 dz = . . .
Cb

(c) Let z = eiθ for 0 ≤ θ < 2π, then dz = ieiθ dθ


Z Z 2π  2 2π
2
(i) z dz = eiθ ieiθ dθ = 13 3 e3iθ =0
ZCc 0 0

(ii) exp(z + 1) dz = . . .
Cc

Exercise 1.2.6. Let C+ and C− represent the upper and lower unit semi-circles centered
at the origin and oriented from z = −1 to z = 1. Find the integrals of the functions (a) z 2 ;
√ √
(b) 1/z; and (c) z along C+ and C− . For z, use the branch where z is represented by
reiθ with 0 ≤ θ < 2π. Suggest why the results are the same in (a) and different in (b) and
(c). (You may look ahead to the next section for a hint.)

Exercise 1.2.7. Let C be the circular closed contour of radius R centered at the origin.
Show that I
dz
m
= 0, for m = 2, 3, . . . (1.12)
C z
by parameterizing the contour in polar coordinates.
CHAPTER 1. COMPLEX ANALYSIS 16

Exercise 1.2.8. Use numerical integration to approximate the integrals in the exercises
above and verify your results.

1.2.3 Cauchy’s Theorem

In general the integral along a path in the complex plane depends on the entire path and
not only on the position of the end points. The following fundamental question arrives
naturally: is there a condition which makes the integral dependent only on the end points
of the path? The question is answered by the following famous theorem.

Theorem 1.2.5 (Cauchy’s Theorem, 1825). If f (z) is analytic in a single connected region
D of the complex plane then for all paths, C, lying in this region and having the same end
R
points, the integral C f (z) dz has the same value.

It is important to recognize that the use of Cauchy’s theorem in what concerns integra-
tion of a multi-valued function. For Cauchy’s theorem to hold one needs the integrand to be
a single valued function. Cuts introduced in the preceding section are required for exactly
this reason – force the integration path to stay within a single branch of a multi-valued
function and thus to guarantee analyticity (differentiability) of the function along the path.
The same theorem can be restated in the following form.

Theorem 1.2.6 (Cauchy Theorem (closed contour version)). Let f (z) be analytic in a
simply connected region D and C be a closed contour that lies in the interior of D. Then
H
the integral of f along C is equal to zero: C f (z) dz = 0.

To make the transformation from the former formulation of Cauchy’s formula to the
latter one, we need to consider two paths connecting two points of the complex plain. From
Eq. (1.10), we see that paths are oriented and that changing the direction of the path
changes the value of the integral by a factor of −1. Therefore, of the two paths considered,
one needs to reverse its direction, then leading us to a closed contour formulation of Cauchy’s
theorem.
Let us now sketch the proof of the closed contour version of Cauchy’s theorem. Consider
breaking the region of the complex plane bounded by the contour C into small squares
with the contours Ck , as well as the original contour C, oriented in the positive direction
(counter-clockwise). Then I XI
dzf (z) = f (z)dz, (1.13)
C k Ck

where we have accounted for the fact that integrals over the inner sides of the small contours
cancel each other, as two of them (for each side) are running in opposite directions. Next,
CHAPTER 1. COMPLEX ANALYSIS 17

pick inside a Ck contour a point, zk , and then approximate, f (z), expanding it in the Taylor
series around zk ,
f (z) = f (zk ) + f 0 (zk )(z − zk ) + O ∆2

(1.14)

where with ∆-squares, the length of Ck is at most 4∆, and we have at most (L/∆)2 small
squares. Substituting Eq. (1.14) into Eq. (1.13) one derives
I I I I
0
dzO ∆2 = 0 + 0 + ∆3 .

dzf (z) = f (zk ) dz + f (zk ) dz(z − zk ) + (1.15)
Ck Ck Ck Ck

Summing over all the small squares bounded by C one arrives at the estimate ∆ → 0 in
the ∆ → 0 limit. .
Disclaimer: We have just used discretization of the integral. When dealing with inte-
grations of functions in the rest of the course we will always discuss it in the sense of a
limit, assuming that it exists, and not really breaking the integration path into segments.
However, if any question on the details of the limiting procedure surfaces one should get
back to the discretization and analyze respective limiting procedure sorely.
One important consequence of Cauchy’s theorem (there will be more discussed in the
following) is that all integration rules known for standard, “interval”, integrals apply to the
contour integrals. This is also facilitated by the following statement.

Theorem 1.2.7 (Triangle Inequality). (A: From Euclidean Geometry) |z1 +z2 | ≤ |z1 |+|z2 |,
also with equality iff (if and only if) z1 and z2 lie on the same ray from the origin. (B:
Integral over Interval) Suppose g(t) is a complex valued function of a real variable, defined
on a ≤ t ≤ b, then
Z b Z b
dtg(t) ≤ dt|g(t)|,
a a

with equality iff (i.e. if and only if) the values of g(t) all lie on the same ray from the origin.
(Integral over Curve/Path) For any function f (z) and any curve γ, we have
Z Z
f (z)dz ≤ |f (z)||dz|,
γ γ

where dz = γ 0 (t)dt and |dz| = |γ 0 (t)|dt.

Proof. We take the “Euclidean” geometry version (A) of the statement, extended to the sum
of complex numbers, as granted and give a brief sketch of proofs for the integral formulations.
The interval version (B) of the triangular inequality follows by approximating the integral
as a Riemann sum
X X Z b
|g(t)dt| ≈ g(tk )∆t ≤ |g(tk )|∆t ≈ |g(t)|dt,
a
CHAPTER 1. COMPLEX ANALYSIS 18

Im (z)=y

Re (z)=x

Figure 1.3

where the middle inequality is just the standard triangular inequality for sums of complex
numbers. The contour version (C) of the Theorem follows immediately from the interval
version
Z Z b Z b Z
0 0
f (z)dz = f (γ(t))γ (t)dt ≤ |f (γ(t))||γ (t)|dt = |f (z)||dz|.
γ a a γ

1.2.4 Cauchy’s Formula

Recall from definition 1.1.3 that a curve is called simple if it does not intersect itself, and
is called a contour if it consists of a finite number of connected smooth curves.

Theorem 1.2.8 (Cauchy’s formula, 1831). Let f (z) be analytic on and interior to a simple
closed contour C. Then, Z
1 f (ζ)dζ
f (z) = . (1.16)
2πi C ζ −z
To illustrate Cauchy’s formula consider the simplest, and arguably most important,
H
example of an integral over complex plane, I = dz/z. For the integral over closed contour
shown in Fig. (1.3a), we parameterize the contour explicitly in polar coordinates and derive
I Z 2π Z 2π Z 2π
dz rd exp(iθ) r exp(iθ)idθ
I= = = =i dθ = 2πi. (1.17)
z 0 r exp(iθ) 0 r exp(iθ) 0

The integral is not zero.


R
Next, recall that for the respective standard indefinite integral, dz/z = log z. This
formula is very naturally consistent with both Eq. (1.17) and with the fact that log(z) is
CHAPTER 1. COMPLEX ANALYSIS 19

y y

0 x 0 x

Figure 1.4

a multivariate function. Indeed, consider the integral over a path between two points of a
complex plain, e.g. z = 1 and z = 2. We can go from z = 1 to z = 2 straight, or can do
it, for example first making a counter-clockwise turn around 0. We can generalize and do
it clockwise and also making as many number of points we want. It is straightforward to
check that the integral depends on how many times and in which direction we go around 0.
The answers will be different by the result of Eq. (1.17), i.e. 2πi multiplied by an integer,
however it will not depend on the path.

Exercise 1.2.9. Compute, compare and discuss the difference (if any) between values of
H
the integral dz/z over two distinct paths shown in Fig. (1.4).

The “small square” construction used above to prove the closed contour version of
Cauchy’s Theorem, i.e. Theorem 1.2.6, is a useful tool for dealing with integrals over
awkward (difficult for direct computation) paths around singular points of the integrand.
However, it should not be thought that all the integrals will necessarily be zero. Consider
I
dz
m = 2, 3, · · · : ,
zm

where the integral is singular at z = 0. The respective indefinite integral (what is sometimes
called the “anti-derivative”) is z −m+1 /(1 − m) + C, where C is constant. Observe that the
indefinite integral is a single-valued function and thus its integral over a closed contour is
zero. (Notice that if m = 1 the indefinite integral is a multi-valued function within the
domain surrounding z = 0.)
Cauchy’s formula can be extended to higher derivatives

Theorem 1.2.9 (Cauchy’s formula for derivatives, 1842). Under the same conditions as in
Theorem 1.2.8, higher derivatives are
Z
(n) n! f (ζ)dζ
f (z) = . (1.18)
2πi C (ζ − z)n+1
CHAPTER 1. COMPLEX ANALYSIS 20

Im (z) Im (z) Im (z)

1 Re (z) 1 Re (z) 1 Re (z)

Figure 1.5

1.2.5 Laurent Series

The Laurent series of a complex function f (z) about a point a is a representation of that
function by a power series that includes terms of both positive and negative degree.

Theorem 1.2.10. A function f (z) that is analytic in the annulus R1 ≤ |z − a| ≤ R2 may


be represented by the power series
+∞
X
f (z) = cn (z − a)k . (1.19)
n=−∞

in the (possible smaller) annulus R1 < R̃1 ≤ |z − a| ≤ R̃2 < R2 where


I
1 f (z)
cn = dz. (1.20)
2πi C (z − a)n+1

and C is any contour that is contained in the region of analyticity and circling a.

Suppose one needs to compute I


f (z)dz,

where the contour surrounds z = a in the positive (counter-clockwise) direction such that
it contains no other singular points of f (z). Then, we substitute f (z) by its Laurent series,
and observe that according to Cauchy’s formula the only nonzero contribution will come
from the k = −1 term I I
c−1 dz
f (z)dz = = 2πic−1 .
z−a
Due to this significance of the c−1 term, it has a special name, the residue of f at z = a,
and is often denoted by c−1 = Res(f, a).

Theoretical Implications of Cauchy’s Theorem & Cauchy’s Formulas

Cauchy’s theorem and formulas have many powerful and far reaching consequences.

Theorem 1.2.11. Suppose f (z) is analytic on a region A. Then, f has derivatives of all
orders.
CHAPTER 1. COMPLEX ANALYSIS 21

Proof. It follows directly from Cauchy’s formula for derivatives, Theorem 1.2.9 – that is we
have an explicit formula for all the derivatives, so in particular the derivatives all exist.

Theorem 1.2.12 (Cauchy Inequality.). Let CR be the circle |z −z0 | = R. Assume that f (z)
is analytic on CR and its interior, i.e. on the disk |z − z0 | ≤ R. Finally let MR = max |f (z)|
over z on CR . Then
n!MR
∀n = 1, 2, · · · : |f (n) (z0 )| ≤ .
Rn
Exercise 1.2.10. Prove Cauchy’s Inequality Theorem utilizing Theorem 1.2.9. Illustrate
the theorem on example of cos(z).

Theorem 1.2.13 (Liouville Theorem.). If f (z) is entire, i.e. analytic at all finite points of
the complex plane C, and bounded then f is constant.

Proof. For any circle of radius R around z0 Cauchy’s inequality (Theorem 1.2.12) states
that f 0 (z) ≤ M/R, but R can be arbitrarily large, thus |f 0 (z0 )| = 0 for every z0 ∈ C. And
since the derivative is 0, the function itself is constant.
Pn k,
Note that P (z) = k=0 ak z exp(z), cos(z) are entire but not bounded.

Theorem 1.2.14 (Fundamental Theorem of Algebra). Any polynomial P of degree n ≥ 1,


i.e. P (z) = nk=0 ak z k , has exactly n roots (solutions of P (z) = 0).
P

Proof. The prove consists of two parts. First, we want to show that P (z) has at least one
root. (See exercise below.) Second, assume that P has exactly n roots. Let z0 be one of
the roots. Factor, P (z) = (z − z0 )Q(z). Q(z) has degree n − 1. If n − 1 > 0, then we can
apply the result to Q(z). We can continue this process until the degree of Q is 0.
Pn k
Exercise 1.2.11. Prove that P (z) = k=0 ak z has at least one root. (Hint: Prove by
contradiction and utilize the Liouville Theorem 1.2.13.)

Theorem 1.2.15 (Maximum modulus principle (over disk)). Suppose f (z) is analytic on
the closed disk, Cr , of radius r centered at z0 , i.e. the set |z − z0 | ≤ r. If |f | has a relative
maximum at z0 than f (z) is constant in Cr .

In order to prove the Theorem we will first prove the following statement.

Theorem 1.2.16 (Mean value property). Suppose f (z) is analytic on the closed disk of
radius r centered at z0 , i.e. the set |z − z0 | ≤ r. Then,
Z 2π
1
f (z0 ) = dθf (z0 + r exp(iθ)) .
2π 0
CHAPTER 1. COMPLEX ANALYSIS 22

Proof. Call Cr the boundary of the |z − z0 | ≤ r set, and parameterize it as z0 + reiθ , 0≤


θ ≤ 2π, γ 0 (θ) = ireiθ . Then, according to Cauchy’s formula,
Z 2π Z 2π
f (z0 + reiθ ) iθ
Z
1 f (z)dz 1 1
f (z0 ) = = dθ ire = dθf (z0 + reiθ ).
2πi Cr z − z0 2πi 0 reiθ 2π 0

Now back to the Theorem 1.2.15. To sketch the proof we will use both the mean value
property Theorem 1.2.16 and the triangle inequality Theorem 1.2.7. Since z0 is a relative
maximum of |f | on Cr we have |f (z) ≤ |f (z0 |) for z ∈ Cr . Therefore by the mean value
property and the triangle inequality one derives
Z 2π
1
|f (z0 )| = dθf (z0 + reiθ ) (mean value property)
2π 0
Z 2π
1
≤ dθ|f (z0 + reiθ )| (triangle inequality)
2π 0
Z 2π
1
≤ dθ|f (z0 )|, (|f (z0 + reiθ )| ≤ |f (z0 )|, i.e. z0 is a local maximum)
2π 0
= |f (z0 )|

Since we start and end with f (z0 ), all inequalities in the chain are equalities. The first
inequality can only be equality if for all θ, f (z0 + reiθ ) lies on the same ray from the
origin, i.e. have the same argument or equal to zero. The second inequality can only be
an equality if all |f (z0 + reiθ | = |f (z0 )|. Thus, combining the two observations, one gets
that all f (z0 + reiθ ) have the same magnitude and the same argument, i.e. all the same.
Finally, if f (z) is constant along the circle and f (z0 ) is the average of f (z) over the circle
then f (z) = f (z0 ), i.e. f is constant on Cr .
Two remarks are in order. First, based on the experience so far (starting from Theorem
1.2.13) it is plausible to expect that Theorem 1.2.15 generalizes from a disk Cr to any single-
connected domain. Second, one also expects that the maximum modulus can be achieved
at the boundary of a domain and then the function is not constant within the domain.
Indeed, consider example of exp(z) on the unit square, 0 ≤ x, y ≤ 1. The maximum,
| exp(x + iy)| = exp(x), is achieved at x = 1 and arbitrary y, 0 ≤ y ≤ 1, i.e. at the
boundary of the domain. These remarks and the example suggest the following extension
of the Theorem 1.2.15.

Theorem 1.2.17 (Maximum modulus principle (general)). Suppose f (z) is analytic on A,


which is a bounded, connected, open set, and it is continuous on Ā = A ∪ ∂A, where ∂A is
the boundary of Ā. Then either f (z) is a constant or the maximum of |f (z)| on Ā occurs
on ∂A.
CHAPTER 1. COMPLEX ANALYSIS 23

Proof. Here is a sketch of the proof. Let us cover A by disks which are laid such that
their centers form a path from the value where f (z) is maximized to any other points in A,
while being totally contained within A. Existence of a maximum value of |f (z)| within A
implies, according to Theorem 1.2.15 applied to all the disks, that all the values of f (z) in
the domain are the same, thus f (z) is constant within A. Obviously the constancy of f (z)
is not required if the maximum of |f (z)| is achieved at δA.

Exercise 1.2.12. Find the maximum modulus of sin(z) on the square, 0 ≤ x, y ≤ 2π.

1.3 Residue Calculus


1.3.1 Singularities and Residues

Exercise 1.3.1. Use Cauchy’s formula to compute

exp(z 2 )dz
I
, (1.21)
z−1
for three contour examples shown in the Figure 1.5.

dz/(ez − 1) over circle of the radius 4 centered


H
Exercise 1.3.2. Compute the integral
around 3i.

1.3.2 Evaluation of Real-valued Integrals by Contour Integration

Example 1.3.3. Evalute the integral


Z +∞
cos(ωx)dx
I1 = , ω > 0.
−∞ 1 + x2

Note: the respective indefinite integral is not expressible via elementary functions and one
needs an alternative way of evaluating the definite integral.
Solution. Observe that Z +∞
sin(ωx)dx
= 0,
−∞ 1 + x2
just because the integrand is odd (skew-symmetric) over x. Combining the two formulas
above one derives Z +∞
exp(iωx)dx
I1 = .
−∞ 1 + x2
Consider an auxiliary integral
I
exp(iωz)dz
IR = , ω > 0,
1 + z2
CHAPTER 1. COMPLEX ANALYSIS 24

where the contour consists of half-circle of radius R and the straight line over real axis from
−R to R shown in Fig. (1.7). Since the function in the integrand has two poles of the first
order, at z = ±i, and only one of these poles lie within the contour, one derives
 
exp(iωz) exp(iωi)
IR = 2πiRes 2
, +i = 2πi = π exp(−ω).
1+z 2i

On the other hand IR can be represented as a sum of two integrals, one over [−R, R], and
one over the semi-circle. Sending R → ∞ one observes that the later integral vanishes, thus
leaving us with the answer
I1 = π exp(−ω).

Exercise 1.3.4. Evaluate the following integrals:


Z ∞
dx
(a) ,
0 1 + x4
Z ∞
dx
(b) ,
0 1 + x3
Z ∞
exp(ikx)dx
(c) .
0 x4 + a4
Z ∞
exp ix2 dx,

(d)
0
Z ∞
exp(ikx)dx
(e) ,
−∞ cosh(x)

Cauchy Principal Value

Consider the integral Z ∞


sin(ax)dx
, (1.22)
0 x
where a > 0. As became custom in this part of the course let us evaluate it by constructing
and evaluating a contour integral. Since sin(az)/z is analytic near z = 0 (recall or google
L’Hôpital rule), we build the contour around the origin as shown in Fig. (1.6). Then going
through the following chain of evaluations we arrive at
Z ∞ Z
sin(ax)dx 1 sin(az)
= dz (1.23)
0 x 2 [a→b→c→d] z
Z  
1 exp(iaz) exp(−iaz)
= − dz
4i [a→b→c→d] z z
Z Z
1 exp(iaz) 1 exp(−iaz) 1 π
= dz − dz = (2πi − 0) = .
4i [a→b→c→d→e→a] z 4i [a→b→c→d→f →a] z 4i 2
CHAPTER 1. COMPLEX ANALYSIS 25

a b c d

R r

Figure 1.6

(Note that a lot of details in this chain of transformations are dropped. We advise the
reader to reconstruct these details. In particular, we suggest to check that the integrals
over two semi-circles in Fig. (1.6) decay to zero with r → 0 and R → ∞. For the latter,
you may either estimate asymptotic value of the integral yourself, or use Jordan’s lemma.
The limiting process just explained is often refereed to as the (Cauchy) Principal Value
of the integral Z ∞ Z R
exp(ix)dx exp(ix)dx
PV = lim = iπ. (1.24)
−∞ x R→∞ −R x
In general if the integrand, f (x), becomes infinite at a point x = c inside the range of
integration, so that the limit on the right of the following expression
Z R Z c−ε Z R 
lim f (x)dx = lim dxf (x) + dxf (x) , (1.25)
ε→0 −R ε→0 −R c+ε

exists, we call it the principal value integral. (Notice that any of the terms inside the
brackets on the right if considered separately may result in a divergent integral.)
Consider another example
Z b
dx b
= log , (1.26)
a x a
where we write the integral as a formal indefinite integral. However, if a < 0 and b > 0 the
integral diverges at x = 0. And we can still define
Z b Z −ε Z b   
dx . dx dx ε b b
PV = lim + = lim log + log = log , (1.27)
a x ε→0 a x ε x ε→0 −a ε |a|
CHAPTER 1. COMPLEX ANALYSIS 26

Figure 1.7

excluding ε vicinity of 0. This example helps us to emphasize that the principal value is
R −ε R
unambiguous – the condition that the ε-dependent integration limits in and ε are
R −ε/2 R
taken with the same absolute value, and say not and ε , is essential.
If the complex variables were used, we could complete the path by a semicircle from −ε
to ε about the origin (zero), either above or below the real axis. If the upper semicircle were
chosen, there would be a contribution, −iπ, whereas if the lower semicircle were chosen, the
contribution to the integral would be, −iπ. Thus, according to the path permitted in the
Rb
complex plane we should have a dz/z = log(b/|a|) ± iπ. The principal value is the mean
of these two alternatives.

1.3.3 Contour Integration with Multi-valued Functions

Proposed Addition: I would like to include a few more worked example for the students to
reference.
Contour integrals can be used to evaluate certain definite integrals.

Integrals involving Branch Cuts

We discuss below a number of examples of definite integrals which are reduced to contour
integrals avoiding branch cuts.
Consider the following standard integral and its contour version
Z ∞ I I
dx dz
√ → √ = dzf (z). (1.28)
0 x(x2 + 1) z(z 2 + 1)
CHAPTER 1. COMPLEX ANALYSIS 27

!%
i
!$
!#
r
!"
R -i

Figure 1.8


The square root in the integrand, z = exp((log z)/2, is a multi-valued function, so it must
be treated with a contour containing a branch cut. We consider contour shown in Fig. (1.8),
H R R R R
then in Eq. (1.28) becomes C1 + C2 + C3 + C4 . The contour is chosen to guarantee that
Z
dx
r→0: √ → 0, (1.29)
C2 x(x2 + 1)
Z
dx
R→∞: √ 2 + 1)
→ 0, (1.30)
C4 x(x
then resulting (under the r → 0 and R → ∞ limits) in
I Z Z Z ∞
dz dz dz dx
√ 2 = √ 2 + √ 2 =2 √ . (1.31)
z(z + 1) C1 z(z + 1) C3 z(z + 1) 0 x(x2 + 1)
On the other hand the contour integral, with the (full) contour surrounding two poles of
the integrand, at z = ±i, thus
I
dz
√ 2 = πi (Res (at z = i) + Res (at z = i)) , (1.32)
z(z + 1)
where
1 exp(3πi/4)
Res (at z = i) = lim (f (z)(z − i)) = lim √ = , (1.33)
z→i z→i z(z + i) 2
1 exp(−3πi/4)
Res (at z = −i) = lim (f (z)(z + i)) = lim √ = . (1.34)
z→i z→i z(z − i) 2
Summarizing one arrives at the following answer
Z ∞  
dx exp(3πi/4) exp(−3πi/4) π
√ 2
= πi − =√ . (1.35)
0 x(x + 1) 2 2 2
CHAPTER 1. COMPLEX ANALYSIS 28

Exercise 1.3.5. Evaluate the following integral


Z ∞
dx
√ .
1 x x−1

Aiming to compute the following integral along the real axis (notice asymptotics at
x → 0, x → 1)
Z 1
dx
I= , (1.36)
0 x2/3 (1− x)1/3
let us introduce and analyze contour integral with almost the same integrand
I I
dz dz
2/3 1/3
= , (1.37)
z (z − 1) f (z)

where we introduce the contour, shown in Fig. (1.9a), surrounding the cut connecting two
branching points of f (z), at z = 0 and z = 1 (both points are the branching points of the
3rd order).
Recall that the cuts are introduced to make functions which are multi-valued in the
complex plain (thus the functions which are not entire, i.e. not analytic within the entire
complex plain) to become analytic within the complex plain excluding the cut. Cut also
defined choice of the (originally multi-valued) function branches. Thus in the case under
.
consideration f (z) = z 2/3 (z − 1)1/3 has the following parameterization as we go around the
cut (in the negative direction):

Sub-contour Parametrization of z Evaluation of f (z)


. 2/3
C1 = [a → b] x1 , x1 ∈ [r, 1 − r] x1 |1 − x1 |1/3 exp(iπ/3)
.
C2 = [b → c] 1 + r exp(iθ2 ), θ ∈ [π, −π] r1/3 exp(iθ2 /3)
. 2/3
C3 = [c → d] x3 , x3 ∈ [1 − r, r] x3 |1 − x3 |1/3 exp(−iπ/3)
.
C4 = [d → a] r exp(iθ4 ), θ4 ∈ [2π, 0] r2/3 exp(i2θ4 /3 + iπ/3)

Next we compute integrals with the same integrand over the sub-contours, C1 , C2 , C3 , C4
Z Z 1
dz dx1
= 2/3
= exp(−iπ/3)I, (1.38)
C1 f (z) 0 x1 (1 − x1 )1/3 exp(iπ/3)
Z Z −π
dz ir exp(iθ2 )dθ2
= →r→0 0 (1.39)
C2 f (z) π (1 + r exp(iθ2 ))2/3 (r exp(iθ2 )1/3
Z Z 0
dz dx3
= 2/3
= − exp(iπ/3)I (1.40)
C3 f (z) 1 x3 |1 − x3 |1/3 exp(−iπ/3)
Z Z 0
dz ir exp(iθ4 )dθ4
= 2/3 (r exp(iθ ) − 1)1/3
→r→0 0 (1.41)
C4 f (z) 2π (r exp(iθ 4 )) 4
CHAPTER 1. COMPLEX ANALYSIS 29

r !"
!# !$
a b
0 1
d c
!%

!# !$ r !"
0 1
!%
R

Figure 1.9
CHAPTER 1. COMPLEX ANALYSIS 30

Finally, taking advantage of f (z) analyticity everywhere outside the [0, 1] cut and using
Cauchy’s integral theorem one transform integral over, C1 ∪ C2 ∪ C3 ∪ C4 , into the same
integral over the contour C shown in Fig. (1.9)
Z Z Z Z Z
dz dz dz dz dz
+ + + = . (1.42)
C1 f (z) C2 f (z) C3 f (z) C4 f (z) C f (z)

On the other hand the contour integral over C can be computed in the R → ∞ limit:
Z Z 0 Z 2π
dz iR exp(iθ)dθ
= 2/3 exp(2iθ/3)(R exp(iθ) − 1)1/3
→R→∞ = −i dθ = −2πi. (1.43)
C f (z) 2π R 0

Summarizing Eqs. (1.36, 1.37,1.38,1.39,1.40,1.41,1.42,1.43) one arrives at


−2πi π 2π
I= = =√ . (1.44)
− exp(iπ/3) + exp(−iπ/3) sin(π/3) 3
It may be instructive to compare this derivation with an alternative derivation discussed
in [1].

Exercise 1.3.6. Evaluate the integral


Z 1
dx
√ , (1.45)
−1 (1 + x2 ) 1 − x2
by identifying and evaluating an equivalent contour integral.

1.4 Extreme-, Stationary- and Saddle-Point Methods


In this section we study family of methods which allow to approximate integrals dominated
by contribution of a special point and its vicinity. Depending on the case it is called
extreme-point, stationary-point or saddle-point method. We start discussing the extreme-
point version, corresponding to estimating real-valued integrals over a real domain, then
turn to estimation of oscillatory (complex-valued) integrals over a real interval (stationary
point method) and then generalize to complex-valued integrals over complex path (saddle-
point method).
Extreme- (or maximal-) point method applies to the integral
Z b
I1 = dx exp (f (x)) , (1.46)
a

where the real-valued, continuous function f (x) achieves its maximum at a point x0 ∈]a, b[.
Then one approximates the function by the first terms of its Taylor series expansion around
the maximum
(x − x0 )2 00
f (x0 ) + O (x − x0 )3 ,

f (x) = f (x0 ) + (1.47)
2
CHAPTER 1. COMPLEX ANALYSIS 31

where we assume f 0 (x0 )=0. Since x0 is the maximum, f 0 (x0 ) = 0 and f 00 (x0 ) ≤ 0, and
we consider the case of a general position, f 00 (x0 ) < 0. One substitutes Eq. (1.47) in
Eq. (1.46) and then drops the O((x − x0 )3 ) term and extends the integration over [a, b] to
] − ∞, ∞[. Evaluating the resulting Gaussian integral one arrives at the following extreme-
point estimation s

I1 → exp (f (x0 )) . (1.48)
−f 00 (x0 )
This approximation is justified if |f 00 (x0 )|  1.

Example 1.4.1. Estimate the following integral


Z +∞
I= dx exp(S(x)), S(x) = αx2 − x4 /2, (1.49)
−∞

at sufficiently large positive α using the saddle-point approximation.


Solution: Let us find all stationary points of S(x) (saddle points of the integrand).

Solving S 0 (xs ) = 0, one gets that either xs = 0 or xs = ± α. Values of S at the saddle

points are S(0) = 0 and S(± α) = α2 /2, and we thus choose the dominating saddle points,

xs = ± α, for further evaluations. In fact, and since the two (dominant) saddle points are
fully equivalent, we pick one and then multiply estimation for the integral by two:
Z +∞ Z +∞

I ≈ 2 exp(α2 /2) dx exp(S 00 ( α)x2 /2) = 2 exp(α2 /2) dx exp(−2αx2 )
−∞ −∞
r
2
= exp(α2 /2) .
απ
The same idea works for highly oscillatory integrals of the form
Z b
I2 = dx exp (if (x)) , (1.50)
a

where real-valued, continuous f (x) has a real stationary point x0 , f 0 (x0 ) = 0. Integrand
oscillates least at the stationary point, thus guaranteeing that the stationary point and its
vicinity make dominant contribution to the integral. The statement just made may be a
bit confusing because the integrand, considered as a function over x is oscillatory making,
formally, integral over x to be highly sensitive to positions of the ends of interval. To
make the statement sensible consider shifting the contour of integration into the complex
plain so that it crosses the real axis at x0 along a special direction where if 00 (x0 )(x − x0 )2
shows maximum at x0 then making the resulting integrand to decay fast (locally along the
CHAPTER 1. COMPLEX ANALYSIS 32

contour) with |x − x0 | increase. One derives


Z
I2 ≈ exp (if (x0 )) dx exp if 00 (x0 )/2(x − x0 )2


s

exp if (x0 ) + isign(f 00 (x0 ))π/4 ,

= 00
|f (x0 )|

where dependence on the interval’s end-points disappear (in the limit of sufficiently large
|f 00 (x0 )|).
Now in the most general case (of the saddle-point method) we consider the contour
integral Z
I3 = dz exp (f (z)) , (1.51)
C
assuming that f (z) is analytic along the contour, C, and also within a domain, D, of the
complex plain, the contour is embedded in. Let us also assume that there exists a point, z0 ,
within D where f 0 (z0 ) = 0. This point is called a saddle-point because iso-lines of f (z) in
the vicinity of z0 show a saddle – minimum and maximum along two orthogonal directions.
Deforming C such that it passes z0 along the “maximal” path (where f (z) reaches maximum
at z0 ) one arrives at the following saddle-point estimation
s

I3 → exp (f (z0 )) . (1.52)
−f 00 (z0 )

In what concerns applicability of the saddle-point approximation – the approximation is


based on truncating the Taylor expansion of f (z) around z0 , which is justified if f (z)
changes significantly where the expansion applies, i.e. |f 00 (z0 )|R2  1, where R is the
radius of convergence of the Taylor series expansion of f (z) around z0 .
Two remarks are in order. First, let us emphasize that f (z0 ) and f 00 (z0 ) can both be
complex. Second, there may be a number (more than one) of saddle points in the region
of the f (z) analyticity. In this case one picks the saddle-point achieving maximal value (of
f (z0 )). In the case of degeneracy, i.e. when multiple saddle-points achieves the same value
as in the Example 1.4.1, one deforms the contour to pass through all the saddle-points then
replacing rhs in Eq. (1.52) by sums of the saddle-point contributions.

Exercise 1.4.2. Estimate the following integrals


Z +∞
dx cos αx2 − x3 /3 ,

(a)
−∞
Z +∞
dx exp −x4 /4 cos(αx).

(b)
−∞

at sufficiently large positive α through the saddle-point approximation.


Chapter 2

Fourier Analysis

Fourier analysis is the study of the way functions may be represented or approximated by an
integral, or a sum, of oscillatory basis functions. The process of decomposing a function into
its oscillatory components, and the inverse process of recomposing the function from these
components, are two themes of Fourier analysis. When the oscillatory components take a
continuous range of wave-numbers (or frequencies), the decomposition and recomposition
is achieved by integration, and is referred to as the Fourier transform and inverse Fourier
transform. When the oscillatory components take a discrete range of wave-numbers (or
frequencies), the decomposition and recomposition is achieved by summation, and is referred
to as a Fourier Series.
Fourier analysis grew from the study of Fourier series which is credited to Joseph Fourier
for showing that the study of heat transfer is greatly simplified by representing a function
as a sum of trigonometric basis functions. The original concept of Fourier analysis has been
extended over time to apply to more general and abstract situations, and the field is now
often called harmonic analysis.

2.1 The Fourier Transform and Inverse Fourier Transform


Certain functions f (x) can be expressed by the representation, known as the Fourier inte-
gral,
Z
1
dk exp ikT x fˆ(k),

f (x) = (2.1)
(2π)d Rd

where k = (k1 , · · · , kd ) is the “wave-vector”, dk = dk1 · · · dkd , and fˆ(k) is the Fourier
transform of f (x), defined according to
Z
ˆ dx exp −ikT x f (x).

f (k) := (2.2)
Rd

33
CHAPTER 2. FOURIER ANALYSIS 34

Eq. (2.1) and Eq.(2.2) are inverses of each other (meaning, for example, that substituting
Eq. (2.2) into Eq. (2.1) will recover f (x)), and it is for this reason that the Fourier integral
is also called the Inverse Fourier Transform. Proofs that they are inverses, as well as
other important properties of the Fourier Transform, rely on Dirac’s δ-function which in
d-dimensions can be defined as
Z
1
δ(x) := dk exp(ikT x). (2.3)
(2π)d Rd

We will discuss Dirac’s δ-function in section 2.3, primarily for d = 1.


At first glance, it might appear that the appropriate class of functions for which Eq. (2.1)
is defined is one where both f (x) and fˆ(k) are integrable. We will demonstrate how the
definition of the δ-function permits Eq. (2.1) to be defined over a wider class of functions
in section 2.4. More careful consideration of the function spaces to which f (x) and fˆ(k)
belong will be addressed in the Theory course (Math527).
In the interest of maintaining compact notation and clear explanations, important prop-
erties for the Fourier Transform will be presented for the one dimensional case (section 2.2),
but each property applies to the more general d-dimensional Fourier transform. There are
only a few functions for which their fourier transform can be expressed by a closed-form
representation, see section 2.4.
Remark. There are alternative definitions for the Fourier transform and its inverse; some
authors place the multiplicative constant of (2π)−d in the definition of fˆ(k), other authors
prefer the ‘symmetric’ definition where both f (x) and fˆ(k) are multiplied by (2π)−d/2 , and
still others place a 2π in the complex exponential. It is important to read widely during
graduate school, but be warned that the specific results you find will depend on the exact
definitions used by the author.

2.2 Properties of the 1-D Fourier Transform


In the d = 1 case, x may play the role of the spatial coordinate or of time. When x is the
spatial coordinate, the spectral variable k is often called the wave number, which is the one
dimensional version of the wave vector. When x is time, k is often called frequency and
given the symbol ω. The spatial and temporal terminologies are interchangeable.
Linearity: Let h(x) = af (x) + bg(x), where a, b ∈ C, then
Z Z Z Z
−ikx −ikx −ikx
ĥ(k) = dx h(x)e = dx (af (x) + bg(x)) e = a dx f (x)e + b dx g(x)e−ikx
R R R R
= afˆ(k) + bĝ(k). (2.4)
CHAPTER 2. FOURIER ANALYSIS 35

Spatial/Temporal Translation: Let h(x) = f (x − x0 ), where x0 ∈ R, then


Z Z Z
−ikx −ikx 0
ĥ(k) = dx h(x)e = dx f (x − x0 )e = dx0 f (x0 )e−ikx −ikx0
R R R
−ikx0
=e fˆ(k). (2.5)

Frequency Modulation: For any real number k0 , if h(x) = exp(ik0 x)f (x), then
Z Z Z
−ikx ik0 x −ikx
ĥ(k) = dx h(x)e = dx f (x)e e = dx f (x)e−i(k−k0 )x
R R R
= fˆ(k − k0 ). (2.6)

Spatial/Temporal Rescaling: For a non-zero real number a, if h(x) = f (ax), then


Z Z Z
0
ĥ(k) = dx h(x)e−ikx = dx f (ax)e−ikx = |a|−1 dx0 f (x0 )e−ikx /a
R R R
−1
= |a| fˆ(k/a). (2.7)

The case a = −1 leads to the time-reversal property: if h(t) = f (−t), then ĥ(ω) = fˆ(−ω).
Complex Conjugation: If h(x) is a complex conjugate of f (x), that is, if h(x) = (x),
then
Z Z Z
−ikx ∗ −ikx
ĥ(k) = dx h(x)e = dx (f (x)) e = dx f (x)eikx
R R R

= fˆ(−k). (2.8)

Exercise 2.2.1. Verify the following consequences of complex conjugation:

(a) If f is real, then fˆ(−k) = (fˆ(k))∗ (this implies that fˆ is a Hermitian function.)

(b) If f is purely imaginary, then fˆ(−k) = −(fˆ(k))∗ .

(c) If h(x) = <(f (x)), then ĥ(k) = 21 (fˆ(k) + (fˆ(−k))∗ .


1 ˆ
(d) If h(x) = =(f (x)), then ĥ(k) = 2i (f (k) − (fˆ(−ξ))∗ .

Exercise 2.2.2. Show that the Fourier transform of a radially symmetric function in two
variables, i.e. f (x1 , x2 ) = g(r) where r2 = x2 +x2 is also radially symmetric, i.e. fˆ(k1 , k2 ) =
1 2
fˆ(ρ) where ρ2 = k12 + k22 .

Differentiation: If h(x) = f 0 (x), then under the assumption that |f (x)| → 0 as x → ±∞,
Z Z h i∞ Z
ĥ(k) = dx h(x)e−ikx = dxf 0 (x)e−ikx = f (x)e−ikx − dx(−ik)f (x)e−ikx
R R −∞ R
= (ik)fˆ(k). (2.9)
CHAPTER 2. FOURIER ANALYSIS 36

R∞
Integration: Substituting k = 0 in the definition, we obtain fˆ(0) = −∞ f (x) dx. That is,
the evaluation of the Fourier transform at the origin, k = 0, equals the integral of f over
all its domain.
Proofs for the following two properties rely on the use of the δ-function (which will
not be addressed until section 2.3), and require more careful consideration of integrability
(which is beyond the scope of this brief introduction). The following two properties are
added here so that a complete list of properties appears in a single location.

R
Unitarity [Parceval/Plancherel Theorem]: For any function f such that |f | dx < ∞
and |f |2 < ∞,
R

Z ∞ Z ∞ Z ∞ Z ∞ Z ∞
2 dk1 ik1 x ˆ dk2 −ik2 x ˆ
dx |f (x)| = dxf (x)f (x) = dx e f (k1 ) e f (k2 )
−∞ −∞ −∞ −∞ 2π −∞ 2π
Z ∞
1
= dk |fˆ(k)| .
2
(2.10)
2π −∞

Definition 2.2.1. The integral convolution of the function f with the function g, is defined
as Z
(g ∗ f )(x) := dy g(x − y)f (y), (2.11)
R

Proposed Addition: I think that we should (1) have an illustration of convolution here. It
could be as simple as the schematic shown on the wikipedia page. (2) Write a computational
snippet to demonstrate a moving average. Convolution: Suppose that h is the integral
convolution of f with g, that is, h(x) = (g ∗ f )(x), then
Z Z Z
−ikx
ĥ(k) = dx h(x)e = dx dy g(x − y)f (y)e−ikx
R R R
= ĝ(k)fˆ(k). (2.12)

The convolution of a function f with a kernel g is defined in Eq. (2.11). Consider


whether there exists a convolution kernel g resulting in the projection of a function to itself.
That is, can we find a g such that (f ∗ g) = f for arbitrary functions f ? If such a g were to
exist, what properties would it have?
Heuristically, we could argue that such a function would have to be both localized and
R
unbounded. Localized because for the convolution dy g(x − y)f (y) to “pick out” f (x),
g(x − y) must be zero for all x 6= y. Unbounded because we also need g(x − y) to be
sufficiently large at x = y to ensure that the integral on the RHS of Eq. (2.26) could be
nonzero.
CHAPTER 2. FOURIER ANALYSIS 37

Such a degree of ‘un-boundedness’ over such a localized point is impossible under the
traditional theory of functions, but nonetheless, such a g(x) was introduced by Paul Dirac
in the context of quantum mechanics. It was not until the 1940’s that Laurent Schwartz
developed a rigourous theory for such ‘functions’, which became known as the theory of
distributions. We usually denote this ‘function’ by δ(x) and call it the (Dirac) δ-function.
See [1](ch. 4) for more details.

2.3 Dirac’s δ-function.


2.3.1 The δ-function as the limit of a δ-sequence

We begin our study of Dirac’s δ-function by considering the sequence of functions given by

1/ |x| ≤ /2
f (x) = (2.13)
0 |x| > /2

The pointwise limit of f is clearly zero for all x 6= 0, and therefore the integral of the limit
of f must also be zero:
Z ∞
lim f (x) = 0 ⇒ dx lim f (x) = 0. (2.14)
→0 −∞ →0

However, for any  > 0, the integral of f is clearly unity, and therefore the limit the integral
of f must also be unity:
Z ∞ Z ∞
dx f (x) = 1 ⇒ lim dx f (x) = 1. (2.15)
−∞ →0 −∞

Although Eq. (2.14) suggests that f (x) may not be very interesting as a function, the
behavior demonstrated by Eq. (2.15) motivates the use of f (x) as a functional 1 . For any
sufficiently nice function φ(x), define the functionals f [φ] and f [φ] by
Z ∞
f [φ] := lim f [φ] := lim dx f (x)φ(x) (2.16)
→0 →0 −∞

The behavior of f [φ] can be demonstrated by approximating the corresponding integrals,


f [φ] for each  > 0:
Z ∞ Z /2
1
f [φ] = dx f (x)φ(x) = dx φ(x)
−∞ −/2 
1
In casual terms, a function takes numbers as inputs, and gives numbers as outputs, whereas a functional
takes functions as inputs and gives numbers as outputs
CHAPTER 2. FOURIER ANALYSIS 38

Letting m and M represent the minimum and maximum values of φ(x) on the interval
−/2 < x < /2 gives the bounds

m ≤ f [φ] ≤ M

If φ is continuous at x = 0, the limit f [φ] as  → 0 is given by

f [φ] = lim f [φ] = φ(0)


→0

In summary, f [φ] evaluates its argument at the point x = 0.


Now compare f (x) to the sequence of functions given by
1 
g (x) =
π x + 2
2

The pointwise limit g (x) is also zero for every x 6= 0, so as before, the integral of the limit
must be zero: Z ∞
lim g (x) = 0 ⇒ dx lim g (x) = 0
→0 −∞ →0

A suitable trigonometric substitution shows that the integral of g (x) is also unity for each
 > 0, and as before, the limit of the integrals must be unity:
Z ∞ Z ∞
dx g (x) = 1 ⇒ lim dx g (x) = 1
−∞ →0 −∞

As with f (x), we can use g (x) to define the functionals g [φ(x)] and g[φ] by
Z ∞
g[φ] := lim g [φ] := lim g (x)φ(x)dx
→0 →0 −∞

This time it takes a little more thought to find the appropriate bounds, but with some
effort, it can be shown that
g[φ] = lim g [φ] = φ(0)
→0
That is, g[φ] also evaluates its argument at the point x = 0.
The sequences f (x) and g (x) both have the same limiting behavior as functionals, and
are examples of what is known as a δ-sequence. Their limiting behavior leads us to the
definition of a δ-function, which is defined as δ[φ] = φ(0).
Remark. The δ-function only makes sense in the context of an integral. Although it is
common practice to write expressions like δ(x)f (x), such expressions should always be
R
consdidered as R dx δ(x)f (x)

Example 2.3.1. For b, c ∈ R, show that cδ(x − b)f (x) = cf (b)


Z ∞ Z ∞
cδ(x − b)f (x) = dx cδ(x − b)f (x) = c dx0 δ(x0 )f (x0 + b)
−∞ −∞

= cf (b) (2.17)
CHAPTER 2. FOURIER ANALYSIS 39

Example 2.3.2. For a ∈ R, show that δ(ax)f (x) = f (0)/|a|


Z ∞ Z ∞
dx0
δ(ax)f (x) = dx δ(ax)f (x) = δ(x0 )f (x0 /a)
−∞ −∞ |a|
= f (0)/|a| (2.18)

Corollary 2.3.1. Show that the Fourier transform of a δ-function is a constant.


Solution.
Z ∞
δ̂(k) = dx δ(x)e−ikx = e−ik 0
−∞

=1 (2.19)

Corollary 2.3.2. Show that


Z ∞
dk
δ(x) = exp(ikx).
−∞ 2π
Show that the Fourier transform of a constant is a δ-function.
Solution. We identify the expression on the RHS as the inverse Fourier transform of the
function fˆ(k) = 1
Z ∞
1
f (x) = dk 1eikx (2.20)
2π −∞
The constant function is not integrable in the traditional sense. The theory of distributions
allows us to give meaning to this integral. We know that δ is defined so that for any suitable
R
function φ(x), dx δ(x)φ(x) = φ(0). Even though we cannot integrate f (x) directly, but if
R
we can show that dx f (x)φ(x) = φ(0), then we can assert that f (x) = δ(x).
Z ∞ Z ∞ Z ∞
dk ∞
Z
dk −ikx
f [φ(x)] = dx φ(x) 1e = dx φ(x)e−ikx
2π 2π
Z−∞∞
−∞
Z ∞ −∞ −∞
dk dk ik0
= φ̂(k) = φ̂(k)e
−∞ 2π −∞ 2π

= φ(0) (2.21)

Since f [φ] = φ(0) for every suitable test function φ, we say that f (x) = δ(x).

Alternative Definitions of the δ-function

We have defined the δ-function in Eq. (2.3) as the limit of a particular δ-sequence, namely
the ‘top-hat’ function given in Eq. (2.14). One has to wonder whether there may be other
δ-sequences which give the same limit. For example, consider
2t2 
δ(t) = lim . (2.22)
→0 π(t2 + 2 )2
CHAPTER 2. FOURIER ANALYSIS 40

To validate the suitability of Eq. (2.22) as an alternative definition of the δ-function one
R
needs to check first that δ(t) → 0 as  → 0 for all t 6= 0, and second that dtδ(t) = 1. (It
is easy to evaluate this integral as the complex pole integral and closing the contour, for
example, over the upper part of the complex plane. Observing that the integrand has pole
of the second order at t = i, expanding it into Laurent series aroun i and keeping the
c = −1 coefficient, and then using the Cauchy formula for the contour integral, we confirm
that the integral is equal to unity.)

Exercise 2.3.3. Validate the following asymptotic representations for the δ-function
 2
1 t
(a) δ(t) = lim √ exp − , (2.23)
→0 π 
1 − cos(nt)
(b) δ(t) = lim . (2.24)
n→∞ πnt2
In many applications we deal with periodic functions. In this case one needs to consider
relations hold within the interval. In view of the δ-function extreme locality (just explored),
all the relations discussed above extend to this case.

Exercise 2.3.4. Prove that for x on the interval (−π, π)


1 − r2
lim = δ(x).
r→1−0 2π(1 − 2r cos(x) + r 2 )

2.3.2 Using δ-functions to Prove Properties of Fourier Transforms

We now return to proving (1) that the Fourier Transform and the inverse Fourier Transfrom
are indeed inverses of each other, (2) Plancherel’s theorem and (3) the convolution property.

Proposition 2.3.3. The Fourier Transform of the convolution of the function f with the
function g is the product fˆ(k)ĝ(k)
Z ∞ Z ∞
f ˆ∗ g(k) = dx dy g(x − y)f (y)e−ikx (2.25)
−∞ −∞
Z ∞ Z ∞ Z ∞
dk1 ∞ dk2 ˆ
Z
= dx dy f (k1 )ĝ(k2 ) exp (−ikx + ik1 (x − y) + ik2 y)
−∞ 2π −∞ 2π
Z−∞

−∞
Z ∞ Z ∞ Z ∞
ˆ 1 1
= dk1 dk2 f (k1 )ĝ(k2 ) dx exp (−ikx + ik1 x) dy exp (−ik1 y + ik2 y)
−∞ −∞ 2π −∞ 2π −∞
Z ∞ Z ∞
= ˆ
dk1 f (k1 )δ(k − k1 ) dk2 ĝ(k2 )δ(k1 − k2 )
−∞ −∞

= fˆ(k)ĝ(k) (2.26)

where in transition from the first to the second lines we exchange order of integrations
assuming that all the integrals involved are well-defined.
CHAPTER 2. FOURIER ANALYSIS 41

Proposition 2.3.4. Unitarity [Parceval/Plancherel Theorem]:


Z ∞ Z ∞ Z ∞ Z ∞ Z ∞
2 dk1 ik1 x ˆ dk2 −ik2 x ˆ
dx|f (x)| = dxf (x)f (x) = dx e f (k1 ) e f (k2 )
−∞ −∞ −∞ −∞ 2π −∞ 2π
Z ∞ Z ∞ Z ∞
1 ˆ ˆ 1
= dk1 dk2 f (k1 )f (k2 ) dx exp(ix(k1 − k2 ))
2π −∞ 2π −∞
Z ∞ Z−∞∞
1
= dk1 dk2 fˆ(k1 ) fˆ(k2 ) δ(x − y)
2π −∞ −∞
Z ∞
1
= dk|fˆ(k)|2 . (2.27)
2π −∞

Remark. Using the δ-function as the convolution kernel yeilds the self-convolution property:
Z
f (x) = dyδ(x − y)f (y). (2.28)

Consider δ-function of a function, δ(f (x)). It can be transformed to the following sum
over zeros of f (x),
X 1
δ (f (x)) = δ(x − yn ). (2.29)
n
|f 0 (y n )|

To prove the statement, one, first of all, recall that δ-function is equal to zero at all points
where its argument is nonzero. Just this observation suggest that the answer is a sum of
δ-functions and what is left is to establish weights associated with each term in the sum.
Pick a contribution associated with a zero of f (x) and integrating the resulting expression
over a small vicinity around the point, make the change of variable
Z Z
df
dxδ(f (x)) = 0
δ(f (x)).
f (x)

Because of the δ(f (x)) term in the integrand, which is nonzero only at the zero point of
f (x), we can replace f 0 (x) by f 0 evaluated at the zero and move it out from the integrand.
The remaining integral obviously depends on the sign of the derivative.

2.3.3 The δ-function in Higher Dimensions

d-dimensional δ-function, which was instrumental for introducing d-dimensional Fourier


transform in Section 2.1, is simply a product of one dimensional δ-functions, δ(x) =
δ(x1 ) · · · δ(xn ).

Example 2.3.5. Compute the δ-function in polar spherical coordinates.


CHAPTER 2. FOURIER ANALYSIS 42

2.3.4 The Heaviside Function and the Derivatives of the δ-function


R∞
One also derives from Eq. (2.28) that −∞ dxδ(x) = 1. This motivates introduction of a
function associated with an incomplete integration of the δ(x)
Z y (
0, y < 0
θ(y) := dxδ(x) = (2.30)
−∞ 1, y > 0,

called Heaviside- or step-function.

Exercise 2.3.6. Prove the relation


 2 
d 2
− γ exp (−γ|t|) = −2γδ(t). (2.31)
dt2
Hint: Yes, the step function will be useful in the proof.

One also gets, differentiating Eq. (2.30), that θ0 (x) = δ(x). We can also differentiate
the δ-function. Indeed, integrating Eq. (2.28) by parts, and assuming that the respective
anti-derivative is bounded, one arrives at
Z
dyδ 0 (y − x)f (y) = −f 0 (x) (2.32)

Substituting in Eq. (2.32), f (x) = xg(x) one derives

xδ 0 (x) = −δ(x). (2.33)

Expanding f (x) in the Taylor series around x = y, ignoring terms of the second order (and
higher) in (x − y), and utilizing Eq. (2.34) one arrives at

f (x)δ 0 (x − y) = f (y)δ 0 (x − y) − f 0 (y)δ(x − y). (2.34)

Notice that δ 0 (x) is skew-symmetric and f (x)δ 0 (x − y) is not equal to f (y)δ 0 (x − y).
We have assumed so far that δ 0 (x) is convolved with a continuous function. To extend it
to the case of piece-wise continuous functions with jumps and jumps in derivative, one need
to be more careful using integration by parts at the points of the function discontinuity. An
exemplary function of this type is the Heaviside function just discussed. This means that
if a function, f (x), shows a jump at x = y, its derivative allows the following expression

f 0 (x) = (f (y + 0) − f (y − 0))δ(x − y) + g(x), (2.35)

where, f (y + 0) − f (y − 0), represents value of the jump and g(x) is finite at x = y. Similar
representation (involving δ 0 (x)) can be build for a function with a jump in its derivative.
Then the δ(x) contribution is associated with the second derivative of f (x),

Exercise 2.3.7. Express tδ 00 (t) via δ 0 (t).


CHAPTER 2. FOURIER ANALYSIS 43

2.4 Closed form representation for select Fourier Transforms


There are a few functions for which the Fourier transforms can be written in closed form.

2.4.1 Elementary examples of closed form representations

Example 2.4.1. Show that the Fourier Transform of a δ-function is a constant.


Solution. See corollary 2.3.1 where we showed δ̂(k) = 1.

Example 2.4.2. Show that the Fourier Transform of a constant is a δ-function.


Solution. In corollary 2.3.2 where we showed that the inverse Fourier transform of unity
was δ(x). A similar calculation shows that 1̂(k) = 2πδ(k)

Example 2.4.3. Show that the Fourier transform of a square pulse function is a sinc
function: 
b, |x| < a 2b
f (x) = ⇒ fˆ(k) = sin(ka)
0, |x| > a. k

Solution.
Z Z a a
b b  −ika 
fˆ(k) = dxf (x)e−ikx = b dxe−ikx = e−ikx = e − eika
R −a (−ik) −a −ik
2b
= sin(ka). (2.36)
k
Example 2.4.4. Show that the Fourier transform of a sinc function is a square pulse:

sin(ax) aπ, |k| < a
g(x) = ⇒ ĝ(k) =
ax 0, |k| > a.

Example 2.4.5. Find the Fourier transform of a Gaussian function

f (x) = a exp(−bx2 ), a, b > 0.

Solution.
Z Z ∞
−ikx 2
fˆ(k) = dxf (x)e =a dxe−bx e−ikx
R −∞
 !
ik 2
 2Z ∞ 
k
= a exp − dx exp −b x +
4b −∞ 2b
 2Z ∞
a k 02
= √ exp − dx0 e−x
b 4b −∞
 2r
k π
= a exp − . (2.37)
4b b
CHAPTER 2. FOURIER ANALYSIS 44

1
Exercise 2.4.6. Find the Fourier transform of f (x) =
x4 + a4
Exercise 2.4.7. Find the Fourier transform of f (x) = sech(ax).

Exercise 2.4.8. Verify the following Fourier transform pair:

(a) Let a > 0. Show that


1 π
f (x) := ⇒ fˆ(k) := e−a|k|
x2 + a2 a

(b) Let a > 0. Show that


2a
g(x) := e−a|x| ⇒ ĝ(k) :=
k 2 + a2

2.4.2 More complex examples of closed form representations

We can find closed form representations of other functions by combining the examples above
with the properties in section 2.2.

Example 2.4.9. This problem is fantastically difficult. Let f (t) be given by



cos(ω0 t) |t| < A
f (t) =
0 otherwise

where ω0 and A are fixed, and A > 0.

(a) Compute fˆ(k), the Fourier transform of f , as a function of ω0 and A.

(b) Identify the relationship between the continuity of f and ω0 and A, and discuss how
this affects the decay of the Fourier coefficients as |k| → ∞.

Solution. Coming Soon!

Exercise 2.4.10. Let


2a
fa (x) :=
a2 + (4πx)2
for a ∈ C with Re(a) > 0. If also b ∈ C with Re(b) > 0, show that

fa ∗ fb = fa+b

Exercise 2.4.11. Show the following:

(a) Show that the Fourier transform of


 
1 ˆ k−a
g(x) := exp(iax)f (bx) is ĝ(k) := f
|b| b
CHAPTER 2. FOURIER ANALYSIS 45

(b) Show that the Fourier transform of

sin2 (x) iπ  
f (x) = is fˆ(k) = − Π(k − 1) − Π(k + 1)
x 2
where 
1, |k| ≤ 1
Π(k) =
0, |k| > 1

2.4.3 Closed form representations in higher dimensions

Exercise
q 2.4.12. Let x = (x1 , x2 , . . . , xd ) ∈ Rd , and use the notation |x| to represent
x21 + x22 + · · · + x2d . Find the Fourier transform of

(a) g(x) = exp(−|x|2 ).

(b) (Bonus) h(x) = exp(−|x|) for d = 3 (i.e. in three dimensions).

2.5 Fourier Series


Fourier Series is a version of the Fourier Integral which is used when the function is periodic
or of a finite support (nonzero within a finite interval). As in the case of the Fourier
Integral/Transform, we will mainly focus on the one-dimensional case. Generalization of
the Fourier Series approach to a multi-dimensional case is, typically, straightforward.
Consider a periodic function with the period, L. We can represent it in the form of a
series over the following standard set of periodic exponentials (harmonics), exp(i2πnx/L):

X
f (x) = fn exp (2πinx/L) . (2.38)
n=−∞

This, so-called Fourier series representation of a periodic function immediately shows that
the Fourier Series is a particular case of the Fourier integral. Indeed, a periodic function
can be represented as a convolution of a function with a finite support in [0, L] and of a
sum of δ-functions

X Z ∞ Z ∞ ∞
X
fn exp (2πinx/L) dk δ(k − n) = dk exp (2πikx/L) fn δ(k − n).
n=−∞ −∞ −∞ n=−∞
(2.39)

One can also consider a function with a finite support over [0, L], i.e. one which is equal
to zero outside of the interval. Fourier transform of this function and its (standard) Inverse
CHAPTER 2. FOURIER ANALYSIS 46

Fourier Transform are


Z L Z ∞
dx
fˆ(k) = f (x) exp (−ikx) , f (x) = dkeikx fˆ(k). (2.40)
0 2π −∞

Obviously if x ∈ [0, L] assumption of periodicity and assumption of finite support are


equivalent. Then comparing Eq. (2.39) and Eq. (2.40), one arrives at
Z L
dx  nx 
fn = f (x) exp −2πi , (2.41)
0 L L
which is the inverse Fourier Series relation for periodic and/or finite support functions.
Notice that one may also consider Fourier Transform/Integral as a limit of the Fourier
Series. Indeed in the case when a typical scale of the f (x) change is much less than L, many
harmonics are significant and the Fourier series transforms to the Fourier integral

L ∞
X Z
··· → dk · · · . (2.42)
−∞
2π −∞

Let us illustrate expansion of a function into Fourier series on example of f (x) = exp(αx)
considered on the interval 0 < x < 2π. In this case the Fourier coefficients are
Z 2π
dx 1 1
e2πα − 1 .

fn = exp(−inx + αx) = (2.43)
0 2π 2π α − in
Notice that at n → ∞, fn ∼ 1/n. As discussed in more details in the following section,
the slow decay of the Fourier coefficients is associated with the fact that f (x), when con-
sidered as a periodic function over reals with the period 2π has discontinuities (jumps) at
0, ±2π, ±4π, · · · .

Exercise 2.5.1. Expand (a) f (x) = x, and (b) g(x) = |x|, both defined on the interval
−π < x < π, in the Fourier series. Describe the difference between (a) and (b) in the
dependence of the n-th Fourier coefficient on n.

Let us conclude this section reminding that constructing the Fourier Series (and also
Fourier Integrals) we assume that the set of harmonic functions forms a complete set of basis
functions for a properly integrable function. Proving the assumption requires an extra work
which is not done in this course. Instead this proof (as well as many other proofs) is left
for detailed discussion in the companion Math 525 course of the core AM series.

2.6 Riemann-Lebesgue Lemma


The Fourier series is infinite (contains infinite number of terms), thus computationally
prohibitive, and one common approximation approach consists in truncating it.
CHAPTER 2. FOURIER ANALYSIS 47

The Riemann-Lebesgue Lemma helps to justify the truncation. The Lemma states that
for any integrable function f , the Fourier coefficients fn must decay as n → ∞.

Theorem 2.6.1 (Riemann-Lebesgue Lemma). If f (x) ∈ L1 , i.e. if the Lebesgue integral


of |f | is finite, then limn→∞ fn = 0.

We will not prove the Riemann-Lebesgue lemma here but notice that a standard proof is
based on (a) showing that the lemma works for the case of characteristic function of a finite
open interval in R1 , where f (x) is constant within ]a, b[ and zero otherwise, (b) extending it
to simple functions over R1 , that are functions which are piece-wise constant, and then (c)
building a sequence of simple functions (which are dense in L1 ) approximating f (x) more
and more accurately.
Let us mention the following useful corollary of the Riemann-Lebesgue Lemma: For
any periodic function f (x) with continuous derivatives up to order m, integration by parts
can be performed respective number of times to show that the n-th Fourier coefficient is
C
bounded at sufficiently large n according to |fn | ≤ |n|m+2
, where C = O(1).
In particular, and consistently with the example above, we observe that in the case
of a “jump”, corresponding to continuous anti-derivative, i.e. m = −1, |fn | is O(1/n)
asymptotically at n → ∞. In the case of a “ramp”, i.e. m = 0 with continuous function
but discontinuous derivative, |fn | becomes O(1/n2 ) at n → ∞. For the analytic function,
with all derivatives continuous, |fn | decays faster than polynomially as n increases.
Further details of the Lemma, as well as the general discussion of how the material of
this Section is related to material discussed in the theory course (Math 527) and also the
algorithm course (Math 575), will be given at an inter-core recitation session.

2.7 Gibbs Phenomenon


One also needs to be careful with the Fourier Series truncation, because of the so-called
Gibbs phenomenon, called after J. Willard Gibbs, who has described it in 1889. (Apparently,
the phenomenon was discovered earlier in 1848 by Henry Wilbraham.) The phenomenon
represents an unusual behavior of a truncated Fourier Series built to represent piece-wise
continuous periodic function. The Gibbs phenomenon involves both the fact that Fourier
sums overshoot at a jump discontinuity, and that this overshoot does not die out as more
terms are added to the sum.
CHAPTER 2. FOURIER ANALYSIS 48

Consider the following classic example of a a square wave



π/4, if 2nπ ≤ x ≤ (2n + 1)π, n = 0, 1, 2, . . .
f (x) = (2.44)
−π/4, if (2n + 1)π ≤ x ≤ (2n + 2)π, n = 0, 1, 2, . . .

X sin((2n + 1)x)
= , (2.45)
2n + 1
n=0

where definition of the function is in the first line and the second line describes expression
for the function in terms of the Fourier series. Notice that the 2π-periodic function jumps
at 2nπ by π/2.
Let us truncate the series in eq:square-wave-Fourier and thus consider N -th partial
Fourier Series
N
X sin((2n + 1)x)
SN (x) = . (2.46)
2n + 1
n=0

Gibbs phenomenon consists in the following observation: as N → ∞ the error of the approx-
imation around the jump-points is reduced in width and energy (integral), but converges
to a fixed height. See movie-style visualization (from wikipedia) of how SN (x) evolves with
N . (It is also reproduced in a julia-snippet available at the class D2L repository.)
Let us now back up this simulation by an analytic estimation and compute the limiting
value of the partial Fourier Series at the point of the jump. Notice that
N
d X 2(N + 1)
SN () = cos((2n + 1)) = , (2.47)
d 2 sin 
n=0

where we have utilized formula for the sum of the geometric progression. Observe that
d N +1
d SN () → 2 at  → 0, that is the derivative is large (when N is large) and positive.
Therefore, SN () grows with  to reach its (first close to  = 0) maximum at ∗ = π/(2(N +
1)). Now we estimate the value of SN (∗ )
 
N (2n+1)π N
sin sin nπ

X 2(N +1) X
N
SN (∗ ) = = + O(1/N )
2n + 1 2n
n=0 n=0
N →0
1 π sin t
Z
π
→ dt ≈ + 0.14, (2.48)
2 0 t 4
thus observing that at the point of the closest to zero maximum the partial sum systemat-
ically overshoots, f (0+ ) = π/4, by an O(1) amount.

Exercise 2.7.1. Generalize the two functions from Exercise 2.5.1 beyond the [−π, π) inter-
val, so they are 2π-periodic function on [−5π, 5π). Compute the respective partial Fourier
CHAPTER 2. FOURIER ANALYSIS 49

Im (k)

0
Re (k)

Figure 2.1: Integration contour, C, in Eq. (2.49) is shown red. C is often shown as straight
line in the complex from c − i∞ to c + i∞, where c is an infinitesimally small positive
number. Possible singularities of the LT Φ̃(k) may only be at the points with negative real
number, which are shown schematically as blue dots.

series SN (x) for select N , and study numerically (or theoretically!) how the amplitude
and the width of the oscillations near the points x = mπ, m ∈ {−5, −4, . . . , 4} behave as
N → ∞.

We complete discussion of the Fourier Series mentioning its, arguably most significant,
application to the field of differential equations. Even though some differential equations can
be analyzed or even solved analytically (this is the prime focus of the next two chapters of
the course), most differential equations of interest can only be solved numerically. Looking
for solution of an ODE or PDE in terms of a Fourier series and then truncating the series
to a finite sum represents one of the most powerful numerical methods in the arsenal of
Applied Mathematics. This, so-called spectral method, is to be discussed in the algorithm
(Math 575) course of the core series.

2.8 Laplace Transform


The Laplace Transform (LT) may be considered as a Fourier transform applied to functions
which are nonzero at t ≥ 0. Then, the LT, defined at t > 0, is
Z ∞
Φ̃(k) = dt exp (−kt) Φ(t). (2.49)
0

We consider complex k and require that the integral on the right hand side of eq:LT is
converging (finite) at sufficiently large Re(k). In other words, Φ̃(k) is analytic at Re(k) > C,
where C is a positive constant.
CHAPTER 2. FOURIER ANALYSIS 50

Inverse Laplace Transform (ILT) is defined as a complex integral


Z
1
Φ(t) = dk exp (kt) Φ̃(k). (2.50)
2πi C

over contour, C, shown in Fig. (2.1). C can be deformed arbitrarily within the domain,
Re > 0, of the Φ̃(k) analyticity. Note that by construction, and consistently with the
requirement imposed on Φ(t), the integral on the right hand side of Eq. (2.50) is equal to
zero at t < 0. Indeed, given that Φ̃(k) is analytic at Re(k) > 0 and it approaches zero at
k → ∞, contour C can be collapsed to surround ∞, which is also a non-singular point for
the integrand thus resulting in zero for the integral.
It is instructive to illustrate similarities and differences between Laplace and Fourier
transforms on examples.
Consider one sided exponential

f (t) = θ(t) exp(−αt), α > 0, (2.51)


α − iω
fˆ(ω) = 2 , (2.52)
α + ω2
1
f˜(s) = , Re(s + α) > 0, (2.53)
s+α
which then turns into the step function, θ(t), at α → 0+

f (t) = θ(t), (2.54)


i
fˆ(ω) = πδ(ω) − , (2.55)
ω
1
f˜(s) = . (2.56)
s
Shifting and rescaling the step-function we arrive at the following expressions for the sig-
nature function

f (t) = sign(t), (2.57)


2i
fˆ(ω) = − , (2.58)
ω
1
f˜(s) = . (2.59)
s
Exercise 2.8.1. Find the Laplace Transform of (a) Φ(t) = exp(−λt), (b) Φ(t) = tn , (c)

Φ(t) = cos(νt), (d) Φ(t) = cosh(λt), (e) Φ(t) = 1/ t. Show details.

Exercise 2.8.2. Find the Inverse Laplace Transform of 1/(k 2 + a2 ). Show details.
Part II

Differential Equations

51
Chapter 3

Ordinary Differential Equations.

A differential equation (DE) is an equation that relates an unknown function and its deriva-
tives to other known functions or quantities. Solving a DE amounts to determining the
unknown function. For a DE to be fully determined, it is necessary to define auxiliary
information, typically available in the form of initial or boundary data.
Often several DE’s may be coupled together in a system of DE’s. Since this is equivalent
to a DE of a vector-valued function, we will use the term “differential equation” to refer
to both single equations and systems of equations and the term “function” to refer to both
scalar- and vector-valued functions. We will distinguish between the singular and plural
only when relevant.
The function to be determined may be a function of a single independent variable, (e.g.
u = u(t) or u = u(x)) in which case the differential equation is known as an ordinary
differential equation, or it may be a function of two or more independent variables, (e.g.
u = u(x, y), or u = u(t, x, y, z)) in which case the differential equation is known as a partial
differential equation.
The order of a differential equation is defined as the largest integer n for which the nth
derivative of the unknown function appears in the differential equation.
Most general differential equation is equivalent to the condition that a nonlinear func-
tion of an unknown function and its derivatives is equal to zero. An ODE is linear if the
condition is linear in the function and its derivatives. We call the ODE linear, homogeneous
if in addition the condition is both linear and homogeneous in the function and its deriva-
tives. It follows for the homogeneous linear ODE that, if f (x) is a solution, so is cf (x),
where c is a constant. A linear differential equation that fails the condition of homogeneity
is called inhomogeneous. For example, an nth order, inhomogeneous ordinary differential
equation is one that can be written as αn (t)u(n) (t) + · · · + α1 (t)u0 (t) + α0 (t)u(t) = f (t),

52
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 53

where αi (t), i = 0, . . . n and f (t) are known functions. Typical methods for solving linear
differential equations often rely on the fact that the linear combination of two or more solu-
tions to the homogeneous DE is yet another solution, and hence the particular solution can
be constructed from from a basis of general solutions. This cannot be done for nonlinear
differential equations, and analytic solutions must often be tailor-made for each differential
equation, with no single method applicable beyond a fairly narrow class of nonlinear DEs.
Due to the difficulty in finding analytic solutions, we often rely on qualitative and/or ap-
proximate methods of analyzing nonlinear differential equations, e.g. through dimensional
analysis, phase plane analysis, perturbation methods or linearization. In general, linear dif-
ferential equations admit relatively simple dynamics, as compared to nonlinear differential
equations.
An ordinary differential equation (ODE) is a differential equation of one or more func-
tions of one independent variable, and of the derivatives of these functions. The term
ordinary is used in contrast with the term partial differential equation (PDE) where the
functions are with respect to more than one independent variables. PDEs will be discussed
in the section 4.

3.1 ODEs: Simple cases


For a warm up let us recall cases of simple ODEs which can be integrated directly.

3.1.1 Separable Differential Equations

A separable differential equation is a first order differential equation that can be written so
that the derivative function appears on one side of the equation, and the other side contains
the product or quotient of two functions, one of which is a function of the independent
variable, and the other a function of the dependent variable.
Z Z
dx f (t)
= ⇒ g(x)dx = f (t)dt ⇒ g(x)dx = f (t)dt. (3.1)
dt g(x)

3.1.2 Method of Parameter Variation

To solve the following linear, inhomogeneous ODE

dy/dt − p(t)y(t) = g(t), y(t0 ) = y0 , (3.2)

let us substitute, Z t 
0 0
y(t) = c(t) exp dt p(t ) , (3.3)
t0
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 54

where the second term on the right is selected based on solution of the homogeneous version
of Eq. (3.2), i.e. dy/dt = p(t)y(t), and one makes the first term, c(t), which would be a
constant in the homogeneous case, a function of t. This results in the following equation
for the t-dependent c(t) Z t 
dc(t) 0 0
exp dt p(t ) = g(t).
dt t0

Applying the method of separable differential equations (see Eq. (3.1)) and then recalling
the substitution (3.3), one arrives at
!
Z t  Z t Z t0
0 0 0 0 00 00
y(t) = exp dt p(t ) y0 + dt g(t ) exp(− dt p(t )) .
t0 t0 t0

Exercise 3.1.1. Solve dx/dt − λ(t)x = f (t)/x2 , where λ(t) and f (t) are known functions
of t.

3.1.3 Integrals of Motion

Consider the conservative version of Eqs. (??) (conservative means there is no dissipation
of energy)
ẋ = v, v̇ = −∂x U (x), (3.4)

describing the dynamics of a particle of unit mass in the potential, U (x). The energy of the
particle is
ẋ2
+ U (x),
E= (3.5)
2
which consists of the kinetic energy (the first term), and the potential energy (the second
term). It is straightforward to check that the energy is constant, that is dE/dt = 0.
Therefore,
p
ẋ = ±2 E − U (x), (3.6)

where ± on the right hand side is chosen according to the initial condition chosen for ẋ(0)
(there may be multiple solutions, corresponding to the same energy). Eq. (3.7) is separable,
and it can thus be integrated resulting in the following classic implicit expression for the
particle coordinate as a function of time
Z
dx
p = ±t, (3.7)
x0 E − U (x)
which depends on the particle’s initial position, x0 , and its energy, E which is conserved.
In the example above, E is an integral of motion or equivalently a first integral, which is
defined as a quantity that is conserved along solutions to the differential equation. In this
case E was constant along the trajectories x(t).
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 55

The idea of an integral of motion or first integral extends to conservative systems de-
scribed by a system of ODEs. (Here and in the next section we follow [2, 1].) For ex-
ample, consider the situation where a quantitiy H, called Hamiltonian, which is a twice-
differentiable function of 2n variables, p1 , · · · , pn (momenta) and q1 , · · · , qn (coordinates),
that satisfy the following system of equations, called Hamilton’s canonical equations,
∂H ∂H
∀i = 1, · · · , N : ṗi = − , q̇i = . (3.8)
∂qi ∂pi
Computing the rate of change of the Hamiltonian in time
N   N
dH X ∂H ∂H X
= ṗi + q̇i = (−q̇i ṗi + ṗi q̇i ) , (3.9)
dt ∂pi ∂qi
i=1 i=1

we observe that H is constant, that is, H is an integral of motion.


The one degree of freedom system (3.4) is an example of Hamilton’s canonical system
where the energy (3.5), considered as a function of x and v, is the Hamiltonian and x and
v correspond to (scalar) q and p respectively. We will continue exploring the one degree of
freedom system in section 3.2.

3.2 Phase Space Dynamics for Conservative and Perturbed


Systems
3.2.1 Phase Portrait

Here we will follow material of [2] and Section 1.3 of [1]. Our starting point (and main
example) will be the conservative (Hamiltonian) system with one degree of freedom (3.4).
We have established that the energy (Hamiltonian) is conserved, and it is thus instructive
to study isolines, or level curves, of the energy drawn in the two-dimensional (x, v) space,
v2
{{x, v} | 2 + U (x) = E}. To draw a level curve of energy we simply fix E and evaluate
how {x, v} evolves with t according to Eqs. (3.4).
Consider the quadratic potential, U (x) = 12 kx2 . The two cases of positive and negative
k are illustrated in Fig. (3.1), see the snippet Portrait.ipynb. We observe that with the
exception of the equilibrium position (x, v) = (0, 0), the level curves of the energy are
smooth. Generalizing, we find that the exceptional points are critical, or stationary, points
of the Hamiltonian, which are points where the derivatives of the Hamiltonian with respect
to the canonical variables, q and p, are zero. Note that each level curve, which we draw
observing how a particle slides in a potential well, U (x), also has a direction (not shown in
Fig. (3.1)).
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 56

Figure 3.1: Phase portrait, i.e. (x, v) level-curves of the conservative system Eq. (3.4) with
the potential, U (x) = kx2 /2 with k > 0 (top) and k < 0 (bottom).
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 57

a b

c d

Figure 3.2: What is appearance of the level curves (phase portrait) of the energy for each
of these potentials?

Consider the case where k > 0, and fix the value of the energy E. Due to Eq. (3.5), the
coordinate of the particle, x, should lie within the set where the potential energy is less than
the energy, {x | U (x) ≤ E}. We observe that E ≥ 0, and that equality corresponds to the
particle sitting still at the minimum of the potential, which is called a critical point, or fixed
point. Furthermore, the larger the kinetic energy, the smaller the potential energy. Any
position where the particle changes its velocity from positive to negative or vice-versa is
called a turning point. For any E > 0, there are two turning points, x± = ±2E/k. Testing
different values of E > 0, we sketch different level curves, resulting in different ellipsoids
centered around 0. This is the canonical example of a oscillator. The motion of the particle
motion is periodic, and its period, T , can be computed evaluating Eq. (3.7) between the
turning points
Z x+ Z √2E/k
dx dx
T := p = √ p = 2π. (3.10)
x− E − U (x) − 2E/k E − kx2 /2

For this case, the period is a constant, 2π, and we note that it is independent of k.
In the k < 0 case where all the values of energy (positive and negative) are accessible,
x = v = E = 0 is the critical point again. When E > 0 there are no turning points (points
where direction of the velocity changes). When E > 0 the particle may turn only once or
not at all. If x(0) 6= 0 and regardless of the sign of E, x(t) increases with t to become
unbounded at t → ∞. As seen in Fig. (3.1)b, in this case the (x, v) phase space splits into

four quadrants, separated by the v = ± kx separatrices. The level curves of the energy
are hyperbolas centered around x = v = 0.
A qualitative study of the dynamics in more complex potentials U (x) can be conducted
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 58

by sketching the level curves in a similar way.

Exercise 3.2.1. Sketch level curves of the energy for the Kepler potential, U (x) := − x1 + xC2 ,
and for the potentials shown in Fig. (3.2).

3.2.2 Small Perturbation of a Conservative System

Let us analyze the following simple but very instructive example of a system which deviates
very slightly from the quadratic potential with k = 1:

ẋ = v + εf (x, v), v̇ = −x + εg(x, v), (3.11)

in the regime where ε  1 and x2 + v 2 ≤ R2 .


For ε = 0, and assuming that x(0) (0) = x0 , one derives

x(0) (t) = x0 cos(t), v (0) (t) = −x0 sin(t).

We calculate the energy and find that E = (x(0) )2 + (v (0) )2 )/2, which is obviously conserved
and so the system cycles with the period given by T = 2π.
The general case where 0 < ε  1 is not conservative. Let us examine how the energy
changes with time. One derives
d  
E = xẋ + v v̇ = ε (xf + vg) = ε x(0) f + v (0) g + O(ε2 ). (3.12)
dt
Integrating over a period, one arrives at the following expression for the gain (or loss) of
energy
Z 2π   I
(0) (0) 2
∆E = ε dt x f +v g + O(ε ) = ε (−f dv + gdx) + O(ε2 ), (3.13)
0
where the integral is taken over the level curve, which is also iso-energy cycle, of the unper-
turbed (ε = 0) system in the (x, v) space. Obviously ∆E depends on x0 .
For the case of increasing energy, ∆E > 0, we see an unwinding spiral in the (x, v)
plane. For the case of decreasing energy, ∆E < 0, the spiral contracts to a stationary point.
There are also systems where the sign of ∆E depends on x0 . Consider for example the
van der Pol oscillator
ẍ = −x + εẋ(1 − x2 ). (3.14)
d
As in Eq. (3.13), we integrate dt E over a period, which in this case gives
Z2π Z2π
2 2 2
εx20 sin2 t 1 − x20 cos2 t dt + O(ε2 )

∆E = ε ẋ (1 − x )dt + O(ε ) =
0 0
x4
 
= π x20 − 0 ε + O(ε2 ). (3.15)
4
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 59

The O(ε) part of this expression is zero when x0 = 2, positive when x0 < 2 and negative
when x0 > 2. Therefore, if we start with x0 < 2 the system will be gaining energy, and
the maximum value of x(t) within a period will approach the value 2. On the contrary, if
x0 > 2 the system will be lose energy, and the maximum value of x(t) over a period will
decrease approaching the same value 2. This type of behavior is characterized as the stable
limit cycle, which can be characterized by
d
∆E(x0 ) = 0 and ∆E(x0 ) < 0
dx0
In summary, the van der Pol oscillator is an example of behavior where the perturbation
is singular, meaning that is categorically different from the unperturbed case. Indeed, in
the unperturbed case the particle oscillates cycling an orbit which depends on the initial
condition, while in the perturbed case the particle ends up moving along the same limit
cycle.

Exercise 3.2.2. Recall two properties of stable / unstable limit cycles:

d
Stable Limit Cycle at x = x0 if ∆E(x0 ) = 0 and ∆E(x0 ) < 0
dx0
d
Unstable Limit Cycle at x = x0 if ∆E(x0 ) = 0 and ∆E(x0 ) > 0
dx0
Suggest an example of perturbations, f and g, in Eq. (3.11) which leads to (a) an unstable
limit cycle at x0 = 2, and (b) one stable limit cycle at x0 = 2 and one unstable limit cycle
at x0 = 3. Illustrate your suggested perturbations by building a computational snippet.

Consider another ODE example

I˙ = ε (a + b cos θ) , θ̇ = ω, (3.16)

where ω, ε, a, b are constants, and ε-term in the first Eq. (3.16) is a perturbation. When ε
is zero, I is an integral of motion, (meaning that it is constant along solutions of the ODE),
and we think of θ as an angle in the phase space increasing linearly with the frequency ω.
Note that the unperturbed system is equivalent to the one described by Eq. (3.11).

Exercise 3.2.3. (a) Show that one can transform the unperturbed (i.e. ε = 0) version of
the system described by Eq. (3.11) to the unperturbed version of the system described by
Eq. (3.16) via the following transformation (change of variables)
p p
v= I/2 cos(θ/ω), x= I/2 sin(θ/ω). (3.17)

(b) Restate Eq. (3.16) in the (x, v) variables.


CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 60

The transformation discussed in the Exercise 3.2.3 is an example of the so-called canon-
ical transformation that preserves the Hamiltonian structure of the equations. In this case
the Hamiltonian, which is generally a function of θ and I, depends only on I, H = Iω, and
one can indeed rewrite the unperturbed version of Eq. (3.16) as
∂H ∂H
θ̇ = = ω, I˙ = − = 0, (3.18)
∂I ∂θ
therefore interpreting θ and I as the new coordinate and the new momentum respectively.
Averaging perturbed Eq. (3.16) over one (2πω) angle revolution, as done in Section
3.2.2, one arrives at
∆J = 2πεa. (3.19)

Taking many, 2πnω, revolutions and replacing 2πn by t in the limit one arrives at the
following equation for the averaged (over period) action

J˙ = εa, (3.20)

which has the solution, J(t) = J0 + εat.


In fact Eqs. (3.16) can also be solved exactly
εb sin(ωt)
I(t) = εat + , (3.21)
ω
and one can check that indeed solution of the averaged Eq. (3.20) do not deviate (with
time) from the exact solution of Eq. (3.16)

ω 6= 0 : |J(t) − I(t)| ≤ O(1)ε. (3.22)

In a general n-dimensional case one considers the following system of bare (unperturbed)
differential equations
. .
I˙ = 0, θ̇ = ω(I), I = (I1 , · · · , In ) , θ = (φ1 , · · · , θn ) , (3.23)

where thus each component of I is an integral of motion of the unperturbed system of


equations. Perturbed version of Eq. (3.23) becomes

I˙ = εg(I, θ, ε), θ̇ = ω(I) + εf (I, θ, ε), (3.24)

where f and g are 2π-periodic functions of each of the components of φ. Since I changes
slowly, due to smallness of ε, the perturbed system can be substituted by a much simpler
averaged system for the slow (adiabatic) variables, J (t) = I(t) + O(ε):
H
˙ . g(I, θ, 0)dθ
J = εG(J ), G(J ) = H , (3.25)

CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 61

H
where as in Section 3.2.2 stands for averaging over the period (one rotation) in the phase-
space. Notice that the procedure of averaging over the periodic motion may brake at higher
P
dimensions, n > 1, if the system has resonances, i.e. if i Ni ωi = 0, where Ni are integers.
If the perturbed system is Hamiltonian θ plays the role of generalized coordinates and
I of generalized momenta, then Eqs. (3.24) become

∂H ∂H
I˙ = − , θ̇ = . (3.26)
∂θ ∂I

In this case averaging over θ the rhs of the first equation in Eq. (3.26) results in J˙ = 0. This
means that the slow variables, J1 , · · · , Jn , also called adiabatic invariants, do not change
with time. Notice that the main difficulty of applying this rather powerful approach consists
in finding proper variables which remain integrals of motion of the unperturbed system.

3.3 Direct Methods for Solving Linear ODEs


We continue our exploration of linear by gradually increasing the complexity of the problems
and by developing more technical methods.

3.3.1 Homogeneous ODEs with Constant Coefficients

Consider the n-th order homogeneous ODE with constant coefficients


n
X dn−m
Lx(t) = 0, where L≡ an−m . (3.27)
dtn−m
m=0

(Here and below we will start using bold-calligraphic notation, L, for the differential opera-
tors.) Let us look for the general solution of Eq. (3.27) in the form of a linear combination
of exponentials
n
X
x(t) = ck exp(λk t), (3.28)
k=1

where ck are constants. Substituting Eq.(3.28) into Eq.(3.27), one arrives at the condition
that the λk are roots of the characteristic polynomial:
n
!
X n−m
an−m (λk ) = 0. (3.29)
m=0

Eq. (3.28) holds if the λk are not degenerate (that is, if there are n distinct solutions). In the
case of degeneracy we generalize Eq. (3.28) to a sum of exponentials (or the non-degenerate
λk and of polynomials in t multiplied by the respective exponentials for the degenerate
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 62

λk , where for the degrees of the polynomials are equal to the degree of the respective root
degeneracy.
m dk
!
(l)
X X
x(t) = ck tl exp(λk t), (3.30)
k=1 l=0

where dk is the degree of the k-th root degeneracy.

3.3.2 Inhomogeneous ODEs

Consider an inhomogeneous version of a generic linear ODE

Lx(t) = f (t). (3.31)

Recall that if the particular solution is xp (t), and if x0 (t) is a generic solution of the homo-
geneous version of the equation, then a generic solution of Eq. (3.31) can be expressed as
x(t) = x0 (t) + xp (t).
Let us illustrate the utility of this simple but powerful statement on an example:

ẍ + ω02 x = cos(3t). (3.32)

A generic solution of the homogeneous version of Eq. (3.32) is x0 (t) = c exp(iω0 t), where c is
a complex-valued constant, and a particular solution of Eq. (3.32) is xp (t) = cos(3t)/(ω 2 −9).
Therefore, a general solution of Eq. (3.32) is

cos(3t)
x(t) = c exp(iω0 t) + .
ω2 − 9

3.4 Linear Dynamics via the Green Function


Let us recall some of the empirical lessons of Section 3.2. If a system is in equilibrium, its
state does not change in time. If the system is perturbed away from a stable equilibrium,
the perturbation is small and the system is dissipative, so it relaxes back to the equilibrium.
The relaxation may not be monotonical, and the system may show some oscillations. In
the following we discuss the relaxation of a system back to its equilibrium state in response
to a small perturbation. This type of relaxation is modeled by linear differential equations.
The method of Green function, or “response” functions, will be the working horse of our
analysis for linear dynamics. It offers a powerful and intuitive approach which also extends
to the case of PDEs. We will start exploring the method by revisiting the simple constant
coefficient case of the linear scalar-valued first-order equation (3.2).
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 63

3.4.1 Evolution of a linear scalar

Consider the simplest example of scalar relaxation


d
x + γx = φ(t), (3.33)
dt
where γ is constant and φ(t) known function of t. This model appears, for example, when
we consider an over-damped driving of a polymer through a medium, where the equation
describes the balance of forces where φ(t) is the driving force, γx is the elastic (returning)
force for a polymer with one end positioned at the origin and another at the position x;
and ẋ represents friction of the polymer against the medium. The general solution of this
equation is
Z∞
x(t) = ds G(s)φ(t − s), (3.34)
0

where we have assumed that the evolution starts at −∞ where x(0) = 0; and G(t) is the
so-called Green function which satisfies
d
G + γG = δ(t), (3.35)
dt
and δ(t) is the δ-function.
Notice that the evolutionary problem we discuss here is an initial value problem (also
called a Cauchy problem). Indeed, if we would not assume that back in the past (at t = −∞)
x is fixed, the solution of Eq. (3.33) would be defined unambiguously. Indeed, suppose xs (t)
is a particular solution of Eq. (3.33), then xs (t) = C exp(−γt), where C is a constant,
describes a family of solutions of Eq. (3.33). The freedom, eliminated by fixing the initial
condition, is associated with the so-called zero mode of the differential operator, d/dt + γ.
Another remark is about causality, which may also be referred to, in this context, as
the “causality principle”. It follows from Eq. (3.34) that defining the Green function, one
also enforces that, G(t) = 0 at t < 0. This formal observation is, of course, consistent with
the obvious—solutions of Eq. (3.33) at a particular moment in time t can only depend on
external driving sources φ(t0 ) that occured in the past, when t ≤ t0 , and cannot depend on
external driving forces that will occur in the future, when t > t0 .
Now back to solving Eq. (3.35). Since δ(t) = 0 at t > 0, one associates G(t) with the
zero mode of the aforementioned differential operator, G(t) = A exp(−γt), where A is a
constant. On the other hand due to the causality principle, G(t) = 0 at t < 0. Integrating
Eq. (3.35) over time from − < 0, where 0 <   1, to τ , we observe that G(t) should have
a discontinuity (jump) at t = 0: G(t) = A exp(−γt)θ(t), where θ is the Heaviside function.
Substituting the expression in Eq. (3.35) and integrating the result (left and right hand
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 64

sides of the resulting equality) over − < t < , one finds that A = 1. Substituting the
expression into Eq. (3.34) one arrives at the solution

Zt
x(t) = ds exp(−γ(t − s))φ(s). (3.36)
−∞

We observe that the system “forgets” the past at the rate γ per unit time.

Exercise 3.4.1. Solve Eq. (3.33) at t > 0, where x(0) = 0 and φ(t) = A exp(−αt). Analyze
the dependence on α and γ, including α → γ.

Notice that Eq. (3.35) assumes that the Green function depends on the difference be-
tween t and s, t − s, and not on the two variables separately. This assumption is justified
for the case considered here, however it will not be correctfor situations where the decay
coefficient γ(t) depends on t. In this general case one needs to consider the general expres-
sions for the Green function too, G(t; x). In the case of the constant γ the Green function
depends on the difference because of Eq. (3.35) symmetry with respect to the time trans-
lation (time homogeneity): the form of the equation does not change under the time shift,
t → t + t0 .

3.4.2 Evolution of a vector

Let us now generalize and consider


d
y + Γ̂(t)y = χ(t), (3.37)
dt
where y and φ are n-dimensional vectors and Γ̂ is n × n time-independent matrix.
Note that this type of vector ODE appear in the result of “vectorization” of an n-the
order ODE for a scalar variable x, where y1 = x, y2 = dx/dt, · · · , yn = dn−1 x/dtn−1 .
Then dy/dt is expressed via the components of y and the original equation, thus resulting
in Eq.(3.37).
Consider the following auxiliary linear algebra problem: find the eigen-set of the matrix
Γ
Γ̂ai = λi ai , (3.38)

where λi are eigen-values of Γ.


Let us assume, first, that the eigen-value problem is not degenerate. Then we expand
y and χ over the {ai |i} basis,
X X
y= xi ai , χ= φi ai . (3.39)
i i
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 65

Substituting the expansions into Eq. (3.37) one arrives at


dxi
+ λ i xi = φi , (3.40)
dt
therefore reducing the vector equation to the set of scalar equations of the already considered
type Eqs. (3.33).
To make this transformation invariant, and also extendable to the degenerate case (when
at least two-eigenvalues of Γ̂ are equal) one introduces Green function Ĝ, which satisfies
 
d
+ Γ̂ Ĝ(t) = δ(t)1̂. (3.41)
dt
The explicit solution of Eq. (3.41) is
 
Ĝ(t) = θ(t) exp −Γ̂t , (3.42)

which allows us to state the solution of Eq. (3.37) in the following invariant form
Zt Zt  
y(t) = dsĜ(t − s)χ(s) = dsθ(t − s) exp −Γ̂(t − s) χ(s). (3.43)
−∞ −∞

Notice that matrix exponential, introduced in Eq. (3.42) and utilized in Eq. (3.43), is
the formal expression which may be interpreted in terms of the Taylor series

  X (−t)n Γ̂n
exp −tΓ̂ = , (3.44)
n!
n=0

which is always convergent (for the matrix Ĝ with finite elements).


To relate the invariant expression (3.43) to the eigen-value decomposition of Eqs. (3.39,3.40)
one introduces the eigen-decomposition

Γ̂ = ÂΛ̂Â−1 , (3.45)

where Λ̂ is the diagonal matrix formed from the eigenvalues of Γ̂ and the columns of  are
respective eigenvalues of Γ̂. Note that Γ̂n = ÂΛ̂n Â−1 .
To illustrate the peculiarity of the degenerate case consider
!
0 1
Γ̂ = λ1̂ + N̂, N̂ ≡ ,
0 0

which is the canonical form of the Jordan (2 × 2) matrix/block, where N̂ is (2 × 2) nilpotent


matrix, i.e. N̂2 = 0̂. Writing Eqs. (3.37) in components
dy1 dy2
+ λy1 + y2 = χ1 , + λy2 = χ2 ,
dt dt
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 66

integrating the second equation, substituting result in the first equation, and then changing
from y1 to y = y1 + ty2 , one arrives at
dy
+ λy = χ1 + tχ2 .
dt
Note the emergence of a secular term, (a polynomial in t), on the right hand side, which is
generic in the case of degeneracy which is then straightforward to integrate. Consistently,
expression for the matrix exponential also show a secular term
!!
λ 1  
exp −t = e−λt 1 − tN̂ ,
0 λ

where we have accounted for the nilpotent property of N̂.

Exercise 3.4.2. Find the Green function of Eq. (3.37) for


 
λ 1 0
 
Γ̂ =  0 λ 1 

.
0 0 λ

3.4.3 Higher Order Linear Dynamics

The Green function approach illustrated above can be applied to any inhomogeneous linear
differential equation. Let us see how it works in the case of the second-order differential
equation for a scalar. Consider
d2
x + ω 2 x = φ(t). (3.46)
dt2
To solve Eq. (3.46) note that its general solution can be expressed as a sum of its
particular solution and solution of the homogeneous version of Eq. (3.46) with zero right
hand side. Let us choose a particular solution of Eq. (3.46) in the form of convolution (3.34)
of the source term, φ(t), with the Green function of Eq. (3.46)
 2 
d 2
+ ω G(t) = δ(t). (3.47)
dt2

As established above G(t) = 0 at t < 0. Integration of Eq. (3.47) from − to τ and checking
the balance of the integrated terms reveals that Ġ jumps at t = 0, and the value of the
jump is equal to unity. An additional integration over time around the singularity shows
that G(t) is smooth (and zero) at t = 0. Therefore, in the case of a second order differential
equation considered here: G = 0 and Ġ = 1 at t = +0. Given that δ(+0) = 0 these two
values can be considered as the initial conditions at t = +0 for the homogeneous version
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 67

(zero right hand side) of the Eq. (3.47), defining G(t) at t > +0. Finally, we arrive at the
following result
sin(ωt)
G(t) = θ(t) , (3.48)
ω
where θ is the Heaviside function.
Furthermore, Eq. (3.34) gives the solution to Eq. (3.46) over the infinite time horizon,
however one can also use the Green function to solve the respective Cauchy propblem (initial
value problem). Since Eq. (3.46) is the second order ODE, one just needs to fix two values
associated with x(t) evaluated at the initial, t = 0, for example x(0) and ẋ(0). Then, taking
into account that, G(+0) = 0 and Ġ(+0) = 1, one finds the following general solution of
the Cauchy problem for Eq. (3.46)
Zt
x(t) = ẋ(0)G(t) + x(0)Ġ(t) + dt1 G(t − t1 )φ(t1 ). (3.49)
0

Let us now generalize and consider


n
X dn−k
Lx = φ(t), L≡ an−k , (3.50)
dtn−k
k=0

where ai are constants and L is the linear differential operator of the n-th order with
constant coefficients, already discussed in Section 3.3. We build a particular solution of
Eq. (3.50) as the convolution (3.34) of the source term, φ(t), with the Green function, G(t),
of Eq. (3.50)
LG = δ(t), (3.51)
where G(t) = 0 at t < 0.
Observe that the solution to the respective homogeneous equation, Lx = 0, (the zero
modes of the operator L) can be generally presented as
X
x(t) = bi exp(zi t), (3.52)
i

where bi are arbitrary constants.


Let us now use the general representation (3.54) to construct the Green function solving
Eq. (3.51). Recall that, considering first and second order differential equations in the
preceding Sections, we have transitioned above from the inhomogeneous equations for the
Green function to the homogeneous equation supplemented with the initial conditions.
Direct extension of the “integration around zero” approach (doing it n times) reveals that
initial conditions one needs to set at t = +0 in the general case of the n-th order differential
equation are
dn−1 dm
G(0+ ) = 1, ∀0 ≤ m < n − 1 : G(0+ ) = 0. (3.53)
dtn−1 dtn
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 68

Consider, formally, L, as a polynomial in z, where z is the elementary differential op-


erator, z = d/dt, i.e. L(z). Then, at t > 0+ the Green function satisfies the homogeneous
equation, L(d/dt)G = 0. Solution of the homogeneous equation can generally be presented
as
X
t > 0+ : G(t) = bi exp(zi t), (3.54)
i

where bi are arbitrary constants which are defined unambiguously from the system of alge-
braic equations for the coefficients one derives substituting Eq. (3.53) in Eq. (3.54).

Exercise 3.4.3. Find Green function of


d2 d
(a) 2
x + 2γ x + ν 2 x = φ,
dt dt
d4 2 d
2
(b) x + 4ν x + 3ν 4 x = φ.
dt4 dt2
 2 2
d 2
(c) +ν x=φ
dt2

3.4.4 Laplace’s Method for Dynamic Evolution

So far we have solved linear ODE by using the Green function approach and constructing
the Green function as a solution of the homogeneous equation with additionally prescribed
initial conditions (one less than order of the differential equation). In this section we
discuss an alternative way of solving the problem via application of the Laplace transform
introduced in Section 2.8.
Laplace’s method is natural for solving dynamic problems with causal structure. Let us
see how it works for finding the Green function defined by Eq. (3.51). We apply the Laplace
transform to Eq. (3.51), integrating it over time with the exp(−kt) Laplace weight from a
small positive value, , to ∞. In this case integral of the right hand side is zero. Each term
on the left hand side can be transformed through a sequence of integrations by parts to a
product of a monomial in k with G̃(k), the Laplace transform of G(t). We also check all
boundary terms which appear at t =  and t = ∞. Assuming that G(∞) = 0 (which is
always the case for stable systems), all contributions at t = +∞ are equal to zero. All t = 
boundary terms, but one, are equal to zero, because ∀0 ≤ m < n − 1, dm G()/dtm = 0.
The only nonzero boundary contribution originates from dn−1 G()/dtn−1 = 1. Overall, one
arrives at the following equation
n
. X
L(k)G̃(k) = 1, L(k) = an−k (−k)n−1 . (3.55)
k=0
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 69

Therefore, we just found that G(k) has poles (in the complex plain of k) associated with
zeros of the L(k) polynomial. To find G(t) one applies to G̃(k) the inverse Laplace transform
c+∞
Z
dk
G(t) = exp(kt)G̃(k). (3.56)
2πi
c−i∞

The Laplace method also allows us to solve ODEs of the following type
N
X dm Y
(am + bm x) = 0, (3.57)
dxm
m=0

where the coefficients are linear in x.


Let us look for solution of Eq. (3.57) in the form
Z
Y (x) = dtZ(t) exp(xt), (3.58)
C

where C is a contour in the complex plane of t selected in a way that the integral has the
value which is finite and nonzero. Substituting Eq. (3.57) with the weight Eq. (3.58), using
the relation xext = dext /dt, and assuming that the contour of integration in Eq. (3.57) is
such that no ”contact” term appears after the integration by parts (this is satisfied, e.g.
when the contour is closed and the integrand is single-valued along the contour), one arrives
at
n n
d X
m
X
(QZ) = P Z, where P (t) = am t , Q(t) = bm tm , (3.59)
dt
m=0 m=0
which is solved by Z 
1 P
Z(t) = exp dt , (3.60)
Q Q
where the integral is defined simply as the anti-derivative.
This is a generic recipe - let us now apply it to a particular case of the so-called Hermite
equation
d2 Y dY
2
− 2x + 2nY = 0. (3.61)
dx dx
In this case we derive
exp(−t2 /4)
P = t2 + 2n, Q = −2t, Z=− , (3.62)
2tn+1
thus resulting in the following explicit solution of Eq. (3.61) (written in quadrature, and
defined up to a multiplicative constant)
2
e−u du
Z Z
xt−t2 /4 dt x2
Y (x) = e =e , (3.63)
tn+1 (u − x)n+1
C
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 70

II $#
3

III

I
$#
3
!" !

Figure 3.3: Layout of contours in the complex plane of t needed for saddle-point estimations
of the Airy function described in Eq. (3.68).

where we also change variables t → u according to t = 2(x − u).


When n is a nonegative integer, the integrand in Eq. (3.63) has a simple pole, and thus
choosing the contour to go around the pole works (in the sense of satisfying the “no contact”
term requirement). Applying the Cauchy formula to the resulting constour integral, one
therefore arrives at the expression for the so-called Hermite polynomials
2 dn −x2
Y (x) = Hn (x) = (−1)n ex e , (3.64)
dxn
where re-scaling (which is a degree of freedom in linear differential equations) is selected
according to the normalization constraint introduced in the following exercise.

Exercise 3.4.4. Prove that


+∞

Z
2
dx e−x Hn (x)Hm (x) = 2n n! πδnm , (3.65)
−∞

where δnm is unity when n = m and it is zero otherwise (Kronecker symbol).

Hermite polynomials will come back later in the context of the Sturm-Liouville problem.
Consider another example of the equation which can be solved by the Laplace method
d2
Y − xY = 0. (3.66)
dx2
Following the general Laplace method we derive

P = t2 , Q = −1, Z = − exp(−t2 /3). (3.67)

According to Eq. (3.59) general solution of Eq. (3.67) can be represented as


Z
Y (x) = const exp xt − t3 /3 ,

(3.68)
C
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 71

where we choose an infinite integration path shown in Fig. (3.3) such that values of the
integrand at the two (infinite) end points coincide (and equal to zero). Indeed, this choice
guarantees that the infinite end points of the contour lie in the regions where Re(t2 ) > 0
(shaded regions I, II, III in Fig. (3.3)). Moreover, by choosing that the contour starts in
the region I and ends in the region II (blue contour C in Fig. (3.3)) we guarantee that the
Airy function given by Eq. (3.66) remains finite at x → +∞. Notice that the contour can
be shifted arbitrarily under condition that the end points remain in the sectors I and II. In
particular one can shift the contour to coincide with the imaginary axis (in the complex t
plane shown in Fig. (3.3), then Eq. (3.68) becomes (up to a constant) the so-called Airy
function
Z∞
 ∞ 
u3
  Z  3 
1 1 u
Ai(x) = cos + xu = Re  exp i + ixu  . (3.69)
π 3 2π 3
0 −∞

Asymptotic expression for the Airy function at x > 0, x  1, can be derived utilizing the

saddle-point method described in Section 1.4. At x = ± x, the integrand in Eq. (3.68) has
an extremum along the direction of its “steepest descent” from the saddle point along the
imaginary axis. Since the contour end-points should stay in the sectors I and II, we shift
the contour to the left from the imaginary axis while keeping it parallel to the imaginary

axis. (See C1 shown in red in Fig. (3.3) which crosses the real axis at t = − x.) The

integral is dominated by the saddle-point at t = − x, thus resulting (after substitution

t = x + iu, changing integration variable from t to u, making expansion over u, keeping
quadratic term in u, ignoring higher order terms, and evaluating a Gaussian integral) in
the following asymptotic estimation for the Airy function
+∞
2 3/2 √ 2 exp(−2x3/2 /3)
Z  
1
x > 0, x  1 : Ai(x) ≈ exp − x − xu du = √ . (3.70)
2π 3 x1/4 4π
−∞

(Notice that one can also provide an alternative argument and exclude contribution of the

second, potentially dominating, saddle-point t = x simply by observing that Gaussian
integral evaluated along the steepest descent path from this saddle-point gives zero contri-
bution after evaluating the real part of the result, as required by Eq. (3.69).)

3.5 Linear Static Problems


We will now turn to problems which normally appear in the static case. In many natural
and engineered systems, a dynamic system that reaches equilibrium may have spatial char-
acteristics that are non-trivial and worthy of analysis. Here we discuss a number of linear
spatially one-dimensional problems that are relevant to applications.
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 72

3.5.1 One-Dimensional Poisson Equation

Poisson’s equation describes, in the case of electrostatics, the potential field caused by a
given charge distribution.
Let us discuss function f (x) whose distribution over a finite spatial interval is described
by the following set of equations

d2
f = ψ(x), ∀x ∈ (a, b) with f (a) = f (b) = 0. (3.71)
dx2
We introduce the Green function which satisfies
d2
∀a < x, y < b : G(x; y) = δ(x − y), G(a; y) = G(b; y) = 0. (3.72)
dx2
Notice that the Green function now depends on both x and y.
d2
According to Eq. (3.72), dx2
G(x; y) = 0 if x = y. Then enforcing the boundary condi-
tions one derives

x>y: G(x; y) = B(x − b), (3.73)


y>x: G(x; y) = A(x − a). (3.74)

Furthermore, given that the differential equation in (3.72) is the second order, G(x, y) should
be continuous at x = y and the jump of its first derivative at x = y should be equal to
unity. Summarizing, one finds

1 (y − b)(x − a), x < y
G(x; y) = (3.75)
b − a (y − a)(x − b), x > y.

The solution the Eq. (3.71) is given by the convolution operator

Zb
f (x) = dyG(x; y)ψ(y). (3.76)
a

Exercise 3.5.1. Find Green function of the operator d2 /dx2 + κ2 for periodic functions
with the period 2π.

3.6 Sturm–Liouville (spectral) theory


We enter the study of differential operators which map a function to another function, and
it is therefore imperative to first discuss the Hilbert space where the functions of reside.
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 73

3.6.1 Hilbert Space and its completeness

Let us first review some basic properties of a Hilbert space, in particular, and condition on
its completeness. (These will be discussed at greater length in the companion Math 527
course of the AM core.) A linear (vector) space is called a Hilbert space, H, if

1. For any two elements, f and g there exists a a scalar product (f, g) which satisfies the
following properties:

(a) linear with respect to the second argument,

(f, αg1 + βg2 ) = α(f, g1 ) + β(f, g2 ),

for any f, g1.2 ∈ H and α, β ∈ C.


(b) self-conjugation (Hermitian)

(f, g) = (g, f )∗ ;

.
(c) non-negativity of the norm, kf k2 = (f, f ) > 0, where (f, f ) = 0 means f = 0.

2. H has a countable basis, B, i.e. a countable number of elements, B := {fn , n =


1, · · · , ∞} such that any element g ∈ H can be represented in the form of a linear
combination fn . that is, for any g ∈ H, there exist coefficients cn such that g =
P
cn fn .

Remark. The Hilbert space defined above for complex-valued functions can also be consid-
ered over real-valued functions. In the following we will use the two interchangeably.
Any basis B can be turn into an ortho-normal basis with respect to a given scalar
P∞ P∞
product, i.e. x = n=1 (x, fn )fn , kxk2 = 2
n=1 |(x, fn )| . (For example, the Gram-
Schmidt process is a standard ortho-normalization procedure.)
One primary example of a Hilbert space is the L2 (Ω) space of complex-valued functions
f (x) defined in the space Ω ∈ Rn such that Ω dx|f (x)|2 < ∞ (one may say, casually, that
R

the square modulus of the function is integrable). In this case the scalar product is defined
as Z
.
(f, g) = dxf ∗ (x)g(x).

Properties 1a-c from the definition of Hilbert space above are satisfied by construction and
property 2 can be proven (it is a standard proof in the course of mathematical analysis).
Consider a fixed infinite ortho-normal sequency of functions

{fn , n = 1, · · · , ∞, (fn , fm ) = δnm }.


CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 74

The sequence is a basis in L2 (Ω) iff the following relation of completeness holds

X
fn∗ (x)fm (y) = δ(x − y). (3.77)
n=1

As custom for the δ function (an other generalized functions), Eq. (3.77) should be under-
stood as equality of integrals of the two sides of Eq. (3.77) integrated with a function from
L2 (Ω).

3.6.2 Hermitian and non-Hermitian Differential Operators

Consider a function from the Hilbert space L2 (a, b) over the reals, i.e. function of a single
variable, x ∈ R, over a bounded domain, a ≤ x ≤ b with an integrable square modulus and
a linear differential operator L̂ acting on the function.
A differential operator is called Hermitian (self-conjugated) if for any two functions
(from a certain class of interest, e.g. from L2 (a, b)) the following relation holds:
Zb Zb
(f, L̂g) := dx f (x)L̂g(x) = dx g(x)L̂f (x) = (g, L̂f ). (3.78)
a a

It is clear from how the condition (3.78) was stated that it depends on both the class
of functions and on the operator L̂. For example, considering functions f and g with zero
boundary conditions or functions which are periodic and which derivative is periodic too,
will result in the statement that the operator
d2
L̂ = + U (x), (3.79)
dx2
where U (x) is a function mapping from R to R, is Hermitian.
Natural generalization of the Shrödinger operator 3.79 is the Sturm-Liouville operator
d2 d
L̂ = 2
+Q + U (x). (3.80)
dx dx
The Sturm-Liouville operator is not Hermitian, i.e. Eq. (3.78) does not hold in this case.
However, it is straightforward to check that at the zero boundary conditions or periodic
boundary conditions imposed on the functions, f (x) and g(x), and their derivatives, the
following generalization of Eq. (3.78) holds
Zb Zb
dxρ(x)f (x)L̂g(x) = dxρ(x)g(x)L̂f (x), (3.81)
a a
Z 
d
where ρ = Qρ ⇒ ρ = exp dxQ . (3.82)
dx
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 75

Consider now the eigen-functions fn of the operator L̂, which satisfy

L̂fn = λn fn , (3.83)

where λn is the spectral parameter (eigenvalue) of the eigen-function, fn , of the Sturm-


Liouville operator (3.80), indexed by n. (We assume that, ∀n 6= m : λn 6= λm .)
Notice that the value of λn is not specified in Eq. (3.84) and finding the values of
λn for which there exists a non-trivial solution, satisfying respective boundary conditions
(describing the class of functions considered) is an instrumental part of the Sturm-Liuville
problem.
Observe that the conditions (3.81,3.82) translates into
Z Z Z
dxρfn L̂fm = λm dxρfn fm = λn dxρfn fm , (3.84)

that becomes the following eigen-function orthogonality condition


Z
∀n 6= m : dxρfn fm = 0. (3.85)

As a corollary of this statement one also finds that in the Hermitian case the distinct
eigen-functions are orthogonal to each other with unitary weight, ρ = 1.
Let us check Eq. (3.85) on the example, L̂0 = d2 /dx2 , where Q(x) = U (x) = 0, over
the functions which are 2π-periodic. cos(nx) and sin(nx), where n = 0, 1, · · · are distinct
eigen-functions with the eigen-values, λn = −n2 . Then, forall m 6= n,
Z 2π Z 2π Z 2π
dx cos(nx) cos(mx) = dx cos(nx) sin(mx) = dx sin(nx) sin(mx) = 0. (3.86)
0 0 0

Note that the example just discussed has a degeneracy: cos(nx) and sin(nx) are two
distinct real eigen-functions corresponding to the same eigen-value. Therefore, any combina-
tion of the two is also an eigen-function corresponding to the same eigen-value. If we would
choose any other pair of the degenerate eigen-functions, say cos(nx) and sin(nx) + cos(nx),
the two would not be orthogonal to each other. Therefore, what we see on this example is
that the eigen-functions corresponding to the same-eigenvalue should be specially selected
to be orthogonal to each other.
We say that the set of eigen-functions, {fn (x)|n ∈ N}, of L̂ is complete over a given
class (of functions) if any function from the class can be expanded into the series over the
eigen-functions from the set
X
f= cn fn . (3.87)
n
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 76

Relating this eigen-functions’ property to completeness of the Hilbert space basis, one
observes that eigen-vectors of a self-adjoint (Hermitian) operator over L2 (Ω) form an ortho-
normal basis of L2 (Ω).
Multiplying both sides of Eq. (3.87) by ρfn , integrating over the domain, and applying
(3.85) to the right one derives R
dxρfn f
cn = R . (3.88)
dxρ(fn )2
Note that for the example L̂0 , Eq. (3.87) is a Fourier Series expansion of a periodic
function.
Returning to the general case and substituting Eq. (3.88) back into (3.87), one arrives
at Z !
f (x)fn (y)
R n
X
f (x) = dy ρ(y) f (y). (3.89)
n
dxρ(x)(fn (x))2
If the set of functions {fn (x)|n} is complete relation (3.89) should be valid for any function
f from the considered class. Consistently with this statement one observes that the part
of the integrand in Eq. (3.89) is just the δ(x), which is the special function which maps
convolution of the function to itself, i.e.
f (x)fn (y) 1
R n
X
2
= δ(x − y). (3.90)
n
dxρ(x)(fn (x)) ρ(y)
Therefore, one concludes that Eq. (3.90) is equivalent to the statement of the set of functions
{fn (x)|n} completeness.

Exercise 3.6.1. Check validity of Eq. (3.90), and thus completeness of the respective set
of eigen-functions, for our enabling example of L̂0 = d2 /dx2 over the functions which are
2π-periodic.

3.6.3 Hermite Polynomials, Expansions

Let us now depart from our enabling example and consider the case of Q(x) = −2x and
U (x) = 0 over the class of functions mapping from R to R and decay sufficiently fast at
x → ±∞.
d2 d
ρ(x) = exp −x2 .

L̂2 = 2
− 2x , (3.91)
dx dx
That is we are discussing now
L̂2 fn = λn fn . (3.92)

Changing from fn (x) to Ψn (x) = fn (x) ρ one thus arrives at the following equation for
Ψn :
2 /2 2 /2
 2  d2
e−x L̂2 fn (x) = e−x L̂2 ex /2 Ψn (x) = 2 Ψn + (1 − x2 )Ψn = λn Ψn . (3.93)
dx
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 77

Observe that when λn = −2n, Eq. (3.92) coincides with the Hermite Eq. (3.61).
Let us look for solution of Eq. (3.92) in the form of the Taylor series around x = 0

X
fn (x) = ak xk . (3.94)
k=0
Substituting the series into the Hermite equation and then equating terms for the same
powers of x one arrives at the following regression for the expansion coefficients:
2k + λn
∀k = 0, 1, · · · : ak+2 = ak . (3.95)
(k + 2)(k + 1)
This results in the following two linearly independent solutions (even and odd, respec-
tively, with respect to the x → −x transformation) of Eq. (3.92) represented in the form of
a series
 
λn 2 λn (4 + λn ) 4
fn(e) (x)
= a0 1 + x + x + ··· , (3.96)
2! 4!
 
(o) (2 + λn ) 3 (2 + λn )(6 + λn ) 5
fn (x) = a1 x + x + x + ··· , (3.97)
3! 5!
where the two first coefficients in the series (3.94) are kept as the parameters. Observe that
the series (3.96) and (3.97) terminate if λn = −4n and λn = −4n − 2, respectively, where
(e)
n = 0, 1, · · · , then fn are polynomials – in fact the Hermite polynomials. We combine
the two cases in one and use the standard, Hn (x), notations for the Hermite polynomials
of the n-th order, which satisfies Eq. (3.93). Per statement of the Exercise 3.4.4, Hermite
polynomials are normalized and orthogonal (weighted with ρ) to each other.
Exercise 3.6.2. Verify that the set of functions
1
{Ψn (x) = √ exp(−x2 /2)Hn (x)|n = 0, 1, · · · }, (3.98)
π 1/4 n
2 n!
satisfy
X∞
Ψn (x)Ψn (y) = δ(x − y). (3.99)
n=0
[Hint: use the following identity
+∞
dn √
Z
dq
exp(−x2 ) = π (iq)n exp −q 2 /4 + iqx .]

dx n 2π
−∞

Statement of the Exercise 3.6.2 combined with the statement of Exercise 3.4.4 result
in the statement of “completeness”: the set of functions (3.99) forms an orthogonal basis
R∞
of the Hilbert space of functions, f (x) ∈ L2 , i.e. satisfying −∞ |f (x)|2 dx < ∞. (A bit
more formally, an orthogonal basis for the L2 functions is a complete orthogonal set. For
an orthogonal set, completeness is equivalent to the fact that the 0 function is the only
function f ∈ L2 which is orthogonal to all functions in the set.)
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 78

3.6.4 Schrödinger Equation in 1d

Schrödinger equation
d2 Ψ(x)
+ (E − U (x))Ψ(x) = 0 (3.100)
dx2
described the so-called (complex-valued) wave function describing de-location of a quantum
particle in x ∈ R with energy E in the potential U (x). We are seeking for solutions with
|Ψ(x)| → 0 at x → ∞ and or goal here is do describe the spectrum (allowed values of E)
and respective eigen-functions.
As a simple, but instructive, example consider the case of a quantum particle in a
rectangular potential, i.e. U (x) = U0 at x ∈
/ [0, a] and zero otherwise. General solution of
Eq. (3.100) becomes

U0 > E > 0 :
 √
 cL exp(x U0 − E), x<0
√ √


ΨE (x) = a+ exp(ix E) + a− exp(−ix E), x ∈ [0, a] , (3.101)
 √
cR exp(−x U0 − E),

 x>a
U0 < E :
 √ √
 cL+ exp(ix E −√U0 ) + cL− exp(−ix√ E − U0 ), x<0


ΨE (x) = a+ exp(ix E) + a− exp(−ix E), x ∈ [0, a] ,(3.102)
 √ √
 c exp(ix E − U ) + c exp(−ix E − U ),

R+ 0 R− 0 x>a

where we account for the fact that E cannot be negative (ODE simply does not allow
such solutions) and in the U0 > E > 0 regime we select one solution (of the two linearly
independent solutions) which does not grow with x → ±∞.
The solutions in the three different intervals should be ”glued” together - or stating it
less casually Ψ and dΨ/dx should be continuous at all x ∈ R. These conditions applied to
Eq. (3.101) or Eq. (3.102) result in an algebraic “consistency” conditions for E. We expect
to get a continuous spectrum at E > U0 and discrete at U0 > E > 0.

Exercise 3.6.3. Complete calculations above for the case of U0 > E > 0 and find the
allowed values of the discrete spectrum. What is the condition for appearance of at least
one discrete level?

Consider another example.

Example 3.6.4. Find eigen-functions and energy of stationary states of the Schrödinger
equation for an oscillator:
d2 Ψ(x)
+ (E − x2 )Ψ(x) = 0, (3.103)
dx2
CHAPTER 3. ORDINARY DIFFERENTIAL EQUATIONS. 79

where x ∈ R and Ψ : R → C2 .
As we saw already in the preceding section analysis of Eq. (3.103) is reduced to studying
the Hermite equation, with its spectral version desribed by Eq. (3.92). However, we will
follow another route here. Let us introduce the so-called “creation” and “annihilation”
operators
   
i d † i d
â = √ + x , â = √ −x , (3.104)
2 dx 2 dx
and then rewrite the Schrödinger Eq. (refeq:Schr-osc) as
 
† 1
ĤΨ(x) = â âΨ(x) = 2E − Ψ(x). (3.105)
2
It is straightforward to check that the operator Ĥ is positive definite for all functions from
L2 : Z Z Z
dxΨ† (x)ĤΨ(x) = dxΨ† (x)â↠Ψ(x) = dx|âΨ(x)|2 ≥ 0,
where the equality is achieved only if
 
i d
âΨ0 (x) = √ + x Ψ0 (x) = 0,
2 dx
thus resulting in Ψ0 (x) = A exp(−x2 /2) and E0 = 1/4. We have just found the eigen-
function and eigen-value correspondent to the lowest possible energy, so-called ground state.
To find all other eigen-function, correspondent to the so-called “excited” states, consider
the so-called commutation relations

â↠Ψ(x) = ↠âΨ(x) + Ψ(x), (3.106)


 n  2  n−1  n
↠â ↠Ψ(x) = ↠â ↠Ψ(x) + ↠Ψ(x)
 n  n+1
= n ↠Ψ(x) + ↠âΨ(x). (3.107)
.
Introduce Ψn (x) = (↠)n Ψ0 (x). Since âΨ0 (x) = 0, the commutation relations (3.107) shows
immediately that
 
1
2E − Ψn (x) = ĤΨn (x) = ↠âΨn (x) = nΨn (x).
2
We observe that eigen-functions Ψn (x) of the states with energies, 2En = n + 1/2 are
expressed via the Hermite polynomials, Hn (x), introduced in Eq. (3.64),
  n  2
i d x
Ψn (x) = An √ −x exp −
2 dx 2
n
 2 n
i x d
exp −x2 ,

= An n/2 exp n
2 2 dx
d d
where we have used the identity, ( dx − x) exp(x2 /2) = exp(x2 /2) dx . From the condition of

the Hermite polynomials orthogonality (3.65) one derives, An = (n! π)−1/2 .
Chapter 4

Partial Differential Equations.

A partial differential equation (PDE) is a differential equation that contains one or more
unknown multivariate functions and their partial derivatives. We begin our discussion by
introducing first-order ODEs, and how to resolve them to a system of ODEs by the method
of characteristics. We then utilize ideas from the method of characteristics to classify
(hyperbolic, elliptic and parabolic) linear, second-order PDEs in two dimensions (section
4.2). We will discuss how to generalize and solve elliptic PDE, normally associated with
static problems, in section 4.3. Hyperbolic PDEs, discussed in section 4.4 are normally
associated with waves. Here, we take a more general approach originating from intuition
associated with waves as the phenomena (then wave solving a hyperbolic PDE is a particular
example of a sound wave). We will discuss diffusion (also heat) equation as the main example
of a generalized (to higher dimension) parabolic PDE in Section 4.5.

4.1 First-Order PDE: Method of Characteristics


The method of characteristics reduces PDE to multiple ODEs. The method applies mainly
to first-order PDEs (meaning PDEs which contain only first-order derivatives) which are
moreover linear over the first-order derivatives.
Let θ(x) : Rd → R be a function of a d-dimensional coordinate, x := (x1 , . . . , xd ).
Introduce the gradient vector, ∇x θ := (∂xi θ; i = 1, . . . , d), and consider the following linear
in ∇x θ equation
d
X
(V · ∇x θ) := Vi ∂xi θ = f, (4.1)
i=1

where the velocity, V (x) ∈ Rd and forcing, f (x) ∈ R are given functions of x.

80
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 81

First, consider the homogeneous version of Eq. (4.1)

(V · ∇x θ) = 0. (4.2)

Introduce an auxiliary parameter (or dimension) t ∈ R, call it time, and then introduce the
characteristic equations
dx(t)
= V (x(t)), (4.3)
dt
describing the evolution of the characteristic trajectory x(t) in time according to the func-
d
tion V . A first integral is a function for which dt F (x(t)) = 0. Observe that any first
integral of Eqs. (4.3) is a solution to Eq. (4.2), and that any function of the first integrals
of Eqs. (4.3), g(F1 , . . . , Fk ), is also a solution to Eq. (4.2).
Indeed, a direct substitution of θ = g in Eq. (4.2) leads to the following sequence of
equalities
k d k
X ∂g X ∂Fi X ∂g d
(V · ∇x g) = Vj = Fi = 0. (4.4)
∂Fi ∂xj ∂Fi dt
i=1 j=1 i=1

The system of equations (4.3) has d − 1 first integrals independent of t (directly). Then a
general solution to Eq. (4.4) is

θ(x(t)) = g (F1 (x(t), . . . , Fd−1 (x(t))) , (4.5)

where g is assumed to be sufficiently smooth (at least twice differential over the first inte-
grals).
Eq. (4.2) has a nice geometrical/flow interpretation. If we think of V , which is the d
dimensional vector of the coefficients of ∇x g, as a “velocity”, then Eq. (4.2) means that
derivative of θ over x projected to the vector V is equal to zero. Therefore, the solution to
and ODE by the method of characteristics is reduced to reconstructing integral curves from
vectors V (x), defined at every point x of the space, which are tangent to the curves. Then,
the solution θ(x) is constant along the curves. If in the vicinity of each point x of the space,
one changes variables, x → (t, F1 , . . . , Fd−1 ), where t is considered as a parameter along an
integral curve and, if the transformation is well defined (i.e. Jacobian of the transformation
is not zero), then Eq. (4.2) becomes dθ/dt = 0 along the characteristic.
Let us illustrate how to find a characteristic on the example of the following homogeneous
PDE
∂x θ(x, y) + y∂y θ(x, y) = 0.
The characteristic equations are dx/dt = 1, dy/dt = y, with the general solution x(t) =
t + c1 , y = c2 exp(t). The only first integral of the characteristic equation is F (x, y) =
y exp(−x), therefore θ = g(F (x, y)), where g is an arbitrary function is a general solution.
It is useful to visualize the flow along the characteristics in the (x, y) space.
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 82

Exercise 4.1.1. Find and visualize in the (x, y) plane characteristics of

(a) ∂x θ − y 2 ∂y θ = 0,

(b) x∂x θ − y∂y θ = 0,

(c) y∂x θ − x∂y θ = 0.

Find general solutions to these PDEs, and verify your solutions by direct substitution.

Consider the following initial value (boundary) Cauchy problem: solve Eq. (4.2) subject
to the boundary condition
θ(x)|x0 ∈S = ϑ(x0 ), (4.6)

where S is a surface (boundary) of the dimension d − 1. This Cauchy problem has a


well-defined solution in at least some vicinity of S if S is not tangent to a characteristic
of Eq. (4.2). Consistently with what was described above solution to Eq. (4.3) with the
initial/boundary condition Eq. (4.6) can be thought as the change of variables.
Let us illustrate the solution to the Cauchy problem on the example

∂x θ = y∂y θ, θ(0, y) = cos(y).

The characteristic equations, ẋ = 1, ẏ = −y, have solutions x(t) = t−t1 , y(t) = exp(t2 −t),
and one first integral
F (x, y) = y exp(x) = constant,

therefore
θ(x, y) = g(y exp(x)),

where g is an arbitrary function, is a general solution. Boundary/initial conditions are given


at the straight line, x = 0, which is not tangent to any of the characteristic, y = exp(−x+x1 ).
Therefore, substituting the general solution in the boundary condition one finds a particular
form of the function g for the specific Cauchy problem:

θ(0, y) = g(y) = cos(y).

This results in the desired solution: θ(x, y) = cos(y exp(x)).

Exercise 4.1.2. (a) Solve


y∂x θ − x∂y θ = 0,

for initial condition, θ(0, y) = y 2 . (b) Explain why the same problem with the initial
condition θ(0, y) = y is ill-posed. (c) Discuss if the same problem with the initial condition,
θ(1, y) = y 2 , is ill posed or not.
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 83

Exercise 4.1.3 (not graded). Show that characteristic equations for the Liouville PDE
N  
X ∂H ∂f ∂H ∂f
∂t f + {H, f } = 0, {H, f } := −
∂qi ∂pi ∂pi ∂qi
i=1

where H(p; q) is the Hamilton function and {H, f } is the Poisson bracket of H and f , are
the Hamilton Eqs. (3.8).

Now let us get back to the inhomogeneous Eq. (4.1). As is standard for linear equations,
the general solution to an inhomogeneous equation is constructed as the superposition of
the a particular solution and general solution to the respective homogeneous equation. To
find the former we transition to characteristics, then Eq. (4.1) becomes
d
(V · ∇x ) θ = (ẋ · ∇x ) θ = θ = f (x(t)), (4.7)
dt
which can be integrated along the characteristic thus resulting in a desired particular solu-
tion to Eq. (4.1)
Zt
θinh = f (x(s))ds(V · ∇x )θ = (ẋ · ∇x )θ. (4.8)
t0

Notice that this solution is not constant along characteristics.

Exercise 4.1.4. Solve the Cauchy problem for the following inhomogeneous equation

∂x θ − y∂y θ = y, θ(0, y) = sin(y).

The method of characteristics can also be generalized to quasi-liner first-order ODEs,


(first-order ODEs (4.1) where V and f depend not only on the vector of coordinate, x, but
also on the function θ(x)). In this case the characteristic equations become
dx dθ
= V (x, θ), = f (x, θ). (4.9)
dt dt
The general solution to a quasi-linear ODE is given by g(F1 , F2 , . . . , Fn ) = 0, where g is an
arbitrary function of n first integrals of Eq. (4.9).
Consider the example of the Hopf equation in d = 1

∂t u + u∂x u = 0, (4.10)

which, when u(t; x) refers to the velocity of a particle at location x and time t, describes the
one dimensional flow of non-interacting particles. The characteristic equations and initial
conditions are
ẋ = u, u̇ = 0, x(t = 0) = x0 , u(t = 0) = u0 (x0 ).
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 84

Direct integration produces, x = u0 (x0 )t + x0 giving the following implicit equation for u

u = u0 (x − ut). (4.11)

Under the specific conditions, u0 (x) = c(1 − tanh x), this results in the following (still
implicit) equation, u = c(1 − tanh(x − ut)). Computing partial derivative, one derives
∂x u = −c/(cosh2 (x − ut) − ct), which shows that it diverges in finite time at t∗ = 1/c and
x = ut. The phenomenon is called wave breaking, and has the physical interpretation of fast
particles catching slower ones and aggregating, leading to sharpening of the velocity profile
and eventual breakdown. This singularity is formal, meaning that the physical model is no
longer applicable when the singularity occurs. Introducing a small κ∂x2 u term to the right
hand side of Eq. (4.10) regularizes the non-physical breakdown, and explains creation of
shock. The regularized second-order PDE is called Burger’s equation.

4.2 Classification of linear second-order PDEs:


Consider the most general linear second-order PDE over two independent variables:

a11 ∂x2 u + 2a12 ∂x ∂y u + a22 ∂y2 u + b1 ∂x u + b2 ∂y u + cu + f = 0, (4.12)

where all the coefficients may depend on the two independent variables x and y.
The methods of characteristics, (which applies to first-order PDEs, for example, when
a11 = a12 = a21 = c = 0 in Eq. (4.12), can inform the analysis of second-order PDEs.
Therefore, let us momentarity return to the first-order PDE,

b1 ∂x u + b2 ∂y u + f = 0, (4.13)

and interpret its solution as the variable transformation from the (x, y) pair of variables to
the new pair of variables, (η(x, y), ξ(x, y)), assuming that the Jacobian of the transformation
is neither zero no infinite anywhere within the domain of (x, y) of interest.
!
∂x η ∂ y η
J = det 6= 0, ∞. (4.14)
∂x ξ ∂ y ξ

Substituting u = w(η(x, y), ξ(x, y)) into the sum of the first derivative terms in Eq. (4.13)
one derives

b1 ∂x u + b2 ∂y u = b1 (∂x η∂η w + ∂x ξ∂ξ w) + b2 (∂y η∂η w + ∂y ξ∂ξ w)


= (b1 ∂x η + b2 ∂y η) ∂η w + (b1 ∂x ξ + b2 ∂y ξ) ∂ξ w. (4.15)
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 85

Requiring that the second term in Eq. (4.15) is zero one observes that it is satisfied for all
x, y if ξ(y(x)), i.e. it does not depend on x explicitly but only via y(x) if the latter satisfies
the characteristic equation, b1 dy/dx + b2 = 0.
Let us now try the same logic, but now focusing on the sum of the second-order terms
in Eq. (4.12). We derive

a11 ∂x2 u + 2a12 ∂x ∂y u + a22 ∂y2 u = A∂ξ2 + 2B∂ξ ∂η + C∂η2 w,



(4.16)

where

A := a11 (∂x ξ)2 + 2a12 (∂x ξ)(∂y ξ) + a22 (∂y ξ)2


B := a11 (∂x ξ)(∂x η) + a12 (∂x ξ∂y η + ∂y ξ∂x η) + a22 (∂y ξ)(∂y η)
C := a11 (∂x η)2 + 2a12 (∂x η)(∂y η) + a22 (∂y η)2 .

Let us now attempt, by analogy with the case of the first-order PDE, to force first and last
term on the rhs of Eq. (4.16) to zero, i.e. A = C = 0. This is achieved if we require that
ξ(y+ (x)) and η(y− (x)), where

dy± a12 ± D
= , where D := a212 − a11 a22 . (4.17)
dx a11
and D is called the discriminant. Eqs. (4.17) have in a general case distinct (first) integrals
ψ± (x, y) = const. Then, we can choose the new variables as ξ = ψ+ (x, y) and η = ψ− (x, y)
If D > 0 Eq. (4.12) is called a hyperbolic PDE. In this case, the characteristics are real,
and any real pair (ξ, η) is mapped to the real pair (η, ν). Eq. (4.12) gets the following
canonical form
∂ξ ∂η u + b̃1 ∂ξ u + b̃2 ∂η u + c̃u + f˜ = 0. (4.18)

Notice that another (second) canonical form for the hyperbolic equation is derived if we
transition further from (ξ, η) to (α, β) := ()(η + ξ)/2, (ξ − η)/2). Then Eq. (4.18) becomes

(2) (2)
∂α2 u − ∂β2 u + b̃1 ∂α u + b̃2 ∂β u + c̃(2) u + f˜(2) = 0. (4.19)

If D < 0 Eq. (4.12) is called an elliptic PDE. In this case, Eqs. (4.18) are complex
conjugate of each other and their first integrals are complex conjugate as well. To make the
map from old to new variables real, we choose in this case, α = Re(ψ+ (x, y)) = (ψ+ (x, y) +
ψ− (x, y))/2, β = Im(ψ+ (x, y)) = (ψ+ (x, y)−ψ− (x, y))/(2i). This change of variables results
in the following canonical form for the elliptic second-order PDE:

(e) (e)
∂α2 u + ∂β2 u + b1 ∂α u + b2 uβ + c(e) u + f (e) = 0. (4.20)
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 86

D = 0 is the degenerate case, ψ+ (x, y) = ψ− (x, y), and the resulting equation is a
parabolic PDE. Then we can choose β = ψ+ (x, y) and α = ϕ(x, y), where ϕ is an arbitrary
independent (of ψ+ (x, y)) function of x, y. In this case Eq. (4.12) gets the following canonical
parabolic form
(p) (p)
∂α2 u + b1 ∂α u + b2 ∂β u + c(p) u + f (p) = 0. (4.21)

Exercise 4.2.1. Define the type of equation and then perform change of variables reducing
it to the respective canonical form

(a) ∂x2 u + ∂x ∂y u − 2∂y2 u − 3∂x u − 15∂y u + 27x = 0,

(b) ∂x2 u + 2∂x ∂y u + 5∂y2 u − 32u = 0,

(c) ∂x2 u − 2∂x ∂y u + ∂y2 u + ∂x u + ∂y u − u = 0.

4.3 Elliptic PDEs: Method of Green Function


Elliptic PDEs often originate from the description of static phenomena in two or more
dimensions.
Let us, first clarify the higher dimensional generalization aspect. We generalize Eq. (4.20)
to
d
X
aij ∂xi ∂xj u(x) + lower order terms = 0, (4.22)
i,j=1

where it is assumed that it is not possible to eliminate at least one second derivative term
from the condition of the respective Cauchy problem. Notice that in d > 2 Eq. (4.22) cannot
be reduced to a canonical form (introduced, in the previous section, in the d = 2 case).
Our prime focus here will be on the d ≥ 2 cases where, aij , in Eq. (4.22) is ∼ δij , and
also on solving inhomogeneous equations, where a nontrivial (nonzero) solution is driven
by a an actual nonzero source. It is natural to approach solving these equations with the
Green function method.
We have discussed in Section 3.5.1 how to solve static linear one dimensional case of the
Poisson equation using Green functions. Here we generalize and consider Poisson equation
in the space of higher dimension, and specifically in d = 2 and d = 3.

∇2r f = φ(r), (4.23)

where ∇2r = ∆r is the Laplacian operator, which is ∂x2 + ∂y2 in d = 2, r = (x, y) ∈ R2 , and
∂x2 + ∂y2 + ∂z2 in d = 3, r = (x, y, z) ∈ R3 .
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 87

The Poisson Eq. (4.23) has many applications, and in particular it describes the elec-
trostatic potential of the charge distributed in r with the density ρ(r), in which case,
φ(r) = −4πρ(r). Note that the homogeneous case of ρ = 0 is still called the Laplace equa-
tion. We will distinguish the two cases calling them the (inhomogeneous) Laplace equation
and the homogeneous Laplace equation respectively.
We will also discuss in the following the Debye equation

∇2r − κ2 f = φ(r),

(4.24)

which describes distribution of charge ρ(ρ) in plasma for φ(r) = −4πρ(r).


Functions which satisfy the homogeneous Laplace equation are called harmonic. Notice
that there exists no nonzero harmonic function defined in the entire, R2 , and approaching
0 as |r| → ∞. Indeed, applying Fourier transform to the homogeneous Laplace equation,
one derives q 2 fˆ(q) = 0, which results in fˆ(q) ∼ δ(q), and then (applying Inverse Fourier
transform), f (r) = const. Finally, requiring that f → 0 at r → ∞ one observes that the
constant is zero.
Let us stress that these arguments extends to any dimension, and also applies to the
Debye equation: there exists no solution to the Debye equation defined in the entire space
and decaying to zero at r → ∞.
Therefore, nonzero harmonic function should be defined in a bounded domain, where
the homogeneous Laplace equation should be supplemented with some kind of boundary
conditions. For example, one can fix f (x) at the boundary.
To solve the Laplace problem let us define the Green function, which is a solution to
the inhomogenous equation with a point source on the right hand side,

∇2r G = δ(r). (4.25)

Then the solution to Eq. (4.23) becomes


Z
f (r) = dr 0 G(r − r 0 )φ(r 0 ). (4.26)

The solution to Eq. (4.25) can be found by applying the Fourier transform, resulting
in the following algebraic equation, q 2 Ĝ(q) = −1. Resolving it (trivially) and applying the
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 88

Inverse Fourier transform one derives


d3 q exp(i(qr))
Z
G(r) = −
(2π)3 q2
Z 2 ∞
exp(iqk r)
Z
d q⊥
=− 3
dqk 2 2
(2π) qk + q⊥
−∞
Z 2
d q⊥ π
=− exp(−q⊥ r)
(2π)3 q⊥
1
=− . (4.27)
4πr
Substituting Eq. (4.27) into Eq. (4.26) one derives
Z Z
3 0 φ(r) ρ(r)
f (r) = − d r 0
= d3 r 0 , (4.28)
4π|r − r | |r − r 0 |

which is thus expression for the electrostatic potential for a given distribution of the charge
density in the space.

Example 4.3.1. Find the Green function for the Laplace equation in the region outside of
the sphere of radius R and zero boundary condition on the sphere, i.e. solve

∇2r G(r; r 0 ) = δ(r − r 0 ), (4.29)

for r such that R ≤ r0 , r, under condition that G(r; r 0 ) = 0, r = R ≤ r0 .

The Green function in this case is inhomogeneous, i.e. G(r; r 0 ) 6= G(r − r 0 ). It is direct
to check that solution is given by
1 R
G=− + ,
4π|r − r | 4πr |r − r 00 |
0 0

where r 00 := r 0 R2 /(r0 )2 . It is also clear that the solution is equivalent to placing two point
sources (chargers), one at r 0 and another one at the image point, r 00 , as if there is no
zero boundary condition fixed at the surface, i.e. choosing ρ(r) = δ(r − r 0 ) + δ(r − r 00 )
in Eq. (4.28). (This method of solution to the Laplace equation with zero condition at a
surface is called, naturally, method of (mirror) images.)
Let us now turn to the Debye equation and find its Green function, defined by

∇2r − κ2 G = δ(r).

(4.30)

To solve Eq. (4.30) we act in the same way as above in the case of the Laplace equation.
Equation for the Fourier transform of the Green function is, (q 2 + κ2 )Ĝ(q) = −1, resolving
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 89

the algebraic equation and applying the Inverse Fourier transform to the result one finds
d3 q exp(i(qr))
Z
G(r) = −
(2π)3 q 2 + κ2
Z 2 Z∞
d q⊥ exp(iqk r)
=− 3
dqk 2 2 + κ2
(2π) qk + q⊥
−∞
Z 2
d q⊥ π
q
=− exp(− 2 + κ2 r)
q⊥
(2π)3 q 2 + κ2
q

exp(−κr)
=− . (4.31)
4πr
Exercise 4.3.2. Find general solutions to the inhomogeneous Debye equation

∇2r − κ2 f = −4πρ(r),


where the charge density, ρ(r) depends only on the distance from the origin (zero), i.e. ρ(r).

4.4 Waves in a Homogeneous Media: Hyperbolic PDE


Although hyperbolic PDEs are normally associated with waves, we begin our discussion by
developing intuition which generalizes to a broader class of an integro-differential equations
beyond hyperbolic PDEs. In other words, we act here in reverse to what may be considered
the standard mathematical process; we begin by describing properties of solutions associated
with waves, and then walk back to the equations which are describing such waves.
Consider the propagation of waves in homogeneous media, for example: electro-magnetic
waves, sound waves, spin-waves, surface-waves, electro-mechanical waves (in power sys-
tems), and so on. In spite of such a variety of phenomena, they all admit one rather
universal description. The wave process at a general position in d-dimensional space r and
time t is represented as the following integral over the wave vector k
Z
dk
u(t; r) = exp (i(k·r)) ψk (t)û(k), ψk (t) ≡ exp (−iω(k)t) , (4.32)
(2π)k
where ω(k) and û(k) are the dispersion law and wave amplitude dependent on the wave
vector k. (Notice the similarities and the differences with the Fourier integral.) In Eq. (4.32)
ψk (t) is a solution to the following first-order (in time) linear ODE
 
d
+ iω(k) ψk = 0, (4.33)
dt
or alternatively of the following second-order linear ODE
 2 
d 2
+ (ω(k)) ψk = 0. (4.34)
dt2
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 90

These are called the wave equations in Fourier representation. The linearity of the equations
is principal and is due to the fact that generally nonlinear dynamics is linearized. Waves
may also interact with each other. The interaction of waves can only come from accounting
for nonlinearities in the original equations. In this analysis, we focus primarily on the linear
regime.

Dispersion Laws

Consider the case where ωk = c|k|, where c is a constant having dimensionality and sense
of velocity. In this case, the inverse Fourier transform version of Eq. (4.34) becomes
 2 
d 2 2
− c ∇r ψ(t; r) = 0. (4.35)
dt2

Note that the two differential operators in Eq. (4.35), one in time and another in space,
have opposite signs. Therefore, we naturally arrive at the case which generalizes the hyper-
bolic PDE (4.19). It is a generalization because r is not one-dimensional but d-dimensional,
d ≥ 1.
Eq. (4.23) with c constant, explains a variety of important physical situations: as men-
tioned already, it describes propagation of sound in a homogeneous gas, liquid or crystal
media. In this case ψ describes the shift of an element of the matter from its equilibrium
position and c is the speed of sound in the material. Note, that there is a unique speed
of sound in gas or liquid, while 3d crystal supports three different waves (with different
three different c) each associated with a distinct polarization. For example, in an isotropic
crystals there are longitudinal and transversal waves propagating along and, respectively,
perpendicular to the media shift.
Another example is given by the electro-magnetic waves, described by the Maxwell
equations on the electric, E, and magnetic, B, fields,

∂t E = c∇r × B, ∂t B = −c∇r × E, (4.36)

supplemented by the divergence-free conditions,

(∇r · E) = (∇r · B) = 0, (4.37)

where × is the vector product in d = 31 , and c is the speed of light in the media. Differenti-
ating first equation in the pair of Eqs. (4.36) over time, substituting the resulting ∂t ∇r × B
by −c(∇r × (∇r × E)), consistently with the second equation in the pair, and taking into
1
(∇r × B)i = εijk ∇j Bk , where i, j, = 1, ·3 and εijk is the absolutely skew-symmetric tensor in d = 3
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 91

account that for the divergence-free, E, (∇r × (∇r × E)) = ∇2r E, one arrives at Eq. (4.35)
for all components of the electric field, i.e. with ψ replaced by E.
The dispersion law in the case of sound and light waves is linear, ω(k) = ±c|k|, however
there are other more complex examples. For example, surface waves propagating over the
surface of water (with air), are characterized by the following dispersion law
p
ω(k) = gk + (σ/ρ)k 3 , (4.38)

where g, σ and ρ are gravity coefficient, surface tension coefficient and density of the fluid,
respectively. Eq. (4.38) is so complex because it accounts for both capillary and gravita-
tional effects. Gravitational waves dominate at small q (large distances), where Eq. (4.38)

transforms to ω(q) = gq, while the capillary waves dominate in the opposite limit of large
q (small distances), where one gets asymptotically ω = (σ/ρ)1/2 q 3/2 .
Recall that Eq. (4.33) or Eq. (4.34) are stated in the Fourier k-representation. Transi-
tioning to the respective r-representation in the case of a nonlinear dispersion relation, for
example associated with Eq. (4.38), will NOT result in a PDE. We arrive in this general case
at an integro-differential equation, reflecting the fact that the nonlinear dispersion relation,
even though local in the k-space becomes nonlocal in r-space.
In general, propagation of waves in the homogeneous media is characterized by the
dispersion law dependent only of the absolute value, k = |k| of the wave vector, k. ω(k)/k
and dω(k)/dk, both having dimensionality of speed, are called, respectively, phase velocity
and group velocity.

Example 4.4.1. Solve the Cauchy (initial value) problem for amplitude of spin-waves which
satisfy the following PDE
∂t2 ψ = −(Ω − b∇2r )2 ψ, (4.39)

in d = 3, where ψ(t = 0; r) = exp(−r2 ) and dψ/dt(t = 0; r) = 0.

Note, first, that applying the Fourier transform over r to Eq. (4.39) one arrives at
Eq. (4.33), where
ω(k) = Ω + bk 2 , (4.40)

is the respective (spin wave) dispersion law. The Fourier transform of the initial condition
over k is, ψ̂(t = 0; k) = π 3/2 exp(−k 2 /4). Since dψ/dt(t = 0; r) = 0, the Fourier transform
of the initial condition is zero as well, that is, dψ̂/dt(t = 0; k) = 0. Then, the solution to
Eqs. (4.33,4.40) becomes ψ̂(t; k) = π 3/2 exp(−k 2 /4) cos((Ω + bk 2 )t). Evaluating the inverse
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 92

Fourier transform one derives


d3 k −q2 /4
Z
3/2
ψ(t; r) = π e cos((Ω + bk 2 )t) exp(i(k · r))
(2π)3
Z∞
kdk −k2 /4
= e cos((Ω + bk 2 )t) sin(kr)
2π 1/2 r
0
Z∞
dk −k2 /4 d
=− e cos((Ω + bk 2 )t) cos(kr)
2π 1/2 r dr
0
Z∞
 
 
exp(iΩt) d 1 − 4ibt
= −Re  dk exp − k 2 + ikr 
4π 1/2 r dr 4
−∞
 
r2

exp iΩt − 1−4ibt
= Re  .
(1 − 4ibt)3/2

Exercise 4.4.2. Solve the Cauchy (initial value) problem for the wave Eq. (4.35) in d = 3,
where ψ(t = 0; r) = exp(−r2 ) and dψ/dt(t = 0; r) = 0.

Stimulated Waves: Radiation

So far we have discussed the free propagation of waves. Consider the inhomogeneous equa-
tion generalizing Eq. (4.34) that arises from an source term χ(t; r) on the right hand side:
 2 
d 2
+ (ω(−i∇ r )) ψ(t; r) = χ(t; r). (4.41)
dt2

where we have used −i∇r exp(ikr) = k exp(ikr). You may assume that the dispersion law,
ω(k) is continuous value of its argument (absolute value of the wave vector) so that the
operator ω(−i∇r ))2 is well defined in the sense of the function’s Taylor series.
The Green function for the PDE is defined as the solution to
 2 
d 2
+ (ω(−i∇r )) G(t; r) = δ(t)δ(r). (4.42)
dt2

The solution to the inhomogeneous PDE, Eq. (4.41), can be expressed as the convolution
of the source term χ(t1 ; r1 ) with the Green function, G(t; r)
Z
ψ(t; r) = dt1 dr1 G(t − t1 ; r − r1 )χ(t1 ; r1 ), (4.43)

The solution to Eq. (4.41) is expressed as sum of the forced solution (4.43) and a zero
mode of the respective free equation, i.e. Eq. (4.41) with zero right hand side.
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 93

To solve Eq. (4.42) for the Green function, or equivalently equation for its Fourier
transform
d2
 
2
+ (ω(k)) Ĝ(t; k) = δ(t). (4.44)
dt2
Recall that the inhomogeneous ODE. (4.44) was already discussed earlier in the course.
Indeed Eq. (3.48) solves Eq. (4.44). Then recalling that ω depends on k and applying the
inverse Fourier transform over k to Eq. (3.48) one arrives at

d3 k sin(ω(k)t)
Z
G(t; r) = θ(t) exp (i(kr)) . (4.45)
(2π)3 ω(k)

Exercise 4.4.3 (not graded). Show that the general expression (4.45) in the case of the
linear dispersion law, ω(k) = ck, becomes

θ(t)
G(t; r) = δ(r − ct), (4.46)
4πcr
where r = |r|.

Substituting Eq. (4.46) into Eq. (4.43) one derives the following expression for linear
dispersion (light or sound) radiation from a source
Z
1 dr1
ψ(t; r) = χ(t − R/c; r1 ). (4.47)
4πc2 R

The solution suggests that action of the source is delayed by R/c correspondent to propa-
gation of light (or sound) from the source to the observation point.

Exercise 4.4.4 (not graded). Solve the radiation Eqn. (4.41) in the case of the linear
dispersion law for the case of a point harmonic source, χ(t; r) = cos(ωt)δ(r).

4.5 Diffusion Equation


The most common example of a multi-dimensional generalization of the parabolic equation
Eq. (4.21) is the homogeneous diffusion equation

∂t u = κ∇2r u, (4.48)

where κ is the diffusion coefficient. The equation appears in a number of applications, for
example, this equation can be used to describe the evolution of the density of number of
particles, or the spatial variation of temperature. The same equation describes properties
of the basic stochastic process (Brownian motion).
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 94

Consider the Cauchy problem with u(t; r) given at t = 0. The Fourier transform over
r ∈ Rd is Z
û(t; q) = dy1 . . . dyd exp (iqx) u(t; y). (4.49)

Integrating Eq. (4.48) with the Fourier weight one arrives at

∂t û(t; q) = −q 2 û(t; q) (4.50)

Integrating the equation over time, û(t; q) = exp(−q 2 t)û(0; k), and evaluating the inverse
Fourier transform over q of the result one arrives at

(x − y)2
Z  
dy1 , . . . dyd
u(t; x) = exp − u(0; y). (4.51)
(4πt)d/2 4t

If the initial field, u(0; x), is localized around some x, say around x = 0, that is if
u(0; x) decays with |x| increase sufficiently fast, then one may find a universal asymptotic
of u(t; x) at long times, t  l2 , where l is the length scale on which u(0; x) is localized. At
these sufficiently large times dominant contribution to the integral in Eq. (4.51) is acquired
from the |y| ∼ l vicinity of the origin, and therefore in the leading order one can ignore
y-dependence of the diffusive kernel in the integrand of Eq. (4.51), i.e.
 2 Z
A x
u(t; x) ≈ exp − , A = u(0; y)dy1 . . . dyd . (4.52)
(4πt)d/2 4t

→ Aδ(y) 
Notice that the approximation (4.52) corresponds to the substitution of u(0, y)  in
2
Eq. (4.51). Another interpretation of Eq. (4.52) corresponds to expanding, exp − (x−y)
4t ,
in the Taylor series in y, and then ignoring all but the leading order term, O(y 0 ), in the
expansion. If A = 0 one needs to account for the O(y 1 ) term, and drop the rest. In this
case the analog of Eq. (4.52) becomes
 2
(B · x)
Z
x
u(t; x) ≈ d/2+1
exp − , B = 2π yu(0; y)dy1 . . . dyd . (4.53)
(4πt) 4t

Exercise 4.5.1. Find asymptotic behavior of a one-dimensional diffusion equation at suf-


ficiently long times for the following initial conditions
x2
 
(a) u(0; x) = x exp − 2 ;
2l
 
|x|
(b) u(0; x) = exp −
l
 
|x|
(c) u(0; x) = x exp −
l
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 95

1
(d) u(0; x) =
x2 + l 2

x
(e) u(0; x) =
(x2 + l 2 )2
Hint: Think about expanding the diffusion kernel in the integrand of Eq.(4.51) in a series
over y?

Our next step is to find the Green function of the heat equation, i.e. to solve

∂t G − κ∇2r G = δ(t)δ(x), (4.54)

In fact, we have solved this problem already as Eq. (4.51) describes it with u(0; y) =
G(+0; x) = δ(x) set as the initial condition. The result is
 2
1 x
G(t; x) = d/2
exp − . (4.55)
(4πt) 4t

As always, the Green function can be used to solve the inhomogeneous diffusion equation

∂t u − κ∇2x u = φ(t; x) (4.56)

which solution is expressed via the Green function as follows

Zt Z
0
u(t; x) = dt dyG(t0 ; y)φ(t − t0 ; x − y), (4.57)
−∞

where we assume that u(∞; x) = 0.

Exercise 4.5.2 (not graded). Solve Eq. (4.56) for φ(t; x) = θ(t) exp −x2 /(2l2 ) in the


d = 4-dimensional space.

4.6 Boundary Value Problems: Fourier Method


Consider the boundary value problem associated with sound waves:

∂t2 u(t; x) − c2 ∂x2 u(t; x) = 0, (4.58)


0 ≤ x ≤ L, u(t, 0) = u(t, L) = 0, u(0, x) = ϕ(x), ∂t u(0, x) = ψ(x). (4.59)

This problem can be solved by the Fourier Method (also called the method of variable
separation), which is split in two steps.
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 96

First, we look for a particular solution which satisfy only boundary conditions over
one of the coordinates, x. We look for u(t, x) in the separable form u(t, x) = X(x)T (t).
Substituting this ansatz in Eq. (4.58) one arrives at

X 00 (x) T 00 (t)
= = −λ, (4.60)
X(x) T (t)

where λ is an arbitrary constant. General solution to the equation for X is


√ √
X = A cos( λx) + B sin( λx).

Require that X(x) satisfies the same boundary conditions as in Eq. (4.59). This is possible

only if A = 0 and L λ = nπ, n = 1, 2, . . . . From here we derive solution labeled by
integer n and respective spatial form of the solution
 nπ 2  nπx 
λn = , Xn (x) = sin .
L L
We are now ready to get back to Eq. (4.60) and resolve equation for T (t):
   
nπct nπct
Tn (t) = An cos + Bn sin ,
L L

where An , Bn are arbitrary constants. Xn (x) form a complete basis and therefore a general
solution can be written as a linear combination of the basis solutions:

X
u(t, x) = Xn (x)Tn (t).
n=1

On the second step we fix An and Bn resolving the initial portion of the conditions
(4.59):

X ∞
X
ϕ(x) = An Xn (x), ψ(x) = λn Bn Xn (x). (4.61)
n=1 n=1

Notice that the eigen-functions, Xn (x), are ortho-normal

ZL
L
dxXn (x)Xm (x) = δnm .
2
0

Multiplying both Eqs. (4.61) on Xm (x), integrating them from 0 to L, and accounting for
the ortho-normality of the eigen-functions, one derives
Z: Z:
2 2
Am = dxϕ(x)Xm (x), Bm = dxψ(x)Xm (x). (4.62)
L λm L
0 0
CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 97

Exercise 4.6.1. The equation describing the deviation of a string from the straight line,
u(t; x), is ∂t2 u − c2 ∂x2 u = 0, where x is position along the line, t, is the time, and, c, is
a constant (speed of sound). Assume that the string has at t = 0 a parabolic shape,
u(0; x) = 4hx(L − x)/L2 , with both ends, at x = 0 and x = L, respectively, attached to the
straight line. Let us also assume that the speed of the string is equal to zero at t = 0, i.e.
∀x ∈ [0, L], ∂t u(0; x) = 0. Find dependence of the string deviation, u(t; x), on time, t, at a
position, x ∈ [0, L], along the straight line.

Let us now analyze the following parabolic boundary value problem over x ∈ [0, L]:

x, x < L/2
∂t u = a2 ∂x2 u, u(t, 0) = u(t, L) = 0, u(0, x) = (4.63)
L − x, x > L/2.

Here we follow the same Fourier method approach. In fact the spectral part of the
solution here is identical to the one just described above in the hyperbolic case, while
the temporal components are obviously different. One derives, Tn0 = −λn Tn , which has a
decaying solution    
nπ 2 2
Tn = An exp − a t .
L
Expansion of the initial conditions in the Fourier series is equivalent to conducted above,
therefore resulting in
∞ 2 !
4L X (−1)n
  
(2n + 1)π 2 2n + 1
u(t, x) = 2 exp − a t sin πx .
π (2n + 1)2 L L
n=0

Notice that the solution is symmetric with respect to the middle of the interval, u(t, x) =
u(t, L − x), as this symmetry is inherited from the initial conditions.

Exercise 4.6.2. Solve the following boundary value problem


 
2πx
∂t u = a2 ∂x2 u − βu, u(t, 0) = u(t, L) = 0, u(0, x) = sin .
L

4.7 Exemplary Nonlinear PDE: Burger’s Equation


Burgers equation is a generalization of the Hopf equation, Eq. (4.10), discussed when il-
lustrating the method of characteristics. Recall that the Hopf equation results in a wave
breaking which leads to a non-physical multi-valued solution. Modification of the Hopf
equation by adding dissipation/diffusion results in Burger’s equation:

∂t u + u∂x u = ∂x2 u. (4.64)


CHAPTER 4. PARTIAL DIFFERENTIAL EQUATIONS. 98

Like practically every other nonlinear PDE, Burger’s equation seems rather hopeless to
resolve at first glance. However, Burger’s equation is in fact special. It allows the Cole-
Hopf transformation, from u(t; x) to Ψ(t; x)

∂x Ψ(t; x)
u(t; x) = −2 , (4.65)
Ψ(t; x)

reducing Burger’s equation to the diffusion equation

∂t Ψ = ∂x2 Ψ. (4.66)

The solution to the Cauchy problem associated with Eq. (4.66) can be expressed as an inte-
gral convolving the initial profile Ψ(0; x), with the Green function of the diffusion equation
described in Eq. (4.55)

(x − y)2
Z  
dy
Ψ(t; x) = √ exp − Ψ(0; y). (4.67)
4πt 4t

This latter expression can be used to find some exact solutions to Burger’s equation. Con-
sider, for example, Ψ(0; x) = cosh(ax). Substitution into Eq. (4.67) and conducting in-
tegration over y, one arrives at Ψ(t; x) = cosh(ax) exp(a2 t), which results, according to
Eq. (4.65), in stationary (time independent, i.e. standing) “shock” solution to Burger’s equa-
tion, u(t; x) = −2a tanh(ax). Notice that the following more general solution to Burger’s
equation corresponds to a shock moving with the constant speed u0

u(t; x) = u0 − 2a tanh(a(x − x0 − u0 t)).

Exercise 4.7.1 (not graded). Solve the diffusion equation Eq. (4.66) with the initial con-
ditions Ψ(0, x) = cosh(ax) + B cosh(bx). Reconstruct respective u(t; x) solving the Burgers
Eq. (4.64). Analyze the result in the regime b > a and B  1 and also verify, by building a
computational snippet, that the resulting spatio-temporal dynamics corresponds to a large
shock “eating” a small shock.
Part III

Optimization

99
Chapter 5

Calculus of Variations

The main theme of this section is the relation of equations to minimal principles. Over-
simplifying a bit: to minimize a function S(q) is to solve S 0 (q) = 0. For a quadratic,
S(q) = 21 q T Kq − q T g, where K is positive definite, one indeed has the minimum of S(q)
achieved at q∗ , which solves S 0 (q∗ ) = Kq∗ − g = 0.
q in the example above is an n-(finite) dimensional vector, q ∈ Rn . Consider extending
the finite dimensional optimization to an infinite dimensional, continuous, problem where
q(x) is a function, say, q(x) : R → R, and I{q(x)} is a functional, typically an integral with
the integrand dependent on q(x) and its derivative, q 0 (x), for example
Z c 
0 2
S{q(x)} = dx (q (x)) − g(x)q(x) .
2

The derivative of the functional over u(x) is called the variational derivative, and then
by analogy with the finite dimensional example above, one finds that the Euler-Lagrange
equation,
δS{q}
= 0,
δq(x)
solves the problem of minimizing the functional. The goal of this section is to understand
the variational derivative and other related concepts in theory and on examples.

5.1 Examples
To have a better understanding of the calculus of variations we start describing four exam-
ples.

100
CHAPTER 5. CALCULUS OF VARIATIONS 101

5.1.1 Fastest Path

Consider a robot navigating within the (x, y)-plane. We can describe the robot’s path as
y = q(x). Assume that the plane constitutes a rugged terrain, so that robot’s velocity
(absolute value) when it passes a point on the plane is characterized by a scalar positive
function, g(x, y). Then the time it takes for the robot to move from x to x + dx along the
path q(x), where dx is small, is

L(x, q(x), q 0 (x)) = g(x, q(x)) 1 + (q 0 (x))2 dx.


p

The total length along the path which starts at (x, y) = (0, 0) and ends at (x, y) = (a, b),
where a > 0 is
Za
S{q(x)} = dxL(x, q(x), q 0 (x)).
0

We would like to find the path, q(x), which minimizes the functional S{q(x)}, subject to
q(0) = 0 and q(a) = b.

5.1.2 Minimal Surface

Consider making a three-dimensional bubble by dipping a wire loop into soapy water, and
then asking the question what is the optimal shape of the bubble for the given loop. Physics
suggests that the shape of the bubble minimizes area of the soap film.
We formalize this setting as follows. The surface of a bubble is described by the function,
q : D → R, where D ∈ R2 is bounded (∞ is not contained in the set), and is the projection
of the bubble on the R2 plane, and u(∂D) = g(∂D), where ∂D is the boundary of D (closed
line in the R2 plane), and g(∂D) describes the coordinate of the wire loop along the third
dimension. Then the optimal bubble results from minimizing the functional
Z p
S{q(x)} = dx 1 + |∇x q(x)|2 ,
D

over q(x), subject to q(∂D) = g(∂D).

5.1.3 Image Restoration

A gray-scale image is described by the function, q(x) : [0, 1]2 → [0, 1], mapping a location,
x within the square box, [0, 1]2 ∈ R2 , into a real number between white, 0, and black,
1. However, often only an image corrupted by a noise is observed. The task of image
restoration is to restore the true image from the noisy observation.
CHAPTER 5. CALCULUS OF VARIATIONS 102

Total Variation (TV) restoration [3] is a method built on the conjecture that the true
image is reconstructed from the noisy signal, f (x), by minimization of the following func-
tional Z
dx (q(x) − f (x))2 + λ|∇x q(x)| ,

S{q(x)} = (5.1)
U =[0,1]2

subject to the Neumann boundary condition, n · ∇x q = 0, x ∈ δU , where n is the (unit)


vector normal to δU (boundary of the domain U ).

5.1.4 Classical Mechanics

Classical mechanics is described in terms of the function, q(t) : R → Rd , mapping a time, t ∈


R, into a d-dimensional real-valued coordinate, q ∈ Rd . The evolution of the coordinate in
time is described in Hamiltonian mechanics by the minimal action, also called Hamiltonian,
principle: trajectory, that is understood as describing the evolution of the coordinate in
time, is governed by the minimum of the action,
Z t2
.
S{q} = dtL (t, q(t), q̇(t)) , (5.2)
t1

where L (t, q(t), q̇(t)) is the system Lagrangian, and q̇(t) = dq(t)/dt is the momentum, under
the condition that the values of the coordinate at the initial and final moment of time are
fixed, q(t1 ) = q1 , q(t2 ) = q2 . An exemplary Hamiltonian dynamics is that of a (unit mass)
particle in a potential, V (q), then

q̇ 2
L (t, q(t), q̇(t)) = − V (q). (5.3)
2

5.2 Euler-Lagrange Equations


All the examples can be stated as the minimization of the functional
Z
S{q(x)} = dxL (x, q(x), ∇x q(x)) ,
D∈Rn

over functions, q(x), with the fixed value at the boundary, x ∈ ∂D : q(x) = g(x), where D
is bounded with the known value at all points of the boundary, and the Lagrangian L is a
given function
L : D ∈ Rn × Rd × Rd×n → R,

of the three variables. It will also be convenient in deriving further relations to consider the
three variables in the argument of L and then denoting the respective derivatives, Lx , Lq ,
and L∇q . (Note that the variables are: x ∈ D ∈ Rn , q ∈ Rd , and p ∈ Rd×n , and thus the
CHAPTER 5. CALCULUS OF VARIATIONS 103

dimensionalities of Lx , Lq , and L∇q are n, d and n × d respectively.) We will assume in the


following that both L and g are smooth.

Theorem 5.2.1 (Euler-Lagrange theorem (necessary condition for optimality)). Suppose


that u(x) is the minimizer of S, that is

∀ v(x) ∈ C 2 (D̄ = D ∪ ∂D), with v(x) = g(x) on D : S{v} ≥ S{u},

then L satisfies

∇x (L∇q (x, q(x), ∇x q(x))) − Lq (x, q(x), ∇x q(x)) = 0 in D. (5.4)

Sketch of the proof: Consider the perturbation q(x) → q(x) + sδ(x) = q̃(x), where s ∈ R
and δ(x) sufficiently smooth and such that is does not change the boundary condition, i.e.
δ(x) = 0 in D. Then according to the assumption

S{q} ≤ S{q̃} = S{q + sδ(x)} ∀s ∈ R.

This means that


d
S{q + sδ(x)} = 0.
ds s=0
Notice that
Z
S{q + sδ(x)} = dxL (x, q(x) + sδ(x), ∇x q(x) + s∇x δ(x))
D

Then, exchanging the orders of differentiation and integration, applying the differentiation
(chain) rules to the Lagrangian, and evaluating one of the resulting integrals by parts and
removing the boundary term (because δ(x) = 0 on ∂D), one derives
Z
d d
S{q + sδ(x)} = dx L (x, q(x) + sδ(x), ∇x q(x) + s∇x δ(x)) (5.5)
ds s=0 ds s=0
D
Z
= dx (Lq (x, q(x), ∇x q(x)) · δ(x) + Lp (x, q(x), ∇x q(x)) · ∇x δ(x))
D
Z Z
= dxLq (x, q(x), ∇x q(x)) · δ(x) + dxLp (x, q(x), ∇x q(x)) · ∇x δ(x)
D D
Z
= dx (Lq (x, q(x), ∇x q(x)) − ∇x · L∇q (x, q(x), ∇x q(x))) · δ(x).
D

Since the resulting integral should be equal to zero for any δ(x) one arrives at the desired
statement.
CHAPTER 5. CALCULUS OF VARIATIONS 104

Exercise 5.2.1. Find the Euler-Lagrange equations (conditions) for


Z
dx (q 0 (x))2 + exp(q(x)) ,

(a)
Z
(b) dx q(x)q 0 (x),
Z
(c) dx x2 (q 0 (x))2 ,

where q : R → R.

Example 5.2.2. Consider the shortest path version of the fastest path problem set in
Section 5.1.1, that is the case of g(x, y) = 1:

Za p
min dx 1 + (q 0 (x))2 dx .
{q(x)|x∈[0,a]}
0 q(0)=0, q(a)=b

Find the Euler-Lagrange (EL) condition on q(x).


Solution:
The Euler-Lagrange condition on q(x) becomes

0 = ∇x (L∇q (x, q(x), ∇x q(x))) − Lq (x, q(x), ∇x q(x))


d q 0 (x)
= p −0
dx 1 + (q 0 (x))2
q 0 (x)
→ p = constant
1 + (q 0 (x))2
→ q 0 (x) = constant
b
→ q(x) = x,
a
where at the last step we accounted for the boundary condition. The shortest (optimal)
path connects initial and final points by a straight line.

Exercise 5.2.3. (a) Write the Euler-Lagrange equation for the general case of the fastest
path problem formulated in Section 5.1.1. (b) Find an example of metric, g(x, y), resulting
b 2
in the quadratic optimal path, i.e. q(x) = a2
x .

Example 5.2.4. Let us derive the Euler-Lagrange condition for the Minimal Surface prob-
lem introduced in Section 5.1.2:
Z p
min dx 1 + |∇x q(x)|2 .
{u(x)}
D q(∂D)=g(∂D)
CHAPTER 5. CALCULUS OF VARIATIONS 105

In this case Eq. (5.4) becomes

0 = ∇x (L∇q (x, q(x), ∇x q(x))) − Lq (x, q(x), ∇x q(x))


!
∇x q(x)
= ∇x · p
1 + |∇x q(x)|2
→ −∇x q(x) · ∇2x q∇x q + (1 + |∇x q(x)|2 )∇2x q = 0. (5.6)

Exercise 5.2.5. Show that, q(x) = a · x + b, where a is a real n-dimensional vector and
b is a real scalar solves a Minimal Surface Euler-Lagrange Eq. (5.6) on D = (−π/2, π/2)2 .
(Hint: Do not worry about the boundary conditions. We do not ask about them in the
exercise.)

5.3 Phase-Space Intuition and Relation to Optimization


(finite dimensional, not functional)

!" ∆%"
(&
!# ∆%# !&
!&'"
(&'"

Figure 5.1: Variational Calculus via Discretization and Optimization.

Consider the special case of the fastest path problem of Section 5.1.1, which is still more
general than the shortest path problem discussed in the Example 5.2.2, where the metric
g(x) depends only on x. In this case the action is
Z a p Z a
S{q(x)} = 0 2
dxg(x) 1 + (q (x)) = dsg(x),
0 0
CHAPTER 5. CALCULUS OF VARIATIONS 106

where ds is the element of arc-length of the curve u(x):


p p
ds = 1 + (q 0 (x))2 dx = dx2 + dq 2 .

The Lagrangian and its partial derivatives are, L(x; q(x); q 0 (x)) = g(x) 1 + (q 0 (x))2 , Lq =
p

0, Lq0 = g(x)q 0 / 1 + (q 0 )2 . Then the Euler-Lagrange equation becomes


p

!
d g(x)q 0 (x)
p = 0,
dx 1 + (q 0 (x))2
which results in
g(x)q 0 (x)
p = g(x) sin(θ) = constant, (5.7)
1 + (q 0 (x))2
where θ is the angle in the (q, x) space between the tangent to q(x) and the x-axis.
It is instructive to derive Eq. (5.7) bypassing the variational calculus, taking instead
perspective of standard optimization, that is optimizing over a finite number of continuous
variables. To make this link we need, first, to discretize the action, S{q(x)}:
s
q(xk ) − q(xk−1 ) 2
X X  
S{q(x)} ≈ Sk (· · · , qk , · · · ) = gk ∆sk = gk 1 + ∆

k k
s   2
X qk − qk−1
= gk 1 + ∆

k

where ∆ is the size of a step in x. i.e. ∆ = xk+1 − xk , ∀k, and ∆sk is the length of the
k-th segment of the discretized curve, illustrated in Fig. (5.1). Then, second, we look for
extrema of Sk over qk , i.e. require that ∀ k : ∂qk Sk = 0. The result is the discretized
version of the Euler-Lagrange Eqs. (5.7):
g (q − qk ) gk (qk − qk−1 )
∀k : rk+1 k+1  =r
qk+1 −qk 2  
qk −qk−1 2
1+ ∆ 1+ ∆

→ gk+1 sin θk+1 = gk sin θk .

5.4 Towards Numerical Solutions of the Euler-Lagrange Equa-


tions
Here we discuss the image restoration problem set up in Section 5.1.3. We will derive the
Euler-Lagrange equations and observe that the resulting equations are difficult to solve. We
will then use this case to illustrate the theoretical part (philosophy) of solving the Euler-
Lagrange equations numerically. Following [4], we will use the example to discuss gradient
descent in this Section and then also primal-dual method below in Section 5.6.
CHAPTER 5. CALCULUS OF VARIATIONS 107

5.4.1 Smoothing Lagrangian

The TV functional (5.1) is not differentiable at ∇x u(x) = 0, which creates difficulty for
variations. One way to bypass the problem is to smooth the Lagrangian, considering
(q(x) − f (x))2
Z  
p
Sε {q} = dx 2 2
+ λ ε + (∇x q(x)) , (5.8)
2
[0,1]2

where ε is small and positive. The Euler-Lagrange equations for the smoothed action (5.8)
are
∇x q
∀x ∈ [0, 1]2 : q − λ∇x · p = f, (5.9)
ε2 + (∇x q(x))2
with the homogeneous Neumann boundary conditions, ∀x ∈ ∂[0, 1]2 : ∂q(x)/∂n = 0,
where n denotes normal to the boundary of the [0, 1]2 domain. Finding analytical solutions
to Eq. (5.9) for an arbitrary f is not possible. We will discuss ways to solve Eq. (5.9)
numerically in the following.

5.4.2 Gradient Descent and Acceleration

We will start this part with a disclaimer. The discussion below of the numerical procedure
for solving Eq. (5.9) is not fully comprehensive. We add it here for completeness, delegating
details to Math 575, and also aiming to emphasize connections between numerical PDE
analysis and forthcoming discussion (largely within 575) of the optimization algorithms.
A standard numerical scheme for solving Eq. (5.9) originating from optimization of the
action is gradient descent. It is useful to think about the gradient descent algorithm by
introducing an extra “computational time” dimension, which will be discrete in implemen-
tation but can also be thought of (for the purpose of analysis and gaining intuition) as
continuous. Consider the following equation
∇x υ
∀x ∈ [0, 1]2 , t > 0 : ∂t υ + υ − λ∇x · p = f, (5.10)
ε2 + (∇x υ(x))2
for, υ(t; x), representing estimation at the computational time t for q(x) solving Eq. (5.9),
with the initial conditions, ∀x : υ(0; x) = f (x), and the boundary conditions, ∀x ∈
∂[0, 1]2 : ∂υ(x)/∂n = 0. Eq. (5.10) is a nonlinear heat equation. Close to the equilibrium
the equation can be linearized. Discretizing the linear diffusion equation on the spatio-
temporal grid with spacing, ∆t, and, ∆x, and looking for the dynamic (time-derivative)
term balancing the diffusion term (containing second order spatial-derivative) one arrives
at the following rough empirical estimation
ε(∆x)2
∆t ∼ .
λ
CHAPTER 5. CALCULUS OF VARIATIONS 108

The estimation suggests that the temporal step needs to be really small (square of the
spatial step) to guarantee that the numerical scheme is proper (not stiff). The condition
becomes even more demanding with decrease of the regularization parameter, ε.
One way to improve the gradient scheme (to make it less stiff) is to replace the diffusion
Eq. (5.10) by the (damped) wave equation

∇x υ
∀x ∈ [0, 1]2 , t > 0 : ∂t2 υ + a∂t υ + υ − λ∇x · p = f, (5.11)
ε2 + (∇x υ(x))2

where a is the damping coefficient. Acting by analogy with the diffusive case, let us make an
empirical estimate for the balanced choice of the spatial discertization step, ∆x, temporal
discretization step, ∆t, and of the damping coefficient. Linearising the nonlinear wave
Eq. (5.11) and then requiring that the ∂t2 (temporal oscillation) term, the a∂t (damping)
term and the (λ/ε)∇2x (diffusion) term are balanced one arrives at the following estimate

∆t ε(∆x)2
(∆t)2 ∼ ∼ ,
a λ
which results in a much less demanding linear scaling, ∆t ∼ ∆x.
This transition from the overdamped relaxation to balancing damping with oscillations
corresponds to the Polyak’s heavy-ball method [5] and Nesterov’s accelerated gradient de-
scent method [6], which are now used extensively (often with addition of a stochastic com-
ponent) in training of the modern Neural Networks. Both methods will be discussed later
in the course, and even more in the companion Math 575 course. Notice also that an addi-
tional material on modern, continuous-time interpretation of the acceleration method and
other related algorithms can be found in [7, 8]. See also Sections 2.3 and 3.6 of [4]
We will come back to the image-restoration problem one more time in Section 5.6.2
where we discuss an alternative, primal-dual algorithm.

5.5 Variational Principle of Classical Mechanics


Here we apply the variational principle (also called Hamiltonian principle) to the classical
mechanics highlighted in Section 5.1.4. See also [9], which logic we follow in this Section.
To streamline notations of this Section (and unless specificed otherwise) we will discuss
dynamics of a particle in one dimension, i.e. ∀t ∈ R : u(t) ∈ Rd . Generalization of all the
formulas discussed to higher dimensions is straightforward.
CHAPTER 5. CALCULUS OF VARIATIONS 109

5.5.1 Noether’s Theorem & time-invariance of space-time derivatives of


action

In the case of the classical mechanics, introduced in Section 5.1.4, the Euler-Lagrange
Eqs. (5.4) are

d
Lq̇ = Lq , (5.12)
dt
where L(t, q(t), q̇(t)) : R × Rd × Rd → R. Let us consider the case when the Lagrangian does
not depend explicitly on time. (It may still depend on time implicitly via q(t) and q̇(t),
i.e. L(q(t), q̇(t)).) In this case, and quite remarkably, the Euler-Lagrange equation can be
rewritten as a conservation law. Indeed,
 
d d d
(q̇ · Lq̇ − L) = q̈ · Lq̇ + q̇ · Lq̇ − Lq · q̇ − Lq̇ q̈ = q̇ · Lq̇ − Lq = 0,
dt dt dt

where the last equality is due to Eq. (5.12).


We have just introduced the Hamiltonian, H = q̇ · Lq̇ − L, representing energy stored
within the mechanical system instantaneously, and proved that if the Lagrangian (and thus
Hamiltonian) does not have explicit dependence on time, the Hamiltonian (and energy) is
conserved. This is a particular case of Noether’s famous theorem.
Notice, that symmetry under a parametrically continuous change, such as one just ex-
plored (consisting in invariance of the Lagrangian under the time shift), is generally a
stronger property than a conservation law.
To state a more general version of Noether’s theorem we need the following definition.

Definition 5.5.1 (Invariance of Lagrangian). Consider a family of transformations of Rd ,


hs (q) : Rd → Rd , where s ∈ R and hs (q) is continuous in both, q, and (parameter), s, and
h0 (q) = q. We say that a Lagrangian, L(q(t), q̇(t)) : Rn × Rn → R, is invariant under the
action of the family of transformations of Rd , hs (q) : Rn → Rd , if L(q, q̇) does not change
when q(t) is replaced by hs (q(t)), i.e. if for any function q(t) we have

d d
L(hs (q(t)), hs (q(t))) = L(q(t), q(t)).
dt dt
Common examples of hs (q(t)) in classical mechanics include

• translational invariance, hs (q(t)) = q(t) + se, where e is the unit vector in Rn and s
is the distance of the transformation;

• rotational invariance, hs (q(t)) = Re (s)q(t), around the line through the origin defined
by the unit vector e;
CHAPTER 5. CALCULUS OF VARIATIONS 110

• combination of translational invariance and rotational invariance (cork-screw motion):


hs (q(t)) = aes + Re (s)q(t), where a is a constant.

Theorem 5.5.2 (Noether’s theorem (1915)). If the Lagrangian L is invariant under the
action of a one-parameter family of transformations, hs (u(t)), then the quantity,

d
I(q(t), q̇(t)) ≡ Lq̇ · (hs (q(t)))s=0 , (5.13)
ds
is constant along any solution of the Euler-Lagrange Eq. (5.12). Such a constant quantity
is called an integral of motion.

dq
!#

!"
dt
$" $# t
Figure 5.2: End point variation of a critical path.

Proof of Noether’s theorem, which will be sketched below, is linked to analysis of the
action viewed as a function (not functional!) of the end points of the critical path. Con-
sider the critical/optimal path, {q(t)}, corresponding to the solution of the Euler-Lagrange
Eq. (5.12), substitute it into the action-functional, S{q(t)}, and then consider the action
. .
as a function of the end points, A0 = (t0 , q(t0 ) = u0 ) and A1 = (t1 , q(t1 ) = q1 ). With
a little abuse of notation we express this dependence of the action on A0 and A1 as,
S(A0 ; A1 ) = S(t0 , q0 ; t1 , q1 ). The following statement (sometimes presented as the main
theorem of the Hamiltonian mechanics [9]) gives a very intuitive, geometrical interpretation
for the derivatives of the action over the end-point parameters
CHAPTER 5. CALCULUS OF VARIATIONS 111

Theorem 5.5.3 (End-point derivatives of the action).

(a) ∂t1 S(A0 ; A1 ) = (L − q̇Lq̇ )t=t1 = −∂t0 S(A0 ; A1 ) = (L − q̇Lq̇ )t=t0 , (5.14)
(b) ∂q1 S(A0 ; A1 ) = Lq̇ |t=t1 = −∂q0 S(A0 ; A1 ) = −Lq̇ |t=t0 . (5.15)

Proof. Here (and as custom in this course) we will only sketch the proof. Let us focus,
without loss of generality, on the part of the theorem concerning derivatives with respect
to t1 and q1 , i.e. the final end point of the critical path.
Let us first keep the final time fixed at t1 but move the final position by dq, as shown
in Fig. (5.2). The trajectory q(t) will vary by δq(t), where δq(t0 ) = 0 and δq(t1 ) = dq.
Variation of the action is
Zt1
dS = dt (Lq̇ δ q̇ + Lq δq) . (5.16)
t0

One use the relation, δ q̇ = dδq/dt, and also the Euler-Lagrange Eqs. (5.12) to rewrite
Eq. (5.16)
Zt1   Zt1
d d
dS = dt Lq̇ δq + δq Lq̇ = dt (Lq̇ δq) = (Lq̇ δq)tt10 = Lq̇ |t1 dq. (5.17)
dt dt
t0 t0

Therefore, as we kept the final time fixed, dS = ∂q1 Sdq, and one arrives at the desired
statement
∂S
= Lq̇ |t1 . (5.18)
∂q1
We compute variation of the action over the final time similarly. Consider variation of the
action extended from A1 = (q1 , t1 ) to (q1 + dq, t1 + dt):
 
∂S ∂S ∂S ∂S
dS = Ldt = dt + dq = dt + Lq̇ |t1 dq = + q̇Lq̇ dt,
∂t1 ∂q1 ∂t1 ∂t1 t1

where we utilize Eq. (5.18). Finally, we derive


∂S
= (L − q̇Lq̇ )t1 .
∂t1

We are now ready to sketch the proof of the Noether Theorem 5.5.2.

Proof. (of the Noether theorem) By the assumption of the theorem

S(t0 , hs (q0 ); t1 , hs (q1 )) = S(t0 , q0 ; t1 , q1 ), ∀s.


CHAPTER 5. CALCULUS OF VARIATIONS 112

Differentiating both sides of the equality with respect to s at s = 0, and using Theorem
5.5.3 results in
d d
0 = ∂u0 S · (hs (q0 ))s=0 + ∂q1 S · (hs (q1 ))s=0
ds ds
d d
= −Lp (q(t0 ), q̇(t0 )) · (hs (q0 ))s=0 + Lp (q(t1 ), q̇(t1 )) · (hs (q1 ))s=0 .
ds ds
Since t1 can be chosen arbitrarily, it proves that Eq. (5.13) is constant along the solution
of the Euler-Lagrange Eq. (5.12).

Exercise 5.5.1. For q(t) ∈ R3 and each of the following families of transformations find
the explicit form of the conserved quantity given by Noether’s theorem (assuming that
respective invariance of the Lagrangian holds)

• (a) space translation in the direction, e: hs (q(t)) = q(t) + se.

• (b) rotation through angle s around the vector, e ∈ R3 : hs (q(t)) = Re (s)q(t).

• (c) helical symmetry, hs (q(t)) = aes + Re (s)q(t), where a is a constant.

5.5.2 Hamiltonian and Hamilton Equations: the case of Classical Me-


chanics

Let us utilize the specific structure of the classical mechanics Lagrangian which is split,
according to Eq. (5.3), into a difference of the kinetic energy, q̇ 2 /2, and the potential energy,
V (q). Making the obvious observation, that the minimum of the functional
Z
dt 21 (q̇ − p)2 ,

over {p(t)} is achieved at ∀t : q̇ = p, and then stating the kinetic term of the classical
mechanics action, that is the first term in Eq. (5.3), in terms of an auxiliary optimization
q̇ 2 p2
Z Z  
dt = max dt pq̇ − , (5.19)
2 {p(t)} 2
and substituting the result in Eqs. (5.2,5.3), one arrives at the following, alternative, vari-
ational formulation of the classical mechanics
Z
min max dt (pq̇ − H(q; p)) (5.20)
{q(t)} {p(t)}

. p2
H(q; p) = + V (q), (5.21)
2
where p and H are defined as the momentum and Hamiltonian of the system. Turning
the second (Hamiltonian) principle of the classical mechanics into the equations (which,
CHAPTER 5. CALCULUS OF VARIATIONS 113

like EL equations, are only sufficient conditions of optimality) one arrives at the so-called
Hamiltonian equations

∂H(q; p) ∂H(q; p)
q̇ = , ṗ = − . (5.22)
∂p ∂q

Exercise 5.5.2. (a) [Conservation of Energy] Show that in the case of the time independent
Hamiltonian (i.e. in the case of H(q; p) considered so far), H, is also the energy which is
conserved along the solution of the Hamiltonian equations (5.22).
(b) [Conservation of Momentum] Show that if the Lagrangian does not depend explicitly
on one of the coordinates, say q1 , then the corresponding momentum, ∂L/∂ q̇1 , is constant
along the physical trajectory, given by the solutions of either EL or Hamiltonian equations.

The Hamiltonian system of equations becomes even more elegant in vector form
! !
. q . 0 1
ż = −J∇z H(z) = −∇z JH(z), z = , J= , (5.23)
p −1 0

where the 2 × 2 matrix represents two-dimensional rotation (clock-wise in the (q, p)-space).

5.5.3 Hamilton-Jacobi equation

Let us work a bit more with the critical/optimal trajectory/path, {q(t0 ); t0 ∈ [0, t]}, solving
the Euler-Lagrange Eqs. (5.12), given the initial and final conditions for the position of the
particle: q(0) and q(t). That is we continue the thread of Section 5.5.3, and specifically
Theorem 5.5.2 and consider the action as a function of A1 = (t1 , q1 ) – the final position of
the critical path.
Let us re-derive in a bit different, but equivalent, form the main results of the Theorem
5.5.3. Assuming that the action is a sufficiently smooth function of the arguments, t and
u, one would like to introduce (and interpret) derivatives of action over t and q, and then
check if the derivatives are related to each other. Consider, first, derivative of the action
over t:
Z t Z t
. 0 0 0
dt0 Lq ∂t q(t0 ) + Lq̇ ∂t q̇(t0 )
 
St = ∂t S(t; q) = ∂t dt L q(t ), q̇(t ) = L +
Z t 0  0
d t
=L+ dt0 ∂t q(t0 ) Lq + Lq̇ − Lq̇ ∂t q(t0 ) 0 = L − Lq̇ q̇, (5.24)
0 dt

where we have used that, ∂t q(t0 )|t0 =0 = 0, ∂t q(t0 )|t0 =t = q̇(t), utilized the Euler-Lagrange
equations Eq. (5.4), t0 ∈ [0, t] : (Lq − d
dt Lq̇ )t =t
0 = 0.
CHAPTER 5. CALCULUS OF VARIATIONS 114

Next, let us evaluate the derivative of the action over the coordinate, q:
Z t Z t
.
dt0 L q(t0 ), q̇(t0 ) = dt0 Lq ∂q(t) q(t0 ) + Lq̇ ∂q(t) q̇(t0 )
 
Sq = ∂q S(t; q) = ∂t
Z t  0  0
d t
= dt0 ∂q(t) q(t0 ) Lq + Lq̇ + Lq̇ ∂q(t) q(t0 ) 0 = Lq̇ . (5.25)
0 dt

In the case of the classical mechanics, when the Lagrangian is factorized into a difference
of the kinetic energy and the potential energy terms, the object on right hand sides of
Eq. (5.24) turns into the minus Hamiltonian, defined above in Eq. (5.21), and the right
hand side of Eq. (5.25) becomes the momentum, then p = q̇. In the case of a generic (not
factorizable) Lagrangian, one can use the right hand side of and Eq. (5.24) and Eq. (5.25)
as the definitions of the minus Hamiltonian of the system and of the system momentum,
respectively,
.
p ≡ Lq̇ , H(t; q; p) = Lq̇ q̇ − L, (5.26)

where the Hamiltonian is considered a function of time, t, coordinate, q(t), and momentum,
p(t).
Combining Eqs. (5.24,5.25,5.26), that is (a) and (b) of the Theorem 5.5.3 and the def-
initions of the momentum and the Hamiltonian, one arrives at the Hamilton-Jacobi (HJ)
equation

St + H(q; ∂q S) = 0, (5.27)

which provides a nonlinear first order PDE representation of classical mechanics.


It is important to stress that, that if one knows the initial (t = 0) value of the action, the
explicit expression of the Hamiltonian in terms of the time, coordinate and momentum, and
the initial value of ∂q S at t = 0 and at all values of q and p Eq. (5.27 represents a Cauchy
initial value problem, therefore resulting in solving the minimum action problem unambigu-
ously. This is a rather remarkable and strong sentence with many important consequences
and generalizations. The statement is remarkable because because one gets unique solution
for the optimization problem in spite of the fact that solution of the EL equation is not
necessarily unique (remember it is a sufficient but not necessary condition for the minimum
action, i.e. there may be multiple solutions of the EL equations). Consequences of the HJ
equations will be seen later when we will discuss its generalization for the case of optimal
control, called the Bellman-Hamilton-Jacobi (BHJ) equation. HJ equation, discussed here,
and BHJ discussed in Section are linked ultimately to the concept of Dynamic Programming
(DP), also discussed later in the course.
CHAPTER 5. CALCULUS OF VARIATIONS 115

Let us re-emphasize, that the schematic derivation of the HJ-equation (just provided)
has revealed the meaning of the action derivative over time and over the coordinate. We
have learned that, ∂t S, is nothing but minus Hamiltonian, while ∂q S, is simply momenta
(also equal to velocity as in these notes we follow the convention of unit mass).
Let us provide an alternative (and as simple) derivation of the HJ-equation, based
primarily on the differentials. Given transformation from representation of the action as a
functional, of {q(t0 ); t0 ∈ [0, t]}, to representation as a function, of t and q(t), S{q(t0 )} →
S(t; q), one rewrites Eqs. (5.2,5.3)
Z Z
S= pdq − Hdt,

which then implies the following differential form


∂S ∂S
dS = dt + dq,
∂t ∂q
so that
∂t S = −H, ∂q S = p,
resulting (in combination) in the HJ Eq. (5.27).
Example 5.5.3. Find and solve the HJ equation for a free particle.
In this case
p2
H= .
2
Therefore, the HJ equation becomes
(∂q S)2
= −∂t S.
2

Look for solution of the HJ equation in the form S = f (q)−Et. One derives f (q) = 2Eq−c,
and therefore the general solution of the HJ equation becomes

S(t; q) = 2Eq − Et − c.

Exercise 5.5.4. Find and solve the HJ equation for a two dimensional oscillator (unit mass
and unit elasticity) in spherical coordinates, i.e. for the Hamiltonian system with the action
functional Z  
1 2 2 2
 1 2
S{r(t), ϕ(t)} = dt ṙ + r ϕ̇ − r .
2 2
We conclude this very brief discussion of the classical/Hamiltonian mechanics by men-
tioning that in addition to its relevance to the concepts of Optimal Control and Dynamic
Programming (to be discussed in Section 7), the HJ-equations are also most useful in estab-
lishing (and using in practical setting) the transformation from the original variables (u, p)
to the so-called canonical variables for which paths of motion reduce to single points, i.e.
variables for which the (re-defined) Hamiltonian is simply zero.
CHAPTER 5. CALCULUS OF VARIATIONS 116

5.6 Legendre-Fenchel Transform


This section is devoted to the Legendre-Fenchel (LF) transform, which was in fact used
in its relatively simple but functional (infinite dimensional) form in Eq. (5.19). Given LF
importance in variational calculus (already mentioned) and finite dimensional optimization
(yet to be discussed), we have decided to allocate a special section for this important
transformation and its consequences. We will also mention in the end of this Section two
applications of the LF transform: (a) to solving the image restoration problem by a primal-
dual algorithm, and (b) to estimating integrals with the Laplace method.

Definition 5.6.1 (Legendre-Fenchel (LF) transform). Legendre-Fenchel transform of a


function, Φ : Rn → R, is
.
Φ∗ (k) = sup (x · k − Φ(x)) . (5.28)
x∈Rn

Often LF transform also refers to as “dual” transform. Then Φ∗ (k) is dual to Φ(x).

Example 5.6.1. Find the LF transform of the quadratic function, f (x) = x · A · x/2 − b · x,
where A is symmetric positive definite matrix, A  0.
Solution: The following sequence of transformations show that the LF transform of the
positively define quadratic function is another positively defined quadratic function
 
1
sup x·k− x·A·x+b·x
x 2
 
1 −1 −1 1 −1
= sup − (x − (k + b) · A ) · A · (x − A (k + b)) + (b + k) · A · (b + k)
x 2 2
1
= (b + k) · A−1 · (b + k), (5.29)
2
where the maximum is achieved at x∗ = A−1 (k + b).

Definition 5.6.2 (Convex function over Rn ). A function, u : Rn → R is convex if

∀x, y ∈ Rn , λ ∈ (0, 1) : u(λx + (1 − λ)y) ≤ λu(x) + (1 − λ)u(y). (5.30)

The combination of these two notions (the Legendre-Fenchel transform and the convex-
ity) results in the following bold statements (which we only state here, delegating proofs to
Math 527).

Theorem 5.6.3 (Convexity and Involution of Legendre-Fenchel). The Legendre-Fenchel


transform of a convex function is convex, and it is also an involution, i.e. (Φ∗ )∗ = Φ.
CHAPTER 5. CALCULUS OF VARIATIONS 117

5.6.1 Geometric Interpretation: Supporting Lines, Duality and Convex-


ity

Once the formal definitions and statements are made, let us consider the one dimensional
case, n = 1, to develop intuition about the LF and convexity. In one dimension, the LF
transform has a very clear geometrical interpretation (see e.g. [?]) stated in terms of the
supporting lines.

Definition 5.6.4 (Supporting Lines). f : R → R has a supporting line at x ∈ R if

∀x0 ∈ R : f (x0 ) ≥ f (x) + α(x0 − x).

If the inequality is strict at all x0 6= x, the line is called strictly supporting.

Notice that as defined above supporting lines are defined locally, i.e. not globally for all
x ∈ R, but locally for a particular/fixed, x.

!(#)

% & ' ( #

Figure 5.3: Geometric interpretation of supporting lines.

Example 5.6.2. Find f ∗ (k) and the supporting line(s) for f (x) = ax + b.
Solution: Notice that we cannot draw any straight line which do not cross f (x) unless they
have the same slope. Therefore, f (x) is the supporting line for itself. We also observe that
the LF transform of the straight line is finite only at a single point k = a, corresponding to
CHAPTER 5. CALCULUS OF VARIATIONS 118

the slope of the line, i.e. (


∗ −b, k=a
f (k) =
∞, otherwise.

Example 5.6.3. Consider the quadratic, f (x) = ax2 /2−bx. Find f ∗ (k), supporting line(s)
for f (x), and supporting line(s) for f ∗ (k).
Solution: The solution, given by one dimensional version of Eq. (5.29), is f ∗ (k) = (b +
k)2 /(2a), where the maximum (in the LF transform) is achieved at x∗ = (b + k)/a. We
observe that f ∗ (k) is well defined (finite) for all k ∈ R. Denote by fx (y) the supporting
line of, f (x), at x. In this case of a nice (smooth and convex) f (x), one derives, fx (x0 ) =
f (x) + f 0 (x)(x0 − x) = ax2 /2 − bx + (ax − b)(x0 − x), representing the Taylor series expansion
of, f (x), around, x = y, truncated at the first (linear) term. Similarly, fk∗ (k 0 ) = f ∗ (k) +
(f ∗ )0 (k)(k 0 − k) = (b + k)2 /(2a) + (b + k)(k 0 − k)/a.

What we see in this example generalizes into the following statements (given without
proof):

Proposition 5.6.5. Assume that f (x) admits a supporting line at x and f 0 (x) exists at x,
then the slope of the supporting line at x should be f 0 (x), i.e. for a differentiable function
the supporting line is always a tangient line.

Theorem 5.6.6. If f (x) admits a supporting line at x with slope k, then f ∗ (k) admits
supporting line at k with the slope x.

Example 5.6.4. Draw supporting lines for the example of a smooth non-convex function
shown in Fig. (5.3).
Solution: Sketching supporting lines for this smooth, non-convex and bounded from below
example of a function with two local minima we arrive at the following observations:

• The point a admits a supporting line. The supporting line touches f at point a and
the touching line is beneath the graph of f (x), hence the term supporting is justified.

• The supporting line at a is strictly supporting because it touches the graph of f only
at x = a.

• The point b does not admit a supporting line, because any line passing through (b, f (b))
crosses the line f (x) at some other point.

• The point c admits a supporting line which is supporting, but not strictly supporting,
as it touches f (x) at another point, d. In this case c and d share the same supporting
line.
CHAPTER 5. CALCULUS OF VARIATIONS 119

The supporting line analysis yields a number of other useful statements listed below
(without proof and only with limited discussion):

Theorem 5.6.7. f ∗ (k) is always convex in k.

Corollary 5.6.8. f ∗∗ (x) is always convex in x.

The last statement tells us, in particular, that f ∗∗ is not always convolutive, because
f ∗∗ is always convex even for non-convex f , when f 6= f ∗∗ . This observation generalizes to

Theorem 5.6.9. f ∗∗ (x) = f (x) iff f (x) admits a supporting line at x.

The following two statements are immediate corollaries of the theorem.

Corollary 5.6.10. f ∗∗ = f if f is convex.

Corollary 5.6.11. If f ∗ (k) is differentiable for all k then f ∗∗ (x) = f (x).

The following two statements are particularly useful for visualization of f ∗∗ (x)

Corollary 5.6.12. A convex function can always be written as a LF transform of another


function.

Theorem 5.6.13. f ∗∗ (x) is the largest convex function satisfying f ∗∗ (x) ≤ f (x).

Because of the last statement we call f ∗∗ (x) the convex envelope of f (x).

+ ∗ (!)
!(#)
1′

% &
"#$%& = ()
#’
'%()* = ,- '%()* = ,.
1

#/
Δ, # !* !/ !
Δ!

Figure 5.4: Function having a singularity cusp (left) and its LF transform (right).

Below we continue to illustrate the notion of supporting lines, as well as convexity and
duality, on illustrative examples.

Example 5.6.5. Consider function containing a non-differentiable point (cusp), as shown


in Fig. (5.4a). Utilizing the notion of supporting lines, draw and explain f ∗ (k). Is f ∗∗ (x) =
f (x)?
Solution: When a function has a non-differentiable point it is natural to split the analysis
in two, discussing the differentiable and non-differentiable parts separately.
CHAPTER 5. CALCULUS OF VARIATIONS 120

• (Differentiable part of f (x):) Each point (x, f (x)) on the differentiable part of the
function curve (parts a and b in Fig. (5.4a) admits a strict supporting line with slope
f 0 (x) = k. These points maps under the LF transformation into (k, f ∗ (k)) points
admitting supporting lines of slopes (f ∗ )0 (k) = x, shown as l’ and r’ branches in
Fig. (5.4b). Overall left (l) and right (r) branches in Fig. (5.4a) transform into left
(l’) and right (l’) branches in Fig. (5.4b)).

• (The cusp of f (x) at x = xc :) The nondifferentiable point xc admits not one but
infinetily many supporting lines with slopes in the range [k1 , k2 ]. This means that
f ∗ (k) with k ∈ [k1 , k2 ] must admit a supporting line with the constant slope xc ,
shown as branch (c0 ) in Fig. (5.4b), i.e. (c0 ) branch is linear (affine).

The example is convex, therefore according to Corollary 5.6.10, f ∗∗ (x) = f (x).

. ∗ (!) $′
!(#)

% '
%&'() = +,
&

"′ Δ+
%&'() = +3

#( #) # !- !
Δ#

$ ∗∗ (!)

" # !

Figure 5.5: (a) An exemplary nonconvex function, f (x); (b) its LT transform, f ∗ (k); (c) its
double LT transform f ∗∗ (x).

Example 5.6.6. Show schematically f ∗ (k) and f ∗∗ (x) for f (x) shown in Fig. (5.3).
Solution: We split curve of the function into three branches (l-left), (c-center) and (r-right),
and then built LF and double-LT transform separately for each of the branches, as before
relying in this construct of building supporting lines. The result in shown in Fig. (5.5) and
the details are as follows.

• Branch (l) and branch (r) are strictly convex thus admitting strict supporting lines.
LT transforms of the two branches are smooth. Double LF transform returns exactly
the same function we have started from.

• Branch (c) is not convex and as a result none of the points within this branch, extend-
ing from x1 to x2 , admits supporting lines. This means that the points of the branch
CHAPTER 5. CALCULUS OF VARIATIONS 121

are not represented in f ∗ (k). We see it in Fig. (5.5b) as a collapse of the branch under
the LF transform to a point. Supporting line with slope kc connects end-points of
the branch. The supporting line is not strict and it translates in f ∗ (k) into a sin-
gle (kc , f ∗ (kc )) point. This point of f ∗ (k) is not differentiable. Notice that f ∗ (k) is
convex, as well as, f ∗∗ (x). LF transformation extends (kc , f ∗ (kc )) into a straight line
with slope kc (shown red in Fig. (5.5c). This straight line may be thought as a convex
extrapolation, envelope, of f (x) in its non-convex branch.

Exercise 5.6.7. (a) Find the supporting lines and build the LF transform of
(
p1 x + b1 , x ≤ x∗
f (x) =
p2 x + b2 , x ≥ x∗

where x∗ = (b2 − b1 )/(p1 − p2 ), and b2 > b1 , p2 > p1 ; and find the respective f ∗∗ (x)
(b) Suggest an example of a convex function defined on a bounded domain with diverging
(infinite) slopes at the boundary. Show schematically f ∗ (k) and f ∗∗ (x) for the function.

5.6.2 Primal-Dual Algorithm and Dual Optimization

Now we are ready to return back to the image restoration problem set up in Section 5.1.3.
Our task becomes to by-pass ε-smoothing discussed in Section 5.4.2 by using LF transform.
This neat theoretical trick will then in developing computationally advantageous primal-
dual algorithm. We will use Theorem 5.6.3 to accomplish this, transformation-to-dual,
goal.
In fact, let us consider a more general set up than one discussed in Section 5.1.3. Assume
that Φ : Rn → R is convex and consider
Z
min dx (Ψ(x, u(x)) + Φ(∇x u(x))) , (5.31)
{u(x)}
U

where u : U → R. Let us now restate the formulation in terms of the Legendre-Fenchel


transform of Φ, thus utilizing Theorem 5.6.3:
Z
min max dx (Ψ(x, u(x)) + p(x) · ∇x u(x) − Φ∗ (p(x))) , (5.32)
{u(x)} {p(x)}
U

where p : U → Rn . (Notice the difference with u(x) : U → R.) u(x) is called the primal
variable and p(x) is called dual variable. We “dualized” only the second term in the inte-
grand on the right hand side of Eq. (5.31) which is non-smooth, leaving the first (smooth)
term unchanged. The optimization problem (5.32) is also called saddle-point formulation,
due to its min-max structure.
CHAPTER 5. CALCULUS OF VARIATIONS 122

Given the boundary condition up · n = 0 on ∂U , we can apply integration by parts to


the term in the middle in Eq. (5.32) then arriving at
Z
min max dx (Ψ(x, u(x)) − u(x)∇ · p(x) − Φ∗ (p(x))) . (5.33)
{u(x)} {p(x)}
U

We can attempt to solve Eq. (5.32) or Eq. (5.33) by the primal-dual method which con-
sists in alternating minimization and maximization steps in either of the two optimizations.
Implementations may be, for example, via alternating gradient descent (for minimization)
and gradient ascent (for maximization).
However in the original problem we are trying to solve – the image restoration problem
defined in Section 5.1.3 – we can carry over the primal-dual min-max formulation further
by exploring the structure of the argument (effective action), evaluating minimization over
{u(x)} explicitly and thus arriving at the dual formulation. This is our plan for the remain-
der of the section.
The case of the Total Variation image restoration corresponds to setting

(u − f (x))2
Ψ(x, u) = , Φ(w = ∇x u(x)) = |w|,

in Eq. (5.31) thus arriving at the following optimization

(u − f )2
Z  
min dx + |∇x u| . (5.34)
u 2λ
U x∈∂U : n·∇x u=0

Notice that Φ(w) = |w| is convex and thus, according to the high-dimensional generalization
of what we have learned about LF transform, Φ∗∗ (w) = Φ(w). The LF dual of Φ(w) can
be easily computed
(
0, |w| ≤ 1
Φ∗ (p) = sup (p · w − |w|) = (5.35)
w∈Rn ∞, |w| > 1.

And then convexity of Φ(w) = |w| allows us, according to Theorem 5.6.3, to “invert”
Eq. (5.35)
( !
0, |w| ≤ 1
Φ(w) = |w| = sup p · w − = max p · w. (5.36)
p ∞, |w| > 1. |p|≤1

Then min-max Eq. (5.33) becomes

(u − f )2
Z  
min max dx − u∇x · p) . (5.37)
u |p|≤1 2λ
U x∈∂U : n·p=0
CHAPTER 5. CALCULUS OF VARIATIONS 123

Remarkably we can swap min and max in Eq. (5.37). This is guaranteed by the strong
convexity theorem (yet to be discussed in the optimization part of the course/notes)

(u − f )2
Z  
max min dx − u∇x · p . (5.38)
|p|≤1 u 2
U x∈∂U : n·p=0

This trick is very useful because the optimization over u can be done explicitly. One finds
that the minimum of the quadratic over u function in the integrand of the objective in
Eq. (5.38) is achieved at
u(p) = f + λ∇ · p, (5.39)

and then substituting the optimal value back in the objective we arrive at
Z  
λ
max dx f ∇x · p − (∇x · p)2 . (5.40)
|p|≤1 2
U x∈∂U : n·p=0

which is thus the optimization dual to the primal optimization (5.34). If we are to ignore the
constraint in Eq. (5.40), the objective is minimal at ∇ · p = f /λ. To handle the constraint
[10] has suggested to use the so-called projected gradient ascent algorithm

pk + τ ∇x · (∇x · pk − f /λ)
∀x : pk+1 (x) = , (5.41)
1 + τ |∇x · pk − f /λ|

initiated with p0 satisfying the constraint, |p0 | < 1, iterating in time with step τ > 0 and
taking appropriate spatial discretization of the ∇x · operation on a grid with spacing ∆x.
Introduction of the denominator in the ratio on the right hand side of Eq. (5.41) guarantees
that the condition is enforced in iterations, |pk | < 1. When the iterations converge and the
optimal p is found, the optimal pattern, u is reconstructed from Eq. (5.39).

5.6.3 More on Geometric Interpretation of the LF transform

Here we inject some additional geometric meaning in the LF transform following [11]. We
continue to draw our intuition/inspiration from a one dimensional example.
First, notice that if the function is f : R → R is strictly convex than f 0 (x) is increasing,
monotonically and strictly, with x. This means, in particular, that the relation between
the original variable, x, and the respective optimal dual variable, k, is one-to-one, therefore
providing additional explanation for the self-inverse feature of the LT transform in the case
of convexity (strict convexity, to be precise, but we know that is also holds in the convex
case).
CHAPTER 5. CALCULUS OF VARIATIONS 124

" ! + " ∗ % = %!
'()*+ = %

"(!)
%! " ∗ (%) !

Figure 5.6: Graphic representation of the LF transform.

Second, consider relation, illustrated in Fig. (5.6), between the original function, f (x),
at x allowing strict supporting line and the respective LT transform, f ∗ (k), evaluated at
k = f 0 (x), i.e. f ∗ (f 0 (x)):

∀x : kx = f (x) + f ∗ (k), where k = f 0 (x). (5.42)

As seen clearly in the figure the LF relation explains f ∗ (k) as f (x) extended by kx (where
the latter term is associated with the supporting line). Notice remarkable symmetry of
Eq. (5.42) under x ↔ k and f ↔ f ∗ transformation, also assuming that the variables, x
and k, are not independent - one of the two is to be selected as tracking the change while
the other (conjugated) variable will depend on the first one, according to k = f 0 (x) or
x = (f ∗ )0 (k)

5.6.4 Hamiltonian-to-Lagrangian Duality in Classical Mechanics

LF transform is also the key to understanding relation between Hamiltonian and Lagrangian
in classical mechanics. Let us illustrate it one a “no q”-example, i.e. on the case when the
Hamiltonian, generally dependent on t, q and p depends only on p. Specifically consider
p
example of a free relativistic particle, where H(p) = p2 + m2 , m is the particle mass and
p
the speed of light is set to unity, c = 1. In this case, q̇ = ∂p H = dH/dp = p/ p2 + m2 ,
according the Hamilton equation, and the Lagrangian, which generally depends on q̇ and q
CHAPTER 5. CALCULUS OF VARIATIONS 125

but now only depends on q, is, L(q̇) = pq̇ − H(p). This relation, rewritten in the symmetric
form,
pq̇ = L(q̇) + H(p),

should be compared with the LF relation Eq. (5.42). We observe that p and ·q, like x and k,
are conjugated variables while L should be viewed as the LF transform of the Hamiltonian,
L = H ∗ , or vice versa, H = L∗ .
See [11] for further discussion of other examples of LF transform in physics, for exam-
ple in statistical thermodynamics (where inverse temperature and energy are conjugated
variables, while free energy is the LF dual of the entropy, and vice versa).

5.6.5 LF Transformation and Laplace Method

Consider the integral Z


F (k, n) = dx exp (n (kx − f (x))) .
R
When n → ∞ the Laplace methods of approximating the integral (discussed in Math 583a
in the fall) consists in

log F (k, n) = n sup (kx − f (x)) + o(n).


x∈R

5.7 Second Variation


Finding extrema of a function involves more than finding its critical points. A critical point
may be a minimum, a maximum or a saddle-point. To determine the critical point type
one needs to compute the Hessian matrix of the function. Similar consideration applies to
functionals when we want to characterize solutions of the Euler-Lagrange equations.
We naturally start the discussion of the second variation from the finite dimensional
case. Let f : U ⊂ Rn → R be a C2 function (with existing first and second derivatives).
The Hessian matrix of f at x is a symmetric bi-linear form (on the tangent vector space Rnx
to Rn at x) defined by

∂ 2 f (x + s + wη)
∀, η ∈ Rnx : Hessx (, η) = . (5.43)
∂s∂w s=w=0

If the Hessian is positive-definite, i.e. if the respective matrix of second-derivatives has only
positive eigenvalues, then the critical point is the minimum.
R
Let us generalize the notion of the Hessian to the action, S = dtL(q, q̇) and the
Lagrangian, L(q, q̇), where q(t) : R → Rn is a C2 function. Direct generalization of Eq. (5.43)
CHAPTER 5. CALCULUS OF VARIATIONS 126

becomes
∂ 2 S{u(t) + s(t) + wη(t)}
Hessx {(t), η(t)} = (5.44)
∂s∂w s=w=0
   
∂ ∂S{q(t) + s(t) + wη(t)}
=
∂w ∂s s=0 w=0
Z n  
X ∂L(q + s; q̇ + s) ˙ d ∂L(q + s; q̇ + s) ˙
= dt − ηi
∂q i dt ∂ q̇ i
i=1 s=0
n 
∂2L j ∂2L j
 2
∂2L j
Z 
X d ∂ L j
= dt  + j i ˙ −  + j i ˙ ηi
∂q j ∂q i ∂ q̇ ∂q dt ∂q j ∂ q̇ i ∂ q̇ ∂ q̇
i,j=1
Z n
. X
= dt Jij j η i ,
i,j=1

where Jij is the matrix of differential operators called the Jacobi operator. To determine if
the bilinear form is positive definite is usually hard, but in some simple cases the question
can be resolved.
Consider, q : R → R, q ∈ C2 , and quadratic action,
Z T
dt q̇ 2 − q 2 .

S{q(t)} = (5.45)
0

with zero boundary conditions, q(0) = q(T ) = 0. To get some intuition about how the
landscape of action (5.45) looks like, let us consider a subclass of functions, for example
oscillatory functions consisting of only one harmonic,
 
πt
q̄(t) = a sin n , (5.46)
T

where a ∈ R (any real) and n ∈ Z \ 0 (any nonzero integer). Substituting Eq. (5.46) into
Eq. (5.45) one derives,
 T 
n2 π 2 a2
Z  
S{q̄(t)} =  dt cos2 nπt 
T 2 T
0
 T 
2
Z    2 2 
2 2 nπt T a n π
−a dt sin = −1 .
T 2 T2
0

One observes that at T < π, the action, S, considered on this special class of functions,
is positive. However, when some of these probe functions will result in a negative action
when T > π. This means that at T > π, the functional quadratic form, correspondent to
the action (5.45), is certainly not positive definite.
CHAPTER 5. CALCULUS OF VARIATIONS 127

One thus came out of this “probe function” exercise with the following question: can
it be that the functional quadratic form, correspondent to the action (5.45), is not positive
definite? The analysis so far (restricted to the class of single harmonic test functions) is not
conclusive. Quite remarkably one can prove that the action (5.45) is always positive (over
the class of zero boundary condition, twice differentiable functions), and thus the respective
quadratic form is always positive definite, if T < π.

Exercise 5.7.1. Prove that the action S{q(t)} given by Eq. (5.45) is positive at, T < π, for
any twice differentiable function, q ∈ C2 with zero boundary conditions, q(0) = q(T ) = 0.
(Hint: Represent the function as Fourier Series and show that the action is a sum of squares.)

5.8 Methods of Lagrange Multipliers


So far we have only discussed unconstrained variational formulations. This Section is de-
voted to generalizations where variational problems with constraints are formulated and
resolved.

5.8.1 Functional Constraint(s)

Consider the shortest path problem discussed in Example 5.2.2, however constrained by the
area, A as follows
Za p
min dx 1 + (q 0 (x))2 dx .
{q(x)|x∈[0,a]} Ra
0 q(0)=0, q(a)=b, q(x)dx=A
0

The area constraint can be built in the optimization by adding,


 a 
Z
λ  dxq(x)dx − A ,
0

to the optimization objective, where λ is the Lagrangian multiplier. The Euler-Lagrange


equations for this “extended” action are

0 = ∇x (L∇q (x, q(x), ∇x q(x))) − Lq (x, q(x), ∇x q(x)) − λ


d q 0 (x)
= p −0−λ
dx 1 + (q 0 (x))2
q 0 (x)
→ p = constant + λx
1 + (q 0 (x))2
CHAPTER 5. CALCULUS OF VARIATIONS 128

Example 5.8.1. The principle of maximum entropy, also called principle of the maximum
likelihood (distribution), selects the probability distribution that maximizes the entropy,
R R
S = − D dxP (x) log P (x), under normalization condition, D dxP (x) = 1.

• (a) Consider D ∈ Rn . Find optimal P (x).

• (b) Consider D = [a, b] ⊂ R. Find optimal P (x), assuming that the mean of x is
R
known, E{P (x)} (x) ≡ D dxxP (x) = µ.

Solution:
(a) The effective action is,
 Z 
S̃ = S + λ 1 − dxP (x) ,
D

where λ is the (constant, i.e. not dependent on x, Lagrangian multiplier. Variation of S̃


over P (x) results in the following EL equation

δ S̃
=0: − log(P (x)) − 1 − λ = 0.
δP (x)
Accounting for the normalization condition one finds that the optimum is achieved at the
equ-distribution:
1
P (x) = ,
kDk
where kDk is the size of D.
(b) The effective action is,
 Z   Z 
S̃ = S + λ 1 − dxP (x) + λ1 µ − dxxP (x) ,
D D

where λ and λ1 are two (constant) Lagrangian multipliers. Variation of S̃ over P (x) results
in the following EL equation

δ S̃
=0: − log(P (x)) − 1 − λ − λ1 x = 0 → P (x) = e−1−λ exp(−λ1 x).
δP (x)
λ and λ1 are constants which can be expressed via a, b and µ resolving the normalization
constraint and the constraint on the mean,

exp(−λ1 x) b b
   
−1−λ −1−λ x exp(−λ1 x) exp(−λ1 x)
e − = 1, e − − = µ.
λ1 a λ1 λ21 a

Exercise 5.8.2. Consider the setting of Example 5.8.1b with a = −∞, b = ∞. Assuming
additionally that the variance of the probability distribution is known, E{P (x)} x2 = σ 2 ,


find P (x) which maximizes the entropy.


CHAPTER 5. CALCULUS OF VARIATIONS 129

5.8.2 Function Constraints

The method of Lagrange multipliers in the calculus of variations extends to other types of
constrained optimizations, where the condition is not a functional as in the cases discussed
so far but a function. Consider, for example, our standard one-dimensional example of the
action functional, Z
S{q(t)} = dtL(t; q(t); q̇(t)), (5.47)

over q : R → R, however constrained by the functional,

∀t : G(t; q(t); q̇(t)) = 0. (5.48)

Let us also assume that L(t; q; q̇) and G(t; q; q̇) are sufficiently smooth functions of their last
argument, q̇. The idea then becomes to introduce the following “modified” action
Z
S̃{q(t), λ(t)} = dt (L(t; q(t); q̇(t)) − λ(t)G(t; q(t); q̇(t))) , (5.49)

which is now a functional of both q(t) and λ(t), and extremize it over both q(t) and λ(t).
One can show that solutions of the EL equations, derived as variations of the action (5.49)
over both q(t) and λ(t), will give a sufficient condition for the minimum of Eq. (5.47)
constrained by Eq. (5.48).
Let us illustrate this scheme and derive the Euler-Lagrange equation for a Lagrangian
L(q; q̇; q̈) which depends on the second derivative of a C3 function, q : R → R and does
not depend on t explicitly. In full analogy with Eq. (5.49) the modified action in this case
becomes
Z
S̃{q(t), λ(t)} = dt (L(q(t); q̇; v̇) − λ(t) (v(t) − q(t))) . (5.50)

Notice Then the modified Euler-Lagrange equations are


 
∂L d ∂L d ∂L
= + λ , −λ = , v = q̇. (5.51)
∂q dt ∂ q̇ dt ∂ v̇
Eliminating λ and v one arrives at the desired modified EL equations stated solely in terms
of derivatives of the Lagrangian over q(t) and its derivatives:
∂L d ∂L d2 ∂L
− + 2 = 0. (5.52)
∂q dt ∂ q̇ dt ∂ q̈
R1
Exercise 5.8.3. Find extrema of S{q(t)} = 0 dtkq̇(t)k for q : [0, 1] → R3 subject to
∀t : kq(t)k2 = 1.

We will see more of the calculus of variations with (function) constraints later in the
optimal control section of the course.
Chapter 6

Convex and Non-Convex


Optimization

This Section was prepared by Dr. Yury Maximov from Los Alamos National Laboratory
(and edited by MC). The material was presented in 6 lectures cross-cut between Math 583,
Math 527 and Math 575. In the future the material will mainly be moved to Math 527 and
only a brief (one-two lecture) summary will be kept within Math 583. The Section stays
here for now, but may become an Appendix later on.
This Section is split into four Subsections. Sections 6.1 and 6.2 will be discussing
basic convex and non-convex optimizations. (We focus primarily on finite dimensional case,
noticing that generalizations of the basic methods to the infinite-dimensional case, e.g.
corresponding to the variational calculus) is straightforward.) Then in Sections 6.3 and 6.4
we will turn to discussing iterative optimization methods for the optimization formulations,
set in Sections 6.1 and 6.2, which are of constrained and unconstrained types.
The most general problem we will start our discussion from in Section 6.1 consists in
minimization of a function, f : S ⊆ Rn → R:

f (x) → min (6.1)


s.t.: x ∈ S ⊆ Rn .

Notice variability in notations – an absolutely equivalent alternative expression is

min f (x).
x∈S⊆Rn

Section 6.1 should be viewed as introductory (setting notations) leading us to discussion of


the notion of (optimization) duality in Section 6.2.
Iterative algorithms, discussed in Sections 6.3 and 6.4, will be designed to solve Eq. (6.1.
Each step of such an algorithm will consist in updating the current estimate, xk , using

130
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 131

xj ,f (xj ), j ≤ k, possibly its vector of derivatives ∇f (x), and possibly the Hessian matrix,
∇2 f (x), such that the optimum is achieved in the limit, limk→+∞ f (xk ) = inf x∈S⊆Rn f (x).
Different iterative algorithms can be classified depending on the information available,
as follows:

• Zero-order algorithm, where at each iteration step one has an access to the value of
f (x) at a given point x (but no information on ∇f (x) and ∇2 f (x) is available);

• First-order optimization, where at each iteration step one has an access to the value
of f (x) and ∇f (x);

• Second-order algorithm, where at each iteration step one has an access to the value of
f (x), ∇f (x) and ∇2 f (x);

• Higher-order algorithm where at each iteration step one has an access to the value of
the objective function, its first, second and higher-order derivatives.

We will not discuss in these notes second-order and higher-order algorithm, focusing in
Sections 6.3 and 6.4 primarily on the first-order and second order algorithms.

6.1 Convex Functions, Convex Sets and Convex Optimiza-


tion Problems
Calculus of Convex Functions and Sets

An important class of functions one can efficiently minimize are convex functions, that were
introduced earlier in Definition 5.6.2. We restate it here for convenience.

Definition 6.1.1 (Definition 5.6.2). A function, f : Rn → R is convex if

∀x, y ∈ Rn , λ ∈ (0, 1) : f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).

If a function is smooth, one can give an equivalent definition of convexity.

Definition 6.1.2. A smooth function f (x) : Rn → R is convex, if

∀x, y ∈ Rn : f (y) ≥ f (x) + ∇f (x)> (y − x).

Definition 6.1.3. Let function f : Rn → R has smooth gradient. Then f is convex iff
.
∀x : ∇2 f (x) = (∂xi ∂xj f (x); ∀i, j = 1, · · · , n)  0,

that is the Hessian of the function is a positive semi-definite matrix at any point. (Remind
that real symmetric n × n matrix H is positive semi-definite iff xT Hx ≥ 0 for any x ∈ Rn .)
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 132

Lemma 6.1.4. Prove that the definitions above are equivalent for sufficiently smooth func-
tions.

Proof. Assume that the function is convex according to the Definition 6.1.1. Then for any
h ∈ Rn , λ ∈ [0, 1], one has according to the Definition 6.1.1:

f (λ(x + h) + (1 − λ)x) − f (x) = f (x + λh) − f (x) ≤ λ(f (x + h) − f (x)).

That is

f (x + h) − f (x) ≥ f (x + λh) − f (x) = ∇f (x)> h + O(λ) ∀λ ∈ [0, 1]

Then taking the limit for λ → 0 one has ∇f (x)> h ≤ f (x + h) − f (x), ∀h ∈ Rn which is
exactly Def. 6.1.2. Vice versa, if ∀x, y : f (y) ≤ ∇f (x)> (y −x), one has for z = λx+(1−λ)y,
and any λ ∈ [0, 1] :

f (y) ≥ f (z) + ∇f (z)> (y − z) = f (z) + λ∇f (z)> (y − x),


f (x) ≥ f (z) + ∇f (z)> (x − z) = f (z) + (1 − λ)∇f (z)> (x − y)

summing up the inequalities above with the quotients 1−λ and λ one gets f (λx+(1−λ)y) ≤
λf (x) + (1 − λ)f (y). Thus Def. 6.1.1 and Def. 6.1.2.
Further, if f is sufficiently smooth, one has according to the Taylor expansion:
1
f (y) = f (x) + ∇f (x)> (y − x) + (y − x)∇2 f (x)(y − x) + o(ky − xk22 ).
2
Taking y → x one gets from the Definition 6.1.2 to the Definition 6.1.3 and vice versa.

Definition 6.1.5. Function f (x) is concave iff −f (x) is convex.

Definition 6.1.2 is probably the most practical. To generalize it to non-smooth functions,


we introduce the notion of sub-gradient.

Definition 6.1.6. Vector g ∈ Rn is a sub-gradient of the convex function f , f : Rn → R,


at point x iff
∀y ∈ Rn : f (y) ≥ f (x) + g > (y − x).

Set ∂f (x) is a set of all sub-gradients for the function f at point x.

To establish some properties of the sub-gradients (which can also be called sub-differentials)
let us introduce the notion of convex set, i.e. a segment between any point of the set which
belongs to the set as well.
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 133

Definition 6.1.7. Set S is convex, iff for any x1 , x2 ∈ S, and θ ∈ [0, 1] one has x1 θ + x2 (1 −
θ) ∈ S. In other words, set S is convex if for any points x1 , x2 in it, the set contains a line
segment [x1 , x2 ].

Theorem 6.1.8. For any convex function f , f : Rn → R, and any point x ∈ Rn the
sub-differential ∂f (x) is a convex set. In other words, for any g1 , g2 ∈ ∂f (x) one has
θg1 + (1 − θ)g2 ∈ ∂f (x). Moreover, ∂f (x) = {∇f (x)} if f is smooth.

Proof. Let g1 , g2 ∈ ∂f (x), then f (y) ≥ f (x) + g1> (y − x), and f (y) ≥ f (x) + g2> (y − x).
That is for any λ ∈ [0, 1] one has f (y) ≥ f (x) + (λg1 + (1 − λ)g2 )> (y − x) and λg1 +
(1 − λ)g2 is a sub-gradient as well. We conclude that the set of all the sub-gradients
is convex. Moreover, if f is smooth, according to the Taylor expansion formula one has
f (x + h) = f (x) + ∇f (x)> h + O(khk)22 . Assume that there exists sub-gradient g ∈ ∂f (x)
other than ∇f (x) (as ∇f (x) ∈ ∂f (x) by the definition of convex functions 6.1.2). Then
f (x) + g > h ≤ f (x + h) = f (x) + ∇f (x)> h + O(khk22 ) and similarly f (x) − g > h ≤ f (x − h) =
f (x) − ∇f (x)> h + O(khk22 ), and

g > h ≤ ∇f > h + O(khk22 ) and g > h ≥ ∇f > h + O(khk22 )

which implies g = ∇f (x), therefore concluding the proof.

Let us illustrate the sub-gradient calculus on the following examples:

• Sub-differential of |x| is



1, if x > 0

∂f (x) = −1, if x < 0



[−1, 1], if x = 0.

• Sub-differential of f (x) = max{f1 (x), f2 (x)} is



∇f (x), if f1 (x) > f2 (x)
 1



∂f (x) = ∇f2 (x), if f1 (x) < f2 (x)



{θ∇f (x) + (1 − θ)∇f (x), θ ∈ [0, 1]},
1 2 if f1 (x) = f2 (x)

if f1 and f2 are smooth functions on Rn .


p
Exercise 6.1.1 (Math 575). Consider f (x, y) = x2 + 4y 2 . Prove that f is convex. Sketch
level curves of f . Find the sub-differential ∂f (0, 0).

Example 6.1.2. Examples of convex functions include:


CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 134

a) xp , p ≥ 1 or p ≤ 0 is convex; xp , 0 ≤ p ≤ 1 is concave;

b) exp(x), x ∈ R and − log x, x ∈ R++ , are convex;

c) f (h(x)), where f : R → R, h : R → R is convex if

(a) f (x) is convex and non-decreasing, and h(x) is convex;


(b) Or f (x) is convex and non-increasing, h(x) is concave;

To prove the statement for smooth functions we consider

g 00 (x) = f 00 (h(x))(h0 (x))2 + f 0 (h(x))h00 (x)

One can also extend the statement to non-smooth and multidimensional functions.

d) LogSumMax, also called soft-max, log ( ni=1 exp(xi )), is convex in x ∈ Rn as a com-
P

position of a convex non-decreasing and a convex function. The soft-max function


plays a very important role because it bridges smooth and non-smooth optimizations:
n
!
1 X
max(x1 , x2 , . . . , xn ) ≈ log exp(λxi ) , λ → 0, λ > 0. (6.2)
λ
i=1

e) Ratio of the quadratic function of on variable to a linear function of another variable,


e.g. f (x, y) = x2 /y, is jointly convex in x and y at y > 0;
.
f) Vector norm: kxkp = (|xi |p )1/p , x ∈ Rn , also called p-norm, or `p -norm when p ≥ 1,
is convex.
.
g) Dual norm k · k∗ to k · k is kyk∗ = supkxk≤1 x> y. The dual norm is always convex.

h) Indicator function of a convex set, IS (x), is convex:



0, x∈S
IS (x) =
+∞, x 6∈ S

Example 6.1.3. Examples of convex sets:


T
1. If (any number of) sets {Si }i are convex, then i Si is convex;

2. Affine image of a convex set:

S̄ = {x : Ax + b, x ∈ S}
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 135

3. Image (and inverse image) of a convex set S under perspective mapping P : Rn+1 →
Rn , P = x/t, dom P = {(x, t) : t > 0}.
Indeed, consider y1 , y2 ∈ P (S) so that y1 = x1 /t1 and y2 = x2 /t2 . We need to prove
that for any λ ∈ [0, 1]
x1 x2 θx1 + (1 − θ)x2
y = λy1 + (1 − λ)y2 = λ + (1 − λ) =
t1 t2 θt1 + (1 − θ)t2
which holds for θ = λt2 /(λt1 +(1−λ)t2 ). The proof of the inverse statement is similar.

4. Image of a convex set under the linear-fractional function, f : Rn+1 → Rn , f (x) =


Ax+b
c> x+d
, dom f = {x : c> x + d > 0}. Indeed, f (x) is a perspective transform of an
affine function.

Exercise 6.1.4. Check that all functions and all sets above are convex using Definition
6.1.1 of the convex function (or equivalent Definitions 6.1.2,6.1.3) and the Definition 6.1.7
of the complex set.
In further analysis, we introduce a special subclass of convex functions for which one
can guarantee much faster convergence than for minimization of a general convex function.
Definition 6.1.9. Function f : Rn → R is µ-strongly convex with respect to norm k · k for
some µ > 0, iff
1. ∀x, y : f (y) ≥ f (x) + ∇f (x)> (y − x) + µ2 ky − xk2

2. if f is sufficiently smooth, the strong convexity condition in `2 norm is equivalent to


∀x : ∇2 f (x)  µ.
As we will see later, generalization of the strong convexity definition 6.1.9 to a general
`p norm allows to design more efficient algorithms in various cases. (Concavity, strong
concavity and convexity in `p are defined by analogy.)
Exercise 6.1.5 (Math 527). Find a subset of R3 containing (0, 0, 0) such that f (u) =
sin(x + y + z) is (a) convex; (b) strongly convex.
Exercise 6.1.6 (Math 527). Is it true that the functions, f (x) = x2 /2 − sin x and g(x) =

1 + x> x, x ∈ Rn , are convex. Are the functions strongly convex?
Exercise 6.1.7 (Math 527). Check if the function ni=1 xi log xi defined on Rn++ is
P

• convex/concave/strongly convex/strongly concave?

• strongly convex/concave in `1 , `2 , `∞ ?
Hint: to prove that the function is strongly convex in `p norm it is sufficient to show that

h> ∇2 f (x) h ≥ khk2p


CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 136

Convex Optimization Problems

The optimization problem


f (x) → min n
x∈S⊆R

is convex if f (x) and S are convex. Complexity of an iterative algorithm initiated with x0
to solve the optimization problem is measured in the number of iterations required to get
a point xk such that |f (xk ) − inf x∈S⊆Rn f (x)| < ε. Each iteration means an update of xk .
Complexity classification is as follows

• linear, that is the number of iterations k = O(log(1/ε)), and in other words f (xk+1 ) −
inf x∈S f (x) ≤ c(f (xk ) − inf x∈S f (x)) for some constant c, 0 < c < 1. Roughly, after
iteration we increase the number of correct digits in our answer by one.

• quadratic, that is k = O(log log(1/ε)), and f (xk+1 )−inf x∈S f (x) ≤ c(f (xk )−inf x∈S f (x))2
for some constant c, 0 < c < 1. That is, after iteration we double the number of correct
digits in our answer.

• sub-linear, that is characterized by the rate slower than O(log(1/ε)). In convex op-
timization, it is often the case that the convergence rate for different methods is

k = O(1/ε), O(1/ε2 ), or O(1/ ε) depending on the properties of function f .

Consider an optimization problem

f (x) → minn
x∈R

s.t.:g(x) ≤ 0
h(x) = 0

If the inequality constraint g(x) is convex and the equality constraint is affine, h(x) = Ax+b,
a feasible set of this problem, S = {x : g(x) ≤ 0 and h(x) = 0}, is convex that follows
immediately from definitions of a convex set and a convex function. As we will see later in
the lectures, in contrast to non-convex problems the convex ones admit very efficient and
scalable solutions.
`
Exercise 6.1.8 (Math 527). Let ΠCp (x) be a projection of a point x to a convex compact
set C in `p norm, if
`
ΠCp (x) = arg min kx − ykp .
y∈C

Find `1 , `2 , `∞ projections of x = {1, 1/2, 1/3, . . . , 1/n} ∈ Rn on the unit simplex S = {x :


n
|xi | = 1}. Which of the `1 , `2 , `∞ projections of an arbitrary point x ∈ Rn to a unit
P
i=1
simplex is easier to compute?
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 137

6.2 Duality
Duality is very powerful tool which allows (1) to design efficient (tractable) algorithms
to approximate non-convex problems; (2) to build efficient algorithms to convex and non-
convex problems with constraints (which are often of a much smaller dimensionality than
the original formulations); (3) to formulate necessary and sufficient conditions of optimality
for convex and non-convex optimization problems.

Lagrangian

Consider the following constrained (not necessary convex) optimization problem:

f (x) → min (6.3)


s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m
hj (x) = 0, 1 ≤ j ≤ p
x ∈ Rn

with the optimal value p∗ (which is possibly −∞). Let S be the feasible set of this problem,
that is the set of all x for which all the constraints are satisfied.
Compose the so-called Lagrangian function L : Rn × Rm × Rp → R:
m
X p
X
L(x, λ, µ) = f (x) + λi gi (x) + µj hj (x) = f (x) + λ> g(x) + µ> h(x), λ ≥ 0 (6.4)
i=1 j=1

which is a weighted combination of the objective and the constraints. Lagrange multipliers,
λ and µ, can be viewed as penalties for violation of inequality and equality constraints.
The Lagrangian function (6.4) allows us to formulate the constrained optimization,
Eq. (6.3), as a min-max (also called saddle point) optimization problem:

p∗ = min n max L(x, λ, µ) (6.5)


x∈S⊆R λ≥0,µ

where the optimum of Eq. (6.3) is achieved at p∗ .

Weak and Strong Duality

Let us consider the saddle point problem (6.5) in greater details. For any feasible point
x ∈ S ⊆ Rn one has f (x) ≥ L(x, λ, µ), λ ≥ 0. Thus

g(λ, µ) = min L(x, λ, µ) ≤ min f (x) = p∗ ⇒ max min L(x, λ, µ) ≤ p∗ = min max L(x, λ, µ),
x∈S x∈S λ≥0,µ x∈S x∈S λ≥0,µ
| {z }
g(λ,µ)
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 138

where g(λ, µ) = inf x∈Rn L(x, λ, µ) = inf x∈Rn f (x) + λ> g(x) + µ> h(x)

is called the La-
grange dual function. One can restate it as

d∗ = max min L(x, λ, µ) ≤ min max L(x, λ, µ) = p∗


λ≥0,µ x∈S x∈S λ≥0,µ

The original optimization, minx∈S f (x) = minx∈S maxλ≥0,µ L(x, λ, µ), is called Lagrange
primal optimization, while , maxλ≥0,µ g(λ, µ) = maxλ≥0,µ minx∈S L(x, λ, µ), is called the
Lagrange dual optimization.
Note that, maxλ≥0,µ minx∈S L(x, λ, µ) = maxλ≥0,µ minx∈Rn L(x, λ, µ), regardless of what
S is. This is because x̂ 6∈ S one has maxλ≥0,µ L(x̂, λ, µ) = +∞, thus allowing us to perform
unconstrained minimization of L(x, λ, µ) over x much more efficiently.
Let us describe a number of important features of the dual optimization:

1. Concavity of the dual function. The dual function g(λ, µ) is always concave. Indeed
for (λ̄, µ̄) = θ(λ1 , µ1 ) + (1 − θ)(λ2 , µ2 ) one has

g(λ̄, µ̄) = min L(x, λ̄, µ̄) = min{θL(x, λ1 , µ1 ) + (1 − θ)L(x, λ2 , µ2 )}


x x

≥ θ min L(x, λ1 , µ1 ) + (1 − θ) min L(x, λ2 , µ2 ) = θg(λ1 , µ1 ) + (1 − θ)g(λ2 , µ2 )


x x

The dual (maximization) problem maxλ≥0,µ g(λ, µ) is equivalent to the minimization


of the convex function −g(λ, µ) over the convex set λ ≥ 0.

2. Lower bound property. g(λ, µ) ≤ p∗ for any λ ≥ 0.

3. Weak duality: For any optimization problem d∗ ≤ p∗ . Indeed, for any feasible (x, λ, µ)
we have f (x) ≥ L(x, λ, µ) ≥ g(λ, µ), thus p∗ = minx∈Rn f (x) ≥ maxλ≥0,µ g(λ, µ) = d∗ .

4. Strong duality: We say that strong duality holds if p∗ = d∗ . Convexity of the objective
function and convexity of the feasible set S is neither sufficient nor necessary condition
for strong duality (see the example following).

Example 6.2.1. Convexity alone is not sufficient for the strong duality. Find the dual
problem and the duality gap p∗ − d∗ for the following optimization

exp(−x) → min
y>0,x
2
s.t.:x /y ≤ 0.

The optimal problem is p∗ = 1, which is achieved at x = 0 and any positive y. The dual
problem is
g(λ) = inf (exp(−x) + λx2 /y) = 0.
y>0,x

That is the dual problem is maxλ≥0 0 = 0, and the duality gap is p∗ − d∗ = 1.


CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 139

Theorem 6.2.1 (Slater (sufficient) conditions). Consider the optimization (6.3) where
all the equality constraints are affine and all the inequality constraints and the objective
function are convex. The strong duality holds if there exists an x∗ such that x∗ is strictly
feasible, i.e. all constraints are satisfied and the nonlinear constraints are satisfied with
strict inequalities.

The Slater conditions imply that the set of optimal solutions of the dual problem,
therefore making the conditions sufficient for the strong duality of the optimization.

Optimality Conditions

Another notable feature of the Lagrangian function is due to its role in establishing nec-
essary and sufficient conditions for a triplet (x, λ, µ) to be the solution of the saddle-point
optimization (6.5). First, let us formulate necessary conditions of optimality for

f (x) → min
s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m
hj (x) = 0, 1 ≤ j ≤ p
x ∈ S ⊆ Rn .

According to Eq. (6.5) the optimization is equivalent to

min max L(x, λ, µ),


x∈S λ≥0,µ

where the Lagrangian is defined in Eq. (6.4). The following conditions, called Karush-Kuhn-
Tucker (KKT) conditions, are necessary for a triplet (x∗ , λ∗ , µ∗ ) to become optimal:

1. Primal feasibility: x∗ ∈ S.

2. Dual feasibility: λ∗ ≥ 0.

3. Vanishing gradient: ∇x L(x∗ , λ∗ , µ∗ ) = 0 for smooth functions, and 0 ∈ ∂L(x∗ , λ∗ , µ∗ )


for non-smooth functions. Indeed for the optimal (λ∗ , µ∗ ), L should attain its mini-
mum at x∗ .

4. Complementary slackness conditions: λ∗i gi (x∗ ) = 0. Otherwise if gi (x∗ ) < 0 and


λ∗i > 0 one can reduce the Lagrange multiplier and increase the objective.

Note, that the KKT conditions generalize (the finite dimensional version of) the Euler-
Lagrange conditions introduced in the variational calculus. Let us now investigate when
the conditions are sufficient.
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 140

The KKT conditions are sufficient if the problem allows the strong duality, for which (as
we saw above) the Slatter conditions are sufficient. Indeed, assume that the strong duality
holds and a point (x∗ , λ∗ , µ∗ ) satisfies the KKT conditions. Then

g(λ∗ , µ∗ ) = f (x∗ ) + g(x∗ )> λ∗ + h(x∗ )> µ∗ = f (x∗ ) (6.6)

where the first equality holds because of the problem stationarity, and the second conditions
holds because of the complementary slackness.

Example 6.2.2. Find a duality gap and solve the dual problem for the following minimiza-
tion

(x1 − 3)2 + (x2 − 3)2 → min


s.t.:x1 + 2x2 = 4
x21 + x22 ≤ 5

Note, that the problem is (strongly) convex and the Slater’s condition is satisfied, therefore
the minimum is unique. The Lagrangian is

L(x, λ, µ) = (x1 − 3)2 + (x2 − 2)2 + µ(x1 + 2x2 − 4) + λ(x21 + x22 − 5), λ ≥ 0.

The dual problem becomes


g(λ, µ) = infn L(x, λ, µ).
x∈R
The KKT conditions are
!
2(x1 − 3) + µ + λx1
∇L = =0
2(x2 − 2) + 2µ + 2λx2

Therefore, (1 + λ)(2x1 − x2 ) = 4, and using the primal feasibility constraint one derives,
12+4λ 4+8λ
x1 = 5(1+λ) , x2 = 5(1+λ) . The dual problem becomes

9 + 16λ − 9λ2
g(λ) = → max
5(1 + λ)2 λ≥0

Finally, the saddle point is (x∗1 , x∗2 , λ∗1 , λ∗2 ) = (2, 1, 2/3, 1/3).

Example 6.2.3. For the primal problem

3x + 7y + z → min
s.t.:x + 5y = 2
x+y ≥3
z≥0
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 141

find the dual problem, the optimal values of the primal and dual objectives, as well as
optimal solutions for the primal variables and for the dual variables. Describe all the steps
in details.
Solution:

1. Note, that the problem is equivalent to

3x + 7y → min
s.t.:x + 5y = 2
x+y ≥3

as x, y are independent of z, and the objective attains its minimum at z = 0.

2. Introduce the Lagrangian:

L(x, y, µ, λ) = 3x + 7y + µ(2 − x − 5y) + λ(3 − x − y)

3. State the KKT conditions for ∇L(x, y, µ, λ):

d
L(x, y, µ, λ) = 3 − µ − λ = 0
dx
d
L(x, y, µ, λ) = 7 − 5µ − λ = 0,
dy
therefore resulting in µ = 1, and λ = 2. One observes that the Lagrange multipliers
are feasible, meaning that there exists at least one point on the intersection of the
equality and inequality constraints.

4. The complimentary slackness condition (for the inequality) is

λ(3 − x − y) = 0.

Since λ = 2, the respective inequality constraint is active x + y = 3.

5. Using the primal feasibility one derives:

x + 5y = 2 and x + y = 3,

resulting in y = −0.25 and x = 3.25.

6. Optimal values of the primal variables are (x, y, z) = (3.25, −0.25, 0).

Dual problem.
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 142

1. The Lagrangian function is

L(x, y, µ, λ) = 3x+7y+µ(2−x−5y)+λ(3−x−y) = 2µ+3λ+x(3−µ−λ)+y(7−5µ−λ)

Dual objective:

2µ + 3λ, if 3 − µ − λ = 0 and 7 − 5µ − λ = 0
g(λ, µ) = inf L(x, y, µ, λ) =
x,y −∞, otherwise

2. Thus, the dual problem is

2µ + 3λ → max
s.t.: 3 − λ − µ = 0
7 − 5µ − λ = 0

3. The duality gap is 0 as this problem is linear (Slater’s condition is satisfied by the
definition).

Exercise 6.2.4. (Math 583) For the primal optimization problems stated below find the
dual problem, the optimal values of the primal and dual objectives, as well as optimal
solutions for the primal variables and for the dual variables. Describe all the steps in
details.

1. min 4x + 5y + 7z, s.t.: 2x + 7y + 5z + d = 9, and x, y, z, d ≥ 0. [Hint: try to drop


an inequality constraint, find the optimal value and check after finding the optimal
solution if the dropped inequality is satisfied.]

2. min {(x1 − 5/2)2 + 7x22 − x23 }, s.t.: x21 − x2 ≤ 0, and x23 + x2 ≤ 4

Examples of Duality

Example 6.2.5 (Duality and Legendre-Fenchel Transform). Let us discuss the relation
between transformation from the Lagrange function to the dual (Lagrange) function and
the Legendge-Fenchel (LF) transform (or a conjugate function),

f ∗ (y) = sup (y > x − f (x)),


x∈Rn

introduced in the variational calculus Section (of the Math 586 course). One of the principal
conclusions of the LF analysis is f (x) ≥ f ∗∗ (x). The inequality is directly linked to the
statement of duality, specifically to the fact that dual optimization low bounds the primal
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 143

one. To illustrate the relationship between the maximization of f ∗∗ and the dual problem
consider

f (x) → min
s.t.:x = b

where b is a parameter. Then

min max{f (x) + µ> (b − x)} ≤ max min{f (x) + µ(b − x)}
x µ µ x

= max{−µb − max(µx − f (x))} = max{−µb − f ∗ (µ)} = f ∗∗ (−b)


µ x µ

Minimizing the expression over all b ∈ Rn one arrives at minx∈Rn f (x) ≥ minx∈Rn f ∗∗ (x).

Example 6.2.6. [Duality in Linear Programming (LP)] Consider the following problem:

c> x → min
s.t.:Ax ≤ b

We define Lagrangian L(x, λ) = c> x + λ> (Ax − b), λ ≥ 0, and arrives at the following dual
objective

g(λ) = infn L(x, λ)


x∈R

n o −b> λ, if c + A> λ = 0
= infn x> (c + A> λ) − b> λ =
x∈R −∞, otherwise

The resulting dual optimization is

g(λ) = −b> λ → max


c+A> λ=0, λ≥0

Example 6.2.7 (Non-convex problems with strong duality). Consider the following quadratic
minimization:

x> Ax + 2b> x → min


s.t.:x> x ≤ 1

where A 6 0. Its dual objective is:

g(λ) = infn L(x, λ)


x∈R

−∞,
 A + λI 6 0
n o 

> >
= infn x (A + λI)x − 2b x − λ = −∞, A + λI  0, b ∈ Im(A + λI)
x∈R 

−b> (A + λI)+ b − λ, otherwise

CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 144

The resulting dual optimization is

−b> (A + λI)+ b − λ → max


s.t.:A + λI  0
b ∈ Im(A + λI)

Let us restate the optimization in a convex form by introducing an extra variable t

−t − λ → max
s.t.:t ≥ b> (A + λI)+ b
A + λI  0
b ∈ Im(A + λI)

Finally one arrives at

−t − λ → max
!
A + λI b
s.t.: 0
b> t

Example 6.2.8 (Dual to binary Quadratic Programming (QP)). Consider the following
binary quadratic optimization

x> Ax → max
s.t.:x2i = 1, 1 ≤ i ≤ n

with A  0. The dual optimization is


n n
( ) ( )
X X
minn −x> Ax + µi (x2i − 1) = minn x> (Diag(µ) − A)x − µi → max
x∈R x∈R µ
i=1 i=1

that is
n
X
µi → min (6.7)
i=1

s.t.:Diag(µ)  A

Note that the optimization (6.7 is convex and it provides a non-trivial lower bound to the
primal optimization problem. The low bound is called Semi-Definite Programming (SDP)
relaxation.).
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 145

Exercise 6.2.9. Find a dual problem, and estimate the duality gap in the following prob-
lem:
1
min − x> Lx + b> x
2
s.t.: kxk∞ ≤ 1

if bb>  εL for some small ε > 0. Consider the case L  0 and L  0. Is it true, that for
sufficiently small ε > 0 one can solve the problem just stated exactly if L  0?

Conic Duality (additional material)

Standard formulation of the conic optimization is:

c> x → min (6.8)


x

s.t.:Ax = b
x∈

where K is a proper cone, i.e. a set which satisfies

1. K is a convex cone, that is for any x, y ∈ K one has αx + βy ∈ K, α, β ≥ 0;

2. K is closed;

3. K is solid, meaning it has nonempty interior;

4. K is pointed, meaning if x ∈ K, and −x ∈ K then x = 0.

Conic optimization problems are important in optimization. In Example 6.2.8 you already
see the (dual to the binary quadratic optimization) problem which is a conic optimization
problem over the cone of positive semi-definite matrices.
K∗ defines a dual cone of K K∗ = {c : c> hc, xi ≥ 0, x ∈ K}.

Exercise 6.2.10. Show that the following sets are self-dual cones (that is, K∗ = K).

1. Set of positive semi-define matrices, Sn+ ;

2. Positive orthant, Rn+ ;

3. Second-order cone, Qn = {(x, t) ∈ Rn+ : t ≥ kxk2 }


Pn
Note, that in the case of the semi-definite matrices c> x = i,j=1 cij xij (e.g. Hadamard
product of matrices). The Lagrangian to the Problem 6.8 is given by

L = c> x + µ> (b − Ax) − λx


CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 146

where the last term stands for x ∈ K. From the definition of the dual cone one derives

>
0, x∈K
max∗ −λ x =
λ∈K +∞, x 6∈ K

Therefore

p∗ = min max

L(x, λ, µ) ≥
x∈K λ∈K ,µ

d = max

min L(x, λ, µ)
λ∈K ,µ x∈K

And the dual problem is



µ> b, if c − A> µ − λ = 0
g(λ, µ) = min{c> x + µ> (b − Ax) − λ> x} =
x∈K −∞, otherwise

And finally

d∗ = max µ> b
s.t.: = c − A> µ − λ = 0
λ ∈ K∗

Finally, eliminating λ one has

µ> b → max
s.t.:c − A> y ∈ K∗ .

Exercise 6.2.11. Find a dual problem (see Example 6.2.8) to


n
X
1> µ = µi → min
i=1

s.t.:Diag(µ)  A.

Ensure, that your dual problem is equivalent to

hA, Xi → max
s.t.:X ∈ Sn+
Xii = 1 ∀i
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 147

In the remainder of the Section we will study iterative algorithms to solve the optimiza-
tion problems discussed so far. It will be convenient to think about iterations in terms of
“discrete (algorithmic) time”, and also consider the “continuous time” limit when changes
in the values per iteration is sufficiently small and the number of iterations is sufficiently
large. In the continuous time analysis of the algorithms we utilize the language of dif-
ferential equations, as it helps both for intuition (familiar from first semester studies of
the differential equations) and also for analysis. However, to reach some of the rigorous
conclusions we may also get back to the original, discrete, language.

6.3 Unconstrained First-Order Convex Minimization


In this lecture, we will consider an unconstrained convex minimization problem

f (x) → minn ,
x∈R

and focus on the first-order optimization methods. That is we assume that the objective
function, as well as the gradient of the objective function, can both be evaluated efficiently.
Note that the first order methods described in this Section are most popular methods/algo-
rithms currently in use to resolve majority of practical machine learning, data science and
more generally applied mathematics problems.
We assume that function f is smooth, that is

∀x, y : k∇f (x) − ∇f (y)k∗ ≤ βkx − yk, (6.9)

for some positive constant β. Choosing the `2 norm for k · k = k · k2 , one derives

β
f (y) ≤ f (x) + ∇f (x)> (y − x) + ky − xk22 , ∀x, y ∈ Rn .
2
To simplify description we will thus omit in the following “w.r.t. to norm k · k” when
discussing the `2 norm.

Smooth Optimization

Gradient Descent. Gradient Descent (GD) is the simplest and arguably most popular
method/algorithm for solving convex (and non-convex) optimization problems. Iteration of
the GD algorithm is
 
> 1 2
xk+1 = xk −ηk ∇f (xk ) = arg min f (xk ) + ∇f (xk ) (x − xk ) + kx − xk k2 , ηk ≤ 1/β
x 2ηk
| {z }
hηk (x)
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 148

where we assume that f is β smooth with respect to `2 norm. If ηk ≤ 1/β, each step of
the GD becomes equivalent to minimization of the convex quadratic upper bound hηk (x)
of f (x).

Definition 6.3.1. Function f : Rn → R is β-smooth w.r.t. to a norm k · k if

k∇f (x) − ∇f (y)k∗ ≤ kx − yk ∀x, y.

If k · k = k · k2 , we call the function β-smooth.

Theorem 6.3.2. Assume that a function f : Rn → R is convex and β-smooth. Then


repeating the GD step k times/iterations with a fixed step-size, η ≤ 1/β, results in f (xk )
which satisfies:
kx1 − x∗ k22
f (xk ) − f (x∗ ) ≤ , η ≤ 1/β, (6.10)
2ηk
where x∗ is the optimal solution.

We will provide the continuous time proof of the Theorem, as well as its discrete time
version, where the former will rely on the notion of the Lyapunov function.

Definition 6.3.3. Lyapunov function, V (x(t)), of the differential equation, ẋ(t) = f (x(t)),
is a function that

1. decreases monotonically along (discrete or continuous time) trajectory, V̇ (x(t)) < 0.

2. converges to zero at t → ∞, i.e. V (x(∞)) = 0, where x∗ = x(∞).

From now on, we will use capitalized notation, X(t), for the continuous time version of
(xk |k = 1, · · · ).

Proof of Theorem 6.3.2: Continuous time. The GD algorithm can be viewed as a discretiza-
tion of the first-order differential equation:

Ẋ(t) = −∇f (X(t)).

Introduce the following Lyapunov’s function for this ODE, V (X(t)) = kX(t)−x∗ k22 /2. Then

d
V (t) = (X(t) − x∗ )> Ẋ(t) = −∇f (X(t))> (X(t) − x∗ ) ≤ −(f (X(t)) − f ∗ ), (6.11)
dt
where the last inequality is due to the convexity of f . Integrating Eq. (6.11 over time, one
derives Z t

V (X(t)) − V (X(0)) ≤ tf − f (X(t))dt
0
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 149

Utilizing (a) Jensen’s inequality


 Z t
1 t
 Z
1
f X(τ )dτ ≤ f (X(τ ))dτ,
t 0 t 0
which is valid for all convex functions, and (b) non-negativity of V (t) one derives
 Z t
1 t
 Z
1 ∗ V (X(0))
f X(τ )dτ − f ≤ X(τ )dτ − f ∗ ≤ .
t 0 t 0 t
The prove is complete after setting, t ≈ k/β, and recalling that f is smooth.

Proof of Theorem 6.3.2: Discrete time. Condition of smoothness applied to, y = x−η∇f (x),
results in
β2
f (y) ≤ f (x) + ∇f (x)> (y − x) + ky − xk22
2
1
= f (x) + ∇f (x)> (x − η∇f (x) − x) + β22 kx − η∇f (x) − xk22
2
2 β2 2
= f (x) − ηk∇f (x)k2 + k∇f (x)k2
  2
β2 η
= f (x) − 1 − ηk∇f (x)k22 .
2
As η ≤ 1/β, one derives, 1 − βη/2 ≤ −1/2, and
η
f (y) ≤ f (x) − k∇f (x)k22 . (6.12)
2
Note, that Eq. 6.12 does not require convexity of the function, however if the function is
convex one derives

f (x∗ ) ≥ f (x) + ∇f (x)> (x∗ − x),

by choosing y = x∗ . Plugging the last inequality into the smoothness inequality, one derives
for y = x − η∇f (x):
η
f (y) − f (x∗ ) ≤ ∇f (x)> (x − x∗ ) − k∇f (x)k22
2
1 
= kx − x k2 − kx − η∇f (x) − x∗ k22
∗ 2

1 
= kx − x∗ k22 − ky − x∗ k22

X 1 X
(f (xj ) − f (x∗ )) ≤ kxj − x∗ k22 − kxj+1 − x∗ k22


j≤k j≤k
1
kx1 − x∗ k22 − kxk+1 − x∗ k22

=

R2 β2 R22
≤ 2 = ,
2η 2
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 150

where R22 ≥ kx1 − x∗ k22 and the step-size η = 1/β. Finally


βR22
min f (xj ) − f (x∗ ) ≤ f (x̄) − f (x∗ ) ≤ ,
j 2
P
where x̄ = j≤k xj /k.

One obviously would like to choose the step size in GD which results in the fastest
convergence. However, this problem – of choosing best, or simply good step size – is hard
and remains open. The statement also means that finding a good stopping criterion for the
iterations is hard as well. Here are practical/empirical strategies for choosing the step size
in GD:
• Exact line search. Choose ηk so that

ηk = arg min {f (xk − η∇f (xk ))}


η

• Backtracking line search. Choose the step-size ηk so that:


ηk
f (xk − ηk ∇f (xk )) ≤ f (xk ) − k∇f (xk )k22
2
As the difference between the right-hand side and the left-hand size of the inequality
above is monotone in ηk , one can start with some η and then update, η → bη, 0 <
b < 1.

• Polyak’s step-size rule. If the optimal value f ∗ of the function is known, one can
suggest a better step-size policy. Minimization of the right-hand side of:

kxk+1 − x∗ k ≤ kxk − x∗ k22 − 2ηk (f (xk ) − f (x∗ )) + ηk2 kgk k22 → min,
ηk

results in the Polyak’s rule, ηk = (f (xk ) − f (x∗ ))/kgk k22 , which is known to be useful,
in particular, for solving an undetermined system of linear equations, Ax = b.
Exercise 6.3.1. Recall that GD minimizes the convex quadratic upper bound hηk (x) of
f (x). Consider a modified GD, where the step size is, η = (2 + ε)/β, with ε chosen positive.
(Notice that the step size used in the conditions of the Theorem 6.3.2 was η ≤ 1/β.) Derive
modified version of Eq. (6.10). Can one find a quadratic convex function for which the
modified algorithm fails to converge?
Exercise 6.3.2 (not graded - difficult). Consider minimization of the following (non-
convex) function f :

f (x) → min
s.t.:kx − x∗ k ≤ ε,
x ∈ Rn
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 151

where x∗ is a global and unique minimum of the β-smooth function f . Moreover, let
1
∀x ∈ Rn : k∇f (x)k22 ≥ µ(f (x) − f (x∗ )).
2
Is it true, that for some small ε > 0 the GD with a step-size ηk = 1/β converges to the
optimum? How ε depends on β and µ?

Exercise 6.3.3 (not graded - difficult). In many optimization problems, it is often the case
that exact value of the gradient is polluted, i.e. only its noise version is observed. In this
case one may consider the following “inexact oracle” optimization: f (x) → min, x ∈ Rn ,
assuming that for any x one can compute fˆ(x) and ∇f
ˆ (x) so that

∀x : |f (x) − fˆ(x)| ≤ δ, and k∇f (x) − ∇fˆ(x)k2 ≤ ε,

and seek for an algorithm to solve it. Propose and analyze modification of GD solving the
“inexact oracle” optimization?

Gradient Descent in `p . GD in `p norm


 
> 1 2
xk+1 = arg min f (xk ) + ∇f (xk ) (x − xk ) + kx − xk kp ,
x∈S⊂Rn 2ηk

where ηk ≤ 1/βp , βp ≥ supx kg(x)kp , with a properly chosen p can converge much faster
than in `2 . GD in `1 is particularly popular.

Exercise 6.3.4. Restate and prove discrete time version of the Theorem 6.3.2 for GD in
`p norm. (Hint: Consider the following Lyapunov function: kx − x∗ k2p .)

Gradient Descent for Strongly Convex, Smooth Functions.

Theorem 6.3.4. GD for a strongly convex function f and a fixed step-size policy

xk+1 = xk − η∇f (xk ), η = 1/β

converges to the optimal solution as

f (xk+1 ) − f (x∗ ) ≤ ck (f (x1 ) − f (x∗ )),

where c ≤ 1 − µ/β.

Exercise 6.3.5. (not graded) Extend proof of the Theorem 6.3.2 to Theorem 6.3.4.
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 152

Fast Gradient Descent. GD is simple and efficient in practice. However, it may also
be slow if the gradient is small. It may also oscillate about the point of optimality if the
gradient is pointed in a direction with a small projection to the optimal direction (pointing
at the optimum). The following two modifications of the GD algorithm were introduced to
cure the problems

(1964) Polyak’s heavy-ball rule:


xk+1 = xk + ηk ∇f (xk ) + µk (xk − xk−1 ) (6.13)
(1983) Nesterov Fast Gradient Method (FGM):
xk+1 = xk + ηk ∇f (xk + µ(xk − xk−1 )) + µk (xk − xk−1 ). (6.14)

The last term in Eqs. (6.13,6.14) is called “momentum” or “inertia” term to emphasize
relation to respective phenomena in classical mechanics. The inertia terms, added to the
original GD term, which may be associated with “damping” or “friction”, aims to force
the hypothetical “ball” rolling towards optimum faster. In spite of their seemingly minor
difference, convergence rate of FGM and of the heavy-ball method differ rather dramatically,
as the heavy ball can lead to an overshoot (not enough “friction”).

Exercise 6.3.6. (not graded) Construct a convex function f with a piece-wise linear gra-
dient such that the heavy ball algorithm (6.13) with some fixed µ and ηfails to converge.

Consider a slightly modified (less general, two-step recurrence) version of the FGM
(6.14):
k−1
xk = yk−1 − η∇f (yk−1 ), yk = xk + (xk − xk−1 ), (6.15)
k+2
which can be re-stated in continuous time as follows
3
Ẍ(t) + Ẋ(t) + ∇f (X) = 0. (6.16)
t

Indeed, assuming t ≈ k η and re-scaling one derives from Eq. (6.15)
xk+1 − xk k − 1 xk − xk−1 √
√ = √ − η∇f (yk ). (6.17)
η k+2 η

Let xk ≈ X(k η), then

X(t) ≈ xt/√η = xk , X(t + η) ≈ x(t+√η)/√η = xk+1

and utilizing the Taylor expansion


xk+1 − xk 1 √ √
√ = Ẋ(t) + Ẍ(t) η + o( η)
η 2
xk − xk−1 1 √ √
√ = Ẋ(t) − Ẍ(t) η + o( η)
η 2
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 153

one arrives at
 
1 √ √ 1 √ √ √ √
Ẋ(t)+ Ẍ(t)+o( η) = (1 − 3 η/t) Ẋ(t) − Ẍ(t) η + o( η) − ηf (X(t))+o( η) = 0,
2 2
resulting in Eq. (6.16).
To analyze convergence rate of the FGM (6.16) we introduce the following Lyapunov
function:
V (X(t)) = t2 (f (X(t)) − f ∗ ) + 2kX + tẊ/2 − x∗ k22 .
Time derivative of the Lyapunov function is

V̇ (X(t)) = 2t(f (X(t))−f ∗ )+t2 ∇f (X(t))> Ẋ(t)+4(X(t)+tẊ(t)/2−x∗ )> (3Ẍ(t)/2+tẌ/2).

Given that, Ẋ + tẌ/2 = −t∇f (X)/2, and also utilizing convexity of f one derives

V̇ = 2t(f (X) − f ∗ ) − 4(X − x∗ )> (t∇f (X)/2) = 2t(f (X) − f ∗ ) − 2t(X − x∗ )> ∇f (X) ≤ 0.

Making use of the monotonicity of V and of the non-negativity of kX + tẊ/2 − x∗ k one


finds
V (t) V (0) 2kx0 − x∗ k22
f (X(t)) − f ∗ ≤ 2 ≤ 2 = .
t t t2

Finally, substituting, t ≈ k η, one derives
2kx0 − x∗ k22
f (xk ) − f ∗ ≤ , η ≤ 1/β.
ηk 2
We have just sketched a proof of the following statement.

Theorem 6.3.5. Fast GD for, f (x) → minx∈Rn , where f (x) is a β-smooth convex function,
with an update rule
k−1
xk = yk−1 − η∇f (yk−1 )), yk = xk + (xk − xk−1 )
k+2
converges to the optimum as
2kx0 − x∗ k22
f (xk+1 ) − f ∗ ≤ .
ηk 2
As always, turning the continuous time sketch of the proof into the actual (discrete
time) proof takes some additional technical efforts.

Exercise 6.3.7. (not graded) Consider the following differential equation


r
Ẍ(t) + Ẋ(t) + ∇f (X) = 0,
t
at some positive r. Derive respective discrete time algorithm, analyze its convergence and
show that if r ≤ 2, the convergence rate of the algorithm is O(1/k 2 ).

Exercise 6.3.8. (not graded) Show that the FGM method, described by Eq. (6.15), tran-
sitions to Eq. (6.14) at some ηk .
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 154

Non-Smooth Problems

Sub-Gradient Method. We start discussion of the Sub-Gradient (SG) methods with


the simplest, and arguably most-popular, SG algorithm:

xk+1 = xk − ηk gk , gk ∈ ∂F (xk ), (6.18)

which is just the original GD with the gradient replaced by the sub-gradient to deal with
non-smooth f . Note, however, that it is not proper to call the algorithm (6.18) SG descent
because in the process of iterations f (xk+1 ) may become larger than f (xk ). To fix the
problem one may keep track of the best point, or substitute the result by an average of the
points seen in the iterations so far (a finite horizon portion of the past). For example, one
may augment Eq. (6.18) at each k with

(k) (k−1)
fbest = min{fbest , f (xk )}.

We assume that SG of f (x) is bounded, that is

∀x : kg(x)k ≤ L, g(x) ∈ ∂f (x).

This condition follows, for example, from the Lipshitz condition, |f (x) − f (y)| ≤ Lkx − yk,
imposed on f . Let x∗ be the optimal point of, f (x) → minx∈Rn , then

kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22 = kxk − x∗ k22 − 2ηk gk> (xk − x∗ ) + ηk2 kgk k22
≤ kxk − x∗ k22 − 2ηk (f (xk ) − f (x∗ )) + ηk2 kgk k22 , (6.19)

where the last inequality is due to convexity of f , i.e. f (x∗ ) ≥ f (xk )+gk> (x∗ −xk ). Applying
the inequality (6.19) recursively,
X X
kxk+1 − x∗ k22 ≤ kx1 − x∗ k22 − 2 ηj (f (xj ) − f (x∗ )) + ηj2 kgj k22 ,
j≤k j≤k

one derives,
 
(k)
X X X
2 ηj  (f (best ) − f (x∗ )) ≤ 2 ηj (f (xj ) − f (x∗ )) ≤ kx(1) − x∗ k22 + ηj2 kgj k22 ,
j≤k j≤k j≤k

which becomes
kx1 − x∗ k22 + L22 j≤k ηj2
P
(k) ∗ ∗
f (best ) − f (x ) = min f (xj ) − f ≤ P ,
j≤k 2 j≤k ηj
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 155

where we assume that the SG of f are bounded by L2 in the `2 norm. Therefore, if


R22 ≥ kx1 − x∗ k22 , one arrives at

R22 + L22 j≤k ηj2


P
∗ RL
min f (xj ) − f ≤ min P = √ , (6.20)
j≤k η 2 j≤k ηj k
√ √
where the step-size is ηk = R/(L k). Note, that the ∼ 1/ k scaling in Eq. (6.20 is much
worse than the one we got above, ∼ 1/k 2 , for smooth functions. In the following we discuss
this result in more details and suggest a number of ways to improve the convergence.

Proximal Gradient Method. In multiple machine learning (and more generally statis-
tics) applications we deal with a function built as a sum over samples. Inspired by this
application consider the following composite optimization

f (x) = g(x) + h(x) → minn , (6.21)


x∈R

where we assume that g : R → Rn is a convex and smooth function on Rn , and h : R → Rn


is closed, convex and possibly non-smooth function on Rn . One of the most frequently used
composite optimization is the Lasso minimization:

f (x) = kAx − bk22 + λkxk1 → minn . (6.22)


x∈R

Notice that the kxk1 term is not smooth at x = 0.


Let us now introduce the so-called proximal operator
 
1 2
proxh (x) = arg min h(u) + ku − xk2 ,
u∈Rn 2

which will soon be linked to the composite optimization. Standard examples of the proximal
operator/function are

1. h(x) = IC (x), that is h(x) is an indicator of a convex set C. Then the proximal
function is
proxh (x) = arg min kx − uk22
u∈C

is a projection of x on C.

2. h(x) = λkxk1 , then the proximal function acts as a soft threshold:



xi − λ, xi ≥ λ,



proxh (x)i = xi + λ, xi ≤ −λ,



0, otherwise
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 156

The examples suggest using the proximal operator to smooth out non-smooth functions
entering respective optimizations. Having this use of the proximal operator in mind we
introduce the Proximal Gradient Descent (PGD) algorithm
 
1 2
xk+1 = proxηk h (xk − ηk ∇g(xk )) = arg min kxk − ηk ∇g(xk ) − uk2 + ηk h(u)
u 2
 
1
= arg min g(xk ) + ∇g(xk )> (u − xk ) + ku − xk k22 + h(u)
u 2ηk
where ηk ≤ β, and g is a β-smooth function in `2 norm.
Note, that as in the case of the GD algorithm, at each step of the PGD we minimize
a convex upper bound of the objective function. We find out that the PGD algorithm has
the same convergence rate (measured in the number of iterations) as the GD algorithm.
Finally, we are ready to connect PGD algorithm to the composite optimization (6.21).

Theorem 6.3.6. PGD algorithm,

xk+1 = proxh (xk − η∇g(xk )), η ≤ 1/β,

with a fixed step size policy converges to the optimal solution f ∗ of the composite optimiza-
tion (6.21) according to
kx0 − x∗ k22
f (xk+1 ) − f ∗ ≤ .
2ηk
Proof of the Theorem (6.3.6) repeats the logic we use to prove Theorem 6.3.2 for the GD
algorithm. Moreover, one can also accelerate the PGD, similarly to how we have accelerated
GD. The accelerated version of the PGD is
k−1
xk = proxηk (yk−1 − ηk ∇f (yk−1 )) yk = xk + (xk − xk−1 ).
k+2
We naturally arrives at the PGD version of the Theorem 6.3.5:

Theorem 6.3.7. PGD for a convex optimization

f (x) → minn
x∈R

with an update rule


k−1
xk = proxhη (yk−1 − η∇f (yk−1 )), yk = xk + (xk − xk−1 )
k+2
converges as
2kx0 − x∗ k22
f (xk+1 ) − f ∗ ≤ ,
ηk 2
for any β-smooth convex function f .

PGD is one possible approach developed to deal with non-smooth objectives. Another
sound alternative is discussed next.
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 157

Smoothing Out Non-Smooth Objectives

Consider the following min-max optimization

max fi (x) → minn


1≤i≤n x∈R

which is one of the most common non-smooth optimizations. Recall, that a smooth and
convex approximation to the maximum function is provided by the soft-max function (6.2)

which can then be minimized by the accelerated GD (that has a convergence rate O(1/ ε)
in contrast to 1/ε2 for non-smooth functions). Accurate choice of λ (parameter within the
soft-max) normally allows to speed up algorithms to O(1/ε).

6.4 Constrained First-Order Convex Minimization


Projected Gradient Descent

The Projected Gradient Descent (PGD) is

xk+1 = ΠC (xk − ηk ∇f (xk )) (6.23)


 
> 1 2
= arg min f (xk ) − ∇f (xk ) (y − x) + kx − yk2 + IC (y)
y∈C 2ηk
= proxIC (xk − ηk ∇f (xk )),

where ΠC is an Euclidean projection to the convex set C, ΠC (y) = arg minx∈C kx − yk22 .
PGD has the same convergence rate as GD. The proof is similar to the one of the gradient
descent taking into account that projection does not lead to an expansion, i.e.

kxk+1 − x∗ k22 ≤ kxk − ηk ∇f (xk ) − x∗ k22 as x∗ ∈ C.

Exercise 6.4.1. (Alternating Projections.) Consider two convex sets C, D ⊆ Rn and pose
the question of finding x ∈ C ∩ D. One starts from, x0 ∈ C, and applies PGD

yk = ΠC (xk ) xk+1 = ΠD (yk ).

How many iterations are required to guarantee

max{ inf (xk , x), inf (xk , x)} ≤ ε?


x∈C x∈D

Frank-Wolfe Algorithm (Conditional Gradient)

Frank-Wolfe algorithm solves the following optimization problem

f (x) → min, s.t.:x ∈ S (6.24)


CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 158

In contrast to the PGD algorithm (6.23) making projection at each iteration, the Frank-
Wolfe (FW) algorithm solves the following linear problem on C:

yk = arg min y > ∇f (xk ), xk+1 = (1 − γk )xk + γk yk , γk = 2/(k + 1). (6.25)


y∈C

To illustrate, consider the case when C is a simplex:

f (x) → min s.t.:x ∈ S = {x : x ≥ 0, x> 1 = 1}.

In this case the update yk of the FW algorithm is a unit vector correspondent to the
maximal coordinate of the gradient. Overall time to update xk is O(n) therefore resulting
in a significant acceleration in comparison with the PGD algorithm.
FW algorithm has an edge over other algorithms considered so far because it has a
reliable stopping criteria. Indeed, convexity of the objective guarantees that

f (y) ≥ f (xk ) + ∇f (xk )> (y − xk ),

minimizing both sides of the inequality over y ∈ C one derives that

f ∗ ≥ f (xk ) + min ∇f (xk )> (y − xk ),


y∈C

where f ∗ is the optimal solution of Eq. (6.24), then leading to

max ∇f (xk )> (xk − y) ≥ f (xk ) − f ∗ . (6.26)


y∈C

The value on the left of the inequality, maxy∈C ∇f (xk )> (xk −y), gives us an easy to compute
stopping criterion.
The following statement characterizes convergence of the FW algorithm.

Theorem 6.4.1. Given that f (x) in Eq. (6.24) is a convex β-smooth function and C is a
bounded, convex, compact set, Eq. 6.25 converges to the optimal solution, f ∗ , of Eq. (6.24)
as
2βD2
f (xk ) − f ∗ ≤ ,
k+2
where D2 ≥ maxy,y∈C kx − yk22 .

Proof. Convexity of f means that

f (x) ≥ f (xk ) + ∇f (xk )> (x − xk ), ∀x ∈ C.

Minimizing both sides of the inequality one derives

f (x∗ ) ≥ f (xk ) + ∇f (xk )> (yk − xk ).


CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 159

That is f (xk ) − f (x∗ ) ≤ ∇f (xk )> (xk − x∗ ). This inequality, in combination with the
second sub-step in the FW algorithm, xk+1 = γk yk + (1 − γk )xk , results in the following
transformations

f (xk+1 ) − f (xk ) ≤ f (xk+1 ) − f (x∗ )


β
≤ f (xk ) + ∇f (xk )> (xk+1 − xk ) + kxk+1 − xk k22 − f (x∗ )
2
βγ 2
≤ f (xk ) + γk ∇f (xk )(yk − xk ) + k kyk − xk k22 − f (x∗ )
2
βγ 2
≤ f (xk ) − f (x∗ ) − γk (f (xk ) − f (x∗ )) + k D2 ,
2
and finally
βγk2 D2
f (xk+1 ) − f ∗ ≤ (1 − γk )(f (xk ) − f ∗ ) + .
2
Utilizing the inequality in a chain of inductive relations over k, starting from k = 1, one
can show that f (xk ) − f ∗ ≤ 2βD2 /(k + 2).

The conditional GD is slower than the FGM method in terms of the number of iterations.
However, it is often favorable in practice especially when minimizing a convex function
over sufficiently simple objects (like the norm-ball or a polytope) as it does not require
implementing explicit projection to the constraining set.

Primal-Dual Gradient Algorithm

Consider the following smooth convex optimization problem:

f (x) → min
Ax = b, x ∈ Rn

It is a good practice to work with the equivalent augmented problem:

ρ
f (x) + kAx − bk22 → min
2
s.t.:Ax = b

where ρ > 0. Let us define augmented Lagrangian


ρ
L(x, µ) = f (x) + µ> (Ax − b) + kAx − bk22 .
2
We say that a point (in the extended, augmented space), (x, µ), is primal-dual optimal iff

0 = ∇x L(x, µ) = ∇f (x) + A> µ + ρA> (Ax − b),


0 = −∇µ L(x, µ) = b − Ax.
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 160

One can also re-state the primal-dual optimality condition as,


!
∇x L(x, µ)
T (x, µ) = 0, T (x, µ) =
−∇µ L(x, µ)

. Operator/function, T , is often called the Karush-Kuhn-Tucker (KKT) operator. (We may


call T operator to emphasize that it maps a function, f (x), to another function, ∇x L.)
We are now ready to state the Primal-Dual Gradient (PDG) algorithm
! !
x x
= − ηk T (xk , µk ).
µ µ
k+1 k

Similar construction works if inequality constraints are added:

f (x) → min
s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m.

The augmented problem, accounting for the inequalities, becomes


m
ρX
f (x) + (gi (x))2+ → min
2
i=1

s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m.

Respective augmented Lagrangian is


ρ
L(x, λ) = f (x) + λ> F (x) + kF (x)k22 ,
2
where F (x)i = (gi (x))+ . We say that the pair (x, λ) is primal-dual optimal iff
m
X
0 = −∇x L(x, λ) = ∇f (x) + (λi + ρgi (x)+ )(∇gi (x))+
i=1

0 = −∇λ L(x, λ) = −F (x).

PDG algorithm accounting for the inequality constraints is


! !
x x
= − ηk T (xk , λk )
λ λ
k+1 k

Convergence analysis of PDG algorithm repeats all steps involved in analysis of the
original GD. The Lypunov exponent here is , V (x, λ) = kx0 − x∗ k22 + kλ0 − λ∗ k22 .

Exercise 6.4.2. Analyze convergence of the PDG algorithm for convex optimization with
inequality constraints assuming that all the functions involved (in the objective, f , and in
the constraints, gi ) are convex and β-smooth.
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 161

Mirror Descent Algorithm

Our previous analysis was mostly focused on the case, where the objective function f is
smooth in `2 norm and the distance from the starting point, where we initiate the algorithm,
to the optimal point is measured in the `2 norm as well. From the perspective of the GD,
the optimization over a unit simplex and the optimization over a unit Euclidean sphere
are equivalent computational complexity-wise. On the other hand, the volume of the unit
simplex is exponentially smaller than the volume of the unit sphere. Mirror Descent (MD)
algorithm allows to explore geometry of the domain thus providing a faster algorithm for the

case of the simplex. The acceleration is up to the ∼ d factor, where d is the dimensionality
of the underlying space.
We start with an unconstrained convex optimization problem:

f (x) → min
s.t.: x ∈ S ⊆ Rn

Consider in more details an elementary iteration of the GD algorithm

xk+1 = xk − ηk ∇f (xk ).

From the mathematical perspective we sum up objects from different spaces: x belongs to
the primal space, while the space where ∇f (x) resides, called the dual (conjugate) space
may be different. To overcome this “inconsistency”, Nemirovski and Yudin have proposed
in 1978 the following algorithm:

yk = ∇φ(xk ), – map the point to a point in the dual space


yk+1 = yk − ηk ∇f (xk ), – update the point in the dual space
x̄k+1 = (∇φ)−1 (yk+1 ) = ∇φ∗ (yk+1 ), – project the point back to the primal space
D
xk+1 = ΠC φ (x̄k+1 ) = arg min Dφ (x, x̄k+1 ), project the point to a feasible set
x∈C

where φ(x) is a strongly convex function defined on Rn and ∇φ(Rn ) = Rn ; and φ∗ (y) =
supx∈Rn (y > x − φ(x)) is the Legendre Fenchel (LF) transform (conjugate function) of φ(x).
Function φ is also called the mirror map function. Dφ (u, v) = φ(u) − φ(v) − ∇φ(v)> (u − v)
is the so-called Bregman divergence

Dφ (u, v) = φ(u) − φ(v) − ∇φ(v)> (u − v),

which measures (for strictly convex function φ) the distance between φ(u) and its linear
approximation φ(v) − ∇φ(v)> (u − v) evaluated at v.
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 162

Exercise 6.4.3. Let φ(x) be a strongly convex function on Rn . Using the definition of the
conjugate function prove that ∇φ∗ (∇φ(x)) = x, where φ∗ is a conjugate function to φ.

The Bregman divergence has a number of attractive properties:

• Non-negativity. Dφ (u, v) ≥ 0 for any convex function φ.

• Convexity in the first argument. The Bregman divergence Dφ (u, v) is convex in its
first argument. (Notice that is not necessarily convex in the second argument.)

• Linearity with respect to the non-negative coefficients. In other words, for any strictly
convex φ and ψ we observe:

Dλφ+µψ (u, v) = λDφ (u, v) + µDψ (u, v).

• Duality. Let function φ has a convex conjugate φ∗ , then

Dφ∗ (u∗ , v ∗ ) = Dφ (u, v), with u∗ = ∇φ(u), and v ∗ = ∇φ(v).

Examples of the Bregman divergence are

• Euclidean norm. Let φ = kxk22 , then Dφ (x, y) = kxk22 − kyk22 − 2y > (x − y) = kx − yk22 .
Pn
• Negative entropy. φ(x) = i=1 xi ln xi , f : Rn++ → R. Then
n
X n
X n
X
Dφ (x, y) = xi ln(xi /yi ) − xi + yi = DKL (x||y),
i=1 i=1 i=1

where DKL (x||y) is the so called Kullback-Leibler (KL) divergence.

• Lower and upper bounds. Let φ be a µ-strongly convex function with respect to a
norm k · k then

µ β
Dφ (x, y) ≥ kx − yk2 , Dφ (x, y) ≤ kx − yk2
2 2

The following statement represents an important fact which will be used below to analyze
the MD algorithm.
Pn Pn
Theorem 6.4.2 (Pinsker Inequality). For any x, y, such that i=1 xi = i=1 yi = 1,
x ≥ 0, y ≥ 0 one get the following KL divergence estimate, DKL (x||y) ≥ 12 kx − yk21 .
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 163

Pn
An immedite corollary of the Theorem is that φ(x) = i=1 xi ln xi is 1-strongly convex
in `1 norm:
1
φ(y) ≥ φ(x) + ∇φ(x)> (y − x) + DKL (y||x) ≥ φ(x) + ∇φ(x)> (y − x) + kx − yk21
2
The proximal form of the MD algorithm is
  
D 1
xk+1 = ΠC φ arg min f (xk ) + ∇f (xk )> (x − xk ) + Dφ (x, xk ) ,
x∈Rn ηk
D
where ΠS φ (y) = arg minx∈S Dφ (x, y).

Example 6.4.4. Consider the following optimization problem over the unit simplex:

f (x) → minn
x∈R

s.t.:x ∈ S = {x : x> 1 = 1, x ∈ Rn++ }.


Pn
Let the distance generating function φ(x) be a negative entropy, φ(x) = i=1 xi ln xi . Then
the MD algorithm update becomes
  
Dφ > 1
xk+1 = ΠS arg min f (xk ) + ∇f (xk ) (x − xk ) + Dφ (x, xk ) ,
x ηk

where Dφ (x, y) = ni=1 xi ln(xi /yi ) − (xi − yi ). The resulting optimal x is


P

∇φ(x) = ∇φ(xk ) − ηk ∇f (xk ), that is yi = (xk )i exp(−ηk ∇f (xk )i ).


D
One observes that the Bregman projection onto the simplex is a renormalization: ΠS φ =
y/kyk1 . This results in the following expression for the MD update:
(xk )i exp(−ηk ∇f (xk )i )
(xk )i = Pn .
j=1 (xk )j exp(−ηk ∇f (xk )j )

Let us sketch the continuous time analysis of the MD algorithm in the case of the β-
smooth convex functions. In contrast with the GD analysis, it is more appropriate to work
in this case with the Lyapunov’s function in the dual space:

V (Z(t)) = Dφ∗ (Z(t), z ∗ ), Z(t) = ∇φ(X(t)),

where φ is a strongly convex distance generating function. According to the definition of


the Bregman divergence, one derives
d d d n ∗ o
V (Z(t)) = Dφ∗ (Z(t), z ∗ ) = φ (Z(t)) − φ∗ (z ∗ ) − ∇φ∗ (z ∗ )> (Z(t) − z ∗ )
dt dt dt
= (∇φ (Z(t)) − ∇φ (z ), Ż(t)) = (X(t) − x∗ )> Ż(t).
∗ ∗ ∗
CHAPTER 6. CONVEX AND NON-CONVEX OPTIMIZATION 164

Given that Ż(t) = −∇f (X) one derives

d
V (Z(t)) = −∇f (X(t))> (X(t) − x∗ ) ≤ −(f (X(t)) − f ∗ ).
dt
Integrating both sides of the inequality one arrives at
Z t   Z t  
∗ 1 ∗
V (Z(t)) − V (Z(0)) ≥ f (X(τ ))dτ − tf ≥ t f X(τ )dτ − f ,
0 t 0

where the last transformation is due to the Jensen inequality. Therefore, similarly to the
case of GD, the convergence rate of the MD algorithm is O(1/k). The resulting MD ODE
is



X(t) = ∇φ∗ (Z(t))

Ż(t) = −∇f (X(t))


= x0 , Z(0) = z0 with ∇φ∗ (z0 ) = x0 .

X(0)

Behavior of the MD, when applied to a non-smooth convex function, repeats the one of

the GD: the convergence rate is O(1/ k) in this case.
Chapter 7

Optimal Control and Dynamic


Programming

Optimal control problem shall be considered as a special case of a general variational calculus
problem, where the (vector) fields evolve in time, i.e. reside in one dimensional real space
equipped with a direction, and constrained by a system of ODEs, possibly with algrebraic
constraints added too. We will learn how to analyze the problems by the methods of the
variational calculus from Section 5, using optimization approaches, e.g. convex analysis
and duality, described in Section 6.1, and also adding to arsenal of tools a new one called
“Dynamic Programming” (DP) in Section 7.4.
Let us start with an illustrative (as sufficiently simple) optimal control problem.

Example 7.0.1. Consider trajectory of a particle in one dimension: {q(τ ) : [0, t] → R}


which is subject to control {u(τ ) : [0, t] → R}. Solve the following constrained problem of
the variational calculus type:
Zt
min dτ (q(τ ))2 (7.1)
{u(τ ),q(τ )}
0 τ ∈(0,1]: q̇(τ )=u(τ ), u(τ )≤1

where t > 0 and the initial position, q(0) = q0 , are known (fixed).
Solution:
If q0 > 0, one can guess the optimal solution right away: jump to q = 0 immediately (at
τ = 0+ ) and then stay zero. To justify the solution, one first drops all the constraints in
Eq. (7.1), observe that the minimal solution of the unconstrained problem is, τ ∈ (0, t] :
q(τ ) = u(τ ) = 0, and then verify that constraints dropped are satisfied. (Notice that the
resulting discontinuity of the optimal q(τ ) at τ = 0 is not a problem, as it was not required
in the problem formulation.)

165
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 166

The analysis in the case of q0 ≤ 0 is more elaborate. Let us exclude the control variable,
turning the pair of constraints in Eq. (7.1) into one, ∀τ : q̇ ≤ 1. Then, following the logic
of Section 6 we introduce the Lagrangian function,

L(q(τ ), µ(τ )) = q 2 + µ(q̇ − 1),

and then write the KKT conditions, extended from the world of finite dimensional opti-
mization discussed in the previous section to the world of infinite dimensional (variational
calculus) optimization. Specifically the four KKT-conditions are:

1. KKT-1: Primal Feasiblity: q̇(τ ) ≤ 1 for τ ∈ (0, t].

2. KKT-2: Dual Feasibility: µ(t) ≥ 0 for τ ∈ (0, t].

3. KKT-3: Stationary point in primal variables - which is simply the Euler-Lagrange


condition of the variational calculus: 2q = µ̇ for τ ∈ (0, t].

4. KKT-4: Complementary Slackness: µ(t)(q̇(t) − 1) = 0 for τ ∈ (0, t].

We find that,

q(τ ) = τ + q0 , µ(τ ) = τ 2 + 2q0 τ + c, (7.2)

where c is a constant, satisfy both the KKT conditions and the initial condition, q(0) = q0 .
Can we have another solution different from Eqs. (7.2) but satisfying the KKT conditions?
How about a discontinuous control? Consider the following probe functions, bringing q to
zero first with the maximal allowed control, and then switching off the control:
( (
q0 + τ, 0 < τ ≤ −q0 τ 2 + 2q0 τ + q02 , 0 < τ ≤ −q0
q(τ ) = , µ(τ ) = . (7.3)
0, −q0 < τ ≤ t 0, −q0 < τ ≤ t
We observe that, indeed, in the regime where the probe function is well defined, i.e. 0 <
−q0 < t, Eqs. (7.3) solves the KKT conditions (7.3), therefore providing an alternative to
the solution (7.2). Comparing objectives in Eq. (7.1) for the two alternatives one finds that
at, 0 < −q0 < t, the solution (7.3) is optimal while the solution (7.2) is optimal if t < −q0 .

Exercise 7.0.2. Solve Example 7.0.1 with the condition u ≤ 1 replaced by |u| ≤ 1.

7.1 Linear Quadratic (LQ) Control via Calculus of Variations


Consider d-dimensional real vector representing evolution of the system state in time,
{q(τ ) ∈ Rd |τ ∈ [0, t]}, governed by the following system of linear ODEs

∀τ ∈ (0, t] : q̇(τ ) = Aq(τ ) + Bu(τ ), q(0) = q0 , (7.4)


CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 167

where A and B are constant (time independent) square, nonsingular (invertible) and pos-
sibly asymmetric, thus A 6= AT and B 6= B T , real matrices, A, B ∈ Rd × Rd , and
{u(τ ) ∈ Rd |τ ∈ [0, t]} is a time-dependent control vector of the same dimensionality as
q. Introduce a combined action, often called cost-to-go:
.
S{q(τ ), u(τ )} = Sef f {u(τ )} + Sdes {q(τ )} + Sf in (qt ), (7.5)
Zt
. 1
Sef f {u(τ )} = dτ uT (τ )Ru(τ ), (7.6)
2
0
Zt
. 1
Sdes {q(τ )} = dτ q T (τ )Qq(τ ), (7.7)
2
0
. 1
Sf in (q(t)) = q T (t)Qf in q(t), (7.8)
2
where Sef f , dependent only on {u(τ )}, represents required efforts of control; Sdes , dependent
only on {q(τ )}, expresses the cost of maintaining desired state of the system {q(t)} proper;
and Sf in , dependent only on q(t), expresses the cost of achieving the final state, q(t). We
assume that R, Q and Qf in are symmetric real positive definite matrices. We aim to optimize
the cost-to-go over {q(τ )} and {u(τ )} constrained by the governing ODEs and respective
initial condition in Eqs. (7.4).
As custom in the variational calculus with function constraints, let us extend the action
(7.5) with a Lagrangian multiplier function associated with the ODE constraints (7.4) and
then formulate necessary conditions for the optimality stated as an unconstrained variation
of the following effective action
Zt
.
S{q, u, λ} = S{q, u} + dτ λT (τ ) (−q̇ + Aq + Bu) , (7.9)
0

where {λ(τ )} is the time-dependent vector of the Lagrangian multipliers, also called the
adjoint vector. Euler-Lagrange (EL) equations and the primal feasibility equations following
from variations of the effective action (7.9) over q, u and λ are
δS{q,u,λ}
Euler-Lagrange : δq =0: ∀τ ∈ (0, t] : Qq + λ̇ + AT λ = 0, (7.10)
δS{q,u,λ}
δu =0: ∀τ ∈ [0, t] : Ru + B T λ = 0, (7.11)
δS{q,u,λ}
primal feasibility: δλ =0: Eqs. (7.4). (7.12)

The equations should also be complemented with the boundary condition,


∂S{q, u, λ}
boundary condition at τ = t, = 0 : λ(t) = Qf in q(t), (7.13)
∂q(t)
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 168

derived by variations of the effective action over q at the final point, q(t). The simplest
way to derive the boundary condition Eq. (7.13) is through discretization: turning temporal
integrals into discrete sums, specifically
Z t
dτ λT (τ )q̇(τ ) → λT (∆)(q(∆) − q(0) + · · · + λT (t)(q(t − ∆)) − q(t)), (7.14)
0

where ∆ is the discretization step, and then looking for a stationary point over q(t). Observe
that Eqs. (7.11) are algebraic, thus allowing to express the control vector, u, via the adjoint
vector, λ

u = −R−1 B T λ. (7.15)

Substituting it into Eqs. (7.10,7.12) one arrives at the following joint system of the original
and adjoint equations
! ! ! ! !
q̇ A −BR−1 B T q q(0) q0
= , = . (7.16)
λ̇ −Q −AT λ λ(t) Qf in q(t)

The system of ODEs (7.16) is a two-point Boundary Value Problem (BVP) because it
has two boundary conditions at the opposite ends of the time interval. In general, two-
point BVPs are solved by the shooting method, which requires multiple iterations forward
and backward in time (hoping for convergence). However for the LQ Control problems,
the system of equations is linear, and we can solve it in one shot – with only one forward
iteration and one backward iteration. Indeed, integrating the linear ODEs (7.16) one derives
! !
q(τ ) q(0)
= W (τ ) , (7.17)
λ(τ ) λ(0)
! !!
W 1,1 (τ ) W 1,2 (τ ) . A −BR−1 B T
W (τ ) = = exp τ , (7.18)
W 2,1 (τ ) W 2,2 (τ ) −Q −AT

which allows to express λ(0) via q(0) = q0


. −1
M = − W 2,2 (t) + Qf in W 1,2 (t) W 2,1 (t) + Qf in W 1,1 (t) .

λ(0) = M q0 , (7.19)

Substituting Eqs. (7.17,7.19) into Eq. (7.15) one arrives at the following expression for the
optimal control via q0

u(τ ) = −R−1 B T W 2,1 (τ ) + W 2,2 (τ )M q0 .



(7.20)

A control of this type, dependent on the initial state, is called open loop control. This
name suggests that at any moment of time, τ > 0, we set the control based only on the
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 169

information about the initial state of the system at t = 0. The open loop control is normally
juxtaposed with the so-called feedback loop control, which may also be called the close loop
control. The feedback loop version of Eq. (7.20), is derived expressing λ(τ ) and q(τ ) via q0
according to Eq. (7.17,7.19) and then substituting the result in Eq. (7.15):

∀τ ∈ (0, t] : u(τ ) = −R−1 B T P (τ )q(τ ), (7.21)


.
P (τ ) = λ(τ )q −1 (τ ) (7.22)
−1
= W 2,1 (τ ) + W 2,2 (τ )M W 1,1 (τ ) + W 1,2 (τ )M

. (7.23)

The feedback loop control, λ(τ ), at any moment of time τ , i.e. as we go along, responds to
the current measurement of the system state, q(τ ), at the same time, τ .
Notice that in the deterministic case without uncertainty/perturbation (and this is what
we have considered so far) the open loop and the feedback loop are equivalent. However, the
two control schemes/policies give very different results in the presence of uncertainty/per-
turbation. We will investigate this phenomenon and have a more extended comparison of
the two controls in the probability/statistics/data science section of the course

Exercise 7.1.1. Show, utilizing derivations and discussions above, that the matrix, P (t),
defined in Eq. (7.22), satisfies the so-called Riccati equations:

Ṗ + AT P + P A + Q = P BR−1 B T P, (7.24)

supplemented with the terminal/final (τ = t) condition, P (t) = Qf in .

Exercise 7.1.2. Consider an unstable one dimensional process

τ ∈ [0, ∞[: q̇(τ ) = Aq(τ ) + u(τ ),

where u ∈ R and A is a positive constant, A > 0. Design an LQ controller u(τ ) = P q(τ )/R
that minimizes the action
Z∞
dτ q 2 + Ru2 ,

S{q(τ ), u(τ )} =
0

where P is a constant (need to find) and R is a positive known constant. Discuss/explain


what happens with P when R → 0 or R → ∞. [Hint: Analyze Riccatti Eq. (7.24) in the
steady, t → ∞, regime.]
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 170

7.2 From Variational Calculus to Bellman-Hamilton-Jacobi


Equation
Next we consider optimal control problem which is more general, in terms of governing
equations and optimization objective, than what was considered so far. We study controlled
dynamical system, which is nonlinear in our primal variable, {q(τ ) : [0, t] → Rd }, but still
linear in the control variable, {u(τ ) : [0, t] → Rd }

∀τ ∈ [0, t] : q̇(τ ) = f (q(τ )) + u(τ ). (7.25)

As above, we will formulate a control problem as an optimization. We aim to minimize the


objective
Zt  
1 T
dτ u (τ )u(τ ) + V (q(τ )) , (7.26)
2
0

over {u(τ )} which satisfies the ODE (7.25). Here in Eq. (7.26) we shortcut notations and
use (u(τ ))2 for uT (τ )u(τ ). Notice that the cost-to-go objective (7.26) is a sum of two
terms: (a) the cost of control, which is assumed quadratic in the control efforts, and (b) the
bounded from below “potential”, which defines preferences or penalties imposed on where
the particle may or may not go. The potential may be soft or hard. An exemplary soft
potential is the quadratic potential
d
1 1X
V (q) = q T Λq = qi Λij qj , (7.27)
2 2
i=1

where Λ is a positive semi-definite matrix. This potential encourages q(τ ) to stay close to
the origin, q = 0, penalizing (but softly) for deviation from the origin. An exemplary hard
constraint may be
(
0, |q| < a
V (q) = , (7.28)
∞, |q| ≥ a

completely prohibiting q(τ ) the ball of size a around the origin. Summarizing, we discuss
the optimal control problem:
Zt
uT (τ )u(τ )
 
min dτ + V (q(τ )) (7.29)
{u(τ ),q(τ )} 2 ∀τ ∈ [0, t] : q̇(τ ) = f (q(τ )) + u(τ )
0
q(0) = q0 , q(t) = qt

where initial and final states of the system are assumed fixed.
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 171

In the following we restate Eq. (7.29) as an unconstrained variational calculus problem.


(Notice, that we do not count the boundary conditions as constraints.) We will assume that
all the functions involved in the formulation (7.29) are sufficiently smooth and derive re-
spective Euler-Lagrange (EL) equations, Hamiltonian equations and Hamilton-Jacobi (HJ)
equations.
To implement the plan, let us, first of all, exclude {u(τ )} from Eq. (7.29). The resulting
“q-only” formulation becomes

Zt !
(q̇(τ ) − f (q(τ )))T (q̇(τ ) − f (q(τ )))
min dτ + V (q(τ )) . (7.30)
{q(τ )} 2
0 q(0)=q0 , q(t)=qt

Following Lagrangian and Hamiltonian approaches, described in details in the variational


calculus portion of the course, see Section 5, one identifies action, Lagrangian, momentum
and Hamiltonian for the functional optimization (7.30) as follows

Zt
(q̇ − f (q))T (q̇ − f (q))
S{q(τ ), q̇(τ )} = dτ + V (q), (7.31)
2
0
(q̇ − f (q))T (q̇ − f (q))
L = + V (q), (7.32)
2
∂L
p ≡ = q̇ − f (q), (7.33)
∂ q̇ T
∂L q̇ T q̇ (f (q))T f (q)
H ≡ q̇ T T − L = − − V (q)
∂ q̇ 2 2
pT p
= + pT f (q) − V (q). (7.34)
2
Then the Euler-Lagrange equations are
d ∂L ∂L
∀i = 1, · · · , d : = (7.35)
dt ∂ q̇i ∂qi
d
d X
(q̇i − fi (q)) = − (q̇ − f (q))j ∂qi fj (q) + ∂qi V (q),
dt
j=1

where we stated the vector equation by components for clarity. The Hamilton equations
are
∂H
∀i = 1, · · · , d : q̇i = = pi + fi (q), (7.36)
∂pi
∂H
ṗi = − = −pi ∇qi f (q) + ∇qi V (q). (7.37)
∂qi
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 172

Considering the action, S, as a function (not a functional!) of the final time, t, and of the
final position, qt , and recalling that,

∂S ∂S ∂L
= −H|τ =t , = = p|τ =t ,
∂t ∂qt ∂ q̇ τ =t

one arrives at the Hamilton-Jacobi (HJ) equations

1 ∂S T ∂S ∂S T
       
∂S ∂S
= −H|τ =t = −H qt , =− − f (qt ) + V (qt ). (7.38)
∂t ∂qt 2 ∂qt ∂qt ∂qt

We will see later on that it may be useful to consider the HJ equations backwards in
Rt
time. In this case we consider the action, S = τ dτ 0 L, as the function of τ and q(τ ) = q.
This results in the following (backwards in time) modification of Eq. (7.38)
 T    T
∂S 1 ∂S ∂S ∂S
− =− + f (q) + V (q), (7.39)
∂τ 2 ∂q ∂q ∂q

where we use the relations, ∂τ S = H|τ and ∂q S = −∂q̇ L|τ . (Check Theorem 5.5.3 to recall
how differentiation of the action with respect to time and coordinates at the beginning and
at the end of a path are related to each other.)
Notice, that the HJ equations, in the control formulation, are called Bellman or Bellman-
Hamilton-Jacobi (BHJ) equation, and sometimes just Bellman equations, to commemorate
contribution of Bellman to the field, who has formulated the problem and resolved it deriving
the BHJ equations.
In Section 7.4 we derive the BHJ equations in a more general setting.

7.3 Pontryagin Minimal Principle


Let us now consider the following (almost) most general optimal control problem formulated
for a dynamical system in a state, q(τ ) ∈ Rd , evolving in time, τ ∈ [0, t]:

Zt
 

min φ(q(t)) + dτ L (τ, q(τ ), u(τ ))


{u(τ ),q(τ )} ∀τ ∈ (0, t] : q̇(τ ) = f (τ, q(τ ), u(τ )), q(0) = q0
0
∀τ ∈ [0, t] : u(τ ) ∈ U ⊂ Rd
(7.40)

where the control u(τ ) is restricted to domain U of the d-dimensional space at all the times
considered.
Analog of the standard variational calculus approach, consisting in the necessary Euler-
Lagrange (EL) conditions over {u} and {q}, is called Pontryagin Minimal Principle (PMP),
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 173

commemorating contribution of Lev Pontryagin to the subject [12] (see also [13] for extended
discussion of the PMP bibliography, circa 1963). We present it here without much of
elaborations (as it follows straightforwardly the same variational logic repeated by now
many times in this Section). Introduce the effective action,
Z t
.
S̃ = S + dτ λ(τ ) (f (τ, q(τ ), u(τ )) − q̇(τ )) ,
0

where {λ(τ )} is a Lagrangian multiplier (function) and then optimizing over {u} and {q},
we arrive at the expression for the optimal control candidate, u∗ , and at the adjoint (dual)
equations, respectively

∀τ ∈ [0, t] : min S̃ : u(τ ) = arg min (L (τ, q(τ ), ũ(τ )) + λ(τ )f (τ, q(τ ), ũ(τ ))) (7.41)
{u} ũ

δ S̃ ∂
= 0 : λ̇(τ ) = − (L (τ, q(τ ), u(τ )) + λ(τ )f (τ, q(τ ), u(τ ))) , (7.42)
δq(τ ) ∂q
∂ S̃
τ =t = 0 : λ(t) = ∂φ(q(t))/∂q(t). (7.43)
∂q(t)

Notice that Eq. (7.43) is the result of variation of S̃ over q(t), providing the boundary
conditions at τ = t by relating q(t) and λ(t). Derivation of Eq. (7.43) is equivalent to the
derivation of the respective boundary condition (7.13) at τ = t in the case of the LQ control.
Combination of Eqs. (7.41,7.42,7.43) with the (primal) dynamic equations supplemented by
the initial condition on q(0) (which are top conditions in Eq. (7.40) completes description
of the PMP approach. This PMP system of equations, stated as a Boundary Value (BV)
problem, with two boundary conditions on the opposite ends of the temporal interval, is
too difficult to allow an analytic solution in the general case. The system of equations is
normally solved numerically by the shooting method.

Exercise 7.3.1. Consider a rocket, modeled as a particle of constant (unit) mass moving in
zero gravity (empty) two dimensional space. Assume that trust/force acting on the rocket,
f (τ ) is known (prescribed) function of time (dependent on, presumably pre-calculated, rate
of the fuel burn), and that direction of the thrust can be controlled. Then equations of
motion (of the controlled rocket) are

∀τ ∈ (0, t] : q̈1 = f (τ ) cos u(τ ), q̈2 = f (τ ) sin u(τ ).

(a) Assume that ∀τ ∈ [0, t], u(τ ) > 0. Show that min{u} φ(q(t)), where φ(q) is an arbitrary
function, always result in the optimal control stated in the following, so-called bi-linear
tangent, form:
a + bτ
tan (u∗ (τ )) = .
c + dτ
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 174

(b) Assume that the rocket is at rest initially, i.e. q1 (0) = q2 (0) = 0, and we aim to land
the rocket at the furthest longitudinal position away from the origin, i.e. the optimization
problem is
max q2 (t) .
{q} q1 (t)=0

Show that the optimal control in this case is of the following “linear tangent” type:

tan (u(τ )) = a + bτ.

7.4 Dynamic Programming in Optimal Control


7.4.1 Discrete Time Optimal Control

Discretizing Eq. (7.40) in time one arrives at


n−1
!
X
min φ(qn ) + L(τk , qk , uk ) , (7.44)
u0:n−1 ,q1:n
k=0 k=0,··· ,n−1: qk+1 =qk +∆f (τk ,qk ,uk )

. . . .
where k = 1, · · · , n : τk = kt/n, qk = q(τk ), uk−1 = u(τk ), ∆ = t/n, and q0 is assumed
fixed.
Main idea of the Dynamic Programming (DP) consists in making optimization in Eq. (7.44)
not over all the variables at once, but sequentially, one after another, that is in a greedy
fashion. Specifically, let us first optimize in Eq. (7.44) over qn and un−1 . In fact, opti-
mization over qn consists simply in the substitution of qn by qn−1 + ∆f (τn−1 , qn−1 , un−1 ),
according to the condition in Eq. (7.44) evaluated at k = n − 1. One derives
.
S(n, qn ) = φ(qn ), (7.45)
.
u∗n−1 = arg min S (n, qn−1 + ∆f (τn−1 , qn−1 , un−1 )) + L (τn−1 , qn−1 , un−1 ) ,(7.46)
un−1 ∈U
.
S(n − 1, qn−1 ) = S n, qn−1 + ∆f τn−1 , qn−1 , u∗n−1 + L τn−1 , qn−1 , u∗n−1 ,(7.47)
 

where making optimization over un−1 we took advantage of the Markovian, causal structure
of the objective in Eq. (7.44), therefore taking into account only terms in the objective de-
pendent on un−1 . Repeating the same scheme and, first, excluding, qn−1 , second, optimizing
over un−2 , and then repeating the two sub-steps (by induction) n − 1 times (backwards in
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 175

discreet time) we arrive at the following generalization of Eqs. (7.46,7.47)


.
k = n, · · · , 1 : u∗k−1 = arg min S (k, qk−1 + ∆f (τk−1 , qk−1 , uk−1 )) + L (τk−1 , qk−1 , uk−1 ) ,
uk−1 ∈U
(7.48)
.
S(k − 1, qk−1 ) = S k, qk−1 + ∆f τk−1 , qk−1 , u∗k−1 + L τk−1 , qk−1 , u∗k−1 ,
 

(7.49)

where Eq. (7.45) sets initial condition for the backward in (discrete) time iterations. It is
now clear that S(0, q0 ) is exactly solution of Eq. (7.44). S(k, qk ), defined in Eq. (7.48), is
called cost-to-go, or value function, evaluated at the (discrete) time τk . Eqs. (7.45,7.48,7.49)
are summarized in the Algorithm 1.

Algorithm 1 Dynamic Programming [Backward in time Value Iteration]


Input: L(τ, q, u), f (τ, q, u) return the value of reward and the vector of incremental state
corrections ∀τ, q, u.

1: S(n, q) = φ(q)
2: for k = n, · · · , 0 do
3: u∗k (q) = arg minu (L(τk , q, u) + S(τk + 1, qk + ∆f (τk , qk , u))) , ∀q
4: S(k, q) = L(τk , q, u∗k (q)) + S(k + 1, qk + ∆f (τk , q, u∗k (q))), ∀q
5: end for
Output: u∗k (q), ∀q, k = n − 1, · · · , 0.

The scheme just explained and the resulting DP Algorithm 1 were introduced in the
famous paper of Richard Bellman from 1952 [14].
In accordance with the greedy nature of the DP algorithm construction – one step at
a time, backward in time – it gives an example of what is called a greedy algorithm in
Computer Science, that is an algorithm that makes locally optimal choice at each step. In
general, greedy algorithms offer only a heuristic, i.e. an approximate (sub-optimal), solution.
However, the remarkable feature of the optimal control problem, which we just sketched
a prof of (through the sequence of transformations of Eqs. (7.45,7.48,7.49) resulted in the
optimal solution of Eq. (7.44)), is that the greedy algorithm in this case is optimal/exact.

7.4.2 Continuous Time & Space Optimal Control

Taking a continuous limit of Eqs. (7.45,7.48,7.49) one arrives at the already familiar from
Section 7.2 Bellman, or Bellman-Hamilton-Jacobi, equation

−∂τ S(τ, q) = min (L(τ, q, u) + f (τ, q, u)∂q S(τ, q)) . (7.50)


u∈U
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 176

Then expression for the optimal control, that is continuous time version of the line 3 in the
Algorithm 1, is

∀τ ∈ (0, t] : u∗ (τ, q) = arg min (L(τ, q, u) + ∂q S(τ, q)f (τ, q, u)) . (7.51)
u∈U

Notice that the special case considered in Section 7.2, where

u2
L(τ, q, u) → + V (q), f (τ, q, u) → f (q) + u,
2
and U → Rd , leads, after explicit evaluation of the resulting quadratic optimization, to
Eq. (7.39).

Example 7.4.1 (Bang-Bang control of an oscillator). Consider a particle of unit mass on


the spring, subject to a bounded amplitude control:

τ ∈ (0, t] : ẍ(τ ) = −x(τ ) + u(τ ), |u(τ )| < 1, (7.52)

where particle and control trajectories are {x(τ ) ∈ R|τ ∈ (0, t]} and {u(τ ) ∈ R|τ ∈ (0, t]}.
Given x(0) = x0 and ẋ(0) = 0, i.e. particle is at rest initially, find the control path {u(τ )}
such that particle position at the final moment, x(t) is maximal. (t is assumed known too.)
Describe optimal control and optimal solution for the case of x(0) = 0 and t = 2π.
Solution:
First, we change from a single second order (in time) ODE to the two first order ODEs
! !
q1 . x
∀τ ∈ (0, t] : q= = , q̇ = Aq + Bu, (7.53)
q2 ẋ
! !
. 0 1 . 0
A= , B= . (7.54)
−1 0 1
.
We arrive at the optimal control problem (7.40) where, φ(q) = C T q, C T = (−1, 0), L(t, q, u) =
0, f (t, q, u) = Aq + Bu. Then Eq. (7.50) becomes

∀τ ∈ (0, t] : −∂τ S = (∂q S)T Aq − (∂q S)T B . (7.55)

Let us look for solution by the (standard for HJ) method of variable separation, S(τ, q) =
(ψ(τ ))T q + α(τ ). Substituting the ansatz into Eq. (7.55) one derives

∀τ ∈ (0, t] : ψ̇ = −AT ψ, α̇ = |ψ T B|. (7.56)

These equations must be solved for all τ , with the terminal/final conditions: ψ(t) = C
and α(t) = 0. Solving the first equation and then substituting the result in Eq. (7.51) one
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 177

derives
!
− cos(τ − t)
∀τ ∈ (0, t] : ψ(τ ) = , u(τ, q) = −sign(Ψ2 (τ )) = −sign (sin(τ − t)) ,
sin(τ − t)
(7.57)

that is the optimal control depends only on τ (does not depend on q) and it is ±1.
Consider for example q1 (0) = x(0) = 0 and t = 2π. In this case the optimal control is
(
−1, 0 < τ < π
u(τ ) = , (7.58)
1, π < τ < 2π

and the optimal trajectory is


(
T (cos(τ ) − 1, − sin(τ )) 0<τ <π
q = (q1 , q2 ) = (7.59)
(3 cos(τ ) + 1, −3 sin(τ )) π < τ < 2π

The solution consists in, first, pushing the mass down, and then up, in both cases to the
extremes, i.e. to u = −1 and u = 1, respectively. This type of control is called bang-bang
control, observed in the cases, like the one considered, without any (soft) cost associated
with the control but only (hard) bounds.

Exercise 7.4.2. Consider a soft version of the problem discussed in Example 7.4.1:

Zt
 
C T q(t) + 1
min dτ (u(τ ))2  , (7.60)
{u(τ },{q(τ )} 2
0 ∀τ ∈(0,t]: q̇(τ )=Aq(τ )+Bu(τ )

where (q(0))T = (x0 , 0) and A, B and C are defined above (in the formulation and solution
of the Example 7.4.1). Derive Bellman/BHJ equation, build a generic solution and illustrate
it on the case of t = 2π and q1 (0) = x0 = 0. Compare your result with solution of the
Example 7.4.1.

7.5 Dynamic Programming in Discrete Mathematics


Let us take a look at the Dynamic Programming (DP) from the prospective of discrete
mathematics, usually associated with combinations of variables (thus combinatorics) and
graphs (thus graph theory). In the following we start exploring this very rich and modern
field of applied mathematics on examples.
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 178

7.5.1 LATEX Engine

Consider a sequence of words of varying lengths, w1 , . . . , wn , and pose the question of


choosing locations for breaking the sequence at j1 , j2 , · · · into multiple lines. Once the
sequence is chosen, spaces between words are stretched, so that the left margin and the
right margins are aligned. We are interested to place the line breaks in a way which would
be most pleasing for the eye. We turn this informally stated goal into optimization requiring
that word stretching in the result of the line breaking is minimal.
To formalize the notion of the minimal stretching consider a sequence of words labeled
by index i = 1, · · · , n. Each word is characterized by its length, wi > 0. Assume that the
cost of fitting all words in between i and j, where j > i, in a raw is, c(i, j). Then the total
cost of placing n words in (presumably) nice looking text consisting of l rows is

c(1, j1 ) + c(j1 + 1, j2 ) + · · · + c(jl + 1, n), (7.61)

where 1 < j1 < j2 < · · · < jl < n. We will seek for an optimal sequence minimizing the
total cost. To make description of the problem complete one needs to introduce a plausible
way of “pricing” the breaks. Let us define the total length of the line as a sum of all lengths
(of words) in the sequence plus the number of words in the line minus one (corresponding to
the number of spaces in the line before stretching). Then, one requires that the total length
of the line (before stretching) to be less then the widest allowed margin, L, and define the
cost to be a monotonically increasing function of the stretching factor, for example

L < (j − i) − jk=i wk
P


 +∞,
Pj !3
c(i, j) = L − (j − i) − k=i w k (7.62)
 , otherwise
j−i

(The cubic dependence in Eq. (7.62) is an empirical way to introduce preference for smaller
stretching factors. Notice also that Eq. (7.62) assumes that j > i, i.e. any line contains
more than one word, and it does not take into account the last string in the paragraph.)
At first glance the problem of finding the optimal sequence seems hard, that is expo-
nential in the number of words. Indeed, formally one has to make a decision on if to place a
break (or not) after reading each word in the sequence, thus facing the problem of choosing
an optimal sequence from 2n−1 of possible options.
Is there a more efficient way of finding the optimal sequence? Apparently answer to
this question is affirmative, and in fact, as we will see below the solution is of the Dynamic
Programming (DP) type. The key insight is relation between optimal solution of the full
problem and an optimal solution of a sub-problem consisting of an early portion of the
full paragraph. One discovers that the optimal solution of the sub-problem is a sub-set of
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 179

the optimal solution of the full problem. This means, in particular, that we can proceed
in a greedy manner, looking for an optimal solution sequentially - solving a sequence of
sub-problems, where each consecutive problem extends the preceding one incrementally.
Let f (i) denote the minimum cost of formatting a sequence of words which starts from
the word i and runs to the end of the paragraph. Then, the minimum cost of the entire
paragraph is
f (1) = min(c(1, j) + f (j + 1)). (7.63)
j

while a partial cost satisfies the following recursive relation

∀i : f (i) = min (c(i, j) + f (j + 1)), (7.64)


j:i≤j

which we also supplement by the boundary condition, f (n + 1) = 0, stating formally that


no word is available for formatting when we reach the end of the paragraph. Eq. (7.64) is
a full analog of the Bellman equation (7.49). Algorithm 2 is a recursive algorithm for f (i)
implementing Eq. (7.64).

Algorithm 2 Dynamic Programming for LATEX Engine


Input: c(i, j), ∀i, j = 1, · · · , n, e.g. according to Eq. (7.62). f (n + 1) = 0.

1: for i = n, · · · , 1 do
2: fmin = +∞
3: for j = i, · · · , n do
4: fmin = min (fmin , c(i, j) + f (j + 1))
5: end for
6: end for
Output: f (i), ∀i = 1, · · · , n

Algorithm 2 answers the formatting question in a way smarter than naive check men-
tioned above. However, it is still not efficient, as it recomputes the same values of f many
times, thus wasting efforts. For example, the algorithm calculates f (4) whenever it calcu-
lates f (1), f (2), f (3). To avoid this unnecessary step, one should save the values already
calculated, by placing the result just computed into the memory. Then, by storing the
results we win calling, computing and storing the functions f (i) sequentially. Since we have
n different values of i and the loop runs through O(n) values of j, the total running time
of the algorithm, relaying on the previous values stored, is O(n2 ).
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 180

7.5.2 Shortest Path over Grid

Let us now discuss another problem. There is a number placed in each cell of a rectangular
grid, N × M . One starts from the left-up corner and aims to reach the right-down corner.
At every step one can move down or right, then “paying a price” equal to the number
written into the cell. What is the minimum amount needed to complete the task?
Solution: You can move to a particular cell (i, j) only from its left (i−1, j) or up (i, j −1)
neighbor. Let us solve the following sub-problem — find a minimal price p[i, j] of moving
to the (i, j) cell. The recursive formula (Bellman equation again) is:

p(i, j) = min(p(i − 1, j), p(i, j − 1)) + a(i, j),

where a(i, j) is a table of initial numbers. The final answer is an element p(n, m). Note,
that you can manually add the first column and row in the table a(i, j), filled with numbers
which are deliberately larger than the content of any cell (this helps as it allows to avoid
dealing with the boundary conditions). See Algorithm 3.

Algorithm 3 Dynamic Programming for Shortest Path over Grid


Input: Costs assigned: a(i, j), ∀i = 1, · · · , N ; ∀j = 1, · · · , M . Boundary conditions fixed:
p(i, 0) = +∞, ∀i = 1, · · · , N . p(0, j) = +∞, ∀j = 1, · · · , M . Initialization: p(1, 1) = 0.

1: for t = 2, · · · , N + M do
2: for i + j = t, i, j ≥ 0 do
3: p(i, j) = min (p(i − 1, j), p(i, j − 1)) + a(i, j)
4: end for
5: end for
Output: p(i, j), ∀i = 1, · · · , N ; j = 1, · · · , M.

Algorithm performance is illustrated in Fig. (7.1).

7.5.3 DP for Graphical Model Optimization

Number of optimization problems which can be solved with DP efficiently is remarkably


broad. In particular, it appears that the following combinatorial optimization problem, over
binary n-dimensional variable, x:
n−1
. X
E= min Ei (xi , xi+1 ), (7.65)
x∈{±1}n
i=1
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 181

3 1 4 3 1 4 3 1 4
0 0 0 0 0 0 3 0 0

1 3 7 3 1 3 7 3 1 3 7 3
1 3 0 0 0 0 1 0 0 0

8 9 2 5 8 9 2 5 8 9 2 5
1 0 0 0 1 0 0 0
12 14

4 5 2 1 4 5 2 1 4 5 2 1
0 0 0 0 0 0 0 0
16 17

(a) A sample path. (b) Initialization step. (c) First step.


3 1 4 3 1 4 3 1 4
0 3 3 0 0 3 4 8 0 3 4 8

1 3 7 3 1 3 7 3 1 3 7 3
1 4 0 0 1 4 11 0 1 4 11 11

min 1,3 + 3 min 4,4 + 7 min 8,11 + 3

8 9 2 5 8 9 2 5 8 9 2 5
1 0 0 0 9 13 0 0 9 13 13 0

min 4,9 + 9 min 11,13 + 2

4 5 2 1 4 5 2 1 4 5 2 1
0 0 0 0 13 0 0 0 13 18 0 0

min 13,13 + 5

(d) Second step. (e) Third step. (f) Fourth step.


3 1 4 3 1 4 3 1 4
0 3 4 8 0 3 4 8 0 3 4 8

1 3 7 3 1 3 7 3 1 3 7 3
1 4 11 11 1 4 11 11 1 4 11 11

8 9 2 5 8 9 2 5 8 9 2 5
9 13 13 16 9 13 13 16 9 13 13 16

min 11,13 + 5

4 5 2 1 4 5 2 1 4 5 2 1
13 18 15 0 13 18 15 16 13 18 15 16

min 13,18 + 2 min 15,16 + 1 min 15,16 + 1

(g) Fifth step. (h) Sixth (final) step. (i) Optimal path(s).

Figure 7.1: Step-by-step illustration of the Shortest-Path Algorithm 3 for an exemplary


4 × 4 grid. Number in the corner of each cell (except cell (1, 1) is respective aij . Values
in the green circles are respective final, pij , corresponding to the cost of the optimal path
from (1, 1) to (i, j).
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 182

which requires optimization over 2n possible states, can be solved efficiently by DP in efforts
linear in n. In the jargon of mathematical physics the problem just introduced is called
“finding a ground state of the Ising model”.

(" (!" , !# ) (# (!# , !$ ) ($ (!$ , !% ) (% (!% , !& ) (& (!& , !' )

1 2 3 4 5 6

!" !# !$ !% !& !'

(,# (!# , !$ ) ($ (!$ , !% ) (% (!% , !& ) (& (!& , !' )

2 3 4 5 6

!# !$ !% !& !'

Figure 7.2: Top: Example of a linear Graphical Model (chain). Bottom: Modified GM
(shorter chain) after one step of the DP algorithm.

To explain the DP algorithm for this example it is convenient to represent the problem
in terms of a linear graph (a chain) shown in Fig. (7.2). Components of x are associated
with nodes and the “energy” of “pair-wise interactions” between neighboring components
of x are associated with an edge, thus arriving at a linear graph (chain).
Let us illustrate the greedy, DP approach to solving optimization (7.65) on the example
in Fig. (7.2). The greedy essence of the approach suggests that we should minimize over
components sequentially, starting from one side of the chain and advancing to its opposite
end. Therefore, minimizing over x1 one derives
n−1
X
E = min E1 (x1 , x2 ) + min Ei (xi , xi+1 )
x1 x2 ,··· ,xn
i=2
n−1
!
X
= min Ẽ2 (x2 , x3 ) + Ei (xi , xi+1 ) , (7.66)
x2 ,··· ,xn
i=3
.
Ẽ2 (x2 , x3 ) = E2 (x2 , x3 ) + min E1 (x1 , x2 ), (7.67)
x1

where we took advantage of the objective factorization (into sum of terms each involving
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 183

only a pair of neighboring components). Notice that in the result of minimization over x1 we
arrive at the problem with exactly the same structure we started from, i.e. a chain, which
is however shorter by one node (and edge). The only change is “renormalization” of the
pair-wise energy: E2 (x2 , x3 ) → Ẽ2 (x2 , x3 ). Graphical transformation associated with one
greedy step is illustrated in Fig. (7.2) on transition from the original chain to the reduced
(one node and one edge shorter) chain. Therefore, repeating the process sequentially (by
induction) we will get the desired answer in exactly n steps. The DP algorithm is shown
below, where we also generalize assuming that all components of xi are drawn from and
arbitrary (and not necessarily binary) set, Σ, often called “alphabet” in the Computer
Science and Information Theory literature.

Algorithm 4 DP for Combinatorial Optimization over Chain


Input: Pair-wise energies, Ei (xi , xi+1 ), ∀i = 1, · · · , n − 1.

1: for i = 1, · · · , n − 2 do
2: for xi+1 , xi+2 ∈ Σ do
3: Ei+1 (xi+1 , xi+2 ) = Ei+1 (xi+1 , xi+2 ) + min Ei (xi , xi+1 )
xi
4: end for
5: end for
P
Output: E = En−1 (xn−1 , xn )
xn−1 ,xn

Consider generalization of the combinatorial optimization problem (7.67) to the case of


a single-connected tree, T = (V, E), e.g. one shown in Fig. (7.3):
. X
E = min Ei,j (xi , xj ), (7.68)
x∈Σ|V|
{i,j}∈E

where V and E are the sets of nodes and edges of the tree respectively; |V| is the cardinality
of the of nodes (number of nodes); and Σ is the set (alphabet) marking possible (allowed)
values for any, xi , i ∈ V, component of x.

Exercise 7.5.1. Generalize Algorithm 4 to the case of the GM optimization problem (7.68)
over a tree, that is compute E defined in Eq. (7.68). (Hint: one can start from any leaf
node of the tree, and use induction as in any other DP scheme.)
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 184

!.
8
!% 4 )
,! .
((

( !"
(
!"
,!%
)

9
(( 1 !"
!# !#
,! ) !$
,)
2 ," ! # 3
!
(( ( (!$ , ! ) 7
-

!& 5 6 !
'

Figure 7.3: Example of a tree-like Graphical Model.


Part IV

Mathematics of Uncertainty

185
Chapter 8

Basic Concepts from Statistics

8.1 Random Variables: Characterization & Description.


8.1.1 Probability of an event

Consider events drawn from a sample space, Σ. In general Σ may be continuous, e.g.
embedded into Rn , however let us start discussing the simple case of a discrete binary
space: Σ = {0, 1}. Let us draw a sequence of random variables from the space, for example
by tossing a coin. Given that any new toss of a coin does not depend on any previous
tossings and also assuming that the law/rule of tossing does not change as we progress,
we arrive at the so-called Bernoulli i.i.d. (independent and identically distributed) process
described by the probability of being in the state ς:

∀ς ∈ Σ : Prob(ς) = P (ς) (8.1)


0 ≤ P (ς) ≤ 1 (8.2)
X
P (ς) = 1, (8.3)
ς∈Σ

where thus, P (1) = β, P (0) = 1 − β. If β 6= 1/2 the coin is biased.


Another important i.i.d discrete event distribution is the Poisson distribution. An event
can occur k = 0, 1, 2, · · · times in an interval. The average number of events in an interval
is λ̃ - called event rate. The probability of observing k events within the interval is

[ λ̃k e−λ̃
∀k ∈ Z∗ = {0} Z: P (k) = . (8.4)
k!
(Check that the probability is properly normalized, in the sense of Eq. (8.1). Notice also
that λ̃ is dimensionless. Later we will also be discussing the Poisson process where a related,
but dimensional, object λ will be introduced. λ stands for rate of arrival per unit time.)

186
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 187

The distribution is also called exponential distribution (for obvious reason - look at the
expression).
Standard notations for Bernoulli and Poisson distributions are Bernoulli(β) or and
Poisson(λ̃), respectively.

Example 8.1.1. Are Bernoulli and Poisson distributions related? Can you ”design” Pois-
son from Bernoulli? Can you give an example of the Poisson process from life/science?
Solution:
Consider repeating Bernoulli, each time independently, thus drawing a Bernoulli process.
You get sequence of zeros and ones. Then check only for ones and record times/slots
associated with arrivals of ones. Study probability distribution of t arrivals in n step,
and then analyze n → ∞, to get the Poisson distribution. (The statement, also called in
the literature Poisson Limit Theorem, will be discussed in details in one of the following
lectures.) Some examples of processes associated with the Poisson distribution (what we
call Possion processes) are: probability distribution of the number of phone calls received
by a call center per hour, probability distribution of customers arrival at the shop/bank,
probability distribution of the number of meteors greater than 1 meter in diameter that
strike earth in a year, probability distribution of the number of typing errors per page page,
and many other.

The domain, Σ, can also be continuous, bounded or unbounded. Example of an i.i.d.


distribution which is bounded - is the uniform distribution from the [0, 1] interval:

∀x ∈ [0, 1] : p(x) = 1, (8.5)


Z 1
dxp(x) = 1, (8.6)
0

where p(x) is the probability density distribution. (It is custom to use low-key p for the
probability density and the upper-case, P , to denote actual probabilities.) Gaussian distri-
bution is the most important (also most frequently used) continuous distribution:

(x − µ)2
 
1
∀x ∈ Z : p(x|σ, µ) = √ exp − . (8.7)
σ 2π 2σ 2

pσ,µ (x) another possible notation. It is also called ”normal distribution” - where “nor-
mality” refers to the fact that the Gaussian distribution is a “normal/natural” outcome
of summing up many random numbers, regardless of the distributions for the individual
contributions. (We will discuss the so-called law of large numbers, also called central limit
theorem, shortly.) The distribution is parameterized by the mean, µ, and by the variance σ 2 .
Standard notation in math for the Gaussian/normal distribution is N (µ, σ 2 ) = N (µ, σ 2 ).
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 188

There are many more ’standard’ distributions (i.i.d. or not) beyond the golden three –
Bernoulli, Poisson and Gaussian. In fact one can generate practically any other distribution
from the ’golden set’ (possibly extended with the uniform distribution).
Let us make a brief remark about notations. We will often write, P (X = x), or a
short-cut, P (x) and sometimes you see in the literature, PX (x). By convention, upper
case variables denote random variables, e.g. X. A random variable takes on values in
some domain, and if we want to consider a particular instantiation, thus instance/sam-
ple, of the random variable (that is, it has been sampled and observed to have a partic-
ular value in the domain) then that non-random value is denoted by lower case e.g. x.
E (f (x)) = EX (f (x) · · · ) = EPX (f (x) · · · ) = hf (x)i are all the different notations used for
averaging of a function f (x) of the variable x over the probability distribution, P (x), that
P
is x∈Σ f (x)P (x). x ∼ P (x) denotes the fact that the random variable x is drawn from
the distribution, P (x).

8.1.2 Sampling. Histograms.

Random process generation. Random process is generated/sampled. Any computational


package/software contains a random number generator (even a number of these). Designing
a good random generation is important. In this course, however, we will mainly be using the
random number generators (in fact pseudo-random generators) already created by others.
Histogram. To show distributions graphically, you may also ”bin” it in the domain -
thus generating the histogram, which is a convenient way of showing p(σ) (see plots in the
attached julia notebook with illustration breaking [0, 1] interval in N > 1 bins).

8.1.3 Moments. Generating Function.

Expectations.
X
E [A(ς)]p = hA(ς)ip = A(ς)p(ς).
ς∈Σ

Examples: mean,
E [ς] ,

variance,
Var[ς] = E (ς − E[ς])2 .
 

We have already discussed these for the Gaussian process.

Example 8.1.2. What is the average number of the events in the Poisson process, Pois(λ),
described by the probability distribution function (8.4)? What is the second moment (vari-
ance) of the Poisson distribution?
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 189

Solution:
The average number of events in the interval
∞ ∞ ∞
X kλk X λk X λn
µ1 = e−λ = e−λ = λ e−λ = λ.
k! (k − 1)! n!
k=0 k=1 n=0

The second moment is


∞ ∞ ∞
X k 2 λk −λ
X kλk X (n + 1)λn
µ2 = e = e−λ = λ e−λ = λ(λ + 1),
k! (k − 1)! n!
k=0 k=1 n=0

and then the variance is σ 2 = µ2 − µ21 = λ. Note, that the expectation value and variance
of the Poisson distribution are both equal to the same value, λ.

Example 8.1.3. Consider the Cauchy distribution. (It plays an important role in physics,
since it describes the resonance behavior, e.g. shape of a spectral width of a laser.) The
probability density function of the distribution is
1 γ
p(x) = , −∞ < x < +∞. (8.8)
π (x − a)2 + γ 2
Show that the probability distribution is properly normalized and find its first moment.
What can you say about the second moment?
Solution:
The first moment is Z +∞
γ xdx
µ1 = = a. (8.9)
π −∞ (x − a)2 + γ 2
(Recall that this integral is an example of the “principal value integral” we have studied in
the fall.) The second moment µ2 is not defined (infinite).

Moments of a probability distribution P (ς) are defined as follows


.
h i X
k = 0, · · · , mk (Σ) = EP ς k = hς k iP = ς k P (ς). (8.10)
ς∈Σ

We can also extend the definition to the probability density p(x) = pX (x), over continuous
valued X:
Z
.
h i
k = 0, · · · , µk = Ep xk = hxk ip = dxxk p(x). (8.11)

Example 8.1.4. Find variance and moments of the Bernoulli distribution, Bernoulli(β),
with the probability density function

p(x) = βδ(1 − x) + (1 − β)δ(x). (8.12)


CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 190

Solution:

Z ∞
n
k = 1, · · · : µk = hX i = xn p(x)dx = β, n = 1, 2, . . . . (8.13)
−∞

In this case the variance is σ 2 = µ2 − µ21 = β − β 2 = β(1 − β).

Moment Generating and Characteristic Function


Moment generating function is defined by
Z∞ Z∞ ∞ ∞
X xk X µk
MX (t) = E [exp(tx)] = dxp(x) exp(tx) = dxp(x) = . (8.14)
k! k!
−∞ −∞ k=0 k=0

where t ∈ R and all integrals are assumed well defined

Example 8.1.5. Consider standard example of Boltzmann distribution from statistical


mechanics, where the probability density, p(s), of a state, s is
1 −βE(s) X
p(s) = e , Z(β) = e−βE(s) , (8.15)
Z s

where β = 1/T is the inverse temperature and E(s) is the known function of s, called
energy of the state s. The normalization factor Z is called the partition function. Suppose
we know the partition function, Z(β) as a function of the inverse temperature, β. (Notice
that up to sign inversion of the argument the partition function is equivalent to the moment
generating function (8.14), Z(β) = MX (−β).) Compute the expected mean value and the
variance of the energy.
Solution:
The mean value (average) of the energy is
X 1 X 1 ∂Z ∂ ln Z
hEi = p(s)E(s) = E(s)e−βE(s) = − =− . (8.16)
s
Z s Z ∂β ∂β

The variance of the energy (energy fluctuations) is

∂ 2 ln Z
∆E 2 = h(E − hEi)2 i = , (8.17)
∂β 2
Characteristic function is a related object, defined as a Fourier transform of the proba-
bility density:
+∞
Z
.
G(k) = Ep [exp(ikx)] = dxp(x) exp(ikx), (8.18)
−∞
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 191

where i2 = −1. The characteristic function exists for any real k and it obeys the following
relations
G(0) = 1, |G(k)| ≤ 1. (8.19)

The characteristic function contains information about all the moments µm . Moreover it
allows the Taylor series representation in terms of the moments:

X (ik)m m
G(k) = hx i, (8.20)
m!
m=0

and thus
1 ∂m
hxm i = G(k) . (8.21)
im ∂k m k=0
This implies that derivatives of G(k) at k = 0 exist up to the same m as the moments µm .

Example 8.1.6. Find characteristic function of the Bernoulli distribution, Bernoulli(β).


Solution:
Substituting Eq. (8.12) into the Eq. (8.18) one derives

G(k) = 1 − β + βeik , (8.22)

and thus
∂m
µm = [1 − β + βeik ] = β. (8.23)
∂(ik)m k=0

The result is naturally consistent with Eq. (8.13).

Exercise 8.1.7. The probability density function of the so-called exponential distribution
is 
Ae−λx , x ≥ 0,
p(x) = (8.24)
0, x < 0,
where the parameter λ > 0. Calculate

(1) The normalization constant A of the distribution.

(2) The mean value and the variance of the probability distribution.

(3) The characteristic function G(k) of the exponential distribution.

(4) The m−th moment of the distribution (utilizing G(k).

Cumulants
The cumulants are defined by the characteristic function as follows

X (ik)m
ln G(k) = κm . (8.25)
m!
m=1
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 192

According to Eq. (8.19) the Taylor series in Eq. (8.25) start from unity. Utilizing Eqs. (8.20)
and (8.25), one derives the following relations between the cumulants and the moments

κ1 = µ1 , (8.26)
κ2 = µ2 − µ21 = σ 2 . (8.27)

The procedure naturally extends to higher order moments and cumulants.


Notice that moments determine the cumulants in the sense that any two probability dis-
tributions whose moments are identical will have identical cumulants as well, and similarly
the cumulants determine the moments. In some cases theoretical treatments of problems
in terms of cumulants are simpler than those using moments.

Example 8.1.8. Find characteristic function and cumulants of the Poisson distribution
(8.4).
Solution:
The respective characteristic function is
∞ ∞
X λk X (λeip )k
e−λ eipk = e−λ = exp λ(eip − 1) ,
 
G(p) = (8.28)
k! k!
k=0 k=0

and then
ln G(p) = λ(eip − 1). (8.29)

Next, using the definition (8.25), one finds that κm = λ, m = 1, 2, . . . .

Example 8.1.9. Birthday’s Problem Assume that a year has 366 days. What is the
probability, pm , that m people in a room all have different birthdays?
Solution: Let (b1 , b2 , . . . , bm ) be a list of people birthdays, bi ∈ {1, 2, . . . , 366}. There
are 366m different lists, and all are distributed identically (equiprobable). We should count
the lists, which have bi 6= bj , ∀i 6= j. The amount of such lists is m
Q
i=1 (366 − i + 1). Then,
the final answer
m  
Y i−1
pm = 1− . (8.30)
366
i=1

The probability that at least 2 people in the room have the same birthday day is 1 − pm .
Note that 1 − p23 > 0.5 and 1 − p22 < 0.5.

Exercise 8.1.10. (not graded) Choose, at random, three points on the circle of unit radius.
Interpret them as cuts that divide the circle into three arcs. Compute the expected length
of the arc that contains the point (1, 0).
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 193

8.1.4 Probabilistic Inequalities.

Here are some useful probabilistic inequalities, which we present here mainly for a reference.
(Proofs of the inequalities will be discussed in Math 527. See also http://jeremykun.com/
2013/04/15/probabilistic-bounds-a-primer/.)

• (Markov Inequality)

E[x]
P (x ≥ c) ≤ (8.31)
c

• (Chebyshev’s inequality)

σ2
P (|x − µ| ≥ k) ≤ (8.32)
k2

• (Chernoff bound)

E[etx ]
P (x ≥ a) = P (etx ≥ eta ) ≤ (8.33)
eta

where µ and σ 2 are mean and variance of x.


We will get back to discussion of these and some other useful probabilistic inequalities
in the lecture devoted to entropy and to how compare probabilities.

Exercise 8.1.11 (not graded). Play, e.g. in IJulia (notebook linked to the lecture), check-
ing the three inequalities for the distributions mentioned through out the lecture. Provide
examples of the distributions for which the tree inequalities are saturated (becomes equali-
ties)?

8.2 Random Variables: from one to many.


8.2.1 Law of Large Numbers

Take n samples x1 , · · · , xn generated i.i.d. from a distribution with mean µ and variance,
Pn √
σ > 0, and compute yn = i=1 xi /n. What is Prob(yn )? n(yn − µ), converges in
distribution to Gaussian with mean, µ, and variance, σ 2 :
n
!
√ 1X
n xi − µ → N (0, σ 2 ). (8.34)
n
i=1

This is so-called Weak Version of the Central Limit Theorem. (Notice, that the “Large
Deviation Theorem” is an alternative name.)
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 194

Let us sketch the prove of the weak-CLT (8.34) in a simple case µ = 0, σ = 1. Obviously,

m1 (Yn n) = 0. Compute
"  # P  2 P
√ x1 + · · · + xn 2 E xi i6=j E [xi xj ]
m2 (Yn n) = E √ = i + = 1.
n n n

Now the third moment:


" 3 # 3
P  
√ x1 + · · · + xn i E xi
m3 (Yn n) = E √ = → 0,
n n3/2

at n → ∞, assuming E x3i = O(1). Can you guess what will happen with the fourth
 

moment? m4 (Yn n) = 3 = 3m2 (Yn ). This is related to the so-called Wick’s theorem (see
discussion in the next lecture).

Example 8.2.1 (Sum of Gaussian variables). Compute the probability density, pn (yn ), of
the random variable yn = n−1 ni=1 xi , where x1 , x2 , . . . , xn are sampled i.i.d from the zero
P

mean normal distribution


x2
 
2 1
p(x) = N (µ, σ ) = √ exp − 2 ,
2πσ 2σ

exactly.
Solution:
Remind that moments, hxn i, over p(x), can be calculated via the characteristic function

σ2 k2
Z  
. ikx
G(k) = e p(x)dx = exp iµk − ,
2

then resulting in

1 ∂m (2n)! 2n
hx2n i = G(k) σ , hx2n+1 i = 0.
im ∂k m k=0 2n n!

Then the characteristic function of the distribution pn (yn ) is

σ2 k2
 
n
Gn (k) = (G(k/n)) = exp iµk − . (8.35)
2n

Inverse Fourier transform of Gn (k) results in


+∞ +∞
σ2 k2
Z Z  
dk dk
pn (yn ) = Gn (k)e−ikyn = exp −ik(yn − µ) − n = (8.36)
2π 2π 2
−∞ −∞

n(yn − µ)2
 
n
=√ exp − . (8.37)
2πσ 2σ 2
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 195

Example 8.2.2 (Violation of the central limit theorem). Calculate the probability density
distribution of the random variable yn = n−1 ni=1 xi , where x1 , x2 , . . . , xn are independently
P

chosen from the Cauchy distribution with the following probability density
γ 1
p(x) = , (8.38)
π x + γ2
2

and show that the CLT does not hold in this case. Explain why.
Solution:
The characteristic function of the Cauchy distribution is
+∞
Z
γ dx
G(k) = eikx = e−γk . (8.39)
π x2 + γ2
−∞

The resulting characteristic functional expression is

Gn (k) = (G(k/n))n = G(k). (8.40)

This expression shows that for any n the variable yn is Cauchy-distributed with exactly the
same width parameter as the individual samples. The CLT is “violated” in this case because
we have ignored an important requirement/condition for the CLT to hold – existence of the
variance. (See Example 8.1.3.)

Exercise 8.2.3. Assume that you play a dice game 100 times. Awards for the game are as
follows: $0.00 for 1, 3 or 5, $2.00 for 2 or 4 and $26.00 for 6.

(1) What is the expected value of your winnings?

(2) What is the standard deviation of your winnings?

(3) What is the probability that you win at least 200$?

Exercise 8.2.4 (not graded). Check Julia notebok for the lecture and experiment with the
law of large numbers for different distributions mentioned in the lecture.

The CLT holds for independent but not necessarily identically distributed variables too.
(That is one can use different distributions generating different variables in the summed up
sequence.)
If one is interested in not only the asymptotic, n → ∞, by itself but also in how the
asymptotic is approached, the so-called strong version of CLT (also known under the name of
the Cramér theorem) states for the normalized sum, yn = ni=1 xi /n, of the i.i.d. variables,
P
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 196

xi ∼ pX (x),
1
∀z > µ : lim log Prob (yn ≥ z) = −Φ∗ (z) (8.41)
n→∞ n
.
Φ∗ (x) = supλ∈R (λz − Φ(λ)) (8.42)
.
Φ(λ) = log (E exp (λx)) , (8.43)

Here, Φ(λ), is the characteristic function of pX (x) and P hi∗ (x) is the Legendre-Fenchel
transform of the characteristic function, also called the Cramér function. This was a formal
(mathematical) statement. A less formal (“physical”) version of Eq. (8.41) is

n→∞: Prob (yn ) ∝ exp (−nΦ∗ (x)) . (8.44)

Note, that the weak version of the CLT (8.34) is equivalent to approximating the Cramer
function (asymptotically exact) by a Gaussian around its minimum.

Exercise 8.2.5 (not graded). Prove the strong-CLT (8.41,8.42). [Hint: use saddle point/s-
tationary point method to evaluate the integrals.] Give an example of an expectation for
which not only vicinity of the minimum but also other details of Φ∗ (x) are significant at
n → ∞? More specifically give an example of the object which behavior is controlled solely
by left/right tail of Φ∗ (x)? Φ∗ (0) and its vicinity?

Example 8.2.6. Compute Cramer function for the Bernoulli process, i.e. (generally unfair)
coin toss
(
0 with probability 1 − β
x= (8.45)
1 with probability β

Solution:

Φ(λ) = log(βeλ + 1 − β) (8.46)


z 1−z
0<x<1: Φ∗ (z) = z log + (1 − z) log . (8.47)
β 1−β

Eqs. (8.46,8.47) are noticeable for two reasons. First of all, they lead (after some alge-
braic manipulations) to the famous Stirling formula for the asymptotic of a factorial

n! = 2πnnn e−n (1 + O(1/n)) .

(Do you see how?) Second, the z log z structure is an ”entropy” which will appear a number
of times in following lectures - stay tuned.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 197

Exercise 8.2.7. Consider n independent Poisson processes, i = 1, · · · , n : Xi ∼ Pois(λi )


each distributed according to its own rate, λi > 0. (That is each of the random numbers in
the sequence is independent on others, but they are not identically distributed.) Show that
the sum, Y = ni=1 Xi , is distributed according to the Poisson distribution with the rate
P

λ = ni=1 λi , i.e. Y ∼ Pois(λ).


P

8.2.2 Multivariate Distribution. Marginalization. Conditional Probabil-


ity.

Consider an n-component vector build of components each taking a value from a set, Σ,
ς = (ςi ∈ Σ|i = 1, · · · , n). Σ may be discrete, e.g. Σ = {0, 1}, or continuous, e.g. Σ = R.
P
Assume that any state, ς, occur with the probability, P (ς), where ς P (ς) = 1.
Consider statistical version the Ising model discussed in Section 7.5 in the discrete op-
timization setting. (We have used it back then to illustrate application of the Dynamic
Programming in combinatorial optimization.) We introduce the following probability dis-
tribution over the 2n -dimensional space Σ (space of cardinality 2n ):
n−1
Y
ς = (ςi = ±1|i = 1, · · · , n) : P (ς) = Z −1 exp(Jςi ςi+1 ) (8.48)
i=1
X n−1
Y
Z= exp(Jςi ςi+1 ) (8.49)
ς i=1

where Z is the normalization constant, also called the partition function introduced to
guarantee that the sum over all the states is unity. For n = 2 one gets the example of a
bi-variate probability distribution
exp(Jς1 ς2 )
P (ς) = P (ς1 , ς2 ) = . (8.50)
4 cosh(J)
P (ς) is also called a joint probability distribution function of the ς vector components,
ς1 , · · · , ςn . It is also useful to consider conditional distribution, say for the example above
with n = 2,
P (ς1 , ς2 ) exp(Jς1 ς2 )
P (ς1 |ς2 ) = P = (8.51)
ς1 P (ς ,
1 2ς ) 2 cosh(Jς2 )
P
is the probability to observe ς1 under condition that ς2 is known. Notice that, ς1 P (ς1 |ς2 ) =
1, ∀ς2 .
We can also marginalize the multivariate (joint) distribution over a subset of variables.
For example,
X X
P (ς1 ) = P (ς) = P (ς1 , · · · , ςn ). (8.52)
ς\ς1 ς2 ,··· ,ςn
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 198

Multivariate Gaussian (Normal) distribution


Now let us consider n zero-mean random variables x1 , x2 , . . . , xn sampled i.i.d. from a
generic Gaussian distribution
 
1 1 X
p(x1 , . . . , xn ) = exp − xi Aij xj  , (8.53)
Z 2
i,j=1,··· ,n

where A is the symmetric, A = AT , positive definite, A  0, matrix. If the matrix is diag-


onal then the probability distribution (8.53) is decomposed into a product of terms, each
dependent on one of the variables. This is the special case when each of the random vari-
ables, x1 , · · · , xn , is statistically independent of others. Z in Eq. (8.53) is the normalization
factor, called the partition function, which is

(2π)n/2
Z= √ . (8.54)
det A
Moments of the Gaussian distribution are
.
∀i : E [xi ] = µi ; ∀i, j : E [(xi − µi )(xj − µj )] = (A−1 )ij = Σij , (8.55)

where A−1
ij = Σij denotes i, j component of the inverse of the matrix A. The Σ matrix
(which is also symmetric and positive definite, as its inverse is by construction) is called
the co-variance matrix. Standard notation for the multi-variate statistics with mean vector,
µ = (µi |i = 1, · · · , n) and co-variance matrix, Σ, is N (µ, Σ) or Nn (µ, Σ).
Gaussian distribution is remarkable because of its “invariance” properties.

Theorem 8.2.1 (Invariance of Normal/Gaussian distribution under conditioning and marginal-


ization). Consider x ∼ Nn (µ, Σ) and split the n dimensional random vector into two com-
ponents, x = (x1 , x2 ), where x1 is a p-component sub-vector of x and x2 is a q-component
of x, p + q = n. Assume also that the mean vector, µ, and the covariance matrix, Σ, are
split into components as follows
!
Σ11 Σ12
µ = (µ1 , µ2 ); Σ= , (8.56)
Σ21 Σ22

where thus µ1 and µ2 are p and q dimensional vectors and Σ11 , Σ12 , Σ21 and Σ22 are (p × p),
(p × q), (q × p and (q × q) matrices. Then, the following two statements hold:
. R
• Marginalization: p(x1 ) = dx2 p(x1 , x2 ) is the following Normal/Gaussian distribu-
tion, N (µ1 , Σ11 ).
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 199

. p(x1 ,x2 ) 
• Conditioning: p(x1 |x2 ) = p(x2 ) is the Normal/Gaussian distribution, N µ1|2 , Σ1|2 ,
where
. .
µ1|2 = µ1 + Σ12 Σ−1
22 (x2 − µ2 ), Ξ1|2 = Σ22 − ΣT12 Σ−1
11 Σ12 . (8.57)

Proof of the theorem is recommended as a useful technical exercise (not graded) which
requires direct use of some basic linear algebra. (You will need to use or derive explicit for-
mula for the inverse of a positive definite matrix split into four quadrangles, as in Eq. (8.56).)

8.2.3 Bayes Theorem

We already saw how to get conditional probability distribution and marginal probability
distribution from the joint probability distribution
P (x, y) P (x, y)
P (x|y) = , P (y|x) = . (8.58)
P (y) P (x)
Combining the two formulas to exclude the joint probability distribution we arrive at the
famous Bayes formula

P (x|y)P (y) = P (y|x)P (x). (8.59)

Here, in Eqs. (8.58,8.59) both x and y may be multivariate. Rewriting Eq. (8.59) as
P (y|x)P (x)
P (x|y) = , (8.60)
P (y)
one often refers (in the field of the so-called Bayesian inference/reconstruction) to P (x) as
the ”prior” probability distribution which measures the degree of the initial “belief” in x.
Then, P (x|y), called the ”posterior”, measured the degree of the (statistical) dependence
P (y|x)
of x on y, and the quotient P (y) represents the “support/knowledge” y provides about x.
A good visual illustration of the notion of the conditional probability can be found at
http://setosa.io/ev/conditional-probability/

Exercise 8.2.8. The joint probability density of two real random variables X1 and X2 is
1
∀x1 , x2 ∈ R : p(x1 , x2 ) = exp(−x21 − x1 x2 − x22 ). (8.61)
Z
(1) Calculate the normalization constant Z.

(2) Calculate the marginal probability p(x1 ).

(3) Calculate the conditional probability p(x1 |x2 ).

(4) Calculate the moments E[X12 X22 ], E[X1 X23 ], E[X14 X22 ] and E[X14 X24 ].
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 200

8.3 Information-Theoretic View on Randomness


8.3.1 Entropy.

Entropy is defined as an expectation of −log-probability


X
H = −EP (X) [log(P (X))] = − P (x) log(P (x)), (8.62)
x∈X

where x is drawn from the space X . Intuitively, entropy is a measure of uncertainty. Entropy
of a deterministic process, that is a process when a state takes a value, say x0 , with the
probability 1, is zero. Indeed, according to Eq. (8.62), 0 log 0 = 0.
One remark on notations, before we proceed further. The notation used in Eq. (8.62)
should be considered as a shortcut. A more accurate notation would be, H(X), on the
left hand side of Eq. (8.62), where thus X is the random variable which can take a value
x ∈ X . Following tradition of the information theory, we use H for entropy. Beware of an
alternative notation, S, custom in Statistical Physics.
Somehow importantly, the logarithm of the probability distribution is chosen as a mea-
sure of the information in the definition of entropy (logarithm and not some other function)
because it is additive for independent sources.
Let us familiarize ourselves with the concept of entropy on example of the Bernoulli {0, 1}
process (8.45). In this case, there are only two states, P (X = 1) = β and P (X = 0) = 1−β,
and therefore

H = −β log β − (1 − β) log(1 − β). (8.63)

Notice that H, considered as a function of β, has the bell like shape with the maximum at
β = 1/2. Therefore, β = 1/2, corresponding to the fair coin in the process of coin flipping,
is the least uncertain case (maximum entropy). If we plot the entropy as the function of p.
The entropy is zero at β = 0 and β = 1 as both of these cases are deterministic, i.e. fully
certain and thus least uncertain. (See accompanied ijulia file.)
The expression for entropy (8.62), has the following properties (some of these can be
interpreted as alternative definitions):

• H≥0

• H = 0 iff the process is deterministic, i.e. ∃x s.t. P (x) = 1.

• H ≤ log(|X |) and H = log(|X |) iff x is distributed uniformly over the set X .

• Choice of the logarithm base is custom - just a re-scaling. (Base 2 is custom in the
information theory, when dealing with binary variables.)
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 201

Figure 8.1: Venn diagram(s) explaining the chain rule for computing multivariate entropy.

• Entropy is the measure of average uncertainty.

• Entropy is less than the average number of bits needed to describe the random variable
(the equality is achieved for uniform distribution). (*)

• Entropy is the lower bound on the average length of the shortest description of a
random variable

(*) requires a clarification. Take integers which are smaller or equal than n, and represent
them in the binary system. We will need log2 (n) binary variables (bits) to represent any
of the integers. If all the integers are equally probable then log2 (n) is exactly the entropy
of the distribution. If the random variable is distributed non-uniformly than the entropy is
less than the estimate.
The notion of entropy naturally extends to the multivariate statistics. If we have a pair
of discrete random variables, X and Y , taken values x ∈ X and y ∈ Y respectively, their
joint entropy is
. X
H(X, Y ) = − P (x, y) log(P (x, y)), (8.64)
x∈X ,y∈Y

and the conditional entropy is


. X
H(Y |X) = −EP (X,Y ) [log(P (Y |X)] = − P (x, y) log(P (y|x)). (8.65)
x∈X ,y∈Y

Note, that H(Y |X) 6= H(X|Y ).


CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 202

Definitions of the joint and conditional entropies naturally lead to the following relation
between the two

H(X, Y ) = H(X) + H(Y |X), (8.66)

derived from the Bayes theorem. (Checking it is a good exercise.) Eq. (8.66) is also called
the chain rule.
One can naturally extend the chain rule from bi-variate to the multi-variate case (X1 , · · · , Xn ) ∼
P (x1 , · · · , xn ) as follows
n
X
H(Xn , · · · , X1 ) = H(Xi |Xi−1 , · · · , X1 ). (8.67)
i=1

Notice, that the choice of the order in the chain is arbitrary. The name ”chain-rule” should
become clear from (8.67). See also Fig. (8.1) for the Venn diagram illustration of the chain
rule.

8.3.2 Independence, Dependence, and Mutual Information.

The essence of our next theme is in comparing random numbers, or more accurately their
probabilities. Kullback-Leibler (KL) divergence offers a convenient way of doing the com-
parison

. X P1 (x)
D(P1 kP2 ) = P1 (x) log . (8.68)
P2 (x)
x∈X

Note that the KL difference is not symmetric, i.e. D(P1 kP2 ) 6= D(P2 kP1 ). Moreover it is
not a proper metric of comparison as it does not satisfy the so-called triangle inequality.
Any proper metric, dab , for the elements a and b from a space, should be a) positive (for all
elements of the space), b) zero when comparing identical states, i.e. daa = 0; c) symmetric,
i.e. dab = dba , and d) satisfy the triangle inequality, dab ≤ dac + dbc . The last two conditions
do not hold in the case of the KL divergence. However, an infinitesimal version of the KL
divergence - Hessian of the KL distance around its minimum, also called Fisher information,
constitutes a proper metric.

Exercise 8.3.1. Assume that a random variable X2 is generated by the known probability
distribution P2 (x), where x ∈ X and X is finite. Consider the KL-divergence, D(P1 kP2 ),
as a function of a vector (P1 (x)|x ∈ X ), with all the |X | components non-negative and
P
related to each other via the probability normalization condition, x∈X P1 (x) = 1. Show
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 203

Figure 8.2: Venn diagram explaining relations between the mutual information and respec-
tive entropies.

that D(P1 kP2 ) is non-negative and it achieves its minimum at ∀x ∈ X : P1 (x) = P2 (x), i.e.

arg min D(P1 kP2 ) P = (P2 (x)|x ∈ X ). (8.69)


(P1 (x)|x∈X ) x∈X P1 (x) = 1
∀x ∈ X : P1 (x) ≥ 0

Comparing the two information sources, say tracking events x and y, one assumption,
which is rather dramatic, may be that the probabilities are independent, i.e. P (x, y) =
P (x)P (y) and then, P (x|y) = P (x) and P (y|x) = P (y). Mutual information, which we
are about to discuss, will be zero in this case. Thus, naturally, the mutual information is
introduced as the measure of dependence
  XX
P (x, y) P (x, y)
I(X; Y ) = EP (x,y) log = P (x, y) log . (8.70)
P (x)P (y) P (x)P (y)
x∈X y∈Y

Intuitively the mutual information measures the information that X and Y share. In other
words, it measures how much knowing one of these random variables reduces uncertainty
about the other. For example, if X and Y are independent, then knowing X does not give
any information about Y and vice versa - the mutual information is zero. In the other
extreme, if X is a deterministic function of Y then all information conveyed by X is shared
with Y . In this case the mutual information is the same as the uncertainty contained in X
itself (or Y itself), namely the entropy of X (or Y ).
The mutual information is obviously related to respective entropies,

I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) = H(X) + H(Y ) − H(X, Y ).


(8.71)

The relation is illustrated in Fig. (8.2). Mutual Information also possesses the following
properties

I(X; Y ) = I(Y ; X) (symmetry) (8.72)


I(X; X) = S(X) (self-information) (8.73)
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 204

Figure 8.3: Venn diagram explaining the chain rules for mutual information.

The conditional mutual information between X and Y given Z is


 
. P (x, y|z)
I(X; Y |Z) = H(X|Z) − H(X|Y, Z) = EP (x,y,z) log (8.74)
P (x|z)P (y|z)

The entropy chain rule (8.66) when applied to the mutual information of (X1 , · · · , Xn ) ∼
P (x1 , · · · , xn ) results in
n
X
I(Xn , · · · , X1 ; Y ) = I(Xi ; Y |Xi−1 , · · · , X1 ) (8.75)
i=1

See Fig. (8.3) for the Venn diagram illustration of Eq. (8.75).
See [15] for extra discussions on entropy, mutual information and related.

8.3.3 Probabilistic Inequalities for Entropy and Mutual Information

Let us now discuss the case when a random one dimensional variable, X, is drawn from the
space of reals, x ∈ R, with the probability density, p(x). Now consider averaging a convex
function of X, f (X). One observes that the following statement, called Jensen inequality,
holds

E [f (X)] ≥ f (E [X]) . (8.76)

Obviously the statement becomes equality when p(x) = δ(x). To gain a bit more of intuition
consider the case of the Bernoulli-like distribution, p(x) = βδ(x − x1 ) + (1 − β)δ(x − x0 ).
We derive

f (E[X]) = f (x1 β + x0 (1 − β)) ≤ βf (x1 ) + (1 − β)f (x0 ) = E [f (X)] , (8.77)

where the critical inequality in the middle is simply expression of the function f (x) convexity
(taken verbatim from the definition).
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 205

Figure 8.4
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 206

See also Fig. (8.4) with another (graphical) hint on the proof of the Jensen inequality.
In fact, the Jensen inequality holds over any spaces. Mathematically accurate proof of
the Jensen inequality will be discussed in Math 527.
Notice that the entropy, considered as a function (or functional in the continuous case)
of probabilities at a particular state is convex. This observation gives rise to multiple
consequences of the Jensen inequality (for the entropy and the mutual information):

• (Information Inequality)

D(pkq) ≥ 0, with equality iff p = q

• (conditioning reduces entropy)

H(X|Y ) ≤ H(X) with equality iff X and Y are independent

• (Independence Bound on Entropy)


n
X
H(X1 , · · · , Xn ) ≤ H(Xi ) with equality iff Xi are independent
i=1

Another useful inequality [Log-Sum Theorem]


n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n , (8.78)
bi i=1 bi
i=1 i=1

with equality iff ai /bi is constant. Convention: 0 log 0 = 0, a log(a/0) = ∞ if a > 0 and
0 log 0/0 = 0. Consequences of the Log-Sum theorem

• (Convexity of Relative Entropy) D(pkq) is convex in the pair p and q


.
• (Concavity of Entropy) For X ∼ p(x) we have H(P ) = HP (X) (notations are ex-
tended) is a concave function of P (x).

• (Concavity of the mutual information in P (x)) Let (X, Y ) ∼ P (x, y) = P (x)P (y|x).
Then I(X; Y ) is a concave function of P (x) for fixed P (y|x).

• (Concavity of the mutual information in P (y|x)) Let (X, Y ) ∼ P (x, y) = P (x)P (y|x).
Then I(X; Y ) is a concave function of P (y|x) for fixed P (x).

We will see later (discussing Graphical Models) why the convexity/concavity properties of
the entropy-related objects are useful.
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 207

Example 8.3.2. Prove that H(X) ≤ log2 n, where n is the number of possible values of
the random variable x ∈ X.
Solution. The simplest proof is via the Jensen’s inequality. It states that if f is a convex
function and u is a random variable then

E[f (u)] ≥ f [E(u)]. (8.79)

Let us define
f (u) = − log2 u, u = 1/P (x)
Obviously, f (u) is convex. In accordance with (8.79) one obtains

E[log2 P (x)] ≥ − log2 E[1/P (x)],

where E[log2 P (x)] = −H(X) and E[1/P (x)] = n, so H(X) ≤ log2 n.

Note, in passing, that the Jensen’s inequality leads to a number of other useful expres-
sions for entropy, e.g. H(X|Y ) ≤ H(X) with equality iff X and Y are independent, and
more generally, H(X1 , . . . , Xn ) ≤ ni=1 H(Xi ) with equality iff all Xi are independent.
P

Example 8.3.3. The so called Zipf’s law states that the frequency of the n-th most frequent
word in randomly chosen English document can be approximated by
(
0.1
n , for n ∈ 1, . . . , 12367
pn = (8.80)
0, for n > 12367
Under an assumption that English documents are generated by picking words at random
according to Eq. (8.80) compute the entropy of the made-up English per word and also per
character. Interpret the results
Solution. Substituting the distribution (8.80) into the definition of entropy one derives
12367 123670
Z
X 0.1 0.1 0.1 ln x 1
H=− log2 ≈ dx = = (ln2 123670 − ln2 10) ≈ 9.9 bits.
n n ln 2 x 20 ln 2
i=1 10
Let us now calculate the entropy of the English per character. The resulting entropy
is fairly low ∼ 1 bit. Thus, the character-based entropy of a typical English text is much
smaller than its entropy per word. This result is intuitively clear: after the first few letters
one can often guess the rest of the word, but prediction of the next word in the sentence is
a less trivial task.

Exercise 8.3.4. The joint probability distribution P (x, y) of two random variables X and
Y is described in Table 8.1. Calculate the marginal probabilities P (x) and P (y), conditional
probabilities P (x|y) and P (y|x), marginal entropies H(X) and H(Y ), as well as the mutual
information I(X; Y ).
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 208

P (x, y) X P (y)
x1 x2 x3 x4
y1 1/8 1/16 1/32 1/32 1/4
Y y2 1/16 1/8 1/32 1/32 1/4
y3 1/16 1/16 1/16 1/16 1/4
y4 1/4 0 0 0 1/4
P (x) 1/2 1/4 1/8 1/8

Table 8.1: Exemplary joint probability distribution function P (x, y) and the marginal prob-
ability distributions, P (x), P (y), of the random variables x and y.
Chapter 9

Stochastic Processes

9.1 Markov Chains [discrete space, discrete time]


9.1.1 Transition Probabilities

So far we have studied random variables and events often assuming that these are i.i.d. =
independent identically distributed. However, in real world we ”jump” from one random
state to another so that the consecutive states are dependent. The memory may last for
more than one jump, however there is also a big family of interesting random processes
which do not have long memory - only current state influences where we jump to. This is
the class of random processes described by Markov Chains (MCs).
Before we define a Markov chain, it is a good idea to watch the introductory video,
which explains the origin of Markov chains and briefly describes what they are.
A Markov chain p is a stochastic process with no memory other than its current state.
We can think of a Markov chain as a random walk on a directed graph, where vertices
correspond to states and edges correspond to transitions between states. Each edge i → j
is associated with the probability p(j ← i) of going from the state i to the state j. A useful
interactive playground can be found here.
MCs can be explained in terms of directed graphs, G = (V, E), where the set of vertices,
V = (i), is associated with the set of states, and the set of directed edges, E = (j ← i),
correspond to possible transitions between the states. Note that we may also have ”self-
loops”, (i ← i) included in the set of edges. To make description complete we need to
associate to each vertex a transition probability, pj←i = pji from the state i to the state j.
Since pji is the probability, ∀(j ← i) ∈ E : pji ≥ 0, and
X
∀i : pji = 1. (9.1)
j:(j←i)∈E

209
CHAPTER 9. STOCHASTIC PROCESSES 210

.
Then, the combination of G and p = (pji |(j ← i) ∈ E) defines a MC. Mathematically we also
say that the tuple (finite ordered set of elements), (V, E, p), defines the Markov chain. We
will mainly consider in the following stationary Markov chains, i.e. these with pji constant
- not changing in time. However, for many of the following statements/considerations
generalization to the time-dependent processes is straightforward.
MC generates a random (stochastic) dynamic process. Time flows continuously, however
as a matter of convenient abstraction we consider discrete times (and sometimes, actually
quite often, events do happen discreetly). One uses t = 0, 1, 2, · · · for the times when jumps
occur. Then a particular random trajectory/path/sample of the system will look like

i1 (0), i2 (1), · · · , ik (tk ), where i1 , · · · , ik ∈ V

We can also generate many samples (many trajectories)

(n) (n) (n)


n = 1, · · · , N : i1 (0), i2 (1), · · · , ik (tk ), where i1 , · · · , ik ∈ V

where N is the number of trajectories.


How does one relates the directed graph with weights (associated to the transition
probabilities) to samples? The relation, actually, has two sides. The direct one - is about
how one generates samples. The samples are generated by advancing the trajectory from
the current-time state flipping coin according to the transition probability pij . The inverse
side is about reconstructing characteristics of Markov chain from samples or verifying if the
samples where indeed generated according to (rather restrictive) MC rules.
Now let us get back to the direct problem where a MC is described in terms of (V, E, p).
However, instead of characterizing the system in terms of the trajectories/paths/samples,
we can pose the question following evolution of the “state probability vector”, or simply
the ”state vector”:
X
∀i ∈ V, ∀t = 0, · · · : πi (t + 1) = pij πj (t). (9.2)
j:(i←j)∈E

.
Here, π(t) = (πi (t) ≥ 0|i ∈ V) is the vector built of components each representing probability
P
for the system to be in the state i at the moment of time t. Thus, i∈V πi = 1. We can
also rewrite Eq. (9.2) in the vector/matrix form

π(t + 1) = pπ(t), (9.3)

where π(t) the column/state and p(t) is the transition-probability matrix, which satisfies
the so-called ”stochasticity” property (9.1), to preserve the total probability.
CHAPTER 9. STOCHASTIC PROCESSES 211

0.7 0.5

0.3

1 2
0.5

Figure 9.1: An exemplary Markov Chain (MC).

Definition 9.1.1. A matrix is called stochastic if all of its components are nonnegative
and each column sums to 1.

Sequential application of Eq. (9.3) results in

π(t + k) = pk π(t), (9.4)

and we are interested to analyze properties of pk , characterizing the Markov chain acting
for k sequential periods.
Let us first study it on example of the simple MC illustrated in Fig. (9.1). In this case,
pk is 2 × 2 matrix which dependence on k as follows
! ! !
0.7 0.5 0.64 0.6 0.625 0.625
p1 = , p2 = , p10 ≈ p100 ≈ . (9.5)
0.3 0.5 0.36 0.4 0.375 0.375

9.1.2 Properties of Markov Chains

Definition 9.1.2 (Irreducability of MC). MC is irreducible if one can access any state
from any state, formally

∀i, j ∈ V : ∃k > 1, s.t. (pk )ij > 0. (9.6)

Example of Eq. (9.5) is obviously irreducible. However, if we replace 0.3 → 0 and


0.7 → 1 the MC becomes reducible – state 1 is not accessible from 2.
CHAPTER 9. STOCHASTIC PROCESSES 212

13
A C
1

23 1
1

A C

1
B

Figure 9.2: Some examples of Markov chains.

Definition 9.1.3 (Aperiodicity of MC). A state i has period k if any return to the state
must occur in multiples of k. Formally the period of state k is

k = greatest common divisor {n > 0 : Prob(xn = i|x0 = i) > 0} ,

provided that the set is not empty (otherwise the period is not defined). If k = 1 than the
state is aperiodic. MC is aperiodic if all states are aperiodic.

An irreducible MC only needs one aperiodic state to imply all states are aperiodic. Any
MC with at least one self-loop is aperiodic. Example #1 is obviously aperiodic. However,
it becomes periodic with period two if the two self-loops are removed.

Exercise 9.1.1 (Not graded.). Consider two MC examples shown in Fig. 9.2. Are these
MC reducible or irreducble? Periodic or aperiodic?

Definition 9.1.4. A state i is said to be transient if, given that we start in state i, there is a
non-zero probability that we will never return to i. State i is recurrent if it is not transient.
State i is positive-recurrent if the expected return time (to the state) is positive.

Notice that the positivity is an important feature for analysis of MC over infinite graphs.

Exercise 9.1.2 (not graded). Give an example of a Markov chain with an infinite number
of states, which is irreducible and aperiodic (prove it), but which does not converge to an
equilibrium probability distribution.

Definition 9.1.5 (Ergodicity of MC). A state is ergodic if the state is aperiodic and
positive-recurrent. If all states in an irreducible MC are ergodic then the MC is ergodic. A
CHAPTER 9. STOCHASTIC PROCESSES 213

0.70
0.30 0.70
0 1
0.30

(a) Bernoulli Markov chain (b) Walking on Hypercube

Figure 9.3: Illustration of sampling.

MC is ergodic if there is a finite number k∗ such that any state can be reached from any
other state in exactly k∗ steps.

For the example of Eq. (9.5) k∗ = 2.


Note, that there are other (alternative) descriptions of ergodicity. A particularly in-
tuitive one is: the MC is ergodic if it is aperiodic and irreducible. Notice that if
we replace positive-recurrence by irreducibility in the definition of ergodicity, the ergodic-
ity still holds. However, the combination of irreducibility and positive-recurrence does not
guarantee ergodicity. In this course we will not go into related mathematical formalities
and details, largely considering generic, i.e. ergodic, MC.
Practical consequences of the ergodicity are that the steady state is unique and it is
universal. Universality means that the steady state does not depend on the initial condition.

9.1.3 Sampling

As already mentioned above MC are widely used to generate samples of some distribution.
One can imagine a particle which travels over a graph according to edges’ weights. After
some time (for ergodic chain) the probability distribution of a particle becomes stationary
(one say that the chain is mixed) and then the trajectory of the particle will represent the
sample of a distribution. Analyzing the trajectory you can say a lot about distribution, e.g.
calculate moments and expectation values of functions.
In the Figure 9.3a we see a Markov chain which corresponds to the Bernoulli distribu-
tion with probability of success equal to 0.7. More complicated example is shown in the
Figure 9.3b. Imagine that you need to generate a random string of n bits. There is 2n
possible configurations. You can organize these configurations in a hypercube graph. The
CHAPTER 9. STOCHASTIC PROCESSES 214

hypercube has 2n vertices and each vertex has n neighbors, corresponding to the strings
that differ from it at a single bit. Our Markov chain will walk along these edges and flip
one bit at a time. The trajectory after a long time will correspond to the series of random
strings. The important question is how long should we wait before our Markov chain be-
comes mixed (loses a memory about initial condition)? To answer this question we should
look at the MC from a more mathematical point of view.

9.1.4 Steady State Analysis

Theorem 9.1.6 (Existence of Stationary Distribution). Component-wise positive, normal-


ized, π ∗ , is called stationary distribution (invariant measure) if

π ∗ = pπ ∗ (9.7)

An irreducible MC has a stationary distribution iff all of its states are positive recurrent.

Proof of this (and other statements used in this Section) will be discussed in Math 527.
Solving Eq. (9.7) for the example of Eq. (9.5) one finds
!
0.625
π∗ = , (9.8)
0.375

which is naturally consistent with Eq. (9.5). In general,


e
π∗ = P , (9.9)
i ei

where e is the eigenvector with the eigenvalue 1. And how about other eigenvalues of the
transition matrix?

9.1.5 Spectrum of the Transition Matrix & Speed of Convergence to the


Stationary Distribution

Assume that p is diagonalizable (has n = |p| linearly independent eigenvectors) then we can
decompose p according to the following eigen-decomposition

p = U −1 ΣU (9.10)

where Σ = diag(λ1 , · · · , λn ), 1 = |λ1 | > λ2 ≥ |λ3 | ≥ · · · |λn | and U is the matrix of


eigenvectors (each normalized to having an l2 norm equal to 1) where each raw is a right
eigenvector of p. Then

π (k) = pk π = (U −1 ΣU )k π0 = U −1 Σk U π0 . (9.11)
CHAPTER 9. STOCHASTIC PROCESSES 215

13
13

16
C B

56
13

56
16

Figure 9.4: Illustration of the Detailed Balance (DB).

Let us represent p0 as an expansion over the normalized eigenvectors, ui , · · · i = 1, · · · , n:


n
X
π= ai ui . (9.12)
i=1

Taking into account orthonormality of the eigenvectors one derives


 k  k !
λ 2 λn
π (k) = λ1 a1 u1 + a2 u2 + · · · + an un (9.13)
λ1 λ1

(k)
Since πk→∞ → π ∗ = u1 , the second term on the rhs of Eq. (9.13) describes the rate of
convergence of π (k) to the steady state. The convergence is exponential in log(λ1 /λ2 ).

Example 9.1.3. Find eigen-values for the MC shown in Fig. (9.4) with the transition
matrix  
0 5/6 1/3
 
p=
5/6 0 .
1/3 (9.14)
1/6 1/6 1/3
What does define the speed of the MC convergence to a steady state?
Solution:
Let us start by noticing that p is stochastic. If the initial probability distribution is π(0),
then the distribution after t steps is

π(t) = pt π(0). (9.15)


CHAPTER 9. STOCHASTIC PROCESSES 216

As t increases, π(t) approaches a stationary distribution π ∗ (since the Markov chain is


ergodic - this property is easy to check for the MC), such that

pπ ∗ = π ∗ . (9.16)

Thus, π ∗ is an eigenvector of p with eigenvalue 1 with all components positive and normal-
ized. The matrix (9.14) has three eigenvalues λ1 = 1, λ2 = 1/6, λ3 = −5/6 and correspond-
ing eigenvectors are
 T  T
2 2 1 1 1

π = , , , u2 = − , − , 1 , u3 = (−1, 1, 0)T . (9.17)
5 5 5 2 2

Suppose that we start in the state ”A”, i.e. π(0) = (1, 0, 0)T . We can write the initial state
as a linear combination of the eigenvectors
u2 u3
π(0) = π ∗ − − , (9.18)
5 2
and then
λt2 λt
π(t) = pt π(0) = π ∗ −
u2 − 3 u3 . (9.19)
5 2
Since |λ2 | < 1 and |λ3 | < 1, then in the limit t → ∞ we obtain π(t) = π ∗ . The speed of
convergence is defined by the eigenvalue (λ2 or λ3 ), which has the greatest absolute value.

Note, that the considered situation generalizes to the following powerful statement (see
[16] for details):

Theorem 9.1.7 (Perron-Frobenius Theorem). Ergodic Markov chain with transition ma-
trix p has a unique eigenvector π ∗ with eigenvalue 1, and all its other eigenvectors have
eigenvalues with absolute value less than 1.

9.1.6 Reversible & Irreversible Markov Chains.

MC is called reversible if there exists π s.t.

∀{i, j} ∈ E : pji πi∗ = pij πj∗ , (9.20)

where {i, j} is our notation for the undirected edge, assuming that both directed edges
(i ← j) and (j ← i) are elements of the set E. In physics this property is also called
Detailed Balance (DB). If one introduces the so-called ergodicity matrix
.
Q = (Qji = pji πi∗ |(j ← i) ∈ E), (9.21)

then DB translates into the statement that Q is symmetric, Q = QT . The MC for which
the property does not hold is called irreversible. Q − QT is nonzero, i.e. Q is asymmetric
CHAPTER 9. STOCHASTIC PROCESSES 217

for reversible MC. An asymmetric component of Q is the matrix built from currents/flows
(of probability). Thus for the case shown in Fig. (9.1)
! !
0.7 ∗ 0.625 0.5 ∗ 0.375 0.4375 0.1875
Q= = (9.22)
0.3 ∗ 0.625 0.5 ∗ 0.375 0.1875 0.1875
Q is symmetric, i.e. even though p12 6= p21 , there is still no flow of probability from 1 to 2
as the “population” of the two states, π1∗ and π2∗ respectively are different, Q12 − Q21 = 0.
In fact, one observes that in the two node situation the steady state of the MC is always in
DB.

9.1.7 Detailed Balance vs Global Balance. Adding cycles to accelerate


mixing.

Note that if a steady distribution, π ∗ , satisfy the DB condition (9.20) for a MC, (V, E, p),
it will also be a steady state of another MC, (V, E, p̃), satisfying the more general Balance
(or global balance) B-condition
X X
p̃ji πi∗ = p̃ij πj∗ . (9.23)
j:(j←i)∈E j:(i←j)∈E

This suggests that many different MC (many different dynamics) may result in the same
steady state. Obviously DB is a particular case of the B-condition (9.23).
The difference between DB- and B- can be nicely interpreted in terms of flows (think
water) in the state space. From the hydrodynamic point of view reversible MCMC corre-
sponds to irrotational probability flows, while irreversibility relates to nonzero rotational
part, e.g. correspondent to vortices contained in the flow. Putting it formally, in the irre-
versible case antisymmetric part of the ergodic flow matrix, Q = (p̃ij πj∗ |(i ← j)), is nonzero
and it actually allows the following cycle decomposition,
X
α α

Qij − Qji = Jα Cij − Cji (9.24)
α

where index α enumerates cycles on the graph of states with the adjacency matrices C α .
Then, Jα stands for the magnitude of the probability flux flowing over cycle α.
One can use the cycle decomposition to modify MC such that the steady distribution
stay the same (invariant). Of course, cycles should be added with care, e.g. to make sure
that all the transition probabilities in the resulting p̃, are positive (stochasticity of the
matrix will be guaranteed by construction). The procedure of “adding cycles” along with
some additional tricks (e.g. the so-called lifting/replication) may help to improve mixing,
i.e. speed up convergence to the steady state — which is a very desirable property for
sampling π ∗ efficiently.
CHAPTER 9. STOCHASTIC PROCESSES 218

Exercise 9.1.4 (not graded). Construct a Markov chain, which mixes in the shortest time
regardless of the initial state, and which obeys the following properties. The state space
contains N states, and the desired stationary distribution is such that the probability to be
in a state i equals to known, pi . What can you say about eigenvalues of the corresponding
transition matrix? Construct the transition matrix explicitly.

Exercise 9.1.5 (Hardy-Weinberg Law). Consider an experiment of mating rabbits. We


watch evolution of a particular gene that appears in two types, G or g. A rabbit has a pair
of genes, either GG (dominant), Gg (hybrid — the order is irrelevant, so gG is the same
as Gg) or gg (recessive). In mating two rabbits, the offspring inherits a gene from each
of its parents with equal probability. Thus, if we mate a dominant (GG) with a hybrid
(Gg), the offspring is dominant with probability 1/2 or hybrid with probability 1/2. Start
with a rabbit of given character (GG, Gg, or gg) and mate it with a hybrid. The offspring
produced is again mated with a hybrid, and the process is repeated through a number of
generations, always mating with a hybrid.
Note: The first experiment of such kind was conducted in 1858 by Gregor Mendel. He
started to breed garden peas in his monastery garden and analyzed the offspring of these
matings.
1) Write down the transition matrix P of the Markov chain thus defined. Is the Markov
chain irreducible and aperiodic?
2) Assume that we start with a hybrid rabbit. Let µn be the probability distribution
of the character of the rabbit of the n-th generation. In other words, µn (GG), µn (Gg),
µn (gg) are the probabilities that the n-th generation rabbit is GG, Gg, or gg, respectively.
Compute µ1 , µ2 , µ3 . Is there some kind of law?
3) Calculate P n for general n. What can you say about µn for general n?
4) Calculate the stationary distribution of the MC. Does the Detailed Balance hold in
this case?

9.2 Bernoulli and Poisson Processes [discrete space, discrete


& continuous time]
The two processes discussed in the following are some of the simplest dynamic random
processes, which are also building blocks for others. Simplicity here is related to the fact
that the processes are defined with the least number of characteristics. We will focus on
important features of the processes, such as memorylessness (also called Markov property),
and will work out interesting (and rather general) questions one may ask (and answer).
CHAPTER 9. STOCHASTIC PROCESSES 219

9.2.1 Bernoulli Process: Definition

Bernoulli process is defined as a sequence of independent Bernoulli trials, i.e. at each trial
P (success) = P (x = 1) = β and P (f ailure) = P (x = 0) = 1 − β. The Bernoulli process
can be represented as a simple MC (two nodes + two self-loops, please draw one). The
sequence looks like 00101010001 = ∗ ∗ S ∗ S ∗ S ∗ ∗ ∗ S. S here stands for ”success”.
Examples:

• Sequence of discrete updates – ups and downs (stock market).

• Sequence of lottery wins.

• Arrivals of buses at a station, checked every 1/5/? minutes.

9.2.2 Bernoulli: Number of Successes

As we discussed already earlier in the course, number of k successes in n steps follows the
binomial distribution
!
n
∀k = 0, · · · , n : P (S = k|n) = β k (1 − β)n−k (9.25)
k
mean : E[S] = nβ (9.26)
variance : var(S) = E[(S − E[S])2 ] = nβ(1 − β) (9.27)

Let us now discuss dynamic characteristics of the Bernoulli process.

9.2.3 Bernoulli: Distribution of Arrivals

Call T1 the number of trials till the first success (including the success event too). The
Probability Mass Function (PMF) for the time of the first success is

t = 1, 2, · · · : P (T1 = t) = β(1 − β)t−1 [Geometric PMF] (9.28)

The answer is the product of the probabilities of (t − 1) failures and one success (thus
memoryless). It is called geometric because checking that the probability distribution is
normalized involves summing up the geometric sequence (progression). Naturally, ∞
P
t=1 (1−
β)t−1 = 1/β. Mean and variance of the geometric distribution are
1
mean : E[T1 ] = (9.29)
β
1−β
variance : var(T1 ) = E[(T1 − E[T1 ])2 ] = (9.30)
β2
CHAPTER 9. STOCHASTIC PROCESSES 220

More on the memoryless property. Given n, the future sequence xn+1 , xn+2 , · · · is also
a Bernoulli process and it is independent of the past. Moreover, suppose we have observed
the process for n times and no success has occurred. Then the PMF for the remaining
arrival times is also geometric

P (T − n = k|T > n) = β(1 − β)k−1 (9.31)

And how about the k th arrival? Let yk be the number of trials until k th success (inclu-
sive), then we write
!
t−1
t = k, k + 1, · · · : P (yk = t) = β k (1 − β)t−k [Pascal PMF] (9.32)
k−1
k(1 − β)
mean : E[yk ] = (9.33)
β2
k(1 − β)
variance : var(yk ) = E[(yk − E[yk ])2 ] = (9.34)
β2
The combinatorial factor accounts for the number of configurations of the “k arrivals in yk
trials” type.

Exercise 9.2.1 (Not graded.). Define Tk = yk − yk−1 , k = 2, 3, · · · , where thus Tk is the


inter-arrival time between k − 1-th and k-th arrivals. Write down the probability density
distribution function for the k-th inter-arrival time, Tk .

9.2.4 Poisson Process: Definition

Examples:

• All examples from the Bernoulli case considered in continuous time.

• E-mail arrivals with infrequent check.

• High-energy beams collide at a high frequency (10 MHz) with a small chance of a
good event (actual collision).

• Radioactive decay of a nucleus with the trial being to observe a decay within a small
time interval.

• Spin flip in a magnetic field.

COVID-19 challenge: suggest an example of a Poisson process event inspired by our daily
“infected” life.
CHAPTER 9. STOCHASTIC PROCESSES 221

Let us first recall the definition of the Poisson distribution we had in Section 8.1.1, and
specifically relation between the Bernoulli distribution and the Poisson distribution.
The Poisson distribution, describing arrival of k customers in an interval (of unspecified
duration) was defined as

[ λ̃k e−λ̃
∀k ∈ Z∗ = {0} Z: Pois(k|λ̃) = . (9.35)
k!
Then we notice that if we take the binomial distribution (9.25), describing probability of
k arrivals in n intervals, with each arrival being independent with probability β, and then
consider it in the limit, n → ∞, β → λ̃/n we arrive at Eq. (9.35).
Now we would like to inject into consideration the notion of the time interval duration,
and thus transition to continuous time. Then, the continuous time version of Eq. (9.35)
describing probability density (per unit time) of getting one arrival in time t becomes

pT1 (t) = λ exp(−λt), (9.36)


R∞
where the normalization is chosen proper, i.e. 0 dtpT1 (t) = 1. Notice that with some
minor abuse of notations we change from a dimensionless parameter λ̃ in Eq. (9.35) to λ in
Eq. (9.36) where the latter has the dimension of the inverse time [λ] = [1/t].

9.2.5 Poisson: Arrival Time

Then the probability density of the first arrival in time t is


Z t
P (T1 ≤ t) = dt0 pT1 (t0 ) = 1 − exp(−λt).
0

By extension (generalizing), for the probability density of time of the k th arrival one
derives
λk tk−1 exp(−λt)
pTk (t) = , t>0 (Erlang ”of order” k)
(k − 1)!
Like Bernoulli, the Poisson process show the following two key properties

• Fresh Start Property: the time of the next arrival is independent of the past

• Memoryless property: suppose we observe the process for t seconds and no success
occurred. Then the density of the remaining time of arrival is exponential.

Summary of the relations between the Bernoulli process and the Poisson process is
summarized in the table
CHAPTER 9. STOCHASTIC PROCESSES 222

𝑁1

𝑁2

𝑁1 + 𝑁2

Merging two Poisson Processes

Figure 9.5: Merging two Poisson processes.

Bernoulli Poisson
Times of Arrival Discrete Continuous
Arrival Rate p/per trail λ/unit time
PMF of Number of arrivals Binomial Poisson
PMF of Interarrival Time Geometric Exponential
PMF of k th Arrival Time Pascal Erlang

9.2.6 Merging and Splitting Processes

Most important feature shared by Bernoulli and Poisson processes is their invariance with
respect to mixing and splitting. We will show it on the example of the Poisson process but
the same applies to Bernoulli process.
Merging: Let N1 (t) and N2 (t) be two independent Poisson processes with rates λ1
and λ2 respectively. Let us define N (t) = N1 (t) + N2 (t). This random process is derived
combining the arrivals as shown in Fig. (9.5). The claim is that N (t) is the Poisson process
with the rate λ1 + λ2 . To see it we first note that N (0) = N1 (0) + N2 (0) = 0. Next, since
N1 (t) and N2 (t) are independent and have independent increments their sum also have
an independent increment. Finally, consider an interval of length τ , (t, t + τ ]. Then the
number of arrivals in the interval are Poisson(λ1 τ ) and Poisson(λ2 τ ) and the two numbers
are independent. Therefore the number of arrivals in the interval associated with N (t)
CHAPTER 9. STOCHASTIC PROCESSES 223

is Poisson((λ1 + λ2 )τ ) - as sum of two independent Poisson random variables. We can


obviously generalize the statement to a sum of many Poisson processes. Note that in the
case of the Bernoulli process the story is identical provided that collision is counted as one
arrival.
Splitting: Let N (t) be a Poisson process with rate λ. Here, we split N (t) into N1 (t)
and N2 (t) where the splitting is decided by coin tossing (Bernoulli process) - when an arrival
occur we toss a coin and with probability β and 1 − β add arrival to N1 and N2 respectively.
The coin tosses are independent of each other and are independent of N (t). Then, the
following statements can be made

• N1 is a Poisson process with rate λβ.

• N2 is a Poisson process with rate λ(1 − β).

• N1 and N2 are independent, thus Poisson.

Example 9.2.2. Astronomers estimate that the meteors above a certain size hit the earth
on average once every 1000 years, and that the number of meteor hits follows a Poisson
distribution.

(1) What is the probability to observe at least one large meteor next year?

(2) What is the probability of observing no meteor hits within the next 1000 years?

(3) Calculate the probability distribution P (tn ), where the random variable tn represents
the appearance time of the n-th meteor.

Solution:
The probability of observing n meteors in a time interval t is given by

(λt)n −λt
P (n, t) = e , (9.37)
n!
where λ = 0.001 (events per year) is the average hitting rate.

(1) P (n > 0 meteors next year) = 1 − P (0, 1) = 1 − e−0.001 ≈ 0.001.

(2) P (n = 0 meteors next 1000 years) = P (0, 1000) = e−1 ≈ 0.37.

(3) It is intuitively clear that

(probability that tn > t) = (probability to get n − 1 arrivals in interval [0, t]),


CHAPTER 9. STOCHASTIC PROCESSES 224

Therefore
Z ∞
p(tn )dtn = P (n − 1, t),
t
λn tnn−1 −λtn
p(tn ) = e .
(n − 1)!

Exercise 9.2.3. Customers arrive at a store with the Poisson rate of 10 per hour. 40%/60%
of arrivals are males/females.

(1) Compute probability that at least 20 customers have entered between 10 and 11 am.

(2) Compute probability that exactly 10 woman entered between 10 and 11 am.

(3) Compute the expected inter-arrival time of men.

(4) Compute probability that there are no male customers between 2 and 4 pm.

9.3 Space-time Continuous Stochastic Processes


In this lecture we discuss stochastic dynamics of continuous variables governed by the
Langevin equation. We discuss how to derive the so-called Fokker-Planck equations, de-
scribing temporal evolution of the probability of a state. We then go into some additional
details for a basic example of stochastic dynamics in a free space (no potential) describing
the Brownian motion where Fokker-Planck equations becomes the diffusion equation.

9.3.1 Langevin equation in continuous time and discrete time

Stochastic process in 1d is described in the continuous-time and discrete-time forms as


follows

ẋ = −F (x) + hξ(t)i = 0, hξ(t1 )ξ(t2 )i = δ(t1 − t2 )
Dξ(t), (9.38)

xn+1 − xn = −∆F (xn ) + D∆ξ(tn ), hξ(tn )i = 0, hξ(tn )ξ(tk )i = δkn (9.39)

The first and second terms on the rhs of Eq. (9.38) stand for the force and the “noise”
respectively. The noise is considered independent at each time step. These equations, also
called Langevin equations, describe evolution of a “particle” positioned at x ∈ R. The two
terms on the rhs of Eq. (9.38)) correspond to deterministic advancement of the particle
(also dependent on its position at the previous time step) and, respectively, on a random
correction/increment. The random correction models uncertainty of the environment the
particles moves through. (We can also think of it as representing random kicks by other
CHAPTER 9. STOCHASTIC PROCESSES 225

“invisible” particles). The uncertainty is represented in a probabilistic way – therefore we


will be talking about the probability distribution function of paths, i.e. trajectories of the
particle.
The square root on the rhs of Eq. (9.39) may seem mysterious, let us clarify its origin on
basic (no force/potential) example of F (x) = 0. (This will be the running example through
out this lecture.) In this case the Langevin equation describes the Brownian motion. Direct
integration of the linear equation with the inhomogeneous source results in this case in
Z t
∀t ≥ 0 : x(t) = dt0 ξ(t0 ), (9.40)
0
Z t Z t Z t
∀t ≥ 0 hx2 (t)i = dt1 dt2 Dδ(t1 − t2 ) = D dt1 = Dt, (9.41)
0 0 0
where we also set x(0) = 0. Infinitesimal version of Eq. (9.41) is

δx = D∆, (9.42)

which is thus the Brownian (no force) version of Eq. (9.39).

9.3.2 From the Langevin Equation to the Path Integral

The Langevin equation can also be viewed as relating the change in x(t), i.e. dynamics of
interest, to stochastic dynamics of the δ-correlated source ξ(tn ) = ξn characterized by the
Probability Density Function (PDF)
N
!
−N/2
X ξ2 n
p(ξ1 , · · · , ξN ) = (2π) exp − (9.43)
2
n=1

Eqs. (9.38,9.39,9.43) are starting points for our further derivations, but they should also be
viewed as a way to simulate the Langevin equation on computer by generating many paths
at once, i.e. simultaneously. Notice, for completeness, that there are also other ways to
simulate the Langevin equation, e.g. through the telegraph process.
Let us express ξn via xn from Eq. (9.39) and substitute it into Eq. (9.43)
N −1
!
−(N −1)/2 1 X
p(ξ1 , · · · , ξN −1 ) → p(x1 , · · · , xN ) = (2πD) exp − (xn+1 − xn + ∆F (x))2
2D∆
n=1
(9.44)
one gets an explicit expression for the measure over a path written in the discretized way.
And here is a typical way of how we state it in the continuous form (e.g. as a notational
shortcut)
 Z T 
1 2
p{x(t)} ∝ exp − dt (ẋ + F (x)) (9.45)
2D 0
This object is called (in physics and math) ”path integral” and/or Feynmann/Kac integral.
CHAPTER 9. STOCHASTIC PROCESSES 226

9.3.3 From the Path Integral to the Fokker-Plank (through sequential


Gaussian integrations)

Probability Density Function of a path is a useful general object. However we may also
want to marginalize it thus extracting the marginal PDF for being at the position xN at the
(temporal) step N from the joint probability (density) distribution (of the path) conditioned
to being at the initial position, x1 , at the moment of time t1 , p(x2 , · · · , xN |x1 ), and also
from the prior/initial (distribution) p1 (x1 ) – both assumed known:
Z
pN (xN ) = dx1 · · · dxN p(x2 , · · · , xN |x1 )p1 (x1 ). (9.46)

It is convenient to derive relation between pN (·) and p1 (·) in steps, i.e. through a recur-
rence, integrating over dx1 , · · · , dxN sequentially. Let us proceed analyzing the case of the
Brownian motion where, F = 0. Then the first step of the induction becomes
Z  
−1/2 1 2
p2 (x2 ) = (2πD) dx1 exp − (x2 − x1 ) P1 (x1 ) (9.47)
2D∆
2
Z  
−1/2
= (2πD) d exp − P1 (x2 − ) (9.48)
2D∆
2 2 2
Z    
−1/2
≈ (2πD) d exp − p1 (x2 ) − ∂x p1 (x2 ) + ∂x p1 (x2 ) (9.49)
2D∆ 2
D
= p1 (x2 ) + ∆ ∂x2 p1 (x2 ), (9.50)
2
where transitioning from Eq. (9.48) to Eq. (9.49) one makes Taylor expansion in , also

assuming that  ∼ ∆ and keeping only the leading terms in ∆. The resulting Gaussian in-
tegrations are straightforward. We arrive at the discretized (in time) version of the diffusion
equation
D 2
∂t pt (x) = ∂ pt (x). (9.51)
2 x
Of course it is not surprising that the case of the Brownian motion has resulted in the diffu-
sion equation for the marginal PDF. Restoring the U (x) term (derivation is straightforward)
one arrives at the Fokker-Planck equation, generalizing the zero-force diffusion equation
D 2
∂t pt (x) − ∂x (U (x)pt (x)) = ∂ pt (x). (9.52)
2 x

9.3.4 Analysis of the Fokker-Planck Equation: General Features and Ex-


amples

Here we only give a very brief and incomplete description on the properties of the distri-
bution which analysis is of a fundamental importance for Statistical Mechanics. See e.g.
[17].
CHAPTER 9. STOCHASTIC PROCESSES 227

The Fokker-Planck equation (9.52) is a linear and deterministic Partial Differential


Equation (PDE). It describes continuous in phase space, x, and time, t, evolution/flow
of the probability density distribution.
Derivation was for a particle moving in 1d, R, but the same ideology and logic extends
to higher dimensions, Rd , d = 1, 2, · · · . There are also extension of this consideration to
compact continuous spaces. Thus one can analyze dynamics on a circle, sphere or torus.
Analogs of the Fokker-Planck can be derived and analyzed for more complicated proba-
bilities than just the marginal probability of the state (path integral marginalized to given
time). An example here is of the so-called first-passage, or “first-hitting” problem.
The temporal evolution is driven by two terms - “diffusion” and “advecton” - the ter-
minology is from fluid mechanics - indeed not only fluids but also probabilities can flow.
The flow of probability is in the phase space. The diffusion originates from the stochastic
source, while advection is associated with a deterministic (possibly nonlinear) force.
Linearity of the Fokker-Planck does not imply that it is simpler than the original nonlin-
ear problem. Deriving the Fokker-Planck we made a transition from nonlinear, stochastic
but ODE to linear PDE. This type of transition from nonlinear representation of many
trajectories to linear probabilistic representation is typical in math/statistics/physics. The
linear Fokker-Planck equation can be viewed as the continuous-time, continuous-space ver-
sion of the discrete-time/discrete space Master equation describing evolution of a (finite
dimensional) probability vector in the case of a Markov Chain.
The Fokker-Planck Eq. (9.52) can be represented in the ‘flux’ form:

∂t pt + ∂x Jt (x) = 0 (9.53)

where Jt (x) is the flux of probability through the space-state point x at the moment of time
t. The fact that the second (flux) term in Eq. (9.53) has a gradient form, corresponds to the
global conservation of probability. Indeed, integrating Eq. (9.53) over the whole continuous
domain of achievable x, and assuming that if the domain is bounded there is no injection (or
dissipation) of probability on the boundary, one finds that the integral of the second term
R
is zero (according to the standard Gauss theorem of calculus) and thus, ∂t dxpt (x) = 0.
In the steady state, when ∂t pt = 0 for all x (and not only in the result of integration over
the entire domain) the flux is constant - does not depend on x. The case of zero-flux is the
special case of the so-called ‘equilibrium’ statistical mechanics. (See some further comments
below on the latter.)
If the initial probability distribution, pt=0 (x) is known, pt (x) for any consecutive t is
well defined, in the sense that the Fokker-Planck is the Cauchi (initial value) problem with
unique solution.
CHAPTER 9. STOCHASTIC PROCESSES 228

Remarks about simulations. One can solve PDE but can also analyze stochastic ODE
approaching the problem in two complementary ways - correspondent to Eulerian and La-
grangian analysis in Fluid Mechanics describing “incompressible” flows in the probability
space.
Main and simplest (already mentioned) example of the Langevin dynamic is the Brown-
ian motion, i.e. the case of F = 0. Another example, principal for the so-called ’equilibrium
statistical physics’, is of the potential force F = ∂x U (x), where U (x) is a potential. Think,
for example about x representing a particle connected to the origin by a spring. U (x) is the
potential/energy stored within the spring. In this case of the gradient force the stationary
(i.e. time-independent) solution of the Fokker-Planck Eq. (9.52) can be found explicitly,
 
−1 U (x)
pst (x) = Z exp − . (9.54)
D
This solution is called Gibbs distribution, or equilibrium distribution.

Brownian Motion

Example 9.3.1. Consider motion of a Brownian particle in the parabolic potential, U (x) =
γx2 /2. (The situation is typical for the particle, which is located near minimum or maximum
of a potential.) The Langevin equation (9.38) in this case becomes
dx √
+ γx = Dξ(t), hξ(t)i = 0, hξ(t1 )ξ(t2 )i = δ(t1 − t2 ) (9.55)
dt
Write a formal solution of Eq. (9.55) for x(t) as a functional of ξ(t). Compute hx2 (t)i as a
function of t and interpret the results. Write the Fokker-Planck (FP) equation for n(t, x),
and solve it for the initial condition, n(t, 0) = δ(x).
Solution:
Eq. (9.55) has the formal solution
Z t
0
x(t) = x(0)e−γt + ξ(t0 )e−γ(t−t ) dt0 . (9.56)
0

For simplicity, we assume x(0) = 0. Then E[x(t)] = 0 and


Z tZ t
0 00
2
hx (t)i = dt0 dt00 hξ(t0 )ξ(t00 )ie−γ(t−t ) e−γ(t−t )
0 0
Z tZ t (9.57)
0 00 D
= 2De−2γt dt0 dt00 δ(t0 − t00 )eγ(t +t ) = (1 − e−2γt ).
0 0 γ
At small time scale t  1/γ we deal with usual diffusion hx2 (t)i ' 2Dt, since the particle
does not feel the potential, while at larger time scale t  1/γ the dispersion saturates,
hx2 (t)i ' D/γ.
CHAPTER 9. STOCHASTIC PROCESSES 229

The Fokker-Planck equation, ∂t n = (γ∂x x + D∂x2 )n, should be supplemented by the


initial condition n(0, x) = δ(x). Then, the solution (the Green function) is

x2
 
1
n(t, x) = p exp − . (9.58)
2πhx2 (t)i 2hx2 (t)i

The meaning of the expression is clear: the probability function n(t, x) is Gaussian, but the
dispersion is time-dependent.

Exercise 9.3.2 (High order moments, not graded). Prove that the moments E x2k (t) for
 

the Brownian motion in R1 obey the following recurrent equation

∂t hx2k i = 2k(2k − 1)Dhx2(k−1) i. (9.59)

Solve this equation for a particle starting from x = 0 at t = 0.

Exercise 9.3.3 (Brownian motion in parabolic potential, not graded). The concentration
field, n(t, x) : R+ × R → R+ , for a Brownian particle in the potential, U (x) = αx2 /2, is
described by the advection-diffusion equation

D∂x2 n + α∂x (xn) = ∂t n. (9.60)

Write down stochastic ODE for the underlying stochastic process, x(t) : R → R+ , and,
given the initial condition for the concentration field, n(0, x) = δ(x), compute respective
statistical moments hxk (t)i. [Hint: Reconstruct stochastic ODE correspondent to the PDE
(9.60) and then follow the logic/strategy of Example 9.3.1.]

Exercise 9.3.4 (Self-propelled particle). The term ”self-propelled particle” refers to an


object capable to move actively by gaining energy from the environment. Examples of
such objects range from the Brownian motors and motile cells to macroscopic animals and
mobile robots. In the simplest two-dimensional model the self-propelled particle moves in
the plane xy with fixed speed v0 . The Cartesian components of the particle velocity vx , vy
in the polar coordinates are

vx = v0 cos ϕ, vy = v0 sin ϕ, (9.61)

where the polar angle ϕ defines the direction of motion. Assume that ϕ evolves according
to the stochastic equation

= ξ, (9.62)
dt
where ξ(t) is the Gaussian white noise with zero mean and pair correlator hξ(t1 )ξ(t2 )i =
2Dδ(t1 − t2 ). The initial condition are chosen to be ϕ(0) = 0, x(0) = 0 and y(0) = 0.
CHAPTER 9. STOCHASTIC PROCESSES 230

Figure 9.6: Optimal solution set of actions (arrows) for each state, for each time.

• Calculate hx(t)i, hy(t)i.

• Calculate hr2 (t)i = hx2 (t)i + hy 2 (t)i.

(Hint: Derive equation for probability density of observing ϕ at the moment of time t, solve
the equation and use the result.)

Solving MDP means finding optimal a, i.e. set of actions for each state at each moment
of time, as illustrated on the GridWorld example (to be discussed next) in Fig. 9.6.
Our description here is intentionally terse/introductory. For a more colloquial, detailed
and mathematical exposition of MDP check the lecture notes of Pieter Abbeel (UC Berkeley)
http://www.cs.berkeley.edu/~pabbeel/cs287-fa12/slides/mdps-exact-methods.pdf
from the Berkley AI course. In fact, the Berkeley course on AI also contains a very good
repository of materials at http://ai.berkeley.edu/lecture_videos.html. Our running
’Grid World’ example/illustration of MDP (comes next) is used intensively in the lecture se-
ries, see http://aima.cs.berkeley.edu/demos.html and also http://www2.hawaii.edu/
~chenx/ics699rl/grid/.

9.3.5 MDP: Grid World Example

MDP can be considered as an interactive probabilistic game one plays against computer
(random number generator). The game consists in defining transition rates between the
states to achieve certain objectives. Once optimal (or suboptimal) rates are fixed the
implementation becomes just a Markov Process we have studied already.
Let us play this ’Grid World’ game a bit. The rules are introduced in Fig. (9.7). An
agent lives on the grid (3 × 4). Walls block the agent’s path. The agent actions do not
always go as planned: 80% of time the action ’North’ take the agent ’North’ (if there is no
CHAPTER 9. STOCHASTIC PROCESSES 231

Figure 9.7: Canonical example of MDP from ’Grid World’ game.

wall there), 10% of the time the action ’North’ actually takes the agent West; 10% East. If
there is a wall the agent would have been taken, she stays put. Big reward, +1, or penalty,
−1 comes at the end. We will come to this example many times during this lecture.
We will consider the following Value Iteration algorithm 1 :

Algorithm 5 MDP – Value Iteration


Input: Set of states, S; set of actions, A; Transition probabilities between states, P (s0 |s, a);
rewards/costs, R(s, a, s0 ); γ discount factors

∀s : V0∗ (s) = 0
for i = 0, · · · , H − 1 do
∗ (s) ← max P (s0 |s, a) [R(s, a, s0 ) + γVi∗ (s0 )] – [ Bellman update/back-
P
∀s : Vi+1 a s0
up] – the expected sum of rewards accumulated when starting from state s and acting
optimally for a horizon of i + 1 steps
end for

The Grid World implementation of the algorithm is illustrated in Fig. (9.8).


1
The algorithm is justified through a standard Dynamic Programming arguments, of the type discussed
above.
CHAPTER 9. STOCHASTIC PROCESSES 232

Value Iteration in Gridworld Value Iteration in Gridworld Value Iteration in Gridworld


noise = 0.2, ° =0.9, two terminal states with R = +1 and -1 noise = 0.2, ° =0.9, two terminal states with R = +1 and -1 noise = 0.2, ° =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld Value Iteration in Gridworld Value Iteration in Gridworld


noise = 0.2, ° =0.9, two terminal states with R = +1 and -1 noise = 0.2, ° =0.9, two terminal states with R = +1 and -1 noise = 0.2, ° =0.9, two terminal states with R = +1 and -1

Value Iteration in Gridworld


noise = 0.2, ° =0.9, two terminal states with R = +1 and -1

Figure 9.8: Value Iteration in Grid World.


CHAPTER 9. STOCHASTIC PROCESSES 233

9.3.6 Recitation. Dynamic Programming.


Chapter 10

Elements of Inference and Learning

??

10.1 Exact and Approximate Inference and Learning


10.1.1 Monte-Carlo Algorithms: General Concepts and Direct Sampling

This lecture should be read in parallel with the respective IJulia notebook file. Monte-
Carlo (MC) methods refers to a broad class of algorithms that rely on repeated random
sampling to obtain results. Named after Monte Carlo -the city- which once was the capital
of gambling, i.e. playing with randomness. The MC algorithms can be used for numerical
integration, e.g. computing weighted sum of many contributions, expectations, marginals,
etc. MC can also be used in optimization.
Sampling is a selection of a subset of individuals/configurations from within a statistical
population to estimate characteristics of the whole population.
There are two basic flavors of sampling. Direct Sampling MC - mainly discussed in this
lecture and Markov Chain MC. DS-MC focuses on drawing independent samples from a
distribution, while MCMC draws correlated (according to the underlying Markov Chain)
samples.
Let us illustrate both on the simple example of the ’pebble’ game - calculating the value
of π by sampling interior of a circle.

Direct-Sampling by Rejection vs MCMC for ‘pebble game’

In this simple example we will construct distribution which is uniform within a circle from
another distribution which is uniform within a square containing the circle. We will use

234
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 235

direct product of two rand() to generate samples within the square and then simply reject
samples which are not in the interior of the circle.
In respective MCMC we build a sample (parameterized by a pair of coordinates) by
taking previous sample and adding some random independent shifts to both variables, also
making sure that when the sample crosses a side of the square it reappears on the opposite
side. The sample ”walks” the square, but to compute area of the circle we count only for
samples which are within the circle (rejection again).
See IJulia notebook associated with this lecture for an illustration.

Direct Sampling by Mapping

Direct Sampling by Mapping consists in application of the deterministic function to samples


from a distribution you know how to sample from. The method is exact, i.e. it produces
independent random samples distributed according to the new distribution. (We will discuss
formal criteria for independence in the next lecture.)
For example, suppose we want to generate exponential samples, yi ∼ ρ(y) = exp(−y) –
one dimensional exponential distribution over [0, ∞], provided that one-dimensional uniform
oracle, which generates independent samples, xi from [0, 1], is available. Then yi = − log(xi )
generates desired (exponentially distributed) samples.
Another example of DS MS by mapping is given by the Box-Miller algorithm which is
a smart way to map two-dimensional random variable distributed uniformly within a box
to the two-dimensional Gaussian (normal) random variable:

Z ∞ Z 2π Z ∞ Z 2π Z ∞ Z 1 Z 1
dxdy −(x2 +y2 )/2 dϕ −r2 /2 dϕ −z
e = rdr e = dz e = dθ dψ = 1.
−∞ 2π 0 2π 0 0 2π 0 0 0

Thus, the desired mapping is (ψ, θ) → (x, y), where x = −2 log ψ cos(2πΘ) and y =

−2 log ψ sin(2πθ).
See IJulia notebook associated with this lecture for numerical illustrations.

Direct Sampling by Rejection (another example)

Let us now show how to get positive Gaussian (normal) random variable from an exponential
random variable through rejection. We do it in two steps

• First, one samples from the exponential distribution:


(
e−x x > 0,
x ∼ ρ0 (x) =
0 otherwise
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 236

• Second, aiming to get a sample from the positive half of Gaussian,


( p
2/π exp(−x2 /2) x > 0,
x ∼ ρ0 (x) =
0 otherwise

, one accepts the generated sample with the probability


1p
p(x) = 2/π exp(x − x2 /2)
M
p
where M is a constant which should be larger than, max(ρ(x)/ρ0 (x)) = 2/π e1/2 ≈
1.32, to guarantee that p(x) ≤ 1 for all x > 0.

Note that the rejection algorithm has an advantage of being applicable even when the
probability densities are known only up to a multiplicative constant. (We will discuss
issues related to this constant, also called in the multivariate case the partition function,
extensively.)
See IJulia notebook associated with this lecture for numerical illustration.
We also recommend

• Introduction to direct Sampling, Chapter of Monte Carlo Lecture Notes by J. Good-


man (NYU)

• Lecture on Monte Carlo Sampling, from Berkley course of M. Jordan on Bayesian


Modeling and Inference

for additional reading on DS-MC.

Importance Sampling

One important application of MC is in computing sums, integrals and expectations. Suppose


we want to compute an expectation of a function, f (x), over the distribution, ρ(x), i.e.
R
dxρ(x)f (x), in the regime where f (x) and ρ(x) are concentrated around very different x.
In this case the overlap of f (x) and ρ(x) is small and as a result a lot of MC samples drawn
from ρ(x) will be ’wasted’.
Importance Sampling is the method which aims to fix the small-overlap problem. The
method is based on adjusting the distribution function from ρ(x) to ρa (x) and then utilizing
the following obvious formula
Z Z  
f (x)ρ(x) f (x)ρ(x)
Eρ [f (x)] = dxρ(x)f (x) = dxρa (x) = Eρa
ρa (x) ρa (x)
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 237

See the
 IJulia notebook associated
 with
 this lecture contrasting DS example, ρ(x) =
x2 (x−4)2
√1 exp − 2 and f (x) = exp − 2 , with IS where the choice of the proposal dis-

1 2

tribution is, ρa (x) = √π exp −(x − 2) . This example shows that we are clearly wasting
samples with DS.
Note one big problem with IS. In a realistic multi-dimensional case it is not easy to
guess the proposal distribution, ρa (x), right. One way of fixing this problem is to search
for good ρa (x) adaptively.
A comprehensive review of the history and state of the art in Importance Sampling can
be found in multiple lecture notes of A. Owen posted at his web page, for example follow
this link. Check also adaptive importance sampling package.

Direct Brut-force Sampling

This algorithm relies on availability of the uniform sampling algorithm from [0, 1], rand().
One splits the [0, 1] interval into pieces according to the weights of all possible states and
then use rand() to select the state. The algorithm is impractical as it requires keeping in
the memory information about all possible configurations. The use of this construction is
in providing the bench-mark case useful for proving independence of samples.

Direct Sampling from a multi-variate distribution with a partition function


oracle

Suppose we have an oracle capable of computing the partition function (normalization) for a
multivariate probability distribution and also for any of the marginal probabilities. (Notice
that we are ignoring for now the issue of the oracle complexity.) Does it give us the power
to generate independent samples?
We get affirmative answer to this question through the following decimation algorithm
.
generating independent sample x ∼ P (x), where x = (xi |i = 1, · · · , N ):
Validity of the algorithm follows from the exact representation for the joint probability
distribution function as a product of ordered conditional distribution function (chain rule
for distribution):

P (x1 , · · · , xn ) = P (x1 )P (x2 |x1 )P (x3 |x1 , x2 ) · · · P (xn |x1 , · · · , xn−1 ). (10.1)

(The chain rule follows directly from the Bias rule/formula. Notice also that ordering
of variables within the chain rule is arbitrary.) One way of proving that the algorithm
produces an independent sample is to show that the algorithm outcome is equivalent to
another algorithm for which the independence is already proven. The benchmark algorithm
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 238

Algorithm 6 Decimation Algorithm


Input: P (x) (expression). Partition function oracle.

1: x(d) = ∅; I=∅
2: while |I| < N do
3: Pick i at random from {1, · · · , N } \ I.
4: x(I) = (xj |j ∈ I)
. P
5: Compute P (xi |x(d) ) = x\xi ;x(I) =x(d) P (x) with the oracle.
6: Generate random xi ∼ P (xi |x(d) ).
7: I ∪i←I
8: x(d) ∪ xi ← x(d)
end while
9:

Output: x(dec) is an independent sample from P (x).

we can use to state that the Decimation algorithm (6) produces independent samples is
the brute-force sampling algorithm described in the beginning of the lecture. The crucial
point here is that the decimation algorithm can be interpreted in terms of splitting the
[0, 1] interval hierarchically, first according to P (x1 ), then subdividing pieces for different
x1 according to P (x2 , x1 ), etc. This guidanken experiment will result in the desired proof.
In general efforts of the partition function oracle are exponential. However in some
special cases the partition function can be computed efficiently (polynomially in the number
of steps). For example this is the case for (glassy) Ising model without magnetic field over
planar graph. See the report and references there in for details.

Ising Model

Let us digress and consider the Ising model which is, in fact, an example of a larger class of
important/interesting multi-variate statistics often referred to (in theoretical engineering)
as Graphical Models (GM). We will study GM later in the course. Consider a system of
spins or pixels (binary variables) on a graph, G = (V, E), where V is a set of nodes/vertices
and E is the set of edges. The graph may be 1d chain, tree, 2d lattice ... or any other
graph. (The cases of regular lattices are prevalent in physics, while graphs of a relevance
to various engineering disciplines are, generally, richer.) Consider binary variables, residing
at every node of the graph, ∀i ∈ V : σi = ±1, we call them “spins”. If there are N spins
in the system, 2N is the number of possible configuration of spins — notice exponential
scaling with N , meaning, in particular that just counting the number of configurations is
“difficult”. If we are able to do the counting in an algebraic/polynomial number of steps, we
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 239

would call it “easy”, or rather “theoretically easy”, while the practically easy case - which
is the goal – would correspond to the case when “complexity” of, say counting, would be
O(N ) - linear in N at N → ∞. (Btw o(N ) is the notation used to state that the behavior

is actually slower than O(N ), say ∼ N at N → ∞, i.e. asymptotically o(N )  O(N ).)
In magnetism (field of physics where magnetic materials are studied) probability of a spin
configuration (vector) is

exp (−βE(σ)) 1 X X
p(σ) = , E(σ) = − σi Jij σj + hi σi , (10.2)
Z 2
{i,j}∈E i∈V
X
Z= exp (−βE(σ)) . (10.3)
σ

E(σ) is the energy of a given spin configuration, σ. The first term in E(σ) is pair-wise
(wrt nodal spins), spin exchange/interaction term. The last term in E(σ) stands from
(potentially node dependent) contribution of the magnetic field, h = (hi |i ∈ V) on individual
spins. Z is the partition function, which is the weighted sum of the spin configurations.
Formally the partition function is just the normalization condition introduced to enforce,
P
σ p(σ) = 1. For a general graph with arbitrary values of J and h, Z is the difficult object
to compute, i.e. complexity of computing Z is O(2N ). (Notice that for some special cases,
such as the case of a tree, or when the graph is planar and h = 0, computing the partition
function becomes easy.) Moreover, computing other important characteristics, such as the
most probable configuration of spins

σM L = arg max p(σ), (10.4)


σ

also called Maximum Likelihood and Ground State in information sciences and physics
respectively, or the (so-called marginal) probability of observing a particular node in the
state σi (can be + or −)
X
pi (σi ) = p(σ), (10.5)
σ\σi

are also difficult problems. (Wrt notations – arg max - pronounced argmax - stands for
particular σ at which the maximum in Eq. (10.4) is reached. σ \ σi in the argument of the
sum in Eq. (10.5) means that we sum over all σ consistent with the fixed value of σi at the
node i.)

Exercise 10.1.1. Consider Ising model on a square, 4 × 4 lattice, construct (write down
on paper and code) and compare performance of two direct sampling algorithms, one by
rejection and also the decimation algorithm (6).
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 240

10.1.2 Markov-Chain Monte-Carlo

Markov Chain Monte Carlo (MCMC) methods belong to the class of algorithms for sampling
from a probability distribution based on constructing a Markov chain that converges to the
target steady distribution.
Examples and flavors of MCMC are many (and some are quite similar) – heat bath,
Glauber dynamics, Gibbs sampling, Metropolis Hastings, Cluster algorithm, Warm algo-
rithm, etc – in all these cases we only need to know transition probability between states
while the actual stationary distribution may be not known or, more accurately, known up
to the normalization factor, also called the partition function. Below, we will discuss in
details two key examples: Gibbs sampling & Metropolis-Hastings.

Gibbs Sampling

Assume that the direct sampling is not feasible (because there are too many variables and
computations are of ”exponential” complexity — more on this latter). The main point
of the Gibbs sampling is that given a multivariate distribution it is simpler to sample
from a conditional distribution than to marginalize by integrating over a joint distribution.
Then we create a chain: start from a current sample of the vector x, pick a component
at random, compute probability for this component/variable conditioned to the rest, and
sample from this conditional distribution. (The conditional distribution is for a simple
component and thus it is easy.) We continue the process till convergence, which can be
identified (empirically) by checking if estimation of the histogram or observable(s) stopped
changing.

Algorithm 7 Gibbs Sampling


Input: Given p(xi |x∼i = x \ xi ), ∀i ∈ {1, · · · , N }. Start with a sample x(t) .

loop Till convergence


Draw an i.i.d. i from {1, · ·· , N }. 
(t)
Generate a random xi ∼ p xi |x∼i .
(t+1)
xi = xi .
(t+1) (t)
∀j ∈ {1, · · · , N } \ i : xj ← xj .
Output x(t+1) as the next sample.
end loop

Example 10.1.2. Describe Gibbs sampling for example of a general Ising model. Build
respective Markov chain. Show that the algorithm obeys Detailed Balance.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 241

Solution: Starting from a state we pick a random node i and compare two candidate
states, (si = 1 and si = −1). Then we calculate the corresponding conditional (all spins
except i are fixed) probabilities p+ and p− , following p+ + p− = 1, p+ /p− = e−β∆E ,
where ∆E is the energy difference between the two configurations. Next, one accepts
the configuration si = 1 with the probability p+ or the configuration si = −1 with the
probability p− .
Markov chain corresponding to the algorithm is defined on the hupercube. To check
the DB condition compute the probability flux from the state with si = +1 to the state
1
with si = −1. It is Q−+ = e−βE(si =−1) p+ , and then the reversed probability flux is
Z
1
Q+− = e−βE(si =+1) p− . One finds that, indeed, the DB is satisfied since Q−+ = Q+− .
Z

Metropolis-Hastings Sampling

Metropolis-Hastings sampling is an MCMC method which explores efficiently the DB con-


dition, i.e. reversibility of the underlying Markov Chain. The algorithm also uses sampling
from the conditional probabilities and smart use of the rejection strategy. Assume that
the probability of any state x from which one wants to sample (call it the target distribu-
tion) is explicitly known up to the normalization constant, Z, i.e. p(x) = p(x)/Z, where
Z = x p(x). Let us also introduce the so-called proposal distribution, p(x0 |x), and assume
P

that drawing a sample proposal x0 from the current sample x is (computationally) easy.

Algorithm 8 Metropolis-Hastings Sampling


Input: Given π(x) and p(x0 |x). Start with a sample xt .

1: loop Till convergence


2: Draw a random x0 ∼ p(x0 |xt ).
p(xt |x0 )π(x0 )
3: Compute α = p(x0 |xt )π(xt ) .
4: Draw random β ∈ U ([0, 1]), uniform i.i.d. from [0, 1].
5: if β < min{1, α} then
6: xt ← x0 [accept]
7: else
8: x0 is ignored [reject]
9: end if
10: xt is recordered as a new sample
11: end loop

Note that the Gibbs sampling previously introduced can be considered as the Metropolis-
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 242

Figure 10.1: Metropolis-Hastings Markov chain example for two spins

Hastings without rejection (thus it is a particular case).

Example 10.1.3. Consider Markov chain example representing MH algorithm for two
spins. Show that the Markov chain corresponds to an ergodic process. Describe the algo-
rithm. Show that the algorithm obeys the DB condition. What is the resulting stationary
distribution? MH algorithm contains the rejection step. What is the resulting steady dis-
tribution if the rejected step is removed from the consideration, as in the case of direct
sampling by rejection?
Solution: We start from an arbitrary initial state and then perform a random walk
in a state space flipping one spin at a time. Think about the algorithm as of a Markov
chain defined over 2N vertices of the hypercube. The algorithm works as follows: at each
step one, first, chooses the random site i, then compute probabilities of keeping the spin
value or flipping it (while other spins are kept instant), then flip the spin with the following
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 243

probability 
1, if ∆E < 0
p= (10.6)
e−β∆E , if ∆E ≥ 0.
Since our Markov chain is irreducible and aperiodic (contains self-loops), it is ergodic and
thus has a unique stationary distribution. DB is checked directly. The algorithm converges
1
to the Boltzmann/Gibbs distribution, Peq (s) = e−βE(s) , Z = s e−βE(s) .
P
Z
If the spin flip is rejected one accepts the current state as a new configuration. This is
an important difference with direct sampling by rejection. If the reject state is removed the
resulting distribution is uniform.

The proposals (conditional probabilities) may vary. Details are critical (change mixing
time), especially for large system. There is a (heuristic) rule of thumb: lower bound on
number of iterations of MH. If the largest distance between the states is L, the MH
will mix in time

T ≈ (L/ε)2 (10.7)

where ε is the typical step size of the random walk.


Mixing may be extremely slow if the proposal distribution is not selected carefully. Let
us illustrate how slow MCMC can be on a simple example. (See Section 29 of [15] for
details.) Consider the following target distribution over N states
(
1/N x ∈ {0, · · · , N − 1}
π(x) = (10.8)
0 otherwise

and proposal distribution over N + 2 states (extended by −1 and N )


(
0 1/2 x0 = x ± 1
p(x |x) = (10.9)
0 otherwise

Notice that the rejection can only occur when the proposed state is x0 = −1 or x0 = N .
A more sophisticated example of the Glauber algorithm (version of MH) on the example
of the Ising Model is to be discussed next.

Glauber Sampling of Ising Model

Let us return to the special version of the Gibbs algorithm (and thus also a special case of
the MH algorithm) developed specifically for the Ising model – the Glauber dynamics/al-
gorithms:
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 244

Algorithm 9 Glauber Sampling


Input: Ising model on a graph. Start with a sample σ

1: loop Till convergence


2: Pick a node i at random.
3: −σi ← σi  P 
4: Compute α = exp σi j∈V:{i,j}∈E Jij σj − 2h i .
5: Draw random β ∈ U ([0, 1]), uniform i.i.d. from [0, 1].
6: if α < β < 1 then
7: −σi ← σi [reject]
8: end if
9: Output: σ as a sample
10: end loop

Exercise 10.1.4 (not graded). (a) What is the proposal distribution turning the MH
sampling into the Glauber sampling (for the Ising model)? (b) Consider running parallel
dynamics, based on the Glauber algorithm, i.e. at every moment of time update all variables
in parallel according to the Glauber Sampling rule applied to the previous state. What is the
resulting stationary distribution? Is it different from the Ising model? Does the algorithm
satisfy the DB conditions?

Exercise 10.1.5 (Spanning Trees (not graded).). Let G be an undirected complete graph.
A simple MCMC algorithm to sample uniformly from the set of spanning trees of G is as
follows: Start with some spanning tree; add uniformly-at-random some edge from G (so
that a cycle forms); remove uniformly-at-random sample an edge from this cycle; repeat.
Suppose now that the graph G is positively weighted, i.e., each edge e has some cost ce > 0.
Suggest an MCMC algorithm that samples from the set of spanning trees of G, with the
probability proportional to the overall weight of the spanning for the following cases:
(i) the weight of any sub-graph of G is the sum of costs of its edges;
(ii) the weight of any sub-graph of G is the product of costs of its edges. In addition,
(iii) estimate the average weight of a spanning tree using the algorithm of uniform sampling.
(iv) implement all the algorithms on some small (but non-trivial) weighted graph of your
choice. Verify that the algorithm converges to the right value.

For useful additional reading on sampling and computations for the Ising model see
https://www.physik.uni-leipzig.de/~janke/Paper/lnp739_079_2008.pdf.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 245

Exactness and Convergence

MCMC algorithm is called (casually) exact if one can show that the generated distribution
”converges” to the desired stationary distribution. However, ”convergence” may mean
different things.
The strongest form of convergence – called exact independence test (warning - this
is our ‘custom’ term) – states that at each step we generate an independent sample from
the target distribution. To prove this statement means to show that empirical correlation
of the consecutive samples is zero in the limit when N number of samples → ∞:
N
1 X
lim f (xn )g(xn−1 ) → E [f (x)] E [g(x)] , (10.10)
N →+∞ N
n=0

where f (x) and g(x) are arbitrary functions (however such that respective expectations on
the rhs of Eq. (10.10) are well-defined).
A weaker statement – call it asymptotic convergence – suggests that in the limit of
N → ∞ we reconstruct the target distribution (and all the respective existing moments):
N
1 X
lim f (xn ) → E [f (x)] , (10.11)
N →+∞ N
n=0

where f (x) is an arbitrary function such that the expectation on the rhs is well defined.
Finally, the weakest statement – call it parametric convergence – corresponds to the
case when one arrives at the target estimate only in a special limit with respect to a special
parameter. It is common, e.g. in statistical/theoretical physics and computer sicence, to
study the so-called thermodynamic limit, where the number of degrees of freedom (for
example number of spins/variables in the Ising model) becomes infinite:
N
1 X
lim lim fs (xn ) → E [fs∗ (x)] . (10.12)
s→s∗ N →+∞ N
n=0

For additional math (but also intuitive as written for applied mathematicians, engi-
neers and physicists) reading on the MCMC (and in general MC) convergence see “The
mathematics of mixing things up” article by Persi Diaconis and also [16].

Exact Monte Carlo Sampling (Did it converge yet?)

(This part of the lecture is a bonus material - we discuss it only if time permits.)
The material follows Chapter 32 of D.J.C. MacKay book [15]. An extensive set of
modern references, discussions and codes are also available at the website on perfectly
random sampling with Markov chains.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 246

As mentioned already the main problem with MCMC methods is that one needs to wait
(and sometimes for too long) to make sure that the generated samples (from the target
distribution) are i.i.d. If one starts to form a histogram (empirical distribution) too early
it will deviate from the target distribution. One important question in this regards is: For
how long shall one run the Markov Chain before it has ‘converged’ ? To answer this question
(prove) it is very difficult, in many cases not possible. However, there is a technique which
allows to check the exact convergence, for some cases, and do it on the fly - as we run
MCMC.
This smart technique is the Propp-Wilson exact sampling method, also called coupling
from the past. The technique is based on a combination of three ideas:

• The main idea is related to the notion of the trajectory coalescence. Let us ob-
serve that if starting from different initial conditions the MCMC chains share a single
random number generator, then their trajectories in the phase space can coalesce; and
having coalesced, will not separate again. This is clearly an indication that the initial
conditions are forgotten.
Will running all the initial conditions forward in time till coalescence generate exact
sample? Apparently not. One can show (sufficient to do it for a simple example) that
the point of coalescence does not represent an exact sample.

• However, one can still achieve the goal by sampling from a time T0 in the past,
up to the present. If the coalescence has occurred the present sample is an unbiased
sample; and if not we restart the simulation from the time T0 further into the past,
reusing the same random numbers. The simulation is repeated till a coalescence occur
at a time before the present. One can show that the resulting sample at the present
is exact.

• One problem with the scheme is that we need to test it for all the initial conditions -
which are too many to track. Is there a way to reduce the number of necessary
trials. Remarkably, it appears possible for sub-class of probabilistic models the so-
called ’attractive’ models. Loosely speaking and using ’physics’ jargon - these are
’ferromagnetic’ models - which are the models where for a stand alone pair of vari-
ables the preferred configuration is the one with the same values of the two variables.
In the case of attractive model monotonicity (sub-modularity) of the underlying model
suggests that the paths do not cross. This allows to only study limiting trajectories
and deduce interesting properties of all the other trajectories from the limiting cases.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 247

𝜎1

𝜎2 𝜎3 𝑓23 (𝜎2 , 𝜎3 )

Figure 10.2: Factor Graph Representation for the (simple case) with pair-wise factors only.
In the case of the Ising model: f12 (σ1 , σ2 ) = exp (−J12 σ1 σ2 + h1 σ1 + h2 σ2 )).

10.2 Graphical Models


This lecture largely follow material of the mini-course on Graphical Models of Statistical
Inference: Belief Propagation & Beyond. See links to slides and lecture notes at the following
web-site.

From Ising Model to (Factor) Graphical Models

Brief reminder of what we have learned so far about the Ising Model. It is fully described by
Eqs. (10.2,10.3). The weight of a “spin” configuration is given by Eq. (10.2). Let us not pay
much of attention for now to the normalization factor Z and observe that the weight is nicely
factorized. Indeed, it is a product of pair-wise terms. Each term describes “interaction”
between spins. Obviously we can represent the factorization through a graph. For example,
if our spin system consists only of three spins connected to each other, then the respective
graph is a triangle. Spins are associated with nodes of the graphs and “interactions”, which
may also be called (pair-wise) factors, are associated with edges.
It is useful, for resolving this and other factorized problems, to introduce a bit more
general representation — in terms of graphs where both factors and variables are associated
with nodes/vertices. Transformation to the factor-graph representation for the three spin
example is shown in Fig. (10.2).
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 248

Figure 10.3: Tanner graph of a linear code, represented with N = 10 bits, M = 5 checks,
and L = N − M = 5 information bits. This code selects 25 codewords from 21 0 possible
patterns. This adjacency, parity-check matrix of the code is given by Eq. (10.14).

Ising Model, as well as other models discussed later in the lectures, can thus be stated
in terms of the general factor-graph framework/model
Y .
P (σ) = Z −1 fa (σa ), σa = (σi |i ∈ Vn , (i, a) ∈ E) , (10.13)
a∈Vf

where (Vf , Vn , E) is the bi-partite graph built of factors and nodes.


The factor graph language (representation) is more general. We will see it next - dis-
cussing another interesting problem from Information Theory - decoding of error-correction
codes.

Decoding of Graphical Codes as a Factor Graph problem

First, let us discuss decoding of a graphical code. (Our description here is terse, and we
advise interested reader to check the book by Richardson and Urbanke [?] for more details.)
A message word consisting of L information bits is encoded in an N -bit long code word,
N > L. In the case of binary, linear coding discussed here, a convenient representation of
the code is given by M ≥ N − L constraints, often called parity checks or simply, checks.
Formally, ς = (ςi = 0, 1|i = 1, · · · , N ) is one of the 2L code words iff i∼α ςi = 0 ( mod 2)
P

for all checks α = 1, · · · , M , where i ∼ α if the bit if the bit i contributes the check α, and
α ∼ i will indicate that the check α contains bit i. The relation between bits and checks
is often described in terms of the M × N parity-check matrix H consisting of ones and
zeros: Hiα = 1 if i ∼ α and Hiα = 0 otherwise. The set of the codewords is thus defined as
Ξ(cw) = (ς|Hς = 0 ( mod 2)). A bipartite graph representation of H, with bits marked as
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 249

circles, checks marked as squares, and edges corresponding to respective nonzero elements
of H, is usually called (in the coding theory) the Tanner graph of the code, or parity-check
graph of the code. (Notice that, fundamentally, code is defined in terms of the set of its
codewords, and there are many parity check matrixes/graphs parameterizing the code. We
ignore this unambiguity here, choosing one convenient parametrization H for the code.)
Therefore the bi-partite Tanner graph of the code is defined as G = (G0 , G1 ), where the set
of nodes is the union of the sets associated with variables and checks, G0 = G0;v ∪ G0;e and
only edges connecting variables and checks contribute G1 .
For a simple example with 10 bits and 5 checks, the parity check (adjacency) matrix of
the code with the Tanner graph shown in Fig. (10.3) is
 
1 1 1 1 0 1 1 0 0 0
 
 0 0 1 1 1 1 1 1 0 0 
 
H= 0 1 0 1 0 1 0 1 1 1 . (10.14)
 
 
 1 0 1 0 1 0 0 1 1 1 
 
1 1 0 0 1 0 1 0 1 1

Another example of a bigger code and respective parity check matrix are shown in Fig. (10.4).
For this example, N = 155, L = 64, M = 91 and the Hamming distance, defined as the
minimum l0 -distance between two distinct codewords, is 20.
Assume that each bit of the transmitted signal is changed (effect of the channel noise)
independently of others. It is done with some known conditional probability, p(x|σ), where
σ = 0, 1 is the valued of the bit before transmission, and x is its changed/distorted image.
Once x = (xi |i = 1, · · · , N ) is measured, the task of the Maximum-A-Posteriori (MAP)
decoding becomes to reconstruct the most probable codeword consistent with the measure-
ment:
N
Y
σ (M AP ) = arg min p(xi |σi ). (10.15)
σ∈Ξ(cw)
i=1

More generally, the probability of a codeword ς ∈ Ξ(cw) to be a pre-image for x is


Y X Y
P(σ|x) = (Z(x))−1 g (ch) (xi |ςi ), Z(x) = g (ch) (xi |ςi ), (10.16)
i∈G0;v ς∈Ξ(cw) i∈G0;v

where Z(x) is thus the partition function dependent on the detected vector x. One may
also consider the signal (bit-wise) MAP decoder

ς∈Ξ (cw)
(s−M AP )
X
∀i: ςi = arg max P(ς|x). (10.17)
ςi
ς\ςi
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 250

Figure 10.4: Tanner graph and parity check matrix of the (155, 64, 20) Tanner code, where
N = 155 is the length of the code (size of the code word), L = 64 and the Hamming distance
of the code, d = 20.

Partition Function. Marginal Probabilities. Maximum Likelihood.

The partition function in Eq. (10.13) is the normalization factor


X Y .
Z= fa (σa ), σa = (σi |i ∈ Vn , (i, a) ∈ E) , (10.18)
σ a∈Vf

where σ = (σi ∈ {0, 1} ∈ Vn ). Here, we assume that the alphabet of the elementary random
variable is binary, however generalization to the case of a higher alphabet is straightforward.
We are interested to ‘marginalize’ Eq. (10.13) over a subset of variables, for example
over all the elementary/nodal variables but one
. X
P (σi ) = P (σ). (10.19)
σ\σi

Expectation of σi computed with the probability Eq. (10.19) is also called (in physics)
’magnetization’ of the variable.

Exercise 10.2.1 (not graded). Does a partition function oracle sufficient for computing
P (σi )? What is the relation in the case of the Ising model between P (σi ) and Z(h)?

Another object of interest is the so-called Maximum Likelihood. Stated formally, is the
most probable state of all represented in Eq. (10.13):

σ∗ = arg max P (σ). (10.20)


σ
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 251

All these objects are difficult to compute. “Difficulty” - still stated casually - means
that the number of operations needed is exponential in the system size (e.g. number of
variables/spins in the Ising model). This is in general, i.e. for a GM of a general position.
However, for some special cases, or even special classes of cases, the computations may be
much easier than in the worst case. Thus, ML (10.20) for the case of the so-called ferromag-
netic (attractive, sub-modular) Ising model can be computed with efforts polynomial in the
system size. Note that the partition function computation (at any nonzero temperatures)
is still exponential even in this case, thus illustrating the general statement - computing Z
or P (σi ) is a more difficult problem than computing σ∗ .
A curious fact. Ising model (ferromagnetic, anti-ferromagnetic or glassy) when the
“magnetic field” is zero, h = 0, and the graph is planar, represents a very unique class of
problems for which even computations of Z are easy. In this case the partition function
is expressed via determinant of a finite matrix, while computing determinant of a size N
matrix is a problem of O(N 3 ) complexity (actually O(N 3/2 ) in the planar case).
In the general (difficult) case we will need to relay on approximations to make compu-
tations scalable. And some of these approximations will be discussed later in the lecture.
However, let us first prepare for that - restating the most general problem discussed so far
– computation of the partition function, Z – as an optimization problem.

Kullback-Leibler Formulation & Probability Polytope

We will go from counting (computing partition function is the problem of weighted counting)
to optimization by changing description from states to probabilities of the states, which we
will also call beliefs. b(σ) will be a belief - which is our probabilistic guess - for the probability
of state σ. Consider it on the example of the triangle system shown in Fig. (10.2). There are
23 states in this case: (σ1 = ±1, σ2 = ±1, σ3 = ±1), which can occur with the probabilities,
b(σ1 , σ2 , σ3 ). All the beliefs are positive and together should some to unity. We would
like to compare a particular assignment of the beliefs with P (σ), generally described by
Eq. (10.13). Let us recall a tool which we already used to compare probabilities - the
Kullback-Leibler (KL) divergence (of probabilities) discussed in Lecture #2:
 
X b(σ)
D(bkP ) = b(σ) log (10.21)
σ
P (σ)

Note that the KL divergence (10.21) is a convex function of the beliefs (remember, there
are 23 of the beliefs in the our enabling three node example) within the following polytope
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 252

– domain in the space of beliefs bounded by linear constraints:

∀σ : b(σ) ≥ 0, (10.22)
X
b(σ) = 1. (10.23)
σ

Moreover, it is straightforward to check (please do it at home!) that the unique minimum


of D(bkP ) is achieved at b = P , where the KL divergence is zero:

P = arg min D(bkP ), min D(bkP ) = 0. (10.24)


b b

Substituting Eq. (10.13) into Eq. (10.24) one derives


Q 
. X a fa (σa )
log Z = − min F(b), F(b) = b(σ) log , (10.25)
b
σ
b(σ)

where F (b), considered as a function of all the beliefs, is called (configurational) free energy
(where configuration is one of the beliefs). The terminology originates from statistical
physics.
To summarize, we did manage to reduce counting problem to an optimization problem.
Which is great, however so far it is just a reformulation – as the number of variational
degrees of freedom (beliefs) is as much as the number of terms in the original sum (the
partition function). Indeed, it is not the formula itself but (as we will see below) its further
use for approximations which will be extremely useful.

Variational Approximations. Mean Field.

The main idea is to reduce the search space from exploration of the 2N −1 dimensional beliefs
to their lower dimensional, i.e. parameterized with fewer variables, proxy/approximation.
What kind of factorization can one suggest for the multivariate (N -spin) probabilities/bel-
liefs? The idea of postulating independence of all the N variables/spins comes to mind:
Y
b(σ) → bM F (σ) = bi (σi ) (10.26)
i
∀i ∈ Vi , ∀σi : bi (σi ) ≥ 0 (10.27)
X
∀i ∈ Vi : bi (σi ) = 1. (10.28)
σi

Clearly bi (σi ) is interpreted within this substitution as the single-node marginal belief (es-
timate for the single-node marginal probability).
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 253

Substituting b by bM F in Eq. (10.25) one arrives at the MF estimation for the partition
function

log Zmf = − min F(bmf ),


bmf
!
. XX Y XX
F(bmf ) = bi (σi ) log fa (σa ) − bi (σi ) log(bi (σi )). (10.29)
a σa i∼a i σi

To solve the variational problem (10.29) constrained by Eqs. (10.26,10.27,10.28) is equiv-


alent to searching for the (unique) stationary point of the following MF Lagrangian
. X X
L(bmf ) = F(bmf ) + λi bi (σi ) (10.30)
i σi

Exercise 10.2.2 (Not graded.). Show that Zmf ≥ Z, and that F(bmf ) is a strictly convex
function of its (vector) argument. Write down equations defining the stationary point of
L(bmf ). Suggest an iterative algorithm converging to the stationary point of L(bmf ).

The fact that Zmf (see the exercise above) gives an upper bound on Z is a good news.
However, in general the approximation is very crude, i.e. the gap between the bound and
the actual value is large. The main reason for that is clear - by assuming that the variables
are independent we have ignored significant correlations.
In the next lecture we will analyze what, very frequently, provides a much better ap-
proximation for ML inference - the so called Belief Propagation approach.
We will mainly focus on the so-called Belief Propagation, related theory and techniques.
In addition to discussing inference with Belief Propagation we will also have a brief discus-
sions (pointers) to respective inverse problem – learning with Graphical Models.

Dynamic Programming for Inference over Trees

Consider Ising model over a linear chain of n spins shown in Fig. 10.5a, the partition
function is
X
Z= Z(σn ), (10.31)
σn

where Z(σn ) is the newly introduced object representing sum over all but last spin in the
chain, labeled by n. Zn can be expressed as follows
X
Z(σn ) = exp(Jn,n−1 σn σn−1 + hn σn )Z(n−1)→(n) (σn−1 ), (10.32)
σn−1

where Z(n−1)→(n) (σi ) is the partial partition function for the subtree (a shorter chain in this
case) rooted at n − 1 and built excluding the branch/link directed towards n. The newly
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 254

Figure 10.5: Exemplary interaction/factor graphs which are tree.

introduced partially summed partition function contains summation over one less spins then
the original chain. In fact, this partially sum object can be defined recursively
X
Z(i−1)→(i) (σi−1 ) = exp(Ji−1,i−2 σi−1 σi−2 + hi−1 σi−1 )Z(i−2)→(i−1) (σi−2 ) (10.33)
σi−2

that is expressing one partially sum object via the partially sum object computed on the
previous step. Advantage of this recursive approach is obvious – it allows to replace sum-
mation over the exponentially many spin configurations by summing up of only two terms
at each step of the recursion.
What should also be obvious is that the method just described is adaptation of the
Dynamic Programming (DP) methods we have discussed in the optimization part of the
course to the problem of statistical inference.
It is also clear that the approach just explained allows generalization from the case of
the linear chain to the case of a general tree. Then, in the general case Z(σi ) is the partition
function of the entire tree with a value of the spin at the site/node i fixed. We derive
 
Y X
Z(σi ) = ehi σi  eJij σi σj Zj→i (σj ) , (10.34)
j∈∂i σj

where ∂i denotes the set of neighbors of the i-th spin and


!
Y X
Zj→i (σj ) = ehj σj eJkj σk σj Zk→j (σj ) (10.35)
k∈∂j\i σk

is the partition function of the subtree rooted at the node j.


Let us illustrate the general scheme on example of the tree in Fig. (10.5b), one obtains
X
Z= Z(σ4 ), (10.36)
σ4
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 255

The partition function, partially summed and conditioned to the spin value at the spin, σ4 ,
is
X X X
Z(σ4 ) = eh4 σ4 eJ45 σ4 σ5 Z5→4 (σ5 ) eJ46 σ4 σ6 Z6→4 (σ6 ) eJ34 σ3 σ4 Z3→4 (σ3 ) (10.37)
σ5 σ6 σ3

where
X X
Z3→4 (σ3 ) = eh3 σ3 eJ13 σ1 σ3 Z1→3 (σ1 ) eJ23 σ2 σ3 Z2→3 (σ2 ). (10.38)
σ1 σ2

Exercise 10.2.3 (not graded). Demonstrate that the i-th spin is conditionally independent
of all other spins, given values of spins of the i-th spin neighbors fixed, i.e.

p(σi |σ/σi ) = p(σi |σj ∼ σi ), (10.39)

where, p(σi |σ/σi ), is the probability distribution of the ith spin conditioned to the values
of all other spins, and, p(σi |σj ∼ σi ), is the probability distribution of ith spin conditioned
to the spin values of its neighbors.

Properties of Undirected Tree-Structured Graphical Models

It appears that in the case of a general pair-wise graphical model over trees the joint
distribution function over all variables can be expressed solely via single-node marginals
and pair-vise marginals over all pairs of the graph-neighbors. To illustrate this important
factorization property, let us consider examples shown in Fig. 10.6. In the case of the two-
nodes example of Fig. 10.6a the statement is obvious as following directly from the Bayes
formula
P (x1 , x2 ) = P (x1 )P (x2 |x1 ), (10.40)

or, equivalently, P (x1 , x2 ) = P (x2 )P (x1 |x2 ).


For the pair-wise graphical model shown in Fig. 10.6b one obtains

P (x1 , x2 , x3 ) = P (x1 , x2 )P (x3 |x1 , x3 ) = P (x1 , x2 )P (x3 |x2 ) =


P (x1 , x2 )P (x2 , x3 )
= P (x1 )P (x2 |x1 )P (x3 |x2 ) = , (10.41)
P (x2 )
where the conditional independence of x3 on x1 , P (x3 |x1 , x2 ) = P (x3 |x2 ), was used.
Next, let us work it out on the example of the pair-wise graphical model shown in
Fig. 10.6

P (x1 , x2 , x3 , x4 ) = P (x1 , x2 , x3 )P (x4 |x1 , x3 , x4 ) = P (x1 , x2 , x3 )P (x4 |x2 ) =


= P (x1 , x2 )P (x3 |x1 , x2 )P (x4 |x2 ) = P (x1 , x2 )P (x3 |x2 )P (x4 |x2 ) =
P (x1 , x2 )P (x2 , x3 )P (x2 , x4 )
= P (x1 )P (x2 |x1 )P (x3 |x2 )P (x4 |x2 ) = . (10.42)
P 2 (x2 )
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 256

Figure 10.6: Examples of undirected tree-structured graphical models.

Here one uses the following reductions, P (x4 |x1 , x3 , x4 ) = P (x4 |x2 ) and P (x3 |x1 , x2 ) =
P (x3 |x2 ), related to respective independence properties.
Finally, it is easy to verify that the joint probability distribution corresponding to the
model in Fig. 10.6d is

P (x1 , x2 , x3 , x4 , x5 , x6 ) = P (x1 )P (x2 |x1 )P (x3 |x2 )P (x4 |x2 )P (x5 |x2 )P (x6 |x5 ) =
P (x1 , x2 )P (x2 , x3 )P (x2 , x4 )P (x2 , x5 )P (x5 , x6 )
= . (10.43)
P 3 (x2 )P (x5 )

In general, the joint probability distribution of a tree-like graphical model can be written
as follows Q
(i,j)∈E P (xi , xj )
P (x1 , x2 , . . . , xn ) = Q , (10.44)
i∈V P qi −1 (xi )
where qi is the degree of the ith node. Eq. (10.44) can be proven by induction.

Bethe Free Energy & Belief Propagation

As discussed above Dynamic Programming is a provably exact approach for inference when
the graph is a tree. It also provides an empirically good approximation for a very broad
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 257

family of problems stated on loopy graphs.


The approximation is usually called Bethe-Peierls or Belief Propagation (BP is the ab-
breviation which works for both). Loopy BP is anothe popular term. See the original paper
[?], a comprehensive review [?], and respective lecture notes, for an advanced/additional
reading.
Instead of Eq. (10.26) one uses the following BP substitution
Q
a ba (σa )
b(σ) → bbp (σ) = Q qi −1
(10.45)
i i (σi ))
(b
∀a ∈ Vf , ∀σa : ba (σa ) ≥ 0 (10.46)
X
∀i ∈ Vn , ∀a ∼ i : bi (σi ) = ba (σa ) (10.47)
σa \σi
X
∀i ∈ Vn : bi (σi ) = 1. (10.48)
σi

where qi stands for degree of node i. The physical meaning of the factor qi − 1 on the rhs of
Eq. (10.45) is straightforward: by placing beliefs associated with the factor-nodes connected
by an edge with a node, i, we over-count contributions of an individual variable qi times
and thus the denominator term in Eq. (10.45) comes as a correction for this over-counting.
Substitution of Eqs. (10.45) into Eq. (10.25) results in what is called Bethe Free Energy
(BFE)
.
Fbp = Ebp − Hbp , (10.49)
. XX
Ebp = − ba (σa ) log fa (σa ) (10.50)
a σa
XX XX
Hbp = ba (σa ) log ba (σa ) − (qi − 1)bi (σi ) log bi (σi ), (10.51)
a σa i σi

where Ebp is the so-called self-energy (physics jargon) and Hbp is the BP-entropy (this
name should be clear in view of what we have discussed about entropy so far). Thus the
BP version of the KL-divergence minimization becomes

arg min Fbp , (10.52)


ba ,bi Eqs. (10.46,10.47,10.48)

min Fbp (10.53)


ba ,bi Eqs. (10.46,10.47,10.48)
Question: Is Fbp a convex function (of its arguments)? [Not always, however for some
graphs and/or some factor functions the convexity holds.]
The ML (zero temperature) version of Eq. (10.52) results from the following optimization

min Ebp (10.54)


ba ,bi Eqs. (10.46,10.47,10.48)
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 258

Note the optimization is a Linear Programming (LP) — minimizing linear objective over
set of linear constraints.

Belief Propagation & Message Passing

Let us restate Eq. (10.52) as an unconditional optimization. We use the standard method
of Lagrangian multipliers to achieve it. The resulting Lagrangian is
. XX XX
Lbp (b, η, λ) = ba (σa ) log fa (σa ) − ba (σa ) log ba (σa )
a σa a σa
XX
+ (qi − 1)bi (σi ) log bi (σi )
i σi
  !
XXX X X X
− ηia (σi ) bi (σi ) − ba (σa ) + λi bi (σi ) − 1 , (10.55)
i a∼i σi σa \σi i σi

where η and λ are the dual (Lagrangian) variables associated with the conditions Eqs. (10.47,10.48)
respectively. Then Eq. (10.52) become the following min-max problem

min max Lbp (b, η, λ). (10.56)


b η,λ

Changing the order of optimizations in Eq. (10.56) and then minimizing over η one arrives
at the following expressions for the beliefs via messages (check the derivation details)
!
X . Y
∀a, ∀σa : ba (σa ) ∼ fa (σa ) exp ηia (σi ) = fa (σa ) ni→a (σi )
i∼a i∼a

. Y b6Y
=a
= fa (σa ) mb→i (σi ) (10.57)
i∼a b∼i
P
ηia (σi )

. Y
∀i, ∀σi : bi (σi ) ∼ exp  a∼i = ma→i (σi ), (10.58)
qi − 1
a∼i

where, as usual, ∼ for beliefs means equality up to a constant which guarantees that the
sum of respective beliefs is unity, and we have also introduce the auxiliary variables , m
and n, called messages, related to the Lagrangian multipliers η as follows
.
∀i, ∀a ∼ i : ni→a (σi ) = exp (ηia (σi )) (10.59)
 
. ηia (σi )
∀a, ∀i ∼ a : ma→i (σi ) = exp . (10.60)
qi − 1
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 259

Combining Eqs. (10.57,10.58,10.59,10.60) with Eq. (10.47) results in the following BP-
equations stated in terms of the message variables
b6Y
=a
∀i, ∀a ∼ i, ∀σi : ni→a (σi ) = ma→i (σi ) (10.61)
b∼i
X j6=i
Y
∀a, ∀i ∼ a, ∀σi : ma→i (σi ) = fa (σa ) nj→a (σj ). (10.62)
σa \σi j∼a

Note that if the Bethe Free Energy (10.49) is non-convex there may be multiple fixed points
of the Eqs. (10.61,10.62). The following iterative, so called Message Passing (MP), algorithm
(10) is used to find a fixed point solution of the BP Eqs. (10.45,10.46)

Algorithm 10 Message Passing, Sum-Product Algorithm [factor graph representation]


Input: The graph. The factors.

1: ∀i, ∀a ∼ i, ∀σi : ma→i = 1 [initialize variable-to-factor messages]


2: ∀a, ∀i ∼ a, ∀σi : ni→1 = 1 [initialize factor-to-variable messages]
3: loopTill convergence within an error [or proceed with a fixed number of iterations]
∀i, ∀a ∼ i, ∀σi : ni→a (σi ) ← b6b∼i
Q =a
4: ma→i (σi )
∀a, ∀i ∼ a, ∀σi : ma→i (σi ) ← σa \σi fa (σa ) j6j∼a
P Q =i
5: nj→a (σj )
6: end loop

Exercise 10.2.4 (not graded). Derive the T = 0 version of the aforementioned (see pre-
vious exercise) message-passing equations. [A hint: the iterative equations should contain
alternating min- and sum- steps — thus the name min-sum algorithm.] Study performance
of the message-passing algorithm on example of a small code decoding, for example check
this student midterm paper for discussion of decoding of a binary (3, 6) code over the Binary-
Erasure Channel (BEC). Show how BP decodes and contrast the BP decoding against the
MAP decoding. What is the (best) complexity of the MAP decoder for a code over the
BEC channel? [Hint: Use Gaussian Elimination over GL(2).]

Sufficient Statistics

So far we have been discussing direct (inference) GM problem. In the remainder of this
lecture we will briefly talk about inverse problems. This subject will also be discussed (on
example of the tree) in the following.
Stated casually - the inverse problem is about ‘learning’ GM from data/samples. Think
about the two room setting. In one room a GM is known and many samples are generated.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 260

The samples, but not GM (!!!), are passed to the second room. The task becomes to
reconstruct GM from samples.
The first question we should ask is if this is possible in principle, even if we have an
infinite number of samples. A very powerful notion of sufficient statics helps to answer this
question.
Consider the Ising model (not the first time in this course) using a little bit different
notations then before
 
1 X X 
P (σ) = exp θi σi + θij σi σj = exp{θT φ(σ) − log Z(θ)}, (10.63)
Z(θ)  
i∈V {i,j}∈E

where σi ∈ {−1, 1} and the partition function Z(θ) serves to normalize the probability dis-
tribution. In fact, Eq. (10.63) describes what is called the exponential family - emphasizing
‘exponential’ dependence on the factors θ.

Exercise 10.2.5 (not graded). Show that any pairwise GM over binary variables can be
represented as an Ising model.

Consider collection of all first and second moments (but only these two) of the spin
. .
variables, µ(1) = (µi = E[σi ], i ∈ V ) and µ(2) = (µij = E[σi σj ], {i, j} ∈ E). The sufficient
statistics statement is that to reconstruct θ, fully defining the GM, it is sufficient to know
µ(1) and µ(2) .

Maximum-Likelihood Estimation/Learning of GM

Let us turn the sufficiency into a constructive statement – the Maximum-likelihood estima-
tion over an exponential family of GMs.
First, notice that (according to the definition of µ)

∀i : ∂θi log Z(θ) = −µi , ∀i, j : ∂θij log Z(θ) = −µij . (10.64)

This leads to the following statement: if we know how to compute log-partition function
for any values of θ - reconstructing ’correct’ θ is a convex optimization problem (over θ):

θ∗ = arg max{µT θ − log Z(θ)} (10.65)


θ

If P represents the empirical distribution of a set of independent identically-distributed


(i.i.d.) samples {σ (s) , s = 1, . . . , S} then µ are the corresponding empirical moments, e.g.
P (s) (s)
µij = S1 s σi σj .
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 261

General Remarks about GM Learning. The ML parameter Estimation (10.65) is the


best we can do. It is fundamental for the task of Machine Learning, and in fact it generalizes
beyond the case of the Ising model.
Unfortunately, there are only very few nontrivial cases when the partition function can
be calculated efficiently for any values of θ (or parametrization parameters if we work with
more general class of GM than described by the Ising models).
Therefore, to make the task of parameter estimation practical one needs to rely on one
of the following approaches:

• Limit consideration to the class of functions for which computation of the partition
function can be done efficiently for any values of the parameters. We will discuss
such case below – this will be the so-called tree (Chow-Lou) learning. (In fact, the
partition function can also be computed efficiently in the case of the Ising model over
planar graphs and generalizations, see this recent paper for details.)

• Relay on approximations, e.g. such as variational approximation (MF, BP, and other),
MCMC or approximate elimination (approximate Dynamical Programming).

• There exists a very innovative new approach - which allows to learn GM efficiently
however using more information than suggested by the notion of the sufficient statis-
tics. How one of the scientists contributing to this line of research put it – ’the suffi-
cient statistics is not sufficient’. This is a fascinating novel subjects, which is however
beyond the scope of this course. But check this article and references therein, if
interested.

Learning Spanning Tree

Eq. (10.44) suggests that knowing the structure of the tree-based graphical model allows
to express the joint probability distribution in terms of the single-(node) and pairwise
(edge-related) marginals. Below we will utilize this statement to pose and solve an inverse
problem. Specifically, we attempt to reconstruct a tree representing correlations between
multiple (ideally, infinitely many) snapshots of the discrete random variables x1 , x2 , . . . , xn ?
A straightforward strategy to achieve this goal is as follows. First, one estimates all
possible single-node and pairwise marginal probability distributions, P (xi ) and P (xi , xj ),
from the infinite set of the snapshots. Then, we may similarly estimate the joint distribution
function and verify for a possible tree layout if the relations (10.44) hold. However, this
strategy is not feasible as requiring (in the worst unlucky case) to test exponentially many,
nn−2 , possible spanning threes. Luckily a smart and computationally efficient way of solving
the problem was suggested by Chow and Liu in 1968.
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 262

Consider the candidate probability distribution, PT (x1 , . . . , xn ) over a tree, T = (V, E)


(where V and E are the sets of nodes and edges of the tree, respectively) which is tree-
factorized according to Eq. (10.44) via marginal (pair-wise and single-variable) probabilities
as follows Q
(i,j)∈E F P (xi , xj )
PT (x1 , x2 , . . . , xn ) = Q . (10.66)
i∈V F P (xi )qi −1 (xi )
”Distance” between the actual (correct) joint probability distribution P and the candidate
tree-factorized probability distribution, PT , can be measured in terms of the Kullback-
Leibler (KL) divergence
X P (~x)
D(P k PT ) = − P (~x) log . (10.67)
PT (~x)
~
x

As discussed in Section 8.3, the KL divergence is always positive if P and PT are different,
and is zero if these distributions are identical. Then, we are looking for a tree that minimizes
the KL divergence.
Substituting (10.66) into Eq. (10.67) one arrives at the following chain of explicit trans-
formations
 
X X X
P (~x) log P (~x) − log P (xi , xj ) + (qi − 1) log P (xi ) =
~
x (i,j)∈E i∈V
X X X
= P (~x) log P (~x) − P (xi , xj ) log P (xi , xj ) +
~
x (i,j)∈E F xi ,xj
X X X X P (xi , xj )
+ (qi − 1) P (xi ) log P (xi ) = − P (xi , xj ) log +
xi
P (xi )P (xj )
i∈V (i,j)∈E F xi ,xj
X XX
+ P (~x) log P (~x) − P (xi ) log P (xi ), (10.68)
~
x i∈V F xi

where the following nodal and edge marginalization relations were used, ∀i ∈ V F : P (xi ) =
x), and, ∀(i, j) ∈ E F : P (xi , xj ) = ~x\xi ,xj P (~x), respectively. One observes
P P
x\xi P (~
~
that the Kullback-Leibler divergence becomes
X X
D(P k PF ) = − I(Xi , Xj ) + S(xi ) − S(~x), (10.69)
(i,j)∈E F i∈V F

where
. X P (xi , xj )
I(Xi , Xj ) = P (xi , xj ) log (10.70)
x ,x
P (xi )P (xj )
i j

is the mutual information of the pair of random variables xi and xj .


CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 263

Since the entropies S(Xi ) and S(X) do not depend on the tree choice, minimizing the
Kullback-Leibler divergence is equivalent to maximizing the following sum over branches of
a tree
X
I(Xi , Xj ). (10.71)
(i,j)∈E F

Based on this observation, Chow and Liu have suggested to use the following (standard in
computer science) Kruskal maximum tree reconstruction algorithm (notice that the algo-
rithm is greedy, i.e. of the Dynamic Programming type):

• (step 1) Sort the edges of G into decreasing order by weight = Mutual Information,
i.e. I(Xi , Xj ) for the candidate edge (i, j). Let ET be the set of edges comprising the
maximum weight spanning tree. Set ET = ∅.

• (step 2) Add the first edge to ET

• (step 3) Add the next edge to ET if and only if it does not form a cycle in ET .

• (step 4) If ET has n − 1 edges (where n is the number of nodes in G) stop and output
ET . Otherwise go to step 3.

Eq. (10.44) is exact only in the case when it is guaranteed that the graphical model we
attempt to recover forms a tree. However, the same tree ansatz can be used to recover the
best tree approximation for a graphical model defined over a graph with loops. How to
choose the optimal (best approximation) tree in this case? To answer this question within
the aforementioned Kullback-Leibler paradigm one needs to compare the tree ansatz (10.44)
and the empirical joint distribution. This reconstruction of the optimal tree is based on the
Chow-Liu algorithm.

Exercise 10.2.6. Find Chou-Liu optimal spanning tree approximation for the joint prob-
ability distribution of four random binary variables with statistical information presented
in the Table 10.1. [Hint: Estimate empirical, i.e. based on the data, pair-wise mutual
information and then utilize the Chow-Liu-Kruskal algorithm (see description above in the
lecture notes) to reconstruct the optimal tree.]
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 264

Table 10.1: Information available about an exemplary probability distribution of four binary
variables discussed in the Exercise 10.2.6.

x1 x2 x3 x4 P (x1 , x2 , x3 , x4 ) P (x1 )P (x2 |x1 )P (x3 |x2 )P (x4 |x1 ) P (x1 )P (x2 )P (x3 )P (x4 )
0000 0.100 0.130 0.046
0001 0.100 0.104 0.046
0010 0.050 0.037 0.056
0011 0.050 0.030 0.056
0100 0.000 0.015 0.056
0101 0.000 0.012 0.056
0110 0.100 0.068 0.068
0111 0.050 0.054 0.068
1000 0.050 0.053 0.056
1001 0.100 0.064 0.056
1010 0.000 0.015 0.068
1011 0.000 0.018 0.068
1100 0.050 0.033 0.068
1101 0.050 0.040 0.068
1110 0.150 0.149 0.083
1111 0.150 0.178 0.083

10.3 Neural Networks


This Section is work in progress. If time permits, we plan to follow here material from
Chapter V of the “Information Theory Inference and Learning Algorithms” book by David
MacKay [15] devoted to Neural Networks. Some useful material can also be found in the
recent book “Linear Algebra and Learning from Data” by Gilbert Strang [18], specifically
in the Chapter VII “Learning from Data”; and also in the lecture on “Deep Learning and
Graphical Models” by Eric Xing.

10.3.1 Single Neuron and Supervised Learning

Exercise 10.3.1. Consider a Neural Network (NN) with two layers, each with only one
node. Assume that each node is assigned the activation function
  
ŷ = tanh w2 tanh w1 x + b1 + b2 ,
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 265

and assume that the weights are currently set at (w1 , b1 ) = (1.0, 0.5) and (w2 , b2 ) =
(−0.5, 0.3). What is the gradient of the Mean Square Error (MSE) cost for the observation
(x, y) = (2, −0.5)? What is the optimal MSE and optimal values of the parameters?

10.3.2 Hopfield Networks and Boltzmann Machines


Bibliography

[1] M. Tabor, Principles and Methods of Applied Mathematics. University of Arizona


Press, 1999.

[2] V. Arnold, Ordinary Differential Equations. The MIT Press, 1973.

[3] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal
algorithms,” Physica D: Nonlinear Phenomena, vol. 60, no. 1, pp. 259–268, 1992.

[4] J. Calder, “The calculus of variations (lecture notes),” http://www-users.math.umn.


edu/∼jwcalder/CalculusOfVariations.pdf, 2019.

[5] B. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR
Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1 – 17, 1964.

[6] Y. E. Nesterov, “A method for solving the convex programming problem with conver-
gence rate o(1/k 2 ),” Dokl. Akad. Nauk SSSR, vol. 269, pp. 543–547, 1983.

[7] W. Su, S. Boyd, and E. J. Candes, “A Differential Equation for Modeling Nesterov’s
Accelerated Gradient Method: Theory and Insights,” arXiv:1503.01243, 2015.

[8] A. C. Wilson, B. Recht, and M. I. Jordan, “A Lyapunov Analysis of Momentum Meth-


ods in Optimization,” arXiv:1611.02635, 2016.

[9] M. Levi, Classical Mechanics with Calculus of Variations and Optimal Control: An
Intuitive Introduction. AMS, 2014.

[10] A. Chambolle, “An algorithm for total variation minimization and applications,” Jour-
nal of Mathematical Imaging and Vision, vol. 20, pp. 89–97, 2004.

[11] R. K. P. Zia, E. F. Redish, and S. R. McKay, “Making sense of the legendre


transform,” American Journal of Physics, vol. 77, no. 7, p. 614–622, Jul 2009.
[Online]. Available: http://dx.doi.org/10.1119/1.3119512

266
BIBLIOGRAPHY 267

[12] L. Pontryagin, V. Boltayanskii, R. Gamkrelidze, and E. Mishchenko, The mathematical


theory of optimal processes (translated from Russian in 1962). Wiley, 1956.

[13] A. T. FULLER, “Bibliography of pontryagm’s maximum principle,” Journal of Elec-


tronics and Control, vol. 15, no. 5, pp. 513–517, 1963.

[14] R. Bellman, “On the theory of dynamic programming,” PNAS, vol. 38, no. 8, p. 716,
1952.

[15] D. J. C. Mackay, Information theory, inference, and learning algorithms. Cambridge


University Press, 2003.

[16] C. Moore and S. Mertens, The Nature of Computation. New York, NY, USA: Oxford
University Press, 2011.

[17] N. V. Kampen, Stochastic processes in physics and chemistry. North Holland, 2007.

[18] G. Strang, Linear Algebra and Learning from Data. Wellesley-Cambridge Press, 2019.

You might also like