Lectures On AM

Lecture Notes on the
Principles and Methods of Applied Mathematics
Michael (Misha) Chertkov

(lecturer)
and Colin Clark
(recitation instructor for this and other core classes)
Graduate Program in Applied Mathematics,

University of Arizona, Tucson
March 17, 2023

Contents
1 Applied Math Core Courses 1
Fall Semester 5
I Applied Analysis 6
2 Complex Analysis 7
2.1 Complex Variables and Complex-valued Functions . . . . . . . . . . . . . . 7
2.1.1 The Cartesian Representation of Complex Variables . . . . . . . . . 7
2.1.2 The Polar Representation of Complex Variables . . . . . . . . . . . . 10
2.1.3 Parameterization of Curves in the Complex Plane . . . . . . . . . . 12
2.1.4 Functions of a Complex Variable . . . . . . . . . . . . . . . . . . . . 14
2.1.5 Complex Exponentials . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.6 Multi-valued Functions and Branch Cuts . . . . . . . . . . . . . . . 16
2.2 Analytic Functions and Integration along Contours . . . . . . . . . . . . . . 22
2.2.1 Analytic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 Integration along Contours . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.3 Cauchy’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.4 Cauchy’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.5 Laurent Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3 Residue Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.1 Singularities and Residues . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.2 Evaluation of Real-valued Integrals by Contour Integration . . . . . 42
2.3.3 Contour Integration with Multi-valued Functions . . . . . . . . . . . 49
2.4 Extreme-, Stationary- and Saddle-Point Methods ∗ . . . . . . . . . . . . . . 53
i
CONTENTS ii
3 Fourier Analysis 57
3.1 The Fourier Transform and Inverse Fourier Transform . . . . . . . . . . . . 57
3.2 Properties of the 1-D Fourier Transform . . . . . . . . . . . . . . . . . . . . 58
3.3 Dirac’s δ-function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3.1 The δ-function as the limit of a δ-sequence . . . . . . . . . . . . . . 61
3.3.2 Properties of the δ-function . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.3 Using δ-functions to Prove Properties of Fourier Transforms . . . . . 65
3.3.4 The δ-function in Higher Dimensions . . . . . . . . . . . . . . . . . . 66
3.3.5 Formal Differentiation: The Heaviside Function and the Derivatives
of the δ-function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4 Closed form representation for select Fourier Transforms . . . . . . . . . . . 68
3.4.1 Elementary examples of closed form representations . . . . . . . . . 68
3.4.2 More advanced examples of closed form representations . . . . . . . 71
3.4.3 Closed form representations in higher dimensions . . . . . . . . . . . 73
3.5 Fourier Series: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6 Properties of the Fourier Series . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7 Riemann-Lebesgue Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.8 Gibbs Phenomenon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.9 Laplace Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.9.1 Integral Representations and Asymptotics of Special Functions . . . 82
3.10 From Differential to Algebraic Equations with FT, FS and LT . . . . . . . . 84
II Differential Equations 87
4 Ordinary Differential Equations. 88

4.1 ODEs: Simple cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.1.1 Separable Differential Equations . . . . . . . . . . . . . . . . . . . . 89
4.1.2 Variation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.2 Direct Methods for Solving Linear ODEs . . . . . . . . . . . . . . . . . . . . 93
4.2.1 Homogeneous ODEs with Constant Coefficients . . . . . . . . . . . . 93
4.2.2 Inhomogeneous ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.3 Linear Dynamics via the Green Function . . . . . . . . . . . . . . . . . . . . 95
4.3.1 Evolution of a linear scalar . . . . . . . . . . . . . . . . . . . . . . . 96
4.3.2 Evolution of a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.3.3 Higher Order Linear Dynamics . . . . . . . . . . . . . . . . . . . . . 101
4.3.4 Laplace Transform and Laplace Method . . . . . . . . . . . . . . . . 106
CONTENTS iii
4.4 Linear Static Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.4.1 One-Dimensional Poisson Equation . . . . . . . . . . . . . . . . . . . 111
4.5 Sturm–Liouville (spectral) theory . . . . . . . . . . . . . . . . . . . . . . . . 113
4.5.1 Hilbert Space and its completeness . . . . . . . . . . . . . . . . . . . 113
4.5.2 Hermitian and non-Hermitian Differential Operators . . . . . . . . . 114
4.5.3 Hermite Polynomials. . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.5.4 Case study: Schrödinger Equation in 1d ∗ . . . . . . . . . . . . . . . 119
4.6 Phase Space Dynamics for Conservative and Perturbed Systems . . . . . . . 121
4.6.1 Integrals of Motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.6.2 Phase Portrait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.6.3 Small Perturbation of a Conservative System . . . . . . . . . . . . . 125
5 Partial Differential Equations. 129

5.1 First-Order PDE: Method of Characteristics . . . . . . . . . . . . . . . . . . 129
5.2 Classification of linear second-order PDEs: . . . . . . . . . . . . . . . . . . . 135
5.3 Elliptic PDEs: Method of Green Function . . . . . . . . . . . . . . . . . . . 138
5.4 Waves in a Homogeneous Media: Hyperbolic PDE ∗ . . . . . . . . . . . . . 141
5.5 Diffusion Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.6 Boundary Value Problems: Fourier Method . . . . . . . . . . . . . . . . . . 148
5.7 Case study: Burgers’ Equation ∗ . . . . . . . . . . . . . . . . . . . . . . . . 150
Spring Semester 150
III Optimization 151
6 Calculus of Variations 152

6.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.1.1 Fastest Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.1.2 Minimal Surface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.1.3 Image Restoration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.1.4 Classical Mechanics . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.2 Euler-Lagrange Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.3 Phase-Space Intuition and Relation to Optimization . . . . . . . . . . . . . 159
6.4 Towards Numerical Solutions of the Euler-Lagrange Equations ∗ . . . . . . 161
6.4.1 Smoothing Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.4.2 Gradient Descent and Acceleration . . . . . . . . . . . . . . . . . . . 161
CONTENTS iv
6.5 Dependence of the action on the end-points . . . . . . . . . . . . . . . . . . 163

6.6 Variational Principle of Classical Mechanics . . . . . . . . . . . . . . . . . . 166
6.6.1 Noether’s Theorem & time-invariance of space-time derivatives of action166
6.6.2 Hamiltonian and Hamilton Equations: the case of Classical Mechanics 168
6.6.3 Hamilton-Jacobi equation . . . . . . . . . . . . . . . . . . . . . . . . 169
6.7 Legendre-Fenchel Transform ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . 172
6.7.1 Geometric Interpretation:
Supporting Lines, Duality and Convexity . . . . . . . . . . . . . . . 173
6.7.2 Example of Dual Optimization in Variational Calculus . . . . . . . . 178
6.7.3 More on Geometric Interpretation of the LF transform . . . . . . . . 180
6.7.4 Hamiltonian-to-Lagrangian Duality in Classical Mechanics . . . . . . 181
6.7.5 LF Transformation and Laplace Method . . . . . . . . . . . . . . . . 182
6.8 Second Variation ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
6.9 Methods of Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . 185
6.9.1 Functional Constraint(s) . . . . . . . . . . . . . . . . . . . . . . . . . 185
6.9.2 Function Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
7 Optimal Control and Dynamic Programming 189

7.1 Linear Quadratic Control via Calculus of Variations ∗ . . . . . . . . . . . . 191
7.2 From Variational Calculus to Bellman-Hamilton-Jacobi Equation . . . . . . 195
7.3 Pontryagin Minimal Principle . . . . . . . . . . . . . . . . . . . . . . . . . . 198
7.4 Dynamic Programming in Optimal Control and Beyond . . . . . . . . . . . 200
7.4.1 Discrete Time Optimal Control . . . . . . . . . . . . . . . . . . . . . 200
7.4.2 Continuous Time Optimal Control . . . . . . . . . . . . . . . . . . . 202
7.5 Dynamic Programming in Discrete Mathematics . . . . . . . . . . . . . . . 204
7.5.1 LATEX Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.5.2 Cheapest Path over Grid . . . . . . . . . . . . . . . . . . . . . . . . 206
7.5.3 DP for Graphical Model Optimization . . . . . . . . . . . . . . . . . 208
IV Mathematics of Uncertainty 212
8 Basic Concepts from Statistics 213

8.1 Distributions and Random Variables . . . . . . . . . . . . . . . . . . . . . . 213
8.1.1 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . 213
8.1.2 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . 217
8.1.3 Sampling. Histograms. . . . . . . . . . . . . . . . . . . . . . . . . . . 219
CONTENTS v
8.2 Moments & Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

8.2.1 Expectation & Variance . . . . . . . . . . . . . . . . . . . . . . . . . 219
8.2.2 Higher Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
8.2.3 Moment Generating Functions. . . . . . . . . . . . . . . . . . . . . . 223
8.2.4 Characteristic Functions . . . . . . . . . . . . . . . . . . . . . . . . . 224
8.2.5 Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.3 Probabilistic Inequalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
8.4 Random Variables: from one to many. . . . . . . . . . . . . . . . . . . . . . 227
8.4.1 Multivariate Distributions. Marginalization. Conditional Probability. 228
8.4.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . . 232
8.4.3 Bayes Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
8.5 Information-Theoretic View on Randomness . . . . . . . . . . . . . . . . . . 237
8.5.1 Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
8.5.2 Comparing Probability Distributions: Kullback-Leibler Divergence . 241
8.5.3 Joint and Conditional Entropy . . . . . . . . . . . . . . . . . . . . . 242
8.5.4 Independence, Dependence, and Mutual Information. . . . . . . . . . 244
8.5.5 Probabilistic Inequalities for Entropy and Mutual Information . . . 250
9 Stochastic Processes 253

9.1 Bernoulli Process (Discrete Space, Discrete Time) . . . . . . . . . . . . . . 254
9.1.1 Probability distribution of the total number of successes . . . . . . . 254
9.1.2 Probability distribution of the 1st success . . . . . . . . . . . . . . . 254
9.1.3 Probability distribution of the k th success . . . . . . . . . . . . . . . 255
9.2 Poisson Process (Discrete Space, Continuous Time) . . . . . . . . . . . . . . 256
9.3 Stochastic Processes that are Continuous in Space-time . . . . . . . . . . . 260
9.3.1 Random Walks on the Integers . . . . . . . . . . . . . . . . . . . . . 261
9.3.2 From Random Walks to Brownian Motion . . . . . . . . . . . . . . . 261
9.3.3 Langevin equation in continuous time and discrete time . . . . . . . 262
9.3.4 The Wiener Process: A Rigorous Defintion of Brownian Motion . . . 263
9.3.5 From the Langevin Equation to the Path Integral . . . . . . . . . . . 264
9.3.6 From the Path Integral to the Fokker-Plank (through sequential Gaus-
sian integrations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
9.3.7 Analysis of the Kolmogorov-Fokker-Planck Equation: General Fea-
tures and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
9.3.8 Examples and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . 267
9.4 Markov Process [discrete space, discrete time] . . . . . . . . . . . . . . . . . 272
CONTENTS vi
9.4.1 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 272

9.4.2 Sample Trajectories and Analysis by Simulation . . . . . . . . . . . 273
9.4.3 Evolution of the Probability State Vector . . . . . . . . . . . . . . . 280
9.5 Stochastic Optimal Control: Markov Decision Process . . . . . . . . . . . . 289
9.5.1 Bellman Equation & Dynamic Programming . . . . . . . . . . . . . 290
9.5.2 MDP: Grid World Example . . . . . . . . . . . . . . . . . . . . . . . 292
9.6 Queuing Networks ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
9.6.1 Queuing: a bit of History & Applications . . . . . . . . . . . . . . . 296
9.6.2 Single Open Queue = Birth/Death process. Markov Chain represen-
tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.6.3 Generalization to (Jackson) Networks. Product Solution for the Steady
State. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
9.6.4 Heavy Traffic Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
10 Elements of Inference and Learning 303

10.1 Statistical Inference: Sampling and Stochastic Algorithms . . . . . . . . . . 303
10.1.1 Monte-Carlo Algorithms: General Concepts and Direct Sampling . . 303
10.1.2 Inference via Markov-Chain Monte-Carlo . . . . . . . . . . . . . . . 308
10.2 Statistical Inference: General Relations, Calculus of Variations and Trees . 315
10.2.1 From Ising Model to (Factor) Graphical Models . . . . . . . . . . . . 315
10.2.2 Decoding of Graphical Codes as a Factor Graph problem . . . . . . 316
10.2.3 Partition Function. Marginal Probabilities. Maximum Likelihood. . 319
10.2.4 Kullback-Leibler Formulation & Probability Polytope . . . . . . . . 320
10.2.5 Variational Approximation: Mean Field . . . . . . . . . . . . . . . . 321
10.2.6 Dynamic Programming for (Exact) Inference over Trees . . . . . . . 322
10.2.7 Properties of Undirected Tree-Structured Graphical Models . . . . . 324
10.2.8 Bethe Free Energy & Belief Propagation . . . . . . . . . . . . . . . . 326
10.3 Theory of Learning: Sufficient Statistics and Maximum Likelihood Estimation328
10.3.1 Sufficient Statistics: infinitely many samples . . . . . . . . . . . . . . 328
10.3.2 Maximum-Likelihood Estimation/Learning of Graphical Models . . 329
10.3.3 Learning Spanning Tree . . . . . . . . . . . . . . . . . . . . . . . . . 330
10.4 Function Approximation with Neural Networks . . . . . . . . . . . . . . . . 333
10.4.1 Fitting a Function with NN as an Optimization . . . . . . . . . . . . 334
10.4.2 Automatic Differentiation, Back-Propagation and the Chain Rules . 335
10.4.3 Avoiding Over-fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 336
CONTENTS vii
A Convex and Non-Convex Optimization ∗ 338

A.1 Convex Functions, Sets and Optimizations . . . . . . . . . . . . . . . . . . . 339
A.2 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
A.3 Unconstrained First-Order Convex Minimization . . . . . . . . . . . . . . . 356
A.4 Constrained First-Order Convex Minimization . . . . . . . . . . . . . . . . 366
Chapter 1
Applied Math Core Courses
Every student in the Program for Applied Mathematics at the University of Arizona takes
the same three core courses during their first year of study. These three courses are called
Methods (Math 581), Theory (Math 584), and Algorithms (Math 589). Each course presents
a different expertise, or ‘toolbox’ of competencies, for approaching problems in modern
applied mathematics. The courses are designed to discuss many of the same topics, often
synchronously, (Fig. 1.1). This allows them to better illustrate the potential contributions
of each toolbox, and also to provide a richer understanding of the applied mathematics.
The material discussed in the courses include topics that are taught in traditional applied
mathematics curricula (like differential equation) as well as topics that promote a modern
perspective of applied mathematics (like optimization, control and elements of computer
science and statistics). All the material is carefully chosen to reflect what we believe is most
relevant now and in the future.
The essence of the core courses is to develop the different toolboxes available in applied
Metric, Normed & Convex Measure Theory Probability &

Topological Spaces Optimization & Integration Statistics
Complex Analysis, Ordinary & Partial Calculus of Variations Stochastic Calculus,

Fourier Analysis Differential Equations & Optimal Control Inference & Learning
Numerical Numerical Numerical

Stochastic Algorithms
Linear Algebra Differential Equations Optimization
Figure 1.1: Topics covered in Theory (blue), Methods (red) and Algorithms (green) during
the Fall semester (columns 1 & 2) and Spring semester (columns 3 & 4)
1
CHAPTER 1. APPLIED MATH CORE COURSES 2
mathematics. When we’re lucky, we can find exact solutions to a problem by applying
powerful (but typically very specialized) techniques, or methods. More often, we must
formulate solutions algorithmically, and find approximate solutions using numerical sim-
ulations and computation. Understanding the theoretical aspects of a problem motivates
better design and implementation of these methods and algorithms, and allows us to make
precise statements about when and how they will work.
The core courses discuss a wide array of mathematical content that represents some of
the most interesting and important topics in applied mathematics. The broad exposure
to different mathematical material often helps students identify specific areas for further
in-depth study within the program. The core courses do not (and cannot) satisfy the in-
depth requirements for a dissertation, and students must take more specialized courses and
conduct independent study in their areas of interest.
Furthermore, the courses do not (and cannot) cover all subjects comprising applied
mathematics. Instead, they provide a (somewhat!) minimal, self-consistent, and admittedly
subjective (due to our own expertise and biases) selection of the material that we believe
students will use most during and after their graduate work. In this introductory chapter
of the lecture notes, we aim to present our viewpoint on what constitutes modern applied
mathematics, and to do so in a way that unifies seemingly unrelated material.
What is Applied Mathematics?

We study and develop mathematics as it applies to model, optimize and control various
physical, biological, engineering and social systems. Applied mathematics is a combination
of (1) mathematical science, (2) knowledge and understanding from a particular domain
of interest, and often (3) insight from a few ‘math-adjacent’ disciplines (Fig. 1.2). In our
program, the core courses focus on the mathematical foundations of applied math. The
more specialized mathematics and the domain-specific knowledge are developed in other
coursework, independent research and internship opportunities.
Applying mathematics to real-world problems requires mathematical approaches that
have evolved to stand up to the many demands and complications of real-world problems.
In some applications, a relatively simple set of governing mathematical expressions are able
to describe the relevant phenomena. In these situations, problems often require very accu-
rate solutions, and the mathematical challenge is to develop methods that are efficient (and
sometimes also adaptable to variable data) without losing accuracy. In other applications,
there is no set of governing mathematical expressions (either because we do no know them,
or because they may not exist). Here, the challenge is to develop better mathematical
Adjacent Disciplines:
Physics, Statistics,
Computer Science,
Data Science
Domain Knowlege:
e.g. Physical Sciences,
Biological Sciences, Mathematical Science:
Social Sciences, e.g. Differential Equations,
Engineering Real & Functional Analysis,
Optimization, Probability,
Numerical Analysis,
Figure 1.2: The key components studied under the umbrella of applied mathematics: (1)
mathematical science, (2) domain-specific knowledge, and (3) a few ‘math-adjacent’ disci-
plines.
descriptions of the phenomena by processes, interpreting and synthesizing imperfect obser-

vations. In terms of the general methodology maintained throughout the core courses, we
devote considerable amount of time to:
1. Formulating the problem, first casually, i.e. in terms standard in sciences and engi-
neering, and then transitioning to a proper mathematical formulation;
2. Analyzing the problem by “all means available”, including theory, method and algo-
rithm toolboxes developed within applied mathematics;
3. Identifying what kinds of solutions are needed, and implementing an appropriate

method to find such a solution.
Making contributions to a specific domain that are truly valuable requires more than
just mathematical expertise. Domain-specific understanding may change our perspective for
what constitutes a solution. For example, whenever system parameters are no longer ‘nice’
but must be estimated from measurement or experimental data, it becomes more the difficult
to finding meaning in the solutions, and it becomes more important, and challenging, to
estimate the uncertainty in solutions,. Similarly, whenever a system couples many sub-
systems at scale, it may be no longer possible to interpret the exact expressions, (if they
can be computed at all) and approximate, or ‘effective’ solutions may be more meaningful.
In every domain-specific application, it is important to know what problems are most urgent,
and what kinds of solutions are most valuable.
Mathematics is not the only field capable of making valuable contributions to other
domains, and we think specifically of physics, statistics and computer science as other fields
that have each developed their own frameworks, philosophies, and intuitions for describing
problems and their solutions. This is particularly evident with the recent developments in
data science. The recent deluge of data has brought a wealth of opportunity in engineering,
and in the physical, natural and social sciences where there have been many open problems
that could only be addressed empirically. Physics, statistics, and computer science have
become fundamental pillars of data science, in part, because each of these ’math-adjacent’
disciplines provide a way to analyze and interpret this data constructively. Nonetheless,
there are many unresolved challenges ahead, and we believe that a mixture of mathematical
insight and some intuition from these adjacent disciplines may help resolve these challenges.
Problem Formulation
We will rely on a diverse array of instructional examples from different areas of science and
engineering to illustrate how to translate a rather vaguely stated scientific or engineering
phenomenon into a crisply stated mathematical challenge. Some of these challenges will be
resolved, and some will stay open for further research. We will be refering to instructional
examples, such as the Kirchoff and the Kuramoto-Sivashinsky equations for power systems,
the Navier-Stokes equations for fluid dynamics, network flow equations, the Fokker-Plank
equation from statistical mechanics, and constrained regression from data science.
Problem Analysis
We analyze problems extracted from applications by all means possible, which requires
both domain-specific intuition and mathematical knowledge. We can often make precise
statements about the solutions of a problem without actually solving the problem in the
mathematical sense. Dimensional analysis from physics is an example of this type of pre-
liminary analysis that is helpful and useful. We may also identify certain properties of the
solutions by analyzing any underlying symmetries and establishing the correct principal
behaviors expected from the solutions, some important example involve oscillatory behav-
ior (waves), diffusive behavior, and dissipative/decaying vs. conservative behaviors. One
can also extract a lot from analyzing the different asymptotic regimes of a problem, say
when a parameter becomes small, making the problem easier to analyze. Matching different
asymptotic solutions can give a detailed, even though ultimately incomplete, description.
Solution Construction
As previously mentioned, one component of applied mathematics is a collection of special-

ized techniques for finding analytic solutions. These techniques are not always feasible,
and developing computational intuition should help us to identify proper methods of
numerical (or mixed analytic-numerical) analysis, i.e. a specific toolbox, helping to unravel
the problem.
Part I
Applied Analysis
6
Chapter 2
Complex Analysis
The real number system is somewhat “deficient” in the sense that not all operations are
allowed for all real numbers. For example, taking arbitrary roots of negative numbers is
not allowed in the real number system. This deficiency can be remedied by defining the
√
imaginary unit, i := −1. A number that is a real multiple of the imaginary unit, for
example 3i, i/2 or −πi, is called an imaginary number. A number that has both a real and
an imaginary component is called a complex number.
Complex analysis is the branch of mathematics that investigates functions of complex
variables. A fundamental premise of complex analysis is that most operations between real
numbers have natural extensions to complex numbers, and that most real-valued functions
have natural extensions to complex-valued functions. Interestingly, the natural extensions of
even the most elementary functions can lead to a richness that often admits new techniques
for problem solving.
Complex analysis provides useful tools for many other areas of mathematics, (both pure
and applied), as well as for physics (including the branches of hydrodynamics, thermody-
namics, and particularly quantum mechanics), and engineering fields (such as aerospace,
mechanical and electrical engineering).
2.1 Complex Variables and Complex-valued Functions

2.1.1 The Cartesian Representation of Complex Variables
For two complex numbers, z1 = a1 + ib1 , and, z2 = a2 + ib2 , we have

• Addition: z1 + z2 = (a1 + ib1 ) + (a2 + ib2 ) = (a1 + a2 ) + i(b1 + b2 ),
• Multiplication: z1 z2 = (a1 + ib1 )(a2 + ib2 ) = a1 a2 + i(a1 b2 + b1 a2 ) + i2 b1 b2 = (a1 a2 −

b1 b2 ) + i(a1 b2 + b1 a2 ).
7
CHAPTER 2. COMPLEX ANALYSIS 8
Im(z)
z1 = 1 + 2i
z1 + z2
Re(z)
z2 = 4 − i
Figure 2.1: Complex numbers can be visualized as vectors in R2 . By convention, the real
component is plotted on the horizontal axis, and the imaginary component is plotted on
the vertical axis. The addition of two complex numbers is reminiscent of vector addition in
R2 .
The addition and subtraction of complex numbers are direct generalizations of their
real-valued counterparts.
Example 2.1.1. Let z1 = 1 + 2i and z2 = 4 − i. Compute (a) z1 + z2 and (b) z1 − z2 .

Solution.
(a) z1 + z2 = (1 + 2i) + (4 − i) = (1 + 4) + (2 − 1)i = 5 + i
(b) z1 − z2 = (1 + 2i) − (4 − i) = (1 − 4) + (2 + 1)i = −3 + 3i
Because the behavior of addition and subtraction is reminiscent of translating vectors in

R2 , we often visualize complex numbers as points on a Cartesian plane by associating the
the real and imaginary components of the complex number with the x- and y-coordinates
respectively.
Definition 2.1.2. The complex conjugate of a complex number z, denoted by z ∗ or z̄, is

the complex number with an equal real part and an imaginary part equal in magnitude but
opposite in sign. That is, if z = x + iy then z ∗ := x − iy.
The multiplication and division of complex numbers are also direct generalizations of
their real-valued counterparts with the additional definition, i2 = −1.
Example 2.1.3. Let z1 = −1 + 2i and z2 = 4 − 3i. Compute (a) z1 z2 , (b) z1 /z2 .

Solution.
(a) z1 z2 = (−1 + 2i)(4 − 3i) = −4 + 3i + 8i − 6i2 = 2 + 11i.
(b) To compute z1 /z2 , we first multiply it by z2∗ /z2∗ , so that the denominator, z2 z2∗ , is a
real number,

z1 z1 z2∗ −1 + 2i 4 + 3i −4 − 3i + 8i + 6i2 −10 + 5i 2 1
= ∗ = = = = − + i.
z2 z2 z2 4 − 3i 4 + 3i 16 − 12i + 12i + 9 25 5 5
Complex conjugates
Theorem 2.1.4. For algebraic operations including addition, multiplication, division and
exponentiation, consider a sequence of algebraic operations over the n complex numbers
z1 , . . . , zn with the result w. If the same actions are applied in the same order to z1∗ , . . . , zn∗ ,
then the result will be w∗ .
Example 2.1.5. Let us illustrate the Theorem 2.1.4 on the example of a quadratic equation,
az 2 +bz +c = 0, where the coefficients, a, b and c are real. Direct application of the Theorem
2.1.4 to this example results in the fact that if the equation has a root, then its complex
conjugate is also a root, which is obviously consistent with the roots of quadratic equations
√
formula, z1,2 = (−b ± b2 − 4ac)/(2a).
Exercise 2.1. Use Theorem 2.1.4 to show that the roots of a polynomial with real-valued
coefficients of arbitrary order occur in complex conjugate pairs.
Example 2.1.6. Find all the roots of the polynomial, p(z) = z 4 − 6z 3 + 11z 2 − 2z − 10,
given that one of its roots is 2 − i.
Solution. We observe that p(z) has real-valued coefficients, so its roots occur in conjugate
pairs; given that z1 = 2 − i is a root, then z2 = 2 + i must also be a root, which we verify
by evaluation. We factorize p(z) as p(z) = (z − z1 )(z − z2 )r(z), where we find r(z) by
polynomial division, giving r(z) = (z 2 − 2z − 2). Therefore, the four roots of p(z) are found
by solving
0 = z 4 − 6z 3 + 11z 2 − 2z − 10 = (z − z1 )(z − z2 )(z 2 − 2z − 2).

√
Solving z 2 − 2z − 2 = 0 by the quadratic formula gives z3,4 = 1 ± 3. Thus, the four roots
of p(z) are:
√ √
z1 = 2 − i, z2 = 2 + i, z3 = 1 + 3, z4 = 1 − 3.
Example 2.1.7. Let z1 = x1 + iy1 and z2 = x2 + iy2 . Show that if ω = z1 /z2 , then
ω ∗ = z1∗ /z2∗ .
Im(z) z = 4 + 3i
∗
√ zz
2 =
y
2 +
p x
y = r sin(θ)
r=
|=
|z
θ
Re(z)
x = r cos(θ)
Figure 2.2: A complex number, z, has both a Cartesian representation (shown in blue) and
a polar representation (shown in orange). Its modulus, denoted by |z| or r, is non-negative
and satisfies |z|2 = r2 := zz ∗ = x2 + y 2 . Its argument, denoted by θ, is the angle measured
modulo 2π, counter-clockwise from the positive real axis.
Solution. From the definition of a complex conjugate, we have
z1∗ = x1 − iy1 , and z2∗ = x2 − iy2 .
We must find ω ∗ and verify that it is equivalent to z1∗ /z2∗ . First compute ω,

x1 + iy1 x1 + iy1 x2 − iy2 x1 x2 + y1 y2 x2 y1 − x1 y2
ω= = = 2 2 +i .
x2 + iy2 x2 + iy2 x2 − iy2 x 2 + y2 x22 + y22
Now compute ω ∗ ,
x1 x2 + y1 y2 x2 y1 − x1 y2 (x1 − iy1 )(x2 + iy2 ) x1 − iy1
ω∗ = 2 2 −i 2 2 = = ,
x 2 + y2 x2 + y2 (x2 − iy2 )(x2 + iy2 ) x2 − iy2
which is equivalent to z1∗ /z2∗ , as required.
2.1.2 The Polar Representation of Complex Variables
In addition to their Cartesian representation, complex numbers can also be represented by

their polar representation with components r and θ. Here r is called the modulus of z and
satisfies r2 = |z|2 := zz ∗ = x2 + y 2 ≥ 0, and θ is called the argument of z or sometimes the
polar angle. Note that θ = arg(z) is defined only for |z| > 0, and modulo 2π:
x + iy = r cos θ + ir sin θ, for r = x2 + y 2 , and θ = arctan(y, x).
The application of trigonometric identities shows that the product of two complex num-
bers is the complex number whose modulus is the product of the moduli of its factors, and
whose argument is the sum of the arguments of its factors. That is, if z1 = r1 cos θ1 +
ir1 sin θ1 , and z2 = r2 cos θ2 + ir2 sin θ2 , then z1 z2 = r1 r2 cos(θ1 + θ2 ) + ir1 r2 sin(θ1 + θ2 ).
This summation of arguments whenever two functions are multiplied together is reminiscent
of multiplying exponential functions. The polar representation is simplified by defining the
complex-valued exponential function. When expressed in their polar representation, the
multiplication of two complex numbers is simplified by elementary trigonometric identities.
For z1 = r1 cos θ1 + ir1 sin θ1 and z2 = r2 cos θ2 + ir2 sin θ2 , their product is
z1 z2 = r1 r2 cos θ1 cos θ2 + r1 r2 sin θ1 sin θ2 + ir1 r2 cos θ1 sin θ2 + ir1 r2 sin θ1 cos θ2
= r1 r2 cos(θ1 + θ2 ) + ir1 r2 sin(θ1 + θ2 ).
We make two observations: first, the modulus of the product of two complex numbers
is the product of their moduli, that is, |z1 z2 | = |z1 | |z2 |, and second the argument of the
product is the sum of arguments, that is, arg(z1 z2 ) = arg(z1 ) + arg(z2 ). These observations,
which are reminiscent of the multiplication of real-valued exponential functions, motivate
the definition of the complex-valued exponential function.
Definition 2.1.8. The exponential function is defined for imaginary arguments by
reiθ := r cos(θ) + ir sin(θ) = x + iy. (2.1)
Euler’s (famous) formula, eiπ = −1, follows directly from this definition.
Example 2.1.9. Convert z1 = −1 + 2i and z2 = 4 − 3i to their polar representations and

compute (a) their product and (b) their quotient. Compare you answer to Example 2.1.3.
Solution. The polar representations of z1 and z2 are:
√ √
r1 = z1 z1∗ = 5, θ1 = tan−1 (10/ − 5) ≈ 2.03, z1 = 5e2.03i ;
r2 = z2 z2∗ = 5, θ2 = tan−1 (−3/4), ≈ −0.64, z2 = 5e−0.64i .
Their product and quotient are:

√ √ √ √
(a) z1 z2 ≈ 5e2.03i 5e−0.64i = 5 5e1.39i = 5 5 cos(1.39) + i5 5 sin(1.39) = 2 + 11i.
√ √ √ √
(b) z1 /z2 ≈ 5e2.03i /5e−0.64i = 5 5e2.67i = 1/ 5 cos(2.67)+i/ 5 sin(2.67) = −0.4+0.2i.
Sometimes it is convenient to express a complex number using a mixture of Cartesian

and polar representations.
Example 2.1.10. Find r̃ and θ̃ such that the point, ω = 1 + 5i, can be written as,
ω = −1 + r̃eiθ̃ .
Im(z) Im(z) Im(z) Im(z)
Re(z) Re(z) Re(z) Re(z)
Figure 2.3: Examples of curves in the complex plane. The first two curves (red and blue)
are open and the last two curves (green and orange) are closed. The first and fourth curves
(red and orange) are simple and the second and third curves (blue and green) are not simple
because they self-intersect at points other than the end points.
Solution. Given that 1 + 5i = −1 + r̃eiθ̃ , solves for r̃eiθ̃ to get 2 + 5i = r̃eiθ̃ . Solve for r̃
√
and θ̃ to get r̃ = (2 + 5i)(2 − 5i) = 29 ≈ 5.39 and θ̃ = tan−1 (5/2) ≈ 1.19rad. Therefore,
w ≈ −1 + 5.39e1.19i .
Example 2.1.11. Express z := (2 + 2i)e−iπ/6 by its (a) Cartesian and (b) polar represen-
tations.
Solution.

(a) z = (2 + 2i)(cos(− π6 ) + i sin(− π6 )) = 2 cos(− π6 ) + 2 sin(− π6 ) + i 2 cos(− π6 ) + 2 sin(− π6 )
√ √
= (1 + 3) + i( 3 − 1).
√ √
(b) z = (2 + 2i)e−iπ/6 = 2 2eπ/4 e−iπ/6 = 2 2eiπ/12 .
2.1.3 Parameterization of Curves in the Complex Plane
Definition 2.1.12. A curve in the complex plane is a set of points z(t) where, a ≤ t ≤ b,
for some a ≤ b. We say that the curve is closed if, z(a) = z(b), and simple if it does not
self-intersect, except possibly at the end-points. That is the curve is simple if, z(t) 6= z(t0 )
for t 6= t0 and a < t, t0 < b. A curve is called a contour if it is continuous and piece-
wise smooth. By convention, all simple, closed contours are parameterized to be traversed
counter-clockwise, unless stated otherwise.
Example 2.1.13. Parameterize the following curves:
(a) The infinite ‘vertical’ line passing through π/2.

√
(b) The semi-infinite ray extending from the point z = −1 and passing through 3i.
(c) The circular arc of radius ε centered at 0.

Im(z)
0 + 3i
z(ρ) = −1 + ρeiπ/3 z(s) = π/2 + is
−1 + 0i π/2 + 0i
Re(z)
z(τ ) = εeiτ
Figure 2.4: Parameterized curves for example 2.1.13. Red: The infinite ‘vertical’ line
passing through π/2 is parameterized by z(s) = π/2 + is for −∞ < s < ∞. Green: The
√
semi-infinite ray extending from the point z = −1 and passing through 3i is parameterized
by z(ρ) = −1 + ρeiπ/3 for 0 ≤ ρ < ∞. Blue: The circular arc of radius ε centered at 0 is
parameterized by z(τ ) = εeiτ for 0 ≤ τ ≤ 2π.
Solution.
(a) z(s) = π/2 + is for −∞ < s < ∞.
(b) z(ρ) = −1 + ρeiπ/3 for 0 ≤ ρ < ∞.
(c) z(τ ) = εeiτ for 0 ≤ τ ≤ 2π.
The Complex Number System
Complex numbers can be considered as the resolution of the notation for numbers that
are closed under all possible algebraic operations. What this means is that any algebraic
operation between two complex numbers is guaranteed to return another complex number.
This is not generally true for other classes of numbers, for example,
i. The addition of two positive integers is guaranteed to be another positive integer, but
the subtraction of two positive integers is not necessarily a positive integer. Therefore,
we say that the positive integers are closed under addition but are not closed under
subtraction.
ii. The class of all integers is closed under subtraction and also multiplication. However
the integers are not closed under division because the quotient of two integers is not
necessarily another integer.
iii. The rational numbers are closed under division. However the process of taking limits
of rational numbers may lead to numbers that are not rational, so real numbers are
needed if we require a system that is closed under limits.
iv. Taking non-integer powers of negative numbers does not yield a real number. The
class of complex numbers must be introduced to have a system that is closed under
this operation.
Moreover one finds that the class of complex numbers is also closed under the operations of
finding a root of algebraic equations, of taking logarithms, and others. We conclude with a
happy statement that the class of complex numbers is closed under all the operations.
2.1.4 Functions of a Complex Variable
A function of a complex variable, w = f (z), maps the complex number z to the complex
number w. That is, f maps a point in the z-complex plane to a point (or points) in the
w-complex plane. Since both z and w have a Cartesian representation, this means that
every function of a complex variable can be expressed as two real-valued functions of two
real variables, f (z) := u(x, y) + iv(x, y).
2.1.5 Complex Exponentials
In Eq. (2.1) we motivated the definition of the exponential function, f (z) = ez , with the in-
tention to preserve the property that, ez1 +z2 = ez1 ez2 , and incidentally that, e1 = 2.718 . . . .
This is not the only property we could have chosen to motivate the definition of ez . We
could have chosen to preserve any of the following properties:
P∞
• the function’s Taylor series n=0 z
n /n!;
• the function’s limiting expression limn→∞ (1 + z/n)n ;
• the fact that f (z) = ez solves the Ordinary Differential Equation, f 0 (z) = f (z),
subject to f (0) = 1.
We encourage the reader to verify that all these properties are preserved for the complex
exponential, and that any one of them could have motivated our definition and yielded the
same results.
An immediate consequence of this observation is that the natural definitions of the

complex-valued trigonometric functions are
eiz + e−iz eiz − e−iz
cos(z) := and sin(z) := . (2.2)
2 2i
Example 2.1.14. Let f (z) = exp(iz) where z = x + iy. Express f (z) as the sum, u(x, y) +
iv(x, y), where u and v are real-valued functions of x and y.
Solution.
f (z) = exp(i(x + iy)) = exp(ix − y) = exp(−y) exp(ix)

= exp(−y) cos(x) + i exp(−y) sin(x)
Example 2.1.15. Evaluate the functions along the curves : (a) z 7→ sin z along the infinite
horizontal line passing through π/2. (b) z 7→ exp(z+1) along the semi-infinite ray extending
√
from the point z = −1 and passing through 3i, and (c) z 7→ z 2 along circular arc of radius
ε centered at 0. See example 2.1.13 and also Fig. 2.4.
Solution.
(a) Parameterize the vertical line passing through π/2 by π/2 + is for −∞ < s < ∞
eiπ/2−s − e−iπ/2+s ie−s + ies
f (s + iπ/2) = sin(π/2 + is) = =
2i 2i
= cosh(s) for − ∞ < s < ∞.
(b) Parameterize the semi-infinite ray extending from the point z = −1 and passing
√
through 3i by −1 + ρeiπ/3 for 0 < ρ < ∞.
f (−1 + ρeiπ/3 ) = exp(ρeiπ/3 ) = exp(ρ cos(iπ/3) + iρ sin(iπ/3))

√ √
= eρ/2 cos(ρ 3/2) + i sin(ρ 3/2) for 0 ≤ ρ < ∞.
(c) Parameterize the circular arc of radius ε centered at 0 by εeiτ for 0 ≤ τ ≤ 2π.
f (εeiτ ) = (εeiτ )2 = ε2 e2iτ for 0 ≤ τ ≤ 2π.
Exercise 2.2. Investigate the asymptotic behavior at |z| → ∞ of the complex-valued

functions (a) f (z) = exp(z), (b) f (z) = sin(z), (c) f (z) = cos(z). Hint: There are many
different ways that |z| can go to infinity. For example, we could write z = x + iy and let
x → ±∞ for fixed (or variable) y or let y → ±∞ for fixed (or variable) x. We could also
write z = reiθ and let r → ∞ for fixed (or variable) θ. We are asking you to consider each
function from whichever perspectives are most informative for determining what happens
as |z| → ∞.
2.1.6 Multi-valued Functions and Branch Cuts
Not every complex function is single-valued. We often deal with functions that are multi-
valued, meaning that for some z, there exist two or more wi such that f (z) = wi . Recall the
parametrized curve in example 2.1.15 and consider (c)(i) where we evaluated the function
f (z) = z 2 along the circle of radius ε centered at the origin. Notice, in particular, that
the function returns to its original value, that is, f (εe0i ) = f (εe2πi ) = ε2 . It may be seem
surprising, but there are functions where this is not the case.
√
Example 2.1.16. Consider the example of ω(z) = z. When z is represented in polar
coordinates, z = r exp(iθ), we know that θ is defined up to a shift on 2πn, for any integer
√
n. For our example, this translates to ωn (z) = r exp(iθ/2 + iπn), where different n
√ √
will result in (two) different values of z, called two branches, ω1 = r exp(iθ/2), ω2 =
√
r exp(iθ/2 + iπ). If we choose one branch, say ω1 , and walk in the complex plane around
z = 0 in a positive counter-clockwise direction (so that z = 0 always stays on the left)
changing θ from its original value, say θ = 0, to π/2, π, 3π/2 and eventually get to 2π, ω1
will transition to ω2 . Making one more positive 2π swing will return to ω1 . In other words,
the two branches transition to each other after one makes a 2π turn. Per definition below,
√
the point z = 0 is called a second order branch point of the two-valued function z.
Definition 2.1.17. A multi-valued function w(z) has a branch point at z0 ∈ C if w(z)

varies continuously along a sufficiently small circuit surrounding z0 , but does not return to
its starting values after one full circuit.
Definition 2.1.18. A branch of a multi-valued function w(z) is a single-valued function

that is obtained by restricting the image of w(z) and disregarding all but one set of values.
A multi-valued function has the property that if we traverse a sufficiently small closed
contour around its branch point, we experience a discontinuity. One should note that the
location of this discontinuity is entirely dependent on where we choose to start and stop
the closed contour. To see this consider the following two closed contours:
α(θ) = eiθ , 0 ≤ θ ≤ 2π, (2.3)

β(φ) = eiφ , −π ≤ φ ≤ π. (2.4)
√
If we traverse these two contours with the function f (z) = z, we see that the discon-
tinuity occurs at θ = 0, 2π in the first case, and φ = −π, π in the second case. In truth, the
location of this discontinuity was dependent on our choice of contour. We can expand on
this idea by introducing the notion of a branch cut. A branch cut is something we pick in
order to separate the branches of a multi-valued function. For most multi-valued functions,
this means that when we attempt to traverse a cut with some closed contour, we end up
experiencing a discontinuity. Really what we have is a contour that is closed in the domain
of f , but maps to an open contour.
Definition 2.1.19. A branch cut is a curve in the complex plane along which a branch is
discontinuous.
Remark. Branch cuts are usually not unique, and are something that is defined by us, not
the multi-valued function in question. One branch is arbitrarily selected as the principal
branch. Most software packages employ a set of rules for selecting the principal branch of
a multi-valued function.
Example 2.1.20. The generalization of example 2.1.16 to ω(z) = z 1/n is straightforward.

This function has n branches, ω1 (z), . . . , ωn (z) and thus z = 0 is called an nth order branch
point of the nth -valued function z 1/n .
Example 2.1.21. Another important example is ω(z) = log(z). We can represent z by

its polar representation, z = rei(θ+2πn) , to show that log is a multi-valued function with
infinitely many (but countable number of) roots, ωn = log(r) + i(θ + 2nπ), n = 0, ±1, . . . .
In this case, z = 0 is an infinite order branch point.
Definition 2.1.22. (Branch points at z = ∞.) Consider a multi-valued function f (z). We

say that f has a branch point at z = ∞ if the function g(w) = f (1/w) has a branch point
at w = 0.
Example 2.1.23. Find the branch points of log(z − 1) and sketch a set of possible branch
cuts. Choose a branch cut and describe the resulting branches.
Solution. Parameterize the function as follows, log(z−1) = log ρ+iφ, where z−1 = ρ exp(iφ)
with ρ > 0 (non-negative real) and φ real. Since φ changes by multiples of 2π as we travel
on a closed path around z = 1, the point z = 1 is a branch point of log(z − 1). We can
observe that z = ∞ is also a branch point (thus infinite branch point) by replacing z with
1
z = w and observing that w = 0 is a branch point. Therefore a valid branch cut for the
function should connect the two branch points as illustrated in Fig. (2.7).
To describe branches of the function, let us choose (for concreteness) the branch cut
starting at z = 1 and moving along the x axis to z = +∞. Introduce potential branches
of z by zn = 1 + ρ exp(iφ + 2iπn), n = 0, 1, · · · . Given that f (z) = log(z − 1), the family
of zn translates into the following branches: fn (z) = log ρ + iφ + 2iπn. Observe that each
branch is distinct from the others and that each is a single-valued, analytic function in C
excluding the branch cut.
Real component of f (z) = z 1/2 Imaginary component of f (z) = z 1/2

Im(z) Im(z)
Re(z) Re(z)

Im(z) Im(z)
Re(z) Re(z)

Im(z) Im(z)
Re(z) Re(z)
Figure 2.5: (a) Top row: The real (left) and imaginary (right) components of z 7→ z 1/2 .
(b) Middle row: The representation, z = re−iθ , with 0 ≤ θ < 2π, gives a branch cut along
the positive real axis and a (single-valued) branch that is analytic everywhere except along
the branch cut. (c) Bottom row: The representation, z = re−iθ , with −π ≤ θ < π, gives
a branch cut along the negative real axis and a (single valued) branch of z 7→ z 1/2 that is
analytic everywhere except along the branch cut.
Real component of f (z) = log(z) Imaginary component of f (z) = log(z)
Im(z) Im(z)
Re(z) Re(z)
Im(z) Im(z)
Re(z) Re(z)
Im(z) Im(z)
Re(z) Re(z)
Figure 2.6: (a) Top row: The real (left) and imaginary (right) components of z 7→ log(z)
. (b) Middle row: The representation, z = re−iθ , with 0 ≤ θ < 2π, gives a branch cut
along the positive real axis and a (single-valued) branch of z 7→ log(z) that is analytic
everywhere except along the branch cut. (c) Bottom row: The representation, z = re−iθ ,
with −π ≤ θ < π, gives a branch cut along the negative real axis and a (single valued)
branch of z 7→ log(z) that is analytic everywhere except along the branch cut.
y
z=x+ i y
"
0 (1,0) x
y
z=x+ i y
0 (1,0) x
y
z=x+ i y
0 (1,0) x
y
z=x+ i y
0 (1,0) x
Figure 2.7: Polar parametrization of log(z − 1) (left) and three examples of branch cut for
the function connecting its two branch points, at z = 1 and at z = ∞.
y
z=x+ i y
0 x
y
z=x+ i y
0 x
Figure 2.8
Example 2.1.24. Consider log(z 2 − 1) = log(z − 1) + log(z + 1). As we travel around

z = 1, log(z − 1) and also log(z 2 − 1) change by 2π. Therefore z = 1 is a branch point
of log(z 2 − 1). Similarly, z = −1 and z = ∞ are two other branch points of log(z 2 − 1).
Fig. (2.8) show two branch cut examples of log(z 2 − 1).
Two important general remarks are in order.
1. The function log(f (z)) has branch points at the zeros of f (z) and at the points where
f (z) is infinite, as well as (possibly) at the points where f (z) itself has branch points.
But be careful with this (later possibility): the zeros have to be zeros in the sense of
analytic functions and by infinities we mean poles. Other types of (singular) behaviors
in f (z) can lead to unexpected results, e.g. check what happens at z = 0 when
f (z) = exp(1/z).
2. The fact that a function f (z) or its derivatives may or may not have a (finite) value
at some point z = z0 , is irrelevant as far as deciding the issue of whether or not z0 is
a branch point of f (z).
Exercise 2.3. Identify the branch points, introduce suitable branch cuts, and describe the
p
resulting branches for the functions (a) f (z) = (z − a)(z − b), and (b) g(z) = log((z −
1)/(z − 2)).
The graphs of complex multi-valued functions are in general two-dimensional manifolds

in the space R4 . These manifolds are called Riemann surfaces. Riemann surfaces are
visualized in three-dimensional space with parallel projection and the image surface in
three-dimensional space is rendered on the screen. (See http://matta.hut.fi/matta/
mma/SKK_MmaJournal.pdf for details and visualization with Mathematica.)
Example 2.1.25. Find all values of z ∈ C satisfying the equation, sin(z) = 3.

Solution. Start with the definition of complex valued sin:
eiz − e−iz
= 3,
2i
Multiply each side by 2i eiz
(eiz )2 − 6ieiz − 1 = 0.
This can be solved using the quadratic formula which, after some algebra, gives
√
eiz = i(3 ± 2 2).
Now, take the natural log of both sides
√
iz = ln(i) + ln(3 ± 2 2).
By ln(z) = ln(r) + i(θ + 2nπ) where n = 1, 2, 3, ..., we know that ln(i) = ln(1) + i( π2 + 2nπ),
so
π √
z= + 2nπ ± i ln(3 + 2 2).
2
2.2 Analytic Functions and Integration along Contours

2.2.1 Analytic functions
The derivative of a real valued function is defined at a point x via the limiting expression
f (x + ∆x) − f (x)
f 0 (x) = lim ,
∆x→0 ∆x
and we say that the function is differentiable at x if the limit exists and is independent of
whether the x is approached from above or below as given by the sign of ∆x.
Definition 2.2.1. The derivative of a complex function is defined via a limiting expression:
f (z + ∆z) − f (z)
f 0 (z) = lim . (2.5)
∆z→0 ∆z
This limit only exists if f 0 (z) is independent of the direction in the z-plane the limit ∆z → 0
is taken. (Note: there are infinitely many ways to approach a point z ∈ C.)
If one sets, ∆z = ∆x, Eq. (2.5) results in
f 0 (z) = ux + ivx ,
where f = u + iv. However, setting ∆z = i∆y results in
f 0 (z) = −iuy + vy .
A consistent definition of a derivative requires that the two ways of taking the derivative
coincide, that is,
ux = vy , uy = −vx , (2.6)
and this gives a necessary condition for the following statement.
Theorem 2.2.2 (Cauchy-Riemann Theorem). The function f (z) = u(x, y) + iv(x, y) is

differentiable at the point z = x + iy iff (if and only if) the partial derivatives, ux , uy , vx , vy
are continuous and the Cauchy-Riemann conditions (2.6) are satisfied in a neighborhood of
z.
Notice that in the explanations which lead us to the Cauchy-Riemann theorem (2.2.2)
we only sketched one side of the proof – that it is necessary for the differentiability of f (z) to
have the theorem’s conditions satisfied. To complete the proof, one needs to also show that
Eq. (2.6) is sufficient for the differentiability of f (z). In other words, one needs to show that
any function u(x, y)+iv(x, y) is complex-differentiable if the Cauchy–Riemann equations
hold. The missing part of the proof follows from the following chain of transformations
∂f ∂f
∆f = f (z + ∆z) − f (z) = ∆x + ∆y + O (∆x)2 , (∆y)2 , (∆x)(∆y)
∂x ∂y

1 ∂f ∂f 1 ∂f ∂f
= −i ∆z + +i ∆z ∗ + O (∆x)2 , (∆y)2 , (∆x)(∆y)
2 ∂x ∂y 2 ∂x ∂y
∂f ∂f
= ∆z + ∗ ∆z ∗ + O (∆x)2 , (∆y)2 , (∆x)(∆y)
∂z ∂z
∂f ∂f ∆z ∗
= ∆z + ∗ + O (∆x)2 , (∆y)2 , (∆x)(∆y) , (2.7)
∂z ∂z ∆z

where O (∆x)2 , (∆y)2 , (∆x)(∆y) indicates that we have ignored terms of orders higher or
equal than two in ∆x and ∆y. In transition to the last line of Eq. (2.7) we change variables
from (x, y) to (z, z ∗ ), thus using
∂ ∂z ∂ ∂z ∗ ∂ ∂ ∂
= + ∗
= + ∗,
∂x ∂x ∂z ∂x ∂z ∂z ∂z
∂ ∂z ∂ ∂z ∗ ∂ ∂ ∂
= + =i − i ∗,
∂y ∂y ∂z ∂y ∂z ∗ ∂z ∂z
and its inverse (known as “Wirtinger derivatives”)

∂ 1 ∂ ∂ ∂ 1 ∂ ∂
= −i , ∗ = +i .
∂z 2 ∂x ∂y ∂z 2 ∂x ∂y
Observe that ∆z ∗ /∆z takes different values depending on which direction we take the
respective, ∆z, ∆z ∗ → 0 limit in the complex plain. Therefore to ensure that the derivative,
f 0 (z), is well defined at any z, one needs to require that
∂f
= 0, (2.8)
∂z ∗
i.e. that f does not depend on z ∗ . It is straightforward to check that the “independence of
the complex conjugate” Eq. (2.8) is equivalent to Eq. (2.6).
Definition 2.2.3 ((Complex) Analyticity). A function f (z) is called (a) analytic (or holo-
morphic) at a point, z0 , if it is differentiable (as a complex function) in a neigborhood of
z0 ; (b) analytic in a region of the complex plane if it is analytic at each point of the region.
Exercise 2.4. The isolines for a function, f (x, y) = u(x, y) + iv(x, y), are defined to be the
curves, u(x, y) = const and v(x, y) = const0 . Show that the iso-lines of an analytic function
always cross at a right angle.
Example 2.2.4. Let f (z) = u(x, y) + iv(x, y) be analytic. Given that u(x, y) = x + x2 − y 2
and f (0) = 0, find v(x, y).
Solution. We start by utilizing the Cauchy-Riemann conditions:
∂u ∂v ∂u ∂v
= , =− .
∂x ∂y ∂y ∂x
This gives us the two differential equations
∂v ∂v
= 2x + 1 , = 2y,
∂y ∂x
v(x, y) = 2xy + y + C1 (x) , v(x, y) = 2xy + C2 (y).
Based on these two solutions, C1 (x) = k, for some constant k, and C2 (y) = y + k. Given
the initial condition f (0) = 0, v(0, 0) = 0, and therefore k = 0. So, we find the solution,
v(x, y) = 2xy + y. Therefore
f (z) = (x + x2 − y 2 ) + i(2xy + y).
Exercise 2.5. Let f (z) = u(x, y) + iv(x, y) be analytic. Given that, v(x, y) = −2xy and
f (0) = 1, find u(x, y).
Examples of functions that are not analytic
Example 2.2.5. Determine whether (and where) f (z) = z ∗ is analytic and compute its
derivative where it exists.
Solution. Recall: If z = x + iy, then z ∗ := x − iy. We first compute ux , vy , uy and vx , and
determine whether (and where) they are continuous.
ux = 1 vy = −1
uy = 0 vx = 0
We confirm that the partial derivatives are continuous everywhere in C. We now check
the Cauchy-Riemann conditions and find that they are not satisfied anywhere in C because
ux = 1 6= −1 = vy . Intuitively, the complex conjugate fails to be analytic because analytic
functions can be locally approximated by rotations and stretches of the complex plane,
whereas the complex conjugate function is a reflection.
Example 2.2.6. Determine whether (and where) f (z) = z 1/2 is analytic and compute its
derivative where it exists.
Solution. We leave it to the reader to apply the chain rule and the trigonometic identity
sin2 (θ) + cos2 (θ) = 1 to verify that the Cauchy-Riemann equations in polar coordinates are
∂u 1 ∂v
=
∂r r ∂θ
∂v 1 ∂u
=−
∂r r ∂θ
We compute the relevant partial derivatives of z 1/2 and observe that they are not defined at
z = 0. We also observe that they cannot be continuous at a branch cut because branches of
z 1/2 are not continuous across the branch cut. The Cauchy-Riemann conditions are satisfied
everywhere else. In conclusion, a branch of z 1/2 is analytic in any region of C \ {0} that
does not contain the branch cut. We leave it to the reader to show that the derivative
in polar representation is given by f 0 (z) = e−iθ (ur + ivr ). For our example, this gives
f 0 (z) = 21 r−1/2 e−iθ/2 = 21 z −1/2 .
Example 2.2.7. Determine whether (and where) the function f (z) = 1/z is analytic.
Solution. Note that f is not defined at z = 0 and that limz→0 f (z) does not exist. Rationalize
the denominator to write f (z) = x/(x2 + y 2 ) − iy/(x2 + y 2 ). The relevant partial derivatives
are:
y 2 − x2 y 2 − x2
ux = vy =
x2 + y 2 x2 + y 2
−2xy 2xy
uy = 2 vx = 2
x + y2 x + y2
The partial derivatives exist and are continuous everywhere (except z = 0) and the the
Cauchy-Riemann conditions are satisfied on C \ {0}. We evaluate the derivative f 0 (z) and
observe that limz→0 f 0 (z) does not exist. We say that f has a simple pole at z = 0 because
(z − 0)f (z) is analytic in a neighborhood of 0. We will revisit this in section 2.2.5.
Example 2.2.8. Determine whether (and where) the functions (a) exp(z), (b) z exp(z̄) and
(c) (exp(z) − 1)/z are analytic and compute their derivatives where they exist.
The Cauchy-Riemann theorem 2.2.2 has a couple of other complementary, geometrical

and from the world of partial differential equations, interpretations discussed below.
“Geometry” of Complex: Conformal Mapping
Let us now make a detour into the subject of conformal map describing a function of two
variables, x, y, that locally preserves angles, but not necessarily lengths. We will see, and
quite remarkably, that the the rich family of conformal functions (maps) can be analyzed
based solely on the Cauchy-Riemann condition (2.6).
Indeed, the Cauchy-Riemann conditions (2.6) can be restated in the following compact
form
∂f ∂f
i = .
∂x ∂y
Then the Jacobian matrix of the function f : R2 → R2 , i.e. of the (x, y) → (u, v) map is
! !
∂u ∂u ∂u ∂u
∂x ∂y ∂x ∂y
J= ∂v ∂v
= ∂u ∂u
.
∂x ∂y − ∂y ∂x
Geometrically, the off-diagonal (skew-symmetric) part of the matrix represents rotation and
the diagonal part of the matrix represents stretching/re-scaling. The Jacobian of a function
f (z) takes infinitesimal line segments at the intersection of two curves in z and rotates
them to the corresponding segments in f (z). Therefore, a function satisfying the Cauchy-
Riemann equations, with a nonzero derivative, preserves the angle between curves in the
plane. Transformations corresponding to such functions and functions themselves are called
conformal. That is, the Cauchy-Riemann equations are not only conditions for the function
analyticity, but are also conditions for a function to be conformal.
The following famous theorem of the complex analysis builds a strong theoretical foun-
dation to the conformal maps. (It is due to Bernard Riemann, who has stated it in his
Ph.D. thesis in 1851. First proof of the theorem was published in 1912 by Constantin
Carathéodory.)
Figure 2.9: Exemplary functions (maps) from unit disk. Screenshots from https://
demonstrations.wolfram.com/ConformalMappingOfTheUnitDisk/.
Theorem 2.2.9 (Riemann Mapping Theorem). If A is a non-empty single-connected open

subset of C which is not entire C, A ⊂ C, than there exists a holomorphic (complex analytic)
function f (z) mapping A to the unit disk, D ≡ {s ∈ C : |z| < 1}, f : A → D. Moreover
f −1 : D → A is also holomorphic.
The theorem allows to build various conformal maps from one single-connected domain
to another by reducing the problem to two, each mapping the respective domain to the unit
disk.
It is useful for developing geometrical intuition to consider conformal maps associated
with elementary functions. See a number of illustrations in Fig. 2.9. Notice, however that
even relatively simple Riemann mappings, such as a map from the unit disk to the interior
of a square, may not be expressible in terms of elementary functions and in general one
needs to rely on approximate numerical methods to build the maps. (See ConformalMaps
Julia package https://github.com/sswatson/ConformalMaps.jl for approximating the
Riemann map from a simply connected planar domain to a disk.)
“Physics” of Complex: Harmonic functions
Here we will make a fast jump to the end of the semester where Partial Differential Equations
(PDEs) will be discussed in details. Consider solution of the Laplace equation in two
dimensions
(∂x2 + ∂y2 )f (x, y) = 0. (2.9)
Eq. (2.9) defines the so-called Harmonic functions. We do it now, while studying complex
calculus, because, and quite remarkably, an arbitrary analytic function is a solution of
Eq. (2.9). This statement is a straightforward corollary of the Cauchy-Riemann theorem
(2.2.2). To see it we recall that, f = u + iv, and use the following set of transformation
following from the Cauchy-Riemann conditions (2.6) also assuming that the function, f (z),
is analytic at z (which allows us to differentiate it one more time with respect to x and y)
( (
ux = vy uxx = vxy
⇒ ⇒ uxx + uyy = 0
uy = −vx uyy = −vxy
( (
ux = vy uxy = vyy
⇒ ⇒ vxx + vyy = 0
uy = −vx uxy = −vxx
The descriptor “harmonic” in the name harmonic function originates from a point on
a taut string which is undergoing periodic motion which is pleasant-sounding, thus coined
by ancient Greeks harmonic (!). This type of motion can be written in terms of sines and
cosines, functions which are thus referred to as harmonics. Fourier analysis, which we will
turn our attention to soon, involves expanding periodic functions on the unit circle in terms
of a series over these harmonics. These functions satisfy Laplace equation and over time
”harmonic” was used to refer to all functions satisfying Laplace equation.
Laplace equation, and thus harmonic functions, arise most prominently in mathematical
physics, in particular in electromagnetism (electrostatics) and fluid mechanics (hydrostat-
ics).
For example, in electrostatics it describes distribution of electrostatic potential within
the planar domain cut in a metal – domain which is free of charge, i.e. free of singularities.
On the other hand, singular points of the harmonic functions above are expressed as “point
charges” and/or continuously distributed “charge densities”. Placing a point charge at
the origin, z = x + iy = 0, results in the solution of the Laplace equation (2.9), which
is singular, i.e. non-analytic at the origin. We will see later in the PDE part of the
course, that the analytic function correspondent to a point charge placed at the origin
is, f (z) = C log z, where C is a constant related to the value of the charge, and then,
Re(f (z)) = u = C log r = (C/2) log(x2 + y 2 ), is the corresponding electrostatic potential.
The family of the harmonic (complex) functions is rich. Each harmonic function which
satisfies Eq. (2.9) will yield another harmonic function when multiplied by a constant,
rotated, and/or has a constant added. The inversion of each function will yield another
harmonic function which has singularities which are the images of the original singularities
in a spherical “mirror”. Also, the sum of any two harmonic functions will yield another
harmonic function.
2.2.2 Integration along Contours
Complex integration is defined along an oriented contour C in the complex plane.
Definition 2.2.10 (Complex Integration). Let f (z) be analytic in the neighborhood of a

contour C. The integral of f (z) along C is
Z n−1
X
f (z) dz := lim f (ζk )(ζk+1 − ζk ), (2.10)
n→∞
C k=0
where for each n, (ζk |k = 0, · · · , n) describes an ordered sequence of points along the path
breaking it into n intervals such that ζ0 = a, ζn = b and maxk |ζk+1 − ζk | → 0 as n → ∞.
Remark. It is now time to utilize parametrization of the complex functions discussed earlier
in the course. Let z(t) with a ≤ t ≤ b be a parameterization of C, then definition 2.2.10 is
equivalent to the Riemann integral of f (z(t))z 0 (t) with respect to t. Therefore,
Z Zb
f (z) dz = f (z(t)) z 0 (t) dt (2.11)
C a
Example 2.2.11. In example 2.1.13 we evaluated the functions f1 (z) = sin z, f2 (z) =
exp(z + 1), and f3 (z) = z 2 along the parameterized curves described in example 2.1.15.
R R R
Now compute (a) C1 f1 (z) dz, (b) C2 f2 (z) dz, and (c) C3 f3 (z) dz where C1 is the the
vertical line segment from π/2 − iM to π/2 + iM , C2 is the ray segment extending from
√
the point z = −1 and to the point 3i, and C3 is the circular arc of radius ε centered at 0.
Solution.
(a) Let z = π/2 + is, then dz = ids for −M < s < M .

Z Z
+M Z +M
eiπ/2−s − e−iπ/2+s
sin(z)dz = ids = i cosh(s)ds = 2i cosh(M ).
2i −M
C1 −M
(b) Let z = −1 + ρeiπ/3 for 0 ≤ ρ ≤ 2. Then dz = eiπ/3 dρ.

Z Z2
iπ/3 iπ/3 2
z+1
e dz = eρe eiπ/3 dρ = eρe = e2 cos(π/3)+i2 sin(π/3) − 1
0
C2 0
√ √
= e1 cos( 3) − 1 + ie1 sin( 3).
(c) Let z = eiτ for 0 ≤ τ < 2π, then dz = ieiτ dθ. Therefore,
Z Z2π 2π
2
2
z dz = eiτ ieiτ dθ = 1 3 3iθ
3 e = 13 ε3 e6πi − 13 ε3 e0 = 0.
0
C3 0
Exercise 2.6. Let C+ and C− represent the upper and lower unit semi-circles centered at
the origin and oriented from z = −1 to z = 1. Find the integrals of the functions (a) z; (b)
√ √
z 2 ; (c) 1/z; and (d) z along C+ and C− . For z, use the branch where z is represented
by reiθ with 0 ≤ θ < 2π.
Example 2.2.12. Let C be the circular closed contour of radius R centered at the origin.
Show that I
dz
= 0, for m = 2, 3, . . . (2.12)
zm
C
by parameterizing the contour in polar coordinates.
Solution. One possible parameterization of the contour is z(θ) = Reiθ for 0 ≤ θ < 2π.
Therefore, dz = iReiθ dθ. Changing the integral to polar coordinates gives:
I Z 2π
dz iReiθ
= dθ.
zm 0 (Reiθ )m
C
Because m = 2, 3, 4, ..., we know that m − 1 > 0. Therefore,

Z2π Z2π 2π
iReiθ i 1 i i
iθ m
dθ = m−1 iθ m−1
dθ = m−1 e−iθ(m−1) =0
(Re ) R (e ) R m−1 0
0 0
A simple, but useful, continuation of this example would be confirming that the integral is
not 0 when m = 1.
Example 2.2.13. Use numerical integration to approximate the integrals in the examples
above and verify your results.
2.2.3 Cauchy’s Theorem
In general the integral along a path in the complex plane depends on the entire path and
not only on the position of the end points. The following fundamental question arrives
naturally: is there a condition which makes the integral dependent only on the end points
of the path? The question is answered by the following famous theorem.
Theorem 2.2.14 (Cauchy’s Theorem, 1825). If f (z) is analytic in a simply connected

region D of the complex plane then for all paths, C, lying in this region and having the
R
same end points, the integral C f (z) dz has the same value.
A more compact way of stating the theorem is to say that integrals of analytic functions
are path independent.
It is important to recognize that for Cauchy’s theorem to hold in the case of the multi-
valued functions one needs the integrand to be a single-valued function. The cuts introduced
in the preceding section are required for exactly this reason – to force the integration path
to stay within a single branch of a multi-valued function and thus to guarantee analyticity
(differentiability) of the function along the path.
The same theorem can be restated in the following form.
Theorem 2.2.15 (Cauchy’s Theorem (closed contour version)). Let f (z) be analytic in a
simply connected region D and C be a closed contour that lies in the interior of D. Then
H
the integral of f along C is equal to zero: C f (z) dz = 0.
To make the transformation from the former formulation of Cauchy’s formula to the
latter one, we need to consider two paths connecting two points of the complex plain. From
Eq. (2.10), we see that paths are oriented and that changing the direction of the path
changes the value of the integral by a factor of −1. Therefore, of the two paths considered,
one needs to reverse its direction, then leading us to a closed contour formulation of Cauchy’s
theorem.
Let us now sketch the proof of the closed contour version of Cauchy’s theorem. Consider
breaking the region of the complex plane bounded by the contour C into small squares
with the contours Ck , as well as the original contour C, oriented in the positive direction
(counter-clockwise). Then I XI
dzf (z) = f (z)dz, (2.13)
C k C
k
where we have accounted for the fact that integrals over the inner sides of the small contours
cancel each other, as two of them (for each side) are running in opposite directions. Next,
pick inside a Ck contour a point, zk , and then approximate, f (z), expanding it in the Taylor
series around zk ,

f (z) = f (zk ) + f 0 (zk )(z − zk ) + O ∆2 (2.14)
where with ∆-squares, the length of Ck is at most 4∆, and we have at most (L/∆)2 small
squares. Substituting Eq. (2.14) into Eq. (2.13) one derives
I I I I
0

dzf (z) = f (zk ) dz + f (zk ) dz(z − zk ) + dzO ∆2 = 0 + 0 + ∆3 . (2.15)
Ck Ck Ck Ck
Summing over all the small squares bounded by C one arrives at the estimate ∆ → 0 in
the ∆ → 0 limit. .
Disclaimer: We have just used discretization of the integral. When dealing with inte-
grations of functions in the rest of the course we will always discuss it in the sense of a
limit, assuming that it exists, and not really breaking the integration path into segments.
However, if any question on the details of the limiting procedure surfaces one should get
back to the discretization and analyze respective limiting procedure sorely.
One important consequence of the Cauchy’s theorem (there will be more discussed in
the following) is that all integration rules known for standard, “interval”, integrals apply to
the contour integrals. This is also facilitated by the following statement.
Theorem 2.2.16 (Triangle Inequality). (A: From Euclidean Geometry) |z1 + z2 | ≤ |z1 | +
|z2 |, also with equality iff (if and only if) z1 and z2 lie on the same ray from the origin. (B:
Integral over Interval) Suppose g(t) is a complex valued function of a real variable, defined
on a ≤ t ≤ b, then
Zb Zb
dtg(t) ≤ dt|g(t)|, (2.16)
a a
with equality iff (i.e. if and only if) the values of g(t) all lie on the same ray from the origin.
(Integral over Curve/Path) For any function f (z) and any curve γ, we have
Z Z
f (z)dz ≤ |f (z)||dz|, (2.17)
γ γ
where dz = γ 0 (t)dt and |dz| = |γ 0 (t)|dt.
Proof. We take the “Euclidean” geometry version (A) of the statement, extended to the sum
of complex numbers, as granted and give a brief sketch of proofs for the integral formulations.
The interval version (B) of the triangular inequality follows by approximating the integral
Im (z)=y
Re (z)=x
Figure 2.10
as a Riemann sum
X X Zb
|g(t)dt| ≈ g(tk )∆t ≤ |g(tk )|∆t ≈ |g(t)|dt,
a
where the middle inequality is just the standard triangular inequality for sums of complex
numbers. The contour version (C) of the Theorem follows immediately from the interval
version
Z Zb Zb Z
0
f (z)dz = f (γ(t))γ (t)dt ≤ |f (γ(t))||γ 0 (t)|dt = |f (z)||dz|.
γ a a γ
2.2.4 Cauchy’s Formula
Recall from definition 2.1.12 that a curve is called simple if it does not intersect itself, and
is called a contour if it is piece-wise smooth.
Theorem 2.2.17 (Cauchy’s formula, 1831). Let f (z) be analytic on and interior to a simple
closed contour C. Then, Z
1 f (ζ)dζ
f (z) = . (2.18)
2πi ζ −z
C
To illustrate Cauchy’s formula consider the simplest, and arguably most important,
H
example of an integral over complex plane, I = dz/z. For the integral over closed contour
y y
0 x 0 x
Figure 2.11
shown in Fig. (2.10a), we parameterize the contour explicitly in polar coordinates and derive
I Z2π Z2π Z2π

dz rd exp(iθ) r exp(iθ)idθ
I= = = =i dθ = 2πi. (2.19)
z r exp(iθ) r exp(iθ)
0 0 0
The integral is not zero.

R
Next, recall that for the respective standard indefinite integral, dz/z = log z. This
formula is very naturally consistent with both Eq. (2.19) and with the fact that log(z) is
a multivariate function. Indeed, consider the integral over a path between two points of a
complex plain, e.g. z = 1 and z = 2. We can go from z = 1 to z = 2 straight, or can do
it, for example first making a counter-clockwise turn around 0. We can generalize and do
it clockwise and also making as many number of points we want. It is straightforward to
check that the integral depends on how many times and in which direction we go around 0.
The answers will be different by the result of Eq. (2.19), i.e. 2πi multiplied by an integer,
however it will not depend on the path.
Example 2.2.18. Compute, compare and discuss the difference (if any) between values of
H
the integral dz/z over two distinct paths shown in Fig. (2.11).
Solution. Firstly, note that 1/z is not analytic at z = 0, and therefore neither contour can
be deformed to the other without passing through the non-analytic point. Both curves are
simple and closed, therefore Cauchy’s formula applies. The curve on the left, C1 contains
z = 0, and makes one full turn around the origin, and therefore will be equal to the curve
H dz
in Figure 2.4, so, z = 2πi. The curve on the right, C2 , does not contain z = 0, and
C1
makes a full turn counter-clockwise around the origin, as well as a clockwise turn around
H dz
the origin on the ”inner” part of the curve. So, z = 0.
C2
The “small square” construction used above to prove the closed contour version of
Cauchy’s Theorem, i.e. Theorem 2.2.15, is a useful tool for dealing with integrals over
awkward (difficult for direct computation) paths around singular points of the integrand.
However, it should not be thought that all the integrals will necessarily be zero. Consider
I
dz
m = 2, 3, · · · : ,
zm
where the integral is singular at z = 0. The respective indefinite integral (what is sometimes
called the “anti-derivative”) is z −m+1 /(1 − m) + C, where C is constant. Observe that the
indefinite integral is a single-valued function and thus its integral over a closed contour is
zero. (Notice that if m = 1 the indefinite integral is a multi-valued function within the
domain surrounding z = 0.)
Cauchy’s formula can be extended to higher derivatives
Theorem 2.2.19 (Cauchy’s formula for derivatives, 1842). Under the same conditions as
in Theorem 2.2.17, higher derivatives are
Z
(n) n! f (ζ)dζ
f (z) = . (2.20)
2πi (ζ − z)n+1
C
Theoretical Implications of Cauchy’s Theorem & Cauchy’s Formulas
Cauchy’s theorem and formulas have many powerful and far reaching consequences.
Theorem 2.2.20. Suppose f (z) is analytic on a region A. Then, f has derivatives of all
orders.
Proof. It follows directly from Cauchy’s formula for derivatives, Theorem 2.2.19 – that is we
have an explicit formula for all the derivatives, so, in particular, the derivatives all exist.
Theorem 2.2.21 (Cauchy Inequality.). Let CR be the circle |z −z0 | = R. Assume that f (z)
is analytic on CR and its interior, i.e. on the disk |z − z0 | ≤ R. Finally let MR = max |f (z)|
over z on CR . Then
n!MR
∀n = 1, 2, · · · : |f (n) (z0 )| ≤ .
Rn
Exercise 2.7. Prove the Cauchy’s Inequality Theorem utilizing Theorem 2.2.19. Provide
an alternative argument for the Theorem validity on examples of exp(z) and cos(z) using
a circle that is centered at the origin (you are expected to argue informally and without
reference to Theorem 2.2.19 why the inequality holds).
Theorem 2.2.22 (Liouville Theorem.). If f (z) is entire, i.e. analytic at all finite points of
the complex plane C, and bounded then f is constant.
Proof. For any circle of radius R around z0 the Cauchy’s inequality (Theorem 2.2.21) states
that f 0 (z) ≤ M/R, but R can be arbitrarily large, thus |f 0 (z0 )| = 0 for every z0 ∈ C. And
since the derivative is 0, the function itself is constant.
Pn k,
Note that P (z) = k=0 ak z exp(z), cos(z) are entire but not bounded.
Theorem 2.2.23 (Fundamental Theorem of Algebra). Any polynomial P of degree n ≥ 1,

P
i.e. P (z) = nk=0 ak z k , has exactly n roots (solutions of P (z) = 0).
Proof. The prove consists of two parts. First, we want to show that P (z) has at least one
root. (See example below.) Second, assume that P has exactly n roots. Let z0 be one of
the roots. Factor, P (z) = (z − z0 )Q(z). Q(z) has degree n − 1. If n − 1 > 0, then we can
apply the result to Q(z). We can continue this process until the degree of Q is 0.
Pn k
Example 2.2.24. Prove that P (z) = k=0 ak z has at least one root.
Solution. We provide a hint and not the full solution: prove by contradiction and utilize
the Liouville Theorem 2.2.22.
Theorem 2.2.25 (Maximum modulus principle (over disk)). Suppose f (z) is analytic on
the closed disk, Cr , of radius r centered at z0 , i.e. the set |z − z0 | ≤ r. If |f | has a relative
maximum at z0 than f (z) is constant in Cr .
In order to prove the Theorem we will first prove the following statement.
Theorem 2.2.26 (Mean value property). Suppose f (z) is analytic on the closed disk of
radius r centered at z0 , i.e. the set |z − z0 | ≤ r. Then,
Z2π
1
f (z0 ) = dθf (z0 + r exp(iθ)) .
2π
0
Proof. Call Cr the boundary of the |z − z0 | ≤ r set, and parameterize it as z0 + reiθ , 0≤

θ ≤ 2π. Then, according to Cauchy’s formula,
Z Z2π Z2π
1 f (z)dz 1 f (z0 + reiθ ) iθ 1
f (z0 ) = = dθ ire = dθf (z0 + reiθ ).
2πi z − z0 2πi reiθ 2π
Cr 0 0
Now back to the Theorem 2.2.25. To sketch the proof we will use both the mean value
property Theorem 2.2.26 and the triangle inequality Theorem 2.2.16. Since z0 is a relative
maximum of |f | on Cr we have |f (z) ≤ |f (z0 |) for z ∈ Cr . Therefore by the mean value

property and the triangle inequality one derives
Z2π
1
|f (z0 )| = dθf (z0 + reiθ ) (mean value property)
2π
0
Z2π
1
≤ dθ|f (z0 + reiθ )| (triangle inequality)
2π
0
Z2π
1
≤ dθ|f (z0 )|, (|f (z0 + reiθ )| ≤ |f (z0 )|, i.e. z0 is a local maximum)
2π
0
= |f (z0 )|
Since we start and end with f (z0 ), all inequalities in the chain are equalities. The first
inequality can only be equality if for all θ, f (z0 + reiθ ) lies on the same ray from the
origin, i.e. have the same argument or equal to zero. The second inequality can only be
an equality if all |f (z0 + reiθ | = |f (z0 )|. Thus, combining the two observations, one gets
that all f (z0 + reiθ ) have the same magnitude and the same argument, i.e. all the same.
Finally, if f (z) is constant along the circle and f (z0 ) is the average of f (z) over the circle
then f (z) = f (z0 ), i.e. f is constant on Cr .
Two remarks are in order. First, based on the experience so far (starting from Theorem
2.2.22) it is plausible to expect that Theorem 2.2.25 generalizes from a disk Cr to any single-
connected domain. Second, one also expects that the maximum modulus can be achieved
at the boundary of a domain and then the function is not constant within the domain.
Indeed, consider example of exp(z) on the unit square, 0 ≤ x, y ≤ 1. The maximum,
| exp(x + iy)| = exp(x), is achieved at x = 1 and arbitrary y, 0 ≤ y ≤ 1, i.e. at the
boundary of the domain. These remarks and the example suggest the following extension
of the Theorem 2.2.25.
Theorem 2.2.27 (Maximum modulus principle (general)). Suppose f (z) is analytic on A,

which is a bounded, connected, open set, and it is continuous on Ā = A ∪ ∂A, where ∂A is
the boundary of Ā. Then either f (z) is a constant or the maximum of |f (z)| on Ā occurs
on ∂A.
Proof. Here is a sketch of the proof. Let us cover A by disks which are laid such that
their centers form a path from the value where f (z) is maximized to any other points in A,
while being totally contained within A. Existence of a maximum value of |f (z)| within A
implies, according to Theorem 2.2.25 applied to all the disks, that all the values of f (z) in
the domain are the same, thus f (z) is constant within A. Obviously the constancy of f (z)
is not required if the maximum of |f (z)| is achieved at δA.
Example 2.2.28. Find the maximum modulus of sin(z) on the square, 0 ≤ x, y ≤ 2π.
Solution.
sin(z) = sin(x + iy) = sin(x) cosh(y) + i cos(x) sinh(y)
We want to find the maximum modulus of sin(z)

q
| sin(x + iy)| = sin2 (x) cosh2 (y) + cos2 (x) sinh2 (y).
Using the identities, cos2 (x) = 1 − sin2 (x), and cosh2 (x) − sinh2 (x) = 1, we get
q q
2 2
| sin(z)| = sin (x) cosh (y) + (1 − sin (x)) sinh (y) = sinh2 (y) + sin2 (x).
2 2
nπ
This is maximized when sin2 (x) = 1, so x = 2 where n = 1, 2, 3, · · · , and y = 2π, as this
is the maximum value of y in our square. So,
max(| sin(z)|) = sinh2 (2π) + 1 = 71688.328285.
nπ
z = 2 + i2π, is on the boundary of our square, which satisfies the maximum modulus
principle.
2.2.5 Laurent Series
The Laurent series of a complex function f (z) about a point a is a representation of that
function by a power series that includes terms of both positive and negative degree.
Theorem 2.2.29. A function f (z) that is analytic on the annulus, R1 ≤ |z − a| ≤ R2 ,

and within its interior may be represented by a power series, called a Laurent Series, that
converges on the interior of the annulus:
+∞
X
f (z) = ck (z − a)k . (2.21)
k=−∞
The coefficients of the Laurent series are given by

I
1 f (z)
ck = dz, (2.22)
2πi (z − a)n+1
C
where C is any contour that is contained within the annulus and circling a.
Suppose one needs to compute I

f (z)dz,
C
where the contour C circles z = a in the positive (counter-clockwise) direction such that
there are no singular points of f (z) in the interior of C, except possibly at z = a. If we
represent f (z) by its Laurent series, then the only nonzero contribution will come from the
k = −1 term
I I ∞
X I
k dz
f (z)dz = ck (z − a) dz = c−1 = 2πic−1 . (2.23)
z−a
C C k=−∞
Definition 2.2.30. The coefficient corresponding to the k = −1 term plays such a signif-
icant role in contour integration that it deserved a special name – residue of f at z = a –
and is denoted by c1 = Res(f ; a).
Notice, that if f (z) has a simple pole at z = a, then
c−1 = Res(f, a) = lim (f (z)(z − a)) . (2.24)

z→a
2.3 Residue Calculus

2.3.1 Singularities and Residues
Definition 2.3.1 (Singularity). Let f : C → C and consider a ∈ C. If f is not analytic at

a (meaning that f 0 (a) does not exist) then we say that a is a singular point of f . If f is
not analytic at a, but is analytic in the region 0 < |z − a| < R, then we say that a ∈ C is
an isolated singular point of f .
Definition 2.3.2 (Removable Singularity). Let a be a singular point of a function f , and

let ck be the coefficients of the Laurent expansion of f about a. If ck = 0 for all k < 0,
then we say that a is a removable singularity of f . (Note that f would be analytic if f were
redefined at a single point, a.)
Definition 2.3.3 (Simple Pole). Let a be a singular point of a function f , and let ck be the
coefficients of the Laurent expansion of f about a. If c−1 6= 1, but ck = 0 for all k < −1,
then we say that a is a first order pole or a simple pole of f . See Fig. 2.12(a) for an example
of a simple pole.
If f has a simple pole at z = a, we can represent f in the form

g(z)
f (z) = , (2.25)
z−a
where g(z) is analytic in a neighborhood of z with g(a) = c−1 6= 0.
Real component of f (z) = z −1 Imaginary component of f (z) = z −1
Im(z) Im(z)
Re(z) Re(z)
Real component of f (z) = z −2 Imaginary component of f (z) = z −2
Im(z) Im(z)
Re(z) Re(z)
Figure 2.12: (a) Top row: The canonical example of a simple pole. The real (left) and
imaginary (right) components of z 7→ z −1 . (b) Bottom row: The canonical example of a
double pole. The real (left) and imaginary (right) components of z 7→ z −2 .
Definition 2.3.4 (Higher Order Pole). Let a be a singular point of a function f , and let ck
be the coefficients of the Laurent expansion of f about a. If, for some positive N , c−N 6= 0
but ck = 0 for all k < −N , then we say that a is a N -th order pole of f . See Fig. 2.12(b)
for an example of a double pole.
If f has an N -th order pole at z = a, we can represent f in the form
g(z)
f (z) = , (2.26)
(z − a)N
where g(z) is analytic in a neighborhood of z with g(a) = c−N 6= 0.
Example 2.3.5. Find the removable singularity and give the order of the poles in the
Im (z) Im (z) Im (z)
1 Re (z) 1 Re (z) 1 Re (z)
Figure 2.13: See Example 2.3.6.
following function

z−1 z−1
f (z) = 4 = .
(z − 1)(z + 1) (z − 1)(z + 1)2 (z + i)(z − i)
Solution. Observe that z = 1, z = −1, z = i, and z = −1 are singular points of f because

f is not defined at these points.
• z = 1: For z 6= 1, f (z) = g(z) where g(z) = 1

(z+1)2 (z+i)(z−i)
which is analytic in a
neighborhood of z = 1. Therefore, z = 1 is a removable singularity of f .
g(z)
• z = i: For z 6= i, f (z) = z−i where g(z) = z−1
(z−1)(z+1)2 (z+i)
which is analytic in a
neighborhood of z = 1. Therefore, z = i is a first order pole (or simple pole) of f .
• z = −i: It is similar to z = i.
g(z)
• z = −1: For z 6= −1, f (z) = (z+1)2
, where g(z) = z−1
(z−1)(z+i)(z−i) which is analytic in
a neighborhood of z = −1. Therefore, z = −1 is a second order pole (double pole) of
f.
In summary, there is a removable singularity at z = 1, first-order poles at z = ±i and a

second-order pole at z = −1.
Example 2.3.6. Use Cauchy’s formula to compute

I
exp(z 2 )dz
I= ,
z−1
for three contour examples shown in the Fig. 2.13.
Solution. (a) The integrand is analytic everywhere within the domain surrounded by the
contour, therefore I = 0. (b) The integrand has a single first-order pole within the domain
surrounded by the contour at z = 1. Notice, that direction of the contour is negative

(clockwise). therefore I = −2πiRes exp(z 2 )/(z − 1), 1 = −2πi exp(z 2 )|z=1 = −2πi. (c)
Since the only singularity of the integrand is at z = 1 the contour can be reduced to the
contour in the case (b), however traveled twice. Therefore, I = 2 ∗ (−2πi) = −4πi.
z
−" + $% " + $%
−" 0 "
Example 2.3.7. Use Cauchy’s formula to compute

I
dz
I= ,
cosh z
over the contour shown in Fig. 2.14, where a is a positive real.
Solution. To identify possible singularities of the integrand we need to solve cosh z∗ = 0,
resulting in z∗ = iπ(n + 1/2), where n = 0, ±1, · · · . We observe that all the singularities
are first-order poles. Only one of the poles, z∗ = iπ/2 is within the domain surrounded by
the contour. Therefore, according to the Cauchy’s formula, the residue formula (2.24), and
L’Hôpital’s rule,
z − iπ/2 2πi 2π
I = 2πiRes (1/ cosh z, iπ/2) = 2πi lim = = = 2π.
z→iπ/2 cosh z sinh(iπ/2) sin π/2
H
Exercise 2.8. Compute the integral dz/(ez −1) over the circle of radius 4 centered around
3i.
2.3.2 Evaluation of Real-valued Integrals by Contour Integration
Example 2.3.8. Evaluate the following real-valued integral using contour integration:
Z∞
eikx dx
I= .
x2 + 1
−∞
Solution. Let f : C → C be given by f (z) = eikz /(z 2 + 1). Observe that f has simple poles
at z = ±i. Let C = C1 ∪ CR be the contour in the complex plane show in Figure 2.15.
Consider the integral

Z Z Z
eikz dz eikz dz eikz dz
= + .
z2 + 1 z2 + 1 z2 + 1
C C1 CR
Our plan is to use the Cauchy’s formula to show that the integral along C is 2πiRes(f ; i),
eikz (z−i) e−k
where Res(f ; i) = limz→i z 2 +1
= 2i , and then, with the correct parameterization, we
can show that the integral along C1 converges to I as R → 0, and that the integral along
CR converges to 0 as R → ∞.
First, evaluate the integral along C1 . Parameterize C1 by z = x + 0i for −R < x < R.
Therefore, dz = dx.
Z ZR ZR
eikz dz eikz dz eikx dx
= = =
z2 + 1 z2 + 1 x2 + 1
C1 −R −R
Observe that in the limit R → 0, this is equivalent to the integral we must find.
Next evaluate the integral along C2 . Parameterize C2 by z = eRiθ = R cos(θ)+iR sin(θ).
Therefore, dz = RieRiθ dθ. This gives
Z Zπ
eikz dz eikR(cos(θ)+i sin(θ)) RieRiθ dθ
= .
z2 + 1 (Reiθ )2 + 1
C2 0
We must consider what happens to the magnitude of the above integral as R → ±∞.
Zπ Zπ
eikR(cos(θ)+i sin(θ)) RieRiθ dθ eikR(cos(θ)+i sin(θ)) RieRiθ
≤ dθ (by triangle inequality)
(Reiθ )2 + 1 (Reiθ )2 + 1
0 0
Zπ
eikR cos(θ)−kR sin(θ)
≤R dθ
R2 e2iθ + 1
0
Zπ
e−kR sin(θ)
≤R dθ (because eikR cos(θ) ≤ 1).
R2 e2iθ + 1
0
1 1
Observe that for R → ∞, we have R2 e2iθ +1
≤ R2 −1
, because the term R2 e2iθ must lie
between −R2 and R2 .
Zπ
R
≤ 2 e−kR sin(θ) dθ
R −1
0
Zπ/2
2R
≤ 2 e−kR sin(θ) dθ
R −1
0
Zπ/2
2R 2θ
≤ 2 e−kR π dθ (because sin(θ) ≥ 2θ
π for all θ ∈ [0, π]).
R −1
0
This final integral is one that we can evaluate.
Zπ/2
2R 2θ 2R π(1 − e−kR )
2
e−kR π dθ = 2
→0 as R → ∞.
R −1 R −1 2kR
0
All in all, we see that Z

eikz dz
→0 as R → ∞.
z2 + 1
C2
R eikz dz
From here, we use Cauchy Formula which gives us z 2 +1
= 2πiRes(f, i) = (2πi)(e−k /2i) =
C
πe−k . This implies that our final answer is
Z∞
eikx dx π
I= = k.
x2 + 1 e
−∞
Observe that Example 2.3.8 is an application of Jordan’s lemma, formally stated in

1
Lemma 2.3.9, where g(z) = z 2 +1
.
Lemma 2.3.9. (Jordan’s Lemma) Let CR be a contour of an infinite semicircle in the

upper-half of the complex plane (this is the semicircle piece shown in Figure 2.15). Let f (z)
be a function of the form f (z) = eiaz g(z), z ∈ CR and limR→∞ |g(Reiθ )| = 0. Then,
Z
π
f (z)dz ≤ MR ,
a
CR
where MR = maxθ∈[0,π] |g(Reiθ )|.

Example 2.3.10. Evaluate the integral
Z
+∞
cos(ωx)dx
I1 = , ω > 0.
1 + x2
−∞
Note: the respective indefinite integral is not expressible via elementary functions and one
needs an alternative way of evaluating the definite integral.
Solution. Observe that
Z
+∞
sin(ωx)dx
= 0,
1 + x2
−∞
just because the integrand is odd (skew-symmetric) over x. Combining the two formulas
above one derives
Z
+∞ Z
+∞ Z
+∞
cos(ωx)dx sin(ωx)dx exp(iωx)dx
I1 = + = .
1 + x2 1 + x2 1 + x2
−∞ −∞ −∞
Consider an auxiliary integral

I
exp(iωz)dz
IR = , ω > 0,
1 + z2
where the contour consists of half-circle of radius R and the straight line over real axis from
−R to R shown in Fig. 2.15. Since the function in the integrand has two poles of the first
order, at z = ±i, and only one of these poles lie within the contour, one derives

exp(iωz) exp(iωi)
IR = 2πiRes 2
, +i = 2πi = π exp(−ω).
1+z 2i
On the other hand IR can be represented as a sum of two integrals, one over [−R, R], and
one over the semi-circle. Sending R → ∞ one observes that the later integral vanishes, thus
leaving us with the answer
I1 = π exp(−ω).
Z
+∞
dx
I= ,
cosh x
−∞
reducing it to a contour integral.

Solution. Consider contour shown in Fig. 2.14 at a → ∞. Integral along the real axis coin-
cides with the desired integral. Integrals over left (up) and right (down) vertical portions
give zero in the a → ∞ limit (because the respective integrands decays to zero exponen-
tially). Note that
exp(x + iπ) + exp(−x − iπ) exp(x) + exp(−x)
cosh(x + iπ) = =− = − cosh(x).
2 2
R
iπ−∞
Therefore the fourth part of the contour becomes, dx/ cosh(x + iπ) = I. Summing up
iπ+∞
the four pieces and utilizing the result of Example 2.3.7 one derives, I + 0 + 0 + I = 2π,
i.e. I = π. Obviously the integral can also be evaluated directly (via anti-derivative and
definite integral)
π π
I = 2 arctan(tanh x/2)|+∞
−∞ = − − = π.
2 2
Exercise 2.9. Evaluate the following integrals reducing them to contour integrals
Z∞
dx
(a) ,
1 + x4
0
Z∞
dx
(b) ,
1 + x3
0
Z∞

(c) exp ix2 dx,
0
a b c d
R r
f
0
Figure 2.16
Z∞
exp(ikx)dx
(d) ,
cosh(x)
−∞
Cauchy Principal Value
Consider the integral

Z∞
sin(ax)dx
, (2.27)
x
0
where a > 0. As became custom in this part of the course let us evaluate it by constructing
and evaluating a contour integral. Since sin(az)/z is analytic near z = 0 (recall or google
L’Hôpital rule), we build the contour around the origin as shown in Fig. 2.16. Then going
through the following chain of evaluations we arrive at
Z∞ Z
sin(ax)dx 1 sin(az)
= dz (2.28)
x 2 z
0 [a→b→c→d]
Z
1 exp(iaz) exp(−iaz)
= − dz
4i z z
[a→b→c→d]
Z
1 exp(iaz)
= dz
4i z
[a→b→c→d→e→a]
Z
1 exp(−iaz) 1 π
− dz = (2πi − 0) = .
4i z 4i 2
[a→b→c→d→f →a]
(Note that a lot of details in this chain of transformations are dropped. We advise the
reader to reconstruct these details. In particular, we suggest to check that the integrals
over two semi-circles in Fig. 2.16 decay to zero with r → 0 and R → ∞. For the latter, you
may either estimate asymptotic value of the integral yourself, or use the (Jordan’s) Lemma
2.3.9.)
The limiting process just explained is often refereed to as the (Cauchy) principal value
of the integral
Z∞ ZR
exp(ix)dx exp(ix)dx
PV = lim = iπ. (2.29)
x R→∞ x
−∞ −R
In general if the integrand, f (x), becomes infinite at a point x = c inside the range of
integration, so that the limit on the right of the following expression
 c−ε 
ZR Z ZR
f (x)dx = lim  dxf (x) + dxf (x) , (2.30)
ε→0
−R −R c+ε
exists, we call it the principal value integral. (Notice that any of the terms inside the
brackets on the right if considered separately may result in a divergent integral.)
Consider another example
Zb
dx b
= log , (2.31)
x a
a
where we write the integral as a formal indefinite integral. However, if a < 0 and b > 0 the
integral diverges at x = 0. And we can still define
 −ε 
Zb Z Zb
dx  dx dx  ε b b
PV := lim + = lim log + log = log , (2.32)
x ε→0 x x ε→0 −a ε |a|
a a ε
excluding ε vicinity of 0. This example helps us to emphasize that the principal value is
R −ε R
unambiguous – the condition that the ε-dependent integration limits in and ε are
R −ε/2 R
taken with the same absolute value, and say not and ε , is essential.
If the complex variables were used, we could complete the path by a semicircle from −ε
to ε about the origin (zero), either above or below the real axis. If the upper semicircle were
chosen, there would be a contribution, −iπ, whereas if the lower semicircle were chosen, the
contribution to the integral would be, −iπ. Thus, according to the path permitted in the
Rb
complex plane we should have a dz/z = log(b/|a|) ± iπ. The principal value is the mean
of these two alternatives.
!%
i
!$
!#
r
!"
R -i
2.3.3 Contour Integration with Multi-valued Functions
Contour integrals can be used to evaluate certain definite integrals.
Integrals involving Branch Cuts
We discuss below a number of examples of definite integrals which are reduced to contour
integrals avoiding branch cuts.

Z∞
dx
√ ,
x(x2 + 1)
0
reducing it to a contour integral.

√
Solution. The square root in the integrand, z = exp((log z)/2, is a multi-valued function,
therefore it must be treated with a contour containing a branch cut. Consider
I
dz
√ 2
z(z + 1)
where the contour is shown in Fig. 2.17. The contour is chosen to guarantee that
Z
dx
r→0: √ → 0,
x(x2 + 1)
C2
Z
dx
R→∞: √ → 0,
x(x2 + 1)
C4
R R R R
where the integral is broken in four parts, + + + . Then resulting (assuming that
C1 C2 C3 C4
r → 0 and R → ∞) in
I Z Z Z∞
dz dz dz dx
√ 2 = √ 2 + √ 2 =2 √ .
z(z + 1) z(z + 1) z(z + 1) x(x2 + 1)
C1 C3 0
On the other hand the full close contour contains two poles of the integrand, at z = ±i, in
the interior, therefore
I
dz
√ 2 = πi (Res (at z = i) + Res (at z = i)) ,
z(z + 1)
1 exp(3πi/4)
Res (at z = i) = lim (f (z)(z − i)) = lim √ = ,
z→i z→i z(z + i) 2
1 exp(−3πi/4)
Res (at z = −i) = lim (f (z)(z + i)) = lim √ = .
z→i z→i z(z − i) 2
Summarizing one arrives at the following answer
Z∞
dx exp(3πi/4) exp(−3πi/4) π
√ 2
= πi − =√ .
x(x + 1) 2 2 2
0
Exercise 2.10. Evaluate the following integral

Z∞
dx
√ .
x x−1
1
Example 2.3.13. Compute the following integral reducing it to a contour integral

Z1
dx
I= . (2.33)
x2/3 (1 − x)1/3
0
Solution. Let us analyze contour integral with almost the same integrand
I I
dz dz
= , (2.34)
z 2/3 (z − 1)1/3 f (z)
and the contour, shown in Fig. (2.18a), surrounding the cut connecting two branching points
of f (z), at z = 0 and z = 1 (both points are the branching points of the 3rd order).
Recall that the cuts are introduced to make functions which are multi-valued in the
complex plain (thus the functions which are not entire, i.e. not analytic within the entire
complex plain) to become analytic within the complex plain excluding the cut. Cut also
sets the choice for the (originally multi-valued) function branches. In the case under con-
sideration f (z) := z 2/3 (z − 1)1/3 has the following parameterization as we go around the
cut (in the negative direction):
r !"
!# !$
a b
0 1
d c
!%
!# !$ r !"
0 1
!%
R

Sub-contour Parametrization of z Evaluation of f (z)

2/3
C1 := [a → b] x1 , x1 ∈ [r, 1 − r] x1 |1 − x1 |1/3 exp(iπ/3)
C2 := [b → c] 1 + r exp(iθ2 ), θ2 ∈ [π, −π] r1/3 exp(iθ2 /3)
2/3
C3 := [c → d] x3 , x3 ∈ [1 − r, r] x3 |1 − x3 |1/3 exp(−iπ/3)
C4 := [d → a] r exp(iθ4 ), θ4 ∈ [2π, 0] r2/3 exp(i2θ4 /3 + iπ/3)
Next we compute integrals with the same integrand over the sub-contours, C1 , C2 , C3 , C4
Z Z1
dz dx1
= 2/3
= exp(−iπ/3)I, (2.35)
f (z) x1 (1 − x1 )1/3 exp(iπ/3)
C1 0
Z Z−π
dz ir exp(iθ2 )dθ2
= →r→0 0 (2.36)
f (z) (1 + r exp(iθ2 ))2/3 (r exp(iθ2 )1/3
C2 π
Z Z0
dz dx3
= 2/3
= − exp(iπ/3)I (2.37)
f (z) x3 |1 − x3 |1/3 exp(−iπ/3)
C3 1
Z Z0
dz ir exp(iθ4 )dθ4
= →r→0 0 (2.38)
f (z) (r exp(iθ4 ))2/3 (r exp(iθ4 ) − 1)1/3
C4 2π
Taking advantage of f (z) analyticity everywhere outside the [0, 1] cut and using Cauchy’s
integral theorem let us transform the integral over, C1 ∪ C2 ∪ C3 ∪ C4 , to the integral with
the same integrand over the contour C shown in Fig. (2.18)
Z Z Z Z Z
dz dz dz dz dz
+ + + = . (2.39)
f (z) f (z) f (z) f (z) f (z)
C1 C2 C3 C4 C
On the other hand the contour integral over C can be computed in the R → ∞ limit:
Z Z0 Z2π
dz iR exp(iθ)dθ
= →R→∞ = −i dθ = −2πi. (2.40)
f (z) R exp(2iθ/3)(R exp(iθ) − 1)1/3
2/3
C 2π 0
Summarizing Eqs. (2.33, 2.34,2.35,2.36,2.37,2.38,2.39,2.40) one arrives at
−2πi π 2π
I= = =√ . (2.41)
− exp(iπ/3) + exp(−iπ/3) sin(π/3) 3
It may be instructive to compare this derivation with an alternative derivation of the
integral discussed in [1].
Exercise 2.11. Evaluate the integral
Z1
dx
√ , (2.42)
(1 + x2 ) 1 − x2
−1
by suggesting and evaluating an equivalent contour integral.
∗
2.4 Extreme-, Stationary- and Saddle-Point Methods
In this auxiliary ∗ Section, we study the family of related methods which allow to approx-
imate integrals dominated by contribution of a special point and its vicinity. Depending
on the case it is called extreme-point (which is also called Laplace method), stationary-
point or saddle-point method (which is also called steepest-descent method). We start
discussing the extreme-point version, corresponding to estimating real-valued integrals over
a real domain, then we turn to estimation of oscillatory (complex-valued) integrals over
a real interval (stationary-point method) and then generalize to complex-valued integrals
over complex path (saddle-point method, or steepest-descent method).
Extreme- (or maximal-) point method applies to the integral
Zb
I1 = dx exp (f (x)) , (2.43)
a
where the real-valued, continuous function f (x) achieves its maximum at a point x0 ∈]a, b[.
Then one approximates the function by the first terms of its Taylor series expansion around
the maximum
(x − x0 )2 00
f (x) = f (x0 ) + f (x0 ) + O (x − x0 )3 , (2.44)
2
where we assume f 0 (x0 )=0. Since x0 is the maximum, f 0 (x0 ) = 0 and f 00 (x0 ) ≤ 0, and
we consider the case of a general position, f 00 (x0 ) < 0. One substitutes Eq. (2.44) in
Eq. (2.43) and then drops the O((x − x0 )3 ) term and extends the integration over [a, b] to
] − ∞, ∞[. Evaluating the resulting Gaussian integral one arrives at the following extreme-
point estimation s
2π
I1 → exp (f (x0 )) . (2.45)
−f 00 (x0 )
This approximation is justified if |f 00 (x0 )| 1.
Example 2.4.1. Estimate the following integral

Z
+∞
I= dx exp(f (x)), f (x) = αx2 − x4 /2,

−∞
at sufficiently large positive α using the extreme-point method.

Solution. Let us find all stationary points of f (x) (extreme point of the integrand). Solving
√
f 0 (xs ) = 0, one gets that either xs = 0 or xs = ± α. Values of f at the extreme-points
∗
Here and below we will mark auxiliary Sections with ∗ . These Sections can be dropped at the first
reading. Material form the auxiliary Sections will not contribute midterm and final exams.
√
are f (0) = 0 and f (± α) = α2 /2, and we thus choose the dominating extreme point,
√
xs = ± α, for further evaluations. In fact, and since the two (dominant) extreme-points
are fully equivalent, we pick one of them and then multiply estimation for the integral by
two:
Z
+∞ Z
+∞
2 00 √ 2 2
I ≈ 2 exp(α /2) dx exp(f ( α)x /2) = 2 exp(α /2) dx exp(−2αx2 )
−∞ −∞
r
2
= exp(α2 /2) ,
απ
√
where we also took into account that f 00 (± α) = −4α.
The same idea, known under the name of the stationary-point method, works for highly
oscillatory integrals of the form
Zb
I2 = dx exp (if (x)) , (2.46)
a
where real-valued, continuous f (x) has a real stationary point x0 , f 0 (x0 ) = 0. Integrand
oscillates least at the stationary point, thus guaranteeing that the stationary point and its
vicinity make dominant contribution to the integral. The statement just made may be a
bit confusing because the integrand, considered as a function over x is oscillatory making,
formally, integral over x to be highly sensitive to positions of the ends of interval. To
make the statement sensible consider shifting the contour of integration into the complex
plain so that it crosses the real axis at x0 along a special direction where if 00 (x0 )(x − x0 )2
shows maximum at x0 then making the resulting integrand to decay fast (locally along the
contour) with |x − x0 | increase. One derives
Z

I2 ≈ exp (if (x0 )) dx exp if 00 (x0 )/2(x − x0 )2
s
2π
= 00
exp if (x0 ) + isign(f 00 (x0 ))π/4 ,
|f (x0 )|
where dependence on the interval’s end-points disappear (in the limit of sufficiently large
|f 00 (x0 )|).
Example 2.4.2. Estimate the following integral

Z
+∞
I2 = dx exp(if (x)), f (x) = αx2 − x4 /2,

−∞
at sufficiently large positive α using the stationary-point method.

Solution. We can re-use here results of the Example 2.4.1. The stationary points of the
√
integrand are the same: xs = 0 and xs = ± α. Values of f at the stationary points are
√
f (0) = 0 and f (± α) = α2 /2 resulting in 1 and exp(iα2 /2) contributions to the integrand.
Therefore, in the asymptotic (large α) estimation we should keep all three contributions to
the integral. Computing second derivatives at the three stationary points, f 00 (0) = 2α and
√
f 00 (± α) = −4α, estimating the three contributions to I2 according to Eqs. (2.48), and
finally summing them up we arrive at
Z Z
I2 ≈ 2 exp(iα /2) dx exp(−2iαx ) + dx exp(iαx2 )
2 2
r r
π 2π
= 2 exp(iα2 /2 − iπ/4) + exp(iπ/4).
2α α
Now in the most general case (of the saddle-point method, also called the steepest-
descent method) we consider the contour integral
Z
I3 = dz exp (f (z)) , (2.47)
C
assuming that f (z) is analytic along the contour, C, and also within a domain, D, of the
complex plain, the contour is embedded in. Let us also assume that there exists a point, z0 ,
within D where f 0 (z0 ) = 0. This point is called a saddle-point because iso-lines of f (z) in
the vicinity of z0 show a saddle – minimum and maximum along two orthogonal directions.
Deforming C such that it passes z0 along the “maximal” path (where f (z) reaches maximum
at z0 ) one arrives at the following saddle-point estimation
s
2π
I3 → exp (f (z0 )) , (2.48)
−f 00 (z0 )
where the square-root sign stand for its main (standard) branch, i.e. ∀θ ∈ [0, 2π] :
p
exp(iθ) = exp(iθ/2). In what concerns applicability of the saddle-point approximation
– the approximation is based on truncating the Taylor expansion of f (z) around z0 , which
is justified if f (z) changes significantly where the expansion applies, i.e. |f 00 (z0 )|R2 1,
where R is the radius of convergence of the Taylor series expansion of f (z) around z0 .
Two remarks are in order. First, let us emphasize that f (z0 ) and f 00 (z0 ) can both be
complex. Second, there may be a number (more than one) of saddle points in the region
of the f (z) analyticity. In this case one picks the saddle-point achieving maximal value (of
f (z0 )). In the case of degeneracy, i.e. when multiple saddle-points achieves the same value,
as was the case both in the Example 2.4.1 and Example 2.4.2, one deforms the contour to
pass through all the saddle-points then replacing right hand side in Eq. (2.48) by the sum
of the saddle-point contributions.
Exercise 2.12. ∗ Estimate the following integrals
Z
+∞

(a) dx cos αx2 − x3 /3 ,
−∞
Z
+∞

(b) dx exp −x4 /4 cos(αx).
−∞
at sufficiently large positive α through the saddle-point approximation.
In summary, the important lesson (take away) of extreme-, stationary-, saddle-point

analysis of the complex integrals is in our principal ability to search for regions dominating
the integrals enabled by analyticity of the integrand. We achieve it shifting the integration
contour – (a) making it go through the point(s) where the absolute value of the integrand
max out and (b) forcing the contour to ascent and descent the point(s) along the steepest
direction, therefore exploring the fact that any of the max-points are saddle-points. The
approach allows to extract asymptotic behavior of the integral in the regime where we have
a parameter making the ascent/descent infinitely steep in the limit.
Two Homework Assignments associated with the Chapter 2 are:
• HW1: Exercises 2.1-2.6.
• HW2: Exercises 2.7-2.12.

Chapter 3
Fourier Analysis
Fourier analysis is the study of how functions may be represented or approximated by

their oscillatory components. Decomposing a function into its oscillatory components (or
basis functions), which requires computing the correct coefficients for each component, is
achieved by computing an integral. Similarly, recomposing the function from its orthogo-
nal basis functions is achieved by computing a sum or an integral. When the oscillatory
components take a continuous range of wave-numbers (or frequencies), the decomposition
and recomposition are referred to as the Fourier transform and inverse Fourier transform.
When the oscillatory components take a discrete range of wave-numbers (or frequencies),
the decomposition and recomposition are referred to as a Fourier Series.
Fourier analysis grew from the study of Fourier series, which is credited to Joseph Fourier
for showing that the study of heat transfer is greatly simplified by representing a function
as a sum of trigonometric basis functions. The original concept of Fourier analysis has been
extended over time and now applies to more general and more abstract situations. The field
is often called harmonic analysis.
3.1 The Fourier Transform and Inverse Fourier Transform

Certain functions f (x) can be expressed by the representation, known as the Fourier inte-
gral,
Z
1
f (x) = dk exp ikT x fˆ(k), (3.1)
(2π)d Rd
where k = (k1 , · · · , kd ) is the “wave-vector”, dk = dk1 · · · dkd , and fˆ(k) is the Fourier
transform of f (x), defined according to
Z

ˆ
f (k) := dx exp −ikT x f (x). (3.2)
Rd
57
CHAPTER 3. FOURIER ANALYSIS 58
Eq. (3.1) and Eq.(3.2) are inverses of each other (meaning, for example, that substituting
Eq. (3.2) into Eq. (3.1) will recover f (x)), and it is for this reason that the Fourier integral
is also called the Inverse Fourier Transform. Proofs that they are inverses, as well as
other important properties of the Fourier Transform, rely on Dirac’s δ-function which in
d-dimensions can be defined as
Z
1
δ(x) := dk exp(ikT x). (3.3)
(2π)d Rd
We will discuss Dirac’s δ-function in Section 3.3, primarily for d = 1.

At first glance, it might appear that the appropriate class of functions for which Eq. (3.1)
is defined is one where both f (x) and fˆ(k) are integrable. We will demonstrate how the
definition of the δ-function permits Eq. (3.1) to be defined over a wider class of functions
in Section 3.4. More careful consideration of the function spaces to which f (x) and fˆ(k)
belong will be addressed in the Theory course (Math 584).
In the interest of maintaining compact notation and clear explanations, important prop-
erties for the Fourier Transform will be presented for the one dimensional case (Section 3.2),
but each property applies to the more general d-dimensional Fourier transform. There are
only a few functions for which their Fourier transform can be expressed by a closed-form
representation, see Section 3.4.
Remark. There are alternative definitions for the Fourier transform and its inverse; some
authors place the multiplicative constant of (2π)−d in the definition of fˆ(k), other authors
prefer the ‘symmetric’ definition where both f (x) and fˆ(k) are multiplied by (2π)−d/2 , and
still others place a 2π in the complex exponential. It is important to read widely during
graduate school, but be warned that the specific results you find will depend on the exact
definitions used by the author.
3.2 Properties of the 1-D Fourier Transform

In the d = 1 case, x may play the role of the spatial coordinate or of time. When x is the
spatial coordinate, the spectral variable k is often called the wave number, which is the one
dimensional version of the wave vector. When x is time, k is often called frequency and
given the symbol ω. The spatial and temporal terminologies are interchangeable.
Linearity: Let h(x) = af (x) + bg(x), where a, b ∈ C, then
Z Z
ĥ(k) = dx h(x)e−ikx = dx (af (x) + bg(x)) e−ikx
RZ RZ
= a dx f (x)e −ikx
+ b dx g(x)e−ikx = afˆ(k) + bĝ(k). (3.4)
R R
Spatial/Temporal Translation: Let h(x) = f (x − x0 ), where x0 ∈ R, then

Z Z Z
0
ĥ(k) = dx h(x)e−ikx
= dx f (x − x0 )e −ikx
= dx0 f (x0 )e−ikx −ikx0 = e−ikx0 fˆ(k).
R R R
(3.5)
Frequency Modulation: For any real number k0 , if h(x) = exp(ik0 x)f (x), then
Z Z Z
= dx f (x)e ik0 x −ikx
e = dx f (x)e−i(k−k0 )x = fˆ(k − k0 ).
R R R
(3.6)
Spatial/Temporal Rescaling: For a non-zero real number a, if h(x) = f (ax), then

Z Z Z
0
= dx f (ax)e −ikx
= |a| −1
dx0 f (x0 )e−ikx /a = |a|−1 fˆ(k/a).
R R R
(3.7)
The case a = −1 leads to the time-reversal property: if h(t) = f (−t), then ĥ(ω) = fˆ(−ω).
Complex Conjugation: If h(x) is a complex conjugate of f (x), that is, if h(x) = (f (x))∗ ,
then
Z Z Z ∗ ∗
ĥ(k) = −ikx
dx h(x)e = dx (f (x)) e ∗ −ikx
= dx f (x)eikx = fˆ(−k) .
R R R
(3.8)
Exercise 3.1. Verify the following consequences of complex conjugation:
(a) If f is real, then fˆ(−k) = (fˆ(k))∗ (this implies that fˆ is a Hermitian function.)
(b) If f is purely imaginary, then fˆ(−k) = −(fˆ(k))∗ .

(c) If h(x) = Re(f (x)), then ĥ(k) = 21 fˆ(k) + (fˆ(−k))∗ .

(d) If h(x) = Im(f (x)), then ĥ(k) = 1
2i fˆ(k) − (fˆ(−k))∗ .
Exercise 3.2. Show that the Fourier transform of a radially symmetric function in two
variables, i.e. f (x1 , x2 ) = g(r), where r2 = x21 + x22 , is also radially symmetric, i.e.
fˆ(k1 , k2 ) = fˆ(ρ), where ρ2 = k12 + k22 . (We remind that in polar coordinates (r, θ) a
radially symmetric function does not depend on the angle θ.)
Differentiation: If h(x) = f 0 (x), then under the assumption that |f (x)| → 0 as x → ±∞,
Z Z h i∞ Z
ĥ(k) = dx h(x)e−ikx = dxf 0 (x)e−ikx = f (x)e−ikx − dx(−ik)f (x)e−ikx
R R −∞ R
= (ik)fˆ(k). (3.9)
R∞
Integration: Substituting k = 0 in the definition, we obtain fˆ(0) = −∞ f (x) dx. That is,
the evaluation of the Fourier transform at the origin, k = 0, equals the integral of f over
all its domain.
Proofs for the following two properties rely on the use of the δ-function (which will
not be addressed until Section 3.3), and require more careful consideration of integrability
(which is beyond the scope of this brief introduction). The following two properties are
added here so that a complete list of properties appears in a single location.
R
Unitarity [Parceval/Plancherel Theorem]: For any function f such that |f | dx < ∞
R
and |f |2 < ∞,
Z Z Z Z Z
∞
2
∞
∗
∞ ∞
dk1 ik1 x ˆ ∞
dk2 −ik2 x ˆ ∗
dx |f (x)| = dxf (x) (f (x)) = dx e f (k1 ) e f (k2 )
−∞ −∞ −∞ −∞ 2π −∞ 2π
Z ∞
1
= dk |fˆ(k)| .2
(3.10)
2π −∞
Definition 3.2.1. The integral convolution of the function f with the function g, is defined
as Z
(f ∗ g)(x) := dy g(x − y)f (y), (3.11)
R
Convolution: Suppose that h is the integral convolution of f with g, that is, h(x) =
(f ∗ g)(x), then
Z Z Z
ĥ(k) = dx h(x)e −ikx
= dx dy f (x − y)g(y)e−ikx = fˆ(k)ĝ(k). (3.12)
R R R
The convolution of a function f with a kernel g is defined in Eq. (3.11). Consider

whether there exists a convolution kernel g resulting in the projection of a function to itself.
That is, can we find a g such that (g ∗ f ) = f for arbitrary functions f ? If such a g were to
exist, what properties would it have?
Heuristically, we could argue that such a function would have to be both localized and
R
unbounded. Localized because for the convolution dy g(x − y)f (y) to “pick out” f (x),
g(x − y) must be zero for all x 6= y. Unbounded because we also need g(x − y) to be
sufficiently large at x = y to ensure that the integral in the middle of Eq. (3.12) could be
nonzero.
Such a degree of ‘un-boundedness’ over a localized point is impossible under the tradi-
tional theory of functions, but nonetheless, an unbound g(x) was introduced by Paul Dirac
in the context of quantum mechanics. It was not until the 1940’s that Laurent Schwartz
developed a rigorous theory for such ‘functions’, which became known as the theory of dis-
tributions. We usually denote this ‘function’ by δ(x) and call it the (Dirac) δ-function. See
[1](ch. 4) for more details.
3.3 Dirac’s δ-function.

3.3.1 The δ-function as the limit of a δ-sequence
We begin our study of Dirac’s δ-function by considering the sequence of functions given by

1/, |x| ≤ /2
δ (x) = . (3.13)
0, |x| > /2
The point-wise limit of δ is clearly zero for all x 6= 0, and therefore the integral of the limit
of δ must also be zero, that is,
Z∞
lim δ (x) = 0 ⇒ dx lim f (x) = 0. (3.14)
→0 →0
−∞
However, for any > 0, the integral of δ is clearly unity, and therefore the limit of the
integral of δ must also be unity
Z∞ Z∞
dx δ (x) = 1 ⇒ lim dx δ (x) = 1. (3.15)
→0
−∞ −∞
Although, Eq. (3.14) suggests that δ (x) may not be very interesting as a function, the
behavior demonstrated by Eq. (3.15) motivates the use of δ (x) as a functional a . For any
sufficiently nice function φ(x), define the functionals δ [φ] and δ[φ] by
Z∞
δ[φ] := lim δ [φ] := lim dx δ (x)φ(x). (3.16)
→0 →0
−∞
The behavior of δ[φ] can be demonstrated by approximating the corresponding integrals,

δ [φ] for each > 0:
Z∞ Z/2
1
δ [φ] = dx δ (x)φ(x) = dx φ(x). (3.17)

−∞ −/2
a
In casual terms, a function takes numbers as inputs, and gives numbers as outputs, whereas a functional
takes functions as inputs and gives numbers as outputs
Letting m and M represent the minimum and maximum values of φ(x) on the interval
−/2 < x < /2 gives the bounds
m ≤ δ [φ] ≤ M (3.18)
If φ is continuous at x = 0, the limit δ [φ] as → 0 is given by
δ[φ] = lim δ [φ] = φ(0). (3.19)

→0
In summary, δ[φ] evaluates its argument at the point x = 0.

Now compare δ (x) to the sequence of functions given by
1
δ̃ (x) = . (3.20)
π x + 2
2
The point-wise limit δ̃ (x) is also zero for every x 6= 0, so as before, the integral of the limit
must be zero
Z∞
lim δ̃ (x) = 0 ⇒ dx lim δ̃ (x) = 0. (3.21)
→0 →0
−∞
A suitable trigonometric substitution shows that the integral of δ̃ (x) is also unity for each
> 0, and as before, the limit of the integrals must be unity:
Z∞ Z∞
dx δ̃ (x) = 1 ⇒ lim dx δ̃ (x) = 1. (3.22)
→0
−∞ −∞
As with δ (x), we can use δ̃ (x) to define the functionals δ̃ [φ(x)] and δ̃[φ] by
Z∞
δ̃[φ] := lim δ̃ [φ] := lim δ̃ (x)φ(x)dx. (3.23)
→0 →0
−∞
This time it takes a little more thought to find the appropriate bounds, but with some
effort, it can be shown that
δ̃[φ] = lim δ̃ [φ] = φ(0) (3.24)
→0
That is, δ̃[φ] also evaluates its argument at the point x = 0.

The sequences δ (x) and δ̃ (x) both have the same limiting behavior as functionals, and
are examples of what is known as a δ-sequence. Their limiting behavior leads us to the
R
definition of a δ-function, which is defined as δ[φ] = R dxδ(x)φ(x) = φ(0).
Remark. The δ-function only makes sense in the context of an integral. Although it is a
common practice to write expressions like δ(x)f (x), as above, such expression should always
R
be considered as R dx δ(x)f (x)
Alternative Definitions of the δ-function
We have defined the δ-function in Eq. (3.3) as the limit of a particular δ-sequence, namely
the ‘top-hat’ function given in Eq. (3.14). One has to wonder whether there may be other
δ-sequences which give the same limit. For example, consider
2t2
δ(t) = lim . (3.25)
→0 π(t2 + 2 )2
To validate the suitability of Eq. (3.25) as an alternative definition of the δ-function one
R
needs to check first that δ(t) → 0 as → 0 for all t 6= 0, and second that dtδ(t) = 1. (It
is easy to evaluate this integral as the complex pole integral and closing the contour, for
example, over the upper part of the complex plane. Observing that the integrand has pole
of the second order at t = i, expanding it into Laurent series around i and keeping the
c = −1 coefficient, and then using the Cauchy formula for the contour integral, we confirm
that the integral is equal to unity.)
Exercise 3.3. Validate the following asymptotic representations for the δ-function
2
1 t
(a) δ(t) = lim √ exp − ,
→0 π
1 − cos(nt)
(b) δ(t) = lim .
n→∞ πnt2
In many applications we deal with periodic functions. In this case one needs to consider
relations hold within the interval. In view of the δ-function extreme locality (just explored),
all the relations discussed above extend to this case.
Example 3.3.1. Validate the following asymptotic representation for the δ-function on the
interval (−π, π)
1 − r2
δ(θ) = lim ,
r→1− 2π(1 − 2r cos(θ) + r2 )
where r → 1− means r = 1 − , > 0, → 0.

1−r2
Solution. For 0 < r < 1, define δr (θ) = 2π(1−2r cos(θ)+r2 )
. To show that δr (θ) is a δ-sequence,
we must show:
i. lim δr (θ) = 0 for each θ 6= 0,

r→1−
Z π
ii. δr (θ) dθ = 1 for r < 1 (i.e. for > 0 where r = 1 − ).
−π
i. To show that the point-wise limit of δr (θ) is zero for each θ 6= 0, note that for any θ 6= 0,
limr→1− 1 − 2r cos(θ) + r2 > 0 but limr→1− 1 − r2 = 0.
ii. To show that δr (θ) integrates to unity for each r. There is a clever trick that evaluates
this integral by a complex-valued contour integral. Let C be the parameterization of the
unit circle z(θ) = 1eiθ for −π < θ < π. Therefore dz = ieiθ dθ = izdθ. Now consider
Z π Zπ
1 1 − r2
δr (θ)dθ = dθ
−π 2π 1 − 2r cos(θ) + r2
−π
Z
1 1 − r2 dz
= iθ −iθ 2
2π 1 − r(e + e ) + r iz
C
Z
1 1 − r2 dz
= 2
2πi 1 − r(z + 1/z) + r z
C
Z
1 1 − r2
= dz
2πi −r + (1 + r2 )z − rz 2
C
Z
1 1 − r2
= dz
2πi (1 − rz)(z − r)
C
The integrand has a simple pole inside the contour at z1 = r with residue equal to 1. (There
is also a simple pole at z2 = 1/r with residue r, but this is irrelevant because it is outside
Rπ
the contour.) Therefore, δr (θ)dθ = 1.
−π
3.3.2 Properties of the δ-function
Example 3.3.2. For b, c ∈ R, show that cδ(x − b)f (x) = cf (b)δ(x − b).
Solution. We need to show that (a) the two functions are zero at x 6= b (which is trivial)
R∞ R∞
and (b) that their integrals are equal: dx cδ(x − b)f (x) = c dy δ(y)f (y + b)
−∞ −∞
Example 3.3.3. For a ∈ R, show that δ(ax)f (x) = δ(x)f (0)/|a|.

Solution. We need to show that (a) the two functions are zero at x 6= 0 (which is trivial) and
R∞ R∞ dy R∞
(b) that their integrals are equal: dx δ(ax)f (x) = |a| δ(y)f (y/a) = dxδ(x)f (0)/|a| =
−∞ −∞ −∞
f (0)/|a|.
Example 3.3.4. Show that the Fourier transform of a δ-function is a constant.

R∞
Solution. δ̂(k) = dx δ(x)e−ikx = e−ik 0 = 1.
−∞
Example 3.3.5. Show that

Z∞
dk
(a) δ(x) = exp(ikx)
2π
−∞
(b) The Fourier transform of a constant is a δ-function.
Solution. (a) We identify the expression on the RHS as the inverse Fourier transform of
R∞
the function fˆ(k) = 1: f (x) = 2π
1
dk 1eikx . Even though the constant function is not
−∞
integrable in the traditional sense, the theory of distributions (which are functionals of the
δ-function type) allows us to give meaning to this integral. We know that the δ-function
R
is defined so that for any suitable function φ(x), dx δ(x)φ(x) = φ(0). Even if we cannot
R
integrate f (x) directly, but can show that, dx f (x)φ(x) = φ(0), for every suitable test
function, φ(x), then we can assert that, f (x) = δ(x).
(b)
Z∞ Z∞ Z∞ Z∞ Z∞
dk −ikx dk −ikx dk
f [φ(x)] = dx φ(x) 1e = dx φ(x)e = φ̂(k)
2π 2π 2π
−∞ −∞ −∞ −∞ −∞
Z∞
dk
= φ̂(k)eik0 = φ(0).
2π
−∞
And since, f [φ] = φ(0), for every suitable test function φ, we say that f (x) = δ(x).
3.3.3 Using δ-functions to Prove Properties of Fourier Transforms
We now return to proving (1) that the Fourier Transform and the inverse Fourier Transfrom
are indeed inverses of each other, (2) Plancherel’s theorem and (3) the convolution property.
Proposition 3.3.6. The Fourier Transform of the convolution of the function f with the
function g is the product fˆ(k)ĝ(k)
Z∞ Z ∞
\
(f ∗ g)(k) = dx dy g(x − y)f (y)e−ikx
−∞
−∞
Z∞ Z ∞ Z ∞ Z ∞
dk1 dk2 ˆ
= dx dy f (k1 )ĝ(k2 )e−ikx+ik1 (x−y)+ik2 y
−∞ −∞ 2π −∞ 2π
−∞
Z∞ Z∞ Z∞ Z∞
dx −ikx+ik1 x dy −ik1 y+ik2 y
= dk1 dk2 fˆ(k1 )ĝ(k2 ) e e
2π 2π
−∞ −∞ −∞ −∞
Z∞ Z∞
= dk1 fˆ(k1 )δ(k − k1 ) dk2 ĝ(k2 )δ(k1 − k2 ) = fˆ(k)ĝ(k)
−∞ −∞
where in transition from the first to the second lines we exchange order of integrations
assuming that all the integrals involved are well-defined.
Proposition 3.3.7. Unitarity [Parceval/Plancherel Theorem]:

Z ∞ Z ∞ Z ∞ Z ∞ Z ∞
2 ∗ dk1 ik1 x ˆ dk2 −ik2 x ˆ ∗
dx|f (x)| = dxf (x) (f (x)) = dx e f (k1 ) e f (k2 )
−∞ −∞ −∞ −∞ 2π −∞ 2π
Z ∞ Z ∞ Z ∞
1 ∗ 1
= dk1 dk2 fˆ(k1 ) fˆ(k2 ) dx exp(ix(k1 − k2 ))
2π −∞ −∞ 2π −∞
Z ∞ Z ∞ ∗
1 ˆ ˆ
= dk1 dk2 f (k1 ) f (k2 ) δ(x − y)
2π −∞ −∞
Z ∞
1
= dk|fˆ(k)|2 .
2π −∞
Remark. Using the δ-function as the convolution kernel yeilds the self-convolution property:
Z
f (x) = dyδ(x − y)f (y). (3.26)
Consider δ-function of a function, δ(f (x)). It can be transformed to the following sum
over zeros of f (x),
X 1
δ (f (x)) = δ(x − yn ).
n
|f 0 (y n )|
To prove the statement, one, first of all, recall that δ-function is equal to zero at all points
where its argument is nonzero. Just this observation suggest that the answer is a sum of
δ-functions and what is left is to establish weights associated with each term in the sum.
Pick a contribution associated with a zero of f (x) and integrating the resulting expression
over a small vicinity around the point, make the change of variable
Z Z
df
dxδ(f (x)) = 0
δ(f (x)).
f (x)
Because of the δ(f (x)) term in the integrand, which is nonzero only at the zero point of
f (x), we can replace f 0 (x) by f 0 evaluated at the zero and move it out from the integrand.
The remaining integral obviously depends on the sign of the derivative.
3.3.4 The δ-function in Higher Dimensions
The d-dimensional δ-function, which was instrumental for introducing d-dimensional Fourier
transform in Section 3.1, is simply a product of one dimensional δ-functions, δ(x) =
δ(x1 ) · · · δ(xn ).
Example 3.3.8. Compute the δ-function in polar spherical coordinates.

Solution. Functional expressions for the δ-function in the Cartesian frame and Polar frame
are
Z Z
f (r) = δ(r − r̃)f (r̃)dr̃ = δ(x − x̃)δ(y − ỹ)δ(z − z̃)f (x̃, ỹ, z̃)dx̃dỹdz̃, (3.27)
Z
= δ(θ − θ̃)δ(φ − φ̃)δ(r − r̃)f (r̃, θ̃, φ̃)dr̃dθ̃dφ̃. (3.28)
On the other hand the volume element transformation from the Cartesian frame to the
Polar frame is
dr̃ = dx̃dỹdz̃ = r̃2 sin θ̃dr̃dθ̃dφ̃. (3.29)
Combining Eqs. (3.27,3.28,3.29) we derive
δ(θ − θ̃)δ(φ − φ̃)δ(r − r̃)

δ(r − r̃) = .
r̃2 sin θ̃
3.3.5 Formal Differentiation: The Heaviside Function and the Derivatives

of the δ-function
The δ-function is not technically a well-defined function, as it only exists in the context
of being integrated against a well-defined function. However, formally, using integration
techniques, we can write down a well-defined notion for a “derivative” of the δ-function.
In fact, we can “differentiate” discontinuous or classically non-differentiable functions using
the same notion. Once again, we stress that this is not true differentiation, but rather
something that looks like differentiation in form. This technique is often referred to as
formal differentiation.
R∞
Substituting, f (x) = 1 into Eq. (3.26) we derive, −∞ dxδ(x) = 1. This motivates
introduction of a function associated with an incomplete integration of the δ(x)
Z y (
0, y < 0
θ(y) := dxδ(x) = (3.30)
−∞ 1, y > 0,
called Heaviside- or step-function.
Exercise 3.4. Prove the relation

2
d 2
− γ exp (−γ|t|) = −2γδ(t). (3.31)
dt2
Hint: Yes, the step function will be useful in the proof.

One gets, differentiating Eq. (3.30), that θ0 (x) = δ(x). We can also differentiate the
δ-function. Indeed, integrating Eq. (3.26) by parts, and assuming that the respective anti-
derivative is bounded, we arrive at
Z
dyδ 0 (y − x)f (y) = −f 0 (x) (3.32)
Substituting in Eq. (3.32), f (x) = xg(x) we derive
xδ 0 (x) = −δ(x). (3.33)
Expanding f (x) in the Taylor series around x = y, ignoring terms of the second order (and
higher) in (x − y), and utilizing Eq. (3.34) one arrives at
f (x)δ 0 (x − y) = f (y)δ 0 (x − y) − f 0 (y)δ(x − y). (3.34)
Notice that δ 0 (x) is skew-symmetric and f (x)δ 0 (x − y) is not equal to f (y)δ 0 (x − y).
We have assumed so far that δ 0 (x) is convolved with a continuous function. To extend it
to the case of piece-wise continuous functions with jumps and jumps in derivative, one need
to be more careful using integration by parts at the points of the function discontinuity. An
exemplary function of this type is the Heaviside function just discussed. This means that
if a function, f (x), shows a jump at x = y, its derivative allows the following expression
f 0 (x) = (f (y + 0) − f (y − 0))δ(x − y) + g(x), (3.35)
where, f (y + 0) − f (y − 0), represents value of the jump and g(x) is finite at x = y. Similar
representation (involving δ 0 (x)) can be build for a function with a jump in its derivative.
Then the δ(x) contribution is associated with the second derivative of f (x),
Exercise 3.5. Express tδ 00 (t) via δ 0 (t).
3.4 Closed form representation for select Fourier Transforms

There are a few functions for which the Fourier transforms can be written in closed form.
3.4.1 Elementary examples of closed form representations
Example 3.4.1. Show that the Fourier Transform of a δ-function is a constant.

Solution. See corollary 3.3.4 where we showed δ̂(k) = 1.
Example 3.4.2. Show that the Fourier Transform of a constant is a δ-function.

Solution. In corollary 3.3.5 where we showed that the inverse Fourier transform of unity
was δ(x). A similar calculation shows that 1̂(k) = 2πδ(k)
Example 3.4.3. Show that the Fourier transform of a square pulse function is a sinc
function: 
b, |x| < a 2b
f (x) = ⇒ fˆ(k) = sin(ka)
0, |x| > a. k
Solution.
Z Z a
b a
b −ika
fˆ(k) = dxf (x)e −ikx
=b dxe−ikx = e−ikx = e − eika
R −a (−ik) −a −ik
2b
= sin(ka). (3.36)
k
Example 3.4.4. Show that the Fourier transform of a sinc function is a square pulse:

sin(ax) π/a, |k| < a
g(x) = ⇒ ĝ(k) =
ax 0, |k| > a.
There are a number of different solutions to this problem and it is instructive to look at
each one.
Solution.
Z∞ Z∞ Z∞ Z∞
sin(ax) −ikx eiax − e−iax −ikx e−i(k−a)x e−i(k+a)x
ĝ(k) = dx e = dx e = dx − dx
ax 2iax 2iax 2iax
−∞ −∞ −∞ −∞
| {z } | {z }
I II

π/a, |k| < a
= ,
0, |k| > a.
where the integrals I and II are computed by transforming them to contour integrals
analogous to Eq. (2.28) with contours (distinct for the two contributions) shown in Fig. 2.16.
For both integrals, there is a simple pole at z = 0 with residue 1/(2ia). The contribution
from the pole is ±(1/2)(2πi)/(2ia) where the (1/2) arises from the fact that the pole lies
on the contour of integration and the +/− is determined by the orientation of contour
(Contours that are closed in the upper-half plane do not need to be reversed (+) whereas
contours closed in the lower-half plane must be reversed (-)).
When k < a, the contour for I must be closed in the upper half plane and clockwise
traversal coincides with the orientation of I (i.e I = +π/(2a)). When k > a, the contour
for I must be closed in the lower half plane and clockwise traversal must be reversed to
coincide with the orientation of I, (i.e. I = −π/(2a)).
Similarly, when k < −a, the contour for II must be closed in the upper half plane and
clockwise traversal coincides with the orientation of II (i.e II = +π/(2a)). When k > −a,
the contour for II must be closed in the lower half plane and clockwise traversal must be
reversed to coincide with the orientation of II, (i.e. II = −π/(2a)).
Solution. This solution uses the technique of ‘differentiating under the integral’, more for-
mally known as Leibniz’s integration rule. The integral is tricky because of the x in the
denominator. If we consider the integrand as a function of two variables, x and a, and
differentiate with respect to a, the x will disappear. We can then integrate with respect to
x without concern. Taking the anti-derivative of this result with respect to a to ‘undo’ the
earlier differentiation will yield the final answer. Define a function I(a) such that
Z∞
sin(ax) −ikx
I(a) := ĝ(k) = dx e
ax
−∞
The
 ∞ 
Z Z∞ Z∞
d d  sin(ax) −ikx  ∂ sin(ax) −ikx
(aI(a)) = dx e = dx e = dx cos(ax)e−ikx
da da x ∂a x
−∞ −∞ −∞
= πδ(a − k) + πδ(a + k)
So we know that (a I(a))0 = πδ(a−k)+πδ(a+k). Taking the anti-derivative with respect to

a to find a I(a) will give a square pulse function plus a constant of integration. The constant
of integration is determined to be zero by observing that lima→0 aI(a) = 0. Therefore

Za π/a |k| < a
1
I(a) = πδ(ã − k) + πδ(ã + k) dã =
a 0 |k| > a
0
Example 3.4.5. Find the Fourier transform of a Gaussian function
f (x) = a exp(−bx2 ), a, b > 0.
Solution.
Z Z∞
2
fˆ(k) = dxf (x)e −ikx
=a dxe−bx e−ikx
R −∞
2 Z∞ !
k ik 2
= a exp − dx exp −b x +
4b 2b
−∞
2 Z∞
a k 02
= √ exp − dx0 e−x
b 4b
−∞
2r
k π
= a exp − .
4b b
Example 3.4.6. Let a > 0. Show that

1 π
(a) f (x) = ⇒ fˆ(k) = e−a|k| ;
x2 + a2 a
2a
(b) g(x) := e−a|x| ⇒ ĝ(k) := .
k 2 + a2
Solution.
(a) If k < 0, the integral fˆ(k) can be computed with a complex-valued contour integral.
Consider the semi-circular contour in the upper-half plane. The contour encloses a
simple pole at z = ia with residue eka /(2ia).
Z
ˆ 1 1 1 ka π ak
f (k) = e−ikz dz = 2πi e = e .
z + ia z − ia 2ia a
C
Similarly, if k > 0, the integral fˆ(k) can be computed with the semi-circular contour
in the lower-half plane. The contour encloses a simple pole at z = −ia with residue
−e−ka /(2ia). For this contour, we must multiply to −1 to reverse the orientation of
the contour.
Z
1 1 1 −ka π −ka
fˆ(k) = e−ikz dz = 2πi e = e .
z + ia z − ia 2ia a
C
Combining these results gives fˆ(k) = (π/a)e−a|k| .
(b) Split the integral for x < 0 and for x > 0.

Z∞ Z 0 Z ∞
ĝ(k) = dx e−a|x| e−ikx = dx eax e−ikx + dx e−ax e−ikx
−∞ 0
−∞
1 1 2a
=− + = 2 .
ik − a ik + a k + a2
Exercise 3.6. Find the Fourier transform of

1
(a) f (x) = ,
x4 + a4
(b) f (x) = sech(ax).
3.4.2 More advanced examples of closed form representations
We can find closed form representations of other functions by combining the examples above
with the properties in Section 3.2.
2
Example 3.4.7. Let f (x) = e−x and define g(x) = (f ∗ f ∗ f ∗ f )(x). Find (a) ĝ(k) and
(b) g(x).
√ 2
Solution. From Example 3.4.5, we know that fˆ(k) = πe−k 4 . Therefore,
4
f ∗ f ∗ f )(k) = fˆ(k) · fˆ(k) · fˆ(k) · fˆ(k) = fˆ(k)
ĝ(k) = (f ∗ \
√ 2
4 2
= πe−k /4 = π 2 e−k .
We recognize ĝ(k) as a Gaussian function (see example 3.4.5), and we know that if ĝ(k) is
p 2 2
a Gaussian in the form a (π/b)e−k /(4b) , then g(x) is the Gaussian ae−bx . A little algebra
shows we can re-write ĝ(k) in this form by setting a = 12 π 3/2 and b = 41 :
1 2
g(x) = π 3/2 e−x /4 .
2
Example 3.4.8. Let g(x) = max(0, 1 − |x|) (sometimes called a ‘tent’ function). Compute
ĝ(k).
Solution. Let f (x) = 1 if |x| < 1/2 and 0 otherwise. Note that (f ∗ f )(x) = g(x). From
example 3.4.3, we know that the Fourier transform of a square pulse is a sinc function.
Therefore,
2
2 sin(k/2)
ĝ(k) = fˆ(k)fˆ(k) = = sinc2 (k/2).
k
Example 3.4.9. Let f (t) be given by

cos(ω0 t), |t| < A;
f (t) =
0, otherwise;
where ω0 , A ∈ R with A > 0. (a) Compute fˆ(k), the Fourier transform of f , as a function
of ω0 and A. (b) Identify the relationship between the continuity of f and ω0 and A, and
discuss how this affects the decay of the Fourier coefficients as |k| → ∞.
Solution.
(a) By the convolution theorem,
f (t) = cos(ω0 t)rectA (t) ⇒ fˆ(k) = cos

[ ω0 (k) ∗ rect
[A (k),
where
2 sin(Ak)
cos
[ ω0 (k) = πδ(k − ω0 ) + πδ(k + ω0 ) and rect
[A (k) = .
k
Therefore,
2π sin(A(k − ω0 )) 2π sin(A(k + ω0 ))
fˆ(k) = cos
[ ω0 (k) ∗ RectA (k) =
\ + .
k − ω0 k + ω0
Definition
Z
+∞ Z
+∞
1
fˆ(k) = f (x)e−ikx dx ⇔ f (x) = fˆ(k)eikx dk
2π
−∞ −∞
Fourier Transforms Pairs
f (x) = δ(x) fˆ(k) = 1

f (x) = (2a)−1 H(1 − |x/a|) fˆ(k) = sinc(ak)
f (x) = (2a)−1 exp(−|x/a|) fˆ(k) = (1 + (ak)2 )−1
√
f (x) = ( 2πa)−1 exp(−(x/a)2 /2) fˆ(k) = exp(−(ak)2 /2)
√
f (x) = ( 2πa)−1 sech(πx/2a) fˆ(k) = sech(πak/2)
f (x) = (πa)−1 (1 + (x/a)2 )−1 fˆ(k) = exp(−|ak|)
f (x) = (πa)−1 sinc(x/a) fˆ(k) = H(1 − |ak|)
f (x) = 1 fˆ(k) = 2π δ(k)
(b) Some basic algebra shows that we can write fˆ(k) as follows:

ˆ 2k sin(Ak) cos(Aω0 ) − 2ω0 cos(Ak) sin(Aω0 )
f (k) = 2π .
k 2 − ω02
In general, the fˆ(k) ∼ 1/k as |k| → ∞, unless ω0 and A satisfy sin(Ak) cos(Aω0 ) = 0,
(i.e. Aω0 = nπ + π ), then fˆ(k) ∼ 1/k 2 .
2
2a
Exercise 3.7. Let a ∈ C with Re(a) > 0 and define fa (x) := . If also b ∈ C
a2 + (2πx)2
with Re(b) > 0, show that (fa ∗ fb )(x) = fa+b (x).
Exercise 3.8. Show that

1 ˆ k−a

(a) g(x) = exp(iax)f (bx) ⇒ ĝ(k) := |b| f b
2

(b) f (x) = sinx(x) ⇒ fˆ(k) = − iπ
2 Π(k − 1) − Π(k + 1) .
Here we have that 

1, |k| ≤ 1;
Π(k) =
0, |k| > 1.
3.4.3 Closed form representations in higher dimensions
Example
q 3.4.10. Let x = (x1 , x2 , . . . , xd ) ∈ Rd , and use the notation |x| to represent
x21 + x22 + · · · + x2d . Find the Fourier transform of g(x) = exp(−|x|2 ).
Solution.
Z Z
−ikT x 2 Tx
ĝ(k) = g(x)e dx = e−|x| e−ik dx
Rd Rd
Z
2 2 2
= e−x1 −x2 −···−xd e−i(k1 x1 +k2 x2 +···+kd xd dx
Rd
Z Z Z
2 2 2
= ··· e−x1 e−x2 . . . e−xd e−ik1 x1 eik2 x2 . . . eikd xd dx1 dx2 . . . dxd
R R R
Z Z Z
−x21 −ik1 x1 −x22 −ik2 x2 2
= e e dx1 e e dx2 . . . e−xd . . . e−ikd xd dxd
R R R
√ √ √
= π exp(−k12 /4) π exp(−k22 /4) . . . π exp(−kd2 /4) = π d/2 exp(−|k|2 /4).
3.5 Fourier Series: Introduction

Fourier Series is a version of the Fourier Integral which is used when the function is periodic
or of a finite support (nonzero within a finite interval). As in the case of the Fourier
Integral/Transform, we will mainly focus on the one-dimensional case. Generalization of
the Fourier Series approach to a multi-dimensional case is straightforward.
Consider a periodic function with the period, L. We can represent it in the form of a
series over the following standard set of periodic exponentials (harmonics), exp(i2πnx/L):
∞
X
f (x) = fn exp (2πinx/L) . (3.37)
n=−∞
This, so-called Fourier series, representation of a periodic function immediately shows that
the Fourier Series is a particular case of the Fourier integral:
∞
X Z∞ Z∞ ∞
X
fn exp (2πinx/L) dk δ(k − n) = dk exp (2πikx/L) fn δ(k − n). (3.38)
n=−∞ −∞ −∞ n=−∞
Like in the case of the Fourier transform and inverse Fourier transform, we would like to
invert Eq. (3.37) and express fn via f (x). By analogy with the Fourier transform, consider
integrating the left hand side of Eq. (3.37) with the oscillating factor, exp(−2πikx/L)/L,
where k is an integer, over x ∈ [0, L]. Applying this integration to the right hand side of
Eq. (3.37), we run into the following easy to evaluate integral (for each term in the resulting
sum)
ZL (
dx ikx2π/L −inx2π/L ei(k−n)2π − 1 1, k = n
e e = = δk,n := , (3.39)
L i(k − n)2π 0, k 6= n
0
Definition
Z +∞ Z +∞
1
fˆ(k) = −ikx
f (x)e dx ⇔ f (x) = fˆ(k)eikx dk
−∞ 2π −∞
Property Original Function Fourier Transform
Transformations
Linearity a f (x) + b g(x) a fˆ(k) + b ĝ(k)
Space-Shifting f (x − x0 ) e−ikx0 fˆ(k)
k-Space-Shifting eik0 x f (x) fˆ(k − k0 )
Space Reversal f (−x) fˆ(−k)
Space Scaling f (ax) |a|−1 fˆ(k/a)
Calculus
Convolution g(x) ∗ h(x) ĝ(k) ĥ(k)
1

Multiplication g(x)h(x) 2π ĝ(k) ∗ ĥ(k)
Differentiation d
f (x) ik fˆ(k)
R xdx
Integration 1 ˆ
0 f (ξ) dξ ik f (k)
Real-valued vs Complex-valued functions

∗ ∗
Conjugation f (x) fˆ(−k)
Real and Even f (x) real and even fˆ(k) real and even
Real and Odd f (x) real and odd fˆ(k) imaginary and odd
Parseval’s Theorem / Unitarity

Z +∞ Z +∞
1
|f (x)|2 dx = |fˆ(k)|2 dk
−∞ 2π −∞
where expression in the middle is resolved via the LH́opitale rule and δk,m is the common
notation for the so-called Kronecker delta. We observe that only one term in the sum is
nonzero, therefore arriving at the desired formula
ZL
dx nx
fn = f (x) exp −2πi . (3.40)
L L
0
Notice that one may also consider Fourier Transform/Integral as a limit of the Fourier
Series. Indeed in the case when a typical scale of the f (x) change is much less than L, many
harmonics are significant and the Fourier series transforms to the Fourier integral
∞
X Z∞
L
··· → dk · · · . (3.41)
−∞
2π
−∞
Let us illustrate expansion of a function into Fourier series on example of f (x) = exp(αx)
considered on the interval, 0 < x < 2π. In this case the Fourier coefficients are
Z2π
dx 1 1
fn = exp(−inx + αx) = e2πα − 1 . (3.42)
2π 2π α − in
0
Notice that at n → ∞, fn ∼ 1/n. As discussed in more details in the following section,

the slow decay of the Fourier coefficients is associated with the fact that f (x), when con-
sidered as a periodic function over reals with the period 2π, has discontinuities (jumps) at
0, ±2π, ±4π, · · · .
Exercise 3.9. Let f (x) = x and g(x) = |x| be defined on the interval −π < x < π. (a)
Expand both functions as a Fourier series. (b) Compare the dependence of the n-th Fourier
coefficient on n for the two functions.
We conclude this Section reminding the reader that our construction of the Fourier
Series assumed that the set of harmonic functions forms a complete set of basis functions.
Proving this assumption is outside the scope of this course. This proof (and many other
proofs) will be discussed in detail in the companion course, Math 584.
3.6 Properties of the Fourier Series

Properties of Fourier series follow their related properties of Fourier transforms discussed in
section 3.2. We list the properties in the Table. The presentation is formal, without proofs
and made mainly for reference convenience.
Definition
Let f (x) be periodic with period L.

ZL
nx nx
+∞
X dx
f (x) = fn exp 2πi ⇔ fn = f (x) exp −2πi
n=−∞
L L L
0
Property Periodic Function Fourier Series Coefficients
Scaling
Linearity af (x) + bg(x) afn + bgn
Space-Shifting f (x − x0 ) e2πinx0 /L fn
k-Space-Shifting e2πik/L f (x) fn−k
Space Reversal f (−x) f−n
Space Scaling f (ax) (with period L/a) fn
Calculus
Z L
Periodic Convolution f (ξ)g(x − ξ) dξ Lfn gn
0
∞
X
Multiplication f (x)g(x) fm gn−m
m=−∞
d 2πin
Differentiation Zdx xf (x) L fn
L

Integration dξf (ξ) 2πin fn
0
Real-valued vs Complex-valued functions

Conjugation f ∗ (x) ∗
f−n
Real and Even f (x) real and even fn real; fn = f−n
Real and Odd f (x) real and even fn imaginary; fn = −f−n
Parseval’s Theorem / Unitarity

Z X∞
1 L
dx|f (x)|2 = |fn |2
L 0 n=−∞
3.7 Riemann-Lebesgue Lemma

In general, the Fourier series of a function is an infinite series (that is, it contains an
infinite number of terms). This means that it is computationally prohibitive to represent
an arbitrary function exactly, but a function may be approximated by truncating it’s Fourier
series. The Riemann-Lebesgue Lemma helps to justify the truncation. The Lemma states
that for any integrable function f , the Fourier coefficients fn must decay as n → ∞.
Theorem 3.7.1 (Riemann-Lebesgue Lemma). If f (x) ∈ L1 , i.e. if the Lebesgue integral

of |f | is finite, then limn→∞ fn = 0.
We will not prove the Riemann-Lebesgue lemma here but notice that a standard proof
is based on (a) showing that the lemma works for the case of the characteristic function
of a finite open interval in R1 , where f (x) is constant within ]a, b[ and zero otherwise, (b)
extending it to simple functions over R1 , that are functions which are piece-wise constant,
and then (c) building a sequence of simple functions (which are dense in L1 ) approximating
f (x) more and more accurately.
Let us mention the following useful corollary of the Riemann-Lebesgue Lemma: For
any periodic function f (x) with continuous derivatives up to order m, integration by parts
can be performed respective number of times to show that the n-th Fourier coefficient is
C
bounded at sufficiently large n according to |fn | ≤ |n|m+2
, where C = O(1).
In particular, and consistently with the example above, we observe that in the case
of a “jump”, corresponding to continuous anti-derivative, i.e. m = −1, |fn | is O(1/n)
asymptotically at n → ∞. In the case of a “ramp”, i.e. m = 0 with continuous function
but discontinuous derivative, |fn | becomes O(1/n2 ) at n → ∞. For the analytic function,
with all derivatives continuous, |fn | decays faster than polynomially as n increases.
Further details of the Lemma, as well as the general discussion of how the material of
this Section is related to material discussed in the theory course (Math 584) and also the
algorithm course (Math 589), will be given at an inter-core recitation session.
3.8 Gibbs Phenomenon

One also needs to be careful with the Fourier Series truncation, because of the so-called
Gibbs phenomenon, called after J. Willard Gibbs, who has described it in 1889. (Apparently,
the phenomenon was discovered earlier in 1848 by Henry Wilbraham.) The phenomenon
represents an unusual behavior of a truncated Fourier Series built to represent piece-wise
continuous periodic function. The Gibbs phenomenon involves both the fact that Fourier
sums overshoot at a jump discontinuity, and that this overshoot does not die out as more
terms are added to the sum.
Consider the following classic example of a square wave

π/4, if 2nπ ≤ x ≤ (2n + 1)π, n = 0, 1, 2, . . .
f (x) = (3.43)
−π/4, if (2n + 1)π ≤ x ≤ (2n + 2)π, n = 0, 1, 2, . . .
∞
X sin((2n + 1)x)
= , (3.44)
2n + 1
n=0
where definition of the function is in the first line and the second line describes expression
for the function in terms of the Fourier series. Notice that the 2π-periodic function jumps
at 2nπ by π/2.
Let us truncate the series in Eq. (3.44) and thus consider N -th partial Fourier Series
N
X sin((2n + 1)x)
SN (x) = . (3.45)
2n + 1
n=0
Gibbs phenomenon consists in the following observation: as N → ∞ the error of the approx-
imation around the jump-points is reduced in width and energy (integral), but converges
to a fixed height. See movie-style visualization (from wikipedia) of how SN (x) evolves with
N . (It is also reproduced in a julia-snippet available at the class D2L repository.)
Let us now back up this simulation by an analytic estimation and compute the limiting
value of the partial Fourier Series at the point of the jump. Notice that
X N
d sin(2(N + 1))
SN () = cos((2n + 1)) = , (3.46)
d 2 sin
n=0
where we have utilized formula for the sum of the geometric progression. Observe that
d
d SN () → N + 1 at → 0, that is the derivative is large (when N is large) and positive.
Therefore, SN () grows with to reach its (first close to = 0) maximum at ∗ = π/(2(N +
1)). Now we estimate the value of SN (∗ )

N sin (2n+1)π N
X 2(N +1) X sin nπ
N
SN (∗ ) = = + O(1/N )
2n + 1 2n
n=0 n=0
N →0
Zπ
1 sin t π
→ dt ≈ + 0.14, (3.47)
2 t 4
0
thus observing that at the point of the closest to zero maximum the partial sum systemat-
ically overshoots, f (0+ ) = π/4, by an O(1) amount.
Exercise 3.10. Generalize the two functions from Exercise 3.9 beyond the [−π, π) interval,
so they are 2π-periodic functions on [−5π, 5π). Compute the respective partial Fourier series
SN (x) for select N , and study numerically (or theoretically!) how the amplitude and the
width of the oscillations near the points x = mπ, m ∈ {−5, −4, . . . , 4} behave as N → ∞.
We complete our discussion of the Fourier Series by mentioning its arguably most signif-
icant application in the field of differential equations. Most differential equations of interest
can only be solved numerically. Mathematicians often find approximate solutions to a dif-
ferential equation by representing the solution as a Fourier series and then truncating the
series to a finite sum according to the desired accuracy. This method, called the spectral
method, will be discussed in the algorithm core course (Math 589).
3.9 Laplace Transform

Recall that to evaluate some Fourier transforms, fˆ(k), of the functions, f (x), which do
not decay with, x → ±∞, like exp(ix) or exp(−x), we needed to consider k with nonzero
imaginary part to make the resulting Fourier integrals finite.
Let us discuss the two cases, of the oscillatory, exp(ix), and the exponential, exp(−x),
asymptotic behaviors, separately. The Fourier Series, just discussed, may be considered as
a way to avoid the integrability difficulty in the case of the periodic function. Indeed, recall
that the coefficients of the Fourier Series, described by Eqs. (3.40), required evaluation of
the integral only over a domain of a finite support, like [0, L].
The Laplace transform, we are about to discuss, suggests an elegant way around for
another important class of functions, these which decay with x → +∞, like exp(−x).
Instead of working with the functions defined over all real x, like in the case of the Fourier
Transform, or with the periodic functions (or functions defined over a finite support interval)
like in the case of the Fourier Series, we turn to discussion of the so-called Laplace Transform
(LT) operating on functions defined over the semi-infinite interval, x ∈ R≥0 = [0, +∞).
Equivalently, we may also introduce the Laplace Transform as a Fourier transform ap-
plied to the functions which are nonzero only at x ≥ 0:
Z∞
f˜(k) = dx exp (−kx) f (x). (3.48)
0
We consider complex k and require that the integral on the right hand side of Eq. (3.48) is
converging (finite) at sufficiently large Re(k). In other words, f˜(k) is analytic at Re(k) > C,
where C is a positive constant.
Inverse Laplace Transform (ILT) is defined as a complex integral
Z
1
f (x) = dk exp (kx) f˜(k). (3.49)
2πi
C
over the so-called Bromwich contour, C, shown in Fig. (3.1). C can be deformed arbitrarily
within the domain, Re(k) > 0, of the f˜(k) analyticity. Note that by construction, and
consistently with the requirement imposed on f (x), the integral on the right hand side of
Eq. (3.49) is equal to zero at x < 0. Indeed, given that f˜(k) is analytic at Re(k) > 0 and
it approaches zero at k → ∞, contour C can be collapsed to surround ∞, which is also a
non-singular point for the integrand thus resulting in zero for the integral.
Properties of the Laplace Transform, following related properties of the Fourier Trans-
form discussed in Section 3.2 are listed formally in the Table below.
Im (k)
0
Re (k)
Figure 3.1: The Bromwich integration contour, C, in Eq. (3.48) is shown red. C is often
shown as straight line in the complex from ε − i∞ to ε + i∞, where ε is an infinitesimally
small positive number. Possible singularities of the LT Φ̃(k) may only be at the points with
negative real number, which are shown schematically as blue dots.
Property Function Laplace Transform
Scaling
Linearity af (x) + bg(x) af˜ + bg̃(k)
Space-Shifting f (x − x0 )θ(x − x0 ), x0 ∈ R>0 e−kx0 f˜(k)
k-Space-Shifting h(x) = exp(k0 x)f (x) h̃(k) = f˜(k − k0 )

Space Scaling h(x) = f (ax), ∀a ∈ R>0 h̃(k) = 1 f˜ k
a a
Differentiation h(x) = f 0 (x) h̃(k) = k f˜(k) − f (0− )
Rx
Integration h(x) = f (y)dy h̃(k) = k1 f˜(k)
0
Rx
Convolution h(x) = (f ∗ g)(x) = f (y)g(x − y)dy h̃(k) = f˜(k)g̃(k)
0
It is instructive to illustrate similarities and differences between the LT and the FT on
basic examples.
Consider, first, the one sided exponential
f (x) = θ(x) exp(−αx), α > 0, (3.50)

α − ik
fˆ(k) = 2 , (3.51)
α + k2
1
f˜(k) = , Re(k + α) > 0. (3.52)
k+α
Considered in the limit α → 0+ Eqs. (3.50,3.51,3.52) turn into the following set of
relations for the step function, θ(x):
f (x) = θ(x), (3.53)

i
fˆ(k) = πδ(k) − , (3.54)
k
1
f˜(k) = . (3.55)
k
Shifting and re-scaling the theta-function we transform Eqs. (3.53,3.54,3.55) into the
following expressions for the Laplace transform of the signature function
f (x) = 2θ(x) − 1 = sign(x), (3.56)

2i
fˆ(k) = − , (3.57)
k
1
f˜(k) = . (3.58)
k
Exercise 3.11. Find the Laplace Transform of (a) f (x) = exp(−λx) where Re(λ) > 0,
(b) f (x) = xn where n ∈ Z, (c) f (x) = cos(νx) where ν ∈ R, (d) f (x) = cosh(λx) where
√
Re(λ) > 0, (e) f (x) = 1/ x. Show details.
Exercise 3.12. Find the Inverse Laplace Transform of 1/(k 2 + a2 ). Show details.
3.9.1 Integral Representations and Asymptotics of Special Functions
The Laplace transform is often used to describe (and to manipulate) integral representations
of the special functions. Consider, f (x) = xν . Upto an elementary factor its Laplace
transform is related to the so-called Gamma, Γ, function (of the parameter ν)
Z∞ Z∞
Γ(ν + 1)
f˜(k) = ν −kx
dxx e =k −ν−1
dyy ν e−y = ,
k ν+1
0 0
where the integrals are well defined at k > 0. Then, the inverse Laplace transform returns
”definition” of the Γ function in terms of a contour integral
Z
ε+i∞
xν 1 ekx
= dk , (3.59)
Γ(ν + 1) 2πi k ν+1
ε−i∞
where ε > 0 to guarantee that we are on the right of the singularity at k = 0 which is a
pole if ν is positive integer and a branch point for noninteger ν.
In the general case of the branch point at k = 0, the second branch point of the integrand
in Eq. (4.24) is at k = ∞, and we ought to introduce the branch cut connecting the
branch points to guarantee analyticity of the integrand. We choose the branch cut along
!#$
!$ !&
!%
!"
!#%
Figure 3.2: Contour transformation for the integral representation of the Γ function via the
Inverse Laplace Transform. Integrand of Eq. (3.59) is analytic in the domain surrounded
by the (blue) contour, therefore returning zero for the integral.
R kx
the negative real axis. Then the integral, dk keν+1 , over C, shown (blue) in Fig. (3.2),
C
contains no singularities in the interior and it is therefore equal to zero according to the
+ −
Cauchy theorem. (Here, C = C1 + CR + C+ + Cr + C− + CR , R → ∞, r → 0 and
C1 =]ε − i∞, ε + i∞[.) We can show (using the Jordan’s lemma) that the integral along
±
CR and Cr tend to zero at R → ∞, r → 0. Therefore, the Bromwich contour in Eq. (3.59)
can be replaced by the integral over the −C+ − C− contour, i.e. the contour consisting of
first going along the negative part of the real axis from −∞ to 0 a tiny bit under the cut
and then returning back from 0 to −∞ above the cut. We arrive at
Z
1 1 ek
= dk ν+1 . (3.60)
Γ(ν + 1) 2πi k
−C− −C+
Eq. (3.60) just derived is quite useful for evaluating asymptotic expression of the function
f (x) at x → ∞ with f˜(k) which requires introduction of the branch cut (on the left from the
Bromwich contour). Indeed, consider the Inverse Laplace Transform formula (3.49). The
idea is that at x → +∞ the integral is dominated by the singularity furthest to the right in
the complex k plane. (The idea obviously applies not only to the cuts but also to the poles
of the integrand in Eq. (3.49).) Then, we can expand around the right-most singularity of
f˜(k), i.e.
X
f˜(k) ≈ cν (k − k0 )λν , (3.61)
ν
where k0 is the position of the right-most singularity of f˜(k) and λν may be non-integer.
A loop integral around the branch point at k0 results in an asymptotic series that can be
obtained integrating (3.61) term by term
Z Z !
1 1 X X cν
f (x) = dk f˜(k)ekx ≈ dk cν (k − k0 )λν ekx = ek0 x ,
2πi 2πi ν ν
Γ(−λν )xλν +1
Ck0 Ck0
where Ck0 is the contour surrounding the k0 singularity anti-clockwise around the cut and
we have used Eq. (3.60) for the Γ function.
3.10 From Differential to Algebraic Equations with FT, FS

and LT
The Fourier transform, the Fourier Series and the Laplace transform, introduced and dis-
cussed in the current Chapter of the notes, will be utilized extensively in the next Chapters
of the notes, especially in the following one where we discuss differential equations.
To facilitate the transition consider the simplest possible Differential Equation, relating
linearly derivative of the scalar function, f (x) of x ∈ R to its derivative
d
f (x) + qf (x) = g(x), (3.62)
dx
where q is a positive constant, q > 0, and g(x) is a known scalar function.
Assume that the FT of the function, g(x), on the right hand side of Eq. (3.62) has a
well-defined Fourier Transform (FT). Then applying the FT to the differential equation we
arrive at a much simpler algebraic equation
ik fˆ(k) + q fˆ(k) = ĝ(k), (3.63)
which is solved trivially,

ĝ(k)
fˆ(k) = . (3.64)
q + ik
Applying Inverse Fourier Transform to the result we derive
Z
+∞ Z
+∞ Z
+∞
1 ĝ(k) ikx 1 dk
f (x) = dk e = dyg(y)eik(x−y)
2π q + ik 2π q + ik
−∞ −∞ −∞
Zx
= dyg(y)e−q(x−y) , (3.65)
−∞
which provides an explicit solution of the original differential equation stated in quadratures,
i.e. as an integral over the known integrand. Note, that in transition from the first line in
Eq. (3.65) to the second line we have exchanged the order of integration and evaluated the
pole integral at k = iq, also observing that the integral returns non-zero only at y < x.
In summary, when g(x) is well defined over the entire x ∈ R we solve the differential
Eq. (3.62) in three straightforward steps: (a) applying FT and arriving at (much simpler)
algebraic equation; (b) solving the algebraic equation; (c) applying the Inverse FT.
Let us now assume that g(x), f (x) 6= 0 only at x ≥ 0. In this case we repeat the logic
of the preceding derivation, however substituting the FT by the LT. We derive
Z
ε+i∞
θ(x) f (0+ ) + g̃(k) kx
f (x) = dk e
2πi q+k
ε−i∞
 
Z
ε+i∞ Z
+∞
θ(x) dk  + kx
= f (0 )e + dyg(y)ek(x−y) 
2πi q+k
ε−i∞ 0
 
Zx
= θ(x) f (0+ )e−qx + dyg(y)e−q(x−y)  , (3.66)

0
where in transition from the first to the second line we have accounted for the fact that the
LT version of the Eq. (3.63) acquires the additional f (0+ ) factor, and then, in transition
from the second line to the third line we have exchanged the order of integration and
evaluated the pole integrals at k = −q. We also observe that Eq. (3.65) transitions into
Eq. (3.66) if we set f (x), g(x) to be nonzero only at x ≥ 0.
Finally, consider Eq. (3.62) in the case supporting 2π-periodicity of f (x), i.e. requiring
that q = im, where m ∈ Z, and 2π-periodic g(x). In this case we arrive, after application
of the FT to Eq. (3.62) at the discrete version of the algebraic Eq. (3.63), where fˆ(k) and
ĝ(k) are substituted by the Fourier coefficients, fk and gk , valid at k ∈ Z. Then the Fourier
Series version of the Eq. (3.65) becomes
 
+∞
X X+∞ X+∞ Zx
gk gk0 −m ik0 x 1 0
f (x) = eikx = e−imx 0
e = e−imx gk0 −m  0 + dyeik y 
i(k + m) 0
ik 0
ik
k=−∞ k =−∞ k =−∞ 0
+∞
X Zx +∞
X Zx
gk 0
= + dyeim(y−x) eik (y−x) gk0 = f (0) + dyeim(y−x) g(y).
i(k + m)
k=−∞ 0 k0 =−∞ 0
(3.67)
Notice that in the 2π-periodic FS case, like in the semi-infinite LT case, final expressions,
given by Eq. (3.67) and Eq. (3.66) respectively, require providing an additional condition
on f (x), chosen to be imposed at x = 0. This is reflection of a more general fact that

an n-th order ODE requires fixing n condition. The FT expression Eq. (3.65) does show
this dependence (on f (−∞)) only because for the Fourier integral to be well defined we
effectively required that f (−∞) = 0.
Part II
Differential Equations
87
Chapter 4
Ordinary Differential Equations.
A differential equation (DE) is an equation that relates an unknown function and its deriva-
tives to other known functions or quantities. Solving a DE amounts to determining the
unknown function. For a DE to be fully determined, it is necessary to define auxiliary
information, typically available in the form of an initial or boundary condition.
Often several DE’s may be coupled together in a system of DE’s. Since this is equivalent
to a DE of a vector-valued function, we will use the term “differential equation” to refer
to both single equations and systems of equations and the term “function” to refer to both
scalar- and vector-valued functions. We will distinguish between the singular and plural
only when relevant.
The function to be determined may be a function of a single independent variable, (e.g.
f = f (t) or f = f (x)) in which case the differential equation is known as an ordinary
differential equation, or it may be a function of two or more independent variables, (e.g.
f = f (x, y), or f = f (t, x, y, z)) in which case the differential equation is known as a partial
differential equation.
The order of a differential equation is defined as the largest integer n for which the nth
derivative of the unknown function appears in the differential equation.
Most general differential equation is equivalent to the condition that a nonlinear func-
tion of an unknown function and its derivatives is equal to zero. An ODE is linear if the
condition is linear in the function and its derivatives. We call the ODE linear, homoge-
neous if in addition the condition is both linear and homogeneous in the function and its
derivatives. It follows for the homogeneous linear ODE that, if f (t) is a solution, so is cf (t),
where c is a constant. A linear differential equation that fails the condition of homogeneity
is called inhomogeneous. For example, an nth order, inhomogeneous ordinary differential
equation is one that can be written as αn (t)f (n) (t) + · · · + α1 (t)f 0 (t) + α0 (t)g(t) = g(t),
88
CHAPTER 4. ORDINARY DIFFERENTIAL EQUATIONS. 89
where αi (t), i = 0, . . . n and g(t) are known functions. Typical methods for solving linear
differential equations often rely on the fact that the linear combination of two or more solu-
tions to the homogeneous DE is yet another solution, and hence the particular solution can
be constructed from from a basis of general solutions. This cannot be done for nonlinear
differential equations, and analytic solutions must often be tailor-made for each differential
equation, with no single method applicable beyond a fairly narrow class of nonlinear DEs.
Due to the difficulty in finding analytic solutions, we often rely on qualitative and/or ap-
proximate methods of analyzing nonlinear differential equations, e.g. through dimensional
analysis, phase plane analysis, perturbation methods or linearization. In general, linear dif-
ferential equations admit relatively simple dynamics, as compared to nonlinear differential
equations.
An ordinary differential equation (ODE) is a differential equation of one or more func-
tions of one independent variable, and of the derivatives of these functions. The term
ordinary is used in contrast with the term partial differential equation (PDE) where the
functions are with respect to more than one independent variables. PDEs will be discussed
in the Chapter 5.
4.1 ODEs: Simple cases

For a warm up let us recall cases of simple ODEs which can be integrated directly.
4.1.1 Separable Differential Equations
A separable differential equation is a first order differential equation that can be written so
that the derivative function appears on one side of the equation, and the other side contains
the product or quotient of two functions, one of which is a function of the independent
variable, and the other a function of the dependent variable.
Z Z
dx f (t)
= ⇒ g(x)dx = f (t)dt ⇒ g(x)dx = f (t)dt. (4.1)
dt g(x)
Example 4.1.1. Solve the differential equation ẋ(t) = ax(t)t2 .

Solution.
Z x Z t
dx(t) dx dξ a 3
= ax(t)t2 ⇒ = at2 dt ⇒ = aτ 2 dτ ⇒ log(x) + c1 = t + c2
dt x ξ 3
3 /3
⇒ x(t) = c eat .
4.1.2 Variation of Parameters
To solve the following linear, inhomogeneous ODE

dy
− p(t)y(t) = g(t), y(t0 ) = y0 , (4.2)
dt
let us substitute,  t 
Z
y(t) = c(t) exp  dτ p(τ ) , (4.3)
t0
where the second term on the right is selected based on solution of the homogeneous version
dy
of Eq. (4.2), i.e. dt = p(t)y(t), and one makes the first term, c(t), which would be a constant
in the homogeneous case, a function of t. This results in the following equation for the t-
dependent c(t)  t 
Z
dc(t)
exp  dτ p(τ ) = g(t).
dt
t0
Applying the method of separable differential equations (see Eq. (4.1)) and then recalling
the substitution (4.3), one arrives at
 t   τ 
Z Z t Z
y(t) = exp  dτ p(τ ) y0 + dτ g(τ ) exp − dτ p(τ ) .
t0
t0 t0
The method just applied to the first order differential equation is called the method
of variation of parameters because, c(t), in the derivation above can be considered as a
parameter which we vary, i.e. allow to depend on t.
Let us extend the idea of the parameter variation method to the case of the second order
inhomogeneous differential equation of a general position
d2 y dy
− p(t) − q(t)y(t) = g(t), (4.4)
dt2 dt
and try to find its general solution. Recall that the general solution of a linear inhomoge-
neous equation is a sum of the general solutions of the respective homogeneous equation
and of a particular solution of the inhomogeneous equation.
Consider, first, solution of the homogeneous equation, i.e. solution of Eq. (4.4) with zero
right hand side, g(t) = 0. This is a homogeneous differential equation of the second order
which should therefore have two independent solutions (we call them linearly independent
solutions). Let us denote the two solutions, y1 (t) and y2 (t) and form the so-called Wronskian
of the two
W (t) = y1 y20 − y2 y10 . (4.5)

Next we compute the derivative of the Wronskian and use the fact that y1 and y2 are
solutions of the homogeneous version of Eq. (4.4). We derive
d
W = y1 y200(
+y(10 y(20 ( y20 y10 − y2 y100 = py1 y20 + ( 0
(
−( qy( 2−
1 y( (qy2 y1 − py2 y1 = pW.
( (( (
dt
Therefore, the Wronskian becomes
 
Zt
W (t) = exp  dτ p(τ ) , (4.6)
t0
where t0 can be chosen arbitrarily. Moreover, given the relation (4.5), we can express one
of the two independent solutions via another one and the Wronskian (which we now know
explicitly).
Now the question becomes if we can find a particular solution (just one of many) of
the inhomogeneous Eq. (4.4)? Let us follow the idea of the method of the variation of
parameters and look for a particular solution of the Eq. (4.4) as a linear combination of
x1 (t) and x2 (t) multiplied by unknown parameters A(t) and B(t):
y(t) = A(t)y1 (t) + B(t)y2 (t). (4.7)
Substituting Eq. (4.7) into Eq. (4.4) we derive

A00 y1 + B 00 y2 + 2A0 y10 + 2B 0 y20 − p A0 y1 + B 0 y2 = g. (4.8)
Recall that we are looking for a particular solution. Therefore, we can choose to relate the
two unknown coefficients, A(t) and B(t), as we find fit, leaving only one of them as a degree
of freedom. The form of Eq. (4.7) suggests to pick the relation so that the dependence on
p(t) in Eq. (4.8) disappears, i.e.
A0 y1 + B 0 y2 = 0. (4.9)
Then Eq. (4.8) becomes
A0 y10 + B 0 y20 = g. (4.10)
Notice that the order of derivatives in Eq. (4.10) is reduced in comparison with Eq.(4.8) we
started with. Furthermore, expressing B 0 via A0 , y1 , y2 according to Eq. (4.9) and substi-
tuting the result in Eq. (4.10) we arrive at
Zt
0 y2 (τ )g(τ )
W A + y2 g = 0 ⇒ A = − dτ . (4.11)
W (τ )
t0
Similarly expressing A0 via B 0 , y1 , y2 according to Eq. (4.9) and substituting the result in
Eq. (4.10) we derive the B-analog of Eq. (4.11)
Zt
y1 (τ )g(τ )
B= dτ . (4.12)
W (τ )
t0
In summary, to construct solution of Eq. (4.5) we follow the steps
• Find the Wronskian, W (t), given by Eq. (4.6).
• Find a homogeneous solution, x1 (t), and express its linearly independent counterpart,
x2 (t), via x1 (t) and the previously found Wronskian, W (t).
• Compute the time dependent factors A and B according to Eq. (4.11,4.12), therefore
presenting a particular solution of the original (inhomogeneous) equation according
to Eq. (4.7).
• The resulting general solution is a sum of x1 and x2 , each multiplied by a time

independent coefficient, with the particular solution of the inhomogeneous equation
(just found) also added to the sum.
The general scheme is illustrated on the following two examples.
Example 4.1.2. Find the general solution to t2 x00 (t) + tx0 (t) − x(t) = t (where t 6= 0) given
that x(t) = t is a solution.
Solution. Set the leading coefficient to unity by dividing by t to get x00 + t−1 x0 − t−2 x = t−1
(where t 6= 0). Therefore p(t) = −t−1 . We compute the Wronskian
Z t Z t
W (t) = exp p(τ )dτ = exp −τ dτ = t−1
−1
t0 t0
The second linearly independent solution is found by
1 1
W (t) = y1 y20 − y2 y10 ⇒ = ty20 − y2 1 ⇒ y2 (t) = −
t 2t
Computing A and B
Z t Z t
y2 (τ )g(τ ) − 1 t−1 t−1 1
A(t) = − dτ =− dτ 2 −1 = log(t)
t0 W (τ ) t0 t 2
. Z t Z t
y1 (τ )g(τ ) t t−1 t2
B(t) = dτ = dτ −1 =
t0 W (τ ) t0 t 2
The general solution to the differential equation is
1 t
x(t) = c1 t + c2 t−1 + t log(t) −
2 4
Example 4.1.3. Find the general solution to r00 (θ) + r(θ) = tan(θ) for −π/2 < θ < π/2.
Solution. We compute the Wronskian
Z θ
0
W (θ) = exp 0 dθ = 1
θ0
Let r1 (θ) = cos(θ) be the first linearly independent solution. The second linearly indepen-
dent solution is found by
W (t) = r1 (θ)r20 (θ) − r2 (θ)r10 (θ) ⇒ 1 = cos(θ)r2 (θ) + r20 (θ) sin(θ) ⇒ r2 (θ) = sin(θ)
Computing A and B
Z θ 0 0 Z θ
0 r2 (θ )g(θ ) sin(θ) tan(θ)
A(θ) = − dθ 0
=− dθ0 = sin(θ) − log(sec(θ) + tan(θ))
θ0 W (θ ) θ0 1
.
Z t 0 0 Z θ
0 r1 (θ )g(θ ) cos(θ) tan(θ)
B(θ) = dθ 0
= dθ0 = − cos(θ)
θ0 W (θ ) θ0 1
The solution to the differential equation is
x(θ) = c1 cos(θ) + c2 sin(θ) + cos(θ) log(sec(θ) + tan(θ))
Exercise 4.1. (a) Find a general solution, x(t) to the following ODE,
dx f (t)
− λ(t)x = 2 ,
dt x
where λ(t) and f (t) are known functions of t. (b) Solve the following general second-order,
constant-coefficient, linear ODE
d2 d
τ02 2
y + τ1 y + y = g(t),
dt dt
d
with the initial conditions y(0) = y0 , dt y t=0 = v0 .
4.2 Direct Methods for Solving Linear ODEs

We continue our exploration of linear by gradually increasing the complexity of the problems
and by developing more technical methods.
4.2.1 Homogeneous ODEs with Constant Coefficients
Consider the n-th order homogeneous ODE with constant coefficients

n
X dn−m
Lx(t) = 0, where L≡ an−m . (4.13)
dtn−m
m=0
(Here and below we will start using bold-calligraphic notation, L, for the differential opera-
tors.) Let us look for the general solution of Eq. (4.13) in the form of a linear combination
of exponentials
n
X
x(t) = ck exp(λk t), (4.14)
k=1
where ck are constants. Substituting Eq.(4.14) into Eq.(4.13), one arrives at the condition
that the λk are roots of the characteristic polynomial:
n
!
X n−m
an−m (λk ) = 0. (4.15)
m=0
Eq. (4.14) holds if the λk are not degenerate (that is, if there are n distinct solutions). In the
case of degeneracy we generalize Eq. (4.14) to a sum of exponentials (or the non-degenerate
λk and of polynomials in t multiplied by the respective exponentials for the degenerate
λk , where for the degrees of the polynomials are equal to the degree of the respective root
degeneracy. !
m
X dk
X (l)
x(t) = ck tl exp(λk t), (4.16)
k=1 l=0
where dk is the degree of the k-th root degeneracy.
4.2.2 Inhomogeneous ODEs
Consider an inhomogeneous version of a generic linear ODE
Lx(t) = f (t). (4.17)
Recall that if the particular solution is xp (t), and if x0 (t) is a generic solution of the homo-
geneous version of the equation, then a generic solution of Eq. (4.17) can be expressed as
x(t) = x0 (t) + xp (t).
Let us illustrate the utility of this simple but powerful statement on an example:
Example 4.2.1. For (a) ω0 6= 3 and for (b) ω0 = 3, solve
Lx := ẍ + ω02 x = cos(3t). (4.18)
Solution. The general solution to the homogeneous equation, Lx = 0 is x0 (t) = c1 cos(ω0 t)+
c2 sin(ω0 t). For ω0 6= 3, a particular solution to Eq. (4.18) is xp (t) = cos(3t)/(ω02 − 9), which
can be found by variation of parameters (Section 4.1.2). Therefore, for ω0 6= 3, the solution
to the inhomogeneous Eq. (4.18) is
cos(3t)
x(t) = c1 cos(ω0 t) + c2 sin(ω0 t) + .
ω02 − 9
When ω0 = 3, the natural frequency of the system coincides with the forcing frequency of
the right hand side and the system resonates. We must look for a new particular solution
because the particular solution we found above is already represented in the solution to the
inhomogeneous problem. This particular solution can be found by variation of parameters
(Section 4.1.2). Therefore, for ω0 = 3 the solution to the inhomogeneous Eq. (4.18) is
1
x(t) = c1 cos(ω0 t) + c2 sin(ω0 t) + t sin(3t).
6
4.3 Linear Dynamics via the Green Function

So far our analysis of ODEs was formal. Often, even though not always, we can associate
ODE with dynamics of a system, thus the term “dynamical system”. In this case ODE
describes evolution of the system variable, x, in time t, i.e. it studies x as a function of t,
x(t).
The dynamic considerations are very rich and in this course we will only scratch a
surface of interesting phenomena it covers. For example, in Section ?? we discuss the
so-called ”conservative” dynamics. ODE may also describe a “dissipative” system which
relaxes to an “equilibrium”. If a dissipative system is in equilibrium, its state does not
change in time. If the system is perturbed away from a stable equilibrium, the perturbation
is small the system relaxes back to the equilibrium. The relaxation may not be monotonic,
and the system may show some oscillations. If the relaxational (dissipative with possible
oscillations) dynamics is close to the equilibrium we model it by a linear ODE. There are
also many interesting situations when linear ODEs explain oscillations which do not decay.
The method of Green function, or “response” functions, will be the working horse of
our analysis for such dynamics, when the ODE is linear. The Green function method offers
a powerful and intuitive approach which also extends (in the next chapter) to the case of
PDEs.
Let us start with the following general consideration. Given a linear differential equation,
Lx(t) = f (t), the goal is to find an operator, L−1 , such that “x(t) = L−1 f (t)”. Since L is
a differential operator, it is reasonable to expect L−1 to be an integral operator, which can
be expressed as Z
−1
L f (t) = dτ G(t, τ )f (τ ),
where G(t, τ ) is the so-called Green Function which is to be determined. Formal manipu-
lations show that
Z Z
f (t) = LL−1 f (t) = L dτ G(t, τ )f (t) = dτ LG(t, τ )f (t).
We have already seen this equation as one of the properties of the δ-function. That is
Z
f (t) = dτ LG(t, τ )f (t) ⇔ LG(t, τ ) = δ(t − τ ).
The Green function for a differential operator, L, is the function G(t, τ ) that solves the
differential equation LG(t, τ ) = δ(t − τ ) subject to the prescribed side conditions. The
Green function describes the ‘response’ of the system at time t to a ‘impulse’ applied at
time τ .
Notation. Technically, the Green function is a function of two variables, t and τ , where τ
represents the time of an impulse and t represents the time that we observe the system’s
response to the impulse. Notice that if L is a differential operator with constant (time-
independent) coefficients, then the response of the system to an impulse does not depend
on t and τ independently, but instead it only depends on the difference t − τ . In this
situation, G(t, τ ) reduces to the “homogeneous in time” or “time-invariant” G(t − τ ).
We will proceed exploring the method by revisiting the simple constant coefficient case
of the linear scalar-valued first-order ODE (4.2).
4.3.1 Evolution of a linear scalar
Consider the simplest example of the scalar relaxation

d
x + γx = f (t), (4.19)
dt
where γ is constant and f (t) known function of t. This model appears, for example, when
we consider an over-damped driving of a polymer through a medium, where the equation
describes the balance of forces where f (t) is the driving force, γx is the elastic (returning)
force for a polymer with one end positioned at the origin and another at the position x; and ẋ
represents friction of the polymer against the medium. The general solution of this equation
(recall discussion of the integral operator above, and also notice the time-homogeneous form
of the Green function) is
Zt
x(t) = dτ G(t − τ )f (τ ), (4.20)
−∞
where we have assumed that the evolution starts at t = −∞ with limt→−∞ x(t) = 0; and
G(t, τ ) is the Green function which satisfies
d
G(t, τ ) + γG(t, τ ) = δ(t − τ ), (4.21)
dt
and δ(t) is the δ-function.
Notice that the evolutionary problem we discuss here is an initial value problem (also
called a Cauchy problem). Indeed, if we would not assume that back in the past (at t = −∞)
x is fixed, the solution of Eq. (4.19) would be defined unambiguously. Indeed, suppose xs (t)
is a particular solution of Eq. (4.19), then xs (t) = C exp(−γt), where C is a constant,
describes a family of solutions of Eq. (4.19). The freedom, eliminated by fixing the initial
condition, is associated with the so-called zero mode of the differential operator, d/dt + γ.
Another remark is about causality, which may also be referred to, in this context, as the
“causality principle”. It follows from Eq. (4.20) that defining the Green function, one also
enforces that, G(t − τ ) = 0 at t < τ . This formal observation is, of course, consistent with
the obvious—solutions of Eq. (4.19) at a particular moment in time t can only depend on
external driving sources f (τ ) that occurred in the past, when τ ≤ t. The solution cannot
depend on external driving forces that will occur in the future, when τ > t.
Now back to solving Eq. (4.21). Since δ(t − τ ) = 0 at t > τ , one associates G(t − τ ) with
the zero mode of the aforementioned differential operator, G(t − τ ) = A exp(−γ(t − τ )),
where A is a constant. On the other hand due to the causality principle, G(t − τ ) = 0 at
t < τ . Integrating Eq. (4.21) over time from τ − < 0, where 0 < 1, to τ , we observe that
G(t − τ ) should have a discontinuity (jump) at t = τ : G(t − τ ) = A exp(−γ(t − τ ))θ(t − τ ),
where θ is the Heaviside function. Substituting the expression in Eq. (4.21) and integrating
the result (left and right hand sides of the resulting equality) over τ − < t < τ + , one
finds that A = 1. Substituting the expression into Eq. (4.20) one arrives at the solution
Zt
x(t) = dτ exp(−γ(t − τ ))f (τ ). (4.22)
−∞
We observe that the system “forgets” the past at the rate γ per unit time.
Lets sketch out a few different ways to solve Eq. (4.21).
Method 1: Multiply by the appropriate integrating factor. For this problem, the inte-
grating factor is eγt .
LG(t, τ ) = δ(t − τ )
d γt
e G(t, τ ) = eγt δ(t − τ )
dt
Zt Z∞
γt0 0 0 0
γt
e G(t, τ ) = e δ(t − τ )dt = θ(t0 − τ )eγt δ(t0 − τ )dt0
−∞ −∞
G(t, τ ) = θ(t − τ )e−γ(t−τ ) .

Method 2: Take the Fourier transform of both sides, solve the subsequent algebraic
equation for x̂(k), and then use a contour integral to compute the inverse Fourier transform
of x̂(k).
h i
F Ġ(t, τ ) + γG(t, τ ) = F [δ(t − τ )]
ik Ĝ(k, τ ) + γ Ĝ(k, τ ) = e−ikτ

e−ikτ
Ĝ(k, τ ) =
γ + ik
Z∞ −ikτ
1 e
G(t, τ ) = eikt dk
2π γ + ik
−∞
G(t, τ ) = θ(t − τ )e−γ(t−τ )
Where in the last line, we have computed the contour integral by closing the contour with
an semi-circular arc of radius R and taken the limit R → ∞. To ensure that the closing arc
has a vanishingly small contribution to the integral, the contour is closed off in the upper
half plane for t > τ and in the lower half plane for t < τ . The integrand has a simple pole
1 −γ(t−τ )
at k = iγ with residue 2πi e . Because the pole is in the upper half plane, the integral
is equal to e−γ(t−τ ) for t > τ and because there are no poles in the lower half plane, the
integral is equal to 0 for t < τ .
Method 3: Construct the Green function based on a number of properties that it must
satisfy. Given LG(t, τ ) = δ(t − τ ), we see that G(t, τ ) must satisfy:
(i.) G(t, τ ) solves LG(t, τ ) = 0 whenever t 6= τ .
(ii.) G(t, τ ) must satisfy the initial condition.
(iii.) G(t, τ ) must have a jump of size unity at t = τ . This can be explained by integrating
from τ − < t < τ + and taking the limit → 0+ . The calculation is as follows:
Z
τ + Z
τ + Z τ +
0 0 0 0
Ġ(t − τ )dt + γ G(t − τ )dt + δ(t0 − τ )dt0 ⇒ G(τ + ) − G(τ − ) + 0 = 1
τ −
τ − τ −
Generalization of property (iii): In general, the Green function of an nth order differen-
tial operator has a jump in the (n − 1)st derivative at t = τ . All further derivatives are
continuous. The size of the jump is equal to the magnitude of the leading coefficient.
Let’s use these properties to construct the Green function.
d

Step 1. Find candidate solutions by solving dt + γ G(t, τ ) = 0. We know there is only
one (non-trivial) linearly independent solution because the ODE is linear and first order.
This solution is Ae−γt . There is also the trivial solution. To construct the Green function,
we must to mix and match between the candidates G(t, τ ) = 0 and G(t, τ ) = Ae−γt .
Step 2. Apply the initial condition. Our initial condition is limt→−∞ = 0. The only
candidate solution that satisfies the initial condition is the trivial solution. That is, G(t, τ ) =
0 for t < τ . We are not yet prepared to say what happens for t > τ .
Step 3. Apply the jump condition. Given that G(t, τ ) = 0 for t < τ , we must now determine
G(t, τ ) for t > τ . We realize that Ae−γt is the only candidate that can produce a jump at
t = τ . Furthermore, to ensure the jump is size unity, we must set A = eγτ .
In summary, G(t, τ ) = θ(t − τ )e−γ(t−τ ) .
Exercise 4.2. Solve Eq. (4.19) at t > 0, where x(0) = 0 and f (t) = A exp(−αt). Analyze
the dependence on α and γ, including α → γ.
Recall that Eq. (4.21) assumes that the Green function depends on the difference between
t and τ , t − τ , and not on the two variables separately. This assumption is justified for the
case considered here, however it will not be correct for situations where the decay coefficient
γ(t) depends on t. In this general case one needs to consider the general expressions for the
Green function discussed above, G(t, τ ). In the case of the constant γ the Green function
depends on the difference because of Eq. (4.21) symmetry: invariance with respect to the
time translation (time homogeneity), i.e. the equation does not change under the time shift,
t → t + t0 .
4.3.2 Evolution of a vector
Let us now generalize and consider

d
x + Γ̂x = f (t), (4.23)
dt
where x = (x1 , . . . , xn )> and f = (f1 , . . . , fn )> are n-dimensional vector-valued functions
of t and Γ̂ is n × n time-independent matrix. We consider the two possible cases for Γ̂: first,
when Γ̂ is either diagonal or diagonalizable, and second, when it is not diagonalizable.
If Γ̂ is a diagonal matrix, the vector-valued differential equation decouples into n scalar
valued differential equations ẋi (t)+γi xi (t) = fi (t), where γ1 , . . . γn are the n diagonal entries
of Γ̂. Each of the n scalar-valued differential equations can each be solved independently of
each other as discussed in section 4.3.1.
If Γ̂ is a diagonalizable matrix (but not necessarily diagonal), then we find the eigen-set
of Γ̂
Γ̂ai = λi ai , (4.24)
and expand x and f over the {ai |i} basis,

X X
x= yi ai , f= φi ai . (4.25)
i i
Substituting the expansions into Eq. (4.23) one arrives at the n scalar-valued differential
equations
dyi
+ λ i yi = φ i , (4.26)
dt
therefore reducing the vector equation to the set of scalar equations of the already considered
in section 4.3.1.
If Γ̂ is not diagonalizable, it can be decomposed into Jordan Canonical form. This
occurs when two (or more) eigenvalues share an eigenvector. As before, one introduces the
Green function Ĝ(t, τ ), which satisfies

d
+ Γ̂ Ĝ(t, τ ) = δ(t − τ )1̂. (4.27)
dt
The explicit solution of Eq. (4.27) is

Ĝ(t, τ ) = θ(t) exp −Γ̂(t − τ ) , (4.28)
which allows us to state the solution of Eq. (4.23) in the following invariant form
Z t Z t
x(t) = dτ Ĝ(t − τ )f (τ ) = dτ θ(t − τ ) exp −Γ̂(t − τ ) f (τ ). (4.29)
−∞ −∞
Notice that matrix exponential, introduced in Eq. (4.28) and utilized in Eq. (4.29), is
the formal expression which may be interpreted in terms of the Taylor series
∞
X (−(t − τ ))n Γ̂n
exp −(t − τ )Γ̂ = , (4.30)
n!
n=0
which is always convergent (for the matrix Ĝ with finite elements).

To relate the invariant expression (4.29) to the eigen-value decomposition of Eqs. (4.25,4.26)
one introduces the eigen-decomposition
Γ̂ = ÂĴÂ−1 , (4.31)
where Ĵ is the matrix of Jordan blocks formed from the eigenvalues of Γ̂, and the columns
of Â are the respective eigenvalues of Γ̂. Note that Γ̂n = ÂĴn Â−1 .
To illustrate the peculiarity of the degenerate case consider
" # " #
λ 1 0 1
Γ̂ = , which can be re-written as λÎ + N̂ where N̂ ≡ ,
0 λ 0 0
which is the canonical form of the Jordan (2×2) matrix/block. Observe that N̂2 = 0̂, which
indicates that N̂ is a (2 × 2) nilpotent matrix. The nilpotent property can be leveraged
when taking the matrix exponential,

exp −(t − τ )Γ̂ = e−λ(t−τ ) Î − (t − τ )N̂ , (4.32)
where we have accounted for the nilpotent property of N̂. Incorporating Eq. 4.32 into
Eq. 4.29, the solution can therefore be expressed as
Z t
x(t) = dτ θ(t − τ )e−λ(t−τ ) Î − (t − τ )N̂ f (τ ). (4.33)
−∞
Alternatively, we could write Eqs. (4.23) in components
dx1 dx2
+ λx1 + x2 = f1 , + λx2 = f2 ,
dt dt
integrating the second equation, substituting result in the first equation, and then changing
from x1 to x̃1 = x1 + tx2 , one arrives at
x̃1
+ λx̃1 = f1 + tf2 .
dt
Note the emergence of a secular term, (a polynomial in t), on the right hand side, which is
generic in the case of degeneracy which is then straightforward to integrate.
Exercise 4.3. Find the Green function of Eq. (4.23) for

 
λ 1 0
 
Γ̂ = 
 0 λ 1 .

0 0 λ
Note that vector-valued ODEs appear as the result of the “vectorization” of an nth
order scalar-valued ODE for y(t). The vectorization occurs by setting x1 = y, x2 =
dy/dt, · · · , xn = dn−1 y/dtn−1 . Then dx/dt is expressed via the components of x and
the original equation, thus resulting in Eq.(4.23).
4.3.3 Higher Order Linear Dynamics
The Green function approach illustrated above can be applied to any inhomogeneous linear
differential equation. Let us see how it works in the case of the second-order differential
equation for a scalar. Consider
d2
x + ω 2 x = f (t). (4.34)
dt2
To solve Eq. (4.34) note that its general solution can be expressed as a sum of its
particular solution and solution of the homogeneous version of Eq. (4.34) with zero right
hand side. Let us choose a particular solution of Eq. (4.34) in the form of convolution (4.20)
of the source term, f (t), with the Green function of Eq. (4.34)
2
d 2
+ ω G(t) = δ(t). (4.35)
dt2
As established above G(t) = 0 at t < 0. Integration of Eq. (4.35) from − to τ and checking
the balance of the integrated terms reveals that Ġ jumps at t = 0, and the value of the
jump is equal to unity. An additional integration over time around the singularity shows
that G(t) is smooth (and zero) at t = 0. Therefore, in the case of a second order differential
equation considered here: G = 0 and Ġ = 1 at t = +0. Given that δ(+0) = 0 these two
values can be considered as the initial conditions at t = +0 for the homogeneous version
(zero right hand side) of the Eq. (4.35), defining G(t) at t > +0. Finally, we arrive at the
following result
sin(ωt)
G(t) = θ(t) , (4.36)
ω
where θ is the Heaviside function.
Furthermore, Eq. (4.20) gives the solution to Eq. (4.34) over the infinite time horizon,
however one can also use the Green function to solve the respective Cauchy propblem (initial
value problem). Since Eq. (4.34) is the second order ODE, one just needs to fix two values
associated with x(t) evaluated at the initial, t = 0, for example x(0) and ẋ(0). Then, taking
into account that, G(+0) = 0 and Ġ(+0) = 1, one finds the following general solution of
the Cauchy problem for Eq. (4.34)
Zt
x(t) = ẋ(0)G(t) + x(0)Ġ(t) + dt1 G(t − t1 )f (t1 ). (4.37)
0
Let us now generalize and consider

n
X dn−m
Lx = f (t), L≡ an−m , (4.38)
dtn−m
m=0
where ai are constants and L is the linear differential operator of the n-th order with
constant coefficients, already discussed in Section 4.2. We build a particular solution of
Eq. (4.38) as the convolution (4.20) of the source term, φ(t), with the Green function, G(t),
of Eq. (4.38)
LG = δ(t), (4.39)
where G(t) = 0 at t < 0.

Observe that the solution to the respective homogeneous equation, Lx = 0, (the zero
modes of the operator L) can be generally presented as
X
x(t) = bi exp(zi t), (4.40)
i
where bi are arbitrary constants.

Let us now use the general representation (4.40) to construct the Green function solving
Eq. (4.39). Recall that, considering first and second order differential equations in the
preceding Sections, we have transitioned above from the inhomogeneous equations for the
Green function to the homogeneous equation supplemented with the initial conditions.
Direct extension of the “integration around zero” approach (doing it n times) reveals that
initial conditions one needs to set at t = +0 in the general case of the n-th order differential
equation are
dn−1 dm
an−1 G(0+ ) = 1, ∀0 ≤ m < n − 1 : G(0+ ) = 0. (4.41)
dtn−1 dtm
Consider, formally, L, as a polynomial in z, where z is the elementary differential op-
erator, z = d/dt, i.e. L(z). Then, at t > 0+ the Green function satisfies the homogeneous
equation, L(d/dt)G = 0. Solution of the homogeneous equation can generally be presented
as
X
t > 0+ : G(t) = bi exp(zi t), (4.42)
i
where bi are arbitrary constants which are defined unambiguously from the system of alge-
braic equations for the coefficients one derives substituting Eq. (4.42) in Eq. (4.41).
Example 4.3.1. Let γ, ν ∈ R with γ > 0. Find the Green function for the differential
operator and use it to solve the ODE
d2 d
2
x + 2γ x + ν 2 x = f subject to: lim x(t) = 0, lim ẋ(t) = 0. (4.43)
dt dt x→−∞ t→−∞
Consider the case ν < γ and ν > γ

Notation. As before, G(t, τ ) is a function of both t and τ , where the variable τ represents
the time that an impulse is applied. For any fixed τ , G(t, τ ) is the response of the impulse
a function of t. It’s Fourier transform, Ĝ(ω, τ ), is the decomposition of G(t, τ ) into its
oscillatory modes.
Solution. Longer method: Solve LG = δ(t − τ ) by taking Fourier Transforms: To find
G(t, τ ), take the Fourier transform of LG(t, τ ) = δ(t − τ ) and solve for Ĝ(ω, τ ).
d
−e−iωτ
LG(ω; τ ) = δ̂(ω; τ ) ⇒ −ω 2 − 2iγω + ν 2 G = e−iωτ ⇒ Ĝ(ω; τ ) = ,
(ω − ω+ )(ω − ω− )
p
where ω± = −iγ ± ν 2 − γ 2 . The inverse Fourier transform of Ĝ is computed by a contour
integral where the contour must be closed off by a semi-circular arc of radius R under
the limit R → ∞. To ensure that the semi-circular arc has vanishing contribution to the
integral, we must close off the contour in the upper-half plane if t < τ and in the lower-half
plane if t > τ . The integrand has poles with associated residues at:
• If ν > γ:
p
Simple poles at ω = ω± = −iγ± ν 2 − γ 2 (with both real and imaginary components)
p
−1 2
Res(f, ω− ) = (ω− −ω+ ) exp (−iω− τ ) = −(ν −γ ) 2 −1/2 exp (−γτ ) exp −i ν 2 − γ 2 τ
p
Res(f, ω+ ) = (ω+ −ω− )−1 exp (−iω+ τ ) = +(ν 2 −γ 2 )−1/2 exp (−γτ ) exp −i ν 2 − γ 2 τ
• If ν < γ:
p
Simple poles at ω = ω± = −iγ ± i γ 2 − ν 2 (purely imaginary) p
Res(f, ω− ) = (ω− −ω+ )−1 exp (−iω− τ ) = −(γ 2 −ν 2 )−1/2 exp (−γτ ) exp − γ 2 − ν 2 τ
p
Res(f, ω+ ) = (ω+ −ω− )−1 exp (−iω+ τ ) = +(γ 2 −ν 2 )−1/2 exp (−γτ ) exp − γ 2 − ν 2 τ
The Green function is given by

Z ∞
1 iωt 1
G(t, τ ) = e dt = 2π Res(f, ω− ) + 2π Res(f, ω+ )
2π −∞ 2π
Because there are no singularities in the upper half plane, G(t, τ ) = 0 if t < τ . Physically,
this means that the system has no response to an impulse that will happen in the future
(also called causality). Simplifying the algebra (expressing the complex exponentials as
sines or hyperbolic sines where appropriate), we get
 p
θ(t − τ ) e −γ(t−τ ) sin ν 2 − γ 2 (t − τ ) , γ<ν
G(t − τ ) = p p
|ν 2 − γ 2 | sinh γ 2 − ν 2 (t − τ ) , γ>ν
Finally, the solution to the ODE is given by

Z t p

 e−γ(t−τ )

 p sin ν 2 − γ 2 (t − τ ) f (τ )dτ if γ < ν
|ν 2 − γ 2 |
x(t) = Z−∞ t p

 e−γ(t−τ )

 p sinh γ 2 − ν 2 (t − τ ) f (τ )dτ if γ > ν
−∞ |ν 2 − γ 2 |
Solution. Shorter Method. Construct the Green function based on the properties it must
satisfy. We solve it for the case ν > γ. The case ν < γ follows analogously. Given
LG(t, τ ) = δ(t − τ ), we see that G(t, τ ) must satisfy:
(i.) G(t, τ ) solves LG(t, τ ) = 0 whenever t 6= τ .

(ii.) G(t, τ ) must satisfy the initial conditions.
(iii.) G(t, τ ) must be continuous everywhere, including t = τ , and the derivative of G(t, τ )
must have a jump of magnitude unity at t = τ .
Generalization of property (iii): In general, the Green function of an nth order differen-
tial operator has a jump in the (n − 1)st derivative at t = τ . All further derivatives are
continuous. The size of the jump is equal to the magnitude of the leading coefficient.
Let’s use these properties to construct the Green
2 function.
d d 2 G(t, τ ) = 0. We know
Step 1. Find candidate solutions by solving dt 2 + 2γ dt + ν
there are two linearly independent

√ solution because√the ODE is linear and second or-
der. The solutions are A1 e −γt− ν 2 −γ 2 t and A e−γt+ ν 2 −γ 2 t where A and A are func-
2 1 2
tions of τ . The solution in this form correct, but not very helpful. We take linear com-
p
binations of the two solutions and express them as c1 e−γ(t−τ ) sin( ν 2 + γ 2 (t − τ )) and
p
c2 e−γ(t−τ ) cos( ν 2 + γ 2 (t − τ )). To construct the Green function, we must mix and match
the two linearly independent candidate solutions. Tentatively, we can write
 p p
c1 e−γ(t−τ ) sin ν 2 − γ 2 (t − τ ) + c2 e−γ(t−τ ) cos ν 2 − γ 2 (t − τ ) if t < τ
G(t, τ ) = p p
c e−γ(t−τ ) sin ν 2 − γ 2 (t − τ ) + c4 e−γ(t−τ ) cos ν 2 − γ 2 (t − τ ) if t > τ
3
Step 2. Apply the initial conditions. Our initial conditions are limt→−∞ G(t, τ ) = 0 and
limt→−∞ Ġ(t, τ ) = 0 . The only candidate solution that satisfies the initial condition is
the trivial solution. That is, G(t, τ ) = 0 for t < τ . We are not yet prepared to say what
happens for t > τ . We can improve our tentative Green function by

0 if t < τ
G(t, τ ) = p p
c e−γ(t−τ ) sin ν 2 − γ 2 (t − τ ) + c4 e−γ(t−τ ) cos ν 2 − γ 2 (t − τ ) if t > τ
3
Step 3. Apply the continuity and jump conditions. Given that G(t, τ ) = 0 for t < τ , we
must now determine G(t, τ ) for t > τ . The continuity condition requires that c4 = 0. We
∂
p p
compute ∂t G(t, τ ) = −c3 γe−γ(t−τ ) +c3 ν 2 − γ 2 e−γ(t−τ ) cos( ν 2 − γ 2 (t−τ )), and find that
p p
limt→τ + = c3 ν 2 − γ 2 . To ensure the jump is size unity, we must set c3 = ν 2 − γ 2 .
In summary, the Green function is given by
θ(t − τ )e−γ(t−τ ) p
G(t − τ ) = p sin ν2 − γ2 , if ν < γ,
ν2 − γ2
and the solution to the ODE is
Z t p
e−γ(t−τ )
x(t) = p sin ν 2 − γ 2 (t − τ ) f (τ )dτ, if γ < ν.
−∞ ν2 − γ2
A similar calculation can be used to find the Green function and the solution for ν < γ.
Exercise 4.4. Follow the logic of Example 4.3.1 and suggest two methods of finding the
Green function ((a) based on Fourier transform, and (b) based on properties of the Green
2 2 2 2 2 2
d 2 d 2 d
:= dt 2 d 2 ,
function) for solving dt2 + ν x(t) = f (t), where dt2 + ν 2 + ν dt2 + ν
d d2 d3
at t > 0, assuming that x is real-valued and x(0− ) = −
dt x(0 ) = dt2
x(0− ) = dt3
x(0− ) = 0.
4.3.4 Laplace Transform and Laplace Method
So far we have solved linear ODEs by using the Green function approach and constructing
the Green function as a solution of the homogeneous equation with additionally prescribed
initial conditions (one less than the order of the differential equation). In this section
we discuss an alternative way of solving the problem via, first, application of the Laplace
transform introduced in Section 3.9 for solving linear ODEs with constant coefficients,
and then discussing the so-called Laplace method for solving linear ODEs with coefficients
dependent linearly on the (time/space) variable. Connection between the two is not only
via the name of Laplace, who has contributed developing both, but also due to the fact that
the Laplace method can be considered as utilizing a generalization of the Laplace transform.
The Laplace transform is natural for solving dynamic problems with causal structure.
Let us see how it works for finding the Green function defined by Eqs. (4.38,4.39). We
apply the Laplace transform to Eq. (4.39), integrating it over time with the exp(−kt)
Laplace weight from a small positive value, , to ∞. In this case, the integral of the
right hand side is zero. Each term on the left hand side can be transformed through a
sequence of integrations by parts to a product of a monomial in k with G̃(k), the Laplace
transform of G(t). We also check all boundary terms which appear at t = and t = ∞.
Assuming that G(∞) = 0 (which is always the case for stable systems), all contributions at
t = +∞ are equal to zero. All t = boundary terms, but one, are equal to zero, because
∀0 ≤ m < n − 1, dm G()/dtm = 0. The only nonzero boundary contribution originates
from dn−1 G()/dtn−1 = 1. Overall, one arrives at the following equation
n
X
L(k)G̃(k) = 1, L(k) := am−k (−k)n−m . (4.44)
m=0
Therefore, we just found that G(k) has poles (in the complex plain of k) associated with
zeros of the L(k) polynomial. To find G(t) one applies to G̃(k) the inverse Laplace transform
Z
c+∞
dk
G(t) = exp(kt)G̃(k). (4.45)
2πi
c−i∞
The Laplace method allows us to solve ODEs where the coefficients are linear in t.
Remark. Notice, again, that the Laplace’s method for differential equations is not to be
confused with Laplace Transforms or Laplace’s method for approximating integrals. They
are not the same, even though related.
Consider an ODE that can be written as
N
X dm y
(am + bm t) = 0, (4.46)
dtm
m=0
We look for solutions in the form of the integral

Z
y(t) = dk Z(k)ekt , (4.47)
C
where Z(k) is a function of the complex variable k and C is a contour in the complex plane
of k that will depend on Z(k) but that will not depend on t.
Remark. Notice, that the substitution (4.47) is similar to the inverse Laplace transform
(4.45), however with an important difference that the contour C in the former does not
necessarily coincides with the contour used in the latter. We remind that the contour used
in the basic formula of the inverse Laplace transform, i.e. in Eq. (4.45), goes up, on the
right to the imaginary axis.
The derivatives of y are computed from Eq. (4.47),
Z
dm y
= dk Z(k)k m ekt
dtm C
and are substituted into the left hand side of equation 4.46 which gives
Z
dk Z(k) a0 + a1 k + · · · + an k n + b0 + b1 k + · · · + bn k n t = 0
C | {z } | {z }
P (k) Q(k)
where we have introduced the notation P (k) and Q(k) for convenience. We integrate by
parts to get
Z
0= dk Z(k) P (k) + Q(k)t ekt
C
h ik Z
kt 2 d
= Z(k)Q(k)e + dk Z(k)P (k) − Z(k)Q(k) ekt (4.48)
k1 C dk
where k1 and k2 represent the two endpoints of C. If we can pick Z(k) and a contour C
such that Eq. (4.48) holds, then we can use them to express the solution to the ODE (4.46)
by Eq. (4.47). In summary, we must find Z(k) and the contour C such that
d h ik2
Q(k)Z(k) − P (k)Z(k) = 0 and Z(t)Q(t)ekt = 0.
dk k1
The differential equation above, for Z(k), can be solved either by finding an integrating
factor, or by separation of variables, as follows:
Z
d [QZ] P P
d [QZ] = P Zdk ⇒ = dk ⇒ ln(QZ) = dk + const
QZ Q Q
Z(k), is given by Z
c P (k)
Z(k) = exp dk (4.49)
Q(k) Q(k)
Once Z(k) is determined, we must find a contour with endpoints k1 and k2 such that
Q(k1 )Z(k1 )ek1 t = Q(k2 )Z(k2 )ek2 t .
Example 4.3.2. Use the Laplace’s method to find the general solution to the boundary
value problem:
d3
x u + 2u = 0, u(0) = 1, u(∞) = 0.
dx3
Solution. For this problem, we compute
2
3 c e−1/k 2
P (k) = 2, Q(k) = k , Z(k) = − , Q(k)Z(k)ekt = ekx−1/k (4.50)
k3
2 2
We see that ekx−1/k → 0 as k → −∞ and that ekx−1/k = 0 as k → 0. Therefore, we take
the contour of integration to be along the negative real axis. The solution is
Z0 2
ekx−1/k
u(x) = −2c dk
k3
−∞
which can be expressed as

Z∞ √
u(x) = e−x/ z−z
dz
0
by the change of variables z = t−2 . The constant c = 1/2 was chosen to satisfy the boundary
condition u(0) = 1
Exercise 4.5. Consider the sum

∞
X xn
S(x) = .
(n!)2
n=0
Find a second order, linear differential equation which, when supplied with proper initial
conditions at x = 0, results in S(x) as a solution. Solve the initial value problem by the
Laplace method, therefore, representing S(x) as an integral.
Example 4.3.3. (a) Use Laplace’s method to find a general solution to the Hermite equa-
tion
d2 y dy
2
− 2t + 2ny = 0. (4.51)
dt dt
(b) Simplify your result for the case where n is a non-negative integer.
Solution. (a) In this case we derive
2 2 /4
2 c e−k /4 kt ekt−k
P (k) = k + 2n, Q(k) = −2k, Z(k) = − , Q(k)Z(k)e = (4.52)
2k n+1 kn
thus resulting in the following explicit solution of Eq. (4.51), defined up to a multiplicative
constant, is
Z 2
ekt−k /4
y(t) = dk . (4.53)
C k n+1
Let us make the change in variables k → z according to k = 2(t − z) which gives
Z 2
t2 e−z dz
y(t) = e , (4.54)
C0 (z − t)n+1
where C 0 is a suitable contour in the complex plane of z, which is yet undefined, that is we
have a freedom in choosing the contour (as above we had a choice with contour C in the
complex space of k).
(b) When n is a non-negative integer, the integrand in Eq. (4.54) has a simple pole,
and thus choosing the contour to go around the pole both satisfies the requirement on
the boundary terms and allows us to evaluate the integral by residue calculus. Applying
Cauchy’s formula to the resulting contour integral, one therefore arrives at the expression
for the so-called Hermite polynomials
2 dn −t2
y(t) = Hn (t) = (−1)n et e , (4.55)
dtn
where re-scaling (which is a degree of freedom in linear differential equations) is selected
according to the normalization constraint introduced in the following exercise.
Hermite polynomials will come back later in the context of the Sturm-Liouville problem
in Section 4.5.3.
Example 4.3.4. Consider another example particular case of Eqs.(4.46) that can be solved
by the Laplace method,
d2
y − ty = 0. (4.56)
dt2
II $#
3
III
I
$#
3
!" !
Figure 4.1: Layout of contours in the complex plane of k needed for saddle-point estimations
of the Airy function described in Eq. (4.58).
Solution. Following the general Laplace method described above we derive
P (k) = k 2 , Q(k) = −1, Z(k) = − exp(−k 2 /3). (4.57)
According to Eq. (4.47) the general solution of Eq. (4.57) can be represented as
Z

y(t) = const dk exp kt − k 3 /3 , (4.58)
C
where we choose an infinite integration path shown in Fig. (4.1) such that values of the
integrand at the two (infinite) end points coincide (and equal to zero). Indeed, this choice
guarantees that the infinite end points of the contour lie in the regions where Re(k 2 ) > 0
(shaded regions I, II, III in Fig. (4.1)). Moreover, by choosing that the contour starts in
the region I and ends in the region II (blue contour C in Fig. (4.1)) we guarantee that the
Airy function given by Eq. (4.56) remains finite at t → +∞. Notice that the contour can
be shifted arbitrarily under condition that the end points remain in the sectors I and II. In
particular one can shift the contour to coincide with the imaginary axis (in the complex k
plane shown in Fig. (4.1), then Eq. (4.58) becomes (up to a constant) the so-called Airy
function
 ∞ 
Z∞ Z 3
1 z3 1 z
Ai(t) = dz cos + zt = Re  dz exp i + itz  . (4.59)
π 3 2π 3
0 −∞
Asymptotic expression for the Airy function at t > 0, t 1, can be derived utilizing the
√
saddle-point method described in Section 2.4. At k = ± t, the integrand in Eq. (4.58) has
an extremum along the direction of its “steepest descent” from the saddle point along the
imaginary axis. Since the contour end-points should stay in the sectors I and II, we shift
the contour to the left from the imaginary axis while keeping it parallel to the imaginary
√
axis. (See C1 shown in red in Fig. (4.1) which crosses the real axis at k = − t.) The
√
integral is dominated by the saddle-point at k = − t, thus resulting (after substitution
√
k = t + iz, changing integration variable from k to z, making expansion over z, keeping
quadratic term in z, ignoring higher order terms, and evaluating a Gaussian integral) in the
following asymptotic estimation for the Airy function
Z
+∞
1 2 3/2 √ 2 exp(−2t3/2 /3)
t > 0, t 1 : Ai(t) ≈ exp − t − tz dz = √ . (4.60)
2π 3 t1/4 4π
−∞
(Notice that one can also provide an alternative argument and exclude contribution of the
√
second, potentially dominating, saddle-point k = t simply by observing that Gaussian
integral evaluated along the steepest descent path from this saddle-point gives zero contri-
bution after evaluating the real part of the result, as required by Eq. (4.59).)
4.4 Linear Static Problems

We will now turn to problems which normally appear in the static case. In many natural
and engineered systems, a dynamic system that reaches equilibrium may have spatial char-
acteristics that are non-trivial and worthy of analysis. Here we discuss a number of linear
spatially one-dimensional problems that are relevant to applications.
4.4.1 One-Dimensional Poisson Equation
Poisson’s equation describes, in the case of electrostatics, the potential field caused by a
given charge distribution.
Let us discuss the function u(x) whose distribution over a finite spatial interval is de-
scribed by the following set of equations
d2
u(x) = f (x), ∀x ∈ (a, b) with u(a) = u(b) = 0. (4.61)
dx2
We introduce the Green function which satisfies
d2
∀a < x, y < b : G(x; y) = δ(x − y), G(a; y) = G(b; y) = 0. (4.62)
dx2
Notice that the Green function now depends on both x and y.
d2
According to Eq. (4.62), dx2
G(x; y) = 0 if x 6= y. That is, G(x, y) is a linear function of
x for all x 6= y. Then enforcing the boundary conditions one derives
x>y: G(x; y) = B(x − b), (4.63)

y>x: G(x; y) = A(x − a). (4.64)
Furthermore, given that the differential equation in (4.62) is the second order, G(x, y) should
be continuous at x = y and the jump of its first derivative at x = y should be equal to
unity. Summarizing, one finds

1 (y − b)(x − a), x < y
G(x; y) = (4.65)
b − a (y − a)(x − b), x > y.
The solution the Eq. (4.61) is given by the convolution operator
Zb
u(x) = dyG(x; y)f (y). (4.66)
a
Example 4.4.1. Find

Green function for the equation, Lu(x) = f (x), where the
the
d d
operator L = − x , and the boundary conditions on u(x) are u(1) = 0 and u0 (2) = 0.
dx dx
Solution. We construct the Green function by finding a function that satisfies the necessary
properties. Property (i.): The Green function satisfies LG(x; y) = 0 for x 6= y. Two linearly
independent solutions to the homogeneous equation are u(x) = const and u(x) = log(x).
Therefore for any y in 1 < y < 2, we can write

c1 + c2 log(x), x < y
G(x; y) =
c + c log(x), x > y
3 4
Property (ii.): The Green function satisfies the boundary conditions. Enforcing the bound-
ary conditions u(1) = 0 and u0 (2) = 0, gives c1 = c4 = 0,

c2 log(x) x < y
G(x; y) =
c x>y
3
Property (iii.): G(x; y) is continuous and Gx (x; y) has a jump of magnitude −1/y at x = y.
To ensure continuity at y, we must set c2 log(y) = c3 . This ensures continuity, but does not
yet give the appropriate jump condition. Computing the derivative of G as x → y − and
as x → y + gives limx→y− Gx = c2 /y and limx→y+ Gx = 0. That is, we must set c2 = 1 to
ensure that the derivative has a jump with magnitude −1/y at x = y.

log(x), x < y
G(x; y) =
log(y), x > y
Remark. Explanation for Property (iii.): To determine the magnitude of the jump at x = y,
integrate the ODE on the interval y − < x < y + and take the limit → 0.

d d
− x u(x) = δ(x − y)
dx dx
Z y+ Z y+
d d
lim − x u(x)dx = lim δ(x − y)dx
→0 y− dx dx →0 y−
y+
d
lim −x u(x) =1
→0 dx y−
d 1
u(x) =−
dx x=y y
Exercise 4.6. Find the Green function for the equation, Lu(x) = f (x), where the operator
d 2
L = − dx 2
2 − κ , and the boundary conditions on u(x) are
(a) u(0) = u0 (1) = 0;
(b) u(x) is periodic with the period 2π.
4.5 Sturm–Liouville (spectral) theory

We enter the study of differential operators which map a function to another function, and
it is therefore imperative to first discuss the Hilbert space where the functions of reside.
4.5.1 Hilbert Space and its completeness
Let us first review some basic properties of a Hilbert space, in particular, condition on its
completeness. (These will be discussed at greater length in the companion Math 527 course
of the AM core.) A linear (vector) space is called a Hilbert space, H, if
1. For any two elements, f and g there exists a a scalar product (f, g) which satisfies the
following properties:
(a) linear with respect to the second argument,
(f, αg1 + βg2 ) = α(f, g1 ) + β(f, g2 ),
for any f, g1.2 ∈ H and α, β ∈ C.

(b) self-conjugation (Hermitian)
(f, g) = (g, f )∗ ;
(c) non-negativity of the norm, kf k2 := (f, f ) > 0, where (f, f ) = 0 means f = 0.
2. H has a countable basis, B, i.e. a countable number of elements, B := {fn , n =

1, · · · , ∞} such that any element g ∈ H can be represented in the form of a linear
combination fn . that is, for any g ∈ H, there exist coefficients cn such that g =
P
cn fn .
Remark. The Hilbert space defined above for complex-valued functions can also be consid-
ered over real-valued functions. In the following we will use the two interchangeably.
Any basis B can be turned into an ortho-normal basis with respect to a given scalar
P
∞ P
∞
product, i.e. x = (x, fn )fn , kxk2 = |(x, fn )|2 . (For example, the Gram-Schmidt
n=1 n=1
process is a standard ortho-normalization procedure.)
One primary example of a Hilbert space is the L2 (Ω) space of complex-valued functions
R
f (x) defined in the space Ω ∈ Rn such that dx|f (x)|2 < ∞ (one may say, casually, that
Ω
the square modulus of the function is integrable). In this case the scalar product is defined
as Z
(f, g) := dxf ∗ (x)g(x).
Ω
Properties 1a-c from the definition of Hilbert space above are satisfied by construction and
property 2 can be proven (it is a standard proof in the course of mathematical analysis).
Consider a fixed infinite ortho-normal sequency of functions
{fn , n = 1, · · · , ∞, (fn , fm ) = δnm }.
The sequence is a basis in L2 (Ω) iff the following relation of completeness holds
∞
X
fn∗ (x)fn (y) = δ(x − y). (4.67)
n=1
As custom for the δ function (an other generalized functions), Eq. (4.67) should be under-
stood as equality of integrals of the two sides of Eq. (4.67) integrated with a function from
L2 (Ω).
4.5.2 Hermitian and non-Hermitian Differential Operators
Consider a function from the Hilbert space L2 (a, b) over the reals, i.e. function of a single
variable, x ∈ R, over a bounded domain, a ≤ x ≤ b with an integrable square modulus and
a linear differential operator L̂ acting on the function.
A differential operator is called Hermitian (self-conjugated) if for any two functions

(from a certain class of interest, e.g. from L2 (a, b)) the following relation holds:
Zb Zb
(f, L̂g) := dx f (x)L̂g(x) = dx g(x)L̂f (x) = (g, L̂f ). (4.68)
a a
It is clear from how the condition (4.68) was stated that it depends on both the class
of functions and on the operator L̂. For example, considering functions f and g with zero
boundary conditions or functions which are periodic and which derivative is periodic too,
will result in the statement that the operator
d2
L̂ = + U (x), (4.69)
dx2
where U (x) is a function mapping from R to R, is Hermitian.
The natural generalization of the Shrödinger operator 4.69 is the Sturm-Liouville oper-
ator
d2 d
2
+Q
L̂ = + U (x). (4.70)
dx dx
The Sturm-Liouville operator is not Hermitian, i.e. Eq. (4.68) does not hold in this case.
However, it is straightforward to check that at the zero boundary conditions or periodic
boundary conditions imposed on the functions, f (x) and g(x), and their derivatives, the
following generalization of Eq. (4.68) holds
Zb Zb
dxρ(x)f (x)L̂g(x) = dxρ(x)g(x)L̂f (x), (4.71)
a a
Z
d
where ρ = Qρ ⇒ ρ = exp dxQ . (4.72)
dx
Consider now the eigen-functions fn of the operator L̂, which satisfy
L̂fn = λn fn , (4.73)
where λn is the spectral parameter (eigenvalue) of the eigen-function, fn , of the Sturm-

Liouville operator (4.70), indexed by n. (We assume that, ∀n 6= m : λn 6= λm .)
Notice that the value of λn is not specified in Eq. (4.73) and finding the values of
λn for which there exists a non-trivial solution, satisfying respective boundary conditions
(describing the class of functions considered) is an instrumental part of the Sturm-Liouville
problem.
Observe that the conditions (4.71,4.72) translates into
Z Z Z
dxρfn L̂fm = λm dxρfn fm = λn dxρfn fm ,
that becomes the following eigen-function orthogonality condition

Z
dxρfn fm = 0 (∀n 6= m) . (4.74)
As a corollary of this statement one also finds that in the Hermitian case the distinct
eigen-functions are orthogonal to each other with unitary weight, ρ = 1.
Let us check Eq. (4.74) on the example, L̂0 = d2 /dx2 , where Q(x) = U (x) = 0, over
the functions which are 2π-periodic. cos(nx) and sin(nx), where n = 0, 1, · · · are distinct
eigen-functions with the eigen-values, λn = −n2 . Then, forall m 6= n,
Z2π Z2π Z2π

dx cos(nx) cos(mx) = dx cos(nx) sin(mx) = dx sin(nx) sin(mx) = 0. (4.75)
0 0 0
Note that the example just discussed has a degeneracy: cos(nx) and sin(nx) are two
distinct real eigen-functions corresponding to the same eigen-value. Therefore, any combina-
tion of the two is also an eigen-function corresponding to the same eigen-value. If we would
choose any other pair of the degenerate eigen-functions, say cos(nx) and sin(nx) + cos(nx),
the two would not be orthogonal to each other. Therefore, what we see on this example is
that the eigen-functions corresponding to the same-eigenvalue should be specially selected
to be orthogonal to each other.
We say that the set of eigen-functions, {fn (x)|n ∈ N}, of L̂ is complete over a given
class (of functions) if any function from the class can be expanded into the series over the
eigen-functions from the set
X
f= cn fn . (4.76)
n
Relating this eigen-functions’ property to completeness of the Hilbert space basis, one
observes that eigen-vectors of a self-adjoint (Hermitian) operator over L2 (Ω) form an ortho-
normal basis of L2 (Ω).
Multiplying both sides of Eq. (4.76) by ρfn , integrating over the domain, and applying
(4.74) to the right one derives R
dxρfn f
cn = R . (4.77)
dxρ(fn )2
Note that for the example L̂0 , Eq. (4.76) is a Fourier Series expansion of a periodic
function.
Returning to the general case and substituting Eq. (4.77) back into (4.76), one arrives
at Z !
X f (x)fn (y)
f (x) = dy ρ(y) R n f (y). (4.78)
n
dxρ(x)(fn (x))2
If the set of functions {fn (x)|n} is complete relation (4.78) should be valid for any function
f from the considered class. Consistently with this statement one observes that the part
of the integrand in Eq. (4.78) is just the δ(x), which is the special function which maps
convolution of the function to itself, i.e.
X fn (x)fn (y) 1
R = δ(x − y). (4.79)
dxρ(x)(fn (x))2 ρ(y)
n
Therefore, one concludes that Eq. (4.79) is equivalent to the statement of the set of functions
{fn (x)|n} completeness.
Example 4.5.1. Check validity of Eq. (4.79), and thus completeness of the respective set
of eigen-functions, for our enabling example of L̂0 = d2 /dx2 over the functions which are
2π-periodic.
4.5.3 Hermite Polynomials.
Let us now depart from our enabling example and consider the case of Q(x) = −2x and
U (x) = 0, i.e.
d2 d
2
− 2x , ρ(x) = exp −x2 ,
L̂2 = (4.80)
dx dx
over the class of functions mapping from R to R, which also decay sufficiently fast at
x → ±∞. That is we are discussing now
L̂2 fn = λn fn . (4.81)
√
Changing from fn (x) to Ψn (x) = fn (x) ρ one thus arrives at the following equation for
Ψn :
2 /2 2 /2
2 d2
e−x L̂2 fn (x) = e−x L̂2 ex /2 Ψn (x) = 2 Ψn + (1 − x2 )Ψn = λn Ψn . (4.82)
dx
Observe that when λn = −2n, Eq. (4.81) coincides with the Hermite Eq. (4.51).
Let us look for solution of Eq. (4.81) in the form of the Taylor series around x = 0
∞
X
fn (x) = ak xk . (4.83)
k=0
Substituting the series into the Hermite equation and then equating terms for the same
powers of x one arrives at the following regression for the expansion coefficients:
2k + λn
∀k = 0, 1, · · · : ak+2 = ak . (4.84)
(k + 2)(k + 1)
This results in the following two linearly independent solutions (even and odd, respec-
tively, with respect to the x → −x transformation) of Eq. (4.81) represented in the form of
a series

λn 2 λn (4 + λn ) 4
fn(e) (x)
= a0 1 + x + x + ··· , (4.85)
2! 4!

(o) (2 + λn ) 3 (2 + λn )(6 + λn ) 5
fn (x) = a1 x + x + x + ··· , (4.86)
3! 5!
where the two first coefficients in the series (4.83) are kept as the parameters. Observe that
the series (4.85) and (4.86) terminate if λn = −4n and λn = −4n − 2, respectively, where
(e)
n = 0, 1, · · · , then fn are polynomials – in fact the Hermite polynomials. We combine
the two cases in one and use the standard, Hn (x), notations for the Hermite polynomials
of the n-th order, which satisfies Eq. (4.82). Per statement of the Exercise 4.7, Hermite
polynomials are normalized and orthogonal (weighted with ρ) to each other.
Exercise 4.7. (a) Prove that
Z
+∞
2 √
dt e−t Hn (t)Hm (t) = 2n n! πδnm , (4.87)
−∞
where δnm is unity when n = m and it is zero otherwise (Kronecker symbol).

(b) Verify that the set of functions

1
Ψn (x) = √ exp(−x2 /2)Hn (x) | n = 0, 1, · · · , (4.88)
π 1/4 2n n!
satisfies
∞
X
Ψn (x)Ψn (y) = δ(x − y). (4.89)
n=0
Hint: The following identity may be useful
Z
+∞
dn 2 √ dq
exp(−x ) = π (iq)n exp −q 2 /4 + iqx .
dxn 2π
−∞
A corollary of the Exercise 4.7 is the statement of “completeness”: the set of functions
(4.88) forms an orthogonal basis of the Hilbert space of functions, f (x) ∈ L2 , i.e. satisfying
R∞ 2 2
−∞ |f (x)| dx < ∞. (A bit more formally, an orthogonal basis for the L functions is a
complete orthogonal set. For an orthogonal set, completeness is equivalent to the fact that
the 0 function is the only function f ∈ L2 which is orthogonal to all functions in the set.)
Note that the last equality in Eq. (4.82) is the spectral version of the the Schrödinger
PDE (in the imaginary time) in the quadratic potential
∂x2 Ψ(t; x) + (1 − x2 )Ψ(t; x) = −∂t Ψ(t; x), (4.90)
discussed next.
∗
4.5.4 Case study: Schrödinger Equation in 1d
Schrödinger equation ∗
d2 Ψ(x)
+ (E − U (x))Ψ(x) = 0, (4.91)
dx2
described the so-called (complex-valued) wave function describing de-location of a quantum
particle in x ∈ R with energy E in the potential U (x). We are seeking for solutions with
|Ψ(x)| → 0 at x → ∞ and our goal here is do describe the spectrum (allowed values of E)
and respective eigen-functions.
As a simple, but instructive, example consider the case of a quantum particle in a
rectangular potential, i.e. U (x) = U0 at x ∈
/ [0, a] and zero otherwise. General solution of
Eq. (4.91) becomes
U0 > E > 0 :
 √

 cL exp(x U0 − E), x<0
 √ √
ΨE (x) = a+ exp(ix E) + a− exp(−ix E), x ∈ [0, a] , (4.92)

 √
 cR exp(−x U0 − E), x>a
U0 < E :
 √ √

 cL+ exp(ix E −√U0 ) + cL− exp(−ix√ E − U0 ),
 x<0
ΨE (x) = a+ exp(ix E) + a− exp(−ix E), x ∈ [0, a] , (4.93)

 c exp(ix√E − U ) + c exp(−ix√E − U ),

x>a
R+ 0 R− 0
where we account for the fact that E cannot be negative (ODE simply does not allow
such solutions) and in the U0 > E > 0 regime we select one solution (of the two linearly
independent solutions) which does not grow with x → ±∞.
The solutions in the three different intervals should be ”glued” together - or stating it
less casually Ψ and dΨ/dx should be continuous at all x ∈ R. These conditions applied to
Eq. (4.92) or Eq. (4.93) result in an algebraic “consistency” conditions for E. We expect
to get a continuous spectrum at E > U0 and discrete at U0 > E > 0.
Example 4.5.2. Complete calculations above for the case of U0 > E > 0 and find the
allowed values of the discrete spectrum. What is the condition for appearance of at least
one discrete level?
Consider another example.
∗
This auxiliary Subsection can be dropped at the first reading. Material form the Subsection will not
contribute midterm and final exams.
Example 4.5.3. Find eigen-functions and energy of stationary states of the Schrödinger
equation for an oscillator:
d2 Ψ(x)
+ (E − x2 )Ψ(x) = 0, (4.94)
dx2
where x ∈ R and Ψ : R → C2 .
Solution. As we saw already in the preceding section analysis of Eq. (4.94) is reduced to
studying the Hermite equation, with its spectral version desribed by Eq. (4.81). However, we
will follow another route here. Let us introduce the so-called “creation” and “annihilation”
operators

i d † i d
â = √ +x , â = √ −x , (4.95)
2 dx 2 dx
and then rewrite the Schrödinger Eq. (4.94) as

1
ĤΨ(x) = â† âΨ(x) = 2E − Ψ(x). (4.96)
2
It is straightforward to check that the operator Ĥ is positive definite for all functions from
L2 : Z Z Z
† † †
dxΨ (x)ĤΨ(x) = dxΨ (x)ââ Ψ(x) = dx|âΨ(x)|2 ≥ 0,
where the equality is achieved only if

i d
âΨ0 (x) = √ + x Ψ0 (x) = 0,
2 dx
thus resulting in Ψ0 (x) = A exp(−x2 /2) and E0 = 1/4. We have just found the eigen-
function and eigen-value correspondent to the lowest possible energy, so-called ground state.
To find all other eigen-function, correspondent to the so-called “excited” states, consider
the so-called commutation relations
ââ† Ψ(x) = â† âΨ(x) + Ψ(x), (4.97)

n 2 n−1 n
â† â â† Ψ(x) = â† â â† Ψ(x) + â† Ψ(x)
n n+1
= n â† Ψ(x) + â† âΨ(x). (4.98)
Introduce Ψn (x) := (â† )n Ψ0 (x). Since âΨ0 (x) = 0, the commutation relations (4.98) shows
immediately that

1
2E − Ψn (x) = ĤΨn (x) = â† âΨn (x) = nΨn (x).
2
We observe that eigen-functions Ψn (x) of the states with energies, 2En = n + 1/2 are
expressed via the Hermite polynomials, Hn (x), introduced in Eq. (4.55),
n 2
i d x
Ψn (x) = An √ −x exp −
2 dx 2
n
2 n
i x d 2

= An n/2 exp exp −x ,
2 2 dxn
d d
where we have used the identity, ( dx − x) exp(x2 /2) = exp(x2 /2) dx . From the condition of
√
the Hermite polynomials orthogonality (4.87) one derives, An = (n! π)−1/2 .
4.6 Phase Space Dynamics for Conservative and Perturbed

Systems
4.6.1 Integrals of Motion
Consider equation describing conservative dynamics of a particle of unit mass in the poten-
tial (conservative means there is no dissipation of energy)
ẋ = v, v̇ = −∂x U (x). (4.99)
The energy of the particle is

ẋ2
E= + U (x), (4.100)
2
which consists of the kinetic energy (the first term), and the potential energy (the second
term). It is straightforward to check that the energy is constant, that is dE/dt = 0.
Therefore,
p
ẋ = ±2 E − U (x), (4.101)
where ± on the right hand side is chosen according to the initial condition chosen for
ẋ(0) (there may be multiple solutions, corresponding to the same energy). Eq. (4.101) is
separable, and it can thus be integrated resulting in the following classic implicit expression
for the particle coordinate as a function of time
Z
dx
p = ±t, (4.102)
x0 E − U (x)
which depends on the particle’s initial position, x0 , and its energy, E which is conserved.
In the example above, E is an integral of motion or equivalently a first integral, which is
defined as a quantity that is conserved along solutions to the differential equation. In this
case E was constant along the trajectories x(t).
The idea of an integral of motion or first integral extends to conservative systems de-
scribed by a system of ODEs. (Here and in the next section we follow [2, 1].) For example,
consider the situation where a quantity H, called Hamiltonian, is a twice-differentiable
function of 2n variables, p1 , · · · , pn (momenta) and q1 , · · · , qn (coordinates). Correspond-
ing system of the dynamic equations is called the Hamilton’s canonical equations,
∂H ∂H
ṗi = − , q̇i = (∀i = 1, · · · , N ) . (4.103)
∂qi ∂pi
Computing the rate of change of the Hamiltonian in time
X N N
X
dH ∂H ∂H
= ṗi + q̇i = (−q̇i ṗi + ṗi q̇i ) , (4.104)
dt ∂pi ∂qi
i=1 i=1
we observe that H is constant, that is, H is an integral of motion.

This dynamical system with a single degree of freedom (”single” particle), described by
(4.99), is an example of a canonical Hamilton system. Energy (4.100) of the Hamiltonian
system, considered as a function of x and v, is the Hamiltonian. x and v correspond to
(scalar) q and p respectively. We continue exploring a single particle, N = 1, Hamiltonian
system in the next Subsection – Section 4.6.2.
We will also discuss Hamiltonian systems, as derived from the variational principle,
in the optimization part of the course (early second semester). We reiterate that reader
interested in a broader and comprehensive mathematical introduction into the subject of
the Hamiltonian dynamics is advised to consult with [2].
4.6.2 Phase Portrait
Following [2] and Section 1.3 of [1] we now turn to discussing the famous example of a
conservative (Hamiltonian) system with one degree of freedom (4.99).
We have established above that the energy (Hamiltonian) is conserved, and it is thus
instructive to study isolines, or level curves, of the energy drawn in the two-dimensional
v2
(x, v) space, {{x, v} | 2 + U (x) = E}. To draw a level curve of energy we simply fix E and
evaluate how {x, v} evolves with t according to Eqs. (4.99).
Let us build some intuition for level curves using an analogy. Suppose that the potential
curve is the same shape as a length of wire (literally the same shape, i.e. if the potential
curve is a parabola the wire is shaped like a parabola). We will say that this wire is perfectly
rigid, and frictionless. Now imagine that there is a ball or bead which slides along the wire,
subject to gravity. One may start the bead at any position on the wire, with any initial
velocity (left or right). The path that the bead traces out in position-velocity space is
(qualitatively) a level curve of the corresponding potential function.
Figure 4.2: Phase portrait, i.e. (x, v) level-curves of the conservative system Eq. (4.99) with
the potential, U (x) = kx2 /2 with k > 0 (top) and k < 0 (bottom).
a b
c d
Figure 4.3: What is appearance of the level curves (phase portrait) of the energy for each
of these potentials?
Consider the quadratic potential, U (x) = 12 kx2 . The two cases of positive and negative
k are illustrated in Fig. (4.2), see the snippet Portrait.ipynb. We observe that with the
exception of the equilibrium position (x, v) = (0, 0), the level curves of the energy are
smooth. Generalizing, we find that the exceptional points are critical, or stationary, points
of the Hamiltonian, which are points where the derivatives of the Hamiltonian with respect
to the canonical variables, q and p, are zero. Note that each level curve, which we draw
observing how a particle slides in a potential well, U (x), also has a direction (not shown in
Fig. (4.2)).
Consider the case where k > 0, and fix the value of the energy E. Due to Eq. (4.100),
the coordinate of the particle, x, should lie within the set where the potential energy is less
than the energy, {x | U (x) ≤ E}. We observe that E ≥ 0, and that equality corresponds
to the particle sitting still at the minimum of the potential, which is called a critical point,
or fixed point. Furthermore, the larger the kinetic energy, the smaller the potential energy.
Any position where the particle changes its velocity from positive to negative or vice-versa is
called a turning point. For any E > 0, there are two turning points, x± = ±2E/k. Testing
different values of E > 0, we sketch different level curves, resulting in different ellipsoids
centered around 0. This is the canonical example of a oscillator. The motion of the particle
is periodic, and its period, T , can be computed evaluating Eq. (4.102) between the turning
points √
Zx+ Z2E/k
dx dx
T := p = p = 2π. (4.105)
E − U (x) √ E − kx2 /2
x− − 2E/k
For this case, the period is a constant, 2π, and we note that it is independent of k.
In the k < 0 case where all the values of energy (positive and negative) are accessible,
x = v = E = 0 is the critical point again. When E > 0 there are no turning points (points
where direction of the velocity changes). When E > 0 the particle may turn only once or
not at all. If x(0) 6= 0 and regardless of the sign of E, x(t) increases with t to become
unbounded at t → ∞. As seen in Fig. (4.2)b, in this case the (x, v) phase space splits into
√
four quadrants, separated by the v = ± kx separatrices. The level curves of the energy
are hyperbolas centered around x = v = 0.
A qualitative study of the dynamics in more complex potentials U (x) can be conducted
by sketching the level curves in a similar way.
Example 4.6.1. Sketch level curves of the energy for the Kepler potential, U (x) := − x1 + xC2 ,
and for the potentials shown in Fig. (4.3).
4.6.3 Small Perturbation of a Conservative System
Let us analyze the following simple but very instructive example of a system which deviates
very slightly from the quadratic potential with k = 1:
ẋ = v + εf (x, v), v̇ = −x + εg(x, v), (4.106)
in the regime where ε 1 and x2 + v 2 ≤ R2 .

For ε = 0, and assuming that x(0) = x0 , one derives
x(t) = x0 cos(t), v(t) = −x0 sin(t).
We calculate the energy and find that E = (x2 + v 2 )/2 (within the limit) is conserved and
so the system cycles with the period given by T = 2π.
The general case where 0 < ε 1 is not conservative. Let us examine how the energy
changes with time. One derives
d
E = xẋ + v v̇ = ε (xf + vg) = ε x(0) f + v (0) g + O(ε2 ). (4.107)
dt
Integrating over a period, one arrives at the following expression for the gain (or loss) of
energy
Z 2π I
∆E = ε dt x(0) f + v (0) g + O(ε2 ) = ε (−f dv + gdx) + O(ε2 ), (4.108)
0
where the integral is taken over the level curve, which is also iso-energy cycle, of the unper-
turbed (ε = 0) system in the (x, v) space. Obviously ∆E depends on x0 .
For the case of increasing energy, ∆E > 0, we see an unwinding spiral in the (x, v)
plane. For the case of decreasing energy, ∆E < 0, the spiral contracts to a stationary point.
There are also systems where the sign of ∆E depends on x0 . Consider for example the
van der Pol oscillator
ẍ = −x + εẋ(1 − x2 ). (4.109)
d
As in Eq. (4.108), we integrate dt E over a period, which in this case gives
Z2π Z2π
2 2 2

∆E = ε ẋ (1 − x )dt + O(ε ) = εx20 sin2 t 1 − x20 cos2 t dt + O(ε2 )
0 0

2 x40
= π x0 − ε + O(ε2 ). (4.110)
4
The O(ε) part of this expression is zero when x0 = 2, positive when x0 < 2 and negative
when x0 > 2. Therefore, if we start with x0 < 2 the system will be gaining energy, and
the maximum value of x(t) within a period will approach the value 2. On the contrary, if
x0 > 2 the system will be lose energy, and the maximum value of x(t) over a period will
decrease approaching the same value 2. This type of behavior, established for ∆E including
only O(ε) contributions (and thus ignoring all contributions of higher order in small ε) is
characterized as the stable limit cycle, which can be characterized by
d
∆E(x0 ) = 0 and ∆E(x0 ) < 0.
dx0
In summary, the van der Pol oscillator is an example of behavior where the perturbation
is singular, meaning that is categorically different from the unperturbed case. Indeed, in
the unperturbed case the particle oscillates cycling an orbit which depends on the initial
condition, while in the perturbed case the particle ends up moving along the same limit
cycle.
Exercise 4.8. Recall properties of stable / unstable limit cycles:
d
Limit Cycle is Stable at x = x0 if ∆E(x0 ) = 0 and ∆E(x0 ) < 0
dx0
d
Limit Cycle is unstable at x = x0 if ∆E(x0 ) = 0 and ∆E(x0 ) > 0
dx0
Suggest an example of perturbations, f and g, in Eq. (4.106) which leads to (a) an unstable
limit cycle at x0 = 2, and (b) one stable limit cycle at x0 = 2 and one unstable limit cycle
at x0 = 4. Illustrate your suggested perturbations by building a computational snippet.
Consider another ODE example
I˙ = ε (a + b cos(θ/ω)) , θ̇ = ω, (4.111)
where ω, ε, a, b are constants, and ε-term in the first Eq. (4.111) is a perturbation. When ε
is zero, I is an integral of motion, (meaning that it is constant along solutions of the ODE),
and we think of θ as an angle in the phase space increasing linearly with the frequency ω.
Note that the unperturbed system is equivalent to the one described by Eq. (4.106).
Example 4.6.2.
(a) Show that one can transform the unperturbed (i.e. ε = 0) version of the system
described by Eq. (4.106) to the unperturbed version of the system described by
Eq. (4.111) via the following transformation (change of variables)
p p
v= I/2 cos(θ/ω), x= I/2 sin(θ/ω). (4.112)
(b) Restate Eq. (4.111) in the (x, v) variables.
The transformation discussed in the Example 4.6.2 is the so-called canonical transforma-
tion that preserves the Hamiltonian structure of the equations. In this case the Hamiltonian,
which is generally a function of θ and I, depends only on I, H = Iω, and one can indeed
rewrite the unperturbed version of Eq. (4.111) as
∂H ∂H
θ̇ = = ω, I˙ = − = 0, (4.113)
∂I ∂θ
therefore interpreting θ and I as the new coordinate and the new momentum respectively.
Averaging perturbed Eq. (4.111) over one (2πω) angle revolution, as done in Section
4.6.3, one arrives at the following expression for the change in I over the 2π-period (of time)
∆I = 2πεa. (4.114)
Taking many, 2πnω, revolutions, replacing 2πn by t, and ∆I by J, where the latter is the
action averaged over time t, one arrives at the following equation
J˙ = εa, (4.115)
which has the solution, J(t) = J0 + εat.

In fact Eqs. (4.111) can also be solved exactly
I(t) = εat + εb sin t, (4.116)
and one can check the consistency, that is indeed solution of the averaged Eq. (4.115) does
not deviate (with time) from the exact solution of Eq. (4.111)
ω 6= 0 : |J(t) − I(t)| ≤ O(1)ε. (4.117)

In a general n-dimensional case one considers the following system of bare (unperturbed)
differential equations
I˙ = 0, θ̇ = ω(I), I := (I1 , · · · , In ) , θ := (θ1 , · · · , θn ) , (4.118)
where thus each component of I is an integral of motion of the unperturbed system of

equations. Perturbed version of Eq. (4.118) becomes
I˙ = εg(I, θ, ε), θ̇ = ω(I) + εf (I, θ, ε), (4.119)
where f and g are 2πω-periodic functions of each of the components of θ. Since I changes
slowly, due to smallness of ε, the perturbed system can be substituted by a much simpler
averaged system for the slow (adiabatic) variables, J (t) = I(t) + O(ε):
H
g(I, θ, 0)dθ
J˙ = εG(J ), G(J ) := H , (4.120)
dθ
H
where as in Section 4.6.3 stands for averaging over the period (one rotation) in the phase-
space. Notice that the procedure of averaging over the periodic motion may brake at higher
P
dimensions, n > 1, if the system has resonances, i.e. if i Ni ωi = 0, where Ni are integers.
If the perturbed system is Hamiltonian θ plays the role of generalized coordinates and
I of generalized momenta, then Eqs. (4.119) become
∂H ∂H
I˙ = − , θ̇ = . (4.121)
∂θ ∂I
In this case averaging over θ the rhs of the first equation in Eq. (4.121) results in J˙ = 0.
This means that the slow variables, J1 , · · · , Jn , also called adiabatic invariants, do not
change with time. Notice that the main difficulty of applying this rather powerful approach
consists in finding proper variables which remain integrals of motion of the unperturbed
system.
Chapter 5
Partial Differential Equations.
A partial differential equation (PDE) is a differential equation that contains one or more
unknown multivariate functions and their partial derivatives. We begin our discussion by
introducing first-order ODEs, and how to resolve them to a system of ODEs by the method
of characteristics. We then utilize ideas from the method of characteristics to classify
(hyperbolic, elliptic and parabolic) linear, second-order PDEs in two dimensions (Section
5.2). We will discuss how to generalize and solve elliptic PDE, normally associated with
static problems, in Section 5.3. Hyperbolic PDEs, discussed in section 5.4 are normally
associated with waves. Here, we take a more general approach originating from intuition
associated with waves as the phenomena (then wave solving a hyperbolic PDE is a particular
example of a sound wave). We will discuss diffusion (also heat) equation as the main example
of a generalized (to higher dimension) parabolic PDE in Section 5.5.
5.1 First-Order PDE: Method of Characteristics

The method of characteristics reduces PDE to multiple ODEs. The method applies mainly
to first-order PDEs (meaning PDEs which contain only first-order derivatives) which are
moreover linear in the first-order derivatives.
To motivate this technique we will first consider a function u of two independent variables
(x, y), u(x, y). Suppose that u(x, y) solves
∂u ∂u
a(x, y, u) + b(x, y, u) = c(x, y, u). (5.1)
∂x ∂y
Now consider some arbitrary differentiable parametric curve (x(t), y(t)) and consider the
du
total derivative dt . By the chain rule we have
du dx ∂u dy ∂u
= + . (5.2)
dt dt ∂x dt ∂y
129
CHAPTER 5. PARTIAL DIFFERENTIAL EQUATIONS. 130
Observe, that since the parametric curve was arbitrary we may chose to define it by
dx
= a(x, y, u),
dt
dy
= b(x, y, u),
dt
du
= c(x, y, u). (5.3)
dt
Substituting this into (5.2) gives us precisely (5.1). Thus we have a family of characteristic
curves from which we can construct our solution to (5.1). (It is a family of curves since we
never gave an initial condition for the system (5.3).)
Let u(x) : Rd → R be a function of a d-dimensional coordinate, x := (x1 , . . . , xd ).
Introduce the gradient vector, ∇x u := (∂xi u; i = 1, . . . , d), and consider the following linear
in ∇x u equation
d
X
(V · ∇x u) := Vi ∂xi u = f, (5.4)
i=1
where the velocity, V (x) ∈ Rd and forcing, f (x) ∈ R are given functions of x.
First, consider the homogeneous version of Eq. (5.4)
(V · ∇x u) = 0. (5.5)
Introduce an auxiliary parameter (or dimension) t ∈ R, call it time, and then introduce the
characteristic equations
dx(t)
= V (x(t)), (5.6)
dt
describing the evolution of the characteristic trajectory x(t) in time according to the func-
d
tion V . A first integral is a function for which dt F (x(t)) = 0. Observe that any first
integral of Eqs. (5.6) is a solution to Eq. (5.5), and that any function of the first integrals
of Eqs. (5.6), g(F1 , . . . , Fk ), is also a solution to Eq. (5.5).
Indeed, a direct substitution of u = g in Eq. (5.5) leads to the following sequence of
equalities
k
X d k
∂g X ∂Fi X ∂g d
(V · ∇x g) = Vj = Fi = 0. (5.7)
∂Fi ∂xj ∂Fi dt
i=1 j=1 i=1
The system of equations (5.6) has d − 1 first integrals independent of t (directly). Then a
general solution to Eq. (5.7) is
u(x(t)) = g (F1 (x(t)), . . . , Fd−1 (x(t))) , (5.8)
where g is assumed to be sufficiently smooth (at least twice differential over the first inte-
grals).
Eq. (5.5) has a nice geometrical/flow interpretation. If we think of V , which is the d

dimensional vector of the coefficients of ∇x g, as a “velocity”, then Eq. (5.5) means that
derivative of u over x projected to the vector V is equal to zero. Therefore, the solution to
and ODE by the method of characteristics is reduced to reconstructing integral curves from
vectors V (x), defined at every point x of the space, which are tangent to the curves. Then,
the solution u(x) is constant along the curves. If in the vicinity of each point x of the space,
one changes variables, x → (t, F1 , . . . , Fd−1 ), where t is considered as a parameter along an
integral curve and, if the transformation is well defined (i.e. Jacobian of the transformation
is not zero), then Eq. (5.5) becomes du/dt = 0 along the characteristic.
Let us illustrate how to find a characteristic on the example of the following homogeneous
PDE
∂x u(x, y) + y∂y u(x, y) = 0.
The characteristic equations are dx/dt = 1, dy/dt = y, with the general solution x(t) =
t + c1 , y = c2 exp(t). The only first integral of the characteristic equation is F (x, y) =
y exp(−x), therefore u = g(F (x, y)), where g is an arbitrary function is a general solution.
It is useful to visualize the flow along the characteristics in the (x, y) space.
Example 5.1.1. Find the characteristics of the following PDEs and use them to find the
the general solutions to the PDEs. Verify your solutions by direct substitution.
(a) ∂x u − y 2 ∂y u = 0,
(b) x∂x u − y∂y u = 0,
(c) y∂x u − x∂y u = 0.
Visualize the characteristics in (x, y)-plane.

Solution.
(a) The goal is to find curves parameterized by t expressing the left hand side as a total
d
derivative giving dt u(x(t), y(t)) = 0. By the chain rule, this is equivalent to ∂x u ẋ(t) +
∂y u ẏ(t) = 0, which is equivalent to our PDE if we set ẋ(t) = 1 and ẏ(t) = −y 2 . These
are the characteristic equations. Their solutions are x(t) = t+c1 and y(t) = (t+c2 )−1 .
Eliminating t gives c = y −1 − x =: F (x, y) as the only first integral. General solutions
to the PDE are in the form u(x, y) = g(y −1 − x) where g : R → R can be any function.
(b) The goal is to find curves parameterized by t expressing the left hand side as a total
d
∂y u ẏ(t) = 0, which is equivalent to our PDE if we set ẋ(t) = x and ẏ(t) = −y. These
y y y
x x x
(a) 1/y − x = const (b) xy = const (c) x2 + y 2 = const
Figure 5.1: Characteristic curves for PDEs in example 5.1.1
are the characteristic equations. Their solutions are x(t) = c1 et and y(t) = c2 e−t .
Eliminating t gives c = xy =: F (x, y) as the only first integral. General solutions to
the PDE are in the form u(x, y) = g(xy) where g : R → R can be any function.
(c) The goal is to find curves parameterized by t expressing the left hand side as a total
d
∂y u ẏ(t) = 0, which is equivalent to our PDE if we set ẋ(t) = y and ẏ(t) = −x. These
are the characteristic equations. Their solutions are x(t) = c cos(t) and y(t) = c sin(t).
Eliminating t gives c = x2 + y 2 =: F (x, y) as the only first integral. General solutions
to the PDE are in the form u(x, y) = g(x2 + y 2 ) where g : R → R can be any function.
Consider the following initial value (boundary) Cauchy problem: solve Eq. (5.5) subject
to the boundary condition
u(x)|x0 ∈S = ϑ(x0 ), (5.9)
where S is a surface (boundary) of the dimension d − 1. This Cauchy problem has a

well-defined solution in at least some vicinity of S if S is not tangent to a characteristic
of Eq. (5.5). Consistently with what was described above solution to Eq. (5.6) with the
initial/boundary condition Eq. (5.9) can be thought as the change of variables.
Example 5.1.2. Let us illustrate the solution to the Cauchy problem on the example
∂x u = y∂y u, u(0, y) = cos(y).
Solution. The characteristic equations, ẋ = 1, ẏ = −y, have solutions x(t) = t − t1 , y(t) =

exp(t2 − t), and one first integral
F (x, y) = y exp(x) = constant,

therefore
u(x, y) = g(y exp(x)),
where g is an arbitrary function, is a general solution. Boundary/initial conditions are given

at the straight line, x = 0, which is not tangent to any of the characteristic, y = exp(−x+x1 ).
Therefore, substituting the general solution in the boundary condition one finds a particular
form of the function g for the specific Cauchy problem:
u(0, y) = g(y) = cos(y).
This results in the desired solution: u(x, y) = cos(y exp(x)).
Exercise 5.1. (a) Solve

y∂x u − x∂y u = 0,
for initial condition, u(0, y) = y 2 . (b) Explain why the same problem with the initial
condition u(0, y) = y is ill-posed. (c) Determine whether the same problem with the initial
condition u(1, y) = y 2 is ill-posed.
Example 5.1.3. Let (q, p) = (q1 , . . . , qn , p1 , . . . , pn ) be a set of canonical coordinates for a

Hamiltonian system with Hamiltonian H(q, p), and let f = f (t, q, p) be any function of t
(time), q and p. Liouville’s theorem states that
n
X
∂f ∂H ∂f ∂H
∂t f + {f, H} = 0, where {f, H} := − ,
∂qi ∂pi ∂pi ∂qi
i=1
where {f, H} is the so-called Poisson bracket of f and H. Find the characteristics of the
Liouville’s PDE.
Solution. We wish to find a V (t, q, p) such that the left hand side of the PDE can be
expressed in the form V · ∇f . This is satisfied for V = (1, ∂p H, −∂q H). We interpret
dt dq dp
V as the vector field V = ( ds , ds , ds ). Introducing V we also define a family of curves,
(t(s), q(s), p(s)), called the characteristic curves. A little algebra allows us to simplify the
curves to
d ∂H ∂H
qi = , ṗi = − , ∀i = 1, · · · , n.
dt ∂pi ∂qi
Interpretation: In the next chapter, we will see that the Hamilton’s equations, dq/dt = ∂p H,
and dp/dt = −∂q H, describe the evolution of the state of the system in phase space. Since
the characteristic curves are precisely the solutions to the Hamilton’s equations, which
df
reduces Liouville’s PDE to dt = 0, we infer that for a Hamiltonian system any function of
the system’s state variables (q, p) does not change as the system evolves in time.
Now let us get back to the inhomogeneous Eq. (5.4). As is standard for linear equations,
the general solution to an inhomogeneous equation is constructed as the superposition of
the a particular solution and general solution to the respective homogeneous equation. To
find the former we transition to characteristics, then Eq. (5.4) becomes
d
(V · ∇x ) u = (ẋ · ∇x ) u = u = f (x(t)), (5.10)
dt
which can be integrated along the characteristic thus resulting in a desired particular solu-
tion to Eq. (5.4)
Zt
uinh = f (x(s))ds where x(s) satisfies (V · ∇x )u = (ẋ · ∇x )u. (5.11)
t0
Notice that this solution is not constant along characteristics.
Example 5.1.4. Solve the Cauchy problem for the following inhomogeneous equation
∂x u − y∂y u = y, u(0, y) = sin(y).
The method of characteristics can also be generalized to quasi-liner first-order ODEs,

(first-order ODEs (5.4) where V and f depend not only on the vector of coordinate, x, but
also on the function u(x)). In this case the characteristic equations become
dx du
= V (x, u), = f (x, u). (5.12)
dt dt
The general solution to a quasi-linear ODE is given by g(F1 , F2 , . . . , Fn ) = 0, where g is an
arbitrary function of n first integrals of Eq. (5.12).
Consider the example of the Hopf’s equation in d = 1
∂t u + u∂x u = 0, (5.13)
which, when u(t; x) refers to the velocity of a particle at location x and time t, describes the
one dimensional flow of non-interacting particles. The characteristic equations and initial
conditions are
ẋ = u, u̇ = 0, x(t = 0) = x0 , u(t = 0) = u0 (x0 ).
Direct integration produces, x = u0 (x0 )t + x0 giving the following implicit equation for u
u = u0 (x − ut). (5.14)
Under the specific conditions, u0 (x) = c(1 − tanh x), this results in the following (still
implicit) equation, u = c(1 − tanh(x − ut)). Computing partial derivative, one derives
∂x u = −c/(cosh2 (x − ut) − ct), which shows that it diverges in finite time at t∗ = 1/c and
x = ut. The phenomenon is called wave breaking, and has the physical interpretation of fast
particles catching slower ones and aggregating, leading to sharpening of the velocity profile
and eventual breakdown. This singularity is formal, meaning that the physical model is no
longer applicable when the singularity occurs. Introducing a small κ∂x2 u term to the right
hand side of Eq. (5.13) regularizes the non-physical breakdown, and explains creation of
shock. The regularized second-order PDE is called Burgers’ equation.
5.2 Classification of linear second-order PDEs:

Consider the most general linear second-order PDE over two independent variables:
a11 ∂x2 u + 2a12 ∂x ∂y u + a22 ∂y2 u + b1 ∂x u + b2 ∂y u + cu + f = 0, (5.15)
where all the coefficients may depend on the two independent variables x and y.
The method of characteristics, (which applies to first-order PDEs, for example, when
a11 = a12 = a21 = c = 0 in Eq. (5.15)), can inform the analysis of second-order PDEs.
Therefore, let us momentarily return to the first-order PDE,
b1 ∂x u + b2 ∂y u + f = 0, (5.16)
and interpret its solution as the variable transformation from the (x, y) pair of variables to
the new pair of variables, (η(x, y), ξ(x, y)), assuming that the Jacobian of the transformation
is neither zero no infinite anywhere within the domain of (x, y) of interest.
!
∂x η ∂ y η
J = det 6= 0, ∞. (5.17)
∂x ξ ∂ y ξ
Substituting u = w(η(x, y), ξ(x, y)) into the sum of the first derivative terms in Eq. (5.16)
one derives
b1 ∂x u + b2 ∂y u = b1 (∂x η∂η w + ∂x ξ∂ξ w) + b2 (∂y η∂η w + ∂y ξ∂ξ w)

= (b1 ∂x η + b2 ∂y η) ∂η w + (b1 ∂x ξ + b2 ∂y ξ) ∂ξ w. (5.18)
Requiring that the second term in Eq. (5.18) is zero one observes that it is satisfied for all
x, y if ξ(y(x)), i.e. it does not depend on x explicitly but only via y(x) if the latter satisfies
the characteristic equation, b1 dy/dx + b2 = 0.
Let us now try the same logic, but now focusing on the sum of the second-order terms
in Eq. (5.15). We derive

a11 ∂x2 u + 2a12 ∂x ∂y u + a22 ∂y2 u = A∂ξ2 + 2B∂ξ ∂η + C∂η2 w, (5.19)
where
A := a11 (∂x ξ)2 + 2a12 (∂x ξ)(∂y ξ) + a22 (∂y ξ)2

B := a11 (∂x ξ)(∂x η) + a12 (∂x ξ∂y η + ∂y ξ∂x η) + a22 (∂y ξ)(∂y η)
C := a11 (∂x η)2 + 2a12 (∂x η)(∂y η) + a22 (∂y η)2 .
Let us now attempt, by analogy with the case of the first-order PDE, to force first and last
term on the rhs of Eq. (5.19) to zero, i.e. A = C = 0. This is achieved if we require that
ξ(y+ (x)) and η(y− (x)), where
√
dy± a12 ± D
= , where D := a212 − a11 a22 . (5.20)
dx a11
and D is called the discriminant. Eqs. (5.20) have in a general case distinct (first) integrals
ψ± (x, y) = const. Then, we can choose the new variables as ξ = ψ+ (x, y) and η = ψ− (x, y)
If D > 0 Eq. (5.15) is called a hyperbolic PDE. In this case, the characteristics are real,
and any real pair (x, y) is mapped to the real pair (η, ξ). Eq. (5.15) gets the following
canonical form
∂ξ ∂η u + b̃1 ∂ξ u + b̃2 ∂η u + c̃u + f˜ = 0. (5.21)
Notice that another (second) canonical form for the hyperbolic equation is derived if we
transition further from (ξ, η) to (α, β) := ((η + ξ)/2, (ξ − η)/2). Then Eq. (5.21) becomes
(2) (2)
∂α2 u − ∂β2 u + b̃1 ∂α u + b̃2 ∂β u + c̃(2) u + f˜(2) = 0. (5.22)
If D < 0 Eq. (5.15) is called an elliptic PDE. In this case, Eqs. (5.21) are complex
conjugate of each other and their first integrals are complex conjugate as well. To make the
map from old to new variables real, we choose in this case, α = Re(ψ+ (x, y)) = (ψ+ (x, y) +
ψ− (x, y))/2, β = Im(ψ+ (x, y)) = (ψ+ (x, y)−ψ− (x, y))/(2i). This change of variables results
in the following canonical form for the elliptic second-order PDE:
(e) (e)
∂α2 u + ∂β2 u + b1 ∂α u + b2 uβ + c(e) u + f (e) = 0. (5.23)
D = 0 is the degenerate case, ψ+ (x, y) = ψ− (x, y), and the resulting equation is a
parabolic PDE. Then we can choose β = ψ+ (x, y) and α = ϕ(x, y), where ϕ is an arbitrary
independent (of ψ+ (x, y)) function of x, y. In this case Eq. (5.15) gets the following canonical
parabolic form
(p) (p)
∂α2 u + b1 ∂α u + b2 ∂β u + c(p) u + f (p) = 0. (5.24)
Example 5.2.1. Define the type of equation and then perform change of variables reducing
it to the respective canonical form
(a) ∂x2 u + ∂x ∂y u − 2∂y2 u − 3∂x u − 15∂y u + 27x = 0,
(b) ∂x2 u + 2∂x ∂y u + 5∂y2 u − 32u = 0,
(c) ∂x2 u − 2∂x ∂y u + ∂y2 u + ∂x u + ∂y u − u = 0.
Solution.
(a) The coefficients of the second order terms are a11 = 1, a12 = 1/2 and a22 = −2. The
discriminant is D = a212 − a11 a22 = 9/4. The equation is hyperbolic (everywhere)
because the discriminant is positive (everywhere). There are two families of char-
acteristics, defined by dy/dx = (a12 ± D)/a11 giving dy/dx = 2 and dy/dx = −1,
respectively. The general solutions of the two equations are
y = 2x + ξ, y = −x + η,
where ξ and η are arbitrary constants. Expressing, ξ and η via x and y one derives
ξ = y − 2x, η = y + x.
Transitioning to the new variables one derives
∂ξ ∂η u + ∂ξ u + 2∂η u + 3(η − ξ) = 0.
(b) The coefficients of the second order terms are a11 = 1, a12 = 1 and a22 = 5. The
discriminant is D = a212 − a11 a22 = −4. The equation is elliptic (everywhere) because
the discriminant is negative (everywhere). There are two families of the characteristics,
defined by dy/dx = (a12 ± D)/a11 giving dy/dx = 1 + 2i and dy/dx = 1 − 2i,
respectively. The general solutions of the two equations are
˜
y = (1 − 2i)x + ξ, y = (1 − 2i)x + η̃.
Setting ξ := (ξ˜+ η̃)/2 = y − x and η = (ξ˜− η̃)/2i = 2x as the new variables, we arrive
a the following canonical form
∂ξ2 u + ∂η2 u − 8u = 0.
(c) The coefficients of the second order terms are a11 = 1, a12 = −1 and a22 = 1. The
discriminant is D = a212 − a11 a22 = 0. The equation is parabolic (everywhere). We
have one characteristic, dy/dx = (a12 ± D)/a11 = −1, giving y = −x + ξ therefore one
of the new variables is the integral of the characteristic equations, ξ = x + y. We can
take any independent function of x and y as the second new variable. Let us pick,
η = x. Then the equation becomes,
∂η2 u + 2∂ξ u + ∂η u − u = 0.
(Notice that the condition of functional independence consists in the requirement that
Jacobian of the transformation is nonzero. Any other choice of second (independent)
variable will result in another canonical form.)
5.3 Elliptic PDEs: Method of Green Function

Elliptic PDEs often originate from the description of static phenomena in two or more
dimensions.
Let us, first clarify the higher dimensional generalization aspect. We generalize Eq. (5.23)
to
d
X
aij ∂xi ∂xj u(x) + lower order terms = 0, (5.25)
i,j=1
where we assume that it is not possible to eliminate at least one second derivative term from
the condition of the respective Cauchy problem. Notice that in d > 2 Eq. (5.25) cannot be
reduced to a canonical form (introduced, in the previous Section, in d = 2).
Our primary focus will be on the generalization where aij = δij in Eq. (5.25) in d ≥ 2,
and also on solving inhomogeneous equations, where a nontrivial (nonzero) solution is driven
by a nonzero source. It is natural to find solutions to these equations using Green functions.
We have discussed in Section 4.4.1 how to solve static linear one dimensional case of the
Poisson equation using the Green functions. Here we generalize and consider the Poisson
equation in d ≥ 2,
∇2r u = φ(r), (5.26)
where ∇2r := ∆r is the Laplace operator. In d = 2, r = (x, y) ∈ R2 and ∆r = ∂x2 + ∂y2 .

In d = 3, r = (x, y, z) ∈ R3 and ∆r = ∂x2 + ∂y2 + ∂z2 . The Poisson Eq. (5.26) has many
applications, for example, its solution u(r) describes the electrostatic potential of the charge
distributed in Rd with the density ρ(r), for this example, φ(r) = −4πρ(r).
The Poisson’s equation, defined in all of Rd , can be solved by the method of Green
functions. Recall, that the Green function is the solution to the inhomogenous equation
with a point source on the right hand side,
∇2r G = δ(r). (5.27)

Then the solution to Eq. (5.26) becomes

Z
u(r) = dr 0 G(r − r 0 )φ(r 0 ). (5.28)
The solution to Eq. (5.27) can be found by applying the Fourier transform, resulting in the
following algebraic equation, k 2 Ĝ(k) = −1, where k = |k|. Solving (trivially) for Ĝ and
applying the Inverse Fourier transform, one derives for d = 3,
Z Z Z∞
d3 k exp(i(k · r)) d2 k⊥ exp(ikk r)
G(r) = − =− dkk
(2π)3 k2 (2π)3 kk2 + k⊥
2
−∞
Z
d2 k⊥ π 1
=− exp(−k⊥ r) = − , (5.29)
(2π)3 k⊥ 4πr
where for each r, we change from the Cartesian to cylindrical representation associated with
r, i.e. k = (kk , k⊥ ), the one-dimensional kk = (k · r)/r is along r and the two dimensional
vector, k⊥ , stands for the remaining two components of k orthogonal to r. Substituting
Eq. (5.29) into Eq. (5.28) one derives
Z Z
φ(r) ρ(r)
u(r) = − d3 r 0 = d3 r 0 , (5.30)
4π|r − r 0 | |r − r 0 |
which is thus expression for the electrostatic potential of a given distribution of the charge
density in the space.
Note, that for d = 2, we usually write φ(r) = −2πρ(r). In this case the Green function
is found to be G(r − r 0 ) = ln(|r − r 0 |).
The homogeneous case of φ = 0 is often called the Laplace equation. We will distinguish
the two cases calling them the (inhomogeneous) Laplace equation and the homogeneous
Laplace equation respectively.
We will also discuss in the following the Debye equation

∇2r − κ2 u = φ(r), (5.31)
which describes distribution of charge ρ(r) in plasma for φ(r) = −4πρ(r).

Functions that satisfy the homogeneous Laplace equation are called harmonic functions.
Notice that there exists no nonzero harmonic function defined on the whole of R2 that
approach 0 as |r| → ∞. This can be seen by applying Fourier transform to the homogeneous
Laplace equation. One derives k 2 û(k) = 0, which results in û(k) ∼ δ(k), and then (applying
Inverse Fourier transform), u(r) = const. Finally, requiring that u → 0 at r → ∞ one
observes that the constant is zero. This argument extends to any dimension, and it also
applies to the Debye equation; there exists no solution to the (homogeneous) Debye equation
defined in the entire space that decays to zero at r → ∞.
Consequently, nonzero harmonic functions must be defined in a bounded domain. For

many physical applications, the homogeneous Laplace equation is supplemented with some
form of boundary conditions. For example, u, or the normal component of its gradient to
the boundary, ∇u · n, may be fixed at the boundary.
Example 5.3.1. Find the Green function for the Laplace equation in the region outside of
the sphere of radius R and zero boundary condition on the sphere, i.e. solve
∇2r G(r; r 0 ) = δ(r − r 0 ), (5.32)
for r such that R ≤ r, r0 , with the boundary condition G(r; r 0 ) = 0, for R = r ≤ r0 .
Solution. The Green function can be constructed by recognizing that |r − r 00 |−1 solves
∇2r G(r; r 0 ) = 0 for all r 6= r 00 . The trick is to find for each r 0 ∈ D a fictitious image point
r 00 ∈
/ D such that G(r; r 0 ) = 0 whenever |r| = R. The problem distills down to finding
the correct strength and position of the image point to enforce the boundary condition at
every point on the boundary. Using symmetry, it is clear that r 0 and r 00 must be collinear.
Therefore, we can write r 00 = αr 0
1 A
Find A, α such that G(r; r 0 ) := − + = 0 whenever |r| = r = R,
4π|r − r | 4π|r − αr 0 |
0
p
For any r on the boundary and r 0 ∈ D, |r − r 0 | = R2 + |r 0 |2 − 2|r 0 |R cos(θ). Similarly,
p
|r − r 00 | = R2 + α2 |r 0 |2 − 2α|r 0 |R cos(θ) (See blue and orange triangles respectively in the
Fig. 5.2). To enforce the boundary condition, their contributions must cancel. That is,
1 A
− p + p = 0.
4π |r 0 |2 + R2 − 2|r 0 |R cos(θ) 4π α2 |r 0 |2 + R2 − 2α|r 0 |R cos(θ)
We are looking for values of A and α that are independent of θ. The algebra is a bit tedious,
but we find that A = R/|r 0 | and α = (R/|r 0 |)2 . Hence, the Green function is
1 R/|r 0 |
G(r; r 0 ) := − + ,
4π|r − r 0 | 4π|r − r 00 |
where r 00 = (R/|r 0 |)2 r 0 .
Exercise 5.2. Find general solutions to the inhomogeneous Debye equation

∇2r − κ2 f = −4πρ(r),
where the charge density, ρ(r) depends only on the distance from the origin (zero), i.e.
ρ(r = |r|). (Hint: Consider finding the Green function first, acting by analogy with how we
found the Green function of the Laplace equation above.)
r
r R
θ
r 00 r0 r 00 r0
0
α|r |
|r 0 | |r 0 |
(a) For each r 0 ∈ D, identify the associated (b) Contributions from r 0 and r 00 must cancel
image r 00 ∈
/ D to find response at any r to enforce the boundary condition
Figure 5.2: Method of images applied to the exterior of a sphere in the example 5.3.1.
∗
5.4 Waves in a Homogeneous Media: Hyperbolic PDE
Although hyperbolic PDEs are normally associated with waves ∗ , we begin our discussion by
developing intuition which generalizes to a broader class of an integro-differential equations
beyond hyperbolic PDEs. In other words, we act here in reverse to what may be considered
the standard mathematical process; we begin by describing properties of solutions associated
with waves, and then walk back to the equations which are describing such waves.
Consider the propagation of waves in homogeneous media, for example: electro-magnetic
waves, sound waves, spin-waves, surface-waves, electro-mechanical waves (in power sys-
tems), and so on. In spite of such a variety of phenomena, they all admit one rather
universal description. The wave process at a general position in d-dimensional space r and
time t is represented as the following integral over the wave vector k
Z
dk
u(t; r) = exp (i(k·r)) ψk (t)û(k), ψk (t) ≡ exp (−iω(k)t) , (5.33)
(2π)k
where ω(k) and û(k) are the dispersion law and wave amplitude dependent on the wave
vector k. (Notice the similarities and the differences with the Fourier integral.) In Eq. (5.33)
ψk (t) is a solution to the following first-order (in time) linear ODE

d
+ iω(k) ψk = 0, (5.34)
dt
or alternatively of the following second-order linear ODE
2
d 2
+ (ω(k)) ψk = 0. (5.35)
dt2
These are called the wave equations in Fourier representation. The linearity of the equations
is principal and is due to the fact that generally nonlinear dynamics is linearized. Waves
∗
This is an Auxiliary Section which can be dropped at the first reading. Material from the Section will
not contribute midterm and final exams.
may also interact with each other. The interaction of waves can only come from accounting
for nonlinearities in the original equations. In this analysis, we focus primarily on the linear
regime.
Dispersion Laws
Consider the case where ωk = c|k|, where c is a constant having dimensionality and sense
of velocity. In this case, the inverse Fourier transform version of Eq. (5.35) becomes
2
d 2 2
− c ∇ r ψ(t; r) = 0. (5.36)
dt2
Note that the two differential operators in Eq. (5.36), one in time and another in space,
have opposite signs. Therefore, we naturally arrive at the case which generalizes the hyper-
bolic PDE (5.22). It is a generalization because r is not one-dimensional but d-dimensional,
d ≥ 1.
Eq. (5.26) with c constant, explains a variety of important physical situations: as men-
tioned already, it describes propagation of sound in a homogeneous gas, liquid or crystal
media. In this case ψ describes the shift of an element of the matter from its equilibrium
position and c is the speed of sound in the material b .
Another example is given by the electro-magnetic waves, described by the Maxwell
equations on the electric, E, and magnetic, B, fields,
∂t E = c∇r × B, ∂t B = −c∇r × E, (5.37)
supplemented by the divergence-free conditions,
(∇r · E) = (∇r · B) = 0, (5.38)
where × is the vector product in d = 3c , and c is the speed of light in the media. Differ-
entiating the first equation in the pair of Eqs. (5.37) over time, substituting the resulting
∂t ∇r × B by −c(∇r × (∇r × E)), consistently with the second equation in the pair, and
taking into account that for the divergence-free, E, (∇r × (∇r × E)) = ∇2r E, one arrives
at Eq. (5.36) for all components of the electric field, i.e. with ψ replaced by E.
b
Note, that there is a unique speed of sound in gas or liquid, while 3d crystal supports three different waves
(with different three different c) each associated with a distinct polarization. For example, in an isotropic
crystals there are longitudinal and transversal waves propagating along and, respectively, perpendicular to
the media shift.
c
(∇r × B)i = εijk ∇j Bk , where i, j, = 1, ·3 and εijk is the absolutely skew-symmetric tensor in d = 3
The dispersion law in the case of sound and light waves is linear, ω(k) = ±c|k|, however
there are other more complex examples. For example, surface waves propagating over the
surface of water (with air), are characterized by the following dispersion law
p
ω(k) = gk + (σ/ρ)k 3 , (5.39)
where g, σ and ρ are gravity coefficient, surface tension coefficient and density of the fluid,
respectively. Eq. (5.39) is so complex because it accounts for both capillary and gravita-
tional effects. Gravitational waves dominate at small q (large distances), where Eq. (5.39)
√
transforms to ω(q) = gq, while the capillary waves dominate in the opposite limit of large
q (small distances), where one gets asymptotically ω = (σ/ρ)1/2 q 3/2 .
Recall that Eq. (5.34) or Eq. (5.35) are stated in the Fourier k-representation. Transi-
tioning to the respective r-representation in the case of a nonlinear dispersion relation, for
example associated with Eq. (5.39), will NOT result in a PDE. We arrive in this general case
at an integro-differential equation, reflecting the fact that the nonlinear dispersion relation,
even though local in the k-space becomes nonlocal in r-space.
In general, propagation of waves in the homogeneous media is characterized by the
dispersion law dependent only of the absolute value, k = |k| of the wave vector, k. ω(k)/k
and dω(k)/dk, both having dimensionality of speed, are called, respectively, phase velocity
and group velocity.
Example 5.4.1. Solve the Cauchy (initial value) problem for amplitude of spin-waves which
satisfy the following PDE
∂t2 ψ = −(Ω − b∇2r )2 ψ, (5.40)
in d = 3, where ψ(t = 0; r) = exp(−r2 ) and dψ/dt(t = 0; r) = 0.

Solution. Note, first, that applying the Fourier transform over r to Eq. (5.40) one arrives
at Eq. (5.34), where
ω(k) = Ω + bk 2 , (5.41)
is the respective (spin wave) dispersion law. The Fourier transform of the initial condition
over k is, ψ̂(t = 0; k) = π 3/2 exp(−k 2 /4). Since dψ/dt(t = 0; r) = 0, the Fourier transform
of the initial condition is zero as well, that is, dψ̂/dt(t = 0; k) = 0. Then, the solution to
Eqs. (5.34,5.41) becomes ψ̂(t; k) = π 3/2 exp(−k 2 /4) cos((Ω + bk 2 )t). Evaluating the inverse
Fourier transform one derives

Z
3/2 d3 k −q2 /4
ψ(t; r) = π e cos((Ω + bk 2 )t) exp(i(k · r))
(2π)3
Z∞
kdk −k2 /4
= e cos((Ω + bk 2 )t) sin(kr)
2π 1/2 r
0
Z∞
dk −k2 /4 d
=− e cos((Ω + bk 2 )t) cos(kr)
2π 1/2 r dr
0
 
Z∞
exp(iΩt) d 1 − 4ibt
= −Re  dk exp − k 2 + ikr 
4π 1/2 r dr 4
−∞
 
r2
exp iΩt − 1−4ibt
= Re  .
(1 − 4ibt)3/2
Example 5.4.2. Solve the Cauchy (initial value) problem for the wave Eq. (5.36) in d = 3,
where ψ(t = 0; r) = exp(−r2 ) and dψ/dt(t = 0; r) = 0.
Stimulated Waves: Radiation
So far we have discussed the free propagation of waves. Consider the inhomogeneous equa-
tion generalizing Eq. (5.35) that arises from a source term χ(t; r) on the right hand side:
2
d 2
+ (ω(−i∇ r )) ψ(t; r) = χ(t; r). (5.42)
dt2
where we have used −i∇r exp(ikr) = k exp(ikr). You may assume that the dispersion law,
ω(k) is continuous value of its argument (absolute value of the wave vector) so that the
operator ω(−i∇r ))2 is well defined in the sense of the function’s Taylor series.
The Green function for the PDE is defined as the solution to
2
d 2
+ (ω(−i∇r )) G(t; r) = δ(t)δ(r). (5.43)
dt2
The solution to the inhomogeneous PDE, Eq. (5.42), can be expressed as the convolution
of the source term χ(t1 ; r1 ) with the Green function, G(t; r)
Z
ψ(t; r) = dt1 dr1 G(t − t1 ; r − r1 )χ(t1 ; r1 ), (5.44)
The solution to Eq. (5.42) is expressed as sum of the forced solution (5.44) and a zero
mode of the respective free equation, i.e. Eq. (5.42) with zero right hand side.
To solve Eq. (5.43) for the Green function, or equivalently equation for its Fourier
transform
d2 2
+ (ω(k)) Ĝ(t; k) = δ(t). (5.45)
dt2
Recall that the inhomogeneous ODE. (5.45) was already discussed earlier in the course.
Indeed Eq. (4.36) solves Eq. (5.45). Then recalling that ω depends on k and applying the
inverse Fourier transform over k to Eq. (4.36) one arrives at
Z
d3 k sin(ω(k)t)
G(t; r) = θ(t) exp (i(k · r)) . (5.46)
(2π)3 ω(k)
Example 5.4.3. Show that the general expression (5.46) in the case of the linear dispersion
law, ω(k) = ck, becomes
θ(t)
G(t; r) = (δ(r − ct) − δ(r + ct)), (5.47)
4πcr
where r = |r|.
Solution. The linear dispersion law means that w(k) = ck, where k = |k|. Then
Z
sin ckt d3 k
G(t; r) = θ(t) exp(ir · k) .
R3 ck (2π)3
Compute the integral by rotating the coordinate system so that r is pointed in the z-
direction (i.e. r = (0, 0, r)> ) and then switch to spherical coordinates (i.e. (k1 , k2 , k3 ) 7→
(k cos θ sin φ, k sin θ sin φ, k cos φ)). The scalar product r ·k then evaluates to 0+0+rk cos φ.
Z 2π Z π Z ∞
θ(t) sin ckt
G(t; r) = 3
exp(irk cos φ)k 2 sin φ dk dφ dθ
(2π) 0 0 0 ck
Z ∞ ickt − e−ickt eikr − e−ikr
θ(t) e
= k2 · dk via the sub. u = − cos θ
(2π)2 0 2ick ikr
Z ∞ ik(r−ct)
θ(t) e + e−ik(r−ct) eik(r+ct) + e−ik(r+ct)
= − dk
(2π)2 cr 0 2 2
θ(t)
= (δ(r − ct) − δ(r + ct)),
4πcr
which is equivalent to the expression we were given.
Substituting Eq. (5.47) into Eq. (5.44) one derives the following expression for linear
dispersion (light or sound) radiation from a source
Z
1 dr1 R
ψ(t; r) = χ t − ; r1 . (5.48)
4πc2 R c
The solution suggests that action of the source is delayed by R/c correspondent to propa-
gation of light (or sound) from the source to the observation point.
Example 5.4.4. Solve the radiation Eq. (5.42) in the case of the linear dispersion law for
the case of a point harmonic source, χ(t; r) = cos(ωt)δ(r).
Solution.
Z Z
1 dr1
ψ(t; r) = dt1 dr1 G(t − t1 ; r − r1 )χ(t1 ; r1 ) = cos(ω(t − R/c))δ(r1 )
4πc2 R
1
= cos(ω(t − R/c)).
4πRc2
5.5 Diffusion Equation

The most common example of a multi-dimensional generalization of the parabolic equation
Eq. (5.24) is the homogeneous diffusion equation
∂t u = κ∇2r u, (5.49)
where κ is the diffusion coefficient. The equation appears in a number of applications, for
example, this equation can be used to describe the evolution of the density of number of
particles, or the spatial variation of temperature. The same equation describes properties
of the basic stochastic process (Brownian motion).
Consider the Cauchy problem with u(t; r) given at t = 0. The Fourier transform over
r ∈ Rd is Z
û(t; k) = dy1 . . . dyd exp (ik · x) u(t; y). (5.50)
Integrating Eq. (5.49) with the Fourier weight one arrives at
∂t û(t; k) = −k 2 û(t; k) (5.51)
Integrating the equation over time, û(t; q) = exp(−q 2 t)û(0; k), and evaluating the inverse
Fourier transform over q of the result one arrives at
Z
dy1 , . . . dyd (x − y)2
u(t; x) = exp − u(0; y). (5.52)
(4πt)d/2 4t
If the initial field, u(0; x), is localized around some x, say around x = 0, that is if
u(0; x) decays with |x| increase sufficiently fast, then one may find a universal asymptotic
of u(t; x) at long times, t l2 , where l is the length scale on which u(0; x) is localized. At
these sufficiently large times dominant contribution to the integral in Eq. (5.52) is acquired
from the |y| ∼ l vicinity of the origin, and therefore in the leading order one can ignore
y-dependence of the diffusive kernel in the integrand of Eq. (5.52), i.e.
2 Z
A x
u(t; x) ≈ exp − , A = u(0; y)dy1 . . . dyd . (5.53)
(4πt)d/2 4t
→ Aδ(y)
Notice that the approximation (5.53) corresponds to the substitution of u(0, y) in
2
Eq. (5.52). Another interpretation of Eq. (5.53) corresponds to expanding, exp − (x−y)
4t ,
in the Taylor series in y, and then ignoring all but the leading order term, O(y 0 ), in the
expansion. If A = 0 one needs to account for the O(y 1 ) term, and drop the rest. In this
case the analog of Eq. (5.53) becomes
2 Z
(B · x) x
u(t; x) ≈ d/2+1
exp − , B = 2π yu(0; y)dy1 . . . dyd . (5.54)
(4πt) 4t
Exercise 5.3. Find asymptotic behavior of a one-dimensional diffusion equation at suffi-
ciently long times for the following initial conditions

x2
(a) u(0; x) = x exp − 2
2l

|x|
(b) u(0; x) = exp −
l

|x|
(c) u(0; x) = x exp −
l
1
(d) u(0; x) =
x2 + l 2
x
(e) u(0; x) =
(x2 + l 2 )2
Hint: Think about expanding the diffusion kernel in the integrand of Eq.(5.52) in a series
over y.
Our next step is to find the Green function of the heat equation, i.e. to solve
∂t G − κ∇2r G = δ(t)δ(x), (5.55)
In fact, we have solved this problem already as Eq. (5.52) describes it with u(0; y) =
G(+0; x) = δ(x) set as the initial condition. The result is
2
1 x
G(t; x) = d/2
exp − . (5.56)
(4πt) 4t
As always, the Green function can be used to solve the inhomogeneous diffusion equation
∂t u − κ∇2x u = φ(t; x) (5.57)
which solution is expressed via the Green function as follows

Zt Z
0
u(t; x) = dt dyG(t0 ; y)φ(t − t0 ; x − y), (5.58)
−∞
where we assume that u(∞; x) = 0.


Example 5.5.1. Solve Eq. (5.57) for φ(t; x) = θ(t) exp −x2 /(2l2 ) in the d = 4-dimensional
space.
5.6 Boundary Value Problems: Fourier Method

Consider the boundary value problem associated with sound waves:
∂t2 u(t; x) − c2 ∂x2 u(t; x) = 0, (5.59)

0 ≤ x ≤ L, u(t, 0) = u(t, L) = 0, u(0, x) = ϕ(x), ∂t u(0, x) = ψ(x). (5.60)
This problem can be solved by the Fourier Method (also called the method of variable
separation), which is split in two steps.
First, we look for a particular solution which satisfy only boundary conditions over
one of the coordinates, x. We look for u(t, x) in the separable form u(t, x) = X(x)T (t).
Substituting this ansatz in Eq. (5.59) one arrives at
X 00 (x) T 00 (t)
= = −λ, (5.61)
X(x) T (t)
where λ is an arbitrary constant. General solution to the equation for X is
√ √
X = A cos( λx) + B sin( λx).
Require that X(x) satisfies the same boundary conditions as in Eq. (5.60). This is possible
√
only if A = 0 and L λ = nπ, n = 1, 2, . . . . From here we derive solution labeled by
integer n and respective spatial form of the solution
nπ 2 nπx
λn = , Xn (x) = sin .
L L
We are now ready to get back to Eq. (5.61) and resolve equation for T (t):

nπct nπct
Tn (t) = An cos + Bn sin ,
L L
where An , Bn are arbitrary constants. Xn (x) form a complete basis and therefore a general
solution can be written as a linear combination of the basis solutions:
∞
X
u(t, x) = Xn (x)Tn (t).
n=1
On the second step we fix An and Bn resolving the initial portion of the conditions
(5.60):
∞
X ∞
X
ϕ(x) = An Xn (x), ψ(x) = λn Bn Xn (x). (5.62)
n=1 n=1
Notice that the eigen-functions, Xn (x), are ortho-normal

ZL
L
dxXn (x)Xm (x) = δnm .
2
0
Multiplying both Eqs. (5.62) on Xm (x), integrating them from 0 to L, and accounting for
the ortho-normality of the eigen-functions, one derives
ZL ZL
2 2
Am = dxϕ(x)Xm (x), Bm = dxψ(x)Xm (x). (5.63)
L λm L
0 0
Example 5.6.1. The equation describing deviation of a string from the straight line, u(t; x),
is ∂t2 u − c2 ∂x2 u = 0, where x is the position along the line, t, is the time, and, c, is a
constant (speed of sound). Assume that the string has at t = 0 a parabolic shape, u(0; x) =
4hx(L − x)/L2 , with both ends, at x = 0 and x = L, respectively, attached to the straight
line. Let us also assume that the speed of the string is equal to zero at t = 0, i.e. ∀x ∈
[0, L], ∂t u(0; x) = 0. Find dependence of the string deviation, u(t; x), on time, t, at a
position, x ∈ [0, L], along the straight line.
Let us now analyze the following parabolic boundary value problem over x ∈ [0, L]:

x, x < L/2
2 2
∂t u = a ∂x u, u(t, 0) = u(t, L) = 0, u(0, x) = (5.64)
L − x, x > L/2.
Here we follow the same Fourier method approach. In fact the spectral part of the
solution here is identical to the one just described above in the hyperbolic case, while
the temporal components are obviously different. One derives, Tn0 = −λn Tn , which has a
decaying solution
nπ 2 2
Tn = An exp − a t .
L
Expansion of the initial conditions in the Fourier series is equivalent to conducted above,
therefore resulting in
∞ 2 !
4L X (−1)n (2n + 1)π 2 2n + 1
u(t, x) = 2 exp − a t sin πx .
π (2n + 1)2 L L
n=0
Notice that the solution is symmetric with respect to the middle of the interval, u(t, x) =
u(t, L − x), as this symmetry is inherited from the initial conditions.
Exercise 5.4. Solve the following boundary value problem

2πx
∂t u = a2 ∂x2 u − βu, u(t, 0) = u(t, L) = 0, u(0, x) = sin .
L
∗
5.7 Case study: Burgers’ Equation
Burgers’ equation ∗ is a generalization of the Hopf’s equation, Eq. (5.13), discussed when
illustrating the method of characteristics. Recall that the Hopf’s equation results in a wave
breaking which leads to a non-physical multi-valued solution. Modification of the Hopf’s
equation by adding dissipation/diffusion results in the Burgers’ equation:
∂t u + u∂x u = ∂x2 u. (5.65)
Like practically every other nonlinear PDE, Burgers’ equation seems rather hopeless to
resolve at first glance. However, Burgers’ equation is in fact special. It allows the Cole-
Hopf’s transformation, from u(t; x) to Ψ(t; x)
∂x Ψ(t; x)
u(t; x) = −2 , (5.66)
Ψ(t; x)
reducing Burgers’ equation to the diffusion equation
∂t Ψ = ∂x2 Ψ. (5.67)
The solution to the Cauchy problem associated with Eq. (5.67) can be expressed as an inte-
gral convolving the initial profile Ψ(0; x), with the Green function of the diffusion equation
described in Eq. (5.56)
Z
dy (x − y)2
Ψ(t; x) = √ exp − Ψ(0; y). (5.68)
4πt 4t
This latter expression can be used to find some exact solutions to Burgers’ equation. Con-
sider, for example, Ψ(0; x) = cosh(ax). Substitution into Eq. (5.68) and conducting in-
tegration over y, one arrives at Ψ(t; x) = cosh(ax) exp(a2 t), which results, according to
Eq. (5.66), in stationary (time independent, i.e. standing) “shock” solution to Burgers’
equation, u(t; x) = −2a tanh(ax). Notice that the following more general solution to Burg-
ers’ equation corresponds to a shock moving with the constant speed u0
u(t; x) = u0 − 2a tanh(a(x − x0 − u0 t)).
Example 5.7.1. Solve the diffusion equation Eq. (5.67) with the initial conditions Ψ(0, x) =
cosh(ax) + B cosh(bx). Reconstruct respective u(t; x) solving the Burgers Eq. (5.65). Ana-
lyze the result in the regime b > a and B 1 and also verify, by building a computational
snippet, that the resulting spatio-temporal dynamics corresponds to a large shock “eating”
a small shock.
∗
This auxiliary Section can be dropped at the first reading. Material from this Section will not contribute
midterm and finals.
Part III
Optimization
151
Chapter 6
Calculus of Variations
The main theme of this chapter is the relation of equations to minimal principles. Over-
simplifying a bit: to minimize a function S(q) is to solve S 0 (q) = 0. For a quadratic,
S(q) = 21 q T Kq − q T g, where K is positive definite, one indeed has the minimum of S(q)
achieved at q∗ , which solves S 0 (q∗ ) = Kq∗ − g = 0.
In the example above, q is an n-(finite) dimensional vector, q ∈ Rn . Consider extending
the finite dimensional optimization to an infinite dimensional, continuous, problem where
q(x) is a function, say, q(x) : R → R, and S{q(x)} is a functional, typically an integral with
the integrand dependent on q(x) and its derivative, q 0 (x), for example
Z c
S{q(x)} = dx (q 0 (x))2 − g(x)q(x) .
2
The derivative of the functional over q(x) is called the variational derivative, and by analogy
with the finite dimensional example above, one finds that the Euler-Lagrange (EL) equation,
δS{q}
= 0,
δq(x)
solves the problem of minimizing the functional. The goal of this section is to understand
the variational derivative and other related concepts in theory and on examples.
6.1 Examples
To have a better understanding of the calculus of variations we start by describing four
examples.
6.1.1 Fastest Path
Consider a robot navigating in the (x, y)-plane, and define the function y = q(x) so it
describes the path of the robot. For small δx, the arclength of the the robot’s path from
152
CHAPTER 6. CALCULUS OF VARIATIONS 153
p
(x, y) to (x + δx, y + δy) can be approximated by (δx)2 + (δy)2 by the Pythagorean
p
theorem, which simplifies to 1 + (q 0 (x))2 dx for infinitesimal δx . If the plane constitutes
a rugged terrain, then the robot’s speed may differ at each point in the plane, so we define
the scalar-valued positive function µ−1 (x, y) to describe the speed of the robot at each point
in the plane. The time it takes for the robot to move from (x, q(x)) to (x + dx, q(x + dx))
along the path q(x) is
p
L(x, q(x), q 0 (x)) := µ(x, q(x)) 1 + (q 0 (x))2 dx.
The total time taken to travel along the path which starts at an initial point (xi , yi ) := (0, 0)
and ends at a terminal point (xt , yt ) := (a, b), where a > 0 is
Za
S{q(x)} = dxL(x, q(x), q 0 (x)).
0
Subject to a few modest conditions on µ(x, y), the calculus of variations provides a way to
find the optimal path through the domain, that is, the path which minimizes the functional
S{q(x)}, subject to q(0) = 0 and q(a) = b.
6.1.2 Minimal Surface
Consider making a three-dimensional bubble by dipping a wire loop into soapy water, and
then asking whether there is an optimal bubble shape for a given loop. Physics suggests
that the bubble will form in whatever shape minimizes the surface area of the soap film.
We formalize this setting as follows. The surface of a bubble is described by the con-
tinuously differentiable function, q(x) : x = (x1 , x2 ) ∈ D → q(x) ∈ R, where D ⊂ R2 is
bounded. We also assume that at the boundary of D, denoted ∂D and representing a closed
line in the (x1 , x2 ) plane, q(x) is fixed/known, i.e., q(∂D) = g(∂D), where g(∂D) describes
the coordinate of the wire loop along the third dimension. Then the optimal bubble results
from minimizing the functional
Z p
S{q(x)} = dx 1 + |∇x q(x)|2 (6.1)
D
over q(x), subject to q(∂D) = g(∂D).
Example 6.1.1. Show that Eq. (6.1) represents a general formula for the surface area of
the graph of a continuously differentiable function, q(x), where x = (x1 , x2 ) ∈ D ⊂ R2 and
D is a region in the (x1 , x2 ) plane with the smooth boundary.
C
B
A
y (!" , #" +dy,0) (!" +dx, #" , 0) x
(!" , #" ,0)
Figure 6.1: Illustration for construction of an infinitesimal surface element in the Exam-
ple (6.1.1).
Solution. It is sufficient to derive the differential version of Eq. (6.1), i.e. to show that area
(0) (0)
of a surface element, represented by q(x) around a point, x(0) = (x1 , x2 ) ∈ D, is described
by
p p
1 + |∇x q(x)|2 dx = 1 + (∂x1 q(x1 , x2 ))2 + (∂x2 q(x1 , x2 ))2 dx1 dx2 ,
p
where 1 + |∇x q(x)|2 is evaluated at x = x(0) . In this (infinitesimal) case, we can represent
the surface of the infinitesimal element by the plane
(0) (0) (0)

x3 − x3 = a(x1 − x1 ) + b(x2 − x2 ),
(0)
in R3 , where x3 = q(x(0) ), a = ∂x1 q|x=x(0) and b = ∂x2 q|x=x(0) . Specifically, we can describe
the plane in terms of the following three points in the three dimensional space (see Fig. (6.1)
for the illustration)

(0) (0) (0)
A = x1 , x2 , x3 ,

(0) (0) (0)
B = x1 + dx1 , x2 , x3 + adx1 ,

(0) (0) (0)
C = x1 , x2 + dx2 , x3 + bdx2 .
Then area of the surface of the infinitesimal element becomes the absolute value of the cross
(vector) product of the following two infinitesimal three dimensional vectors:
p
|u × v| = |(B − A) × (C − A)| = |(−adx1 dx2 , −bdx1 dx2 , dx1 dx2 )| = dx1 dx2 1 + a2 + b2 ,
where we have used standard vector calculus rules for the cross/vector product in three di-
P
mensions, (y×z)i = j,k=1,2,3 εijk yj zk , with εijk being Levi-Civita (absolute anti-symmetric)
tensor in three dimensions.
6.1.3 Image Restoration
A gray-scale image is described by the function, q(x) : [0, 1]2 → [0, 1], mapping a location, x
within the square box, [0, 1]2 ∈⊂ R2 , into a real number between 0 (white), and 1 (black).
However, the true image is often corrupted by a noise, and we only observe this noisy image.
The task of image restoration is to restore the true image from the noisy observation.
Total Variation (TV) restoration [3] is a method built on the conjecture that the true
image is reconstructed from the noisy image, f (x), by minimization of the following func-
tional Z

S{q(x)} = dx (q(x) − f (x))2 + λ|∇x q(x)| , (6.2)
U =[0,1]2
subject to the Neumann boundary condition, n · ∇x q(x) = 0 for all x ∈ δU , where n is the
(unit) vector normal to δU , which is the boundary of the domain U ).
6.1.4 Classical Mechanics
Classical mechanics is described in terms of the function, q(t) : R → Rd , mapping a time,

t ∈ R, into a d-dimensional real-valued spatial coordinate, q ∈ Rd . The evolution of the
coordinate in time is described in Hamiltonian mechanics by the minimal action, also called
Hamiltonian, principle: trajectory, that is understood as describing the evolution of the
coordinate in time, is governed by the minimum of the action,
Z t2
S{q} := dtL (t, q(t), q̇(t)) , (6.3)
t1
where L (t, q(t), q̇(t)) is the system Lagrangian, and q̇(t) = dq(t)/dt is the momentum, under
the condition that the values of the coordinate at the initial and final moment of time are
fixed, q(t1 ) = q1 , q(t2 ) = q2 . An exemplary Hamiltonian dynamics is that of a (unit mass)
particle in a potential, V (q), then
q̇ 2
L (t, q(t), q̇(t)) = − V (q). (6.4)
2
6.2 Euler-Lagrange Equations

All the examples can be stated as the minimization of the functional
Z
S{q(x)} = dxL (x, q(x), ∇x q(x)) ,
D⊆Rn
over functions, q(x), with the fixed value at the boundary, x ∈ ∂D : q(x) = g(x), where D
is bounded with the known value at all points of the boundary, and the Lagrangian L is a
given function
L : D ⊆ Rn × Rd × Rd×n → R,
of the three variables. It will also be convenient in deriving further relations to consider
the three variables in the argument of L and then denoting the respective derivatives, Lx ,
Lq , and L∇q . (Note that the variables are: x ∈ D ⊆ Rn , q ∈ Rd , and ∇x q ∈ Rd×n , We will
assume in the following that both L and g are smooth.
Theorem 6.2.1 (Necessary condition for optimality). Suppose that q(x) is the minimizer
of S, that is

S{q̃(x)} ≥ S{q(x)} & ∀x ∈ ∂D q̃(x) = q(x) ∀x ∈ D, ∀q̃(x) ∈ C 2 (D̄ = D ∪ ∂D) ,
then L satisfies the so-called Euler-Lagrange (EL) equations
∇x (L∇q (x, q(x), ∇x q(x))) − Lq (x, q(x), ∇x q(x)) = 0 (∀x ∈ D) . (6.5)
Sketch of the proof: Consider the perturbation q(x) → q(x) + sδ(x) = q̃(x), where s ∈ R
and δ(x) sufficiently smooth and such that is does not change the boundary condition, i.e.
δ(x) = 0 (∀x ∈ ∂D). Then according to the assumption
S{q(x)} ≤ S{q̃(x)} = S{q(x) + sδ(x)} (∀x ∈ D, ∀s ∈ R) .
This means that

d
S{q(x) + sδ(x)} = 0.
ds s=0
Notice that
Z
S{q(x) + sδ(x)} = dxL (x, q(x) + sδ(x), ∇x q(x) + s∇x δ(x))
D
Then, exchanging the orders of differentiation and integration, applying the differentiation
(chain) rules to the Lagrangian, and evaluating one of the resulting integrals by parts and
removing the boundary term (because δ(x) = 0 on ∂D), one derives

Z
d d
S{q(x) + sδ(x)} = dx L (x, q(x) + sδ(x), ∇x q(x) + s∇x δ(x)) (6.6)
ds s=0 ds s=0
D
Z
= dx (Lq (x, q(x), ∇x q(x)) · δ(x) + Lp (x, q(x), ∇x q(x)) · ∇x δ(x))
D
Z Z
= dxLq (x, q(x), ∇x q(x)) · δ(x) + dxLp (x, q(x), ∇x q(x)) · ∇x δ(x)
D D
Z
= dx (Lq (x, q(x), ∇x q(x)) − ∇x · L∇q (x, q(x), ∇x q(x))) · δ(x).
D
Since the resulting integral should be equal to zero for any δ(x) one arrives at the desired
statement.
Remark. Solutions to the EL Eqs. (6.5), q(x), are stationary curves of the functional,
S{q(x)}, and could be minimizers, maximizers or saddle-points. It’s for this reason that
theorem (6.2.1) provides a necessary, but not sufficient, condition for minimizing S{q(x)}.
Example 6.2.2. Find the Euler-Lagrange equations (conditions) for

Z

(a) dx (q 0 (x))2 + exp(q(x)) ,
Z
(b) dx q(x)q 0 (x)
Z
(c) dx x2 (q 0 (x))2
where q : R → R.
Solution. In one dimension the Euler-Lagrange equation simplifies to

d ∂L ∂L
− = 0.
dx ∂q 0 ∂q
Since we are not asked to solve the equations this is only a matter of constructing the
respective equation for each case.
(a) We identify L(x, q, q 0 ) = (q 0 )2 + exp(q) and derive

d ∂L d 0

= 2q = 2q 00 ,
dx ∂q 0 dx
∂L
= exp(q).
∂q
Then the Euler-Lagrange equation is
2q 00 − exp(q) = 0.
(b) We identify L(x, q, q 0 ) = qq 0 and derive

d ∂L d
0
= (q) = q 0 ,
dx ∂q dx
∂L
= q0.
∂q
q 0 − q 0 = 0.
This implies that any function q satisfies the Euler-Lagrange equation, which in turn
implies that the functional does not have any minima or maxima.
(c) We identify L(x, q, q 0 ) = x2 (q 0 )2 and derive

d ∂L d 2 0

= 2x q = 2x2 q 00 + 4xq 0 ,
dx ∂q 0 dx
∂L
= 0.
∂q
x2 q 00 + 2xq 0 = 0.
Example 6.2.3. Consider the shortest path version of the fastest path problem set in
Section 6.1.1, that is the case of g(x, y) = 1:
Za p
min dx 1 + (q 0 (x))2 dx .
{q(x)|x∈[0,a]}
0 q(0)=0, q(a)=b
Find the Euler-Lagrange (EL) condition on q(x).

Solution. The Euler-Lagrange condition on q(x) becomes
0 = ∇x (L∇q (x, q(x), ∇x q(x))) − Lq (x, q(x), ∇x q(x))

d q 0 (x)
= p −0
dx 1 + (q 0 (x))2
q 0 (x)
→ p = constant
1 + (q 0 (x))2
→ q 0 (x) = constant
b
→ q(x) = x,
a
where at the last step we accounted for the boundary condition. The shortest (optimal)
path connects initial and final points by a straight line.
Exercise 6.1. (a) Write the Euler-Lagrange equation for the general case of the fastest
path problem formulated in Section 6.1.1. (b) Find an example of, µ(x, y), resulting in the
b 2
quadratic optimal path, i.e. q(x) = a2
x . (Your solution for µ(x, y) should be independent
of a and b.)
Example 6.2.4. Let us derive the Euler-Lagrange condition for the Minimal Surface prob-
lem introduced in Section 6.1.2:
Z p
min dx 1 + |∇x q(x)|2 .
{q(x)}
D q(∂D)=g(∂D)
Solution. In this case Eq. (6.5) becomes
0 = ∇x (L∇q (x, q(x), ∇x q(x))) − Lq (x, q(x), ∇x q(x))

!
∇x q(x)
= ∇x · p
1 + |∇x q(x)|2
→ −∇x q(x) · ∇2x q∇x q + (1 + |∇x q(x)|2 )∇2x q = 0. (6.7)
6.3 Phase-Space Intuition and Relation to Optimization

(finite dimensional, not functional)
Consider the special case of the fastest path problem of Section 6.1.1, which is still more
general than the shortest path problem discussed in the Example 6.2.3, where µ(x) depends
only on x. In this case the action is
Z a p Z a
S{q(x)} = 0 2
dxµ(x) 1 + (q (x)) = dsµ(x),
0 0
where ds is the element of arc-length of the curve u(x):

p p
ds = 1 + (q 0 (x))2 dx = dx2 + dq 2 .
p
The Lagrangian and its partial derivatives are, L(x; q(x); q 0 (x)) = µ(x) 1 + (q 0 (x))2 , Lq =
p
0, Lq0 = µ(x)q 0 / 1 + (q 0 )2 . Then the Euler-Lagrange equation becomes
!
d µ(x)q 0 (x)
p = 0,
dx 1 + (q 0 (x))2
which results in
µ(x)q 0 (x)
p = µ(x) sin(θ) = constant, (6.8)
1 + (q 0 (x))2
!" ∆%"
(&
!# ∆%# !&
!&'"
(&'"
Figure 6.2: Variational Calculus via Discretization and Optimization.
where θ is the angle in the (q, x) space between the tangent to q(x) and the x-axis.
It is instructive to derive Eq. (6.8) bypassing the variational calculus, taking instead
perspective of standard optimization, that is optimizing over a finite number of continuous
variables. To make this link we need, first, to discretize the action, S{q(x)}:
s
X X q(xk ) − q(xk−1 ) 2
S{q(x)} ≈ Sk (· · · , qk , · · · ) = µk ∆sk = µk 1 + ∆
∆
k k
s
X qk − qk−1 2
= µk 1 + ∆
∆
k
where ∆ is the size of a step in x. i.e. ∆ = xk+1 − xk , ∀k, and ∆sk is the length of the
k-th segment of the discretized curve, illustrated in Fig. (6.2). Then, second, we look for
extrema of Sk over qk , i.e. require that ∀ k : ∂qk Sk = 0. The result is the discretized
version of the Euler-Lagrange Eqs. (6.8):
µ (q − qk ) µk (qk − qk−1 )
∀k : rk+1 k+1 =r
qk+1 −qk 2 qk −qk−1 2
1+ ∆ 1+ ∆
→ µk+1 sin θk+1 = µk sin θk .

6.4 Towards Numerical Solutions of the Euler-Lagrange Equa-

∗
tions
Here ∗ we discuss the image restoration problem set up in Section 6.1.3. We will derive the
Euler-Lagrange equations and observe that the resulting equations are difficult to solve. We
will then use this case to illustrate the theoretical part (philosophy) of solving the Euler-
Lagrange equations numerically. Following [4], we will use the example to discuss gradient
descent in this Section and then also primal-dual method below in Section 6.7.
6.4.1 Smoothing Lagrangian
The TV functional (6.2) is not differentiable at ∇x q(x) = 0, which creates difficulty for
variations. One way to bypass the problem is to smooth the Lagrangian, considering
Z
(q(x) − f (x))2 p
Sε {q} = dx 2 2
+ λ ε + (∇x q(x)) , (6.9)
2
[0,1]2
where ε is small and positive. The Euler-Lagrange equations for the smoothed action (6.9)
are
∇x q
∀x ∈ [0, 1]2 : q − λ∇x · p = f, (6.10)
ε + (∇x q(x))2
2
with the homogeneous Neumann boundary conditions, ∀x ∈ ∂[0, 1]2 : ∂q(x)/∂n = 0,

where n denotes normal to the boundary of the [0, 1]2 domain. Finding analytical solutions
to Eq. (6.10) for an arbitrary f is not possible. We will discuss ways to solve Eq. (6.10)
numerically in the following.
6.4.2 Gradient Descent and Acceleration
We will start this part with a disclaimer. The discussion below of the numerical procedure
for solving Eq. (6.10) is not fully comprehensive. We add it here for completeness, delegating
details to Math 589, and also aiming to emphasize connections between numerical PDE
analysis and optimization algorithms.
A standard numerical scheme for solving Eq. (6.10) originating from optimization of
the action is gradient descent. It is useful to think about the gradient descent algorithm
by introducing an extra “computational time” dimension, which will be discrete in imple-
mentation but can also be thought of (for the purpose of analysis and gaining intuition) as
∗
midterm and finals.
continuous. Consider the following equation

∇x υ
∀x ∈ [0, 1]2 , t > 0 : ∂t υ + υ − λ∇x · p = f, (6.11)
ε2 + (∇x υ(x))2
for, υ(t; x), representing estimation at the computational time t for q(x) solving Eq. (6.10),
with the initial conditions, ∀x : υ(0; x) = f (x), and the boundary conditions, ∀x ∈
∂[0, 1]2 : ∂υ(x)/∂n = 0. Eq. (6.11) is a nonlinear heat equation. Close to the equilibrium
the equation can be linearized. Discretizing the linear diffusion equation on the spatio-
temporal grid with spacing, ∆t, and, ∆x, and looking for the dynamic (time-derivative)
term balancing the diffusion term (containing second order spatial-derivative) one arrives
at the following rough empirical estimation
ε(∆x)2
∆t ∼ .
λ
The estimation suggests that the temporal step needs to be really small (square of the
spatial step) to guarantee that the numerical scheme is proper (not stiff). The condition
becomes even more demanding with decrease of the regularization parameter, ε.
One way to improve the gradient scheme (to make it less stiff) is to replace the diffusion
Eq. (6.11) by the (damped) wave equation
∇x υ
∀x ∈ [0, 1]2 , t > 0 : ∂t2 υ + a∂t υ + υ − λ∇x · p = f, (6.12)
ε + (∇x υ(x))2
2
where a is the damping coefficient. Acting by analogy with the diffusive case, let us make an
empirical estimate for the balanced choice of the spatial discertization step, ∆x, temporal
discretization step, ∆t, and of the damping coefficient. Linearising the nonlinear wave
Eq. (6.12) and then requiring that the ∂t2 (temporal oscillation) term, the a∂t (damping)
term and the (λ/ε)∇2x (diffusion) term are balanced one arrives at the following estimate
∆t ε(∆x)2
(∆t)2 ∼ ∼ ,
a λ
which results in a much less demanding linear scaling, ∆t ∼ ∆x.
This transition from the overdamped relaxation to balancing damping with oscillations
corresponds to the Polyak’s heavy-ball method [5] and Nesterov’s accelerated gradient de-
scent method [6], which are now used extensively (often with addition of a stochastic com-
ponent) in training of the modern Neural Networks. Both methods will be discussed later
in the course, and even more in the companion Math 575 course. Notice also that an addi-
tional material on modern, continuous-time interpretation of the acceleration method and
other related algorithms can be found in [7, 8]. See also Sections 2.3 and 3.6 of [4]
We will come back to the image-restoration problem one more time in Section 6.7.2
where we discuss an alternative, primal-dual algorithm.
6.5 Dependence of the action on the end-points

Consider x ∈ R and let the points A0 := (x0 , q0 ) and A1 := (x1 , q1 ) be given, and let F
be the family of continuously differentiable functions on [x0 , x1 ] that satisfy q(x0 ) = q0 and
q(x1 ) = q1 . For a given Lagrangian L(x, q, q 0 ), let S : F → R be the functional
Z x1
S{q(x)} = L(x, q(x), q 0 (x)) dx. (6.13)
x0
d
In section 6.2, we showed that if a function q(x) satisfies the EL equations, dx Lq
0 − Lq = 0,
then q(x) is a stationary curve of S and therefore a candidate for a minimizer of S. (See
also discussion of the EL Theorem 6.2.1 and the following remark.)
Notation. In the following, we abuse notation a little and use the symbol S to denote both
the action-functional Z x1
S{q(x)} = L(x, q(x), q̇(x)) dx
x0
and also the action-function
Z x1
S(A0 , A1 ) = S(x0 , q0 , x1 , q1 ) = min L(x, q(x), q̇(x)) dx.
q(x)∈F x0
Therefore, S(A0 , A1 ) corresponds to S{q(x)} evaluated at solutions q(x) of the associated

EL equations and should thus be understood as a function of the end-points, A0 and A1 , of
q(x).
The following statement gives a very intuitive, geometrical interpretation for the deriva-
tives of the action over the end-point parameters
Theorem 6.5.1 (End-point derivatives of the action).
(a) ∂x1 S(A0 ; A1 ) = (L − q 0 Lq0 ) x=x1

∂x0 S(A0 ; A1 ) = − (L − q 0 Lq0 ) x=x0
, (6.14)
(b) ∂q1 S(A0 ; A1 ) = Lq0 x=x1
∂q0 S(A0 ; A1 ) = − Lq0 x=x0
. (6.15)
Proof. Here (and as custom in this course) we will only sketch the proof. Let us focus,
without loss of generality, on the part of the theorem concerning derivatives with respect
to x1 and q1 , i.e. the final end-point of the critical path.
Let us first keep the final independent variable fixed at x1 but move the final position
by dq, as shown in Fig. 6.5 Left. The trajectory q(x) will vary by δq(x), where δq(x0 ) = 0
and δq(x1 ) = dq.
At each time x, we can estimate the value of L(x, q + δq, q 0 + δq 0 ) from the first-order
Taylor expansion
L(x, q + δq, q 0 + δq 0 ) ≈ L(x, q, q 0 ) + Lq (x, q, q 0 ) · δq + Lq0 (x, q, q 0 ) · δq 0 . (6.16)

q(x)
(x1 , q1 + dq)
(x1 , q1 )
(x0 , q0 )
Figure 6.3: Critical curves from (x0 , q0 ) to (x1 , q1 ) (green) and to (x1 , q1 + dq) (purple).
q(x) (x1 + dx, q1 + dq

dx x1
dx)
(x1 , q1 )
(x0 , q0 )
Figure 6.4: Critical curves from (x0 , q0 ) to (x1 , q1 ) (green) and to (x1 + dx, q1 + dq) (purple)
dq
where dq = dx x dx.
1
q(x)
(x1 , q1 ) (x1 + dx, q1 )
(x0 , q0 )
Figure 6.5: Critical curves from (x0 , q0 ) to (x1 , q1 ) (green) and to (x1 + dx, q1 ) (purple).
Therefore, the variation of the Lagrangian is given by
dL = Lq dδq + Lq0 dδq 0 , (6.17)
and the variation of the action is

Zx1

dS = dx Lq0 δq 0 + Lq δq . (6.18)
x0
The relation δq 0 = dδq/dx, together with the Euler-Lagrange Eqs. (6.21), allows us to
rewrite Eq. (6.18) as
Zx1 Zx1
d d x1
dS = dx Lq0 δq + δq Lq0 = d Lq0 δq = Lq0 δq x0
= Lq0 x1
dq. (6.19)
dx dx
x0 x0
Therefore, as we kept the final independent variable fixed, dS = ∂q1 Sdq, and one arrives at
the desired statement
∂S
= Lq0 x1
. (6.20)
∂q1
Now consider variation of the action extended from A1 = (x1 , q1 ) to (x1 + dx, q1 + q 0 (x1 )dx):

∂S ∂S 0 ∂S 0 ∂S 0
dS = Ldx = dx + q (x1 )dx = dx + Lq0 x1 q (x1 )dx = + q Lq 0 dx,
∂x1 ∂q1 ∂x1 ∂x1 x1
where we utilize Eq. (6.20). Finally, we derive

∂S
= L − q 0 Lq 0 x1
.
∂x1
Example 6.5.2. Find the minimzers of the functional

Z 1
S {q(x)} = (q 02 + xq)dx
0
for the case where (a) q(0) = 1 and q(1) = 0, and where (b) q(0) = 1 and q(1) is free.
Solution. The stationary curve (by definition) satisfies the EL equations:
d d 0
Lq0 − Lq = 0 ⇒ 2q − x = 0 ⇒ q 00 − x = 0,
dx dx
thus resulting in q(x) = 61 x3 + c1 x + c2 . For part (a), the values of c1 and c2 are determined
7
from the requirements that q(0) = 1 and q(1) = 0 giving c1 = 6 and c2 = 1. For part (b),
given that the value q(1) is free, we must find the optimal value of q(1) by solving ∂q1 S = 0:
∂S
0= = Lq0 x=1
= (2q 0 ) x=1
.
∂q1
Therefore, in this case the optimal value of q(1) occurs when q 0 (1) = 0. In this case, the
corresponding values of c1 and c2 are c1 = − 21 and c2 = 1.
Exercise 6.2. Find all the critical function(s), q(x), of the functional
Za

S {q(x)} = q 2 + 2qq 0 + (q 0 )2 + 1 dx
0
where q(0) = 0 and at some positive but not fixed, a (i.e., such, a, that is by itself subject
to optimization), q(a) = 1.
6.6 Variational Principle of Classical Mechanics

In this Section we apply the principle of minimal action (also called variational principle, or
Hamiltonian principle), to the case of the classical mechanics, already highlighted in Section
6.1.4. (See also [9], which we follow in this Section.)
6.6.1 Noether’s Theorem & time-invariance of space-time derivatives of

action
In the case of the classical mechanics, introduced in Section 6.1.4, the Euler-Lagrange
Eqs. (6.5) are
d
Lq̇ − Lq = 0, (6.21)
dt
where L(t, q(t), q̇(t)) : R × Rd × Rd → R. Let us consider the case when the Lagrangian does
not depend explicitly on time. (It may still depend on time implicitly via q(t) and q̇(t),
i.e. L(q(t), q̇(t)).) In this case, and quite remarkably, the Euler-Lagrange equation can be
rewritten as a conservation law. Indeed,

d d d
(q̇ · Lq̇ − L) = q̈ · Lq̇ + q̇ · Lq̇ − Lq · q̇ − Lq̇ · q̈ = q̇ · Lq̇ − Lq = 0,
dt dt dt
where the last equality is due to Eq. (6.21).

We have just introduced the Hamiltonian, H = q̇ · Lq̇ − L, representing energy stored
within the mechanical system instantaneously, and proved that if the Lagrangian (and thus
Hamiltonian) does not have explicit dependence on time, the Hamiltonian (and energy) is
conserved. This is a particular case of the Noether’s theorem.
Notice, that symmetry under a parametrically continuous change, such as one just ex-
plored (consisting in invariance of the Lagrangian under the time shift), is generally a
stronger property than a conservation law.
To state theorems expressing the invariance(s) we need the following definition.
Definition 6.6.1 (Invariance of the Lagrangian). Consider a family of transformations of

Rd , hs (q) : Rd → Rd , where s ∈ R and hs (q) is continuous in both, q, and (parameter),
s, and h0 (q) = q. We say that a Lagrangian, L(q(t), q̇(t)) : Rn × Rn → R, is invariant
under the action of the family of transformations of Rd , hs (q) : Rn → Rd , if L(q, q̇) does
not change when q(t) is replaced by hs (q(t)), i.e. if for any function q(t) we have
d d
hs (q(t))) = L(q(t), q(t)).
L(hs (q(t)),
dt dt
Common examples of hs (q(t)) in the classical mechanics include
• translational invariance, hs (q(t)) = q(t) + se, where e is the unit vector in Rn and s
is the distance of the transformation;
• rotational invariance, hs (q(t)) = Re (s)q(t), around the line through the origin defined
by the unit vector e;
• combination of translational invariance and rotational invariance (cork-screw motion):

hs (q(t)) = aes + Re (s)q(t), where a is a constant.
Theorem 6.6.2 (Noether’s theorem (1915)). If the Lagrangian L is invariant under the
action of a one-parameter family of transformations, hs (q(t)), then the quantity,
d
I(q(t), q̇(t)) := Lq̇ ·
(hs (q(t)))s=0 , (6.22)
ds
is constant along any solution of the Euler-Lagrange Eq. (6.21). Such a constant quantity
is called an integral of motion.
Proof. Following discussion of Section 6.5, we consider the action-function, S(t0 , q0 , t1 , q1 ),

i.e. a minimum value of the action-functional, analyzed as a function of the end points,
(t0 , q0 ) to (t1 , q1 ). Theorem 6.5.1, applied to the case of the classical mechanics, states that
(a) ∂t1 S(A0 ; A1 ) = (L − q̇Lq̇ )t=t1 ∂t0 S(A0 ; A1 ) = −(L − q̇Lq̇ )t=t0 ,
(b) ∂q1 S(A0 ; A1 ) = Lq̇ |t=t1 ∂q0 S(A0 ; A1 ) = −Lq̇ |t=t0 .
The assumptions of Noether’s theorem require that the Lagrangian is invariant under the
transformation q(t) → hs (q(t), which gives
S(t0 , hs (q0 ); t1 , hs (q1 )) = S(t0 , q0 ; t1 , q1 ), ∀s. (6.23)
Differentiating both sides of Eq. 6.23 with respect to s, applying Theorem 6.5.1, and eval-
uating the result at s = 0, leads us to
d d
0 = ∂q0 S · hs (q(t0 )) + ∂q1 S · hs (q(t1 ))
ds s=0 ds s=0
d d
= −Lq̇ (q(t0 ), q̇(t0 )) · hs (q(t0 )) s=0
+ Lq̇ (q(t1 ), q̇(t1 )) · hs (q(t1 )) s=0
ds ds
Since t1 can be chosen arbitrarily, it proves that Eq. (6.22) is constant along the solution
of the Euler-Lagrange Eq. (6.21).
Exercise 6.3. For q(t) ∈ R3 , where t ∈ [t0 , t1 ], and each of the following families of
transformations, find the explicit form of the conserved quantity given by the Noether’s
theorem (assuming that the respective invariance of the Lagrangian holds)
(a) space translation in the direction, e: hs (q(t)) = q(t) + se.
(b) rotation through angle s around the vector, e ∈ R3 : hs (q(t)) = Re (s)q(t).
(c) helical symmetry, hs (q(t)) = aes + Re (s)q(t), where a is a constant.
6.6.2 Hamiltonian and Hamilton Equations: the case of Classical Me-

chanics
Let us utilize the specific structure of the classical mechanics Lagrangian which is split,
according to Eq. (6.4), into a difference of the kinetic energy, q̇ 2 /2, and the potential energy,
V (q). Making the obvious observation, that the minimum of the functional
Z
dt 21 (q̇ − p)2 ,
over {p(t)} is achieved at ∀t : q̇ = p, and then stating the kinetic term of the classical
mechanics action, that is the first term in Eq. (6.4), in terms of an auxiliary optimization
Z Z
q̇ 2 p2
dt = max dt pq̇ − , (6.24)
2 {p(t)} 2
and substituting the result in Eqs. (6.3,6.4), one arrives at the following, alternative, vari-
ational formulation of the classical mechanics
Z
min max dt (pq̇ − H(q; p)) (6.25)
{q(t)} {p(t)}
p2
H(q; p) := + V (q), (6.26)
2
where p and H are defined as the momentum and Hamiltonian of the system. Turning
the second (Hamiltonian) principle of the classical mechanics into the equations (which,
like EL equations, are only sufficient conditions of optimality) one arrives at the so-called
Hamiltonian equations
∂H(q; p) ∂H(q; p)
q̇ = , ṗ = − . (6.27)
∂p ∂q
Example 6.6.3. (a) [Conservation of Energy] Show that in the case of the time independent
Hamiltonian (i.e. in the case of H(q; p) considered so far), H, is also the energy which is
conserved along the solution of the Hamiltonian equations (6.27).
(b) [Conservation of Momentum] Show that if the Lagrangian does not depend explicitly on
one of the coordinates, say q (1) where q = (q (1) , · · · ), then the corresponding momentum,
∂L/∂ q̇ (1) , is constant along the physical trajectory, given by the solutions of either EL or
Hamiltonian equations.
Solution. (a) Full time derivative of the Hamiltonian is
dH ∂H ∂H
= q̇ + ṗ.
dt ∂q ∂p
The Hamiltonian equations are
∂H ∂H
q̇ = , ṗ = − .
∂p ∂q
Combining the two expressions
dH ∂H ∂H
= q̇ + ṗ = −ṗq̇ + q̇ ṗ = 0,
dt ∂q ∂p
we arrive at the desired statement.
(b) Suppose that L does not depend on q (1) , then ∂L/∂q (1) = 0. Then, the EL equations

d ∂L ∂L d ∂L
0= − (1) = ,
dt ∂ q̇ (1) ∂q dt ∂ q̇ (1)
imply that ∂L/∂ q̇ (1) is constant in time.
Notice, that the Hamiltonian system of Eqs. (6.27) becomes even more elegant in the
vector form
! !
q 0 1
ż = −J∇z H(z) = −∇z JH(z), z := , J := , (6.28)
p −1 0
where the 2 × 2 matrix represents two-dimensional rotation (clock-wise in the (q, p)-space).
6.6.3 Hamilton-Jacobi equation
Let us work a bit more with the critical/optimal trajectory/path, {q(t); t ∈ [t0 = 0, t1 ]},
solving the Euler-Lagrange Eqs. (6.21), choosing the initial time to be equal to 0, fixing
the initial position, q(0) = q0 , then analyzing dependence of the action on the final time,
t1 , and position, q1 . That is we continue the thread of the Theorem 6.5.1 and consider the
action-function as a function of A1 = (t1 , q1 ), i.e., of the final position of the critical path.
Indeed, let us re-derive in a bit different, but equivalent, form the main results of the
Theorem 6.5.1. Assuming that the action-function is a sufficiently smooth function of the
arguments, t1 and q1 , one would like to introduce (and interpret) derivatives of action over t1
and q1 , and then check if the derivatives are related to each other. Consider, first, derivative
of the action-function over t1 :
Zt1 Zt1
St1 := ∂t1 S(t1 ; q1 ) = ∂t dtL (q(t), q̇(t)) = L(t1 , q1 ) + dt (Lq ∂t1 q(t) − Lq̇ ∂t1 q̇(t))
0 0
Zt1
d
= L(t1 , q1 ) + dt∂t1 q(t) Lq + Lq̇ − Lq̇ ∂t q(t)|t01 = (L − Lq̇ q̇)t=t1 , (6.29)
dt
0
where we have made an integration by parts and used that, ∂t q(t)|t=0 = 0, ∂t1 q(t)|t=t1 =
d
q̇(t1 ), utilized the Euler-Lagrange equations Eq. (6.5), ∀t ∈ [0, t1 ] : Lq − dt Lq̇ = 0.
Next, let us evaluate the derivative of the action-function over the coordinate at the
final position, q1 :
Zt1 Zt1
Sq1 := ∂q S(t1 ; q1 ) = ∂q1 dtL (q(t), q̇(t)) = dt (Lq ∂q1 q(t) + Lq̇ ∂q1 q̇(t))
0 0
Zt1
d
= dt∂q1 q(t) Lq − Lq̇ + Lq̇ ∂q1 q(t)|t01 = Lq̇ |t=t1 . (6.30)
dt
0
In the case of the classical mechanics, when the Lagrangian is factorized into a difference
of the kinetic energy and the potential energy terms, the object on right hand sides of
Eq. (6.29) turns into the minus Hamiltonian, defined above in Eq. (6.26), and the right
hand side of Eq. (6.30) becomes the momentum, then p = q̇. In the case of a generic (not
factorizable) Lagrangian, one can use the right hand side of and Eq. (6.29) and Eq. (6.30)
as the definitions of the minus Hamiltonian of the system and of the system momentum,
respectively,
∀t : p := Lq̇ , H(t; q; p) := Lq̇ q̇ − L, (6.31)
where the Hamiltonian is considered as a function of the current time, t, coordinate, q(t),
and momentum, p(t).
Combining Eqs. (6.29,6.30,6.31), that is (a) and (b) of the Theorem 6.5.1 and the def-
initions of the momentum and the Hamiltonian, one arrives at the Hamilton-Jacobi (HJ)
equation
St1 + H(t1 ; q1 ; ∂q1 S) = 0, (6.32)

which provides a nonlinear first order PDE representation of the classical mechanics.
It is important to stress that if one knows the initial (at t = 0) values of the action-
function, S, and of its derivative, ∂q S, and also of the explicit expression of the Hamiltonian
in terms of the time, coordinate and momentum at all moments of time, Eq. (6.32) combined
with the initial conditions represents a Cauchy initial value problem, therefore resulting in
solving of the HJ equation unambiguously. This is a rather remarkable and strong state-
ment with many important consequences and generalizations. The statement is remarkable
because because one gets the unique solution of the optimization problem in spite of the fact
that solution of the EL equation is not necessarily unique (remember it is a sufficient but
not necessary condition for the minimum action, i.e. there may be multiple solutions of the
EL equations). Consequences of the HJ equations will be seen later when we will discuss its
generalization to the case of theoptimal control, called the Bellman-Hamilton-Jacobi (BHJ)
equation. HJ equation, discussed here, and BHJ discussed in Section are linked ultimately
to the concept of the Dynamic Programming (DP), also discussed later in the course.
Let us re-emphasize, that the schematic derivation of the HJ-equation (just provided)
has revealed the meaning of the action derivative over (final) time and over the (final)
coordinate. We have learned that, ∂t1 S, is nothing but minus Hamiltonian, while ∂q1 S, is
simply momenta, p1 = p(t1 ) (also equal to velocity as in these notes we follow the convention
of unit mass).
Let us provide an alternative (and as simple) derivation of the HJ-equation, based
primarily on the differentials. Given transformation from representation of the action as a
functional, of {q(t); t ∈ [0, t1 ]}, to its representation as a function of t1 and q1 , S{q(t)} →
S(t1 ; q1 ), one rewrites Eqs. (6.3,6.4)
Z Z
S= p dq − H dt,
which then implies the following differential form

∂S ∂S
dS = dt1 + dq1 ,
∂t1 ∂q1
so that
∂t1 S = −H(t1 ; q1 ; p1 ), ∂q1 S = p1 ,
resulting (in combination) in the HJ Eq. (6.32).

So far it was important to differentiate the current moment of time, t ∈ [0, t1 ] and the
final moment of time, t1 . However, and once the HJ equations are derived we may return,
simplifying notations (and when it does not lead to a confusion), to using t interchangeably
for both.
Example 6.6.4. Find and solve the HJ equation for a free particle.
In this case
p2
H= .
2
Therefore, the HJ equation becomes
(∂q S)2
= −∂t S.
2
√
Look for solution of the HJ equation in the form S = f (q)−Et. One derives f (q) = 2Eq−c,
and therefore the general solution of the HJ equation becomes
√
S(t; q) = 2Eq − Et − c.
Exercise 6.4. Find and solve the HJ equation for a two dimensional oscillator (unit mass
and unit elasticity) in spherical coordinates, i.e. for the Hamiltonian system with the action
functional Z
1 2 1
S{r(t), ϕ(t)} = dt ṙ + r2 ϕ̇2 − r2 .
2 2
We conclude this very brief discussion of the classical/Hamiltonian mechanics by men-
tioning that in addition to its relevance to the concepts of Optimal Control and Dynamic
Programming (to be discussed in Section 7), the HJ-equations are also most useful in estab-
lishing (and using in practical setting) the transformation from the pair of the coordinate-
momentum variables (q, p) to the so-called canonical variables for which paths of motion
reduce to single points, i.e. variables for which the (re-defined) Hamiltonian is simply zero.
∗
6.7 Legendre-Fenchel Transform
This Section ∗ is devoted to the Legendre-Fenchel (LF) transform, which was in fact used
in its relatively simple but functional (infinite dimensional) form in Eq. (6.24). Given LF
importance in variational calculus (already mentioned) and finite dimensional optimization
(yet to be discussed), we have decided to allocate a special section for this important
transformation and its consequences. We will also mention in the end of this Section two
applications of the LF transform: (a) to solving the image restoration problem by a primal-
dual algorithm, and (b) to estimating integrals with the Laplace method.
Definition 6.7.1 (Legendre-Fenchel (LF) transform). Legendre-Fenchel transform of a

function, f : Rn → R, is
f ∗ (k) := sup (x · k − f (x)) . (6.33)
x∈Rn
∗
midterm and finals.
Often LF transform also refers to as “dual” transform. Then f ∗ (k) is dual to f (x).
Example 6.7.2. Find the LF transform of the quadratic function, f (x) = x · A · x/2 − b · x,
where A is symmetric positive definite matrix, A 0.
Solution: The following sequence of transformations show that the LF transform of the
positively define quadratic function is another positively defined quadratic function

1
sup x·k− x·A·x+b·x
x 2

1 −1 −1 1 −1
= sup − (x − (k + b) · A ) · A · (x − A (k + b)) + (b + k) · A · (b + k)
x 2 2
1
= (b + k) · A−1 · (b + k), (6.34)
2
where the maximum is achieved at x∗ = A−1 (k + b).
Definition 6.7.3 (Convex function over Rn ). A function, u : Rn → R is convex if
∀x, y ∈ Rn , λ ∈ (0, 1) : u(λx + (1 − λ)y) ≤ λu(x) + (1 − λ)u(y). (6.35)
The combination of these two notions (the Legendre-Fenchel transform and the convex-
ity) results in the following bold statements (which we only state here, delegating proofs to
Math 527).
Theorem 6.7.4 (Convexity and Involution of Legendre-Fenchel). The Legendre-Fenchel

transform of a convex function is convex, and it is also an involution, i.e. (f ∗ )∗ = f .
6.7.1 Geometric Interpretation:

Supporting Lines, Duality and Convexity
Once the formal definitions and statements are made, let us consider the one dimensional
case, n = 1, to develop intuition about the LF and convexity. In one dimension, the LF
transform has a very clear geometrical interpretation (see e.g. [10]) stated in terms of the
supporting lines.
Definition 6.7.5 (Supporting Lines). f : R → R has a supporting line at x ∈ R if
∀x0 ∈ R : f (x0 ) ≥ f (x) + α(x0 − x).
If the inequality is strict at all x0 6= x, the line is called strictly supporting.
Notice that as defined above supporting lines are defined locally, i.e. not globally for all
x ∈ R, but locally for a particular/fixed, x.
!(#)
% & ' ( #
Figure 6.6: Geometric interpretation of supporting lines.
Example 6.7.6. Find f ∗ (k) and the supporting line(s) for f (x) = ax + b.
Solution: Notice that we cannot draw any straight line which do not cross f (x) unless they
have the same slope. Therefore, f (x) is the supporting line for itself. We also observe that
the LF transform of the straight line is finite only at a single point k = a, corresponding to
the slope of the line, i.e. (
−b, k=a
f ∗ (k) =
∞, otherwise.
Example 6.7.7. Consider the quadratic, f (x) = ax2 /2−bx. Find f ∗ (k), supporting line(s)
for f (x), and supporting line(s) for f ∗ (k).
Solution: The solution, given by one dimensional version of Eq. (6.34), is f ∗ (k) = (b +
k)2 /(2a), where the maximum (in the LF transform) is achieved at x∗ = (b + k)/a. We
observe that f ∗ (k) is well defined (finite) for all k ∈ R. Denote by fx (y) the supporting
line of, f (x), at x. In this case of a nice (smooth and convex) f (x), one derives, fx (x0 ) =
f (x) + f 0 (x)(x0 − x) = ax2 /2 − bx + (ax − b)(x0 − x), representing the Taylor series expansion
of, f (x), around, x = y, truncated at the first (linear) term. Similarly, fk∗ (k 0 ) = f ∗ (k) +
(f ∗ )0 (k)(k 0 − k) = (b + k)2 /(2a) + (b + k)(k 0 − k)/a.
What we see in this example generalizes into the following statements (given without
proof):
Proposition 6.7.8. Assume that f (x) admits a supporting line at x and f 0 (x) exists at x,
then the slope of the supporting line at x should be f 0 (x), i.e. for a differentiable function
the supporting line is always a tangient line.
Theorem 6.7.9. If f (x) admits a supporting line at x with slope k, then f ∗ (k) admits
supporting line at k with the slope x.
Example 6.7.10. Draw supporting lines for the example of a smooth non-convex function
shown in Fig. (6.6).
Solution: Sketching supporting lines for this smooth, non-convex and bounded from below
example of a function with two local minima we arrive at the following observations:
• The point a admits a supporting line. The supporting line touches f at point a and
the touching line is beneath the graph of f (x), hence the term supporting is justified.
• The supporting line at a is strictly supporting because it touches the graph of f only
at x = a.
• The point b does not admit a supporting line, because any line passing through (b, f (b))
crosses the line f (x) at some other point.
• The point c admits a supporting line which is supporting, but not strictly supporting,
as it touches f (x) at another point, d. In this case c and d share the same supporting
line.
The supporting line analysis yields a number of other useful statements listed below
(without proof and only with limited discussion):
Theorem 6.7.11. f ∗ (k) is always convex in k.
Corollary 6.7.12. f ∗∗ (x) is always convex in x.
The last statement tells us, in particular, that f ∗∗ is not always convolutive, because
f ∗∗ is always convex even for non-convex f , when f 6= f ∗∗ . This observation generalizes to
Theorem 6.7.13. f ∗∗ (x) = f (x) iff f (x) admits a supporting line at x.
The following two statements are immediate corollaries of the theorem.
Corollary 6.7.14. f ∗∗ = f if f is convex.
Corollary 6.7.15. If f ∗ (k) is differentiable for all k then f ∗∗ (x) = f (x).
The following two statements are particularly useful for visualization of f ∗∗ (x)
Corollary 6.7.16. A convex function can always be written as a LF transform of another

function.
Theorem 6.7.17. f ∗∗ (x) is the largest convex function satisfying f ∗∗ (x) ≤ f (x).
Because of the last statement we call f ∗∗ (x) the convex envelope of f (x).
+ ∗ (!)
!(#)
1′
% &
"#$%& = ()
#’
'%()* = ,- '%()* = ,.
1
#/
Δ, # !* !/ !
Δ!
Figure 6.7: Function having a singularity cusp (left) and its LF transform (right).
Below we continue to illustrate the notion of supporting lines, as well as convexity and
duality, on illustrative examples.
Example 6.7.18. Consider function containing a non-differentiable point (cusp), as shown

in Fig. (6.7a). Utilizing the notion of supporting lines, draw and explain f ∗ (k). Is f ∗∗ (x) =
f (x)?
Solution: When a function has a non-differentiable point it is natural to split the analysis
in two, discussing the differentiable and non-differentiable parts separately.
• (Differentiable part of f (x):) Each point (x, f (x)) on the differentiable part of the
function curve (parts a and b in Fig. (6.7a) admits a strict supporting line with slope
f 0 (x) = k. These points maps under the LF transformation into (k, f ∗ (k)) points
admitting supporting lines of slopes (f ∗ )0 (k) = x, shown as l’ and r’ branches in
Fig. (6.7b). Overall left (l) and right (r) branches in Fig. (6.7a) transform into left
(l’) and right (l’) branches in Fig. (6.7b)).
• (The cusp of f (x) at x = xc :) The nondifferentiable point xc admits not one but
infinetily many supporting lines with slopes in the range [k1 , k2 ]. This means that
f ∗ (k) with k ∈ [k1 , k2 ] must admit a supporting line with the constant slope xc ,
shown as branch (c0 ) in Fig. (6.7b), i.e. (c0 ) branch is linear (affine).
The example is convex, therefore according to Corollary 6.7.14, f ∗∗ (x) = f (x).

. ∗ (!) $′
!(#)
% '
%&'() = +,
&
"′ Δ+
%&'() = +3
#( #) # !- !
Δ#
∗∗
$ (!)
" # !
Figure 6.8: (a) An exemplary nonconvex function, f (x); (b) its LT transform, f ∗ (k); (c) its
double LT transform f ∗∗ (x).
Example 6.7.19. Show schematically f ∗ (k) and f ∗∗ (x) for f (x) shown in Fig. (6.6).
Solution: We split curve of the function into three branches (l-left), (c-center) and (r-right),
and then built LF and double-LT transform separately for each of the branches, as before
relying in this construct of building supporting lines. The result in shown in Fig. (6.8) and
the details are as follows.
• Branch (l) and branch (r) are strictly convex thus admitting strict supporting lines.
LT transforms of the two branches are smooth. Double LF transform returns exactly
the same function we have started from.
• Branch (c) is not convex and as a result none of the points within this branch, extend-
ing from x1 to x2 , admits supporting lines. This means that the points of the branch
are not represented in f ∗ (k). We see it in Fig. (6.8b) as a collapse of the branch under
the LF transform to a point. Supporting line with slope kc connects end-points of
the branch. The supporting line is not strict and it translates in f ∗ (k) into a sin-
gle (kc , f ∗ (kc )) point. This point of f ∗ (k) is not differentiable. Notice that f ∗ (k) is
convex, as well as, f ∗∗ (x). LF transformation extends (kc , f ∗ (kc )) into a straight line
with slope kc (shown red in Fig. (6.8c). This straight line may be thought as a convex
extrapolation, envelope, of f (x) in its non-convex branch.
Example 6.7.20. (a) Find the supporting lines and build the LF transform of
(
p1 x + b1 , x ≤ x∗
f (x) =
p2 x + b2 , x ≥ x∗
where x∗ = (b2 − b1 )/(p1 − p2 ), and b2 > b1 , p2 > p1 ; and find the respective f ∗∗ (x)
(b) Suggest an example of a convex function defined on a bounded domain with diverging
(infinite) slopes at the boundary. Show schematically f ∗ (k) and f ∗∗ (x) for the function.
6.7.2 Example of Dual Optimization in Variational Calculus
Now we are ready to return back to the image restoration problem set up in Section 6.1.3.
Our task becomes to by-pass ε-smoothing discussed in Section 6.4.2 by using LF transform.
This neat theoretical trick will then in developing computationally advantageous primal-
dual algorithm. We will use Theorem 6.7.4 to accomplish this, transformation-to-dual,
goal.
In fact, let us consider a more general set up than one discussed in Section 6.1.3. Assume
that f : Rn → R is convex and consider
Z
min dx (g(x, q(x)) + f (∇x q(x))) , (6.36)
{q(x)} nT ·∇x q=0, ∀x∈∂U
U
where q : U → R and as before n is the normal vector to ∂U . Let us now restate the
formulation in terms of the Legendre-Fenchel transform of f , thus utilizing Theorem 6.7.4:
Z
min max dx (g(x, q(x)) + p(x) · ∇x q(x) − f ∗ (p(x))) T T
, (6.37)
{q(x)} {p(x)} n ·∇x q=n ·p=0, ∀x∈∂U
U
where p : U → Rn and we also require that p (which is a vector) does not have a projection
to the boundary, i.e. it is elongated with ∇x q. q(x) is called the primal variable and p(x)
is called dual variable. We “dualize” only the second term in the integrand on the right
hand side of Eq. (6.36) which is non-smooth, leaving the first (smooth) term unchanged.
The optimization problem (6.37) is also called saddle-point formulation, due to its min-max
structure.
Given the boundary conditions, we can apply integration by parts to the term in the
middle in Eq. (6.37) then arriving at
Z
min max dx (g(x, q(x)) − q(x)∇ · p(x) − f ∗ (p(x))) . (6.38)
{q(x)} {p(x)} nT ·p=0, ∀x∈∂U
U
We can attempt to solve Eq. (6.37) or Eq. (6.38) by the primal-dual method which con-
sists in alternating minimization and maximization steps in either of the two optimizations.
Implementations may be, for example, via alternating gradient descent (for minimization)
and gradient ascent (for maximization).
However in the original problem we are trying to solve – the image restoration problem
defined in Section 6.1.3 – we can carry over the primal-dual min-max formulation further
by exploring the structure of the argument (effective action), evaluating minimization over
{q(x)} explicitly and thus arriving at the dual formulation. This is our plan for the remain-
der of the section.
The case of the Total Variation image restoration corresponds to setting
(q − f (x))2
g(x, q) = , f (w = ∇x q(x)) = |w|,
2λ
in Eq. (6.36) thus arriving at the following optimization
Z
(q − f )2
min dx + |∇x q| . (6.39)
q 2λ
U nT ·∇x q=0,∀x∈∂U
Notice that f (w) = |w| is convex and thus, according to the high-dimensional generalization
of what we have learned about LF transform, f ∗∗ (w) = f (w). The LF dual of f (w) can be
easily computed
(
∗ 0, |w| ≤ 1
f (p) = sup (p · w − |w|) = (6.40)
w∈Rn ∞, |w| > 1.
And then convexity of f (w) = |w| allows us, according to Theorem 6.7.4, to “invert”
Eq. (6.40)
( !
0, |p| ≤ 1
f (w) = |w| = sup p · w − = max p · w. (6.41)
p ∞, |p| > 1. |p|≤1
Then min-max Eq. (6.38) becomes
Z
(q − f )2
min max dx − q∇x · p) . (6.42)
q |p|≤1 2λ
U nT ·p=0, ∀x∈∂U
Remarkably we can swap min and max in Eq. (6.42). This is guaranteed by the strong
convexity theorem (see Appendix)
Z
(q − f )2
max min dx − q∇x · p . (6.43)
|p|≤1 q 2λ
U x∈∂U : nT ·p=0
This trick is very useful because the ”ultra-local” optimization over q can be done explicitly.
One finds that the minimum of the quadratic over q function in the integrand of the objective
in Eq. (6.43) is achieved at
q = f + λ∇x · p, (6.44)
and then substituting the optimal value back in the objective we arrive at
Z
λ
max dx f ∇x · p − (∇x · p)2 . (6.45)
|p|≤1 2
U nT ·p=0, ∀x∈∂U :
which is thus the optimization dual to the primal optimization (6.39). If we are to ignore the
constraint in Eq. (6.45), the objective is minimal at ∇ · p = f /λ. To handle the constraint
[11] has suggested to use the so-called projected gradient ascent algorithm
pk + τ ∇x · (∇x · pk − f /λ)
∀x : pk+1 (x) = , (6.46)
1 + τ |∇x · pk − f /λ|
initiated with p0 satisfying the constraint, |p0 | < 1, iterating in time with step τ > 0 and
taking appropriate spatial discretization of the ∇x · operation on a grid with spacing ∆x.
Introduction of the denominator in the ratio on the right hand side of Eq. (6.46) guarantees
that the condition is enforced in iterations, |pk | < 1. When the iterations converge and the
optimal p is found, the optimal pattern, u is reconstructed from Eq. (6.44).
6.7.3 More on Geometric Interpretation of the LF transform
" ! + " ∗ % = %!
'()*+ = %
"(!)
%! " ∗ (%) !
Figure 6.9: Graphic representation of the LF transform.
Here we inject some additional geometric meaning in the LF transform following [12].
We continue to draw our intuition/inspiration from a one dimensional example.
First, notice that if the function is f : R → R is strictly convex than f 0 (x) is increasing,
monotonically and strictly, with x. This means, in particular, that the relation between
the original variable, x, and the respective optimal dual variable, k, is one-to-one, therefore
providing additional explanation for the self-inverse feature of the LT transform in the case
of convexity (strict convexity, to be precise, but we know that is also holds in the convex
case).
Second, consider relation, illustrated in Fig. (6.9), between the original function, f (x),
at x allowing strict supporting line and the respective LT transform, f ∗ (k), evaluated at
k = f 0 (x), i.e. f ∗ (f 0 (x)):
∀x : kx = f (x) + f ∗ (k), where k = f 0 (x). (6.47)
As seen clearly in the figure the LF relation explains f ∗ (k) as f (x) extended by kx (where
the latter term is associated with the supporting line). Notice remarkable symmetry of
Eq. (6.47) under x ↔ k and f ↔ f ∗ transformation, also assuming that the variables, x
and k, are not independent - one of the two is to be selected as tracking the change while
the other (conjugated) variable will depend on the first one, according to k = f 0 (x) or
x = (f ∗ )0 (k)
6.7.4 Hamiltonian-to-Lagrangian Duality in Classical Mechanics
LF transform is also the key to understanding relation between Hamiltonian and Lagrangian
in classical mechanics. Let us illustrate it on a “no q”-example, i.e. on the case when the
Hamiltonian, generally dependent on t, q and p depends only on p. Specifically consider
p
example of a free relativistic particle, where H(p) = p2 + m2 , m is the particle mass and
p
the speed of light is set to unity, c = 1. In this case, q̇ = ∂p H = dH/dp = p/ p2 + m2 ,
according the Hamilton equation, and the Lagrangian, which generally depends on q̇ and q
but now only depends on q, is, L(q̇) = pq̇ − H(p). This relation, rewritten in the symmetric
form,
pq̇ = L(q̇) + H(p),
should be compared with the LF relation Eq. (6.47). We observe that p and ·q, like x and k,
are conjugated variables while L should be viewed as the LF transform of the Hamiltonian,
L = H ∗ , or vice versa, H = L∗ .
See [12] for further discussion of other examples of LF transform in physics, for exam-
ple in statistical thermodynamics (where inverse temperature and energy are conjugated
variables, while free energy is the LF dual of the entropy, and vice versa).
6.7.5 LF Transformation and Laplace Method
Consider the integral Z

F (k, n) = dx exp (n (kx − f (x))) .
R
When n → ∞ the Laplace methods of approximating the integral (discussed in Math 583a
in the fall) consists in
log F (k, n) = n sup (kx − f (x)) + o(n).

x∈R
∗
6.8 Second Variation
Finding extrema of a function involves more than finding its critical points ∗ . A critical
point may be a minimum, a maximum or a saddle-point. To determine the critical point
type one needs to compute the Hessian matrix of the function. Similar consideration applies
to functionals when we want to characterize solutions of the Euler-Lagrange equations.
We naturally start the discussion of the second variation from the finite dimensional
case. Let f : U ⊂ Rn → R be a C2 function (with existing first and second derivatives).
The Hessian matrix of f at x is a symmetric bi-linear form (on the tangent vector space Rnx
to Rn at x) defined by
∂ 2 f (x + s + wη)
∀, η ∈ Rnx : Hessx (, η) = . (6.48)
∂s∂w s=w=0
If the Hessian is positive-definite, i.e. if the respective matrix of second-derivatives has only
positive eigenvalues, then the critical point is the minimum.
R
Let us generalize the notion of the Hessian to the action, S = dtL(q, q̇) and the
Lagrangian, L(q, q̇), where q(t) : R → Rn is a C2 function. Direct generalization of Eq. (6.48)
∗
midterm and finals.
becomes
∂ 2 S{u(t) + s(t) + wη(t)}
Hessx {(t), η(t)} = (6.49)
∂s∂w s=w=0

∂ ∂S{q(t) + s(t) + wη(t)}
=
∂w ∂s s=0 w=0
Z X n
∂L(q + s; q̇ + s) ˙ d ∂L(q + s; q̇ + s) ˙
= dt − ηi
∂q i dt ∂ q̇ i
i=1 s=0
Z X n 2
∂2L j ∂2L j d ∂ L j ∂2L j
= dt + j i ˙ − + j i ˙ ηi
∂q j ∂q i ∂ q̇ ∂q dt ∂q j ∂ q̇ i ∂ q̇ ∂ q̇
i,j=1
Z X n
:= dt Jij j η i ,
i,j=1
where Jij is the matrix of differential operators called the Jacobi operator. To determine if
the bilinear form is positive definite is usually hard, but in some simple cases the question
can be resolved.
Consider, q : R → R, q ∈ C2 , and quadratic action,
Z T
S{q(t)} = dt q̇ 2 − q 2 . (6.50)
0
with zero boundary conditions, q(0) = q(T ) = 0. To get some intuition about how the
landscape of action (6.50) looks like, let us consider a subclass of functions, for example
oscillatory functions consisting of only one harmonic,

πt
q̄(t) = a sin n , (6.51)
T
where a ∈ R (any real) and n ∈ Z \ 0 (any nonzero integer). Substituting Eq. (6.51) into
Eq. (6.50) one derives,
 T 
Z
n2 π 2 a2
 dt cos2 nπt 
S{q̄(t)} = 2
T T
0
 T 
Z 2
2 2
2 2 nπt = T a n π
−a dt sin −1 .
T 2 T2
0
One observes that at T < π, the action, S, considered on this special class of functions,
is positive. However, when some of these probe functions will result in a negative action
when T > π. This means that at T > π, the functional quadratic form, correspondent to
the action (6.50), is certainly not positive definite.
One thus came out of this “probe function” exercise with the following question: can
it be that the functional quadratic form, correspondent to the action (6.50), is not positive
definite? The analysis so far (restricted to the class of single harmonic test functions) is not
conclusive. Quite remarkably one can prove that the action (6.50) is always positive (over
the class of zero boundary condition, twice differentiable functions), and thus the respective
quadratic form is always positive definite, if T < π.
Example 6.8.1. Prove that the action S{q(t)} given by Eq. (6.50) is positive at, T < π, for
any twice differentiable function, q ∈ C2 with zero boundary conditions, q(0) = q(T ) = 0.
(Hint: Represent the function as Fourier Series and show that the action is a sum of squares.)
Solution. Consider the general Fourier series expansion of q(t), that is
∞
a0 X 2πnt 2πnt
q(t) = + an cos + bn sin .
2 T T
n=1
RT
Calculating 0 (q̇ 2 − q 2 )dt and noting the orthogonality of the terms in the expansion, one
arrives at Z
T ∞
X
2 2 a20 a2n + b2n 4π 2 n2
(q̇ − q )dt = −T +T −1 .
0 4 2 T2
n=1
If we consider the ‘worst-case’ scenario of T = π we then have to show that
∞
X a2n + b2n a2
4n2 − 1 ≥ 0 ,
2 4
n=1
to ensure positivity of the action. Note that from the boundary conditions we have that,
P
a0 /2 + n an = 0. Without loss of generality let us scale q(t) such that a0 = −2. Then it
will suffice for us to show that
∞
X a2 n

4n2 − 1 ≥ 1.
2
n=1
We can do this by constructing the dual problem and demonstrating that the minimal value
of the left-hand-side is 1 when varying the an . Specifically, our problem is
k
X a2 n

min 4n2 − 1 ,
a=(a1 ,...,ak ) 2
n=1
Xk
s.t. an = 1,
n=1
where we first consider the partial sum case and then take the k → ∞ limit. The Lagrangian
is !
k
X k
X
a2 n

L(a, µ) = 4n2 − 1 − µ an − 1 .
2
n=1 n=1
We compute ∇a L = 0 to get
an (4n2 − 1) − µ = 0, ∀n = 1, . . . , k.
This implies an = µ/(4n2 − 1). If the equality constraint is enforced we derive

k
X µ
= 1.
4n2 − 1
n=1
One can compactly represent the partial sums as

k
X µ µk
=
4n2 − 1 2k + 1
n=1
which yields µ = (2k + 1)/k. Substituting this back into the original objective function we
arrive at
k
X a2 n
2k + 1
4n2 − 1 = ≥ 1.
2 2k
n=1
Therefore in the k → ∞ limit we are still larger than 1, thus proving positivity of the
action.
6.9 Methods of Lagrange Multipliers

So far we have only discussed unconstrained variational formulations. This Section is de-
voted to generalizations where variational problems with constraints are formulated and
resolved.
6.9.1 Functional Constraint(s)
Consider the shortest path problem discussed in Example 6.2.3, however constrained by the
area, A as follows
Za p
min dx 1 + (q 0 (x))2 dx .
{q(x)|x∈[0,a]} Ra
0 q(0)=0, q(a)=b, q(x)dx=A
0
The area constraint can be built in the optimization by adding,

 a 
Z
λ  dxq(x)dx − A ,
0
to the optimization objective, where λ is the Lagrangian multiplier. The Euler-Lagrange

equations for this “extended” action are
0 = ∇x (L∇q (x, q(x), ∇x q(x))) − Lq (x, q(x), ∇x q(x)) − λ

d q 0 (x)
= p −0−λ
dx 1 + (q 0 (x))2
q 0 (x)
→ p = constant + λx
1 + (q 0 (x))2
Example 6.9.1. The principle of maximum entropy, also called principle of the maximum
likelihood (distribution), selects the probability distribution that maximizes the entropy,
R R
S = − D dxP (x) log P (x), under normalization condition, D dxP (x) = 1.
• (a) Consider D ∈ Rn . Find optimal P (x).
• (b) Consider D = [a, b] ⊂ R. Find optimal P (x), assuming that the mean of x is
R
known, E{P (x)} (x) := D dxxP (x) = µ.
Solution:
(a) The effective action is,
Z
S̃ = S + λ 1 − dxP (x) ,
D
where λ is the (constant, i.e. not dependent on x, Lagrangian multiplier. Variation of S̃

over P (x) results in the following EL equation
δ S̃
=0: − log(P (x)) − 1 − λ = 0.
δP (x)
Accounting for the normalization condition one finds that the optimum is achieved at the
equ-distribution:
1
P (x) = ,
kDk
where kDk is the size of D.
(b) The effective action is,
Z Z
S̃ = S + λ 1 − dxP (x) + λ1 µ − dxxP (x) ,
D D
where λ and λ1 are two (constant) Lagrangian multipliers. Variation of S̃ over P (x) results
in the following EL equation
δ S̃
=0: − log(P (x)) − 1 − λ − λ1 x = 0 → P (x) = e−1−λ exp(−λ1 x).
δP (x)
λ and λ1 are constants which can be expressed via a, b and µ resolving the normalization
constraint and the constraint on the mean,
b b
−1−λ exp(−λ1 x) −1−λ x exp(−λ1 x) exp(−λ1 x)
e − = 1, e − − = µ.
λ1 a λ1 λ21 a
Exercise 6.5. Consider the setting of Example 6.9.1b with a = −∞, b = ∞. Assuming
that mean and variance of the probability distribution are known, i.e. E{P (x)} (x) = µ and

E{P (x)} x2 = σ 2 + µ2 , find P (x) which maximizes the entropy.
6.9.2 Function Constraints
The method of Lagrange multipliers in the calculus of variations extends to other types of
constrained optimizations, where the condition is not a functional as in the cases discussed
so far but a function. Consider, for example, our standard one-dimensional example of the
action functional, Z
S{q(t)} = dtL(t; q(t); q̇(t)), (6.52)
over q : R → R, however constrained by the functional,
∀t : G(t; q(t); q̇(t)) = 0. (6.53)
Let us also assume that L(t; q; q̇) and G(t; q; q̇) are sufficiently smooth functions of their last
argument, q̇. The idea then becomes to introduce the following “modified” action
Z
S̃{q(t), λ(t)} = dt (L(t; q(t); q̇(t)) − λ(t)G(t; q(t); q̇(t))) , (6.54)
which is now a functional of both q(t) and λ(t), and extremize it over both q(t) and λ(t).
One can show that solutions of the EL equations, derived as variations of the action (6.54)
over both q(t) and λ(t), will give a sufficient condition for the minimum of Eq. (6.52)
constrained by Eq. (6.53).
Let us illustrate this scheme and derive the Euler-Lagrange equation for a Lagrangian
L(q; q̇; q̈) which depends on the second derivative of a C3 function, q : R → R and does
not depend on t explicitly. In full analogy with Eq. (6.54) the modified action in this case
becomes
Z
S̃{q(t), λ(t)} = dt (L(q(t); q̇; v̇) − λ(t) (v(t) − q̇(t))) . (6.55)
Notice Then the modified Euler-Lagrange equations are

∂L d ∂L d ∂L
= + λ , −λ = , v = q̇. (6.56)
∂q dt ∂ q̇ dt ∂ v̇
Eliminating λ and v one arrives at the desired modified EL equations stated solely in terms
of derivatives of the Lagrangian over q(t) and its derivatives:
∂L d ∂L d2 ∂L
− + 2 = 0. (6.57)
∂q dt ∂ q̇ dt ∂ q̈
R1
Exercise 6.6. Find extrema of S{q(t)} = 0 dtkq̇(t)k for q : [0, 1] → R3 subject to the
norm constraint, ∀t : kq(t)k2 = 1, and with generic boundary conditions, i.e. q(0) = q0
and q(1) = q1 , where both boundary points satisfy the norm constraint.
We will see more of the calculus of variations with (function) constraints later in the
optimal control section of the course.
Chapter 7
Optimal Control and Dynamic

Programming
Optimal control problem shall be considered as a special case of a general variational calculus
problem, where the (vector) fields evolve in time, i.e. reside in one dimensional real space
equipped with a direction, and constrained by a system of ODEs, possibly with algrebraic
constraints added too. We will learn how to analyze the problems by the methods of the
variational calculus from Section 6, using optimization approaches, e.g. convex analysis and
duality, described in Appendix A.1, and also adding to arsenal of tools a new one called
“Dynamic Programming” (DP) in Section 7.4.
Let us start with an illustrative (as sufficiently simple) optimal control problem.
Example 7.0.1. Consider trajectory of a particle in one dimension: {q(τ ) : [0, t] → R}

which is subject to control {u(τ ) : [0, t] → R}. Specifically, the control means that at any
moment in time, the velocity of the particle can be set to any value less than or equal to
one. That is q̇(t) = u(t) where u(t) ≤ 1. Solve the following constrained problem of the
variational calculus type:
Zt
min dτ (q(τ ))2 (7.1)
{u(τ ),q(τ )}
0 τ ∈(0,t]: q̇(τ )=u(τ ), u(τ )≤1
where t > 0 and the initial position, q(0) = q0 , is known (fixed).

Solution. If q0 > 0, one can guess the optimal solution right away: jump to q = 0 im-
mediately (at τ = 0+ ) and then stay zero. To justify the solution, one first drops all the
constraints in Eq. (7.1), observe that the minimal solution of the unconstrained problem is,
τ ∈ (0, t] : q(τ ) = u(τ ) = 0, and then verify that constraints dropped are satisfied. (Notice
189
CHAPTER 7. OPTIMAL CONTROL AND DYNAMIC PROGRAMMING 190
that the resulting discontinuity of the optimal q(τ ) at τ = 0 is not a problem, as it was not
required in the problem formulation.)
The analysis in the case of q0 ≤ 0 is more elaborate. Let us exclude the control variable,
turning the pair of constraints in Eq. (7.1) into one, ∀τ : q̇ ≤ 1. Our next step will be to
use the Duality Theory and KKT conditions, discussed in Math 584 (see also Appendix A)
in the context of finite dimensional optimization. We can therefore view what follows as a
practical lesson (just a use case, no proofs) on how to extend this important approach from
the case of finite dimensional optimization to the variational calculus.
We introduce the Lagrangian function,
L(q(τ ), µ(τ )) = q 2 + µ(q̇ − 1),
and then write the following four KKT conditions:
1. KKT-1: Primal Feasiblity: q̇(τ ) ≤ 1 for τ ∈ (0, t].
2. KKT-2: Dual Feasibility: µ(t) ≥ 0 for τ ∈ (0, t].
3. KKT-3: Stationary point in primal variables - which is simply the Euler-Lagrange

condition of the variational calculus: 2q = µ̇ for τ ∈ (0, t].
4. KKT-4: Complementary Slackness: µ(t)(q̇(t) − 1) = 0 for τ ∈ (0, t].
We find that,
q(τ ) = τ + q0 , µ(τ ) = τ 2 + 2q0 τ + c, (7.2)
where c is a constant, satisfy both the KKT conditions and the initial condition, q(0) = q0 .
Can we have another solution different from Eqs. (7.2) but satisfying the KKT conditions?
How about a discontinuous control? Consider the following probe functions, bringing q to
zero first with the maximal allowed control, and then switching off the control:
( (
q0 + τ, 0 < τ ≤ −q0 τ 2 + 2q0 τ + q02 , 0 < τ ≤ −q0
q(τ ) = , µ(τ ) = . (7.3)
0, −q0 < τ ≤ t 0, −q0 < τ ≤ t
We observe that, indeed, in the regime where the probe function is well defined, i.e. 0 <
−q0 < t, Eqs. (7.3) solves the KKT conditions (7.3), therefore providing an alternative to
the solution (7.2). Comparing objectives in Eq. (7.1) for the two alternatives one finds
that at, 0 < −q0 < t, the solution (7.3) is optimal while the solution (7.2) is optimal if
t < −q0 .
Exercise 7.1. Solve Example 7.0.1 with the condition u ≤ 1 replaced by |u| ≤ 1.
∗
7.1 Linear Quadratic Control via Calculus of Variations
Our next extra-curricular topic ∗ is Linear Quadratic (LQ) control. Consider d-dimensional
real vector representing evolution of the system state in time, {q(τ ) ∈ Rd |τ ∈ [0, t]}, gov-
erned by the following system of linear ODEs
∀τ ∈ (0, t] : q̇(τ ) = Aq(τ ) + Bu(τ ), q(0) = q0 , (7.4)
where A and B are constant (time independent) square, nonsingular (invertible) and pos-
sibly asymmetric, thus A 6= AT and B 6= B T , real matrices, A, B ∈ Rd × Rd , and
{u(τ ) ∈ Rd |τ ∈ [0, t]} is a time-dependent control vector of the same dimensionality as
q. Introduce a combined action, often called cost-to-go:
S{q(τ ), u(τ )} := Sef f {u(τ )} + Sdes {q(τ )} + Sf in (qt ), (7.5)

Zt
1
Sef f {u(τ )} := dτ uT (τ )Ru(τ ), (7.6)
2
0
Zt
1
Sdes {q(τ )} := dτ q T (τ )Qq(τ ), (7.7)
2
0
1
Sf in (q(t)) := q T (t)Qf in q(t), (7.8)
2
where Sef f , dependent only on {u(τ )}, represents required efforts of control; Sdes , dependent
only on {q(τ )}, expresses the cost of maintaining desired state of the system {q(t)} proper;
and Sf in , dependent only on q(t), expresses the cost of achieving the final state, q(t). We
assume that R, Q and Qf in are symmetric real positive definite matrices. We aim to optimize
the cost-to-go over {q(τ )} and {u(τ )} constrained by the governing ODEs and respective
initial condition in Eqs. (7.4).
As custom in the variational calculus with function constraints, let us extend the action
(7.5) with a Lagrangian multiplier function associated with the ODE constraints (7.4) and
then formulate necessary conditions for the optimality stated as an unconstrained variation
of the following effective action
Zt
S{q, u, λ} := S{q, u} + dτ λT (τ ) (−q̇ + Aq + Bu) , (7.9)
0
where {λ(τ )} is the time-dependent vector of the Lagrangian multipliers, also called the
adjoint vector. Euler-Lagrange (EL) equations and the primal feasibility equations following
∗
midterm and finals.
from variations of the effective action (7.9) over q, u and λ are
δS{q,u,λ}
Euler-Lagrange : δq =0: ∀τ ∈ (0, t] : Qq + λ̇ + AT λ = 0, (7.10)
δS{q,u,λ}
δu =0: ∀τ ∈ [0, t] : Ru + B T λ = 0, (7.11)
δS{q,u,λ}
primal feasibility: δλ =0: Eqs. (7.4). (7.12)
The equations should also be complemented with the boundary condition,
∂S{q, u, λ}
boundary condition at τ = t, = 0 : λ(t) = Qf in q(t), (7.13)
∂q(t)
derived by variations of the effective action over q at the final point, q(t). The simplest
way to derive the boundary condition Eq. (7.13) is through discretization: turning temporal
integrals into discrete sums, specifically
Z t
dτ λT (τ )q̇(τ ) → λT (∆)(q(∆) − q(0) + · · · + λT (t)(q(t − ∆)) − q(t)), (7.14)
0
where ∆ is the discretization step, and then looking for a stationary point over q(t).
Notice that alternatively the boundary conditions can be derived from the “main theo-
rem of classical mechanics – Theorem 6.5.1 – proving that the gradient of S with respect to
the terminal point (t1 , q1 ) is ∇S = h L − q̇Lq̇ , Lq̇ i|(t1 ,q1 ) . (See Theorem 6.5.1). This result
suggests that given that q(τ ) is optimal, the action cannot be improved by variations in the
terminal value, q(t) and therefore the boundary condition at τ = t is:
∂S{q, u, λ}
boundary condition at τ = t, 0= = λ(t) + Lq̇ |τ =t = λ(t) − Qf in q(t), (7.15)
∂q(t)
Observe that Eqs. (7.11) are algebraic, thus allowing to express the control vector, u,
via the adjoint vector, λ
u = −R−1 B T λ. (7.16)
Substituting it into Eqs. (7.10,7.12) one arrives at the following joint system of the original
and adjoint equations
! ! ! ! !
q̇ A −BR−1 B T q q(0) q0
= , = . (7.17)
λ̇ −Q −AT λ λ(t) Qf in q(t)
The system of ODEs (7.17) is a two-point Boundary Value Problem (BVP) because it
has two boundary conditions at the opposite ends of the time interval. In general, two-
point BVPs are solved by the shooting method, which requires multiple iterations forward
and backward in time (hoping for convergence). However for the LQ Control problems,
the system of equations is linear, and we can solve it in one shot – with only one forward
iteration and one backward iteration. Indeed, integrating the linear ODEs (7.17) one derives
! !
q(τ ) q(0)
= W (τ ) , (7.18)
λ(τ ) λ(0)
! !!
W 1,1 (τ ) W 1,2 (τ ) A −BR−1 B T
W (τ ) = := exp τ , (7.19)
W 2,1 (τ ) W 2,2 (τ ) −Q −AT
which allows to express λ(0) via q(0) = q0

−1
λ(0) = M q0 , M := − W 2,2 (t) + Qf in W 1,2 (t) W 2,1 (t) + Qf in W 1,1 (t) . (7.20)
Substituting Eqs. (7.18,7.20) into Eq. (7.16) one arrives at the following expression for the
optimal control via q0

u(τ ) = −R−1 B T W 2,1 (τ ) + W 2,2 (τ )M q0 . (7.21)
A control of this type, dependent on the initial state, is called open loop control. That
is, the control policy u(τ ) doesn’t explicitly depend on the current state q(τ ), and instead
only on the initial state q0 . While in ideal conditions this does not pose an issue, under
uncertainty it is often better for the control policy to depend on the current state of the
system q(τ ). Such a control scheme is called feedback loop control, which may also be called
the closed loop control. The feedback loop version of Eq. (7.21) requires us to express u(τ )
in terms of q(τ ). To do this we seek a map from q(τ ) to λ(τ ), which for LQ control is a
matrix P (τ ) such that P (τ )q(τ ) = λ(τ ). This allows us to write
u(τ ) = −R−1 B T λ(τ ) = −R−1 B T P (τ )q(τ ).
We derive this matrix by expressing λ(τ ) and q(τ ) via q0 according to Eq. (7.18,7.20) and
then substituting the result in Eq. (7.16):

∀τ ∈ (0, t] : u(τ ) = −R−1 B T λ(τ ) = −R−1 B T W 2,1 (τ ) + W 2,2 (τ )M q0
−1
= −R−1 B T W 2,1 (τ ) + W 2,2 (τ )M W 1,1 (τ ) + W 1,2 (τ )M q(τ )
= −R−1 B T P (τ )q(τ ), (7.22)
2,1 2,2
1,1 1,2
−1
P (τ ) := W (τ ) + W (τ )M W (τ ) + W (τ )M . (7.23)
At any moment of time τ , i.e. as we go along, the feedback loop control responds to the
current measurement of the system state, q(τ ), at the same time, τ .
Notice that in the deterministic case without uncertainty/perturbation (and this is what
we have considered so far) the open loop and the feedback loop are equivalent. However,
the two control schemes/policies give very different results in the presence of uncertain-
ty/perturbation (consistently with what was already mentioned above). We will investigate
this phenomenon and have a more extended comparison of the two controls in the proba-
bility/statistics/data science section of the course
Example 7.1.1. Show, utilizing derivations and discussions above, that the matrix, P (τ ),
defined in Eq. (7.23), satisfies the so-called Riccati equations:
Ṗ + AT P + P A + Q = P BR−1 B T P, (7.24)
supplemented with the terminal/final (τ = t) condition, P (t) = Qf in .

Solution. To solve this problem we will not actually be using the explicit expression for P
given in (7.23). Instead we will use the important relation P q = λ. First, we note that
q̇ = Aq − BR−1 B T λ, and λ̇ = −Qq − AT λ.
With these three relations we may write
Ṗ q + P q̇ = λ̇,
=⇒ Ṗ q + P (Aq − BR−1 B T λ) = −Qq − AT λ,
=⇒ Ṗ q + P (Aq − BR−1 B T P q) = −Qq − AT P q,
=⇒ Ṗ q + AT P q + P Aq + Qq = P BR−1 B T P q,
=⇒ (Ṗ + AT P + P A + Q)q = P BR−1 B T P q.
Since this holds for arbitrary q we must have that
Ṗ + AT P + P A + Q = P BR−1 B T P.
Example 7.1.2. Consider an unstable one dimensional process
τ ∈ [0, ∞[: q̇(τ ) = Aq(τ ) + u(τ ),
where u ∈ R and A is a positive constant, A > 0. Design an LQ controller u(τ ) = −P q(τ )/R
that minimizes the action
Z∞

S{q(τ ), u(τ )} = dτ q 2 + Ru2 ,
0
where P is a constant (need to find) and R is a positive known constant. Discuss/explain

what happens with P when R → 0 or R → ∞.
Solution. We note that since P is a constant, Ṗ = 0. Moreover, since we are in one

dimension the Riccati Eq. (7.24) can be simplified to
B2 2
2AP + Q = P .
R
Since B = Q = 1 the quadratic form for P becomes
p
P = RA ± R2 A2 + R,
that is it shows two branches. We must consider both of these branches in the context of
the optimization problem. Substituting u in the ODE, we arrive at
p
q(τ ) = exp ((A − P/R)τ ) = exp ∓ A2 + 1/R τ .
√
Observe that the cost is infinite if we do not take P = RA + R2 A2 + R. If we consider
√
R → 0 it results in P → R. However, if we consider R → ∞ the result is P → 2RA.
We can even write down the action explicitly:
Z ∞
S= (q 2 + Ru2 )dτ
0
Z ∞ p p
= (1 + 2A2 + 1/R − 2 A2 + 1/R) exp −2τ A2 + 1/R dτ
0
p
1 + 2A2 + 1/R − 2 A2 + 1/R
= p .
2 A2 + 1/R
7.2 From Variational Calculus to Bellman-Hamilton-Jacobi

Equation
Next we consider optimal control problem which is more general, in terms of governing
equations and optimization objective, than what was considered so far. We study controlled
dynamical system, which is nonlinear in our primal variable, {q(τ ) : [0, t] → Rd }, but still
linear in the control variable, {u(τ ) : [0, t] → Rd }
∀τ ∈ [0, t] : q̇(τ ) = f (q(τ )) + u(τ ). (7.25)
As above, we will formulate a control problem as an optimization. We aim to minimize the

objective
Zt
1 T
dτ u (τ )u(τ ) + V (q(τ )) , (7.26)
2
0
over {u(τ )} which satisfies the ODE (7.25). Here in Eq. (7.26) we shortcut notations and
use (u(τ ))2 for uT (τ )u(τ ). Notice that the cost-to-go objective (7.26) is a sum of two
terms: (a) the cost of control, which is assumed quadratic in the control efforts, and (b) the
bounded from below “potential”, which defines preferences or penalties imposed on where
the particle may or may not go. The potential may be soft or hard. An exemplary soft
potential is the quadratic potential
d
1 1X
V (q) = q T Λq = qi Λij qj , (7.27)
2 2
i=1
where Λ is a positive semi-definite matrix. This potential encourages q(τ ) to stay close to
the origin, q = 0, penalizing (but softly) for deviation from the origin. An exemplary hard
constraint may be
(
0, |q| < a
V (q) = , (7.28)
∞, |q| ≥ a
completely prohibiting q(τ ) to leave the ball of size a around the origin. Summarizing, we
discuss the optimal control problem:
Zt
uT (τ )u(τ )
min dτ + V (q(τ )) (7.29)
{u(τ ),q(τ )} 2 ∀τ ∈ [0, t] : q̇(τ ) = f (q(τ )) + u(τ )
0
q(0) = q0 , q(t) = qt
where initial and final states of the system are assumed fixed.
In the following we restate Eq. (7.29) as an unconstrained variational calculus problem.
(Notice, that we do not count the boundary conditions as constraints.) We will assume that
all the functions involved in the formulation (7.29) are sufficiently smooth and derive re-
spective Euler-Lagrange (EL) equations, Hamiltonian equations and Hamilton-Jacobi (HJ)
equations.
To implement the plan, let us, first of all, exclude {u(τ )} from Eq. (7.29). The resulting
“q-only” formulation becomes
Zt !
(q̇(τ ) − f (q(τ )))T (q̇(τ ) − f (q(τ )))
min dτ + V (q(τ )) . (7.30)
{q(τ )} 2
0 q(0)=q0 , q(t)=qt
Following Lagrangian and Hamiltonian approaches, described in details in the variational

calculus portion of the course, see Section 6, one identifies action, Lagrangian, momentum
and Hamiltonian for the functional optimization (7.30) as follows
Zt
(q̇ − f (q))T (q̇ − f (q))
S{q(τ ), q̇(τ )} = dτ + V (q), (7.31)
2
0
(q̇ − f (q))T (q̇ − f (q))
L = + V (q), (7.32)
2
∂L
p ≡ = q̇ − f (q), (7.33)
∂ q̇ T
∂L q̇ T q̇ (f (q))T f (q)
H ≡ q̇ T T − L = − − V (q)
∂ q̇ 2 2
pT p
= + pT f (q) − V (q). (7.34)
2
Then the Euler-Lagrange equations are
d ∂L ∂L
∀i = 1, · · · , d : = (7.35)
dt ∂ q̇i ∂qi
X d
d
(q̇i − fi (q)) = − (q̇ − f (q))j ∂qi fj (q) + ∂qi V (q),
dt
j=1
where we stated the vector equation by components for clarity. The Hamilton equations
are
∂H
∀i = 1, · · · , d : q̇i = = pi + fi (q), (7.36)
∂pi
∂H
ṗi = − = −pi ∇qi f (q) + ∇qi V (q). (7.37)
∂qi
Considering the action, S, as a function (not a functional!) of the final time, t, and of the
final position, qt , and recalling that,
∂S ∂S ∂L
= −H|τ =t , = = p|τ =t ,
∂t ∂qt ∂ q̇ τ =t
one arrives at the Hamilton-Jacobi (HJ) equations

∂S ∂S 1 ∂S T ∂S ∂S T
= −H|τ =t = −H qt , =− − f (qt ) + V (qt ). (7.38)
∂t ∂qt 2 ∂qt ∂qt ∂qt
We will see later on that it may be useful to consider the HJ equation backwards in
Rt
time. In this case we consider the action, S = τ dτ 0 L, as the function of τ and q(τ ) = q.
This results in the following (backwards in time) modification of Eq. (7.38)
T T
∂S 1 ∂S ∂S ∂S
− =− + f (q) + V (q), (7.39)
∂τ 2 ∂q ∂q ∂q
where we use the relations, ∂τ S = H|τ and ∂q S = −∂q̇ L|τ . (Check Theorem 6.5.1 to recall
how differentiation of the action with respect to time and coordinates at the beginning and
at the end of a path are related to each other.)
Notice, that the HJ equation, in the control formulation, is called Bellman or Bellman-
Hamilton-Jacobi (BHJ) equation to commemorate contribution of Bellman to the field, who
has formulated the problem and resolved it deriving the BHJ equation.
In Section 7.4 we derive the BHJ equation in a more general setting.
Exercise 7.2. Consider one-dimensional optimal control problem where we know where
we want to arrive and we must learn how to steer there from any location, formally:
Zt
∗ ∗ (u(τ ))2 + β 2 (q(τ ))2
{u (τ ), q (τ )} = arg min dτ .
{u(τ ),q(τ )∈R} 2 ∀τ ∈ [0, t] : q̇(τ ) = −αq(τ ) + u(τ )
0
q(0) = q0 , q(t) = 0
(7.40)
Use a substitution to eliminate the control variable, and derive the Euler-Lagrange equa-
tions, the Hamiltonian equations, and the Hamilton-Jacobi equations. Find the optimal
trajectory q ∗ (τ ) and verify that it is consistent with the three equations. Reconstruct the
optimal controller u∗ (τ ) and express it via (a) q ∗ (τ ) (closed loop control); (b) q0 (open loop
control). [Hint: In your solution, you many find it convenient to define γ 2 = β 2 + α2 .]
7.3 Pontryagin Minimal Principle

Let us now consider the following even more general optimal control problem formulated
for a dynamical system in a state, {q(τ ) ∈ Rd |∀τ ∈ [0, t]}:
{u∗ (τ ), q ∗ (τ )} = (7.41)
 
Zt
arg min φ(q(t)) + dτ L (τ, q(τ ), u(τ )) ,
{u(τ ),q(τ )} q̇(τ ) = f (τ, q(τ ), u(τ )), ∀τ ∈ (0, t];
0
u(τ ) ∈ U ⊆ Rd , ∀τ ∈ [0, t];
q(0) = q0
where the control {u(τ ) ∈ U ∈ Rd |τ ∈ [0, 1]} is restricted to the domain U of the d-
dimensional space at all the times considered.
The analog of the standard variational calculus approach, consisting in the necessary
Euler-Lagrange (EL) conditions over {u(τ )} and {q(τ )}, is called Pontryagin Minimal Prin-
ciple (PMP), commemorating the contribution of Lev Pontryagin to the subject [13] (see
also [14] for extended discussion of the PMP bibliography, circa 1963). We present it here
without much elaboration, as it follows the same variational logic repeated by now many
times in this Section. Begin by introducing the effective action,
Zt Z t
S̃ := φ(q(t)) + dτ L (τ, q(τ ), u(τ )) + dτ λ(τ ) f (τ, q(τ ), u(τ )) − q̇(τ ) , (7.42)
0
0
where {λ(τ )} is a Lagrangian multiplier that is a function of τ . Then, by optimizing over

{u} and {q}, we arrive at the following set of variational equations for the candidate solution
of Eq. (7.41):
∀τ ∈ [0, t] : min S̃ : (7.43)

{u}
u (τ ) = arg min (L (τ, q ∗ (τ ), ũ(τ )) + λ∗ (τ )f (τ, q ∗ (τ ), ũ(τ ))) ,
∗
ũ
δ S̃
=0: (7.44)
δq(τ )
q(τ )=q ∗ (τ )
∂
λ̇∗ (τ ) = − (L (τ, q ∗ (τ ), u∗ (τ )) + λ∗ (τ )f (τ, q ∗ (τ ), u∗ (τ ))) ,
∂q ∗
∂ S̃
τ =t: =0: (7.45)
∂q(t)
q(τ )=q ∗ (τ )
λ∗ (t) = ∂φ(q ∗ (t))/∂q ∗ (t).
Notice that Eq. (7.46) is the result of variation of S̃ over q(t), providing the boundary
conditions at τ = t by relating q(t) and λ∗ (t). (Derivation of Eq. (7.46) is equivalent to the
derivation of the respective boundary condition (7.15) at τ = t in the case of the LQ control
discussed in Section 7.1.) Combination of Eqs. (7.43,7.44,7.46) with the (primal) dynamic
equations and the initial condition on q(0) (from the first line in the conditions of Eq. (7.41)
completes description of the PMP approach. This PMP system of equations, stated as a
Boundary Value (BV) problem, with two boundary conditions on the opposite ends of the
temporal interval, is too difficult to allow an analytic solution in the general case. The
system of equations is normally solved numerically by the shooting method. Solution of the
PMP system of equations is not guaranteed to be unique.
Exercise 7.3. Consider a rocket, modeled as a particle of constant (unit) mass moving
in zero gravity (empty) two dimensional space. Assume that the thrust/force acting on
the rocket, f (τ ) is a known (prescribed) function of time (dependent on, presumably pre-
calculated, rate of the fuel burn), and that the direction of the thrust can be controlled.
Then equations of motion for the controlled rocket are
∀τ ∈ (0, t] : q̈1 = f (τ ) cos u(τ ), q̈2 = f (τ ) sin u(τ ).

(a) Assume that ∀τ ∈ [0, t], f (τ ) > 0. Show that min{u(τ ),q1 (τ ),q2 (τ )} φ(q(t)), where φ(q) is
an arbitrary function, always result in the optimal control stated in the following, so-called
bi-linear tangent, form:
a + bτ
tan (u∗ (τ )) = .
c + dτ
(b) Assume that the rocket starts at rest at the origin and that we want to drive it to a
given height, q2 (t) = q∗ , at the final moment of time t, such that the final velocity in the
horizontal direction, q̇1 (t), is maximized, while q̇2 (t) = 0:
max q̇1 (t) .

{q1 (τ ),q2 (τ ),u(τ )} q1 (0)=q2 (0)=0, q2 (t)=q∗ , q̇2 (t)=0
Show that the optimal control is reduced to a linear tangent law,
tan (u∗ (τ )) = a + bτ.
7.4 Dynamic Programming in Optimal Control and Beyond

7.4.1 Discrete Time Optimal Control
Discretizing Eq. (7.41) in time one arrives at

n−1
!
X
min φ(qn ) + ∆ L(τk , qk , uk ) , (7.46)
u0:n−1 ,q1:n
k=0 k=0,··· ,n−1: qk+1 =qk +∆f (τk ,qk ,uk )
where k = 1, · · · , n : τk := kt/n, qk := q(τk ), uk−1 := u(τk ), ∆ := t/n, and q0 is assumed

fixed.
The main idea of Dynamic Programming (DP) consists in performing the optimization
in Eq. (7.46), not over all the variables at once, but sequentially, one after another, that is
in a greedy fashion. Specifically, let us first optimize over qn and un−1 . In fact, optimization
over qn consists simply in the substitution of qn by qn−1 + ∆f (τn−1 , qn−1 , un−1 ), according
to the condition in Eq. (7.46) evaluated at k = n − 1. One derives
S(n, qn ) := φ(qn ), (7.47)

u∗n−1 := arg min S (n, qn−1 + ∆f (τn−1 , qn−1 , un−1 )) + ∆L (τn−1 , qn−1 , un−1 ) , (7.48)
un−1 ∈U

S(n − 1, qn−1 ) := S n, qn−1 + ∆f τn−1 , qn−1 , u∗n−1 + ∆L τn−1 , qn−1 , u∗n−1 , (7.49)
where making optimization over un−1 we took advantage of the locality in the causal struc-
ture of the objective in Eq. (7.46), therefore taking into account only terms in the objective
dependent on un−1 . Repeating the same scheme by first excluding, qn−1 , and second opti-
mizing over un−2 , and then repeating the two sub-steps (by induction) n − 1 times (back-
wards in discreet time) we arrive at the following recurrent generalization of Eqs. (7.48,7.49),
k = n, · · · , 1:
u∗k−1 := arg min (S (k, qk−1 + ∆f (τk−1 , qk−1 , uk−1 )) + ∆L (τk−1 , qk−1 , uk−1 )) , (7.50)
uk−1 ∈U

S(k − 1, qk−1 ) := S k, qk−1 + ∆f τk−1 , qk−1 , u∗k−1 + ∆L τk−1 , qk−1 , u∗k−1 , (7.51)
where Eq. (7.47) sets initial condition for the backward in (discrete) time iterations. It is
now clear that S(0, q0 ) is exactly the solution of Eq. (7.46). S(k, qk ), which is defined in
Eq. (7.50), is called the cost-to-go, or the value function, evaluated at the (discrete) time τk .
L(τ, q, u) and f (τ, q, u) are called (incremental) reward and (incremental) state correction.
Eqs. (7.47,7.50,7.51) are summarized in Algorithm 1.
Algorithm 1 Dynamic Programming [Backward in time Value Iteration]

Input: L(τ, q, u), f (τ, q, u), ∀τ, q, u.
1: S(n, q) ← φ(q)
2: for k = n, · · · , 0 do
3: u∗k (q) ← arg minu (∆L(τk , q, u) + S(τk + 1, qk + ∆f (τk , qk , u))) , ∀q
4: S(k, q) ← ∆L(τk , q, u∗k (q)) + S(k + 1, qk + ∆f (τk , q, u∗k (q))), ∀q
5: end for
Output: u∗k (q), ∀q, k = n − 1, · · · , 0.
The scheme just explained and the resulting DP Algorithm 1 were introduced in the
famous paper of Richard Bellman from 1952 [15].
In accordance with the greedy nature of the DP algorithm construction—one step at a
time, backward in time—is an example of what is called a greedy algorithm in Computer
Science, that is an algorithm that attempts to find a globally optimal solution by making
choices at each step that are only locally optimal. In general, greedy algorithms offer only
a heuristic, i.e. an approximate (sub-optimal), solution. However, the remarkable feature
of the optimal control problem, which we just sketched a proof of (through the sequence
of transformations of Eqs. (7.47,7.50,7.51) resulted in the optimal solution of Eq. (7.46)) is
that the greedy algorithm in this case is optimal/exact.
7.4.2 Continuous Time Optimal Control
Taking a continuous limit of Eqs. (7.47,7.50,7.51) one arrives at the Bellman (also called
Bellman-Hamilton-Jacobi) equation ( which is already familiar from the discussion of the
Section 7.2, where it was derived in a special case)

−∂τ S(τ, q) = min L(τ, q, u) + ∂q S(τ, q) · f (τ, q, u) . (7.52)
u∈U
Then expression for the optimal control, that is continuous time version of the line 3 in the
Algorithm 1, is

∀τ ∈ (0, t] : u∗ (τ, q) = arg min L(τ, q, u) + ∂q S(τ, q) · f (τ, q, u) . (7.53)
u∈U
Notice that the special case considered in Section 7.2, where
u2
L(τ, q, u) → + V (q), f (τ, q, u) → f (q) + u,
2
and U → Rd , leads, after explicit evaluation of the resulting quadratic optimization, to
Eq. (7.39).
Example 7.4.1 (Bang-Bang control of an oscillator). Consider a particle of unit mass on

the spring, subject to a bounded amplitude control:
τ ∈ (0, t] : ẍ(τ ) = −x(τ ) + u(τ ), |u(τ )| < 1, (7.54)
where particle and control trajectories are {x(τ ) ∈ R|τ ∈ (0, t]} and {u(τ ) ∈ R|τ ∈ (0, t]}.
Given x(0) = x0 and ẋ(0) = 0, i.e. particle is at rest initially, find the control path {u(τ )}
such that particle position at the final moment, x(t) is maximal. (t is assumed known too.)
Describe optimal control and optimal solution for the case of x(0) = 0 and t = 2π.
Solution. First, we change from a single second order (in time) ODE to the two first order
ODEs
! !
q1 x
∀τ ∈ (0, t] : q= := , q̇ = Aq + Bu, (7.55)
q2 ẋ
! !
0 1 0
A := , B := . (7.56)
−1 0 1
We arrive at the optimal control problem (7.41) where, φ(q) = C T q, C T := (−1, 0), L(t, q, u) =
0, f (t, q, u) = Aq + Bu. Then, Eq. (7.52) becomes
∀τ ∈ (0, t] : −∂τ S = (∂q S)T Aq − (∂q S)T B . (7.57)

The absolute value here comes from the fact that the optimal value for u(τ ) is at one of it’s
extremes, either +1 or −1, depending on the sign of (∂q S)T B. Let us look for solution by the
(standard for HJ) method of variable separation, S(τ, q) = (ψ(τ ))T q + α(τ ). Substituting
the ansatz into Eq. (7.57) one derives
∀τ ∈ (0, t] : ψ̇ = −AT ψ, α̇ = |ψ T B|. (7.58)
These equations must be solved for all τ , with the terminal/final conditions: ψ(t) = C
and α(t) = 0. Solving the first equation and then substituting the result in Eq. (7.53) one
derives
!
− cos(τ − t)
∀τ ∈ (0, t] : ψ(τ ) = , u(τ, q) = −sign(ψ2 (τ )) = −sign (sin(τ − t)) ,
sin(τ − t)
(7.59)
that is the optimal control depends only on τ (does not depend on q) and it is ±1.
Consider for example q1 (0) = x(0) = 0 and t = 2π. In this case the optimal control is
(
−1, 0 < τ < π
u(τ ) = , (7.60)
1, π < τ < 2π
and the optimal trajectory is

(
T (cos(τ ) − 1, − sin(τ )) 0<τ <π
q = (q1 , q2 ) = (7.61)
(3 cos(τ ) + 1, −3 sin(τ )) π < τ < 2π
The solution consists in, first, pushing the mass down, and then up, in both cases to the
extremes, i.e. to u = −1 and u = 1, respectively. This type of control is called bang-bang
control, observed in the cases, like the one considered, without any (soft) cost associated
with the control but only (hard) bounds.
Exercise 7.4. Consider a soft version of the problem discussed in Example 7.4.1:
 
Zt
min C T q(t) + 1 dτ (u(τ ))2  , (7.62)
{u(τ },{q(τ )} 2
0 ∀τ ∈(0,t]: q̇(τ )=Aq(τ )+Bu(τ )
where (q(0))T = (x0 , 0) and A, B and C are defined above (in the formulation and solution
of the Example 7.4.1). Derive Bellman/BHJ equation, build a generic solution and illustrate
it on the case of t = 2π and q1 (0) = x0 = 0. Compare your result with solution of the
Example 7.4.1.
7.5 Dynamic Programming in Discrete Mathematics

Let us take a look at the Dynamic Programming (DP) from the prospective of discrete
mathematics, usually associated with combinations of variables (thus combinatorics) and
graphs (thus graph theory). In the following we start exploring this very rich and modern
field of applied mathematics on examples.
7.5.1 LATEX Engine
Consider a sequence of words of varying lengths, w1 , . . . , wn , and pose the question of

choosing locations for breaking the sequence at j1 , j2 , · · · into multiple lines. Once the
sequence is chosen, spaces between words are stretched, so that the left margin and the
right margins are aligned. We are interested to place the line breaks in a way which would
be most pleasing for the eye. We turn this informally stated goal into optimization requiring
that word stretching in the result of the line breaking is minimal b .
To formalize the notion of the minimal stretching consider a sequence of words labeled
by index i = 1, · · · , n. Each word is characterized by its length, wi > 0. Assume that the
cost of fitting all words in between i and j, where j > i, in a row is, c(i, j). Then the total
cost of placing n words in (presumably) nice looking text consisting of l rows is
c(1, j1 ) + c(j1 + 1, j2 ) + · · · + c(jl + 1, n), (7.63)
where 1 < j1 < j2 < · · · < jl < n. We seek an optimal sequence that minimizes the
total cost. To make the description of the problem complete, one needs to introduce a
plausible way of “pricing” the line breaks. Let us define the total length of the line as a
sum of all lengths (of words) in the sequence plus the number of words in the line minus one
(corresponding to the number of spaces in the line before stretching). Then, one requires
that the total length of the line (before stretching) to be less then the widest allowed line
length L, and define the cost to be a monotonically increasing function of the stretching
factor, for example
 Pj

+∞, L < (j − i) + k=i wk
 !3
c(i, j) = Pj (7.64)
 L − (j − i) − k=i wk

 , otherwise
j−i
(The cubic dependence in Eq. (7.64) is an empirical way to introduce preference for smaller
stretching factors. Notice also that Eq. (7.64) assumes that j > i, i.e. any line contains
more than one word, and it does not take into account the last string in the paragraph.)
b
The exemplary Dynamical Programming problem is borrowed from [16]. See Section 3.3.1.
At first glance the problem of finding the optimal sequence seems hard, that is expo-
nential in the number of words. Indeed, formally one has to make a decision whether to
place a break (or not) after reading each word in the sequence, thus facing the problem of
choosing an optimal sequence from 2n−1 of possible options.
Is there a more efficient way of finding the optimal sequence? Apparently the answer
to this question is the affirmative, and in fact, as we will see below that the solution is of
the Dynamic Programming (DP) type. The key insight is the relation between the optimal
solution of the full problem and an optimal solution of a sub-problem consisting of an early
portion of the full paragraph. One discovers that the optimal solution of the sub-problem
is a sub-set of the optimal solution of the full problem. This means, in particular, that
we can proceed in a greedy manner, looking for an optimal solution sequentially - solving
a sequence of sub-problems, where each consecutive problem extends the preceding one
incrementally. (In general the greedy algorithms follow this basic structure: First, we view
the solving of the problem as making a sequence of ”steps” such that every time we make
a ”step” we end up with a smaller version of the same basic problem. Second, we follow an
approach of always taking whichever ”step” looks best at the moment, and we never back
up and change the ”step”.)
Let f (i) denote the minimum cost of formatting a sequence of words which starts from
the word i and runs to the end of the paragraph. Then, the minimum cost of the entire
paragraph is
f (1) = min(c(1, j) + f (j + 1)). (7.65)
j
while a partial cost satisfies the following recursive relation
∀i : f (i) = min (c(i, j) + f (j + 1)), (7.66)

j:i≤j
which we also supplement by the boundary condition, f (n + 1) = 0, stating formally that

no word is available for formatting when we reach the end of the paragraph. Eq. (7.66) is
a full analog of the Bellman equation (7.51). Algorithm 2 is a recursive algorithm for f (i)
implementing Eq. (7.66).
Algorithm 2 answers the formatting question in a way smarter than naive check men-
tioned above. However, it is still not efficient, as it recomputes the same values of f many
times, thus wasting efforts. For example, the algorithm calculates f (4) whenever it calcu-
lates f (1), f (2), f (3). To avoid this unnecessary step, one should save the values already
calculated, by placing the result just computed into the memory. Then, by storing the
results we win calling, computing and storing the functions f (i) sequentially. Since we have
n different values of i and the loop runs through O(n) values of j, the total running time
of the algorithm, relaying on the previous values stored, is O(n2 ).
Algorithm 2 Dynamic Programming for LATEX Engine

Input: c(i, j), ∀i, j = 1, · · · , n, e.g. according to Eq. (7.64). f (n + 1) = 0.
1: for i = n, · · · , 1 do
2: f (i) = +∞
3: for j = i, · · · , n do
4: f (i) ← min (f (i), c(i, j) + f (j + 1))
5: end for
6: end for
Output: f (i), ∀i = 1, · · · , n
7.5.2 Cheapest Path over Grid
Let us now discuss another problem. There is a number placed in each cell of a rectangular
grid, N × M . One starts from the left-up corner and aims to reach the right-down corner.
At every step one can move down or right, then “paying a price” equal to the number
written into the cell. What is the minimum amount needed to complete the task?
Solution: You can move to a particular cell (i, j) only from its left-most (i − 1, j) or
upper-most (i, j − 1) neighbor. Let us solve the following sub-problem — find a minimal
price p[i, j] of moving to the (i, j) cell. The recursive formula (Bellman equation again) is:
p(i, j) = min(p(i − 1, j), p(i, j − 1)) + a(i, j),
where a(i, j) is a table of initial numbers. The final answer is an element p(n, m). Note,
that you can manually add the first column and row in the table a(i, j), filled with numbers
which are deliberately larger than the content of any cell (this helps as it allows to avoid
dealing with the boundary conditions). See Algorithm 3.
Algorithm performance is illustrated in Fig. (7.1).
Exercise 7.5. Consider a directed acyclic graph with weighted edges (a directed acyclic
graph is a directed graph with no cycles). One node is the start, and one node is the end.
Construct an algorithm to compute the maximum cost path from the start node to the
end node recursively. That is, beginning with the start node (say node 1), propagate by
adjacency and keep an updated list of the max cost path from node 1 to node j. Your
algorithm should not arbitrarily compute all possible paths. Provide pseudo-code for your
algorithm and test it (by hand or with a computer) on the graph given in Fig. 7.2
Algorithm 3 Dynamic Programming for Minimum Cost Path over Grid

Input: Costs assigned: a(i, j), ∀i = 1, · · · , N ; ∀j = 1, · · · , M . Boundary conditions fixed:
p(i, 0) ← +∞, ∀i = 1, · · · , N . p(0, j) ← +∞, ∀j = 1, · · · , M . Initialization: p(1, 1) ← 0.
1: for t = 2, · · · , N + M do
2: for i + j = t, i, j ≥ 0 do
3: p(i, j) ← min (p(i − 1, j), p(i, j − 1)) + a(i, j)
4: end for
5: end for
Output: p(i, j), ∀i = 1, · · · , N ; j = 1, · · · , M.
3 1 4 3 1 4 3 1 4
0 0 0 0 0 0 3 0 0
1 3 7 3 1 3 7 3 1 3 7 3
1 4 0 0 0 0 1 0 0 0
8 9 2 5 8 9 2 5 8 9 2 5
1 0 0 0 1 0 0 0
13 15
4 5 2 1 4 5 2 1 4 5 2 1
0 0 0 0 0 0 0 0
17 18
(a) A sample path. (b) Initialization step. (c) First step.

3 1 4 3 1 4 3 1 4
0 3 3 0 0 3 4 8 0 3 4 8
1 3 7 3 1 3 7 3 1 3 7 3
1 4 0 0 1 4 11 0 1 4 11 11
min 1,3 + 3 min 4,4 + 7 min 8,11 + 3
8 9 2 5 8 9 2 5 8 9 2 5
1 0 0 0 9 13 0 0 9 13 13 0
min 4,9 + 9 min 11,13 + 2
4 5 2 1 4 5 2 1 4 5 2 1
0 0 0 0 13 0 0 0 13 18 0 0
min 13,13 + 5
(d) Second step. (e) Third step. (f) Fourth step.

3 1 4 3 1 4 3 1 4
0 3 4 8 0 3 4 8 0 3 4 8
1 3 7 3 1 3 7 3 1 3 7 3
1 4 11 11 1 4 11 11 1 4 11 11
8 9 2 5 8 9 2 5 8 9 2 5
9 13 13 16 9 13 13 16 9 13 13 16
min 11,13 + 5
4 5 2 1 4 5 2 1 4 5 2 1
13 18 15 0 13 18 15 16 13 18 15 16
min 13,18 + 2 min 15,16 + 1 min 15,16 + 1
(g) Fifth step. (h) Sixth (final) step. (i) Optimal path(s).
Figure 7.1: Step-by-step illustration of the Cheapest-Path Algorithm 3 for an exemplary

4 × 4 grid. Number in the corner of each cell (except cell (1, 1) is respective aij . Values
in the green circles are respective final, pij , corresponding to the cost of the optimal path
from (1, 1) to (i, j).
4
3 5
2 1
1
1 2 2 3 6
3 4
4
Figure 7.2: Example of a weighted directed acyclic graph.
7.5.3 DP for Graphical Model Optimization
The number of optimization problems which can be solved efficiently with DP is remarkably
broad. In particular, it appears that the following combinatorial optimization problem, over
binary n-dimensional variable, x:
n−1
X
E := min Ei (xi , xi+1 ), (7.67)
x∈{±1}n
i=1
which requires optimization over 2n possible states, can be solved efficiently by DP in efforts
linear in n. Here, in Eq. (7.67) Ei (xi , xi+1 ) is an arbitrary, known, and possibly different
for different i, real-valued function of its arguments, which are both binary. In the jargon
of mathematical physics the problem just introduced is called “finding a ground state of
the Ising model”.
To explain the DP algorithm for this example it is convenient to represent the problem
in terms of a linear graph (a chain) shown in Fig. (7.3). The components of x are associated
with nodes and the “energy” of “pair-wise interactions” between neighboring components
of x are associated with an edge, thus arriving at a linear graph (chain).
Let us illustrate the greedy, DP approach to solving optimization (7.67) on the example
in Fig. (7.3). The greedy essence of the approach suggests that we should minimize over
components sequentially, starting from one side of the chain and advancing to its opposite
end. Therefore, minimizing over x1 one derives
n−1
!
X
E = min min E1 (x1 , x2 ) + Ei (xi , xi+1 )
x2 ,··· ,xn x1
i=2
n−1
!
X
= min Ẽ2 (x2 , x3 ) + Ei (xi , xi+1 ) , (7.68)
x2 ,··· ,xn
i=3
Ẽ2 (x2 , x3 ) := E2 (x2 , x3 ) + min E1 (x1 , x2 ), (7.69)
x1
where we took advantage of the objective factorization (into sum of terms each involving
only a pair of neighboring components). Notice, that computing min E1 (x1 , x2 ), we need
x1
(" (!" , !# ) (# (!# , !$ ) ($ (!$ , !% ) (% (!% , !& ) (& (!& , !' )
1 2 3 4 5 6
!" !# !$ !% !& !'
(,# (!# , !$ ) ($ (!$ , !% ) (% (!% , !& ) (& (!& , !' )
2 3 4 5 6
!# !$ !% !& !'
Figure 7.3: Top: Example of a linear Graphical Model (chain). Bottom: Modified GM
(shorter chain) after one step of the DP algorithm.
to track the result for all possible (two) values of x2 . The end result of the (greedy)
minimization (over x1 ) we arrive at the problem with exactly the same structure we started
with, i.e. a chain. However, the chain is shorter by one node (and edge). The only change
in the new structure (when compared with the original structure) is “renormalization” of
the pair-wise energy: E2 (x2 , x3 ) → Ẽ2 (x2 , x3 ). Graphical transformation associated with
one greedy step is illustrated in Fig. (7.3). It shows transition from the original chain
to the reduced (one node and one edge shorter) chain. Therefore, repeating the process
sequentially (by induction) we will get the desired answer in exactly n steps. The DP
algorithm is shown below, where we also generalize assuming that all components of xi are
drawn from and arbitrary (and not necessarily binary) set, Σ, often called “alphabet” in
the Computer Science and Information Theory literature.
Consider generalization of the combinatorial optimization problem (7.69) to the case of
a single-connected tree, T = (V, E), e.g. one shown in Fig. (7.4):
X
E := min Ei,j (xi , xj ), (7.70)
x∈Σ|V|
{i,j}∈E
where V and E are the sets of nodes and edges of the tree respectively; |V| is the cardinality
of the of nodes (number of nodes); and Σ is the set (alphabet) marking possible (allowed)
values for any, xi , i ∈ V, component of x.
Algorithm 4 DP for Combinatorial Optimization over Chain

Input: Pair-wise energies, Ei (xi , xi+1 ), ∀i = 1, · · · , n − 1.
1: for i = 1, · · · , n − 2 do
2: for xi+1 , xi+2 ∈ Σ do
3: Ei+1 (xi+1 , xi+2 ) = Ei+1 (xi+1 , xi+2 ) + min Ei (xi , xi+1 )
xi
4: end for
5: end for
Output: E = min En−1 (xn−1 , xn )
xn−1 ,xn
!.
8
!% 4 )
,! .
((
( !"
(
!"
,!%
)
9
(( 1 !"
!# !#
,! ) !$
,)
2 , !# 3
( (! " 7
( (!$ , ! )
-
!& 5 6 !
'
Figure 7.4: Example of a tree-like Graphical Model.

Exercise 7.6. Generalize Algorithm 4 to the case of the GM optimization problem (7.70)
over a tree, that is compute E defined in Eq. (7.70). (Hint: one can start from any leaf
node of the tree, and use induction as in any other DP scheme.)
Part IV
Mathematics of Uncertainty
212
Chapter 8
Basic Concepts from Statistics
8.1 Distributions and Random Variables

Consider a system that can exist in a number of different states. State spaces can be either
continuous or discrete. An example of continuous state space is the angle between the two
hands of a clock, measured clockwise from the hour hand, so Σ = [0, 2π), and an example
of a discrete state space is the number showing on the top of a die, so Σ = {1, 2, 3, 4, 5, 6}.
If the state of the system is influenced by a source of randomness, then each state x ∈ Σ is
associated with a probability, P (x), which describes the likelihood that state x is observed.
8.1.1 Discrete Random Variables
For discrete state spaces, P must satisfy
∀x : 0 ≤ P (x) ≤ 1 (8.1)
X
P (x) = 1, (8.2)
x∈Σ
It is often useful to work with state spaces that are quantitative. For example, the set of
possible outcomes of a single coin toss, {Tail, Head}, can be mapped to the quantitative
state space, Σ = {0, 1}, by asking how many heads are observed after a single toss. For this
example, the probability mass function associated with this binary discrete sample space is
completely determined by a single parameter, call it β. So if P (1) = β, then P (0) = 1 − β
(See also Fig. 8.1.)
Terminology. In the example of the coin toss, we defined a random variable to be the
number of heads after one coin toss. If we call this random variable X, then the notation
P (1) = β and P (0) = 1 − β is really shorthand for P (X = 1) = β and P (X = 0) = 1 − β.
213
CHAPTER 8. BASIC CONCEPTS FROM STATISTICS 214
β ≡ 1/2 β ≡ 1/2
0 1 x 0 1 x
1
β
1−β 1−β
0 1 x 0 1 x
Figure 8.1: Probability mass function (left column) and cumulative distribution functions
(right column) for a Bernoulli random variable with parameter β ≡ 1/2 (top) and for a
Bernoulli random variable with parameter β > 1/2 (bottom).
Another common notation is PX (1) = β and PX (0) = 1 − β. All three notations mean “The
probability that exactly one head is observed after a toss is β.”
We could also write 
1 − β, for k = 0;
P (X = k) = (8.3)
β, for k = 1.
The probability distribution described by (8.3) is called Bernoulli distribution with

the parameter β. A random variable X that follows the Bernoulli distribution is called a
Bernoulli random variable, and we state it as, X ∼ Bernoulli(β).
Eq. (8.3) and Fig. 8.1 describe the Bernoulli distribution by its Probability Mass Function
(PMF). A PMF is written in the form in the form P (X = x) = . . . because it defines the
probability that a random variable takes certain values. Distributions can be described
by their Cumulative Distribution Functions (CDF). A CDF is written in the form P (X ≤
x) = . . . because it defines the probability that a random variable is less than or equal to
a certain value. For example, the CDF of the Bernoulli distribution (8.3) is

1 − β, for k = 0
P (X ≤ k) = (8.4)
1, for k = 1
See also Fig. 8.1 for illustration.

We can take our example further and ask what happens when we toss the coin n times.
The set of possible outcomes of our experiment is the set of sequences of length n consisting
1/2 1
3/4
1/4 1/2
1/4
0 1 2 x 0 1 2 x
1.00
0.3 0.75
0.2 0.50
0.1 0.25
0 1 2 3 4 5 x 0 1 2 3 4 5 x
Figure 8.2: Probability Mass Functions (left column) and cumulative distribution functions
(right column) for Binomial random variables with parameters n = 2 and β = 1/2 (top)
and with parameters n = 5 and β > 1/2 (bottom).
of heads and tails, for example the sequence (H, T, H, H, ..., T ) is one possible outcome.
If we define the random variable Xi to be the number of heads showing on the ith toss,
then the sequence (Xi )ni=1 is a (quantitative) sequence of ones and zeros that represents the
outcome of n tosses.
For this example, we say that the random variables Xi are independent because the
outcome of each coin toss does not depend on the previous tosses, and we say they are
identically distributed because the underlying principles that determine the outcome of a
toss do not change from toss to toss. Random variables that are both independent and
identically distributed are given the shorthand i.i.d.
Let us define a new random variable, Y , to be the number of heads after n coin tosses,
so Y = X1 + X2 + · · · + Xn . In this situation, {Xi } are i.i.d., so the probability of tossing a
sequence with exactly k heads is precisely the proportion of sequences that contain exactly
k heads, which can be computed from the binomial formula giving
!
n
P (Y = k) = β k (1 − β)n−k . (8.5)
k
The probability distribution described by (8.5) is called a binomial distribution with pa-
rameters n and k (Figs. 8.2). A random variable Y that follows a Binomial distribution is
called a binomial random variable, and we say that Y ∼ B(n, β) or Y ∼ Binom(n, β).
Let us now discuss an example of an unbounded discrete state space. Consider some
event that occurs by chance, for example, observing a meteor in the night sky. Let K be the
random variable that counts the number of such occurrences during a given period of time.
0.6 1.00
0.4 0.75
0.50
0.2 0.25
0 1 2 ··· x 0 1 2 ··· x
1.00
0.3 0.75
0.2 0.50
0.1 0.25
... x ... x
0 1 2 3 4 5 0 1 2 3 4 5
Figure 8.3: Probability mass functions (left column) and cumulative distribution functions
(right column) for Poisson random variables with parameter λ = 0.5 (top) and with param-
eter λ = 2 (bottom).
Then Σ = {0, 1, 2, . . . }, (i.e, it is possible, that there are no occurrences, one occurrence,
two occurrences, and so on). It can be shown that under certain conditions, K will have
the probability distribution
λk e−λ
P (K = k) = . for k = 0, 1, 2, . . . (8.6)
k!
The probability distribution described by (8.6) is called a Poisson distribution with pa-
rameter λ (Fig. 8.3). A random variable K that follows a Poisson distribution is called a
Poisson random variable, and we say that K ∼ Pois(λ). (Check that the probability defined
in Eq. (8.6) satisfies Eq. (8.2).)
Other real-life examples of random processes associated with the Poisson distribution
(called Poisson processes) are: the probability distribution of the number of phone calls
received at a call center in an hour, the probability distribution of the number of customers
arriving at the shop or bank, probability distribution of the number of typing errors per
page page, and many more.
Example 8.1.1. Are the Bernoulli and Poisson distributions related? Can you “design” a
Poisson process from Bernoulli process?
Solution. Consider repeating the Bernoulli process n times independently, thus drawing a
sequence consisting of zeros and ones from a Binomial distribution. Check only for the ones
and record the times (indexes) associated with their occurrences. Analyzing the probability
distribution of k arrivals in n steps in the limit n → ∞ and assuming than nβ converge to

a constant, i.e., nβ → λ will recover a Poisson distribution. (The statement, also called in
the literature the Poisson Limit Theorem, will be discussed in more detail in the following
lectures.)
8.1.2 Continuous Random Variables
The state space Σ can also be continuous. Random variables on continuous states spaces
are associated with a probability density function that must satisfy
∀x ∈ Σ : p(x) ≥ 0, (8.7)
Z
dx p(x) = 1, (8.8)
Σ
It is customary to use lower case p for the Probability Density Function and upper case, P ,
to denote actual probabilities. The Probability Density Function (PDF) provides a means
to compute probabilities that an outcome occurs in a given set or interval.
For example, for A ⊂ Σ, then the probability of observing an outcome in the set A is
given by Z
P (A) = p(x) dx.
A
Consider the probability that a real-valued X will take a value less than or equal to x:
Zx
P (X ≤ x) = p(x0 )dx0 . (8.9)
−∞
Eq. (8.9) extends to continuous space example the notion of CDF (we remind that the
abbreviation stands for the already familiar from Section 8.1.1 Cumulative Distribution
Function).
The setting can be extended from infinite to finite intervals. The uniform distribution
on the interval [a, b] is an example of a distribution on a bounded continuous state space:
1
∀x ∈ [a, b] : p(x) = , (8.10)
b−a
A random variable X with a probability distribution given by equation (8.10) can be de-
scribed by the notation X ∼ Unif(a, b). Fig. 8.4 illustrates PDF and CDF of Unif(a, b).
The Gaussian distribution is perhaps the most common (and also the most important)
continuous distribution:

1 (x − µ)2
∀x ∈ R : p(x|σ, µ) = √ exp − . (8.11)
σ 2π 2σ 2
1 1
0 1 x 0 1 x
1
1/π
0 π x 0 π x
Figure 8.4: PDF (left column) and CDF (right column) for uniform random variables on
(0, 1) (top) and on (0, π) (bottom).
Area = 1
x x
-2 -1 0 1 2 -2 -1 0 1 2
1
Area = 1
x x
-2 -1 0 1 2 -2 -1 0 1 2
Figure 8.5: Probability density function (left column) and cumulative distribution func-
tion (right column) for normally distributed random variables N (0, 1) (top) and N (1, 0.52 )
(bottom).
pσ,µ (x) another possible notation. The distribution is parameterized by the mean, µ, and
by the variance σ 2 . Standard notation for the Gaussian/normal distribution is N (µ, σ 2 ) or
N (µ, σ 2 ). Fig. 8.4 illustrates PDF and CDF of N (µ, σ 2 ).
The probability distribution given by (8.11) is also called a normal distribution—where
“normality” refers to the fact that the Gaussian distribution is a “normal/natural” outcome
of summing up many random numbers, regardless of the distributions for the individual
contributions. (We will discuss the law of large numbers and central limit theorem shortly.)
Let us make a brief remark about the notations. We will often write, P (X = x), or a
short-cut, P (x) and sometimes you see in the literature, PX (x). By convention, upper case
variables denote random variables. A random variable takes on values in some domain, and
a particular observation of a random variable (that is, it has been sampled and observed
to have a particular value in the domain) is then a non-random value and is denoted by
lower case e.g. x. X ∼ P (x) denotes the fact that the random variable X is drawn from
the distribution, P (x). Also, and when it is not confusing, we will streamline the notations
(and thus abuse them a bit) and use the lower case variables across the board – for both a
random variable and for a deterministic value the random variable takes.
8.1.3 Sampling. Histograms.
Random process generation. Random process is generated/sampled. Any computational

package/software contains a random number generator (even a number of these). Designing
a good random generation is important. In this course, however, we will mainly be using the
random number generators (in fact pseudo-random generators) already created by others.
Histogram. To show distributions graphically, you may also ”bin” it in the domain -
thus generating the histogram, which is a convenient way of showing p(σ) (see plots in the
attached julia notebook with illustration breaking [0, 1] interval in N > 1 bins).
8.2 Moments & Cumulants

8.2.1 Expectation & Variance
It is often useful to use as few numbers as possible to describe a probability distribution as

meaningfully as possible.
The cumulants of a probability distribution are one of the most common set of descrip-
tors of the distribution. The first two cumulants are well known: the mean measures the
central tendency of a distribution and the variance measures the spread of a distribution
about the mean. The third and fourth cumulant are less well known: the skewness mea-
sures asymmetry about the mean and the kurtosis measures whether extreme values are
unusually rare or common (Fig. 8.6).
The cumulants of a distribution are often found by taking combinations of the distri-
bution’s moments, which can be computed directly by summation or integration of certain
quantities (see below). Alternatively, the moments and the cumulants of a distribution can
be derived from its moment generating function and its characteristic function (see below
for details).
The moment generating function and the characteristic function are also frequently used
in more theoretical analysis, which is beyond the scope of this course.
x x x x
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
x x x x
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
x x x x
-2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2 -2 -1 0 1 2
Figure 8.6: The first cumulant describes the central tendency (mean) of a probability distri-
bution (green). The second cumulant describes the spread about the mean (orange). The
third moment describes the asymmetry of the distribution (blue). The fourth moment
describes whether extreme values are unusually rare or common (red).
For a random (discrete or continuous) variable X, the expectation of X is defined as

X
E[X] := x P (x), (discrete) (8.12)
x∈Σ
Z
E[X] := dxx p(x), (continuous). (8.13)
x∈Σ
Notation. Common notation for the expectation of a random variable X with probability
mass function P includes E[X], EP [X], hXi and hXiP .
Example 8.2.1. Consider the example of tossing a pair of fair coins. The set of possible
outcomes is {(T, T ), (T, H), (H, T ), (H, H)}, each outcome occurring with equal probability.
Define the random variable X to be the number of heads that are observed, so X ∼ B(2, 1/2)
and 

1/4, for x = 0

P (X = x) = 1/2, for x = 1



1/4, for x = 2
The expected number of heads is
X
E[X] = x P (x) = 0 · 1/4 + 1 · 1/2 + 2 · 1/4 = 1
x={0,1,2}
The expectation can also be defined for functions of a random variable. Consider a
function, f (x), and its expectation over the probability, P (x):
X
EP [f (x)] = hf (x)iP = f (x)P (x) (discrete)
x∈Σ
Z
Ep [f (x)] = hf (x)ip = f (x)p(x)dx (continuous)
x∈Σ
Example 8.2.2. Consider a scenario where a gambler wins $200 for tossing a pair of heads,
but loses $100 for any other outcome. If we define f : Σ → R to be the earnings, then the
expectation of f is calculated to be
E[f ] = −100 · 3/4 + 200 · 1/4 = −25
The variance of a random variable is defined as

Var[X] := E (X − E[X])2 (8.14)
The variance of a random variable measures its expected spread about its mean. Note
that the mean and the variance do not have the same units (the units of the variance
are the square of the units of the mean), so it can be difficult to meaningfully interpret
the variance. Consequently, it is common to consider the standard deviation of a random
variable, σ, which is defined as the square root of the variance.
Example 8.2.3. Compute the variance and the standard deviation of X and f for the
Example 8.2.1 and the Example 8.2.2.
Solution. Var[X] = (1 − 0)2 · 1/4 + (1 − 1)2 · 1/2 + (1 − 2)2 · 1/4 = 1/2, and σ = 1/4.
Var[f (X)] = (−100 + 25)2 · 3/4 + (200 + 25)2 · 1/4 = 4275, and σ = 65.4.
Example 8.2.4. The Cauchy distribution plays an important role in physics, since it
describes the resonance behavior, e.g. shape of a spectral width of a laser. The probability
density function of a Cauchy distribution with parameters a ∈ R and γ > 0 is given by
1 γ
p(x|a, γ) = , −∞ < x < +∞. (8.15)
π (x − a)2 + γ 2
Show that the probability distribution is properly normalized and find its mean. What can
you say about its variance?
Solution. To verify that equation (8.15) is properly normalized, we must show that the prob-
ability density integrates to unity, which can be done with the trig-substitution, tan(θ) =
(x − a)/γ.
To compute the mean of the Cauchy distribution, we evaluate

Z
+∞
γ x dx
mean = = a.
π (x − a)2 + γ 2
−∞
which is calculated using using a principal value integral from residue calculus. To compute
the second moment, we attempt to evaluate the integral
Z
+∞
γ (x − a)2 dx
variance = ,
π (x − a)2 + γ 2
−∞
and find that it is unbounded. Since this integral is unbounded, we conclude that the
variance of a Cauchy distribution does not exist (it it is infinite).
8.2.2 Higher Moments
The concept of expectation and variance can be generalized to the moments of a distribution.
For a discrete random variable with probability distribution P (x) the moments of P (x) are
defined as follows
h i X
k = 0, · · · , µk := EP X k = hX k iP = xk P (x). (8.16)
x∈Σ
For a continuous random variable, X with probability density p(x) = pX (x), the moments
of X are: h i Z
k = 0, · · · , µk := Ep X = hX ip = dx xk p(x).
k k
(8.17)
Σ
From the definitions in equations (8.16) and (8.14), it follows that the first moment of a
random variable is equivalent to its mean, E[X] = µ1 , and the second moment is related to
the variance according to V arX = µ2 − µ21 .
Example 8.2.5. Give a closed-form expression for the moments of a Bernoulli distribution
with parameter β. Use the first and second moment to find the mean and variance of the
Bernoulli distribution.
p(x) = βδ(1 − x) + (1 − β)δ(x). (8.18)
Solution.
Z∞
k = 1, · · · : µk = hX k i = xk p(x)dx = β. (8.19)
−∞
The mean of a distribution is equal to it’s first moment: µ = µ1 . In this case µ = β. The
variance of a distribution is equal to the combination of its first two moments given by
σ 2 = µ2 − µ21 . In this case the variance is σ 2 = β − β 2 = β(1 − β).
Example 8.2.6. What is the mean number of the events in the Poisson process, Pois(λ).
What is the variance of the Poisson distribution?
Solution. For this example, we will compute the mean and the variance of the Poisson
distribution by first computing its first and second moments:
∞
X ∞
X ∞
X X λn∞
kλk −λ λk
µ1 = k P (k) = e = e−λ = λ e−λ = λ.
k! (k − 1)! n!
k=0 k=0 k=1 n=0
The second moment is

∞
X ∞
X ∞
X ∞
X (n + 1)λn
k 2 λk kλk
µ2 = k 2 P (k) = e−λ = e−λ = λ e−λ = λ(λ + 1),
k! (k − 1)! n!
k=0 k=0 k=1 n=0
Therefore, the mean (average) and the variance of the number of events is
µ := µ1 = λ and σ 2 := µ2 − µ21 = λ
Note that the expectation and the variance of a Poisson distribution are both equal to the
same value, λ. This is not generally true for other distributions.
8.2.3 Moment Generating Functions.
The moment generating function of a random variable is defined as

Z∞
MX (t) = E [exp(tx)] = dxp(x) exp(tx). (8.20)
−∞
where t ∈ R and all integrals are assumed well defined. When exp(tx) is expressed by its
Taylor series, we find that the moment generating function can be expressed as an infinite
sum involving the moments, µk , of the random variable.
Z∞ Z∞ ∞
X ∞
X
(tx)k µk t k
MX (t) = dxp(x) exp(tx) = dxp(x) = . (8.21)
k! k!
−∞ −∞ k=0 k=0
The name ‘moment generating function’ arises from the observation that differentiating
MX (t) k times and evaluating the result at t = 0 recovers the k th moment of X.
∞
!
dk dk X µk tk
MX (t) = k = µk (8.22)
dtk t=0 dt t=0 k!
k=0
Example 8.2.7. Consider standard example of Boltzmann distribution from statistical

mechanics, where the probability density, p(x), of a random state (variable), X is
1 X
p(x) = e−βE(x) , Z(β) = e−βE(x) , (8.23)
Z x
where β = 1/T is the inverse temperature and E(x) is a known function of x, called energy
of the state x. The normalization factor Z is called the partition function. Suppose we
know the partition function, Z(β) as a function of the inverse temperature, β. Compute
the expected mean value and the variance of the energy.
Solution. The mean (average) value of the energy is
X 1 X 1 ∂Z ∂ ln Z
hE(X)i = p(x)E(x) = E(x)e−βE(x) = − =− . (8.24)
x
Z x Z ∂β ∂β
The variance of the energy (energy fluctuations) is
∂ 2 ln Z
Var [E(X)] = h(E(X) − hE(X)i)2 i = , (8.25)
∂β 2
Notice that up to sign inversion of the argument of the partition function is equivalent to
the moment generating function (8.20), Z(β) = ME(X) (−β).
8.2.4 Characteristic Functions
The characteristic function of a random variable is defined as the Fourier transform of its
probability density:
Z
+∞
G(t) := Ep [exp(itx)] = dxp(x) exp(itx), (8.26)

−∞
where i2 = −1. The characteristic function exists for any real t and it obeys the following
relations
G(0) = 1, |G(t)| ≤ 1. (8.27)
The characteristic function contains information about all the moments µk . Moreover it
allows the Taylor series representation in terms of the moments:
∞
X (it)k
G(t) = hX k i, (8.28)
k!
k=0
and thus
1 ∂k
hX k i = G(t) . (8.29)
ik ∂tk t=0
This implies that derivatives of G(t) at t = 0 exist up to the same m as the moments µk .
Example 8.2.8. Find the characteristic function of a Bernoulli distribution with parameter
β.
Solution. Substituting Eq. (8.18) into the Eq. (8.26) one derives
G(t) = 1 − β + βeit , (8.30)
and thus
∂k
µk = k k
1 − β + βeit = β. (8.31)
i ∂t t=0
The result is naturally consistent with Eq. (8.19).
Exercise 8.1. The exponential distribution has a probability density function given by

Ae−λx , x ≥ 0,
p(x) = (8.32)
0, x < 0,
where the parameter λ > 0. Calculate
(a) The normalization constant A of the distribution.
(b) The mean and the variance of the probability distribution.
(c) The characteristic function G(t) of the exponential distribution.
(d) The k th moment of the distribution (utilizing G(t)).
8.2.5 Cumulants
The cumulants κk of a random variable X are defined by the characteristic function as

follows
∞
X (it)k
ln G(t) = κk . (8.33)
k!
k=1
According to Eq. (8.27) the Taylor series in Eq. (8.33) start from unity. Utilizing Eqs. (8.28)
and (8.33), one derives the following relations between the cumulants and the moments
κ1 = µ1 , (8.34)
κ2 = µ2 − µ21 2
=σ . (8.35)
The procedure naturally extends to higher order moments and cumulants.

Notice that moments determine the cumulants in the sense that if the all the moments
of any two probability distributions are identical then all the cumulants will be identical
as well, and similarly the cumulants determine the moments. In some cases theoretical
treatments of problems in terms of cumulants are simpler than those using moments.
Example 8.2.9. Find the characteristic function and the cumulants of the Poisson distri-
bution (8.6).
Solution. The respective characteristic function is
∞
X ∞
X
λk (λeit )k
G(t) = e−λ eitk = e−λ = exp λ(eit − 1) , (8.36)
k! k!
k=0 k=0
and then
ln G(t) = λ(eit − 1). (8.37)
Next, using the definition (8.33), one finds that κk = λ, k = 1, 2, . . . .
Example 8.2.10. Birthday Problem Assume that a year has 366 days. What is the
probability, pm , that m people in a room all have different birthdays?
Solution. Let (b1 , b2 , . . . , bm ) be a list of people birthdays, bi ∈ {1, 2, . . . , 366}. There are
366m different lists, and all are distributed identically (equiprobable). We should count the
Q
lists, which have bi 6= bj , ∀i 6= j. The amount of such lists is mi=1 (366 − i + 1). Then, the
final answer
m
Y
i−1
pm = 1− . (8.38)
366
i=1
The probability that at least 2 people in the room have the same birthday day is 1 − pm .
Note that 1 − p23 > 0.5 and 1 − p22 < 0.5.
Exercise. (not graded) Choose, at random, three points on the circle of unit radius. In-
terpret them as cuts that divide the circle into three arcs. Compute the expected length of
the arc that contains the point (1, 0).
8.3 Probabilistic Inequalities.

Here are some useful probabilistic inequalities.
• (Markov Inequality) For all non-negative random X
E[X]
P (X ≥ a) ≤ . (8.39)
a
• (Chebyshev’s inequality)
σ2
P (|X − µ| ≥ b) ≤ , (8.40)
b2
where µ and σ 2 are the mean and the variance of X.
• (Chernoff bound)
E[etX ]
P (X ≥ a) = P (etX ≥ eta ) ≤ , (8.41)
eta
where X ∈ R and t ≥ 0.
Example 8.3.1. Prove Markov, Chebyshev and Chernoff inequalities/bounds.

Solution. We prove the Markov inequality (8.39) in two steps. First, let us introduce the
indicator function, 1(y), which returns unity if y ≥ 0, and is zero otherwise, and observe
that the left hand side of Eq. (8.39) can be restated as E [1(a − X)]. Second, notice that
∀X ≥ 0, 1(X − a) ≤ X/a. Taking expectation of (averaging) the inequality we arrive at
the desired result (8.39).
Notice that 1(X − a) = X/a only if X = 0 or X = 1. Therefore, the Markov inequality
becomes equality iff, P (X ∈ {0, a}) = 1.
The Chebyshev inequality (8.40) and the Chernoff bound (8.41) are corollaries of the
Markov inequality (8.39).
Indeed, to prove the Chebyshev inequality (8.40) we consider an unbounded random X,
X ∈ R, and apply the Markov inequality (8.39) to the following auxiliary random variable,
Y = (X −E[X])2 = (X −µ)2 , which is thus non-negative by construction. Then substituting
√
a by b2 , and observing that, E[Y ] = σ 2 , and P (Y ≥ b2 ) = P ( Y ≥ b), we arrive at the
Chebyshev inequality (8.40).
Similarly, to prove the Chernoff bound (8.41) we consider Z = exp(tX), where X is
unbounded random variable, X ∈ R, and t ≥ 0. Z is positive by construction. Therefore,
observing that, P (X ≥ a) = P (etX ≥ eta ) and applying the Markov inequality (8.39) to Z,
where a is substituted by eta , we arrive at the Chernoff bound (8.41).
Notice that the Chernoff bound (8.41) can be viewed as the Markov inequality applied
to the moment generating function.
We will get back to discussion of some other useful probabilistic inequalities in the
lecture devoted to entropy and to how compare probabilities.
8.4 Random Variables: from one to many.

Transition from one to many random variables is natural. We will review notions which
are instrumental for the transition in this Section. However, it is also useful to note that
we have already touched on the issue of the multi-variate probability distributions in the
preceding Section when we constructed more complex (but still single-variate) probability
distributions from simpler ones. In particular, we have generated a sequence of independent
H H H T T H T T
P (H, H) = 1/4 P (H, T ) = 1/4 P (T, H) = 1/4 P (T, T ) = 1/4
Figure 8.7
random variables X1 , X2 , · · · and then created a new variable by taking a function, e.g. a
sum, of the original random variables. The assumed independence was useful for reaching
the goal of designing a new random variable (e.g. for transitioning from Bernoulli random
variable to Poisson random variable), however not all random variables are independent.
In the coming Section we will learn how to describe dependencies or correlations, that is
opposite of the independence, in high-dimensional statistics.
8.4.1 Multivariate Distributions. Marginalization. Conditional Proba-

bility.
Consider an n-component random vector, X, and let P (x) be the probability that the
P
state x is observed, where x P (x) = 1. (Recall: x (lower-case) represents a particular
realization of the random variable X (upper-case).) So if each component Xi takes values
in Σ = {0, 1}, then X is the random vectors of length n with entries zero or one, and x
might be (1, 1, 0, . . . , 1).) We wish to ask two related questions about P :
1. Marginalization: The probability of observing a state where one or more of the com-
ponents attain certain values.
2. Conditioning: The probability of observing a state given that the value(s) attained
by one or more component is known.
Example 8.4.1. Let Xi be the random variable for the number of heads observed after
flipping a fair coin on the ith toss. So Σ = {0, 1}, and P (Xi = 0) = 1/2 and P (Xi = 1) =
1/2. Let X = (X1 , X2 ) be the random vector showing the outcome of two successive coin
flips. So the probability of each possible outcome is
P (X = (0, 0)) = 1/4, P (X = (1, 0)) = 1/4, P (X = (0, 1)) = 1/4, P (X = (1, 1)) = 1/4,
See Fig. (8.7) for illustration. The following questions are examples of marginalization and
conditioning:
1. Marginalization: What is the probability of observing a state where the outcome of

the first toss was a “1”?
2. Conditioning: It is known that the outcome of the first toss is a “1”, what is the set
of possible outcomes and their respective probabilities?
Solution. 1. From the list of all possible outcomes, we see that there are two outcomes
with a “1” in the first entry (namely, (1, 0) and (1, 1)) each with probability 1/4.
P (X1 = 1) = P (X = (1, 0)) + P (X = (1, 1)) = 1/4 + 1/4 = 1/2.
2. Under the condition that the outcome of the first toss is “1”, all other outcomes must
have zero probability. The only two possible outcomes are (1, 0) and (1, 1) which each
occur with equal probability. Therefore,
P (X2 = 0 | X1 = 1) = 1/2, P (X2 = 1 | X1 = 1) = 1/2.
Example 8.4.1 is relatively straightforward because X1 and X2 are independent. The

next example will be more interesting.
Consider the statistical version of the Ising model which was discussed in Section 7.5 in
the context of discrete optimization. (We have used it back then to illustrate application
of the Dynamic Programming in the combinatorial optimization.) Historically, the Ising
model was first used to describe the polarization of a magnet at different temperatures in
the context of physics. In this present context, it might be used to model predictions for
outbreaks of an epidemic in different neighborhoods.
We introduce the following probability distribution over the 2n -dimensional space Σ
(space of cardinality 2n ):
n−1
!
X
x = (xi = ±1|i = 1, · · · , n) : P (x) = Z −1 exp Jxi xi+1 , (8.42)
i=1
n−1
!
X X
Z= exp Jxi xi+1 , (8.43)
x i=1
where J is a constant that determines the coupling strength between adjacent components
and Z is the normalization constant (See figure 8.8.). The normalization constant, also
called the partition function, is introduced to guarantee that the sum of probabilities over
all the states is unity. (See also Example 8.2.7.)
For n = 2 one gets the example of a bi-variate probability distribution
exp(Jx1 x2 )
P (x) = P (x1 , x2 ) = . (8.44)
4 cosh(J)
P (x) is called a joint or multivariate probability distribution function of x = (x1 , · · · , xn ),
because it shows probability of all the components, x1 , · · · , xn , together.
+1 −1 −1 +1 +1 −1 +1 −1
−1 −1 +1 +1 +1 −1 +1 +1
Figure 8.8: Two possible realizations of the n-component Ising model on a line.
+1 +1 +1 −1 −1 +1 −1 −1
P (+1, +1) ∝ exp(+J) P (+1, −1) ∝ exp(−J) P (−1, +1) ∝ exp(−J) P (−1, −1) ∝ exp(+J)
Figure 8.9: Set of all possible outcomes for the 2-component Ising model and their relative
probabilities. The normalization constant, Z is found by summing the probabilities over all
states.
It is useful to consider conditional probability distribution. For the example above

with n = 2,
P (x1 , x2 ) exp(Jx1 x2 )
P (x1 |x2 ) = P = (8.45)
P (x1 , x2 ) 2 cosh(Jx2 )
x1
P
is the probability to observe x1 under condition that x2 is known. Notice that, x1 P (x1 |x2 ) =
1, ∀x2 .
We can marginalize the multivariate (joint) distribution over a subset of variables. For
example,
X X
P (x1 ) = P (x) = P (x1 , · · · , xn ). (8.46)
x\x1 x2 ,··· ,xn
Multivariate Gaussian (Normal) distribution

Now let us consider n zero-mean random variables X1 , X2 , . . . , Xn sampled i.i.d. from
a generic Gaussian distribution
 
1 1 X
p(x1 , . . . , xn ) = exp − xi Aij xj  , (8.47)
Z 2
i,j=1,··· ,n
where A is the symmetric, A = AT , positive definite, A 0, matrix. If the matrix is

diagonal then the probability distribution (8.47) is decomposed into a product of terms,
each dependent on one of the variables. This is the special case when each of the random
variables, X1 , · · · , Xn , is statistically independent of others. Z in Eq. (8.47) is the nor-

malization factor (remind that we call it, consistently with examples above, the partition
function) which is
(2π)n/2
Z= √ . (8.48)
det A
Moments of the Gaussian distribution are
∀i : E [Xi ] = µi ; ∀i, j : E [(Xi − µi )(Xj − µj )] = (A−1 )ij := Σij , (8.49)
where A−1
ij = Σij denotes i, j component of the inverse of the matrix A. The Σ matrix
(which is also symmetric and positive definite, as its inverse is by construction) is called
the co-variance matrix. Standard notation for the multi-variate statistics with mean vector,
µ = (µi |i = 1, · · · , n) and co-variance matrix, Σ, is N (µ, Σ) or Nn (µ, Σ).
The Gaussian distribution is remarkable because of its “invariance” properties.
Theorem 8.4.2 (Invariance of Normal/Gaussian distribution under conditioning and marginal-

ization). Consider X ∼ Nn (µ, Σ) and split the n dimensional random vector into two
components, X = (X1 , X2 ), where X1 is a p-component sub-vector of X and X2 is a q-
component sub-vector of X, p + q = n. Assume also that the mean vector, µ, and the
covariance matrix, Σ, are split into components as follows
!
Σ11 Σ12
µ = (µ1 , µ2 ); Σ= , (8.50)
Σ21 Σ22
where thus µ1 and µ2 are p and q dimensional vectors and Σ11 , Σ12 , Σ21 and Σ22 are (p × p),
(p × q), (q × p) and (q × q) matrices. Then, the following two statements hold:
R
• Marginalization: p(x1 ) := dx2 p(x1 , x2 ) is the following Normal/Gaussian distribu-
tion, N (µ1 , Σ11 ).
p(x1 ,x2 )
• Conditioning: p(x1 |x2 ) := p(x2 ) is the Normal/Gaussian distribution, N µ1|2 , Σ1|2 ,
where
µ1|2 := µ1 + Σ12 Σ−1

22 (x2 − µ2 ), Ξ1|2 := Σ22 − ΣT12 Σ−1
11 Σ12 . (8.51)
Proof of the theorem is recommended as a useful technical exercise (not graded) which
requires direct use of some basic linear algebra. (You will need to use or derive explicit for-
mula for the inverse of a positive definite matrix split into four quadrangles, as in Eq. (8.50).)
8.4.2 Central Limit Theorem
Take n random instances, also called samples, X1 , · · · , Xn generated i.i.d. from a distribu-
P
tion with mean µ and variance, σ > 0, and compute Yn = ni=1 Xi /n. What is Prob(Yn )?
√
Theorem 8.4.3 (Weak Version of the Central Limit Theorem). n(Yn − µ), converges in
distribution to a Gaussian with mean, µ, and variance, σ 2 , i.e.
n
!
√ 1X
n→∞: n Xi − µ ∼ N (0, σ 2 ). (8.52)
n
i=1
Let us sketch the prove of the weak-CLT (8.52) in a simple case µ = 0, σ = 1. Obviously,
√
µ1 (Yn n) = 0. Compute
" # P 2 P
√ X1 + · · · + Xn 2 E Xi i6=j E [Xi Xj ]
µ2 (Yn n) = E √ = i + = 1.
n n n
Now the third moment:

" P 3
3 # E Xi
√ X1 + · · · + Xn i
µ3 (Yn n) = E √ = → 0,
n n3/2

at n → ∞, assuming E Xi3 = O(1). Can you guess what will happen with the fourth
√
moment? It is, µ4 (Yn n) = 3 = 3m2 (Yn ).
Example 8.4.4 (Sum of Gaussian variables). Compute the probability density, pn (yn ), of
P
the random variable Yn = n−1 ni=1 Xi , where X1 , X2 , . . . , Xn are sampled i.i.d. from the
normal distribution

2 1 (x − µ)2
p(x) = N (µ, σ ) = √ exp − ,
2πσ 2σ 2
exactly.
Solution. First, recall that the characteristic function of a Gaussian distribution is a Gaus-
sian
Z
itx σ 2 t2
G(t) = e p(x)dx = exp iµt − .
2
R
Let us now evaluate the characteristic function for pn (yn )

Z n
!
tX
Gn (t) = dx1 · · · dxn exp i xi p(x1 ) · · · p(xn )
n
Rn i=1

σ 2 t2
⇒ (G(t/n))n = exp iµt − .
2n
The inverse Fourier transform of Gn (t) results in

Z
+∞ Z
+∞
dt −ityn dt σ 2 t2
pn (yn ) = Gn (t)e = exp −it(yn − µ) − n
2π 2π 2
−∞ −∞
√
n n(yn − µ)2
=√ exp − .
2πσ 2σ 2
Example 8.4.5 (Failure of the central limit theorem). Calculate the probability density
P
distribution of the random variable Yn = n−1 ni=1 Xi , where X1 , X2 , . . . , Xn are indepen-
dently chosen from the Cauchy distribution with the following probability density
γ 1
p(x) = , (8.53)
π x2 + γ 2
and show that the CLT does not hold in this case. Explain why.
Solution. The characteristic function of the Cauchy distribution is
Z
+∞
γ dx
G(k) = eikx = e−γk . (8.54)
π x2 + γ 2
−∞
The resulting expressions for the characteristic functional is
Gn (k) = (G(k/n))n = G(k). (8.55)
This expression shows that for any n the random variable Yn is Cauchy-distributed with
exactly the same width parameter as the individual samples. The CLT “fails” in this
case because we have ignored an important requirement/condition for the CLT to hold –
existence of the variance. (See Example 8.2.4.)
Exercise 8.2. Assume that you play a dice game 100 times. Awards for the game are as
follows: $0.00 for 1, 3 or 5, $2.00 for 2 or 4 and $26.00 for 6.
(1) What is the expected value of your winnings?
(2) What is the standard deviation of your winnings?
(3) Estimate the probability that you win at least 400$.
Exercise (not graded). Experiment with the CLT for different distributions mentioned in
the lecture.
The CLT holds for independent but not necessarily identically distributed variables too.
(That is one can use different distributions generating different variables in the summed up
sequence.)
We may be interested in not only deviations on the order of the standard deviation but
would also like to discuss arbitrary deviations.
Theorem 8.4.6 (Cramér theorem (strong version of the CLT)). The normalized sum,
P
Yn = ni=1 Xi /n, of the i.i.d. variables, Xi ∼ pX (x), satisfies
1
∀x > µ : limlog Prob (Yn ≥ x) = −Φ∗ (x), (8.56)
n→∞ n
Φ∗ (x) := supt∈R (tx − Φ(t)) , (8.57)
Φ(t) := log (E exp (tX)) , (8.58)
where, Φ(t), is the cumulant generating function of pX (x) and Φ∗ (x) is the Legendre-Fenchel
transform of the cumulant generating function, also called the Cramér function.
Three comments are in order. First, an informal (“physical”) version of Eq. (8.56) is
n→∞: Prob (Yn ) ∝ exp (−nΦ∗ (x)) . (8.59)
Second, the cumulant generating function, Φ(t), is equal to the characteristic function (8.26)
of the minus imaginary argument, i.e., Φ(t) = G(−it). Also, and third, the weak version of
the CLT (8.52) is equivalent to approximating the Cramér function (asymptotically exact)
by a Gaussian distribution centered around its minimum.
Exercise (not graded). Prove the strong-CLT (8.56,8.57). [Hint: use saddle point/sta-
tionary point method to evaluate the integrals.] Give an example of an expectation for
which not only vicinity of the minimum but also other details of Φ∗ (x) are significant at
n → ∞? More specifically give an example of the object which behavior is controlled solely
by left/right tail of Φ∗ (x)? Φ∗ (0) and its vicinity?
Example 8.4.7. Compute the Crámer function for the Bernoulli process, i.e. (generally
unfair) coin toss
(
0, with probability 1 − β;
X= (8.60)
1, with probability β.
Solution.
Φ(t) = log(βet + 1 − β), (8.61)

x 1−x
0<x<1: Φ∗ (x) = x log + (1 − x) log . (8.62)
β 1−β
Eqs. (8.61,8.62) are noticeable for two reasons. First of all, they lead (after some alge-
braic manipulations) to the famous Stirling formula for the asymptotic of a factorial
√ n n
n! = 2πn (1 + O(1/n)) .
e
(Do you see how?) Second, the x log x structure is an “entropy” which will appear a number
of times in following lectures - stay tuned.
The Crámer theorem (8.4.6) gives a powerful asymptotic result, however it says nothing
about the rate of convergence, i.e., the behavior at large but finite n. Quite remarkably the
deficiency can be cured by the following application of the Chernoff bound (8.3.1).
Theorem 8.4.8 (Chernoff Bound Version of the Central Limit Theorem, adapted from
[17]). Let X1 , · · · , Xn be i.i.d. random variables with E[Xi ] = µ and a well-defined (bounded)
Crámer function Φ∗ (x) = supt (tx − log(E[exp(tXi )]). Then,
n
!
X
P Xi ≥ nx ≤ exp (−nΦ∗ (x)) , ∀x > µ,
i=1
n
!
X
P Xi ≤ nx ≤ exp (−nΦ∗ (x)) , ∀x < µ.
i=1
Proof. Consider the case of x > µ:
n
! P
n !
X t Xi
P Xi ≥ nx = P e i=1 ≥ etnx for any t > 0 (to be chosen later)
i=1
" P
n #
t Xi
−tnx
≤ e E e i=1 by Markov’s inequality (its Chernoff bound’s version)
n
Y
= e−tnx E[etXi ] = e−tnx+nΦ(t) since Xi are i.i.d.
i=1

≤ exp −n sup(tx − Φ(t)) = exp −n sup(tx − Φ(t))
t>0 t∈R
∗
= exp (−nΦ (x)) .
Here, Φ(t) = log(E[exp(tXi )]) is convex in t; Φ∗ (x) is convex in x and it achieves its
minimum at x = µ. Therefore, for x > µ the sup in, Φ∗ (x) = sup(tx − Φ(t)), is achieved
t∈R
at t > 0 (positive slope) and we can thus replace, sup by sup. This completes the proof for
t>0 t∈R
x > µ.
In the case of x < µ, one needs to pick t < 0, and otherwise the proof is fully equivalent.
Pn
Exercise 8.3. Let X = i=1 Xi , where the Xi are independent (not necessarily identically
distributed) Poisson random variables. That is, each Xi is independently drawn from a
Poisson distribution with parameter λi , i.e. Xi ∼ Pois(λi ). Denote the characteristic
function of X by GX (t) and the characteristic function of Xi by GXi (t). Show that
Q
(1) GX (t) = ni=1 GXi (t);
P
(2) X ∼ Pois(λ), where λ = ni=1 λi .
8.4.3 Bayes Theorem
We already saw how to get conditional probability distribution and marginal probability
distribution from the joint probability distribution
P (x, y) P (x, y)
P (x|y) = , P (y|x) = . (8.63)
P (y) P (x)
Combining the two formulas to exclude the joint probability distribution we arrive at the
famous Bayes formula
P (x|y)P (y) = P (y|x)P (x). (8.64)
Here, in Eqs. (8.63,8.64) both x and y may be multivariate. Rewriting Eq. (8.64) as
P (y|x)P (x)
P (x|y) = , (8.65)
P (y)
one often refers (in the field of the so-called Bayesian inference/reconstruction) to P (x) as
the ”prior” probability distribution which measures the degree of the initial “belief” in X.
Then, P (x|y), called the ”posterior”, measured the degree of the (statistical) dependence
P (y|x)
of x on y, and the quotient P (y) represents the “support/knowledge” y provides about x.
A good visual illustration of the notion of the conditional probability can be found at
http://setosa.io/ev/conditional-probability/.
Example 8.4.9. Consider the three component Ising model. (a) Compute the normaliza-
tion constant. (b) Compute the marginal probability, P (x1 ). (c) Compute the conditional
probability, P (x3 |x1 ).
Solution. The set of all possible outcomes is shown in Figure 8.10. (a) The normalization
constant, Z, is found by summing the probabilities over all the states:
Z = e2J + e0 + e0 + e−2J + e2J + e0 + e0 + e−2J = 4 + 4 cosh(2J).
Therefore, the probabilities of the states are, P (1, 1, 1) = e2J /(4 + 4 cosh(2J)), etc.
(b) The marginal probability, P (x1 ), is given by summing all the probabilities corresponding
to P (x1 = +1) and all the probabilities corresponding to P (x1 = −1):
P (x1 = +1) = P (1, 1, 1) + P (1, 1, −1) + P (1, −1, −1) + P (1, −1, 1)
e2J + e0 + e0 + e−2J 1
= =
4 + 4 cosh(2J) 2
P (x1 = −1) = P (−1, −1, −1) + P (−1, −1, 1) + P (−1, 1, 1) + P (−1, 1, −1)
e2J + e0 + e0 + e−2J 1
= = .
4 + 4 cosh(2J) 2
+1 +1 −1 −1
+1 +1 +1 −1 +1 −1 +1 +1

P (x) ∝ exp + 2J P (x) ∝ exp 0 · J P (x) ∝ exp 0 · J P (x) ∝ exp − 2J
−1 −1 +1 +1
−1 −1 −1 +1 −1 +1 −1 −1

P (x) ∝ exp + 2J P (x) ∝ exp 0 · J P (x) ∝ exp 0 · J P (x) ∝ exp − 2J
Figure 8.10: Set of all possible outcomes for the 3-component Ising model and their relative
probabilities. The normalization constant, Z, is found by summing the probabilities over
all states.
The conditional probability is found by
P (x3 = +1, x1 = +1) Z −1 (e2J + e−2J ) cosh(2J)

P (x3 = +1|x1 = +1) = = = ,
P (x1 = +1) Z −1 (e2J + 1 + 1 + e−2J ) 1 + cosh(2J)
P (x3 = +1, x1 = −1) Z −1 (e0 + e0 ) 1
P (x3 = +1|x1 = −1) = = −1 2J = ,
P (x1 = −1) Z (e + 1 + 1 + e−2J ) 1 + cosh(2J)
P (x3 = −1, x1 = +1) Z −1 (e0 + e0 ) 1
P (x3 = −1|x1 = +1) = = −1 2J = ,
P (x1 = +1) Z (e + 1 + 1 + e−2J ) 1 + cosh(2J)
P (x3 = −1, x1 = −1) Z −1 (e2J + e−2J ) cosh(2J)
P (x3 = −1|x1 = −1) = = −1 2J = .
P (x1 = −1) Z (e + 1 + 1 + e−2J ) 1 + cosh(2J)
Exercise 8.4. The joint probability density of two real random variables X1 and X2 is
1
∀x1 , x2 ∈ R : p(x1 , x2 ) = exp(−x21 − x1 x2 − x22 ).
Z
(1) Calculate the normalization constant Z.
(2) Calculate the marginal probability density, p(x1 ).
(3) Calculate the conditional probability density, p(x1 |x2 ).
8.5 Information-Theoretic View on Randomness

8.5.1 Entropy.
Consider a random variable X that takes outcomes x ∈ X . The goal is to develop a

systematic and meaningful way to quantify the amount of information gained when we learn
that a particular outcome actually occurred. We suppose that the information content of
an outcome, x, which we denote h(x) and will also be calling surprise, depends only on the
probability of the outcome.
The question becomes: how to quantify the information content? Let us start formu-
lating a list of requirements that the information content (surprise) must satisfy:
1. Deterministic outcomes provide no information. If an outcome is certain to occur,

then its information content, h(x), must be zero. That is,
h(x) = 0 if P (x) = 1.
2. Learning that an unlikely outcome has occurred provides more information than learn-
ing that a likely outcome has occurred. The information content of an outcome must
be a strictly decreasing function of its probability. That is,
h(x1 ) > h(x2 ) for P (x1 ) < P (x2 ).
3. Independent events provide original information. If two independent events occur,

then the information content of the pair of outcomes must be the sum of the infor-
mation content of each individual outcome. That is,
h(x, y) = h(x) + h(y) provided that P (x, y) = P (x)P (y).
With a little work, it can be shown that only one family of continuous functions satisfies
this modest list of requirements. We are forced to define the information content of an
outcome to be the log of the probability:

h(x) = − log P (x) . (8.66)
The base of the logarithm, or equivalently, the multiplicative scaling constant, can be chosen
arbitrarily. Convention, which is standard in the information theory, is to use a unit scaling
and log base 2, i.e. log → log2 in Eq. (8.66).
Terminology. Standard scientific term used for the information gained by learning the out-
come x, which we also called the surprise of x, is the configurational entropy.
Consistently with all of the above, but now introducing the entropy of all possible
outcomes, i.e. entropy of a random variable X, is defined as the expectation of the config-
urational entropy over the outcomes
X X
H(X) = −EP (X) log P (X) = P (x)h(x) = − P (x) log P (x) , (8.67)
x∈X x∈X
h(x)
1 P (x)
Figure 8.11: The information content h(x) of an outcome x plotted against the probability of
x. Negative-logs are the only family of functions that satisfy the requirements of h(x). The
base of the logarithm (or equivalently, the multiplicative scaling constant) can be chosen
arbitrarily.
where x is drawn from the space X . We can also say that the entropy is a measure of
uncertainty. In the case of a deterministic process, i.e. when there is only one outcome with
thhe probability 1, the configurational entropy becomes equal to the entropy and according
to Eq. (8.67) both are zero, 0 log 0 = 0.
Terminology. Yet another term associated with the entropy of a random variable, X, is the
measure of uncertainty. Following the tradition of information theory, we use the symbol H
for entropy. However, be aware that an alternative notation, S, is customary in Statistical
Mechanics/Physics.
Let us familiarize ourselves with the concept of entropy on example of the Bernoulli(β)
process (8.60). In this case, there are only two states, P (X = 1) = β and P (X = 0) = 1−β,
and therefore
H = −β log β − (1 − β) log(1 − β). (8.68)
Notice that H, considered as a function of β, is concave and has a global maximum at

β = 1/2 (Fig. 8.12. Therefore, β = 1/2, corresponding to the fair coin in the process of
coin flipping, is the least uncertain case (maximum entropy). If we plot the entropy as
the function of β. The entropy is zero at β = 0 and β = 1 as both of these cases are
deterministic, i.e. fully certain and thus least uncertain. (See accompanied ijulia file.)
The expression for entropy (8.67), has the following properties (some of these can be
interpreted as alternative definitions):
• H≥0
H(β)
1/2 1 β
Figure 8.12: The entropy of a Bernoulli random variable as a function of β. The entropy
is zero when β = 0 or β = 1; this is precisely when one of the outcomes is certain and the
random variable is actually deterministic. The entropy is maximized at β = 1/2; this is
precisely when the two outcomes are equi-probable
• H = 0 iff the process is deterministic, i.e. ∃x s.t. P (x) = 1.
• H ≤ log(|X |) and H = log(|X |) iff x is distributed uniformly over the set X .
• Entropy is the measure of average uncertainty.
• Entropy is less than the average number of bits needed to describe the random variable
(the equality is achieved for uniform distribution) a .
• Entropy is the lower bound on the average length of the shortest description of a
random variable
Example 8.5.1. The so called Zipf’s law states that the frequency of the n-th most frequent
word in randomly chosen English document can be approximated by
(
0.1
n , for n ∈ 1, . . . , 12367
pn = (8.69)
0, for n > 12367
Under an assumption that English documents are generated by picking words at random
according to Eq. (8.69) compute the entropy of the made-up English per word.
a
Take integers which are smaller or equal than n, and represent them in the binary system. We will need
log2 (n) binary variables (bits) to represent any of the integers. If all the integers are equally probable then
log2 (n) is exactly the entropy of the distribution. If the random variable is distributed non-uniformly than
the entropy is less than the estimate.
Solution. Substituting the distribution (8.69) into the definition of entropy one derives
12367
X Z
123670
0.1 0.1 0.1 ln x 1
H=− log2 ≈ dx = = (ln2 123670 − ln2 10) ≈ 9.9 bits.
n n ln 2 x 20 ln 2
i=1 10
It is known, from the famous work of Shannon [18], that entropy of English alphabet per
character is fairly low, ∼ 1 bit. Therefore, the character-based entropy of a typical English
text is much smaller than its entropy per word. This result is intuitively clear: after the
first few letters one can often guess the rest of the word, but prediction of the next word in
the sentence is a less trivial task.
8.5.2 Comparing Probability Distributions: Kullback-Leibler Divergence
The concepts of information content (surprise) and of entropy provide a number of useful
tools in probability. One of the most important tools is a method of comparing two proba-
bility distributions. For illustration, let X be a random variable taking values x ∈ X , and
let P1 be the probability distribution of X, which we consider as the ground truth. Assume
that P1 is approximated or modelled by the probability distribution P2 (x), then the dif-
ference in the information content of x, as measured by the two probability distributions,
is
P1 (x)
log P1 (x) − log P2 (x) ≡ log
P2 (x)
The Kullback-Leibler (KL) divergence is defined as the expectation of the difference in the
information context between the ground truth and its proxy (approximation) with respect
to the probability distribution of the former, P1 ,
X P1 (x)
D(P1 kP2 ) := P1 (x) log . (8.70)
P2 (x)
x∈X
Note that the KL divergence is not symmetric, i.e. D(P1 kP2 ) 6= D(P2 kP1 ). Moreover it
is not a proper metric of comparison as it does not satisfy the so-called triangle inequality.
A metric, d(a, b), is a function mapping two elements a and b from the same space to R that
satisfies (i) non-negativity, i.e. d(a, b) ≥ 0, and zero if and only if a = b, i.e. d(a, a) = 0; (ii)
symmetric, i.e. d(a, b) = d(b, a), and (iii) the triangle inequality, d(a, b) ≤ d(a, c) + d(b, c).
The last two conditions do not hold in the case of the KL divergence. However, an
infinitesimal version of the KL divergence, the Hessian of the KL divergence around its
minimum, also called the Fisher information, satisfies all the requirements of a metric.
Example 8.5.2. An illusionist has a biased coin that comes up heads 70% of the time.
Use the KL divergence to quantify the amount information that would be lost if the biased
coin were modeled as a fair coin.
Solution. We regard the biased probability distribution as the ‘ground truth’, P1 , and the
fair probability distribution, P2 , as our approximation. The KL divergence between the two
becomes
X
D(P1 ||P2 ) = EP1 log P1 /P2 = P1 (x) log P1 (x)/P2 (x)
x
= 0.3 log2 (0.3/0.5) + 0.7 log2 (0.7/0.5)

= 0.118.
We lose approximately 0.118 bits of information by modeling the biased coin as a fair coin.
Exercise 8.5. Assume that a random variable X2 is generated by the known probability
distribution P2 (x), where x ∈ X and X is finite. Consider the vector (P1 (x)|x ∈ X ) that
P
satisfies P1 (x) ≥ 0 for all x ∈ X and x∈X P1 (x) = 1. Show that D(P1 kP2 ), as a function
of P1 (x), is non-negative and that it achieves its minimum when P1 (x) = P2 (x) ∀x ∈ X , i.e.
arg min D(P1 kP2 ) P = (P2 (x)|x ∈ X ). (8.71)

(P1 (x)|x∈X ) x∈X P1 (x) = 1
∀x ∈ X : P1 (x) ≥ 0
8.5.3 Joint and Conditional Entropy
The notion of entropy naturally extends to the multivariate statistics. If we have a pair of
discrete random variables, X and Y , taken values x ∈ X and y ∈ Y respectively, their joint
entropy is
X
H(X, Y ) := −E log P (X, Y ) = − P (x, y) log P (x, y) . (8.72)
x∈X ,y∈Y
?
One must ask whether H(X, Y ) = H(X)+H(Y ), that is, one asks whether the expected
information in the entire system is equal to the sum of the expected information of X and
Y individually. To answer this question, we examine the expected amount information in
the system beyond that which can be gained from X, that is, we examine the quantity
H(X, Y ) − H(X),
X X
H(X, Y ) − H(X) = − P (x, y) log P (x, y) + P (x) log P (x)
x∈X ,y∈Y x∈X
X X
=− P (x, y) log P (x, y) + P (x, y) log P (x)
x∈X ,y∈Y x∈X ,y∈Y
X
P (x, y)
=− P (x, y) log
P (x)
x∈X ,y∈Y
H(X, Y ) H(X) H(Y |X)
H(X, Y ) H(X|Y ) H(Y )
Figure 8.13: Venn diagram illustrating the relationship between entropy, joint entropy and
conditional entropy. (It is customary in information theory to use venn diagrams to illustrate
entropies, conditional entropies and mutual information. Be advised that the shapes in the
diagram do not actually represent sets of objects. See e.g. pp 141 of [19] for a detailed
discussion with examples.).
If X and Y are independent, then P (x, y) = P (x)P (y) and the result is H(Y ). However, if
X and Y are not independent, then P (x, y)/P (x) = P (y|x) by Bayes theorem. We define
the conditional entropy H(Y |X) := H(X, Y ) − H(X) and observe that it can be computed
by
X
H(Y |X) = −E log P (Y |X) = − P (x, y) log P (y|x) . (8.73)
x∈X ,y∈Y
The definitions of the joint and conditional entropies naturally lead to the following
relation between the two
H(X, Y ) = H(X) + H(Y |X), (8.74)
called the chain rule (Fig. 8.13).

One can naturally extend the chain rule from the bi-variate to the multi-variate case
(X1 , · · · , Xn ) ∼ P (x1 , · · · , xn ) as follows
n
X
H(Xn , · · · , X1 ) = H(Xi |Xi−1 , · · · , X1 ). (8.75)
i=1
Notice, that the choice of the order in the chain is arbitrary.

H(X, Y ) H(X|Y ) I(X; Y ) H(Y |X)
Figure 8.14: Venn diagram explaining relations between the mutual information and re-
spective entropies for two random variables. (It is customary in information theory to use
venn diagrams to illustrate entropies, conditional entropies and mutual information. Be
advised that the shapes in the diagram do not actually represent sets of objects. See e.g.
pp 141 of [19] for a detailed discussion with examples.)
8.5.4 Independence, Dependence, and Mutual Information.
Comparing the two information sources, say tracking events x and y, one assumption, which
is rather dramatic, may be that the probabilities are independent, i.e. P (x, y) = P (x)P (y)
and then, P (x|y) = P (x) and P (y|x) = P (y). Mutual information, which we are about to
discuss, will be zero in this case. Thus, naturally, the mutual information is introduced as
the measure of dependence
XX
P (x, y) P (x, y)
I(X; Y ) = EP (x,y) log = P (x, y) log . (8.76)
P (x)P (y) P (x)P (y)
x∈X y∈Y
Intuitively the mutual information measures the information that X and Y share. In other
words, it measures how much knowing one of these random variables reduces uncertainty
about the other. For example, if X and Y are independent, then knowing X does not give
any information about Y and vice versa - the mutual information is zero. In the other
extreme, if X is a deterministic function of Y then all information conveyed by X is shared
with Y . In this case the mutual information is the same as the uncertainty contained in X
itself (or Y itself), namely the entropy of X (or Y ).
The mutual information is obviously related to respective entropies,
I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) = H(X) + H(Y ) − H(X, Y ). (8.77)
The relation is illustrated in Fig. (8.14). Mutual Information also possesses the following
properties
I(X; Y ) = I(Y ; X) (symmetry) (8.78)

I(X; X) = S(X) (self-information) (8.79)
The conditional mutual information between two random variables (two sources of in-
formation), X and Y , given another random variable, Z, is

P (x, y|z)
I(X; Y |Z) := H(X|Z) − H(X|Y, Z) = EP (x,y,z) log (8.80)
P (x|z)P (y|z)
The entropy chain rule (8.74) when applied to the mutual information of (X1 , · · · , Xn ) ∼
P (x1 , · · · , xn ) results in
n
X
I(Xn , · · · , X1 ; Y ) = I(Xi ; Y |Xi−1 , · · · , X1 ) (8.81)
i=1
See Fig. (8.14) for the Venn diagram illustration of Eq. (8.81).
We recommend the reader to check [19] for extended discussions on entropy, mutual
information and related.
The notions of joint, conditional entropy and mutual information in the context of two
random variables is illustrated in the following three examples.
Example 8.5.3. Consider two Bernoulli random variables X and Y with a joint probability
mass function P (X, Y ) given by
y=0 y=1
x=0 0 0.2
x=1 0.8 0
Compute the entropy of X, the joint entropy of X and Y , and the conditional entropy of
Y given X. Discuss the results.
Solution. The joint probability mass function indicates that the outcome of Y is completely
determined by the outcome of X, and vice versa. Intuitively, we should expect that all the
information in the entire system is fully contained in X, and that once X is known, no
additional information can be gained from Y . Lets do the calculations to verify that our
intuition is correct. The entropy of X is
X
H(X) = − P (x) log2 P (x) = −0.2 log2 0.2 − 0.8 log2 0.8 = 0.722.
x∈X
The joint entropy of X and Y is

X
H(X, Y ) = − P (x, y) log2 P (x, y)
x,y∈X ,Y

= −0 log2 (0) − 0.8 log2 0.8 − 0.2 log2 0.2 − 0 log2 (0) = 0.722.
H(X, Y ) ≈ 0.722
H(X) ≈ 0.722 H(Y |X) = 0
H(X|Y ) = 0 H(Y ) ≈ 0.722
Figure 8.15: Schematic for example 8.5.3. The entropy of the whole system (blue) is the
same as the entropy of X (orange). The conditional entropy of Y given X is zero (illustrated
by the bar of ‘zero’ width at the end of the second row). The bottom row shows that the
entropy of Y (pink) also coincides with that of the entire system (and, incidentally, is fully
shared with that of X).
For this situation, the expected information content of the entire system is exactly the same
as the expected information content of X alone. No additional information can be expected
from (X, Y ) that cannot be expected from X. We anticipate (and verify) that there is no
additional information that can be expected from Y once X is known.
X
H(Y |X) = − P (x, y) log2 P (y|x)
x,y∈X ,Y

= −0 log2 (0) − 0.8 log2 1 − 0.2 log2 1 − 0 log2 (0) = 0.
See Fig. 8.15 for illustration. Comment: Similar calculations for H(Y ) and H(X|Y ) would
show that Y also contains all the expected information in the system, and that no additional
information can be expected from X once Y is known.
y=0 y=1
x=0 0.45 0.45
x=1 0.05 0.05
Compute the entropies of X and of Y . Compute the joint entropy of X and Y . Compute
the conditional entropy of Y given X. Discuss the results.
Solution. The marginal distributions are:
 
0.9, x = 0 0.5, y=0
P (X = x) = ; and P (Y = y) = .
0.1, x = 1 0.5, y=1
We observe that X and Y are independent, so we should intuitively expect that there is
no ‘overlap’ in the information in X and Y . Let’s do the calculations to verify that our
intuition is correct. The entropy of X is
X
H(X) = − P (x) log2 P (x) = −0.9 log2 0.9 − 0.1 log2 0.1 = 0.469.
x∈X
The entropy of Y is
X
H(Y ) = P (y) log2 P (y) = 0.5 log2 0.5 + 0.5 log2 0.5 = 1.0.
y∈Y
The joint entropy of X and Y is

X
H(X, Y ) = − P (x, y) log2 P (x, y)
x,y∈X ,Y

= −0.45 log2 (0.45) − 0.45 log2 0.45 − 0.05 log2 0.05 − 0.05 log2 (0.05) = 1.469.
For this situation, the expected information content of the entire system is more than the
expected information content of X alone. The additional expected information of the system
(X, Y ) beyond that of X must be information expected from Y that is not contained in
X. We anticipate (and verify) that the additional information that can be expected from
Y when X is known is non-zero.
X
H(Y |X) = − P (x, y) log2 P (y|x)
x,y∈X ,Y

= −0.45 log2 (0.5) − 0.45 log2 0.5 − 0.05 log2 (0.5) − 0.05 log2 0.5 = 1.0.
Furthermore, X and Y are independent so X actually contains no information about Y .

We anticipate (and verify) that all the expected information content of Y will contribute
to the expected information of the entire system.
X
H(Y ) = P (y) log2 P (y)
y∈Y

= 0.5 log2 0.5 + 0.5 log2 0.5 = 1.0.
Performing similar calculations for H(Y ) and H(X|Y ) would show that Y also contains
all the information content in the system, and that no additional information is gained by
learning X once Y is known. See Fig. 8.15 for illustration.
H(X, Y ) ≈ 1.469
H(X) ≈ 0.469 H(Y |X) = 1.0
H(X|Y ) ≈ 0.469 H(Y ) = 1.0
Figure 8.16: Schematic for example 8.5.4. The entropy of the whole system (blue) is equal to
the entropy of X (orange) plus the entropy of Y conditioned on X (pink, shaded). Observe
that in this example, the entropy of Y conditioned on X is equal to the entropy of Y (pink)
because the there is no overlap in information between X and Y .
y=0 y=1
x=0 0.2 0.3
x=1 0 0.5
Compute the entropies of X and of Y . Compute the joint entropy of X and Y . Compute
the conditional entropy of Y given X. Discuss the results.
Solution. The entropy of X is
X
H(X) = P (x) log P (x)
x
= −0.5 log2 (0.5) − 0.5 log2 (0.5) = 1.0.
The conditional entropy of Y given X is

X
H(Y |X) = − P (x, y) log P (y|x)
x,y
= −0.2 log2 (0.4) − 0.4 log2 (0.6) − 0 · log2 (0) − 0.4 log2 (1) = 0.485.
The entropy of Y is
X
H(Y ) = P (y) log P (y) = −0.2 log2 (0.2) − 0.8 log2 (0.8) = 0.722.
y
The conditional entropy of X given Y is

X
H(X|Y ) = − P (x, y) log P (x|y)
x,y
= −0.2 log2 (1) − 0.4 log2 (0.375) − 0 · log2 (0) − 0.4 log2 (0.625) = 0.764.
H(X, Y ) ≈ 1.485
H(X) ≈ 1.0 H(Y |X) ≈ 0.485
H(X|Y ) ≈ 0.764 H(Y ) ≈ 0.722
Figure 8.17: Schematic for example 8.5.5. The entropy of the whole system (blue)is equal to
the entropy of X (orange) plus the entropy of Y conditioned on X (pink, shaded). Observe
that in this example, the entropy of Y conditioned on X is less than the entropy of Y (pink)
because the some of the information content in Y overlaps with that of X.
P (x, y) X P (y)
x1 x2 x3 x4
y1 1/8 1/16 1/32 1/32 1/4
Y y2 1/16 1/8 1/32 1/32 1/4
y3 1/16 1/16 1/16 1/16 1/4
y4 1/4 0 0 0 1/4
P (x) 1/2 1/4 1/8 1/8
Table 8.1: Exemplary joint probability distribution function P (x, y) and the marginal prob-
ability distributions, P (x), P (y), of the random variables x and y.
The joint entropy is

X
H(X, Y ) = − P (x, y) log P (x, y)
x,y
= −0.2 log2 (0.2) − 0.4 log2 (0.4) − 0 · log (0) − 0.4 log2 (0.4) = 1.522.
See Fig. 8.17 for illustration. Comment: In this example, X and Y are not independent, and
therefore some information is shared between the two. This explains why the joint entropy
is less than the the sum of the individual entropies, i.e. H(X, Y ) < H(X)+H(Y ). This also
explains why the information content of X conditioned on Y is less than the information
content of X alone, i.e. H(X|Y ) < H(X). (Similarly, it explains why H(Y |X) < H(Y ).)
Exercise 8.6. The joint probability distribution P (x, y) of two random variables X and
Y is described in Table 8.1. Calculate the conditional probabilities P (x|y) and P (y|x),
marginal entropies H(X) and H(Y ), as well as the mutual information I(X; Y ).
Figure 8.18
8.5.5 Probabilistic Inequalities for Entropy and Mutual Information
Let us now discuss the case when a random one dimensional variable, X, is drawn from the
space of reals, x ∈ R, with the probability density, p(x). Now consider averaging a convex
function of X, f (X). One observes that the following statement, called Jensen inequality,
holds
E [f (X)] ≥ f (E [X]) . (8.82)
Obviously the statement becomes equality when p(x) = δ(x). To gain a bit more of intuition
consider the case of the Bernoulli-like distribution, p(x) = βδ(x − x1 ) + (1 − β)δ(x − x0 ).
We derive
f (E[X]) = f (x1 β + x0 (1 − β)) ≤ βf (x1 ) + (1 − β)f (x0 ) = E [f (X)] , (8.83)

where the critical inequality in the middle is simply expression of the function f (x) convexity
(taken verbatim from the definition).
See also Fig. (8.18) with another (graphical) hint on the proof of the Jensen inequality.
In fact, the Jensen inequality holds over any spaces.
Notice that the entropy, considered as a function (or functional in the continuous case)
of probabilities at a particular state is convex. This observation gives rise to multiple
consequences of the Jensen inequality (for the entropy and the mutual information):
• (Information Inequality)
D(pkq) ≥ 0, with equality iff p = q
• (conditioning reduces entropy)
H(X|Y ) ≤ H(X) with equality iff X and Y are independent
• (Independence Bound on Entropy)

n
X
H(X1 , · · · , Xn ) ≤ H(Xi ) with equality iff Xi are independent
i=1
Another useful inequality [Log-Sum Theorem]

n n
! Pn
X ai X ai
ai log ≥ ai log Pi=1
n , (8.84)
bi i=1 bi
i=1 i=1
with equality iff ai /bi is constant. Convention: 0 log 0 = 0, a log(a/0) = ∞ if a > 0 and
0 log 0/0 = 0. Consequences of the Log-Sum theorem
• (Convexity of Relative Entropy) D(pkq) is convex in the pair p and q
• (Concavity of Entropy) For X ∼ p(x) we have H(P ) := HP (X) (notations are ex-
tended) is a concave function of P (x).
• (Concavity of the mutual information in P (x)) Let (X, Y ) ∼ P (x, y) = P (x)P (y|x).
Then I(X; Y ) is a concave function of P (x) for fixed P (y|x).
• (Concavity of the mutual information in P (y|x)) Let (X, Y ) ∼ P (x, y) = P (x)P (y|x).
Then I(X; Y ) is a concave function of P (y|x) for fixed P (x).
We will see later (discussing Graphical Models) why the convexity/concavity properties of
the entropy-related objects are useful.
Example 8.5.6. Prove that H(X) ≤ log2 n, where n is the number of possible values of
the random variable x ∈ X.
Solution. The simplest proof is via the Jensen’s inequality. It states that if f is a convex
function and U is a random variable then
E[f (U )] ≥ f (E[U ]). (8.85)
Let us define
f (u) = − log2 u, u = 1/P (x)
Obviously, f (u) is convex. In accordance with (8.85) one obtains
E[log2 P (X)] ≥ − log2 E[1/P (X)],
where E[log2 P (X)] = −H(X) and E[1/P (X)] = n, so H(X) ≤ log2 n.
Note, in passing, that the Jensen’s inequality leads to a number of other useful expres-
sions for entropy, e.g. H(X|Y ) ≤ H(X) with equality iff X and Y are independent, and
P
more generally, H(X1 , . . . , Xn ) ≤ ni=1 H(Xi ) with equality iff all Xi are independent.
Chapter 9
Stochastic Processes
In Chapter 8, we discussed random vectors and their probability distributions (for example
X = (X1 , . . . , XN ) with X ∼ PX ). In Chapter 9, we will discuss stochastic processes,
which are a natural extension of our inquiry into random vectors. A stochastic process is a
collection of random variables, often written as a path, or sequence, {Xt |t = 1, · · · , T }, or
simply (Xt )Tt=1 , whose components Xt take values from the same state space Σ. Typically
we think of t as time, and we consider the cases where T is finite, T < ∞, and where it is
infinite, T → ∞. Both the state space, Σ, and the index set, t, may be either discrete or
continuous.
We will discuss three basic examples: (a) of Bernoulli process, which is space-time
discrete, (b) Poisson process, which is space discrete but time continuous, and (c) continuous
space and continuous time process described by Langevien equation, which is an example of
the so-called Stochastic Differential Equation (SDE), in Section 9.1, Section 9.2 and Section
9.3, respectively.
The three basic examples of the stochastic processes can also be classified in terms of
the amount of memory needed to generate them.
In our first two examples (Bernoulli processes and Poisson processes), the random vari-
ables Xt are independent, i.e. memory-less, meaning that the outcome of each Xt does
not influence, and is not influenced by, the outcomes of any of the other Xt . In general,
however, the components Xt within the path {X(t)}, need not be independent. Thus,
stochastic process described by the Langevien equation results in dependent, i.e. correlated
in time Xt .
Time-correlations within a random path, {X(t)}, described by an SDE may be compli-
cated and difficult to analyze. Consequently, one often considers a discrete-time simplifica-
tion, called a Markov process, discussed in Section 9.4, where the memory holds only for a
253
CHAPTER 9. STOCHASTIC PROCESSES 254
single time step, i.e. Xt depends only on the outcome of the previous step, Xt−1 .
We conclude this Chapter with a brief discussion in Section 9.5 of the Markov Decision
Process (MDP), which is controlled formulation involving (conditioned to) Markov process.
(Queuing theory, discussed in Section 9.6, is a bonus material.)
9.1 Bernoulli Process (Discrete Space, Discrete Time)

A Bernoulli process is a sequence of independent Bernoulli random variables that are often
called events or trials. For the case where each event can take only one of two outcomes,
say “success” or “failure”, then a typical sample path of a Bernoulli process may look like
∗ ∗ S ∗ S ∗ S ∗ ∗ ∗ S, where S here stands for “success”, or equivalently 00101010001. We will
discuss only stationary Bernoulli processes, meaning that the probability of success is the
same for each Xt , that is P (success) = P (Xt = 1) = β and P (failure) = P (Xt = 0) = 1 − β
for each t.
Examples of processes that can be modeled by a Bernoulli process include the number
“arrivals” when checked at fixed intervals, such as the “arrival” of a monsoon on each day
of a Tucson summer, or any sequence of discrete updates, such as the (random) ups and
downs of the stock market.
9.1.1 Probability distribution of the total number of successes
As we discussed in Eq. 8.5, the number of successes k, in n trials follows the binomial
distribution !
n
∀k = 0, · · · , n : P (S = k|n, β) = β k (1 − β)n−k , (9.1)
k
The mean and variance are found by computing E[S] and E[(S − E[S])2 ] respectively:
mean : E[S] = nβ, (9.2)

variance : var(S) = E[(S − E[S])2 ] = nβ(1 − β). (9.3)
9.1.2 Probability distribution of the 1st success
Let T1 be the number of trials until the first success (including the success event too). The
Probability Mass Function (PMF) for the time of the first success is the product of the
probabilities of (t − 1) failures and one success:
t = 1, 2, · · · : P (T1 = t|β) = β(1 − β)t−1 [Geometric PMF] (9.4)

The distribution in Eq. 9.4 is called a geometric distribution because the calculation to
verify that the probability distribution is normalized involves summing up the geometric
P
sequence: ∞ t=1 β(1−β)
t−1 = β(1−(1−β))−1 = 1. The mean and variance of the geometric
distribution are
1
mean : E[T1 ] = , (9.5)
β
1−β
variance : var(T1 ) = E[(T1 − E[T1 ])2 ] = . (9.6)
β2
The Bernoulli process is memoryless, meaning that each outcome is independent of the
past. If n trials have already occurred, the future sequence xn+1 , xn+2 , · · · is also a Bernoulli
process and it is independent of the first n trials. Moreover, suppose we have observed the
process for n times and no success has occurred. Then the PMF for the remaining arrival
times is also geometric,
P (T − n = k|T > n, β) = β(1 − β)k−1 . (9.7)
9.1.3 Probability distribution of the k th success
What about the k th arrival? Let Tk be the number of trials until k th success (inclusive),
then we write
!
t − 1,
t = k, k + 1, · · · : P (Tk = t|β) = β k (1 − β)t−k [Pascal PMF], (9.8)
k−1
The mean and variance are found by computing E[Tk ] and E[(Tk − E[Tk ])2 ] respectively:
k
mean : E[Tk ] = , (9.9)
β
k(1 − β)
variance : var(Tk ) = E[(Tk − E[Tk ])2 ] = . (9.10)
β2
The combinatorial factor accounts for the number of configurations of k arrivals in Tk trials.
Exercise 9.1. Define τk = Tk − Tk−1 , k = 2, 3, · · · , so that τk is the inter-arrival time

between the (k − 1)st and k th arrivals. Write down the probability density distribution
function for the k th inter-arrival time, τk .
The following two sections discuss two different continuous-time limits of Bernoulli pro-
cesses, namely Poisson Processes and Brownian motion.
9.2 Poisson Process (Discrete Space, Continuous Time)

We will build on the material of the previous Section, where a discrete-time, discrete-space
Bernoulli process was discussed, and extend it to continuous time, thus arriving at Poisson
process.
Formally, a Poisson process is defined in terms of a collection of random variables, {Nt },
indexed by time, that count the number of independent ‘arrivals’ on the interval [0, t]. Recall
the Poisson distribution which was defined in Section 8.1.1:
λ̃k e−λ̃
∀k ∈ {0, 1, 2, . . . } : P(N = k; λ̃) = . (9.11)
k!
In the following derivation, we will show that the outcomes of a Poisson process follow a
Poisson distribution with parameter λt.
The distribution for the Poisson Process can be derived by subdividing the interval [0, t]
into n subintervals of length ∆t := t/n. For sufficiently small ∆t, the probability of two
or more arrivals on any subinterval is negligible and the occurrence of arrivals on any two
subintervals are independent. Under these conditions, the probability of k arrivals in n
sub-intervals can be modeled by a binomial distribution. The probability of an arrival, β,
is proportional to the length of the sub-interval: β ∝ t/n, or equivalently β = λt/n where
λ is the constant of proportionality. In the limit as n → ∞, we get
k
k k n! λt λt n−k
P (Nt = k; λ) = lim β (1 − β)n−k = lim 1− (9.12)
n→∞ n n→∞ k!(n − k)! n n
k n−k
nk + O(nk−1 ) λt λt
= lim 1− (9.13)
n→∞ k! n n
(λt)k e−λt
= . (9.14)
k!
λt n λt k
where we have used the facts that (1 − n) → e−λt and (1 − n) → 1. Notice that the
dimensionless parameter λ̃ in Eq. (9.11) is replaced by λt in Eq. (9.14). Hence λ has the
dimension of inverse time: [λ] = [1/t].
Common examples of Poisson processes:
• E-mail arrivals with infrequent check.
• Collision of high-energy beams a high frequency (10 MHz) where there is a small
chance of actual collision.
• Radioactive decay of a nucleus with the trial being to observe a decay within a small
time interval.
Nt
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
T1 T2 T3 T4 T5 t P (Nt )
Figure 9.1: One realization of a Poisson process over the interval [0, t] (blue) in which
five “arrivals” occurred at the random times T1 , . . . , T5 , therefore N (t) = 5. Two other
realizations are shown (grey) in which 6 arrivals and 3 arrivals occurred on [0, t]. The
random variable Nt follows a Poisson distribution (right) with parameter λt where λ is a
fixed parameter.
• Spin flips in a magnetic field.
Properties of Poisson Processes
The Poisson process has the following key properties:
• Initialization: No arrivals have occurred at t = 0, that is, N (0) = 0.
• Independence: The number of arrivals that occur on two time intervals are inde-
pendent if on only if the two time intervals are disjoint.
• Distribution: The number of arrivals that occur on an interval depends only on the
length of the interval and not its location. In particular, in the limit as ∆t → 0,
P (N (∆t) = 1) → λ∆t and P (N (∆t) ≥ 2) = 0.
With a small amount of effort, it can be shown that these three properties are both necessary
and sufficient conditions to define a stochastic process whose components follow a Poisson
distribution with parameter λt.
A summary of the relationship between the Bernoulli process and the Poisson process
is given in table 9.1.
Probability Distributions of the 1st and the k th Arrival Times
Let T1 be the (random) time of the first arrival. The probability density of T1 per unit time
can be found from Eq. (9.11) by (1) recalling that a PDF is the derivative of the CDF, and
(2) recognizing the equivalency P (T1 < t) = P (Nt ≥ 1) since the event that first arrival
Bernoulli Poisson
Times of Arrival Discrete Continuous
Arrival Rate p/per trial λ/unit time
PMF of Number of arrivals Binomial Poisson
PMF of Interarrival Time Geometric Exponential
PMF of k th Arrival Time Pascal Erlang
Table 9.1: Comparison between the Bernoulli process and the Poisson process
occurs before time t is equivalent to the event that the number of arrivals by time t is
greater than or equal to 1. Therefore,
1
d
P (T1 = t) = lim P (T1 < t + ∆t) − P (T1 < t) = dt P (T1 < t)
∆t→0 ∆t
d d

= dt P (Nt ≥ 1) = dt 1 − P (Nt = 0)

d
= dt 1 − e−λt = λe−λt
Hence the time of the first arrival follows an exponential distribution with parameter λ.
The three properties above imply that, like the Bernoulli process, the Poisson process
is Memoryless and that it has Fresh Starts.
• Memoryless: if we observe the process for t seconds and no arrival has occurred, then
the density of the remaining time of arrival is exponential.
• Fresh Starts: the time of the next arrival is independent of the past, and hence is also
exponentially distributed with parameter λ.
The probability density that the first arrival occurs before time t can therefore be found
by integration:
Z t Z t
0 0
P (T1 ≤ t) = dt pT1 (t ) = dt0 λe−λt = 1 − exp(−λt).
0 0
By extension, the probability density of the time of the k th arrival, one derives
λk tk−1 exp(−λt)
p(Tk = t; λ) = , t>0 (Erlang “of order” k).
(k − 1)!
Merging and Splitting Processes
One of the most important features shared by the Bernoulli and Poisson processes is their
invariance with respect to mixing and splitting. We will show it on the example of the
Poisson process but the same applies to Bernoulli process.
N1
N2
N1 + N2
Figure 9.2: Merging and Splitting Poisson Processes
Merging: Let N1 (t) and N2 (t) be two independent Poisson processes with rates λ1
and λ2 respectively. Let us define N (t) = N1 (t) + N2 (t). This random process is derived
combining the arrivals as shown in Fig. (9.2). The claim is that N (t) is the Poisson process
with the rate λ1 + λ2 . To see it we first note that N (0) = N1 (0) + N2 (0) = 0. Next,
since N1 (t) and N2 (t) are independent and have independent increments their sum also
has independent increments. Finally, consider an interval of length τ , (t, t + τ ]. Then the
number of arrivals in the interval are Poisson(λ1 τ ) and Poisson(λ2 τ ) and the two numbers
are independent. Therefore the number of arrivals in the interval associated with N (t)
is Poisson((λ1 + λ2 )τ ) - as sum of two independent Poisson random variables. We can
obviously generalize the statement to a sum of many Poisson processes. Note that in the
case of the Bernoulli process the story is identical provided that collision is counted as one
arrival.
Splitting: Let N (t) be a Poisson process with rate λ. Here, we split N (t) into N1 (t)
and N2 (t) where the splitting is decided by coin tossing (Bernoulli process) - when an arrival
occur we toss a coin and with probability β and 1 − β add arrival to N1 and N2 respectively.
The coin tosses are independent of each other and are independent of N (t). Then, the
following statements can be made
• N1 is a Poisson process with rate λβ.
• N2 is a Poisson process with rate λ(1 − β).
• N1 and N2 are independent, thus Poisson.
Example 9.2.1. Astronomers estimate that the meteors above a certain size hit the earth
on average once every 1000 years, and that the number of meteor hits follows a Poisson
distribution.
(a) What is the probability to observe at least one large meteor next year?
(b) What is the probability of observing no meteor hits within the next 1000 years?
(c) Calculate the probability density p(Tk ), where the random variable Tk represents the
appearance time of the k th meteor.
Solution. The probability of observing k meteors in a time interval [0, t] is given by

(λt)k −λt
P (k|t) =e , (9.15)
k!
where λ = 0.001 (events per year) is the average hitting rate and we simplify notations,
P (k|t, λ) → P (k|t).
(a) P (k > 0 meteors next year) = 1 − P (0|1) = 1 − e−0.001 ≈ 0.001.
(b) P (k = 0 meteors next 1000 years) = P (0|1000) = e−1 ≈ 0.37.
(c) It is intuitively clear that
(probability that Tk > t) = (probability to get k − 1 arrivals in interval [0, t]),
Therefore
Z ∞
p(Tk )dTk = P (k − 1|t),
t
λk Tkk−1 −λTk
p(Tk ) = e .
(k − 1)!
Exercise 9.2. Customers arrive at a store with the Poisson rate of 5 per hour. 40%/60%
of arrivals are men/women.
(a) Compute probability that at least 10 customers have entered between 10 and 11 am.
(b) Compute probability that exactly 5 women entered between 10 and 11 am.
(c) Compute the expected inter-arrival time of men.
(d) Compute probability that no men arrive between 2 and 4 pm.
9.3 Stochastic Processes that are Continuous in Space-time

The stochastic processes discussed so far were memory-less
In this Section we discuss the stochastic dynamics of continuous variables governed
by the Langevin equation. We discuss how to derive the Fokker-Planck equations which
describe the temporal evolution of the probability of a state. We then go into some addi-
tional detail for the foundational example of stochastic dynamics in free space (no drift)
that describe Brownian motion where the Fokker-Planck equations simplify to the diffusion
equation.
9.3.1 Random Walks on the Integers
For a binomial process Y = (Y1 , . . . Yn ) that takes outcomes ±1 with equal probability, the
P
stochastic process X where Xj := ji=1 Yi is a called a random walk on the integers. The
PMF of this random walk is a binomial distribution. The PMF converges to a Gaussian
√
distribution with mean zero and variance n, which is a direct result of the central limit
theorem.
Example 9.3.1. A grasshopper is dropped on a number-line and proceeds to take a random

walk by making either a unit jump to the right with probability β or a unit jump to the left
with probability 1 − β. The starting location of the grasshopper is random, and is X0 = 0
with probability 0.7, and X0 = 5 with probability 0.3. Find the probability distribution of
the grasshopper at time t.
Solution. Begin by examining the case where X0 = 0. The possible outcomes after one jump
either X1 = −1 (by jumping left) or X1 = +1 (by jumping right). The possible outcomes
after two jumps are X2 = −2 (by making two jumps to the left), X2 = 0 (by either jumping
left then right, or right then left), or X2 = +2 (by making two jumps to the right). Define
the function F (n, k) to be the number of possible combinations of n left and right jumps
n

ending at Xn = k. Specifically F (n, k) = (n+k)/2 if (n + k)/2 is an integer between 0 and
n, and F (n, k) = 0 otherwise. Hence, the probability of jumping to k in n steps from k = 0
is F (n, k)β k (1 − β)n−k . Repeating for X0 = 5 and applying Bayes theorem gives:
P (Xn = k) = P (Xn = k|X0 = 0)P (X0 = 0) + P (Xn = k|X0 = 5)P (X0 = 5)

= 0.7F (n, k)β k (1 − β)n−k + 0.3F (n, k − 5)β k (1 − β)n−k
Exercise. Modify example 9.3.1 for the random starting location P (X0 = x) = 2−x /3.
9.3.2 From Random Walks to Brownian Motion
Example 9.3.2. Consider a random walk on the line with n jumps that occur at times
t = ∆, 2∆, . . . , n∆, where ∆t = 1/n. Find the PMF of the random walk if the jumps are
√ √ √
± ∆ with equal probability, so P (Xj+1 = Xj + ∆) = P (Xj+1 = Xj − ∆) = 1/2, and
use your result to show the stochastic process has mean zero and unit variance when t = 1.
√ n
Solution. P (Xn = (n−2k) ∆) = nk 21 . The mean and variance are found by recognizing
√ √
that Xn = ∆(2Y − n), where Y ∼ B(n, 1/2). Therefore, E[Xn ] = ∆(2E[Y ] − n) = 0
and Var[Xn ] = 4∆Var[Y ] = 1.
Example 9.3.2 could be further extended by using the Central Limit Theorem to show
that the CDF of the random walk converges to the CDF of a standard normal distribution in
the limit as n → ∞. This result informs when and how to approximate a high-dimensional
discrete-time stochastic processes by a continuous-time stochastic process and vice versa.
(Solutions to continuous-time are often easier to analyze and solutions to discrete stochastic
processes are often easier to compute numerically.)
The central limit theorem further implies that the jumps need not be Bernoulli pro-
cesses—every random walk with steps i.i.d. random variables will have the same limit,
√
provided that the variance of the random variables scales as 1/ n as n → ∞. This limit is
Brownian Motion.
9.3.3 Langevin equation in continuous time and discrete time
Many stochastic processes in 1d can be described in the continuous-time and discrete-time

forms as follows
√
ẋ = v(x) + hξ(t)i = 0, hξ(t1 )ξ(t2 )i = δ(t1 − t2 )
2Dξ(t), (9.16)
√
xn+1 − xn = ∆v(xn ) + 2D∆ξ(tn ), hξ(tn )i = 0, hξ(tn )ξ(tk )i = δkn . (9.17)
The first on the rhs of the Stochastic Differential Equation (SDE) (9.16) determines the
(deterministic) drift and the second term (called the Langevin term) determines the random
“noise”, which has mean zero and variance determined by D. The noise is considered
independent at each time step. These equations, also called the Langevin equations, describe
the evolution of a “particle” positioned at x ∈ R. The two terms on the rhs of Eq. (9.16)
correspond to deterministic drift/advancement of the particle (also dependent on its position
at the previous time step) and, respectively, on a random correction/increment. The random
correction models uncertainty of the environment the particles moves through. (We can also
think of it as representing random kicks by other “invisible” particles). The uncertainty
is represented in a probabilistic way – therefore we will be talking about the probability
distribution function of paths, i.e. trajectories of the particle.
The square root on the rhs of Eq. (9.17) may seem mysterious, let us clarify its origin on
the basic “no (deterministic) drift” example of v(x) = 0. (This will be the running example
through out this lecture.) In this case the Langevin equation describes the Brownian motion.
Direct integration of the linear equation with the inhomogeneous source results in
Z t
∀t ≥ 0 : x(t) = dt0 ξ(t0 ), (9.18)
0
Z t Z t Z t
2
∀t ≥ 0 hx (t)i = dt1 dt2 2Dδ(t1 − t2 ) = 2D dt1 = 2Dt, (9.19)
0 0 0
where we also set x(0) = 0. Infinitesimal version of Eq. (9.19) is
h(xn+1 − xn )2 i = 2D∆, (9.20)

which can thus be derived from the Brownian (no drift) version of Eq. (9.17).
9.3.4 The Wiener Process: A Rigorous Defintion of Brownian Motion
Note that the notations introduced above for the SDE, in its continuous time form (9.16),
and in the discrete time form (9.17), are originally from physics and custom in (at least
some part of) applied mathematics. The notations are intuitive and simple, however they
are formally ambiguous as dependent on the notion of the δ-function, i.e. a generalized
function. This is similar to the ambiguity associated with the δ-function use, for example,
as a source term in a linear ODE describing the Green function. Recall that to resolve the
ambiguity associated with the δ-function we should either regularize the δ function or use
it only under an integral. Therefore in theoretical mathematics, statistics and engineering,
we more often see the SDE (9.16) restated in the differential form
√
dx(t) = v(x(t))dt + 2DdW (t), (9.21)
where W (t) denotes the so-called “standard Brownian motion”, also called the Wiener
process/term. (In some of the mathematics literature, dB(t), is used instead of dW (t).)
Formally:
Definition 9.3.3 (Wiener Process). The Wiener process, {W (t)}t≥0+ , is a continuous time
stochastic process in R that is characterized by the following three properties:
1. W (0) = 0;
2. W (t) is almost surely continuous (i.e. with probability 1, W (t) is continuous in t);
3. W (t) has stationary independent increments, that are normally distributed with mean
zero and variance equal to the length of the increment, that is Wt − Wt0 ∼ N (0, t − t0 )
(for 0 ≤ t0 ≤ t).
The differential form (9.21) of the Langevin equation is advantageous because it natu-
rally leads to the following integral version of the Langevin equation resolving the afore-
mentioned ambiguity
Z
t+∆ Z
t+∆
√ Z
t+∆
0 0 0
x(t + ∆) − x(t) = dx(t ) = v(x(t ))dt + 2D dW (t0 ), (9.22)
t t t
where the first term on the rhs is the standard (Lebesgue) integral, while the second term
is the so-called Ito integral. According to Eqs. (9.18,9.19), and consistently with the formal
definition of the Wienner process above, the heuristic intepretation of the Ito integral is
that when ∆ → 0, the increment, x(t + ∆) − x(t), becomes Gaussian, zero mean normally
distributed with the variance, 2D∆.
9.3.5 From the Langevin Equation to the Path Integral
The Langevin equation can also be viewed as relating the change in x(t), i.e. its dynamic
increment, to stochastic dynamics of the δ-correlated source ξ(tn ) = ξn characterized by
the Probability Density Function (PDF)
N
!
X ξ2
p(ξ1 , · · · , ξN ) = (2π)−N/2 exp − n
. (9.23)
2
n=1
Note that Eqs. (9.16,9.17,9.23) are the starting points for our further derivations, but
they should also be viewed as a way to simulate the Langevin equation on computer by
generating many paths at once, i.e. simultaneously. Notice, for completeness, that there are
also other ways to simulate the Langevin equation through the so-called telegraph process.
Let us now formally express ξn via xn from Eq. (9.17) and substitute it into Eq. (9.23)
N −1
!
1 X
p(ξ1 , · · · , ξN −1 ) → p(x1 , · · · , xN ) = (4πD)−(N −1)/2 exp − (xn+1 − xn − ∆v(x))2 .
4D∆
n=1
(9.24)
One gets an explicit expression for the measure over a path written in the discretized way.
And here is a typical way of how we state it in the continuous form (e.g. as a notational
shortcut)
Z T
1 2
p{x(t)} ∝ exp − dt (ẋ − v(x)) (9.25)
4D 0
This object is called (in physics and math) ”path integral” and/or Feynmann-Kac integral.
9.3.6 From the Path Integral to the Fokker-Plank (through sequential

Gaussian integrations)
The Probability Density Function (PDF) of a path is a useful general object. However
we may also want to marginalize it thus extracting the marginal PDF, for being at the
position xN at the (temporal) step N , from the joint PDF (of the path) conditioned to
being at the initial position, x1 , at the moment of time t0 , p(x1 , · · · , xN |x0 ), and also from
the prior/initial (distribution) p0 (x0 ) – both assumed known:
Z Z
pN (xN ) = dx0 · · · dxN −1 p(x0 , · · · , xN ) = dx1 · · · dxN −1 p(x1 , · · · , xN |x0 )p0 (x0 ).(9.26)
It is convenient to derive relation between pN (·) and p0 (·) in steps, i.e. through an induc-
tion/recurrence, integrating over dx0 , · · · , dxN −1 sequentially. Let us proceed analyzing the
case of the Brownian motion where, F = 0. Then the first step of the induction becomes
Z
−1/2 1 2
p1 (x1 ) = (4πD) dx0 exp − (x1 − x0 ) p0 (x0 ) (9.27)
2D∆
Z
−1/2 2
= (4πD) d exp − p0 (x1 + ) (9.28)
4D∆
Z
2 2
≈ (4πD)−1/2 d exp − p0 (x1 ) + ∂x1 p0 (x1 ) + ∂x21 p0 (x1 ) (9.29)
4D∆ 2
= p0 (x1 ) + ∆D∂x21 p0 (x1 ), (9.30)
where transitioning from Eq. (9.28) to Eq. (9.29) one makes Taylor expansion in , also
√
assuming that ∼ ∆ and keeping only the leading terms in ∆. The resulting Gaussian in-
tegrations are straightforward. We arrive at the discretized (in time) version of the diffusion
equation
∂t p(x|t) = D∂x2 p(x|t), (9.31)
where we write, p(x|t), to emphasize that this is the probability of being at the position x
R
at the moment of time t, i.e. the expression is conditioned to t and thus: ∀t : dxp(x|t) =
1. Of course it is not surprising that the case of the Brownian motion has resulted in
the diffusion equation for the marginal PDF. Restoring the deterministic drift term, v(x),
(derivation is straightforward) one arrives at the Fokker-Planck equation, generalizing the
zero-drift diffusion equation
∂t p(x|t) + ∂x (v(x)p(x|t)) = D∂x2 p(x|t). (9.32)
9.3.7 Analysis of the Kolmogorov-Fokker-Planck Equation: General Fea-

tures and Examples
Here we only give a very brief and incomplete description on the properties of the distri-
bution which analysis is of a fundamental importance for Statistical Mechanics. See e.g.
[20].
The Fokker-Planck equation (9.32) is a linear and deterministic Partial Differential
Equation (PDE). It describes continuous in phase space, x, and time, t, evolution/flow
of the probability density distribution.
Derivation was for a particle moving in 1d, R, but the same ideology and logic extends
to higher dimensions, Rd , d = 1, 2, · · · . There are also extension of this consideration to
compact continuous spaces. Thus one can analyze dynamics on a circle, sphere or torus.
Analogs of the Fokker-Planck can be derived and analyzed for more complicated proba-
bilities than just the marginal probability of the state (path integral marginalized to given
time). An example here is of the so-called first-passage, or “first-hitting” problem.
The temporal evolution is driven by two terms, often called “diffusion” and “advecton”.
The terminology originates from fluid mechanics and describes how probabilities can “flow”
in the phase space. The diffusion originates from the stochastic source, while the advection
is associated with a deterministic (possibly nonlinear) force.
Linearity of the Fokker-Planck does not imply that it is simpler than the original nonlin-
ear problem. Deriving the Fokker-Planck we made a transition from nonlinear, stochastic
but ODE to linear PDE. This type of transition from nonlinear representation of many
trajectories to linear probabilistic representation is typical in math/statistics/physics. The
linear Fokker-Planck equation can be viewed as the continuous-time, continuous-space ver-
sion of the discrete-time/discrete space Master equation describing evolution of a (finite
dimensional) probability vector in the case of a Markov Chain.
The Fokker-Planck Eq. (9.32) can be represented in the ‘flux’ form:
∂t p(x|t) + ∂x J(t; x) = 0 (9.33)
where J(t; x) is the flux of probability through the space-state point x at the moment of
time t. The fact that the second (flux) term in Eq. (9.33) has a gradient form, corresponds
to the global conservation of probability. Indeed, integrating Eq. (9.33) over the whole
continuous domain of achievable x, and assuming that if the domain is bounded there is
no injection (or dissipation) of probability on the boundary, one finds that the integral of
the second term is zero (according to the standard Gauss theorem of calculus) and thus,
R
∂t dxp(x|t) = 0. In the steady state, when ∂t p(x|t) = 0 for all x (and not only in the
result of integration over the entire domain) the flux is constant - does not depend on x.
The case of zero-flux is the special case of the so-called ‘equilibrium’ statistical mechanics.
(See some further comments below on the latter.)
If the initial probability distribution, p(x|0) is known, (x|t) for any consecutive t is well
defined, in the sense that the Fokker-Planck is the Cauchy (initial value) problem with
unique solution.
Remarks about simulations. One can solve PDE but can also analyze stochastic ODE
approaching the problem in two complementary ways - correspondent to Eulerian and La-
grangian analysis in Fluid Mechanics describing “incompressible” flows in the probability
space.
Main and simplest (already mentioned) example of the Langevin dynamic is the Brow-
nian motion, i.e. the case of zero drift, v = 0. Another example, principal for the so-
called ’equilibrium statistical physics’, is where the velocity (of the deterministic drift) is
a spatial gradient of a potential, U (x): v(x) = −∂x U (x). Think, for example about x
representing a over-damped particle connected to the origin by a spring. U (x) is the poten-
tial/energy stored within the spring. In this case of the gradient drift the stationary (i.e.
time-independent) solution of the Fokker-Planck Eq. (9.32) can be found explicitly,

−1 U (x)
p(x|t) → pst (x) = Z exp − . (9.34)
t∞ D
This solution is called Gibbs distribution, or equilibrium distribution.
9.3.8 Examples and Exercises
Example 9.3.4. Consider the motion of a Brownian particle in the parabolic potential,
U (x) = γx2 /2. (The situation is typical for the particle, which is located near minimum or
maximum of a potential.) The Langevin equation (9.16) in this case becomes
dx √
+ γx = 2Dξ(t), hξ(t)i = 0, hξ(t1 )ξ(t2 )i = δ(t1 − t2 ) (9.35)
dt
Write a formal solution of Eq. (9.35) for x(t) as a functional of ξ(t). Compute hx2 (t)i
as a function of t and interpret the results. Write the Kolmogorov-Fokker-Planck (KFP)
equation for p(x|t), and solve it for the initial condition, p(x|0) = δ(x).
Solution. Multiply Eq. (9.35) by the integrating factor eγt to get
d √
x(t)eγt = 2Dξ(t)eγt ,
dt
which has the formal solution Z t
√ 0
x(t)eγt = x(0) + 2D ξ(t0 ) eγt dt0 .
0
The formal solution simplifies to
√ Z t
0
x(t) = x(0)e−γt + 2D ξ(t0 )e−γ(t−t ) dt0
0
We wish to find the mean and the variance of x(t). The first two moments, hx(t)i and
hx2 (t)i, become
Z t
−γt
√ 0 −γ(t−t0 ) 0
hx(t)i = x(0)e + 2D ξ(t ) e dt
0
√ Z t
−γt 0
= hx(0)ie + 2D hξ(t0 )i e−γ(t−t ) dt0
0
−γt
= x(0)e
where we have used that h(ξ(t)i = 0,

Z t Z t
2 −γt
√ 0 −γ(t−t0 ) 0 −γt
√ 00 −γ(t−t00 ) 00
hx (t)i = x(0)e + 2D ξ(t ) e dt x(0)e + 2D ξ(t ) e dt
0 0
Z tZ t
0 00
= x(0)2 e−2γt + 2D ξ(t0 )ξ(t00 ) e−γ(t−t ) e−γ(t−t ) dt0 dt00
0 0
Z tZ t
0 00
= x(0)2 e−2γt + 2De−2γt δ(t0 − t00 )eγ(t +t ) dt0 dt00
0 0
D
= x(0)2 e−2γt + 1 − e−2γt .
γ
The interpretation of the solution is as follows: (i) The contribution to h(x2 (t)i from the
initial condition decays at a rate of 2γt. (ii) At the smallest times, t 1/γ, we expand the
term (1 − e−2γt ) as a first order Taylor polynomial to find the usual diffusion hx2 (t)i ' 2Dt,
since the particle does not feel the potential. (iii) At larger time scale t 1/γ the dispersion
saturates, hx2 (t)i ' D/γ).
The Kolmogorov-Fokker-Planck equation, ∂t p(x|t) = (γ∂x x + D∂x2 )p(x|t), should be
supplemented by the initial condition p(x|t) = δ(x). Then, the solution (the Green function)
is
1 x2
p(x|t) = p exp − . (9.36)
2πhx2 (t)i 2hx2 (t)i
The meaning of the expression is clear: the probability function p(x|t) is Normal/Gaussian
with the variance which is time-dependent.
Example 9.3.5. Prove that the moments hx2k (t)i for the Brownian motion in R obey the
following recurrent equation
∂t hx2k i = 2k(2k − 1)Dhx2(k−1) i. (9.37)
Solve this equation for a particle starting from x = 0 at t = 0.

Solution. Recall the definition of the k th moment or a random variable and recall Eq.(9.31).
Z +∞ Z +∞ Z +∞
2k 2k 2k
∂t hx i = ∂t x p(x|t) dt = x ∂t p(x|t) dt = x2k D∂xx p(x|t) dt
−∞ −∞ −∞
Z +∞ Z +∞
=D ∂xx x2k p(x|t) dt = D 2k(2k − 1)x2k−2 p(x|t) dt = 2k(2k − 1)Dhx2(k−1) i.
−∞ −∞
Example 9.3.6 (Brownian motion in parabolic potential). The conditional probability

distribution for a Brownian particle in a parabolic potential, U (x) = αx2 / is described by
the advection-diffusion equation
D∂x2 p + α∂x (xp) = ∂t p. (9.38)

Write down stochastic ODE for the underlying stochastic process, x(t), and, given the initial
condition for, p(x|t) = δ(x), compute respective statistical moments hxk (t)i.
Solution. We propose that the corresponding Langevin equation is
√
ẋ = −αx + 2Dξ(t),
and verify our proposal by going through derivations similar to these shown in Example
9.3.4.
R
Let µk (t) be the k th moment of the random process, so µk (t) := h(x(t)i = xk n(x, t) dx.
A differential equation for µk (t) can be derived from the KFP equation:
Z ∞

∂t µk (t) = ∂t xk p(x|t) dx
Z ∞−∞

= xk α ∂x xp(x|t) + xk ∂xx p(x|t) dx
−∞
Z ∞ Z ∞
= −k αxk p(x|t) dx + k(k − 1) xk−2 p(x|t) dx
−∞ −∞
= −αkµk (t) + k(k − 1)µk−2 (t). (9.39)
where the boundary terms from integration by parts vanish at ±∞ because n and its
derivatives are smaller than any polynomial as x → ±∞.
In the case where p(x|0) = δ(x), then µ0 (t) = 1 and µ1 (t) = 0. Applying the recursive
relationship, we get µ2k+1 (t) = 0 for all odd-order moments. For even order moments, the
solutions are subjected to solving Eq. (9.39). For example, the second moment can be
found by solving the differential equation
∂t µ2 (t) = −2αµ2 (t) + 2,
hence µ2 (t) = α1 (1 − e−2αt ).
Example 9.3.7. Consider the following expectation over the Langevin term, ξ(t),
*  t +
Z
Ψ(t; x) = exp  dτ Q(x(τ )) (9.40)
0 x(t)=x, x(0)=0
Z
← ΨN (x) = dx0 · · · dxN −1 p (x, xN −1 , · · · , x1 |x0 ) δ (x0 ) e∆(Q(xN −1 )+···+Q(x0 ))
N →∞
(9.41)
Z !
dx1 dxN −1 (x − xN −1 )2 + · · · + x21
= √ ··· √ exp − e∆(Q(xN −1 )+···+Q(0)) ,
2πD∆ 4πD∆ 4D∆
where x(t) is the Brownian motion, thus satisfying, ẋ(t) = ξ(t), with x(t) set to zero initially,
x(0) = 0; Q(x(t)) is a given function of x, which is finite everywhere in R; and Eq. (9.40)
and Eq. (9.41) show, respectively, continuous time and discrete time versions of the same
expectation of interest.
• (a) Derive partial differential equation governing ”evolution” of Ψ(t; x) in t and x.

Rt
• (b) Suggest a scheme allowing to compute the first and second moments of dτ Q(x(τ )),
0
*Zt + *Zt 2 +
φ(1) (t) := dτ Q(x(τ )) , φ(2) (t) :=  dτ Q(x(τ )) , (9.42)

0 x(0)=0 0 x(0)=0
explicitly, i.e. bypassing solving the PDE (derived in (a)), which does not allow
explicit solutions for a general Q(x(t)).
Solution. (a) First of all notice that Ψ(0; x) = δ(x). Further, observe that when Q(x) = 0,
the resulting PDE becomes the diffusion Eq. (9.32) with p(t; x) substituted by Ψ(t; x),
respectively. We can rewrite Eq. (9.41) as a recurrence
Z !
dxk−1 (x − xk−1 )2
∀k = 1, · · · , N : Ψk (x) = √ exp − + ∆Q(xk−1 ) Ψk−1 (xk−1 ),
4πD∆ 4D∆
(9.43)
where Ψ0 (x) = δ(x). Changing the integration variable, xk−1 → = xk−1 − x, keeping the
Gaussian expression in the integrand intact and expanding all other terms in the Taylor
series in , then evaluating the resulting Gaussian integrals, we arrive at the following
differential version of the recurrence:

∀k = 1, · · · , N : Ψk (x) = 1 + ∆Q(x) + ∆D∂x2 + O(∆2 ) Ψk−1 (x).
The continuous time version of the Eq. (9.43), and therefore the desirable Cauchy (initial
value) problem stated as a PDE supplemented with the initial condition, becomes:
∂t Ψ(t; x) = Q(x)Ψ(t; x) + D∂x2 Ψ(t; x), Ψ(0; x) = δ(x). (9.44)
(b) Let us substitute Q(x) in the expressions above, e.g. in the PDE (9.44), by δ ∗ Q(x),
expand the PDE (9.44) in the Taylor series in δ and write down relations for the zero, first
and second terms in δ. We derive the following set of diffusion equations (first homogeneous
and two other inhomogeneous)
∂t ψ (0) (t; x) = D∂x2 ψ (0) (t; x), ψ (0) (0; x) = δ(x), (9.45)
∂t ψ (1) (t; x) = Q(x)ψ (0) (t; x) + D∂x2 ψ (1) (t; x), ψ (1) (0; x) = 0, (9.46)
∂t ψ (2) (t; x) = Q(x)ψ (1) (t; x) + D∂x2 ψ (2) (t; x), ψ (2) (0; x) = 0, (9.47)
where
*Zt n +
n = 0, 1, · · · : ψ (n) (t; x) :=  dτ Q(x(τ ) .

0 x(t)=x, x(0)=0
Eqs. (9.45,9.46,9.47) can be resolved explicitly

(0) 1 x2
ψ (t; x) = √ exp − ,
4πDt 4Dt
Z tZ ∞
(1) 1 (x − ξ)2
ψ (t; x) = dξ dτ p exp − ψ (0) (τ ; ξ)
0 −∞ 4πD(t − τ ) 4D(t − τ )
Z tZ ∞
(2) 1 (x − ξ)2
ψ (t; x) = dξ dτ p exp − ψ (1) (τ ; ξ)
0 −∞ 4πD(t − τ ) 4D(t − τ )
Finally, observe that

Z
(k)
k = 0, · · · : φ (t) = dxψ (k) (t; x).
Exercise 9.3 (Self-propelled particle). The term “self-propelled particle” refers to an object
capable to move actively by gaining energy from the environment. Examples of such objects
range from the Brownian motors and motile cells to macroscopic animals and mobile robots.
In the simplest two-dimensional model, the self-propelled particle moves in the xy-plane
with fixed speed v0 . The Cartesian components of the particle velocity, ẋ(t) and ẏ(t), in
the polar coordinates are
ẋ = v0 cos ϕ, ẏ = v0 sin ϕ,
where the polar angle ϕ defines the direction of motion. Assume that ϕ evolves according
to the stochastic equation
dϕ √
= 2Dξ, (9.48)
dt
where ξ(t) is the Gaussian white noise with zero mean and the following pair correlation
function, hξ(t1 )ξ(t2 )i = δ(t1 − t2 ). The initial condition are chosen to be ϕ(0) = 0, x(0) = 0
and y(0) = 0.
(a) Calculate hx(t)i, hy(t)i.
(b) Calculate hr2 (t)i = hx2 (t)i + hy 2 (t)i.
Hint: Derive equation for probability density of observing ϕ at the moment of time t, solve
the equation and use the result. You may also consider using first and second derivatives
of the object of interest over t, as well as evaluations fro the Example 9.3.7.
9.4 Markov Process [discrete space, discrete time]

It may be tempting when studying stochastic processes to assume that the random events
are independent and identically distributed (i.i.d). However, complex systems in the real
world often “jump” from one random state to another in such a way that previous states
influence future states. In general, the memory may last for more than one jump, but there
is also a large family of interesting random processes which do not have long memory and
the jumps are only directly influenced by the current state, and not by any previous states.
More precisely, the Markovian simplification is P (Xt+1 |Xt , Xt−1 , Xt−2 , . . . ) = P (Xt+1 |Xt ).
These random processes are called Markov Processes (MPs) or, equivalently, Markov Chains
(MCs).
9.4.1 Transition Probabilities
The Markovian simplification allows us to think of a Markov chain as a random walk on a

directed graph. The vertices of the graph correspond to the various states and the edges
correspond to transitions between the states and are associated with the probability of
transitioning between the corresponding statesa .
The graphical representation of a MC is as follows: introduce a directed graph, G =
(V, E), where the set of vertices, V = {i}, is associated with the set of states, and the set of
directed edges, E = {j ← i}, which correspond to possible transitions between the states.
Note that we may also have “self-loops”, {i ← i}, included in the set of edges. To complete
the description we need to associate to each vertex the probability P (Xt = j|Xt−1 = i) of
transitioning from the state i to the state j, which we write as pj←i or pji . Since pji is a
probability, it must statisfy
∀(j ← i) ∈ E : pji ≥ 0 (9.49)
and
X
∀i : pji = 1. (9.50)
j:(j←i)∈E
The combination of G and p := (pji |(j ← i) ∈ E) defines a MC. Mathematically we also say
that the tuple (finite ordered set of elements), (V, E, p), defines the Markov chain.
We will mainly consider stationary Markov chains, where pji does not change in time.
However, for many of the following statements/considerations generalization to the time-
dependent processes is straightforward.
a
A useful interactive playground can be found here http://setosa.io/ev/markov-chains/
0.3
0.7 A B 0.5
0.5
Figure 9.3: An example of a two-state Markov chain.
9.4.2 Sample Trajectories and Analysis by Simulation
One way to analyze a Markov chain is by generating sample trajectories. How does one
relate the weighted, directed graph to samples? The relation, actually, has two sides. The
direct side is about generating samples, which is done by first initializing the trajectory at
a particular state, and then advancing the trajectory from the current state to a randomly
selected adjacent state according to the transition probability pij . The inverse side is about
verifying whether given samples where indeed generated according to the rather restrictive
MC rules, and even reconstructing the characteristics of the underlying Markov chain.
Example 9.4.1. Describe how a sample trajectory may be generated for the Markov chain
illustrated in Fig. 9.3. Assume the system begins in state A at time 0.
Solution. Since the system is in state A at time 0 set X0 = A. At time 1, the system may
either remain in state A with probability 0.7 or transition to state B with probability 0.3.
Formally we write, P (X1 = A|X0 = A) = 0.7 and P (X1 = B|X0 = B) = 0.3. One can
generate a sample trajectory by first drawing a random number on the interval [0, 1) and
then setting X1 = A if the random number lies on [0, 0.7) and setting X1 = B if it lies on
[0.7, 1). By the Markov property, the state of the system at time 2 depends only on the
state at time 1, so one can then generate a second random number on [0, 1) and define X2
accordingly. Samples of the trajectory (xt )∞
t=0 can be generated efficiently in this manner
and may look something like AABABBAA....
The Markov chain defines a random (stochastic) dynamic process. Although time may
flow continuously, Markov chains consider time to be discrete (which is sometimes a matter
of convenient abstraction and sometimes, actually quite often, events do happen discretely).
One uses t = 0, 1, 2, · · · for the times when jumps occur. Then a particular random trajec-
tory/path/sample of the system will look like
i1 (0), i2 (1), · · · , ik (tk ), where i1 , · · · , ik ∈ V

We can also generate many samples (many trajectories)
(n) (n) (n)

n = 1, · · · , N : i1 (0), i2 (1), · · · , ik (tk ), where i1 , · · · , ik ∈ V
where N is the number of trajectories.

It can be interesting to ask about various statistics of a trajectory, for example, (i)
the proportion of time spent in a particular state, or (ii) the probability that the system
takes longer than k steps to return to a particular state once leaving it. Although it may
be tempting to assume that the statistics measured from a particular trajectory will be
representative of those that would be measured from any other trajectory, this is not true
for all Markov chains. In the following section, we will demonstrate the necessary property,
which will be called ergodicity, that guarantees whether individual trajectories are actually
representative of the Markov chain.
Basic Properties of Markov Chains
Definition 9.4.2 (Irreducible). A Markov chain is said to be irreducible if one can access
any state from any state, formally
∀i, j ∈ V : ∃n > 1, s.t. P (Xn = j|X0 = i) > 0. (9.51)
The Markov chain in Fig. (9.3) is obviously irreducible. However, if we replace 0.3 → 0
and 0.7 → 1 it becomes reducible because state 1 would no longer be accessible from 2.
Definition 9.4.3 (Aperiodicity). We say that state i has period k if every return to the
state must occur at times that are multiples of k. Formally the period of state is k where
k = greatest common divisor {n > 0 : P (Xn = i|X0 = i) > 0} ,
provided that the set is not empty (otherwise the period is not defined). If n = 1 than the
state is said to be aperiodic. If all the states of a Markov Chain are aperiodic, then we say
that the Markov chain is aperiodic.
An irreducible MC only needs one aperiodic state to imply all states are aperiodic. Any
MC with at least one self-loop is aperiodic. The Markov chain in Fig. (9.3) is obviously
aperiodic. However, it becomes periodic with period two if the two self-loops are removed.
Example 9.4.4. Consider the Markov chain shown in Fig. 9.4. Is this Markov chain
reducible or irreducble? Periodic or aperiodic?
1.0
A B
1.0 1.0
Figure 9.4: An example of a three-state periodic Markov chain.
1.0
B
0.1
A
0.9
1.0 C
Figure 9.5: An example of a three-state reducible Markov chain.
Solution. The Markov chain shown in Fig. (9.4) is irreducible and periodic. The Markov
chain is irreducible because each state is accessible from each other state. The Markov chain
is periodic because if we start in state C, we can return to it only after 3, 6, 9, . . . steps (the
system will never forget its initial state). We say that state “C” has period 3. We say that
a Markov chain is aperiodic if and only if each state has period 1. One can make the this
Markov chain aperiodic by adding a self-loop to any of the three state.
Example 9.4.5. Consider the Markov chain shown in Fig. 9.5. Is this Markov chain
reducible or irreducble? Periodic or aperiodic?
Solution. The Markov chain shown in Fig. (9.5) is reducible and periodic. The Markov
chain is reducible because states “A” and “B” cannot be accessed from state “C”. Notice
that once the system enters state “C”, it will remain there forever. It cannot escape from
state “C”. This Markov chain is periodic because state “A” has period 2.
It is often of interest whether the system is guaranteed to return to a given state upon
leaving it, and if so, how many steps we should expect this to take. Define the time of the
1/2 1/2 1/2 1/2
1/2 A1 A2 A3 Ai
1/2 1/2
1/2
Figure 9.6: An example of a Markov chain with countably many (thus infinite number of)
states. In this example, P (An+1 ← An ) = 1/2 and P (A1 ← An ) = 1/2. This Markov chain
is positive recurrent. See Example 9.4.8.
1 1/2 2/3 1 − 1/i
0 A1 A2 A3 Ai
1/2 1/i
1/3
Figure 9.7: An example of a Markov chain with countably many states. In this example,
P (An+1 ← An ) = 1 − 1/n and P (A1 ← An ) = 1/n. This Markov chain is recurrent, but
not positive recurrent. See Example 9.4.9.
first return to a state by

τ1 = inf {n : Xn = X0 }. (9.52)
n≥1
The expected time of return to a particular state is

∞
X
E[τ1 |X0 ] = nP (τ1 = n). (9.53)
n=1
Example 9.4.6. Consider the Markov chain illustrated in Fig. 9.3. Compute the probabil-
ity that the first return to state A is in exactly n steps. (b) Compute the expected return
time to state A.
Solution.
(a) Observe that if the first return to state A is in exactly 1 step, then the state must have
transitioned from A back to A (a self-loop). If the first return to state A is in exactly n
steps (where n ≥ 2), then the state must have transitioned from A to B on step 1, and then
remained at B n − 2 times, and then returned from B to A on step n.
n−2
P (τ1 = n|X0 = A) = P (X1 = B|X0 = A) · P (Xk = B|Xk−1 = B)
· P (Xn = A|Xn−1 = B)

0.7, if n = 1,
=
0.3 · (0.5)n−2 · 0.5, otherwise.
(b) The expected return time to state A is
E[τ1 |X0 = A] = 1 · P (τ1 = 1) + 2 · P (τ1 = 2) + 3 · P (τ1 = 3) + · · ·

= 1 (0.7) + 2 (0.3)(0.5) + 3 (0.3)(0.5)(0.5) + · · · = 1.6
ity that the first return to the state A occurs in exactly n steps. (b) Compute the expected
return time to the state A.
Solution.
(a) Observe that when the state transitions out of A it is guaranteed to transition to B,
and from there to C, and from there back to A.

1, if n = 3,
P (τ1 = n|X0 = A) =
0, otherwise.
E[τ1 |X0 = A] = 1 · P (τ1 = 1) + 2 · P (τ1 = 2) + 3 · P (τ1 = 3) + · · ·

= 1 (0) + 2 (0) + 3 (1) + 4 (0) · · · = 3.
ity that the first return to state A1 is in exactly n steps. (b) Compute the expected return
time to state A1 .
Solution.
(a) Notice that if the first return time is n, then the state must have transitioned from A1
to A2 to A3 and so on up to An−1 , and then returned to A1 in the nth step.
1 1 1 1
P (τ1 = n|X0 = A1 ), = · · ··· = n.
2 2 2 2
E[τ1 |X0 = A1 ] = 1 · P (τ1 = 1) + 2 · P (τ1 = 2) + 3 · P (τ1 = 3) + · · ·

= 1 (1/2) + 2 (1/4) + 3 (1/8) + 4 (1/16) · · · = 2.
ity that the first return to state A1 is in exactly n steps. (b) Compute the expected return
time to the state A1 .
Solution.
(a) Notice that if the first return time is n, then the state must have transitioned from A1
to A2 to A3 and so on up to An−1 , and then returned to A1 in the nth step.
1 1 2 3 n−2 1 1 1
P (τ1 = n|X0 = A1 ) = · · · · ··· · · = .
1 2 3 4 n−1 n n−1n
E[τ1 |X0 = A1 ] = 1 · P (τ1 = 1) + 2 · P (τ1 = 2) + 3 · P (τ1 = 3) + · · ·

= 1 (0) + 2 (1)(1/2) + 3 (1/2)(1/3) + 4 (1/3)(1/4) · · · → ∞.
Definition 9.4.10 (Transient, Recurrent, Positive Recurrent). A state i is said to be tran-

sient if, given that we start in state i, there is a non-zero probability that we never return
to i. A state i is said to be recurrent if the probability that we never return to i is zero
(even if the expected return time is infinite). A state i is said to be positive-recurrent if the
expected return time is finite.
The distinction between recurrent and positive recurrent becomes important when an-
alyzing Markov chains on countable state spaces.
Exercise 9.4. Give an example of a Markov chain with an infinite number of states,
which is irreducible and aperiodic (prove it), but which does not converge to an equilibrium
probability distribution.
110 111
010 011
100 101
000 001
Figure 9.8: Sampling on a Hypercube
Definition 9.4.11 (Ergodic). A state is said to be ergodic if the state is aperiodic and
positive-recurrent. A Markov chain is said to be ergodic if it is irreducible and if every state
is ergodic.
Ergodicity can be restated as follows: A MC is ergodic if it is aperiodic and if there is

a finite number k∗ such that any state can be reached from any other state in exactly k∗
steps. (For the example of Eq. (9.58) k∗ = 2.)
There are other (alternative) descriptions of ergodicity. A particularly intuitive one is:
the MC is ergodic if it is aperiodic and irreducible. (Notice that ergodicity still holds
if we replace positive-recurrence by irreducibility, but the combination of irreducibility and
positive-recurrence without aperiodicity does not guarantee ergodicity.) In this course we
will not go into related mathematical formalities and details, largely considering generic,
i.e. ergodic, MC.
Ergodicity and Sampling
Markov chains are often used to generate samples of a desired distribution. One can imagine
a particle that travels over a graph according to the weights of the edges. If the Markov
chain is ergodic, then after some time the probability distribution of the particle becomes
stationary (one say that the chain is mixed) and then the trajectory of the particle will
represent the sample of a distribution. Important information about the distribution, such
as its moments or the expectation values of functions, is provided by analyzing the trajectory
of a particle.
Imagine that you need to generate a random string of n bits. There are 2n possible
configurations. You can organize these configurations on a hypercube graph with 2n vertices
where each vertex has n neighbors, corresponding to the strings that differ from it by a
single bit as in Fig. 9.8. Our Markov chain will walk along these edges and flip one bit at a
time. The trajectory after a long time will correspond to the series of random strings. The
important question is how long should we wait before our Markov chain becomes mixed
(loses a memory about initial condition)? To answer this question we should look at the
MC from a more mathematical point of view.
9.4.3 Evolution of the Probability State Vector
We began the section by defining a Markov chain in terms of the weighted, directed graph
(V, E, p) and attempting to analyze by generating sample trajectories. A rigorous analysis
involves the probability state vector, or simply the state vector, which is the vector where
the ith component represents the probability that the system is in the state i at the moment
of time t:
π(t) := (πi (t))i∈V where πi (t) := P (Xt = i). (9.54)
P
Thus, πi ≥ 0 and i∈V πi = 1.
The probability state vector evolves according to
X
∀i ∈ V, ∀t = 0, · · · : πi (t + 1) = pij πj (t). (9.55)
j:(i←j)∈E
We can also rewrite Eq. (9.55) in the vector/matrix form
π(t + 1) = pπ(t), (9.56)
where p := {pji }, called the transition probability matrix, is the matrix whose (i, j) compo-
nent is the probability of transitioning from state j to state i and is therefore matrix. It is
a stochastic matrix (defined below).
Definition 9.4.12. A matrix is called stochastic if all of its components are nonnegative
and each column sums to 1.
To analyze the Markov chain acting after k sequential step, we consider repeated appli-
cation of Eq. (9.56) which results in
π(t + k) = pk π(t), (9.57)
We therefore are interested in analyzing the properties of the matrix pk ,.
Example 9.4.13. Find the stochastic matrix associated with Fig. 9.3? Is the corresponding
Markov chain reducible?
Solution. For Fig. (9.3), the stochastic matrix is:

!
0.7 0.5
p=
0.3 0.5
To determine whether the Markov chain is irreducible, we estimate pk for large k:

! !
2 0.64 0.6 10 100 0.625 0.625
p = , p ≈p ≈ . (9.58)
0.36 0.4 0.375 0.375
A Markov chain is irredcuible if each state is accessible from every other state. If π(k) is
the vector of probabilities for each state at time k, given initial probabilities π(0), then we
observe that for the initial conditions corresponding to each of the two states, π(0) = ( 10 ),
and π(0) = ( 01 ), we find that every entry of π(k) = pk π(0) is non-zero. Therefore every state
has non-zero probability at time k and we conclude that the Markov chain is irreducible.
Example 9.4.14. Find the stochastic matrix associated with Fig. 9.5? Is the corresponding
Markov chain reducible?
Solution. For Fig. (9.5), the stochastic matrix is:
 
0.8 0.9 0.0
 
p= 
0.2 0.0 0.0 .
0.0 0.1 1.0
Subsequent powers of p are:

   
0.82 0.72 0.00 0.71 0.65 0.00
   
p2 =  
0.16 0.18 0.00 , and p10 ≈  
0.14 0.133 0.00 .
0.02 0.10 1.00 0.14 0.21 1.00
Repeated matrix multiplication shows that the the first two entries of the third column
are zero for all k. This means that states “A” and “B” are inaccessible from state “C”.
Therefore the Markov chain is reducible.
Steady State Analysis
Definition 9.4.15 (Stationary Distribution). The probability state vector (if it exists) that
satisfies
π ∗ = pπ ∗ (9.59)
is called the stationary distribution or invariant measure. (Recall, to be a state vector, each
component must be positive, and the components must sum to unity).
Theorem 9.4.16 (Existence of a Stationary Distribution). A Markov chain has a stationary

distribution iff it is ergodic. (Equivalently, a MC has a stationary distribution iff it is
aperiodic and all of its states are positive recurrent.
Solving Eq. (9.59) for the example of Eq. (9.58) one finds
!
∗ 0.625
π = , (9.60)
0.375
which is naturally consistent with Eq. (9.58).

In general, the stochastic matrix for an ergodic Markov chain has one eigenvalue sat-
isfying λ∗ = 1. The stationary distribution π ∗ is the `1 normalized eigenvector associated
with the unit eigenvalue
e
π∗ = P , (9.61)
i ei
And how about other eigenvalues of the transition matrix?

An important practical consequence of the ergodicity is that the steady state is unique
and it is universal. Universality means that the steady state does not depend on the initial
condition. It may now be timely to ask: why do we care about uniqueness of the steady state,
invariance with respect to the initial condition and ergodicity? The most straightforward
answer is because it allows us to design powerful techniques to explore complicated phase
space. Markov Chain Monte Carlo (MCMC), which will be discussed in chapter 10, is
one such technique. At the moment, it is important to appreciate that understanding
different properties of MC (and latter MCMC algorithms) allows us to use Markov chains
to solve complicated inference and (machine) learning problems in Data Science and related
disiplines efficiently. When we turn from analysis of a particular MC to the design of
a desirable MC we will be stating the “desires” in terms of uniqueness, invariance and
ergodicity. Ergodicity will ensure convergence to a unique desirable probability distribution.
Moreover, MCMC allows us to generate samples which are drawn independently from ANY
probability distribution. Generating independent samples is generally difficult, but Markov
chains helps us to solve the problem bypassing independence and creating much easier to
generate dependent samples. We will eventually make the samples independent if we repeat
MC many times and show that the sample/state we have started with is forgotten after
sufficiently many steps.
Spectrum of the Transition Matrix & Speed of Convergence to the Stationary

Distribution
Assume that p is diagonalizable (has n = |p| linearly independent eigenvectors) then we can
decompose p according to the following eigen-decomposition
p = U ΣU −1 , (9.62)
where Σ = diag(λ1 , · · · , λn ), 1 = λ1 ≥ |λ2 | ≥ | · · · |λn | and U is the matrix of eigenvectors

(each normalized to having an l2 norm equal to 1) where each row is a right eigenvector of
p. Then, the evolution of an initial stochastic vector, π(0), in the discrete time t = 1, · · · ,
is given by
π(t) = pt π(0) = (U ΣU −1 )t π(0) = U Σt U −1 π(0). (9.63)
Let us represent the initial π(0) as an expansion over the normalized eigenvectors, ui , · · · i =
1, · · · , n, of p:
n
X
π(0) = ai ui . (9.64)
i=1
Taking into account orthonormality of the eigenvectors one derives

t t !
λ2 λn
π(t) = λ1 a1 u1 + a2 u2 + · · · + an un . (9.65)
λ1 λ1
Since limt→∞ π(t) = π ∗ = u1 , we get that a1 = 1 and the second term on the rhs of
Eq. (9.65) describes the rate of convergence of π(t) to the steady state at t → ∞. The
convergence is exponential in t with the rate, log(λ1 /λ2 ).
Example 9.4.17. Find eigenvalues for the MC shown in Fig. (9.9) with the transition
matrix  
0 5/6 1/3
 
p=
5/6 0 1/3
. (9.66)
1/6 1/6 1/3
What does define the speed of the MC convergence to a steady state?
Solution. Let us start by noticing that p is stochastic. If the initial probability distribution
is π(0), then the distribution after t steps is
π(t) = pt π(0). (9.67)

5/6
A 5/6
B
1/6 1/3
1/3 1/6
1/3
Figure 9.9: Illustration of the Detailed Balance (DB)
As t increases, π(t) approaches a stationary distribution π ∗ (since the Markov chain is

ergodic - this property is easy to check for the MC), such that
pπ ∗ = π ∗ . (9.68)
Thus, π ∗ is an eigenvector of p with eigenvalue 1 with all components positive and normal-
ized. The matrix (9.66) has three eigenvalues λ1 = 1, λ2 = 1/6, λ3 = −5/6 and correspond-
ing eigenvectors are
T T
2 2 1 1 1
π∗ = , , , u2 = − ,− ,1 , u3 = (−1, 1, 0)T . (9.69)
5 5 5 2 2
Suppose that we start in the state ”A”, i.e. π(0) = (1, 0, 0)T . We can write the initial state
as a linear combination of the eigenvectors
u2 u3
π(0) = π ∗ − − , (9.70)
5 2
and then
λt2 λt
π(t) = pt π(0) = π ∗ −
u2 − 3 u3 . (9.71)
5 2
Since |λ2 | < 1 and |λ3 | < 1, then in the limit t → ∞ we obtain π(t) = π ∗ . The speed
of convergence is defined by the eigenvalue (λ2 or λ3 ), which has the greatest absolute
value.
Note, that the considered situation generalizes to the following powerful statement (see
[16] for details):
Theorem 9.4.18 (Perron-Frobenius Theorem). Ergodic Markov chain with transition ma-
trix p has a unique eigenvector π ∗ with eigenvalue 1, and all its other eigenvectors have
eigenvalues with absolute value less than 1.
Some Additional Properties of Markov Chains
Definition 9.4.19 (Reversible). MC is called reversible if there exists π ∗ s.t.
∀{i, j} ∈ E : pji πi∗ = pij πj∗ , (9.72)
where {i, j} is our notation for the undirected edge, assuming that both directed edges
(i ← j) and (j ← i) are elements of the set E.
In physics this property is also called Detailed Balance (DB). If one introduces the
so-called ergodicity matrix
Q := (Qji = pji πi∗ |(j ← i) ∈ E), (9.73)
then DB translates into the statement that Q is symmetric, Q = QT . The MC for which
the property does not hold is called irreversible. Q − QT is nonzero, i.e. Q is asymmetric
for reversible MC. An asymmetric component of Q is the matrix built from currents/flows
(of probability). Thus for the case shown in Fig. (9.3)
! !
0.7 ∗ 0.625 0.5 ∗ 0.375 0.4375 0.1875
Q= = (9.74)
0.3 ∗ 0.625 0.5 ∗ 0.375 0.1875 0.1875
Q is symmetric, i.e. even though p12 6= p21 , there is still no flow of probability from 1 to 2
as the “population” of the two states, π1∗ and π2∗ respectively are different, Q12 − Q21 = 0.
In fact, one observes that in the two node situation the steady state of the MC is always in
DB.
Note that if a steady distribution, π ∗ , satisfies the DB condition (9.72) for a MC,
(V, E, p), it will also be a steady state of another MC, (V, E, p̃), satisfying the more general
Balance (or global balance) B-condition
X X
p̃ji πi∗ = p̃ij πj∗ . (9.75)
j:(j←i)∈E j:(i←j)∈E
This suggests that many different MC (many different dynamics) may result in the same
steady state. Obviously DB is a particular case of the B-condition (9.75).
The difference between DB- and B- can be nicely interpreted in terms of flows (think wa-
ter) in the state space. From the hydrodynamic point of view reversible MCMC corresponds
to irrotational probability flows, while irreversibility relates to nonzero rotational part, e.g.
correspondent to vortices contained in the flow. Putting it formally, in the irreversible case
skew-symmetric part of the ergodic flow matrix, Q = (p̃ij πj∗ |(i ← j)), is nonzero and it
actually allows the following cycle decomposition,
X
α α
Qij − Qji = Jα Cij − Cji , (9.76)
α
where index α enumerates cycles on the graph of states with the adjacency matrices C α .
Then, Jα , stands for the magnitude of the probability flux flowing over the cycle α.
One can use the cycle decomposition to modify MC such that the steady distribution
stay the same (invariant). Of course, cycles should be added with care, e.g. to make sure
that all the transition probabilities in the resulting p̃, are positive (stochasticity of the
matrix will be guaranteed by construction). The procedure of “adding cycles” along with
some additional tricks (e.g. the so-called lifting/replication) may help to improve mixing,
i.e. speed up convergence to the steady state — which is a very desirable property for
sampling π ∗ efficiently.
Example 9.4.20. Given a stationary solution, π ∗ = (π1∗ , π2∗ , π3∗ ), construct a three-state
Markov chain, i.e. present a (3 × 3) transition matrix, p, (a) of a general position (satisfies
global balance), (b) satisfies detailed balance. Are the constructions unique? Find the
spectrum of the transition matrix in the case (b) and verify that the Perron-Frobenius
theorem 9.4.18 holds. In the case (b) formulate and solve example of the fastest mixing
MC. Can one generalize solution and find the fastest mixing MC of size n, given π ∗ =
(π1∗ , · · · , πn∗ )? Return back to the three state MC and impose the constraint that all the
diagonal elements of the transition probability (correspondent to self-loops on the fully
connected three state graph) are zero, p(1, 1) = p(2, 2) = p(3, 3) = 0. Is the MC unique in
this case? Is it ergodic?
Solution. See Mathematica snippet MC-3nodes.nb (or pdf-printout MC-3nodes.pdf) posted
at the D2L site of the course.
Example 9.4.21. Let Σ = {x0 , x1 , · · · , xK−1 } be K equidistant points on the circle, i.e.,
xk = e2πik/K . Let α, β ∈ (0, 1) be constants that satisfy α + β + γ = 1, and consider the
random walk (Xt ) defined by
P (Xt+1 = xk+1 |Xt = xk ) = α, (9.77a)

P (Xt+1 = xk−1 |Xt = xk ) = β, (9.77b)
P (Xt+1 = xk |Xt = xk ) = γ. (9.77c)
Let π be the unique stationary distribution.

(a) For what values of α, β, and γ (and K) is the Markov chain ergodic?
(b) What is the stationary distribution? (intuitive arguments preferred.)
(c) For what values of α and β does the Markov chain satisfy detailed balance?
(d) Let p denote the transition matrix. Find exact expressions for the eigenvalues of p.
Hint: the linear transformation represented by p is a convolution operator, i.e., there
P
is a g such that (pv)k = (g ? v)k = ` g(xk−` )v(x` ) for all k-vectors v (identifying
k vectors with functions on Σ), and thus can be diagonalized by the discrete Fourier
transform.
(e) The spectral gap of p is 1 − |λ0 |, where λ0 is the second largest (in absolute value)
eigenvalue of p. The size of the spectral gap determines how fast an ergodic chain
converges to its stationary distribution: the larger the gap, the faster the convergence.
Suppose γ = 0.98 and α = β. Use the result of the previous part to find the spectral
gap of p to leading order in 1/K as K → ∞.
(f) Are there initial distributions that converge to the stationary distribution at a rate
faster than the second largest eigenvalue? If so, give an example. If not, explain why
not.
Solution.
(a) The Markov chain is irreducible if γ < 1 and aperiodic if γ > 0. (If K is odd, then
α, β > 0 is also sufficient for aperiodicity.
(b) First, observe that the stationary distribution satisfies
βπ(xk+1 ) + απ(xk−1 ) + γπ(xk ) = π(xk ). (9.78)
If we set π(x) = 1/K, we see that this solves the equation. Since the chain is irre-
ducible, the uniform distribution is therefore the unique stationary distribution.
(c) To satisfy detailed balance, we must have
βπ(xk+1 ) = απ(xk−1 ). (9.79)
This holds if and only if β = α. So the chain satisfies detailed balance whenever
β = α = (1 − γ)/2.
(d) The transition matrix has the form

 
γ α 0 0 ··· β
 
β γ α 0 ··· 0
 
 
0 β γ α ··· 0
p=  .. . . .. .. ..

..  , (9.80)
. . . . . .
 
0 · · · 0 β γ α
 
α ··· 0 0 β γ
i.e., p is ”circulant.” The matrix acts by convolution on K-vectors, and so can be
diagonalized by the discrete Fourier transform. That is, if we let U be the K × K
matrix with elements
2πik`
Uk` = exp , k, ` ∈ {0, · · · , K − 1}, (9.81)
K
then p = U ΛU −1 , where Λ is diagonal. We can check this directly: let u` denote the
`th column of U . Then
   
1 (γ + αe2πi/K + βe−2πi/K ) · 1
   
1  (γ + αe2πi/K + βe−2πi/K ) · e2πi/K 
   
pu0 =  .  , pu1 =  .  ,··· . (9.82)
 ..   .. 
   
1 (γ + αe2πi/K + βe−2πi/K )·e 2πi(K−1)/K
In general, the eigenvalues are
λ` = γ + αe2πi`/K + βe−2πi`/K
= γ + (α + β) cos(2π`/K) + i(α − β) sin(2π`/K).
(e) From the above, we have

2
|λ` |2 = γ + (α + β) cos(2π`/K) (9.83)
Setting α = β = (1 − γ)/2, we have

2
|λ` |2 = γ + (1 − γ) cos(2π`/K) . (9.84)
To find the second largest eigenvalue, we observe that |λ` |2 = f (2π`/K) where
2
f (θ) = γ + (1 − γ) cos(θ) . (9.85)
A simple calculation shows that the only critical points of f are θ = 0 and θ = π,
with f (0) = 1 and f (π) = 2γ − 1 ∈ (0, 1). By continuity, when K is large we expect
the second largest eigenvalue to occur at ` = ±1, i.e.,
λ±1 = γ + (1 − γ) cos(2π/K). (9.86)
To leading order in 1/K, this is 1 − 2π 2 /K 2 , and thus the gap is 2π 2 /K 2 .

(f) If we take the eigen-vector u` for ` ∈

/ {0, 1}, then |λ` | < |λ1 |. By orthogonality, the
entries of u` sum to 0, and Re(u` ) lies in the eigen-space of λ` . Therefore any initial
distribution of the form π + aRe(u` ) would converge to the stationary distribution
with a rate given by |λ` |.
Exercise 9.5 (Hardy-Weinberg Law). Consider an experiment of mating rabbits. We follow

the inheritance of a particular gene that appears in two types, G or g. A rabbit has a pair
of genes, either GG (dominant), Gg (hybrid — the order is irrelevant, so gG is the same
as Gg) or gg (recessive). In mating two rabbits, the offspring inherits a gene from each
of its parents with equal probability. Thus, if we mate a dominant (GG) with a hybrid
(Gg), the offspring is dominant with probability 1/2 or hybrid with probability 1/2. Start
with a rabbit of given character (GG, Gg, or gg) and mate it with a hybrid. The offspring
produced is again mated with a hybrid, and the process is repeated through a number of
generations, always mating with a hybrid.
Note: The first experiment of such kind was conducted in 1858 by Gregor Mendel. He
started to breed garden peas in his monastery garden and analyzed the offspring of these
matings.
(a) Write down the transition matrix P of the Markov chain thus defined. Is the Markov
chain irreducible and aperiodic?
(b) Assume that we start with a hybrid rabbit. Let π(n) = (πGG (n), πGg (n), πgg (n)) be
the probability distribution vector (state) of the character of the rabbit of the n-th
generation. In other words, πGG (n), πGg (n), πgg (n) are the probabilities that the n-th
generation rabbit is GG, Gg, or gg, respectively. Compute π(1), π(2), π(3). Is there
some kind of law?
(c) Calculate P n for general n. What can you say about π(n) for general n?
(d) Calculate the stationary distribution of the MC, π ∗ = (πGG
∗ , π ∗ , π ∗ ). Does Detailed
Gg gg
Balance hold in this case?
9.5 Stochastic Optimal Control: Markov Decision Process

We have discussed in the preceding sections of this chapter Markov Processes. We have also
discussed Optimal Control two chapters ago. Let us now merge the two topics and consider
what comes under the name of the Markov Decision Process (MDP).
Specifically, a MDP represents a stochastic, discrete space, discrete time version of the
stochastic optimal control problem, which is stated as follows [21]:
Figure 9.10: General Scheme of Markov Decision Process. Drawing is from [21].
• Given:
– Set of states, S [nodes of a graph, or squares if we play the example of the 4 × 4

grid World discussed in the following].
– Set of actions, A [associated with arrows connecting nodes/squares].
– P : S × A × S → [0, 1] - Transition probabilities between states, dependent on
actions and sliced (sequentially) in time, P(s, a, s0 ) = P (st+1 = s0 |st = s, at = a).
– R : S × A × S → R – rewards/costs, R(s, a, s0 ), for st+1 = s0 - next state, st = s
current state, at current action taken.
– We consider the problem over the infinite time horizon.
– γ t discount factor (less reward as time progresses).
• Goal:
– to maximize the expected sum of rewards over the policy, π : S → A, defined as

a function mapping st to at :
" ∞
#
X
∗ t
π = arg max E γ R(st , π(st ), st+1 ) , (9.87)
π(·)
t=0
where averaging is over the random transitions, governed by P(s, a, s0 ).
Notice that in the present formulation, R(·, ·, ·), P(·, ·, ·) do not depend explicitly on time.
(Generalizations are possible but we will not discuss these in the lectures leaving it for an
independent studies.)
9.5.1 Bellman Equation & Dynamic Programming
Expectation on the right hand side of Eq. (9.87) is called the global reward, i.e. the reward
accumulated over the entire (infinite) time horizon under the given (not necessarily optimal)
policy, π. However, it has a sense to discuss not only the global reward but also respective
expected value of the reward, evaluated for the time horizon τ , called the value function:
"τ −1 #
X
∀τ ∈ [1, · · · , ∞], ∀s0 : Vτπ (s0 ) := Es1 ,s2 ,··· γ t R(st , π(st ), st+1 ) (9.88)
t=0
−1
τY
! τ −1
X X
= P(st0 , π(st0 ), st0 +1 ) γ t R(st , π(st ), st+1 ),
s1 ,··· ,sτ t0 =0 t=0
which depends on the initial state, s0 , and the policy, π(·). Observe that the right hand
side of Eq. (9.88) can also be expressed in terms of the value function evaluated at the
preceding, τ − 1, step:
−1
τY
! τ −2
!
X X
= P(st0 , π(st0 ), st0 +1 ) R(s0 , π(s0 ), s1 ) + γ γ t R(st+1 , π(st+1 ), st+2 )
s1 ,··· ,sτ t0 =0 t=0

= Es0 R(s0 , π(s0 ), s0 ) + γVτπ−1 (s0 ) . (9.89)
Let us now introduce the current optimal (over policy) value
∀τ, ∀s : Vτ∗ (s) := max Vτπ (s).

π(·)
Then, optimizing both sides of Eqs. (9.89) over policy we derive

∀τ, ∀s : Vτ∗ (s) = max Es0 R(s, a, s0 ) + γVτ∗−1 (s0 ) . (9.90)
a
The recursion, suggested by Bellman in his seminal 1948 paper, shows that the optimal
solution of Eq. (9.87) can be found by solving Eq. (9.90). Moreover, once the optimal value
at the step τ is found we can also find the so-called optimal policy:
X
∀τ, ∀s : πτ∗ (s) = arg max P(s, a, s0 ) R(s, a, s0 ) + γVτ∗−1 (s0 ) , (9.91)
a
s0
defined as a function/map from S to A. These, Eqs. (9.90,9.91), called Bellman equations,

represent yet another example of what also comes under the (already familiar) name of the
Dynamic Programming.
Few important remarks are in order:
• Recurrent Eqs. (9.90) translate into the value-iteration Algorithm 5, which we il-
lustrate on the example of the Grid World below.
• Similarly to the state-dependent value function, one may also introduce the value of
taking action a0 in a state s0 under a policy π:
" τ −1
#
X
∀τ, s0 , a0 : Qπτ (s0 , a0 ) = Es1 ,s2 ,··· R(s0 , a0 , s1 ) + t
γ R(st , π(st ), st+1 )
t=1

= Es0 R(s0 , a0 , s0 ) + γVτπ−1 (s0 ) . (9.92)
Therefore, instead of working with Vτπ (s0 ) we can re-state the entire DP approach
and the optimization procedure in terms of Qπτ (s0 , a0 ) – the action-value function of
the state s0 and action a0 under policy π – also called the Q-function. In particular,
the action-value version of Eq. (9.89) becomes

π 0 π 0 0
∀τ, s0 , a0 : Qτ (s0 , a0 ) = Es0 R(s0 , a0 , s ) + γ max
0
Qτ −1 (s , a ) . (9.93)
a
• There exists an alternative iterative way of solving the optimization problem (9.87)
via a policy-iteration algorithm. In this case the approach is to alternate between
the following two steps till convergence: (a) policy evaluation: for a current (not
optimal) policy run the value iteration algorithm solving Eqs. (9.89) for either fixed
number of steps (in τ ) or till a pre-defined tolerance is achieved; (b) policy improve-
ment: according to
X
∀τ, ∀s0 : πτ (s) ← arg max P(s, a, s0 ) R(s, a, s0 ) + γVτπ−1 (s0 ) .
a
s0
This policy iteration approach may be algorithmically advantageous when policy

”freezes” faster than the value/cost.
9.5.2 MDP: Grid World Example
MDP may be considered as an interactive probabilistic game one plays (against computer).
The game consists in defining transition rates between the states to achieve certain objec-
tives. Once optimal (or sub-optimal) rates are fixed the implementation becomes just a
Markov Process.
Let us play this ’Grid World’ game with the rules illustrated in Fig. (9.11). An agent
lives on the grid (3 × 4). Walls block the agent’s path. The agent actions do not always
go as planned: for example 80% of time the action ’North’ take the agent ’North’ (if there
is no wall there), 10% of the time the action ’North’ actually takes the agent West; 10%
East. If there is a wall the agent would have been taken, she stays put. Big reward, +1,
or penalty, −1 comes at the end — the game ends after reaching either of the two states.
Visiting any other state does not result in a reward.
Figure 9.11: Canonical example of MDP from ’Grid World’ game.
Figure 9.12: Optimal solution set of actions (arrows) for each state, for each time.
Value Iteration in Gridworld Value Iteration in Gridworld Value Iteration in Gridworld

noise = 0.2, ° =0.9, two terminal states with R = +1 and -1 noise = 0.2, ° =0.9, two terminal states with R = +1 and -1 noise = 0.2, ° =0.9, two terminal states with R = +1 and -1
Value Iteration in Gridworld Value Iteration in Gridworld Value Iteration in Gridworld

noise = 0.2, ° =0.9, two terminal states with R = +1 and -1 noise = 0.2, ° =0.9, two terminal states with R = +1 and -1 noise = 0.2, ° =0.9, two terminal states with R = +1 and -1
Value Iteration in Gridworld

noise = 0.2, ° =0.9, two terminal states with R = +1 and -1
Figure 9.13: Value Iteration in Grid World.
Let us now translate these rules into formulas:
s ∈ {(1, 1), (1, 2), (1, 3), (1, 4), (2, 1), (2, 2), (2, 4), (3, 1), (3, 2), (3, 3), (3, 4)}
a ∈ {↑, ↓, ←, →}
P ((2, 1)|(1, 1), ↑) = 0.8, P ((1, 1)|(1, 1), ↑) = 0.1, P ((1, 2)|(1, 1), ↑) = 0.1,
P ((1, 2)|(1, 1), →) = 0.8, P ((1, 1)|(1, 1), →) = 0.1, P ((2, 1)|(1, 1), →) = 0.1, · · · .


 +1, s = (3, 4)

0
R(s, a, s ) = −1, s = (2, 4)


 0, s 6= (3, 4), (2, 4).
Vτ∗ (s) is the expected sum of rewards accumulated when starting from state s and acting
optimally for a horizon of τ steps.
To find the optimal actions (policy), one may consider the Value Iteration algorithm
(MDP – Value Iteration). One can show that the algorithms outputs solution of the value
iteration Eqs. (9.90).
The MDP value iteration algorithm is illustrated for the Grid World example in Fig. (9.13).
Algorithm 5 MDP – Value Iteration (finite horizon version)

Input: Set of states, S; set of actions, A; Transition probabilities between states, P (s0 |s, a);
rewards/costs, R(s, a, s0 ); γ discount factors
∀s : V0∗ (s) = 0
for τ = 0, · · · , H − 1 do
P
∀s : Vτ∗+1 (s) ← maxa s0 P (s0 |s, a) [R(s, a, s0 ) + γVτ∗ (s0 )] – [ Bellman update/back-
up] – the expected sum of rewards accumulated when starting from state s and acting
optimally for a horizon of τ + 1 steps
end for
Some of the numbers (for the optimal value function) at the first three iterations, are derived
as follows:
V1∗ ((3, 3)) = P ((3, 4)|(3, 3), →) ∗ V0∗ ((3, 4)) ∗ γ = 0.8 ∗ 1 ∗ 0.9 ≈ 0.72,
V2∗ ((3, 3)) = (P ((3, 4)|(3, 3), →) ∗ V1∗ ((3, 4)) + P ((3, 3)|(3, 3), →) ∗ V1∗ ((3, 3))) ∗ γ
≈ (0.8 ∗ 1 + 0.1 ∗ 0.72) ≈ 0.78,
V3∗ ((2, 3)) = (P ((3, 3)|(2, 3), ↑) ∗ V2∗ ((3, 3)) + P ((2, 4)|(2, 3), ↑) ∗ V2∗ ((2, 4))) ∗ γ
≈ (0.8 ∗ 1 ∗ 0.72 + 0.1 ∗ (−1)) ∗ 0.9 ≈ 0.43.
We also observe (empirically) that as τ → ∞ the optimal value functions show a well defined
limit, i.e. the expected sum of rewards accumulated when acting optimally freezes (becomes
stationary) in the limit.
Exercise 9.6. Consider the following modification of the Grid World example: no discount,
γ = 1; the terminal penalty at s = (2, 4) increases by absolute value (from −1) to −2, i.e.


 +1, s = (3, 4)

0
R(s, a, s ) = −2, s = (2, 4) .


 0, s 6= (3, 4), (2, 4).
(a) Compute optimal values of the expected sum of rewards for the first three iterations of
the Algorithm 5, i.e. find ∀s : V1∗ (s), V2∗ (s), V3∗ (s);
(b) Find optimal policy for the first three iterations, i.e. find ∀s : a∗1 = π1∗ (s), a∗2 =
π2∗ (s), a∗3 = π3∗ (s).
(c) Re-state the original MDP formulation (9.87) as a linear programming. [Hint: take
advantage of the linearity of the value iteration way of solving the MDP according to
Eqs. (9.90).]
(d) What we have discussed so far in this Section is the case of a deterministic policy, where
π is a map/function from S to A. It may be advantageous to consider stochastic policy
where for each state a number of actions can be taken with different probabilities, in which
case π(s, a) becomes function of two variables (state and action) describing the probability
of taking action a when the state is s. Suggest stochastic policy modification of Eq. (9.87).
We will return to the MDP (and the grid world example) in Section ??, discussing
reinforcement learning.
∗
9.6 Queuing Networks
9.6.1 Queuing: a bit of History & Applications
∗ There are number of books written on the subject. The book of Frank Kelly and Elena
Ydovina [22] is recommended.
Agner Krarup Erlang, a Danish engineer who worked for the Copenhagen Telephone
Exchange, published the first paper on what would now be called queueing theory in 1909.
He modeled the number of telephone calls arriving at an exchange by a Poisson process and
solved the M/D/1/∞ queue in 1917 and M/D/k/∞ queuing model in 1920.
The notations are now standard in the Queuing theory – which is a discipline tradi-
tionally considered as a part of Operation Research with deep connection to stochastic
processes. In M/D/k/∞, for example,
• M stands for Markov or memoryless and it means that arrivals occur according to a
Poisson process. Arrivals may also be deterministic, D.
• D stands for deterministic and means that the jobs arriving at the queue require a
fixed=deterministic amount of service/processing. Processing can also be stochastic,
Markovian (or non-Markovian, in which case it is custom to mark it as G - generic
sevice; arrival can also be G=generic).
• k describes the number of servers at the queueing node k = 1, 2, .... If there are more
jobs at the node than there are servers then jobs will queue and wait for service.
• ∞ stands for the allowed size of the queue (waiting room) - in this case no limit to
the waiting room capacity (everybody arriving is admitted to the queue - not denied)
∗
This is an Auxiliary Section which can be dropped at the first reading. Material from the Section will
not contribute midterm and final exams.
Figure 9.14: On the left: Markov Chain representation of the M/M/1 queue. In the standard
situation considered ∀i : λi = λ, µi = µ. On the right: reduced graphical description of a
single queue.
We will only be dealing with the case of ∞ waiting room, thus dropping the last argu-
ment.
The M/M/1 queue is a simple model where a single server handles jobs that arrive
according to a Poisson process and have exponentially distributed service requirements.
In an M/G/1 queue the G stands for general and indicates an arbitrary probability
distribution.
Many mathematicians and math-engineers contributed the subject since 1930 — Pol-
laczek, Khinchin, Kendall, Kingman, Jackson, Kelly and others.
Applications: call centers, logistics (at different scales), manufacturing, checkout at the
super-market, processing of electric vehicles at the charging stations, etc. In general, any
kind of practical systems where arrivals (of whatever coming in units) and processing fits
the framework. We are talking about design which would
• Manage queue (controls its size).
• Keep processing units busy (good utilization).
• Have waiting time in the queue under control.
9.6.2 Single Open Queue = Birth/Death process. Markov Chain repre-

sentation.
Let us discuss in details M/M/1. We start by playing with the Java modeling tool – JMT
(can upload it from http://jmt.sourceforge.net/Download.html)
The process is also called birth-death process - the name is clear from the Markov-Chain
representation shown in Fig. (9.14). The MC has infinitely many states, each representing #
of customers in the system (waiting room). Arrival of customers is modeled as the Poisson
process with the arrival rate, λ. We assume that all customers are identical. The customers
are taken from the waiting room based on availability of the servant, and the service is
completed with the rate µ of the other Poisson process.
Everything is Poisson in here (recall that mixing and splitting of the Poisson processes
is Poisson again).
Let us analyze the (relatively simple) system. Let us start from finding the steady state
of the Markov Chain: ∀i = 0, · · · , ∞ Pi , where Pi is the probability that the system is in
the i-th state, i.e. with i customers in the queue.
The balance equations are
# 0 customers: µP = λP0 (9.94)

|{z}1 |{z}
arrival departure
# 1 customer: λP0 + µP2 = (λ + µ) P1 (9.95)
# n customers: λPn−1 + µPn+1 = (λ + µ) Pn (9.96)
Resolving the equations (sequentially), and requiring that the total probability is normal-
P
ized, ∞i=0 Pi = 1, we derive
n−1
! n
Y λ λ
Pn = P0 = P0 = ρn P0 (9.97)
µ µ
i=0
∞
X ∞
X P0
1= Pn = P0 ρn = (9.98)
1−ρ
n=0 n=0
Pn = (1 − ρ) ρn . (9.99)
where ρ := λ/µ is the traffic intensity.

The average size of the queue is:
∞
X ∞
X ρ
nPn = (1 − ρ) nρn = . (9.100)
1−ρ
n=0 n=0
We observe that the average queue becomes infinite at ρ = 1, i.e. the steady state exists
only when ρ < 1. This criterium (existence of the steady state) can also be referred to as
“stability”.
Exercise: Consider a single M/M/m queue, i.e. the system when the number of servers
is m. Derive steady state? What is the modified stability criterium? Can a single queue
system withe m = 2 be unstable?
In this simple queue system we can also study transient time dynamics. The steady
state system of Eqs. (9.96) transitions to
d
∀n : Pn = λPn−1 + µPn+1 − (λ + µ) Pn . (9.101)
dt | {z } | {z }
arrival departure
11
12
2
23
30
1
24 3
01 14
41 43
4
33
40
Figure 9.15: Example of a Queuing network.
Solution of this system can be found in an analytic form
Pk (t) = e−(λ+µ)t ρ(k−i)/2 Ik−i (at) (9.102)

!
X
(k−i−1)/2 k −j/2
+ ρ Ik+i+1 (at) + (1 − ρ)ρ ρ Ij (at)
j=k+i+2
√
where a := 2 λµ, Ik (x) is the modified Bessel function of the first kind, where it is assumed
that the system was in the state i at t = 0.
Example 9.6.1. Derive Eq. (9.102) from Eq. (9.101). Compute distribution time of the
busy period of server. Assuming first come-first served policy, compute distribution of the
waiting time and distribution of the total time in the system.
We can also write the dynamical Eq. (9.101) in the following matrix form
 . 
..
d  
P = P (tr) P, P (tr) = 
 · · · 0 µ −(λ + µ) λ 0 · · · .
 (9.103)
dt
..
.
Notice that in the steady state (achievable at ρ < 1) the Detailed Balance (DB) does not
(tr) (st) (tr) (st)
hold, Pnm Pm 6= Pmn Pn .
9.6.3 Generalization to (Jackson) Networks. Product Solution for the

Steady State.
It appears that the description of a single Q- can be extended to a network, e.g. of the type
shown in Fig. (9.15). λ are arrivals and processing rates – now denoted in the same way
and indexed by nodes/stations from where the process is coming from and where it is going
to (two indexes). We study, P (n1 , · · · , nN ; t), probability over the entire network and, like
in the single queue case, write the balance (also called Master) equation – stated for any
state of the network at any time
!
∂ X
P (n; t) = λij (ni + 1)P (· · · , ni + 1, · · · , nj − 1, · · · ; t) − ni P (· · · , ni , · · · , nj , · · · ; t)
∂t | {z } | {z }
(i,j)∈E
customers leaving i for j customers staying at i
X
+ λ0i (P (· · · , ni − 1, · · · ; t) − P (· · · , ni , · · · ; t))
i∈V
X
+ λi0 ((ni + 1)P (· · · , ni + 1, · · · ; t) − ni P (· · · , ni , · · · ; t)) (9.104)
i∈V
This equation is written here for the case of M/M/∞ - when the number of servers is
infinite - this is the case when jobs are not waiting but are taken for processing (by tellers
which are always available) immediately.
Example 9.6.2. Write down a M/M/m version of Eq. (9.104).
Remarkably the (complicated looking) Eq. (9.104) allows an explicit steady state solu-
tion (for any graph!)
Y hni
−1 Qnii
P (n) = Z ,
i∈E li =1 li
(i,j)∈E (j,i)∈E
X X
∀i ∈ V : − hi λij + λji hj + λ0i − λi0 hi = 0
j6=0 j6=0
which is also called product solution/factorization (name is according to the structure).

Few important things to mention in here
• This is a product form solution which IS NOT a Gibbs (equilibrium) distribution.

[Remember our discussions of the Fokker-Planck.]
• The system is stable (solution is finite) if: ∀i ∈ V, hi < mi
• h is a “single-customer” object
Example 9.6.3. Generalize the steady state formula and reformulate the stability for the
general M/M/m case.
Example 9.6.4. By analogy with what was discussed for a single queue case, state the
skewed detailed balance relation. Show how analysis of the skewed DB leads to the product
state solution for the steady state.
9.6.4 Heavy Traffic Limit
Our discussion here is (mainly) based on the material from http://www.columbia.edu/

~ww2040/A1a.html.
The Heavy traffic limit applies when either of the two cases (or some special combination
of the two, which we will not discuss here) takes place (we discuss a single queue case - to
make the arguments simpler):
• The number of servers is fixed and the traffic intensity (utilization), λ/µ, approaches
unity (from below). The queue length approximation is the so-called “reflected Brow-
nian motion”.
• Traffic intensity is fixed and the number of servers and arrival rate are increased to
infinity. Here the queue length limit converges to the normal distribution.
Let us give some intuitive picture and then pose a number of technical questions/chal-
lenges (some of these with answers not yet fully known).
When the system is congested, i.e. when most of the time it has many customers, an
arriving customer will need to wait long. Assuming FIFO (first in first out) protocol, the
customer joining queue with L customers will see the queue going down to zero. However,
in this time of L + 1 departures, there will also be many arrivals. If traffic intensity, ρ, is
close to 1, the number of arrivals will be on order L as well. Thus, when the customer leaves
the queue behind will be comparable to the queue observed when the customer arrived. For
the system to go from average to empty will take more like L busy periods.
Example 9.6.5. Assume that 1 − ρ 1 and estimate
• How much time a typical type customer spend in the system ?
• How long does it take for a system to change from an average/typical filling to empty?
Hint (following from the preceding the discussions): The time scale at which the system
changes is much longer then the time scale of a single customer.
We therefore have a time scale separation and, therefore, may study a Q-system with
many customers on two scales, fluid and diffusive. Let X(t) be some Q-system related
process. X̄n (t) = nX(nt) defines the fluid re-scaling by n. This means that we measure
time in the units of n and we measure the state (# of customers) in the units of n. As
n → ∞ we shall look for n−1 X(nt) → X̄(t), where X̄(t) is the fluid limit.
At this scale as n → ∞ the arrival process and the service process have fluid limits
λt and µt which means that they are deterministic. As we said, queueing is the result of
variability, and so on a fluid scale, when input and output are not variable, there
will be no real queueing behavior in the system. We may see the queue length grow
linearly indefinitely (ρ > 1), or go to zero linearly and then stay at 0 (ρ < 1), or we may see
it constant, (ρ = 1). For queueing networks we may observe piecewise linear behavior of
queue lengths. This will capture changes in the queue on the fluid scale: The queue changes
by n in a time of order n. The stochastic fluctuations of a queue in steady state are scaled
down to be identically 0 and uninteresting.
The diffusion scaling looks at the difference between the process and its fluid limit, and
√
measures the time in units of n and the state (counts of customers) in units of n. The
√
diffusion re-scaling of A(t) by n is Ân (t) by n is Ân (t) = n(Ān (t) − Ā(t)). As n → ∞ we
shall look (in analogy with the Central Limit Theorem) for An (t) converging in the sense of
distribution to Â(t) describing the diffusion limit- it is a diffusion process, such as Brownian
motion or reflected Brownian motion. The diffusion limit captures the random fluctuation
of the system around its fluid limit.
Here is a (formal) statement on the heavy traffic asymptotic for the waiting time (in-
cluding both fluid and diffusive limits): Consider G/G/1 indexed by j. For queue j let Tj
λj
denotes the random inter-arrival time, Sj denote the random service time; ρj = µj denote
1 1
the traffic intensity with λj = E(Tj ) and µj = E(Sj ); Wq,j denotes the waiting time in
d
queue for a customer in steady state; αj = −E[Sj − Tj ] and βj2 = var[Sj − Tj ]. If Tj −
→ T,
d 2αj d
Sj −
→ S, and ρj → 1, then βj2
Wq,j −
→ exp(1) provided that: (a) Var[S-T] > 0, and (b) for
some δ > 0, E[Sj2+δ ] and E[Tj2+δ ] are both less than some constant C, ∀j.
Chapter 10
Elements of Inference and Learning
Statistical Inference describes set of (inference) tasks/operations over the statistical model
of the phenomenon. These tasks, assuming knowledge of the statistical model, include:
(a) sampling from the probability distribution; (b) computing marginal probabilities; (c)
finding the most likely configuration/state. However, the statistical model may or may not
be known. In the latter case one needs to learn the model before posing and resolving the
challenge of inference.
In the following we will start discussing statistical inference and then shift our attention
to the discussion of learning statistical models we aim to infer.
10.1 Statistical Inference: Sampling and Stochastic Algo-

rithms
10.1.1 Monte-Carlo Algorithms: General Concepts and Direct Sampling
This lecture should be read in parallel with the respective IJulia notebook file. Monte-
Carlo (MC) methods refers to a broad class of algorithms that rely on repeated random
sampling to obtain results. Named after Monte Carlo -the city- which once was the capital
of gambling, i.e. playing with randomness. The MC algorithms can be used for numerical
integration, e.g. computing weighted sum of many contributions, expectations, marginals,
etc. MC can also be used in optimization.
Sampling is a selection of a subset of individuals/configurations from within a statistical
population to estimate characteristics of the whole population.
There are two basic flavors of sampling. Direct Sampling MC - mainly discussed in this
lecture and Markov Chain MC. DS-MC focuses on drawing independent samples from a
distribution, while MCMC draws correlated (according to the underlying Markov Chain)
303
CHAPTER 10. ELEMENTS OF INFERENCE AND LEARNING 304
samples.
Let us illustrate both on the simple example of the ’pebble’ game - calculating the value
of π by sampling interior of a circle.
Direct-Sampling by Rejection vs MCMC for ‘pebble game’
In this simple example we will construct distribution which is uniform within a circle from
another distribution which is uniform within a square containing the circle. We will use
direct product of two rand() to generate samples within the square and then simply reject
samples which are not in the interior of the circle.
In respective MCMC we build a sample (parameterized by a pair of coordinates) by
taking previous sample and adding some random independent shifts to both variables, also
making sure that when the sample crosses a side of the square it reappears on the opposite
side. The sample ”walks” the square, but to compute area of the circle we count only for
samples which are within the circle (rejection again).
See IJulia notebook associated with this lecture for an illustration.
Direct Sampling by Mapping
Direct Sampling by Mapping consists in application of the deterministic function to samples

from a distribution you know how to sample from. The method is exact, i.e. it produces
independent random samples distributed according to the new distribution. (We will discuss
formal criteria for independence in the next lecture.)
For example, suppose we want to generate exponential samples, yi ∼ p(y) = exp(−y) –
one dimensional exponential distribution over [0, ∞], provided that one-dimensional uniform
oracle, which generates independent samples, xi from [0, 1], is available. Then yi = − log(xi )
generates desired (exponentially distributed) samples.
Another example of DS MS by mapping is given by the Box-Miller algorithm which is
a smart way to map two-dimensional random variable distributed uniformly within a box
to the two-dimensional Gaussian (normal) random variable:
Z ∞ Z 2π Z ∞ Z 2π Z ∞ Z 1 Z 1
dxdy −(x2 +y2 )/2 dϕ −r2 /2 dϕ −z
e = rdr e = dz e = dθ dψ = 1.
−∞ 2π 0 2π 0 0 2π 0 0 0
√
Thus, the desired mapping is (ψ, θ) → (x, y), where x = −2 log ψ cos(2πΘ) and y =
√
−2 log ψ sin(2πθ).
See IJulia notebook associated with this lecture for numerical illustrations.
Direct Sampling by Rejection (another example)
Let us now show how to get positive Gaussian (normal) random variable from an exponential
random variable through rejection. We do it in two steps
• First, one samples from the exponential distribution:

(
e−x , x > 0,
x ∼ p0 (x) =
0, otherwise.
• Second, aiming to get a sample from the positive half of Gaussian,

( p
2/π exp(−x2 /2), x > 0,
x ∼ p0 (x) = ,
0, otherwise
one accepts the generated sample with the probability

1p
p(x) = 2/π exp(x − x2 /2),
M
p
where M is a constant which should be larger than, max(p(x)/p0 (x)) = 2/π e1/2 ≈
1.32, to guarantee that p(x) ≤ 1 for all x > 0.
Note that the rejection algorithm has an advantage of being applicable even when the
probability densities are known only up to a multiplicative constant. (We will discuss
issues related to this constant, also called in the multivariate case the partition function,
extensively.)
See IJulia notebook associated with this lecture for numerical illustration.
We also recommend
• Introduction to direct Sampling, Chapter of Monte Carlo Lecture Notes by J. Good-

man (NYU)
• Lecture on Monte Carlo Sampling, from Berkley course of M. Jordan on Bayesian

Modeling and Inference
for additional reading on DS-MC.
Importance Sampling
One important application of MC is in computing sums, integrals and expectations. Suppose

we want to compute an expectation of a function, f (x), over the distribution, p(x), i.e.
R
dxp(x)f (x), in the regime where f (x) and p(x) are concentrated around very different x.
In this case the overlap of f (x) and p(x) is small and as a result a lot of MC samples drawn
from p(x) will be ’wasted’.
Importance Sampling is the method which aims to fix the small-overlap problem. The
method is based on adjusting the distribution function from p(x) to p̃(x) and then utilizing
the following obvious formula
Z Z
f (x)p(x) f (x)p(x)
Ep [f (x)] = dxp(x)f (x) = dxp̃(x) = Ep̃
p̃(x) p̃(x)
See the
IJulia notebook associated
with
this lecture contrasting DS example, p(x) =
x2 (x−4)2
√1 exp − 2 and f (x) = exp − 2 , with IS where the choice of the proposal dis-
2π
1 2
tribution is, p̃(x) = √π exp −(x − 2) . This example shows that we are clearly wasting
samples with DS.
Note one big problem with IS. In a realistic multi-dimensional case it is not easy to
guess the proposal distribution, p̃(x), right. One way of fixing this problem is to search for
good p̃(x) adaptively.
A comprehensive review of the history and state of the art in Importance Sampling can
be found in multiple lecture notes of A. Owen posted at his web page, for example follow
this link. Check also adaptive importance sampling package.
Direct Brute-force Sampling
This algorithm relies on availability of the uniform sampling algorithm from [0, 1], rand().
One splits the [0, 1] interval into pieces according to the weights of all possible states and
then use rand() to select the state. The algorithm is impractical as it requires keeping in
the memory information about all possible configurations. The use of this construction is
in providing the bench-mark case useful for proving independence of samples.
Direct Sampling from a multi-variate distribution with a partition function

oracle
Suppose we have an oracle capable of computing the partition function (normalization) for a
multivariate probability distribution and also for any of the marginal probabilities. (Notice
that we are ignoring for now the issue of the oracle complexity.) Does it give us the power
to generate independent samples?
We get affirmative answer to this question through the following decimation algorithm
generating independent sample x ∼ P (x), where x := (xi |i = 1, · · · , N ):
Validity of the algorithm follows from the exact representation for the joint probability
distribution function as a product of ordered conditional distribution function (chain rule
Algorithm 6 Decimation Algorithm

Input: P (x) (expression). Partition function oracle.
1: x(d) = ∅; I=∅
2: while |I| < N do
3: Pick i at random from {1, · · · , N } \ I.
4: x(I) = (xj |j ∈ I)
P
5: Compute P (xi |x(d) ) := x\xi ;x(I) =x(d) P (x) with the oracle.
6: Generate random xi ∼ P (xi |x(d) ).
7: I ∪i←I
8: x(d) ∪ xi ← x(d)
end while
9:
Output: x(dec) is an independent sample from P (x).
for distribution):
P (x1 , · · · , xn ) = P (x1 )P (x2 |x1 )P (x3 |x1 , x2 ) · · · P (xn |x1 , · · · , xn−1 ). (10.1)
(The chain rule follows directly from Bayes rule/formula. Notice also that ordering of
variables within the chain rule is arbitrary.) One way of proving that the algorithm produces
an independent sample is to show that the algorithm outcome is equivalent to another
algorithm for which the independence is already proven. The benchmark algorithm we can
use to state that the Decimation algorithm (6) produces independent samples is the brute-
force sampling algorithm described in the beginning of the lecture. The crucial point here
is that the decimation algorithm can be interpreted in terms of splitting the [0, 1] interval
hierarchically, first according to P (x1 ), then subdividing pieces for different x1 according
to P (x2 , x1 ), etc. This guidanken experiment will result in the desired proof.
Note that in general efforts of the partition function oracle are exponential in the problem
size. However in some special cases the partition function can be computed efficiently
(polynomially in the number of steps).
In the following exercise we suggest to test performance of the direct sampling algorithm
on the example of the Ising model. We remind that in the case of the Ising model probability
of a binary vector (spin configuration), x, is given by
exp (−βE(x)) 1 X X
p(x) = , E(x) = − xi Jij xj + hi xi , (10.2)
Z 2
{i,j}∈E i∈V
X
Z= exp (−βE(x)) . (10.3)
x
Exercise 10.1. Consider example of the Ising model with zero singleton term, h = 0, and
uniform pair-wise term, Jβ = −1, i.e. Jij β = −1, ∀{i, j} ∈ E, over n × n grid-graph with
the nearest-neighbor interaction. Construct (write down pseudo-algorithm and then code)
the decimation algorithm (6).
Compare the performance of the direct sampling for n = 2, 3, 4, 5 – that is find out how the
time required to generate the next i.i.d. sample depends on n – and explain.
10.1.2 Inference via Markov-Chain Monte-Carlo
Markov Chain Monte Carlo (MCMC) methods belong to the class of algorithms for sampling
from a probability distribution which is based on constructing a Markov chain that converges
to the target steady distribution.
Examples and flavors of MCMC are many (and some are quite similar) – heat bath,
Glauber dynamics, Gibbs sampling, Metropolis-Hastings, Cluster algorithm, Warm algo-
rithm, etc – in all these cases we only need to know the transition probability between states
while the actual stationary distribution may be not known or, more accurately, known up
to the normalization factor, also called the partition function. Below, we will discuss in
details two key examples: Gibbs sampling and Metropolis-Hastings.
Gibbs Sampling
Assume that the direct sampling is not feasible, because of the high dimensionality of the
state vector (too many components): computation of the joint probability distribution and
its marginalizations (over a small number of components) is of the ”exponential” complexity
(more on this below). The main point of the Gibbs sampling is that given a multivariate
distribution is that, even though to sample from the joint probability distribution is not
feasible, sampling from the probability distribution of a few components (conditioned on
the rest of the state vector) can be done efficiently. We utilize this remarkable feature of the
marginal probability distributions and create the Gibbs Sampling algorithm (Algorithm 7).
The algorithm starts from a current sample of the vector x, pick a component at random,
compute probability for this component/variable conditioned to the rest (other components
of the state vector), and sample from this conditional distribution. As mentioned above,
the conditional distribution is over a single component, and it is therefore an easy one to
compute. We continue the process till convergence, which can be identified empirically, for
example, by checking if estimation of the histogram or observable(s) stopped changing.
Example 10.1.1. Describe Gibbs sampling on the example of a general Ising model. Build
respective Markov chain. Show that the algorithm obeys Detailed Balance.
Algorithm 7 Gibbs Sampling

Input: Given p(xi |x∼i = x \ xi ), ∀i ∈ {1, · · · , N }. Start with a sample x(t) .
loop Till convergence

Draw an i.i.d. i from {1, · ·· , N }.
(t)
Generate a random xi ∼ p xi |x∼i .
(t+1)
xi = xi .
(t+1) (t)
∀j ∈ {1, · · · , N } \ i : xj ← xj .
Output x(t+1) as the next sample.
end loop
Solution. Starting from a state, x(t) , we pick a random node, i, and compare two candidate
states, (xi = 1 and xi = −1). Then we calculate the corresponding conditional (all spins
(t) (t)
except i are fixed) probabilities p(xi = +1|x∼i and p(xi = −|x∼i , which we denote, p± ,
respectively. By construction,
(t) (t)
p+ + p− = 1, p+ /p− = e−β∆E , ∆E = E(xi = +1, x∼i ) − E(xi = −1, x∼i ), (10.4)
where ∆E, is the energy difference between the two configurations. Next, one accepts
the configuration xi = 1 with the probability p+ or the configuration xi = −1 with the
probability p− .
MC corresponding to the algorithm is defined over the 2N -dimensional hupercube, when
N is the number of spins, and 2N is the number of states. To check the DB condition,
assume that the conditional probabilities, p± , corresponds to the steady state and compute
(t)
the probability fluxes from the state x(t) to the states (xi = ±1, x∼i ). We derive
1 −βE(xi =−1,x(t) ) 1 −βE(xi =+1,x(t) )

Q−+ = e ∼i p ,
+ Q+− = e ∼i p .
− (10.5)
Z Z
Combining Eqs. (10.4) and Eqs. (10.5), we observe that, Q−+ = Q+− , i.e. the detailed
balance holds.
Metropolis-Hastings Sampling
Metropolis-Hastings (MH) sampling is an MCMC method which is built, like Gibbs sam-
pling, assuming that the desired stationary distribution, π̃(x), is known explicitly upto
the normalization constant (also called the partition function), where thus the normalized
P
distribution is π(x) = π̃(x)/Z and Z := x π̃(x). Let us also introduce the so-called
”proposal” distribution, p(x0 |x), and assume that drawing a sample proposal x0 from the
current sample x is (computationally) easy. The combination of π̃(x) and p(x0 |x), as well as
an arbitrary initialization of the state vector, x(t) , set the MH Algorithm 8. Starting with
x(t) the algorithm draws a sample, x0 , according to the proposal distribution, p(x0 |x(t) ), but
then accepts or reject the proposed state, x0 , according to the probability
n p(xt |x0 )π̃(x0 ) o
min 1, . (10.6)
p(x0 |x(t) )π̃(x(t) )
The procedure is repeated till (empirical) convergence.

Observe that by construction
n p(x|x0 )π̃(x0 ) o n p(x0 |x)π̃(x) o
∀x, x0 : min 1, p(x 0
|x) π̃(x) = min 1, p(x|x0 ) π̃(x0 ), (10.7)
p(x0 |x)π̃(x) p(x|x0 )π̃(x0 )
| {z } | {z }
MH trans. prob.(x0 ←x) MH trans. prob.(x←x0 )
i.e. the algorithm satisfies the Detailed Balance condition.
Algorithm 8 Metropolis-Hastings Sampling

Input: Given π̃(x) and p(x0 |x). Start with a sample xt .
1: loop Till convergence

2: Draw a random x0 ∼ p(x0 |x(t) ).
p(xt |x0 )π̃(x0 )
3: Compute α = p(x0 |x(t) )π̃(x(t) )
.
4: Draw random β ∈ U ([0, 1]), uniform i.i.d. from [0, 1].
5: if β < min{1, α} then
6: x(t) ← x0 [accept]
7: else
8: x0 is ignored [reject]
9: end if
10: x(t) is recorded as a new sample
11: end loop
Note that the Gibbs sampling, introduced before, can be considered as the Metropolis-
Hastings without rejection, where the proposal distribution is chosen specifically as the
respective conditional probability distribution. (That is Gibbs sampling should be consid-
ered as a special case of the more general MH algorithm.)
Example 10.1.2. Consider MC shown in Fig. (10.1). Show that this MC is ergodic and
can be viewed as a particular MH Algorithm 8 for the two spin Ising model. What is the
proposal distribution in this case? What is the resulting stationary distribution? Does it
Figure 10.1: Example of the MC induced by a Metropolis-Hastings for a two spin example.
obey Detailed Balance? Will the steady distribution change if the rejection is removed from
the consideration? How?
Solution. The cardinality (size) of the state space is 22 = 4: x = (x1 = ±1, x2 = ±1).
Inspecting the MC we observe that it is a-periodic and positive-recurrent, thus ergodic.
Observe that some of the transition probabilities are exactly 1/2. Therefore, it is natural
to associate these with the cases when Eq. (10.6) returns unity. Next, we link self-loops
in the MC to rejections, that are the cases when a proposed state is rejected. The two
observations combined suggest that we are dealing with an instance of the MH Algorithm
8 where
x1 x2
π̃(x1 , x2 ) = exp(−βE(x)), E(x = (x1 , x2 )) = − , p(x0 |x) = exp −β E(x0 ) − E(x) .
2
That is the stationary distribution is of the attractive (ferromagnetic) type with the unit
pair-wise strength and without bias (with zero magnetic field). The proposal distribution,
p(x0 |x), is such that only a single spin flip is allowed (cannot flip two spins in one step) and
the proposal may be accepted only if the resulting energy gain is positive, i.e. when the
algorithm proposes to move from the state when the spins are miss-aligned to the aligned
state. Removal of the rejection translates into setting β to zero (temperature becomes
infinite). In this case the resulting distribution is uniform (so-called, para-magnetic).
The freedom in selecting MH proposal distribution translates into strong variations in

the resulting algorithm mixing time. Typically we would like to find proposal which mixes
fast. Even though evaluating mixing time of the proposal is a challenge, one can estimate
it empirically following the following heuristics. If the largest distance between the states
(measured in the number of the algorithm’s elementary steps) is L, the MH algorithm
executes a random walk, i.e. it advances diffusively covering distance L in T ∼ L2 steps.
This will then be a low bound estimate on the mixing time. Notice that the actual mixing
time, i.e. the time to arrive at a sample which is almost independent on the initial sample,
may be significantly slower, e.g. due to rejection (if these are frequent). This slow (diffisive)
exploration of the phase space by the MH algorithm is linked to DB.
The particular form of the MH proposal, illustrated in the Example 10.1.2 for the two
spin Ising model, generalizes to the so-called Glauber (dynamics) Algorithm 9. (Check
snippet illustrating performance of the Glauber algorithm on a 128 × 128 square lattice.)
Algorithm 9 Glauber Sampling

Input: Ising model on a graph. (See Eq. (10.2).) Start with a sample x
1: loop Till convergence

2: Pick a node i at random.
3: −xi ← xi P
4: Compute α = exp xi J x
j∈V:{i,j}∈E ij j − 2hi .
5: Draw random β ∈ U ([0, 1]), uniform i.i.d. from [0, 1].
6: if α < β < 1 then
7: −xi ← xi [reject]
8: end if
9: Output: x as a sample
10: end loop
Exercise 10.2 (Spanning Trees.). Let G be an undirected complete graph. The following
MCMC algorithm results in a uniform stationary probability distribution of all the spanning
trees of G: Start with some spanning tree; add uniformly-at-random some edge from G (so
that a cycle forms); remove uniformly-at-random an edge from this cycle; repeat. Suppose
now that the graph G is positively weighted, i.e., each edge e has some cost ce > 0. Suggest
an MCMC algorithm that samples from the set of spanning trees of G, with the stationary
probability distribution proportional to the overall weight of the spanning tree for the
following cases:
(i) the weight of any spanning tree of G is the sum of costs of its edges;
(ii) the weight of any spanning tree of G is the product of costs of its edges. In addition,
(iii) estimate the average weight of a spanning tree using the algorithm of uniform sampling.
(iv) implement all the algorithms on a (4×4) square lattice with randomly assigned weights.
Verify that the algorithm converges to the right value.
For useful additional reading on sampling and computations for the Ising model see
https://www.physik.uni-leipzig.de/~janke/Paper/lnp739_079_2008.pdf.
Exactness and Convergence
MCMC algorithm is called (casually) exact if one can show that the generated distribution
”converges” to the desired stationary distribution. However, ”convergence” may mean
different things.
The strongest form of convergence – called exact independence test (warning - this
is our ‘custom’ term) – states that at each step we generate an independent sample from
the target distribution. To prove this statement means to show that empirical correlation
of the consecutive samples is zero in the limit when N number of samples → ∞:
N
1 X
lim f (xn )g(xn−1 ) → E [f (x)] E [g(x)] , (10.8)
N →+∞ N
n=0
where f (x) and g(x) are arbitrary functions (however such that respective expectations on
the rhs of Eq. (10.8) are well-defined).
A weaker statement – call it asymptotic convergence – suggests that in the limit of
N → ∞ we reconstruct the target distribution (and all the respective existing moments):
N
1 X
lim f (xn ) → E [f (x)] , (10.9)
N →+∞ N
n=0
where f (x) is an arbitrary function such that the expectation on the rhs is well defined.
Finally, the weakest statement – call it parametric convergence – corresponds to the
case when one arrives at the target estimate only in a special limit with respect to a special
parameter. It is common, e.g. in statistical/theoretical physics and computer sicence, to
study the so-called thermodynamic limit, where the number of degrees of freedom (for
example number of spins/variables in the Ising model) becomes infinite:

N
1 X
lim lim fs (xn ) → E [fs∗ (x)] . (10.10)
s→s∗ N →+∞ N
n=0
For additional math (but also intuitive as written for applied mathematicians, engi-
neers and physicists) reading on the MCMC (and in general MC) convergence see “The
mathematics of mixing things up” article by Persi Diaconis and also [16].
Exact Monte Carlo Sampling (Did it converge yet?)
(This part of the lecture is a bonus material - we discuss it only if time permits.)
The material follows Chapter 32 of D.J.C. MacKay book [19]. An extensive set of
modern references, discussions and codes are also available at the website [23] on perfectly
random sampling with Markov Chains.
As mentioned already the main problem with MCMC methods is that one needs to wait
(and sometimes for too long) to make sure that the generated samples (from the target
distribution) are i.i.d. If one starts to form a histogram (empirical distribution) too early
it will deviate from the target distribution. One important question in this regards is: For
how long shall one run the Markov Chain before it has ‘converged’ ? To answer this question
(prove) it is very difficult, in many cases not possible. However, there is a technique which
allows to check the exact convergence, for some cases, and do it on the fly - as we run
MCMC.
This smart technique is the Propp-Wilson exact sampling method, also called coupling
from the past. The technique is based on a combination of three ideas:
• The main idea is related to the notion of the trajectory coalescence. Let us ob-
serve that if starting from different initial conditions the MCMC chains share a single
random number generator, then their trajectories in the phase space can coalesce; and
having coalesced, will not separate again. This is clearly an indication that the initial
conditions are forgotten.
Will running all the initial conditions forward in time till coalescence generate exact
sample? Apparently not. One can show (sufficient to do it for a simple example) that
the point of coalescence does not represent an exact sample.
• However, one can still achieve the goal by sampling from a time T0 in the past,
up to the present. If the coalescence has occurred the present sample is an unbiased
sample; and if not we restart the simulation from the time T0 further into the past,
reusing the same random numbers. The simulation is repeated till a coalescence occur
at a time before the present. One can show that the resulting sample at the present
is exact.
• One problem with the scheme is that we need to test it for all the initial conditions -
which are too many to track. Is there a way to reduce the number of necessary
trials. Remarkably, it appears possible for sub-class of probabilistic models the so-
called ’attractive’ models. Loosely speaking and using ’physics’ jargon - these are
’ferromagnetic’ models - which are the models where for a stand alone pair of vari-
ables the preferred configuration is the one with the same values of the two variables.
In the case of attractive model monotonicity (sub-modularity) of the underlying model
suggests that the paths do not cross. This allows to only study limiting trajectories
and deduce interesting properties of all the other trajectories from the limiting cases.
10.2 Statistical Inference: General Relations, Calculus of

Variations and Trees
This lecture largely follow material of the mini-course on Graphical Models of Statistical
Inference: Belief Propagation & Beyond. See links to slides and lecture notes at the following
web-site.
10.2.1 From Ising Model to (Factor) Graphical Models
Brief reminder of what we have learned so far about the Ising Model. It is fully described by
Eqs. (10.2,10.3). The weight of a “spin” configuration is given by Eq. (10.2). Let us not pay
much of attention for now to the normalization factor Z and observe that the weight is nicely
factorized. Indeed, it is a product of pair-wise terms. Each term describes “interaction”
between spins. Obviously we can represent the factorization through a graph. For example,
if our spin system consists only of three spins connected to each other, then the respective
graph is a triangle. Spins are associated with nodes of the graphs and “interactions”, which
may also be called (pair-wise) factors, are associated with edges.
It is useful, for resolving this and other factorized problems, to introduce a bit more
general representation — in terms of graphs where both factors and variables are associated
with nodes/vertices. Transformation to the factor-graph representation for the three spin
example is shown in Fig. (10.2).
Ising Model, as well as other models discussed later in the lectures, can thus be stated
𝜎1
𝜎2 𝜎3 𝑓23 (𝜎2 , 𝜎3 )
Figure 10.2: Factor Graph Representation for the (simple case) with pair-wise factors only.
In the case of the Ising model: f12 (x1 , x2 ) = exp (−J12 x1 x2 + h1 x1 + h2 x2 )).
in terms of the general factor-graph framework/model

Y
P (x) = Z −1 fa (xa ), xa := (xi |i ∈ Vn , (i, a) ∈ E) , (10.11)
a∈Vf
where (Vf , Vn , E) is the bi-partite graph built of factors and nodes.

The factor graph language (representation) is more general. We will see it next - dis-
cussing another problem which originates from the Information Theory, and specifically
from the error-correction theory.
10.2.2 Decoding of Graphical Codes as a Factor Graph problem
First, let us discuss decoding of a graphical code. (Our description here is terse, and we
advise interested reader to check the book by Richardson and Urbanke [24] for more details.)
A message word consisting of L information bits is encoded in an N -bit long code word,
N > L. In the case of binary, linear coding discussed here, a convenient representation of
the code is given by M ≥ N − L constraints, often called parity checks or simply, checks.
P
Formally, ς = (ςi = 0, 1|i = 1, · · · , N ) is one of the 2L code words iff i∼α ςi = 0 ( mod 2)
for all checks α = 1, · · · , M , where i ∼ α if the bit if the bit i contributes the check α, and
α ∼ i will indicate that the check α contains bit i. The relation between bits and checks
is often described in terms of the M × N parity-check matrix H consisting of ones and
Figure 10.3: Tanner graph of a linear code, represented with N = 10 bits, M = 5 checks,
and L = N − M = 5 information bits. This code selects 25 codewords from 21 0 possible
patterns. This adjacency, parity-check matrix of the code is given by Eq. (10.12).
zeros: Hiα = 1 if i ∼ α and Hiα = 0 otherwise. The set of the codewords is thus defined as
Ξ(cw) = (ς|Hς = 0 ( mod 2)). A bipartite graph representation of H, with bits marked as
circles, checks marked as squares, and edges corresponding to respective nonzero elements
of H, is usually called (in the coding theory) the Tanner graph of the code, or parity-check
graph of the code. (Notice that, fundamentally, code is defined in terms of the set of its
codewords, and there are many parity check matrixes/graphs parameterizing the code. We
ignore this unambiguity here, choosing one convenient parametrization H for the code.)
Therefore the bi-partite Tanner graph of the code is defined as G = (G0 , G1 ), where the set
of nodes is the union of the sets associated with variables and checks, G0 = G0;v ∪ G0;e and
only edges connecting variables and checks contribute G1 .
For a simple example with 10 bits and 5 checks, the parity check (adjacency) matrix of
the code with the Tanner graph shown in Fig. (10.3) is
 
1 1 1 1 0 1 1 0 0 0
 
 0 0 1 1 1 1 1 1 0 0 
 
 
H= 0 1 0 1 0 1 0 1 1 1 . (10.12)
 
 1 0 1 0 1 0 0 1 1 1 
 
1 1 0 0 1 0 1 0 1 1
Another example of a bigger code and respective parity check matrix are shown in Fig. (10.4).
For this example, N = 155, L = 64, M = 91 and the Hamming distance, defined as the
minimum l0 -distance between two distinct codewords, is 20.
Assume that each bit of the transmitted signal is changed (effect of the channel noise)
Figure 10.4: Tanner graph and parity check matrix of the (155, 64, 20) Tanner code, where
N = 155 is the length of the code (size of the code word), L = 64 and the Hamming distance
of the code, d = 20.
independently of others. It is done with some known conditional probability, p(x|σ), where
σ = 0, 1 is the valued of the bit before transmission, and x ∈ R is its changed/distorted
image. Once x = (xi |i = 1, · · · , N ) is measured, the task of the Maximum-A-Posteriori
(MAP) decoding becomes to reconstruct the most probable codeword consistent with the
measurement:
N
Y
σ (M AP ) = arg min p(xi |σi ). (10.13)
σ∈Ξ(cw)
i=1
More generally, the probability of a codeword ς ∈ Ξ(cw) to be a pre-image for x is

Y X Y
P(ς|x) = (Z(x))−1 g (ch) (xi |ςi ), Z(x) = g (ch) (xi |ςi ), (10.14)
i∈G0;v ς∈Ξ(cw) i∈G0;v
where Z(x) is thus the partition function dependent on the detected vector x. One may
also consider the signal (bit-wise) MAP decoder
ς∈Ξ (cw)
(s−M AP )
X
∀i: ςi = arg max P(ς|x). (10.15)
ςi
ς\ςi
10.2.3 Partition Function. Marginal Probabilities. Maximum Likelihood.
The partition function in Eq. (10.11) is the normalization factor

X Y
Z= fa (xa ), xa := (xi |i ∈ Vn , (i, a) ∈ E) , (10.16)
x a∈Vf
where x = (xi ∈ {0, 1} ∈ Vn ). Here, we assume that the alphabet of the elementary random
variable is binary, however generalization to the case of a higher alphabet is straightforward.
We are interested to ‘marginalize’ Eq. (10.11) over a subset of variables, for example
over all the elementary/nodal variables but one
X
P (xi ) := P (x). (10.17)
x\xi
Expectation of xi computed with the probability Eq. (10.17) is also called (in physics)
’magnetization’ of the variable.
Example 10.2.1. Is a partition function oracle sufficient for computing P (xi )? What is
the relation in the case of the Ising model between P (xi ) and Z(h)?
Solution. There are two ways one can relate P (xi ) to the partition function. First, introduce
an auxiliary graphical model, derived from the original one simply by fixing the value at
the node i to xi . Then, P (xi ) is simply the ratio of the partition function of the newly
derived graphical model to the original graphical model. Second, we can modify the original
graphical model introducing multiplicative factor, exp(xi hi ), and denoting the resulting
partition function by Z(h). Then, log Z(h), is also a moment generating function of P (xi ).
Another object of interest is the so-called Maximum Likelihood. Stated formally, is the
most probable state of all the states represented in Eq. (10.11):
x∗ = arg max P (x). (10.18)

x
All these objects are difficult to compute. “Difficulty” - still stated casually - means
that the number of operations needed is exponential in the system size (e.g. number of
variables/spins in the Ising model). This is in general, i.e. for a GM of a general position.
However, for some special cases, or even special classes of cases, the computations may be
much easier than in the worst case. Thus, ML (10.18) for the case of the so-called ferromag-
netic (attractive, sub-modular) Ising model can be computed with efforts polynomial in the
system size. Note that the partition function computation (at any nonzero temperatures)
is still exponential even in this case, thus illustrating the general statement - computing Z
or P (xi ) is a more difficult problem than computing x∗ .
A curious fact. Ising model (ferromagnetic, anti-ferromagnetic or glassy) when the

“magnetic field” is zero, h = 0, and the graph is planar, represents a very unique class of
problems for which even computations of Z are easy. In this case the partition function
is expressed via determinant of a finite matrix, while computing determinant of a size N
matrix is a problem of O(N 3 ) complexity (actually O(N 3/2 ) in the planar case).
In the general (difficult) case we will need to relay on approximations to make compu-
tations scalable. And some of these approximations will be discussed later in the lecture.
However, let us first prepare for that - restating the most general problem discussed so far
– computation of the partition function, Z – as an optimization problem.
10.2.4 Kullback-Leibler Formulation & Probability Polytope
We will go from counting (computing partition function is the problem of weighted counting)
to optimization by changing description from states to probabilities of the states, which we
will also call beliefs. b(x) will be a belief - which is our probabilistic guess - for the probability
of state x. Consider it on the example of the triangle system shown in Fig. (10.2). There are
23 states in this case: (x1 = ±1, x2 = ±1, x3 = ±1), which can occur with the probabilities,
b(x1 , x2 , x3 ). All the beliefs are positive and together should some to unity. We would
like to compare a particular assignment of the beliefs with P (x), generally described by
Eq. (10.11). Let us recall a tool which we already used to compare probabilities - the
Kullback-Leibler (KL) divergence (of probabilities) discussed in Lecture #2:
X
b(x)
D(bkP ) = b(x) log (10.19)
x
P (x)
Note that the KL divergence (10.19) is a convex function of the beliefs (remember, there
are 23 of the beliefs in the our enabling three node example) within the following polytope
– domain in the space of beliefs bounded by linear constraints:
∀x : b(x) ≥ 0, (10.20)
X
b(x) = 1. (10.21)
x
Moreover, it is straightforward to check (please do it at home!) that the unique minimum

of D(bkP ) is achieved at b = P , where the KL divergence is zero:
P = arg min D(bkP ), min D(bkP ) = 0. (10.22)

b b
Substituting Eq. (10.11) into Eq. (10.22) one derives

X Q
a fa (xa )
log Z = − min F(b), F(b) := b(x) log , (10.23)
b
x
b(x)
where F (b), considered as a function of all the beliefs, is called (configurational) free energy
(where configuration is one of the beliefs). The terminology originates from statistical
physics.
To summarize, we did manage to reduce counting problem to an optimization problem.
Which is great, however so far it is just a reformulation – as the number of variational
degrees of freedom (beliefs) is as much as the number of terms in the original sum (the
partition function). Indeed, it is not the formula itself but (as we will see below) its further
use for approximations which will be extremely useful.
10.2.5 Variational Approximation: Mean Field
The main idea is to reduce the search space from exploration of the 2N −1 dimensional beliefs
to their lower dimensional, i.e. parameterized with fewer variables, proxy/approximation.
What kind of factorization can one suggest for the multivariate (N -spin) probabilities/bel-
liefs? The idea of postulating independence of all the N variables/spins comes to mind:
Y
b(x) → bM F (x) = bi (xi ) (10.24)
i
∀i ∈ Vi , ∀xi : bi (xi ) ≥ 0 (10.25)
X
∀i ∈ Vi : bi (xi ) = 1. (10.26)
xi
Clearly bi (xi ) is interpreted within this substitution as the single-node marginal belief (es-
timate for the single-node marginal probability).
Substituting b by bM F in Eq. (10.23) one arrives at the MF estimation for the partition
function
log Zmf = − min F(bmf ),

bmf
!
XX Y XX
F(bmf ) := bi (xi ) log fa (xa ) − bi (xi ) log(bi (xi )). (10.27)
a xa i∼a i xi
To solve the variational problem (10.27) constrained by Eqs. (10.24,10.25,10.26) is equiv-

alent to searching for the (unique) stationary point of the following MF Lagrangian
X X
L(bmf ) := F(bmf ) + λi bi (xi ) (10.28)
i xi
Example 10.2.2. Show that Z ≥ Zmf , and that F(bmf ) is a strictly convex function of
its (vector) argument. Write down equations defining the stationary point of L(bmf ).
Solution. Zmf is low bound because we optimize over the class of belief functions, bmf (x)
which is strictly within the class of allowed b(x). Convexity is proven simply by utilizing
the fact the optimization’s objective is decomposed into a sum of the convex functions.
The fact that Zmf (see the example above) gives a lower bound on Z is a good news.
However, in general the approximation is very crude, i.e. the gap between the bound and
the actual value is large. The main reason for that is clear - by assuming that the variables
are independent we have ignored significant correlations.
In the next lecture we will analyze what, very frequently, provides a much better ap-
proximation for ML inference - the so called Belief Propagation approach.
We will mainly focus on the so-called Belief Propagation, related theory and techniques.
In addition to discussing inference with Belief Propagation we will also have a brief discus-
sions (pointers) to respective inverse problem – learning with Graphical Models.
10.2.6 Dynamic Programming for (Exact) Inference over Trees
Figure 10.5: Exemplary interaction/factor graphs which are tree.
Consider Ising model over a linear chain of n spins shown in Fig. 10.5a, the partition
function is
X
Z= Z(xn ), (10.29)
xn
where Z(xn ) is the newly introduced object representing sum over all but last spin in the
chain, labeled by n. Zn can be expressed as follows
X
Z(xn ) = exp(Jn,n−1 xn xn−1 + hn xn )Z(n−1)→(n) (xn−1 ), (10.30)
xn−1
where Z(n−1)→(n) (xi ) is the partial partition function for the subtree (a shorter chain in this
case) rooted at n − 1 and built excluding the branch/link directed towards n. The newly
introduced partially summed partition function contains summation over one less spins then
the original chain. In fact, this partially sum object can be defined recursively
X
Z(i−1)→(i) (xi−1 ) = exp(Ji−1,i−2 xi−1 xi−2 + hi−1 xi−1 )Z(i−2)→(i−1) (xi−2 ) (10.31)
xi−2
that is expressing one partially sum object via the partially sum object computed on the
previous step. Advantage of this recursive approach is obvious – it allows to replace sum-
mation over the exponentially many spin configurations by summing up of only two terms
at each step of the recursion.
What should also be obvious is that the method just described is adaptation of the
Dynamic Programming (DP) methods we have discussed in the optimization part of the
course to the problem of statistical inference.
It is also clear that the approach just explained allows generalization from the case of
the linear chain to the case of a general tree. Then, in the general case Z(xi ) is the partition
function of the entire tree with a value of the spin at the site/node i fixed. We derive
 
Y X
Z(xi ) = ehi xi  eJij xi xj Zj→i (xj ) , (10.32)
j∈∂i xj
where ∂i denotes the set of neighbors of the i-th spin and

!
Y X
Zj→i (xj ) = ehj xj eJkj xk xj Zk→j (xj ) (10.33)
k∈∂j\i xk
is the partition function of the subtree rooted at the node j.

Let us illustrate the general scheme on example of the tree in Fig. (10.5b), one obtains
X
Z= Z(x4 ), (10.34)
x4
The partition function, partially summed and conditioned to the spin value at the spin, x4 ,
is
X X X
Z(x4 ) = eh4 x4 eJ45 x4 x5 Z5→4 (x5 ) eJ46 x4 x6 Z6→4 (x6 ) eJ34 x3 x4 Z3→4 (x3 ) (10.35)
x5 x6 x3
where
X X
Z3→4 (x3 ) = eh3 x3 eJ13 x1 x3 Z1→3 (x1 ) eJ23 x2 x3 Z2→3 (x2 ). (10.36)
x1 x2
Exercise 10.3. Consider the Ising model on a graph G = (V, E), with spins x. Let V0 ⊂ V,
and let V̄0 = {i ∈ V \ V0 : (i, j) ∈ E for some j ∈ V}. (You may think of V̄0 as the
“boundary” of V0 within V.) Show that the spins on V0 are conditionally independent of
all other spins, given values of spins on V̄0 .
10.2.7 Properties of Undirected Tree-Structured Graphical Models
It appears that in the case of a general pair-wise graphical model over trees the joint
distribution function over all variables can be expressed solely via single-node marginals
and pair-vise marginals over all pairs of the graph-neighbors. To illustrate this important
factorization property, let us consider examples shown in Fig. 10.6. In the case of the two-
nodes example of Fig. 10.6a the statement is obvious as following directly from the Bayes
formula
P (x1 , x2 ) = P (x1 )P (x2 |x1 ), (10.37)
or, equivalently, P (x1 , x2 ) = P (x2 )P (x1 |x2 ).

For the pair-wise graphical model shown in Fig. 10.6b one obtains
P (x1 , x2 , x3 ) = P (x1 , x2 )P (x3 |x1 , x3 ) = P (x1 , x2 )P (x3 |x2 ) =

P (x1 , x2 )P (x2 , x3 )
= P (x1 )P (x2 |x1 )P (x3 |x2 ) = , (10.38)
P (x2 )
where the conditional independence of x3 on x1 , P (x3 |x1 , x2 ) = P (x3 |x2 ), was used.
Next, let us work it out on the example of the pair-wise graphical model shown in
Fig. 10.6
P (x1 , x2 , x3 , x4 ) = P (x1 , x2 , x3 )P (x4 |x1 , x3 , x4 ) = P (x1 , x2 , x3 )P (x4 |x2 ) =

= P (x1 , x2 )P (x3 |x1 , x2 )P (x4 |x2 ) = P (x1 , x2 )P (x3 |x2 )P (x4 |x2 ) =
P (x1 , x2 )P (x2 , x3 )P (x2 , x4 )
= P (x1 )P (x2 |x1 )P (x3 |x2 )P (x4 |x2 ) = . (10.39)
P 2 (x2 )
Here one uses the following reductions, P (x4 |x1 , x3 , x4 ) = P (x4 |x2 ) and P (x3 |x1 , x2 ) =
P (x3 |x2 ), related to respective independence properties.
Finally, it is easy to verify that the joint probability distribution corresponding to the
model in Fig. 10.6d is
P (x1 , x2 , x3 , x4 , x5 , x6 ) = P (x1 )P (x2 |x1 )P (x3 |x2 )P (x4 |x2 )P (x5 |x2 )P (x6 |x5 ) =
P (x1 , x2 )P (x2 , x3 )P (x2 , x4 )P (x2 , x5 )P (x5 , x6 )
= . (10.40)
P 3 (x2 )P (x5 )
Exercise 10.4. In the case of a general tree-like graphical model joint probability distri-
bution can be stated in terms of the pair-wise and singleton marginals as follows
Q
(i,j)∈E P (xi , xj )
P (x1 , x2 , . . . , xn ) = Q qi −1 (x )
, (10.41)
i∈V P i
where qi is the degree of the i-th node. Use mathematical induction to prove Eq. (10.41).
Figure 10.6: Examples of undirected tree-structured graphical models.

10.2.8 Bethe Free Energy & Belief Propagation
As discussed above Dynamic Programming is a provably exact approach for inference when
the graph is a tree. It also provides an empirically good approximation for a very broad
family of problems stated on loopy graphs.
The approximation is usually called Bethe-Peierls or Belief Propagation (BP is the ab-
breviation which works for both). Loopy BP is anothe popular term. See the original paper
[25], a comprehensive review [26], and respective lecture notes, for an advanced/additional
reading.
Instead of Eq. (10.24) one uses the following BP substitution
Q
ba (xa )
b(x) → bbp (x) = Q a qi −1
(10.42)
i (bi (xi ))
∀a ∈ Vf , ∀xa : ba (xa ) ≥ 0 (10.43)
X
∀i ∈ Vn , ∀a ∼ i : bi (xi ) = ba (xa ) (10.44)
xa \xi
X
∀i ∈ Vn : bi (xi ) = 1. (10.45)
xi
where qi stands for degree of node i. The physical meaning of the factor qi − 1 on the rhs of
Eq. (10.42) is straightforward: by placing beliefs associated with the factor-nodes connected
by an edge with a node, i, we over-count contributions of an individual variable qi times
and thus the denominator term in Eq. (10.42) comes as a correction for this over-counting.
Substitution of Eqs. (10.42) into Eq. (10.23) results in what is called Bethe Free Energy
(BFE)
Fbp := Ebp − Hbp , (10.46)

XX
Ebp := − ba (xa ) log fa (xa ) (10.47)
a xa
XX XX
Hbp = ba (xa ) log ba (xa ) − (qi − 1)bi (xi ) log bi (xi ), (10.48)
a xa i xi
where Ebp is the so-called self-energy (physics jargon) and Hbp is the BP-entropy (this
name should be clear in view of what we have discussed about entropy so far). Thus the
BP version of the KL-divergence minimization becomes
arg min Fbp , (10.49)

ba ,bi Eqs. (10.43,10.44,10.45)
min Fbp (10.50)

ba ,bi Eqs. (10.43,10.44,10.45)
Question: Is Fbp a convex function (of its arguments)? [Not always, however for some
graphs and/or some factor functions the convexity holds.]
The ML (zero temperature) version of Eq. (10.49) results from the following optimization
min Ebp (10.51)

ba ,bi Eqs. (10.43,10.44,10.45)
Note the optimization is a Linear Programming (LP) — minimizing linear objective over
set of linear constraints.
Belief Propagation & Message Passing
Let us restate Eq. (10.49) as an unconditional optimization. We use the standard method
of Lagrangian multipliers to achieve it. The resulting Lagrangian is
XX XX
Lbp (b, η, λ) := ba (xa ) log fa (xa ) − ba (xa ) log ba (xa )
a xa a xa
XX
+ (qi − 1)bi (xi ) log bi (xi )
i xi
  !
XXX X X X
− ηia (xi ) bi (xi ) − ba (xa ) + λi bi (xi ) − 1 , (10.52)
i a∼i xi xa \xi i xi
where η and λ are the dual (Lagrangian) variables associated with the conditions Eqs. (10.44,10.45)
respectively. Then Eq. (10.49) become the following min-max problem
min max Lbp (b, η, λ). (10.53)

b η,λ
Changing the order of optimizations in Eq. (10.53) and then minimizing over η one arrives
at the following expressions for the beliefs via messages (check the derivation details)
!
X Y
∀a, ∀xa : ba (xa ) ∼ fa (xa ) exp ηia (xi ) := fa (xa ) ni→a (xi )
i∼a i∼a
Y b6Y
=a
:= fa (xa ) mb→i (xi ) (10.54)
i∼a b∼i
P 
ηia (xi ) Y
∀i, ∀xi : bi (xi ) ∼ exp  a∼i  := ma→i (xi ), (10.55)
qi − 1
a∼i
where, as usual, ∼ for beliefs means equality up to a constant which guarantees that the
sum of respective beliefs is unity, and we have also introduce the auxiliary variables , m
and n, called messages, related to the Lagrangian multipliers η as follows
∀i, ∀a ∼ i : ni→a (xi ) := exp (ηia (xi )) (10.56)

ηia (xi )
∀a, ∀i ∼ a : ma→i (xi ) := exp . (10.57)
qi − 1
Combining Eqs. (10.54,10.55,10.56,10.57) with Eq. (10.44) results in the following BP-
equations stated in terms of the message variables
b6Y
=a
∀i, ∀a ∼ i, ∀xi : ni→a (xi ) = ma→i (xi ) (10.58)
b∼i
X j6=i
Y
∀a, ∀i ∼ a, ∀xi : ma→i (xi ) = fa (xa ) nj→a (xj ). (10.59)
xa \xi j∼a
Note that if the Bethe Free Energy (10.46) is non-convex there may be multiple fixed points
of the Eqs. (10.58,10.59). The following iterative, so called Message Passing (MP), algorithm
(10) is used to find a fixed point solution of the BP Eqs. (10.42,10.43)
Algorithm 10 Message Passing, Sum-Product Algorithm [factor graph representation]

Input: The graph. The factors.
1: ∀i, ∀a ∼ i, ∀xi : ma→i = 1 [initialize variable-to-factor messages]

2: ∀a, ∀i ∼ a, ∀xi : ni→1 = 1 [initialize factor-to-variable messages]
3: loopTill convergence within an error [or proceed with a fixed number of iterations]
Q =a
4: ∀i, ∀a ∼ i, ∀xi : ni→a (xi ) ← b6b∼i ma→i (xi )
P Q =i
5: ∀a, ∀i ∼ a, ∀xi : ma→i (xi ) ← xa \xi fa (xa ) j6j∼a nj→a (xj )
6: end loop
10.3 Theory of Learning: Sufficient Statistics and Maximum

Likelihood Estimation
10.3.1 Sufficient Statistics: infinitely many samples
So far we have been discussing direct (inference) GM problem. In the remainder of this
lecture we will briefly talk about inverse problems. This subject will also be discussed (on
example of the tree) in the following.
Stated casually - the inverse problem is about ‘learning’ GM from data/samples. Think
about the two room setting. In one room a GM is known and many samples are generated.
The samples, but not GM (!!!), are passed to the second room. The task becomes to
reconstruct GM from samples.
The first question we should ask is if this is possible in principle, even if we have an
infinite number of samples. A very powerful notion of sufficient statics helps to answer this
question.
Consider the Ising model (not the first time in this course)
 
1 X X 
P (x) = exp hi x i + Jij xi xj = exp{θT φ(x) − log Z(θ)}, (10.60)
Z(θ)  
i∈V {i,j}∈E
where xi ∈ {−1, 1}, θ := h ∪ J = (hi |i ∈ V) ∪ (Jij |{i, j} ∈ E) and the partition function
Z(θ) serves to normalize the probability distribution. In fact, Eq. (10.60) describes what
is called the exponential family - emphasizing ‘exponential’ dependence on the factors θ.
Notice that any pairwise GM over binary variables can be represented as an Ising model.
Consider collection of all first and second moments (but only these two, and not higher
moments) of the spin variables, µ(1) := (µi = E[xi ], i ∈ V ) and µ(2) := (µij = E[xi xj ], {i, j} ∈
E). The sufficient statistics statement is that to reconstruct θ, fully defining the GM, it is
sufficient to know µ(1) and µ(2) .
10.3.2 Maximum-Likelihood Estimation/Learning of Graphical Models
Let us turn the sufficiency into a constructive statement – the Maximum-likelihood estima-
tion over an exponential family of GMs.
First, notice that (according to the definition of µ)
∀i : ∂hi log Z(θ) = −µi , ∀i, j : ∂Jij log Z(θ) = −µij . (10.61)
This leads to the following statement: if we know how to compute log-partition function
for any values of θ - reconstructing ’correct’ θ is a convex optimization problem (over θ):
θ∗ = arg max{µT θ − log Z(θ)} (10.62)

θ
If P represents the empirical distribution of a set of independent identically-distributed

(i.i.d.) samples {x(s) , s = 1, . . . , S} then µ are the corresponding empirical moments, e.g.
P (s) (s)
µij = S1 s xi xj .
General Remarks about GM Learning. The ML parameter Estimation (10.62) is the
best we can do. It is fundamental for the task of Machine Learning, and in fact it generalizes
beyond the case of the Ising model.
Unfortunately, there are only very few nontrivial cases when the partition function can
be calculated efficiently for any values of θ (or parametrization parameters if we work with
more general class of GM than described by the Ising models).
Therefore, to make the task of parameter estimation practical one may utilize one of
the following approaches:
• Limit consideration to the class of functions for which computation of the partition
function can be done efficiently for any values of the parameters. We will discuss
such case below – this will be the so-called tree (Chow-Lou) learning. (In fact, the
partition function can also be computed efficiently in the case of the Ising model over
planar graphs and generalizations, see [27] and references therein for details.)
• Rely on approximations, e.g. such as variational approximation (MF, BP, and other),
MCMC or approximate elimination (approximate Dynamical Programming).
• There exists a very innovative new approach - which allows to learn GM efficiently
however using more information than suggested by the notion of the sufficient statis-
tics. How one of the scientists contributing to this line of research put it – ’the
sufficient statistics is not sufficient’. This is a fascinating novel subjects, which is
however beyond the scope of this course. Check [28] and references therein.
10.3.3 Learning Spanning Tree
Eq. (10.41) suggests that knowing the structure of the tree-based graphical model allows
to express the joint probability distribution in terms of the single-(node) and pairwise
(edge-related) marginals. Below we will utilize this statement to pose and solve an inverse
problem. Specifically, we attempt to reconstruct a tree representing correlations between
multiple (ideally, infinitely many) snapshots of the discrete random variables x1 , x2 , . . . , xn ?
A straightforward strategy to achieve this goal is as follows. First, one estimates all
possible single-node and pairwise marginal probability distributions, P (xi ) and P (xi , xj ),
from the infinite set of the snapshots. Then, we may similarly estimate the joint distribution
function and verify for a possible tree layout if the relations (10.41) hold. However, this
strategy is not feasible as requiring (in the worst unlucky case) to test exponentially many,
nn−2 , possible spanning threes. Luckily a smart and computationally efficient way of solving
the problem was suggested by Chow and Liu in 1968 [29].
Consider the candidate probability distribution, PT (x = (x1 , . . . , xn )) over a tree, T =
(V, E) (where V and E are the sets of nodes and edges of the tree, respectively) which is tree-
factorized according to Eq. (10.41) via marginal (pair-wise and single-variable) probabilities
as follows Q
(i,j)∈E F P (xi , xj )
PT (x1 , x2 , . . . , xn ) = Q . (10.63)
i∈V F P (xi )qi −1 (xi )
”Distance” between the actual (correct) joint probability distribution P and the candidate
tree-factorized probability distribution, PT , can be measured in terms of the Kullback-
Leibler (KL) divergence

X P (x)
D(P k PT ) = − P (x) log . (10.64)
x
PT (x)
As discussed in Section 8.5, the KL divergence is always positive if P and PT are different,
and is zero if these distributions are identical. Then, we are looking for a tree that minimizes
the KL divergence.
Substituting (10.63) into Eq. (10.64) one arrives at the following chain of explicit trans-
formations
 
X X X
P (x) log P (x) − log P (xi , xj ) + (qi − 1) log P (xi ) =
x (i,j)∈E i∈V
X X X
= P (x) log P (x) − P (xi , xj ) log P (xi , xj ) +
x (i,j)∈E F xi ,xj
X X X X P (xi , xj )
+ (qi − 1) P (xi ) log P (xi ) = − P (xi , xj ) log +
xi
P (xi )P (xj )
i∈V (i,j)∈E F xi ,xj
X XX
+ P (x) log P (x) − P (xi ) log P (xi ), (10.65)
x i∈V F xi
where the following nodal and edge marginalization relations were used, ∀i ∈ V F : P (xi ) =
P F :
P
x\xi P (x), and, ∀(i, j) ∈ E P (xi , xj ) = x\xi ,xj P (x), respectively. One observes
that the Kullback-Leibler divergence becomes
X X
D(P k PF ) = − I(Xi , Xj ) + S(xi ) − S(x), (10.66)
(i,j)∈E F i∈V F
where
X P (xi , xj )
I(Xi , Xj ) := P (xi , xj ) log (10.67)
xi ,xj
P (xi )P (xj )
is the mutual information of the pair of random variables xi and xj .

Since the entropies S(Xi ) and S(X) do not depend on the tree choice, minimizing the
Kullback-Leibler divergence is equivalent to maximizing the following sum over branches of
a tree
X
I(Xi , Xj ). (10.68)
(i,j)∈E F
Based on this observation, Chow and Liu have suggested to use the following (standard in
computer science) Kruskal maximum tree reconstruction algorithm (notice that the algo-
rithm is greedy, i.e. of the Dynamic Programming type):
Table 10.1: Information available about an exemplary probability distribution of four binary
variables discussed in the Exercise 10.5.
x1 x2 x3 x4 P (x1 , x2 , x3 , x4 ) P (x1 )P (x2 |x1 )P (x3 |x2 )P (x4 |x1 ) P (x1 )P (x2 )P (x3 )P (x4 )
0000 0.100 0.130 0.046
0001 0.100 0.104 0.046
0010 0.050 0.037 0.056
0011 0.050 0.030 0.056
0100 0.000 0.015 0.056
0101 0.000 0.012 0.056
0110 0.100 0.068 0.068
0111 0.050 0.054 0.068
1000 0.050 0.053 0.056
1001 0.100 0.064 0.056
1010 0.000 0.015 0.068
1011 0.000 0.018 0.068
1100 0.050 0.033 0.068
1101 0.050 0.040 0.068
1110 0.150 0.149 0.083
1111 0.150 0.178 0.083
• (step 1) Sort the edges of G into decreasing order by weight = Mutual Information,
i.e. I(Xi , Xj ) for the candidate edge (i, j). Let ET be the set of edges comprising the
maximum weight spanning tree. Set ET = ∅.
• (step 2) Add the first edge to ET
• (step 3) Add the next edge to ET if and only if it does not form a cycle in ET .
• (step 4) If ET has n − 1 edges (where n is the number of nodes in G) stop and output
ET . Otherwise go to step 3.
Eq. (10.41) is exact only in the case when it is guaranteed that the Graphical Model
we attempt to recover forms a tree. However, the same tree ansatz can be used to recover
the best tree approximation for a graphical model defined over a graph with loops. How to
choose the optimal (best approximation) tree in this case? To answer this question within
the aforementioned Kullback-Leibler paradigm one needs to compare the tree ansatz (10.41)
and the empirical joint distribution. This reconstruction of the optimal tree is based on the
Chow-Liu algorithm.
Exercise 10.5. Find the Chow-Liu optimal spanning tree approximation for the joint prob-
ability distribution of four random binary variables with statistical information presented
in the Table 10.1. [Hint: Estimate the empirical (i.e. based on the data), pair-wise mutual
information and then utilize the Chow-Liu-Kruskal algorithm (see description above in the
lecture notes) to reconstruct the optimal tree.]
10.4 Function Approximation with Neural Networks

Material presented in this lecture is relatively new. Some of the theoretical results discussed
below are 30 years old, but practical power of these results, and of course the flow of new
results and approaches, are very recent – 5 ears old, or even younger. In this situation
there are not that many books written on the topics yet, especially books focusing on the
applied mathematics aspects of the function approximation with Neural Networks (NN).
The book of Gilbert Strang [30] on ”Linear Algebra and Learning from Data”, which we
highly recommend, stands alone. Part VII of the book, entitled ”Learning from Data”, is
especially relevant to this Section.
NN, and especially Deep Neural Networks (DNN), is the newest most important tool of
applied mathematics which can be used universally to fit a function.
Mathematical foundations of the methodology is established by the following Theorem:
Theorem 10.4.1 (Universal Approximation Theorem (Cybenko 1989; Hornik 1991; Pinkus
1999)). Let ρ : R → R be any continuous function (called activation function). Let N N ρn
represent the class of feed-forward NN with activation function, ρ, with n neurons in the
input layer, one neuron in the output layer, and one arbitrary layer with an arbitrary
number of neurons. Let K ⊂ R be a compact. Then N N ρn is dense in C(K) (class of
differentiable function on K if and only if ρ is not polynomial
This famous theorem was an inhibitor of the NN revolution. In its original form, stated
above, it therefore applied to the regime of bounded depth and arbitrary width, was also
extended recently to the complementary case of the bounded width and arbitrary depth
(See [31] and references therein. Notice that the term ”deep” in the name of Deep Learning
refers to the large depth of respective NN.)
Even though, the Theorem 10.4.1 is agnostic to the type of the activation function some
activation functions are used more frequently in practice. The choice which has succeeded
far beyond expectations is the nonlinear function called Rectified Linear Unit, or simply,
ReLU(x) = x+ = max(x, 0).
Obviously, NNs are not limited to representing an arbitrary ρ : R → R map. For

example, we can use ReLU to construct the following piecewise-linear function mapping a
p-dimensional data vector, v, to an m-dimensional output:
• Choose, (q, p)-dimensional matrix A1 and q-dimensional vector b1 and set to zero all
negative components of A1 v + b1 , i.e. introduce ReLU(A1 v + b1 ) = (A1 v + b1 )+ acting
component-wise, where each component is associated with a neuron of a hidden layer.
• Choose (m, q) dimensional matrix A2 and apply it to the hidden layer vector, A2 (A1 v+
b1 )+ .
Introducing depth in NN allows to construct more and more expressive, as containing more
piece-wise linear pieces, continuous piece-wise linear functions. One standard way to do it
is to use composition:
N N (v) = FL (FL−1 (· · · F2 (F1 (v)))) = (FL ◦ FL−1 ◦ · · · ◦ F2 ◦ F1 )(v), (10.69)
where l = 1, · · · , L : Fl (x) = (Al x + bl )+ .

While the composition may be considered as the key operation in construction of the
Deep NN, the loss function, the chain rule and associated automatic differentiation and
back-propagation are the other key ingredients of DNN, which we discuss in the following,
one after another.
Example 10.4.2 (Counting Number of Pieces). Consider an example of N N (v) with Re-
LU activation function, one dimensional input layer, one 5-dimensional hidden layer, and
one dimensional output layer, and count the number of pieces in the resulting continuous
piece-wise linear function.
10.4.1 Fitting a Function with NN as an Optimization
Key element of fitting a function with NN is the so-called loss function, which is the term
commonly used in data science to describe objective of the underlying optimization formu-
lation. Standard choice of the loss function is a norm, e.g. l1 , l2 or l∞ norm, of the error
between the function, F (v), and its NN-approximation evaluated at the available samples,
s = 1, · · · , S. Then, for the example of the lp norm min-error, the resulting optimization
becomes
S
X
min kF (vs ) − N N (vs |θ)kp , (10.70)
θ
s=1
where θ is the vector of the NN parameters, e.g. θ = (AL , bL , · · · , A1 , b1 ) in the case of the
continuous piece-wise NN given by Eq. (10.69).
A standard way to solve the optimization is to evaluate partial derivatives (components

of the gradient) of the loss function over the parameters and require that all of them are
zero. This zero gradient solution of the optimization problem is normally found by the
gradient descent algorithm (by one of its many versions).
In all of its variants, and according to Eq. (10.70) the gradient descent needs to compute
derivatives of N N (vs |θ) over the components of the parameter vector θ. Automatic differ-
entiation, or its particular version back-propagation, is a method to compute the derivatives
quickly.
10.4.2 Automatic Differentiation, Back-Propagation and the Chain Rules
Computational efficiency of the Automatic Differentiation is associated with the so-called

chain rule, illustrated on the example

dg d dg3 dg3 dg1
= (g3 (g2 (g1 (x)))) = (g2 (g1 (x))) (g1 (x)) (x) .
dx dx dg2 dg1 dx
Steps in the automatic differentiation can be arranged in two ways – forward mode and
backwards mode. We choose forward mode (as much faster ones) if we have many functions
g depending on a few inputs (components of x), and vice versa we choose backward mode
if we have fewer function and higher dimensional input.
Since in Deep Learning we have one loss function depending on many weights, the
choice of the back-propagation mode (of the automatic differentiation) is natural for this
application.
x above is the vector of weights consisting of all the matrices, A1 , · · · , AL , where L is
the number of layers, and all the bias vectors, b1 , · · · , bL . The input-output pair of vectors,
(v, w), are associated with the 0’th and L’s layer of the NN. In the super-vised learning
discussed here, v = v0 , and, w = vL , are the training data represented through S samples,
(v (s) , w(s) ), s = 1, · · · , S. NN maps inputs to outputs, w = F (x, v0 ). Each new layer is a
map from the previous layer, vn = Rn (bn + An vn−1 ), where Rn is the activation function of
the n-th layer, for example ReLu.
Consider example of one hidden layer, L = 2, with R2 set as the identity function
w = v2 = b2 + A2 v1 = b2 + A2 R(b1 + A1 v0 ).
Let us compute derivatives over A1 following the chain rule:

∂w ∂(b1 + A1 v0 )
= A2 R0 (b1 + A1 v0 ) .
∂A1 ∂A1
Notice that computing the derivative we go backwards from w = v2 to v = v0 – that is we
follow the backward propagation rule starting from the output and moving to the input.
10.4.3 Avoiding Over-fitting
Over-fitting occurs when a trained network performs very accurately on the given data,
but cannot generalize well to new data. If F does purely on the training set we say that
the NN is under-fit. Over-fitting occurs when F does well on the training set but gives
much larger error on the validation data. Colloquially, it occurs when our function is
not sufficiently smooth – filling gaps (unnecessarily) between the training data points. One
standard recommendation: to avoid over-fitting introduce regularization to the loss function.
Another recommendation: stop training before you over-fit.
Example 10.4.3. Consider a Neural Network (NN) with L = 2, with one hidden layers
each consisting of a single node (neuron), and with the tanh activation functions. Therefore,
the model is

w = v2 = tanh a2 tanh a1 v + b1 + b2 ,
and assume that the weights are currently set at (a1 , b1 ) = (1.0, 0.5) and (a2 , b2 ) = (−0.5, 0.3).
What is the gradient of the Mean Square Error (MSE) cost for the observation (x, y) =
(2, −0.5)? What is the optimal MSE and optimal values of the parameters?
Solution. Evaluating the function (forward pass) with the initial parameters gives:
w = tanh(−0.5 tanh(1.0(2) + 0.5) + 0.3) = −0.1909
Define and compute the intermediary variables z1 , a1 , z2 as follows:
z1 := a1 v + b1 = 2.500, v1 := tanh(a1 v + b1 ) = 0.9866

z2 := a2 v1 + b2 = −.1933, ŵ = v2 = tanh(a2 v1 + b2 ) = −0.1909,
where ŵ stands for the NN estimate of the output, as opposed to the actual, i.e. sample,
output, w.
The MSE loss function is L = (ŵ − w)2 . The gradient of the loss function is
   
∂L ∂L ∂ ŵ ∂z2
 ∂a 2
∂L   ∂ ŵ ∂z2 ∂a2
∂L ∂ ŵ ∂z2 
 ∂b   
∇L =  ∂L  =  ∂L ∂ ŵ ∂z ∂a ∂z 
 2   ∂ ŵ ∂z2 ∂b2

 ∂a1   ∂ ŵ ∂z2 ∂a21 ∂z11 ∂a11 
∂L ∂L ∂ ŵ ∂z2 ∂a1 ∂z1
∂b1 ∂ ŵ ∂z2 ∂a1 ∂z1 ∂b1
Evaluating each partial derivative gives:

∂L
= 2(ŵ − w) = 2((−0.1909) − (−0.5)) = 0.6182
∂ ŵ
∂ ŵ
= 1 − (−.1933)2 = 0.9626
∂z2
∂z2
= a1 = 0.9866
∂a2
∂z2
=1
∂b2
∂z2
= a2 = −0.5
∂a1
∂a1
= 1 − (.9866)2 = .0266
∂z1
∂z1
= x = 2.0
∂a1
∂z1
=1
∂b1
Putting this all together, we get:
     
∂L ∂L ∂ ŵ ∂z2
0.5951
 ∂a 2
∂L   ∂ ŵ ∂z2 ∂a2
∂L ∂ ŵ ∂z2   
 ∂b     0.5871 
∇L =  2 =  ∂ ŵ ∂z2 ∂b2  
= 
 ∂L   ∂L ∂ ŵ ∂z2 ∂a1 ∂z1  −0.0079 
 ∂a1   ∂ ŵ ∂z2 ∂a1 ∂z1 ∂w1   
∂L ∂L ∂ ŵ ∂z2 ∂a1 ∂z1
∂b1 ∂ ŵ ∂z2 ∂a1 ∂z1 ∂b1 −0.0158
(b) There is only one data point, and four parameters. Therefore the problem is under-
determined. Iterating gradient descent-type methods does not guarantee return of a unique
solution.
Exercise 10.6. Consider the following two-layer (L = 2) NN-map, v → w where v, w ∈ R,

built from three ReLU neurons :
v1i = ReLU(ai v + bi ), ∀i = 1, 2, v1 = (v11 , v12 ) ∈ R2 , w = v2 = ReLU(A3 · v1T + b3 ),
where thus, A3 ∈ R1×2 , and, a1 , a2 , b1 , b2 , b3 ∈ R, are the parameters.
(a) Describe the complexity of the class of functions representing this NN.
(b) What is the minimal number P of non-degenerate samples, (v (p) , w(p) ), p = 1, · · · , P ,

needed for exact (!) reconstruction of the NN’s parameters?
(c) Build an example where this NN outputs continuous piece-wise linear function with
two linear intervals.
Appendix A
Convex and Non-Convex

Optimization ∗
This Appendix was originally prepared by Dr. Yury Maximov from Los Alamos National
Laboratory (and edited by MC). The material was presented in 2020 in 6 lectures cross-
cut between Math 581 (then Math 583), Math 584 (then Math 527) and Math 589 (then
Math 575). In 2021 material of the Appendix was mainly covered in Math 584 (then Math
527), with a brief summary included in Math 581 (then Math 583) via 1.5 lecture and two
exercises. In 2022 we do not include the material in Math 581 at all. The Appendix should
thus be viewed as a ∗-Section of the main text – suggested for an extra-curriculum study ∗ .
The Appendix is split into four Subsections. Sections A.1 and A.2 will be discussing
basic convex and non-convex finite dimensional optimizations. Then in Sections A.3 and A.4
we will turn to discussing iterative optimization methods for the optimization formulations,
set in Sections A.1 and A.2, which are of constrained and unconstrained types.
The most general problem we will start our discussion from in Section A.1 consists in
minimization of a function, f : S ⊆ Rn → R:
f (x) → min (A.1)

n
s.t.: x∈S⊆R .
Notice variability in notations – an absolutely equivalent alternative expression is
min f (x).
x∈S⊆Rn
Section A.1 should be viewed as introductory (setting notations) leading us to discussion

of the notion of (optimization) duality in Section A.2.
∗
This auxiliary Appendix can be dropped at the first reading. Material form the Subsection will not
contribute midterm and final exams of Math 581.
338
APPENDIX A. CONVEX AND NON-CONVEX OPTIMIZATION ∗ 339
Iterative algorithms, discussed in Sections A.3 and A.4, will be designed to solve Eq. (A.1).
Each step of such an algorithm will consist in updating the current estimate, xk , using
xj ,f (xj ), j ≤ k, possibly its vector of derivatives ∇f (x), and possibly the Hessian matrix,
∇2 f (x), such that the optimum is achieved in the limit, limk→+∞ f (xk ) = inf x∈S⊆Rn f (x).
Different iterative algorithms can be classified depending on the information available,
as follows:
• Zero-order algorithm, where at each iteration step one has an access to the value of
f (x) at a given point x (but no information on ∇f (x) and ∇2 f (x) is available);
• First-order optimization, where at each iteration step one has an access to the value
of f (x) and ∇f (x);
• Second-order algorithm, where at each iteration step one has an access to the value of
f (x), ∇f (x) and ∇2 f (x);
• Higher-order algorithm where at each iteration step one has an access to the value of
the objective function, its first, second and higher-order derivatives.
We will not discuss in these notes second-order and higher-order algorithm, focusing in
Sections A.3 and A.4 primarily on the first-order and second order algorithms.
A.1 Convex Functions, Sets and Optimizations

Calculus of Convex Functions and Sets
An important class of functions one can efficiently minimize are convex functions, that were
introduced earlier in Definition 6.7.3. We restate it here for convenience.
Definition A.1.1 (Definition 6.7.3). A function, f : Rn → R is convex if
∀x, y ∈ Rn , λ ∈ (0, 1) : f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y).
If a function is smooth, one can give an equivalent definition of convexity.
Definition A.1.2. A smooth function f (x) : Rn → R is convex, if
∀x, y ∈ Rn : f (y) ≥ f (x) + ∇f (x)> (y − x).
Definition A.1.3. Let function f : Rn → R has smooth gradient. Then f is convex iff
∀x : ∇2 f (x)(:= ∂xi ∂xj f (x); ∀i, j = 1, · · · , n) 0,
that is the Hessian of the function is a positive semi-definite matrix at any point. (Remind
that real symmetric n × n matrix H is positive semi-definite iff xT Hx ≥ 0 for any x ∈ Rn .)
Lemma A.1.4. Prove that the definitions above are equivalent for sufficiently smooth
functions.
Proof. Assume that the function is convex according to the Definition A.1.1. Then for any
h ∈ Rn , λ ∈ [0, 1], one has according to the Definition A.1.1:
f (λ(x + h) + (1 − λ)x) − f (x) = f (x + λh) − f (x) ≤ λ(f (x + h) − f (x)).
That is
f (x + h) − f (x) ≥ f (x + λh) − f (x) = ∇f (x)> h + O(λ) ∀λ ∈ [0, 1]
Then taking the limit for λ → 0 one has ∇f (x)> h ≤ f (x + h) − f (x), ∀h ∈ Rn which is
exactly Def. A.1.2. Vice versa, if ∀x, y : f (y) ≤ ∇f (x)> (y−x), one has for z = λx+(1−λ)y,
and any λ ∈ [0, 1] :
f (y) ≥ f (z) + ∇f (z)> (y − z) = f (z) + λ∇f (z)> (y − x),

f (x) ≥ f (z) + ∇f (z)> (x − z) = f (z) + (1 − λ)∇f (z)> (x − y)
summing up the inequalities above with the quotients 1−λ and λ one gets f (λx+(1−λ)y) ≤
λf (x) + (1 − λ)f (y). Thus Def. A.1.1 and Def. A.1.2.
Further, if f is sufficiently smooth, one has according to the Taylor expansion:
1
f (y) = f (x) + ∇f (x)> (y − x) + (y − x)∇2 f (x)(y − x) + o(ky − xk22 ).
2
Taking y → x one gets from the Definition A.1.2 to the Definition A.1.3 and vice versa.
Definition A.1.5. Function f (x) is concave iff −f (x) is convex.
Definition A.1.2 is probably the most practical. To generalize it to non-smooth functions,

we introduce the notion of sub-gradient.
Definition A.1.6. Vector g ∈ Rn is a sub-gradient of the convex function f , f : Rn → R,

at point x iff
∀y ∈ Rn : f (y) ≥ f (x) + g > (y − x).
Set ∂f (x) is a set of all sub-gradients for the function f at point x.
To establish some properties of the sub-gradients (which can also be called sub-differentials)
let us introduce the notion of convex set, i.e. a segment between any point of the set which
belongs to the set as well.
Definition A.1.7. Set S is convex, iff for any x1 , x2 ∈ S, and θ ∈ [0, 1] one has x1 θ +
x2 (1 − θ) ∈ S. In other words, set S is convex if for any points x1 , x2 in it, the set contains
a line segment [x1 , x2 ].
Theorem A.1.8. For any convex function f , f : Rn → R, and any point x ∈ Rn the
sub-differential ∂f (x) is a convex set. In other words, for any g1 , g2 ∈ ∂f (x) one has
θg1 + (1 − θ)g2 ∈ ∂f (x). Moreover, ∂f (x) = {∇f (x)} if f is smooth.
Proof. Let g1 , g2 ∈ ∂f (x), then f (y) ≥ f (x) + g1> (y − x), and f (y) ≥ f (x) + g2> (y − x).
That is for any λ ∈ [0, 1] one has f (y) ≥ f (x) + (λg1 + (1 − λ)g2 )> (y − x) and λg1 +
(1 − λ)g2 is a sub-gradient as well. We conclude that the set of all the sub-gradients
is convex. Moreover, if f is smooth, according to the Taylor expansion formula one has
f (x + h) = f (x) + ∇f (x)> h + O(khk)22 . Assume that there exists sub-gradient g ∈ ∂f (x)
other than ∇f (x) (as ∇f (x) ∈ ∂f (x) by the definition of convex functions A.1.2). Then
f (x) + g > h ≤ f (x + h) = f (x) + ∇f (x)> h + O(khk22 ) and similarly f (x) − g > h ≤ f (x − h) =
f (x) − ∇f (x)> h + O(khk22 ), and
g > h ≤ ∇f > h + O(khk22 ) and g > h ≥ ∇f > h + O(khk22 )
which implies g = ∇f (x), therefore concluding the proof.
Let us illustrate the sub-gradient calculus on the following examples:
• Sub-differential of |x| is


1, if x > 0


∂f (x) = −1, if x < 0



[−1, 1], if x = 0.
• Sub-differential of f (x) = max{f1 (x), f2 (x)} is



 ∇f (x), if f1 (x) > f2 (x)

 1
∂f (x) = ∇f2 (x), if f1 (x) < f2 (x)



{θ∇f (x) + (1 − θ)∇f (x), θ ∈ [0, 1]},
1 2 if f1 (x) = f2 (x)
if f1 and f2 are smooth functions on Rn .

p
Exercise A.1. Consider f (x, y) = x2 + 4y 2 . Prove that f is convex. Sketch level curves
of f . Find the sub-differential ∂f (0, 0).
Example A.1.9. Examples of convex functions include:

a) xp , p ≥ 1 or p ≤ 0 is convex; xp , 0 ≤ p ≤ 1 is concave;
b) exp(x), x ∈ R and − log x, x ∈ R++ , are convex;
c) f (h(x)), where f : R → R, h : R → R is convex if
(a) f (x) is convex and non-decreasing, and h(x) is convex;

(b) Or f (x) is convex and non-increasing, h(x) is concave;
To prove the statement for smooth functions we consider
g 00 (x) = f 00 (h(x))(h0 (x))2 + f 0 (h(x))h00 (x)
One can also extend the statement to non-smooth and multidimensional functions.
P
d) LogSumMax, also called soft-max, log ( ni=1 exp(xi )), is convex in x ∈ Rn as a com-
position of a convex non-decreasing and a convex function. The soft-max function
plays a very important role because it bridges smooth and non-smooth optimizations:
n
!
1 X
max(x1 , x2 , . . . , xn ) ≈ log exp(λxi ) , λ → 0, λ > 0. (A.2)
λ
i=1
e) Ratio of the quadratic function of on variable to a linear function of another variable,

e.g. f (x, y) = x2 /y, is jointly convex in x and y at y > 0;
f) Vector norm: kxkp := (|xi |p )1/p , x ∈ Rn , also called p-norm, or `p -norm when p ≥ 1,
is convex.
g) Dual norm k · k∗ to k · k is kyk∗ := supkxk≤1 x> y. The dual norm is always convex.
h) Indicator function of a convex set, IS (x), is convex:


0, x∈S
IS (x) =
+∞, x 6∈ S
Example A.1.10. Examples of convex sets:

T
1. If (any number of) sets {Si }i are convex, then i Si is convex;
2. Affine image of a convex set:
S̄ = {x : Ax + b, x ∈ S}
3. Image (and inverse image) of a convex set S under perspective mapping P : Rn+1 →
Rn , P = x/t, dom P = {(x, t) : t > 0}.
Indeed, consider y1 , y2 ∈ P (S) so that y1 = x1 /t1 and y2 = x2 /t2 . We need to prove
that for any λ ∈ [0, 1]
x1 x2 θx1 + (1 − θ)x2
y = λy1 + (1 − λ)y2 = λ + (1 − λ) =
t1 t2 θt1 + (1 − θ)t2
which holds for θ = λt2 /(λt1 +(1−λ)t2 ). The proof of the inverse statement is similar.
4. Image of a convex set under the linear-fractional function, f : Rn+1 → Rn , f (x) =

Ax+b
c> x+d
, dom f = {x : c> x + d > 0}. Indeed, f (x) is a perspective transform of an
affine function.
Exercise A.2. Check that all functions and all sets above are convex using Definition A.1.1
of the convex function (or equivalent Definitions A.1.2,A.1.3) and the Definition A.1.7 of
the complex set.
In further analysis, we introduce a special subclass of convex functions for which one
can guarantee much faster convergence than for minimization of a general convex function.
Definition A.1.11. Function f : Rn → R is µ-strongly convex with respect to norm k · k
for some µ > 0, iff
1. ∀x, y : f (y) ≥ f (x) + ∇f (x)> (y − x) + µ2 ky − xk2
2. if f is sufficiently smooth, the strong convexity condition in `2 norm is equivalent to

∀x : ∇2 f (x) µ.
As we will see later, generalization of the strong convexity definition A.1.11 to a general
`p norm allows to design more efficient algorithms in various cases. (Concavity, strong
concavity and convexity in `p are defined by analogy.)
Exercise A.3. Find a subset of R3 containing (0, 0, 0) such that f (u) = sin(x + y + z) is
(a) convex; (b) strongly convex.
√
Exercise A.4. Is it true that the functions, f (x) = x2 /2 − sin x and g(x) = 1 + x> x, x ∈
Rn , are convex. Are the functions strongly convex?
P
Exercise A.5. Check if the function ni=1 xi log xi defined on Rn++ is
• convex/concave/strongly convex/strongly concave?
• strongly convex/concave in `1 , `2 , `∞ ?
Hint: to prove that the function is strongly convex in `p norm it is sufficient to show that
h> ∇2 f (x) h ≥ khk2p

Convex Optimization Problems
The optimization problem

f (x) → min n
x∈S⊆R
is convex if f (x) and S are convex. Complexity of an iterative algorithm initiated with x0
to solve the optimization problem is measured in the number of iterations required to get
a point xk such that |f (xk ) − inf x∈S⊆Rn f (x)| < ε. Each iteration means an update of xk .
Complexity classification is as follows
• linear, that is the number of iterations k = O(log(1/ε)), and in other words f (xk+1 ) −
inf x∈S f (x) ≤ c(f (xk ) − inf x∈S f (x)) for some constant c, 0 < c < 1. Roughly, after
iteration we increase the number of correct digits in our answer by one.
• quadratic, that is k = O(log log(1/ε)), and f (xk+1 )−inf x∈S f (x) ≤ c(f (xk )−inf x∈S f (x))2
for some constant c, 0 < c < 1. That is, after iteration we double the number of correct
digits in our answer.
• sub-linear, that is characterized by the rate slower than O(log(1/ε)). In convex op-
timization, it is often the case that the convergence rate for different methods is
√
k = O(1/ε), O(1/ε2 ), or O(1/ ε) depending on the properties of function f .
Consider an optimization problem
f (x) → minn
x∈R
s.t.:g(x) ≤ 0
h(x) = 0
If the inequality constraint g(x) is convex and the equality constraint is affine, h(x) = Ax+b,
a feasible set of this problem, S = {x : g(x) ≤ 0 and h(x) = 0}, is convex that follows
immediately from definitions of a convex set and a convex function. As we will see later in
the lectures, in contrast to non-convex problems the convex ones admit very efficient and
scalable solutions.
`
Exercise A.6. Let ΠCp (x) be a projection of a point x to a convex compact set C in `p
norm, if
`
ΠCp (x) = arg min kx − ykp .
y∈C
Find `1 , `2 , `∞ projections of x = {1, 1/2, 1/3, . . . , 1/n} ∈ Rn on the unit simplex S = {x :

Pn
|xi | = 1}. Which of the `1 , `2 , `∞ projections of an arbitrary point x ∈ Rn to a unit
i=1
simplex is easier to compute?
A.2 Duality
Duality is very powerful tool which allows (1) to design efficient (tractable) algorithms
to approximate non-convex problems; (2) to build efficient algorithms to convex and non-
convex problems with constraints (which are often of a much smaller dimensionality than
the original formulations); (3) to formulate necessary and sufficient conditions of optimality
for convex and non-convex optimization problems.
Lagrangian
Consider the following constrained (not necessary convex) optimization problem:
f (x) → min (A.3)

s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m
hj (x) = 0, 1 ≤ j ≤ p
x ∈ Rn
with the optimal value p∗ (which is possibly −∞). Let S be the feasible set of this problem,
that is the set of all x for which all the constraints are satisfied.
Compose the so-called Lagrangian function L : Rn × Rm × Rp → R:
m
X p
X
L(x, λ, µ) = f (x) + λi gi (x) + µj hj (x) = f (x) + λ> g(x) + µ> h(x), λ ≥ 0 (A.4)
i=1 j=1
which is a weighted combination of the objective and the constraints. Lagrange multipliers,
λ and µ, can be viewed as penalties for violation of inequality and equality constraints.
The Lagrangian function (A.4) allows us to formulate the constrained optimization,
Eq. (A.3), as a min-max (also called saddle point) optimization problem:
p∗ = min n max L(x, λ, µ) (A.5)

x∈S⊆R λ≥0,µ
where the optimum of Eq. (A.3) is achieved at p∗ .
Weak and Strong Duality
Let us consider the saddle point problem (A.5) in greater details. For any feasible point
x ∈ S ⊆ Rn one has f (x) ≥ L(x, λ, µ), λ ≥ 0. Thus
L(λ, µ) = min L(x, λ, µ) ≤ min f (x) = p∗ ⇒ max min L(x, λ, µ) ≤ p∗ = min max L(x, λ, µ),
x∈S x∈S λ≥0,µ x∈S x∈S λ≥0,µ
| {z }
L(λ,µ)

where L(λ, µ) = inf x∈Rn L(x, λ, µ) = inf x∈Rn f (x) + λ> g(x) + µ> h(x) is called the La-
grange dual function. One can restate it as
d∗ = max min L(x, λ, µ) ≤ min max L(x, λ, µ) = p∗

λ≥0,µ x∈S x∈S λ≥0,µ
The original optimization, minx∈S f (x) = minx∈S maxλ≥0,µ L(x, λ, µ), is called Lagrange
primal optimization, while , maxλ≥0,µ L(λ, µ) = maxλ≥0,µ minx∈S L(x, λ, µ), is called the
Lagrange dual optimization.
Note that, maxλ≥0,µ minx∈S L(x, λ, µ) = maxλ≥0,µ minx∈Rn L(x, λ, µ), regardless of what
S is. This is because x̂ 6∈ S one has maxλ≥0,µ L(x̂, λ, µ) = +∞, thus allowing us to perform
unconstrained minimization of L(x, λ, µ) over x much more efficiently.
Let us describe a number of important features of the dual optimization:
1. Concavity of the dual function. The dual function L(λ, µ) is always concave. Indeed
for (λ̄, µ̄) = θ(λ1 , µ1 ) + (1 − θ)(λ2 , µ2 ) one has
L(λ̄, µ̄) = min L(x, λ̄, µ̄) = min{θL(x, λ1 , µ1 ) + (1 − θ)L(x, λ2 , µ2 )}

x x
≥ θ min L(x, λ1 , µ1 ) + (1 − θ) min L(x, λ2 , µ2 ) = θL(λ1 , µ1 ) + (1 − θ)L(λ2 , µ2 )

x x
The dual (maximization) problem maxλ≥0,µ L(λ, µ) is equivalent to the minimization

of the convex function −L(λ, µ) over the convex set λ ≥ 0.
2. Lower bound property. L(λ, µ) ≤ p∗ for any λ ≥ 0.
3. Weak duality: For any optimization problem d∗ ≤ p∗ . Indeed, for any feasible (x, λ, µ)
we have f (x) ≥ L(x, λ, µ) ≥ L(λ, µ), thus p∗ = minx∈Rn f (x) ≥ maxλ≥0,µ L(λ, µ) =
d∗ .
4. Strong duality: We say that strong duality holds if p∗ = d∗ . Convexity of the objective
function and convexity of the feasible set S is neither sufficient nor necessary condition
for strong duality (see the example following).
Example A.2.1. Convexity alone is not sufficient for the strong duality. Find the dual
problem and the duality gap p∗ − d∗ for the following optimization
exp(−x) → min
y>0,x
2
s.t.:x /y ≤ 0.
The optimal problem is p∗ = 1, which is achieved at x = 0 and any positive y. The dual
problem is
L(λ) = inf (exp(−x) + λx2 /y) = 0.
y>0,x
That is the dual problem is maxλ≥0 0 = 0, and the duality gap is p∗ − d∗ = 1.
Theorem A.2.2 (Slater (sufficient) conditions). Consider the optimization (A.3) where
all the equality constraints are affine and all the inequality constraints and the objective
function are convex. The strong duality holds if there exists an x∗ such that x∗ is strictly
feasible, i.e. all constraints are satisfied and the nonlinear constraints are satisfied with
strict inequalities.
The Slater conditions imply that the set of optimal solutions of the dual problem,
therefore making the conditions sufficient for the strong duality of the optimization.
Optimality Conditions
Another notable feature of the Lagrangian function is due to its role in establishing nec-
essary and sufficient conditions for a triplet (x, λ, µ) to be the solution of the saddle-point
optimization (A.5). First, let us formulate necessary conditions of optimality for
f (x) → min
s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m
hj (x) = 0, 1 ≤ j ≤ p
x ∈ S ⊆ Rn .
According to Eq. (A.5) the optimization is equivalent to
min max L(x, λ, µ),

x∈S λ≥0,µ
where the Lagrangian is defined in Eq. (A.4). The following conditions, called Karush-
Kuhn-Tucker (KKT) conditions, are necessary for a triplet (x∗ , λ∗ , µ∗ ) to become optimal:
1. Primal feasibility: x∗ ∈ S.
2. Dual feasibility: λ∗ ≥ 0.
3. Vanishing gradient: ∇x L(x∗ , λ∗ , µ∗ ) = 0 for smooth functions, and 0 ∈ ∂L(x∗ , λ∗ , µ∗ )

for non-smooth functions. Indeed for the optimal (λ∗ , µ∗ ), L should attain its mini-
mum at x∗ .
4. Complementary slackness conditions: λ∗i gi (x∗ ) = 0. Otherwise if gi (x∗ ) < 0 and

λ∗i > 0 one can reduce the Lagrange multiplier and increase the objective.
Note, that the KKT conditions generalize (the finite dimensional version of) the Euler-
Lagrange conditions introduced in the variational calculus. Let us now investigate when
the conditions are sufficient.
The KKT conditions are sufficient if the problem allows the strong duality, for which (as
we saw above) the Slatter conditions are sufficient. Indeed, assume that the strong duality
holds and a point (x∗ , λ∗ , µ∗ ) satisfies the KKT conditions. Then
L(λ∗ , µ∗ ) = f (x∗ ) + L(x∗ )> λ∗ + h(x∗ )> µ∗ = f (x∗ ) (A.6)
where the first equality holds because of the problem stationarity, and the second conditions
holds because of the complementary slackness.
Example A.2.3. Find a duality gap and solve the dual problem for the following mini-
mization

min (x1 − 3)2 + (x2 − 2)2
x1 + 2x2 = 4
x21 + x22 ≤ 5
Note, that the problem is convex and the Slater’s conditions are satisfied, therefore the
minimum is unique and there is no duality gap. The Lagrangian is
L(x, λ, µ) = (x1 − 3)2 + (x2 − 2)2 + µ(x1 + 2x2 − 4) + λ(x21 + x22 − 5), λ ≥ 0.
The dual problem becomes

L(λ, µ) = infn L(x, λ, µ).
x∈R
The KKT conditions are
!
2(x1 − 3) + µ + 2λx1
∇L = =0
2(x2 − 2) + 2µ + 2λx2
Therefore, (1 + λ)(2x1 − x2 ) = 4, and using the primal feasibility constraint one derives,
12+4λ 4+8λ
x1 = 5(1+λ) , x2 = 5(1+λ) . The dual problem becomes
9λ 16
L(λ) = 5 − − → max
5 5(1 + λ) λ≥0
Looking for a stationary point of L(λ), we arrive at λ = 1/3 and λ = −7/3. However, given
that λ∗ ≥ 0, we get λ∗ = 1/3. Finally, the saddle point is (x∗1 , x∗2 , λ∗ , µ∗ ) = (2, 1, 2/3, 1/3).
Example A.2.4. For the primal problem
3x + 7y + z → min
s.t.:x + 5y = 2
x+y ≥3
z≥0
find the dual problem, the optimal values of the primal and dual objectives, as well as
optimal solutions for the primal variables and for the dual variables. Describe all the steps
in details.
Solution:
1. Note, that the problem is equivalent to
3x + 7y → min
s.t.:x + 5y = 2
x+y ≥3
as x, y are independent of z, and the objective attains its minimum at z = 0.
2. Introduce the Lagrangian:
L(x, y, µ, λ) = 3x + 7y + µ(2 − x − 5y) + λ(3 − x − y)
3. State the KKT conditions for ∇L(x, y, µ, λ):
d
L(x, y, µ, λ) = 3 − µ − λ = 0
dx
d
L(x, y, µ, λ) = 7 − 5µ − λ = 0,
dy
therefore resulting in µ = 1, and λ = 2. One observes that the Lagrange multipliers
are feasible, meaning that there exists at least one point on the intersection of the
equality and inequality constraints.
4. The complimentary slackness condition (for the inequality) is
λ(3 − x − y) = 0.
Since λ = 2, the respective inequality constraint is active x + y = 3.
5. Using the primal feasibility one derives:
x + 5y = 2 and x + y = 3,
resulting in y = −0.25 and x = 3.25.
6. Optimal values of the primal variables are (x, y, z) = (3.25, −0.25, 0).
Dual problem.
1. The Lagrangian function is
L(x, y, µ, λ) = 3x+7y+µ(2−x−5y)+λ(3−x−y) = 2µ+3λ+x(3−µ−λ)+y(7−5µ−λ)
Dual objective:

2µ + 3λ, if 3 − µ − λ = 0 and 7 − 5µ − λ = 0
L(λ, µ) = inf L(x, y, µ, λ) =
x,y −∞, otherwise
2. Thus, the dual problem is
2µ + 3λ → max
s.t.: 3 − λ − µ = 0
7 − 5µ − λ = 0
3. The duality gap is 0 as this problem is linear (Slater’s condition is satisfied by the
definition).
Exercise A.7. For the primal optimization problems stated below find the dual problem,
the optimal values of the primal and dual objectives, as well as optimal solutions for the
primal variables and for the dual variables. Describe all the steps in details.
1. min 4x + 5y + 7z, s.t.: 2x + 7y + 5z + d = 9, and x, y, z, d ≥ 0. [Hint: try to drop

an inequality constraint, find the optimal value and check after finding the optimal
solution if the dropped inequality is satisfied.]
2. min {(x1 − 5/2)2 + 7x22 − x23 }, s.t.: x21 − x2 ≤ 0, and x23 + x2 ≤ 4
Examples of Duality
Example A.2.5 (Duality and Legendre-Fenchel Transform). Let us discuss the relation
between transformation from the Lagrange function to the dual (Lagrange) function and
the Legendge-Fenchel (LF) transform (or a conjugate function),
f ∗ (y) = sup (y > x − f (x)),

x∈Rn
introduced in Variational Calculus Section 6 of the course. One of the principal conclusions
of the LF analysis is f (x) ≥ f ∗∗ (x). The inequality is directly linked to the statement
of duality, specifically to the fact that dual optimization low bounds the primal one. To
illustrate the relationship between the maximization of f ∗∗ and the dual problem consider
f (x) → min
s.t.:x = b
where b is a parameter. Then
min max{f (x) + µ> (b − x)} ≤ max min{f (x) + µ(b − x)}
x µ µ x
= max{−µb − max(µx − f (x))} = max{−µb − f ∗ (µ)} = f ∗∗ (−b)

µ x µ
Minimizing the expression over all b ∈ Rn one arrives at minx∈Rn f (x) ≥ minx∈Rn f ∗∗ (x).
Example A.2.6. [Duality in Linear Programming (LP)] Consider the following problem:
c> x → min
s.t.:Ax ≤ b
We define Lagrangian L(x, λ) = c> x + λ> (Ax − b), λ ≥ 0, and arrives at the following dual
objective
L(λ) = infn L(x, λ)

x∈R

n o −b> λ, if c + A> λ = 0
= infn x> (c + A> λ) − b> λ =
x∈R −∞, otherwise
The resulting dual optimization is
L(λ) = −b> λ → max

c+A> λ=0, λ≥0
Example A.2.7 (Non-convex problems with strong duality). Consider the following quadratic
minimization:
x> Ax + 2b> x → min

s.t.:x> x ≤ 1
where A 6 0. Its dual objective is:
L(λ) = infn L(x, λ)

x∈R


−∞, A + λI 6 0
n o 

= infn x> (A + λI)x − 2b> x − λ = −∞, A + λI 0, b ∈ Im(A + λI)
x∈R 


−b> (A + λI)+ b − λ, otherwise
The resulting dual optimization is
−b> (A + λI)+ b − λ → max

s.t.:A + λI 0
b ∈ Im(A + λI)
Let us restate the optimization in a convex form by introducing an extra variable t
−t − λ → max
s.t.:t ≥ b> (A + λI)+ b
A + λI 0
b ∈ Im(A + λI)
Finally one arrives at
−t − λ → max
!
A + λI b
s.t.: 0
b> t
Example A.2.8 (Dual to binary Quadratic Programming (QP)). Consider the following
binary quadratic optimization
x> Ax → max
s.t.:x2i = 1, 1 ≤ i ≤ n
with A 0. The dual optimization is

( n
) ( n
)
X X
minn −x> Ax + µi (x2i − 1) = minn x> (Diag(µ) − A)x − µi → max
x∈R x∈R µ
i=1 i=1
that is
n
X
µi → min (A.7)
i=1
s.t.:Diag(µ) A
Note that the optimization (A.7 is convex and it provides a non-trivial lower bound to the
primal optimization problem. The low bound is called Semi-Definite Programming (SDP)
relaxation.).
Example A.2.9. Show that min λT x = −kλkp/(p−1) , where x ∈ Rd , p ≥ 0 and,

x kxkp ≤1
kxkp := (|x1 |p + · · · + |xd |p )1/p , is the p-norm of x.
Solution: The dual formulation of the problem is

min max λT x + µ (kxkp − 1) µ≥0 . (A.8)
x µ
The original formulation is convex and Slater’s condition is obviously satisfied (any x which
is sufficiently small in the p-norm is feasible), therefore the strong duality holds and we can
reverse the order of optimizations in Eq. (A.8)

max min λT x + µ (kxkp − 1) µ≥0 . (A.9)
µ x
Next let us write the KKT conditions. Stationary point condition over the primal variable
for the objective in Eq. (A.9) is
|x∗i |p−2
∀i = 1, · · · , d : λi + µx∗i 1−1/p = 0. (A.10)
(x∗1 )p + · · · + (x∗d )p
The complementary slackness condition is
µ (kx∗ kp − 1) = 1. (A.11)
Assume that λ 6= 0 (if otherwise the result is trivially zero), then µ 6= 0 according to
Eq. (A.10), kx∗ kp = 1, and Eq. (A.10) becomes, ∀i = 1, · · · , d : λi = −µx∗i |x∗i |p−2 .
Combining the two equations we derive
!(p−1)/p
X
µ=− |λi |p/(p−1) = −kλk(p−1)/p , (A.12)
i
X
λT x∗ = −µ |x∗i |p = −µ, (A.13)
i
therefore proving the relation after substitution the optimal values back into the objective.
Exercise A.8. For the quadratic constraint optimization problem

1
min − x> Lx + b> x (A.14)
x∈Rd 2
s.t.: kxk∞ ≤ 1, (A.15)
• (a) Describe conditions on L and b guaranteeing convexity of Eq. (A.14).
• (b) Find the dual of Eq. (A.14), restating the l∞ -constraint as a convex quadratic
constraint. Is the duality gap zero at L 0?
• (c) Show that if bb> εL 0 for some ε > 0, then L = cbb> , where c is a constant.
• (d) Assuming that conditions in (c) are satisfied, solve Eq. (A.14) analytically. (Hint:
Transition to a scalar variable and show that the problem reduces to a one-dimensional
quadratic, concave optimization over a bounded domain.)
Conic Duality (additional material)
Standard formulation of the conic optimization is:
c> x → min (A.16)

x
s.t.:Ax = b
x∈K
where K is a proper cone, i.e. a set which satisfies
1. K is a convex cone, that is for any x, y ∈ K one has αx + βy ∈ K, α, β ≥ 0;
2. K is closed;
3. K is solid, meaning it has nonempty interior;
4. K is pointed, meaning if x ∈ K, and −x ∈ K then x = 0.
Conic optimization problems are important in optimization. In Example A.2.8 you already
see the (dual to the binary quadratic optimization) problem which is a conic optimization
problem over the cone of positive semi-definite matrices.
K∗ defines a dual cone of K K∗ = {c : c> hc, xi ≥ 0, x ∈ K}.
Exercise A.9. Show that the following sets are self-dual cones (that is, K∗ = K).
1. Set of positive semi-define matrices, Sn+ ;
2. Positive orthant, Rn+ ;
3. Second-order cone, Qn = {(x, t) ∈ Rn+ : t ≥ kxk2 }

Pn
Note, that in the case of the semi-definite matrices c> x = i,j=1 cij xij (e.g. Hadamard
product of matrices). The Lagrangian to the Problem A.16 is given by
L = c> x + µ> (b − Ax) − λx
where the last term stands for x ∈ K. From the definition of the dual cone one derives

0, x∈K
max∗ −λ> x =
λ∈K +∞, x 6∈ K
Therefore
p∗ = min max
∗
L(x, λ, µ) ≥
x∈K λ∈K ,µ
∗
d = max
∗
min L(x, λ, µ)
λ∈K ,µ x∈K
And the dual problem is


µ> b, if c − A> µ − λ = 0
L(λ, µ) = min{c> x + µ> (b − Ax) − λ> x} =
x∈K −∞, otherwise
And finally
d∗ = max µ> b
s.t.: = c − A> µ − λ = 0
λ ∈ K∗
Finally, eliminating λ one has
µ> b → max
s.t.:c − A> y ∈ K∗ .
Exercise A.10. Find a dual problem (see Example A.2.8) to

n
X
1> µ = µi → min
i=1
s.t.:Diag(µ) A.
Ensure, that your dual problem is equivalent to
hA, Xi → max
s.t.:X ∈ Sn+
Xii = 1 ∀i
In the remainder of the Section we will study iterative algorithms to solve the optimiza-
tion problems discussed so far. It will be convenient to think about iterations in terms of
“discrete (algorithmic) time”, and also consider the “continuous time” limit when changes
in the values per iteration is sufficiently small and the number of iterations is sufficiently
large. In the continuous time analysis of the algorithms we utilize the language of dif-
ferential equations, as it helps both for intuition (familiar from first semester studies of
the differential equations) and also for analysis. However, to reach some of the rigorous
conclusions we may also get back to the original, discrete, language.
A.3 Unconstrained First-Order Convex Minimization

In this lecture, we will consider an unconstrained convex minimization problem
f (x) → minn ,
x∈R
and focus on the first-order optimization methods. That is we assume that the objective
function, as well as the gradient of the objective function, can both be evaluated efficiently.
Note that the first order methods described in this Section are most popular methods/algo-
rithms currently in use to resolve majority of practical machine learning, data science and
more generally applied mathematics problems.
We assume that function f is smooth, that is
∀x, y : k∇f (x) − ∇f (y)k∗ ≤ βkx − yk, (A.17)
for some positive constant β. Choosing the `2 norm for k · k = k · k2 , one derives
β
f (y) ≤ f (x) + ∇f (x)> (y − x) + ky − xk22 , ∀x, y ∈ Rn .
2
To simplify description we will thus omit in the following “w.r.t. to norm k · k” when
discussing the `2 norm.
Smooth Optimization
Gradient Descent. Gradient Descent (GD) is the simplest and arguably most popular
method/algorithm for solving convex (and non-convex) optimization problems. Iteration of
the GD algorithm is

> 1 2
xk+1 = xk −ηk ∇f (xk ) = arg min f (xk ) + ∇f (xk ) (x − xk ) + kx − xk k2 , ηk ≤ 1/β
x 2ηk
| {z }
hηk (x)
where we assume that f is β smooth with respect to `2 norm. If ηk ≤ 1/β, each step of
the GD becomes equivalent to minimization of the convex quadratic upper bound hηk (x)
of f (x).
Definition A.3.1. Function f : Rn → R is β-smooth w.r.t. to a norm k · k if
k∇f (x) − ∇f (y)k∗ ≤ kx − yk ∀x, y.
If k · k = k · k2 , we call the function β-smooth.
Theorem A.3.2. Assume that a function f : Rn → R is convex and β-smooth. Then

repeating the GD step k times/iterations with a fixed step-size, η ≤ 1/β, results in f (xk )
which satisfies:
kx1 − x∗ k22
f (xk ) − f (x∗ ) ≤ , η ≤ 1/β, (A.18)
2ηk
where x∗ is the optimal solution.
We will provide the continuous time proof of the Theorem, as well as its discrete time
version, where the former will rely on the notion of the Lyapunov function.
Definition A.3.3. Lyapunov function, V (x(t)), of the differential equation, ẋ(t) = f (x(t)),
is a function that
1. decreases monotonically along (discrete or continuous time) trajectory, V̇ (x(t)) < 0.
2. converges to zero at t → ∞, i.e. V (x(∞)) = 0, where x∗ = x(∞).
From now on, we will use capitalized notation, X(t), for the continuous time version of
(xk |k = 1, · · · ).
Proof of Theorem A.3.2: Continuous time. The GD algorithm can be viewed as a discretiza-
tion of the first-order differential equation:
Ẋ(t) = −∇f (X(t)).
Introduce the following Lyapunov’s function for this ODE, V (X(t)) = kX(t)−x∗ k22 /2. Then
d
V (t) = (X(t) − x∗ )> Ẋ(t) = −∇f (X(t))> (X(t) − x∗ ) ≤ −(f (X(t)) − f ∗ ), (A.19)
dt
where the last inequality is due to the convexity of f . Integrating Eq. (A.19 over time, one
derives Z t
∗
V (X(t)) − V (X(0)) ≤ tf − f (X(t))dt
0
Utilizing (a) Jensen’s inequality

Z t Z
1 1 t
f X(τ )dτ ≤ f (X(τ ))dτ,
t 0 t 0
which is valid for all convex functions, and (b) non-negativity of V (t) one derives
Z t Z
1 ∗ 1 t V (X(0))
f X(τ )dτ − f ≤ X(τ )dτ − f ∗ ≤ .
t 0 t 0 t
The prove is complete after setting, t ≈ k/β, and recalling that f is smooth.
Proof of Theorem A.3.2: Discrete time. Condition of smoothness applied to, y = x−η∇f (x),
results in
β2
f (y) ≤ f (x) + ∇f (x)> (y − x) + ky − xk22
2
1
= f (x) + ∇f (x)> (x − η∇f (x) − x) + β22 kx − η∇f (x) − xk22
2
2 β2 2
= f (x) − ηk∇f (x)k2 + k∇f (x)k2
2
β2 η
= f (x) − 1 − ηk∇f (x)k22 .
2
As η ≤ 1/β, one derives, 1 − βη/2 ≤ −1/2, and
η
f (y) ≤ f (x) − k∇f (x)k22 . (A.20)
2
Note, that Eq. A.20 does not require convexity of the function, however if the function is
convex one derives
f (x∗ ) ≥ f (x) + ∇f (x)> (x∗ − x),
by choosing y = x∗ . Plugging the last inequality into the smoothness inequality, one derives
for y = x − η∇f (x):
η
f (y) − f (x∗ ) ≤ ∇f (x)> (x − x∗ ) − k∇f (x)k22
2
1
= kx − x k2 − kx − η∇f (x) − x∗ k22
∗ 2
2η
1
= kx − x∗ k22 − ky − x∗ k22
2η
X 1 X
(f (xj ) − f (x∗ )) ≤ kxj − x∗ k22 − kxj+1 − x∗ k22
2η
j≤k j≤k
1
= kx1 − x∗ k22 − kxk+1 − x∗ k22
2η
R2 β2 R22
≤ 2 = ,
2η 2
where R22 ≥ kx1 − x∗ k22 and the step-size η = 1/β. Finally

βR22
min f (xj ) − f (x∗ ) ≤ f (x̄) − f (x∗ ) ≤ ,
j 2
P
where x̄ = j≤k xj /k.
One obviously would like to choose the step size in GD which results in the fastest
convergence. However, this problem – of choosing best, or simply good step size – is hard
and remains open. The statement also means that finding a good stopping criterion for the
iterations is hard as well. Here are practical/empirical strategies for choosing the step size
in GD:
• Exact line search. Choose ηk so that
ηk = arg min {f (xk − η∇f (xk ))}

η
• Backtracking line search. Choose the step-size ηk so that:

ηk
f (xk − ηk ∇f (xk )) ≤ f (xk ) − k∇f (xk )k22
2
As the difference between the right-hand side and the left-hand size of the inequality
above is monotone in ηk , one can start with some η and then update, η → bη, 0 <
b < 1.
• Polyak’s step-size rule. If the optimal value f ∗ of the function is known, one can
suggest a better step-size policy. Minimization of the right-hand side of:
kxk+1 − x∗ k ≤ kxk − x∗ k22 − 2ηk (f (xk ) − f (x∗ )) + ηk2 kgk k22 → min,
ηk
results in the Polyak’s rule, ηk = (f (xk ) − f (x∗ ))/kgk k22 , which is known to be useful,
in particular, for solving an undetermined system of linear equations, Ax = b.
Exercise A.11. Recall that GD minimizes the convex quadratic upper bound hηk (x) of
f (x). Consider a modified GD, where the step size is, η = (2 + ε)/β, with ε chosen positive.
(Notice that the step size used in the conditions of the Theorem A.3.2 was η ≤ 1/β.)
Derive modified version of Eq. (A.18). Can one find a quadratic convex function for which
the modified algorithm fails to converge?
Exercise A.12. Consider minimization of the following (non-convex) function f :
f (x) → min
s.t.:kx − x∗ k ≤ ε,
x ∈ Rn
where x∗ is a global and unique minimum of the β-smooth function f . Moreover, let
1
∀x ∈ Rn : k∇f (x)k22 ≥ µ(f (x) − f (x∗ )).
2
Is it true, that for some small ε > 0 the GD with a step-size ηk = 1/β converges to the
optimum? How ε depends on β and µ?
Exercise A.13 (not graded - difficult). In many optimization problems, it is often the case
that exact value of the gradient is polluted, i.e. only its noise version is observed. In this
case one may consider the following “inexact oracle” optimization: f (x) → min, x ∈ Rn ,
assuming that for any x one can compute fˆ(x) and ∇f
ˆ (x) so that
∀x : |f (x) − fˆ(x)| ≤ δ, and k∇f (x) − ∇fˆ(x)k2 ≤ ε,
and seek for an algorithm to solve it. Propose and analyze modification of GD solving the
“inexact oracle” optimization?
Gradient Descent in `p . GD in `p norm

> 1 2
xk+1 = arg min f (xk ) + ∇f (xk ) (x − xk ) + kx − xk kp ,
x∈S⊂Rn 2ηk
where ηk ≤ 1/βp , βp ≥ supx kg(x)kp , with a properly chosen p can converge much faster
than in `2 . GD in `1 is particularly popular.
Exercise A.14. Restate and prove discrete time version of the Theorem A.3.2 for GD in
`p norm. (Hint: Consider the following Lyapunov function: kx − x∗ k2p .)
Gradient Descent for Strongly Convex, Smooth Functions.
Theorem A.3.4. GD for a strongly convex function f and a fixed step-size policy
xk+1 = xk − η∇f (xk ), η = 1/β
converges to the optimal solution as
f (xk+1 ) − f (x∗ ) ≤ ck (f (x1 ) − f (x∗ )),
where c ≤ 1 − µ/β.
Exercise A.15. Extend proof of the Theorem A.3.2 to Theorem A.3.4.

Fast Gradient Descent. GD is simple and efficient in practice. However, it may also
be slow if the gradient is small. It may also oscillate about the point of optimality if the
gradient is pointed in a direction with a small projection to the optimal direction (pointing
at the optimum). The following two modifications of the GD algorithm were introduced to
cure the problems
(1964) Polyak’s heavy-ball rule:

xk+1 = xk + ηk ∇f (xk ) + µk (xk − xk−1 ) (A.21)
(1983) Nesterov Fast Gradient Method (FGM):
xk+1 = xk + ηk ∇f (xk + µ(xk − xk−1 )) + µk (xk − xk−1 ).(A.22)
The last term in Eqs. (A.21,A.22) is called “momentum” or “inertia” term to emphasize
relation to respective phenomena in classical mechanics. The inertia terms, added to the
original GD term, which may be associated with “damping” or “friction”, aims to force
the hypothetical “ball” rolling towards optimum faster. In spite of their seemingly minor
difference, convergence rate of FGM and of the heavy-ball method differ rather dramatically,
as the heavy ball can lead to an overshoot (not enough “friction”).
Exercise A.16. Construct a convex function f with a piece-wise linear gradient such that
the heavy ball algorithm (A.21) with some fixed µ and ηfails to converge.
Consider a slightly modified (less general, two-step recurrence) version of the FGM
(A.22):
k−1
xk = yk−1 − η∇f (yk−1 ), yk = xk + (xk − xk−1 ), (A.23)
k+2
which can be re-stated in continuous time as follows
3
Ẍ(t) + Ẋ(t) + ∇f (X) = 0. (A.24)
t
√
Indeed, assuming t ≈ k η and re-scaling one derives from Eq. (A.23)
xk+1 − xk k − 1 xk − xk−1 √
√ = √ − η∇f (yk ). (A.25)
η k+2 η
√
Let xk ≈ X(k η), then
√
X(t) ≈ xt/√η = xk , X(t + η) ≈ x(t+√η)/√η = xk+1
and utilizing the Taylor expansion

xk+1 − xk 1 √ √
√ = Ẋ(t) + Ẍ(t) η + o( η)
η 2
xk − xk−1 1 √ √
√ = Ẋ(t) − Ẍ(t) η + o( η)
η 2
one arrives at

1 √ √ 1 √ √ √ √
Ẋ(t)+ Ẍ(t)+o( η) = (1 − 3 η/t) Ẋ(t) − Ẍ(t) η + o( η) − ηf (X(t))+o( η) = 0,
2 2
resulting in Eq. (A.24).
To analyze convergence rate of the FGM (A.24) we introduce the following Lyapunov
function:
V (X(t)) = t2 (f (X(t)) − f ∗ ) + 2kX + tẊ/2 − x∗ k22 .
Time derivative of the Lyapunov function is
V̇ (X(t)) = 2t(f (X(t))−f ∗ )+t2 ∇f (X(t))> Ẋ(t)+4(X(t)+tẊ(t)/2−x∗ )> (3Ẍ(t)/2+tẌ/2).
Given that, Ẋ + tẌ/2 = −t∇f (X)/2, and also utilizing convexity of f one derives
V̇ = 2t(f (X) − f ∗ ) − 4(X − x∗ )> (t∇f (X)/2) = 2t(f (X) − f ∗ ) − 2t(X − x∗ )> ∇f (X) ≤ 0.
Making use of the monotonicity of V and of the non-negativity of kX + tẊ/2 − x∗ k one

finds
V (t) V (0) 2kx0 − x∗ k22
f (X(t)) − f ∗ ≤ 2 ≤ 2 = .
t t t2
√
Finally, substituting, t ≈ k η, one derives
2kx0 − x∗ k22
f (xk ) − f ∗ ≤ , η ≤ 1/β.
ηk 2
We have just sketched a proof of the following statement.
Theorem A.3.5. Fast GD for, f (x) → minx∈Rn , where f (x) is a β-smooth convex function,
with an update rule
k−1
xk = yk−1 − η∇f (yk−1 )), yk = xk + (xk − xk−1 )
k+2
converges to the optimum as
2kx0 − x∗ k22
f (xk+1 ) − f ∗ ≤ .
ηk 2
As always, turning the continuous time sketch of the proof into the actual (discrete
time) proof takes some additional technical efforts.
Exercise A.17. Consider the following differential equation

r
Ẍ(t) + Ẋ(t) + ∇f (X) = 0,
t
at some positive r. Derive respective discrete time algorithm, analyze its convergence and
show that if r ≤ 2, the convergence rate of the algorithm is O(1/k 2 ).
Exercise A.18. Show that the FGM method, described by Eq. (A.23), transitions to
Eq. (A.22) at some ηk .
Non-Smooth Problems
Sub-Gradient Method. We start discussion of the Sub-Gradient (SG) methods with

the simplest, and arguably most-popular, SG algorithm:
xk+1 = xk − ηk gk , gk ∈ ∂F (xk ), (A.26)
which is just the original GD with the gradient replaced by the sub-gradient to deal with
non-smooth f . Note, however, that it is not proper to call the algorithm (A.26) SG descent
because in the process of iterations f (xk+1 ) may become larger than f (xk ). To fix the
problem one may keep track of the best point, or substitute the result by an average of the
points seen in the iterations so far (a finite horizon portion of the past). For example, one
may augment Eq. (A.26) at each k with
(k) (k−1)
fbest = min{fbest , f (xk )}.
We assume that SG of f (x) is bounded, that is
∀x : kg(x)k ≤ L, g(x) ∈ ∂f (x).
This condition follows, for example, from the Lipshitz condition, |f (x) − f (y)| ≤ Lkx − yk,
imposed on f . Let x∗ be the optimal point of, f (x) → minx∈Rn , then
kxk+1 − x∗ k22 = kxk − ηk gk − x∗ k22 = kxk − x∗ k22 − 2ηk gk> (xk − x∗ ) + ηk2 kgk k22
≤ kxk − x∗ k22 − 2ηk (f (xk ) − f (x∗ )) + ηk2 kgk k22 , (A.27)
where the last inequality is due to convexity of f , i.e. f (x∗ ) ≥ f (xk )+gk> (x∗ −xk ). Applying
the inequality (A.27) recursively,
X X
kxk+1 − x∗ k22 ≤ kx1 − x∗ k22 − 2 ηj (f (xj ) − f (x∗ )) + ηj2 kgj k22 ,
j≤k j≤k
one derives,
 
X (k)
X X
2 ηj  (f (best ) − f (x∗ )) ≤ 2 ηj (f (xj ) − f (x∗ )) ≤ kx(1) − x∗ k22 + ηj2 kgj k22 ,
j≤k j≤k j≤k
which becomes
P
(k) ∗
kx1 − x∗ k22 + L22 j≤k ηj2
∗
f (best ) − f (x ) = min f (xj ) − f ≤ P ,
j≤k 2 j≤k ηj
where we assume that the SG of f are bounded by L2 in the `2 norm. Therefore, if

R22 ≥ kx1 − x∗ k22 , one arrives at
P
∗
R22 + L22 j≤k ηj2 RL
min f (xj ) − f ≤ min P = √ , (A.28)
j≤k η 2 j≤k ηj k
√ √
where the step-size is ηk = R/(L k). Note, that the ∼ 1/ k scaling in Eq. (A.28 is much
worse than the one we got above, ∼ 1/k 2 , for smooth functions. In the following we discuss
this result in more details and suggest a number of ways to improve the convergence.
Proximal Gradient Method. In multiple machine learning (and more generally statis-
tics) applications we deal with a function built as a sum over samples. Inspired by this
application consider the following composite optimization
f (x) = g(x) + h(x) → minn , (A.29)

x∈R
where we assume that g : R → Rn is a convex and smooth function on Rn , and h : R → Rn

is closed, convex and possibly non-smooth function on Rn . One of the most frequently used
composite optimization is the Lasso minimization:
f (x) = kAx − bk22 + λkxk1 → minn . (A.30)

x∈R
Notice that the kxk1 term is not smooth at x = 0.

Let us now introduce the so-called proximal operator

1 2
proxh (x) = arg min h(u) + ku − xk2 ,
u∈Rn 2
which will soon be linked to the composite optimization. Standard examples of the proximal
operator/function are
1. h(x) = IC (x), that is h(x) is an indicator of a convex set C. Then the proximal
function is
proxh (x) = arg min kx − uk22
u∈C
is a projection of x on C.
2. h(x) = λkxk1 , then the proximal function acts as a soft threshold:



xi − λ, xi ≥ λ,


proxh (x)i = xi + λ, xi ≤ −λ,



0, otherwise
The examples suggest using the proximal operator to smooth out non-smooth functions
entering respective optimizations. Having this use of the proximal operator in mind we
introduce the Proximal Gradient Descent (PGD) algorithm

1 2
xk+1 = proxηk h (xk − ηk ∇g(xk )) = arg min kxk − ηk ∇g(xk ) − uk2 + ηk h(u)
u 2

1
= arg min g(xk ) + ∇g(xk )> (u − xk ) + ku − xk k22 + h(u)
u 2ηk
where ηk ≤ β, and g is a β-smooth function in `2 norm.
Note, that as in the case of the GD algorithm, at each step of the PGD we minimize
a convex upper bound of the objective function. We find out that the PGD algorithm has
the same convergence rate (measured in the number of iterations) as the GD algorithm.
Finally, we are ready to connect PGD algorithm to the composite optimization (A.29).
Theorem A.3.6. PGD algorithm,
xk+1 = proxh (xk − η∇g(xk )), η ≤ 1/β,
with a fixed step size policy converges to the optimal solution f ∗ of the composite optimiza-
tion (A.29) according to
kx0 − x∗ k22
f (xk+1 ) − f ∗ ≤ .
2ηk
Proof of the Theorem (A.3.6) repeats the logic we use to prove Theorem A.3.2 for the GD
algorithm. Moreover, one can also accelerate the PGD, similarly to how we have accelerated
GD. The accelerated version of the PGD is
k−1
xk = proxηk (yk−1 − ηk ∇f (yk−1 )) yk = xk + (xk − xk−1 ).
k+2
We naturally arrives at the PGD version of the Theorem A.3.5:
Theorem A.3.7. PGD for a convex optimization
f (x) → minn
x∈R
with an update rule

k−1
xk = proxhη (yk−1 − η∇f (yk−1 )), yk = xk + (xk − xk−1 )
k+2
converges as
2kx0 − x∗ k22
f (xk+1 ) − f ∗ ≤ ,
ηk 2
for any β-smooth convex function f .
PGD is one possible approach developed to deal with non-smooth objectives. Another
sound alternative is discussed next.
Smoothing Out Non-Smooth Objectives
Consider the following min-max optimization
max fi (x) → minn

1≤i≤n x∈R
which is one of the most common non-smooth optimizations. Recall, that a smooth and
convex approximation to the maximum function is provided by the soft-max function (A.2)
√
which can then be minimized by the accelerated GD (that has a convergence rate O(1/ ε)
in contrast to 1/ε2 for non-smooth functions). Accurate choice of λ (parameter within the
soft-max) normally allows to speed up algorithms to O(1/ε).
A.4 Constrained First-Order Convex Minimization

Projected Gradient Descent
The Projected Gradient Descent (PGD) is
xk+1 = ΠC (xk − ηk ∇f (xk )) (A.31)

> 1 2
= arg min f (xk ) − ∇f (xk ) (y − x) + kx − yk2 + IC (y)
y∈C 2ηk
= proxIC (xk − ηk ∇f (xk )),
where ΠC is an Euclidean projection to the convex set C, ΠC (y) = arg minx∈C kx − yk22 .
PGD has the same convergence rate as GD. The proof is similar to the one of the gradient
descent taking into account that projection does not lead to an expansion, i.e.
kxk+1 − x∗ k22 ≤ kxk − ηk ∇f (xk ) − x∗ k22 as x∗ ∈ C.
Exercise A.19. (Alternating Projections.) Consider two convex sets C, D ⊆ Rn and pose
the question of finding x ∈ C ∩ D. One starts from, x0 ∈ C, and applies PGD
yk = ΠC (xk ) xk+1 = ΠD (yk ).
How many iterations are required to guarantee
max{ inf (xk , x), inf (xk , x)} ≤ ε?

x∈C x∈D
Frank-Wolfe Algorithm (Conditional Gradient)
Frank-Wolfe algorithm solves the following optimization problem
f (x) → min, s.t.:x ∈ S (A.32)

In contrast to the PGD algorithm (A.31) making projection at each iteration, the Frank-
Wolfe (FW) algorithm solves the following linear problem on C:
yk = arg min y > ∇f (xk ), xk+1 = (1 − γk )xk + γk yk , γk = 2/(k + 1). (A.33)

y∈C
To illustrate, consider the case when C is a simplex:
f (x) → min s.t.:x ∈ S = {x : x ≥ 0, x> 1 = 1}.
In this case the update yk of the FW algorithm is a unit vector correspondent to the
maximal coordinate of the gradient. Overall time to update xk is O(n) therefore resulting
in a significant acceleration in comparison with the PGD algorithm.
FW algorithm has an edge over other algorithms considered so far because it has a
reliable stopping criteria. Indeed, convexity of the objective guarantees that
f (y) ≥ f (xk ) + ∇f (xk )> (y − xk ),
minimizing both sides of the inequality over y ∈ C one derives that
f ∗ ≥ f (xk ) + min ∇f (xk )> (y − xk ),

y∈C
where f ∗ is the optimal solution of Eq. (A.32), then leading to
max ∇f (xk )> (xk − y) ≥ f (xk ) − f ∗ . (A.34)

y∈C
The value on the left of the inequality, maxy∈C ∇f (xk )> (xk −y), gives us an easy to compute
stopping criterion.
The following statement characterizes convergence of the FW algorithm.
Theorem A.4.1. Given that f (x) in Eq. (A.32) is a convex β-smooth function and C is a
bounded, convex, compact set, Eq. A.33 converges to the optimal solution, f ∗ , of Eq. (A.32)
as
2βD2
f (xk ) − f ∗ ≤ ,
k+2
where D2 ≥ maxy,y∈C kx − yk22 .
Proof. Convexity of f means that
f (x) ≥ f (xk ) + ∇f (xk )> (x − xk ), ∀x ∈ C.
Minimizing both sides of the inequality one derives
f (x∗ ) ≥ f (xk ) + ∇f (xk )> (yk − xk ).

That is f (xk ) − f (x∗ ) ≤ ∇f (xk )> (xk − x∗ ). This inequality, in combination with the
second sub-step in the FW algorithm, xk+1 = γk yk + (1 − γk )xk , results in the following
transformations
f (xk+1 ) − f (xk ) ≤ f (xk+1 ) − f (x∗ )

β
≤ f (xk ) + ∇f (xk )> (xk+1 − xk ) + kxk+1 − xk k22 − f (x∗ )
2
βγ 2
≤ f (xk ) + γk ∇f (xk )(yk − xk ) + k kyk − xk k22 − f (x∗ )
2
βγ 2
≤ f (xk ) − f (x∗ ) − γk (f (xk ) − f (x∗ )) + k D2 ,
2
and finally
βγk2 D2
f (xk+1 ) − f ∗ ≤ (1 − γk )(f (xk ) − f ∗ ) + .
2
Utilizing the inequality in a chain of inductive relations over k, starting from k = 1, one
can show that f (xk ) − f ∗ ≤ 2βD2 /(k + 2).
The conditional GD is slower than the FGM method in terms of the number of iterations.
However, it is often favorable in practice especially when minimizing a convex function
over sufficiently simple objects (like the norm-ball or a polytope) as it does not require
implementing explicit projection to the constraining set.
Primal-Dual Gradient Algorithm
Consider the following smooth convex optimization problem:
f (x) → min
Ax = b, x ∈ Rn
It is a good practice to work with the equivalent augmented problem:
ρ
f (x) + kAx − bk22 → min
2
s.t.:Ax = b
where ρ > 0. Let us define augmented Lagrangian

ρ
L(x, µ) = f (x) + µ> (Ax − b) + kAx − bk22 .
2
We say that a point (in the extended, augmented space), (x, µ), is primal-dual optimal iff
0 = ∇x L(x, µ) = ∇f (x) + A> µ + ρA> (Ax − b),

0 = −∇µ L(x, µ) = b − Ax.
One can also re-state the primal-dual optimality condition as,

!
∇x L(x, µ)
T (x, µ) = 0, T (x, µ) =
−∇µ L(x, µ)
. Operator/function, T , is often called the Karush-Kuhn-Tucker (KKT) operator. (We may

call T operator to emphasize that it maps a function, f (x), to another function, ∇x L.)
We are now ready to state the Primal-Dual Gradient (PDG) algorithm
! !
x x
= − ηk T (xk , µk ).
µ µ
k+1 k
Similar construction works if inequality constraints are added:
f (x) → min
s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m.
The augmented problem, accounting for the inequalities, becomes

m
ρX
f (x) + (gi (x))2+ → min
2
i=1
s.t.:gi (x) ≤ 0, 1 ≤ i ≤ m.
Respective augmented Lagrangian is

ρ
L(x, λ) = f (x) + λ> F (x) + kF (x)k22 ,
2
where F (x)i = (gi (x))+ . We say that the pair (x, λ) is primal-dual optimal iff
m
X
0 = −∇x L(x, λ) = ∇f (x) + (λi + ρgi (x)+ )(∇gi (x))+
i=1
0 = −∇λ L(x, λ) = −F (x).
PDG algorithm accounting for the inequality constraints is

! !
x x
= − ηk T (xk , λk )
λ λ
k+1 k
Convergence analysis of PDG algorithm repeats all steps involved in analysis of the
original GD. The Lypunov exponent here is , V (x, λ) = kx0 − x∗ k22 + kλ0 − λ∗ k22 .
Exercise A.20. Analyze convergence of the PDG algorithm for convex optimization with
inequality constraints assuming that all the functions involved (in the objective, f , and in
the constraints, gi ) are convex and β-smooth.
Mirror Descent Algorithm
Our previous analysis was mostly focused on the case, where the objective function f is
smooth in `2 norm and the distance from the starting point, where we initiate the algorithm,
to the optimal point is measured in the `2 norm as well. From the perspective of the GD,
the optimization over a unit simplex and the optimization over a unit Euclidean sphere
are equivalent computational complexity-wise. On the other hand, the volume of the unit
simplex is exponentially smaller than the volume of the unit sphere. Mirror Descent (MD)
algorithm allows to explore geometry of the domain thus providing a faster algorithm for the
√
case of the simplex. The acceleration is up to the ∼ d factor, where d is the dimensionality
of the underlying space.
We start with an unconstrained convex optimization problem:
f (x) → min
s.t.: x ∈ S ⊆ Rn
Consider in more details an elementary iteration of the GD algorithm
xk+1 = xk − ηk ∇f (xk ).
From the mathematical perspective we sum up objects from different spaces: x belongs to
the primal space, while the space where ∇f (x) resides, called the dual (conjugate) space
may be different. To overcome this “inconsistency”, Nemirovski and Yudin have proposed
in 1978 the following algorithm:
yk = ∇φ(xk ), – map the point to a point in the dual space

yk+1 = yk − ηk ∇f (xk ), – update the point in the dual space
x̄k+1 = (∇φ)−1 (yk+1 ) = ∇φ∗ (yk+1 ), – project the point back to the primal space
D
xk+1 = ΠC φ (x̄k+1 ) = arg min Dφ (x, x̄k+1 ), project the point to a feasible set
x∈C
where φ(x) is a strongly convex function defined on Rn and ∇φ(Rn ) = Rn ; and φ∗ (y) =
supx∈Rn (y > x − φ(x)) is the Legendre Fenchel (LF) transform (conjugate function) of φ(x).
Function φ is also called the mirror map function. Dφ (u, v) = φ(u) − φ(v) − ∇φ(v)> (u − v)
is the so-called Bregman divergence
Dφ (u, v) = φ(u) − φ(v) − ∇φ(v)> (u − v),
which measures (for strictly convex function φ) the distance between φ(u) and its linear
approximation φ(v) − ∇φ(v)> (u − v) evaluated at v.
Exercise A.21. Let φ(x) be a strongly convex function on Rn . Using the definition of the
conjugate function prove that ∇φ∗ (∇φ(x)) = x, where φ∗ is a conjugate function to φ.
The Bregman divergence has a number of attractive properties:
• Non-negativity. Dφ (u, v) ≥ 0 for any convex function φ.
• Convexity in the first argument. The Bregman divergence Dφ (u, v) is convex in its
first argument. (Notice that is not necessarily convex in the second argument.)
• Linearity with respect to the non-negative coefficients. In other words, for any strictly
convex φ and ψ we observe:
Dλφ+µψ (u, v) = λDφ (u, v) + µDψ (u, v).
• Duality. Let function φ has a convex conjugate φ∗ , then
Dφ∗ (u∗ , v ∗ ) = Dφ (u, v), with u∗ = ∇φ(u), and v ∗ = ∇φ(v).
Examples of the Bregman divergence are
• Euclidean norm. Let φ = kxk22 , then Dφ (x, y) = kxk22 − kyk22 − 2y > (x − y) = kx − yk22 .
Pn
• Negative entropy. φ(x) = i=1 xi ln xi , f : Rn++ → R. Then
n
X n
X n
X
Dφ (x, y) = xi ln(xi /yi ) − xi + yi = DKL (x||y),
i=1 i=1 i=1
where DKL (x||y) is the so called Kullback-Leibler (KL) divergence.
• Lower and upper bounds. Let φ be a µ-strongly convex function with respect to a
norm k · k then
µ β
Dφ (x, y) ≥ kx − yk2 , Dφ (x, y) ≤ kx − yk2
2 2
The following statement represents an important fact which will be used below to analyze
the MD algorithm.
Pn Pn
Theorem A.4.2 (Pinsker Inequality). For any x, y, such that i=1 xi = i=1 yi = 1,
x ≥ 0, y ≥ 0 one get the following KL divergence estimate, DKL (x||y) ≥ 12 kx − yk21 .
Pn
An immedite corollary of the Theorem is that φ(x) = i=1 xi ln xi is 1-strongly convex
in `1 norm:
1
φ(y) ≥ φ(x) + ∇φ(x)> (y − x) + DKL (y||x) ≥ φ(x) + ∇φ(x)> (y − x) + kx − yk21
2
The proximal form of the MD algorithm is

D 1
xk+1 = ΠC φ arg min f (xk ) + ∇f (xk )> (x − xk ) + Dφ (x, xk ) ,
x∈Rn ηk
D
where ΠS φ (y) = arg minx∈S Dφ (x, y).
Example A.4.3. Consider the following optimization problem over the unit simplex:
f (x) → minn
x∈R
s.t.:x ∈ S = {x : x> 1 = 1, x ∈ Rn++ }.

Pn
Let the distance generating function φ(x) be a negative entropy, φ(x) = i=1 xi ln xi . Then
the MD algorithm update becomes

Dφ > 1
xk+1 = ΠS arg min f (xk ) + ∇f (xk ) (x − xk ) + Dφ (x, xk ) ,
x ηk
P
where Dφ (x, y) = ni=1 xi ln(xi /yi ) − (xi − yi ). The resulting optimal x is
∇φ(x) = ∇φ(xk ) − ηk ∇f (xk ), that is yi = (xk )i exp(−ηk ∇f (xk )i ).

D
One observes that the Bregman projection onto the simplex is a renormalization: ΠS φ =
y/kyk1 . This results in the following expression for the MD update:
(xk )i exp(−ηk ∇f (xk )i )
(xk )i = Pn .
j=1 (xk )j exp(−ηk ∇f (xk )j )
Let us sketch the continuous time analysis of the MD algorithm in the case of the β-
smooth convex functions. In contrast with the GD analysis, it is more appropriate to work
in this case with the Lyapunov’s function in the dual space:
V (Z(t)) = Dφ∗ (Z(t), z ∗ ), Z(t) = ∇φ(X(t)),
where φ is a strongly convex distance generating function. According to the definition of

the Bregman divergence, one derives
d d d n ∗ o
V (Z(t)) = Dφ∗ (Z(t), z ∗ ) = φ (Z(t)) − φ∗ (z ∗ ) − ∇φ∗ (z ∗ )> (Z(t) − z ∗ )
dt dt dt
= (∇φ (Z(t)) − ∇φ (z ), Ż(t)) = (X(t) − x∗ )> Ż(t).
∗ ∗ ∗
Given that Ż(t) = −∇f (X) one derives
d
V (Z(t)) = −∇f (X(t))> (X(t) − x∗ ) ≤ −(f (X(t)) − f ∗ ).
dt
Integrating both sides of the inequality one arrives at
Z t Z t
∗ 1 ∗
V (Z(t)) − V (Z(0)) ≥ f (X(τ ))dτ − tf ≥ t f X(τ )dτ − f ,
0 t 0
where the last transformation is due to the Jensen inequality. Therefore, similarly to the
case of GD, the convergence rate of the MD algorithm is O(1/k). The resulting MD ODE
is


X(t) = ∇φ∗ (Z(t))


Ż(t) = −∇f (X(t))



X(0) = x0 , Z(0) = z0 with ∇φ∗ (z0 ) = x0 .
Behavior of the MD, when applied to a non-smooth convex function, repeats the one of
√
the GD: the convergence rate is O(1/ k) in this case.
Bibliography
[1] M. Tabor, Principles and Methods of Applied Mathematics. University of Arizona

Press, 1999.
[2] V. Arnold, Ordinary Differential Equations. The MIT Press, 1973.
[3] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal
algorithms,” Physica D: Nonlinear Phenomena, vol. 60, no. 1, pp. 259–268, 1992.
[4] J. Calder, “The calculus of variations (lecture notes),” http://www-users.math.umn.

edu/∼jwcalder/CalculusOfVariations.pdf, 2019.
[5] B. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR
Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1 – 17, 1964.
[6] Y. E. Nesterov, “A method for solving the convex programming problem with conver-
gence rate o(1/k 2 ),” Dokl. Akad. Nauk SSSR, vol. 269, pp. 543–547, 1983.
[7] W. Su, S. Boyd, and E. J. Candes, “A Differential Equation for Modeling Nesterov’s
Accelerated Gradient Method: Theory and Insights,” arXiv:1503.01243, 2015.
[8] A. C. Wilson, B. Recht, and M. I. Jordan, “A Lyapunov Analysis of Momentum Meth-

ods in Optimization,” arXiv:1611.02635, 2016.
[9] M. Levi, Classical Mechanics with Calculus of Variations and Optimal Control: An
Intuitive Introduction. AMS, 2014.
[10] H. Touchette, “Legendre-fenchel transforms in a nutshell,” 2014.
[11] A. Chambolle, “An algorithm for total variation minimization and applications,” Jour-
nal of Mathematical Imaging and Vision, vol. 20, pp. 89–97, 2004.
[12] R. K. P. Zia, E. F. Redish, and S. R. McKay, “Making sense of the legendre

transform,” American Journal of Physics, vol. 77, no. 7, p. 614–622, Jul 2009.
[Online]. Available: http://dx.doi.org/10.1119/1.3119512
374
BIBLIOGRAPHY 375
[13] L. Pontryagin, V. Boltayanskii, R. Gamkrelidze, and E. Mishchenko, The mathematical

theory of optimal processes (translated from Russian in 1962). Wiley, 1956.
[14] A. T. Fuller, “Bibliography of pontryagm’s maximum principle,” Journal of Electronics

and Control, vol. 15, no. 5, pp. 513–517, 1963.
[15] R. Bellman, “On the theory of dynamic programming,” PNAS, vol. 38, no. 8, p. 716,
1952.
[16] C. Moore and S. Mertens, The Nature of Computation. New York, NY, USA: Oxford
University Press, 2011.
[17] A. Sinclair, “Uc berkley, cs271 ”randomness & computation” course,” 2020. [Online].
Available: https://people.eecs.berkeley.edu/∼sinclair/cs271/n13.pdf
[18] C. E. Shannon, “Prediction and entropy of printed english,” The Bell System Technical
Journal, vol. 30, no. 1, pp. 50–64, 1951.
[19] D. J. C. Mackay, Information theory, inference, and learning algorithms. Cambridge

University Press, 2003.
[20] N. V. Kampen, Stochastic processes in physics and chemistry. North Holland, 2007.
[21] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT

Press, 2018.
[22] F. Kelly and E. Yudovina, Stochastic Networks, ser. Institute of Mathematical Statistics
Textbooks. Cambridge University Press, 2014.
[23] D. B. Wilson, “Perfectly random sampling with markov chains,” http://www.dbwilson.

com/exact/.
[24] T. Richardson and R. Urbanke, Modern Coding Theory. USA: Cambridge University
Press, 2008.
[25] J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Constructing free-energy approximations

and generalized belief propagation algorithms,” IEEE Transactions on Information
Theory, vol. 51, no. 7, pp. 2282–2312, 2005.
[26] M. J. Wainwright and M. I. Jordan, “Graphical models, exponential families, and

variational inference,” Foundations and Trends in Machine Learning, vol. 1, no. 1, pp.
1–305, 2008.
BIBLIOGRAPHY 376
[27] V. Likhosherstov, Y. Maximov, and M. Chertkov, “Inference and sampling of

k33 -free ising models,” in Proceedings of the 36th International Conference on
Machine Learning, ser. Proceedings of Machine Learning Research, K. Chaudhuri and
R. Salakhutdinov, Eds., vol. 97. PMLR, 09–15 Jun 2019, pp. 3963–3972. [Online].
Available: http://proceedings.mlr.press/v97/likhosherstov19a.html
[28] A. Y. Lokhov, M. Vuffray, S. Misra, and M. Chertkov, “Optimal structure and

parameter learning of ising models,” Science Advances, vol. 4, no. 3, 2018. [Online].
Available: https://advances.sciencemag.org/content/4/3/e1700791
[29] C. Chow and C. Liu, “Approximating discrete probability distributions with depen-
dence trees,” IEEE Transactions on Information Theory, vol. 14, no. 3, pp. 462–467,
1968.
[30] G. Strang, Linear Algebra and Learning from Data. Wellesley-Cambridge Press, 2019.
[31] P. Kidger and T. Lyons, “Universal Approximation with Deep Narrow Networks,”
in Proceedings of Thirty Third Conference on Learning Theory, ser. Proceedings
of Machine Learning Research, J. Abernethy and S. Agarwal, Eds., vol.
125. PMLR, 09–12 Jul 2020, pp. 2306–2327. [Online]. Available: https:
//proceedings.mlr.press/v125/kidger20a.html

Lectures On AM

Uploaded by

Copyright:

Available Formats

Lectures On AM

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lectures On AM

Uploaded by

Copyright:

Available Formats

Lecture Notes on the

Principles and Methods of Applied Mathematics

Michael (Misha) Chertkov

Graduate Program in Applied Mathematics,

March 17, 2023

1 Applied Math Core Courses 1

4 Ordinary Differential Equations. 88

4.4 Linear Static Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5 Partial Differential Equations. 129

Spring Semester 150

III Optimization 151

6 Calculus of Variations 152

6.5 Dependence of the action on the end-points . . . . . . . . . . . . . . . . . . 163

7 Optimal Control and Dynamic Programming 189

IV Mathematics of Uncertainty 212

8 Basic Concepts from Statistics 213

8.2 Moments & Cumulants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219

9 Stochastic Processes 253

9.4.1 Transition Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . 272

10 Elements of Inference and Learning 303

A Convex and Non-Convex Optimization ∗ 338

Applied Math Core Courses

Metric, Normed & Convex Measure Theory Probability &

Complex Analysis, Ordinary & Partial Calculus of Variations Stochastic Calculus,

Numerical Numerical Numerical

What is Applied Mathematics?

descriptions of the phenomena by processes, interpreting and synthesizing imperfect obser-

3. Identifying what kinds of solutions are needed, and implementing an appropriate

As previously mentioned, one component of applied mathematics is a collection of special-

2.1 Complex Variables and Complex-valued Functions

For two complex numbers, z1 = a1 + ib1 , and, z2 = a2 + ib2 , we have

• Multiplication: z1 z2 = (a1 + ib1 )(a2 + ib2 ) = a1 a2 + i(a1 b2 + b1 a2 ) + i2 b1 b2 = (a1 a2 −

Example 2.1.1. Let z1 = 1 + 2i and z2 = 4 − i. Compute (a) z1 + z2 and (b) z1 − z2 .

(a) z1 + z2 = (1 + 2i) + (4 − i) = (1 + 4) + (2 − 1)i = 5 + i

(b) z1 − z2 = (1 + 2i) − (4 − i) = (1 − 4) + (2 + 1)i = −3 + 3i

Because the behavior of addition and subtraction is reminiscent of translating vectors in

Definition 2.1.2. The complex conjugate of a complex number z, denoted by z ∗ or z̄, is

Example 2.1.3. Let z1 = −1 + 2i and z2 = 4 − 3i. Compute (a) z1 z2 , (b) z1 /z2 .

(a) z1 z2 = (−1 + 2i)(4 − 3i) = −4 + 3i + 8i − 6i2 = 2 + 11i.

0 = z 4 − 6z 3 + 11z 2 − 2z − 10 = (z − z1 )(z − z2 )(z 2 − 2z − 2).

Solution. From the definition of a complex conjugate, we have

z1∗ = x1 − iy1 , and z2∗ = x2 − iy2 .

2.1.2 The Polar Representation of Complex Variables

In addition to their Cartesian representation, complex numbers can also be represented by

x + iy = r cos θ + ir sin θ, for r = x2 + y 2 , and θ = arctan(y, x).

Definition 2.1.8. The exponential function is defined for imaginary arguments by

reiθ := r cos(θ) + ir sin(θ) = x + iy. (2.1)

Example 2.1.9. Convert z1 = −1 + 2i and z2 = 4 − 3i to their polar representations and

Their product and quotient are:

Sometimes it is convenient to express a complex number using a mixture of Cartesian

Im(z) Im(z) Im(z) Im(z)

Re(z) Re(z) Re(z) Re(z)

2.1.3 Parameterization of Curves in the Complex Plane

Example 2.1.13. Parameterize the following curves:

(a) The infinite ‘vertical’ line passing through π/2.

(c) The circular arc of radius ε centered at 0.

z(ρ) = −1 + ρeiπ/3 z(s) = π/2 + is

(a) z(s) = π/2 + is for −∞ < s < ∞.

(b) z(ρ) = −1 + ρeiπ/3 for 0 ≤ ρ < ∞.