0% found this document useful (0 votes)
10 views208 pages

Wang F. Foundation of Probability Theory 2024

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 208

This page intentionally left blank

Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

Library of Congress Control Number: 2024044604

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

概率论基础
Originally published in Chinese by Beijing Normal University Press (Group) Co., Ltd.
Copyright © Beijing Normal University Press (Group) Co., Ltd., 2010

FOUNDATION OF PROBABILITY THEORY

Copyright © 2025 by World Scientific Publishing Co. Pte. Ltd.


All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance
Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy
is not required from the publisher.

ISBN 978-981-12-9885-1 (hardcover)


ISBN 978-981-98-0035-3 (ebook for institutions)
ISBN 978-981-98-0036-0 (ebook for individuals)

For any available supplementary material, please visit


https://www.worldscientific.com/worldscibooks/10.1142/14003#t=suppl

Desk Editors: Nambirajan Karuppiah/Rok Ting Tan

Typeset by Stallion Press


Email: enquiries@stallionpress.com

Printed in Singapore
Preface

Why shall we learn the course “Foundation of Probability Theory”


after the elementary course “Probability Theory”? The reason is
that the elementary probability theory describes specific distribu-
tions induced by random trials, which is intuitively clear but math-
ematically less rigorous, while the foundation of probability theory
established by Kolmogorov is an axiomatization theory, which makes
probability theory a rigorous branch of mathematics.
For example, in the elementary probability theory, the sample
space is the total of the possible results appearing in a random trial,
and each subset of this space is called an event, whose probability
is defined as the limit of its appearing frequency as the number of
the trials goes to infinity. These concepts are intuitively clear but
not mathematically rigorous: Why can the trial be repeated infinite
times? Why must the frequency converge? And how can one fix the
limit if it does converge? One may argue that this limit exists due to
the law of large numbers. However, the law of large numbers itself
is established based on the definition of probability, which leads to a
circular argument.
Now, the motivation to learn the course becomes clear: it enables
us to grasp a serious foundation of probability theory in the math-
ematical axiomatic system. Contrary to the elementary probability
theory which deals with random events in specific examples of ran-
dom trials, Foundation of Probability Theory is a general mathemat-
ical theory which provides rigorous descriptions of these examples.
Therefore, this course has all characters of mathematical theories:
abstract contents, extensive applications, complete structures, and

v
vi Foundation of Probability Theory

clear conclusions. Due to abstract contents, we will face many diffi-


culties during learning. To overcome these difficulties, a crucial trick
is to keep in mind those concrete examples while trying to under-
stand an abstract context and compare the abstract theory with
related courses learned before, especially with the Lebesgue measure
theory. In the following, we give a brief chapterwise summary of the
main contents of this textbook.
To define events without random trials, we first fix a global set
Ω and then construct a class A of subsets of Ω, which is equipped
with an algebra structure, so that each element in A is measurable
in a reasonable way. We then call the couple (Ω, A ) a measurable
space, where Ω refers to the sample space of a random trial and A
stands for the set of events. In general, A is strictly smaller than
the class of all subsets of Ω, i.e. not all subsets of Ω are measurable.
For instance, there exist nonmeasurable sets in the Lebesgue measure
theory. Following the line of the Lebesgue measure theory, we assume
that A contains Ω and is closed under countable set operations,
which leads to the concept of σ-algebra. Furthermore, the probability
of an event can be considered as the nonnegative survey results for
sets in A , which is this a function P : A → [0, ∞). According to
the requirement of probability measure in the probability  theory,
we


postulate that P(Ω) = 1 and P has σ-additivity, i.e. P An =
n=1


P(An ) for a sequence of mutually disjoint sets {An }n1 ⊂ A .
n=1
Without the restriction P(Ω) = 1, the map P is called a measure
and is denoted by μ rather than P to emphasize the difference. In
this way, we construct a triple (Ω, A , P), or more generally (Ω, A , μ),
which is called a probability space or a measure space.
So, how can we construct a probability measure on a σ-algebra?
According to the Lebesgue measure theory, we first define the mea-
sure value for simple sets, for instance, the intervals, and then extend
it to all measurable sets by an extension argument. To abstract this
method in the general framework, we first introduce the semi-algebra
for subsets of Ω in terms of the property of right semi-closed inter-
vals in the Euclidean space. From this semi-algebra of sets, we gen-
erate the σ-algebra A by establishing the monotone class theorem.
Moreover, by the monotone class theorem and the construction ideas
of Lebesgue measure, a measure defined on a semi-algebra S can
Preface vii

be (under a σ-finite condition uniquely) extended to the minimal


σ-algebra generated by S . This is known as the measure extension
theorem, the core result of Chapter 1.
Having the measure space (Ω, A , μ) in hand, the tasks in
Chapters 2 and 3 are to survey a measurable function, and the value
is called the integral of the function with respect to μ. The def-
inition and properties of integrals are inhered from the theory of
Lebesgue integrals and hence are easy to understand with a basis of
the Lebesgue measure theory. In particular, on a probability space
(Ω, A , P), a measurable function is called a random variable, whose
expectation is defined as the integral with respect to the probability
measure. By the integral transformation formula (Lebesgue–Stieltjes
integral expression), the expectation can be formulated as integral of
the identity function with respect to the distribution of the random
variable, where the distribution is a probability measure on the real
line. In order to classify the distributions of random variables, we
consider the decomposition of measures in Chapter 3.
To study several or infinite many random variables together, we
introduce product probability spaces and consider the conditional
properties of some random variables given other ones. These are
treated in Chapters 4 and 5, where the main difficulty is to clar-
ify the definition of conditional expectation given a sub σ-algebra
and to introduce the regular conditional probability which enables
one to construct measures on product spaces which is fundamental
for further study of stochastic processes.
Chapter 6 presents several equivalent definitions of the weak con-
vergence for finite measures, which are also equivalent to the con-
vergence of the characteristic functions for finite measures in the
multi-dimensional Euclidean space. Chapter 7 introduces some prob-
ability distances on the space of probability measures. Both chapters
are important to develop the limit theory of random variables and
stochastic processes.
Finally, Chapter 8 introduces derivatives for functions of finite
measures, establishes the chain rule and derivative formulas. These
provide a quick way for readers to enter the frontier of analysis and
geometry on the space of measures.
In conclusion, this course is an abstract rigorization of the ele-
mentary probability theory. Key points include the monotone class
theorem, measure extension theorem, conditional expectation and
viii Foundation of Probability Theory

regular conditional probability, and the weak convergence. To make


the whole book easy to follow, in the beginning of each part, we
briefly introduce the main purpose of study based on previous con-
tents, figure out the main structure, and explain the key idea of study.
If one understands clearly the backgrounds and basic ideas for the
study of each part, it is not hard to grasp the whole contents of this
textbook. There are many books containing these contents, see an
incomplete list of references in the end of this book.
The first seven chapters of the textbook are translated and modi-
fied from the Chinese version published in 2010 by the Beijing Normal
University Press. We would like to thank the Executive Editor
Ms Fengjuan Liu for the encouragement and the efficient work. We
gratefully acknowledge the support from the National Key R&D
Program of China (2022YFA1006000, 2020YFA0712900) and the
National Nature Science Foundation of China (11921001).
About the Authors

Feng-Yu Wang is a Professor at Tianjin University. After receiving


his Ph.D. in 1993 from Beijing Normal University, he was excep-
tionally appointed by Beijing Normal University in 1995 as a Full
Professor by the Educational Ministry of China in 2000 as a reputed
Chang-Jiang Chair, by Swansea University in 2007 as a Research
Chair, and by Tianjin University in 2016 as a Chair Professor. His
research areas include stochastic analysis on manifolds, functional
inequalities and applications, stochastic (partial) differential equa-
tions, distribution dependent stochastic differential equations, and
Wasserstein convergence of empirical measures. He served as an
Associated Editor of Electronic Journal of Probability, Electronic
Communications in Probability, Journal of Theoretical Probability,
Communications in Pure and Applied Analysis, Science in China
Mathematics, and Frontiers of Mathematics in China.
Yong-Hua Mao is a Professor at Beijing Normal University. He
has worked on Markov chains, jump processes, and other Markov
processes, especially on stationary and quasi-stationary for Markov
processes. He served as an Associate Editor of Statistics and Prob-
ability Letters and the Chinese Journal of Applied Probability and
Statistics.

ix
This page intentionally left blank
Contents

Preface v
About the Authors ix

1. Class of Sets and Measure 1


1.1 Class of Sets and Monotone Class Theorem . . . . . . . . 2
1.1.1 Semi-algebra . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Algebra . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.3 σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.4 Monotone class theorem . . . . . . . . . . . . . . . 6
1.1.5 Product measurable space . . . . . . . . . . . . . . 10
1.2 Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2.1 Function of sets . . . . . . . . . . . . . . . . . . . . 11
1.2.2 Measure space . . . . . . . . . . . . . . . . . . . . . 15
1.3 Extension and Completion of Measure . . . . . . . . . . . 16
1.3.1 Extension from semi-algebra to the induced algebra 17
1.3.2 Extension from semi-algebra to the generated
σ-algebra . . . . . . . . . . . . . . . . . . . . . . . . 18
1.3.3 Completion of measures . . . . . . . . . . . . . . . 22
1.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2. Random Variable and Measurable Function 29


2.1 Measurable Function . . . . . . . . . . . . . . . . . . . . . 30
2.1.1 Definition and properties . . . . . . . . . . . . . . . 30
2.1.2 Construction of measurable function . . . . . . . . 32

xi
xii Foundation of Probability Theory

2.1.3 Operations of measurable functions . . . . . . . . . 34


2.1.4 Monotone class theorem for functions . . . . . . . . 36
2.2 Distribution Function and Law . . . . . . . . . . . . . . . 39
2.3 Independent Random Variables . . . . . . . . . . . . . . . 42
2.4 Convergence of Measurable Functions . . . . . . . . . . . . 45
2.4.1 Almost everywhere convergence . . . . . . . . . . . 45
2.4.2 Convergence in measure . . . . . . . . . . . . . . . 46
2.4.3 Convergence in distribution . . . . . . . . . . . . . 50
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3. Integral and Expectation 55


3.1 Definition and Properties for Integral . . . . . . . . . . . 56
3.1.1 Definition of integral . . . . . . . . . . . . . . . . . 56
3.1.2 Properties of integral . . . . . . . . . . . . . . . . 59
3.2 Convergence Theorems . . . . . . . . . . . . . . . . . . . . 60
3.3 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.1 Numerical characters and characteristic function . 65
3.3.2 Integral transformation and L–S representation of
expectation . . . . . . . . . . . . . . . . . . . . . . 68
r
3.4 L -space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.1 Some classical inequalities . . . . . . . . . . . . . . 71
3.4.2 Topology property of Lr (μ) . . . . . . . . . . . . . 74
3.4.3 Links of different convergences . . . . . . . . . . . . 75
3.5 Decompositions of Signed Measure . . . . . . . . . . . . . 77
3.5.1 Hahn’s decomposition theorem . . . . . . . . . . . 77
3.5.2 Lebesgue’s decomposition theorem . . . . . . . . . 80
3.5.3 Decomposition theorem of distribution function . . 84
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4. Product Measure Space 91


4.1 Fubini’s Theorem . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Infinite Product Probability Space . . . . . . . . . . . . . 95
4.3 Transition Measure and Transition Probability . . . . . . 99
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5. Conditional Expectation and Conditional


Probability 105
5.1 Conditional Expectation Given σ-Algebra . . . . . . . . . 106
5.2 Conditional Expectation Given Function . . . . . . . . . . 110
Contents xiii

5.3 Regular Conditional Probability . . . . . . . . . . . . . . . 111


5.3.1 Definition and properties . . . . . . . . . . . . . . . 111
5.3.2 Conditional distribution . . . . . . . . . . . . . . . 112
5.3.3 Existence of regular conditional probability . . . . 113
5.4 Kolmogorov’s Consistent Theorem . . . . . . . . . . . . . 115
5.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6. Characteristic Function and Weak


Convergence 121
6.1 Characteristic Function of Finite Measure . . . . . . . . . 121
6.1.1 Definition and properties . . . . . . . . . . . . . . . 121
6.1.2 Inverse formula . . . . . . . . . . . . . . . . . . . . 123
6.2 Weak Convergence of Finite Measures . . . . . . . . . . . 125
6.2.1 Definition and equivalent statements . . . . . . . . 125
6.2.2 Tightness and weak compactness . . . . . . . . . . 129
6.3 Characteristic Function and Weak Convergence . . . . . . 133
6.4 Characteristic Function and Nonnegative
Definiteness . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7. Probability Distances 143


7.1 Metrization of Weak Topology . . . . . . . . . . . . . . . . 143
7.2 Wasserstein Distance and Optimal Transport . . . . . . . 145
7.2.1 Transport problem, coupling and Wasserstein
distance . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.2 Optimal coupling and Kantorovich’s dual
formula . . . . . . . . . . . . . . . . . . . . . . . . . 147
7.2.3 The metric space (Pp (E), Wp ) . . . . . . . . . . . 148
7.3 Total Variation Distance . . . . . . . . . . . . . . . . . . . 153
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8. Calculus on the Space of Finite Measures 157
8.1 Intrinsic Derivative and Chain Rule . . . . . . . . . . . . . 158
8.1.1 Vector field and tangent space . . . . . . . . . . . . 158
8.1.2 Intrinsic derivative and C 1 functions . . . . . . . . 159
8.1.3 Chain rule . . . . . . . . . . . . . . . . . . . . . . . 162
8.2 Extrinsic Derivative and Convexity Extrinsic Derivative . 167
8.3 Links of Intrinsic and Extrinsic Derivatives . . . . . . . . . 175
8.4 Gaussian Measures on P2 and M . . . . . . . . . . . . . . 181
xiv Foundation of Probability Theory

8.4.1 Gaussian measure on Hilbert space . . . . . . . . . 182


8.4.2 Gaussian measures on P2 . . . . . . . . . . . . . . 184
8.4.3 Gaussian measures on M . . . . . . . . . . . . . . . 185
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Bibliography 189
Index 191
Chapter 1

Class of Sets and Measure

What is a measure? It is a tool to determine the weights of


“measurable sets” which satisfy the countable additivity property,
i.e. the sum of weights for countable many disjoint sets coincides
with the weight of the union of these sets. For example, under the
Lebesgue measure, the weight of an interval [a, b) for real numbers
b  a is its length b − a, which uniquely determines a measure on the
class of “Lebesgue measurable sets” on R. The aim of this chapter
is to choose a reasonable class of subsets (i.e. measurable sets) for a
given global set Ω and to construct a measure on this class. To this
end, we first generate a class of subsets sharing the following fea-
tures of Lebesgue measurable sets: (1) it contains the empty set and
the total set; (2) it is closed under the countably infinite set oper-
ations (set union, intersection and difference). A class of sets with
these properties is called a σ-algebra or σ-field, which is our ideal
class of “measurable subsets” of the abstract global set Ω. To define
a measure on the σ-algebra, let us again go back to the Lebesgue
measure on R.
As mentioned above, the Lebesgue measure of an interval is
defined as the length. By a natural extension procedure, this mea-
sure can be extended to the smallest σ-algebra containing intervals,
which is nothing but the Borel σ-algebra whose completion is a class
of Lebesgue measurable sets. To realize the same procedure for the
present abstract setting with Ω in place of R, we consider the “semi-
algebra” which is a class of sets sharing the following features of
intervals: (1) it contains the empty set and the total set; (2) it is
closed under the set intersection; (3) the difference of any two sets

1
2 Foundation of Probability Theory

can be expressed by the union of finite disjoint sets in the class. We


first induce the smallest σ-algebra from the semi-algebra, where the
main tool is called the monotone class theorem, and then extend a
measure from the semi-algebra to the induced σ-algebra, where the
key step is to establish the measure extension theorem. These two
theorems are the key results of this chapter.

1.1 Class of Sets and Monotone Class Theorem

1.1.1 Semi-algebra
We first introduce operations for subsets of the global set Ω. Let ∪
and ∩ denote the union and intersection, respectively, let Ac be the
complement of the set A, and let A− B := A∩ B c be the difference of
A and B, which is called a proper difference if B ⊂ A. For simplicity,
 will use AB to stand for A∪B, A+B for A∪B with AB = ∅, and
we
n An for the union of finite or countable many disjoint sets {An }.
Then the semi-algebra of sets is defined in terms of the above-
mentioned features of intervals.
Definition 1.1. A class S of subsets of Ω is called a semi-algebra
(of sets) in Ω if
(1) Ω, ∅ ∈ S ,
(2) A ∩ B ∈ S for A, B ∈ S ,
(3) for A1 , A ∈ S with A1 ⊂ A, there exist n  1 and

n
A1 , A2 , . . . , An ∈ S disjoint mutually such that A = Ai .
i=1

Property 1.2. Under items (1) and (2) in Definition 1.1, item (3)
is equivalent to the following:
(3 ) if A ∈ S , then ∃n  1 and A1 , A2 , . . . , An ∈ S mutually

n
disjoint such that Ac = Ai .
i=1

Proof. (3) ⇒ (3 ): Since A ⊂ Ω, it follows from (3) that ∃n  1


and A1 , A2 , . . . , An ∈ S disjoint mutually, which are all disjoint with

n 
n
A, such that Ω = A + Ai , so Ac = Ai .
i=1 i=1
Class of Sets and Measure 3

(3 ) ⇒ (3): It follows from (3 ) that ∃n  2 and A2 , . . . , An ∈ S


n 
n
mutually disjoint such that Ac1 = Ai , so A = A1 + Ai ∩ A. 
i=2 i=2

Example 1.3. Let Ω = [0, +∞), S = {[a, b) : 0  a  b  +∞} .


Then S is a semi-algebra in Ω.

To induce the σ-algebra from a semi-algebra, we introduce a relay


notion “algebra”, which is closed under finite many operations.

1.1.2 Algebra

Definition 1.4. A class F of subsets of Ω is called an algebra (of


sets) in Ω, or Boolean algebra in Ω, if

(1) Ω ∈ F ,
(2) A, B ∈ F implies A − B ∈ F .

Property 1.5. Under item (1) in Definition 1.4, item (2) is equiva-
lent to any one of the following:

(2 ) A, B ∈ F implies A ∪ B, Ac , B c ∈ F ,
(2 ) A, B ∈ F implies A ∩ B, Ac , B c ∈ F .

Proof. We will prove (2 ) ⇒ (2 ) ⇒ (2) ⇒ (2 ).

(2 ) ⇒ (2 ): It follows from (2 ) that F is closed under comple-


ment and intersection, so that A, B ∈ F implies A ∪ B =
(Ac ∩ B c )c ∈ F .
(2 ) ⇒ (2): Assume A, B ∈ F . It follows from (2 ) that F is closed
under complement and union, so that A − B = (Ac ∪ B)c ∈ F .
(2) ⇒ (2 ): Assume A, B ∈ F . It follows from (2) that Ac = Ω − A,
B c = Ω − B ∈ F , so that A ∩ B = A − B c ∈ F .


Proposition 1.6. If F is an algebra in Ω, then ∀A, B ∈ F and we


have Ac , B c , A ∩ B, A ∪ B, A − B ∈ F .
4 Foundation of Probability Theory

Obviously, an algebra is a semi-algebra. The following theo-


rem provides an explicit formulation of the induced algebra from
a semi-algebra.

Theorem 1.7. If S is a semi-algebra, then


 n 

F := Ak : n  1, Ak ∈ S (1  k  n) are mutually disjoint
k=1

is the smallest algebra containing S , which is called the algebra


induced (or generated) by S , and is denoted by F (S ).

Proof. First, we prove that F is an algebra. Obviously, item (1) in


Definition 1.4 is fulfilled. Moreover, ∀A, B ∈ F , ∃A1 , A2 , . . . , An ∈ S
and B1 , B2 , . . . , Bm ∈ S , mutually disjoint, respectively, such that

n 
m 
A= Ai , B = Bi . Then A ∩ B = Ai ∩ Bj . It follows from
i=1 i=1 i,j
Definition 1.1(2) that A ∩ B ∈ F , so that F is closed under finite
intersections.
Next, by Property 1.5, to prove that F is an algebra in Ω, we only

n
need to verify Ac ∈ F for any A ∈ F . Let A = Ai ∈ F , Ai ∈ S .
i=1

n
Then Ac = Aci . By Property 1.2, we see that Aci can be expressed
i=1
by the union of mutually disjoint sets in S , so Aci ∈ F . Since F is
closed under finite many intersections, we obtain Ac ∈ F .
Finally, for any algebra F  ⊃ S , Property 1.5 implies F  ⊃ F .


Example 1.8. S in Example 1.3 is not an algebra, and by Theo-


rem 1.7, its induced algebra is
n 

F (S ) = [ai , bi ) : n  1, 0  a1  b1  a2  b2  · · ·  an  bn .
i=1

1.1.3 σ-algebra
According to the property of Lebesgue measurable sets, a σ-algebra
should be closed under the countable many operations of sets. Since
Class of Sets and Measure 5

the union and intersection of sets are dual to each other by comple-
ment, it suffices to have the closedness by complement and countable
many unions.

Definition 1.9. A class A of subsets of Ω is called a σ-algebra (or


σ-field) in Ω if

(1) Ω ∈ A ,
(2) Ac ∈ A holds for A ∈ A ,


(3) An ∈ A holds for any {An }n1 ⊂ A .
n=1

In this case, we call (Ω, A ) a measurable space, and each element in


A is called an A -measurable set or simply a measurable set.

Property 1.10. A σ-algebra is an algebra.

Property 1.11. Under items (1) and (2) in Definition 1.9, (3) is
equivalent to


(3 ) An ∈ A for An ∈ A , n = 1, 2, . . .
n=1


∞ 
∞ c
Proof. Note that An = Acn . 
n=1 n=1

Property 1.12. The intersection of a family of σ-algebras in Ω is


also a σ-algebra.

Proof.
 Let {Ar : r ∈ Γ} be a family of σ-algebras in Ω. Then A =
Ar is a σ-algebra in Ω as well because of the following:
r∈Γ

(1) for any r ∈ Γ, we have ∅, Ω ∈ Ar , so that ∅, Ω ∈ A ;


(2) if A ∈ A , then A ∈ Ar for any r ∈ Γ, so that Ac ∈ Ar (r ∈ Γ),
i.e. Ac ∈ A ;
(3) if A1 , A2 , . . . ∈ A , then A1 , A2 , . . . ∈ Ar for any r ∈ Γ, so that

∞ 

An ∈ Ar for all r ∈ Γ, hence An ∈ A . 
n=1 n=1
6 Foundation of Probability Theory

Example 1.13. A = {∅, Ω} is the smallest σ-algebra in Ω, while

A = 2Ω := {A : A ⊂ Ω}

is the largest σ-algebra in Ω, where the notation 2Ω comes from the


fact that a subset A of Ω is identified with the element in {0, 1}Ω :
Ω ω → 1A (ω), where 1A is the indicator function of A.
Theorem 1.14. Let C be a class of subsets of Ω. Then there exist
a unique σ-algebra A in Ω such that
(1) C ⊂ A ,
(2) if A is a σ of Ω and A ⊃ C , then A ⊃ A .
We denote A by σ(C ) and call it the σ-algebra induced (or generated)
from C .
Proof. Since the largest σ-algebra includes C , there exists at
least one σ-algebra including C . Let A be the intersection of
all σ-algebras including C . By Property 1.12, A is the smallest
σ-algebra including C . 
The following theorem shows that the induced procedure from a
semi-algebra to σ-algebra can be decomposed into two steps, i.e. first
induce the algebra and then the σ-algebra.
Theorem 1.15. If S is a semi-algebra of Ω, then σ(S ) =
σ(F (S )).
Proof. Since σ(F (S )) ⊃ S , we have σ(F (S )) ⊃ σ(S ). Con-
versely, since σ(S ) is an algebra including S , we have σ(S ) ⊃
F (S ), and hence, σ(S ) ⊃ σ(F (S )). 

Example 1.16. Let (Ω, T ) be a topology space where T is the class


of all open subsets of Ω. The σ-algebra B := σ(T ) is called Borel
field (or Borel σ-algebra).

1.1.4 Monotone class theorem


By Theorem 1.7, the algebra is easily induced from a semi-algebra.
Combining this with Theorem 1.15, to induce the σ-algebra from a
semi-algebra, one only needs to generate it from the induced algebra.
Note that the difference between the algebra and the σ-algebra is
Class of Sets and Measure 7

the following: the former is only closed under finite many operations,
while the latter is closed under countably infinite many operations.
Intuitively, countably infinite many operations can be characterized
as the limit of finite many operations. So, it is reasonable to consider
the limit for sequences of sets.
Note that the limit for a sequence of sets is defined only in the
monotone case by the union (respectively, intersection) for an increas-
ing (respectively, decreasing) sequence. This leads to the notion of
monotone class.

Definition 1.17. A class M of subsets of Ω is called a monotone


class if it is closed for the limits of monotone sequences, that is,


(1) if An ∈ M , n = 1, 2, . . . , and A1 ⊂ A2 ⊂ . . . , then An ∈ M ;
n=1


(2) if An ∈ M , n = 1, 2, . . . , and A1 ⊃ A2 ⊃ . . . , then An ∈ M .
n=1

Theorem 1.18. A class of subsets of Ω is a σ-algebra if and only if


it is both an algebra and a monotone class.

Proof. It suffices to prove the sufficiency. Let A be both an algebra


and a monotone class. We only need to show that A is closed under
countably many unions. Since A is an algebra, ∀A1 , A2 , . . . ∈ F ,

n
Bn := Ai ∈ A is increasing in n. Since A is a monotone class,
i=1
this implies
∞ ∞
Ai = Bn ∈ A .
i=1 n=1

Then the proof is finished. 

Theorem 1.19. Let C be an algebra in Ω. Then there exists a unique


monotone class M0 in Ω fulfilling
(1) M0 ⊃ C ,
(2) M ⊃ M0 holds for any monotone class M including C .
We call M0 the monotone class induced (or generated) from C and
denote it by M (C ).
8 Foundation of Probability Theory

Theorem 1.20. Let F be an algebra. Then M (F ) = σ(F ).


Proof. By Theorem 1.18, it suffices to prove that M (F ) is an
algebra.
(a) We first show that A ∈ M (F ) ⇒ Ac ∈ M (F ). Let M1 =
{A : Ac ∈ M (F )}. For any decreasing sequence {An } ⊂ M1 ,
we have {Acn } ⊂ M (F ), which is increasing. Since M (F )

∞ ∞
is a monotone class, we have Acn ∈ M (F ), so An =
 ∞ n=1 n=1
 c c
An ∈ M1 . Similarly, we see that M1 is closed under
n=1
the limits of increasing sequences. Thus, M1 is a monotone class
including F , so it includes M (F ). Thus, Ac ∈ M (F ) for any
A ∈ M (F ). 
(b) Next, we prove that for any A ∈ F , A B ∈
M (F ) holds for B ∈ M (F ). To this end, let MA =
{B ∈ M (F ) : A ∩ B ∈ M (F )}. Then MA ⊃ F . If {Bn } ⊂ MA
is increasing, then so is {A Bn } ⊂ M (F ). Since M (F ) is a


monotone class, we obtain A∩ Bn ∈ M (F ), which implies
n=1


Bn ∈ MA (F ). Similar argument shows that MA is a mono-
n=1
tone class. Thus, MA ⊃ M (F ); i.e. A ∩ B ∈ M (F ) holds for
any B ∈ M (F ).
(c) Finally, we prove A∩ B ∈ M (F ) for A, B ∈ M (F ). By (b), MA
is a monotone class including F , so that MA ⊃ M (F ). Thus,
A ∩ B ∈ M (F ). 
The trick behind the proof can be summarized as follows. To prove
that a class C1 of sets has certain property, we define a new class C2
consisting of all sets having this property and then it suffices to show
that C1 ⊂ C2 . To realize this procedure, we sometimes need to split
it into several steps, as we have done above with two steps. This
technique will be used frequently.
The monotonicity is easier to check than the closedness under
countable unions (or intersections). In the spirit of Theorem 1.18
that the monotone class and algebra give rise to σ-algebra, in the
following, we introduce another pair of classes to form σ-algebra.
Definition 1.21. (1) A class C of subsets of Ω is called a π-system
if it is closed under intersections.
Class of Sets and Measure 9

(2) A class C of subsets of Ω is called a λ-system if it fulfills the


following:
(i) Ω ∈ C ,
(ii) B − A ∈ C holds for A, B ∈ C , A ⊂ B,


(iii) An ∈ C holds for any increasing {An } ⊂ C .
n=1

Property 1.22. If C is a λ-system, then it is a monotone class.


Proof. If {An } ⊂ C is decreasing, then {Acn } ⊂ C is increasing. By

∞ 

the definition of λ-system, we obtain Acn ∈ C , so that An =
n=1 n=1


Ω− Acn ∈ C.
n=1


Property 1.23. If C is both a π-system and a λ-system, then C is


a σ-algebra.
Proof. By Theorem 1.18 and Property 1.22, it suffices to prove
that C is an algebra, which follows easily from Definition 1.21 of
π-system and (ii) Definition 1.21 of λ-system. 
In the same spirit of Theorems 1.14 and 1.19 for the induced σ-
algebra and monotone class, any class of sets C induces a unique
λ-system λ(C ), which is the smallest λ-system including C .
Theorem 1.24. Let C be a π-system. Then λ(C ) = σ(C ).
Proof. Since λ(C ) ⊂ σ(C ) and λ(C ) is a monotone class, by
Theorem 1.18, it suffices to prove λ(C ) is an algebra. By the def-
inition of λ-system, λ(C ) is closed under complement, so it remains
to verify the closedness under intersections. We split the proof into
two steps:
(1) Let A ∈ λ(C ), B ∈ C . We intend to prove that A∩B ∈ λ(C ). By
the trick explained above, let CB = {A : A ∩ B ∈ λ(C )}. Since
C is a π-system, we have CB ⊃ C . By the definition of CB and
the fact λ(C ) is a λ-system, it is easy to verify that CB is also a
λ-system, so CB ⊃ λ(C ), that is, A∩B ∈ λ(C ) for any A ∈ λ(C ).
(2) Let B ∈ λ(C ). From (1), we see that CB ⊃ C and it is a λ-system,
so that CB ⊃ λ(C ). Thus, A ∩ B ∈ λ(C ) for A, B ∈ λ(C ). 
Having the above preparations, we obtain the following important
theorem.
10 Foundation of Probability Theory

Theorem 1.25 (Monotone class theorem). Let C and A be


two classes of subsets of Ω with C ⊂ A .
(1) If A is a λ-system and C is a π-system, then σ(C ) ⊂ A .
(2) If A is a monotone class and C is an algebra, then σ(C ) ⊂ A .
This is the main result for classes of sets. In the following, we
explain the main idea to apply the monotone class theorem. Let C be
a class of sets having certain property S, one wants to verify the same
property for sets in σ(C ). For this, let A = {B : B has property S},
so that A ⊃ C . By the monotone class theorem, it suffices to show
that C is a π-system or an algebra, and accordingly, A is a λ-system
or a monotone class.
Remark 1.26. The following diagram summarizes the relations of
various classes of sets:

=⇒ alg. =⇒ s.-alg. =⇒ π-system


σ- σ-
⇐⇒ + + ⇐⇒
algebra algebra
mon.cl. ⇐= λ-system ⇐=

where alg. stands for algebra, mon.cl. for monotone class, and s.-alg.
for semi-algebra.

1.1.5 Product measurable space


Let (Ωi , Ai ), i = 1, . . . , n be finite many measurable spaces. Let

C = {A1 × · · · × An : Ai ∈ Ai , 1  i  n} ,

where each element in C is called a rectangle in the product space

Ω := Ω1 × · · · × Ωn .

It is easy to check that C is a semi-algebra in Ω. We call A :=


σ(C ) the product σ-algebra of A1 , . . . , An and denote it by A1 ×
· · · × An . Moreover, (Ω, A ) is called the product measurable space
of (Ωi , Ai ), i = 1, . . . , n.
Theorem 1.27 (Associative law). ∀n  3, 1  k  n, we have

A1 × A2 × · · · × An = (A1 × · · · × Ak ) × (Ak+1 × · · · × An ).
Class of Sets and Measure 11

Theorem 1.27 can be derived by the definition of product


σ-algebra and the monotone class theorem, whose proof is left as
an exercise.

1.2 Measure

After constructing a measurable space (Ω, A ), we intend to define


a real function on A , which is called a measure if it is nonnegative
and satisfies the countable additivity. In general, we consider a real
function defined on a class of sets, which is called a function of sets.

1.2.1 Function of sets


Definition 1.28. A function on a class of sets C in Ω is a map
Φ : C → (−∞, +∞],
such that Φ(A) < ∞ for some A ∈ C .
We allow a function of sets to take value +∞ such that the
Lebesgue measure is included. On the other hand, we do not allow
it to take value −∞ to avoid the sum of +∞ and −∞ when the
additivity property is considered. In general, we study functions of
sets with the following properties:
(1) (Additivity) A function Φ of sets is called additive if Φ(A + B) =
Φ(A)+Φ(B) holds for any disjoint A, B ∈ C such that A+B ∈ C .
(2) (Finite additivity) A function Φ of sets is called finitely additive if

n 
n
Φ Ai = Φ(Ai )
i=1 i=1
holds for any n  2, and A1 , . . . , An ∈ C mutually disjoint with
n
Ai ∈ C .
i=1
(3) (σ-additivity, or countable additivity) A function Φ of sets is
called σ-additive if
∞ ∞
Φ Ai = Φ(Ai )
i=1 i=1


holds for any {An }n1 ⊂ C mutually disjoint with Ai ∈ C .
i=1
12 Foundation of Probability Theory

(4) (Finiteness) A function Φ of sets is called finite if Φ(A) ∈ R


holds for all A ∈ C .
(5) (σ-finiteness) A function Φ of sets is called σ-finite if for any
A ∈ C , there exists a sequence {An }n1 ⊂ C such that Φ(An ) ∈


R (∀n  1) and A = An .
n=1

Definition 1.29. A signed measure is a function of sets with σ-


additivity. A measure is a signed measure taking nonnegative values.
A probability measure is a measure with Φ(Ω) = 1. If a function Φ of
sets takes nonnegative values and is finitely additive, then it is call
a finitely additive measure.
Note that a signed measure or a finitely additive measure may
not be a measure. The following propositions for functions of sets
are obvious, so the proofs are omitted.
Proposition 1.30. Let Φ be a function on C .
(1) Finite additivity ⇒ additivity.
(2) If ∅ ∈ C , then σ-additivity ⇒ finite additivity.
(3) If C is an algebra, then finite additivity ⇔ additivity.
(4) If Φ is additive and ∅ ∈ C , then Φ(∅) = 0.
The following result characterizes the properties of functions on
different classes of sets.
Property 1.31.
(1) (Subtractivity) Let Φ be an additive function on an algebra F ,
and let A, B ∈ F with A ⊂ B. We have Φ(B) = Φ(A)+Φ(B−A).
If Φ(A) < ∞, then Φ(B − A) = Φ(B) − Φ(A).
(2) (Monotonicity) Let μ be a finitely additive measure on a semi-
algebra S . Then μ(A)  μ(B) holds for A, B ∈ S with A ⊂ B.
(3) (Finiteness) Let Φ be a finitely additive function on a semi-
algebra S . If Φ(B) < ∞ and A ⊂ B, then Φ(A) < ∞. In
addition, if Φ(Ω) < ∞, then Φ is finite.
(4) (σ-finiteness) Let Φ be a finitely additive function on a semi-


algebra S . If Ω = An with An ∈ S and Φ(An ) < ∞
n=1
(∀n  1), then for ∀A ∈ S , ∃ {An } ⊂ S mutually disjoint such


that A = An and Φ(An ) < ∞ (∀n  1).
n=1
Class of Sets and Measure 13

Proof. We only prove (2) and (4), the rest is obvious.

(2) By the property of semi-algebra, there exist A1 , . . . , An ∈ S


mutually disjoint such that B = A + A1 + · · · + An . By the finite
additivity and nonnegativeness of μ, we have μ(B) = μ(A) +
n
μ(Ai )  μ(A).
i=1
(4) We prove first that Ω can be expressed as the union of countable
many disjoint sets whose Φ-values are finite. Let B1 = A1 , Bn =

n−1
An − Ak ∀n  1. By the definition of semi-algebra, there
k=1

kn
exists ∃Bn1 , . . . , Bnkn ∈ S mutually disjoint and Bn = Bni ,
i=1
so that
∞ 
 ∞
renumber  
kn
Ω= Bni ——— Bk , Bk ⊂ S mutually disjoint.
n=1 i=1 k=1

From (3), it follows that Φ(An ) < ∞ (∀n  1), which implies
Φ(Bk ) < ∞ (∀k  1). Then the desired assertion follows by
letting An = An ∩ Bn .


Proposition 1.32.
(1) (Finite subadditivity) Let μ be a finite additive measure on an

n
algebra F . For any A, A1 , . . . , An ∈ F with A ⊂ Ak , there
k=1

n
holds μ(A)  μ(Ak ).
k=1
(2) (σ-subadditivity) Let μ be a measure on an algebra F . If A ∈


F and {An }n1 ⊂ F such that A ⊂ An , then μ(A) 
n=1


μ(An ).
n=1

Proof. (1) By induction, we only need to prove for n = 2. By the


monotonicity and additivity,

μ(A)  μ(A1 ∪ A2 ) = μ(A1 + (A2 − A1 ))


= μ(A1 ) + μ(A2 − A1 )  μ(A1 ) + μ(A2 ).
14 Foundation of Probability Theory

(2) Let A0 = ∅. By the monotonicity and σ-additivity,


∞ ∞

μ(A) = μ An ∩ A =μ A∩ An − Ai
n=1 n=1 in−1

 ∞

= μ A∩ An − Ai  μ(An ). 
n=1 in−1 n=1

Definition 1.33. A function Φ on a class C is called lower contin-


uous at A ∈ C if lim Φ(An ) = Φ(A) for any sequence C An ↑ A,
n→∞
while it is called upper continuous at A ∈ C if lim Φ(An ) = Φ(A)
n→∞
for any sequence C An ↓ A with Φ(An ) < ∞ holds for some n.
Moreover, Φ is called continuous at A ∈ C if it is both lower and
upper continuous at A. Φ is called continuous if it is continuous at
every A ∈ C .
Note that we require the condition ∃n such that Φ(An ) < ∞
for the upper continuity. Otherwise, the classical Lebesgue measure
is excluded. More precisely, for Ω = R and Φ being the Lebesgue
measure, An := (n, ∞) is decreasing to ∅, but Φ(∅) = 0 = ∞ =
lim Φ(An ).
n→∞

Theorem 1.34. Let Φ be a signed measure on an algebra F . Then


Φ is continuous.


Proof. Let F An ↑ A ∈ F . We have A = An = A1 +
n=1


(An −An−1 ). If there exists n such that Φ(An ) = ∞, then Φ(A) =
n=2
∞ = lim Φ(An ). If Φ(An ) < ∞ for every n, then by the σ-additivity
n→∞
and subtractive property,

 ∞

Φ(A) = Φ(A1 ) + Φ(An − An−1 ) = Φ(A1 ) + [Φ(An ) − Φ(An−1 )]
n=2 n=2
n

= Φ(A1 ) + lim [Φ(Ak ) − Φ(Ak−1 )] = lim Φ(An ).
n→∞ n→∞
k=2

So, Φ is lower continuous.


Class of Sets and Measure 15

On the other hand, let F An ↓ A ∈ F with Φ(An0 ) < ∞ for


some n0 . Then An0 −An ↑ An0 −A, so that Φ(An0 −An ) → Φ(An0 −A)
by the lower continuity. This and the subtractive property imply
Φ(An ) → Φ(A). 
Corollary 1.35. A measure on an algebra is continuous.
The following theorem shows that when Φ is finitely additive, the
continuity also implies the σ-additivity. This together with Theorem
1.34 implies the equivalence of the σ-additivity and the continuity of
finitely additive functions on an algebra.
Theorem 1.36. Let Φ be a finitely additive function on an algebra
F . If Φ satisfies one of the following conditions, then Φ is σ-additive:
(a) Φ is lower continuous;
(b) Φ is finite and is continuous at ∅.
Proof. Let (a) hold. If {An }n1 ⊂ F mutually disjoint and A =

∞ n
An ∈ F , then Bn := Ak ↑ A. It follows from the lower
n=1 k=1
continuity and the finite additivity that

n 
n ∞

Φ(A) = lim Φ(Bn ) = lim Φ Ak = lim Φ (Ak ) = Φ (Ak ) .
n→∞ n→∞ n→∞
k=1 k=1 k=1

Let (b) hold, and let {An }n1 and {Bn }n1 as in above. We have
F A − Bn ↓ ∅. By the continuity at ∅ and the subtractive prop-
erty, we obtain 0 = lim Φ(A − Bn ) = Φ(A) − lim Φ(Bn ). Thus,
n→∞ n→∞


Φ(A) = Φ(Ak ).
k=1


1.2.2 Measure space


Definition 1.37. Let A be a σ-algebra in Ω and μ be a measure
on A . Then (Ω, A , μ) is called a measure space. If μ is a probability
measure, then (Ω, A , μ) is called a probability space, and in this case,
μ is often denoted by P, and a measurable set is called an event.
Let (Ω, A , P) be a probability space. By properties of a finite
measure, we have the following assertions for the probability
measure P:
16 Foundation of Probability Theory

(1) (Nonnegativity) P(A)  0, ∀A ∈ A .


(2) (Normality) P(Ω) = 1.
(3) (σ-additivity, hence finite additivity) If An ∈ A , n = 1, 2, . . . are
mutually disjoint, then

 ∞

P An = P(An ).
n=1 n=1

(4) (Subtractive property, hence monotonicity) If A ⊂ B, A, B ∈ A ,


then

P(B − A) = P(B) − P(A) ⇒ P(B)  P(A).

(5) (Additive formula) P(A ∪ B) = P(A) + P(B) − P(A ∩ B). In


general, ∀ {An }∞
n=1 ⊂ A , we have
n
P Ak
k=1

n 
= P(Ak ) − P(Ai ∩ Aj )+· · ·+ (−1)n−1 P(A1 ∩· · ·∩ An )
k=1 1i<jn


n 
= (−1)−1 P(Ai1 · · · Ai ).
=1 1i1 <i2 <···<i n

(6) (Continuity) For A, An ∈ A , n  1,

An ↑ A ⇒ P(An ) ↑ P(A); An ↓ A ⇒ P(An ) ↓ P(A).

Example 1.38 (Geometric probability model). Let Ω ⊂ R be


a Lebesgue measurable set and 0 < |Ω| < ∞, where | · | denotes
the Lebesgue measure. Assume A is a class of Lebesgue measurable
subsets of Ω, P(A) = |A|
|Ω| , A ∈ A . Then (Ω, A , P) is a probability
space.

1.3 Extension and Completion of Measure

As explained in the beginning of this chapter, a measure is often


easily defined on a semi-algebra. So, to build up a measure space, it
is crucial to extend a measure from a semi-algebra to the induced
Class of Sets and Measure 17

σ-algebra. In this section, we first extend a measure from a semi-


algebra to its generated algebra, which is easy to do according to the
formula of the induced algebra, and then further extend to the gener-
ated σ-algebra and finally make completion of the resulting measure
space.

1.3.1 Extension from semi-algebra to the induced


algebra

Definition 1.39. Let C1 ⊂ C2 be two classes of sets in Ω, and let μi


be measures (or finitely additive measures) defined on Ci (i = 1, 2),
respectively. If μ1 (A) = μ2 (A) holds for any A ∈ C1 , then we call μ2
an extension of μ1 from C1 to C2 and call μ1 the restriction of μ2 on
C1 which is denoted by μ1 = μ2 |C1 .

Theorem 1.40. Let μ be a measure (or finitely additive measure)


on a semi-algebra S . Then μ can be uniquely extended to a measure
(or finitely additive measure) μ
 on F (S ).

Proof. By Theorem 1.7, for any A ∈ F (S ), there exist



n
B1 , . . . , Bn ∈ S mutually disjoint such that A = Bi . Define
i=1

n
μ
(A) = μ(Bi ). First, we prove that μ
(A) is independent of the
i=1
choices of {Bi }. Let B1 , . . . , Bn  ∈ S be mutually disjoint such that
n
 n
A= Bi . Then Bi = Bi ∩ Bj . Since Bi ∩ Bj ∈ S , by the finite
i=1 j=1

n
additivity, we have μ(Bi ) = μ(Bi ∩ Bj ). So,
j=1

  

n 
n 
n 
n 
n
μ(Bi ) = μ(Bi ∩ Bj ) = μ(Bi ∩ Bj )
i=1 i=1 j=1 j=1 i=1


n
= μ(Bj ) = μ
(A).
j=1

Thus, μ(A) is independent of the choice of {Bi }.


Next, we prove that μ  is a measure (or finitely additive mea-
sure). It is obvious for nonnegativeness and uniqueness, as well
18 Foundation of Probability Theory

as finite additivity. We are going to prove the σ-additivity. Let




{An }n1 ⊂ F (S ) be mutually disjoint such that A = An ∈
n=1

k
F (S ). Take B1 , . . . , Bk ∈ S mutually disjoint such that A = Bi .
i=1
Again, ∀n  1, take Cn1 , . . . , Cnkn ∈ S mutually disjoint, satisfying
kn ∞ 
∞ kn
An = Cni . Then ∀i  k, Bi = An ∩ Bi = Bi ∩ Cnl is the
i=1 n=1 n=1 l=1
union of mutually disjoint subsets in S . By the σ-additivity of μ, we

∞ kj
have μ(Bi ) = μ(Bi ∩ Cjl ). From this and the finite additivity,
j=1 l=1
it follows that

k  ∞ 
k  kn
μ
(A) = μ
 Bi = μ (Bi ∩ Cnl )
i=1 i=1 n=1 l=1
∞ 
 kn 
k ∞

= μ (Bi ∩ Cnl ) = μ
 (An ) . 
n=1 l=1 i=1 n=1

By applying Proposition 1.32 to μ̃ on F (C ), we obtain the fol-


lowing result.
Corollary 1.41. Let μ ba a finite additive measure on a semi-algebra
S , and let A, A1 , . . . , An ∈ S .

n
(a) If A1 , . . . , An are mutually disjoint and Ai ⊂ A, then
i=1

n
μ(Ai )  μ(A).
i=1

n 
n
(b) If Ai ⊃ A, then μ(Ai )  μ(A).
i=1 i=1

If μ is σ-additive, the above assertions hold for n = ∞.

1.3.2 Extension from semi-algebra to the generated


σ-algebra
Theorem 1.42 (Measure extension theorem). Let μ be a mea-
sure on a semi-algebra S in Ω. Then it can be extended to a measure
on σ(S ). If furthermore μ is σ-finite, then the extension is unique.
Class of Sets and Measure 19

Following the line of the Lebesgue measure theory, we first define


an outer measure for every subset of Ω by the covering procedure and
then prove that the restriction of the outer measure is σ-additive on
the generated σ-algebra.
Definition 1.43. Let μ be a measure on a semi-algebra S in Ω. For
any A ⊂ Ω,
∞ ∞


μ∗ (A) := inf μ(An ) : A ⊂ An , An ∈ S
n=1 n=1

is called the outer measure of A, and the function μ∗ defined on the


largest σ-algebra 2Ω is called the outer measure generated by μ.
Property 1.44.
(1) μ∗ |S = μ.
(2) μ∗ (A) ∗
 ∞ μ (B), ∀A ⊂ B.
 ∞
(3) μ∗ An  μ∗ (An ), ∀An ⊂ Ω, n  1.
n=1 n=1

Proof. (1) As A ⊂ A, by letting A1 = A, An = ∅, n  2, we have


μ∗ (A)  μ(A). On the other hand, by the sub σ-additivity of μ, it


follows that μ(A)  μ(An ) for any sequence {An } ⊂ S with
n=1


An ⊃ A. So, μ∗ (A)  μ(A).
n=1

(2) Obvious.
(3) For any ε > 0 and n  1, take An1 , An2 , . . . ∈ S such

∞ 

that Ani ⊃ An and μ∗ (An )  μ(Ani ) − ε/2n . Thus,
i=1 i=1

∞ 
∞ 

Ani ⊃ An , and by the definition of μ∗
n=1 i=1 n=1
∞ 
ε
∞ ∞ 
 ∞ 
μ∗ An  μ (Ani )  μ∗ (An ) + n
2
n=1 n=1 i=1 n=1


= μ∗ (An ) + ε.
n=1

Let ε ↓ 0 to derive the assertion.


20 Foundation of Probability Theory


If μ∗ were a measure on 2Ω , then the restriction μ∗ |σ(S ) would be
an extended measure as desired. However, this is in general not true
as the Lebesgue measure is already a counterexample. So, we need
to find a class A ∗ of “regular” sets such that A ∗ ⊃ σ(S ) and μ∗
is σ-additive on A ∗ . An intuition to select a “regular” set is that it
does not leads to any loss of outer measures when using the set to
cut others. In this spirit, we introduce the notion of μ∗ -measurable
set as follows.
Definition 1.45. A set A ⊂ Ω is called μ∗ -measurable if

μ∗ (D) = μ∗ (A ∩ D) + μ∗ (Ac ∩ D), ∀D ⊂ Ω.

Let A ∗ = {A ⊂ Ω : A is μ∗ -measurable}.
We shall prove that A ∗ is a σ-algebra including S and μ∗ is a
measure on A ∗ . For this, we first study the properties of μ∗ and A ∗ .
The following is a consequence of Property 1.44(3).
Property 1.46. A is a μ∗ measurable set if and only if

μ∗ (D)  μ∗ (A ∩ D) + μ∗ (Ac ∩ D), ∀ D ⊂ Ω.

Property 1.47. A ∗ ⊃ S .
Proof. Let A ∈ S , D ⊂ Ω. For any ε > 0, take {An } ⊂ S such

∞ 

that An ⊃ D, μ∗ (D)  μ(An ) − ε. Then by σ-subadditivity
n=1 n=1
of μ∗ and finite additivity of μ on F (S ), it follows that


μ∗ (Ac ∩ D) + μ∗ (A ∩ D)  [μ (An ∩ A) + μ (Ac ∩ An )]
n=1


= μ(An )  μ∗ (D) + ε.
n=1

Let ε ↓ 0 to derive μ∗ (D)  μ∗ (A ∩ D) + μ∗ (Ac ∩ D). Thus, A ∈ A ∗


by Property 1.46. 

Theorem 1.48.
(1) A ∗ is a σ-algebra, so that A ∗ ⊃ σ(S ).
Class of Sets and Measure 21



(2) If {An } ⊂ A ∗ are mutually disjoint and A = An , then
n=1
∀D ⊂ Ω,


μ∗ (D ∩ A) = μ∗ (D ∩ An ).
n=1

(3) The restriction of μ∗ on A ∗ is a measure on A ∗ .


Proof. (1) We first prove that A ∗ is an algebra. Since A ∗ ⊃ S ,
∅, Ω ∈ A ∗ . It is obvious that A ∈ A ∗ implies Ac ∈ A ∗ . So, it suffices
to prove that A, B ∈ A ∗ ⇒ A ∩ B ∈ A ∗ . By subadditivity,
μ∗ (D) = μ∗ (A ∩ D) + μ∗ (Ac ∩ D)
= μ∗ (A ∩ B ∩ D) + μ∗ (A ∩ B c ∩ D) + μ∗ (Ac ∩ D)
 μ∗ (A ∩ B ∩ D) + μ∗ ((Ac ∪ B c ) ∩ D).
Thus, A ∩ B ∈ A ∗ by Property 1.46.
Next, we prove that A ∗ is a monotone class. Let An ∈ A ∗ such
that An ↑ A. By Property 1.44(2) and letting A0 = ∅, we have

μ∗ (D) = μ∗ (A1 ∩ D) + μ∗ (Ac1 ∩ D)


= μ∗ (A1 ∩ D) + μ∗ (A2 ∩ Ac1 ∩ D) + μ∗ (D ∩ Ac2 )

n
= ··· = μ∗ ((Ai − Ai−1 ) ∩ D) + μ∗ (D ∩ Acn )
i=1

n
 μ∗ ((Ai − Ai−1 ) ∩ D) + μ∗ (D ∩ Ac ). (1.1)
i=1

Letting n → ∞, we derive


μ∗ (D)  μ∗ (D ∩ Ac )+ μ∗ (D ∩ (Ai−1 − Ai ))  μ∗ (D ∩ Ac )+μ∗ (D ∩ A) .
i=1

Thus, A ∈ A ∗ .
Therefore, A ∗ is a σ-algebra by the monotone class theorem.


(2) Let A = An with An ∈ A ∗ mutually disjoint. Then A ∈
n=1
∗.


A By Property 1.44-(2), it suffices to prove μ∗ (D ∩ A)  μ∗
n=1
22 Foundation of Probability Theory


n
(D ∩ An ). Replacing D by A ∩ D and An by Ai in (1.1), we obtain
i=1

n
μ∗ (D ∩ A)  μ∗ (D ∩ Ai ). Then the proof is finished by letting
i=1
n ↑ ∞.
(3) The σ-additivity of μ∗ on A ∗ is obtained by letting D = Ω in (2).

Proof of Theorem 1.42. Since A ∗ ⊃ σ(S ), the restriction of
μ∗ on σ(S ) is obviously a measure, and μ∗ (A) = μ(A) for A ∈ S .
Thus, there exists an extension of μ on σ(S ).
Now, let μ be σ-finite on S . By Property 1.31(4), there exist


mutually disjoint {An } ⊂ S such that Ω = An and μ(An ) < ∞,
n=1
n  1. If both μ1 and μ2 are measures on σ(S ) extended from μ, it
suffices to prove μ1 (A ∩ An ) = μ2 (A ∩ An ) for A ∈ σ(S ) and n  1.
For this, let Mn = {A : A ∈ σ(S ), μ1 (A ∩ An ) = μ2 (A ∩ An )}.
Then Mn ⊃ S . By the unique extension of μ on F (S ), we have
Mn ⊃ F (S ), thus by the monotone class theorem, it is sufficient
to show that M is a monotone class, which can be derived by the
continuity of measures. 

Corollary 1.49. If S is a semi-algebra in Ω, and P is a probabil-


ity measure on S , then P can be uniquely extended to a probability
measure on σ(S ).

1.3.3 Completion of measures

Definition 1.50. Let (Ω, A , μ) be a measure space. A subset B of


Ω is called a μ-null set if there exists A ∈ A such that B ⊂ A and
μ(A) = 0. If all μ-null sets are contained in A , then (Ω, A , μ) is
called a complete measure space.

Theorem 1.51. For a measure space (Ω, A , μ), let

A¯ = {A ∪ N : A ∈ A , N is a μ-null set}

and define μ̄(A ∪ N ) := μ(A) for A ∈ A and N a μ-null set. Then


(Ω, A¯, μ̄) is a complete measure space, which is called the completion
of (Ω, A , μ).
Class of Sets and Measure 23

Proof. We first prove that A¯ is a σ-algebra. By the σ-subadditivity


of μ, the union of countable many μ-null sets is still μ-null, so that
A¯ is closed under countable union. It remains to be proved that A¯
is closed under complement. Let A ∪ N ∈ A¯ with A ∈ A and N a
μ-null set. Assume B ∈ A such that B ⊃ N and μ(B) = 0. Then
(A ∪ N )c = Ac ∩ N c = Ac ∩ B c + Ac ∩ (N c − B c ). As Ac ∩ (N c − B c ) ⊂
Ω − B c = B, and μ(B) = 0, Ac ∩ (N c − B c ) is μ-null. Moreover,
Ac ∩ B c ∈ A , so that (A ∪ N )c ∈ A¯ by definition.
Next, it is easy to check that μ̄ is σ-additive on A¯. It suffices to
prove that (Ω, A¯, μ̄) is complete. Let N̄ be a μ̄-null set. Then ∃B̄ ∈ A¯
such that μ̄(B̄) = 0 and B̄ ⊃ N̄ . By B̄ ∈ A¯, we have B̄ = A ∪ N
for some A ∈ A and a μ-null set N . Then 0 = μ̄(B̄) = μ(A). Take
B ∈ A , B ⊃ N such that μ(B) = 0. We have N̄ ⊂ B̄ ⊂ A ∪ B, and
μ(A ∪ B) = 0. So, N̄ is μ-null, and hence, N̄ ∈ A¯ by the definition
of A¯. 

Theorem 1.52. Let μ be a measure on a semi-algebra S , and let μ∗


be the induced outer measure. Then for any A ⊂ Ω with μ∗ (A) < ∞,
∃B ∈ σ(S ) such that
(i) A ⊂ B,
(ii) μ∗ (A) = μ(B),
(iii) μ∗ (C) = 0, ∀C ⊂ B − A and C ∈ σ(S ).
The above B is called a measurable cover of A.


Proof. ∀n  1, take {Fnk }k1 ⊂ S such that A ⊂ Fnk and
k=1

∞ 

μ(Fnk )  μ∗ (A) + 1/n. Let Bn = Fnk . Then μ∗ (A)  μ∗ (Bn ).
k=1 k=1


By setting B = Bn , we have B ∈ σ(S ) with B ⊃ A and μ∗ (B) =
n=1
μ∗ (A). If C ∈ σ(S ) and C ⊂ B − A, then A ⊂ B − C. Thus, μ∗ (A) 
μ∗ (B − C) = μ(B) − μ(C). From this and μ∗ (B) = μ∗ (A) < ∞, it
follows that μ∗ (C) = 0. 

Theorem 1.53. Let μ be a σ-finite measure on a semi-algebra S ,


and let μ∗ be the induced outer measure. Then (Ω, A ∗ , μ∗ ) is the
completion of (Ω, σ(S ), μ).
Proof. By Theorem 1.51, we only need to prove A ∗ = A¯.
24 Foundation of Probability Theory

Let Ā ∈ A¯. Then ∃A ∈ σ(S ) and a μ-null set N such that


Ā = A ∪ N . It is clear A ∗ contains all μ null sets, so that Ā ∈ A ∗ .
Conversely, for A ∈ A ∗ with μ∗ (A) < ∞, let B be a mea-
surable cover of A and C be a measurable cover of B − A. Then
A = (B − C) ∪ (C − (B − A)), where B − C ∈ σ(S ), C − (B − A) are
μ-null. Thus, A ∈ A¯. When μ∗ (A) = ∞, by the σ-finiteness of μ, we


have ∃ {An }n1 ⊂ S such that An = Ω and μ(An ) < ∞, n  1.
n=1
From the previous proofs, it follows A ∩ An ∈ A¯ for n  1. Thus,


A= An ∩ A ∈ A¯. 
n=1

Theorem 1.54. Let μ be a σ-finite measure on a semi-algebra S .


Then μ has a unique measure extension on A ∗ .
Proof. By Theorem 1.42, μ is uniquely extended to a measure on
σ(S ), and the extension is the restriction of μ∗ on σ(S ). Let μ1
be another measure on A ∗ which extends μ. Then for any A ∪ N ∈
A ∗ = A¯, where A ∈ σ(S ) and N is μ-null, we have

μ∗ (A ∪ N ) = μ∗ (A) = μ1 (A)  μ1 (A ∪ N )  μ1 (A) + μ1 (N )


= μ∗ (A) + μ1 (N ) = μ∗ (A ∪ N ) + μ1 (N ).

Let B ∈ σ(S ) such that B ⊃ N and μ(B) = 0. We obtain μ1 (N ) 


μ1 (B) = μ(B) = 0. So, μ∗ (A ∪ N ) = μ1 (A ∪ N ). 

Theorem 1.55. Let μ be a measure on a semi-algebra S in Ω. Then


∀A ∈ A ∗ with μ∗ (A) < ∞ and ∀ε > 0, there exists Aε ∈ F (S ) such
that μ∗ (AΔAε ) < ε, where AΔAε ) := (A − Aε ) + (Aε − A).
Proof. ∀ε > 0, there exists a sequence {Bn }n1 ⊂ S such that

∞ 

Bn ⊃ A and μ∗ (A)  μ∗ (Bn )  μ∗ (A) + 2ε . Since μ∗ (A) < ∞,
n=1 n=1
∞ 
μ∗ (Bn ) < ∞. Take n0  1 such that μ∗ (Bn ) < 2ε . Let Aε =
n=1 n>n0
n0 
Bn and Bε = Bn . Then Aε ∈ F (S ). By the σ-subadditivity
n=1 n>n0
of μ∗ , it follows that μ∗ (Bε ) < 2ε . So, μ∗ ((Aε ∪ Bε ) − A) < 2ε by
monotone property. As A − Aε ⊂ Bε and Aε − A ⊂ (Aε ∪ Bε ) − A,
we obtain μ∗ (AΔAε ) = μ∗ ((A − Aε ) + (Aε − A)) < ε. 
Class of Sets and Measure 25

1.4 Exercises

1. Prove Proposition 1.6.


2. Prove Property 1.10.
3. Let C be a class of subsets of Ω. Then ∀A ∈ σ(C ), there exists
a countable sub-class CA of C such that A ∈ σ(C1 ).
4. (Countable generation) A σ-algebra A is called countably gener-
ated if there exists a countable sub-class C such that σ(C ) = A .
Prove that the Borel σ-algebra B d in Rd is countably generated.
5. Let {Cn }n1 be an increasing sequence for classes of sets in Ω.


(a) If {Cn }n1 are algebras, prove that Cn is an algebra.
n=1


(b) Exemplify that Cn is not a σ-algebra, but {Cn }n1 are
n=1
σ-algebras.

6. Prove Theorem 1.19.


7. Prove that a σ-algebra is either finite or uncountable.
8. Let (Ωi , Ai ), 1  i  n, be measure spaces. Prove that

C := {A1 × · · · × An : Ai ∈ Ai }

is a semi-algebra in Ω := Ω1 × · · · × Ωn .
9. Prove Theorem 1.27.
10. Exemplify that an additive measure on a class of sets may not
be finitely additive.
11. Exemplify that the σ-algebra generated by a semi-algebra S
cannot be expressed as
∞ 

σ(S ) = An : ∀n  1, An ∈ S ,
n=1

and prove this formula when Ω is finite or countable.


26 Foundation of Probability Theory

12. Let (Ωn , An , μn ), n  1, be a sequence of measure spaces with


{Ωn } mutually disjoint. Set


Ω= Ωn , A = {A ⊂ Ω : ∀n  1, A ∩ Ωn ∈ An } ,
n=1


μ(A) = μn (A ∩ Ωn ), A∈A.
n=1

Prove that (Ω, A , μ) is a measure space.


13. Let Ω be infinite, and let F be the class of finite subsets of Ω
and their complements. Define P(A) = 0 if A is finite and = 1 if
Ac is finite.
(a) Prove that F is an algebra and P is finitely additive.
(b) When Ω is countable, prove that P is not σ-additive.
(c) When Ω is uncountable, prove that P is σ-additive.

14. Prove Proposition 1.30.


15. Let (Ω, A , P) be a probability measure space without atom, i.e.
for any A ∈ A with P(A) > 0, there exists B ∈ A such that
B ⊂ A and 0 < P(B) < P(A). For any A ∈ A with P(A) > 0,
prove that {P(B) : B ∈ A , B ⊂ A} = [0, P(A)].
16. Prove Corollary 1.35.
17. Let ([0, 1], B([0, 1]), μ) be a finite measure space with
μ({x}) = 0, ∀x ∈ [0, 1]. ∀ε > 0. Prove
(a) ∀x ∈ [0, 1], there exists an interval I x such that μ(I)  ε;
(b) there exists a dense subset A of [0, 1] such that μ(A)  ε.

18. Prove Corollary 1.41.


19. Prove Property 1.46.
20. Construct an example of a measure μ on a semi-algebra C such
that it has more than one extended measures on σ(C ).
21. Prove that a measure space (Ω, A , μ) is complete if and only if
A ⊃ {A ⊂ Ω : μ∗ (A) = 0}.
Class of Sets and Measure 27

22. Let μ be a finite measure on a semi-algebra S . Let


 
 
μ∗ (A) = sup μ(An ) : An ∈ S mutually disjoint, An ⊂ A ,
n n

A∗ = {A ⊂ Ω : μ (A) = μ∗ (A)} .

Prove A ∗ ⊃ A∗ .
23. Let (Ω, A , μ) be a measure space. Prove that N ⊂ Ω is μ-null if
and only if μ∗ (N ) = 0.
24. For a measure space (Ω, A , μ), let Ai , Bi ⊂ Ω satisfy
μ∗ (Ai ΔBi ) = 0, i  1. Prove that

 ∞

μ∗ Ai = μ∗ Bi .
i=1 i=1

25. Let C = {Ca,b = [−b, −a) ∪ (a, b] : 0 < a < b} and define
μ(Ca,b ) = b − a. Prove that μ can be extended to a measure
on σ(C ). Ask whether [1, 2] is μ∗ -measurable.
26. Let f : [0, ∞) → [0, ∞) be strictly increasing, strictly convex
and f (0) = 0. ∀A ⊂ (0, 1], define μ∗ (A) = f (λ∗ (A)), where λ∗ is
the Lebesgue outer measure. Prove that μ∗ satisfies μ∗ (∅) = 0,
nonnegativeness, monotonicity and σ-subadditivity.
27. Let (Ω, A , P) be a probability space, and let Ω ⊃ A ∈
/ A . Prove
that P can be extended to a probability measure on A1 := σ(A ∪
{A}).
28. Let f : R x −→ x3 ∈ R and A0 = [0, 1]. Prove that An+1 =
f (An ) ∪ 23 + f (An ) (n  0) is decreasing in n  0, where
f (An ) := {f (x) : x ∈ An }. The limit of An is denoted by C,
which is called the Cantor set. Prove that the Lebesgue measure
of C is 0.
This page intentionally left blank
Chapter 2

Random Variable and


Measurable Function

Given a probability space (Ω, A , P), we define random variables and


their distribution functions as follows.
Definition 2.1.
(1) A real function ξ : Ω → R is called a random variable √ on
(Ω, A , P) if {ω : ξ(ω) < x} ∈ A for every x ∈ R. Let i = −1.
We call ξ = η + iζ a complex random variable on (Ω, A , P) pro-
vided η and ζ are random variables.
(2) If ξ1 , . . . , ξn are real (complex) random variables on (Ω, A , P),
then vector-valued function ξ := (ξ1 , . . . , ξn ) is called an n-
dimensional real (complex) random variable on (Ω, A , P). A
multi-dimensional random variable is also called random vector.
(3) The distribution function of a random variable ξ := (ξ1 , . . . , ξn )
is defined as
F : Rn  (x1 , x2 , . . . , xn ) → P(ξi < xi : 1  i  n).
(4) Let ξ := (ξ1 , . . . , ξn ) and η := (η1 , . . . , ηn ) be two random vari-
ables on (Ω, A , P). If
P(ξi = ηi ) = 0, 1  i  n,
then these two random variables are called identical almost
surely, denoted by ξ = η, a.s. If they have the same distribution
function, we call them identically distributed.

29
30 Foundation of Probability Theory

In this chapter, we first extend the concept of random vari-


ables to measurable functions on a measurable space and then study
the construction of measurable functions and convergence theorems.
These provide a basis for the following chapter to define and study
the integral (in particular, expectation) of measurable functions
(random variables).

2.1 Measurable Function

2.1.1 Definition and properties


Let B be the Borel σ-algebra of R and R̄ = [−∞, ∞], B̄ = σ(B ∪
{∞} ∪ {−∞}). Let R̄n be the n-dimensional product space of R̄ and
B̄ n be the product σ-algebra. Similarly, we can define n-dimensional
product space C̄n of the generalized complex plane C̄ and the product
σ-algebra B̄cn .
Definition 2.2. Let (Ω, A ) and (E, E ) be two measurable spaces.
(1) A map f : Ω → E is called measurable from (Ω, A ) to (E, E )
if f −1 (B) := {ω ∈ Ω : f (ω) ∈ B} ∈ A for every B ∈ E , where
f −1 (B) is called the inverse image of B under f .
(2) A measurable map f from (Ω, A ) to (R̄, B̄) is called a mea-
surable function, denoted by f ∈ A . A measurable map from
(Ω, A ) to (R̄n , B̄ n ) is called an n-dimensional measurable func-
tion. If f1 and f2 are (n-dimensional) measurable functions, then
f := f1 + if2 is called an (n-dimensional) complex measurable
function.
In the following, we only consider real-valued measurable func-
tions, unless otherwise specified. Let C be a class of subsets of E.
Then {f −1 (B) : B ∈ C } is called the inverse image of C under
f , denoted by f −1 (C ) or σ(f ). Obviously, f : (Ω, A ) → (E, E ) is
measurable if and only if f −1 (E ) ⊂ A .
It is easy to see that the inverse is interchange with any set
operations.
Property 2.3. Let f be a map from Ω to E, and let {Bγ }γ∈Γ be a
family of subsets of Ω. Then the following can be noted:
(1) f −1 (E) = Ω, f −1 (∅) = ∅.
(2) f −1 (B c ) = [f −1 (B)]c for any B ⊂ E.
Random Variable and Measurable Function 31

(3) f −1 (B1 − B2 ) = f −1 (B1 ) − f −1 (B2 ) for any B1 , B2 ⊂ E.


 
  −1
(4) f −1 Bγ = f (Bγ ) .
γ∈Γ γ∈Γ
 
 
(5) f −1 Bγ = f −1 (Bγ ).
γ∈Γ γ∈Γ

Property 2.4. Let (E, E ) be a measurable space. Then for any map
f : Ω → E, f −1 (E ) is the smallest σ-algebra in Ω such that f is
measurable.
Property 2.5. Let C be a class of subsets of E and let f : Ω → E
be a map. Then σ(f ) := f −1 (σ(C )) = σ(f −1 (C )).
Proof. Since f −1 (σ(C )) is a σ-algebra, including f −1 (C ), we have
f −1 (σ(C )) ⊃ σ(f −1 (C )). So, it suffices to prove
A := {C ⊂ E : f −1 (C) ∈ σ(f −1 (C ))} ⊃ σ(C ).
In fact, we have (1) A ⊃ C ; (2) f −1 (E) = Ω ∈ σ(f −1 (C )) ⇒ E ∈ A ;
(3) C ∈ A ⇒ f −1 (C c ) = (f −1 (C))c ∈ σ(f −1 (C )) ⇒ C c ∈ A ;

∞ 
∞ 
(4) {Cn }n1 ⊂ A ⇒ f −1 Cn = f −1 (Cn ) ∈ σ f −1 (C ) ⇒
n=1 n=1


Cn ∈ A . Thus, A is a σ-algebra including C , hence A ⊃ σ(C ).
n=1 

Theorem 2.6.
(1) f is a real measurable function on (Ω, A ) if and only if {f <
x} ∈ A for every x ∈ R.
(2) f = (f1 , · · · , fn ) is an n-dimensional function on (Ω, A ) if and
only if fk is a real measurable function on (Ω, A ) for 1  k  n.
Proof. (1) The necessary part is obvious. To prove the sufficiency,
let S = {[−∞, x) : x ∈ R}. Then σ(S ) = B̄, so by Property 2.5, we
have f −1 (B̄) = f −1 (σ(S )) = σ(f −1 (S )) ⊂ σ(A ) = A . Thus, f is
a measurable function on (Ω, A ).
(2) Let f be measurable. Then for any 1  k  n and Ak ∈ B̄,
we have {fk ∈ Ak } = {f ∈ R̄ × · · · × Ak × · · · × R̄} ∈ A . So,
fk is measurable for any 1  k  n. On the other hand, let
fk be measurable for any 1  k  n. To prove the measurabil-
ity of f , we take S = {{fk < r} : 1  k  n, r ∈ R}. Since
32 Foundation of Probability Theory

B̄ n = σ({{x : xk < r} : 1  k  n, r ∈ R}), by Property 2.5,


we have f −1 (B̄ n ) = σ(S ). Combining this with the fact that
the measurability of fk (1  k  n) implies S ⊂ A , we obtain
f −1 (B̄ n ) ⊂ A , which means that f is measurable. 
Theorem 2.7. Let (Ωi , Ai ), i = 1, 2, 3, be measurable spaces, and let
f g
(Ω1 , A1 ) −
→ (Ω2 , A2 ) −
→ (Ω3 , A3 ) be measurable maps. Then g ◦ f is
a measurable map from (Ω1 , A1 ) to (Ω3 , A3 ).
Proof. It follows from (g ◦ f )−1 (B) = f −1 (g−1 (B)) immediately.

By Theorem 2.6(1) and Definition 2.1, a random variable is noth-
ing but a finite measurable function on the probability space, while
Theorem 2.6(2) shows that a vector valued function is measurable
if and only if each component is measurable. Theorem 2.7 says that
the composition of measurable maps remains measurable.
Corollary 2.8.
(1) Let g be a real (complex) measurable function on (R̄n , B̄ n )
and f1 , . . . , fn be real measurable functions on (Ω, A ). Then
g(f1 , . . . , fn ) is a real (complex) measurable function on (Ω, A ).
n
(2) Let g be a real (complex) measurable function on (C̄n , B̄c )
and f1 , . . . , fn be complex measurable functions on (Ω, A ). Then
g(f1 , . . . , fn ) is a real (complex) measurable function on (Ω, A ).
Corollary 2.9.
(1) Let g be a real (complex) measurable function on (R̄n , B̄ n )
and f1 , . . . , fn be real random variables on (Ω, A , P). If
P(|g(f1 , . . . , fn )| = ∞) = 0, then g(f1 , . . . , fn ) is a real (com-
plex) random variable on (Ω, A , P).
(2) Let g be a real (complex) measurable function on (C̄n , B̄cn )
and f1 , . . . , fn be complex random variables on (Ω, A , P). If
|g(f1 , . . . , fn )| < ∞, then g(f1 , . . . , fn ) is a complex random vari-
able on (Ω, A , P).

2.1.2 Construction of measurable function


We first recall the measurable indicator functions with one-to-one
correspondence to measurable sets and then use their combinations
Random Variable and Measurable Function 33

and limits to construct all measurable functions. This construction


is fundamental for the definition of integrals, where the integral of a
measurable function is regarded as the measure of the function, so
that it is natural to identify the integral of an indicator function with
the measure of the corresponding set.
Definition 2.10.
(1) ∀A ⊂ Ω, its indicator function is defined by

1 if ω ∈ A;
1A (ω) =
0 else.

(2) Let {Ak }1kn be a finite measurable partition of Ω, i.e. they are
n
mutually disjoint sets in A such that Ω = Ak . Then for any
k=1
n
a1 , . . . , an ∈ R, f := ak 1Ak is called a simple function.
k=1
(3) If we take n = ∞ in (2) above, then f is called an elementary
function.
Property 2.11.
(1) 1A is a measurable function on (Ω, A ) if and only if A ∈ A .
(2) Simple functions and elementary functions are all measurable
functions on (Ω, A ).
n
Proof. Let f = ak 1Ak be a simple (or elementary if n = ∞)
k=1 
function. Then ∀B ∈ B̄ we have f −1 (B) = Ak ∈ A . 
k:ak ∈B

Theorem 2.12.
(1) A measurable function is the pointwise limit of a sequence of
simple functions.
(2) A measurable function is the uniform limit of a sequence of ele-
mentary functions.
(3) A bounded measurable function is the uniform limit of a sequence
of simple functions.
(4) A nonnegative measurable function is the (uniform) limit of a
sequence of increasing simple (elementary) functions.
34 Foundation of Probability Theory

Proof. (1) For n  1 and ω ∈ Ω, let


n2n −1
k
fn (ω) = 1 k k+1 + n1{f (ω)n} − n1{f (ω)<−n} .
2n { 2n f (ω)< 2n }
k=−n2n

Then fn are simple functions and |fn − f |1{−nf <n} < 21n ; when
f = ∞, fn = n; when f = −∞, fn = −n. Thus, the sequence
{fn }n1 converges pointwise to f .
(2) For any n ∈ N, let

k
fn = 1 k k+1 + ∞1{f =∞} − ∞1{f =−∞} .
2n { 2n f < 2n }
k=−∞
Then {fn }n1 are elementary functions such that for any n  1,
1
|fn − f |1{|f |<∞} < n ; fn = f when |f | = ∞.
2
Thus, fn converges uniformly to f as n → ∞.
(3) If f is bounded, then by (1), the sequence {fn }n1 of simple
functions converges uniformly to f .
(4) If f is nonnegative, then the sequences {fn }n1 constructed in
(1) and (2) are increasing.

Let f be a real function on Ω. Define f + = max{f, 0} and f − =
max{−f, 0}, which are called the positive and the negative parts
of f , respectively. Then f = f + − f − , |f | = f + + f − , f + = |f |+f
2 ,
− |f |−f
f = 2 .
Theorem 2.13. The positive part and the negative part of a mea-
surable function are measurable. So, any measurable function can be
expressed as the difference of two nonnegative measurable functions.

2.1.3 Operations of measurable functions


Proposition 2.14. Let {fn }n1 be a sequence of real functions on Ω.
(1) Super-limit, lower limit, supremum and infimum of {fn }n1 all
exist, and
lim fn = lim inf fk = sup inf fk ,
n→∞ n→∞ kn n kn

lim fn = lim sup fk = inf sup fk .


n→∞ n→∞ kn n kn
Random Variable and Measurable Function 35

(2) Limit of {fn }n1 , lim fn exists if and only if


n→∞

∀ω ∈ Ω, lim fn (ω) = lim fn (ω).


n→∞ n→∞

In this case, we denote fn → f as n → ∞.

A sequence of complex functions fn := gn + hn i , n  1, is called


convergent to f := g + hi , if gn → g and hn → h as n → ∞.

Theorem 2.15. Let (Ω, A ) be a measurable space.


(1) Let {fn }n1 be a sequence of real measurable functions on
(Ω, A ). Then

sup fn , inf fn , lim fn , lim fn


n1 n1 n→∞ n→∞

are measurable as well.


(2) Let {fn }n1 be a sequence of complex measurable functions on
(Ω, A ). If lim fn exists, then it is measurable.
n→∞

Proof. Note that ∀x ∈ R,



inf fn < x = {fn < x} ∈ A .
n1
n1

So, inf fn is measurable. Since sup fn = − inf (−fn ), sup fn is mea-


n1 n1 n1 n1
surable.
Finally, for x ∈ R,
  ∞ 
 ∞ 
1
lim fn = fk < x − .
n→∞
m=1 n=1 kn
m

Thus, lim fn is measurable. As lim fn = − lim (−fn ), lim fn is


n→∞ n→∞ n→∞ n→∞
measurable. 

Theorem 2.16. Let g be a continuous function on D ⊂ R̄n . Then


g is a measurable function on (D, D ∩ B̄ n ). The assertion remains
true for C̄n in place of R̄n .
36 Foundation of Probability Theory

Proof. For simplicity, assume g is real. ∀m  1, R̄n is divided into


countable many disjoint cubes with side length 1/2m :
   
j1 j1 + 1 jn jn + 1
Aj1 ,...,jn = m , m ×· · ·× m , m , j1 , . . . , jn ∈ Z∪{±∞},
2 2 2 2

For j = −∞ or +∞, by convention, we set [ 2jm , j+1


2m ) = {−∞} or
{+∞}. Rearranging these cubes, we denote them by {Ami : i, m ∈ N}.
Given xim ∈ Am
i , define

gm (x) = 1Am
i ∩D
(x)g(xim ).
i=1

Then gm is measurable, and the continuity of g implies that


m→∞
gm −−−−→ g, so g is measurable. 

Theorem 2.17. Let D ⊂ C̄n and f1 , . . . , fn be measurable functions


on (Ω, A ) such that

(f1 , . . . , fn )(Ω) ⊂ D.

If g is a measurable function on D, then g(f1 , . . . , fn ) is measurable.


Proof. Simply note that the composition of measurable functions
is measurable. 

Corollary 2.18. The sum, difference, product and quotient of mea-


surable functions are measurable (if the operations make sense).
Corollary 2.19. Let ξ1 , . . . , ξn be (complex) random variables on
(Ω, A , P) and let g be a finite continuous function on Rn (Cn ). Then
g(ξ1 , . . . , ξn ) is a (complex) random variable. Specially, the sum,
difference, product and quotient of (complex) random variables are
(complex) random variables (if the operations make sense).

2.1.4 Monotone class theorem for functions


Definition 2.20. Let L be a family of functions on Ω such that
f ∈ L ⇒ f + , f − ∈ L . A family L of functions on Ω is called an
L -system if
Random Variable and Measurable Function 37

(1) 1 ∈ L;
(2) L is closed under linear combinations;
(3) for any nonnegative and increasing sequence, {fn }n1 ⊂ L, such
that fn ↑ f , if either f is bounded or f ∈ L , then f ∈ L.

Theorem 2.21 (Monotone class theorem for functions). Let


L be an L -system. If L contains the indicator functions of all ele-
ments in a π-system C , then L contains all real σ(C )-measurable
functions in L .

Proof. Let Λ = {A : 1A ∈ L}. Then Ω ∈ Λ and Λ is closed


under the proper difference and the union of increasing sets. So, Λ
is a λ-system. Since Λ ⊃ C and C is a π-system, by the monotone
class theorem, we have Λ ⊃ σ(C ). From this and Definition 2.20(2),
it follows that L contains all σ(C )-measurable simple functions. Let
f ∈ L be σ(C )-measurable. Then f + , f − ∈ L are σ(C )-measurable,
so that there exists a sequence of simple functions fn ↑ f + . From this
and Definition 2.20(3), it follows that f + ∈ L. Similarly, f − ∈ L.
Thus, f = f + − f − ∈ L by Definition 2.20(2). 

The monotone class theorem for functions is used to prove that


a family F of functions have a certain property A0 . To this end,
we first choose a class of functions L ⊃ F such that L :=
{f : f have property A0 } is an L -system, then introduce a π-system
C such that indicator functions of all subsets of C are contained in
L, and finally verify that the family of all σ(C )-measurable functions
includes F . Thus, by Theorem 2.21, we conclude that functions of F
have property A0 .
The following theorem is an example to illustrate this procedure.

Theorem 2.22. Let (E, E ) be a measurable space, and let σ(f ) =


f −1 (E ) for a map f : Ω → E. Then ϕ : Ω → R̄ is σ(f )-measurable
if and only if there exists an (E, E )-measurable function g such that
ϕ = g ◦ f . If ϕ is finite (bounded), then one can take finite (bounded)
g as well.

Proof. The sufficiency follows the fact that the composition of


measurable functions is measurable.
38 Foundation of Probability Theory

To prove the necessity, we choose L to be the class of all σ(f )-


measurable functions on Ω, and let L = {g ◦ f : g ∈ E }. Then L is
an L -system such that the following items hold:

(1) 1Ω = 1E ◦ f ∈ L.
(2) ∀g1 ◦ f, g2 ◦ f ∈ L and a1 , a2 ∈ R such that a1 (g1 ◦ f ) + a2 (g2 ◦ f )
makes sense, we have

a1 g1 ◦ f + a2 g2 ◦ f = [(a1 g1 + a2 g2 )1A ] ◦ f,

where A = {x ∈ E : a1 g1 (x) + a2 g2 (x) exists}. Thus, a1 g1 ◦ f +


a2 g2 ◦ f ∈ L.
(3) If ϕn ∈ L, ϕn ↑ ϕ, then ∀n  1, ∃gn ∈ E such that ϕn = gn ◦ f .
Let g = sup gn . Then g ∈ E and ϕ = g ◦ f , so ϕ ∈ L.
n1

If C ∈ σ(f ), then there exists B ∈ E such that C = f −1 (B),


so that 1C = 1B ◦ f . Thus, L contains indicator functions of all
subsets of σ(f ). By Theorem 2.21, L includes L . This proves the
first assertion.
If ϕ is finite (bounded), and ϕ = g ◦ f , then we can replace g by
g1{|g|||ϕ||∞} (g1{|g|<∞} ), so that the second assertion holds true. 
Corollary 2.23. Let f be an n-dimensional real function on Ω. Then
ϕ is f −1 (B̄ n )-measurable if and only if there exists a measurable
function g on (R̄n , B̄ n ) such that ϕ = g ◦ f .
Theorem 2.24. Let L be the total of real functions on R̄n and L
be an L -system on R̄n containing all bounded continuous functions.
Then L contains all Borel measurable real functions.
Proof. Let S = {A : A is open interval in R̄n }. Then S is
a π-system and σ(S ) = R̄n . For A ∈ S , set d(x, Ac ) =
inf {|x − y| : y ∈ A}. ∀m  1, let


⎨ 0, x ∈ A,
1
fm (x) = 1, x ∈ A, d(x, Ac ) > m ,

⎩ 1
md(x, A ), x ∈ A, d(x, A )  m .
c c

Then fm is continuous and fm ↑ 1A , so 1A ∈ L. Now, the assertion


follows from Theorem 2.21. 
Random Variable and Measurable Function 39

2.2 Distribution Function and Law

For a real function F on Rn and a, b ∈ Rn with a  b, the difference


Δb,a F of F on interval [a, b) is defined by Δb,a F := F (b)− F (a) when
n = 1, and Δb,a F := Δb1 ,a1 Δb2 ,a2 · · · Δbn ,an when n  2 and a =
(a1 , . . . , an ), b = (b1 , . . . , bn ), where Δbi ,ai (1  i  n) is difference in
the ith component.
We have the following characterization on the distribution func-
tion of a random variable.

Theorem 2.25. Let F be an n-dimensional real function. It is the


distribution function of an n-dimensional random variable if and only
if the following four items hold:

(a) F is increasing and b,a F  0 for a  b,


(b) F is left-continuous,
(c) F (x) → 0 if ∃1  i  n such that xi → −∞,
(d) F (∞, ∞, . . . , ∞) := lim F (n, . . . , n) = 1.
n→∞

The necessity is obvious. So, for a function F satisfying (a)–(d),


we only need to construct an n-dimensional random variable ξ on a
probability space (Ω, A , P) such that F is its distribution function.
In the following, we prove a more general result for F not necessarily
having properties (c) and (d). For this, we introduce a general notion
of distribution functions.

Definition 2.26. A left-continuous finite real function F on Rn is


called a distribution function if it is has nonnegative differences,
i.e. Δb,a F  0 for any a, b ∈ Rn with a  b. In particular, F
is called a probability distribution function if it satisfies (a)–(d) in
Theorem 2.25.

Theorem 2.27. Let F be a distribution function on Rn . Then there


exists a unique measure μF on B n such that μF ([a, b)) = Δb,a F, a 
b. The completion of μF , denoted again by μF , is called the Lebesgue–
Stieltjes (L–S) measure generated by F .

n
Proof. Write [a, b) = [ak , bk ) for a = (a1 , . . . , an )  b =
k=1
(b1 , . . . , bn ), where [ak , bk ) is understood as (−∞, bk ) when ak = −∞.
40 Foundation of Probability Theory

Let

C = {[a, b) : ak  bk , ak ∈ [−∞, +∞), bk ∈ (−∞, +∞], 1  k  n} .

It is clear that C is a semi-algebra in Rn and σ(C ) = B n .


Define a function on C by μF ([a, b)) = Δb,a F, a  b. When
a component of b or a is ±∞, μF ([a, b)) is understood as the
limit when this component tends to ±∞, respectively. It is easy
to check that μF is finitely additive. Since μF takes finite values
in finite intervals, it is σ-finite. To prove that μF is σ-additive, let

A ∈ C , {Ak }k1 ⊂ C mutually disjoint and Ak = A. It suffices
k=1

to verify that μF (Ak ) = μF (A).
k=1
Let μF be uniquely extended to a finitely additive measure on
F (C ), which is denoted again by μF . Then

n
 n

μF (Ak ) = μF Ak  μF (A).
k=1 k=1


 ∞

Letting n → ∞, we obtain μF (Ak ) = μF Ak  μF (A).
k=1 k=1

It remains to be proved that μF (Ak )  μF (A). By an approx-
k=1
imation argument, we may assume that A is a finite interval; i.e. by
(N )
first using A(N ) = A∩[−N, N )n and Ak = Ak ∩[−N, N )n replacing
A and Ak respectively for N ∈ N and then letting N ↑ ∞.
Now, let A = [a, b) and Ak = [a(k) , b(k) ) with a, b, a(k) , b(k) ∈ Rn

such that a  b, a(k)  b(k) (k  1) and Ak = A. By the left
k=1
continuity of F , ∀ε > 0, ∃δ > 0 such that

μF (A) − ε < μF ([a, b − δ)),

where δ = (δ, . . . , δ). Moreover, for each k  1, there exists δ(k) > 0
such that
ε
μF ([a(k) − δ(k) , b(k) ))  μF (Ak ) + .
2k
Random Variable and Measurable Function 41

Since
∞ 
  ∞  
[a, b − δ] ⊂ [a, b) = a(k) , b(k) ⊂ a(k) − δ(k) , b(k) ,
k=1 k=1

by the finite cover theorem, we find a natural number N  1 such


that
N 
 
[a, b − δ] ⊂ a(k) − δ(k) , b(k) .
k=1

Thus,
N  
μF ([a, b))  μF ([a, b − δ)) + ε  ε + μF (a(k) − δ(k) , b(k) )
k=1

 2ε + μF (Ak ).
k=1


Letting ε ↓ 0, we obtain μF (Ak )  μF (A).
k=1
So far, we have proved that μF is a σ-finite measure on the
semi-algebra C . Then the proof is finished by the measure exten-
sion theorem (Theorem 1.42). 
It is clear that the L–S measure induced by a distribution function
is finite on compact sets. Such a measure is called the Radon measure.
Indeed, the inverse of Theorem 2.27 also holds, i.e. a Radon measure
must be the L–S measure induced by a distribution function, see
Exercise 6 at the end of this chapter.
Proof of Theorem 2.25. Let μF be the induced measure of F on
B n . By (c) and (d), μF is a probability measure. On probability space
(Ω, A , P) = (Rn , B n , μF ), define the random variable ξ(x) := x.
Then ξ is an n-dimensional random variable such that P(ξ < x) :=
μF ((−∞, x)) = F (x). 

Example 2.28. Let F (x1 , . . . , xn ) = x1 x2 . . . xn for (x1 , . . . , xn ) ∈


Rn . Then F is a distribution function on Rn and μF is the Lebesgue
measure on Rn .
42 Foundation of Probability Theory

Definition 2.29. Let ξ be an n-dimensional random variable. The


probability measure

(P ◦ ξ −1 )(A) := P(ξ ∈ A), A ∈ B n

is called the distribution (or law, or distribution law) of ξ.

2.3 Independent Random Variables

Let T be a nonempty set. We write S  T if S is a nonempty finite


subset of T .
 
Definition 2.30. Let ξ (t) = (ξt,1 , . . . , ξt,mt ) : t ∈ T be a family of
 
random variables on (Ω, A , P). We call ξ (t) : t ∈ T independent if
for any l ∈ N, {t1 , . . . , tl } ⊂ T and x(ti ) ∈ Rmti , i = 1, . . . , l, there
holds


l  
P(ξ (t1 ) < x(t1 ) , . . . , ξ (tl ) < x(tl ) ) = P ξ (ti ) < x(tI ) .
i=1

The following properties are obvious.

Property 2.31.
(1) {ξ (t) : t ∈ T } are independent if and only if {ξ (t) : t ∈ T  } are
independent
 for ∀T   T.
(2) Let Tr = T and Tr mutually disjoint with |Tr | < ∞. Set
r∈I
ξ̄ (r) = (ξ (t) : t ∈ Tr ). If {ξ (t) : t ∈ T } are independent, then
{ξ̄ (r) : r ∈ I} are independent as well.

Due to Proposition 2.31, we only need to study the independence


of finite many random variables.

Theorem 2.32. Random variables {ξ (k) }1kn are independent if


and only if ∀B (mk ) ∈ B mk ,
 n 
 
n  
(k) (mk )
P {ξ ∈ B } = P ξ (k) ∈ B (mk ) .
k=1 k=1
Random Variable and Measurable Function 43

Proof. The sufficiency is obvious. By induction, we only need to


prove the necessity for n = 2. Indeed, by Proposition 2.31(2) and the
necessity for n = 2, we obtain, for n = k + 1,
k+1   k 
     
P ξ (i) ∈ B (mi ) = P ξ (i) ∈ B (mi ) P ξ (k+1) ∈ B (mk+1 ) .
i=1 i=1

This implies the desired assertion for n = k + 1 by using that for


n = k.
Now, let n = 2. We prove the necessity by using the monotone
class theorem in two steps:

(1) Let Sk = {(−∞, bk ) : bk ∈ Rmk } for k = 1, 2. Then Sk is a


π-system of Rmk and σ(Sk ) = B mk . Given (−∞, b) ∈ B m2 , let
      
C1 = A1 ∈ B m1 : P ξ (1) ∈ A1 , ξ (2) < b = P ξ (1) ∈ A1 P ξ (2) < b .

Then C1 ⊃ S1 . Now, we prove C1 is a λ-system.


Obvious Ω ∈ C1 . If A(n) ∈ C1 and A(n) ↑ A1 , then {ξ (1) ∈ A(n) } ↑
{ξ (1) ∈ A1 }. By the continuity of probability, we have A1 ∈ C1 .
Moreover, if A1 ⊃ A1 and A1 , A1 ∈ C1 , then
 
P ξ (1) ∈ A1 − A1 , ξ (2) < b
   
= P ξ (1) ∈ A1 , ξ (2) < b − P ξ (1) ∈ A1 , ξ (2) < b
       
= P ξ (1) ∈ A1 P ξ (2) < b − P ξ (1) ∈ A1 P ξ (2) < b
   
= P ξ (1) ∈ A1 − A1 P ξ (2) < b .

Thus, A1 − A1 ∈ C1 , so C1 is a λ-system. It follows form the


monotone class theorem that C1 ⊃ B m1 .
(2) ∀A1 ∈ B m1 , let
      
C2 = A2 ∈ B m2 : P ξ (1) ∈ A1 , ξ (2) ∈ A2 = P ξ (1) ∈ A1 P ξ (2) ∈ A2 .

Then C2 ⊃ S2 by (1). Similar proof shows that C2 is a λ-system.


Therefore, the proof is competed by the monotone class theorem.

44 Foundation of Probability Theory

As a consequence of Theorem 2.33, the following result says that


the functions of independent random variables are also independent.
 
Corollary 2.33. Assume ξ (k) : 1  k  n are independent. Let

fk : Rmk → Rmk be finite Borel measurable functions. Then
fk ξ (k) : 1  k  n are independent.

Proof. ∀Ak ∈ B (mk ) , we have fk−1 (Ak ) ∈ B (mk ) . Then
   n 

n   
P {fk ξ (k) ∈ Ak } = P {ξ (k) ∈ fk−1 (Ak )}
k=1 k=1

n  
= P ξ (k) ∈ fk−1 (Ak )
k=1

n    
= P fk ξ (k) ∈ Ak . 
k=1

 
Corollary 2.34. ξ (k) : 1  k  n are independent if and only if
the distribution functions of (ξ (1) , . . . , ξ (n) ) can be expressed as


n  
(1) (n)
F (x ,...,x )= Fk x(k)
k=1

for some real function Fk on Rmk , where mk is the dimension of


ξ (k) , 1  k  n.

Proof. The necessity is obvious. To prove the sufficiency, we may


assume that {Fk }1kn are nonnegative, otherwise simply replace Fk
by |Fk |. Since
   
P ξ (k) < x(k) = P ξ (k) < x(k) , ξ (i) < ∞, i = k
 
(k)
= F ∞, . . . , x , ∞, . . . , ∞
 
= Fk x(k) Fi (∞),
i=k
Random Variable and Measurable Function 45


n
by letting x(k) → ∞, we derive Fi (∞) = 1, and hence, the distri-
 i=1
bution of ξ (k) is given by Fk x(k) /Fk (∞). Thus,
      
Fk x(k)
n n
(1) (n) (k)
F x ,...,x = Fk x = ,
Fk (∞)
k=1 k=1

which implies the independence of {ξ (1) , . . . , ξ (n) } by definition. 

2.4 Convergence of Measurable Functions

Let (Ω, A , μ) be a complete measure space. If some relationship holds


outside a μ-null set, we say that it holds μ-almost everywhere and
denote it by μ-a.e. or simply a.e. if there is no confusion. A null set
is called an exception set. In this section, all measurable functions
are a.e. finite.

2.4.1 Almost everywhere convergence


Definition 2.35. Let {fn } be a sequence of measurable functions
and f be a measurable function. We say that {fn } converges almost
a.e.
everywhere to f and denote it by fn −−→ f if there exists N ∈ A
with μ(N ) = 0, such that fn (ω) → f (ω), n → ∞ for every ω ∈ N .
The sequence {fn } is called mutually convergent almost every-
a.e.
where, denoted by fn − fm −−→ 0, if ∀ω ∈ N , fn (ω) − fm (ω) → 0
when n, m → ∞.
a.e. a.e.
Obviously, fn − fm −−→ 0 if and only if fn+m − fn −−→ 0 (n → ∞)
holds uniformly in m  1.
Property 2.36.
a.e. a.e.
(1) If fn −−→ f , then any sub-sequence {fnk } satisfies fnk −−→ f.
a.e. a.e.
(2) If fn −−→ f and fn −−→ f  , then f = f  a.e.
a.e. a.e.
(3) If fn −−→ f and gn = fn a.e., f = g a.e., then gn −−→ g.
(k) a.e.
(4) If fn −−→ f (k) , k = 1, . . . , m, and g is a continuous function on
R̄ , then
m

a.e.
g(fn(1) , . . . , fn(m) ) −−→ g(f (1) , . . . , f (m) ).
46 Foundation of Probability Theory

Theorem 2.37. Let {fn } be a sequence of finite measurable func-


tions. Then there exists a finite measurable function f such that
a.e.
fn −−→ f if and only if {fn } mutually converges almost everywhere.
a.e.
Proof. If fn −−→ f , then there exists a null set N such that
fn (ω) → f (ω), ω ∈ N , so ∀ω ∈ N, {fn (ω)}n1 is a Cauchy sequence,
that is, fn (ω) − fm (ω) → 0 when n, m → ∞. Thus, {fn }n1 mutually
converges almost everywhere.
Conversely, if {fn }n1 mutually converges almost everywhere,
then there exists a null set N such that ∀ω ∈ N , {fn (ω)}n1 is a
Cauchy sequence, so it has a limit, denoted by f (ω). When ω ∈ N ,
set f (ω) = 0. Since N is measurable due to the completion of mea-
sure space, and since the limit function of measurable functions is
a.e.
measurable, we conclude that f is measurable and fn −−→ f . 
The following theorem follows from Definition 2.35 immediately.
Theorem 2.38. Let f, fn , n  1 be finite measurable functions.
∞ ∞ 
a.e.  
(1) fn −−→ f if and only if ∀ε > 0, μ {|fm − f |  ε} = 0.
n=1 m=n
a.e.
In particular, when μ is finite, fn −−→ f if and only if
 ∞ 

∀ε > 0, μ {|fm − f |  ε} → 0 (n → ∞).
m=n

a.e.
(2) fn − fm −−→ 0 if and only if ∀ε > 0,

∞ 

μ {|fn+v − fn |  ε} = 0.
n=1 v=1
a.e.
In particular, when μ is finite, fn − fm −−→ 0 if and only if
∞ 

∀ε > 0, μ {|fn+v − fn |  ε} → 0 (n → ∞).
v=1

2.4.2 Convergence in measure


Definition 2.39. A sequence {fn }n1 of finite measurable functions
is said to converge in measure μ to a measurable function f , denoted
μ
by fn −→ f, if ∀ε > 0, μ(|fn − f |  ε) → 0(n → ∞).
Random Variable and Measurable Function 47

We call {fn }n1 mutually convergent in measure μ and denote


μ
fn+v − fn −
→ 0, if ∀ε > 0,

sup μ(|fn+v − fn |  ε) → 0, n → ∞,
v1

μ
Clearly, if fn −
→ f , then f is finite a.e. The following properties
are obvious.

Property 2.40.
μ μ
(1) If fn −
→ f, then any subsequence fnk − → f.
μ μ  
(2) If fn −
→ f and fn −→ f , then f = f a.e.
μ μ
(3) If fn −
→ f and gn = fn a.e., g = f a.e., then gn −
→ g.

Theorem 2.41. Let f, fn : Ω → Rm be measurable and let D ⊃




f (Ω), D ⊃ fn (Ω). If g : D → R is uniformly continuous and
n=1
μ μ
fn −
→ f , then g(fn ) −
→ g(f ).

Proof. ∀ε > 0, ∃δ > 0, we have |g(x) − g(y)| < ε when x, y ∈ D


and |x − y| < δ. Then {|g(fn ) − g(f )|  ε} ⊂ {|fn − f |  δ} . 
μ μ μ
Corollary 2.42. If fn −
→ f and gn −
→ g, then fn + gn −
→ f + g.

Theorem 2.43. In Theorem 2.41, if μ is a finite measure on (Ω, A )


and D is an open set, then g can be replaced by a continuous function.

Proof. Let

1
DN = x ∈ Rn : |x|  N, d(x, D c )  , d(x, ∅) = ∞.
N

Then DN is a bounded closed set (as d(·, D c ) is continuous) and


1
d(DN , DNc
+1 )  N (N +1) . When N ↑ ∞, DN ↑ D, so that
μ(f −1 (D\DN )) ↓ 0. Since g is uniformly continuous on DN +1 ,
∀ε ∈ (0, 1), there exists δN > 0 such that whenever
48 Foundation of Probability Theory

∀x, y ∈ DN +1 , |x − y| < δN , |g(x) − g(y)| < ε. Thus,

An : = {|g(fn ) − g(f )|  ε} ⊂ (An ∩ {fn , f ∈ DN +1 })


1
/ DN } ∪ |fn − f | 
∪ {f ∈
N (N + 1)
⊂ {|fn − f |  cN } ∪ {f ∈
/ DN } ,

where cN := min{δN , N (N1+1) }. But lim μ(An )  0 +


 n→∞
μ f −1 (D\DN ) . Let N ↑ ∞ to derive

lim μ(|g(fn ) − g(f )|  ε) = 0. 


n→∞

Finally, we illustrate the relationship between the a.e. convergence


and the convergence in measure.

Theorem 2.44. Let {fn }n1 be a sequence of finite measurable


functions.
μ
(1) If fn − → f, then there exists a subsequence {fnk } such that
a.e.
fnk −−→ f.
μ
(2) If fn+v − fn −
→ 0, then there exist a subsequence {fnk } and finite
a.e. μ
measurable function f such that fnk −−→ f and fn − → f.
a.e. μ
(3) If μ is a finite measure, then fn −−→ f implies fn − → f.

Proof. (1) ∀k  1, ∃nk ↑ ∞ such that μ |fn − f |  2−k <

2−k , k  1. Let fk = fnk . Then μ |fk − f |  2−k < 2−k , k  1.

Thus, ∀ε > 0 and k   1 with 2−k  ε, we have
∞ ∞  ∞
   
μ {|f 
− f |  ε} 
k+v μ f   − f   ε
k +v
k=1 v=1 v=1

 
 2−(k +v) = 2−k .
v=1

a.e.
Let k  ↑ ∞ to derive that fnk −−→ f by Theorem 2.38(1).
Random Variable and Measurable Function 49

(2) As in (1), we take nk ↑ ∞ such that


 
sup μ |fnk +v − fnk |  2−k < 2−k , k  1.
v1

For ∀ε > 0 and k   1 with 21−k  ε, we have ∞


 −l  ε, so
l=k  2
that ∪∞   ∞   −l
v=1 {|fk  +v − fk  |  ε} ⊂ ∪l=k  {|fl+1 − fl |  2 }. Thus,
∞ ∞ 
  ∞  
  
μ {|fk+v − fk |  ε}  μ |fl+1 − fl |  2−l
k=1 v=1 l=k 


 2−l = 2−k +1 .
l=k 

By letting k  ↑ ∞, it follows from Theorem 2.38(2) that fnk converges


mutually almost surely, so that it converges almost surely to some
finite measurable function f .
μ a.e.
Next, we prove fk − → f. By fk −−→ f , there exists a null set N
such that fk (ω) → f (ω), ∀ω ∈ N . Then
∞ 
 
{|fk − f |  ε} ⊂ N 
{|fk+i 
− fk+i−1 |  2−i ε} ,
i=1

which implies that when ε  21−k ,



μ(|fk − f |  2ε)  μ |fk − f |  ε) + μ(|fnk − fk |  ε
∞  
 
 μ |fk+i − fk+i−1 |  2−(k+i−1)
i=1

+ μ(|fnk − fk |  ε)
 21−k + μ(|fnk − fk |  ε).
μ
Hence, fn −
→ f.
a.e.
(3) Let μ be a finite measure and fn −−→ f . Then
 ∞ 

μ(|fn − f |  ε)  μ {|fm − f |  ε} .
m=n
50 Foundation of Probability Theory

a.e.
Combining this with fn −−→ f and the upper continuity of measure,
we obtain
∞ ∞ 
 
lim μ(|fn − f |  ε)  μ {|fm − f |  ε} = 0. 
n→∞
n=1 m=n

Theorem 2.45. There exists a finite measurable function f such


μ μ
that fn −
→ f if and only if fn+v − fn −
→ 0.
Proof. The necessity follows from the triangle inequality. Following
is a proof of the sufficiency.
μ
Let fn+v − fn − → 0. By Theorem 2.44(2), there exists a subse-
μ
quence such that fnk −→ f for some measurable function f . Then
lim μ(|fk − f |  ε)
k→∞
 ε  ε
 lim μ |fk − fnk |  + lim μ |fnk − f |  = 0. 
k→∞ 2 k→∞ 2

2.4.3 Convergence in distribution


Definition 2.46. Let {ξn }n1 be a sequence of random variables of
same dimension with the corresponding distribution functions {Fn }.
Let ξ have distribution F . We call ξn convergent in distribution (or
d
law) to ξ and denote Fn ⇒ F or ξn −
→ ξ if Fn (x0 ) → F (x0 ) holds for
every continuous point x0 of F .
P d
Theorem 2.47. If ξn −
→ ξ, then ξn −
→ ξ.
Proof. By |P(A) − P(B)|  P(AB), where AB = (A − B) ∪
(B − A) is the symmetric difference of A and B, for any unit vector
e ∈ Rn , we have
|Fn (x) − F (x)| = |P(ξn < x) − P(ξ < x)|
= P(ξn < x, ξ ∈ (−∞, x + εe)) + P(ξ ∈ (−∞, x + εe) \ (−∞, x))
+ P(ξn ∈ (−∞, x), ξ < x − εe) + P(ξ ∈ (−∞, x) \ (−∞, x − εe))
 P(|ξ − ξn |  ε) + P(ξ ∈ (−∞, x + εe) \ (−∞, x − εe)).
If x is a continuous point of F , then we derive Fn (x) → F (x) by
letting first n ↑ ∞ and then ε ↓ 0. 
Random Variable and Measurable Function 51

P
Corollary 2.48. Let a be a constant. Then ξn −
→ a if and only if
d
ξn −
→ a.
Proof. We only need to prove the sufficiency. For simplicity, we
only consider the one-dimensional case. Since the distribution for
random variable ξ ≡ a is F (x) = 1(a,∞) , both a − ε and a + ε are
d
continuous points of F for any ε > 0. By ξn − → a, it follows for any
ε > 0, P(|ξn − a| > ε) = P(ξn < a − ε) + P(ξn > a + ε) → 0(n → ∞).

Similarly, it is easy to check the following two assertions.
P d d
Theorem 2.49. If ξn − ξn −
→ 0 and ξn −
→ ξ , then ξn −
→ ξ.
d d
Theorem 2.50. If ξn → ξ and ηn → a, where a is a constant, then
d
ξn + ηn → ξ + a.

2.5 Exercises

1. Prove Property 2.3.


2. Prove Corollary 2.8.
3. Prove Corollary 2.9.
4. Prove Theorem 2.13.
5. Is a distribution function increasing? Prove or disprove with a
counterexample.
6. Prove that a Radon measure must be the L–S measure generated
by some distribution function.
7. Prove that if F (x) = P(ξ < x) is continuous, then η = F (ξ) has
the uniform distribution on [0, 1].
8. Prove Property 2.31.
9. Let {ξn }n1 be independent with identical distribution μ. Given
A ∈ B with μ(A) > 0, define τ = inf {k : ξk ∈ A}. Prove that
the distribution of ξτ is μ(· ∩ A)/μ(A).
52 Foundation of Probability Theory

10. Let ξ and ξ̃ be independent and identically distributed. Let


η = ξ − ξ̃ (which is called the symmetrization of ξ). Prove
P(|η| > t)  2P(|ξ| > t/2).

11. Let (Ω, A , P) be a probability space. Subclasses C1 , . . . , Cn of A


are called independent if
n 
 n
P Ai = P (Ai ) , Ai ∈ Ci , 1  i  n.
i=1 i=1

Prove that if C1 , . . . , Cn are independent π-systems, then


σ(C1 ), . . . , σ(Cn ) are independent.
12. Prove the following 0–1 laws.
(a) Let {An }n1 be a sequence of independent events and T =


σ {An , An+1 , . . .}. Then P (A) = 0 or 1 for any A ∈ T .
n=1
(b) Let {ξn }n1 be a sequence of independent random variables,
∞
and let T = σ {ξn , ξn+1 , . . .}, where σ {ξn , ξn+1 , . . .} is
n=1
the smallest σ-algebra in Ω such that {ξk : k  n} are
measurable. Then P (A) = 0 or 1 for any A ∈ T .
13. Prove Property 2.36.
14. Prove Theorem 2.38.
15. Let ξ1 , ξ2 , . . . ∈ {1, 2, . . . , r} be independent with identical dis-
tribution P(ξi = k) = p(k) > 0, 1  k  r. Set πn (ω) =
p(ξ1 (ω)) · · · p(ξn (ω)). Prove
r
P
−n−1 log πn −
→ H := − p(k) log p(k),
k=1
where H is called Shannon’s information entropy.
P
16. Let ξn = 1An . Then ξn −
→ 0 if and only if P(An ) → 0.
17. Let C be a class of sets in Ω, and let f be a function on Ω. If
f is σ(C )-measurable, then there exists a countable subclass Cf
of C such that f ∈ Cf .
Random Variable and Measurable Function 53

18. If the sequence of random variables {ξn }n1 is increasing and


P a.e.
ξn −
→ ξ, then ξn −−→ ξ.
a.e.
19. (a) If ξn −−→ ξ, then
n
1 a.e.
Sn := ξk −−→ ξ.
n
k=1

P P
(b) When ξn −
→ ξ, does it hold that Sn −
→ ξ?
20. (Ω, A , P) is called a pure atom probability space if Ω has a
partition {An }n1 such that A = σ({An : 1  n < ∞}), each
An (= ∅) is called an atom. Prove that for a sequence of random
variables on a pure atom probability space, the convergence in
probability is equivalent to the a.s. convergence.
21. (Egorov’s theorem) Let (Ω, A , μ) be a finite measure space, and
a.e.
let fn , f are fine measurable functions such that fn −−→ f . Then
∀ε > 0, ∃N ∈ A with μ(N )  ε such that fn uniformly converge
to f on N c .
22. For any sequence of random variables {ξn }, there exists a
P
sequence of positive numbers {an } such that an ξn −
→ 0.
23. Exemplify that Theorem 2.41 may fail when g is only a contin-
uous function.
24. Prove Theorems 2.49 and 2.50.
25. Let Fn and F be distribution functions of ξn and ξ, respectively.
d
If ξn −
→ ξ, then P(ξn  x) → P(ξ  x) and P(ξn > x) →
P(ξ > x) for every continuous point x of F .
This page intentionally left blank
Chapter 3

Integral and Expectation

In the elementary probability theory, the expectation (also called


mathematical expectation) has been defined for two typical types
of random variables, i.e. by using the weighted sum with respect to
the distribution sequence for a discrete random variable, and the
Lebesgue integral of the product of the identity function and the
distribution density function for a continuous random variable. In
this chapter, we aim to define and study the expectation for gen-
eral random variables on an abstract probability space (Ω, A , P).
More generally, we define the integral of a measurable function on a
complete measure space (Ω, A , μ), and when μ = P is a probability
measure, a measurable function reduces to a random variable, and
the integral is called mathematical expectation or expectation for
simplicity.
Intuitively, the integral of a measurable function f with respect
to μ can be regraded as the measurement result of f under μ, so the
integral of a measurable indicator function 1A is naturally defined
as μ(A). Combining this with the the construction of a measurable
functions based on simple functions (Theorem 2.12), and equipping
the integral with the linear property, we may define the integral for
general measurable functions.

55
56 Foundation of Probability Theory

3.1 Definition and Properties for Integral

3.1.1 Definition of integral


Let (Ω, A , μ) be a complete measure space and f be a real measur-
able function on Ω. As explained above, for any A ∈ A , we call
μ(A) the integral of 1A with respect to μ. By equipping with the
linear property, we define integral for nonnegative simple functions
as follows. In case an infinite function is concerned, we use the con-
vention 0 × ∞ = 0.

Definition 3.1. Let f be a nonnegative simple function, i.e.


n
f= ak 1Ak ,
k=1

where n ∈ N, {ak } ⊂ [0, ∞], and {Ak } ⊂ A is a partition of Ω.


We call
 
n
f dμ := ak μ(Ak )
Ω k=1


the
 integral of f with respect to μ. For any A ∈ A , we call A f dμ =
Ω f 1A dμ the integral of f on A with respect to μ.

Clearly, the value of Ω f dμ is independent of the expression of
the simple function f and is hence well defined. As the integral
 is the
measurement result of f under μ, we also denote μ(f ) = Ω f dμ. The
following properties are obvious.

Property 3.2. Let f and g be nonnegative simple functions.

(1) (Monotonicity) f  g ⇒ μ(f )  μ(g).


(2) (Linearity) μ(f + g) = μ(f ) + μ(g) and μ(cf ) = cμ(f ), ∀c 0.
(3) Let μf (A) = A f dμ. Then μf is a measure on A and Ω g d

(μ ◦ f −1 ) = Ω f g dμ.
Integral and Expectation 57

By the monotonicity and the fact that a nonnegative measurable


function can be approximated from below by nonnegative simple
functions, we define the integrals of nonnegative measurable func-
tions as follows.

Definition 3.3. Let f be a nonnegative measurable function. Then


  
μ(f ) = f dμ := sup g dμ: 0  g  f, g is a simple function
Ω Ω

is called the integral of f with respect to μ. ∀A ∈ A , A f dμ :=
μ(1A f ) is called the integral of f on A with respect to μ.

By the definition, the monotonicity is kept by the integral of non-


negative measurable functions.

Property 3.4. If 0  f  g are measurable, then μ(f )  μ(g).

The following result is fundamental in the study of limit theorem


for integrals.

Theorem 3.5 (Monotone convergence theorem). Let {fn }n1


be nonnegative measurable functions on Ω. If fn ↑ f as n ↑ ∞,
then
lim μ(fn ) = μ(f ).
n→∞

Proof. By the monotonicity, μ(fn ) is increasing, hence its limit


exists, and fn  f implies lim μ(fn )  μ(f ). We only need to prove
n→∞

m
the converse inequality, i.e. for any simple function g = aj 1Aj +
j=1
∞1{f =∞} such that 0  g  f , we have μ(g)  lim μ(fn ). To see
n→∞
this, for any ε ∈ (0, min aj ) and N  1, let
1jm


m
gn := (aj − ε)1Aj ∩{|fn −f |ε} + N 1{f =∞,|fn |N } , n  1.
j=1
58 Foundation of Probability Theory

It is clear that fn  gn . By fn → f and the monotonicity of


integral, we obtain
lim μ(fn )  lim μ(gn )
n→∞ n→∞

m
= (aj − ε) lim μ(Aj ∩ {|fn − f |  ε}) + N μ({f = ∞, |fn |  N })
n→∞
j=1


m
= (aj − ε)μ(Aj ) + N μ(f = ∞).
j=1

Since ε and N are arbitrary, this implies the desired inequality


lim μ(fn )  μ(g). Hence, the proof is finished. 
n→∞
Finally, we define the integral of a measurable function f by the
linearity and the formula f = f + − f − , where f + and f − are the
positive and negative parts of f , respectively.
Definition 3.6.
(1) Let f be a measurable function on Ω. If either μ(f + ) or μ(f − )
is finite, then

μ(f ) = f dμ := μ(f + ) − μ(f − )
Ω
is called the integral of f with respect to μ. For any A ∈ A such
that the integral μ(1A f ) exists,

f dμ := μ(1A f )
A

is called the integral of f on A with respect to μ. When μ(f )


exists and is finite, f is called integrable (with respect to μ).
To
 emphasize the dependence  of f on x, we also denote μ(f ) =
Ω f (x)μ(dx) and μ(1A f ) = A f (x)μ( dx).
(2) Let f = f1 +i f2 be a complex measurable function. If both μ(f1 )
and μ(f2 ) exist, we say that f has integral, which is defined as
  
μ(f ) = f dμ := f1 dμ + i f2 dμ.
Ω Ω Ω
we call f integrable if both f1 and f2 are integrable.
Integral and Expectation 59

a.e.
Proposition 3.7. If f = g and their integrals exist, then μ(f ) =
μ(g).

3.1.2 Properties of integral


It is easy to see from Property 3.2 and Definition 3.6 that the integral
has the following properties.

Theorem 3.8. Let f and g be real measurable functions.

(1) Linear property


(a) If the sum μ(f ) + μ(g) exists, then integral of f + g exists
and μ(f + g) = μ(f ) + μ(g). 
(b) If μ(f ) exists and A, B ∈ A are disjoint, then A+B f dμ =
 
A f dμ + B f dμ.
(c) If c ∈ R and μ(f ) exists, then μ(cf ) exists, and μ(cf ) =
cμ(f ).
(2) Monotonicity

(a) If μ(f ) and μ(g) exist and f  g, a.e., then Af dμ 
A g dμ, A ∈ A .
(b) If μ(f ) exists, then |μ(f )|  μ(|f |).
(c) When f  0, μ(f ) = 0 if and only if f = 0, a.e.
(d) Let N be a μ-null set. Then N f dμ = 0.
(3) Integrability
(a) f is integrable if and only if μ(|f |) < ∞; when f is integrable,
f is finite a.e.
(b) If |f |  g and g is integrable, so is f .
(c) If f and g are integrable, so is f + g.
 2
(d) If μ(f g) exists, then μ(f g)  μ(f 2 )μ(g2 ).

Corollary 3.9 (Markov’s inequality). If f is measurable and


nonnegative on A ∈ A , then ∀c > 0,

1
μ({f  c} ∩ A)  f dμ.
c A
60 Foundation of Probability Theory

 
Proof. Let g = c1A∩{f c} . Then g  1A f and Ω g dμ  Af dμ,
so

cμ({f  c} ∩ A) = μ(g)  f dμ. 
A

3.2 Convergence Theorems

As the application of the monotone convergence theorem, we have


the following convergence theorems.

Theorem 3.10 (Fatou–Lebesgue theorem or Fatou’s lemma).


Let g and h be real integrable functions and {fn }n1 be a sequence of
real measurable functions.
 
(1) If ∀n  1, g  fn , then Ω lim fn dμ  lim Ω fn dμ.
n→∞ n→∞
 
(2) If ∀n  1, fn  g, then lim Ω fn dμ  Ω lim fn dμ.
n→∞ n→∞
a.e.
(3) If g  fn ↑ f or ∀n  1, g  fn  h, a.e. and fn → f , then
 
lim fn dμ = f dμ.
n→∞ Ω Ω
 
Proof. If g  fn , then g−  fn − , so Ω fn − dμ < ∞, hence Ω fn dμ
exists. A similar argument shows that Ω fn dμ exists in (2) and (3).
(1) Let gn = inf (fk − g). Then gn  0 and
kn

gn ↑ lim (fn − g) = lim fn − g.


n→∞ n→∞

By the monotone convergence theorem,


  
lim fn dμ − g dμ = lim inf (fk − g) dμ
Ω n→∞ Ω n→∞ Ω kn
  
 lim (fn − g) dμ = lim fn dμ − g dμ.
n→∞ Ω n→∞ Ω Ω

(2) Replacing fn by −fn in the above proof, (2) follows from (1)
immediately.
Integral and Expectation 61

(3) When g  fn ↑ f , 0  fn − g ↑ f − g, so the assertion follows


by the monotone convergence theorem. When g  fn  h, a.e. and
a.e.
fn → f , let N be a null set such that g  fn  h and fn → f hold
on N c . Then g1N c  fn 1N c  h1N c . By Theorem 3.10, (1) and (2),
we have

lim μ(fn 1N c ) = μ lim fn 1N c = μ(f 1N c ).


n→∞ n→∞

Combining this with Theorem 3.8, (1)(a) and (2)(d), we finish the
proof. 

Theorem 3.11 (Dominated convergence theorem). Let g be


an integrable function and let {fn } be measurable functions such that
a.e. μ
|fn |  g, a.e.
 for all n  1. If either fn −−→ f or fn − → f, then
Ω fn dμ → Ω f dμ.

μ
Proof. By Theorem 3.10(3), we only need to prove  for fn − → f . By
Theorem 3.8(2)(b), it suffices to show that lim Ω |fn − f | dμ = 0.
n→∞
If this does not hold, then there exist nk ↑ ∞ and ε > 0 such that
 μ
Ω |fnk −f | dμ  ε, ∀k  1. Since fnk −
→ f , there exists a subsequence

a.e.
fnk −−→ f , so that by Theorem 3.10(3), we derive lim Ω |fnk −
n→∞
f | dμ = 0, which is a contradiction. 

Corollary 3.12. Let {fn }n1 be a sequence of measurable functions.


∞ 
 

If fn is nonnegative or Ω |f n | dμ < ∞, then the integral of fn
n=1 n=1
 
∞ ∞ 

exists, and Ω fn dμ = Ω fn dμ.
n=1 n=1


n 

Proof. Let gn = fk . If fn is nonnegative, then gn ↑ fn ,
k=1 n=1
so the assertion follows from the monotone convergence theorem.
∞ 
 
∞ 
n
Assume |f | dμ < ∞. Let g  = |f |, g  = |fk |. Then
Ω n n n
n=1 n=1 k=1
0  g  n ↑ g . It follows from the monotone convergence theorem
∞ 
    

|fn | dμ = lim gn dμ = 
g dμ = |fn | dμ,
n→∞ Ω
n=1 Ω Ω Ω n=1
62 Foundation of Probability Theory

∞ 

so g is integrable and |gn |  g . Since Ω |fn | dμ < ∞ and g is
n=1
a.e. 

a.e. finite. Hence, gn −→ fn . Then the assertion follows from the
n=1
dominated convergence theorem. 
Corollary 3.13. If μ(f ) exists, for A ∈ A and {An }∞
n=1 ⊂ A mutu-

∞  ∞ 

ally disjoint such that A = An , we have A f dμ = An f dμ.
n=1 n=1

∞  ∞ 

Proof. As f ± 1A = f ± 1An , we have A f ± dμ = An f ± dμ.
n=1 n=1
Since the integral of f exists, at least one of the previous series is
finite, so we can make subtraction term by term, which gives the
assertion. 
The following result provides the definition of product measures.
Corollary 3.14. Let (Ωi , Ai , μi ), 1  i  n, be finite many σ-finite
measure spaces. Then there exists a unique σ-finite measure μ on the
product measurable space (Ω1 × · · · × Ωn , A1 × · · · × An ) such that
μ(A1 × · · · × An ) = μ1 (A1 ) · · · μn (An ), Ai ∈ Ai , 1  i  n. (3.1)
The measure μ is called the product measure of μi , 1  i  n, and is
denoted by μ1 × · · · × μn .
Proof. It is easy to see that
C := A1 × · · · × An : Ai ∈ Ai , 1  i  n
is a semi-algebra in Ω1 × · · · × Ωn . By Theorem 1.42, it suffices to
show that μ defined by (3.1) is a σ-finite measure on C . Since each
μi is σ-finite, so is μ. It remains to prove the σ-additivity of μ. By
induction, we only prove for n = 2. 
Let {A× B, Ai × Bi : i  1} ⊂ C such that A× B = ∞ i=1 Ai × Bi .
Then for any ω2 ∈ Ω2 , we have


1A 1B (ω2 ) = 1A×B (·, ω2 ) = 1Ai 1Bi (ω2 ).
i=1

By Corollary 3.12 for integrals with respect to μ1 , we obtain


μ1 (A)1B = ∞ i=1 μ1 (Ai )1Bi . Applying Corollary 3.12 again for inte-
grals with respect to μ2 , we finish the proof. 
Integral and Expectation 63

Definition 3.15. Let f be a measurable function such that μ(f )


exists. We call μf (A) := A f dμ(A ∈ A ) the indefinite integral of f .

It is clear that when μ(f − ) < ∞, the indefinite integral μf is a signed


measure on A .

Proposition 3.16. Let (Ω, A , μ) be a measure space and ρ  0 be a


measurable function on (Ω, A ). If a measurable function f on (Ω, A )
f has integral with respect to μρ , then μ(ρf ) exists and μ(ρf ) =
μρ (f ).

Proof. By definition, the assertion is true for f being a simple func-


tion. Combining this with the linearity of integral, Theorem 2.12(4),
and Theorem 3.5, we derive the formula first for f being a simple
function, then a nonnegative measurable function, and finally a mea-
surable function such that μρ (f ) exists. 

Corollary 3.17. If f is integrable, then An f dμ → 0 as μ(An ) → 0.
  
Proof. Assume An f dμ  0. Since | An f dμ|  |f | dμ < ∞,

there exists nk ↑ ∞ such that An f dμ →  = 0. Take a subsequence
k


{nk } of {nk } such that μ Ank  21k . Let Bk =
 AnI . Then
i=k
1 

μ(Bk )  2k−1
, so Bk ↓ B = Bk is a null set. It follows that
k=1
1An f  |1Bk f | → 0, a.e. By the dominated convergence theorem,
k  
we have A  f dμ → 0, which contradicts that An f dμ →  = 0 as
n k
k
nk → ∞. 

As applications of the dominated or monotone convergence the-


orem, we have the following results concerning the commutable
calculations with the integral.

Corollary 3.18 (Interchange of derivative and integral). Let


T ⊂ R be an open set. If ∀t ∈ T, ft is integrable and ∀ω ∈ Ω, ft (ω)
d
is differential at t0 , then dt ft (ω)|t0 is measurable.
  If there exists an
 ft −ft0 
integrable function g and ε > 0 such that  t−t0   g for |t−t0 | < ε,
d   dft
then dt Ω ft dμ |t0 = Ω dt |t0 dμ.
64 Foundation of Probability Theory

Corollary 3.19. Let {ft }t∈(a,b) be a family of real integrable func-


dft
  and dt exists. If there exist
tions an integrable function g such that
 dft  d
  d
 dt   g, then there exists dt ft dμ = dt ft dμ on (a, b).

Proof.
 By applying the mean value theorem of differentiation, we
 f −f 
have  tt−t0t0   g, t ∈ (a, b) for ∀t0 ∈ (a , b). The assertion follows
from Corollary 3.18 immediately. 

Corollary 3.20 (Interchange of integrals).


(1) Let {ft }t∈(a,b) be a family of real integrable functions such that
∀ω ∈ Ω, ft (ω) is continuous in t, and there exists an integrable
function g such that |ft |  g for ∀t ∈ (a, b). Then
 b     b 
ft dμ dt = ft dt dμ.
a Ω Ω a
∞
(2) If the above equation holds on finite intervals, and −∞ |ft | dt  h
 ∞   ∞
with h integrable, then −∞ Ω ft dμ dt = Ω −∞ ft dt dμ.

Proof. (1) Let a = t0 < t1 < · · · < tn = b be a partition of [a, b].


Then
 b 
n
ft dt = lim (ti − ti−1 )fti .
a n→∞
i=1
n 
 
Since  (ti − ti−1 )fti   (b − a)g, it follows from the dominated
i=1 
convergence theorem that Ω ft dμ is continuous in t. Thus, by the
dominated convergence theorem and the linear property of integral,
we have
  b  n   b  
ft dt dμ = lim (ti −ti−1 ) fti dμ = ft dμ dt.
Ω a n→∞ Ω a Ω
i=1
∞ n
(2) Since −∞ |ft | dt  h, we note that gn = −n ft dt satisfies
∞
gn → −∞ ft dt and |gn |  h. Hence, the assertion follows from the
dominated convergence theorem. 
Integral and Expectation 65

Corollary 3.21 (Interchange of summations). Let {fnm }n,m1


be a family of real numbers. Assume fnm  0 or there exists


a sequence of numbers {gn } such that |fnm |  gn (∀n) and
m=1


gn < ∞. Then
n=1
 ∞
∞  ∞ 
 ∞
fnm = fnm .
m=1 n=1 n=1 m=1
Proof. Let Ω = N, μ be the counting measure on Ω and g(n) =


gn . Then g is integrable. Let fm (n) = fnm . Then |fm |  g.
m=1
From the monotone convergence theorem or the dominated conver-
gence theorem, it follows that
∞  ∞ ∞   
∞ ∞ 
 ∞
fnm = fm dμ = fm dμ = fnm . 
m=1 n=1 m=1 Ω Ω m=1 n=1 m=1

Corollary 3.22. Let {fnm }n,m1 be a family of real numbers such


that 0  fnm ↑ fn (m ↑ ∞) or there exists a sequence of real numbers


{gn }n1 such that |fnm |  gn , gn < ∞, and lim fnm = fn . Then
n=1 m→∞

 ∞

lim fnm = fn .
m→∞
n=1 n=1

3.3 Expectation

In the following, we introduce the definition of the expectation and


some characters of a random variable and then establish the inte-
gral transformation formula which implies the L–S representation of
expectation.

3.3.1 Numerical characters and characteristic


function
Definition 3.23. Let ξ be a random variable on probability space
(Ω, A , P). If the integral of ξ with respect to P exists, then the
66 Foundation of Probability Theory


integral is called the expectation of ξ, denoted by Eξ = Ω ξ dP.
If E|ξ| < ∞, we say that ξ has finite expectation.
As in the elementary probability theory, by using expectation, we
define the characteristic function and numerical characters of random
variables are as follows.
Definition 3.24.
(1) Let ξ = (ξ1 , . . . , ξn ) be an n-dimensional random variable. We
call
Rn (t1 , . . . , tn ) → ϕξ (t1 , . . . , tn ) := Eei t,ξ

n characteristic function of ξ, where i := −1 and t, ξ :=
the
j=1 tj ξj .
(2) Let ξ be a random variable such that Eξ exists. Then Dξ :=
E|ξ − Eξ|2 is called the variance of ξ.
(3) Let ξ be a random variable and r > 0. E|ξ|r is called the rth
moment of ξ. When Eξ exists, E|ξ − Eξ|r is called the rth central
moment of ξ.
(4) Let ξ and η be two random variables such that Eξ and Eη are
finite, and
bξ,η = E(ξ − Eξ)(η − Eη)
exists. Then bξ,η is called the covariance of ξ and η. If DξDη = 0
bξ,η
and is finite, then rξ,η = √DξDη is called the covariance coeffi-
cient of ξ and η.
(5) Let ξ = (ξ1 , . . . , ξn ) be an n-dimensional random variable such
that Eξ = (Eξ1 , . . . , Eξn ) and (bij = bξi ,ξj )1i,jn exist. Then
⎛ ⎞
b11 · · · b1n
⎜ . .. ⎟
B(ξ) = ⎜
⎝ .
.

. ⎠
bn1 · · · bnn
is called the covariance matrix of ξ. If (rij = rξi ,ξj )1i,jn exist,
then
⎛ ⎞
r11 · · · r1n
⎜ . .. ⎟
R(ξ) = ⎜

.. . ⎟

rn1 · · · rnn
is called the correlation matrix of ξ.
Integral and Expectation 67

Besides properties listed in Theorem 3.8, the expectations for


independent random variables also satisfy the following product
formula.

Theorem 3.25 (Multiplication theorem). If random variables


ξ1 , ξ2 , . . . , ξn on probability space (Ω, A , P) are independent, all are
either nonnegative or having finite expectations, then

E(ξ1 ξ2 · · · ξn ) = Eξ1 Eξ2 · · · Eξn .

Proof. By induction, we only prove the formula for n = 2.


(1) Let ξ and η be nonnegative simple functions with ξ =
n 
m
ai 1Ai (ai = aj , i = j) and η = bi 1Bi (bi = bj , i = j). Then
i=1 i=1
P(Ai ∩ Bj ) = P(Ai )P(Bj ) holds for all i, j, so that


n 
m 
n 
m
ξη = ai bj 1Ai ∩Bj , Eξη = ai bj P(Ai )P(Bj ) = EξEη.
i=1 j=1 i=1 j=1

This implies the desired formula for nonnegative ξ and η by applying


Theorems 2.12(4) and 3.5.
(2) Let ξ and η have finite expectations. By Corollary 2.34, (ξ + , ξ − )
and (η + , η − ) are independent nonnegative random variables. Com-
bining this with ξ = ξ + − ξ − , η = η + − η − , and the formula for
nonnegative independent random variables proved in step (1), we
finish the proof. 
The characteristic function and numerical characters have the fol-
lowing properties.

Proposition 3.26.
(1) Random variables ξ1 , . . . , ξn are mutually independent if and only
if

ϕ(ξ1 ,...,ξn ) (t1 , . . . , tn ) = ϕξ1 (t1 ) · · · ϕξn (tn ), t1 , . . . , tn ∈ R.

(2) If ξ1 , . . . , ξn are independent variables having finite variances,


then D(ξ1 + · · · + ξn ) = Dξ1 + · · · + Dξn .
(3) If ξ and η are independent and having finite expectations, then
bξ,η = 0.
68 Foundation of Probability Theory

(4) Let ξ be an n-dimensional random vector such that B(ξ) is finite.


Then B(ξ)  0 (nonnegative definite).
(5) If E|ξ|r < ∞ for some r > 0, then E|ξ|s < ∞ holds for s ∈ (0, r).
Proof. We only prove (1), (4), and (5), and the rest are obvious.
(1) By Theorem 3.25, we only need to prove the sufficiency. Accord-
ing to Theorem 4.9 which will be proved in Chapter 4, we may
construct independent random variables ξ1 , . . . , ξn such that their
characteristic functions are ϕξ1 (t1 ), . . . , ϕξn (tn ), respectively. Then
ξ˜ := (ξ1 , . . . , ξn ) and ξ := (ξ1 , . . . , ξn ) have the same characteristic
function, so that by Theorem 6.4 which will be proved in Chap-
ter 6, we know ξ and ξ˜ are identically distributed. From this and
the mutual independence of ξ1 , . . . , ξn , it follows that ξ1 , . . . , ξn are
mutually independent.
(4) ∀t1 , . . . , tn ∈ C, we have
 2

n n 
 
bij ti tj = E ti (ξi − Eξi )  0.
 
i,j=1 i=1

(5) Note that |ξ|s  1 + |ξ|r for 0 < s < r. 

3.3.2 Integral transformation and L–S representation


of expectation
The expectation of a random variable ξ is defined as its integral with
respect to the probability measure P. Since in general the probability
space (Ω, A , P) is abstract, the expectation is not easy to calculate.
Since Eξ is a distribution property of ξ, we aim to express it by
using the integral of the identity function with respect to the distri-
bution P ◦ ξ −1 of ξ, which is a probability measure on the Euclidean
space, see Definition 2.29. This is called the L–S representation of
expectation.
In general, let f : (Ω, A ) → (E, E ) be a measurable map, and let
μ be a measure on (Ω, A ). Then f maps μ into the following measure
μ ◦ f −1 on (E, E ):
(μ ◦ f −1 )(B) := μ(f −1 (B)), B ∈ E,
which is called the image of μ under f . We have the following integral
transformation theorem.
Integral and Expectation 69

Theorem 3.27 (Integral transformation theorem). Let f :


(Ω, A ) −→ (E, E ) be measurable, let μ be a measure on A , and
let g be a measurable function on (E, E ) such that its integral with
respect to μ ◦ f −1 exists. Then the integral of g ◦ f with respect to μ
exists and
 
g ◦ f dμ = g d(μ ◦ f −1 ), B ∈ E .
f −1 (B) B

Proof. (1) Let g be an indicator function and g = 1B  , B  ∈ E .


Then

 
g d(μ ◦ f −1 ) = (μ ◦ f −1 )(B ∩ B ) = μ(f −1 (B ∩ B ))
B
 
= 1f −1 (B  ) dμ = 1B  ◦ f dμ.
f −1 (B) f −1 (B)

(2) By step (1), the linear property of integral, and the monotone
convergence theorem, we derive the formula first for f being a simple
function, then for f being a nonnegative function f , and finally for
f being a measurable function such that μ(g ◦ f ) exists. 
In references, the L–S measure μ induced by a distribution func-
tion F is also denoted by dμ = dF , and the associated integral is
called the L–S integral.
Definition 3.28. Let μ be the Lebesgue–Stieltjes (L–S, in short)
measure on (Rn , B n ) induced by a distribution function F . Let f be
a measurable function on (Rn , B n ) such that μ(f ) exists. Then the
integral of f with respect to μ is called an L–S integral, denoted by
 
μ(f ) = f dμ = f dF.
Rn Rn

Let ξ = (ξ1 , . . . , ξn ) be an n-dimensional random variable on


(Ω, A , P), having distribution function F . Then the distribution of ξ
is expressed as
 
−1
(P ◦ ξ )(A) := P(ξ ∈ A) = dF = 1A dF, A ∈ B n .
A Rn

More generally, let g := (g1 , . . . , gm ) be a finite m-dimensional mea-


surable function on Rn . Then by Theorem 3.27, the distribution of
70 Foundation of Probability Theory

η := g(ξ) is

(P ◦ η −1 )(A) = dF, A ∈ B m .
g −1 (A)

This implies the following L–S representation of expectation.


Corollary 3.29. Let ξ = (ξ1 , . . . , ξn ) be an n-dimensional ran-
dom variable on (Ω, A , P), having distribution function F . Let g =
(g1 , . . . , gm ) be a finite m-dimensional
 measurable function
 on Rn
−1
such that Eg(ξ) exists. Then Eg(ξ) = Rn g d(P ◦ ξ ) = Rn g dF. In
particular,
 
−1
Eξ = x (P ◦ ξ )(dx) = x dF (x). (3.2)
Rn Rn

To conclude this section, we present the following two examples


to show that the general definition of expectation covers that for
discrete type and continuous type random variables presented in the
elementary probability theory.
Example 3.30. Let ξ be a discrete random variable, i.e. it takes
values on a countable set {ai : i  1} with distribution sequence

∞ 
P(ξ = ai ) = pi  0, pi = 1. Then (P ◦ ξ −1 )(A) = pi holds for
i=1 ai ∈A
any A ∈ B. By (3.2) and noting that the identity function satisfies


x= ai 1{ai } (x), (P ◦ ξ −1 )-a.s.,
i=1

we obtain
 ∞

−1
Eξ = x (P ◦ ξ )(dx) = ai pi .
R i=1

Example 3.31. Let ξ be a continuous type random variable with


distribution density function ρ such that Eξ exists. Then its distri-
bution is the indefinite integral of ρ with respect to the Lebesgue
measure
 dx, so that by Propositions 3.16 and (3.2), we obtain
Eξ = R x (P ◦ ξ −1 )(dx) = R xρ(x) dx.
Integral and Expectation 71

3.4 Lr -space

Definition 3.32. Let (Ω, A , μ) be a measure space and let r ∈


(0, ∞).
Lr (μ) := {f : f is a measurable function on Ω, μ(|f |r ) < ∞}
is called the Lr -space of μ. A sequence {fn } ⊂ Lr (μ) is said to
converge in Lr (μ) to some measurable function f if μ(|fn − f |r ) → 0
Lr (μ)
(n → ∞), which is denoted by fn −−−→ f.
To ensure the uniqueness of limit, an element in Lr (μ) is regarded
as an equivalent class in the sense of μ-a.e. equal, that is, we identify
two functions f and g in Lr (μ) if f = g μ-a.e.
Let f r = μ(|f |r )1/(r∧1) . We will prove when r ∈ (0, 1), (Lr (μ),
 · r ) is a complete metric space with distance dr (f, g) := f − gr ;
when r  1, it is a Banach space. In particular, L2 (μ) is a Hilbert
space with inner product f, g := μ(f g). To prove this assertion, we
first recall some classical inequalities, then extend them to integrals
of functions, and finally compare the convergence in Lr with the
convergences in a.e. and in measure. Moreover, the space (Lr (μ), ·r )
is separable if A is generated by an at most countable sub-class
of A , which we will not prove in this textbook.

3.4.1 Some classical inequalities


Proposition 3.33. If a  0, b  0, 0 < α < 1, α + β = 1, then
aα bβ  aα + bβ and the equality holds if and only if a = b.
Proof. As log is a concave function, log(aα + bβ)  α log a +
β log b = log(aα bβ ), and clearly the equality holds if and only if a = b.


Proposition 3.34 (Hölder’s inequality). Let r > 1, 1r + 1


s = 1.
Then
1 1
μ(|f g|)  (μ(|f |r )) r (μ(|g|s )) s . (3.3)
The equality holds if and only if ∃c1 , c2 ∈ R with |c1 | + |c2 | > 0 such
that
c1 |f |r + c2 |g|s = 0, μ-a.e. (3.4)
72 Foundation of Probability Theory

Proof. The inequality is obvious when f = 0 or g = 0 or the R.H.S.


is infinite, the inequality holds obviously. So, we may and do assume
0 < μ(|f |r ), μ(|g|s ) < ∞. In this case, let

|f |r |g|s 1 1
a= , b= , α= , β= ,
μ(|f |r ) μ(|g|s ) r s

From Proposition 3.33, it follows that

|f g| |f |r |g|s
 + ,
f r gs rμ(|f |r ) sμ(|g|s )
r s
|f | |g|
where the equality holds if and only if μ(|f = μ(|g| s . Combining this
|r ) )
with Theorem 3.8, we may take integrals with respect to μ in both
|f |r
sides to derive (3.3), and the equality holds if and only if μ(|f |r )
=
|g|s 1
μ(|g|s )
holds μ-a.e., which implies (3.4) for c1 = μ(|f |r )
and c2 =
1
− μ(|g|s ) . Finally, it is clear that (3.4) implies the equality in (3.3). 
1
Corollary 3.35 (Jensen’s inequality). ∀r > 1, E|ξ|  (E|ξ|r ) r ,
and the equality hold if and only if |ξ|r is a.s. constant.

Proposition 3.36 (C r -inequality).

+
∀a1 , . . . , an ∈ R, |a1 + · · · + an |r  n(r−1) (|a1 |r + · · · + |an |r ).

When r > 1, the equality holds if and only if a1 = · · · = an ; when


r = 1, the equality holds if and only if ai have same signs; when
r < 1, the equality holds if and only if at most one of {ai } is not
zero.

Proof. (1) Case r > 1. Let Ω = {1, . . . , n}, A = 2Ω equipped


with probability measure P(A) = n1 |A|, where |A| is the number of
points in A. Consider the random variable ξ(i) := ai , 1  i  n.
n n
Then E|ξ| = n1 |ai |, E|ξ|r = n1 |ai |r . By Jensen’s inequality,
n r i=1 n i=1
−r
 1  r
n |ai |  n |ai | , hence the inequality holds, and the
i=1 i=1
Integral and Expectation 73

 holds if and only if |ξ| are constant, i.e. |ai | = |aj |, ∀i, j. But
equality
 n   n
 ai  = |ai | if and only if ai have same signs, so ai = aj , ∀i, j.
 
i=1 i=1
(2) Case r  1. We only prove ai are not all null. Note

|ak | |ak |r
  r , r  1.

n 
n
|ai | |ai |
i=1 i=1

Make summation in k on both sides to derive the inequality. When


r = 1, the equality holds if and only if ai have same signs. And when

n
r  1, the equality if and only if ∀k, |ak |/ |ai | = 1 or 0, i.e. only
i=1
one of ai is not null. 

Proposition 3.37 (C r -inequality). Let f1 , . . . , fn be measurable


functions. Then

+

n
μ(|f1 + · · · + fn |r )  n(r−1) μ(|fi |r ),
i=1

and the equality hold if and only if

(1) when r > 1, ∀i = j, fi = fj , a.e.;


(2) when r < 1, at most one of μ(|fi |) is not null;
(3) when r = 1, fi , a.e. have the same sign.

Proposition 3.38 (Minkowski’s inequality). Let r  1, f, g ∈


Lr (μ). Then

1 1 1
(μ|f + g|r ) r  (μ|f |r ) r + (μ|g|r ) r ,

and the equality holds if and only if

(1) when r > 1, there exist c1 , c2 ∈ R not all null and having the
same signs such that c1 f − c2 g = 0, a.e.;
(2) when r = 1, f, g have the same sign, a.e.
74 Foundation of Probability Theory

Proof. Since the assertion for r = 1 is trivial, we only consider


r > 1. By Hölder’s inequality, we obtain

μ(|f + g|r )  μ(|f ||f + g|r−1 ) + μ(|g||f + g|r−1 )


r−1 r−1
 f r (μ(|f + g|r )) r + gr (μ(|f + g|r )) r ,

and the equality holds if and only if there exist c1 , c2 , c3 , c4 ∈ R with


|c1 | + |c2 | > 0 and |c3 | + |c4 | > 0 such that μ-a.e |f |r c1 + |f + g|r c2 =
c3 |g|r + c4 |f + g|r = 0 and f, g have the same sign. This implies the
desired assertion. 

3.4.2 Topology property of Lr (µ)


Theorem 3.39. Let r ∈ (0, ∞). Then Lr (μ) is a complete met-
ric space under the distance dr (f, g) := f − gr . Moreover, it is a
Banach space when r  1 and a Hilbert space when r = 2.
Proof. (a) Clearly, f r = 0 if and only if f = 0, μ-a.e. Thus,
∀f ∈ Lr (μ), f r = 0 if and only if f = 0. Obviously, Lr (μ) is
obviously a linear space and dr satisfies the triangle inequality by
Cr -inequality (for r < 1) or Minkowski’s inequality (for r  1).
Moreover, when r  1,  · r is a norm.
(b) It remains to prove the completeness. Let {fn } be a Cauchy
sequence in (Lr (μ), dr ). Then ∀ε > 0, by Khinchin’s inequality, we
derive that when n, m → ∞,
1
μ(|fn − fm |  ε)  μ(|fn − fm |r ) → 0.
εr
Thus, {fn } converges mutually in measure. By Theorem 2.45, there
a.e.
exists a subsequence nk ↑ ∞ such that fnk −−→ f (say). It follows
a.e.
that for ∀m  1, fm − fnk −−→ fm − f (nk → ∞). Then Fatou’s
lemma (Theorem 3.10) implies
 
μ(|fm − f | ) = μ lim |fm − fnk |  lim μ (|fm − fnk |r ) .
r r
nk →∞ nk →∞

Since {fn } is a Cauchy sequence in Lr (μ), by letting m → ∞, we


obtain lim μ(|fm − f |r ) = 0. 
m→∞
Integral and Expectation 75

Proposition 3.40.

Lr (μ) Lr (μ)
(1) Let μ be a finite measure. If fn −−−→ f , then fn −−−−→ f, r  ∈
(0, r).
Lr (μ)
(2) If fn −−−→ f , then μ(|fn |r ) → μ(|f |r ).

Proof. (1) and (2) follow from Hölder’s inequality and the triangle
inequality of dr , respectively. 

3.4.3 Links of different convergences

Definition 3.41. Let (Ω, A , μ) be a finite measure space, and let


{ft , t ∈ T } be a family of real measurable functions on Ω.

(1) {ft , t ∈ T } is called uniformly continuous in integral if

lim sup μ(|ft |1A ) = 0.


μ(A)→0 t∈T

(2) {ft , t ∈ T } is called uniformly integrable if

lim sup μ(|ft |1{|ft |n} ) = 0.


n→∞ t∈T

Theorem 3.42. Let μ be a finite measure and {fn }n1 ⊂ Lr (μ).


The following statements are equivalent:

Lr (μ)
(1) fn −−−→ f .
μ
→ f and {|fn − f |r }n1 is uniformly continuous in integral.
(2) fn −
μ
→ f and {|fn |r }n1 is uniformly continuous in integral.
(3) fn −
μ
→ f and {|fn |r }n1 is uniformly integrable.
(4) fn −

Proof. (a) First, we prove (1) ⇔ (2).


μ
(1) ⇒ (2) Since μ(|fn −f |  ε)  ε−r μ(|fn −f |r ), (1) implies fn −
→ f.
To prove that {|fn − f |r }n1 is uniformly continuous in integral, for
∀ε > 0, take nε  1 such that μ(|fn − f |r ) < ε for ∀n  nε . Then we
76 Foundation of Probability Theory

have
 

r
sup |fn − f | dμ  ε + μ(1A |fn − f |r ).
n1 A n=1
r
As for given n, lim μ(1A |fn − f | ) = 0, so
μ(A)→0

lim sup μ(1A |fn − f |r )  ε.


μ(A)→0 n1

Since ε is arbitrary, {|fn − f |r }n1 is uniformly continuous in


integral.
(2) ⇒ (1) Let An = {|fn − f |  ε}. Then μ(An ) → 0
and the uniform continuity in integral implies μ(1An |fn − f |r ) 
sup μ(1An |fm − f |r ) → 0, n → ∞. Hence,
m1

lim μ(|fn − f |r )  lim μ(|fn − f |r 1{|fn −f |ε} ) + εr μ(Ω) = εr μ(Ω).


n→∞ n→∞

Lr (μ)
Since ε is arbitrary, we have fn −−−→ f.
μ
(b) Again, by Theorem 2.44, fn − → f implies that there exists a
a.e.
subsequence {fnk } such that fnk −−→ f. Thus, by Fatou’s lemma,
∀A ∈ A , μ(f 1A )  lim μ(fnk 1A )  sup μ(fn 1A ).
k→∞ n

Since |1A fn r − 1A (fn − f )r |  1A f r , the uniform continuity


in integral of {|fn − f |r }n1 is equivalent to that of {|fn |r }n1 . That
is, (2) ⇔ (3).
(c) From the equivalence of (2) and (3), it follows that {μ(|fn |)}n1 is
bounded, so that the uniformly continuity of integrals of {|fn |r }n1
is equivalent to the uniform integrability of {|fn |r }n1 (cf. Exercise
25 in this chapter). Hence, (3) ⇔ (4). 
a.e.
Noting that |μ(fn ) − μ(f )|  μ(|fn − f |) and fn −−→ f
μ
implies fn −
→ f for finite μ, we have the following consequence of
Theorem 3.42.
Corollary 3.43 (Dominated convergence). Let (Ω, A , μ) be a
finite measure space, and let {fn , f : n  1} be uniformly integrable
μ
measurable functions. If fn −
→ f, then μ(fn ) → μ(f ).
Integral and Expectation 77

3.5 Decompositions of Signed Measure

Basing on the decomposition f = f + − f − for a function, we aim


to formulate a signed measure ϕ as the differences of two measures
ϕ+ and ϕ− . This is called Hahn’s decomposition, from which we
will introduce Lebesgue’s decomposition which uniquely expresses
a signed measure as the sum of an indefinite integral part and a
singular part. The uniqueness of Lebesgue’s decomposition leads to
the Radon–Nikodym derivative between measures, which is crucial
to develop the analysis on the space of measures. By applying
Lebesgue’s decomposition to L–S measure, we decompose a distri-
bution function into the discrete part, the absolutely continuous
part, and the singular part, which classifies random variables into
three types: the discrete type, the continuous type and the singular
type, where the first two types have been studied in the elementary
probability theory.

3.5.1 Hahn’s decomposition theorem


To decompose a signed measure as the difference of two measures, we
consider the indefinite integral μf for a measurable function f with
μ(f − ) < ∞, for which the natural decomposition is μf = μf + − μf − .
To define ϕ+ and ϕ− for a general signed measure ϕ, we reformulate
μf + and μf − as follows:
μf − (A) = μf (A∩D), μf + (A) = μf (A∩D c ), A ∈ A , D := {f  0}.
It is clear that μf (D) = inf A∈A μf (A). This indicates that for a
signed measure ϕ, if we could find a set D ∈ A reaching inf A∈A ϕ(A),
we could define ϕ+ (A) and ϕ− (A) as ϕ(A ∩ D c ) and ϕ(A ∩ D),
respectively. So, we first prove the existence of D.
Theorem 3.44. Let ϕ be a signed measure on (Ω, A ). Then ∃D ∈ A
such that
ϕ(D) = inf ϕ(A).
A∈A

Proof. Take {An } such that ϕ(An ) ↓ inf ϕ(A). Since inf
A∈A A∈A


ϕ(A)  0, we may assume that ϕ(An ) are finite. Let A = An .
n=1
78 Foundation of Probability Theory

For any k  1, we have A = Ak + (A − Ak ) =: Ak,1 + Ak,2 . ∀n  2,


we can write
2

A= A1,i1 ∩ A2,i2 ∩ · · · ∩ An−1,in−1 ∩ An,in .
i1 ,i2 ,...,in =1

As n is increasing, the partitions become finer and finer. For each


partition, we take out the subsets with negative ϕ-values. Intuitively,
the union of such subsets approaches the desired set D when n → ∞.
To confirm this observation, for each n  1, let

 
kn
Bn = A1,i1 ∩ A2,i2 ∩ · · · ∩ An,in =: An,i .
1i1 ,i2 ,...,in 2 i=1
ϕ(A1,i1 ∩A2,i2 ∩···∩An,in )0

By the σ-additivity of ϕ, we have ϕ(Bn )  ϕ(An ). Let


 ∞
∞  ∞

D= Bk = lim Bk .
n→∞
n=1 k=n k=n

As (n + 1)th partition is finer than that of nth, a subset An+1,i of


Bn+1 is either included by Bn or disjoint with Bn . Then for any
m > n, we have
 
 
Bn ∪ · · · ∪ Bm = Bn + An+1,i + An+2,i
 
An+1,i ∩Bn =∅ An+2,i ∩(Bn ∪Bn+1 )=∅
 
+ ··· + Am,i .

Am,i ∩(Bn ∪···∪Bm−1 )=∅

Thus, by the σ-additivity of ϕ and ϕ(Ai,j )  0, it follows ϕ(Bn ∪


· ∪ Bm )  ϕ(Bn )  ϕ(An ). Letting m ↑ ∞, we obtain −∞ <
· ·


ϕ Bk  ϕ(An ) by the lower continuity of signed measures
k=n
(note ϕ(An ) is finite). Finally, the upper continuity of signed measure
implies that
∞ 

ϕ(D) = lim ϕ Bk  lim ϕ(An ) = inf ϕ(A). 
n→∞ n→∞ A∈A
k=n
Integral and Expectation 79

Corollary 3.45. Let ϕ be a signed measure on a measurable space


(Ω, A ). Then there exists D ∈ A such that ϕ(A ∩ D) = inf ϕ(B)
B∈A∩A
and ϕ(A ∩ D c ) = sup ϕ(B) for any A ∈ A .
B∈A∩A

Proof. Let D ∈ A such that ϕ(D) = inf ϕ(A), ϕ(D c ) =


A∈A
sup ϕ(A). Then ∀A ∈ A and B ∈ A ∩ A , we have ϕ(A ∩ D) +
A∈A
ϕ(D − A) = ϕ(D)  ϕ(B ∪ (D − A)) = ϕ(B) + ϕ(D − A). Since both
ϕ(D)  0, ϕ(A ∩ D) and ϕ(D − A) are finite, ϕ(A ∩ D)  ϕ(B).
Thus, inf ϕ(B)  ϕ(A ∩ D)  inf ϕ(B), i.e. ϕ(A ∩ D) =
B∈A∩A B∈A∩A
inf ϕ(B).
B∈A∩A
On the other hand, ∀B ∈ A ∩ A , ϕ(A ∩ D c ) + ϕ(A ∩ D) = ϕ(A) =
ϕ(B)+ϕ(B c ∩A). Since ϕ(A∩D) = inf ϕ(B) is finite, ϕ(A∩B c ) 
B∈A∩A
inf ϕ(B) = ϕ(A ∩ D), so ϕ(A ∩ D c ) = ϕ(B) + ϕ(B c ∩ A) − ϕ(A ∩
B∈A∩A
D)  ϕ(B). Hence, sup ϕ(B)  ϕ(A ∩ D c )  sup ϕ(B), i.e.
B∈A∩A B∈A∩A
ϕ(A ∩ D c ) = sup ϕ(B).
B∈A∩A 

By Corollary 3.45, we have the following theorem.

Theorem 3.46 (Hahn’s decomposition theorem). Let ϕ be a


σ-additive function on (Ω, A ), and let

ϕ+ (A) = sup ϕ(B), ϕ− (A) = − inf ϕ(B), A ∈ A .


A B⊂A A B⊂A

Then both ϕ+ and ϕ− are measures on A , and ϕ = ϕ+ − ϕ− .

The formula ϕ = ϕ+ − ϕ− is called Hahn’s decomposition of ϕ,


where ϕ+ and ϕ− are called the positive and negative parts of ϕ,
respectively. Moreover, the measure |ϕ| := ϕ+ + ϕ− is called the
total variation measure of ϕ. Note that in general |ϕ(A)| = |ϕ|(A)
for A ∈ A .
By Corollary 3.45 and Theorem 3.46, if ϕ = μf for some measur-
able function f with μ(f − ) < ∞, then ϕ+ = μf + and ϕ− = μf − , as
suggested in the beginning of this part.
80 Foundation of Probability Theory

3.5.2 Lebesgue’s decomposition theorem


Let (Ω, A , μ) be a measure space. If ϕ = μf is the indefinite integral
of a measurable function f with μ(f − ) < ∞, then |ϕ|(N ) = 0 holds
for any μ-null set N . We will prove the converse result, i.e. a signed
measure having this property must be the indefinite integral of a
measurable function with respect to μ. To this end, we introduce the
following notions and establish Lebesgue’s decomposition theorem.
Definition 3.47. Let (Ω, A , μ) be a measure space, and let ϕ be a
signed measure on A .
(1) ϕ is called absolutely continuous with respect to μ, denoted by
ϕ  μ, if ϕ(N ) = 0 holds for all μ-null set N ∈ A .
(2) We call ϕ and μ (mutually) singular if there exists N ∈ A such
that μ(N ) = 0 and |ϕ|(N c ) = 0.
Theorem 3.48. Let (Ω, A , μ) be a σ-finite measure space, and let ϕ
be a σ-finite signed measure. Then ϕ  μ if and only if there exists
a measurable function f such that μ(f − ) < ∞ and ϕ = μf .
The sufficient part is obvious, and the necessary part is implied
by the following Lebesgue’s decomposition theorem.
Theorem 3.49 (Lebesgue’s decomposition theorem). Let μ
and ϕ be as in Theorem 3.48. Then ϕ is uniquely decomposed as
ϕ = ϕc + ϕs , where ϕc is the indefinite integral of a measurable func-
tion with respect to μ and ϕs is a signed measure singular to μ. The
composition is unique.
Proof. (1) Uniqueness of decomposition
Consider two such decompositions: ϕ = ϕc + ϕs = ϕc + ϕs . Then
there are μ-null sets N1 , N2 such that |ϕs |(N1c ) = |ϕs |(N2c ) = 0. Let
N = N1 ∪ N2 . We have μ(N ) = 0 and |ϕs |(N c ) = |ϕs |(N c ) = 0. So,
∀A ∈ A ,
ϕs (A ∩ N ) = (ϕc + ϕs )(A ∩ N ) = (ϕc + ϕs )(A ∩ N ) = ϕs (A ∩ N ),
and ϕs (A ∩ N c ) = ϕs (A ∩ N c ) = 0. Thus, ϕs = ϕs . Similarly, we have
ϕc (A ∩ N ) = ϕc (A ∩ N ) = 0 and
ϕc (A ∩ N c ) = (ϕc + ϕs )(A ∩ N c ) = (ϕc + ϕs )(A ∩ N c ) = ϕc (A ∩ N c ),
so that ϕc = ϕc . Then the decomposition is unique.
Integral and Expectation 81

(2) Existence of decomposition


(i) Assume that μ and ϕ are finite measures. Let
  
Φ= f : f  0, f dμ  ϕ(A), ∀A ∈ A , α = sup μ(f ).
A f ∈Φ

Clearly, Φ is not empty and α ∈ [0, ϕ(Ω)]. Take {fn }n1 ⊂ Φ such
that αn := μ(fn ) ↑ α  ϕ(Ω) < ∞. Set gn = sup fk . Then 0  gn ↑
kn
f := sup fk . For given n  1, put Ak = {ω : gn (ω) = fk (ω)} (1 
k1

n 
k−1
k  n). Then Ak = Ω. Moreover, let Bk = Ak − AI . Then
k=1 i=1

n
{Bk } are mutually disjoint and Bk = Ω. Thus, ∀A ∈ A ,
k=1

 n 
 
n
gn dμ = fk dμ  ϕ(Bk ∩ A) = ϕ(A).
A k=1 A∩Bk k=1


Hence, A f dμ  ϕ(A). By this and definition of α, it follows that
μ(f ) = α.
Let
 
ϕc (A) = f dμ, ϕs (A) = ϕ(A) − f dμ.
A A

For any n  1, let ϕn = ϕs − μn . From the proof of Hahn’s decompo-


sition theorem, it follows that there exists Dn ∈ A such that

ϕn (Dn ∩ A)  0, ϕn (Dnc ∩ A)  0, ∀A ∈ A .



Let D = Dn . Then ∀n,
n=1

1
D ⊂ Dn , ϕs (D ∩ A)  μ(D ∩ A).
n

Thus, ϕs (D ∩ A) = 0 for any A ∈ A .


82 Foundation of Probability Theory

To prove μ(D c ) = 0, it suffices to show μ(Dnc ) = 0 for any n. In


fact,
  
1 1
f + 1Dnc dμ = ϕc (A) + μ(A ∩ Dnc )
A n n
1
= ϕ(A) − ϕs (A) + μ(A ∩ Dnc )
n
= ϕ(A) − ϕn (A ∩ Dnc ) − ϕs (A ∩ Dn )
 ϕ(A) − ϕs (A ∩ Dn )  ϕ(A).
 
It follows that f + n1 1Dnc ∈ Φ. Thus, α  Ω (f + n1 1Dnc ) dμ = Ω f dμ+
1 1
n μ(Dn ) = α. This together with μ(f ) = α implies n μ(Dn ) = 0,
c c

hence μ(Dnc ) = 0.
(ii) Let μ and ϕ be σ-finite measures. There exists {An }n1 mutually


disjoint such that An = Ω and μ(An ), ϕ(An ) < ∞ (∀n). From (i),
n=1
(n) (n)
it follows that there exists ϕc and ϕs such that

ϕ(An ∩ •) = ϕ(n) (n)


c (An ∩ •) + ϕs (An ∩ •),

(n)
ϕc (An ∩ •) = f (n) dμ.
An ∩•

(n)
Let Nn be a μ-null set such that ϕs (Nnc ∩ A ∩ An ) = 0 for any
A ∈ A . Set
∞  ∞

(n)
f= 1An f , ϕc (A) = f dμ, ϕs (A) = ϕ(n)
s (An ∩ A).
n=1 A n=1



(n)
Again, let N = Nn . Then ∀A ∈ A , and we have ϕs (N c ∩
n=1
(n)
A n) 
∩ A(n) ϕs (Nnc ∩
A ∩ An ) = 0. It follows that ϕs (N c ∩ A) =
ϕs (N ∩ A ∩ An ) = 0. Hence, ϕs and μ are singular.
c
n

(3) General case. From Hahn’s decomposition theorem, we have ϕ =


ϕ+ − ϕ− . But by (ii), we have ϕ+ and ϕ− have decompositions ϕ+ =
ϕc + + ϕs + , ϕ− = ϕc − + ϕs − . Then ϕ = (ϕc + − ϕc − ) + (ϕs + − ϕs −
).
Integral and Expectation 83

As a direct consequence of Lebesgue’s decomposition theorem, we


have the following result.
Theorem 3.50 (Radon–Nikodym theorem). Let (Ω, A , μ) be a
σ-finite measure space, and let ϕ be a σ-finite signed measure on A .
If ϕ is absolutely continuous with respect to μ, then there exists a
μ-a.e. unique measurable function f such that ϕ = μf .
This result can be extended to not σ-finite signed measures.
Theorem 3.51 (Generalization of the Radon–Nikodym
theorem). Let (Ω, A , μ) be a σ-finite measure space and ϕ be a
signed measure on A . If ϕ is absolutely continuous with respect to
μ, then there exists a μ-a.e. unique measurable function f such that
ϕ = μf .
Proof. By the σ-finiteness of μ and Hahn’s decomposition of ϕ, we
may assume that μ is a finite measure and ϕ is a measure. In this case,
we apply Theorem 3.50 with Ω replaced by the largest measurable
set on which the restriction of ϕ is σ-finite, then the function f is
defined as ∞ outside this set.
Let
C = {A ∈ A : ϕ is σ-finite on A}.
Set s = sup μ(B) and then take {Bn }n1 ⊂ C such that μ(Bn ) ↑ s.
B∈C


Let B = Bn . Then B ∈ C and s = μ(B). Since ϕ is σ-finite
n=1
on B ∩ A , Theorem 3.50 implies that there exists f1 such that
ϕ(A ∩ B) = A∩B f1 dμ, A ∈ A . Let

f1 (ω), ω ∈ B,
f (ω) =
∞, ω∈/ B.

Then for any A ∈ A with μ(A ∩ B c ) > 0, we have A f dμ = ∞. On
the other hand, if μ(A ∩ B c ) > 0, then ϕ(A ∩ B c ) = ∞. If not, then
ϕ is σ-finite on B ∪ A and μ(B ∪ A) = μ(B c ∩ A) + μ(B) > s, which
contradicts the fact s = sup μ(B). Thus, ∀A ∈ A ,
B∈B
  
f dμ = f dμ + f dμ = ϕ(A ∩ B) + ∞ · μ(A ∩ B c ) = ϕ(A).
A A∩B A∩B c

84 Foundation of Probability Theory

The above theorem leads to the notion of Radon–Nikodym


derivative.
Definition 3.52. Let (Ω, A , μ) be a σ-finite measure space and let
ϕ be a signed measure which is absolutely continuous with respect
to μ. Then there exists a μ-a.e. unique measurable function f
such that ϕ = μf (equivalently, dϕ = f dμ). The function f is
called the Radon–Nikodym derivative of ϕ with respect to μ and
is denoted by dϕ
dμ .

The following result is reformulated from Proposition 3.16.


Corollary 3.53. Let ν and μ be σ-finite measures on A such that
ν  μ. If f is a measurable function, then the integral of f with

respect to ν exists if and only if the integral of f dμ with respect to μ
  dν
exists, and A f dν = A f dμ dμ for any A ∈ A .

3.5.3 Decomposition theorem of distribution function


By applying Lebesgue’s decomposition theorem to the L–S measure
induced by a distribution function with respect to the Lebesgue mea-
sure dx, we derive the following decomposition theorem for distribu-
tion function.
Theorem 3.54. Any distribution function F on Rn can be decom-
posed as the sum of three distribution functions: F = Fc + Fd + Fs ,
where the L–S measure induced by Fc is absolutely continuous with
respect to dx, the L–S measure induced by Fd is supported on a finite
or countable set, and the L–S measure induced by Fs is singular with
dx and having null measure on singletons. This decomposition is
unique in the sense of induced L–S measures. The functions Fc , Fd
and Fs are called the absolute part, the discrete part and the singular
part of F , respectively.
Proof. Let μ be the L–S measure induced by F . By the Lebesgue’s
decomposition theorem, we have μ = μc + μs , where μc  dx, μs is
singular with dx. Let A = {x ∈Rn : μs ({x}) > 0}. Then A is at most
countable. Define μd (B) = μs {x} and let Fd be distribution
x∈B∩A
function of μd . Finally, let μs = μs − μd . Then μs is singular with dx,
Integral and Expectation 85

and μs ({x}) = 0 for every x ∈ Rn , which has distribution function


Fs = F − Fc − Fd .
Uniqueness of decomposition follows from that of Lebesgue
decomposition and properties of Fs and Fd . 

3.6 Exercises

1. For a nonnegative measurable function f on a measure space


(Ω, A , μ), let
¯  
f dμ = inf g dμ : g  f, g is a simple function .
Ω Ω
 
Exemplify that ¯Ω f dμ and Ω f dμ may not be identical.
2. Prove Theorem 3.8.
3. Exemplify that when f is a complex measurable function whose
integral exists and c is a complex, the integral of cf may not
exist. What happens when f is integrable?
4. Let f be  measurable function such that μ(f ) exists.
 a complex
Prove | Ω f dμ|  Ω |f | dμ.
5. Exemplify that we cannot get rid of dominated condition g  fn
in Theorem 3.10(1).
6. Let {fnm }n,m1 be a family of nonnegative numbers. Prove

 ∞

lim fnm  lim fnm .
m→∞ m→∞
n=1 n=1

7. Exemplify that for a sequence of random variables, the conver-


gence in Lr (P) does not imply the a.s. convergence and vice
verse.
8. Let (Ω, A , μ) be a measure space, and let ϕ be a finite signed
measure such that ϕ  μ. Then ϕ(An ) → 0 for any {An } ⊂ A
with μ(An ) → 0. Exemplify that this assertion does not hold
when ϕ is σ-finite.
86 Foundation of Probability Theory

(Hint: Let Ω = (0, 1), μ = dx, ϕ = μf , f (x) x1 , An = (0, n1 ).)


9. If {ξn } converges in distribution to ξ, then E|ξ|  lim E|ξn |.
n→∞

10. Prove Corollary 3.35.


11. Prove Proposition 3.37.
(Eξ)2
12. Let ξ  0 such that Eξ 2 < ∞. Prove P(ξ > 0)  Eξ 2 .


n
13. Let A1 , . . . , An be events and A = Ai . Prove
i=1

n
(a) 1A  1Ai .
i=1

n 
(b) P(A)  P(Ai ) − P(Ai ∩ Aj ).
i=1 i<j
n  
(c) P(A)  P(Ai ) − P(Ai ∩ Aj ) + P(Ai ∩ Aj ∩ Ak ).
i=1 i<j i<j<k

14. Apply Jensen’s inequality to prove (geometric mean is domi-


nated by algebraic mean): a1 , . . . , an  0 and α1 , . . . , αn  0

n 
n
such that α1 + · · · + αn = 1, we have aαi i  αi ai .
i=1 i=1

15. Let ξ > 0. Prove


 
1 1
lim t dP = 0, lim t dP = 0.
t→∞ [ξ>t] ξ t→0 [ξ>t] ξ

16. Let ξ and η be independent random variables with distribu-


tion functions F and G, respectively. Formulate the distribution
function of ξ + η using F and G.
17. Random variables ξ and η are independent if and only if for any
f, g,

Ef (ξ)g(η) = Ef (ξ)Eg(η).
Integral and Expectation 87



18. (a) If events A1 , A2 , . . . satisfy P(An ) < ∞, then
n=1

∞ 

P( Ak ) = 0.
n=1 k=n


(b) If events A1 , A2 , . . . are independent and P(An ) = ∞,
n=1

∞ 

then P( Ak ) = 1.
n=1 k=n



19. Let pn ∈ [0, 1). Apply the previous exercise to prove (1 −
n=1


pn ) = 0 if and only if pn = ∞.
n=1

20. Let ξ be a random


 variable taking values of nonnegative integers.
Prove Eξ = P(ξ  n). What happens when ξ takes values of
n1
integers?
∞
21. Let ξ be a nonnegative random variable. Prove Eξ = 0 P
(ξ  x) dx. What happens for a general real random variable?
P
22. Prove ξn −
→ ξ if and only if
 
|ξn − ξ|
E → 0.
1 + |ξn − ξ|

23. Let φ  0 such that lim φ(x)/x → ∞ and T be an index set.


x→∞
If Eφ(|ξt |)  C < ∞ for any t ∈ T , then {ξt , t ∈ T } is uniformly
integrable.
24. Let {fn } be a sequence of real measurable functions on (Ω, A , μ).
If sup μ(|fn |r ) < ∞ for some r > 0, then ∀s ∈ (0, r), {|fn |s } is
n1
uniformly continuous in integral.
25. Let (Ω, A , μ) be a finite measure space, and let {ft : t ∈ T }
be a family of random variables such that {μ(|ft |) : t ∈ T } is
bounded. Then {ft }t∈T is uniformly continuous in integral if and
only if it is uniformly integrable.
88 Foundation of Probability Theory

26. Let r ∈ (0, ∞) and let (Ω, A , μ) be a measure space. Prove that
the class of integrable simple functions is dense in Lr (μ).
27. Let 1/p + 1/q = 1, p, q > 1. Prove f p = {μ(f g) : gq  1}.
28. If a sequence of random variables {ξn }n1 is uniformly bounded,
then ξn converges in probability if and only if it converges in
Lr (P), where r ∈ (0, ∞).
29. For a measurable function f , define the essential supremum
by
f ∞ = inf {M : μ({ω : |f (ω)| > M }) = 0} .
(a) Prove that  · ∞ satisfies the triangle inequality.
(b) If μ(Ω) < ∞, then f ∞ = lim f r .
r→∞

30. Assume Eξ 2 < ∞. Prove that Eξ attains the minimum of


E(ξ − c)2 over c ∈ R.
31. If ξ and η are independent random variables having finite expec-
tations, E|ξ + η|2 < ∞. Prove E(|ξ|2 + |η|2 ) < ∞.
32. Let ξ be a random variable with m := Eξ ∈ R and σ 2 = Dξ ∈
(0, ∞).
(a) Prove
σ2
P(ξ − m  t)  , t  0.
σ 2 + t2
(b) Prove
2σ 2
P(|ξ − m|  t)  , t  0.
σ2 + t2

33. Let f be a convex function on R, and let ξ be a random variable


with finite expectation. Prove that Ef (ξ) exists and f (Eξ) 
Ef (ξ).
34. Let ξ be a random variable with finite expectation. If there is a
strictly convex function φ on R such that Eφ(ξ) = φ(Eξ), then
ξ is an a.s. constant.
Integral and Expectation 89

35. If independent random variables ξ and η have distributions μ


and ν, respectively, then for A ∈ B and B ∈ B 2 ,
 
P((ξ, η) ∈ B) = P((x, η) ∈ B)μ(dx) = P((ξ, y) ∈ B)ν(dy)
R R

and

P(ξ ∈ A, (ξ, η) ∈ B) = P((x, η) ∈ B)μ(dx).
A

36. (Uniform distribution on Cantor sets) Let C be the Cantor set


defined in Exercise 31 of Chapter 2. We call


⎪ 0, x  0,



⎪ x  1,
⎪ 1,



⎨ 1 , x ∈ [ 1 , 2 ],
2 3 3
F (x) := 1
⎪ , x ∈ 1 2

⎪ 4 [ 9 , 9 ],



⎪ 3
x ∈ [ 79 , 89 ],

⎪ 4,

⎩··· ···

the uniform distribution function on C. Prove


(a) F is continuous;
(b) F is singular with Lebesgue measure.

37. Let μ1 and μ2 be finite signed measures. Set μ1 ∨ μ2 = μ1 + (μ2 −


μ1 )+ , μ1 ∧ μ2 = μ1 − (μ1 − μ2 )+ . Then μ1 ∨ μ2 is the minimal
signed measure such that ν  μi (i = 1, 2) and μ1 ∧ μ2 is the
maximal signed measure such that ν  μi (i = 1, 2).
38. Let μ be a σ-finite measure on (Ω, A ) such that A contains all
singletons. Then the set

{x ∈ Ω : μ({x}) > 0}

is at most countable.
90 Foundation of Probability Theory

39. Let {An} be an


 increasing sequence of σ-algebras in Ω, and let

A =σ An . Assume that μ is a finite measure and ν is a
n
probability measure on (Ω, A ). Let μn , νn be the restrictions of
μ, ν on An , respectively. If μn  νn , f = lim dμn
dνn , prove
n

μ(A) = f dν + μ(A ∩ {f = ∞}), A ∈ A .
A
Chapter 4

Product Measure Space

Why should we study multi-dimensional spaces and even infinite-


dimensional spaces? Let’s consider a system of n many random
particles in the 3-dimensional real world, where the location of each
particle is a 3-dimensional random variable, and the joint distribu-
tion of these particles is a probability measure on the 3n-dimensional
Euclidean space. Another example is to consider a particle randomly
moving on the real line, at each time its location is a 1-dimensional
random variable. If we want to describe the movement of the parti-
cle, we have to clarify its path when time varies, which is an infinite-
dimensional random variable (stochastic process), whose distribution
is a probability measure on an infinite product space. Note that the
finite product measure space has been introduced in Section 1.1.5
of Chapter 1 and Corollary 3.14 of Chapter 3. We will extend the
notion to the infinite product case.
In this chapter, we first establish Fubini’s theorem, which reduces
the integral with respect to a product measure to the iterated inte-
grals with respect to the marginal measures, as we have already stud-
ied in Lebesgue’s measure theory. We then extend this theorem to
nonproduct measures induced by a marginal measure and transition
measures. In particular, the construction of probability measures on
infinite product measurable spaces is fundamental in the study of
stochastic processes.

91
92 Foundation of Probability Theory

4.1 Fubini’s Theorem

Let us recall how to reduce a multiple integral on R2 to iterated


integrals, as we learnt in calculus or Lebesgue’s integral theory. Let
A be a measurable subset of R2 and f be an integrable function
with
 respect to the 2-dimensional Lebesgue measure. To compute
A f (x 1 , x2 ) dx1 dx2 , we first determine x1 and fix the range of x2 ,
i.e. Ax1 := {x2 ∈ R : (x1 , x2 ) ∈ A}. Then the multiple integral can
be calculated as R dx1 Ax f (x1 , x2 ) dx2 . The aim of this section is
1
to realize this procedure for integrals on a general product measure
space.
Let (Ω1 , A1 , μ1 ) and (Ω2 , A2 , μ2 ) be σ-finite measure spaces. By
Corollary 3.14 of Chapter 3, the product measure space (Ω1 ×
Ω2 , A1 × A2 , μ1 × μ2 ) is σ-finite as well. Given A ∈ A1 × A2 and a
measurable function f which is integrable with respect to μ1 × μ2 ,
we will prove
  
f d(μ1 × μ2 ) = μ1 (dω1 ) f (ω1 , ω2 )μ2 (dω2 ), (4.1)
A Ω1 Aω1

where Aω1 is the section of A at ω1 . This formula is called Fubini’s


theorem. For this, we first introduce the section of a set.
Definition 4.1. Let A ⊂ Ω1 × Ω2 . ∀ω1 ∈ Ω1 ,
Aω1 := {ω2 ∈ Ω2 : (ω1 , ω2 ) ∈ A} ⊂ Ω2
is called the section of set A at ω1 . Similarly, we can define Aω2 ⊂ Ω1 ,
∀ω2 ∈ Ω2 .
Clearly, sections of sets have the following properties.
Property 4.2.
(1) A ∩ B = ∅ ⇒ Aωi ∩ Bωi = ∅.
(2) A ⊃ B ⇒ Aωi ⊃ Bωi .
 (n)   (n)
(3) nA ωi
= n Aωi .
 (n)   (n)
(4) nA ωi
= n Aωi ,
(5) (A − B)ωi = Aωi − Bωi .
To prove (4.1), we need to clarify that the right-hand side of this
formula makes sense by verifying that ∀ω1 ∈ Ω1 , we have Aω1 ∈ A2 ,
Product Measure Space 93


f (ω1 , ω2 )μ2 (dω2 ) is A1 -measurable in ω1 and has integral with
Aω1
respect to μ1 .

Theorem 4.3. Let A ∈ A1 × A2 . Then for any ωi ∈ Ωi , i = 1, 2, we


have Aω1 ∈ A2 and Aω2 ∈ A1 .

Proof. Let

M = {A ∈ A1 × A2 : ∀ω1 ∈ Ω1 , Aω1 ∈ A2 ; ∀ω2 ∈ Ω2 , Aω2 ∈ A1 } .

Clearly, M includes the semi-algebra {A1 × A2 : A1 ∈ A1 , A2 ∈ A2 }.


By Property 4.2, we see that M is a σ-algebra, so it includes
A1 × A2 . 

Recall that for a function f and a σ-algebra A , f ∈ A means


that f is A -measurable.

Theorem 4.4. For any A1 × A2 -measurable function f and any


ωi ∈ Ωi , i = 1, 2, we have fω1 (·) := f (ω1 , ·) ∈ A2 and fω2 (·) :=
f (·, ω2 ) ∈ A1 .

Proof. For any B ∈ B, we have

fω−1
1
(B) = {ω2 ∈ Ω2 : fω1 (ω2 ) ∈ B}

= ω2 ∈ Ω2 : (ω1 , ω2 ) ∈ f −1 (B) = f −1 (B) ω1
,

which is in A2 by Theorem 4.3. So, fω1 is A2 -measurable. Similarly,


fω2 is A1 -measurable. 

The functions fω1 and fω2 are called the section functions f at
ω1 and ω2 , respectively.

Theorem 4.5. Let f be a nonnegative measurable function on (Ω1 ×


Ω2 , A1 × A2 ). Then
 
f (ω1 , ω2 )μ1 (dω1 ) ∈ A2 , f (ω1 , ω2 )μ2 (dω2 ) ∈ A1 .
Ω1 Ω2

Proof. By the construction of measurable functions and the prop-


erties of integrals, we only need to prove for f = 1A with A ∈
A1 × A2 , and by the monotone class theorem, it suffices to consider
94 Foundation of Probability Theory

A = A1 × A2 , Ai ∈ Ai , i = 1, 2. In this case, we have



f (ω1 , ω2 )μ1 (dω1 ) = μ1 (A1 )1A2 ∈ A2 ,
Ω1

f (ω1 , ω2 )μ2 (dω2 ) = μ2 (A2 )1A1 ∈ A1 .
Ω2 

Theorem 4.6 (Fubini’s theorem). Let f be A1 × A2 -measurable


function having integral with respect to μ1 × μ2 . Then
  
f dμ1 × μ2 = f (ω1 , ω2 )μ2 (dω2 ) μ1 (dω1 )
Ω1 ×Ω2 Ω1 Ω2
 
= f (ω1 , ω2 )μ1 (dω1 ) μ2 (dω2 ).
Ω2 Ω1

Proof. By symmetry, we only prove the first equation.


(1) The equation holds obviously for f = 1A1 ×A2 (Ai ∈ Ai , i = 1, 2).
From this, the monotone class theorem implies that the equation
holds for f = 1A (A ∈ A1 × A2 ).
(2) By the linear property of integral and step (1) in the proof, we
obtain the equation for a simple function f . Combining this with
Theorem 2.12(4) of Chapter 2 and Theorem 3.5 of Chapter 3, we
prove the equation for a nonnegative measurable function f .
(3) For a general measurable function f such that (μ1 ×μ2 )(f ) exists,
assume, for instance, (μ1 × μ2 )(f − ) < ∞. By step (2), we have
  

∞> f d(μ1 × μ2 ) = μ1 (dω1 ) f − (ω1 , ·) dμ2 .
Ω1 ×Ω2 Ω1 Ω2

Thus, μ1 -a.e. ω1 , Ω2 f − (ω1 , ·) dμ2 < ∞, so that
  
+
f (ω1 , ·) dμ2 = f (ω1 , ·) dμ2 − f − (ω1 , ·) dμ2 .
Ω2 Ω2 Ω2

Combining this with the linear property of integral and step (2),
we finish the proof. 
Product Measure Space 95

By applying Theorem 4.6 to f 1A in place of f , we derive (4.1).


Moreover, by induction, Fubini’s theorem can be extended to multi-
product measure spaces.
Let (Ωi , Ai , μi ), 1  i  n be σ-finite measure spaces and f be a
measurable function having integral on the product measure space
(Ω, A , μ) := (Ω1 × · · · × Ωn , A1 × · · · × An , μ1 × · · · × μn ). Then
   
f dμ = dμi1 dμi2 · · · f dμin,
Ω Ωi1 Ωi2 Ωin

where (i1 , . . . , in ) is any permutation of (1, . . . , n). This means that


all integrals in the right-hand side exist, and the iterated integral
equals to the multiple integral in the left-hand side.

4.2 Infinite Product Probability Space

Let {(Ωt , At , Pt )}t∈T be a family of probability spaces, where T is an


infinite index set. Let

ΩT = Ωt = {ω : ω = (ωt )t∈T , ωt ∈ Ωt , t ∈ T } .
t∈T

We intend to define the produce σ-algebra
 AT = πt∈T At and the
product probability measure PT = t∈T Pt .
Following the line in the finite product setting, one may
define
 AT as the σ-algebra generated by the class of rectangles
t∈T t : At ∈ At , t ∈ T . However, this class is not a semi-algebra
A
as required
 by the measure extension
 theorem, and for each rectan-
gle t∈T At , its probability P
t∈T t (A t ) is usually ill-defined. For
this reason, we only allow the set {Pt (At ) : t ∈ T } to be finite, so a
natural way is to restrict ourselves to the following class of measur-
able cylindrical sets. This also explains why we only study infinite
product probability measures rather than infinite product measures.
Definition 4.7. A set like

B TN × Ωt
t∈T
/ N

is called a measurable cylindrical


 set, where TN  T (i.e. finite sub-
set), and B TN ∈ ATN := t∈TN At . In this case, B TN is called the
base of the cylindrical set (it is not unique!).
96 Foundation of Probability Theory

Let A T be the total of measurable cylindrical sets, which is obvi-


ously an algebra. Define infinite product σ-algebra by

AT = At := σ(A T ).
t∈T

Theorem 4.8. There exists a unique probability P on (ΩT , AT ) such


that
⎛ ⎞
 T    
P A N × ΩTNc = ⎝ Pt ⎠ ATN , (4.2)
t∈TN

where

TN  T, ΩTNc = Ωt , ATN ∈ ATN.
t∈T
/ N

Proof. (1) Formula (4.2) defines a function on A T , so we need to


prove this function of sets is independent of the choice of repressions
of the cylindrical set.

Let ATN × ΩTNc = ATN × Ω(TN )c with TN , TN  T , ATN ∈ ATN ,
  
ATN ∈ ATN . Let TN = TN ∩ TN . Then there exists ATN ∈ t∈T  At
N
such that

  

ATN = ATN × Ωt , ATN = ATN × Ωt .

t∈TN −TN  −T 
t∈TN N

    
 
Hence, Pt (ATN ) = P (ATN ) = P T
t∈TN t∈T  t t∈T  t (A N ).
N N

(2) P is finitely additive.


Assume {Ak }nk=1 ⊂ A T are mutually disjoint. For 1  k  n,
let Tk be a finite subset of T and ATk ∈ t∈Tk At such that Ak =

ATk × ΩTkc . Then nk=1 Ak =: A0 ∈ A T . Let T0 ⊂ T be a finite set

and AT0 ∈ t∈T0 At such that A0 = AT0 × ΩT0c .

Set TN = ∪nk=0 Tk , ATk N = ATk × t∈TN −Tk Ωt . Then Ak =
 n 
ATk N × Ω(TN )c , n  k  0. Clearly, ATk N ⊂ t∈TN At are
n k=1
mutually disjoint, and k=1 ATk N = A0 . For any finite T  ⊂ T , let
Product Measure Space 97

 
PT  = t∈T  Pt be a product measure on (ΩT  , AT  ), but PT is a

finitely additive function on A T \T 
, defined by (4.2) with T \ T as
its total index set. Since PTN := t∈TN Pt is a probability measure,
 from the definition of P and the finite additivity of measure
it follows
that nk=1 P(Ak ) = P(A0 ).
(3) Since A T is a set algebra and P is finitely additive, to get the
σ-additivity of P, we only need to prove that it is continuous at ∅.
We use the method of proof by contradiction.
Let {An }n1 ⊂ A T be decreasing and ∃ε > 0 such that P(An )  ε
for every n  1. Now, we prove ∩∞ n=1 An = ∅.  Note that for any
n  1, there exist a finite set Tn ⊂ T and An ∈ t∈Tn At such that
T n

An = ATnn × t∈T / n Ωt .
Let T∞ = ∪∞ n=1 Tn . Then T∞ is countable, denoted by T∞ =
{t1 , t2 , . . .}.To prove ∩∞ n=1 An =∅, we only need to prove ∃(ω̄t1 , . . . ,
ω̄tn , . . .) ∈ t∈T∞ Ωt , such that ∞ j=1 Aj (ω̄t1 , . . . , ω̄tn , . . .) = ∅, where
Aj (ω̄t1 , . . . , ω̄tn , . . .) is section of Aj at (ω̄t1 , . . . , ω̄tn , . . .).
(j) 
First, we set B1 = ωt1 ∈ Ωt1 : P{t1 } (Aj (ωt1 ))  2ε . Since
 
P{t1 } (Aj (ωt1 )) = Pt∈Tj \{t1 } Aj j (ωt1 )
T

(j)
is At1 -measurable, B1 ∈ At1. Fubini’s theorem gives
   ε
(j)
ε  P(Aj ) = P{t1 } (Aj (ωt1 )) dPt1  Pt1 B1 + ,
Ωt 1 2
   
(j) (j) ∞
which implies that Pt1 B1  2ε . Since B1 is decreas-
  
j=1
∞ (j) (j)
ing, we have Pt1 j=1 B1  2ε , so ∃ω̄t1 ∈ ∞ j=1 B1 , that is,
P{ti } (Aj (ω̃tj ))  2ε for every j  1.
In general, assume for some k  1 we have (ω̄t1 , . . . , ω̄tk ) ∈ Ωt1 ×
· · · Ωtk such that
ε
P{t1 ,...,tk } (Aj (ω̄t1 , . . . , ω̄tk ))  k , ∀j  1.
2
Let

(j)
Bk+1 = ωtk+1 ∈ Ωtk+1 : P{t1 ,...,tk ,tk+1 } (Aj (ω̄t1 , . . . , ω̄tk , ωtk+1 ))
ε 
 k+1 .
2
98 Foundation of Probability Theory

Then it follows from Fubini’s theorem and induction that


ε
 P{t1 ,...,tk } (Aj (ω̄t1 , . . . , ω̄tk ))
2k

= P{t1 ,...,tk ,tk+1 } (Aj (ω̄t1 , . . . , ω̄tk , ωtk+1 )) dPtk+1 (ωtk+1 )
Ωtk+1
  ε
(j)
 P Bk+1 + .
2k+1
  ∞
(j) (j)
Thus, P Bk+1  ε
2k+1
for every j  1. Hence, ∃ω̄tk+1 ∈ j=1 Bk+1 ,
i.e.
ε
P{t1 ,...,tk ,tk+1 } (Aj (ω̄t1 , . . . , ω̄tk , ω̄tk+1 ))  , ∀j  1.
2k+1

By induction, it follows that ∃ {ω̄ti ∈ Ωti }i1 such that ∞j=1 Aj
(ω̄t1 , . . . , ω̄tn ) = ∅ forevery n  1. 
Take and fix ω̃ ∈ t∈T∞ Ωt and let ω ∈ t∈T Ωt such that

ω̄t , for t ∈ T∞ ,
ωt =
ω̃t , for t ∈ T∞ .

Then for any j  1, there exists Nj such that Tj ⊂ t1 , . . . , tNj ,
but Aj (ω̄t1 , . . . , ω̄tNj ) = ∅, so ω ∈ Aj for every j  1. Hence,

ω ∈ j Aj . 

Theorem 4.9. Let {Ft }t∈T be a family of probability distributions.


Then there exists a family of independent random variables {ξt }t∈T
such that ξt has distribution function Ft for each t ∈ T .
Proof. Assume Pt is probability measure induced by Ft on (R, B).
Let

(Ωt , At , Pt ) = (R, B, Pt ), Ω = RT , A = B T , P = Pt .
t∈T

Then (ξt (ω) := ωt )t∈T are random variables on (Ω, A , P), and

P(ξt < xt ) = Pt ((−∞, xt )) = Ft (xt ).

Obviously, they are independent. 


Product Measure Space 99

4.3 Transition Measure and Transition Probability

As shown in Theorem 4.9, the joint distribution of independent ran-


dom variables is a product measure. In this section, we aim to con-
struct non-product measures on a product measurable space. To this
end, we introduce the notion of transition measure, in particular
transition probability, which describes the conditional distribution
of a random variable given the value of another random variable. See
Chapter 5 for details.

Definition 4.10. Let (Ωi , Ai )(i = 1, 2) be two measurable spaces.


A map λ : Ω1 × A2 → [0, ∞] is called a transition measure from
(Ω1 , A1 ) to (Ω2 , A2 ) or simply a transition measure on Ω1 × A2 if it
has the following two properties:

(1) λ(ω1 , ·) is a measure on (Ω2 , A2 ) for any ω1 ∈ Ω1 ,


(2) λ(·, A) is a measurable function of A1 for any A ∈ A2 .

If there exists a partition {Bn }n∈N ⊂ A2 of Ω2 such that λ(ω1 , Bn ) <


∞ (n  1, ω1 ∈ Ω1 ), then λ is called σ-finite. Furthermore, if
supω1 ∈Ω1 λ(ω1 , Bn ) < ∞ (∀n  1), λ is called uniformly σ-finite. If
λ(ω1 , ·) is a probability for any ω1 ∈ Ω1 , then λ is called a transition
probability.

To construct non-product measures on a product space by


using transition measures, and to extend Fubini’s theorem for the
integral with respect to such a measure, we need the following
theorem.

Theorem 4.11. Let λ be a σ-finite transition measure on Ω1 × A2 ,


and
 let f be a nonnegative measurable function on A1 × A2 . Then
Ω2 f (·, ω2 )λ(·, dω2 ) is A1 -measurable.

Proof. By Theorem 2.12 of Chapter 2 and Theorem 3.8 of


Chapter 3, we only need to prove for f being an indicator function.
Moreover, by the monotone class theorem, it suffices
 to consider f =
1A×B for some A ∈ A1 , B ∈ A2 . In this case, Ω2 f (·, ω2 )λ(·, dω2 ) =
λ(·, B)1A , which is obviously A1 -measurable. 
100 Foundation of Probability Theory

Theorem 4.12. Let (Ωi , Ai )(i = 1, 2, . . . , n) be finite many measur-


able spaces, and let


k 
k
(k) (k)
Ω = Ωi , A = Ai , k = 1, . . . , n.
i=1 i=1

If λ1 is a σ-finite measure on A1 and λk is a σ-finite transition


measure on Ω(k−1) × Ak for each k = 2, . . . , n, then
 
λ(n) (B) : = ··· 1B (ω1 , . . . , ωn )λn
Ω1 Ωn

(ω1 , . . . , ωn−1 , dωn ) . . . λ1 (dω1 )

for any B ∈ A (n) defines a measure on A (n) . If {λi }i=2,···n are


uniformly σ-finite, then λ(n) is σ-finite.

Proof. By Theorem 4.11, λ(n) is a well-defined nonnegative func-


tion on A (n) . By applying Corollary 3.12 for n many times, we see
that λ(n) is σ-additive. Hence, it is a measure on A (n) .
Now, let {λi }i=2,...,n be uniformly σ-finite, we intend to prove that
(n)
λ is σ-finite. By induction, we only prove for n = 2.
Since λ1 is σ-finite and λ2 is uniformly σ-finite, we find measurable
partitions {An }n∈N for Ω1 and {Bn }n∈N for Ω2 such that λ1 (An ) < ∞
and supω1 ∈Am λ2 (ω1 , Bn ) < ∞ for all m, n  1. Then {Ai × Bj }i,j1
is a measurable partition of Ω1 × Ω2 satisfying
 
(2)
λ (Ai × Bj ) = λ1 (dω1 ) λ2 (ω1 , dω2 )
Ai Bj

= λ2 (ω1 , Bj )λ1 (dω1 )
Ai
 sup λ2 (ω1 , Bj )λ1 (Ai ) < ∞. 
ω1 ∈Ai

Theorem 4.13 (Generalized Fubini’s theorem). Let Ω(n) , A (n)


(n) be in Theorem 4.12, and let f be a measurable function on
and(n)λ (n) 
Ω ,A . Then the integral λ(n) (f ) exists if and only if at least
Product Measure Space 101

one of I(f + ) and I(f − ) is finite, where


 
I(f ± ) : = ... f ± (ω1 , . . . , ωn )
Ω1 Ωn

λn (ω1 , . . . , ωn−1 , dωn ) · · · λ1 (dω1 ),


and in this case we have
 
(n)
λ (f ) = ··· f (ω1 , ..., ωn )λn (ω1 , . . . , ωn−1 , dωn ) · · · λ1 (dω1 ).
Ω1 Ωn

Proof. When f = 1B , B ∈ A (n) , the formula follows from Theo-


rem 4.12. Combining this with Theorem 2.12 of Chapter 2 and The-
orem 3.8 of Chapter 3, we prove the formula for all any nonnegative
measurable function f . In general, by the definition of integral, by
the formula for nonnegative functions, we have λ(n) (f ± ) = I(f ± ), so
that λ(n) (f ) exists if and only if at least one of I(f + ) and I(f − ) is
finite, and in this case,
  
f dλ(n) = f + dλ(n) − f − dλ(n) = I(f + ) − I(f − )
     
+ −
= dλ1 ··· f dλn · · · dλ2 − ··· f dλn ... dλ2
Ω1 Ω2 Ωn Ω2 Ωn
 
= ··· = ··· (f + − f − ) dλn · · · dλ1
Ω1 Ωn
 
= ··· f dλn · · · dλ1 . 
Ω1 Ωn

Finally, we construct probability measures on an infinite product


measurable space by using a marginal distribution P1 and a sequence
of transition probability measures {Pn }n1 .
Theorem 4.14 (Tulcea’s theorem). Let n, 
 (n) (Ω(n) An )n∈N be
a sequence of measurable spaces, and let Ω , A be defined in
Theorem 4.12. Set
∞ ∞

Ω= Ωi , A = Ai .
i=1 i=1

Let P1 be a probability measure on (Ω1 , A1 ), and for each n  2, let


Pn be a transition probability on Ω(n−1) × An . Then there exists a
102 Foundation of Probability Theory

unique probability measure P on (Ω, A ) such that


 

(n)
P B × Ωk = P(n) , n ∈ N, B (n) ∈ A (n) ,
k>n
   
where P(n) B (n) = Ω1 · · · Ωn 1B (n) dPn · · · dP1 .

Proof. Let C (∞) be class of all measurable cylindrical sets, which


is an algebra in Ω. As explained in the proof of Theorem 4.8, P is
a well-defined finitely additive nonnegative function on C (∞) with
P(Ω) = 1. Moreover, by using Theorem 4.13 in place of Fubini’s
theorem, the same argument in the proof of Theorem 4.8 implies the
continuity of P at ∅. Hence, P(∞) is a probability on C (∞) , which is
uniquely extended to a probability on (Ω, A ). 

4.4 Exercises

1. Prove Property 4.2.


2. Let (Ω, A , μ) be a measure space and let f be a nonnegative
measurable function. Prove
 ∞  ∞
μ(f ) = μ(f > r) dr = μ(f  r) dr.
0 0

3. Let (Ωi , Ai , μi )(i = 1, 2) be measure spaces, and let A, B ∈


A1 × A2 . If μ2 {ω2 : (ω1 , ω2 ) ∈ A} = μ2 {ω2 : (ω1 , ω2 ) ∈ B} holds
for μ1 -a.e. ω1 , prove (μ1 × μ2 )(A) = (μ1 × μ2 )(B).
4. Let (Ωi , Ai )(i = 1, 2, 3) be measurable spaces, λ be a σ-finite
transition measure on Ω2 × A3 , and f be a measurable func-
tion on (Ω1 × Ω3 , A1 × A3 . If the integral g(ω1 , ω2 ) :=
Ω3 f (ω1 , ω3 )λ(ω2 , dω3 ) exists for all (ω1 , ω2 ) ∈ Ω1 × Ω2 , prove
g is A1 × A2 -measurable.
5. Let (Ωi , Ai , μi )(i = 1, 2) be σ-finite measure spaces. Prove that
for any A ∈ A1 × A2 , the following statements are equivalent:
Product Measure Space 103

(a) μ1 × μ2 (A) = 0.
(b) μ1 (Aω2 ) = 0, μ2 -a.e.
(c) μ2 (Aω1 ) = 0, μ1 -a.e.

6. If an infinite matrix P = (pij )i,j∈N satisfies pij  0, j∈N pij =
1, ∀i ∈ N, 
then P is called a transition probability matrix. Let
λ(i, A) = j∈A pij , i ∈ N, A ⊂ N. Prove that λ is a transition
probability on N × 2N .
7. Let μ be the counting measure on N, i.e. μ({i}) = 1 for any
i ∈ N. Let

⎨i,
⎪ i = j,
f (i, j) = −i, j = i + 1,

⎩0, other i, j.

Prove
 
f (ω1 , ω2 )μ(dω2 ) μ(dω1 ) = 0, but
N N
 
f (ω1 , ω2 )μ(dω1 ) μ(dω2 ) = ∞.
N N

Does this contradict Fubini’s theorem?


8. Construct a function f : [0, 1] × [0, 1] → [0, 1] fulfilling the fol-
lowing conditions:
(a) ∀z ∈ [0, 1], the functions f (z, ·) and f (·, z) are Borel mea-
surable on [0, 1],
(b) f is not Borel  on [0,1] ×
measurable  [0, 1],
 1  1 1 1

(c) Both 0 0 f (x, y) dy dx and 0 0 f (x, y) dx dy exist,
but do not equal.

9. Let μk , νk be σ-finite measures on (Ωk , Ak ), respectively, and


νk μk (k = 1, 2). Prove that ν1 × ν2 μ1 × μ2 and

d(ν1 × ν2 ) dν1 dν2


(ω1 , ω2 ) = (ω1 ) (ω2 ), μ1 × μ2 -a.e.
d(μ1 × μ2 ) dμ1 dμ2
104 Foundation of Probability Theory

10. Let (Ωt , At )t∈T be a family of measurable spaces,


 where At =
σ(Ct ) for t ∈ T . For each t ∈ T , let πt : t∈T Ω t  ω →
ωt ∈ Ωt be the projection onto the tth space. Prove t∈T At =
 
σ t∈T πt−1 (Ct ) .

11. Let B ∞ be the product Borel σ-algebra on R∞ := i∈N R. Prove
that the following sets are B ∞ -measurable:
 

(a) x ∈ R : sup xn < a ,
n
 



(b) x ∈ R : |xn | < ∞ ,
 n=1

(c) x ∈ R∞ : lim xn exists and is finite ,
 n→∞


(d) x ∈ R : lim xn  a .
n

12. Let
 F be a probability distribution function on R. Prove
R (F (x + c) − F (x)) dx
 = c for any constant c ∈ R, and if
1
F is continuous, then R F (x) dF (x) = 2 .
Chapter 5

Conditional Expectation and


Conditional Probability

To describe the influence of a class of events (sub σ-algebra) C to


a random variable ξ, we introduce the conditional expectation (or
more generally, conditional distribution) of ξ given C . When the sub
σ-algebra C is induced by a family of random variables, the condi-
tional expectation refers to the influence of these random variables to
ξ. When ξ runs over the indicator functions for all measurable sets,
the conditional expectation reduces to the conditional probability.
To define the conditional expectation, we recall the simple case
where the condition is given by an event B. Throughout this chap-
ter, (Ω, A , P) is a complete probability space. For any B ∈ A
with P(B) > 0, the conditional probability given B is defined as
P(·|B) := P(·∩B)
P(B) . Moreover, for any random variable ξ having expec-
tation, the conditional expectation E(ξ|B) of a random variable
ξ given B is defined as the integral of ξ with respect to condi-
tional probability. Similarly, if P(B c ) > 0, we define in the same
way the conditional expectation E(·|B c ) under event B c . Thus, the
conditional expectation of ξ given C = {B, B c , ∅, Ω} is naturally
defined as
E(ξ|C ) = 1B E(ξ|B) + 1B c E(ξ|B c ), (5.1)
which is a C -measurable random variable. This is P|C -a.s. well
defined even if B or B c is a P-null set.
The aim of this chapter is to define the conditional probability
and conditional expectation under an arbitrarily given sub-σ-algebra

105
106 Foundation of Probability Theory

C of A and make applications to the study of transition probabilities


and probability measures on product spaces.

5.1 Conditional Expectation Given σ-Algebra

We first extend the definition in (5.1) to a σ-algebra C generated


by countable many atoms. A set B ∈ C is called an atom of C if
∀B  ∈ C , B  ⊂ B, we have B  = B or B  = ∅.
Definition 5.1. Let C = σ({Bn : n  1}) for {Bn }n1 ⊂ A being
a partition of Ω, and let ξ be a random variable having expectation.
Then


E(ξ|C ) := E(ξ|Bn )1Bn
n=1

is called the conditional expectation of ξ with respect to P given


σ-algebra C , where E(ξ|Bn )1Bn := 0 if P(Bn ) = 0.
To further extend the definition to general sub-σ-algebra C ,
we present the following result which characterizes the conditional
expectation without using the expression of C .
Proposition 5.2. Let C = σ({Bn : n  1}) for a partition
{Bn }n1 ⊂ A of Ω. Then for any random variable ξ having expec-
tation, E(ξ|C ) is C -measurable and satisfies
 
E(1B ξ) = E 1B E(ξ|C ) , ∀B ∈ C .
On the other hand, if η is a C -measurable function such that
E(ξ1B ) = E(η1B ), ∀B ∈ C , then η = E(ξ|C ), P|C -a.s.
According to Proposition 5.2, we define the conditional expecta-
tion given a general σ-algebra C as follows.
Definition 5.3. Let C ⊂ A be a sub-σ-algebra of A , and let ξ
be a random variable having expectation. The conditional expec-
tation E(ξ|C ) of ξ given C (with respect to P) is defined as the
C -measurable function, satisfying
 
E(ξ|C ) dP = ξ dP, ∀B ∈ C .
B B
Conditional Expectation and Conditional Probability 107

To see that Definition 5.3 makes sense, we need to show the exis-
tence and uniqueness of E(ξ|C ) when ξ has expectation. Without
loss of generality,
 we may assume Eξ − < ∞, so that C  B →
ϕ(B) := B ξ dP is a signed measure with ϕ P|C . By Theorem
3.51, there
 exists P|
 C -a.s. unique f ∈ C such that dϕ = f dP|C , i.e.
ϕ(B) = B f dP = B ξ dP for B ∈ C .

Definition 5.4. Let C ⊂ A be a σ-algebra. For any A ∈ A ,


P(A|C ) = E(1A |C ) is called the conditional probability of A given
C (with respect to P).

Since the conditional expectation is defined via integral, it inherits


most properties of integrals, but in the sense of P|C -a.s. We collect
some of them in the following result, where the convergence the-
orems can be proved by using the monotone convergence theorem
(Exercise 3), as shown in the proofs of the corresponding results
for integrals. In the same spirit, some inequalities for integrals and
expectations (such as Jensen’s inequality, Hölder’s inequality, and
Minkowski’s inequality) also hold for conditional expectations, which
are left as exercises as well.

Property 5.5. Assume that the following involved random variables


have expectations:

(1) E(E(ξ|C )) = Eξ.


(2) If ξ ∈ C , then E(ξ|C ) = ξ.
(3) (Monotonicity) ξ  η ⇒ E(ξ|C )  E(η|C ).
(4) (Linear property) E(aξ + bη|C ) = aE(ξ|C ) + bE(η|C ), a, b ∈ R.
(5) (Fatou–Lebesgue convergence theorem) Let η  and ζ be inte-

grable. If η  ξn , P-a.e. for any n  1, then E lim ξn |C 
n→∞
lim E(ξn |C ).
n→∞
If η  ξn , P-a.e. for any n  1, then lim E(ξn |C ) 
n→∞
E lim ξn |C .
n→∞
(6) (Dominated convergence theorem) Let η be integrable. If η  ξn ↑
ξ or |ξn |  η for any n  1 and ξn → ξ, a.s., then E(ξn |C ) →
E(ξ|C ), a.s.
108 Foundation of Probability Theory

Property 5.6. Let ξ and η be random variables such that η ∈ C


and Eξη, Eξ exist. Then E(ξη|C ) = ηE(ξ|C ).
Proof. Since both E(ξη|C ) and ηE(ξ|C ) is C -measurable, by the
definition of conditional expectation, we only need to show
 
ξη dP = ηE(ξ|C ) dP, C ∈ C .
C C

By Theorem 2.12 and Property 5.5(4) and (6), it suffices to prove for
ξ = 1A , η = 1B , A ∈ A , B ∈ C , then the proof is finished since in
this case, we have
  
ηE(ξ|C ) dP = ξ dP = P(A ∩ B ∩ C) = ξη dP. 
C C∩B C

Lr (P) Lr (P)
Property 5.7. Let r ∈ [1, ∞). If ξn −−−→ ξ, then E(ξn |C ) −−−→
E(ξ|C ).
Proof. By Jensen’s inequality and the property of conditional
expectation, it follows that
E|E(ξn |C ) − E(ξ|C )|r = E|E(ξn − ξ|C )|r  E(E(|ξn − ξ|r |C ))
= E|ξn − ξ|r → 0 (n → ∞). 

The following result shows that the conditional expectation of ξ


under C can be regarded as the average of ξ on each atom of C . This
property is called smoothness of conditional expectation.
Property 5.8. E(ξ|C ) takes constant value on each atom of C . If
1
P(B) > 0 and B is an atom, then E(ξ|C )(ω) = P(B) B ξ dP for any
ω ∈ B.
Proof. Let B be an atom of C . If ∃ω1 , ω2 ∈ B such that E(ξ|C )
(ω1 ) = E(ξ|C )(ω2 ), then C  {ω ∈ B : E(ξ|C )(ω) = E(ξ|C )(ω1 )} 
B is nonempty. This contradicts the fact that B is an atom.
Let B be an atom with P(B) > 0. Since E(ξ|C ) takes constant
value on B, we have
 
E(ξ|C )|B P(B) = E(ξ|C ) dP = ξ dP.
B B
1

Hence, E(ξ|C )|B = P(B) B ξ dP. 
Conditional Expectation and Conditional Probability 109

The following result shows that the general definition of condi-


tional expectation is consistent with that for C induced by countable
many atoms.

Property 5.9. Let {Bn }n1 ⊂ A be a partition of Ω and C =



σ({Bn : n  1}). Then E(ξ|C ) = E(ξ|Bn )1Bn . In particular,
n=1
E(ξ|C ) = Eξ for C = {φ, Ω}.

Property 5.10. If C and σ(ξ) are independent, then E(ξ|C ) = Eξ;


if C ⊂ C  , then

E(ξ|C ) = E(E(ξ|C  )|C ).

Proof. ∀B ∈ C , 1B and ξ are independent, so


  
E(ξ|C ) dP = ξ dP = E1B ξ = (E1B )Eξ = P(B)Eξ = Eξ dP.
B B B

Since B ∈ C is arbitrary, it follows E(ξ|C ) = Eξ.


Let C ⊂ C  . Then ∀B ∈ C
 
  
E(ξ|C ) dP = E[1B E(ξ|C )] = E(E(ξ1B |C )) = Eξ1B = ξ dP.
B B

Hence, E(ξ|C ) = E(E(ξ|C  )|C ). 

Finally, we prove that E(ξ|C ) is the L2 optimal approximation of


ξ among C -measurable functions.

Property 5.11 (Optimal mean square approximation). Let


ξ ∈ L2 (P), C ⊂ A be a sub-σ-algebra. Then E(ξ|C ) ∈ L2 (PC ), and
E(ξ|C ) is the optimal approximation of ξ in L2 (PC ): ∀η ∈ L2 (PC ),

E|ξ − E(ξ|C )|2  E|ξ − η|2 , E(|ξ − E(ξ|C )|2 |C )  E(|ξ − η|2 |C ),

and the equalities hold if and only if η = E(ξ|C ), P-a.s.


110 Foundation of Probability Theory

Proof. We only prove the latter. By Jensen’s inequality,

|E(ξ|C )|2  E(|ξ|2 |C ),

so E(ξ|C ) ∈ L2 (PC ). ∀η ∈ L2 (PC ), we have

E(|ξ − η|2 |C ) = E(|ξ − E(ξ|C )|2 |C ) + E(|η − E(ξ|C )|2 |C )


− 2E((η − E(ξ|C ))(ξ − E(ξ|C ))|C ).

Since η − E(ξ|C ) ∈ C , we have

E((η −E(ξ|C ))(ξ −E(ξ|C ))|C ) = (η −E(ξ|C ))E((ξ −E(ξ|C ))|C ) = 0.

Hence, E(|ξ − η|2 |C )  E(|ξ − E(ξ|C )|2 |C ), and the equality holds if
and only if η = E(ξ|C ), P-a.s. 

5.2 Conditional Expectation Given Function

In this section, we study conditional expectations given the σ-algebra


C = σ(f ) = f −1 (E ) induced by a measurable map f : (Ω, A ) →
(E, E ), where (E, E ) is a measurable space. In this case, we simply
denote E(·|σ(f )) by E(·|f ). In particular, when f is a random vari-
able, i.e. (E, E ) = (Rn , B n ) for some n  1, the conditional expec-
tations reflect the influence of f to random variables under study.

Theorem 5.12. Let ξ be a random variable having expectation, and


let f : (Ω, A ) → (E, E ) be measurable. Then E(ξ|f ) := E(ξ|σ(f )) =
g ◦ f , where g : E → R is a measurable function such that
 
g d(P ◦ f −1 ) = ξ dP, B ∈ E .
B f −1 (B)

Proof. Since E(ξ|σ(f )) is σ(f )-measurable, by Theorem 2.22,


there exists a measurable function g : (E, E ) → (R, B) such that
E(ξ|σ(f )) = g ◦ f . Combining this with the integral transform for-
mula (Theorem 3.27) and the definition of conditional expectation,
we derive the desired formula. 
Conditional Expectation and Conditional Probability 111

By taking ξ = 1A (A ∈ A ) in Theorem 5.12, we obtain the fol-


lowing result.
Corollary 5.13. For any A ∈ A , we have P(A|σ(f )) = g ◦ f , where
g : (E, E ) → (R, B) is measurable, satisfying

g d(P ◦ f −1 ) = P(A ∩ f −1 (B)), ∀B ∈ E .
B
As explained in the beginning of this chapter, the conditional
expectation given an event B can be formulated as the integral with
respect to the conditional probability P(·|B). As we have already
defined the conditional expectation E(·|C ) and the conditional prob-
ability P(·|C ), we wish to establish the same link of them, i.e. to for-
mulate E(ξ|C ) as the integral of ξ with respect to P(·|C ). However,
for each event A, P(A|C ) is only P-a.s. defined. So, to establish the
desired formula, we need to fix a P-version (i.e. a pointwise defined
function) in the class of of P(A|C ), denoted by PC (·, A). If PC hap-
pens to be a transition probability on (Ω, C ) × (Ω, A ), then we will
be able to verify the formula E(ξ|C ) = Ω ξ dPC . Such a transition
probability is called the regular conditional probability given C .

5.3 Regular Conditional Probability

5.3.1 Definition and properties


Definition 5.14 (Regular conditional probability). Let C ⊂
A be a sub-σ-algebra of A . A transition probability PC on (Ω, C ) ×
(Ω, A ) is called the regular conditional probability given C (with
respect to P) if
PC (·, A) = E(1A |C ) = P(A|C ), ∀A ∈ A .
Obviously, the regular conditional expectation is P-a.s. unique. If it
exists, we may formulate the conditional expectation given C by the
integral with respect to PC .
Theorem 5.15. Let PC be the regular conditional probability given
C . Then for any random variable ξ having expectation,

E(ξ|C ) = ξPC (·, dω).
Ω
112 Foundation of Probability Theory

Proof. By definition, formula holds for ξ being an indicator func-


tion. Then the proof is finished by Theorem 2.12, Theorem 3.5 and
the linearity of integrals. 
By the link of the regular conditional probability and the con-
ditional expectation, properties of conditional expectation can be
formulated by using the regular conditional probability. In the fol-
lowing, we only reformulate one property as example.

Theorem 5.16. Let C ⊂ C  ⊂ A be a sub-σ-algebras, and let PC



and PC be the associated regular conditional probabilities. Then for
any random variable ξ ∈ C and ξ  ∈ C  such that Eξξ  and Eξ exist,
there holds
  

(ξ  ξ)(ω)PC (·, dω) = ξ  (ω  ) ξ(ω)PC (ω  , dω) PC (·, dω  ).

By the integral transformation theorem, the expectation of a ran-


dom variable ξ can be formulated as the integral of identity func-
tion with respect to the distribution of ξ, which only depends on
the restricted probability measure on the σ-algebra induced by ξ.
Correspondingly, in the following, we introduce the regular condi-
tional distribution and mixed conditional distribution of ξ given C .

5.3.2 Conditional distribution

Definition 5.17. Let ξT = {ξt : t ∈ T } be a family of random vari-


ables on (Ω, A , P), and let C ⊂ A be a sub-σ-algebra of A .
(1) A transition probability on (Ω, C ) × (Ω, σ(ξT )) is called the reg-
ular conditional distribution of ξT under C if

PC (·, A) = P(A|C ), A ∈ σ(ξT ).

(2) A transition probability PCξT on (Ω, C ) × (RT , B T ) is called a


mixed conditional distribution of ξT under C if

PCξT (·, B) = P(ξT−1 (B)|C ), B ∈ B T .

Theorem 5.18. Let g : RT → R be Borel measurable such that


Eg(ξT ) exists. Let PC and PCξT be, respectively, the regular conditional
Conditional Expectation and Conditional Probability 113

distribution and mixed conditional distribution of ξ given C exist,


then
 
E(g(ξT )|C ) = g(ξT (ω))PC (·, dω) = g(xT )PCξT (·, dxT ).
Ω RT

Proof. As explained many times, the desired formulae follow from


those with g = 1B , B ∈ B T . 

Theorem 5.19. If the conditional distribution of ξT given C exists,


then its mixed conditional distribution exists too. When ξT (Ω) ∈ B T ,
the converse assertion also holds.
Proof. Let PC be the conditional distribution of ξT given C . Then
PCξT (ω, B) = PC (ω, ξT−1 (B)), B ∈ B T

gives the mixed conditional distribution. Conversely, if PCξT exists and


ξT (Ω) ∈ B T , then for any A ∈ σ(ξT ), there exists B ∈ B T such that
A = ξT−1 (B), so that ξT (A) = B ∩ ξT (Ω) ∈ B T . Hence, we can define
the regular conditional distribution as PC (·, A) = PCξT (·, ξT (A)). 

5.3.3 Existence of regular conditional probability


We first prove the existence of mixed conditional distribution.
Theorem 5.20. Let ξ = (ξ1 , ξ2 , . . . , ξn ) be an n-dimensional random
variable on (Ω, A , P) and C be a sub-σ-algebra of A . Then PCξ exists,
hence when ξ(Ω) ∈ B n , the conditional distribution of ξ under C
exists.
Proof. To construct PCξ (ω, ·), we only need to determine the corre-
sponding probability distribution function. By the left continuity, it
suffices to fix the distribution function on a countable dense subset of
R̄n , say, on the rational number space Qn . For any r ∈ (Q ∪ {∞})n ,
we fix a C -measurable function F (·; r) in the class P(ξ < r|C ). Obvi-
ously, F has the following properties: there exists a P-null set N such
that
(1) ∀a, b ∈ Qn , a  b, Δb,a F (ω; ·) = P(ξ ∈ [a, b)|C )(ω)  0 and
F (ω; b)  F (ω; a), ω ∈
/ N;
(2) lim F (ω; m, . . . , m) = 1, ω ∈ / N;
Nm→∞
114 Foundation of Probability Theory

(i)
(3) for any 1  i  n, let rm ∈ Q̄n such that the ith component is
−m and others are ∞; then lim F (ω; rm ) = 0, ω ∈ / N;
Nm→∞
1
(4) ∀r0 ∈ Qn , lim F (ω; r0 − m ) = F (ω; r0 ), ω∈
/ N.
Nm→∞

Let

C F (ω; r), ω ∈ N c;
F (ω; r) =
1(0,∞) (r), ω ∈ N, r ∈ Qn .

Moreover, for each x ∈ Rn , let F C (ω; x) = lim F C (ω; r), which is


r↑x
well defined by the increasing property. Then ∀ω ∈ Ω, F C (ω; ·) is a
probability distribution function, so it induces a unique probability
measure PCξ (ω; ·) on (Rn , B n ) such that PCξ (ω; (−∞, x)) = F C (ω; x)
for ω ∈ Ω and x ∈ Rn .
Finally,
 
Π = {(−∞, r) : r ∈ Qn } , Λ = B ∈ B n : PCξ (·, B) = P(ξ ∈ B|C ) .

Then Π is a π-system, Λ ⊃ Π, and Λ is a λ-system. By the monotone


class theorem, we obtain Λ = B n , so that PCξ is the mixed conditional
distribution of ξ given C . 
As a consequence of Theorem 5.20, we confirm the existence of
regular probability measure for (Ω, A , P) = (Rn , B n , P).
Theorem 5.21. Let (Ω, A , P) = (Rn , B n , P). Then for any sub-σ-
algebra C of B n , there exists the regular conditional probability PC .
Proof. Let ξ(x) = x for x ∈ Rn . Then σ(ξ) = B n and PCξ is just
the regular conditional probability of P given C . 
As an application of the regular conditional probability, any prob-
ability measure on Rn can be induced by a marginal distribution
together with some transition probabilities as in Theorem 4.12.
Theorem 5.22. Let P be a probability on (Rn , B n ). Then there exist
probability P1 on B and transition probabilities Pk (x1 , x2 , . . . , dxk )
on Rk−1 × B for k = 2, . . . , n such that
 
P(B) = ··· 1B (x1 , . . . , xn )Pn (x1 , . . . , xn−1 , dxn ) · · · P1 ( dx1 ), B ∈ Rn .
R R
Conditional Expectation and Conditional Probability 115

Proof. By induction, we only prove for n = 2. In this case, let


C = {A × R : A ∈ B}, and for any B1 , B2 ∈ B and B ∈ B 2 ,

P1 (B1 ) = P(B1 × R), P2 (x1 , B2 ) = PC ((x1 , 0), R × B2 ),


 
P̃(B) = dP1 (x1 ) 1B P2 (x1 , dx2 ).
R R

Noting that P2 (x1 , B2 ) = PC ((x1 , x2 ), R × B2 ) for any x2 ∈ R since


{(x1 , x2 ) : x1 ∈ R} is an atom of C on which the conditional proba-
bility is constant, we obtain
 
P(B1 × B1 ) = 1B1 (x1 )P1 ( dx1 ) 1B2 (x2 )P2 (x1 , dx2 )
R R

= 1B1 ×R (x1 , x2 )P2 (x1 , B2 )P( dx1 , dx2 )
R2

= E[E(1B1 ×B2 |C )] = E1B1 ×B2 = P(B1 × B2 ).

This finishes the proof by the uniqueness in the measure extension


theorem. 

5.4 Kolmogorov’s Consistent Theorem

In this section, we construct probability measures on an infinite prod-


uct space by using a family of consistent probability measures on
finite product spaces. For this, we introduce the concept of consis-
tency. Let T be an infinite index set, and let (Ωt , At ) be a measur-


able space for every t ∈ T . Recall that ∀T  ⊂ T , ΩT := Ωt ,
t∈T 
 
A T := At . If S is a finite subset of T , we write S  T .
t∈T 
 
Definition 5.23. The family of probability measures PS : S  T
is called consistent if each PS is a probability measure on A S and

 
PS (AS ) = PS AS × ΩS −S , AS ∈ A S , S ⊂ S   T.
116 Foundation of Probability Theory

Theorem 5.24 (Kolmogorov’s consistent


 theorem). Let
(Ωt , At ) = (R, B) for t ∈ T , and let PS : S  T be a family of
consistent probability measures. Then there exists a unique probabil-
ity measure P on (RT , B T ) such that
P(B S × RT −S ) = PS (B S ), S  T, BS ∈ BS .
Proof. (1) By the consistency, it is easy to see that P is a well-
defined finite-additive measure on the class C T of measurable cylin-
drical sets and P(RT ) = 1. So, by the measure extension theorem, it
suffices to verify the σ-additivity of P.
(2) When T is countable, we may let T = N. By Theorem 5.22,
there exist probability P1 on R and transition probabilities Pn on
Rn−1 × B for each n  2 such that P{1,2,...,n} = Pn · Pn−1 · · · P1 ,
∀n  2. Then the desired assertion follows from Tulcea’s theorem.
(3) In general, we only need to prove that P is continuous at ∅.
Let {An }n1 ⊂ C T with An ↓ ∅. For any n  1, there exists
Tn  T such that An = ATn × RT \Tn and ATn ∈ B Tn . Set


T∞ = Tn . Then T∞ is countable. By step (2), P is σ-additive
n=1
on the algebra C T∞ × RT \T∞ , so that Theorem 1.34 implies that
P(An ) ↓ 0 (n → ∞). 
Remark 5.25. The proof of Theorem 5.24 mainly uses Theorem 5.22
and Tulcea’s theorem, where the latter works for general (Ωt , At ).
Note that Theorem 5.22 can be extended to a Polish space in place
of Rn , see Exercise 20 in this chapter. So, Theorem 5.24 can be
extended to the case that each (Ωt , At ) is a Polish space.

5.5 Exercises

1. Prove Proposition 5.2.


2. For a sequence of random variables 0  ξn ↑ ξ, prove E(ξn |C ) ↑
E(ξ|C ).
3. Prove Property 5.5.
4. Prove Hölder’s inequality E(ξη|C )  E(|ξ|p |C )1/p E(|η|q |C )1/q ,
p > 1, 1p + 1q = 1.
Conditional Expectation and Conditional Probability 117

5. Formulate and prove Jensen’s and Minkowski’s inequalities for


conditional expectations.
6. Prove Corollary 5.9.
7. Construct a probability space (Ω, A , P), sub-σ-algebras C1 and
C2 of A , and an integrable random variable ξ such that E(ξ|C1 ∩
C2 ) = E(E(ξ|C1 )|C2 ).
8. Denote total rational numbers by x1 , x2 , . . . and let


F (x) = 2−n 1(xn ,∞) (x), x ∈ R.
n=1

Prove that F is a probability distribution function on R.


9. (Martingale) Let {An }n1 be a sequence of increasing sub-σ-
algebras of A . If a sequence of random variables {ξn }n1 satisfies

E(ξn+1 |An ) = ξn , n  1,

then it is called a martingale sequence. For an integrable random


variable ξ, prove that ξn = E(ξ|An ) is a martingale sequence.
10. (Markov chain) Let {ξn }n1 be a sequence of random variables.
Set {An = σ({ξm : m  n})}. If

E(ξn+1 |An ) = E(ξn+1 |ξn ), n  1,

then {ξn }n1 is called a Markov chain. Let {Xn }n1 be a


sequence of independent random variables. Prove that {ξn =
n
Xm } is a Markov chain.
m=1

11. Let {ξn }n1 be a sequence of random variables, and let

An = σ({ξm : m  n}), A n = σ({ξm : m  n}), n  1.

Prove that {ξn }n1 is a Markov chain if and only if one of the
following conditions holds:
(a) E(ξm |An ) = E(ξm |ξn ), m  n  1;
(b) E(η|An ) = E(η|ξn ), η ∈ A n , n  1;
118 Foundation of Probability Theory

(c) ∀η ∈ An , ζ ∈ A n such that η, ζ, ηζ are integrable, there


holds

E[ηζ|ξn ] = E[η|ξn ]E[ζ|ξn ], n  1.


12. Let matrix P = (pij )∞
i,j=0 satisfy pij  0, pij = 1. Construct a
j=0
probability space (Ω, A , P) and a sequence of random variables
{ξn }n0 such that P(ξn+1 = j|ξn = i) = pij for n  0 and
i, j  0.
13. Let ξ and η be random variables such that E(ξ|C ) = η and
Eξ 2 = Eη 2 < ∞. Prove that ξ = η, a.s.
14. Let ξ ∈ L1 (P). Prove that the family of random variables

{E(ξ|C ) : C is sub-σ-algebra of A }

is uniformly integrable.
15. Let ξ and η be independent identically distributed such that Eξ
exists. Prove E(ξ|ξ + η) = (ξ + η)/2.
16. Let (Ω, A , P) be a probability space, let (E, E ) be a measurable
space, and let T : Ω → E be a measurable map. Prove that for
any sub-σ-algebra C of E , there holds
 
P T −1 (B)|T −1 (C ) = (P ◦ T −1 )(B|C ) ◦ T, B ∈ E .

17. For an event P(A) > 0, denote PA = P(·|A). Prove that for any
B ∈ A and sub-σ-algebra C of A ,
P(A ∩ B|C )
PA (B|C ) = .
P(A|C )

18. (a) Let C1 ⊂ C2 be sub-σ-algebras of A , and let ξ be a random


variable with Eξ 2 < ∞. Prove

E((ξ − E(ξ|C1 ))2 )  E((ξ − E(ξ|C2 ))2 ).


Conditional Expectation and Conditional Probability 119

(b) Let Var(ξ|C ) = E((ξ − E(ξ|C ))2 |C ). Prove

Var(ξ) = E(Var(ξ|C )) + Var(E(ξ|C )).

19. Let Ci , i = 1, 2, 3 be sub-σ-algebras of A , and let Cij =


σ(Ci ∪ Cj ), 1  i, j  3. Prove that the following statements
are equivalent to each other:
(a) P(A3 |C12 ) = P(A3 |C2 ), ∀A3 ∈ C3 ;
(b) P(A1 ∩A3 |C2 ) = P(A1 |C2 )P(A3 |C2 ), ∀A1 ∈ C1 , A3 ∈ C3 ;
(c) P(A1 |C23 ) = P(A1 |C2 ), ∀A1 ∈ C1 .

20. Let P be a probability on a Polish (i.e. complete separable


metric) space E. Then for any sub-σ-algebra C of the Borel
σ-algebra B(E), the regular conditional probability PC exists.
This page intentionally left blank
Chapter 6

Characteristic Function and


Weak Convergence

We have learnt the characteristic function of a random variable,


which is determined by the distribution function according to the
L–S representation of expectation and has better analysis properties
than the distribution function. In this chapter, we study character-
istic functions for general finite measures on Rn and establish an
inverse formula to show that the characteristic function of a random
variable also determines the distribution function. Therefore, we can
use the convergence of characteristic functions to define the conver-
gence of finite measures or random variables, which is called the
weak convergence, and the associated topology on the space of finite
measures is called weak topology. More generally, we will introduce
several different type convergences for finite measures on a metric
space and present some equivalent statements for the weak conver-
gence. In particular, the weak convergence for the distributions of
random variables is equivalent to the convergence in distribution.

6.1 Characteristic Function of Finite Measure

6.1.1 Definition and properties


Definition 6.1. Let μ be a finite measure on (Rn , B n ). The char-
acteristic function (or Fourier–Stieltjes transform) of μ is defined as

fμ (t) = eit,x μ(dx), t ∈ Rn .
Rn
121
122 Foundation of Probability Theory

Obviously, the characteristic function has the following properties.


Property 6.2. Let ā be the conjugate number of a ∈ C.
(1) Let μ be a finite measure on Rn . Then for any t ∈ Rn , we have
|fμ (t)|  fμ (0) = μ(Rn ), f¯μ (t) = fμ (−t)
and the increment inequality
|fμ (t) − fμ (t + h)|2  2fμ (0)[fμ (0) − Refμ (h)], h ∈ Rn .
Consequently, fμ is uniformly continuous.
(2) Let μ
k be a finite measure on R
mk (k = 1, 2, . . . , n), and let
n
μ := k=1 μi . Then

n    
fμ (t) = fμk t(mk ) , t = t(m1 ) , . . . , t(mn ) ∈ Rm1 +···+mn .
k=1

Proof. We only prove the the increment inequality, since other


assertions are obvious. Since fμ (0) = μ(Rn ), by the Schwarz inequal-
ity, we have
  2
2  i t,x 
|fμ (t) − fμ (t + h)|  fμ (0) e − ei t+h,x  μ(dx)
Rn
  2
 i h,x 
 fμ (0) e − 1 μ(dx)
Rn

= 2fμ (0) (1 − cosh, x)μ(dx)
Rn

= 2fμ (0)(f (0) − Refμ (h)). 


Finally, we characterize the derivatives of fμ .
Proposition
 6.3. Let μ be a finite measure on (R, B) and let n  1.
If R |x|n μ(dx) < ∞, then fμ has derivatives up to n-th order, and
∀0  k  n,

f (k) (t) = ik xk eitx μ(dx), t ∈ R.
R
In particular,

xk μ(dx) = i−k f (k) (0).
R
Characteristic Function and Weak Convergence 123

6.1.2 Inverse formula


In this part, we aim to determine a finite measure by using its charac-
teristic function via the following inverse formula. An interval [a, b] in
Rn is called μ-continuous for a finite measure μ on Rn if μ(∂[a, b]) = 0.

Theorem 6.4 (Inverse formula). Let μ be a finite measure on


(Rn , B n ). Then for any μ-continuous interval [a, b] in Rn , we have

 
n −itk ak
1 e −e−itk bk
μ([a, b)) = lim f (t1 , . . . , tn )dt1 · · · dtn .
T →∞ (2π)n [−T,T ]n k=1 itk

Proof. Let I(T ) denote the integral in the right-hand side over
[−T, T ]n . By the definition of fμ and Fubini’s theorem, we obtain

   
n  n
T T
e−itk ak − e−itk bk i k=1 tk xk
I(T ) = μ(dx) ··· e dt1 · · · dtn
Rn −T −T k=1 itk
 n 
 T
e−itk ak − e−itk bk itk xk
= e dtk μ(dx)
Rn itk
k=1 −T
 n 
 T
sin tk (xk − ak ) − sin tk (xk − bk )
=2 n
dtk μ(dx)
Rn tk
k=1 0
 n 
 T (xk −ak )
sin t
= 2n dt μ(dx).
Rn t
k=1 T (xk −bk )

r ∞
Since s sint t dt is bounded in s  r ∈ R, and −∞ sin t
t dt = π, the
dominated convergence theorem implies

lim I(T ) = (2π)n μ((a, b)) = (2π)n μ([a, b)). 


T →∞

To prove that μ is uniquely determined by fμ via Theorem 6.4,


we need to show that μ has plentiful enough continuous intervals, or
only rare intervals are not μ-continuous. To this end, we present the
following result.
124 Foundation of Probability Theory

Lemma 6.5. Let μ be a finite measure on Rn . Then the set

D(μ) := {a ∈ R : ∃k ∈ {1, . . . , n} such that μ({x : xk = a}) > 0}

is at most countable.

Proof. Let
1
Dm,k (μ) = a ∈ R : μ({x : xk = a})  , m  1, 1  k  n.
m

Then D(μ) = Dm,k (μ). As μ is finite, each Dm,k (μ) is a finite


k,m
set, so that D(μ) is at most countable. 

Lemma 6.6. If an interval [a, b] in Rn is such that all components


of a and b are in the set C(μ) := R\D(μ), then it is μ-continuous.

Proof. Let a = (ak )1kn and b = (bk )1kn such that {ak , bk :
1  k  n} ⊂ C(μ). Then


n
∂[a, b] ⊂ {xk = ak or bk }
k=1

is a μ-null set. 

Proposition 6.7. Let μ1 and μ2 be finite measures on Rn . If μ1 and


μ2 are equal on their common continuous intervals, then μ1 = μ2 .
Consequently, a finite measure on Rn is uniquely determined by its
characteristic function.

Proof. By Theorem 6.4, we only need to prove the first assertion.


As D(μ1 ) ∪ D(μ2 ) is at most countable, C := C(μ1 ) ∩ C(μ2 ) is dense
(m) (m ) (m)
in R. ∀[a, b) ⊂ Rn and 1  k  n, ∃{ak }, {bk } ⊂ C and ak ↑
(m )
ak , bk ↑ bk . From the continuity of finite measure and the definition
of C, it follows
   
(m) (m )
μ1 ([a, b)) = lim μ1 [a(m) , b) = lim lim

μ 1 a , b
m↑∞ m↑∞ m ↑∞
 

= lim lim

μ2 a(m) , b(m ) = μ2 ([a, b)). 
m↑∞ m ↑∞
Characteristic Function and Weak Convergence 125

6.2 Weak Convergence of Finite Measures

6.2.1 Definition and equivalent statements


Let (E, ρ) be a metric space and E be the Borel σ-algebra. Denote
by M the total of finite measures on (E, E ).
Lemma 6.8 (Regularity). Let μ ∈ M. Then ∀A ∈ E ,
μ(A) = inf μ(G) = sup μ(C).
G⊃A, G is open C⊂A, C is closed

Proof. Let C be the class of all sets A ∈ B satisfying the desired


equations. It suffices to prove (1) C contains all open sets, which is
a π-system and (2) C is a λ-system.
To prove (1), let A be an open set. Then the first equation holds,
and the second equation also holds if A = E. Now, let A = E so
that Ac is a nonempty closed set. By the triangle inequality, the
distance function to Ac defined by d(·, Ac ) := inf c ρ(·, y) is Lipschitz
  y∈A
continuous. Let Cn = x ∈ E : d(x, Ac )  n1 . Then Cn is closed and
Cn ⊂ A. Since A is open, for any x ∈ A, there exists n  1 such that
B(x, n1 ) ⊂ A. Thus, d(x, Ac )  n1 , so that x ∈ Cn . Therefore, Cn ↑
A(n → ∞). By the continuity of μ, we obtain lim μ(Cn ) = μ(A).
n→∞
This proves the second equation, so that A ∈ C .
To prove (2), it suffices to show that C is a monotone class and
closed under the proper difference. Let {An }n1 ⊂C , An ↑ A(n → ∞).
For every An , there exists an open set Gn ⊃ An such that |μ(Gn ) −
μ(An )|  2−n ; there also exists a closed set Cn ⊂ An such that

|μ(Cn ) − μ(An )|  2−n . Then G̃n = Gm is an open set including
m=n
A, while Cn is a closed set included in A. Moreover,
 
 ∞
 ∞
 
 
lim |μ(G̃n ) − μ(A)| = lim μ Gm − μ Am 
n→∞ n→∞  
m=n m=n


 lim μ (Gm − Am )
n→∞
m=n


 lim 2−m = 0,
n→∞
m=n
126 Foundation of Probability Theory

and
lim |μ(Cn ) − μ(A)| = lim |μ(Cn ) − μ(An )|.
n→∞ n→∞

Therefore, A ∈ C . Finally, let A1 , A2 ∈ C with A1 ⊃ A2 . It remains


to be proved that A1 − A2 ∈ C . For this, ∀n  1, we take an open
set Gn ⊃ A1 and a closed set Cn ⊂ A2 such that
1
|μ(Gn ) − μ(A1 )| + |μ(Cn ) − μ(A2 )|  .
n
Then Gn \ Cn is open, including A1 − A2 , and
1
|μ(A1 − A2 ) − μ(Gn \ Cn )|  |μ(Gn ) − μ(A1 )| + |μ(Cn ) − μ(A2 )|  .
n
So, A := A1 − A2 satisfies the first equation. Symmetrically, we can
prove the second equation for A := A1 − A2 , so that A1 − A2 ∈ C .

The above result shows that the class of open sets and that of
closed sets are measure determined classes, i.e. two finite measures
are equal if they coincide with each other on any one of these two
classes. In the following, we prove that Cb (E), the class of bounded
continuous functions on E, is also a measure determined class. By
the way, for later use, we also introduce the class Bb (E) of bounded
measurable functions on E as well as C0 (E) of continuous functions
on E with compact supports.
Lemma 6.9. Let μ1 , μ2 ∈ M. If μ1 (f ) = μ2 (f ) for f ∈ Cb (E), then
μ1 = μ2 .
Proof. By Lemma 6.8, it suffices to prove that μ1 (G) = μ2 (G) for
any open G. Let g(x) = d(x, Gc ) for x ∈ E. Then g(x) > 0 for x ∈ G
and g is Lipschitz. Set hn (r) = (nr) ∧ 1. Then hn ◦ g is Lipschitz and
hn ◦ g ↑ 1G (n ↑ ∞). From the monotone convergence theorem and
μ1 (hn ◦ g) = μ2 (hn ◦ g), it follows that μ1 (G) = μ2 (G). 

Definition 6.10. Let {μn } ⊂ M and μ ∈ M.


(1) We say that (μn )n1 converges uniformly to μ, denoted by
u
μn −
→ μ, if
sup |μn (A) − μ(A)| → 0, n ↑ ∞,
A∈B
Characteristic Function and Weak Convergence 127

equivalently,

sup |μn (f ) − μ(f )| → 0, n ↑ ∞.


f ∈B,|f |1

s
(2) We say that (μn )n1 converges strongly to μ, denoted by μn −

μ, if

lim μn (A) = μ(A), ∀A ∈ B,


n→∞

equivalently, μn (f ) → μ(f ) for every f ∈ Bb .


w
(3) We call (μn )n1 convergent weakly to μ, denoted by μn − → μ, if
μn (f ) → μ(f ) for every f ∈ Cb (E).
v
(4) (μn )n1 is called convergent vaguely to μ, denoted by μn −
→ μ, if
μn (f ) → μ(f ) for every f ∈ C0 (E).

Definition 6.11. A set A ∈ E is called μ-continuous if μ(∂A) = 0.

Following are some equivalent characterizations on the weak conver-


gence.

Theorem 6.12. Let μn , μ ∈ M (n  1). The following statements


are equivalent:

(1) μn (f ) → μ(f ) for every f ∈ Cb (E).


(2) μn (f ) → μ(f ) for every bounded uniformly continuous func-
tion f .
(3) μn (f ) → μ(f ) for every bounded Lipschitz continuous function f .
(4) lim μn (G)  μ(G) for every open G ⊂ E, and μn (E) → μ(E).
n→∞
(5) lim μn (C)  μ(C) for every closed C ⊂ E, and μn (E) → μ(E).
n→∞
(6) μn (A) → μ(A) for every μ-continuous set A.

Proof. (1) ⇒ (2) ⇒ (3) and (4) ⇔ (5) are obvious.


(3) ⇒ (5). Let C ⊂ E be a closed set, and define

1
fm (x) = , x ∈ E, m  1.
1 + md(x, C)
128 Foundation of Probability Theory

Then fm is Lipschitz and 1  fm ↓ 1C as m ↑ ∞. By (3) and the


dominated convergence theorem, we obtain
 
μ(C) = lim fm dμ = lim lim fm dμn
m→∞ E m→∞ n→∞ E

 lim μn (C).
n→∞

(4) and (5) ⇒ (6). Let A be a μ-continuous set. Then μ(A) = μ(Ā) =
μ(A◦ ), where Ā and A◦ are the closure and interior of A, respectively.
This together with (4) and (5) yields

μ(A) = μ(A◦ )  lim μn (A◦ )  lim μn (A),


n→∞ n→∞

μ(A) = μ(Ā)  lim μn (Ā)  lim μn (A).


n→∞ n→∞

So, (6) holds.


(6) ⇒ (1). ∀f ∈ Cb (E), we intend to find a sequence of simple func-
tions {fn }n1 generated by μ-continuous sets such that fn → f uni-
formly as n → ∞. Since μ is finite, the set

D := {a ∈ R : μ({f = a}) > 0}

is at most countable. Thus, we may find a constant c > ||f ||∞ + 1


such that ±c ∈ D c , and a sequence of partitions

In := {−c = r0 < r1 < · · · < rn < rn+1 = c}, n2

such that

{ri } ⊂ D c , δ(In ) := max (rk − rk−1 ) → 0.


1kn+1

Let

n−1
fn = ri 1{ri f <ri+1 } .
i=1

Then

||fn − f ||∞  δ(In ) → 0, n → ∞,


Characteristic Function and Weak Convergence 129

so that
|μ(f ) − μm (f )|  |μ(f ) − μ(fn )| + |μ(fn ) − μm (fn )| + |μm (fn ) − μm (f )|
n−1

 δ(In ) (μ(E) + μm (E)) + ri |μ(ri  f < ri+1 ) − μm (ri  f < ri+1 )|.
i=1

Noting that {ri } ⊂ D c implies that each set {ri  f < ri+1 } is
μ-continuous, by (6), we may let first m ↑ ∞ then n ↑ ∞ to derive (1).


6.2.2 Tightness and weak compactness


The topology induced by weak convergence on M is called the weak
topology. In this part, we characterize the weak compactness (i.e.
compactness in the weak topology) for subsets of M. We first consider
a simple case where E is a compact metric space. In this case, the
relatively weak compactness is equivalent to the boundedness, which
is well known for subsets in the Euclidean space. Recall that M ⊂ M
is called bounded if supμ∈M μ(E) < ∞.
Theorem 6.13. Let (E, ρ) a compact metric space. If {μn } ⊂ M is
w
bounded, then there exists a subsequence {μnk } such that μnk −
→μ
as k → ∞ for some μ ∈ M.
Proof. Since E is compact, C(E) is a Polish space under uniform
norm. Let {fn }n1 be a dense subset of C(E). Since {μn }n1 is
bounded, for each m  1, {μn (fm )}n1 is bounded in R and hence
has a convergent subsequence. By the diagonal principle, we may
find a subsequence nk ↑ ∞ as k ↑ ∞ and a sequence {αm }m1 ⊂ R
such that

lim μnk (fm ) = αm , m  1.


k→∞

Since {fn }n1 is dense in C(E), for any f ∈ C(E) and ε > 0, there
exists m0  1 such that ||fm0 − f ||∞  ε. So,

|μnk (f ) − μnl (f )|
 |μnk (f − fm0 )| + |μnl (f − fm0 )| + |μnk (fm0 ) − μnl (fm0 )|
 2εC + |μnk (fm0 ) − μnl (fm0 )|.
130 Foundation of Probability Theory

By letting first k, l → ∞ then ε → 0, we obtain

lim |μnk (f ) − μnl (f )| = 0.


l,k→∞

Then {μnk (f )} is a Cauchy sequence and there exists α(f ) ∈ R such


that μnk (f ) → α(f ). It is clear that α : C(E) → R is a nonnega-
tive bounded linear functional. By Riesz–Markov–Kakutani theorem
[19, Theorem IV.14], there exists unique μ ∈ M such that μ(f ) =
w
α(f ). Therefore, μnk −
→ μ. 
When E is not compact, the above results remains true if the
bounded sequence {μn }n1 is supported on a compact set K ⊂ E,
i.e. μn (K c ) = 0, n  1. In general, we may extend the result to
bounded {μn }n1 asymptotically supported on compact sets. This
leads to the notion of tightness.
Definition 6.14. A bounded subset M of M is called tight if for
any ε > 0, there exists a compact set K ⊂ E such that

sup μ(K c ) < ε.


μ∈M

Theorem 6.15 (Prohorov’s theorem). Let (E, ρ) be a metric


space and let {μn }n1 ⊂ M be bounded.
(1) If there exists a sequence of compact sets {Km }m1 such that
Km ↑ E, then {μn }n1 has a vague convergent subsequence.
(2) If {μn }n1 is tight, then it has a weak convergent subsequence.
Proof. Let {Km }m1 be a sequence of increasing compact subsets
of E. Given m, there exist a subsequence {μmk } and a finite measure
μ(m) on Km such that

→ μ(m) (n → ∞),
w
μmn |Km −

where μmn |Km is the restriction of μmn on Km . By the diagonal


principle, there exists a subsequence {μnk } such that

→ μ(m) (k → ∞),
w
μnk |Km − m  1.

Clearly,

μ(m+1) (A ∩ Km+1 )  μ(m) (A ∩ Km ), ∀A ∈ B.


Characteristic Function and Weak Convergence 131

Indeed, for any closed set A, let

1
hl = l  1.
1 + ld(x, A)

Then

μ(m+1) (A ∩ Km+1 ) = lim μ(m+1) (hl 1Km+1 )


l→∞

= lim lim μnk (hl 1Km+1 )


l→∞ k→∞

 lim lim μnk (hl 1Km )


l→∞ nk →∞

= μ(m) (A ∩ Km ).

Thus, limit μ(A) = lim μ(m) (A ∩ Km ) exists for any A ∈ B, so that


m→∞
μ ∈ M and μ(m) (f 1Km ) → μ(f ) for any f ∈ Cb (E).

(1) Since Km ↑ E, for any f ∈ C0 (E), there exists m0  1 such that


suppf ⊂ Km for all m  m0 . Thus,

lim μnk (f ) = lim μ(m) (f ) = μ(f ).


k→∞ m→∞

(2) Up to a subsequence, we may assume that sup μn (Km


c )  1/m
n1
for m  1. Then ∀f ∈ Cb (E),

|μnk (f ) − μ(f )|

 |μnk (f 1Km ) − μ(m) (f 1Km )| + |μnk (f − f 1Km )| + |μ(f ) − μ(m) (f 1Km )|


1
 ||f ||∞ + |μnk (f 1Km ) − μ(m) (f 1Km )| + |μ(f ) − μ(m) (f 1Km )|.
m

w
By first letting k ↑ ∞ and then m ↑ ∞, we prove μnk −
→ μ. 

Theorem 6.16. Let E be a Polish space. Then a subset of M is


weak relatively compact if and only if it is tight.
132 Foundation of Probability Theory

Proof. By Prohorov’s theorem, we only need to prove the necessity.


Let M ⊂ M be relatively compact, we intend to prove the tightness.
To this end, we first observe that
lim sup μ(Gcn ) = 0 (6.1)
n→∞ μ∈M

holds for any increasing open sets Gn ↑ E. To see this, for any n  1,
we take μn ∈ M such that
μn (Gcn )  sup μ(Gcn ) − 1/n.
μ∈M

Since M is weak relatively compact, there exist μ0 ∈ M and a sub-


w
sequence μnk −→ μ0 . Combining this with the increasing property of
Gn , we obtain
limn→∞ supμ∈M μ(Gcn ) = limk→∞ supμ∈M μ(Gcnk )  limk→∞ μnk (Gcnk )
 limm→∞ limk→∞ μnk (Gcm )  limm→∞ μ0 (Gcm ) = 0.
So, (6.1) holds.
Since E is separable, ∀m  1, ∃{xm,j } such that E =
∞ n
B(xm,j , 2−m ). Let G(n, m) = B(xm,j , 2−m ). Then Kε :=
j=1 j=1


G(N (ε, m), m) is completely bounded, so that by Hausdorff’s
m=1
theorem, K̄ε is compact. Since G(n, m) ↑ E (n ↑ ∞), by (6.1) with
Gn = G(n, m), we conclude that ∀ε > 0, ∃N (ε, m)  1 such that
ε
sup μ(G(n, m)c )  m , n  N (ε, m),
μ∈M 2
and

 ∞
 ε
μ(K̄εc )  μ(G(N, r)c )  = ε, μ ∈ M . 
2r
r=1 r=1

When E = Rd , we have the following one more equivalent state-


ment for the weak convergence using continuous intervals in place of
continuous sets.
w
Proposition 6.17. Let E = Rd . Then μn − → μ if and only
if μn (Rd ) → μ(Rd ) and μn ([a, b)) → μ([a, b)) for any finite
μ-continuous interval [a, b).
Characteristic Function and Weak Convergence 133

Proof. We only need to prove sufficiency. If μn is not weakly con-


vergent to μ, then there exist δ > 0, f ∈ Cb (Rd ) and subsequence
nk → ∞ such that
|μnk (f ) − μ(f )|  δ, k  1. (6.2)
By Lemmas 6.5 and 6.6, there exists a sequence of μ-continuous
intervals Im ↑ Rd . Then ∀ε > 0, there exists m  1 such that
μ(Imc )  ε/2. Since μ (I ) → μ(I ) and μ (Rd ) → μ(Rd ) when
n m m n
n → ∞, we have lim μn (Im c )  ε/2. Thus, there exists n  1 such
0
n→∞
that ∀n  n0 , μn (Im c ) < ε. Moreover, take compact set K such that
1
μn (K1 ) < ε for ∀n  n0 . Then K = K1 ∪ I¯m is compact and satisfies
c

μn (K c ) < ε, ∀n  1. Thus, {μnk } is tight, so there exist a subse-


w
quence nk and a finite measure μ such that μnk − → μ . Combining
this with the condition that μn ([a, b)) → μ([a, b)) for μ-continuous
intervals [a, b), we see that μ and μ are equal on their common contin-
uous intervals. By Proposition 6.7, we have μ = μ, which contradicts
w
to (6.2) since μnk − →μ. 

6.3 Characteristic Function and Weak Convergence

In this section, we first identify the weak convergence for finite mea-
sures on Rn by using the convergence of characteristic functions and
then prove that a complex function on Rn is a characteristic function
if and only if it is continuous and nonnegative definite.
Theorem 6.18. Let {μk , μ}k1 be finite measures on Rn . Then
w
μk −
→ μ (k → ∞) if and only if fμk → fμ pointwise.
By the dominated convergence theorem, the necessity is obvious.
The sufficiency follows from Theorem 6.22 on the convergence of
integral characteristic functions.
Definition 6.19. Let fμ be the characteristic function of a finite
measure μ. The indefinite integral of fμ
 u1  un
˜
fμ (u1 , . . . , un ) = ··· fμ (t1 , . . . , tn ) dt1 · · · dtn , u ∈ Rn
0 0
 ui 0
is called the integral characteristic function of μ, where 0 =− ui
if ui < 0.
134 Foundation of Probability Theory

Since fμ is continuous, fμ and f˜μ determine each other.

Lemma 6.20. The integral characteristic function of μ satisfies


 
n
eiuk xk − 1
f˜μ (u1 , . . . , un ) = μ(dx1 , . . . , dxn ), u1 , . . . , un ∈ R.
Rn k=1 ixk

Proof. By the definition and Fubini’s theorem, for u =


(u1 , . . . , un ) ∈ Rn ,
 u1  un 
f˜μ (u) = ··· eit,x μ(dx) dt
0 0 Rn
 
= μ(dx) eit,x dt
Rn [0,u]
 
n
eiuk xk − 1
= μ(dx1 , . . . , dxn ). 
Rn k=1 ixk

Let


n
ei uk xk − 1
F (x, u) = , x, u ∈ Rn .
i xk
k=1

Then for given u, lim F (x, u) = 0. Thus, F (·, u) can be uniformly


|x|→∞
approximated by continuous functions with compact supports.

Theorem 6.21. Let {μk }k1 be bounded measures on Rn . If f˜μk → g̃


for some function g̃, then there exists a finite measure μ such that
v
→ μ and g̃ = f˜μ .
μk −

Proof. By Theorem 6.15, there exists a subsequence {μnk }k1 of


{μk }k1 which converges vaguely to some finite measure μ. Since a
finite measure is determined by its integral characteristic function, we
only need to prove f˜μ = g̃. Since f˜μ (u) = μ(F (u, ·)), and F (u, ·) can
be uniformly approximated by continuous functions with compact
v
→ μ and f˜μk → g̃ imply f˜μ = g̃.
supports, it is clear that μnk − 
Characteristic Function and Weak Convergence 135

Theorem 6.22. Let {μk }k1 be bounded measures on Rn such that


fμk → g for some function g continuous at 0. Then there exists a
w
finite measure μ such that μk −
→ μ and fμ = g.
Proof. By the dominated convergence theorem, fμk → g implies
f˜μk → g̃. By Theorem 6.21, Proposition 6.17 and Exercise 9, it suf-
fices to prove μk (Rn ) → μ(Rn ). Since g̃ = f˜μ , g = fμ dx-a.e., and
since both g and fμ are continuous at 0,

μ(Rn ) = fμ (0) = g(0) = lim fk (0) = lim μk (Rn ). 


k→∞ k→∞

In the following, we introduce two important applications of The-


orem 6.18.
Theorem 6.23 (Law of large numbers). Let {ξn } be i.i.d.
random variables with Eξn = a ∈ R. Then

1
n
P
ξk −
→ a.
n
k=1

Proof. (1) It suffices to prove the characteristic functions fn of


1 
n
ηn := n (ξk − a) satisfy fn (t) → 1. In fact, if fn (t) → 1, then by
k=1
w
Theorem 6.18, we have Pηn − → δ0 (probability with total mass at 0).
Since (−ε, ε) is δ0 -continuous for any ε > 0, there holds
 n 
1  
 
lim P  ξk − a < ε = lim Pηn ((−ε, ε)) = δ0 ((−ε, ε)) = 1,
n→∞ n  n→∞
k=1

which implies
 n 
1  
 
lim P  ξk − a  ε = 0.
n→∞ n 
k=1

1 
n
(2) Let ξn = ξn − a. Then ηn = n ξk , so
k=1


n
fn (t) = fξk (t/n) = [f (t/n)]n ,
k=1
136 Foundation of Probability Theory

where f = fξk . Since Eξk = 0, Taylor’s expansion gives


 
n
fn (t) = Eeitξk /n = (1 + o(1/n))n , t ∈ R.

Thus,
 
lim log fn (t) = lim log (1 + o(1/n))n = lim n log (1 + o(1/n)) = 0.
n→∞ n→∞ n→∞

Therefore, lim fn (t) = 1. 


n→∞

Theorem 6.24 (Central limit theorem). Let {ξ (k) }k1 be a


sequence of n-dimensional i.i.d. random variables with invertible cor-
relation matrix D and Eξ (k) = m ∈ Rn . Then ∀x ∈ Rn ,

1  (k)
N
1 1 −1
lim P √ (ξ −m)<x = 1/2
e− 2 t,D t dt.
N →∞ N k=1 (2π) |D|
n/2
(−∞,x)

Proof. Let η (k) = ξ (k) − m. Then {η (k) } are i.i.d with zero mean.
Let f be the characteristic function of η (k) . Then the characteristic

N
function of √1N η (k) is
k=1
 √ N
fN (t) = f (t/ N ) , t ∈ Rn .

Since Eη (k) = 0, Taylor’s expansion shows


√ 1
f (t/ N ) = 1 − t, Dt + o(1/N ), t ∈ Rn ,
2N
so that
√ 1
log f (t/ N ) = − t, Dt + o(1/N ).
2N
Thus,
1
lim log fN (t) = − t, Dt, t ∈ Rn ,
N →∞ 2
so that
1
lim fN (t) = e− 2 t,Dt.
N →∞
Characteristic Function and Weak Convergence 137


N
By Theorem 6.18, this implies that √1 η (k) converges in
N
k=1
distribution to N (0, D), the centered normal distribution with
covariance D. 

6.4 Characteristic Function and Nonnegative


Definiteness

Let μ be a finite measure on Rn with characteristic function fμ .


Clearly, ∀m  1, α1 , . . . , αm ∈ C, and t(1) , . . . , t(m) ∈ Rn , we have

 m 2

m    
(j) (k)  (k) 
fμ t − t αj ᾱk =  αk eit ,x  μ(dx)  0.
R n  
j,k=1 k=1

A function having this property is called a nonnegative definite func-


tion, and this property is called the nonnegative definiteness. In this
section, we will prove that a function on Rn is the characteristic func-
tion of a finite measure if and only if it is continuous and nonnegative
definite. To this end, we first observe that a nonnegative function has
some properties of characteristic functions.

Property 6.25. If f is a nonnegative definite function, then


f (0)  0, f (−t) = f¯(t) and |f (t)|  f (0).

Proof. Let m = 2, t(1) = 0, t(2) = t, α1 = 1, α2 ∈ C. By the non-


negative definiteness, we have
 
f (0) 1 + |α2 |2 + f (t)α2 + f (−t)ᾱ2  0.

(1) Let α2 = 0. Then f (0)  0.


(2) Let α2 = 1. Then 2f (0) + f (−t) + f (t)  0, so that Imf (t) =
−Imf (−t). Moreover, taking α2 = i, we obtain 2f (0) + i(f (t) −
f (−t))  0, so that Ref (t) = Ref (−t). In conclusion, f (−t) = f¯(t).
(3) For f (t) = 0 and α2 := −f¯(t)/|f (t)|, we obtain 2f (0)  2|f (t)|,
so that f (0)  |f (t)|. 
138 Foundation of Probability Theory

Lemma 6.26. Let Tc = {kc : k ∈ Zn }, c > 0. If f is a nonnegative


definite function, then there exists a finite measure μ on Rn such that

μ(Rn ) = μ([−π/c, π/c]n ) = f (0), fμ (t) = f (t), ∀t ∈ Tc .

Proof. ∀m  1, by the nonnegative definiteness of f , we obtain

1 
m−1
0 f (c(j − k))e−i cj−k,x
mn
j1 ,...,jn ,k1 ,...,kn =0
 n  

m  |r |
= 1− f (cr)e−i cr,x =: Gm (x).
r1 ,...,rn =−m
m
=1

Let
 c n
μm (dx) = Gm (x)1[− πc , πc ]n (x) dx.

Then
 π π n 
μm (Rn ) = μm − ,
c c
 n     π 
m  |r | 
n
c c
= 1− f (cr) ei cr x dx
r1 ,...,rn =−m
m 2π − π
=1 =1 c

= f (0).

Let fm be the characteristic function of μm . Then


 c n 
fm (ck) = ei ck,x Gm (x) dx
2π [− π ,
c c
π n
]
 n    n 
m  |r |  c  πc
i c(k −r )x
= 1− f (cr) e dx
r ,...,r =−m
m 2π − πc
1 n =1 =1
n 
 
|r |
= f (ck) 1− → f (ck) (m → ∞).
m
=1
Characteristic Function and Weak Convergence 139

Since {μm }m1 is tight, there exist μ and a subsequence μmk such
w
that μmk → μ(k → ∞). Then
 π π n 
μ(Rn ) = μ − , = f (0)
c c
and

fμ (ck) = lim fm (ck) = f (ck). 


m→∞

Theorem 6.27. If f is a continuous and nonnegative definite func-


tion on Rn , then it is the characteristic function of a finite measure.

Proof. By Lemma 6.26, there exists a sequence of finite measures


{μm }m1 such that

μm (Rn ) = μm ([−mπ, mπ]n ) = f (0),


1 n
and their characteristic functions fm satisfy fm (t) = f (t), t ∈ m Z .
  (m)
∀t ∈ Rn , take t(m) m1 ⊂ T1/m such that |tk − tk |  1/m, 1 
   
k  n, m  1. Thus, by the continuity of f and f t(m) = fm t(m) ,
we have
   
f (t) = lim f t(m) = lim fm t(m) .
m→∞ m→∞

From this and Theorem 6.18, it suffices to prove


  
 
lim fm (t) − fm t(m)  = 0. (6.3)
m→∞

For this, we use the increment inequality (Property 6.2) to derive


  
 
fm (t) − fm t(m) 


n−1


(m)
 
(m)


 fm t1 , . . . , ti , ti+1 , . . . , t(m)
n − fm t1 , . . . , ti+1 , ti+2 , . . . , tn(m) 
i=0
n−1    
 (m)
 2f (0)(f (0) − Refm ei ti − ti , (6.4)
i=0
140 Foundation of Probability Theory

where ei ∈ Rn is a unit vector with ith being 1. Since for xi ∈


(m)
[−mπ, mπ], |(ti − ti )xi |  π, and cos θ is decreasing in |θ| on θ ∈
[−π, π], we have
  
(m)
0  f (0) − Refm ei ti − ti
    
(m)
= 1 − cos ti − ti xi μm (dx)
[−mπ,mπ]n
  xi 
 1 − cos μm (dx)
[−mπ,mπ]n m
e 
i
= f (0) − Refm .
m
Combining this with (6.4) and the continuity of f , we prove (6.3).


6.5 Exercises

1. Prove that the characteristic function fμ of a finite measure μ


on Rn has the following properties:
(1) fμ (0) = μ(Rn ), (2) |fμ (t)|  f (0), (3) f¯μ (t) = f (−t).
2. Prove Property 6.2(2).
3. A finite measure μ on (R, B) is called symmetric if μ(−∞, x) =
μ(x, ∞) for any x  0. Prove the following:
(a) μ is symmetric if and only if μ(A) = μ(−A), A ∈ B, where
−A = {x : −x ∈ A};
(b) μ is symmetric if and only if its characteristic function is a
real function.

4. Let μ be a finite measure on Rn such that R |fμ (t)| dt < ∞.
Prove that μ is absolutely continuous with respect to dx and

μ(dx) 1
= e−i tx φ(t) dt.
dx 2π R

5. Prove Proposition 6.3.


Characteristic Function and Weak Convergence 141

6. Let {ξn , ξ}n1 be centered normal random variables with vari-


d
ances {σn2 , σ 2 }n1 . If ξn −
→ ξ, prove σn2 → σ.
7. Let {ξn }n1 be i.i.d. random variables with P(ξi = 0) = P
(ξi = 1) = 1/2. Calculate the distribution and characteristic
function of ξ = 2 ∞ j=1 ξj /3 .
j

8. Exemplify that vague convergence is not equivalent to weak con-


vergence.
9. Let {μn , μ}n1 be finite measures on a metric space E. Prove
the following:
w
(a) μn −→ μ if and only if μk (A) → μ(A) for any μ-continuous
compact A.
w
(b) μk −→ μ if and only if μk (A) → μ(A) for any μ-continuous
open A.
v
(c) When E = Rn , μk − → μ if and only if μk (I) → μ(I) for any
finite μ-continuous interval I.

10. Let g  0 be a continuous function on Rn and {ξk , ξ}k1 are


w
n-dimensional random variables. If ξn −
→ ξ, prove

lim Eg(ξn )  Eg(ξ).


n→∞

11. Let {Fn , F }k1 be probability distribution functions on Rn such


that is continuous. Prove that Fn → F implies sup |Fn (x) −
x
F (x)| → 0.
12. Let {μn , μ}n1 be finite measures on a measurable space (E, E ).
Prove

sup |μn (A) − μ(A)| → 0, n↑∞


A∈E

is equivalent to

sup |μn (f ) − μ(f )| → 0, n ↑ ∞.


f ∈E , |f |1
142 Foundation of Probability Theory

13. Prove that a family of probability measures {μt , t ∈ T } on Rn is


tight if and only if there exists an increasing
 function
 φ : R+ →
R+ with lim φ(x) = ∞ such that sup μt φ(| · |) < ∞.
x→∞ t∈T

14. Let {ξk }k1 be a sequence of random variables on Rn such that


P
{P◦ξk−1 }k1 is tight. Prove that for any random variables ηn −
→ 0,
P
there holds ξn ηn −
→ 0.
15. Let h : R → R be measurable, and let Dh be the class of dis-
continuous points of h. Prove that if Dh is measurable, then
w
for any finite measures {μn , μ}n1 on R such that μn −
→ μ and
w
μ(Dh ) = 0, then μn ◦ h−1 −
→ μ ◦ h−1 .
16. Let μ(dx) = p(x) dx be a finite measure on (R, B).
t
fμ (s)
(a) Prove lim fμ (t) = 0 (Hint: lim 0
t = 0).
|t|→∞ |t|→∞
(b) If p has integrable derivative function p , then lim t
|t|→∞
fμ (t) = 0.
(c) What happens if p has integrable derivatives p(k) for 1 
k  n for some n  2?

17. Let μ be a finite measure on R. Prove that for any x ∈ R,


 T
1
μ({x}) = lim e−i tx fμ (t) dt.
T →∞ 2T −T
Chapter 7

Probability Distances

Let (E, ρ) be a metric space with Borel σ-algebra E , and let P(E)
be class of all probability measures on (E, E ). In this chapter, we
introduce some distances on P(E), including the metrization of weak
topology, the total variation distance for the uniform convergence,
and Wasserstein distance arising from optimal transport.

7.1 Metrization of Weak Topology

Let (E, ρ) be a Polish space. Then space Cb (E) of bounded contin-


uous functions is also a Polish space under uniform norm ||f ||∞ =
sup |f | (see [18–20]). For a dense sequence {fn }n1 in Cb (E), define
E



dw (μ, ν) := 2−n {|μ(fn ) − ν(fn )| ∧ 1} , μ, ν ∈ P(E).
n=1

Theorem 7.1. Let (E, ρ) be a Polish space. Then (P(E), dw ) is


a separable metric space, and for any {μn }n1 ⊂ P(E) and μ ∈
w
P(E), μn − → μ if and only if dw (μn , μ) → 0. If E is compact, then
(P(E), dw ) is complete.
Proof. (a) dw is a distance: Obviously, dw (μ, μ) = 0. If dw
(μ, ν) = 0, then μ(fn ) − ν(fn ) = 0(∀n). Since {fn }n1 is dense in
Cb (E), it follows that μ(f ) = ν(f ) for any f ∈ Cb (E), thus μ = ν by
Lemma 6.9. Finally, dw clearly satisfies the triangle inequality.

143
144 Foundation of Probability Theory

w
(b) Equivalence to the weak topology: Obviously, if μn − → μ, then
dw (μn , μ) → 0. Conversely, let dw (μn , μ) → 0. We are going to prove
μn (f ) − μ(f ) → 0 for any f ∈ Cb (E). Given f ∈ Cb (E), since {fn }
is dense in Cb (E), for any ε > 0, there exists n0  1 such that
||fn0 − f ||∞ < ε. So,
lim |μn (f ) − μ(f )|  2ε + lim |μn (fn0 ) − μ(fn0 )|
n→∞ n→∞

 2ε + 2n0 +1 lim dw (μn , μ)


n→∞

= 2ε.
As ε is arbitrary, we have μn (f ) → μ(f ).
(c) Separability: ∀m  1, let Um = {(μ(f1 ), . . . , μ(fm )) : μ ∈
P(E)} ⊂ Rm . Since Rm is separable, so is Um . Thus, there exists a
countable set Pm ⊂ P(E) such that
Ũm := {(μ(f1 ), . . . , μ(fm )) : μ ∈ Pm }


is dense in Um . Thus, P∞ := Pm is a countable subset of P(E),
m=1
so that it suffices to prove that P∞ is dense in P(E) under distance
dw .
In fact, for any μ ∈ P(E), there exists μm ∈ Pm such that
1
|μm (fi ) − μ(fi )|  , ∀1  i  m.
m
Thus,
1
dw (μm , μ)  2−m + → 0 (m → ∞).
m
(d) Completeness of dw : Assume E is locally compact. Note
{μn }n1 ⊂ P(E) is a Cauchy sequence under dw . Then ∀m 
1, {μn (fm )}n1 is a Cauchy sequence, so converge to some number,
denoted by φ(fm ). Moreover, given f ∈ Cb (E), ∀ε > 0, ∃m0  1 such
that ||fm0 − f ||∞ < ε. Thus,
lim |μm (f ) − μn (f )|  2ε + lim |μm (fm0 ) − μn (fm0 )|
m,n→∞ m,n→∞

= 2ε.
As ε is arbitrary, we note that {μn (f )}n1 is also a Cauchy sequence,
which converge to some number, denoted by φ(f ). By the properties
Probability Distances 145

of integral, it follows that

φ : Cb (E) → R

is a linear map, φ(1) = 1, and φ(f )  0 for f  0. By Riesz’s


representation theorem, there exists unique μ ∈ P(E) such that
μ(f ) = φ(f ) for every f ∈ Cb (E), see [11, Theorem IV.14]. By the
w
construction of φ, it follows that μn − → μ, hence dw (μn , μ) → 0
from (b). 

7.2 Wasserstein Distance and Optimal Transport

In this section, we introduce the transportation problem initiated by


Monge in 1781 and characterized by Kantorovich in the 1940s using
couplings, which leads to the notion of the Wasserstein distance. In
particular, when E is a Polish space, P(E) is also a Polish space
under the Wasserstein distance.

Definition 7.2. Let P(E) be the class of probability measures on a


measurable space (E, E ), and let μ, ν ∈ P(E). A probability measure
π on the product space (E × E, E × E ) is called a coupling of μ and
ν, denoted by π ∈ C (μ, ν), if its marginals are μ and ν, i.e.

π(A × E) = μ(A), π(E × A) = ν(A), A ∈ E.

A simple coupling is the product measure μ × ν, which is called


the independent coupling of μ and ν. Therefore, C (μ, ν) = ∅.

7.2.1 Transport problem, coupling and Wasserstein


distance
Before introducing the general theory, let us consider a simple
example.
Let x1 , . . . , xn be n many cities, each city produces and consumes
certain product. Let μ and ν be the produced (initial) distribution
and the consumed (target) distribution, respectively. We intend to
design a scheme to transport the product from initial distribution μ
to the target distribution ν.
146 Foundation of Probability Theory

Let μ({xi }) = μi , ν({xi }) = νi , 1  i  n. We have μi , νi  0


n n
and μi = νi = 1. So, μ and ν are probability measures on
i=1 i=1
space E := {x1 , . . . , xn } . Let π = {πij : 1  i, j  n} be a transport
scheme, where πij  0 denotes the amount of product transported
from xi to xj . Then, the scheme π transports μ into ν if and only if

n 
n
μi = πij , νi = πji , 1  i  n.
j=1 j=1

Thus, π is a scheme π transporting μ into ν if and only if π ∈ C (μ, ν).


Let ρij  0 be the cost to transport a unit product from xi to xj ,
which is called the cost function. Then for any scheme π ∈ C (μ, ν),
the total cost is
n 
ρij πij = ρ dπ.
i,j=1 E×E

Thus, the lowest cost to transport from μ to ν is



ρ
W1 (μ, ν) := inf ρ dπ,
π∈C (μ,ν) E×E

which is called L1 -Wasserstein distance between μ and ν induced


by ρ.
In general, we define the Lp -Wasserstein distance on P(E) over
a metric space (E, E ) as follows.
Definition 7.3. Let (E, ρ) be a metric space. ∀p ∈ [1, ∞), define the
Lp -Wasserstein distance induced by ρ as
 1/p
Wp (μ, ν) := inf ρ dπ
p
, μ, ν ∈ P(E).
π∈C (μ,ν) E×E
A coupling π is called optimal if it reaches the infimum.
In generally, ρ may be unbounded, so that Wp (μ, ν) may be infi-
nite for some μ, ν ∈ P(E). To make Wp finite, we restrict to the
following subspace of P(E) of finite p-moment:
Pp (E) = {μ ∈ P(E) : μ(ρ(o, ·)) < ∞} , p  1,
where o ∈ E is any fixed point. By the triangle inequality, the defi-
nition of Pp (E) is independent of the choice of o ∈ E.
Probability Distances 147

7.2.2 Optimal coupling and Kantorovich’s dual


formula
We first consider the existence of optimal coupling.
Theorem 7.4. Let (E, ρ) be a Polish space. Then ∀μ, ν ∈ Pp (E),
there exists π ∈ C (μ, ν) such that Wp (μ, ν) = π(ρp ).
Proof. Since μ, ν ∈ Pp (E) and μ × ν ∈ C (μ, ν), we have

Wp (μ, ν) 
p
ρp (x, y)μ(dx)ν(dy)
E×E

2 p−1
(ρp (x, o) + ρp (y, o))μ(dx)ν(dy)
E×E

< ∞.
Thus, for any n  1, there exists πn ∈ C (μ, ν) such that
1
Wp (μ, ν)p  πn (ρp ) − . (7.1)
n
So, if πn converge weakly to some π0 , then π0 should be an optimal
coupling. For this, we first prove that {πn }n1 is tight. In fact, by
Theorem 6.16, we know that finite set {μ, ν} is tight, so for any ε > 0,
there exists a compact set K ⊂ E such that μ(K c ) + ν(K c ) < ε.
Thus, ∀π ∈ C (μ, ν), π((K × K)c )  π(K c × E) + π(E × K c ) =
μ(K c ) + ν(K c ) < ε. Therefore, C (μ, ν) is tight. Hence, there exist a
w
subsequence {πnk }k1 and π0 ∈ P(E) such that πnk − → π0 (k → ∞).
Obviously, π0 ∈ C (μ, ν). Combining this with (7.1), we obtain that
for any N ∈ (0, ∞),
π0 (ρp ∧ N ) = lim πnk (ρp ∧ N )  Wp (μ, ν)p .
k→∞

Letting N ↑ ∞ gives π0 (ρp )  Wp (μ, ν)p . 


From Definition 7.3, it is easy to derive an upper bound estimate
on the Wasserstein distance. To estimate it from below, we intro-
duce the Kantorovich dual formula by using the following classes of
function pairs for μ, ν ∈ P(E):

Fμ,ν = (f, g) : f ∈ L1 (μ), g ∈ L1 (ν), f (x)  g(y) + ρ(x, y)p , ∀x, y ∈ E ,
FLip = {(f, g) : f, g Lipschitz continuous f (x)  g(y) + ρ(x, y)p , x, y ∈ E}.
148 Foundation of Probability Theory

Theorem 7.5 (Kantorovich’s dual formula). Let (E, ρ) be a


Polish space. Then ∀μ, ν ∈ Pp (E),

Wp (μ, ν)p = sup {μ(f ) − ν(g)} = sup {μ(f ) − ν(g)}.


(f,g)∈Fμ,ν (f,g)∈FLip
(7.2)

Proof. Since FLip ⊂ Fμ,ν , we only need to prove

sup {μ(f ) − ν(g)}  Wp (μ, ν)p  sup {μ(f ) − ν(g)} .


(f,g)∈Fμ,ν (f,g)∈FLip

In the following, we only prove the first inequality, as the second is


far from trivial, see, for instance, [10, Section 3] or [3, Chapter 5] for
details.
Let (f, g) ∈ Fμ,ν and π ∈ C (μ, ν). We have
 
μ(f )−ν(g) = (f (x)−g(y))π(dx, dy)  ρ(x, y)p π(dx, dy).
E×E E×E

Thus, the first equation follows from the definition of Wp . 

7.2.3 The metric space (Pp (E), Wp )

Theorem 7.6. Let (E, ρ) be a Polish space. Then (Pp (E), Wp ) is


also a Polish space.

Proof. (a) First, we prove Wp is a metric. Obviously, Wp (μ, ν) = 0


if and only if μ = ν, so we only need to prove the triangle inequality.
∀μ1 , μ2 , μ3 ∈ Pp (E), let π12 and π23 be optimal couplings of
(μ1 , μ2 ) and (μ2 , μ3 ), respectively. We have

Wp (μ1 , μ2 ) = π12 (ρp )1/p , Wp (μ2 , μ3 ) = π23 (ρp )1/p .

To construct the optimal coupling of μ1 and μ3 , let π12 (x1 , dx2 )


be the regular conditional probability of π12 for given x1
and π23 (x2 , dx3 ) be the regular conditional probability of π23
Probability Distances 149

for given x2 . Set



π13 (A × B) = μ1 (A) π23 (x2 , B)π12 (x1 , dx2 ).
E

It is clear that π13 ∈ C (μ1 , μ3 ). Then

π(dx1 , dx2 , dx3 ) := μ1 (dx1 )π12 (x1 , dx2 )π23 (x2 , dx3 )

is a probability measure on E × E × E, and for

ρij (x1 , x2 , x3 ) := ρ(xi , xj ), 1  i, j  3,

we have

π(ρpij ) = πij (ρp ), 1  i, j  3.

Thus, by the triangle inequality in Lp (π),

Wp (μ1 , μ3 )  π(ρp13 )1/p  π((ρ12 + ρ23 )p )1/p


 π(ρp12 )1/p + π(ρp23 )1/p
= Wp (μ1 , μ2 ) + Wp (μ2 , μ3 ).

(b) Next, we prove Wp is complete. Let {μn }n1 ⊂ Pp (E) be a


Cauchy sequence under Wp . Then {μn }n1 ⊂ Pp (E) is tight (see
Lemma 6.14 in [13]). Without loss of generality, we assume that
w
μn −→ μ for some μ ∈ P(E). On the other hand, given o ∈ E, we
have

μn (ρ(o, ·)p )  2p−1 μ1 (ρ(o, ·)p ) + 2p−1 Wp (μ1 , μn )p ,

which are bounded for n  1, so ∃C > 0 such that ∀N  1

μ(ρ(o, ·)p ∧ N ) = lim μn (ρ(o, ·)p ∧ N )  C.


n→∞

Thus, μ ∈ Pp (E) and

lim μn (ρ(o, ·)p )  μ(ρ(o, ·)p ). (7.3)


n→∞
150 Foundation of Probability Theory

Moreover, ∀ε > 0, ∃n0  1 such that Wp (μn0 , μn )p  ε, ∀n  n0 .


Then

μn ((N − ρ(o, ·)p )+ )


 μn0 ((N − ρ(o, ·)p )+ ) + |μn ((N − ρ(o, ·)p )+ )−μn0 ((N − ρ(o, ·)p )+ )|
 μn0 ((N − ρ(o, ·)p )+ ) + 2p−1 Wp (μn , μn0 )p
 μn0 ((N − ρ(o, ·)p )+ ) + 2p−1 ε.

Hence,

lim μn (ρ(o, ·)p )  lim μn (ρ(o, ·)p ∧ N ) + 2p−1 ε.


n→∞ n→∞

w
As ε is arbitrary and μn −
→ μ, we have

lim μn (ρ(o, ·)p )  lim μn (ρ(o, ·)p ∧N ) = μ(ρ(o, ·)p ∧N )  μ(ρ(o, ·)p ).
n→∞ n→∞

From this and (7.3), it follows that μ(ρ(o, ·)p ) = lim μn (ρ(o, ·)p ).
n→∞
Thus, by Kantorovich’s dual formula, the dominated convergence
theorem, and Wp (μn , μm ) → 0 as n, m → ∞, we obtain

lim Wp (μ, μn )p = lim sup |μ(f ) − μn (g)|


n→∞ n→∞ (f,g)∈F
Lip

= lim sup lim |μm (f ) − μn (g)|Wp (μ, μn )p


n→∞ (f,g)∈F m→∞
Lip

 lim lim Wp (μm , μn )p = 0.


n→∞ m→∞

(c) Finally, we prove Wp is separable. ∀N  1, let



Pp(N ) (E) = μ ∈ Pp (E) : suppμ ⊂ B̄(o, N ) ,

where B̄(o, N ) is a closed ball with radium N centered at o. As


∀μ ∈ Pp (E), it is easy to prove N → ∞,

μ(· ∩ B(o, N )) Wp
μN := −−→ μ,
μ(B(o, N ))
Probability Distances 151



(N )
so we have Pp (E) is dense in (Pp (E), Wp ). Thus, we only need
N =1
(N )
to prove that each Pp (E) is separable. Since ρ(o, ·) is bounded on
B̄(o, N ), as shown in step (b), the weak convergence is equivalent to
the convergence in Wp (see Exercise 6). Then the proof is finished
(N )
by Theorem 7.1, which says that Pp (E) is separable under weak
topology. 

Theorem 7.7. Let M ⊂ Pp (E). Then M is compact under Wp if


and only if it is weakly compact and
lim sup μ ρ(o, ·)p 1{ρ(o,·)N } = 0. (7.4)
N →∞ μ∈M

Proof. (a) Necessity. It is clear that Wp (μn , μ) → 0 implies


μn (f ) → μ(f ) for any Lipschitz continuous function f , so the topol-
ogy induced by Wp is stronger that weak topology. Thus, if M is
compact under Wp , then M is also compact under weak topology. It
remains to prove (7.4).
Since M is compact under Wp , for any ε > 0, there exist
μ1 , . . . , μn ∈ M such that
min Wp (μi , μ)p < ε, μ ∈ M.
1in

Thus,
μ (ρ(o, ·)p − N )+  max μi (ρ(o, ·)p − N )+ + 2p−1 Wp (μi , μ)p
1in


n
 μi (ρ(o, ·)p − N )+ + 2p−1 ε, μ ∈ M.
i=1

Hence,
+ 
N
lim sup μ ρ(o, ·) 1{ρ(o,·)N }  2 lim sup μ
p
ρ(o, ·) −
p
N →∞ μ∈M N →∞ μ∈M 2

 2p ε.
As ε is arbitrary, we get (7.4) immediately.
(b) Sufficiency. Let M be weakly compact and (7.4) hold. We intend
to prove that M is compact under Wp . For this, we only need to
152 Foundation of Probability Theory

prove that for any sequence {μn }n1 ⊂ M, there exists a conver-
gent subsequence under Wp . By the weak compactness of M, we
w
may and do assume that μn − → μ. Let {x1 , x2 , . . .} be a dense sub-


set of E. Then we have B(xi , ε) ⊃ E for any ε > 0, where
i=1
B(xi , ε) is an open ball with radium ε centered at xi . Since set
{ε > 0 : ∃i  1 such that μ(∂B(xi , ε)) > 0} is at most countable, for
m  1, take εm ∈ (0, 1/m) such that B(xi , εm ) are all μ continuous
sets. Let

i
U1 = B(x1 , εm ), Ui+1 = B(xi+1 , εm ) \ B(xj , εm ).
j=1

Then {Ui }i1 is a sequence of mutually disjoint μ-continuous sets,




1 

Ui = E and the radium of Ui is less than m . Let rn = μn (Ui ) ∧
i=1 i=1
μ(Ui ). Then rn ∈ [0, 1] with lim rn = 1. Let
n→∞

 μn (Ui ) ∧ μ(Ui )
Qn (dx) = μn (dx) − 1Ui (x)μn (dx),
μn (Ui )
i=1

 μn (Ui ) ∧ μ(Ui )
Q(dx) = μ(dx) − 1Ui (x)μ(dx).
μ(Ui )
i=1

Then

 μn (Ui ) ∧ μ(Ui )
πn (dx, dy) := 1Ui (x)1Ui (y) μn (dx)μ(dy)+
μn (Ui )μ(Ui )
i=1
1
Qn (dx)Q(dy)
1 − rn
is a coupling of μn and μ (if rn = 1, then the last term is set to 0),
hence
2p−1
Wp (μn , μ)p  πn (ρp )  m−p + (Qn (ρ(o, ·)p ) + Q(ρ(o, ·)p ))
1 − rn
 m−p + 2p N p (1 − rn ) + 2p−1 sup μk ρ(o, ·)p 1{ρ(o,·)N }
k1

+ 2p−1 μ ρ(o, ·)p 1{ρ(o,·)N } .


Probability Distances 153

By first letting n → ∞, then N → ∞ and finally m → ∞, we obtain


Wp (μn , μ) → 0(n → ∞). 

7.3 Total Variation Distance

Let P(E) be the class of probability measures on a measurable space


(E, E ). The total variation distance on P(E) is defined as

||μ − ν||Var := |μ − ν|(E) = 2(μ − ν)+ (E) = 2(ν − μ)+ (E). (7.5)

We will characterize this distance by using the Wasserstein coupling,


we define the wedge μ ∧ ν of μ and ν.
Proposition 7.8. For any μ, ν ∈ P(E),

μ ∧ ν := μ − (μ − ν)+ = ν − (ν − μ)+

is a sub-probability measure, i.e. it is a measure with μ ∧ ν  1.


Proof. Since μ  (μ − ν)+ and ν  (ν − μ)+ , both μ − (μ − ν)+
and ν − (ν − μ)+ are sub-probability measures. It suffices to prove
that they are equal. By Hahn’s decomposition theorem, there exists
D ∈ E such that (μ − ν)(D) = inf (μ − ν)(A), and for any A ∈ E ,
A∈E

(μ − ν)+ (A) = (μ − ν)(D c ∩ A), (ν − μ)+ (A) = (ν − μ)(A ∩ D).

Thus,

(μ − (μ − ν)+ )(A) = μ(A) − μ(D c ∩ A) + ν(D c ∩ A)


= μ(A ∩ D) + ν(A) − ν(D ∩ A)
= (ν − (ν − μ)+ )(A). 

The Wasserstein coupling of μ and ν is defined as


(μ − ν)+ (dx)(μ − ν)− (dy)
π0 (dx, dy) := (μ ∧ ν)(dx)δx (dy) + ,
(μ − ν)− (E)
+ −
where for μ = ν, we set (μ−ν) (μ−ν)
(dx)(μ−ν) (dy)
− (E) = 0.
Regarding a coupling as a scheme to transport μ into ν, the idea
of this coupling is that to keep the common part of μ and ν without
154 Foundation of Probability Theory

transport and to transport (μ−ν)+ to (ν −μ)+ using the independent


coupling. The following result shows that Wasserstein coupling gives
an optimal transport under the cost function 1{x=y} .
Theorem 7.9. Let D0 = {(x, x) : x ∈ E} ∈ E × E . Then
π0 (dx, dy) ∈ C (μ, ν), and

μ−ν Var =2 inf π(D0c ) = 2π0 (D0c ). (7.6)


π∈C (μ,ν)

Proof. (a) Obviously, π0 is a probability measure on the prod-


uct space (E × E, E × E ). When μ = ν, we have π0 (dx, dy) =
μ(dx)δx (dy), so that

π0 (A × E) = π0 (E × A) = μ(A), A ∈ E.

Hence, π0 ∈ C (μ, ν).


When μ = ν, we have (μ − ν)+ (E) > 0. Since (μ − ν)− = (ν − μ)+
and μ(E) = ν(E) = 1, we have (μ − ν)− (E) = (μ − ν)+ (E). Thus,
for any A ∈ E ,

(μ − ν)+ (A)(μ − ν)− (E)


π0 (A × E) = (μ ∧ ν)(A) +
(μ − ν)− (E)
= μ(A) − (μ − ν)+ (A) + (μ − ν)+ (A) = μ(A),

π0 (E × A) = 1A (x)(μ ∧ ν)(dx) + (μ − ν)− (A)
E

= ν(A) − (μ − ν)− (A) + (μ − ν)− (A)


= ν(A).

Hence, π0 ∈ C (μ, ν).


(b) ∀π ∈ C (μ, ν), we have

μ(A) − ν(A) = π(A × E) − π(E × A)


 π({(x, y) : x ∈ A, y ∈ A})
 π(D0c ).

Thus, ||μ − ν||Var  2π(D0c ). To prove equation (7.6), we only need


to prove ||μ − ν||Var  2π0 (D0c ). We prove in the case of μ = ν.
Probability Distances 155

By (7.5) and the definition of π0 , it follows



1
π0 (D0 ) =
c
(μ − ν)+ (dx)(μ − ν)− (dy)
(μ − ν)+ (E) D0c

1
 (μ − ν)+ (dx)(μ − ν)− (dy)
(μ − ν)+ (E) E×E
1
= (μ − ν)− (E) = ||μ − ν||Var .
2 

7.4 Exercises

1. Let (E, ρ) be a Polish space. On P(E), construct a metric equiv-


alent to the vague convergence and give a proof. Is this metric
complete?
2. Let (E, E ) be a measurable space. Prove that ∀μ, ν ∈ P(E),
||μ − ν||Var = 2 sup |μ(A) − ν(A)| = |μ − ν|(E).
A∈E

3. Let V  1 be a measurable function on a measurable space (E, E ).


Prove that ∀μ ∈ P(E), define weighted variance

||μ||V = V (x)μ(dx).
E
Prove for measurable function f ,

|f (x)|
sup |f | dμ = sup .
||μ||V 1 E x∈E V (x)

4. Let (E, ρ) be a Polish space. Prove that under the total variation
distance, the space P(E) is complete. Exemplify it may not be
separable.
5. Let (E, E ) be a Polish space, and let μ, ν ∈ P(E). Construct a
probability space (Ω, A , P) and measurable
ξ, η : Ω → E,
which are called random variables on E such that P ◦ ξ −1 = μ,
P ◦ η −1 = ν and Wp (μ, ν)p = Eρ(ξ, η)p .
156 Foundation of Probability Theory

6. Let (E, ρ) be a Polish space and {μn , μ}n1 ⊂ Pp (E). Prove that
w
Wp (μn , μ) → 0 if and only if μn −→ μ and lim μn (ρ(o, ·)p ) =
n→∞
μn (ρ(o, ·)p ), where o ∈ E is a fixed point.
7. Let (E, ρ) be a compact metric space. Prove that for any p ∈
[1, ∞), (P(E), Wp ) is also a compact metric space. Exemplify
that (E, ρ) is a locally compact space, but (P(E), Wp ) is not.
8. (Lévy distance) For any probability distribution functions F, G
on R, let

ρL (F, G) = inf {ε  0 : F (x − ε) − ε  G(x)  F (x + ε) + ε, x ∈ Rn } .

Prove that ρ is a distance and ρ(Fn , F ) → 0 if and only if Fn (x) →


F (x) for any continuous point x of F .
9. For any 1-dimensional random variables ξ, η, let Fξ and Fη be
their distribution functions. Define

α(ξ, η) = inf {ε  0 : P(|ξ − η| > ε)  ε}

and

|ξ − η|
β(ξ, η) = E .
1 + |ξ − η|
Prove

ρL (Fξ , Fη )  α(ξ, η)

and
α(ξ, η)2 (1 − α(ξ, η))α(ξ, η)
 β(ξ, η)  α(ξ, η) + .
1 + α(ξ, η) 1 + α(ξ, η)
Chapter 8

Calculus on the Space of Finite


Measures

In this chapter, we introduce the intrinsic and extrinsic derivatives


for functions of finite measures and make corresponding calculus. For
simplicity, we only consider measures on Rd , but the related theory
also applies to measures on more general spaces, such as Riemannian
manifolds and separable Banach spaces.
Recall that M is the set of all finite measures on Rd . For fixed
k ∈ [1, ∞), let
Mk = {μ ∈ M : μ(| · |k ) < ∞}
be the set of all finite measures on Rd having finite kth moment,
and let
Pk = {μ ∈ M : μ(Rd ) = 1}
be the set of probability measures on Rd having finite kth moment.
Both are Polish space under the k-Wasserstein distance Wk , which
is defined in Definition 7.4 on Pk , and for μ, ν ∈ M

⎪ μ · |k )] k1 if ν = 0,

⎪ k := [μ(|

⎨ 1
Wk (μ, ν) := νk:= [μ(| · | )]  if μ = 0,
k k



⎪ μ ν
⎩ Wk , + |μ(Rd ) − ν(Rd )| otherwise.
μ(Rd ) ν(Rd )
We will define intrinsic and extrinsic derivatives on Pk and Mk , and
make calculus with these derivatives.

157
158 Foundation of Probability Theory

8.1 Intrinsic Derivative and Chain Rule

The intrinsic derivative for measures was introduced in [1]


to construct diffusion processes on configuration spaces over a
Riemannian manifold, and was used in [16] to study the geometry of
dissipative evolution equations, see [2] for analysis and geometry on
the Wasserstein space over a metric measure space.
In this chapter, we introduce the intrinsic derivative on Pk for k ∈
[1, ∞) and establish the chain rule for functions of the distributions
of random variables having kth moment.

8.1.1 Vector field and tangent space


To define the intrinsic derivative, let us first recall the directional
derivative along a vector v ∈ Rd of a differentiable function f on Rd :
f (x + εv) − f (x)
∇v f (x) := lim , x ∈ Rd .
ε↓0 ε
The directional derivative operator ∇v reflects the variance rate of a
function along the line
[0, 1]  ε → x + εv,
which pushes forward a particle from point x along the direction v.
Moreover, the gradient
∇f (x) := (∂x1 f (x), . . . , ∂xd f (x))
is the element in Rd such that
∇f (x), v = ∇v f (x), v ∈ Rd . (8.1.1)
Now, let us characterize the variance rate of a function f on Pk .
In this case, we replace x ∈ Rd by a distribution μ ∈ Pk . To push
forward the distribution μ, we need to push forward all points x in
support of μ, so that instead of the line ε → x + εv, we need a family
of lines {ε → x + εv(x)}x∈Rd , which leads to the notion of vector
field.
Definition 8.1. A vector field on Rd is a measurable map
v : Rd  x → v(x) ∈ Rd .
Calculus on the Space of Finite Measures 159

Now, for each vector field v on Rd , we may push forward a measure


μ ∈ M along v as
[0, 1]  ε → μ ◦ (id + εv)−1 ,
where id : x → x is the identity map, and for each ε ∈ [0, 1], μ ◦ (id +
εv)−1 is a finite measure on Rd defined as

(μ ◦ (id + εv)−1 )(A) := μ (id + εv)−1 (A)
for A ∈ B(Rd ), the Borel σ-algebra on Rd .
Then we may define the directional derivative long v for a function
f of measures as follows:
f (μ ◦ (id + εv)−1 ) − f (μ)
lim ,
ε↓0 ε
provided the limit exists. When a function on Mk is concerned, we
need to assume that μ ◦ (id + εv)−1 ∈ Mk for μ ∈ Mk . It is easy to
see that this is equivalent to

|v|k dμ < ∞,
Rd
since by the integral transformation (Theorem 3.27),

|x|k (μ ◦ (id + εv)−1 )( dx) = |v|k dμ.


Rd Rd
This leads to the notion of tangent space at a point μ ∈ Mk , which
is the class of all vector fields on Rd such that μ(|v|k ) < ∞.
Definition 8.2. Let k ∈ [1, ∞) and μ ∈ Mk . The tangent space at
point μ is defined as
Lk (Rd → Rd , μ) := {v : Rd → Rd is measurable such that
μ(|v|k ) < ∞}.
We denote this space by Tμ,k .
The tangent space Tμ,k is a separable Banach space, and when k = 2,
it is a separable Hilbert space.

8.1.2 Intrinsic derivative and C 1 functions


We are now be able to define the directional derivative of a function
on Pk (or Mk ) along vector fields in the tangent space. Moreover,
160 Foundation of Probability Theory

similar to (8.1.1), we may define the intrinsic derivative as a linear


functional on the tangent space.
Definition 8.3. Let f be a continuous function on Pk (or Mk ).
(1) Let μ ∈ Pk (or Mk ), and let v ∈ Tμ,k . If the limit
f (μ ◦ (id + εv)−1 ) − f (μ)
Dv f (μ) := lim
ε↓0 ε
exists, then it is called the directional derivative of f at μ along v.
(2) Let μ ∈ Pk (or Mk ). If Dv f (μ) exists for any v ∈ Tμ,k , and
Tμ,k  v → Dv f (μ)
is a bounded linear functional, we call f intrinsically
differentiable at μ. In this case, the linear functional
Df (μ) : v → Dv f (μ)
is called the intrinsic derivative of f at μ.
(3) If f is intrinsically differentiable at all elements in Pk (or Mk ),
we call f intrinsically differentiable on Pk (or Mk ).
According to the definition, for any intrinsically differentiable
function f and any μ ∈ Pk (or Mk ), Df (μ) is the unique element in

Tμ,k∗ := Lk (Rd → Rd , μ),

where k ∗ := k
k−1 ∈ (1, ∞], such that

Dv f (μ) = Df (μ)(x), v(x) μ( dx), v ∈ Tμ,k .


Rd

Intuitively, the intrinsic derivative describes the movement of


distributions along the flows of particles induced by vector fields.
As a random particle can be regarded as a random variable on Rd , in
the following, we lift a function f on Pk to a function fˆ of random
variables.
Let Rk be the class of all d-dimensional random variables ξ with
E|ξ|k < ∞. Then a function f on Pk induces the following function
on Rk :
Rk  ξ → fˆ(ξ) := f (Lξ ),
Calculus on the Space of Finite Measures 161

where Lξ ∈ Pk is the distribution of ξ. The directional derivative of


fˆ at ξ along η ∈ Rk is defined as
fˆ(ξ + εη) − fˆ(ξ)
∇η fˆ(ξ) := lim ,
ε↓0 ε
provided the limit exists. We aim to establish the chain rule
∇η fˆ(ξ) = E[ Df (Lξ )(ξ), η ], ξ, η ∈ Rk (8.1.2)
for a class of intrinsically differentiable functions f on Pk . To this
end, we introduce the notion of L-derivative and the classes C 1 (Pk )
and Cb1 (Pk ) as follows.
Definition 8.4. Let f : Pk → R be intrinsically differentiable.
(1) If
|f (μ ◦ (id + φ)−1 ) − f (μ) − DφI f (μ)|
lim = 0,
φTμ,k ↓0 φTμ,k
then f is called L-differentiable at μ. In this case, the intrinsic
derivative is also called the L-derivative.
(2) We write f ∈ C 1 (Pk ) if f is L-differentiable at any μ ∈ Pk ,
and the L-derivative has a version Df (μ)(x) jointly continuous
in (x, μ) ∈ Rd ×Pk . If moreover Df (μ)(x) is bounded, we denote
f ∈ Cb1 (Pk ).
To establish the chain rule (8.1.2), we need to assume that the
underlying probability space is Polish.
Definition 8.5. A probability space (Ω, F , P) is called Polish if F is
the P-completeness of the a Borel σ-field induced by a Polish metric
on Ω. P is called atomless if P(A) = 0 holds for any atom A ∈ F .
When k = 2, the L-derivative is named after Lions due to his Lec-
ture Notes (see the corresponding reference in [6]), where Df (μ) is
defined as the unique element in Tμ,2 such that for any atomless
probability space (Ω, F , P) and any random variables X, Y with
LX = μ,
|f (LY ) − f (LX ) − E[ Df (μ)(X), Y − X ]|
lim = 0.
Y −XL2 (P) ↓0 Y − XL2 (P)
Since Df (μ) does not depend on the choice of probability space,
when μ is atomless, we may choose (Ω, F , P) = (Rd , B d , μ) such
162 Foundation of Probability Theory

that this definition is equivalent to the one we introduced above.


Since by approximations one may drop the atomless condition, so
that the above definition of L-derivative coincides with, and is more
straightforward than, the one defined by Lions.

8.1.3 Chain rule


To establish the chain rule (8.1.2) for functions of distributions of
random variables, we need the following proposition.
Proposition 8.6. Let {(Ωi , Fi , Pi )}i=1,2 be two atomless, Polish
probability spaces, and let Xi , i = 1, 2, be Rd -valued random variables
on these two probability spaces, respectively, such that LX1 |P1 =
LX2 |P2 . Then for any ε > 0, there exist measurable maps

τ : Ω1 → Ω2 , τ −1 : Ω2 → Ω1
such that
P1 (τ −1 ◦ τ = idΩ1 ) = P2 (τ ◦ τ −1 = idΩ2 ) = 1,
P1 = P2 ◦ τ, P2 = P1 ◦ τ −1 ,
X1 − X2 ◦ τ L∞ (P1 ) + X2 − X1 ◦ τ −1 L∞ (P2 )  ε,
where idΩi stands for the identity map on Ωi , i = 1, 2.
Proof. Since Rd is separable, there is a measurable partition
(An )n1 of Rd such that diam(An ) < ε, n  1. Let Ain = {Xi ∈
An }, n  1, i = 1, 2. Then (Ain )n1 forms a measurable partition of
Ωi so that n1 Ain = Ωi , i = 1, 2, and, due to LX1 |P1 = LX2 |P2 ,

P1 (A1n ) = P2 (A2n ), n  1.
Since the probabilities (Pi )i=1,2 are atomless, according to Theorem
C in Section 41 of [12], for any n  1, there exist measurable sets
Ãin ⊂ Ain with Pi (Ain \ Ãin ) = 0, i = 1, 2, and a measurable bijec-
tive map
τn : Ã1n → Ã2n
such that
P1 |Ã1 = P2 ◦ τn |Ã1 , P2 |Ã2 = P1 ◦ τn−1 |Ã2 .
n n n n
Calculus on the Space of Finite Measures 163

By diam(An ) < ε and Pi (Ain \ Ãin ) = 0, we have

(X1 − X2 ◦ τn )1Ã1 L∞ (P1 ) ∨ (X2 − X1 ◦ τn−1 )1Ã2 L∞ (P2 )  ε.


n n

Then the proof is finished by taking, for fixed points ω̂i ∈ Ωi , i = 1, 2,

τn (ω1 ) if ω1 ∈ Ã1n for some n  1,


τ (ω1 ) =
ω̂2 otherwise,

τn−1 (ω2 ) if ω2 ∈ Ã2n for some n  1,


τ −1 (ω2 ) =
ω̂1 otherwise.

The following chain rule is taken from Theorem 2.1 in [3], which
extends the corresponding formulas for functions on P2 presented
in [6, 13] and references within.

Theorem 8.7. Let f : Pk → R be continuous for some k ∈ [1, ∞),


and let (ξε )ε∈[0,1] be a family of Rd -valued random variables on a
complete probability space (Ω, F , P) such that ξ˙0 := limε↓0 ξε −ξ
ε
0

exists in L (Ω → R , P). We assume that either ξε is continuous


k d

in ε ∈ [0, 1], or the probability space is Polish.


(1) Let μ0 = Lξ0 be atomless. If f is L-differentiable such that
Df (μ0 ) has a continuous version satisfying

Df (μ0 )(x)Rd  C(1 + |x|k−1 ), x ∈ Rd (8.1.3)

for some constant C > 0, then

f (Lξε ) − f (Lξ0 )
lim = E[ Df (μ0 )(ξ0 ), ξ˙0 ]. (8.1.4)
ε↓0 ε

(2) If f is L-differentiable in a neighborhood O of μ0 such that Df


has a version jointly continuous in (x, μ) ∈ Rd × O satisfying

Df (μ)(x)Rd  C(1 + |x|k−1 ), (x, μ) ∈ Rd × O (8.1.5)

for some constant C > 0, then (8.1.4) holds.


164 Foundation of Probability Theory

Proof. Without loss of generality, we may and do assume that P


is atomless. Otherwise, by taking
(Ω̃, F˜ , P̃) = (Ω × [0, 1], F × B([0, 1]), P × ds),
(ξ̃ε )(ω, s) = ξε (ω) for (ω, s) ∈ Ω̃,
where B([0, 1]) is the completion of the Borel σ-algebra on [0, 1],
with respect to the Lebesgue measure ds, we have
L = L , E[ Df (μ )(ξ ), ξ˙ ] = Ẽ[ Df (μ )(ξ̃ ), ξ˜˙ ].
ξ̃ε |P̃ ξε |P 0 0 0 0 0 0

In this way, we go back to the atomless situation. Moreover, it


suffices to prove for the Polish probability space case. Indeed, when
ξε is continuous in ε, we may take Ω̄ = C([0, 1]; Rd ), let P̄ be the
distribution of ξ· , let F¯ be the P̄-complete Borel σ-field on Ω̄ induced
by the uniform norm, and consider the coordinate random variable
ξ¯· (ω) := ω, ω ∈ Ω̄. Then
Lξ̄· |P̃ = Lξ· |P ,
so that Lξ̄ε |P̄ = Lξε |P for any ε ∈ [0, 1] and Lξ̄  |P̄ = Lξ0 |P , hence we
0
have reduced the situation to the Polish setting.
(1) Let Lξ0 = μ0 ∈ Pk be atomless. In this case, (Rd , B(Rd ), μ0 )
is an atomless Polish complete probability space, where B(Rd ) is
the μ0 -complete Borel σ-algebra of Rd . By Proposition 8.6, for any
n  1, we find measurable maps
τn : Ω → Rd , τn−1 : Rd → Ω
such that
P(τn−1 ◦ τn = idΩ ) = μ0 (τn ◦ τn−1 = id) = 1,
P = μ0 ◦ τn , μ0 = P ◦ τn−1 , (8.1.6)
1
ξ0 − τn L∞ (P) + id − ξ0 ◦ τn−1 L∞ (μ0 )  ,
n
where idΩ is the identity map on Ω.
Since f is L-differentiable at μ0 , there exists a decreasing function
h : [0, 1] → [0, ∞) with h(r) ↓ 0 as r ↓ 0 such that
sup f (μ0 ◦ (id + φ)−1 ) − f (μ0 ) − Dφ f (μ0 )
φLk (μ ) r
0 (8.1.7)
 rh(r), r ∈ [0, 1].
Calculus on the Space of Finite Measures 165

By Lξε −ξ0 ∈ Pk and (8.1.6), we have

φn,ε := (ξε − ξ0 ) ◦ τn−1 ∈ Tμ,k , φn,ε Tμ,k = ξε − ξ0 Lk (P) . (8.1.8)

Next, (8.1.6) implies

Lτn +ξε −ξ0 = P ◦ (τn + ξε − ξ0 )−1


= (μ0 ◦ τn ) ◦ (τn + ξε − ξ0 )−1 = μ0 ◦ (id + φn,ε )−1 . (8.1.9)
ξε −ξ0
Moreover, by ε → ξ˙0 in Lk (P) as ε ↓ 0, we find a constant c  1
such that

ξε − ξ0 Lk (P)  cε, ε ∈ [0, 1]. (8.1.10)

Combining (8.1.6)–(8.1.10) leads to

|f (Lτn +ξε −ξ0 ) − f (Lξ0 ) − E[ (Df )(μ0 )(τn ), (ξε − ξ0 ) ]|


= f (μ0 ◦ (id + φn,ε )−1 ) − f (μ0 ) − Dφn,ε f (μ0 )
 φn,ε Tμ,k h(φn,ε Tμ,k )
= ξε − ξ0 Lk (P) h(ξε − ξ0 Lk (P) ), ε ∈ [0, c−1 ]. (8.1.11)

Since f (μ) is continuous in μ and Df (μ0 )(x) is continuous in x,


by (8.1.3) and (8.1.6), we may apply the dominated convergence
theorem to deduce from (8.1.11) with n → ∞ that

f (Lξε ) − f (Lξ0 ) − E[ (Df )(μ0 )(ξ0 ), (ξε − ξ0 ) ]


 ξε − ξ0 Lk (P) h(ξε − ξ0 Lk (P) ), ε ∈ [0, c−1 ].

Combining this with (8.1.10) and h(r) → 0 as r → 0, we derive


(8.1.4).
(2) When μ0 has an atom, we take an Rd -valued bounded random
variable X which is independent of (ξε )ε∈[0,1] and LX does not have
any atom. Then

Lξ0 +sX+r(ξε −ξ0 ) ∈ Pk

does not have atom for any s > 0, ε ∈ [0, 1]. By conditions in Theorem
8.7(2), there exists a small constant s0 ∈ (0, 1) such that for any
166 Foundation of Probability Theory

s, ε ∈ (0, s0 ], we may apply (8.1.4) to the random variables


ξ0 + sX + (r + δ)(ξε − ξ0 ), δ > 0
to conclude
f (Lξε +sX ) − f (Lξ0 +sX )
1
d
= f (Lξ0 +sX+(r+δ)(ξε −ξ0 ) ) δ=0
dr
0 dδ
1
= E[ Df (Lξ0 +sX+r(ξε −ξ0 ) )(ξ0 + sX + r(ξε − ξ0 )), ξε − ξ0 ] dr.
0

By conditions in Theorem 8.7(2), we may let s ↓ 0 to derive


f (Lξε ) − f (Lξ0 )
1
= E[ Df (Lξ0 +r(ξε −ξ0 ) )(ξ0 + r(ξε − ξ0 )), ξε − ξ0 ] dr,
0
ε ∈ (0, s0 ).
Multiplying both sides by ε−1 and letting ε ↓ 0, we finish the proof.

As a consequence of the chain rule, we have the following Lipschitz
estimate for L-differentiable functions on Pk .
Corollary 8.8. Let f be L-differentiable on Pk such that for any
μ ∈ Pk , Df (μ)(·) has a continuous version satisfying
|Df (μ)(x)|  c(μ)(1 + |x|k−1 ), x ∈ Rd (8.1.12)
for come constant c(μ) > 0, and
K0 := sup Df (μ)Lk (μ) < ∞. (8.1.13)
μ∈Pk

Then
|f (μ1 ) − f (μ2 )|  K0 Wk (μ1 , μ2 ), μ1 , μ2 ∈ Pk . (8.1.14)
Proof. By Theorem 7.4, there exists π ∈ C (μ1 , μ2 ) such that
 1
k
Wk (μ1 , μ2 ) = |x − y| π( dx, dy) .
k
Rd ×Rd
Calculus on the Space of Finite Measures 167

Now, consider the probability space


(Ω, F , P) = (Rd × Rd × Rd , B̄(Rd × Rd × Rd ), π × G),
where G is the standard Gaussian measure on Rd and B̄(Rd ×Rd ×Rd )
is the completion of the Borel σ-field B(Rd × Rd × Rd ) with respect
to P. Obviously, this probability space is atomless and Polish, and
the random variables
ξ1 (ω) := ω1 , ξ2 (ω) := ω2 , ω = (ω1 , ω2 , ω3 ) ∈ Ω := Rd × Rd × Rd
satisfy
1
Lξ1 = μ1 , Lξ2 = μ2 , Wk (μ1 , μ2 ) = (E[|ξ1 − ξ2 |k ]) k .
Moreover, the random variable
η(ω) := ω3 , ω = (ω1 , ω2 , ω3 ) ∈ Ω
is independent of (ξ1 , ξ2 ) with distribution Lμ = G, so that the
random variables
γε (r) := εη + rξ1 + (1 − r)ξ2 , r ∈ [0, 1], ε ∈ (0, 1]
are absolutely continuous with respect to the Lebesgue measure. By
Theorem 8.7, (8.1.12) and the continuity of Df (μ)(·), we obtain
1  
|f (Lγε (1) ) − f (Lγε (0) )| = E Df (Lγε (r) )(γε (r)), ξ1 − ξ2 dr
0

1
1
 (E[|ξ1 − ξ2 |k ]) k Df (Lγε (r) )Lk (Lγε (r) ) dr
0
 KWk (μ1 , μ2 ), ε ∈ (0, 1].
Letting ε → 0, we derive (8.1.14). 

8.2 Extrinsic Derivative and Convexity Extrinsic


Derivative

Regarding a measure as the distribution of particle systems, the


intrinsic derivative describes the movement of particles. In this part,
we consider the (convexity) extrinsic derivative, which refers to the
birth and death rates of particles.
We first recall the extrinsic derivative defined as partial derivative
in the direction of Dirac measures, see [17, Definition 1.2].
168 Foundation of Probability Theory

Definition 8.9 (Extrinsic derivative). Let f be a real function


on Mk . For any x ∈ Rd , let δx be the Dirac measure at x, i.e. δx ∈ P
with δx ({x}) = 1.
(1) f is called extrinsically differentiable on Mk with derivative
D E f if
f (μ + εδx ) − f (μ)
D E f (μ)(x) := lim ∈R
ε↓0 ε
exists for all (x, μ) ∈ Rd × Mk .
(2) If D E f (μ)(x) exists and is continuous in (x, μ) ∈ Rd × Mk , we
denote f ∈ C E,1 (Mk ).
E,1
(3) We denote f ∈ CK (Mk ) if f ∈ C E,1 (Mk ) and for any compact
set K ⊂ Mk , there exists a constant C > 0 such that
sup |D E f (μ)(x)|  C(1 + |x|k ), x ∈ Rd .
μ∈K

(4) We denote f ∈ C E,1,1 (Mk ) if f ∈ C E,1 (Mk ) such that


D E f (μ)(x) is differentiable in x, ∇{D E f (μ)(·)}(x) is continuous
k
in (x, μ) ∈ Rd ×Mk and |∇{D E f (μ)}| ∈ L k−1 (μ) for any μ ∈ Mk .
E,1,1
(5) We write f ∈ CB (Mk ) if f ∈ C E,1,1 (Mk ) and for any constant
L > 0, there exists CL > 0 such that
sup |∇{D E f (μ)}|(x)  CL (1 + |x|k ), x ∈ Rd .
μk L

Since for a probability measure μ and s > 0, μ + sδx is no longer a


probability measure, for functions of probability measures, we modify
the definition of the extrinsic derivative with the convex combination
(1 − s)μ + sδx ,
replacing μ + sδx . This leads to the notion of convexity extrinsic
derivative.
Definition 8.10 (Convexity extrinsic derivative). Let f be a
real function on Pk .
(1) f is called extrinsically differentiable on Pk if the centered
extrinsic derivative
f ((1 − s)μ + sδx ) − f (μ)
D̃ E f (μ)(x) := lim ∈R
s↓0 s
exists for all (x, μ) ∈ Rd × Pk .
Calculus on the Space of Finite Measures 169

(2) We write f ∈ C E,1 (Pk ), if D̃ E f (μ)(x) exists and is continuous


in (x, μ) ∈ Rd × Pk .
E,1
(3) We denote f ∈ CK (Pk ), if f ∈ C E,1 (Pk ) and for any compact
set K ⊂ Pk , there exists a constant C > 0 such that
sup |D̃ E f (μ)(x)|  C(1 + |x|k ), x ∈ Rd .
μ∈K

(4) We write f ∈ C E,1,1 (Pk ) if f ∈ C E,1 (Pk ) such that D̃ E f (μ)(x)


is differentiable in x ∈ Rd , ∇{D̃ E f (μ)}(x) is continuous in
k
(x, μ) ∈ Rd × Pk , and |∇{D̃ E f (μ)}| ∈ L k−1 (μ) for any μ ∈ Pk .
E,1,1
(5) We write f ∈ CB (Pk ), if f ∈ C E,1,1 (Pk ), and for any con-
stant L > 0, there exists C > 0 such that
sup |∇{D̃ E f (μ)}|(x)  C(1 + |x|k ), x ∈ Rd .
μ(|·|k )L

By Proposition 8.13 with γ = δx and r = 0, we have


f ((1 − s)μ + sδx ) − f (μ)
lim
s↓0 s
E,1
= D E f (μ)(x) − μ(D E f (μ)), f ∈ CK (Mk ), x ∈ Rd .
So, the convexity extrinsic derivative is indeed the centralized
extrinsic derivative.
For μ ∈ M and a density function 0  h ∈ L1 (μ), hμ is a finite
measure on Rd defined as

(hμ)(A) = h dμ, A ∈ B(Rd ).


A
Then a function f on M induces the following function of density h:
h → f (hμ).
To characterize this function by using the extrinsic derivative, we
introduce the following class of density functions.
Definition 8.11. We denote h ∈ Hε0 for a constant ε0 > 0 if h
satisfies the following conditions:
(1) 0  h ∈ C([0, ε0 ] × Rd );
(2) h0 ≡ 0, supε∈[0,ε0 ] hε ∞ < ∞, and there exists a compact set
K ⊂ Rd such that hε |K c = 0 for all ε ∈ [0, ε0 ];
170 Foundation of Probability Theory

(3) ḣε := lims↓0 hε+ss−hε ∈ Cb (Rd ) exists and is uniformly bounded


for ε ∈ [0, ε0 ).
The following proposition links f ((1+hε )μ)−f (μ) to the extrinsic
derivative, which will be used to characterize the relation of Df and
D E f in the following section.
Proposition 8.12. Let k ∈ [1, ∞). For any h ∈ Hε0 and any f ∈
C E,1,1 (Mk ),
ε
f ((1 + hε )μ) − f (μ) = dr D E f ((1 + hr )μ)(x)ḣr (x)μ( dx)
0 Rd
(8.2.1)

holds for all μ ∈ Mk and ε ∈ [0, ε0 ].


Proof. (1) We first consider


n
μ ∈ Mdisc := ai δxi : n  1, ai > 0, xi ∈ R , 1  i  n .
d

i=1

In this case, for any ε ∈ [0, ε0 ) and s ∈ (0, ε0 − ε), by the definition
of D E , we have

f ((1 + hε+s )μ) − f ((1 + hε )μ)


 
n
= f (1 + hε )μ + {hε+s − hε }(xi )ai δxi − f ((1 + hε )μ)
i=1
 

n 
k
= f (1 + hε )μ + {hε+s − hε }(xi )ai δxi
k=1 i=1
 

k−1
−f (1 + hε )μ + {hε+s − hε }(xi )ai δxi
i=1

n ak {hε+s −hε }(xk )
=
k=1 0
 

k−1
D f E
(1 + hε )μ + {hε+s − hε }(xi )ai δxi + rδxk (xk ) dr.
i=1
Calculus on the Space of Finite Measures 171

Multiplying by s−1 and letting s ↓ 0, we deduce from this and the


continuity of D E f that

f ((1 + hε+s )μ) − f ((1 + hε )μ)


lim
s↓0 s

n
= ak ḣε (xk )D E f ((1 + hε )μ)(xk )
k=1

= D E f ((1 + hε )μ)(x)ḣε (x)μ( dx), ε ∈ [0, ε0 ), μ ∈ Mdisc .


Rd
(8.2.2)

(2) In general, for any μ ∈ Mk , let {μn }n1 ⊂ Mdisc such that μn → μ
in Mk . By (8.2.2), for any ε ∈ (0, ε0 ) and s ∈ (0, ε0 − ε), we have

f ((1 + hε )μn ) − f (μn )


ε
= dr D E f ((1 + hr )μn )(x)ḣr (x)μn ( dx), n  1. (8.2.3)
0 Rd

Next, since D E f ∈ C(Rd × Mk ) and hr , ḣr ∈ Cb (Rd ) for r ∈ [0, ε0 ]


with compact support ⊂ K, and μn → μ in Mk , we obtain

lim D E f ((1 + hr )μ)(x)ḣr (x)μn ( dx)


n→∞ Rd

= D E f ((1 + hr )μ)(x)ḣr (x)μ( dx). (8.2.4)


Rd

Moreover, μn → μ in Mk and h ∈ Hε0 imply that the set

Kr := {(1 + hr )μ, (1 + hr )μn : n  1}

is compact in Mk for any r ∈ [0, ε0 ]. Combining this with D E f ∈


C(Rd × Mk ), we see that the function

Kr × Rd  (γ, x) → D E f (γ)(x)ḣr (x)


172 Foundation of Probability Theory

is uniformly continuous and has compact support ⊂ Kr × K, so that


(8.2.4) implies

lim sup D E f ((1 + hr )μn )(x)ḣr (x)μn ( dx)


n→∞ Rd

− D E f ((1 + hr )μ)(x)ḣr (x)μ( dx)


Rd

= lim sup D E f ((1 + hr )μn )(x)ḣr (x)μn ( dx)


n→∞ Rd

− D E f ((1 + hr )μ)(x)ḣr (x)μn ( dx)


Rd

 lim sup μn (K) sup |D E f ((1 + hr )μn )(x)ḣr (x)
n→∞ x∈K

− D f ((1 + hr )μ)(x)ḣr (x)|
E

= 0.

Combining this with

sup |D E f (γ)(x)ḣr (x)| < ∞,


(γ,x)∈Kr ×K,r∈[0,ε0]

we deduce from the dominated convergence theorem that

ε  E 
lim dr D f ((1 + hε )μn )(x)ḣr (x)μn ( dx)
n→∞ 0 Rd
ε
= dr {D E f }((1 + hr )μ)(x)ḣr (x)μ( dx). (8.2.5)
0 Rd

Therefore, by letting n → ∞ in (8.2.3) and using the continuity of


f , we prove (8.2.1). 

To calculate the convexity extrinsic derivative, we present the


following result.
Calculus on the Space of Finite Measures 173

E,1
Proposition 8.13. Let k ∈ [1, ∞). Then for any f ∈ CK (Mk ) and
μ, γ ∈ Mk ,
d
f ((1 − r)μ + rγ)
dr
f ((1 − r − ε)μ + (r + ε)γ) − f ((1 − r)μ + rγ)
:= lim
ε↓0 ε

= {D E f ((1 − r)μ + rγ)(x)}(γ − μ)( dx), r ∈ [0, 1).


Rd
E,1
Consequently, for any f ∈ CK (Mk ),
D̃ E f (μ)(x)
f ((1 − s)μ + sδx ) − f (μ)
:= lim
s↓0 s
 E
= D f (μ)(x) − μ D f (μ) , (x, μ) ∈ Rd × Mk .
E

The assertions also hold for Pk replacing Mk .


Proof. As in the proof of Proposition 8.12, we take
n 
n
μn = αn,i δxn,i , γn = βn,i δxn,i
i=1 i=1

for some xn,i ∈ Rd and αn,i , βn,i  0, such that


μn → μ, γn → γ in Mk as n → ∞.
For any r ∈ [0, 1) and ε ∈ (0, 1 − r), let

i−1
Λεn,i := (1 − r)μn + rγn + ε(βk − αk )δxn,k ∈ Mk , 1  i  n,
k=1
0
where by convention i=1 := 0. Then by the definition of D E f , we
have
f ((1 − r − ε)μn + (r + ε)γn ) − f ((1 − r)μn + rγn )

n
 
= f (Λεn,i + ε(βn,i − αn,i )δxn,i ) − f (Λεn,i )
i=1

n ε(βn,i −αn,i )
= D E f (Λεn,i + sδxn,i )(xn,i ) ds, ε ∈ (0, 1 − r).
i=1 0
174 Foundation of Probability Theory

Multiplying by ε−1 and letting ε ↓ 0, due to the continuity of D E f ,


we derive
d
f ((1 − r)μn + rγn )
dr
n
= (βn,i − αn,i )D E f ((1 − r)μn + rγn )(xn,i )
i=1

= {D E f ((1 − r)μn + rγn )(x)}(γn − μn )( dx),


Rd
r ∈ [0, 1), n  1.

Consequently, for any r ∈ [0, 1),

f ((1 − r − ε)μn + (r + ε)γn ) − f ((1 − r)μn + rγn )


r+ε  
= ds D E f ((1 − s)μn + sγn )(x) (γn − μn )( dx),
r Rd
ε ∈ (0, 1 − r), n  1.

Noting that the set {μn , γn : n  1} is relatively compact in Mk , by


this and the condition on f , we may let n → ∞ to derive

f ((1 − r − ε)μ + (r + ε)γ) − f ((1 − r)μ + rγ)


r+ε  E 
= ds D f ((1 − s)μ + sγ)(x) (γ − μ)( dx),
r Rd
ε ∈ (0, 1 − r).

Multiplying by ε−1 and letting ε ↓ 0, we finish the proof. 

The following is a consequence of Proposition 8.13 for functions


on Pk .
E,1
Proposition 8.14. Let k ∈ [1, ∞). Then for any f ∈ CK (Pk ) and
μ, ν ∈ Pk ,

f ((1 − s)μ + sν) − f (μ)


lim = {D̃ E f (μ)(x)}(ν − μ)( dx).
s↓0 s Rd
Calculus on the Space of Finite Measures 175

Proof. To apply Proposition 8.13, we extend a function f on Pk


to f˜ on Mk by letting
f˜(μ) = h(μ(Rd ))f (μ/μ(Rd )), μ ∈ Mk ,
where h ∈ C0∞ (R) with support contained by [ 14 , 2] and h(r) = 1 for
r ∈ [ 12 , 32 ]. It is easy to see that

f ((1 − s)μ + sν) = f˜((1 − s)μ + sν), s ∈ [0, 1], μ, ν ∈ Pk ,


E,1 E,1
and f ∈ CK (Pk ) implies that f˜ ∈ CK (Mk ) and

D E f˜(μ) = D̃ E f (μ), μ ∈ P.
Then the desired formula is implied by Proposition 8.13 with r = 0.


8.3 Links of Intrinsic and Extrinsic Derivatives

Theorem 8.15. Let k ∈ [1, ∞).


(1) Let f ∈ C E,1,1 (Mk ). Then f is intrinsically differentiable and
Df (μ)(x) = ∇{D E f (μ)(·)}(x), (x, μ) ∈ Rd × Mk . (8.3.1)

When k ∈ [1, 2] and f ∈ CBE,1,1


(Mk ), we have f ∈ C 1 (Mk ).
(2) If f ∈ C (Mk ), then for any s  0, f (μ + sδ· ) ∈ C 1 (Rd ) with
1

∇f (μ + sδ· )(x) = sDf (μ + sδx )(x), x ∈ Rd , s  0. (8.3.2)


Consequently,
1
Df (μ)(x) = lim ∇f (μ + sδ· )(x), (x, μ) ∈ Rd × Mk . (8.3.3)
s↓0 s

Proof. In the following, we prove assertions (1) and (2),


respectively.
(a) Proof of assertion (1). We first prove (8.3.1) for f ∈
C E,1,1 (Mk ). Let v ∈ Tμ,k , and simply denote
φεv := id + εv, ε  0.
Since any μ ∈ Mk can be approximated by those having smooth and
strictly positive density functions with respect to the volume measure
176 Foundation of Probability Theory

dx, by the argument leading to (8.2.5), it suffices to show that for


any μ ∈ Mk satisfying

μ( dx) = ρ(x) dx for some ρ ∈ Cb∞ (Rd ), inf ρ > 0, (8.3.4)

there exists a constant ε0 > 0 such that

f (μ ◦ φ−1
εv ) − f (μ)
ε
= dr ∇{D E f (μ ◦ φ−1 −1
rv )}, v d(μ ◦ φrv ), ε ∈ (0, ε0 ).
0 Rd
(8.3.5)

Firstly, there exists a constant ε0 > 0 such that

d(μ ◦ φ−1
εv ) ρvε+s − ρvε
ρvε := , ρ̇vε := lim
dμ s↓0 s

exist in Cb (Rd ) and are uniformly bounded and continuous in ε ∈


[0, ε0 ]. Next, by Proposition 8.12, we have

f (μ ◦ φ−1
εv ) − f (μ)
ε
= dr {D E f (μ ◦ φ−1
rv )}ρ̇r dμ, ε ∈ [0, ε0 ].
v
(8.3.6)
0 Rd

To calculate ρ̇vr , we note that for any g ∈ C0∞ (Rd ),

d
{g ◦ φrv } = ∇g(φrv ), v(φrv )
dr
= ∇g, v (φrv ), r  0,

which is smooth and bounded in (r, x) ∈ [0, ε0 ] × Rd . So,

ρvr+s − ρvr
gρ̇vr dμ = g lim dμ
Rd Rd s↓0 s
1  
= lim g d μ ◦ φ−1 −1
(r+s)v − μ ◦ φrv
s↓0 s Rd
1   d
= lim g ◦ φ(r+s)v − g ◦ φrv dμ = (g ◦ φrv ) dμ
s↓0 s Rd Rd dr
Calculus on the Space of Finite Measures 177

= ∇g, v ◦ φrv dμ = ∇g, v d(μ ◦ φ−1


εc )
Rd Rd
 
=− g divμ◦φ−1
rv
(v) d(μ ◦ φ−1
rv )
Rd
 
=− g divμ◦φ−1
rv
(v)ρv
r dμ, g ∈ C0∞ (Rd ),
Rd

where divμ◦φ−1
rv
(v) = div(v) + v, ∇ log(ρvr ρ) . This implies

ρ̇vr = −divμ◦φ−1
rv
(v)ρr ,

so that the integration by parts formula and

ρvr μ = μ ◦ φ−1
rv

lead to

  v
D E f (μ ◦ φ−1
rv ) ρ̇r dμ
Rd
 
=− D E f (μ ◦ φ−1
rv ) divμ◦(φ−1
rv
(v) d(μ ◦ φ−1
rv )
Rd
 
= ∇{D E f (μ ◦ φ−1 −1
rv )}, v d(μ ◦ φrv ).
Rd

Combining this with (8.3.6), we prove (8.3.5).


Now, let k ∈ [1, 2], we intend to verify the L-differentiability of f .
For any μ ∈ Mk and v ∈ Tμ,2 with μ(|v|2 )  1, we have

sup (μ ◦ φ−1
sv )(| · | ) = μ(|φsv | )  2μ(| · | + |v| ) < ∞.
k k k k
s∈[0,1]

Then there exists a constant K > 0 such that

sup (μ ◦ (φsv )−1 + μ)(| · |k )  K. (8.3.7)


s∈[0,1],μ(|v|2 )1
178 Foundation of Probability Theory

So, by Proposition 8.13, we obtain

f (μ ◦ (id + v)−1 ) − f (μ)


1 
d −1
= f (rμ ◦ (φrv ) + (1 − r)μ) dr
0 dr
1
= dr (D E f )(rμ ◦ φ−1 −1
rv + (1 − r)μ) d(μ ◦ φv − μ)
0 Rd
1 
= dr (D E f )(rμ ◦ φ−1
rv + (1 − r)μ)(φv (x))
0 Rd

−(D E f )(rμ ◦ φ−1
rv + (1 − r)μ)(x) μ( dx)

1 1 
= dr μ( dx) ∇ (D E f )(rμ ◦ φ−1
rv
0 Rd 0
 
+ (1 − r)μ) (φsv (x)), v(x) ds.

Thus,

|f (μ ◦ φ−1
v ) − f (μ) − Rd ∇{D f (μ)}, v dμ|
E 2
Iv :=
μ(|v|2 )

 |∇{(D E f )(rμ ◦ φ−1


rv + (1 − r)μ)}(φsv (x))
[0,1]2 ×Rd

− ∇{D E f (μ)}(x)|2 dr dsμ( dx).

By (8.3.7), as vL2 (μ) → 0, we have φsv (x) → x μ-a.e. and μ ◦


φ−1
sv → μ in Mk for any s  0. Combining these with (8.3.7), we
may apply the dominated convergence theorem to derive Iv → 0 as
vL2 (μ) → 0. Therefore, f is L-differentiable.
(b) Proof of assertion (2). It suffices to prove (8.3.2). Let f ∈
C 1 (Mk ). We first prove the formula for μ ∈ Mk and x ∈ Rd with
μ({x}) = 0 and then extend to the general situation.
Let μ({x}) = 0. In this case, for any v0 ∈ Tx M , let v = 1{x} v0 .
Then
z if z = x,
φrv (z) =
x + rv0 if z = x.
Calculus on the Space of Finite Measures 179

By μ({x}) = 0, we have

(μ + sδx ) ◦ φ−1
rv = μ + sδx+rv0 . (8.3.8)

Since v can be approximated in L2 (μ + sδx ) by smooth vector fields


with compact support, the L-differentiability of f and μ({x}) = 0
imply

f ((μ + sδx ) ◦ φ−1


rv ) − f (μ + sδx )
lim
r↓0 r

= Df (μ + sδx ), v d(μ + sδx ))


Rd
= s Df (μ + sδx )(x), v0 .

Combining this with (8.3.8), we obtain

f (μ + sδx+rv0 ) − f (μ + sδx )
lim
r↓0 r
= s Df (μ + sδx )(x), v0 .

This implies that f (μ + sδ· ) is differentiable at point x and (8.3.2)


holds.
In general, for any v0 ∈ Tx M , there exists r0 > 0 such that v0
extends to a smooth vector field v on B(x, r0 ) by parallel displace-
ment, i.e. v(x) is the parallel displacement along the minimal geodesic
from x to z. Since μ({x + θv0 }) = 0 for a.e. θ  0, by the continuity
of f and the formula (8.3.2) for μ({x}) = 0 proved above, we obtain

f (μ + sδx+rv0 ) − f (μ + sδx )
r
r
1 d
= f (μ + sδx+θv0 ) dθ
r 0 dθ
1 r 
= ∇f (μ + sδ· )(x + θv0 ), v(x + θv0 ) dθ
r 0
s r 
= Df (μ + sδ· )(x + θv0 ), v(x + θv0 ) dθ, r ∈ (0, r0 ).
r 0

By the continuity of Df , with r ↓ 0, this implies (8.3.2). 


180 Foundation of Probability Theory

Theorem 8.15 implies CB E,1,1


(Mk ) ⊂ C 1 (Mk ) for k ∈ [1, 2].
1
However, a function f ∈ C (Mk ) is not necessarily extrinsically
differentiable. For instance, let ψ ∈ C([0, ∞)) but not differentiable,
and let f (μ) = ψ(μ(Rd )). Then f (μ + sδx ) = ψ(μ(Rd ) + s) which
is not differentiable in s, so that f is not extrinsically differentiable.
But it is easy to see that f ∈ C 1 (Mk ) with Df (μ) = 0. Of course,
this counter-example does not work for functions on Pk .
By extending a function on Pk to Mk , we may apply Theo-
rem 8.15 to establish the corresponding link for functions on Pk .
As an application, we will present derivative formula for the distri-
butions of random variables.
For s0 > 0 and a family of Rd -valued random variables {ξs }s∈[0,s0 )
d
on a probability space (Ω, F , P), we say that ξ̇0 := ds ξs s=0 exists
in L (P) for some q  1 if ξ̇0 ∈ Rk and
q

q
ξs − ξ0
lim E − ξ˙0 = 0. (8.3.9)
s↓0 s
Corollary 8.16. Let k ∈ [1, ∞).
(1) Let f ∈ C E,1,1 (Pk ). Then f is intrinsically differentiable and
Df (μ)(x) = ∇{D̃ E f (μ)(·)}(x), (x, μ) ∈ Rd × Pk . (8.3.10)
When k  2 and f ∈ CB E,1,1
(Pk ), we have f ∈ C 1 (Pk ).
(2) If f ∈ C E,1 (Pk ), then f ((1 − s)μ + sδ· ) ∈ C 1 (Rd ) with
∇f ((1 − s)μ + sδ· )(x) = sDf ((1 − s)μ + sδx )(x), x ∈ Rd .
(8.3.11)
Consequently,
1
Df (μ)(x) = lim ∇f ((1 − s)μ + sδ· )(x), f ∈ C E,1 (Pk ),
s↓0 s

(x, μ) ∈ Rd × M. (8.3.12)
(3) Let {ξs }s∈[0,s0 ) be random variables on M with Lξs ∈ Pk
d
continuous in s, such that ξ̇0 := ds ξs s=0 exists in Lq (Ω →
T M ; P) for some q  1. Then
f (Lξs ) − f (Lξ0 )  
lim = E Df (Lξ0 )}(ξ0 ), ξ˙0 (8.3.13)
s↓0 s
Calculus on the Space of Finite Measures 181

holds for any f ∈ C E,1,1 (Pk ) such that for any compact set
K ⊂ Pk ,
p(q−1)
sup |∇{D̃ E f (μ)}|(x)  C(1 + |x|) q , x ∈ Rd (8.3.14)
μ∈K

holds for some constant C > 0.


Proof. To apply Theorem 8.15, we extend a function f on Pk to
f˜ on Mk as in the proof of Proposition 8.14, i.e. by letting
 
μ
f˜(μ) = h(μ(R ))f
d
, μ ∈ Mk ,
μ(Rd )
where h ∈ C0∞ (R) with support contained in [ 14 , 2] and
 
1 3
h(r) = 1 for r ∈ , .
2 2
It is easy to see that
f ((1 − s)μ + sν) = f˜((1 − s)μ + sν), s ∈ [0, 1], μ, ν ∈ Pk ,
and f ∈ C E,1,1 (Pk ) implies that f˜ ∈ C E,1,1 (Mk ) and
D E f˜(μ) = D̃ E f (μ), Df (μ) = D f˜(μ), μ ∈ P.
Then Corollary 8.16(1)–(4) follow from the corresponding assertions
in Theorem 8.15 with f˜ replacing f .
Finally, since f ∈ C E,1,1 (Pk ) and
∇{D̃ E f (μ)} = ∇{D E f˜}(μ) = Df (μ), μ ∈ Pk ,
(8.3.13) follows from Theorem 8.7. 

8.4 Gaussian Measures on P2 and M

The Gaussian measure, also called normal distribution, plays a key


role in probability theory and related analysis. For instance, by the
central limit theorem, the renormalization partial sum of i.i.d. ran-
dom variables converges weakly to the standard Gaussian measure.
In this section, we introduce Gaussian measures on P2 and M as
images of Gaussian distributions on Hilbert spaces.
182 Foundation of Probability Theory

8.4.1 Gaussian measure on Hilbert space


Let H be a separable Hilbert space, with orthonormal basis (ONB
for short) {ei }i1 . Let (L, D(L)) be a positive definite self-adjoint
operator on H with
Lei = αi ei , i  1
for positive constants {αi }i1 satisfying αi ↓ 0 as i ↑ ∞ and


αi < ∞.
i=1

Definition 8.17. Let {ξi }i1 be i.i.d. random variables with stan-
dard normal distribution N (0, 1). Then

 1
ξ= αi2 ξi ei
i=1

is called a Gaussian random variable on H, whose distribution


GL (A) := P(ξ ∈ A), A ∈ B(H)
is called Gaussian measure on H with covariance operator L, denoted
by GL = N (0, L).
In the following, we introduce the integration by parts formula
for GL . To this end, we introduce the class Cb1 (H) of functions on H.
Definition 8.18. A function f on H is called Gâdeaux differentiable
if for any x ∈ H,
f (x + εv) − f (x)
H  v → ∇v f (x) := lim ∈R
ε↓0 ε
is a well-defined linear functional. In this case, the unique element
∇f (x) ∈ H such that
∇f (x), v H = ∇v f (x), v∈H
is called the Gâdeaux derivative of f at point x.
If f has Gâdaeux derivative and
f (x + v) − f (x) − ∇v f (x)
lim = 0, x ∈ H,
vH ↓0 vH
then f is called Fréchet differentiable.
Calculus on the Space of Finite Measures 183

If f is Fréchet differentiable and ∇f : H → H is continuous, we


denote f ∈ C 1 (H). If, moreover, ∇f H + |f | is bounded on H, we
denote f ∈ Cb1 (H).
Next, we introduce the divergence of vector fields on H .
Definition 8.19. A vector field is a measurable map
v:H →H.
We denote v ∈ Cb1 (H ; H ) for each i  1, we have
vi := v, ei H ∈ Cb1 (H ),
and there exists a constant c > 0, such that


|v| + |∇ei vi |  c.
i=1

In this case,



∇ v := ∇ei vi ∈ C(H )
i=1

is called the divergence of v.


Theorem 8.20. Let f ∈ Cb1 (H) and v ∈ Cb1 (H; H) such that
∞ − 12
i=1 |vi |αi is bounded. Then
 ∞

 ri vi ∗
v, ∇f H dGL = − ∇ v f d GL .
H H αi
i=1

Proof. We first formulate GL by using the coordinates


Φ : H  x → ( x, ei H )i1 ∈ 2 ,
where



2 N
 := r = (ri )i1 ∈ R : ri2 <∞
i=1

is a Hilbert space. Let


r 2
1 − i
Λi ( dri ) = √ e 2αi dri , i1
2αi π
184 Foundation of Probability Theory

be the centered normal distribution with variance αi . Then the


product measure


Λ := Λi
i=1

is supported on 2 , since
∞  ∞
 
ri2 Λ( dr) = α2i < ∞.
RN i=1 i=1
It is easy to see that
GL = Λ ◦ Φ−1 .
By combining this with the integral transformation theorem
(Theorem 3.27) and the integration by parts formula
 
ri h(ri )
h(ri )g (ri ) dΛi = − h (ri ) Λi ( dri ), h, g ∈ Cb1 (R),
R R αi
we finish the proof. 

8.4.2 Gaussian measures on P2


Let μ0 ∈ P2 such that the tangent space H := Tμ0 ,2 is a separable
Hilbert space. Let GL be the Gaussian measure on H . This measure
induces a Gaussian measure on P2 under the map
Φ1 : H  φ → μ0 ◦ φ−1 ∈ P2 .
Definition 8.21. We call
Nμ0 ,L := GL ◦ Φ−1
1
the Gaussian measure on P2 with parameter (μ0 , L).
By the chain rule Theorem 8.7, we have the following result.
Theorem 8.22. Let u, v ∈ Cb1 (P2 ). Then
f := u ◦ Φ1 , g := v ◦ Φ1 ∈ Cb1 (Tμ0 ,2 )
and
Du(μ), Dv(μ) Tμ,2 Nμ0 ,L ( dμ) = ∇f, ∇g Tμ0 ,2 dGL .
P2 Tμ0 ,2
Calculus on the Space of Finite Measures 185

8.4.3 Gaussian measures on M


In this section, we construct Gaussian measures on M supported on
the subspace of absolutely continuous measures
 

Mac := μ ∈ M : ρμ := exists .
dx
In this case, we choose
 
2 2
H = L (R , dx) := h : Rd → R is measurable,
d
h(x) dx < ∞ .
Rd

We then consider the image of the Gaussian measure GL on H under


the map

Φ2 : H = L2 ( dx)  h → h(x)2 dx ∈ Mac .

Definition 8.23. We call

NL := GL ◦ Φ−1
2

the Gaussian measure on M (or Mac ) with parameter L.


By Proposition 8.12 for k = 0 (see Exercise 5), we may prove the
following result.
Theorem 8.24. Let u, v ∈ CbE,1,1(M), and let H = L2 (Rd , dx).
Then

f := u ◦ Φ2 , g := v ◦ Φ2 ∈ Cb1 (H)

and

D E u(μ), D E v(μ) L2 (μ) NL ( dμ) = ∇f, ∇g H dGL .


P2 H

8.5 Exercises

1. Prove that Mk is a Polish space.


2. Let

f (μ) = g(μ(h1 ), . . . , μ(hn ))


186 Foundation of Probability Theory

for some n  1, g ∈ C 1 (Rn ) and hi ∈ Cb1 (Rd ), 1  i  n. Prove


that f ∈ Cb1 (Pk ) and

d
Df (μ) = (∂i g)(μ(h1 ), . . . , μ(hn ))∇hi .
i=1

This type of functions are called Cb1 -cylindrical functions on Pk .


3. Let μ0 ∈ Pk be absolutely continuous with respect to the
Lebesgue measure. By [2, Theorem 6.2.10], for any μ ∈ Pk , there
exists a unique φμ ∈ Tμ0 ,k such that
1
μ = μ0 ◦ (id + φμ )−1 , Wk (μ0 , μ) = (μ0 (|φμ |k )) k .
Please use this assertion and the chain rule to prove the following:
If f ∈ C 1 (Pk ) satisfies (8.1.12) for all μ ∈ Pk with some constant
c(μ) > 0, then for any μ ∈ Pk ,
1 
f (μ) − f (μ0 ) = (Df )(μ0 ◦ (id + rφμ )−1 ), φμ L2 (μ0 )
dr.
0

4. Let k ∈ [1, ∞). For two probability measures μ, ν ∈ Pk , define


the k-variational distance
μ − νk,var := sup |μ(f ) − ν(f )|.
|f |1+|·|k

E,1,1
For any f ∈ CK (Pk ) such that
|D̃f (μ)(x)|  1 + |x|k , μ ∈ Pk , x ∈ Rd ,
prove
|f (μ) − f (ν)|  μ − νk,var .

E,1,1
5. For k ∈ [0, 1], let Pk , Mk , D E , D̃ E , C E,1,1 (Mk ), CK (Pk )
E,1,1
and CK (Mk ) be defined as before. Prove that Propositions
8.12–8.14 still hold. Can we also define the directional intrinsic
derivative and intrinsic derivative on Pk and Mk for k ∈ [0, 1]?
6. Let k ∈ [0, ∞). Prove that there exists a constant c > 0 such that
μ − νvar + Wk (μ, ν)1∨k  cμ − νk,var , μ, ν ∈ Pk .
Calculus on the Space of Finite Measures 187

Moreover, when k > 1, find counter example such that for any
constant c > 0, the inequality

Wk (μ, ν)  cμ − νk,var , μ, ν ∈ Pk

does not hold.


7. Prove Theorem 8.22.
8. Prove Theorem 8.24.
This page intentionally left blank
Bibliography

[1] Billingsley, P. Probability and Measure. 3rd Ed. John Wiley and Sons,
New York, 1995.
[2] Bogachev, I. Measure Theory (Vol. I). Springer, Berlin, 2007.
[3] Chen, Mu-Fa. From Markov Chains to Non-equilibrium Particle
Systems. 2nd Ed. World Scientific, River Edge, NJ, 2004.
[4] Dudley, R. Real Analysis and Probability. Wadsworth, Pacific Grove,
CA, 1989.
[5] Durrett, R. Probability: Theory and Examples. 5th Ed. Cambridge
University Press, Cambridge, 2019.
[6] Kallenberg, O. Foundations of Modern Probability. 3rd Ed. Springer,
Cham, 2021.
[7] Feller, W. An Introduction to Probability and Its Applications
(Vol. I, II). John Wiley and Sons, London, 1971.
[8] Loéve, M. Probability Theory II. 4th Ed. Springer-Verlag, New York-
Heidelberg, 1978.
[9] Neveu, J. Mathematical Foundations of the Calculus of Probability.
Holden-Day, San Francisco, Calif.-London-Amsterdam, 1965.
[10] Rachev, S. The Monge–Kantorovich Mass Transference Problem and
Its Stochastic Applications. Theory of Probability and Applications,
Vol. XXIX, 1985, 647–676.
[11] Reed, M., Simon, B. Method of Modern Mathematical Physics
(Vol. I). Academic Press, New York, 1980.
[12] Shiryayev, A. Probability. Springer-Verlag, New York, 1984.
[13] Villani, C. Optimal Transport. Springer-Verlag, Berlin, 2009.
[14] Wang, Jaigang. Foundation of Modern Probability Theory
(in Chinese). 2nd Ed. Fudan University Press, Shanghai, 2005.
[15] Yan, Jiaan. Lectures on Measure Theory (in Chinese). 2nd Ed. Science
Press, Beijing, 2004.

189
190 Foundation of Probability Theory

[16] Yan, Shi-Jian, Liu, Xiu-Fang. Measure and Probability (in Chinese).
Beijing Norm University Press, Beijing, 2003.
[17] Yan, Shi-Jian, Wang, Jun-Xiang, Liu, Xiu-Fang. Foundation of Prob-
ability Theory (in Chinese). 2nd Ed. Science Press, Beijing, 2009.
[18] Yosida, K. Functional Analysis. 6th Ed. Springer-Verlag, Berlin,
1980.
[19] Zhang, Gongqing, Lin, Yuanqu. Lectures on Functional Analysis
(Vol I) (in Chinese). Peking University Press, Beijing, 1990.
[20] Zhang, Gongqing, Guo, Maozheng. Lectures on Functional Analysis
(Vol II) (in Chinese). Peking University Press, Beijing, 2001.
Index

A D
absolute continuity, 80 decomposition
additivity, 11 of distribution function, 84
algebra, 3 directional derivative, 160
almost everywhere convergence, 45 distribution function, 29, 39
almost everywhere (a.e.), 45 probability, 42
almost sure (a.s.), 45 distribution law, 42
dominated convergence theorem,
B 61
Boolean algebra, 3
Borel σ-algebra, 6 E
Borel field, 6 L -system, 36
boundedness in integral, 75 elementary function, 33
essential supremum, 88
C existence of integral, 58
central limit theorem, 136 extension of measure, 17
characteristic function, 66 extrinsic derivative, 168
of finite measure, 121
complete measure space, 22 F
conditional expectation, 106 family of consistent probability
conditional probability, 107 measures, 115
convergence in distribution, 50 Fatou-Lebesgue theorem, 60
convergence in rth mean, 71 Fourier-Stieltjes transform, 121
convergence in measure, 46 Fubini’s theorem, 94
convexity extrinsic derivative, 168 generalized, 100
coupling, 145 function of sets, 11
covariance coefficient, 66 σ-additivity, 11
covariance matrix, 66 σ-finite, 12
Cr inequality, 72–73 continuous, 14

191
192 Foundation of Probability Theory

finite, 12 M
finite additivity, 11 μ∗ -measurable, 20
mathematical expectation, 65
G measurable cover, 23
Gauss measure on Hilbert space, measurable cylindrical set, 95
182 measurable function, 30
Gaussian measure on M, 185 measurable map, 30
Gaussian measure on P2 , 184 measurable space, 5
geometric probability model, 16 measure, 12
measure extension theorem, 18
H measure space, 15
metrization of weak topology, 143
Hölder’s inequality, 71
Minkowski’s inequality, 73
Hahn’s decomposition theorem, 79
mixed conditional distribution, 112
monotone class theorem, 7
I
for functions, 37
indefinite integral, 63 for set classes, 10
independent, 42 monotone convergence theorem, 57
indicator function, 33 mutually singular, 80
infinite product σ-algebra, 95
integrable, 58 N
integral, 56 nonnegative definite function, 137
integral characteristic function, 133 null set, 22
integral transformation theorem, 69
intrinsic derivative, 160 O
inverse formula, 123
optimal mean square approximation,
inverse image, 30
109
optimal transport, 145
J
outer measure, 19
Jensen’s inequality, 72
P
K
π-system, 8
Kantorovich dual formula, 148 positive part and negative part of
Kolmogorov’s consistent theorem, 116 function, 34
probability measure, 12
L probability space, 15
λ-system, 8 product σ-algebra, 10
law of large numbers, 135 Prohorov’s theorem, 130
Lebesgue’s decomposition theorem,
80 R
Lebesgue-Stieltjes (L-S) integral, 69 rth central moment, 66
Lebesgue-Stieltjes (L-S) measure, 39 rth moment, 66
L-derivative, 161 Radon-Nikodym derivative, 84
Lr space, 71 Radon-Nikodym theorem, 83
Index 193

random variable, 29 total variance distance, 153


continuous type, 70 transition measures, 99
discrete type, 70 transition probability, 99
rectangle, 10 Tulcea’s theorem, 101
regular conditional distribution,
112 U
regular conditional probability, 111 uniform continuity in integral,
restriction of measure, 17 75
uniform convergence, 126
S uniform integrability, 75
section of a function, 93
section of a set, 92 V
semi-algebra, 2
vague convergence, 126
σ-algebra, 5
variance, 66
signed measure, 12
vector field, 158
simple function, 33
strong convergence, 126
W
T Wasserstein coupling, 153
tangent space, 159 Wasserstein distance, 146
tight, 130 weak convergence, 126

You might also like