Spreij Measure Theoretic Probability
Spreij Measure Theoretic Probability
Spreij Measure Theoretic Probability
P.J.C. Spreij
4 Integration 25
4.1 Integration of simple functions . . . . . . . . . . . . . . . . . . . . . . 25
4.2 A general definition of integral . . . . . . . . . . . . . . . . . . . . . . 28
4.3 Integrals over subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 Expectation and integral . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Functions of bounded variation and Stieltjes integrals . . . . . . . . . 36
4.6 Lp -spaces of random variables . . . . . . . . . . . . . . . . . . . . . . . 39
4.7 Lp -spaces of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5 Product measures 46
5.1 Product of two measure spaces . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Application in Probability theory . . . . . . . . . . . . . . . . . . . . . 50
5.3 Infinite products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Derivative of a measure 57
6.1 Linear functionals on Rn . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.2 Linear functionals on a Hilbert space . . . . . . . . . . . . . . . . . . . 57
6.3 Real and complex measures . . . . . . . . . . . . . . . . . . . . . . . . 58
6.4 Absolute continuity and singularity . . . . . . . . . . . . . . . . . . . . 61
6.5 The Radon-Nikodym theorem . . . . . . . . . . . . . . . . . . . . . . . 63
6.6 Decomposition of a distribution function . . . . . . . . . . . . . . . . . 64
6.7 The fundamental theorem of calculus . . . . . . . . . . . . . . . . . . . 65
6.8 Dual spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.9 Additional results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7 Convergence and Uniform Integrability 74
7.1 Modes of convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Uniform integrability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8 Conditional expectation 82
8.1 A simple, finite case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.2 Conditional expectation for X ∈ L1 (Ω, F, P) . . . . . . . . . . . . . . . 83
8.3 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 87
8.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
1.1 σ-algebras
Definition 1.1 Let S be a non-empty set. A collection Σ0 ⊂ 2S is called an
algebra (on S) if
(i) S ∈ Σ0 ,
(ii) E ∈ Σ0 ⇒ E c ∈ Σ0 ,
(iii) E, F ∈ Σ0 ⇒ E ∪ F ∈ Σ0 .
Proof We prove the two obvious inclusions, starting with σ(I) ⊂ B. Since
(−∞, x] = ∩n (−∞, x + n1 ) ∈ B, we have I ⊂ B and then also σ(I) ⊂ B, since
σ(I) is the smallest σ-algebra that contains I. (Below we will use this kind of
arguments repeatedly).
For the proof of the reverse inclusion we proceed in three steps. First we
observe that (−∞, x) = ∪n (−∞, x − n1 ] ∈ σ(I). Knowing this, we conclude
1
that (a, b) = (−∞, b) \ (−∞, a] ∈ σ(I). Let then G be an arbitrary open set.
Since G is open, for every x ∈ G there exists a rational εx > 0 such that
(x − 2εx , x + 2εx ) ⊂ G. Consider (x − εx , x + εx ) and choose a rational qx in
this interval, note that |x − qx | ≤ εx . It follows that x ∈ (qx − εx , qx + εx ) ⊂
(x − 2εx , x + 2εx ) ⊂ G. Hence G ⊂ ∪x∈G (qx − εx , qx + εx ) ⊂ G, and so
G = ∪x∈G (qx − εx , qx + εx ). But the union here is in fact a countable union,
since there are only countably many qx and εx . (Note that the arguments
above can be used for any metric space with a countable dense subset to get
that an open G is a countable union of open balls.) It follows that G ∈ σ(I),
hence O ⊂ σ(I), and therefore (recall B is the smallest σ-algebra containing O)
B ⊂ σ(I).
2
the empty set and conclude that B(R) has cardinality at most equal to 2ℵ0 .
1.2 Measures
Let Σ0 be an algebra on a set S, and Σ be a σ-algebra on S. We consider
mappings µ0 : Σ0 → [0, ∞] and µ : Σ → [0, ∞]. Note that ∞ is allowed as a
possible value.
We call µ0 finitely additive if µ0 (∅) = 0 and if µ0 (E ∪ F ) = µ0 (E) + µ0 (F )
for every pair of disjoint sets E and F in Σ0 . Of course this addition rule then
extends to arbitrary finite unions of disjoint sets. The mapping P µ0 is called
σ-additive or countably additive, if µ0 (∅) = 0 and if µ0 (∪n En ) = n µ0 (En ) for
every sequence (En ) of disjoint sets of Σ0 whose union is also in Σ0 . σ-additivity
is defined similarly for µ, but then we don’t have to require that ∪n En ∈ Σ.
This is true by definition.
Theorem 1.5 There exists a unique measure λ on (R, B) with the property
that for every interval I = (a, b] with a < b it holds that λ(I) = b − a.
The proof of this theorem is deferred to later, see Theorem 2.6. For the time
being, we take this existence result for granted. One remark is in order. One can
show that B is not the largest σ-algebra for which the measure λ can coherently
be defined. On the other hand, on the power set of R it is impossible to define
a measure that coincides with λ on the intervals. We’ll come back to this later.
3
(i) If E ⊂ F , then µ(E) ≤ µ(F ).
(ii) µ(E ∪ F ) ≤ µ(E) + µ(F ).
Pn
(iii) µ(∪nk=1 Ek ) ≤ k=1 µ(Ek )
If µ is finite, we also have
(iv) If E ⊂ F , then µ(F \ E) = µ(F ) − µ(E).
(v) µ(E ∪ F ) = µ(E) + µ(F ) − µ(E ∩ F ).
Proof The set F can be written as the disjoint union F = E ∪ (F \ E). Hence
µ(F ) = µ(E) + µ(F \ E). Property (i) now follows and (iv) as well, provided µ
is finite. To prove (ii), we note that E ∪ F = E ∪ (F \ (E ∩ F )), a disjoint union,
and that E ∩ F ⊂ F . The result follows from (i). Moreover, (v) also follows, if
we apply (iv). Finally, (iii) follows from (ii) by induction.
Remark 1.9 The finiteness condition in the second assertion of Proposition 1.7
is essential. Consider N with the counting measure τ . Let Fn = {n, n + 1, . . .},
then ∩n Fn = ∅ and so it has measure zero. But τ (Fn ) = ∞ for all n.
4
1.4 π- and d-systems
In general it is hard to grab what the elements of a σ-algebra Σ are, but often
collections C such that σ(C) = Σ are easier to understand. In ‘good situations’
properties of Σ can easily be deduced from properties of C. This is often the
case when C is a π-system, to be defined next.
Proof Suppose that we would know that d(I) is a π-system as well. Then
Proposition 1.12 yields that d(I) is a σ-algebra, and so it contains σ(I). Since
the reverse inclusion is always true, we have equality. Therefore we will prove
that indeed d(I) is a π-system.
Step 1. Put D1 = {B ∈ d(I) : B ∩ C ∈ d(I), ∀C ∈ I}. We claim that
D1 is a d-system. Given that this holds and because, obviously, I ⊂ D1 , also
d(I) ⊂ D1 . Since D1 is defined as a subset of d(I), we conclude that these
5
two collections are the same. We now show that the claim holds. Evidently
S ∈ D1 . Let B1 , B2 ∈ D1 with B1 ⊂ B2 and C ∈ I. Write (B2 \ B1 ) ∩ C as
(B2 ∩ C) \ (B1 ∩ C). The last two intersections belong to d(I) by definition of
D1 and so does their difference, since d(I) is a d-system. For Bn ↑ B, Bn ∈ D1
and C ∈ I we have (Bn ∩ C) ∈ d(I) which then converges to B ∩ C ∈ d(I). So
B ∈ D1 .
Step 2. Put D2 = {C ∈ d(I) : B ∩ C ∈ d(I), ∀B ∈ d(I)}. We claim, again,
(and you check) that D2 is a d-system. The key observation is that I ⊂ D2 .
Indeed, take C ∈ I and B ∈ d(I). The latter collection is nothing else but D1 ,
according to step 1. But then B ∩ C ∈ d(I), which means that C ∈ D2 . It now
follows that d(I) ⊂ D2 , but then we must have equality, because D2 is defined
as a subset of d(I). The equality D2 = d(I) and the definition of D2 together
imply that d(I) is a π-system, as desired.
All these efforts lead to the following very useful theorem. It states that any
finite measure on Σ is characterized by its action on a rich enough π-system.
We will meet many occasions where this theorem is used.
Proof The whole idea behind the proof is to find a good d-system that contains
I. The following set is a reasonable candidate. Put D = {E ∈ Σ : µ1 (E) =
µ2 (E)}. The inclusions I ⊂ D ⊂ Σ are obvious. If we can show that D is a
d-system, then Corollary 1.14 gives the result. The fact that D is a d-system is
straightforward to check, we present only one verification. Let E, F ∈ D such
that E ⊂ F . Then (use Proposition 1.6 (iv)) µ1 (F \ E) = µ1 (F ) − µ1 (E) =
µ2 (F ) − µ2 (E) = µ2 (F \ E) and so F \ E ∈ D.
Remark 1.16 In the above proof we have used the fact that µ1 and µ2 are
finite. If this condition is violated, then the assertion of the theorem is not
valid in general. Here is a counterexample. Take N with the counting measure
µ1 = τ and let µ2 = 2τ . A π-system that generates 2N is given by the sets
Gn = {n, n + 1, . . .} (n ∈ N).
6
1.5 Probability language
In Probability Theory, one usually writes (Ω, F, P) instead of (S, Σ, µ), and
one then speakes of a probability space. On one hand this is merely change of
notation and language. We still have that Ω is a set, F a σ-algebra on it, and P
a measure, but in this case, P is a probability measure (often also simply called
probability), P(Ω) = 1. In probabilistic language, Ω is often called the set of
outcomes and elements of F are called events. So by definition, an event is a
measurable subset of the set of all outcomes.
A probability space (Ω, F, P) can be seen as a mathematical model of a random
experiment. Consider for example the experiment consisting of tossing two
coins. Each coin has individual outcomes 0 and 1. The set Ω can then be
written as {00, 01, 10, 11}, where the notation should be obvious. In this case,
we take F = 2Ω and a choice of P could be such that P assigns probability 14
to all singletons. Of course, from a purely mathematical point of view, other
possibilities for P are conceivable as well.
A more interesting example is obtained by considering an infinite sequence
of coin tosses. In this case one should take Ω = {0, 1}N and an element ω ∈ Ω
is then an infinite sequence (ω1 , ω2 , . . .) with ωn ∈ {0, 1}. It turns out that one
cannot take the power set of Ω as a σ-algebra, if one wants to have a nontrivial
probability measure defined on it. As a matter of fact, this holds for the same
reason that one cannot take the power set on (0, 1] to have a consistent notion
of Lebesgue measure. This has everything to do with the fact that one can set
up a bijective correspondence between (0, 1) and {0, 1}N . Nevertheless, there
is a good candidate for a σ-algebra F on Ω. One would like to have that sets
like ‘the 12-th outcome is 1’ are events. Let C be the collection of all such sets,
C = {{ω ∈ Ω : ωn = s}, n ∈ N, s ∈ {0, 1}}. We take F = σ(C) and all sets
{ω ∈ Ω : ωn = s} are then events. One can show that there indeed exists a
probability measure P on this F with the nice property that for instance the set
{ω ∈ Ω : ω1 = ω2 = 1} (in the previous example it would have been denoted by
{11}) has probability 41 .
Having the interpretation of F as a collection of events, we now introduce two
special events. Consider a sequence of events E1 , E2 , . . . and define
∞ [
\ ∞
lim sup En := En
m=1 n=m
[∞ \ ∞
lim inf En := En .
m=1 n=m
Note that the sets Fm = ∩n≥m En form an increasing sequence and the sets
Dm = ∪n≥m En form a decreasing sequence. Clearly, F is closed under taking
limsup and liminf. The terminology is explained by (i) of Exercise 1.4. In prob-
abilistic terms, lim sup En is described as the event that the En occur infinitely
often, abbreviated by En i.o. Likewise, lim inf En is the event that the En occur
eventually. The former interpretation follows by observing that ω ∈ lim sup En
7
iff for all m, there exists n ≥ m such that ω ∈ En . In other words, a particular
outcome ω belongs to lim sup En iff it belongs to some (infinite) subsequence of
(En ). S∞ T ∞
The terminology to call m=1 n=m En the lim inf of the sequence is justified
in Exercise 1.4. In this exercise, indicator functions of events are used, of which
we here recall the definition. If E is an event, then the function 1E is defined
by 1E (ω) = 1 if ω ∈ E and 1E (ω) = 0 if ω ∈ / E.
1.6 Exercises
1.1 Prove the following statements.
(a) The intersection of an arbitrary family of d-systems is again a d-system.
(b) The intersection of an arbitrary family of σ-algebras is again a σ-algebra.
(c) If C1 and C2 are collections of subsets of Ω with C1 ⊂ C2 , then d(C1 ) ⊂ d(C2 ).
1.2 Prove Corollary 1.8.
1.3 Prove the claim that D2 in the proof of Lemma 1.13 forms a d-system.
1.4 Consider a measure space (S, Σ, µ). Let (En ) be a sequence in Σ.
(a) Show that 1lim inf En = lim inf 1En .
(b) Show that µ(lim inf En ) ≤ lim inf µ(En ). (Use Proposition 1.7.)
(c) Show also that µ(lim sup En ) ≥ lim sup µ(En ), provided that µ is finite.
1.5 Let (S, Σ, µ) be a measure space. Call a subset N of S a (µ, Σ)-null set
if there exists a set N 0 ∈ Σ with N ⊂ N 0 and µ(N 0 ) = 0. Denote by N the
collection of all (µ, Σ)-null sets. Let Σ∗ be the collection of subsets E of S for
which there exist F, G ∈ Σ such that F ⊂ E ⊂ G and µ(G \ F ) = 0. For E ∈ Σ∗
and F, G as above we define µ∗ (E) = µ(F ).
(a) Show that Σ∗ is a σ-algebra and that Σ∗ = Σ ∨ N (= σ(N ∪ Σ)).
(b) Show that µ∗ restricted to Σ coincides with µ and that µ∗ (E) doesn’t
depend on the specific choice of F in its definition.
(c) Show that the collection of (µ∗ , Σ∗ )-null sets is N .
1.6 Let G and H be two σ-algebras on Ω. Let C = {G ∩ H : G ∈ G, H ∈ H}.
Show that C is a π-system and that σ(C) = σ(G ∪ H).
Ω
1.7
P Let Ω be a countable set.P Let F = 2 and let p : Ω → [0, 1] satisfy
ω∈Ω p(ω) = 1. Put P(A) = ω∈A p(ω) for A ∈ F. Show that P is a probability
measure.
1.8 Let Ω be a countable set. Let A be the collection of A ⊂ Ω such that A
or its complement has finite cardinality. Show that A is an algebra. What is
d(A)?
1.9 Show that a finitely additive map µ : Σ0 → [0, ∞] is countably
T additive if
µ(Hn ) → 0 for every decreasing sequence of sets Hn ∈ Σ0 with n Hn = ∅. If
µ is countably additive, do we necessarily
T have µ(Hn ) → 0 for every decreasing
sequence of sets Hn ∈ Σ0 with n Hn = ∅?
8
1.10 Consider the collection Σ0 of subsets of R that can be written as a finite
union of disjoint intervals of type (a, b] with −∞ ≤ a ≤ b < ∞ or (a, ∞). Show
that Σ0 is an algebra and that σ(Σ0 ) = B(R).
9
2 Existence of Lebesgue measure
In this chapter we construct the Lebesgue measure on the Borel sets of R. To
that end we need the concept of outer measure. Somewhat hidden in the proof of
the construction is the extension of a countably additive function on an algebra
to a measure on a σ-algebra. There are different versions of extension theorems,
originally developed by Carathéodory. Although of crucial importance in mea-
sure theory, we will confine our treatment of extension theorems mainly aimed
at the construction of Lebesgue measure on (R, B). However, see also the end
of this section.
10
By induction we obtain that for every sequence of disjoint set Ei in Σµ it holds
that for every F ⊂ S
n
[ n
X
∗
µ (F ∩ E) = µ∗ (F ∩ Ei ). (2.1)
i=1 i=1
S∞
If E = i=1 Ei , it follows from (2.1) and the monotonicity of µ∗ that
∞
X
µ∗ (F ∩ E) ≥ µ∗ (F ∩ Ei ).
i=1
µ∗ (F ) = µ∗ (F ∩ Un ) + µ∗ (F ∩ Unc )
Xn
≥ µ∗ (F ∩ Ei ) + µ∗ (F ∩ E c )
i=1
∞
X
→ µ∗ (F ∩ Ei ) + µ∗ (F ∩ E c )
i=1
= µ∗ (F ∩ E) + µ∗ (F ∩ E c ).
We will use Theorem 2.3 to show the existence of Lebesgue measure on (R, B).
Let E be a subset of R. By I(E) we denote a cover of E consisting of at
most countably many open intervals. For any interval I, we denote by λ0 (I)
its ordinary length. We now define a function λ∗ defined on 2R by putting for
every E ⊂ R
X
λ∗ (E) = inf λ0 (Ik ). (2.3)
I(E)
Ik ∈I(E)
11
Proof Properties (i) and (ii) of Definition 2.1 are obviously true. We prove
subadditivity. Let E1 , E2 , . . . be arbitrary subsets of R and ε > 0. By definition
of λ∗ , there exist covers I(En ) of the En such that for all n
X
λ∗ (En ) ≥ λ0 (I) − ε2−n . (2.4)
I∈I(En )
If this holds, then by taking the infimum on the right hand of (2.5), it follows
that λ0 (I) ≤ λ∗ (I). To prove (2.5) we proceed as follows. The covering intervals
are open. By compactness of I, there exists a finite subcover of I, {I1 , . . . , In }
say. So, it is sufficient to show (2.5), which we do by induction. If n = 1, this
is trivial. Assume it is true for covers with at most n − 1 elements. Assume
that I = [a, b]. Then b is an element of some Ik = (ak , bk ). Note that the
interval I \ Ik (possibly empty) isP covered by the remaining intervals, and by
hypothesis we have λ0 (I \ Ik ) ≤ j6=k λ0 (Ij ). But then we deduce
P λ0 (I) =
(b − ak ) + (ak − a) ≤ (bk − ak ) + (ak − a) ≤ λ0 (Ik ) + λ0 (I \ Ik ) ≤ j λ0 (Ij ).
Putting the previous results together, we obtain existence of the Lebesgue mea-
sure on B.
Theorem 2.6 The (restricted) function λ : B → [0, ∞] is the unique measure
on B that satisfies λ(I) = λ0 (I).
12
Proof By Theorem 2.3 and Lemma 2.4 λ is a measure on Σλ and by Lemma 2.5
its restriction to B is a measure as well. Moreover, Lemma 2.4 states that λ(I) =
λ0 (I). The only thing that remains to be shown is that λ is the unique measure
with the latter property. Suppose that also a measure µ enjoys this property.
Then, for any a ∈ R we have and n ∈ N, we have that (−∞, a] ∩ [−n, +n] is
an interval, hence λ((−∞, a] ∩ [−n, +n]) = µ((−∞, a] ∩ [−n, +n]). Since the
intervals (−∞, a] form a π-system that generates B, we also have
13
Proof We only sketch the main steps. First we define an outer measure on 2S
by putting
X
µ∗ (E) = inf µ0 (Ek ),
Σ0 (E)
Ek ∈Σ0 (E)
where the infimum is taken over all Σ0 (E), countable covers of E with elements
Ek from Σ0 . Compare this to the definition in (2.3). It follows as in the proof
of Lemma 2.4 that µ∗ is an outer measure.
Let E ∈ Σ0 . Obviously, {E} is a (finite) cover of E and so we have that
µ∗ (E) ≤ µ0 (E). Let {E1 , E2 , . . .} be a cover of E with the Ek ∈ P
Σ0 . Since µ0 is
countably subadditive and E = ∪k (E ∩ Ek ), we have µ0 (E) ≤ k µ0 (E ∩ Ek )
and since µ0 is finitely additive, we also have µ0 (E ∩ Ek ) ≤ µ0 (Ek ). Collecting
these results we obtain
X
µ0 (E) ≤ µ0 (Ek ).
k
Taking the infimum in the displayed inequality over all covers Σ0 (E), we obtain
µ0 (E) ≤ µ∗ (E), for E ∈ Σ0 . Hence µ0 (E) = µ∗ (E) and µ∗ is an extension of
µ0 .
In order to show that µ∗ restricted to Σ is a measure, it is by virtue of
Theorem 2.3 sufficient to show that Σ0 ⊂ Σµ , because we then also have Σ ⊂ Σµ .
We proceed to prove the former inclusion. LetP F ∈ S be arbitrary, ε > 0. Then
there exists a cover Σ0 (F ) such that µ∗ (F ) ≥ Ek ∈Σ0 (F ) µ0 (Ek ) − ε. Using the
same kind of arguments as in the proof of Lemma 2.5, one obtains (using that
µ0 is additive on the algebra Σ0 , where it coincides with µ∗ ) for every E ∈ Σ0
X
µ∗ (F ) + ε ≥ µ0 (Ek )
k
X X
= µ0 (Ek ∩ E) + µ0 (Ek ∩ E c )
k k
X X
∗
= µ (Ek ∩ E) + µ∗ (Ek ∩ E c )
k k
≥ µ∗ (F ∩ E) + µ∗ (F ∩ E c ),
14
Now we show the mentioned key result. Let E ∈ Σ. Consider
P a cover
P Σ0 (E) of
E. Then we have, since ν is a measure on Σ, ν(E) ≤ k ν(Ek ) = k µ0 (Ek ).
By taking the infimum over such covers, we obtain ν(E) ≤ µ∗ (E) = µ(E). We
proceed to prove the converse inequality for sets E with µ(E) < ∞.
Let E ∈ ΣPwith µ(E) < ∞. Given ε > 0, we can chose a cover Σ0 (E) such
that µ(E) > Ek ∈Σ0 (E) µ(Ek ) − ε. Let Un = ∪nk=1 Ek and note that Un ∈ Σ0
and U := ∪∞ ∞
n=1 Un = ∪k=1 Ek ∈ Σ. Since U ⊃ E, we obtain µ(E) ≤ µ(U ),
whereas σ-additivity of µ yields µ(U ) < µ(E) + ε, which implies µ(U ∩ E c ) =
µ(U ) − µ(U ∩ E) = µ(U ) − µ(E) < ε. Since it also follows that µ(U ) < ∞, there
is N ∈ N such that µ(U ) < µ(UN ) + ε. Below we use that µ(UN ) = ν(UN ), the
already established fact that µ ≥ ν on Σ and arrive at the following chain of
(in)equalities.
ν(E) = ν(E ∩ U ) = ν(U ) − ν(U ∩ E c )
≥ ν(UN ) − µ(U ∩ E c )
≥ ν(UN ) − ε
= µ(UN ) − ε
> µ(U ) − 2ε
≥ µ(E) − 2ε.
It follows that ν(E) ≥ µ(E).
The assumption in Theorem 2.7 that the collection Σ0 is an algebra can be
weakened by only assuming that it is a semiring. This notion is beyond the
scope of the present course.
Unicity of the extension fails to hold for µ0 that are not σ-finite. Here is a
counterexample. Let S be an infinite set and Σ0 an arbitrary algebra consisting
of the empty set and infinite subsets of S. Let µ0 (E) = ∞, unless E = ∅, in
which case we have µ0 (E) = 0. Then µ(F ) defined by µ(F ) = ∞, unless F = ∅,
yields the extension of Theorem 2.7 on 2S , whereas the counting measure on 2S
also extends µ0 .
2.3 Exercises
2.1 Let µ be an outer measure on some set S. Let N ⊂ S be such that µ(N ) = 0.
Show that N ∈ Σµ .
2.2 Let (S, Σ, µ) be a measure space. A measurable covering of a subset A of S
is a countable collection {Ei : i ∈ N} ⊂ Σ such that A ⊂ ∪∞ P∞M(A) be
i=1 Ei . Let
the collection of all measurable coverings of A. Put µ∗ (A) = inf{ i=1 µ(Ei ) :
{E1 , E2 , . . .} ∈ M(A)}. Show that µ∗ is an outer measure on S and that
µ∗ (E) = µ(E), if E ∈ Σ. Show also that µ∗ (A) = inf{µ(E) : E ⊃ A, E ∈ Σ}.
We call µ∗ the outer measure associated to µ.
2.3 Let (S, Σ, µ) be a measure space and let µ∗ be the outer measure on S
associated to µ. If A ⊂ S, then there exists E ∈ Σ such that A ⊂ E and
µ∗ (A) = µ(E). Prove this.
15
2.4 Consider a measure space (S, Σ, µ) with σ-finite µ and let µ∗ be the outer
measure on S associated to µ. Show that Σµ ⊂ Σ ∨ N , where N is the collection
of all µ-null sets. Hint: Reduce the question to the case where µ is finite. Take
then A ∈ Σµ and E as in Exercise 2.3 and show that µ∗ (E \ A) = 0. (By
Exercise 2.1, we even have Σµ = Σ ∨ N .)
2.5 Show that the Lebesgue measure λ is translation invariant, i.e. λ(E + x) =
λ(E) for all E ∈ Σλ , where E + x = {y + x : y ∈ E}.
2.6 This exercise aims at showing the existence of a set E ∈ / Σλ . First we define
an equivalence relation ∼ on R by saying x ∼ y iff x − y ∈ Q. By the axiom of
choice there exists a set E ⊂ (0, 1) that has exactly one point in each equivalence
class induced by ∼. The set E is our candidate.
(a) Show the following two statements. If x ∈ (0, 1), then ∃q ∈ Q ∩ (−1, 1) :
x ∈ E + q. If q, r ∈ Q and q 6= r, then (E + q) ∩ (E + r) = ∅.
(b) Assume that E ∈ Σλ . Put S = ∪q∈Q∩(−1,1) E +q and note that S ⊂ (−1, 2).
Use translation invariance of λ (Exercise 2.5) to show that λ(S) = 0,
whereas at the same time one should have λ(S) ≥ λ(0, 1).
(c) Show that λ∗ (E) = 1 and λ∗ ((0, 1) \ E) = 1.
16
3 Measurable functions and random variables
In this chapter we define random variables as measurable functions on a proba-
bility space and derive some properties.
17
Remark 3.4 There are many variations on the assertions of Proposition 3.3
possible. For instance in (ii) we could also use {h < c}, or {h > c}. Further-
more, (ii) is true for h : S → [−∞, ∞] as well. We proved (iv) by a simple
composition argument, which also applies to a more general situation. Let
(Si , Σi ) be measurable spaces (i = 1, 2, 3), h : S1 → S2 is Σ1 /Σ2 -measurable
and f : S2 → S3 is Σ2 /Σ3 -measurable. Then f ◦ h is Σ1 /Σ3 -measurable.
Theorem 3.6 Let H be a vector space of bounded functions, with the following
properties.
(i) 1 ∈ H.
(ii) If (fn ) is a nonnegative sequence in H such that fn+1 ≥ fn for all n, and
f := lim fn is bounded as well, then f ∈ H.
If, in addition, H contains the indicator functions of sets in a π-system I, then
H contains all bounded σ(I)-measurable functions.
18
Proof Put D = {F ⊂ S : 1F ∈ H}. One easily verifies that D is a d-system,
and that it contains I. Hence, by Corollary 1.14, we have Σ := σ(I) ⊂ D. We
will use this fact later in the proof.
Let f be a bounded, σ(I)-measurable function. Without loss of generality,
we may assume that f ≥ 0 (add a constant otherwise), and f < K for some real
constant K. Introduce the functions fn defined by fn = 2−n b2n f c. In explicit
terms, the fn are given by
n
X−1
K2
fn (s) = i2−n 1{i2−n ≤f <(i+1)2−n } (s).
i=0
19
of X, we introduce its distribution function, usually denoted by F (or FX ,
or F X ). By definition it is the function F : R → [0, 1], given by F (x) =
µ((−∞, x]) = P(X ≤ x).
Proposition 3.8 The distribution function of a random variable is right con-
tinuous, non-decreasing and satisfies limx→∞ F (x) = 1 and limx→−∞ F (x) = 0.
The set of points where F is discontinuous is at most countable.
This proposition thus states, in a different wording, that for a random variable
X, its distribution, the collection of all probabilities P(X ∈ B) with B ∈ B, is
determined by the distribution function FX .
We call any function on R that has the properties of Proposition 3.8 a distri-
bution function. Note that any distribution function is Borel measurable (sets
{F ≥ c} are intervals and thus in B). Below, in Theorem 3.10, we justify this
terminology. We will see that for any distribution function F , it is possible
to construct a random variable on some (Ω, F, P), whose distribution function
equals F . This theorem is founded on the existence of the Lebesgue measure
λ on the Borel sets B[0, 1] of [0, 1], see Theorem 1.5. We now give a proba-
bilistic translation of this theorem. Consider (Ω, F, P) = ([0, 1], B[0, 1], λ). Let
U : Ω → [0, 1] be the identity map. The distribution of U on [0, 1] is trivially
the Lebesgue measure again, in particular the distribution function F U of U
satisfies F U (x) = x for x ∈ [0, 1] and so P(a < U ≤ b) = F U (b) − F U (a) = b − a
for a, b ∈ [0, 1] with a ≤ b. Hence, to the distribution function F U corresponds
a probability measure on ([0, 1], B[0, 1]) and there exists a random variable U
on this space, such that U has F U as its distribution function. The random
variable U is said to have the standard uniform distribution.
The proof of Theorem 3.10 (Skorokhod’s representation of a random variable
with a given distribution function) below is easy in the case that F is continuous
and strictly increasing (Exercise 3.6), given the just presented fact that a random
variable with a uniform distribution exists. The proof that we give below for
the general case just follows a more careful line of arguments, but is in spirit
quite similar.
20
Proof Let (Ω, F, P) = ((0, 1), B(0, 1), λ). We define X − (ω) = inf{z ∈ R :
F (z) ≥ ω}. Then X − (ω) is finite for all ω and X − is Borel measurable function,
so a random variable, as this follows from the relation to be proven below, valid
for all c ∈ R and ω ∈ (0, 1),
3.3 Independence
Recall the definition of independent events. Two events E, F ∈ F are called
independent if the product rule P(E ∩ F ) = P(E)P(F ) holds. In the present
section we generalize this notion of independence to independence of a sequence
of events and to independence of a sequence of σ-algebras. It is even convenient
and elegant to start with the latter.
Definition 3.11
(i) A sequence of σ-algebras F1 , F2 , . Q . . is called independent, if for every n
n
it holds that P(E1 ∩ · · · ∩ En ) = i=1 P(Ei ), for all choices of Ei ∈ Fi
(i = 1, . . . , n).
(ii) A sequence of random variables X1 , X2 , . . . is called independent if the
σ-algebras σ(X1 ), σ(X2 ), . . . are independent.
(iii) A sequence of events E1 , E2 , . . . is called independent if the random vari-
ables 1E1 , 1E2 , . . . are independent.
The above definition also applies to finite sequences. For instance, a finite se-
quence of σ-algebras F1 , . . . , Fn is called independent if the infinite sequence
F1 , F2 , . . . is independent in the sense of part (ii) of the above definition, where
Fm = {∅, Ω} for m > n. It follows that two σ-algebras F1 and F2 are inde-
pendent, if P(E1 ∩ E2 ) = P(E1 )P(E2 ) for all E1 ∈ F1 and E2 ∈ F2 . To check
independence of two σ-algebras, Theorem 1.15 is again helpful.
Proposition 3.12 Let I and J be π-systems and suppose that for all I ∈ I
and J ∈ J the product rule P(I ∩ J) = P(I)P(J) holds. Then the σ-algebras
σ(I) and σ(J ) are independent.
Proof Put G = σ(I) and H = σ(J ). We define for each I ∈ I the finite
measures µI and νI on H by µI (H) = P(H ∩I) and νI (H) = P(H)P(I) (H ∈ H).
21
Notice that µI and νI coincide on J by assumption and that µI (Ω) = P(I) =
νI (Ω). Theorem 1.15 yields that µI (H) = νI (H) for all H ∈ H.
Now we consider for each H ∈ H the finite measures µH and ν H on G
defined by µH (G) = P(G ∩ H) and ν H (G) = P(G)P(H). By the previous step,
we see that µH and ν H coincide on I. Invoking Theorem 1.15 again, we obtain
P(G ∩ H) = P(G)P(H) for all G ∈ G and H ∈ H.
22
(ii) If U is uniformly distributed on [0, 1), then there are Borel measurable
functions fk : [0, 1) → [0, 1) such that Zk = fk ◦ U defines an iid sequence,
with all Zk uniformly distributed on [0, 1) as well.
Proof (i) Let U have the uniform distribution on [0, 1). For x1 , . . . , xn ∈ {0, 1},
one easily computes the joint probability P(X1 = x1 , . . . , Xn = xn ) = 2−n . It
follows that P(Xk = xk ) = 21 for all k and that the Xk are independent.
Conversely, let the Xk be distributed as assumed. Let V be a random
variable having a uniform distribution on [0, 1). Then by the above part of
the
P∞proof the sequence of Yk := bk ◦ V is distributed
P∞ as the Xk and therefore
−k
k=1 2 Xk has the same distribution as k=1 2−k Yk , which means that U
and V have the same distribution. Hence U is uniformly distributed on [0, 1).
(ii) Take the functions bk and relabel them in a rectangular array as bkj ,
j,
P∞= 1,
k 2, . . . by using any bijective mapping from N onto N2 . Put fk (x) :=
−j
j=1 2 bkj (x). The functions fk are Borel measurable. Since for fixed k the
bkj ◦ U are iid, we have by the first part of the lemma that Zk is uniform on
[0, 1). Moreover, for different k and k 0 the sequences (bkj ) and (bk0 j ) are disjoint
and therefore Zk and Zk0 are independent. By extension of this argument the
whole sequence (Zk ) becomes independent (think about this!).
Proof Let (Ω, F, P) = ([0, 1), B[0, 1), λ). Choose for each k a random variable
Xk according to Theorem 3.10. Then certainly, the Xk have law µk . Let U be
the identity mapping on [0, 1), then U is uniformly distributed on [0, 1). Choose
the Zk as in part (ii) of Lemma 3.15 and define Yk = Xk ◦ Zk , k ≥ 1. These are
easily seen to have the desired properties.
3.4 Exercises
3.1 If h1 and h2 are Σ-measurable functions on (S, Σ, µ), then h1 h2 is Σ-
measurable too. Show this.
3.2 Let X be a random variable. Show that Π(X) := {{X ≤ x} : x ∈ R} is a
π-system and that it generates σ(X).
3.3 Let {Yγ : γ ∈ C} be an arbitrary collection of random variables and {Xn :
n ∈ N} be a countable collection of random variables, all defined on the same
probability space.
(a) Show that σ{Yγ : γ ∈ C} = σ{Yγ−1 (B) : γ ∈ C, B ∈ B}.
S∞
(b) Let Xn = σ{X1 , . . . , Xn } (n ∈ N) and A = n=1 Xn . Show that A is an
algebra and that σ(A) = σ{Xn : n ∈ N}.
23
3.4 Prove Proposition 3.8.
3.5 Let F be a σ-algebra on Ω with the property that for all F ∈ F it holds
that P(F ) ∈ {0, 1}. Let X : Ω → R be F-measurable. Show that for some c ∈ R
one has P(X = c) = 1. (Hint: P(X ≤ x) ∈ {0, 1} for all x.)
24
4 Integration
In elementary courses on Probability Theory, there is usually a distinction be-
tween random variables X having a discrete distribution, on N say, and those
having a P density. In the former case we have for the expectation ER X the ex-
pression k k P(X = k), whereas in the latter case one has E X = xf (x) dx.
This distinction is annoying and not satisfactory from a mathematical point of
view. Moreover, there exist random variables whose distributions are neither
discrete, nor do they admit a density. Here is an example. Suppose Y and Z,
defined on the same (Ω, F, P), are independent random variables. Assume that
P(Y = 0) = P(Y = 1) = 12 and that Z has a standard normal distribution. Let
X = Y Z and F the distribution function of X. Easy computations (do them!)
yield F (x) = 21 (1[0,∞) (x) + Φ(x)). We see that F has a jump at x = 0 and is
differentiable on R \ {0}, a distribution function of mixed type. How to compute
E X in this case?
In this section we will see that expectations are special cases of the unifying
concept of Lebesgue integral, a sophisticated way of addition. Lebesgue integrals
have many advantages. It turns out that Riemann integrable functions (on
a compact interval) are always Lebesgue integrable w.r.t. Lebesgue measure
and that the two integrals are the same. Also sums are examples of Lebesgue
integral. Furthermore, the theory of Lebesgue integrals allows for very powerful
limit theorems. Below we work with a measurable space (S, Σ, µ).
25
when f has representation (4.1).
R
Other notations that we often use for this integral are f (s) µ(ds) and µ(f ).
Note that if f = 1A , then µ(f ) = µ(1A ) = µ(A), so there is a bit of ambiguity in
the notation, but also a reasonable level of consistency. Note that µ(f ) ∈ [0, ∞]
and also that the above summation is well defined, since all quantities involved
are nonnegative, although possibly infinite. For products ab for a, b ∈ [0, ∞], we
use the convention ab = 0, when a = 0.
It should be clear that this definition of integral is, at first sight, troublesome.
The representation of a simple function is not unique, and one might wonder if
the just defined integral takes on different values for different representations.
This would be very bad, and fortunately it is not the case.
Proposition 4.3 Let f be a nonnegative simple function. Then the value of
the integral µ(f ) is independent of the chosen representation.
Proof Step 1. Let f be given by (4.1) and define φ : S → {0, 1}n by φ(s) =
(1A1 (s), . . . , 1An (s)). Let {0, 1}n = {u1 , . . . , um } where m = 2n and put Uk =
φ−1 (uk ). Then the collection {U1 , . . . , Um } is a measurable partition of S (the
sets Uk are measurable). We will also need the sets Si = {k : Uk ⊂ Ai } and
Tk = {i : Uk ⊂ Ai }. Note that these sets are dual in the sense that k ∈ Si iff
i ∈ Tk .
Below we will use the fact Ai = ∪k∈Si Uk , when we rewrite (4.1). We obtain
by interchanging the summation order
X X X
f= ai 1Ai = ai ( 1Uk )
i i k∈Si
X X
= ( ai )1Uk . (4.3)
k i∈Tk
26
Ai ∩Bj 6= ∅ ⇒ ai = bj . We compute the integral of f according to thePdefinition.
Of course, this yields (4.2) by using the representation (4.1) of f , but j bj µ(Bj )
if we use (4.4). Rewrite
X X X X
bj µ(Bj ) = bj µ(∪i (Ai ∩ Bj )) = bj µ(Ai ∩ Bj )
j j j i
XX XX
= bj µ(Ai ∩ Bj ) = ai µ(Ai ∩ Bj )
i j i j
X X X
= ai µ(Ai ∩ Bj ) = ai µ(∪j (Ai ∩ Bj ))
i j i
X
= ai µ(Ai ),
i
which shows that the two formulas for the integral are the same.
Step 3: Take now two arbitrary representations of f of the form (4.1)
and (4.4). According to step 1, we can replace each of them with a repre-
sentation in terms of a measurable partition, without changing the value of the
integral. According to step 2, each of the representations in terms of the par-
titions also gives the same value of the integral. This proves the proposition.
Corollary 4.4 Let f ∈ S+ and suppose Pn that f assumes the different values
0 ≤ a1 , . . . , an < ∞. Then µ(f ) = i=1 ai µ({f = ai }). If f is indentically zero,
then µ(f ) = 0.
Pn
Proof We have the representation f = i=1 ai 1{f =ai } . The expression for
µ(f ) follows from Definition 4.2, which is unambiguous by Proposition 4.3. The
result for the zero function follows by the representation f = 0 × 1S and the
convention ab = 0 for a = 0 and b ∈ [0, ∞].
27
Example 4.6 Let (S, Σ, µ) = ([0, 1], B([0, 1]), λ) and f the indicator of the
rational numbers in [0, 1], f = 1Q∩[0,1] . We know that λ(Q ∩ [0, 1]) = 0 and it
follows that λ(f ) = 0. This f is a nice example of a function that is not Riemann
integrable, whereas its Lebesgue integral trivially exists and has a very sensible
value.
We say that a property of elements of S holds almost everywhere (usually abbre-
viated by a.e. or by µ-a.e.), if the set for which this property does not hold, has
measure zero. For instance, we say that two measurable functions are almost
everywhere equal, if µ({f 6= g}) = 0. Elementary properties of the integral are
listed below.
Proposition 4.7 Let f, g ∈ S+ and c ∈ [0, ∞).
(i) If f ≤ g a.e., then µ(f ) ≤ µ(g).
(ii) If f = g a.e., then µ(f ) = µ(g).
(iii) µ(f + g) = µ(f ) + µ(g) and µ(cf ) = cµ(f ).
P
Proof (i)PRepresent f and g by means of measurable partitions, f = i ai 1Ai
and g = j bj 1Bj . We have {f > g} = ∪i,j:ai >bj Ai ∩ Bj , and since µ({f >
g}) = 0, we have that µ(Ai ∩Bj ) = 0 if ai > bj . It follows that for all i and j, the
inequality ai µ(Ai ∩ Bj ) ≤ bj µ(Ai ∩ Bj ) holds. We use this in the computations
below.
X
µ(f ) = ai µ(Ai )
i
XX
= ai µ(Ai ∩ Bj )
i j
XX
≤ bj µ(Ai ∩ Bj )
i j
X
= bj µ(Bj ).
j
Assertion (ii) follows by a double application of (i), whereas (iii) can also be
proved by using partitions and intersections Ai ∩ Bj .
28
Proposition 4.9 Let f, g ∈ Σ+ . If f = 0 a.e., then µ(f ) = 0. If f ≤ g a.e.,
then µ(f ) ≤ µ(g), and if f = g a.e., then µ(f ) = µ(g).
h1N c ≤ f 1N c ≤ g1N c ≤ g.
Proof Because µ(f ) = 0, it holds that µ(h) = 0 for all nonnegative simple
functions with h ≤ f . Take hn = n1 1{f ≥1/n} , then hn ∈ S+ and hn ≤ f .
The equality µ(hn ) = 0 implies µ({f ≥ 1/n}) = 0. The result follows from
{f > 0} = ∪n {f ≥ 1/n} and Corollary 1.8.
We now present the first important limit theorem, the Monotone Convergence
Theorem.
Theorem 4.12 Let (fn ) be a sequence in Σ+ , such that fn+1 ≥ fn a.e. for
each n. Let f = lim sup fn . Then µ(fn ) ↑ µ(f ) ≤ ∞.
Proof We first consider the case where fn+1 (s) ≥ fn (s) for all s ∈ S, so (fn ) is
increasing everywhere. Then f (s) = lim fn (s) for all s ∈ S, possibly with value
infinity. It follows from Proposition 4.9, that µ(fn ) is an increasing sequence,
bounded by µ(f ). Hence we have ` := lim µ(fn ) ≤ µ(f ).
We show that we actually have an equality. Take h ∈ S+ with h ≤ f ,
c ∈ (0, 1) and put En = {fn ≥ ch}. The sequence (En ) is obviously increasing
and we show that its limit is S. Let s ∈ S and suppose that f (s) = 0. Then
also h(s) = 0 and s ∈ En for every n. If f (s) > 0, then eventually fn (s) ≥
cf (s) ≥ ch(s), and so s ∈ En . This shows that ∪n En = S. Consider the chain
of inequalities
29
P
Suppose that h has representation (4.1). Then µ(h1En ) = i ai µ(Ai ∩En ). This
is a finite sum of nonnegative numbers and hence the limit of it for n → ∞ can
be taken inside the sum and thus equals µ(h), since En ↑ S and the continuity
of the measure (Proposition 1.7). From (4.5) we then conclude ` ≥ cµ(h), for
all c ∈ (0, 1), and thus ` ≥ µ(h). Since this holds for all our h, we get ` ≥ µ(f )
by taking the supremum over h. This proves the first case.
Next we turn to the almost everywhere version. Let Nn = {fn > fn+1 },
by assumption µ(Nn ) = 0. Put N = ∪n Nn , then also µ(N ) = 0. It follows
that µ(fn ) = µ(fn 1N c ). But on N c we have that f = f 1N c and similarly
µ(f ) = µ(f 1N c ). The previous case can be applied to get µ(fn 1N c ) ↑ µ(f 1N c ),
from which the result follows.
Example 4.13 Here is a nice application of Theorem 4.12. Let f ∈ Σ+ and,
for each n ∈ N, put En,i = {i2−n ≤ f < (i+1)2−n } (i ∈ In := {0, . . . , n2n −1}),
similar to the sets in the proof of Theorem 3.6. Put also En = {f ≥ n}. Note
that the sets En,i and En are in Σ. Define
X
fn = i2−n 1En,i + n1En .
i∈In
30
Definition 4.17 Let f ∈ Σ and assume that µ(f + ) < ∞ or µ(f − ) < ∞. Then
we define µ(f ) := µ(f + ) − µ(f − ). If both µ(f + ) < ∞ and µ(f − ) < ∞, we
say that f is integrable. The collection of all integrable functions is denoted by
L1 (S, Σ, µ). Note that f ∈ L1 (S, Σ, µ) implies that |f | < ∞ µ-a.e.
Proposition 4.18 The following natural properties hold.
(i) Let f, g ∈ L1 (S, Σ, µ) and α, β ∈ R. Then αf + βg ∈ L1 (S, Σ, µ) and
µ(αf + βg) = αµ(f ) + βµ(g). Hence µ can be seen as a linear operator
on L1 (S, Σ, µ).
(ii) If f, g ∈ L1 (S, Σ, µ) and f ≤ g a.e., then µ(f ) ≤ µ(g).
(iii) Triangle inequality: If f ∈ L1 (S, Σ, µ), then |µ(f )| ≤ µ(|f |).
Proof Exercise 4.3.
The next theorem is known as the Dominated Convergence Theorem, also called
Lebesgue’s Convergence Theorem.
Theorem 4.19 Let (fn ) ⊂ Σ and f ∈ Σ. Assume that fn (s) → f (s) for all s
outside a set of measure zero. Assume also that there exists a function g ∈ Σ+
such that supn |fn | ≤ g a.e. and that µ(g) < ∞. Then µ(|fn − f |) → 0, and
hence µ(fn ) → µ(f ).
Proof The second assertion easily follows from the first one, which we prove
now for the case that fn → f everywhere. One has the inequality |f | ≤ g,
whence |fn − f | ≤ 2g. The second assertion of Fatou’s lemma immediately
yields lim sup µ(|fn − f |) ≤ 0, which is what we wanted. The almost everywhere
version is left as Exercise 4.4.
L1
The convergence µ(|fn − f |) → 0 is often denoted by fn → f . The following
result is known as Scheffé’s lemma.
Lemma 4.20 Let (fn ) ⊂ Σ+ and assume that fn → f a.e. Assume that µ(fn )
is finite for all n and µ(f ) < ∞ as well. Then µ(|fn − f |) → 0 iff µ(fn ) → µ(f ).
Proof The ‘only if’ part follows from Theorem 4.19. Assume then that µ(fn ) →
µ(f ). We have the elementary equality |fn − f | = (fn − f ) + 2(fn − f )− , and
hence µ(|fn − f |) = (µ(fn ) − µ(f )) + 2µ((fn − f )− ). The first term on the right
hand side of the last expression tends to zero by assumption. The second one
we treat as follows. Since f − fn ≤ f and f ≥ 0, it follows that (fn − f )− ≤ f .
Hence µ((fn − f )− ) → 0, by virtue of Theorem 4.19.
Example 4.21 Let (S, Σ, µ) = ([0, 1], B([0, 1]), λ), where λ is Lebesgue mea-
sure. Assume that f ∈ C[0, 1]. Exercise 4.6 yields that f ∈ L1 ([0, 1], B([0, 1]), λ)
R1
and that λ(f ) is equal to the Riemann integral 0 f (x) dx. This implication fails
to hold if we replace [0, 1] with an unbounded interval, see Exercise 4.7.
On the other hand, one can even show that every function that is Riemann
integrable over [0, 1], not only a continuous function, is Lebesgue integrable too.
Knowledge of Chapter 2 is required for a precise statement and its proof, see
Exercise 4.14.
31
Many results in integration theory can be proved by what is sometimes called
the standard machine. This ‘machine’ works along the following steps. First
one shows that results hold true for an indicator function, then one extends
this by a linearity argument to nonnegative simple functions. Invoking the
Monotone Convergence Theorem, one can then prove the results for nonnegative
measurable functions. In the final step one shows the result to be true for
functions in L1 (S, Σ, µ) by splitting into positive and negative parts.
One verifies (Exercise 4.9) that ν is a measure on (S, Σ). We want to compute
ν(h) for h ∈ Σ+ . For measurable indicator functions we have by definition
that the integral ν(1E ) equals ν(E), which is equal to µ(1E f ) by (4.7). More
generally we have
Proposition 4.23 Let f ∈ Σ+ and h ∈ Σ. Then h ∈ L1 (S, Σ, ν) iff hf ∈
L1 (S, Σ, µ), in which case one has ν(h) = µ(hf ).
32
RExample 4.24 Let (S, Σ, µ) = (R, B, λ), f ≥ 0, Borel measurable, ν(E) =
1E f dλ and hf ∈ L1 (R, B, λ). Then
Z ∞
ν(h) = h(x)f (x) dx,
−∞
where the equality is valid under conditions as for instance in Example 4.21 or,
in general, as in Exercise 4.14.
R
Remark 4.25 If f is continuous, see Example 4.21, then x 7→ F (x) = [0,x] f dλ
defines a differentiable function on (0, 1), with F 0 (x) = f (x). This follows
from theR theory of Riemann integrals. We adopt the conventional notation
x
F (x) = 0 f (u) du. This case can be generalized as follows. If f ∈ L1 (R, B, λ),
Rx
then (using a similar notational convention) x 7→ F (x) = −∞ f (u) du is well
defined for all x ∈ R. Moreover, F is at (Lebesgue) almost all points x of
R differentiable with derivative F 0 (x) = f (x). The proof of this result, the
fundamental theorem of calculus for the Lebesgue integral, will be given in
Section 6.7.
provided that the integral is well defined, which is certainly the case if P(|X|) <
∞. Other often used notations for this integral are PX and E X. The latter
is the favorite one among probabilists and one speaks of the Expectation of
X. Note also that E X is always defined when X ≥ 0 almost surely. The
latter concept meaning almost everywhere w.r.t. the probability measure P. We
abbreviate almost surely by a.s.
33
If h : R → R is Borel measurable, then Y := h ◦ X (we also write Y = h(X))
is a random variable as well. We give two recipes to compute E Y . One is of
course the direct application of the definition of expectation to Y . But we also
have
Proposition 4.27 Let X be a random variable, and h : R → R Borel mea-
surable. Let PX be the distribution of X. Then h ◦ X ∈ L1 (Ω, F, P) iff
h ∈ L1 (R, B, PX ), in which case
Z
E h(X) = h dPX . (4.9)
R
It
R follows from this proposition that one can also compute E Y as E Y =
R
y PY ( dy).
Example 4.28 Suppose there exists f ≥ 0, Borel-measurable such that for all
B ∈ B one has PX (B) = λ(1B f ), in which case it is said that X has a density
f . Then, provided that the expectation is well defined, Example 4.24 yields
Z
E h(X) = h(x)f (x) dx,
R
The next two propositions have proven to be very useful in proofs of results
in Probability Theory. We use the fact that P is a probability measure in an
essential way.
34
Proof This follows from the inequality g(X)1{X≥c} ≥ g(c)1{X≥c} .
The inequality in Proposition 4.30 is known as Markov’s inequality. An example
is obtained by taking g(x) = x+ and by replacing X with |X|. One gets E |X| ≥
cP(|X| ≥ c). For the special case where g(x) = (x+ )2 , it is known as Chebychev’s
inequality. This name is especially used, if we apply it with |X − E X| instead
of X. For c ≥ 0 we then obtain Var X ≥ c2 P(|X − E X| ≥ c).
We now turn to a result that is known as Jensen’s inequality, Proposition 4.31
below. Recall that a function g : G → R is convex, if G is a convex set and if
for all x, y ∈ G and α ∈ [0, 1] one has
g(αx + (1 − α)y) ≤ αg(x) + (1 − α)g(y).
We consider only the case where G is an interval. Let us first give some proper-
ties of convex functions. Let x, y, z ∈ G and x < y < z. Then y = αx + (1 − α)z,
z−y
with α = z−x , and from convexity of g we get
(g(y) − g(x))(z − x) ≤ (g(z) − g(x))(y − x).
By taking the appropriate limits, one obtains that g is continuous on Int G, the
interior of G. Take x ∈ Int G and rewrite the above inequality as
g(y) − g(x) g(z) − g(x)
≤ . (4.10)
y−x z−x
g(y)−g(x)
It follows that the right derivative D+ g(x) := limy↓x y−x exists and is finite.
In a similar way one can show that the left derivative D− g(x) := limy↑x g(y)−g(x)
y−x
exists and is finite. Moreover, one has D+ g(x) ≥ D− g(x) for all x ∈ Int G and
both onesided derivatives are increasing. If one takes in (4.10) the limit when
y ↓ x, one gets the inequality (z − x)D+ g(x) ≤ g(z) − g(x), valid for z > x.
Likewise one has the inequality (x − z)D− g(x) ≥ g(x) − g(z), valid for z < x.
It follows that for any d(x) ∈ [D− g(x), D+ g(x)], it holds that
g(z) − g(x) ≥ d(x)(z − x). (4.11)
The d(x) are also called subgradients of g. The following proposition (Jensen’s
inequality) is now easy to prove.
Proposition 4.31 Let g : G → R be convex and X a random variable with
P(X ∈ G) = 1. Assume that E |X| < ∞ and E |g(X)| < ∞. Then
E g(X) ≥ g(E X).
Proof We exclude the trivial case P(X = x0 ) = 1 for some x0 ∈ G. Since
P(X ∈ G) = 1, we have E X ∈ Int G (Exercise 4.18) and (4.11) with x = E X
and z replaced with X holds a.s. So, in view of (4.11),
g(X) − g(E X) ≥ d(E X)(X − E X).
Take expectations to get E g(X) − g(E X) ≥ 0.
35
4.5 Functions of bounded variation and Stieltjes integrals
In this section we define functions of bounded variation and review some ba-
sic properties. Stieltjes integrals will be discussed subsequently. We consider
functions defined on an interval [a, b]. Next to these we consider partitions Π
of [a, b], finite subsets {t0 , . . . , tn } of [a, b] with the convention t0 ≤ · · · ≤ tn ,
and µ(Π) denotes the mesh of Π. Extended partitions, denoted Π∗ , are parti-
tions Π, together with additional points τi , with ti−1 ≤ τi ≤ ti . By definition
µ(Π∗ ) = µ(Π). Along with a function α, a partition Π, we define
n
X
V 1 (α; Π) := |α(ti ) − α(ti−1 )|,
i=1
α(ti )−α(ti−1 )
where the τi satisfy ti−1 ≤ τi ≤ ti and α0 (τi ) = ti −ti−1 .
Proof Define
1
vα+ (t) = (vα (t) + α(t) − α(a))
2
1
vα− (t) = (vα (t) − α(t) + α(a)).
2
36
We only have to check that these functions are increasing, since the other state-
ments are obvious. Let t0 > t. Then vα+ (t0 ) − vα+ (t) = 21 (vα (t0 ) − vα (t) + α(t0 ) −
α(t)). The difference vα+ (t0 ) − vα+ (t) is the variation of α over the interval [t, t0 ],
which is greater than or equal to |α(t0 ) − α(t)|. Hence vα+ (t0 ) − vα+ (t) ≥ 0, and
the same holds for vα− (t0 ) − vα− (t).
We say that S(f, α) = limµ(Π∗ )→0 S(f, a; Π∗ ), if for all ε > 0, there exists δ > 0
such that µ(Π∗ ) < δ implies |S(f, α) − S(f, α; Π∗ )| <R ε. If this happens, we say
that f is integrable w.r.t. α and we commonly write f dα for S(f, α), and call
it the Stieltjes integral of f w.r.t. α.
|S(f, α; Π∗1 )−S(f, α; Π∗2 )| ≤ |S(f, α; Π∗1 )−S(f, α; Π∗ )|+|S(f, α; Π∗ )−S(f, α; Π∗2 )|
that it suffices to show that |S(f, α; Π∗1 ) − S(f, α; Π∗2 )| can be made small for
Π2 a refinement of Π1 . Let ε > 0 and choose δ > 0 such that |f (t) − f (s)| < ε
whenever |t−s| < δ (possible by uniform continuity of f ). Assume that µ(Π1 ) <
δ, then also µ(Π2 ) < δ. Consider first two extended partitions of Π1 , Π∗1 and
Π01 , the latter with intermediate points τi0 . Then
X
|S(f, α; Π∗1 ) − S(f, α; Π01 )| ≤ |f (τi ) − f (τi0 )||α(ti ) − α(ti−1 )|
i
≤ ε V 1 (α; Π1 ) ≤ ε V 1 (α).
In the next step we assume that Π2 is obtained from Π1 by adding one point,
namely τj for some τj from the extended partition Π∗1 . Further we assume that
37
Π∗2 contains all the intermediate points τi from Π∗1 , whereas we also take the
intermediate points from the intervals [tj−1 , τj ] and [τj , tj ] both equal to τj . It
follows that S(f, α; Π∗1 ) = S(f, α; Π∗2 ). A combination of the two steps finishes
the proof. The triangle inequality for the integrals holds almost trivially.
38
Proposition 4.39 Let f, α : [a, b]R→ R, f continuous and α of boundedR varia-
tion. Then
R the Lebesgue
R integral f dµα and the Stieltjes integral f dα are
equal, f dµα = f dα.
The notation || · || suggests that we deal with a norm. In a sense, this is correct,
but we will not prove until the end of this section. It is however obvious that
Lp := Lp (Ω, F, P) is a vector space, since |X + Y |p ≤ (|X| + |Y |)p ≤ 2p (|X|p +
|Y |p ).
In the special case p = 2, we have for X, Y ∈ L2 , that |XY | = 21 ((|X|+|Y |)2 −
X −Y 2 ) has finite expectation and is thus in L1 . Of course we have |E (XY )| ≤
2
Proof The standard machine easily gives E (1A Y ) = P(A) · E Y for A an event
independent of Y . Assume that X ∈ S+ . Since X is integrable we can assume
that it is finite, and thus bounded by a constantP c. Since then |XY | ≤ c|Y |,
n
we
Pn obtain E |XY | < ∞. If we represent X as i=1 ai 1Ai , then E (XY ) =
i=1 a i P(A i )E Y readily follows and thus E (XY ) = E X · E Y . The proof may
be finished by letting the standard machine operate on X.
39
We continue with some properties of Lp -spaces. First we have monotonicity of
norms.
Proof It follows from the trivial inequality |u| ≤ 1+|u|a , valid for u ∈ R and a ≥
1, that |X|p ≤ 1+|X|r , by taking a = r/p, and hence X ∈ Lp (Ω, F, P). Observe
that x → |x|a is convex. We apply Jensen’s inequality to get (E |X|p )a ≤
E (|X|pa ), from which the result follows.
with the convention inf ∅ = ∞. It is clear that |f | ≤ ||f ||∞ a.e. We write
f ∈ L∞ (S, Σ, µ) if ||f ||∞ < ∞.
Here is the first of two fundamental inequalities, known as Hölder’s inequality.
µ(1E f p )
P(E) = .
µ(f p )
Put h(s) = g(s)/f (s)p−1 if f (s) > 0 and h(s) = 0 otherwise. Jensen’s inequality
gives (P(h))q ≤ P(hq ). We compute
µ(f g)
P(h) = ,
µ(f p )
40
and
µ(1{f >0} g q ) µ(g q )
P(hq ) = ≤ .
µ(f p ) µ(f p )
Insertion of these expressions into the above version of Jensen’s inequality yields
(µ(f g))q µ(g q )
≤ ,
(µ(f p ))q µ(f p )
whence (µ(f g))q ≤ µ(g q )µ(f p )q−1 . Take q-th roots on both sides and the result
follows.
Theorem 4.47 Let f, g ∈ Lp (S, Σ, µ) and p ∈ [1, ∞]. Then ||f + g||p ≤ ||f ||p +
||g||p .
because (p − 1)q = p. After dividing by (µ(|f + g|p )1/q , we obtain the result,
because 1 − 1/q = 1/p.
Recall the definition of a norm on a (real) vector space X. One should have
||x|| = 0 iff x = 0, ||αx|| = |α| ||x|| for α ∈ R (homogeneity) and ||x + y|| ≤
||x|| + ||y|| (triangle inequality). For || · ||p homogeneity is obvious, the triangle
inequality has just been proved under the name Minkowski’s inequality and
we also trivially have f = 0 ⇒ ||f ||p = 0. But, conversely ||f ||p = 0 only
implies f = 0 a.e. This annoying fact disturbs || · ||p being called a genuine
norm. This problem can be circumvented by identifying a function f that is
zero a.e. with the zero function. The proper mathematical way of doing this is
by defining the equivalence relation f ∼ g iff µ({f 6= g}) = 0. By considering
the equivalence classes induced by this equivalence relation one gets the quotient
space Lp (S, Σ, µ) := Lp (S, Σ, µ)/ ∼. One can show that || · ||p induces a norm
41
on this space in the obvious way. We don’t care too much about these details
and just call || · ||p a norm and Lp (S, Σ, µ) a normed space, thereby violating a
bit the standard mathematical language.
A desirable property of a normed space, (a version of) completeness, holds for
Lp spaces. We give the proof for Lp (Ω, F, P).
Theorem 4.48 Let p ∈ [1, ∞]. The space Lp (Ω, F, P) is complete in the fol-
lowing sense. Let (Xn ) be a Cauchy-sequence in Lp : ||Xn − Xm ||p → 0 for
n, m → ∞. Then there exists a limit X ∈ Lp such that ||Xn − X||p → 0. The
limit is unique in the sense that any other limit X 0 satisfies ||X − X 0 ||p = 0.
Proof Assume that p ∈ [1, ∞). Since (Xn ) is Cauchy, for all n ∈ N there exists
kn ∈ N such that ||Xl − Xm ||p ≤ 2−n if l, m ≥ kn . By monotonicity of the p-
−n
norms, we then have ||XPkn+1 − Xkn ||1 ≤ ||XPkn+1 − Xkn ||p ≤ 2 . It follows (see
Example
P 4.29) that E n |X kn+1 − X kn | = P n E |Xk n+1 − X k n | < ∞. But then
n |Xkn+1 − Xkn | < ∞ a.s., which implies n (Xkn+1 − Xkn ) < ∞ a.s. This is
a telescopic sum, so we obtain that lim Xkn exists and is finite a.s. To have a
proper random variable, we define X := lim sup Xkn and we have Xkn → X a.s.
We have to show that X is also a limit in Lp . Recall the definition of the kn .
Take m ≥ n and l ≥ kn . Then ||Xl − Xkm ||p ≤ 2−n , or E |Xl − Xkm |p ≤ 2−np .
We use Fatou’s lemma for m → ∞ and get
E |Xl − X|p = E lim inf |Xl − Xkm |p ≤ lim inf E |Xl − Xkm |p ≤ 2−np .
Remark 4.49 Notice that it follows from Theorem 4.48 and the discussion
preceding it, that Lp (Ω, F, P) is a truly complete normed space, a Banach space.
The same is true for Lp (S, Σ, µ) (p ∈ [0, ∞]), for which you need Exercise 4.13.
For the special case p = 2R we obtain that L2 (S, Σ, µ) is a Hilbert space with the
inner product hf, gi := f g dµ. Likewise L2 (Ω, F, P) is a Hilbert space with
inner product hX, Y i := E XY .
Let S be a vector space with inner product h·, ·i and norm || · || defined by
||x|| = hx, xi1/2 . Recall that an orthogonal projection on a subspace L is a
linear mapping π : S → L with the property that πx = arg inf{||x − y|| : y ∈ L}.
The argument of the infimum exists if L is complete, which is partly contained
in the following theorem.
42
(ii) ||x − y||2 = ||x − x̂||2 + ||x̂ − y||2 , ∀y ∈ L.
(iii) x − x̂ is orthogonal to L, i.e. hx − x̂, yi = 0, ∀y ∈ L.
Proof Let α be the infimum in (i). For every n ∈ N there exists xn ∈ L such
that ||x − xn ||2 < α2 + 1/n. Add up the two equalities
1
||x − xm + (xm − xn )||2
2
1 1
= ||x − xm ||2 + || (xm − xn )||2 + 2hx − xm , (xm − xn )i
2 2
and
1
||x − xn − (xm − xn )||2
2
1 1
= ||x − xn ||2 + || (xm − xn )||2 − 2hx − xn , (xm − xn )i
2 2
to get
1 1
2||x − (xm + xn )||2 = ||x − xm ||2 + ||x − xn ||2 − ||xm − xn ||2 . (4.12)
2 2
Since the left hand side of (4.12) is at least equal to 2α2 , one obtains by definition
of xn and xm that 21 ||xm − xn ||2 ≤ m 1
+ n1 . Therefore (xn ) is a Cauchy sequence
in L which by completeness has a limit, we call it x̂. We now show that x̂ attains
the infimum. Since ||x − x̂|| ≤ ||x − xn || + ||xn − x̂||, we get for n → ∞ that
||x − x̂|| ≤ α and hence we must have equality.
Suppose that there are two x̂1 , x̂2 ∈ L that attain the infimum. Let x̂ =
1 1
2 (x̂ 1 + x̂2 ). Then ||x − x̂|| ≤ 2 (||x − x̂1 || + ||x − x̂2 ||) = α, so also x̂ attains
the infimum. Replace in (4.12) xm with x̂1 and xn with x̂2 to conclude that
||x̂1 − x̂2 || = 0. Hence x̂1 = x̂2 .
We now show that the three characterizations of x̂ are equivalent. First we
prove (i) ⇒ (iii). Consider the quadratic function f (t) = ||x − x̂ + ty||2 , t ∈ R,
where y ∈ L is arbitrary. The function f has minimum at t = 0. Computing
f (t) explicitly gives f (t) = ||x − x̂||2 + 2thx − x̂, yi + t2 ||y||. A minimum at t = 0
can only happen if the coefficient of t vanishes, so hx − x̂, yi = 0.
The implication (iii) ⇒ (ii) follows from ||x − y||2 = ||x − x̂||2 + 2hx − x̂, x̂ −
yi + ||y − x̂||2 , since the cross term vanishes by assumption.
The implication (ii) ⇒ (i) is obvious.
Proof The space L2 (Ω, F, P) is complete in the sense of Theorem 4.48 and
L2 (Ω, G, P) is a closed subspace. The proof of this theorem then becomes just a
copy of the proof of Theorem 4.50, the only difference is that for two minimizers
one only gets ||X̂1 − X̂2 ||2 = 0, which is equivalent to X̂1 = X̂2 a.s.
43
4.8 Exercises
4.1 Let (x1 , x2 , . . .) be a sequence of nonnegative real numbers, let ` : N → N
be a bijection and define the sequence (y1 , y2 , . . .) by yk = x`(k) . Let for each
n the n-vector y n be given by y n = (y1 , . . . , yn ). Consider then for each n
a sequence of numbers xn defined by xnk = xk if xk is a coordinate of y n .
n n
Otherwise
P∞ P∞xk = 0. Show that xk ↑ xk for every k as n → ∞. Show that
put
k=1 yk = k=1 xk .
44
(a) Exploit Riemann integrability to show that there exist a decreasing se-
quence of simple functions (un ) in S+ and an increasing sequence (`n ) in
S+ such that `n ≤ f ≤ un , for all n, and λ(`n ) ↑ I, λ(un ) ↓ I.
(b) Let u = lim un and ` = lim `n and put fˆ = 1{u=`} u. Show that λ({u 6=
`}) = 0 and {f 6= fˆ} ⊂ {u 6= `}.
(c) Conclude that f ∈ Σλ and that µ(f ) = I.
4.15 This exercise concerns a more general version of Theorem 4.19. Let (fn ) ⊂
Σ and f = lim sup fn and assume that fn (s) → f (s) for all s outside a set of
measure zero. Assume also there exist functions g, gn ∈ Σ+ such that |fn | ≤ gn
a.e. with gn (s) → g(s) for all s outside a set of measure zero and that µ(gn ) →
µ(g) < ∞. Show that µ(|fn − f |) → 0.
4.16 Let (S, Σ, µ) be a measurable space, Σ0 a sub-σ-algebra of Σ and µ0 be the
restriction of µ to Σ0 . Then also (S, Σ0 , µ0 ) is a measurable space and integrals of
Σ0 -measurable functions can be defined according to the usual procedure. Show
that µ0 (f ) = µ(f ), if f ≥ 0 and Σ0 -measurable. Show also that L1 (S, Σ, µ)∩Σ0 =
L1 (S, Σ0 , µ0 ).
4.17 Prove Proposition 4.39.
45
5 Product measures
So far we have consideredR measure spaces (S, Σ, µ) and we have looked at inte-
grals of the type µ(f ) = f dµ. Here f is a function of ‘one’ variable (depends
on how you count and what the underlying set S is). Suppose that we have two
measure spaces (S1 , Σ1 , µ1 ) and (S2 , Σ2 , µ2 ) and a function f : S1 × S2 → R.
Is it possible to integrate such a function of two variables w.r.t. some measure,
that has to be defined on some Σ-algebra of S1 × S2 . There is a natural way
of constructing this σ-algebra and a natural construction of a measure on this
σ-algebra. Here is a setup with some informal thoughts.
Take f : S1 × S2 → R and assume R any good notion of measurability and
integrability. Then µ(f (·, s2 )) := f (·, s2 ) dµ1 defines a function of s2 and so
we’d like to take the integral w.r.t. µ2 . We could as well have gone the other way
round (integrate first w.r.t. µ2 ), and the questions are whether these integrals
are well defined and whether both approaches yield the same result.
Here is a simple special case, where the latter question has a negative an-
swer. We have seen that integration w.r.t. counting measure is nothing else but
addition. What we have outlined above is in this context just interchanging the
summation order. P So P if (an,m ) P
is a P
double array of real numbers, the above is
about whether n m an,m = m n an,m . This is obviously true if n and m
run through a finite set, but things can go wrong for indices from infinite sets.
Consider for example
1 if n = m + 1
an,m = −1 if m = n + 1
0 else.
P P
One
P P easily verifies m a1,m = −1, m an,m = P n ≥ 2 and hence we find
0, if P
n a
m n,m = −1. Similarly one shows that m n an,m = +1. In order
that interchanging of the summation order yields the P same
P result, additional
conditions have to be imposed. We will see that m n |an,m | < ∞ is a
sufficient condition. As a side remark we note that this case has everything
to do with a well known theorem by Riemann that says that a series of real
numbers is absolutely convergent iff it is unconditionally convergent.
46
Next to the projections, we now consider embeddings. For fixed s1 ∈ S1 we
define es1 : S2 → S by es1 (s2 ) = (s1 , s2 ). Similarly we define es2 (s1 ) = (s1 , s2 ).
One easily checks that the embeddings es1 are Σ2 /Σ-measurable and that the
es2 are Σ1 /Σ-measurable (Exercise 5.1). As a consequence we have the following
proposition.
Proof This follows from the fact that a composition of measurable functions is
also measurable.
Remark 5.2 The converse statement of Proposition 5.1 is in general not true.
There are functions f : S → R that are not measurable w.r.t. the product σ-
algebra Σ, although the mappings s1 7→ f (s1 , s2 ) and s2 7→ f (s1 , s2 ) are Σ1 -,
respectively Σ2 -measurable. Counterexamples are not obvious, see below for a
specific one. Fortunately, there are also conditions that are sufficient to have
measurability of f w.r.t. Σ, when measurability of the marginal functions is
given. See Exercise 5.8.
Here is a sketch of a counterexample, based on the Continuum Hypothesis,
with (S1 , Σ1 , µ1 ) = (S2 , Σ2 , µ2 ) = ([0, 1], B([0, 1]), λ). Consider V := {0, 1}N and
the set W of (countable) ordinal numbers smaller than ω1 , the first uncountable
ordinal number. The cardinality of W is equal to ℵ1 , whereas {0, 1}N has
cardinality 2ℵ0 . The Continuum hypothesis states that there exists a bijective
mapping between V and W . Hence there also exists a bijective mapping φ
between the interval [0, 1] and W . The set W has the property that for every
x ∈ [0, 1] the element φ(x) of W has countable many predecessors. Consider
Q := {(x, y) ∈ [0, 1]2 : φ(x) < φ(y)}. For fixed y we let Qy = {x ∈ [0, 1] :
(x, y) ∈ Q} and for fixed x we let Qx = {y ∈ [0, 1] : (x, y) ∈ Q}. It follows
that for every y, the set Qy is countable and thus Borel-measurable and it
has Lebesgue measure zero. For every x, the complement of Qx is countable
and has Lebesgue measure zero, hence Qx has Lebesgue measure one. For
f = 1Q , we thus have that x 7→ f (x, y) and y 7→ f (x, y) are Borel-measurable
and that I1f (x) = 1 and I2f (y) = 0. We see that Lemma 5.3 doesn’t hold and
therefore conclude that f cannot be measurable w.r.t. the product σ-algebra
B([0, 1]) × B([0, 1]).
47
defined (why?). Let then
Z
I1f (s1 ) = f (s1 , s2 )µ2 (ds2 )
Z
I2f (s2 ) = f (s1 , s2 )µ1 (ds1 ).
Proof We use the Monotone Class Theorem, Theorem 3.6, and so we have
to find a good vector space H. The obvious candidate is the collection of all
bounded Σ-measurable functions f that satisfy the assertions of the lemma.
First we notice that H is indeed a vector space, since sums of measurable
functions are measurable and by linearity of the integral. Obviously, the con-
stant functions belong to H. Then we have to show that if fn ∈ H, fn ≥ 0 and
fn ↑ f , where f is bounded, then also f ∈ H. Of course here the Monotone
Convergence Theorem comes into play. First we notice that measurability of
the Iif follows from measurability of the Iifn for all n. Theorem 4.12 yields
that the sequences Iifn (si ) are increasing and converging to Iif (si ). Another
application of this theorem yields that µ1 (I1fn ) converges to µ1 (I1f ) and that
µ2 (I2fn ) converges to µ2 (I2f ). Since µ1 (I1fn ) = µ2 (I2fn ) for all n, we conclude
that µ1 (I1f ) = µ2 (I2f ), whence f ∈ H.
Next we check that H contains the indicators of sets in R. A quick com-
putation shows that for f = 1E1 ×E2 one has I1f = 1E1 µ2 (E2 ), which is Σ1 -
measurable, I2f = 1E2 µ1 (E1 ), and µ1 (I1f ) = µ2 (I2f ) = µ1 (E1 )µ2 (E2 ). Hence
f ∈ H. By Theorem 3.6 we conclude that H coincides with the space of all
bounded Σ-measurable functions.
It follows from Lemma 5.3 that for all E ∈ Σ, the indicator function 1E sat-
isfies the assertions of the lemma. This shows that the following definition is
meaningful.
In Theorem 5.5 below (known as Fubini’s theorem) we assert that this defines
a measure and it also tells us how to compute integrals w.r.t. this measure in
terms of iterated integrals w.r.t. µ1 and µ2 .
Theorem 5.5 The mapping µ of Definition 5.4 has the following properties.
48
(i) It is a measure on (S, Σ). Moreover, it is the only measure on (S, Σ) with
the property that µ(E1 × E2 ) = µ1 (E1 )µ1 (E2 ). It is therefore called the
product measure of µ1 and µ2 and often written as µ1 × µ2 .
(ii) If f ∈ Σ+ , then
(iii) If f ∈ L1 (S, Σ, µ), then Equation (5.3) is still valid and µ(f ) ∈ R.
Theorem 5.5 has been proved under the standing assumption that the initial
measures µ1 and µ2 are finite. The results extend to the case where both these
measures are σ-finite. The approach is as follows. Write S1 = ∪∞ i
i=1 S1 with the
S1 ∈ Σ1 and µ1 (S1 ) < ∞. Without loss of generality, we can take the S1i disjoint.
i i
Take a similar partition (S2j ) of S2 . Then S = ∪i,j Sij , where the Sij := S1i × S2j ,
form a countable disjoint union as well. Let Σij = {E ∩ Sij : E ∈ Σ}. On each
measurable space (Sij , Σij ) the above results apply and one has e.g. identity of
the involved integrals by splitting the integration over the sets Sij and adding
up the results.
We note that if one goes beyond σ-finite measures (often a good thing to
do if one wants to have counterexamples), the assertion may no longer be true.
Let S1 = S2 = [0, 1] and Σ1 = Σ2 = B[0, 1]. Take µ1 equal to Lebesgue measure
and µ2 the counting measure, the latter is not σ-finite. It is a nice exercise to
show that ∆ := {(x, y) ∈ S : x = y} ∈ Σ. Let f = 1∆ . Obviously I1f (s1 ) ≡ 1
and I2f (s2 ) ≡ 0 and the two iterated integrals in (5.3) are 1 and 0. So, more or
less everything above concerning product measures fails in this example.
We close the section with a few remarks on products with more than two factors.
The construction of a product measure space carries over, without any problem,
to products of more than two factors, as long as there are finitely many. This
results in product spaces of the form (S1 × . . . × Sn , Σ1 × . . . × Σn , µ1 × . . . × µn )
under conditions similar to those of Theorem 5.5. The product σ-algebra is
again defined as the smallest σ-algebra that makes all projections measurable.
Existence of product measures is proved in just the same way as before, using an
induction argument. Note that there will be many possibilities to extend (5.1)
and (5.2), since there are n! different integration orders. We leave the details to
the reader.
49
Things become however more complicated, when we work with infinite prod-
ucts of measure spaces (Si , Σi , µi ). In Section 5.3 we treat the construction of
an infinite product of probability spaces.
Remark 5.7 Observe that the proof of B(R) × B(R) ⊂ B(R2 ) generalizes to
the situation, where one deals with two topological spaces with the Borel sets.
For the proof of the other inclusion, we used (and needed) the fact that R is
separable under the ordinary topology. In a general setting one might have the
strict inclusion of the product σ-algebra in the Borel σ-algebra on the product
space (with the product topology).
We now know that there is no difference between B(R2 ) and B(R) × B(R). This
facilitates the use of the term 2-dimensional random vector and we have the
following easy to prove corollary.
Corollary 5.8 Let X1 , X2 : Ω → R be given. The vector mapping X =
(X1 , X2 ) : Ω → R2 is a random vector iff the Xi are random variables.
50
f : R2 → R be a continuous function. Then it is also B(R2 )-measurable, and by
Corollary 5.8 and composition of measurable functions, f (X1 , X2 ) is a random
variable as well. Apply this with f (x1 , x2 ) = x1 + x2 .
Recall that we defined in Section 3.2 the distribution, or the law, of a random
variable. Suppose that X = (X1 , X2 ) is a random vector defined on (Ω, F, P)
with values in R2 . Let E ∈ B(R2 ), then
for E ∈ B(R2 ) defines a probability measure on (R2 , B(R2 )), the distribution of
X, also called the joint distribution of (X1 , X2 ). If we take E = E1 × R, then
we have PX (E1 × R) = P(X1 ∈ E1 ) = PX1 (E1 ). Common terminology is to call
PX1 the marginal distribution, or marginal law, of X1 .
Along with the (joint) distribution of X, we introduce the joint distribution
function F = FX : R2 → [0, 1], given by
Notice that, for instance, FX1 (x1 ) = limx2 →∞ F (x1 , x2 ), also denoted F (x1 , ∞).
2
It may happen that R there exists a nonnegative B(R )-measurable function f
such that PX (E) = E f d(λ × λ), for all E ∈ B(R2 ). In that case, f is called
the (joint) Rdensity of X. The obvious marginal density fX1 of X1 is defined by
fX1 (x1 ) = f (x1 , x2 ) λ( dx2 ). One similarly defines the marginal density of X2 .
Check these are indeed densities in the sense of Example 4.28.
Independence (of random variables) had to do with multiplication of probabili-
ties (see Definition 3.11), so it should in a natural way be connected to product
measures.
The results of the present section (Proposition 5.6, Corollary 5.8, Proposi-
tion 5.10) have obvious extensions to higher dimensional situations. We leave
the formulation to the reader.
51
and X(ω) = ω, having a Bernoulli distribution for P defined on the power set
of Ω with P({1}) = p. Clearly, it is impossible to define on this Ω a random
variable having more than two different outcomes. Extending the probability
space to a suitable product space offers a way out, see Exercise 5.13, from which
it even follows that X and Y are independent.
if Ei ∈ Fi , i = 1, . . . , n.
52
Proof The proof is based on an application of Theorem 2.7. We proceed by
showing that P0 is countably additive on C. To that end we invoke Exercise 1.9,
from which we deduce that it is sufficient to show for a decreasing sequence
of cylinders C n with limn→∞ P0 (C n ) > 0, one must have C := ∩∞ n
n=1 C 6= ∅.
n
Without loss of generality we may assume that the C are of the type Cn as
above. IfQit happens that there is an N ∈ N such that all Cn can be written
∞
as Bn × n=N +1 Ωn , with Bn ∈ F1 × . . . × FN , we are done, since in this case
P0 (Cn ) = P1 × . . . × PN (Bn ). We already know that P1 × . . . × PN is a measure
and therefore countably additive. Henceforth we assume that such an N doesn’t
exist. Q∞
For simplicity we write Ω0n = k=n+1 Ωk , so that Cn = Bn × Ω0n . Typical
elements of Ω0n are ωn0 and we have ω = (ω1 , . . . , ωn , ωn0 ). On the cylinders in
Ω0n we can define set functions P0n in the same way as P0 was defined on C. Note
that, similar to the definition of P0 , the action of P0n on cylinders in Ω0n only
involves product measures with finitely many factors. For every cylinder C in C,
one defines C(ω1 , . . . , ωn ) = {ωn0 : (ω1 , . . . , ωn , ωn0 ) ∈ C}. Then the probabilities
P0n (C(ω1 , . . . , ωn )) are well defined.
Since we have assumed that limn→∞ P0 (Cn ) > 0 for the decreasing sequence
(Cn ), there exists ε > 0 such that P0 (Cn ) > ε for all n. Define
1
En1 = {ω1 : P01 (Cn (ω1 )) > ε}.
2
It follows from Lemma 5.3 that En1 ∈ F1 . Then, using ‘Fubini computations’,
we obtain
Z
P0 (Cn ) = P01 (Cn (ω1 )) P1 (dω1 )
Ω1
Z Z
0
= P1 (Cn (ω1 )) P1 (dω1 ) + P01 (Cn (ω1 )) P1 (dω1 )
1
En 1
Ω1 \En
1
≤ P1 (En1 ) + εP1 (Ω1 \ En1 ).
2
Since P0 (Cn ) > ε, it then follows that P1 (En1 ) > 12 ε. Since the Cn form a
decreasing sequence, the same holds for the En1 . Letting E 1 = ∩n En1 , we get by
continuity of P1 that P1 (E 1 ) ≥ 21 ε. In particular E 1 is not empty and we can
choose some ω1∗ ∈ E 1 for which we have P01 (Cn (ω1∗ )) > 2ε for all n.
Then we repeat the above story applied to the sets Cn (ω1∗ ) instead of the Cn .
So we consider the sets Cn (ω1∗ , ω2 ) and En2 (ω1∗ ) = {ω2 : P02 (Cn (ω1∗ , ω2 )) > 4ε }.
This results in a non-empty limit set E 2 (ω1∗ ) ⊂ Ω2 from which we select some ω2∗
and we obtain P02 (Cn (ω1∗ , ω2∗ )) > 4ε for all n. Continuing this way we construct
a point ω ∗ = (ω1∗ , ω2∗ , . . .) that belongs to all Cn and therefore the intersection
∩n Cn is not empty.
5.4 Exercises
5.1 Show that the embeddings es1 are Σ2 /Σ-measurable and that the es2 are
Σ1 /Σ-measurable. Also prove Proposition 5.1.
53
5.2 Prove part (iii) of Fubini’s theorem (Theorem 5.5) for f ∈ L1 (S, Σ, µ) (you
already know it for f ∈ Σ+ ). Explain why s1 7→ f (s1 , s2 ) is in L1 (S1 , Σ1 , µ1 )
for all s2 outside a set N of µ2 -measure zero and that I2f is well defined on N c .
Define
Z
fX (x) = f (x, y) dy.
R
54
(a) Show that for all a, b, c ∈ R the function (x, y) 7→ bx + cf (a, y) is Borel-
measurable on R2 .
(b) Let ani = i/n, i ∈ Z, n ∈ N. Define
X ani − x x − ani−1
f n (x, y) = 1(ani−1 ,ani ] (x)( f (ani−1 , y) + n f (ani , y)).
i
ani n
− ai−1 ai − ani−1
measurable on R2 .
5.9 Show that for t > 0
Z ∞
1
sin x e−tx dx = .
0 1 + t2
Although x 7→ sinx x doesn’t belong L1 ([0, ∞), B([0, ∞)), λ), show that one can
use Fubini’s theorem to compute the improper Riemann integral
Z ∞
sin x π
dx = .
0 x 2
where F (s−) = limu↑s F (u). Hint: integrate 1(a,b]2 and split the square
into a lower and an upper triangle.
(b) The above displayed formula is not symmetric in F and G. Show that it
can be rewritten in the symmetric form
F (b)G(b) − F (a)G(a) =
Z Z
F (s−) dG(s) + G(s−) dF (s) + [F, G](b) − [F, G](a),
(a,b] (a,b]
P
where [F, G](t) = a<s≤t ∆F (s)∆G(s) (for t ≥ a), with ∆F (s) = F (s) −
F (s−). Note that this sum involves at most countably many terms and is
finite.
5.11 Let F be the distribution function of a nonnegative random variable X
and α >R 0. Show (use Exercise 5.10 for instance, or write E X α = E f (X), with
x
f (x) = 0 αy α−1 dy) that
Z ∞
α
EX = α xα−1 (1 − F (x)) dx.
0
55
5.12 Let I be an arbitrary uncountable index set. For each Q i there is a proba-
bility space (Ωi , Fi , Pi ). Define the product σ-algebra F on i∈I Ωi as for the
case that I is Qcountable. Call a set C a countable cylinder if it can be written
as a product i∈I Ci , with Ci ∈ Fi and Ci a strict subset of Ωi for at most
countably many indices i.
(a) Show that the collection of countable cylinders is a σ-algebra, that it con-
tains the measurable rectangles and that every set in F is in fact a count-
able cylinder.
Q
(b) Let F = i∈I Ci ∈ F and let IF beQ the set of indices i for which Ci is a
strict subset of Ωi . Define P(F ) := i∈IF Pi (Ci ). Show that this defines
a probability measure on F with the property that P(πi−1 [E]) = Pi (E) for
every i ∈ I and E ∈ Fi .
5.13 Let X be a random variable, defined on some (Ω, F, P). Let Y be random
variable defined on another probability space (Ω0 , F 0 , P0 ). Consider the product
space, with the product σ-algebra and the product probability measure. Rede-
fine X and Y on the product space by X(ω, ω 0 ) = X(ω) and Y (ω, ω 0 ) = Y (ω 0 ).
Show that the redefined X and Y are independent and that their marginal
distributions are the same as they were originally.
56
6 Derivative of a measure
The topics of this chapter are absolute continuity and singularity of a pair of
measures. The main result is a kind of converse of Proposition 4.23, known as
the Radon-Nikodym theorem, Theorem 6.10. In the proof of it that we will
give, we need that a continuous map on a Hilbert space can be represented by
the inner product with a fixed element in that space. This will be shown in
Section 6.2.
Hence we can identify the mapping T with the vector y. Let H ∗ be the set of
all linear maps on H. Then we have for this case the identification of H ∗ with
H itself via equation (6.1).
Suppose that we know that (6.1) holds. Then the kernel K of T is the space
of vectors that are orthogonal to y and the orthogonal complement of K is the
space of all vectors that are multiples of y. This last observation is the core of
the following elementary proof of (6.1).
Let us first exclude the trivial situation in which T = 0. Let K be the kernel
of T . Then K is a proper linear subspace of H. Take a nonzero vector z in the
orthogonal complement of K, a one-dimensional linear subspace of H. Every
vector x can be written as a sum x = λz + u, with λ ∈ R and u ∈ K. Then we
have
hx, zi
λ= and T (x) = λ T (z). (6.2)
hz, zi
T (z) T (z)
Let y = hz,zi z. Then hx, yi = hz,zi hx, zi = λT (z) = T (x), as follows from (6.2).
Uniqueness of y is shown as follows. Let y 0 ∈ H be such that T (x) = hx, y 0 i.
Then hx, y − y 0 i is zero for all x ∈ H, in particular for x = y − y 0 . But then
y − y 0 must be the zero vector.
The interesting observation is that this proof carries over to the case where one
works with continuous linear functionals on a Hilbert space, which we treat in
the next section. We will henceforth write T x instead of T (x).
57
this inner product. Let T : H → R be a continuous linear functional on H, there
is a C > 0 such that |T x| ≤ C||x|| for all x ∈ H. We denote by H ∗ the linear
space of all continuous linear functionals on H, also called the dual space of H.
We will prove the Riesz-Fréchet theorem, which states that every continuous
linear functional on H is given by an inner product with a fixed element of H.
This theorem can be summarized as follows. The dual space H ∗ can be identified
with H itself. Moreover, we can turn H ∗ into a Hilbert space itself by defining an
inner product h·, ·i∗ on H ∗ , Let T, T 0 ∈ H ∗ and let y, y 0 the elements in H that
represent T and T 0 according to the theorem. Then we define hT, T 0 i∗ = hy, y 0 i.
One readily shows that this defines an inner product. Let || · ||∗ be the norm
on H ∗ generated by this inner product. Then H ∗ is complete as well. Indeed,
let (Tn ) be a Cauchy sequence in H ∗ with corresponding elements (yn ) in H,
satisfying Tn x ≡ hx, yn i. Then ||Tn − Tm ||∗ = ||yn − ym ||. The sequence (yn ) is
thus Cauchy in H and has a limit y. Define T x = hx, yi. Then T is obviously
linear and ||Tn − T ||∗ = ||yn − y|| → 0. Concluding, we say that the normed
spaces (H ∗ , || · ||∗ ) and (H, || · ||) are isomorphic.
The usual operator norm of a linear functional T on a normed space is defined
as ||T ||∗ = supx6=0 |T x|
||x|| . It is a simple consequence of the Cauchy-Schwarz
inequality that this norm ||·||∗ is the same as the one in the previous paragraph.
In fact one can show that continuity of T as a mapping (in the usual sense) is
equivalent to finiteness of ||T ||∗ . Hence the constant C at the beginning of this
section satisfies C ≥ ||T ||∗ .
Remark 6.2 Suppose H = L2 (S, Σ, µ), then H is not a genuine Hilbert space,
since ||f ||2 = 0 only implies f = 0 a.e. The assertion of Theorem 6.1 still holds
true, except that uniqueness has to be replaced with uniqueness a.e.
58
this chapter a measure could be either a positive or a complex (or real) measure.
Notice that a positive measure can assume the value infinity, unlike a complex
measure, whose values lie in C (see also (6.3)).
Let
S µ be a complex measure and E1 , E2 , . . . be disjoint sets in Σ with E =
i≥1 Ei , then (by definition)
X
µ(E) = µ(Ei ),
i≥1
where the sum is convergent and the summation is independent of the order.
Hence the series is absolutely convergent as well, and we also have
X
|µ(E)| ≤ |µ(Ei )| < ∞. (6.3)
i≥1
For a given set E ∈ Σ let Π(E) be the collection of all measurable partitions
of E, countable partitions of E with elements in Σ. If µ is a complex measure,
then we define
X
|µ|(E) = sup{ |µ(Ei )| : Ei ∈ π(E) and π(E) ∈ Π(E)}.
i
We will show below that |µ| is a (positive) measure on (S, Σ) with |µ|(S) < ∞; it
is called the total variation measure (of µ). Notice that always |µ|(E) ≥ |µ(E)|
and that in particular µ(E) = 0 as soon as |µ|(E) = 0.
In the special case where µ is real valued,
1
µ+ = (|µ| + µ)
2
and
1
µ− = (|µ| − µ)
2
define two bounded positive measures such that
µ = µ+ − µ− . (6.4)
Theorem 6.3 Let µ be a complex measure on (S, Σ), then |µ| is a positive
measure on (S, Σ), with |µ|(S) < ∞.
59
By summing over n and then letting ε → 0 we get
X X
bn ≤ |µ(An,j )| ≤ |µ|(A).
n n,j
Taking now the supremum over all bn satisfying the constraints, we obtain
X
|µ|(An ) ≤ |µ|(A).
n
We proceed by showing the converse of this inequality. Let (Bj ) be any other
measurable partition of A. Then for each fixed n, (An ∩ Bj ) is a partition of
An . We have
X X X
|µ(Bj )| = | µ(An ∩ Bj )|
j j n
XX
≤ |µ(An ∩ Bj )|
n j
X
≤ |µ|(An ).
n
Next we show that |µ|(S) is finite. It suffices to show that this is true for a real
measure. We claim the following fact. If E ∈ Σ is such that |µ(E)| = ∞, then
there exist disjoint A, B ∈ Σ such that E ∪ B = E, |µ(A)| > 1 and |µ(B)| = ∞.
This is proved as follows.
P Let m ∈ N and choose a measurable partition
(En ) of E such that n |µ(E n )| > m. Then there exists N ∈ N such that
PN
n=1 |µ(EnP)| > m. It follows that there P exists a−subset J of {1, . . . , N } such
+
that either
P n∈J µ(E n ) > m/2 or S n∈J µ(En ) > m/2. In either case, we
have | n∈J µ(En )| > m/2. Let A = n∈J En . Then A ⊂ E and |µ(A)| > m/2.
Let then B = E \ A, for which we have by finiteness of µ, |µ(B)| = |µ(E) −
µ(A)| ≥ |µ(A)| − |µ(E)| ≥ m/2 − |µ(E)|. Choose m > 2(1 + |µ(E)|) to get
|µ(B)| > 1 as well as |µ(A)| > 1. Since E = A ∪ B and |µ(E)| = ∞, we must
have |µ(A)| = ∞ or |µ(B)| = ∞. This proves the claim (possibly after swapping
the roles of A and B).
Reasoning by contradiction, we now assume that |µ|(S) = ∞. Let B0 =
S, According to our claim, B0 is the disjoint union of sets A1 and B1 with
|µ(A1 )| > 1 and |µ(B1 )| = ∞. By induction S∞ we construct a sequence (An ) with
|µ(An )| > 1 and |µ(Bn )| = ∞. P∞ Let U = n=1 An , which by countable additivity
of µ should satisfy µ(U ) = n=1 µ(An ). But the construction of the An was
such that this series cannot be convergent, yielding a contradiction.
60
+
and R |f | dµ− are finite, equivalent
R R
f : RS → R is measurable and R both |fR| dµ +
to |f | d|µ| < ∞, we put f dµ := f dµ − f dµ− . For f or µ complex
valued, one splits both in their real and imaginary parts and assume that the
corresponding real integrals
R are well defined. Adding up in the obvious way
yields the definition of f dµ for this case. If all defining real integrals are
finite, one writes f ∈ L1 (S, Σ, µ). R R
For signed measures µ the usual triangle inequality | f dµ| ≤ |f | dµ is
not valid in general, think of a negative µ. Instead we have the following result.
Proposition 6.4 Let µ be a real measure on a measurable space (S, R Σ) and
1
|µ|
R its total variation measure. Assume that f ∈ L (S, Σ, µ). Then | f dµ| ≤
|f | d|µ|.
Proof Exercise 6.12.
61
Proof It follows that
νa0 − νa = νs − νs0 ,
νa0 − νa µ and νs − νs0 ⊥ µ (Proposition 6.6), and hence both are zero (Propo-
sition 6.6 again).
Proof See Exercise 4.9 for nonnegative h. The other case is Exercise 6.3.
The Radon-Nikodym theorem of the next section states that every σ-finite mea-
sure ν that is absolutely continuous w.r.t. µ is of the form (6.6). We will use in
that case the notation
dν
h= .
dµ
In the next section we use
Lemma 6.9 Let µ be a finite positive measure and f ∈ L1 (S, Σ, µ), possibly
complex valued. Let A be the set of averages
Z
1
aE = f dµ,
µ(E) E
where E runs through the collection of sets E ∈ Σ with µ(E) > 0. Then
µ({f ∈
/ Ā}) = 0.
Proof Assume that C \ Ā is not the empty set (otherwise there is nothing to
prove) and let B be a closed ball in C \ Ā with center c and radius r > 0. Notice
that |c − a| > r for all a ∈ Ā. It is sufficient to prove that E = f −1 [B] has
measure zero, since C \ Ā is a countable union of such balls (argue as in the
proof of Proposition 1.3). Suppose that µ(E) > 0. Then we would have
Z
1
|aE − c| ≤ |f − c| dµ ≤ r.
µ(E) E
But this is a contradiction since aE ∈ A.
62
6.5 The Radon-Nikodym theorem
As an appetizer for the Radon-Nikodym theorem (Theorem 6.10) we consider
a special case. Let S be a finite or countable set and Σ = 2S . Let µ be
a positive σ-finite measure on (S, Σ) and ν another finite measure such that
ν({x})
ν µ. Define h(x) = µ({x}) if µ({x}) > 0 and zero otherwise. It is easy to
1
verify that h ∈ L (S, Σ, µ) and
ν(E) = µ(1E h), ∀E ⊂ S. (6.7)
Observe that we have obtained an expression like (6.6), but now starting from
the assumption ν µ. The principal theorem on absolute continuity (and
singularity) is the following.
Theorem 6.10 Let µ be a positive σ-finite measure and ν a complex measure.
Then there exists a unique decomposition ν = νa + νs and a function h ∈
L1 (S, Σ, µ) such that νa (E) = µ(1E h) for all E ∈ Σ (so νa µ) and νs ⊥ µ.
Moreover, h is unique in the sense that any other h0 with this property is such
that µ({h 6= h0 }) = 0. The function h is called the Radon-Nikodym derivative
of νa w.r.t. µ and is often written as
dνa
h= .
dµ
Proof Uniqueness of the decomposition ν = νa + νs is the content of Proposi-
tion 6.7. Hence we proceed to show existence. Let us first assume that µ(S) < ∞
and that ν is positive and finite.
Consider then the positive bounded measure φ = ν + µ. Let f ∈ L2 (S, Σ, φ).
The Cauchy-Schwarz inequality (Remark 4.46) gives
|ν(f )| ≤ ν(|f |) ≤ φ(|f |) ≤ (φ(f 2 ))1/2 (φ(S))1/2 .
We see that the linear map f 7→ ν(f ) is bounded on L2 (S, Σ, φ). Hence
there exists, by virtue of the Riesz-Fréchet Theorem 6.1 and Remark 6.2, a
g ∈ L2 (S, Σ, φ) such that for all f
ν(f ) = φ(f g). (6.8)
Take f = 1E for any E with φ(E) > 0. Then φ(E) ≥ ν(E) = φ(1E g) ≥ 0 so
1
that the average φ(E) φ(1E g) lies in ∈ [0, 1]. From Lemma 6.9 we obtain that
φ({g ∈/ [0, 1]}) = 0. Replacing g with g1{0≤g≤1} , we see that (6.8) still holds
and hence we may assume that 0 ≤ g ≤ 1.
Rewrite (6.8) as
ν(f (1 − g)) = µ(f g) (6.9)
and take f = 1{g=1} to obtain µ({g = 1}) = 0. Define the positive measure νs
on Σ by νs (E) = ν(E ∩ {g = 1}). Then νs ({g < 1}) = νs (∅) = 0. It follows that
νs ⊥ µ. Define the measurable function
g
h= 1{g<1} , (6.10)
1−g
63
and the measure νa on Σ by νa (E) = µ(1E h). By Proposition 6.8 this indeed
defines a measure and obviously νa µ.
1 Pn−1
Let f = E∩{g<1}
1−g and note that = limn fn , where fn = 1E∩{g<1} k=0 g k .
Apply (6.9) to the fn and apply monotone convergence to obtain
Theorem 6.10 is often used for probability measures Q and P with Q P. Write
Z = dQ
dP and note that E Z = 1. It is immediate from Proposition 4.23 that for
X ∈ L1 (Ω, F, Q) one has
EQ X = E [XZ], (6.11)
64
Proposition 6.13 Let F be a distribution function, F : R → R. Then there
exists a purely discontinuous right-continuous nondecreasing function Fd , with
limx→−∞ Fd (x) = 0, a nonnegative Borel-measurable function f and a nonde-
creasing continuous function Fs with limx→−∞ Fs (x) = 0 such that the decom-
position
F = Fd + Fs + Fac ,
Rx
holds true, with Fac defined by Fac (x) = −∞
f (y) dy. Such a decomposition is
unique.
One verifies that Fd has the asserted properties, the set of discontinuities of Fd is
also D and that Fc := F −Fd is continuous. Up to the normalization constant 1−
Fd (∞), Fc is a distribution function if Fd (∞) < 1 and equal to zero if Fd (∞) = 1.
Hence there exists a subprobability measure µc on B such that µc ((−∞, x]) =
Fc (x), Theorem 3.10. According to the Radon-Nikodym theorem, we can split
µc = µac + µs , where µac is absolutely continuous w.r.t. Lebesgue measure
λ. Hence,R there exists a λ-a.e. unique function f in L1+ (R, B, λ) such that
µac (B) = B f dλ.
65
is another example. Let f = 1Q∩[0,∞) . Then F = 0, and trivially also F 0 (x) = 0
for all x > 0, equal to f (x) outside Q. In this example, F 0 exists everywhere on
(0, ∞), but only coincides with f outside a set of Lebesgue measure zero. This
all suggests a generalization of the classical fundamental theorem of calculus,
Theorem 6.18 below. We need some preparations, formulated as the next three
lemmas, but first a definition.
Definition 6.14 Let (S, Σ, µ) be a measure space and suppose that S is en-
dowed with a topology. A set A ∈ Σ is called regular if
Lemma 6.15 Let µ be a finite measure on the Borel sets of a finite interval
I = [0, b]. Then all sets in B([0, b]) are regular. In particular, if B is Borel with
µ(B) = 0, for any ε > 0, there exists an open set U in [0, b] such that U ⊃ B
and µ(U ) < ε.
Proof First one shows (Exercise 6.13) that the collection of sets A in B([0, b])
that are regular as well as their complements Ac form a σ-algebra, Σ say. Next,
every closed set in [0, b] is compact and thus automatically regular. Let then U
be open. The sets Un := {x : d(x, U c ) ≥ n1 } (d the ordinary metric) are closed
and their union is U . It follows that U is regular. Hence Σ contains all open
(and closed) sets in [0, b] and thus Σ = B([0, b]).
To prove the second assertion, we observe that µ(B c ) = µ([0, b]) and that
B c is regular. Hence there is a compact set K ⊂ [0, b] such that K ⊂ B and
µ(B c \ K) < ε. But then the open set K c contains B and µ(K) < ε.
Proof Let t < λ(U ). Because U is an interval, by the continuity of λ there exists
a compact interval K with K ⊂ U and λ(K) > t. Since Sn U is an open cover of
K, there are finitely many U1 , . . . , Un in U such that i=1 Ui ⊃ K. Order the
Ui such that the λ(Ui ) are non-increasing. Let V1 := U1 . Suppose Ui ∩ V1 6= ∅
for all i ≥ 2. Write Ui = (mi − δi , mi + δi ) and W1 = (m1 − 3δ1 , m1 + 3δ1 ). For
Ui ∩V1 6= ∅, we have for instance mi −δi < m1 +δ1 , hence mi +δi < mi −δi +2δi <
m1 + 3δ1 . Hence mi + δi S ∈ (m1 − 3δ1 , m1 + 3δ1 ) =: W
Sn1 . It follows that Ui ⊂ W1 ,
n
for all i ≥ 1, and hence i=1 Ui ⊂ W1 ) and t < λ( i=1 Ui ) ≤ λ(W1 ) = 3λ(V1 ).
In this case we set q = 1.
If there is Ui such that V1 ∩ Ui = ∅, we define V2 = Uk2 , where k2 is the
smallest index in {2, . . . , n} such that V1 ∩ Uk = ∅. If (V1 ∪ V2 ) ∩ Ui 6= ∅ for all
i ≥ 2, we set q = 2 and next to W1 above, we construct W2 like W1 above by
‘blowing up’ V2 by a factor 3. For i 6= 1, k2 , we have Ui 6= V1 , V2 and Ui ∩ V1 6= ∅
66
or Ui ∩ V2 6= ∅. It follows, by parallel reasoning as above that Ui ⊂ W1 or
UiS⊂ W2 . Hence all Ui ⊂ (W1 ∪ W2 ) and the same holds for their union. So
n
λ( i=1 Ui ) ≤ λ(W1 ) + λ(W2 ) = 3(λ(V1 ) + λ(V2 )).
Here is the general case. Choose recursively for i ≥ 2, Vi such that Vi = Uki ,
Si−1
where ki is the smallest integer k in {i, . . . , n} for which Uk ∩ j=1 Vj = ∅.
The largest i for which this is possible is q. Then any Uk is either one of the
Vi (and thus contained in the corresponding Wi ), or it intersects one of the
V
Sin, in whichScase it is also contained in the corresponding Wi . It follows that
q
k=1 k U ⊂ i=1 W i , from which the assertion follows.
Lemma 6.17 Let µ be a probability (or a finite) measure on the Borel sets of
an interval [a, b]. Let F (x) = µ([a, x]) be its distribution function. If A is a
Borel subset of [a, b] with µ(A) = 0, then F is differentiable on a subset A0 of
A with λ(A \ A0 ) = 0 and F 0 (x) = 0 for all x in A0 .
µ((x − h, x + h)) 1
Nj = {x ∈ A : lim sup > }
h↓0 h j
S∞
have Lebesgue measure zero. Indeed on the set N = j=1 , satisfying λ(N ) = 0,
the limit in (6.12) may not exist or may be different from zero. First we argue
that the Nj are measurable. It is easy to verify that (x, y, h) → 1(x−h,x+h) (y)
is a measurable function of its three arguments. By the construction R leading
to Fubini’s theorem (Lemma 5.3), the mapping (x, h) → 1(x−h,x+h) dλ =
µ((x − h, x + h)) is measurable as well. Realizing that by the continuity of the
measure, the map h → µ((x − h, x + h)) is left continuous, for computing the
lim sup in the definition of the Nj we can restrict ourself to rational sequences,
which makes the lim sup a measurable function of x. Consequently, Pj is Borel-
measurable.
Let ε > 0. Since µ(A) = 0, by virtue of Lemma 6.15 there exists an open
V ⊃ A, with µ(V ) < ε. Let x ∈ Pj . Then there is h > 0 such that (x −
h, x + h) ⊂ V and µ((x − h, x + h)) > h/j. (Note that the latter entails
2jµ((x − h, x + h)) > λ((x − h, x + h)).) All such intervals cover Pj and if
t < λ(Pj ), we can choose from them according to Lemma 6.16 disjoint intervals
J1 , . . . , Jq all contained in V such that (here it is essential that the intervals are
disjoint)
q
X q
X q
[
t≤3 λ(Ji ) ≤ 6j µ(Ji ) = 6jλ( Ji ) ≤ 6jλ(V ) ≤ 6jε.
i=1 i=1 i=1
67
Here is the result we aim at, the generalized fundamental theorem of calculus.
R
Theorem 6.18 Let f : R → R be measurable such that K |f | dλ < ∞ for
every compactR subset K of R. Define for any a ∈ R the function F : [a, ∞) → R
by F (x) = (a,x] f dλ. Then outside a set N of Lebesgue measure zero, F is
differentiable and F 0 (x) = f (x), for all x not belonging to N .
and we obtain
Z x+h
1
lim sup f dλ ≤ r.
h↓0 h x
1 x+h
Z
lim inf f dλ ≥ f (x). (6.14)
h↓0 h x
We can go one step further. Also functions F that cannot be written as integrals
of f w.r.t. Lebesgue measure may be differentiable outside a set of Lebesgue
measure zero. Take for instance for F the Cantor function of Exercise 6.5, then
outside the Cantor set, F is differentiable with F 0 = 0.
68
Theorem 6.19 Let F : [0, ∞) → R be an increasing function. Then F is dif-
ferentiable in Lebesgue almost all points of (0, ∞). Denote by F 0 the derivative
in those points where it exists, and zero else. Then F 0 is integrable w.r.t. λ on
any interval [0, b].
Theorem 6.20 Let p ∈ [1, ∞). The dual space of Lp (Ω, F, P) is Lq (Ω, F, P),
where p1 + 1q = 1.
Case 1: p = 1. We show that Y is a.s. bounded. Let F = {Y > c}, for some
c > 0. By continuity of T , we have
c P (Y > c) ≤ E [1F Y ] = |T (1F )| ≤ ||T || ||1F ||1 = ||T || P(Y > c).
69
Hence, if P(Y > c) > 0 it follows that ||T || ≥ c. Stated otherwise ||T || ≥
sup{c > 0 : P(Y > c) > 0}. A similar argument yields ||T || ≥ sup{c > 0 :
P(Y < −c) > 0}. It follows that ||Y ||∞ ≤ ||T || < ∞.
We finally show that for every X ∈ L1 (Ω, F, P), it holds that T (X) =
E [XY ]. By the above construction this is true for X of the form X = 1F .
Hence also for (nonnegative) simple functions. Let X ∈ L1 (Ω, F, P) be arbitrary.
Choose a sequence (Xn ) of simple functions such that Xn → X a.s. and |Xn | ≤
|X|. Then, by dominated convergence, Xn → X in L1 (Ω, F, P) as well and,
since Y is a.s. bounded, we also have Xn Y → XY in L1 (Ω, F, P). But then
T (X) = E [XY ].
Case 2: p ∈ (1, ∞). We start from Equation (6.16). It follows that for
every X ∈ L∞ ((Ω, F, P)) one has T (X) = E XY . For every n ∈ N, put Xn =
sgn(Y )|Y |q−1 1En , where En = {|Y | ≤ n}. Note that every Xn is bounded,
Xn Y = |Y |q 1En and |Xn |p = |Y |q 1En . We obtain
from which it follows that (E |Y |q 1En )1/q ≤ ||T || (here it is used that p > 1).
By letting n → ∞, we obtain ||Y ||q ≤ ||T || < ∞, so Y ∈ Lq (Ω, F, P).
Finally, for X ∈ Lp (Ω, F, P) we put Xn = X1{|X|≤n} , so that Xn ∈
∞
L (Ω, F, P) and ||X − Xn ||p → 0. It follows by Hölder’s inequality (used
in the fourth step) that
T (X) = T (X − Xn ) + T (Xn )
= T (X − Xn ) + E Xn Y
= T (X − Xn ) + E (Xn − X)Y + E XY
→ E XY.
In both cases the random variable Y is a.s. unique. Indeed, if Y and Y 0 satisfy
E [XY ] = E [XY 0 ] = T (X) for all X ∈ Lp (Ω, F, P), we can choose X = sgn(Y −
Y 0 ) to obtain this result.
Remark 6.21 In the proof of Lemma 6.20 one actually has that ||Y ||q = ||T ||
(Exercise 6.15). The Riesz-Fréchet Theorem 6.1 can be applied to show that
the dual space of L2 (Ω, F, P) is isomorphic to itself, see also Remark 6.2 for the
L2 (Ω, F, P) case.
70
Suppose there is g ∈ L1 ((− 12 , 12 ), B, λ) such that T f ≡
R
(− 12 , 12 )
f g dλ. Then for
x0 6= 0 and all small ε
Z
1
0= g dλ → g(x0 ),
2ε [x0 −ε,x0 +ε]
in view of Theorem 6.18 for all x0 outside a set of measure zero. Hence g = 0
a.s. and then (− 1 , 1 ) f g dλ = 0 for all f ∈ L∞ ((− 12 , 12 ), B, λ). But T is not
R
2 2
identically zero.
Proposition 6.22 Let µ be a complex measure. Then µ |µ| and the Radon-
dµ
Nikodym derivative h = d|µ| may be taken such that |h| = 1.
Proof That µ |µ| has already been observed in Section 6.3. Let h be
any function as in the Radon-Nikodym theorem applied to µ |µ|. Since
||µ|(h1E )| = |µ(E)| ≤ |µ|(E), it follows from Lemma 6.9 that |µ|({|h| > 1}) = 0.
On the other hand, for A = {|h| ≤ r} (r > 0) and a measurable partition with
elements Aj of A, we have
X X X
|µ(Aj )| = |µ|(1Aj h) ≤ |µ|(1Aj |h|) ≤ r|µ|(A).
j j j
Then we find, by taking suprema over such partitions, that |µ|(A) ≤ r|µ|(A).
Hence for r < 1 we find |µ|(A) = 0 and we conclude that |µ|({|h| < 1}) = 0.
Combining this with the previous result we get |µ|({|h| =
6 1}) = 0. The function
that we look for, is h1{|h|=1} + 1{|h|6=1} .
dµ
Corollary 6.23 Let µ be a real measure, h = d|µ| . Then for any E ∈ Σ we have
µ (E) = |µ|(1E∩{h=1} ) and µ (E) = |µ|(1E∩{h=−1} ) and µ+ ⊥ µ− . Moreover,
+ −
d|ν| dν
= | |.
dµ dµ
Proof Exercise 6.8.
71
6.10 Exercises
6.1 Let µ be a real measure on a space (S, Σ). Define ν : Σ → [0, ∞) by
ν(E) = sup{µ(F ) : F ∈ Σ, F ⊂ E, µ(F ) ≥ 0}. Show that ν is a finite positive
measure. Give a characterization of ν.
6.2 Prove Proposition 6.6.
6.3 Prove a version of Proposition 6.8 adapted to the case where h ∈ L1 (S, Σ, µ)
is complex valued.
6.4 Let X be a symmetric Bernoulli distributed random variable (P(X = 0) =
P(X = 1) = 12 ) and Y uniformly distributed on [0, θ] (for some arbitrary θ > 0).
Assume that X and Y are independent.
(a) Show that the laws Lθ (θ > 0) of XY are not absolutely continuous w.r.t.
Lebesgue measure on R.
(b) Find a fixed dominating σ-finite measure µ such that Lθ µ for all θ and
determine the corresponding Radon-Nikodym derivatives.
6.5 Let X1 , X2 , . . . be an iid sequence of Bernoulli random variables, defined on
some probability space (Ω, F, P) with P(X1 = 1) = 21 . Let
∞
X
X= 2−k Xk .
k=1
where the factor 3 only appears for esthetic reasons. Show that the distri-
bution function F : [0, 1] → R of Y is constant on ( 14 , 43 ), that F (1 − x) =
1 − F (x) and that it satisfies F (x) = 2F (x/4) for x < 41 .
(c) Make a sketch of F and show that F is continuous, but not absolutely
continuous w.r.t. Lebesgue measure.
R (Hence there is no Borel measurable
function f such that F (x) = [0,x] f (u) du, x ∈ [0, 1]).
6.6 Let f ∈ L1 (S, Σ, µ) be such that µ(1E f ) = 0 for all E ∈ Σ. Show that
µ({f 6= 0}) = 0. Conclude that the function h in the Radon-Nikodym theorem
has the stated uniqueness property.
6.7 Let µ and ν be positive σ-finite measures and φ an arbitrary measure on a
measurable space (S, Σ). Assume that φ ν and ν µ. Show that φ µ
and that
dφ dφ dν
= .
dµ dν dµ
72
6.8 Prove Proposition 6.24.
6.9 Let ν and µ be positive σ-finite measures on (S, Σ) with ν µ and let
dν
h = dµ , the standing assumptions in this exercise. Show that ν({h = 0}) = 0.
dµ
Show that µ({h = 0}) = 0 iff µ ν. What is dν if this happens?
6.10 Let µ and ν be positive σ-finite measures and φ a complex measure on
(S, Σ). Assume that φ µ and ν µ with Radon-Nikodym derivatives h and
k respectively. Let φ = φa + φs be the Lebesgue decomposition of φ w.r.t. µ.
Show that (ν-a.e.)
dφa h
= 1{k>0} .
dν k
6.11 Consider the measurable space (Ω, F) and a measurable map X : Ω → Rn
(Rn is endowed with the usual Borel σ-algebra B n ). Consider two probability
measure P and Q on (Ω, F) and let PX and QX be the corresponding dis-
tributions (laws) on (Rn , B n ). Assume that PX and QX are both absolutely
continuous w.r.t. some σ-finite measure (e.g. Lebesgue measure), with corre-
sponding Radon-Nikodym derivatives (in this context often called densities) f
and g respectively, so f, g : Rn → [0, ∞). Assume that g > 0. Show that for
F = σ(X) it holds that P Q and that (look at Exercise 6.10) the Radon-
Nikodym derivative here can be taken as the likelihood ratio
dP f (X(ω))
ω 7→ (ω) = .
dQ g(X(ω))
73
7 Convergence and Uniform Integrability
In this chapter we first review a number of convergence concepts for random
variables and study how they are interrelated. The important concept of uniform
integrability shall enable us to perform a more refined analysis.
Conversely, if (7.1) holds, we have almost sure convergence. Notice that we can
rewrite the probability in (7.1) as P(lim inf Enε ) = 1, with Enε = {|Xn − X| < ε}.
74
P P
If Xn → X and Xn → X 0 , then we have by the triangle inequality for any
ε>0
The next proposition gives a partial converse to Proposition 7.3 (iii). Notice that
the assertion would trivially follow from the Dominated Convergence Theorem
for an a.s. converging sequence. The weaker assumption on convergence in
probability makes it slightly less trivial. In Theorem 7.15 we will see a kind of
converse to Proposition 7.3(iii) for p = 1.
75
Proof The first assertion follows from P(|X| > K + ε) ≤ P(|Xn − X| > ε), valid
for every ε > 0. Let n → ∞ to conclude P(|X| > K + ε) = 0, ∀ε > 0. Then
|Xn − X| ≤ 2K a.s., which we use to prove the second assertion. Consider for
any ε > 0
The following result tells how to use almost sure convergence when convergence
in probability has to be established.
Proposition 7.5 There is equivalence between
P
(i) Xn → X and
(ii) every subsequence of (Xn ) contains a further subsequence that is almost
surely convergent to X.
Proof Assume that (i) holds, then for any ε > 0 and any subsequence we also
have P(|Xnk − X| > ε) → 0. Hence for every p ∈ N, there is kp ∈ N such that
P(|Xnkp − X| > 2−p ) ≤ 2−p . Now we apply part (ii) of Proposition 7.3, which
P
gives us (ii), once we have verified that p P(|Xnkp − X| > ε) < ∞ for all ε > 0.
This holds since
X X X
P(|Xnkp −X| > ε) = P(|Xnkp −X| > ε)+ P(|Xnkp −X| > ε),
p p:2−p >ε p:2−p ≤ε
But the sequence Xnk by assumption has an almost surely convergent subse-
quence (Xnkp ), which, by Proposition 7.3 (i), also converges in probability. This
contradicts (7.2).
76
Convergence, almost surely or in probability, of random vectors is defined simi-
larly. For instance, if X, X1 , X2 , . . . are n-dimensional random vectors and || · ||
is a norm on Rn (you may also take a metric instead of a norm), then we say
P P
that Xn → X if ||Xn − X|| → 0. Here we apply the definition of convergence
for real random variables to ||Xn − X|| (which is truly a random variable!). A
nice feature of the convergence concepts introduced above is that appropriate
convergence results for real valued random variables carry over to results for
random vectors.
Lemma 7.8 Let X ∈ L1 (Ω, F, P) and put ν(F ) := E |X|1F , F ∈ F. Then for
all ε > 0 there exists δ > 0, such that ν(F ) < ε, if P(F ) < δ.
Remark 7.9 The assertion of Lemma 7.8 justifies what we previously called
absolute continuity (of ν w.r.t. P).
lim E |X|1{|X|≥K} = 0.
K→∞
77
We give some rather general examples of a uniformly integrable collection C.
Example 7.13 The content of the present example should be intuitively ob-
vious. Let Y be a nonnegative random variable with EY < ∞. Let C be a
collection of random variables X with the property |X| ≤ Y a.s. Then C is UI.
Indeed, we have from |X| ≤ Y a.s. that
hence
Proof Assume that E |Xn − X| → 0. Then (i) was already known from Propo-
sition 7.3. To prove (ii) we consider
78
hence for every N ∈ N
sup E |Xn |1{|Xn |>K} ≤ sup E |Xn − X| + sup E |X|1{|Xn |>K} . (7.3)
n≥N n≥N n≥N
Let ε > 0 and choose N such that the first term is smaller than ε, which
can be done by L1 -convergence. To control the second term, we observe that
P(|Xn | > K) ≤ supn E |Xn |/K. By L1 -convergence one has supn E |Xn | < ∞
and hence, by selecting K = K(δ) large enough (K > supn E |Xn |/δ), one has
P(|Xn | > K) < δ, for any δ > 0. Now choose δ as in Lemma 7.8. Then we get
supn≥N E |X|1{|Xn |>K} < ε. It follows from (7.3) that
sup E |Xn |1{|Xn |>K} < 2ε.
n≥N
Next we consider the Xn for n < N . Since by Proposition 7.14 these form a UI
collection, we have for all sufficiently large K
sup E |X|1{|Xn |>K} < 2ε.
n<N
Combining the inequalities in the last two displays, we conclude that for all
large enough K
sup E |Xn |1{|Xn |>K} ≤ 2ε,
n≥1
7.3 Exercises
7.1 Let X1 , Y1 , X2 , Y2 , . . . be an i.i.d. sequence whose members have a uniform
distribution on [0, 1] and let f : [0, 1] → [0, 1] be continuous. Define Zi =
1{f (Xi )>Yi } .
79
1
Pn R1
(a) Show that n i=1 Zi →
f (x) dx a.s.
0
1
Pn R1 1
(b) Show that Zi − 0 f (x) dx)2 ≤
E ( n i=1 4n .
(c) Explain why these two results are useful.
7.2 Prove Proposition 7.6.
7.3 Prove Proposition 7.7.
7.4 Prove Corollary 7.10.
7.5 Let C be a collection of uniformly integrable random variables. Show that
C is bounded in L1 (Ω, F, P), i.e. sup{E |X| : X ∈ C} < ∞. Let (Ω, F, P) =
([0, 1], B[0, 1], ν) and Xn (ω) = n1(0,1/n) (ω). Show that {Xn : n ∈ N} is bounded
in L1 (Ω, F, P), but not uniformly integrable.
7.6 Prove Proposition 7.14.
7.7 Let C1 , . . . , Cn be uniformly integrableS collections of random variables on
n
a common probability space. Show that k=1 Ck is uniformly integrable. (In
1
particular is a finite collection in L uniformly integrable).
7.8 If C1 and C2 are uniformly integrable collections in L1 (Ω, F, P), so is C1 +
C2 := {X1 + X2 : X1 ∈ C1 , X2 ∈ C2 }. Show this.
7.9 Here is uniform variation on Lemma 7.8. Let C be a class of random vari-
ables defined on some (Ω, F, P). Put νX (F ) := E |X|1F , X ∈ C, F ∈ F. Show
that C is uniformly integrable iff the following two conditions hold.
(i) C is bounded in L1 .
(ii) For all ε > 0 there exists δ > 0, such that νX (F ) < ε ∀X ∈ C, if P(F ) < δ.
7.10 Let C be a uniformly integrable collection of random variables.
¯ the closure of C in L1 . Use Exercise 7.9 to show that also C¯ is
(a) Consider C,
uniformly integrable.
(b) Let D be the convex hull of C, the smallest convex set that contains C.
Then both D and its closure in L1 are uniformly integrable
7.11 In this exercise you prove (fill in the details) the following characterization:
a collection C is uniformly integrable iff there exists a function G : R+ → R+
such that limt→∞ G(t)t = ∞ and M := sup{E G(|X|) : X ∈ C} < ∞.
The necessity you prove as follows. Let ε > 0 choose a = M/ε and c such
that G(t)
t ≥ a for all t > c. To prove uniform integrability of C you use that
G(|X|)
|X| ≤ a on the set {|X| ≥ c}.
It is less easy to prove sufficiency. Proceed as follows. Suppose
Pthat we have a se-
∞
quence (gn ) with g0 = 0 and limn→∞ gn = ∞. Define g(t) = n=0 1[n,n+1) (t)gn
Rt
and G(t) = 0 g(s) ds. Check that limt→∞ G(t) t = ∞.
P∞
With an (X) = P(|X| > n), it holds R that E G(|X|)P∞≤ n=1 gn an (|X|). Further-
more, for every k ∈ N we have |X|≥k |X| dP ≥ m=k am (X). Pick for every n
80
P∞
a constant cn ∈ N such that |X|≥cn |X| dP ≤ 2−n . Then m=cn am (X) ≤ 2−n
R
P∞ P∞
and hence n=1 m=cn am (X) ≤ 1. Choose then the sequence (gn ) as the
‘inverse’ of (cn ): gn = #{k : ck ≤ n}.
7.12 Prove that a collection C is uniformly integrable iff there exists an in-
creasing and convex function G : R+ → R+ such that limt→∞ G(t) t = ∞ and
M := sup{E G(|X|) : X ∈ C} < ∞. (You may use the result of Exercise 7.11.)
Let D be the closure of the convex hull of a uniformly integrable collection C
in L1 . With the function G as above we have sup{EG(|X|) : X ∈ D} = M ,
whence also D is uniformly integrable.
7.13 Let p ≥ 1 and let X, X1 , X2 , . . . be random variables. Then Xn converges
to X in Lp iff the following two conditions are satisfied.
(i) Xn → X in probability,
(ii) The collection {|Xn |p : n ∈ N} is uniformly integrable.
7.14 Here is an extension of Proposition 10.9, but you can do the exercise now.
Let C be a uniformly integrable collection of random variables on some (Ω, F, P).
Let G be a family of sub-σ-algebras of F. Let D = {E [X|G] : X ∈ C, G ∈ G}
(‘in the sense of versions’). Show that also D is uniformly integrable. (Hint:
use Exercise 7.9.)
7.15 Let X, X1 , X2 , . . . be random variables that are defined on the same prob-
P
ability space (Ω, F, P). Suppose that Xn → X and that there exists a random
variable Y with E Y < ∞ such that |Xn | ≤ Y a.s. Show that P(|X| ≤ Y ) = 1
L1
and that Xn → X.
a.s.
7.16 Assume that X1 , X2 , . . . is an iid sequence and that X n → µ for some
µ ∈ R. Show P that E |X1 | < ∞ and that E X1 = µ. Hint: Show first that
∞
E |X1 | < ∞ iff n=0 P(|Xn | > n) < ∞ and use Borel-Cantelli. (This exercise
is a converse to the strong law of large numbers, Theorem 10.23.)
7.17 Here is a generalisation of Fatou’s lemma. Let (Xn ) be a sequence of
random variables such that the negative parts (Xn− ) are uniformly integrable.
Show that lim inf n→∞ E Xn ≥ E lim inf n→∞ Xn . Hint: show first that for given
ε > 0 there exists a < 0 such that E (Xn ∨ −a) ≤ E Xn + ε for all n.
81
8 Conditional expectation
8.1 A simple, finite case
Let X be a random variable with values in {x1 , . . . , xn } and Y a random variable
with values in {y1 , . . . , ym }. The conditional probability
P(X = xi , Y = yj )
P(X = xi |Y = yj ) :=
P(Y = yj )
E X̂1Ej = E X1Ej ,
the expectation of X̂ over the set Ej is the same as the expectation of X over
that set. We show this by simple computation. Note first that the values of
X1Ej are zero and xi , the latter reached on the event {X = xi } ∩ Ej that has
probability P({X = xi } ∩ Ej ). Note too that X̂1Ej = x̂j 1Ej . We then get
The just described two properties of the conditional expectation will lie at the
heart of a more general concept, conditional expectation of a random variable
given a σ-algebra, see Section 8.2.
The random variable X̂ is a.s. the only σ(Y )-measurable random variable that
82
satisfies (8.1). Indeed, suppose that Z is σ(Y )-measurable and that E Z1E =
E X1E , ∀E ∈ σ(Y ). Let E = {Z > X̂}. Then (Z − X̂)1E ≥ 0 and has
expectation zero since E ∈ σ(Y ), so we have (Z − X̂)1{Z>X̂} = 0 a.s. Likewise
we get (Z − X̂)1{Z<X̂} = 0 a.s. and it then follows that Z − X̂ = 0 a.s.
83
Proposition 8.6 The following elementary properties hold.
(i) If X ≥ 0 a.s., then X̂ ≥ 0 a.s. If X ≥ Y a.s., then X̂ ≥ Ŷ a.s.
(ii) E X̂ = E X.
(iii) If a, b ∈ R and if X̂ and Ŷ are versions of E [X|G] and E [Y |G], then
aX̂ + bŶ is a version of E [aX + bY |G].
(iv) If X is G-measurable, then X is a version of E [X|G].
Proof (i) Let G = {X̂ < 0}. Then we have from (8.2) that 0 ≥ E 1G X̂ =
E 1G X ≥ 0. Hence 1G X̂ = 0 a.s.
(ii) Take G = Ω in (8.2).
(iii) Just verify that E 1G (aX̂ + bŶ ) = E 1G (aX + bY ), for all G ∈ G.
(iv) Obvious.
We have taken some care in formulating the assertions of the previous theorem
concerning versions. Bearing this in mind and being a bit less precise at the
same time, one often phrases e.g. (iii) as E [aX + bY |G] = aE [X|G] + bE [Y |G].
Some convergence properties are listed in the following theorem.
Proof (i) From the previous theorem we know that the X̂n form a.s. an increas-
ing sequence. Let X̂ := lim sup X̂n , then X̂ is G-measurable and X̂n ↑ X̂ a.s.
We verify that this X̂ is a version of E [X|G]. But this follows by application of
the Monotone Convergence Theorem to both sides of E 1G Xn = E 1G X̂n for all
G ∈ G.
(ii) and (iii) These properties follow by mimicking the proofs of the ordinary
versions of Fatou’s Lemma and the Dominated Convergence Theorem, Exer-
cises 8.4 and 8.5.
84
(ii) If Z is G-measurable such that ZX ∈ L1 (Ω, F, P), then Z X̂ is a version
of E [ZX|G]. We write ZE [X|G] = E [ZX|G].
(iii) Let X̂ be a version of E [X|G]. If H is independent of σ(X) ∨ G, then X̂
is a version of E [X|G ∨ H]. In particular, E X is a version of E [X|H] if
σ(X) and H are independent.
(iv) Let X be a G-measurable random variable and let the random variable
Y be independent of G. Assume that h ∈ B(R2 ) is such that h(X, Y ) ∈
L1 (Ω, F, P). Put ĥ(x) = E [h(x, Y )]. Then ĥ is a Borel function and ĥ(X)
is a version of E [h(X, Y )|G].
(v) If c : R → R is a convex function and E |c(X)| < ∞, then c(X̂) ≤ C,
a.s., where C is any version of E [c(X)|G]. We often write c(E [X|G]) ≤
E [c(X)|G] (Jensen’s inequality for conditional expectations).
(vi) ||X̂||p ≤ ||X||p , for every p ≥ 1.
85
Invoking the Monotone Convergence Theorem again results in E 1G ĥ(X) =
E 1G h(X, Y ). Since all ĥn (X) are G-measurable, the same holds for ĥ(X) and
we conclude that h ∈ V . The remainder of the proof is Exercise 8.12.
(v) Since c is convex, there are sequences (an ) and (bn ) in R such that
c(x) = sup{an x + bn : n ∈ N}, ∀x ∈ R. Hence for all n we have c(X) ≥
an X + bn and by the monotonicity property of conditional expectation, we
also have C ≥ an X̂ + bn a.s. If Nn is the set of probability zero, where this
inequality is violated, then also P(N ) = 0, where N = ∪∞n=1 Nn . Outside N we
have C ≥ supn (an X̂ + bn ) = c(X̂).
(vi) The statement concerning the p-norms follows upon choosing c(x) = |x|p
in (v) and taking expectations.
Sn
E [X1 |Sn ] = a.s.
n
holds true. Argue as follows. By symmetry we have for all sets G = {Sn ∈
B} the equality E 1G X1 = E 1G Xj , for every j ∈ {1, . . . , n}.
Pn Hence we ob-
tain E [X1 |Sn ] = E [Xj |Sn ] and then Sn = E [Sn |Sn ] = j=1 E [Xj |Sn ] =
nE [X1 |Sn ]. Even more is true, E [X1 |Sn ] is also equal to E [X1 |Gn ], where
Gn = σ(Sn , Sn+1 , . . .). To see this, observe first that Gn = σ(Sn ) ∨ Hn , where
Hn = σ(Xn+1 , Xn+2 , . . .) and that Hn is independent of σ(X1 , Sn )). Now apply
Theorem 8.8(iii).
We conclude this section with the following loose statement, whose message
should be clear from the above results. A conditional expectation is a random
variable that has properties similar to those of ordinary expectation.
86
8.3 Conditional probabilities
Let F ∈ F and G a sub-σ-algebra of F. We define P(F |G) := E [1F |G], the
conditional probability of F given G. So a version of P(F |G) is a G-measurable
random variable P̂(F ) that satisfies
87
Theorem 8.11 Let X be a (real) random variable with law PX , a probability
measure on (R, B). There exists a regular conditional distribution of X given G.
That is, there exists a probability kernel P̂X on B × Ω with the property that
P̂X (B) is a version of P(X −1 [B]|G).
Proof We split the proof into two parts. First we show the existence of a
conditional distribution function, after which we show that it generates a regular
conditional distribution of X given G.
We will construct a conditional distribution function on the rational num-
bers. For each q ∈ Q we select a version of P(X ≤ q|G), call it G(q). Let
Erq = {G(r) < G(q)}. Assume that r > q. Then {X ≤ r} ⊃ {X ≤ q} and hence
G(r) ≥ G(q) a.s. and so P(Erq ) = 0. Hence we obtain that P(E) = 0, where
E = ∪r>q Erq . Note that E is the set where the random variables G(q) fail to be
increasing in the argument q. Let Fq = {inf r>q G(r) > G(q)}. Let {q1 , q2 , . . .}
be the set of rationals strictly bigger then q and let rn = inf{q1 , . . . , qn }.
Then rn ↓ q, as n → ∞. Since the indicators 1{X≤rn } are bounded, we
have G(rn ) ↓ G(q) a.s. It follows that P(Fq ) = 0, and then P(F ) = 0, where
F = ∪q∈Q Fq . Note that F is the event on which G(·) is not right-continuous.
Let then H be the set on which limq→∞ G(q) < 1 or limq→−∞ G(q) > 0. By a
similar argument, we have P(H) = 0. On the set Ω0 := (E ∪ F ∪ H)c , the ran-
dom function G(·) has the properties of a distribution function on the rationals.
Note that Ω0 ∈ G. Let F 0 be an arbitrary distribution function and define for
x∈R
It is easy to check that F̂ (·) is a distribution function for each hidden argument
ω. Moreover, F̂ (x) is G-measurable and since inf q>x 1{X≤q} = 1{X≤x} , we
obtain that F̂ (x) is a version of P(X ≤ x|G). This finishes the proof of the
construction of a conditional distribution function of X given G.
For every ω, the distribution function F̂ (·)(ω) generates a probability mea-
sure PX (·)(ω) on (R, B(R)). Let C be the class of Borel-measurable sets B for
which PX (B) is a version of P(X ∈ B|G). It follows that all intervals (−∞, x]
belong to C. Moreover, C is a d-system. By virtue of Dynkin’s Lemma 1.13,
C = B(R).
Proof Consider the collection H of all Borel functions h for which (8.4) is a
version of E [h(X)|G]. Clearly, in view of Theorem 8.11 the indicator functions
88
1B for B ∈ B(R) belong to H and so do linear combinations of them. If
h ≥ 0, then we can find nonnegative simple functions hn that convergence to h
in a monotone way. Monotone convergence for conditional expectations yields
h ∈ H. If h is arbitrary, we split as usual h = h+ − h− and apply the previous
step.
Once more we emphasize that regular conditional probabilities in general don’t
exist. The general definition of conditional expectation would be pointless if
every conditional expectation could be computed by Proposition 8.12. The
good news is that in most common situations Proposition 8.12 can be applied.
In Exercise 8.8 you find an explicit expression for the regular conditional
distribution of a random variable X given another random variable Y .
8.4 Exercises
8.1 Let (Ω, F, P) be a probability space and let A = {A1 , . . . , An } be a partition
1
of Ω, where the Ai belong to F. Let X P∈n L (Ω, F, P) and G = σ(A). Show that
any version of E [X|G] is of the form i=1 ai 1Ai and determine the ai .
8.2 Let Y be a (real) random variable or random vector on a probability space
(Ω, F, P). Assume that Z is another random variable that is σ(Y )-measurable.
Use the standard machine to show that there exists a Borel-measurable function
h on R such that Z = h(Y ). Conclude that for integrable X it holds that
E [X|Y ] = h(Y ) for some Borel-measurable function h.
8.3 Prove Proposition 8.10.
8.4 Prove the conditional version of Fatou’s lemma, Theorem 8.7(ii).
8.5 Prove the conditional Dominated Convergence theorem, Theorem 8.7(iii).
8.6 Let (X, Y ) have a bivariate normal distribution with E X = µX , E Y = µY ,
2
Var X = σX , Var Y = σY2 and Cov (X, Y ) = c. Let
c
X̂ = µx + 2 (Y − µY ).
σY
Show that E (X − X̂)Y = 0. Show also (use a special property of the bivariate
normal distribution) that E (X − X̂)g(Y ) = 0 if g is a Borel-measurable function
such that E g(Y )2 < ∞. Conclude that X̂ is a version of E [X|Y ].
8.7 Let X, Y ∈ L1 (Ω, F, P) and assume that E [X|Y ] = Y and E [Y |X] = X (or
rather, versions of them are a.s. equal). Show that P(X = Y ) = 1. Hint: Start
to work on E (X − Y )1{X>z,Y ≤z} + E (X − Y )1{X≤z,Y ≤z} for arbitrary z ∈ R.
8.8 Let X and Y be random variables and assume that (X, Y ) admits a density
f w.r.t. Lebesgue measure on (R2 , B(R2 )). Let fY be the marginal density of
Y . Define fˆ(x|y) by
(
f (x,y)
ˆ fY (y) if fY (y) > 0
f (x|y) =
0 else.
89
h(x)fˆ(x|y) dx. Show that ĥ(Y ) is a
R
Assume that E |h(X)| < ∞. Put ĥ(y) = R
version of E [h(X)|Y ]. Show also that
Z
P̂(E) = fˆ(x|Y ) dx
E
90
8.12 Finish the proof of Theorem 8.8 (iv), i.e. show that the assertion also holds
without the boundedness condition on h.
91
9 Martingales and their relatives
In this chapter we define martingales, sub- and supermartingales. In the next
chapter we formulate convergence theorems for them and see how these can be
applied to give elegant proofs of some central results in probability theory. The
power of martingales is the combination of their main defining property, that is
shared by a rich class of special cases, and the strong convergence results that
can be obtained in spite of a seemingly innocent definition.
92
Remark 9.2 The equality (9.1) should be read in the sense that Mn is a version
of the conditional expectation E [Mn+1 |Fn ]. Although we have always been
careful in formulating properties of conditional expectations in terms of version,
we will drop this care and leave it to the reader to properly interpret statements
given below concerning conditional expectations.
Remark 9.3 The definition of martingales has been given in terms of ‘one-
step-ahead’ conditional expectations. If we change (9.1) in the sense that we
replace on the left hand side E [Mn+1 |Fn ] with E [Mm |Fn ], m ≥ n + 1 arbitrary,
we obtain an equivalent definition. (Use the tower property to check this!) A
similar remark applies to the definitions of sub- and supermartingales, that we
will meet shortly.
The previous example was in terms of sums. The next one involves products.
93
Example 9.6 Let X be a random variable with E |X| < ∞ and F a filtration.
Put Mn = E [X|Fn ], n ≥ 0. By the tower property (see Theorem 8.8(i)) of con-
ditional expectation we obtain E [Mn+1 |Fn ] = E [E [X|Fn+1 ]|Fn ] = E [X|Fn ] =
Mn , where all equalities are to be understood in the a.s. sense. The process M
is thus a martingale.
Assume further that X ∈ L2 (Ω, F, P). We can interpret Mn as the best
prediction of X given the ‘information’ Fn . Take as a measure of ‘prediction
error’ the mean square loss E (X − Y )2 , Y ∈ L2 (Ω, Fn , P), see Proposition 8.10.
Put vn := E (Mn −X)2 . One can show that the vn are decreasing (Exercise 9.2),
which supports our intuitive understanding that with more information one
should be able to predict better.
The expressions (9.1) and (9.2) can be interpreted by saying that a martingale
follows a constant trend, whereas a submartingale displays an increasing trend.
Of course, a supermartingale fluctuates around a decreasing trend. Notice that
it also follows from (9.1) that a martingale has constant expectation, E Mn =
E M0 , for all n. On the other hand, a submartingale has increasing expectations,
as follows from (9.2).
Example 9.8 We revisit the first two examples given above. In Example 9.4
we obtain a submartingale if the Xn have positive expectation, resulting in
an increasing trend, whereas a negative expectation for the Xn turns S into a
supermartingale. In Example 9.5 we now restrict the Xn to be positive. Then
P will become a submartingale if E Xn ≥ 1 and a supermartingale for E Xn ≤ 1.
∆Xn = Xn − Xn−1 , n ≥ 1.
Pn
It trivially follows that Xn = X0 + k=1 ∆Xk . Sometimes it is convenient
Pn to
adopt the convention ∆X0 = X0 , from which we then obtain Xn = k=0 ∆Xk .
The martingale property of an adapted integrable process X can then be for-
mulated as E [∆Xn+1 |Fn ] = 0 a.s. for n ≥ 0. For submartingales it holds that
E [∆Xn+1 |Fn ] ≥ 0 a.s.
If you want to interpret random variables ξn as the payoffs (profits or losses,
Pn de-
pending on the sign) of your bets in the n-th game of a series, then Sn = k=1 ξk
94
would be your accumulated total capital after n games. Here we have ∆Sn = ξn .
If S is a submartingale, you are playing a favorable game, and if S is martingale
you are playing a fair game. It should be clear what an unfavorable game is. In
the next subsection we will investigate whether it is possible by playing clever
strategies to turn an unfavorable game into a favorable one.
You may consider a predictable process Y to be a strategy, it tells you what your
action at time n is going to be, given that you use your information available
at time n − 1. In a trivial sense, you ‘perfectly’ predict Yn at time n − 1.
Suppose that a sequence of random variables ξn (with ξ0 = 0) represents
the payoff of a game at time n when you make a unit bet, then Yn ξn would be
the payoff when you bet Yn units at time n. When you are not a clairvoyant
and have no insider information, your bet Yn cannot depend on future outcomes
ξm , m ≥ n of the game, but you are allowed to let them depend on what has
ξ
been realized before, i.e. to take Y as FP -predictable. The accumulated, or
n
total, P
earnings up to time n are Sn = k=1 Yk ξk (with P
S0 = 0). If we let
n n
Xn = k=0 ξk , we get ∆Sn = Yn ∆Xn . We also have Sn = k=1 Yk ∆Xk . This
Rt
notation is a discrete time analogue of an expression like St = 0 Ys dXs . Such
an expression, for suitably defined stochastic processes with a continuous time
parameter are called stochastic integrals. Because of the analogy, a process like S
above, is sometimes called a discrete time stochastic integral, in particular when
X is a martingale. In that case one also speaks of a martingale transform. A
common notation is to write in this case S = Y · X for the process S. Note that
the ‘dot’ here is not the multiplication operator. If Z is another (predictable)
process, we have Z · S = (ZY ) · X. The term martingale transform is justified
by the following proposition.
95
Proof Clearly, S is adapted. All three statements follow from the identity
E [∆Sn |Fn−1 ] = Yn E [∆Xn |Fn−1 ] a.s., which holds by virtue of Theorem 8.8(ii),
and the definitions of martingale and sub-, supermartingale.
Back to the interpretation in terms of betting and games. If you play an unfavor-
able game, the accumulated ‘gains per unit bet’ process X is a supermartingale.
A (predictable) strategy Y has to be nonnegative (negative bets are usually
not allowed). Proposition 9.10 tells us that whatever strategy you play, your
accumulated gains process Y · X will still be a supermartingale, an unfavorable
game.
Although, as we have explained above, you are not able to change the nature
of the game by using predictable strategies, it is still a good question to ask for
optimal strategies. A strategy could be called optimal, if it maximizes SN at
some fixed end time N . Questions like this are answered in the theory of optimal
stopping. We don’t treat this theory in this course, but we do pay attention
to one of the basic ingredients, stopping times, sometimes also called optional
times. A formal definition follows.
Let T be a stopping time and n ∈ {0, 1, 2, . . .}. Define Tn (ω) := T (ω) ∧ n, the
minimum of T (ω) and n. Then Tn is also a stopping time. Indeed, for k < n
we have {Tn ≤ k} = {T ≤ k} ∈ Fk , whereas {Tn ≤ k} = Ω for k ≥ n. Usually
we write T ∧ n for Tn .
96
If X is an adapted process and T a stopping time, we define the stopped
process X T by XnT (ω) := XT (ω)∧n (ω), n ≥ 0. Note that X0T = X0 and XnT (ω) =
XT (ω) (ω) for n ≥ T (ω), abbreviated XnT = XT on {T ≤ n}. This explains the
terminology.
For a stopping time T and an adapted process X, the obvious definition of the
random variable XT (the value of the process at the stopping time) would be
XT (ω) = XT (ω) (ω), so XT = Xn on {T = n}. But since T may assume the
value ∞, there is a problem, because we often only have Xn for n < ∞. The
problem can be circumvented by defining XT only on {T < ∞} and setting it
equal to zero outside that set, which leads to XT = XT 1{T <∞} . Another way
out could be to assume that we also have a random variable X∞ , in which case
XT is properly defined everywhere. Of course the problem disappears if T is
finite, or even bounded.
Example 9.14 Here is a seemingly winning strategy for a fair game. Let (ξ)n≥1
be an iid Psequence of random variable with P(ξn = ±1) = 21 . Then the process
n
X, Xn = k=1 ξk with X0 = 0, is a martingale with E Xn = 0. For any strategy
Y , the total earnings process starting from zero is again S = Y · X. Note that
Sn = Sn−1 + Yn ξn . The strategy of interest is defined by the predictable process
Y , where Yn = 1 − Sn−1 and Y1 = 1. What happens at time n if ξn = 1? In that
case, Sn = Sn−1 + (1 − Sn−1 ) = 1 and then Yn+1 = 0, so Sn+1 = Sn = 1, and so
Sk = 1 for k ≥ n. If ξn = −1, we obtain Sn = Sn−1 − (1 − Sn−1 ) = 2Sn−1 − 1.
Hence Yn+1 = 2(1 − Sn−1 ) = 2Yn . It follows that the strategy doubles, as long
as the ξn are equal to −1, which results in Yn = 2n−1 on the event {ξ1 = · · · =
ξn−1 = −1} and zero on its complement. As soon as ξn = 1, you stop playing
and go home with your profit Sn = 1. Let T = inf{n : Sn = 1}. One checks that
P(T = n) = P(ξ1 = · · · = ξn−1 = −1, ξn = 1) = 2−n . Hence P(T < ∞) = 1.
Moreover ST = 1. HerePis the pitfall of this strategy. The last (non-zero) bet is
∞
equal to YT = 2T −1 = n=1 P 2n−1 1{T =n} , which assumes arbitrary large values
∞
and has expectation E YT = n=1 2n−1 P(T = n) = ∞. Therefore, you need an
infinite capital to successfully play this strategy to the very end.
97
Theorem 9.15 Let X be supermartingale and T an a.s. finite stopping time.
Then E XT ≤ E X0 under either of the assumptions
(i) X is a.s. bounded from below by random variable Y ∈ L1 (Ω, F, P), or
(ii) T is bounded, i.e. there exists N < ∞ such that P(T ≤ N ) = 1 or
(iii) The process ∆X is bounded by a constant C and E T < ∞.
If X is a martingale, then E XT = E X0 under (ii) and (iii) and also under the
assumption (iv) that X is bounded.
Theorem 9.16 Let X be an adapted process and assume that E |Xn | < ∞ for
all n. Then there exists a predictable process A and a martingale M such that
Xn = An + Mn a.s. for n ≥ 1. The process A is a.s. unique in the sense that if
A0 + M 0 is another additive decomposition of X with the same properties, then
P(An = A0n , ∀n ≥ 1) = 1. Moreover, A is a.s. an increasing process iff X is a
submartingale.
Proof Note first that E |∆Xn | < ∞ for all n. We define the predictable process
A is follows. ForPn ≥ 1, we put ∆An := E [∆Xn |Fn−1 ] (or rather, any version
n
of it) and An = k=1 ∆Ak . That A is predictable should be immediately clear.
Knowing this, we define M , byPsetting M0 = X0 and ∆Mn := ∆Xn − ∆An for
n
n ≥ 1 and finally Mn = M0 + k=1 ∆Mk . By its definition, M is a martingale,
since E [∆Mn |Fn−1 ] = 0 a.s. Note that ∆An ≥ 0 a.s. if X is a submartingale, in
which case A becomes increasing. The converse statement is as easy to prove.
To prove uniqueness, we argue as follows. Since Xn = An + Mn = A0n + Mn0 ,
we have A0n − An = Mn0 − Mn and so A0 − A becomes a predictable martingale.
These properties yield A0n − An = E [A0n − An |Fn−1 ] = A0n−1 − An−1 a.s. It
follows that P(∆A0n = ∆An ) = 1 for each individual n. But then also P(∆A0n =
98
∆An , ∀n ≥ 1) = 1, since it is the countable union of events with probability
one. Since A00 = A0 = 0, we get the assertion about unicity.
A martingale is called square integrable if E Mn2 < ∞ for all n.
Corollary 9.17 Let M be a square integrable martingale. Then there exists a
unique (in the sense of Theorem 9.16) increasing predictable process hM i with
hM i0 = 0 such that M 2 − hM i is a martingale. Moreover, for n ≥ 1 the random
variable ∆hM in is (a version of) the conditional variance of Mn given Fn−1 , i.e.
∆hM in = E [(Mn − E [Mn |Fn−1 ])2 |Fn−1 ] = E [(Mn − Mn−1 )2 |Fn−1 ] a.s.
It follows that ‘Pythagoras’s theorem’ holds for square integrable martingales,
n
X n
X
E Mn2 = E (M0 + ∆Mk )2 = E M02 + E (∆Mk )2 ,
k=1 k=1
Proof First we note that M 2 is a submartingale, see Exercise 9.3. Hence the
previous proposition applies and the only thing left to prove is the statement
about the conditional variance. Since M is a martingale, we have a.s.
E [(Mn − Mn−1 )2 |Fn−1 ] = E [Mn2 − 2Mn Mn−1 + Mn−1
2
|Fn−1 ]
= E [Mn2 |Fn−1 ] − Mn−1
2
= E [Mn2 − Mn−1
2
|Fn−1 ],
which is by the proof of Theorem 9.16 just the definition of ∆hM in .
The process hM i of Corollary 9.17 is also called the predictable quadratic varia-
tion process of M . If M and N are two square integrable martingales, then there
exists a unique predictable process hM, N i with hM, N i0 = 0, called the pre-
dictable covariation process of M and N , such that M N −hM, N i is martingale.
See Exercise 9.13.
Remark 9.18 Under the conditions of Corollary 9.17 it holds that E Mn2 =
E M02 + E hM in . Since hM i is an a.s. increasing process, it has an almost sure
limit hM i∞ ≤ ∞. It follows that M is bounded in L2 (supn E Mn2 < ∞) iff
E hM i∞ < ∞.
99
Lemma 9.19 Let X be a submartingale and T a bounded stopping time, T ≤
N say for some N ∈ N. Then E |XT | < ∞ and
Proof First we show that XT is integrable. Notice that E |XT |1{T =∞} =
E |X∞ |1{T =∞} ≤ E |X∞ | < ∞. Next, because |X| is a submartingale with
last element |X∞ |,
∞
X
E |XT |1{T <∞} = E |Xn |1{T =n}
n=0
X∞
≤ E |X∞ |1{T =n}
n=0
= E |X∞ |1{T <∞} < ∞.
We proceed with the proof of (i). Notice that T ∧ n is a bounded stopping time
for every n. But then by Lemma 9.19 it holds a.s. that
100
Let n → ∞ and apply the Dominated convergence theorem to get
Together with the trivial identity E [X∞ 1F 1{T =∞} ] = E [XT 1F 1{T =∞} ] this
yields E [X∞ 1F ] = E [XT 1F ] and (i) is proved.
For the proof of (ii) we use (i) two times and obtain
and hence
9.5 Exercises
9.1 Let Ω = {0, 1}N and denote by ω = (ω1 , ω2 , . . .) a typical element of Ω. Let
Xn : Ω → R be defined by Xn (ω) = ωn (n ≥ 1) and put Fn = σ(X1 , . . . , Xn ).
Write down the elements of F1 and F2 explicitly and describe the elements of
Fn for arbitrary n. How many elements does Fn have?
101
9.3 Let X be a submartingale and f an increasing convex function. Assume that
E |f (Xn )| < ∞ for all n. Show that f (X) = (f (Xn ))n≥0 is a submartingale too.
If X is a martingale, then f (X) is a submartingale even if f is not increasing,
but still convex.
9.4 Let X be an adapted process and T a stopping time that is finite. Show
that XT is F-measurable.
9.5 For every n we have a measurable function fn on Rn . Let Z1 , Z2 , . . . be
independent random variables and Fn = σ(Z1 , . . . , Zn ). Show that Xn =
fn (Z1 , . . . , Zn ) defines a martingale under the conditions that E |Xn | < ∞ and
Efn (z1 , . . . , zn−1 , Zn ) = fn−1 (z1 , . . . , zn−1 ) for every n.
9.6 If S and T are stopping times, then also S + T , S ∨ T and S ∧ T are stopping
times. Show this.
9.7 Show that an adapted process X is a martingale iff E[Xn+m |Fn ] = Xn for
all n, m ≥ 0.
9.11 Let M be a martingale with E Mn2 < ∞ for every n. Let C be a bounded
predictable process and X = C · M . Show that E Xn2 < ∞ for all n and that
hXi = C 2 · hM i.
9.12 Let M be a martingale with E Mn2 < ∞ for all n and T a stopping time. We
know that the stopped process M T is a martingale too. Show that E (MnT )2 < ∞
for all n and that hM T in = hM in∧T for every n.
9.13 Let M and N be two square integrable martingales. Show that there exists
a unique predictable process hM, N i with hM, N i0 = 0 such that M N − hM, N i
is martingale. Show also that for n ≥ 1
102
10 Convergence theorems
The key idea behind the convergence theorems of this chapter is explained first.
Consider a sequence of real numbers (xn ). Suppose that xn → x ∈ R and
let (a, b) be any open interval containing x. Then there is N > 0 such that
xn ∈ (a, b) for n ≥ N . Hence there will be only finitely many fluctuations of
the sequence between values smaller than a and values larger than b. This is
obviously also true, when x ∈ / (a, b). Let’s see what happens if (xn ) doesn’t have
a limit. In that case, x = lim inf xn < x̄ = lim sup xn . Let (a, b) ⊂ (x, x̄). Then
there exists a subsequence (xnk ) such that xnk > b for all k, and a subsequence
(xmk ) such that xmk < a. Hence there are infinitely many fluctuations between
values below a and values above b. We conclude that convergence of the sequence
is equivalent to have only finitely many fluctuations from below a to above b,
for any pair of real (even rational) numbers a, b with a < b. Something similar
can be said if the limit is equal to ±∞.
Below we use upcrossings, these are defined next. Recall that inf ∅ = ∞.
It follows from the discussion above that a sequence converges, possibly with
limits ±∞, iff U (a, b) < ∞ for all a < b. If X is a stochastic process, then
we can apply the definition of upcrossings to any sequence (Xn (ω)). Then all
Bn and Sn will depend on ω, which we then view as mappings Bn , Sn : Ω →
{0, 1, 2, . . .} ∪ {∞}. The same holds for the UN (a, b) and U (a, b). In fact they
are all random variables.
Proposition 10.2 Let X be an adapted process. Fix a < b. Then the Bn and
Sn are stopping times. Furthermore, UN (a, b) is FN -measurable and U (a, b) is
F∞ -measurable.
103
Z = Y · X. If Sk < ∞, then ZSk − ZBk > (b − a). You may think of Y as a ‘buy
low, sell high’ strategy, if X has the interpretation of a stock price. During an
upcrossing your profit will be at least b − a.
Lemma 10.3 It holds that ZN ≥ (b − a)UN (a, b) − (XN − a)− .
Proof We discern two cases. If N belongs to a downcrossing, then we have
ZN ≥ (b − a)UN (a, b), since there are exactly UN (a, b) upcrossings to the left of
N . Note that in this case we have N ∈ (SUN , BUN +1 ]. If N falls in an upcrossing,
then we can write ZN = ZBUN +1 +(ZN −ZBUN +1 ) = ZBUN +1 +(XN −XBUN +1 ) ≥
(b − a)UN (a, b) + XN − a ≥ (b − a)UN (a, b) − (XN − a)− . Combining the two
cases, we arrive at the assertion.
Proposition 10.4 Let X be a supermartingale that is bounded in L1 (i.e.
there is an M > 0 such that supn E |Xn | < M ). Then for all a < b it holds that
E U (a, b) < ∞ and thus U (a, b) < ∞ a.s.
Proof If X is a supermartingale, then so is Z by virtue of Proposition 9.10.
It follows that E ZN ≤ E Z0 = 0. From Lemma 10.3 we obtain 0 ≥ E ZN ≥
(b − a)E UN (a, b) − E (XN − a)− . Hence (b − a)E UN (a, b) ≤ supn E (Xn − a)− ≤
|a| + supn E |Xn | ≤ |a| + M . Since the UN (a, b) form an increasing sequence,
the Monotone Convergence Theorem yields (b − a)E U (a, b) ≤ |a| + M .
Here is the first convergence result for supermartingales, Doob’s convergence
theorem.
Theorem 10.5 Let X be a supermartingale which is bounded in L1 . Then
a.s.
there exists a random variable X∞ ∈ L1 (Ω, F∞ , P) such that Xn → X∞ .
Proof Define X∞ := lim inf Xn+ − lim inf Xn− . Then |X∞ | ≤ lim inf Xn+ +
lim inf Xn− and from Fatou’s lemma we deduce that E |X∞ | ≤ lim inf E Xn+ +
lim inf E Xn− ≤ 2 lim inf E |Xn |, which is finite by the assumption that X is
bounded in L1 . Note that if (Xn ) has an a.s. limit, it must be a.s. equal to
X∞ . Hence, the limit -if it exists- has finite expectation and is thus a.s. finite.
Let N be the set of ω’s such Xn (ω) doesn’t have a limit in [−∞, ∞]. Then
N = {lim inf Xn < lim sup Xn }. We can write N = ∪a<b, a,b∈Q Na,b , where
Na,b = {lim inf Xn < a < b < lim sup Xn }. On Na,b it holds that U (a, b) = ∞.
But the latter event has probability zero, in view of Proposition 10.4. Hence,
N , being a countable union of events with probability zero also has probability
zero, which concludes the proof.
Remark 10.6 Notice that the theorem only states that a.s. convergence holds,
no other type of convergence, except convergence in probability, necessarily
holds true. If X is a martingale, it is attractive to add X∞ to the sequence X
to obtain a process with time index set that includes infinity. It would be very
nice that the martingale property as in Remark 9.3 would extend to m = ∞,
i.e. E [X∞ |Fn ] = Xn a.s. But this is not true in general, see Exercise 10.2. In
the next section we will give necessary and sufficient conditions under which the
extended martingale property holds.
104
The following corollary can be seen as a stochastic version of the elementary
result that every decreasing sequence of real numbers that is bounded from
below has a (finite) limit.
E 1G X∞ − E 1G Xn ≤ E 1G X∞ − E 1G Xm
≤ E |1G (X∞ − Xm )|
≤ E |X∞ − Xm |,
105
Proposition 10.9 Let ξ ∈ L1 (Ω, F, P) and let G be a family of sub-σ-algebras
of F. Write XG for any version of E [ξ|G]. Then the collection C = {XG :
G ∈ G} is uniformly integrable. In particular, if G is a filtration (Fn ) and
Xn := E [ξ|Fn ], then the process X is a uniformly integrable martingale.
Proof Let ε > 0. We have to show the existence of k > 0 such that
E |XG | E |ξ|
P(|XG | > k) ≤ ≤ < δ.
k k
Write G = {|XG | > k}. Note that G ∈ G and by (10.3) we get
in view of Lemma 7.8. This proves (10.2). For the case, where G is a filtration,
it remains to show that X is a martingale, but we have already done that in
Example 9.6.
In the next theorem we connect the results of Theorem 10.8 and Proposi-
tion 10.9. It is known as Lévy’s upward convergence theorem.
106
shrinking to the left. Notice that F−∞ := ∩n<0 Fn is a σ-algebra as well and can
be considered as an infimum. Similarly we extend the notion of a martingale
to have negative indexes too. So, a martingale X = (Xn )n∈Z is a sequence
of integrable random variables adapted to a filtration for which the martingale
property (9.1) is valid for all n ∈ Z. Below we need these concepts only for
n < 0. The next result is known as Lévy’s downward theorem.
Theorem 10.11 Let F be a filtration on the negative integers and X an F-
adapted martingale. Then there is a F−∞ -measurable random variable X−∞
such that Xn → X−∞ both a.s. and in L1 as n → −∞. Moreover X−∞ =
E [X−1 |F−∞ ].
Proof Since Xn = E [X−1 |Fn ] for n < −1 we have that X is uniformly inte-
grable in view of Proposition 10.9. Hence L1 convergence follows from a.s. con-
vergence (Proposition 7.15), the latter to be established now. In the proof of
Theorem 10.5, we used the upcrossings inequality of Lemma 10.3. This applies
to the present case as well, if we shift the time over a distance of N to the left.
Denote the number of upcrossings over (a, b) in a time interval from −N to 0
by U−N (a, b). Taking expectations as in the proof of Theorem 10.5, we obtain
E U−N (a, b) ≤ E (X−1 −a)− /(b−a). Hence also U−∞ (a, b) := limN →∞ U−N (a, b)
has finite expectation. The rest of the proof of existence of the a.s. limit is as
before. The characterization of the limit as a conditional expectation is as in
the proof of Theorem 10.8.
In Section 10.4 we will see a nice application of this theorem.
107
Remark 10.13 Let ξ1 , ξ2 , . . . be a sequence of independent random Pn variables
with expectation zero and finite second moment. Let Mn = k=1 ξk and
Mn∗ = max{|M1 |, . . . , |Mn |}. Then λ2 P(Mn∗ ≥ λ) ≤ E Mn2 , as follows by taking
Xn = Mn2 in Proposition 10.12. This can be viewed as a ‘supremum version’ of
Chebychev’s inequality.
Lemma 10.14 Let X and Y be nonnegative random variables for which the
inequality
holds for all λ > 0. If X ∈ Lp (Ω, F, P) for 1 < p ≤ ∞, then also Y ∈ Lp (Ω, F, P)
p
and moreover, ||Y ||p ≤ q||X||p , where q = p−1 , as in Hölder’s inequality.
Proof We give the proof for p < ∞, the case p = ∞ is left to the reader.
Since (10.6) holds for all λ > 0 we can integrate
R ∞ both sides after multiplication
with pλp−2 . On the left we get the integral 0 pλp−1 P(Y ≥ λ) dλ = E Y p , in
view of Exercise 5.11. On the right we compute using Fubini’s theorem (you
check that (ω, λ) 7→ λp−2 1{Y (ω)≥λ} X(ω) is jointly measurable in ω and λ) the
integral
Z ∞ Z ∞
pλp−2 E 1{Y ≥λ} X dλ = E pλp−2 1{Y ≥λ} X dλ
0 0
Z Y
= E (X pλp−2 dλ)
0
p
= E XY p−1 .
p−1
Here is the result we were aiming at. Inequality (10.7) is known as Doob’s
Lp -inequality.
108
(i) It holds that X ∗ ∈ Lp (Ω, F∞ , P) and
p
where q = p−1 .
Lp
(ii) The a.s. limit X∞ = limn→∞ Xn exists and moreover, Xn → X∞ and
||X∞ ||p = supn ||Xn ||p = limn→∞ ||Xn ||p .
Proof Again existence of M∞ as an a.s. limit can be deduced from e.g. Theo-
rem 10.8. Moreover, |Mn −M∞ | ≤ 2X ∗ , where X ∗ = supn |Mn |. An application
of the previous theorem to the nonnegative submartingale X = |M | shows that
||X ∗ ||p < ∞. The rest of the proof is as in the proof of Theorem 10.15.
Proof (i) Let ε > 0 and choose m such that |xn − x| < ε for n > m. Then we
109
have for n > m
n
X n
X
|x̄n − x| ≤ wkn |xk − x| + | wkn − 1||x|
k=1 k=1
Xm n
X n
X
≤ wkn |xk − x| + wkn |xk − x| + | wkn − 1||x|
k=1 k=m+1 k=1
Xm Xn n
X
≤ wkn |xk − x| + ε wkn + | wkn − 1||x|.
k=1 k=m+1 k=1
It follows from the assumptions that lim supn→∞ |x̄n − x| ≤ ε. Since ε > 0 is
arbitrary the result follows.
(ii) Given n ≥ 1, define wkn = wk −w wn
k−1
, for k = 1, . . . , n and yn =
P∞ xk
k=n+1 wk . Note that yn → 0. We compute
n n
1 X 1 X
xk = − wk (yk − yk−1 )
wn wn
k=1 k=1
n
1 X
= − wn yn + yk (wk − wk−1 )
wn
k=1
n
X
= −yn + wkn yk .
k=1
110
Corollary 10.20 Let X1 , X2 , . . . be an independent
Pn sequence with E Xk = 0
andPσk2 := Var Xk < ∞ for all k. Let αn = σ
k=1 k
2
. If αn → ∞, then
1 n
αn X
k=1 k → 0 a.s.
The assertion of Corollary 10.20 is the strong law for a sequence of indepen-
dent random variables
Pn with a finite second moment. If the sequence is moreover
iid, then we get n1 k=1 Xk → 0, the usual strong law of large numbers. The
assumption that an iid sequence has finite second moments can be dropped,
whereas the strong law still holds. This is the content of Theorem 10.23 below,
whose proof is based on completely different arguments.
We introduce some terminology. Let X1 , X2 , . . . be a sequence of random vari-
ables. Define Tn = σ(Xn+1 , Xn+2 , . . .) and T = ∩∞ n=0 Tn , T is called the tail
σ-algebra of the sequence. The following proposition is known as Kolmogorov’s
0-1 law. It tells us that the tail σ-algebra of an independent sequence is a.s. triv-
ial. Its proof is an easy consequence of Lévy’s theorem.
All the heavy machinery that we have developed so far now pays off by having
a relatively simple proof of the Strong Law of Large Numbers for iid sequences.
Proof We’d like to apply Theorem 10.11. The first thing we need is a filtration
that is defined on the P
negative integers. The following choice turns out be a
n
clever one. Let Sn = k=1 Xk and put F−n = σ(Sn , Sn+1 , . . .), n ≥ 1. Put
M−n = X̄n . In Example 8.9 we have seen that E [X1 |F−n ] = M−n . It follows
111
from Theorem 10.11 that there exists M−∞ such that M−n → M−∞ both
a.s. and in L1 . We proceed by identifying the limit. From Example 10.22, we
know that M−∞ has to be equal to a constant a.s. But Theorem 10.11 also tells
us that E M−∞ = E M−1 , which is equal to µ. Hence M−∞ = µ a.s.
10.5 Exercises
10.1 Prove Proposition 10.2. Show also that the process Y below that proposi-
tion is predictable.
10.2 Consider the probability space (Ω, F, P) with Ω = [0, 1), F the Borel
sets of [0, 1) and P the Lebesgue measure. Let Ikn = [k2−n , (k + 1)2−n ) for
k = 0, . . . , 2n − 1 and Fn be the σ-algebra by the Ikn for k = 0, . . . , 2n − 1.
Define Xn = 1I0n 2n . Show that Xn is a martingale and that the conditions of
Theorem 10.5 are satisfied. What is X∞ in this case? Do we have Xn → X∞
in L1 ?
10.3 Let X be a submartingale with supn≥0 E|Xn | < ∞. Show that there exists
a random variable X∞ such that Xn → X∞ a.s.
10.4 Show that for a supermartingale X the condition sup{E |Xn | : n ∈ N} < ∞
is equivalent to the condition sup{E Xn− : n ∈ N} < ∞.
10.5 Let Y ∈ L1 (Ω, F, P) and a filtration (Fn ) be given. Define for all n ∈ N
the random variable Xn = E [Y |Fn ]. We know that there is X∞ such that
L2
Xn → X∞ a.s. Show that for Y ∈ L2 (Ω, F, P), we have Xn → X∞ . Find a
condition such that X∞ = Y . Give also an example in which P(X∞ = Y ) = 0.
10.6 Let X = (Xn )n≤0 a (backward) supermartingale.
(a) Show equivalence of the next two properties:
(i) supn E|Xn | < ∞ and (ii) limn→−∞ EXn < ∞.
(Use that x 7→ x+ is convex and increasing.)
(b) Under the condition supn E|Xn | =: A < ∞ the supermartingale X is uni-
formly integrable. To show this, you may proceed as follows (but other
solutions are equally welcome). Let ε > 0 and choose K ∈ Z such that
for all n < K one has 0 ≤ EXn − EXK < ε. It is then sufficient to
show that (Xn )n≤K is uniformly integrable. Let c > 0 be arbitrary and
Fn = {|Xn | > c}. Using the supermartingale inequality you show that
Z Z
|Xn | dP ≤ |XK | dP + ε.
Fn Fn
A
Because P(Fn ) ≤ c you conclude the proof.
10.7 Suppose that Q is a probability measure on (Ω, F) such that Q P with
dQ/ dP = Z. Denote by Pn and Qn the restrictions of P and Q to Fn (n ≥ 1).
Show that Qn Pn and that
dQn
= Mn ,
dPn
112
where Mn = EP [Z|Fn ].
10.8 Let M be a nonnegative martingale with E Mn = 1 for all n. Define
Qn (F ) = E 1F Mn for F ∈ Fn (n ≥ 1). Show that for all n ≥ 1 and k ≥ 1 one
has Qn+k (F ) = Qn (F ) for F ∈ Fn . Assume that M is uniformly S integrable.
Show that there exists a probability measure Q on F∞ = σ( n Fn ) that is
absolutely continuous w.r.t. P and that is such that for all n the restriction of
Q to Fn coincides with Qn . Characterize dQ/ dP.
10.9 Consider Theorem 10.15. Show that ||Xn ||p is increasing in n.
10.10 Let (Xn ) be a sequence of random variables with finite a.s. limit X.
Assume the existence of a random variable Y ≥ 0 with E Y < ∞ such that for
all n it holds that |Xn | ≤ Y . Let (Fn ) be an arbitrary filtration. Hunt’s lemma
states
a.s.
E [Xn |Fn ] → E [X|F∞ ].
(a) Put Zm = supk≥m |Xk − X|. Show that Zm converges to zero, both in L1
and a.s.
(b) Show also that for n ≥ m:
x2
P(sup Mk ≥ x) ≤ exp(− Pn 2 ).
k≤n 2 i=1 ck
113
1
(a) Let Rn = Zn2 . Show that rn := E Rn ≤ 1.
Qn Ri
(b) Let N be the martingale defined by Nn = i=1 ri . Assume that
∞
Y
rk > 0. (10.8)
k=1
Show that there exists a random variable N∞ such that Nn → N∞ a.s. Show
Mn
(use Kronecker’s lemma) that (1+A n)
p has an a.s. finite limit.
114
11 Local martingales and Girsanov’s theorem
The purpose of this section is to formulate and prove Girsanov’s theorem on
absolutely continuous changes of measure for discrete time processes.
It is obvious that any martingale is a local martingale, but the converse is not
n
true, see Exercise 11.2. Note also that the definition implies that X0 = X0T
belongs to L1 (Ω, F, P). There exist alternative definitions for local martingales
that also allow for non-integrable X0 , but we don’t consider these.
115
Proposition 11.3 The following results on martingale transforms hold true.
(i) If Y is a predictable process and M a martingale, then X = Y · M is a
local martingale.
(ii) If X is a local martingale, then there exists a predictable process Y and
martingale M such that X = X0 + Y · M .
(iii) If Y is a predictable process and M a local martingale, then Y · M is a
local martingale.
un := k≥0 (k + 1)−3 ∆Xn 1Ank . Then un ∈ Fn , E |un | < ∞ and E [un |Fn−1 ] =
Pn
0. ThusP (Mn ) with Mn = i=1 uk is a martingale with M0 = 0. Put then
Yn = k≥0 (k + 1)3 1Ank and we see that ∆Xn = Yn ∆Mn .
(iii) From (ii) it follows that, M being a local martingale, must be of the
form M = M0 + Y 0 · M 0 for a predictable process Y 0 and a martingale M 0 . This
implies X = Y · (Y 0 · M 0 ) = Y Y 0 · M 0 and the result follows from (i).
Proof We first show that E Xk < ∞ for all k. Let (T n ) be a localizing sequence.
n
By Fatou’s lemma and by the martingale property of X T we have
n
E Xk ≤ lim inf E XkT = E X0 .
n→∞
n Pk
Furthermore, XkT ≤ i=0 Xi , of which we now know that this upper bound has
finite expectation. Hence we can apply the Dominated Convergence Theorem
n
Tn
in E [XkT |Fk−1 ] = Xk−1 to get E [Xk |Fk−1 ] = Xk−1 .
116
11.2 Quadratic variation
If X and Y are two stochastic processes, their optional quadratic covariation
process [X, Y ] is defined as
n
X
[X, Y ]n = ∆Xk ∆Yk .
k=1
For X = Y we write [X] instead of [X, Y ]. Note the integration by parts formula
This property carries over to the case where X and Y are local martingales under
the extra conditions E [∆Xn2 |Fn−1 ] < ∞ and E [∆Yn2 |Fn−1 ] < ∞, Exercise 11.3,
and under this condition, XY − hX, Y i is a local martingale.
One easily verifies that Z is a martingale under the probability measure P. Note
that Q(Zn > 0) = 1, but P(Zn > 0) may be less than 1, unless Pn ∼ Qn , in
which case P(F ) = EQ Z1n 1F for F ∈ Fn . Furthermore, on the set {Zn = 0} also
Zn−1 = 0, as follows from the martingale property of the nonnegative process
Z. Hence we can define Zn /Zn−1 with the understanding that it is put equal
to zero on {Zn−1 = 0}.
117
If M is martingale under P, it is usually not a martingale under Q. We shall
see below how to transform M parallel to the measure change from P to Q such
that the changed process is a martingale again, or a local martingale in a more
general setting. The following lemma generalizes Equation (6.11) to conditional
expectations.
E [XZ1G ] = EQ [X1G ]
= EQ EQ [X1G |G]
= E (ZEQ [X|G]1G )
= E (E [ZEQ [X|G]1G |G])
= E (E [Z|G]EQ [X|G]1G ) .
Remark 11.8 One easily shows that Z = 0 P-a.s. on the set N := {E [Z|G] =
0}. Hence E [XZ|G]1N = 0 P-a.s. It also follows that EQ [X|G]1N = 0 Q-a.s.
Lemma 11.9 Let Q be locally absolutely continuous w.r.t. P with density pro-
cess Z and let T be a stopping time. Then for F ∈ FT it holds that
118
(i) The process X is a martingale under Q iff the process XZ is a martingale
under P.
(ii) The process X is a local martingale under Q if the process XZ is a local
martingale under P.
(iii) If moreover P is also locally absolutely continuous w.r.t. Q, then the pro-
cess XZ is a local martingale under P if the process X is a local martingale
under Q.
Proof To prove (i) we use Lemma 11.7 and recall that E [Zn |Fn−1 ] = Zn−1 .
Note that EQ |Xn | < ∞ iff E |Xn |Zn < ∞. We have
E [Xn Zn |Fn−1 ]
EQ [Xn |Fn−1 ] = ,
Zn−1
and both(!) terms on the right hand side are martingales (Exercise 11.6) under
n
P, we can apply part (i) to deduce that X T is a martingale under Q. Hence X
is a local martingale under Q.
(iii) Let X be a local martingale under Q and (T n ) be a localizing sequence
for it. Then Q(limn→∞ T n < ∞) = 0 and by the argument in the proof of (ii)
also P(limn→∞ T n < ∞) = 0, since P is given to be locally absolutely continuous
w.r.t. Q. Similar to Equation (11.11) we have
n n n
XkT ZkT = XkT Zk − XT n (Zk − ZT n )1{k≥T n } ,
again with two martingales under P on the right hand side, and so XZ is a local
martingale under P.
fQ := M − 1 · hM, Zi
M
Z−
119
(ii) if moreover E [|∆Mn |Zn |Fn−1 ] < ∞ a.s. for all n, then also M
fQ is a local
martingale under Q.
Proof The property Q(Zn > 0) = 1 allows for division by Zn a.s. under Q.
(i) Let us compute
∆Mn ∆Zn
∆MnQ = ∆Mn −
Zn
∆Mn
= (Zn − ∆Zn )
Zn
∆Mn
= Zn−1 .
Zn
Z Zn
Hence, by Lemma 11.7 with Z = Zn , G = Fn−1 and E [Z|G] = Zn−1 we get
fQ Z)n = Zn−1 ∆M
∆(M fQ ∆Zn + ∆M
fnQ + M fnQ ∆Zn .
n−1
Hence
The first two terms are obviously local martingale differences under P by virtue
of Proposition 11.3, whereas the same holds true for the difference ∆MfnQ ∆Zn −
∆hM, Zin under the stated assumption, see Section 11.2.
The second assertion of Theorem 11.11 is the most appealing one, since it allows
to write
fQ + 1 · hM, Zi,
M =M
Z−
where the first term is a local martingale under Q and the second term is
predictable. This gives the (generalized) Doob decomposition of M under Q.
The terms Z and Z− in the denominator in M Q and M fQ are annoying in
the sense that they may be zero with positive P-probability, although with Q-
probability one they are strictly positive. This implies that in general M Q and
fQ are not well defined under P. Clearly, this annoyance disappears under the
M
additional condition that P and Q are locally equivalent. Alternatively, there
exists another way to transform the given local martingale under P into a local
120
martingale under Q in which all terms involved are also well defined under P.
We use the following notation.
Zk
αk = 1{Zk−1 >0} .
Zk−1
As we observed before, on the set {Zk−1 = 0} also Zk = 0. Hence, adopting the
convention 00 = 0, we may also write αk = ZZk−1
k
.
Theorem 11.12 Let M be a local martingale under P and let Q be locally
absolutely continuous w.r.t. P. Assume E [|∆Mn |αn |Fn−1 ] < ∞ P-a.s. for all
n ≥ 1. Then the process M̂ Q defined by
n
X
M̂nQ := Mn − E [αk ∆Mk |Fk−1 ]
k=1
121
11.4 Exercises
11.1 Consider a probability space (Ω, F, P). Suppose X ≥ 0 and that G is a
sub-σ-algebra of F. Then there exists a G-measurable random variable X̂ such
that E X1G = E X̂1G for all G ∈ G.
fQ .
Equation (11.12) to determine M
(c) The Qn can be seen as restriction of a probability measure Q on (Ω, F) to
Fn . Is Q P?
122
12 Weak convergence
In this chapter we encounter yet another convergence concept for random vari-
ables, weak convergence. Although the origin of the terminology is in functional
analysis, fortunately weak convergence is also weaker than the other kinds of
convergence for random variables that we have seen sofar. First we sketch a
functional analytic background of weak convergence. After that we go back to
our probabilistic environment.
Consider a complete normed vector space (X, || · ||), a Banach space, and let X ∗
be the vector space of all continuous linear functionals T : X → R, also called
the (strong) dual space of X. The operator norm of T ∈ X ∗ is defined as
|T x|
||T || = sup{ : ||x|| =
6 0}.
||x||
It is known that a linear functional is continuous iff it has finite norm. Note
that we use the same symbol || · || to denote both the norm on X and the one
on X ∗ . It follows that for all x ∈ X one has |T x| ≤ ||T || ||x||. One can show
that (X ∗ , || · ||) is a Banach space. Let (xn ) be a sequence in X that converges
to x in norm, ||xn − x|| → 0. If T ∈ X ∗ , we then have
If a sequence (xn ) satisfies (12.1) for some x ∈ X and all T ∈ X ∗ , we say that
xn converges to x weakly.
Now we mimic the above by taking Y = X ∗ as the basic space and along
with Y we consider Y ∗ , also denoted by X ∗∗ . A sequence of operators (Tn ) ⊂ Y
then strongly converges to T ∈ Y , if ||Tn − T || → 0. Of course, we then also
have for all y ∗ ∈ Y ∗ that
|y ∗ Tn − y ∗ T | → 0. (12.2)
Parallelling the above, we say that Tn converges weakly to T if (12.2) holds for
all y ∗ ∈ Y ∗ .
Let’s have a look at some special linear operators in Y ∗ = X ∗∗ . Let T ∈ X ∗
and x ∈ X. Define x(T ) = T x. We can view x(·) as an element of X ∗∗ , which is
easy to check. A sequence (Tn ) ⊂ X ∗ which weakly converges to T ∈ X ∗ , then
also satisfies (12.2) for y ∗ = x(·) and we have
|Tn x − T x| → 0, ∀x ∈ X. (12.3)
123
with finite norm. Note that every f ∈ X is bounded. We write Cb (R) for this
space. One easily checks that Cb (R) is a linear space and one can verify by a
direct argument that it is complete. This is also known from Theorem 4.48,
upon noticing that Cb (R) is a closed subspace of L∞ (R, B, λ).
Probability theory comes in when we look at special operators on Cb (R),
originating from probability measures on (R, B). Although in this chapter we
will confine ourselves mainly to (R, B), occasionally we will hint at a more
general context. If µ is such a probability measure, we view the integral µ :
f → µ(f ) as a linear functional. It is easy to check that every µ, viewed as a
functional, has norm ||µ|| = 1. Notice that the space of probability measures
on (R, B) is not a vector space, but only a convex set. The encompassing vector
space is the set of all signed finite measures. Following the above scheme, weak∗
convergence for a sequence of probability measures (µn ) means µn (f ) → µ(f ),
for all f ∈ Cb (R). However, in probability theory, it has become a convention
to speak of weak convergence of a sequence of probability measures instead
of weak∗ convergence, partly because the above notion of weak convergence
turns out to be too strong and not useful for certain purposes, see further
down in this chapter. One may view Cb (R) as a space of test functions. Other
choices for of test functions are also possible, for instance the space of continuous
functions with compact support, CK (R), or the space of continuous functions
that converge to zero at ±∞, C0 (R). One can show that for convergence of
probability measures on (R, B), the definition of convergence by using any of
these collections yields the same convergence concept. To define convergence of
a more general (signed) measures (possibly even defined on more general metric,
or even topological spaces, with the Borel σ-algebra), the use of the different
test function lead to different convergence concepts. To distinguish between
these, different names are used. What we have just called weak convergence
is then called narrow convergence, whereas the name weak convergence is then
reserved for C0 (R) as the space of test functions. In fact, a theorem by Riesz
says that the dual space of C0 (R) can be identified with the space of all signed
measures on (R, B), which makes C0 (R) in a sense more natural to work with, if
one considers weak convergence. On the other hand, we shall characterize weak
convergence of probability measures by their action on a relatively small but
rich enough collection of functions that are not in C0 (R), see Theorem 12.15.
Below we will adhere to the custom followed in probability theory.
12.1 Generalities
Here is the formal definition of weak convergence of probability measures on
(R, B) and of a sequence of random variables.
124
w w
Xn converges weakly to X, and write Xn → X if it holds that µn → µ. In this
case, one also says that Xn converges to X in distribution.
One could guess that checking weak convergence of a sequence of random vari-
ables may be a hard job, one needs to work with all functions in Cb (R). For-
tunately there is a fine characterization in terms of distribution functions, and
125
these are the objects that we are quite familiar with. As a first result, we have
the following proposition. Recall the notation F (x−) = limy↑x F (y).
Since this holds for all x ∈ R, we rename x + ε as x to obtain lim inf Fn (x) ≥
F (x − ε). Let ε ↓ 0. The final statement then also follows.
Proof Look back at Theorem 3.10 and its proof. We take ([0, 1], B, λ) as our
probability space and we consider the random variables X − and Xn− (n ≥ 1),
the definition of them should be clear. Along with these random variables we
also need X + (ω) = inf{z : F (z) > ω} and the similarly defined Xn+ (ω). Fix
ω ∈ (0, 1) and take z ∈ CF with z > X + (ω). Then F (z) > ω and by the assumed
126
convergence, eventually all Fn (z) > ω. It follows that z ≥ lim sup Xn+ (ω). Since
this holds for all z ∈ CF , we can choose a sequence of them (there is plenty
of choice, since the complement of CF is at most countable) that decreases to
X + (ω) to obtain X + (ω) ≥ lim sup Xn+ (ω). Similar reasoning yields X − (ω) ≤
lim inf Xn− (ω). Since Xn− (ω) ≤ Xn+ (ω), we deduce X − (ω) ≤ lim inf Xn+ (ω) ≤
lim sup Xn+ (ω) ≤ X + (ω). By Exercise 3.7, we know that λ({X − < X + }) = 0
and we thus conclude that lim inf Xn+ = lim sup Xn+ = X on {X − = X + },
which has probability one. Hence the Xn and X in the assertion can be taken
as Xn = Xn+ and X = X + (or as Xn = Xn− and X = X − ).
Here is a result that gives an appealing sufficient condition for weak convergence,
when the random variables involved admit densities.
Theorem 12.6 Consider real random variables X, X1 , X2 , . . . having densities
f, f1 , f2 , . . . w.r.t. Lebesgue measure λ. Suppose that fn → f λ-a.e. Then
w
Xn → X.
Proof The proof’s main ingredients are an infinite repetition of the Bolzano-
Weierstraß theorem combined with a Cantor diagonalization. First we restrict
ourselves to working on Q, instead of R, and exploit the countability of Q.
127
Write Q = {q1 , q2 , . . .} and consider the Fn restricted to Q. Then the sequence
(Fn (q1 )) is bounded and along some subsequence (n1k ) it has a limit, `(q1 ) say.
Look then at the sequence Fn1k (q2 ). Again, along some subsequence of (n1k ), call
it (n2k ), we have a limit, `(q2 ) say. Note that along the thinned subsequence, we
still have the limit limk→∞ Fn2k (q1 ) = `(q1 ). Continue like this to construct a
nested sequence of subsequences (njk ) for which we have that limk→∞ Fnj (qi ) =
k
`(qi ) holds for every i ≤ j. Put nk = nkk , then (nk ) is a subsequence of (nik )
for every i ≤ k. Hence for any fixed i, eventually nk ∈ (nik ). It follows that for
arbitrary i one has limk→∞ Fnk (qi ) = `(qi ). In this way we have constructed
a function ` : Q → [0, 1] and by the monotonicity of the Fn this function is
increasing.
In the next step we extend this function to a function F on R that is right-
continuous, and still increasing. We put
Note that in general F (q) is not equal to `(q) for q ∈ Q, but the inequality
F (q) ≥ `(q) always holds true. Obviously, F is an increasing function and
by construction it is right-continuous. An explicit verification of the latter
property is as follows. Let x ∈ R and ε > 0. There is q ∈ Q with q > x
such that `(q) < F (x) + ε. Pick y ∈ (x, q). Then F (y) < `(q) and we have
F (y) − F (x) < ε. Note that it may happen that for instance limx→∞ F (x) < 1,
F can be defective.
The function F is of course the one we are aiming at. Having verified that
F is a (possibly defective) distribution function, we show that Fnk (x) → F (x)
if x ∈ CF . Take such an x and let ε > 0 and q as above. By left-continuity of F
at x, there is y < x such that F (x) < F (y) + ε. Take now r ∈ (y, x) ∩ Q, then
F (y) ≤ `(r), hence F (x) < `(r) + ε. So we have the inequalities
Then lim sup Fnk (x) ≤ lim Fnk (q) = `(q) < F (x) + ε and lim inf Fnk (x) ≥
lim inf Fnk (r) = `(r) > F (x) − ε. The result follows since ε is arbitrary.
Here is an example for which the limit is not a true distribution function. Let
µn be the Dirac measure concentrated on {n}. Then its distribution function
is given by Fn (x) = 1[n,∞) (x) and hence limn→∞ Fn (x) = 0. Hence any limit
function F in Theorem 12.7 has to be the zero function, which is clearly defec-
tive.
Translated into terms concerning probability measures on (R, B), the propo-
sition seems to say that every sequence of probability measures has a weakly
converging subsequence whose limit µ is a subprobability measure, µ(R) ≤ 1.
In topological terms this would mean that the family of probability measure
is relatively sequentially compact (w.r.t. topology generated by weak conver-
gence). But look again at the example with the Dirac measures. The integral
µn (f ) is equal to f (n), which has in general no limit for n → ∞, if f ∈ Cb (R),
128
although the zero measure is the only possible limit. There are a number of
ways to circumvent the problem. One of them is to replace in the definition of
weak convergence the space Cb (R) with the smaller set C0 (R). Another way-
out is to look at probability measures on the Borel sets of [−∞, ∞]. The space
C([−∞, ∞]) can be identified with C[0, 1] and in this space every continuous
function automatically has limits at the boundary points. For the sequence of
Dirac measures, we would then have the Dirac measure concentrated on {∞}
as the limit and weak convergence holds again.
The example with the Dirac measures also provides another insight why the
limit is only a defective distribution function, the point masses at n ‘disappear’
from R as n tends to infinity. A possible way out to prevent this phenomenon
is by requiring that all probability measures involved have probability one on
a fixed bounded set. This is too stringent, because it rules out many useful
distributions. Fortunately, a considerably weaker assumption suffices. For any
probability measure µ on (R, B) it holds that limM →∞ µ([−M, M ]) = 1. If F is
the distribution function of µ, we equivalently have limM →∞ (F (M )−F (−M )) =
1. Note that F (M ) − F (−M ) = µ((−M, M ]). The next condition, tightness,
gives a uniform version of this.
Remark 12.9 Note that a sequence (µn ) is tight iff if every ‘tail sequence’
(µn )n≥N is tight. In order to show that a sequence is tight it is thus sufficient to
show tightness from a certain suitably chosen index on. Tightness of a sequence
is also a necessary condition for weak convergence, as we shall see later. Recall
that a distribution function F has at most countably points of discontinuity and
that µ({x}) > 0 iff x is a discontinuity point of F . In this case {x} is called an
atom of µ.
Proof (i) Fix ε > 0 and choose M > 0 such that µ([−M, M ]) > 1 − ε. Since
the collection of all atoms of all µn is at most countable, we can choose M such
that it is not an atom of any µn and not of µ. If F and Fn are the corresponding
distribution functions, we thus have that Fn (±M ) → F (±M ). Hence, there is
N > 0 such that |Fn (±M ) − F (±M )| < ε for n > N . For these n we then
have by the triangle inequality that µn ([−M, M ]) > 1 − 3ε. Hence the sequence
(µn )n>N is tight.
129
(ii) The tightness condition means that for all ε > 0 we can find M > 0 such
that µn ([−M, M ]) > 1 − ε for all n. Again we may assume that the singletons
M and −M are no atoms of all the probability measures involved. Take a
subsequence as in Theorem 12.7 and choose N such that |Fnk (±M )−F (±M )| <
ε for nk > N . Then F (M ) = (F (M ) − Fnk (M )) + Fnk (M ) > 1 − 2ε and likewise
F (−M ) < 2ε. It follows that F is not defective.
All definitions and results so far generalize without difficulties to (weak) con-
vergence of sequences of random vectors with values in Rn , although some care
must be taken in formulating the statements about convergence of distribution
functions at points where the limit is continuous. Take this for granted or verify
it, if you want. Here is a useful warning. If you know of two sequences of random
w w
variables that Xn → X and Yn → Y , it tells you a priori nothing about weak
convergence of the random vectors (Xn , Yn ), simply because nothing is known
of the joint distribution of the (Xn , Yn ). Under extra conditions something can
be said though.
Proposition 12.11 Assume for simplicity that all random variables below are
defined on the same space.
w w
(i) If (Xn ) and (Yn ) are independent sequences with Xn → X and Yn → Y ,
w
then also (Xn , Yn ) → (X, Y ).
w
(ii) If (Xn ) and (Yn ) are sequences of random variables with Xn → X and
w w
Yn → y, where y ∈ R a constant, then also (Xn , Yn ) → (X, y).
(iii) In any of the previous cases, one also has weak convergence of (g(Xn , Yn ))
for a continuous g : R2 → R. In particular there holds weak convergence
for Xn + Yn .
We close this section with presenting some results that apply to a rather general
setting, probability measure defined on separable metric spaces (S, d) endowed
with the Borel σ-algebra.
Weak convergence can be characterized by a great variety of properties, we
present some of them in the next theorem, known as the portmanteau theorem.
Recall that the boundary ∂E of a set E in a topological space is ∂E = Cl E \
Int E.
130
Proof We start with (i)⇒(ii), the proof of which is similar to the one of Propo-
sition 12.3. We construct a function that is one on F , and zero just a bit away
from it. Let δ > 0. If x ∈
/ F , then d(x, F ) > 0. Define g by
0 if d(x, F ) > δ
g(x) =
1 − d(x, F )/δ if d(x, F ) ≤ δ.
As a matter of fact one can show that weak convergence takes place, if (12.4)
holds for all bounded uniformly continuous functions (Exercise 12.4). In the
present section we take this as our characterization of weak convergence.
Our goal is to prove Theorem 12.16 for which we need a couple of prepara-
tory results. The approach followed in this section is based on smoothing by
convolution and small disturbances. In the sequel ||f || denotes the sup norm of
a function f .
Lemma 12.13 Let X and Y be random variables and f a bounded uniformly
continuous function. Then, for all ε > 0, there exists δ > 0 such that
131
Proof Let ε > 0 be given and choose δ > 0 such that |f (x)−f (y)| < ε whenever
|x − y| < δ. Then
Lemma 12.14 Let Y, X, X1 , X2 , . . . be random variables such that for all σ > 0
w w
it holds that Xn + σY → X + σY . Then also Xn → X (the σ = 0 case).
δ
|E f (X) − E f (X + σY )| ≤ ε + 2||f || P(|Y | ≥ )
σ
and
δ
|E f (Xn ) − E f (Xn + σY )| ≤ ε + 2||f || P(|Y | ≥ ).
σ
Now we consider
Hence
E f (X + σY ) = E fσ (X). (12.7)
132
1
Let pσ (x) = √2πσ 2
exp(− 2σ1 2 x2 ), the density of a N (0, σ 2 ) distributed random
variable. The function fσ is obtained by convolution of f with the normal den-
sity pσ . By the Dominated Convergence Theorem, one can show (Exercise 12.3)
that f has derivatives of all orders given by
Z ∞
fσ(k) (x) = f (z)p(k)
σ (z − x) dz. (12.8)
−∞
Proof Suppose that E f (Xn ) → E f (X), for all f ∈ C ∞ , then it holds in par-
ticular for any fσ obtained from a uniformly continuous f as in (12.6). In view
of (12.7), we have E f (Xn + σY ) → E f (X + σY ) for all bounded uniformly
w
continuous f , so Xn + σY → X + σY for all σ > 0. Now Lemma 12.14 applies.
As a final preparation for the proof of the Central Limit Theorem we proceed
with some analytic technicalities that eventually lead to the crucial inequal-
ity (12.12). Let f ∈ C ∞ and put
1
R(x, y) = f (x + y) − f (x) − yf 0 (x) − y 2 f 00 (x).
2
Replacing x and y above by independent random variables X and Y and taking
expectations, then yields
1
E f (X + Y ) − E f (X) − E Y E f 0 (X) − E Y 2 E f 00 (X) = E R(X, Y ).
2
Let W be another random variable, independent of X, and assume that E W =
E Y and E W 2 = E Y 2 . Then a similar equality is valid and we then obtain by
taking the difference the inequality
We are now going to find bounds on the remainder terms in this equation. The
mean value theorem yields for any x and y that R(x, y) = 61 y 3 f 000 (θ1 (x, y)) for
some θ1 (x, y) between x and x + y. Alternatively, we can express R(x, y) by
another application of the mean value theorem as
1 1
R(x, y) = f (x + y) − f (x) − yf 0 (x) − y 2 f 00 (x) = y 2 (f 00 (θ2 (x, y)) − f 00 (x)),
2 2
133
for some θ2 (x, y) between x and x + y. Let C = max{||f 00 ||, 61 ||f 000 ||}. Then we
have the estimate |R(x, y)| ≤ C|y|3 , as well as for every ε > 0 the estimate
E |R(X, Y )| ≤ CE |Y |3 (12.10)
and
The proof of the Central Limit Theorem is basedPon the following idea. Consider
n
a sum of independent random variables S = j=1 ξj , where n is ‘big’. If we
replace one of the ξj by another random variable, then we can think of a small
perturbation of S and the expectation of f (S) will hardly change. This idea will
be repeatedly used, all the ξj that sum up to S will be step by step replaced
with other, normally distributed, random variables. We assume that the ξj have
finite second moments and expectation zero. Let η1 , . . . , ηn be independent
normal random variables,
Pn also independent of the ξj , with expectation zero and
E ηj2 = E ξj2 . Put Z = j=1 ηj and notice that also Z has a normal distribution
Pn
with variance equal to j=1 E ξj2 . We are interested in E f (S) − E f (Z). The
Pj−1 Pn
following notation is convenient. Put Xj = i=1 ξi + i=j+1 ηi . Notice that
S = Xn + ξn and Z = X1 + η1 and Xj + ξj = Xj+1 + ηj+1 for 1 ≤ j ≤ n − 1.
These facts and an application of (12.9) yield
134
Pkn
and j=1 Var ξnj = 1. Let for every ε > 0
kn
X
2
Ln (ε) = E[ξnj 1{|ξnj |>ε} ].
j=1
with an obvious meaning of the Xnj . For the first error terms in (12.13) we use
Pkn
the estimate of (12.11) which yields E j=1 E |R(Xnj , ξnj )| ≤ C(ε + Ln (ε)). In
view of the Lindeberg condition, this term can be made arbitrarily small. We
2 2 2
now focus on the second error term in (12.13). Let σnj = E ξnj = E ηnj and
use (12.10) to obtain
kn
X kn
X kn
X
E |R(Xnj , ηnj )| ≤ C E |ηnj |3 = C 3
σnj E |N (0, 1)|3 .
j=1 j=1 j=1
Pkn 2
Hence (use j=1 σnj = 1)
kn
X kn
X
3 2
σnj ≤ max σnj σnj ≤ (ε2 + Ln (ε))1/2 .
j
j=1 j=1
And, again, this term can be made arbitrarily small, because of the Lindeberg
condition.
12.3 Exercises
12.1 Prove the remaining assertions of Proposition 12.2.
135
12.2 Prove Proposition 12.11.
12.3 Show, using the Dominated Convergence Theorem, that (12.8) holds.
Show also that all the derivatives are bounded functions.
w
12.4 Show that Xn → X iff Ef (Xn ) → Ef (X) for all bounded uniformly con-
tinuous functions f . Hint: for one implication the proof of Proposition 12.3 is
instructive.
w
12.5 Show the implication Fn (x) → F (x) for all x ∈ CF ⇒ µn → µ of Corol-
lary 12.5 without referring to the Skorohod representation. First you take for
given ε > 0 a K > 0 such that F (K) − F (−K) > 1 − ε. Approximate a
function f ∈ Cb (R) on the interval [−K, K] by a piecewise constant function,
compute the integrals of this approximating function and use the convergence
of the Fn (x) at continuity points of F etc.
w
12.6 Suppose that Xn → X and that the collection {Xn , n ≥ 1} is uniformly
integrable (you make a minor change in the definition of this notion if the Xn
are defined on different probability spaces). Use the Skorohod representation to
w
show that Xn → X implies EXn → EX.
w
12.7 Show the following variation on Fatou’s lemma: if Xn → X, then E|X| ≤
lim inf n→∞ E|Xn |.
12.8 Show that the weak limit of a sequence of probability measures is unique.
12.9 Consider the N (µn , σn2 ) distributions, where the µn are real numbers and
the σn2 nonnegative. Show that this family is tight iff the sequences (µn ) and (σn2 )
are bounded. Under what condition do we have that the N (µn , σn2 ) distributions
converge to a (weak) limit? What is this limit?
1
Pn w
12.10 The classical Central Limit Theorem says that σ√ n j=1 (Xj − µ) →
N (0, 1), if the Xj are iid with EXj = µ and 0 < Var Xj = σ 2 < ∞. Show that
this follows from the Lindeberg-Feller Central Limit Theorem.
12.11 For each n we have a sequence ξn1 , . . . , ξnkn of independent random vari-
Pkn
ables with Eξnj = 0 and j=1 Var ξnj = 1. If
kn
X
E|ξnj |2+δ → 0 as n → ∞ for some δ > 0, (12.14)
j=1
Pkn w
then j=1 ξnj → N (0, 1). Show that this follows from the Lindeberg-Feller
Central Limit Theorem 12.16. Condition (12.14) is called Lyapunov’s condition.
12.12 Let X1 , X2 , . . . , Xn be an iid sequence having a distribution function F , a
continuous density (w.r.t. Lebesgue measure) f . Let m be such that F (m) = 21 .
Assume that f (m) > 0 and that n is odd, n = 2k − 1, say (k = 12 (n + 1)).
136
(a) Show that m is the unique solution of the equation F (x) = 12 . We call m
the median of the distribution of X1 .
(b) Let X(1) = min{X1 , . . . , Xn }, X(2) = min{X1 , . . . , Xn } \ {X(1) }, etc. The
resulting X(1) , X(2) , . . . , X(n) is called the ordered sample. The sample
median Mn of X1 , . . . , Xn is by definition X(k) . Show that with Unj =
1{Xj ≤m+n−1/2 x} we have
X
P(n1/2 (Mn − m) ≤ x) = P( Unj ≥ k).
j
−1/2
(c) Let pn = E Unj = P(X Pjn ≤ m + n x), bn = (npn (1 − pn ))1/2 , ξnj =
(Unj − pn )/bn , Zn = j=1 ξnj , tn = (k − npn )/bn . Note that pn ∈ (0, 1)
for n large enough. Rewrite the probabilities in part (b) as P(Zn ≥ tn )
and show that tn → t := −2xf (m).
(d) Show that P(Zn ≥ t) → 1 − Φ(t), where Φ is the standard normal distri-
bution.
(e) Show that P(Zn ≥ tn ) → Φ(2f (m)x) and conclude that the Central Limit
Theorem for the sample median holds:
w
2f (m)n1/2 (Mn − m) → N (0, 1).
12.13 Consider the situation of Proposition 12.3 and assume that F is contin-
uous. For given n ∈ N, select xk such that F (xk ) = nk with k ∈ {1, . . . , n − 1}
and take x0 = −∞, xn = ∞.
(a) Show that for all x ∈ R it holds that
1
|Fm (x) − F (x)| ≤ sup |Fm (xk ) − F (xk )| + .
0≤k≤n n
137
13 Characteristic functions
‘Characteristic functions are characteristic’. In this chapter we will explain
what this statement means and why it is true. We develop some theory for
characteristic functions, primarily with the goal to apply it to prove by other
means a Central Limit Theorem.
138
√
sometimes also by dividing this expression by 2π. What we thus see, is the
equality φX (u) = fˆ(−u).
Proof Properties (i), (iii) and (iv) are trivial. Consider (ii). Fixing u ∈ R, we
consider φ(u + t) − φ(u) for t → 0. We have
Z
|φ(u + t) − φ(u)| = | (exp(i(u + t)x) − exp(iux)) µ( dx)|
Z
≤ | exp(itx) − 1| µ( dx).
Remark 13.3 The implication that φX+Y is the product of φX and φY for
independent random variables X and Y cannot simply be reversed, as shown
by the following example. Let X have a standard Cauchy distribution. From
Exercise 13.4 one sees that φX (u) = exp(−|u|). With Y = X one computes
that exp(−2|u|) = φ2X (u) = φX+Y (u) = φX (u)φY (u), although in this case
X and Y are obviously not independent. Similarly, if X1 = . . . = Xn = X,
one has φnX (u) = φX (u)n . In fact, the property that φnX = φnX for all n ≥ 1
characterizes the (scaled) Cauchy distributions, see Exercise 13.5.
139
Theorem 13.4 Let F be a distribution function and φ its characteristic func-
tion. Then, for all a < b
T
e−iua − e−iub
Z
1
lim φ(u) du = F̃ (b) − F̃ (a). (13.3)
T →∞ 2π −T iu
Proof Let a < b. We compute, using Fubini’s theorem which we will justify
below,
T
e−iua − e−iub
Z
1
ΦT := φ(u) du (13.4)
2π −T iu
Z T −iua
− e−iub
Z
1 e
= eiux µ(dx) du
2π −T iu R
Z Z T −iua
1 e − e−iub iux
= e du µ(dx)
2π R −T iu
Z Z T i(x−a)u
1 e − ei(x−b)u
= du µ(dx) (13.5)
2π R −T iu
Z
=: ET (x) µ(dx)
R
Rt
The function g given by g(s, t) = s siny y dy is continuous in (s, t). Hence it is
bounded on any compact subset of R2 . Moreover, g(s, t) → π as s → −∞ and
t → ∞. Hence g, as a function on R2 , is bounded in s, t. We conclude that
also ET (x) is bounded, as a function of T and x, a first ingredient to apply the
Dominated Convergence Theorem to (13.5), since µ is a finite measure. The
second ingredient is E(x) := limT →∞ ET (x). From Exercise 5.9 we deduce that
Z ∞
sin αx π
dx = sgn(α) .
0 x 2
By comparing the location of x relative to a and b, we use the value of the latter
140
integral to obtain
1 if a < x < b,
1
E(x) = if x = a or x = b,
2
0 else.
Corollary 13.5 If µ and ν are two probability measures on (R, B) whose char-
acteristic functions are the same, then they coincide.
The content of Corollary 13.5 explains why characteristic functions are called
characteristic, there is a bijective correspondence between probability measures
and characteristic functions.
Proof Define
Z
1
f (x) = e−iux φ(u) du. (13.6)
2π R
Since |φ| has a finite integral, f is well defined for every x. Observe that
f is real valued, because φ(u) = φ(−u). An easy application of the Domi-
nated Convergence Theorem shows that f is continuous. Note first that the
limit of the integral expression in (13.3) is equal to the (Lebesgue) integral
1
R e−iua −e−iub
2π iu φ(u) du, again because of dominated convergence. We use Fu-
bini’s theorem to compute for a < b
Z b Z bZ
1
f (x) dx = e−iux φ(u) du dx
a 2π a R
Z Z b
1
= φ(u) e−iux dx du
2π R a
e−iua − e−iub
Z
1
= φ(u) du
2π R iu
= F (b) − F (a),
Rb
for every a and b, because of Theorem 13.4 and because a f (x) dx is continuous
in a and b. It also follows that f must be nonnegative and so it is a density.
141
Remark 13.7 Note the duality between the expressions (13.2) and (13.6).
Apart from the presence of the minus sign in the integral and the factor 2π
in the denominator in (13.6), the transformations f 7→ φ and φ 7→ f are similar.
Like in the real case, also here probability measures are uniquely determined
by their characteristic functions. The proof of this statement can be given as
a multi-dimensional version of Exercise 13.2. As a consequence we have the
following characterization of independent random variables.
Proposition 13.8 Let X = (X1 , . . . , Xk ) be a k-dimensional random vector.
Qk
Then X1 , . . . , Xk are independent random variables iff φX (u) = i=1 φXi (ui ),
∀u = (u1 , . . . , uk ) ∈ Rk .
Proof If the Xi are independent, the statement about the characteristic func-
tions is proved in the same way as Proposition 13.2 (v). If the characteristic
function φX factorizes as stated, the result follows by the uniqueness property
of characteristic functions.
Proof Consider for fixed u the function f (x) = eiux . It is obviously bounded
and continuous and we obtain straight from Definition 12.1 that µn (f ) → µ(f ).
But µn (f ) = φn (u).
142
Proof Since (µn ) is tight we use Proposition 12.10 to deduce that there exists a
weakly converging subsequence (µnk ) with a probability measure as limit. Call
this limit µ. From Proposition 13.10 we know that φnk (u) → φµ (u) for all u.
Hence we must have φµ = φ. We will now show that any convergent subsequence
of (µn ) has µ as a limit. Suppose that there exists a subsequence (µn0k ) with limit
µ0 . Then φn0k (u) converges to φµ0 (u) for all u. But, since (µn0k ) is a subsequence
of the original sequence, by assumption the corresponding φn0k (u) must converge
to φ(u) for all u. Hence we conclude that φµ0 = φµ and then µ0 = µ.
Suppose that the whole sequence (µn ) does not converge to µ. Then there
must exist a function f ∈ Cb (R) such that µn (f ) does not converge to µ(f ). So,
there is ε > 0 such that for some subsequence (n0k ) we have
Using Proposition 12.10, the sequence (µn0k ) has a further subsequence (µn00k )
that has a limit probability measure µ00 . By the same argument as above (con-
vergence of the characteristic functions) we conclude that µ00 (f ) = µ(f ). There-
fore µn00k (f ) → µ(f ), which contradicts (13.7).
Characteristic functions are a tool to give a rough estimate of the tail prob-
abilities of a random variable, useful to establish tightness of a sequence of
probability
Ra measures. To that end we will use the following lemma. Check first
that −a (1 − φ(u)) du ∈ R for every a > 0.
143
Theorem 13.13 Let µ1 , µ2 , . . . be a sequence of probability measures on (R, B)
and φ1 , φ2 , . . . the corresponding characteristic functions. Assume that for all
u ∈ R the limit φ(u) := limn→∞ φn (u) exists. If φ is continuous at zero, then
w
there exists a probability measure µ on (R, B) such that φ = φµ and µn → µ.
Proof We will show that under the present assumptions, the sequence (µn ) is
tight. We will use Lemma 13.12. Let ε > 0. Since φ is continuous at zero, the
same holds for φ, and there is δ > 0 such that |φ(u) + φ(−u) − 2| < ε if |u| < δ.
Notice that φ(u) + φ(−u) is real-valued and bounded from above by 2. Hence
Rδ Rδ
2 −δ (1 − φ(u)) du = −δ (2 − φ(u) − φ(−u)) du ∈ [0, 2δε).
By the convergence of the characteristic functions (which are bounded), the
Dominated Convergence Theorem implies that for big enough n, n ≥ N say,
Z δ Z δ
(1 − φn (u)) du < (1 − φ(u)) du + δε.
−δ −δ
Proof If φn (u) → φ(u) for all u ∈ R, then we can apply Theorem 13.13.
Because φ, being a characteristic function, is continuous at zero. Hence there is
a probability to which the µn weakly converge. But since the φn (u) converge to
φ(u), the limiting probability measure must be µ. The converse statement we
have encountered as Proposition 13.10.
144
Lemma 13.16 Let X be a random variable with E X 2 < ∞ and with charac-
teristic function φ. Then
|φ(u) − 1| ≤ E min{2, |uX|},
1
|φ(u) − 1 − iuE X| ≤ E min{2|u||X|, u2 X 2 }
2
and
1 1
|φ(u) − 1 − iuE X + u2 E X 2 | ≤ E min{u2 X 2 , |u|3 |X|3 }.
2 6
Rx
Proof Let x ∈ R. Then |eix − 1| ≤ 2 and |eix − 1| = | 0 ieiy dy| ≤ |x|. Hence
|eix − 1| ≤ min{2, |x|}. Since
Z x
ix
e − 1 − ix = i (eiy − 1) dy,
0
and
Z x Z y
1
eix − 1 − ix + x2 = − (eit − 1) dt dy,
2 0 0
We write
kn kn kn
X X 1 X 1 2 2
(φnj (u) − 1) = (φnj (u) − 1 + u2 E ξnj
2
)− u E ξnj .
j=1 j=1
2 j=1
2
The last term gives the desired limit, so it suffices to show that the first term
converges to zero. By virtue of Lemma 13.16, we can bound its absolute value
by
kn
2 1
X
E min{u2 ξnj , |u|3 |ξnj |3 }. (13.10)
j=1
6
145
But
2 1 1
E min{u2 ξnj , |u|3 |ξnj |3 } ≤ |u|3 εE ξnj
2
1{|ξnj |≤ε} + u2 E ξnj
2
1{|ξnj |>ε} .
6 6
Hence we get that the expression in (13.10) is majorized by
kn
1 3 X 2 1
|u| ε E ξnj + u2 Ln (ε) = |u|3 ε + u2 Ln (ε),
6 j=1
6
which tends to 16 |u|3 ε. Since ε is arbitrary, we have proved (13.9). It then also
follows that
kn
X 1
exp( (φnj (u) − 1)) → exp(− u2 ). (13.11)
j=1
2
To apply this result we have to understand that the complex numbers involved
indeed have norm less than or equal to one. For the φnj (u) this is one of the
basic properties of characteristic functions. But it turns out that exp(φnj (·)−1)
is a characteristic function as well (see Exercise 13.3).
Let Mn (u) = maxj |φnj (u) − 1|. Now we use the inequality |ez − 1 − z| ≤
2 |z|
|z| e (which easily follows from a Taylor expansion) with z = φnj (u) − 1 to
bound (13.13) by
kn
X kn
X
2 Mn (u)
|φnj (u) − 1| exp(|φnj (u) − 1|) ≤ Mn (u)e |φnj (u) − 1|.
j=1 j=1
146
From Lemma 13.16, second assertion, we get
kn kn
X 1 2X 2 1
|φnj (u) − 1| ≤ u E ξnj = u2 .
j=1
2 j=1 2
2
On the other hand, we have maxj E ξnj ≤ ε2 + Ln (ε). Hence
2
max E ξnj → 0. (13.14)
j
This proves (13.12) and hence it completes the proof of the theorem.
13.4 Exercises
13.1 Prove Corollary 13.5.
13.2 Let µ and ν be probability measures on (R, B)) having corresponding char-
acteristic functions φµ and φν .
R R
(a) Show that R exp(−iuy)φµ (y) ν(dy) = R φν (x − u) µ(dx).
(b) Assume that ν is the N (0, σ12 ) distribution, then φν (u) = exp(− 21 u2 /σ 2 ).
Let fσ2 be the density of the N (0, σ 2 ) distribution. Show that
Z Z
1 1 2 2
exp(−iuy)φµ (y) exp(− σ y ) dy = fσ2 (u − x) µ(dx),
2π R 2 R
and show that the right hand side gives the density of X + Y , where
X has distribution µ, Y has the N (0, σ 2 ) distribution and X and Y are
independent (see also Exercise 5.6).
(c) Write Y = σZ, with Z having the standard normal distribution (X and
Z independent). It follows that φµ and σ determine the distribution of
X + σZ for all σ > 0. Show that φµ uniquely determines µ. (This gives
an alternative proof of the assertion of Corollary 13.5).
13.3 Let X1 , X2 , . . . be a sequence of iid random variables and N a Poisson(λ)
PN
distributed random variable, independent of the Xn . Put Y = n=1 Xn . Let
φ be the characteristic function of the Xn and ψ the characteristic function of
Y . Show that ψ = exp(λφ − λ).
13.4 Verify the formulas for the characteristic functions in each of the following
cases.
147
(a) φN (0,1) (u) = exp(− 12 u2 ). Hint: Show that φN (0,1) is a solution to φ̇(u) =
−uφ(u).
(b) φN (µ,σ2 ) (u) = exp(iuµ − 12 σ 2 u2 ).
(c) If X has an exponential distribution with parameter λ, then φX (u) =
λ/(λ − iu).
(d) If X has P a Cauchy distribution, then φX (u) = exp(−|u|). Show also that
n
X̄n := n1 k=1 Xk has a Cauchy distribution, if the Xk are iid and Cauchy
distributed. Hence, in this case the averages X̄n have the same distribution
as X1 for every n.
13.5 Let φ be a real characteristic function with the property that φ(nu) =
φ(u)n for all u ∈ R and n ∈ N. Show that for some α ≥ 0 it holds that
φ(u) = exp(−α|u|). Let X have characteristic function φ(u) = exp(−α|u|). If
α > 0, show that X admits the density x 7→ α 1
π α2 +x2 . What is the distribution
of X if α = 0?
w
13.6 Let Xn have a Bin(n, λ/n) distribution (for n > λ). Show that Xn → X,
where X has a Poisson(λ) distribution.
13.7 Let X and Y be independent, assume that Y has a N (0, 1) distribution.
Let σ > 0. Let φ be the characteristic function of X: φ(u) = E exp(iuX).
(a) Show that Z = X + σY has density p(z) = σ√12π E exp(− 2σ1 2 (z − X)2 ).
1
φ(−y/σ) exp(iyz/σ − 21 y 2 ) dy.
R
(b) Show that p(z) = 2πσ
13.8 Let X, X1 , X2 , . . . be a sequence of random variables and Y a N (0, 1)-
distributed random variable independent of that sequence. Let φn be the char-
acteristic function of Xn and φ that of X. Let pn be the density of Xn + σY
and p the density of X + σY .
(a) If φn → φ pointwise, then pn → p pointwise. Invoke Exercise 13.7 and the
dominated convergence theorem to show this.
(b) LetRf : R → R be bounded by B. Show that |Ef (Xn +σY )−Ef (X +σY )| ≤
2B (p(z) − pn (z))+ dz.
(c) Show that |Ef (Xn + σY ) − Ef (X + σY )| → 0 (with f bounded) if φn → φ
pointwise.
w
(d) Prove Corollary 13.14: Xn → X iff φn → φ pointwise.
13.9 Let Y be a random variable with a RGamma(t, 1) distribution, so it has
∞
1
density Γ(t) y t−1 e−y 1{y>0} , where Γ(t) = 0 y t−1 e−y dy for t > 0. Put Xt =
Y√−t
t
.
√
(a) Show that Xt has a density on (− t, ∞) given by
√
t √ √
ft (x) = (x t + t)t−1 e−(x t+t) .
Γ(t)
(b) Show that the characteristic function φt (u) = E eiuXt of Xt is given by
√ 1
φt (u) = e−iu t
iu t
(1 − √ t
)
148
1 2
and conclude that φt (u) → e− 2 u as t → ∞.
(c) Show that
1 ∞
tt− 2 e−t
Z
1
= φt (u) du.
Γ(t) 2π −∞
Γ(t)
lim √ 1 = 1.
t→∞ 2πe−t tt− 2
13.10 Let X be a random variable defined on some probability space (Ω, F, P)
and let G be a sub-σ-algebra of F. Let for u ∈ R the random variable φ̂(u) be
a version of E [eiuX |G]. Note that we actually have a map (u, ω) 7→ φ̂(u, ω).
(a) Show that the map u 7→ φ̂(u) is continuous in L1 (E |φ̂(u + h) − φ̂(u)| → 0
for h → 0).
(b) Show that we can take the φ̂(u) such that u 7→ φ̂(u, ω) is continuous on a
set of probability one.
(c) Suppose that there exists a function φ : R → C such that φ(u) is a version
of E [eiuX |G] for each u ∈ R. Show that φ is the characteristic function of
X and that G and σ(X) are independent.
13.11 Let X, X1 , X2 , . . . be a sequence of d-dimensional random vectors. Show,
w
using Remark 13.15, that Xn → X iff for all d-dimensional column vectors a one
> w >
has a Xn → a X. (This equivalence is known as the Cramér-Wold device and
it reduces d-dimensional weak convergence to 1-dimensional weak convergence.)
Suppose that X has a multivariate normal distribution with mean vector µ and
w
covariance matrix Σ. Show that (the notation should be obvious) Xn → N (µ, Σ)
> w > > d
iff a Xn → N (a µ, a Σa), for all a ∈ R .
149
14 Brownian motion
This chapter is devoted to showing the existence of Brownian motion as a well
defined mathematical object. By definition, a stochastic process X = {Xt :
t ≥ 0} with time index [0, ∞) on a probability space (Ω, F, P) is a collection of
random variables {ω 7→ Xt (ω) : t ≥ 0} on this probability space.
The topic of this chapter is thus to show that an appropriate (Ω, F, P) exists
and that one can define a Brownian motion on it. As it turns out, the choice
Ω = {ω ∈ C[0, ∞) : ω(0) = 0} of real valued continuous functions on [0, ∞) is
a good one. If one defines Wt (ω) = ω(t) for ω ∈ Ω, then automatically every
t 7→ Wt (ω) = ω(t) is continuous, which already settles properties (i) and (ii).
Brownian motion is perhaps the most fundamental stochastic process with a
continuous time set. The style of this chapter is to some extent expository. Some
results that we have proved in previous chapters for finite dimensional spaces
have a generalization to infinite dimensional spaces. In the present chapter these
generalizations are sometimes presented without proof.
Then ρ defines a metric on C[0, ∞) (which we use throughout this chapter) and
one can show that the metric space (C[0, ∞), ρ) is complete and separable.
Later on we need the relatively compact subsets of C[0, ∞). To describe these
we introduce the modulus of continuity mT . For each x ∈ C[0, ∞), T, δ > 0 we
define
It holds that mT (·, δ) is continuous and limδ↓0 mT (x, δ) = 0 for each x and T .
The following characterization is known as the Arzelà-Ascoli theorem.
150
Theorem 14.2 A set A in C[0, ∞) is relatively compact (has compact closure)
iff (i) sup{|x(0)| : x ∈ A} < ∞ and (ii) the functions in A are equicontinuous:
limδ↓0 sup{mT (x, δ) : x ∈ A} = 0 for all T > 0.
Proof Assume (i) and (ii). It is sufficient to prove the existence of a sequence of
functions (xn ) ⊂ A that is uniformly convergent on every interval [0, T ]. To that
end it suffices to show that for every T ∈ N the collection AT := {x : [0, T ] →
R| x ∈ A} contains a uniformly convergent sequence (note the subtle difference
and verify that the last statement is correct). Under conditions (i) and (ii), AT
is uniformly bounded and we can apply the same technique as in the first part
of the proof of Theorem 12.7. There exists a sequence (xn ) ⊂ AT such that
(xn (q)) converges for every rational q ∈ [0, T ]. For any given ε > 0, there exists
N (q, ε) such that |xn (q)−xm (q)| < ε/3 for m, n ≥ N . Since limδ↓0 mT (x, δ) = 0
for each x and T , there also exist δ > 0 such that |x(s) − x(t)| < ε/3 if |s − t| < δ
and s, t ∈ [0, T ]. For given t ∈ [0, T ], choose rational q = q(t) in [0, m] such
that |t − q| < δ. Since [0, T ] is bounded we can select this q from a finite set of
rationals qi , i ∈ I. Then N := max{N (qi , ε) : i ∈ I} is finite and for n, m ≥ N
one has
|xn (t) − xm (t)| ≤ |xn (t) − xn (q)| + |xn (q) − xm (q)| + |xm (q) − xm (t)| < ε,
Cylinder sets of C[0, ∞) have the typical form {x : (x(t1 ), . . . , x(tk )) ∈ A},
where A ∈ B(Rk ) for some k ≥ 1 and t1 , . . . , tk ∈ [0, ∞). A finite dimensional
projection on (C[0, ∞), ρ) is by definition of the following type: πt1 ,...,tk (x) =
(x(t1 ), . . . , x(tk )), where the ti are nonnegative real numbers. It is easy to
see that any finite dimensional projection is continuous (Rk is endowed with
the ordinary metric). Note that cylinder sets are inverse images under finite
dimensional projections of Borel sets of Rk (k ≥ 1). Let C be the collection of
all cylinder sets and B the Borel σ-algebra on C[0, ∞) induced by the metric ρ.
Let (Ω, F) be a measurable space. A map X : Ω → C[0, ∞) is called a
random element of C[0, ∞) if it is F/B-measurable. It then follows that πt1 ,...,tk ◦
X is a random vector in Rk , for any finite dimensional projection πt1 ,...,tk , and
it is usually denoted by (Xt1 , . . . , Xtk ). One can prove that B = σ(C) and thus
that X is a random element of C[0, ∞), if all Xt are real random variables.
Moreover, if P is a probability measure on (Ω, F) and X a random element of
C[0, ∞), then the distribution PX of X on (C[0, ∞), B) is completely determined
by the distributions of all k-tuples (Xt1 , . . . , Xtk ) on Rk (k ≥ 1, ti ∈ [0, ∞)).
151
14.2 Weak convergence on C[0, ∞)
Let (S, ρ) be a metric space and µ, µ1 , µ2 , . . . be probability measures on the
Borel σ-algebra B(S). Like in the real case we say that µn converges weakly
w
to µ (notation µn → µ) iff for all f ∈ Cb (S) one has lim µn (f ) = µ(f ). If
1 2
X, X , X , . . . are random variables defined on probability spaces (Ω, F, P) and
(Ωn , F n , Pn ) (n ≥ 1) with values in one and the same (S, ρ), we say that X n
w
converges in distribution to X (X n → X) if the laws µn of X n converge weakly
to the law µ of X, equivalently, iff E f (X n ) → E f (X) for all f ∈ Cb (S).
We need a generalization of tightness as given in Definition 12.8 that is
applicable in the present context. Since any interval [−M, M ] ⊂ R is compact,
the following is reasonable. A family of probability measures Π on B(S) is
called tight if for every ε > 0, there is a compact subset K of S such that
inf{µ(K) : µ ∈ Π} > 1 − ε. One can show that any single probability measure
on B(S) is tight if (S, ρ) is a separable and complete metric space (a Polish
space). A family of random variables with values in a metric space is called
tight if the family of their distributions is tight. Like in the real case, see
Proposition 12.10 and Exercise 12.14, (but much harder to prove here) there is
equivalence between relative compactness (in this context it means that every
sequence in a set of probability measures has a weakly converging subsequence)
and tightness, known as Prohorov’s theorem.
If we take S = C[0, ∞) with the metric ρ of the previous section, we get the
following ‘stochastic version’ of the Arzelà-Ascoli theorem.
Theorem 14.5 Let µ1 , µ2 , . . . be a sequence of probability measures on the
space (C[0, ∞), B). This sequence is tight iff
and
152
Conversely, assume (14.3) and (14.4) and let ε, T > 0, T integer, be given.
Choose λT such that supn µn ({x : |x(0)| > λT }) ≤ ε2−T −1 . For each k ≥ 1 we
can also find δk such that supn µn ({x : mT (x, δk ) > 1/k}) ≤ ε2−T −k−1 . Notice
that the sets AT,k = {x : mT (x, δk ) ≤ 1/k} and AT,0 = {x : |x(0)| ≤ λT } are
closed and so is their intersection over both k and (integer) T , call it K. From
Theorem 14.2 we obtain that K P has compact closurePand it
Pis thus compact itself.
Finally we compute µn (K c ) ≤ T ≥1 µn (AcT,0 ) + T ≥1 k≥1 µn (AcT,k ) ≤ ε.
Proof Every subsequence of (µn ) is tight as well and thus has a convergent
subsequence. Different subsequences have to converge to the same limit, call it
µ, since the finite dimensional distributions corresponding to these sequences
converge. Hence, if (µn ) has a limit, it must be µ. Suppose therefore that the
µn don’t converge. Then there is bounded and continuous f and an ε > 0 such
that |µnk (f ) − µ(f )| > ε along a subsequence (µnk ). No further subsequence of
this can have µ as a limit which contradicts what we just showed.
1
Xtn = √ (S[nt] + (nt − [nt])ξ[nt]+1 ). (14.5)
σ n
Theorem 14.7 Let 0 = t0 < t1 < · · · < tk . Then the k-vectors of incre-
ments (Xtn1 − Xtn0 , . . . , Xtnk − Xtnk−1 ) converge in distribution to a random vector
(N1 , . . . , Nk ) with independent elements Nj having a N (0, tj −tj−1 ) distribution.
153
Proof Since the term in (14.5) with the ξ[nt] tends to zero in probability, we can
ignore it as a consequence of Proposition 14.4. But then the conclusion for each
of the increments separately follows from the classical Central Limit Theorem
of Exercise 12.10. Check that for each n the S[ntj ] − S[ntj−1 ] (j = 1, . . . , k) are
independent to complete the proof.
Both Theorems 14.9 and 14.10 are known as Donsker’s invariance principle.
What we have done in this section can be summarized by saying that we have
shown the existence of a Wiener process and we have given a Functional Central
Limit Theorem.
154
Proof Assume η < γ. Let τ = min{j : |Sj | > γ}. Then we have to consider
P(τ ≤ n). Split this probability up as
lim sup P( max |Xtn − Xsn | > ε) = 0 for all T, ε > 0. (14.9)
δ↓0 n≥1 |s−t|≤δ
0≤t,s≤T
But since we only need tightness for all but√ finitely many n, we can as well
n
replace the ‘sup’ by a ‘lim sup’. Let Yt = σ nXt/n . Each of the probabilities
in (14.9) is less than
√
P( max |Yt − Ys | > εσ n).
|s−t|≤[nδ]+1
0≤t,s≤[nT ]+1
But, since Y is piecewise linear between the integer values of its arguments, the
max is attained at integer numbers. Hence we consider
√
P( max |Sj+k − Sk | > εσ n).
0≤j≤[nδ]+1
0≤k≤[nT ]+1
[nT ] + 1 √
( + 2)P( max |Sj | > εσ n/3). (14.10)
[nδ] + 1 j≤[nδ]+1
155
p
In view of (14.6) (take η = σ 2[nδ]) the probability in (14.10) is less than
√ p
P(|S[nδ]+1 | > εσ n/3 − σ 2[nδ]).
w
Now we apply the central limit theorem: √1 S[nδ]+1 → Z, where Z has
σ [nδ]
a N (0, 1) distribution. So for n → ∞ the last probability tends to P(|Z| >
ε
√ 2
√
3 δ
− 2) which is less than (ε/3−δ√2δ)4 E Z 4 . Hence the lim sup in (14.10) for
2
T δ+2δ
n → ∞ is less than √
(ε/3− 2δ)4
E Z 4, from which we obtain (14.9).
156
the value ηkn at the midpoint tn2k−1 . So ηkn = f (tn2k−1 ) − 12 (f (tn−1 n−1
k−1 ) + f (tk )).
n n
Then we have for t ∈ (t2k−2 , t2k ) the simple formula
ηkn n
fn (t) − fn−1 (t) = S (t),
σn k
and hence we get for all t
X ηn
k n
fn (t) = fn−1 (t) + S (t). (14.12)
σn k
k∈I(n)
Summing Equation (14.12) over n leads with I(0) = {1} to the following repre-
sentation of fn on the whole interval [0, 1]:
Xn X ηm
fn (t) = k
Skm (t). (14.13)
m=0
σ m
k∈I(m)
Proof Let ε > 0 and choose N such that we have |f (t) − f (s)| ≤ ε as soon as
|t − s| < 2−N . It is easy to see that then |ηkn | ≤ ε if n ≥ N . On the interval
[tn2k−2 , tn2k ] we have that
This bound holds on any of the intervals [tn2k−2 , tn2k ]. Hence ||f − fn || → 0.
X∞ X ηm
f= k
Skm , (14.14)
m=0
σ m
k∈I(m)
157
will present below.
Let, as above, I(0) = {1} and I(n) be the set {1, . . . , 2n−1 } for n ≥ 1. Take a
double sequence of independent standard normally distributed random variables
ξkn that are all defined on some probability space Ω with k ∈ I(n) and n ∈
N ∪ {0}, see Theorem 3.16 or Theorem 5.12. With the aid of these random
variables we are going to construct a sequence of continuous processes W n as
1
follows. Let, also as above, σn = 2− 2 (n+1) . Put
W 0 (t) = tξ10 .
Theorem 14.14 For almost all ω the functions t → 7 W n (ω, t) converge uni-
formly to a continuous function t 7→ W (ω, t) and the process W : (ω, t) →
W (ω, t) is Brownian motion on [0, 1].
Proof We start with the following result. If Z has a standard normal distribu-
tion and x > 0, then (Exercise 14.8)
r
21 1
P(|Z| > x) ≤ exp(− x2 ). (14.18)
πx 2
q
Let βn = maxk∈I(n) |ξkn |. Then bn := P(βn > n) ≤ 2n−1 π2 n1 exp(− 12 n2 ).
P
Observe that n bn is convergent and that hence in virtue of the Borel-Cantelli
lemma P(lim sup{βn > n}) = 0. Hence we can find a subset Ω̃ of Ω with
P(Ω̃) = 1, such that for all ω ∈ Ω̃ there exists a natural number n(ω) with
the property that all |ξkn (ω)| ≤ n if n ≥ n(ω) and k ∈ I(n). Consequently, for
p > n ≥ n(ω) we have
∞
X
sup |W n (ω, t) − W p (ω, t)| ≤ mσm < ∞. (14.19)
t
m=n+1
This shows that the sequence W n (ω, ·) with ω ∈ Ω̃ is Cauchy in C[0, 1], so that
it converges to a continuous limit, which we call W (ω, ·). For ω’s not in Ω̃ we
158
define W (ω, ·) = 0. So we now have continuous functions W (ω, ·) for all ω with
the property W (ω, 0) = 0.
As soon as we have verified properties (iii) and (iv) of Definition 14.1 we know
that W is a Brownian motion. We will verify these two properties at the same
time by showing that all increments ∆j := W (tj ) − W (tj−1 ) with tj > tj−1 are
independent N (0, tj −tj−1 ) distributed P random variables. Thereto wePwill prove
that the characteristic function E exp(i j λj ∆j ) is equal to exp(− 21 j λ2j (tj −
tj−1 )).
The Haar functions form a Complete Orthonormal System of L2 [0, 1] (see
2
Exercise
P 14.9). Hence
n n
P∞everyP function fn ∈ nL [0, 1] has the representation f =
n,k hf, Hk iHk = n=0 k∈I(n) hf, Hk iHk , where h·.·i denotes the inner prod-
uct of L [0, 1] and where the infinite sum is convergent in L2 [0, 1]. As a re-
2
sult we P have for any two functions f and g in L2 [0, 1] the Parseval identity
hf, gi = n,k hf, Hkn ihg, Hkn i. Taking the specific choice fP= 1[0,t] and g = 1[0,s]
results in h1[0,t] , Hkn i = Skn (t) and t ∧ s = h1[0,t] , 1[0,s] i = n,k Skn (t)Skn (s).
We use this property to compute the limit of Cov(W n (t), W n (s)). We have
n
X X n X
X
Cov(W n (t), W n (s)) = E ξkm Skm (t) ξlp Slp (s)
m=0 k∈I(m) p=0 l∈I(p)
n
X X
= Skm (t)Skm (s)),
m=0 k∈I(m)
Working out the double product, we get in the exponential the sum over the
four variables i, j, m ≤ n, k of − 21 λi λj times
Skm (tj )Skm (ti ) − Skm (tj−1 )Skm (ti ) − Skm (tj )Skm (ti−1 ) + Skm (tj−1 )Skm (ti−1 )
and this sum converges to exp(− 12 j λ2j (tj − tj−1 )) as n → ∞. This completes
P
the proof.
Having constructed Brownian motion on [0, 1], we proceed to show that it also
exists on [0, ∞). Take for each n ∈ N a probability space (Ωn ,QFn , Pn ) that
supports a Brownian motion W n on [0, 1]. Consider then Ω = n Ωn , F =
159
Q Q
Fn and P = n Pn . Note that these Brownian motions are independent by
n
construction. Let ω = (ω1 , ω2 , . . .) and define then
n
!
X X
W (ω, t) = 1[n,n+1) (t) Wk (ωk , 1) + Wn+1 (ωn+1 , t − n) . (14.20)
n≥0 k=1
This obviously defines a process with continuous paths and for all t the random
variable W (t) is the sum of independent normal random variables. It is not
hard to see that the thus defined process has independent increments. It is
immediate that E W (t) = 0 and that Var W (t) = t.
14.6 Exercises
1
14.1 Consider the sequence of ‘tents’ (X n ), where Xtn = nt for t ∈ [0, 2n ],
1 1
Xtn = 1 − nt for t ∈ [ 2n , n ], and zero elsewhere. (There is no real ‘randomness’
here, the tents don’t depend on some ω, but one may think of the X n defined
on whatever probability space, such that Xtn (ω) is constant in ω.) Show that all
finite dimensional distributions of the X n converge, but X n does not converge
in distribution.
14.2 Show that ρ as in (1.1) defines a metric.
14.3 Suppose that the ξi of Section 14.3 are iid normally distributed random
variables. Use Doob’s supremal inequality (10.5) to obtain µ(maxj≤n |Sj | >
γ) ≤ 3γ −4 n2 .
14.4 Show that a finite dimensional projection on C[0, ∞) (with the metric ρ)
is continuous.
14.5 Consider C[0, ∞) with the Borel σ-algebra B induced by ρ and some prob-
ability space (Ω, F, P). If X : (Ω, F) → (C[0, ∞), B) is measurable, then all
maps ω 7→ Xt (ω) are random variables. Show this, as well as its converse. For
the latter you need separability that allows you to say that the Borel σ-algebra
B is a product σ-algebra. See also Remark 5.7.
14.6 Prove Proposition 14.4.
√
14.7 Show that the random variable Z = 21 (W (s) + W (t)) + 21 t − s ξ (see the
paragraph following Corollary 14.13) has a normal distribution with mean zero
and variance equal to 12 (s + t).
14.8 Prove inequality (14.18).
14.9 The Haar functions form a Complete Orthonormal System in L2 [0, 1].
Show first that the Haar functions are orthonormal. To prove that the sys-
tem is
R ·complete, you argue as follows. Let f be orthogonal to all Hk,n and set
F = 0 f (u)du. Show that F is zero in all t = k2−n , and therefore zero on the
whole interval. Conclude that f = 0.
160
14.10 Consider the processes W n of Section 14.5. Let t1 , . . . , tk ∈ [0, 1].
(a) Show that the sequence of random vectors (Wtn1 , . . . , Wtnk ) in the L2 -sense
converges to (Wt1 , . . . , Wtk ). (Hint: this sequence is Cauchy in L2 ).
(b) Show that it follows that (Wt1 , . . . , Wtk ) has a multivariate normal distri-
bution (see also Exercise 13.13).
14.11 Show that for a random variable X with a N (0, σ 2 ) distribution it holds
that P(|X| ≤ x) < x/σ, for x, σ > 0.
161
Index
algebra, 1 Chebychev, 34
σ-algebra, 1 Doob’s Lp -inequality, 108
product, 46 Doob’s supremal inequality, 106
tail, 110 Hölder, 40
Jensen, 35
bounded variation Markov, 34
function of, 36 Minkowski, 41
Brownian motion, 149 integral
Lebesgue, 25
characteristic function, 137 Lebesgue-Stieltjes, 38
inversion theorem, 139 Riemann, 31
Complete orthonormal system, 158 Stieltjes, 37
convergence interpolation, 155
almost sure, 74
in Lp , 74 law, 19
in p-th mean, 74 lemma
in probability, 74 Borel-Cantelli, 22
weak, 123 Fatou, 30
Scheffé, 31
density process, 116 localizing sequence, 114
distribution, 19
joint, 51 martingale, 91
regular conditional distribution, 87 local, 114
distribution function, 20 square integrable, 98
decomposition, 65 martingale transform, 94
defective, 126 measurable
distribution:conditional, 87 Borel
Donsker’s invariance principle, 153 function, 17
Doob’s decomposition, 97 set, 1
downcrossing, 102 function, 17
set, 1
event, 7 µ-measurable set, 10
expectation, 33 measure, 3
conditional, 83 absolutely continuous, 61
version, 83 complex, 58
generalized conditional, 114 Jordan decomposition, 59
Lebesgue, 3
filtration, 91
Lebesgue decomposition, 62
fundamental theorem of calculus, 68
locally absolutely continuous, 116
Haar functions, 155 locally equivalent, 116
outer, 10
independence, 21 positive, 58
inequality probability, 7
162
product, 49 supermartingale, 93
real, signed, 58 system
singular, 61 π-system, 5
total variation, 59 d-system, 5
Wiener, 153
median, 136 theorem
Arzelà-Ascoli, 149
null set, 4 Carathéodory, 13
central limit theorem, 133
optional sampling, 98 dominated convergence, 31
Doob’s convergence theorem, 103
probability, 7 Fubini, 48
conditional, 86 Girsanov, 118
regular conditional probability, 87 Helly, 126
probability kernel, 87 Lévy’s continuity theorem, 142
process, 91 Lévy’s downward theorem, 106
adapted, 91 Lévy’s upward theorem, 105
predictable, 94 Lebesgue, 31
stochastic, 91 Lindeberg-Feller, 133
monotone class theorem, 18
quadratic covariation
monotone convergence, 29
optional, 116
portmanteau, 129
predicatable, 98
Prohorov, 151
quadratic variation
Radon-Nikodym, 63
predictable, 98
Riesz-Fréchet, 58
Radon-Nikodym derivative, 63 Skorokhod’s representation, 20, 125
random variable, 19 tightness, 128
random vector, 50
uniform integrability, 77
regular set, 66
upcrossing, 102
relative compactness, 151
Wiener process, 149
Schauder functions, 155
simple function, 25
space
Lp , 41
Lp , 40
complete measure space, 4
dual space, 58, 69
measurable, 1
measure, 3
probability, 7
standard machine, 31
stopping time, 95
strong law of large numbers, 110
submartingale, 93
subprobability measure, 126
163