Ran

BIRKHOFF’S ERGODIC THEOREM
DAVID RAN
Abstract. In this paper we will introduce the theory of ergodic measure-

preserving transformations of probability spaces. Along the way there will
be several examples of such transformations. As a whole, this paper works
towards the proof of a main ergodic theorem by Birkhoff. Lastly we apply
Birkhoff’s Ergodic Theorem to the aforementioned examples to prove results
in number theory and other fields. Most of the examples found in this paper
are referenced from [1].
Contents
1. Measure Theory Preliminaries 2
2. Measure-Preserving Transformations 3
3. Recurrence 5
4. Ergodicity 6
5. Proving Ergodicity Using Fourier Series 9
6. Conditional Expectation 11
7. Birkhoff’s Ergodic Theorem 11
8. Consequences of Birkhoff’s Theorem 15
9. Application to Normal Numbers 17
10. Application to Continued Fractions 18
11. Ehrenfests’ Model 20
Acknowledgments 21
References 21
In ergodic theory one focuses on the long term behavior of a flow or an iterated
transformation T and studies properties such as recurrence, equidistribution, mix-
ing, and other behavior. Typically, the transformation T preserves the structure
of the space, such as the measure of a probability space. The field arose in or-
der to better understand problems in statistical physics, where a measure-theoretic
approach to mechanics proved useful when dealing with difficult equations.
Birkhoff’s Ergodic Theorem is one of two fundamental theorems in ergodic the-
ory, the other being Von Neumann’s Ergodic Theorem. We begin by presenting a
simplified version of Birkhoff’s Theorem.
Theorem 0.1 (Birkhoff’s Ergodic Theorem). Let (X, B, µ) be a σ-finite measure
space and let T : (X, B, µ) → (X, B, µ) be a measure-preserving ergodic transforma-
tion. Let A be measurable. Then for µ-a.e. x ∈ X we have:
1 µ(A)
lim card{j | T j (x) ∈ A, 1 ≤ j ≤ n − 1} → .
n→∞ n µ(X)
1
2 DAVID RAN
Given the orbit {T j x}∞j=0 , the theorem states that the time average (LHS) of
the orbit is equal to its space average (RHS). This paper will aim to prove this
result in a more general form and explore some of its applications. The examples
and applications in this paper are taken from [1].
1. Measure Theory Preliminaries

We will make heavy use of measure theory and draw upon the following defini-
tions and theorems. Experienced readers may skip this section.
Definition 1.1. A probability space is a measure space (X, B, µ) such that µ(X) =
1 and µ : B → R+ is a nonnegative measure. The space X is the sample space of
events. The probability that an event A occurs is given by its measure µ(A).
Any positive finite measure space (X, B, µ) can be made into a probability space
by scaling µ by 1/µ(X). Because we will work with probability spaces, we will
often deal with spaces such as the unit interval and the torus.
Proposition 1.2. In a finite measure space, the countable intersection of sets with
full measure is also a set with full measure.
Proof. Consider the finite measure space (X, B, µ) with sets
T An such that S µ(An ) =
µ(X) for all n ∈ N. Then each Acn is null. We have µ( An ) = µ(X \ Acn ) =
S c n≥0 n≥0
µ(X) − µ( An ) = µ(X).
n≥0
Theorem 1.3 (Hahn-Kolmogorov Extension Theorem). Every σ-finite measure

µ : B → [0, ∞] is uniquely determined by the values µ takes on an algebra B0 that
generates B.
Lemma 1.4 (Fatou’s Lemma). Let fn : (X, B, µ) → [0, ∞] be a nonnegative se-
quence of measurable functions. Define f a.e. by
f (x) = lim inf fn (x)
n→∞
Then f is measurable and

Z Z
f (x)dµ ≤ lim inf fn dµ
n→∞
Theorem 1.5 (Dominated Convergence Theorem). Let fn : (X, B, µ) → R be a

sequence of measurable functions such that |fn (x)| ≤ g(x) for some g ∈ L1 for all
n and all for x. Let f be the pointwise limit of fn . Then
Z Z
f dµ = lim fn dµ
n→∞
Theorem 1.6 (Radon-Nikodym). Consider the measurable space (X, B) and two
measures on the space, ν and µ. Let ν µ, meaning that for all A ∈ B if µ(A) = 0
then ν(A) = 0. Then there exists a measurable function f : (X, B, µ) → [0, ∞) such
that for all A ∈ B we have Z
ν(A) = f dµ
A
The function f is called the Radon-Nikodym derivative.
BIRKHOFF’S ERGODIC THEOREM 3
2. Measure-Preserving Transformations
Given an invertible transformation T : X → X we often want to study the
doubly infinite time orbit of a point x ∈ X, which we denote ω:
ω = (. . . , T −2 x, T −1 x, x, T x, T 2 x, . . . )
Here we work in discrete time, which is often easier to analyze. We expect the long
term behavior of the discretized system to approximate the continuous time one.
Additionally, the transformation T typically preserves some structure of the space.
Definition 2.1. A map T : (X, B, µ) → (X, B, µ) is measure-preserving if for every
measurable set B ∈ B we have µ(T −1 B) = µ(B)
Now note that if T is both measure-preserving and invertible, then T −1 is also
measure-preserving, because µ((T −1 )−1 A) = µ(T A) = µ(T −1 T A) = µ(A). How-
ever, if T is not invertible it is not necessarily true that µ(A) = µ(T A). Consider
the following example.
Example 2.2 (The Doubling Map). Consider the map T : [0, 1] → [0, 1] where
T (x) = 2x mod 1. Then with respect to Lebesgue measure µ we have µ([0, 21 ]) = 12
while µ(T [0, 21 ]) = µ([0, 1]) = 1. However, if we work with preimages, we have
µ([0, 12 ]) = 21 and µ(T −1 [0, 12 ]) = µ([0, 41 ] ∪ [ 12 , 34 ]) = 12 . Moreover, measure is
preserved for any general interval [a, b]. Because intervals generate the Borel σ-
algebra, measure is preserved for all Borel sets. We will later use the Doubling
Map to prove results in number theory.
Theorem 2.3. Let T be a measurable transformation on the probability space
(X, B, µ). The following are equivalent:
(1) T is measure-preservingR
(2) for all f ∈ L1 we have R f dµ = R f ◦ T dµ
R
(3) for all f ∈ L2 we have f dµ = f ◦ T dµ

Proof. We show (1) =⇒ (2): We show the result first for simple functions. Con-
sider the characteristic function χB for B ∈ B. R Note that χRB ◦ T = χT −1 B .
−1
R
Consequently
R χB dµ = µ(B) = µ(T B) = χT −1 B dµ = χB ◦ T dµ. The
integral f dµ is defined as the supremum of integrals of simple functions less
than f . We can check that A := {g | g simple, g ≤ f ◦ T } is the same set
as B := {h ◦ T | h simple, h ≤ f }. These two sets give the same correspond-
ing sets of integrals. Also, the integrals
R Rof B are the same as those given by
C := {h | h simple, h ≤ f }. Thus f dµ = f ◦ T dµ.
(2) =⇒ (3): Because L2 ⊂ L1 in finite measure spaces.
let f = χB ∈ L2 . Then µ(B) = χB dµ =
R
R (3) =⇒ (1): R Let B ∈ B. Then
χB ◦ T dµ = χT −1 B dµ = µ(T −1 B)
To prove that T preserves measure on the σ-algebra B it suffices to show that T
preserves the measure of an algebra B0 that generates B. By the Hahn-Kolmogorov
Extension Theorem, a σ-finite measure is uniquely determined by the values it takes
on a generating algebra. Thus we can show that T preserves measure by defining
a new measure ν := µ ◦ T −1 and proving ν(A) = µ(A) on B0 .
Example 2.4 (Rotations on a circle). We will consider the rotation by a on the
unit circle as:
T : S 1 → S 1 : x 7→ e2πia x.
4 DAVID RAN
The Lebesgue measure is translation invariant, so the rotation on a circle preserves

circular measure. It will be clear later that the orbit of every point is equidistributed
and dense if and only if a is irrational.
Example 2.5 (Automorphisms of the Torus). Consider the torus Rn /Zn and the
linear map A : Rn /Zn → Rn /Zn . We require that the entries of A be integers so
that A maps Zn to itself. This guarantees that A is a well-defined transformation of
Rn /Zn . If det A = ±1 then the matrix A−1 has integer values as well, meaning that
the inverse map AR−1 is also well-defined.
R Then by theRLebesgue change of variables
formula we have f ◦ Adµ = A(Rn /Zn ) f | det A|dµ = Rn /Zn f dµ. Consequently by
Theorem 2.3, A preserves Lebesgue measure on the Borel σ-algebra.
Example 2.6 (Gauss Map). The Gauss map and Gauss measure will be useful
later when we apply Birkhoff’s ergodic theorem to problems of continued fractions.
Define the Gauss map T : [0, 1] → [0, 1] by
(
1
mod 1 if x 6= 0,
T (x) = x
0 if x = 0.
Every x ∈ (0, 1] has a continued fraction expression: {xn } ⊂ N such that
1 1
x= which means that T (x) = .
1 1
x0 + x1 +
1 1
x1 + x2 +
x2 + · · · x3 + · · ·
If x is rational, this expression has finitely many terms. In fact x is irrational if
and only if the continued fraction has infinitely many terms. If x is irrational then
its continued fraction is unique. Note that T shifts the sequence {xn } to the left.
While T does not preserve Lebesgue measure, it preserves Gauss measure, defined
by Z
1 1
ν(B) = dµ.
log 2 B 1 + x
We multiply by a factor of log1 2 so that the space [0, 1] with the Gauss measure is a
probability space. We want to show T preserves Gauss measure. Because the Borel
σ-algebra is generated by closed sets, it suffices
S 1 to check that ν(T −1 [a, b]) = ν([a, b])
−1 1
for all a, b ∈ [0, 1]. We note T [a, b] = [ b+n , a+n ]. Then all that remains is to
n≥1
show ν([a, b]) = ν(T −1 [a, b]). This equality is left for the reader to check if desired.
Example 2.7 (Bernoulli Shifts). Define ρ to be a finite set of r elements. We
can take the algebra consisting of all subsets of ρ and make ρ a probability space
P fixing an r-dimensional probability vector p. This means 0 ≤ p1 ≤ 1 and
by
pi = 1. We assign each element of ρ a value pi . Let Ω be the space of doubly
infinite sequences ω = (ωn )∞ −∞ where ωn ∈ ρ. We want to analyze Ω because
the space is a model for infinitely repeated independent experiments. We want to
make Ω a probability space by introducing a σ-algebra and a measure. For some
n ∈ Z, k ∈ N we define a cylinder set A ⊂ Ω as the set {ω ∈ Ω | (ωn , . . . , ωn+k ) ∈ E}
where E ⊂ ρk . We define a thin cylinder set as a cylinder set where E is some
singleton set {i1 , . . . , ik } ⊂ ρk . Given a thin cylinder set, we define its measure by
µp (A) = pi1 · · · pik . Every cylinder set is the finite union of thin cylinder sets, so
we have effectively defined µp on the σ-algebra generated by countable unions of

cylinder sets.
Consider the map T that shifts all terms in ω one space to the left. It is clear
that T is measure-preserving on thin cylinders. In other words, the probability of
a cylinder is independent of where the cylinder is located. We can think of this as
time independence.
The probability measure µ need not be defined as above. More generally, we can
consider a general probability measure P on (Ω, B) and a shift T that preserves P .
In this case, the trials ωn need not be independent of each other. The invariance
under T implies that “when” does not matter, although order might. This condition
is called stationarity.
Example 2.8 (Markov Shifts). Consider an r × r right stochastic matrix P . This

is a matrix with positive entries whose row entries sum to 1. We think of P (im , in )
as the probability of im given that the most recent outcome was in . Given an initial
probability vector p, we have pP is also a probability vector because the rows of
P sum to 1. Thus, points stay in the system with probability 1. There exists a
probability vector p such that pP = p. Let A = {ω | (ωn , . . . , ωn+k ) = (i1 , . . . , ik )}.
We define a measure µP on the thin cylinder A by µ(A) = pi1 P (i1 , i2 ) · · · P (ik−1 , ik ).
Here P (im , in ) refers to the entry of P in the im th row and in th column. We can
check that the shift T is measure preserving just as in the previous example. We
use the fact pP = p.
The Markov Shift is a generalization of the Bernoulli one. In the Markov space,
the probability of each event depends on the previous one, whereas in the Bernoulli
space each event was independent.
We assumed that P is time-homogenous, i.e. probabilities do not change with
time. We can generalize to time-dependent transition matrices An , but these topics
are beyond the scope of this paper. The same goes for non-discrete time and state
spaces.
3. Recurrence
One property we are interested in is the frequency with which a point from a set
returns to that set. Poincaré’s Theorem is a first step in that direction. We will
prove stronger results later, so one may skip this proof.
Theorem 3.1 (Poincaré Recurrence Theorem). Let T be a measure-preserving

transformation of the probability space (X, B, µ). Let E ∈ B with µ(E) > 0. For
almost all points of E the orbit {T j x}∞
j=1 returns to E infinitely many times. More-
over, there exists F ⊂ E such that µ(F ) = µ(E) and the orbit of each point in F
returns to F infinitely many times.
∞
T −n E. Then we see E0 ⊃ E1 ⊃ E2 ⊃ · · · .
S
Proof. For N ≥ 0 define EN =
n=N
Note also that E ⊂ E0 . For all N we have EN +1 = T −1 EN . Because T is measure-
preserving, we have µ(EN ) = µ(EN +1 ) for all N . Then
∞
\
µ( EN ) = lim µ(EN ) = lim µ(E0 ) = µ(E0 )
N →∞ N →∞
N =0
6 DAVID RAN
∞ S
∞ ∞
T −n E =
T T
Let x ∈ EN . For all N ≥ 0 there exists n ≥ N such that
N =0n=N N =0
∞
T n x ∈ E. Therefore the set
T
EN contains points of X that enter E infinitely
N =0
∞
T
many times. Let F = E ∩ EN . We have
N =0
∞
\ ∞
\

µ(F ) = µ(E ∩ EN ) = µ E ∩ E0 \ (E0 \ EN )
N =0 N =0
∞
\
= µ(E ∩ E0 ) − µ E ∩ (E0 \ EN ) = µ(E ∩ E0 ) = µ(E)
N =0
Thus, almost every point in E returns to E infinitely many times. Moreover,
points in F return to F infinitely many times. Let x ∈ F . Then there exist
0 < n1 < n2 < · · · such that T ni x ∈ E for all i. Then the orbit of T ni x enters E
infinitely many times, because for all k > i we have T nk −ni (T ni (x)) = T nk ∈ E.
Therefore T ni (x) ∈ F , and this is true for all i. Therefore the orbit of x enters F
infinitely many times.
4. Ergodicity
Definition 4.1. A measure preserving map T is ergodic if for all sets B ∈ B such
that T −1 B = B we have µ(B) ∈ {0, 1}.
Ergodicity is a non-decomposability condition. If there exists B ∈ B such that
T −1 B = B and µ(B) ∈ (0, 1) then we can decompose X into two disjoint probability
1 1
spaces (B, B|B , µ(B) µ) and (X \ B, B |X\B , 1−µ(B) µ) that are invariant under T .
Definition 4.2. The symmetric difference of two sets A, B is A4B = (A \ B) ∪
(B \ A). If µ(A4B) = 0 we say A = B almost everywhere (a.e.)
If T is ergodic then the weaker condition B = T −1 B a.e. also implies µ(B) ∈
{0, 1}. We prove this in Corollary 4.5.
Proposition 4.3. If µ(A4B) = 0 then µ(A) = µ(B).
Lemma 4.4. Let T be measure-preserving. If µ(T −1 B4B) = 0 then there exists
B 0 ∈ B such that T −1 B 0 = B 0 and µ(B4B 0 ) = 0.
Proof. Let
∞ [
\ ∞
B0 = T −j B.
n=0j=n
∞
∞ S ∞ S∞ ∞ ∞
Then T −1 B 0 = T −1 T −j B = T −(j+1) B = T −j B = B 0
T T T S
n=0j=n n=0j=n n=0j=n+1
Let n ∈ N and observe that

n−1
[
T −n B4B ⊂ (T −(j+1) B \ T −j B) ∪ (T −j B \ T −(j+1) B)
j=0
n−1
[ n−1
[
= T −(j+1) B4T −j B = T −j (T −1 B4B)
j=0 j=0
Therefore
n−1
[ n−1
X
µ(T −n B4B) ≤ µ( T −j (T −1 B4B)) ≤ µ(T −j (T −1 B4B))
j=0 j=0
Because T is measure-preserving:
n−1
X
= µ(T −1 B4B) = nµ(T −1 B4B) = 0
j=0
Therefore for all n we have µ(T −n B4B) = 0. Then

∞ [
\ ∞ ∞ [
\ ∞ ∞ [
\ ∞
0 −j −j
µ(B4B ) = µ(B4 T B) = µ(B \ T B) + µ( T −j B \ B)
n=0j=n n=0j=n n=0j=n
∞
[ ∞
[ [∞
≤µ (B \ T −j B) + lim µ( T −j B \ B)
n→∞
n=0 j=n j=n
∞
X ∞
[ ∞
X
≤ µ(B \ T −j B) + lim µ(B4T −n B)
n→∞
k=0 j=k j=n
X∞ ∞
X
≤ µ(B4T −k B) + lim µ(B4T −n B)
n→∞
k=0 j=n
X∞ ∞
X
= 0 + lim 0=0
n→∞
k=0 j=n

Corollary 4.5. If T is ergodic and µ(T −1 B4B) = 0 then µ(B) ∈ {0, 1}.
Proof. By the lemma, there exists B 0 such that T −1 B 0 = B 0 . Because T is ergodic
µ(B 0 ) ∈ {0, 1}. Because µ(B4B 0 ) = 0 we have µ(B) = µ(B 0 ) ∈ {0, 1}.
Recall Poincaré’s Theorem. Does the orbit of a point ever enter or re-enter a
different set?
Proposition 4.6. Given a set of positive measure A ⊂ X, if T is measure-
preserving and ergodic then the orbit of almost every point of X enters A even-
tually.
∞
T −n A. Then T −1 B ⊂ B and µ(T −1 B) = µ(B)
S
Proof. Consider the set B =
n=0
because T is measure-preserving. Because T is ergodic, µ(B) ∈ {0, 1}. Because
A has positive measure, we have µ(B) = 1. If x ∈ B, there exists k such that
x ∈ T −k A, so T k x ∈ A. Thus, almost every point in X enters A.
We summarize the previous results in the following statement.
Theorem 4.7. If T is a measure-preserving transformation then the following are
equivalent:
(1) T is ergodic
(2) µ(T −1 B4B) = 0 =⇒ µ(B) ∈ {0, 1}
8 DAVID RAN
(3) For all sets A, B with positive measure, there exists n > 0 such that
µ(T −n A ∩ B) > 0.
Proof. (1) =⇒ (2): Follows from Corollary 4.5

(2) =⇒ (3): It is clear that (2) implies ergodicity. By Proposition 4.6 there
exists a set Y such that µ(Y c ) = 0 and all points in Y enter A at some point. Then
µ(B ∩ Y ) = µ(B \ Y c ) = µ(B).
S −n In other words, almost every point of B enters A at
some point. Therefore, µ( T A ∩ B) = µ(B ∩ Y ) = µ(B) > 0. Thus for some
n≥1
n we have µ(T −n A ∩ B) > 0.
(3) =⇒ (1): Suppose T is not ergodic. Then there exists B such that T −1 B = B
and 0 < µ(B) < 1. Then µ(T −n B ∩ B c ) = 0 for all n, which contradicts (3).
Proposition 4.8. Let T be a measure-preserving transformation. The following

are equivalent:
(1) T is ergodic
(2) If f ∈ L1 (X, B, µ) and f ◦ T = f µ a.e. then f is constant µ a.e.
Proof. We first show (1) =⇒ (2). Let T be ergodic and let f ∈ L1 be such that
f ◦ T = f µ-a.e. If f is complex-valued, we can split f into real and imaginary
parts. Thus, without loss of generality we assume f is real-valued.
For B ∈ B let A := f −1 B. Then T −1 A = A a.e. This is true because if

x ∈ T −1 A \ A then f ◦ T (x) ∈ f (A) ⊂ B. If x 6∈ A then f (x) 6∈ B. Then
f ◦ T (x) 6= f (x). By similar reasoning, if x ∈ A \ T −1 A then f ◦ T (x) 6= f (x). As
f ◦ T = f a.e. and T −1 A4A ⊂ {x | f ◦ T (x) 6= f (x)} we have µ(T −1 A4A) = 0,
which proves our claim. We divide R into sets

k k+1
A(k, n) := f −1 n , n
2 2
where k ∈ Z, n ∈ N. Because f is measurable, A(k, n) is measurable. We have
∞
T −1 A(k, n) = A(k, n) a.e. We also have X = f −1 R =
S
A(k, n). Then
k=−∞
∞
X ∞
[
µ(A(k, n)) = µ( A(k, n)) = µ(X) = 1.
k=−∞ k=−∞
For each n there exists kn such that A(kn , n) = 1 and A(k, n) = 0 for k 6= kn . This
is true for arbitrary n, so we take the intersection over all n:
∞
T
Let Y = A(kn , n). By Proposition 1.2, we still have µ(Y ) = 1. Moreover, for
n=1
x, y ∈ Y we have |f (x) − f (y)| ≤ 21n for all n. Therefore for all x, y ∈ Y we have
f (x) = f (y). Thus, f is constant on a set of full measure Y ⊂ X.
Now we show (2) =⇒ (1). Let B be a measurable set such that T −1 B = B.

Consider χB ∈ L1 . We have χB ◦ T = χB . Then χB is constant a.e. Because
χB = 1 on B and 0 everywhere else, we have µ(B) = 0 or µ(B) = 1.
The above theorem is also true for f ∈ L2 , so we can apply the above theorem
using L2 Fourier Series. In many cases it is easier to change to Fourier series to
show that f ◦ T = f implies f is constant a.e. With or without using Fourier series,
proving ergodicity can often be a difficult task.
5. Proving Ergodicity Using Fourier Series

In this section we will use Fourier series and the Hilbert space L2 to prove that
certain maps are ergodic. Consider the space L2 (R/Z). R This is a complete vector
space with a norm given by the inner product hf, gi = f ḡdµ. For all n ∈ Z define
the function en by en (x) = e2πnix . The functions {en }∞ −∞ form an orthonormal
basis for L2 . Each function f ∈ L2 can be written uniquely as a convergent linear
combination of those basis vectors. Given a function, we call its corresponding
linear combination of vectors {en } the Fourier series of the function.
First, we list some preliminary facts about Fourier series. For reference, see [1]
or a standard text on Fourier series.
Theorem 5.1. Given f, g ∈ L2 , the functions f and g are equal almost everywhere
if and only if they have the same sequence of Fourier coefficients.
Theorem 5.2 (Riemann-Lebesgue Lemma). For f ∈ L2 , the Fourier coefficients
converge to 0 at ±∞.
Proposition 5.3. Given f ∈ L2 , its Fourier series composed with T converges to
f ◦ T . The Fourier series composed with T written in the basis {en } is then the
Fourier series of f ◦ T .
The following examples deal with Fourier series of periodic functions on the circle
R/Z. The compactness, or rather the periodicity of the domain allow for the basis
{en } to approximate functions in L2 (R/Z).
Example 5.4 (Rotations on a circle). Recall the rotation T from Example 2.4
T : R/Z → R/Z : x 7→ x + a mod 1.
If a ∈ Q then T is not ergodic with respect to Lebesgue measure. Indeed, if
a ∈ Q then (e2πia )q = 1 for some q ∈ N. Then f (x) = xq is invariant under T and
f ∈ L1 . However, f is not constant a.e. and by Proposition 4.8 T is not ergodic.
If a 6∈ Q then T is ergodic with respect to Lebesgue measure. To see this, suppose
f ∈ L2 ⊂ L1 such that f ◦ T = f a.e. Let f have the Fourier series
∞
X
cn e2πnix .
n=−∞
Then by Proposition 5.3 the composition f ◦ T has the Fourier series

∞
X
cn e2πnia e2πnix .
n=−∞
Therefore because Fourier coefficients are unique, we have for all n

cn = e2πnia cn .
Because a is irrational, for all n 6= 0 we have e2πnia 6= 1. Therefore cn = 0 for
all n 6= 0. Thus the Fourier series of f is c0 , meaning that f is constant a.e. By
Proposition 4.8 f is ergodic.
10 DAVID RAN
Example 5.5 (Doubling Map). Recall the Doubling Map from before, but this
time generalized to the b-adic map for some b ∈ Z, b ≥ 2:
T : R/Z → R/Z : x 7→ bx mod 1.
Showing that T is measure-preserving is straightforward. We can show T is ergodic
by once again applying Proposition 4.8. Let f ∈ L2 ⊂ L1 be such that f ◦ T = f
a.e. By induction, f ◦ T k = f a.e. for all k. This is true because for each additional
composition of T only a null set of points are the exception to the equality. Now
let f have the Fourier series
X∞
cn e2πinx .
n=−∞
Then f ◦ T k has the Fourier series
∞
X k
cn e2πinb x .
n=−∞
By Proposition 5.3 we have for all n and for all k

cn = cnbk
The Fourier coefficients must tend to 0 at ±∞. Thus cn = 0 for all n 6= 0. Therefore
f has the Fourier series c0 and is constant a.e. By Proposition 4.8 T is ergodic.
Example 5.6 (Linear Endomorphism on Torus). Let A be a linear map on the
torus Rk /Zk whose entries are integers. The map A is ergodic if and only if none
of its eigenvalues is a root of unity.
We once again prove ergodicity using Fourier series. We generalize from the
Hilbert space L2 (R/Z) to the Hilbert space L2 (Rk /Zk ). The inner product and
norm are the same as before, except now we use the orthonormal basis {en }n∈Zk
given by
en (x) := e2πihn,xi .
We use the same criteria for convergence of Fourier series as before in the sense
that we want L2 convergence and take limits as |n| → ∞ where | · | can be the
maximal norm, Euclidean norm, or other norm. In this sense, Theorems 5.1, 5.2,
and 5.3 still hold.
Suppose that A has an eigenvalue m that is a pth root of unity. Then Ap has 1
as an eigenvalue and a corresponding left eigenvector nT . Because Ap is an integer
matrix, we can scale n so that it has integer values as well. Thus n ∈ Zk . Consider
the function f ∈ L2 defined
p−1 p−1
j T
Aj )T ,xi
X X
f (x) := e2πihn,A xi
= e2πih(n
j=0 j=0
The above equality holds because for any 1×k coordinate vectors v, w and any k ×k
matrix A we have h(v T A)T , wi = (v T A)w = v T (Aw) = hv, Awi. Next, observe that
f is A-invariant. We can also check that f is non-constant unless n = 0. This
raises a contradiction with Proposition 4.8.
Now suppose that A has no eigenvalue that is a root of unity. Suppose f ∈ L2
such that f ◦ A = f . Suppose f has the Fourier series
X
cn e2πihn,xi .
n∈Zk
Then f ◦ Aj has the Fourier series

X j X T j T
cn e2πihn,A xi = cn e2πih(n A ) ,xi
n∈Zk n∈Zk
Then c(nT Aj )T = cn for all j ∈ N and for all n ∈ Zk . Because A has no

eigenvalue that is a root of unity, for all j we have (nT Aj )T 6= n unless n = 0. By
the Riemann-Lebesgue lemma we require cn → 0 as |n| → ∞. Therefore cn = 0
for all n 6= 0. Thus f is constant, so by Proposition 4.8 A is ergodic.
There are many other maps, such as the examples discussed in section 2, that
are also ergodic. For those examples, we do not employ Fourier series. Though
we will not prove this here, the Gauss Map and the Bernoulli Shift are ergodic.
Moreover, the Markov shift is ergodic under certain conditions, namely when the
stochastic matrix P is irreducible. A stochastic matrix is called irreducible if for
each i, j there exists n such that P n (i, j) > 0, meaning that P can eventually send
any state i to any state j with positive probability. For reference, see [1].
6. Conditional Expectation
We introduce the idea of the conditional expectation of a function f as a nuance
that we will later encounter in Birkhoff’s Theorem. If we know T is measure-
preserving but not necessarily ergodic, we can still use Birkhoff’s Theorem to attain
a result regarding the conditional expectation of f , which is relevant to probability
theory and martingales.
Proposition 6.1. Let (X, B, µ) be a probability space and let A ⊂ B be a σ-
algebra Rof B. Let f ∈ L1 (X, B, µ). We can define a measure ν on A by taking
ν(A) = A f dµ for all A ∈ A. This is a measure and ν µ|A .
Proposition 6.2. By the Radon-Nikodym theorem there exists a unique function
E(f | A) : (X, A)
R → [0, ∞) that is measurable
R with respect to A such that for all
A ∈ A we have A E(f | A)dµ = ν(A) = A f dµ.
Since A ⊂ B, being A-measurable is more restrictive than begin B-measurable.
The function f need not be measurable with respect to A, as opposed to the function
dν
E(f | A). We refer to E(f | A) as the Radon Nikodym derivative dµ| A
or as the
conditional expectation of f with respect to A.
7. Birkhoff’s Ergodic Theorem

In the Birkhoff theorem, we find that the time average of f under T j x approaches
the conditional expectation E(f | I)(x) where I is the σ-algebra of sets that are
invariant under T .
Theorem 7.1. Let (X, B, µ) be a σ-finite measure space and let T : (X, B, µ) →
(X, B, µ) be a measure-preserving transformation. Let f ∈ L1 (X). Then for µ-a.e.
x ∈ X we have:
n−1
1X
lim f (T j (x)) → E(f | I)(x).
n→∞ n
j=0
1
R R
Also E(f | I) ∈ L (X). If µ(X) < ∞ then E(f | I)dµ = f dµ. Moreover,R if T
1
is ergodic and µ(X) < ∞ then E(f | I) is constant a.e. and equal to µ(X) f dµ
a.e.
12 DAVID RAN
In a probability space X with an ergodic transformation T and f ∈ L1 , this

means that for a.e. x we have
n−1 Z
1X
lim f (T j (x)) = f dµ.
n→∞ n
j=0
We can consider the expression on the left as a time average. It is the asymptotic
average of the points f (T n x) over time. Meanwhile, the expression on the right is
a space average. If we let f be the indicator function for a set A, we can measure
the long term frequency with which the orbit of x enters A as a proportion of time.
µ(A)
If T is ergodic, this frequency is µ(X) .
Let us proceed to the proof of the theorem, which will be broken into several
steps.
Proposition 7.2. Let T be a measure-preserving transformation. The operator
U : L1 → L1 that maps f 7→ f ◦ T is a positive linear operator with norm ||U || = 1.
(i.e. if f ≥ 0 then U f ≥ 0)
Proof. That U is positive follows from its definition, and ||U || = 1 by Theorem
2.3.
Theorem 7.3 (Maximal Inequality). Let U : L1 (X, B, µ) → L1 (X, B, µ) be a pos-

itive linear operator with ||U || ≤ 1. Let f ∈ L1 . For n ≥ 1 define f0 = 0 and
fn = f + U f + U 2 f + · · · + U n−1 f,
Fn (x) = max fj (x) ≥ 0.

0≤j≤n
Let A = {x ∈ X | Fn (x) > 0}. Then

Z
f dµ ≥ 0.
A
Proof. For each x there exists nx ≤ n such that Fn (x) = fnx (x). Then
n
X
|Fn (x)| = |fnx (x)| ≤ |fj (x)|.
j=1
R Pn
|fj |dµ < ∞, which proves Fn ∈ L1 .
R
Then |Fn |dµ ≤ j=1
For 0 ≤ j ≤ n we have Fn ≥ fj . Then U Fn ≥ U fj because U is positive.

Then U Fn + f ≥ U fj + f . Therefore U Fn + f ≥ max fj . If Fn (x) > 0 then
1≤j≤n+1
max fj (x) = max fj (x) ≥ Fn (x). Thus for x ∈ A we have U Fn (x) + f (x) ≥
1≤j≤n+1 0≤j≤n+1
Fn (x). Then f ≥ Fn − U Fn on A.
We use the facts that Fn = 0 on X \ A, U Fn ≥ 0 and ||U || ≤ 1 to show

Z Z Z Z Z Z Z
f dµ ≥ Fn dµ− U Fn dµ = Fn dµ− U Fn dµ ≥ Fn dµ− U Fn dµ ≥ 0.
A A A X A X X

Corollary 7.4. Let T : (X, B, µ) → (X, B, µ) be measure-preserving. Let g ∈ L1

and let
n−1
1X j
Mα = x ∈ X sup
g(T (x)) > α .
n≥1 n j=0
Then for all B such that T −1 B = B we have

Z
gdµ ≥ αµ(Mα ∩ B)
Mα ∩B
Proof. Let f = g − α. Using the notation from before, we note

∞ n−1
[ ∞ ∞
X [ [
Mα = x g(T j (x)) > nα = {x | fn (x) > 0} = {x | Fn (x) > 0}.
n=0 j=0 n=0 n=0
R R
By the Maximal Inequality we have Mα f dµ ≥ 0, so Mα gdµ ≥ αµ(Mα ).
We can apply the Maximal Inequality with T restricted to B. Then we see
Z
gdµ ≥ αµ(Mα ∩ B).
Mα ∩B
Completing the Proof of Birkhoff ’s Theorem. It suffices to prove the theo-

rem for real-valued functions, as complex valued functions can be split into real
and imaginary parts. Thus assume f ∈ L1 is real-valued. Define
n−1 n−1
1X 1X
f ∗ (x) = lim sup f (T j (x)) and f∗ (x) = lim inf f (T j (x))
n→∞ n j=0 n→∞ n j=0
These functions are defined for all x, but may be infinite. This proof will have
R ∗ distinct
several steps. We show (1) f ∗ ◦ T = f ∗ , (2) f ∗ = f∗ a.e., (3) f ∗ ∈ L1 , (4)
f dµ = f dµ, (5) f ∗ = E(f | I).
R
Pn−1
(1) We claim f ∗ ◦ T = f ∗ . Indeed, let an (x) = 1
n j=0 f (T j (x)). Then
n+1 1
n an+1 (x) = an (T x) + n f (x). As n → ∞ we have
lim sup an+1 (x) = lim sup an (T x)

n→∞ n→∞
Thus f ∗ ◦ T = f ∗ . Similarly f∗ ◦ T = f∗ .
(2) Now we show f ∗ = f∗ a.e. LetS Eu,v = {x ∈ X | f∗ < u, v < f ∗ (x)}.

Then {x ∈ X | f∗ (x) < f ∗ (x)} = Eu,v . We want to show µ(Eu,v ) = 0
u<v,u,v∈Q
for all u, v ∈ Q with u < v. By the definition introduced in Corollary 7.4 we see
Eu,v ∩ Mv = Eu,v . We apply Corollary 7.4 to find
Z
(7.5) f dµ ≥ vµ(Eu,v ).
Eu,v
We then prove the same inequality again, but this time we replace f, u, v by
−f, −v, −u respectively and note that (−f ∗ ) = −f∗ and (−f )∗ = −f ∗ . We see,
14 DAVID RAN
D−v,−u = {x ∈ X | (−f )∗ < −v, −u < (−f )∗ } = Eu,v . We apply Corollary 7.4 to
(−f ) and find
Z Z
(−f )dµ = (−f )dµ ≥ −uµ(D−v,−u ) = −uµ(E(u, v)).
Eu,v D(−v,−u)
Therefore
Z
(7.6) f dµ ≤ uµ(Eu,v ).
Eu,v
Combining (7.5) and (7.6) we find vµ(Eu,v ) ≤ uµ(Eu,vS). However, u < v, so

µ(Eu,v ) = 0. Then µ({x ∈ X | f∗ (x) < f ∗ (x)}) = µ( Eu,v ) = 0. There-
u<v,u,v∈Q
≥ f ∗ a.e. Clearly f∗ ≤ f ∗ as well, so we have f∗ = f ∗ a.e. Therefore
fore f∗ P
n−1
lim n j=0 f (T j (x)) exists a.e. and is equal to f ∗ (x).
1
n→∞
Pn−1
(3) Now we show f ∗ ∈ L1 . Consider the partial sum gn (x) = | n1 j=0 f (T j x)|
The sequence gn converges to |f ∗ | pointwise for almost every x. Using the Triangle
Inequality and by Theorem 2.3 we have
Z Z n−1 n−1 Z n−1 Z Z
1 X 1X 1X
f ◦ T j dµ ≤ |f ◦ T j |dµ =

gn dµ = |f |dµ = |f |dµ
n j=0 n j=0 n j=0
We apply Fatou’s Lemma to gn to conclude f ∗ ∈ L1 , since

Z Z Z Z
∗
f dµ = f∗ dµ = lim inf gn dµ ≤ |f |dµ.
(4) Now we show f ∗ dµ = f dµ if µ(X) < ∞. This is much like part (1).
R R
Define Dkn := {x ∈ X | nk ≤ f ∗ (x) < k+1 n }. Consider M n k

− for > 0. Then
Dkn ∩ M k − = Dkn . We use here because the lower bound inequality in Dkn is not
n
strict. Then we apply Corollary 7.4 to find
Z Z
k
f dµ = f dµ ≥ ( − )µ(Dkn ).
n
Dk n
Dk ∩M k n
−
n
This is true for all > 0, so

Z
k
f dµ ≥ µ(Dkn ).
n
Dk n
It follows that
Z Z
k+1 k 1 1
f ∗ dµ ≤ µ(Dkn ) = µ(Dkn ) + µ(Dkn ) ≤ f dµ + µ(Dkn ).
Dkn n n n n
Dk n
S n
Note that Dk = X. Summing over all k ∈ Z we have
k∈Z
Z Z
1
f ∗ dµ ≤ f dµ + µ(X).
X X n
This is true for all n, so we send n → ∞. If µ(X) is finite then X f ∗Rdµ ≤ X f dµ.
R R
∗
R We can show the same inequality for (−f ∗
). Doing so, we find X (−f
∗
R ) dµ ≤
RX (−f
∗
)dµ. Then, noting
R ∗ again R that (−f ) = −f∗ = −f , we have X f dµ ≤
X
f dµ. Therefore f dµ = f dµ.
(5) Now we show f ∗ = E(f | I). Because f ∗ = f ∗ ◦ T , we know that the preim-
age of any set under f ∗ will be invariant under T . Therefore f ∗ is measurable with
respect to I, the
R σ-algebraR of sets invariant under T . Also, if A is a set invariant
under T , then A f ∗ dµ = A f dµ which is true because we can restrict to the space
A and the function T |A : A → A. These two properties uniquely determine the
conditional expectation E(f | I).
Lastly, suppose T is ergodic. By Proposition 4.8 the fact f ∗ ◦ T = f ∗ implies

that f ∗ is constant a.e. Taking the average of the space X, we see that for a.e. x
Z
1
E(f | I)(x) = f dµ.
µ(X)

8. Consequences of Birkhoff’s Theorem

Under an ergodic transformation T , almost every point in X “forgets” its initial
position. As time N grows large, the orbit {T j x}N j=0 becomes equidistributed. In
the long run, the proportion of time the orbit {T j x}∞
j=0 spends in a set B is equal
to its measure µ(B) divided by µ(X). Once again, this is true for almost every x
regardless of where it is located.
Corollary 8.1. Let T be an ergodic measure-preserving transformation of the prob-
ability space (X, B, µ). Let B ∈ B. Then for a.e. x ∈ X we have
1
lim card{j ∈ {0, . . . , n − 1} | T j x ∈ B} = µ(B).
n→∞ n
Pn−1
Proof. Note card{j ∈ {0, . . . , n − 1} | T j x ∈ B} = j=0 χB ◦ T j (x). Thus we
apply Birkhoff’s Theorem using f = χB .
We can think of the Birkhoff Theorem not only in terms of orbits, but also in
terms of pre-images of sets, as in the following corollary.
Corollary 8.2. Let (X, B, µ) be a probability space and let T : X → X be a
measure-preserving transformation. Then T is ergodic if and only if for all A, B ∈ B
n−1
1X
lim µ(T −j A ∩ B) = µ(A)µ(B).
n→∞ n
j=0
Proof. Apply the Birkhoff Theorem to χA to get

n−1 Z
1X
lim χA (T j x) = χA dµ = µ(A).
n→∞ n
j=0
Then multiply by χB (x) to get

n−1
1X
lim χA (T j x)χB (x) = µ(A)χB (x).
n→∞ n
j=0
16 DAVID RAN
Then integrate
Z n−1 Z
1X
lim χA (T j x)χB (x)dµ = µ(A)χB (x)dµ = µ(A)µ(B).
n→∞ n
j=0
Pn−1
Observe that χA (T j )χB dµ = µ(T −j A ∩ B) and | n1 j=0 χA (T j x)χB (x)| ≤ 1 for
R
all x. This allows us to apply the Dominated Convergence Theorem:

n−1 n−1 Z
1X −j 1X
lim µ(T A ∩ B) = lim χA (T j )χB dµ
n→∞ n n→∞ n
j=0 j=0
Z n−1 Z n−1
1X j 1X
= lim χA (T )χB dµ = lim χA (T j )χB dµ = µ(A)µ(B).
n→∞ n j=0 n→∞ n
j=0

Recall Poincaré’s Theorem. Given a measure-preserving transformation T and a
set A of positive measure, almost every point in A will return to A infinitely many
times. While Poincaré’s Theorem shows that almost every point of A returns to A
under a measure-preserving transformation, it does not give a sense of how long we
would have to wait for a point to return to the set. Kac’s Lemma will give a sense
of how long we can expect to wait. As one might guess, a point from a larger set
on average takes less time to return to that set, whereas return times are longer for
smaller sets.
Define the first return time for a point x as nA (x) = min{n ≥ 1 | T n (x) ∈ A}.
We already know nA is finite for almost every x ∈ A, but Kac’s Lemma will tell us
more.
Theorem 8.3 (Kac’s Lemma). Let T be an ergodic measure-preserving transfor-
mation of the probability space (X, B, µ). Let A be a set of positive measure. Then
Z
nA dµ = 1.
A
Proof. Define the two sets
[
Lj := {T j (x) | x ∈ A, T k (x) 6∈ Y for 0 < k < j} and B := Lj .
j≥0
We claim that B is T -invariant up to a null set. First observe T −1 Lj ⊂ Lj−1 by

definition. Then it is clear that B ⊂ T −1 B. We then show T −1 B ⊂ B up to a null
set. We see
[ [ [
T −1 B = T −1 Lj = T −1 Lj ⊂ Lj ∪ T −1 A = B ∪ T −1 A.
n≥0 j≥0 j≥0
Because T preserves measure, T A is not null. Thus we must show T −1 A ⊂ B

−1
mod µ if we want to show T −1 B ⊂ B mod µ. We show C = T −1 A \ B is null.

For x ∈ C we have T −j x 6∈ A for all j. Otherwise there exists a minimal j
such that x = T j (T −j x) ∈ Lj ⊂ B. This then implies that the sets T −j C are
disjoint for all j. This is so because otherwise there would exist two minimal
j, k with j < k such that there exists x ∈ T −j C ∩ T −k C. Then T k x ∈ C but
T j−k xP∈ A, which Pwe showed cannot beS true. Because T preserves measure we
∞ ∞
have j=0 µ(C) = j=0 µ(T −j C) = µ( T −j C) ≤ 1. Therefore µ(C) = 0. Thus
j≥0
we have shown T −1 B = B a.e. Because T is ergodic and µ(B) has positive measure,
we see µ(B) = 1. S
Define Ak := {x ∈ A : nA (x) = k}. By Poincaré’s Theorem, Ak = A a.e. It
S j k≥0
is also clear that the sets Ak are disjoint. Also, Lj = T Ak up to a null set.
k≥j
Therefore, by switching the order of the unions and noting T preserves measure,
[ [[ [ [
µ(X) = µ(B) = µ( Lj ) = µ( T j Ak ) = µ( T j Ak )
j≥0 j≥0k≥j k≥0k≥j≥0
∞
X X∞ X∞ ∞
X X∞ Z
= µ(T j Ak ) = µ(Ak ) = kµ(Ak ) = nA dµ.
k≥0 k≥j≥0 k≥0 k≥j≥0 k≥0 A
Therefore Z
nA dµ = 1.
A

1
By Kac’s Lemma, the average return time for a point in A is nA (x) = µ(A) ,
inversely proportional to the size of the set.
As a side note, we can generalize the Birkhoff Theorem from functions in L1 to
functions in Lp with Lp convergence. We present the theorem for reference. More
information can be found in [3].
Theorem 8.4 (Lp Ergodic Theorem). Let T be a measure-preserving transforma-
tion of (X, B, µ) and let f ∈ Lp for some 1 ≤ p < ∞. There exists f ∗ ∈ Lp such
Pn−1
that f ∗ ◦ T = f ∗ a.e. and lim || n1 j=0 f (T j ) − f ∗ ||p = 0
n→∞
9. Application to Normal Numbers

Here we use Birkhoff’s Theorem to prove that Lebesgue almost every number is
normal. However,√the proof is not a constructive one. While it is conjectured that
numbers such as 2, ln 2, e and π are normal, it is unknown whether any of them
are. Although Lebesgue almost all numbers are normal, only few specific examples
are presently known, such as Chaitlin’s constant. An example of a computable
normal number was produced by Becher in 2002.
We now proceed to define what it means for a number to be normal. Consider
x ∈ R/Z. The number x can be written in base b for any b ∈ N with b ≥ 2. The
number x has a unique expression in base b unless it ends with repeating 0’s or
(b − 1)’s.
Definition 9.1. A number x is simply normal in base b if for all digits k ≤ b the
frequency with which k occurs in the base b expansion of x is 1b .
Proposition 9.2. For a natural number b ≥ 2 Lebesgue almost every number in
[0, 1] is simply normal in base b. In other words, all digits are equally distributed
in the long term.
Proof. Recall the b-adic map, which we denote Tb . Let x ∈ [0, 1] be written as
.p0 p1 p2 p3 . . . in base b. We have p0 = k if and only if x ∈ [ kb , k+1
b ). Observe
Tb x = p0 .p1 p2 . . . mod 1 = .p1 p2 p3 . . . . Applying the b-adic map shifts all the
digits to the left. Thus, p1 = k if and only if x ∈ [ kp , k+1
b ).
18 DAVID RAN
Recall that Tb is ergodic. Then by applying the Birkhoff ergodic theorem, for
Lebesgue almost all x ∈ [0, 1] we have
n−1 Z
1X 1
lim χ[ k , k+1 ) ◦ T j = χ[ k , k+1 ) = .
n→∞ n b b b b b
j=0
Thus, over the long term the proportion of digits in the base b expansion that are
k is 1b .
We now generalize the Definition 9.1.
Definition 9.3. A number x is normal in base b if each finite word of length k
occurs in x with the frequency b1k . A word of length k is a block of k adjacent digits
i0 i1 . . . ik−1 .
Proposition 9.4. For b ≥ 2, Lebesgue almost all numbers x ∈ [0, 1] are normal in
base b.
Proof. We generalize the proof from before. The first k digits of x in base b are
i0 i1 . . . ik−1 if and only if
n−1
X ij n−1X ij
1
x∈ , + k =: I.
j=0
bj j=0 bj b
By applying the b-adic map n times, we then see that the digits n through n + k − 1
form the word i0 i1 . . . ik−1 if and only if Tbn x ∈ I. Observe that I has measure b1k
and apply Birkhoff’s Theorem as we did before.
Definition 9.5. A number x is normal if it is normal in base b for all b ≥ 2.
Theorem 9.6. Lebesgue almost all numbers x ∈ [0, 1] are normal.
Proof. From the previous proposition, for all b ≥ 2 there exists a set Xb withT full
measure such that all x ∈ Xb are normal in base b. The countable intersection Xb
b≥2
is the set of normal numbers. By Proposition 1.2 this set has full measure.
10. Application to Continued Fractions

In this section, we use T to denote the Gauss map, λ to denote Lebesgue measure,
and ν to denote Gauss measure. Observe that the Lebesgue measure λ and the
Gauss measure ν are absolutely continuous with respect to each other. Thus the
two measures have the same null sets. Consequently, if a statement is true Gauss
almost everywhere, then it is true Lebesgue almost everywhere.
Proposition 10.1. For Lebesgue almost every x ∈ [0, 1] the number k appears in
1 (k+1)2
the continued fraction representation of x with the long-term frequency log 2 log k(k+2) .
1
Proof. By a quick computation we see that xn = k if and only if Tn x ∈ ( k+1 , k1 ].
By the Birkhoff Theorem for Lebesgue almost every x the number k appears in the
continued fraction of x with frequency
n−1
(k + 1)2
Z
1X j 1 1 1
lim χ( k+1
1 1 ◦T = χ 1 1 dν = ν ( , ] = log .
,k] ( k+1 , k ]
n→∞ n k+1 k log 2 k(k + 2)
j=0

For a.e. x we can also make some statements about the arithmetic and geometric
means of the terms in the continued fraction.
Proposition 10.2. For Lebesgue a.e. x ∈ [0, 1] the continued fraction expansion
satisfies
n−1
1X
lim xj = ∞.
n→∞ n
j=0
Proof. For each n we have xn = [ T 1n x ], where [x] denotes the integerR part of
x. Let f be the function f (x) = [ x1 ]. Note that f 6∈ L1 (ν), because f dν =
P∞ 1 1
P∞ 1 (k+1)2
k=0 kµ(( k+1 , k ]) = k=0 k log 2 log k(k+2) which does not converge. (We can
check by L’Hopital’s Rule that the terms in the sum do not converge to 0). How-
ever, we can “chop off” the unbounded part of f by defining a new function fN
that takes values of f when |f | < N and 0 otherwise. Clearly f ≥ fN . We have
fN ∈ L1 so we can apply Birkhoff’s Theorem. For Lebesgue a.e. x ∈ [0, 1]
n−1 n−1 n−1 Z
1X 1X 1X
lim xj = lim f ◦ T j x ≥ lim fN ◦ T j x = fN dν.
n→∞ n n→∞ n n→∞ n
j=0 j=0 j=0
R
This is true for all N . We claim lim fN dν = ∞. Therefore
N →∞
n−1
1X
lim xj = ∞.
n→∞ n
j=0
Proposition 10.3. For Lebesgue a.e. x ∈ [0, 1] the long-term geometric mean is
n−1
Y 1/n Y ∞ log k/ log 2
1
lim xj = 1+ 2 .
n→∞
j=0
k + 2k
k=1
Proof. We want to apply Birkhoff’s Theorem, but we are not dealing with a sum.
To remedy this we work with logarithms. Consider the function f : [0, 1] → R
1
where f (x) = log(k) when x ∈ ( k+1 , k1 ] for all k ∈ N and f (0) = 0. As one might
expect from prior examples, this function maps x to log k if x0 = k. We show
f ∈ L1 (ν).
Z ∞ X ∞
X 1 1 1 1
f dν = log k · ν ( , ] ≤ log k · 2λ ( , ] <∞
k+1 k k+1 k
k=1 k=1
We apply Birkhoff’s Theorem. For Lebesgue a.e. x ∈ [0, 1]

n−1
Y 1/n n−1 n−1 Z
1X 1X
lim log xj = lim log xj = lim f ◦ T j (x) = f dν
n→∞ n→∞ n n→∞ n
j=0 j=0 j=0
We have
∞ ∞ log k/ log 2
(k + 1)2 (k + 1)2
Z
X log k X
f dν = log = log
log 2 k 2 + 2k k 2 + 2k
k=1 k=1
20 DAVID RAN
Thus
n−1 1/n ∞ log k/ log 2
(k + 1)2
Y X
lim log xj = log
n→∞
j=0
k 2 + 2k
k=1
We then exponentiate both sides. We then interchange the series in the exponent
of the RHS with an infinite product instead, which is allowed because the series
converges. Doing so proves the theorem.
11. Ehrenfests’ Model

The next example is credited to P. and T. Ehrenfest, 1907. Imagine that we
have two urns and 100 balls placed in the first urn. Each ball is assigned a number
1 through 100 and at each minute one of those numbers is selected randomly. The
ball with the corresponding number is transferred from its current urn to the other
urn. We expect that the distribution of balls will reach an equilibrium in which
the balls are divided evenly between the two urns, with some fluctuations. Given
infinite time however, we also anticipate with probability 1 that all 100 balls will
eventually return to the first urn. How long can we expect to wait for this unlikely
event to “inevitably” occur?
We formalize this using a transition matrix P and a Markov measure µP . Con-
sider the sequence ω = (x0 , x1 , x2 , . . . ) that tracks the number of balls i in the first
urn at any given minute k. For i ∈ {0, 1, . . . , 100}, when there are i balls in the first
urn, then the probability that there will be i − 1 balls in the first urn a minute later
i
is 100 . The probability that there will be i + 1 balls in the first urn a minute later
i
is 1 − 100 . The probability of any other number of balls is 0. Therefore we desire
i i
that the transition matrix P satisfy P (i, i − 1) = 100 and P (i, i + 1) = 1 − 100 . We
have the transition matrix with rows and columns numbering 0 through 100
 
0 100 0 0 0 · · ·
1 0 99 0 0 · · ·
1  0 2 0 98 0 · · ·

P = .
100 0 0

3 0 97 · · ·
.. .. .. .. .. . .
 
. . . . . .
If all 100 balls are initially distributed randomly between the two urns, the proba-
bility that i balls are in the first urn is

1 100
p(i) = 100 .
2 i
This defines a probability vector p such that pP = p. Therefore p and P define
a Markov measure on the cylinder sets of {0, 1, . . . , 100}N . The matrix P is irre-
ducible, so the left shift operator T is ergodic with respect to the Markov measure
µP .
Let A be the cylinder A = {ω | ω0 = 100}, the set of all possible outcomes
given that we start with 100 balls in the first urn. For ω ∈ A, we want to study
nA (ω) = min{n ≥ 1 | T n (ω) ∈ A} = min{n ≥ 1 | ωn = 100}, which is the first
time at which we return to having 100R balls in the first urn for that given sequence
ω. We apply Kac’s Lemma to find A nA dµP = 1. Therefore, the average value of
nA on A is µP1(A) = 2100 . In other words, if we start with 100 balls in the first urn
the expected time it will take to return to having 100 balls in the first urn is 2100
minutes. That is equal to over 2.4 × 102 4 years, much longer than the accepted age
of the universe.
Acknowledgments. I would like to thank my mentor Daniel Campos Salas for
helping me explore different paper topics and for introducing me to ergodic theory.
I am very grateful for his patience and encouragement in guiding me through a
variety of different topics and helping me write this paper.
I would also like to thank Peter May for a wonderful REU experience. I believe
that the program would not have been so interesting or even possible without the
tremendous effort he put in.
References
[1] Charles Walkden. Ergodic Theory. http://www.maths.manchester.ac.uk/ cwalkden/ergodic-
theory/ergodic theory.pdf
[2] Patrick Billingsley. Ergodic Theory and Information. Wiley & Sons. 1965
[3] Peter Walters. Ergodic Theory - Introductory Lectures. Springer-Verlag. 1975.
[4] Henk Bruin. Notes on Ergodic Theory. http://www.mat.univie.ac.at/ bruin/ET2.pdf

Ran

Uploaded by

Copyright:

Available Formats

Ran

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ran

Uploaded by

Copyright:

Available Formats

BIRKHOFF’S ERGODIC THEOREM

Abstract. In this paper we will introduce the theory of ergodic measure-

1. Measure Theory Preliminaries

Theorem 1.3 (Hahn-Kolmogorov Extension Theorem). Every σ-finite measure

Then f is measurable and

Theorem 1.5 (Dominated Convergence Theorem). Let fn : (X, B, µ) → R be a

(3) for all f ∈ L2 we have f dµ = f ◦ T dµ

The Lebesgue measure is translation invariant, so the rotation on a circle preserves

we have effectively defined µp on the σ-algebra generated by countable unions of

Example 2.8 (Markov Shifts). Consider an r × r right stochastic matrix P . This

Theorem 3.1 (Poincaré Recurrence Theorem). Let T be a measure-preserving

Let n ∈ N and observe that

Therefore for all n we have µ(T −n B4B) = 0. Then

Proof. (1) =⇒ (2): Follows from Corollary 4.5

Proposition 4.8. Let T be a measure-preserving transformation. The following

For B ∈ B let A := f −1 B. Then T −1 A = A a.e. This is true because if

Now we show (2) =⇒ (1). Let B be a measurable set such that T −1 B = B.

5. Proving Ergodicity Using Fourier Series

Then by Proposition 5.3 the composition f ◦ T has the Fourier series

Therefore because Fourier coefficients are unique, we have for all n

By Proposition 5.3 we have for all n and for all k

Then f ◦ Aj has the Fourier series

Then c(nT Aj )T = cn for all j ∈ N and for all n ∈ Zk . Because A has no

7. Birkhoff’s Ergodic Theorem

In a probability space X with an ergodic transformation T and f ∈ L1 , this

Theorem 7.3 (Maximal Inequality). Let U : L1 (X, B, µ) → L1 (X, B, µ) be a pos-

Fn (x) = max fj (x) ≥ 0.

Let A = {x ∈ X | Fn (x) > 0}. Then

For 0 ≤ j ≤ n we have Fn ≥ fj . Then U Fn ≥ U fj because U is positive.

We use the facts that Fn = 0 on X \ A, U Fn ≥ 0 and ||U || ≤ 1 to show

Corollary 7.4. Let T : (X, B, µ) → (X, B, µ) be measure-preserving. Let g ∈ L1

Then for all B such that T −1 B = B we have

Proof. Let f = g − α. Using the notation from before, we note

Completing the Proof of Birkhoff ’s Theorem. It suffices to prove the theo-

lim sup an+1 (x) = lim sup an (T x)

(2) Now we show f ∗ = f∗ a.e. LetS Eu,v = {x ∈ X | f∗ < u, v < f ∗ (x)}.

Combining (7.5) and (7.6) we find vµ(Eu,v ) ≤ uµ(Eu,vS). However, u < v, so

We apply Fatou’s Lemma to gn to conclude f ∗ ∈ L1 , since

Define Dkn := {x ∈ X | nk ≤ f ∗ (x) < k+1 n }. Consider M n k

This is true for all  > 0, so

Lastly, suppose T is ergodic. By Proposition 4.8 the fact f ∗ ◦ T = f ∗ implies

8. Consequences of Birkhoff’s Theorem

Proof. Apply the Birkhoff Theorem to χA to get

Then multiply by χB (x) to get

all x. This allows us to apply the Dominated Convergence Theorem:

We claim that B is T -invariant up to a null set. First observe T −1 Lj ⊂ Lj−1 by

Because T preserves measure, T A is not null. Thus we must show T −1 A ⊂ B

mod µ if we want to show T −1 B ⊂ B mod µ. We show C = T −1 A \ B is null.

9. Application to Normal Numbers

10. Application to Continued Fractions

We apply Birkhoff’s Theorem. For Lebesgue a.e. x ∈ [0, 1]

11. Ehrenfests’ Model

You might also like

This is true for all > 0, so