Measure Theory For Dummies
Measure Theory For Dummies
Measure Theory For Dummies
Maya R. Gupta
{gupta}@ee.washington.edu
Abstract
This tutorial is an informal introduction to measure theory for people who are interested in reading papers that
use measure theory. The tutorial assumes one has had at least a year of college-level calculus, some graduate level
exposure to random processes, and familiarity with terms like “closed” and “open.” The focus is on the terms and
ideas relevant to applied probability and information theory. There are no proofs and no exercises.
Measure theory is a bit like grammar, many people communicate clearly without worrying about all the details,
but the details do exist and for good reasons. There are a number of great texts that do measure theory justice. This is
not one of them. Rather this is a hack way to get the basic ideas down so you can read through research papers and
follow what’s going on. Hopefully, you’ll get curious and excited enough about the details to check out some of the
references for a deeper understanding.
A Something to measure
First, we need something to measure. So we define a “measurable space.” A measurable space is a collection of events
B, and the set of all outcomes Ω, which is sometimes called the sample space. Given a collection of possible events B,
why do you need to state Ω? For one, having a sample space makes it possible to define complements of sets; if the
event F ∈ B, then the event F C is the set of outcomes in Ω that are disjoint from F . A measurable space is written
(Ω, B).
B Measure
A measure µ takes a set A (from a measurable collection of sets B), and returns “the measure of A,” which is some pos-
itive real number. So ones writes µ : B → [0, ∞). An example measure is volume, which goes by the name Lebesgue
measure. In general, measures are generalized notions of volume. The triple (Ω, B, µ) combines a measurable space
and a measure, and thus the triple is called a measure space. A measure is defined by two properties:
1
1. Nonnegativity: µ(A) ≥ 0 for all A ∈ B
2. Countable Additivity: If Ai ∈ B are disjoint sets for i = 1, 2, . . ., then the measure of the union of the Ai is
equal to the sum of the measures of the Ai .
You can see how our ordinary notion of volume satisfies these two properties. There are a couple variations on measure
that you will run into. One is a signed measure, which can be negative. A special case of measure is the probability
measure. A probability space is just a measure space with a probability measure. And a probability measure P has
the two above properties of a measure but it’s also normalized, such that P (Ω) = 1.
A probability measure P over discrete set of events is basically what you know as a probability mass function. For
example given probability measure P and two sets A, B ∈ B, we can familiarly write
P (A ∩ B)
P (B|A) = .
P (A)
B.4 Support
The support of a measure is all the sets that do not have measure zero. For example you might say, “the probability
measure µ has support only on the unit interval,” by which you mean there is zero probability of drawing a point bigger
than one or smaller than zero. You often see written “the measure has compact support” to note that the support of the
measure forms a compact (=closed and bounded) set.
UWEETR-2006-0008 2
B.6 Borel measure
To call a measure a Borel measure means it is defined over a Borel σ-algebra.
B.7 Example
This example is based on an example from Capinski and Kopp’s book, page 45 [6] of what it means to say “draw a
number from [0, 1] at random.”
Restrict Lebesgue measure m to the interval B = [0, 1] and consider the σ-field M of measurable subsets of [0, 1].
Then m[0,1] is a probability measure on M. Since all subintervals of [0, 1] with the same length have the same
measure, the mass of m[0,1] is spread uniformly over [0, 1], so that the measure of [0, 1/10) is the same as the measure
of [6/10, 7/10) (both are 1/10). Thus all numerals are equally likely to appear as first digits of the decimal expansion
of a number drawn randomly according to this measure.
This works for every Borel set in the output space, so the random variable X induces a probability measure over the
space.
As shorthand, one writes the probability P(A) = P (X ∈ A). In fact, it’s common not to write the induced measure
at all, and just write P (X ∈ A).
C.1 Distributions
The probability measure P over the output measurable space induced by a random variable X is called the distribution
of X [7]. However, the term distribution is also used in a more specific way. As we foreshadowed in the section on
Borel sets, the complete description of the probability measure induced by a random variable X requires knowledge
about P (X ∈ A) for all sets A ∈ R. However, since the Borel σ-algebra can be generated by the set of intervals
(−∞, x) for all x ∈ R, we only have to know P (X ∈ A) for every set A = (−∞, x). Then, the distribution function
of X is F (x) = P (X ≤ x). The distribution function is usually indexed by the random variable, such as FX or FY .
Then one can say that the induced probability measure over the interval (a, b] is P((a, b]) = F (b) − F (a).
UWEETR-2006-0008 3
C.3 Discrete distributions
A discrete distribution F has the familiar corresponding point mass function, or probability mass function. For a
discrete distribution F there is some countable set of numbers {xj } and point masses {pj } such that
X
F (x) = pj ,
xj ≤x
for all x ∈ R. The {pj } form the probability mass function (pmf) over the events, which are defined to be intervals of
the real line.
If E|X| < ∞, then one says that the random variable X is integrable. You’ll note from the limit-sum definition
that if one takes the integral of a set of measure 0, one gets 0. That is a simple but key idea in many proofs, and you’ll
often see equivalence relationships where some q equals some p if they agree on all sets that do not have measure 0.
The most common measure to use in integration is the Lebesgue measure, which is for almost all practical purposes
equivalent to the standard Riemann integration that one first learns. As noted in Section E, Riemann integration has
some problems that make it not as useful as Lebesgue integration, and the reader is referred (for example) to Capinski
and Kopp for more details [6].
D Entropy
Entropy is a useful function of a random variable and in this section we will use it to solidify some of the ideas
and notation introduced above. First, consider the entropy of a discrete alphabet random variable f defined on the
probability space (Ω, B, P ). Then the entropy is
X
HP (f ) = − P (f = a) ln P (f = a).
a∈A
UWEETR-2006-0008 4
Also, f induces a probability mass function (pmf) pf , where pf = P (w : f (w) = a) = P (f = a), so you can
equivalently write X
HP (f ) = − pf (a) ln pf (a).
a∈A
A discrete random variable f induces a partition on the input space Ω that corresponds to the inverse image of each
event: let the partition Q consist of sets {Qi : i = 1, 2, . . . , kAk} where Qi = {w : f (w) = ai } = f −1 ({ai }). Then
you can also write entropy in terms of the induced partition Q:
kAk
X
HP (Q) = − P (Qi ) ln P (Qi ).
i=1
E Limits
To quote Gut [5], “One of the basic questions in mathematics is to what extent limits of objects carry over to limits of
functions of objects.” One of the more important results in this area is Lebesgue’s Dominated Convergence Theorem.
A formal statement of this can be found on mathworld’s page (www.mathworld.com). Basically, it says that the integral
of a function f with a measure µ is the same as the limit of the integral of fn , where f1 , f2 , . . . , fn is a sequence of
measurable functions that converges to f . There are some other restrictions on the fn ’s (see the formal statement).
What’s powerful about this theorem though is that one doesn’t have to assume that f is measurable, instead, the
theorem concludes that f is integrable, and shows you how to integrate it by instead taking the limit of the integral of
the sequence of functions. The integration one learns as a kid is Riemann integration. This theorem doesn’t work for
Riemann integration, and that is considered one of the flaws of Riemann integration that makes Lebesgue integration
more general and more useful. For more on the flaws of Riemann integration, see for example [6].
F Read on
If you are interested in a more thorough understanding of measure theory and probability, one of the friendliest books
is Resnick’s [8], which teaches measure theoretic graduate level probability with the assumption that you do not have
a B.A. in mathematics. Other good texts are an undergraduate text on measure theory [6], and Gut’s graduate level
measure-theoretic probability book [5]. There are certainly plenty of other probability and measure theory text books,
but these three are relatively well-suited for self-study. If you’ve decided you aren’t so interested in formal probability,
but want to learn more about approaches to solving probability problems, I recommend Richard Hamming’s book [2].
If you are interested in information theory you can solidify your understanding of the use of measure theory in
information theory by reading Bob Gray’s book [7], Kullback’s book [1], or the information theory book by Ash [4]
(you might want to read the less formal book by Reza [3] before Ash). Gray’s book is available free on-line, and the
Kullback, Ash, and Reza books are available in inexpensive Dover editions.
References
[1] S. Kullback, “Information theory and statistics”, Dover, 1997.
[2] R. W. Hamming, “The art of probability for scientists and engineers”, Addison Wesley, 1993.
[3] F. M. Reza, “An introduction to information theory”, Dover, 1994.
[4] R. B. Ash, “Information theory”, Dover, 1990.
UWEETR-2006-0008 5
[8] S. I. Resnick, “A probability path”, Birkhäuser, 1999.
[9] T. M. Cover and J. A. Thomas, “Elements of Information Theory”, Wiley Series in Telecommunications, 1991.
UWEETR-2006-0008 6