Note 0
Note 0
Note 0
We assume that many readers are familiar with much of the material presented in this
chapter. However, we do not view this material as superfluous, and we feature it prominently
as the first chapter of these notes for several reasons. First, some of these topics may have
been learned long ago by readers, and a review of this chapter may remind them of knowledge
they have forgotten. Second, including these preliminary topics as a separate chapter makes
the notes more self-contained than if the topics were omitted: We do not have to refer
readers to “a standard calculus textbook” or “a standard mathematical statistics textbook”
whenever an advanced result relies on this preliminary material. Third, some of the topics
here are likely to be new to some readers, particularly readers who have not taken a course
in real analysis.
Fourth, and perhaps most importantly, we wish to set the stage in this chapter for a math-
ematically rigorous treatment of large-sample theory. By “mathematically rigorous,” we
do not mean “difficult” or “advanced”; rather, we mean logically sound, relying on argu-
ments in which assumptions and definitions are unambiguously stated and assertions must
be provable from these assumptions and definitions. Thus, even well-prepared readers who
know the material in this chapter often benefit from reading it and attempting the exercises,
particularly if they are new to rigorous mathematics and proof-writing. We strongly caution
against the alluring idea of saving time by skipping this chapter when teaching a course,
telling students “you can always refer to Chapter 1 when you need to”; we have learned the
hard way that this is a dangerous approach that can waste more time in the long run than
it saves!
3
1.1 Limits and Continuity
Fundamental to the study of large-sample theory is the idea of the limit of a sequence. Much
of these notes will be devoted to sequences of random variables; however, we begin here by
focusing on sequences of real numbers. Technically, a sequence of real numbers is a function
from the natural numbers {1, 2, 3, . . .} into the real numbers R; yet we always write a1 , a2 , . . .
instead of the more traditional function notation a(1), a(2), . . ..
We begin by defining a limit of a sequence of real numbers. This is a concept that will be
intuitively clear to readers familiar with calculus. For example, the fact that the sequence
a1 = 1.3, a2 = 1.33, a3 = 1.333, . . . has a limit equal to 4/3 is unsurprising. Yet there are
some subtleties that arise with limits, and for this reason and also to set the stage for a
rigorous treatment of the topic, we provide two separate definitions. It is important to
remember that even these two definitions do not cover all possible sequences; that is, not
every sequence has a well-defined limit.
Definition 1.1 A sequence of real numbers a1 , a2 , . . . has limit equal to the real num-
ber a if for every > 0, there exists N such that
Definition 1.2 A sequence of real numbers a1 , a2 , . . . has limit ∞ if for every real
number M , there exists N such that
Implicit in the language of Definition 1.1 is that N may depend on . Similarly, N may
depend on M (in fact, it must depend on M ) in Definition 1.2.
The symbols +∞ and −∞ are not considered real numbers; otherwise, Definition 1.1 would
be invalid for a = ∞ and Definition 1.2 would never be valid since M could be taken to
be ∞. Throughout these notes, we will assume that symbols such as an and a denote real
numbers unless stated otherwise; if situations such as a = ±∞ are allowed, we will state this
fact explicitly.
A crucial fact regarding sequences and limits is that not every sequence has a limit, even
when “has a limit” includes the possibilities ±∞. (However, see Exercise 1.4, which asserts
4
that every nondecreasing sequence has a limit.) A simple example of a sequence without a
limit is given in Example 1.3. A common mistake made by students is to “take the limit of
both sides” of an equation an = bn or an inequality an ≤ bn . This is a meaningless operation
unless it has been established that such limits exist. On the other hand, an operation that is
valid is to take the limit superior or limit inferior of both sides, concepts that will be defined
in Section 1.1.1. One final word of warning, though: When taking the limit superior of a
strict inequality, < or > must be replaced by ≤ or ≥; see the discussion following Lemma
1.10.
Two or more sequences may be added, multiplied, or divided, and the results follow in-
tuitively pleasing rules: The sum (or product) of limits equals the limit of the sums (or
products); and as long as division by zero does not occur, the ratio of limits equals the limit
5
of the ratios. These rules are stated formally as Theorem 1.5, whose complete proof is the
subject of Exercise 1.1. To prove only the “limit of sums equals sum of limits” part of the
theorem, if we are given an → a and bn → b then we need to show that for a given > 0,
there exists N such that for all n > N , |an + bn − (a + b)| < . But the triangle inequality
gives
and furthermore we know that there must be N1 and N2 such that |an − a| < /2 for n > N1
and |bn − b| < /2 for n > N2 (since /2 is, after all, a positive constant and we know an → a
and bn → b). Therefore, we may take N = max{N1 , N2 } and conclude by inequality (1.1)
that for all n > N ,
|an + bn − (a + b)| < + ,
2 2
which proves that an + bn → a + b.
Theorem 1.5 Suppose an → a and bn → b as n → ∞. Then an + bn → a + b and
an bn → ab; furthermore, if b 6= 0 then an /bn → a/b.
A similar result states that continuous transformations preserve limits; see Theorem 1.16.
Theorem 1.5 may be extended by replacing a and/or b by ±∞, and the results remain true
as long as they do not involve the indeterminate forms ∞ − ∞, ±∞ × 0, or ±∞/∞.
The limit superior and limit inferior of a sequence, unlike the limit itself, are defined for any
sequence of real numbers. Before considering these important quantities, we must first define
supremum and infimum, which are generalizations of the ideas of maximum and minumum.
That is, for a set of real numbers that has a minimum, or smallest element, the infimum
is equal to this minimum; and similarly for the maximum and supremum. For instance,
any finite set contains both a minimum and a maximum. (“Finite” is not the same as
“bounded”; the former means having finitely many elements and the latter means contained
in an interval neither of whose endpoints are ±∞.) However, not all sets of real numbers
contain a minimum (or maximum) value. As a simple example, take the open interval (0, 1).
Since neither 0 nor 1 is contained in this interval, there is no single element of this interval
that is smaller (or larger) than all other elements. Yet clearly 0 and 1 are in some sense
important in bounding this interval below and above. It turns out that 0 and 1 are the
infimum and supremum, respectively, of (0, 1).
An upper bound of a set S of real numbers is (as the name suggests) any value m such that
s ≤ m for all s ∈ S. A least upper bound is an upper bound with the property that no smaller
6
upper bound exists; that is, m is a least upper bound if m is an upper bound such that for
any > 0, there exists s ∈ S such that s > m − . A similar definition applies to greatest
lower bound. A useful fact about the real numbers—a consequence of the completeness of
the real numbers which we do not prove here—is that every set that has an upper (or lower)
bound has a least upper (or greatest lower) bound.
Definition 1.6 For any set of real numbers, say S, the supremum sup S is defined to
be the least upper bound of S (or +∞ if no upper bound exists). The infimum
inf S is defined to be the greatest lower bound of S (or −∞ if no lower bound
exists).
Example 1.7 Let S = {a1 , a2 , a3 , . . .}, where an = 1/n. Then inf S, which may also
be denoted inf n an , equals 0 even though 0 6∈ S. But supn an = 1, which is
contained in S. In this example, max S = 1 but min S is undefined.
If we denote by supk≥n ak the supremum of {an , an+1 , . . .}, then we see that this supremum is
taken over a smaller and smaller set as n increases. Therefore, supk≥n ak is a nonincreasing
sequence in n, which implies that it has a limit as n → ∞ (see Exercise 1.4). Similarly,
inf k≥n ak is a nondecreasing sequence, which implies that it has a limit.
Definition 1.8 The limit superior of a sequence a1 , a2 , . . ., denoted lim supn an or
sometimes limn an , is the limit of the nonincreasing sequence
The limit inferior, denoted lim inf n an or sometimes limn an , is the limit of the
nondecreasing sequence
inf ak , inf ak , . . . .
k≥1 k≥2
Intuitively, the limit superior and limit inferior may be understood as follows: If we define
a limit point of a sequence to be any number which is the limit of some subsequence, then
lim inf and lim sup are the smallest and largest limit points, respectively (more precisely,
they are the infimum and supremum, respectively, of the set of limit points).
Example 1.9 In Example 1.3, the sequence dn = (−1)n does not have a limit. How-
ever, since supk≥n dk = 1 and inf k≤n dk = −1 for all n, it follows that
In this example, the set of limit points of the sequence d1 , d2 , . . . is simply {−1, 1}.
Here are some useful facts regarding limits superior and inferior:
7
Lemma 1.10 Let a1 , a2 , . . . and b1 , b2 , . . . be arbitrary sequences of real numbers.
• Both lim sup and lim inf preserve nonstrict inequalities; that is, if an ≤ bn
for all n, then lim supn an ≤ lim supn bn and lim inf n an ≤ lim inf n bn .
• lim supn (−an ) = − lim inf n an .
The next-to-last claim in Lemma 1.10 is no longer true if “nonstrict inequalities” is replaced
by “strict inequalities”. For instance, 1/(n + 1) < 1/n is true for all positive n, but the limit
superior of each side equals zero. Thus, it is not true that
1 1
lim sup < lim sup .
n n+1 n n
We must replace < by ≤ (or > by ≥) when taking the limit superior or limit inferior of both
sides of an inequality.
1.1.2 Continuity
Although Definitions 1.1 and 1.2 concern limits, they apply only to sequences of real numbers.
Recall that a sequence is a real-valued function of the natural numbers. We shall also require
the concept of a limit of a real-valued function of a real variable. To this end, we make the
following definition.
Definition 1.11 For a real-valued function f (x) defined for all points in a neighbor-
hood of x0 except possibly x0 itself, we call the real number a the limit of f (x)
as x goes to x0 , written
lim f (x) = a,
x→x0
if for each > 0 there is a δ > 0 such that |f (x)−a| < whenever 0 < |x−x0 | < δ.
First, note that Definition 1.11 is sensible only if both x0 and a are finite (but see Definition
1.13 for the case in which one or both of them is ±∞). Furthermore, it is very important
to remember that 0 < |x − x0 | < δ may not be replaced by |x − x0 | < δ: The latter would
8
imply something specific about the value of f (x0 ) itself, whereas the correct definition does
not even require that this value be defined. In fact, by merely replacing 0 < |x − x0 | < δ
by |x − x0 | < δ (and insisting that f (x0 ) be defined), we could take Definition 1.11 to be
the definition of continuity of f (x) at the point x0 (see Definition 1.14 for an equivalent
formuation).
Implicit in Definition 1.11 is the fact that a is the limiting value of f (x) no matter whether
x approaches x0 from above or below; thus, f (x) has a two-sided limit at x0 . We may also
consider one-sided limits:
Definition 1.12 The value a is called the right-handed limit of f (x) as x goes to x0 ,
written
if for each > 0 there is a δ > 0 such that |f (x) − a| < whenever 0 < x − x0 < δ.
The left-handed limit, limx→x0 − f (x) or f (x0 −), is defined analagously: f (x0 −) =
a if for each > 0 there is a δ > 0 such that |f (x)−a| < whenever −δ < x−x0 <
0.
in other words, the (two-sided) limit exists if and only if both one-sided limits exist and they
coincide. Before using the concept of a limit to define continuity, we conclude the discussion
of limits by addressing the possibilities that f (x) has a limit as x → ±∞ or that f (x) tends
to ±∞:
9
As mentioned above, the value of f (x0 ) in Definitions 1.11 and 1.12 is completely irrelevant;
in fact, f (x0 ) might not even be defined. In the special case that f (x0 ) is defined and equal
to a, then we say that f (x) is continuous (or right- or left-continuous) at x0 , as summarized
by Definition 1.14 below. Intuitively, f (x) is continuous at x0 if it is possible to draw the
graph of f (x) through the point [x0 , f (x0 )] without lifting the pencil from the page.
Finally, even though continuity is inherently a local property of a function (since Defini-
tion 1.14 applies only to the particular point x0 ), we often speak globally of “a continuous
function,” by which we mean a function that is continuous at every point in its domain.
Statement (1.2) implies that every (globally) continuous function is right-continuous. How-
ever, the converse is not true, and in statistics the canonical example of a function that is
right-continuous but not continuous is the cumulative distribution function for a discrete
random variable.
1.0
●
0.8
0.6
F(t)
● ●
0.4
0.2
0.0
Figure 1.1: The cumulative distribution function for a Bernoulli (1/2) random variable is
discontinuous at the points t = 0 and t = 1, but it is everywhere right-continuous.
10
Example 1.15 Let X be a Bernoulli (1/2) random variable, so that the events X = 0
and X = 1 each occur with probability 1/2. Then the distribution function
F (t) = P (X ≤ t) is right-continuous but it is not continuous because it has
“jumps” at t = 0 and t = 1 (see Figure 1.1). Using one-sided limit notation of
Definition 1.12, we may write
We conclude with a simple yet important result relating continuity to the notion of the
limit of a sequence. Intuitively, this result states that continuous functions preserve limits
of sequences.
Theorem 1.16 If a is a real number such that an → a as n → ∞ and the real-valued
function f (x) is continuous at the point a, then f (an ) → f (a).
Proof: We need to show that for any > 0, there exists N such that |f (an ) − f (a)| <
for all n > N . To this end, let > 0 be a fixed arbitrary constant. From the definition of
continuity, we know that there exists some δ > 0 such that |f (x) − f (a)| < for all x such
that |x − a| < δ. Since we are told an → a and since δ > 0, there must by definition be
some N such that |an − a| < δ for all n > N . We conclude that for all n greater than this
particular N , |f (an ) − f (a)| < . Since was arbitrary, the proof is finished.
Exercise 1.2 For a fixed real number c, define an (c) = (1 + c/n)n . Then Equation
(1.9) states that an (c) → exp(c). A different sequence with the same limit is
obtained from the power series expansion of exp(c):
n−1 i
X c
bn (c) =
i=0
i!
11
For each of the values c ∈ {−10, −1, 0.2, 1, 5}, find the smallest value of n such
that |an (c) − exp(c)|/ exp(c) < .01. Now replace an (c) by bn (c) and repeat.
Comment on any general differences you observe between the two sequences.
12
(a) Graph f (x). Experiment with various ranges on the axes until you attain a
visually pleasing and informative plot that gives a sense of the overall behavior
of the function.
(b) For each of x0 ∈ {−1, 0, 1, 2}, answer these questions: Is f (x) continuous at
x0 , and if not, could f (x0 ) be defined so as to make the answer yes? What are
the right- and left-hand limits of f (x) at x0 ? Does it have a limit at x0 ? Finally,
what are limx→∞ f (x) and limx→−∞ f (x)?
Exercise 1.8 Define F (t) as in Example 1.15 (and as pictured in Figure 1.1). This
function is not continuous, so Theorem 1.16 does not apply. That is, an → a does
not imply that F (an ) → F (a).
(a) Give an example of a sequence {an } and a real number a such that an → a
but lim supn F (an ) 6= F (a).
(b) Change your answer to part (a) so that an → a and lim supn F (an ) = F (a),
but limn F (an ) does not exist.
(c) Explain why it is not possible to change your answer so that an → a and
lim inf n F (an ) = F (a), but limn F (an ) does not exist.
Differential calculus plays a fundamental role in much asymptotic theory. In this section
we review simple derivatives and one form of Taylor’s well-known theorem. Approximations
to functions based on Taylor’s Theorem, often called Taylor expansions, are ubiquitous in
large-sample theory.
We assume that readers are familiar with the definition of a derivative of a real-valued
function f (x):
Definition 1.17 If f (x) is continuous in a neighborhood of x0 and
f (x) − f (x0 )
lim (1.4)
x→x0 x − x0
exists, then f (x) is said to be differentiable at x0 and the limit (1.4) is called the
derivative of f (x) at x0 and is denoted by f 0 (x0 ) or f (1) (x0 ).
We use the standard notation for second- and higher-order derivatives. Thus, if f 0 (x) is
itself differentiable at x0 , we express its derivative as f 00 (x0 ) or f (2) (x0 ). In general, if the
kth derivative f (k) (x) is differentiable at x0 , then we denote this derivative by f (k+1) (x0 ). We
13
also write (dk /dxk )f (x) (omitting the k when k = 1) to denote the function f (k) (x), and to
denote the evaluation of this function at a specific point (say x0 ), we may use the following
notation, which is equivalent to f (k) (x0 ):
dk
k
f (x)
dx x=x0
In some cases, we will find it helpful to have an explicit form of rd (x, a). This is possible
under stronger assumptions, namely, that f (x) has d + 1 derivatives on the closed interval
from x to a. In this case, we may write
Z x
(x − t)d (d+1)
rd (x, a) = f (t) dt (1.6)
a d!
in equation (1.5). Equation (1.6) is often called the Lagrange form of the remainder. By the
Mean Value Theorem of calculus, there exists x∗ somewhere in the closed interval from x to
a such that
(x − a)d+1 (d+1) ∗
rd (x, a) = f (x ). (1.7)
(d + 1)!
Expression (1.7), since it follows immediately from Equation (1.6), is also referred to as the
Lagrange form of the remainder.
To conclude this section, we state the well-known calculus result known as l’Hôpital’s Rule.
This useful Theorem provides an elegant way to prove Theorem 1.18, among other things.
14
Theorem 1.19 l’Hôpital’s Rule: For a real number c, suppose that f (x) and g(x)
are differentiable for all points in a neighborhood containing c except possibly c
itself. If limx→c f (x) = 0 and limx→c g(x) = 0, then
f (x) f 0 (x)
lim = lim 0 , (1.8)
x→c g(x) x→c g (x)
provided the right-hand limit exists. Similarly, if limx→c f (x) = ∞ and limx→c g(x) =
∞, then Equation (1.8) also holds. Finally, the theorem also applies if c = ±∞, in
which case a “neighborhood containing c” refers to an interval (a, ∞) or (−∞, a).
log(1 + cxn ) c
lim = lim n log 1 + = c, (1.10)
n→∞ xn n→∞ n
which was to be proved. Now we use the fact that the exponential function
h(t) = exp t is a continuous function, so Equation (1.9) follows from Theorem
1.16 once we apply the exponential function to Equation (1.10).
15
Exercise 1.10 For f (x) continuous in a neighborhood of x0 , consider
f (x) − f (2x0 − x)
lim . (1.11)
x→x0 2(x − x0 )
(a) Prove or give a counterexample: When f 0 (x0 ) exists, limit (1.11) also exists
and it is equal to f 0 (x0 ).
(b) Prove or give a counterexample: When limit (1.11) exists, it equals f 0 (x0 ),
which also exists.
Then use l’Hôpital’s rule, Theorem 1.19, d − 1 times. (You can do this because
the existence of f (d) (a) implies that all lower-order derivatives exist on an interval
containing a.) You cannot use l’Hôpital’s rule d times, but you won’t need to if
you use Definition 1.17.
Exercise 1.12 Let f (t) = log t. Taking a = 1 and x = a + h, find the explicit
remainder term rd (x, a) in Equation (1.5) for all values of d ∈ {2, 3} and h ∈
{0.1, 0.01, 0.001}. Give your results in a table. How does rd (x, a) appear to vary
with d? How does rd (a + h, a) appear to vary with h?
Exercise 1.13 The idea for Exercise 1.10 is based on a numerical trick for accurately
approximating the derivative of a function that can be evaluated directly but for
which no formula for the derivative is known.
(a) First, construct a “first-order” approximation to a derivative. Definition
1.17 with d = 1 suggests that we may choose a small h and obtain
f (a + h) − f (a)
f 0 (a) ≈ . (1.12)
h
For f (x) = log x and a = 2, calculate the approximation to f 0 (a) in Equation
(1.12) using h ∈ {0.5, 0.05, 0.005}. How does the difference between the true
value (which you happen to know in this case) and the approximation appear to
vary as a function of h?
(b) Next, expand both f (a + h) and f (a − h) using Taylor’s theorem with
d = 2. Subtract one expansion from the other and solve for f 0 (a). Ignore the
16
remainder terms and you have a “second-order” approximation. (Compare this
approximation with Exercise 1.10, substituting x0 and x−x0 for a and h.) Repeat
the computations of part (a). Now how does the error appear to vary as a function
of h?
(c) Finally, construct a “fourth-order” approximation. Perform Taylor expan-
sions of f (x + 2h), f (x + h), f (x − h), and f (x − 2h) with d = 4. Ignore the
remainder terms, then find constants C1 and C2 such that the second, third, and
fourth derivatives all disappear and you obtain
Exercise 1.14 The gamma function Γ(x) is defined for positive real x as
Z ∞
Γ(x) = tx−1 e−t dt (1.14)
0
[in fact, equation (1.14) is also valid for complex x with positive real part]. The
gamma function may be viewed as a continuous version of the factorial function
in the sense that Γ(n) = (n − 1)! for all positive integers n. The gamma function
satisfies the identity
even for noninteger positive values of x. Since Γ(x) grows very quickly as x in-
creases, it is often convenient in numerical calculations to deal with the logarithm
of the gamma function, which we term the log-gamma function. The digamma
function Ψ(x) is defined to be the derivative of the log-gamma function; this func-
tion often arises in statistical calculations involving certain distributions that use
the gamma function.
(a) Apply the result of Exercise 1.13(b) using h = 1 to demonstrate how to
obtain the approximation
1
Ψ(x) ≈ log [x(x − 1)] (1.16)
2
for x > 2.
Hint: Use Identity (1.15).
17
(b) Test Approximation (1.16) numerically for all x in the interval (2, 100) by
plotting the ratio of the approximation to the true Ψ(x). What do you no-
tice about the quality of the approximation? If you are using R or Splus, then
digamma(x) gives the value of Ψ(x).
Exercise 1.15 The second derivative of the log-gamma function is called the trigamma
function:
d2
Ψ0 (x) = log Γ(x). (1.17)
dx2
Like the digamma function, it often arises in statistical calculations; for example,
see Exercise 1.35.
(a) Using the method of Exercise 1.13(c) with h = 1 [that is, expanding f (x+2h),
f (x + h), f (x − h), and f (x − 2h) and then finding a linear combination that
makes all but the second derivative of the log-gamma function disappear], show
how to derive the following approximation to Ψ0 (x) for x > 2:
" 15 #
1 x x − 2
Ψ0 (x) ≈ log . (1.18)
12 x−1 x+1
As we saw in Example 1.3, the limiting behavior of a sequence is not fully characterized
by the value of its limit alone, if the limit exists. In that example, both 1 + (−1)n /n and
1 + (−1)n /n2 converge to the same limit, but they approach this limit at different rates. In
this section we consider not only the value of the limit, but the rate at which that limit is
approached. In so doing, we present some convenient notation for comparing the limiting
behavior of different sequences.
18
The expression |(an − bn )/an | above is called the relative error in approximating an by bn .
The definition of asymptotic equivalence does not say that
lim an
= 1;
lim bn
the above fraction might equal 0/0 or ∞/∞, or the limits might not even exist! (See Exercise
1.17.)
There are multiple ways to prove Stirling’s formula. We outline one proof, based
on the Poisson distribution, in Exercise 4.5.
This is proved in Exercise 1.19. But what about the case k = −1? Let us prove
that
n
X 1
∼ log n. (1.21)
i=1
i
19
and combine this with the limit superior of the right inequality (remember to
change < to ≤ when doing this; see the discussion following Lemma 1.10) to
obtain
Pn 1 Pn 1
i=1 i
1 ≤ lim inf ≤ lim sup i=1 i ≤ 1.
n log n n log n
This implies that the limit inferior and limit superior are in fact the same, so the
limit exists and is equal to 1. This is what we wished to show.
The next notation we introduce expresses the idea that one sequence is asymptotically neg-
ligible compared to another sequence.
Definition 1.24 We write an = o(bn ) (“an is little-o of bn ”) as n → ∞ if an /bn → 0
as n → ∞.
Among other advantages, the o-notation makes it possible to focus on the most important
terms of a sequence while ignoring the terms that are comparatively negligible.
Example 1.25 According to Definition 1.24, we may write
1 2 4 1 1
− 2 + 3 = +o as n → ∞.
n n n n n
This makes it clear at a glance how fast the sequence on the left tends to zero,
since all terms other than the dominant term are lumped together as o(1/n).
Some of the exercises in this section require proving that one sequence is little-o of another
sequence. Sometimes, l’Hôpital’s rule may be helpful; yet as in Example 1.20, care must be
exercised because l’Hôpital’s rule applies to functions of real numbers whereas a sequence is
a function of the positive integers.
Example 1.26 Let us prove that log log n = o(log n). The function (log log x)/ log x,
defined for x > 1, agrees with (log log n)/ log n on the positive integers; thus,
since l’Hôpital’s rule implies
1
log log x x log x 1
lim = lim 1 = lim = 0,
x→∞ log x x→∞ x→∞ log x
x
we conclude that (log log n)/ log n must also tend to 0 as n tends to ∞ as an
integer.
Often, however, one may simply prove an = o(bn ) without resorting to l’Hôpital’s rule, as in
the next example.
20
Example 1.27 Prove that
n
!
X √
n=o i . (1.22)
i=1
Proof: Letting bn/2c denote the largest integer less than or equal to n/2,
n n
√ √
r
X X n jnk
i ≥ i ≥ .
i=1
2 2
i=bn/2c
√
Since n = o(n n), the desired result follows.
Equation (1.22) could have been proved using the result of Example 1.23, in which
Equation (1.20) with k = 1/2 implies that
n
X √ 2n3/2
i∼ . (1.23)
i=1
3
However, we urge extreme caution when using asymptotic equivalences like Ex-
pression (1.23). It is tempting to believe that expressions that are asymptotically
equivalent may be substituted for one another under any circumstances, and this
is not true! In this particular example, we may write
!
2n3/2
n 3
Pn √ = √ Pn √ ,
i=1 i 2 n 3 i=1 i
and because we know that the second fraction in parentheses tends to 1 by Ex-
pression (1.23) and the first fraction in parentheses tends to 0, we conclude that
the product of the two converges to 0 and Equation (1.22) is proved.
We define one additional order notation, the capital O.
Definition 1.28 We write an = O(bn ) (“an is big-o of bn ”) as n → ∞ if there exist
M > 0 and N > 0 such that |an /bn | < M for all n > N .
In particular, an = o(bn ) implies an = O(bn ). In a vague sense, o and O relate to sequences
as < and ≤ relate to real numbers. However, this analogy is not perfect: For example, note
that it is not always true that either an = O(bn ) or bn = O(an ).
Although the notation above is very precisely defined, unfortunately this is not the case with
the language used to describe the notation. In particular, “an is of order bn ” is ambiguous;
it may mean simply that an = O(bn ), or it may mean something more precise: Some authors
define an bn or an = Θ(bn ) to mean that |an | remains bounded between m|bn | and M |bn |
21
for large enough n for some constants 0 < m < M . Although the language can be imprecise,
it is usually clear from context what the speaker’s intent is.
This latter case, where an = O(bn ) but an 6= o(bn ), is one in which the ratio |an /bn | remains
bounded and also bounded away from zero: There exist positive constants m and M , and
an integer N , such that
an
m < < M for all n > N . (1.24)
bn
(x − a)d (d)
f (x) = f (a) + (x − a)f 0 (a) + · · · +
f (a) + o(1) as x → a.
d!
It is important to write “as x → a” in this case.
It is often tempting, when faced with an equation such as an = o(bn ), to attempt to apply a
function f (x) to each side and claim that f (an ) = o[f (bn )]. Unfortunately, however, this is
not true in general and it is not hard to find a counterexample [see Exercise 1.18(d)]. There
are certain circumstances in which it is possible to claim that f (an ) = o[f (bn )], and one such
circumstance is particularly helpful. It involves a convex function f (x), defined as follows:
Definition 1.30 We say that a function f (x) is convex if for all x, y and any α ∈ [0, 1],
we have
If f (x) is everywhere differentiable and f 00 (x) > 0 for all x, then f (x) is convex (this is
proven in Exercise 1.24). For instance, the function f (x) = exp(x) is convex because its
second derivative is always positive.
We now see a general case in which it may be shown that f (an ) = o[f (bn )].
Theorem 1.31 Suppose that a1 , a2 , . . . and b1 , b2 , . . . are sequences of real numbers
such that an → ∞, bn → ∞, and an = o(bn ); and f (x) is a convex function such
that f (x) → ∞ as x → ∞. Then f (an ) = o[f (bn )].
22
The proof of Theorem 1.31 is the subject of Exercise 1.25.
There are certain rates of growth toward ∞ that are so common that they have names, such
as logarithmic, polynomial, and exponential growth. If α, β, and γ are arbitrary positive
constants, then the sequences (log n)α , nβ , and (1 + γ)n exhibit logarithmic, polynomial, and
exponential growth, respectively. Furthermore, we always have
(log n)α = o(nβ ) and nβ = o([1 + γ]n ). (1.26)
Thus, in the sense of Definition 1.24, logarithmic growth is always slower than polynomial
growth and polynomial growth is always slower than exponential growth.
To prove Statement (1.26), first note that log log n = o(log n), as shown in Example 1.26.
Therefore, α log log n = o(β log n) for arbitrary positive constants α and β. Since exp(x) is
a convex function, Theorem 1.31 gives
(log n)α = o(nβ ). (1.27)
As a special case of Equation (1.27), we obtain log n = o(n), which immediately gives
β log n = o[n log(1 + γ)] for arbitrary positive constants β and γ. Exponentiating once again
and using Theorem 1.31 yields
nβ = o[(1 + γ)n ].
Exercise 1.17 For each of the following statements, prove the statement or provide
a counterexample that disproves it.
(a) If an ∼ bn , then limn an / limn bn = 1.
(b) If limn an / limn bn is well-defined and equal to 1, then an ∼ bn .
(c) If neither limn an nor limn bn exists, then an ∼ bn is impossible.
23
Exercise 1.19 Prove the asymptotic relationship in Example 1.23.
Hint: One way to proceed is to prove that the sum lies between two simple-
to-evaluate integrals that are themselves asymptotically equivalent. Consult the
proof of Expression (1.21) as a model.
Exercise 1.20 According to the resultPn of Exercise 1.16, the limit (1.21) implies that
the relative difference between i=1 (1/i) and log n goes to zero. But this does
not imply that the difference itself goes to zero (in general, the difference may
not even have any limit at all). In this particular case, the difference converges to
a constant called Euler’s constant that is sometimes used to define the complex-
valued gamma function.
Evaluate ni=1 (1/i)−log n for various large values of n (say, n ∈ {100, 1000, 10000})
P
to approximate the Euler constant.
24
and nb variancen are roughly constant. (“Nice” here may be interpreted to mean
integers or half-integers.)
Hints: To generate a sample from the given distribution, use the fact that if
U1 , U2 , . . . is a sample from a uniform (0, 1) density and the continuous distribu-
tion function F (x) may be inverted explicitly, then letting Xi = F −1 (Ui ) results
in X1 , X2 , . . . being a simple random sample from F (x). When using Splus or R, a
sample from uniform (0, 1) of size, say, 50 may be obtained by typing runif(50).
Calculating δn involves taking the sum of logarithms. Mathematically, this is
the same as the logarithm of the product. However, mathematically equivalent
expressions are not necessarily computationally equivalent! For a large sample,
multiplying all the values could result in overflow or underflow, so the logarithm
of the product won’t always work. Adding the logarithms is safer even though
it requires more computation due to the fact that many logarithms are required
instead of just one.
Exercise 1.24 Prove that if f (x) is everywhere twice differentiable and f 00 (x) ≥ 0 for
all x, then f (x) is convex.
Hint: Expand both αf (x) and (1 − α)f (y) using Taylor’s theorem 1.18 with
d = 1, then add. Use the mean value theorem version of the Lagrange remainder
(1.7).
Exercise 1.26 Create counterexamples to the result in Theorem 1.31 if the hypothe-
ses of the theorem are weakened as follows:
(a) Find an , bn , and convex f (x) with limx→∞ f (x) = ∞ such that an = o(bn )
6 o[f (bn )].
but f (an ) =
(b) Find an , bn , and convex f (x) such that an → ∞, bn → ∞, and an = o(bn )
25
but f (an ) 6= o[f (bn )].
(c) Find an , bn , and f (x) with limx→∞ f (x) = ∞ such that an → ∞, bn → ∞,
and an = o(bn ) but f (an ) 6= o[f (bn )].
Exercise 1.27 Recall that log n always denotes the natural logarithm of n. Assuming
that log n means log10 n will change some of the answers in this exercise!
(a) The following 5 sequences have the property that each tends to ∞ as n → ∞,
and for any pair of sequences, one is little-o of the other. List them in order of
rate of increase from slowest to fastest. In other words, give an ordering such that
first sequence = o(second sequence), second sequence = o(third sequence), etc.
√ Pn √
n log n! i=1
3
i 2log n (log n)log log n
Prove the 4 order relationships that result from your list.
Hint: Here and in part (b), using a computer to evaluate some of the sequences
for large values of n can be helpful in suggesting the correct ordering. However,
note that this procedure does not constitute a proof!
(b) Follow the directions of part (a) for the following 13 sequences.
log(n!) n2 nn 3n
We now consider vectors in Rk , k > 1. We denote vectors by bold face and their components
by regular type with subscripts; thus, a is equivalent to (a1 , . . . , ak ). For sequences of
vectors, we use bold face with subscripts, as in a1 , a2 , . . .. This notation has a drawback:
Since subscripts denote both component numbers and sequence numbers, it is awkward to
denote specific components of specific elements in the sequence. When necessary, we will
denote the jth component of the ith vector by aij . In other words, ai = (ai1 , . . . , aik )> for
i = 1, 2, . . .. We follow the convention that vectors are to be considered as columns instead
of rows unless stated otherwise, and the transpose of a matrix or vector is denoted by a
superscripted >.
26
The extension to the multivariate case from the univariate case is often so trivial that it is
reasonable to ask why we consider the cases separately at all. There are two main reasons.
The first is pedagogical: We feel that any disadvantage due to repeated or overlapping
material is outweighed by the fact that concepts are often intuitively easier to grasp in R
than in Rk . Furthermore, generalizing from R to Rk is often instructive in and of itself, as in
the case of the multivariate concept of differentiability. The second reason is mathematical:
Some one-dimensional results, like Taylor’s Theorem 1.18 for d > 2, need not (or cannot, in
some cases) be extended to multiple dimensions in these notes. In later chapters in these
notes, we will treat univariate and multivariate topics together sometimes and separately
sometimes, and we will maintain the bold-face notation for vectors throughout.
To define a limit of a sequence of vectors, we must first define a norm on Rk . We are
interested primarily in whether the norm of a vector goes to zero, a concept for which any
norm will suffice, so we may as well take the Euclidean norm:
v
u k
def t
uX √
kak = a2i = a> a.
i=1
the “if” part follows from repeated use of Theorem 1.5 (which says that the limit of a sum is
the sum of the limits and the limit of a product is the product of the limits) and Theorem 1.16
(which says that continuous functions preserve limits). The “only if” part follows because
|anj − cj | ≤ kan − ck for each j.
There is no multivariate analogue of Definition 1.2; it is nonsensical to write an → ∞.
However, since kan k is a real number, writing kan k → ∞ is permissible. If we write
limkxk→∞ f (x) = c for a real-valued function f (x), then it must be true that f (x) tends
to the same limit c no matter what path x takes as kxk → ∞.
27
Suppose that the function f (x) maps vectors in some open subset U of Rk to vectors in R` ,
a property denoted by f : U → R` . In order to define continuity, we first extend Definition
1.11 to the multivariate case:
Definition 1.34 For a function f : U → R` , where U is open in Rk , we write
limx→a f (x) = c for some a ∈ U and c ∈ R` if for every > 0 there exists a
δ > 0 such that kf (x) − ck < whenever x ∈ U and 0 < kx − ak < δ.
In Definition 1.34, kf (x) − ck refers to the norm on R` , while kx − ak refers to the norm on
Rk .
Definition 1.35 A function f : U → R` is continuous at a ∈ U ⊂ Rk if limx→a f (x) =
f (a).
Since there is no harm in letting k = 1 or ` = 1 or both, Definitions 1.34 and 1.35 include
Definitions 1.11 and 1.14(a), respectively, as special cases.
The extension of differentiation from the univariate to the multivariate setting is not quite as
straightforward as the extension of continuity. Part of the difficulty lies merely in notation,
but we will also rely on a qualitatively different definition of the derivative in the multivariate
setting. Recall that in the univariate case, Taylor’s Theorem 1.18 implies that the derivative
f 0 (x) of a function f (x) satisfies
f (x + h) − f (x) − hf 0 (x)
→ 0 as h → 0. (1.28)
h
It turns out that Equation (1.28) could have been taken as the definition of the derivative
f 0 (x). To do so would have required just a bit of extra work to prove that Equation (1.28)
uniquely defines f 0 (x), but this is precisely how we shall now extend differentiation to the
multivariate case:
Definition 1.36 Suppose that f : U → R` , where U ⊂ Rk is open. For a point a ∈ U ,
suppose there exists an ` × k matrix Jf (a), depending on a but not on h, such
that
f (a + h) − f (a) − Jf (a)h
lim = 0. (1.29)
h→0 khk
Then Jf (a) is unique and we call Jf (a) the Jacobian matrix of f (x) at a. We say
that f (x) is differentiable at the point a, and Jf (x) may be called the derivative
of f (x).
The assertion in Definition 1.36 that Jf (a) is unique may be proved as follows: Suppose that
(1) (2)
Jf (a) and Jf (a) are two versions of the Jacobian matrix. Then Equation (1.29) implies
28
that
(1) (2)
Jf (a) − Jf (a) h
lim = 0;
h→0 khk
(1) (2)
but h/khk is an arbitrary unit vector, which means that Jf (a) − Jf (a) must be the
zero matrix, proving the assertion.
Although Definition 1.36, sometimes called the Fréchet derivative, is straightforward and
quite common throughout the calculus literature, there is unfortunately not a universally
accepted notation for multivariate derivatives. Various authors use notation such as f 0 (x),
ḟ (x), Df (x), or ∇f (x) to denote the Jacobian matrix or its transpose, depending on the
situation. In these notes, we adopt perhaps the most widespread of these notations, letting
∇f (x) denote the transpose of the Jacobian matrix Jf (x). We often refer to ∇f as the
gradient of f .
When the Jacobian matrix exists, it is equal to the matrix of partial derivatives, which are
defined as follows:
Definition 1.37 Let g(x) be a real-valued function defined on a neighborhood of a
in Rk . For 1 ≤ i ≤ k, let ei denote the ith standard basis vector in Rk , consisting
of a one in the ith component and zeros elsewhere. We define the ith partial
derivative of g(x) at a to be
∂g(x) def g(a + hei ) − g(a)
= lim ,
∂xi x=a
h→0 h
if this limit exists.
Now we are ready to state that the Jacobian matrix is the matrix of partial derivatives.
Theorem 1.38 Suppose f (x) is differentiable at a in the sense of Definition 1.36.
Define the gradient matrix ∇f (a) to be the transpose of the Jacobian matrix
Jf (a). Then
∂f1 (x)
· · · ∂f∂x` (x)
∂x1 1
.. .
.
∇f (a) = . (1.30)
.
∂f1 (x) ∂f` (x)
···
∂xk ∂xk
x=a.
The converse of Theorem 1.38 is not true, in the sense that the existence of partial derivatives
of a function does not guarantee the differentiability of that function (see Exercise 1.31).
When f maps k-vectors to `-vectors, ∇f (x) is a k × ` matrix, a fact that is important to
memorize; it is often very helpful to remember the dimensions of the gradient matrix when
29
trying to recall the form of various multivariate results. To try to simplify the admittedly
confusing notational situation resulting from the introduction of both a Jacobian matrix
and a gradient, we will use only the gradient notation ∇f (x), defined in Equation (1.30),
throughout these notes.
By Definition 1.36, the gradient matrix satisfies the first-order Taylor formula
Definition 1.39 The k × k matrix on the right hand side of Equation (1.32), when
it exists, is called the Hessian matrix of the function f (x) at a.
Twice differentiability guarantees the existence (by two applications of Theorem 1.38) and
symmetry (by Theorem 1.40 below) of the Hessian matrix. The Hessian may exist for a
function that is not twice differentiable, as seen in Exercise 1.33, but this mathematical
curiosity will not concern us elsewhere in these notes.
We state the final theorem of this section, which extends second-order Taylor expansions to
a particular multivariate case, without proof, but the interested reader may consult Magnus
and Neudecker (1999) for an encyclopedic treatment of this and many other topics involving
differentiation.
Theorem 1.40 Suppose that the real-valued function f (x) is twice differentiable at
some point a ∈ Rk . Then ∇2 f (a) is a symmetric matrix, and
1
f (x) = f (a) + ∇f (a)> (x − a) + (x − a)> ∇2 f (a)(x − a) + r2 (x, a),
2
where r2 (x, a)/kx − ak2 → 0 as x → a.
30
Exercises for Section 1.4
Exercise 1.28 (a) Suppose that f (x) is continuous at 0. Prove that f (tei ) is con-
tinuous as a function of t at t = 0 for each i, where ei is the ith standard basis
vector.
(b) Prove that the converse of (a) is not true by inventing a function f (x) that
is not continuous at 0 but such that f (tei ) is continuous as a function of t at
t = 0 for each i.
Exercise 1.31 Prove that the converse of Theorem 1.38 is not true by finding a
function that is not differentiable at some point but whose partial derivatives at
that point all exist.
The score vector is defined to be the gradient of the loglikelihood. Find the score
vector for this example.
Hint: The score vector is a vector with two components and it is a function of
X1 , . . . , Xn , µ, and σ 2 . Setting the score vector equal to zero and solving for µ
and σ 2 gives P the well-known maximum likelihood estimators of µ and σ 2 , namely
1 2
X and n i (Xi − X) .
31
Use Theorem 1.40 to demonstrate that f (x, y) is not twice differentiable at (0, 0)
by showing that ∇2 f (0, 0), which does exist, is not symmetric.
Exercise 1.34 (a) Find the Hessian matrix of the loglikelihood function defined in
Exercise 1.32.
(b) Suppose that n = 10 and that we observe this sample:
2.946 0.975 1.333 4.484 1.711
2.627 -0.628 2.476 2.599 2.143
Evaluate the Hessian matrix at the maximum likelihood estimator (µ̂, σ̂ 2 ). (A
formula for the MLE is given in the hint to Exercise 1.32).
(c) As we shall see in Chapter 7, the negative inverse of the Hessian matrix is a
reasonable large-sample estimator of the covariance matrix of the MLE (though
with only n = 10, it is not clear how good this estimator would be in this
example!). Invert your answer from part (b), then put a negative sign in front
and use the answer to give approximate standard errors (the square roots of the
diagonal entries) for µ̂ and σ̂ 2 .
Exercise 1.36 The gamma distribution with shape parameter α > 0 and rate pa-
rameter β > 0 has density function
β α α−1 −βx
f (x; α, β) = x e for x > 0.
Γ(α)
(a) Calculate the score vector for an independent and identically distributed
gamma(α, β) sample of size n.
(b) Using the approximation to the digamma function Ψ(x) given in Equation
(1.16), find a closed-form approximation to the maximum likelihood estimator
32
(obtained by setting the score vector equal to zero and solving for α and β).
Simulate 1000 samples of size n = 50 from gamma(5, 1) and calculate this ap-
proximation for each. Give histograms of these estimators. Can you characterize
their performance?
The approximation of Ψ(x) in Equation (1.16) can be extremely poor for x < 2,
so the method above is not a reliable general-purpose estimation procedure.
While random variables have made only occasional appearances in these notes before now,
they will be featured prominently from now on. We do not wish to make the definition of a
random variable rigorous here—to do so requires measure theory—but we assume that the
reader is familiar with the basic idea: A random variable is a function from a sample space
Ω into R. (We often refer to “random vectors” rather than “random variables” if the range
space is Rk rather than R.)
For any random variable X, we denote the expected value of X, if this value exists, by
E X. We assume that the reader is already familiar with expected values for commonly-
encountered random variables, so we do not attempt here to define the expectation operator
E rigorously. In particular, we avoid writing explicit formulas for E X (e.g., sums if X is
discrete or integrals if X is continuous) except when necessary. Much of the theory in these
notes may be developed using only the E X notation; exceptions include cases in which
we wish to evaluate particular expectations and cases in which we must deal with density
functions (such as the topic of maximum likelihood estimation). For students who have not
been exposed to any sort of a rigorous treatment of random variables and expectation, we
hope that the many applications of this theory presented here will pique your curiosity and
encourage you to delve further into the technical details of random variables, expectations,
and conditional expectations. Nearly any advanced probability textbook will develop these
details. For a quick, introductory-level exposure to these intricacies, we recommend the first
chapter of Lange (2003).
Not all random variables have expectations, even if we allow the possibilities E X = ±∞:
Let X + = max{X, 0} and X − = max{−X, 0} denote the positive and negative parts of X,
so that X = X + − X − . Now both E X + and E X − are always well-defined if we allow ∞ as
a possibility, but if both X + and X − have infinite expectation, then there is no sensible way
to define E X. It is easy to find examples of random variables X for which E X is undefined.
Perhaps the best-known example is a Cauchy random variable (whose density function is
given in Exercise 7.3), but we may construct other examples by taking any two independent
nonnegative random variables Y1 and Y2 with infinite expectation—e.g., let Yi take the value
33
2n with probability 2−n for all positive integers n—and simply defining X = Y1 − Y2 .
The expectation operator has several often-used properties, listed here as axioms because
we will not derive them from first principles. We assume below that X and Y are defined
on the same sample space Ω and E X and E Y are well-defined.
As a special case of the conditioning property, note that if X and Y are independent, then
E (X|Y ) = E X, which gives the well-known identity
E XY = E {E (XY |Y )} = E {Y E (X|Y )} = E {Y E X} = E X E Y,
where we have used the fact that E (XY |Y ) = Y E (X|Y ), which is always true because
conditioning on Y is like holding it constant.
The variance and covariance operators are defined as usual, namely,
def
Cov (X, Y ) = E XY − (E X)(E Y )
def
and Var (X) = Cov (X, X). The linearity property above extends to random vectors: For
scalars a and b we have E (aX + bY) = a E (X) + b E (Y), and for matrices P and Q with
dimensions such that P X + QY is well-defined, E (P X + QY) = P E (X) + Q E (Y). The
covariance between two random vectors is
def
Cov (X, Y) = E XY> − (E X)(E Y)> ,
and the variance matrix of a random vector (sometimes referred to as the covariance matrix)
def
is Var (X) = Cov (X, X). Among other things, these properties imply that
34
where throughout these notes, I{·} denotes the indicator function
def 1 if expression is true
I{expression} = (1.34)
0 if expression is not true.
f (E X) ≤ E f (X) (1.37)
for any convex function f (x) and any random variable X. Definition 1.30 tells
precisely what a convex function is, but the intuition is simple: Any line segment
connecting two points on the graph of a convex function must never go below the
graph (valley-shaped graphs are convex; hill-shaped graphs are not). To prove
inequality 1.37, we require another property of any convex function, called the
supporting hyperplane property. This property, whose proof is the subject of
Exercise 1.38, essentially guarantees that for any point on the graph of a convex
function, it is possible to construct a hyperplane through that point that puts
the entire graph on one side of that hyperplane.
In the context of inequality (1.37), the supporting hyperplane property guarantees
that there exists a line g(x) = ax + b through the point [E X, f (E X)] such
that g(x) ≤ f (x) for all x (see Figure 1.2). By monotonicity, we know that
E g(X) ≤ E f (X). We now invoke the linearity of the expectation operator to
conclude that
f(x)
g(x)
EX
Figure 1.2: The solid curve is a convex function f (x) and the dotted line is a supporting
hyperplane g(x), tangent at x = E X. This figure shows how to prove Jensen’s inequality.
Exercise 1.37 Show by example that equality can hold in inequality 1.36.
Exercise 1.38 Let f (x) be a convex function on some interval, and let x0 be any
point on the interior of that interval.
(a) Prove that
f (x) − f (x0 )
lim (1.38)
x→x0 + x − x0
exists and is finite; that is, a one-sided derivative exists at x0 .
Hint: Using Definition 1.30, show that the fraction in expression (1.38) is non-
increasing and bounded below as x decreases to x0 .
(b) Prove that there exists a linear function g(x) = ax+b such that g(x0 ) = f (x0 )
and g(x) ≤ f (x) for all x in the interval. This fact is the supporting hyperplane
property in the case of a convex function taking a real argument.
Hint: Let f 0 (x0 +) denote the one-sided derivative of part (a). Consider the line
f (x0 ) + f 0 (x0 +)(x − x0 ).
Exercise 1.39 Prove Hölder’s inequality: For random variables X and Y and positive
p and q such that p + q = 1,
p q
E |XY | ≤ E |X|1/p E |Y |1/q . (1.39)
36
(If p = q = 1/2, inequality (1.39) is also called the Cauchy-Schwartz inequality.)
Hint: Use the convexity of exp(x) to prove that |abXY | ≤ p|aX|1/p + q|bY |1/q
whenever aX 6= 0 and bY 6= 0 (the same inequality is also true if aX = 0 or
bY = 0). Take expectations, then find values for the scalars a and b that give the
desired result when the right side of inequality (1.39) is nonzero.
Exercise 1.40 Use Hölder’s Inequality (1.39) to prove that if α > 1, then
(E |X|)α ≤ E |X|α .
to be the centered kth partial sum for 1 ≤ k ≤ n. Then for a > 0, Kolmogorov’s
inequality states that
Var Sn
P max |Sk | ≥ a ≤ . (1.40)
1≤k≤n a2
(a) Let Ak denote the event that |Si | ≥ a for the first time when i = k; that is,
that |Sk | ≥ a and |Sj | < a for all j < k. Prove that
X n
2
E I{Ak }Sk2 .
a P max |Sk | ≥ a ≤
1≤k≤n
i=1
37
Hint: Use the fact that the Ak are nonoverlapping, which implies that 1 ≥
I(A1 ) + · · · + I(An ). Also use Sn2 = Sk2 + 2Sk (Sn − Sk ) + (Sn − Sk )2 .
(c) Using parts (a) and (b), prove inequality (1.40).
Hint: By independence,
What is E (Sn − Sk )?
Exercise 1.42 Try a simple numerical example to check how much sharper Kol-
mogorov’s inequality (1.40) is than Chebyshev’s inequality (1.36).
(a) Take n = 8 and assume that X1 , . . . , Xn are independent normal random
variables with E Xi = 0 and Var Xi = 9 − i. Take a = 12. Calculate the exact
values on both sides of Chebyshev’s inequality (1.36).
(b) Simulate 104 realizations of the situation described in part (a). For each,
record the maximum value attained by |Sk | for k = 1, . . . , 8. Approximate the
probability on the left hand side of Kolmogorov’s inequality (1.40). Describe
what you find when you compare parts (a) and (b). How does a histogram of the
maxima found in part (b) compare with the distribution of |Sn |?
Exercise 1.43 The complex √ plane C consists of all points x + iy, where x and y are
real numbers and i = −1. The elegant result known as Euler’s formula relates
the points on the unit circle to the complex exponential function:
Because eit is on the unit circle for all real-valued t, the norm (also known as
the modulus) of eit , denoted |eit |, equals 1. This fact leads to the following
generalization of the triangle inequality: For any real-valued function g(x) and
any real number t,
Z t Z t
t
Z
g(x)eix dx ≤ g(x)eix dx = |g(x)| dx . (1.42)
0 0 0
The inequalities below in parts (a) through (d) involving exp{it} will be used
in Chapter 4. Assume t is a real number, then use Equations (1.6) and (1.41),
together with Inequality (1.42), to prove them. [Since we only claim Equation
(1.6) to be valid for real-valued functions of real variables, it is necessary here
to use Euler’s formula to separate eit into its real and imaginary parts, namely
cos t and sin t, then Taylor-expand them separately before reassembling the parts
using Euler’s formula again.]
38
(a) In Equation (1.6), use a = 0 and d = 0 on both cos t and sin t to show that
for any t ∈ R,
| exp{it} − 1| ≤ |t|.
(d) Proceed as above but using d = 1 for sin t, then d = 2 together with
integration by parts for cos t, to show that
exp{it} − 1 − it + 1 t2 ≤ t2 .
2
Exercise 1.44 Refer to Exercise 1.43. Graph the functions exp{it} − 1 − it + 12 t2 ,
|t|3 /6, and t2 for t in the interval [−10, 10]. Graph the three curves on the same
set of axes, using different plotting styles so they are distinguishable from one
another. As a check, verify that the inequalities in Exercises 1.43(c) and (d)
appear to be satisfied.
p
Hint: The modulus |z| of a complex number z = x + iy equals x2 + y 2 . Refer
to Equation (1.41) to deal with the expression exp{it}.
Exercise 1.45 For any nonnegative random variable Y with finite expectation, prove
that
∞
X
P (Y ≥ i) ≤ E Y. (1.43)
i=1
39
(This equation
R remains true Rif E Y = ∞.) To sketch a proof, note that if we
can prove E f (Y, t) dt = E f (Y, t) dt, the result follows immediately by taking
f (Y, t) = I{Y ≥ t}.
40