Statistics Notes 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 94

Probability and Statistics Lecture Notes

Antonio Jimnez-Martnez
Chapter 1
Probability spaces
In this chapter we introduce the theoretical structures that will allow us to assign proba-
bilities in a wide range of probability problems.
1.1. Examples of random phenomena
Science attempts to formulate general laws on the basis of observation and experiment.
The simplest and most used scheme of such laws is:
if a set of conditions B is satised
=
event A occurs.
Examples of such laws are the law of gravity, the law of conservation of mass, and many
other instances in chemistry, physics, biology... . If event A occurs inevitably whenever
the set of conditions B is satised, we say that A is certain or sure (under the set of
conditions B). If A can never occur whenever B is satised, we say that A is impossible
(under the set of conditions B). If A may or may not occur whenever B is satised, then
A is said to be a random phenomenon.
Random phenomena is our subject matter. Unlike certain and impossible events, the
presence of randomness implies that the set of conditions B do not reect all the necessary
and sucient conditions for the event A to occur. It might seem them impossible to make
any worthwhile statements about random phenomena. However, experience has shown
that many random phenomena exhibit a statistical regularity that makes them subject
to study. For such random phenomena it is possible to estimate the chance of occurrence
of the random event. This estimate can be obtained from laws, called probabilistic or
stochastic, with the form:
if a set of conditions B is satised
repeatedly n times
=
event A occurs m times
out of the n repetitions.
Probabilistic laws play an important role in almost every eld of science.
If one assumes that n , then the probability that event A occurs (under the set
of conditions B) can be estimated as the ratio m/n.
Now, how are probabilities assigned to random events? Historically, there have been
two approaches to study random phenomena, the relative frequency (or statistical) method
and the classical (or a priori) method. The relative frequency method relies upon obser-
vation of occurrence of the event A under a large number of repetitions of the xed set
1
of conditions B. Then, one counts the number of times that event A has occurred. The
classical method, whose introduction is credited to Laplace (1812), makes use of the con-
cept of equal likelihood, which is taken as a primitive of the model. Under this approach,
if an event is regarded as the aggregation of several mutually exclusive and equally likely
elementary events, the probability of such event is obtained as the sum of the individual
probabilities of the elementary events.
These two approaches were joined in the 20th century by the axiomatic approach,
introduced by A. N. Kolmogorov, which is consistent with both of them and allows for a
systematic and rigorous treatment of general classes of random phenomena. This is the
approach that we shall follow.
1.2. Probability spaces
We would like to use a development that allows us to assign probabilities to as many
events as possible. We begin with an arbitrary nonempty set of elementary events or
(sample) points . In decision theory, the elementary events are also known as states of
the world. Consider a collection or family F of subsets of with a generic element A F
(i.e., A ). Elements A F are called (random) events. We wish to impose formally
conditions on F that allow us to assign probabilities in general probability problems. To
do this, we use measurable structures, which are at the foundations of probability and
statistics.
Intuitively, suppose that random events are described by sentences and we wish to
assign probabilities to them. Given events A and B, it makes sense to connect such
sentences so as to form new sentences like A and B, A or B, and not A. Then, the
family of events should be closed under intersections, unions, and complements. It should
also include the entire set of elementary events. Such a family of sets is called an algebra
(or a eld) of sets. If we also wish to discuss results along the law of averages, which have
to do with the average behavior over an innite sequence of trials, then it is useful to add
closure under countable unions and intersections to our list of desiderata. An algebra that
is closed under countable unions is a -algebra (or a -eld). A nonempty set equipped
with a -algebra of subsets is a measurable space and elements of this -algebra are called
measurable sets or (random) events.
Denition 1. Let be an arbitrary nonempty set. A nonempty family F of subsets of
is an algebra on if A, B F implies:
(a) A B F;
(b) \ A F.
A -algebra on is an algebra on that is also closed under countable unions, i.e., for
each sequence {A
n
}

n=1
F, we have

n=1
A
n
F. The elements of a -algebra are called
(random) events.
2
Denition 2. A measurable space is a pair (, F) where is an arbitrary nonempty
set and F is a -algebra on .
Each algebra F on contains and .
1
Since F is nonempty by denition, there
exists some = A F, so \ A F follows from (b) in the denition above. Hence,
= A ( \ A) F and = \ F, respectively, from (a) and from (b) in the
denition above. The events and are called, respectively, the certain (or sure) event
and the impossible event.
Notice that, by DeMorgans law, AB = \[(\A)(\B)] and AB = \[(\A)
(\B)]. Thus, if F is a -algebra on , then, since it is closed under complementation, we
obtain that F is closed under the formation of nite unions if and only if it is closed under
the formation of nite intersections. Then, we can replace requirement (a) in the denition
above by (a)

A B F. Analogously, by applying the innite form of DeMorgans law,


we can replace the statement above to dene a -algebra by the requirement that a
-algebra is an algebra that is also closed under countable intersections, i.e., for each
sequence {A
n
}

n=1
F, we have

n=1
A
n
F.
Example 1. Using set operations, formal statements regarding events are expressed as:
(1) Event A does not occur: \ A;
(2) Both events A and B occur: A B;
(3) Either event A, event B, or both occurs: A B;
(4) Either event A or event B occurs, but not both of them: (A\B) (B\A) =: AB
(symmetric dierence set operation);
(5) Events A and B are mutually exclusive: A B = ;
(6) Event A occurs and event B does not occur: A \ B;
(7) If event A occurs, then event B also occurs: A B;
(8) Neither event A nor event B occur: \ (A B).
Example 2. Consider the experiment of rolling a die once. Then = {1, . . . , 6}. If we
wish to be able to discern among all possible subsets of , then we would take 2

as our
-algebra. However, suppose that we wish to model the pieces of information obtained by
a person who is only told whether or not 1 has come up. Then, 2

would not be the most


appropriate -algebra. For instance, {1, 2, 3} 2

is the event a number less than 4 has


come up, a piece of information that this person does not receive in this experiment. In
this experiment, it makes more sense to choose a -algebra like {, , {1} , {2, . . . , 6}}.
Example 3. Consider the experiment of rolling a die arbitrarily many times. Then,
= {1, . . . , 6} {1, . . . , 6} = {1, . . . , 6}

. Suppose that we wish to talk about the


event number 2 comes up in the ith roll of the die. Then, we should certainly choose a
1
Some textbooks include explicitly F as a requirement in the denition of algebra.
3
-algebra on that contains all sets of the form
A
i
= {(
i
)

i=1
:
i
= 2} , i = 1, 2, . . . .
Notice that, given this -algebra the situation B =in neither the second nor the third
roll number 2 comes up is formally an event since
B = {(
i
)

i=1
:
2
= 2,
3
= 2} = ( \ A
2
) ( \ A
3
).
Also, the situations number 2 comes up at least once through the rolls, described by

i=1
A
i
, and each roll results in number 2 coming up, described by {(2, 2, . . . )} =

i=1
A
i
,
are formally events under this -algebra.
The simplest example of a -algebra is {, }, which is the smallest (with respect to
set inclusion) -algebra on . The largest possible -algebra on is the power class
2

, the collection of all subsets of . In many probability problems, one often wishes to
do the following. Beginning with a collection F of subsets of , one searches for a family
of subsets of that (a) contains F, (b) is a -algebra on itself, and (c) is in a certain
sense as small as possible. The notion of -algebra generated by F gives us precisely this.
Denition 3. Let be an arbitrary nonempty set and let F be a nonempty family of
subsets of . The -algebra generated by F is the family of subsets of
(F) :=

iI
_
F
i
2

: each F
i
F is a -algebra on
_
,
where I is an arbitrary index set.
From the denition above one immediately observes that
(F) = inf
_
G 2

: G F and G is a -algebra on
_
.
Theorem 1. Given a measurable space (, F), (F) satises the following properties:
(a) (F) is a -algebra on ;
(b) F (F);
(c) if F G and G is a -algebra on , then (F) G.
Proof. (a) First, take A (F), then A F
i
for each F
i
F, i I. Since each F
i
is a
-algebra on , we have \ A F
i
for each F
i
F, i I, and, therefore, \ A (F).
Second, take a sequence {A
n
}

n=1
(F), then {A
n
}

n=1
F
i
for each F
i
F, i I.
Since each F
i
is a -algebra on , we have

n=1
A
n
F
i
for each F
i
F, i I. Therefore,

n=1
A
n
(F).
(b) The result follows directly from the denition of (F), taking into account the set
operations of inclusion and intersection.
(c) Take a -algebra G on such that G F. Then, it must be the case that G = F
k
for
some k I so that (F) G.
4
Denition 4. A topology on a set is a collection of subsets of the set that contains
the empty set and the set itself, and that is closed under nite intersections and arbitrary
(maybe uncountable!) unions. A member of a topology on a set is called an open set in
such a set.
Notice that a -algebra on a countable set is also a topology on that set but the
converse is not true.
Example 4. Consider the set = {a, b, c, d} and its family of subsets
= {, {a} , {a, d} , {b, c} , {a, b, c, d}} .
We have that is not a -algebra on since, for instance, \ {a} = {b, c, d} / .
Furthermore, is not a topology on either since, for instance, {a}{b, c} = {a, b, c} / .
We can add one extra element to so that = {a, b, c} is indeed a topology on .
However, is still not a -algebra on . If we look for the -algebras generated by and
, separately, we obtain
() = () = {, {a} , {a, d} , {b, c} , {a, b, c, d} , {a, b, c} , {b, c, d} , {d}} .
Let us now comment on a generated -algebra that is extensively used in probability
and statistics.
Denition 5. Let be an arbitrary nonempty set endowed with a topology . The
Borel -algebra on the topological space (, ) is the -algebra generated by such
topology B

:= (). The members A B

are called Borel sets in (, ) (or, simply, in


when the topology is understood from the context).
Therefore, the Borel -algebra of a set is a concept relative to the topology that one
considers for this set. Of particular interest in this course is the Borel -algebra B
R
on
R. When no particular topology on R is provided, it is usually understood that the Borel
-algebra on R is generated by the topology of all open sets in R, that is,
B
R
:= ({S R : S is open in R}) .
This notion of Borel -algebra can be directly extended to Euclidean spaces.
Example 5. Consider the collection of open intervals in R,
= {(a, b) R : < a < b < +} .
We now show that () = B
R
. Let denote the collection of all open sets in R. Since each
open interval is an open set in R, we have that (). Then, () () because
() is a -algebra on R. On the other hand, since each open set in R can be expressed
as the result of the union of countably many open intervals, we know that (). This
5
is so because, as a -algebra that contains , () must contain the unions of countably
arbitrarily many open intervals. But, then () () follows from the fact that () is
a -algebra on R. Therefore, () = ().
Example 6. Consider the collection of all bounded right-semiclosed intervals of R,
= {(a, b] R : < a < b < +} .
We now show that () = B
R
. First, note that for each a, b R such that < a < b <
+, we have
(a, b] =

n=1
_
a, b +
1
n
_
.
Then, () since, as a -algebra that contains , () must contain the intersections
of countably arbitrarily many open intervals. From the fact that () is a -algebra on R,
it follows () (). Second, note that for each a, b R such that < a < b < +,
we have
(a, b) =

_
n=1
_
a, b
1
n
_
.
Then, by an argument totally analogous to the previous ones, we have () and,
then, () (). Therefore, () = ().
Using arguments analogous to those in the previous two examples, one can show that
the Borel -algebra on R coincides also with the -algebras generated, respectively, by
the following collections of sets in R:
(1) the collection of all closed intervals,
(2) the collection of all bounded left-semiclosed intervals,
(3) the collection of all intervals of the form (, a],
(4) the collection of all intervals of the form (, a),
(5) the collection of all intervals of the form (b, +),
(6) the collection of all intervals of the form [b, +), and
(7) the collection of all closed sets.
The fact in (7) above that the collection of all closed sets induces B
R
implies that singletons
and countable sets in R are members of its Borel -algebra (of course, provided that no
reference to a particular topology dierent from the collection of all open sets is given!).
Up to now we have imposed conditions on the set of possible random events on which
we can take probabilities. But how do we actually assign probabilities to such events?
A. N. Kolmogorov (1929, 1933) developed the axiomatic approach to probability theory,
which relates probability theory to set theory and to the modern developments in measure
theory. Historically, mathematicians have been interested in generalizing the notions of
6
length, area and volume. The most useful generalization of these concept is provided by
the notion of a measure. Using these tools from measure theory, Kolmogorovs axioms
follow from the denition of probability measure below.
Denition 6. A set function is an (extended) real function dened on a family of
subsets of a measurable space.
In its abstract form, a measure is a set function with some additivity properties that
reect the properties of length, area and volume.
Denition 7. Let (, F) be a measurable space, then a measure P on (, F) is a set
function P : F R

, where R

:= R {, +}, satisfying:
(a) P() = 0 and P(A) 0 for each A F;
(b) (-additivity) if {A
n
}

n=1
F is a sequence of pairwise disjoint events in F, then
P
_

_
n=1
A
n
_
=

n=1
P(A
n
).
If a measure P on (, F) also satises P() = 1, then it is called a probability measure.
2
Denition 8. Aprobability space is a triple (, F, P) where is an arbitrary nonempty
set, F is a -algebra of subsets of , and P is a probability measure on (, F).
From the denition of probability measure above we can deduce some properties of
probabilities that will be used extensively throughout the course. Let (, F, P) be a prob-
ability space. Then,
(P1) P() = 0, as stated directly in the denition of probability measure.
(P2) for each A F, P( \ A) = 1 P(A).
Proof. Since = ( \ A) A and ( \ A) A = , -additivity implies
1 = P() = P( \ A) + P(A) P( \ A) = 1 P(A).
(P3) for each A F, 0 P(A) 1.
Proof. Note that P(A) 0 is stated directly in the denition of probability measure.
Also, P(A) 1 follows from the result P(A) = 1 P( \ A) shown above and from the
fact that P( \ A) 0 stated in the denition of probability measure.
2
Given some set function P : F R

, Kolmogorovs axioms are:


(i) P(A) 0 for each A F;
(ii) P() = 1;
(iii) for each set {A
1
, . . . , A
n
} of pairwise disjoint events, we have P(

n
i=1
A
i
) =

n
i=1
P(A
i
).
Compare these axioms with the given denition of probability measure.
7
(P4) for A, B F, if A B, then P(A) P(B).
Proof. Since we can write B = A [(\ A) B] and A [(\ A) B] = , -additivity
implies
P(B) = P(A) + P(( \ A) B) Pr(A).
(P5) for each A, B F, P(A B) = P(A) + P(B) P(A B).
Proof. Consider the following identities:
A B = A [B \ (A B)],
B = [A B] [B \ (A B)].
Since the sets united in each identity are disjoint, it follows from -additivity that
P(A B) = P(A) + P(B \ (A B)),
P(B) = P(A B) + P(B \ (A B)).
Then, the result follows by combining these two identities.
(P6) for each A, B F, P(A B) P(A) + P(B).
Proof. The result follows directly from the property above (P5) combined with the fact
that P(A B) 0, as stated directly in the denition of probability measure.
(P7) for each sequence {A
n
}

n=1
F of events (not necessarily disjoint!), P(

n=1
A
n
) =
1 P(

n=1
\ A
n
), that is, the probability of at least one of the events A
n
will occur is
1 minus the probability that none of the events will occur.
Proof. By applying the innite form of DeMorgans law, we have

n=1
A
n
= \ [

n=1
\ A
n
].
Then, the result follows using Property (P2).
Denition 9. A Borel measure is a measure dened on the Borel sets of a topological
space.
One of the most important examples of measures is the Lebesgue measure on the real
line and its generalizations to Euclidean spaces. It is the unique measure on the Borel sets
whose value on each interval is its length.
Denition 10. The Lebesgue measure on R is the set funtion : B
R
[0, +)
specied by ((a, b]) := b a for each (a, b] B
R
.
8
Similarly, we can dene the Lebesgue measure on R
n
by assigning to each rectangle
its n-dimensional volume. For a, b R
n
such that a
i
b
i
, i = 1, . . . , n, let (a, b] :=

n
i=1
[a
i
, b
i
). The Lebesgue measure on R
n
is then the set funtion : B
R
n [0, +)
dened by ((a, b]) :=

n
i=1
(b
i
a
i
).
Sometimes, one starts with a probability measure dened on a small -algebra (or
even on an algebra!) and then wishes to extend it to a larger -algebra. For example,
Lebesgue measure can be constructed by dening it rst on the collection of nite unions
of right-semiclosed intervals and extending it to the collection of Borel sets of R, which
is generated by the former collection. The following example illustrates why extending
probability measures should interests us.
Example 7. Consider the experiment of tossing a coin three times. Let 0=heads and
1=tails. Then, = {0, 1}
3
and || = 2
3
. Since is nite, we can take F = 2

right
away. Now, notice that the niteness of allows us to assign probabilities (in quite an
intuitive way) to the events in F using the notion of relative frequency. Therefore, for
S F, we can take P(S) = |S| / ||. For instance, the probability of the event S =at
least one of the last two tosses does not come up tails would be computed by noting that
S = {(0, 1, 0), (0, 0, 1), (0, 1, 1), (1, 1, 0), (1, 0, 1), (1, 1, 1)} ,
so that P(S) = 6/2
3
= 3/4. Now, how should we take the space of events and how should
we compute probabilities if the sample set were innite? Suppose that the coin is tossed
innitely many times. Then, = {0, 1}

. To consider the space of events, let us deal rst


with easy events. For instance, consider the set
A
S
= {(
i
)

i=1
{0, 1}

: (
1
,
2
,
3
) S} ,
where S {0, 1}
3
is the event obtained above. This type of sets is known as a cylinder set
and it is specied by imposing conditions only on a nite number of the coordinates of a
generic element of it. More generally, given a set T {0, 1}
k
for some nite k, a cylinder
set is a set of the form
A
T
= {(
i
)

i=1
{0, 1}

: (
1
, . . . ,
k
) T} .
Notice that this cylinder set is nothing but the event the outcome of the rst k tosses
belongs to T. Then, if we use the notion of relative frequency to compute probabilities,
we would like to assign to it the probability |T| /2
k
. For instance, using the earlier event
S to construct the cylinder set A
S
, we would compute the probability of the event out
of rst three tosses, at least one of its last two tosses does not come up tails as P(A
S
) =
|S| /2
k
= 3/4. In this manner, as k increases, we would be able to compute probabilities
of events when the number of tosses becomes arbitrary. Then, it makes sense that we
9
include the set of all cylinder sets
A =

k=1
_
A
T
: T {0, 1}
k
_
=

_
k=1
_
{(
i
)

i=1
{0, 1}

: (
1
, . . . ,
k
) T} : T {0, 1}
k
_
in our event space. Therefore, we would like to consider A as the nucleus of our event
space so that (A) would be the chosen -algebra. We have the A is an algebra (check
it!) but need not be a -algebra. So it seems worrisome that we only know how to assign
probabilities to the elements in A. What do we do with the events in (A) \ A?
A general method for extending measures was developed by C. Carathodory (1918)
and is known as the Carathodory extension procedure. We present briey the ap-
proach followed by this procedure and some useful consequences of it. Intuitively, the
Carathodory extension method allows us to construct a probability measure on a -
algebra by specifying it only on the algebra that generates that -algebra. Furthermore,
under fairly general conditions, such a construction is done in a unique way. We begin
with a formulation of Carathodorys Theorem which reects precisely this intuition.
Theorem 2. Let A be an algebra on a nonempty set and let P : A R

. If P is
-additive, then there exists a measure P

on (A) such that P

(A) = P(A) for each


A A. Moreover, if P() < , then P

is unique.
Now, we get into the formal details of the procedure and present another formulation
of Carathodorys Theorem.
Denition 11. An outer measure on an arbitrary nonempty set is set function
: 2

+
, where R

+
:= R
+
{+}, that veries
(a) () = 0;
(b) (monotonicity) for A, B 2

, A B implies (A) (B);


(c) (-subadditivity) for each sequence {A
n
}

n=1
of subsets of , we have (

n=1
A
n
)

n=1
(A
n
).
Denition 12. Let P be a measure on a measurable space (, F). The measure P gen-
erates a set function P

: 2

+
dened by
P

(A) := inf
_

n=1
P(A
n
) : {A
n
}

n=1
F and A

n=1
A
n
_
, (*)
which is called the Carathdory extension of P.
Intuitively, the Carathdory extension P

of a measure P is constructed from P by


approximating events from the outside. If {A
n
}

n=1
forms a good covering of A in the sense
10
that they not overlap one another very much or extend much beyond A, then

n=1
P(A
n
)
should be a good outer approximation to the measure assigned to A. Then, this approach
allows for the following. Consider a measure P on a measurable space (, F) and the
-algebra generated by F, F

= (F). Then, F

is a -algebra larger than F (in the sense


that F

F). The formulation of Carathodorys Theorem stated below asserts that there
exists an outer measure P

on (, F

) such that:
(a) P

(A) = P(A) for each A F;


(b) if Q is another measure on (, F

) such that Q(A) = P(A) for each A F, then it


must be the case that Q(A) = P

(A) for each A F.


Theorem 3. A measure P on a measurable space (, F) such that P() < has a
unique extension P

(i.e., conditions (a) and (b) above are satised), dened by equation
(*) above, to the generated -algebra (F). Moreover, the extension P

is an outer measure
of .
The extension P

of P identied in the Theorem above is also known as the outer


measure generated by P. Given a probability space (, F, P), the phrase P-almost
everywhere (which is often substituted by just almost everywhere (or almost surely)
when the probability measure P is understood from the context) means everywhere
except possibly for a set A F with P

(A) = 0, where P

is the outer measure generated


by P. For example, we say that two functions f, g : A B are P-almost everywhere
equal if P

({a A : f(a) = f(a)}) = 0. We use the symbol =


a.e.
to denote equal almost
everywhere
1.3. Conditional probability and Bayes theorem
In many problems there is some information available about the outcome of the random
phenomenon at the moment at which we assing probalities to events. Hence, one may face
questions of the form: what is the probability that event A occurs given that another
event B has occurred?
Denition 13. Let (, F, P) be a probability space. Consider two events A, B F such
that P(B) > 0, then the conditional probability of A given B is given by:
P(A|B) :=
P(A B)
P(B)
.
If P(B) = 0, then the conditional probability of A given B is undened.
Using the denition of conditional probability, we obtain the chain-rule formulas
P(A B) = P(B)P(A|B),
P(A B C) = P(B)P(A|B)P(C|B A),
11
and so on. Furthermore, if {B
n
}

n=1
partitions , then, for A F, it can be shown that
P(A) =

n=1
P(B
n
)P(A|B
n
).
This property is known as the law of total probability.
If one begins with a probability space (, F, P) and considers an event B F such
that P(B) > 0, then it can be shown that the set function P(|B) : F [0, 1] specied
in the denition above is a well dened probability measure (check it!) on the measure
space (, F) so that (, F, P(|B)) constitutes a probability space.
Theorem 4 (Bayes rule). Let (, F, P) be a probability space and {A
1
, . . . , A
n
} F a
set of n mutually disjoint events such that P(A
i
) > 0 for each i = 1, . . . , n and
n
i=1
A
i
= .
If B F is an event such that P(B) > 0, then
P(A
k
|B) =
P(B|A
k
)P(A
k
)

n
i=1
P(B|A
i
)P(A
i
)
for each k = 1, . . . , n.
Proof. Since
B = B
_
n
_
i=1
A
i
_
=
n
_
i=1
_
B A
i
_
and {B A
1
, . . . , B A
n
} is a set of disjoint events, we obtain
P(B) =
n

i=1
P(B A
i
).
However, by applying the denition of conditional probability, we have
n

i=1
P(B A
i
) =
n

i=1
P(B|A
i
)P(A
i
).
Therefore, using the denition of conditional probability again, we can write, for each
k = 1, . . . , n,
P(A
k
|B) =
P(A
k
B)
P(B)
=
P(B|A
k
)P(A
k
)

n
i=1
P(B|A
i
)P(A
i
)
as stated.
Example 8. A ball is drawn from one of two urns depending on the outcome of a roll
of a fair die. If the die shows 1 or 2, the ball is drawn from Urn I which contains 6 red
balls and 2 white balls. If the die shows 3, 4, 5, or 6, the ball is drawn from Urn II which
contains 7 red balls and 3 white balls. We ask ourselves: given that a white ball is drawn,
what is the probability that it came from Urn I? from Urn II?
12
Let I (II) denote the event the ball comes from Urn I (Urn II). Let w (r) denote
the event the drawn ball is white (red). We compute P(I|w) and P(II|w) by applying
Bayes rule:
P(I|w) =
P(w|I)P(I)
P(w|I)P(I) + P(w|II)P(II)
=
(1/4)(1/3)
(1/4)(1/3) + (3/10)(2/3)
=
5
17
,
P(II|w) =
P(w|II)P(II)
P(w|I)P(I) + P(w|II)P(II)
=
(3/10)(2/3)
(1/4)(1/3) + (3/10)(2/3)
=
12
17
.
1.4. Independence
Let A and B be two events in a probability space. An interesting case occurs when
knowledge that B occurs does not change the odds that A occurs. We may think intuitively
that A and B occur in an independent way if P(A|B) = P(A) when P(B) > 0. Hence,
the denition of conditional probability leads to the following denition
Denition 14. Let (, F, P) be a probability space. Two events A, B F are inde-
pendent if P(A B) = P(A)P(B).
More generally, a nite collection A
1
, . . . , A
n
of events is independent if
P(A
k
1
A
k
j
) = P(A
k
1
) P(A
k
j
)
for each 2 j n and each 1 k
1
< < k
j
n. That is, a nite collection of
events is independent if each of its subcollections is. Analogously, an innite (perhaps
uncountable) collection of events is dened to be independent if each of its nite
subcollections is.
Example 9. Consider an experiment with a sample set = {a, b, c, d} and suppose that
the probability of each outcome is 1/4. Consider the following three events
A = {a, b} , B = {a, c} , C = {a, d} .
Then, we have
P(A B) = P(A C) = P(A D) = P(A B C) = P({a}) = 1/4,
so that P(A B) = P(A)P(B), P(A C) = P(A)P(C), and P(A D) = P(A)P(D).
However, P(A B C) = 1/4 = 1/8 = P(A)P(B)P(C). Therefore, we must say that
events A, B, and C are pairwise independent but all three of them are not independent.
Example 10. Let = {(x, y) R
2
: 0 x, y 1} and consider the probability space
(, B

, ) where is the Lebesgue measure on R


2
. It can be shown that the events
A =
_
(x, y) R
2
: 0 x 1/2, 0 y 1
_
13
B =
_
(x, y) R
2
: 0 x 1, 0 y 1/4
_
are independent. To do this, let us compute the area of the respective rectangles. First,
notice that
A B =
_
(x, y) R
2
: 0 x 1/2, 0 y 1/4
_
.
Then, one obtains (A) = 1/2, (B) = 1/4, and (A B) = 1/8, as required.
Consider now the event
C =
_
(x, y) R
2
: 0 x 1/2, 0 y 1, y x
_
We have (C) = 1/2(1/2)
3
= 3/8 and (CB) = 1/2(1/4)
2
= 1/32 so that (C)(B) =
3/32 = 1/32 = (C B), that is, C and B are not independent.
Problems
1. [Innite form of DeMorgans law] Let A
1
, A
2
, . . . be an innite sequence of distinct
subsets of some nonempty set . Show that
(a) \ (

n=1
A
n
) =

n=1
( \ A
n
).
(b) \ (

n=1
A
n
) =

n=1
( \ A
n
).
2. Let F be a collection of subsets of some nonempty set .
(a) Suppose that F and that A, B F implies A\B F. Show that F is an algebra.
(b) Suppose that F and that F is closed under the formation of complements and
nite disjoint unions. Show that F need not be an algebra.
3. Let F
1
, F
2
, . . . be collections of subsets of some nonempty set .
(a) Suppose that F
n
are algebras satisfying F
n
F
n+1
. Show that

n=1
F
n
is an algebra.
(b) Suppose that F
n
are -algebras satisfying F
n
F
n+1
. Show by example that

n=1
F
n
need not be a -algebra.
4. Let (, F) be a measurable space. For A F, let F(A) be the collection of subsets of
of the form AB, where B F. Show that, for a given A F, F(A) is a -algebra on
.
5. Let = {(x, y) R
2
: 0 < x, y 1}, let F be the collection of sets of of the form
_
(x, y) R
2
: x A, 0 < y 1
_
,
where A B
(0,1]
, and let P({(x, y) R
2
: x A, 0 < y 1}) = (A), where is
Lebesgue measure on R.
(a) Show that (, F, P) is a probability space.
14
(b) Show that P

({(x, y) R
2
: 0 < x 1, y = 1/2}) = 1, where P

is the outer
measure generated by P.
6. Let (, F, P) be a probability space and, for A F, let P
A
: F [0, 1] be a set function
dened by P
A
(B) := P(A B) for each B F.
(a) Show that, for a given A F, P
A
is a probability measure on (, F).
(b) Show that, for a given A F such that P(A) > 0, the set function Q
A
on F dened
by Q
A
(B) := P
A
(B)/P(A) for each B F is a probability measure on (, F).
7. Let P
1
, . . . , P
n
be probability measures on some measurable space (, F). Show that
Q :=

n
i=1
a
i
P
i
, where a
i
R
+
for each i = 1, . . . , n and

n
i=1
a
i
= 1, is a probability
measure on (, F).
8. Let (, F, P) be a probability space and let A
1
, . . . , A
n
be events in F such that
P(
k
i=1
A
i
) > 0 for each k = 1, . . . , n 1.
(a) Show that
P(
n
i=1
A
i
) = P(A
1
)P(A
2
|A
1
)P(A
3
|A
1
A
2
) P(A
n
|A
1
A
2
A
n1
).
(b) Show that if P(
k
i=1
A
i
) = 0 for some k {1, . . . , n 1}, then P(
n
i=1
A
i
) = 0.
9. Let (, F, P) be a probability space and let A
1
, . . . , A
n
be independent events in F.
Let B
1
, . . . , B
n
be another sequence of events such in F such that, for each i = 1, . . . , n,
either B
i
= A
i
or B
i
= \ A
i
. Show that B
1
, . . . , B
n
are independent events.
10. There are three coins in a box. One is a two-headed coin, another is a two-tailed coin,
and the third is a fair coin. One of the three coins is chosen at random and ipped. It
shows heads. What is the probability that it is the two-headed coin?
11. Two dice are rolled once and the 36 possible outcomes are equally likely. Compute
the probability that the sum of the numbers on the two faces is even.
12. A box has 10 numbered balls. A ball is picked at random and then a second ball is
picked at random from the remaining 9 boxes. Compute the probability that the numbers
on the two selected balls dier by two or more.
13. A box has 10 balls, 6 of which are black and 4 of which are white. Three balls are
removed at random from the box, but their colors are not noted.
(a) Compute the probability that a fourth ball removed from the box is white.
Suppose now that it is known that at least one of the three removed balls is black.
(b) Compute the probability that all three of the removed balls is black.
15
14. A box has 5 numbered balls. Two balls are drawn independently from the box with
replacement. It is known that the number on the second ball is at least as large as the
number on the rst box. Compute the probability that the number on the rst ball is 2.
15. Let (, F, P) be a probability space and let A, B and C be three events in F such
that P(ABC) > 0. Show that P(C|AB) = P(C|B) implies P(A|BC) = P(A|B).
16
Chapter 2
Combinatorics
Consider a nite sample set and suppose that its elementary events are equally likely
(as considered by the classical approach to probability theory). Then, using the relative
frequency interpretation of probability, we can compute the probability of an event A
simply by dividing the cardinality of A over the cardinality of .
The aim of this chapter is to introduce several combinatorial formulas which are com-
monly used for counting the number of elements of a set. These methods rely upon special
structures that exist in some common random experiments.
2.1. Ordered samples
Consider a nite set S := {1, 2, . . . , s} and suppose that we are interested in drawing a
sequence of m s elements from this set. Then, the outcome of the draws can then be
regarded as an m-tuple or sequence = (
1
,
2
, . . . ,
m
), where
i
is the element in the
ith draw.
Suppose rst that we draw such a sequence by putting each drawn element back into
the set before the next element is drawn. This procedure is called random sampling
with replacement. Here, we have = S
m
so that || = s
m
.
On the other hand, suppose that we do not return the elements into the set before the
following draw. Then, = {(
i
)
m
i=1
:
i
=
j
for each i = j}. This procedure is known
as random sampling without replacement. Here, || = s(s1)(s2) (sm+1).
Let (s)
m
:= s(s1)(s2) (sm+1) denote the number of dierent possible m-tuples
when there is no replacement. Also, if the elements from S are drawn m = s times without
replacement, then there are (s)
s
=: s! possible outcomes of the experiment.
Example 11. Suppose that we are asked to compute the probability that there shows
up at least one head when a fair coin is tossed n times. We can regard ipping the coin n
times as drawing a random sample with replacement of size n from the set {0, 1}, where
head is indicated by 0. Therefore, || = 2
n
. Let C be the event there shows up at least
one zero and let B
i
be the event that the ith toss results in a zero. Then C =
n
i=1
B
i
,
and note that the sets B
1
, . . . , B
n
are not pairwise disjoint! However, using property (P2)
of probabilities and DeMorgans law, we have
P(C) = 1 P( \ C) = 1 P( \
n
i=1
B
i
) = 1 P
_

n
i=1
( \ B
i
)
_
.
But
n
i=1
( \ B
i
) consist of the event the n tosses yield ones, i.e.,
n
i=1
( \ B
i
) =
17
{(1, . . . , 1)}. Then, P(
n
i=1
(\ B
i
)) = 2
n
, and the sought probability is P(C) = 1 2
n
.
Let us now consider the following problem: a random sample of size m is chosen from
a set S of s distinct objects with replacement. We ask ourselves about the probability of
the event A=in the sample no element appears twice. Note that the cardinality of the
sample set is s
m
. Also, the number of elementary events from the sample set where no
element from S appears twice, out of the s
m
possible elementary events, is nothing but
the cardinality of the set
A = {(
i
)
m
i=1
:
i
=
j
for each i = j} .
But this is precisely the cardinality of the sample set (or sure event) associated with
an experiment of random sampling without replacement from that set S! So, the sought
probability can be computed as
s(s 1)(s 2) (s m+ 1)
s
m
=
_
1
1
s
__
1
2
s
_

_
1
m1
s
_
.
A typical problem of this sort is known as the birthday problem.
Example 12. Suppose that we are asked to compute the probability of the event A =no
two people from a group of ve friends have a common birthday. Here we shall ignore the
leap years and the fact that birth rates are not exactly equally likely over the year. Using
the expression obtained above, here m = 5 and s = 365, so that we can easily compute
P(A) = (1 1/365)(1 2/365)(1 3/365)(1 4/365).
2.2. Permutations
Suppose that we have a set S of s distinct objects, which we permute randomly among
themselves. Then, we ask about the nal positions of some pre-specied objects. Here, we
should identify each position i after the rearrangement with the element
i
drawn from
the set S. Also, notice that, since two distinct objects from S cannot end up in the same
position, we are considering random sampling without replacement and, consequently, the
number of possible ways of distributing the s objects into the s nal positions is (s)
s
= s!.
This is the cardinality of the sample set of our experiment, i.e., || = s!.
Now, let M be a strict subset of size m of S, and consider the event A =m < s
pre-specied objects from S end up in m pre-specied positions. Given that m elements
from S are required to end up in xed positions, the number of sequences with s m
coordinates that can be extracted without replacement from the set S \ M is (s m)!.
Therefore, |A| = (s m)! and the probability that m specied objects from S end up in
m specied positions after permuting randomly among themselves the s distinct objects
18
is
P(A) =
(s m)!
s!
=
1
s(s 1) (s m+ 1)
.
On the other hand, notice that the number of sequences that can be obtained from S if
only m < s objects can be used in the permutations is (s)
m
= s(s 1) (s m + 1).
This is the number of dierent permutations with m objects that can be made from a
set of s > m objects if the nal order of the elements matters. In other words, we are
computing the number of ways of obtaining an ordered subset of m elements from a set
of s > m elements. We denote this number by P
s
m
,
P
s
m
= (s)
m
= s(s 1) (s m+ 1),
which is nothing but the inverse of the probability, calculated above, that m pre-specied
elements from S end up in m pre-specied positions after permuting randomly the s
objects.
Example 13. A committee of 5 members, consisting of a president, a secretary and three
ocials is selected from of club of 50 members. The ocials are ranked as ocial 1, 2 and
3, according to the degree of their importance within the club. The presidency and the
secretary position are assigned, respectively, to the oldest and the youngest members of
the club. Then, the three ocials are selected at random from the remaining 48 members
of the club. We are interested in computing the probability that three friends, Peter, Paul
and Pierce, end up chosen, respectively, as ocial 1, 2 and 3. Notice that, since two pre-
specied members of {1, . . . , 50} must end up in two pre-specied positions, there are P
48
3
ways in which the three ocials are selected, provided that the order of the sequence of
size 3 matters. Therefore, the sought probability is 1/P
48
3
= 1/(48 47 46).
2.3. Combinations
Suppose that we have a set S of s distinct objects, which we permute randomly among
themselves. Again, we should identify each position i after the rearrangement with the
element
i
drawn from the set S. Also, since two distinct objects from S cannot end up
in the same position, we are considering random sampling without replacement, as in the
section above.
Now we are interested in computing the number of dierent subsets of size m s that
can be extracted from S. In other words, we wish to compute the number of sequences that
can be obtained if the order of its coordinates does not matter. First, notice that there are
(s)
m
dierent sequences of size m that can be drawn from S without replacement. But,
instead of focusing on ordered m-tuples, we care indeed about the subsets of size m that
can be drawn from S disregarding the order in which its elements are selected. Notice that
19
the elements of each set M S of m elements can be rearranged in m! dierent ways.
Then, since we wish to ignore the order in which the elements are selected, then these
m! reorderings of the elements of M should be considered the same. Therefore, there are
(s)
m
/m! dierent samples of size m that can be drawn from S without replacement and
regardless the order of its elements. Using the binomial operator, we usually write
_
s
m
_
=
(s)
m
m!
=
P
s
m
m!
=
s!
m!(s m)!
.
Example 14. Consider again the problem in Example 11 above, where we were interested
in computing the probability of event C =at least one zero (head) shows up when a fair
coin is tossed n times. Consider the event A
i
=there shows up exactly i zeroes. Then,
C =
n
i=1
A
i
and A
i
A
j
= for each i, j = 1, . . . , n such that i = j. So, P(C) =

n
i=1
P(A
i
). To compute each P(A
i
), note that
_
n
i
_
gives us the number of subsets of size
i that can be extracted from 1, . . . , n. In other words,
_
n
i
_
gives us the cardinality of the
event that i tosses result in zero shows up while, at the same time, the remaining n i
tosses yield one. This is precisely the cardinality of A
i
. Therefore,
P(C) =

n
i=1
_
n
i
_
2
n
,
which coincides with 1 2
n
, as obtained in Example 11.
Example 15. The economics department consists of 8 full professors, 14 associate pro-
fessors, and 18 assistant professors. A committee of 5 is selected at random from the
faculty of the department. Suppose that we are asked to compute the probability that all
the members of the committee are assistant professors. To answer this, notice rst that in
all there are 40 faculty members. So, the committee of ve can be chosen from the forty
in
_
40
5
_
ways. There are 18 assistant professors so that the committee of ve can be chosen
from them in
_
18
5
_
ways. Therefore, the sought probability is
_
18
5
_
_
_
40
5
_
.
Example 16. A die is rolled 12 times and suppose rst that we are interested in com-
puting the probability of getting exactly 2 ves. Let A denote that event of interest.
Here note that = {1, . . . , 6}
12
so that || = 6
12
. Now consider the event A
(i,j)
, where
i, j = 1, . . . , 12, i < j, which describes the outcome such that number 5 shows up only in
the ith and jth rolls. Then, we have

A
(i,j)

= 5
10
regardless the value of the pair (i, j).
Also, we know that A
(i,j)
A
(k,l)
= whenever (i, j) = (k, l) and
A =
_
(i,j)Q
A
(i,j)
,
where Q is the set specied as
Q =
_
(i, j) {1, . . . , 12}
2
: i < j
_
.
20
Therefore, we know that
P(A) = |Q| 5
10
/6
10
.
All that we need then is to compute the cardinality of set Q. Note that Q is nothing but
the set of dierent pairs of numbers that can be extracted from {1, . . . , 12}. Therefore,
its cardinality is given by
_
12
2
_
and we thus obtain
P(A) =
_
12
2
_
5
10
6
10
.
Suppose now that we wish to compute the probability that at least 1 one shows up.
Let B denote that event of interest and consider the event B
k
, where k = 0, 1, 2, . . . , 12,
which describes the outcome such that number 1 shows up exactly k times. Then, we
have B =
12
k=1
B
k
and B
k
B
l
= whenever k = l. Therefore, we know that P(B) =

12
k=1
P(B
k
). Following the same reasoning as above, we nally obtain
P(B) =

12
k=0
_
12
k
_
5
12k
6
12
.
Example 17. A set of n balls is distributed randomly into n boxes and we are asked to
compute the probability that only the rst box ends up being empty. Here, an elementary
event should be identied with the nal position of the balls so that
i
should be inter-
preted as the box where the ith ball ends up. Then, the sample space is = {1, . . . , n}
n
so that || = n
n
. Notice that we are considering random sampling with replacement since
two dierent balls may end up in the same box.
Consider the event A =only box 1 ends up being empty. Notice that this can happen
if and only if exactly one of the remaining n 1 boxes contains two balls and all the
other n2 boxes have exactly one ball each. Consider then the event B
i
=box 1 ends up
empty, box i ends up with two balls, and the remaining n 2 boxes end up with exactly
one ball each. We have A =
n
i=2
B
i
and B
i
B
j
= whenever i = j.
To compute P(B
i
), notice rst that the number of subsets that can be extracted from
{1, . . . , n} containing two balls is
_
n
2
_
. Then, the remaining (n2) balls can be rearranged
in the remaining (n2) boxes in (n2)! dierent ways. Therefore, the number of distinct
ways in which one can put no ball in box 1, two balls into box i, and exactly one ball in
each of the remaining boxes is
_
n
2
_
(n 2)!. So, we obtain
P(B
i
) =
_
n
2
_
(n 2)!
n
n
.
Consequently, the sought probability is
P(A) =
n

i=2
P(B
i
) =
(n 1)
_
n
2
_
(n 2)!
n
n
=
_
n
2
_
(n 1)!
n
n
.
21
2.4. Partitions
Many combinatorial problems involving unordered samples are of the following type. There
is a box that contains r red balls and b black balls. A random sample of size m is drawn
from the box without replacement. What is the probability that this sample contains
exactly k red balls (and, therefore, m k black balls)?. The essence of this type of
problem is that the total population is partitioned into two classes. A random sample of
a certain size is taken and we inquire about the probability that the sample contains a
specied number of elements of the two classes.
First, note that we are interested only in the number of red and black balls in the
sample and not in the order in which these balls are drawn. Thus, we are dealing with
sampling without replacement and without regard to order. Then, we can take as our
sample space the collection of all samples of size m drawn from a set of b + r without
replacement and without regard to order. As argued before, the probability that we must
assign to each of these samples is
_
r +b
m
_
1
.
We must now compute the number of ways in which a sample of size m can be drawn so
as to have exactly k red balls. Notice that the k red balls can be chosen from the subset
of r red balls in
_
r
k
_
ways without replacement and without regard to order, and the mk black balls can be
chosen from the subset of b black balls in
_
b
mk
_
ways without replacement and without regard to order. Since each choice of k red balls
can be paired with each choice of mk black balls, there are a total of
_
r
k
__
b
mk
_
possible choices. Therefore, the sought probability can be computed as
_
r
k
__
b
mk
__
r + b
m
_
1
.
Example 18. We have a box containing r numbered balls. A random sample of size
n < r is drawn without replacement and the numbers of the balls are noted. These balls
are then returned to the box, and a second random sample of size m < r is drawn without
22
replacement. Suppose that we are asked to compute the probability that the two samples
have exactly l balls in common. To solve this problem, we can proceed as follows. The rst
sample partitions the balls into two classes, those n selected and those r n not chosen.
The problem then turns out to consists of computing the probability that the sample of
size m contains exactly l balls from the rst class. So, the sought probability is
_
n
l
__
r n
ml
__
r
m
_
1
.
Problems
1. Given the digits 1, 2, 3, 4, and 5, how many four-digit numbers can be formed if
(a) there is no repetition;
(b) there can be repetition;
(c) the number must be even and there is no repetition;
(d) if the digits 2 and 3 must appear in that order in the number and there is no repetition.
2. A bridge deck has 52 cards dividend into 4 suits of 13 cards each: hearts, spades,
diamonds, and clubs. Compute the probability that, when drawing 5 cards form a bridge
deck (a poker hand),
(a) all of them are diamonds;
(b) one card is a diamond, one a spade, and the other three are clubs;
(c) exactly two of them are hearts if it is known that four of them are either hearts or
diamonds;
(d) none of them is a queen;
(e) exactly two of them are kings;
(f) exactly three of them are of the same suit.
3. In a hand of 13 cards drawn from a bridge deck, compute the probability of getting
exactly 5 clubs, 3 diamonds, 4 hearts, and 1 spade.
4. A man has 8 keys one of which ts the lock. He tries the keys one at a time, at each
attempt choosing at random from the keys that were not tried earlier. Find the probability
that the 6th key tried is the correct one.
5. A set of n balls is distributed at random into n boxes. Compute the probabilities of
the following events:
(a) exactly one box is empty;
(b) only one box is empty if it is known that box 1 is empty;
(c) box 1 is empty if it is known that only one box is empty.
6. Suppose that n balls are distributed at random into r boxes. Compute the probability
that the box 1 contains exactly k balls, where 0 k n.
23
7. A random sample of 3 balls is drawn from a box that contains 10 numbered balls.
Compute the probability that balls 1 and 4 are among the three picked balls.
8. A random sample of size n is drawn from a set of s elements. Compute the probability
that none of k pre-specied elements is in the sample if the method used is:
(a) sampling without replacement;
(b) sampling with replacement.
9. A set of n objects are permuted among themselves. Show that the probability that k
pre-specied objects occupy k pre-specied positions is (n k)!/n!.
10. Two boxes contains n numbered balls each. A random sample of k n is drawn
without replacement from each box. Compute the probability that the samples contain
exactly l balls having the same numbers in common.
11. Show that, for two positive integers s and n such that s n, we have
_
1
n 1
s
_
n1

(s)
n
s
n

_
1
1
s
_
n1
,
where (s)
n
:= s(s 1) (s n + 1).
12. A die is rolled 12 times. Compute the probability of getting at most 3 fours.
24
Chapter 3
Random variables and their distributions
3.1. Denitions
From here onwards we shall use the following bracket notation for inverse images. Given
two sets and

and a function f :

, for A

, we write [f A] to denote
f
1
(A) = { : f() A}. Also, for simplicity, we shall write throughout P[] instead
of P([]) when using that bracket notation for inverse images.
Denition 15. Let F

and F

be two nonempty families of subsets of two arbitrary


nonempty sets and

, respectively. A function f :

is (F

, F

)-measurable
if
f
1
(B) = { : f() B} = [f B] F

for each B F

. When the families F

and F

are understood from the context, we


simply say that f is measurable.
Even though it is not required by the denition above, usually F

and F

are -algebras.
In the special case of a real-valued function f : R, given a -algebra F of subsets of
, we say that f is F-measurable if it is (F, B
R
)-measurable.
Denition 16. Let (, F) be a measurable space, a random variable on (, F) is a
real-valued F-measurable function X : R. If a random variable X assumes nitely
many values, then it is called a simple random variable.
The point of the denition of random variable X is to ensure that, for each Borel set
B B
R
, X
1
(B) can be assigned a measure or probability. The notion of random variable
is a key concept in probability theory. It gives us a transformation through which we can
assign probabilities on subsets of arbitrary sets by means of doing so on subsets of the
real line. Note that the denition of a random variable is not attached to a probability
measure. However, we need a probability measure to speak of the distribution of a random
variable.
3.2. Probability distribution of a random variable
Denition 17. Let (, F, P) be a probability space and X a random variable on (, F).
The probability distribution of the random variable X is the probability measure
25
on (R, B
R
) dened by
(B) := P(X
1
(B)) = P({ : X() B}) = P[X B], for each B B
R
.
Moreover, the distribution function of the random variable X is the function F : R R
dened by
F(x) := ((, x]) = P({ : X() x}) = P[X x].
Example 19. Consider the experiment of rolling three dice together and suppose that we
are interested in the sum of the numbers that show up. In principle, we could take as our
primitive the probability space (, F, P), where = {1, . . . , 6}
3
so that = (
1
,
2
,
3
)
is a generic elementary event, F = 2

and P is specied by P(A) = |A|/6


3
for each A F.
But this approach would not be particularly useful since we are interested only on the sum
of the numbers that show up. Alternatively, we can make use of a function X : R
specied as
X((
1
,
2
,
3
)) =
1
+
2
+
3
.
It can be checked that this function X is measurable and, therefore, a random variable.
Now, consider the event A = (3, 5] B
R
. Using the concept of probability distribution of
X, we can compute the probability that the sum of the numbers that show up is larger
that three but no larger than ve as
(A) = P[X
1
(A)] =
|{(
1
,
2
,
3
) : 3 <
1
+
2
+
3
5}|
6
3
=
9
6
3
.
Using the denition of distribution function, we can apply limits, for some > 0, to
compute
P[X < x] = lim
0
P[X x ] = lim
tx

F(t).
Also,
P[X = x] = F(x) lim
tx

F(t).
The following result gives us a useful characterization of a distribution function. Some
textbooks take this result as the denition of distribution function.
Theorem 5. Let F : R R be a nondecreasing, right-continuous function satisfying
lim
x
F(x) = 0 and lim
x+
F(x) = 1.
Then, there exists a random variable X on some probability space (, F, P) such that
F(x) = P[X x].
26
Proof. Let the probability space (, F, P) be ((0, 1), B
(0,1)
, ), where is the Lebesgue
measure on ((0, 1), B
(0,1)
). To understand the proof, suppose rst that F is strictly increas-
ing and continuous. Dene the mapping := F
1
. Since F : R (0, 1) is a one-to-one
function
3
, : (0, 1) R is an increasing function. For (0, 1), let X() := (). Since
is increasing, then X is B
(0,1)
-measurable. Now take a given (0, 1). Then, we have
() x F(x). Since is Lebesgue measure, we obtain
P[X x] =
_
{ (0, 1) : () x}
_
=
_
(0, F(x)]
_
= F(x) 0 = F(x),
as required.
Now, if F has discontinuities or is not strictly increasing, dene, for (0, 1),
() := inf {x R : F(x)} inf {x R : () x} .
Note that, since F is nondecreasing and right-continuous, then {x R : F(x)} is an
interval with the form [(), +) for some (0, 1) (i.e, it is closed on the left and
stretches to ). Therefore, we obtain again () x F(x) so that, by dening
X() := () for (0, 1) and by applying the same argument as above, we obtain that
X is a random variable on ((0, 1), B
(0,1)
, ) and P[X x] = F(x).
Consider a probability space (, F, P) and a random variable X on it. Then, given a
Borel set (a, b] R, it follows immediately from the denition of distribution function
that, if F is the distribution function of X, then
P[a < X b] = P[X b] P[X a] = F(b) F(a).
3.3. Discrete random variables
Denition 18. Let P be a probability measure on some measurable space (, F). A
support for P is an event supp(P) F such that P(supp(P)) = 1.
Denition 19. A random variable X on a probability space (, F, P) with probability
distribution is discrete if has a countable support supp() = {x
1
, x
2
, . . . }.
4
In this
case, is completely determined by the values ({x
i
}) = P[X = x
i
] for i = 1, 2, . . . .
If a random variable is discrete, then we say that its probability distribution and its
distribution function are discrete as well.
The support of the probability distribution of a random variable X is often called the
support of the random variable X itself, which we denote by supp(X). In this case,
3
A function f : A B is one-to-one, or an injection, if for each y B there is at most one x A
satisfying f(x) = y.
4
Notice that each simple random variable is a discrete random variable but the converse is not true.
27
the support of the random variable X is identied with the range of X considered as a
function. If X is a discrete random variable on a probability space (, F, P) with support
{x
1
, x
2
, . . . }, then we can compute
F(x) = P[X x] =

x
i
x
P[X = x
i
].
3.4. Integration
Before proceeding to notion of continuous random variable, we need to introduce the
notion of integral of a function. Integration is the approach used in modern mathematics
to compute areas and volumes. Hence, integration is a tool closely related to the notion of
measure and, in particular, to the Lebesgue measure. A detailed treatment of the notion
of integral is not within the scope of this course and, therefore, in this section we just
introduce briey its formal denition.
Consider a probability space (, F, P). Let us take rst a F-measurable function
g : R that assumes a nite number of values, say {y
1
, y
2
, . . . , y
n
} R, such that
A
i
:= g
1
({y
i
}) F for each i = 1, . . . , n. This type of functions are known as F-simple
functions. Further, assume that P(A
i
) < for each i = 1, . . . , n. Then, we say that
function g is a P-step function. In this case, g admits the following standard repre-
sentation:
g() =
n

i=1
y
i
I
A
i
(),
where I
A
i
is the indicator function of the set A
i
, that is, I
A
i
() = 1 if A
i
and
I
A
i
() = 0 if / A
i
.
The integral of the step function g is then dened as
_
g()dP() =
_
g()P(d) :=
n

i=1
y
i
P(A
i
).
The integration problem consists of enlarging this denition so that it may be applied to
more general classes of functions, and not only to step functions.
So, let us take now a bounded F-measurable function f : R. We say that f is
P-integrable (or P-summable) if
sup
__
g()dP() : g L
P
and g f
_
= inf
__
g()dP : g L
P
and f g
_
,
where L
P
is the space of P-step functions. The common value is called the integral of f
with respect to P and is denoted either
_
fdP,
_
f()dP(), or
_
f()P(d). There are
several approaches to the abstract concept of integral. For example, the approach most
28
closely related to the notion of measure is that of Lebesgue integral while the most
used approach in Euclidean spaces, within the elementary calculus benchmark, is that of
Riemann integral.
Let (, F, P) be a probability space and let X be a random variable on (, F) with
probability distribution and distribution function F, both associated to the probability
measure P. Throughout this course, we shall use the notation dP(), dF(x), or d(x)
interchangeably.
3.5. Continuous random variables
Denition 20. Let P and Q be two measures on a measurable space (, F). Measure
P is said to have density f : R with respect to measure Q if f is a nonnegative
F-measurable function and the following condition holds
P(A) =
_
A
fdQ =
_
A
f()dQ() =
_
A
f()Q(d) for each A F.
Using the general denition above of a density of a measure, we can particularize it so as
to introduce the concept of a density of a random variable, which is simply a density of
the probability distribution of the random variable with respect to Lebesgue measure on
the real line.
Denition 21. A random variable X on a probability space (, F, P), with probability
distribution , has density function f : R R if has density f with respect to
Lebesgue measure on (R, B
R
). That is, if f is nonnegative B
R
-measurable and the following
condition is satised
P[X B] = (B) =
_
xB
f(x)(dx) =
_
xB
f(x)dx for each B B
R
.
Denition 22. A random variable X on a probability space (, F, P) is absolutely
continuous if it has a density function f : R R. That is, for each Borel set B B
R
,
we have
P[X B] =
_
xB
f(x)dx.
Carathodorys Theorem about the unique extension of a measure implies that if a
random variable X with distribution function F is absolutely continuous then
F(b) F(a) =
_
b
a
f(x)dx
holds for every a, b R such that a b. Therefore, we obtain that, if f is a density of the
random variable X, then
F(b) F(a) = P[a < X b] =
_
b
a
f(x)dx.
29
Also, f and F are related by
F(x) = P[X x] =
_
x

f(t)dt.
Notice that F

(x) = f(x) need not hold for each x R; all that is required is that f
integrates properly, as expressed in the denition of absolutely continuous random variable
above. On the other hand, if F

(x) = f(x) holds for each x R and f is continuous, then


it follows from the fundamental theorem of calculus that f is indeed a density of F.
5
Nevertheless, if X is an absolutely continuous random variable with distribution function
F, then F can be dierentiated almost everywhere, and at each continuity point x of f,
we have F

(x) = f(x).
Discrete and continuous random variables allow for an analogous treatment. For a
discrete random variable X on some probability space (, F, P) with support {x
1
, x
2
, . . . },
the function f : {x
1
, x
2
, . . . } R dened by f(x
i
) := P[X = x
i
] is called the discrete
density function of X. Now, compare how we compute probabilities in the discrete case
P[X x] = F(x) =

x
i
x
f(x
i
) and P[a < X b] = F(b) F(a) =
x
i
b

x
i
>a
f(x
i
)
with the case in which X is an absolutely continuous random variable with density f:
P[X x] = F(x) =
_
x

f(t)dt and P[a < X b] = F(b) F(a) =


_
b
a
f(x)dx.
Throughout the remaining of the course we shall use integrals when no possible confusion
arises. In general, the particular analysis of discrete cases will only require change integrals
to sums in the appropriate formulae.
Example 20. A box contains a > 0 red balls and b > 0 black balls. A random size of n
balls is drawn from the box. Let X be the number of red balls picked. We would like to
compute the density of X, considered as a random variable on some probability space, if
the sampling is with replacement. To answer this, let us specify the set of balls as
S := {1, . . . , a, a + 1, . . . , a + b}
and let us follow the convention that {1, . . . , a} are red balls and {a + 1, . . . , a +b} are
black balls. Then, since there is replacement, our sample set is = S
n
so that || =
(a + b)
n
. The random variable X can be specied as
X() = X
_
(
1
, . . . ,
n
)
_
= |{
i
S :
i
a, i = 1, . . . , n}| .
Also, the discrete density function of X is dened as f(x) = P[X = x]. Therefore, we
need to compute the samples which have exactly a number x of its coordinates no larger
than a. In other words, we must compute the cardinality of the event
A = { : |{
i
S :
i
a}| = x} .
5
We make no formal statements on the relation between dierentiation and integration.
30
Since the sampling is with replacement, there are a
x
ways of selecting x coordinates yield-
ing numbers no larger than p and b
nx
ways of selecting the remaining n x coordinates
yielding numbers between a + 1 and a + b. Finally, there are
_
n
x
_
ways of choosing x
coordinates from the n coordinates in the sample. Then, we obtain
f(x) =
_
n
x
_
a
x
b
nx
(a +b)
n
.
Now, in this experiment the probability of choosing a red ball after drawing one ball from
the box is p = a/(a + b). This is known as the probability of success in a sequence of n
Bernoulli trials. Using this probability of success, we can rewrite
f(x) =
_
n
x
__
a
a +b
_
x
_
b
a + b
_
nx
=
_
n
x
_
p
x
(1 p)
nx
,
which is known as the density function of a Binomial distribution with parameter p.
3.4. Functions of a random variable
Given a random variable X on a probability space (, F, P), a typical problem is prob-
ability theory is that of nding the density of the random variable g(X). Since there is
a measurability requirement in the denition of a random variable, one usually restricts
attention to the case where g is a one-to-one function.
The treatment of this problem is relatively simple in the discrete case.
Example 21. Consider a discrete random variable X supported on {1, 2, . . . , n} and with
discrete density function f(x) =
_
n
x
_
p
x
q
nx
for some 0 < p, q < 1. Let Y = g(X) = a+bX
for some a, b > 0. We are interested in obtaining the discrete density function of Y . Let
us denote such a density by h. First, notice that, by applying g to the elements in the
support of X, the support of Y is g({1, 2, . . . , n}) = {a + b, a + 2b, . . . , a + nb}. Then, we
can compute
h(y) = f
_
(y a)/b
_
=
_
n
(y a)/b
_
p
(ya)/b
q
n(ya)/b
,
where y {a +b, a + 2b, . . . , a +nb}.
Suppose now that X is absolutely continuous with distribution function F and density
f. Furthermore, assume that f is continuous. Consider a one-to-one function g : R R
such that g
1
is dierentiable on R. We ask ourselves about the density of Y = g(X).
Let H and h denote, respectively, the distribution function and the density of the random
variable Y := g(X). Let T := g
1
Recall that for an absolutely continuous random variable
X with distribution function F and density f, if f is continuous, then F

(x) = f(x) holds


for each x R.
31
First, suppose that g is increasing. Then, for y R,
H(y) = P[g(X) y] = P[X T(y)] = F
_
T(y)
_
.
Therefore, since T = g
1
is dierentiable, we obtain
h(y) =
d
dy
H(y) =
d
dy
F
_
T(y)
_
= F

_
T(y)
_
T

(y) = f
_
T(y)
_
T

(y).
On the other hand, if g is decreasing, then
H(y) = P[g(X) y] = P[g(X) < y] = P[X > T(y)] = 1 F
_
T(y)
_
,
so that
h(y) =
d
dy
H(y) = F

_
T(y)
_
T

(y) = f
_
T(y)
_
T

(y).
Thus, in either case the random variable Y = g(X) has density
h(y) = f
_
T(y)
_
|T

(y)| .
Example 22. Consider a positive absolutely continuous random variable X with con-
tinuous density f. We are interested in obtaining the density function of 1/X. Note that
T(y) = g
1
(y) = 1/y, which is dierentiable for each y 0. Also, T

(y) = 1/y
2
so that
h(y) = f(1/y)/y
2
.
We can use arguments similar to those above to obtain the density of a transformation
Y = g(X) even when g is not one-to-one.
Example 23. Suppose that X is an absolutely continuous random variable with density
f(x) =
1

2
e
x
2
/2
,
which corresponds to a distribution known as normal with parameters (0, 1). Let Y = X
2
be a random variable with distribution function H and density function h. Here notice
that the transformation g(X) = X
2
is not one-to-one. However, note that
H(y) = P[X
2
y] = P[

y X

y] =
1

2
_

y

y
e
x
2
/2
dx
=
2

2
_

y
0
e
x
2
/2
dx,
since the graph of f is symmetric around the origin. Now, since

y = x and dx =
1/2y
1/2
dy, we have
H(x) =
_
x
0
2

2
e
y/2
1
2
y

1
2
dy.
32
Therefore, we obtain that Y = X
2
has density
h(y) =
_
0 if y 0,
1

2
y
1/2
e
y/2
if y > 0,
which corresponds to a distribution known as chi-square with parameter 1.
Problems
1. Let X be a random variable on some probability space (, F, P) and let g : R R be
a one-to-one function. Show that Y := g(X) is a random variable on (, F, P).
2. Let F
1
, . . . , F
n
be distribution functions on some probability space (, F, P). Show that
G :=

n
i=1
a
i
F
i
, where a
i
R
+
for each i = 1, . . . , n and

n
i=1
a
i
= 1, is a distribution
function on (, F, P).
3. Let X be an absolutely continuous random variable on some probability space (, F, P)
with density f(x) = 1/2e
|x|
for x R. Compute P[X 0], P[|X| 2], and P[1 |X|
2].
4. Any point in the interval [0, 1) can be represented by its decimal expansion .x
1
x
2
. . . .
Suppose that a point is chosen at random from the interval [0, 1). Let X be the rst digit
in the decimal expansion representing the point. Compute the density of X considered as
a random variable on some probability space.
5. A box contains 6 red balls and 4 black balls. A random size of n balls is drawn from
the box. Let X be the number of red balls picked. Compute the density of X, considered
as a random variable on some probability space, if the sampling is without replacement.
6. Let n be a positive integer and let h be a real-valued function dened by
h(x) :=
_
c2
x
if x = 1, 2, . . . n,
0 otherwise.
Find the value of c such that h is a discrete density function on some probability space.
7. Let X be a discrete random variable on some probability space with support
{3, 1, 0, 1, 2, 3, 5, 8}
and discrete density function f specied by f(3) = .2, f(1) = .15, f(0) = .2, f(1) = .1,
f(2) = .1, f(3) = .15, f(5) = .05, and f(8) = .05. Compute the following probabilities:
(a) X is negative;
33
(b) X is even;
(c) X takes a value between 1 and 5 inclusive;
(d) P[X = 3|X 0];
(e) P[X 3|X > 0].
8. A box contains 12 numbered balls. Two balls are drawn with replacement from the
box. Let X be the larger of the two numbers on the balls. Compute the density of X
considered as a random variable on some probability space.
9. Let X be a random variable on some probability space (, F, P) such that P[|X 1| =
2] = 0. Express P[|X 1| 2] in terms of the distribution function F of X.
10. Show that the distribution function F of a random variable is continuous from the
right and that
lim
x
F(x) = 0 and lim
x+
F(x) = 1.
11. A point is chosen at random from the interior of a sphere of radius r. Each point in
the sphere is equally likely of being chosen. Let X be the square of the Euclidean distance
of the chosen point from the center of the sphere. Find the distribution function of X
considered as a random variable on some probability space.
12. The distribution function F of some random variable X on some probability space is
dened by
F(x) :=
_
0 if x 0,
1 e
x
if x > 0,
where > 0. Find a number m such that F(m) = 1/2.
13. Let X be a random variable (on some probability space) with distribution function
F(x) :=
_

_
0 if x < 0,
x/3 if 0 x < 1,
x/2 if 1 x < 2,
1 if x 2.
Compute the following probabilities:
(a) P[1/2 X 3/2];
(b) P[1/2 X 1];
(c) P[1/2 X < 1];
(d) P[1 X 3/2];
(e) P[1 < X < 2].
34
14. The distribution function F of some random variable X (on some probability space)
is dened by
F(x) =
1
2
+
x
2(|x| + 1)
, x R.
Find a density function f for F. At what points x will F

(x) = f(x)?
15. Let X be an absolutely continuous random variable with density f. Find a formula
for the density of Y = |X|.
16. Let X be a positive absolutely continuous random variable with density f. Find a
formula for the density of Y = 1/(X + 1).
17. Let T be a positive absolutely continuous random variable on some probability space
(, F, P). Let T denote the failure date of some system. Let F be the distribution function
of T, and assume that F(t) < 1 for each t > 0. Then, we can write F(t) = 1 e
G(t)
for
some one-to-one function G : R
++
R
++
. Assume also that G

(t) = g(t) exists for each


t > 0.
(a) Show that T has density f satisfying
f(t)
1 F(t)
= g(t), t > 0.
(b) Show that for s, t > 0,
P[T > t +s|T > t] = e

R
t+s
t
g(m)dm
.
35
Chapter 4
Random vectors
4.1. Denitions
Many random phenomena have multiple aspects or dimensions. To study such phe-
nomena we need to consider random vectors.
Denition 23. Let (, F) be a measurable space, an n-dimensional random vector on
(, F) is a vector-valued F-measurable function X : R
n
.
It can be shown that a vector-valued function X = (X
1
, . . . , X
n
) is a random vector
on some measurable space (, F) if and only if each X
i
, i = 1, . . . , n, is a random variable
on (, F). Therefore, a random vector is simply an n-tuple X = (X
1
, . . . , X
n
) of random
variables.
Denition 24. Let X = (X
1
, . . . , X
n
) be a random vector on some probability space
(, F, P). The joint probability distribution of the random vector X is the probability
measure on (R
n
, B
R
n) dened by
(B) := P
__
:
_
X
1
(), . . . , X
n
()
_
B
__
= P[(X
1
, . . . , X
n
) B],
for each B B
R
n.
Moreover, the joint distribution function of the random vector X is the function
F : R
n
R dened by
6
F(x
1
, . . . , x
n
) := (S
x
) = P
__
:
_
X
1
(), . . . , X
n
()
_
(x
1
, . . . , x
n
)
__
= P[X
1
x
1
, . . . , X
n
x
n
],
where S
x
:= {y R
n
: y
i
x
i
, i = 1, . . . , n} is the set of points southwest of x.
Denition 25. A random vector X = (X
1
, . . . , X
n
) on a probability space (, F, P)
with probability distribution is discrete if has a countable support. In this case,
is completely determined by ({x}) = P[X
1
= x
1
, . . . , X
n
= x
n
] for x supp(X) R
n
.
If a random vector is discrete, then we say that its joint probability distribution and its
joint distribution function are discrete as well.
Denition 26. Let X = (X
1
, . . . , X
n
) be a discrete random vector on some probability
space (, F, P). The vector-valued function f : supp(X) R dened by f(x
1
, . . . , x
n
) :=
P[X
1
= x
1
, . . . , X
n
= x
n
] is called the discrete joint density function of X.
6
For x, y R
n
, notation x y means x
i
y
i
for each i = 1, . . . , n.
36
Denition 27. A random vector X = (X
1
, . . . , X
n
) on a probability space (, F, P) with
probability distribution is absolutely continuous if it has a density f : R
n
R with
respect to Lebesgue measure on (R
n
, B
R
n) such that for each Borel set B B
R
n,
P[(X
1
, . . . , X
n
) B] =
_

_
B
f(x
1
, . . . , x
n
)dx
1
dx
n
.
The density f is called joint density function of X.
If X is an n-dimensional random vector (on some probability space) with probability
distribution , joint distribution function F, and joint density f (which may be discrete
as well) and g
i
: R
n
R is the function dened by g
i
(x
1
, . . . , x
n
) := x
i
, then g
i
(X) is the
random variable X
i
with (a) marginal probability distribution
i
given by

i
(B) = ({(x
1
, . . . , x
n
) R
n
: x
i
B}) = P[X
i
B] for each B B
R
,
(b) marginal distribution function given by
F
i
(x
i
) = P[X
i
x
i
],
and (c) marginal density given by
f
i
(x
i
) =
_

_
R
n1
f(x
1
, . . . , x
i1
, x
i+1
, . . . , x
n
)dx
1
dx
i1
x
i+1
dx
n
,
in the absolutely continuous case, or
f
i
(x
i
) =

supp(X
1
)

supp(X
i1
)

supp(X
i+1
)

supp(Xn)
f(x
1
, . . . , x
i1
, x
i+1
, . . . , x
n
),
in the discrete case.
4.2. Functions of random vectors
A change of variables for a discrete random vector is done straightforwardly just as shown
for the case of a discrete random variable. Therefore, we turn to analyze the case of
absolutely continuous random vectors.
Let g : U V be a one-to-one continuously dierentiable function, where U, V R
n
are open sets. We begin with n random variables X
1
, . . . , X
n
, with joint density function
f, transform them into new random variables Y
1
, . . . , Y
n
through the functions
y
1
= g
1
(x
1
, . . . , x
n
)
.
.
.
y
n
= g
n
(x
1
, . . . , x
n
),
37
and ask about the joint density function of Y
1
, . . . , Y
n
. Let T := g
1
and suppose that its
Jacobian never vanishes, i.e.,
J(y) =

_
_
_
T
1
y
1
(y) . . .
T
1
yn
(y)
.
.
.
.
.
.
.
.
.
Tn
y
1
(y) . . .
Tn
yn
(y)
_
_
_

= 0 for each y V.
Under these conditions we can state the following useful result.
Theorem 6.Change of Variables Let g : U V be a one-to-one continuously
dierentiable function, where U, V are open sets in R
n
. Suppose that T := g
1
satises
J(y) = 0 for each y V . If X is a random vector with density f supported in U, then the
random vector Y = g(X) has density h supported in V and given by
h(y) =
_
f(T(y)) |J(y)| if y V ,
0 if y / V .
Example 24. Let (X
1
, X
2
) an absolutely continuous random vector with joint density
function
f(x
1
, x
2
) = e
(x
1
+x
2
)
, x
1
, x
2
R
+
.
Consider the transformation given by
y
1
= x
1
+x
2
, y
2
= 2x
1
x
2
.
We are asked to nd the joint density function of (Y
1
, Y
2
). To answer this, note rst that
x
1
=
y
1
+y
2
3
, x
2
=
2y
1
y
2
3
.
Then, by applying the result in the theorem above, one obtains
|J(y)| =

x
1
y
1
x
2
y
2

x
2
y
1
x
1
y
2

=
1
3
,
and, consequently,
h(y
1
, y
2
) =
1
3
e
y
1
for y
1
0.
4.3. Independent random variables
Analogously to the analysis of independent random events, an interesting case appears
when the random phenomenon (or dimension of a random phenomenon) described by a
random variable occurs regardless that captured by another random variable.
38
Denition 28. The random variables X
1
, . . . , X
n
(on some probability space (, F, P))
are independent random variables if
P[X
1
B
1
, . . . , X
n
B
n
] = P[X
1
B
1
] P[X
n
B
n
]
for all Borel sets B
1
, . . . , B
n
with the form B
i
= (a
i
, b
i
], where a
i
< b
i
, i = 1, . . . , n.
Let (X
1
, . . . , X
n
) be a random vector (on some probability space) with joint distribution
function F and joint density f (corresponding either to the absolutely continuous or the
discrete case). It can be shown that the requirement in the denition of independence of
random variables above is equivalent to the condition
P[X
1
x
1
, . . . , X
n
x
n
] = P[X
1
x
1
] P[X
n
x
n
] for each (x
1
, . . . , x
n
) R
n
.
Then, using the denition of distribution function, one obtains that X
1
, . . . , X
n
are inde-
pendent if and only if
F(x
1
, . . . , x
n
) = F
1
(x
1
) F
n
(x
n
).
Furthermore, by Fubinis theorem, we also obtain that X
1
, . . . , X
n
are independent if and
only if
f(x
1
, . . . , x
n
) = f
1
(x
1
) f
n
(x
n
).
Example 25. Let X
1
, X
2
, X
3
be independent absolutely continuous random variables
with common density
f(x) = e
x
, x > 0,
and suppose that we are interested in obtaining the density function h(y) of the random
variable Y = min {X
1
, X
2
, X
3
}. Then, for a given number y > 0, we have
H(y) = P[min {X
1
, X
2
, X
3
} y] = 1 P[min {X
1
, X
2
, X
3
} > y]
= 1 P[X
1
> y, X
2
> y, X
3
> y] = 1 P[X
1
> y]P[X
2
> y]P[X
3
> y]
= 1
__

y
e
x
dx
_
3
= 1 e
3y
.
Consequently, h(y) = H

(y) = 3e
3y
for y > 0.
4.4. Probability distribution of independent random vectors
The theory of independent random variables can be extended readily to random vectors.
If X
i
is a k
i
-dimensional random vector, then X
1
, . . . , X
n
are independent random vectors
if the earlier denition of independence holds with the appropriate changes in the formula
39
so as to consider random vectors instead of random variables. In particular, we say that
X
1
, . . . , X
n
are independent random vectors if
P[X
1
x
1
, . . . , X
n
x
n
] = P[X
1
x
1
] P[X
n
x
n
] for each x
1
R
k
1
, . . . , x
n
R
kn
.
Let us analyze in more detail the probability distribution of two independent random
vectors (or variables). Let X and Y be independent random vectors (on some probability
space (, F, P)) with distributions
x
and
y
in R
n
and R
m
, respectively. Then, (X, Y )
has probability distribution
x

y
in R
n+m
given by
(
x

y
)(B) =
_
R
n

y
({y : (x, y) B})
x
(dx) for each B B
R
n+m.
The following result allows us to use the probability distribution of two independent
random vectors (or variables) to compute probabilities.
Theorem 7. Let X and Y be independent random vectors (on some probability space
(, F, P)) with distributions
x
and
y
in R
n
and R
m
, respectively. Then,
P[(X, Y ) B] =
_
R
n
P[(x, Y ) B]
x
(dx) for each B B
R
n+m,
and
P[X A, (X, Y ) B] =
_
A
P[(x, Y ) B]
x
(dx) for each A B
R
n and B B
R
n+m.
Now, we can use the result above to propose an alternative approach to obtain the
distribution of a particular function of random variables, namely, the sum of independent
random variables.
Let X and Y be two independent random variables with respective distributions
x
and
y
. Using the rst result in the Theorem above, we obtain
P[X +Y B] =
_
R

y
(B \ {x})
x
(dx) =
_
R
P[Y B \ {x}]
x
(dx).
The convolution of
x
and
y
is the measure
x

y
dened by
(
x

y
)(B) :=
_
R

y
(B \ {x})
x
(dx) for each B B
R
.
That is, X +Y has probability distribution
x

y
. Now, if F and G are the distribution
functions corresponding to
x
and
y
, then X+Y has distribution function F G. Taking
B = (, y], the denition of convolution above gives us
(F G)(y) =
_
R
G(y x)dF(x).
40
Furthermore, if
x
and
y
have respective density functions f and g, then Z = X+Y has
density f g, which, in turn, gives us the density h
X+Y
of the sum of two independent
random variables,
h
X+Y
(z) = (f g)(z) =
_
R
g(z x)f(x)dx.
These arguments can be easily generalized to obtain the distribution of the sum of an
arbitrary nite number n of independent random variables.
4.5. Covariance and correlation
Denition 29. Let X
1
and X
2
be two random variables (on a probability space (, F, P))
with joint probability distribution and joint distribution function F. The covariance
of X
1
and X
2
is
Cov[X
1
, X
2
] :=
_
R
_
R
(x
1

1
)(x
2

2
)dF(x
1
, x
2
),
where

i
:=
_
R
x
i
dF
i
(x
i
) for each i = 1, 2.
If Cov[X
1
, X
2
] = 0, then we say that the random variables X
1
and X
2
are uncorrelated.
Moreover, the correlation coecient between X
1
and X
2
is the number
[X
1
, X
2
] =
Cov[X
1
, X
2
]
__
R
(x
1

1
)
2
dF
1
(x
1
)
_
R
(x
2

2
)
2
dF
2
(x
2
)
_
1/2
.
Obviously, if two random variables X
1
and X
2
are uncorrelated, then [X
1
, X
2
] = 0.
Also, if
_
R
(x
i

i
)
2
dF
i
(x
i
) < for i = 1, 2, then [X
1
, X
2
] = 0 implies that X
1
and X
2
are uncorrelated.
Let us now study the relation between independence and no correlation of two random
variables. Consider two random variables X
1
and X
2
with joint distribution function F.
Then, applying the denition of covariance above, one obtains
Cov[X
1
, X
2
] =
_
R
_
R
(x
1

1
)(x
2

2
)dF(x
1
, x
2
)
=
_
R
_
R
x
1
x
2
dF(x
1
, x
2
)
1
_
R
x
2
dF
2
(x
2
)
2
_
R
x
1
dF
1
(x
1
) +
1

2
=
_
R
_
R
x
1
x
2
dF(x
1
, x
2
)
1

2
.
Now, suppose that the random variables X
1
and X
2
are independent. Then, since in that
case F(x
1
, x
2
) = F
1
(x
1
)F
2
(x
2
), the expression above becomes
_
R
_
R
x
1
x
2
dF(x
1
, x
2
)
1

2
=
1

2
= 0.
41
Therefore, if two random variables are independent, then they are uncorrelated too. How-
ever, two uncorrelated random variables need not be independent.
Problems
1. Let be a (X
1
, X
2
) random vector with joint density
f(x
1
, x
2
) =
1
2
e

1
2
(x
2
1
+x
2
2
)
, x
1
, x
2
> 0.
Consider the transformation g to polar coordinates so that T = g
1
is given by
(x
1
, x
2
) = T(y
1
, y
2
) = (y
1
cos y
2
, y
1
sin y
2
),
and g(R
++
) = {(y
1
, y
2
) R
2
: y
1
> 0, 0 < y
2
< 2}. Let h denote the joint density of
(Y
1
, Y
2
), and let h
1
and h
2
be the marginal densities of Y
1
and Y
2
, respectively. Show that
(a) h(y
1
, y
2
) = (2)
1
y
1
e
y
2
1
/2
,
(b) h
1
(y
1
) = y
1
e
y
2
1
/2
,
(c) h
2
(y
2
) = (2)
1
.
2. Let X and Y be two absolutely continuous random variables whose respective densities,
given two numbers , > 0,
f(x) =
1

2
e

x
2
2
2
and
l(y) =
1

2
e

y
2
2
2
are supported in R. Show that if X and Y are independent, then S = X +Y has density
m(s) =
1

2
+
2
e

s
2
2(
2
+
2
)
,
supported in R.
3. Suppose that X and Y are independent absolutely continuous random variables. Derive
formulas for the joint density for (X + Y, X), the density of X + Y , and the density of
Y X.
4. Let X and Y be absolutely continuous random variables with joint distribution function
F and joint density f. Find the joint distribution function and the joint density of the
random variables W = X
2
and Z = Y
2
. Show that if X and Y are independent, then W
and Z are independent too.
42
5. Let X and Y be two independent absolutely continuous random variables (on some
probability space (, F, P)) having the same density each, f(x) = g(y) = 1 for x, y (0, 1].
Find
(a) P[|X Y | .5];
(b) P
_

X
Y
1

.5

;
(c) P[Y X|Y 1/3].
6. Let X and Y be absolutely continuous random variables with joint density
f(x, y) =
_

2
e
y
if 0 x y,
0 otherwise.
Find the marginal density of X and Y . Find the joint distribution function of X and Y .
7. Let f(x, y) = ce
(x
2
xy+4y
2
)/2
for x, y R How should c be chosen to make f a joint
density for two random variables X and Y ? Find the marginal densities of f.
8. Let X, Y and Z be absolutely continuous random variables with joint density
f(x, y, z) =
_
c if x
2
+y
2
+z
2
1,
0 otherwise.
How should c be chosen to make f indeed a joint density of X, Y and Z. Find the marginal
density of X. Are X, Y and Z independent?
9. Let X be an absolutely continuous random variable with density f(x) = 1/2 for
x (1, 1]. Let Y = X
2
. Show that X and Y are uncorrelated but not independent.
43
Chapter 5
Moments and expected value of a distribution
5.1. Denition of moments of a distribution
The information contained in a probability distribution can be often summarized by some
characteristics of the general shape of the distribution and its location. Such characteristics
are in most cases described by numbers known as the moments of the distribution.
Denition 30. Let X be a random variable with distribution function F. Suppose that,
given a positive integer r and a real number k,
_
R
|(x k)|
r
dF(x) < . Then, the rth
moment about the point k of the random variable is
m
r,k
:=
_
R
(x k)
r
dF(x).
Moreover, if k = 0, then the moment m
r,0
is called the rth moment about the origin.
We usually write m
r
instead of m
r,0
.
5.2. Denition of expected value
The concept of expected value of a random variable can be interpreted as the every-day
notion of mean value, weighted arithmetic sum or arithmetic average, according to the
frequency interpretation of probabilities.
Denition 31. Let X be a random variable on a probability space (, F, P). The ex-
pected value (or expectation) of X on (, F, P) is the integral of X with respect to
measure P:
E[X] =
_

XdP =
_

X()dP().
Also, suppose that the random variable X on (, F, P) has probability distribution and
distribution function F. Then, the expected value of X can accordingly be specied as
E[X] =
_
R
xd =
_
xR
xdF(x).
Furthermore, if X is a discrete random variable with support {x
1
, x
2
, . . . } and discrete
density function f, we have
E[X] =

i=1
x
i
P[X = x
i
] =

i=1
x
i
f(x
i
).
44
On the other hand, if X is an absolutely continuous random variable with density f, then
E[X] =
_
R
xf(x)dx.
As in the denition of covariance between two random variables, we shall often use nota-
tion = E[X] when no possible confusion arises.
Notice that the expected value of a random variable E[X] coincides with its rst
moment (about the origin), m
1
.
Given a random variable X with distribution function F and a function g : R R,
we have
E[g(X)] =
_
R
g(x)dF(x).
Finally, consider a random variable X on a probability space (, F, P) distribution
function F. If the rst two moments of X (about the origin), m
1
and m
2
, are nite, then
we can dene the variance of X as
Var[X] := E[(X )
2
] =
_

_
X()
_
2
dP() =
_
R
(x )
2
dF(x),
which coincides with the second moment of X about the point , m
2,
. Intuitively, the
variance of a random variable is a measure of the degree of dispersion of the corresponding
distribution with respect to its expected value.
5.3. Properties of expected values
The following properties, which can be immediately derived from analogous properties
of the integral operator, will prove useful in the remaining of the course. We shall not
demonstrate formally the required properties of the integral. Consider two random vari-
ables X and Y whose expected values are well dened on some probability space and two
real numbers and . Then, the following properties hold:
(P1) E[X] = E[X];
(P2) E[] = ;
(P3) E[X +Y ] = E[X] + E[Y ];
(P4) if X Y almost everywhere, then E[X] E[Y ].
As for the variance of a random variable X, using some of the properties above, we can
obtain the following useful result.
Var[X] = E[(X )
2
]
= E[X
2
+
2
2X]
= E[X
2
] +
2
2E[X]
= E[X
2
] +
2
2
2
= E[X
2
]
2
= m
2
m
2
1
.
45
Finally, we can also use the moments of a distribution to obtain and expression for the
covariance between two random variables. Consider two random variables X
1
and X
2
with
joint distribution function F. In the previous chapter, we showed that
Cov[X
1
, X
2
] =
_
R
_
R
x
1
x
2
dF(x
1
, x
2
)
1

2
.
So, if we let g : R
2
R be the function g(X
1
, X
2
) = X
1
X
2
, then we have
_
R
_
R
x
1
x
2
dF(x
1
, x
2
) = E[X
1
X
2
].
Therefore, we can write
Cov[X
1
, X
2
] = E[X
1
X
2
] m
1
(X
1
)m
1
(X
2
),
where m
1
(X
i
) indicates the moment of order 1 of the random variable X
i
, i = 1, 2.
5.4. Conditional expected value
In many applications we are interested in deriving the expected value of a random variable
given that there is some information available about another random variable. Consider
two random variables, X and Y , on a probability space (, F, P). We would like to answer
questions of the form: what is the expected value of the random variable X given that the
realization of another random variable Y is y (i.e., Y () = y)? We begin by introducing
the theory of conditional expected value from rst principles and then relate this concept
to that of conditional distribution.
Consider a probability space (, F, P) and another -algebra G F on . We start
by facing the question what is the expected value of a random variable X on (, F, P)
given that we know for each B G whether or not B.
Denition 32. Let (, F, P) be a probability space and X a P-integrable random vari-
able on (, F, P). Consider a -algebra G F on . Then, there exists a random variable
E[X|G], called the conditional expected value of X given G, which satises:
(1) E[X|G] is G-measurable and P-integrable;
(2) E[X|G] meets the functional equation
_
B
E[X|G]()dP() =
_
B
X()dP() for each B G.
We shall not prove the existence of such random variable specied in the denition above.
There will be in general many such random variables E[X|G], any of which is called a
version of the conditional expected value. However, any two versions are equal almost
everywhere. For a given , the value E[X|G]() is intuitively interpreted as the
expected value of X for an observer who knows for each B G whether or not B (
itself in general remains unknown).
46
Example 26. Let B
1
, B
2
, . . . be either a nite or a countable partition of that generates
the -algebra G (i.e., G = ({B
1
, B
2
, . . . })). First, notice that since E[X|G] : R is G-
measurable, it must be the case that E[X|G]() = k
i
for each B
i
, for each i = 1, 2, . . . ,
where each k
i
is a constant. Then, by applying condition (2) of the denition of conditional
expectation above when B = B
i
, one obtains, for each B
i
:
_
B
i
k
i
dP() =
_
B
i
X()dP()
E[X|G]()P(B
i
) =
_
B
i
X()dP()
E[X|G]() =
1
P(B
i
)
_
B
i
X()dP(),
which is specied in this way for P(B
i
) > 0.
For a random variable X on a probability space (, F, P), the conditional probabil-
ity of the event {X B}, for B B
R
, given a -algebra G F on is a G-measurable,
P-integrable random variable P[X B|G] : R on (, F, P) satisfying
_
C
P[X B|G]()dP() = P
_
{X B} C

, for each C G.
Intuitively, P[X B|G] gives us the probability of the event {X B} when the observer
knows for each C G whether or not C. The existence of P[X B|G] is guaranteed
if X is P-integrable. There will be in general many such random variables P[X B|G]
but any two of them are equal almost everywhere.
Example 27. As in the previous example, let C
1
, C
2
, . . . be either a nite or a count-
able partition of that generates the -algebra G (i.e., G = ({C
1
, C
2
, . . . })). Since
P[X B|G] is G-measurable, it must be the case that P[X B|G]() =
i
for each
C
i
, for each i = 1, 2, . . . , where each
i
is a constant. Then, by applying the deni-
tion of conditional probability above when C = C
i
, one obtains, for each C
i
:

i
P(C
i
) = P
_
{X B} C

P[X B|G]() =
P
_
{X B} C
i

P(C
i
)
whenever P(C
i
) > 0.
Next we introduce the notion of conditional probability distribution. We do so by
means of the following result.
Theorem 8. Let X be a random variable on a probability space (, F, P) and let G F
be a -algebra on . Then, there exists a function : B
R
R satisfying:
(1) for each , (, ) is a probability measure on B
R
;
(2) for each B B
R
, (B, ) is a version of P[X B|G].
47
The probability measure (, ) identied in the Theorem above is known as a condi-
tional probability distribution of X given G.
As for the relation between a conditional probability distribution and a conditional
expected value of random variable, the following result provides the required insight.
Theorem 9. Let X be a random variable on a probability space (, F, P) and, for ,
let (, ) be a conditional probability distribution of X given a -algebra G F on .
Then
_
R
x(dx, ) is a version of E[X|G]() for .
Now we can use this theory to obtain an analogous framework for the density of a
random vector. By doing so we are able to answer the question raised at the beginning
of the section. Consider a random vector (X
1
, X
2
) on some probability space with joint
density f and respective marginal densities f
1
and f
2
. Then, the conditional density of
X given Y = y is the density dened by
f(x|y) =
f(x, y)
f
2
(y)
.
Now, using the Theorem above, it can be shown that the conditional expected value of
X given Y = y can be written as
E[X|Y = y]=
a.e.
_
R
xf(x|y)dx.
Example 28. Let X and Y be two absolutely continuous random variables with joint
density
f(x, y) =
_
n(n 1)(y x)
n2
if 0 x y 1,
0 otherwise.
We are asked to compute the conditional density and conditional expected value of Y
given X = x.
First, the marginal density of X is given by
f
1
(x) =
_
+

f(x, y)dy
= n(n 1)
_
1
x
(y x)
n2
dy
= n(n 1)
_
(y x)
n1
n 1
_
1
x
= n(1 x)
n1
if 0 x 1 and f
1
(x) = 0 otherwise. Therefore, for 0 x 1,
f(y|x) =
_
(n1)(yx)
n2
(1x)
n1
if x y < 1,
0 otherwise.
48
Then, we can compute
E[Y |X = x] =
_
+

yf(y|x)dy
= (n 1)(1 x)
1n
_
1
x
y(y x)
n2
dy.
To compute the integral above, note that
y(y x)
n2
= [y x +x](y x)
n2
= x(y x)
n2
+ (y x)(y x)
n2
= x(y x)
n2
+ (y x)
n1
.
So, using that algebraical identity, we have
E[Y |X = x] = (n 1)(1 x)
1n
_
1
x
_
x(y x)
n2
+ (y x)
n1
_
dy
= (n 1)(1 x)
1n
_
(1 x)
n
n
+
x(1 x)
n1
n 1
_
=
(n 1)(1 x)
n
+ x
=
n 1 + x
n
.
5.5. Moment generating function
Even though the denition of moment gives us a closed expression, from a practical
viewpoint, sometimes it may be complicated (if not impossible) to compute the required
integral. Therefore, we outline another method for deriving the moments for most distri-
butions.
Denition 33. Let X be a random variable (on a probability space (, F, P)) with dis-
tribution function F. The moment generating function of X is a real-valued function
dened by
(t) := E
_
e
tX

=
_
R
e
tx
dF(x)
for each t R for which (t) is nite.
We now make use of the theory of Taylor expansions. In particular, let us rst invoke
the expansion result
e
s
= 1 +s +
s
2
2!
+
s
3
3!
+ =

i=0
s
i
i!
, s R.
49
Now, suppose that (t) is well dened throughout the interval (

t,

t) for some

t > 0.
Then, since e
|tx|
e
tx
+ e
tx
and
_
R
[e
tx
+e
tx
]dF(x) is nite for |t| <

t, it follows that
_
R
e
|tx|
dF(x) =

i=0
_
R
|tx|
i
i!
dF(x) < .
Therefore, one obtains
(t) =
_
R
e
tx
dF(x) =

i=0
t
i
t!
_
R
x
i
dF(x)
= 1 +t
_
R
xdF(x) +
t
2
2!
_
R
x
2
dF(x) +
t
3
3!
_
R
x
3
dF(x) + . . . for |t| <

t.
Finally, using the theory of Taylor expansions, it follows that if the rth derivate of (t),

(r)
(t), exists in some neighborhood of t = 0, then

(r)
(0) =
_
R
x
r
dF(x) = m
r
.
The denition of moment generating function can be extended readily to random
vectors.
Denition 34. Let X = (X
1
, . . . , X
n
) be a random vector (on a probability space
(, F, P)) with joint distribution function F. The moment generating function of
X is a vector-valued function : R
n
R dened by
(t
1
, . . . , t
n
) :=
_

_
R
n
e
(t
1
x
1
++tnxn)
dF(x
1
, . . . , x
n
)
for each (t
1
, . . . , t
n
) R for which (t
1
, . . . , t
n
) is nite.
Then, using the theory of Taylor expansions, as in the case of a random variable, one
obtains

r
(0, . . . , 0)
t
r
k
=
_

_
R
n
x
r
k
dF(x
1
, . . . , x
n
)
=
_
R
x
r
k
_

_
R
n1
dF(x
1
, . . . , x
n
)
=
_
R
x
r
k
dF
k
(x
k
) = m
r
(X
k
),
where m
r
(X
k
) denotes the rth moment (about the origin) of the random variable X
k
.
The moment generating function of a random variable or of a random vector has
several important applications in the study of the underlying probability distributions. In
particular, the moment generating function of a random vector can be used to the study
whether a set of random variables are independent or not.
50
Theorem 10. Let (X, Y ) be a random vector with moment generating function (t
1
, t
2
).
Then, the random variables X and Y are independent if and only if
(t
1
, t
2
) = (t
1
, 0)(0, t
2
)
for each t
1
, t
2
R.
Also, the moment generating function of a random variable can be used to characterize
the distribution of the random variable itself. Using the following result, we can identify
a distribution just by identifying its moment generating function.
Theorem 11. The moment generating function uniquely determines a probability distri-
bution and, conversely, if the moment generating function of a distribution exists, then it
is unique.
The uniqueness property of the theorem above can be used to nd the probability
distribution of a transformation Y = g(X
1
, . . . , X
n
).
Example 29. Let X be an absolutely continuous random variable with density function
f(x) =
1

2
e
x
2
/2
, < x < ,
and consider the transformation Y = X
2
. Then, we have

Y
(t) = E[e
tX
2
] =
1

2
_
+

e
(1/2)(12t)x
2
dx
= (1 2t)
1/2
for x < 1/2,
a moment generating function that happens to correspond to a continuous random variable
with density function
h(y) =
e
y/2

y
, y > 0.
Recall that we already obtained this result in Example 23.
Example 30. Let X
1
, X
2
, . . . , X
k
be a set of discrete random variables with common
(discrete) density function
f(x) =
_
n
x
i
_
p
x
i
(1 p)
nx
i
, x = 0, 1, 2, . . . , n,
where n 1 is an integer and p (0, 1). In the next chapter we will show that the moment
generating function of each X
i
is given by

i
(t) = [(1 p) + pe
t
]
n
.
51
Suppose that the random variables X
1
, X
2
, . . . , X
k
are independent. Then, the moment
generating function of the random variable X =

k
i
X
i
can be obtained as
(t) =
k
i=1
[(1 p) + pe
t
]
n
= [(1 p) + pe
t
]
kn
.
It follows that the random variable X has density function
f(x) =
_
kn
x
_
p
x
(1 p)
knx
, x = 0, 1, 2, . . . , kn.
Problems
1. Let (X
1
, X
2
) be a random vector. Using the concept of moment generating function,
show that
Cov[X
1
, X
2
] =

2
(0, 0)
t
1
t
2

(0, 0)
t
1

(0, 0)
t
2
.
2. Let (X, Y ) be an absolutely continuous random vector with joint density
f(x, y) =
1
2
x

y
_
1
2
exp {Q} ,
where
Q =
1
2(1
2
)
_
(x
x
)
2

2
x
+
(y
y
)
2

2
y
2
(x
x
)(y
y
)

y
_
.
Show that
f(x|y) =
1

2
_
(1
2
)
2
x
exp
_

1
2(1
2
)
2
x
_
(x
x
)

y
(y
y
)
_
2
_
.
3. Let X be a random variable on some probability space (, F, P) which takes only the
values 0, 1, 2, . . . . Show that E[X] =

n=1
P[X n].
4. Let X be an absolutely continuous random with supp(X) = [0, b], where b > 0, with
distribution function F, and with density function f. Show that
E[X] =
_
b
0
[1 F(x)] dx.
5. Let X and Y be random variables with joint density
f(x, y) =
_
c if x
2
+ y
2
1
0 if x
2
+ y
2
> 1.
52
Find the conditional density of X given Y = y and compute the conditional expected
value E[X|Y = y].
6. Let X
1
, . . . , X
n
be independent random variables having a common density with mean
and variance
2
. Set X
n
= (X
1
+ +X
n
)/n.
(a) By writing X
k
X
n
= (X
k
) (X
n
), show that
n

k=1
(X
k
X
n
)
2
=
n

k=1
(X
k
)
2
n(X
n
)
2
.
(b) From (a) obtain
E
_
n

k=1
(X
k
X
n
)
2
_
= (n 1)
2
.
7. Let X be a random variable which takes only the values 0, 1, 2, . . . . Show that, for
t (1, 1), (t) = E[t
x
],

(t) = E[xt
x1
], and

(t) = E[x(x 1)t


x2
].
8. Let X and Y be two random variables (on some probability space (, F, P)) such that
P[|X Y | a] = 1
for some constant a R. Show that if Y is P-integrable, then X is P-integrable too and
|E[X] E[Y ]| a.
9. Show that Var[aX] = a
2
Var[X] for any random variable X and constant a R.
10. Let X and Y be absolutely continuous random variables with joint density
f(x, y) =
_

2
e
y
if 0 x y,
0 otherwise.
Find the conditional density f(y|x).
11. Let X and Y be absolutely continuous random variables with joint density
f(x, y) = ce
(x
2
xy+y
2
)/2
,
for each x, y R. Find the conditional expected value of Y given X = x.
Hint: Use the Gaussian integral identity:
_
+

e
z
2
dz =

.
12. Let X and Y be two absolutely continuous random variables with joint density
f(x, y) =
_
n(n 1)(y x)
n2
if 0 x y 1,
0 otherwise.
Find the conditional expected value of X given Y = y.
53
Chapter 6
Some special distributions
6.1. Some discrete distributions
We begin with the binomial distribution.
6.1.1. The binomial distribution
A Bernoulli trial is a random experiment with two possible mutually exclusive outcomes.
Without loss of generality we can call these outcomes success and failure (e.g., defective
or non-defective, female or male). Denote by p (0, 1) the probability of success. A
sequence of independent Bernoulli trials, in the sense that the outcome of any trial does
not aect the outcome of any other trial, are called binomial or Bernoulli trials.
For a sequence of n Bernoulli trials, let X
i
be the random variable associated with the
outcome of the ith trial so that X
i
(success) = 1 and X
i
(failure) = 0. Also, let X be the
random variable associated with the number of successes in the n trials. The number of
ways of selecting x successes out of n trials is
_
n
x
_
. Since trials are independent and the
probability of each of these ways is p
x
(1 p)
nx
, the discrete density function of X is
given by
f(x) = P[X = x] =
_
n
x
_
p
x
(1 p)
nx
for x = 0, 1, 2, . . . , n.
Recall that this density function was obtained earlier in example 18. The probability
distribution of X is called binomial distribution and we write X b(n, p). Using the
fact that, for a positive integer n, (a +b)
n
=

n
x=0
_
n
x
_
b
x
a
nx
, we can obtain
(t) =
n

x=0
e
tx
_
n
x
_
p
x
(1 p)
nx
=
n

x=0
_
n
x
_
(pe
t
)
x
(1 p)
nx
= [(1 p) + pe
t
]
n
.
Then,

(t) = n[(1 p) + pe
t
]
n1
pe
t
and

(t) = n(n 1)[(1 p) + pe


t
]
n2
p
2
e
2t
+n[(1 p) + pe
t
]
n1
pe
t
.
It follows that
E[X] = m
1
=

(0) = n[1 p + p]
n1
p = np
54
and
Var[X] = m
2
m
2
1
=

(0)
_
E[X]
_
2
= n(n 1)[1 p + p]
n2
p
2
+np n
2
p
2
= n
2
p
2
np
2
+np n
2
p
2
= np(1 p).
Consider now the special case of a Bernoulli distribution that one obtains when n = 1.
Then, X is the random variable associated with the outcome of a single Bernoulli trial
so that X(success) = 1 and X(failure) = 0. The probability distribution of X is called
Bernoulli distribution. We write X b(1, p) and the discrete density function of X is
f(x) = P[X = x] = p
x
(1 p)
1x
for x = 0, 1.
One can easily compute
E[X] = (0)(1 p) + (1)(p) = p;
Var[X] = (0 p)
2
(1 p) + (1 p)
2
(p) = p(1 p);
(t) = e
t(0)
(1 p) + e
t(1)
(p) = 1 +p(e
t
1).
Notice that the binomial distribution can be also considered as the distribution of the
sum of n independent, identically distributed X
i
b(1, p) random variables. Clearly, the
number of successes is given by X = X
1
+ + X
n
. Following this approach, we have
E[X] =
n

i=1
E[X
i
] = np
and
Var[X] = Var
_
n

i=1
X
i
_
= np(1 p).
Theorem 12. Let X
i
b(n
i
, p), i = 1, . . . , k, be independent random variables. Then,
Y
k
=
k

i=1
X
i
b(
k

i=1
n
i
, p).
Corollary 1. Let X
i
b(n, p), i = 1, . . . , k, be independent random variables. Then,
Y
k
=
k

i=1
X
i
b(kn, p).
This result has been demonstrated in Example 30.
55
6.1.2. The negative binomial distribution
Consider now a sequence (maybe innite) of Bernoulli trials and let X be the random
variable associated to the number of failures in the sequence before the rth success, where
r 1. Then, X +r is the number of trials necessary to produce exactly r successes. This
will happen if and only if the (X +r)th trial results in a success and among the previous
(X +r 1) trials there are exactly X failures or, equivalently, r 1 successes. We remark
that we need to take into account the probability that the (X + r)th trial results in a
success. It follows by the independence of trials that
f(x) = P[X = x] =
_
x +r 1
x
_
p
r
(1p)
x
=
_
x +r 1
r 1
_
p
r
(1p)
x
for x = 0, 1, 2, . . .
We say that the random variable X has negative binomial distribution and write
X NB(r, p). For the special case given by r = 1, we say that X has geometric
distribution and write X G(p). For the negative binomial distribution, we have
(t) = p
r
[1 (1 p)e
t
]
r
;
E[X] = r(1 p)/p;
Var[X] = r(1 p)/p
2
.
6.1.3. The multinomial distribution
The binomial distribution is generalized in a natural way to the multinomial distri-
bution as follows. Suppose that a random experiment is repeated n independent times.
Each repetition of the experiment results in on of k mutually exclusive and exhaustive
events A
1
, A
2
, . . . , A
k
. Let p
i
be the probability that the outcome (of any repetition) is an
element of A
i
and assume that each p
i
remains constant throughout the n repetitions. Let
X
i
be the random variable associated with the number of outcomes which are elements of
A
i
. Also, let x
1
, x
2
, . . . , x
k1
be nonnegative numbers such that x
1
+x
2
+ +x
k1
n.
Then, the probability that exactly x
i
outcomes terminate in A
i
, i = 1, 2, . . . , k 1, and,
therefore, x
k
= n (x
1
+ x
2
+ + x
k1
) outcomes terminate in A
k
is
P[X
1
= x
1
, . . . , X
n
= x
n
] =
n!
x
1
!x
2
! x
k
!
p
x
1
1
p
x
2
2
p
x
k
k
.
This is the joint discrete density of a multinomial distribution.
6.1.4. The Poisson distribution
Recall that, for each r R, we have
e
r
= 1 +r +
r
2
2!
+
r
3
3!
+ =

s=0
r
s
s!
.
56
Then, given r > 0, consider the function f : R R
+
dened by
f(x) =
r
x
e
r
x!
for x = 0, 1, 2, . . .
One can check that

x=0
f(x) = e
r

x=0
r
x
x!
= e
r
e
r
= 1.
Hence, f satises the conditions required for being a discrete density function. The distri-
bution associated to the density function above is known as the Poisson distribution
and, for a random variable X that follows such distribution, we write X P(r). Empir-
ical evidence indicates that the Poisson distribution can be used to analyze a wide class
of applications. In those applications one deals with a process that generates a number
of changes (accidents, claims, etc.) in a xed interval (of time or space). If a process can
be modeled by a Poisson distribution, then it is called a Poisson process. Examples of
random variables distributed according to the Poisson distributions are: (1) X indicates
the number of defective goods manufactured by a productive process in a certain period
of time, (2) X indicates the number of car accidents in a unit of time, and so on. For
X P(r), we have
E[X] = Var[X] = r
and
(t) =

x=0
e
tx
r
x
e
r
x!
= e
r

x=0
(re
t
)
x
x!
= e
r
e
re
t
= e
r(e
t
1)
.
Theorem 13. Let X
i
P(r
i
), i = 1, . . . , k, be independent random variables. Then,
S
k
=
k

i=1
X
i
P(r
1
+ +r
k
).
The following results relate the Poisson with the binomial distribution.
Theorem 14. Let X P(r
x
) and Y P(r
y
) be independent random variables. Then
the conditional distribution of X given X + Y is binomial. In particular, (X|X + Y =
z) b(z,
rx
rx+ry
) (that is, for a sequence of z Bernoulli trials). Conversely, let X and Y are
independent nonnegative integer-valued random variables with strictly positive densities.
If (X|X + Y = z) b(z, p), then X P(p/(1 p)) and Y P() for an arbitrary
> 0.
Theorem 15. If X P(r) and (Y |X = x) b(x, p), then Y P(rp).
57
6.2. Some continuous distributions
In this section we introduce some of the most frequently used absolutely continuous dis-
tributions and describe their properties.
6.2.1. The uniform distribution
A random variable X is said to have uniform distribution on the interval [a, b] if its
density function is given by
f(x) =
_
1
ba
if a x b;
0 otherwise.
We write X U[a, b]. Intuitively, the uniform distribution is related to random phenom-
ena where the possible outcomes have the same probability of occurrence. One can easily
obtain that
F(x) =
_

_
0 if x a;
xa
ba
if a < x b;
1 if x > b.
E[X] =
a +b
2
, Var[X] =
(b a)
2
12
, and (t) =
e
tb
e
ta
t(b a)
.
Example 31. Let X be a random variable with density
f(x) =
_
e
x
if x > 0;
0 otherwise,
where > 0. One can easily obtain
F(x) =
_
0 if x 0;
1 e
x
if x > 0.
Consider the transformation Y = F(X) = 1 e
X
. We note then: x = T(y) =
ln (1 y)/ and T

(y) = 1/(1 y) so that the density of Y is given by


h(y) = f
_
T(y)
_
|T

(y)|
= e

_
ln (1y)/
_
1
(1 y)
= 1
for 0 y < 1.
So, is it a mere coincidence that in the example above F(X) is uniformly distributed
on the interval [0, 1]? The following theorem answers this question and provides us with
a striking result about the uniformity of the distribution of any distribution function.
58
Theorem 16. Let X be a random variable with a continuous distribution function F.
Then F(X) is uniformly distributed on [0, 1]. Conversely, let F be any distribution function
and let X U[0, 1]. Then, there exists a function g : [0, 1] R such that g(X) has F as
its distribution function, that is, P[g(X) x] = F(x) for each x R.
6.2.2. The ,
2
, and Beta distributions
It is a well known result that the integral
() :=
_

0
y
1
e
y
dy
yields a nite positive number for > 0. Integration by parts gives us
() = ( 1)
_

0
y
2
e
y
dy = ( 1)( 1).
Thus, if is a positive integer, then
() = ( 1)( 2) (2)(1)(1) = ( 1)!.
Let us now consider another parameter > 0 and introduce a new variable by writing
y = x/. Then, we have
() =
_

0
_
x

_
1
e

_
1

_
dx
Therefore, we obtain
1 =
_

0
1
()

x
1
e
x/
dx.
Hence, since (), , > 0, we see that
f(x) =
_
1
()

x
1
e
x/
if x > 0;
0 otherwise
is a density function of an absolutely continuous random variable. A random variable
X with the density above is said to have the gamma distribution and we write X
(, ). The special case when = 1 yields the exponential distribution with pa-
rameter . In that case, we write X exp() (1, ) and the corresponding density
function is, therefore,
f(x) =
_
1

e
x/
if x > 0.
0 otherwise
The gamma distribution is often used to model waiting times.
59
The distribution function associated to a gamma distribution is
F(x) =
_
0 if x 0;
1
()

_
x
0
y
1
e
y/
dy if x > 0.
The corresponding moment generating function is obtained as follows. First,
(t) =
_

0
e
tx
1
()

x
1
e
x/
dx
=
_

0
1
()

x
1
e
x(1t)/
dx.
Second, by setting y = x(1 t)/ or, equivalently,
x =
y
1 t
and dx =

1 t
dy,
we obtain
(t) =
_

0
/(1 t)
()

_
y
1 t
_
1
e
y
dy
=
1
(1 t)


1
()
_

0
y
1
e
y
dy =
1
(1 t)

for t < 1/.


Therefore, for the gamma distribution, we obtain
E[X] =

(0) = and Var[X] =

(0)
_
E[X]
_
2
= ( + 1)
2

2
=
2
.
We turn now to consider the special case of the gamma distribution when = r/2, for
some positive integer r, and = 2. This gives the distribution of an absolutely continuous
random variable X with density
f(x) =
_
1
(r/2)2
r/2
x
r/21
e
x/2
if x > 0;
0 otherwise.
This distribution is called the chi-square distribution and we write X
2
(r) where,
for no obvious reason, r is called the number of degrees of freedom of the distribution.
The moment generating function of the chi-square distribution is
(t) =
1
(1 2t)
r/2
for t < 1/2,
and its expected value and variance are, respectively, E[X] = r and Var[X] = 2r.
Theorem 17. Let X
i
(
i
, ), i = 1, . . . , k, be independent random variables. Then,
Y
k
=

k
i=1
X
i
(

k
i=1

i
, ).
60
Theorem 18. Let X U[0, 1]. Then, Y = 2 ln X
2
(2).
Theorem 19. Let X (
x
, ) and Y (
y
, ) be two independent random variables.
Then, X +Y and X/Y are independent random variables and X +Y and X/(X +Y ) are
also independent random variables.
Theorem 20. Let {X
n
}

n=1
be a sequence of independent random variables such that
X
n
exp() for each n = 1, 2, . . . . Let Y
n
=

n
i=1
X
i
for n = 1, 2, . . . and let Z be the
random variable corresponding to the number of Y
n
[0, t] for t > 0. Then Z P(t/).
We close this subsection by introducing another important distribution related with the
gamma distribution. Let X, Y be two independent random variables such that X (, 1)
and Y (, 1) The joint density function of (X, Y ) is then
f(x, y) =
1
()()
x
1
y
1
e
xy
, for 0 < x, y < .
Consider the change of variables given by U = X + Y and V = X/(X + Y ). Using the
change of variables formula, one obtains
h(u, v) =
1
()()
u
+1
v
1
(1 v)
1
e
u
, for 0 < u < and 0 < v < 1.
The marginal distribution of V is then
h
2
(v) =
v
1
(1 v)
1
()()
_

0
u
+1
e
u
du
=
( +)
()()
v
1
(1 v)
1
for 0 < v < 1.
The density function above is that of the beta distribution with parameters and ,
and we write V B(, ). Now, it follows from Theorem 16 above that U and V are
independent random variables. Therefore, since h(u, v) = h
1
(u)h
2
(v), it must be the case
that
h
1
(u) =
1
( +)
u
+1
e
u
, for 0 < u < .
The function h
1
(u) above corresponds to the density function of a gamma distribution
such that U ( +, 1).
It can be checked that the expected value and the variance of V , which has a beta
distribution, are given by
E[V ] =

+
and Var[V ] =

( + + 1)( +)
2
.
There is no closed expression for the moment generating function of a beta distribution.
The intuition given above regarding the relation between the gamma and the beta
distributions can be extended by the following result.
61
Theorem 21. Let X (, ) and X (, ) be two independent random variables.
Then X/(X + Y ) B(, ).
6.2.3. The normal distribution
We introduce now one of the most important distributions in the study of probability
and mathematical statistics, the normal distribution. The Central Limit Theorem shows
that normal distributions provide a key family of distributions for applications and for
statistical inference.
Denition 35. A random variable X is said to have the normal distribution if its
density function is given by
f(x) =
1

2
exp
_

1
2
_
x

_
2
_
.
The parameters and
2
correspond, respectively, to the mean and variance of the dis-
tribution. We write X N(,
2
). The standard normal distribution is the normal
distribution obtained when = 0 and
2
= 1.
Suppose that X N(0, 1) and consider the transformation Y = a+bX for b > 0. Using
the change of variable formula, we can derive the expression for the density function of
Y as
h(y) = f
_
y a
b
_
1
b
=
1
b

1

2
exp
_

1
2
_
y a
b
_
2
_
,
so that Y N(a, b
2
). For a = and b
2
=
2
one can obtain the converse implication by
applying the change of variable formula too. Therefore, the following claim holds.
Theorem 22. A random variable X has a N(,
2
) distribution if and only if the random
variable (X )
_
has a N(0, 1) distribution.
Using the result above, we can obtain the moment generating function of a random
variable X N(,
2
) by using the fact that X = Z + for some random variable
Z N(0, 1). This is done as follows. First, note that
(t) = E[e
tX
] = E[e
tZ+t
] = e
t
E[e
tZ
]
= e
t
_
+

e
tz
1

2
e
z
2
/2
dz.
62
Second, we compute the integral above as
_
+

e
tz
1

2
e
z
2
/2
dz = e

2
t
2
/2
_
+

2
e
(zt)
2
/2
dz
= e

2
t
2
/2
_
+

2
e
s
2
/2
ds
= e

2
t
2
/2
,
using the change of variable s = z t and the fact that
_
+

1/

2e
s
2
/2
ds = 1.
Therefore, we nally obtain
(t) = e
t
e

2
t
2
/2
= e
t+
2
t
2
/2
.
Even though many applications can be analyzed using normal distributions, normal
density functions usually contain a factor of the type exp {s
2
}. Therefore, since an-
tiderivatives cannot be obtained in closed form, numerical integration techniques must be
used. Given the relation between a normal distribution and the standard normal dis-
tribution, we make use of numerical integration computations as follows. Consider a
random variable X N(,
2
), denote by F its distribution function and by H(z) =
_
z

1/

2e
s
2
/2
ds the distribution function of the random variable Z = (X )/
N(0, 1). Now, suppose that we wish to compute F(x) = P[X x]. Then, we use the fact
that
P[X x] = P
_
Z
x

_
= H
_
x

_
.
Therefore, all that we need are numerical computations for H(z).
We close this section with a few important results concerning normal distributions.
Theorem 23. Let X be a standard normal random variable. Then,
P[X > x]
1

2x
e
x
2
/2
as x .
Theorem 24. If X and Y are independent normally distributed random variables, then
X + Y and X Y are independent.
Theorem 25. Let X
i
N(
i
,
2
i
), i = 1, . . . , n, be independent random variables. Then,
for
1
, . . . ,
n
R, we have
n

i=1

i
X
i
N
_
n

i=1

i
,
n

i=1

2
i

2
i
_
.
Theorem 26. If X N(,
2
), then (X )
2
_

2

2
(1).
63
The result above has already been demonstrated in Examples 23 and 29.
6.2.4. The multivariate normal distribution
Here we consider the generalization of the normal distribution to random vectors.
Denition 36. A random vector X = (X
1
, . . . , X
n
) is said to have the n-variate nor-
mal distribution if its density function is given by
f(x) = f(x
1
, . . . , x
n
) =
1
(2)
n/2
||
1/2
exp
_

1
2
(x )

1
(x )
_
,
where R
n
R
n
is a symmetric, positive semi-denite matrix and = (
1
, . . . ,
n
)
R
n
. We write X = (X
1
, . . . , X
n
) N(, ). The vector is is called the mean vector
and the matrix is called the dispersion matrix or variance-covariance matrix of
the multivariate distribution.
7
The special case n = 2 yields the bivariate normal distribution. Consider a random
vector (X, Y ) N(, ), where
=
_

y
_
and =
_

2
x

xy

xy

2
y
_
.
Here
xy
denotes the covariance between X and Y . Thus, if is the correlation coecient
between X and Y , then we have
xy
=
x

y
, where the symbol
k
stands for the stan-
dard deviation,
k
= +(
2
k
)
1/2
, of the corresponding random variable k = x, y. After
noting these notational rearrangements, matrix above can be easily inverted to obtain

1
=
1

2
x

2
y
(1
2
)
_

2
y

x

y

2
x
_
.
Therefore, the joint density function of (X, Y ) is
f(x, y) =
1
2
x

y
_
1
2
exp {Q} ,
where
Q =
1
2(1
2
)
_
_
x
x

x
_
2
2
_
x
x

x
__
y
y

y
_
+
_
y
y

y
_
2
_
.
The following result is crucial to analyze the relation between a multivariate normal
distribution and its marginal distributions.
7
Some authors refer to
1
, instead of , as the variance-covariance matrix.
64
Theorem 27. Let X N(, ) such that X, , and can be partitioned as
X =
_
X
1
X
2
_
, =
_

2
_
and =
_

11

12

21

22
_
.
Then, X
s
N(
s
,
ss
), s = 1, 2. Moreover, X
1
and X
2
are independent random vectors
if and only if
12
=
21
= 0.
The result in the Theorem above tells us that any marginal distribution of a multi-
variate normal distribution is also normal and, further, its mean and variance-covariance
matrix are those associated with that partial vector. It also asserts that, for the normal
case, independence of the random variables follows from their no correlation.
Let us consider the bivariate case to x ideas. It follows from the Theorem above that
if (X, Y ) N(, ), with
=
_

y
_
and =
_

2
x

xy

xy

2
y
_
,
then X N(
x
,
2
x
) and Y N(
y
,
2
y
). Suppose now that X and Y are uncorrelated.
Then, = 0 and we can use the expression above for f(x, y) to conclude that f(x, y) =
f
x
(x)f
y
(y), where
f
k
(k) =
1

2
k
exp
_
_
k
k

k
_
2
_
for k = x, y.
Hence, if (X, Y ) is bivariate normally distributed with the parameters given above, and
X and Y are uncorrelated, then X N(
x
,
2
x
) and Y N(
y
,
2
y
). This follows simply
from the fact that (X, Y ) is bivariate normally distributed as stated in the Theorem above.
Furthermore, X and Y are independent!
However, it is possible for two random variables X and Y to be distributed jointly in
a way such that each one alone is marginally normally distributed, and they are uncorre-
lated, but they are not independent. This can happen only if these two random variables
are not distributed jointly as bivariate normal.
Example 32. Suppose that X has a normal distribution with mean 0 and variance 1.
Let W be a random variable which takes the values either 1 or 1, each with probability
1/2, and assume W is independent of X. Now, let Y = WX. Then, it can be checked that
(i) X and Y are uncorrelated,
(ii) X and Y have the same normal distribution, and
(iii) X and Y are not independent.
To see that X and Y are uncorrelated, notice that
Cov[X, Y ] = E[XY ] E[X]E[Y ] = E[XY ]
= E[XY |W = 1]P[W = 1] +E[XY |W = 1]P[W = 1]
= E[X
2
](1/2) + E[X
2
](1/2) = 1(1/2) 1(1/2) = 0.
65
To see that X and Y have the same normal distribution notice that
F
Y
(x) = P[Y x] = P[Y x|W = 1]P[W = 1] +P[Y x|W = 1]P[W = 1]
= P[X x](1/2) + P[X x](1/2)
= P[X x](1/2) + P[X x](1/2) = P[X x] = F
X
(x).
Finally, to see that X and Y are not independent, simply note that |Y | = |X|.
We have already seen how to obtain the marginal distributions from a multivariate
normal distribution. We have learned that the marginal distributions are also normal.
We now ask whether putting together two normal distributions yields a bivariate normal
distribution. The answer to this question depends crucially on whether the two random
variables are independent or not.
Theorem 28. Let X N(
x
,
2
x
) and Y N(
y
,
2
y
) be two independent random vari-
ables. Then (X, Y ) N(, ), where
=
_

y
_
and =
_

2
x
0
0
2
y
_
.
However, in general the fact that two random variables X and Y both have a normal
distribution does not imply that the pair (X, Y ) has a joint normal distribution. A simple
example is one in which X has a normal distribution with expected value 0 and variance
1, and Y = X if |X| > c and Y = X if |X| < c, where c is approximately equal to 1.54.
In this example the two random variables X and Y are uncorrelated but not independent.
The following result tells us about the distribution of a linear transformation of a
normal random vector.
Theorem 29. Let X N(, ), and let A R
m
R
n
and b R
m
. Then,
Y = [A X +b] N(A +b, A A

).
The following result claries the relation between a multivariate normal distribution
and its conditional distributions.
Theorem 30. Let X N(, ) such that X, , and can be partitioned as
X =
_
X
1
X
2
_
, =
_

2
_
and =
_

11

12

21

22
_
.
Assume that is positive denite. Then, the conditional distribution of X
1
|X
2
= x
2
is
N
_

1
+
12

1
22
(x
2

2
),
11

12

1
22

21
_
.
For the bivariate normal case, one can use the expression given above for the joint
density of (X, Y ) to obtain, after dividing such expression by the marginal density of X,
[Y |X = x] N
_

y
+

x
(x
x
),
2
y
(1
2
)
_
,
66
as stated in the result in the Theorem above. We conclude by emphasizing that the
conditional expected value of Y given X = x is linear in x:
E[Y |X = x] =
y
+

x
(x
x
).
6.2.5. The t and the F distributions
Denition 37. A random variable X is said to have the t distribution if its density
function is given by
f(x) =

_
( + 1)/2
_
()
1/2
(/2)
_
1 +
x
2

_
(+1)/2
for each x R.
We write X t() and is called the degree of freedom of the distribution.
The t distribution is important in statistics because of the following results.
Theorem 31. Let X N(0, 1) and Y
2
(n) be independent random variables. Then
T =
X
_
Y/n
t(n).
Theorem 32. Let X
i
N(,
2
), i = 1, . . . , n, be independent random variables and let
X
n
and S
2
n
be the random variables dened as
X
n
=
n

i=1
X
i
_
n and S
2
n
=
n

i=1
(X
i
X
n
)
2
_
(n 1).
Then:
(i) X
n
N(,
2
/n);
(ii) X
n
and S
2
n
are independent;
(iii) (n 1)S
2
n
_

2

2
(n 1);
(iv) (X
n
)
_
(s/

n) t(n 1).
Denition 38. A random variable X is said to have the F distribution if its density
function is given by
f(x) =

_
( +)/2
_

/2

/2
(/2)(/2)

x
(/2)1
( +x)
(+)/2
for x > 0,
and f(x) = 0 for x 0. We write X F(, ), and and are called the degrees of
freedom of the distribution.
The F distribution is important in statistical work because of the following result.
67
Theorem 33. Let X
2
() and Y
2
() be independent random variables. Then
Z =
X/
Y/
F(, ).
Problems
1. Let X be a random variable with moment generating function
(t) =
_
3
4
+
1
4
e
t
_
6
.
Obtain the density function of X.
2. Let X be the random variable associated to the number of successes throughout n
independent repetitions of a random experiment with probability p of success. Show that
X satises the following form of the Weak Law of Large Numbers:
lim
n
P
_

X
n
p

<
_
= 1 for each given > 0.
3. Let X be a random variable with density function f(x) = (1/3)(2/3)
x
, x = 0, 1, 2, . . . .
Find the conditional density of X given that X 3.
4. Let X be a random variable with geometric distribution. Show that
P[X > k + j|X > k] = P[X > j].
5. Let X be a random variable with moment generating function
(t) = e
5(e
t
1)
.
Compute P[X 4].
6. Let X P(1). Compute, if it exists, the expected value E[X!].
7. Prove Theorem 17.
8. Let X
1
, X
2
, and X
3
be independent and identically distributed random variables,
each with density function f(x) = e
x
for x > 0. Find the density function of Y =
min {X
1
, X
2
, X
3
}.
9. Let X U[0, 1]. Find the density function of Y = ln X.
68
10. Prove Theorem 23.
11. Let (X
1
, X
2
, X
3
) have a multivariate normal distribution with mean vector 0 and
variance-covariance matrix
=
_
_
1 0 0
0 2 1
0 1 2
_
_
.
Find P[X
1
> X
2
+X
3
+ 2].
12. Mr. Banach carries matchbox in each of his left and right pockets. When he wants a
match, he selects the left pocket with probability p and the right pocket with probability
1 p. Suppose that initially each box contains k > 0 matches. Compute the probability
that Mr. Banach discovers a box empty while the other contains 0 < r k matches.
13. The number of female insects in a given region is distributed according to a Poisson
with mean while the number of eggs laid by each insect (male or female) is distributed
according to a Poisson with mean . Find the probability distribution of the number of
eggs of in the region.
14. Let X
i
N(0, 1), i = 1, . . . , 4, be independent random variables. Show that Y =
X
1
X
2
+ X
3
X
4
has the density function f(y) = (1/2) exp {|y|} for each y R.
15. Let X and Y be two random variables distributed standard normally. Denote by f
and F the density function and the distribution function of X, respectively. Likewise,
denote by g and G the density function and the distribution function of Y . Let (X, Y )
have joint density function
h(x, y) = f(x)g(y)
_
1 + (2F(x) 1)(2G(y) 1)

,
where is a constant such that || 1. Show that X + Y is not normally distributed
except in the trivial case = 0, i.e., when X and Y are independent.
16. Give a closed expression for E[X
r
], r = 1, 2, . . . , where X F(, ).
17. Let X
2
(n) and Y
2
(m) be independent random variables. Find the density
of Z = X
_
(X +Y ).
18. Let (X, Y ) N(, ). Determine the distribution of the random vector (X+Y, XY ).
Show that X +Y and X Y are independent if Var[X] = Var[Y ].
19. Let X N(2, 4). Compute P[1 < X < 6] using only the function (y) := 1
_
2
_
y
0
e
s
2
/2
ds.
20. Let (X, Y ) have joint density function:
f(x, y) =
1
6

7
exp
_

8
7
_
x
2
16

31x
32
+
xy
8
+
y
2
9

4y
3
+
71
16
__
for x, y R.
69
(a) Find the means and variances of X and Y . Find Cov[X, Y ] too.
(b) Find the conditional density of Y |X = x, E[Y |X = x], and Var[Y |X = x].
(c) Find P[4 Y 6|X = 4].
70
Chapter 7
Convergence of probability distributions
In this chapter we study convergence properties of sequences of random variables.
7.1. Convergence in distribution
Convergence in distribution is the weakest mode of convergence that we shall analyze.
Denition 39. Given some probability space, let {X
n
}

n=1
be a sequence of random vari-
ables and let X be a random variable. Let {F
n
}

n=1
and F be the corresponding sequence
of distribution functions and the distribution function. We say that X
n
converges in
distribution to X or, equivalently, F
n
converges in law (or weakly) to F if
lim
n
F
n
(x) = F(x)
for each point x at which F is continuous. We write X
n
L
X and F
n
w
F.
Example 33. Let X
1
, X
2
, . . . , X
n
be independent and identically distributed (henceforth,
i.i.d.) random variables with (common) density function
f(x) =
_
1/ if 0 x < ;
0 otherwise,
where 0 < < . Let Y
n
:= max {X
1
, X
2
, . . . , X
n
} for n = 1, 2, . . . . Then, the distribu-
tion function of Y
n
is given by
G
n
(y) = P[Y
n
y] = P[Y
1
y, . . . , Y
n
y] = [F(y)]
n
=
_

_
0 if y < 0;
(y/)
n
if 0 y < ;
1 if y .
Then, given y 0,
lim
n
G
n
(y) = G(y) =
_
0 if y < ;
1 if y .
Therefore, Y
n
L
Y , where Y is the random variable associated to a random experiment
that yields with certainty.
71
The following example shows that convergence in distribution does not imply conver-
gence of the moments.
Example 34. Let {F
n
}

n=1
be a sequence of distribution functions dened by
F
n
(x) =
_

_
0 if x < 0;
1 1/n if 0 x < n;
1 if x n.
Note that, for each n = 1, 2, . . . , F
n
is the distribution function of a discrete random
variable X
n
, supported on the set {0, n}, with density function
P[X
n
= 0] = 1
1
n
, P[X
n
= n] =
1
n
.
We have, for each given x 0,
lim
n
F
n
(x) =
_
0 if x < 0;
1 if x 0.
Note that F is is the distribution function of a random variable X degenerate at x = 0
so that, clearly, for r = 1, 2, . . . , one obtains E[X
r
] = 0. However, we have
E[X
r
n
] = (0)
r
_
1
1
n
_
+ (n)
r
_
1
n
_
= n
r1
,
so that, evidently, lim
n
E[X
r
] = E[X].
7.2. Convergence in probability
Here we formalize a way of saying that a sequence of random variables approaches another
random variable. Note, however, that the denition below says nothing about convergence
of the random variables in the sense in which is understood in real analysis; it tells us
something about the convergence of a sequence of probabilities.
Denition 40. Given some probability space, let {X
n
}

n=1
be a sequence of random
variables and let X be a random variable. We say that X
n
converges in probability to
X if for each > 0,
lim
n
P
_
|X
n
X| >

= 0.
We write X
n
P
X.
Note that the condition in the denition above can be rewritten as
lim
n
P
_
|X
n
X|

= 1.
72
Example 35. Let {X
n
}

n=1
be a sequence of random variables with (discrete) density
function
P[X
n
= 0] = 1
1
n
, P[X
n
= 1] =
1
n
.
Then,
P
_
|X
n
| >

=
_
1/n if 0 < < 1;
0 if 1,
so that lim
n
P
_
|X
n
| >

= 0 and, therefore, X
n
P
0.
Example 36. Let {X
n
}

n=1
be a sequence of i.i.d. random variables with (common)
density function
f(x) =
_
e
(x)
if x > ;
0 if x ,
where R, and let Y
n
:= min {X
1
, . . . , X
n
} for each n = 1, 2, . . . . Let us show that
Y
n
P
. To do this, note that, for any given real number y > , we have
F
n
(y) = P[min {X
1
, . . . , X
n
} y] = 1 P[min {X
1
, . . . , X
n
} > y]
= 1 P[X
1
> y, , X
n
> y] = 1
__

y
e
(x)
dx
_
n
= 1 e
n(y)
.
Therefore, for a given > 0, we obtain
P[|Y
n
| ] = P[ Y
n
+ ]
= F
n
( + ) F
n
( )
= 1 e
n(+)
,
where we have taken into account that F
n
( ) = 0 since < . Finally, we trivially
obtain 1 e
n
1 as n , as required.
Theorem 34. Suppose X
n
P
X and let g : R R be a continuous function. Then
g(X
n
)
P
g(X).
Theorem 35. Suppose X
n
P
X and Y
n
P
Y . Then:
(i) X
n
+Y
n
P
X + Y for each , R;
(ii) X
n
Y
n
P
X Y .
Theorem 36. Suppose X
n
P
X, then X
n
L
X.
73
7.3. Weak law of large numbers
Theorem 37 (WLLN). Let {X
n
}

n=1
be a sequence of i.i.d. random variables with
E[X
n
] = and Var[X
n
] =
2
< , and let X
n
=

n
i=1
X
i
_
n. Then, X
n
P
.
Proof. Using the inequality of Chebychev-Bienayme, for > 0, we have
0 P
_

X
n

2
Var
_
X
n

=

2
n
2
.
Then, the result follows since
2
_
n
2
0 as n .
There is also a well known result, called the Strong Law of Large Numbers, where
the requirements of the theorem above can be weaken so as to assume that the random
variables X
1
, X
2
, . . . are independent and each of them has nite mean . Hence, the
Strong Law is a rst moment result while the Weak Law requires the existence of the
second moment.
Example 37. Consider a sequence {X
n
}

n=1
of i.i.d. random variables with E[X
n
] =
and Var[X
n
] =
2
< . Let
X
n
=
n

i=1
X
i
_
n and S
2
n
=
n

i=1
(X
i
X
n
)
2
_
(n 1).
The WLLN states that X
n
P
. We ask ourselves about convergence results concern-
ing the sample variance S
2
n
. Assume that E
_
X
4
i

< for each i = 1, 2, . . . , so that


Var[S
2
n
] < for each n = 1, 2, . . . . We obtain
S
2
n
=
1
n 1
n

i=1
(X
i
X
n
)
2
=
n
n 1
_
1
n
n

i=1
X
2
i
X
2
n
_
.
We using the WLLN, we know that X
n
P
. Also, by taking Y
i
= X
2
i
, the WLLN tells
us that
Y
n
=
n

i=1
X
2
i
/n
P
E[Y
k
] = E[X
2
k
]
for each given k = 1, 2, . . . . Then, combining the results in the Theorems above, we obtain
S
2
n
=
n
n 1
_
1
n
n

i=1
X
2
i
X
2
n
_
P
1
_
E[X
2
k
]
2
_
=
2
.
7.4. Central limit theorem
We start this section by introducing the notion of random sample.
74
Denition 41. A random sample of size n from a distribution with distribution func-
tion F is a set {X
1
, X
2
, . . . , X
n
} of i.i.d. random variables whose (common) distribution
function is F.
Using the results in Proposition 2 and Theorem 20, one can show easily that if
X
1
, X
2
, . . . , X
n
are i.i.d. normal random variables with mean and variance
2
, then
the random variable

n(X
n
)

has the standard normal distribution.


Now, suppose that X
1
, X
2
, . . . , X
n
are the observations (not necessarily independent!)
of a random sample of size n obtained from any distribution with nite variance
2
> 0
and, therefore, nite mean . The important result stated below says that the random
variable

n(X
n
)
_
converges in distribution to a random variable distributed ac-
cording to the standard normal. It will be then possible to use this approximation to the
normal distribution to compute approximate probabilities concerning X
n
. In the statisti-
cal problem where is unknown, we shall use this approximate of X
n
to estimate .
Theorem 38 (Lindeberg-Lvy Central Limit Theorem). Let {X
n
}

n=1
be a se-
quence of random variables with E[X
n
] = and 0 < Var[X
n
] =
2
< , and let
X
n
=

n
i=1
X
i
_
n. Then, the sequence of random variables {Y
n
}

n=1
dened by
Y
n
:=
(

n
i=1
X
i
n)

n
=

n
_
X
n

satises Y
n
L
Y N(0, 1).
Example 38. Consider a set {X
1
, . . . , X
75
} of random variables with X
i
U[0, 1] for
each i = 1, . . . , 75. We are interested in computing P[0.45 < X
n
< 0.55], where X
n
=

75
i=1
X
i
_
75. Such computation maybe complicated to obtain directly. However, using the
theorem above together with the fact that = 1/2 and
2
= 1/12, one obtains
P[0.45 < X
n
< 0.55] P
_

75(0.45 0.5)
1/

12
<

n(X
n
)

<

75(0.55 0.5)
1/

12
_
= P[1.5 < 30(X
n
0.5) < 1.5] = 0.866,
since 30(X
n
0.5) is approximately distributed according to the standard normal distri-
bution.
Problems
1. Let {X
n
}

n=1
be a sequence of random variables with X
i
b(n, p), for each i = 1, . . . , n
(0 < p < 1). Obtain the probability distribution of a random variable X such that
X
n
L
X.
75
2. Let {X
n
}

n=1
be a sequence of random variables with mean < and variance a
_
n
p
,
where a R and p > 0. Show that X
n
P
.
3. Let X be the mean of a random sample of size 128 from a Gamma distribution with
= 2 and = 4. Approximate P[7 < X < 9].
4. Let f(x) = 1
_
x
2
for 1 < x < and f(x) = 0 for x 1. Consider a random sample of
size 72 from the probability distribution of a random variable X which has f as density
function. Compute approximately the probability that more than 50 observations of the
random variable are less than 3.
76
Chapter 8
Parametric point estimation
Sometimes we are interested in working with a random variable X but we do not know
its distribution function F. The distribution function F describes the behavior of a phe-
nomenon or population (whose individuals are, accordingly, the realizations of the random
variable X). Basically, this not knowing a distribution function can take two forms. Either
we ignore completely the form of F(x) or we do know the functional form of F but ignore
a set of parameters upon which F depends. The problem of point estimation is of the
second type. For instance, we may know that a certain population has a normal distri-
bution N(,
2
) but ignore one of the parameters, say
2
. Then, after drawing a random
sample {X
1
, X
2
, . . . , X
n
} from the distribution N(,
2
), the problem of point estimation
consists of choosing a number T(X
1
, X
2
, . . . , X
n
) that depends only on the sample and
best estimates the unknown parameter
2
. If both parameters and
2
are unknown,
then we need to seek for a pair
T(X
1
, X
2
, . . . , X
n
) =
_
T
1
(X
1
, X
2
, . . . , X
n
), T
2
(X
1
, X
2
, . . . , X
n
)
_
R
2
such that T
1
estimates and T
2
estimates
2
.
We formalize our problem by considering that the random variable X has a distribution
function F

and a density function f

which depend on some unknown parameter =


(
1
, . . . ,
k
) R
k
. Let denote the subset R
k
of possible values for the parameter
and let X denote the set of possible random samples of size n. Thus, we are indeed
considering a family {F

: } of distribution functions parameterized by . A point


estimator (or statistic) for is any function T : X .
Next, we introduce certain desirable properties of estimators. The criteria that we
discuss are consistency, suciency, unbiasedness, and eciency.
8.1. Consistent estimation
Denition 42. Let {X
1
, X
2
, . . . , X
n
} be a random sample from F

. A point estimator
T(X
1
, X
2
, . . . , X
n
) is consistent for if
T(X
1
, X
2
, . . . , X
n
)
P
.
Intuitively, consistency requires that be likely that the estimator approaches the true value
of the parameter as the size of the random sample increases. In other words, if a estimator
T is consistent for a parameter we may interpret it as T being close to on average
as n increases.
77
Example 39. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a binomial distribution
b(1, p). Then, E[X
k
] = p for each k = 1, 2, . . . , n. From the WLLN, we know that
X
n
= T(X
1
, X
2
, . . . , X
n
) =
n

i=1
X
i
_
n
P
p
so that the sample mean is consistent to estimate p. Now it can be easily checked that

n
i=1
X
i
+ 1
n + 2
=

n
i=1
X
i
n

n
n + 2
+
1
n + 2
P
p.
Thus, a consistent estimator for a certain parameter need not be unique. Finally, as shown
earlier,
S
2
n
=
n
n 1
_
1
n
n

i=1
X
2
i
X
2
n
_
P
Var[X
k
]
for each k = 1, 2, . . . , n, so that the sample variance is consistent to estimate the variance
of the distribution. It can be easily checked that S
2
n
is not the unique consistent estimator
for the variance of the population.
8.2. Sucient estimation
The desiderata associated to the suciency criterion can be summarized by saying that,
when proposing a estimator, we wish that the only information obtained about the un-
known parameter is that provided by the sample itself. Thus, we nd desirable to rule
out possible relations between the proposed estimator and the parameter. Under this cri-
terion, we seek for estimators that make full use of the information contained in the
sample.
Denition 43. Let {X
1
, X
2
, . . . , X
n
} be a random sample from F

. A point estimator
T(X
1
, X
2
, . . . , X
n
) is sucient for if the conditional density of (X
1
, X
2
, . . . , X
n
),
given T(X
1
, X
2
, . . . , X
n
) = t, does not depend on (except perhaps for a set A of zero
measure, P

[X A] = 0).
Example 40. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a binomial distribution
b(1, p) and consider the estimator T(X
1
, X
2
, . . . , X
n
) =

n
i=1
X
i
. Then, by considering
t =

n
i=1
x
i
f

(x
1
, . . . , x
n
|T = t) =
P
_
X
1
= x
1
, . . . , X
n
= x
n
,

n
i=1
X
i
=

n
i=1
x
i
_
P
_

n
i=1
X
i
= t
_
=
p
P
n
i=1
x
i
(1 p)
n
P
n
i=1
x
i
_
n
t
_
p
t
(1 p)
nt
=
1
_
n
t
_,
which does not depend on p. So,

n
i=1
X
i
is sucient to estimate p.
78
Often it turns out to be dicult to use the denition of suciency to check whether
a estimator is sucient or not. The following result is then helpful in many applications.
Theorem 39 (Fisher-Neyman Factorization Criterion). Let {X
1
, X
2
, . . . , X
n
} be a
random sample from F

and let f

denote the joint density function of (X


1
, X
2
, . . . , X
n
).
Then, a estimator T(X
1
, X
2
, . . . , X
n
) is sucient for a parameter if and only if f

(x
1
, . . . , x
n
)
can be factorized as follows:
f

(x
1
, . . . , x
n
) = h(x
1
, . . . , x
n
) g

(T(x
1
, . . . , x
n
)),
where h is a nonnegative function of x
1
, . . . , x
n
only and does not depend on , and g

is
a nonnegative nonconstant function of and T(x
1
, . . . , x
n
) only.
Example 41. As in the previous example, let {X
1
, X
2
, . . . , X
n
} be a random sample from
a binomial distribution b(1, p) and consider the estimator T(X
1
, X
2
, . . . , X
n
) =

n
i=1
X
i
.
Then, we can write
f
p
(x
1
, . . . , x
n
) = p
P
n
i=1
x
i
(1 p)
n
P
n
i=1
x
i
= 1 (1 p)
n
_
p
1 p
_
P
n
i=1
x
i
,
so that, by taking h(x
1
, . . . , x
n
) = 1 and g
p
(

n
i=1
x
i
) = (1 p)
n
_
p
_
(1 p)

P
n
i=1
x
i
, we
obtain that

n
i=1
X
i
is sucient to estimate p.
Example 42. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a normal distribution
N(,
2
) and suppose that we are interested in estimating both and
2
. Then, we can
write
f
(,
2
)
(x
1
, . . . , x
n
) =
1
_

2
_
n
exp
_

n
i=1
(x
i
)
2
2
2
_
=
1
_

2
_
n
exp
_

n
i=1
x
i

n
i=1
x
2
i
2
2

n
2
2
2
_
.
Then, using the factorization theorem above, it follows that
T(X
1
, X
2
, . . . , X
n
) =
_
n

i=1
X
i
,
n

i=1
X
2
i
_
is a sucient estimator for (,
2
).
8.3. Unbiased estimation
Denition 44. Let {X
1
, X
2
, . . . , X
n
} be a random sample from F

. A point estimator
T(X
1
, X
2
, . . . , X
n
) is unbiased for if
E

_
T(X
1
, X
2
, . . . , X
n
)

= .
79
We now show that the sample mean and the sample variance are unbiased estima-
tors for the population mean and the population variance, respectively. Consider a ran-
dom variable X with E[X] = , Var[X] =
2
, and distribution function F
(,
2
)
. Let
{X
1
, X
2
, . . . , X
n
} be a random sample from F
(,
2
)
. First, we easily obtain
E
_
X
n

=
1
n
n

i=1
E[X
i
] = .
Second, we have
S
2
n
=
1
n 1
n

i=1
_
X
i
X
n
_
2
=
1
n 1
_
n

i=1
X
2
i
nX
2
n
_
,
so that
E[S
2
n
] =
1
n 1
_
n

i=1
E[X
2
i
] nE[X
2
n
]
_
.
Then, since E[Z
2
] =
_
E[Z]
_
2
+ Var[Z] for any random variable Z, we obtain
E[S
2
n
] =
1
n 1
_
n
2
+n
2
n
_
_
E[X
n
]
_
2
+ Var[X
n
]
__
=
1
n 1
_
n
2
+ n
2
n
_

2
+

2
n
_
_
=
2
.
Hence, the sample variance is unbiased to estimate the population variance. On the other
hand, notice that the estimator

n
i=1
_
X
i
X
n
_
2
_
n is biased to estimate
2
.
8.4. Maximum likelihood estimation
In the previous sections we have introduced several desirable properties that can be used
to search for appropriate estimators. Here, we introduce another method which has a
constructive approach. The basic tool of this method is the likelihood function of
a random sample, which is nothing but its joint density function. To follow the usual
notation, given a random sample {X
1
, X
2
, . . . , X
n
} from a distribution F

, we rename its
joint density function as
L(x
1
, . . . , x
n
; ) f

(x
1
, . . . , x
n
) =
n

i=1
f

(x
i
).
Furthermore, for tractability reasons, in many applications it is convenient to work with
the log transformation of the likelihood function:
() := ln L(x
1
, . . . , x
n
; ) =
n

i=1
ln f

(x
i
).
At this point, we need to make some assumptions on our working benchmark.
80
Assumption 1 (Regularity Conditions). (i) For ,

, we have
=

= f

;
(ii) the support of f

does not depend on for each .


Now, suppose that the actual value of the unknown parameter is
0
. The following
result gives us theoretical reasons for being interested in obtaining the maximum of the
function (). It tells us that the maximum of () asymptotically separates the true
model at
0
from any other model =
0
.
Theorem 40. Given Assumption 1, if
0
is the true value of the unknown parameter ,
then
lim
n
P

0
_
L(X
1
, . . . , X
n
;
0
) L(X
1
, . . . , X
n
; )

= 1 for each .
Proof. Notice that, by taking logs, the inequality L(X
1
, . . . , X
n
;
0
) L(X
1
, . . . , X
n
; )
can be rewritten as
n

i=1
ln f

(X
i
)
n

i=1
ln f

0
(X
i
) Y
n
:=
1
n
n

i=1
ln
_
f

(X
i
)
f

0
(X
i
)
_
0.
From the WLLN it follows that
1
n
n

i=1
ln
_
f

(X
i
)
f

0
(X
i
)
_
P
E

0
_
ln
_
f

(X
1
)
f

0
(X
1
)
__
.
Now, using the fact that ln(s) is a strictly concave function in s, we can use Jensens
inequality to obtain
E

0
_
ln
_
f

(X
1
)
f

0
(X
1
)
__
< ln
_
E

0
_
f

(X
1
)
f

0
(X
1
)
__
.
However, notice that
E

0
_
f

(X
1
)
f

0
(X
1
)
_
=
_
+

(x
1
)
f

0
(x
1
)
dF

0
(x
1
) =
_
+

(x
1
)
f

0
(x
1
)
f

0
(x
1
)dx
1
= 1.
Since ln(1) = 0, we have obtained that
Y
n
=
1
n
n

i=1
ln
_
f

(X
i
)
f

0
(X
i
)
_
P
Z < 0.
Therefore, for any > 0, from the denition of convergence in probability, we know that
lim
n
P

0
_
Z Y
n
Z +

= 1.
Since Z < 0, by choosing > 0 small enough so as to have Z + = 0, the equality above
implies (considering only one of the inequalities within the probability operator)
lim
n
P

0
_
Y
n
0

= 1,
as desired.
81
Therefore, asymptotically the likelihood function is maximized at the true value
0
.
Denition 45. Let {X
1
, X
2
, . . . , X
n
} be random sample from F

and let (x
1
, x
2
, . . . , x
n
)
be a realization of that sample. The value T(x
1
, x
2
, . . . , x
n
) =

is a maximum likelihood
estimate for if
(

) (

) for each

.
Example 43. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a binomial distribution
b(1, p). Then,
f
p
(x
1
, . . . , x
n
) = p
P
n
i=1
x
i
(1 p)
n
P
n
i=1
x
i
and, consequently,
(p) =
_
n

i=1
x
i
_
ln p +
_
n
n

i=1
x
i
_
ln(1 p).
Then,
d(p)
dp
= 0 (1 p)
n

i=1
x
i
= p
_
n
n

i=1
x
i
_
p =
n

i=1
x
i
_
n.
Thus, the sample mean is the maximum likelihood estimator of p.
Example 44. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a uniform distribution
U[0, ]. Since the parameter is in the support of the distribution, dierentiation is not
helpful here. Notice instead that the corresponding likelihood function can be written as
L(x
1
, . . . , x
n
; ) =
_
1

n
_
(max {x
i
: i = 1, . . . , n} , ),
where (a, b) = 1 if a b and (a, b) = 0 if a > b. So, L(x
1
, . . . , x
n
; ) is decreasing in for
max {x
i
: i = 1, . . . , n} and equals zero for < max {x
i
: i = 1, . . . , n}. Furthermore,
notice that, despite being decreasing in for max {x
i
: i = 1, . . . , n}, its maximum is
attained at

= max {x
i
: i = 1, . . . , n} since for

< max {x
i
: i = 1, . . . , n} one obtains
L(x
1
, . . . , x
n
;

) = 0.
Example 45. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a normal distribution
N(0,
2
). The likelihood function is obtained as
L(x
1
, . . . , x
n
;
2
) =
1
(2
2
)
n/2
exp
_

n
i=1
x
2
i
2
2
_
so that
(
2
) =
n
2
ln 2
n
2
ln
2

n
i=1
x
2
i
2
2
.
82
Therefore,
d(
2
)
d
2
=
n
2
2
+

n
i=1
x
2
i
2
4
= 0
2
=

n
i=1
x
2
i
n
.
8.5. Rao-Cramr lower bound and ecient estimation
Here we introduce an important inequality which provides us with a lower bound for the
variance of any unbiased estimator. First, we need to restrict further our benchmark by
imposing a few requirements additional to those given by Assumption 1.
Assumption 2 (Additional Regularity Conditions). (i) The point
0
is an interior
point in ;
(ii) f

is twice dierentiable with respect to ;


(iii) the integral
_
f

(x
i
)dx
i
can be dierentiated twice (under the integral sign) with
respect to .
Theorem 41 (Rao-Cramr Lower Bound). Under Assumptions 1 and 2, if {X
1
, X
2
, . . . , X
n
}
is a random sample from F

and T(X
1
, X
2
, . . . , X
n
) is a point estimator for with mean
E
_
T(X
1
, X
2
, . . . , X
n
)

= (), then
Var
_
T(X
1
, X
2
, . . . , X
n
)

()

2
nI()
,
where
nI() := E

_
ln f

(x
1
, . . . , x
n
)

_
2
is a quantity called Fisher information of the random sample.
Note that if T(X
1
, X
2
, . . . , X
n
) is an unbiased estimator of , then the Rao-Cramr
inequality becomes
Var
_
T(X
1
, X
2
, . . . , X
n
)

1
nI()
The Rao-Cramr lower bound gives us another criterion for choosing appropriate estima-
tors.
Denition 46. Let {X
1
, X
2
, . . . , X
n
} be a random sample from F

. A point estimator
T(X
1
, X
2
, . . . , X
n
) is ecient for if its variance attains the Rao-Cramr lower
bound.
83
Problems
1. Let {X
1
, X
2
, . . . , X
n
} be a random sample from F

and let T(X


1
, X
2
, . . . , X
n
) be a
point estimator of . Show that if T is unbiased for and lim
n
Var[T] = 0, then T is
consistent for .
2. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a distribution with density function
f

(x) = x
1
, for 0 < x < 1,
where > 0. Argue whether the product X
1
X
2
X
n
is a sucient estimator for or
not.
3. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a Poisson distribution with mean r.
Propose a maximum likelihood estimator for r.
4. Let X and Y be two random variables such that E[Y ] = and Var[Y ] =
2
. Let
T(x) = E[Y |X = x]. Show that E[T(X)] = and Var[T(X)]
2
.
5. What is a sucient estimator for if the random sample is drawn from a beta distri-
bution with = = > 0?
6. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a distribution with density function
f

(x) =
e
(x)
[1 + e
(x)
]
2
, for < x < +,
where R. Show that there exists a unique maximum likelihood estimator for .
7. Let X
1
and X
2
constitute a random sample from a Poisson distribution with mean r.
Show that X
1
+ X
2
is a sucient estimator for r and that X
1
+ 2X
2
is not a sucient
estimator for r.
84
Chapter 9
Hypothesis testing
9.1. Neyman-Pearson theory of hypothesis testing
In the previous chapter we analyzed the problem of using sample information to estimate
unknown parameters of a probability distribution. In this chapter we follow a slightly
dierent approach. We use sample information to test hypotheses about the unknown
parameters. The treatment of this problem is as follows. We have a distribution function
F

that depends on some unknown parameter (or vector of parameters) and our objective
is to use a random sample {X
1
, X
2
, . . . , X
n
} from this distribution to test hypotheses
about the value of . As in the previous chapter, we assume that the functional form of
F

, except for the parameter itself, is known. Suppose that we think, from preliminary
information, that
0
where
0
. This assertion is usually known as the null
hypothesis, H
0
:
0
, while the statement H
1
:
1
:= \
0
is known as the
alternative hypothesis. We write
H
0
:
0
;
H
1
:
1
.
There are two types of hypotheses: if
0
(
1
) contains only one point, the hypothesis
is simple, otherwise the hypothesis is composite. Note that if a hypothesis is simple,
then the distribution function F

becomes completely specied under that hypothesis. For


example, consider a random variable X N(,
2
). Then, we might propose the test
H
0
: 1,
2
> 2;
H
1
: > 1,
2
2,
where both the null and the alternative hypotheses are composite. Here, under any of
those hypotheses, the distribution of X remains not fully specied.
The procedure that we follow to test hypotheses is as follows. Given the sample space
X, we search for a decision rule that allows us, for each realization (x
1
, . . . , x
n
) of the
random sample, to either accept (roughly speaking) or reject the null hypothesis. More
specically, for R
k
, we consider a statistic T : X and partition the sample space
of that statistic into two sets C R
k
and C
c
:= R
k
\C. Now, if T(x
1
, . . . , x
n
) C, then we
reject H
0
while if T(x
1
, . . . , x
n
) C
c
, then we fail to reject H
0
. When T(x
1
, . . . , x
n
) C
c
and, consequently, we fail to reject H
0
, then we shall write from here onwards accept
H
0
. However, we emphasize that this does not necessarily mean that H
0
can be granted
our stamp of approval. It rather means that the sample does not provide us with sucient
evidence against H
0
.
85
Alternatively, we can partition the space of the random sample itself (instead of the
set of possible values taken by the statistic) into A R
n
and A
c
:= R
n
\ A. Then, we
can use the same reasoning as before, that is, if (x
1
, . . . , x
n
) A, then we reject H
0
and
accept it otherwise.
The set C (resp., A) such that if T(x
1
, . . . , x
n
) C (resp., (x
1
, . . . , x
n
) A), then H
0
is rejected (with probability 1) is called the critical region of the test. There are four
possibilities that can arise when one uses this procedure:
(1) H
0
is accepted when it is correct,
(2) H
0
is rejected when it is correct,
(3) H
0
is accepted when it is incorrect (and, thus, H
1
is correct),
(4) H
0
is rejected when it is incorrect (and, thus, H
1
is correct).
Possibilities (2) and (3) above are known, respectively, as type I and type II errors.
We proceed to the basic theory underlying hypothesis testing.
Denition 47. A Borel-measurable function : R
n
[0, 1] is a test function. Further,
a test function is a test of hypothesis H
0
:
0
against the alternative H
1
:
1
,
with error probability (or signicance level) , if
E

[(X
1
, . . . , X
n
)] for each
0
.
The function (as a function of ) E

[(X
1
, . . . , X
n
)] is known as the power function of
the test and the least upper bound sup

0
E

[(X
1
, . . . , X
n
)] is known as the size
of the test .
The interpretation of the concepts above is as follows. A test allows us to assign to
each sample realization (x
1
, . . . , x
n
) R
n
a number (x
1
, . . . , x
n
) [0, 1], which is to be
interpreted as the probability of rejecting H
0
. Thus, the inequality E

[(X
1
, . . . , X
n
)]
for
0
says that if H
0
were true, then the test rejects it with probability
E

[(X
1
, . . . , X
n
)] = P [ reject H
0
| H
0
is true ]
= P [T(X
1
, . . . , X
n
) C | H
0
] = P [(X
1
, . . . , X
n
) A | H
0
] .
In other words, the denition of test requires that the probability of the type I error
exceeds not .
There is an intuitive class of tests, used often in applications, called nonrandom-
ized tests, such that (x
1
, . . . , x
n
) = 1 if (x
1
, . . . , x
n
) A and (x
1
, . . . , x
n
) = 0 if
(x
1
, . . . , x
n
) / A for some set A R
n
(i.e., is the indicator function I
A
for a subset A
of sample realizations). In the sequel, we will make use of this class of tests.
Given an error probability equal to , let us use (,
0
,
1
) as short notation for our
hypothesis testing problem. Also, let

be the set of all tests for the problem (,


0
,
1
).
Denition 48. Given a random sample {X
1
, X
2
, . . . , X
n
} from F

. A test

is a
most powerful test against an alternative


1
if
E

[ (X
1
, . . . , X
n
)] E

[(X
1
, . . . , X
n
)] for each

.
86
If a test

is a most powerful test against each alternative


1
, then is a
uniformly most powerful test.
To obtain an intuitive interpretation, suppose that both hypotheses are simple so that
({
0
} , {
1
} , ) is our hypotheses testing problem. Then, note rst that
E

1
[(X
1
, . . . , X
n
)] = P [ reject H
0
| H
1
is true ]
= P [T(X
1
, . . . , X
n
) C | H
1
] = P [(X
1
, . . . , X
n
) A | H
1
]
= 1 P [ accept H
0
| H
1
is true ] .
Note that the expected value E

1
[(X
1
, . . . , X
n
)] is the power of the test evaluated at
the alternative hypothesis. Then, when we seek for a most powerful test, we are indeed
trying to solve the problem
min
AR
n
P

1
[(X
1
, . . . , X
n
) A
c
]
s.t.: P

0
[(X
1
, . . . , X
n
) A] .
In other words, the method of a most powerful test lead us to minimize the probability of
type II error subject to the restriction that the probability of type I error exceeds not ,
as imposed by the denition of the test. This method then gives us the practical procedure
to follow in choosing the critical region for testing a hypothesis: choose the critical region
(and, therefore, the test) in such a way that, for a given size (or probability of type
I error), the power of the test is maximized (or, equivalently, the probability of type II
error is minimized).
Note that, for a general hypotheses testing problem (
0
,
1
, ), nding a uniformly
most powerful test is equivalent to proposing a critical region A R
n
that, for each

1

1
, minimizes the probability P

1
[(X
1
, . . . , X
n
) A
c
] under the restriction
sup

0
P

0
[(X
1
, . . . , X
n
) A] .
Example 46. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a normal distribution
N(, 1). We know that = {
0
,
1
},
0
<
1
. Consider the test
H
0
: =
0
;
H
1
: =
1
,
so that both H
0
and H
1
are simple hypotheses. We choose the sample mean X
n
as statistic
so that, intuitively, one would accept H
0
if X
n
is closer to
0
than to
1
. That is, one
would reject H
0
if X
n
> c, for some constant c, and would otherwise accept H
0
. Then,
for 0 < < 1, we have
= P [ reject H
0
| H
0
is true ] = P[X
n
> c | =
0
]
= P
_
X
n

0
1/

n
>
c
0
1/

n
_
= 1 F
Z
_
c
0
1/

n
_
,
87
where Z N(0, 1). Therefore, the value c must solve the equation
F
Z
_
c
0
1/

n
_
= 1 ,
so that one obtains
c =
0
+
z
(1)

n
,
where z
(1)
denotes the realization z of the random variable Z that such that
P[Z z] = 1, i.e., the quantile of order (1) of the distribution of Z. Therefore,
the corresponding nonrandomized test is specied as
(x
1
, . . . , x
n
) =
_
1 if

n
i=1
x
i
_
n >
0
+ z
(1)
_

n;
0 otherwise.
Finally, the power of the test at
1
is
E[(x
1
, . . . , x
n
) | =
1
] = P
_
X
n
>
0
+
z
(1)

=
1
_
= P
_
X
n

1
1/

n
> (
0

1
)

n +z
(1)
_
= 1 F
Z
_
z
(1)
(
1

0
)

n
_
.
The result below, due to Neyman and Pearson, gives us a general method for nding
a most powerful test of a simple hypothesis against a simple alternative. Following the
notation used in the previous chapter, let L(x
1
, . . . , x
n
; ) denote the likelihood function
of the random sample {X
1
, . . . , X
n
} given that the true value of the parameter is .
Theorem 42 (Neyman-Pearson Fundamental Lemma). Let {X
1
, X
2
, . . . , X
n
} be a
random sample from a distribution function F

. Let
0
and
1
be two distinct values of
and let k be a positive number. Consider the following test of two simple hypotheses:
H
0
: =
0
;
H
1
: =
1
.
Let A and A
c
be a subset of the set of sample realizations and its complement, respectively,
such that
L(x
1
, . . . , x
n
;
0
)
L(x
1
, . . . , x
n
;
1
)
k, for each (x
1
, . . . , x
n
) A,
L(x
1
, . . . , x
n
;
0
)
L(x
1
, . . . , x
n
;
1
)
k, for each (x
1
, . . . , x
n
) A
c
,
=
_

_
A
L(x
1
, . . . , x
n
;
0
) dx
1
dx
n
.
Then, A is a critical region for a most powerful test against the alternative
1
.
88
The most powerful test identied in the Theorem above must be necessarily specied
as:
=
_

_
1 if f

1
(x
1
, . . . , x
n
) > qf

0
(x
1
, . . . , x
n
);
(x
1
, . . . , x
n
) if f

1
(x
1
, . . . , x
n
) = qf

0
(x
1
, . . . , x
n
);
0 if f

1
(x
1
, . . . , x
n
) < qf

0
(x
1
, . . . , x
n
),
for some q 0 and 0 (x
1
, . . . , x
n
) 1. When q , is specied as:
=
_
1 if f

0
(x
1
, . . . , x
n
) = 0;
0 if f

0
(x
1
, . . . , x
n
) > 0.
Finally, it can be shown that there is a functional form for (x
1
, . . . , x
n
) such that indeed
does not depend on (x
1
, . . . , x
n
) and the resulting is as identied by the Neyman-Pearson
Lemma.
Example 47. As in the previous example, consider a random sample {X
1
, X
2
, . . . , X
n
}
from a normal distribution N(, 1). We know that = {
0
,
1
},
0
<
1
. Consider
the test
H
0
: =
0
;
H
1
: =
1
,
so that both H
0
and H
1
are simple hypotheses. Then,
L(x
1
, . . . , x
n
;
s
) = (2)
n/2
exp
_

1
2
n

i=1
(x
i

s
)
2
_
, s = 0, 1,
so that
L(x
1
, . . . , x
n
;
0
)
L(x
1
, . . . , x
n

1
)
= exp
_

1
2
n

i=1
(x
i

0
)
2
+
1
2
n

i=1
(x
i

1
)
2
_
k,
for some positive number k that depends on . Taking logs in the expression above, we
obtain

i=1
(x
i

0
)
2
+
n

i=1
(x
i

1
)
2
2 ln(k).
From the equation above, using the fact that, for s = 0, 1,
n

i=1
(x
i

s
)
2
=
n

i=1
(x
i
x
n
)
2
+n
_
x
n

s
_
2
+ 2(x
n

s
)
n

i=1
(x
i
x
n
),
where

n
i=1
(x
i
x
n
) = nx
n
nx
n
= 0, we get to
n
_
_
x
n

1
_
2

_
x
n

0
_
2
_
2 ln(k).
89
Then, by computing the squares of the terms in brackets and by rearranging terms, we
obtain
x
n
(
0

1
)
1
2
(
2
0

2
1
) +
1
n
ln(k).
Therefore, the critical region identied by the Neyman-Pearson Lemma is
x
n

1
2
(
0
+
1
)
ln(k)
n(
1

0
)
.
Note that the statistic selected is the sample mean. Finally we set
1
2
(
0
+
1
)
ln(k)
n(
1

0
)
=: c,
where c is nothing but the constant proposed in the previous example. We can then
proceed as in that example to obtain
c =
0
+
z
(1)

n
.
We end this section by discussing briey the application of the Neyman-Pearson ap-
proach to testing a simple hypothesis against a composite alternative. Using the Neyman-
Pearson Lemma, one can conclude that a test is a most powerful test for a simple hy-
pothesis against a single value of the parameter as alternative. To follow this approach
for a set of alternatives which is not a singleton, we should check for the Neyman-Pearson
criterion for each value of the parameter within the set of alternatives. Thus, we would
be searching for a uniformly most powerful test. Unfortunately, it is typical that the a
uniformly most powerful test does not exist for all values of the parameter. In such cases,
we must seek for tests that are most powerful within a restricted class of tests. One such
restricted class is, for instance, the class of unbiased tests.
9.2. Tests based on the likelihood ratio
We present here a classical method for testing a simple or composite hypothesis against a
simple or composite alternative. This method is based on the ratio of the sample likelihood
function given the null hypothesis over the likelihood function given either the alternative
or the entire parameter space. This method gives us a test which is based on a sucient
statistic, if one exists. Also, this procedure often (but not necessarily) leads to a most
powerful test or a uniformly most powerful test, if they exist.
Denition 49. Given a hypothesis testing problem (,
0
,
1
), the critical region
A := {(x
1
, . . . , x
n
) R
n
: (x
1
, . . . , x
n
) < k} ,
90
where k R is a constant and
(x
1
, . . . , x
n
) =
sup

0
L(x
1
, . . . , x
n
; )
sup

L(x
1
, . . . , x
n
; )
,
corresponds to a test called a generalized likelihood ratio test.
In addition, it can be shown that the critical region specied above gives us the same
test as the region specied using the statistic
(x
1
, . . . , x
n
) =
sup

1
L(x
1
, . . . , x
n
; )
sup

0
L(x
1
, . . . , x
n
; )
.
The idea behind this method is as follows. The numerator in the ratio is the best expla-
nation of (X
1
, . . . , X
n
) under H
0
while the denominator is the best possible explanation of
(X
1
, . . . , X
n
). Therefore, this test proposes that H
0
be rejected if there is a much better
explanation of (X
1
, . . . , X
n
) than the one provided by H
0
.
For practical matters, 0 1 and the value of the constant k is determined using
the restriction of the size of the test
sup

0
P[(x
1
, . . . , x
n
) < k] = ,
where, accordingly, is the signicance level of the test.
Theorem 43. For a hypothesis testing problem (,
0
,
1
), the likelihood ratio test is a
function of each sucient statistic for the parameter .
Example 48. Let {X} be a random sample, consisting of a single random variable, from
a binomial distribution b(n, p). We seek a signicance level for the test
H
0
: p p
0
;
H
1
: p > p
0
,
for some 0 < p
0
< 1. Then, if we propose the monotone likelihood ratio test, we have
(x) =
sup
pp
0
_
n
x
_
p
x
(1 p)
nx
sup
0p1
_
n
x
_
p
x
(1 p)
nx
=
max
pp
0
p
x
(1 p)
nx
max
0p1
p
x
(1 p)
nx
.
Now, it can be checked that the function p
x
(1 p)
nx
rst increases until it achieves its
maximum at p = x
_
n and from then on it decreases. Therefore,
max
0p1
p
x
(1 p)
nx
=
_
x
n
_
x
_
1
x
n
_
nx
and
max
pp
0
p
x
(1 p)
nx
=
_
p
x
0
(1 p
0
)
nx
if p
0
< x
_
n
_
x
n
_
x
_
1
x
n
_
nx
if p
0
x
_
n.
91
Consequently,
(x) =
_
p
x
0
(1p
0
)
nx
(x/n)
x
[1(x/n)]
nx
if x > np
0
1 if x np
0
.
It follows that (x) 1 for x > np
0
and (x) = 1 for x np
0
, so that is a function not
increasing in x. Therefore, (x) < k if and only if x > k

and we should reject H


0
: p p
0
when x > k

.
Since X is a discrete random variable, it may be not possible to obtain the size . We
have
= sup
pp
0
P
p
[X > k

] = P
p
0
[X > k

].
If such k

does not exist, then we should choose an integer k

such that
P
p
0
[X > k

] and P
p
0
[X > k

1] > .
Example 49. Let {X
1
, X
2
, . . . , X
n
} be a random sample from a normal distribution
N(,
2
) and consider the hypothesis testing problem
H
0
: =
0
;
H
1
: =
0
,
where
2
is also unknown. Here we have = (,
2
),
=
_
(,
2
) R
2
: < < ,
2
> 0
_
,
and

0
=
_
(
0
,
2
) R
2
:
2
> 0
_
.
We obtain
sup

0
L(x
1
, . . . , x
n
; ) =
1
(
0

2)
n
exp
_

n
i=1
(x
i

0
)
2
2
2
0
_
,
where
2
0
= (1/n)

n
i=1
(x
i

0
)
2
is nothing but the maximum likelihood estimator for
2
given that the mean of the distribution is
0
. It follows that
sup

0
L(x
1
, . . . , x
n
; ) =
1
(2/n)
n/2
[

n
i=1
(x
i

0
)
2
]
n/2
e
n/2
.
Now, it can be checked that the maximum likelihood estimator for (,
2
) when both
and
2
are unknown is
( ,
2
) =
_
X
n
,
n

i=1
(X
i
X
n
)
2
_
n
_
.
92
Then, we obtain
sup

L(x
1
, . . . , x
n
; ) =
1
(2/n)
n/2
[

n
i=1
(x
i
x
n
)
2
]
n/2
e
n/2
.
Therefore,
(x
1
, . . . , x
n
) =
_
n
i=1
(x
i
x
n
)
2

n
i=1
(x
i

0
)
2
_
n/2
=
_
n
i=1
(x
i
x
n
)
2

n
i=1
(x
i
x
n
)
2
+n(x
n

0
)
2
_
n/2
=
_
1
1 +
_
n(x
n

0
)
2
_
n
i=1
(x
i
x
n
)
2

_
n/2
,
which happens to be a decreasing function in (x
n

0
)
2
_
n
i=1
(x
i
x
n
)
2
. Thus,
(x
1
, . . . , x
n
) < k

(x
n

0
)/

n 1
_

n
i=1
(x
i
x
n
)
2
/(n 1)

> k

n(x
n

0
)
s
n

> k

,
where s
2
n
= [1/(n 1)]

n
i=1
(x
i
x
n
)
2
is the sample variance and k

= k

_
(n 1)/n.
Also, we know that the statistic
T(X
1
, . . . , X
n
) :=

n(X
n

0
)
S
n
has a distribution t with n 1 degrees of freedom (recall Theorem 31 (iv)). So, given the
symmetry of the function density of a random variable distributed according to a t, we
should make use of the quantile t
n1,/2
to specify k

.
Problems
1. Let {X
1
, . . . , X
n
} be a random sample from a normal distribution N(, 1). Use the
result in the Neyman-Pearson Lemma to test the null hypothesis H
0
; = 0 against the
alternative H
1
; = 1. For n = 25 and = 0.05, compute the power of this test when the
alternative is true.
2. Let {X
1
, . . . , X
n
} and {Y
1
, . . . , Y
m
} be independent random samples from normal dis-
tributions N(
1
,
2
1
) and N(
2
,
2
2
), respectively. Use a monotone likelihood ratio test to
test the hypothesis H
0
;
2
1
=
2
2
against H
1
;
2
1
=
2
2
.
93

You might also like