Stochastic Processes, Detection and Estimation: 6.432 Course Notes
Stochastic Processes, Detection and Estimation: 6.432 Course Notes
Stochastic Processes, Detection and Estimation: 6.432 Course Notes
i=1
Pr [A
i
] , (1.6)
where equality in (1.6) holds if and only if the A
i
are a collection of mutually exclu-
sive events, i.e., if A
i
A
j
= for i ,= j.
1
In fact we cannot compute the probability of every subset of . Those that we can we will
term valid subsets. In formal mathematical treatments a probability space is specied in terms
of a sample space, a probability measure, and a collection of valid sets. At our level of treatment,
however, you can assume that any subset we mention or constructeither explicitly or implicitly
is valid.
Sec. 1.2 Axioms of Probability and Basic Concepts 7
1.2.1 Conditional Probabilities
The conditional probability of event A given event B is dened by
Pr [A [ B] =
Pr [A B]
Pr [B]
. (1.7)
Exploiting, in turn, the denition of Pr [B [ A] in the numerator of (1.7) yields
Pr [A [ B] =
Pr [B [ A] Pr [A]
Pr [B]
. (1.8)
A straightforward extention of (1.8) is Bayes Rule: let A
i
, i = 1, 2, . . . , n be a set of
mutually exclusive events that are also exhaustive, i.e.,
n
_
i=1
A
i
= ;
then for any event B,
Pr [A
j
[ B] =
Pr [B [ A
j
] Pr [A
j
]
n
i=1
Pr [B [ A
i
] Pr [A
i
]
. (1.9)
1.2.2 Independence
Two (nontrivial) events are independent if knowledge of one events occurrence
provides no information about the other events occurrence, i.e., if
Pr [A [ B] = Pr [A] . (1.10)
Using (1.7) we see that (1.10) is equivalent to
Pr [A B] = Pr [A] Pr [B] . (1.11)
More generally, a collection of events A
1
, A
2
, . . . , A
N
are said to be mutually
independent if for every i we have
Pr [A
i
[ A
j
, j J] = Pr [A
i
] , (1.12)
where J is any subset of indices between 1 through N but excluding i. The condi-
tion (1.12) is equivalent to the requirement that for every subset of distinct indices
i
1
, i
2
, . . . , i
K
, drawn from 1, 2, . . . , N and corresponding to some K N we have
Pr
_
K
k=1
A
i
k
_
=
K
k=1
Pr [A
i
k
] . (1.13)
8 Probability, Random Vectors, and Vector Spaces Chap. 1
For three events A, B, and C to be mutually independent, for example, this means
that we require that all the following hold:
Pr [A B C] = Pr [A] Pr [B] Pr [C] (1.14)
Pr [A B] = Pr [A] Pr [B] (1.15)
Pr [A C] = Pr [A] Pr [C] (1.16)
Pr [B C] = Pr [B] Pr [C] . (1.17)
In particular, (1.14) alone is not sufcient; (1.15)(1.17) are also required.
1.3 RANDOM VARIABLES
In these notes, we adopt the useful convention of using fonts without serifs for
random variables, and the corresponding fonts with serifs for sample values and
dummy arguments. For example, x, y, z, and will denote random variables, and
x, y, z, and corresponding generic sample values.
Formally, a random variable x is a real-valued function on the sample space
. The probability distribution function for x is dened by
P
x
(x) = Pr [ [ x() x] = Pr [x x] , (1.18)
where the last expression we use for notational convenience. This distribution is a
complete characterization of the random variable. Likewise, the probability density
function (pdf) p
x
(x), which is related to the distribution by
p
x
(x) =
dP
x
(x)
dx
, (1.19)
is also a complete characterization.
2
This follows from the fact that for any valid
A, we can write
Pr [x A] =
_
A
p
x
(x) dx. (1.20)
If x takes on particular values with nonzero probability, then P
x
(x) will con-
tain step-discontinuities and p
x
(x) will contain impulses. For example,
p
x
(x) =
1
2
(x + 1) +
1
2
(x 1) (1.21)
is the density of a random variable taking on the values 1 each with the prob-
ability 1/2. To accommodate the possibility of p
x
(x) having impulses and remain
consistent with (1.18), we write the inverse of (1.19) as
P
x
(x) =
_
x+
p
x
(u) du, (1.22)
2
Well assume in our treatment that densities always exist for the quantities of interest, at
least in this generalized sense (i.e., allowing impulses). However, it is worth keeping in mind
that there exist random variables whose probability distributions are not differentiable even in a
generalized sense.
Sec. 1.3 Random Variables 9
using x+ in the upper limit of (1.22) to indicate that the endpoint x is included in
the interval. Also, since [cf. (1.19)]
p
x
(x
0
) = lim
x0
Pr [x
0
< x x
0
+ x]
x
,
we have the frequently useful approximation valid for suitably small x:
Pr [x
0
< x x
0
+ x] p
x
(x
0
) x. (1.23)
1.3.1 Expectations
Often we are interested in partial characterizations of a random variable in the
form of certain expectations. The expected value of a function g(x) of a random vari-
able is given by
g(x) E [g(x)] =
_
+
g(x) p
x
(x) dx. (1.24)
Remember that since z = g(x) is itself a random variable we may also write (1.24)
in the form
E [g(x)] = E [z] =
_
+
z p
z
(z) dz, (1.25)
where p
z
(z) is the probability density for z = g(x). If g() is a one-to-one and
differentiable function, a simple expression for p
z
(z) can be derived, viz.,
p
z
(z) =
p
x
(g
1
(z))
[g
(g
1
(z))[
(1.26)
where g
() denotes the rst derivative of g(). If g() is not invertible, the more
general method-of-events approach for deriving densities, which we briey re-
view later in the multivariate case, can be employed to obtain p
z
(z). In terms of
the ultimate goal of evaluating E [g(x)], whether (1.24) or (1.25) turns out to be
more convenient depends on the problem at hand.
Several expectations that are important partial characterizations of a random
variable are the mean value (or rst moment)
E [x] = x m
x
, (1.27)
the mean-squared value (or second moment)
E
_
x
2
= x
2
, (1.28)
and the variance (or second central-moment)
E
_
(x m
x
)
2
= x
2
m
2
x
var x
2
x
x
. (1.29)
In (1.27)(1.29) we have introduced a variety of notation that will be convenient to
use in subsequent sections of these notes. The standard deviation is
x
, the square
10 Probability, Random Vectors, and Vector Spaces Chap. 1
root of the variance. One important bound provided by these moments is the
Chebyshev inequality
Pr [[x m
x
[ ]
2
x
2
(1.30)
Observe that (1.30) implies that x is a constant (i.e., Pr [x = ] = 1 for some con-
stant ) if
2
x
= 0. Note that the corresponding only if statement follows imme-
diately from (1.29).
The Chebyshev bound is a particularly convenient bound to use in practice
because its calculation involves only the mean and variance of the random vari-
able, i.e., it doesnt depend on the detailed form of the density. However, for this
same reason the Chebyshev bound is not a particularly tight bound.
3
1.3.2 Characteristic Functions
The characteristic function of a random variable is dened as
M
x
(jv) = E
_
e
jvx
=
_
+
e
jvx
p
x
(x) dx (1.31)
and, as is apparent from the integral in (1.31), corresponds to the Fourier transform
of the density (to within a minor sign change). As a Fourier transform, we can
recover p
x
(x) from M
x
(jv) via the inverse formula
p
x
(x) =
1
2
_
+
e
jvx
M
x
(jv) dv. (1.32)
and hence the characteristic function is an equivalent complete characterization of
a random variable.
Characteristic functions are particularly useful in computing certain expec-
tations involving the random variable. For example, the moments of x can all be
efciently recovered from M
x
(jv) by differentiation, i.e.,
E [x
n
] =
_
1
j
n
d
n
dv
n
M
x
(jv)
_
v=0
(1.33)
Observe that (1.33) implies that the characteristic function can be expanded in
terms of the power series
M
x
(jv) =
+
k=0
E
_
x
k
(jv)
k
k!
(1.34)
when all the moments of the form (1.33) exist. This result implies, in turn, that
knowledge of all moments is an equivalent characterization for such random vari-
ables: given these moments we can reconstruct M
x
(jv) via (1.34).
3
As an aside, an alternative bound that is typically much tighter but which requires access
to more information about the random variable is the Chernoff bound.
Sec. 1.3 Random Variables 11
The characteristic function is also frequently useful in deriving the density
of the sum of independent random variables. In particular, if x
1
, x
2
, . . . , x
N
is a
set of mutually independent random variables then the characteristic function for
their sum
z = x
1
+ x
2
+ + x
N
is simply given by the product of the characteristic functions of the constituents,
i.e.,
M
z
(jv) = E
_
e
jvz
= E
_
e
jv(x
1
+x
2
++x
N
)
= E
_
e
jvx
1
E
_
e
jvx
2
E
_
e
jvx
N
= M
x
1
(jv) M
x
2
(jv) M
x
N
(jv). (1.35)
Thus, after computing M
z
(jv) via (1.35), we can determine p
z
(z) by the Fourier
transform inverse formula (1.32). Note that inverting (1.35) directly yields, via the
convolution property of Fourier transforms, the familiar result
p
z
(z) = p
x
1
(z) p
x
2
(z) p
x
N
(z), (1.36)
where denotes the convolution operator.
1.3.3 Discrete Random Variables
Random variables that take on only integer values can be fully developed within
the framework weve been describing. In particular, their probability densities
consist entirely of uniformly-spaced impulses with suitable weights. However, to
make manipulation of these quantities less cumbersome, it is sometimes conve-
nient to adopt some special notation for specically discrete random variables. In
particular, we dene the probability mass function (pmf) of an integer-valued ran-
dom variable k as
p
k
[k] = Pr [k = k] (1.37)
using square brackets to distinguish masses from densities, and to remind us that
the argument is integer-valued. The density can, of course, be derived from the
mass function via
p
k
(k) =
+
i=
p
k
[k] (i k) (1.38)
and related to the distribution function via
P
k
(k) =
k
i=
p
k
[i]. (1.39)
With this notation, expectations can be expressed in terms of sums rather
than integrals, e.g.,
E [f[k]] =
+
k=
f[k] p
k
[k].
12 Probability, Random Vectors, and Vector Spaces Chap. 1
Furthermore, the characteristic function can be viewed as the discrete-time Fourier
transform (again to within a minor sign change) of the mass function, i.e.,
M
k
(jv) = E
_
e
jvk
=
+
k=
e
jvk
p
k
[k].
1.4 PAIRS OF RANDOM VARIABLES
We will frequently deal with several random variables, and we will nd it conve-
nient to use vector notation in this case. Before we do that, however, let us recall
some complete joint characterizations of a pair of random variables x and y. One
such characterization is the joint distribution function for x and y, which is dened
by
P
x,y
(x, y) = Pr [x x and y y] . (1.40)
A second complete joint characterization is the joint density of x and y, i.e.,
p
x,y
(x, y) =
2
P
x,y
(x, y)
xy
. (1.41)
If A R
2
is a valid set, where R
2
denotes
4
the plane of all pairs (x, y), then
Pr [(x, y) A] =
_ _
A
p
x,y
(x, y) dxdy. (1.42)
Evidently, a special case of (1.42) is the inverse to (1.41), i.e.,
P
x,y
(x, y) =
_
x+
_
y+
p
x,y
(u, v) du dv.
Again, from (1.41) we also have the following approximation valid for suitably
small x and y:
Pr [x
0
< x x
0
+ x and y
0
< y y
0
+ y] p
x,y
(x
0
, y
0
) xy. (1.43)
1.4.1 Marginal and Conditional Densities
Recall that the marginal densities of either x or y can be recovered by integrating
out the other variable:
p
x
(x) =
_
+
p
x,y
(x, y) dy (1.44)
p
y
(y) =
_
+
p
x,y
(x, y) dx. (1.45)
4
See the Appendix 1.A for a discussion of such spaces.
Sec. 1.4 Pairs of Random Variables 13
Geometrically, it is useful to visualize marginal densities p
x
(x) and p
y
(y) as (inte-
grated) projections of the joint density p
x,y
(x, y) onto the x and y axes, respectively,
of the (x, y) plane.
The conditional density for x given that y = y is dened by
p
x|y
(x[y) =
p
x,y
(x, y)
p
y
(y)
, (1.46)
which, as a function of x and for a particular y = y
0
, corresponds to a slice into
the plane through the joint density along the y = y
0
line that is normalized to have
unit integral. In particular, the denominator in (1.46), i.e., p
y
(y), is precisely this
normalization factor. Note too that since p
y|x
(y[x) is dened analogously, we have,
e.g.,
p
x|y
(x[y) =
p
y|x
(y[x) p
x
(x)
p
y
(y)
.
1.4.2 Independence
A pair of random variables x and y are independent if knowledge of the value of
one does not affect the density of the other, i.e., if
p
x|y
(x[y) = p
x
(x). (1.47)
Using (1.47) with (1.46), we get the equivalent condition for independence
p
x,y
(x, y) = p
x
(x) p
y
(y). (1.48)
1.4.3 Expectations and Correlation
Expectations also provide partial characterizations for pairs of random variables.
The expected value of a function of x and y is given by
E [f(x, y)] =
_
+
_
+
f(x, y) p
x,y
(x, y) dxdy = f(x, y). (1.49)
Note that (1.49) gives (1.24) as a special case:
E [g(x)] =
_
+
_
+
g(x) p
x,y
(x, y) dxdy =
_
+
g(x) p
x
(x) dx. (1.50)
On many occasions we will exploit the fact that expectations are linear opera-
tions. For example, for arbitrary constants and we have
E [x + y] = E [x] + E [y] .
This means that in computations we can typically interchange expectations with
summations, integrations, and other linear operations.
In addition to (1.27)(1.29) for x and their counterparts for y, some additional
expectations that constitute useful partial characterizations of the statistical rela-
tionship between x and y are
14 Probability, Random Vectors, and Vector Spaces Chap. 1
Correlation:
E [xy] (1.51)
Covariance:
E [(x m
x
)(y m
y
)] = E [xy] m
x
m
y
=
xy
cov (x, y) . (1.52)
Note that var x = cov (x, x).
A pair of variables x and y are said to be uncorrelated if
xy
= 0, i.e., if
E [xy] = E [x] E [y] .
If random variables x and y are independent, they are also uncorrelated:
E [xy] =
_
+
_
+
xy p
x,y
(x, y) dxdy
=
_
+
_
+
xy p
x
(x)p
y
(y) dxdy
= E [x] E [y] .
However, the converse is not trueuncorrelated random variables are generally
not independent. A pair of random variables is said to be orthogonal when
E [xy] = 0.
For reasons that will become apparent in Section 1.7, we sometimes express this
condition using the notation x y.
The correlation coefcient
xy
is a normalized measure of the correlation be-
tween two random variables x and y, and is dened by
xy
=
xy
y
. (1.53)
Later in Section 1.7, we will establish that 1
xy
1, with
xy
= 0 x and y uncorrelated
xy
= +1 x is a positive multiple of y plus a constant
xy
= 1 x is a negative multiple of y plus a constant
(1.54)
We can also dene conditional expectations involving x and y. The conditional
expectation of x given y = y, for example, is given by
E [x[y = y] =
_
+
xp
x|y
(x[y) dx. (1.55)
Sec. 1.4 Pairs of Random Variables 15
Note that E [x[y = y] is a function of y and consequently E [x[y] can be viewed as
a random variable. Its expectation is then
E [E [x[y]] =
_
+
E [x[y = y] p
y
(y) dy
=
_
+
__
+
xp
x|y
(x[y) dx
_
p
y
(y) dy
=
_
+
_
+
xp
x,y
(x, y) dxdy = E [x] (1.56)
The identity (1.56), which well use on many occasions in this course, is called the
law of iterated expectation.
As a nal remark, we point out that the condition
E [x[y] = E [x] (1.57)
is equivalent to neither independence nor uncorrelatedness. While it is true that
(1.57) holds if x and y are independent since
E [x[y = y] =
_
+
xp
x|y
(x[y) dx =
_
+
xp
x
(x) dx = E [x] ,
the converse is not true: x and y are not necessarily independent if (1.57) holds.
As a simple counterexample we have the joint density
p
x,y
(x, y) =
1
3
(x, y 1) +
1
3
(x + 1, y) +
1
3
(x 1, y). (1.58)
Likewise, it is true that x and y are uncorrelated if (1.57) holds since, using iterated
expectation, we have
E [xy] = E [E [xy[y]] = E [yE [x[y]] = E [yE [x]] = E [x] E [y] ;
however, the converse is again not true: if x and y are uncorrelated, we cannot
deduce that (1.57) holds. A simple counterexample is the density (1.58) with x and
y interchanged:
p
x,y
(x, y) =
1
3
(x 1, y) +
1
3
(x, y + 1) +
1
3
(x, y 1).
16 Probability, Random Vectors, and Vector Spaces Chap. 1
1.5 RANDOM VECTORS
It is generally more convenient to represent collections of two or more random
variables in terms of a vector of random variables.
5
If, for example, we let
x =
_
_
x
1
x
2
.
.
.
x
N
_
_
(1.59)
denote a vector of N random variables, then the joint distribution function can be
expressed in the form
P
x
(x) = Pr [x
1
x
1
, x
2
x
2
, . . . , x
N
x
N
] . (1.60)
Provided this distribution function is differentiable, the corresponding joint den-
sity function is
p
x
(x) =
N
P
x
(x)
x
1
x
2
x
N
. (1.61)
Analogous to the scalar case, we have p
x
(x) 0 and
_
+
_
+
p
x
(x) dx = 1. (1.62)
The joint density function is a complete characterization of the random vari-
ables that comprise the vector. In particular, for any valid set A R
N
Pr [x A] =
_
_
A
p
x
(x) dx
1
dx
N
=
_
A
p
x
(x) dx (1.63)
where the last expression is a notational convenience. Furthermore, we can recon-
struct the original distribution function from the joint density via
P
x
(x) =
_
x
1
+
_
x
2
+
_
x
N
+
p
x
(u) du. (1.64)
Rather than collecting all the random variables of interest into a single vector
x, in many problems it is often more natural and more convenient to divide them
among several random vectors of possibly different sizes.
In the case where we divide our random variables into two random vectors
x R
N
and y R
M
, we can dene the joint distribution
P
x,y
(x, y) = Pr [x
1
x
1
, x
2
x
2
, . . . , x
N
x
N
, y
1
y
1
, y
2
y
2
, . . . , y
M
y
M
]
(1.65)
5
Since we use bold face fonts for vectors, random vectors will be denoted using bold face
fonts without serifs, and sample values will be denoted using bold face fonts with serifs. For
example, x, y, z, and will be random vectors, and x, y, z, and will be associated sample
values.
Sec. 1.5 Random Vectors 17
and the joint density
p
x,y
(x, y) =
N+M
P
x,y
(x, y)
x
1
x
2
x
N
y
1
y
2
y
M
, (1.66)
each of which fully characterizes the statistical relationship among all the elements
of x and y.
In turn, using the joint density (1.66), we can recover marginal densities, e.g.,
p
x
(x) =
_
+
p
x,y
(x, y) dy. (1.67)
1.5.1 Conditional Densities and Independence
The conditional density for x given y = y is given by
p
x|y
(x[y) =
p
x,y
(x, y)
p
y
(y)
, (1.68)
and hence we have
p
x,y
(x, y) = p
x|y
(x[y) p
y
(y) = p
y|x
(y[x) p
x
(x). (1.69)
Thus, using (1.69) we have, in turn,
p
x|y
(x[y) =
p
y|x
(y[x) p
x
(x)
p
y
(y)
. (1.70)
Two random vectors x and y are independent (meaning that the two cor-
responding collections of random variables are mutually independent of one an-
other) if knowledge of any of the elements of y provides no information about any
of the elements of x (or vice versa), i.e., if
p
x|y
(x[y) = p
x
(x). (1.71)
Analogous to our earlier results, using (1.68) we nd that (1.71) is equivalent to
the condition
p
x,y
(x, y) = p
x
(x) p
y
(y). (1.72)
All of these formulas extend to more than two random vectors. For instance,
a collection of K random vectors x
1
, x
2
, . . . , x
K
are mutually independent if for
every i we have
p
x
i
|{x
j
,jJ}
(x
i
[ x
j
, j J) = p
x
i
(x
i
), (1.73)
where J is any subset of indices between 1 through K but excluding i. The condi-
tion (1.73) is equivalent to the requirement that
p
x
1
,x
2
,...,x
N
(x
1
, x
2
, . . . , x
N
) =
n
i=1
p
x
i
(x
i
). (1.74)
18 Probability, Random Vectors, and Vector Spaces Chap. 1
Note that by integrating out any subset of vectors in (1.74) we obtain lower-order
independence relations among arbitrary subsets of the random vectors as well,
i.e., if I is an arbitrary subset of distinct indices selected from 1 to K, then
p
{x
i
,iI}
(x
i
, i I) =
iI
p
x
i
(x
i
). (1.75)
To show that (1.74) implies (1.73) involves a straightforward application of
the denition of conditional probabilities:
p
x
i
|{x
j
,jJ}
(x
i
[ x
j
, j J)
=
p
x
i
,{x
j
,jJ}
(x
i
, x
j
, j J)
p
{x
j
,jJ}
(x
j
, j J)
=
k{i}J
p
x
k
(x
k
)
kJ
p
x
k
(x
k
)
= p
x
i
(x
i
).
To show the conversethat (1.73) implies (1.74)requires rewriting the joint
density as the product of conditionals of the form of the left-hand side of (1.73).
As a special case, if x, y and z are three random vectors then their joint density can
be expressed as
p
x,y,z
(x, y, z) = p
x|y,z
(x[y, z) p
y,z
(y, z)
= p
x|y,z
(x[y, z) p
y|z
(y[z) p
z
(z). (1.76)
Applying (1.73) to each of the right-hand side terms of (1.76), we get that x, y and
z are mutually independent if
p
x,y,z
(x, y, z) = p
x
(x) p
y
(y) p
z
(z). (1.77)
We emphasize that from integrations of (1.77) we get that mutual indepen-
dence implies pairwise independence, i.e.,
p
x,y
(x, y) = p
x
(x) p
y
(y)
p
x,z
(x, z) = p
x
(x) p
z
(z)
p
y,z
(y, z) = p
y
(y) p
z
(z)
However, the converse is not truepairwise independence alone does not ensure
mutual independence.
Finally, we note that all of the results in this section can be used in the spe-
cial case in which each of the random vectors has only a single element and are,
hence, scalar randomvariables. As an example, we have that the random variables
x
1
, x
2
, . . . , x
N
are mutually independent if and only if
p
x
1
,x
2
,...,x
N
(x
1
, x
2
, . . . , x
N
) = p
x
1
(x
1
) p
x
2
(x
2
) p
x
N
(x
N
). (1.78)
Sec. 1.5 Random Vectors 19
1.5.2 Derived Distributions and Jacobians
Suppose
y =
_
_
y
1
y
2
.
.
.
y
M
_
_
= g(x) =
_
_
g
1
(x)
g
2
(x)
.
.
.
g
M
(x)
_
_
is an M-dimensional random vector obtained as a function of the N-dimensional
random vector x. We can always in principle calculate the distribution for y from
the method of events:
P
y
(y) = Pr [g
1
(x) y
1
, g
2
(x) y
2
, . . . , g
M
(x) y
M
]
=
_
A(y)
p
x
(x) dx (1.79)
where
A(y) = x [ g
1
(x) y
1
, g
2
(x) y
2
, . . . , g
M
(x) y
M
(1.80)
We can then obtain the density via
p
y
(y) =
M
P
y
(y)
y
1
y
2
y
M
. (1.81)
If M = N and g(x) is one-to-one, this approach leads to the expression
p
y
(y) =
p
x
(g
1
(y))
dg
dx
(g
1
(y))
(1.82)
where, as is discussed in Appendix 1.B, dg/dx is the Jacobian matrix correspond-
ing to g and [ [ = det() denotes the determinant of its matrix argument.
1.5.3 Expectations and Covariance Matrices
The expectation of a scalar-valued function of x is given by
E [f(x)] =
_
+
f(x) p
x
(x) dx (1.83)
The expectations of vector-valued (or even matrix-valued) functions of x are de-
ned component-wise. For example, if
f(x) =
_
_
f
1
(x)
f
2
(x)
.
.
.
f
M
(x)
_
_
, (1.84)
20 Probability, Random Vectors, and Vector Spaces Chap. 1
then
E [f(x)] =
_
_
E [f
1
(x)]
E [f
2
(x)]
.
.
.
E [f
M
(x)]
_
_
. (1.85)
Some important expectations are:
Mean Vector:
E [x] = m
x
(1.86)
Correlation Matrix:
E
_
xx
T
(1.87)
Covariance Matrix:
cov (x, x) =
xx
= E
_
(x m
x
)(x m
x
)
T
= E
_
xx
T
m
x
m
T
x
(1.88)
Cross-Covariance Matrix:
cov (x, y) =
xy
= E
_
(x m
x
)(y m
y
)
T
= E
_
xy
T
m
x
m
T
y
(1.89)
Conditional Mean:
m
x|y
(y) = m
x|y=y
= E [x[y = y] =
_
+
xp
x|y
(x[y) dx (1.90)
Conditional Covariance:
x|y
(y) =
x|y=y
=
_
+
yx
=
T
xy
(1.95)
and
x
is a symmetric matrix, i.e.,
x
=
T
x
. (1.96)
The random vectors x and y are uncorrelated if every element of x is uncorre-
lated with every element of y, i.e.,
xy
= 0 (1.97)
or
E
_
xy
T
= E [x] [E [y]]
T
. (1.98)
And we say two random vectors x and y are orthogonal if
E
_
xy
T
= 0,
i.e., if every element of x is orthogonal to every element of y.
1.5.4 Characteristic Functions of Random Vectors
The characteristic function of a random vector is given by
M
x
(jv) = E
_
e
jv
T
x
_
=
_
+
e
jv
T
x
p
x
(x) dx, (1.99)
and corresponds to the (sign-reversed) N-dimensional Fourier transform of the
joint density. Analogous to the scalar case, when it exists, the characteristic func-
tion corresponds to an alternative complete statistical characterization of the ran-
dom vector, and in particular the inverse formula for reconstructing p
x
(x) from
M
x
(jv) is
p
x
(x) =
1
(2)
N
_
+
e
jv
T
x
M
x
(jv) dv. (1.100)
Among its uses, all mixed moments of x can be efciently recovered from the
characteristic function via differentiation; specically, with K = k
1
+k
2
+ +k
N
,
we have
E
_
x
k
1
1
x
k
2
2
x
k
N
N
_
=
_
1
j
K
K
M
x
(jv)
v
k
1
1
v
k
2
2
v
k
N
N
_
v=0
. (1.101)
In turn, (1.101) implies that M
x
(jv) can be expanded in a power series of the form
M
x
(jv) =
+
k
1
=0
+
k
2
=0
+
k
N
=0
E
_
x
k
1
1
x
k
2
2
x
k
N
N
_
(jv
1
)
k
1
k
1
!
(jv
2
)
k
2
k
2
!
(jv
N
)
k
N
k
N
!
, (1.102)
22 Probability, Random Vectors, and Vector Spaces Chap. 1
provided all the constituent moments exist. Hence, many classes of random vec-
tors are completely characterized by the complete set of moments of the form
(1.101).
In addition, note that the collection of random variables x
1
, x
2
, . . . , x
N
are mu-
tually independent if and only if
M
x
(jv) = M
x
1
(jv
1
) M
x
2
(jv
2
) M
x
N
(jv
N
). (1.103)
To establish the only if part, it sufces to note that if the x
1
, x
2
, . . . , x
N
are mutu-
ally independent then
M
x
(jv) = E
_
e
jv
T
x
_
= E
_
e
j(v
1
x
1
+v
2
x
2
++v
N
x
N
)
= E
_
e
jv
1
x
1
E
_
e
jv
2
x
2
E
_
e
jv
N
x
N
= M
x
1
(jv
1
) M
x
2
(jv
2
) M
x
N
(jv
N
). (1.104)
To establish the if part, we note that if (1.103) holds then by (1.100) and (1.32)
we have
p
x
(x) =
1
(2)
N
_
+
e
jv
T
x
M
x
(jv) dv
=
1
(2)
N
_
+
dv
N
i=1
e
jv
i
x
i
M
x
i
(jv
i
)
=
N
i=1
1
2
_
+
e
jv
i
x
i
M
x
i
(jv
i
) dv
i
=
N
i=1
p
x
i
(x
i
).
Among several implications of (1.103) is the following conceptually useful
alternative condition for mutual independence: the collection of random variables
x
1
, x
2
, . . . , x
N
is mutually independent if and only if for all choices of functions
f
1
(), f
2
(), . . . , f
N
() we have
E [f
1
(x
1
) f
2
(x
2
) f
N
(x
N
)] = E [f
1
(x
1
)] E [f
2
(x
2
)] E [f
N
(x
N
)] . (1.105)
The only if part requires a straightforward application of (1.78). To establish the
if part, it sufces to choose the f
i
(x
i
) = e
jv
i
x
i
and then exploit (1.103).
Finally, by combining (1.103) with the power series expansions (1.102) and
(1.34) we get a related but milder equivalent condition for mutual independence:
the collection of random variables x
1
, x
2
, . . . , x
N
is mutually independent if and
only if for every set of nonnegative integers k
1
, k
2
, . . . , k
N
we have
E
_
x
k
1
1
x
k
2
2
x
k
N
N
_
= E
_
x
k
1
1
E
_
x
k
2
2
E
_
x
k
N
N
_
. (1.106)
Sec. 1.5 Random Vectors 23
Properties and Geometry of the Covariance Matrix
Let x and y be random vectors and dene z as follows:
z = Ax +By +b (1.107)
where A and B are matrices of appropriate dimensions and b is a deterministic
vector. Then straightforward calculations yield the following:
m
z
= Am
x
+Bm
y
+b (1.108)
z
= A
x
A
T
+A
xy
B
T
+B
yx
A
T
+B
y
B
T
. (1.109)
Note that if x and y are uncorrelated (1.109) simplies to
z
= A
x
A
T
+B
y
B
T
. (1.110)
As a special case, let a be a vector of numbers and consider the scalar random
variable
z = a
T
x =
N
i=1
a
i
x
i
. (1.111)
Then from (1.109)
2
z
=
z
= a
T
x
a. (1.112)
Since
2
z
must be a non-negative number we see that
x
must be a positive semidef-
inite matrix.
6
If
x
is invertible, i.e., if it is positive denite, then
2
z
> 0 for any
vector a ,= 0. However, if
x
is singular, so that
x
is not positive denite, then
there is some vector a ,= 0 so that
2
z
= a
T
x
a = 0. Consequently, z in this case is a
known constant and therefore one of the x
i
equals a constant plus a deterministic
linear combination of the other components of x.
Example 1.1
Let x and y be two scalar random variables, and consider the random vector
w =
_
x
y
_
. (1.113)
Then
w
=
_
2
x
xy
xy
2
y
_
=
_
2
x
xy
xy
y
2
y
_
. (1.114)
For
w
to be positive denite, it must be true that the determinant of
w
is positive:
det(
w
) = (1
2
xy
)
2
x
2
y
> 0. (1.115)
From this equation we can see that
w
will not be positive denite if and only if the
correlation coefcient
xy
equals 1. In either of these cases we can conclude that x
must equal a multiple of y plus a constant, i.e., x = cy + d for some constants c and
d. It is straightforward to check that the sign of c is the same as that of
xy
.
6
See Appendix 1.A for a discussion of positive denite and semidenite matrices.
24 Probability, Random Vectors, and Vector Spaces Chap. 1
To develop the geometrical properties of the covariance matrix, suppose x
is an N-dimensional random vector. As discussed in Appendix 1.A, since
x
is
symmetric, there exists an orthogonal matrix P such that if we dene
z = Px (1.116)
then
z
= P
x
P
T
= diag(
1
,
2
, . . . ,
N
) (1.117)
where
1
,
2
, . . . ,
N
are the eigenvalues of
x
.
The interpretation of this result is very important. Specically, we see that we
can perform a change of coordinates (1.116) on our random vector so that the com-
ponents of the transformed vector z are uncorrelated. The columns of P
T
, which
are the eigenvectors of
x
, are the principal directions of
x
, i.e., they specify the
linear combinations of x that make up the uncorrelated components of z. More-
over, the eigenvalues of
x
are the variances of the corresponding components of
z.
Example 1.2
Continuing Example 1.1, suppose that
w
=
_
3/2 1/2
1/2 3/2
_
.
Then the eigenvalues are
1
= 2 and
2
= 1, and the corresponding normalized
eigenvectors are
p
1
=
_
1/
2
1/
2
_
p
2
=
_
1/
2
1/
2
_
.
Hence, we can conclude that the pair of random variables
u =
2 p
T
1
w = x + y
v =
2 p
T
2
w = x y
are uncorrelated and have variances
var u = var [x + y] = 2 var
_
x + y
2
_
= 2
1
= 4
var v = var [x y] = 2 var
_
x y
2
_
= 2
2
= 2.
1.6 GAUSSIAN RANDOM VARIABLES
In this section we dene and develop the basic properties of jointly Gaussian or
normal random variables (or, equivalently, Gaussian or normal random vectors).
Gaussian random variables are important for at least two reasons. First, Gaus-
sian random vectors are good models in many physical scenarios. For example,
Sec. 1.6 Gaussian Random Variables 25
1/2e
2
1/2
2
p (x)
x
m m- m+
x
Figure 1.1. The probability density
function of a scalar Gaussian random
variable.
Gaussian distributions arise in practice when the quantity observed is composed
of a superposition of a large number of small, independent, random contributions.
This behavior is captured by the Central Limit Theorem, which we describe shortly.
Second, jointly Gaussian random variables are highly tractable, having convenient
mathematical properties that greatly simplify a variety of calculations involving,
e.g., linear transformations.
A Gaussian random variable x has a probability density of the form
p
x
(x) =
1
2
2
exp
_
(x m)
2
2
2
_
(1.118)
for some parameters m and
2
> 0. To emphasize the fact that this density is
parametrized by these two numbers, we will use the notation x N(m,
2
) as
shorthand for x is Gaussian with mean m and variance
2
and we will also write
p
x
(x) = N(x; m,
2
) = N(x m; 0,
2
) =
1
N
_
x m
; 0, 1
_
. (1.119)
The density (1.118) is the familiar bell-shaped curve depicted in Fig. 1.1. It is
centered at x = m and
x
is a measure of its width. In fact, the rst and second
(central) moments of x are related to the parameters in the manner one would
expect based on our choice of notation, i.e.,
E [x] = m (1.120)
var x =
2
. (1.121)
Eqs. (1.120) and (1.121) can be veried by direct computation of the expectation
integrals.
The characteristic function of a Gaussian random variable x is frequently
useful in computations. It takes the form
M
x
(jv) = exp
_
jvm
1
2
v
2
2
_
, (1.122)
26 Probability, Random Vectors, and Vector Spaces Chap. 1
which, again, can be veried by direct computation of the corresponding Fourier
transform (1.31).
As an example application, we can use the characteristic function to establish
that if a and b are arbitrary constants, the random variable z = ax + b is N(am +
b, a
2
2
). To verify this, we note that
M
z
(jv) = E
_
e
jvz
= E
_
e
jvax
e
jvb
= M
x
(jva)e
jvb
= exp
_
jv(am + b)
1
2
v
2
(a
2
2
)
_
,
(1.123)
where we have used (1.122) to obtain the last equality in (1.123).
1.6.1 Central Limit Theorem
One fairly general form of the Central Limit Theorem is stated formally as follows.
Let
x
1
, x
2
, x
3
, . . .
be a sequence of mutually independent zero-mean random variables with distri-
butions
P
x
1
(x
1
), P
x
2
(x
2
), P
x
3
(x
3
), . . .
and variances
2
1
,
2
2
,
2
3
, . . . ,
respectively. If for any > 0 there exists a k (depending on ) sufciently large that
i
< S
k
for i = 1, 2, . . . , k (1.124)
with
S
k
=
_
k
i=1
2
i
,
then the distribution function of the normalized sum
z
n
=
1
S
n
n
i=1
x
i
(1.125)
converges to the distribution function of a Gaussian random variable with zero-
mean and unit-variance, i.e.,
P
zn
(z)
_
z
N(x; 0, 1) dx as n .
A couple of points are worth emphasizing. First, the somewhat exotic con-
straint (1.124) essentially ensures that no one term dominates the sum (1.125).
7
In
7
To see that this constraint is critical, it sufces to consider the sequence of independent
Bernoulli random variables x
i
, each of which is 1/2
i
with equal probability. Note that this se-
quence does not satisfy (1.124). For this sequence, it is straightforward to verify using (1.36) that
the distribution of the normalized sum (1.125) converges to a uniform rather than Gaussian distri-
bution.
Sec. 1.6 Gaussian Random Variables 27
fact, a simpler special case of this theorem corresponds to the x
i
being identically-
distributed and having a nite common variance. Second, it is important to em-
phasize that the theorem guarantees convergence in distribution but not in density.
In fact, when the random variables in the sum are discrete, it is impossible to have
convergence in density since arbitrary partial sums will be discrete!
1.6.2 Error Functions
In the context of many engineering applications of Gaussian random variables, we
need to compute the area under the tail of a Gaussian density. In general, there is
no closed-form expression for such quantities. However, the corresponding quan-
tities for normalized Gaussian densities are often available numerically via tables
or computer software packages.
In particular, if x N(0, 1), then the standard Q-function is dened accord-
ing to Q() = Pr [x > ], i.e.,
Q()
1
2
_
e
x
2
/2
dx. (1.126)
This function, the area under the tail of the unit Gaussian (normal) density, is
closely related to the so-called complementary error function erfc() via
Q() =
1
2
erfc
_
2
_
. (1.127)
This function is also well-tabulated, and can be evaluated, e.g., using the MAT-
LAB function erfc. In calculations, it is often convenient to exploit the symmetry
property
Q() = 1 Q() (1.128)
and the bound
Q()
1
2
e
2
/2
(1.129)
valid for > 0. A variety of tighter bounds that are useful in a number of applica-
tions can also be developed.
Via a change of variables, the area under tails of other nonnormalized Gaus-
sian densities is readily expressed in terms of the Q-function. For example, if
x N(m,
2
), then x = (x m)/ N(0, 1), so
Pr [x > ] = Pr
_
x m
>
m
_
= Q
_
m
_
.
1.6.3 Gaussian Random Vectors
The notion of a Gaussian random vector is a powerful and important one, and
builds on our notion of a Gaussian randomvariable. Specically, an N-dimensional
28 Probability, Random Vectors, and Vector Spaces Chap. 1
random vector
x =
_
_
x
1
x
2
.
.
.
x
N
_
_
is dened to be a Gaussian random vector, or equivalently x
1
, x
2
, . . . , x
N
is de-
ned to be a set of jointly Gaussian random variables when for all choices of the
constant vector
a =
_
_
a
1
a
2
.
.
.
a
N
_
_
(1.130)
the scalar y = a
T
x is a Gaussian random variable.
Gaussian random vectors have several important properties. In what fol-
lows, suppose x is a Gaussian random vector whose mean is m
x
and whose co-
variance matrix is
x
.
First, all subsets of x
1
, x
2
, . . . , x
N
are jointly Gaussian. Deriving this result
simply requires setting some of the a
i
s in (1.130) to zero. As a special case of
this resultcorresponding to having only one nonzero component in (1.130)we
have that all the constituents must be individually Gaussian random variables,
i.e.,
x
i
N(m
i
,
ii
) for i = 1, 2, . . . , N
where
ii
= [
x
]
ii
.
The characteristic function for a Gaussian random vector takes the form
M
x
(jv) = exp
_
jv
T
m
x
1
2
v
T
x
v
_
. (1.131)
To prove (1.131), rst note that
M
x
(jv) = E
_
e
jv
T
x
_
= E
_
e
j(v
1
x
1
+v
2
x
2
++v
N
x
N
)
(1.132)
Now for an arbitrary a let
y = a
T
x, (1.133)
which from the denition of a Gaussian random vector means that y is a Gaussian
random variable; specically, y N(a
T
m
x
, a
T
x
a). Then
M
y
(jv) = E
_
e
jvy
= E
_
e
j(va
1
x
1
+va
2
x
2
++va
N
x
N
)
(1.134)
Combining (1.132) with (1.134) we obtain
M
y
(jv) = M
x
(jva), (1.135)
Sec. 1.6 Gaussian Random Variables 29
while combining (1.134) with (1.122) we obtain
M
y
(jv) = exp
_
j(va
T
)m
x
1
2
(va
T
)
x
(va)
_
. (1.136)
Finally, equating (1.135) and (1.136), and choosing a = v/v we obtain our desired
result (1.131).
Having derived the characteristic function of a Gaussian random vector, we
now turn our attention to the corresponding density. When [
x
[ = det(
x
) > 0
(i.e., the nondegenerate case), the density function for a Gaussian random vector
takes the following form
p
x
(x) =
exp
_
1
2
(x m
x
)
T
1
x
(x m
x
)
(2)
N/2
[
x
[
1/2
(1.137)
N(x; m
x
,
x
)
= [
x
[
1/2
N(
1/2
x
(x m
x
); 0, I) (1.138)
where
1
x
is the inverse matrix associated with
x
, and where
1/2
x
is the positive
denite square root matrix of
x
, i.e., as discussed in Appendix 1.A, the (unique)
matrix satisfying
8
1/2
x
=
_
1/2
x
_
T
> 0 (1.139a)
1/2
x
1/2
x
=
x
. (1.139b)
The density (1.137) can be obtained via direct computation of the inverse
Fourier transform of (1.131); a derivation is as follows. First, observe that
p
x
(x) =
1
(2)
N
_
+
M
x
(jv)e
jv
T
x
dv
=
1
(2)
N
_
+
exp
_
jv
T
(x m
x
)
1
2
v
T
x
v
_
dv. (1.140)
Then, using the change of variables u =
1/2
x
v with
1/2
x
as dened in (1.139), and
noting that the Jacobian of the transformation is, using (1.139b), [du/dv[ = [
1/2
x
[ =
[
x
[
1/2
, we can rewrite (1.140) as
p
x
(x) =
1
(2)
N
_
+
exp
_
ju
T
1/2
x
(x m
x
)
1
2
u
T
u
_
du
[
x
[
1/2
,
which when we adopt the convenient notation
x =
1/2
x
(x m
x
) (1.141)
8
We have used
1/2
x
to denote the inverse of this square root matrix. Incidently, it is
straightforward to verify that this matrix is also the positive denite square root matrix of
1
x
.
30 Probability, Random Vectors, and Vector Spaces Chap. 1
and complete the square in the exponential yields
p
x
(
1/2
x
x +m
x
) =
1
(2)
N
[
x
[
1/2
exp
_
1
2
x
T
x
_ _
+
exp
_
1
2
(u + j x)
T
(u + j x)
_
du
=
1
(2)
N
[
x
[
1/2
exp
_
1
2
x
T
x
_
N
i=1
2
_
+
N(u
i
; j x
i
; 1) du
i
(1.142)
Finally, recognizing that each of the integrals in (1.142) is unity, and replacing x
with its denition (1.141) we obtain, after some simple manipulations, our desired
result (1.137).
Other Properties of Gaussian Random Vectors
Several additional important properties of Gaussian random vectors are worth de-
veloping. First, a pair of jointly Gaussian random vectors x and y are independent
if and only if they are uncorrelated. We established the only if part for any pair
of random vectors earlier. To establish the if part, let
z =
_
x
y
_
(1.143)
so that z N(m
z
,
z
), with
m
z
=
_
m
x
m
y
_
(1.144)
z
=
_
x
xy
yx
y
_
. (1.145)
Then when x and y are uncorrelated, i.e., when
xy
= 0, we have
det
z
= det
x
det
y
(1.146)
and
1
z
=
_
1
x
0
0
1
y
_
. (1.147)
Using the expression (1.137) for the Gaussian density with these results, one can
easily check that in this case
p
z
(z) = p
x,y
(x, y) = p
x
(x) p
y
(y). (1.148)
Second, linear transformations of Gaussian random vectors always produce
Gaussian random vectors. To see this, let A be an arbitrary m n matrix and let
y = Ax where x is a Gaussian random vector. Then y is a Gaussian random vector
provided z = b
T
y is a Gaussian random variable for every b. But z =
b
T
x where
Sec. 1.6 Gaussian Random Variables 31
b = A
T
b. Hence, since x is a Gaussian random vector, z is indeed a Gaussian
random variable.
Note that as an immediate corollary to the last result we have that x and y
are also jointly Gaussian random vectors. To see this, it sufces to observe that
_
x
y
_
=
Ax =
_
I
A
_
x.
Third, for jointly Gaussian random vectors x and y the conditional distribu-
tion for x given y = y is also Gaussian with mean
m
x|y
(y) = m
x
+
xy
1
y
(y m
y
) (1.149)
and covariance
9
x|y
(y) =
x
xy
1
y
T
xy
. (1.150)
A particularly straightforward proof of this result will appear later in our dis-
cussion of optimal estimation of random vectors in Chapter 3. Also, note that
consistent with our preceding discussions, if
xy
= 0 then m
x|y
(y) = m
x
and
x|y
(y) =
x
.
Finally, we stress that a Gaussian random vector is completely determined
by its mean and covariance. For example, suppose that x and y are independent
Gaussian random vectors and consider
z = Ax +By +b. (1.151)
Then, since as weve shown Gaussianity is preserved under linear operations, we
know that z is Gaussian. Consequently, in order to specify its density completely,
we need only calculate its mean and covariance. As we saw in (1.108) and (1.110),
these computations are also straightforward; we repeat themhere for convenience:
m
z
= Am
x
+Bm
y
+b
z
= A
x
A
T
+B
y
B
T
.
Since the mean vector and covariance matrix fully parameterize the density
of a collection of jointly Gaussian random variables, this means that all moments
can be expressed as functions of the mean and covariance. Moreover, in the Gaus-
sian case, these moments can be computed extremely efciently. To see this, let
x
1
, x
2
, . . . , x
N
be a set of jointly Gaussian random variables with mean values x
i
and covariances
ij
= cov (x
i
, x
j
), 1 i, j N. In addition, for convenience dene
x
i
= x
i
x
i
.
Then for any set of integers i
1
, i
2
, . . . , i
L
selected from 1, 2, . . . , Nwith repeti-
tion allowedit follows that
E [ x
i
1
x
i
2
x
i
L
] =
_
0 L odd
j
1
j
2
j
3
j
4
. . .
j
L1
j
L
L even
(1.152)
9
Note from (1.150) that
x|y
(y) is a constant matrix, i.e., independent of the value of y.
32 Probability, Random Vectors, and Vector Spaces Chap. 1
where the summation in (1.152) is over all distinct pairings j
1
, j
2
, j
3
, j
4
, . . . ,
j
L1
, j
L
of the set of symbols i
1
, i
2
, . . . , i
L
. Although we wont develop it here,
this result may be derived in a relatively straightforward manner using, e.g., a
Taylor series expansion of M
x
(jv). As an example application of (1.152) we have
E [ x
i
1
x
i
2
x
i
3
x
i
4
] =
i
1
i
2
i
3
i
4
+
i
1
i
3
i
2
i
4
+
i
1
i
4
i
2
i
3
(1.153)
so that
E [ x
1
x
2
x
3
x
4
] =
12
34
+
13
24
+
14
23
(1.154)
E
_
x
2
1
x
2
2
=
11
22
+ 2
2
12
(1.155)
E
_
x
4
1
= 3
2
11
. (1.156)
As a nal remark, we point out that the detailed shape of the contours of
equiprobability for the multidimensional Gaussian density can be directly de-
duced from geometry of the covariance matrix as developed in Section 1.5.4. In
particular, from (1.137) we see that the contours of equiprobability are the N-
dimensional ellipsoids dened by
(x m
x
)
T
1
x
(x m
x
) = constant. (1.157)
From this perspective the transformation (1.116) from x to z corresponds to a gen-
eralized coordinate rotation (i.e., length preserving transformation) such that the
components of z represent the principal or major axes of this ellipsoid, i.e.,
(z
1
m
z
1
)
2
1
+
(z
2
m
z
2
)
2
2
+ +
(z
N
m
z
N
)
2
N
= constant.
Note that the
i
= var z
i
describe the proportions of the ellipsoid: they correspond
to (squares of) the relative lengths along the principal axes. Note too that since z
is Gaussian, its components are not only uncorrelated but mutually independent
random variables.
We conclude this section by specializing our results to the case of two-dimensional
Gaussian random vectors, where we let
m
x
=
_
m
1
m
2
_
x
=
_
2
1
1
2
2
2
_
. (1.158)
Here
p
x
(x) =
1
(2)
N/2
[
x
[
1/2
exp
_
1
2
(x m
x
)
T
1
x
(x m
x
)
_
(1.159)
=
exp
_
(x
1
m
1
)
2
2
2
2(x
1
m
1
)(x
2
m
2
)
1
2
+(x
2
m
2
)
2
2
1
2
2
1
2
2
(1
2
)
_
2
1
2
(1
2
)
1/2
(1.160)
Fig. 1.2 depicts the joint density of a pair of Gaussian random variables. In
Fig. 1.3 we have plotted the associated contours of constant values of p
x
(x) which
are the ellipses
(x
1
m
1
)
2
2
2
2(x
1
m
1
)(x
2
m
2
)
1
2
+ (x
2
m
2
)
2
2
1
= constant. (1.161)
Sec. 1.7 Abstract Vector Space, and Spaces of Random Variables 33
Figure 1.2. The two-dimensional
probability density function of a pair
of jointly Gaussian random variables.
m
1
m
2
z -m
1 1
z -m
2 2
x
2
x
1
Figure 1.3. The contours of equiprob-
ability corresponding to the density of
Fig. 1.2.
As indicated in the gure, the components of z dene the principal axes of the
ellipses in (1.161), i.e., this equation in the transformed coordinates becomes
(z
1
m
1
)
2
1
+
(z
2
m
2
)
2
2
= constant
where
1
and
2
are the eigenvalues of
x
and where
m
z
=
_
m
1
m
2
_
= Pm
x
.
The larger [[ is, the more eccentric these ellipses become, degenerating to lines
when [[ = 1.
1.7 ABSTRACT VECTOR SPACE, AND SPACES OF RANDOM VARIABLES
The notion of a vector space is very powerful, and one that we will exploit on
numerous occasions throughout the course. Clearly, weve already used certain
34 Probability, Random Vectors, and Vector Spaces Chap. 1
vector space ideas in preceding sections exploiting results from Appendix 1.A. In
particular, weve exploited properties of the Euclidean space R
N
consisting of N-
dimensional vectors. However, while Euclidean space is an important example
of a vector space, there are in fact many other somewhat more abstract vector
spaces that turn out to be at least as important to us in this course. Although more
abstract, many properties carry over from the Euclidean case, and you will often
be able to rely on the geometric picture and intuition you have developed for this
case.
Most generally, a vector space is a collection of elements or objects satisfy-
ing certain properties. This collection of elements may indeed consist of vectors
x as we usually think of them, or they may be other kinds of objects like whole
sequences x[n] or functions x(t), or even random variables x(). To avoid a con-
ceptual bias, well just use the generic notation x for one such element.
For our purposes, vector spaces are special classes of metric spacesi.e., spaces
in which there is some notion of distance between the various elements in the col-
lection.
10
A metric space is described by the pair (S, d(, )) where S is the collection
of elements and d(, ) is referred to as the metric. It is a measure of distance be-
tween an arbitrary pair of elements in the set; in particular d(x, y) is the distance
between elements x and y in S.
For a metric to be useful, it must satisfy certain key properties that are con-
sistent with our intuition about what distance is. In particular, we must have, for
any elements x, y, and z in S,
d(x, y) 0 (1.162)
d(x, y) = 0 x = y (1.163)
d(x, y) = d(y, x) (1.164)
d(x, y) d(x, z) + d(z, y) (1.165)
The last of these, i.e., (1.165), is referred to as the triangle inequality.
An obvious (but not unique) example of a metric in R
N
is the usual Euclidean
distance
d(x, y) =
_
N
n=1
(x
n
y
n
)
2
. (1.166)
where x
n
and y
n
are the nth elements of x and y, respectively. You can verify that
(1.166) satises (1.162)(1.165).
The metric spaces were usually interested in have additional structure.
First, we want to work with spaces that are complete. While the technical def-
inition is beyond the scope of our treatment here, in essence completeness means
10
As a note of caution, our use of the term vector space is not universal. Some references
consider the term to be equivalent to the term linear space. However, as will become apparent,
we will nd it convenient to dene vector spaces as linear spaces that are also metric spaces.
Sec. 1.7 Abstract Vector Space, and Spaces of Random Variables 35
the metric space has no holes. An example of a metric space that isnt complete
is S = (0, 1] R with d(x, y) = [xy[. Note that the sequence of elements x
n
= 1/n
for n = 1, 2, . . . , are all in the space S, but lim
n
x
n
= 0 is not. The sequence x
n
is an example of what is called a Cauchy sequence, and for a metric space to be
complete, all such Cauchy sequences must converge to an element of S.
A vector space V is a metric space that is linear. In order to talk about lin-
earity, well need to dene addition and scalar multiplication operators for objects
in V. For the cases of interest to us, well be using the usual denitions of these
operators. We say V is a vector space if the following two properties hold:
11
x, y V x + y V (1.167)
x V, R x V. (1.168)
There are lots of important examples of vector spaces. First, there is the usual
Euclidean space R
N
composed of N-dimensional vectors x. There is also the space
of sequences x[n] with nite energy
n=
x
2
[n] <
which is usually denoted
2
(Z), and the space of (integrable) functions x(t) with
nite energy
_
+
x
2
(t) dt <
which is usually denoted L
2
(R). And there is the space of random variables x()
with nite mean-square
var x = E
_
x
2
=
_
+
x
2
p
x
(x) dx < ,
which is usually denoted L
2
().
12
1.7.1 Linear Subspaces
Subspaces are an important concept associated with vector space. A subspace is a
vector space that lies within another vector space, i.e., a subset W Vis a subspace
11
When the scalar is restricted to be a real number as (1.168) indicates, the result is referred
to as a real vector space; when it can be a complex number, i.e., C, the result is a complex
vector space. Although we will largely focus on the former class in this course to simplify our
development, we remark in advance that we will sometimes need to work with complex vector
spaces. Fortunately, however, there are no signicant conceptual differences between the two.
12
Incidently, for every probability space, there is an associated vector space of such random
variables.
36 Probability, Random Vectors, and Vector Spaces Chap. 1
if Wis itself a vector space.
13
As an example if V is the plane R
2
, then the line
W = (x, y) R
2
[ y = x
is a subspace.
1.7.2 Linear Transformations
Alinear transformation L() is a linear mapping fromone vector space Vto another
vector space U. This means that the powerful principle of superposition is satised,
i.e., if x
k
for k = 1, 2, . . . , K are each elements of V, and if
k
for k = 1, 2, . . . , K are
scalars, then
L(
1
x
1
+
2
x
2
+ +
K
x
K
) =
1
y
1
+
2
y
2
+ +
K
y
K
where y
k
= L(x
k
).
When the vector spaces are the familiar Euclidean spaces, e.g., V = R
N
and
U = R
M
, then L() is represented by a matrix, i.e.,
y = L(x) = Ax
where A is an M N-dimensional matrix. Several properties of matrices are de-
veloped in Appendix 1.A.
1.7.3 Linear Independence
A set of elements x
1
, x
2
, . . . , x
K
in a vector space V are said to be linearly indepen-
dent when
1
x
1
+
2
x
2
+ +
K
x
K
= 0
1
=
2
= =
K
= 0. (1.169)
From (1.169) we see that linear dependency implies that the set of elements is
redundant, i.e., that some of the elements can be expressed as a linear combination
of the others. For example, if (1.169) does not hold, then assuming
1
,= 0 we have
x
1
=
2
x
2
+
3
x
3
+ +
K
x
K
where
k
=
k
/
1
.
1.7.4 Bases
A basis for V is a linearly independent set of elements in V that span V. A set of
elements x
1
, x
2
, . . . , x
K
is said to span V if any element x V can be represented
as a linear combination of of the elements in this set, i.e., there exist
k
for k =
1, 2, . . . , K such that
x =
1
x
1
+
2
x
2
+ +
K
x
K
.
13
Technically, the mathematical notion of a subspace is more general. The denition we
provide is of a specically linear subspace, which is the only type of interest to us.
Sec. 1.7 Abstract Vector Space, and Spaces of Random Variables 37
Note that when this set is linearly independent, the
k
must be unique.
All bases for a vector space have the same cardinality. This cardinality is
referred to as the dimension of the vector space. As youve seen, the Euclidean
space V = R
N
has dimension N. Hence, these spaces are nite-dimensional.
Other spaces, like the space of nite-energy sequences
2
(Z) and the space of nite-
energy functions L
2
(R) are innite-dimensional. The space of nite mean-square
random variables L
2
() is not only innite-dimensional, but its dimension is un-
countable (unless the probability space is discrete)! Although innite-dimensional
spaces are difcult to visualize, much intuition from nite-dimensional Euclidean
space carries over. Furthermore, in many problems involving these vector spaces,
we will often work with nite-dimensional subspaces, for which our geometric
pictures are well-developed.
Let us continue to add more geometric structure to our notion of vector
space.
1.7.5 Normed Vector Spaces
A normed vector space is a special kind of vector space for which the concept of
length is dened for elements of the space. Let us use |x| to denote the length or
norm of each x V, so a normed vector space is dened by specifying the pair
(V, | |). In order for a function | | to make sense as a norm on V it must satisfy
certain properties. In particular, for x V and an arbitrary scalar, it must satisfy:
|x| 0 (1.170)
|x + y| |x| +|y| (1.171)
|x| = [[|x| (1.172)
|x| = 0 x = 0 (1.173)
Note that (1.171) is referred to as the triangle inequality for norms.
For normed vector spaces, the following rather natural metric can be dened:
for x and y in V,
d(x, y) = |x y|. (1.174)
It should be a straightforward exercise to verify that (1.174) satises the necessary
properties of a metric, i.e., (1.162)(1.165). Normed vector spaces that are complete
in the sense we discussed earlier have a special and somewhat arcane namethey
are referred to as Banach spaces.
38 Probability, Random Vectors, and Vector Spaces Chap. 1
As examples, R
N
,
2
(Z), L
2
(R), and L
2
() are all normed vector spaces. The
corresponding norms are dened by, respectively,
|x|
2
=
N
n=1
x
2
n
|x[]|
2
=
n
x
2
[n]
|x()|
2
=
_
x
2
(t) dt
|x()|
2
= E
_
x
2
.
Note however that there are many other norms one can dene even for vectors in
R
N
; for example,
|x| = max
1nN
[x
n
[.
Likewise, for functions x(t), for any positive integer p,
|x()| =
__
[x(t)[
p
dt
_
1/p
is a valid norm, and denes a whole family of normed vector spaces L
p
(R) param-
eterized by p. We emphasize that to fully specify a normed vector space we need
both a collection of elements and a norm.
Ultimately, were interested in normed vector spaces with even more struc-
ture, as well now develop.
1.7.6 Inner Product Spaces
An inner product space is a normed vector space where there is a notion of relative
orientation or angle between elements. We use the notation x, y) to denote the
inner product between two elements x and y in V. An inner product space is
therefore dened by the pair (V, , )). An inner product denes the operation
of projection of one element onto another. A valid inner product must satisfy the
following properties
14
:
x + y, z) = x, z) + y, z) (1.175)
x, y) = x, y) (1.176)
x, y) = y, x) (1.177)
x, x) > 0 x ,= 0. (1.178)
14
For simplicity, well restrict our attention to real-valued inner products even though many
important examples are complex-valued.
Sec. 1.7 Abstract Vector Space, and Spaces of Random Variables 39
For each inner product space there is a natural notion of norm. We call this
the induced norm, and it is dened in terms of the inner product as follows:
|x| =
_
x, x). (1.179)
In turn, from the induced norm we get the associated metric
d(x, y) =
_
x y, x y).
From the inner product and the induced norm, we arrive at a denition of
the angle between two elements x, y V. In particular, we have
cos =
x, y)
|x||y|
. (1.180)
One enormously useful inequality that applies to inner product spaces is the
Cauchy-Schwarz inequality: for any x and y in V,
[ x, y) [ |x||y| (1.181)
A proof is as follows. For any , we have, from the properties of a norm and
(1.179),
x y, x y) = |x y|
2
0. (1.182)
Exploiting (1.175)(1.178), we can rewrite the left hand side of (1.182) to get
x y, x y) = |x|
2
2x, y) +
2
|y|
2
0 (1.183)
Then if we let = x, y) / y, y), (1.183) becomes
|x|
2
x, y)
2
|y|
2
0 (1.184)
which can be rewritten in the form (1.181). As a nal comment, note from (1.182)
and (1.178) that equality in (1.181) holds if and only if x y = 0 for an arbitrary
.
Inner product spaces that are complete also have a special and arcane name
they are referred to as Hilbert spaces.
As examples, R
N
,
2
(Z), L
2
(R), and L
2
() are also all complete inner product
(i.e., Hilbert) spaces. The corresponding inner products are dened by, respec-
tively,
x, y) = x
T
y =
N
n=1
x
n
y
n
x[], y[]) =
n
x[n]y[n]
x(), y()) =
_
x(t) y(t) dt
x(), y()) = E [xy] .
40 Probability, Random Vectors, and Vector Spaces Chap. 1
Note that the Cauchy-Schwarz inequality for L
2
() implies that
(E [xy])
2
E
_
x
2
E
_
y
2
(1.185)
with equality if and only if x = y for some , i.e., if and only if x and y are scaled
versions of the same random variable. Likewise, for the subspace of L
2
() con-
sisting of zero-mean, nite-variance random variables, the specialization of (1.185)
yields the following property of the correlation coefcient mentioned earlier in the
chapter:
[
xy
[ =
[cov (x, y)[
var x var y
1,
again with equality if and only if x = y for some .
1.7.7 Orthonormal Bases
With inner product spaces we have enough structure that we can nally talk about
the concept of orthogonality. Specically, we say that elements x and y in V are
orthogonal, denoted x y, when their inner product is zero, i.e.,
x y x, y) = 0. (1.186)
In turn, we can talk about orthogonal complements of a subspace. In particular,
if W V is a subspace of V, then its orthogonal complement, denoted W
, is
dened as follows:
W
= x V : x, y) = 0, for all y W.
Note that V is the direct sum of W and W
, which we write as V = W W
. This
means that every x V can be uniquely expressed as the sum x = u + v where
u Wand v W
.
A collection of elements
x
1
, x
2
, . . . , x
K
k=1
k
x
k
where the
k
are projections of x onto the basis functions x
k
, i.e.,
k
= x, x
k
).
Sec. 1.A Linear Algebra and Euclidean Vector Space 41
An important identity that applies to orthonormal bases is Parsevals relation.
In particular, if x
1
, x
2
, . . . , x
K
is an orthonormal basis for V and if x and y are
arbitrary elements of V, then
x, y) =
K
k=1
k
where
k
= x, x
k
) and
k
= y, x
k
). A special case of Parsevals relation that
corresponds to choosing x = y is the Plancherel formula:
|x|
2
=
K
k=1
[
k
[
2
where again
k
= x, x
k
).
If you are interested in exploring the concept of an abstract vector space in
more detail, a good starting point is, e.g., A. W. Naylor and G. R. Sell, Linear Oper-
ator Theory in Engineering and Science, Springer-Verlag, New York, 1982.
1.A LINEAR ALGEBRA AND EUCLIDEAN VECTOR SPACE
1.A.1 Vectors and Matrices
In this course vectors will be matrices that are specically columns, and as such
will be denoted by boldface lowercase characters; for example,
x =
_
_
x
1
x
2
.
.
.
x
n
_
_
(1.187)
where x
1
, x
2
, . . . , x
n
are either real or complex numbers. The set of all such n-
dimensional vectors of real numbers is denoted by R
n
. The corresponding set of
all n-dimensional vectors of complex numbers is denoted by C
n
. The transpose of
a column vector x is the row vector
x
T
=
_
x
1
x
2
x
n
. (1.188)
Vector addition and scalar multiplication are dened componentwise, i.e.,
_
_
x
1
x
2
.
.
.
x
n
_
_
+
_
_
y
1
y
2
.
.
.
y
n
_
_
=
_
_
x
1
+ y
1
x
2
+ y
2
.
.
.
x
n
+ y
n
_
_
(1.189)
42 Probability, Random Vectors, and Vector Spaces Chap. 1
and
_
x
1
x
2
.
.
.
x
n
_
_
=
_
_
x
1
x
2
.
.
.
x
n
_
_
(1.190)
where is a real or complex number.
A set of vectors x
1
, x
2
, . . . , x
r
in R
n
is linearly independent if
15
1
x
1
+
2
x
2
+ +
r
x
r
= 0 (1.191)
implies that
1
=
2
= =
r
= 0. (1.192)
Otherwise the set of vectors is said to be linearly dependent, and in this case one of
the x
i
can be written as a linear combination of the others. For example, if
1
,= 0 in
(1.191)
x
1
=
2
x
2
+ +
r
x
r
(1.193)
with
2
=
2
/
1
, . . . ,
r
=
r
/
1
. (1.194)
In R
n
there exist sets of at most n linearly independent vectors. Any such set
x
1
, x
2
, . . . , x
n
forms a basis for R
n
. That is, any x R
n
can be written as a linear
combination of x
1
, x
2
, . . . , x
n
.
Matrices will in general be denoted by boldface uppercase characters. The
element in the ith row and jth column of Awill be denoted by a
ij
or, alternatively,
by [A]
ij
. If Ais mn, i.e., if Ahas m rows and n columns, then
A =
_
_
a
11
a
12
a
1n
a
21
a
22
a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
a
mn
_
_
. (1.195)
The set of all m n real-valued matrices is denoted R
mn
. As with vectors, we
dene matrix addition and scalar multiplication componentwise. If m = n, A is a
square matrix. The transpose of an mn matrix A is the n m matrix
A
T
=
_
_
a
11
a
21
a
m1
a
12
a
22
a
m2
.
.
.
.
.
.
.
.
.
.
.
.
a
1n
a
2n
a
mn
_
_
. (1.196)
15
The symbol 0 denotes the matrix or vector of appropriate dimension, all of whose compo-
nents are zero.
Sec. 1.A Linear Algebra and Euclidean Vector Space 43
A square matrix is said to be symmetric if A
T
= A. A diagonal square matrix is one
of the form
A =
_
1
0 0
0
2
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
n
_
_
= diag(
1
,
2
, . . . ,
n
) (1.197)
where the last expression in (1.197) introduces notation that is sometimes conve-
nient. The identity matrix is dened as
I = diag(1, 1, . . . , 1). (1.198)
On (rare) occasions when there is risk of ambiguity, we will write I
n
to make ex-
plicit the size (i.e., n n) of the identity matrix. The trace of a square matrix A is
the sum of its diagonal elements:
tr(A) =
n
i=1
a
ii
. (1.199)
Let A be an m n matrix and B an n p matrix. Then we can dene the
product
C = AB. (1.200)
Here C is an mp matrix whose elements are given by
c
ij
=
n
k=1
a
ik
b
kj
. (1.201)
Note the required compatibility conditionthe number of columns of A must
equal the number of rows of B for AB to be dened. Note too that BA may
not be dened even if AB is (e.g., let m = 7, n = 4, p = 3). Even if BA is dened
it is generally not the same size as AB. For example, if A is 2 4 and B is 4 2,
then ABis 2 2, but BAis 4 4. If Aand B are square and of the same size, then
AB and BAare as well. In general, however, AB ,= BA. Note also that
AI = IA = A (1.202)
(AB)
T
= B
T
A
T
(1.203)
Also, if A R
mn
and x R
n
, then Ax R
m
. In addition, if both AB and BAare
dened,
tr(AB) = tr(BA). (1.204)
Let x R
n
and y R
m
. Then the dyadic or outer product of x and y is the
n m matrix
xy
T
=
_
_
x
1
y
1
x
1
y
2
x
1
y
m
x
2
y
1
x
2
y
2
x
2
y
m
.
.
.
.
.
.
.
.
.
.
.
.
x
n
y
1
x
n
y
2
x
n
y
m
_
_
R
nm
. (1.205)
44 Probability, Random Vectors, and Vector Spaces Chap. 1
If n = m we can also dene the dot or inner product
x
T
y =
n
i=1
x
i
y
i
= y
T
x R (1.206)
Two n-vectors x and y are orthogonal, denoted x y, if
x
T
y = 0. (1.207)
Note that a set of nonzero, mutually orthogonal vectors is linearly independent.
The length or norm of x R
n
is
|x| = (x
T
x)
1/2
=
_
x
2
1
+ x
2
2
+ + x
2
n
_
1/2
(1.208)
Note, too, that
|x|
2
= x
T
x = tr(x
T
x) = tr(xx
T
) (1.209)
On occasion we will nd it useful to deal with matrices written in block form,
such as
_
A
11
A
12
A
21
A
22
_
(1.210)
where A
11
and A
12
have the same number of rows, and A
11
and A
21
have the same
number of columns. The product of two matrices in block form is computed in a
manner analogous to usual matrix multiplication. For example,
_
A
11
A
12
A
21
A
22
_ _
B
11
B
12
B
21
B
22
_
=
_
A
11
B
11
+A
12
B
21
A
11
B
12
+A
12
B
22
A
21
B
11
+A
22
B
21
A
21
B
12
+A
22
B
22
_
(1.211)
where the blocks on the left side of (1.211) must be partitioned in a compatible
fashion, and where the order of multiplication of the various terms on the right-
hand side is important.
1.A.2 Matrix Inverses and Determinants
An n n matrix A is invertible or nonsingular if there exists another n n matrix
A
1
, called the inverse of A, so that
AA
1
= A
1
A = I. (1.212)
If no such matrix exists A is said to be singular or, equivalently, noninvertible.
Consider the set of equations
Ax = y (1.213)
where A is n n. This equation has a unique solution x for any y if and only if
A is invertible (in which case the solution is A
1
y). Consequently, the equation
Ax = 0 has a nonzero solution if and only if Ais not invertible.
The determinant of a square matrix A, denoted by [A[ or det(A), can be
computed recursively. If A is 1 1, then [A[ = A. If A is n n, then we can
Sec. 1.A Linear Algebra and Euclidean Vector Space 45
compute [A[ by expanding by minors using any row or column. For example,
using the ith row,
[A[ = a
i1
A
i1
+ a
i2
A
i2
+ + a
in
A
in
, (1.214)
or, using the jth column,
[A[ = a
1j
A
1j
+ a
2j
A
2j
+ + a
nj
A
nj
, (1.215)
where the cofactors A
ij
are given by
A
ij
= (1)
i+j
det(M
ij
) (1.216)
and where M
ij
is the (n 1) (n 1) matrix obtained from A by deleting the ith
row and jth column.
As a simple example, we have
a
11
a
12
a
21
a
22
= a
11
a
22
a
12
a
21
. (1.217)
As a more complex example we have
2 0 0 3
1 1 0 0
1 1 1 0
5 1 1 9
= 2(1)
1+1
1 0 0
1 1 0
1 1 9
+ 0(1)
1+2
1 0 0
1 1 0
5 1 9
+ 0(1)
1+3
1 1 0
1 1 0
5 1 9
+ 3(1)
1+4
1 1 0
1 1 1
5 1 1
= 2 1 (1)
1+1
1 0
1 9
3 1 (1)
1+1
1 1
1 1
3 1 (1)
1+2
1 1
5 1
= 2 9 3 0 + 3 (4) = 6
Several useful properties of determinants are
[AB[ = [A[[B[ (1.218)
[A[ =
n
[A[ (1.219)
[A
T
[ = [A[ (1.220)
[A
1
[ =
1
[A[
(1.221)
The invertibility of a matrix A is equivalent to each of the following state-
ments:
1. [A[ , = 0
46 Probability, Random Vectors, and Vector Spaces Chap. 1
2. All of the columns of A are linearly independent.
3. All of the rows of Aare linearly independent.
The inverse of Acan be expressed as
A
1
=
1
[A[
adj A (1.222)
where [adj A]
ij
= A
ji
is referred to as the adjugate or adjoint matrix, with A
ij
as
dened in (1.216). As a simple example, we have
_
a
11
a
12
a
21
a
22
_
1
=
1
a
11
a
22
a
12
a
21
_
a
22
a
12
a
21
a
11
_
(1.223)
Some useful properties of inverses are
(A
T
)
1
= (A
1
)
T
(1.224)
(AB)
1
= B
1
A
1
(1.225)
and
A = diag(
1
,
2
, . . . ,
n
) A
1
= diag
_
1
1
,
1
2
, . . . ,
1
n
_
. (1.226)
A matrix P is said to be orthogonal if
P
1
= P
T
. (1.227)
If we think of P as consisting of a set of columns, i.e.,
P =
_
x
1
x
2
x
n
(1.228)
then, in general,
P
T
P =
_
_
x
T
1
x
1
x
T
1
x
2
x
T
1
x
n
x
T
2
x
1
x
T
2
x
2
x
T
2
x
n
.
.
.
.
.
.
.
.
.
.
.
.
x
T
n
x
1
x
T
n
x
2
.
.
. x
T
n
x
n
_
_
. (1.229)
Consequently, we see that Pis orthogonal if and only if its columns are orthonormal,
i.e., if x
i
x
j
for i ,= j, and if |x
i
| = 1.
There are also some useful results for block matrices. For example, for a block
diagonal matrix
A = diag(F
1
, F
2
, . . . , F
r
) A
1
= diag(F
1
1
, F
1
2
, . . . , F
1
r
). (1.230)
Also, we have the formulas
_
A
11
A
12
A
21
A
22
_
1
=
_
(A
11
A
12
A
1
22
A
21
)
1
(A
11
A
12
A
1
22
A
21
)
1
A
12
A
1
22
A
1
22
A
21
(A
11
A
12
A
1
22
A
21
)
1
A
1
22
+A
1
22
A
21
(A
11
A
12
A
1
22
A
21
)
1
A
12
A
1
22
_
(1.231)
Sec. 1.A Linear Algebra and Euclidean Vector Space 47
and
det
_
A
11
A
12
A
21
A
22
_
=
A
11
A
12
A
1
22
A
21
[A
22
[ , (1.232)
which are valid if A
22
is nonsingular. These formulas can be veried by exploiting
the identity
_
I A
12
A
1
22
0 I
_ _
A
11
A
12
A
21
A
22
_ _
I 0
A
1
22
A
21
I
_
=
_
A
11
A
12
A
1
22
A
21
0
0 A
22
_
. (1.233)
If on the other hand A
11
is nonsingular, then as an alternative to (1.231) we
have
_
A
11
A
12
A
21
A
22
_
1
=
_
A
1
11
+A
1
11
A
12
(A
22
A
21
A
1
11
A
12
)
1
A
21
A
1
11
A
1
11
A
12
(A
22
A
21
A
1
11
A
12
)
1
(A
22
A
21
A
1
11
A
12
)
1
A
21
A
1
11
(A
22
A
21
A
1
11
A
12
)
1
_
.
(1.234)
Other useful results are obtained by comparing (1.231) and (1.234). For ex-
ample, equating the upper left blocks in these two expressions yields the useful
identity
(A
11
A
12
A
1
22
A
21
)
1
= A
1
11
+A
1
11
A
12
(A
22
A
21
A
1
11
A
12
)
1
A
21
A
1
11
. (1.235)
1.A.3 Eigenvalues and Eigenvectors
Let Abe an nn real matrix. A scalar is called an eigenvalue of Awith associated
nonzero eigenvector x if
Ax = x. (1.236)
The above equation can be rewritten as
(I A)x = 0. (1.237)
Thus is an eigenvalue of A if and only if (1.237) has a solution x ,= 0. This will
be the case if and only if I A is singular, i.e., if and only if is a solution of the
characteristic equation
A
() = [I A[ = 0. (1.238)
Here
A
() is called the characteristic polynomial of A and is of the form
A
() =
n
+
n1
n1
+ +
1
+
0
(1.239)
= (
1
) (
2
) (
n
). (1.240)
The
1
,
2
, . . . ,
n
in (1.240) are the n eigenvalues, which may or may not be dis-
tinct. Some of the
i
may in general be complex, in which case they occur in
48 Probability, Random Vectors, and Vector Spaces Chap. 1
complex conjugate pairs. However, if A is symmetric, the
i
are always real. Also
note that
[A[ = (1)
n
A
(0) = (1)
n
0
=
n
i=1
i
(1.241)
so that A is invertible if and only if all of the eigenvalues of A are nonzero. In
addition, one can show that
tr(A) =
n1
=
n
i=1
i
. (1.242)
If
i
is an eigenvalue of A, then we can determine an associated eigenvector
x
i
by solving the set of linear equations
Ax
i
=
i
x
i
. (1.243)
Note that if x
i
is an eigenvector, so is x
i
for any scalar . Consequently, we can
always adjust the length of the eigenvectors arbitrarily, and, in particular, we can
normalize them to have unit length. It is also possible to show that each distinct
i
has a linearly independent x
i
corresponding to it. If, on the other hand,
i
has
multiplicity k > 1, i.e., if
i
is a kth-order root of
A
(), then in general there
may be anywhere from 1 to k linearly independent eigenvectors associated with
i
. Note that we can always combine (1.243) for different values of i into one
equation
A
_
x
1
x
2
x
n
=
_
x
1
x
2
x
n
1
0 0
0
2
0
.
.
.
.
.
.
.
.
.
.
.
.
0 0
n
_
_
. (1.244)
If A is symmetric, some special properties result. First, eigenvectors corre-
sponding to distinct eigenvalues are not only linearly independent, but orthogonal.
Second, eigenvalues with multiplicity k have a full set of (i.e., k) linearly inde-
pendent eigenvectors, which can also be chosen to be orthogonal to one another.
Hence, symmetric matrices always have a full set of linearly independent eigen-
vectors that can be chosen so as to be orthonormal as well.
In general, an n n matrix A that has a full set of n linearly independent
eigenvectors x
1
, x
2
, . . . , x
n
is called diagonalizable. For a diagonalizable matrix, the
matrix of eigenvectors in (1.244), viz.,
_
x
1
x
2
x
n
P
1
, (1.245)
is nonsingular. Hence, we can write
PAP
1
= diag(
1
,
2
, . . . ,
n
). (1.246)
We emphasize that if Ais a symmetric matrix we can choose the x
i
to be orthonor-
mal so that P
1
= P
T
, further simplifying manipulations. The matrix representa-
tion (1.246) will prove very useful. It is an example of a similarity transformation,
which we briey describe next.
Sec. 1.A Linear Algebra and Euclidean Vector Space 49
1.A.4 Similarity Transformation
Let A be an n n matrix, and let P be an invertible matrix of the same size. We
can then dene a similarity transformation of A as
B = PAP
1
. (1.247)
We sometimes say that B is similar to A. A similarity transformation can be
interpreted as arising out of a change of coordinates. To see this, suppose
y = Ax (1.248)
and consider the change of coordinates
u = Px (1.249)
v = Py, (1.250)
so that (since x = P
1
u) each component of u, for example, is a weighted sum of
components of x and vice versa. Then
v = Bu (1.251)
with B as given in (1.247). Furthermore,
B
() = [I B[ = [PP
1
PAP
1
[ = [P
1
(I A)P[
= [P
1
[[I A[[P[ = [I A[ =
A
()
so the eigenvalues of B and A are the same. Thus by (1.241) and (1.242), A and B
have the same determinant and trace, respectively.
1.A.5 Positive Denite Matrices
A symmetric square matrix A is positive semidenite, written A 0, if and only if
x
T
Ax 0 (1.252)
for all vectors x. This matrix Ais positive denite, written A > 0, if and only if
x
T
Ax > 0 for any x ,= 0. (1.253)
It is not difcult to see that a positive semidenite matrix is positive denite if and
only if it is invertible.
Some basic facts about positive semidenite matrices are the following:
1. If A 0 and B 0, then A+B 0, since
x
T
(A+B)x = x
T
Ax +x
T
Bx (1.254)
2. If either A or B in (1.254) is positive denite, then so is A + B. This again
follows from (1.254).
50 Probability, Random Vectors, and Vector Spaces Chap. 1
3. If A > 0, then A
1
> 0, since
x
T
A
1
x = (A
1
x)
T
A(A
1
x) > 0 if x ,= 0 (1.255)
4. If Q 0 then F
T
QF 0 for any (not necessarily square) matrix F for which
F
T
QF is dened. This follows from
x
T
(F
T
QF)x = (Fx)
T
Q(Fx) 0 (1.256)
5. If Q > 0 and F is invertible, F
T
QF > 0. This also follows from (1.256).
One test for positive deniteness is Sylvesters Test. Let
A =
_
_
a
11
a
12
a
1n
a
12
a
22
a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
1n
a
2n
a
nn
_
_
. (1.257)
Then A is positive denite if and only if the determinant of every upper left sub-
matrix of Ais positive, i.e.,
16
a
11
> 0
a
11
a
12
a
12
a
22
> 0
a
11
a
12
a
13
a
12
a
22
a
23
a
13
a
23
a
33
> 0
etc.
(1.258)
Let A be symmetric and let P be the orthogonal matrix of eigenvectors so
that [cf. (1.246)]
PAP
T
= diag(
1
,
2
, . . . ,
n
) . (1.259)
Then, letting z = Px, we have
x
T
Ax = x
T
P
T
(PAP
T
)Px = z
T
z =
1
z
2
1
+
2
z
2
2
+ +
n
z
2
n
(1.260)
From this we can conclude that a symmetric matrix Ais positive semidenite (pos-
itive denite) if and only if all its eigenvalues are nonnegative (positive).
Another characterization of positive semidenite matrices is in terms of their
square root matrices. In particular, any A 0 has a square root matrix F such that
A = F
T
F. (1.261)
16
Beware, howeverthere is no corresponding test for positive semideniteness that in-
volves examining upper submatrices for nonnegative determinants. Consider, e.g., the matrix
_
0 0
0 1
P. (1.262)
where
diag
_
_
1
,
_
2
, . . . ,
_
n
_
. (1.263)
Note that the F we choose in (1.262) is invertible if and only if A > 0.
In general, the square root matrix as dened in (1.261) is not unique. For
example, let Qbe any orthogonal matrix, and let
F = QF (1.264)
Then
F is also a valid square root matrix for A, i.e.,
F
T
F = F
T
Q
T
QF = F
T
IF = F
T
F = A. (1.265)
However, choosing Q = P
T
in (1.264) gives the positive semidenite square root
matrix
F = P
T
P. (1.266)
In fact, (1.266) is the unique positive semidenite square root matrix associated
with A, and hence we will reserve the notation A
1/2
for this particular matrix.
As a nal important remark, it is often convenient to make use of matrix
inequalities of the form
A B, (1.267)
which are interpreted in the sense of positive deniteness. In particular, (1.267)
means that A B 0, i.e., that the difference matrix A B is positive semidef-
inite. Similarly, the notation A > B means that A B is positive denite, and
the notation A < B means that B A is positive denite. Also, it is occasionally
convenient to use the terminology negative denite to refer to a matrix Asatisfying
A < 0, and negative semidenite to refer to a matrix Asatisfying A 0. Using these
conventions, we have, for example, that A is negative denite whenever A is
positive denite, etc. A matrix that is neither positive semidenite nor negative
semidenite is termed indenite.
We emphasize that (1.267) does not mean that every entry of A is at least as
big as the corresponding entry of B. However, if we choose, for any j,
[x]
i
=
_
1 i = j
0 otherwise
,
then the denition of positive semideniteness, i.e., (1.252), implies that
[A]
jj
0 for all j. (1.268)
Hence, using (1.268) we can conclude that (1.267) implies, among other relation-
ships, that every diagonal entry of A is not less than the corresponding entry of
B.
52 Probability, Random Vectors, and Vector Spaces Chap. 1
1.A.6 Subspaces
A subset S R
n
is a subspace if S is closed under vector addition and scalar
multiplication. Examples of subspaces of R
2
are
17
S
1
=
_
_
a
0
_
a R
_
(1.269)
S
2
=
_
_
a
2a
_
a R
_
(1.270)
The dimension of a subspace equals the maximum number of vectors in S that can
form a linearly independent set.
Let Kbe any subset of R
n
. The orthogonal complement of Kin R
n
is dened as
follows:
K
= x R
n
[ x y for all y K. (1.271)
Note that K
and y K,
(x
1
+x
2
)
T
y = x
T
1
y +x
T
2
y = 0 (1.272)
(x
1
)
T
y = x
T
1
y = 0 (1.273)
so x
1
+x
2
K
and x
1
K
.
Let d be a single nonzero vector in R
n
(so d is not a subspace), and consider
d
splits R
n
into two half-spaces, one corresponding to
those x for which d
T
x > 0, the other to d
T
x < 0.
For additional insights into the concepts and results summarized in this sec-
tion, see, e.g., G. S. Strang, Linear Algebra and its Applications, 3rd ed., Academic
Press, New York, 1988.
1.B VECTOR CALCULUS
Several results from vector calculus, which we briey summarize here, will prove
useful. First, consider a scalar function of a vector of n real variables
f(x) = f
_
_
_
_
_
_
_
x
1
x
2
.
.
.
x
n
_
_
_
_
_
_
_
= f(x
1
, x
2
, . . . , x
n
). (1.274)
17
Here R equals the set of real numbers.
Sec. 1.B Vector Calculus 53
d
{x : d x = 0}
T
Figure 1.4. An example of a one-
dimensional orthogonal complement
subspace.
Partial derivatives, integrals, etc., can all be dened in a useful manner. For exam-
ple, it is convenient to dene a Jacobian row vector, which consists of rst partial
derivatives:
df
dx
(x) =
x
f(x) =
_
f
x
1
(x)
f
x
2
(x)
f
xn
(x)
_
. (1.275)
It is also convenient to dene a Hessian matrix, which consists of second-order
partial derivatives:
d
2
f
dx
2
(x) =
2
x
f(x) =
_
2
f
x
2
1
(x)
2
f
x
1
x
2
(x)
2
f
x
1
xn
(x)
2
f
x
2
x
1
(x)
2
f
x
2
2
(x)
2
f
x
2
xn
(x)
.
.
.
.
.
.
.
.
.
.
.
.
2
f
xnx
1
(x)
2
f
xnx
2
(x)
2
f
x
2
n
(x)
_
_
. (1.276)
Note that the Hessian is a symmetric matrix. Furthermore, the Hessian matrix at
x = x
0
is positive semidenite, i.e., d
2
f/dx
2
(x
0
) 0 whenever x
0
corresponds to
a local minimum of f(). Similarly, if x = x
0
is the location of a local maximum of
f(), then the Hessian satises d
2
f/dx
2
(x
0
) 0 which means that d
2
f/dx
2
(x
0
) is
positive semidenite.
Using the notation (1.275) and (1.276) we can conveniently express the mul-
tivariable Taylors series expansion as
f(x + x) = f(x) +
df
dx
(x)x +
1
2!
(x)
T
d
2
f
dx
2
(x)x + (1.277)
where in (1.277) denotes higher order terms.
Finally, we briey discuss vector-valued functions f(). Derivatives, inte-
grals, limits, etc., for functions of this type are dened component-wise, e.g., for a
54 Probability, Random Vectors, and Vector Spaces Chap. 1
vector-valued function with a scalar argument, we have
d
dx
f(x) =
_
_
d
dx
f
1
(x)
d
dx
f
2
(x)
.
.
.
d
dx
f
m
(x)
_
_
. (1.278)
More generally, for a vector-valued function of a vector argument, we dene the
Jacobian matrix
18
df
dx
(x) =
x
f(x) =
_
_
f
1
x
1
(x)
f
1
x
2
(x)
f
1
xn
(x)
f
2
x
1
(x)
f
2
x
2
(x)
f
2
xn
(x)
.
.
.
.
.
.
.
.
.
.
.
.
fm
x
1
(x)
fm
x
2
(x)
fm
xn
(x)
_
_
. (1.279)
Dening second-order derivatives for vector-valued functions of vector-valued
arguments is possible but generally less useful (because of the need for three-
dimensional matrices). In any case, a multidimensional Taylor series expansion
can be obtained from (1.277) through componentwise operations on f(), yielding
f(x + x) = f(x) +
df
dx
(x)x + (1.280)
where, again, denotes higher order terms.
Some simple (but useful) examples of the calculations described in this sec-
tion are the following:
dx
dx
= I (1.281)
d
dx
Ax = A (1.282)
d
dx
x
T
Ax = x
T
(A+A
T
) (1.283)
d
2
dx
2
x
T
Ax = A+A
T
. (1.284)
18
Note that (1.279) is consistent both with (1.275) when m = 1 and with (1.278) when n = 1.