Linear Algebra

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Chapter 2

Linear Algebra
2.1 Introduction
We discuss vectors, matrices, transposes, covariance, correlation, diagonal and inverse
matrices, orthogonality, subspaces and eigenanalysis. An alterntive source for much
of this material is the excellent book by Strang [58].

2.2 Transposes and Inner Products


A collection of variables may be treated as a single entity by writing them as a vector.
For example, the three variables x1, x2 and x3 may be written as the vector
2 3
x1
x = 4 x2 75
6 (2.1)
x3
Bold face type is often used to denote vectors (scalars - single variables - are written
with normal type). Vectors can be written as column vectors where the variables go
down the page or as row vectors where the variables go across the page (it needs to
be made clear when using vectors whether x means a row vector or a column vector -
most often it will mean a column vector and in our text it will always mean a column
vector, unless we say otherwise). To turn a column vector into a row vector we use
the transpose operator
xT = [x1; x2; x3] (2.2)
The transpose operator also turns row vectors into column vectors. We now de ne
the inner product of two vectors
2 3
y1
x y = [x1; x2; x3] 4 y2 75
T 6 (2.3)
y3
= x1y1 + x2y2 + x3 y3
27
X
3
= xi yi
i=1
which is seen to be a scalar. The outer product of two vectors produces a matrix
2 3
x1
xyT = 64 x2 75 [y1; y2; y3] (2.4)
x
2 3 3
x1 y1 x1 y2 x1 y3
= 64 x2 y1 x2 y2 x2 y3 75
x3 y1 x3 y2 x3 y3
An N  M matrix has N rows and M columns. The ij th entry of a matrix is the
entry on the j th column of the ith row. Given a matrix A (matrices are also often
written in bold type) the ij th entry is written as Aij . When applying the transpose
operator to a matrix the ith row becomes the ith column. That is, if
2 3
a11 a12 a13
A = 64 a21 a22 a23 75 (2.5)
a31 a32 a33
then
2 3
a11 a21 a31
AT = 64 a12 a22 a32 75 (2.6)
a13 a23 a33
A matrix is symmetric if Aij = Aji. Another way to say this is that, for symmetric
matrices, A = AT .
Two matrices can be multiplied if the number of columns in the rst matrix equals
the number of rows in the second. Multiplying A, an N  M matrix, by B , an M  K
matrix, results in C , an N  K matrix. The ij th entry in C is the inner product
between the ith row in A and the j th column in B . As an example
" #2 1 3 7 2 3 " #
2 3 4 64 4 3 4 1 75 = 34 39 42 15 (2.7)
5 6 7 5 6 4 2 64 75 87 30
Given two matrices A and B we note that
(AB )T = B T AT (2.8)

2.2.1 Properties of matrix multiplication


Matrix multiplication is associative
(AB )C = A(BC ) (2.9)
distributive
A(B + C ) = AB + AC (2.10)
but not commutative
AB 6= BA (2.11)
2.3 Types of matrices
2.3.1 Covariance matrices
In the previous chapter the covariance, xy , between two variables x and y was de-
ned. Given p variables there are p  p covariances to take account of. If we write
the covariances between variables xi and xj as ij then all the covariances can be
summarised in a covariance matrix which we write below for p = 3
2 2 3
1 12 13
C = 64 21 22 23 75 (2.12)
31 32 32
The ith diagonal element is the covariance between the ith variable and itself which
is simply the variance of that variable; we therefore write i2 instead of ii . Also, note
that because ij = ji covariance matrices are symmetric.
We now look at computing a covariance matrix from a given data set. Suppose we
have p variables and that a single observation xi (a row vector) consists of measuring
these variables and suppose there are N such observations. We now make a matrix
X by putting each xi into the ith row. The matrix X is therefore an N  p matrix
whose rows are made up of di erent observation vectors. If all the variables have zero
mean then the covariance matrix can then be evaluated as
C = N 1 1XT X (2.13)

This is a multiplication of a p  N matrix, X T , by a N  p matrix, X , which results in


a p  p matrix. To illustrate the use of covariance matrices for time series, gure 2.1
shows 3 time series which have the following covariance relation
2 3
1 0:1 1:6
C 1 = 64 0:1 1 0:2 75 (2.14)
1:6 0:2 2:0
and mean vector
m1 = [13; 17; 23]T (2.15)

2.3.2 Diagonal matrices


A diagonal matrix is a square matrix (M = N ) where all the entries are zero except
along the diagonal. For example
2 3
4 0 0
D = 64 0 1 0 75 (2.16)
0 0 6
28

26

24

22

20

18

16

14

12

10
0 20 40 60 80 100
t

Figure 2.1: Three time series having the covariance matrix C 1 and mean vector m1
shown in the text. The top and bottom series have high covariance but none of the
other pairings do.
There is also a more compact notation for the same matrix
D = diag([4; 1; 6]) (2.17)
If a covariance matrix is diagonal it means that the covariances between variables are
zero, that is, the variables are all uncorrelated. Non-diagonal covariance matrices are
known as full covariance matrices. If V is a vector of variances V = [12; 22; 32 ]T then
the corresponding diagonal covariance matrix is V d = diag(V ).

2.3.3 The correlation matrix


The correlation matrix, R, can be derived from the covariance matrix by the equation
R = BCB (2.18)
where B is a diagonal matrix of inverse standard deviations
B = diag([1=1; 1=2; 1=3]) (2.19)

2.3.4 The identity matrix


The identity matrix is a diagonal matrix with ones along the diagonal. Multiplication
of any matrix, X by the identity matrix results in X . That is
IX = X (2.20)
The identity matrix is the matrix equivalent of multiplying by 1 for scalars.
2.4 The Matrix Inverse
Given a matrix X its inverse X 1 is de ned by the properties
X 1X = I (2.21)
XX = I 1

where I is the identity matrix. The inverse of a diagonal matrix with entries dii is
another diagonal matrix with entries 1=dii. This satis es the de nition of an inverse,
eg.
2 32 3 2 3
4 0 0 1=4 0 0 1 0 0
64 0 1 0 75 64 0 1 0 75 = 64 0 1 0 75 (2.22)
0 0 6 0 0 1=6 0 0 1
More generally, the calculation of inverses involves a lot more computation. Before
looking at the general case we rst consider the problem of solving simultaneous
equations. These constitute relations between a set of input or independent variables
xi and a set of output or dependent variables yi. Each input-output pair constitutes
an observation. In the following example we consider just N = 3 observations and
p = 3 dimensions per observation
2w1 +w2 + w3 = 5
4w1 6 w2 = 2
2w1 +7w2 + 2w3 = 9
which can be written in matrix form
2 32 3 2 3
2 1 1 w1 5
64 4 6 0 75 64 w2 75 = 64 2 75 (2.23)
2 7 2 w3 9
or in matrix form
Xw = y (2.24)
This system of equations can be solved in a systematic way by subtracting multiples
of the rst equation from the second and third equations and then subtracting mul-
tiples of the second equation from the third. For example, subtracting twice the rst
equation from the second and 1 times the rst from the third gives
2 32 3 2 3
2 1 1 w1 5
64 0 8 2 75 64 w2 75 = 64 12 75 (2.25)
0 8 3 w3 4
Then, subtracting 1 times the second from the third gives
2 32 3 2 3
2 1 1 w1 5
64 0 8 2 75 64 w2 75 = 64 12 75 (2.26)
0 0 1 w3 2
This process is known as forward elimination. We can then substitute the value for
w3 from the third equation into the second etc. This process is back-substitution. The
two processes are together known as Gaussian elimination. Following this through
for our example we get w = [1; 1; 2]T .
When we come to invert a matrix (as opposed to solve a system of equations as in
the previous example) we start with the equation
AA 1
=I (2.27)
and just write down all the entries in the A and I matrices in one big matrix
2 3
2 1 1 1 0 0
64 4 6 0 0 1 0 75 (2.28)
2 7 2 0 0 1
We then perform forward elimination 1 until the part of the matrix corresponding to
A equals the identity matrix; the matrix on the right is then1 A 1 (this is because in
equation 2.27 if A becomes I then the left hand side is A and the right side must
equal the left side). We get
2 3
1 0 0 12 5 6
64 0 1 0 416 163 162 75 (2.29)
8 8 8
0 0 1 1 1 1
This process is known as the Gauss-Jordan method. For more details see Strang's
excellent book on Linear Algebra [58] where this example was taken from.
Inverses can be used to solve equations of the form Xw = y. This is achieved by
multiplying both sides by X 1 giving
w = X 1y (2.30)
Hence,
2 3 2 32 3
w 12 5 6
5
64 w12 75 = 64 16
4
16
3
16
275 64 2 75 (2.31)
8 8 8
w3 1 1 1 9
which also gives w = [1; 1; 2]T .
The inverse of a product of matrices is given by
(AB ) 1 = BA1 1
(2.32)
Only square matrices are invertible because, for y = Ax, if y and x are of di erent
dimension then we will not necessarily have a one-to-one mapping between them.
1We do not perform back-substitution but instead continue with forward elimination until we get
a diagonal matrix.
2.5 Orthogonality
The length of a d-element vector x is written as jjxjj where
d
X
jjxjj2 = x2i (2.33)
i=1
= xT x
Two vectors x and y are orthogonal if

x-y

Figure 2.2: Two vectors x and y. These vectors will be orthogonal if they obey
Pythagoras' relation ie. that the sum of the squares of the sides equals the square of
the hypoteneuse.

jjxjj2 + jjyjj2 = jjx yjj2 (2.34)


That is, if
x21 + ::: + x2d + y12 + ::: + yd2 = (x1 y1 )2 + ::: + (xd yd )2 (2.35)
Expanding the terms on the right and re-arranging leaves only the cross-terms
x1 y1 + ::::: + xd yd = 0 (2.36)
xT y = 0
That is, two vectors are orthogonal if their inner product is zero.

2.5.1 Angles between vectors


Given a vector b = [b1 ; b2]T and a vector a = [a1; a2 ]T we can work out that
b

||b-a||

||b||

a
||a||

β
δ α

Figure 2.3: Working out the angle between two vectors.


cos = a1 (2.37)
jjajj
sin = jjaa2jj
cos = jjbb1jj
sin = jjbb2jj
(2.38)
Now, cos = cos( ) which we can expand using the trig identity
cos( ) = cos cos + sin sin (2.39)
Hence
cos() = a1 b1 + a2 b2 (2.40)
jjajjjjbjj
More generally, we have
cos() = jjaajjjjbbjj
T
(2.41)
Because, cos =2 = 0, this again shows that vectors are orthogonal for aT b = 0. Also,
because j cos j  1 where jxj denotes the absolute value of x we have
jaT bj  jjajjjjbjj (2.42)
which is known as the Schwarz Inequality.

2.5.2 Projections
The projection of a vector b onto a vector a results in a projection vector p which is
the point on the line a which is closest to the point b. Because p is a point on a it
b

b-p

p
δ

Figure 2.4: The projection of b onto a is the point on a which is closest to b.

must be some scalar multiple of it. That is


p = wa (2.43)
where w is some coecient. Because p is the point on a closest to b this means that
the vector b p is orthogonal to a. Therefore
aT (b p) = 0 (2.44)
a (b wa) = 0
T

Re-arranging gives
w= T
aT b (2.45)
aa
and
p = aaT ab a
T
(2.46)
We refer to p as the projection vector and to w as the projection.

2.5.3 Orthogonal Matrices


The set of vectors q1::qk are orthogonal if

qTj qk = 0d j 6= k (2.47)
jk = k
j
If these vectors are placed in columns of the matrix Q then
QT Q = QQT = D (2.48)
2.5.4 Orthonormal Matrices
The set of vectors q1 ::qk are orthonormal if
qTj qk = 10 jj =6= kk (2.49)
If these vectors are placed in columns of the matrix Q then
QT Q = QQT = I (2.50)
Hence, the transpose equals the inverse
QT = Q 1 (2.51)
The vectors q1 ::qk are said to provide an orthonormal basis. This means that any
vector can be written as a linear combination of the basis vectors. A trivial example
is the two-dimensional cartesian coordinate system where q1 = [1; 0]T (the x-axis)
and q2 = [0; 1]T (the y-axis). More generally, to represent the vector x we can write
x = x~1q1 + x~2q2 + ::: + x~dqd (2.52)
To nd the appropriate coecients x~k (the co-ordinates in the new basis), multiply
both sides by qTk . Due to the orthonormality property all terms on the right disappear
except one leaving
x~k = q Tk x (2.53)
The new coordinates are the projections of the data onto the basis functions (re.
equation 2.45, there is no denominator since qTk qk = 1). In matrix form, equation 2.52
can be written as x = Qx~ which therefore has the solution x~ = Q 1x. But given
that Q 1 = QT we have
x~ = QT x (2.54)
Transformation to an orthonormal basis preserves lengths. This is because 2
jjx~jj = jjQT xjj (2.55)
= (Q x) Q x
T T T
= xT QQT x
= xT x
= jjxjj
Similarly, inner products and therefore angles between vectors are preserved. That is
x~T y~ = (QT x)T QT y (2.56)
= x QQ y
T T
= xT y
Therefore, transformation by an orthonormal matrix constitutes a rotation of the
co-ordinate system.
2 Throughout this chapter we will make extensive use of the matrix identities (AB )T = B T AT
and (AB )C = A(BC ). We will also use (AB ) 1 = B 1 A 1 .
2.6 Subspaces
A space is, for example, a set of real numbers. A subspace S is a set of points fxg
such that (i) if we take two vectors from S and add them we remain in S and (ii) if
we take a vector from S and multiply by a scalar we also remain in S (S is said to be
closed under addition and multiplication). An example is a 2-D plane in a 3-D space.
A subspace can be de ned by a basis.

2.7 Determinants
The determinant of a two-by-two matrix
" #
A = a b (2.57)
c d
is given by
det(A) = ad bc (2.58)
The determinant of a three-by-three matrix
2 3
a b c
A = 64 d e f 75 (2.59)
g h i
is given by
" #! " #! " #!
det(A) = a det e f b det d f + c det d e (2.60)
h i g i g h
Determinants are important because of their properties. In particular, if two rows of
a matrix are equal then the determinant is zero eg. if
" #
A = a b a b (2.61)
then
det(A) = ab ba = 0 (2.62)
In this case the transformation from x = [x1 ; x2 ]T to y = [y1; y2]T given by
Ax = y (2.63)
reduces two pieces of information (x1 and x2 ) to one piece of information
y = y1 = y2 = ax1 + bx2 (2.64)
In this case it is not possible to reconstruct x from y; the transformation is not
invertible - the matrix A does not have an inverse and the value of the determinant
provides a test for this: If det(A) = 0 the matrix A is not invertible; it is singular.
x1 x2 y

Figure 2.5: A singular (non-invertible) transformation.

Conversely, if det(A) 6= 0 then A is invertible. Other properties of the determinant


are
det(AT ) = det(A) (2.65)
det(AB ) = det(A) det(B )
det(A 1) = 1Y= det(A)
det(A) = akk
k
Another important property of determinants is that they measure the `volume' of a
matrix. For a 3-by-3 matrix the three rows of the matrix form the edges of a cube.
The determinant is the volume of this cube. For a d-by-d matrix the rows form the
edges of a `parallepiped'. Again, the determinant is the volume.

2.8 Eigenanalysis
The square matrix A has eigenvalues  and eigenvectors q if
Aq = q (2.66)
Therefore
(A I )q = 0 (2.67)
To satisfy this equation either q = 0, which is uninteresting, or the matrix A I
must reduce q to the null vector (a single point). For this to happen A I must
be singular. Hence
det(A I ) = 0 (2.68)
Eigenanalysis therefore proceeds by (i) solving the above equation to nd the eigen-
values i and then (ii) substituting them into equation 2.66 to nd the eigenvectors.
For example, if
" #
A = 2 3 4 5 (2.69)

then
det(A I ) = (4 )( 3 ) ( 5)(2) = 0 (2.70)
which can be rearranged as
2  2 = 0 (2.71)
( + 1)( 2) = 0
Hence the eigenvalues are  = 1 and  = 2. Substituting back into equation 2.66
gives an eigenvector q1 which is any multiple of [1; 1]T . Similarly, eigenvector q2 is
any multiple of [5; 2]T .
We now note that the determinant of a matrix is also equal to the product of its
eigenvalues Y
det(A) = k (2.72)
k
We also de ne the Trace of a matrix as the sum of its diagonal elements
X
T r(A) = akk (2.73)
k
and note that it is also equal to the sum of the eigenvalues
X
T r(A) = k (2.74)
k
Eigenanalysis applies only to square matrices.

2.9 Gram-Schmidt
A general class of procedures for nding eigenvectors are the de ation methods of
which QR-decomposition and Gram-Schmidt orthogonalization are examples.
In Gram-Schmidt, we are given a set of vectors, say a,b and c and we wish to nd a
set of corresponding orthonormal vectors which we'll call q1,q2 and q3. To start with
we let
q1 = jjaajj (2.75)
We then compute b0 which is the original vector b minus the projection vector (see
equation 2.46) of b onto q1
b0 = b qT1 bq1 (2.76)
The second orthogonal vector is then a unit length version of b0
q2 = jjbb0jj
0
(2.77)
Finally, the third orthonormal vector is given by
q3 = jjcc0jj
0
(2.78)
where
c0 = c qT1 cq1 qT2 cq2 (2.79)
In QR-decomposition the Q terms are given by qi and the R terms by qTi c.
2.9.1 Diagonalization
If we put the eigenvectors into the columns of a matrix
2 3
j j : j
66 j j : j 77
Q = 6666 q1 q2 : qd 7777 (2.80)
4j j : j 5
j j : j
then, because, Aqk = k qk , we have
2 3
j
66 j
j : j
77
j : j
AQ = 6666 1q1 2 q 2 : d q d 77
7
75
(2.81)
4j j : j
j j : j
If we put the eigenvalues into the matrix  then the above matrix can also be written
as Q. Therefore,
AQ = Q (2.82)
Pre-multiplying both sides by Q gives
1

Q 1AQ =  (2.83)
This shows that any square matrix can be converted into a diagonal form (provided
it has distinct eigenvalues; see eg. [58] p. 255). Sometimes there won't be d distinct
eigenvalues and sometimes they'll be complex.

2.9.2 Spectral Theorem


For any real symmetric matrix all the eigenvalues will be real and there will be d dis-
tinct eigenvalues and eigenvectors. The eigenvectors will be orthogonal (if the matrix
is not symmetric the eigenvectors won't be orthogonal). They can be normalised and
placed into the matrix Q. Since Q is now orthonormal we have Q 1 = QT . Hence
QT AQ =  (2.84)
Pre-multiplying by Q and post-multiplying by QT gives
A = QQT (2.85)
which is known as the spectral theorem. It says that any real, symmetric matrix can
be represented as above where the columns of Q contain the eigenvectors and  is a
diagonal matrix containing the eigenvalues, i. Equivalently,
2 32 3
j j : j
66 j j : j 77 66
1
7
2 q 3
2 7
77 66 q2
1
77
A = 6666 q1 q2 : qd 7777 6666 6
77 4 : : : : 75 (2.86)
4 j j : j 54 5 q
j j : j d d
This can also be written as a summation
d
A = X k qk qTk (2.87)
k=1

2.10 Complex Matrices


If
" #
3 + 2 i 4
A = 2 + i 3 + 2i 7 + 4i 6 + 3 i (2.88)
then the complex transpose or Hermitian transpose is given by
2 3
3 2i 2 i
AH = 64 4 3 2i 75 (2.89)
6 3i 7 4i
ie. each entry changes into its complex conjugate (see appendix) and we then trans-
pose the result. Just as A T denotes the transpose of the inverse so A H denotes
the Hermitian transpose of the inverse.
If AH A is a diagonal matrix then A is said to be a unitary matrix. It is the complex
equivalent of an orthogonal matrix.

2.11 Quadratic Forms


The quadratic function
f (x) = a11 x21 + a12 x1 x2 + a21 x2 x1 + ::: + add x2d (2.90)
can be written in matrix form as
2 3
a11 a12 a1d 2 x 3
66 a a 7 1
66 21 22 a2d 777 66 x2 77
f (x) = [x1 ; x2 ; :::; xd ] 6 77 64 : 75 (2.91)
64 5
ad1 ad2 add xd
which is written compactly as
f (x) = xT Ax (2.92)
If f (x) > 0 for any non-zero x then A is said to be positive-de nite. Similarly, if
f (x)  0 then A is positive-semi-de nite.
If we substitute A = QQT and x = Qy where y are the projections onto the
eigenvectors, then we can write
f (x) = yT y (2.93)
X 2
= yi  i
i
Hence, for positive-de niteness we must therefore have i > 0 for all i (ie. positive
eigenvalues).

2.11.1 Ellipses
For 2-by-2 matrices if A = I then we have
f = x21 + x22 (2.94)
p
which is the equation of a circle with radius f . If A = kI we have
f
= x21 + x22 (2.95)
k
q
The radius is now f=k. If A = diag([k1; k2]) we have
f = k1 x21 + k2 x22 (2.96)
q
which is the equation of anqellipse. For k1 > k2 the major axis has length f=k2 and
the minor axis has length f=k1.
For a non-diagonal A we can diagonalise it using A = QQT . This gives
f = 1 x~21 + 2 x~22 (2.97)
where the ellipse now lives in a new co-ordinate q by the rotation x~ =
q system given
x Q. The major and minor axes have lengths f=2 and f=1.
T

You might also like