Pca
Pca
Pca
Component Analysis
(1) Theoretical Discussion
(2) Application to Face
Detection and Recognition
Aly A. Farag
Shireen Elhabian
University of Louisville, CVIP Lab
September 2009
PCA is
A backbone of modern data analysis.
A black box that is widely used but poorly
understood.
PCA
PCA
OK lets dispel the magic behind
this black box
PCA - Overview
It is a mathematical tool from applied linear
algebra.
It is a simple, non-parametric method of
extracting relevant information from confusing
data sets.
It provides a roadmap for how to reduce a
complex data set to a lower dimension.
What do we need under our BELT?!!!
Basics of statistical measures, e.g. variance and
covariance.
Basics of linear algebra:
Matrices
Vector space
Basis
Eigen vectors and eigen values
5
Variance
A measure of the spread of the data in a data
set with mean
Variance is claimed to be the original statistical
measure of spread of data.
( )
( ) 1
1
2
2
=
n
X X
n
i
i
X
Covariance
Variance measure of the deviation from the mean for points in one
dimension, e.g., heights
Covariance a measure of how much each of the dimensions varies
from the mean with respect to each other.
Covariance is measured between 2 dimensions to see if there is a
relationship between the 2 dimensions, e.g., number of hours studied
and grade obtained.
The covariance between one dimension and itself is the variance
( )( )
( )
( )( )
( ) 1
) , cov(
1
) var(
1
1
=
=
n
Y Y X X
Y X
n
X X X X
X
n
i
i i
i
n
i
i
7
Covariance
What is the interpretation of covariance calculations?
Say you have a 2-dimensional data set
X: number of hours studied for a subject
Y: marks obtained in that subject
And assume the covariance value (between X and Y) is:
104.53
What does this value mean?
8
Covariance
Exact value is not as important as its sign.
A positive value of covariance indicates that both
dimensions increase or decrease together, e.g., as the
number of hours studied increases, the grades in that
subject also increase.
A negative value indicates while one increases the other
decreases, or vice-versa, e.g., active social life vs.
performance in ECE Dept.
If covariance is zero: the two dimensions are independent
of each other, e.g., heights of students vs. grades obtained
in a subject.
9
Covariance
Why bother with calculating (expensive)
covariance when we could just plot the 2 values
to see their relationship?
Covariance calculations are used to find
relationships between dimensions in high
dimensional data sets (usually greater than 3)
where visualization is difficult.
10
Covariance Matrix
Representing covariance among dimensions as a
matrix, e.g., for 3 dimensions:
Properties:
Diagonal: variances of the variables
cov(X,Y)=cov(Y,X), hence matrix is symmetrical about
the diagonal (upper triangular)
m-dimensional data will result in mxm covariance matrix
=
) , cov( ) , cov( ) , cov(
) , cov( ) , cov( ) , cov(
) , cov( ) , cov( ) , cov(
Z Z Y Z X Z
Z Y Y Y X Y
Z X Y X X X
C
Linear algebra
Matrix A:
Matrix Transpose
Vector a
= =
mn m m
n
n
n m ij
a a a
a a a
a a a
a A
...
... ... ... ...
...
...
] [
2 1
2 22 21
1 12 11
m j n i a b A b B
ji ij
T
m n ij
= = =
1 , 1 ; ] [
] ,..., [ ; ...
1
1
n
T
n
a a a
a
a
a =
=
Matrix and vector multiplication
Matrix multiplication
Outer vector product
Vector-matrix product
) ( ) ( , ] [
; ] [ ; ] [
B col A row c where c C AB
b B a A
j i ij n m ij
n p ij p m ij
= = =
= =
matrix n m an AB b a c
b B b a A a
n ij
T
m ij
= =
= = = =
,
; ] [ ; ] [
1 1
m length of vector matrix m an Ab C
b B b a A
n ij n m ij
= = =
= = =
1
; ] [ ; ] [
1
Inner Product
Inner (dot) product:
Length (Eucledian norm) of a vector
a is normalized iff ||a|| = 1
The angle between two n-dimesional
vectors
An inner product is a measure of
collinearity:
a and b are orthogonal iff
a and b are collinear iff
A set of vectors is linearly independent if
no vector is a linear combination of
other vectors.
=
=
n
i
i i
T
b a b a
1
=
= =
n
i
i
T
a a a a
1
2
|| || || ||
cos
b a
b a
T
=
0 = b a
T
|| || || || b a b a
T
=
Determinant and Trace
Determinant
det(AB)= det(A)det(B)
Trace
) det( ) 1 (
; ,.... 1 ; ) det(
; ] [
1
ij
j i
ij
n
j
ij ij
n n ij
M A
n i A a A
a A
+
=
=
= =
=
= =
n
j
jj n n ij
a A tr a A
1
] [ ; ] [
Matrix Inversion
A (n x n) is nonsingular if there exists B such that:
A=[2 3; 2 2], B=[-1 3/2; 1 -1]
A is nonsingular iff
Pseudo-inverse for a non square matrix, provided
is not singular
1
;
= = = A B I BA AB
n
0 || || A
T T
A A A A
1 #
] [
=
A A
T
I A A =
#
Linear Independence
A set of n-dimensional vectors x
i
R
n
, are said
to be linearly independent if none of them can
be written as a linear combination of the others.
In other words,
0 ...
0 ...
2 1
2 2 1 1
= = = =
= + + +
k
k k
c c c iff
x c x c x c
Linear Independence
Another approach to reveal a vectors
independence is by graphing the vectors.
=
4
6
,
2
3
2 1
x x
=
2
3
,
2
1
2 1
x x
Not linearly independent vectors
Linearly independent vectors
Span
A span of a set of vectors x
1
, x
2
, , x
k
is the
set of vectors that can be written as a linear
combination of x
1
, x
2
, , x
k
.
( ) { } + + + =
k k k k
c c c x c x c x c x x x span ,..., , | ... ,..., ,
2 1 2 2 1 1 2 1
Basis
A basis for R
n
is a set of vectors which:
Spans R
n
, i.e. any vector in this n-dimensional space
can be written as linear combination of these basis
vectors.
Are linearly independent
Clearly, any set of n-linearly independent vectors
form basis vectors for R
n
.
Orthogonal/Orthonormal Basis
An orthonormal basis of an a vector space V with an inner product, is a set
of basis vectors whose elements are mutually orthogonal and of magnitude 1 (unit
vectors).
Elements in an orthogonal basis do not have to be unit vectors, but must be mutually
perpendicular. It is easy to change the vectors in an orthogonal basis by scalar
multiples to get an orthonormal basis, and indeed this is a typical way that an
orthonormal basis is constructed.
Two vectors are orthogonal if they are perpendicular, i.e., they form a right angle, i.e.
if their inner product is zero.
The standard basis of the n-dimensional Euclidean space R
n
is an example of
orthonormal (and ordered) basis.
b a b a b a
n
i
i i
T
= =
=
0
1
21
Transformation Matrices
Consider the following:
The square (transformation) matrix scales (3,2)
Now assume we take a multiple of (3,2)
2 3
2 1
3
2
=
12
8
= 4
3
2
4
6
4
16
24
4
6
1 2
3 2
4
6
2
3
2
22
Transformation Matrices
Scale vector (3,2) by a value 2 to get (6,4)
Multiply by the square transformation matrix
And we see that the result is still scaled by 4.
WHY?
A vector consists of both length and direction. Scaling a
vector only changes its length and not its direction. This is
an important observation in the transformation of matrices
leading to formation of eigenvectors and eigenvalues.
Irrespective of how much we scale (3,2) by, the solution
(under the given transformation matrix) is always a multiple
of 4.
23
Eigenvalue Problem
The eigenvalue problem is any problem having the
following form:
A . v = . v
A: m x m matrix
v: m x 1 non-zero vector
: scalar
Any value of for which this equation has a
solution is called the eigenvalue of A and the vector
v which corresponds to this value is called the
eigenvector of A.
24
Eigenvalue Problem
Going back to our example:
A . v = . v
Therefore, (3,2) is an eigenvector of the square matrix A
and 4 is an eigenvalue of A
The question is:
Given matrix A, how can we calculate the eigenvector
and eigenvalues for A?
2 3
2 1
3
2
=
12
8
= 4
3
2
25
Calculating Eigenvectors & Eigenvalues
Simple matrix algebra shows that:
A . v = . v
A . v - . I . v = 0
(A - . I ). v = 0
Finding the roots of |A - . I| will give the eigenvalues
and for each of these eigenvalues there will be an
eigenvector
Example
26
Calculating Eigenvectors &
Eigenvalues
Let
Then:
And setting the determinant to 0, we obtain 2 eigenvalues:
1
= -1 and
2
= -2
( ) ( ) ( ) 2 3 1 2 3
3 2
1
0
0
3 2
1 0
1 0
0 1
3 2
1 0
.
2
+ + = =
I A
A =
0 1
2 3
27
Calculating Eigenvectors & Eigenvalues
For
1
the eigenvector is:
Therefore the first eigenvector is any column vector in
which the two elements have equal magnitude and opposite
sign.
( )
2 : 1 1 : 1
2 : 1 1 : 1 2 : 1 1 : 1
2 : 1
1 : 1
1 1
0 2 2 0
0 .
2 2
1 1
0 . .
v v
v v and v v
v
v
v I A
=
= = +
=
=
28
Calculating Eigenvectors & Eigenvalues
Therefore eigenvector v
1
is
where k
1
is some constant.
Similarly we find that eigenvector v
2
where k
2
is some constant.
+
=
1
1
1 1
k v
+
=
2
1
2 2
k v
29
Properties of Eigenvectors and
Eigenvalues
Eigenvectors can only be found for square matrices and
not every square matrix has eigenvectors.
Given an m x m matrix (with eigenvectors), we can find
n eigenvectors.
All eigenvectors of a symmetric
*
matrix are
perpendicular to each other, no matter how many
dimensions we have.
In practice eigenvectors are normalized to have unit
length.
*Note:covariancematricesaresymmetric!
Now Lets go back
Example of a problem
We collected m parameters about 100 students:
Height
Weight
Hair color
Average grade
We want to find the most important parameters that
best describe a student.
Each student has a vector of data
which describes him of length m:
(180,70,purple,84,)
We have n = 100 such vectors.
Lets put them in one matrix,
where each column is one
student vector.
So we have a mxn matrix. This
will be the input of our problem.
Example of a problem
Example of a problem
m
-
d
i
m
e
n
s
i
o
n
a
l
d
a
t
a
v
e
c
t
o
r
n - students
Every student is a vector that
lies in an m-dimensional
vector space spanned by an
orthnormal basis.
All data/measurement vectors
in this space are linear
combination of this set of
unit length basis vectors.
Which parameters can we ignore?
Constant parameter (number of heads)
1,1,,1.
Constant parameter with some noise - (thickness of
hair)
0.003, 0.005,0.002,.,0.0008 low variance
Parameter that is linearly dependent on other
parameters (head size and height)
Z= aX + bY
Which parameters do we want to keep?
Parameter that doesnt depend on others (e.g.
eye color), i.e. uncorrelated low covariance.
Parameter that changes a lot (grades)
The opposite of noise
High variance
Questions
How we describe most important
features using math?
Variance
How do we represent our data so that
the most important features can be
extracted easily?
Change of basis
Change of Basis !!!
Let X and Y be m x n matrices related by a linear transformation P.
X is the original recorded data set and Y is a re-representation of that
data set.
PX = Y
Lets define;
p
i
are the rows of P.
x
i
are the columns of X.
y
i
are the columns of Y.
Change of Basis !!!
Let X and Y be m x n matrices related by a linear transformation P.
X is the original recorded data set and Y is a re-representation of that
data set.
PX = Y
What does this mean?
P is a matrix that transforms X into Y.
Geometrically, P is a rotation and a stretch (scaling) which again
transforms X into Y.
The rows of P, {p
1,
p
2, ,
p
m
} are a set of new basis vectors for
expressing the columns of X.
Change of Basis !!!
Lets write out the explicit dot
products of PX.
We can note the form of each
column of Y.
We can see that each coefficient of
y
i
is a dot-product of x
i
with the
corresponding row in P.
In other words, the j
th
coefficient of y
i
is a projection onto the
j
th
row of P.
Therefore, the rows of P are a new set of basis vectors for
representing the columns of X.
Changing the basis doesnt change the data only its
representation.
Changing the basis is actually projecting the data
vectors on the basis vectors.
Geometrically, P is a rotation and a stretch of X.
If P basis is orthonormal (length = 1) then the
transformation P is only a rotation
Change of Basis !!!
Questions Remaining !!!
Assuming linearity, the problem now is to find the
appropriate change of basis.
The row vectors {p
1,
p
2, ,
p
m
} in this transformation
will become the principal components of X.
Now,
What is the best way to re-express X?
What is the good choice of basis P?
Moreover,
what features we would like Y to exhibit?
What does best express the data
mean ?!!!
As weve said before, we want to filter out noise
and extract the relevant information from the
given data set.
Hence, the representation we are looking for will
decrease both noise and redundancy in the data
set at hand.
Noise
Noise in any data must be low or no matter the analysis
technique no information about a system can be extracted.
There exists no absolute scale for the noise but rather it is
measured relative to the measurement, e.g. recorded ball
positions.
A common measure is the signal-to-noise ration (SNR), or a ratio of
variances.
A high SNR (>>1) indicates high precision data, while a low
SNR indicates noise contaminated data.
Find the axis rotation that
maximizes SNR = maximizes
the variance between axis.
Why Variance ?!!!
Redundancy
Multiple sensors record the same dynamic information.
Consider a range of possible plots between two arbitrary measurement
types r
1
and r
2
.
Panel(a) depicts two recordings with no redundancy, i.e. they are un-
correlated, e.g. persons height and his GPA.
However, in panel(c) both recordings appear to be strongly related, i.e.
one can be expressed in terms of the other.
Covariance Matrix
Assuming zero mean data (subtract the mean), consider the
indexed vectors {x
1,
x
2, ,
x
m
} which are the rows of an mxn
matrix X.
Each row corresponds to all measurements of a particular
measurement type (x
i
).
Each column of X corresponds to a set of measurements from
particular time instant.
We now arrive at a definition for the covariance matrix S
X
.
where
Covariance Matrix
The ij
th
element of the variance is the dot product between the
vector of the i
th
measurement type with the vector of the j
th
measurement type.
S
X
is a square symmetric mm matrix.
The diagonal terms of S
X
are the variance of particular
measurement types.
The off-diagonal terms of S
X
are the covariance between
measurement types.
where
Covariance Matrix
Computing S
X
quantifies the correlations between all possible
pairs of measurements. Between one pair of measurements, a
large covariance corresponds to a situation like panel (c), while
zero covariance corresponds to entirely uncorrelated data as in
panel (a).
where
Covariance Matrix
Suppose, we have the option of manipulating S
X
. We will
suggestively define our manipulated covariance matrix S
Y
.
What features do we want to optimize in S
Y
?
Diagonalize the Covariance Matrix
Our goals are to find the covariance matrix that:
1. Minimizes redundancy, measured by covariance. (off-diagonal),
i.e. we would like each variable to co-vary as little as possible
with other variables.
2. Maximizes the signal, measured by variance. (the diagonal)
Since covariance is non-negative, the optimized covariance matrix
will be a diagonal matrix.
Diagonalize the Covariance Matrix
PCA Assumptions
PCA assumes that all basis vectors {p
1
, . . . , p
m
}
are orthonormal (i.e. p
i
p
j
=
ij
).
Hence, in the language of linear algebra, PCA
assumes P is an orthonormal matrix.
Secondly, PCA assumes the directions with the
largest variances are the most important or in
other words, most principal.
Why are these assumptions easiest?
Diagonalize the Covariance Matrix
PCA Assumptions
By the variance assumption PCA first selects a normalized direction in
m-dimensional space along which the variance in X is maximized - it
saves this as p
1
.
Again it finds another direction along which variance is maximized,
however, because of the orthonormality condition, it restricts its
search to all directions perpendicular to all previous selected directions.
This could continue until m directions are selected. The resulting
ordered set of ps are the principal components.
The variances associated with each direction p
i
quantify how principal
each direction is. We could thus rank-order each basis vector p
i
according to the corresponding variances.
This pseudo-algorithm works, however we can solve it using linear
algebra.
Solving PCA: Eigen Vectors of
Covariance Matrix
We will derive our first algebraic solution to PCA using
linear algebra. This solution is based on an important
property of eigenvector decomposition.
Once again, the data set is X, an mn matrix, where m is the
number of measurement types and n is the number of data
trials.
The goal is summarized as follows:
Find some orthonormal matrix P where Y = PX such
that is diagonalized. The rows of P are the
principal components of X.
We begin by rewriting S
Y
in terms of our variable of choice P.
Note that we defined a new matrix A = XX
T
, where A is symmetric
Our roadmap is to recognize that a symmetric matrix (A) is diagonalized by
an orthogonal matrix of its eigenvectors
A symmetric matrix A can be written as EDE
T
where D is a diagonal matrix
and E is a matrix of eigenvectors of A arranged as columns.
The matrix A has r m orthonormal eigenvectors where r is the rank of the
matrix.
Solving PCA: Eigen Vectors of
Covariance Matrix
Now comes the trick. We select the matrix P to be a matrix where each
row p
i
is an eigenvector of XX
T
.
By this selection, P = E
T
. Hence A = P
T
DP.
With this relation and the fact that P
1
= P
T
since the inverse of
orthonormal matrix is its transpose, we can finish evaluating S
Y
as follows;
It is evident that the choice of
P diagonalizes S
Y.
This was the
goal for PCA.
Now lets do this in steps by
counter example.
Solving PCA: Eigen Vectors of
Covariance Matrix
56
PCA Process STEP 1
Subtract the mean from each of the dimensions
This produces a data set whose mean is zero.
Subtracting the mean makes variance and covariance
calculation easier by simplifying their equations.
The variance and co-variance values are not affected by
the mean value.
Suppose we have two measurement types X
1
and X
2
,
hence m = 2, and ten samples each, hence n = 10.
57
PCA Process STEP 1
http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf
01 . 1 71 . 0
31 . 0 31 . 0
81 . 0 81 . 0
31 . 0 19 . 0
79 . 0 49 . 0
09 . 1 29 . 1
29 . 0 09 . 0
99 . 0 39 . 0
21 . 1 31 . 1
49 . 0 69 . 0
'
91 . 1
81 . 1
9 . 0 2 . 1
6 . 1 5 . 1
1 . 1 0 . 1
6 . 1 0 . 2
7 . 2 3 . 2
0 . 3 1 . 3
2 . 2 9 . 1
9 . 2 2 . 2
7 . 0 5 . 0
4 . 2 5 . 2
2 1
2
1
2 1
=
=
X X
X
X
X X
58
PCA Process STEP 2
Calculate the covariance matrix
Since the non-diagonal elements in this covariance matrix
are positive, we should expect that both the X
1
and X
2
variables increase together.
Since it is symmetric, we expect the eigenvectors to be
orthogonal.
=
716555556 . 0 615444444 . 0
615444444 . 0 616555556 . 0
cov
59
PCA Process STEP 3
Calculate the eigenvectors and eigenvalues of
the covariance matrix
=
735178656 . 0 677873399 . 0
677873399 . 0 735178656 . 0
28402771 . 1
490833989 . 0
rs eigenvecto
s eigenvalue
60
PCA Process STEP 3
Eigenvectors are plotted as
diagonal dotted lines on the
plot. (note: they are
perpendicular to each other).
One of the eigenvectors goes
through the middle of the
points, like drawing a line of
best fit.
The second eigenvector gives
us the other, less important,
pattern in the data, that all the
points follow the main line,
but are off to the side of the
main line by some amount.
61
PCA Process STEP 4
Reduce dimensionality and form feature vector
The eigenvector with the highest eigenvalue is the principal
component of the data set.
In our example, the eigenvector with the largest eigenvalue
is the one that points down the middle of the data.
Once eigenvectors are found from the covariance matrix,
the next step is to order them by eigenvalue, highest to
lowest. This gives the components in order of significance.
62
PCA Process STEP 4
Now, if youd like, you can decide to ignore the components
of lesser significance.
You do lose some information, but if the eigenvalues are
small, you dont lose much
m dimensions in your data
calculate m eigenvectors and eigenvalues
choose only the first r eigenvectors
final data set has only r dimensions.
63
PCA Process STEP 4
When the
i
s are sorted in descending order, the proportion
of variance explained by the r principal components is:
If the dimensions are highly correlated, there will be a small
number of eigenvectors with large eigenvalues and r will be
much smaller than m.
If the dimensions are not correlated, r will be as large as m
and PCA does not help.
m p
r
m
i
i
r
i
i
+ + + + +
+ + +
=
=
=
K K
K
2 1
2 1
1
1
64
PCA Process STEP 4
Feature Vector
FeatureVector = (
1
2
3
r
)
(take the eigenvectors to keep from the ordered list of eigenvectors, and form a matrix
with these eigenvectors in the columns)
We can either form a feature vector with both of the
eigenvectors:
or, we can choose to leave out the smaller, less significant
component and only have a single column:
677873399 . 0 735178656 . 0
735178656 . 0 677873399 . 0
735178656 . 0
677873399 . 0
65
PCA Process STEP 5
Derive the new data
FinalData = RowFeatureVector x RowZeroMeanData
RowFeatureVector is the matrix with the
eigenvectors in the columns transposed so that the
eigenvectors are now in the rows, with the most
significant eigenvector at the top.
RowZeroMeanData is the mean-adjusted data
transposed, i.e., the data items are in each column,
with each row holding a separate dimension.
66
PCA Process STEP 5
FinalData is the final data set, with data items in
columns, and dimensions along rows.
What does this give us?
The original data solely in terms of the vectors we
chose.
We have changed our data from being in terms
of the axes X
1
and X
2
, to now be in terms of
our 2 eigenvectors.
67
PCA Process STEP 5
FinalData (transpose: dimensions along columns)
162675287 . 0 22382956 . 1
0177646297 . 0 438046137 . 0
0464172582 . 0 14457216 . 1
349824698 . 0 0991094375 . 0
175282444 . 0 912949103 . 0
209498461 . 0 67580142 . 1
130417207 . 0 274210416 . 0
384374989 . 0 992197494 . 0
142857227 . 0 77758033 . 1
175115307 . 0 827870186 . 0
2 1
newX newX
68
PCA Process STEP 5
69
Reconstruction of Original Data
Recall that:
FinalData = RowFeatureVector x RowZeroMeanData
Then:
RowZeroMeanData = RowFeatureVector
-1
x FinalData
And thus:
RowOriginalData = (RowFeatureVector
-1
x FinalData) +
OriginalMean
If we use unit eigenvectors, the inverse is the same
as the transpose (hence, easier).
70
Reconstruction of Original Data
If we reduce the dimensionality (i.e., r < m),
obviously, when reconstructing the data we lose
those dimensions we chose to discard.
In our example let us assume that we considered
only a single eigenvector.
The final data is newX
1
only and the
reconstruction yields
71
Reconstruction of original Data
The variation along
the principal
component is
preserved.
The variation along
the other component
has been lost.
72
Next
We will practice with each other how would we use this powerful
mathematical tool in one of computer vision applications, which is
2D face recognition so stay excited
References
J. SHLENS, "TUTORIAL ON PRINCIPAL
COMPONENT ANALYSIS," 2005
http://www.snl.salk.edu/~shlens/pub/notes/pca.pdf
Introduction to PCA,
http://dml.cs.byu.edu/~cgc/docs/dm/Slides/