Gharamani Pca
Gharamani Pca
Gharamani Pca
Zoubin Ghahramani
Georey E. Hinton
Abstract
Factor analysis, a statistical method for modeling the covariance structure of high
dimensional data using a small number of latent variables, can be extended by allowing
dierent local factor models in dierent regions of the input space. This results in a
model which concurrently performs clustering and dimensionality reduction, and can
be thought of as a reduced dimension mixture of Gaussians. We present an exact
Expectation{Maximization algorithm for tting the parameters of this mixture of factor
analyzers.
1 Introduction
Clustering and dimensionality reduction have long been considered two of the fundamental
problems in unsupervised learning (Duda & Hart, 1973; Chapter 6). In clustering, the goal
is to group data points by similarity between their features. Conversely, in dimensionality
reduction, the goal is to group (or compress) features that are highly correlated. In this
paper we present an EM learning algorithm for a method which combines one of the basic
forms of dimensionality reduction|factor analysis|with a basic method for clustering|the
Gaussian mixture model. What results is a statistical method which concurrently performs
clustering and, within each cluster, local dimensionality reduction.
Local dimensionality reduction presents several benets over a scheme in which clustering
and dimensionality reduction are performed separately. First, dierent features may be
correlated within dierent clusters and thus the metric for dimensionality reduction may
need to vary between dierent clusters. Conversely, the metric induced in dimensionality
reduction may guide the process of cluster formation|i.e. dierent clusters may appear
more separated depending on the local metric.
Recently, there has been a great deal of research on the topic of local dimensionality
reduction, resulting in several variants on the basic concept with successful applications to
character and face recognition (Bregler and Omohundro, 1994; Kambhatla and Leen, 1994;
Sung and Poggio, 1994; Schwenk and Milgram, 1995; Hinton et al., 1995). The algorithm
used by these authors for dimensionality reduction is principal components analysis (PCA).
1
z
x
-
2 Factor Analysis
E (zjx) = x;
(2)
where 0( + 0);1 , a fact that results from the joint normality of data and factors:
" #!
" # "
0 + #!
x
0
P z
= N 0 ;
(3)
0
I :
Note that since is diagonal, the p p matrix ( + 0), can be eciently inverted using
the matrix inversion lemma:
( + 0);1 = ;1 ; ;1(I + 0 ;1);10 ;1 ;
where I is the k k identity matrix. Furthermore, it is possible (and in fact necessary for
EM) to compute the second moment of the factors,
E (zz0jx) = Var(zjx) + E (zjx)E (zjx)0
= I ; + xx0 0;
(4)
which provides a measure of uncertainty in the factors, a quantity that has no analogue in
PCA.
The expectations (2) and (4) form the basis of the EM algorithm for maximum likelihood
factor analysis (see Appendix A and Rubin & Thayer, 1982):
E-step: Compute E (zjxi) and E (zz0jxi) for each data point xi, given and .
! n
!;1
n
M-step:
X
X
new
0
0
=
xiE (zjxi)
E (zz jxl)
(5)
i=1 (
n
X
1
new = n diag
i=1
l=1
(6)
where the diag operator sets all the o-diagonal elements of a matrix to zero.
P (x) =
m Z
X
j =1
(7)
As in regular factor analysis, the factors are all assumed to be N (0; I ) distributed, therefore,
(8)
!
z
;
x
S
S j
Sw /
-
4 Discussion
We have described an EM algorithm for tting a mixture of factor analyzers. Matlab source
code for the algorithm can be obtained from ftp://ftp.cs.toronto.edu/pub/zoubin/
mfa.tar.gz. An extension of this architecture to time series data, in which both the factors
z and the discrete variables ! depend on their value at a previous time step, is currently
being developed.
One of the important issues not addressed in this note is model selection. In tting a
mixture of factor analyzers the modeler has two free parameters to decide: The number of
factor analyzers to use (m), and the number of factor in each analyzer (k). One method
by which these can be selected is cross-validation: several values of m and k are t to the
data and the log likelihood on a validation set is used to select the nal values. Greedy
methods based on pruning or growing the mixture may be more ecient at the cost of
some performance loss. Alternatively, a full-
edged Bayesian analysis, in which these model
parameters are integrated over, may also be possible.
Acknowledgements
We thank C. Bishop for comments on the manuscript. The research was funded by grants
from the Canadian Natural Science and Engineering Research Council and the Ontario
Information Technology Research Center. GEH is the Nesbitt-Burns fellow of the Canadian
Institute for Advanced Research.
new
E [zz0jxl]0
xiE [zjxi]0
n new =
2
X1
0 1 new
0
x
i xi ; E [zjxi] xi
2
2
Q =
2
YY
4
E log
(2)p=2j j;1=2 expf; 12 [ i ; j ; j ]0 ;1[ i ; j
i j
wj 3
; j ]g 5
Q =
=
2
wj 3
YY
1
E 4log
(2)p=2j j;1=2 expf; 2 [ i ; ~ j ~]0 ;1[ i ; ~ j ~]g 5
i j
h
X
c ; n2 log j j ; 21 hij 0i ;1 i ; hij 0i ;1~ j E [~j i; !j ] + 21 hij tr ~ 0j ;1~ j
i;j
zx
E [~z~z0jxi; !j ]
j j
new
new
= ~ new
j =
!;1
hlj E [~z~z0jxl; !j ]
(15)
where
and
"
E [~zjxi; !j ] = E [zjx1i; !j ]
"
E [zz0jxl; !j ] E [zjxl; !j ] :
E [zjxl; !j ]0
1
We re-estimate the matrix through its inverse, setting
@Q = n new ; X 1 h x x0 ; h ~ newE [~zjx ; ! ]x0 + 1 h ~ newE [~zz~0jx ; ! ]~ new = 0:
ij i i
ij j
i j i
i j j
@ ;1
2
2 ij j
ij 2
E [~z~z0jxl; !j ] =
Substituting equation (15) for ~ j and using the diagonal constraint on we obtain,
new =
8
9
<
=
X
1 diag
0 :
~ new
h
;
E
[~
j
;
!
]
ij
i
i
j
j
i
;
n : ij
zx
(16)
References