Gharamani Pca

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

The EM Algorithm for Mixtures of Factor Analyzers

Zoubin Ghahramani
Geo rey E. Hinton

Department of Computer Science


University of Toronto
6 King's College Road
Toronto, Canada M5S 1A4
Email: zoubin@cs.toronto.edu
Technical Report CRG-TR-96-1
May 21, 1996 (revised Feb 27, 1997)

Abstract

Factor analysis, a statistical method for modeling the covariance structure of high
dimensional data using a small number of latent variables, can be extended by allowing
di erent local factor models in di erent regions of the input space. This results in a
model which concurrently performs clustering and dimensionality reduction, and can
be thought of as a reduced dimension mixture of Gaussians. We present an exact
Expectation{Maximization algorithm for tting the parameters of this mixture of factor
analyzers.

1 Introduction
Clustering and dimensionality reduction have long been considered two of the fundamental
problems in unsupervised learning (Duda & Hart, 1973; Chapter 6). In clustering, the goal
is to group data points by similarity between their features. Conversely, in dimensionality
reduction, the goal is to group (or compress) features that are highly correlated. In this
paper we present an EM learning algorithm for a method which combines one of the basic
forms of dimensionality reduction|factor analysis|with a basic method for clustering|the
Gaussian mixture model. What results is a statistical method which concurrently performs
clustering and, within each cluster, local dimensionality reduction.
Local dimensionality reduction presents several bene ts over a scheme in which clustering
and dimensionality reduction are performed separately. First, di erent features may be
correlated within di erent clusters and thus the metric for dimensionality reduction may
need to vary between di erent clusters. Conversely, the metric induced in dimensionality
reduction may guide the process of cluster formation|i.e. di erent clusters may appear
more separated depending on the local metric.
Recently, there has been a great deal of research on the topic of local dimensionality
reduction, resulting in several variants on the basic concept with successful applications to
character and face recognition (Bregler and Omohundro, 1994; Kambhatla and Leen, 1994;
Sung and Poggio, 1994; Schwenk and Milgram, 1995; Hinton et al., 1995). The algorithm
used by these authors for dimensionality reduction is principal components analysis (PCA).
1


z



x

-

Figure 1: The factor analysis generative model (in vector form).


PCA, unlike maximum likelihood factor analysis (FA), does not de ne a proper density
model for the data, as the cost of coding a data point is equal anywhere along the principal
component subspace (i.e. the density is un-normalized along these directions). Furthermore,
PCA is not robust to independent noise in the features of the data (see Hinton et al., 1996,
for a comparison of PCA and FA models) . Hinton, Dayan, and Revow (1996), also exploring
an application to digit recognition, were the rst to extend mixtures of principal components
analyzers to a mixture of factor analyzers. Their learning algorithm consisted of an outer
loop of approximate EM to t the mixture components, combined with an inner loop of
gradient descent to t each individual factor model. In this note we present an exact EM
algorithm for mixtures of factor analyzers which obviates the need for an outer and inner
loop. This simpli es the implementation, reduces the number of heuristic parameters (i.e.
learning rates or steps of conjugate gradient descent), and can potentially result in speed-ups.
In the next section we present background material on factor analysis and the EM algorithm. This is followed by the derivation of the learning algorithm for mixture of factor
analyzers in section 3. We close with a discussion in section 4.

2 Factor Analysis

In maximum likelihood factor analysis (FA), a p-dimensional real-valued data vector x is


modeled using a k-dimensional vector of real-valued factors, z, where k is generally much
smaller than p (Everitt, 1984). The generative model is given by:
x = z + u;
(1)
where  is known as the factor loading matrix (see Figure 1). The factors z are assumed
to be N (0; I ) distributed (zero-mean independent normals, with unit variance). The pdimensional random variable u is distributed N (0; ), where is a diagonal matrix. The
diagonality of is one of the key assumptions of factor analysis: The observed variables are
independent given the factors. According to this model, x is therefore distributed with zero
mean and covariance 0 + ; and the goal of factor analysis is to nd the  and that
best model the covariance structure of x. The factor variables z model correlations between
the elements of x, while the u variables account for independent noise in each element of x.
The k factors play the same role as the principal components in PCA: They are informative projections of the data. Given  and , the expected value of the factors can be
2

computed through the linear projection:

E (zjx) = x;
(2)
where  0( + 0);1 , a fact that results from the joint normality of data and factors:
" #!
" # "
0 +  #!
x
0

P z
= N 0 ;
(3)
0
I :
Note that since is diagonal, the p  p matrix ( + 0), can be eciently inverted using
the matrix inversion lemma:
( + 0);1 = ;1 ; ;1(I + 0 ;1);10 ;1 ;
where I is the k  k identity matrix. Furthermore, it is possible (and in fact necessary for
EM) to compute the second moment of the factors,
E (zz0jx) = Var(zjx) + E (zjx)E (zjx)0
= I ;  + xx0 0;
(4)
which provides a measure of uncertainty in the factors, a quantity that has no analogue in
PCA.
The expectations (2) and (4) form the basis of the EM algorithm for maximum likelihood
factor analysis (see Appendix A and Rubin & Thayer, 1982):

E-step: Compute E (zjxi) and E (zz0jxi) for each data point xi, given  and .
! n
!;1
n
M-step:
X
X
new
0
0
 =
xiE (zjxi)
E (zz jxl)
(5)
i=1 (
n
X
1

new = n diag

i=1

l=1

xix0i ; newE [zjxi]x0i ;

(6)

where the diag operator sets all the o -diagonal elements of a matrix to zero.

3 Mixture of Factor Analyzers

Assume we have a mixture of m factor analyzers indexed by !j , j = 1; : : : ; m. The generative


model now obeys the following mixture distribution (see Figure 2):

P (x) =

m Z
X
j =1

P (xjz; !j )P (zj!j )P (!j )dz:

(7)

As in regular factor analysis, the factors are all assumed to be N (0; I ) distributed, therefore,

P (zj!j ) = P (z) = N (0; I ):


3

(8)




!
z
 
 ;

x

S


S  j
Sw /
-

Figure 2: The mixture of factor analysis generative model.


Whereas in factor analysis the data mean was irrelevant and was subtracted before tting the
model, here we have the freedom to give each factor analyzer a di erent mean, j , thereby
allowing each to model the data covariance structure in a di erent part of input space,
P (xjz; !j ) = N (j + j z; ):
(9)
The parameters of this model are f(j ; j )mj=1; ; g; 1 the vector  parametrizes the
adaptable mixing proportions, j = P (!j ). The latent variables in this model are the factors
z and the mixture indicator variable !, where wj = 1 when the data point was generated
by !j . For the E-step of the EM algorithm, one needs to compute expectations of all
the interactions of the hidden variables that appear in the log likelihood. Fortunately, the
following statements can be easily veri ed,
E [wj zjxi] = E [wj jxi] E [zj!j ; xi]
(10)
0
0
E [wj zz jxi] = E [wj jxi] E [zz j!j ; xi]:
(11)
De ning
hij = E [wj jxi] / P (xi; !j ) = j N (xi ; j ; j 0j + )
(12)
and using equations (2) and (10) we obtain
E [wj zjxi] = hij j (xi ; j );
(13)
where j  0j ( + j 0j );1 . Similarly, using equations (4) and (11) we obtain


E [wj zz0jxi] = hij I ; j j + j (xi ; j )(xi ; j )0 j0 :
(14)
The EM algorithm for mixtures of factor analyzers therefore becomes:
E-step: Compute hij , E [zjxi; !j ] and E [zz0jxi; !j ] for all data points i and mixture
components j .
M-step: Solve a set of linear equations for j , j , j and (see Appendix B).
The mixture of factor analyzers is, in essence, a reduced dimensionality mixture of Gaussians. Each factor analyzer ts a Gaussian to a portion of the data, weighted by the posterior
probabilities, hij . Since the covariance matrix for each Gaussian is speci ed through the
lower dimensional factor loading matrices, the model has mkp + p, rather than mp(p + 1)=2,
parameters dedicated to modeling covariance structure.
Note that each model can also be allowed to have a separate matrix. This, however, changes its
interpretation as sensor noise.
1

4 Discussion

We have described an EM algorithm for tting a mixture of factor analyzers. Matlab source
code for the algorithm can be obtained from ftp://ftp.cs.toronto.edu/pub/zoubin/
mfa.tar.gz. An extension of this architecture to time series data, in which both the factors
z and the discrete variables ! depend on their value at a previous time step, is currently
being developed.
One of the important issues not addressed in this note is model selection. In tting a
mixture of factor analyzers the modeler has two free parameters to decide: The number of
factor analyzers to use (m), and the number of factor in each analyzer (k). One method
by which these can be selected is cross-validation: several values of m and k are t to the
data and the log likelihood on a validation set is used to select the nal values. Greedy
methods based on pruning or growing the mixture may be more ecient at the cost of
some performance loss. Alternatively, a full- edged Bayesian analysis, in which these model
parameters are integrated over, may also be possible.

Acknowledgements

We thank C. Bishop for comments on the manuscript. The research was funded by grants
from the Canadian Natural Science and Engineering Research Council and the Ontario
Information Technology Research Center. GEH is the Nesbitt-Burns fellow of the Canadian
Institute for Advanced Research.

A EM for Factor Analysis


The expected log likelihood for factor analysis is
"
#
Y
1
p=
2
;
1=2
0
;
1
Q = E log (2) j j expf; 2 [xi ; z] [xi ; z]g
i

X 
n
= c ; 2 log j j ; E 21 x0i ;1xi ; x0i ;1z + 21 z00 ;1z
i

h
i
X
n
1
1
0
;
1
0
;
1
0
;
1
0
= c ; 2 log j j ;
xi xi ; xi  E [zjxi] + 2 tr   E [zz jxi] ;
i 2
where c is a constant, independent of the parameters, and tr is the trace operator.
To re-estimate the factor loading matrix we set
@Q = ; X ;1x E [zjx ]0 + X ;1newE [zz0jx ] = 0
i
i
l
@
i
l
obtaining


new

E [zz0jxl]0

xiE [zjxi]0

from which we get equation (5).


We re-estimate the matrix through its inverse, setting
@Q = n new ; X  1 x x0 ; new E [zjx ] x0 + 1 newE [zz0jx ]new  = 0:
i i
i i
i
@ ;1
2
2
i 2
0

Substituting equation (5),

n new =
2

X1

0 1 new
0
x
i xi ;  E [zjxi] xi
2
2

and using the diagonal constraint,


(
)
X
1
0
new
0
new

= n diag
xixi ;  E [zjxi]xi :
i

B EM for Mixture of Factor Analyzers

The expected log likelihood for mixture of factor analysis is

Q =

2
YY
4
E log
(2)p=2j j;1=2 expf; 12 [ i ; j ; j ]0 ;1[ i ; j
i j

wj 3
; j ]g 5

To jointly estimate the mean j and the factor loadings j it is useful to de ne an


augmented column vector of factors
" #
z~ = z1
and an augmented factor loading matrix ~ j = [j j ]. The expected log likelihood is then

Q =
=

2
wj 3
YY
1
E 4log
(2)p=2j j;1=2 expf; 2 [ i ; ~ j ~]0 ;1[ i ; ~ j ~]g 5
i j
h
X
c ; n2 log j j ; 21 hij 0i ;1 i ; hij 0i ;1~ j E [~j i; !j ] + 21 hij tr ~ 0j ;1~ j
i;j

zx

E [~z~z0jxi; !j ]

where c is a constant. To estimate ~ j we set


@Q = ; X h ;1x E [~zjx ; ! ]0 + h ;1~ newE [~z~z0jx ; ! ] = 0:
ij
i
i j
ij
i j
j
@ ~ j
i
This results in a linear equation for re-estimating the means and factor loadings,
h

j j
new

new

= ~ new
j =

hij xiE [~zjxi; !j ]0


6

!;1

hlj E [~z~z0jxl; !j ]

(15)

where
and

"

E [~zjxi; !j ] = E [zjx1i; !j ]

"

E [zz0jxl; !j ] E [zjxl; !j ] :
E [zjxl; !j ]0
1
We re-estimate the matrix through its inverse, setting
@Q = n new ; X 1 h x x0 ; h ~ newE [~zjx ; ! ]x0 + 1 h ~ newE [~zz~0jx ; ! ]~ new = 0:
ij i i
ij j
i j i
i j j
@ ;1
2
2 ij j
ij 2
E [~z~z0jxl; !j ] =

Substituting equation (15) for ~ j and using the diagonal constraint on we obtain,
new =

8
9
<
=


X
1 diag
0 :
~ new
h
;

E
[~
j
;
!
]
ij
i
i
j
j
i
;
n : ij

zx

(16)

Finally, to re-estimate the mixing proportions we use the de nition,


Z

j = P (!j ) = P (!j jx)P (x) dx:


Since hij = P (!j jxi), using the empirical distribution of the data as an estimate of P (x) we
get
n
X
jnew = n1 hij :
i=1

References

Bregler, C. and Omohundro, S. M. (1994). Surface learning with applications to lip-reading.


In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information
Processing Systems 6, pages 43{50. Morgan Kaufman Publishers, San Francisco, CA.
Duda, R. O. and Hart, P. E. (1973). Pattern Classi cation and Scene Analysis. Wiley, New
York.
Everitt, B. S. (1984). An Introduction to Latent Variable Models. Chapman and Hall,
London.
Hinton, G., Revow, M., and Dayan, P. (1995). Recognizing handwritten digits using mixtures
of Linear models. In Tesauro, G., Touretzky, D., and Leen, T., editors, Advances in
Neural Information Processing Systems 7, pages 1015{1022. MIT Press, Cambridge,
MA.
Hinton, G. E., Dayan, P., and Revow, M. (1996). Modeling the manifolds of Images of
handwritten digits. Submitted for Publication.
7

Kambhatla, N. and Leen, T. K. (1994). Fast non-linear dimension reduction. In Cowan,


J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing
Systems 6, pages 152{159. Morgan Kaufman Publishers, San Francisco, CA.
Rubin, D. and Thayer, D. (1982). EM algorithms for ML factor analysis. Psychometrika,
47(1):69{76.
Schwenk, H. and Milgram, M. (1995). Transformation invariant autoassociation with application to handwritten character recognition. In Tesauro, G., Touretzky, D., and Leen,
T., editors, Advances in Neural Information Processing Systems 7, pages 991{998. MIT
Press, Cambridge, MA.
Sung, K.-K. and Poggio, T. (1994). Example-based learning for view-based human face
detection. MIT AI Memo 1521, CBCL Paper 112.

You might also like