Some Studies of Expectation Maximization Clustering Algorithm To Enhance Performance
Some Studies of Expectation Maximization Clustering Algorithm To Enhance Performance
Some Studies of Expectation Maximization Clustering Algorithm To Enhance Performance
E
E
(1)
where
X
is sample column vector, the superscript
T
indicates transpose to a
column vector, | |
l
E is the determinant of
l
E and
1
E
l
is its matrix inverse. The The
mixture model probability density function is:
). | ( = ) (
1 =
l X p w X p
l
k
l
(2)
The coefficient
l
w is the weight corresponding the fraction of the database
represented by the cluster
l
. The given data record
X
is a member of each of the
k
clusters with different probabilitis of membership. Probability in population
l
is
given by:
.
) (
) | (
= ) | (
X p
l X p w
X l p
l
The EM algorithm starts with initializing mixture model parameters. Every EM
iteration provides a non-decreasing log-likelihood of the model. The process is repeated
until the log-likelihood of the mixture model at the previous iteration is sufficiently
close to the log-likelihood of the current model or a fixed number of iterations are
exhausted.
The algorithm proceeds as follows for the Gaussian mixture model:
1. Initialize the mixture model parameters, set current iteration 0 = j :
l
w ,
l
M and k l
l
1,..., = , E .
2. Having mixture model parameters at iteration j , update them as follows:
For each database record , 1,..., = , m t X
t
compute the membership
probability of
t
X in each cluster:
. 1,..., = ,
) (
) | (
= ) | ( k l
X p
l X p w
X l p
t
t l
t (3)
258
2011 Anu Books
3. Update mixture model parameters:
), | (
1
=
1 =
t
m
t
l
X l p
m
w
(4)
,
) | (
) | (
=
1 =
1 =
t
m
t
t t
m
t
l
X l p
X l p X
M
(5)
. 1,..., = ,
) | (
) )( )( | (
=
1 =
1 =
k l
X l p
M X M X X l p
t
m
t
T
l t l t t
m
t
l
E
(6)
4. Calculate the log-likelihood value,
)) | ( . ( = )) ( ( =
1 = 1 = 1 =
l X p w log X p log llh
t l
k
l
m
t
t
m
t
j (7)
5. If either the difference between log-likelihood values of present and last
iterations is small, s
+
| |
1 j j
llh llh or the maximum number of iterations
are exhausted, 100 = j , then stop. Otherwise set 1 = + j j and goto
step
2
.
3. Quadratic Term and Techniques to speedup its calculation
The equation ) ( ) (
1
l l
T
l
M X M X E
is known as quadratic term. For every
data sample, the probability density function (pdf) is to be calculated, which in turn
requires to calculate the quadratic term. N. B. Venkateswarlu et. al. introduces
some methods useful in Machine Learning Classification to reduce the number of
computations while classifying a sample by using fewer computations in calculating
the pdf of each group [36]. The quadratic term in the pdf is presented to be a
montonically increasing sum of squares by representing the inverse covariance matrix
in Lower Triangular Canonical Form (LTCF).
The computational complexity of LTCF in calculating quadratic terms recursively
is identified to be half of the standard quadratic term computation.
D. J. Nagendra Kumar, J. V. R. Murthy
259 Research Cell: An International Journal of Engineering Sciences ISSN: 2229-6913 Issue July 2011, Vol. 1
2011 Anu Books
In this paper one recursive approach and two approaches based on LTCF are
used for quadratic term computation in EM clustering algorithm. The quadratic term
values of the first n features can be used to calculate the quadratic term values of
the first ) 1 1( d n n s + + features iteratively. Computational time required for the
proposed EM algorithms are compared with the standard EM algorithm. Performance
of these algorithms are studied by applying EM algorithm on Synthetic data varying
dimensionality and number of gaussian distributions.
3. 1 Recursive Approach
We partition
1 +
E
n
in the following manner:
(
E
E
+ + +
+
+
1 1, 1
1
1
=
n n
T
n
n n
n
U
U
where
] [ = =
2 1
2 22 21
1 12 11
ij
nn n n
n
n
n
(
(
(
(
and
| |. =
1, 1,2 1,1 1 n n n n n
U
+ + + +
Mathematically, we can show that
(
E
E E E + E
E
+
+
+ +
+
n n
T
n n
n n n n
T
n n n n n
n
U
U U U
1
1
1
1 1
1 1
1 1
1
1
=
where
.
1
=
1
1
1 1 1, +
+ + +
E
n n
T
n n n
n
U U
In the above equations, the subscript n is used to denote the first n features.
By setting
n n n
M X Z = , a recursive form of the quadratic term can be represented
as:
) ( ) ( = ) (
1 1
1
1 1 1 1 + +
+ + + +
E
n n n
T
n n n
M X M X x Q
260
2011 Anu Books
| |
(
E
E E E + E
+
+
+
+ +
+
1
1
1
1
1 1
1 1
1 1
1
=
n
T
n
n n
T
n n
n n n n
T
n n n n n
n
T
n
z
Z
U
U U U
z Z
) ) 2 ) )(( (( =
2
1 1
1
1
1
1
1
+ +
+ E E + E
n n n n
T
n n n
T
n n n n
T
n
z z Z U Z U Z Z
) ) 2 ( ( ) ( =
2
1 1 + +
+ +
n n n n n n
z z t t x Q
. ) ( ) ( =
2
1 +
+
n n n n
z t x Q
where
, = = =
1 =
1
1 k k
n
k
n
T
n n n
T
n n
z a Z A Z U t
+
E
and
z
Z
M X Z
n
T
n
n n n
, = =
1
1 1 1 (
+
+ + +
| |. = =
2 1
1
1 n n
T
n
T
n
a a a U A
+
E
The first term is known from the previous iteration. The vector
1
1
+
E
n
t
n
U needs
to be calculated only once for the cluster at the begining, then saved in the memory
and regarded as a constant at the clustering stage for each sample. At the clustering
stage, the calculation of above equation requires only
3 + n
multiplications. Thus,
the total number of multiplications for one cluster through all d features is
3
2
) 5 (
2
+ d d
.
3. 2 Lower Triangular Canonical Form (LTCF) with Matrix Inversion
The covariance matrix can be represented in terms of the Cholesky
decomposition and so its inverse. That is, the matrix
1
E
k
can be represented in
terms of a lower triangular matrix
k
L as
n
T
n n
L L =
1
E . According to the Cholesky
decomposition,
T
n n ij n
L L = ] [ = E
with
D. J. Nagendra Kumar, J. V. R. Murthy
261 Research Cell: An International Journal of Engineering Sciences ISSN: 2229-6913 Issue July 2011, Vol. 1
2011 Anu Books
] [ =
0
0 0
=
2 1
22 21
11
ij
nn n n
n
l
l l l
l l
l
L
(
(
(
(
and
ij
l is given by:
, ) ( =
1/2
2
1
1 =
|
.
|
\
|
T
ki
i
k
ii
T
ii
l l
n i , 1, =
, / =
1
1 =
T
ii
T
kj
T
ki
i
k
ij
T
ij
l l l l |
.
|
\
|
n i j , 1, = +
The inverse of
n
L , denoted by ] [ =
1 1
ij n
l L is given by:
, / =
1
1
1 =
1 T
ii kj
T
ki
i
k
ij
l l l l
1 , 1, = i j
T
ii
ii
l
l
1
=
1
The quadratic term for each cluster can be rewritten as:
Y Y Z L L Z x Q
T
n n
T
n
T
n n
= = ) (
1
where
| | , = =
2 1
1 T
n n n
y y y Z L Y
with
( ) , =
1
1 =
j ij
i
j
i
z l y
262
2011 Anu Books
. , 1,2, = n i
Above equation can be rewritten as:
. = ) (
2
1 =
i
n
i
n
y x Q
A recursive formula for calculating Q(x) can given by:
. ) ( = = = ) (
2
1
2 2
1) (
1 =
2
1 =
n n n i
n
i
i
n
i
n
y x Q y y y x Q + +
The calculation of above equation requires only n + 1 multiplications. Thus, the
total number of multiplications for a cluster through all d features is
2
) 3 (
2
d d +
.
3. 3 Lower Triangular Canonical Form (LTCF) with Forward Substitution
In the above quadratic term calculation Y Y Z L L Z x Q
T
n
T T
n n
= = ) (
1
, after
the Cholesky decomposition of
T
LL = E
is performed,
Y
can be calculated by
forward substitution using
Z LY =
. A close-form solution of the equation
Z LY =
by forward substitution is given by:
nn
k nk
n
k
n
n
l
y l z
y
1
1 =
=
The number of multiplications of forming ) (x Q
n
from
Y Y
T
is the same as
that of the previous approach. Note that in this approach
1
E
need not be calculated.
The calculation of above equation requires only n + 1 multiplications. Thus, the total
number of multiplications for a cluster through all d features is
2
) 3 (
2
d d +
.
4. Experiments and Discussion
Implementing the above three methods and the standard EM on synthetic
datasets, with 1 million rows, varying the number of dimensions(d) from 50 to 100 in
steps of 10 and the number of clusters(k) from 5 to 10 gives the following observations.
The experiments are taken up on Intel Dula-Core system with 2.6GHz processor
speed, 1 GB RAM, Fedora 10 OS and gcc compiler.
D. J. Nagendra Kumar, J. V. R. Murthy
263 Research Cell: An International Journal of Engineering Sciences ISSN: 2229-6913 Issue July 2011, Vol. 1
2011 Anu Books
4. 1 Synthetic dataset varying number of dimensions
Table I gives the time taken by proposed methods and standard EM for quadratic
term computation on synthetic data varying the number of dimensions(d) from 50 to
100 in steps of 10 and the number of samples(m) fixed to 1 Million rows and number
of clusters(k) 5. Table II gives comparitive statement of time taken for quadratic
term computation by proposed approaches against that of standard EM. Fig 1 is a
graphical representation of Table II.
Table I. Comparision of execution times (in sec) of proposed methods
against standard EM varying the number of dimensions
Table II. Comparision of % execution times of proposed methods
against standard EM varying the number of dimensions
264
2011 Anu Books
Fig 1. Comparision of % execution times of proposed methods against standard EM
varying the number of dimensions
The conclusion is that approach LTCF Inverse is outperforming rest of the methods.
It takes only 44.61% of time compared to standard EM.
4.2 Synthetic dataset varying number of clusters
Table III gives the time taken by proposed methods and standard EM for quadratic
term computation on synthetic data of 1 Million rows varying the number of clusters(k)
from 5 to 10 keeping the number of dimensions(d) fixed to 50. Table III gives comparitive
statement of time taken for quadratic term computation by proposed approaches against
that of standard EM. Fig 2 is a graphical representation of Table IV.
Table III. Comparision of execution times (in sec) of proposed methods
against standard EM varying the number of clusters
D. J. Nagendra Kumar, J. V. R. Murthy
265 Research Cell: An International Journal of Engineering Sciences ISSN: 2229-6913 Issue July 2011, Vol. 1
2011 Anu Books
Table IV. Comparision of % execution times of proposed methods
against standard EM varying the number of clusters
Fig 2. Comparision of % execution times of proposed methods against standard EM
varying the number of clusters
The conclusion is that approach LTCF with Matrix Inversion is outperforming
rest of the methods, taking only 47.11% of time compared to standard EM.
266
2011 Anu Books
5. Conclusion :
The observation is that all the proposed approaches are working as expected.
Out of all the three approaches, LTCF with Matrix Inversion is the fastet approach.
We are trying to further improve the speed of Expectation-step by applying additional
matrix algebra techniques and to study them in various Operating System and
Computing environments.
References
[1] Cdric Archambeau and John Aldo Lee and Michel Verleysen. On Convergence
Problems of the EM Algorithm for Finite Gaussian Mixtures. ESANN, pages
99106, 2003.
[2] Pavel Berkhin. Survey of clustering data mining techniques. Technical report,
Accrue Software, San Jose, CA, 2002.
[3] Jeff A. Bilmes. A gentle tutorial on the EM algorithm and its application to
parameter estimation for gaussian mixture and hidden markov models. Technical
report, 1997.
[4] Sean Borman. The expectation maximization algorithm: A short tutorial.
Unpublished paper available at http://www.seanborman.com/publications.
Technical report, 2004.
[5] Nizar Bouguila. Clustering of Count Data Using Generalized Dirichlet
Multinomial Distributions. IEEE Transactions on Knowledge and Data
Engineering, 20:462-474, 2008.
[6] Paul S. Bradley and Usama M. Fayyad and Cory A. Reina and P. S. Bradley
and Usama Fayyad and Cory Reina. Scaling EM (Expectation-Maximization)
Clustering to Large Databases. 1999.
[7] A. P. Dempster and N. M. Laird and D. B. Rubin. Maximum likelihood from
incomplete data via the EM algorithm. JOURNAL OF THE ROYAL
STATISTICAL SOCIETY, SERIES B, 39(1):138, 1977.
[8] Wenbin Guo and Shuguang Cui. A q-Parameterized Deterministic Annealing
EM Algorithm Based on Nonextensive Statistical Mechanics. IEEE
Transactions on Signal Processing, 56(7-2):30693080, 2008.
[9] Radu Horaud and Florence Forbes and Manuel Yguel and Guillaume Dewaele
and Jian Zhang. Rigid and Articulated Point Registration with Expectation
Conditional Maximization. 2009.
D. J. Nagendra Kumar, J. V. R. Murthy
267 Research Cell: An International Journal of Engineering Sciences ISSN: 2229-6913 Issue July 2011, Vol. 1
2011 Anu Books
[10] A. K. Jain and M. N. Murty and P. J. Flynn. Data Clustering: A Review. ACM
Computing Surveys, 31(3), 1999.
[11] H. D. Jin and M. L. Wong and K. S. Leung. Scalable Model-Based Clustering
for Large Databases Based on Data Summarization. IEEE Trans. Pattern
Analysis and Machine Intelligence, 27(??):17101719, 2005.
[12] F. -X. Jollois and M. Nadif. Speed-up for the expectation-maximization algorithm
for clustering categorical data. 2007.
[13] Michael I. Jordan. Hierarchical mixtures of experts and the EM algorithm.
Neural Computation, 6:181214, 1994.
[14] Chanho Jung. Unsupervised Segmentation of Overlapped Nuclei Using
Bayesian Classification. IEEE 2010
[15] Wojtek Kowalczyk and Nikos Vlassis. Newscast EM. In NIPS 17, pages
713720, 2005. MIT Press.
[16] Martin H. C. Law and Mrio A. T. Figueiredo and Anil K. Jain. Simultaneous
Feature Selection and Clustering Using Mixture Models. IEEE Trans. Pattern
Anal. Mach. Intell, 26(??):11541166, 2004.
[17] Adam Lopez. Statistical machine translation. ACM Comput. Surv, 40(3), 2008.
[18] M. W. Mak, S. Y. Kung, S. . Lin. Expectation-Maximization Theory. 2005.
[19] Andrew Mccallum and Kamal Nigam. Employing EM in Pool-Based Active
Learning for Text Classification. 1998.
[20] Marina Meil and David Heckerman. An experimental comparison of several
clustering and intialization methods. Technical report, 1998.
[21] Norwati Mustapha. Expectation Maximization Clustering Algorithm for User
Modeling in Web Usage Mining Systems EUJSR 2009.
[22] Radford Neal and Geoffrey E. Hinton. A View Of The EM Algorithm That
Justifies Incremental, Sparse, And Other Variants. Learning in Graphical
Models, pages 355368, 1998. Kluwer Academic Publishers.
[23] Kamal Nigam and Andrew Kachites Mccallum and Sebastian Thrun and Tom
Mitchell. Text Classification from Labeled and Unlabeled Documents using
EM. Machine Learning, 39(2/3):103, 2000.
268
2011 Anu Books
[24] C. Nikou and N. P. Galatsanos and A. C. Likas. A Class-Adaptive Spatially
Vari ant Mi xture Model for I mage Segmentati on. IEEE Trans. Image
Processing, 16(4):11211130, 2007.
[25] Zachary A. Pardos and Neil T. Heffernan. Navigating the parameter space of
Bayesian Knowledge Tracing models: Visualizations of the convergence of the
Expectation Maximization algorithm. In Ryan Shaun Joazeiro de Baker and
Agathe Merceron and Philip I. Pavlik Jr, editors, Educational Data Mining
2010, The 3rd International Conference on Educational Data Mining,
Pittsburgh, PA, USA, June 11-13, 2010. Proceedings, pages 161170, 2010.
www.educationaldatamining.org.
[26] C. K. Reddy and H. D. Chiang and B. Rajaratnam. TRUST-TECH-Based
Expectation Maximization for Learning Finite Mixture Models. IEEE Trans.
Pattern Analysis and Machine Intelligence, 30(7):11461157, 2008.
[27] Leonardo Rigutini and Marco Maggini. An EM based training algorithm for
Cross-Language Text Categorization. in In Proceedings of the Web
Intelligence Conference (WI, pages 529535, 2005.
[28] Jonathan Le Roux and Hirokazu Kameoka and Nobutaka Ono and Alain de
Cheveign and Shigeki Sagayama. Single and Multiple F
0
Contour Estimation
Through Parametric Spectrogram Modeling of Speech in Noisy Environments.
IEEE Transactions on Audio, Speech & Language Processing, 15(4):1135
1145, 2007.
[29] Sam Roweis. EM Algorithms for PCA and SPCA. in Advances in Neural
Information Processing Systems, pages 626632, 1998. MIT Press.
[30] S.P. Smith, C.Y. Lin. Efficient implementation of the new restricted maximum
likelihood algorithm. 1989.
[31] Ruslan Salakhutdinov and Sam Roweis and Zoubin Ghahramani. Optimization
with EM and Expectation-Conjugate-Gradient. pages 672679, 2003.
[32] Christian D. Sigg and Joachim M. Buhmann. Expectation-maximization for
sparse and non-negative PCA. In William W. Cohen and Andrew McCallum
and Sam T. Roweis, editors, Machine Learning, Proceedings of the Twenty-
D. J. Nagendra Kumar, J. V. R. Murthy
269 Research Cell: An International Journal of Engineering Sciences ISSN: 2229-6913 Issue July 2011, Vol. 1
2011 Anu Books
Fifth International Conference (ICML 2008), Helsinki, Finland, June
5-9, 2008 in ACM International Conference Proceeding Series, pages 960
967, 2008. ACM.
[33] Bo Thiesson and Christopher Meek and David Heckerman. Accelerating EM
for large databases. Technical report, Machine Learning, 2001.
[34] Michael E. Tipping and Chris M. Bishop. Probabilistic Principal Component
Analysis. Journal of the Royal Statistical Society, Series B, 61:611622,
1999.
[35] Rudolph Triebel and Wolfram Burgard. Using hierarchical EM to extract planes
from 3d range scans. In Proc. of the IEEE Int. Conf. on Robotics and
Automation (ICRA, 2005.
[36] N B Venkateswarlu and S. Balaji and R D Boyle. Some Further Results of
Three Stage ML Classification Applied to Remotely Sensed Images. Technical
report, 1993.
[37] Jakob J. Verbeek and Jan R. J. Nunnink and Nikos Vlassis. Accelerated EM-
based clustering of large data sets. Data Mining and Knowledge Discovery,
13:2006, 2006.
[38] Jason Wolfe and Aria Haghighi and Dan Klein. Fully distributed EM for very
large datasets. In William W. Cohen and Andrew McCallum and Sam T. Roweis,
editors, Machine Learning, Proceedings of the Twenty-Fifth International
Conference (ICML 2008), Helsinki, Finland, June 5-9, 2008 in ACM
International Conference Proceeding Series, pages 11841191, 2008. ACM.
[39] Lei Xu and Michael I. Jordan. On Convergence Properties of the EM Algorithm
for Gaussian Mixtures. Neural Computation, 8:129151, 1995.
[40] R. Xu and D. Wunsch. Survey of clustering algorithms. IEEE Transactions
on Neural Networks, 16(3):645678, 2005.