EM and Kmeans relations

Advanced Introduction to
Machine Learning
CMU-10715
Clustering and EM
Barnabás Póczos
Contents
 Clustering
K-means
Mixture of Gaussians
 Expectation Maximization
 Variational Methods
2
Clustering
3
K- means clustering
What is clustering?
Clustering:
The process of grouping a set of objects into classes of similar objects
–high intra-class similarity
–low inter-class similarity
–It is the most common form of unsupervised learning
Clustering is Subjective
4
K- means clustering
What is clustering?
Clustering:
The process of grouping a set of objects into classes of similar objects
–high intra-class similarity
–low inter-class similarity
–It is the most common form of unsupervised learning
5
K- means clustering
What is Similarity?
Hard to define! …but we know it when we see it
6
The K- means Clustering Problem
7
K-means Clustering Problem
K-means clustering problem:

Partition the n observations into K sets (K ≤ n) S = {S1, S2, …, SK}
such that the sets minimize the within-cluster sum of squares:
K=3
8
K-means Clustering Problem
K-means clustering problem:

Partition the n observations into K sets (K ≤ n) S = {S1, S2, …, SK}
such that the sets minimize the within-cluster sum of squares:
How hard is this problem?
The problem is NP hard, but there are good heuristic algorithms that seem
to work well in practice:
• K–means algorithm
• mixture of Gaussians
9
K-means Clustering Alg: Step 1
• Given n objects.
• Guess the cluster centers (k1, k2, k3. They were 1, 2,3 in the previous slide)
10
• Build a Voronoi diagram based on the cluster centers k1, k2, k3.
• Decide the class memberships of the n objects by assigning them to the
nearest cluster centers k1, k2, k3. 11
• Re-estimate the cluster centers (aka the centroid or mean), by

assuming the memberships found above are correct.
12
• Build a new Voronoi diagram based on the new cluster centers.

• Decide the class memberships of the n objects based on this diagram
13
• Re-estimate the cluster centers.

14
• Stop when everything is settled.

(The Voronoi diagrams don’t change anymore)
15
K- means clustering
K- means Clustering Algorithm
Algorithm
Input
– Data + Desired number of clusters, K
Initialize
– the K cluster centers (randomly if necessary)
Iterate
1. Decide the class memberships of the n objects by assigning them to the
nearest cluster centers
2. Re-estimate the K cluster centers (aka the centroid or mean), by
assuming the memberships found above are correct.
Termination
– If none of the n objects changed membership in the last iteration, exit.
Otherwise go to 1. 16
K-
K- means Algorithm
means clustering
Computation Complexity
 At each iteration,
– Computing distance between each of the n objects and the K cluster
centers is O(Kn).
– Computing cluster centers: Each object gets added once to some
cluster: O(n).
 Assume these two steps are each done once for l iterations: O(lKn).
Can you prove that the K-means algorithm guaranteed to terminate?
17
K- means clustering
Seed Choice
18
K- means clustering
Seed Choice
19
K- means clustering
Seed Choice
The results of the K- means Algorithm can vary based on random seed
selection.
 Some seeds can result in poor convergence rate, or convergence to

sub-optimal clustering.
 K-means algorithm can get stuck easily in local minima.

– Select good seeds using a heuristic (e.g., object least similar to any
existing mean)
– Try out multiple starting points (very important!!!)
– Initialize with the results of another method.
20
Alternating Optimization
21
K- means clustering
K- means Algorithm (more formally)
 Randomly initialize k centers
 Classify: At iteration t, assign each point xj (j 2 {1,…,n}) to the

nearest center:
Classification at iteration t
 Recenter: i(t+1) is the centroid of the new set:
Re-assign new cluster

centers at iteration t 22
What is the
K- means K-means
clustering
algorithm optimizing?
 Define the following potential function F of centers 
and point allocation C
Two equivalent versions
 It’s easy to see that the optimal solution of the K-means

problem is:
23
K- means clustering
K-means Algorithm
Optimize the potential function:
K-means algorithm:
(1)
Exactly the first step

Assign each point to the nearest cluster center
(2)
Exactly the 2nd step (re-center)

24
K- means clustering
K-means Algorithm
Optimize the potential function:
K-means algorithm: (coordinate descent on F)
(1) “Expectation step”
(2) “Maximization step”
Today, we will see a generalization of this approach:

EM algorithm 25
Gaussian Mixture Model
26
Density Estimation
Generative approach
• There is a latent parameter Θ

• For all i=1,2,…,n, draw observed xi from the parametric
distribution given parameter Θ
What if the basic model (e.g. a Gaussian) doesn’t fit all data?
) Mixture modelling, Partitioning algorithms
Different parameters for different parts of the domain.
27
K- means clustering
Partitioning Algorithms
• K-means
–hard assignment: each object xj belongs to only one cluster
• Mixture modeling
–soft assignment: probability that an object belongs to a cluster
28
K- means clustering
Mixture of K Gaussians distributions: (Multi-modal distribution)
• There are K components
• Component i has an associated mean vector i
Component i generates data from
Each data point is generated using this process:
29
Mixture of K Gaussians distributions: (Multi-modal distribution)
Hidden variable
Observed Mixture Mixture

data component proportion
30
Mixture of Gaussians Clustering
Assume that
For a given x we want to decide if it belongs to cluster i or cluster j
Cluster x based on the ratio of posteriors:
31
Mixture of Gaussians Clustering
Assume that
32
Piecewise linear decision boundary
33
MLE for GMM
What if we don't know the parameters?
) Maximum Likelihood Estimate (MLE)
34
K-means and GMM
MLE:
• What happens if we assume Hard assignment?

P(yj = i) = 1 if i = C(j)
= 0 otherwise
In this case the MLE estimation:
Same as K-means!!! 35
General GMM
General GMM –Gaussian Mixture Model
• There are k components
• Component i has an associated

mean vector i
• Each component generates data

from a Gaussian with mean i
and covariance matrix i. Each
data point is generated according
to the following recipe:
1) Pick a component at random: Choose component i

with probability P(y=i)
2) Datapoint x~ N(i ,i) 36
General GMM
GMM –Gaussian Mixture Model
Mixture Mixture
component proportion
37
General GMM
Assume that
Clustering based on ratios of posteriors:
“Quadratic Decision boundary” – second-order terms don’t cancel out 38

General GMM MLE Estimation
What if we don't know
) Maximize marginal likelihood (MLE):
Non-linear, non-analytically solvable
Doable, but often slow
39
The EM algorithm
What is EM in the general case, and why does it work?
40
Expectation-Maximization (EM)
A general algorithm to deal with hidden data, but we will study it in
the context of unsupervised learning (hidden class labels =
clustering) first.
• EM is an optimization strategy for objective functions that can be interpreted
as likelihoods in the presence of missing data.
• EM is “simpler” than gradient methods:
No need to choose step size.
• EM is an iterative algorithm with two linked steps:
o E-step: fill-in hidden values using inference
o M-step: apply standard MLE/MAP method to completed data
• We will prove that this procedure monotonically improves the likelihood (or
leaves it unchanged). EM always converges to a local optimum of the
likelihood.
41
General EM algorithm
Notation
Observed data:
Unknown variables:
For example in clustering:
Paramaters:
For example in MoG:
Goal:
42
Other Examples: Hidden Markov Models
Observed data:
Unknown variables:
Paramaters:
Initial probabilities:
Transition probabilities:
Emission probabilities:
Goal:
43
Goal:
Free energy:
E Step:
M Step:
We are going to discuss why this approach works 44

Free energy:
E Step: Let us see why!
M Step:
We maximize only here in !!! 45
Free energy:
Theorem: During the EM algorithm the marginal likelihood is not decreasing!
Proof:
46
Goal:
E Step:
M Step:
During the EM algorithm the marginal likelihood is not decreasing!
47
Convergence of EM
Sequence of EM lower bound F-functions
EM monotonically converges to a local maximum of likelihood !
48
Convergence of EM
Different sequence of EM lower bound F-functions depending on initialization
Use multiple, randomized initializations in practice

49
Variational Methods
50
Variational methods
Free energy:
Variational methods might decrease the marginal likelihood!
51
Variational methods
Free energy:
Partial E Step:
But not necessarily the best max/min which would be

Partial M Step:
Variational methods might decrease the marginal likelihood! 52

Summary: EM Algorithm
A way of maximizing likelihood function for hidden variable models.
Finds MLE of parameters when the original (hard) problem can be broken up
into two (easy) pieces:
1.Estimate some “missing” or “unobserved” data from observed data and
current parameters.
2. Using this “complete” data, find the MLE parameter estimates.
Alternate between filling in the latent variables using the best guess (posterior)
and updating the parameters based on this guess:
E Step:
M Step:
In the M-step we optimize a lower bound F on the likelihood L.

In the E-step we close the gap, making bound F =likelihood L.
EM performs coordinate ascent on F, can get stuck in local optima. 53
EM Examples
54
Expectation-Maximization (EM)
A simple case:
• We have unlabeled data x1, x2, …, xn
• We know there are K classes
• We know P(y=1)=1, P(y=2)=2, P(y=3) =3…, P(y=K)=K
• We know common variance 2
• We don’t know 1, 2, … K , and we want to learn them
We can write
Independent data
Marginalize over class
) learn 1, 2, … K

55
Expectation (E) step
We want to learn:
Our estimator at the end of iteration t-1:
At iteration t, construct function Q:
E step
Equivalent to assigning clusters to each data point in K-means in a soft way

56
Maximization (M) step
We calculated these weights in the E step
Joint distribution is simple
M step At iteration t, maximize function Q in t:

Here i sum can be ignored because we are maximizing
in mu(i) which we only see in i. (Its just indexing stuff)
Equivalent to updating cluster centers in K-means 57

EM for spherical, same variance
GMMs
E-step
Compute “expected” classes of all datapoints for each class
In K-means “E-step” we do hard assignment. EM does soft assignment
M-step
Compute Max of function Q. [I.e. update μ given our data’s class
membership distributions (weights) ]
Iterate. Exactly the same as MLE with weighted data. 58

EM for general GMMs
The more general case:
• We have unlabeled data x1, x2, …, xn
• We know there are K classes
• We don’t know P(y=1)=1, P(y=2)=2 P(y=3) … P(y=K)=K
• We don’t know 1,… K
• We don’t know 1, 2, … K
We want to learn:
Our estimator at the end of iteration t-1:
The idea is the same:

At iteration t, construct function Q (E step) and maximize it in t (M step)
59
EM for general GMMs
At iteration t, construct function Q (E step) and maximize it in t (M step)
E-step
Compute “expected” classes of all datapoints for each class
M-step
Compute MLEs given our data’s class membership distributions (weights)
60
EM for general GMMs: Example
61
After 1st iteration
62
After 2nd iteration
63
After 3rd iteration
64
After 4th iteration
65
After 5th iteration
66
After 6th iteration
67
After 20th iteration
68
GMM for Density Estimation
69
Thanks for your attention!
70

EM and Kmeans relations

Uploaded by

Copyright:

Available Formats

EM and Kmeans relations

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EM and Kmeans relations

Uploaded by

Copyright:

Available Formats

Advanced Introduction to

Hard to define! …but we know it when we see it

K-means clustering problem:

K-means clustering problem:

How hard is this problem?

• Re-estimate the cluster centers (aka the centroid or mean), by

• Build a new Voronoi diagram based on the new cluster centers.

• Re-estimate the cluster centers.

• Stop when everything is settled.

Can you prove that the K-means algorithm guaranteed to terminate?

 Some seeds can result in poor convergence rate, or convergence to

 K-means algorithm can get stuck easily in local minima.

 Randomly initialize k centers

 Classify: At iteration t, assign each point xj (j 2 {1,…,n}) to the

 Recenter: i(t+1) is the centroid of the new set:

Re-assign new cluster

Two equivalent versions

 It’s easy to see that the optimal solution of the K-means

Exactly the first step

Exactly the 2nd step (re-center)

K-means algorithm: (coordinate descent on F)

(1) “Expectation step”

(2) “Maximization step”

Today, we will see a generalization of this approach:

• There is a latent parameter Θ

Component i generates data from

Each data point is generated using this process:

Observed Mixture Mixture

For a given x we want to decide if it belongs to cluster i or cluster j

Cluster x based on the ratio of posteriors:

) Maximum Likelihood Estimate (MLE)

• What happens if we assume Hard assignment?

In this case the MLE estimation:

• There are k components

• Component i has an associated

• Each component generates data

1) Pick a component at random: Choose component i

Clustering based on ratios of posteriors:

“Quadratic Decision boundary” – second-order terms don’t cancel out 38

Non-linear, non-analytically solvable

Doable, but often slow

For example in MoG:

We are going to discuss why this approach works 44

E Step: Let us see why!

Theorem: During the EM algorithm the marginal likelihood is not decreasing!

During the EM algorithm the marginal likelihood is not decreasing!

Sequence of EM lower bound F-functions

EM monotonically converges to a local maximum of likelihood !

Different sequence of EM lower bound F-functions depending on initialization

Use multiple, randomized initializations in practice

Variational methods might decrease the marginal likelihood!

But not necessarily the best max/min which would be

Variational methods might decrease the marginal likelihood! 52

In the M-step we optimize a lower bound F on the likelihood L.

Marginalize over class

) learn 1, 2, … K

Equivalent to assigning clusters to each data point in K-means in a soft way

We calculated these weights in the E step

Joint distribution is simple

M step At iteration t, maximize function Q in t: