Clustering 4
Clustering 4
m
– For each cluster Cj 0 < ∑ wij < m
i =1
m m
wij = (1 / dist( xi , c j ) 2 ) p −1
∑ (1 / dist ( xi , cq ) 2
) p −1
q =1
k
– When p=2 wij = 1 / dist ( xi , c j ) 2 ∑1 / dist ( xi , cq ) 2
q =1
Jian Pei: CMPT 459/741 Clustering (4) 3
Choice of P
• When p à 1, FCM behaves like traditional
k-means
• When p is larger, the cluster centroids
approach the global centroid of all data
points
• The partition becomes fuzzier as p
increases
θ1 = (−4,2) θ2 = (4,2)
( x+ 4)2 ( x−4)2
1 − 1 −
prob( x | Θ) = e 8
+ e 8
2 2π 2 2π
– Equivalently, maximize
( xi − µ ) 2
m
log prob( X | Θ) = −∑ 2
− 0.5m log 2π − m log σ
i =1 2σ
Jian Pei: CMPT 459/741 Clustering (4) 9
EM Algorithm
• Expectation Maximization algorithm
Select an initial set of model parameters
Repeat
Expectation Step: for each object, calculate the
probability that it belongs to each distribution θi, i.e.,
prob(xi|θi)
Maximization Step: given the probabilities from the
expectation step, find the new estimates of the
parameters that maximize the expected likelihood
Until the parameters are stable
0 1 2 3 4 5 6 7
(10,000)
Salary
Vac
atio
n
30 50
age
age
Vacation
(week)
20 30 40 50 60
0 1 2 3 4 5 6 7
age
20 30 40 50 60
Jian Pei: CMPT 459/741 Clustering (4) 16
CLIQUE: Pros and Cons
• Automatically find subspaces of the highest
dimensionality with high density clusters
• Insensitive to the order of input
– Not presume any canonical data distribution
• Scale linearly with the size of input
• Scale well with the number of dimensions
• The clustering result may be degraded at
the expense of simplicity of the method
∑(X i − X )2
– Standard deviation: n
s= i =1
2 n −1
∑ (X − X )
– Variance: s = 2 i =1
i
n −1
– Covariance: ∑(X
i =1
i − X )(Yi − Y )
cov( X , Y ) =
n −1
- 0.707x + 0.707y
O X
Intrinsic Methods
When the ground truth of a data set is not available, we have to use
method
Jian Pei:to assess
CMPT 459/741the clustering
Clustering (4) quality. In general, intrinsic metho
42
Silhouette Coefficient
• No ground truth is assumed
• Suppose a data set D of n objects is partitioned
into k clusters, C1, …, Ck
• For each object o,
– Calculate a(o), the average distance between o and
every other object in the same cluster –
compactness of a cluster, the smaller, the better
– Calculate b(o), the minimum average distance from
o to every objects in a cluster that o does not
belong to – degree of separation from other
clusters, the larger, the better
Jian Pei: CMPT 459/741 Clustering (4) 43
Silhouette Coefficient
P
dist(o, o0 )
o,o0 2Ci ,o0 6=o
a(o) =
|Ci | 1
P
dist(o, o0 )
o0 2Cj
b(o) = min { }
Cj :o62Cj |Cj |
• Then b(o) a(o)
s(o) =
max{a(o), b(o)}
• Use the average silhouette coefficient of all
objects as the overall measure
Jian Pei: CMPT 459/741 Clustering (4) 44
Multi-Clustering
• A data set may be clustered in different
ways
– In different subspaces, that is, using different
attributes
– Using different similarity measures
– Using different clustering methods
• Some different clusterings may capture
different meanings of categorization
– Orthogonal clusterings
• Putting users in the loop
Jian Pei: CMPT 459/741 Clustering (4) 45
To-Do List
• Read Chapters 10.5, 10.6, and 11.1
• Find out how Gaussian mixture can be used
in SPARK MLlib
• (for thesis-based graduate students only)
Learn LDA (Latent Dirichlet allocation) by
yourself