0% found this document useful (0 votes)
40 views46 pages

Clustering 4

The document describes fuzzy clustering and the fuzzy c-means (FCM) algorithm. FCM assigns data points to multiple clusters with varying degrees of membership. It iteratively computes cluster centroids based on the fuzzy membership and updates the membership based on distances to centroids. The algorithm converges when centroids do not change significantly. Mixture models can also model clusters, assuming distributions like multivariate normal. The EM algorithm is used to estimate maximum likelihood parameters for mixture models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views46 pages

Clustering 4

The document describes fuzzy clustering and the fuzzy c-means (FCM) algorithm. FCM assigns data points to multiple clusters with varying degrees of membership. It iteratively computes cluster centroids based on the fuzzy membership and updates the membership based on distances to centroids. The algorithm converges when centroids do not change significantly. Mixture models can also model clusters, assuming distributions like multivariate normal. The EM algorithm is used to estimate maximum likelihood parameters for mixture models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Fuzzy Clustering

•  Each point xi takes a probability wij to belong


to a cluster Cj
•  Requirements k

–  For each point xi, ∑w


j =1
ij =1

m
–  For each cluster Cj 0 < ∑ wij < m
i =1

Jian Pei: CMPT 459/741 Clustering (4) 1


Fuzzy C-Means (FCM)
Select an initial fuzzy pseudo-partition, i.e., assign
values to all the wij
Repeat
Compute the centroid of each cluster using the fuzzy
pseudo-partition
Recompute the fuzzy pseudo-partition, i.e., the wij
Until the centroids do not change (or the change
is below some threshold)

Jian Pei: CMPT 459/741 Clustering (4) 2


Critical Details
•  Optimization on sum of the squared error
(SSE): SSE(C ,…, C ) = k m w p dist( x , c ) 2
1 k ∑∑ j =1 i =1
ij i j

m m

•  Computing centroids: c j = ∑ wijp xi / ∑ wijp


i =1 i =1

•  Updating the fuzzy pseudo-partition


1 k 1

wij = (1 / dist( xi , c j ) 2 ) p −1
∑ (1 / dist ( xi , cq ) 2
) p −1

q =1
k
–  When p=2 wij = 1 / dist ( xi , c j ) 2 ∑1 / dist ( xi , cq ) 2

q =1
Jian Pei: CMPT 459/741 Clustering (4) 3
Choice of P
•  When p à 1, FCM behaves like traditional
k-means
•  When p is larger, the cluster centroids
approach the global centroid of all data
points
•  The partition becomes fuzzier as p
increases

Jian Pei: CMPT 459/741 Clustering (4) 4


Effectiveness

Jian Pei: CMPT 459/741 Clustering (4) 5


Mixture Models
•  A cluster can be modeled as a probability
distribution
–  Practically, assume a distribution can be
approximated well using multivariate normal
distribution
•  Multiple clusters is a mixture of different
probability distributions
•  A data set is a set of observations from a
mixture of models
Jian Pei: CMPT 459/741 Clustering (4) 6
Object Probability
•  Suppose there are k clusters and a set X of
m objects
–  Let the j-th cluster have parameter θj = (µj, σj)
–  The probability that a point is in the j-th cluster is
wj, w1 + …+ wk = 1
•  The probability of an object x is
k
prob( x | Θ) = ∑ w p ( x | θ ) j j j
j =1
m m k
prob( X | Θ) = ∏ prob( xi | Θ) =∏∑ w j p j ( xi | θ j )
i =1 i =1 j =1

Jian Pei: CMPT 459/741 Clustering (4) 7


Example
( x−µ )2
1 −
2σ 2
prob( xi | Θ) = e
2π σ

θ1 = (−4,2) θ2 = (4,2)

( x+ 4)2 ( x−4)2
1 − 1 −
prob( x | Θ) = e 8
+ e 8

2 2π 2 2π

Jian Pei: CMPT 459/741 Clustering (4) 8


Maximal Likelihood Estimation
•  Maximum likelihood principle: if we know a
set of objects are from one distribution, but
do not know the parameter, we can choose
the parameter maximizing the probability
2
( x−µ )
•  Maximize prob( x | Θ) = m
1 −
e 2σ
2
i ∏
j =1 2π σ

–  Equivalently, maximize
( xi − µ ) 2
m
log prob( X | Θ) = −∑ 2
− 0.5m log 2π − m log σ
i =1 2σ
Jian Pei: CMPT 459/741 Clustering (4) 9
EM Algorithm
•  Expectation Maximization algorithm
Select an initial set of model parameters
Repeat
Expectation Step: for each object, calculate the
probability that it belongs to each distribution θi, i.e.,
prob(xi|θi)
Maximization Step: given the probabilities from the
expectation step, find the new estimates of the
parameters that maximize the expected likelihood
Until the parameters are stable

Jian Pei: CMPT 459/741 Clustering (4) 10


Advantages and Disadvantages
•  Mixture models are more general than k-
means and fuzzy c-means
•  Clusters can be characterized by a small
number of parameters
•  The results may satisfy the statistical
assumptions of the generative models
•  Computationally expensive
•  Need large data sets
•  Hard to estimate the number of clusters
Jian Pei: CMPT 459/741 Clustering (4) 11
Grid-based Clustering Methods
•  Ideas
–  Using multi-resolution grid data structures
–  Using dense grid cells to form clusters
•  Several interesting methods
–  CLIQUE
–  STING
–  WaveCluster

Jian Pei: CMPT 459/741 Clustering (4) 12


CLIQUE
•  Clustering In QUEst
•  Automatically identify subspaces of a high
dimensional data space
•  Both density-based and grid-based

Jian Pei: CMPT 459/741 Clustering (4) 13


CLIQUE: the Ideas
•  Partition each dimension into the same
number of equal length intervals
–  Partition an m-dimensional data space into non-
overlapping rectangular units
•  A unit is dense if the number of data points
in the unit exceeds a threshold
•  A cluster is a maximal set of connected
dense units within a subspace

Jian Pei: CMPT 459/741 Clustering (4) 14


CLIQUE: the Method
•  Partition the data space and find the number of
points in each cell of the partition
–  Apriori: a k-d cell cannot be dense if one of its (k-1)-d
projection is not dense
•  Identify clusters:
–  Determine dense units in all subspaces of interests and
connected dense units in all subspaces of interests
•  Generate minimal description for the clusters
–  Determine the minimal cover for each cluster

Jian Pei: CMPT 459/741 Clustering (4) 15


CLIQUE: An Example

0 1 2 3 4 5 6 7
(10,000)
Salary

Vac
atio
n

30 50
age

age
Vacation
(week)

20 30 40 50 60
0 1 2 3 4 5 6 7

age
20 30 40 50 60
Jian Pei: CMPT 459/741 Clustering (4) 16
CLIQUE: Pros and Cons
•  Automatically find subspaces of the highest
dimensionality with high density clusters
•  Insensitive to the order of input
–  Not presume any canonical data distribution
•  Scale linearly with the size of input
•  Scale well with the number of dimensions
•  The clustering result may be degraded at
the expense of simplicity of the method

Jian Pei: CMPT 459/741 Clustering (4) 17


Bad Cases for CLIQUE

Parts of a cluster may be missed

A cluster from CLIQUE may


contain noise

Jian Pei: CMPT 459/741 Clustering (4) 18


Dimensionality Reduction
•  Clustering a high dimensional data set is
challenging
–  Distance between two points could be dominated by
noise
•  Dimensionality reduction: choosing the informative
dimensions for clustering analysis
–  Feature selection: choosing a subset of existing
dimensions
–  Feature construction: construct a new (small) set of
informative attributes

Jian Pei: CMPT 459/741 Clustering (4) 19


Variance and Covariance
•  Given a set of 1-d points, how different are
those points? n

∑(X i − X )2
–  Standard deviation: n
s= i =1

2 n −1
∑ (X − X )
–  Variance: s = 2 i =1
i

n −1

•  Given a set of 2-d points, are the two


dimensions correlated? n

–  Covariance: ∑(X
i =1
i − X )(Yi − Y )
cov( X , Y ) =
n −1

Jian Pei: CMPT 459/741 Clustering (4) 20


Principal Components

Art work and example from http://csnet.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

Jian Pei: CMPT 459/741 Clustering (4) 21


Step 1: Mean Subtraction
•  Subtract the mean from each dimension for each
data point
•  Intuition: centralizing the data set

Jian Pei: CMPT 459/741 Clustering (4) 22


Step 2: Covariance Matrix
⎛ cov( D1 , D1 ) cov( D1 , D2 ) ! cov( D1 , Dn ) ⎞
⎜ ⎟
⎜ cov( D2 , D1 ) cov( D2 , D2 ) ! cov( D2 , Dn ) ⎟
C =⎜ ⎟
" " # "
⎜ ⎟
⎜ cov( D , D ) cov( D , D ) ! cov( Dn , Dn ) ⎟⎠
⎝ n 1 n 2

Jian Pei: CMPT 459/741 Clustering (4) 23


Step 3: Eigenvectors and Eigenvalues

•  Compute the eigenvectors and the


eigenvalues of the covariance matrix
–  Intuition: find those direction invariant vectors as
candidates of new attributes
–  Eigenvalues indicate how much the direction
invariant vectors are scaled – the larger the
better for manifest the data variance

Jian Pei: CMPT 459/741 Clustering (4) 24


Step 4: Forming New Features
•  Choose the principal components and forme new
features
–  Typically, choose the top-k components

Jian Pei: CMPT 459/741 Clustering (4) 25


New Features

NewData = RowFeatureVector x RowDataAdjust

The first principal component is used

Jian Pei: CMPT 459/741 Clustering (4) 26


Clustering in Derived Space
Y

- 0.707x + 0.707y

O X

Jian Pei: CMPT 459/741 Clustering (4) 27


Spectral Clustering
Data Affinity matrix Computing the leading Clustering in the Projecting back to
k eigenvectors of A new space cluster the original data
[ Wij ] Av = \lamda v
A = f(W)

Jian Pei: CMPT 459/741 Clustering (4) 28


Affinity Matrix
•  Using a distance measure
dist(oi ,oj )
Wij = e w

where σ is a scaling parameter controling


how fast the affinity Wij decreases as the
distance increases
•  In the Ng-Jordan-Weiss algorithm, Wii is set
to 0

Jian Pei: CMPT 459/741 Clustering (4) 29


Clustering
•  In the Ng-Jordan-Weiss algorithm, we
define a diagonal matrix
n
such that
X
Dii = Wij
j=1
1 1
•  Then, A = D W D 2 2

•  Use the k leading eigenvectors to form a


new space
•  Map the original data to the new space and
conduct clustering
Jian Pei: CMPT 459/741 Clustering (4) 30
Is a Clustering Good?
•  Feasibility
–  Applying any clustering methods on a uniformly
distributed data set is meaningless
•  Quality
–  Are the clustering results meeting users’ interest?
–  Clustering patients into clusters corresponding
various disease or sub-phenotypes is meaningful
–  Clustering patients into clusters corresponding to
male or female is not meaningful

Jian Pei: CMPT 459/741 Clustering (4) 31


Major Tasks
•  Assessing clustering tendency
–  Are there non-random structures in the data?
•  Determining the number of clusters or other
critical parameters
•  Measuring clustering quality

Jian Pei: CMPT 459/741 Clustering (4) 32


Uniformly Distributed Data
•  Clustering uniformly distributed data is
meaningless
•  A uniformly distributed data set is generated
504CHAPTER 10. CLUSTER ANALYSIS: BASIC C
by a uniform data distribution

Jian Pei: CMPT 459/741 Clustering (4) 33


Figure 10.21: A data set that is uniformly distrib
Hopkins Statistic
•  Hypothesis: the data is generated by a
uniform distribution in a space
•  Sample n points, p1, …, pn, uniformly from
the space of D
•  For each point pi, find the nearest neighbor
of pi in D, let xi be the distance between pi
and its nearest neighbor in D
xi = min{dist(pi , v)}
v2D

Jian Pei: CMPT 459/741 Clustering (4) 34


Hopkins Statistic
•  Sample n points, q1, …, qn, uniformly from D
•  For each qi, find the nearest neighbor of qi
in D – {qi}, let yi be the distance between qi
and its nearest neighbor in D – {qi}
yi = min {dist(qi , v)}
v2D,v6=qi
•  Calculate the Hopkins Statistic H
P
n
yi
i=1
H= P
n P
n
xi + yi
i=1 i=1
Jian Pei: CMPT 459/741 Clustering (4) 35
Explanation
n
X
•  If D
n
is uniformly distributed, then i=1
xi
and
X
yi would be close to each other, and thus
i=1

H would be round 0.5 n


X
•  If D is skewed, then yi would be
i=1
substantially smaller, and thus H would be
close to 0
•  If H > 0.5, then it is unlikely that D has
statistically significant clusters
Jian Pei: CMPT 459/741 Clustering (4) 36
Finding the Number of Clusters
•  Depending on many factors
–  The shape and scale of the distribution in the
data set
–  The clustering resolution required by the user
•  Many methods
r exist p
n
–  Set k = , each cluster has 2n points on
2
average
–  Plot the sum of within-cluster variances with
respect to k, find the first (or the most significant
turning point)

Jian Pei: CMPT 459/741 Clustering (4) 37


A Cross-Validation Method
•  Divide the data set D into m parts
•  Use m – 1 parts to find a clustering
•  Use the remaining part as the test set to test
the quality of the clustering
–  For each point in the test set, find the closest
centroid or cluster center
–  Use the squared distances between all points in the
test set and the corresponding centroids to
measure how well the clustering model fits the test
set
•  Repeat m times for each value of k, use the
average as the quality measure
Jian Pei: CMPT 459/741 Clustering (4) 38
Measuring Clustering Quality
•  Ground truth: the ideal clustering
determined by human experts
•  Two situations
–  There is a known ground truth – the extrinsic
(supervised) methods, comparing the clustering
against the ground truth
–  The ground truth is unavailable – the intrinsic
(unsupervised) methods, measuring how well
the clusters are separated

Jian Pei: CMPT 459/741 Clustering (4) 39


Quality in Extrinsic Methods
•  Cluster homogeneity: the more pure the
clusters in a clustering are, the better the
clustering
•  Cluster completeness: objects in the same
cluster in the ground truth should be clustered
together
•  Rag bag: putting a heterogeneous object into a
pure cluster is worse than putting it into a rag
bag
•  Small cluster preservation: splitting a small
cluster in the ground truth into pieces is worse
than splitting a bigger one
Jian Pei: CMPT 459/741 Clustering (4) 40
Bcubed Precision and Recall
•  D = {o1, …, on}
–  L(oi) is the cluster of oi given by the ground truth
•  C is a clustering on D
–  C(oi) is the cluster-id of oi in C
•  For two objects oi and oj, the correctness is
1 if L(oi) = L(oj) ßà C(oi) = C(oj), 0
otherwise

Jian Pei: CMPT 459/741 Clustering (4) 41


1 if L(oi ) = L(oj ) ⇔ C(oi ) = C(oj )
!
Correctness(oi , oj ) =
0 otherwise.
Bcubed Precision and
BCubed precision is defined as
Recall
•  Precision
"
Correctness(oi , oj )
n
10.6. EVALUATION OF " CLUSTERING
oj :i̸=j,C(oi )=C(oj )
i=1
∥{oj |i ̸= j, C(oi ) = C(oj )}∥
Precision
BCubed BCubed
recall =
is defined as .
n
•  Recall
!
Correctness(oi , oj )
n
! oj :i̸=j,L(oi )=L(oj )
i=1
∥{oj |i ̸= j, L(oi ) = L(oj )}∥
Recall BCubed = .
n

Intrinsic Methods
When the ground truth of a data set is not available, we have to use
method
Jian Pei:to assess
CMPT 459/741the clustering
Clustering (4) quality. In general, intrinsic metho
42
Silhouette Coefficient
•  No ground truth is assumed
•  Suppose a data set D of n objects is partitioned
into k clusters, C1, …, Ck
•  For each object o,
–  Calculate a(o), the average distance between o and
every other object in the same cluster –
compactness of a cluster, the smaller, the better
–  Calculate b(o), the minimum average distance from
o to every objects in a cluster that o does not
belong to – degree of separation from other
clusters, the larger, the better
Jian Pei: CMPT 459/741 Clustering (4) 43
Silhouette Coefficient
P
dist(o, o0 )
o,o0 2Ci ,o0 6=o
a(o) =
|Ci | 1
P
dist(o, o0 )
o0 2Cj
b(o) = min { }
Cj :o62Cj |Cj |
•  Then b(o) a(o)
s(o) =
max{a(o), b(o)}
•  Use the average silhouette coefficient of all
objects as the overall measure
Jian Pei: CMPT 459/741 Clustering (4) 44
Multi-Clustering
•  A data set may be clustered in different
ways
–  In different subspaces, that is, using different
attributes
–  Using different similarity measures
–  Using different clustering methods
•  Some different clusterings may capture
different meanings of categorization
–  Orthogonal clusterings
•  Putting users in the loop
Jian Pei: CMPT 459/741 Clustering (4) 45
To-Do List
•  Read Chapters 10.5, 10.6, and 11.1
•  Find out how Gaussian mixture can be used
in SPARK MLlib
•  (for thesis-based graduate students only)
Learn LDA (Latent Dirichlet allocation) by
yourself

Jian Pei: CMPT 459/741 Clustering (4) 46

You might also like