Lecture 3.
Partitioning-Based
Clustering Methods
Lecture 3. Partitioning-Based Clustering Methods
Basic Concepts of Partitioning Algorithms
The K-Means Clustering Method
Initialization of K-Means Clustering
The K-Medoids Clustering Method
The K-Medians and K-Modes Clustering Methods
The Kernel K-Means Clustering Method
Summary
2
Session 1: Basic Concepts of
Partitioning Algorithms
Partitioning Algorithms: Basic Concepts
Partitioning method: Discovering the groupings in the data by optimizing a specific
objective function and iteratively improving the quality of partitions
K-partitioning method: Partitioning a dataset D of n objects into a set of K clusters
so that an objective function is optimized (e.g., the sum of squared distances is
minimized, where ck is the centroid or medoid of cluster Ck)
A typical objective function: Sum of Squared Errors (SSE)
K
SSE (C ) || xi ck ||2
k 1 xiCk
Problem definition: Given K, find a partition of K clusters that optimizes the chosen
partitioning criterion
Global optimal: Needs to exhaustively enumerate all partitions
Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-Medoids, etc.
4
Session 2: The K-Means
Clustering Method
The K-Means Clustering Method
K-Means (MacQueen’67, Lloyd’57/’82)
Each cluster is represented by the center of the cluster
Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows
Select K points as initial centroids
Repeat
Form K clusters by assigning each point to its closest centroid
Re-compute the centroids (i.e., mean point) of each cluster
Until convergence criterion is satisfied
Different kinds of measures can be used
Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
6
Example: K-Means Clustering
Assign
Recompute
points to
cluster
clusters
centers
The original data points & Redo point assignment
randomly select K = 2 centroids
Execution of the K-Means Clustering Algorithm
Select K points as initial centroids
Repeat
• Form K clusters by assigning each point to its closest centroid
• Re-compute the centroids (i.e., mean point) of each cluster
Until convergence criterion is satisfied
7
Discussion on the K-Means Method
Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of iterations
Normally, K, t << n; thus, an efficient method
K-means clustering often terminates at a local optimal
Initialization can be important to find high-quality clusters
Need to specify K, the number of clusters, in advance
There are ways to automatically determine the “best” K
In practice, one often runs a range of values and selected the “best” K value
Sensitive to noisy data and outliers
Variations: Using K-medians, K-medoids, etc.
K-means is applicable only to objects in a continuous n-dimensional space
Using the K-modes for categorical data
Not suitable to discover clusters with non-convex shapes
Using density-based clustering, kernel K-means, etc.
8
Variations of K-Means
There are many variants of the K-Means method, varying in different aspects
Choosing better initial centroid estimates
K-means++, Intelligent K-Means, Genetic K-Means To be discussed in this lecture
Choosing different representative prototypes for the clusters
K-Medoids, K-Medians, K-Modes To be discussed in this lecture
Applying feature transformation techniques
Weighted K-Means, Kernel K-Means To be discussed in this lecture
9
Session 3: Initialization of K-
Means Clustering
Initialization of K-Means
Different initializations may generate rather different clustering
results (some could be far from optimal)
Original proposal (MacQueen’67): Select K seeds randomly
Need to run the algorithm multiple times using different seeds
There are many methods proposed for better initialization of k seeds
K-Means++ (Arthur & Vassilvitskii’07):
The first centroid is selected at random
The next centroid selected is the one that is farthest from the currently selected
(selection is based on a weighted probability score)
The selection continues until K centroids are obtained
11
Example: Poor Initialization May Lead to Poor Clustering
Assign Recompute
points to cluster
clusters centers
Another random selection of k
centroids for the same data points
Rerun of the K-Means using another random K seeds
This run of K-Means generates a poor quality clustering
12
Session 4: The K-Medoids
Clustering Method
Handling Outliers: From K-Means to K-Medoids
The K-Means algorithm is sensitive to outliers!—since an object with an extremely
large value may substantially distort the distribution of the data
K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster
The K-Medoids clustering algorithm:
Select K points as the initial representative objects (i.e., as initial K medoids)
Repeat
Assigning each point to the cluster with the closest medoid
Randomly select a non-representative object oi
Compute the total cost S of swapping the medoid m with oi
If S < 0, then swap m with oi to form the new set of medoids
Until convergence criterion is satisfied
14
PAM: A Typical K-Medoids Algorithm
10 10 10
9 9 9
8 8 8
7
Arbitrary 7
Assign 7
choose K each
6 6 6
remaining
5 5
4 object as 4 4
3 initial 3 object to 3
2
medoids 2
nearest 2
1 1
medoids 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
K=2
Randomly select a non-
medoid object,Oramdom
Select initial K medoids randomly
10 10
Repeat
Compute
9 9
Swapping O 8 8
Object re-assignment total cost of
and Oramdom
7 7
6
swapping 6
Swap medoid m with oi if it If quality is
5
4
5
improved
improves the clustering quality 3
2
3
1 1
Until convergence criterion is satisfied 0
0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10
15
Discussion on K-Medoids Clustering
K-Medoids Clustering: Find representative objects (medoids) in clusters
PAM (Partitioning Around Medoids: Kaufmann & Rousseeuw 1987)
Starts from an initial set of medoids, and
Iteratively replaces one of the medoids by one of the non-medoids if it improves
the total sum of the squared errors (SSE) of the resulting clustering
PAM works effectively for small data sets but does not scale well for large data
sets (due to the computational complexity)
Computational complexity: PAM: O(K(n − K)2) (quite expensive!)
Efficiency improvements on PAM
CLARA (Kaufmann & Rousseeuw, 1990):
PAM on samples; O(Ks2 + K(n − K)), s is the sample size
CLARANS (Ng & Han, 1994): Randomized re-sampling, ensuring efficiency + quality
16
Session 5: The K-Medians and K-
Modes Clustering Methods
K-Medians: Handling Outliers by Computing Medians
Medians are less sensitive to outliers than means
Think of the median salary vs. mean salary of a large firm when adding a few top
executives!
K-Medians: Instead of taking the mean value of the object in a cluster as a reference
point, medians are used (L1-norm as the distance measure)
The criterion function for the K-Medians algorithm: K
S | xij med kj |
The K-Medians clustering algorithm: k 1 xiCk
Select K points as the initial representative objects (i.e., as initial K medians)
Repeat
Assign every point to its nearest median
Re-compute the median using the median of each individual feature
Until convergence criterion is satisfied
18
K-Modes: Clustering Categorical Data
K-Means cannot handle non-numerical (categorical) data
Mapping categorical value to 1/0 cannot generate quality clusters for high-
dimensional data
K-Modes: An extension to K-Means by replacing means of clusters with modes
Dissimilarity measure between object X and the center of a cluster Z
Φ(xj, zj) = 1 – njr/nl when xj = zj ; 1 when xj ǂ zj
where zj is the categorical value of attribute j in Zl, nl is the number of objects
in cluster l, and njr is the number of objects whose attribute value is r
This dissimilarity measure (distance function) is frequency-based
Algorithm is still based on iterative object cluster assignment and centroid update
A fuzzy K-Modes method is proposed to calculate a fuzzy cluster membership
value for each object to each cluster
A mixture of categorical and numerical data: Using a K-Prototype method
19
Session 6: Kernel K-Means
Clustering
Kernel K-Means Clustering
Kernel K-Means can be used to detect non-convex clusters
K-Means can only detect clusters that are linearly separable
Idea: Project data onto the high-dimensional kernel space, and
then perform K-Means clustering
Map data points in the input space onto a high-dimensional feature
space using the kernel function
Perform K-Means on the mapped feature space
Computational complexity is higher than K-Means
Need to compute and store n x n kernel matrix generated from the
kernel function on the original data
The widely studied spectral clustering can be considered as a variant of
Kernel K-Means clustering
21
Kernel Functions and Kernel K-Means Clustering
Typical kernel functions:
Polynomial kernel of degree h: K(Xi, Xj) = (Xi∙Xj + 1)h
|| X i X j || 2 /2 2
Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = e
Sigmoid kernel: K(Xi, Xj) = tanh(κXi∙Xj −δ)
The formula for kernel matrix K for any two points xi, xj є Ck is K xi x j ( xi ) ( x j )
K
The SSE criterion of kernel K-means: SSE (C ) || ( xi ) ck ||2
k 1 xiCk
The formula for the cluster centroid:
(x )
xiCk
i
ck
| Ck |
Clustering can be performed without the actual individual projections φ(xi) and φ(xj)
for the data points xi, xj є Ck
22
Example: Kernel Functions and Kernel K-Means Clustering
|| X i X j || 2 /2 2
Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = e
Suppose there are 5 original 2-dimensional points:
x1(0, 0), x2(4, 4), x3(-4, 4), x4(-4, -4), x5(4, -4)
If we set 𝜎 to 4, we will have the following points in the kernel space
2 32
2 2 −
E.g., 𝑥1 − 𝑥2 = 0−4 + 0−4 = 32, therefore, 𝐾 𝑥1 , 𝑥2 = 𝑒 2⋅42 = 𝑒 −1
Original Space RBF Kernel Space (𝜎 = 4)
𝑲(𝒙𝒊 , 𝒙𝟏 ) 𝑲(𝒙𝒊 , 𝒙𝟐 ) 𝑲(𝒙𝒊 , 𝒙𝟑 ) 𝑲(𝒙𝒊 , 𝒙𝟒 ) 𝑲(𝒙𝒊 , 𝒙𝟓 )
𝑥 𝑦
x1 0 0 0 −
42 +42
−1
𝑒 −1 𝑒 −1 𝑒 −1
𝑒 2⋅4 2 =𝑒
x2 4 4
𝑒 −1 0 𝑒 −2 𝑒 −4 𝑒 −2
x3 −4 4
𝑒 −1 𝑒 −2 0 𝑒 −2 𝑒 −4
x4 −4 −4
𝑒 −1 𝑒 −4 𝑒 −2 0 𝑒 −2
x5 4 −4
𝑒 −1 𝑒 −2 𝑒 −4 𝑒 −2 0
23
Example: Kernel K-Means Clustering
The original data set The result of K-Means clustering The result of Gaussian Kernel K-Means clustering
The above data set cannot generate quality clusters by K-Means since it contains non-
covex clusters
Gaussian RBF Kernel transformation maps data to a kernel matrix K for any two points
|| X i X j || 2 /2 2
xi, xj: K x x ( xi ) ( x j ) and Gaussian kernel: K(Xi, Xj) = e
i j
K-Means clustering is conducted on the mapped data, generating quality clusters
24
Session 7: Summary
Summary: Partitioning-Based Clustering Methods
Basic Concepts of Partitioning Algorithms
The K-Means Clustering Method
Initialization of K-Means Clustering
The K-Medoids Clustering Method
The K-Medians and K-Modes Clustering Methods
The Kernel K-Means Clustering Method
Summary
26
Recommended Readings
J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proc.
of the 5th Berkeley Symp. on Mathematical Statistics and Probability, 1967
S. Lloyd. Least Squares Quantization in PCM. IEEE Trans. on Information Theory, 28(2), 1982
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John
Wiley & Sons, 1990
R. Ng and J. Han. Efficient and Effective Clustering Method for Spatial Data Mining. VLDB'94
B. Schölkopf, A. Smola, and K. R. Müller. Nonlinear Component Analysis as a Kernel Eigenvalue
Problem. Neural computation, 10(5):1299–1319, 1998
I. S. Dhillon, Y. Guan, and B. Kulis. Kernel K-Means: Spectral Clustering and Normalized Cuts. KDD’04
D. Arthur and S. Vassilvitskii. K-means++: The Advantages of Careful Seeding. SODA’07
C. K. Reddy and B. Vinzamuri. A Survey of Partitional and Hierarchical Clustering Algorithms, in
(Chap. 4) Aggarwal and Reddy (eds.), Data Clustering: Algorithms and Applications. CRC Press, 2014
M. J. Zaki and W. Meira, Jr.. Data Mining and Analysis: Fundamental Concepts and Algorithms.
Cambridge Univ. Press, 2014
27