0% found this document useful (0 votes)

100 views

Lecture 3. Partitioning-Based Clustering Methods

The document discusses different partitioning-based clustering methods. It covers the basic concepts of partitioning algorithms including k-means clustering. The k-means method is described as assigning points to the closest centroid and recomputing centroids iteratively until convergence. Variations discussed include k-medoids clustering, which uses data points as centroids rather than means, making it more robust to outliers.

Uploaded by

sharathdhamodaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views

Lecture 3. Partitioning-Based Clustering Methods

Uploaded by

sharathdhamodaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Lecture 3.

Partitioning-Based
Clustering Methods
Lecture 3. Partitioning-Based Clustering Methods
 Basic Concepts of Partitioning Algorithms

 The K-Means Clustering Method

 Initialization of K-Means Clustering

 The K-Medoids Clustering Method
 The K-Medians and K-Modes Clustering Methods
 The Kernel K-Means Clustering Method
 Summary

2
Session 1: Basic Concepts of
Partitioning Algorithms
Partitioning Algorithms: Basic Concepts
 Partitioning method: Discovering the groupings in the data by optimizing a specific
objective function and iteratively improving the quality of partitions
 K-partitioning method: Partitioning a dataset D of n objects into a set of K clusters
so that an objective function is optimized (e.g., the sum of squared distances is
minimized, where ck is the centroid or medoid of cluster Ck)
 A typical objective function: Sum of Squared Errors (SSE)
K
SSE (C )    || xi ck ||2
k 1 xiCk
 Problem definition: Given K, find a partition of K clusters that optimizes the chosen
partitioning criterion
 Global optimal: Needs to exhaustively enumerate all partitions
 Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-Medoids, etc.

4
Session 2: The K-Means
Clustering Method
The K-Means Clustering Method
 K-Means (MacQueen’67, Lloyd’57/’82)
 Each cluster is represented by the center of the cluster
 Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows

 Select K points as initial centroids

 Repeat
 Form K clusters by assigning each point to its closest centroid
 Re-compute the centroids (i.e., mean point) of each cluster
 Until convergence criterion is satisfied
 Different kinds of measures can be used

 Manhattan distance (L1 norm), Euclidean distance (L2 norm), Cosine similarity
6
Example: K-Means Clustering
Assign
Recompute
points to
cluster
clusters
centers

The original data points & Redo point assignment

randomly select K = 2 centroids

Execution of the K-Means Clustering Algorithm

Select K points as initial centroids
Repeat
• Form K clusters by assigning each point to its closest centroid
• Re-compute the centroids (i.e., mean point) of each cluster
Until convergence criterion is satisfied
7
Discussion on the K-Means Method
 Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of iterations
 Normally, K, t << n; thus, an efficient method
 K-means clustering often terminates at a local optimal
 Initialization can be important to find high-quality clusters
 Need to specify K, the number of clusters, in advance
 There are ways to automatically determine the “best” K
 In practice, one often runs a range of values and selected the “best” K value
 Sensitive to noisy data and outliers
 Variations: Using K-medians, K-medoids, etc.
 K-means is applicable only to objects in a continuous n-dimensional space
 Using the K-modes for categorical data
 Not suitable to discover clusters with non-convex shapes
 Using density-based clustering, kernel K-means, etc.
8
Variations of K-Means
 There are many variants of the K-Means method, varying in different aspects

 Choosing better initial centroid estimates

 K-means++, Intelligent K-Means, Genetic K-Means To be discussed in this lecture

 Choosing different representative prototypes for the clusters

 K-Medoids, K-Medians, K-Modes To be discussed in this lecture

 Applying feature transformation techniques

 Weighted K-Means, Kernel K-Means To be discussed in this lecture

9
Session 3: Initialization of K-
Means Clustering
Initialization of K-Means
 Different initializations may generate rather different clustering
results (some could be far from optimal)
 Original proposal (MacQueen’67): Select K seeds randomly

 Need to run the algorithm multiple times using different seeds

 There are many methods proposed for better initialization of k seeds

 K-Means++ (Arthur & Vassilvitskii’07):

 The first centroid is selected at random
 The next centroid selected is the one that is farthest from the currently selected
(selection is based on a weighted probability score)
 The selection continues until K centroids are obtained

11
Example: Poor Initialization May Lead to Poor Clustering

Assign Recompute
points to cluster
clusters centers

Another random selection of k

centroids for the same data points

 Rerun of the K-Means using another random K seeds

 This run of K-Means generates a poor quality clustering

12
Session 4: The K-Medoids
Clustering Method
Handling Outliers: From K-Means to K-Medoids
 The K-Means algorithm is sensitive to outliers!—since an object with an extremely
large value may substantially distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster as a reference
point, medoids can be used, which is the most centrally located object in a cluster
 The K-Medoids clustering algorithm:
 Select K points as the initial representative objects (i.e., as initial K medoids)
 Repeat
 Assigning each point to the cluster with the closest medoid
 Randomly select a non-representative object oi
 Compute the total cost S of swapping the medoid m with oi
 If S < 0, then swap m with oi to form the new set of medoids
 Until convergence criterion is satisfied
14
PAM: A Typical K-Medoids Algorithm
10 10 10

9 9 9

8 8 8

7
Arbitrary 7
Assign 7

choose K each
6 6 6

remaining
5 5

4 object as 4 4

3 initial 3 object to 3

2
medoids 2
nearest 2

1 1
medoids 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

K=2
Randomly select a non-
medoid object,Oramdom
Select initial K medoids randomly
10 10

Repeat
Compute
9 9

Swapping O 8 8

Object re-assignment total cost of

and Oramdom
7 7

6
swapping 6

Swap medoid m with oi if it If quality is

4
5

improved
improves the clustering quality 3

2
3

1 1

Until convergence criterion is satisfied 0

0 1 2 3 4 5 6 7 8 9 10
0
0 1 2 3 4 5 6 7 8 9 10

15
Discussion on K-Medoids Clustering
 K-Medoids Clustering: Find representative objects (medoids) in clusters
 PAM (Partitioning Around Medoids: Kaufmann & Rousseeuw 1987)
 Starts from an initial set of medoids, and
 Iteratively replaces one of the medoids by one of the non-medoids if it improves
the total sum of the squared errors (SSE) of the resulting clustering
 PAM works effectively for small data sets but does not scale well for large data
sets (due to the computational complexity)
 Computational complexity: PAM: O(K(n − K)2) (quite expensive!)
 Efficiency improvements on PAM
 CLARA (Kaufmann & Rousseeuw, 1990):
 PAM on samples; O(Ks2 + K(n − K)), s is the sample size
 CLARANS (Ng & Han, 1994): Randomized re-sampling, ensuring efficiency + quality
16
Session 5: The K-Medians and K-
Modes Clustering Methods
K-Medians: Handling Outliers by Computing Medians
 Medians are less sensitive to outliers than means
 Think of the median salary vs. mean salary of a large firm when adding a few top
executives!
 K-Medians: Instead of taking the mean value of the object in a cluster as a reference
point, medians are used (L1-norm as the distance measure)
 The criterion function for the K-Medians algorithm: K
S    | xij  med kj |
 The K-Medians clustering algorithm: k 1 xiCk

 Select K points as the initial representative objects (i.e., as initial K medians)

 Repeat
 Assign every point to its nearest median
 Re-compute the median using the median of each individual feature
 Until convergence criterion is satisfied
18
K-Modes: Clustering Categorical Data
 K-Means cannot handle non-numerical (categorical) data
 Mapping categorical value to 1/0 cannot generate quality clusters for high-
dimensional data
 K-Modes: An extension to K-Means by replacing means of clusters with modes
 Dissimilarity measure between object X and the center of a cluster Z
 Φ(xj, zj) = 1 – njr/nl when xj = zj ; 1 when xj ǂ zj
 where zj is the categorical value of attribute j in Zl, nl is the number of objects
in cluster l, and njr is the number of objects whose attribute value is r
 This dissimilarity measure (distance function) is frequency-based
 Algorithm is still based on iterative object cluster assignment and centroid update
 A fuzzy K-Modes method is proposed to calculate a fuzzy cluster membership
value for each object to each cluster
 A mixture of categorical and numerical data: Using a K-Prototype method
19
Session 6: Kernel K-Means
Clustering
Kernel K-Means Clustering
 Kernel K-Means can be used to detect non-convex clusters
 K-Means can only detect clusters that are linearly separable
 Idea: Project data onto the high-dimensional kernel space, and
then perform K-Means clustering
 Map data points in the input space onto a high-dimensional feature
space using the kernel function
 Perform K-Means on the mapped feature space
 Computational complexity is higher than K-Means
 Need to compute and store n x n kernel matrix generated from the
kernel function on the original data
 The widely studied spectral clustering can be considered as a variant of
Kernel K-Means clustering
21
Kernel Functions and Kernel K-Means Clustering
 Typical kernel functions:
 Polynomial kernel of degree h: K(Xi, Xj) = (Xi∙Xj + 1)h
|| X i  X j || 2 /2 2
 Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = e
 Sigmoid kernel: K(Xi, Xj) = tanh(κXi∙Xj −δ)
 The formula for kernel matrix K for any two points xi, xj є Ck is K xi x j   ( xi )   ( x j )
K
 The SSE criterion of kernel K-means: SSE (C )    ||  ( xi )  ck ||2
k 1 xiCk
 The formula for the cluster centroid:
 (x )
xiCk
i

ck 
| Ck |

 Clustering can be performed without the actual individual projections φ(xi) and φ(xj)
for the data points xi, xj є Ck
22
Example: Kernel Functions and Kernel K-Means Clustering
|| X i  X j || 2 /2 2
 Gaussian radial basis function (RBF) kernel: K(Xi, Xj) = e
 Suppose there are 5 original 2-dimensional points:
 x1(0, 0), x2(4, 4), x3(-4, 4), x4(-4, -4), x5(4, -4)
 If we set 𝜎 to 4, we will have the following points in the kernel space
2 32
2 2 −
 E.g., 𝑥1 − 𝑥2 = 0−4 + 0−4 = 32, therefore, 𝐾 𝑥1 , 𝑥2 = 𝑒 2⋅42 = 𝑒 −1

Original Space RBF Kernel Space (𝜎 = 4)

𝑲(𝒙𝒊 , 𝒙𝟏 ) 𝑲(𝒙𝒊 , 𝒙𝟐 ) 𝑲(𝒙𝒊 , 𝒙𝟑 ) 𝑲(𝒙𝒊 , 𝒙𝟒 ) 𝑲(𝒙𝒊 , 𝒙𝟓 )
𝑥 𝑦
x1 0 0 0 −
42 +42
−1
𝑒 −1 𝑒 −1 𝑒 −1
𝑒 2⋅4 2 =𝑒
x2 4 4
𝑒 −1 0 𝑒 −2 𝑒 −4 𝑒 −2
x3 −4 4
𝑒 −1 𝑒 −2 0 𝑒 −2 𝑒 −4
x4 −4 −4
𝑒 −1 𝑒 −4 𝑒 −2 0 𝑒 −2
x5 4 −4
𝑒 −1 𝑒 −2 𝑒 −4 𝑒 −2 0
23
Example: Kernel K-Means Clustering

The original data set The result of K-Means clustering The result of Gaussian Kernel K-Means clustering

 The above data set cannot generate quality clusters by K-Means since it contains non-
covex clusters
 Gaussian RBF Kernel transformation maps data to a kernel matrix K for any two points
|| X i  X j || 2 /2 2
xi, xj: K x x   ( xi )   ( x j ) and Gaussian kernel: K(Xi, Xj) = e
i j

 K-Means clustering is conducted on the mapped data, generating quality clusters

24
Session 7: Summary
Summary: Partitioning-Based Clustering Methods
 Basic Concepts of Partitioning Algorithms

 The K-Means Clustering Method

 Initialization of K-Means Clustering

 The K-Medoids Clustering Method
 The K-Medians and K-Modes Clustering Methods
 The Kernel K-Means Clustering Method
 Summary

26
Recommended Readings
 J. MacQueen. Some Methods for Classification and Analysis of Multivariate Observations. In Proc.
of the 5th Berkeley Symp. on Mathematical Statistics and Probability, 1967
 S. Lloyd. Least Squares Quantization in PCM. IEEE Trans. on Information Theory, 28(2), 1982
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John
Wiley & Sons, 1990
 R. Ng and J. Han. Efficient and Effective Clustering Method for Spatial Data Mining. VLDB'94
 B. Schölkopf, A. Smola, and K. R. Müller. Nonlinear Component Analysis as a Kernel Eigenvalue
Problem. Neural computation, 10(5):1299–1319, 1998
 I. S. Dhillon, Y. Guan, and B. Kulis. Kernel K-Means: Spectral Clustering and Normalized Cuts. KDD’04
 D. Arthur and S. Vassilvitskii. K-means++: The Advantages of Careful Seeding. SODA’07
 C. K. Reddy and B. Vinzamuri. A Survey of Partitional and Hierarchical Clustering Algorithms, in
(Chap. 4) Aggarwal and Reddy (eds.), Data Clustering: Algorithms and Applications. CRC Press, 2014
 M. J. Zaki and W. Meira, Jr.. Data Mining and Analysis: Fundamental Concepts and Algorithms.
Cambridge Univ. Press, 2014
27

Process Validation of Liquid
92% (26)
Process Validation of Liquid
24 pages
Joint Industry Standard: Solderability Tests For Printed Boards
No ratings yet
Joint Industry Standard: Solderability Tests For Printed Boards
16 pages
The Ceramics Bible
0% (1)
The Ceramics Bible
3 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
19.1. Partitioning-Based Clustering Algorithms
No ratings yet
19.1. Partitioning-Based Clustering Algorithms
27 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Lesson8 Clustering
100% (1)
Lesson8 Clustering
33 pages
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
No ratings yet
Chapter 3: Cluster Analysis: 3.1 Basic Concepts of Clustering
33 pages
AI-AG-Day-2-28th Feb 2023
No ratings yet
AI-AG-Day-2-28th Feb 2023
44 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Clustering Partition Hierachy
No ratings yet
Clustering Partition Hierachy
58 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Lecture5 - Clustering (K Means and K Medoids)
No ratings yet
Lecture5 - Clustering (K Means and K Medoids)
36 pages
4 Clustring
No ratings yet
4 Clustring
48 pages
Digital Image Processing: Segmentation-5
No ratings yet
Digital Image Processing: Segmentation-5
43 pages
Partitioning Algorithms
No ratings yet
Partitioning Algorithms
14 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
K Means
No ratings yet
K Means
23 pages
Unsupervised Learning - Clustering
No ratings yet
Unsupervised Learning - Clustering
55 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
Week 11
No ratings yet
Week 11
49 pages
DM_C6
No ratings yet
DM_C6
37 pages
Clustering_notes
No ratings yet
Clustering_notes
29 pages
4.1.2. K Means Clustering
No ratings yet
4.1.2. K Means Clustering
38 pages
Clustering
No ratings yet
Clustering
25 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
Mod4_Unsupervised Learning
No ratings yet
Mod4_Unsupervised Learning
9 pages
unit4_ml[1]
No ratings yet
unit4_ml[1]
20 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering
No ratings yet
Clustering
24 pages
Cluster
No ratings yet
Cluster
50 pages
ML 5 (1)
No ratings yet
ML 5 (1)
61 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
Clustering_Partitioning-Hierarchical-DensityBased_68c3f9e8cbf266f647e4a8c7ade8a79c
No ratings yet
Clustering_Partitioning-Hierarchical-DensityBased_68c3f9e8cbf266f647e4a8c7ade8a79c
87 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
CV UNIT 4
No ratings yet
CV UNIT 4
60 pages
Clustering
No ratings yet
Clustering
32 pages
Clustering
No ratings yet
Clustering
6 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
Clustering
No ratings yet
Clustering
125 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
The C++ Workshop: Learn to write clean, maintainable code in C++ and advance your career in software engineering
From Everand
The C++ Workshop: Learn to write clean, maintainable code in C++ and advance your career in software engineering
Dale Green
No ratings yet
Workflow of Statistical Data Analysis
No ratings yet
Workflow of Statistical Data Analysis
105 pages
Vincent Granville, Ph.D. Co-Founder, DSC
No ratings yet
Vincent Granville, Ph.D. Co-Founder, DSC
16 pages
Python Regular Expressions Cheat Sheet PDF
No ratings yet
Python Regular Expressions Cheat Sheet PDF
1 page
Binary Classification Metrics
No ratings yet
Binary Classification Metrics
6 pages
Epoxy Resin Hardeners
No ratings yet
Epoxy Resin Hardeners
2 pages
Om Group Assignment 1
No ratings yet
Om Group Assignment 1
11 pages
CTV SLB043 en - 09152021
No ratings yet
CTV SLB043 en - 09152021
8 pages
Datasheet For On/Off Valve (Xv-001) Datasheet For On/Off Valve (Xv-001)
No ratings yet
Datasheet For On/Off Valve (Xv-001) Datasheet For On/Off Valve (Xv-001)
2 pages
Panduan CFP Kongres ISEI XXII & SEMNAS 2024
No ratings yet
Panduan CFP Kongres ISEI XXII & SEMNAS 2024
4 pages
Progesterone Book
No ratings yet
Progesterone Book
84 pages
Fundamentals of The Physics of Solids Volume II
100% (2)
Fundamentals of The Physics of Solids Volume II
659 pages
S02 Rock Drill
No ratings yet
S02 Rock Drill
18 pages
Class 9 Science Force and Law of Motion
No ratings yet
Class 9 Science Force and Law of Motion
5 pages
Wokojuw
No ratings yet
Wokojuw
2 pages
Standard Bolt Tightening Torque
No ratings yet
Standard Bolt Tightening Torque
2 pages
14CrMoV6 9 Datasheet
No ratings yet
14CrMoV6 9 Datasheet
3 pages
Topic 6c_Friction Losses in Noncircular Conduits_Chapter 8_Munsen
No ratings yet
Topic 6c_Friction Losses in Noncircular Conduits_Chapter 8_Munsen
20 pages
Evoque 2011-13 - Fuel Tank and Lines - TD4 2.2L Diesel
No ratings yet
Evoque 2011-13 - Fuel Tank and Lines - TD4 2.2L Diesel
47 pages
4348 9183 1 SM PDF
No ratings yet
4348 9183 1 SM PDF
8 pages
Structural Crack Repair by Epoxy Injection: 2020, Mohamed Awf
No ratings yet
Structural Crack Repair by Epoxy Injection: 2020, Mohamed Awf
32 pages
V Gopinath
No ratings yet
V Gopinath
3 pages
Scope Location 1 CC-2996
No ratings yet
Scope Location 1 CC-2996
31 pages
Hoja de Seguridad Bio
No ratings yet
Hoja de Seguridad Bio
1 page
Gess 201
No ratings yet
Gess 201
6 pages
Settlement Hierachy and Sphere of Influence. Worksheets PDF
No ratings yet
Settlement Hierachy and Sphere of Influence. Worksheets PDF
2 pages
Sci8 Q2-Mod6.V4
50% (2)
Sci8 Q2-Mod6.V4
16 pages
Alcoholic and Non Alcoholic Beverages
No ratings yet
Alcoholic and Non Alcoholic Beverages
15 pages
Sikafloor - 156: 2-Part Epoxy Primer, Levelling Mortar and Mortar Screed
No ratings yet
Sikafloor - 156: 2-Part Epoxy Primer, Levelling Mortar and Mortar Screed
6 pages
Ssit PPT ANK I
No ratings yet
Ssit PPT ANK I
14 pages
Second Quarter Exam 8
No ratings yet
Second Quarter Exam 8
2 pages
Hipotronics OC60-DI User Manual
No ratings yet
Hipotronics OC60-DI User Manual
27 pages