Chapter 8 - Clustering
Chapter 8 - Clustering
Chapter 8 - Clustering
Chapter 8
CLUSTERING
1
CLUSTERING
• Clustering are Unsupervised Methods that seek a partition of the data into
distinct groups so that the observations within each group are quite similar to
each other.
➢ K-Means Clustering
➢ Hierarchical Clustering
5
K-Means Clustering
Let 𝐶1 , . . . , 𝐶𝑘 denote sets containing the indices of the observations in each cluster.
These sets satisfy two properties:
a. For each of the K clusters, compute the cluster centroid (the mean vector).
a. Assign each observation to the cluster whose centroid is closest (e.g. using
Euclidean distance).
K-Means Clustering
Properties of the Algorithm
• This algorithm is guaranteed to decrease the value of the objective at each step.
➢ K-Means Clustering
➢ Hierarchical Clustering
17
Hierarchical Clustering
2. For i = n, n − 1, . . . , 2:
(a) Examine all pairwise inter-cluster dissimilarities among the i clusters.
Identify the pair of clusters that are least dissimilar and fuse them. The
dissimilarity between these two clusters indicates the height in the
dendrogram at which the fusion should be placed.
(b) Compute the new pairwise inter-cluster dissimilarities among the i − 1
remaining clusters
K-Means Clustering
Dissimilarity between these two clusters
The concept of dissimilarity between a pair of observations can be extended to
a pair of groups of observations using the notion of linkage.
Centroid Dissimilarity between the centroid for cluster A and the centroid for cluster B.
Centroid linkage can result in undesirable inversions (increase of similarity).
Hierarchical Clustering
Choice of Dissimilarity
• Observations 5 and 7 are quite similar to each other, as are observations 1 and 6.
• This is because observations 2, 8, 5 and 7 all fuse with observation 9 at the same
height, approximately 1.8.
Hierarchical Clustering
Example 2: 9 points
Hierarchical Clustering
Example 2: 9 points
Hierarchical Clustering
Choice of Dissimilarity
Left: Number without scaling. Center: after scaling the number. Right: Amount (USD) without scaling
Hierarchical Clustering
Practical Issues
• Euclidean distance to determine the similarity among the tumours and a complete
linkage function to iteratively build up the clusters. Data were normalized.
• subset of 293 ER+/HER2− tumours for which scorings of the seven proteins of
interest were available
Hierarchical Clustering