Chapter 8 - Clustering

FEM 2063 - Data Analytics
Chapter 8
CLUSTERING
1
CLUSTERING
• Clustering refers to a very broad set of techniques for finding subgroups, or

clusters, in a data set.
• Clustering are Unsupervised Methods that seek a partition of the data into
distinct groups so that the observations within each group are quite similar to
each other.
• It requires to define what it means for two or more observations to be similar

or different.
CLUSTERING
PCA versus Clustering
• PCA looks for a low-dimensional (--> dimensional reduction method)

representation of the observations that explains a good fraction of the
variance.
• Clustering looks for homogeneous subgroups among the observations.

CLUSTERING
Example: Clustering for Market Segmentation
Given a large number of measurements (e.g. household income, occupation,
distance from nearest urban area, etc.) for a large number of people.
Objective: to perform market segmentation by identifying subgroups of

people who might be more receptive to a particular form of advertising, or
more likely to purchase a particular product.
The task of performing market segmentation amounts to clustering the people

in the data set.
Overview
➢ K-Means Clustering
➢ Hierarchical Clustering
5
K-Means Clustering
In K-means clustering, we seek to partition the observations into a pre-

specified number of clusters, K.
K-Means Clustering
Example: simulated data set with 150 observations
K-Means Clustering
Details
Let 𝐶1 , . . . , 𝐶𝑘 denote sets containing the indices of the observations in each cluster.
These sets satisfy two properties:
1. 𝐶1 ∪ 𝐶2 ∪. . .∪ 𝐶𝑘 = {1, . . . , 𝑛} : each observation belongs to one of the K clusters.
2. 𝐶𝐾 ∩ 𝐶𝐾′ = ∅ for all 𝑘 ≠ 𝑘′ : the clusters are non-overlapping.

K-Means Clustering
Details
• A good clustering is one for which the within-cluster variation is as
small as possible.
• The within-cluster variation for cluster Ck is a measure W(Ck) of the
amount by which the observations within a cluster differ from each other.
• Hence, we want to solve the problem
K-Means Clustering
Define W(Ck)
• Using Euclidean distance
where Ck denotes the number of observations in the kth cluster.
• The optimization problem that defines K-means clustering:

K-Means Clustering
Algorithm
1. Initialization: randomly assign a cluster Ci (i between 1 and K) to each of
the observations.
1. Iterate until the cluster assignments stop changing:
a. For each of the K clusters, compute the cluster centroid (the mean vector).
a. Assign each observation to the cluster whose centroid is closest (e.g. using
Euclidean distance).
K-Means Clustering
Properties of the Algorithm
• This algorithm is guaranteed to decrease the value of the objective at each step.
• It is not guaranteed to reach the global minimum

K-Means Clustering
Example
K-Means Clustering
Example
K-Means Clustering
Example: Different initializations
K-Means Clustering
Example: Different initializations
Overview
➢ K-Means Clustering
➢ Hierarchical Clustering
17
Hierarchical Clustering
• Hierarchical clustering consists of building a tree-based diagram called

dendrogram, that allows to view at once the clusters obtained for each
possible number of clusters.
• It does not require to set upfront the number of clusters.
• There are several approaches. The most common is we describe bottom-up or

agglomerative clustering. It refers to the fact that a dendrogram is built
starting from the leaves and combining clusters up to the trunk.
“Bottom-up”
“Bottom-up”
“Bottom-up”
“Bottom-up”
“Bottom-up”
Approach
Start with each point in its own cluster.
• Identify the closest two clusters and merge them.
• Repeat.
• Ends when all points are in a single cluster.
Example 1: 45 generated points
Example 1: 45 generated points
K-Means Clustering
Algorithm
1. Begin with n observations and a measure (e.g. Euclidean distance) of all
(n(n−1)/2) pairwise dissimilarities. Treat each observation as its own cluster.
2. For i = n, n − 1, . . . , 2:
(a) Examine all pairwise inter-cluster dissimilarities among the i clusters.
Identify the pair of clusters that are least dissimilar and fuse them. The
dissimilarity between these two clusters indicates the height in the
dendrogram at which the fusion should be placed.
(b) Compute the new pairwise inter-cluster dissimilarities among the i − 1
remaining clusters
K-Means Clustering
Dissimilarity between these two clusters
The concept of dissimilarity between a pair of observations can be extended to
a pair of groups of observations using the notion of linkage.
The four most common types of linkage:

• Complete
• Average
• Single
• Centroid
Linkage Description
Complete Maximal inter-cluster dissimilarity. Compute all pairwise dissimilarities
between the observations in cluster A and the observations in cluster B, and
record the largest of these dissimilarities.
Single Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities

between the observations in cluster A and the observations in cluster B, and
record the smallest of these dissimilarities.
Average Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities between

the observations in cluster A and the observations in cluster B, and record the
average of these dissimilarities.
Centroid Dissimilarity between the centroid for cluster A and the centroid for cluster B.
Centroid linkage can result in undesirable inversions (increase of similarity).
Choice of Dissimilarity
Euclidean distance: 1 and 3 are similar (small distance).

Correlation: 1 and 2 are correlated
Example 2: 9 points
Example 2: 9 points
Example 2: 9 points
• Observations 5 and 7 are quite similar to each other, as are observations 1 and 6.
• However, observation 9 is no more similar to observation 2 than it is to

observations 8, 5 and 7, even though observations 9 and 2 are close together in
terms of horizontal distance.
• This is because observations 2, 8, 5 and 7 all fuse with observation 9 at the same
height, approximately 1.8.
Example 2: 9 points
Example 2: 9 points
Choice of Dissimilarity
• An alternative to Euclidean distance is correlation-based distance

which considers two observations to be similar if their features are
highly correlated.
• This is an unusual use of correlation, which is normally computed

between variables; here it is computed between the observation profiles
for each pair of observations.
Scaling
Left: Number without scaling. Center: after scaling the number. Right: Amount (USD) without scaling
Practical Issues
• Should the observations or features first be standardized in some way?

For instance, maybe the variables should be centered to have mean zero
and scaled to have standard deviation one.
• In the case of hierarchical clustering,

• What dissimilarity measure should be used?
• What type of linkage should be used?
• How many clusters to choose? (for both K-means or hierarchical

clustering).
Example: Breast Cancer Microarray study
• “Kruger,D.T., et al. “Hierarchical clustering of activated proteins in the PI3K and

MAPK pathways in ER-positive, HER2-negative breast cancer with potential
therapeutic consequences”. Br J Cancer 119, 832–839 (2018).
• Euclidean distance to determine the similarity among the tumours and a complete
linkage function to iteratively build up the clusters. Data were normalized.
• subset of 293 ER+/HER2− tumours for which scorings of the seven proteins of
interest were available
Hierarchical clustering of seven proteins in ER+/HER2− tumours visualised in a heatmap.

Red (Blue) bars indicate a higher (lower) score in activation.
Conclusions
• Unsupervised learning is important for understanding the variation and
grouping structure of a set of unlabeled data, and can be a useful pre-
processor for supervised learning
• It is intrinsically more difficult than supervised learning because there is

no gold standard (like an outcome variable) and no single objective (like
test set accuracy)
• It is an active field of research.

42

Chapter 8 - Clustering

Uploaded by

Copyright:

Available Formats

Chapter 8 - Clustering

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 8 - Clustering

Uploaded by

Copyright:

Available Formats

FEM 2063 - Data Analytics

• Clustering refers to a very broad set of techniques for finding subgroups, or

• It requires to define what it means for two or more observations to be similar

• PCA looks for a low-dimensional (--> dimensional reduction method)

• Clustering looks for homogeneous subgroups among the observations.

Objective: to perform market segmentation by identifying subgroups of

The task of performing market segmentation amounts to clustering the people

In K-means clustering, we seek to partition the observations into a pre-

1. 𝐶1 ∪ 𝐶2 ∪. . .∪ 𝐶𝑘 = {1, . . . , 𝑛} : each observation belongs to one of the K clusters.

2. 𝐶𝐾 ∩ 𝐶𝐾′ = ∅ for all 𝑘 ≠ 𝑘′ : the clusters are non-overlapping.

where Ck denotes the number of observations in the kth cluster.

• The optimization problem that defines K-means clustering:

1. Iterate until the cluster assignments stop changing:

• It is not guaranteed to reach the global minimum

• Hierarchical clustering consists of building a tree-based diagram called

• It does not require to set upfront the number of clusters.

• There are several approaches. The most common is we describe bottom-up or

The four most common types of linkage:

Single Minimal inter-cluster dissimilarity. Compute all pairwise dissimilarities

Average Mean inter-cluster dissimilarity. Compute all pairwise dissimilarities between

Euclidean distance: 1 and 3 are similar (small distance).

• However, observation 9 is no more similar to observation 2 than it is to

• An alternative to Euclidean distance is correlation-based distance

• This is an unusual use of correlation, which is normally computed

• Should the observations or features first be standardized in some way?

• In the case of hierarchical clustering,

• How many clusters to choose? (for both K-means or hierarchical

• “Kruger,D.T., et al. “Hierarchical clustering of activated proteins in the PI3K and

Hierarchical clustering of seven proteins in ER+/HER2− tumours visualised in a heatmap.

• It is intrinsically more difficult than supervised learning because there is

• It is an active field of research.

You might also like