What Is Cluster Analysis?

Lecture Notes
Part 22
Cluster Analysis
15.075/ESD.07J
Statistical Thinking and Data Analysis
Spring 2014
M.I.T.
Roy Welsch 2014

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.
1
What is Cluster Analysis?
Finding groups of objects such that the objects in

a group will be similar (or related) to one another
and different from (or unrelated to) the objects in
other groups
Inter-cluster
Intra-cluster
distances are
minimized
Tan,Steinbach, Kumar
distances are
maximized
Notion of a Cluster can be Ambiguous
How many clusters?
Six Clusters
Two Clusters
Four Clusters
Types of Clusterings
A clustering is a set of clusters
Important distinction between hierarchical and

partitional sets of clusters
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of

merges or splits
5
6
0.2
4
3
0.15
5
2
0.1
0.05
3
0
Clustering Algorithms
Hierarchical techniques.
Optimization techniques.
Strengths of Hierarchical Clustering
Do not have to assume any particular number of

clusters
Any desired number of clusters can be obtained by
cutting the dendrogram at the proper level
They may correspond to meaningful taxonomies

Example in biological sciences (e.g., animal kingdom,
phylogeny reconstruction, )
Two main types of hierarchical clustering

Agglomerative:
Start with the points as individual clusters
At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
Divisive:
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or
there are k clusters)
Traditional hierarchical algorithms use a similarity

or distance matrix
Merge or split one cluster at a time
Hierarchical Techniques
Agglomerative hierarchical techniques more

common.
The most popular agglomerative techniques
1. Minimum distance (also called single linkage)
2. Maximum distance(also called complete
linkage)
3. Group average (also called average linkage)
Minimum Distance (Single Linkage)

The distance between two clusters is defined as the
distance between the nearest pair of objects with each
object in the pair belonging to a distinct cluster.
If cluster A is the set of objects a1, a2, . . . , am and cluster B
is b1, b2, . . . , bn, the single linkage distance between A and
B is,
min (distance (ai, bj) | i = 1, 2, . . . , m; j = 1, 2,. . . , n).
Tends to create string shaped clusters.
10
Maximum Distance (Complete Linkage)

The distance between two clusters is defined as the
distance between the farthest pair of objects with each
object in the pair belonging to a distinct cluster.
If cluster A is the set of objects a1, a2, . . . , am and cluster
B is b1, b2, . . . , bn the single linkage distance between A
and B is,
max (distance (ai, bj) | i =1, 2, . . . , m; j = 1, 2,. . . , n).
Tends to create spherical shaped clusters.
11
Average Distance (Average Linkage)

Here the distance between two clusters is defined
as the average distance between all possible pairs
of objects with each object in the pair belonging to
a distinct cluster.
If cluster A is the set of objects a1, a2, . . . , am and
cluster B is b1, b2, . . . , bn the average linkage
distance between A and B is,
(1 / mn) distance (ai, bj)
the sum being taken over i =1, 2, . . . , m and j = 1,
2, . . . , n.
12
Distance
This produces a binary tree or

dendrogram
The final cluster is the root and

each data item is a leaf
The heights of the bars

indicate how close the items
are
Data items (genes, etc.)

2006 C. Burge - Copyright 2006 Massachusetts Institute of Technology. All Rights Reserved.
13
An Illustration: Public Utilities Data

Aim: To predict the cost impact of deregulation.
This would require building a detailed cost model
of the various utilities.
Considerable amount of time and effort would be
saved if we could cluster similar types of utilities
and build detailed cost models for just one
typical utility in each cluster.
Models can then be scaled up to estimate results
for all utilities.
14
The Data
15
Explanation of Variables
16
Example of Distance Matrix

1
0.0
3.1
3.7
2.5
4.1
3.1
0.0
4.9
2.2
3.9
3.7
4.9
0.0
4.1
4.5
2.5
2.2
4.1
0.0
4.1
4.1
3.9
4.5
4.1
0.0
17
18
19
Dendrogram (Complete linkage)

7
Distance
18
14
19
Figure C: Dendogram: Complete Linkage for All 22 Utilities, Using All 8 Measurements
20
10
Sales & Fuel Cost:

3 rough clusters can be seen
High fuel cost, low sales
Low fuel cost, high sales
Low fuel cost, low sales
Relational representation gene expression data
samp1 sampl2 sampl3 sampl4 sampl _m

gene 1
gene 2
gene 3
gene_n
Label
x_11 x_12 x_13 x_14 x_1m

x_21 x_22 x_23 x_24 x_2m
x_31 x_32 x_33 x_34 x_3m
.
..
x_n1 x_n2 x_n3 x_n4 x_nm
P
n>m
22
11
Why cluster?
Cluster genes (rows)
Measure expression at multiple time-points,
different conditions, etc.
Similar expression patterns may suggest similar functions of genes
Cluster samples (columns)

e.g., expression levels of thousands of genes for
each tumor sample
Similar expression patterns may suggest biological
relationship among samples
23
24
12
25
26
13
Validating Clusters
Interpretability
Summary statistics
Common features not used to cluster
Assign a label?
Cluster stability (over partitions)
Cluster partition A
Assign cases in partition B to clusters from A with
the closest centroid
Assess consistency based on clusters obtained
from all of the data
A type of cross-validation
27
Limitations of Hierarchical Clustering

Computation cost need the n n distance matrix
Only one pass through the data. Records incorrectly
allocated early cannot be relocated later
Poor stability
Single and complete linkage robust to choice of
distance matrix. Average linkage is not get rather
different clusters
Sensitive to outliers (means?)
28
14
Optimization Methods
A non-hierarchical approach to forming good
clusters.
Specify a desired number of clusters,say, k, and
assign each case to one of k clusters so as to minimize
a measure of dispersion within the clusters.
Common measure is the sum of squared Euclidean
distances from the mean of each cluster.
The optimization problem is difficult.
In practice, clusters are often computed using fast,
heuristic methods that generally produce good (but
not necessarily optimal) solutions. A very popular
(non-hierarchical) method is the k-Means algorithm.
29
K-Means
Greedy, local improvement heuristic for
minimizing within-cluster squared Euclidean
distances
Starting clusters required
Local minimum guaranteed
Very fast, in practice requires few iterations
Many variations
Not scale invariant, often needs normalization
30
15
k-Means Clustering Algorithm

Starts with k initial centers (how to choose k?).
At every step each record re-assigned to the
cluster with the closest centroid.
Recompute centroids of clusters that lost or gained
a record.
Stop when moving any more records between
clusters increases cluster dispersion.
Randomize starts?
31
k-Means Algorithm (contd.)

Generally the number of clusters in the data is not
known.
A good idea is to run the algorithm with different
values for k that are near the number of clusters one
expects from the data, to see how the sum of distances
reduces with increasing values of k.
The ratio of the sum of distances for a given k to the
sum of distances to the mean of all the cases (k = 1) is a
good measure for the usefulness of the clustering.
If the ratio is near 1.0 the clustering has not been very
effective, If the ratio is small, we have well separated
groups.
32
16
33
34
17
35
Problems with Selecting Initial Points
If there are K real clusters then the chance of selecting

one centroid from each cluster is small.
Chance is relatively small when K is large

If clusters are the same size, n, then
For example, if K = 10, then probability = 10!/1010 = 0.00036

Sometimes the initial centroids will readjust themselves in
right way, and sometimes they dont
36
18
Solutions to Initial Centroids Problem
Multiple runs
Helps, but probability is not on your side
Sample and use hierarchical clustering to

determine initial centroids
Select more than k initial centroids and then
select among these initial centroids
Select most widely separated
37
How do we define similarity?

The goal is to group together similar data
but how to define similarity/distance between points
(or clusters)?
In general, depends on what we want to find or

emphasize in the data - clustering is an art
The similarity measure is often more important than

the clustering algorithm used
38
19
Similarity Measures
Such measures can always be converted to
dissimilarity measures.
Sometimes it is more natural or convenient to
work with a similarity measure between cases
rather than distance which measures dissimilarity.
An example of a similarity measure would be the
square of a correlation coefficient (Pearson or
Spearman).
39
Similarity Measures (Binary data)

Suppose we have binary values for all the xijs and for
individuals i and j we have the following 22 table:
The most useful similarity measures in this situation are:

The matching coefficient, (a + d) / p
Jacquards coefficient, d / (b + c + d). This coefficient ignores
zero matches. This is desirable when we do not want to
consider two individuals to be similar when zero indicates a
feature that is not important.
40
20

What Is Cluster Analysis?

Uploaded by

Copyright:

Available Formats

What Is Cluster Analysis?

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

What Is Cluster Analysis?

Uploaded by

Copyright:

Available Formats

Lecture Notes

Roy Welsch 2014

What is Cluster Analysis?

Finding groups of objects such that the objects in

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Notion of a Cluster can be Ambiguous

How many clusters?

A clustering is a set of clusters

Important distinction between hierarchical and

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

A tree like diagram that records the sequences of

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Strengths of Hierarchical Clustering

Do not have to assume any particular number of

They may correspond to meaningful taxonomies

Two main types of hierarchical clustering

Start with the points as individual clusters

Start with one, all-inclusive cluster

Traditional hierarchical algorithms use a similarity

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Agglomerative hierarchical techniques more

Minimum Distance (Single Linkage)

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Maximum Distance (Complete Linkage)

Average Distance (Average Linkage)

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

This produces a binary tree or

The final cluster is the root and

The heights of the bars

Data items (genes, etc.)

An Illustration: Public Utilities Data

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Example of Distance Matrix

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Dendrogram (Complete linkage)

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Sales & Fuel Cost:

Low fuel cost, high sales

Low fuel cost, low sales

Relational representation gene expression data

samp1 sampl2 sampl3 sampl4 sampl _m

x_11 x_12 x_13 x_14 x_1m

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Cluster samples (columns)

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Limitations of Hierarchical Clustering

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

k-Means Clustering Algorithm

k-Means Algorithm (contd.)

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.

Problems with Selecting Initial Points

If there are K real clusters then the chance of selecting

Chance is relatively small when K is large

For example, if K = 10, then probability = 10!/1010 = 0.00036

Copyright 2014 Massachusetts Institute of Technology. All Rights Reserved.