What Is Cluster Analysis?
What Is Cluster Analysis?
What Is Cluster Analysis?
Part 22
Cluster Analysis
15.075/ESD.07J
Statistical Thinking and Data Analysis
Spring 2014
M.I.T.
Tan,Steinbach, Kumar
distances are
maximized
Six Clusters
Two Clusters
Four Clusters
Tan,Steinbach, Kumar
Types of Clusterings
Partitional Clustering
A division data objects into non-overlapping subsets (clusters)
such that each data object is in exactly one subset
Hierarchical clustering
A set of nested clusters organized as a hierarchical tree
Tan,Steinbach, Kumar
Hierarchical Clustering
Produces a set of nested clusters organized as a
hierarchical tree
Can be visualized as a dendrogram
6
0.2
4
3
0.15
5
2
0.1
0.05
3
0
Tan,Steinbach, Kumar
Clustering Algorithms
Hierarchical techniques.
Optimization techniques.
Tan,Steinbach, Kumar
Hierarchical Clustering
At each step, merge the closest pair of clusters until only one cluster
(or k clusters) left
Divisive:
At each step, split a cluster until each cluster contains a point (or
there are k clusters)
Tan,Steinbach, Kumar
Hierarchical Techniques
Distance
Hierarchical Clustering
13
The Data
15
Explanation of Variables
16
0.0
3.1
3.7
2.5
4.1
3.1
0.0
4.9
2.2
3.9
3.7
4.9
0.0
4.1
4.5
2.5
2.2
4.1
0.0
4.1
4.1
3.9
4.5
4.1
0.0
17
18
19
Distance
18
14
19
Figure C: Dendogram: Complete Linkage for All 22 Utilities, Using All 8 Measurements
20
10
gene_n
Label
n>m
22
11
Why cluster?
Cluster genes (rows)
Measure expression at multiple time-points,
different conditions, etc.
Similar expression patterns may suggest similar functions of genes
2006 C. Burge - Copyright 2006 Massachusetts Institute of Technology. All Rights Reserved.
23
24
12
25
26
13
Validating Clusters
Interpretability
Summary statistics
Common features not used to cluster
Assign a label?
Cluster stability (over partitions)
Cluster partition A
Assign cases in partition B to clusters from A with
the closest centroid
Assess consistency based on clusters obtained
from all of the data
A type of cross-validation
27
14
Optimization Methods
A non-hierarchical approach to forming good
clusters.
Specify a desired number of clusters,say, k, and
assign each case to one of k clusters so as to minimize
a measure of dispersion within the clusters.
Common measure is the sum of squared Euclidean
distances from the mean of each cluster.
The optimization problem is difficult.
In practice, clusters are often computed using fast,
heuristic methods that generally produce good (but
not necessarily optimal) solutions. A very popular
(non-hierarchical) method is the k-Means algorithm.
29
K-Means
Greedy, local improvement heuristic for
minimizing within-cluster squared Euclidean
distances
Starting clusters required
Local minimum guaranteed
Very fast, in practice requires few iterations
Many variations
Not scale invariant, often needs normalization
30
15
16
33
34
17
35
Tan,Steinbach, Kumar
36
18
Multiple runs
Helps, but probability is not on your side
Tan,Steinbach, Kumar
37
38
19
Similarity Measures
Such measures can always be converted to
dissimilarity measures.
Sometimes it is more natural or convenient to
work with a similarity measure between cases
rather than distance which measures dissimilarity.
An example of a similarity measure would be the
square of a correlation coefficient (Pearson or
Spearman).
39
20