Hierarchical Clustering: Class Program University Semester Lecturer Sources
Hierarchical Clustering: Class Program University Semester Lecturer Sources
Hierarchical Clustering: Class Program University Semester Lecturer Sources
Sources:
● Mohammed J. Zaki, Wagner Meira, Jr., Data Mining and Analysis:
http://www.cs.bu.edu/~evimaria/cs565-13.html
1
http://www.talkorigins.org/faqs/comdesc/phylo.html 2
http://www.chegg.com/homework-help/questions-and-answers/part-phylogenetic-tree-shown-figure-b-sines-10-12-13-first-insert-genomes-artiodactyls-phy-q4932026
3
Hierarchical Clustering
• Produces a set of nested clusters organized as
a hierarchical tree
• Can be visualized as a dendrogram
– A tree-like diagram that records the sequences of
merges or splits
Strengths of Hierarchical
Clustering
• No assumptions on the number of clusters
– Any desired number of clusters can be obtained by
‘cutting’ the dendogram at the proper level
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or
there are k clusters)
p2
p3
p4
p5
.
.
.
Distance/Proximity Matrix
Intermediate State
• After some merging steps, we have some clusters
C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
C1
Distance/Proximity Matrix
C2 C5
Intermediate State
• Merge the two closest clusters (C2 and C5) and update the
distance matrix.
C1 C2 C3 C4 C5
C1
C3 C2
C3
C4
C4
C5
C1
Distance/Proximity Matrix
C2 C5
After Merging
• “How do we update the distance matrix?”
C2 U
C5
C1 C3 C4
C3 C1 ?
C4 C2 U C5 ? ? ? ?
C3 ?
C1 C4 ?
C2 U C5
Distance between two
clusters
• Each cluster is a set of points
1 2 3 4 5
Single-link clustering: example
5
1
3
5
2 1
2 3 6
4
4
5 11 13 16 25 36 38 39 42 60 62 64 67
Exercise:
Create a hierarchical agglomerative clustering for this data.
To make this deterministic, if there are ties, pick the left-most link.
http://chato.cl/2015/data-analysis/exercise-answers/hierarchical-clustering_exercise_01_answer.txt 17
Strengths of single-link clustering
1 2 3 4 5
Complete-link clustering: example
4 1
2 5
5
2
3 6
3
1
4
1 2 3 4 5
Average-link clustering: example
5 4 1
2
5
2
3 6
1
4
3
• Strengths
– Less susceptible to noise and outliers
• Limitations
– Biased towards globular clusters
Distance between two
clusters
• Centroid distance between clusters Ci and Cj is
the distance between the centroid ri of Ci and the
centroid rj of Cj
Distance between two
clusters
• Ward’s distance between clusters Ci and Cj is the
difference between the total within cluster sum of
squares for the two clusters separately, and the
within cluster sum of squares resulting from
merging the two clusters in cluster Cij
• ri: centroid of Ci
• rj: centroid of Cj
• rij: centroid of Cij
Ward’s distance for
clusters
• Similar to group average and centroid distance
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
Hierarchical Clustering: Time and Space
requirements