Basics of hierarchical
clustering
C L U S T E R A N A LY S I S I N P Y T H O N
Shaumik Daityari
Business Analyst
Creating a distance matrix using linkage
scipy.cluster.hierarchy.linkage(observations,
method='single',
metric='euclidean',
optimal_ordering=False
)
method : how to calculate the proximity of clusters
metric : distance metric
optimal_ordering : order data points
CLUSTER ANALYSIS IN PYTHON
Which method should use?
single: based on two closest objects
complete: based on two farthest objects
average: based on the arithmetic mean of all objects
centroid: based on the geometric mean of all objects
median: based on the median of all objects
ward: based on the sum of squares
CLUSTER ANALYSIS IN PYTHON
Create cluster labels with fcluster
scipy.cluster.hierarchy.fcluster(distance_matrix,
num_clusters,
criterion
)
distance_matrix : output of linkage() method
num_clusters : number of clusters
criterion : how to decide thresholds to form clusters
CLUSTER ANALYSIS IN PYTHON
Hierarchical clustering with ward method
CLUSTER ANALYSIS IN PYTHON
Hierarchical clustering with single method
CLUSTER ANALYSIS IN PYTHON
Hierarchical clustering with complete method
CLUSTER ANALYSIS IN PYTHON
Final thoughts on selecting a method
No one right method for all
Need to carefully understand the distribution of data
CLUSTER ANALYSIS IN PYTHON
Let's try some
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Visualize clusters
C L U S T E R A N A LY S I S I N P Y T H O N
Shaumik Daityari
Business Analyst
Why visualize clusters?
Try to make sense of the clusters formed
An additional step in validation of clusters
Spot trends in data
CLUSTER ANALYSIS IN PYTHON
An introduction to seaborn
seaborn : a Python data visualization library based on matplotlib
Has better, easily modi able aesthetics than matplotlib!
Contains functions that make data visualization tasks easy in the context of data analytics
Use case for clustering: hue parameter for plots
CLUSTER ANALYSIS IN PYTHON
Visualize clusters with matplotlib
from matplotlib import pyplot as plt
df = pd.DataFrame({'x': [2, 3, 5, 6, 2],
'y': [1, 1, 5, 5, 2],
'labels': ['A', 'A', 'B', 'B', 'A']})
colors = {'A':'red', 'B':'blue'}
df.plot.scatter(x='x',
y='y',
c=df['labels'].apply(lambda x: colors[x]))
plt.show()
CLUSTER ANALYSIS IN PYTHON
Visualize clusters with seaborn
from matplotlib import pyplot as plt
import seaborn as sns
df = pd.DataFrame({'x': [2, 3, 5, 6, 2],
'y': [1, 1, 5, 5, 2],
'labels': ['A', 'A', 'B', 'B', 'A']})
sns.scatterplot(x='x',
y='y',
hue='labels',
data=df)
plt.show()
CLUSTER ANALYSIS IN PYTHON
Comparison of both methods of visualization
MATPLOTLIB PLOT SEABORN PLOT
CLUSTER ANALYSIS IN PYTHON
Next up: Try some
visualizations
C L U S T E R A N A LY S I S I N P Y T H O N
How many clusters?
C L U S T E R A N A LY S I S I N P Y T H O N
Shaumik Daityari
Business Analyst
Introduction to dendrograms
Strategy till now - decide clusters on visual
inspection
Dendrograms help in showing progressions as
clusters are merged
A dendrogram is a branching diagram that
demonstrates how each cluster is composed by
branching out into its child nodes
CLUSTER ANALYSIS IN PYTHON
Create a dendrogram in SciPy
from scipy.cluster.hierarchy import dendrogram
Z = linkage(df[['x_whiten', 'y_whiten']],
method='ward',
metric='euclidean')
dn = dendrogram(Z)
plt.show()
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Next up - try some
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Limitations of
hierarchical
clustering
C L U S T E R A N A LY S I S I N P Y T H O N
Shaumik Daityari
Business Analyst
Measuring speed in hierarchical clustering
timeit module
Measure the speed of .linkage() method
Use randomly generated points
Run various iterations to extrapolate
CLUSTER ANALYSIS IN PYTHON
Use of timeit module
from scipy.cluster.hierarchy import linkage
import pandas as pd
import random, timeit
points = 100
df = pd.DataFrame({'x': random.sample(range(0, points), points),
'y': random.sample(range(0, points), points)})
%timeit linkage(df[['x', 'y']], method = 'ward', metric = 'euclidean')
1.02 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CLUSTER ANALYSIS IN PYTHON
Comparison of runtime of linkage method
Increasing runtime with data points
Quadratic increase of runtime
Not feasible for large datasets
CLUSTER ANALYSIS IN PYTHON
Next up - exercises
C L U S T E R A N A LY S I S I N P Y T H O N