0% found this document useful (0 votes)

238 views30 pages

Cluster Analysis in Python Chapter2 PDF

This document provides an overview of hierarchical clustering techniques in Python. It discusses different linkage methods for calculating distances between clusters like single, complete, average, centroid and ward. It also covers creating cluster labels using fcluster and visualizing clusters using matplotlib and seaborn. Dendrograms are introduced as a way to determine the number of clusters by showing how clusters are merged. Limitations like the quadratic increase in runtime with data points, making it unsuitable for large datasets, are also covered.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

238 views30 pages

Cluster Analysis in Python Chapter2 PDF

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Basics of hierarchical

clustering
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Creating a distance matrix using linkage
scipy.cluster.hierarchy.linkage(observations,
method='single',
metric='euclidean',
optimal_ordering=False
)

method : how to calculate the proximity of clusters

metric : distance metric

optimal_ordering : order data points

CLUSTER ANALYSIS IN PYTHON

Which method should use?
single: based on two closest objects

complete: based on two farthest objects

average: based on the arithmetic mean of all objects

centroid: based on the geometric mean of all objects

median: based on the median of all objects

ward: based on the sum of squares

CLUSTER ANALYSIS IN PYTHON

Create cluster labels with fcluster
scipy.cluster.hierarchy.fcluster(distance_matrix,
num_clusters,
criterion
)

distance_matrix : output of linkage() method

num_clusters : number of clusters

criterion : how to decide thresholds to form clusters

CLUSTER ANALYSIS IN PYTHON

Hierarchical clustering with ward method

CLUSTER ANALYSIS IN PYTHON

Hierarchical clustering with single method

CLUSTER ANALYSIS IN PYTHON

Hierarchical clustering with complete method

CLUSTER ANALYSIS IN PYTHON

Final thoughts on selecting a method
No one right method for all

Need to carefully understand the distribution of data

CLUSTER ANALYSIS IN PYTHON

Let's try some
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Visualize clusters
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Why visualize clusters?
Try to make sense of the clusters formed

An additional step in validation of clusters

Spot trends in data

CLUSTER ANALYSIS IN PYTHON

An introduction to seaborn
seaborn : a Python data visualization library based on matplotlib

Has better, easily modi able aesthetics than matplotlib!

Contains functions that make data visualization tasks easy in the context of data analytics

Use case for clustering: hue parameter for plots

CLUSTER ANALYSIS IN PYTHON

Visualize clusters with matplotlib
from matplotlib import pyplot as plt

df = pd.DataFrame({'x': [2, 3, 5, 6, 2],

'y': [1, 1, 5, 5, 2],
'labels': ['A', 'A', 'B', 'B', 'A']})

colors = {'A':'red', 'B':'blue'}

df.plot.scatter(x='x',
y='y',
c=df['labels'].apply(lambda x: colors[x]))
plt.show()

CLUSTER ANALYSIS IN PYTHON

Visualize clusters with seaborn
from matplotlib import pyplot as plt
import seaborn as sns

df = pd.DataFrame({'x': [2, 3, 5, 6, 2],

'y': [1, 1, 5, 5, 2],
'labels': ['A', 'A', 'B', 'B', 'A']})

sns.scatterplot(x='x',
y='y',
hue='labels',
data=df)
plt.show()

CLUSTER ANALYSIS IN PYTHON

Comparison of both methods of visualization
MATPLOTLIB PLOT SEABORN PLOT

CLUSTER ANALYSIS IN PYTHON

Next up: Try some
visualizations
C L U S T E R A N A LY S I S I N P Y T H O N
How many clusters?
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Introduction to dendrograms
Strategy till now - decide clusters on visual
inspection

Dendrograms help in showing progressions as

clusters are merged

A dendrogram is a branching diagram that

demonstrates how each cluster is composed by
branching out into its child nodes

CLUSTER ANALYSIS IN PYTHON

Create a dendrogram in SciPy
from scipy.cluster.hierarchy import dendrogram

Z = linkage(df[['x_whiten', 'y_whiten']],
method='ward',
metric='euclidean')

dn = dendrogram(Z)
plt.show()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Next up - try some
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Limitations of
hierarchical
clustering
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Measuring speed in hierarchical clustering
timeit module

Measure the speed of .linkage() method

Use randomly generated points

Run various iterations to extrapolate

CLUSTER ANALYSIS IN PYTHON

Use of timeit module
from scipy.cluster.hierarchy import linkage
import pandas as pd
import random, timeit

points = 100
df = pd.DataFrame({'x': random.sample(range(0, points), points),
'y': random.sample(range(0, points), points)})

%timeit linkage(df[['x', 'y']], method = 'ward', metric = 'euclidean')

1.02 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

CLUSTER ANALYSIS IN PYTHON

Comparison of runtime of linkage method
Increasing runtime with data points

Quadratic increase of runtime

Not feasible for large datasets

CLUSTER ANALYSIS IN PYTHON

Next up - exercises
C L U S T E R A N A LY S I S I N P Y T H O N

Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
Capstone Notes-2
No ratings yet
Capstone Notes-2
27 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Nagareddy 18-Nov-2023
No ratings yet
Nagareddy 18-Nov-2023
20 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
1) Introduction A) Defining Problem Statement:-: ST ST
No ratings yet
1) Introduction A) Defining Problem Statement:-: ST ST
10 pages
Assignment Clustering
No ratings yet
Assignment Clustering
22 pages
Introduction To Data Visualization With Seaborn Chapter3
100% (1)
Introduction To Data Visualization With Seaborn Chapter3
32 pages
Designing Machine Learning Workflows in Python Chapter2
No ratings yet
Designing Machine Learning Workflows in Python Chapter2
39 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
Clustering & PCA Assignment Questions
No ratings yet
Clustering & PCA Assignment Questions
4 pages
AS Notebook - PCA - Wine Data-4
100% (1)
AS Notebook - PCA - Wine Data-4
1 page
Simple Regression Quiz
No ratings yet
Simple Regression Quiz
6 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
Why Do You Need To Scale Data in KNN: 3 Answers
No ratings yet
Why Do You Need To Scale Data in KNN: 3 Answers
1 page
Machine Learning Guided Project
No ratings yet
Machine Learning Guided Project
23 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
UL Coded Project Report - KC
No ratings yet
UL Coded Project Report - KC
30 pages
Data Mining Problem 2 Report
No ratings yet
Data Mining Problem 2 Report
13 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
SQL - Basics
No ratings yet
SQL - Basics
25 pages
Data Mining Assignment: Sudhanva Saralaya
100% (1)
Data Mining Assignment: Sudhanva Saralaya
16 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Chapter 5 - Classification Problems
100% (1)
Chapter 5 - Classification Problems
25 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
ML Quiz 3
No ratings yet
ML Quiz 3
2 pages
Advance Stats Project Parijat
No ratings yet
Advance Stats Project Parijat
18 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Amit Khilare Used Device Data PM Project
No ratings yet
Amit Khilare Used Device Data PM Project
25 pages
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
MySQL - Week 5 Quiz
100% (1)
MySQL - Week 5 Quiz
6 pages
Problem 1
No ratings yet
Problem 1
12 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Advanced Statistics (AS) Project Report
No ratings yet
Advanced Statistics (AS) Project Report
52 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Capstone Project Taiwan
No ratings yet
Capstone Project Taiwan
6 pages
Time Series Forecasting Jupyter Code - Ipynb
No ratings yet
Time Series Forecasting Jupyter Code - Ipynb
2,484 pages
BUSINESS REPORT Part 1
No ratings yet
BUSINESS REPORT Part 1
9 pages
ML Project Report: (Text Learning Case Study)
No ratings yet
ML Project Report: (Text Learning Case Study)
9 pages
Business Report: Advanced Statistics Module Project I
100% (1)
Business Report: Advanced Statistics Module Project I
5 pages
Module 1 Quiz
No ratings yet
Module 1 Quiz
7 pages
Color: Due On Sunday June 7th, by 11:59PM
No ratings yet
Color: Due On Sunday June 7th, by 11:59PM
2 pages
NIrupam Agarwal Business Report-ML
100% (1)
NIrupam Agarwal Business Report-ML
23 pages
Problem 1 - (Download Data) : Importing Nessceary Libraries
No ratings yet
Problem 1 - (Download Data) : Importing Nessceary Libraries
16 pages
PYF Project LearnerNotebook LowCode
No ratings yet
PYF Project LearnerNotebook LowCode
6 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
Pranjal - Singh - 25.12.2022 - Data Mining Project
No ratings yet
Pranjal - Singh - 25.12.2022 - Data Mining Project
8 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
Time Series Forecasting Project (Shoe Sales)
No ratings yet
Time Series Forecasting Project (Shoe Sales)
26 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
Project - Ipynb - Colaboratory
No ratings yet
Project - Ipynb - Colaboratory
4 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Basics of Hierarchical Clustering: Shaumik Daityari
No ratings yet
Basics of Hierarchical Clustering: Shaumik Daityari
30 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Spoken Language Processing in Python Chapter2
No ratings yet
Spoken Language Processing in Python Chapter2
23 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Credit Risk Modeling in Python Chapter4
100% (1)
Credit Risk Modeling in Python Chapter4
35 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Analyzing IoT Data in Python Chapter4
No ratings yet
Analyzing IoT Data in Python Chapter4
34 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Analyzing IoT Data in Python Chapter2
No ratings yet
Analyzing IoT Data in Python Chapter2
35 pages
Analyzing IoT Data in Python Chapter1
100% (1)
Analyzing IoT Data in Python Chapter1
27 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
Full Paper
No ratings yet
Full Paper
25 pages
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
No ratings yet
Introduction To Data Analytics MCA-3282 Open Elective - 6 Sem B.Tech Topic - Grouping
44 pages
R20 Machine Learning Unit 4
No ratings yet
R20 Machine Learning Unit 4
49 pages
Khuraijam Shitle Kumar Manipur University: Clustered Based Analysis and Forecasting of COVID-19 Cases in NE India
No ratings yet
Khuraijam Shitle Kumar Manipur University: Clustered Based Analysis and Forecasting of COVID-19 Cases in NE India
33 pages
IRS Unit-4
No ratings yet
IRS Unit-4
13 pages
1 Stop Project1
No ratings yet
1 Stop Project1
27 pages
Unit - 4 - Modified
No ratings yet
Unit - 4 - Modified
152 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
41 pages
Ugc Net Cs Paper-2 Dbms Data Warehouse Miscellaneous
No ratings yet
Ugc Net Cs Paper-2 Dbms Data Warehouse Miscellaneous
38 pages
Chapter 5 CLUSTERING
No ratings yet
Chapter 5 CLUSTERING
36 pages
Segmenting Bank Customers Via RFM Model and Unsupervised Machine Learning
No ratings yet
Segmenting Bank Customers Via RFM Model and Unsupervised Machine Learning
6 pages
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
No ratings yet
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
16 pages
Machine Learning and Its Applications 1st Edition Peter Wlodarczak 2024 Scribd Download
100% (2)
Machine Learning and Its Applications 1st Edition Peter Wlodarczak 2024 Scribd Download
55 pages
Artificial Intelligence Fundamentals Midterm Lab Q1
100% (1)
Artificial Intelligence Fundamentals Midterm Lab Q1
4 pages
BRM Multivariate Notes
No ratings yet
BRM Multivariate Notes
22 pages
Customer Segmentation Using Machine Learning
No ratings yet
Customer Segmentation Using Machine Learning
8 pages
Classification of Painting Style
No ratings yet
Classification of Painting Style
9 pages
Lecture14 Clustering
No ratings yet
Lecture14 Clustering
50 pages
Unit 4
No ratings yet
Unit 4
23 pages
MLT Quantum
No ratings yet
MLT Quantum
163 pages
Statistica
No ratings yet
Statistica
40 pages
Cluster Analysis Notes
No ratings yet
Cluster Analysis Notes
37 pages
Chapter 4
No ratings yet
Chapter 4
18 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
70 pages
Unit V: Distance and Rule Based Models
No ratings yet
Unit V: Distance and Rule Based Models
56 pages
CS467 Machine Learning, January 2023
No ratings yet
CS467 Machine Learning, January 2023
3 pages
SAP HANA Predictive Analysis Library PAL en
100% (2)
SAP HANA Predictive Analysis Library PAL en
243 pages
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
No ratings yet
Chapter-8 (Cluster Analysis Basic Concepts and Algorithms)
73 pages
DMDW Case Study Finished
No ratings yet
DMDW Case Study Finished
28 pages

Cluster Analysis in Python Chapter2 PDF

Uploaded by

Cluster Analysis in Python Chapter2 PDF

Uploaded by

Basics of hierarchical

method : how to calculate the proximity of clusters

metric : distance metric

optimal_ordering : order data points

CLUSTER ANALYSIS IN PYTHON

complete: based on two farthest objects

average: based on the arithmetic mean of all objects

centroid: based on the geometric mean of all objects

median: based on the median of all objects

ward: based on the sum of squares

CLUSTER ANALYSIS IN PYTHON

distance_matrix : output of linkage() method

num_clusters : number of clusters

criterion : how to decide thresholds to form clusters

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

Need to carefully understand the distribution of data

CLUSTER ANALYSIS IN PYTHON

An additional step in validation of clusters

Spot trends in data

CLUSTER ANALYSIS IN PYTHON

Has better, easily modi able aesthetics than matplotlib!

Use case for clustering: hue parameter for plots

CLUSTER ANALYSIS IN PYTHON

df = pd.DataFrame({'x': [2, 3, 5, 6, 2],

colors = {'A':'red', 'B':'blue'}

CLUSTER ANALYSIS IN PYTHON

df = pd.DataFrame({'x': [2, 3, 5, 6, 2],

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

Dendrograms help in showing progressions as

A dendrogram is a branching diagram that

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

Measure the speed of .linkage() method

Use randomly generated points

Run various iterations to extrapolate

CLUSTER ANALYSIS IN PYTHON

%timeit linkage(df[['x', 'y']], method = 'ward', metric = 'euclidean')

CLUSTER ANALYSIS IN PYTHON

Quadratic increase of runtime

Not feasible for large datasets

CLUSTER ANALYSIS IN PYTHON

You might also like