0% found this document useful (0 votes)

164 views31 pages

Cluster Analysis in Python Chapter1 PDF

This document discusses unsupervised learning and cluster analysis in Python. It begins by explaining the differences between labeled and unlabeled data, with unlabeled data being the focus of unsupervised learning techniques. Unsupervised learning algorithms like clustering are used to find patterns in unlabeled data and group similar items together. The document then covers hierarchical and k-means clustering algorithms in Python using SciPy and demonstrates how to perform each type of clustering on sample Pokémon sighting data. Finally, it discusses the importance of preparing data for clustering through techniques like normalization prior to analyzing the data.

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views31 pages

Cluster Analysis in Python Chapter1 PDF

Uploaded by

Fgpeqw

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Unsupervised

learning: basics
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Everyday example: Google news
How does Google News classify articles?

Unsupervised Learning Algorithm: Clustering

Match frequent terms in articles to nd

similarity

CLUSTER ANALYSIS IN PYTHON

Labeled and unlabeled data
Data with no labels Point 1: (1, 2)

Point 2: (2, 2)

Point 3: (3, 1)

Data with labels Point 1: (1, 2), Label: Danger Zone

Point 2: (2, 2), Label: Normal Zone

Point 3: (3, 1), Label: Normal Zone

CLUSTER ANALYSIS IN PYTHON

What is unsupervised learning?
A group of machine learning algorithms that nd patterns in data

Data for algorithms has not been labeled, classi ed or characterized

The objective of the algorithm is to interpret any structure in the data

Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

CLUSTER ANALYSIS IN PYTHON

What is clustering?
The process of grouping items with similar characteristics

Items in groups similar to each other than in other groups

Example: distance between points on a 2D plane

CLUSTER ANALYSIS IN PYTHON

Plotting data for clustering - Pokemon sightings
from matplotlib import pyplot as plt

x_coordinates = [80, 93, 86, 98, 86, 9, 15, 3, 10, 20, 44, 56, 49, 62, 44]
y_coordinates = [87, 96, 95, 92, 92, 57, 49, 47, 59, 55, 25, 2, 10, 24, 10]

plt.scatter(x_coordinates, y_coordinates)
plt.show()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Up next - some
practice
C L U S T E R A N A LY S I S I N P Y T H O N
Basics of cluster
analysis
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
What is a cluster?
A group of items with similar characteristics

Google News: articles where similar words and

word associations appear together

Customer Segments

CLUSTER ANALYSIS IN PYTHON

Clustering algorithms
Hierarchical clustering

K means clustering

Other clustering algorithms: DBSCAN, Gaussian Methods

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
Hierarchical clustering in SciPy
from scipy.cluster.hierarchy import linkage, fcluster
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,

10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates,
'y_coordinate': y_coordinates})

Z = linkage(df, 'ward')
df['cluster_labels'] = fcluster(Z, 3, criterion='maxclust')

sns.scatterplot(x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
plt.show()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
CLUSTER ANALYSIS IN PYTHON
K-means clustering in SciPy
from scipy.cluster.vq import kmeans, vq
from matplotlib import pyplot as plt
import seaborn as sns, pandas as pd

import random
random.seed((1000,2000))

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,

10.4, 20.3, 44.2, 56.8, 49.2, 62.5, 44.0]
y_coordinates = [87.2, 96.1, 95.6, 92.4, 92.4, 57.7, 49.4,
47.3, 59.1, 55.5, 25.6, 2.1, 10.9, 24.1, 10.3]

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate': y_coordinates})

centroids,_ = kmeans(df, 3)
df['cluster_labels'], _ = vq(df, centroids)

sns.scatterplot(x='x_coordinate', y='y_coordinate',
hue='cluster_labels', data = df)
plt.show()

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON
Next up: hands-on
exercises
C L U S T E R A N A LY S I S I N P Y T H O N
Data preparation for
cluster analysis
C L U S T E R A N A LY S I S I N P Y T H O N

Shaumik Daityari
Business Analyst
Why do we need to prepare data for clustering?
Variables have incomparable units (product dimensions in cm, price in $)

Variables with same units have vastly different scales and variances (expenditures on cereals, travel)

Data in raw form may lead to bias in clustering

Clusters may be heavily dependent on one variable

Solution: normalization of individual variables

CLUSTER ANALYSIS IN PYTHON

Normalization of data
Normalization: process of rescaling data to a standard deviation of 1

x_new = x / std_dev(x)

from scipy.cluster.vq import whiten

data = [5, 1, 3, 3, 2, 3, 3, 8, 1, 2, 2, 3, 5]

scaled_data = whiten(data)
print(scaled_data)

[2.73, 0.55, 1.64, 1.64, 1.09, 1.64, 1.64, 4.36, 0.55, 1.09, 1.09, 1.64, 2.73]

CLUSTER ANALYSIS IN PYTHON

Illustration: normalization of data
# Import plotting library
from matplotlib import pyplot as plt

# Initialize original, scaled data

plt.plot(data,
label="original")
plt.plot(scaled_data,
label="scaled")

# Show legend and display plot

plt.legend()
plt.show()

CLUSTER ANALYSIS IN PYTHON

Next up: some DIY
exercises
C L U S T E R A N A LY S I S I N P Y T H O N

Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Python Data Associate Certification Study Guide
No ratings yet
Python Data Associate Certification Study Guide
2 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
5 pages
Social Web Analytics - Solution Answers
33% (3)
Social Web Analytics - Solution Answers
22 pages
Introduction To Data Visualization With Seaborn Chapter3
100% (1)
Introduction To Data Visualization With Seaborn Chapter3
32 pages
Designing Machine Learning Workflows in Python Chapter2
No ratings yet
Designing Machine Learning Workflows in Python Chapter2
39 pages
Unsteady Flow Past A Cylinder Cornell
No ratings yet
Unsteady Flow Past A Cylinder Cornell
21 pages
ECON2206 Introductory Econometrics PartA S12015
No ratings yet
ECON2206 Introductory Econometrics PartA S12015
10 pages
Cluster Analysis in Python Chapter2 PDF
No ratings yet
Cluster Analysis in Python Chapter2 PDF
30 pages
Stanford CS224W Limitations of Graph Neural Networks 18-Limitations
No ratings yet
Stanford CS224W Limitations of Graph Neural Networks 18-Limitations
75 pages
Weather Forecasting Basepaper
100% (1)
Weather Forecasting Basepaper
14 pages
K Means Clustering
100% (1)
K Means Clustering
10 pages
Solutions To Pandas Basic Questions
No ratings yet
Solutions To Pandas Basic Questions
1 page
Ai 325 CA
No ratings yet
Ai 325 CA
8 pages
Cluster Analysis: G Sreenivas
No ratings yet
Cluster Analysis: G Sreenivas
29 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Data Visualization in Python
No ratings yet
Data Visualization in Python
11 pages
Time Series Analysis
100% (1)
Time Series Analysis
2 pages
05 Logistic - Regression
No ratings yet
05 Logistic - Regression
7 pages
Visualization - Python Data Analysis
No ratings yet
Visualization - Python Data Analysis
13 pages
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
100% (1)
CSE 473 Pattern Recognition: Instructor: Dr. Md. Monirul Islam
57 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
No ratings yet
ENG 202: Computers and Engineering Object Oriented Programming in PYTHON
56 pages
Introduction To Data Visualization in Python
No ratings yet
Introduction To Data Visualization in Python
16 pages
Python & Leetcode - The Ultimate Interview Bootcamp: Strings
No ratings yet
Python & Leetcode - The Ultimate Interview Bootcamp: Strings
3 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
Unit V Data Visualization
No ratings yet
Unit V Data Visualization
49 pages
Analyzing Social Media Data in Python Chapter1
No ratings yet
Analyzing Social Media Data in Python Chapter1
21 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Data Science Workshop
No ratings yet
Data Science Workshop
6 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
An Introduction To Clustering and Different Methods of Clustering
No ratings yet
An Introduction To Clustering and Different Methods of Clustering
9 pages
Cluster Analysis
No ratings yet
Cluster Analysis
47 pages
Module 2
No ratings yet
Module 2
20 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Cluster
100% (1)
Cluster
72 pages
Pandas
No ratings yet
Pandas
41 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
8960 - DWM Experiment 5
No ratings yet
8960 - DWM Experiment 5
6 pages
Limits and An Introduction To Calculus
No ratings yet
Limits and An Introduction To Calculus
56 pages
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
100% (1)
Parallelism of Statistics and Machine Learning & Logistic Regression Versus Random Forest
72 pages
DAX Functions For Data Analysis
No ratings yet
DAX Functions For Data Analysis
9 pages
Time Series Forecasting Jupyter Code - Ipynb
No ratings yet
Time Series Forecasting Jupyter Code - Ipynb
2,484 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
1 Lect - 1.2 - 12 - August 2022 PDF
No ratings yet
1 Lect - 1.2 - 12 - August 2022 PDF
59 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Probability and Statistics Advanced (Second Edition)
100% (1)
Probability and Statistics Advanced (Second Edition)
359 pages
Supervised Learning With Scikit-Learn
No ratings yet
Supervised Learning With Scikit-Learn
178 pages
Seminar Report Machine Learning
No ratings yet
Seminar Report Machine Learning
20 pages
Application of Data Science
No ratings yet
Application of Data Science
8 pages
Lasso Regularization of Generalized Linear Models - MATLAB & Simulink
No ratings yet
Lasso Regularization of Generalized Linear Models - MATLAB & Simulink
14 pages
Cluster Analysis
No ratings yet
Cluster Analysis
38 pages
OCS353 Data Science Fundamentals LAB QUESTION SET
No ratings yet
OCS353 Data Science Fundamentals LAB QUESTION SET
2 pages
Student Booklet For Sep 2015 v6
100% (1)
Student Booklet For Sep 2015 v6
50 pages
Classification Algorithms
100% (2)
Classification Algorithms
23 pages
Clickstream Analysis
No ratings yet
Clickstream Analysis
25 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Supervised-Unsupervised Learning
No ratings yet
Supervised-Unsupervised Learning
2 pages
Matplotlib PDF
No ratings yet
Matplotlib PDF
16 pages
Chapter 1
No ratings yet
Chapter 1
31 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Spoken Language Processing in Python Chapter2
No ratings yet
Spoken Language Processing in Python Chapter2
23 pages
Credit Risk Modeling in Python Chapter4
100% (1)
Credit Risk Modeling in Python Chapter4
35 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Analyzing IoT Data in Python Chapter4
No ratings yet
Analyzing IoT Data in Python Chapter4
34 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Analyzing IoT Data in Python Chapter2
No ratings yet
Analyzing IoT Data in Python Chapter2
35 pages
Analyzing IoT Data in Python Chapter1
100% (1)
Analyzing IoT Data in Python Chapter1
27 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
TEST Reading-Comprehension Answer Key
No ratings yet
TEST Reading-Comprehension Answer Key
6 pages
The Negative Effects of Stress On Grade 11 Students and It'S Coping Strategies at Green Fields Integrated School of Laguna A.Y 2017-2018
No ratings yet
The Negative Effects of Stress On Grade 11 Students and It'S Coping Strategies at Green Fields Integrated School of Laguna A.Y 2017-2018
26 pages
De Thi Hoc Sinh Gioi Tieng Anh Lop 9 Yến
No ratings yet
De Thi Hoc Sinh Gioi Tieng Anh Lop 9 Yến
11 pages
Zos Basic
100% (1)
Zos Basic
442 pages
Real Numbers PDF
No ratings yet
Real Numbers PDF
6 pages
Week 1 Resume Updated
No ratings yet
Week 1 Resume Updated
2 pages
Hedges
No ratings yet
Hedges
6 pages
Project Charter Team Innova
100% (1)
Project Charter Team Innova
7 pages
Simulation Quick Reference Guide
No ratings yet
Simulation Quick Reference Guide
18 pages
MAPEH 1: Foundation of Music, Arts, Physical Education and Health
No ratings yet
MAPEH 1: Foundation of Music, Arts, Physical Education and Health
5 pages
Computer Fundamentals: I P I T
No ratings yet
Computer Fundamentals: I P I T
50 pages
How To Write A Geotechnical Investigation Report - Civilblog
No ratings yet
How To Write A Geotechnical Investigation Report - Civilblog
12 pages
An Introduction To Medical Astrology Wanda Sellar Download
No ratings yet
An Introduction To Medical Astrology Wanda Sellar Download
45 pages
Continue
No ratings yet
Continue
4 pages
RFID Based Attendance Management System
No ratings yet
RFID Based Attendance Management System
10 pages
Xie-Ye Polymerization Catalysis
No ratings yet
Xie-Ye Polymerization Catalysis
47 pages
The Amplio Body of Knowledge and How To Use It
No ratings yet
The Amplio Body of Knowledge and How To Use It
5 pages
Group Sessions For Grade 5
No ratings yet
Group Sessions For Grade 5
7 pages
Activity Sheet Aseptic Technique Plate Streaking
No ratings yet
Activity Sheet Aseptic Technique Plate Streaking
2 pages
The Representing Brain: Neural Correlates of Motor Intention and Imagery
No ratings yet
The Representing Brain: Neural Correlates of Motor Intention and Imagery
60 pages
Planning, Analyzing and Designing of Indoor Stadium Building by Using STAAD Pro
No ratings yet
Planning, Analyzing and Designing of Indoor Stadium Building by Using STAAD Pro
13 pages
Binaural Science PDF
No ratings yet
Binaural Science PDF
5 pages
Kiran Mazumdar Shaw: Chairperson of Biocon Limited
No ratings yet
Kiran Mazumdar Shaw: Chairperson of Biocon Limited
15 pages
Exercise 1A ERP
No ratings yet
Exercise 1A ERP
2 pages
Abaqus Questions
No ratings yet
Abaqus Questions
12 pages
Dfma PDF
No ratings yet
Dfma PDF
385 pages
Seed - 2004 - Nineteenth-Century Travel Writing An Introduction
No ratings yet
Seed - 2004 - Nineteenth-Century Travel Writing An Introduction
6 pages
Grade 7 Long Range Plan
No ratings yet
Grade 7 Long Range Plan
6 pages

Cluster Analysis in Python Chapter1 PDF

Uploaded by

Cluster Analysis in Python Chapter1 PDF

Uploaded by

Unsupervised

Unsupervised Learning Algorithm: Clustering

Match frequent terms in articles to nd

CLUSTER ANALYSIS IN PYTHON

Data with labels Point 1: (1, 2), Label: Danger Zone

Point 2: (2, 2), Label: Normal Zone

Point 3: (3, 1), Label: Normal Zone

CLUSTER ANALYSIS IN PYTHON

Data for algorithms has not been labeled, classi ed or characterized

The objective of the algorithm is to interpret any structure in the data

Common unsupervised learning algorithms: clustering, neural networks, anomaly detection

CLUSTER ANALYSIS IN PYTHON

Items in groups similar to each other than in other groups

Example: distance between points on a 2D plane

CLUSTER ANALYSIS IN PYTHON

CLUSTER ANALYSIS IN PYTHON

Google News: articles where similar words and

CLUSTER ANALYSIS IN PYTHON

Other clustering algorithms: DBSCAN, Gaussian Methods

CLUSTER ANALYSIS IN PYTHON

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,

CLUSTER ANALYSIS IN PYTHON

x_coordinates = [80.1, 93.1, 86.6, 98.5, 86.4, 9.5, 15.2, 3.4,

df = pd.DataFrame({'x_coordinate': x_coordinates, 'y_coordinate': y_coordinates})

CLUSTER ANALYSIS IN PYTHON

Data in raw form may lead to bias in clustering

Clusters may be heavily dependent on one variable

Solution: normalization of individual variables

CLUSTER ANALYSIS IN PYTHON

from scipy.cluster.vq import whiten

CLUSTER ANALYSIS IN PYTHON

# Initialize original, scaled data

# Show legend and display plot

CLUSTER ANALYSIS IN PYTHON

You might also like