Unit 4
Clustering
- DBScan
- Hierarchical Clustering
- Fuzzy Clustering
- Graph Based Clustering
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 1
Objective
●
Clustering Algorithms
– Centroid based
– Density based
– Hierarchical
– Fuzzy
– Graph based
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 2
Evaluation of Clustering
●
Challenging due to
– In many real-world datasets, the boundaries between clusters are not
clear-cut.
●
Some data points might sit at the boundary of two clusters and could be
reasonably assigned to both.
– Different applications might prioritize different aspects of clustering.
●
For example, in one application, it might be essential to have tight, well-
separated clusters, while in another, capturing the overall data structure
might be more important.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 4
Types of Evaluation
●
Two types of clustering evaluation measures (or metrics)
●
Internal measures do not require any ground truth to assess
the quality of clusters. They are based solely on the data and
the clustering results.
●
External measures compare the clustering results to ground
truth labels.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 5
Internal Evaluation Measures
●
Most internal validation measures are based on the following two
criteria:
●
Compactness measures how closely related objects in the same
cluster are.
– Compactness can be measured in different ways, such as by using the
variance of the points within each cluster, or computing the average pairwise
distance between them.
●
Separation measures how distinct or well-separated a cluster is
from other clusters.
– Examples for measures of separation include pairwise distances between
cluster centers or pairwise minimum distances between objects in different
clusters.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 6
External Evaluation Measures
●
External evaluation measures are used when the true labels of
the data points are known.
●
These measures compare the the results of the clustering
algorithm against the ground truth labels.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 7
Other Clustering Algorithms
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 8
Hierarchical Clustering
●
Hierarchical clustering organizes data into a hierarchy of
clusters, represented as a tree-like structure known as a
dendrogram
– This algorithm builds a hierarchy of clusters by iteratively merging or
splitting clusters based on their similarity.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 9
Hierarchical Clustering
●
Types:
– There are two main types of hierarchical clustering:
●
agglomerative (bottom-up) and
●
divisive (top-down).
●
Strengths
– Hierarchical clustering can discover clusters of arbitrary shapes and
sizes, and it provides a visual representation of the hierarchical
relationships between clusters.
●
Weaknesses
– Hierarchical clustering can be computationally expensive, especially for
large datasets. It is also sensitive to the initial ordering of the data points
and the choice of the distance metric.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 10
Hierarchical Clustering - Types
●
Agglomerative Clustering:
– Bottom-up approach merging similar clusters sequentially.
●
Divisive Clustering:
– Top-down approach dividing clusters iteratively.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 11
Basic Agglomerative Hierarchical Algo.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 12
Hierarchical Agglomerative Clustering (HAC)
●
Types
– single link
●
For the single link or MIN version of hierarchical clustering,
the proximity of two clusters is defined as the minimum of
the distance (maximum of the similarity) between any two
points in the two different clusters
– complete link
●
For the complete link or MAX version of hierarchical
clustering, the proximity of two clusters is defined as the
maximum of the distance (minimum of the similarity)
between any two points in the two different clusters
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 13
Hierarchical Agglomerative Clustering (HAC)
●
Use the distance matrix in table below to perform single link
and complete link hierarchical clustering. Show your results by
drawing a dendogram.
●
The dendogram should clearly show the order in which the
points are merged.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 14
HAC
●
Calculate Euclidean distance, create the distance matrix.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 15
HAC
●
Find the minimum value element from distance matrix.
– The minimum value element is (p3,p6)and value is 0.11
– i.e. our 1st cluster (p3,p6)
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 16
HAC
●
Recalculate or update the distance matrix for cluster(p3,p6)
●
Same formula can be used for p2,p4,p5
●
Updated distance matrix
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 17
HAC
●
Repeat to get another pair
– The minimum value element is (p2,p5)and value is 0.14
– i.e. our 2nd cluster (p2,p5)
●
Recalculate or update the distance
matrix for cluster (p2,p5)
– Same formula can be used for
(p3,p6),p4
– Updated distance matrix:
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 18
HAC
●
Repeat
– 1st cluster (p3,p6)
– 2nd cluster (p2,p5)
– 3rd cluster (p2,p5,p3,p6)
– 4th cluster (p2,p5,p3,p6,p4)
– 5th cluster (p2,p5,p3,p6,p4,p1)
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 19
HAC
●
Repeat
– 1st cluster (p3,p6)
– 2nd cluster (p2,p5)
– 3rd cluster (p2,p5,p3,p6)
– 4th cluster (p2,p5,p3,p6,p4)
– 5th cluster (p2,p5,p3,p6,p4,p1)
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 20
MAX version of HAC
●
Find the min distance and group them
●
Update the distance
●
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 21
MAX version of HAC
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 22
Numericals
●
https://aryabhattacollege.ac.in/samplepaper/63722744950872
5497DataMining(Chapter8).pdf
●
https://www.analyticsvidhya.com/blog/2021/06/single-link-hiera
rchical-clustering-clearly-explained/
●
https://medium.com/@rohanjoseph_91119/learn-with-an-exam
ple-hierarchical-clustering-873b5b50890c
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 23
Density Based Clustering
●
Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly constructed
density function
●
Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 24
Density Based Clustering
●
Algorithms
– DBSCAN: Ester, et al. (KDD 96)
– DENCLUE: Hinneburg & D. Keim (KDD 98/2006)
– OPTICS: Ankerst, et al (SIGMOD 99).
– CLIQUE: Agrawal, et al. (SIGMOD 98)
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 25
DBSCAN
●
DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius r (Ɛ - EPS)
– A point is a core point if it has more than a specified number of points
(MinPts) within Ɛ
●
These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Ɛ, but is in the
neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
●
DBSCAN is a density-based algorithm.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 26
DBSCAN- core, border, and noise points
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 27
Simplified view of DBSCAN
●
Create a graph whose nodes are the points to be clustered
●
For each core-point c create an edge from c to every point p in the Ɛ-
neighborhood of c
●
Set N to the nodes of the graph;
●
If N does not contain any core points terminate
●
Pick a core point c in N
●
Let X be the set of nodes that can be reached from c by going forward;
– create a cluster containing X U {c}
– N=N / (X U {c})
●
Continue with step 4
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 28
Estimating Ɛ and MinPts
●
Idea is that for points in a cluster, their kth nearest neighbors
are at roughly the same distance
– Noise points have the kth nearest neighbor at farther distance
– So, plot sorted distance of every point to its kth nearest neighbor
●
Thus, eps=10
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 29
Where DBSCAN Fails Larger EPS , MinPts= 4
●
Varying densities
●
High-dimensional data
Original Points
2024, MDS
MDS 602 (Advanced Data Mining)
Master’s in Data Science Unit 4: Clustering Smaller EPS , MinPts= 4 30
Resources
●
https://cdn.aaai.org/KDD/1996/KDD96-037.pdf
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 31
Hard partition
●
Hard partition
– where the data points are strictly allocated as a member of one cluster
and are not a member of another cluster, assuming that the number of
clusters is known.
– The k-means is one of the algorithms that use a hard partition.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 32
Soft Partition
●
Every data point is given a probability of closeness [0, 1] for
existing clusters, assuming that the number of clusters is
known.
●
One of the algorithms that use fuzzy partition is Fuzzy c-
means that we will talk about it in depth.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 33
Fuzzy Clustering
●
Traditional clustering algorithms assign data points to
exclusive clusters, where each point belongs to one and only
one cluster.
●
Fuzzy clustering, on the other hand, acknowledges the
inherent ambiguity in real-world data and allows for a more
flexible approach.
– With fuzzy clustering, a data point can belong to multiple clusters to
varying degrees, providing a richer representation of complex
relationships.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 34
Fuzzy Clustering
●
The key idea behind fuzzy clustering lies in the concept of
membership functions.
– Instead of assigning binary membership values (0 or 1) as in
traditional clustering, fuzzy clustering employs membership values
ranging from 0 to 1.
– These values represent the degree of belongingness of each data
point to each cluster.
●
A higher membership value indicates a stronger association
with a particular cluster, while a lower value signifies a weaker
association.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 35
Application of fuzzy clustering
●
Image Segmentation
– In computer vision, fuzzy clustering can be used for image
segmentation tasks.
– By assigning membership values to pixels, it allows for a more flexible
and granular delineation of image regions, enabling better object
recognition and scene understanding.
●
Pattern Recognition
– Fuzzy clustering is useful in pattern recognition tasks, such as
character recognition or speech recognition.
– By capturing the uncertainty and variability inherent in patterns, it
allows for more robust and accurate recognition algorithms.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 36
Application of fuzzy clustering
●
Customer Segmentation
– Fuzzy clustering is valuable in customer segmentation for marketing
purposes.
– By considering the degree of membership to different segments,
businesses can tailor personalized marketing strategies to effectively
target customers based on their varying preferences and behaviors.
●
Document clustering
– Fuzzy clustering can be used to cluster documents based on their
content, such as keywords, topics, and themes.
– This can help in organizing and retrieving documents efficiently.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 37
Fuzzy c-means (FCM)
●
FCM is the most well-known and widely used fuzzy clustering
technique.
– It is an iterative algorithm that minimizes the sum of the weighted
squared distances between each data point and the centers of the
clusters.
– The degree of membership of each data point to each cluster is
calculated using a membership function, which assigns a probability
value between 0 and 1 for each cluster.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 38
Graph based Clustering
●
Connected components clustering (CCC)
●
HCS:
– stands for highly connected subgraphs clustering
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 39
HCS
●
It uses the notion of min-cuts.
●
There are many heuristic algorithms to find min-cuts or
approximations
●
HCS starts with the entire graph and checks if it is highly
connected.
– We can define highly connected however we like so long as we can
test for this condition efficiently.
– If the graph is not highly connected, it partitions the graph into two
non-empty subgraphs in such a way that the number of edges that
cross the partition is minimized.
– There are algorithms to find min-cuts under additional constraints.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 40
HCS
●
We then recurse on the two subgraphs.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 41
HCS
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 42
Use of HCS
●
Network based summarization
●
Mapping application
●
Use Case
– Online Delivery
– E-Commerce Store setup
– Bus Route planning etc.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 43
Thank you
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 44