unit-3
unit-3
Clustering metrics
• Clustering metrics are quantitative measures used to evaluate the quality of
clustering algorithms and the resulting clusters.
• Internal Evaluation Metrics: These metrics evaluate the quality of clusters without
any external reference. They measure the compactness of data points within the
same cluster and the separation between different clusters.
• Such as: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index (Variance Ratio Criterion),
Dunn Index
• External Evaluation Metrics: These metrics require a ground truth or a reference set
of labels to compare the clustering results against. They measure the agreement
between the generated clusters and the true classes in the reference set.
• Such as: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Fowlkes-Mallows Index
Silhouette Score
• Silhouette Coefficient or silhouette score is a metric
used to calculate the goodness of a clustering
technique. Its value ranges from -1 to 1.
• 1: Means clusters are well apart from each other and
clearly distinguished (Best Case)
• 0: Means clusters are indifferent, or we can say that the
distance between clusters is not significant
(Overlapping Clusters)
• -1: Means clusters are assigned in the wrong way (Worst
Case)
• The Silhouette Score is a metric used to evaluate the quality of clustering in
unsupervised learning.
• It measures how well data points are clustered by considering both cohesion
(how close points are within a cluster) and separation (how distinct or far
away clusters are from one another).
Applications:
• The cluster_std parameter controls how far the created cluster points
are spread.
• The random_state parameter provides the repeatability of the
random number generation process.
• n_clusters is used as a variable specifying how many clusters will be
created.
• The KMeans class is used to implement the K-Means clustering
algorithm.
Datapoint A1 A2 A3 A4
A1 0 0.10 0.65 0.55
A2 0.10 0 0.70 0.60
A3 0.65 0.70 0 0.30
A4 0.55 0.60 0.30 0
It is computed as:
• The silhouette value is a measure of how similar an
object is to its own cluster (cohesion) compared to other
clusters (separation).
• Silhouette Score = (b-a)/max(a,b)
where
• a= average intra-cluster distance i.e the average
distance between each point within a cluster
• b= average inter-cluster distance i.e the average
distance between all clusters
First Table:
1. Lists data points (A1, A2, A3, A4) and their corresponding cluster labels (C1 or C2).
2. A1 and A2 belong to cluster C1, while A3 and A4 belong to cluster C2.
Second Table:
3. Contains a distance matrix showing pairwise distances between each data point (A1,
A2, A3, A4).
4. The rows and columns represent the data points, and the values represent the
distance between each pair.
1. a(i): Calculate the average distance between A1 and A2 within the same cluster (C1).
Similarly, calculate for A2 and A1.
2. b(i): For both A1 and A2, calculate the average distance to points in the nearest cluster
(C2: A3 and A4).
• It measures the average similarity ratio of each cluster to its most similar
cluster, considering the intra-cluster distance (how compact a cluster is) and
the inter-cluster distance (how separated clusters are).
• Data Point 1: (2, 3) Data Point 2: (3, 2) Data Point 3: (4, 3) Data Point 4: (8, 8)
Data Point 5: (9, 7) Data Point 6: (10, 8) Data Point 7: (15, 14) Data Point 8: (16,
13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:
• Cluster 1: {Data Point 1, Data Point 2, Data Point 3} Cluster 2: {Data
Point 4, Data Point 5, Data Point 6, Data Point 7, Data Point 8}
• Now, let's calculate the inter-cluster distances (minimum distance
between any two points from different clusters) and intra-cluster
distances (maximum distance between any two points within the
same cluster):
• Inter-cluster distance:
• Between Cluster 1 and Cluster 2:
• Min inter-cluster distance = distance(Data Point 1, Data Point 4) = sqrt((8 - 2)^2 + (8 - 3)^2) ≈ 8.25
• Intra-cluster distances:
• For Cluster 1:
• Max intra-cluster distance = max(distance(Data Point 1, Data Point 2), distance(Data Point 1, Data
Point 3), distance(Data Point 2, Data Point 3)) = max(sqrt(2), sqrt(1), sqrt(1)) = sqrt(2) ≈ 1.41
• For Cluster 2:
• Max intra-cluster distance = max(distance(Data Point 4, Data Point 5), distance(Data Point 4, Data
Point 6), distance(Data Point 4, Data Point 7), distance(Data Point 4, Data Point 8), distance(Data
Point 5, Data Point 6), distance(Data Point 5, Data Point 7), distance(Data Point 5, Data Point 8),
distance(Data Point 6, Data Point 7), distance(Data Point 6, Data Point 8), distance(Data Point 7,
Data Point 8)) = max(sqrt(1), sqrt(2), sqrt(2), sqrt(5), sqrt(2), sqrt(2), sqrt(5), sqrt(2), sqrt(5),
sqrt(2)) = sqrt(5) ≈ 2.24
• Now, calculate the Dunn Index:
• Dunn Index = Min inter-cluster distance / Max intra-cluster distance ≈
8.25 / 2.24 ≈ 3.68
• In this example, the Dunn Index for this clustering solution is
approximately 3.68. A higher Dunn Index indicates that the clusters
are well-separated and compact within themselves, which is desirable
for a good clustering solution.
Example
• Let's work through a numerical example to calculate the Dunn Index for a simple set of clusters.
Suppose you have the following data points and their corresponding clusters:
Data Points:
• A(1, 2)
• B(2, 2)
• C(5, 8)
• D(6, 8)
• E(10, 12)
• F(11, 12)
Clusters:
• Cluster 1: {A, B}
• Cluster 2: {C, D}
• Cluster 3: {E, F}
• Distance between Cluster 1 and Cluster 2 (d(1, 2)):
• d(A, C) = sqrt((1-5)^2 + (2-8)^2) = sqrt(32) = 4√2
• d(A, D) = sqrt((1-6)^2 + (2-8)^2) = sqrt(41) ≈ 6.40
• d(B, C) = sqrt((2-5)^2 + (2-8)^2) = sqrt(29) ≈ 5.39
• d(B, D) = sqrt((2-6)^2 + (2-8)^2) = sqrt(36) = 6
• So, d(1, 2) = 4√2
• Distance between Cluster 1 and Cluster 3 (d(1, 3)):
• d(A, E) = sqrt((1-10)^2 + (2-12)^2) = sqrt(170) ≈ 13.04
• d(A, F) = sqrt((1-11)^2 + (2-12)^2) = sqrt(200) ≈ 14.14
• d(B, E) = sqrt((2-10)^2 + (2-12)^2) = sqrt(164) ≈ 12.81
• d(B, F) = sqrt((2-11)^2 + (2-12)^2) = sqrt(194) ≈ 13.93
• So, d(1, 3) = 12.81
• Distance between Cluster 2 and Cluster 3 (d(2, 3)):
• d(C, E) = sqrt((5-10)^2 + (8-12)^2) = sqrt(20) ≈ 4.47
• d(C, F) = sqrt((5-11)^2 + (8-12)^2) = sqrt(32) ≈ 5.66
• d(D, E) = sqrt((6-10)^2 + (8-12)^2) = sqrt(16) = 4
• d(D, F) = sqrt((6-11)^2 + (8-12)^2) = sqrt(29) ≈ 5.39
• So, d(2, 3) = 4.47
• Now, let's calculate the minimum inter-cluster distance and the
maximum intra-cluster distance:
• Minimum Inter-cluster Distance: min(d(1, 2), d(1, 3), d(2, 3)) =
min(4√2, 12.81, 4.47) = 4√2
• Maximum Intra-cluster Distance: max(max(d(A, B), d(C, D), d(E, F))) =
max(6.40, 6.40, 14.14) = 14.14
• Finally, we can calculate the Dunn Index:
• Dunn Index = min(d(1, 2), d(1, 3), d(2, 3)) / max(max(d(A, B), d(C, D),
d(E, F))) Dunn Index = (4√2) / 14.14 ≈ 0.2828
• So, the Dunn Index for these clusters is approximately 0.2828.
Data Point Coordinates Cluster
1 (2, 3) A
2 (2, 5) A
3 (3, 8) B
4 (6, 5) B
5 (8, 8) C
6 (9, 6) C
7 (10, 2) A
8 (12, 4) A
9 (15, 7) C
10 (17, 5) C
Drawbacks of Dunn index:
• As the number of clusters and dimensionality of the
data increase, the computational cost also increases.
• Application of Dunn Index
• The Dunn Index is particularly useful when trying to determine the quality
of clusters in fields where compact and well-separated clusters are
desirable, such as:
• Bioinformatics: Grouping genes or proteins based on similarity in sequence
or structure.
• Market segmentation: Dividing customers into homogeneous groups that
are distinctly different from other groups.
• Social network analysis: Identifying tightly-knit communities that are well
separated from each other.
• Image processing: Segmenting images into distinct regions that represent
objects or areas of interest.
• Advantages of Dunn Index
1.Emphasizes compactness and separation: The Dunn Index prioritizes
clusters that are both compact and well-separated, making it an ideal
measure for high-quality clustering.
2.Helps in selecting the number of clusters: It can be used to compare
different clustering results with varying numbers of clusters and select the
one with the highest Dunn Index.
3.Works with any clustering algorithm: Like other clustering evaluation
metrics, the Dunn Index can be applied to results from various algorithms
(e.g., KMeans, DBSCAN, hierarchical clustering).
• Disadvantages of Dunn Index
1.Computationally expensive: Calculating pairwise distances between points
across clusters makes the Dunn Index computationally expensive, especially
for large datasets.
2.Sensitive to noise: The presence of noise or outliers in the data can
significantly affect the Dunn Index, leading to lower scores even for well-
formed clusters.
3.Limited interpretability: While a higher Dunn Index indicates better
clustering, it is hard to interpret the meaning of specific values without a
reference.
4.Not widely available in libraries: Unlike other metrics like Silhouette Score
or Davies-Bouldin Index, the Dunn Index is not directly available in popular
libraries like scikit-learn, requiring custom implementation.
• Real-Time Example of Dunn Index
• Example: Gene Expression Clustering in Bioinformatics
• In bioinformatics, researchers often cluster gene expression data to identify
groups of genes with similar expression patterns across different conditions
or time points. These clusters can help identify genes involved in the same
biological processes.
• Application of Dunn Index: After clustering gene expression data,
researchers use the Dunn Index to evaluate the quality of clusters, ensuring
that genes within the same cluster are similar and different from those in
other clusters.
• Advantage: The Dunn Index helps ensure that the gene clusters are well-
formed, providing more reliable insights into gene function.
• Disadvantage: Due to the high dimensionality and noisy nature of gene
expression data, the Dunn Index can be computationally expensive and may
be sensitive to outliers.
• Conclusion
• The Dunn Index is a valuable clustering evaluation metric that emphasizes
both compactness and separation of clusters.
Obtain True Class Labels: If you have access to ground truth labels or class
assignments for your data, this is ideal. These true labels represent the
actual groups or categories that your data points belong to.
RI is the Rand Index, which measures the similarity between two clusterings.
• # Calculate NMI
• nmi_score = normalized_mutual_info_score(U, V)
• print(f"Normalized Mutual Information Score: {nmi_score}")
Calculating NMI for
Clustering
• Assume m=3 classes and k=2 clusters
𝐻 𝑌 𝐶 = −𝑃(𝐶 = ∑ 𝑃 𝑌 = 𝑦 𝐶 = 1 log(𝑃 𝑌 = 𝑦
=1 1) Y∈ 𝐶 =1
{1,2,3})
+ log(4 )+ log( )] =
1 3 3 3 4
=− 32 × [1 1 1 10 1
log 0 0 0.7855
0 10 0
H(Y|C): conditional entropy of class
labels for clustering C
• Now, consider Cluster-2:
– P(Y=1|C=2)=2/10 (two triangles in cluster-2)
– P(Y=2|C=2)=7/10 (seven rectangles in cluster-2)
– P(Y=3|C=2)=1/10 (one star in cluster-2)
– Calculate conditional entropy as:
𝐻 𝑌 𝐶 = −𝑃(𝐶 = ∑ 𝑃 𝑌 = 𝑦 𝐶 = 2 log(𝑃 𝑌 = 𝑦
=2 2) Y∈ 𝐶 =2
{1,2,3})
+ log(1 )+ log( )] =
1 2 7 7 1
=− 22 × [1 1 1 10 1
log 0 0 0.5784
0 10 0
I(Y;
C)
𝐼 = 𝐻𝑌 − 𝐻 𝑌 𝐶
• Finally the mutual information is:
𝑌; 𝐶 = 1.5 − 0.7855 +
=
0.5784
0.1361
2×
The NMI is therefore,
𝑁𝑀𝐼 𝐼(𝑌; 𝐶)
𝐻𝑌 + 𝐻
𝑌, 𝐶
𝐶
𝑁𝑀𝐼 2 × 0.1361
= 1.5 + =
=
𝑌, 𝐶
1
Calculate NMI for the following
Homogeneity
• A clustering result satisfies homogeneity if all of its
clusters contain only data points which are members of
a single class.
• This metric is independent of the absolute values of the
labels: a permutation of the class or cluster label values
won’t change the score value in any way.
• Syntax : sklearn.metrics.homogeneity_score(labels_tru
e, labels_pred)
• The Metric is not symmetric,
switching label_true with label_pred will return
the completeness_score.
Parameters :
•labels_true:<int array, shape = [n_samples]> : It accept the
ground truth class labels to be used as a reference.
•labels_pred: <array-like of shape (n_samples,)>: It accepts the
cluster labels to evaluate.
Returns:
homogeneity:<float>: Its return the score between 0.0 and 1.0
stands for perfectly homogeneous labeling.
• To calculate homogeneity numerically, you can use the following formula:
Completeness score
• This score is complementary to the previous one. Its
purpose is to provide a piece of information about the
assignment of samples belonging to the same class.
• More precisely, a good clustering algorithm should
assign all samples with the same true label to the same
cluster.
Completeness portrays the closeness of the clustering algorithm to this
(completeness_score) perfection.
This metric is autonomous of the outright values of the labels. A permutation of the
cluster label values won’t change the score value in any way.
sklearn.metrics.completeness_score()
Syntax: sklearn.metrics.completeness_score(labels_true, labels_pred)
• Where TP is the number of True Positive (i.e. the number of pair of points that belongs in the
same clusters in both labels_true and labels_pred),
• FP is the number of False Positive (i.e. the number of pair of points that belongs in the same
clusters in labels_true and not in labels_pred)
• and FN is the number of False Negative (i.e the number of pair of points that belongs in the
same clusters in labels_pred and not in labels_True).
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
• Adjusted Mutual Information (AMI) is a variation of the Normalized
Mutual Information (NMI) that adjusts for the chance grouping of data
points.
• In other words, AMI corrects for the possibility that mutual information
might be high purely due to random clustering.
• The AMI score is based on the concept of Mutual Information (MI), but it
incorporates a correction factor to account for the fact that random
clusterings will still have some degree of mutual information due to
chance.
• This adjustment ensures that the score reflects the actual quality of the
clustering, especially when comparing different clustering algorithms on a
dataset.
• Advantages of AMI
• Chance Correction: AMI adjusts for random clustering and eliminates any
bias caused by chance groupings, making it a more reliable measure for
clustering performance than NMI.
• Symmetry: Like NMI, AMI is symmetric, meaning the clustering order does
not affect the score.
• Scale Independence: The AMI score ranges between -1 and 1, making it
intuitive to interpret.
• Disadvantages of AMI
• Computational Complexity: AMI can be computationally expensive due to
the need to calculate the expected mutual information, which involves
random permutations.
• Sensitivity to Cluster Number: Like most clustering metrics, AMI is
sensitive to the number of clusters. If the true number of clusters is not
specified correctly, the AMI score may not provide an accurate evaluation.
• Real-time Example and Use Case of AMI
• Real-time Example: Suppose you have a dataset of documents and you
apply two different clustering methods to organize the documents into
topics. You can use AMI to compare the two clustering results. If you also
have ground truth labels (e.g., predefined topics), you can use AMI to
check how closely your clustering matches these true labels while
correcting for chance.
• Use Case: In biological data analysis, AMI can be used to evaluate the
performance of clustering algorithms applied to gene expression data.
Since clusters may emerge by random chance, AMI helps ensure that
clusters found in the data are meaningful and not due to random
partitioning.
• from sklearn.metrics import adjusted_mutual_info_score
• # Calculate AMI
• ami_score = adjusted_mutual_info_score(U, V)
• print(f"Adjusted Mutual Information Score: {ami_score}")
• AMI vs. NMI
• AMI corrects for random clustering and adjusts for chance, making it
more reliable when randomness is a concern.
• NMI is simpler but does not account for chance alignments, so its scores
may be higher for random clusterings. For better evaluation in
clustering, AMI is generally preferred over NMI when comparing
multiple algorithms.