0% found this document useful (0 votes)
18 views130 pages

unit-3

Clustering metrics are quantitative measures used to evaluate the quality of clustering algorithms, divided into internal and external evaluation metrics. The Silhouette Score is a key internal metric that assesses how well data points are clustered by measuring cohesion and separation, with values ranging from -1 to 1. The Davies-Bouldin Index is another metric that evaluates cluster separation and compactness, where lower values indicate better clustering quality.

Uploaded by

baliyand8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views130 pages

unit-3

Clustering metrics are quantitative measures used to evaluate the quality of clustering algorithms, divided into internal and external evaluation metrics. The Silhouette Score is a key internal metric that assesses how well data points are clustered by measuring cohesion and separation, with values ranging from -1 to 1. The Davies-Bouldin Index is another metric that evaluates cluster separation and compactness, where lower values indicate better clustering quality.

Uploaded by

baliyand8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 130

Clustering metrics

Clustering metrics
• Clustering metrics are quantitative measures used to evaluate the quality of
clustering algorithms and the resulting clusters.
• Internal Evaluation Metrics: These metrics evaluate the quality of clusters without
any external reference. They measure the compactness of data points within the
same cluster and the separation between different clusters.
• Such as: Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index (Variance Ratio Criterion),
Dunn Index
• External Evaluation Metrics: These metrics require a ground truth or a reference set
of labels to compare the clustering results against. They measure the agreement
between the generated clusters and the true classes in the reference set.
• Such as: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), Fowlkes-Mallows Index
Silhouette Score
• Silhouette Coefficient or silhouette score is a metric
used to calculate the goodness of a clustering
technique. Its value ranges from -1 to 1.
• 1: Means clusters are well apart from each other and
clearly distinguished (Best Case)
• 0: Means clusters are indifferent, or we can say that the
distance between clusters is not significant
(Overlapping Clusters)
• -1: Means clusters are assigned in the wrong way (Worst
Case)
• The Silhouette Score is a metric used to evaluate the quality of clustering in
unsupervised learning.

• It measures how well data points are clustered by considering both cohesion
(how close points are within a cluster) and separation (how distinct or far
away clusters are from one another).
Applications:

•The Silhouette Score helps in choosing the optimal number


of clusters in algorithms like K-Means, Agglomerative
Clustering, and DBSCAN.

•It provides a quick insight into how well-defined the clusters


are without requiring labels.
• The Silhouette score is a metric used to evaluate how
good clustering results are in data clustering.

• This score is calculated by measuring each data point’s


similarity to the cluster to which it belongs and how
different it is from other clusters.

• The Silhouette score is commonly used to assess the


performance of clustering algorithms like K-Means.
• Key characteristics of the Silhouette score
include:

• It ranges from -1 to +1:


• Positive values indicate that data points belong to the
correct clusters, indicating good clustering results.
• A score of zero suggests overlapping clusters or data
points equally close to multiple clusters.
• Negative values indicate that data points are assigned
to incorrect clusters, indicating poor clustering results.
• A higher Silhouette score indicates better clustering
results.
• Therefore, the Silhouette score is an important criterion
used to evaluate the settings and outcomes of data
clustering algorithms.

• A high Silhouette score indicates more consistent and


better clustering results, while a low score may indicate
that data points are assigned to incorrect clusters or
that the clustering algorithm is not suitable for the data.
• The make_blobs function creates a dummy dataset with sample
(n_samples) and center point.

Here, a dataset is created with 300 samples and 4 center points.

• The cluster_std parameter controls how far the created cluster points
are spread.
• The random_state parameter provides the repeatability of the
random number generation process.
• n_clusters is used as a variable specifying how many clusters will be
created.
• The KMeans class is used to implement the K-Means clustering
algorithm.

• random_state controls the random initialization of the algorithm.

• The fit_predictmethod allows to cluster the data and predict which


cluster belongs to each data point.
• The results are stored in an array called labels.
• The Silhouette score is a metric used to evaluate how
well K-Means clustering results are.

• This score measures how well a dataset is clustered by


measuring within-cluster similarity and out-of-cluster
discrimination.
• The silhouette_score function calculates the Silhouette
score using the array of labels containing the dataset
and which cluster each data point belongs to.
• The result is printed to the screen with the text
“Silhouette Score
Example:
Datapoints Cluster Label
A1 C1
A2 C1
A3 C2
A4 C2

Datapoint A1 A2 A3 A4
A1 0 0.10 0.65 0.55
A2 0.10 0 0.70 0.60
A3 0.65 0.70 0 0.30
A4 0.55 0.60 0.30 0
It is computed as:
• The silhouette value is a measure of how similar an
object is to its own cluster (cohesion) compared to other
clusters (separation).
• Silhouette Score = (b-a)/max(a,b)
where
• a= average intra-cluster distance i.e the average
distance between each point within a cluster
• b= average inter-cluster distance i.e the average
distance between all clusters
First Table:

1. Lists data points (A1, A2, A3, A4) and their corresponding cluster labels (C1 or C2).
2. A1 and A2 belong to cluster C1, while A3 and A4 belong to cluster C2.

Second Table:

3. Contains a distance matrix showing pairwise distances between each data point (A1,
A2, A3, A4).
4. The rows and columns represent the data points, and the values represent the
distance between each pair.

5.Silhouette Score = (b-a)/max(a,b)


• Steps to calculate Silhouette Score for each point:

For A1 and A2 (Cluster C1):

1. a(i): Calculate the average distance between A1 and A2 within the same cluster (C1).
Similarly, calculate for A2 and A1.
2. b(i): For both A1 and A2, calculate the average distance to points in the nearest cluster
(C2: A3 and A4).

For A3 and A4 (Cluster C2):


3. a(i): Calculate the average distance between A3 and A4 within the same cluster (C2).
4. b(i): For both A3 and A4, calculate the average distance to points in the nearest cluster
(C1: A1 and A2).
• Finally, compute the Silhouette score for each point and take the average for the
overall score.
For point A1:
a= 0.1/1=0.1
b= (0.65+0.55)/2=0.6
Silhouette Score = (b-a)/max(a,b)
(0.6-0.1)/ 0.6= 0.833

For point A2:


a= 0.1/1=0.1
b= (0.70+0.60)/2=0.65
Silhouette Score = (b-a)/max(a,b)
(0.65-0.1)/ 0.65= 0.846

For point A3:


a= 0.30/1=0.30
b= (0.65+0.70)/2=0.675
Silhouette Score = (b-a)/max(a,b)
(0.675-0.30)/ 0.675= 0.555

For point A4:


a= 0.30/1=0.30
b= (0.55+0.60)/2=0.575
Silhouette Score = (b-a)/max(a,b)
(0.575-0.30)/ 0.575= 0.478
Point A1 and A2 are lying cluster C1 , so for computing Silhouette
Score for cluster C1,
(0.833+0.846)/2= 1.679/2= 0.839

Point A3 and A4 are lying cluster C2 , so for computing Silhouette


Score for cluster C2,
(0.555+0.478)/2= 1.033/2= 0.5165

Silhouette Score/Coefficient for overall clustering


problem is:
(0.839+0.5165)/2= 0.6775 or 0.678
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
data = np.array([[3, 7], [9, 8], [5, 4], [2, 6], [5,6]])
labels = np.array([0, 0, 1, 1, 1])
Silhouette_Score = silhouette_score(data, labels)
print("The average silhouette score is:", Silhouette_Score)
• : Silhouette Score
• A dataset contains 15 data points and has been clustered into 3
clusters. After computing the Silhouette Score for each point, the
results are as follows:
• Cluster 1: Silhouette Scores = {0.4, 0.5, 0.6}
• Cluster 2: Silhouette Scores = {0.2, 0.3, 0.35, 0.4, 0.1}
• Cluster 3: Silhouette Scores = {-0.2, 0.1, 0.05, 0.3, 0.4, 0.45, -0.1}
• Task:
• Compute the overall Silhouette Score for the clustering by averaging the
silhouette scores of all points.
• Interpret the clustering quality based on the average Silhouette Score.
• You have a dataset with 9 data points that has been clustered into
two clusters:
Cluster 1: {P1(1, 2), P2(2, 1), P3(2, 2)}
Cluster 2: {P4(10, 10), P5(11, 10), P6(12, 12), P7(50, 50), P8(52, 53),
P9(51, 52)}
• Task:
• Calculate the intra-cluster and inter-cluster distances for any point in Cluster
1.
• Discuss whether you expect the Silhouette Score to be high or low for this
clustering based on your findings.
• To implement the Silhouette Score for clustering evaluation, we can use the
scikit-learn library in Python. The Silhouette Score measures how similar an
object is to its own cluster (cohesion) compared to other clusters
(separation). Its value ranges from -1 to 1, where:
• A score of 1 indicates that the clusters are well apart from each other and the
points are well clustered.
• A score of 0 means that the points are on or very close to the decision
boundary between two clusters.
• A score of -1 indicates that the points are assigned to the wrong clusters.
• Explanation:
1.make_blobs: Generates a dataset with blobs (clusters). You can adjust
the number of clusters, number of samples, and other parameters.
2.KMeans: This performs K-means clustering on the dataset. You can
change the clustering algorithm based on your preference (DBSCAN,
Agglomerative, etc.).
3.silhouette_score: This function computes the Silhouette Score based on
the cluster labels produced by the clustering algorithm.
• Application of Silhouette Score
• The Silhouette Score is a metric used to assess how well clusters are
separated in a dataset. It helps to evaluate:
1.Cluster quality: It measures how close data points in one cluster are to
points in other clusters, determining the tightness and separation of
clusters.
2.Optimal number of clusters: It is used for model selection, particularly
when you're unsure how many clusters to use. By running clustering
algorithms for different numbers of clusters, you can compare Silhouette
Scores and select the model with the highest score.
3.Clustering algorithm comparison: The score can help compare different
clustering algorithms (like KMeans, DBSCAN, or Agglomerative) to select
the one that best fits the dataset.
• The score is typically used in:
• Customer segmentation
• Image segmentation
• Anomaly detection
• Social network analysis
• Biological data classification
• Advantages of Silhouette Score
1.Simple to compute: It is easy to implement and provides an intuitive
score that reflects the cohesion (within-cluster compactness) and
separation (between-cluster distinction) of the clusters.
2.Interpretability: The score ranges between -1 and 1, making it
straightforward to interpret the quality of clustering (positive values
indicate well-clustered data).
3.No need for ground truth labels: Since it’s an unsupervised learning
metric, it doesn't require labeled data for evaluation.
4.Works with any clustering algorithm: It can be used to evaluate the
clustering results of any algorithm, from KMeans to DBSCAN or
hierarchical clustering.
5.Model selection aid: Helps in determining the optimal number of
clusters for a dataset by computing the score for various cluster numbers
and comparing the results.
• Disadvantages of Silhouette Score
1.Scalability: The Silhouette Score is computationally expensive, particularly
for large datasets, as it requires pairwise distance calculations between
points and their cluster centroids.
2.Limited in high dimensions: In high-dimensional spaces, distance metrics
often become less meaningful, which can affect the accuracy of the
Silhouette Score.
3.Sensitivity to clusters of varying density: It assumes that clusters are
convex and roughly equal in size, so it may not perform well with clusters of
varying density or irregular shapes.
4.Distance-based limitations: Since it is based on distances, it might not be
ideal for clustering algorithms that use different methods of defining
clusters (e.g., DBSCAN’s density-based clusters).
5.Unable to handle overlapping clusters: If clusters have significant overlap,
the Silhouette Score might incorrectly suggest that these clusters are poorly
formed.
• Real-Time Example of Silhouette Score
• Customer Segmentation in Retail:
• A retail company wants to segment its customer base to offer personalized
marketing campaigns. The company gathers purchase data (e.g., total
purchases, frequency, categories of products purchased). Using clustering
(e.g., KMeans), the company groups customers into clusters based on their
purchase patterns.
• Application of Silhouette Score: After clustering, the company calculates the
Silhouette Score for various numbers of clusters (e.g., k = 2, 3, 4, 5, etc.) to
determine the optimal number of customer segments.
• Advantage: Helps the company ensure the clusters are well-formed (i.e.,
customers in the same group have similar purchasing behavior, while different
groups are distinct).
• Disadvantage: If the purchase patterns are highly overlapping or vary greatly
in density, the Silhouette Score might not be fully reliable in determining the
best number of customer segments.
Davies-Bouldin Index

• The Davies-Bouldin index (DBI) is a metric for assessing


the separation and compactness of clusters.
• It is based on the idea that good clusters are those that
have low within-cluster variation and high between-
cluster separation.
• The minimum score is zero, with lower values indicating
better clustering.
• Lower the DB index value, better is the clustering. It
also has a drawback. A good value reported by this
method does not imply the best information retrieval.
• Davies-Bouldin Index (DBI)
• The Davies-Bouldin Index (DBI) is a clustering evaluation metric used to
assess the quality of clusters generated by a clustering algorithm.

• It measures the average similarity ratio of each cluster to its most similar
cluster, considering the intra-cluster distance (how compact a cluster is) and
the inter-cluster distance (how separated clusters are).

• The goal of clustering is to have compact clusters that are well-separated,


and DBI provides a way to measure this.
Where,
• Implementation of Davies-Bouldin Index
• The Davies-Bouldin Index can be computed using the scikit-learn library in Python. Here's an example:

• # Import necessary libraries


• from sklearn.datasets import make_blobs
• from sklearn.cluster import KMeans
• from sklearn.metrics import davies_bouldin_score

• # Step 1: Create a sample dataset


• X, y = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1.0, random_state=42)

• # Step 2: Perform clustering using KMeans


• kmeans = KMeans(n_clusters=4, random_state=42)
• y_kmeans = kmeans.fit_predict(X)

• # Step 3: Calculate the Davies-Bouldin Index


• dbi_score = davies_bouldin_score(X, y_kmeans)

• # Output the score


• print(f'Davies-Bouldin Index: {dbi_score:.4f}')
• Explanation:
1.make_blobs: Generates synthetic data with predefined clusters.
2.KMeans: Performs clustering on the dataset.
3.davies_bouldin_score: This function computes the Davies-Bouldin Index for
the clustering result. The lower the DBI, the better the clustering (0
indicates perfect clustering).
• Application of Davies-Bouldin Index
• The DBI is applied in scenarios where the quality of clustering needs to be
evaluated, especially in unsupervised learning. Some common applications
include:
• Customer segmentation: Grouping customers based on behavior, purchases, or
demographics to create distinct segments for targeted marketing.
• Image segmentation: Evaluating the quality of clusters formed when
segmenting images into meaningful regions (e.g., separating background and
foreground).
• Anomaly detection: Identifying unusual or outlying data points by clustering
normal data points and flagging outliers.
• Product categorization: Grouping products into categories based on features
like price, reviews, and user preferences.
• In each of these applications, DBI helps determine how well the clustering
algorithm has performed by measuring the compactness of clusters and how
distinct they are from each other.
• Advantages of Davies-Bouldin Index
1.Model evaluation without ground truth: Like the Silhouette Score, DBI
doesn’t require labeled data, making it a suitable metric for unsupervised
learning.
2.Simple to compute: DBI is straightforward to implement using standard
libraries and requires relatively little computation for small to medium-sized
datasets.
3.Cluster separation and cohesion: It evaluates both the compactness of
clusters (intra-cluster distance) and the separation between clusters (inter-
cluster distance), giving a holistic view of clustering quality.
4.Helps in choosing the number of clusters: It can be used to determine the
optimal number of clusters by comparing DBI scores for different values of k
(number of clusters).
• Disadvantages of Davies-Bouldin Index
1.Sensitive to noise: The DBI score can be heavily influenced by outliers or
noise in the data, which can cause misleading evaluations.
2.Prefers spherical clusters: It assumes that clusters are spherical and equally
sized, so it may not perform well with irregular or elongated clusters (similar
to the Silhouette Score).
3.Less interpretable than other metrics: While it provides a useful number,
the interpretation of this number isn’t always intuitive (lower is better, but
the exact meaning of a specific score is less clear).
4.Scalability: It becomes computationally expensive as the dataset grows
larger because it calculates pairwise distances between cluster centroids
and within clusters.
• Real-Time Example of Davies-Bouldin Index
• Example: Image Segmentation in Medical Imaging
• A healthcare organization is working with MRI scans to segment brain tissues
into different regions for medical analysis. The organization uses a clustering
algorithm (e.g., KMeans) to partition the pixels of MRI scans into clusters
representing different tissue types.
• Application of DBI: After performing image segmentation, the Davies-Bouldin
Index is used to evaluate how well the clusters (segmented tissues) are
formed. A lower DBI score indicates that the segmentation is likely to be more
accurate, with distinct regions for each tissue type.
• Advantage: Helps ensure the clusters of tissues are compact (similar tissues
are grouped together) and well-separated from other clusters (different tissue
types are clearly distinguishable).
• Disadvantage: If the MRI data contains noise or artifacts, DBI could be skewed,
giving misleading results and suggesting poor clustering when the clusters are
actually useful for the task at hand.
Example
• Data Point 1: (2, 3)
• Data Point 2: (3, 2)
• Data Point 3: (8, 8)
• Data Point 4: (9, 7)
• Data Point 5: (15, 14)
• Data Point 6: (16, 13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:

• Cluster 1: {Data Point 1, Data Point 2} Cluster 2: {Data Point 3, Data


Point 4, Data Point 5, Data Point 6}

• The centroids of these clusters are approximately:

• Centroid of Cluster 1: (2.5, 2.5) Centroid of Cluster 2: (12, 10.5)

• Now, let's calculate the pairwise distances between the centroids of


each cluster:
• Now, let's calculate the pairwise distances between the centroids of each cluster:
• Distance between Cluster 1 centroid and Cluster 2 centroid
• sqrt((12 - 2.5)^2 + (10.5 - 2.5)^2) ≈ 11.64
• Next, for each cluster, let's calculate the average distance of its data points to its
centroid:
• For Cluster 1:
• Average distance = [(2.5 - 2)^2 + (2.5 - 3)^2]^0.5 ≈ 0.71 (For all datapoints in C1)
• For Cluster 2:
• Average distance = [(12 - 8)^2 + (10.5 - 8)^2 + (12 - 15)^2 + (10.5 - 14)^2]^0.5 ≈
4.56 (For all datapoints in C2)
• Now, calculate the Davies-Bouldin Index for each cluster:
• For Cluster 1:
• DB1 = 0.71 / 11.64 ≈ 0.061
• For Cluster 2:
• DB2 = 4.56 / 11.64 ≈ 0.392
• Finally, calculate the overall Davies-Bouldin Index for this clustering solution:
• DBI = (DB1 + DB2) / 2 ≈ (0.061 + 0.392) / 2 ≈ 0.227
• In this example, the Davies-Bouldin Index for this clustering solution is
approximately 0.227. Lower values of the Davies-Bouldin Index indicate better-
defined clusters, where data points within clusters are close to each other and far
from points in other clusters.
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import
davies_bouldin_score
data1 = np.array([[6, 8], [9, 5], [5, 4], [2,
6], [5,6], [3,4]])
labels1 = np.array([0, 0, 1, 1, 1, 0])
davies_bouldin_index =
davies_bouldin_score(data1, labels1)
print("Davies-Bouldin Index:",
davies_bouldin_index)
Dunn index:
• A metric for evaluating clustering algorithms, is an internal evaluation scheme,
where the result is based on the clustered data itself.
• Like all other such indices, the aim of this Dunn index to identify sets of clusters
that are compact, with a small variance between members of the cluster, and well
separated, where the means of different clusters are sufficiently far apart, as
compared to the within cluster variance.
• Higher the Dunn index value, better is the clustering.
• The number of clusters that maximizes Dunn index is taken as the optimal number
of clusters k.
Where,
• Implementation of Dunn Index
• The Dunn Index is not available directly in libraries like scikit-learn, but it
can be computed using custom code.
• The formula involves calculating the minimum distance between clusters
(inter-cluster separation) and dividing it by the maximum intra-cluster
distance (compactness). Here's a basic implementation:
• import numpy as np
• from sklearn.datasets import make_blobs
• from sklearn.cluster import KMeans
• from scipy.spatial.distance import cdist

# Function to calculate intra-cluster distances
• def calculate_intra_cluster_distances(X, labels, num_clusters):
• intra_distances = []
• for i in range(num_clusters):
• cluster_i = X[labels == i]
• if len(cluster_i) > 1:
• intra_distances.append(np.max(cdist(cluster_i, cluster_i)))
• return intra_distances

# Function to calculate the Dunn Index
• def dunn_index(X, labels):
• unique_clusters = np.unique(labels)
• num_clusters = len(unique_clusters)

• # Calculate intra-cluster distances
• intra_distances = calculate_intra_cluster_distances(X, labels, num_clusters)


• # Calculate inter-cluster distances
• inter_distances = []
• for i in range(num_clusters):
• for j in range(i + 1, num_clusters):
• cluster_i = X[labels == unique_clusters[i]]
• cluster_j = X[labels == unique_clusters[j]]
• inter_distances.append(np.min(cdist(cluster_i, cluster_j)))

• # Calculate Dunn Index
• if intra_distances and inter_distances:
• return np.min(inter_distances) / np.max(intra_distances)
• else:
• return np.inf

# Step 1: Create a sample dataset
• X, _ = make_blobs(n_samples=500, n_features=2, centers=4, cluster_std=1.0, random_state=42)

# Step 2: Perform clustering using KMeans
• kmeans = KMeans(n_clusters=4, random_state=42)
• y_kmeans = kmeans.fit_predict(X)

# Step 3: Calculate the Dunn Index
• dunn = dunn_index(X, y_kmeans)
• print(f'Dunn Index: {dunn:.4f}')
• Explanation:
1.cdist: This function calculates the pairwise distances between points in
clusters.
2.intra-cluster distance: Maximum distance between points in the same
cluster (compactness).
3.inter-cluster distance: Minimum distance between points from different
clusters (separation).
4.Dunn Index: The ratio of the minimum inter-cluster distance to the
maximum intra-cluster distance.
• A higher Dunn Index value indicates better clustering, as it implies that the
clusters are compact and well-separated.

• Let's go through a numerical example to illustrate the concept.


• Consider a dataset with 8 data points in a 2D space:

• Data Point 1: (2, 3) Data Point 2: (3, 2) Data Point 3: (4, 3) Data Point 4: (8, 8)
Data Point 5: (9, 7) Data Point 6: (10, 8) Data Point 7: (15, 14) Data Point 8: (16,
13)
• Let's say we want to perform K-means clustering on this dataset with
K = 2. After clustering, the data points are divided into two clusters:
• Cluster 1: {Data Point 1, Data Point 2, Data Point 3} Cluster 2: {Data
Point 4, Data Point 5, Data Point 6, Data Point 7, Data Point 8}
• Now, let's calculate the inter-cluster distances (minimum distance
between any two points from different clusters) and intra-cluster
distances (maximum distance between any two points within the
same cluster):
• Inter-cluster distance:
• Between Cluster 1 and Cluster 2:
• Min inter-cluster distance = distance(Data Point 1, Data Point 4) = sqrt((8 - 2)^2 + (8 - 3)^2) ≈ 8.25
• Intra-cluster distances:
• For Cluster 1:
• Max intra-cluster distance = max(distance(Data Point 1, Data Point 2), distance(Data Point 1, Data
Point 3), distance(Data Point 2, Data Point 3)) = max(sqrt(2), sqrt(1), sqrt(1)) = sqrt(2) ≈ 1.41
• For Cluster 2:
• Max intra-cluster distance = max(distance(Data Point 4, Data Point 5), distance(Data Point 4, Data
Point 6), distance(Data Point 4, Data Point 7), distance(Data Point 4, Data Point 8), distance(Data
Point 5, Data Point 6), distance(Data Point 5, Data Point 7), distance(Data Point 5, Data Point 8),
distance(Data Point 6, Data Point 7), distance(Data Point 6, Data Point 8), distance(Data Point 7,
Data Point 8)) = max(sqrt(1), sqrt(2), sqrt(2), sqrt(5), sqrt(2), sqrt(2), sqrt(5), sqrt(2), sqrt(5),
sqrt(2)) = sqrt(5) ≈ 2.24
• Now, calculate the Dunn Index:
• Dunn Index = Min inter-cluster distance / Max intra-cluster distance ≈
8.25 / 2.24 ≈ 3.68
• In this example, the Dunn Index for this clustering solution is
approximately 3.68. A higher Dunn Index indicates that the clusters
are well-separated and compact within themselves, which is desirable
for a good clustering solution.
Example
• Let's work through a numerical example to calculate the Dunn Index for a simple set of clusters.
Suppose you have the following data points and their corresponding clusters:
Data Points:
• A(1, 2)
• B(2, 2)
• C(5, 8)
• D(6, 8)
• E(10, 12)
• F(11, 12)
Clusters:
• Cluster 1: {A, B}
• Cluster 2: {C, D}
• Cluster 3: {E, F}
• Distance between Cluster 1 and Cluster 2 (d(1, 2)):
• d(A, C) = sqrt((1-5)^2 + (2-8)^2) = sqrt(32) = 4√2
• d(A, D) = sqrt((1-6)^2 + (2-8)^2) = sqrt(41) ≈ 6.40
• d(B, C) = sqrt((2-5)^2 + (2-8)^2) = sqrt(29) ≈ 5.39
• d(B, D) = sqrt((2-6)^2 + (2-8)^2) = sqrt(36) = 6
• So, d(1, 2) = 4√2
• Distance between Cluster 1 and Cluster 3 (d(1, 3)):
• d(A, E) = sqrt((1-10)^2 + (2-12)^2) = sqrt(170) ≈ 13.04
• d(A, F) = sqrt((1-11)^2 + (2-12)^2) = sqrt(200) ≈ 14.14
• d(B, E) = sqrt((2-10)^2 + (2-12)^2) = sqrt(164) ≈ 12.81
• d(B, F) = sqrt((2-11)^2 + (2-12)^2) = sqrt(194) ≈ 13.93
• So, d(1, 3) = 12.81
• Distance between Cluster 2 and Cluster 3 (d(2, 3)):
• d(C, E) = sqrt((5-10)^2 + (8-12)^2) = sqrt(20) ≈ 4.47
• d(C, F) = sqrt((5-11)^2 + (8-12)^2) = sqrt(32) ≈ 5.66
• d(D, E) = sqrt((6-10)^2 + (8-12)^2) = sqrt(16) = 4
• d(D, F) = sqrt((6-11)^2 + (8-12)^2) = sqrt(29) ≈ 5.39
• So, d(2, 3) = 4.47
• Now, let's calculate the minimum inter-cluster distance and the
maximum intra-cluster distance:
• Minimum Inter-cluster Distance: min(d(1, 2), d(1, 3), d(2, 3)) =
min(4√2, 12.81, 4.47) = 4√2
• Maximum Intra-cluster Distance: max(max(d(A, B), d(C, D), d(E, F))) =
max(6.40, 6.40, 14.14) = 14.14
• Finally, we can calculate the Dunn Index:
• Dunn Index = min(d(1, 2), d(1, 3), d(2, 3)) / max(max(d(A, B), d(C, D),
d(E, F))) Dunn Index = (4√2) / 14.14 ≈ 0.2828
• So, the Dunn Index for these clusters is approximately 0.2828.
Data Point Coordinates Cluster
1 (2, 3) A
2 (2, 5) A
3 (3, 8) B
4 (6, 5) B
5 (8, 8) C
6 (9, 6) C
7 (10, 2) A
8 (12, 4) A
9 (15, 7) C
10 (17, 5) C
Drawbacks of Dunn index:
• As the number of clusters and dimensionality of the
data increase, the computational cost also increases.
• Application of Dunn Index
• The Dunn Index is particularly useful when trying to determine the quality
of clusters in fields where compact and well-separated clusters are
desirable, such as:
• Bioinformatics: Grouping genes or proteins based on similarity in sequence
or structure.
• Market segmentation: Dividing customers into homogeneous groups that
are distinctly different from other groups.
• Social network analysis: Identifying tightly-knit communities that are well
separated from each other.
• Image processing: Segmenting images into distinct regions that represent
objects or areas of interest.
• Advantages of Dunn Index
1.Emphasizes compactness and separation: The Dunn Index prioritizes
clusters that are both compact and well-separated, making it an ideal
measure for high-quality clustering.
2.Helps in selecting the number of clusters: It can be used to compare
different clustering results with varying numbers of clusters and select the
one with the highest Dunn Index.
3.Works with any clustering algorithm: Like other clustering evaluation
metrics, the Dunn Index can be applied to results from various algorithms
(e.g., KMeans, DBSCAN, hierarchical clustering).
• Disadvantages of Dunn Index
1.Computationally expensive: Calculating pairwise distances between points
across clusters makes the Dunn Index computationally expensive, especially
for large datasets.
2.Sensitive to noise: The presence of noise or outliers in the data can
significantly affect the Dunn Index, leading to lower scores even for well-
formed clusters.
3.Limited interpretability: While a higher Dunn Index indicates better
clustering, it is hard to interpret the meaning of specific values without a
reference.
4.Not widely available in libraries: Unlike other metrics like Silhouette Score
or Davies-Bouldin Index, the Dunn Index is not directly available in popular
libraries like scikit-learn, requiring custom implementation.
• Real-Time Example of Dunn Index
• Example: Gene Expression Clustering in Bioinformatics
• In bioinformatics, researchers often cluster gene expression data to identify
groups of genes with similar expression patterns across different conditions
or time points. These clusters can help identify genes involved in the same
biological processes.
• Application of Dunn Index: After clustering gene expression data,
researchers use the Dunn Index to evaluate the quality of clusters, ensuring
that genes within the same cluster are similar and different from those in
other clusters.
• Advantage: The Dunn Index helps ensure that the gene clusters are well-
formed, providing more reliable insights into gene function.
• Disadvantage: Due to the high dimensionality and noisy nature of gene
expression data, the Dunn Index can be computationally expensive and may
be sensitive to outliers.
• Conclusion
• The Dunn Index is a valuable clustering evaluation metric that emphasizes
both compactness and separation of clusters.

• It is particularly useful in applications like bioinformatics, social network


analysis, and market segmentation, where compact and well-separated
clusters are critical.

• However, its computational complexity and sensitivity to noise mean it


should be used alongside other metrics, especially for large or noisy
datasets.
• Ground truth labels refer to the actual, real-world labels or categories
associated with the data points in a dataset.

• These labels are considered the "true" classification or segmentation for


the data and are used as a standard to evaluate the performance of
machine learning models, especially in supervised learning and clustering
tasks.
Adjusted Rand Index (ARI)
• It measures the similarity between the true class labels and the
clusters generated by a clustering algorithm while accounting for
chance agreement.
• The ARI produces a score between -1 and 1, where higher values
indicate better agreement between the predicted clusters and the
true labels.
• Contigency table- m x n
• m= number of clusters by C1 algo.
• n= number of clusters by C2 algo (Ground Truth)
• Adjusted Rand Index (ARI) is a clustering evaluation metric that measures
the similarity between two clusterings by comparing how many pairs of
points are placed in the same or different clusters in both the true and
predicted clusterings.

• ARI adjusts for the chance grouping of points, providing a normalized


measure that ranges from -1 (complete disagreement) to 1 (perfect
agreement), with 0 indicating random labeling.
To use the Adjusted Rand Index for evaluating clustering, follow these
steps:

Perform Clustering: Apply a clustering algorithm to your dataset to create


clusters of data points. This could be an algorithm like k-means,
hierarchical clustering, DBSCAN, or any other clustering technique.

Obtain True Class Labels: If you have access to ground truth labels or class
assignments for your data, this is ideal. These true labels represent the
actual groups or categories that your data points belong to.
RI is the Rand Index, which measures the similarity between two clusterings.

Expected(RI) is the expected similarity under random clustering.


Example
• Suppose you have two clustering results for a dataset with 100 data points:
• Create a contingency table that shows the number of data points in common
between the two clusterings. Rows represent clusters in Clustering A, and
columns represent clusters in Clustering B. The table might look like this:
• | Cluster 1 | Cluster 2 |
• -----------------------------------
• Cluster 1 | 30 | 20 |
• -----------------------------------
• Cluster 2 | 40 | 10 |
• -----------------------------------
Conclusion
The Adjusted Rand Index (ARI) for the given contingency table is approximately 0.0113, which indicates
that the clustering result is very close to random.
• from sklearn.metrics import adjusted_rand_score
• from sklearn.datasets import make_classification
• from sklearn.cluster import KMeans

• # Generate a synthetic dataset


• X, true_labels = make_classification(n_samples=300, n_features=2,
• n_informative=2, n_redundant=0,
• n_clusters_per_class=1, n_classes=3,
• random_state=42)

• # Apply KMeans clustering


• kmeans = KMeans(n_clusters=3, random_state=42)
• predicted_labels = kmeans.fit_predict(X)

• # Compute the Adjusted Rand Index


• ari_score = adjusted_rand_score(true_labels, predicted_labels)
• print(f"Adjusted Rand Index: {ari_score}")
• n_clusters_per_class: This is a valid parameter in make_classification, but you
must ensure that other parameters like n_informative and n_redundant are
compatible.
• Set n_redundant=0: This specifies that no features are generated as linear
combinations of the informative features.
• Set n_informative=2: Ensure that the number of informative features is
compatible with n_features=2.
• Added n_classes=3: This ensures that the generated dataset has 3 classes,
compatible with the KMeans algorithm where you’re clustering into 3 clusters.
• Practical Use Cases of ARI:
• Biology: For evaluating clustering of gene expression data where the true
groupings (cell types) are known.
• Social Network Analysis: Measuring similarity between predicted
communities and actual groups in a network.
• Marketing: Evaluating the performance of customer segmentation
algorithms by comparing the predicted clusters with known customer
segments.
• Advantages of Using ARI:
1.Corrects for Random Chance: ARI adjusts for the possibility of random
clustering. This makes it more reliable compared to basic clustering
metrics like Rand Index.
2.Handles Different Cluster Sizes: ARI works well even when clusters have
different sizes, providing a normalized score.
3.Symmetry: ARI is symmetric, meaning the result will be the same if the
true labels and predicted labels are swapped, ensuring consistent
performance evaluation.
4.Applicable in Multi-Class Clustering: It can handle more than two clusters
and is applicable in complex clustering scenarios where multiple true
clusters exist.
5.Works with Partitions: ARI is designed to evaluate clusterings as
partitions, making it ideal for datasets where all instances are assigned to
a single cluster.
• Disadvantages of Using ARI:
1.Requires Ground Truth: ARI relies on having known, true labels for the data.
This limits its applicability to supervised settings, and it cannot be used for
pure unsupervised learning where labels are unavailable.
2.Not Interpretable Without Labels: ARI has little interpretability without
comparing the predicted clusters to known labels, making it less useful for
purely exploratory data analysis.
3.Sensitive to Number of Clusters: The score can be highly sensitive to the
number of clusters used in the algorithm, particularly if the number of true
clusters and predicted clusters differ significantly.
4.Computationally Intensive: For large datasets, calculating ARI can become
computationally expensive as it requires comparing pairs of elements. This
makes it slower for large-scale problems.
5.Does Not Handle Noise Well: ARI is not designed to work with noise or
outliers. If your clustering algorithm identifies noise (e.g., DBSCAN labels
noise points), ARI might not be the best evaluation metric.
• Real-Time Examples of ARI Application:
• 1. Customer Segmentation in Marketing
• Example: A company wants to group its customers into different segments
based on their purchasing behavior. They apply various clustering
algorithms (e.g., K-Means, DBSCAN) and evaluate the effectiveness of each
clustering against known segments (based on previous customer profiles).
• ARI Usage: ARI can be used to compare the predicted clusters (from the
clustering algorithm) with the true customer segments, ensuring that the
algorithm identifies similar customer groups as the ground truth.
• Gene Expression Data in Bioinformatics
• Example: In genetics, clustering is used to find groups of similar gene
expression profiles across various conditions (e.g., normal vs. cancerous
cells). In cases where some labels are known (e.g., cancer vs. healthy
tissue), ARI can be applied to evaluate the clustering results against these
known labels.
• ARI Usage: ARI helps in assessing how well the clustering algorithm groups
similar gene expression profiles together and separates different profiles.
• . Document Clustering in Natural Language Processing (NLP)
• Example: A company needs to group large sets of documents or articles
into topics (e.g., sports, politics, tech). After clustering, they can compare
the clustering results with known topics or categories using ARI.
• ARI Usage: ARI can measure how well the algorithm has grouped similar
documents into the correct topics by comparing it with labeled datasets.
• Community Detection in Social Networks
• Example: In a social network analysis, clusters (communities) of users can
be identified based on interactions, friendship connections, etc. If you
have known communities (e.g., groups of friends, families), ARI can be
used to measure how well the clustering algorithm detects these groups.
• ARI Usage: ARI will evaluate whether the predicted communities match
the real communities or known labels of social interactions.
• In Classification: Ground truth labels are the true class labels for each
instance in the dataset. For example, in a dataset of images of cats and
dogs, the ground truth labels would indicate whether each image is a "cat"
or a "dog."
• In Clustering: Ground truth labels represent the correct grouping of data
points. For example, if you're clustering customers into segments, the
ground truth labels might indicate the actual customer segments based on
their purchasing behavior.
• Uses of Ground Truth Labels:
• Model Evaluation: Ground truth labels are used to compare the
predictions of a machine learning model and compute performance
metrics such as accuracy, precision, recall, F1-score, or clustering
evaluation metrics like Adjusted Rand Index (ARI).
• Supervised Learning: In supervised learning, the model is trained using
data points along with their ground truth labels, so the model learns to
associate input features with the correct label.
• Clustering and Segmentation: For evaluating clustering algorithms, ground
truth labels are used to determine how well the clusters generated by the
algorithm align with the actual labels of the data points.
• Real-World Examples:
• Image Classification: In an image classification task, the ground truth label
is the correct class assigned to an image (e.g., "dog," "cat," or "bird").
• Medical Diagnosis: In a medical dataset, ground truth labels might
represent whether a patient has a specific disease or not, based on a
doctor's diagnosis.
• Speech Recognition: In a dataset of audio recordings, ground truth labels
would be the actual spoken words in each audio clip.
• Importance:
• Model Training: In supervised learning, ground truth labels are essential
for training the model, allowing it to learn the correct output for various
inputs.
• Performance Comparison: By comparing the predicted labels with the
ground truth, you can measure the accuracy and reliability of a model.
NMI
• NMI is a good measure for determining the
quality of clustering.
• It is an external measure because we need the
class labels of the instances to determine the
NMI.
• Since it’s normalized we can measure and
compare the NMI between different
clusterings having different number of
clusters.
• Normalized Mutual Information (NMI) is a metric used to evaluate the
similarity between two clustering results.
• It measures how much information is shared between two partitions
(clusters) of a dataset, normalized to ensure values range between 0 and
1.

• A value of 1 indicates perfect correlation, while 0 indicates that the


clusters are completely independent.
• 3. Advantages of NMI
• Scale Independence: NMI ranges from 0 to 1, making it easy to interpret.
A score close to 1 means a high correlation between two clusterings, while
a score close to 0 implies minimal similarity.
• Symmetry: NMI is symmetric, meaning the result is the same regardless of
which clustering is used as the reference.
• Robust to Noise: NMI is robust to small changes or noise in the data
because it focuses on the shared information between clusters.
• 4. Disadvantages of NMI
• Assumption of Mutual Dependence: NMI assumes that clusters are
mutually dependent. This might not always be true, especially in complex
datasets where some clusters could have non-linear or hierarchical
relationships.
• Interpretability with Overlapping Clusters: NMI might not work well in
scenarios where clusters overlap significantly, as it assumes distinct
partitioning.
• Computational Cost: NMI computation can be expensive for very large
datasets, particularly in situations with many clusters.
• Real-time Example and Use Case of NMI
• Real-time Example: NMI can be applied to evaluate clustering algorithms
in customer segmentation. Assume you have a dataset of customer
purchase behaviors, and you apply two different clustering methods (e.g.,
KMeans and hierarchical clustering) to segment your customers. NMI can
be used to evaluate how similarly these two clustering approaches
categorize the same customers, which is crucial in understanding if
different clustering methods yield consistent customer groupings.
• Use Case: In image segmentation, NMI can be used to compare the
clustering result of pixels produced by two different image segmentation
algorithms, ensuring that the segmented areas of the image are consistent
across different approaches.
• from sklearn.metrics import normalized_mutual_info_score

• # True and predicted labels


• U = [0, 0, 1, 1, 2]
• V = [1, 1, 0, 0, 2]

• # Calculate NMI
• nmi_score = normalized_mutual_info_score(U, V)
• print(f"Normalized Mutual Information Score: {nmi_score}")
Calculating NMI for
Clustering
• Assume m=3 classes and k=2 clusters

Cluster-1 (C=1) Cluster-2 (C=2)

Class-1 (Y=1) Class-2 (Y=2) Class-3 (Y=3)


H(Y) = Entropy of Class
Labels
• P(Y=1) = 5/20 = ¼
• P(Y=2) = 10/20 = ½
• P(Y=3) = 5/20 =¼
=
1 1 1
• H(Y) = − 41
4
−41
4
−21
2
log log log 1.5
This is calculated for the entire dataset and can be
calculated prior to clustering, as it will not change
depending on the clustering output.
H(C) = Entropy of Cluster
Labels
• P(C=1) = 10/20 = 1/2
• P(C=2) = 10/20 = ½
1
1
− 1 1
=
log 2
• H(Y) =− 2
log 2 2 1
This will be calculated every time the clustering
changes. You can see from the figure that the
clusters are balanced (have equal number of
instances).
I(Y;C)= Mutual
Information
• Mutual information is given as:
–𝐼 𝑌; 𝐶= 𝐻 𝑌 − 𝐻 𝑌 𝐶
– We already know H(Y)
– H(Y|C) is the entropy of class labels within each
cluster, how do we calculate this??

Mutual Information tells us the reduction in the


entropy of class labels that we get if we know the
cluster labels. (Similar to Information gain in
deicison trees)
H(Y|C): conditional entropy of class
labels for clustering C
• Consider Cluster-1:
– P(Y=1|C=1)=3/10 (three triangles in cluster-1)
– P(Y=2|C=1)=3/10 (three rectangles in cluster-1)
– P(Y=3|C=1)=4/10 (four stars in cluster-1)
– Calculate conditional entropy as:

𝐻 𝑌 𝐶 = −𝑃(𝐶 = ∑ 𝑃 𝑌 = 𝑦 𝐶 = 1 log(𝑃 𝑌 = 𝑦
=1 1) Y∈ 𝐶 =1
{1,2,3})
+ log(4 )+ log( )] =
1 3 3 3 4
=− 32 × [1 1 1 10 1
log 0 0 0.7855
0 10 0
H(Y|C): conditional entropy of class
labels for clustering C
• Now, consider Cluster-2:
– P(Y=1|C=2)=2/10 (two triangles in cluster-2)
– P(Y=2|C=2)=7/10 (seven rectangles in cluster-2)
– P(Y=3|C=2)=1/10 (one star in cluster-2)
– Calculate conditional entropy as:

𝐻 𝑌 𝐶 = −𝑃(𝐶 = ∑ 𝑃 𝑌 = 𝑦 𝐶 = 2 log(𝑃 𝑌 = 𝑦
=2 2) Y∈ 𝐶 =2
{1,2,3})
+ log(1 )+ log( )] =
1 2 7 7 1
=− 22 × [1 1 1 10 1
log 0 0 0.5784
0 10 0
I(Y;
C)
𝐼 = 𝐻𝑌 − 𝐻 𝑌 𝐶
• Finally the mutual information is:

𝑌; 𝐶 = 1.5 − 0.7855 +
=
0.5784
0.1361


The NMI is therefore,
𝑁𝑀𝐼 𝐼(𝑌; 𝐶)
𝐻𝑌 + 𝐻
𝑌, 𝐶
𝐶
𝑁𝑀𝐼 2 × 0.1361
= 1.5 + =
=
𝑌, 𝐶
1
Calculate NMI for the following
Homogeneity
• A clustering result satisfies homogeneity if all of its
clusters contain only data points which are members of
a single class.
• This metric is independent of the absolute values of the
labels: a permutation of the class or cluster label values
won’t change the score value in any way.
• Syntax : sklearn.metrics.homogeneity_score(labels_tru
e, labels_pred)
• The Metric is not symmetric,
switching label_true with label_pred will return
the completeness_score.
Parameters :
•labels_true:<int array, shape = [n_samples]> : It accept the
ground truth class labels to be used as a reference.
•labels_pred: <array-like of shape (n_samples,)>: It accepts the
cluster labels to evaluate.
Returns:
homogeneity:<float>: Its return the score between 0.0 and 1.0
stands for perfectly homogeneous labeling.
• To calculate homogeneity numerically, you can use the following formula:
Completeness score
• This score is complementary to the previous one. Its
purpose is to provide a piece of information about the
assignment of samples belonging to the same class.
• More precisely, a good clustering algorithm should
assign all samples with the same true label to the same
cluster.
Completeness portrays the closeness of the clustering algorithm to this
(completeness_score) perfection.
This metric is autonomous of the outright values of the labels. A permutation of the
cluster label values won’t change the score value in any way.

sklearn.metrics.completeness_score()
Syntax: sklearn.metrics.completeness_score(labels_true, labels_pred)

•labels_true:<int array, shape = [n_samples]>: It


accepts the ground truth class labels to be used as a
reference.
•labels_pred: <array-like of shape (n_samples,)>: It
accepts the cluster labels to evaluate.
Returns: completeness score between 0.0 and 1.0.
1.0 stands for perfectly completeness labeling.
V-Measure
One of the primary disadvantages of any clustering technique is that it is difficult to
evaluate its performance. To tackle this problem, the metric of V-Measure was
developed. The calculation of the V-Measure first requires the calculation of two
terms:-

1.Homogeneity: A perfectly homogeneous clustering is one where each cluster has


data-points belonging to the same class label. Homogeneity describes the closeness
of the clustering algorithm to this perfection.
2.Completeness: A perfectly complete clustering is one where all data-points
belonging to the same class are clustered into the same cluster. Completeness
describes the closeness of the clustering algorithm to this perfection.

sklearn.metrics.v_measure_score(labels_true, labels_pred, *, beta=1.0)


The V-measure is the harmonic mean between homogeneity and completeness:

v = (1 + beta) * homogeneity * completeness


/ (beta * homogeneity + completeness)
Fowlkes-Mallows Score
• The Fowlkes-Mallows Score is an evaluation metric to evaluate the
similarity among clustering's obtained after applying different
clustering algorithms.
• Although technically it is used to quantify the similarity between two
clustering's, it is typically used to evaluate the clustering performance
of a clustering algorithm by assuming the second clustering to be the
ground-truth i.e. the observed data and assuming it to be the perfect
clustering.
• The Fowlkes-Mallows Index (FMI) is a clustering evaluation metric that
measures the similarity between two sets of clusters: true labels (or ground
truth) and predicted clusters.

• It computes the geometric mean of precision and recall by comparing pairs


of points in the dataset. The score ranges from 0 to 1, where:
• 1 indicates perfect clustering (i.e., the predicted clusters perfectly match
the true labels),
• 0 indicates that the clustering is completely random.
• Advantages of Fowlkes-Mallows Index (FMI)
• Simple Interpretation: FMI has an intuitive interpretation, ranging from 0 to
1. A higher value means better clustering performance, making it easy to
evaluate clustering results.
• Focus on Pairwise Accuracy: FMI evaluates the clustering based on the
pairs of points, which can be useful in applications where grouping similar
points together is important.
• Applicable in Supervised Clustering: FMI works well in situations where you
have ground truth labels to compare against predicted clusters.
• Disadvantages of Fowlkes-Mallows Index (FMI)
• Dependence on Cluster Sizes: FMI is affected by the size and number of
clusters. It may not perform well when clusters have different sizes or if
there are outliers.
• Requires Ground Truth: Since FMI compares clusters with true labels, it
cannot be used in fully unsupervised clustering tasks where no ground truth
is available.
• Does Not Penalize Large Clusters: FMI doesn't penalize algorithms for
generating large clusters that may contain dissimilar points. It only
evaluates pairwise groupings, not the quality of individual clusters.
• Real-time Example and Use Case of FMI
• Real-time Example: Suppose you have labeled data on customer segments
(e.g., based on purchase patterns), and you want to evaluate how well a
clustering algorithm (e.g., KMeans) has grouped customers based on a
different set of features. FMI will tell you how similar the predicted clusters
are to the ground truth customer segments.
• Use Case: FMI can be used in image segmentation tasks to evaluate how
well a clustering algorithm has grouped pixels into meaningful regions (e.g.,
foreground and background) by comparing the predicted segmentation
with a manually labeled image.
• from sklearn.metrics import fowlkes_mallows_score

• # True and predicted labels


• U = [0, 0, 1, 1, 2]
• V = [1, 1, 0, 0, 2]

• # Calculate Fowlkes-Mallows Index


• fmi_score = fowlkes_mallows_score(U, V)
• print(f"Fowlkes-Mallows Index: {fmi_score}")
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision
and recall:

FMI = TP / sqrt((TP + FP) * (TP + FN))

• Where TP is the number of True Positive (i.e. the number of pair of points that belongs in the
same clusters in both labels_true and labels_pred),
• FP is the number of False Positive (i.e. the number of pair of points that belongs in the same
clusters in labels_true and not in labels_pred)
• and FN is the number of False Negative (i.e the number of pair of points that belongs in the
same clusters in labels_pred and not in labels_True).

The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
• Adjusted Mutual Information (AMI) is a variation of the Normalized
Mutual Information (NMI) that adjusts for the chance grouping of data
points.

• In other words, AMI corrects for the possibility that mutual information
might be high purely due to random clustering.
• The AMI score is based on the concept of Mutual Information (MI), but it
incorporates a correction factor to account for the fact that random
clusterings will still have some degree of mutual information due to
chance.

• This adjustment ensures that the score reflects the actual quality of the
clustering, especially when comparing different clustering algorithms on a
dataset.
• Advantages of AMI
• Chance Correction: AMI adjusts for random clustering and eliminates any
bias caused by chance groupings, making it a more reliable measure for
clustering performance than NMI.
• Symmetry: Like NMI, AMI is symmetric, meaning the clustering order does
not affect the score.
• Scale Independence: The AMI score ranges between -1 and 1, making it
intuitive to interpret.
• Disadvantages of AMI
• Computational Complexity: AMI can be computationally expensive due to
the need to calculate the expected mutual information, which involves
random permutations.
• Sensitivity to Cluster Number: Like most clustering metrics, AMI is
sensitive to the number of clusters. If the true number of clusters is not
specified correctly, the AMI score may not provide an accurate evaluation.
• Real-time Example and Use Case of AMI
• Real-time Example: Suppose you have a dataset of documents and you
apply two different clustering methods to organize the documents into
topics. You can use AMI to compare the two clustering results. If you also
have ground truth labels (e.g., predefined topics), you can use AMI to
check how closely your clustering matches these true labels while
correcting for chance.
• Use Case: In biological data analysis, AMI can be used to evaluate the
performance of clustering algorithms applied to gene expression data.
Since clusters may emerge by random chance, AMI helps ensure that
clusters found in the data are meaningful and not due to random
partitioning.
• from sklearn.metrics import adjusted_mutual_info_score

• # True and predicted labels


• U = [0, 0, 1, 1, 2]
• V = [1, 1, 0, 0, 2]

• # Calculate AMI
• ami_score = adjusted_mutual_info_score(U, V)
• print(f"Adjusted Mutual Information Score: {ami_score}")
• AMI vs. NMI
• AMI corrects for random clustering and adjusts for chance, making it
more reliable when randomness is a concern.
• NMI is simpler but does not account for chance alignments, so its scores
may be higher for random clusterings. For better evaluation in
clustering, AMI is generally preferred over NMI when comparing
multiple algorithms.

You might also like