Skip to content

Add more unsupervised clustering metrics #6654

@pambot

Description

@pambot

Hello,

I noticed that most of the clustering metrics available in scikit-learn are for supervised clustering, with only the Silhouette score for unsupervised cases. Do you think it might be useful if I added one or more unsupervised clustering metrics as features? I would also be happy to write examples and non-API documentation, of course.

As you probably already know, for unsupervised clustering, metrics typically take into account compactness (how close samples in one group are to each other) and separability (how distinct different groups are from each other). The different algorithms for computing metrics mostly differ in how they interpret those two terms and how they are synthesized into a single score. I propose to add a few that are well-cited (on Google Scholar) and that show a diversity of approach:

Calinski-Harabasz index
Citations: 2481
Paper: A dendrite method for cluster analysis (1974)
This is a pretty classic metric, and it essentially amounts to a ratio of between-cluster sum of squares (separability) to within-cluster sum of squares (compactness). It's mostly used to evaluate the K in K-means. It's simple and intuitive, widely used, and it handles close clusters better than the Silhouette Index. It also, however, handles skewed clusters worse, just like K-means.

SDbw validity index
Citations: 242
Paper: Clustering validity assessment: finding the optimal partitioning of a data set (2001)
This one just makes the citation cut, but I include it because of the approach and the performance. SDbw defines compactness as scattering, which is the sum of the standard deviations of each cluster over the standard deviation of the whole dataset. It defines separability in terms of the density of the midpoint of the cluster centers with respect to the cluster center densities. Not only does it perform better compared to a lot of metrics, but it also seems especially well-suited for DBSCAN, which is shown here to accommodate all of the weird shapes thrown at it.

Gap statistic
Citations: 2088
Paper: Estimating the number of clusters in a data set via the gap statistic (2001)
The gap statistic has a ridiculous amount of citations considering its young age, possibly because it was introduced by two authors of Elements of Statistical Learning. It compares the within-cluster dispersion (pairwise distances between samples in a cluster) with that expected under a null hypothesis. This one is tricky because you have to be able to input different types of null distributions and you have to sample the chosen one many times in order to get the expected null dispersion. This isn't the fastest metric, is what I'm saying. Other implementations, such as the one in R, ask you to input a range of parameter values and performs the clustering within the function. It's not unusual for scikit-learn to have functions as inputs for internal processing, but on the other hand, this may not be in keeping with the API template of the other metrics, that only ask for the dataset and the labels. Yet again on the other hand, it's the only metric I've encountered that asks the statistical question of whether any clustering should be occurring at all.

So that's it for now. These metrics are implemented elsewhere, so I would use them as validation of both accuracy and efficiency. What do you think? It's not all-or-none and I'm not limited to the ones I listed. If you think it's a good idea, I'm open to other contributors. I can use the results of this as a final project for a class, by the way, so I'm motivated not to leave it hanging.

P.S. I know the contribution docs suggest making small, easy edits first to test the waters, and I did think of that first, but it seemed to me that all of the ones involving Python code marked Easy were taken and the ones marked Moderate that I knew anything about were as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions