Skip to content

RFC: Introduce DBCV cluster validity metric #28243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
luis261 opened this issue Jan 24, 2024 · 3 comments
Closed

RFC: Introduce DBCV cluster validity metric #28243

luis261 opened this issue Jan 24, 2024 · 3 comments
Labels
Needs Triage Issue requires triage

Comments

@luis261
Copy link

luis261 commented Jan 24, 2024

Dear scikit-learn core developers/maintainers,

I am opening this issue to make the case for the inclusion of DBCV as a scikit-learn cluster metric. I went ahead and tried to address possible concerns (according to your contribution guidelines) upfront (see below). The effort in terms of getting the actual functionality up and running should be manageable if we opt for transferring the version of DBCV that ships with scikit-learn-contrib/HDBSCAN which is what I'd suggest (for technical details see #28244). Looking forward to your feedback!

DBCV is the name of a clustering validity metric defined in this paper direct link to PDF hosted on a German university server alternative ref. It stands for Density-Based Clustering Validation and is characterized by not relying on cluster centroids as part of its validity calculations/formula. Instead, it evaluates the quality of clusters based on the concepts of density separation (highest density area between clusters) and density sparseness (lowest density area within a cluster) - propped up on the underlying construct of minimum spanning trees, much like the HDBSCAN clustering algorithm. (the 3 authors of the original 2013 HDBSCAN paper also (co-)authored the DBCV paper linked above)

I'd argue DBCV meets the algorithm inclusion criteria (https://scikit-learn.org/dev/faq.html#new-algorithms-inclusion-criteria)

"at least 3 years since publication"

  • the paper was published in 2014

"200+ citations"

  • according to Google Scholar, this criterion is met as well:
    gscholar

I believe I have thoroughly addressed the further inclusion criteria (see below). If my proposal is not up to par in that regard in any way, please let me know.

"The contributor should support the importance of the proposed addition with research papers and/or implementations in other similar packages, demonstrate its usefulness via common use-cases/applications and corroborate performance improvements, if any, with benchmarks and/or plots. It is expected that the proposed algorithm should outperform the methods that are already implemented in scikit-learn at least in some areas."

scikit-learn currently provides the following clustering validity metrics (source, documentation):

All these metrics listed above implicitly assume clusters with spherical shapes based on their mathematical formulations.

There are more metrics for the purpose of clustering evaluation that ship with scikit-learn. However, these all require knowledge of the underlying ground-truth, which makes them irrelevant in terms of practical considerations regarding their application on optimization processes since those underlying labels aren't available when working with non-synthetic, real-world data - hence the need for the clustering in the first place.

When trying to optimize the hyperparameters of a clustering algorithm given a certain dataset, it does not make sense to drive the optimization process based on a metric that does not align with/is opposed to the algorithms internal, relative metric/implicit notion of what a cluster is and thus what makes a "better"/"worse" cluster.
That notion materializes from the manner the clustering algorithm operates, e.g.:

  • the way KMeans iterates based on quadratic distances from cluster centers
  • DBSCAN assigning noise labels based on proximity to other points

If one is not careful about which metric is paired with which algorithm (or how it is applied), the underlying mathematical constructs can "clash" in a way that leads to subpar/nonsensical results, e.g. applying SWC to DBSCAN output on a dataset with elongated clusters such as the one shown below or without filtering out noise first.

Since there is currently no density-based metric, this means there is no suitable option available when trying to optimizing a scikit-learn cluster method with scikit-learn metrics. Including DBCV would address that.

Below, I have included an example that illustrates the shortcomings of centroid metrics when applied to density clusters such as those emitted by established algorithms like DBSCAN, HDBSCAN or OPTICS. For details regarding the example, please have a look at this notebook.

If you're looking for something more thorough/empirical, I suggest having a look at this repo: https://github.com/luis261/clustering-validity-comparison - (perfomance benchmark in the sense of measuring outcome performance, not runtime performance)

The plot below shows a synthetic dataset containing two elongated clusters with some background noise.
The points were labeled based on the underlying ground-truth. Just for the sake of argument, you can imagine it being the result of running a density based clustering algorithm instead.
Afterwards, the Silhouette Coefficient is assigned to all non-noise points according to the labelling and a color palette applied based on the score of each point.
The result is a visualization showing the high internal disparity of the scores in each cluster, depending on their distance to the other clusters centroid. This incongruity can arise whenever clusters emitted by an algorithm are judged based on a metric ill-suited for the algorithm/dataset and will affect the optimization process, with outputs of arguably higher quality loosing against alternatives that are less "ideal".

https://gist.github.com/luis261/5ad7f13d8b8ebe6762e424cc1d430f30

@github-actions github-actions bot added the Needs Triage Issue requires triage label Jan 24, 2024
@adrinjalali
Copy link
Member

This is somewhat a duplicate of #27259, please leave a comment there.

@luis261
Copy link
Author

luis261 commented Jan 26, 2024

@adrinjalali thank you for pointing me in the right direction!

@luis261
Copy link
Author

luis261 commented Feb 2, 2024

@Micky774 I think the example displayed above fits exactly what you asked for, since it demonstrates the insufficiency of spherical metrics (in particular SWC). Implementation see here

+1 to DBCV. I agree that we should be open to unsupervised metrics, esp. density-based approaches. I think finding adversarial examples that highlight insufficiency in current scikit-learn metrics would help motivate these, especially as we'll have to consider how we recommend their usage to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Triage Issue requires triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants