New clustering metrics #27259

Wainberg · 2023-08-31T21:29:54Z

Describe the workflow you want to enable

Scikit-learn defines three popular metrics for evaluating clustering performance when there are no ground-truth cluster labels: sklearn.metrics.silhouette_score, sklearn.metrics.calinski_harabasz_score and sklearn.metrics.davies_bouldin_score. But there are lots of others, and it's previously been discussed whether to integrate more into scikit-learn.

Describe your proposed solution

I've implemented four relatively popular ones, using the same interface and code style as in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/cluster/_unsupervised.py. Would there be interest in integrating these into scikit-learn?

import numpy as np
from itertools import combinations
from sklearn.utils import check_X_y
from sklearn.metrics.cluster._unsupervised import check_number_of_labels
from sklearn.preprocessing import LabelEncoder


def log_ss_ratio(X, labels):
    X, labels = check_X_y(X, labels)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    n_samples, _ = X.shape
    n_labels = len(le.classes_)

    check_number_of_labels(n_labels, n_samples)

    extra_disp, intra_disp = 0.0, 0.0
    mean = X.mean(axis=0)
    for k in range(n_labels):
        cluster_k = X[labels == k]
        mean_k = cluster_k.mean(axis=0)
        extra_disp += len(cluster_k) * ((mean_k - mean) ** 2).sum()
        intra_disp += ((cluster_k - mean_k) ** 2).sum()

    return np.log(extra_disp / intra_disp)


def ball_hall(X, labels):
    X, labels = check_X_y(X, labels)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    n_samples, _ = X.shape
    n_labels = len(le.classes_)

    check_number_of_labels(n_labels, n_samples)

    sum_mean_dispersions = 0

    for k in range(n_labels):
        cluster_k = X[labels == k]
        mean_k = cluster_k.mean(axis=0)
        intra_disp_cluster = ((cluster_k - mean_k) ** 2).sum()
        sum_mean_dispersions += intra_disp_cluster / len(cluster_k)

    return sum_mean_dispersions / n_labels


def banfeld_raftery(X, labels):
    X, labels = check_X_y(X, labels)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    n_samples, _ = X.shape
    n_labels = len(le.classes_)

    check_number_of_labels(n_labels, n_samples)

    br_index = 0
    for k in range(n_labels):
        cluster_k = X[labels == k]
        mean_k = cluster_k.mean(axis=0)
        intra_disp_cluster = ((cluster_k - mean_k) ** 2).sum()
        br_index += len(cluster_k) * \
                    np.log(intra_disp_cluster / len(cluster_k))

    return br_index


def ray_turi(X, labels):
    X, labels = check_X_y(X, labels)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    n_samples, _ = X.shape
    n_labels = len(le.classes_)

    check_number_of_labels(n_labels, n_samples)

    intra_disp = 0.0

    cluster_means = []
    for k in range(n_labels):
        cluster_k = X[labels == k]
        mean_k = cluster_k.mean(axis=0)
        intra_disp += ((cluster_k - mean_k) ** 2).sum()
        cluster_means.append(mean_k)

    min_cluster_mean_diff = min(
        ((mean_i - mean_j) ** 2).sum()
        for mean_i, mean_j in combinations(cluster_means, 2))

    return intra_disp / (min_cluster_mean_diff * n_samples)

Describe alternatives you've considered, if relevant

No response

Additional context

No response

glemaitre · 2023-09-16T17:02:06Z

@Wainberg could you provide articles from the scientific literature regarding these metrics such that we judge their pertinence, in order to include them or not in scikit-learn.

Wainberg · 2023-09-16T22:02:04Z

The Ray-Turi paper has been cited 932 times since 1999. Its main contribution is the clustering metric.

The Ball-Hall paper has been cited 1681 times since 1965. It introduced ISODATA (a variant of k-means) along with the clustering metric, but if you look at the top citations they're mostly about using the clustering metric. One disadvantage of this metric is that unlike the others here (and also unlike the three already implemented in scikit-learn: Silhouette score, Davies-Bouldin and Calinski-Harabasz), you have to look at the maximum difference in the metric between two adjacent values of k (the number of clusters) to determine the best k.

The Banfield-Raftery paper has been cited 3185 times since 1993. The top citations are a mix of the clustering metric and the paper's other contributions.

The log_SS_ratio is less popular than the other three and might make sense to leave out.

The clearest case for inclusion would be Ray-Turi, which has been cited nearly 1000 times just for the clustering metric.

adrinjalali · 2023-09-18T14:42:46Z

With those stats I don't think we can argue they're not popular. I don't mind including them. But cc'ing @lorentzenchr since he probably has a better overview of the scores we're adding.

lorentzenchr · 2023-09-18T15:12:22Z

I'm no expert on that field. Given the citations, Ray-Turi, Ball-Hall and Banfield-Raftery seems fine to include.
Although it is mentioned in the docs, I would appreciate a clearer separation between supervised and unsupervised metrices for clustering.

Wainberg · 2023-09-19T00:44:01Z

So I went on a bit of a deep dive through a few implementations of unsupervised cluster evaluation metrics:

clusterCrit (https://rdrr.io/cran/clusterCrit/f/inst/doc/clusterCrit.pdf)
Clusters-Features (https://github.com/Simon-Bertrand/Clusters-Features)
cclust (https://www.rdocumentation.org/packages/cclust/versions/0.6-26/topics/clustIndex)
NbClust (https://hal.science/hal-01126138/document)
evaluclusters (https://www.mathworks.com/help/stats/evalclusters.html)

and found 37 metrics. I looked at their number of Google hits and their number of citations. Keep in mind that some papers are cited for things other than the clustering metric. I've bolded the ones that particularly stand out on both popularity metrics.

"Baker-Hubert gamma index" clustering: 3,370 Google hits (3,330 for "Hubert index" clustering, 3,360 for "Hubert gamma index" clustering), 379 citations
"Ball-Hall index" clustering: 904 Google hits (15,400 for "Ball index" clustering), 1,681 citations
"Banfield-Raftery index" clustering: 10 Google hits (but 463 for the misspelling "Banfeld-Raftery index" clustering), 3,185 citations
"Beale index" clustering: 1,940 Google hits, 444 citations
"C index" clustering: 224,000 Google hits (may include false positives talking about AUROC, which is also a "C index", but "C index" clustering Hubert Levin gets 8,510 results), 486 citations
"Calinski-Harabasz index" clustering: 24,200 Google hits (already in scikit-learn), 8,244 citations
"Cubic Clustering Criterion" clustering: 11,400 Google hits, 604 citations
"D index" clustering: 45,100 Google hits, 3,033 citations
"Davies-Bouldin index" clustering: 72,000 Google hits (already in scikit-learn), 9,078 citations
"Det Ratio index" clustering: 1,180 Google hits, 654 citations - this paper also introduced the Log Det Ratio and Scott-Symons indices
"Duda index" clustering: 294 Google hits, 26,694 citations - this paper also introduced the pseudo t2 index
"Dunn index" clustering: 51,900 Google hits, 3,097 citations
"Frey index" clustering: 557 Google hits, ?? citations (paper doesn't seem to be available on the internet)
"Friedman index" clustering: 869 Google hits (74 for "Trace WiB index" clustering), 943 citations - this paper also introduced the Rubin index
"Gap index" clustering: 11,300 Google hits, 6,673 citations
"Generalized Dunn index" clustering: 2,640 Google hits, 1,487 citations
"G plus index" clustering: 548 Google hits, 226 citations - this paper also introduced the Tau index
"Hartigan index" clustering: 2,390 Google hits (10 for "Log SS Ratio index"), 47 citations
"KL index" clustering: 1,620 Google hits (1,650 for "Krzanowski-Lai index" clustering), 899 citations
"Log Det Ratio index" clustering: 68 Google hits, 654 citations - this paper also introduced the Det Ratio and Scott-Symons indices
"Marriot index" clustering: 344 Google hits (1,260 for "Marriott index" clustering, 10 for "Ksq DetW index" clustering), 370 citations
"McClain-Rao index" clustering: 238 Google hits (377 for "McClain index" clustering), 176 citations
"PBM index" clustering: 3,730 Google hits, 956 citations
"Point biserial index" clustering: 1,710 Google hits, 575 citations
"Pseudo t2 index" clustering: 135 Google hits, 26,694 citations - this paper also introduced the Duda index
"Ratkowsky-Lance index" clustering: 174 Google hits (116 for "Ratkowsky index" clustering), 171 citations
"Ray-Turi index" clustering: 729 Google hits, 932 citations
"Rubin index" clustering: 653 Google hits, 943 citations - this paper also introduced the Friedman index
"Scott-Symons index" clustering: 1,230 Google hits (4,690 for "Scott index" clustering), 654 citations - this paper also introduced the Det Ratio and Log Det Ratio indices
"SD index" clustering: 12,400 Google hits (6,470 for "SDindex" clustering), 466 citations
"S Dbw index" clustering: 1,280 Google hits (954 for "S_Dbw index" clustering), 610 citations
"Silhouette index" clustering: 51,900 Google hits (68,000 for "Silhouette score" clustering; already in scikit-learn), 18,202 citations
"Tau index" clustering: 3,530 Google hits, 226 citations - this paper also introduced the G plus index
"Trace W index" clustering: 97 Google hits (170 for "TraceW index" clustering), 785 citations
"trcovw index" clustering: 157 Google hits, 5,119 citations - this may not be the earliest citation
"Wemmert-Gancarski index" clustering: 170 Google hits, 52 citations
"Xie-Beni index" clustering: 9,000 Google hits, 4,380 citations

Besides the three in scikit-learn (Silhouette, Davies-Bouldin and Calinski-Harabasz), the ones that stand out are the D index, Dunn index, Gap index, and Xie-Beni index. However, the D index citation is a book that seems to be mostly cited for things other than the index.

To sum up, the most popular unsupervised clustering metrics missing from Scikit-learn seem to be the Dunn index, Gap index, and Xie-Beni index. However, maybe the even bigger take-away here is that there are a lot of widely used and widely cited metrics, and it might be worth supporting a more comprehensive set than just Dunn, Gap and Xie-Beni.

glevv · 2023-09-24T14:34:47Z

I would love to see more "unsupervised" (without the need in labels) clustering metrics, since current ones are not very flexible (they rely too heavily on SSq or L2 distance). I don't know about metrics that need labels thought.

lorentzenchr · 2023-09-24T16:19:08Z

I‘m personally not a big fan of adding that many unsupervised metrics, just the most important ones.
Is there any theory for selecting such a metric that limits the choices?

Wainberg · 2023-09-24T16:54:15Z

My personal experience: I tried out a bunch of these on single-cell RNA sequencing data, trying to select the optimal resolution hyperparameter for Leiden clustering. My experience was that Ray-Turi was the most reasonable, Davies-Bouldin and Calinski-Harabasz were way off, and Silhouette was relatively good but also ridiculously computationally expensive (also a problem that Dunn, Gap and Xie-Beni all suffer from). All of the metrics gave completely different answers from each other!

glevv · 2023-09-25T17:42:09Z

Sadly all metrics described in OP rely on L2 distance from the "center". I was thinking about something like this

Density-Based Clustering Validation 2014, Moulavi - Cited by 243

Nylio-prog · 2023-12-30T11:18:11Z

I would like to work on adding the DBCV score. Tell me if anyone is already on it to not duplicate our work

luis261 · 2024-01-26T13:45:15Z

I'v arrived here from #28243 where I laid out the case for including DBCV by explaining how it fills a gap presented by the 3 current metrics which were also mentioned above

"I would love to see more "unsupervised" (without the need in labels) clustering metrics, since current ones are not very flexible (they rely to heavily on SSq or L2 distance)."

"Sadly all metrics described in OP rely on L2 distance from the "center". I was thinking about something like this"

That's exactly the problem I experienced when tasked with clustering non-spherical data in a practical setting based on the "conventional" metrics.
@glevv Just like you correctly demonstrated with your example, DBCV avoids the typical shortcoming as it does not rely on centroids and therefore does not make an implicit spherical assumption. Instead, it works based on MSTs, much like HDBSCAN does (my issue linked above goes into more detail on this as well).

My PR #28244 aims to include the implementation of DBCV that's shipped with https://github.com/scikit-learn-contrib/hdbscan, instead of being based on https://github.com/FelSiq/DBCV

the HDBSCAN DBCV version also contains an optional performance improvement (can toggle via kwarg) based on this benchmark of mine, which outperforms the default DBCV version (only demonstrated for low-dimensional data) by avoiding excessive inflation of core distances

"I would like to work on adding the DBCV score. Tell me if anyone is already on it to not duplicate our work"

@Nylio-prog I was not aware of your PR before ramping up myself, sorry about that

Regardless of which implementation gets merged (if any), I believe my benchmark and the argument laid out in #28243 can provide helpful context to aid in the decision making process when it comes to adding this feature or not.
My argument for including DBCV is backed by an illustrative example as well as an experiment with a more systematic, empirical methodology according to this paper

lorentzenchr · 2024-01-26T21:02:20Z

Based on the very good discussion here, I‘m in favor to include

Ray-Turi
DBCV

@scikit-learn/core-devs and all other participants, giving your +1 (thumb up to this comment) might help to push this issue forward a lot.

thomasjpfan · 2024-01-31T03:37:15Z

I am +1 on DBCV and not sure about Ray-Turi. Is there a reference that showcases the advantages of Ray-Turi compared to the existing scikit-learn cluster metrics?

Micky774 · 2024-01-31T04:58:44Z

+1 to DBCV. I agree that we should be open to unsupervised metrics, esp. density-based approaches. I think finding adversarial examples that highlight insufficiency in current scikit-learn metrics would help motivate these, especially as we'll have to consider how we recommend their usage to users.

luis261 · 2024-02-02T15:27:57Z

My PR which addresses this issue is ready for review now: #28244

glevv · 2024-02-02T17:39:51Z

Usually the process goes like this:
issue -> discussion -> decision -> if green light -> development plan -> PR.

I don't think we at the PR stage yet.

luis261 · 2024-02-02T18:21:29Z

Thank you for letting me know. While I did read the contribution guidelines, my understanding of the process of adding entirely new features is limited. I assumed we had accumulated enough votes in addition to covering the required scientific criteria and practical considerations over here to get an actual implementation for DBCV underway. I'll stand by until a decision is made.

lorentzenchr · 2024-02-02T20:17:43Z

Usually the process goes like this: issue -> discussion -> decision -> if green light -> development plan -> PR.

I don't think we at the PR stage yet.

Adding DBCV seems like a solid agreement. So green light for that PR, reviews welcome.

luis261 · 2024-02-02T20:44:40Z

Adding DBCV seems like a solid agreement. So green light for that PR, reviews welcome.

Great, can't wait!

luis261 · 2024-02-23T13:02:17Z

It's been 3 weeks since #28244 was greenlit for review. I have not been able to observe any movement on this issue in any regard whatsoever since then.

If there is still interest in pushing this forward, I expect to see an assigned reviewer, as well as ideally being informed of an estimated time upon which the initial, preliminary review will be completed, soon.

adrinjalali · 2024-02-23T15:01:14Z

@luis261 We don't assign reviewers, and reviewer time is the bottleneck in most open source projects. We green lighting this doesn't mean it has a high priority. Unfortunately a few weeks is not a long time for a big project with over 600 open pull requests.

You can ping people who have shown interest in this thread every now and then and hopefully they can allocate some time to review.

luis261 · 2024-02-23T17:19:39Z

@adrinjalali sure, I understand that. Hope I didn't come off as rude. Just wanted to get everyone on the same page and probe a bit so I know if interest in this issue still remains so I can adjust my own priorities.

Seems we are on track for now though, already got valuable feedback on the PR that I'll take care to implement now (:

lesteve · 2024-02-24T06:00:48Z

@luis261 let me try to be honest, your comment #27259 (comment) sounded very demanding.

No harm done, there could be plenty of reason for this, maybe I was having a bad day or I have a heightened sensibility to this kind of message or generally being pinged on something I know very little about. Also English is not my main language, and maybe not yours, which sometimes does not help.

I completely get your frustration, you have worked on something for some considerate amount of time, and you get no feed-back for three weeks, this is definitely not great. Be reassured, that, as a maintainer, it certainly does not feel great either to drop things / let people down, but as Adrin said we do have plenty on our plate and we try to do what we can.

Personally when I don't get any feed-back on an issue, I kind of try to say things like "of course I completely understand if you have other priorities" or "let me know if there is anything I can do to help this push forward". A recent example where I did not have any feed-back in roughly three weeks: colesbury/nogil-wheels#6 (comment)

I really like Brett Cannon talk on open-source participation:
https://snarky.ca/setting-expectations-for-open-source-participation/. In particular, I find the "PR as a puppy" metaphor great.

luis261 · 2024-02-24T10:18:22Z

let me try to be honest, your comment #27259 (comment) sounded very demanding.

@lesteve I'm sorry it came off that way, certainly not what I intended. What I meant to communicate was not that I feel entitled to a review and therefore demand to receive one, but instead that under the assumption that there is still interest in the issue at hand, I'm anticipating a review in the near future. If I hadn't gotten any response or an explicit no, that would have been fine with me as well. At least in that case I can "cut my losses" and move on. I'd prefer that outcome over ambiguity which results in me unnecessarily remaining on standby.

I really like Brett Cannon talk on open-source participation:
https://snarky.ca/setting-expectations-for-open-source-participation/. In particular, I find the "PR as a puppy" metaphor great.

I read through the talk in its article form. Looking at PRs as puppys seems like a good way to think about it. I like to think that I personally have a history of interacting with maintainers in a polite way that's consistent with that.

Personally when I don't get any feed-back on an issue, I kind of try to say things like "of course I completely understand if you have other priorities" or "let me know if there is anything I can do to help this push forward". A recent example where I did not have any feed-back in roughly three weeks: colesbury/nogil-wheels#6 (comment)

that seems like a good way of going about it

Wainberg added Needs Triage Issue requires triage New Feature labels Aug 31, 2023

glemaitre added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Sep 16, 2023

glevv mentioned this issue Nov 4, 2023

Add clustering score? #27703

Closed

Nylio-prog linked a pull request Dec 30, 2023 that will close this issue

add dbcv score (unsupervised clustering score) and one test comparing kmeans and dbscan dbcv score #28036

Open

adrinjalali mentioned this issue Jan 26, 2024

RFC: Introduce DBCV cluster validity metric #28243

Closed

adrinjalali mentioned this issue Mar 6, 2024

[MRG]Clusters-Class auto-match #10604

Closed

luis261 mentioned this issue Mar 7, 2024

FEAT Introduce DBCV as new cluster metric #28244

Closed

13 tasks

This was referenced Apr 17, 2024

Implement cluster stability measure #1091

Closed

[WIP] ENH Optimal n_clusters values #6948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New clustering metrics #27259

New clustering metrics #27259

Wainberg commented Aug 31, 2023

glemaitre commented Sep 16, 2023

Wainberg commented Sep 16, 2023

adrinjalali commented Sep 18, 2023

lorentzenchr commented Sep 18, 2023

Wainberg commented Sep 19, 2023

glevv commented Sep 24, 2023 •

edited

Loading

lorentzenchr commented Sep 24, 2023

Wainberg commented Sep 24, 2023

glevv commented Sep 25, 2023 •

edited

Loading

Nylio-prog commented Dec 30, 2023

luis261 commented Jan 26, 2024 •

edited

Loading

lorentzenchr commented Jan 26, 2024 •

edited

Loading

thomasjpfan commented Jan 31, 2024

Micky774 commented Jan 31, 2024

luis261 commented Feb 2, 2024 •

edited

Loading

glevv commented Feb 2, 2024 •

edited

Loading

luis261 commented Feb 2, 2024

lorentzenchr commented Feb 2, 2024

luis261 commented Feb 2, 2024

luis261 commented Feb 23, 2024 •

edited

Loading

adrinjalali commented Feb 23, 2024

luis261 commented Feb 23, 2024

lesteve commented Feb 24, 2024 •

edited

Loading

luis261 commented Feb 24, 2024 •

edited

Loading

New clustering metrics #27259

New clustering metrics #27259

Comments

Wainberg commented Aug 31, 2023

Describe the workflow you want to enable

Describe your proposed solution

Describe alternatives you've considered, if relevant

Additional context

glemaitre commented Sep 16, 2023

Wainberg commented Sep 16, 2023

adrinjalali commented Sep 18, 2023

lorentzenchr commented Sep 18, 2023

Wainberg commented Sep 19, 2023

glevv commented Sep 24, 2023 • edited Loading

lorentzenchr commented Sep 24, 2023

Wainberg commented Sep 24, 2023

glevv commented Sep 25, 2023 • edited Loading

Nylio-prog commented Dec 30, 2023

luis261 commented Jan 26, 2024 • edited Loading

lorentzenchr commented Jan 26, 2024 • edited Loading

thomasjpfan commented Jan 31, 2024

Micky774 commented Jan 31, 2024

luis261 commented Feb 2, 2024 • edited Loading

glevv commented Feb 2, 2024 • edited Loading

luis261 commented Feb 2, 2024

lorentzenchr commented Feb 2, 2024

luis261 commented Feb 2, 2024

luis261 commented Feb 23, 2024 • edited Loading

adrinjalali commented Feb 23, 2024

luis261 commented Feb 23, 2024

lesteve commented Feb 24, 2024 • edited Loading

luis261 commented Feb 24, 2024 • edited Loading

glevv commented Sep 24, 2023 •

edited

Loading

glevv commented Sep 25, 2023 •

edited

Loading

luis261 commented Jan 26, 2024 •

edited

Loading

lorentzenchr commented Jan 26, 2024 •

edited

Loading

luis261 commented Feb 2, 2024 •

edited

Loading

glevv commented Feb 2, 2024 •

edited

Loading

luis261 commented Feb 23, 2024 •

edited

Loading

lesteve commented Feb 24, 2024 •

edited

Loading

luis261 commented Feb 24, 2024 •

edited

Loading