Skip to content

New clustering metrics #27259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Wainberg opened this issue Aug 31, 2023 · 24 comments · May be fixed by #28036
Open

New clustering metrics #27259

Wainberg opened this issue Aug 31, 2023 · 24 comments · May be fixed by #28036
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature

Comments

@Wainberg
Copy link

Describe the workflow you want to enable

Scikit-learn defines three popular metrics for evaluating clustering performance when there are no ground-truth cluster labels: sklearn.metrics.silhouette_score, sklearn.metrics.calinski_harabasz_score and sklearn.metrics.davies_bouldin_score. But there are lots of others, and it's previously been discussed whether to integrate more into scikit-learn.

Describe your proposed solution

I've implemented four relatively popular ones, using the same interface and code style as in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/cluster/_unsupervised.py. Would there be interest in integrating these into scikit-learn?

import numpy as np
from itertools import combinations
from sklearn.utils import check_X_y
from sklearn.metrics.cluster._unsupervised import check_number_of_labels
from sklearn.preprocessing import LabelEncoder


def log_ss_ratio(X, labels):
    X, labels = check_X_y(X, labels)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    n_samples, _ = X.shape
    n_labels = len(le.classes_)

    check_number_of_labels(n_labels, n_samples)

    extra_disp, intra_disp = 0.0, 0.0
    mean = X.mean(axis=0)
    for k in range(n_labels):
        cluster_k = X[labels == k]
        mean_k = cluster_k.mean(axis=0)
        extra_disp += len(cluster_k) * ((mean_k - mean) ** 2).sum()
        intra_disp += ((cluster_k - mean_k) ** 2).sum()

    return np.log(extra_disp / intra_disp)


def ball_hall(X, labels):
    X, labels = check_X_y(X, labels)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    n_samples, _ = X.shape
    n_labels = len(le.classes_)

    check_number_of_labels(n_labels, n_samples)

    sum_mean_dispersions = 0

    for k in range(n_labels):
        cluster_k = X[labels == k]
        mean_k = cluster_k.mean(axis=0)
        intra_disp_cluster = ((cluster_k - mean_k) ** 2).sum()
        sum_mean_dispersions += intra_disp_cluster / len(cluster_k)

    return sum_mean_dispersions / n_labels


def banfeld_raftery(X, labels):
    X, labels = check_X_y(X, labels)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    n_samples, _ = X.shape
    n_labels = len(le.classes_)

    check_number_of_labels(n_labels, n_samples)

    br_index = 0
    for k in range(n_labels):
        cluster_k = X[labels == k]
        mean_k = cluster_k.mean(axis=0)
        intra_disp_cluster = ((cluster_k - mean_k) ** 2).sum()
        br_index += len(cluster_k) * \
                    np.log(intra_disp_cluster / len(cluster_k))

    return br_index


def ray_turi(X, labels):
    X, labels = check_X_y(X, labels)
    le = LabelEncoder()
    labels = le.fit_transform(labels)

    n_samples, _ = X.shape
    n_labels = len(le.classes_)

    check_number_of_labels(n_labels, n_samples)

    intra_disp = 0.0

    cluster_means = []
    for k in range(n_labels):
        cluster_k = X[labels == k]
        mean_k = cluster_k.mean(axis=0)
        intra_disp += ((cluster_k - mean_k) ** 2).sum()
        cluster_means.append(mean_k)

    min_cluster_mean_diff = min(
        ((mean_i - mean_j) ** 2).sum()
        for mean_i, mean_j in combinations(cluster_means, 2))

    return intra_disp / (min_cluster_mean_diff * n_samples)

Describe alternatives you've considered, if relevant

No response

Additional context

No response

@Wainberg Wainberg added Needs Triage Issue requires triage New Feature labels Aug 31, 2023
@glemaitre
Copy link
Member

@Wainberg could you provide articles from the scientific literature regarding these metrics such that we judge their pertinence, in order to include them or not in scikit-learn.

@glemaitre glemaitre added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Triage Issue requires triage labels Sep 16, 2023
@Wainberg
Copy link
Author

The Ray-Turi paper has been cited 932 times since 1999. Its main contribution is the clustering metric.

The Ball-Hall paper has been cited 1681 times since 1965. It introduced ISODATA (a variant of k-means) along with the clustering metric, but if you look at the top citations they're mostly about using the clustering metric. One disadvantage of this metric is that unlike the others here (and also unlike the three already implemented in scikit-learn: Silhouette score, Davies-Bouldin and Calinski-Harabasz), you have to look at the maximum difference in the metric between two adjacent values of k (the number of clusters) to determine the best k.

The Banfield-Raftery paper has been cited 3185 times since 1993. The top citations are a mix of the clustering metric and the paper's other contributions.

The log_SS_ratio is less popular than the other three and might make sense to leave out.

The clearest case for inclusion would be Ray-Turi, which has been cited nearly 1000 times just for the clustering metric.

@adrinjalali
Copy link
Member

With those stats I don't think we can argue they're not popular. I don't mind including them. But cc'ing @lorentzenchr since he probably has a better overview of the scores we're adding.

@lorentzenchr
Copy link
Member

I'm no expert on that field. Given the citations, Ray-Turi, Ball-Hall and Banfield-Raftery seems fine to include.
Although it is mentioned in the docs, I would appreciate a clearer separation between supervised and unsupervised metrices for clustering.

@Wainberg
Copy link
Author

So I went on a bit of a deep dive through a few implementations of unsupervised cluster evaluation metrics:

and found 37 metrics. I looked at their number of Google hits and their number of citations. Keep in mind that some papers are cited for things other than the clustering metric. I've bolded the ones that particularly stand out on both popularity metrics.

  1. "Baker-Hubert gamma index" clustering: 3,370 Google hits (3,330 for "Hubert index" clustering, 3,360 for "Hubert gamma index" clustering), 379 citations
  2. "Ball-Hall index" clustering: 904 Google hits (15,400 for "Ball index" clustering), 1,681 citations
  3. "Banfield-Raftery index" clustering: 10 Google hits (but 463 for the misspelling "Banfeld-Raftery index" clustering), 3,185 citations
  4. "Beale index" clustering: 1,940 Google hits, 444 citations
  5. "C index" clustering: 224,000 Google hits (may include false positives talking about AUROC, which is also a "C index", but "C index" clustering Hubert Levin gets 8,510 results), 486 citations
  6. "Calinski-Harabasz index" clustering: 24,200 Google hits (already in scikit-learn), 8,244 citations
  7. "Cubic Clustering Criterion" clustering: 11,400 Google hits, 604 citations
  8. "D index" clustering: 45,100 Google hits, 3,033 citations
  9. "Davies-Bouldin index" clustering: 72,000 Google hits (already in scikit-learn), 9,078 citations
  10. "Det Ratio index" clustering: 1,180 Google hits, 654 citations - this paper also introduced the Log Det Ratio and Scott-Symons indices
  11. "Duda index" clustering: 294 Google hits, 26,694 citations - this paper also introduced the pseudo t2 index
  12. "Dunn index" clustering: 51,900 Google hits, 3,097 citations
  13. "Frey index" clustering: 557 Google hits, ?? citations (paper doesn't seem to be available on the internet)
  14. "Friedman index" clustering: 869 Google hits (74 for "Trace WiB index" clustering), 943 citations - this paper also introduced the Rubin index
  15. "Gap index" clustering: 11,300 Google hits, 6,673 citations
  16. "Generalized Dunn index" clustering: 2,640 Google hits, 1,487 citations
  17. "G plus index" clustering: 548 Google hits, 226 citations - this paper also introduced the Tau index
  18. "Hartigan index" clustering: 2,390 Google hits (10 for "Log SS Ratio index"), 47 citations
  19. "KL index" clustering: 1,620 Google hits (1,650 for "Krzanowski-Lai index" clustering), 899 citations
  20. "Log Det Ratio index" clustering: 68 Google hits, 654 citations - this paper also introduced the Det Ratio and Scott-Symons indices
  21. "Marriot index" clustering: 344 Google hits (1,260 for "Marriott index" clustering, 10 for "Ksq DetW index" clustering), 370 citations
  22. "McClain-Rao index" clustering: 238 Google hits (377 for "McClain index" clustering), 176 citations
  23. "PBM index" clustering: 3,730 Google hits, 956 citations
  24. "Point biserial index" clustering: 1,710 Google hits, 575 citations
  25. "Pseudo t2 index" clustering: 135 Google hits, 26,694 citations - this paper also introduced the Duda index
  26. "Ratkowsky-Lance index" clustering: 174 Google hits (116 for "Ratkowsky index" clustering), 171 citations
  27. "Ray-Turi index" clustering: 729 Google hits, 932 citations
  28. "Rubin index" clustering: 653 Google hits, 943 citations - this paper also introduced the Friedman index
  29. "Scott-Symons index" clustering: 1,230 Google hits (4,690 for "Scott index" clustering), 654 citations - this paper also introduced the Det Ratio and Log Det Ratio indices
  30. "SD index" clustering: 12,400 Google hits (6,470 for "SDindex" clustering), 466 citations
  31. "S Dbw index" clustering: 1,280 Google hits (954 for "S_Dbw index" clustering), 610 citations
  32. "Silhouette index" clustering: 51,900 Google hits (68,000 for "Silhouette score" clustering; already in scikit-learn), 18,202 citations
  33. "Tau index" clustering: 3,530 Google hits, 226 citations - this paper also introduced the G plus index
  34. "Trace W index" clustering: 97 Google hits (170 for "TraceW index" clustering), 785 citations
  35. "trcovw index" clustering: 157 Google hits, 5,119 citations - this may not be the earliest citation
  36. "Wemmert-Gancarski index" clustering: 170 Google hits, 52 citations
  37. "Xie-Beni index" clustering: 9,000 Google hits, 4,380 citations

Besides the three in scikit-learn (Silhouette, Davies-Bouldin and Calinski-Harabasz), the ones that stand out are the D index, Dunn index, Gap index, and Xie-Beni index. However, the D index citation is a book that seems to be mostly cited for things other than the index.

To sum up, the most popular unsupervised clustering metrics missing from Scikit-learn seem to be the Dunn index, Gap index, and Xie-Beni index. However, maybe the even bigger take-away here is that there are a lot of widely used and widely cited metrics, and it might be worth supporting a more comprehensive set than just Dunn, Gap and Xie-Beni.

@glevv
Copy link
Contributor

glevv commented Sep 24, 2023

I would love to see more "unsupervised" (without the need in labels) clustering metrics, since current ones are not very flexible (they rely too heavily on SSq or L2 distance). I don't know about metrics that need labels thought.

@lorentzenchr
Copy link
Member

I‘m personally not a big fan of adding that many unsupervised metrics, just the most important ones.
Is there any theory for selecting such a metric that limits the choices?

@Wainberg
Copy link
Author

My personal experience: I tried out a bunch of these on single-cell RNA sequencing data, trying to select the optimal resolution hyperparameter for Leiden clustering. My experience was that Ray-Turi was the most reasonable, Davies-Bouldin and Calinski-Harabasz were way off, and Silhouette was relatively good but also ridiculously computationally expensive (also a problem that Dunn, Gap and Xie-Beni all suffer from). All of the metrics gave completely different answers from each other!

@glevv
Copy link
Contributor

glevv commented Sep 25, 2023

Sadly all metrics described in OP rely on L2 distance from the "center". I was thinking about something like this

1688956387847

Density-Based Clustering Validation 2014, Moulavi - Cited by 243

@Nylio-prog
Copy link

I would like to work on adding the DBCV score. Tell me if anyone is already on it to not duplicate our work

@luis261
Copy link

luis261 commented Jan 26, 2024

I'v arrived here from #28243 where I laid out the case for including DBCV by explaining how it fills a gap presented by the 3 current metrics which were also mentioned above

"I would love to see more "unsupervised" (without the need in labels) clustering metrics, since current ones are not very flexible (they rely to heavily on SSq or L2 distance)."

"Sadly all metrics described in OP rely on L2 distance from the "center". I was thinking about something like this"

That's exactly the problem I experienced when tasked with clustering non-spherical data in a practical setting based on the "conventional" metrics.
@glevv Just like you correctly demonstrated with your example, DBCV avoids the typical shortcoming as it does not rely on centroids and therefore does not make an implicit spherical assumption. Instead, it works based on MSTs, much like HDBSCAN does (my issue linked above goes into more detail on this as well).

My PR #28244 aims to include the implementation of DBCV that's shipped with https://github.com/scikit-learn-contrib/hdbscan, instead of being based on https://github.com/FelSiq/DBCV

the HDBSCAN DBCV version also contains an optional performance improvement (can toggle via kwarg) based on this benchmark of mine, which outperforms the default DBCV version (only demonstrated for low-dimensional data) by avoiding excessive inflation of core distances

"I would like to work on adding the DBCV score. Tell me if anyone is already on it to not duplicate our work"

@Nylio-prog I was not aware of your PR before ramping up myself, sorry about that

Regardless of which implementation gets merged (if any), I believe my benchmark and the argument laid out in #28243 can provide helpful context to aid in the decision making process when it comes to adding this feature or not.
My argument for including DBCV is backed by an illustrative example as well as an experiment with a more systematic, empirical methodology according to this paper

@lorentzenchr
Copy link
Member

lorentzenchr commented Jan 26, 2024

Based on the very good discussion here, I‘m in favor to include

  • Ray-Turi
  • DBCV

@scikit-learn/core-devs and all other participants, giving your +1 (thumb up to this comment) might help to push this issue forward a lot.

@thomasjpfan
Copy link
Member

I am +1 on DBCV and not sure about Ray-Turi. Is there a reference that showcases the advantages of Ray-Turi compared to the existing scikit-learn cluster metrics?

@Micky774
Copy link
Contributor

+1 to DBCV. I agree that we should be open to unsupervised metrics, esp. density-based approaches. I think finding adversarial examples that highlight insufficiency in current scikit-learn metrics would help motivate these, especially as we'll have to consider how we recommend their usage to users.

@luis261
Copy link

luis261 commented Feb 2, 2024

My PR which addresses this issue is ready for review now: #28244

@glevv
Copy link
Contributor

glevv commented Feb 2, 2024

Usually the process goes like this:
issue -> discussion -> decision -> if green light -> development plan -> PR.

I don't think we at the PR stage yet.

@luis261
Copy link

luis261 commented Feb 2, 2024

Thank you for letting me know. While I did read the contribution guidelines, my understanding of the process of adding entirely new features is limited. I assumed we had accumulated enough votes in addition to covering the required scientific criteria and practical considerations over here to get an actual implementation for DBCV underway. I'll stand by until a decision is made.

@lorentzenchr
Copy link
Member

Usually the process goes like this: issue -> discussion -> decision -> if green light -> development plan -> PR.

I don't think we at the PR stage yet.

Adding DBCV seems like a solid agreement. So green light for that PR, reviews welcome.

@luis261
Copy link

luis261 commented Feb 2, 2024

Adding DBCV seems like a solid agreement. So green light for that PR, reviews welcome.

Great, can't wait!

@luis261
Copy link

luis261 commented Feb 23, 2024

It's been 3 weeks since #28244 was greenlit for review. I have not been able to observe any movement on this issue in any regard whatsoever since then.

If there is still interest in pushing this forward, I expect to see an assigned reviewer, as well as ideally being informed of an estimated time upon which the initial, preliminary review will be completed, soon.

@adrinjalali
Copy link
Member

@luis261 We don't assign reviewers, and reviewer time is the bottleneck in most open source projects. We green lighting this doesn't mean it has a high priority. Unfortunately a few weeks is not a long time for a big project with over 600 open pull requests.

You can ping people who have shown interest in this thread every now and then and hopefully they can allocate some time to review.

@luis261
Copy link

luis261 commented Feb 23, 2024

@adrinjalali sure, I understand that. Hope I didn't come off as rude. Just wanted to get everyone on the same page and probe a bit so I know if interest in this issue still remains so I can adjust my own priorities.

Seems we are on track for now though, already got valuable feedback on the PR that I'll take care to implement now (:

@lesteve
Copy link
Member

lesteve commented Feb 24, 2024

@luis261 let me try to be honest, your comment #27259 (comment) sounded very demanding.

No harm done, there could be plenty of reason for this, maybe I was having a bad day or I have a heightened sensibility to this kind of message or generally being pinged on something I know very little about. Also English is not my main language, and maybe not yours, which sometimes does not help.

I completely get your frustration, you have worked on something for some considerate amount of time, and you get no feed-back for three weeks, this is definitely not great. Be reassured, that, as a maintainer, it certainly does not feel great either to drop things / let people down, but as Adrin said we do have plenty on our plate and we try to do what we can.

Personally when I don't get any feed-back on an issue, I kind of try to say things like "of course I completely understand if you have other priorities" or "let me know if there is anything I can do to help this push forward". A recent example where I did not have any feed-back in roughly three weeks: colesbury/nogil-wheels#6 (comment)

I really like Brett Cannon talk on open-source participation:
https://snarky.ca/setting-expectations-for-open-source-participation/. In particular, I find the "PR as a puppy" metaphor great.

@luis261
Copy link

luis261 commented Feb 24, 2024

let me try to be honest, your comment #27259 (comment) sounded very demanding.

@lesteve I'm sorry it came off that way, certainly not what I intended. What I meant to communicate was not that I feel entitled to a review and therefore demand to receive one, but instead that under the assumption that there is still interest in the issue at hand, I'm anticipating a review in the near future. If I hadn't gotten any response or an explicit no, that would have been fine with me as well. At least in that case I can "cut my losses" and move on. I'd prefer that outcome over ambiguity which results in me unnecessarily remaining on standby.

I really like Brett Cannon talk on open-source participation:
https://snarky.ca/setting-expectations-for-open-source-participation/. In particular, I find the "PR as a puppy" metaphor great.

I read through the talk in its article form. Looking at PRs as puppys seems like a good way to think about it. I like to think that I personally have a history of interacting with maintainers in a polite way that's consistent with that.

Personally when I don't get any feed-back on an issue, I kind of try to say things like "of course I completely understand if you have other priorities" or "let me know if there is anything I can do to help this push forward". A recent example where I did not have any feed-back in roughly three weeks: colesbury/nogil-wheels#6 (comment)

that seems like a good way of going about it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision - Include Feature Requires decision regarding including feature New Feature
Projects
None yet
10 participants