-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
New clustering metrics #27259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@Wainberg could you provide articles from the scientific literature regarding these metrics such that we judge their pertinence, in order to include them or not in scikit-learn. |
The Ray-Turi paper has been cited 932 times since 1999. Its main contribution is the clustering metric. The Ball-Hall paper has been cited 1681 times since 1965. It introduced ISODATA (a variant of k-means) along with the clustering metric, but if you look at the top citations they're mostly about using the clustering metric. One disadvantage of this metric is that unlike the others here (and also unlike the three already implemented in scikit-learn: Silhouette score, Davies-Bouldin and Calinski-Harabasz), you have to look at the maximum difference in the metric between two adjacent values of k (the number of clusters) to determine the best k. The Banfield-Raftery paper has been cited 3185 times since 1993. The top citations are a mix of the clustering metric and the paper's other contributions. The log_SS_ratio is less popular than the other three and might make sense to leave out. The clearest case for inclusion would be Ray-Turi, which has been cited nearly 1000 times just for the clustering metric. |
With those stats I don't think we can argue they're not popular. I don't mind including them. But cc'ing @lorentzenchr since he probably has a better overview of the scores we're adding. |
I'm no expert on that field. Given the citations, Ray-Turi, Ball-Hall and Banfield-Raftery seems fine to include. |
So I went on a bit of a deep dive through a few implementations of unsupervised cluster evaluation metrics:
and found 37 metrics. I looked at their number of Google hits and their number of citations. Keep in mind that some papers are cited for things other than the clustering metric. I've bolded the ones that particularly stand out on both popularity metrics.
Besides the three in scikit-learn (Silhouette, Davies-Bouldin and Calinski-Harabasz), the ones that stand out are the D index, Dunn index, Gap index, and Xie-Beni index. However, the D index citation is a book that seems to be mostly cited for things other than the index. To sum up, the most popular unsupervised clustering metrics missing from Scikit-learn seem to be the Dunn index, Gap index, and Xie-Beni index. However, maybe the even bigger take-away here is that there are a lot of widely used and widely cited metrics, and it might be worth supporting a more comprehensive set than just Dunn, Gap and Xie-Beni. |
I would love to see more "unsupervised" (without the need in labels) clustering metrics, since current ones are not very flexible (they rely too heavily on SSq or L2 distance). I don't know about metrics that need labels thought. |
I‘m personally not a big fan of adding that many unsupervised metrics, just the most important ones. |
My personal experience: I tried out a bunch of these on single-cell RNA sequencing data, trying to select the optimal resolution hyperparameter for Leiden clustering. My experience was that Ray-Turi was the most reasonable, Davies-Bouldin and Calinski-Harabasz were way off, and Silhouette was relatively good but also ridiculously computationally expensive (also a problem that Dunn, Gap and Xie-Beni all suffer from). All of the metrics gave completely different answers from each other! |
Sadly all metrics described in OP rely on L2 distance from the "center". I was thinking about something like this Density-Based Clustering Validation 2014, Moulavi - Cited by 243 |
I would like to work on adding the DBCV score. Tell me if anyone is already on it to not duplicate our work |
I'v arrived here from #28243 where I laid out the case for including DBCV by explaining how it fills a gap presented by the 3 current metrics which were also mentioned above
That's exactly the problem I experienced when tasked with clustering non-spherical data in a practical setting based on the "conventional" metrics. My PR #28244 aims to include the implementation of DBCV that's shipped with https://github.com/scikit-learn-contrib/hdbscan, instead of being based on https://github.com/FelSiq/DBCV
@Nylio-prog I was not aware of your PR before ramping up myself, sorry about that Regardless of which implementation gets merged (if any), I believe my benchmark and the argument laid out in #28243 can provide helpful context to aid in the decision making process when it comes to adding this feature or not. |
Based on the very good discussion here, I‘m in favor to include
@scikit-learn/core-devs and all other participants, giving your +1 (thumb up to this comment) might help to push this issue forward a lot. |
I am +1 on DBCV and not sure about Ray-Turi. Is there a reference that showcases the advantages of Ray-Turi compared to the existing scikit-learn cluster metrics? |
+1 to DBCV. I agree that we should be open to unsupervised metrics, esp. density-based approaches. I think finding adversarial examples that highlight insufficiency in current scikit-learn metrics would help motivate these, especially as we'll have to consider how we recommend their usage to users. |
My PR which addresses this issue is ready for review now: #28244 |
Usually the process goes like this: I don't think we at the PR stage yet. |
Thank you for letting me know. While I did read the contribution guidelines, my understanding of the process of adding entirely new features is limited. I assumed we had accumulated enough votes in addition to covering the required scientific criteria and practical considerations over here to get an actual implementation for DBCV underway. I'll stand by until a decision is made. |
Adding DBCV seems like a solid agreement. So green light for that PR, reviews welcome. |
Great, can't wait! |
It's been 3 weeks since #28244 was greenlit for review. I have not been able to observe any movement on this issue in any regard whatsoever since then. If there is still interest in pushing this forward, I expect to see an assigned reviewer, as well as ideally being informed of an estimated time upon which the initial, preliminary review will be completed, soon. |
@luis261 We don't assign reviewers, and reviewer time is the bottleneck in most open source projects. We green lighting this doesn't mean it has a high priority. Unfortunately a few weeks is not a long time for a big project with over 600 open pull requests. You can ping people who have shown interest in this thread every now and then and hopefully they can allocate some time to review. |
@adrinjalali sure, I understand that. Hope I didn't come off as rude. Just wanted to get everyone on the same page and probe a bit so I know if interest in this issue still remains so I can adjust my own priorities. Seems we are on track for now though, already got valuable feedback on the PR that I'll take care to implement now (: |
@luis261 let me try to be honest, your comment #27259 (comment) sounded very demanding. No harm done, there could be plenty of reason for this, maybe I was having a bad day or I have a heightened sensibility to this kind of message or generally being pinged on something I know very little about. Also English is not my main language, and maybe not yours, which sometimes does not help. I completely get your frustration, you have worked on something for some considerate amount of time, and you get no feed-back for three weeks, this is definitely not great. Be reassured, that, as a maintainer, it certainly does not feel great either to drop things / let people down, but as Adrin said we do have plenty on our plate and we try to do what we can. Personally when I don't get any feed-back on an issue, I kind of try to say things like "of course I completely understand if you have other priorities" or "let me know if there is anything I can do to help this push forward". A recent example where I did not have any feed-back in roughly three weeks: colesbury/nogil-wheels#6 (comment) I really like Brett Cannon talk on open-source participation: |
@lesteve I'm sorry it came off that way, certainly not what I intended. What I meant to communicate was not that I feel entitled to a review and therefore demand to receive one, but instead that under the assumption that there is still interest in the issue at hand, I'm anticipating a review in the near future. If I hadn't gotten any response or an explicit no, that would have been fine with me as well. At least in that case I can "cut my losses" and move on. I'd prefer that outcome over ambiguity which results in me unnecessarily remaining on standby.
I read through the talk in its article form. Looking at PRs as puppys seems like a good way to think about it. I like to think that I personally have a history of interacting with maintainers in a polite way that's consistent with that.
that seems like a good way of going about it |
Describe the workflow you want to enable
Scikit-learn defines three popular metrics for evaluating clustering performance when there are no ground-truth cluster labels: sklearn.metrics.silhouette_score, sklearn.metrics.calinski_harabasz_score and sklearn.metrics.davies_bouldin_score. But there are lots of others, and it's previously been discussed whether to integrate more into scikit-learn.
Describe your proposed solution
I've implemented four relatively popular ones, using the same interface and code style as in https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/metrics/cluster/_unsupervised.py. Would there be interest in integrating these into scikit-learn?
Describe alternatives you've considered, if relevant
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: