-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] ENH Optimal n_clusters values #6948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
7d02026
to
e7ad3b0
Compare
@jnothman I have begun to modify some of your suggestions. I still have some works to do. Now, it uses parallel work to compute fastly the best value for I will send examples soon. BTW : why for the clustering algorithms we use |
I've not worked much with the GMM code and API, but it doesn't support |
With current Gap implementation, you lose the early stopping criterion (but you parallelized computations). On datasets with large N, and GapStatistic(data, estimator) < ~10, I think this implementation is slower. [my2cents] The gap algorithm is kind of built to return low number of clusters and stop early -> unless very distributed, parallelized is slower[/my2cents]. (We can go with this implementation and make a separate PR to produce an early stopping gap algo) (there could be an option "parallelized", but "early stopping" should be the default behavior [it's the algorithm in the paper]) On the examples, I'd like a non-convex examples. For example, 4 concentric circles ou 4 half-moons (stability should work whereas all other should fail. You need an estimator able to understand these clusters, such as SpectralClustering) |
fitting_process='stability', metric=fowlkes_mallows_score)), | ||
('Distortion jump', OptimalNClusterSearch( | ||
estimator=estimator, parameters=parameters, | ||
fitting_process='distortion_jump')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an interaction between fitting_process and metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For unsupervised
case, the method compute the max value of an unsupervised metric passed in parameters (e.g. calinski and silhouette).
For stability
case, the method computes the fitted labels of two subset of X and check we obtain the same clusters by applying a supervised metric (e.g. fowlkes and Mallows).
If you want I can create two new fitting process called silhouette
, calinski
, ... and remove the metric attribute.
That's true but I've supposed it was better to have parallelisation for everyone than fast computation for only one. Moreover, I've designed these classes to do a global analysis of the clustering obtained by an estimator. I'm not 100% sure because I don't know R but it seems that the R implementation need to know all the values too (function maxSE) :
In the way these classes are defined, you can apply the SpectralClustering only for unsupervised case because SpectralClustering don't have a score function. I thought the conclusion of the discussion in #4301 was it doesn't make sense to apply not convex clustering methods with these criterium ? @agramfort @jnothman What is your point of view about these points ? |
It would be a shame to develop powerful techniques (SpectralClustering + stability solving clustering problem in complex cases) and not showing it. Convex cases are the simplest ones. If it doesn't fit in this example, maybe stability can have it's own example (compared to Calinski on a convex and 2 non-convex datasets) P.S. : I read again Tibshirani's article. They do not link the gap statistic to the clustering method, only to chosen distance between points. From the article point of view, we could use SpectralClustering and gap. I understand agramfort's point. Still, I have cases where I compute clustering with some data (semantic distance between keywords), and select number of clusters with other data (I want keywords from a cluster to have similar conversion rate). There would be much to debate. I'm ok not to use gap & spectral in the examples, as it is a rather odd usage. |
Very good. Thanks |
honestly I don't get how you can obtain a sound result using gap stats and
spectral clustering but maybe I am missing something? basically you need a
score and the score has to be related to what the algo is trying to
maximize or minimize.
|
I have a problem where I want to maximize A, but only have individual reliable data B. -> My algo optimizes (B) and I have a score on (A) (which becomes reliable on groups because they have way more data than individual points). Don't think it's relevant to this discussion as I agree that gap + spectral shouldn't be in example. I'd be happy to explain by e-mail if you're interested. |
I have to admit it's a bit obscure what you're saying :)
let's keep the public sklearn API so it addresses the use cases of the
majority of us.
|
Yes stability works for all estimators. Fine with me |
@tguillemot, Could you please explain why we are constraining this to use a clusterer's |
What's the status on this? I'm super excited :) |
@amueller I have to finish another thing right now but it's this PR is next on my todo list. |
to address your initial question about |
@tguillemot, do you have capacity for this at the moment? |
5f0415a
to
1868d75
Compare
Sorry for the late answer :(. So I've done a refresh of this PR, added the unsupervised metric for each searcher and a doc.
Once everyone will be OK, we will make this PR mergeable. If I can't I will give the PR to someone else. Seems OK for everyone ? |
What do you mean by any other metric? It seems like only I'm quite confused reading the code in its current state, with the documentation not quite consistent. When are we talking about searching over all clusterer parameters, and when just number of clusters? IIRC some searchers where generic to any parameter, others required Am I right in thinking that anywhere a supervised metric can be used, an unsupervised metric should be usable, but that our metric function interface makes this hard (except that you might presume supervised if y is provided, otherwise unsupervised)? (It looks like We should consider whether In cases where the data is not resampled/altered (which is what motivates It would be appropriate to make |
Maybe I misunderstood what you asked but I've added for Gap, Pham and Distortion the possibility to use any arbitrary unsupervised metric rather than the score function of the estimator. As I said I'm -1 to add unsupervised metric in that way because Gap, Pham, ... are not defined to work for other estimator than KMeans.
Sorry, I though the documentation will be the best way to understand the API but I didn't take too much time to check it in detail. Sorry again.
We are always searching over all clusterer parameters.
The thing is all these methods are totally different one from another and some of them need to fit the estimator for "n_cluster" and "n_cluster - 1" (which is the case of Gap and Pham). In my point of view the API can be implement in three ways :
Indeed you can but what's the meaning of doing that ? The only searcher which used supervised metric is Other searchers are not designed to deal with supervised metric because the method are not defined like that. Gap, Pham,... are mathematical function defined from the score of KMeans only. It's already complicated to integrate unsupervised metrics and I don't see how we can change the scorer to use supervised metrics.
I have added _compute_score to simplified the integration of unsupervised metric on the searcher. To be honest I don't like it because it's only useful for unsupervised metric and I don't know how to deal with the supervised case without doing something complicated.
Is what I tried to do indeed.
I'm not sure to see what you are talking. For example, do you talk of the case where the user try to fit only on sampled data and want to apply the same parameters to fit all point ?
+1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said I'm -1 to add unsupervised metric in that way because Gap, Pham, ... are not defined to work for other estimator than KMeans.
That's patently untrue. The paper introducing Gap explicitly has in its abstract, "the technique uses the output of any clustering algorithm". Maybe you mean to say that they're not defined to work with any measure other than KMeans inertia? I agree Gap should not take as a parameter an arbitrary clustering score metric - it uses the weighted sum of within-cluster distances - but it does allow for arbitrary distance metrics (not just euclidean).
I've not looked into Pham, or etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where the data is not resampled/altered (which is what motivates refit in grid search), I suspect we need not refit. We could have an option store_fitted={'all', 'none', 'best'}. Or else we could allow the user to memoize the models (or we could do it for them) to avoid an expensive refit.
I'm not sure to see what you are talking. For example, do you talk of the case where the user try to fit only on sampled data and want to apply the same parameters to fit all point ?
I'm saying, for instance, if a method runs estimator.fit(X)
for each parameter setting, with the whole input X
, then we should not need to re-fit it later. Either keep the model in memory or keep the model in joblib.Memory
. In GridSearchCV, refit is necessary because the estimator is never fit on the whole X
.
In terms of shared code: I think we try to balance code readability with avoiding duplicated code here. Usually this means not using this kind of inheritance with little parts of each implementation pulled out, but I'm not yet familiar enough with the code to say what is and isn't applicable here. But in terms of shared public API, let's:
Methods that are only designed to operate over But if there's no reason to constrain something like stability search to using Fowlkes-Mallow (and not ARI, v-measure or B-cubed), why would we? Ultimately, I'm not familiar with all these techniques as I should be (or will be), and think I would benefit greatly from seeing a tabular comparison, in the comments here or drafted in the narrative docs. This is much easier to grok than a series of docstrings and implementations. That is, for each technique I'd like to see:
Unless I've misunderstood, all these searches are intended to work where gold standard labels are unavailable. If I'm mistaken, note this too. Is this going over something we've covered and I've just lost the comment? And I apologise if some of my preceding comments have been under -informed or misleading or wasted your time. In general, you should please take my comments as suggestions, thoughts, often made in a hurry. I have not always had the time to look in depth at something, and often intend to clarify the situation. Here we're working with a lot of ?subtly different techniques and it's hard for me to get them all into my mind while reviewing tens of PRs. |
Do we see this happening in one of the next couple of releases? |
@jnothman To be honnest I don't have too much time now. Sorry. If someone wants to work on this is not a problem. |
Hello everybody, |
I think there are too many open questions on this, and doesn't seem like the OP would have the time to get it to the finish line (which it seems might not be possible with our inclusion criteria these days). Related issue: #27259 |
Reference Issue
I propose to integrate the algorithms which compute automatically the
n_clusters
values proposed in #4301 by @afouchet.This PR is the the second step of the work initiated by #6823.
What does this implement/fix? Explain your changes.
I propose several algorithms that analyses the best value for 'n_clusters' :
Any other comments?
TODO:
n_clusters