Skip to content

[WIP] ENH Optimal n_clusters values #6948

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Jun 29, 2016

Reference Issue

I propose to integrate the algorithms which compute automatically the n_clusters values proposed in #4301 by @afouchet.

This PR is the the second step of the work initiated by #6823.

What does this implement/fix? Explain your changes.

I propose several algorithms that analyses the best value for 'n_clusters' :

  • max value for some unsupervised metrics (silhouette, calinski and harabraz)
  • stability criterium (using folwkes and mallows metric)
  • max value of the distortion jump
  • gap criterium
  • pham criterium

Any other comments?

TODO:

  • Use parallel to accelerate the computation
  • Not range parameters for n_clusters
  • Using ParameterGrid class allowing to see the evolution of other parameters
  • Adapted to have a similar use than GridSearch
  • The results can be analysed by pandas
  • docstring
  • docs
  • tests
  • examples

@tguillemot tguillemot force-pushed the optimal_n_clusters branch 2 times, most recently from 7d02026 to e7ad3b0 Compare June 29, 2016 13:32
@tguillemot
Copy link
Contributor Author

tguillemot commented Jun 29, 2016

@jnothman I have begun to modify some of your suggestions. I still have some works to do.

Now, it uses parallel work to compute fastly the best value for n_clusters. Moreover, it's not necessary to give a range for the values of n_clusters and other parameters can be given through the parameters attribute. The use is pretty similar to GridSearchCV.

I will send examples soon.

BTW : why for the clustering algorithms we use n_clusters as attribute when in the GMM it's n_components ? Why don't use n_components for all the classes ?

@agramfort @afouchet

@tguillemot
Copy link
Contributor Author

Here you can see a plot of the plot_optimal_n_clusters.py file
plot_optimal_n_clusters

@jnothman
Copy link
Member

I've not worked much with the GMM code and API, but it doesn't support labels_ either; it's being somewhat set apart from clustering, but perhaps it should not be. n_components as a parameter is borrowed from decomposition, for good or bad. :/ No clear answer here.

@afouchet
Copy link

With current Gap implementation, you lose the early stopping criterion (but you parallelized computations). On datasets with large N, and GapStatistic(data, estimator) < ~10, I think this implementation is slower. [my2cents] The gap algorithm is kind of built to return low number of clusters and stop early -> unless very distributed, parallelized is slower[/my2cents]. (We can go with this implementation and make a separate PR to produce an early stopping gap algo) (there could be an option "parallelized", but "early stopping" should be the default behavior [it's the algorithm in the paper])

On the examples, I'd like a non-convex examples. For example, 4 concentric circles ou 4 half-moons (stability should work whereas all other should fail. You need an estimator able to understand these clusters, such as SpectralClustering)

fitting_process='stability', metric=fowlkes_mallows_score)),
('Distortion jump', OptimalNClusterSearch(
estimator=estimator, parameters=parameters,
fitting_process='distortion_jump')),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an interaction between fitting_process and metric?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For unsupervised case, the method compute the max value of an unsupervised metric passed in parameters (e.g. calinski and silhouette).
For stability case, the method computes the fitted labels of two subset of X and check we obtain the same clusters by applying a supervised metric (e.g. fowlkes and Mallows).
If you want I can create two new fitting process called silhouette, calinski, ... and remove the metric attribute.

@tguillemot
Copy link
Contributor Author

With current Gap implementation, you lose the early stopping criterion (but you parallelized computations).

That's true but I've supposed it was better to have parallelisation for everyone than fast computation for only one.

Moreover, I've designed these classes to do a global analysis of the clustering obtained by an estimator.
It seems more interesting for me to see how change the score for differents parameters (as it's done in GridSearchCV) than only find the best configuration.
Matlab Gap implementation computes all the values to do a global analysis (look the example):
http://fr.mathworks.com/help/stats/clustering.evaluation.gapevaluation-class.html

I'm not 100% sure because I don't know R but it seems that the R implementation need to know all the values too (function maxSE) :
https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clusGap.html

On the examples, I'd like a non-convex examples. For example, 4 concentric circles ou 4 half-moons (stability should work whereas all other should fail. You need an estimator able to understand these clusters, such as SpectralClustering).

In the way these classes are defined, you can apply the SpectralClustering only for unsupervised case because SpectralClustering don't have a score function. I thought the conclusion of the discussion in #4301 was it doesn't make sense to apply not convex clustering methods with these criterium ?
Maybe, I'm wrong with that.

@agramfort @jnothman What is your point of view about these points ?

@afouchet
Copy link

In the way these classes are defined, you can apply the SpectralClustering only for unsupervised case because SpectralClustering don't have a score function. I thought the conclusion of the discussion in #4301 was it doesn't make sense to apply not convex clustering methods with these criterium ?

It would be a shame to develop powerful techniques (SpectralClustering + stability solving clustering problem in complex cases) and not showing it. Convex cases are the simplest ones.

If it doesn't fit in this example, maybe stability can have it's own example (compared to Calinski on a convex and 2 non-convex datasets)

P.S. : I read again Tibshirani's article. They do not link the gap statistic to the clustering method, only to chosen distance between points. From the article point of view, we could use SpectralClustering and gap. I understand agramfort's point. Still, I have cases where I compute clustering with some data (semantic distance between keywords), and select number of clusters with other data (I want keywords from a cluster to have similar conversion rate). There would be much to debate. I'm ok not to use gap & spectral in the examples, as it is a rather odd usage.

@afouchet
Copy link

That's true but I've supposed it was better to have parallelisation for everyone than fast computation for only one. [...]

Very good. Thanks

@agramfort
Copy link
Member

agramfort commented Jul 11, 2016 via email

@afouchet
Copy link

afouchet commented Jul 11, 2016

honestly I don't get how you can obtain a sound result using gap stats and spectral clustering but maybe I am missing something? basically you need a score and the score has to be related to what the algo is trying to maximize or minimize.

I have a problem where I want to maximize A, but only have individual reliable data B. -> My algo optimizes (B) and I have a score on (A) (which becomes reliable on groups because they have way more data than individual points). Don't think it's relevant to this discussion as I agree that gap + spectral shouldn't be in example. I'd be happy to explain by e-mail if you're interested.

@agramfort
Copy link
Member

agramfort commented Jul 11, 2016 via email

@agramfort
Copy link
Member

Yes stability works for all estimators. Fine with me

@jnothman
Copy link
Member

@tguillemot, Could you please explain why we are constraining this to use a clusterer's score method, not allowing an arbitrary unsupervised clustering metric?

@amueller
Copy link
Member

What's the status on this? I'm super excited :)

@amueller amueller added this to the 0.19 milestone Oct 11, 2016
@tguillemot
Copy link
Contributor Author

@amueller I have to finish another thing right now but it's this PR is next on my todo list.
Happy to know you're motivated by that PR ;)

@amueller
Copy link
Member

to address your initial question about n_components: because n_components is really weird for clustering, and n_clusters is really weird for mixture models ;) I'd be slightly more ok with using n_clusters for GMMs but probably not ;)

@jnothman
Copy link
Member

jnothman commented Jan 9, 2017

@tguillemot, do you have capacity for this at the moment?

@tguillemot
Copy link
Contributor Author

Do you have capacity for this at the moment?
@jnothman Unfortunately, I don't have much time to work on it but the main problem for me is that it does not make sense to use unsupervised metric for each searcher.

Could you please explain why we are constraining this to use a clusterer's score method, not allowing an arbitrary unsupervised clustering metric?

Sorry for the late answer :(.
It's technically not a problem but it does not make any sense to compute Gap, Pham, Distortion with other metric (in my point of view). But as I'm not an expert I have done the modification on the PR and I let you play with it.

So I've done a refresh of this PR, added the unsupervised metric for each searcher and a doc.
I let you play with it and tell me :

  1. what do you thing about the API and if I need to do some modifications.
  2. try the code with unsupervised metrics (you can apply SpectralClustering in practice now) and tell me if you think results are meaningful @amueller @jnothman @afouchet.

Once everyone will be OK, we will make this PR mergeable. If I can't I will give the PR to someone else.

Seems OK for everyone ?

@jnothman
Copy link
Member

jnothman commented Jan 9, 2017

Why we are constraining this to use a clusterer's score method, not allowing an arbitrary unsupervised clustering metric?

It's technically not a problem but it does not make any sense to compute Gap, Pham, Distortion with other metric (in my point of view).

What do you mean by any other metric? It seems like only KMeans even defines score!

I'm quite confused reading the code in its current state, with the documentation not quite consistent.

When are we talking about searching over all clusterer parameters, and when just number of clusters? IIRC some searchers where generic to any parameter, others required n_clusterssince models need to be searched in order of increasing complexity and the best is defined by the difference between subsequent models.

Am I right in thinking that anywhere a supervised metric can be used, an unsupervised metric should be usable, but that our metric function interface makes this hard (except that you might presume supervised if y is provided, otherwise unsupervised)? (It looks like _compute_score does not work yet for supervised.)

We should consider whether results_ wants to be structured like GridSearchCV.cv_results_.

In cases where the data is not resampled/altered (which is what motivates refit in grid search), I suspect we need not refit. We could have an option store_fitted={'all', 'none', 'best'}. Or else we could allow the user to memoize the models (or we could do it for them) to avoid an expensive refit.

It would be appropriate to make labels_ available when best_estimator_ is.

@tguillemot
Copy link
Contributor Author

What do you mean by any other metric? It seems like only KMeans even defines score!

Maybe I misunderstood what you asked but I've added for Gap, Pham and Distortion the possibility to use any arbitrary unsupervised metric rather than the score function of the estimator.

As I said I'm -1 to add unsupervised metric in that way because Gap, Pham, ... are not defined to work for other estimator than KMeans.

I'm quite confused reading the code in its current state, with the documentation not quite consistent.

Sorry, I though the documentation will be the best way to understand the API but I didn't take too much time to check it in detail. Sorry again.

When are we talking about searching over all clusterer parameters, and when just number of clusters?

We are always searching over all clusterer parameters.

IIRC some searchers where generic to any parameter, others required n_clusters since models need to be searched in order of increasing complexity and the best is defined by the difference between subsequent models.

The thing is all these methods are totally different one from another and some of them need to fit the estimator for "n_cluster" and "n_cluster - 1" (which is the case of Gap and Pham).
These methods have a lot of common code so I tried to find a way to unify them into a unique API.
I was certainly wrong to do that.

In my point of view the API can be implement in three ways :
a) Keep the API unified (removing a lot of copy of the code but complexifying the design).
b) Make a specific function for each method with their own parameters (creating understandable code but adding copy of code)
c) Remove Pham, Gap and Distortion to simplify the API

Am I right in thinking that anywhere a supervised metric can be used, an unsupervised metric should be usable,

Indeed you can but what's the meaning of doing that ? The only searcher which used supervised metric is StabilitySearch. The purpose is to compare the label for the common point of two different sets fitted with an estimator. Changing supervised by unsupervised on that algorithm will change the meaning of the algorithm.

Other searchers are not designed to deal with supervised metric because the method are not defined like that. Gap, Pham,... are mathematical function defined from the score of KMeans only. It's already complicated to integrate unsupervised metrics and I don't see how we can change the scorer to use supervised metrics.
Do we want to keep Gap, Pham, .... in this PR ?

but that our metric function interface makes this hard (except that you might presume supervised if y is provided, otherwise unsupervised)? (It looks like _compute_score does not work yet for supervised.)

I have added _compute_score to simplified the integration of unsupervised metric on the searcher. To be honest I don't like it because it's only useful for unsupervised metric and I don't know how to deal with the supervised case without doing something complicated.

We should consider whether results_ wants to be structured like GridSearchCV.cv_results_.

Is what I tried to do indeed.

In cases where the data is not resampled/altered (which is what motivates refit in grid search), I suspect we need not refit. We could have an option store_fitted={'all', 'none', 'best'}. Or else we could allow the user to memoize the models (or we could do it for them) to avoid an expensive refit.

I'm not sure to see what you are talking. For example, do you talk of the case where the user try to fit only on sampled data and want to apply the same parameters to fit all point ?

It would be appropriate to make labels_ available when best_estimator_ is.

+1

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said I'm -1 to add unsupervised metric in that way because Gap, Pham, ... are not defined to work for other estimator than KMeans.

That's patently untrue. The paper introducing Gap explicitly has in its abstract, "the technique uses the output of any clustering algorithm". Maybe you mean to say that they're not defined to work with any measure other than KMeans inertia? I agree Gap should not take as a parameter an arbitrary clustering score metric - it uses the weighted sum of within-cluster distances - but it does allow for arbitrary distance metrics (not just euclidean).

I've not looked into Pham, or etc.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where the data is not resampled/altered (which is what motivates refit in grid search), I suspect we need not refit. We could have an option store_fitted={'all', 'none', 'best'}. Or else we could allow the user to memoize the models (or we could do it for them) to avoid an expensive refit.

I'm not sure to see what you are talking. For example, do you talk of the case where the user try to fit only on sampled data and want to apply the same parameters to fit all point ?

I'm saying, for instance, if a method runs estimator.fit(X) for each parameter setting, with the whole input X, then we should not need to re-fit it later. Either keep the model in memory or keep the model in joblib.Memory. In GridSearchCV, refit is necessary because the estimator is never fit on the whole X.

@jnothman
Copy link
Member

In terms of shared code: I think we try to balance code readability with avoiding duplicated code here. Usually this means not using this kind of inheritance with little parts of each implementation pulled out, but I'm not yet familiar enough with the code to say what is and isn't applicable here.

But in terms of shared public API, let's:

  • report results in a consistent manner;
  • not force methods to do things they aren't meant to do

Methods that are only designed to operate over n_clusters (e.g. requires evaluating n_clusters and n_clusters - 1, i.e. requiring discreteness and ordering in the parameters) should not do anything else, but where there's nothing about n_clusters that differentiates it from another parameter, then we could allow arbitrary parameter grids... but maybe in version 2.0.

But if there's no reason to constrain something like stability search to using Fowlkes-Mallow (and not ARI, v-measure or B-cubed), why would we?

Ultimately, I'm not familiar with all these techniques as I should be (or will be), and think I would benefit greatly from seeing a tabular comparison, in the comments here or drafted in the narrative docs. This is much easier to grok than a series of docstrings and implementations.

That is, for each technique I'd like to see:

  • name and hyperlink to reference if possible
  • estimate of importance in literature (year and cites if nothing else)
  • search constrained to n_clusters vs any parameter
  • parameters (e.g. distance metric, supervised/unsupervised clustering metric)
  • complexity (number of fits per search candidate)

Unless I've misunderstood, all these searches are intended to work where gold standard labels are unavailable. If I'm mistaken, note this too.

Is this going over something we've covered and I've just lost the comment?

And I apologise if some of my preceding comments have been under -informed or misleading or wasted your time. In general, you should please take my comments as suggestions, thoughts, often made in a hurry. I have not always had the time to look in depth at something, and often intend to clarify the situation. Here we're working with a lot of ?subtly different techniques and it's hard for me to get them all into my mind while reviewing tens of PRs.

@jnothman
Copy link
Member

jnothman commented Jun 7, 2017

Do we see this happening in one of the next couple of releases?

@tguillemot
Copy link
Contributor Author

@jnothman To be honnest I don't have too much time now. Sorry. If someone wants to work on this is not a problem.

@amueller amueller modified the milestone: 0.19 Jun 12, 2017
Base automatically changed from master to main January 22, 2021 10:49
@Mathemilda
Copy link

Hello everybody,
I have been working on scikit-learn estimator module to estimate an optimal number of clusters using numeric version of Elbow method. I use bootstrapping to figure out what might be an optimal number under given conditions. My method was featured on scikit-learn mailing list as a function, and I discovered scores of downloads from my github repo since then. It is still cloned a few times per week. I talked with other people about it, and have seen a lot of interest in the method.
https://github.com/Mathemilda/Numeric_ElbowMethod_For_K-means

@adrinjalali
Copy link
Member

I think there are too many open questions on this, and doesn't seem like the OP would have the time to get it to the finish line (which it seems might not be possible with our inclusion criteria these days).

Related issue: #27259

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants