[WIP] ENH Optimal n_clusters values #6948

tguillemot · 2016-06-29T13:26:36Z

Reference Issue

I propose to integrate the algorithms which compute automatically the n_clusters values proposed in #4301 by @afouchet.

This PR is the the second step of the work initiated by #6823.

What does this implement/fix? Explain your changes.

I propose several algorithms that analyses the best value for 'n_clusters' :

max value for some unsupervised metrics (silhouette, calinski and harabraz)
stability criterium (using folwkes and mallows metric)
max value of the distortion jump
gap criterium
pham criterium

Any other comments?

TODO:

Use parallel to accelerate the computation
Not range parameters for n_clusters
Using ParameterGrid class allowing to see the evolution of other parameters
Adapted to have a similar use than GridSearch
The results can be analysed by pandas
docstring
docs
tests
examples

tguillemot · 2016-06-29T13:43:01Z

@jnothman I have begun to modify some of your suggestions. I still have some works to do.

Now, it uses parallel work to compute fastly the best value for n_clusters. Moreover, it's not necessary to give a range for the values of n_clusters and other parameters can be given through the parameters attribute. The use is pretty similar to GridSearchCV.

I will send examples soon.

BTW : why for the clustering algorithms we use n_clusters as attribute when in the GMM it's n_components ? Why don't use n_components for all the classes ?

@agramfort @afouchet

tguillemot · 2016-06-29T16:07:52Z

Here you can see a plot of the plot_optimal_n_clusters.py file

jnothman · 2016-06-30T02:22:47Z

I've not worked much with the GMM code and API, but it doesn't support labels_ either; it's being somewhat set apart from clustering, but perhaps it should not be. n_components as a parameter is borrowed from decomposition, for good or bad. :/ No clear answer here.

afouchet · 2016-06-30T09:12:12Z

With current Gap implementation, you lose the early stopping criterion (but you parallelized computations). On datasets with large N, and GapStatistic(data, estimator) < ~10, I think this implementation is slower. [my2cents] The gap algorithm is kind of built to return low number of clusters and stop early -> unless very distributed, parallelized is slower[/my2cents]. (We can go with this implementation and make a separate PR to produce an early stopping gap algo) (there could be an option "parallelized", but "early stopping" should be the default behavior [it's the algorithm in the paper])

On the examples, I'd like a non-convex examples. For example, 4 concentric circles ou 4 half-moons (stability should work whereas all other should fail. You need an estimator able to understand these clusters, such as SpectralClustering)

agramfort · 2016-06-30T09:22:52Z

examples/cluster/plot_optimal_n_clusters.py

+        fitting_process='stability', metric=fowlkes_mallows_score)),
+    ('Distortion jump', OptimalNClusterSearch(
+        estimator=estimator, parameters=parameters,
+        fitting_process='distortion_jump')),


is there an interaction between fitting_process and metric?

For unsupervised case, the method compute the max value of an unsupervised metric passed in parameters (e.g. calinski and silhouette).
For stability case, the method computes the fitted labels of two subset of X and check we obtain the same clusters by applying a supervised metric (e.g. fowlkes and Mallows).
If you want I can create two new fitting process called silhouette, calinski, ... and remove the metric attribute.

tguillemot · 2016-07-04T09:14:17Z

With current Gap implementation, you lose the early stopping criterion (but you parallelized computations).

That's true but I've supposed it was better to have parallelisation for everyone than fast computation for only one.

Moreover, I've designed these classes to do a global analysis of the clustering obtained by an estimator.
It seems more interesting for me to see how change the score for differents parameters (as it's done in GridSearchCV) than only find the best configuration.
Matlab Gap implementation computes all the values to do a global analysis (look the example):
http://fr.mathworks.com/help/stats/clustering.evaluation.gapevaluation-class.html

I'm not 100% sure because I don't know R but it seems that the R implementation need to know all the values too (function maxSE) :
https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clusGap.html

On the examples, I'd like a non-convex examples. For example, 4 concentric circles ou 4 half-moons (stability should work whereas all other should fail. You need an estimator able to understand these clusters, such as SpectralClustering).

In the way these classes are defined, you can apply the SpectralClustering only for unsupervised case because SpectralClustering don't have a score function. I thought the conclusion of the discussion in #4301 was it doesn't make sense to apply not convex clustering methods with these criterium ?
Maybe, I'm wrong with that.

@agramfort @jnothman What is your point of view about these points ?

afouchet · 2016-07-11T14:24:05Z

In the way these classes are defined, you can apply the SpectralClustering only for unsupervised case because SpectralClustering don't have a score function. I thought the conclusion of the discussion in #4301 was it doesn't make sense to apply not convex clustering methods with these criterium ?

It would be a shame to develop powerful techniques (SpectralClustering + stability solving clustering problem in complex cases) and not showing it. Convex cases are the simplest ones.

If it doesn't fit in this example, maybe stability can have it's own example (compared to Calinski on a convex and 2 non-convex datasets)

P.S. : I read again Tibshirani's article. They do not link the gap statistic to the clustering method, only to chosen distance between points. From the article point of view, we could use SpectralClustering and gap. I understand agramfort's point. Still, I have cases where I compute clustering with some data (semantic distance between keywords), and select number of clusters with other data (I want keywords from a cluster to have similar conversion rate). There would be much to debate. I'm ok not to use gap & spectral in the examples, as it is a rather odd usage.

afouchet · 2016-07-11T14:25:05Z

That's true but I've supposed it was better to have parallelisation for everyone than fast computation for only one. [...]

Very good. Thanks

agramfort · 2016-07-11T16:39:34Z

honestly I don't get how you can obtain a sound result using gap stats and spectral clustering but maybe I am missing something? basically you need a score and the score has to be related to what the algo is trying to maximize or minimize.

afouchet · 2016-07-11T17:36:00Z

honestly I don't get how you can obtain a sound result using gap stats and spectral clustering but maybe I am missing something? basically you need a score and the score has to be related to what the algo is trying to maximize or minimize.

I have a problem where I want to maximize A, but only have individual reliable data B. -> My algo optimizes (B) and I have a score on (A) (which becomes reliable on groups because they have way more data than individual points). Don't think it's relevant to this discussion as I agree that gap + spectral shouldn't be in example. I'd be happy to explain by e-mail if you're interested.

agramfort · 2016-07-11T19:02:13Z

I have to admit it's a bit obscure what you're saying :) let's keep the public sklearn API so it addresses the use cases of the majority of us.

agramfort · 2016-07-12T06:27:40Z

Yes stability works for all estimators. Fine with me

jnothman · 2016-07-12T06:31:26Z

@tguillemot, Could you please explain why we are constraining this to use a clusterer's score method, not allowing an arbitrary unsupervised clustering metric?

amueller · 2016-10-11T21:15:20Z

What's the status on this? I'm super excited :)

tguillemot · 2016-10-12T08:34:50Z

@amueller I have to finish another thing right now but it's this PR is next on my todo list.
Happy to know you're motivated by that PR ;)

amueller · 2016-10-13T21:02:41Z

to address your initial question about n_components: because n_components is really weird for clustering, and n_clusters is really weird for mixture models ;) I'd be slightly more ok with using n_clusters for GMMs but probably not ;)

jnothman · 2017-01-09T04:35:11Z

@tguillemot, do you have capacity for this at the moment?

tguillemot · 2017-01-09T17:55:05Z

Do you have capacity for this at the moment?
@jnothman Unfortunately, I don't have much time to work on it but the main problem for me is that it does not make sense to use unsupervised metric for each searcher.

Could you please explain why we are constraining this to use a clusterer's score method, not allowing an arbitrary unsupervised clustering metric?

Sorry for the late answer :(.
It's technically not a problem but it does not make any sense to compute Gap, Pham, Distortion with other metric (in my point of view). But as I'm not an expert I have done the modification on the PR and I let you play with it.

So I've done a refresh of this PR, added the unsupervised metric for each searcher and a doc.
I let you play with it and tell me :

what do you thing about the API and if I need to do some modifications.
try the code with unsupervised metrics (you can apply SpectralClustering in practice now) and tell me if you think results are meaningful @amueller @jnothman @afouchet.

Once everyone will be OK, we will make this PR mergeable. If I can't I will give the PR to someone else.

Seems OK for everyone ?

jnothman · 2017-01-09T22:37:59Z

Why we are constraining this to use a clusterer's score method, not allowing an arbitrary unsupervised clustering metric?

It's technically not a problem but it does not make any sense to compute Gap, Pham, Distortion with other metric (in my point of view).

What do you mean by any other metric? It seems like only KMeans even defines score!

I'm quite confused reading the code in its current state, with the documentation not quite consistent.

When are we talking about searching over all clusterer parameters, and when just number of clusters? IIRC some searchers where generic to any parameter, others required n_clusterssince models need to be searched in order of increasing complexity and the best is defined by the difference between subsequent models.

Am I right in thinking that anywhere a supervised metric can be used, an unsupervised metric should be usable, but that our metric function interface makes this hard (except that you might presume supervised if y is provided, otherwise unsupervised)? (It looks like _compute_score does not work yet for supervised.)

We should consider whether results_ wants to be structured like GridSearchCV.cv_results_.

In cases where the data is not resampled/altered (which is what motivates refit in grid search), I suspect we need not refit. We could have an option store_fitted={'all', 'none', 'best'}. Or else we could allow the user to memoize the models (or we could do it for them) to avoid an expensive refit.

It would be appropriate to make labels_ available when best_estimator_ is.

tguillemot · 2017-01-10T08:38:48Z

What do you mean by any other metric? It seems like only KMeans even defines score!

Maybe I misunderstood what you asked but I've added for Gap, Pham and Distortion the possibility to use any arbitrary unsupervised metric rather than the score function of the estimator.

As I said I'm -1 to add unsupervised metric in that way because Gap, Pham, ... are not defined to work for other estimator than KMeans.

I'm quite confused reading the code in its current state, with the documentation not quite consistent.

Sorry, I though the documentation will be the best way to understand the API but I didn't take too much time to check it in detail. Sorry again.

When are we talking about searching over all clusterer parameters, and when just number of clusters?

We are always searching over all clusterer parameters.

IIRC some searchers where generic to any parameter, others required n_clusters since models need to be searched in order of increasing complexity and the best is defined by the difference between subsequent models.

The thing is all these methods are totally different one from another and some of them need to fit the estimator for "n_cluster" and "n_cluster - 1" (which is the case of Gap and Pham).
These methods have a lot of common code so I tried to find a way to unify them into a unique API.
I was certainly wrong to do that.

In my point of view the API can be implement in three ways :
a) Keep the API unified (removing a lot of copy of the code but complexifying the design).
b) Make a specific function for each method with their own parameters (creating understandable code but adding copy of code)
c) Remove Pham, Gap and Distortion to simplify the API

Am I right in thinking that anywhere a supervised metric can be used, an unsupervised metric should be usable,

Indeed you can but what's the meaning of doing that ? The only searcher which used supervised metric is StabilitySearch. The purpose is to compare the label for the common point of two different sets fitted with an estimator. Changing supervised by unsupervised on that algorithm will change the meaning of the algorithm.

Other searchers are not designed to deal with supervised metric because the method are not defined like that. Gap, Pham,... are mathematical function defined from the score of KMeans only. It's already complicated to integrate unsupervised metrics and I don't see how we can change the scorer to use supervised metrics.
Do we want to keep Gap, Pham, .... in this PR ?

but that our metric function interface makes this hard (except that you might presume supervised if y is provided, otherwise unsupervised)? (It looks like _compute_score does not work yet for supervised.)

I have added _compute_score to simplified the integration of unsupervised metric on the searcher. To be honest I don't like it because it's only useful for unsupervised metric and I don't know how to deal with the supervised case without doing something complicated.

We should consider whether results_ wants to be structured like GridSearchCV.cv_results_.

Is what I tried to do indeed.

In cases where the data is not resampled/altered (which is what motivates refit in grid search), I suspect we need not refit. We could have an option store_fitted={'all', 'none', 'best'}. Or else we could allow the user to memoize the models (or we could do it for them) to avoid an expensive refit.

I'm not sure to see what you are talking. For example, do you talk of the case where the user try to fit only on sampled data and want to apply the same parameters to fit all point ?

It would be appropriate to make labels_ available when best_estimator_ is.

+1

jnothman

As I said I'm -1 to add unsupervised metric in that way because Gap, Pham, ... are not defined to work for other estimator than KMeans.

That's patently untrue. The paper introducing Gap explicitly has in its abstract, "the technique uses the output of any clustering algorithm". Maybe you mean to say that they're not defined to work with any measure other than KMeans inertia? I agree Gap should not take as a parameter an arbitrary clustering score metric - it uses the weighted sum of within-cluster distances - but it does allow for arbitrary distance metrics (not just euclidean).

I've not looked into Pham, or etc.

jnothman

In cases where the data is not resampled/altered (which is what motivates refit in grid search), I suspect we need not refit. We could have an option store_fitted={'all', 'none', 'best'}. Or else we could allow the user to memoize the models (or we could do it for them) to avoid an expensive refit.

I'm not sure to see what you are talking. For example, do you talk of the case where the user try to fit only on sampled data and want to apply the same parameters to fit all point ?

I'm saying, for instance, if a method runs estimator.fit(X) for each parameter setting, with the whole input X, then we should not need to re-fit it later. Either keep the model in memory or keep the model in joblib.Memory. In GridSearchCV, refit is necessary because the estimator is never fit on the whole X.

jnothman · 2017-01-15T02:49:47Z

In terms of shared code: I think we try to balance code readability with avoiding duplicated code here. Usually this means not using this kind of inheritance with little parts of each implementation pulled out, but I'm not yet familiar enough with the code to say what is and isn't applicable here.

But in terms of shared public API, let's:

report results in a consistent manner;
not force methods to do things they aren't meant to do

Methods that are only designed to operate over n_clusters (e.g. requires evaluating n_clusters and n_clusters - 1, i.e. requiring discreteness and ordering in the parameters) should not do anything else, but where there's nothing about n_clusters that differentiates it from another parameter, then we could allow arbitrary parameter grids... but maybe in version 2.0.

But if there's no reason to constrain something like stability search to using Fowlkes-Mallow (and not ARI, v-measure or B-cubed), why would we?

Ultimately, I'm not familiar with all these techniques as I should be (or will be), and think I would benefit greatly from seeing a tabular comparison, in the comments here or drafted in the narrative docs. This is much easier to grok than a series of docstrings and implementations.

That is, for each technique I'd like to see:

name and hyperlink to reference if possible
estimate of importance in literature (year and cites if nothing else)
search constrained to n_clusters vs any parameter
parameters (e.g. distance metric, supervised/unsupervised clustering metric)
complexity (number of fits per search candidate)

Unless I've misunderstood, all these searches are intended to work where gold standard labels are unavailable. If I'm mistaken, note this too.

Is this going over something we've covered and I've just lost the comment?

And I apologise if some of my preceding comments have been under -informed or misleading or wasted your time. In general, you should please take my comments as suggestions, thoughts, often made in a hurry. I have not always had the time to look in depth at something, and often intend to clarify the situation. Here we're working with a lot of ?subtly different techniques and it's hard for me to get them all into my mind while reviewing tens of PRs.

jnothman · 2017-06-07T13:01:19Z

Do we see this happening in one of the next couple of releases?

tguillemot · 2017-06-07T15:01:30Z

@jnothman To be honnest I don't have too much time now. Sorry. If someone wants to work on this is not a problem.

Mathemilda · 2021-04-14T19:19:12Z

Hello everybody,
I have been working on scikit-learn estimator module to estimate an optimal number of clusters using numeric version of Elbow method. I use bootstrapping to figure out what might be an optimal number under given conditions. My method was featured on scikit-learn mailing list as a function, and I discovered scores of downloads from my github repo since then. It is still cloned a few times per week. I talked with other people about it, and have seen a lot of interest in the method.
https://github.com/Mathemilda/Numeric_ElbowMethod_For_K-means

adrinjalali · 2024-04-18T18:48:36Z

I think there are too many open questions on this, and doesn't seem like the OP would have the time to get it to the finish line (which it seems might not be possible with our inclusion criteria these days).

Related issue: #27259

tguillemot force-pushed the optimal_n_clusters branch 2 times, most recently from 7d02026 to e7ad3b0 Compare June 29, 2016 13:32

agramfort reviewed Jun 30, 2016
View reviewed changes

amueller added this to the 0.19 milestone Oct 11, 2016

amueller mentioned this pull request Nov 28, 2016

[STALLED] Davies bouldin index #7942

Closed

tguillemot mentioned this pull request Jan 9, 2017

[MRG] Choose number of clusters #4301

Closed

Arnaud Fouchet and others added 5 commits January 9, 2017 11:11

add distortion

2fd375c

[WIP] ENH Add the OptimalNClusterSearch class [ci skip]

733390a

Add plot_choose_n_optimal example [ci skip]

227b30d

Modify little stuff for testing.

a0926db

[ci-skip] Add doc, unsupervised metric for any clustersearch.

1868d75

tguillemot force-pushed the optimal_n_clusters branch from 5f0415a to 1868d75 Compare January 9, 2017 17:44

jnothman reviewed Jan 15, 2017

View reviewed changes

jnothman mentioned this pull request Jan 15, 2017

[WIP] Add prediction strength method to determine number of clusters #8206

Open

amueller modified the milestone: 0.19 Jun 12, 2017

jnothman added Need Contributor New Feature Stalled labels Jun 14, 2017

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

github-actions bot added the module:cluster label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

adrinjalali closed this Apr 18, 2024

lesteve mentioned this pull request Mar 14, 2025

auto clusters selection of n_clusters with elbow method #30972

Closed

Uh oh!

[WIP] ENH Optimal n_clusters values #6948

[WIP] ENH Optimal n_clusters values #6948

Uh oh!

Conversation

tguillemot commented Jun 29, 2016 • edited by TomDLT Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

tguillemot commented Jun 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tguillemot commented Jun 29, 2016

Uh oh!

jnothman commented Jun 30, 2016

Uh oh!

afouchet commented Jun 30, 2016

Uh oh!

agramfort Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

tguillemot Jul 4, 2016

Choose a reason for hiding this comment

Uh oh!

tguillemot commented Jul 4, 2016

Uh oh!

afouchet commented Jul 11, 2016

Uh oh!

afouchet commented Jul 11, 2016

Uh oh!

agramfort commented Jul 11, 2016 via email

Uh oh!

afouchet commented Jul 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agramfort commented Jul 11, 2016 via email

Uh oh!

agramfort commented Jul 12, 2016

Uh oh!

jnothman commented Jul 12, 2016

Uh oh!

amueller commented Oct 11, 2016

Uh oh!

tguillemot commented Oct 12, 2016

Uh oh!

amueller commented Oct 13, 2016

Uh oh!

jnothman commented Jan 9, 2017

Uh oh!

tguillemot commented Jan 9, 2017

Uh oh!

jnothman commented Jan 9, 2017

Uh oh!

tguillemot commented Jan 10, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 15, 2017

Uh oh!

jnothman commented Jun 7, 2017

Uh oh!

tguillemot commented Jun 7, 2017

Uh oh!

Mathemilda commented Apr 14, 2021

Uh oh!

adrinjalali commented Apr 18, 2024

Uh oh!

Uh oh!

tguillemot commented Jun 29, 2016 •

edited by TomDLT

Loading

tguillemot commented Jun 29, 2016 •

edited

Loading

afouchet commented Jul 11, 2016 •

edited

Loading