Skip to content

[MRG] Choose number of clusters #4301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

afouchet
Copy link

This module implements 7 algorithms to find "optimal" number of clusters (stabiliity, gap statistic, distortion jump, silhouette, Calinsky and Harabasz index, and 2 "elbow" methods)

If a metric is needed (for example, to compute distortion, the mean distance of a point to its cluster's center), all distance of scipy.spatial.distance can be used.

@raghavrv
Copy link
Member

I haven't looked into your entire code, but I feel all these need to be added (if not already present) to sklearn.metrics.cluster. Kindly look at existing metrics at sklearn/metrics/cluster and see which file might be a good fit for the newly introduced metrics.

Also I do not know about others but we already have silhouette at sklearn/metrics/cluster/unsupervised.py ... Please take a look at the same...

Sorry if I had understood your PR incorrectly though...

@afouchet
Copy link
Author

I was unsure where to put this module.

The job of the module is, given data and a clustering algorithm, to return the number of cluster. I did not find such module.

For example, there is the silhouette score in sklearn. There is no module that returns the number of cluster k for which the silhouette score is maximum. (In the case of the silhouette, it's only a loop. It is a little bit different for other algorithms).

I'll check the already-coded silhouette and will amend the code.

@raghavrv
Copy link
Member

I am inclined to suggest the following (but please wait for review from other devs before finalizing)

  • Amend all the new algorithms into different metric functions and add the same to sklearn.metrics.cluster under appropriate files.
  • Add a few examples on using the new/old metrics in GridSearchCV to find the optimal n_clusters value. ( The new/old metrics after appropriate processing of output data could act as the scoring callable parameter to the GridSearchCV while instantiating it; Refer the above linked GridSearchCV docs for the format of the function).
    Also refer this example.

More links -
The metrics.clusters module.
The grid_search module.

@raghavrv
Copy link
Member

In fact adding such examples (which show how the optimal n_clusters can be obtained using GridSearchCV and the new/old metrics) would partially fix #4040 too :)

@amueller
Copy link
Member

You can simply use GridSeachCV for this, right? Well that would use a hold-out set. Do you want to search just for the training set?

@afouchet
Copy link
Author

As implemented, GridSearchCV can do this. For further development, many methods require the same computations. I intend to do a consensus method showing the results of all with (almost) the computation time of doing only one.

(for example, many methods compute the k-cluster, then compute distortion, then return result. We can compute distortion for all k, and feed this list to the distortion-based methods).

Do you think this kind of trick can be done with GridSearchCV ?

@afouchet
Copy link
Author

From my opinion, I don't have a training and test set, I only have unlabelled data

@jnothman
Copy link
Member

Firstly, the recently added example at http://scikit-learn.org/dev/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#example-cluster-plot-kmeans-silhouette-analysis-py addresses the question of showing the user how to find the number of clusters with sihouette.

What you write is not altogether clear, but I suspect supporting multiple metrics in grid search (#1850) is what you need.

@raghavrv
Copy link
Member

raghavrv commented Mar 1, 2015

Also don't you think the newly added metrics could be useful?

From a quick glance these were the new algorithms that could be refactored into a metric -

I am not sure about distortion/elbow though...

@afouchet
Copy link
Author

afouchet commented Mar 2, 2015

Well, GridSeachCV does not seem relevant to me. GridSearchCV gives me
(algorithm, parameters) -> mean score across all folds and select the best one. The algorithms in the PR do not split data in test and training set (also, in the example http://scikit-learn.org/dev/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#example-cluster-plot-kmeans-silhouette-analysis-py, no cross-validation is done). GridSearchCV uses KFold, which says "n_fold: [...] must be at least 2"

Besides, for most algorithms, the import part is not computing the score for each parameter, but comparing these scores to find an inflection point. So I'm not looking for the parameter which minimizes the score.

Calinski and Harabasz can be refactored in metrics.cluster
GAP Stat also, but, for the gap-statistic algorithm, you do not look for k maximizing the gap, but for the minimum k such as Gap(k) > Gap(k+1) - some_kind_of_standard_deviation. So the algorithm must be written somewhere, and I don't know if the gap statistic would be useful in itself
Stability does not measure the stability of a cluster, but the stability of a clustering algorithm. I can put it in metrics.cluster, but it may be misleading
Stability can also be seen as a way to find optimal parameters (like cross validation) (but, IMO, it sucks when compared to cross-validation).

@afouchet
Copy link
Author

afouchet commented Mar 2, 2015

Can you specify what is clear and what isn't ? As I wrote this code, I don't have hindsight and many
steps seem logic to me.

In short, I have unlabeled data and I assumed it has a group-structure as in the picture from scikit-learn (http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)

To determine the optimal number of clusters, given a clustering algorithm, the method :

  • "elbow method" or "distortion jump" see how the "variance" of data has been reduce. As the number of clusters increase, the mean distance of a point to its cluster's centroid decreases. But we expect this mean distance to decrease a lot if k < k_optimal, and to decrease very little for k > k_optimal. So I compute the clustering for all k, see the distortion (can be seen as mean distance of a point to its centroïd) an look for an inflection point
  • "stability selection" assumes that, if the right number of cluster is chosen, then the algorithm is stable. Stability computes k-clustering on smally-perturbed subsets E_1 and E_2 of the data. On common data, stability see in what proportion clusters made on E_1 are similar to clusters made on E_2. (You do that comparaison 100 times to have the expected stability of k-clustering)
  • "gap statistic" : computes the difference of "variance" of data in k-clusters versus "variance" of random data in k-clusters. (plus some small tricks of Tibshirani, we look for a k such as Gap(k) > Gap(k+1) - standard_deviation_like_term) (plus, you can do several models to draw "random data")

@raghavrv
Copy link
Member

raghavrv commented Mar 2, 2015

If I understand things correctly -

I agree your point on not needing to split data to choose the optimal n_clusters since you might want to apply a particular metric (silhouette/distortion/any other) on the whole of your data...
Also GridSearchCV is to minimize a particular score for various combinations of parameters... And since your use case is a single parameter you could simply do a linear search...

So I suggest - split the core of your algorithm into 2 metric functions --> 1. which returns the raw value of distortion / index etc... 2. Which is a monotonic function of 1 which is always positively correlated with the optimality of k ( n_clusters ) such that minimizing this function over a range of k gives you the optimal value of n_clusters...

Few naiive suggestions (could sound stupid, in which case kindly ignore it):

we expect this mean distance to decrease a lot if k < k_optimal, and to decrease very little for k > k_optimal. So I compute the clustering for all k, see the distortion (can be seen as mean distance of a point to its centroïd) an look for an inflection point

  1. a metric clustering_distortion / distortion that returns the distortion value given the data and labels.
    Usage : distortion(data, model(n_clusters=k).fit(data).labels_) --> distortion for the predicted data using model initialized with n_clusters=k.
  2. a metric distortion_laplacian(data, n_clusters_range, Estimator, *args, **kwargs) which computes the 2nd derivative of the distortion vs n_clusters curve at the point of given n_clusters...

( or better distortion_laplacians(data, range_n_clusters, Estimator, *args, **kwargs) which might be a lot more efficient since you will compute the labels for different n_clusters only once... and use numpy.diff directly on the results... )

Usage for finding the optimal k

Simply do a linear search - range_k[argmin(distortion_laplacians(data, range_k, KMeans))]

but, for the gap-statistic algorithm, you do not look for k maximizing the gap, but for the minimum k such as Gap(k) > Gap(k+1)

You could make 2 metric functions

  • gap(k) which returns the plain gap(k).
  • gap_cluster_score(k) which returns k*(gap(k) - gap(k+1)) or even better k*int(gap(k) > gap(k+1))

(That is crudely put, you ought to be taking in only data, labels for gap() like the above described distortion... similarly to avoid computing gap(k) multiple times we could have gap_cluster_scores(Estimator, range_k, *args, **kwargs) to return k*(gap(k) - gap(k+1)) for a range of k values...)

Similarly you must be able to find a way to make the stability measure into a monotonically increasing metric value that is positively correlated with the k optimality...

Apologies if any of this sounds stupid w.r.t your situation/scikit learn's api :)

@afouchet
Copy link
Author

afouchet commented Mar 3, 2015

All right. I can refacto as suggested all methods. I only have three remarks / questions:

  • for stability, I can easily refacto in as a monotically increasing metric. Yet, this metric measure the stability of the algorithm, not the clusters. Should I put it in metrics.cluster or somewhere else ?
  • Some algorithms will look for argmin_k(score_with_k_clusters), other will look for argmax. Would you rather have all algorithms looking for argmin, even if I code - score_func instead of score_func ? I'm assuming not, and I'll code score_func
  • As the algorithms' purpose is to take data and a clustering algo to return an optimal number of clusters, I'd like have to functions "data, cluster_algorithm -> optimal number of clusters". It can be the very simple "def algo(...): return argmin(....))". If okay for you, where can I put them ?

@jnothman
Copy link
Member

jnothman commented Mar 5, 2015

Well, GridSeachCV does not seem relevant to me. GridSearchCV gives me (algorithm, parameters) -> mean score across all folds and select the best one.

You're right that GridSearchCV isn't exactly suited to this task, though I suspect that applying these metrics to multiple samples from the data and evaluating the score on each might be more robust than only evaluating the score for the full population. I've not read enough clustering literature to know if this is done.

I think there were also discussions about how to optimise TSNE's metric (which does not determine number of clusters). Ideally it should be possible to do this with a GridSearchCV-like apparatus: it should be possible to assess multiple parameter settings, using either a grid or a randomised search approach -- or any that appears in the future. Optimising such transductive algorithms should be discussed on the mailing list, probably.

@amueller
Copy link
Member

amueller commented Mar 5, 2015

+1 on having a mechanism to search over parameters for transductive methods.

@afouchet
Copy link
Author

afouchet commented Mar 5, 2015

I suspect that applying these metrics to multiple samples from the data and evaluating the score on each might be more robust than only evaluating the score for the full population.

From the literature I have read, it's mostly done on the whole population. (I finished my thesis in early 2014. Clustering was not a main topic, but I did look at the selection of number of clusters for inspiration for my problem of parameter selection). I agree it should be more robust, but I wouldn't be able to reference the algorithm.

Stability is evaluating a score on multiple samples, so CV shouldn't help it much. (There are many ways to do stability. This code subsamples and compares clusters done on subsamples, but stability can be done by clustering data with added noise, by subsampling and comparing to clusters done on the whole dataset, etc).

@afouchet
Copy link
Author

afouchet commented Mar 5, 2015

Ideally it should be possible to do this with a GridSearchCV-like apparatus: it should be possible to assess multiple parameter settings, using either a grid or a randomised search approach -- or any that appears in the future. Optimising such transductive algorithms should be discussed on the mailing list, probably.

On the one hand, Stability-like algorithms can easily be generalized for other transductive algorithms. One would have to specify which output is used for stability (here, it's cluster assignement ; for a semi-supervised regression with feature selection, the set of selected feature could be the output that we want stable).

On the other hand, IMO, some of the algorithms are very cluster-specific (mainly the one based on distortion). They could be refacto into an GridSearch-like algorithm which :

  • compute a score for chosen hyper-parameters
  • run a compare_score function to return

But, to me, it seems much generalisation for little gain or re-use.

In particular, the goal of my pull request was to implement some classic algorithms to select the number of clusters (which I'd like to have in scikit-learn). Can I refacto them as ragv suggested and pull request ?

@afouchet afouchet force-pushed the choose-nb-cluster branch from da070ae to fb0f911 Compare March 16, 2015 09:19
@coveralls
Copy link

Coverage Status

Coverage increased (+0.0%) to 95.12% when pulling fb0f911 on idlead:choose-nb-cluster into 4be6214 on scikit-learn:master.

@afouchet
Copy link
Author

Hello everybody,

I refactoed, trying to follow ragv's guidelines. Do you have other remarks on the PR ?

@amueller
Copy link
Member

@afouchet thanks. This needs a review by someone that is really familiar with clustering metrics, not sure who'd be a good candidate.
Can you maybe add an example or add the metrics to an example on how these metrics are more helpful than the ones currently in scikit-learn?
Also, a description should be added to the user guide in doc/modules/model_evaluation.rst in the clustering section. It is currently pretty threadbare (not sure if there is a user-guide entry for the silhouette_score for example), but in general we like to have at least a description and some intuition about it's use in the user guide.

@afouchet
Copy link
Author

Sure, I can do that.

Very shortly, silhouettes results rely heavily on the distance. Using scikit-learn clustering examples (http://scikit-learn.org/stable/modules/clustering.html), it's very hard for silhouette to find the right number of clusters in the first 2 cases (2 concentric circles and 2 half circles), whereas stability should recover the right number. Nevertheless, stability is slower.

I'll run them on scikit's clustering examples.

@amueller
Copy link
Member

If they work with these toy datasets, that is great for demonstrating and illustrating the use. That would definitely deserve its own example.
If you had some additional insights / demonstrations on "real" data, that would also be great (though I don't have good ideas there - you could run the digits or mnist)

@afouchet afouchet force-pushed the choose-nb-cluster branch from fb0f911 to 88193ca Compare April 8, 2015 14:52
@afouchet
Copy link
Author

afouchet commented Apr 8, 2015

Hello,

I plotted the results of every algorithm on the cluster toy dataset, using spectral clustering
chosen_clusters

As ( I ) expected, distortion_jump and Calinski Harabaz are faster than silhouette, with same clustering relevance. Stability is always able to recover the "right" number of clusters, but is very slow. Gap statistc was always able to recover the "right" number, with computational time in the same order as silhouette. (I am pretty surprised that the gap statistic recovered everythin)

I added documentation and script to produce the plot. My PR should be clearer now. Let me know
how you feel about this.

@afouchet afouchet force-pushed the choose-nb-cluster branch 2 times, most recently from 4648ed6 to a3d51f4 Compare April 8, 2015 15:42
@glouppe
Copy link
Contributor

glouppe commented Apr 8, 2015

For having recently stumbled upon the same problem, I am +1 for including more silhouette-like metrics, as proposed here in this PR with highly-cited alternatives.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.0%) to 95.15% when pulling a3d51f4 on idlead:choose-nb-cluster into b1eaac0 on scikit-learn:master.

@tguillemot
Copy link
Contributor

tguillemot commented Jun 20, 2016

@afouchet I've a technical question for you: does-it make sense to apply GAP or stability distortion to algorithm different from kmeans (or other blob clustering technique) ? As these methods are defined from inertia, does it make sense to change the clustering algorithm by the SpectralClustering as you've done in your example ?

As I'm not from the clustering community, what I want to know is if GAP, stability distortion are used only for k-means or can be applied to other clustering algorithm and why (if it's a yes) ?

Sorry for my stupid question and thanks in advance if you have an answer to that.

@afouchet
Copy link
Author

@tguillemot : the question is rather "do GAP or distortion fit my data ?" (do I expect convex clusters). In this case, they should work for any (good) clustering algorithm. Then again, if you expect convex cluster, k-means is a natural choice of clustering algorithm.

So, they can be used with any clustering algorithm, as long as you expect convex clusters in your data.

(When I built the example, I thought that stability would stand out as the only method that retrieves the right number of clusters on the concentric circles or the half-moons. I'm surprised that GAP worked on these examples, GAP can probably find more complex clusters than convex ones.)

@agramfort
Copy link
Member

agramfort commented Jun 21, 2016 via email

@tguillemot
Copy link
Contributor

I'm agree with @agramfort, for those methods you need a score functions.

In fact, for the circles examples, the best case is not k=2 but k=1.
Your implementation begins from n_cluster=2, it's for you have not seen it.

Thanks @afouchet and @agramfort for the explanation.

@tguillemot
Copy link
Contributor

@afouchet @jnothman Ok before proposing a real PR, I send you a gist of the refactoring of those methods :
https://gist.github.com/tguillemot/f95d7ec55c90ccfb41f29667a71e9d46
Are you ok with that design ? Do you have any suggestion or comments ?

@tguillemot
Copy link
Contributor

@jnothman What do you think about the refactoring of the clustering method :
https://gist.github.com/tguillemot/f95d7ec55c90ccfb41f29667a71e9d46
Any comments about the design of those classes before I do a PR ?

@jnothman
Copy link
Member

Hi @tguillemot; I have your gist open to look at, but haven't got there. It doesn't help that I feel more comfortable leaving comments on a PR than a gist. If you made a PR, I'm sure you'd get a reply, but I'll try to look at that gist sooner or later.

@jnothman
Copy link
Member

Yes, I'm okay with that design. A couple of comments:

  • I think it makes assumptions that we're only searching over number of clusters (to the exclusion of other parameters), and that we're searching in a contiguous range. I suppose that's reasonable.
  • It uses a clusterer's score function, when the metric should be configurable in at least some cases
  • It uses a different notion of best_estimator_ to that in GridSearchCV. Here, the best_estimator_ does not have a fitted model; there it does.
  • I suppose that like GridSearchCV, if a fitted best_estimator_ is attached, the estimator should then act transparently like a clusterer, providing labels_, predict if available, etc.
  • It's unclear whether the "scorers" should be standalone estimators, or just the merged estimator where the "scorers" are methods.
  • I'd don't think you'll get away with using "scorer" that is already jargon in scikit-learn.

@michaelaye
Copy link

What happened? This all sounded so exciting and almost done?

@jnothman
Copy link
Member

jnothman commented Jan 7, 2017

I agree @michaelaye, this would be a great to have. But we are severely under-staffed, and things get lost easily.

@tguillemot
Copy link
Contributor

@michaelaye As said @jnothman this PR is great to have but I'm working on other stuff on sklearn now.

Some of the metrics given by @afouchet are in sklearn now #6823.

For the question about Gap, Pham, ... a discussion about the continuation of this work is done on #6948.

@turian
Copy link

turian commented Aug 30, 2017

For those interested in cluster stability, I see two issues with this implementation:

  • adjacency_matrix is O(n^2) in time and memory, n = # examples, so it will be very slow for large n. This could be greatly improved by computing the adjacency_matrix in a sparse way.
  • I think _one_stability_measure could return the wrong results for non 'fowlkes-mallows' metrics, if the cluster order has changed.

adji_matrix[i, j] = cluster_assignement[i] == cluster_assignement[j]
"""
n_samples = len(cluster_assignement)
adj_matrix = np.zeros((n_samples, n_samples))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will require O(n^2) memory, O(n^2) time. This could be improved significantly if a sparse matrix were used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we materialise the matrix in memory (even sparsely), this needs in worst-case O(n^2) time, but only because it needs to be O(max_cluster_size^2 + n). Here's one sparse construction, assuming y is the cluster assignment and clusters are numbered from :

# get an array of member samples for each cluster:
# [might be just as good to do this with a defaultdict(list)]
cluster_sizes = np.bincount(y)
cluster_members = np.split(np.argsort(y), np.cumsum(cluster_sizes))
lil = np.take(cluster_members, y)
indices = np.hstack(lil)
indptr = np.cumsum(np.hstack([0, np.take(cluster_sizes, y)]))
out = csr_matrix((np.ones_like(indptr), indices, indptr))

Another sparse construction with the same condition on y:

out = cosine_similarity(sp.csr_matrix((np.ones_like(y), y, np.arange(len(y)+1))), dense_output=False).astype(int)


clustering_1 = [assi_1[c] for c in common_points_1]
clustering_2 = [assi_2[c] for c in common_points_2]
return cluster_similarity(clustering_1, clustering_2)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the order of the clusters changes, then I think all metrics besides 'fowlkes-mallows' will be computed incorrectly?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @turian, I'd also like to see something like this progress...

adji_matrix[i, j] = cluster_assignement[i] == cluster_assignement[j]
"""
n_samples = len(cluster_assignement)
adj_matrix = np.zeros((n_samples, n_samples))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as we materialise the matrix in memory (even sparsely), this needs in worst-case O(n^2) time, but only because it needs to be O(max_cluster_size^2 + n). Here's one sparse construction, assuming y is the cluster assignment and clusters are numbered from :

# get an array of member samples for each cluster:
# [might be just as good to do this with a defaultdict(list)]
cluster_sizes = np.bincount(y)
cluster_members = np.split(np.argsort(y), np.cumsum(cluster_sizes))
lil = np.take(cluster_members, y)
indices = np.hstack(lil)
indptr = np.cumsum(np.hstack([0, np.take(cluster_sizes, y)]))
out = csr_matrix((np.ones_like(indptr), indices, indptr))

Another sparse construction with the same condition on y:

out = cosine_similarity(sp.csr_matrix((np.ones_like(y), y, np.arange(len(y)+1))), dense_output=False).astype(int)

@jnothman
Copy link
Member

But I think #6948 is the most recent work in this direction

@failable
Copy link

I would like to known any progress since the pull request 3 years ago. Is using a single linear search still the best way to find the best n_clusters when 2018 is coming?

@manugarri
Copy link

manugarri commented Jun 27, 2018

@Isolet I think this PR is dead

@jnothman
Copy link
Member

jnothman commented Jun 27, 2018 via email

@adrinjalali
Copy link
Member

Closing as inactive.

@adrinjalali adrinjalali closed this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.