-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Choose number of clusters #4301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I haven't looked into your entire code, but I feel all these need to be added (if not already present) to Also I do not know about others but we already have silhouette at Sorry if I had understood your PR incorrectly though... |
b5ee352
to
da070ae
Compare
I was unsure where to put this module. The job of the module is, given data and a clustering algorithm, to return the number of cluster. I did not find such module. For example, there is the silhouette score in sklearn. There is no module that returns the number of cluster k for which the silhouette score is maximum. (In the case of the silhouette, it's only a loop. It is a little bit different for other algorithms). I'll check the already-coded silhouette and will amend the code. |
I am inclined to suggest the following (but please wait for review from other devs before finalizing)
More links - |
In fact adding such examples (which show how the optimal |
You can simply use |
As implemented, GridSearchCV can do this. For further development, many methods require the same computations. I intend to do a consensus method showing the results of all with (almost) the computation time of doing only one. (for example, many methods compute the k-cluster, then compute distortion, then return result. We can compute distortion for all k, and feed this list to the distortion-based methods). Do you think this kind of trick can be done with GridSearchCV ? |
From my opinion, I don't have a training and test set, I only have unlabelled data |
Firstly, the recently added example at http://scikit-learn.org/dev/auto_examples/cluster/plot_kmeans_silhouette_analysis.html#example-cluster-plot-kmeans-silhouette-analysis-py addresses the question of showing the user how to find the number of clusters with sihouette. What you write is not altogether clear, but I suspect supporting multiple metrics in grid search (#1850) is what you need. |
Also don't you think the newly added metrics could be useful? From a quick glance these were the new algorithms that could be refactored into a metric -
I am not sure about distortion/elbow though... |
Well, GridSeachCV does not seem relevant to me. GridSearchCV gives me Besides, for most algorithms, the import part is not computing the score for each parameter, but comparing these scores to find an inflection point. So I'm not looking for the parameter which minimizes the score. Calinski and Harabasz can be refactored in metrics.cluster |
Can you specify what is clear and what isn't ? As I wrote this code, I don't have hindsight and many In short, I have unlabeled data and I assumed it has a group-structure as in the picture from scikit-learn (http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html) To determine the optimal number of clusters, given a clustering algorithm, the method :
|
If I understand things correctly - I agree your point on not needing to split data to choose the optimal So I suggest - split the core of your algorithm into 2 metric functions --> 1. which returns the raw value of distortion / index etc... 2. Which is a monotonic function of 1 which is always positively correlated with the optimality of k ( Few naiive suggestions (could sound stupid, in which case kindly ignore it):
( or better Usage for finding the optimal k Simply do a linear search -
You could make 2 metric functions
(That is crudely put, you ought to be taking in only data, labels for Similarly you must be able to find a way to make the stability measure into a monotonically increasing metric value that is positively correlated with the k optimality... Apologies if any of this sounds stupid w.r.t your situation/scikit learn's api :) |
All right. I can refacto as suggested all methods. I only have three remarks / questions:
|
You're right that GridSearchCV isn't exactly suited to this task, though I suspect that applying these metrics to multiple samples from the data and evaluating the score on each might be more robust than only evaluating the score for the full population. I've not read enough clustering literature to know if this is done. I think there were also discussions about how to optimise TSNE's metric (which does not determine number of clusters). Ideally it should be possible to do this with a |
+1 on having a mechanism to search over parameters for transductive methods. |
From the literature I have read, it's mostly done on the whole population. (I finished my thesis in early 2014. Clustering was not a main topic, but I did look at the selection of number of clusters for inspiration for my problem of parameter selection). I agree it should be more robust, but I wouldn't be able to reference the algorithm. Stability is evaluating a score on multiple samples, so CV shouldn't help it much. (There are many ways to do stability. This code subsamples and compares clusters done on subsamples, but stability can be done by clustering data with added noise, by subsampling and comparing to clusters done on the whole dataset, etc). |
On the one hand, Stability-like algorithms can easily be generalized for other transductive algorithms. One would have to specify which output is used for stability (here, it's cluster assignement ; for a semi-supervised regression with feature selection, the set of selected feature could be the output that we want stable). On the other hand, IMO, some of the algorithms are very cluster-specific (mainly the one based on distortion). They could be refacto into an GridSearch-like algorithm which :
But, to me, it seems much generalisation for little gain or re-use. In particular, the goal of my pull request was to implement some classic algorithms to select the number of clusters (which I'd like to have in scikit-learn). Can I refacto them as ragv suggested and pull request ? |
da070ae
to
fb0f911
Compare
Hello everybody, I refactoed, trying to follow ragv's guidelines. Do you have other remarks on the PR ? |
@afouchet thanks. This needs a review by someone that is really familiar with clustering metrics, not sure who'd be a good candidate. |
Sure, I can do that. Very shortly, silhouettes results rely heavily on the distance. Using scikit-learn clustering examples (http://scikit-learn.org/stable/modules/clustering.html), it's very hard for silhouette to find the right number of clusters in the first 2 cases (2 concentric circles and 2 half circles), whereas stability should recover the right number. Nevertheless, stability is slower. I'll run them on scikit's clustering examples. |
If they work with these toy datasets, that is great for demonstrating and illustrating the use. That would definitely deserve its own example. |
fb0f911
to
88193ca
Compare
4648ed6
to
a3d51f4
Compare
For having recently stumbled upon the same problem, I am +1 for including more silhouette-like metrics, as proposed here in this PR with highly-cited alternatives. |
@afouchet I've a technical question for you: does-it make sense to apply GAP or As I'm not from the clustering community, what I want to know is if GAP, Sorry for my stupid question and thanks in advance if you have an answer to that. |
@tguillemot : the question is rather "do GAP or distortion fit my data ?" (do I expect convex clusters). In this case, they should work for any (good) clustering algorithm. Then again, if you expect convex cluster, k-means is a natural choice of clustering algorithm. So, they can be used with any clustering algorithm, as long as you expect convex clusters in your data. (When I built the example, I thought that stability would stand out as the only method that retrieves the right number of clusters on the concentric circles or the half-moons. I'm surprised that GAP worked on these examples, GAP can probably find more complex clusters than convex ones.) |
to me to use a GAP stat you need a score, like the inertia for kmeans.
spectral clustering does not have such a score. Or maybe it could have one
if we use the optim pb before convex relaxation (ratio cut). Also the fact
that it works the circles is just "luck".
|
I'm agree with @agramfort, for those methods you need a score functions. In fact, for the circles examples, the best case is not Thanks @afouchet and @agramfort for the explanation. |
@afouchet @jnothman Ok before proposing a real PR, I send you a gist of the refactoring of those methods : |
@jnothman What do you think about the refactoring of the clustering method : |
Hi @tguillemot; I have your gist open to look at, but haven't got there. It doesn't help that I feel more comfortable leaving comments on a PR than a gist. If you made a PR, I'm sure you'd get a reply, but I'll try to look at that gist sooner or later. |
Yes, I'm okay with that design. A couple of comments:
|
What happened? This all sounded so exciting and almost done? |
I agree @michaelaye, this would be a great to have. But we are severely under-staffed, and things get lost easily. |
@michaelaye As said @jnothman this PR is great to have but I'm working on other stuff on sklearn now. Some of the metrics given by @afouchet are in sklearn now #6823. For the question about Gap, Pham, ... a discussion about the continuation of this work is done on #6948. |
For those interested in cluster stability, I see two issues with this implementation:
|
adji_matrix[i, j] = cluster_assignement[i] == cluster_assignement[j] | ||
""" | ||
n_samples = len(cluster_assignement) | ||
adj_matrix = np.zeros((n_samples, n_samples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this will require O(n^2) memory, O(n^2) time. This could be improved significantly if a sparse matrix were used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we materialise the matrix in memory (even sparsely), this needs in worst-case O(n^2)
time, but only because it needs to be O(max_cluster_size^2 + n)
. Here's one sparse construction, assuming y
is the cluster assignment and clusters are numbered from :
# get an array of member samples for each cluster:
# [might be just as good to do this with a defaultdict(list)]
cluster_sizes = np.bincount(y)
cluster_members = np.split(np.argsort(y), np.cumsum(cluster_sizes))
lil = np.take(cluster_members, y)
indices = np.hstack(lil)
indptr = np.cumsum(np.hstack([0, np.take(cluster_sizes, y)]))
out = csr_matrix((np.ones_like(indptr), indices, indptr))
Another sparse construction with the same condition on y:
out = cosine_similarity(sp.csr_matrix((np.ones_like(y), y, np.arange(len(y)+1))), dense_output=False).astype(int)
|
||
clustering_1 = [assi_1[c] for c in common_points_1] | ||
clustering_2 = [assi_2[c] for c in common_points_2] | ||
return cluster_similarity(clustering_1, clustering_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the order of the clusters changes, then I think all metrics besides 'fowlkes-mallows' will be computed incorrectly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @turian, I'd also like to see something like this progress...
adji_matrix[i, j] = cluster_assignement[i] == cluster_assignement[j] | ||
""" | ||
n_samples = len(cluster_assignement) | ||
adj_matrix = np.zeros((n_samples, n_samples)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we materialise the matrix in memory (even sparsely), this needs in worst-case O(n^2)
time, but only because it needs to be O(max_cluster_size^2 + n)
. Here's one sparse construction, assuming y
is the cluster assignment and clusters are numbered from :
# get an array of member samples for each cluster:
# [might be just as good to do this with a defaultdict(list)]
cluster_sizes = np.bincount(y)
cluster_members = np.split(np.argsort(y), np.cumsum(cluster_sizes))
lil = np.take(cluster_members, y)
indices = np.hstack(lil)
indptr = np.cumsum(np.hstack([0, np.take(cluster_sizes, y)]))
out = csr_matrix((np.ones_like(indptr), indices, indptr))
Another sparse construction with the same condition on y:
out = cosine_similarity(sp.csr_matrix((np.ones_like(y), y, np.arange(len(y)+1))), dense_output=False).astype(int)
But I think #6948 is the most recent work in this direction |
I would like to known any progress since the pull request 3 years ago. Is using a single linear search still the best way to find the best |
@Isolet I think this PR is dead |
I would welcome its resuscitation!
|
Closing as inactive. |
This module implements 7 algorithms to find "optimal" number of clusters (stabiliity, gap statistic, distortion jump, silhouette, Calinsky and Harabasz index, and 2 "elbow" methods)
If a metric is needed (for example, to compute distortion, the mean distance of a point to its cluster's center), all distance of scipy.spatial.distance can be used.