-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Add prediction strength method to determine number of clusters #8206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I'm interested in the design which I will look at later, but see #6948. How does this compare to the methods proposed there? Actually, this sounds a lot like what is there called "stability" with references including:
Although here it looks like you're interested in a measure from the minimum, rather than the argmax clustering. |
I'm not familar with all the measure in #6948, but the paper by Tibshirani and Guenther (2005) references the gap statistic, the Ben-Hur, Elisseeff, and Guyon paper you mentioned, the work of Kerr and Churchill, Calinski and Harabasz, and Krzanowski and Lai. Their main argument is that prediction strength is easier to interpret and comes with a theoretical justification. From their experiments, they conclude (p. 523)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My comparison to stability wasn't quite right (nor my statement about argmax). The minimum score across clusters clearly sets this aside. I wish I had some idea of what techniques were seminal enough for inclusion (i.e. what will stand the test of time).
An important difference here is that it requires that your clusterer is able to make predictions for samples it was not trained on. Only some of our clusterers can do this atm (excluding, for instance, hierarchical agglomerative clustering). I'm not convinced that it really requires the clusterer to be centroid based.
continue | ||
|
||
matches = 0 | ||
for i, j in permutations(xrange(cluster_test_size), 2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make use of contingency_matrix
to avoid this looping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think contingency_matrix
would work here, because we want to check whether pairs of samples in samples_test_k
are assigned the same cluster in labels_train
too. Most importantly, the actual label of the cluster is irrelevant. Assume a pair (x,y)
of samples from labels_test
with label k=0
, then check whether the label of x
equals the label of y
according to labels_train
, if k != 0
in labels_train
, that's okay. Please see the unit tests for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that's basically why all the information you need to measure a clustering based on true and predicted cluster labels is the contingency matrix.
Quickly:
C = contingency_matrix(y_true, y_pred)
number_of_pairs = (C * (C - 1) / 2).sum()
# the same operation over the marginals gets number of pairs in true and predicted clusterings
M = C.sum(axis=0)
(M * (M - 1) / 2).sum()
About the paper by Tibshirani and Guenther (2005), why do you only split into two sets (training and testing), and not k-ones (like a k-fold cross-validation) ? Is there much randomness in the result (does the optimal number n_clusters returned vary a lot depending on the chosen training and test set ) ? Can it be used, in theory, with all clustering algorithms ? Can it be used with non-convex clusters (in the clustering examples, two half-moons or two concentric circles) ? As it is based on the cluster method's results, it seems possible. |
Any method of splitting the data can be used, see
As long as they can assign labels to previously unseen data, yes.
I don't see a reason why not. It might not be the best choice to check whether a clusterer for spherical data such as k-means is suitable for a dataset that is in fact non-convex. They discuss this problem in the paper in more detail. |
535bb20
to
e298446
Compare
@jnothman I incorporated the changes suggested by you in the latest commit. |
if n_clusters == 1: | ||
return 1.0 # by definition | ||
|
||
C = contingency_matrix(labels_train, labels_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally we would use sparse=True
as with the other metrics. In which case, *
needs to be replaced by .multiply
and C - 1
needs to be replaced by something like _nonzero_add(C, -1)
with:
def _nonzero_add(M, d):
out = M.copy()
out.data += d
return out
fba59f0
to
e300b2d
Compare
@jnothman I updated the code to use a sparse matrix and rebased the changes on top of latest master. |
Codecov Report
@@ Coverage Diff @@
## master #8206 +/- ##
==========================================
- Coverage 94.75% 94.73% -0.02%
==========================================
Files 342 342
Lines 60801 61001 +200
==========================================
+ Hits 57609 57792 +183
- Misses 3192 3209 +17
Continue to review full report at Codecov.
|
Hi @sebp I know it has been a while and I am sorry for that. Are you still interested in working on this pull request? Do you think you would be able to synchronize with upstream? Thanks for your work so far. |
It's been almost 3 years, so I'd have to check how upstream changed by
now. If it doesn't take a huge amount of time to rebase, I'd be willing
to do it.
|
Thank you @sebp. Let us know if you need help. |
93d0639
to
1e6650b
Compare
4ab6d7c
to
c9a3116
Compare
@cmarmo I re-applied my changes on top of master. Currently, there's one issue with
Any ideas where this is coming from? |
c9a3116
to
6ec55b5
Compare
6ec55b5
to
c81d564
Compare
Fixed the issue from above. It appears that coverage is too low, I can fix this, but will wait for a code review first. |
just a quick feedback. I don't get the motivation for a dedicated PredictionStrengthGridSearchCV objects. To this is yet another scoring method and should leveraging the scoring API that works with sklearn model selection machinery. also we already have a number of scores for clustering. Why another one? is there some empirical or theoretical arguments to motivate such an addition? my 2c |
@agramfort it seems to me to be somewhat different to CV (but this is a fault of the proposed API too). We do not care about held-out data splits, but are effectively making arbitrary partitions of the dataset into two (or could we use two bootstraps), building two corresponding models, and then comparing their predictions on all samples in the dataset. No held-out data should be needed for evaluation. (The current PR seems to evaluate only on the training data, rather than the full dataset, and I don't know why that would be beneficial.) Scorers do not have this capability because they are not given the full (or training) dataset. I don't really see why the splitters used in CV would be appropriate. Why would you want such differently sized train and test sets as are produced by KFold? (Grouped splitting strategies are still well motivated in the case of non-independent samples, however.) |
hum...
regarding the same data for train and test you could return a splitter that
just returns np.arange(n_samples) for train
and test so you could evaluate/score on train data if it's what you want.
my 2c
… |
I don't really see how that adds up to the needed capability. For each parameter setting, the proposal is to:
Refitting on the whole of X. |
ok so what do you suggest ? a new metaestimator to implement this?
… |
If we think it's worth including in the library, then yes, a new meta
estimator seems appropriate. I'm not so comfortable with the current
implementation's reuse of _fit_and_score and the rest of the grid search
mechanism. But I've not thought about how it should be done better.
|
@sebp is there a reason you want to reuse the concept of CV, when we should probably always be using random equal partitions of the data? |
I should say that I am, in general, still interested in providing better tools for clustering model selection. I think a major downside of this one is that it requires the model to be inductive, which only a few of our clusterers are. (Although maybe we could allow for nearest-neighbor based labelling of new points for the purposes of this technique, when applying it to DBSCAN, OPTICS, etc.) While I can see that it is beneficial to reuse some of the implementation in BaseSearchCV (although there are a lot of differences if we acknowledge that fit time is ambiguous when we fit two models for each candidate, that the scoring interface is perhaps not relevant, etc), I think we should at least consider alternatives to modifying |
I don't like the PredictionStrengthGridSearchCV in _search.py it leaks one clustering score in a common module. is there a ref / a name for what this class implements? beyond the work on Tibshirani in 2005 |
@jnothman My main motivation to subclass GridSearchCV is to avoid
duplicate code related to checking parameters, running fit/score in
parallel, and verbose print statements in _fit_and_score.
I understand that from an API design point of view, the proposed
solution is not ideal. If PredictionStrengthGridSearchCV becomes
independent of GridSearchCV, this will disentangle it but add duplicate
code, which also has its downsides.
If you and sklearn maintainers agree that prediction strength should be
included in sklearn, I'd be happy to update the code to fit the API
design you deemed best.
|
Description
Prediction strength is a metric that measures the stability of a clustering algorithm and can be used to determine an optimal number of clusters without knowing the true cluster assignments. It has been introduced by Tibshirani and Guenther (2005).
First, one splits the data into two parts (A) and (B). One obtains two cluster assignments, the first one using the centroids derived from the subset (A), and the second one using the centroids from the subset (B). Prediction strength measures the proportion of observation pairs that are assigned to the same clusters according to both clusterings. The overall prediction strength is the minimum of this quantity over all predicted clusters.
The definition of prediction strength makes it difficult to integrate it into the existing GridSearchCV and metrics API, because a clusterer needs to trained twice, each time on a different subset of the data, and evaluated on a single subset.
Implementation
Given two lists of cluster assignments, the function
prediction_strength_score
computes the actual score.The class
PredictionStrengthGridSearchCV
extendsGridSearchCV
and modifies the way_fit_and_score
operates. Instead of fitting an estimator on a training set and evaluating it on a test set (_default_fit_and_score
function), we need to fit and predict twice and compute the prediction score based on the two predictions (_prediction_strength_fit_and_score
function). In addition,PredictionStrengthGridSearchCV
uses the cross-validation generator twice, where the roles of training and test set are reversed the second time around. The final prediction strength for a given k and fold are the averages from both runs.Moreover,
PredictionStrengthGridSearchCV
overrides how the best estimator is selected. The optimal number of clusters k is the largest k such that the corresponding prediction strength is above some threshold. Tibshirani and Guenther suggest a threshold in the range 0.8 to 0.9 for well separated clusters. If no configuration exceeds the threshold, a warning is displayed and the configuration with the highest prediction strength is selected.PredictionStrengthGridSearchCV
assumes that theparam_grid
parameter contains a key "n_clusters".Open questions
The user could define additional parameters beside "n_clusters" in the parameter grid. For a given number of clusters, one would first need to choose the configuration that maximizes the prediction strength, and then choose the maximum k such that the maximum prediction strength is above a threshold.
Alternatively, one could remove "param_grid" and only offer the option to define a list of parameters for "n_clusters".
Finally, I had to add a flag
use_prediction_strength
to_fit_and_score
to allow deviating from the common case: fit once and predict once. I welcome any suggestions on how this could be improved without duplicating too much code.