[WIP] Add prediction strength method to determine number of clusters #8206

sebp · 2017-01-15T17:45:19Z

Description

Prediction strength is a metric that measures the stability of a clustering algorithm and can be used to determine an optimal number of clusters without knowing the true cluster assignments. It has been introduced by Tibshirani and Guenther (2005).

First, one splits the data into two parts (A) and (B). One obtains two cluster assignments, the first one using the centroids derived from the subset (A), and the second one using the centroids from the subset (B). Prediction strength measures the proportion of observation pairs that are assigned to the same clusters according to both clusterings. The overall prediction strength is the minimum of this quantity over all predicted clusters.

The definition of prediction strength makes it difficult to integrate it into the existing GridSearchCV and metrics API, because a clusterer needs to trained twice, each time on a different subset of the data, and evaluated on a single subset.

Implementation

Given two lists of cluster assignments, the function prediction_strength_score computes the actual score.
The class PredictionStrengthGridSearchCVextends GridSearchCV and modifies the way _fit_and_score operates. Instead of fitting an estimator on a training set and evaluating it on a test set (_default_fit_and_score function), we need to fit and predict twice and compute the prediction score based on the two predictions (_prediction_strength_fit_and_score function). In addition, PredictionStrengthGridSearchCV uses the cross-validation generator twice, where the roles of training and test set are reversed the second time around. The final prediction strength for a given k and fold are the averages from both runs.

Moreover, PredictionStrengthGridSearchCV overrides how the best estimator is selected. The optimal number of clusters k is the largest k such that the corresponding prediction strength is above some threshold. Tibshirani and Guenther suggest a threshold in the range 0.8 to 0.9 for well separated clusters. If no configuration exceeds the threshold, a warning is displayed and the configuration with the highest prediction strength is selected.

PredictionStrengthGridSearchCV assumes that the param_grid parameter contains a key "n_clusters".

Open questions

The user could define additional parameters beside "n_clusters" in the parameter grid. For a given number of clusters, one would first need to choose the configuration that maximizes the prediction strength, and then choose the maximum k such that the maximum prediction strength is above a threshold.
Alternatively, one could remove "param_grid" and only offer the option to define a list of parameters for "n_clusters".
Finally, I had to add a flag use_prediction_strength to _fit_and_score to allow deviating from the common case: fit once and predict once. I welcome any suggestions on how this could be improved without duplicating too much code.

jnothman · 2017-01-15T22:39:41Z

I'm interested in the design which I will look at later, but see #6948. How does this compare to the methods proposed there? Actually, this sounds a lot like what is there called "stability" with references including:

"A stability based method for discovering structure in clustered data" <http://www.researchgate.net/profile/Asa_Ben-Hur/publication/11435997_A_stability_based_method_for_discovering_structure_in_clustered_data/links/00b4953988b6d0233d000000.pdf>_
Ben-Hur, A., Elisseeff, A., & Guyon, I., Pacific symposium on biocomputing (2001, December)
"Clustering stability: an overview" <http://arxiv.org/pdf/1007.1075.pdf>_
Von Luxburg, U., (2010)

Although here it looks like you're interested in a measure from the minimum, rather than the argmax clustering.

sebp · 2017-01-16T14:05:54Z

I'm not familar with all the measure in #6948, but the paper by Tibshirani and Guenther (2005) references the gap statistic, the Ben-Hur, Elisseeff, and Guyon paper you mentioned, the work of Kerr and Churchill, Calinski and Harabasz, and Krzanowski and Lai. Their main argument is that prediction strength is easier to interpret and comes with a theoretical justification. From their experiments, they conclude (p. 523)

Overall, the simulation study shows that the prediction strength estimate compares well to the other methods except for strongly elongated clusters.

jnothman

My comparison to stability wasn't quite right (nor my statement about argmax). The minimum score across clusters clearly sets this aside. I wish I had some idea of what techniques were seminal enough for inclusion (i.e. what will stand the test of time).

An important difference here is that it requires that your clusterer is able to make predictions for samples it was not trained on. Only some of our clusterers can do this atm (excluding, for instance, hierarchical agglomerative clustering). I'm not convinced that it really requires the clusterer to be centroid based.

jnothman · 2017-01-17T02:39:38Z

sklearn/metrics/cluster/unsupervised.py

+            continue
+
+        matches = 0
+        for i, j in permutations(xrange(cluster_test_size), 2):


Please make use of contingency_matrix to avoid this looping.

I don't think contingency_matrix would work here, because we want to check whether pairs of samples in samples_test_k are assigned the same cluster in labels_train too. Most importantly, the actual label of the cluster is irrelevant. Assume a pair (x,y) of samples from labels_test with label k=0, then check whether the label of x equals the label of y according to labels_train, if k != 0 in labels_train, that's okay. Please see the unit tests for examples.

No, that's basically why all the information you need to measure a clustering based on true and predicted cluster labels is the contingency matrix.

Quickly:

C = contingency_matrix(y_true, y_pred) number_of_pairs = (C * (C - 1) / 2).sum() # the same operation over the marginals gets number of pairs in true and predicted clusterings M = C.sum(axis=0) (M * (M - 1) / 2).sum()

afouchet · 2017-01-18T07:07:17Z

About the paper by Tibshirani and Guenther (2005), why do you only split into two sets (training and testing), and not k-ones (like a k-fold cross-validation) ? Is there much randomness in the result (does the optimal number n_clusters returned vary a lot depending on the chosen training and test set ) ?

Can it be used, in theory, with all clustering algorithms ?

Can it be used with non-convex clusters (in the clustering examples, two half-moons or two concentric circles) ? As it is based on the cluster method's results, it seems possible.

sebp · 2017-01-20T20:06:02Z

About the paper by Tibshirani and Guenther (2005), why do you only split into two sets (training and testing), and not k-ones (like a k-fold cross-validation) ? Is there much randomness in the result (does the optimal number n_clusters returned vary a lot depending on the chosen training and test set ) ?

Any method of splitting the data can be used, see PredictionStrengthGridSearchCV._fit_parallel. In fact, cross-validation is repeated twice, where the role of training and test set is reversed the second time around. Also keep in mind that training and testing are properly not the right words in this context, because for each train/test split, we train a clusterer on both parts of the data (separately) and predict the train portion of the data. IMHO, the names training and testing are arbitrary in this context.

Can it be used, in theory, with all clustering algorithms ?

As long as they can assign labels to previously unseen data, yes.

Can it be used with non-convex clusters (in the clustering examples, two half-moons or two concentric circles) ? As it is based on the cluster method's results, it seems possible.

I don't see a reason why not. It might not be the best choice to check whether a clusterer for spherical data such as k-means is suitable for a dataset that is in fact non-convex. They discuss this problem in the paper in more detail.

sebp · 2017-01-26T23:28:16Z

@jnothman I incorporated the changes suggested by you in the latest commit.

jnothman · 2017-01-29T11:07:34Z

sklearn/metrics/cluster/unsupervised.py

+    if n_clusters == 1:
+        return 1.0  # by definition
+
+    C = contingency_matrix(labels_train, labels_test)


ideally we would use sparse=True as with the other metrics. In which case, * needs to be replaced by .multiply and C - 1 needs to be replaced by something like _nonzero_add(C, -1) with:

def _nonzero_add(M, d): out = M.copy() out.data += d return out

jnothman · 2017-01-29T11:10:18Z

Thanks @sebp, we (and I) have quite a backlog, so expect full review to take a while, especially given tricky API design questions in relation to #6948.

sebp · 2017-02-13T22:47:24Z

@jnothman I updated the code to use a sparse matrix and rebased the changes on top of latest master.

codecov · 2017-02-13T23:07:48Z

Codecov Report

Merging #8206 into master will decrease coverage by -0.02%.
The diff coverage is 91.96%.

@@            Coverage Diff             @@
##           master    #8206      +/-   ##
==========================================
- Coverage   94.75%   94.73%   -0.02%     
==========================================
  Files         342      342              
  Lines       60801    61001     +200     
==========================================
+ Hits        57609    57792     +183     
- Misses       3192     3209      +17

Impacted Files	Coverage Δ
sklearn/model_selection/init.py	`100% <100%> (ø)`	✅
sklearn/metrics/cluster/init.py	`100% <100%> (ø)`	✅
sklearn/metrics/cluster/unsupervised.py	`100% <100%> (ø)`	✅
sklearn/model_selection/_search.py	`98.21% <100%> (+0.33%)`	✅
sklearn/metrics/cluster/tests/test_unsupervised.py	`100% <100%> (ø)`	✅
sklearn/utils/estimator_checks.py	`92.61% <100%> (ø)`	✅
sklearn/utils/testing.py	`86.98% <100%> (ø)`	✅
sklearn/metrics/init.py	`100% <100%> (ø)`	✅
sklearn/model_selection/_validation.py	`93.96% <70.21%> (-4.76%)`	❌
sklearn/model_selection/tests/test_search.py	`98.83% <92.3%> (-0.48%)`	❌

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05b5b37...e300b2d. Read the comment docs.

cmarmo · 2020-12-15T08:32:14Z

Hi @sebp I know it has been a while and I am sorry for that. Are you still interested in working on this pull request? Do you think you would be able to synchronize with upstream? Thanks for your work so far.

sebp · 2020-12-15T09:46:46Z

It's been almost 3 years, so I'd have to check how upstream changed by now. If it doesn't take a huge amount of time to rebase, I'd be willing to do it.

cmarmo · 2020-12-15T09:51:07Z

If it doesn't take a huge amount of time to rebase, I'd be willing to do it.

Thank you @sebp. Let us know if you need help.

sebp · 2021-01-05T18:50:47Z

@cmarmo I re-applied my changes on top of master.

Currently, there's one issue with test_prediction_strength_fit_failing_clusterer() in sklearn/model_selection/tests/test_search.py. I created a class FailingClusterer, which sub-classes BaseEstimator and throws and error in its fit method to trigger the except block in _prediction_strength_fit_and_score(). For some reason, (get|set)_params() of FailingClusterer is convinced that the argument to init is named parameter leading to the following error when executing sklearn/model_selection/_validation.py:582:

Invalid parameter n_clusters for estimator FailingClassifier(parameter=5). Check the list of available parameters with `estimator.get_params().keys()`.'.

Any ideas where this is coming from?

sebp · 2021-01-05T19:42:05Z

Fixed the issue from above.

It appears that coverage is too low, I can fix this, but will wait for a code review first.

agramfort · 2021-01-05T22:01:36Z

just a quick feedback. I don't get the motivation for a dedicated PredictionStrengthGridSearchCV objects. To this is yet another scoring method and should leveraging the scoring API that works with sklearn model selection machinery.

also we already have a number of scores for clustering. Why another one? is there some empirical or theoretical arguments to motivate such an addition?

my 2c

jnothman · 2021-01-11T11:47:23Z

@agramfort it seems to me to be somewhat different to CV (but this is a fault of the proposed API too). We do not care about held-out data splits, but are effectively making arbitrary partitions of the dataset into two (or could we use two bootstraps), building two corresponding models, and then comparing their predictions on all samples in the dataset. No held-out data should be needed for evaluation. (The current PR seems to evaluate only on the training data, rather than the full dataset, and I don't know why that would be beneficial.) Scorers do not have this capability because they are not given the full (or training) dataset.

I don't really see why the splitters used in CV would be appropriate. Why would you want such differently sized train and test sets as are produced by KFold? (Grouped splitting strategies are still well motivated in the case of non-independent samples, however.)

agramfort · 2021-01-11T20:28:39Z

hum... regarding the same data for train and test you could return a splitter that just returns np.arange(n_samples) for train and test so you could evaluate/score on train data if it's what you want. my 2c

…

jnothman · 2021-01-12T02:05:07Z

regarding the same data for train and test you could return a splitter that
just returns np.arange(n_samples) for train
and test so you could evaluate/score on train data if it's what you want.

I don't really see how that adds up to the needed capability. For each parameter setting, the proposal is to:

training on X[m] and predicting on X
training on X[~m] and predicting on X
comparing those predictions to produce a score

Refitting on the whole of X.

agramfort · 2021-01-12T10:00:35Z

ok so what do you suggest ? a new metaestimator to implement this?

…

jnothman · 2021-01-12T11:59:08Z

If we think it's worth including in the library, then yes, a new meta estimator seems appropriate. I'm not so comfortable with the current implementation's reuse of _fit_and_score and the rest of the grid search mechanism. But I've not thought about how it should be done better.

jnothman · 2021-01-14T22:35:56Z

@sebp is there a reason you want to reuse the concept of CV, when we should probably always be using random equal partitions of the data?

jnothman · 2021-01-14T22:41:54Z

I should say that I am, in general, still interested in providing better tools for clustering model selection.

I think a major downside of this one is that it requires the model to be inductive, which only a few of our clusterers are. (Although maybe we could allow for nearest-neighbor based labelling of new points for the purposes of this technique, when applying it to DBSCAN, OPTICS, etc.)

While I can see that it is beneficial to reuse some of the implementation in BaseSearchCV (although there are a lot of differences if we acknowledge that fit time is ambiguous when we fit two models for each candidate, that the scoring interface is perhaps not relevant, etc), I think we should at least consider alternatives to modifying _fit_and_score. One alternative is to make _fit_and_score pluggable by making it a method of BaseSearchCV that is then overridden here...?

agramfort · 2021-01-15T08:06:19Z

I don't like the PredictionStrengthGridSearchCV in _search.py

it leaks one clustering score in a common module.

is there a ref / a name for what this class implements? beyond the work on Tibshirani in 2005

sebp · 2021-01-16T21:39:20Z

@jnothman My main motivation to subclass GridSearchCV is to avoid duplicate code related to checking parameters, running fit/score in parallel, and verbose print statements in _fit_and_score. I understand that from an API design point of view, the proposed solution is not ideal. If PredictionStrengthGridSearchCV becomes independent of GridSearchCV, this will disentangle it but add duplicate code, which also has its downsides. If you and sklearn maintainers agree that prediction strength should be included in sklearn, I'd be happy to update the code to fit the API design you deemed best.

jnothman reviewed Jan 17, 2017

View reviewed changes

sebp force-pushed the prediction-strength branch from 535bb20 to e298446 Compare January 26, 2017 22:45

jnothman reviewed Jan 29, 2017

View reviewed changes

jnothman added the Waiting for Reviewer label Jan 29, 2017

sebp force-pushed the prediction-strength branch from fba59f0 to e300b2d Compare February 13, 2017 22:21

github-actions bot added module:metrics module:model_selection module:utils labels Mar 2, 2020

cmarmo removed the Waiting for Reviewer label Dec 15, 2020

sebp force-pushed the prediction-strength branch 3 times, most recently from 93d0639 to 1e6650b Compare January 5, 2021 15:22

sebp added 3 commits January 5, 2021 16:25

Add prediction_strength_score function

8504121

Use contingency_matrix in prediction_strength_score

5ef6168

Use sparse contingency matrix

ae59b7d

sebp force-pushed the prediction-strength branch 2 times, most recently from 4ab6d7c to c9a3116 Compare January 5, 2021 15:54

sebp force-pushed the prediction-strength branch from c9a3116 to 6ec55b5 Compare January 5, 2021 19:02

sebp added 2 commits January 5, 2021 20:03

Add PredictionStrengthGridSearchCV class

e887590

Add example showcasing PredictionStrengthGridSearchCV

c81d564

sebp force-pushed the prediction-strength branch from 6ec55b5 to c81d564 Compare January 5, 2021 19:03

cmarmo added the Waiting for Reviewer label Jan 5, 2021

cmarmo added the Needs Decision Requires decision label Jan 14, 2021

Base automatically changed from master to main January 22, 2021 10:49

cmarmo removed the Waiting for Reviewer label Nov 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add prediction strength method to determine number of clusters #8206

[WIP] Add prediction strength method to determine number of clusters #8206

sebp commented Jan 15, 2017 •

edited

Loading

jnothman commented Jan 15, 2017

sebp commented Jan 16, 2017 •

edited

Loading

jnothman left a comment

jnothman Jan 17, 2017

sebp Jan 20, 2017

jnothman Jan 21, 2017 •

edited

Loading

afouchet commented Jan 18, 2017

sebp commented Jan 20, 2017

sebp commented Jan 26, 2017

jnothman Jan 29, 2017

jnothman commented Jan 29, 2017

sebp commented Feb 13, 2017

codecov bot commented Feb 13, 2017

cmarmo commented Dec 15, 2020

sebp commented Dec 15, 2020 via email

cmarmo commented Dec 15, 2020

sebp commented Jan 5, 2021 •

edited

Loading

sebp commented Jan 5, 2021

agramfort commented Jan 5, 2021

jnothman commented Jan 11, 2021

agramfort commented Jan 11, 2021 via email

jnothman commented Jan 12, 2021

agramfort commented Jan 12, 2021 via email

jnothman commented Jan 12, 2021 via email

jnothman commented Jan 14, 2021

jnothman commented Jan 14, 2021

agramfort commented Jan 15, 2021

sebp commented Jan 16, 2021 via email

[WIP] Add prediction strength method to determine number of clusters #8206

Are you sure you want to change the base?

[WIP] Add prediction strength method to determine number of clusters #8206

Conversation

sebp commented Jan 15, 2017 • edited Loading

Description

Implementation

Open questions

jnothman commented Jan 15, 2017

sebp commented Jan 16, 2017 • edited Loading

jnothman left a comment

Choose a reason for hiding this comment

jnothman Jan 17, 2017

Choose a reason for hiding this comment

sebp Jan 20, 2017

Choose a reason for hiding this comment

jnothman Jan 21, 2017 • edited Loading

Choose a reason for hiding this comment

afouchet commented Jan 18, 2017

sebp commented Jan 20, 2017

sebp commented Jan 26, 2017

jnothman Jan 29, 2017

Choose a reason for hiding this comment

jnothman commented Jan 29, 2017

sebp commented Feb 13, 2017

codecov bot commented Feb 13, 2017

Codecov Report

cmarmo commented Dec 15, 2020

sebp commented Dec 15, 2020 via email

cmarmo commented Dec 15, 2020

sebp commented Jan 5, 2021 • edited Loading

sebp commented Jan 5, 2021

agramfort commented Jan 5, 2021

jnothman commented Jan 11, 2021

agramfort commented Jan 11, 2021 via email

jnothman commented Jan 12, 2021

agramfort commented Jan 12, 2021 via email

jnothman commented Jan 12, 2021 via email

jnothman commented Jan 14, 2021

jnothman commented Jan 14, 2021

agramfort commented Jan 15, 2021

sebp commented Jan 16, 2021 via email

sebp commented Jan 15, 2017 •

edited

Loading

sebp commented Jan 16, 2017 •

edited

Loading

jnothman Jan 21, 2017 •

edited

Loading

sebp commented Jan 5, 2021 •

edited

Loading