Skip to content

[WIP] Add prediction strength method to determine number of clusters #8206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sebp
Copy link
Contributor

@sebp sebp commented Jan 15, 2017

Description

Prediction strength is a metric that measures the stability of a clustering algorithm and can be used to determine an optimal number of clusters without knowing the true cluster assignments. It has been introduced by Tibshirani and Guenther (2005).

First, one splits the data into two parts (A) and (B). One obtains two cluster assignments, the first one using the centroids derived from the subset (A), and the second one using the centroids from the subset (B). Prediction strength measures the proportion of observation pairs that are assigned to the same clusters according to both clusterings. The overall prediction strength is the minimum of this quantity over all predicted clusters.

The definition of prediction strength makes it difficult to integrate it into the existing GridSearchCV and metrics API, because a clusterer needs to trained twice, each time on a different subset of the data, and evaluated on a single subset.

Implementation

Given two lists of cluster assignments, the function prediction_strength_score computes the actual score.
The class PredictionStrengthGridSearchCVextends GridSearchCV and modifies the way _fit_and_score operates. Instead of fitting an estimator on a training set and evaluating it on a test set (_default_fit_and_score function), we need to fit and predict twice and compute the prediction score based on the two predictions (_prediction_strength_fit_and_score function). In addition, PredictionStrengthGridSearchCV uses the cross-validation generator twice, where the roles of training and test set are reversed the second time around. The final prediction strength for a given k and fold are the averages from both runs.

Moreover, PredictionStrengthGridSearchCV overrides how the best estimator is selected. The optimal number of clusters k is the largest k such that the corresponding prediction strength is above some threshold. Tibshirani and Guenther suggest a threshold in the range 0.8 to 0.9 for well separated clusters. If no configuration exceeds the threshold, a warning is displayed and the configuration with the highest prediction strength is selected.

PredictionStrengthGridSearchCV assumes that the param_grid parameter contains a key "n_clusters".

Open questions

  1. The user could define additional parameters beside "n_clusters" in the parameter grid. For a given number of clusters, one would first need to choose the configuration that maximizes the prediction strength, and then choose the maximum k such that the maximum prediction strength is above a threshold.

  2. Alternatively, one could remove "param_grid" and only offer the option to define a list of parameters for "n_clusters".

  3. Finally, I had to add a flag use_prediction_strength to _fit_and_score to allow deviating from the common case: fit once and predict once. I welcome any suggestions on how this could be improved without duplicating too much code.

@jnothman
Copy link
Member

I'm interested in the design which I will look at later, but see #6948. How does this compare to the methods proposed there? Actually, this sounds a lot like what is there called "stability" with references including:

  • "A stability based method for discovering structure in clustered data" <http://www.researchgate.net/profile/Asa_Ben-Hur/publication/11435997_A_stability_based_method_for_discovering_structure_in_clustered_data/links/00b4953988b6d0233d000000.pdf>_
    Ben-Hur, A., Elisseeff, A., & Guyon, I., Pacific symposium on biocomputing (2001, December)

  • "Clustering stability: an overview" <http://arxiv.org/pdf/1007.1075.pdf>_
    Von Luxburg, U., (2010)

Although here it looks like you're interested in a measure from the minimum, rather than the argmax clustering.

@sebp
Copy link
Contributor Author

sebp commented Jan 16, 2017

I'm not familar with all the measure in #6948, but the paper by Tibshirani and Guenther (2005) references the gap statistic, the Ben-Hur, Elisseeff, and Guyon paper you mentioned, the work of Kerr and Churchill, Calinski and Harabasz, and Krzanowski and Lai. Their main argument is that prediction strength is easier to interpret and comes with a theoretical justification. From their experiments, they conclude (p. 523)

Overall, the simulation study shows that the prediction strength estimate compares well to the other methods except for strongly elongated clusters.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comparison to stability wasn't quite right (nor my statement about argmax). The minimum score across clusters clearly sets this aside. I wish I had some idea of what techniques were seminal enough for inclusion (i.e. what will stand the test of time).

An important difference here is that it requires that your clusterer is able to make predictions for samples it was not trained on. Only some of our clusterers can do this atm (excluding, for instance, hierarchical agglomerative clustering). I'm not convinced that it really requires the clusterer to be centroid based.

continue

matches = 0
for i, j in permutations(xrange(cluster_test_size), 2):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make use of contingency_matrix to avoid this looping.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think contingency_matrix would work here, because we want to check whether pairs of samples in samples_test_k are assigned the same cluster in labels_train too. Most importantly, the actual label of the cluster is irrelevant. Assume a pair (x,y) of samples from labels_test with label k=0, then check whether the label of x equals the label of y according to labels_train, if k != 0 in labels_train, that's okay. Please see the unit tests for examples.

Copy link
Member

@jnothman jnothman Jan 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's basically why all the information you need to measure a clustering based on true and predicted cluster labels is the contingency matrix.

Quickly:

C = contingency_matrix(y_true, y_pred)
number_of_pairs = (C * (C - 1) / 2).sum()
# the same operation over the marginals gets number of pairs in true and predicted clusterings
M = C.sum(axis=0)
(M * (M - 1) / 2).sum()

@afouchet
Copy link

About the paper by Tibshirani and Guenther (2005), why do you only split into two sets (training and testing), and not k-ones (like a k-fold cross-validation) ? Is there much randomness in the result (does the optimal number n_clusters returned vary a lot depending on the chosen training and test set ) ?

Can it be used, in theory, with all clustering algorithms ?

Can it be used with non-convex clusters (in the clustering examples, two half-moons or two concentric circles) ? As it is based on the cluster method's results, it seems possible.

@sebp
Copy link
Contributor Author

sebp commented Jan 20, 2017

About the paper by Tibshirani and Guenther (2005), why do you only split into two sets (training and testing), and not k-ones (like a k-fold cross-validation) ? Is there much randomness in the result (does the optimal number n_clusters returned vary a lot depending on the chosen training and test set ) ?

Any method of splitting the data can be used, see PredictionStrengthGridSearchCV._fit_parallel. In fact, cross-validation is repeated twice, where the role of training and test set is reversed the second time around. Also keep in mind that training and testing are properly not the right words in this context, because for each train/test split, we train a clusterer on both parts of the data (separately) and predict the train portion of the data. IMHO, the names training and testing are arbitrary in this context.

Can it be used, in theory, with all clustering algorithms ?

As long as they can assign labels to previously unseen data, yes.

Can it be used with non-convex clusters (in the clustering examples, two half-moons or two concentric circles) ? As it is based on the cluster method's results, it seems possible.

I don't see a reason why not. It might not be the best choice to check whether a clusterer for spherical data such as k-means is suitable for a dataset that is in fact non-convex. They discuss this problem in the paper in more detail.

@sebp sebp force-pushed the prediction-strength branch from 535bb20 to e298446 Compare January 26, 2017 22:45
@sebp
Copy link
Contributor Author

sebp commented Jan 26, 2017

@jnothman I incorporated the changes suggested by you in the latest commit.

if n_clusters == 1:
return 1.0 # by definition

C = contingency_matrix(labels_train, labels_test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally we would use sparse=True as with the other metrics. In which case, * needs to be replaced by .multiply and C - 1 needs to be replaced by something like _nonzero_add(C, -1) with:

def _nonzero_add(M, d):
    out = M.copy()
    out.data += d
    return out

@jnothman
Copy link
Member

Thanks @sebp, we (and I) have quite a backlog, so expect full review to take a while, especially given tricky API design questions in relation to #6948.

@sebp sebp force-pushed the prediction-strength branch from fba59f0 to e300b2d Compare February 13, 2017 22:21
@sebp
Copy link
Contributor Author

sebp commented Feb 13, 2017

@jnothman I updated the code to use a sparse matrix and rebased the changes on top of latest master.

@codecov
Copy link

codecov bot commented Feb 13, 2017

Codecov Report

Merging #8206 into master will decrease coverage by -0.02%.
The diff coverage is 91.96%.

@@            Coverage Diff             @@
##           master    #8206      +/-   ##
==========================================
- Coverage   94.75%   94.73%   -0.02%     
==========================================
  Files         342      342              
  Lines       60801    61001     +200     
==========================================
+ Hits        57609    57792     +183     
- Misses       3192     3209      +17
Impacted Files Coverage Δ
sklearn/model_selection/init.py 100% <100%> (ø)
sklearn/metrics/cluster/init.py 100% <100%> (ø)
sklearn/metrics/cluster/unsupervised.py 100% <100%> (ø)
sklearn/model_selection/_search.py 98.21% <100%> (+0.33%)
sklearn/metrics/cluster/tests/test_unsupervised.py 100% <100%> (ø)
sklearn/utils/estimator_checks.py 92.61% <100%> (ø)
sklearn/utils/testing.py 86.98% <100%> (ø)
sklearn/metrics/init.py 100% <100%> (ø)
sklearn/model_selection/_validation.py 93.96% <70.21%> (-4.76%)
sklearn/model_selection/tests/test_search.py 98.83% <92.3%> (-0.48%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05b5b37...e300b2d. Read the comment docs.

@cmarmo
Copy link
Contributor

cmarmo commented Dec 15, 2020

Hi @sebp I know it has been a while and I am sorry for that. Are you still interested in working on this pull request? Do you think you would be able to synchronize with upstream? Thanks for your work so far.

@sebp
Copy link
Contributor Author

sebp commented Dec 15, 2020 via email

@cmarmo
Copy link
Contributor

cmarmo commented Dec 15, 2020

If it doesn't take a huge amount of time to rebase, I'd be willing to do it.

Thank you @sebp. Let us know if you need help.

@sebp sebp force-pushed the prediction-strength branch 3 times, most recently from 93d0639 to 1e6650b Compare January 5, 2021 15:22
@sebp sebp force-pushed the prediction-strength branch 2 times, most recently from 4ab6d7c to c9a3116 Compare January 5, 2021 15:54
@sebp
Copy link
Contributor Author

sebp commented Jan 5, 2021

@cmarmo I re-applied my changes on top of master.

Currently, there's one issue with test_prediction_strength_fit_failing_clusterer() in sklearn/model_selection/tests/test_search.py. I created a class FailingClusterer, which sub-classes BaseEstimator and throws and error in its fit method to trigger the except block in _prediction_strength_fit_and_score(). For some reason, (get|set)_params() of FailingClusterer is convinced that the argument to init is named parameter leading to the following error when executing sklearn/model_selection/_validation.py:582:

Invalid parameter n_clusters for estimator FailingClassifier(parameter=5). Check the list of available parameters with `estimator.get_params().keys()`.'.

Any ideas where this is coming from?

@sebp sebp force-pushed the prediction-strength branch from c9a3116 to 6ec55b5 Compare January 5, 2021 19:02
@sebp sebp force-pushed the prediction-strength branch from 6ec55b5 to c81d564 Compare January 5, 2021 19:03
@sebp
Copy link
Contributor Author

sebp commented Jan 5, 2021

Fixed the issue from above.

It appears that coverage is too low, I can fix this, but will wait for a code review first.

@agramfort
Copy link
Member

just a quick feedback. I don't get the motivation for a dedicated PredictionStrengthGridSearchCV objects. To this is yet another scoring method and should leveraging the scoring API that works with sklearn model selection machinery.

also we already have a number of scores for clustering. Why another one? is there some empirical or theoretical arguments to motivate such an addition?

my 2c

@jnothman
Copy link
Member

@agramfort it seems to me to be somewhat different to CV (but this is a fault of the proposed API too). We do not care about held-out data splits, but are effectively making arbitrary partitions of the dataset into two (or could we use two bootstraps), building two corresponding models, and then comparing their predictions on all samples in the dataset. No held-out data should be needed for evaluation. (The current PR seems to evaluate only on the training data, rather than the full dataset, and I don't know why that would be beneficial.) Scorers do not have this capability because they are not given the full (or training) dataset.

I don't really see why the splitters used in CV would be appropriate. Why would you want such differently sized train and test sets as are produced by KFold? (Grouped splitting strategies are still well motivated in the case of non-independent samples, however.)

@agramfort
Copy link
Member

agramfort commented Jan 11, 2021 via email

@jnothman
Copy link
Member

regarding the same data for train and test you could return a splitter that
just returns np.arange(n_samples) for train
and test so you could evaluate/score on train data if it's what you want.

I don't really see how that adds up to the needed capability. For each parameter setting, the proposal is to:

  • training on X[m] and predicting on X
  • training on X[~m] and predicting on X
  • comparing those predictions to produce a score

Refitting on the whole of X.

@agramfort
Copy link
Member

agramfort commented Jan 12, 2021 via email

@jnothman
Copy link
Member

jnothman commented Jan 12, 2021 via email

@cmarmo cmarmo added the Needs Decision Requires decision label Jan 14, 2021
@jnothman
Copy link
Member

@sebp is there a reason you want to reuse the concept of CV, when we should probably always be using random equal partitions of the data?

@jnothman
Copy link
Member

I should say that I am, in general, still interested in providing better tools for clustering model selection.

I think a major downside of this one is that it requires the model to be inductive, which only a few of our clusterers are. (Although maybe we could allow for nearest-neighbor based labelling of new points for the purposes of this technique, when applying it to DBSCAN, OPTICS, etc.)

While I can see that it is beneficial to reuse some of the implementation in BaseSearchCV (although there are a lot of differences if we acknowledge that fit time is ambiguous when we fit two models for each candidate, that the scoring interface is perhaps not relevant, etc), I think we should at least consider alternatives to modifying _fit_and_score. One alternative is to make _fit_and_score pluggable by making it a method of BaseSearchCV that is then overridden here...?

@agramfort
Copy link
Member

I don't like the PredictionStrengthGridSearchCV in _search.py

it leaks one clustering score in a common module.

is there a ref / a name for what this class implements? beyond the work on Tibshirani in 2005

@sebp
Copy link
Contributor Author

sebp commented Jan 16, 2021 via email

Base automatically changed from master to main January 22, 2021 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants