Skip to content

ENH Add dtype preservation for SpectralClustering #22669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Mar 3, 2022

Reference Issues/PRs

Partially addresses #22881
Precedes #22590

What does this implement/fix? Explain your changes.

This parametrizes tests from test_spectral.py to run on 32bit datasets.

Any other comments?

We could introduce a mechanism to be able to able to remove tests' execution on 32bit datasets if this takes too much time to complete.

@jjerphan jjerphan marked this pull request as ready for review March 4, 2022 13:19
@jjerphan jjerphan changed the title TST Adapt test_spectral.py to test implementations on 32bit datasets TST use global_dtype in sklearn/cluster/tests/test_spectral.py Mar 24, 2022
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb jeremiedbb added the Quick Review For PRs that are quick to review label Jun 3, 2022
@jeremiedbb
Copy link
Member

According to some irl discussions, such tests should only be added after SpectralClustering can natively work on float32 data. For now it converts to float64 right away.

Let's keep this PR open in the mean time.

@cmarmo cmarmo added float32 Issues related to support for 32bit data and removed Waiting for Reviewer Quick Review For PRs that are quick to review labels Jul 16, 2022
@jjerphan
Copy link
Member Author

Resurrecting this PR: SpectralClustering should now preserve dtype with the latest commits.

@jjerphan jjerphan changed the title TST use global_dtype in sklearn/cluster/tests/test_spectral.py ENH Add dtype preservation for SpectralClustering Nov 23, 2022
@jjerphan jjerphan added the Waiting for Second Reviewer First reviewer is done, need a second one! label Nov 23, 2022
@@ -744,7 +744,7 @@ def fit(self, X, y=None):
params["coef0"] = self.coef0
self.affinity_matrix_ = pairwise_kernels(
X, metric=self.affinity, filter_params=True, **params
)
).astype(X.dtype, copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should work on making pairwise_kernels work efficiently with float32 data first.

Since affinity="rbf" is the default, flagging SpectralClustering as dtype preserving without that prerequisite would be misleading to our users: there would be very few performance or peak memory usage gains in passing float32 data to this estimator when using the default hyper-params.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we also need a test that checks the dtype of affinity_matrix_ when the affinity param is the str name of a kernel function and another such assertion in a test that covers the case when affinity is nearest_neighbors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need to have ArgKmin and RadiusNeighbors preserve dtypes because it is also used when affinity="precomputed_nearest_neighbors".

Let's turn this PR as a draft for now.

@@ -96,11 +97,12 @@ def test_spectral_clustering_sparse(assign_labels):
assert adjusted_rand_score(y, labels) == 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the dtype of affinity_matrix_ here.

@@ -122,13 +124,14 @@ def test_precomputed_nearest_neighbors_filtering():
assert_array_equal(results[0], results[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the dtype of affinity_matrix_ here.

@jjerphan jjerphan marked this pull request as draft November 30, 2022 09:37
@glemaitre glemaitre requested review from glemaitre and removed request for glemaitre December 28, 2022 14:04
@glemaitre glemaitre removed the Waiting for Second Reviewer First reviewer is done, need a second one! label Dec 28, 2022
@jjerphan
Copy link
Member Author

Closing for now, might reopen later.

@jjerphan jjerphan closed this Mar 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
float32 Issues related to support for 32bit data module:cluster No Changelog Needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants