SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

jenniferjang · 2020-07-05T17:14:16Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Instead of calculating the pairwise distances for all pairs of points to obtain nearest-neighbor graphs for estimators like DBSCAN, SubsampledNeighborsTransformer only calculates distances for a fraction s of the pairs (selected uniformly at random). This would make estimators that accept precomputed distance matrices feasible for larger datasets. In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory this way without hurting the clustering quality (in some cases s = 0.001 suffices).

[1] https://arxiv.org/abs/2006.06743

Any other comments?

thomasjpfan

Thank you for the PR @jenniferjang !

You can run make flake8-diff locally to find the flake8 errors.

Are there references of this approach being used before https://arxiv.org/abs/2006.06743 ?

sklearn/neighbors/_subsampled.py

sklearn/neighbors/__init__.py

jenniferjang · 2020-07-06T00:44:08Z

Thank you for the PR @jenniferjang !

You can run make flake8-diff locally to find the flake8 errors.

Are there references of this approach being used before https://arxiv.org/abs/2006.06743 ?

Hi @thomasjpfan, thanks for reviewing this. I initially implemented it as a transformer, but do you think it would work better as a function instead? I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property.

To my knowledge I haven't seen this approach being used before in the literature, at least in the context of DBSCAN.

jnothman · 2020-07-06T12:25:52Z

I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property.

It's not because of the distance matrix that it shouldn't satisfy the subset invariance property, only because of the random nature of the transformation. See other estimators that disable 'check_methods_subset_invariance', such as DummyClassifier and BernoulliRBM

jenniferjang · 2020-07-07T20:31:55Z

I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property.

It's not because of the distance matrix that it shouldn't satisfy the subset invariance property, only because of the random nature of the transformation. See other estimators that disable 'check_methods_subset_invariance', such as DummyClassifier and BernoulliRBM

I've added the tag to skip the test.

jenniferjang · 2020-07-07T20:36:38Z

@thomasjpfan @jnothman Thanks for the speedy review! I've updated the fit function to better conform to the transformer paradigm. Please take a look at the updated code at your convenience.

jenniferjang · 2020-07-11T15:56:43Z

@thomasjpfan @jnothman Hi there, can you please take a look at the changes? Thank you!

thomasjpfan · 2020-09-02T14:44:09Z

Sorry for the late response. This PR most likely needs an example to demonstrate how it compares to regular DBSCAN. Looking at your paper a good candidate for an example are the Faces or Bank dataset. We do not want the example to run for too long because we run them every time the docs get built. The example can also cover how s is determined.

cmarmo · 2020-09-11T08:07:38Z

Hi @jenniferjang , thanks for your work. Please note that some checks are failing because of sklearn.neighbors._base missing the attribute 'UnsupervisedMixin'.... I can't understand why I can't find the class in the github diff, but I'm finding it when checking out your branch... Do you mind checking if something went wrong in the workflow? Thanks!

jenniferjang · 2020-09-20T05:39:26Z

Sorry for the late response! @jnothman, @thomasjpfan, @cmarmo, it appears that the class UnsupervisedMixin, which was one of the parent classes of SubsampledNeighborsTransformer, was removed from scikit-learn, which caused the errors. I've removed dependencies on UnsupervisedMixin, and I believe it builds now.

I’ve also added a new example, plot_subsampled_neighbors_transformer_dbscan.py, which plots DBSCAN and subsampled DBSCAN results side by side. However this seem to have issues building, and I'm not sure how to fix it.

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling. Depending on the size of the dataset, paired_distances takes between 50%-80% of the total runtime of subsampled DBSCAN. I’d like to look into improving paired_distances, either before or after finishing this pull request. Otherwise it would be infeasible to use SubsampledNeighborsTransformer to speed up DBSCAN unless the dataset is extremely large.

In order to speed up subsampled DBSCAN, I sort the output of SubsampledNeighborsTransformer’s fit_transform. If you see other ways to make the subsampled_neighbors function faster, please let me know.

thomasjpfan · 2020-09-24T18:43:11Z

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling.

May you see if pairwise_distances_chunked would improve the performance?

jenniferjang · 2020-10-05T16:20:31Z

One thing I noticed is that paired_distances is awfully inefficient: on 30,000 points, paired_distances alone on 10% of edges took more than twice as much time as it took to run the entirety of DBSCAN without sampling.

May you see if pairwise_distances_chunked would improve the performance?

Hi @thomasjpfan, at your suggestion I looked into pairwise_distances_chunked. It calculates the distance matrix for all pairs of points, right? In that case our performance would be n^2 -- the same as DBSCAN -- and wouldn't work for large inputs. Is there a way for pairwise_distances_chunked to calculate distances for a different (random) subset of neighbors per point? I looked into using reduce_func but as far as I saw, that would be applied after the full distance matrix is calculated.

jnothman · 2020-10-07T10:51:55Z

Interesting. The pairwise_distances implementation of Euclidean distance is taking advantage of having a constant norm to use in calculating each row and column of the distance matrix. I don't think you can use chunking here in any straightforward way. You would need to implement a version of paired distances that accepted norms for each X,Y, as with the fast Euclidean pairwise distances implementation.

iqbalfarz · 2022-10-23T00:49:27Z

Hi @jenniferjang, any update on this?

github-actions · 2024-12-20T20:52:02Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 72b35a4. Link to the linter CI: here}

jenniferjang · 2024-12-21T01:46:36Z

New PR: #30523

github-actions bot added the module:neighbors label Jul 5, 2020

thomasjpfan reviewed Jul 5, 2020

View reviewed changes

cmarmo added the Waiting for Reviewer label Sep 1, 2020

cmarmo removed the Waiting for Reviewer label Sep 2, 2020

jenniferjang force-pushed the master branch from 34a983f to 82f69b6 Compare September 9, 2020 17:41

Base automatically changed from master to main January 22, 2021 10:52

GaelVaroquaux mentioned this pull request Nov 25, 2024

DBSCAN too slow and consumes too much memory for large datasets: a simple tweak can fix this. #17650

Open

jenniferjang closed this Dec 20, 2024

jenniferjang force-pushed the master branch from 0d16857 to 72b35a4 Compare December 20, 2024 20:50

ogrisel temporarily deployed to upload_anaconda December 21, 2024 04:37 — with GitHub Actions Inactive

ogrisel temporarily deployed to upload_anaconda December 22, 2024 04:37 — with GitHub Actions Inactive

ogrisel temporarily deployed to upload_anaconda December 23, 2024 04:37 — with GitHub Actions Inactive

Uh oh!

SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843

Conversation

jenniferjang commented Jul 5, 2020 • edited by thomasjpfan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jenniferjang commented Jul 6, 2020

Uh oh!

jnothman commented Jul 6, 2020

Uh oh!

jenniferjang commented Jul 7, 2020

Uh oh!

jenniferjang commented Jul 7, 2020

Uh oh!

jenniferjang commented Jul 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Sep 2, 2020

Uh oh!

cmarmo commented Sep 11, 2020

Uh oh!

jenniferjang commented Sep 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan commented Sep 24, 2020

Uh oh!

jenniferjang commented Oct 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Oct 7, 2020 via email

Uh oh!

iqbalfarz commented Oct 23, 2022

Uh oh!

github-actions bot commented Dec 20, 2024

✔️ Linting Passed

Uh oh!

jenniferjang commented Dec 21, 2024

Uh oh!

Uh oh!

jenniferjang commented Jul 5, 2020 •

edited by thomasjpfan

Loading

jenniferjang commented Jul 11, 2020 •

edited

Loading

jenniferjang commented Sep 20, 2020 •

edited

Loading

jenniferjang commented Oct 5, 2020 •

edited

Loading