-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
SubsampledNeighborsTransformer: Subsampled nearest neighbors for faster and more space efficient estimators that accept precomputed distance matrices #17843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @jenniferjang !
You can run make flake8-diff
locally to find the flake8 errors.
Are there references of this approach being used before https://arxiv.org/abs/2006.06743 ?
Hi @thomasjpfan, thanks for reviewing this. I initially implemented it as a transformer, but do you think it would work better as a function instead? I can't get the test check_methods_subset_invariance to pass for the transformer because in our case, transform generates the distance matrix for the data, which shouldn't satisfy the subset invariance property. To my knowledge I haven't seen this approach being used before in the literature, at least in the context of DBSCAN. |
It's not because of the distance matrix that it shouldn't satisfy the subset invariance property, only because of the random nature of the transformation. See other estimators that disable 'check_methods_subset_invariance', such as |
I've added the tag to skip the test. |
@thomasjpfan @jnothman Thanks for the speedy review! I've updated the fit function to better conform to the transformer paradigm. Please take a look at the updated code at your convenience. |
@thomasjpfan @jnothman Hi there, can you please take a look at the changes? Thank you! |
Sorry for the late response. This PR most likely needs an example to demonstrate how it compares to regular DBSCAN. Looking at your paper a good candidate for an example are the Faces or Bank dataset. We do not want the example to run for too long because we run them every time the docs get built. The example can also cover how |
Hi @jenniferjang , thanks for your work. Please note that some checks are failing because of |
Sorry for the late response! @jnothman, @thomasjpfan, @cmarmo, it appears that the class I’ve also added a new example, One thing I noticed is that In order to speed up subsampled DBSCAN, I sort the output of |
May you see if |
Hi @thomasjpfan, at your suggestion I looked into pairwise_distances_chunked. It calculates the distance matrix for all pairs of points, right? In that case our performance would be n^2 -- the same as DBSCAN -- and wouldn't work for large inputs. Is there a way for |
Interesting. The pairwise_distances implementation of Euclidean distance is
taking advantage of having a constant norm to use in calculating each row
and column of the distance matrix.
I don't think you can use chunking here in any straightforward way. You
would need to implement a version of paired distances that accepted norms
for each X,Y, as with the fast Euclidean pairwise distances implementation.
|
Hi @jenniferjang, any update on this? |
New PR: #30523 |
Reference Issues/PRs
See #17650
What does this implement/fix? Explain your changes.
Instead of calculating the pairwise distances for all pairs of points to obtain nearest-neighbor graphs for estimators like DBSCAN, SubsampledNeighborsTransformer only calculates distances for a fraction s of the pairs (selected uniformly at random). This would make estimators that accept precomputed distance matrices feasible for larger datasets. In a very recent work I did with Google Research [1], we found that you can get over 200x speedup and 250x savings in memory this way without hurting the clustering quality (in some cases s = 0.001 suffices).
[1] https://arxiv.org/abs/2006.06743
Any other comments?