Skip to content

DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jenniferjang
Copy link

@jenniferjang jenniferjang commented Dec 21, 2024

Reference Issues/PRs

See #17650.

What does this implement/fix? Explain your changes.

Instead of calculating the pairwise distances for all pairs of points to obtain nearest-neighbor graphs for DBSCAN, the subsampled option only calculates nearest-neighbor graphs for a fraction of points uniformly selected at random. The nearest-neighbor graphs are calculated only for subsampled points, but every point in the dataset may be a neighbor. This means that we will have a subset of core samples, but our research shows that clustering performance is similar.

With this implementation, we saw over 10x speedup on datasets where DBSCAN actually finished running. For datasets where DBSCAN did not finish running, subsampled DBSCAN is able to run on 100x larger datasets.

Performance is on a c5d.24xlarge Amazon EC2 instance with 96 vCPUs and 196 GB RAM. For all runs, eps = 1 and min_samples = 10. Dataset was an equal mixture of four multivariate normals. We compare the results using the clustering metrics adjusted RAND score and adjusted mutual information score.
image

See: Jang, J. and Jiang, H. "DBSCAN++: Towards fast and scalable density clustering". Proceedings of the 36th International Conference on Machine Learning, 2019.

Any other comments?

Per the November scikit-learn monthly meeting notes:
image

Copy link

github-actions bot commented Dec 21, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 07a193f. Link to the linter CI: here

@kno10
Copy link
Contributor

kno10 commented Dec 21, 2024

Note that quality differences depend a lot on the data.
If you have synthetic data (think of make_blobs - or the "equal mixture of four multivariate normals" in your case), downsampling will be just fine. But this is not a realistic use case!
But the more complex the data gets and the less redundancy you have, the more downsampling can hurt.

I am not really sure about a use case where you (A) are fine with working on a subsample and only extending this to the remaining points, and (B) actually need to extend the labels to every object. Primarily, it is an explorative technique to help you understand the data, in such cases you can just work with a manageable subsample of the data right from the start.

@jenniferjang jenniferjang marked this pull request as draft December 22, 2024 00:08
@jenniferjang
Copy link
Author

Note that quality differences depend a lot on the data. If you have synthetic data (think of make_blobs - or the "equal mixture of four multivariate normals" in your case), downsampling will be just fine. But this is not a realistic use case! But the more complex the data gets and the less redundancy you have, the more downsampling can hurt.

I am not really sure about a use case where you (A) are fine with working on a subsample and only extending this to the remaining points, and (B) actually need to extend the labels to every object. Primarily, it is an explorative technique to help you understand the data, in such cases you can just work with a manageable subsample of the data right from the start.

Hi Erich, thanks for your reply.

Addressing your comment:

“In such cases you can just work with a manageable subsample of the data right from the start.” I think there’s a slight misunderstanding of how our subsampling works. It’s not downsampling the entire dataset and then clustering the samples. It’s only subsampling the points that are core sample candidates, but still using the epsilon-neighborhood graph on the whole dataset to determine which points are core samples, and then clustering the entire dataset to the smaller number of core points. It only works because we use the whole dataset to determine the core points.

As long as you choose samples from each cluster, you’re likely to be able to detect it – redundancy isn’t required. If we were to only run DBSCAN on a downsample, the examples in the minority clusters may not be correctly identified as core samples, and the cluster may not be detected.

We show in our paper that this works on 13 real (not synthetic) datasets, as well as being robust across hyperparameters, in many cases outperforming normal DBSCAN because it’s not as sensitive to border points.

From our paper:

image
Summary of open source datasets used. Includes dataset size (n), number of features (D), number of clusters (c), and the (m) used.

image
Highest scores for each dataset and algorithm. The first row is the adjusted RAND index and the second row the adjusted mutual information. The standard error over 10 runs is given in parentheses for subsampled DBSCAN with uniform sampling. Each algorithm was tuned across a range of eps with min_samples = 10. We used subsample values of 0.1, 0.2, or 0.3. We see that DBSCAN + uniform sampling performs better than DBSCAN on 17 out of 22 metrics.

@jenniferjang jenniferjang marked this pull request as ready for review December 25, 2024 05:00
@jenniferjang
Copy link
Author

Hi @kno10, let me know what you think! Thanks!

@kno10
Copy link
Contributor

kno10 commented Jan 3, 2025

None of these data sets (up to 60000 instances) is really a major scalability issue.
And the larger ones (MNIST) do not really cluster in a useful way with DBSCAN at all!

Can you prove any approximation guarantee for your method?

In fact, it is even a warning sign if your method outperforms exact DBSCAN by much more than a standard deviation - but here I'd argue that simply the labels do not reflect clusters, and hence the evaluation is meaningless.
And clustering has a huge evaluation problem, because we abuse classification labels all the time.

But again: I do not know any actual use case of having to cluster large data sets "exactly"; standard subsampling to, e.g., 1000 instances is usually enough for explorative data analysis. So your method IMHO targets a very small niche: what applications exist where standard subsampling is not enough, but that the exact result is not necessary and that fast approximate neighbor search (e.g., with HNSW) is not acceptable either.

(Which is not meant to say that users might now want this - often users do not want to subsample because they fear they might might lose something, and want to use clustering for classification... then only to be disappointed...)

@jenniferjang
Copy link
Author

jenniferjang commented Jan 16, 2025

None of these data sets (up to 60000 instances) is really a major scalability issue. And the larger ones (MNIST) do not really cluster in a useful way with DBSCAN at all!

Can you prove any approximation guarantee for your method?

In fact, it is even a warning sign if your method outperforms exact DBSCAN by much more than a standard deviation - but here I'd argue that simply the labels do not reflect clusters, and hence the evaluation is meaningless. And clustering has a huge evaluation problem, because we abuse classification labels all the time.

But again: I do not know any actual use case of having to cluster large data sets "exactly"; standard subsampling to, e.g., 1000 instances is usually enough for explorative data analysis. So your method IMHO targets a very small niche: what applications exist where standard subsampling is not enough, but that the exact result is not necessary and that fast approximate neighbor search (e.g., with HNSW) is not acceptable either.

(Which is not meant to say that users might now want this - often users do not want to subsample because they fear they might might lose something, and want to use clustering for classification... then only to be disappointed...)

In the paper, we do show that our modification of DBSCAN has similar theoretical guarantees as the original DBSCAN. At the last monthly scikit-learn meeting, we discussed this change and agreed that if it is simple, there should be room for it as many people were bumping the original thread. Can we get the feature in and see how it goes?

@jenniferjang
Copy link
Author

@kno10 friendly ping! Thanks!

@kno10
Copy link
Contributor

kno10 commented Feb 13, 2025

So far, your DBSCAN++ does in my opinion not satisfy the official requirements for inclusion of scikit-learn: "A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness." (in particular, it does not appear to be used anywhere).

Instead of implementing some "obscure" approximation, the memory consumption of scikit-learn DBSCAN could easily be reduced by more closely following the original algorithm, instead of using the current bulk operation: that reduces worst-case quadratic memory to linear memory consumption. But while this issue has long been known (it is documented in the scikit-learn documentation), nobody in the scikit-learn team seems to care about unsupervised learning. The sciki-learn BIRCH implementation is still incomplete to the point of being useless, the k-means implementation was "de-optimized" based on very poor benchmarks. The only major feature added to scikit-learn clustering was HDBSCAN, in version 1.3 in 2023, literally the single major addition since 1.0). So don't expect anything to happen... I've given up on scikit-learn to merge anything in unsupervised learning, and cannot recommend it in this domain.

A key improvement for users would be to get rid of poor default values in scikit-learn, which includes removing the meaningless default epsilon for DBSCAN:

#13541
#13570
#14942

Getting rid of poor defaults, choosing the more memory-efficient implementation for larger data sets, and suggesting people to use OPTICS and HDBSCAN* instead, may be better than approximating DBSCAN with subsampling IMHO, but of course your opinion may vary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants