DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523

jenniferjang · 2024-12-21T01:28:21Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Instead of calculating the pairwise distances for all pairs of points to obtain nearest-neighbor graphs for DBSCAN, the subsampled option only calculates nearest-neighbor graphs for a fraction of points uniformly selected at random. The nearest-neighbor graphs are calculated only for subsampled points, but every point in the dataset may be a neighbor. This means that we will have a subset of core samples, but our research shows that clustering performance is similar.

With this implementation, we saw over 10x speedup on datasets where DBSCAN actually finished running. For datasets where DBSCAN did not finish running, subsampled DBSCAN is able to run on 100x larger datasets.

Performance is on a c5d.24xlarge Amazon EC2 instance with 96 vCPUs and 196 GB RAM. For all runs, eps = 1 and min_samples = 10. Dataset was an equal mixture of four multivariate normals. We compare the results using the clustering metrics adjusted RAND score and adjusted mutual information score.

See: Jang, J. and Jiang, H. "DBSCAN++: Towards fast and scalable density clustering". Proceedings of the 36th International Conference on Machine Learning, 2019.

Any other comments?

Per the November scikit-learn monthly meeting notes:

github-actions · 2024-12-21T01:29:44Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 07a193f. Link to the linter CI: here}

kno10 · 2024-12-21T22:42:44Z

Note that quality differences depend a lot on the data.
If you have synthetic data (think of make_blobs - or the "equal mixture of four multivariate normals" in your case), downsampling will be just fine. But this is not a realistic use case!
But the more complex the data gets and the less redundancy you have, the more downsampling can hurt.

I am not really sure about a use case where you (A) are fine with working on a subsample and only extending this to the remaining points, and (B) actually need to extend the labels to every object. Primarily, it is an explorative technique to help you understand the data, in such cases you can just work with a manageable subsample of the data right from the start.

jenniferjang · 2024-12-25T05:00:09Z

Note that quality differences depend a lot on the data. If you have synthetic data (think of make_blobs - or the "equal mixture of four multivariate normals" in your case), downsampling will be just fine. But this is not a realistic use case! But the more complex the data gets and the less redundancy you have, the more downsampling can hurt.

I am not really sure about a use case where you (A) are fine with working on a subsample and only extending this to the remaining points, and (B) actually need to extend the labels to every object. Primarily, it is an explorative technique to help you understand the data, in such cases you can just work with a manageable subsample of the data right from the start.

Hi Erich, thanks for your reply.

Addressing your comment:

“In such cases you can just work with a manageable subsample of the data right from the start.” I think there’s a slight misunderstanding of how our subsampling works. It’s not downsampling the entire dataset and then clustering the samples. It’s only subsampling the points that are core sample candidates, but still using the epsilon-neighborhood graph on the whole dataset to determine which points are core samples, and then clustering the entire dataset to the smaller number of core points. It only works because we use the whole dataset to determine the core points.

As long as you choose samples from each cluster, you’re likely to be able to detect it – redundancy isn’t required. If we were to only run DBSCAN on a downsample, the examples in the minority clusters may not be correctly identified as core samples, and the cluster may not be detected.

We show in our paper that this works on 13 real (not synthetic) datasets, as well as being robust across hyperparameters, in many cases outperforming normal DBSCAN because it’s not as sensitive to border points.

From our paper:

Summary of open source datasets used. Includes dataset size (n), number of features (D), number of clusters (c), and the (m) used.

Highest scores for each dataset and algorithm. The first row is the adjusted RAND index and the second row the adjusted mutual information. The standard error over 10 runs is given in parentheses for subsampled DBSCAN with uniform sampling. Each algorithm was tuned across a range of eps with min_samples = 10. We used subsample values of 0.1, 0.2, or 0.3. We see that DBSCAN + uniform sampling performs better than DBSCAN on 17 out of 22 metrics.

jenniferjang · 2025-01-02T20:17:25Z

Hi @kno10, let me know what you think! Thanks!

kno10 · 2025-01-03T13:08:41Z

None of these data sets (up to 60000 instances) is really a major scalability issue.
And the larger ones (MNIST) do not really cluster in a useful way with DBSCAN at all!

Can you prove any approximation guarantee for your method?

In fact, it is even a warning sign if your method outperforms exact DBSCAN by much more than a standard deviation - but here I'd argue that simply the labels do not reflect clusters, and hence the evaluation is meaningless.
And clustering has a huge evaluation problem, because we abuse classification labels all the time.

But again: I do not know any actual use case of having to cluster large data sets "exactly"; standard subsampling to, e.g., 1000 instances is usually enough for explorative data analysis. So your method IMHO targets a very small niche: what applications exist where standard subsampling is not enough, but that the exact result is not necessary and that fast approximate neighbor search (e.g., with HNSW) is not acceptable either.

(Which is not meant to say that users might now want this - often users do not want to subsample because they fear they might might lose something, and want to use clustering for classification... then only to be disappointed...)

jenniferjang · 2025-01-16T21:45:21Z

None of these data sets (up to 60000 instances) is really a major scalability issue. And the larger ones (MNIST) do not really cluster in a useful way with DBSCAN at all!

Can you prove any approximation guarantee for your method?

In fact, it is even a warning sign if your method outperforms exact DBSCAN by much more than a standard deviation - but here I'd argue that simply the labels do not reflect clusters, and hence the evaluation is meaningless. And clustering has a huge evaluation problem, because we abuse classification labels all the time.

But again: I do not know any actual use case of having to cluster large data sets "exactly"; standard subsampling to, e.g., 1000 instances is usually enough for explorative data analysis. So your method IMHO targets a very small niche: what applications exist where standard subsampling is not enough, but that the exact result is not necessary and that fast approximate neighbor search (e.g., with HNSW) is not acceptable either.

(Which is not meant to say that users might now want this - often users do not want to subsample because they fear they might might lose something, and want to use clustering for classification... then only to be disappointed...)

In the paper, we do show that our modification of DBSCAN has similar theoretical guarantees as the original DBSCAN. At the last monthly scikit-learn meeting, we discussed this change and agreed that if it is simple, there should be room for it as many people were bumping the original thread. Can we get the feature in and see how it goes?

jenniferjang · 2025-02-12T21:11:55Z

@kno10 friendly ping! Thanks!

kno10 · 2025-02-13T13:38:43Z

So far, your DBSCAN++ does in my opinion not satisfy the official requirements for inclusion of scikit-learn: "A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness." (in particular, it does not appear to be used anywhere).

Instead of implementing some "obscure" approximation, the memory consumption of scikit-learn DBSCAN could easily be reduced by more closely following the original algorithm, instead of using the current bulk operation: that reduces worst-case quadratic memory to linear memory consumption. But while this issue has long been known (it is documented in the scikit-learn documentation), nobody in the scikit-learn team seems to care about unsupervised learning. The sciki-learn BIRCH implementation is still incomplete to the point of being useless, the k-means implementation was "de-optimized" based on very poor benchmarks. The only major feature added to scikit-learn clustering was HDBSCAN, in version 1.3 in 2023, literally the single major addition since 1.0). So don't expect anything to happen... I've given up on scikit-learn to merge anything in unsupervised learning, and cannot recommend it in this domain.

A key improvement for users would be to get rid of poor default values in scikit-learn, which includes removing the meaningless default epsilon for DBSCAN:

#13541
#13570
#14942

Getting rid of poor defaults, choosing the more memory-efficient implementation for larger data sets, and suggesting people to use OPTICS and HDBSCAN* instead, may be better than approximating DBSCAN with subsampling IMHO, but of course your opinion may vary.

github-actions bot added the module:cluster label Dec 21, 2024

DBSCAN++

2b8346f

jenniferjang force-pushed the master branch from 0ef6009 to 2b8346f Compare December 21, 2024 01:37

Jennifer Jang added 2 commits December 20, 2024 21:36

Updated in response to tests

f4c5869

Updated docstring to pass test

07a193f

jenniferjang marked this pull request as draft December 22, 2024 00:08

jenniferjang marked this pull request as ready for review December 25, 2024 05:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523

DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523

Uh oh!

jenniferjang commented Dec 21, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Dec 21, 2024 •

edited

Loading

Uh oh!

kno10 commented Dec 21, 2024

Uh oh!

jenniferjang commented Dec 25, 2024

Uh oh!

jenniferjang commented Jan 2, 2025

Uh oh!

kno10 commented Jan 3, 2025

Uh oh!

jenniferjang commented Jan 16, 2025 •

edited

Loading

Uh oh!

jenniferjang commented Feb 12, 2025

Uh oh!

kno10 commented Feb 13, 2025

Uh oh!

Uh oh!

Uh oh!

DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523

Are you sure you want to change the base?

DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523

Uh oh!

Conversation

jenniferjang commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

github-actions bot commented Dec 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

kno10 commented Dec 21, 2024

Uh oh!

jenniferjang commented Dec 25, 2024

Uh oh!

jenniferjang commented Jan 2, 2025

Uh oh!

kno10 commented Jan 3, 2025

Uh oh!

jenniferjang commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jenniferjang commented Feb 12, 2025

Uh oh!

kno10 commented Feb 13, 2025

Uh oh!

Uh oh!

jenniferjang commented Dec 21, 2024 •

edited

Loading

github-actions bot commented Dec 21, 2024 •

edited

Loading

jenniferjang commented Jan 16, 2025 •

edited

Loading