-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DBSCAN++: Run DBSCAN on 100x larger datasets, up to 100x faster in subsampling #30523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Note that quality differences depend a lot on the data. I am not really sure about a use case where you (A) are fine with working on a subsample and only extending this to the remaining points, and (B) actually need to extend the labels to every object. Primarily, it is an explorative technique to help you understand the data, in such cases you can just work with a manageable subsample of the data right from the start. |
Hi @kno10, let me know what you think! Thanks! |
None of these data sets (up to 60000 instances) is really a major scalability issue. Can you prove any approximation guarantee for your method? In fact, it is even a warning sign if your method outperforms exact DBSCAN by much more than a standard deviation - but here I'd argue that simply the labels do not reflect clusters, and hence the evaluation is meaningless. But again: I do not know any actual use case of having to cluster large data sets "exactly"; standard subsampling to, e.g., 1000 instances is usually enough for explorative data analysis. So your method IMHO targets a very small niche: what applications exist where standard subsampling is not enough, but that the exact result is not necessary and that fast approximate neighbor search (e.g., with HNSW) is not acceptable either. (Which is not meant to say that users might now want this - often users do not want to subsample because they fear they might might lose something, and want to use clustering for classification... then only to be disappointed...) |
In the paper, we do show that our modification of DBSCAN has similar theoretical guarantees as the original DBSCAN. At the last monthly scikit-learn meeting, we discussed this change and agreed that if it is simple, there should be room for it as many people were bumping the original thread. Can we get the feature in and see how it goes? |
@kno10 friendly ping! Thanks! |
So far, your DBSCAN++ does in my opinion not satisfy the official requirements for inclusion of scikit-learn: "A rule of thumb is at least 3 years since publication, 200+ citations, and wide use and usefulness." (in particular, it does not appear to be used anywhere). Instead of implementing some "obscure" approximation, the memory consumption of scikit-learn DBSCAN could easily be reduced by more closely following the original algorithm, instead of using the current bulk operation: that reduces worst-case quadratic memory to linear memory consumption. But while this issue has long been known (it is documented in the scikit-learn documentation), nobody in the scikit-learn team seems to care about unsupervised learning. The sciki-learn BIRCH implementation is still incomplete to the point of being useless, the k-means implementation was "de-optimized" based on very poor benchmarks. The only major feature added to scikit-learn clustering was HDBSCAN, in version 1.3 in 2023, literally the single major addition since 1.0). So don't expect anything to happen... I've given up on scikit-learn to merge anything in unsupervised learning, and cannot recommend it in this domain. A key improvement for users would be to get rid of poor default values in scikit-learn, which includes removing the meaningless default epsilon for DBSCAN: Getting rid of poor defaults, choosing the more memory-efficient implementation for larger data sets, and suggesting people to use OPTICS and HDBSCAN* instead, may be better than approximating DBSCAN with subsampling IMHO, but of course your opinion may vary. |
Reference Issues/PRs
See #17650.
What does this implement/fix? Explain your changes.
Instead of calculating the pairwise distances for all pairs of points to obtain nearest-neighbor graphs for DBSCAN, the
subsampled
option only calculates nearest-neighbor graphs for a fraction of points uniformly selected at random. The nearest-neighbor graphs are calculated only for subsampled points, but every point in the dataset may be a neighbor. This means that we will have a subset of core samples, but our research shows that clustering performance is similar.With this implementation, we saw over 10x speedup on datasets where DBSCAN actually finished running. For datasets where DBSCAN did not finish running, subsampled DBSCAN is able to run on 100x larger datasets.
Performance is on a c5d.24xlarge Amazon EC2 instance with 96 vCPUs and 196 GB RAM. For all runs,

eps = 1
andmin_samples = 10
. Dataset was an equal mixture of four multivariate normals. We compare the results using the clustering metrics adjusted RAND score and adjusted mutual information score.See: Jang, J. and Jiang, H. "DBSCAN++: Towards fast and scalable density clustering". Proceedings of the 36th International Conference on Machine Learning, 2019.
Any other comments?
Per the November scikit-learn monthly meeting notes:
