You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When switching from Sklearn HDBSCAN implementation to original one from hdbscan library, I've notice that Sklearn's implementation has much worse implementation. I've tried investigating different parameters but it doesn't seem to have an effect on the performance.
I've created synthetic benchmark using make_blobs function. And those are my results:
I can reproduce a similar behaviour on my machine.
One of the reason is that hdbscan.HDBSCAN uses Boruvka algorithm by default which is not implemented in sklearn.HDBSCAN. There was some work some time ago to add Boruvka algorithm in #27572. See also #26801 for more context.
This would definitely be a significant effort to revive the Boruvka PR and push it forward.
lesteve
changed the title
HDBSCAN performance issues
HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented)
Jun 10, 2025
Uh oh!
There was an error while loading. Please reload this page.
Describe the bug
When switching from Sklearn HDBSCAN implementation to original one from
hdbscan
library, I've notice that Sklearn's implementation has much worse implementation. I've tried investigating different parameters but it doesn't seem to have an effect on the performance.I've created synthetic benchmark using
make_blobs
function. And those are my results:CPU: Ryzen 5 1600, 12 Threads@3.6Ghz*
RAM: 32GB DDR4
Steps/Code to Reproduce
I am starting both algorithms with
n_jobs=-1
to rule out the difference that may occure because of default setting ofcore_dist_n_jobs=4
inhdbscan
Expected Results
Similar performance between algorithms from Sklearn and
hdbscan
libraryActual Results
Sklearn implementation of
HDBSCAN
gets much worse performance than original library. For example when testing much bigger dataset, i.e.hdbscan
library performsfit
in 25s on my hardware, while Sklearn needs 5 minutes to perform clustering.Versions
The text was updated successfully, but these errors were encountered: