Skip to content

HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) #31503

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lkwinta opened this issue Jun 8, 2025 · 1 comment
Labels
Hard Hard level of difficulty help wanted New Feature

Comments

@lkwinta
Copy link

lkwinta commented Jun 8, 2025

Describe the bug

When switching from Sklearn HDBSCAN implementation to original one from hdbscan library, I've notice that Sklearn's implementation has much worse implementation. I've tried investigating different parameters but it doesn't seem to have an effect on the performance.

I've created synthetic benchmark using make_blobs function. And those are my results:

CPU: Ryzen 5 1600, 12 Threads@3.6Ghz*
RAM: 32GB DDR4

# dataset
X, y = make_blobs(n_samples=10000, centers=5, cluster_std=0.60, random_state=0, n_features=10)

# hdbscan params 
og_hdbscan = OGHDBSCAN(core_dist_n_jobs=-1)
sk_hdbscan = SKHDBSCAN(n_jobs=-1)

Image

  • Tested out on Google Collab with similar results

Steps/Code to Reproduce

I am starting both algorithms with n_jobs=-1 to rule out the difference that may occure because of default setting of core_dist_n_jobs=4 in hdbscan

from hdbscan import HDBSCAN as OGHDBSCAN
from sklearn.cluster import HDBSCAN as SKHDBSCAN

import matplotlib.pyplot as plt
import numpy as np
import time

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10000, centers=5, cluster_std=0.60, random_state=0, n_features=10)

og_hdbscan = OGHDBSCAN(core_dist_n_jobs=-1)
sk_hdbscan = SKHDBSCAN(n_jobs=-1)

RUNS = 10

def time_hdbscan(hdbscan, X, runs):
    times = []
    for _ in range(runs):
        start = time.time()
        hdbscan.fit(X)
        end = time.time()
        times.append(end - start)
    return times

times_og = time_hdbscan(og_hdbscan, X, RUNS)
times_sk = time_hdbscan(sk_hdbscan, X, RUNS)

print("Mean time OGHDBSCAN: ", np.mean(times_og))
print("Mean time SKHDBSCAN: ", np.mean(times_sk))

plt.plot(range(RUNS), times_og, label='OGHDBSCAN', marker='o')
plt.plot(range(RUNS), times_sk, label='SKHDBSCAN', marker='x')
plt.xlabel('Run')
plt.ylabel('Time (seconds)')
plt.title('HDBSCAN Timing Comparison')
plt.legend()
plt.show()

Expected Results

Similar performance between algorithms from Sklearn and hdbscan library

Actual Results

Sklearn implementation of HDBSCAN gets much worse performance than original library. For example when testing much bigger dataset, i.e.

X, y = make_blobs(n_samples=100000, centers=5, cluster_std=0.60, random_state=0, n_features=20)

hdbscan library performs fit in 25s on my hardware, while Sklearn needs 5 minutes to perform clustering.

Versions

System:
    python: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
executable: d:\Documents\Projects\Machine-Learning-Basics\.venv\Scripts\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.6.1
          pip: None
   setuptools: 80.9.0
        numpy: 1.26.4
        scipy: 1.15.3
       Cython: None
       pandas: 2.3.0
   matplotlib: 3.10.3
       joblib: 1.5.1
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Zen

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: vcomp
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\sklearn\.libs\vcomp140.dll
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\scipy.libs\libscipy_openblas-f07f5a5d207a3a47104dca54d6d0c86a.dll
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell
@lkwinta lkwinta added Bug Needs Triage Issue requires triage labels Jun 8, 2025
@lesteve lesteve added New Feature help wanted Hard Hard level of difficulty and removed Bug Needs Triage Issue requires triage labels Jun 10, 2025
@lesteve
Copy link
Member

lesteve commented Jun 10, 2025

I can reproduce a similar behaviour on my machine.

One of the reason is that hdbscan.HDBSCAN uses Boruvka algorithm by default which is not implemented in sklearn.HDBSCAN. There was some work some time ago to add Boruvka algorithm in #27572. See also #26801 for more context.

This would definitely be a significant effort to revive the Boruvka PR and push it forward.

@lesteve lesteve changed the title HDBSCAN performance issues HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) Jun 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hard Hard level of difficulty help wanted New Feature
Projects
None yet
Development

No branches or pull requests

2 participants