HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) #31503

lkwinta · 2025-06-08T14:53:52Z

Describe the bug

When switching from Sklearn HDBSCAN implementation to original one from hdbscan library, I've notice that Sklearn's implementation has much worse implementation. I've tried investigating different parameters but it doesn't seem to have an effect on the performance.

I've created synthetic benchmark using make_blobs function. And those are my results:

CPU: Ryzen 5 1600, 12 Threads@3.6Ghz*
RAM: 32GB DDR4

# dataset
X, y = make_blobs(n_samples=10000, centers=5, cluster_std=0.60, random_state=0, n_features=10)

# hdbscan params 
og_hdbscan = OGHDBSCAN(core_dist_n_jobs=-1)
sk_hdbscan = SKHDBSCAN(n_jobs=-1)

Tested out on Google Collab with similar results

Steps/Code to Reproduce

I am starting both algorithms with n_jobs=-1 to rule out the difference that may occure because of default setting of core_dist_n_jobs=4 in hdbscan

from hdbscan import HDBSCAN as OGHDBSCAN
from sklearn.cluster import HDBSCAN as SKHDBSCAN

import matplotlib.pyplot as plt
import numpy as np
import time

from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10000, centers=5, cluster_std=0.60, random_state=0, n_features=10)

og_hdbscan = OGHDBSCAN(core_dist_n_jobs=-1)
sk_hdbscan = SKHDBSCAN(n_jobs=-1)

RUNS = 10

def time_hdbscan(hdbscan, X, runs):
    times = []
    for _ in range(runs):
        start = time.time()
        hdbscan.fit(X)
        end = time.time()
        times.append(end - start)
    return times

times_og = time_hdbscan(og_hdbscan, X, RUNS)
times_sk = time_hdbscan(sk_hdbscan, X, RUNS)

print("Mean time OGHDBSCAN: ", np.mean(times_og))
print("Mean time SKHDBSCAN: ", np.mean(times_sk))

plt.plot(range(RUNS), times_og, label='OGHDBSCAN', marker='o')
plt.plot(range(RUNS), times_sk, label='SKHDBSCAN', marker='x')
plt.xlabel('Run')
plt.ylabel('Time (seconds)')
plt.title('HDBSCAN Timing Comparison')
plt.legend()
plt.show()

Expected Results

Similar performance between algorithms from Sklearn and hdbscan library

Actual Results

Sklearn implementation of HDBSCAN gets much worse performance than original library. For example when testing much bigger dataset, i.e.

X, y = make_blobs(n_samples=100000, centers=5, cluster_std=0.60, random_state=0, n_features=20)

hdbscan library performs fit in 25s on my hardware, while Sklearn needs 5 minutes to perform clustering.

Versions

System:
    python: 3.11.9 (tags/v3.11.9:de54cf5, Apr  2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
executable: d:\Documents\Projects\Machine-Learning-Basics\.venv\Scripts\python.exe
   machine: Windows-10-10.0.19045-SP0

Python dependencies:
      sklearn: 1.6.1
          pip: None
   setuptools: 80.9.0
        numpy: 1.26.4
        scipy: 1.15.3
       Cython: None
       pandas: 2.3.0
   matplotlib: 3.10.3
       joblib: 1.5.1
threadpoolctl: 3.6.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libopenblas
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\numpy.libs\libopenblas64__v0.3.23-293-gc2f4bdbb-gcc_10_3_0-2bde3a66a51006b2b53eb373ff767a3f.dll
        version: 0.3.23.dev
threading_layer: pthreads
   architecture: Zen

       user_api: openmp
   internal_api: openmp
    num_threads: 12
         prefix: vcomp
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\sklearn\.libs\vcomp140.dll
        version: None

       user_api: blas
   internal_api: openblas
    num_threads: 12
         prefix: libscipy_openblas
       filepath: D:\Documents\Projects\Machine-Learning-Basics\.venv\Lib\site-packages\scipy.libs\libscipy_openblas-f07f5a5d207a3a47104dca54d6d0c86a.dll
        version: 0.3.28
threading_layer: pthreads
   architecture: Haswell

The text was updated successfully, but these errors were encountered:

lesteve · 2025-06-10T08:36:56Z

I can reproduce a similar behaviour on my machine.

One of the reason is that hdbscan.HDBSCAN uses Boruvka algorithm by default which is not implemented in sklearn.HDBSCAN. There was some work some time ago to add Boruvka algorithm in #27572. See also #26801 for more context.

This would definitely be a significant effort to revive the Boruvka PR and push it forward.

lkwinta added Bug Needs Triage Issue requires triage labels Jun 8, 2025

lesteve added New Feature help wanted Hard Hard level of difficulty and removed Bug Needs Triage Issue requires triage labels Jun 10, 2025

lesteve changed the title ~~HDBSCAN performance issues~~ HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) Jun 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) #31503

HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) #31503

lkwinta commented Jun 8, 2025 •

edited

Loading

lesteve commented Jun 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) #31503

HDBSCAN performance issues compared to original hdbscan implementation (likely because Boruvka algorithm is not implemented) #31503

Comments

lkwinta commented Jun 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

lesteve commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lkwinta commented Jun 8, 2025 •

edited

Loading

lesteve commented Jun 10, 2025 •

edited

Loading