DBSCAN always triggers and EfficiencyWarning #31030

luispedro · 2025-03-19T20:51:03Z

Describe the bug

Calling dbscan always triggers an efficiency warning. There is no apparent way to either call it correctly or disable the warning.

This was originally reported as an issue in SemiBin, which uses DBSCAN under the hood: BigDataBiology/SemiBin#175

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster import dbscan
from sklearn.neighbors import kneighbors_graph, sort_graph_by_row_values

f = np.random.randn(10_000, 240)
dist_matrix = kneighbors_graph(
    f,
    n_neighbors=200,
    mode='distance',
    p=2,
    n_jobs=3)

_, labels = dbscan(dist_matrix,
        eps=0.1, min_samples=5, n_jobs=4, metric='precomputed')


dist_matrix = sort_graph_by_row_values(dist_matrix)
_, labels = dbscan(dist_matrix,
        eps=0.1, min_samples=5, n_jobs=4, metric='precomputed')

Expected Results

No warning, at least in second call

Actual Results

/home/luispedro/.mambaforge/envs/py3.11/lib/python3.11/site-packages/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.
  warnings.warn(
/home/luispedro/.mambaforge/envs/py3.11/lib/python3.11/site-packages/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.
  warnings.warn(

Versions

I tested on the current main branch, 5cdbbf15e3fade7cc2462ef66dc4ea0f37f390e3, but it has been going on for a while (see original SemiBin report from September 2024):


System:
    python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
executable: /home/luispedro/.mambaforge/envs/py3.11/bin/python3.11
   machine: Linux-6.8.0-55-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.7.dev0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/luispedro/.mambaforge/envs/py3.11/lib/libopenblasp-r0.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /home/luispedro/.mambaforge/envs/py3.11/lib/libgomp.so.1.0.0
        version: None

The text was updated successfully, but these errors were encountered:

adrinjalali · 2025-03-21T17:28:36Z

So this seems to be the culprit:

scikit-learn/sklearn/cluster/_dbscan.py

Line 408 in 8951184

X.setdiag(X.diagonal())

Before this line, _is_sorted_by_data(X) is True, and after this line, it becomes False, and hence the warning.

I'm not really sure how to fix this. Maybe @adam2392 has an idea here?

adam2392 · 2025-03-21T21:35:52Z

If there are 0's in the diagonal, and X.setdiag(X.diagonal()) puts the 0's somewhere where they weren't before, doesn't that imply the sparse data structure is not sorted by data anymore?

disclaimer: I have not looked into this explicitly, but this is just my first read intuition.

yuwei-1 · 2025-03-28T22:43:56Z

/take

adrinjalali · 2025-03-29T09:19:45Z

@yuwei-1 please leave a comment with what you understand the issue is and how you intend to solve it before tagging an issue as taken.

Also, this is not a good first issue. You might want to start with an easier issue.

yuwei-1 · 2025-03-29T10:05:30Z

Apologies, I wasn't aware of this protocol. I had reproduced the error yesterday, and tracked the problem down to the same line. I am not entirely sure why this issue is occurring - I assume it is something to do with how sparse matrices are stored - and when we concretely store the diagonal, this changes the ordering of the data. I was intending to solve it by looking into how the scipy csr matrices are stored, the calculations inside _is_sorted_by_data, and what changes before and after X.setdiag(X.diagonal()).

adrinjalali · 2025-03-29T12:17:13Z

That sounds great. Thanks. Let us know what you find 😊

yuwei-1 · 2025-03-29T13:07:40Z

Hi @adrinjalali, I think I have found the issue, but I do not know how to go about resolving. The TLDR is that by concretising the diagonal values, I do not believe there is any guarantee in the general case that we can still preserve the data ordering in a per-row basis (which is what _is_sorted_by_data is checking for).

I went through a particular worked example below:

from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data


docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))


sparse_array = csr_array((data, indices, indptr), dtype=int)
print("original array: ", sparse_array.toarray())
print("(data, indices, indptr) for original: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))

sparse_array.setdiag(sparse_array.diagonal())
print("concretised array: ", sparse_array.toarray())
print("(data, indices, indptr) for concretised: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))


output:

original array:  [[2 1 0 0]
 [0 1 1 1]]
(data, indices, indptr) for original:  [1 1 1 1 1 1] [0 1 0 2 3 1] [0 3 6]
is sorted by data:  True
concretised array:  [[2 1 0 0]
 [0 1 1 1]]
(data, indices, indptr) for concretised:  [2 1 1 1 1] [0 1 1 2 3] [0 2 5]
is sorted by data:  False

From this, we see that concretising the diagonal essentially forces the values on the diagonal ([2, 1]) to be stored explicitly in the graph.data array (instead of before, where the 2 was stored in two separate elements in the data array. This is a problem because it violates the condition inside _is_sorted_by_data, where data on the same row should be equal or ascending in size.

I've checked and the setdiag() line does impact tests passing so it does impact the results of DBSCAN. I'm not exactly sure how concretisation affects the algorithm (as the underlying array should be the same) but we could explore ways of achieving the same thing without having to concretise the diagonal, perhaps? The alternative could be to simply make a more detailed warning for this particular case.

adrinjalali · 2025-04-10T15:13:26Z

I think figuring out why we need to set the diagonal, and how it affects the algorithm is important. You might need to dig deep into the implementation to figure that out, but it'd be very much appreciated.

yuwei-1 · 2025-04-19T13:13:48Z

Agreed. Will look into this over the next week and let you know what I find!

luispedro added Bug Needs Triage Issue requires triage labels Mar 19, 2025

luispedro mentioned this issue Mar 19, 2025

Warnings when binning using long reads. BigDataBiology/SemiBin#175

Open

adrinjalali removed the Needs Triage Issue requires triage label Mar 21, 2025

adrinjalali added help wanted Needs Investigation Issue requires investigation labels Mar 24, 2025

github-actions bot assigned yuwei-1 Mar 28, 2025

github-actions bot removed the help wanted label Mar 28, 2025

adrinjalali added the help wanted label Mar 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBSCAN always triggers and EfficiencyWarning #31030

DBSCAN always triggers and EfficiencyWarning #31030

luispedro commented Mar 19, 2025

adrinjalali commented Mar 21, 2025

adam2392 commented Mar 21, 2025

yuwei-1 commented Mar 28, 2025

adrinjalali commented Mar 29, 2025 •

edited

Loading

yuwei-1 commented Mar 29, 2025 •

edited

Loading

adrinjalali commented Mar 29, 2025

yuwei-1 commented Mar 29, 2025 •

edited

Loading

adrinjalali commented Apr 10, 2025

yuwei-1 commented Apr 19, 2025

DBSCAN always triggers and EfficiencyWarning #31030

DBSCAN always triggers and EfficiencyWarning #31030

Comments

luispedro commented Mar 19, 2025

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

adrinjalali commented Mar 21, 2025

adam2392 commented Mar 21, 2025

yuwei-1 commented Mar 28, 2025

adrinjalali commented Mar 29, 2025 • edited Loading

yuwei-1 commented Mar 29, 2025 • edited Loading

adrinjalali commented Mar 29, 2025

yuwei-1 commented Mar 29, 2025 • edited Loading

adrinjalali commented Apr 10, 2025

yuwei-1 commented Apr 19, 2025

adrinjalali commented Mar 29, 2025 •

edited

Loading

yuwei-1 commented Mar 29, 2025 •

edited

Loading

yuwei-1 commented Mar 29, 2025 •

edited

Loading