DBSCAN always triggers and EfficiencyWarning #31030

luispedro · 2025-03-19T20:51:03Z

Describe the bug

Calling dbscan always triggers an efficiency warning. There is no apparent way to either call it correctly or disable the warning.

This was originally reported as an issue in SemiBin, which uses DBSCAN under the hood: BigDataBiology/SemiBin#175

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster import dbscan
from sklearn.neighbors import kneighbors_graph, sort_graph_by_row_values

f = np.random.randn(10_000, 240)
dist_matrix = kneighbors_graph(
    f,
    n_neighbors=200,
    mode='distance',
    p=2,
    n_jobs=3)

_, labels = dbscan(dist_matrix,
        eps=0.1, min_samples=5, n_jobs=4, metric='precomputed')


dist_matrix = sort_graph_by_row_values(dist_matrix)
_, labels = dbscan(dist_matrix,
        eps=0.1, min_samples=5, n_jobs=4, metric='precomputed')

Expected Results

No warning, at least in second call

Actual Results

/home/luispedro/.mambaforge/envs/py3.11/lib/python3.11/site-packages/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.
  warnings.warn(
/home/luispedro/.mambaforge/envs/py3.11/lib/python3.11/site-packages/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.
  warnings.warn(

Versions

I tested on the current main branch, 5cdbbf15e3fade7cc2462ef66dc4ea0f37f390e3, but it has been going on for a while (see original SemiBin report from September 2024):


System:
    python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
executable: /home/luispedro/.mambaforge/envs/py3.11/bin/python3.11
   machine: Linux-6.8.0-55-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.7.dev0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/luispedro/.mambaforge/envs/py3.11/lib/libopenblasp-r0.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /home/luispedro/.mambaforge/envs/py3.11/lib/libgomp.so.1.0.0
        version: None

The text was updated successfully, but these errors were encountered:

adrinjalali · 2025-03-21T17:28:36Z

So this seems to be the culprit:

scikit-learn/sklearn/cluster/_dbscan.py

Line 408 in 8951184

X.setdiag(X.diagonal())

Before this line, _is_sorted_by_data(X) is True, and after this line, it becomes False, and hence the warning.

I'm not really sure how to fix this. Maybe @adam2392 has an idea here?

adam2392 · 2025-03-21T21:35:52Z

If there are 0's in the diagonal, and X.setdiag(X.diagonal()) puts the 0's somewhere where they weren't before, doesn't that imply the sparse data structure is not sorted by data anymore?

disclaimer: I have not looked into this explicitly, but this is just my first read intuition.

yuwei-1 · 2025-03-28T22:43:56Z

/take

adrinjalali · 2025-03-29T09:19:45Z

@yuwei-1 please leave a comment with what you understand the issue is and how you intend to solve it before tagging an issue as taken.

Also, this is not a good first issue. You might want to start with an easier issue.

yuwei-1 · 2025-03-29T10:05:30Z

Apologies, I wasn't aware of this protocol. I had reproduced the error yesterday, and tracked the problem down to the same line. I am not entirely sure why this issue is occurring - I assume it is something to do with how sparse matrices are stored - and when we concretely store the diagonal, this changes the ordering of the data. I was intending to solve it by looking into how the scipy csr matrices are stored, the calculations inside _is_sorted_by_data, and what changes before and after X.setdiag(X.diagonal()).

adrinjalali · 2025-03-29T12:17:13Z

That sounds great. Thanks. Let us know what you find 😊

yuwei-1 · 2025-03-29T13:07:40Z

Hi @adrinjalali, I think I have found the issue, but I do not know how to go about resolving. The TLDR is that by concretising the diagonal values, I do not believe there is any guarantee in the general case that we can still preserve the data ordering in a per-row basis (which is what _is_sorted_by_data is checking for).

I went through a particular worked example below:

from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data


docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))


sparse_array = csr_array((data, indices, indptr), dtype=int)
print("original array: ", sparse_array.toarray())
print("(data, indices, indptr) for original: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))

sparse_array.setdiag(sparse_array.diagonal())
print("concretised array: ", sparse_array.toarray())
print("(data, indices, indptr) for concretised: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))


output:

original array:  [[2 1 0 0]
 [0 1 1 1]]
(data, indices, indptr) for original:  [1 1 1 1 1 1] [0 1 0 2 3 1] [0 3 6]
is sorted by data:  True
concretised array:  [[2 1 0 0]
 [0 1 1 1]]
(data, indices, indptr) for concretised:  [2 1 1 1 1] [0 1 1 2 3] [0 2 5]
is sorted by data:  False

From this, we see that concretising the diagonal essentially forces the values on the diagonal ([2, 1]) to be stored explicitly in the graph.data array (instead of before, where the 2 was stored in two separate elements in the data array. This is a problem because it violates the condition inside _is_sorted_by_data, where data on the same row should be equal or ascending in size.

I've checked and the setdiag() line does impact tests passing so it does impact the results of DBSCAN. I'm not exactly sure how concretisation affects the algorithm (as the underlying array should be the same) but we could explore ways of achieving the same thing without having to concretise the diagonal, perhaps? The alternative could be to simply make a more detailed warning for this particular case.

adrinjalali · 2025-04-10T15:13:26Z

I think figuring out why we need to set the diagonal, and how it affects the algorithm is important. You might need to dig deep into the implementation to figure that out, but it'd be very much appreciated.

yuwei-1 · 2025-04-19T13:13:48Z

Agreed. Will look into this over the next week and let you know what I find!

Luis-Varona · 2025-05-08T01:36:20Z

@adrinjalali @luispedro, would adding X = sort_graph_by_row_values(X, warn_when_not_sorted=False) not be the simplest fix? I added that to my local fork and built sklearn in dev, and I no longer got the warning. After all, the original warning was

/Users/luismbvarona/GitHub/scikit-learn/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.

Perhaps I'm missing some obvious reason to avoid simply sorting, though… As an alternative to sorting, we can also implement the below solution to replicate the logic from actually setting the diagonal of the X matrix:

-        if self.metric == "precomputed" and sparse.issparse(X):
-            # set the diagonal to explicit values, as a point is its own neighbor
-            X = X.copy()  # copy to avoid in-place modification
-            with warnings.catch_warnings():
-                warnings.simplefilter("ignore", sparse.SparseEfficiencyWarning)
-                X.setdiag(X.diagonal())
 
         neighbors_model = NearestNeighbors(
             radius=self.eps,
             algorithm=self.algorithm,
             leaf_size=self.leaf_size,
             metric=self.metric,
             metric_params=self.metric_params,
             p=self.p,
             n_jobs=self.n_jobs,
         )
         neighbors_model.fit(X)
         # This has worst case O(n^2) memory complexity
         neighborhoods = neighbors_model.radius_neighbors(X, return_distance=False)
 
+        if self.metric == "precomputed" and sparse.issparse(X):
+            # Each point is its own neighbor; update neighborhoods accordingly
+            for i, neighborhood in enumerate(neighborhoods):
+                if i not in neighborhoods[i]:
+                    neighborhoods[i] = np.append(neighborhood, i)

Again, I re-ran the initial example provided by @luispedro on my machine with this change, and it no longer showed the warning. (This approach might actually be better than the first just from the perspective of speed/allocations.) Do let me know if there's some flaw in my logic, though.

As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this.

As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this. It may also be possible to simply add X = sort_graph_by_row_values(X, warn_when_not_sorted=False) after the original code's X.setdiag(X.diagonal()), but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.

adam2392 · 2025-05-08T02:34:48Z

If there's an efficiency warning but we turn off the warning, then the process is still "inefficient" tho regardless no?

I could be wrong. Haven't followed too heavily.

Luis-Varona · 2025-05-08T02:38:34Z

If there's an efficiency warning but we turn off the warning, then the process is still "inefficient" tho regardless no?

I could be wrong. Haven't followed too heavily.

The original EfficiencyWarning derives from the sparse X matrix no longer being sorted by row value after explicitly setting the diagonal entries. The first fix I proposed was to sort the X matrix by row value after the entry setting but before the neighborhood fitting, although it is true that this may introduce unnecessary overhead (not 100% sure yet). (To clarify—turning off the warning during the sorting affected a different warning from the one we initially got. It is potentially inefficient for a different reason altogether.)

Hence, my second proposed fix (the one I included in #31337) avoids setting the diagonal entries of the sparse matrix at all in favor of updating the neighborhoods data after the model fitting. Hopefully this is more efficient; your thoughts, @adam2392?

EngineerDanny · 2025-05-08T04:36:57Z

Hi @adrinjalali and @yuwei-1, I have also looked into this issue.
I will break it down and the possible next steps.

Firstly, we do need the X.setdiag(X.diagonal()) line because csr arrays only store non-zero elements explicitly. If diagonal elements are missing (which is common in distance matrices), DBSCAN adds them but also changes the internal structure in the process.
There are documented issues with DBSCAN when diagonal elements are missing
#8306 and #439.

So this means we must sort during or after the internal structure of the matrix is changed.
This is the most efficient way to fix the warning. It has to be an internal change.

Proposed Solution ( I agree with @Luis-Varona )

      X.setdiag(X.diagonal())
      # 2. Re‑sort rows so they are monotone by value again
      from sklearn.neighbors import sort_graph_by_row_values
      sort_graph_by_row_values(X, copy=False, warn_when_not_sorted=False) #<--- This sorts the matrix only if it is not sorted

Observe the structure of the matrix at each point.

from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data
from sklearn.neighbors import sort_graph_by_row_values

A = csr_array([[0,2],[3,0]], dtype=float)
print(A)
print(_is_sorted_by_data(A))        # True

A.setdiag(A.diagonal())             # materialise zeros
print(A)
print(_is_sorted_by_data(A))        # False

# sort data
sort_graph_by_row_values(A, copy=False, warn_when_not_sorted=False)
print(A)
print(_is_sorted_by_data(A))        # True

Output

<Compressed Sparse Row sparse array of dtype 'float64'
        with 2 stored elements and shape (2, 2)>
  Coords        Values
  (0, 1)        2.0
  (1, 0)        3.0
True
<Compressed Sparse Row sparse array of dtype 'float64'
        with 4 stored elements and shape (2, 2)>
  Coords        Values
  (0, 1)        2.0
  (0, 0)        0.0
  (1, 0)        3.0
  (1, 1)        0.0
False
<Compressed Sparse Row sparse array of dtype 'float64'
        with 4 stored elements and shape (2, 2)>
  Coords        Values
  (0, 0)        0.0
  (0, 1)        2.0
  (1, 1)        0.0
  (1, 0)        3.0
True

Luis-Varona · 2025-05-08T05:21:38Z

@EngineerDanny, my ultimate suggestion is actually not to call sort_graph_by_row_values but rather to do away with explicit entry setting altogether and modify the neighborhoods variable after model fitting. See #31337 for more details. Would appreciate your feedback 🙂

yuwei-1 · 2025-05-08T05:59:08Z

From an initial look, it seems both solutions fix the warning but I would be inclined to choose your second proposal @Luis-Varona since the first just seems like a quick fix (which could potentially be inefficient). Could you explain what exactly it is doing?

luispedro added Bug Needs Triage Issue requires triage labels Mar 19, 2025

luispedro mentioned this issue Mar 19, 2025

Warnings when binning using long reads. BigDataBiology/SemiBin#175

Open

adrinjalali removed the Needs Triage Issue requires triage label Mar 21, 2025

adrinjalali added help wanted Needs Investigation Issue requires investigation labels Mar 24, 2025

github-actions bot assigned yuwei-1 Mar 28, 2025

github-actions bot removed the help wanted label Mar 28, 2025

adrinjalali added the help wanted label Mar 29, 2025

Luis-Varona mentioned this issue May 8, 2025

Fixes #31030 - Stop EfficiencyWarnings in DBSCAN #31337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBSCAN always triggers and EfficiencyWarning #31030

DBSCAN always triggers and EfficiencyWarning #31030

luispedro commented Mar 19, 2025

adrinjalali commented Mar 21, 2025

adam2392 commented Mar 21, 2025

yuwei-1 commented Mar 28, 2025

adrinjalali commented Mar 29, 2025 •

edited

Loading

yuwei-1 commented Mar 29, 2025 •

edited

Loading

adrinjalali commented Mar 29, 2025

yuwei-1 commented Mar 29, 2025 •

edited

Loading

adrinjalali commented Apr 10, 2025

yuwei-1 commented Apr 19, 2025

Luis-Varona commented May 8, 2025 •

edited

Loading

adam2392 commented May 8, 2025 •

edited

Loading

Luis-Varona commented May 8, 2025 •

edited

Loading

EngineerDanny commented May 8, 2025 •

edited

Loading

Luis-Varona commented May 8, 2025

yuwei-1 commented May 8, 2025

DBSCAN always triggers and EfficiencyWarning #31030

DBSCAN always triggers and EfficiencyWarning #31030

Comments

luispedro commented Mar 19, 2025

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

adrinjalali commented Mar 21, 2025

adam2392 commented Mar 21, 2025

yuwei-1 commented Mar 28, 2025

adrinjalali commented Mar 29, 2025 • edited Loading

yuwei-1 commented Mar 29, 2025 • edited Loading

adrinjalali commented Mar 29, 2025

yuwei-1 commented Mar 29, 2025 • edited Loading

adrinjalali commented Apr 10, 2025

yuwei-1 commented Apr 19, 2025

Luis-Varona commented May 8, 2025 • edited Loading

adam2392 commented May 8, 2025 • edited Loading

Luis-Varona commented May 8, 2025 • edited Loading

EngineerDanny commented May 8, 2025 • edited Loading

Proposed Solution ( I agree with @Luis-Varona )

Observe the structure of the matrix at each point.

Output

Luis-Varona commented May 8, 2025

yuwei-1 commented May 8, 2025

adrinjalali commented Mar 29, 2025 •

edited

Loading

yuwei-1 commented Mar 29, 2025 •

edited

Loading

yuwei-1 commented Mar 29, 2025 •

edited

Loading

Luis-Varona commented May 8, 2025 •

edited

Loading

adam2392 commented May 8, 2025 •

edited

Loading

Luis-Varona commented May 8, 2025 •

edited

Loading

EngineerDanny commented May 8, 2025 •

edited

Loading