Skip to content

DBSCAN always triggers and EfficiencyWarning #31030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
luispedro opened this issue Mar 19, 2025 · 15 comments
Open

DBSCAN always triggers and EfficiencyWarning #31030

luispedro opened this issue Mar 19, 2025 · 15 comments
Assignees
Labels
Bug help wanted Needs Investigation Issue requires investigation

Comments

@luispedro
Copy link
Contributor

Describe the bug

Calling dbscan always triggers an efficiency warning. There is no apparent way to either call it correctly or disable the warning.

This was originally reported as an issue in SemiBin, which uses DBSCAN under the hood: BigDataBiology/SemiBin#175

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster import dbscan
from sklearn.neighbors import kneighbors_graph, sort_graph_by_row_values

f = np.random.randn(10_000, 240)
dist_matrix = kneighbors_graph(
    f,
    n_neighbors=200,
    mode='distance',
    p=2,
    n_jobs=3)

_, labels = dbscan(dist_matrix,
        eps=0.1, min_samples=5, n_jobs=4, metric='precomputed')


dist_matrix = sort_graph_by_row_values(dist_matrix)
_, labels = dbscan(dist_matrix,
        eps=0.1, min_samples=5, n_jobs=4, metric='precomputed')

Expected Results

No warning, at least in second call

Actual Results

/home/luispedro/.mambaforge/envs/py3.11/lib/python3.11/site-packages/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.
  warnings.warn(
/home/luispedro/.mambaforge/envs/py3.11/lib/python3.11/site-packages/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.
  warnings.warn(

Versions

I tested on the current main branch, 5cdbbf15e3fade7cc2462ef66dc4ea0f37f390e3, but it has been going on for a while (see original SemiBin report from September 2024):


System:
    python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
executable: /home/luispedro/.mambaforge/envs/py3.11/bin/python3.11
   machine: Linux-6.8.0-55-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.7.dev0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/luispedro/.mambaforge/envs/py3.11/lib/libopenblasp-r0.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /home/luispedro/.mambaforge/envs/py3.11/lib/libgomp.so.1.0.0
        version: None
@adrinjalali
Copy link
Member

So this seems to be the culprit:

X.setdiag(X.diagonal())

Before this line, _is_sorted_by_data(X) is True, and after this line, it becomes False, and hence the warning.

I'm not really sure how to fix this. Maybe @adam2392 has an idea here?

@adrinjalali adrinjalali removed the Needs Triage Issue requires triage label Mar 21, 2025
@adam2392
Copy link
Member

If there are 0's in the diagonal, and X.setdiag(X.diagonal()) puts the 0's somewhere where they weren't before, doesn't that imply the sparse data structure is not sorted by data anymore?

disclaimer: I have not looked into this explicitly, but this is just my first read intuition.

@adrinjalali adrinjalali added help wanted Needs Investigation Issue requires investigation labels Mar 24, 2025
@yuwei-1
Copy link

yuwei-1 commented Mar 28, 2025

/take

@adrinjalali
Copy link
Member

adrinjalali commented Mar 29, 2025

@yuwei-1 please leave a comment with what you understand the issue is and how you intend to solve it before tagging an issue as taken.

Also, this is not a good first issue. You might want to start with an easier issue.

@yuwei-1
Copy link

yuwei-1 commented Mar 29, 2025

Apologies, I wasn't aware of this protocol. I had reproduced the error yesterday, and tracked the problem down to the same line. I am not entirely sure why this issue is occurring - I assume it is something to do with how sparse matrices are stored - and when we concretely store the diagonal, this changes the ordering of the data. I was intending to solve it by looking into how the scipy csr matrices are stored, the calculations inside _is_sorted_by_data, and what changes before and after X.setdiag(X.diagonal()).

@adrinjalali
Copy link
Member

That sounds great. Thanks. Let us know what you find 😊

@yuwei-1
Copy link

yuwei-1 commented Mar 29, 2025

Hi @adrinjalali, I think I have found the issue, but I do not know how to go about resolving. The TLDR is that by concretising the diagonal values, I do not believe there is any guarantee in the general case that we can still preserve the data ordering in a per-row basis (which is what _is_sorted_by_data is checking for).

I went through a particular worked example below:

from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data


docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))


sparse_array = csr_array((data, indices, indptr), dtype=int)
print("original array: ", sparse_array.toarray())
print("(data, indices, indptr) for original: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))

sparse_array.setdiag(sparse_array.diagonal())
print("concretised array: ", sparse_array.toarray())
print("(data, indices, indptr) for concretised: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))


output:

original array:  [[2 1 0 0]
 [0 1 1 1]]
(data, indices, indptr) for original:  [1 1 1 1 1 1] [0 1 0 2 3 1] [0 3 6]
is sorted by data:  True
concretised array:  [[2 1 0 0]
 [0 1 1 1]]
(data, indices, indptr) for concretised:  [2 1 1 1 1] [0 1 1 2 3] [0 2 5]
is sorted by data:  False

From this, we see that concretising the diagonal essentially forces the values on the diagonal ([2, 1]) to be stored explicitly in the graph.data array (instead of before, where the 2 was stored in two separate elements in the data array. This is a problem because it violates the condition inside _is_sorted_by_data, where data on the same row should be equal or ascending in size.

I've checked and the setdiag() line does impact tests passing so it does impact the results of DBSCAN. I'm not exactly sure how concretisation affects the algorithm (as the underlying array should be the same) but we could explore ways of achieving the same thing without having to concretise the diagonal, perhaps? The alternative could be to simply make a more detailed warning for this particular case.

@adrinjalali
Copy link
Member

I think figuring out why we need to set the diagonal, and how it affects the algorithm is important. You might need to dig deep into the implementation to figure that out, but it'd be very much appreciated.

@yuwei-1
Copy link

yuwei-1 commented Apr 19, 2025

Agreed. Will look into this over the next week and let you know what I find!

@Luis-Varona
Copy link
Contributor

Luis-Varona commented May 8, 2025

@adrinjalali @luispedro, would adding X = sort_graph_by_row_values(X, warn_when_not_sorted=False) not be the simplest fix? I added that to my local fork and built sklearn in dev, and I no longer got the warning. After all, the original warning was

/Users/luismbvarona/GitHub/scikit-learn/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.

Perhaps I'm missing some obvious reason to avoid simply sorting, though… As an alternative to sorting, we can also implement the below solution to replicate the logic from actually setting the diagonal of the X matrix:

-        if self.metric == "precomputed" and sparse.issparse(X):
-            # set the diagonal to explicit values, as a point is its own neighbor
-            X = X.copy()  # copy to avoid in-place modification
-            with warnings.catch_warnings():
-                warnings.simplefilter("ignore", sparse.SparseEfficiencyWarning)
-                X.setdiag(X.diagonal())
 
         neighbors_model = NearestNeighbors(
             radius=self.eps,
             algorithm=self.algorithm,
             leaf_size=self.leaf_size,
             metric=self.metric,
             metric_params=self.metric_params,
             p=self.p,
             n_jobs=self.n_jobs,
         )
         neighbors_model.fit(X)
         # This has worst case O(n^2) memory complexity
         neighborhoods = neighbors_model.radius_neighbors(X, return_distance=False)
 
+        if self.metric == "precomputed" and sparse.issparse(X):
+            # Each point is its own neighbor; update neighborhoods accordingly
+            for i, neighborhood in enumerate(neighborhoods):
+                if i not in neighborhoods[i]:
+                    neighborhoods[i] = np.append(neighborhood, i)

Again, I re-ran the initial example provided by @luispedro on my machine with this change, and it no longer showed the warning. (This approach might actually be better than the first just from the perspective of speed/allocations.) Do let me know if there's some flaw in my logic, though.

Luis-Varona added a commit to Luis-Varona/scikit-learn that referenced this issue May 8, 2025
As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this.
Luis-Varona added a commit to Luis-Varona/scikit-learn that referenced this issue May 8, 2025
As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this.

It may also be possible to simply add X = sort_graph_by_row_values(X, warn_when_not_sorted=False) after the original code's X.setdiag(X.diagonal()), but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.
@adam2392
Copy link
Member

adam2392 commented May 8, 2025

If there's an efficiency warning but we turn off the warning, then the process is still "inefficient" tho regardless no?

I could be wrong. Haven't followed too heavily.

@Luis-Varona
Copy link
Contributor

Luis-Varona commented May 8, 2025

If there's an efficiency warning but we turn off the warning, then the process is still "inefficient" tho regardless no?

I could be wrong. Haven't followed too heavily.

The original EfficiencyWarning derives from the sparse X matrix no longer being sorted by row value after explicitly setting the diagonal entries. The first fix I proposed was to sort the X matrix by row value after the entry setting but before the neighborhood fitting, although it is true that this may introduce unnecessary overhead (not 100% sure yet). (To clarify—turning off the warning during the sorting affected a different warning from the one we initially got. It is potentially inefficient for a different reason altogether.)

Hence, my second proposed fix (the one I included in #31337) avoids setting the diagonal entries of the sparse matrix at all in favor of updating the neighborhoods data after the model fitting. Hopefully this is more efficient; your thoughts, @adam2392?

@EngineerDanny
Copy link
Contributor

EngineerDanny commented May 8, 2025

Hi @adrinjalali and @yuwei-1, I have also looked into this issue.
I will break it down and the possible next steps.

Firstly, we do need the X.setdiag(X.diagonal()) line because csr arrays only store non-zero elements explicitly. If diagonal elements are missing (which is common in distance matrices), DBSCAN adds them but also changes the internal structure in the process.
There are documented issues with DBSCAN when diagonal elements are missing
#8306 and #439.

So this means we must sort during or after the internal structure of the matrix is changed.
This is the most efficient way to fix the warning. It has to be an internal change.

Proposed Solution ( I agree with @Luis-Varona )

      X.setdiag(X.diagonal())
      # 2. Re‑sort rows so they are monotone by value again
      from sklearn.neighbors import sort_graph_by_row_values
      sort_graph_by_row_values(X, copy=False, warn_when_not_sorted=False) #<--- This sorts the matrix only if it is not sorted

Observe the structure of the matrix at each point.

from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data
from sklearn.neighbors import sort_graph_by_row_values

A = csr_array([[0,2],[3,0]], dtype=float)
print(A)
print(_is_sorted_by_data(A))        # True

A.setdiag(A.diagonal())             # materialise zeros
print(A)
print(_is_sorted_by_data(A))        # False

# sort data
sort_graph_by_row_values(A, copy=False, warn_when_not_sorted=False)
print(A)
print(_is_sorted_by_data(A))        # True

Output

<Compressed Sparse Row sparse array of dtype 'float64'
        with 2 stored elements and shape (2, 2)>
  Coords        Values
  (0, 1)        2.0
  (1, 0)        3.0
True
<Compressed Sparse Row sparse array of dtype 'float64'
        with 4 stored elements and shape (2, 2)>
  Coords        Values
  (0, 1)        2.0
  (0, 0)        0.0
  (1, 0)        3.0
  (1, 1)        0.0
False
<Compressed Sparse Row sparse array of dtype 'float64'
        with 4 stored elements and shape (2, 2)>
  Coords        Values
  (0, 0)        0.0
  (0, 1)        2.0
  (1, 1)        0.0
  (1, 0)        3.0
True

@Luis-Varona
Copy link
Contributor

@EngineerDanny, my ultimate suggestion is actually not to call sort_graph_by_row_values but rather to do away with explicit entry setting altogether and modify the neighborhoods variable after model fitting. See #31337 for more details. Would appreciate your feedback 🙂

@yuwei-1
Copy link

yuwei-1 commented May 8, 2025

From an initial look, it seems both solutions fix the warning but I would be inclined to choose your second proposal @Luis-Varona since the first just seems like a quick fix (which could potentially be inefficient). Could you explain what exactly it is doing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug help wanted Needs Investigation Issue requires investigation
Projects
None yet
Development

No branches or pull requests

6 participants