Skip to content

DBSCAN always triggers and EfficiencyWarning #31030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
luispedro opened this issue Mar 19, 2025 · 9 comments
Open

DBSCAN always triggers and EfficiencyWarning #31030

luispedro opened this issue Mar 19, 2025 · 9 comments
Assignees
Labels
Bug help wanted Needs Investigation Issue requires investigation

Comments

@luispedro
Copy link
Contributor

Describe the bug

Calling dbscan always triggers an efficiency warning. There is no apparent way to either call it correctly or disable the warning.

This was originally reported as an issue in SemiBin, which uses DBSCAN under the hood: BigDataBiology/SemiBin#175

Steps/Code to Reproduce

import numpy as np
from sklearn.cluster import dbscan
from sklearn.neighbors import kneighbors_graph, sort_graph_by_row_values

f = np.random.randn(10_000, 240)
dist_matrix = kneighbors_graph(
    f,
    n_neighbors=200,
    mode='distance',
    p=2,
    n_jobs=3)

_, labels = dbscan(dist_matrix,
        eps=0.1, min_samples=5, n_jobs=4, metric='precomputed')


dist_matrix = sort_graph_by_row_values(dist_matrix)
_, labels = dbscan(dist_matrix,
        eps=0.1, min_samples=5, n_jobs=4, metric='precomputed')

Expected Results

No warning, at least in second call

Actual Results

/home/luispedro/.mambaforge/envs/py3.11/lib/python3.11/site-packages/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.
  warnings.warn(
/home/luispedro/.mambaforge/envs/py3.11/lib/python3.11/site-packages/sklearn/neighbors/_base.py:248: EfficiencyWarning: Precomputed sparse input was not sorted by row values. Use the function sklearn.neighbors.sort_graph_by_row_values to sort the input by row values, with warn_when_not_sorted=False to remove this warning.
  warnings.warn(

Versions

I tested on the current main branch, 5cdbbf15e3fade7cc2462ef66dc4ea0f37f390e3, but it has been going on for a while (see original SemiBin report from September 2024):


System:
    python: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
executable: /home/luispedro/.mambaforge/envs/py3.11/bin/python3.11
   machine: Linux-6.8.0-55-generic-x86_64-with-glibc2.39

Python dependencies:
      sklearn: 1.7.dev0
          pip: 24.0
   setuptools: 70.0.0
        numpy: 1.26.4
        scipy: 1.13.1
       Cython: None
       pandas: 2.2.2
   matplotlib: 3.8.4
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 16
         prefix: libopenblas
       filepath: /home/luispedro/.mambaforge/envs/py3.11/lib/libopenblasp-r0.3.27.so
        version: 0.3.27
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libgomp
       filepath: /home/luispedro/.mambaforge/envs/py3.11/lib/libgomp.so.1.0.0
        version: None
@adrinjalali
Copy link
Member

So this seems to be the culprit:

X.setdiag(X.diagonal())

Before this line, _is_sorted_by_data(X) is True, and after this line, it becomes False, and hence the warning.

I'm not really sure how to fix this. Maybe @adam2392 has an idea here?

@adrinjalali adrinjalali removed the Needs Triage Issue requires triage label Mar 21, 2025
@adam2392
Copy link
Member

If there are 0's in the diagonal, and X.setdiag(X.diagonal()) puts the 0's somewhere where they weren't before, doesn't that imply the sparse data structure is not sorted by data anymore?

disclaimer: I have not looked into this explicitly, but this is just my first read intuition.

@adrinjalali adrinjalali added help wanted Needs Investigation Issue requires investigation labels Mar 24, 2025
@yuwei-1
Copy link

yuwei-1 commented Mar 28, 2025

/take

@adrinjalali
Copy link
Member

adrinjalali commented Mar 29, 2025

@yuwei-1 please leave a comment with what you understand the issue is and how you intend to solve it before tagging an issue as taken.

Also, this is not a good first issue. You might want to start with an easier issue.

@yuwei-1
Copy link

yuwei-1 commented Mar 29, 2025

Apologies, I wasn't aware of this protocol. I had reproduced the error yesterday, and tracked the problem down to the same line. I am not entirely sure why this issue is occurring - I assume it is something to do with how sparse matrices are stored - and when we concretely store the diagonal, this changes the ordering of the data. I was intending to solve it by looking into how the scipy csr matrices are stored, the calculations inside _is_sorted_by_data, and what changes before and after X.setdiag(X.diagonal()).

@adrinjalali
Copy link
Member

That sounds great. Thanks. Let us know what you find 😊

@yuwei-1
Copy link

yuwei-1 commented Mar 29, 2025

Hi @adrinjalali, I think I have found the issue, but I do not know how to go about resolving. The TLDR is that by concretising the diagonal values, I do not believe there is any guarantee in the general case that we can still preserve the data ordering in a per-row basis (which is what _is_sorted_by_data is checking for).

I went through a particular worked example below:

from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data


docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
    for term in d:
        index = vocabulary.setdefault(term, len(vocabulary))
        indices.append(index)
        data.append(1)
    indptr.append(len(indices))


sparse_array = csr_array((data, indices, indptr), dtype=int)
print("original array: ", sparse_array.toarray())
print("(data, indices, indptr) for original: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))

sparse_array.setdiag(sparse_array.diagonal())
print("concretised array: ", sparse_array.toarray())
print("(data, indices, indptr) for concretised: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))


output:

original array:  [[2 1 0 0]
 [0 1 1 1]]
(data, indices, indptr) for original:  [1 1 1 1 1 1] [0 1 0 2 3 1] [0 3 6]
is sorted by data:  True
concretised array:  [[2 1 0 0]
 [0 1 1 1]]
(data, indices, indptr) for concretised:  [2 1 1 1 1] [0 1 1 2 3] [0 2 5]
is sorted by data:  False

From this, we see that concretising the diagonal essentially forces the values on the diagonal ([2, 1]) to be stored explicitly in the graph.data array (instead of before, where the 2 was stored in two separate elements in the data array. This is a problem because it violates the condition inside _is_sorted_by_data, where data on the same row should be equal or ascending in size.

I've checked and the setdiag() line does impact tests passing so it does impact the results of DBSCAN. I'm not exactly sure how concretisation affects the algorithm (as the underlying array should be the same) but we could explore ways of achieving the same thing without having to concretise the diagonal, perhaps? The alternative could be to simply make a more detailed warning for this particular case.

@adrinjalali
Copy link
Member

I think figuring out why we need to set the diagonal, and how it affects the algorithm is important. You might need to dig deep into the implementation to figure that out, but it'd be very much appreciated.

@yuwei-1
Copy link

yuwei-1 commented Apr 19, 2025

Agreed. Will look into this over the next week and let you know what I find!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug help wanted Needs Investigation Issue requires investigation
Projects
None yet
Development

No branches or pull requests

4 participants