-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DBSCAN always triggers and EfficiencyWarning #31030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So this seems to be the culprit: scikit-learn/sklearn/cluster/_dbscan.py Line 408 in 8951184
Before this line, I'm not really sure how to fix this. Maybe @adam2392 has an idea here? |
If there are 0's in the diagonal, and X.setdiag(X.diagonal()) puts the 0's somewhere where they weren't before, doesn't that imply the sparse data structure is not sorted by data anymore? disclaimer: I have not looked into this explicitly, but this is just my first read intuition. |
/take |
@yuwei-1 please leave a comment with what you understand the issue is and how you intend to solve it before tagging an issue as taken. Also, this is not a good first issue. You might want to start with an easier issue. |
Apologies, I wasn't aware of this protocol. I had reproduced the error yesterday, and tracked the problem down to the same line. I am not entirely sure why this issue is occurring - I assume it is something to do with how sparse matrices are stored - and when we concretely store the diagonal, this changes the ordering of the data. I was intending to solve it by looking into how the scipy csr matrices are stored, the calculations inside _is_sorted_by_data, and what changes before and after X.setdiag(X.diagonal()). |
That sounds great. Thanks. Let us know what you find 😊 |
Hi @adrinjalali, I think I have found the issue, but I do not know how to go about resolving. The TLDR is that by concretising the diagonal values, I do not believe there is any guarantee in the general case that we can still preserve the data ordering in a per-row basis (which is what _is_sorted_by_data is checking for). I went through a particular worked example below: from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data
docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
for term in d:
index = vocabulary.setdefault(term, len(vocabulary))
indices.append(index)
data.append(1)
indptr.append(len(indices))
sparse_array = csr_array((data, indices, indptr), dtype=int)
print("original array: ", sparse_array.toarray())
print("(data, indices, indptr) for original: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))
sparse_array.setdiag(sparse_array.diagonal())
print("concretised array: ", sparse_array.toarray())
print("(data, indices, indptr) for concretised: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))
output:
original array: [[2 1 0 0]
[0 1 1 1]]
(data, indices, indptr) for original: [1 1 1 1 1 1] [0 1 0 2 3 1] [0 3 6]
is sorted by data: True
concretised array: [[2 1 0 0]
[0 1 1 1]]
(data, indices, indptr) for concretised: [2 1 1 1 1] [0 1 1 2 3] [0 2 5]
is sorted by data: False From this, we see that concretising the diagonal essentially forces the values on the diagonal ([2, 1]) to be stored explicitly in the graph.data array (instead of before, where the 2 was stored in two separate elements in the data array. This is a problem because it violates the condition inside _is_sorted_by_data, where data on the same row should be equal or ascending in size. I've checked and the setdiag() line does impact tests passing so it does impact the results of DBSCAN. I'm not exactly sure how concretisation affects the algorithm (as the underlying array should be the same) but we could explore ways of achieving the same thing without having to concretise the diagonal, perhaps? The alternative could be to simply make a more detailed warning for this particular case. |
I think figuring out why we need to set the diagonal, and how it affects the algorithm is important. You might need to dig deep into the implementation to figure that out, but it'd be very much appreciated. |
Agreed. Will look into this over the next week and let you know what I find! |
@adrinjalali @luispedro, would adding
Perhaps I'm missing some obvious reason to avoid simply sorting, though… As an alternative to sorting, we can also implement the below solution to replicate the logic from actually setting the diagonal of the
Again, I re-ran the initial example provided by @luispedro on my machine with this change, and it no longer showed the warning. (This approach might actually be better than the first just from the perspective of speed/allocations.) Do let me know if there's some flaw in my logic, though. |
As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this.
As per scikit-learn#31030, DBSCAN consistently triggers efficiency warnings due to explicitly setting the diagonal of the X matrix in fitting neighborhoods. This stems from not sorting the precomputed sparse matrix by row values. Here, we instead update the neighborhoods variable after the initial fitting to avoid this. It may also be possible to simply add X = sort_graph_by_row_values(X, warn_when_not_sorted=False) after the original code's X.setdiag(X.diagonal()), but (1) this way seems more efficient and (2) I am not sure if this reordering would potentially affect the data in an undesired manner.
If there's an efficiency warning but we turn off the warning, then the process is still "inefficient" tho regardless no? I could be wrong. Haven't followed too heavily. |
The original Hence, my second proposed fix (the one I included in #31337) avoids setting the diagonal entries of the sparse matrix at all in favor of updating the neighborhoods data after the model fitting. Hopefully this is more efficient; your thoughts, @adam2392? |
Hi @adrinjalali and @yuwei-1, I have also looked into this issue. Firstly, we do need the So this means we must sort during or after the internal structure of the matrix is changed. Proposed Solution ( I agree with @Luis-Varona ) X.setdiag(X.diagonal())
# 2. Re‑sort rows so they are monotone by value again
from sklearn.neighbors import sort_graph_by_row_values
sort_graph_by_row_values(X, copy=False, warn_when_not_sorted=False) #<--- This sorts the matrix only if it is not sorted Observe the structure of the matrix at each point.from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data
from sklearn.neighbors import sort_graph_by_row_values
A = csr_array([[0,2],[3,0]], dtype=float)
print(A)
print(_is_sorted_by_data(A)) # True
A.setdiag(A.diagonal()) # materialise zeros
print(A)
print(_is_sorted_by_data(A)) # False
# sort data
sort_graph_by_row_values(A, copy=False, warn_when_not_sorted=False)
print(A)
print(_is_sorted_by_data(A)) # True Output
|
@EngineerDanny, my ultimate suggestion is actually not to call |
From an initial look, it seems both solutions fix the warning but I would be inclined to choose your second proposal @Luis-Varona since the first just seems like a quick fix (which could potentially be inefficient). Could you explain what exactly it is doing? |
Describe the bug
Calling dbscan always triggers an efficiency warning. There is no apparent way to either call it correctly or disable the warning.
This was originally reported as an issue in SemiBin, which uses DBSCAN under the hood: BigDataBiology/SemiBin#175
Steps/Code to Reproduce
Expected Results
No warning, at least in second call
Actual Results
Versions
The text was updated successfully, but these errors were encountered: