-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DBSCAN always triggers and EfficiencyWarning #31030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So this seems to be the culprit: scikit-learn/sklearn/cluster/_dbscan.py Line 408 in 8951184
Before this line, I'm not really sure how to fix this. Maybe @adam2392 has an idea here? |
If there are 0's in the diagonal, and X.setdiag(X.diagonal()) puts the 0's somewhere where they weren't before, doesn't that imply the sparse data structure is not sorted by data anymore? disclaimer: I have not looked into this explicitly, but this is just my first read intuition. |
/take |
@yuwei-1 please leave a comment with what you understand the issue is and how you intend to solve it before tagging an issue as taken. Also, this is not a good first issue. You might want to start with an easier issue. |
Apologies, I wasn't aware of this protocol. I had reproduced the error yesterday, and tracked the problem down to the same line. I am not entirely sure why this issue is occurring - I assume it is something to do with how sparse matrices are stored - and when we concretely store the diagonal, this changes the ordering of the data. I was intending to solve it by looking into how the scipy csr matrices are stored, the calculations inside _is_sorted_by_data, and what changes before and after X.setdiag(X.diagonal()). |
That sounds great. Thanks. Let us know what you find 😊 |
Hi @adrinjalali, I think I have found the issue, but I do not know how to go about resolving. The TLDR is that by concretising the diagonal values, I do not believe there is any guarantee in the general case that we can still preserve the data ordering in a per-row basis (which is what _is_sorted_by_data is checking for). I went through a particular worked example below: from scipy.sparse import csr_array
from sklearn.neighbors._base import _is_sorted_by_data
docs = [["hello", "world", "hello"], ["goodbye", "cruel", "world"]]
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in docs:
for term in d:
index = vocabulary.setdefault(term, len(vocabulary))
indices.append(index)
data.append(1)
indptr.append(len(indices))
sparse_array = csr_array((data, indices, indptr), dtype=int)
print("original array: ", sparse_array.toarray())
print("(data, indices, indptr) for original: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))
sparse_array.setdiag(sparse_array.diagonal())
print("concretised array: ", sparse_array.toarray())
print("(data, indices, indptr) for concretised: ", sparse_array.data, sparse_array.indices, sparse_array.indptr)
print("is sorted by data: ", _is_sorted_by_data(sparse_array))
output:
original array: [[2 1 0 0]
[0 1 1 1]]
(data, indices, indptr) for original: [1 1 1 1 1 1] [0 1 0 2 3 1] [0 3 6]
is sorted by data: True
concretised array: [[2 1 0 0]
[0 1 1 1]]
(data, indices, indptr) for concretised: [2 1 1 1 1] [0 1 1 2 3] [0 2 5]
is sorted by data: False From this, we see that concretising the diagonal essentially forces the values on the diagonal ([2, 1]) to be stored explicitly in the graph.data array (instead of before, where the 2 was stored in two separate elements in the data array. This is a problem because it violates the condition inside _is_sorted_by_data, where data on the same row should be equal or ascending in size. I've checked and the setdiag() line does impact tests passing so it does impact the results of DBSCAN. I'm not exactly sure how concretisation affects the algorithm (as the underlying array should be the same) but we could explore ways of achieving the same thing without having to concretise the diagonal, perhaps? The alternative could be to simply make a more detailed warning for this particular case. |
I think figuring out why we need to set the diagonal, and how it affects the algorithm is important. You might need to dig deep into the implementation to figure that out, but it'd be very much appreciated. |
Agreed. Will look into this over the next week and let you know what I find! |
Describe the bug
Calling dbscan always triggers an efficiency warning. There is no apparent way to either call it correctly or disable the warning.
This was originally reported as an issue in SemiBin, which uses DBSCAN under the hood: BigDataBiology/SemiBin#175
Steps/Code to Reproduce
Expected Results
No warning, at least in second call
Actual Results
Versions
The text was updated successfully, but these errors were encountered: