-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DBSCAN gives incorrect result on precomputed sparse input if there are only zeros in first row #8306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So we're to interpret this as: sample |
Can I try working on this? I am new though. |
Go ahead, have a try.
…On 7 February 2017 at 15:19, Kanchana Ranasinghe ***@***.***> wrote:
Can I try working on this? I am new though.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8306 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz64YNY5YHNCJWjtB8I88z6xFHqX1nks5rZ_C_gaJpZM4L5Arq>
.
|
You use |
But scipy.spatial.distance.is_valid_dm() verifies this as an acceptable distance matrix. I even tried plotting the corresponding points and got: However, distance matrices (at least for Euclidean) aren't allowed to have zero values off the diagonal, are they? |
dbscan treats implicit zeros in sparse precomputed input as masks for "out
of any eps we care about", not as literal zeros.
…On 7 Feb 2017 7:19 pm, "Kanchana Ranasinghe" ***@***.***> wrote:
But scipy.spatial.distance.is_valid_dm() verifies this as an acceptable
distance matrix. I even tried plotting the corresponding points and got:
[0.1008,0.2193],[1.6069,0.0202],[-0.5538,-0.1653],[0.
2431,-0.1523],[-1.3969,0.0781]
However, distance matrices (at least for Euclidean) aren't allowed to have
zero values off the diagonal, are they?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8306 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62G_wPmeIK_XsR1cGAdPC3c_RgABks5raCj9gaJpZM4L5Arq>
.
|
Let me be more concrete. Problem is specific to sparse matrix. Error happens if and only if the first line in matrix is empty. Example sparse matrix synthetically created from dense input for easy reproduction. All zero-values there are treated as non-existent elements. Incorrect behavior is in this part of code
In sparce matrix row indexer first element is always zero. If first row is empty then second element in row indexer is also zero. In my project I reimplemented your dbscan() function with this code
I doubt it's the best option, but it works correctly and doesn't look like some David Blane street magic. |
If you have a patch, it's best reviewed as as a pull request
…On 8 Feb 2017 12:10 am, "Alexey Grachev" ***@***.***> wrote:
Let I be more concrete. Problem is specific to sparse matrix. Error
happens if and only if the first line in matrix is empty. Example sparse
matrix synthetically created from dense input for easy reproduction. All
zero-values there are treated as non-existent elements.
Incorrect behavior is in this part of code
https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/cluster/
dbscan_.py#L122
if metric == 'precomputed' and sparse.issparse(X):
neighborhoods = np.empty(X.shape[0], dtype=object)
X.sum_duplicates() # XXX: modifies X's internals in-place
X_mask = X.data <= eps
masked_indices = astype(X.indices, np.intp, copy=False)[X_mask]
masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1] # <<== UNVALIDATED SHIFT-DECREMENT
# insert the diagonal: a point is its own neighbor, but 0 distance
# means absence from sparse matrix data
masked_indices = np.insert(masked_indices, masked_indptr,
np.arange(X.shape[0]))
masked_indptr = masked_indptr[:-1] + np.arange(1, X.shape[0])
# split into rows
neighborhoods[:] = np.split(masked_indices, masked_indptr)
In sparce matrix row indexer first element is always zero. If first row is
empty then second element in row indexer is also zero.
There is shift-decrement operation in code. After this operation second
element becomes first with value equal -1 and thus points to the tail of
column index. This treats first row as neighbor of all other rows.
In my project I reimplemented your dbscan() function with this code
if metric == 'precomputed' and sparse.issparse(X):
X.sum_duplicates() # XXX: modifies X's internals in-place
X.data = (X.data <= eps).astype(np.intp)
X.eliminate_zeros()
# Convert to linked list for better performance and to use built-in .rows object
X_lil = X.tolil()
# insert the diagonal: a point is its own neighbor.
X_lil.setdiag(1)
# split into rows
neighborhoods[:] = X_lil.rows
I doubt it's the best option, but it works correctly and don't look like
some David Blane street magic.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8306 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6zgL1S5LMsmg3H4zLrXXSZnzAU0sks5raG1RgaJpZM4L5Arq>
.
|
Description
DBSCAN returns incorrect labels array on given precomputed sparse input if there are only zeros in first row.
Steps/Code to Reproduce
Expected Results
(array([1, 2, 3, 4]), array([-1, 0, 0, 1, 1]))
Actual Results
(array([0, 2, 3, 4]), array([0, 0, 0, 1, 0]))
Error appears only if first row of input matrix consist of only zeroes.
Versions
Linux-4.9.4-100.fc24.x86_64-x86_64-with-fedora-24-Twenty_Four
('Python', '2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.4')
('SciPy', '0.17.0')
('Scikit-Learn', '0.17.1')
The text was updated successfully, but these errors were encountered: