Skip to content

DBSCAN gives incorrect result on precomputed sparse input if there are only zeros in first row #8306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alxgrh opened this issue Feb 7, 2017 · 8 comments · Fixed by #8339
Labels

Comments

@alxgrh
Copy link

alxgrh commented Feb 7, 2017

Description

DBSCAN returns incorrect labels array on given precomputed sparse input if there are only zeros in first row.

Steps/Code to Reproduce

import numpy as np
from scipy.sparse import csr_matrix
from sklearn.cluster import dbscan

# Create example distance matrix 
# On such input and with epsilon value equal to 0.2 DBSCAN should leave first row unclustered, put 2nd and 3rd rows to one cluster and put 4th and 5th rows to another cluster
ar = np.array([
        [0.0, 0.0, 0.0, 0.0, 0.0 ],
        [0.0, 0.0, 0.2, 0.0, 0.3 ],
        [0.0, 0.2, 0.0, 0.0, 0.0 ],
        [0.0, 0.0, 0.0, 0.0, 0.1 ],
        [0.0, 0.3, 0.0, 0.1, 0.0 ]
    ])
matrix = csr_matrix(ar)

# direct method used just for reference. DBSCAN.fit() gives the same result.
dbscan(matrix, metric='precomputed', eps=0.2, min_samples=2)

Expected Results

(array([1, 2, 3, 4]), array([-1, 0, 0, 1, 1]))

Actual Results

(array([0, 2, 3, 4]), array([0, 0, 0, 1, 0]))

Error appears only if first row of input matrix consist of only zeroes.

Versions

Linux-4.9.4-100.fc24.x86_64-x86_64-with-fedora-24-Twenty_Four
('Python', '2.7.12 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:42:40) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]')
('NumPy', '1.10.4')
('SciPy', '0.17.0')
('Scikit-Learn', '0.17.1')

@jnothman
Copy link
Member

jnothman commented Feb 7, 2017

So we're to interpret this as: sample X[0] has no other items within eps distance. I agree this looks like a bug. Thanks for the report and the minimal example.

@kahnchana
Copy link

Can I try working on this? I am new though.

@jnothman
Copy link
Member

jnothman commented Feb 7, 2017 via email

@gan3sh500
Copy link

You use ar as a pairwise distance matrix but can a pairwise distance matrix have a row of zeros without everything being zeros?

@kahnchana
Copy link

But scipy.spatial.distance.is_valid_dm() verifies this as an acceptable distance matrix. I even tried plotting the corresponding points and got:
[0.1008,0.2193],[1.6069,0.0202],[-0.5538,-0.1653],[0.2431,-0.1523],[-1.3969,0.0781]

However, distance matrices (at least for Euclidean) aren't allowed to have zero values off the diagonal, are they?

@jnothman
Copy link
Member

jnothman commented Feb 7, 2017 via email

@alxgrh
Copy link
Author

alxgrh commented Feb 7, 2017

Let me be more concrete. Problem is specific to sparse matrix. Error happens if and only if the first line in matrix is empty. Example sparse matrix synthetically created from dense input for easy reproduction. All zero-values there are treated as non-existent elements.

Incorrect behavior is in this part of code
https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/cluster/dbscan_.py#L122

    if metric == 'precomputed' and sparse.issparse(X):
        neighborhoods = np.empty(X.shape[0], dtype=object)
        X.sum_duplicates()  # XXX: modifies X's internals in-place
        X_mask = X.data <= eps
        masked_indices = astype(X.indices, np.intp, copy=False)[X_mask]
        masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1] # <<== UNVALIDATED SHIFT-DECREMENT
        # insert the diagonal: a point is its own neighbor, but 0 distance
        # means absence from sparse matrix data
        masked_indices = np.insert(masked_indices, masked_indptr,
                                   np.arange(X.shape[0]))
        masked_indptr = masked_indptr[:-1] + np.arange(1, X.shape[0])
        # split into rows
neighborhoods[:] = np.split(masked_indices, masked_indptr)

In sparce matrix row indexer first element is always zero. If first row is empty then second element in row indexer is also zero.
There is shift-decrement operation in code. After this operation second element becomes first with value equal -1 and thus points to the tail of column index. This treats first row as neighbor of all other rows.

In my project I reimplemented your dbscan() function with this code

    if metric == 'precomputed' and sparse.issparse(X):
        X.sum_duplicates()  # XXX: modifies X's internals in-place
        X.data = (X.data <= eps).astype(np.intp)
        X.eliminate_zeros()

        # Convert to linked list for better performance and to use built-in .rows object
        X_lil = X.tolil()
        # insert the diagonal: a point is its own neighbor.
        X_lil.setdiag(1)

        # split into rows
        neighborhoods[:] = X_lil.rows

I doubt it's the best option, but it works correctly and doesn't look like some David Blane street magic.

@jnothman
Copy link
Member

jnothman commented Feb 7, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants