Skip to content

[MRG+1] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. #8339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 23, 2017

Conversation

Akshay0724
Copy link
Contributor

Reference Issue

Fixes #8306

What does this implement/fix? Explain your changes.

This implementation fixes the incorrect output case when first row of precomputed sparse matrix is all zero.
Issue was due to this line-

masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1]

with such input X.indptr is of form [0, 0, ....] and first element of index [X.indptr[1:] - 1] become -1 which in python means the last element.
So it would be better to avoid the case when X.indptr become [0, 0, ...] for this I have changed the value of first element of first row to eps+1.

Any other comments?

@Akshay0724 Akshay0724 changed the title Fixes incorrect output for precomputed input [MRG] Fixes incorrect output when input is precomputed sparse matrix. Feb 11, 2017
@Akshay0724 Akshay0724 changed the title [MRG] Fixes incorrect output when input is precomputed sparse matrix. [MRG] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. Feb 11, 2017
@codecov
Copy link

codecov bot commented Feb 11, 2017

Codecov Report

Merging #8339 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8339      +/-   ##
==========================================
+ Coverage   94.75%   94.75%   +<.01%     
==========================================
  Files         342      342              
  Lines       60801    60806       +5     
==========================================
+ Hits        57609    57614       +5     
  Misses       3192     3192
Impacted Files Coverage Δ
sklearn/cluster/tests/test_dbscan.py 100% <100%> (ø)
sklearn/cluster/dbscan_.py 100% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb9fe80...62ff80f. Read the comment docs.

@jnothman
Copy link
Member

I've not yet understood what the code's meant to be doing. But this doesn't feel like the right fix.

Would replacing np.cumsum(X_mask)[X.indptr[1:] - 1] with np.cumsum(X_mask).take(X.indptr[1:] - 1, mode='clip') be the right thing or nonsense?

@Akshay0724
Copy link
Contributor Author

Akshay0724 commented Feb 12, 2017

In this implementation I just changed the input matrix form-
[[0.0, 0.0, 0.0, 0.0, 0.0], [.......], [.......], [.......], [.......]]
to-
[[eps+1, 0.0, 0.0, 0.0, 0.0], [.......], [.......], [.......], [.......]]

So that index -1 can't be generated and we even do not require to change rest part of the code.
If input does not looks like this than matrix will not be changed.

Is this correct or we should not change the matrix?

@Akshay0724
Copy link
Contributor Author

I have checked that these modification gives correct output for all case.

@jnothman
Copy link
Member

jnothman commented Feb 12, 2017 via email

@Akshay0724
Copy link
Contributor Author

Akshay0724 commented Feb 13, 2017

I understand that changing the input do not looks correct, that is why I checked np.cumsum(X_mask).take(X.indptr[1:] - 1, mode='clip' and found that it is not working correctly. Reason for this I think is when index -1 occurs than it return first element of the array but first element of the array np.cumsum(X_mask) will depend on X_mask and in our case it must be 0(if first row is all zero).
So, I think it will be better to check if first row is all zero or not after masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1].

if it's all zero than make first element of masked_indptr as zero.
code for this
if X.indptr[0] == X.indptr[1] ==0: masked_indptr[0] = 0

This if condition is sufficient to check if all elements of first row is zero or not.

@Akshay0724
Copy link
Contributor Author

I have checked that this works correctly and also have committed these changes.

@Akshay0724
Copy link
Contributor Author

Akshay0724 commented Feb 13, 2017 via email

@@ -125,6 +125,10 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None,
X_mask = X.data <= eps
masked_indices = astype(X.indices, np.intp, copy=False)[X_mask]
masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1]

if X.indptr[0] == X.indptr[1] == 0: # check if first row is all zero
masked_indptr[0] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the case where multiple initial rows are all zero, this index needs to be [:np.argmin(masked_indptr)].

The other way of doing it is replacing np.cumsum(X_mask)[X.indptr[1:] - 1] with np.concatenate([0, np.cumsum(X_mask)][X.indptr[1:]]. Choose whichever you feel is more intuitive (the latter is a bit less efficient, but it pales in comparison to DBSCAN complexity overall).

Please add a test and a bug fix entry in whats_new.rst. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.concatenate([0, np.cumsum(X_mask)][X.indptr[1:]] looks like the best fix as it do not require any if condition and works for all cases. Thanks for your suggestion @jnothman.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

@@ -124,7 +124,9 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None,
X.sum_duplicates() # XXX: modifies X's internals in-place
X_mask = X.data <= eps
masked_indices = astype(X.indices, np.intp, copy=False)[X_mask]
masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1]
masked_indptr = np.concatenate(([0], np.cumsum(X_mask)),
axis=0)[X.indptr[1:]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why you need axis=0; you can just use hstack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I have removed it.

@jnothman jnothman changed the title [MRG] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. [MRG+1] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. Feb 15, 2017
@jnothman jnothman added the Bug label Feb 15, 2017
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the following phrasing comments in the changelog, +1 on my side as well.

@@ -149,6 +149,10 @@ Enhancements

Bug fixes
.........
- Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect
result when input is precomputed sparse matrix with initial rows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...when input is a precomputed sparse matrix...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not addressed

@@ -149,6 +149,10 @@ Enhancements

Bug fixes
.........
- Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect
result when input is precomputed sparse matrix with initial rows
all zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have rather said "with all-zeros initial rows" but I am not a native English speaker and I don't know which is more correct.

@jnothman
Copy link
Member

jnothman commented Feb 16, 2017 via email

@Akshay0724
Copy link
Contributor Author

Both "with all-zero initial rows" and "with initial rows all zero" seems to be correct to me. If you want it to be changed tell me I will commit those changes.

@Akshay0724
Copy link
Contributor Author

Thanks for approving this PR.

@Akshay0724
Copy link
Contributor Author

Hello @jnothman, can this PR be merged or you want some changes in it?
Thanks

@jnothman
Copy link
Member

Merging. Thanks @Akshay0724

@jnothman jnothman merged commit 2cd1220 into scikit-learn:master Feb 23, 2017
@Akshay0724
Copy link
Contributor Author

Akshay0724 commented Feb 23, 2017 via email

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
lemonlaug pushed a commit to lemonlaug/scikit-learn that referenced this pull request Jan 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DBSCAN gives incorrect result on precomputed sparse input if there are only zeros in first row
3 participants