-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. #8339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #8339 +/- ##
==========================================
+ Coverage 94.75% 94.75% +<.01%
==========================================
Files 342 342
Lines 60801 60806 +5
==========================================
+ Hits 57609 57614 +5
Misses 3192 3192
Continue to review full report at Codecov.
|
I've not yet understood what the code's meant to be doing. But this doesn't feel like the right fix. Would replacing |
In this implementation I just changed the input matrix form- So that index -1 can't be generated and we even do not require to change rest part of the code. Is this correct or we should not change the matrix? |
I have checked that these modification gives correct output for all case. |
Which modification? Your eps, or my clip?
…On 12 Feb 2017 2:39 pm, "akshay0724" ***@***.***> wrote:
I have checked that these modification gives correct output for all case.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8339 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-oQ9Da4ZWCJGlQKSyfYroeRi5KMks5rbn8IgaJpZM4L-PkC>
.
|
I understand that changing the input do not looks correct, that is why I checked if it's all zero than make first element of This if condition is sufficient to check if all elements of first row is zero or not. |
I have checked that this works correctly and also have committed these changes. |
I was talking about mine modification.
On Mon, Feb 13, 2017 at 1:22 AM, Joel Nothman <notifications@github.com>
wrote:
… Which modification? Your eps, or my clip?
On 12 Feb 2017 2:39 pm, "akshay0724" ***@***.***> wrote:
> I have checked that these modification gives correct output for all case.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#8339#
issuecomment-279194482>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAEz6-
oQ9Da4ZWCJGlQKSyfYroeRi5KMks5rbn8IgaJpZM4L-PkC>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#8339 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ANfGMC6XZtRyFMTDloSpHUTQiFtASKtWks5rb2L3gaJpZM4L-PkC>
.
|
sklearn/cluster/dbscan_.py
Outdated
@@ -125,6 +125,10 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None, | |||
X_mask = X.data <= eps | |||
masked_indices = astype(X.indices, np.intp, copy=False)[X_mask] | |||
masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1] | |||
|
|||
if X.indptr[0] == X.indptr[1] == 0: # check if first row is all zero | |||
masked_indptr[0] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the case where multiple initial rows are all zero, this index needs to be [:np.argmin(masked_indptr)]
.
The other way of doing it is replacing np.cumsum(X_mask)[X.indptr[1:] - 1]
with np.concatenate([0, np.cumsum(X_mask)][X.indptr[1:]]
. Choose whichever you feel is more intuitive (the latter is a bit less efficient, but it pales in comparison to DBSCAN complexity overall).
Please add a test and a bug fix entry in whats_new.rst
. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.concatenate([0, np.cumsum(X_mask)][X.indptr[1:]]
looks like the best fix as it do not require any if condition and works for all cases. Thanks for your suggestion @jnothman.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
sklearn/cluster/dbscan_.py
Outdated
@@ -124,7 +124,9 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None, | |||
X.sum_duplicates() # XXX: modifies X's internals in-place | |||
X_mask = X.data <= eps | |||
masked_indices = astype(X.indices, np.intp, copy=False)[X_mask] | |||
masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1] | |||
masked_indptr = np.concatenate(([0], np.cumsum(X_mask)), | |||
axis=0)[X.indptr[1:]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why you need axis=0
; you can just use hstack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. I have removed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the following phrasing comments in the changelog, +1 on my side as well.
doc/whats_new.rst
Outdated
@@ -149,6 +149,10 @@ Enhancements | |||
|
|||
Bug fixes | |||
......... | |||
- Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect | |||
result when input is precomputed sparse matrix with initial rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...when input is a precomputed sparse matrix...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed
doc/whats_new.rst
Outdated
@@ -149,6 +149,10 @@ Enhancements | |||
|
|||
Bug fixes | |||
......... | |||
- Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect | |||
result when input is precomputed sparse matrix with initial rows | |||
all zero. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have rather said "with all-zeros initial rows" but I am not a native English speaker and I don't know which is more correct.
"with all-zeros ..." is not right.
"with all-zero initial rows" would be fine.
"with initial rows all zero" is *okay* but seems to suggest that each row
is zero (rather than a vector of zeros), which only makes sense if you're a
linear algebra person :)
…On 16 February 2017 at 20:33, Olivier Grisel ***@***.***> wrote:
***@***.**** approved this pull request.
Besides the following phrasing comments in the changelog, +1 on my side as
well.
------------------------------
In doc/whats_new.rst
<#8339 (comment)>
:
> @@ -149,6 +149,10 @@ Enhancements
Bug fixes
.........
+ - Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect
+ result when input is precomputed sparse matrix with initial rows
...when input is a precomputed sparse matrix...
------------------------------
In doc/whats_new.rst
<#8339 (comment)>
:
> @@ -149,6 +149,10 @@ Enhancements
Bug fixes
.........
+ - Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect
+ result when input is precomputed sparse matrix with initial rows
+ all zero.
I would have rather said "with all-zeros initial rows" but I am not a
native English speaker and I don't know which is more correct.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8339 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_9bV3IEDbFVcEFv43FdkVlPx2-Cks5rdBfXgaJpZM4L-PkC>
.
|
Both "with all-zero initial rows" and "with initial rows all zero" seems to be correct to me. If you want it to be changed tell me I will commit those changes. |
Thanks for approving this PR. |
Hello @jnothman, can this PR be merged or you want some changes in it? |
Merging. Thanks @Akshay0724 |
Reference Issue
Fixes #8306
What does this implement/fix? Explain your changes.
This implementation fixes the incorrect output case when first row of precomputed sparse matrix is all zero.
Issue was due to this line-
masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1]
with such input
X.indptr
is of form [0, 0, ....] and first element of index[X.indptr[1:] - 1]
become -1 which in python means the last element.So it would be better to avoid the case when
X.indptr
become [0, 0, ...] for this I have changed the value of first element of first row to eps+1.Any other comments?