[MRG+1] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. #8339

Akshay0724 · 2017-02-11T17:53:05Z

Reference Issue

What does this implement/fix? Explain your changes.

This implementation fixes the incorrect output case when first row of precomputed sparse matrix is all zero.
Issue was due to this line-

masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1]

with such input X.indptr is of form [0, 0, ....] and first element of index [X.indptr[1:] - 1] become -1 which in python means the last element.
So it would be better to avoid the case when X.indptr become [0, 0, ...] for this I have changed the value of first element of first row to eps+1.

Any other comments?

codecov · 2017-02-11T18:23:16Z

Codecov Report

Merging #8339 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #8339      +/-   ##
==========================================
+ Coverage   94.75%   94.75%   +<.01%     
==========================================
  Files         342      342              
  Lines       60801    60806       +5     
==========================================
+ Hits        57609    57614       +5     
  Misses       3192     3192

Impacted Files	Coverage Δ
sklearn/cluster/tests/test_dbscan.py	`100% <100%> (ø)`	✅
sklearn/cluster/dbscan_.py	`100% <100%> (ø)`	✅

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eb9fe80...62ff80f. Read the comment docs.

jnothman · 2017-02-12T03:01:19Z

I've not yet understood what the code's meant to be doing. But this doesn't feel like the right fix.

Would replacing np.cumsum(X_mask)[X.indptr[1:] - 1] with np.cumsum(X_mask).take(X.indptr[1:] - 1, mode='clip') be the right thing or nonsense?

Akshay0724 · 2017-02-12T03:38:11Z

In this implementation I just changed the input matrix form-
[[0.0, 0.0, 0.0, 0.0, 0.0], [.......], [.......], [.......], [.......]]
to-
[[eps+1, 0.0, 0.0, 0.0, 0.0], [.......], [.......], [.......], [.......]]

So that index -1 can't be generated and we even do not require to change rest part of the code.
If input does not looks like this than matrix will not be changed.

Is this correct or we should not change the matrix?

Akshay0724 · 2017-02-12T03:39:51Z

I have checked that these modification gives correct output for all case.

jnothman · 2017-02-12T19:52:00Z

Which modification? Your eps, or my clip?

…

On 12 Feb 2017 2:39 pm, "akshay0724" ***@***.***> wrote: I have checked that these modification gives correct output for all case. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8339 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-oQ9Da4ZWCJGlQKSyfYroeRi5KMks5rbn8IgaJpZM4L-PkC> .

Akshay0724 · 2017-02-13T15:01:38Z

I understand that changing the input do not looks correct, that is why I checked np.cumsum(X_mask).take(X.indptr[1:] - 1, mode='clip' and found that it is not working correctly. Reason for this I think is when index -1 occurs than it return first element of the array but first element of the array np.cumsum(X_mask) will depend on X_mask and in our case it must be 0(if first row is all zero).
So, I think it will be better to check if first row is all zero or not after masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1].

if it's all zero than make first element of masked_indptr as zero.
code for this
if X.indptr[0] == X.indptr[1] ==0: masked_indptr[0] = 0

This if condition is sufficient to check if all elements of first row is zero or not.

Akshay0724 · 2017-02-13T15:17:44Z

I have checked that this works correctly and also have committed these changes.

Akshay0724 · 2017-02-13T19:47:23Z

I was talking about mine modification. On Mon, Feb 13, 2017 at 1:22 AM, Joel Nothman <notifications@github.com> wrote:

…

Which modification? Your eps, or my clip? On 12 Feb 2017 2:39 pm, "akshay0724" ***@***.***> wrote: > I have checked that these modification gives correct output for all case. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#8339# issuecomment-279194482>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAEz6- oQ9Da4ZWCJGlQKSyfYroeRi5KMks5rbn8IgaJpZM4L-PkC> > . > — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#8339 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ANfGMC6XZtRyFMTDloSpHUTQiFtASKtWks5rb2L3gaJpZM4L-PkC> .

jnothman · 2017-02-14T09:02:56Z

sklearn/cluster/dbscan_.py

@@ -125,6 +125,10 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None,
        X_mask = X.data <= eps
        masked_indices = astype(X.indices, np.intp, copy=False)[X_mask]
        masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1]
+
+        if X.indptr[0] == X.indptr[1] == 0:  # check if first row is all zero
+            masked_indptr[0] = 0


For the case where multiple initial rows are all zero, this index needs to be [:np.argmin(masked_indptr)].

The other way of doing it is replacing np.cumsum(X_mask)[X.indptr[1:] - 1] with np.concatenate([0, np.cumsum(X_mask)][X.indptr[1:]]. Choose whichever you feel is more intuitive (the latter is a bit less efficient, but it pales in comparison to DBSCAN complexity overall).

Please add a test and a bug fix entry in whats_new.rst. Thanks.

np.concatenate([0, np.cumsum(X_mask)][X.indptr[1:]] looks like the best fix as it do not require any if condition and works for all cases. Thanks for your suggestion @jnothman.

jnothman

Otherwise LGTM

jnothman · 2017-02-15T13:11:42Z

sklearn/cluster/dbscan_.py

@@ -124,7 +124,9 @@ def dbscan(X, eps=0.5, min_samples=5, metric='minkowski', metric_params=None,
        X.sum_duplicates()  # XXX: modifies X's internals in-place
        X_mask = X.data <= eps
        masked_indices = astype(X.indices, np.intp, copy=False)[X_mask]
-        masked_indptr = np.cumsum(X_mask)[X.indptr[1:] - 1]
+        masked_indptr = np.concatenate(([0], np.cumsum(X_mask)),
+                                       axis=0)[X.indptr[1:]]


I don't see why you need axis=0; you can just use hstack.

You are right. I have removed it.

ogrisel

Besides the following phrasing comments in the changelog, +1 on my side as well.

ogrisel · 2017-02-16T09:27:56Z

doc/whats_new.rst

@@ -149,6 +149,10 @@ Enhancements

 Bug fixes
 .........
+   - Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect 
+     result when input is precomputed sparse matrix with initial rows


...when input is a precomputed sparse matrix...

Not addressed

ogrisel · 2017-02-16T09:29:03Z

doc/whats_new.rst

@@ -149,6 +149,10 @@ Enhancements

 Bug fixes
 .........
+   - Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect 
+     result when input is precomputed sparse matrix with initial rows
+     all zero.


I would have rather said "with all-zeros initial rows" but I am not a native English speaker and I don't know which is more correct.

jnothman · 2017-02-16T13:47:19Z

"with all-zeros ..." is not right. "with all-zero initial rows" would be fine. "with initial rows all zero" is *okay* but seems to suggest that each row is zero (rather than a vector of zeros), which only makes sense if you're a linear algebra person :)

…

On 16 February 2017 at 20:33, Olivier Grisel ***@***.***> wrote: ***@***.**** approved this pull request. Besides the following phrasing comments in the changelog, +1 on my side as well. ------------------------------ In doc/whats_new.rst <#8339 (comment)> : > @@ -149,6 +149,10 @@ Enhancements Bug fixes ......... + - Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect + result when input is precomputed sparse matrix with initial rows ...when input is a precomputed sparse matrix... ------------------------------ In doc/whats_new.rst <#8339 (comment)> : > @@ -149,6 +149,10 @@ Enhancements Bug fixes ......... + - Fixed a bug where :class:`sklearn.cluster.DBSCAN` gives incorrect + result when input is precomputed sparse matrix with initial rows + all zero. I would have rather said "with all-zeros initial rows" but I am not a native English speaker and I don't know which is more correct. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8339 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_9bV3IEDbFVcEFv43FdkVlPx2-Cks5rdBfXgaJpZM4L-PkC> .

Akshay0724 · 2017-02-16T17:19:03Z

Both "with all-zero initial rows" and "with initial rows all zero" seems to be correct to me. If you want it to be changed tell me I will commit those changes.

Akshay0724 · 2017-02-16T17:19:32Z

Thanks for approving this PR.

Akshay0724 · 2017-02-20T19:00:32Z

Hello @jnothman, can this PR be merged or you want some changes in it?
Thanks

jnothman · 2017-02-23T10:29:11Z

Merging. Thanks @Akshay0724

Akshay0724 · 2017-02-23T14:03:18Z

Thanks for merging.

…

On Feb 23, 2017 3:59 PM, "Joel Nothman" ***@***.***> wrote: Merged #8339. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

…x in DBSCAN. (scikit-learn#8339)

Akshay0724 changed the title ~~Fixes incorrect output for precomputed input~~ [MRG] Fixes incorrect output when input is precomputed sparse matrix. Feb 11, 2017

Akshay0724 changed the title ~~[MRG] Fixes incorrect output when input is precomputed sparse matrix.~~ [MRG] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. Feb 11, 2017

Fixes incorrect output in DBSCAN

11f0307

Akshay0724 force-pushed the dbscan branch from 8cd7c87 to 11f0307 Compare February 13, 2017 15:10

jnothman reviewed Feb 14, 2017

View reviewed changes

Akshay0724 added 3 commits February 15, 2017 17:21

Added support for multiple initial row all zero

fe924f9

Added a test for issue scikit-learn#8306

4875516

Added entry in whats_new.rst for issue scikit-learn#8306

749733b

jnothman reviewed Feb 15, 2017

View reviewed changes

jnothman changed the title ~~[MRG] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN.~~ [MRG+1] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. Feb 15, 2017

jnothman added the Bug label Feb 15, 2017

Removed axis=0

dcb5433

ogrisel approved these changes Feb 16, 2017

View reviewed changes

Wording in changelog

62ff80f

jnothman merged commit 2cd1220 into scikit-learn:master Feb 23, 2017

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017

[MRG+1] Fixes incorrect output when input is precomputed sparse matri…

b7a5752

…x in DBSCAN. (scikit-learn#8339)

Przemo10 mentioned this pull request Mar 17, 2017

update fork (#1) #8606

Closed

Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017

[MRG+1] Fixes incorrect output when input is precomputed sparse matri…

6a2933d

…x in DBSCAN. (scikit-learn#8339)

NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017

[MRG+1] Fixes incorrect output when input is precomputed sparse matri…

cb42ea2

…x in DBSCAN. (scikit-learn#8339)

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017

[MRG+1] Fixes incorrect output when input is precomputed sparse matri…

761d4e3

…x in DBSCAN. (scikit-learn#8339)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

[MRG+1] Fixes incorrect output when input is precomputed sparse matri…

e12f9df

…x in DBSCAN. (scikit-learn#8339)

lemonlaug pushed a commit to lemonlaug/scikit-learn that referenced this pull request Jan 6, 2021

[MRG+1] Fixes incorrect output when input is precomputed sparse matri…

8ef0402

…x in DBSCAN. (scikit-learn#8339)

Uh oh!

[MRG+1] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. #8339

[MRG+1] Fixes incorrect output when input is precomputed sparse matrix in DBSCAN. #8339

Uh oh!

Conversation

Akshay0724 commented Feb 11, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

codecov bot commented Feb 11, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jnothman commented Feb 12, 2017

Uh oh!

Akshay0724 commented Feb 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Akshay0724 commented Feb 12, 2017

Uh oh!

jnothman commented Feb 12, 2017 via email

Uh oh!

Akshay0724 commented Feb 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Akshay0724 commented Feb 13, 2017

Uh oh!

Akshay0724 commented Feb 13, 2017 via email

Uh oh!

jnothman Feb 14, 2017

Choose a reason for hiding this comment

Uh oh!

Akshay0724 Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

Akshay0724 Feb 15, 2017

Choose a reason for hiding this comment

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 16, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 20, 2017

Choose a reason for hiding this comment

Uh oh!

ogrisel Feb 16, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman commented Feb 16, 2017 via email

Uh oh!

Akshay0724 commented Feb 16, 2017

Uh oh!

Akshay0724 commented Feb 16, 2017

Uh oh!

Akshay0724 commented Feb 20, 2017

Uh oh!

jnothman commented Feb 23, 2017

Uh oh!

Akshay0724 commented Feb 23, 2017 via email

Uh oh!

Uh oh!

codecov bot commented Feb 11, 2017 •

edited

Loading

Akshay0724 commented Feb 12, 2017 •

edited

Loading

Akshay0724 commented Feb 13, 2017 •

edited

Loading