-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Fix OPTICS processing order #12357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than those pep8 issues, LGTM! Thanks @kno10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @kno10
I think picking the smallest index on ties is reasonable. It seems computational effective, provide us with stable result and I don't think more complex methods can give us better result. I've seen your PR for R dbscan. Hopefully, ELKI, R dbscan and scikit-learn will soon provide users with the same result.
(Alternative) I've not tried ELKI myself (will try it soon) but we'll appreciate if you can somehow provide more detailed instructions for us to reproduce the result on ELKI since we've encountered similar problems before.
Thanks again for your help on OPTICS.
sklearn/cluster/tests/test_optics.py
Outdated
@@ -217,236 +216,168 @@ def test_cluster_sigmin_pruning(reach, n_child, members): | |||
|
|||
|
|||
def test_reach_dists(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is no longer valid, replace it with test_optics_ELKI
or anything you like.
sklearn/cluster/tests/test_optics.py
Outdated
|
||
|
||
def test_processing_order(): | ||
"""Unit test for bug in ca. version 0.20, see #12090""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can remove this comment. This is not a bug because OPTICS is not released.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep the issue number in there, in case someone wonders where this test case comes from.
How about naming this test_processing_order_issue12090 and dropping the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove bug
and version 0.20
? It's not a bug in 0.20. Another thing is that this one is actually a different bug, so I doubt whether it's good to link to #12090.
sklearn/cluster/tests/test_optics.py
Outdated
assert_array_equal(clust.ordering_, np.array(o)) | ||
assert_array_equal(clust.predecessor_[clust.ordering_], np.array(p)) | ||
assert_allclose(clust.reachability_[clust.ordering_], | ||
np.array(r), atol=1e-15) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we don't need such a high precision here? I'm not an expert on these things but I think there might be some unexpected numeric issues in other environments (we only test some environments here). Maybe using the default value provided by assert_allclose
?
# Define return order based on reachability distance | ||
return (unproc[quick_scan(np.take(self.reachability_, unproc), | ||
dists)]) | ||
# Choose next based on smallest reachability distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment like this : if there're multiple points with the same reachability distance, then we ...
@kno10 What version are you using? I'm unable to reproduce with 0.7.1. I've not looked into it carefully but seems that 0.7.1 and R dbscan are picking largest index on ties? If so, is it possible for ELKI to release 0.7.2 or 0.8? I'm a bit uncomfortable to compare with dev version. |
The commit that prefers lower ids (rather than larger) is quite recent, its not in a release yet. |
Thanks. I'm now able to reproduce your result. Couple of questions: Overall, given the current status of ELKI and your PR for R, I'm fine to merge this one, even without a ELKI release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kno10 Please take some time to address all the review comments and answer my questions above. Thanks :)
sklearn/cluster/tests/test_optics.py
Outdated
@@ -217,236 +216,168 @@ def test_cluster_sigmin_pruning(reach, n_child, members): | |||
|
|||
|
|||
def test_reach_dists(): | |||
# Tests against known extraction array | |||
# Expected values, computed with ELKI using: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note down the version you're using. Apart from that, current comment seems enough from my side.
(1) The preference for larger ids was never intentional. This was introduced in ELKI to stabilize the results when changing the heap implementation, and since it worked for the purpose of testing it went unnoticed. But there are good reasons to prefer smaller ids, so this was recently changed in ELKI. |
Thanks @kno10 for the detailed reply :)
The release cycle of scikit-learn will generally be >=6 months these days, so if there's a ELKI release at the beginning of next year, that will be fine. (and I think it's acceptable if there's not a ELKI release at that time) Feel free to ping me and I'll push these minor changes to save your time. We have several OPTICS issues to solve and I'd prefer to get this in ASAP. |
The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
Pushed a new version. Also added a note that this implementation does not use a heap to the documentation. |
@@ -216,237 +215,167 @@ def test_cluster_sigmin_pruning(reach, n_child, members): | |||
assert_array_equal(members, root.children[0].points) | |||
|
|||
|
|||
def test_reach_dists(): | |||
def test_compare_to_ELKI(): | |||
# Expected values, computed with (future) ELKI 0.7.5 using: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe simply using things like dev version
? It seems strange. I'll take care of it after ELKI is released so you don't need to make unnecessary promise here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @kno10 (feel free to ignore the previous review if you can promise a release)
@qinhanmin2014 I think that this looks good to me-- my one question is if we're happy with @kno10 thank you for taking the time to roll out the patch add to the tests. There's still room for some other small fix-up items in OPTICS that I may open, but nothing directly related to what is addressed in this PR. |
I'm fine with one
Maybe not directly related, but my point here is that RD is used everywhere in OPTICS and it's important to first ensure that we get correct RD. Merging with +3 from @adrinjalali @espg and me. |
* Add a test case for OPTICS bug (closes scikit-learn#12134) * ENH Fix processing order in OPTICS. See scikit-learn#12090 The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
* Add a test case for OPTICS bug (closes scikit-learn#12134) * ENH Fix processing order in OPTICS. See scikit-learn#12090 The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.
This reverts commit e7d8b82.
This reverts commit e7d8b82.
Reference Issues/PRs
Fixes #12090: a subtle bug that only the neighbors of the current point - not earlier neighbors - were considered for expanding clusters in OPTICS.
Fixes #12134: test case is included and passes
What does this implement/fix? Explain your changes.
With these changes, the result is the same as ELKIs, and the unit test contains reference data generated with ELKI. R currently differs, because ELKI changed the order to prefer smaller ids on ties, while R is still replicating older ELKI behavior of prefering higher ids on ties. A pull request for this is open with R dbscan: mhahsler/dbscan#25
Any other comments?
This probably also obsoletes #12054
and helps with resolving #11857 and #12049 and #11677