[MRG+1] Fix OPTICS processing order #12357

kno10 · 2018-10-11T16:44:08Z

Reference Issues/PRs

Fixes #12090: a subtle bug that only the neighbors of the current point - not earlier neighbors - were considered for expanding clusters in OPTICS.
Fixes #12134: test case is included and passes

What does this implement/fix? Explain your changes.

With these changes, the result is the same as ELKIs, and the unit test contains reference data generated with ELKI. R currently differs, because ELKI changed the order to prefer smaller ids on ties, while R is still replicating older ELKI behavior of prefering higher ids on ties. A pull request for this is open with R dbscan: mhahsler/dbscan#25

Any other comments?

This probably also obsoletes #12054
and helps with resolving #11857 and #12049 and #11677

adrinjalali

Other than those pep8 issues, LGTM! Thanks @kno10

qinhanmin2014

LGTM, thanks @kno10
I think picking the smallest index on ties is reasonable. It seems computational effective, provide us with stable result and I don't think more complex methods can give us better result. I've seen your PR for R dbscan. Hopefully, ELKI, R dbscan and scikit-learn will soon provide users with the same result.
(Alternative) I've not tried ELKI myself (will try it soon) but we'll appreciate if you can somehow provide more detailed instructions for us to reproduce the result on ELKI since we've encountered similar problems before.
Thanks again for your help on OPTICS.

qinhanmin2014 · 2018-10-12T02:32:53Z

sklearn/cluster/tests/test_optics.py

@@ -217,236 +216,168 @@ def test_cluster_sigmin_pruning(reach, n_child, members):


 def test_reach_dists():


This name is no longer valid, replace it with test_optics_ELKI or anything you like.

qinhanmin2014 · 2018-10-12T02:35:08Z

sklearn/cluster/tests/test_optics.py

+
+
+def test_processing_order():
+    """Unit test for bug in ca. version 0.20, see #12090"""


I guess we can remove this comment. This is not a bug because OPTICS is not released.

I'd keep the issue number in there, in case someone wonders where this test case comes from.
How about naming this test_processing_order_issue12090 and dropping the comment?

Maybe remove bug and version 0.20 ? It's not a bug in 0.20. Another thing is that this one is actually a different bug, so I doubt whether it's good to link to #12090.

qinhanmin2014 · 2018-10-12T02:50:41Z

ping @jnothman @espg

qinhanmin2014 · 2018-10-12T13:36:26Z

sklearn/cluster/tests/test_optics.py

+    assert_array_equal(clust.ordering_, np.array(o))
+    assert_array_equal(clust.predecessor_[clust.ordering_], np.array(p))
+    assert_allclose(clust.reachability_[clust.ordering_],
+                    np.array(r), atol=1e-15)


I guess we don't need such a high precision here? I'm not an expert on these things but I think there might be some unexpected numeric issues in other environments (we only test some environments here). Maybe using the default value provided by assert_allclose?

qinhanmin2014 · 2018-10-12T13:42:33Z

sklearn/cluster/optics_.py

-        # Define return order based on reachability distance
-        return (unproc[quick_scan(np.take(self.reachability_, unproc),
-                                  dists)])
+        # Choose next based on smallest reachability distance


Maybe add a comment like this : if there're multiple points with the same reachability distance, then we ...

qinhanmin2014 · 2018-10-12T15:00:14Z

@kno10 What version are you using? I'm unable to reproduce with 0.7.1. I've not looked into it carefully but seems that 0.7.1 and R dbscan are picking largest index on ties? If so, is it possible for ELKI to release 0.7.2 or 0.8? I'm a bit uncomfortable to compare with dev version.

kno10 · 2018-10-12T18:56:29Z

The commit that prefers lower ids (rather than larger) is quite recent, its not in a release yet.

qinhanmin2014 · 2018-10-13T02:04:23Z

The commit that prefers lower ids (rather than larger) is quite recent, its not in a release yet.

Thanks. I'm now able to reproduce your result. Couple of questions:
(1) What's the reason for ELKI (and R dbscan if you know) to prefer lager ids when tie previously?
(2) What's the release plan of ELKI (and R dbscan if you know)? I'd prefer to compare with a stable version.
(3) What's the advantages of preferring lower ids, compared with preferring lager ids. In your PR for R dbscan, you said that "Lower ids yields more expected results.". Could you please provide some more details?
(4) Do OPTICSHeap and OPTICSList in ELKI produce the same result?

Overall, given the current status of ELKI and your PR for R, I'm fine to merge this one, even without a ELKI release.

qinhanmin2014

@kno10 Please take some time to address all the review comments and answer my questions above. Thanks :)

qinhanmin2014 · 2018-10-13T02:06:03Z

sklearn/cluster/tests/test_optics.py

@@ -217,236 +216,168 @@ def test_cluster_sigmin_pruning(reach, n_child, members):


 def test_reach_dists():
-    # Tests against known extraction array
+    # Expected values, computed with ELKI using:


Please note down the version you're using. Apart from that, current comment seems enough from my side.

kno10 · 2018-10-13T11:28:11Z

(1) The preference for larger ids was never intentional. This was introduced in ELKI to stabilize the results when changing the heap implementation, and since it worked for the purpose of testing it went unnoticed. But there are good reasons to prefer smaller ids, so this was recently changed in ELKI.
R simply simulates ELKIs behavior to verify their implementation.
(2) An interim ELKI release is long overdue. I would have said September... there are some larger changes prepared, and we want to make a 0.7.5 release before we apply them. But since we want to make our releases citeable, getting appropriate documentation set up requires some effort, and we're just too busy.
(3) if you reorder (and re-enumerate) objects by their cluster order, and run OPTICS again, the order will be stable if you prefer smaller ids. With larger ids, probably every other run would be the same...
(4) they are supposed to, but we currently only test they give the same quality. This supposedly doesn't hold for the old 0.7.1 release yet.

qinhanmin2014 · 2018-10-13T12:01:59Z

Thanks @kno10 for the detailed reply :)

An interim ELKI release is long overdue. I would have said September...

The release cycle of scikit-learn will generally be >=6 months these days, so if there's a ELKI release at the beginning of next year, that will be fine. (and I think it's acceptable if there's not a ELKI release at that time)

Feel free to ping me and I'll push these minor changes to save your time. We have several OPTICS issues to solve and I'd prefer to get this in ASAP.

The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.

kno10 · 2018-10-13T12:15:51Z

Pushed a new version. Also added a note that this implementation does not use a heap to the documentation.

qinhanmin2014 · 2018-10-13T12:16:05Z

sklearn/cluster/tests/test_optics.py

@@ -216,237 +215,167 @@ def test_cluster_sigmin_pruning(reach, n_child, members):
    assert_array_equal(members, root.children[0].points)


-def test_reach_dists():
+def test_compare_to_ELKI():
+    # Expected values, computed with (future) ELKI 0.7.5 using:


Maybe simply using things like dev version? It seems strange. I'll take care of it after ELKI is released so you don't need to make unnecessary promise here.

qinhanmin2014

LGTM, thanks @kno10 (feel free to ignore the previous review if you can promise a release)

qinhanmin2014 · 2018-10-13T12:28:29Z

ping @espg @jnothman
We now compare with ELKI dev version and a PR in R dbscan (i.e., prefer smaller id when there're multiple points with the same reachability distances).

espg · 2018-10-13T19:26:34Z

@qinhanmin2014 I think that this looks good to me-- my one question is if we're happy with test_compare_to_ELKI() using just the one parameter min_samples=5 ... for #12054 there was some suggestion that we should be using two different min_samples values (i.e., 5 and 7, or 5 and 10). I'm fine with just using min_samples=5 if you are though...

@kno10 thank you for taking the time to roll out the patch add to the tests.

There's still room for some other small fix-up items in OPTICS that I may open, but nothing directly related to what is addressed in this PR.

qinhanmin2014 · 2018-10-14T00:57:31Z

my one question is if we're happy with test_compare_to_ELKI() using just the one parameter min_samples=5

I'm fine with one min_samples, as long as it's not 2.

There's still room for some other small fix-up items in OPTICS that I may open, but nothing directly related to what is addressed in this PR.

Maybe not directly related, but my point here is that RD is used everywhere in OPTICS and it's important to first ensure that we get correct RD.

Merging with +3 from @adrinjalali @espg and me.

* Add a test case for OPTICS bug (closes scikit-learn#12134) * ENH Fix processing order in OPTICS. See scikit-learn#12090 The current code may expand the wrong point, because it only considers the neighbors of the current point, not all currently unprocessed points (which may have a better reachability). Using the distance from the latest point as tiebreaker then does not work anymore, because it might not yet be computed for all unprocessed points when using an index. If we choose the first on ties, we get the same result same as R and ELKI. But the order of points in "unproc" is also unstable, so we cannot rely on the first smallest to have the smallest index. Instead of the cython quick_scan, we now use numpy masked arrays.

This reverts commit e7d8b82.

Add a test case for OPTICS bug (closes scikit-learn#12134)

dd95eba

adrinjalali approved these changes Oct 11, 2018

View reviewed changes

kno10 force-pushed the patch-optics branch from b17e351 to 3b06809 Compare October 11, 2018 23:01

qinhanmin2014 approved these changes Oct 12, 2018

View reviewed changes

qinhanmin2014 added this to the 0.20.1 milestone Oct 12, 2018

qinhanmin2014 modified the milestones: 0.20.1, 0.21 Oct 12, 2018

qinhanmin2014 reviewed Oct 12, 2018

View reviewed changes

qinhanmin2014 approved these changes Oct 13, 2018

View reviewed changes

qinhanmin2014 changed the title ~~Fix OPTICS processing order, #12090~~ [MRG+1] Fix OPTICS processing order Oct 13, 2018

kno10 force-pushed the patch-optics branch from 3b06809 to 51b362b Compare October 13, 2018 12:10

kno10 force-pushed the patch-optics branch from 51b362b to 43a8770 Compare October 13, 2018 12:12

qinhanmin2014 reviewed Oct 13, 2018

View reviewed changes

qinhanmin2014 approved these changes Oct 13, 2018

View reviewed changes

qinhanmin2014 merged commit 0296916 into scikit-learn:master Oct 14, 2018

qinhanmin2014 mentioned this pull request Apr 14, 2019

[MRG+2] OPTICS: add extract_xi method #12077

Merged

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX (0.21) OPTICS processing order (scikit-learn#12357)"

65e8ea0

This reverts commit e7d8b82.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "FIX (0.21) OPTICS processing order (scikit-learn#12357)"

c7629ca

This reverts commit e7d8b82.

		@@ -217,236 +216,168 @@ def test_cluster_sigmin_pruning(reach, n_child, members):


		def test_reach_dists():



		def test_processing_order():
		"""Unit test for bug in ca. version 0.20, see #12090"""

Uh oh!

[MRG+1] Fix OPTICS processing order #12357

[MRG+1] Fix OPTICS processing order #12357

Uh oh!

Conversation

kno10 commented Oct 11, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Oct 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Oct 12, 2018

Uh oh!

kno10 commented Oct 12, 2018

Uh oh!

qinhanmin2014 commented Oct 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qinhanmin2014 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kno10 commented Oct 13, 2018

Uh oh!

qinhanmin2014 commented Oct 13, 2018

Uh oh!

kno10 commented Oct 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 left a comment

Choose a reason for hiding this comment

Uh oh!

qinhanmin2014 commented Oct 13, 2018

Uh oh!

espg commented Oct 13, 2018

Uh oh!

qinhanmin2014 commented Oct 14, 2018

Uh oh!

Uh oh!

qinhanmin2014 left a comment •

edited

Loading

qinhanmin2014 commented Oct 13, 2018 •

edited

Loading

qinhanmin2014 left a comment •

edited

Loading