Skip to content

[MRG+1] Fix OPTICS processing order #12357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 14, 2018

Conversation

kno10
Copy link
Contributor

@kno10 kno10 commented Oct 11, 2018

Reference Issues/PRs

Fixes #12090: a subtle bug that only the neighbors of the current point - not earlier neighbors - were considered for expanding clusters in OPTICS.
Fixes #12134: test case is included and passes

What does this implement/fix? Explain your changes.

With these changes, the result is the same as ELKIs, and the unit test contains reference data generated with ELKI. R currently differs, because ELKI changed the order to prefer smaller ids on ties, while R is still replicating older ELKI behavior of prefering higher ids on ties. A pull request for this is open with R dbscan: mhahsler/dbscan#25

Any other comments?

This probably also obsoletes #12054
and helps with resolving #11857 and #12049 and #11677

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than those pep8 issues, LGTM! Thanks @kno10

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @kno10
I think picking the smallest index on ties is reasonable. It seems computational effective, provide us with stable result and I don't think more complex methods can give us better result. I've seen your PR for R dbscan. Hopefully, ELKI, R dbscan and scikit-learn will soon provide users with the same result.
(Alternative) I've not tried ELKI myself (will try it soon) but we'll appreciate if you can somehow provide more detailed instructions for us to reproduce the result on ELKI since we've encountered similar problems before.
Thanks again for your help on OPTICS.

@@ -217,236 +216,168 @@ def test_cluster_sigmin_pruning(reach, n_child, members):


def test_reach_dists():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is no longer valid, replace it with test_optics_ELKI or anything you like.



def test_processing_order():
"""Unit test for bug in ca. version 0.20, see #12090"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can remove this comment. This is not a bug because OPTICS is not released.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep the issue number in there, in case someone wonders where this test case comes from.
How about naming this test_processing_order_issue12090 and dropping the comment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove bug and version 0.20 ? It's not a bug in 0.20. Another thing is that this one is actually a different bug, so I doubt whether it's good to link to #12090.

@qinhanmin2014 qinhanmin2014 added this to the 0.20.1 milestone Oct 12, 2018
@qinhanmin2014
Copy link
Member

ping @jnothman @espg

@qinhanmin2014 qinhanmin2014 modified the milestones: 0.20.1, 0.21 Oct 12, 2018
assert_array_equal(clust.ordering_, np.array(o))
assert_array_equal(clust.predecessor_[clust.ordering_], np.array(p))
assert_allclose(clust.reachability_[clust.ordering_],
np.array(r), atol=1e-15)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't need such a high precision here? I'm not an expert on these things but I think there might be some unexpected numeric issues in other environments (we only test some environments here). Maybe using the default value provided by assert_allclose?

# Define return order based on reachability distance
return (unproc[quick_scan(np.take(self.reachability_, unproc),
dists)])
# Choose next based on smallest reachability distance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment like this : if there're multiple points with the same reachability distance, then we ...

@qinhanmin2014
Copy link
Member

@kno10 What version are you using? I'm unable to reproduce with 0.7.1. I've not looked into it carefully but seems that 0.7.1 and R dbscan are picking largest index on ties? If so, is it possible for ELKI to release 0.7.2 or 0.8? I'm a bit uncomfortable to compare with dev version.

@kno10
Copy link
Contributor Author

kno10 commented Oct 12, 2018

The commit that prefers lower ids (rather than larger) is quite recent, its not in a release yet.

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Oct 13, 2018

The commit that prefers lower ids (rather than larger) is quite recent, its not in a release yet.

Thanks. I'm now able to reproduce your result. Couple of questions:
(1) What's the reason for ELKI (and R dbscan if you know) to prefer lager ids when tie previously?
(2) What's the release plan of ELKI (and R dbscan if you know)? I'd prefer to compare with a stable version.
(3) What's the advantages of preferring lower ids, compared with preferring lager ids. In your PR for R dbscan, you said that "Lower ids yields more expected results.". Could you please provide some more details?
(4) Do OPTICSHeap and OPTICSList in ELKI produce the same result?

Overall, given the current status of ELKI and your PR for R, I'm fine to merge this one, even without a ELKI release.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kno10 Please take some time to address all the review comments and answer my questions above. Thanks :)

@@ -217,236 +216,168 @@ def test_cluster_sigmin_pruning(reach, n_child, members):


def test_reach_dists():
# Tests against known extraction array
# Expected values, computed with ELKI using:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note down the version you're using. Apart from that, current comment seems enough from my side.

@qinhanmin2014 qinhanmin2014 changed the title Fix OPTICS processing order, #12090 [MRG+1] Fix OPTICS processing order Oct 13, 2018
@kno10
Copy link
Contributor Author

kno10 commented Oct 13, 2018

(1) The preference for larger ids was never intentional. This was introduced in ELKI to stabilize the results when changing the heap implementation, and since it worked for the purpose of testing it went unnoticed. But there are good reasons to prefer smaller ids, so this was recently changed in ELKI.
R simply simulates ELKIs behavior to verify their implementation.
(2) An interim ELKI release is long overdue. I would have said September... there are some larger changes prepared, and we want to make a 0.7.5 release before we apply them. But since we want to make our releases citeable, getting appropriate documentation set up requires some effort, and we're just too busy.
(3) if you reorder (and re-enumerate) objects by their cluster order, and run OPTICS again, the order will be stable if you prefer smaller ids. With larger ids, probably every other run would be the same...
(4) they are supposed to, but we currently only test they give the same quality. This supposedly doesn't hold for the old 0.7.1 release yet.

@qinhanmin2014
Copy link
Member

Thanks @kno10 for the detailed reply :)

An interim ELKI release is long overdue. I would have said September...

The release cycle of scikit-learn will generally be >=6 months these days, so if there's a ELKI release at the beginning of next year, that will be fine. (and I think it's acceptable if there's not a ELKI release at that time)

Feel free to ping me and I'll push these minor changes to save your time. We have several OPTICS issues to solve and I'd prefer to get this in ASAP.

The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
@kno10
Copy link
Contributor Author

kno10 commented Oct 13, 2018

Pushed a new version. Also added a note that this implementation does not use a heap to the documentation.

@@ -216,237 +215,167 @@ def test_cluster_sigmin_pruning(reach, n_child, members):
assert_array_equal(members, root.children[0].points)


def test_reach_dists():
def test_compare_to_ELKI():
# Expected values, computed with (future) ELKI 0.7.5 using:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe simply using things like dev version? It seems strange. I'll take care of it after ELKI is released so you don't need to make unnecessary promise here.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @kno10 (feel free to ignore the previous review if you can promise a release)

@qinhanmin2014
Copy link
Member

ping @espg @jnothman
We now compare with ELKI dev version and a PR in R dbscan (i.e., prefer smaller id when there're multiple points with the same reachability distances).

@espg
Copy link
Contributor

espg commented Oct 13, 2018

@qinhanmin2014 I think that this looks good to me-- my one question is if we're happy with test_compare_to_ELKI() using just the one parameter min_samples=5 ... for #12054 there was some suggestion that we should be using two different min_samples values (i.e., 5 and 7, or 5 and 10). I'm fine with just using min_samples=5 if you are though...

@kno10 thank you for taking the time to roll out the patch add to the tests.

There's still room for some other small fix-up items in OPTICS that I may open, but nothing directly related to what is addressed in this PR.

@qinhanmin2014
Copy link
Member

my one question is if we're happy with test_compare_to_ELKI() using just the one parameter min_samples=5

I'm fine with one min_samples, as long as it's not 2.

There's still room for some other small fix-up items in OPTICS that I may open, but nothing directly related to what is addressed in this PR.

Maybe not directly related, but my point here is that RD is used everywhere in OPTICS and it's important to first ensure that we get correct RD.

Merging with +3 from @adrinjalali @espg and me.

@qinhanmin2014 qinhanmin2014 merged commit 0296916 into scikit-learn:master Oct 14, 2018
anuragkapale pushed a commit to anuragkapale/scikit-learn that referenced this pull request Oct 23, 2018
* Add a test case for OPTICS bug (closes scikit-learn#12134)

* ENH Fix processing order in OPTICS. See scikit-learn#12090

The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
* Add a test case for OPTICS bug (closes scikit-learn#12134)

* ENH Fix processing order in OPTICS. See scikit-learn#12090

The current code may expand the wrong point, because it only considers
the neighbors of the current point, not all currently unprocessed points
(which may have a better reachability).

Using the distance from the latest point as tiebreaker then does not
work anymore, because it might not yet be computed for all unprocessed
points when using an index.

If we choose the first on ties, we get the same result same as R and ELKI.

But the order of points in "unproc" is also unstable, so we cannot rely
on the first smallest to have the smallest index.

Instead of the cython quick_scan, we now use numpy masked arrays.
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OPTICS test_reach_dists inappropriate
4 participants