Skip to content

[MRG+1] OPTICS correctly handle multiple infs in reachability array. #12029

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Oct 7, 2018

Conversation

adrinjalali
Copy link
Member

See #11857

The extraction part of the code assumes only the first reachability_ to be inf, but the array may include multiple infs. This PR fixes the issue, and fixes a test which used to result in all inf values. It also adds a ValueError in case all reachability values are inf. And also tests for that ValueError.

This is yet another PR to extract parts of #11857.

Also see #11857 (review)

Ping @jnothman

@jnothman jnothman added this to the 0.20 milestone Sep 6, 2018
@jnothman
Copy link
Member

jnothman commented Sep 7, 2018

Please credit yourself for the OPTICS feature in what's new v0.20

@jnothman
Copy link
Member

jnothman commented Sep 7, 2018

I'm not sure why the python 2 ci is suddenly failing

@adrinjalali adrinjalali changed the title OPTICS correctly handle multiple infs in reachability array. [MRG+1] OPTICS correctly handle multiple infs in reachability array. Sep 7, 2018
@adrinjalali
Copy link
Member Author

This seems ready to me. A second review is appreciated.

@jnothman jnothman modified the milestones: 0.20, 0.21 Sep 17, 2018
@@ -606,7 +606,13 @@ def _extract_optics(ordering, reachability, maxima_ratio=.75,
"""

# Extraction wrapper
reachability = reachability / np.max(reachability[1:])
# according to the paper (p. 5), for a small enough generative distance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which paper? Maybe at least provide the author and the time.

@@ -751,17 +757,17 @@ def _cluster_tree(node, parent_node, local_maxima_points,
avg_reach2 = np.mean(reachability_plot[node_2.start:(node_2.start
+ check_value_2)])

if ((avg_reach1 / reachability_plot[s]) > maxima_ratio or
(avg_reach2 / reachability_plot[s]) > maxima_ratio):
if ((avg_reach1 / maxima_ratio) > reachability_plot[s] or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Not related) I'm wondering whether we need both maxima_ratio and rejection_ratio. Too many parameters make our OPTICS somehow unfriendly at least to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any insights from your side? @adrinjalali

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, setting the two parameters the same, would result in each split point to accept only one of the children, since the next condition is exactly the opposite. But I agree this method has quite a few parameters, which can be confusing specially if the user hasn't read both the papers for OPTICS and this method (which we're calling SQLNK I guess).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree this method has quite a few parameters, which can be confusing specially if the user hasn't read both the papers for OPTICS and this method (which we're calling SQLNK I guess).

You find these two parameters in a paper? Which one? I only find one.

if ((average reachability value in any node in NL) / s.r_dist > 0.75)
    // if split point s is not significant, ignore s and continue
    cluster_tree(N, parent_of_N, L); //

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from these two parameters, some other parameters also seems strange, e.g., significant_min (we're normalizing RD so I don't think it makes sense to users) and the magical check_ratio inside the implementation. I notice that amyxzhang noted that this is "An implementation of the following algorithm, with some minor add-ons". I think we need to check these add-ons carefully. As I missing something @adrinjalali, otherwise I'm going to open an issue. Thanks in advance.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @adrinjalali

@qinhanmin2014 qinhanmin2014 merged commit 2020867 into scikit-learn:master Oct 7, 2018
@adrinjalali adrinjalali deleted the optics/infs branch October 7, 2018 09:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants