-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] OPTICS correctly handle multiple infs in reachability array. #12029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Please credit yourself for the OPTICS feature in what's new v0.20 |
I'm not sure why the python 2 ci is suddenly failing |
This seems ready to me. A second review is appreciated. |
sklearn/cluster/optics_.py
Outdated
@@ -606,7 +606,13 @@ def _extract_optics(ordering, reachability, maxima_ratio=.75, | |||
""" | |||
|
|||
# Extraction wrapper | |||
reachability = reachability / np.max(reachability[1:]) | |||
# according to the paper (p. 5), for a small enough generative distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which paper? Maybe at least provide the author and the time.
@@ -751,17 +757,17 @@ def _cluster_tree(node, parent_node, local_maxima_points, | |||
avg_reach2 = np.mean(reachability_plot[node_2.start:(node_2.start | |||
+ check_value_2)]) | |||
|
|||
if ((avg_reach1 / reachability_plot[s]) > maxima_ratio or | |||
(avg_reach2 / reachability_plot[s]) > maxima_ratio): | |||
if ((avg_reach1 / maxima_ratio) > reachability_plot[s] or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Not related) I'm wondering whether we need both maxima_ratio and rejection_ratio. Too many parameters make our OPTICS somehow unfriendly at least to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any insights from your side? @adrinjalali
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand it correctly, setting the two parameters the same, would result in each split point to accept only one of the children, since the next condition is exactly the opposite. But I agree this method has quite a few parameters, which can be confusing specially if the user hasn't read both the papers for OPTICS and this method (which we're calling SQLNK I guess).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I agree this method has quite a few parameters, which can be confusing specially if the user hasn't read both the papers for OPTICS and this method (which we're calling SQLNK I guess).
You find these two parameters in a paper? Which one? I only find one.
if ((average reachability value in any node in NL) / s.r_dist > 0.75)
// if split point s is not significant, ignore s and continue
cluster_tree(N, parent_of_N, L); //
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from these two parameters, some other parameters also seems strange, e.g., significant_min (we're normalizing RD so I don't think it makes sense to users) and the magical check_ratio inside the implementation. I notice that amyxzhang noted that this is "An implementation of the following algorithm, with some minor add-ons". I think we need to check these add-ons carefully. As I missing something @adrinjalali, otherwise I'm going to open an issue. Thanks in advance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @adrinjalali
See #11857
The extraction part of the code assumes only the first
reachability_
to beinf
, but the array may include multipleinf
s. This PR fixes the issue, and fixes a test which used to result in allinf
values. It also adds aValueError
in case all reachability values are inf. And also tests for thatValueError
.This is yet another PR to extract parts of #11857.
Also see #11857 (review)
Ping @jnothman