-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
DOC Minor updates to OPTICS
docstring
#31363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also some questions regarding the main description.
sklearn/cluster/_optics.py
Outdated
neighborhood radius. Better suited for usage on large datasets than the | ||
current sklearn implementation of DBSCAN. | ||
current scikit-learn implementation of DBSCAN. | ||
|
||
Clusters are then extracted using a DBSCAN-like method |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "then" makes it seem like we should mention the order list e.g.,
"Clusters are then extracted from the order list using...", though not sure what the right term is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going into that level of details will be hard, and it'll make this docstring very long, I'm not sure if we should do that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about just removing the "then" then?
"Clusters are extracted using a DBSCAN-like method ... "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though:
"Clusters are then extracted order list using a DBSCAN-like method"
isn't that much longer. Or are you saying that if we use the term "order list" we need to explain it further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me "then" here implies "as the next step", but sure, you're the native speaker here 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry I was meaning more that we could improve the documentation by including the info of what happens before. Or just removing the "then", so it's not confusing as it's not clear atm what the "before" is to the "then".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, happy with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I used the term 'cluster-order' as it seems to be the term used in the paper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you forgot to push?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes done now. Had to do something else in the middle 😬 .
sklearn/cluster/_optics.py
Outdated
neighborhood radius. Better suited for usage on large datasets than the | ||
current sklearn implementation of DBSCAN. | ||
current scikit-learn implementation of DBSCAN. | ||
|
||
Clusters are then extracted using a DBSCAN-like method | ||
(cluster_method = 'dbscan') or an automatic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the paragraph below (can't add this comment that far down), does the sentence:
" This implementation deviates from the original OPTICS by first performing
k-nearest-neighborhood searches on all points to identify core sizes, then
computing only the distances to unprocessed points when constructing the
cluster order."
mean that the original OPTICS algorithm does not only compute unprocessed points, or is it saying that the KNN part differs and this is just the step that follows (but does not differ from the original OPTICS)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the original OPTICS algorithm proposed in the paper calculates core distance and reachability distances for all objects as a first step, which we don't. It's not easy to explain really, I had a hard time converting that to code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this comment is not for the average user, and rather for somebody picky who reads the paper and then comes looks at the implementation here. You could argue it could be a comment in the code maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I think I understand the meaning now.
Does "unprocessed point" mean the same in both implementations? Does the original OPTICS algorithm re-compute reachability distances, updating if they are smaller? i.e., is the second part of the algorithm the same here as in the original paper?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I'd have to read the paper and compare with implementation to be able to answer that, can't really confidently respond to this question easily
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah fair, I just looked at it and am still confused 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay looking at the pseudo code from OPTICS, I think OPTICS will re-compute reachability distances, for neighbours in the 'heap', updating if smaller, which is the same as our implementation. Note that only the original object is set as 'processed' not the neighbors.
and our algorithm will also calculate reachability of all unprocessed neighbours, updating existing if smaller:
scikit-learn/sklearn/cluster/_optics.py
Lines 712 to 714 in d077f82
improved = np.where(rdists < np.take(reachability_, unproc)) | |
reachability_[unproc[improved]] = rdists[improved] | |
predecessor_[unproc[improved]] = point_index |
I've made some changes that may make it clearer, but also it's complicated and I would be happy to leave as is too.
ping @adrinjalali as I looks like you wrote this originally |
Reference Issues/PRs
cluster_method
to string options allowed instead of juststr
distance
moduleWhat does this implement/fix? Explain your changes.
Any other comments?
Noticed while reviewing #31102