Skip to content

DOC Minor updates to OPTICS docstring #31363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented May 14, 2025

Reference Issues/PRs

  • Small grammar fixes
  • Update type specification of cluster_method to string options allowed instead of just str
  • Use reference to scipy distance module

What does this implement/fix? Explain your changes.

Any other comments?

Noticed while reviewing #31102

Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ee57017. Link to the linter CI: here

Copy link
Member Author

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also some questions regarding the main description.

neighborhood radius. Better suited for usage on large datasets than the
current sklearn implementation of DBSCAN.
current scikit-learn implementation of DBSCAN.

Clusters are then extracted using a DBSCAN-like method
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "then" makes it seem like we should mention the order list e.g.,
"Clusters are then extracted from the order list using...", though not sure what the right term is here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going into that level of details will be hard, and it'll make this docstring very long, I'm not sure if we should do that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just removing the "then" then?

"Clusters are extracted using a DBSCAN-like method ... "

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though:

"Clusters are then extracted order list using a DBSCAN-like method"

isn't that much longer. Or are you saying that if we use the term "order list" we need to explain it further?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me "then" here implies "as the next step", but sure, you're the native speaker here 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I agree that "then" means "as the next step", I was trying to add the info of "next step after what"?

I know above we mention "Unlike DBSCAN, it keeps cluster hierarchy for a variable neighborhood radius", but then we talk about some other stuff, and it is not obvious what step happens before cluster extraction? i.e., what is the "then" following?

Copy link
Member

@adrinjalali adrinjalali May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. The "then" follows calculating a cluster hierarchy kinda thing before creating any "real" clusters in a "points are clustered into clusters, and each point is a single cluster" kinda sense. Cause in OPTICS, each point can be in multiple clusters in a "hierarchical" sense, and big clusters can have sub clusters.

But in scikit-learn clustering algorithms expose a "labels_" thing which is the clusters for each point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I was meaning more that we could improve the documentation by including the info of what happens before. Or just removing the "then", so it's not confusing as it's not clear atm what the "before" is to the "then".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy with that

neighborhood radius. Better suited for usage on large datasets than the
current sklearn implementation of DBSCAN.
current scikit-learn implementation of DBSCAN.

Clusters are then extracted using a DBSCAN-like method
(cluster_method = 'dbscan') or an automatic
Copy link
Member Author

@lucyleeow lucyleeow May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the paragraph below (can't add this comment that far down), does the sentence:

" This implementation deviates from the original OPTICS by first performing
k-nearest-neighborhood searches on all points to identify core sizes, then
computing only the distances to unprocessed points when constructing the
cluster order.
"

mean that the original OPTICS algorithm does not only compute unprocessed points, or is it saying that the KNN part differs and this is just the step that follows (but does not differ from the original OPTICS)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original OPTICS algorithm proposed in the paper calculates core distance and reachability distances for all objects as a first step, which we don't. It's not easy to explain really, I had a hard time converting that to code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this comment is not for the average user, and rather for somebody picky who reads the paper and then comes looks at the implementation here. You could argue it could be a comment in the code maybe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I think I understand the meaning now.

Does "unprocessed point" mean the same in both implementations? Does the original OPTICS algorithm re-compute reachability distances, updating if they are smaller? i.e., is the second part of the algorithm the same here as in the original paper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'd have to read the paper and compare with implementation to be able to answer that, can't really confidently respond to this question easily

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair, I just looked at it and am still confused 🙃

@lucyleeow
Copy link
Member Author

ping @adrinjalali as I looks like you wrote this originally

sidrtx

This comment was marked as spam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants