Skip to content

DOC Minor updates to OPTICS docstring #31363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 23, 2025

Conversation

lucyleeow
Copy link
Member

@lucyleeow lucyleeow commented May 14, 2025

Reference Issues/PRs

  • Small grammar fixes
  • Update type specification of cluster_method to string options allowed instead of just str
  • Use reference to scipy distance module

What does this implement/fix? Explain your changes.

Any other comments?

Noticed while reviewing #31102

Copy link

github-actions bot commented May 14, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e9c9410. Link to the linter CI: here

Copy link
Member Author

@lucyleeow lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also some questions regarding the main description.

neighborhood radius. Better suited for usage on large datasets than the
current sklearn implementation of DBSCAN.
current scikit-learn implementation of DBSCAN.

Clusters are then extracted using a DBSCAN-like method
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "then" makes it seem like we should mention the order list e.g.,
"Clusters are then extracted from the order list using...", though not sure what the right term is here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going into that level of details will be hard, and it'll make this docstring very long, I'm not sure if we should do that here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about just removing the "then" then?

"Clusters are extracted using a DBSCAN-like method ... "

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though:

"Clusters are then extracted order list using a DBSCAN-like method"

isn't that much longer. Or are you saying that if we use the term "order list" we need to explain it further?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me "then" here implies "as the next step", but sure, you're the native speaker here 😁

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry I was meaning more that we could improve the documentation by including the info of what happens before. Or just removing the "then", so it's not confusing as it's not clear atm what the "before" is to the "then".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, happy with that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I used the term 'cluster-order' as it seems to be the term used in the paper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you forgot to push?

Copy link
Member Author

@lucyleeow lucyleeow May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes done now. Had to do something else in the middle 😬 .

neighborhood radius. Better suited for usage on large datasets than the
current sklearn implementation of DBSCAN.
current scikit-learn implementation of DBSCAN.

Clusters are then extracted using a DBSCAN-like method
(cluster_method = 'dbscan') or an automatic
Copy link
Member Author

@lucyleeow lucyleeow May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the paragraph below (can't add this comment that far down), does the sentence:

" This implementation deviates from the original OPTICS by first performing
k-nearest-neighborhood searches on all points to identify core sizes, then
computing only the distances to unprocessed points when constructing the
cluster order.
"

mean that the original OPTICS algorithm does not only compute unprocessed points, or is it saying that the KNN part differs and this is just the step that follows (but does not differ from the original OPTICS)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the original OPTICS algorithm proposed in the paper calculates core distance and reachability distances for all objects as a first step, which we don't. It's not easy to explain really, I had a hard time converting that to code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this comment is not for the average user, and rather for somebody picky who reads the paper and then comes looks at the implementation here. You could argue it could be a comment in the code maybe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh I think I understand the meaning now.

Does "unprocessed point" mean the same in both implementations? Does the original OPTICS algorithm re-compute reachability distances, updating if they are smaller? i.e., is the second part of the algorithm the same here as in the original paper?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I'd have to read the paper and compare with implementation to be able to answer that, can't really confidently respond to this question easily

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah fair, I just looked at it and am still confused 🙃

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay looking at the pseudo code from OPTICS, I think OPTICS will re-compute reachability distances, for neighbours in the 'heap', updating if smaller, which is the same as our implementation. Note that only the original object is set as 'processed' not the neighbors.

from OPTICS:
image

and our algorithm will also calculate reachability of all unprocessed neighbours, updating existing if smaller:

improved = np.where(rdists < np.take(reachability_, unproc))
reachability_[unproc[improved]] = rdists[improved]
predecessor_[unproc[improved]] = point_index

I've made some changes that may make it clearer, but also it's complicated and I would be happy to leave as is too.

@lucyleeow
Copy link
Member Author

ping @adrinjalali as I looks like you wrote this originally

sidrtx

This comment was marked as spam.

@adrinjalali adrinjalali merged commit a2ceff3 into scikit-learn:main May 23, 2025
36 checks passed
@lucyleeow lucyleeow deleted the doc_optics branch May 23, 2025 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants