-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MAINT Group all sorting utilities in sklearn.utils._sorting
#25606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a28418e
to
831ba2a
Compare
sklearn.utils._sorting
sklearn.utils._sorting
As of 3ca2258, I do not face the problem building scikit-learn locally. |
do you have cython >= 3 locally ? |
Indeed, I was using cython>=3.0 locally. f347471 should solve the problem. I am searching to cross-reference a relevant Cython Issue or Pull Request: maybe cython/cython#3617 fixes it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought: Since this is now going to be a "utility" functionality that's intended for usage presumably by any submodule of sklearn, is there any rough direction we can give for which sorting algorithm to use and when, or perhaps that is too ambitious?
Otw, I think the refactoring LGTM and is much desired cuz as a developer, I was confused why there were different sorting mechanisms scattered.
I am thinking that this better be done in a follow-up PR for separation of concerns. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks cleaner. Just a few suggestions.
Could you also please run a few existing ASV benchmarks for the estimators that call these functions to make sure that this PR does not introduce any performance regression? This is quite unlikely but since some of those routines are very performance sensitive, I think it's better to check prior to merging the PR.
Also +1 for a follow-up PR with a docstring that explains the pros and cons of each kind
value.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's wait for the benchmarks that Olivier requested before merging
Co-authored-by: jeremiedbb <jeremiedbb@users.noreply.github.com>
@Micky774, @OmarManzoor or anyone else: if you are interested in continuing this work, feel free to do so (I do no have time for it currently). |
Closing, might reopen later. |
Reference Issues/PRs
Precedes #25097.
What does this implement/fix? Explain your changes.
We have various sorting utilities in scikit-learn whose complexity, and worst case scenarios aren't entirely documented and whose behaviors are not tested.
Moreover, in regards to #25097, we might want to agree on an API swap various algorithms e.g. when we require stable sorting algorithms, which is for instance the case when distance metrics are to be used on boolean data.
Hence, this PR aims at documenting, testing, and unifying the API of those algorithms.
Any other comments?