Skip to content

MAINT Group all sorting utilities in sklearn.utils._sorting #25606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Feb 14, 2023

Reference Issues/PRs

Precedes #25097.

What does this implement/fix? Explain your changes.

We have various sorting utilities in scikit-learn whose complexity, and worst case scenarios aren't entirely documented and whose behaviors are not tested.

Moreover, in regards to #25097, we might want to agree on an API swap various algorithms e.g. when we require stable sorting algorithms, which is for instance the case when distance metrics are to be used on boolean data.

Hence, this PR aims at documenting, testing, and unifying the API of those algorithms.

Any other comments?

@jjerphan jjerphan force-pushed the maint/factor-sorting-utils branch from a28418e to 831ba2a Compare February 14, 2023 12:57
@jjerphan jjerphan changed the title MAINT Move all sorting utilities to sklearn.utils._sorting MAINT Group all sorting utilities in sklearn.utils._sorting Mar 20, 2023
@jjerphan
Copy link
Member Author

As of 3ca2258, I do not face the problem building scikit-learn locally.

@jeremiedbb
Copy link
Member

do you have cython >= 3 locally ?

@jjerphan
Copy link
Member Author

jjerphan commented Mar 21, 2023

do you have cython >= 3 locally ?

Indeed, I was using cython>=3.0 locally.

f347471 should solve the problem. I am searching to cross-reference a relevant Cython Issue or Pull Request: maybe cython/cython#3617 fixes it?

Copy link
Member

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought: Since this is now going to be a "utility" functionality that's intended for usage presumably by any submodule of sklearn, is there any rough direction we can give for which sorting algorithm to use and when, or perhaps that is too ambitious?

Otw, I think the refactoring LGTM and is much desired cuz as a developer, I was confused why there were different sorting mechanisms scattered.

@jjerphan
Copy link
Member Author

jjerphan commented Mar 23, 2023

is there any rough direction we can give for which sorting algorithm to use and when, or perhaps that is too ambitious?

I am thinking that this better be done in a follow-up PR for separation of concerns.

What do you think?

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks cleaner. Just a few suggestions.

Could you also please run a few existing ASV benchmarks for the estimators that call these functions to make sure that this PR does not introduce any performance regression? This is quite unlikely but since some of those routines are very performance sensitive, I think it's better to check prior to merging the PR.

Also +1 for a follow-up PR with a docstring that explains the pros and cons of each kind value.

jjerphan and others added 2 commits March 24, 2023 09:40
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's wait for the benchmarks that Olivier requested before merging

jjerphan and others added 2 commits March 24, 2023 17:40
Co-authored-by: jeremiedbb <jeremiedbb@users.noreply.github.com>
@jjerphan
Copy link
Member Author

@Micky774, @OmarManzoor or anyone else: if you are interested in continuing this work, feel free to do so (I do no have time for it currently).

@jjerphan
Copy link
Member Author

jjerphan commented Nov 2, 2023

Closing, might reopen later.

@jjerphan jjerphan closed this Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants