MAINT Group all sorting utilities in `sklearn.utils._sorting` #25606

jjerphan · 2023-02-14T10:30:07Z

Reference Issues/PRs

Precedes #25097.

What does this implement/fix? Explain your changes.

We have various sorting utilities in scikit-learn whose complexity, and worst case scenarios aren't entirely documented and whose behaviors are not tested.

Moreover, in regards to #25097, we might want to agree on an API swap various algorithms e.g. when we require stable sorting algorithms, which is for instance the case when distance metrics are to be used on boolean data.

Hence, this PR aims at documenting, testing, and unifying the API of those algorithms.

Any other comments?

…sReductions`

sklearn/utils/_sorting.pxd

sklearn/neighbors/tests/test_neighbors_tree.py

sklearn/utils/_sorting.pyx

jjerphan · 2023-03-21T09:36:45Z

As of 3ca2258, I do not face the problem building scikit-learn locally.

jeremiedbb · 2023-03-21T14:38:08Z

do you have cython >= 3 locally ?

jjerphan · 2023-03-21T17:56:25Z

do you have cython >= 3 locally ?

Indeed, I was using cython>=3.0 locally.

f347471 should solve the problem. I am searching to cross-reference a relevant Cython Issue or Pull Request: maybe cython/cython#3617 fixes it?

sklearn/utils/_sorting.pyx

adam2392

Just a thought: Since this is now going to be a "utility" functionality that's intended for usage presumably by any submodule of sklearn, is there any rough direction we can give for which sorting algorithm to use and when, or perhaps that is too ambitious?

Otw, I think the refactoring LGTM and is much desired cuz as a developer, I was confused why there were different sorting mechanisms scattered.

jjerphan · 2023-03-23T07:06:42Z

is there any rough direction we can give for which sorting algorithm to use and when, or perhaps that is too ambitious?

I am thinking that this better be done in a follow-up PR for separation of concerns.

What do you think?

ogrisel

This looks cleaner. Just a few suggestions.

Could you also please run a few existing ASV benchmarks for the estimators that call these functions to make sure that this PR does not introduce any performance regression? This is quite unlikely but since some of those routines are very performance sensitive, I think it's better to check prior to merging the PR.

Also +1 for a follow-up PR with a docstring that explains the pros and cons of each kind value.

sklearn/utils/_sorting.pxd

sklearn/utils/_sorting.pyx

sklearn/utils/tests/test_sorting.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb

LGTM. Let's wait for the benchmarks that Olivier requested before merging

sklearn/utils/_sorting.pyx

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp

sklearn/tree/_splitter.pyx

Co-authored-by: jeremiedbb <jeremiedbb@users.noreply.github.com>

jjerphan · 2023-08-22T12:25:32Z

@Micky774, @OmarManzoor or anyone else: if you are interested in continuing this work, feel free to do so (I do no have time for it currently).

jjerphan · 2023-11-02T17:49:19Z

Closing, might reopen later.

github-actions bot added the cython label Feb 14, 2023

jjerphan added 6 commits February 14, 2023 13:57

MAINT Move sorting utilities to sklearn.utils._sorting

e8d4c6f

Use floating for sorting routines

42367fc

Replace simultaneous_sort with sort in concrete `PairwiseDistance…

f5b593c

…sReductions`

Only expose sort and simultaneous_sort

a8fcdb9

Use cnp.intp_t directly and remove duplicated logic

de756be

Propagate right sorting interfaces

831ba2a

jjerphan force-pushed the maint/factor-sorting-utils branch from a28418e to 831ba2a Compare February 14, 2023 12:57

jjerphan added 3 commits February 14, 2023 17:25

MAINT Make signature of sorting utilities uniform

4d81dd1

TST Add sanity check test

8176a54

Merge branch 'main' into maint/factor-sorting-utils

2cbe56c

jjerphan added the No Changelog Needed label Feb 27, 2023

jjerphan mentioned this pull request Mar 10, 2023

PERF PairwiseDistancesReductions initial work #22587

Closed

jjerphan marked this pull request as ready for review March 20, 2023 09:29

jjerphan added the Waiting for Reviewer label Mar 20, 2023

jjerphan added 2 commits March 20, 2023 10:29

Correct _simultaneous_sort

a73cb9b

Rename aliases

16d579f

jjerphan commented Mar 20, 2023

View reviewed changes

sklearn/utils/_sorting.pxd Outdated Show resolved Hide resolved

Complete TODO comment

f6c5f4b

jjerphan commented Mar 20, 2023

View reviewed changes

sklearn/neighbors/tests/test_neighbors_tree.py Outdated Show resolved Hide resolved

jjerphan changed the title ~~MAINT Move all sorting utilities to sklearn.utils._sorting~~ MAINT Group all sorting utilities in sklearn.utils._sorting Mar 20, 2023

TST Use np.intp over np.in64 for portability

83a82aa

adam2392 reviewed Mar 20, 2023

View reviewed changes

sklearn/utils/_sorting.pyx Outdated Show resolved Hide resolved

jjerphan commented Mar 20, 2023

View reviewed changes

sklearn/utils/_sorting.pyx Outdated Show resolved Hide resolved

Merge branch 'main' into maint/factor-sorting-utils

3ca2258

jjerphan added 2 commits March 21, 2023 19:48

Work-around Cython type inference limitations

f347471

DOC Document _simultaneous_sort

a1d7194

adam2392 reviewed Mar 23, 2023

View reviewed changes

sklearn/utils/_sorting.pyx Outdated Show resolved Hide resolved

adam2392 reviewed Mar 23, 2023

View reviewed changes

ogrisel approved these changes Mar 23, 2023

View reviewed changes

sklearn/utils/_sorting.pxd Outdated Show resolved Hide resolved

sklearn/utils/_sorting.pyx Outdated Show resolved Hide resolved

sklearn/utils/tests/test_sorting.py Outdated Show resolved Hide resolved

sklearn/utils/tests/test_sorting.py Outdated Show resolved Hide resolved

adam2392 approved these changes Mar 23, 2023

View reviewed changes

jjerphan and others added 2 commits March 24, 2023 09:40

DOC Rename quick_sort to quicksort

d3f33c6

TST Update tests

b25345a

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

jeremiedbb approved these changes Mar 24, 2023

View reviewed changes

sklearn/utils/_sorting.pyx Outdated Show resolved Hide resolved

sklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx.tp Outdated Show resolved Hide resolved

sklearn/tree/_splitter.pyx Outdated Show resolved Hide resolved

jjerphan and others added 2 commits March 24, 2023 17:40

Use libc.math.log2 directly

3f65032

Do not use aliases

f7bd05d

Co-authored-by: jeremiedbb <jeremiedbb@users.noreply.github.com>

jjerphan closed this Nov 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT Group all sorting utilities in `sklearn.utils._sorting` #25606

MAINT Group all sorting utilities in `sklearn.utils._sorting` #25606

jjerphan commented Feb 14, 2023 •

edited

Loading

jjerphan commented Mar 21, 2023

jeremiedbb commented Mar 21, 2023

jjerphan commented Mar 21, 2023 •

edited

Loading

adam2392 left a comment

jjerphan commented Mar 23, 2023 •

edited

Loading

ogrisel left a comment

jeremiedbb left a comment

jjerphan commented Aug 22, 2023

jjerphan commented Nov 2, 2023

MAINT Group all sorting utilities in sklearn.utils._sorting #25606

MAINT Group all sorting utilities in sklearn.utils._sorting #25606

Conversation

jjerphan commented Feb 14, 2023 • edited Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

jjerphan commented Mar 21, 2023

jeremiedbb commented Mar 21, 2023

jjerphan commented Mar 21, 2023 • edited Loading

adam2392 left a comment

Choose a reason for hiding this comment

jjerphan commented Mar 23, 2023 • edited Loading

ogrisel left a comment

Choose a reason for hiding this comment

jeremiedbb left a comment

Choose a reason for hiding this comment

jjerphan commented Aug 22, 2023

jjerphan commented Nov 2, 2023

MAINT Group all sorting utilities in `sklearn.utils._sorting` #25606

MAINT Group all sorting utilities in `sklearn.utils._sorting` #25606

jjerphan commented Feb 14, 2023 •

edited

Loading

jjerphan commented Mar 21, 2023 •

edited

Loading

jjerphan commented Mar 23, 2023 •

edited

Loading