-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA PairwiseDistancesReductions
: support for Boolean DistanceMetrics
via stable simultaneous sort
#25097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FEA PairwiseDistancesReductions
: support for Boolean DistanceMetrics
via stable simultaneous sort
#25097
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please sync with main
to get github happy?
sklearn/utils/_sorting.pxd
Outdated
cnp.npy_intp* samples, | ||
cnp.npy_intp start, | ||
cnp.npy_intp end, | ||
) nogil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to export the low-level helper functions like sift_down
, swp
, median3
in the .pxd
file?
Also I don't think we need to export both sort
and introsort
and I would find it more intuitive to rename:
sort
tointrosort
(the main API to call from other modules in scikit-learn),introsort
tointrosort_recursion
or_introsort
or similar to make it more explicit that this function is not meant to be used directly but only useful to expose the prototype imposed by the recursive calls in the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way I think we could add a few lines of doc for each sorting function (assuming we stop exposing the helpers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did iterations locally that I have not yet pushed and that are treating most of your comments.
sklearn/tree/_splitter.pyx
Outdated
from ..utils._sorting cimport ( | ||
sort, | ||
swap, | ||
median3, | ||
introsort, | ||
heapsort | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that only the sort
function is called in the _splitter.pyx
file:
from ..utils._sorting cimport ( | |
sort, | |
swap, | |
median3, | |
introsort, | |
heapsort | |
) | |
from ..utils._sorting cimport sort |
sklearn/utils/_sorting.pyx
Outdated
root = maxind | ||
|
||
|
||
cdef void heapsort(cnp.npy_float32* Xf, cnp.npy_intp* samples, cnp.npy_intp n) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could rename:
Xf
todata
orvalues
samples
toindices
n
tosize
to make the prototype of this function more generic and more intuitive when used in other contexts than for tree growing. Similar comment for the other functions.
@ogrisel: note that this PR is experimental (hence the draft mode): I am cutting corners pragmatically to see if the algorithm can be made stable. |
As of d27f138, |
PairwiseDistancesReductions
: support for Boolean DistanceMetrics
via stable simulataneous sortPairwiseDistancesReductions
: support for Boolean DistanceMetrics
via stable simultaneous sort
We might want to adapt Glide sort to have it sort a structure of arrays. |
Based on a IRL discussion with @ogrisel, we also might be interested in exploring https://github.com/scandum/fluxsort which should be Cython-interfaceable. |
@Micky774, @OmarManzoor or anyone else: Similarly to other PRs of mine, if you are interested in continuing this work, feel free to do so (I do no have time for it currently). |
Closing, might reopen later. |
Reference Issues/PRs
Towards #22587.
What does this implement/fix? Explain your changes.
Use a stable sort to support boolean distance metrics, as explained here:
scikit-learn/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py
Lines 65 to 71 in 7af5297
Any other comments?
This moves sorting utilities to be able to reuse them in for
PairwiseDistancesReductions
.