Skip to content

FEA PairwiseDistancesReductions: support for Boolean DistanceMetrics via stable simultaneous sort #25097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

jjerphan
Copy link
Member

@jjerphan jjerphan commented Dec 2, 2022

Reference Issues/PRs

Towards #22587.

What does this implement/fix? Explain your changes.

Use a stable sort to support boolean distance metrics, as explained here:

# In order to support discrete distance metrics, we need to have a
# stable simultaneous sort which preserves the order of the indices
# because there generally is a lot of occurrences for a given values
# of distances in this case.
# TODO: implement a stable simultaneous_sort.
"hamming",
*BOOL_METRICS,

Any other comments?

This moves sorting utilities to be able to reuse them in for PairwiseDistancesReductions.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please sync with main to get github happy?

cnp.npy_intp* samples,
cnp.npy_intp start,
cnp.npy_intp end,
) nogil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to export the low-level helper functions like sift_down, swp, median3 in the .pxd file?

Also I don't think we need to export both sort and introsort and I would find it more intuitive to rename:

  • sort to introsort (the main API to call from other modules in scikit-learn),
  • introsort to introsort_recursion or _introsort or similar to make it more explicit that this function is not meant to be used directly but only useful to expose the prototype imposed by the recursive calls in the algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I think we could add a few lines of doc for each sorting function (assuming we stop exposing the helpers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did iterations locally that I have not yet pushed and that are treating most of your comments.

Comment on lines 23 to 29
from ..utils._sorting cimport (
sort,
swap,
median3,
introsort,
heapsort
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that only the sort function is called in the _splitter.pyx file:

Suggested change
from ..utils._sorting cimport (
sort,
swap,
median3,
introsort,
heapsort
)
from ..utils._sorting cimport sort

root = maxind


cdef void heapsort(cnp.npy_float32* Xf, cnp.npy_intp* samples, cnp.npy_intp n) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could rename:

  • Xf to data or values
  • samples to indices
  • n to size

to make the prototype of this function more generic and more intuitive when used in other contexts than for tree growing. Similar comment for the other functions.

@jjerphan
Copy link
Member Author

jjerphan commented Dec 2, 2022

@ogrisel: note that this PR is experimental (hence the draft mode): I am cutting corners pragmatically to see if the algorithm can be made stable.

@jjerphan
Copy link
Member Author

jjerphan commented Dec 2, 2022

As of d27f138, sklearn.utils._sorting.sort might implement an unstable sorting algorithm as some tests (e.g. sklearn/neighbors/tests/test_neighbors.py::test_kneighbors_brute_backend) fail when I add the boolean distance metrics in.

@jjerphan jjerphan changed the title FEA PairwiseDistancesReductions: support for Boolean DistanceMetrics via stable simulataneous sort FEA PairwiseDistancesReductions: support for Boolean DistanceMetrics via stable simultaneous sort Jan 10, 2023
@jjerphan
Copy link
Member Author

jjerphan commented Feb 4, 2023

We might want to adapt Glide sort to have it sort a structure of arrays.

@jjerphan
Copy link
Member Author

Based on a IRL discussion with @ogrisel, we also might be interested in exploring https://github.com/scandum/fluxsort which should be Cython-interfaceable.

@jjerphan
Copy link
Member Author

@Micky774, @OmarManzoor or anyone else: Similarly to other PRs of mine, if you are interested in continuing this work, feel free to do so (I do no have time for it currently).

@jjerphan
Copy link
Member Author

jjerphan commented Nov 2, 2023

Closing, might reopen later.

@jjerphan jjerphan closed this Nov 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants