Skip to content

_weighted_percentile NaN handling with array API #31368

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lucyleeow opened this issue May 15, 2025 · 6 comments
Open

_weighted_percentile NaN handling with array API #31368

lucyleeow opened this issue May 15, 2025 · 6 comments

Comments

@lucyleeow
Copy link
Member

There isn't necessarily anything to fix here, but I thought it would be useful to open this for documentation, at least.


_weighted_percentile added support for NaN in #29034 and support for array APIs in #29431.

Our implementation relys on sort putting NaN values at the end:

largest_value_per_column = array[
sorted_idx[-1, ...], xp.arange(n_features, device=device)
]
# NaN values get sorted to end (largest value)
if xp.any(xp.isnan(largest_value_per_column)):

AFAICT (confirmed by @ev-br) array API specs do not specify how sort should handle NaN, which means it is left to individual packages to determine.

As everything works, I don't think we need to do anything here (especially as we ultimately want to drop maintaining our own quantile function), but just thought it would be useful to document.

cc @StefanieSenger @ogrisel

@github-actions github-actions bot added the Needs Triage Issue requires triage label May 15, 2025
@StefanieSenger
Copy link
Contributor

I think it would be nice to define a handling of NaNs for xp.sort (and other functions?) in the spec, in order to define the status quo as a standard (also for future array libraries that want to adopt the array api standard). Maybe we open an issue in the data-apis repo?

@lucyleeow
Copy link
Member Author

lucyleeow commented May 16, 2025

Agreed, it's on my to-do list. The outcome may just be that the spec explicitly states that nans handling method is left to the package though.

@lucyleeow
Copy link
Member Author

So there is a note that states:

the sort order of NaNs and signed zeros is unspecified and thus implementation-dependent.

see: https://data-apis.org/array-api/latest/API_specification/sorting_functions.html

So while it is not guaranteed to work, I think our implementation is acceptable. Maybe it's just like our assumption that arrays are stored 'C' order.

@StefanieSenger
Copy link
Contributor

I see now, and I have also missed that note.

I agree that since the currently supported array libraries in scikit-learn all sort NaNs to the end (that is with the C order(?)), we're fine for now, and in case a new array library would handle this differently, there will be many places where this problem would pop up and then this will be re-discussed.

Thanks for looking into this! ❤

@lesteve lesteve removed the Needs Triage Issue requires triage label May 19, 2025
@lucyleeow
Copy link
Member Author

that is with the C order(?)

we assume arrays are stored by default C-contiguous (vs F-contiguous) (I don't have a default reference explaining this but you should be able to find articles on this).
I vaguely recall a note somewhere in the code about us assuming arrays are C-contiguous but I can't find it now.

@StefanieSenger
Copy link
Contributor

Yea, I had read about this, but I think somewhere in numpy's docs. It's a question on what language (C or Fortran) the way of storing arrays is optimised for. I didn't find any directive on how scikit-learn deals with this. For the time being, I will just accept that we prefer C-ordered arrays then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants