-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
_weighted_percentile
NaN handling with array API
#31368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think it would be nice to define a handling of NaNs for |
Agreed, it's on my to-do list. The outcome may just be that the spec explicitly states that nans handling method is left to the package though. |
So there is a note that states:
see: https://data-apis.org/array-api/latest/API_specification/sorting_functions.html So while it is not guaranteed to work, I think our implementation is acceptable. Maybe it's just like our assumption that arrays are stored 'C' order. |
I see now, and I have also missed that note. I agree that since the currently supported array libraries in scikit-learn all sort NaNs to the end (that is with the Thanks for looking into this! ❤ |
we assume arrays are stored by default C-contiguous (vs F-contiguous) (I don't have a default reference explaining this but you should be able to find articles on this). |
Yea, I had read about this, but I think somewhere in numpy's docs. It's a question on what language (C or Fortran) the way of storing arrays is optimised for. I didn't find any directive on how scikit-learn deals with this. For the time being, I will just accept that we prefer C-ordered arrays then. |
There isn't necessarily anything to fix here, but I thought it would be useful to open this for documentation, at least.
_weighted_percentile
added support for NaN in #29034 and support for array APIs in #29431.Our implementation relys on
sort
putting NaN values at the end:scikit-learn/sklearn/utils/stats.py
Lines 70 to 74 in 8cfc72b
AFAICT (confirmed by @ev-br) array API specs do not specify how
sort
should handle NaN, which means it is left to individual packages to determine.float('nan')
andtorch.nan
) but this is not mentioned in the docs. There is some discussion of ordering NaN as the largest value here: PyTorch NaN behavior and API design pytorch/pytorch#46544 (comment) and a related issue about negative NaN here: [MPS]sort
incorrectly handles 'negative' NaNs pytorch/pytorch#116567As everything works, I don't think we need to do anything here (especially as we ultimately want to drop maintaining our own quantile function), but just thought it would be useful to document.
cc @StefanieSenger @ogrisel
The text was updated successfully, but these errors were encountered: