`_weighted_percentile` NaN handling with array API #31368

lucyleeow · 2025-05-15T12:10:19Z

There isn't necessarily anything to fix here, but I thought it would be useful to open this for documentation, at least.

_weighted_percentile added support for NaN in #29034 and support for array APIs in #29431.

Our implementation relys on sort putting NaN values at the end:

scikit-learn/sklearn/utils/stats.py

Lines 70 to 74 in 8cfc72b

    
           largest_value_per_column = array[ 
        
               sorted_idx[-1, ...], xp.arange(n_features, device=device) 
        
           ] 
        
           # NaN values get sorted to end (largest value) 
        
           if xp.any(xp.isnan(largest_value_per_column)):

AFAICT (confirmed by @ev-br) array API specs do not specify how sort should handle NaN, which means it is left to individual packages to determine.

torch seems to follow numpy and sort NaN to the end (tested manually with float('nan') and torch.nan) but this is not mentioned in the docs. There is some discussion of ordering NaN as the largest value here: PyTorch NaN behavior and API design pytorch/pytorch#46544 (comment) and a related issue about negative NaN here: [MPS] sort incorrectly handles 'negative' NaNs pytorch/pytorch#116567
CuPy seems to follow numpy behaviour as well (relevant issues: different result between cupy.sort and numpy.sort with NaN cupy/cupy#3324, and they seem to have tests to check that their results are the same as numpy with nan sorting )

As everything works, I don't think we need to do anything here (especially as we ultimately want to drop maintaining our own quantile function), but just thought it would be useful to document.

cc @StefanieSenger @ogrisel

The text was updated successfully, but these errors were encountered:

StefanieSenger · 2025-05-16T10:36:36Z

I think it would be nice to define a handling of NaNs for xp.sort (and other functions?) in the spec, in order to define the status quo as a standard (also for future array libraries that want to adopt the array api standard). Maybe we open an issue in the data-apis repo?

lucyleeow · 2025-05-16T11:22:47Z

Agreed, it's on my to-do list. The outcome may just be that the spec explicitly states that nans handling method is left to the package though.

lucyleeow · 2025-05-19T02:32:21Z

So there is a note that states:

the sort order of NaNs and signed zeros is unspecified and thus implementation-dependent.

see: https://data-apis.org/array-api/latest/API_specification/sorting_functions.html

So while it is not guaranteed to work, I think our implementation is acceptable. Maybe it's just like our assumption that arrays are stored 'C' order.

StefanieSenger · 2025-05-19T08:40:20Z

I see now, and I have also missed that note.

I agree that since the currently supported array libraries in scikit-learn all sort NaNs to the end (that is with the C order(?)), we're fine for now, and in case a new array library would handle this differently, there will be many places where this problem would pop up and then this will be re-discussed.

Thanks for looking into this! ❤

lucyleeow · 2025-05-19T11:41:13Z

that is with the C order(?)

we assume arrays are stored by default C-contiguous (vs F-contiguous) (I don't have a default reference explaining this but you should be able to find articles on this).
I vaguely recall a note somewhere in the code about us assuming arrays are C-contiguous but I can't find it now.

StefanieSenger · 2025-05-19T12:35:38Z

Yea, I had read about this, but I think somewhere in numpy's docs. It's a question on what language (C or Fortran) the way of storing arrays is optimised for. I didn't find any directive on how scikit-learn deals with this. For the time being, I will just accept that we prefer C-ordered arrays then.

github-actions bot added the Needs Triage Issue requires triage label May 15, 2025

lucyleeow mentioned this issue May 16, 2025

Clarify NaN handling in sort data-apis/array-api#944

Closed

lesteve removed the Needs Triage Issue requires triage label May 19, 2025

virchan added the Array API label Jun 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

`_weighted_percentile` NaN handling with array API #31368

`_weighted_percentile` NaN handling with array API #31368

lucyleeow commented May 15, 2025

StefanieSenger commented May 16, 2025

Uh oh!

lucyleeow commented May 16, 2025 •

edited

Loading

Uh oh!

lucyleeow commented May 19, 2025

Uh oh!

StefanieSenger commented May 19, 2025

Uh oh!

lucyleeow commented May 19, 2025

Uh oh!

StefanieSenger commented May 19, 2025

Uh oh!

Uh oh!

_weighted_percentile NaN handling with array API #31368

_weighted_percentile NaN handling with array API #31368

Comments

lucyleeow commented May 15, 2025

StefanieSenger commented May 16, 2025

Uh oh!

lucyleeow commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow commented May 19, 2025

Uh oh!

StefanieSenger commented May 19, 2025

Uh oh!

lucyleeow commented May 19, 2025

Uh oh!

StefanieSenger commented May 19, 2025

Uh oh!

`_weighted_percentile` NaN handling with array API #31368

`_weighted_percentile` NaN handling with array API #31368

lucyleeow commented May 16, 2025 •

edited

Loading