Add array API support for _weighted_percentile #29431

EmilyXinyi · 2024-07-08T11:09:31Z

TO DO:

Modify the function _weight_percentile to support array API
Modify and add tests in test_stats.py
- Add array API specific tests that compares array API outputs to numpy outputs

cc: @StefanieSenger

github-actions · 2024-07-08T11:10:54Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: ce620de. Link to the linter CI: here}

adrinjalali · 2024-07-18T14:21:43Z

cc @lesteve as well.

StefanieSenger

Hey @EmilyXinyi! Thank you so much for your work. ✨

I went through it and commented to the best of my knowledge. I will probably err on some things, since I am also only learning about Array API, and I hope this is not too confusing. Surely a maintainer can help more than me. But maybe my comments can help us into some kind of productive discussion.

I have yet to check _take_along_axis(), which I will do later, when I have a clearer head for this.
Edit: I have done it now.

sklearn/utils/stats.py

StefanieSenger · 2024-07-19T10:56:38Z

sklearn/utils/stats.py

@@ -29,41 +28,60 @@ def _weighted_percentile(array, sample_weight, percentile=50):
    percentile : int if `array` 1D, ndarray if `array` 2D
        Weighted percentile.
    """
+    xp, _ = get_namespace(array)
    n_dim = array.ndim
    if n_dim == 0:
        return array[()]


As far as I have understood, we still would want to return a scalar value from the same namespace (and same device) as the original array in the case when the dimension of the original array is 0. I'm not sure about the correct syntax though.

I think this code will keep working and do the right thing for array API inputs. At least the following works:

In [12]: x_ = array_api_strict.asarray(1) In [13]: x_.ndim Out[13]: 0 In [14]: x[()] Out[14]: np.int64(1)

(but maybe this only works because array api strict uses numpy in the background?

Thank you @betatim.

I have a question: what do you mean when you say, the code works? I am not sure what the distinction criterium for working and not working is, since we won't get any error either way. Do we need to measure performance?
If you are looking for the data type: Why did you expect np.int64 here?

I think what he meant is the indexing a 0d-array with an empty tuple preserves the namespace and the dtype (and presumably the device) of the original array.

Actually, I misread the output of the snippet. Indeed, the namespace is not preserved since this returns a NumPy scalar instance and not a 0d array of the original namespace.

But this is somewhat but not exactly in line with the return type described in the docstring of the function (int).

>>> import numpy as np >>> np.isscalar(np.asarray(1)) False >>> np.isscalar(np.asarray(1)[()]) True >>> np.isscalar(np.int64(1)) True >>> import numbers >>> isinstance(np.int64(1), numbers.Integral) True >>> isinstance(np.int64(1), int) False

Actually I cannot reproduce Tim's snippet with array-api-strict 2.3:

>>> import array_api_strict as xp >>> xp.asarray(1)[()] Array(1, dtype=array_api_strict.int64) >>> xp.__version__ '2.3'

So this code is actually returning a 0d array of the original namespace (and device) rather than a NumPy scalar instance.

And torch does the same:

>>> import array_api_compat.torch as xp >>> xp.asarray(1, device="mps")[()] tensor(1, device='mps:0')

I think the caller will be responsible for calling float(output) if we want to convert a NumPy scalar or a 0d (non-NumPy) array value into a Python float scalar in the end.

For 0D NumPy inputs, returning a NumPy scalar or a 0D NumPy array should not matter much. For other kinds of 0D input arrays, we would always return 0D output arrays (because the array API spec does not support namespace-specific scalar values, but only 0D arrays).

We could probably remove the [()] in the 0D input case. I don't think it matters, neither for NumPy inputs nor other kinds of inputs. I tried to remove it on main and all the tests still pass.

@ogrisel to clarify, API spec does not support namespace-specific scalar values, so:

array_api_strict.asarray(1)[()]

array_api_strict.asarray(1)

both result in the same thing, a 0D array ?

Whereas numpy does have scalars so

np.array(1)[()]

np.array(1)

result in different things, 0D array vs scalar.

So I think I agree that we should remove the [()] for 0D input case? (so numpy output is the same as other API arrays?)

I am fine with either option.

sklearn/utils/stats.py

sklearn/utils/tests/test_stats.py

sklearn/utils/stats.py

sklearn/utils/tests/test_stats.py

sklearn/utils/stats.py

sklearn/utils/tests/test_stats.py

sklearn/utils/stats.py

sklearn/utils/tests/test_stats.py

ogrisel

General comment: most of the time when we add array API support to a function in scikit-learn, we do not touch the existing (numpy-only) tests to make sure that the PR does not change the default behavior of scikit-learn on traditional inputs when array API is not enabled.

Instead we add a few new test functions that:

generate some random test data with numpy or sklearn.datasets.make_*;
call the function once on the numpy inputs without enabling array API dispatch;
convert the inputs to a namespace / device combo passed as parameter to the test;
call the function with array API dispatching enabled (under a with sklearn.config_context(array_api_dispatch=True) block
check that the results are on the same namespace and device as the input
convert back the output to a numpy array using _convert_to_numpy
compare the original / reference numpy results and the xp computation results converted back to numpy using assert_allclose or similar.

Those tests should have array_api somewhere in their name to makes sure that we can run all the array API compliance tests with a keyword search in the pytest command line, e.g.:

pytest -k array_api sklearn/some/subpackage

In particular, for cost reasons, our CUDA GPU CI only runs pytest -k array_api sklearn. So it's very important to respect this naming conventions, otherwise we will not tests all what we are supposed to test on CUDA.

StefanieSenger · 2024-09-11T10:29:24Z

General comment: most of the time when we add array API support to a function in scikit-learn, we do not touch the existing (numpy-only) tests to make sure that the PR does not change the default behavior of scikit-learn on traditional inputs when array API is not enabled.
...

Awesome!
This is such a valuable guideline that I think it deserves to be put on the issue description of the main Array API issue (#26024).

This implies that we don't test Array API down every control flow branch (like codecov), but rather have one new test per function. This is also how it is meant, correct?

ogrisel · 2024-09-11T14:11:10Z

This implies that we don't test Array API down every control flow branch (like codecov), but rather have one new test per function. This is also how it is meant, correct?

Ideally we don't even need one test per function when the existing common tests are enough as we did in many past PRs for:

estimators (by adding the array_api_support=True estimator tag):

scikit-learn/sklearn/utils/estimator_checks.py

Lines 143 to 145 in d9deffe

    
           if tags.array_api_support: 
        
               for check in _yield_array_api_checks(estimator): 
        
                   yield check

e.g. for PCA:

scikit-learn/sklearn/decomposition/_pca.py

Lines 850 to 854 in d9deffe

    
           def __sklearn_tags__(self): 
        
               tags = super().__sklearn_tags__() 
        
               tags.transformer_tags.preserves_dtype = ["float64", "float32"] 
        
               tags.array_api_support = True 
        
               return tags

metric functions (by registering them to the list of metric functions to be checked (array_api_metric_checkers))

But we can also write custom test functions (with array_api in their name) if we want to test a function that cannot be tested by an existing common test/check or when we want to cover non-default parameter combinations with an estimator/function specific array_api compliance checking logic + custom some pytest parameterization, e.g. as done for PCA here:

scikit-learn/sklearn/decomposition/tests/test_pca.py

Lines 1007 to 1036 in d9deffe

    
           @pytest.mark.parametrize( 
        
               "array_namespace, device, dtype_name", yield_namespace_device_dtype_combinations() 
        
           ) 
        
           @pytest.mark.parametrize( 
        
               "check", 
        
               [check_array_api_input_and_values, check_array_api_get_precision], 
        
               ids=_get_check_estimator_ids, 
        
           ) 
        
           @pytest.mark.parametrize( 
        
               "estimator", 
        
               [ 
        
                   PCA(n_components=2, svd_solver="full"), 
        
                   PCA(n_components=2, svd_solver="full", whiten=True), 
        
                   PCA(n_components=0.1, svd_solver="full", whiten=True), 
        
                   PCA(n_components=2, svd_solver="covariance_eigh"), 
        
                   PCA(n_components=2, svd_solver="covariance_eigh", whiten=True), 
        
                   PCA( 
        
                       n_components=2, 
        
                       svd_solver="randomized", 
        
                       power_iteration_normalizer="QR", 
        
                       random_state=0,  # how to use global_random_seed here? 
        
                   ), 
        
               ], 
        
               ids=_get_check_estimator_ids, 
        
           ) 
        
           def test_pca_array_api_compliance( 
        
               estimator, check, array_namespace, device, dtype_name 
        
           ): 
        
               name = estimator.__class__.__name__ 
        
               check(name, estimator, array_namespace, device=device, dtype_name=dtype_name)

Note that this particular wrapper of a common estimator check as a custom check will probably be refactored once #29820 is merged, but the main idea is that it's ok to add a few extra tests when the existing common tests / estimator checks are not enough to test parameter combinations we would like to cover.

The point is that the existing (pre-array API) tests should be enough to check that all branches works as expected with numpy and then on top of that array compliance checks check that numpy and any other array API library find the same results (up to rounding errors) on random data (with various combinations of extra parameters that should support array API).

ogrisel · 2024-09-11T14:31:43Z

I updated the description of the meta issue to add testing guidelines.

StefanieSenger · 2024-09-16T11:29:29Z

Thanks for the explanations about testing and updating the issue description, @ogrisel.
Its very helpful to know how maintainers want the testing to be done and prevents me from guesswork. 😂

The point is that the existing (pre-array API) tests should be enough to check that all branches works as expected with numpy and then on top of that array compliance checks check that numpy and any other array API library find the same results (up to rounding errors) on random data (with various combinations of extra parameters that should support array API).

This is something I am still wrapping my head around. Testing if the (old) numpy outputs match the array api outputs for the default params seems incomplete to me. In my mind, the default is nothing special compared to other branches of a function. This doubt was infact why I wrote this down so explicitly to get your confirmation. I will keep thinking about it.

ogrisel

Another pass of feedback w.r.t. missing array API functions.

ogrisel · 2024-10-08T15:16:40Z

sklearn/utils/_array_api.py

+    if _is_numpy_namespace(xp):
+        return numpy.nextafter(x1, x2)


Suggested change

if _is_numpy_namespace(xp):

return numpy.nextafter(x1, x2)

if hasattr(xp, "nextafter"):

return xp.nextafter(x1, x2)

sklearn/utils/_array_api.py

sklearn/utils/stats.py

ogrisel · 2025-03-26T15:40:37Z

One of the tests would fail on torch with MPS device locally because of a bug in torch.nextafter on that device. I pushed a workaround in 7a2f907. I will report the issue upstream.

ogrisel · 2025-03-26T15:55:13Z

@ogrisel GH bot undid your labelling...

That's the expected behaviour. Maybe we could change it though.

ogrisel · 2025-03-26T16:15:08Z

Ok I will stop pushing to your PR for today @lucyleeow ;) Feel free to resume addressing the remaining open thread in the review above.

sklearn/utils/tests/test_stats.py

lucyleeow · 2025-03-31T12:11:31Z

FYI I've opened a PR to fix the array-api-strict bug (data-apis/array-api-strict#139), but I guess we'd have to wait for a release to remove the xfail?

Note it looks codecov is complaining about test_stats.py, line:

pytest.xfail(f"xp.nextafter is broken on {device}")

ogrisel · 2025-03-31T15:35:50Z

FYI I've opened a PR to fix the array-api-strict bug (data-apis/array-api-strict#139), but I guess we'd have to wait for a release to remove the xfail?

Yes. And a weekly lock file update.

ogrisel

LGTM. Thank you very much @lucyleeow @EmilyXinyi @StefanieSenger.

This is ready for another second review (ping @OmarManzoor @betatim).

lucyleeow · 2025-04-03T12:54:49Z

FYI data-apis/array-api-strict#139 is now merged, I'll keep an eye on the new release.

Thanks for your review and fixes @ogrisel !

ogrisel · 2025-04-03T15:33:03Z

For information, I opened scipy/scipy#22794 so that in the long term, we could consolidate quantile implementation with array API + NaN + weight support all in SciPy instead of scikit-learn.

SciPy already has array API and NaN support but lacks:

weight support for method="inverted_cdf";
weight support for method="averaged_inverted_cdf".

sklearn/utils/stats.py

betatim

This looks nice! I've not followed the whole conversation in the ~125 comments, only looked at the code as it is now.

I had two questions, but otherwise I think this is nice and ready to go!

sklearn/utils/stats.py

betatim · 2025-04-04T08:22:49Z

sklearn/utils/tests/test_stats.py

+            pass
+        else:
+            if device == array_api_strict.Device("device1"):
+                # See https://github.com/data-apis/array-api-strict/issues/134


It looks like data-apis/array-api-strict#139 fixed the bug, so do we still need this xfail? Asked differently, when can we remove this xfail? Maybe we can write it down in the comment to help us-from-the-future

Can you tell I didn't read the conversation :D This is answered in #29431 (comment) - maybe still worth a comment here or do you think the release will happen soon enough that you still remember @lucyleeow ?

I agree, I should add a comment here, and I will add this to my to do list to follow up

betatim · 2025-04-04T11:58:52Z

In the last monthly meeting we discussed "why do people not merge PRs once they have enough +1s?" - so in the spirit of merging things a bit more promptly: I'll merge this now. It looks like the comment threads not marked as "resolved" are either out of date or questions. If there is one that leads to some changes, lets do that in a new PR.

Thanks for the persistence in this long running PR

lucyleeow · 2025-04-04T12:07:21Z

@betatim you beat me by minutes! I guess if I don't add that comment, we will have to rely solely on me remembering to do it 😱 I've added it to my todo.

lucyleeow · 2025-04-04T12:07:41Z

Thanks for the review and speedy merge :D

WIP on array API support for _weighted_percentile

c43f256

github-actions bot added the module:utils label Jul 8, 2024

ogrisel added the Array API label Jul 8, 2024

EmilyXinyi and others added 2 commits July 17, 2024 08:51

Merge branch 'scikit-learn:main' into array_API_weighted_percentile

26c160d

change existing tests for array API compatibility

437093b

StefanieSenger reviewed Jul 19, 2024

View reviewed changes

Merge branch 'main' into array_API_weighted_percentile

20e0223

StefanieSenger reviewed Jul 24, 2024

View reviewed changes

sklearn/utils/stats.py Outdated Show resolved Hide resolved

betatim reviewed Jul 26, 2024

View reviewed changes

sklearn/utils/stats.py Outdated Show resolved Hide resolved

betatim reviewed Jul 26, 2024

View reviewed changes

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

Merge branch 'scikit-learn:main' into array_API_weighted_percentile

4a2b303

EmilyXinyi commented Aug 7, 2024

View reviewed changes

EmilyXinyi added 2 commits August 7, 2024 11:46

WIP addressing review comments

5c19da7

_apply_along_axis and _nextafter

c0a4583

EmilyXinyi commented Aug 8, 2024

View reviewed changes

sklearn/utils/tests/test_stats.py Outdated Show resolved Hide resolved

EmilyXinyi added 2 commits August 9, 2024 08:17

Merge branch 'scikit-learn:main' into array_API_weighted_percentile

745c668

Merge branch 'scikit-learn:main' into array_API_weighted_percentile

259b7c4

ogrisel reviewed Sep 10, 2024

View reviewed changes

ogrisel reviewed Oct 8, 2024

View reviewed changes

ogrisel mentioned this pull request Oct 8, 2024

MNT _weighted_percentile supports np.nan values #29034

Merged

Merge branch 'scikit-learn:main' into array_API_weighted_percentile

7b0e733

ogrisel mentioned this pull request Oct 21, 2024

ENH Add array api support for laplacian_kernel and manhattan_distances #29881

Closed

EmilyXinyi and others added 2 commits October 31, 2024 10:51

push to experiment with getting around indexing errors

9d31949

Merge branch 'scikit-learn:main' into array_API_weighted_percentile

bc06da9

ogrisel added the CUDA CI label Mar 26, 2025

github-actions bot removed the CUDA CI label Mar 26, 2025

Workaround MPS bug in torch.nextafter

7a2f907

Link to pytorch issue in comment

8360191

lucyleeow added 4 commits March 28, 2025 16:15

nitpick xfail wording

e5276b9

comment nits

b4964a2

Merge branch 'main' into array_API_weighted_percentile

01da4bb

review

877dba1

ogrisel reviewed Mar 28, 2025

View reviewed changes

sklearn/utils/tests/test_stats.py Show resolved Hide resolved

ogrisel reviewed Mar 28, 2025

View reviewed changes

sklearn/utils/tests/test_stats.py Show resolved Hide resolved

amend test

52e4e1d

Tweak tests cases to pass on MPS and cover the 0D case

fedd05b

ogrisel added the CUDA CI label Apr 3, 2025

github-actions bot removed the CUDA CI label Apr 3, 2025

ogrisel approved these changes Apr 3, 2025

View reviewed changes

ogrisel reviewed Apr 3, 2025

View reviewed changes

sklearn/utils/stats.py Outdated Show resolved Hide resolved

lucyleeow added 2 commits April 4, 2025 11:33

add xp

32b1cd4

add xp aver weighted percentile

ce620de

betatim approved these changes Apr 4, 2025

View reviewed changes

betatim merged commit bb261bf into scikit-learn:main Apr 4, 2025
31 checks passed

Add array API support for _weighted_percentile #29431

Add array API support for _weighted_percentile #29431

Conversation

EmilyXinyi commented Jul 8, 2024 • edited by ogrisel Loading

TO DO:

github-actions bot commented Jul 8, 2024 • edited Loading

✔️ Linting Passed

adrinjalali commented Jul 18, 2024

StefanieSenger left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanieSenger Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Mar 26, 2025 • edited Loading

Choose a reason for hiding this comment

lucyleeow Mar 31, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment • edited Loading

Choose a reason for hiding this comment

StefanieSenger commented Sep 11, 2024

ogrisel commented Sep 11, 2024 • edited Loading

ogrisel commented Sep 11, 2024

StefanieSenger commented Sep 16, 2024

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Mar 26, 2025

ogrisel commented Mar 26, 2025

ogrisel commented Mar 26, 2025

lucyleeow commented Mar 31, 2025

ogrisel commented Mar 31, 2025

ogrisel left a comment

Choose a reason for hiding this comment

lucyleeow commented Apr 3, 2025

ogrisel commented Apr 3, 2025 • edited Loading

betatim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

betatim commented Apr 4, 2025

lucyleeow commented Apr 4, 2025

lucyleeow commented Apr 4, 2025

EmilyXinyi commented Jul 8, 2024 •

edited by ogrisel

Loading

github-actions bot commented Jul 8, 2024 •

edited

Loading

StefanieSenger left a comment •

edited

Loading

StefanieSenger Jul 26, 2024 •

edited

Loading

ogrisel Mar 25, 2025 •

edited

Loading

ogrisel Mar 26, 2025 •

edited

Loading

lucyleeow Mar 31, 2025 •

edited

Loading

ogrisel left a comment •

edited

Loading

ogrisel commented Sep 11, 2024 •

edited

Loading

ogrisel commented Apr 3, 2025 •

edited

Loading