Skip to content

Add array API support for _weighted_percentile #29431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Apr 4, 2025

Conversation

EmilyXinyi
Copy link
Contributor

@EmilyXinyi EmilyXinyi commented Jul 8, 2024

TO DO:

  • Modify the function _weight_percentile to support array API
  • Modify and add tests in test_stats.py
    • Add array API specific tests that compares array API outputs to numpy outputs

cc: @StefanieSenger

Copy link

github-actions bot commented Jul 8, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: ce620de. Link to the linter CI: here

@adrinjalali
Copy link
Member

cc @lesteve as well.

Copy link
Contributor

@StefanieSenger StefanieSenger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @EmilyXinyi! Thank you so much for your work. ✨

I went through it and commented to the best of my knowledge. I will probably err on some things, since I am also only learning about Array API, and I hope this is not too confusing. Surely a maintainer can help more than me. But maybe my comments can help us into some kind of productive discussion.

I have yet to check _take_along_axis(), which I will do later, when I have a clearer head for this.
Edit: I have done it now.

@@ -29,41 +28,60 @@ def _weighted_percentile(array, sample_weight, percentile=50):
percentile : int if `array` 1D, ndarray if `array` 2D
Weighted percentile.
"""
xp, _ = get_namespace(array)
n_dim = array.ndim
if n_dim == 0:
return array[()]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I have understood, we still would want to return a scalar value from the same namespace (and same device) as the original array in the case when the dimension of the original array is 0. I'm not sure about the correct syntax though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code will keep working and do the right thing for array API inputs. At least the following works:

In [12]: x_ = array_api_strict.asarray(1)

In [13]: x_.ndim
Out[13]: 0

In [14]: x[()]
Out[14]: np.int64(1)

(but maybe this only works because array api strict uses numpy in the background?

Copy link
Contributor

@StefanieSenger StefanieSenger Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @betatim.

I have a question: what do you mean when you say, the code works? I am not sure what the distinction criterium for working and not working is, since we won't get any error either way. Do we need to measure performance?
If you are looking for the data type: Why did you expect np.int64 here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what he meant is the indexing a 0d-array with an empty tuple preserves the namespace and the dtype (and presumably the device) of the original array.

Copy link
Member

@ogrisel ogrisel Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I misread the output of the snippet. Indeed, the namespace is not preserved since this returns a NumPy scalar instance and not a 0d array of the original namespace.

But this is somewhat but not exactly in line with the return type described in the docstring of the function (int).

>>> import numpy as np
>>> np.isscalar(np.asarray(1))
False
>>> np.isscalar(np.asarray(1)[()])
True
>>> np.isscalar(np.int64(1))
True
>>> import numbers
>>> isinstance(np.int64(1), numbers.Integral)
True
>>> isinstance(np.int64(1), int)
False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I cannot reproduce Tim's snippet with array-api-strict 2.3:

>>> import array_api_strict as xp
>>> xp.asarray(1)[()]
Array(1, dtype=array_api_strict.int64)
>>> xp.__version__
'2.3'

So this code is actually returning a 0d array of the original namespace (and device) rather than a NumPy scalar instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And torch does the same:

>>> import array_api_compat.torch as xp
>>> xp.asarray(1, device="mps")[()]
tensor(1, device='mps:0')

Copy link
Member

@ogrisel ogrisel Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the caller will be responsible for calling float(output) if we want to convert a NumPy scalar or a 0d (non-NumPy) array value into a Python float scalar in the end.

For 0D NumPy inputs, returning a NumPy scalar or a 0D NumPy array should not matter much. For other kinds of 0D input arrays, we would always return 0D output arrays (because the array API spec does not support namespace-specific scalar values, but only 0D arrays).

We could probably remove the [()] in the 0D input case. I don't think it matters, neither for NumPy inputs nor other kinds of inputs. I tried to remove it on main and all the tests still pass.

Copy link
Member

@lucyleeow lucyleeow Mar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ogrisel to clarify, API spec does not support namespace-specific scalar values, so:

  • array_api_strict.asarray(1)[()]
  • array_api_strict.asarray(1)

both result in the same thing, a 0D array ?

Whereas numpy does have scalars so

  • np.array(1)[()]
  • np.array(1)

result in different things, 0D array vs scalar.

So I think I agree that we should remove the [()] for 0D input case? (so numpy output is the same as other API arrays?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with either option.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: most of the time when we add array API support to a function in scikit-learn, we do not touch the existing (numpy-only) tests to make sure that the PR does not change the default behavior of scikit-learn on traditional inputs when array API is not enabled.

Instead we add a few new test functions that:

  • generate some random test data with numpy or sklearn.datasets.make_*;
  • call the function once on the numpy inputs without enabling array API dispatch;
  • convert the inputs to a namespace / device combo passed as parameter to the test;
  • call the function with array API dispatching enabled (under a with sklearn.config_context(array_api_dispatch=True) block
  • check that the results are on the same namespace and device as the input
  • convert back the output to a numpy array using _convert_to_numpy
  • compare the original / reference numpy results and the xp computation results converted back to numpy using assert_allclose or similar.

Those tests should have array_api somewhere in their name to makes sure that we can run all the array API compliance tests with a keyword search in the pytest command line, e.g.:

pytest -k array_api sklearn/some/subpackage

In particular, for cost reasons, our CUDA GPU CI only runs pytest -k array_api sklearn. So it's very important to respect this naming conventions, otherwise we will not tests all what we are supposed to test on CUDA.

@StefanieSenger
Copy link
Contributor

General comment: most of the time when we add array API support to a function in scikit-learn, we do not touch the existing (numpy-only) tests to make sure that the PR does not change the default behavior of scikit-learn on traditional inputs when array API is not enabled.
...

Awesome!
This is such a valuable guideline that I think it deserves to be put on the issue description of the main Array API issue (#26024).

This implies that we don't test Array API down every control flow branch (like codecov), but rather have one new test per function. This is also how it is meant, correct?

@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2024

This implies that we don't test Array API down every control flow branch (like codecov), but rather have one new test per function. This is also how it is meant, correct?

Ideally we don't even need one test per function when the existing common tests are enough as we did in many past PRs for:

  • estimators (by adding the array_api_support=True estimator tag):

if tags.array_api_support:
for check in _yield_array_api_checks(estimator):
yield check

e.g. for PCA:

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.transformer_tags.preserves_dtype = ["float64", "float32"]
tags.array_api_support = True
return tags

  • metric functions (by registering them to the list of metric functions to be checked (array_api_metric_checkers))

But we can also write custom test functions (with array_api in their name) if we want to test a function that cannot be tested by an existing common test/check or when we want to cover non-default parameter combinations with an estimator/function specific array_api compliance checking logic + custom some pytest parameterization, e.g. as done for PCA here:

@pytest.mark.parametrize(
"array_namespace, device, dtype_name", yield_namespace_device_dtype_combinations()
)
@pytest.mark.parametrize(
"check",
[check_array_api_input_and_values, check_array_api_get_precision],
ids=_get_check_estimator_ids,
)
@pytest.mark.parametrize(
"estimator",
[
PCA(n_components=2, svd_solver="full"),
PCA(n_components=2, svd_solver="full", whiten=True),
PCA(n_components=0.1, svd_solver="full", whiten=True),
PCA(n_components=2, svd_solver="covariance_eigh"),
PCA(n_components=2, svd_solver="covariance_eigh", whiten=True),
PCA(
n_components=2,
svd_solver="randomized",
power_iteration_normalizer="QR",
random_state=0, # how to use global_random_seed here?
),
],
ids=_get_check_estimator_ids,
)
def test_pca_array_api_compliance(
estimator, check, array_namespace, device, dtype_name
):
name = estimator.__class__.__name__
check(name, estimator, array_namespace, device=device, dtype_name=dtype_name)

Note that this particular wrapper of a common estimator check as a custom check will probably be refactored once #29820 is merged, but the main idea is that it's ok to add a few extra tests when the existing common tests / estimator checks are not enough to test parameter combinations we would like to cover.

The point is that the existing (pre-array API) tests should be enough to check that all branches works as expected with numpy and then on top of that array compliance checks check that numpy and any other array API library find the same results (up to rounding errors) on random data (with various combinations of extra parameters that should support array API).

@ogrisel
Copy link
Member

ogrisel commented Sep 11, 2024

I updated the description of the meta issue to add testing guidelines.

@StefanieSenger
Copy link
Contributor

Thanks for the explanations about testing and updating the issue description, @ogrisel.
Its very helpful to know how maintainers want the testing to be done and prevents me from guesswork. 😂

The point is that the existing (pre-array API) tests should be enough to check that all branches works as expected with numpy and then on top of that array compliance checks check that numpy and any other array API library find the same results (up to rounding errors) on random data (with various combinations of extra parameters that should support array API).

This is something I am still wrapping my head around. Testing if the (old) numpy outputs match the array api outputs for the default params seems incomplete to me. In my mind, the default is nothing special compared to other branches of a function. This doubt was infact why I wrote this down so explicitly to get your confirmation. I will keep thinking about it.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another pass of feedback w.r.t. missing array API functions.

Comment on lines 1138 to 1139
if _is_numpy_namespace(xp):
return numpy.nextafter(x1, x2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if _is_numpy_namespace(xp):
return numpy.nextafter(x1, x2)
if hasattr(xp, "nextafter"):
return xp.nextafter(x1, x2)

@ogrisel
Copy link
Member

ogrisel commented Mar 26, 2025

One of the tests would fail on torch with MPS device locally because of a bug in torch.nextafter on that device. I pushed a workaround in 7a2f907. I will report the issue upstream.

@ogrisel
Copy link
Member

ogrisel commented Mar 26, 2025

@ogrisel GH bot undid your labelling...

That's the expected behaviour. Maybe we could change it though.

@ogrisel
Copy link
Member

ogrisel commented Mar 26, 2025

Ok I will stop pushing to your PR for today @lucyleeow ;) Feel free to resume addressing the remaining open thread in the review above.

@lucyleeow
Copy link
Member

FYI I've opened a PR to fix the array-api-strict bug (data-apis/array-api-strict#139), but I guess we'd have to wait for a release to remove the xfail?

Note it looks codecov is complaining about test_stats.py, line:

pytest.xfail(f"xp.nextafter is broken on {device}")

@ogrisel
Copy link
Member

ogrisel commented Mar 31, 2025

FYI I've opened a PR to fix the array-api-strict bug (data-apis/array-api-strict#139), but I guess we'd have to wait for a release to remove the xfail?

Yes. And a weekly lock file update.

@ogrisel ogrisel added the CUDA CI label Apr 3, 2025
@github-actions github-actions bot removed the CUDA CI label Apr 3, 2025
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you very much @lucyleeow @EmilyXinyi @StefanieSenger.

This is ready for another second review (ping @OmarManzoor @betatim).

@lucyleeow
Copy link
Member

FYI data-apis/array-api-strict#139 is now merged, I'll keep an eye on the new release.

Thanks for your review and fixes @ogrisel !

@ogrisel
Copy link
Member

ogrisel commented Apr 3, 2025

For information, I opened scipy/scipy#22794 so that in the long term, we could consolidate quantile implementation with array API + NaN + weight support all in SciPy instead of scikit-learn.

SciPy already has array API and NaN support but lacks:

  • weight support for method="inverted_cdf";
  • weight support for method="averaged_inverted_cdf".

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice! I've not followed the whole conversation in the ~125 comments, only looked at the code as it is now.

I had two questions, but otherwise I think this is nice and ready to go!

pass
else:
if device == array_api_strict.Device("device1"):
# See https://github.com/data-apis/array-api-strict/issues/134
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like data-apis/array-api-strict#139 fixed the bug, so do we still need this xfail? Asked differently, when can we remove this xfail? Maybe we can write it down in the comment to help us-from-the-future

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell I didn't read the conversation :D This is answered in #29431 (comment) - maybe still worth a comment here or do you think the release will happen soon enough that you still remember @lucyleeow ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I should add a comment here, and I will add this to my to do list to follow up

@betatim
Copy link
Member

betatim commented Apr 4, 2025

In the last monthly meeting we discussed "why do people not merge PRs once they have enough +1s?" - so in the spirit of merging things a bit more promptly: I'll merge this now. It looks like the comment threads not marked as "resolved" are either out of date or questions. If there is one that leads to some changes, lets do that in a new PR.

Thanks for the persistence in this long running PR

@betatim betatim merged commit bb261bf into scikit-learn:main Apr 4, 2025
31 checks passed
@lucyleeow
Copy link
Member

@betatim you beat me by minutes! I guess if I don't add that comment, we will have to rely solely on me remembering to do it 😱 I've added it to my todo.

@lucyleeow
Copy link
Member

Thanks for the review and speedy merge :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants