Skip to content

MAINT: convert numpy.array_api to array-api-strict #28555

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Mar 18, 2024

Conversation

rgommers
Copy link
Contributor

@rgommers rgommers commented Feb 29, 2024

The numpy.array_api module has been converted into a standalone package (array-api-strict). This new package is stable and has had a 1.0 release. The numpy.array_api module was marked experimental, and will be removed for NumPy 2.0. Since array-api-strict works with both NumPy 1.2x and NumPy 2.0, using that for testing compliance to the syntax and semantics of the array API standard should always be preferred over still trying to use numpy.array_api.

Reference Issues/PRs

No separate scikit-learn issue for this; NEP 56 documents the decision to remove numpy.array_api before the NumPy 2.0 release

EDIT: numpy/numpy#25911 is the relevant NumPy PR.

What does this implement/fix? Explain your changes.

This is mostly a 1:1 replacement of numpy.array_api with array_api_strict. The improvements in the latter package did turn up a couple of places where the code wasn't compliant with the standard though, due to:

  • using a device='cpu' string which was NumPy/PyTorch-specific,
  • indexing non-numpy arrays with an integer numpy.ndarray (this should be using the same array types and use the take function instead of fancy indexing),
  • use of dtype.kind, which isn't guaranteed to exist or return one letter numpy-style type codes,
  • use of the builtin int as a dtype

Any other comments?

I have tested with:

  • numpy 1.26.4
  • array-api-strict 1.0
  • pytorch 2.1.2 (CPU and CUDA 12.3)
  • cupy 13.0.0 (CUDA 12.3)

The only package with failures was CuPy. I'm seeing 35 failures on main, and 24 with the changes in this PR. Remaining CuPy failures are from

  1. test_array_api.py::test_nan_reductions (20x) with these errors:
TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly

TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
  1. model_selection/tests/test_split.py::test_array_api_train_test_split (4x) with these errors:
FAILED sklearn/model_selection/tests/test_split.py::test_array_api_train_test_split[True-None-cupy.array_api-None-None] - IndexError: Single-axes index [9 1 6 7 3 0 5] is a non-zero-dimensional integer array, but advanced integer indexing is not s...
FAILED sklearn/model_selection/tests/test_split.py::test_array_api_train_test_split[True-stratify1-cupy-None-None] - ValueError: kind can only be None or 'stable'
FAILED sklearn/model_selection/tests/test_split.py::test_array_api_train_test_split[True-stratify1-cupy.array_api-None-None] - ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class ca...
FAILED sklearn/model_selection/tests/test_split.py::test_array_api_train_test_split[False-None-cupy.array_api-None-None] - IndexError: Single-axes index [0 1 2 3 4 5 6] is a non-zero-dimensional integer array, but advanced integer indexing is not s...

Failures visible on main that are resolved with this PR are due to:

ValueError: Unsupported device 'cpu'

in the following tests:

  • utils/tests/test_array_api.py::test_weighted_sum
  • tests/test_common.py::test_estimators[LinearDiscriminantAnalysis()-check_array_api_input
  • metrics/tests/test_common.py::test_array_api_compliance

The `numpy.array_api` module has been converted into a standalone
package (`array-api-strict`). This new package is stable and has had a
1.0 release. The `numpy.array_api` module was marked experimental, and
will be removed for NumPy 2.0. Since `array-api-strict` works with both
NumPy 1.2x and NumPy 2.0, using that for testing compliance to the
syntax and semantics of the array API standard should always be
preferred over still trying to use `numpy.array_api`.
Copy link

github-actions bot commented Feb 29, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: b4befe9. Link to the linter CI: here

@rgommers
Copy link
Contributor Author

rgommers commented Mar 1, 2024

Re linting failures: I've installed the exact versions of black and ruff as used in CI, and can not reproduce the linter complaint either by running the tools directly or by running ./build_tools/linting.sh. It seems unrelated to my changes, just a different way of ordering imports - but since the current ordering is also reasonable (import xxx first, then from xxx import yyy) I'll wait for a maintainer to comment before touching that.

@adrinjalali
Copy link
Member

We use pre-commit hooks, that should take care of things without a need to install any of the linters in your environment.

@adrinjalali
Copy link
Member

I ran pre-commit run --files path/to/those/files and then commit all the changes. Linter's happy now.

@rgommers
Copy link
Contributor Author

rgommers commented Mar 1, 2024

We use pre-commit hooks, that should take care of things without a need to install any of the linters in your environment.

I'm not using pre-commit, mainly for security reasons, but also because it borked my scipy repo once when I evaluated it. So I have to rely on the bot comment instructions posted here.

@adrinjalali
Copy link
Member

I'd be curious about those security reasons. I've used pre-commit here and in other projects and I've never had an issue, so curious about how it broke scipy.

@adrinjalali
Copy link
Member

I'll let @betatim and @ogrisel review the array API related issues.

@rgommers
Copy link
Contributor Author

rgommers commented Mar 1, 2024

so curious about how it broke scipy.

I never got to the bottom of that unfortunately. It was inside a conda env, so it may have had to do with how it was installed (I don't like installing dev tools globally with something like pipx).

I'd be curious about those security reasons.

xref scipy/scipy#18895 for a good write-up (not from me). But tl;dr: I create my dev envs myself and like to be in control of them and understand what I'm using. I also actually browse the "to be installed" list that mamba shows me before installing things. A hook on a repo that silently installs things, often from git repos of individuals, without me having control over that doesn't appeal to me. So I choose to do the bit of extra effort to install dev tools myself.

Anyway, let's keep this PR focused on the maintenance issue rather than pre-commit:) Thanks for addressing the linting issue.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rgommers! I like the PR in general but there are concurrent PR that might impact it. I am not sure which one should be finalized and merged first. See the comments below for details.

@@ -110,7 +110,7 @@ def size(x):

def _is_numpy_namespace(xp):
"""Return True if xp is backed by NumPy."""
return xp.__name__ in {"numpy", "array_api_compat.numpy", "numpy.array_api"}
return xp.__name__ in {"numpy", "array_api_compat.numpy"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this change: _is_numpy_namespace now really a marker for the NumPy case and array-api-strict is now just considered like any other non-NumPy array api implementation only to be used for testing purposes and therefore not being subject to any special treatment motivated by backward-compat considerations.

@ogrisel
Copy link
Member

ogrisel commented Mar 12, 2024

I tried to sync with main after the merge of #27904 via the GitHub API. I hope I got it right.

Co-authored-by: Tim Head <betatim@gmail.com>
@ogrisel
Copy link
Member

ogrisel commented Mar 13, 2024

For the record, I added the array-api-strict dependency to the pylatest_conda_forge_mkl_linux-64 CI config so that the changes in this PR are properly tested. This is the CI config where array-api-compat and pytorch are already installed.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is good to go on my end. I implemented all the pending things we discussed earlier + updated the docstring slightly to better reflect the current state of affairs.

+1 for merge once #28407 is merged and resulting conflicts with this PR are resolved.

@@ -10,7 +10,7 @@
from .._config import get_config
from .fixes import parse_version

_NUMPY_NAMESPACE_NAMES = {"numpy", "array_api_compat.numpy", "numpy.array_api"}
_NUMPY_NAMESPACE_NAMES = {"numpy", "array_api_compat.numpy"}
Copy link
Member

@betatim betatim Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support this change, but not including array_api_strict here is a change from how we have been treating numpy.array_api. It might make some uses of is_numpy_namespace() obsolete. (A good thing IMHO)

Related: do we need to keep numpy.array_api here for people who have an older numpy version and keep using it and expect that support for something experimental continues to exist in scikit-learn? I think we shouldn't keep it in and just assume that there are ~0 people in the world who use numpy.array_api.

Copy link
Member

@ogrisel ogrisel Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make some uses of is_numpy_namespace() obsolete.

Indeed for some cases (that we should simplify progressively) but it's still useful from time to time to detect cases where:

  • sklearn.get_config(array_api_dispatch=True) and the input is a (wrapped) numpy array.

vs

  • sklearn.get_config(array_api_dispatch=True) and the input is not a (wrapped) numpy array.

This is important to know when we can safely convert back and forth to numpy without overhead, for instance to use the out= kwarg for memory efficiency reasons.

do we need to keep numpy.array_api here for people who have an older numpy version and keep using it and expect that support for something experimental continues to exist in scikit-learn?

I don't see any value in this. numpy.array_api was a temporary experiment and it's going away. There is no point in using it in the future: it just adds complexity for no benefit.

numpy.array_api's value was mostly for testing array api compliance, and this is now better served by array-api-strict and it works even with older numpy versions as demonstrated by our CI.

@ogrisel
Copy link
Member

ogrisel commented Mar 18, 2024

Let me resolve the conflicts quickly.

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

If we feel like it, updating the doc string as I suggested would be nice. But otherwise: let's merge this thing.

Co-authored-by: Tim Head <betatim@gmail.com>
@ogrisel ogrisel enabled auto-merge (squash) March 18, 2024 16:22
@ogrisel
Copy link
Member

ogrisel commented Mar 18, 2024

I pushed your docstring suggestion and enabled auto-merge on the PR. Thanks for the PR @rgommers and others for the reviews.

@ogrisel ogrisel merged commit 5efc667 into scikit-learn:main Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants