Skip to content

ENH: Array API standard and Numpy compatibility #21135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vnmabus opened this issue Mar 2, 2022 · 19 comments
Open

ENH: Array API standard and Numpy compatibility #21135

vnmabus opened this issue Mar 2, 2022 · 19 comments

Comments

@vnmabus
Copy link

vnmabus commented Mar 2, 2022

Proposed new feature or change:

I was testing how to add support for the array API standard to a small project of mine. However, I also wanted to remain compatible with Numpy ndarray, as it is what everyone uses right now. However, the differences in the API between the standard and ndarray, and the decision to conform to the minimal implementation of the standard, make really difficult to support both use cases with the same code.

For example, in order to create a copy of the input array, which should be a simple thing to do, using the array standard I would need to call x = xp.asarray(x, copy=True). However, when I receive a numpy ndarray I fall back to xp = np, and np.asarray does not have the copy parameter.

I can't just convert the ndarray to Array, both because I would be returning a different type than the input type, and because Array does not allow the object dtype, and I explicitly allow ndarrays containing Fraction or Decimal objects.

The ideal choice would be either to make the basic Numpy ndarray functions compatible with the standard, or to expose an advanced version of the array_api that deals directly with ndarray objects and support all Numpy functionalities in a manner compatible with the standard. Otherwise, making existing code compatible with both ndarray and the API standard will require a lot of effort and duplicated code to accommodate both.

@mattip
Copy link
Member

mattip commented Mar 2, 2022

We discussed this at the weekly community meeting. There are a few issues here:

  1. How can the small project accept an ndarray and do array-api compliant operations on it
  2. How can the small project return an ndarray after doing operations via the array-api namespace, and not an xp object
  3. How can the small project support non-array-api compliant operations (i.e. use object or structured arrays) as well as array-api compliant operations within the same code base?

As far as I can tell (1) should JustWork: converting an ndarray to an xp.Array should be seamless when doing xp.asarray(x).
(2) is more problematic, and is a general problem when using xp.asarray. There is no base or other indication of the source of the xp.Array.
(3) is by design. The whole point of the array api (as I understand it) is to provide a common denominator of defined operations and dtypes, which excludes some of the current ndarray operations and dtypes. So if your small project wants to support object arrays, it should not use the array api standard.

I think it is worthwhile asking the array api team for clarifications as well. Could you open an issue on their issue tracker https://github.com/data-apis/array-api asking how they view this?

@vnmabus
Copy link
Author

vnmabus commented Mar 2, 2022

Ok, I created data-apis/array-api#400 to keep them on the loop.

IMHO, the problem is not really with the API standard, but with NumPy and its desire to conform just with the minimum implementation and use a separate namespace. I think that the minimal implementation should be moved to a separate package for array API utils (that maybe should provide also the necessary Protocols for static typing) and that NumPy should try to achieve standard compliance using ndarray objects, allowing advanced functionality when compatible with the standard (I think object arrays are not forbidden, just not required), and allowing the usage of non-standard functions through direct use of the np namespace (so ndarray.__array_namespace__ could have only the minimal functions, to clearly know when you are not being standard-compliant).

The only problem I see with this approach is with ndarray methods, in the case they are absent in the standard or their behaviour differs. I think they should try to converge to the standard behaviour, although it probably won't be an easy task.

@asmeurer
Copy link
Member

asmeurer commented Mar 3, 2022

Worth noting that some of these things have already been discussed in NEP 47.

@seberg
Copy link
Member

seberg commented Mar 8, 2022

@asmeurer would you be able to detail this a bit more? I did not reread the NEP, but it has no example code snippets for this user-story. As far as I understood both of these questions were not quite settled when the NEP was written (I had asked the same questions at the time).

To my knowledge the answer to this "user story" for library authors is currently still being explored (probably mainly as a sklearn PR, but I am not sure).
I had hoped (and half insisted that we should have a PR to show the user-story here before adding the array-API namespace to NumPy. I did not worry about the addition to NumPy as such, but this is a very important question to answer and discuss before we are to remove the "experimental" status.

@rgommers
Copy link
Member

rgommers commented Mar 10, 2022

I did not reread the NEP, but it has no example code snippets for this user-story. As far as I understood both of these questions were not quite settled when the NEP was written (I had asked the same questions at the time).

Agreed with this. The NEP has a section Feedback from downstream library authors which is TODO and mentions trying out use cases. From the scikit-learn and SciPy PRs that are in progress we should be learning what the pain points and preferred approach is. @vnmabus's report of this friction in supporting both array objects and namespaces is discussed in scikit-learn issue #22352 in most detail.

I think that the minimal implementation should be moved to a separate package for array API utils (that maybe should provide also the necessary Protocols for static typing) and that NumPy should try to achieve standard compliance using ndarray objects, allowing advanced functionality when compatible with the standard (I think object arrays are not forbidden, just not required), and allowing the usage of non-standard functions through direct use of the np namespace

This is starting to look like a more attractive option indeed. And it is correct that object arrays are not forbidden. Rather than "attractive" I should perhaps say "necessary" - I'm still not super enthusiastic about a standalone Python package with a compliant implementation, but there's probably two reasons to do it indeed:

  1. To reuse numpy.array_api for the more extended namespace
  2. For typing protocols

Given that that standalone package would still have a dependency on a very recent NumPy version and may need to use some private NumPy APIs (that's what numpy.array_api needs to do now) there are of course downsides there as well. Even versioning will not be nontrivial - should it have the same version number as NumPy and be released in sync for example? (for later concern, but separate packages always come with such headaches).

Also, there's the array API standard test suite, which is now being versioned with a date-based scheme. There were request for turning that into a package as well, could be combined.

Related important point

On the scikit-learn issue it was pointed out that the divergence between the numpy and numpy.array_api namespaces for regular functions was non-ideal. That is a good point; ideally long-term we'd have a single namespace (numpy) which is array API standard compliant and provides a (large) superset of functionality. That was the original goal when starting NEP 47, it's just quite difficult because of backwards compatibility concerns. The single largest concern was casting rules. Now that @seberg is trying to tackle getting rid of value-based casting (EDIT: link to PR for NEP 50 draft), there may indeed be a path to get there though. However, that'd be a >1 year trajectory I'd think, and it's not yet fully clear that we'd be willing to make the necessary backwards-compatibility breaks for value-based casting and a few other topics.

If we do get there, a numpy.array_api as a strict implementation of only what is in the standard is still useful; a superset is then no longer useful.

@vnmabus
Copy link
Author

vnmabus commented Mar 10, 2022

Note that the array_api module only needs to be used to create the array objects. Once they have been created, one uses get_namespace to get the appropriate module. Thus:

  • If the advanced API uses just ndarray objects, no public array_api module is needed.
  • If a different kind of object is used (as currently ndarray objects are not standard compliant, in particular considering the behaviour of 0d arrays, which are almost always converted to scalars), then only the code that creates the objects initially needs to change. Ideally that means that most libraries won't need to change, and just a couple lines of client code.

@asmeurer
Copy link
Member

I was out last week so sorry for not responding sooner. I only meant that the NEP answers the questions of why array_api is in NumPy and not a standalone package (https://numpy.org/neps/nep-0047-array-api-standard.html#alternatives), why the implementation is minimal (https://numpy.org/neps/nep-0047-array-api-standard.html#high-level-design), and the technical reasons why it is a separate namespace instead of the main namespace (https://numpy.org/neps/nep-0047-array-api-standard.html#implementation). The NEP doesn't directly have a decision for the main numpy namespace to or not to conform to the standard.

I'm in agreement that NumPy should aim to have full spec compatibility in its main namespace. If you search the code of numpy/array_api for # Note you will find all the places where the two diverge (the NEP also outlines the biggest ones, as does the docstring of numpy.array_api). If people feel it would be helpful, I can extract the various differences into a more readable document.

It's actually not that many things. Quite a few things, like dtype checking, are done for the sake of strictness but aren't actually required by the spec. The biggest thing is some function/keyword argument renames, but those can be added as aliases without breaking compatibility. The list of things that require a compatibility break are small, with the most notable being no value-based casting, which is already being addressed.

A strict implementation like numpy.array_api will always be useful, whether it is part of NumPy or as separate package, as it provides an easy way for array API consumers to check that they aren't accidentally using some array library functionality that isn't part of the spec.

By the way, there's a somewhat separate issue here, which is how users can handle receiving a numpy.ndarray, converting that into a numpy.array_api.Array, doing computation on it, and then converting it back to numpy.ndarray before returning. This is something that we are discussing how to do better in the standard itself.

@stefanv
Copy link
Contributor

stefanv commented Mar 15, 2022

By the way, there's a somewhat separate issue here, which is how users can handle receiving a numpy.ndarray, converting that into a numpy.array_api.Array, doing computation on it, and then converting it back to numpy.ndarray before returning. This is something that we are discussing how to do better in the standard itself.

Thanks for looking at that: having a single, simple code path for library authors to handle numpy arrays alongside "foreign arrays" would be good progress.

@asmeurer
Copy link
Member

Thanks for looking at that: having a single, simple code path for library authors to handle numpy arrays alongside "foreign arrays" would be good progress.

Just to be clear, it's not so much about "foreign arrays". Mixing two different libraries isn't something that the consortium has discussed much. It's about how to handle libraries like NumPy have a "main" namespace and a separate "array API compatible" namespace. Right now the recommended get_namespace pattern tells how to convert from the main namespace to the array API compatible one, but not how to convert back for the return value.

@stefanv
Copy link
Contributor

stefanv commented Mar 15, 2022

Understood about the mixing. What I meant was that currently you have to do something different for NumPy arrays than for Array-API-compatible arrays; I just want to see that gap be reduced so that the same code patterns apply to both.

@seberg
Copy link
Member

seberg commented Mar 15, 2022

I can extract the various differences into a more readable document.

My 2¢: I think it would be great if you could do that. A good way may be to create a single "almost compatible" module (i.e. we ignore any incompatibilities of np.ndarray itself – and possibly more if there are some tricky ones).
I assume that most of it will be a list of functions that directly map to the NumPy version (well enough!). And then a few ones that don't.

If that is the case, than a module may actually be good enough to read! And will also be a perfect start for further summarizing only the more difficult parts (if those even exist – aside from value-based promotion).

Why? For two reasons:

  1. It would be a good basis for a discussion about stating an explicit goal that the NumPy main namespace should be fully compatible in the future. Without it, it is hard to tell how difficult and how fast we should do this. (I now understand now that this was an idea when NEP 47 was written. But it is not currently what NEP 47 proposes, I think.)
  2. I bet it will be immediately useful:
    • libraries like sklearn could vendor the file for now, to use a pattern of:
      try:
          xnp = get_namespace()
      except:
          import compat_namespace as xnp
      
    • If we have this namespace, we can discuss whether it may make sense to return it from ndarray.__array_namespace__.
    • And we may consider using it as the np.array_api (moving that elsewhere in NumPy or outside), so that downstream can use that in the pattern above. Or as an additional try/except use it if available.

About the "promotion" problem: This thread for example mentions it also. I am very sure there was some discussion about it before in an issue w.r.t to NEP 37; but I can't find it ☹ (e.g. having __array_namespace__ be a function that gets passed all types to support the rare mixed cases, similar to __array_function__. There may have been other thoughts, I don't remember.).

I think the promotion use-case is important and should not be forgotten. But it is not related to this discussion and not urgent. (This could be very important for Dask, since Dask tries to work well with both NumPy arrays and cupy in __array_function__ IIRC.)

@asmeurer
Copy link
Member

My 2¢: I think it would be great if you could do that. A good way may be to create a single "almost compatible" module (i.e. we ignore any incompatibilities of np.ndarray itself – and possibly more if there are some tricky ones).

I'm working on this. Ralf suggested adding it in the NumPy documentation for numpy.array_api (which doesn't exist yet, but I will add it).

One question I came across is the question of scalars vs. 0-D arrays. The spec only has 0-D arrays, not scalars. But the question is, assuming NumPy fixes type promotion on scalars so that they promote the same way as 0-D arrays, do you know of any other incompatibilities between them that are relevant for the spec?

If there aren't any, I think NumPy can be compatible with its current behavior just by pretending that scalars are 0-D arrays. They print differently and their Python type isn't ndarray, but there's nothing in the spec that requires that.

@vnmabus
Copy link
Author

vnmabus commented Mar 18, 2022

IMHO, there should be no scalars, only 0-D arrays, period. Otherwise, even if they behave almost identically, they break isinstance checks and static typing. For example you cannot do:

T = TypeVar("T", bound=ArrayProtocolYetToDetermine)

def my_sum(array: T) -> T:
    xp = get_namespace(array)
    return xp.sum(array)

as this wouldn't be true for NumPy arrays.

@asmeurer
Copy link
Member

There is a discussion about having a typing protocol for array objects in the spec data-apis/array-api#229. As far as I understand, a NumPy scalar would pass this protocol, because it has all the same attributes as ndarray (I'm not an expert on typing stuff, though, so please correct me if I'm wrong). The spec doesn't specify a specific array type anywhere. array is used in the type annotations, but libraries are free to make array be whatever their array class happens to be. For NumPy array could be Union[np.ndarray, np.generic] (I've not followed the NumPy typing discussions, but presumably that's already the annotated return type of most NumPy ufuncs?).

I agree in principle that just having 0-D arrays is better than having scalars, and the consortium agrees too, which is why the spec only includes 0-D arrays. But removing scalars from NumPy would be a very difficult task and it would be much simpler if NumPy could be spec compliant without actually having to do that. But again, I might be missing some other incompatibility with them, which is what I'd like to determine.

@vnmabus
Copy link
Author

vnmabus commented Mar 18, 2022

In the above code the TypeVar would be resolved to a concrete type (either ndarray or generic) in a particular call AFAIK, and thus Mypy won't infer the right return type for a ndarray parameter.

@vnmabus
Copy link
Author

vnmabus commented Mar 25, 2022

Another small inconvenience on falling back to xp = np. When you then use any dtype, such as xp.bool, an ugly warning is generated in that case:

DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
  Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations

I don't know an easy way to please both NumPy and the standard array API.

@asmeurer
Copy link
Member

asmeurer commented Apr 7, 2022

There is now a document in the documentation that enumerates all the differences between numpy and numpy.array_api, and categorizes each differences based on whether they are done for strictness (and so wouldn't need to be backported to numpy), could be backported to numpy compatibility, or would require a backwards compatibility break. https://numpy.org/devdocs/reference/array_api.html

The most important things to consider here are the breaking changes, although thinking about how to do the compatible changes (most of which would just be name aliases) is also useful.

@leofang
Copy link
Contributor

leofang commented Apr 7, 2022

cc: @kmaehashi @asi1024 @emcastillo

@rgommers
Copy link
Member

DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`

I opened gh-22021 to address that. We want to be able to use np.bool normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants