Skip to content

DOC: Document interoperability best practices #20998

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
astrojuanlu opened this issue Feb 4, 2022 · 5 comments
Open

DOC: Document interoperability best practices #20998

astrojuanlu opened this issue Feb 4, 2022 · 5 comments

Comments

@astrojuanlu
Copy link
Contributor

Issue with current documentation:

I was reading the wonderful Interoperability with NumPy guide introduced in #20185 (thanks!) and it does a very good job at explaining what the user can do, by giving (a) detailed explanations of what the different methods do and (b) showing examples from popular libraries that apply such methods.

However, if one dives deep enough there seems to be some conflicting information. For example, NEP 18 (2018) says:

There are no general requirements on the return value from __array_function__, although most sensible implementations should probably return array(s) with the same type as one of the function's arguments.

However, NEP 37 (2019) recollects

__array_function__ has significant implications for libraries that use it: [...] users expect NumPy functions like np.concatenate to return NumPy arrays. [...] Libraries like Dask and CuPy have looked at and accepted the backwards incompatibility impact of __array_function__; it would still have been better for them if that impact didn't exist".

Therefore, from my understanding (and I'd love to be corrected here if I'm wrong!) even though NEP 18 left the return value of __array_function__ loosely defined, there was some friction when projects tried to adopt it. I think these findings that were only possible after putting the code in the hands of users are extremely valuable for new library authors.

In a way, I believe this goes to the core of something the guide briefly mentions at the beginning (inspiration from the chart at the top of data-apis/array-api#1): the difference between "array providers" (CuPy, pydata/sparse, dask.array) and "array consumers" (xarray), including the gray area between the two (pandas Series consume NumPy arrays but they also offer an array interface, and same goes to astropy.units).

Idea or request for content:

Expand the current interoperability guide (or add another guide/cookbook/whatever) that addresses practical questions like

  • Should new implementers of NEP 18 return ndarrays, or custom classes?
  • What is the best way to make your array consumer library interoperate with several array providers?
  • Should your custom class adopt the Python array API standard?
  • What are the performance implications of each dispatch mechanism?

(random questions off the top of my head)

If folks are happy to discuss these, either here or in the mailing list, I'd be happy to try to sort out those thoughts and contribute something to the docs myself.

@seberg
Copy link
Member

seberg commented Feb 4, 2022

I would suggest we discuss this here (or in a new topic there):

I think the transition problem should be clear enough. __array_function__ should return whatever makes most sense for the inputs (which usually means dask input, dask output, etc.), I simply don't see a point in returning numpy arrays.

In my personal opinion, the question is not about what we want, but about how to transition users. There are three ways to do that, which I noted in the second point in the above discussion.

For this issue: I doubt we should bother here for documentation, this needs to be solved ecosystem wide in those discussions first. Note that IMO, NEP 37 "moves" the responsibility a bit, but doesn't remove the problem itself entirely.

@astrojuanlu
Copy link
Contributor Author

Thanks a lot @seberg. I went over those posts during the weekend - it's quite a lot to digest and I don't think I'll easily wrap my head around all of it, but in any case, I'll try to see what things we can distill from those discussions now (even by adding notes or warnings about possible caveats), and what is still in flux and needs wider ecosystem consensus first.

@astrojuanlu
Copy link
Contributor Author

To give one particular example, it seems that the idea of subclassing ndarrays is frowned upon, but I couldn't find detailed explanations, only breadcrumbs. For example, @rgommers in one of the threads you linked:

The only reason we’re even thinking about this is because of NumPy being overly flexible and accepting too much, and as a result the proliferation of asarray throughout the ecosystem, which then resulted in us recommending against using ndarray subclasses etc.

Subclasses are tangentially mentioned in NEP 47 when discussing the asarray / asanyarray (anti)pattern:

Many existing libraries use the same asarray (or asanyarray) pattern as NumPy itself does; accepting any object that can be coerced into a np.ndarray. We consider this design pattern problematic - keeping in mind the Zen of Python, “explicit is better than implicit”, as well as the pattern being historically problematic in the SciPy ecosystem for ndarray subclasses and with over-eager object creation. All other array/tensor libraries are more strict, and that works out fine in practice. We would advise authors of new libraries to avoid the asarray pattern. Instead they should either accept just NumPy arrays or, if they want to support multiple kinds of arrays, check if the incoming array object supports the array API standard by checking for __array_namespace__ as shown in the example above.

And the readme of your precodita:

The second restriction is implemented because in the world of arrays, subclasses usually add behaviour in a way that means implementations cannot be expected to return useful results anymore. This makes subclassing less convenient (because sometimes things work out) but protects from those surprises common to NumPy's masked arrays and matrices (why did I get a normal array back and the mask got ignored?).

In summary, there seems to be some "tribal knowledge" that I would like to help properly capture in written form. It doesn't have to be "everybody agrees subclassing is bad", but at least we could describe the kind of surprises that might appear, how to avoid them, etc.

@melissawm
Copy link
Member

I'll be honest - I'd be out of my depth here. But as long as someone can point me in the right direction as far as what should actually be documented, or even pair up on this doc, I'd be happy to work on it.

@seberg
Copy link
Member

seberg commented Feb 10, 2022

there seems to be some "tribal knowledge"

I prefer "culture" ;). More serious, subclasses are problematic for exactly the same reason that subclasses are always problematic: If you break Liskov's substitution principle, you have to live with the consequences.

The problem is that:

  • Almost all interesting subclasses break Liskov's substitution principle
  • Arrays have a large API surface and most of that is functions and not methods. Just like math.<function> won't work for a "masked scalar" a lot of NumPy doesn't actually work right for "masked arrays". These functions will often return base-class arrays and may convert the input silently to a base-class array (which again should be OK for a "proper" subclass!).
  • And finally, there are some bad subclasses that look like they should be fine (masked arrays, matrix).

but while such notes may make sense somewhere, I think they would belong on the doc page for "subclassing" which should probably point to the interop stuff.

There is nothing wrong with subclassing (memmaps are fine, except that they should demote more aggressively IMO), but almost everyone breaks Liskov's. That is OK, but there are clear limits and most of the time the gains (a few things "just work") are not worth the downsides (the stuff that doesn't work may surprise you).

EDIT: OK, to be clear, I am not sure masked arrays break "Liskov" explicitly – matrix does. But they "enrich" the base class in ways that mean you lose vital information if you go back to the base-class.

seberg added a commit to seberg/numpy that referenced this issue Mar 1, 2022
I think we have consensus that for many users subclassing is a bad idea,
but our docs don't really give much of a hint as to why, and what else.

We long have the answers for these, so this is a start to actually
write them down.

Addresses some of the points in numpygh-20998.
seberg added a commit to seberg/numpy that referenced this issue Mar 1, 2022
I think we have consensus that for many users subclassing is a bad idea,
but our docs don't really give much of a hint as to why, and what else.

We long have the answers for these, so this is a start to actually
write them down.

Addresses some of the points in numpygh-20998.
seberg added a commit to seberg/numpy that referenced this issue Mar 15, 2022
I think we have consensus that for many users subclassing is a bad idea,
but our docs don't really give much of a hint as to why, and what else.

We long have the answers for these, so this is a start to actually
write them down.

Addresses some of the points in numpygh-20998.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants