Skip to content

ENH: implement NEP-35's like= argument #16935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 68 commits into from
Aug 28, 2020

Conversation

pentschev
Copy link
Contributor

@pentschev pentschev commented Jul 23, 2020

This PR adds the implementation of NEP-35's like= argument, allowing dispatch of array creation functions with __array_function__ based on a reference array.

There are two ways to dispatch, details below.

The first is via Python API (as demonstrated by np.ones and np.full), where array_function_dispatch is used, but differently than the existing dispatch for compute functions, where it's dispatched usually on the first argument, this is dispatched on the like= keyword argument, returning (like,). We also check if like is not None as a performance optimization, see #16935 (comment) .

The second dispatch occurs via C through splitting array_implement_array_function in two functions. The first function remains very similar to how it was originally implemented (array_implement_array_function) but adds a step to remove like= argument before calling downstream libraries -- downstream libraries shall not add like= to their signatures. The second function (array_implement_c_array_function) will also remove the like= argument, but it will also extract the reference array from it and will gather the public_api Python function by doing an import on np.function_name, where function_name shall be passed by the calling function.

The usage of the C dispatch is very straightforward and optimized for the case where like=None, adding minimal overhead to such functions. The necessary work will thus be only done when a reference array is passed, such as importing the NumPy Python function -- this can still be improved by using a lookup mechanism to avoid reimporting for each subsequent call but it was decided not to do that in this PR. The caller function will still need to add a like argument to its keyword list and parse that, but it will not be used anywhere other than the dispatcher function, it will also need to call the C dispatcher and check for its return value, if it returns Py_NotImplemented it will continue with NumPy's implementation, otherwise return that value (from downstream library) immediately.

This new function allows dispatching from C directly, while also
implementing the new `like=` argument, requiring only minimal
changes to existing array creation functions that need to add
support for that argument.
The implementation uses array_implement_c_array_function, thus
introducing minimal complexity to the original _array_fromobject
code.
np.asarray can rely on np.array's C dispatcher instead.
Tests comprise some of the functions that have been implemented already:
* np.array (C dispatcher)
* np.asarray (indirect C dispatcher via np.array)
* np.full (Python dispatcher)
* np.ones (Python dispatcher)
@pentschev
Copy link
Contributor Author

Replace PyUnicode_FromString by PyUnicode_InternFromString to cache
"like" string.
Check and handle errors on importing NumPy's Python functions
Using Python dispatcher for all of them. Using the C dispatcher
directly on the `np.array` call can result in incorrect behavior.
Incorrect behavior may happen if the downstream library's
implementation is different or if not all keyword arguments are
supported.
@pentschev
Copy link
Contributor Author

Thanks @seberg for bringing this up, I've fixed the conflict now and I am very much excited we're close to merging this, so more folks can do some testing now! :)

public_api.__doc__ = public_api.__doc__.replace(
"${ARRAY_FUNCTION_LIKE}",
array_function_like_doc,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
)
)
return public_api # to allow decorator use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eric-wieser , I tried something much more complicated and failed to achieve this, I'm glad it's so simple! Maybe one day I should take a few Python decorator classes with you! :)

That is now done in 1fc1b60 .

public_api=ascontiguousarray,
)
def _ascontiguousarray_with_like(a, dtype=None, *, like=None):
pass
Copy link
Member

@eric-wieser eric-wieser Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This strikes me as really weird. The following spelling would also work, and avoids repeating the signature:

_ascontiguousarray_with_like = array_function_dispatch(
    _asarray_contiguous_fortran_dispatcher
)(ascontiguousarray)

As written right now, the body and signatures of these functions are completely ignored - a recipe for confusion and stale signatures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I definitely missed we could do something simple like this. Done now in d83cf9d .

@jakirkham
Copy link
Contributor

@eric-wieser, would you be ok just pushing the minor changes proposed or rolling them into a follow-up PR? Peter is out-of-the-office (though I see he kindly stopped by to tidy things briefly; thanks Peter! 🙂). So I think anything else we are wanting addressed in the near term would be better handled directly. Does that sound ok?

@pentschev
Copy link
Contributor Author

Sorry @jakirkham , I only saw your message here now but changes are in now. Thanks for being on top of this! 😄

possible_flags = {'C': 'C', 'C_CONTIGUOUS': 'C', 'CONTIGUOUS': 'C',
'F': 'F', 'F_CONTIGUOUS': 'F', 'FORTRAN': 'F',
'A': 'A', 'ALIGNED': 'A',
'W': 'W', 'WRITEABLE': 'W',
'O': 'O', 'OWNDATA': 'O',
'E': 'E', 'ENSUREARRAY': 'E'}
if not requirements:
return asanyarray(a, dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass like here given we know it is none? Do we actually want to dispatch to like in these functions, or should we just forward the argument to np.array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this was probably a leftover from some early attempts, I now removed those in c1e8754 .

Copy link
Member

@eric-wieser eric-wieser Aug 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pattern is present in lots of places. I'd perhaps argue that for instance np.array should have like dispatching, but np.asarray and np.asanyarray should remain an alias for array(subok=..., copy=..., like=like). Thoughts @seberg?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downstream libraries may have completely different implementations, that's why we need each function to have like=. For instance, CuPy doesn't support subok, as per the note in https://docs.cupy.dev/en/stable/reference/generated/cupy.array.html , so we have to dispatch it directly to CuPy so it handles that appropriately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmpf, I thought I had brought that up but forgot about it again.

I am not sure why we should have more than one entry point for the as.*array() functions. In principle it seems like only ever using np.array() with the full signature during __array_function__ dispatching should be good enough?

@pentschev did you have a reason for wanting np.asarray, np.ascontiguousarray, ... instead of only np.array() as the public_api dispatching "target"? Dask should not really have to know that the call was np.asarray(arr) as opposed to np.array(arr, copy=False), right? And that would mean fewer functions to wrap (although I guess you probably just wrap everything anyway).

We might still want an if like is None: branch for micro-optimization, but I am not sure whether it will make a real difference in that case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was typing when you posted. That seems like a reason, although I didn't quite grok the implications yet...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we also discussed this previously in #16935 (comment) :

Or, we admit that these are just aliases around np.array, and only defer for np.array itself. But I suppose we do not want that, since the error messages would be confusing (np.asarray saying that it did not find an implementation for np.array).

I did that in the beginning, but I realized that's not necessarily correct and removed that in 2c54820, downstream libraries may implement these functions in a completely different manner than NumPy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was typing when you posted. That seems like a reason, although I didn't quite grok the implications yet...

The implication is when you call np.array(..., subok=True, like=cupy_array) would will fail with NotImplemented because cupy.array doesn't implement subok, and cupy.asanyarray is implemented differently which works with how it's currently implemented doing np.asanyarray(..., like=cupy_array).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah, I don't mind these aliases too much (they are a bit tedious, but only a bit), but I don't really like it much either :). While cupy.array() probably shouldn't support subok=True, I wonder if the function called by its __array_function__ could just as well support (and ignore) it...

Although the whole subok parameter is slightly dubious in combination with like=?

PS: I added it to the tracking issue just in case we don't quite find a resolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand it, implementing the NumPy API doesn't imply that the underlying implementation has to behave exactly the same. With that said, some downstream library implementing asanyarray could, for example, implement it to always return np.array(1) for whatever reason, and that could be totally valid. Therefore, from the NumPy perspective, we shouldn't assume that downstream libraries are supposed to implement those aliases exactly the same way as NumPy does. My example is obviously a bit extreme, but I'm just trying to make the point that with __array_function__ we're using the API only, and the underlying implementation is entirely of responsibility of the downstream library and thus we should only dispatch on the API.

This was there for a bit, but the same thing can already be
achieved without the parameter, so that it is now unused.
@seberg seberg merged commit 4cd6e4b into numpy:master Aug 28, 2020
@seberg
Copy link
Member

seberg commented Aug 28, 2020

Thanks everyone for the review and the last fixups! @eric-wieser (and others) if you do continue to comment here, I will make sure to fix things up (I had added one more commit to undo now unnecessary changes in override.py). One reason I liked to merge it now is that it should ends up in the weekly build.

Everyone should be aware that very likely this will not be the final version, and not land unmodified in NumPy 1.20 (at least not without the NEP being accepted)!

@pentschev can you make sure we schedule a meeting in a month or so to discuss this? At least assuming you or sklearn had some chance of trying this out? @thomasjpfan just a ping in case you were not in the loop here.

Comment on lines +2170 to +2174
if like is not None:
return _identity_with_like(n, dtype=dtype, like=like)

from numpy import eye
return eye(n, dtype=dtype)
return eye(n, dtype=dtype, like=like)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a really strong argument for not having a like override - it shouldn't be up to dask or cupy to decide whether eye and identity are aliases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also, this like=like is always like=None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix that in a separate PR, thanks @eric-wieser for spotting this.

@jakirkham
Copy link
Contributor

Really excited to see this land! Thanks for the hard work all 😄

@pentschev
Copy link
Contributor Author

Thanks everyone for reviews and @seberg for merging this!

@pentschev can you make sure we schedule a meeting in a month or so to discuss this? At least assuming you or sklearn had some chance of trying this out? @thomasjpfan just a ping in case you were not in the loop here.

We will try this out in Dask and other potential use cases and then I'll update those who have been in our previous meeting discussing NEPs 35 and 37 to discuss next steps, we still have an email thread with those involved so I'll write it there.

@paugier
Copy link

paugier commented Mar 15, 2021

I detected a performance regression using a Python accelerator (Pythran) between Numpy 0.19.5 and 0.20 (see serge-sans-paille/pythran#1735). The regression is in a loop which does not involve Python (only pure C++ without CPython API call). This indicates that it is about the raw data of the arrays. I didn't try to investigate if it can also impact other tools to accelerate Python or create Python extensions (like Cython or Numba) but I don't see why it would be specific to Pythran.

By preparing the arrays used in the computation with arr.copy() (where arr is a Numpy ndarray produced from a Pandas dataframe with df.values), I get some good results with Numpy 0.19.5 and some less good results with Numpy 0.20. I'm posting this here because the problem appears with commit 4cd6e4b, which seems to correspond to this PR.

@pentschev or @seberg, you might be able to shed some light on what is happening? I see that arr.copy() does not produce anymore the same arrays. It's quite subtle because the values, the flags and the strides are the same. Just the arrays are such as native computations on these arrays are a bit less efficient.

@pentschev
Copy link
Contributor Author

@paugier would it be possible for you to open a new GH issue with a minimal reproducer for that, hopefully detached from Pandas and other 3rd-party libraries? I read and responded a few minutes back on the discussion list, but it's still not exactly clear to me what the differences are that you're referring to, plus a reproducer would allow us to verify the codepath that is potentially causing this.

@pentschev
Copy link
Contributor Author

Also for completeness on this thread, curiously I see NumPy 1.20.1 being considerably
faster for small arrays and mildly-faster with large arrays (results here).

@seberg
Copy link
Member

seberg commented Mar 15, 2021

@pentschev yeah, the life-cycle of small arrays was improved when we cleaned up the buffer handling. (creating a view and deleting it should be about 30% faster, e.g. for array[...], IIRC.

I will distinctly not worry about this PR having anything to do related to @paugier observation. As said multiple times on the mailing list, these type of fluctuations are as common as they are puzzling. As much as I would like to know how to avoid them, I would be shocked if looking for a code change in NumPy will get you anywhere closer to understanding this.

@pentschev
Copy link
Contributor Author

Thanks for the insight @seberg . I agree with you, unless there's a consistent minimal reproducer that can show this or any other commit to be responsible there isn't much that can be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement component: __array_function__ Priority: high High priority, also add milestones for urgent issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.