-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: implement NEP-35's like=
argument
#16935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This new function allows dispatching from C directly, while also implementing the new `like=` argument, requiring only minimal changes to existing array creation functions that need to add support for that argument.
The implementation uses array_implement_c_array_function, thus introducing minimal complexity to the original _array_fromobject code.
np.asarray can rely on np.array's C dispatcher instead.
Tests comprise some of the functions that have been implemented already: * np.array (C dispatcher) * np.asarray (indirect C dispatcher via np.array) * np.full (Python dispatcher) * np.ones (Python dispatcher)
Replace PyUnicode_FromString by PyUnicode_InternFromString to cache "like" string.
Check and handle errors on importing NumPy's Python functions
Using Python dispatcher for all of them. Using the C dispatcher directly on the `np.array` call can result in incorrect behavior. Incorrect behavior may happen if the downstream library's implementation is different or if not all keyword arguments are supported.
Thanks @seberg for bringing this up, I've fixed the conflict now and I am very much excited we're close to merging this, so more folks can do some testing now! :) |
public_api.__doc__ = public_api.__doc__.replace( | ||
"${ARRAY_FUNCTION_LIKE}", | ||
array_function_like_doc, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
) | |
) | |
return public_api # to allow decorator use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eric-wieser , I tried something much more complicated and failed to achieve this, I'm glad it's so simple! Maybe one day I should take a few Python decorator classes with you! :)
That is now done in 1fc1b60 .
numpy/core/_asarray.py
Outdated
public_api=ascontiguousarray, | ||
) | ||
def _ascontiguousarray_with_like(a, dtype=None, *, like=None): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This strikes me as really weird. The following spelling would also work, and avoids repeating the signature:
_ascontiguousarray_with_like = array_function_dispatch(
_asarray_contiguous_fortran_dispatcher
)(ascontiguousarray)
As written right now, the body and signatures of these functions are completely ignored - a recipe for confusion and stale signatures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I definitely missed we could do something simple like this. Done now in d83cf9d .
@eric-wieser, would you be ok just pushing the minor changes proposed or rolling them into a follow-up PR? Peter is out-of-the-office (though I see he kindly stopped by to tidy things briefly; thanks Peter! 🙂). So I think anything else we are wanting addressed in the near term would be better handled directly. Does that sound ok? |
Sorry @jakirkham , I only saw your message here now but changes are in now. Thanks for being on top of this! 😄 |
numpy/core/_asarray.py
Outdated
possible_flags = {'C': 'C', 'C_CONTIGUOUS': 'C', 'CONTIGUOUS': 'C', | ||
'F': 'F', 'F_CONTIGUOUS': 'F', 'FORTRAN': 'F', | ||
'A': 'A', 'ALIGNED': 'A', | ||
'W': 'W', 'WRITEABLE': 'W', | ||
'O': 'O', 'OWNDATA': 'O', | ||
'E': 'E', 'ENSUREARRAY': 'E'} | ||
if not requirements: | ||
return asanyarray(a, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why pass like here given we know it is none? Do we actually want to dispatch to like
in these functions, or should we just forward the argument to np.array
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this was probably a leftover from some early attempts, I now removed those in c1e8754 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this pattern is present in lots of places. I'd perhaps argue that for instance np.array
should have like
dispatching, but np.asarray
and np.asanyarray
should remain an alias for array(subok=..., copy=..., like=like)
. Thoughts @seberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downstream libraries may have completely different implementations, that's why we need each function to have like=
. For instance, CuPy doesn't support subok
, as per the note in https://docs.cupy.dev/en/stable/reference/generated/cupy.array.html , so we have to dispatch it directly to CuPy so it handles that appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmmpf, I thought I had brought that up but forgot about it again.
I am not sure why we should have more than one entry point for the as.*array()
functions. In principle it seems like only ever using np.array()
with the full signature during __array_function__
dispatching should be good enough?
@pentschev did you have a reason for wanting np.asarray
, np.ascontiguousarray
, ... instead of only np.array()
as the public_api
dispatching "target"? Dask should not really have to know that the call was np.asarray(arr)
as opposed to np.array(arr, copy=False)
, right? And that would mean fewer functions to wrap (although I guess you probably just wrap everything anyway).
We might still want an if like is None:
branch for micro-optimization, but I am not sure whether it will make a real difference in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was typing when you posted. That seems like a reason, although I didn't quite grok the implications yet...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note we also discussed this previously in #16935 (comment) :
Or, we admit that these are just aliases around
np.array
, and only defer fornp.array
itself. But I suppose we do not want that, since the error messages would be confusing (np.asarray
saying that it did not find an implementation fornp.array
).I did that in the beginning, but I realized that's not necessarily correct and removed that in 2c54820, downstream libraries may implement these functions in a completely different manner than NumPy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was typing when you posted. That seems like a reason, although I didn't quite grok the implications yet...
The implication is when you call np.array(..., subok=True, like=cupy_array)
would will fail with NotImplemented
because cupy.array
doesn't implement subok
, and cupy.asanyarray
is implemented differently which works with how it's currently implemented doing np.asanyarray(..., like=cupy_array)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah, I don't mind these aliases too much (they are a bit tedious, but only a bit), but I don't really like it much either :). While cupy.array()
probably shouldn't support subok=True
, I wonder if the function called by its __array_function__
could just as well support (and ignore) it...
Although the whole subok parameter is slightly dubious in combination with like=
?
PS: I added it to the tracking issue just in case we don't quite find a resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand it, implementing the NumPy API doesn't imply that the underlying implementation has to behave exactly the same. With that said, some downstream library implementing asanyarray
could, for example, implement it to always return np.array(1)
for whatever reason, and that could be totally valid. Therefore, from the NumPy perspective, we shouldn't assume that downstream libraries are supposed to implement those aliases exactly the same way as NumPy does. My example is obviously a bit extreme, but I'm just trying to make the point that with __array_function__
we're using the API only, and the underlying implementation is entirely of responsibility of the downstream library and thus we should only dispatch on the API.
This was there for a bit, but the same thing can already be achieved without the parameter, so that it is now unused.
Thanks everyone for the review and the last fixups! @eric-wieser (and others) if you do continue to comment here, I will make sure to fix things up (I had added one more commit to undo now unnecessary changes in Everyone should be aware that very likely this will not be the final version, and not land unmodified in NumPy 1.20 (at least not without the NEP being accepted)! @pentschev can you make sure we schedule a meeting in a month or so to discuss this? At least assuming you or sklearn had some chance of trying this out? @thomasjpfan just a ping in case you were not in the loop here. |
if like is not None: | ||
return _identity_with_like(n, dtype=dtype, like=like) | ||
|
||
from numpy import eye | ||
return eye(n, dtype=dtype) | ||
return eye(n, dtype=dtype, like=like) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a really strong argument for not having a like override - it shouldn't be up to dask or cupy to decide whether eye
and identity
are aliases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also, this like=like
is always like=None
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix that in a separate PR, thanks @eric-wieser for spotting this.
Really excited to see this land! Thanks for the hard work all 😄 |
Thanks everyone for reviews and @seberg for merging this!
We will try this out in Dask and other potential use cases and then I'll update those who have been in our previous meeting discussing NEPs 35 and 37 to discuss next steps, we still have an email thread with those involved so I'll write it there. |
I detected a performance regression using a Python accelerator (Pythran) between Numpy 0.19.5 and 0.20 (see serge-sans-paille/pythran#1735). The regression is in a loop which does not involve Python (only pure C++ without CPython API call). This indicates that it is about the raw data of the arrays. I didn't try to investigate if it can also impact other tools to accelerate Python or create Python extensions (like Cython or Numba) but I don't see why it would be specific to Pythran. By preparing the arrays used in the computation with @pentschev or @seberg, you might be able to shed some light on what is happening? I see that |
@paugier would it be possible for you to open a new GH issue with a minimal reproducer for that, hopefully detached from Pandas and other 3rd-party libraries? I read and responded a few minutes back on the discussion list, but it's still not exactly clear to me what the differences are that you're referring to, plus a reproducer would allow us to verify the codepath that is potentially causing this. |
Also for completeness on this thread, curiously I see NumPy 1.20.1 being considerably |
@pentschev yeah, the life-cycle of small arrays was improved when we cleaned up the buffer handling. (creating a view and deleting it should be about 30% faster, e.g. for I will distinctly not worry about this PR having anything to do related to @paugier observation. As said multiple times on the mailing list, these type of fluctuations are as common as they are puzzling. As much as I would like to know how to avoid them, I would be shocked if looking for a code change in NumPy will get you anywhere closer to understanding this. |
Thanks for the insight @seberg . I agree with you, unless there's a consistent minimal reproducer that can show this or any other commit to be responsible there isn't much that can be done. |
This PR adds the implementation of NEP-35's
like=
argument, allowing dispatch of array creation functions with__array_function__
based on a reference array.There are two ways to dispatch, details below.
The first is via Python API (as demonstrated by
np.ones
andnp.full
), wherearray_function_dispatch
is used, but differently than the existing dispatch for compute functions, where it's dispatched usually on the first argument, this is dispatched on thelike=
keyword argument, returning(like,)
. We also checkif like is not None
as a performance optimization, see #16935 (comment) .The second dispatch occurs via C through splitting
array_implement_array_function
in two functions. The first function remains very similar to how it was originally implemented (array_implement_array_function
) but adds a step to removelike=
argument before calling downstream libraries -- downstream libraries shall not addlike=
to their signatures. The second function (array_implement_c_array_function
) will also remove thelike=
argument, but it will also extract the reference array from it and will gather thepublic_api
Python function by doing an import onnp.function_name
, wherefunction_name
shall be passed by the calling function.The usage of the C dispatch is very straightforward and optimized for the case where
like=None
, adding minimal overhead to such functions. The necessary work will thus be only done when a reference array is passed, such as importing the NumPy Python function -- this can still be improved by using a lookup mechanism to avoid reimporting for each subsequent call but it was decided not to do that in this PR. The caller function will still need to add alike
argument to its keyword list and parse that, but it will not be used anywhere other than the dispatcher function, it will also need to call the C dispatcher and check for its return value, if it returnsPy_NotImplemented
it will continue with NumPy's implementation, otherwise return that value (from downstream library) immediately.