-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: port np.core.overrides to C for speed #12317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
465be4b
to
c8de8ff
Compare
Oops... I forgot to add |
I've made some progress -- NumPy builds and imports, but running the tests in numpy/core/tests/test_overrides.py segfaults. I'm currently trying to get gdb and python configured so I can get a stacktrace. Currently I see:
|
This seems a bit premature. Is there an actual performance benefit from doing this? |
Well, it doesn't work yet so I haven't been able to benchmark how much it helps yet. But checking for overrides definitely adds some overhead, and it's now on every single NumPy function call, e.g.,
I'll check the ASV suite, but I suspect it would be a meaningful performance win if we shave off 1-2 us of overhead on every NumPy operation. Typical NumPy operations take ~10 us on small arrays. |
asv shows regressions on array_ufunc in a number of placs https://pv.github.io/numpy-bench/#regressions |
Here's a spreadsheet with all the ASV changes (from @pv's site) that I identified as due to |
OK, I managed to debug this. For future reference, I wasn't able to get It still needs docs and a bit of cleanup but it's in good enough shape to run benchmarks and for anyone interested to take a look at the implementation. Here are some highlights. The general override benchmarks are only about 10% faster for functions that handle a single argument, but 3x faster for functions given a long list of arrays:
For the hstack in particular, I see:
"With nested dispatch" refers to manually adjusting |
Thanks to some performance tips from @eric-wieser, this latest version is working much faster. Pure Python implementation:
C implementation:
These should be compared to my previous results in #12317 (comment) And here are the results for the
The bottom line is that overhead for the typical case of checking a single argument that doesn't have an override is now in the range of ~0.7 us, about the level of a single function call. For the typical case of case of no overrides with only a few arguments, the C version is about 30% faster than the Python version. For functions like |
Rewriting
|
This is ready for someone else to weigh in (maybe @mhvk ?). I think this is in a reasonable state to review, but there's at least one major design decision to make: Should we put an upper limit (e.g.,
|
I had a bit of a look earlier today and the code looked ... "oddly familiar". Which of course is great! I had noted the many times we are getting On the number of arguments, I think using Will try to look in more detail soon. |
OK, I rewrote in terms of static arrays. That does show some improvement for cases where there are duck arrays:
|
Tests are passing, except for |
I think I fixed my bug(s) -- tests are passing now (no more out of bounds memory access). |
One issue worth discussing: I've implemented a different approach here for handling the order in which to call subclasses than what is currently used in ufunc overrides:
I like my approach for two reasons:
Of course, asymptotic complexity isn't really relevant here -- the typical scenario is have only 1-2 arguments that implement overrides. I don't know which approach is faster in practice -- this should probably be benchmarked. Either way, we should probably use the same approach in both places. It would also be ideal to consolidate actual logic between these overrides, but in practice there are enough minor differences (e.g., in the form of the input arguments) that this may be tricky to do while preserving maximum performance. |
@shoyer - at least in the abstract, I agree that your approach is better and I think it makes sense to do the same for |
The differences I noticed:
Of course, they also collect different methods ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! My only slightly larger comment is about the need to check first for any overrides, iterating over all arguments, before doing essentially the same in actually making a list of overrides. Indeed, since in the latter case you check a given class only once, I think it may be faster to just use it directly.
@shoyer - my attempts to make |
Interesting -- did this actually speed things up? I guess it could be equivalent in the typical case of no outputs, but if you have to allocate another list my intuition is that this would be slower |
Looking back at #11372, I see I only allocated a new tuple if output arguments were actually present. The cost was offset by avoiding the need to search for the |
I suppose this could indeed be a reasonable shortcut to support. Some questions to think about:
I'm struggling to think of real use cases for (2), especially given the current state of affairs in the broader ecosystem where casting function inputs to NumPy arrays with |
Here, maybe good to remember that what was merged was my adaptation of the earlier |
Maybe this would work better as two PRs - one with the documentation and pure python changes, including the benchmarks, and another with only replacing the python implementation with an equivalent C-based one. |
In any case, I would prefer the PR not remove the pure python implementation |
@mattip - I'm not sure there is much point in keeping the python implementation around - it will only get out of sync with the C implementation. If we really want to keep it for inspection in a way that is easier than digging through git, perhaps it should be as an addendum to the NEP? That makes it more obvious that it is not something that is necessarily exactly up to date with the real implementation. But that can be done as a separate PR (though it may suggest not to make any changes to the NEP here). For the present code, I think we should merge it rather than worry about further small changes. This in part because 1.16 is waiting for this, but also because it would be really nice next to factor out common parts with |
@mhvk understood. It would be nice if when squashing this down it become three commits: the documentation and tests, the changes to the python implementation, and the new C implementation/python implementation removal. That way it will make it easier for a future refactor if we decide to ever revive the pure python version. I can do that once the code is ready to be merged. |
OK, assuming the review process is done, let me see I can rewrite history here. |
I can't easily split up the tests/refactor of the pure Python version, so I'm going to stick with two commits: (1) pure Python fixes and (2) C implementation / Python removal. |
0696e66
to
db6e223
Compare
db6e223
to
4541345
Compare
Note that the pure Python implementation from the first commit is correct, but the implementation does not exactly match the C version, and there may very well be a performance regression compared to what we currently have on master. But it may still be a good starting point for a pypy. |
OK, I think this is ready to merge? There are two commits, one for Python/doc/test changes and one for switching to the C implementation. Both pass our test suite. |
Had another look and agree this should now just go in, so merging! cc @charris, since I think this was one of the main items 1.16 was waiting for. |
p.s. Thanks, @shoyer, this is really nice! |
Last time I asked, Stephan said that python version was good for 1.16, so I wasn't waiting on the C version. Now that the first 1.16.0 rc is out I would prefer to leave it with the python version. We can backport later if that seems reasonable. |
@charris - I think the python version is indeed fine, but would recommend backporting so that the C version also gets some testing by those interested in trying; it also means we'll get more useful complaints about possible performance regressions. Whether to backport for rc2 and/or 1.16.0 or for 1.16.1, I'll leave up to you. |
p.s. Actually, people will presumably test on master before complaining too much, so really it doesn't matter that much. |
I don't think there would be much risk in backporting, the feature is new and currently unused. However, I'd like to keep the number of potential problems from getting any bigger and I'm already expecting trouble (knock, knock). Let's give it a micro release and go from there. |
@shoyer How do you propose to handle functions that transition to ufuncs? I ask here because it is convenient, and because it looks like we will need to deal with that. If there is much discussion of the topic we may want to open an issue and update the NEP. EDIT: @eric-wieser . |
@charris We did actually address this explicitly in the NEP: See the warning at the top: http://www.numpy.org/neps/nep-0018-array-function-protocol.html#implementation
And below under "non-goals": http://www.numpy.org/neps/nep-0018-array-function-protocol.html#non-goals
More concretely, we can do this with a (possibly abbreviated) deprecation cycle:
This way, a third-party library can retain their use of Actually, now that I write this down maybe it is also worth putting in the NEP, just so people know will happen. This is probably worth implementing in the near-ish term as part of the |
TL,DR: This significantly speeds up dispatch for
__array_function__
, reducing the overhead down to about 0.7 microseconds for typical cases (vs 2-3 microseconds currently with a pure Python implementation). For functions that handle many arguments (likeconcatenate
), this reduces the overhead of dispatching by something like 100x, i.e., from "makes concatenate 4x slower" to "barely unnoticeable".Original post:
Still needs:
Currently NumPy doesn't even import properly after I build it with this change. Hopefully I'm doing something obviously wrong!
Also: I expect it's obvious, but I'm pretty new to C and Python's C-API. I expect there are lots of ways this PR could be improved.