-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Improve array function overhead by using vectorcall #23020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -169,7 +169,7 @@ def _remove_nan_1d(arr1d, overwrite_input=False): | |||
s = np.nonzero(c)[0] | |||
if s.size == arr1d.size: | |||
warnings.warn("All-NaN slice encountered", RuntimeWarning, | |||
stacklevel=5) | |||
stacklevel=6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I triple checked this and it seemed just way too low before. It can even now sometimes point inside (but that is due to use at different depth). (Same below.)
Ah, seems we still have a 3.8 job or two and PyPy doesn't have |
If you mean the pypy failure in the build_test run, it is on PyPy 3.9 and fails on
|
The armv7_simd_test run is on CPython3.8. I think updating that would require modifying the container build |
I think the debug run is also using a apt-installed python3.8-dbg |
Are you happy with moving the |
Ah, I must have mixed up the pypy and debug run. I guess there is a chance that the PyPy failure is a bug in this code in that case. |
Sorry, I guess this was unclear: This is an implementation detail since we never pass |
Ahh, and in order to make it a non-optional positional argument you moved it to be the first one, since there could be a different number of other positional arguments? |
54d9afd
to
7da93fe
Compare
Turns out the PyPy problem was a refcounting issue that wasn't bad enough to show up in the non-debug build... and the debug build was failing after all :). |
.github/workflows/build_test.yml
Outdated
pip3 install cython==0.29.30 setuptools\<49.2.0 hypothesis==6.23.3 pytest==6.2.5 'typing_extensions>=4.2.0' && | ||
apt install -y git python3.9 python3.9-dev && | ||
ln -s /usr/bin/python3.9 /usr/bin/pythonx && | ||
pythonx -m pip install --upgrade pip setuptools wheel && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pythonx
pattern was stolen from the old_gcc
job above.
35c9df3
to
c3b96e4
Compare
numpy/core/overrides.py
Outdated
Arbitrary positional arguments originally passed into ``public_api``. | ||
kwargs : dict | ||
Arbitrary keyword arguments originally passed into ``public_api``. | ||
overrides when called like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overrides when called like. | |
overrides when called like ``implementation(*args, **kwargs)``. |
Not sure this is the right way to call, but the sentence seems to stop all of a sudden.
This moves dispatching for `__array_function__` into a C-wrapper. This helps speed for multiple reasons: * Avoids one additional dispatching function call to C * Avoids the use of `*args, **kwargs` which is slower. * For simple NumPy calls we can stay in the faster "vectorcall" world This speeds up things generally a little, but can speed things up a lot when keyword arguments are used on lightweight functions, for example:: np.can_cast(arr, dtype, casting="same_kind") is more than twice as fast with this. There is one alternative in principle to get best speed: We could inline the "relevant argument"/dispatcher extraction. That changes behavior in an acceptable but larger way (passes default arguments). Unless the C-entry point seems unwanted, this should be a decent step in the right direction even if we want to do that eventually, though. Closes numpygh-20790 Closes numpygh-18547 (although not quite sure why)
The refactor to use vectorcall/fastcall is obviously much better if we don't have to go back and forth, for concatenate we get: arr = np.random.random(20) %timeit np.concatenate((arr, arr), axis=0) Going from ~1.2µs to just below 1µs and all the way down to ~850ns (fluctuates quite a lot down to 822 even). ~40% speedup in total which is not too shabby.
This makes these functions much faster when used with keyword arguments now that the array-function dispatching does not rely on `*args, **kwargs` anymore.
cb91aa7
to
2403dbe
Compare
I noticed that this PR breaks the slight enhancement from gh-21731 (the test there is subtly wrong). This is somewhat more tedious to get right in C, so I think I would prefer fixing it as a follow-up. EDIT: Just to note, in general it is a bit nicer, because moving things to C removes the confusing |
Thanks Sebastian. I don't know if all the changes will be trouble free, but the CI is green and the best way to find out if there are other problems is to get it tested downstream. |
Thanks for working on this Sebastian! 🙏 cc @pentschev @leofang (for awareness) |
This moves dispatching for
__array_function__
into a C-wrapper. Thishelps speed for multiple reasons:
*args, **kwargs
which is slower.This speeds up things generally a little, but can speed things up a lot
when keyword arguments are used on lightweight functions, for example::
is more than twice as fast with this.
There is one alternative in principle to get best speed: We could inline
the "relevant argument"/dispatcher extraction. That changes behavior in
an acceptable but larger way (passes default arguments).
Unless the C-entry point seems unwanted, this should be a decent step
in the right direction even if we want to do that eventually, though.
Closes #20790
Closes #18547 (although not quite sure why)