Skip to content

ENH: Improve array function overhead by using vectorcall #23020

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jan 17, 2023

Conversation

seberg
Copy link
Member

@seberg seberg commented Jan 16, 2023

This moves dispatching for __array_function__ into a C-wrapper. This
helps speed for multiple reasons:

  • Avoids one additional dispatching function call to C
  • Avoids the use of *args, **kwargs which is slower.
  • For simple NumPy calls we can stay in the faster "vectorcall" world

This speeds up things generally a little, but can speed things up a lot
when keyword arguments are used on lightweight functions, for example::

np.can_cast(arr, dtype, casting="same_kind")

is more than twice as fast with this.

There is one alternative in principle to get best speed: We could inline
the "relevant argument"/dispatcher extraction. That changes behavior in
an acceptable but larger way (passes default arguments).
Unless the C-entry point seems unwanted, this should be a decent step
in the right direction even if we want to do that eventually, though.

Closes #20790
Closes #18547 (although not quite sure why)

@@ -169,7 +169,7 @@ def _remove_nan_1d(arr1d, overwrite_input=False):
s = np.nonzero(c)[0]
if s.size == arr1d.size:
warnings.warn("All-NaN slice encountered", RuntimeWarning,
stacklevel=5)
stacklevel=6)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I triple checked this and it seemed just way too low before. It can even now sometimes point inside (but that is due to use at different depth). (Same below.)

@seberg
Copy link
Member Author

seberg commented Jan 16, 2023

Ah, seems we still have a 3.8 job or two and PyPy doesn't have PyObject_VectorCall after all. I could add a (slowish) compatibility shim for it?

@mattip
Copy link
Member

mattip commented Jan 16, 2023

If you mean the pypy failure in the build_test run, it is on PyPy 3.9 and fails on

malloc_consolidate(): unaligned fastbin chunk detected

@mattip
Copy link
Member

mattip commented Jan 16, 2023

The armv7_simd_test run is on CPython3.8. I think updating that would require modifying the container build

@mattip
Copy link
Member

mattip commented Jan 16, 2023

I think the debug run is also using a apt-installed python3.8-dbg

@mattip
Copy link
Member

mattip commented Jan 16, 2023

Are you happy with moving the like argument to be the first one? It seems to be a big change. I understand it needs to transition from kwarg to positional, can it still remain the last argument?

@seberg
Copy link
Member Author

seberg commented Jan 16, 2023

Ah, I must have mixed up the pypy and debug run. I guess there is a chance that the PyPy failure is a bug in this code in that case.

@seberg
Copy link
Member Author

seberg commented Jan 16, 2023

Are you happy with moving the like argument to be the first one? It seems to be a big change. I understand it needs to transition from kwarg to positional, can it still remain the last argument?

Sorry, I guess this was unclear: This is an implementation detail since we never pass like to __array_function__. Our dispatcher expects like as first argument now, wile previously it expected a kwarg like=like and then would delete it.

@mattip
Copy link
Member

mattip commented Jan 16, 2023

Ahh, and in order to make it a non-optional positional argument you moved it to be the first one, since there could be a different number of other positional arguments?

@seberg seberg force-pushed the faster-array-function branch from 54d9afd to 7da93fe Compare January 16, 2023 18:01
@seberg
Copy link
Member Author

seberg commented Jan 16, 2023

Turns out the PyPy problem was a refcounting issue that wasn't bad enough to show up in the non-debug build... and the debug build was failing after all :).

pip3 install cython==0.29.30 setuptools\<49.2.0 hypothesis==6.23.3 pytest==6.2.5 'typing_extensions>=4.2.0' &&
apt install -y git python3.9 python3.9-dev &&
ln -s /usr/bin/python3.9 /usr/bin/pythonx &&
pythonx -m pip install --upgrade pip setuptools wheel &&
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pythonx pattern was stolen from the old_gcc job above.

@seberg seberg force-pushed the faster-array-function branch 5 times, most recently from 35c9df3 to c3b96e4 Compare January 17, 2023 09:07
Arbitrary positional arguments originally passed into ``public_api``.
kwargs : dict
Arbitrary keyword arguments originally passed into ``public_api``.
overrides when called like.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
overrides when called like.
overrides when called like ``implementation(*args, **kwargs)``.

Not sure this is the right way to call, but the sentence seems to stop all of a sudden.

seberg and others added 10 commits January 17, 2023 18:40
This moves dispatching for `__array_function__` into a C-wrapper.  This
helps speed for multiple reasons:
* Avoids one additional dispatching function call to C
* Avoids the use of `*args, **kwargs` which is slower.
* For simple NumPy calls we can stay in the faster "vectorcall" world

This speeds up things generally a little, but can speed things up a lot
when keyword arguments are used on lightweight functions, for example::

    np.can_cast(arr, dtype, casting="same_kind")

is more than twice as fast with this.

There is one alternative in principle to get best speed:  We could inline
the "relevant argument"/dispatcher extraction.  That changes behavior in
an acceptable but larger way (passes default arguments).
Unless the C-entry point seems unwanted, this should be a decent step
in the right direction even if we want to do that eventually, though.

Closes numpygh-20790
Closes numpygh-18547  (although not quite sure why)
The refactor to use vectorcall/fastcall is obviously much better
if we don't have to go back and forth, for concatenate we get:

     arr = np.random.random(20)
     %timeit np.concatenate((arr, arr), axis=0)

Going from ~1.2µs to just below 1µs and all the way down to ~850ns
(fluctuates quite a lot down to 822 even).  ~40% speedup in total
which is not too shabby.
This makes these functions much faster when used with keyword arguments
now that the array-function dispatching does not rely on `*args, **kwargs`
anymore.
@seberg seberg force-pushed the faster-array-function branch from cb91aa7 to 2403dbe Compare January 17, 2023 17:46
@seberg
Copy link
Member Author

seberg commented Jan 17, 2023

I noticed that this PR breaks the slight enhancement from gh-21731 (the test there is subtly wrong). This is somewhat more tedious to get right in C, so I think I would prefer fixing it as a follow-up.

EDIT: Just to note, in general it is a bit nicer, because moving things to C removes the confusing <array-function-internals> stack from the traceback.

@charris charris merged commit 8535df6 into numpy:main Jan 17, 2023
@charris
Copy link
Member

charris commented Jan 17, 2023

Thanks Sebastian. I don't know if all the changes will be trouble free, but the CI is green and the best way to find out if there are other problems is to get it tested downstream.

@jakirkham
Copy link
Contributor

Thanks for working on this Sebastian! 🙏

cc @pentschev @leofang (for awareness)

@seberg seberg deleted the faster-array-function branch January 17, 2023 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants