-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: avoid temporary arrays in expressions (again) #7997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5ca9a16
to
7e520b8
Compare
First, let me say that this is terrifying and this comment should not be taken as condoning it. That said -- I think that all we need for the optimization to be safe is for the the frames up to the last call to (I might have the function name slightly wrong. You know what I mean.) |
I though about this method too. It probably works, while you can call back into python from a c-extension I can't think of a way how you could get that code to operate on a refcount 1 array which is not a temporary. |
Naive question: the benefit of 1.5x-2x is clear, but how common are the use cases where this would hurt? And how badly? |
we don't try it when the arrays aren't large enough to make the check worthwhile. The size depends on the cpus cache sizes and is between 400 and 800KiB. At that size array operations go from 150 to 500 mus. |
f08caa3
to
cd79fc1
Compare
updated so it now also elides when it can do safe casting, this allows e.g. double += float or int += smaller-int there is still a problem on systems where libpython is not statically linked to python as is the case on e.g. debian. Aside from this how do we want to proceed with this? The we need to tune the cutoff to avoid performance penalties. We could set it rather high so we have basically no penalty ever. Somewhere around 1MiB of data should be good, even if that would fit into a CPU cache it will currently take long enough that the checking overhead is negligible. |
1a2cf0c
to
76efa23
Compare
Updated to stop at the frame evaluation function instead of checking everything, probably good enough. I think this should be ready for some wider testing. |
numpy/core/src/multiarray/number.c
Outdated
can_elide_temp(PyArrayObject * m1, PyObject * m2, int * cannot) | ||
{ | ||
int i; | ||
if (Py_REFCNT(m1) != 1 || !PyArray_CheckExact(m1) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Py_REFCNT should be the NumPy specific PyArray_REFCOUNT
This way the whole function will not compile on PyPy, which is good as it is very CPython specific.
0ca7b11
to
a70ffa3
Compare
@juliantaylor Very native question. Could you elaborate a bit more on
Then the refcounts of a b and c are at least two during the evaluation of the expression? How? |
one count is in the variable itself and the other in the reference to the variable on the stack
the loads onto the stack increase the reference count, binary_add reduces them both again This is only valid from the interpreter, a C extension could correctly call PyNumber_Add on two refcount one objects directly so we have to do this backtrace checking. |
Any further comments on the implementation? Else I'll squash the commits and prepare it for merging (release notes etc.) |
@juliantaylor I'd rather push this off to 1.13.0 along with @pv's overlap inplace work. There is already a lot of stuff in 1.12 and the last big item I want in it is |
1.12.x is branched, might be a good time to clean up the history. |
☔ The latest upstream changes (presumably #8367) made this pull request unmergeable. Please resolve the merge conflicts. |
ff0bbc2
to
94f711d
Compare
8c2d717
to
8537ed4
Compare
squashed the history |
* A possible future improvement would be to change cpython to give as access | ||
* to the top of the stack. Then we could just check that the objects involved | ||
* are on the cpython stack instead of checking the function callstack. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to list the acceptable types and the valid operators and the distinction of commutative vs non-commutative operators. Maybe define a few terms as well such as elision and DSO. I know this is in the code, but it would be good if one could read this and know more detail of what is going on. Might also mention how you can determine if the stack objects are in python, numpy, etc. I'm also not clear on why it is OK for a temporary to be in numpy, or is it the case that you only want to know that the possible non-temporaries come from numpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the assumption is that numpy does not call PyNumber_* functions for this to work. The whole backtrace thing is only necessary because other libraries (e.g. cython) can do that.
To my knowledge numpy does not for basic types.
updated the docs more
I think this is looking good, but more documentation is desireable. |
a5668d8
to
e53ee05
Compare
odd why is windows failing now, it shouldn't even be enabled there. Does windows compiler not handle our typenum enum like the standard says it should? |
Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example: r = -a + b One can avoid temporaries when the reference count of one of the operands is one as it cannot be referenced by any other python code (rvalue in C++ terms). Python itself uses this method to optimize string concatenation. The tricky part is that via the C-API one can use the PyNumber_ directly and skip increasing the reference count when not needed. The python stack cannot be used to determine this as python does not export the stack size, only the current position. To avoid the problem we collect the backtrace of the call until the python frame evaluation function and if it consist exclusively of functions inside python or the c-library we can assume no other library is using the C-API in between. Issues are that the reliability of backtraces is unknown on non-GNU platforms. On GNU and amd64 it should be reliable enough, even without frame pointers, as glibc will use GCCs stack unwinder via the mandatory dwarf stack annotations. Other platforms with backtrace need to be tested. Another problem is that the stack unwinding is very expensive. Unwinding a 100 function deep stack (which is not uncommon from python) takes around 35us, so the elision can only be done for relatively large arrays. Heuristicly it seems to be beneficial around 256kb array sizes (which is about the typical L2 cache size). The performance gain is quite significant around 1.5 to 2 times faster operations with temporaries can be observed. The speed is similar to rewriting the operations to inplace operations manually.
e53ee05
to
c3e24b2
Compare
odd must have been a ci fluke or the intp bug thats still in master |
* Only check until the python frame evaluation function is found | ||
* approx 10us overhead for stack size of 10 | ||
* | ||
* TODO some calls go over scalarmath in umath but we cannot get the base |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point I'd like to move scalarmath back into multiarray. The idea would be to put all needed functions in npymath
. I think there are only a few ufuncs still needed.
In it goes. Thanks Julian. |
For future reference, PyNumber functions are called from several places in multiarray with arrays of reference count 1. AFAICT with a quick scan, it would be OK to use the passed arrays like temporaries. Am I missing something, or have you looked at those places? EDIT: I assume there is nothing on the interpreter call stack in those cases. |
Documentation LGTM now, much more understandable |
most of them are for object arrays which are not handled here, the handful non object ones look fine, either reference count != 1 or actually ones which can be done inplace. Of course I might have overlooked some which would mean we have holes in our test coverage. |
@juliantaylor It is a bit concerning as it introduces one more thing to watch out for when writing or reviewing C code. Apropos Ralf's comment, what platforms do we know that this will work on? Is there an easy way to tell? @rgommers I'll fix those backticks when it comes time to edit the release notes. |
I don't think it is something to worry about, usage of PyNumber_ in numpy outside of object arrays is very rare (most of them is a c-api stddev computation I doubt anyone is actually using) and new code must have tests anyway. The release notes is as accurate as I can be, platforms that have the backtrace() function. I have tested linux and mac. bsd and other unixes may also have it but I don't know. |
Did you guys looking into whether this could work with libunwind? |
It works with unwind too, but its unnecessary at least on the platforms I could test. backtrace with glibc on x86 is pretty good and doesn't need frame pointers. |
The patch will apply only to large ndarrays, or to all ndarrays? |
Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:
One can avoid temporaries when the reference count of one of the
operands is one as it cannot be referenced by any other python code
(rvalue in C++ terms).
Python itself uses this method to optimize string concatenation.
The tricky part is that via the C-API one can use the PyNumber_ directly
and skip increasing the reference count when not needed.
The python stack cannot be used to determine this as python does not
export the stack size, only the current position.
To avoid the problem we collect the backtrace of the call and if it
consist exclusively of functions inside python or the c-library we can
assume no other library is using the C-API in between.
Issues are that the reliability of backtraces is unknown on non-GNU
platforms. On GNU and amd64 it should be reliable enough, even without frame
pointers, as glibc will use GCCs stack unwinder via the mandatory dwarf
stack annotations.
Other platforms with backtrace need to be tested.
Another problem is that the stack unwinding is very expensive. Unwinding
a 100 function deep stack (which is not uncommon from python) takes
around 35us, so the elision can only be done for relatively large
arrays.
Heuristicly it seems to be beneficial around 400kb array sizes (which is
about the typical L2 cache size).
The performance gain is quite significant around 1.5 to 2 times faster
operations with temporaries can be observed. The speed is similar to
rewriting the operations to inplace operations manually.