Skip to content

ENH: avoid temporary arrays in expressions (again) #7997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 25, 2017

Conversation

juliantaylor
Copy link
Contributor

Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

 r = -a + b * c**2

One can avoid temporaries when the reference count of one of the
operands is one as it cannot be referenced by any other python code
(rvalue in C++ terms).
Python itself uses this method to optimize string concatenation.
The tricky part is that via the C-API one can use the PyNumber_ directly
and skip increasing the reference count when not needed.
The python stack cannot be used to determine this as python does not
export the stack size, only the current position.
To avoid the problem we collect the backtrace of the call and if it
consist exclusively of functions inside python or the c-library we can
assume no other library is using the C-API in between.

Issues are that the reliability of backtraces is unknown on non-GNU
platforms. On GNU and amd64 it should be reliable enough, even without frame
pointers, as glibc will use GCCs stack unwinder via the mandatory dwarf
stack annotations.
Other platforms with backtrace need to be tested.
Another problem is that the stack unwinding is very expensive. Unwinding
a 100 function deep stack (which is not uncommon from python) takes
around 35us, so the elision can only be done for relatively large
arrays.
Heuristicly it seems to be beneficial around 400kb array sizes (which is
about the typical L2 cache size).

The performance gain is quite significant around 1.5 to 2 times faster
operations with temporaries can be observed. The speed is similar to
rewriting the operations to inplace operations manually.

@juliantaylor juliantaylor force-pushed the temporary-elide branch 3 times, most recently from 5ca9a16 to 7e520b8 Compare August 30, 2016 20:00
@njsmith
Copy link
Member

njsmith commented Aug 31, 2016

First, let me say that this is terrifying and this comment should not be taken as condoning it.

That said -- I think that all we need for the optimization to be safe is for the the frames up to the last call to Py_EvalFrameEx to belong to cpython/numpy. (1) is this true? (2) does it allow a faster version of this where we only calculate the backtrace for, like, 5 frames deep or so, on the optimistic assumption that in almost any case where the optimization can apply, that will be enough to reach the last Py_EvalFrameEx?

(I might have the function name slightly wrong. You know what I mean.)

@juliantaylor
Copy link
Contributor Author

juliantaylor commented Aug 31, 2016

I though about this method too. It probably works, while you can call back into python from a c-extension I can't think of a way how you could get that code to operate on a refcount 1 array which is not a temporary.
So this is a possible optimization, but the code would be very similar, instead of determining the python dso address range we determine the EvalFrameEx range (as calling dladdr for the name is expensive) and then we need to find the longest call chain that ends up in multiarray.so. An issue here might be that the compiler could perform deinlining which would spread the function on a non-contiguous address range. Keeping track of all entry point addresses instead might be a way that avoids this problem.
This could get the overhead down to around 5-10mus which allows us to reduce the minimum array size were we elide.
But then the profitable size is currently already in the L2 cache range. Unless we can reduce it to L1 size (which I doubt with 5mus budget) there might not be a significant performance gain.

@rgommers
Copy link
Member

rgommers commented Sep 2, 2016

Naive question: the benefit of 1.5x-2x is clear, but how common are the use cases where this would hurt? And how badly?

@juliantaylor
Copy link
Contributor Author

juliantaylor commented Sep 3, 2016

we don't try it when the arrays aren't large enough to make the check worthwhile. The size depends on the cpus cache sizes and is between 400 and 800KiB. At that size array operations go from 150 to 500 mus.
So the worst case of a c-extension using refcount 1 arrays in the cache sweetspot and our check being to optimistically tuned would have an overhead of ~20%, for larger arrays it becomes insignificant fast.
But this should be a pretty rare situation, while you can construct a situation where you need the stacktrace check (see our incref_elide tests) I don't think it actually happens in practice often. I haven't seen anything do it yet.

@juliantaylor
Copy link
Contributor Author

updated so it now also elides when it can do safe casting, this allows e.g. double += float or int += smaller-int

there is still a problem on systems where libpython is not statically linked to python as is the case on e.g. debian.
before main there is a symbol in the stack doesn't seem to be identifiable with backtrace. A solution there would be to check the Py_Main (and maybe others for embedded) address and stop when that is found. Or switch to stop at EvalFrameEx which would also a bit faster.

Aside from this how do we want to proceed with this?
It still needs testing on mac, while it seems to have backtrace() I don't know how good it is performance and accuracy wise. Someone want to test that?
Windows likely won't work though there are probably equivalent windows api functions one could use. Someone with more windows knowledge should do that.

The we need to tune the cutoff to avoid performance penalties. We could set it rather high so we have basically no penalty ever. Somewhere around 1MiB of data should be good, even if that would fit into a CPU cache it will currently take long enough that the checking overhead is negligible.
But a too high value also reduces the benefit so we should spend a little effort tuning it.
I have tested a couple linux machines with different caches, an AMD with L2/L3 512/6144, intel core2 with only l2 6144, intel with L2/L3 256/3072 ad L2/L3 512/18000 and the threshold for profitable (d + 1) + 1 varies from 400KiB to 800KiB. It doesn't seem easy to directly relate it to the cache sizes, in particular when there is a L3 available.
Maybe we could write a little data gather script and send it for testing to the mailing list?

@juliantaylor
Copy link
Contributor Author

juliantaylor commented Sep 10, 2016

Updated to stop at the frame evaluation function instead of checking everything, probably good enough.
That reduces the overhead to 5-10us so one could enable it already at around 100KiB sized arrays, but I set it in 160KiB just in case.
Note the function name changed in python 3.6 as that has the juicy pep 523 implemented which should allow us to make all this with zero overhead (and maybe more), but I'll leave that for later.

I think this should be ready for some wider testing.

can_elide_temp(PyArrayObject * m1, PyObject * m2, int * cannot)
{
int i;
if (Py_REFCNT(m1) != 1 || !PyArray_CheckExact(m1) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Py_REFCNT should be the NumPy specific PyArray_REFCOUNT

This way the whole function will not compile on PyPy, which is good as it is very CPython specific.

@rainwoodman
Copy link
Contributor

@juliantaylor Very native question. Could you elaborate a bit more on The tricky part is that via the C-API one can use the PyNumber_ directly and skip increasing the reference count when not needed.? Do you mean if one writes down

 r = -a + b * c**2

Then the refcounts of a b and c are at least two during the evaluation of the expression? How?

@juliantaylor
Copy link
Contributor Author

one count is in the variable itself and the other in the reference to the variable on the stack
The expression a +b + b basically is this bytecode:

          0 LOAD_FAST                0 (a)
          3 LOAD_FAST                1 (b)
          6 BINARY_ADD          
          7 LOAD_FAST                1 (b)
         10 BINARY_ADD      

the loads onto the stack increase the reference count, binary_add reduces them both again
So a non temporary from the python interpreter always has at least a count of two but the temporary result of the addition is a new object not assigned to a variable (a store) so it has a count of one.

This is only valid from the interpreter, a C extension could correctly call PyNumber_Add on two refcount one objects directly so we have to do this backtrace checking.

@juliantaylor
Copy link
Contributor Author

Any further comments on the implementation? Else I'll squash the commits and prepare it for merging (release notes etc.)
I'll probably double the threshold just to be sure not to cause performance regressions and check if divisions can be added (I think my earlier concerns with division changing types should be handled by the safe can_cast check now).
Some testing on the mac would be great, a benchmark is available as ´check.ipy` here
https://github.com/juliantaylor/numpy/tree/elide-bench
https://mail.scipy.org/pipermail/numpy-discussion/2016-September/076034.html

@charris
Copy link
Member

charris commented Nov 2, 2016

@juliantaylor I'd rather push this off to 1.13.0 along with @pv's overlap inplace work. There is already a lot of stuff in 1.12 and the last big item I want in it is __numpy_ufunc__, and I may push that off also.

@charris charris added this to the 1.13.0 release milestone Nov 4, 2016
@charris
Copy link
Member

charris commented Nov 6, 2016

1.12.x is branched, might be a good time to clean up the history.

@homu
Copy link
Contributor

homu commented Dec 13, 2016

☔ The latest upstream changes (presumably #8367) made this pull request unmergeable. Please resolve the merge conflicts.

@juliantaylor juliantaylor force-pushed the temporary-elide branch 2 times, most recently from 8c2d717 to 8537ed4 Compare January 18, 2017 13:09
@juliantaylor
Copy link
Contributor Author

squashed the history

* A possible future improvement would be to change cpython to give as access
* to the top of the stack. Then we could just check that the objects involved
* are on the cpython stack instead of checking the function callstack.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to list the acceptable types and the valid operators and the distinction of commutative vs non-commutative operators. Maybe define a few terms as well such as elision and DSO. I know this is in the code, but it would be good if one could read this and know more detail of what is going on. Might also mention how you can determine if the stack objects are in python, numpy, etc. I'm also not clear on why it is OK for a temporary to be in numpy, or is it the case that you only want to know that the possible non-temporaries come from numpy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assumption is that numpy does not call PyNumber_* functions for this to work. The whole backtrace thing is only necessary because other libraries (e.g. cython) can do that.
To my knowledge numpy does not for basic types.

updated the docs more

@charris
Copy link
Member

charris commented Feb 24, 2017

I think this is looking good, but more documentation is desireable.

@juliantaylor juliantaylor force-pushed the temporary-elide branch 2 times, most recently from a5668d8 to e53ee05 Compare February 24, 2017 16:46
@juliantaylor
Copy link
Contributor Author

odd why is windows failing now, it shouldn't even be enabled there. Does windows compiler not handle our typenum enum like the standard says it should?

Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

     r = -a + b

One can avoid temporaries when the reference count of one of the
operands is one as it cannot be referenced by any other python code
(rvalue in C++ terms).
Python itself uses this method to optimize string concatenation.
The tricky part is that via the C-API one can use the PyNumber_ directly
and skip increasing the reference count when not needed.
The python stack cannot be used to determine this as python does not
export the stack size, only the current position.
To avoid the problem we collect the backtrace of the call until the
python frame evaluation function and if it consist exclusively of
functions inside python or the c-library we can assume no other library
is using the C-API in between.

Issues are that the reliability of backtraces is unknown on non-GNU
platforms. On GNU and amd64 it should be reliable enough, even without frame
pointers, as glibc will use GCCs stack unwinder via the mandatory dwarf
stack annotations.
Other platforms with backtrace need to be tested.
Another problem is that the stack unwinding is very expensive. Unwinding
a 100 function deep stack (which is not uncommon from python) takes
around 35us, so the elision can only be done for relatively large
arrays.
Heuristicly it seems to be beneficial around 256kb array sizes (which is
about the typical L2 cache size).

The performance gain is quite significant around 1.5 to 2 times faster
operations with temporaries can be observed. The speed is similar to
rewriting the operations to inplace operations manually.
@juliantaylor
Copy link
Contributor Author

odd must have been a ci fluke or the intp bug thats still in master

* Only check until the python frame evaluation function is found
* approx 10us overhead for stack size of 10
*
* TODO some calls go over scalarmath in umath but we cannot get the base
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point I'd like to move scalarmath back into multiarray. The idea would be to put all needed functions in npymath. I think there are only a few ufuncs still needed.

@charris charris merged commit 7bc881d into numpy:master Feb 25, 2017
@charris
Copy link
Member

charris commented Feb 25, 2017

In it goes. Thanks Julian.

@charris
Copy link
Member

charris commented Feb 25, 2017

For future reference, PyNumber functions are called from several places in multiarray with arrays of reference count 1. AFAICT with a quick scan, it would be OK to use the passed arrays like temporaries. Am I missing something, or have you looked at those places?

EDIT: I assume there is nothing on the interpreter call stack in those cases.

@rgommers
Copy link
Member

Documentation LGTM now, much more understandable

@juliantaylor
Copy link
Contributor Author

juliantaylor commented Feb 25, 2017

most of them are for object arrays which are not handled here, the handful non object ones look fine, either reference count != 1 or actually ones which can be done inplace. Of course I might have overlooked some which would mean we have holes in our test coverage.

@juliantaylor juliantaylor deleted the temporary-elide branch February 25, 2017 10:05
@charris
Copy link
Member

charris commented Feb 25, 2017

@juliantaylor It is a bit concerning as it introduces one more thing to watch out for when writing or reviewing C code. Apropos Ralf's comment, what platforms do we know that this will work on? Is there an easy way to tell?

@rgommers I'll fix those backticks when it comes time to edit the release notes.

@juliantaylor
Copy link
Contributor Author

I don't think it is something to worry about, usage of PyNumber_ in numpy outside of object arrays is very rare (most of them is a c-api stddev computation I doubt anyone is actually using) and new code must have tests anyway.

The release notes is as accurate as I can be, platforms that have the backtrace() function. I have tested linux and mac. bsd and other unixes may also have it but I don't know.

@jakirkham
Copy link
Contributor

Did you guys looking into whether this could work with libunwind?

@juliantaylor
Copy link
Contributor Author

It works with unwind too, but its unnecessary at least on the platforms I could test. backtrace with glibc on x86 is pretty good and doesn't need frame pointers.
Possibly libunwind is more reliable for non x86 and non glibc architectures. Restricting to those known good systems might be a good idea for release time, but for master its good to leave it unrestricted to find possible non working systems.

@Marco-Sulla
Copy link

Marco-Sulla commented Mar 12, 2020

Heuristicly it seems to be beneficial around 400kb array sizes (which is about the typical L2 cache size).

The patch will apply only to large ndarrays, or to all ndarrays?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants