ENH: avoid temporary arrays in expressions (again) #7997

juliantaylor · 2016-08-30T19:28:13Z

Temporary arrays generated in expressions are expensive as the imply
extra memory bandwidth which is the bottleneck in most numpy operations.
For example:

 r = -a + b * c**2

One can avoid temporaries when the reference count of one of the
operands is one as it cannot be referenced by any other python code
(rvalue in C++ terms).
Python itself uses this method to optimize string concatenation.
The tricky part is that via the C-API one can use the PyNumber_ directly
and skip increasing the reference count when not needed.
The python stack cannot be used to determine this as python does not
export the stack size, only the current position.
To avoid the problem we collect the backtrace of the call and if it
consist exclusively of functions inside python or the c-library we can
assume no other library is using the C-API in between.

Issues are that the reliability of backtraces is unknown on non-GNU
platforms. On GNU and amd64 it should be reliable enough, even without frame
pointers, as glibc will use GCCs stack unwinder via the mandatory dwarf
stack annotations.
Other platforms with backtrace need to be tested.
Another problem is that the stack unwinding is very expensive. Unwinding
a 100 function deep stack (which is not uncommon from python) takes
around 35us, so the elision can only be done for relatively large
arrays.
Heuristicly it seems to be beneficial around 400kb array sizes (which is
about the typical L2 cache size).

The performance gain is quite significant around 1.5 to 2 times faster
operations with temporaries can be observed. The speed is similar to
rewriting the operations to inplace operations manually.

njsmith · 2016-08-31T11:34:27Z

First, let me say that this is terrifying and this comment should not be taken as condoning it.

That said -- I think that all we need for the optimization to be safe is for the the frames up to the last call to Py_EvalFrameEx to belong to cpython/numpy. (1) is this true? (2) does it allow a faster version of this where we only calculate the backtrace for, like, 5 frames deep or so, on the optimistic assumption that in almost any case where the optimization can apply, that will be enough to reach the last Py_EvalFrameEx?

(I might have the function name slightly wrong. You know what I mean.)

juliantaylor · 2016-08-31T13:09:02Z

I though about this method too. It probably works, while you can call back into python from a c-extension I can't think of a way how you could get that code to operate on a refcount 1 array which is not a temporary.
So this is a possible optimization, but the code would be very similar, instead of determining the python dso address range we determine the EvalFrameEx range (as calling dladdr for the name is expensive) and then we need to find the longest call chain that ends up in multiarray.so. An issue here might be that the compiler could perform deinlining which would spread the function on a non-contiguous address range. Keeping track of all entry point addresses instead might be a way that avoids this problem.
This could get the overhead down to around 5-10mus which allows us to reduce the minimum array size were we elide.
But then the profitable size is currently already in the L2 cache range. Unless we can reduce it to L1 size (which I doubt with 5mus budget) there might not be a significant performance gain.

rgommers · 2016-09-02T23:23:35Z

Naive question: the benefit of 1.5x-2x is clear, but how common are the use cases where this would hurt? And how badly?

juliantaylor · 2016-09-03T10:26:15Z

we don't try it when the arrays aren't large enough to make the check worthwhile. The size depends on the cpus cache sizes and is between 400 and 800KiB. At that size array operations go from 150 to 500 mus.
So the worst case of a c-extension using refcount 1 arrays in the cache sweetspot and our check being to optimistically tuned would have an overhead of ~20%, for larger arrays it becomes insignificant fast.
But this should be a pretty rare situation, while you can construct a situation where you need the stacktrace check (see our incref_elide tests) I don't think it actually happens in practice often. I haven't seen anything do it yet.

juliantaylor · 2016-09-03T14:55:30Z

updated so it now also elides when it can do safe casting, this allows e.g. double += float or int += smaller-int

there is still a problem on systems where libpython is not statically linked to python as is the case on e.g. debian.
before main there is a symbol in the stack doesn't seem to be identifiable with backtrace. A solution there would be to check the Py_Main (and maybe others for embedded) address and stop when that is found. Or switch to stop at EvalFrameEx which would also a bit faster.

Aside from this how do we want to proceed with this?
It still needs testing on mac, while it seems to have backtrace() I don't know how good it is performance and accuracy wise. Someone want to test that?
Windows likely won't work though there are probably equivalent windows api functions one could use. Someone with more windows knowledge should do that.

The we need to tune the cutoff to avoid performance penalties. We could set it rather high so we have basically no penalty ever. Somewhere around 1MiB of data should be good, even if that would fit into a CPU cache it will currently take long enough that the checking overhead is negligible.
But a too high value also reduces the benefit so we should spend a little effort tuning it.
I have tested a couple linux machines with different caches, an AMD with L2/L3 512/6144, intel core2 with only l2 6144, intel with L2/L3 256/3072 ad L2/L3 512/18000 and the threshold for profitable (d + 1) + 1 varies from 400KiB to 800KiB. It doesn't seem easy to directly relate it to the cache sizes, in particular when there is a L3 available.
Maybe we could write a little data gather script and send it for testing to the mailing list?

juliantaylor · 2016-09-10T15:43:22Z

Updated to stop at the frame evaluation function instead of checking everything, probably good enough.
That reduces the overhead to 5-10us so one could enable it already at around 100KiB sized arrays, but I set it in 160KiB just in case.
Note the function name changed in python 3.6 as that has the juicy pep 523 implemented which should allow us to make all this with zero overhead (and maybe more), but I'll leave that for later.

I think this should be ready for some wider testing.

mattip · 2016-09-13T16:55:47Z

numpy/core/src/multiarray/number.c

+can_elide_temp(PyArrayObject * m1, PyObject * m2, int * cannot)
+{
+    int i;
+    if (Py_REFCNT(m1) != 1 || !PyArray_CheckExact(m1) ||


Py_REFCNT should be the NumPy specific PyArray_REFCOUNT

This way the whole function will not compile on PyPy, which is good as it is very CPython specific.

rainwoodman · 2016-09-30T19:33:51Z

@juliantaylor Very native question. Could you elaborate a bit more on The tricky part is that via the C-API one can use the PyNumber_ directly and skip increasing the reference count when not needed.? Do you mean if one writes down

 r = -a + b * c**2

Then the refcounts of a b and c are at least two during the evaluation of the expression? How?

juliantaylor · 2016-09-30T20:17:56Z

one count is in the variable itself and the other in the reference to the variable on the stack
The expression a +b + b basically is this bytecode:

          0 LOAD_FAST                0 (a)
          3 LOAD_FAST                1 (b)
          6 BINARY_ADD          
          7 LOAD_FAST                1 (b)
         10 BINARY_ADD

the loads onto the stack increase the reference count, binary_add reduces them both again
So a non temporary from the python interpreter always has at least a count of two but the temporary result of the addition is a new object not assigned to a variable (a store) so it has a count of one.

This is only valid from the interpreter, a C extension could correctly call PyNumber_Add on two refcount one objects directly so we have to do this backtrace checking.

juliantaylor · 2016-11-02T14:19:05Z

Any further comments on the implementation? Else I'll squash the commits and prepare it for merging (release notes etc.)
I'll probably double the threshold just to be sure not to cause performance regressions and check if divisions can be added (I think my earlier concerns with division changing types should be handled by the safe can_cast check now).
Some testing on the mac would be great, a benchmark is available as ´check.ipy` here
https://github.com/juliantaylor/numpy/tree/elide-bench
https://mail.scipy.org/pipermail/numpy-discussion/2016-September/076034.html

charris · 2016-11-02T23:38:48Z

@juliantaylor I'd rather push this off to 1.13.0 along with @pv's overlap inplace work. There is already a lot of stuff in 1.12 and the last big item I want in it is __numpy_ufunc__, and I may push that off also.

charris · 2016-11-06T17:13:02Z

1.12.x is branched, might be a good time to clean up the history.

homu · 2016-12-13T16:47:26Z

☔ The latest upstream changes (presumably #8367) made this pull request unmergeable. Please resolve the merge conflicts.

juliantaylor · 2017-01-18T13:10:28Z

squashed the history

charris · 2017-02-24T01:05:35Z

numpy/core/src/multiarray/temp_avoid.c

+ * A possible future improvement would be to change cpython to give as access
+ * to the top of the stack. Then we could just check that the objects involved
+ * are on the cpython stack instead of checking the function callstack.
+ */


It would be good to list the acceptable types and the valid operators and the distinction of commutative vs non-commutative operators. Maybe define a few terms as well such as elision and DSO. I know this is in the code, but it would be good if one could read this and know more detail of what is going on. Might also mention how you can determine if the stack objects are in python, numpy, etc. I'm also not clear on why it is OK for a temporary to be in numpy, or is it the case that you only want to know that the possible non-temporaries come from numpy?

the assumption is that numpy does not call PyNumber_* functions for this to work. The whole backtrace thing is only necessary because other libraries (e.g. cython) can do that.
To my knowledge numpy does not for basic types.

updated the docs more

charris · 2017-02-24T01:06:42Z

I think this is looking good, but more documentation is desireable.

juliantaylor · 2017-02-24T17:58:31Z

odd why is windows failing now, it shouldn't even be enabled there. Does windows compiler not handle our typenum enum like the standard says it should?

Temporary arrays generated in expressions are expensive as the imply extra memory bandwidth which is the bottleneck in most numpy operations. For example: r = -a + b One can avoid temporaries when the reference count of one of the operands is one as it cannot be referenced by any other python code (rvalue in C++ terms). Python itself uses this method to optimize string concatenation. The tricky part is that via the C-API one can use the PyNumber_ directly and skip increasing the reference count when not needed. The python stack cannot be used to determine this as python does not export the stack size, only the current position. To avoid the problem we collect the backtrace of the call until the python frame evaluation function and if it consist exclusively of functions inside python or the c-library we can assume no other library is using the C-API in between. Issues are that the reliability of backtraces is unknown on non-GNU platforms. On GNU and amd64 it should be reliable enough, even without frame pointers, as glibc will use GCCs stack unwinder via the mandatory dwarf stack annotations. Other platforms with backtrace need to be tested. Another problem is that the stack unwinding is very expensive. Unwinding a 100 function deep stack (which is not uncommon from python) takes around 35us, so the elision can only be done for relatively large arrays. Heuristicly it seems to be beneficial around 256kb array sizes (which is about the typical L2 cache size). The performance gain is quite significant around 1.5 to 2 times faster operations with temporaries can be observed. The speed is similar to rewriting the operations to inplace operations manually.

juliantaylor · 2017-02-24T18:42:18Z

odd must have been a ci fluke or the intp bug thats still in master

charris · 2017-02-25T01:16:10Z

numpy/core/src/multiarray/temp_elide.c

+     * Only check until the python frame evaluation function is found
+     * approx 10us overhead for stack size of 10
+     *
+     * TODO some calls go over scalarmath in umath but we cannot get the base


At some point I'd like to move scalarmath back into multiarray. The idea would be to put all needed functions in npymath. I think there are only a few ufuncs still needed.

charris · 2017-02-25T01:19:43Z

In it goes. Thanks Julian.

charris · 2017-02-25T01:49:03Z

For future reference, PyNumber functions are called from several places in multiarray with arrays of reference count 1. AFAICT with a quick scan, it would be OK to use the passed arrays like temporaries. Am I missing something, or have you looked at those places?

EDIT: I assume there is nothing on the interpreter call stack in those cases.

doc/release/1.13.0-notes.rst

rgommers · 2017-02-25T09:43:50Z

Documentation LGTM now, much more understandable

juliantaylor · 2017-02-25T10:05:24Z

most of them are for object arrays which are not handled here, the handful non object ones look fine, either reference count != 1 or actually ones which can be done inplace. Of course I might have overlooked some which would mean we have holes in our test coverage.

charris · 2017-02-25T15:07:41Z

@juliantaylor It is a bit concerning as it introduces one more thing to watch out for when writing or reviewing C code. Apropos Ralf's comment, what platforms do we know that this will work on? Is there an easy way to tell?

@rgommers I'll fix those backticks when it comes time to edit the release notes.

juliantaylor · 2017-02-25T15:34:22Z

I don't think it is something to worry about, usage of PyNumber_ in numpy outside of object arrays is very rare (most of them is a c-api stddev computation I doubt anyone is actually using) and new code must have tests anyway.

The release notes is as accurate as I can be, platforms that have the backtrace() function. I have tested linux and mac. bsd and other unixes may also have it but I don't know.

jakirkham · 2017-02-27T23:18:40Z

Did you guys looking into whether this could work with libunwind?

juliantaylor · 2017-02-27T23:24:51Z

It works with unwind too, but its unnecessary at least on the platforms I could test. backtrace with glibc on x86 is pretty good and doesn't need frame pointers.
Possibly libunwind is more reliable for non x86 and non glibc architectures. Restricting to those known good systems might be a good idea for release time, but for master its good to leave it unrestricted to find possible non working systems.

Marco-Sulla · 2020-03-12T17:52:31Z

Heuristicly it seems to be beneficial around 400kb array sizes (which is about the typical L2 cache size).

The patch will apply only to large ndarrays, or to all ndarrays?

juliantaylor force-pushed the temporary-elide branch 3 times, most recently from 5ca9a16 to 7e520b8 Compare August 30, 2016 20:00

charris added 01 - Enhancement component: numpy._core labels Sep 1, 2016

juliantaylor force-pushed the temporary-elide branch from f08caa3 to cd79fc1 Compare September 3, 2016 14:40

juliantaylor force-pushed the temporary-elide branch from 1a2cf0c to 76efa23 Compare September 3, 2016 17:31

charris mentioned this pull request Sep 13, 2016

remove refcount semantics, now a.resize() almost always requires refc… #8050

Merged

mattip reviewed Sep 13, 2016
View reviewed changes

juliantaylor force-pushed the temporary-elide branch from 0ca7b11 to a70ffa3 Compare September 30, 2016 11:22

charris added this to the 1.13.0 release milestone Nov 4, 2016

juliantaylor force-pushed the temporary-elide branch from ff0bbc2 to 94f711d Compare December 17, 2016 19:35

juliantaylor force-pushed the temporary-elide branch 2 times, most recently from 8c2d717 to 8537ed4 Compare January 18, 2017 13:09

This was referenced Jan 21, 2017

gufunc to test for all elements equal along axis #8513

Open

WIP: Centralize the spherical <-> cartesian conversions to avoid code duplication astropy/astropy#5735

Closed

charris reviewed Feb 24, 2017

View reviewed changes

juliantaylor force-pushed the temporary-elide branch 2 times, most recently from a5668d8 to e53ee05 Compare February 24, 2017 16:46

juliantaylor added 3 commits February 24, 2017 19:21

DOC: add release note entry for the elision

5b16efe

BENCH: add benchmarks for operations with temporaries

c3e24b2

juliantaylor force-pushed the temporary-elide branch from e53ee05 to c3e24b2 Compare February 24, 2017 18:21

charris reviewed Feb 25, 2017

View reviewed changes

charris merged commit 7bc881d into numpy:master Feb 25, 2017

rgommers reviewed Feb 25, 2017

View reviewed changes

doc/release/1.13.0-notes.rst Show resolved Hide resolved

rgommers reviewed Feb 25, 2017

View reviewed changes

doc/release/1.13.0-notes.rst Show resolved Hide resolved

juliantaylor deleted the temporary-elide branch February 25, 2017 10:05

matthew-brett mentioned this pull request Feb 25, 2017

macOSX / clang compile fails with undefined locale identifiers #8693

Closed

mortada mentioned this pull request Feb 27, 2017

DOC: Fixed minor typos in temp_elide.c #8709

Merged

mhvk mentioned this pull request Apr 4, 2017

Deferred execution NEP #8889

Closed

mhvk mentioned this pull request Jul 10, 2018

Creation of composite units is relatively slow astropy/astropy#7630

Closed

3 tasks

eric-wieser mentioned this pull request Mar 12, 2020

ENH: perform inplace operations if one operand is a temporary #4322

Closed

ngoldbaum mentioned this pull request Apr 15, 2025

BUG: temporary elision heuristics are broken on Python 3.14 #28681

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: avoid temporary arrays in expressions (again) #7997

ENH: avoid temporary arrays in expressions (again) #7997

juliantaylor commented Aug 30, 2016

njsmith commented Aug 31, 2016

juliantaylor commented Aug 31, 2016 •

edited

Loading

rgommers commented Sep 2, 2016

juliantaylor commented Sep 3, 2016 •

edited

Loading

juliantaylor commented Sep 3, 2016

juliantaylor commented Sep 10, 2016 •

edited

Loading

mattip Sep 13, 2016

rainwoodman commented Sep 30, 2016

juliantaylor commented Sep 30, 2016

juliantaylor commented Nov 2, 2016

charris commented Nov 2, 2016

charris commented Nov 6, 2016

homu commented Dec 13, 2016

juliantaylor commented Jan 18, 2017

charris Feb 24, 2017

juliantaylor Feb 24, 2017

charris commented Feb 24, 2017

juliantaylor commented Feb 24, 2017

juliantaylor commented Feb 24, 2017

charris Feb 25, 2017

charris commented Feb 25, 2017

charris commented Feb 25, 2017 •

edited

Loading

rgommers commented Feb 25, 2017

juliantaylor commented Feb 25, 2017 •

edited

Loading

charris commented Feb 25, 2017

juliantaylor commented Feb 25, 2017

jakirkham commented Feb 27, 2017

juliantaylor commented Feb 27, 2017

Marco-Sulla commented Mar 12, 2020 •

edited

Loading

ENH: avoid temporary arrays in expressions (again) #7997

ENH: avoid temporary arrays in expressions (again) #7997

Conversation

juliantaylor commented Aug 30, 2016

njsmith commented Aug 31, 2016

juliantaylor commented Aug 31, 2016 • edited Loading

rgommers commented Sep 2, 2016

juliantaylor commented Sep 3, 2016 • edited Loading

juliantaylor commented Sep 3, 2016

juliantaylor commented Sep 10, 2016 • edited Loading

mattip Sep 13, 2016

Choose a reason for hiding this comment

rainwoodman commented Sep 30, 2016

juliantaylor commented Sep 30, 2016

juliantaylor commented Nov 2, 2016

charris commented Nov 2, 2016

charris commented Nov 6, 2016

homu commented Dec 13, 2016

juliantaylor commented Jan 18, 2017

charris Feb 24, 2017

Choose a reason for hiding this comment

juliantaylor Feb 24, 2017

Choose a reason for hiding this comment

charris commented Feb 24, 2017

juliantaylor commented Feb 24, 2017

juliantaylor commented Feb 24, 2017

charris Feb 25, 2017

Choose a reason for hiding this comment

charris commented Feb 25, 2017

charris commented Feb 25, 2017 • edited Loading

rgommers commented Feb 25, 2017

juliantaylor commented Feb 25, 2017 • edited Loading

charris commented Feb 25, 2017

juliantaylor commented Feb 25, 2017

jakirkham commented Feb 27, 2017

juliantaylor commented Feb 27, 2017

Marco-Sulla commented Mar 12, 2020 • edited Loading

juliantaylor commented Aug 31, 2016 •

edited

Loading

juliantaylor commented Sep 3, 2016 •

edited

Loading

juliantaylor commented Sep 10, 2016 •

edited

Loading

charris commented Feb 25, 2017 •

edited

Loading

juliantaylor commented Feb 25, 2017 •

edited

Loading

Marco-Sulla commented Mar 12, 2020 •

edited

Loading