-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: perform inplace operations if one operand is a temporary #4322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
it passes scipy and pandas testsuite, though if cython creates example benchmark:
improves by 30% with the branch (9.79ms vs 14.8ms) as it saves on zeroing memory pages under linux |
{ | ||
if (Py_REFCNT(m1) == 1 && | ||
PyArray_CheckExact(m1) && PyArray_CheckExact(m2) && | ||
PyArray_DESCR(m1)->type_num == PyArray_DESCR(m2)->type_num && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this type check is assuming that if m1.dtype == m2.dtype
then m1.dtype == result.dtype
as well. But I don't think this is true in all cases -- true_divide violates it all over the place, but also there's 'mm->d' for np.divide, and in the future arithmetic on bools will probably cast to int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what if the dtype is NPY_VOID, but the fields are different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm I don't think there is a way to check these special casting rules in this place
should be instead just whitelist operations?
Should consider whether to support using the second operand as the output as well. And possibly wire this deeper into the ufunc machinery, and implement for other operations (|, &, ~, ...). But perhaps this is a good start to see if this is feasible at all :-) |
Might be a bit tricky to put it deeper, since numpy also likes to increase ref counts when calling into the umath machinery. May be nonsense, but I was wondering if we can/should corrupt the original array object so that unsafe direct C-api usage may be found. For the tests, I am not sure what the stride checks are for, but I think we need to check that m2 does not have possible memory overlap. i was going to suggest a contiguity check for m1, but it may be overly strict, though I don't like the idea that someone creates overlapped memory arrays (usually those should not have refcount=1 and owndata, but I think you could do that) |
|
||
/* check if m1 is a temporary (refcnt == 1) so we can do inplace operations | ||
* instead of creating a new temporary */ | ||
static int can_alide_temp(PyArrayObject * m1, PyObject * m2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/alide/elide/
added some WIP commits including check for m2.ref_cnt == 1 for associative operations to trigger more elisions and find more issues in testsuites. The implementation is still very simplistic how can I find out if an array is an exact type and can reuse temporaries in true_divide (besides hardcoding all types)? |
an option to move the temporary reuse into the ufunc machinery would be to mark the arrays as temporary in tp_ slot layer, that way the refcount can be changed by the way down to the ufunc functions and still keep the c-api unchanged. |
Yeah, it's a bit ugly but we could use an array flag for this. (Just have For avoiding the issue with parametrized dtypes like void/str/datetime, a The current hack may be good enough for testing things like Cython
|
This might not be safe in Cython. Stefan Behnel wrote this on the Cython user list: Nathaniel Smith wrote: That can happen, yes. Cython only guarantees that the object it passes is safely owned so that the reference cannot go away while it's being processed by a function. If it's in a local (non-closure) variable (or Cython temporary variable), that guarantee holds, so it's safe to pass objects with only a single reference into a C function, and Cython will do that. Stefan |
hm too bad, I guess that kills these plans |
Somewhat related #2802. |
@juliantaylor Should this be closed? |
Today I Learned that CPython itself makes a similar optimization in some circumstances: http://utcc.utoronto.ca/~cks/space/blog/python/ExaminingStringConcatOpt It's probably worth looking into how exactly they pull this off. Seems to me pretty plausible that either CPython also assumes that refcnt=1 implies temporary in which case Cython is already broken and we could hassle Cython into fixing it, or else that there's some way to tell the difference between a string concat that comes in via a public API like |
in cpython this is an optimization applied on the BINARY_ADD byte code, so cython is not broken. the only option I see is have it an optional build flag for non-cython users or to let cython set some flag to tell numpy not to do this optimization. |
Ah, I see it. Well, foo. Optional build flags seem pointless because no one will set them. (And But I don't see how to make temporary caching work without nasty side The cython flag idea might work, but there might be other extensions that If this optimization is really as valuable as it looks, then I think we
|
it seems one can get the current frame with PyEval_GetFrame, possibly one can figure out if the current instruction is e.g. BINARY_ADD which should mean we are not being called from cython |
something along the line of:
23 is BINARY_ADD |
Their is one idea that was raise but seam to be discarded and I don't Do cython call the current C function that correspond to add directly On Mon, Mar 31, 2014 at 6:49 PM, Julian Taylor notifications@github.comwrote:
|
so far I know cython just calls PyNumber_Add which calls the nb_add slot (== |
Not sure that checking the current instruction is enough - it could be that
|
right, unwinding the stack to see our caller (e.g. with libunwind) might be an option |
Thanks for the explanation. On Mon, Mar 31, 2014 at 7:19 PM, Julian Taylor notifications@github.comwrote:
|
unwinding seems to work nicely, though it has a 10mus overhead so we would only want to do it for larger arrays |
Unwinding seems like it might be fragile? (Depending on architecture, build If we can get ahold of the Python stack pointer, and op == BINARY_ADD, On Tue, Apr 1, 2014 at 6:23 PM, Julian Taylor notifications@github.comwrote:
Nathaniel J. Smith |
libunwind only works on a couple linux arches (x86, ia64, mips and arm so far I know), macos and windows would need to do something else (windows has apis for this) I don't think you can get the stack pointer, its probably a local variable of the bytecode loop in the frameevaluator |
Do we want to pursue this idea here? It sounds like there are a number of complications to deal with before it would be reliable. |
depends on whether a linux only solution is acceptable, using libunwind for this is just a couple lines of code. |
Do we have any compelling examples of why this optimization is useful, on (Asking because I'm considering asking python-dev and it would be good to On Thu, Apr 10, 2014 at 8:16 PM, Julian Taylor notifications@github.comwrote:
Nathaniel J. Smith |
its about 10-15% in that case due to saving some cache, on large data you save the page zeroing that needs to be done on the temporary memory. numexpr or similar approaches (like compiling the AST into more efficient numpy-only code) are probably the way to go. |
Well, it seems worth a try, anyway, we can see what happens: http://thread.gmane.org/gmane.comp.python.devel/148001 |
I have been poking around the python eval loop a bit again, and while the stack pointer and current instruction is a local variable, it does seem to update the last instruction offset after most operations. |
I forgot we need the stackpointer too to check the operands are known to python |
Are you trying to figure out specifically how to make this work on older For 3.5 purposes, I think the best strategy would be to prototype our ideal On Wed, Jul 2, 2014 at 10:29 PM, Julian Taylor notifications@github.com
Nathaniel J. Smith |
I think we might be able to solve it without python changes, maybe even without relying on the internals of the last instruction counter. |
pushed a commit with this change tested with a small cythonized file:
called via
here the top opcode is BINARY_ADD and cython does not increase the refcount of d so without the stackcheck d would wrongly be 2 instead of 1, with the stack check d is correctly unchanged it does still rely on implementational details like the existance of the stack and that the stack is only cleared after the number operation is performed (I think its never cleared until the frame ends). |
So your idea is that we don't actually need to know what the arguments of On Wed, Jul 2, 2014 at 11:17 PM, Julian Taylor notifications@github.com
Nathaniel J. Smith |
yes |
So let's see. If it's on the stack, then the stack has a refcnt for it -- Re: Question 1: I think a = [1, 1, 1, 1, np.zeros(...)] will result in a I think code like this therefore stands a chance of triggering a bug cdef do_something(list a):
On Wed, Jul 2, 2014 at 11:31 PM, Julian Taylor notifications@github.com
Nathaniel J. Smith |
urg you are right with the list case, cython currently doesn't do this optimization but it could. too bad. |
So how much would it cost for the interpreter to put the top of the stack
|
probably not much, but I really would like a solution that does not require a new python version :/ |
I know, that would be pretty sweet :-/ But even if it's possible it'll be a On Thu, Jul 3, 2014 at 1:11 AM, Julian Taylor notifications@github.com
Nathaniel J. Smith |
how about storing the current frame, current instruction and produced array pointer in thread local storage. |
temporaries are identified by having a reference count of 1, if this is the case we can perform inplace operations to avoid creating another temporary. E.g. a = b + c + d contains one temporary (c + d) with reference count 1 while the other operands have count 2, so b + temp can be performed inplace. This saves memory allocation overheads and improves cache utilization.
implementation not good, just improve checking for issues in testsuite
if it is on the stack it must have correct refcounts, unlike objects from e.g. cython were refcount increases can be skipped for internal objects.
benchmark: import numpy as np n = 10000 d = np.ones(n) e = np.ones(n) f = np.ones(n) def fun(d, e, f): t = d + e t += f %timeit -r 20 d+e+f %timeit -r 20 fun(d, e, f) n = 100000 d = np.ones(n) e = np.ones(n) f = np.ones(n) %timeit -r 20 d+e+f %timeit -r 20 fun(d, e, f) n = 1000000 d = np.ones(n) e = np.ones(n) f = np.ones(n) %timeit -r 20 d+e+f %timeit -r 20 fun(d, e, f)
@juliantaylor I'm going to close this. Feel free to reopen if you want to pursue it. |
Replaced by #7997 I think? |
temporaries are identified by having a reference count of 1, if this is
the case we can perform inplace operations to avoid creating another
temporary.
E.g.
contains one temporary (c + d) with reference count 1 while the other
operands have count 2, so b + temp can be performed inplace.
This saves memory allocation overheads and improves cache utilization.