-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
ENH: Allow an arbitrary number of operands in nditer #28080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9e23d1d
to
3f5818a
Compare
Internally, the change is very moderate mainly requiring to use `int` in the struct (growing it a bit) and otherwise a temporary work-array in the buffer setup function (one could probably use scratch space on the nditer, but this seemed OK to me). The big and somewhat annoying change is that the Python wrapper needs a lot of per operand space, so setting it up is somewhat annoying.
3f5818a
to
b592672
Compare
npyiter_convert_ops(PyObject *op_in, PyObject *op_flags_in, | ||
PyArrayObject **op, npy_uint32 *op_flags, | ||
int *nop_out) | ||
npyiter_prepare_ops(PyObject *op_in, PyObject **out_owner, PyObject ***out_objs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split this effectively, because I need the nops
. One helper to fetch an array of operands (with PySequence_Fast_ITEMS
) and then the original function to conver them to arrays.
(I didn't change the fact that functions with "convert" in their name return 1 for success...)
"Iterator operand %zd is write-only", i); | ||
return NULL; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a sense a nice comment, but didn't seem very useful here...
npy_uint8 ndim, nop; | ||
npy_int8 maskop; | ||
npy_uint8 ndim; | ||
int nop, maskop; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think effectively, this means we need 64bit more space here (on typical systems). A bit unfortunate, but even if we limit to int16
, we lose 32bits here anyway.
The check is unnecessary here (it cannot fail) but it seemed clearer to not use that knowledge when it depends on where we call it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really nice! It also makes the code clearer and cleaner, which is a nice bonus. My comments are all at the extremely nitpicky end.
p.s. When merged, we can use it to remove some annoying work-arounds in stride_tricks.py
|
||
/* | ||
* Converts the operand array and op_flags array into the form | ||
* NpyIter_AdvancedNew needs. Sets nop, and on success, each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not set nop
. Just start sentence with "On success, ..."?
Py_INCREF(op_in); | ||
op[0] = (PyArrayObject *)op_in; | ||
*out_owner = op_in; | ||
*out_objs = out_owner; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just set to op_in
, and swap order, so it is the same as in the stanza just above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will swap, but setting to op_in doesn't work, but I'll add a comment.
PyObject *op_in_owned; /* Sequence/object owning op_objs. */ | ||
int nop = npyiter_prepare_ops(op_in, &op_in_owned, &op_objs); | ||
if (nop < 0) { | ||
npy_free_cache_dim_obj(itershape); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a goto early_failure
(or pre_alloc_fail
) that goes to the end of finish
? Would require initializing op_in_owned=NULL
(res = -1
is already OK), so one could have,
...
npy_free_workspace(op_axes);
early_fail:
Py_XDECREF(op_in_owned);
npy_free_cache_dim_obj(itershape);
return res;
}
Could be used above and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, used twice only, but more than once so added pre_alloc_fail:
NPY_ALLOC_WORKSPACE(op_axes, int *, 8, nop); | ||
/* | ||
* Trying to allocate should be OK if one failed, check for error now | ||
* that we can use `goto cleanup` to clean up everything. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goto finish
PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop); | ||
PyArray_Descr **op_request_dtypes_inner = op_request_dtypes + nop; | ||
|
||
/* And other workspaces (that do not need to clean up their content) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above - might as well move these above op
and write finish
so it can handle op == NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again fine to ignore!
NPY_ALLOC_WORKSPACE(op_flags_inner, npy_uint32, 8, nop); | ||
NPY_ALLOC_WORKSPACE(op_axes_storage, int, 8 * NPY_MAXDIMS, nop * NPY_MAXDIMS); | ||
NPY_ALLOC_WORKSPACE(op_axes, int *, 2 * 8, 2 * nop); | ||
int **op_axes_nop = op_axes + nop; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd do this after the sanity check (I know it makes no difference, NULL
just being 0
, but somehow feels better...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Pretty sure null pointer arithmetic is also undefined (dunno if it happens, but technically, it is allowed to trip probaby).
numpy/_core/tests/test_nditer.py
Outdated
@@ -3379,6 +3378,40 @@ def test_partial_iteration_error(in_dtype, buf_dtype): | |||
assert count == sys.getrefcount(value) | |||
|
|||
|
|||
def test_arbitrary_number_of_ops(): | |||
# 2*16 + 1 is still just a few kiB, so should be fast an easy to deal with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and easy
numpy/_core/tests/test_nditer.py
Outdated
|
||
|
||
def test_arbitrary_number_of_ops_nested(): | ||
# 2*16 + 1 is still just a few kiB, so should be fast an easy to deal with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
|
||
|
||
@pytest.mark.slow | ||
@requires_memory(9 * np.iinfo(np.intc).max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be 8? Or is there a byte for write flags or so? (not that it matters!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought, I'll just make it a bit more (of course a small additition should be enough, and that probably doesn't matter in practice).
(mostly, some are slightly less clear if they are a big improvement)
Agreed, that would be nice. May also introduce fast-call (already looked at it), although even there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looks good, though with a small further comment. More importantly, I think on this second go I spotted a memory leak; do check!
numpy/_core/src/multiarray/alloc.h
Outdated
@@ -98,6 +98,11 @@ _npy_init_workspace( | |||
TYPE *NAME; \ | |||
_npy_init_workspace((void **)&NAME, NAME##_static, (fixed_size), sizeof(TYPE), (size)) | |||
|
|||
/* Same as above, but additionally nulls the workspace */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a bad idea, but it is not used right now, and _npy_cinit_workspace
is not defined either, so it cannot work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, sorry, was briefly considering and then decided that it should be a follow-up.
/* The read/write settings */ | ||
NpyIter_GetReadFlags(iter, self->readflags); | ||
if (self->writeflags == NULL) { | ||
self->writeflags = PyMem_Malloc(sizeof(char) * NpyIter_GetNOp(iter)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be deallocated somewhere, no? Presumably in npyiter_dealloc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch, yes thanks, this is a serious thing!
memset(op, 0, sizeof(PyObject *) * 2 * nop); | ||
PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop); | ||
|
||
/* And other workspaces (that do not need to clean up their content) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to keep as is!
PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop); | ||
PyArray_Descr **op_request_dtypes_inner = op_request_dtypes + nop; | ||
|
||
/* And other workspaces (that do not need to clean up their content) */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And again fine to ignore!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looks all great now!
Best to wait for tests to complete, of course - and probably best to squash-merge. |
if (PyTuple_Check(op_in) || PyList_Check(op_in)) { | ||
nop = PySequence_Size(op_in); | ||
if (nop == 0) { | ||
PyObject *seq = PySequence_Fast(op_in, "failed accessing item list"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes me a bit nervous from the perspective of freethreaded python. Is op_in
ever exposed to python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many other places where we use it in more dangerous ways (e.g. array coercion) and for good reason (both speed and code simplicity).
I feel that needs to be fixed as a general pattern rather than making the code less inconvenient here?
But if @ngoldbaum has a better suggestion, I am happy with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can hold a critical section on op_in
. You only need to do that if the operand is mutable (e.g. not a tuple), but also it might be annoying and require some refactoring to conditionally enter a critical section because the macros have brackets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is it really so performance critical to avoid the strong references? If it really is about code ergonomics more than performance, what if we defined e.g. PySequence_Items
that returns a heap-allocated C array of strong references to the items in the sequence that the caller is responsible for freeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my reasoning for asking if CPython shouldn't just guarantee it or have convenience around it. If we make a heap-allocation, a solution might as well be to write tmp = PyList_AsTuple(obj); Py_SETREF(tmp, PySequence_Fast(tmp))
(plus error checks).
Of course you can also re-organize the code for a critical section...
(Right now we avoid most heap allocations but just because it isn't more convenient to not avoid them anyway. For all but the most trivial iterations that likely doesn't matter and if we care there are bigger fish such as tp_vectorcall
first.)
A PySequence_Fast_immutable()
might be a convenience method then (also potentially better for other implementations, which PySequence_Fast
is an important reason for).
Anyway, need to think about the right pattern and preferably one that also works well for other places in NumPy (although that may just be 1-2 since much of the use seems internal and array-coercion is special enough).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CPython has Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST
which isn't public but handles checking for a mutable vs immutable sequence and only activating the critical section then. Maybe we could ask for that to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I dunno how slow a critical section is (i.e. does it matter?). For array creation where this is important since huge, simple, lists may happen we probably don't get many tuples in practice.
For this case and especially the similar PyArray_IntpFromIndexSequence
(which already does one heap allocation, though). Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST
seems useful, but maybe also simple enough to write ourselves (i.e. a PyTuple_CheckExact
?).
For the case here, the heap-allocation may just be the simplest path (or using GetItem
since we don't expect a very long list/tuple).
OTOH, maybe Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST
is so simple that it is best to just make it public, since it may also be a good way to document the solutions:
- Use
Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST
- Convert to tuple (so it is immutable)
- Stick to functional API (so
tp_getitem
has fine-grained locking if necessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno how slow a critical section is (i.e. does it matter?)
It's pretty fast if it's uncontended. If there's contention it slows down a lot and as we saw elsewhere it doesn't scale all that well.
but maybe also simple enough to write ourselves
Unfortunately we can't without using the private _PyCriticalSection_Begin
functions that implement the macros with brackets. They very purposefully made it so the public symbols are macros with brackets, to force users to deal with the control flow issues. I doubt any of these APIs will change long-term so it could also be a place where we include CPython internals and deal with that.
That said this one particular case really does seem like a clear candidate for CPython to make public since it's specifically for use with PySequece_Fast
and we have real-world examples in NumPy where it would simplify things a lot.
I'll chat about this with our team internally and bring it up with upstream if there aren't any showstoppers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this affect benchmarks for small iterators? There is now additional overhead of memory management, but maybe it is too small to notice?
Well, the For complicated iterators things seem just random to me, which I suppose is the way of these timings... |
@mattip - the code does still have a static allocation for small |
Thanks @seberg |
Internally, the change is very moderate mainly requiring to use
int
in the struct (growing it a bit) and otherwise a temporary work-array in the buffer setup function (one could probably use scratch space on the nditer, but this seemed OK to me).The big and somewhat annoying change is that the Python wrapper needs a lot of per operand space, so setting it up is somewhat annoying.
EDIT: Note to self, once merged inform numexpr (for numba ufuncs may be very interesting, but needs a decent amount of further changes in ufuncs).