Skip to content

ENH: Allow an arbitrary number of operands in nditer #28080

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 6, 2025

Conversation

seberg
Copy link
Member

@seberg seberg commented Dec 31, 2024

Internally, the change is very moderate mainly requiring to use int in the struct (growing it a bit) and otherwise a temporary work-array in the buffer setup function (one could probably use scratch space on the nditer, but this seemed OK to me).

The big and somewhat annoying change is that the Python wrapper needs a lot of per operand space, so setting it up is somewhat annoying.

EDIT: Note to self, once merged inform numexpr (for numba ufuncs may be very interesting, but needs a decent amount of further changes in ufuncs).

@seberg seberg force-pushed the nditer-avoid-maxargs-fully branch from 9e23d1d to 3f5818a Compare December 31, 2024 12:32
Internally, the change is very moderate mainly requiring to use
`int` in the struct (growing it a bit) and otherwise a temporary
work-array in the buffer setup function (one could probably use
scratch space on the nditer, but this seemed OK to me).

The big and somewhat annoying change is that the Python wrapper
needs a lot of per operand space, so setting it up is somewhat
annoying.
@seberg seberg force-pushed the nditer-avoid-maxargs-fully branch from 3f5818a to b592672 Compare December 31, 2024 12:45
npyiter_convert_ops(PyObject *op_in, PyObject *op_flags_in,
PyArrayObject **op, npy_uint32 *op_flags,
int *nop_out)
npyiter_prepare_ops(PyObject *op_in, PyObject **out_owner, PyObject ***out_objs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split this effectively, because I need the nops. One helper to fetch an array of operands (with PySequence_Fast_ITEMS) and then the original function to conver them to arrays.

(I didn't change the fact that functions with "convert" in their name return 1 for success...)

"Iterator operand %zd is write-only", i);
return NULL;
}
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a sense a nice comment, but didn't seem very useful here...

npy_uint8 ndim, nop;
npy_int8 maskop;
npy_uint8 ndim;
int nop, maskop;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think effectively, this means we need 64bit more space here (on typical systems). A bit unfortunate, but even if we limit to int16, we lose 32bits here anyway.

The check is unnecessary here (it cannot fail) but it seemed clearer
to not use that knowledge when it depends on where we call it.
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really nice! It also makes the code clearer and cleaner, which is a nice bonus. My comments are all at the extremely nitpicky end.

p.s. When merged, we can use it to remove some annoying work-arounds in stride_tricks.py


/*
* Converts the operand array and op_flags array into the form
* NpyIter_AdvancedNew needs. Sets nop, and on success, each
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not set nop. Just start sentence with "On success, ..."?

Py_INCREF(op_in);
op[0] = (PyArrayObject *)op_in;
*out_owner = op_in;
*out_objs = out_owner;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just set to op_in, and swap order, so it is the same as in the stanza just above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will swap, but setting to op_in doesn't work, but I'll add a comment.

PyObject *op_in_owned; /* Sequence/object owning op_objs. */
int nop = npyiter_prepare_ops(op_in, &op_in_owned, &op_objs);
if (nop < 0) {
npy_free_cache_dim_obj(itershape);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a goto early_failure (or pre_alloc_fail) that goes to the end of finish? Would require initializing op_in_owned=NULL (res = -1 is already OK), so one could have,

...
    npy_free_workspace(op_axes);
early_fail:
    Py_XDECREF(op_in_owned);
    npy_free_cache_dim_obj(itershape);
    return res;
}

Could be used above and below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, used twice only, but more than once so added pre_alloc_fail:

NPY_ALLOC_WORKSPACE(op_axes, int *, 8, nop);
/*
* Trying to allocate should be OK if one failed, check for error now
* that we can use `goto cleanup` to clean up everything.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goto finish

PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop);
PyArray_Descr **op_request_dtypes_inner = op_request_dtypes + nop;

/* And other workspaces (that do not need to clean up their content) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above - might as well move these above op and write finish so it can handle op == NULL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again fine to ignore!

NPY_ALLOC_WORKSPACE(op_flags_inner, npy_uint32, 8, nop);
NPY_ALLOC_WORKSPACE(op_axes_storage, int, 8 * NPY_MAXDIMS, nop * NPY_MAXDIMS);
NPY_ALLOC_WORKSPACE(op_axes, int *, 2 * 8, 2 * nop);
int **op_axes_nop = op_axes + nop;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd do this after the sanity check (I know it makes no difference, NULL just being 0, but somehow feels better...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Pretty sure null pointer arithmetic is also undefined (dunno if it happens, but technically, it is allowed to trip probaby).

@@ -3379,6 +3378,40 @@ def test_partial_iteration_error(in_dtype, buf_dtype):
assert count == sys.getrefcount(value)


def test_arbitrary_number_of_ops():
# 2*16 + 1 is still just a few kiB, so should be fast an easy to deal with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and easy



def test_arbitrary_number_of_ops_nested():
# 2*16 + 1 is still just a few kiB, so should be fast an easy to deal with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same



@pytest.mark.slow
@requires_memory(9 * np.iinfo(np.intc).max)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be 8? Or is there a byte for write flags or so? (not that it matters!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought, I'll just make it a bit more (of course a small additition should be enough, and that probably doesn't matter in practice).

(mostly, some are slightly less clear if they are a big improvement)
@seberg
Copy link
Member Author

seberg commented Jan 2, 2025

p.s. When merged, we can use it to remove some annoying work-arounds in stride_tricks.py

Agreed, that would be nice. May also introduce fast-call (already looked at it), although even there nditer has a few other Python calls around it always, so may not add much.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good, though with a small further comment. More importantly, I think on this second go I spotted a memory leak; do check!

@@ -98,6 +98,11 @@ _npy_init_workspace(
TYPE *NAME; \
_npy_init_workspace((void **)&NAME, NAME##_static, (fixed_size), sizeof(TYPE), (size))

/* Same as above, but additionally nulls the workspace */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a bad idea, but it is not used right now, and _npy_cinit_workspace is not defined either, so it cannot work!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, sorry, was briefly considering and then decided that it should be a follow-up.

/* The read/write settings */
NpyIter_GetReadFlags(iter, self->readflags);
if (self->writeflags == NULL) {
self->writeflags = PyMem_Malloc(sizeof(char) * NpyIter_GetNOp(iter));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be deallocated somewhere, no? Presumably in npyiter_dealloc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, yes thanks, this is a serious thing!

memset(op, 0, sizeof(PyObject *) * 2 * nop);
PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop);

/* And other workspaces (that do not need to clean up their content) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to keep as is!

PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop);
PyArray_Descr **op_request_dtypes_inner = op_request_dtypes + nop;

/* And other workspaces (that do not need to clean up their content) */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And again fine to ignore!

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks all great now!

@mhvk
Copy link
Contributor

mhvk commented Jan 2, 2025

Best to wait for tests to complete, of course - and probably best to squash-merge.

if (PyTuple_Check(op_in) || PyList_Check(op_in)) {
nop = PySequence_Size(op_in);
if (nop == 0) {
PyObject *seq = PySequence_Fast(op_in, "failed accessing item list");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes me a bit nervous from the perspective of freethreaded python. Is op_in ever exposed to python?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many other places where we use it in more dangerous ways (e.g. array coercion) and for good reason (both speed and code simplicity).
I feel that needs to be fixed as a general pattern rather than making the code less inconvenient here?

But if @ngoldbaum has a better suggestion, I am happy with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can hold a critical section on op_in. You only need to do that if the operand is mutable (e.g. not a tuple), but also it might be annoying and require some refactoring to conditionally enter a critical section because the macros have brackets.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is it really so performance critical to avoid the strong references? If it really is about code ergonomics more than performance, what if we defined e.g. PySequence_Items that returns a heap-allocated C array of strong references to the items in the sequence that the caller is responsible for freeing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my reasoning for asking if CPython shouldn't just guarantee it or have convenience around it. If we make a heap-allocation, a solution might as well be to write tmp = PyList_AsTuple(obj); Py_SETREF(tmp, PySequence_Fast(tmp)) (plus error checks).
Of course you can also re-organize the code for a critical section...
(Right now we avoid most heap allocations but just because it isn't more convenient to not avoid them anyway. For all but the most trivial iterations that likely doesn't matter and if we care there are bigger fish such as tp_vectorcall first.)

A PySequence_Fast_immutable() might be a convenience method then (also potentially better for other implementations, which PySequence_Fast is an important reason for).

Anyway, need to think about the right pattern and preferably one that also works well for other places in NumPy (although that may just be 1-2 since much of the use seems internal and array-coercion is special enough).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPython has Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST which isn't public but handles checking for a mutable vs immutable sequence and only activating the critical section then. Maybe we could ask for that to be public?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I dunno how slow a critical section is (i.e. does it matter?). For array creation where this is important since huge, simple, lists may happen we probably don't get many tuples in practice.

For this case and especially the similar PyArray_IntpFromIndexSequence (which already does one heap allocation, though). Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST seems useful, but maybe also simple enough to write ourselves (i.e. a PyTuple_CheckExact?).

For the case here, the heap-allocation may just be the simplest path (or using GetItem since we don't expect a very long list/tuple).


OTOH, maybe Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST is so simple that it is best to just make it public, since it may also be a good way to document the solutions:

  • Use Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST
  • Convert to tuple (so it is immutable)
  • Stick to functional API (so tp_getitem has fine-grained locking if necessary).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dunno how slow a critical section is (i.e. does it matter?)

It's pretty fast if it's uncontended. If there's contention it slows down a lot and as we saw elsewhere it doesn't scale all that well.

but maybe also simple enough to write ourselves

Unfortunately we can't without using the private _PyCriticalSection_Begin functions that implement the macros with brackets. They very purposefully made it so the public symbols are macros with brackets, to force users to deal with the control flow issues. I doubt any of these APIs will change long-term so it could also be a place where we include CPython internals and deal with that.

That said this one particular case really does seem like a clear candidate for CPython to make public since it's specifically for use with PySequece_Fast and we have real-world examples in NumPy where it would simplify things a lot.

I'll chat about this with our team internally and bring it up with upstream if there aren't any showstoppers.

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this affect benchmarks for small iterators? There is now additional overhead of memory management, but maybe it is too small to notice?

@seberg
Copy link
Member Author

seberg commented Jan 6, 2025

Does this affect benchmarks for small iterators?

Well, the nditer setup doesn't change, these are all fast-pathed after all, so there isn't even the 10% slowdown or whatever it would be for allocations.
For simple sums (one of the main "simple iterations" not using a lot of python), I can see no difference.

For complicated iterators things seem just random to me, which I suppose is the way of these timings...

@mhvk
Copy link
Contributor

mhvk commented Jan 6, 2025

@mattip - the code does still have a static allocation for small nditer - the overhead only kicks in if there are more arguments than account for by that (but now the number of arguments is no longer limited!).

@mattip mattip merged commit 11e7844 into numpy:main Jan 6, 2025
67 checks passed
@mattip
Copy link
Member

mattip commented Jan 6, 2025

Thanks @seberg

@seberg seberg deleted the nditer-avoid-maxargs-fully branch January 6, 2025 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants