ENH: Allow an arbitrary number of operands in nditer #28080

seberg · 2024-12-31T12:29:24Z

Internally, the change is very moderate mainly requiring to use int in the struct (growing it a bit) and otherwise a temporary work-array in the buffer setup function (one could probably use scratch space on the nditer, but this seemed OK to me).

The big and somewhat annoying change is that the Python wrapper needs a lot of per operand space, so setting it up is somewhat annoying.

EDIT: Note to self, once merged inform numexpr (for numba ufuncs may be very interesting, but needs a decent amount of further changes in ufuncs).

Internally, the change is very moderate mainly requiring to use `int` in the struct (growing it a bit) and otherwise a temporary work-array in the buffer setup function (one could probably use scratch space on the nditer, but this seemed OK to me). The big and somewhat annoying change is that the Python wrapper needs a lot of per operand space, so setting it up is somewhat annoying.

seberg · 2024-12-31T12:46:53Z

numpy/_core/src/multiarray/nditer_pywrap.c

-npyiter_convert_ops(PyObject *op_in, PyObject *op_flags_in,
-                    PyArrayObject **op, npy_uint32 *op_flags,
-                    int *nop_out)
+npyiter_prepare_ops(PyObject *op_in, PyObject **out_owner, PyObject ***out_objs)


I split this effectively, because I need the nops. One helper to fetch an array of operands (with PySequence_Fast_ITEMS) and then the original function to conver them to arrays.

(I didn't change the fact that functions with "convert" in their name return 1 for success...)

seberg · 2024-12-31T12:48:00Z

numpy/_core/src/multiarray/nditer_pywrap.c

-                "Iterator operand %zd is write-only", i);
-        return NULL;
-    }
-#endif


In a sense a nice comment, but didn't seem very useful here...

seberg · 2024-12-31T12:52:06Z

numpy/_core/src/multiarray/nditer_impl.h

-    npy_uint8 ndim, nop;
-    npy_int8 maskop;
+    npy_uint8 ndim;
+    int nop, maskop;


So I think effectively, this means we need 64bit more space here (on typical systems). A bit unfortunate, but even if we limit to int16, we lose 32bits here anyway.

The check is unnecessary here (it cannot fail) but it seemed clearer to not use that knowledge when it depends on where we call it.

mhvk

This is really nice! It also makes the code clearer and cleaner, which is a nice bonus. My comments are all at the extremely nitpicky end.

p.s. When merged, we can use it to remove some annoying work-arounds in stride_tricks.py

mhvk · 2024-12-31T14:44:48Z

numpy/_core/src/multiarray/nditer_pywrap.c

-
+/*
+ * Converts the operand array and op_flags array into the form
+ * NpyIter_AdvancedNew needs.  Sets nop, and on success, each


This does not set nop. Just start sentence with "On success, ..."?

mhvk · 2024-12-31T14:52:51Z

numpy/_core/src/multiarray/nditer_pywrap.c

        Py_INCREF(op_in);
-        op[0] = (PyArrayObject *)op_in;
+        *out_owner = op_in;
+        *out_objs = out_owner;


Just set to op_in, and swap order, so it is the same as in the stanza just above.

Will swap, but setting to op_in doesn't work, but I'll add a comment.

numpy/_core/src/multiarray/nditer_pywrap.c

mhvk · 2024-12-31T15:02:50Z

numpy/_core/src/multiarray/nditer_pywrap.c

+    PyObject *op_in_owned;  /* Sequence/object owning op_objs. */
+    int nop = npyiter_prepare_ops(op_in, &op_in_owned, &op_objs);
+    if (nop < 0) {
+        npy_free_cache_dim_obj(itershape);


Maybe a goto early_failure (or pre_alloc_fail) that goes to the end of finish? Would require initializing op_in_owned=NULL (res = -1 is already OK), so one could have,

... npy_free_workspace(op_axes); early_fail: Py_XDECREF(op_in_owned); npy_free_cache_dim_obj(itershape); return res; }

Could be used above and below.

Yeah, used twice only, but more than once so added pre_alloc_fail:

mhvk · 2024-12-31T15:07:17Z

numpy/_core/src/multiarray/nditer_pywrap.c

+    NPY_ALLOC_WORKSPACE(op_axes, int *, 8, nop);
+    /*
+     * Trying to allocate should be OK if one failed, check for error now
+     * that we can use `goto cleanup` to clean up everything.


goto finish

mhvk · 2024-12-31T15:18:14Z

numpy/_core/src/multiarray/nditer_pywrap.c

+    PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop);
+    PyArray_Descr **op_request_dtypes_inner = op_request_dtypes + nop;
+
+    /* And other workspaces (that do not need to clean up their content) */


Same comment as above - might as well move these above op and write finish so it can handle op == NULL

And again fine to ignore!

mhvk · 2024-12-31T15:19:26Z

numpy/_core/src/multiarray/nditer_pywrap.c

+    NPY_ALLOC_WORKSPACE(op_flags_inner, npy_uint32, 8, nop);
+    NPY_ALLOC_WORKSPACE(op_axes_storage, int, 8 * NPY_MAXDIMS, nop * NPY_MAXDIMS);
+    NPY_ALLOC_WORKSPACE(op_axes, int *, 2 * 8, 2 * nop);
+    int **op_axes_nop = op_axes + nop;


I'd do this after the sanity check (I know it makes no difference, NULL just being 0, but somehow feels better...)

Good point. Pretty sure null pointer arithmetic is also undefined (dunno if it happens, but technically, it is allowed to trip probaby).

mhvk · 2024-12-31T15:23:50Z

numpy/_core/tests/test_nditer.py

@@ -3379,6 +3378,40 @@ def test_partial_iteration_error(in_dtype, buf_dtype):
    assert count == sys.getrefcount(value)


+def test_arbitrary_number_of_ops():
+    # 2*16 + 1 is still just a few kiB, so should be fast an easy to deal with


mhvk · 2024-12-31T15:24:47Z

numpy/_core/tests/test_nditer.py

+
+
+def test_arbitrary_number_of_ops_nested():
+    # 2*16 + 1 is still just a few kiB, so should be fast an easy to deal with


mhvk · 2024-12-31T15:29:49Z

numpy/_core/tests/test_nditer.py

+
+
+@pytest.mark.slow
+@requires_memory(9 * np.iinfo(np.intc).max)


shouldn't this be 8? Or is there a byte for write flags or so? (not that it matters!)

Thought, I'll just make it a bit more (of course a small additition should be enough, and that probably doesn't matter in practice).

(mostly, some are slightly less clear if they are a big improvement)

seberg · 2025-01-02T13:27:50Z

p.s. When merged, we can use it to remove some annoying work-arounds in stride_tricks.py

Agreed, that would be nice. May also introduce fast-call (already looked at it), although even there nditer has a few other Python calls around it always, so may not add much.

mhvk

OK, looks good, though with a small further comment. More importantly, I think on this second go I spotted a memory leak; do check!

mhvk · 2025-01-02T14:08:21Z

numpy/_core/src/multiarray/alloc.h

@@ -98,6 +98,11 @@ _npy_init_workspace(
    TYPE *NAME;                                            \
    _npy_init_workspace((void **)&NAME, NAME##_static, (fixed_size), sizeof(TYPE), (size))

+/* Same as above, but additionally nulls the workspace */


This is not a bad idea, but it is not used right now, and _npy_cinit_workspace is not defined either, so it cannot work!

oops, sorry, was briefly considering and then decided that it should be a follow-up.

numpy/_core/src/multiarray/nditer_pywrap.c

mhvk · 2025-01-02T14:16:30Z

numpy/_core/src/multiarray/nditer_pywrap.c

-    /* The read/write settings */
-    NpyIter_GetReadFlags(iter, self->readflags);
+    if (self->writeflags == NULL) {
+        self->writeflags = PyMem_Malloc(sizeof(char) * NpyIter_GetNOp(iter));


This should be deallocated somewhere, no? Presumably in npyiter_dealloc?

ouch, yes thanks, this is a serious thing!

mhvk · 2025-01-02T14:25:11Z

numpy/_core/src/multiarray/nditer_pywrap.c

+    memset(op, 0, sizeof(PyObject *) * 2 * nop);
+    PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop);
+
+    /* And other workspaces (that do not need to clean up their content) */


Happy to keep as is!

mhvk · 2025-01-02T14:26:54Z

numpy/_core/src/multiarray/nditer_pywrap.c

+    PyArray_Descr **op_request_dtypes = (PyArray_Descr **)(op + nop);
+    PyArray_Descr **op_request_dtypes_inner = op_request_dtypes + nop;
+
+    /* And other workspaces (that do not need to clean up their content) */


And again fine to ignore!

mhvk

OK, looks all great now!

mhvk · 2025-01-02T16:56:16Z

Best to wait for tests to complete, of course - and probably best to squash-merge.

mattip · 2025-01-06T04:25:51Z

numpy/_core/src/multiarray/nditer_pywrap.c

    if (PyTuple_Check(op_in) || PyList_Check(op_in)) {
-        nop = PySequence_Size(op_in);
-        if (nop == 0) {
+        PyObject *seq = PySequence_Fast(op_in, "failed accessing item list");


This makes me a bit nervous from the perspective of freethreaded python. Is op_in ever exposed to python?

There are many other places where we use it in more dangerous ways (e.g. array coercion) and for good reason (both speed and code simplicity).
I feel that needs to be fixed as a general pattern rather than making the code less inconvenient here?

But if @ngoldbaum has a better suggestion, I am happy with that.

You can hold a critical section on op_in. You only need to do that if the operand is mutable (e.g. not a tuple), but also it might be annoying and require some refactoring to conditionally enter a critical section because the macros have brackets.

Also is it really so performance critical to avoid the strong references? If it really is about code ergonomics more than performance, what if we defined e.g. PySequence_Items that returns a heap-allocated C array of strong references to the items in the sequence that the caller is responsible for freeing.

That was my reasoning for asking if CPython shouldn't just guarantee it or have convenience around it. If we make a heap-allocation, a solution might as well be to write tmp = PyList_AsTuple(obj); Py_SETREF(tmp, PySequence_Fast(tmp)) (plus error checks).
Of course you can also re-organize the code for a critical section...
(Right now we avoid most heap allocations but just because it isn't more convenient to not avoid them anyway. For all but the most trivial iterations that likely doesn't matter and if we care there are bigger fish such as tp_vectorcall first.)

A PySequence_Fast_immutable() might be a convenience method then (also potentially better for other implementations, which PySequence_Fast is an important reason for).

Anyway, need to think about the right pattern and preferably one that also works well for other places in NumPy (although that may just be 1-2 since much of the use seems internal and array-coercion is special enough).

CPython has Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST which isn't public but handles checking for a mutable vs immutable sequence and only activating the critical section then. Maybe we could ask for that to be public?

https://github.com/python/cpython/blob/b9c693dcca01537eee1ef716ffebc632be37594b/Include/internal/pycore_critical_section.h#L39-L52

Yeah, I dunno how slow a critical section is (i.e. does it matter?). For array creation where this is important since huge, simple, lists may happen we probably don't get many tuples in practice.

For this case and especially the similar PyArray_IntpFromIndexSequence (which already does one heap allocation, though). Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST seems useful, but maybe also simple enough to write ourselves (i.e. a PyTuple_CheckExact?).

For the case here, the heap-allocation may just be the simplest path (or using GetItem since we don't expect a very long list/tuple).

OTOH, maybe Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST is so simple that it is best to just make it public, since it may also be a good way to document the solutions:

Use Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST

Convert to tuple (so it is immutable)

Stick to functional API (so tp_getitem has fine-grained locking if necessary).

I dunno how slow a critical section is (i.e. does it matter?)

It's pretty fast if it's uncontended. If there's contention it slows down a lot and as we saw elsewhere it doesn't scale all that well.

but maybe also simple enough to write ourselves

Unfortunately we can't without using the private _PyCriticalSection_Begin functions that implement the macros with brackets. They very purposefully made it so the public symbols are macros with brackets, to force users to deal with the control flow issues. I doubt any of these APIs will change long-term so it could also be a place where we include CPython internals and deal with that.

That said this one particular case really does seem like a clear candidate for CPython to make public since it's specifically for use with PySequece_Fast and we have real-world examples in NumPy where it would simplify things a lot.

I'll chat about this with our team internally and bring it up with upstream if there aren't any showstoppers.

mattip

Does this affect benchmarks for small iterators? There is now additional overhead of memory management, but maybe it is too small to notice?

seberg · 2025-01-06T11:07:36Z

Does this affect benchmarks for small iterators?

Well, the nditer setup doesn't change, these are all fast-pathed after all, so there isn't even the 10% slowdown or whatever it would be for allocations.
For simple sums (one of the main "simple iterations" not using a lot of python), I can see no difference.

For complicated iterators things seem just random to me, which I suppose is the way of these timings...

mhvk · 2025-01-06T17:23:08Z

@mattip - the code does still have a static allocation for small nditer - the overhead only kicks in if there are more arguments than account for by that (but now the number of arguments is no longer limited!).

mattip · 2025-01-06T17:46:03Z

Thanks @seberg

github-actions bot added the 01 - Enhancement label Dec 31, 2024

seberg force-pushed the nditer-avoid-maxargs-fully branch from 9e23d1d to 3f5818a Compare December 31, 2024 12:32

seberg force-pushed the nditer-avoid-maxargs-fully branch from 3f5818a to b592672 Compare December 31, 2024 12:45

seberg commented Dec 31, 2024

View reviewed changes

seberg added 2 commits December 31, 2024 14:06

BUG,TST: Test nested_iters and remove missed MAXARGS use

a41abb3

MAINT: Add (unnecessary) error check on npyiter_cache_values

a380ff2

The check is unnecessary here (it cannot fail) but it seemed clearer to not use that knowledge when it depends on where we call it.

seberg mentioned this pull request Dec 31, 2024

Increase NPY_MAXARGS to more than 32 #4398

Open

mhvk reviewed Dec 31, 2024

View reviewed changes

Address Marten's review comments

81f5618

(mostly, some are slightly less clear if they are a big improvement)

mhvk reviewed Jan 2, 2025

View reviewed changes

More smaller fixes from review (and previous commmit)

5097048

mhvk approved these changes Jan 2, 2025

View reviewed changes

mattip reviewed Jan 6, 2025

View reviewed changes

mattip merged commit 11e7844 into numpy:main Jan 6, 2025
67 checks passed

seberg deleted the nditer-avoid-maxargs-fully branch January 6, 2025 17:51



		def test_arbitrary_number_of_ops_nested():
		# 2*16 + 1 is still just a few kiB, so should be fast an easy to deal with



		@pytest.mark.slow
		@requires_memory(9 * np.iinfo(np.intc).max)

Uh oh!

ENH: Allow an arbitrary number of operands in nditer #28080

ENH: Allow an arbitrary number of operands in nditer #28080

Uh oh!

Conversation

seberg commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg commented Jan 2, 2025

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

mhvk commented Jan 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

seberg commented Dec 31, 2024 •

edited

Loading