Skip to content

ENH: Add ndmax parameter to np.array to control recursion depth #29569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 16, 2025

Conversation

kibitzing
Copy link
Contributor

Implementaion of #29499

Description

This PR introduces a new parameter, ndmax, to the np.array function. This parameter allows users to explicitly limit the maximum number of dimensions that NumPy will create when converting nested sequences (like lists of lists) into an array.

To ensure backward compatibility, the default behavior remains unchanged. When ndmax is not specified, the function falls back to its current heuristic. This means it will create as many dimensions as possible from nested sequences, up to the compile-time limit defined by NPY_MAXDIMS.


Motivation and Context

As discussed in issue #29499, the current behavior of np.array can lead to unexpected results when dtype=object is specified. For example:

  • A list of lists with non-uniform lengths correctly creates a 1D array of list objects.
  • A list of lists with uniform lengths is implicitly promoted to a 2D ndarray, even with dtype=object.

This inconsistency arises because, as @rkern explained, NumPy has a strong heuristic that assumes uniformly nested sequences are intended to be multi-dimensional arrays. While this is often the desired outcome, it overrides the user's explicit intent in cases where a 1D array of list objects is needed, regardless of their shape. This can cause subtle bugs, particularly in data processing pipelines for machine learning.

The existing workarounds, such as pre-allocating with np.empty and then assigning or using np.fromiter, are effective but less intuitive than a direct parameter. The addition of ndmax provides a clear and explicit way to control this behavior directly within the most commonly used array creation function.


How to use ndmax

The ndmax parameter provides fine-grained control over the array creation process.

  • Current Behavior (Unexpected Promotion):
# Even with dtype=object, this creates a 2D NumPy array
b = np.array([[1, 2, 3], [4, 5, 6]], dtype=object)
print(b.shape)      # (2, 3)
print(type(b[0]))   # <class 'numpy.ndarray'>
  • New Behavior with ndmax=1 (Desired Outcome):
# With ndmax=1, NumPy stops at 1 dimension, preserving inner lists
b_new = np.array([[1, 2, 3], [4, 5, 6]], dtype=object, ndmax=1)
print(b_new.shape)      # (2,)
print(type(b_new[0]))   # <class 'list'>

This aligns the behavior for uniform lists with that of non-uniform lists when the user's intent is to create an array of objects.


Implementation Details

The implementation was made straightforward by leveraging the pre-existing max_depth parameter in the core C-API function, PyArray_FromAny_int. The new Python-level ndmax argument is passed directly to this internal parameter.

This approach ensures that the new feature is built upon NumPy's established and tested array creation logic, minimizing risk and maintaining internal consistency.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kibitzing - thanks for the PR. I think this is rather nice, especially that you found all it really takes is exposing existing functionality!

However, to me at least there is one annoyance: it would seem to me that there is no reason not to support np.array(..., ndmax=0) as meaning that the user wants an array with ndim=0 (i.e., a scalar array).

Unfortunately, this clashes with the numpy C API, which has PyArray_FromAny and friends interpret max_depth=0 as arbitrary (it is actually not documented as such, but since it is used like that throughout the code base, it surely is used outside of numpy as well).

But I think we can make it work by adjusting PyArray_FromAny to change 0 to NPY_MAXDIMS and then have PyArray_FromAny_int not do any checks. See in-line comments.

What do you think?

@@ -1747,8 +1748,19 @@ array_array(PyObject *NPY_UNUSED(ignored),
op = args[0];
}

if (ndmax > NPY_MAXDIMS || ndmax <= 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here I'd also check ndmax < 0 rather than <= 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, I've changed as you suggested.

@@ -1545,6 +1545,10 @@ PyArray_FromAny_int(PyObject *op, PyArray_Descr *in_descr,
int ndim = 0;
npy_intp dims[NPY_MAXDIMS];

if (max_depth == 0 || max_depth > NPY_MAXDIMS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support actually using max_depth = 0 as a real option, the check here would need to be removed.

But that means adjusting code that calls it. The main user is PyArray_FromAny (in ctors.c as well), which would then need this if statement (indeed, perhaps the whole statement should be moved there; it is not unreasonable to assume that for internal calls the values are known to be good).

One other user is PyArray_CheckFromAny_int (also inctors.c), though logically there also perhaps PyArray_CheckFromAny is the one that should be adjusted.

The final user seems to be scalartypes.c.src -- there one could just replace the 0 with NPY_MAX_DIMS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clear guidance.

As you suggested, I've refactored the logic so PyArray_FromAny_int can treat max_depth=0 as a request for a 0-D array.
The responsibility for handling the "no limit" case (by converting 0 to NPY_MAXDIMS) has been moved to its callers:

  • The check was added to PyArray_FromAny and PyArray_CheckFromAny.
  • Call sites in scalartypes.c.src and array_converter_new were updated to use NPY_MAXDIMS explicitly.

@kibitzing
Copy link
Contributor Author

Hello @mhvk,
Thank you for the excellent feedback and for proposing a clear solution!

I also considered supporting ndmax=0 for scalar creation. I initially decided against it, partly because of the C-API's convention that you mentioned, and partly because I was cautious about modifying such a core piece of code.

However, the detailed guidance you've provided makes the path forward very clear, so I'm happy to get to work on implementing it.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kibitzing - this looks great. I was about to approve, but then I realized we should make sure this is actually documented. So, could you add a description in _add_newdocs.py (search for 'array' - l.806; you'll need a.. versionadded:... in there, see examples in that file).

Furthermore, personally, I feel this is worth a little what's-new entry - there are many who have run into this issue! (We've got some code in astropy that we can now simplify!) For this you need to add a fragment in doc/release/upcoming_changes - see the README.rst in that directory).

Since that means another CI run anyway, I also put two absoltely nitpicky comments in-line, which you might as well do too... (but feel free to disagree and ignore).

Let me also ping @mattip and @rkern, in case they want to have a look as well (though in the end this is so much a case of just exposing what was effectively in place already that I'm quite happy to just get it in myself too).

@@ -1747,8 +1748,19 @@ array_array(PyObject *NPY_UNUSED(ignored),
op = args[0];
}

if (ndmax > NPY_MAXDIMS || ndmax < 0) {
if (ndmax > NPY_MAXDIMS) {
PyErr_Format(PyExc_ValueError, "ndmax must be <= NPY_MAXDIMS (=%d)", NPY_MAXDIMS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I wouldn't ask if there wasn't another reason to push a further commit, but to have error path code interrupt the flow as little as possible, I'd tend to have just one PyErr_Format, with (e.g.) "must have 0 <= ndmax <= NPY_MAXDIMS (=%d)".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback!
Updated to use a single PyErr_Format with the combined bounds check message.

@kibitzing
Copy link
Contributor Author

kibitzing commented Aug 16, 2025

Hello @mhvk,
Thanks for the thorough review! I've addressed all the points:

  • Added documentation and examples in _add_newdocs.py with the .. versionadded:: directive
  • Created a news fragment in doc/release/upcoming_changes for the what's new entry (happy to hear it will help simplify things for you!)
  • Applied the inline suggestions. Thanks for spotting those, they definitely improve the code quality

Please let me know if there's anything else you'd like me to adjust!

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super, this now looks great! Let's get it in. Thanks very much for solving a long-standing annoyance in an elegant way!

@mhvk mhvk merged commit 4e00e4d into numpy:main Aug 16, 2025
55 of 56 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants