-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Description
Describe the issue:
When creating a NumPy array from a list of lists with dtype=object
, NumPy implicitly promotes inner lists to ndarray
if their lengths are uniform, but keeps them as list
if not. This creates inconsistent behavior despite explicitly specifying dtype=object
.
-
Expected Behavior
Whendtype=object
is explicitly specified, I expect all inner elements to remain as-is, i.e., Python lists, regardless of shape consistency. -
Actual Behavior
- If inner list lengths are not uniform, elements are preserved as list.
- If inner list lengths are uniform, elements are implicitly converted to
ndarray
, even thoughdtype=object
was specified.
This implicit promotion undermines the explicit request for object dtype.
Is this behavior intentional or a bug?
If this behavior is intentional, could you clarify why NumPy promotes inner lists to ndarray
when their lengths are uniform, even when dtype=object
is explicitly specified?
Reproduce the code example:
import numpy as np
# Inner lists of different lengths → preserved as Python lists (as expected)
a = np.array([[1, 2, 3], [1, 2]], dtype=object)
print(a.shape) # (2,)
print(type(a[0])) # <class 'list'>
# Inner lists of same length → promoted to ndarray unexpectedly
b = np.array([[1, 2, 3], [1, 2, 3]], dtype=object)
print(b.shape) # (2, 3)
print(type(b[0])) # <class 'numpy.ndarray'>
Error message:
Python and NumPy Versions:
3.12.11 | packaged by conda-forge | (main, Jun 4 2025, 14:38:53) [Clang 18.1.8 ]
Runtime Environment:
No response
Context for the issue:
This issue arose in the context of deep learning training, where data is processed step-by-step, and in some cases, we need to concatenate the current batch with the previous one.
If the current batch contains uniform-length lists and the previous batch does not, using np.array(..., dtype=object) can silently result in inconsistent behavior — one being a proper object array of lists, and the other being unexpectedly promoted to a 2D ndarray. This mismatch can lead to dimension-related errors during concatenation, such as ValueError: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
, which can be difficult to debug.
I encountered this exact issue (volcengine/verl#2741) during training with Verl, and had to work around it by switching to np.fromiter
to prevent NumPy from making assumptions about the structure. While this workaround is effective, it's not well-known, and most developers instinctively reach for np.array(..., dtype=object)
without expecting any implicit promotion to happen.
Because of this, I believe this behavior should either be corrected or at least clearly documented, as it has real consequences in machine learning pipelines and other dynamic data processing scenarios.