Skip to content

Possible problem with ragged-array as object deprecation #15041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mhvk opened this issue Dec 3, 2019 · 15 comments
Closed

Possible problem with ragged-array as object deprecation #15041

mhvk opened this issue Dec 3, 2019 · 15 comments
Labels
06 - Regression component: numpy._core triaged Issue/PR that was discussed in a triage meeting

Comments

@mhvk
Copy link
Contributor

mhvk commented Dec 3, 2019

#14794 deprecated auto-creation of object arrays for ragged lists (hurray!!), which, perhaps not surprisingly is causing some test failures in astropy. We can address all those without problem (two out of three just need being more careful with the test), but I wasn't entirely sure whether a deprecation warning should be expected for the following:

import numpy as np
lofo = [1, [2, 3]]
a = np.array(lofo, dtype=object)
a == lofo
# /usr/bin/ipython3:1: DeprecationWarning: Creating an ndarray with automatic object dtype is deprecated, use dtype=object if you intended it, otherwise specify an exact dtype
# array([ True,  True])

In this case, arguably the intent is unambiguous: since one array is an object array, the list might as well be interpreted as an object array.

But it also seems fine to just be strict, so no problem if this is not fixed.

@charris
Copy link
Member

charris commented Dec 3, 2019

The long range intent is to require an explicit dtype=object so that we can dispense with the automatic creation of object arrays. This is just a first step.

EDIT: Although automatic promotion, as here, is something that needs consideration.

@mattip
Copy link
Member

mattip commented Dec 4, 2019

The deeper problem is that even with a = np.array([1, [2, 3]], dtype=object), it is not clear as to what the shape of a should be. It could either be

  1. a.shape == (1,) and a[0] == [1, [2, 3]] or
  2. a.shape == (2,) and a[0] == 1, a[1] == [2, 3].

I am not sure what is correct. NumPy has chosen 2, but this is fragile, see the reverting of PR gh-13913 in gh-14341, and the reverting of gh-14800 in gh-14839 for moving forward with the array_richcompare deprecations. Perhaps we should let libraries like awkward-arrays handle more complicated cases instead.

What is the use case in astropy, pandas, or other places for such ragged arrays? Is it a replacement for masked arrays?

@jbrockmendel
Copy link
Contributor

What is the use case in astropy, pandas, or other places for such ragged arrays?

This is affecting 85 tests in pandas's npdev build (https://dev.azure.com/pandas-dev/pandas/_build/results?buildId=22337). A bunch of these I can fix by manually adding dtype=object; the first one that I can't is pandas.tests.test_strings.test_string_array, with a minimal reproducer:

$ python3 -Werror

>>> import numpy as np
>>> arr = np.array(["a", "b"], dtype=object)
>>> objs = [arr, "foo", arr]
>>> np.sum(objs, axis=0)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 6, in sum
  File "/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2229, in sum
    initial=initial, where=where)
  File "/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 90, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
DeprecationWarning: Creating an ndarray with automatic object dtype is deprecated, use dtype=object if you intended it, otherwise specify an exact dtype

LMK if this needs to go in a separate issue, dont want to hijack the thread.

@mattip
Copy link
Member

mattip commented Dec 4, 2019

I created an issue.

@mhvk
Copy link
Contributor Author

mhvk commented Dec 4, 2019

@mattip - before anything else, for astropy the problems are easy to fix (and have been).

Your point about the ambiguity is well taken. I guess the question is whether in the example I gave, it is worth implicitly doing np.array(other, dtype=object). On the plus side, turning a list into an array is what happens internally anyway (with dtype=None), and the dtype=object that would keep existing behaviour is quite obvious. On the minus side, it is yet more special-casing, it locks in place how np.array deals with nested lists, and one can wonder whether the auto-coercion is all that great generally, so perhaps reducing it slowly is not a bad idea.

Possibly the best answer is to do nothing until a better case comes along!

@seberg
Copy link
Member

seberg commented Dec 4, 2019

We have floated ideas of things such as ndim=... or depth=... kwargs (on the C side, these may even half exist as max ndims and min ndims). If we had those, we would have a solution how users could remove the ambiguity and we could deprecate the usage of object unless the other argument was also given. OTOH, it ndim would be practically only a sanity check on arrays which are not object.

@rgommers rgommers added triaged Issue/PR that was discussed in a triage meeting and removed triage review Issue/PR to be discussed at the next triage meeting labels Dec 7, 2019
@mattip
Copy link
Member

mattip commented Dec 10, 2019

We have floated ideas of things such as ndim=... or depth=... kwargs

I think users who want this should create an empty array of the correct shape and assign to it:

np.array([[1, 2], [3, [4, 5]]])  # fails, do they want `shape=(2, 2)` or `shape==(2,)`?
arr = np.empty((2, 2), dtype=object)
arr[0, :] = [1, 2]
arr[1, :] = [3, [4, 5]]

@tacaswell
Copy link
Contributor

This is also causing ~40 test failures in Matplotlib. Will provide details soon(ish).

@seberg
Copy link
Member

seberg commented Mar 6, 2020

@tacaswell is this thie old ragged array warning, or the new fix of empty array like handling? The dtype=object one has been in master for quite a whilte?
For the array-like change, I knew that there were some small isuses in pandas, which pandas seemed fine with. I did not expect issues in matplotlib to be honest.

@seberg
Copy link
Member

seberg commented May 13, 2020

What happened with this issues? It the current state acceptable for downstream (i.e. to remain identical for the 1.19 branch?) and all of the necessary fixups were done by now?

The initial issue by Marten seems like we can fix it by just using dtype=object here, even if Matti is probably right that it can still be considered ambiguous (but so is the initial array creation). OTOH, I am not totally opposed in forcing the user to call np.array() manually here...

@mhvk
Copy link
Contributor Author

mhvk commented May 13, 2020

Earlier, I wrote

Possibly the best answer is to do nothing until a better case comes along!

Since no better cases for auto-converting seem to have come along, I think it makes sense to close this issue - for now, it avoids us building in some special-casing that might later be regretted.

@seberg
Copy link
Member

seberg commented May 13, 2020

Thanks Marten. I was also wondering about the current state in matplotlib, @tacaswell I will close this though, please do reopen a new issue for the impact on other projects!

@seberg seberg closed this as completed May 13, 2020
@tacaswell
Copy link
Contributor

Matplotlib is still broken, but @QuLogic is making progress on fixing it.

@seberg
Copy link
Member

seberg commented May 14, 2020

@tacaswell good to hear, we are planning to branch 1.19 very soon, which means a release is within sight, so if this is disruptive if you do not have a release before that, then we can see what to do. Right now this is a VisibleDeprecationWarning, to inform users, but if users might face it becase of matplotlib internals, maybe a normal DeprecationWarning is better.

@QuLogic
Copy link
Contributor

QuLogic commented May 14, 2020

That is matplotlib/matplotlib#17289, for reference. There's only one last thing to fix there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
06 - Regression component: numpy._core triaged Issue/PR that was discussed in a triage meeting
Projects
None yet
Development

No branches or pull requests

8 participants