-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
COMPAT: np 1.18 wants explicit dtype=object) #30035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/strings.py
Outdated
with warnings.catch_warnings(): | ||
# See https://github.com/numpy/numpy/issues/15041 | ||
warnings.filterwarnings("ignore", ".*with automatic object dtype.*") | ||
out = np.sum(list_of_columns, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can these be replaced with a sep.join(...)
with a list-comprehension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe? ATM im just trying to get a handle on the scope of the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will revert this for 1.18. PRobably also on master (but probably only temporarily to pacify downstream testsuits for a bit!). The interesting thing would be if there are cases where it cannot be fixed easily. I.e. adding a warning filter should not be necessary except for specific tests. (I am sure you are aware, but the filter context manager is not a good solution in this code, since it is not thread safe at all.)
@@ -1482,6 +1482,11 @@ def _format_strings(self) -> List[str]: | |||
if is_categorical_dtype(values.dtype): | |||
# Categorical is special for now, so that we can preserve tzinfo | |||
array = values._internal_get_values() | |||
elif values.dtype.kind == "O": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer to use is_object_dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i didnt check whether that works on JSONArray, just that values.dtype == object
doesnt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comments, ideally you can xfail the other instances that are not fixed here?
pandas/tests/io/json/test_ujson.py
Outdated
arr = np.array(arr_list) | ||
tm.assert_numpy_array_equal(np.array(ujson.decode(ujson.encode(arr))), arr) | ||
arr = np.array(arr_list, dtype=object) | ||
result = np.array(ujson.decode(ujson.encode(arr)), dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create an expected here
pandas/core/arrays/numpy_.py
Outdated
with warnings.catch_warnings(): | ||
# See https://github.com/numpy/numpy/issues/15041 | ||
warnings.filterwarnings("ignore", ".*with automatic object dtype.*") | ||
result = np.asarray(scalars, dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed at the caller, not here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah, pretty much every usage where we're just suppressing the warnings was a kludge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought for a way forward with the new sentinel in numpy/numpy#15119 is that you would leave this unchanged. When needed you would add the sentinel as the dtype in the call to _from_sequence
(or even to whoever is calling _from_sequence
. During the deprecation period, try to rework the use case that justifies the inefficient ragged array, and get back to NumPy with the use case for such a construct so we can stop or rethink the deprecation.
if self.dtype.kind == "O": | ||
# See https://github.com/numpy/numpy/issues/15041 | ||
return np.asarray(self.values, dtype=object) | ||
return np.asarray(self.values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is documented as being 1d only, this should be:
x = np.empty(len(self.values), dtype=self.dtype)
x[:] = self.values
return x
which will work correctly even if values == [[1, 2], [3, 4]]
I think we (cc @mattip) might not have communicated how to fix this problem well enough in NEP 34. As I see it, there are two cases when dealing with coercing object arrays:
|
@jbrockmendel can you rebase |
NumPy is considering adding something like |
thanks @mattip yeah we likely need a combination of strategies here; e.g. to fix where we can be sure of what the input is and suppress / remove where needed. |
pandas/core/dtypes/concat.py
Outdated
@@ -183,6 +185,15 @@ def concat_categorical(to_concat, axis: int = 0): | |||
return result | |||
|
|||
|
|||
def _safe_array(x): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i actually like this as a real function; can you move to pandas.compat.numpy and use everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you do this one
@@ -78,7 +78,7 @@ def data_for_sorting(allow_in_pandas, dtype): | |||
if dtype.numpy_dtype == "object": | |||
# Use an empty tuple for first element, then remove, | |||
# to disable np.array's shape inference. | |||
return PandasArray(np.array([(), (2,), (3,), (1,)])[1:]) | |||
return PandasArray(np.array([(), (2,), (3,), (1,)], dtype=object)[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger does the comment above about shape inference have any bearing on the inference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using object
dtype is still correct, and I think the comment is still correct.
If we want to get rid of the empty tuple / shape stuff, we could probably allocate the array with the right shape and set the values later
In [6]: a = np.empty((3,), dtype=object)
In [7]: a[:] = (2,), (3,), (1,)
@jbrockmendel can you rebase again |
needs a rebase again |
Just to make sure you guys are up to date. It is very likely that the Opt-In to the old behaviour will not be |
There is a difference between |
You could build numpy with numpy/numpy#15119. I just pushed it so you might want to wait to see if it passes all tests before trying it out. |
Closing, will work on this locally. |
@jbrockmendel did you already make progress on this? I am curious if an In the best case, the cases that are not easily fixed are cases where |
We've updated most of the places that were easy to just add an explicit "dtype=object". I haven't gotten around to work through this locally as suggested here. Do you agree that would be a good next step (and if so, is that still the right branch to use?)? |
Good question, mostly good to know for us right now. If you (and we) are fine with |
No, there are still many places where we currently call There are a good chunk of those cases where we know we want 1D output, so as soon as we see a listlike element we know it'll end up object-dtype. Would passing |
@jbrockmendel adding an Do you have a start with the particularly tough spots? Maybe we have to look at it from numpy to get a better feel for the pain in pandas. I would be happy if pandas is just fine with |
I just cloned this branch and will test against it shortly. |
i didnt get any of the expected errors on that branch. is there another one i should try? |
xref numpy/numpy#15119 |
xref #30030, xref numpy/numpy#15041, numpy/numpy#15045
ATM we have 85 tests in the npdev build failing, this gets it down to 14 (at least for me locally).