Skip to content

Improve support for pandas Extension Arrays (#10301) #10380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

richard-berg
Copy link

@richard-berg richard-berg commented May 30, 2025

The core ideas here are:

  • Provide an array-api implementation of np.result_type(*arrays_or_dtypes). This unlocks arbitrary N-ary operations on ExtensionArrays without loss of type info (as found in pre-2024 releases) or blowing up due to lack of EA-specific implementations (as documented in Regression in DataArrays created from Pandas #10301).
  • Provide said implementations of operators and ufuncs by delegating to pd.Series

Minor refactors & bugfixes are documented inline.

  • Closes #xxxx
  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

Copy link

welcome bot commented May 30, 2025

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

Comment on lines -225 to +241
def preprocess_types(t):
if isinstance(t, str | bytes):
return type(t)
elif isinstance(dtype := getattr(t, "dtype", t), np.dtype) and (
np.issubdtype(dtype, np.str_) or np.issubdtype(dtype, np.bytes_)
):
def maybe_promote_to_variable_width(
array_or_dtype: np.typing.ArrayLike | np.typing.DTypeLike,
) -> np.typing.ArrayLike | np.typing.DTypeLike:
if isinstance(array_or_dtype, str | bytes):
return type(array_or_dtype)
elif isinstance(
dtype := getattr(array_or_dtype, "dtype", array_or_dtype), np.dtype
) and (np.issubdtype(dtype, np.str_) or np.issubdtype(dtype, np.bytes_)):
# drop the length from numpy's fixed-width string dtypes, it is better to
# recalculate
# TODO(keewis): remove once the minimum version of `numpy.result_type` does this
# for us
return dtype.type
else:
return t
return array_or_dtype
Copy link
Author

@richard-berg richard-berg May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diff looks ugly, but it's simply renaming the fn + its argument.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this does not promote to variable width strings (which would be the new string dtype, numpy.dtypes.StringDType) but rather drops the width of a existing fixed-width string dtype to force numpy to recalculate the width. The aim is to avoid truncating a python string object.

Additionally, it might be good to keep the "type" in the name of the function, since it only operates on string dtypes.

Comment on lines +259 to +262
except TypeError:
# passing individual objects to xp.result_type means NEP-18 implementations won't have
# a chance to intercept special values (such as NA) that numpy core cannot handle
pass
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only this except block is new. The rest of this fn was lifted as-is from result_type() below.

Comment on lines -255 to -267
if any(is_extension_array_dtype(x) for x in scalars_or_arrays):
extension_array_types = [
x.dtype for x in scalars_or_arrays if is_extension_array_dtype(x)
]
if len(extension_array_types) == len(scalars_or_arrays) and all(
isinstance(x, type(extension_array_types[0])) for x in extension_array_types
):
return scalars_or_arrays
raise ValueError(
"Cannot cast arrays to shared type, found"
f" array types {[x.dtype for x in scalars_or_arrays]}"
)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we now provide an array-api implementation of np.result_type, we no longer need these special cases. (which were far too special, IMO; the cases where we raised ValueError are perfectly valid)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on, it's great work!

Can you extract out any non-EA changes to duck_array_ops.py and dtypes.py and make a separate PR please? It will be far easier to review then

@dcherian dcherian requested a review from keewis May 30, 2025 14:51
Copy link
Contributor

@ilan-gold ilan-gold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dcherian I would like to hear from @richard-berg about this. Aside from these new ops, everything else in this PR appears to be directly relevant.

By using pandas (maybe private) machinery directly, this PR seems to handle a much more broad set of use-cases than what we used to, and integrates with existing xarray functionality internally in a valid way (via duck array ops / NEP 18). I never felt confident enough/it necessary to go "this deep," especially with duck array ops, but I think this change is super nice and clean and really makes things clear so I'm a fan.

The big questions is what the status of this internal pandas machinery is.

I'll give @richard-berg a day or two to reply, and then start fiddling around with the PR a bit myself to confirm some of these things myself.

Big thanks @richard-berg because this is quite clean and nice!

Comment on lines +11 to +14
from pandas.api.types import is_scalar as pd_is_scalar
from pandas.core.dtypes.astype import astype_array_safe
from pandas.core.dtypes.cast import find_result_type
from pandas.core.dtypes.concat import concat_compat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one reason I was not aware of some of these functions is that they are internal to pandas and do not appear in the public documentation.

What is the policy on this? Are these functions "public" in some other sense in terms of stability?

or not subok
or order != "K"
):
return NotImplemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise TypeError(f"{array} is not an pandas ExtensionArray.")
self.array = array

self._add_ops_dunders()

def _add_ops_dunders(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to the fix for reindex, I believe?

elif is_extension_array_dtype(dtype):
# data may or may not be an ExtensionArray, so we can't rely on
# np.asarray to call our NEP-18 handler; gotta hook it ourselves
converted = PandasExtensionArray(as_extension_array(data, dtype))
Copy link
Contributor

@ilan-gold ilan-gold Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
converted = PandasExtensionArray(as_extension_array(data, dtype))
converted = PandasExtensionArray(np.asarray(data, dtype))

Isn't this what as_extension_array is meant for really i.e., to be used by the above suggested API?

@@ -29,6 +54,97 @@ def __extension_duck_array__issubdtype(
return False # never want a function to think a pandas extension dtype is a subtype of numpy


@implements("astype") # np.astype was added in 2.1.0, but we only require >=1.24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean practically for xarray? Do we need a shim here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support topic-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants