-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Fix typing for extension arrays and extension dtypes without isin and astype #40421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OK, next commit will remove those three things, which were changing the |
pandas/core/internals/blocks.py
Outdated
# error: Invalid index type "List[Any]" for "ExtensionArray"; expected | ||
# type "Union[int, slice, ndarray]" | ||
return self.values[[loc]] # type: ignore[index] | ||
return self.values[[loc]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may not be a false positive. The EA interface does not specify (in the docstring) that a list is accepted for __getitem__, although maybe we have tests in the base extension tests that enforce this. will need to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the docs need to get updated, because this works:
>>> s=pd.array([1,2,3,4,5])
>>> s[[2,4]]
<IntegerArray>
[3, 5]
Length: 2, dtype: Int64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure for our internal EAs. it's 3rd party EAs that may not support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I guess there are two choices:
- Require 3rd party EAs to support it and update the docs
- Remove the
List[Any]
that I put in, and just mark the mypy error on that one place
I will assume you prefer (2), and will remove that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will look in detail tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last commits do it using (2).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Require 3rd party EAs to support it and update the docs
This one (not necessarily for this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thought on this. There was some comment about having the EA's include a 2D signature, but I think that can be done in a subclass that implements it. So right now, the signature for __getitem__()
is:
def __getitem__(
self, item: Union[int, slice, np.ndarray]
) -> Union[ExtensionArray, Any]:
The discussion here is (to do in another PR) to make it (and change the docs):
def __getitem__(
self, item: Union[int, slice, np.ndarray, List[int]]
) -> Union[ExtensionArray, Any]:
Note the use of List[int]
here, not List[Any]
as I had it before.
If there is a 2D EA that wants to allow Tuple[int, ...]
, it can have it's own signature that widens the EA signature:
def __getitem__(
self, item: Union[int, slice, np.ndarray, List[int], Tuple[int, ...]]
) -> Union[ExtensionArray, Any]:
But we don't have to include that signature with Tuple[int,...]
in the base EA class.
@@ -1073,7 +1069,7 @@ def unique(self): | |||
values = self._values | |||
|
|||
if not isinstance(values, np.ndarray): | |||
result = values.unique() | |||
result: ArrayLike = values.unique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we've got ExtensionArray in the namespace, can make the isinstance check above for EA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the isinstance
check, but you still need to declare result
as ArrayLike
, because the code below that line can return the result as an np.ndarray
. Here's the excerpt (as now written):
if isinstance(values, ExtensionArray):
result: ArrayLike = values.unique()
if self.dtype.kind in ["m", "M"] and isinstance(self, ABCSeries):
# GH#31182 Series._values returns EA, unpack for backward-compat
if getattr(self.dtype, "tz", None) is None:
result = np.asarray(result)
@simonjayhawkins I believe I've addressed all your requests on this. Anything else to do? |
Thanks @Dr-Irv for addressing the comments. I want to make another thorough pass before approving. |
# error: Argument "dtype" to "array" has incompatible type | ||
# "Union[ExtensionDtype, dtype[Any]]"; expected "Union[dtype[Any], None, | ||
# type, _SupportsDType, str, Union[Tuple[Any, int], Tuple[Any, | ||
# Union[int, Sequence[int]]], List[Any], _DTypeDict, Tuple[Any, Any]]]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you just assert that this is a NpDtype here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We decided to address astype()
(where the above code is) in a later PR. That will remove the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment, remove the added type and this ignore will not be needed yet.
@simonjayhawkins pinging you again on review of this. Thanks! |
@Dr-Irv can you resolve conflicts. |
@simonjayhawkins I've done this. Note that with respect to what was in #41097, I changed the name of the Note that the way that I use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dr-Irv
@@ -191,7 +191,7 @@ def extract_bool_array(mask: ArrayLike) -> np.ndarray: | |||
# We could have BooleanArray, Sparse[bool], ... | |||
# Except for BooleanArray, this is equivalent to just | |||
# np.asarray(mask, dtype=bool) | |||
mask = mask.to_numpy(dtype=bool, na_value=False) | |||
mask = mask.to_numpy(dtype=np.dtype(bool), na_value=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't need to be changed. i've opened #41185 as a precursor to fix these false positives.
copy: bool = False, | ||
na_value=lib.no_default, | ||
na_value: Any | None = lib.no_default, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None seems redundant here.
@@ -510,8 +508,7 @@ def nbytes(self) -> int: | |||
# ------------------------------------------------------------------------ | |||
# Additional Methods | |||
# ------------------------------------------------------------------------ | |||
|
|||
def astype(self, dtype, copy=True): | |||
def astype(self, dtype: Dtype, copy: bool = True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when adding types, we should ensure that the types match the docstrings. From the PR title and previous discussion, IIUC astype was being done separately.
I would revert this for now until this method is sorted properly.
see also #41018 (comment) and response.
The docstring states we return np.ndarray and the one-liner suggests that too. We sometimes also return an ExtensionArray, this is dependent on the type of dtype.
my concern is that if we add the type now, this may get forgotten.
# error: Argument "dtype" to "array" has incompatible type | ||
# "Union[ExtensionDtype, dtype[Any]]"; expected "Union[dtype[Any], None, | ||
# type, _SupportsDType, str, Union[Tuple[Any, int], Tuple[Any, | ||
# Union[int, Sequence[int]]], List[Any], _DTypeDict, Tuple[Any, Any]]]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous comment, remove the added type and this ignore will not be needed yet.
*args: Any, | ||
**kwargs: Any, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave these untyped for now.
These are passed through to numpy.argsort, so we should be able to be more precise.
try to avoid adding Any where we maybe able to more precise with further investigation.
side note: these actually don't seem to be passed in the base implementation.
@@ -696,7 +693,7 @@ def value_counts(self, dropna: bool = True) -> Series: | |||
|
|||
_str_na_value = ArrowStringDtype.na_value | |||
|
|||
def _str_map(self, f, na_value=None, dtype: Dtype | None = None): | |||
def _str_map(self, f, na_value=None, dtype: DtypeArg | None = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
# type "Union[ExtensionArray, ndarray]"; expected "Sequence[Any]" | ||
locs = self._values[mask].searchsorted( | ||
where._values, side="right" # type: ignore[arg-type] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments regarding using Sequence for ArrayLike
Closing this as I've started making even smaller PRs and tracking status here: #39501 (comment) in the comment from the original big PR |
Same as #39501 by removing changes related to
isin
and the overloads forastype
and return type forastype
Per discussion with @simonjayhawkins #39501 (comment)