BUG: Fix all-NaT when ArrowEA.astype to categorical #62055

arthurlw · 2025-08-06T03:33:17Z

closes BUG: ArrowEA.astype to categorical returning all-NaT #62051 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
~~Added type annotations to new arguments/methods/functions.~~
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2025-08-06T04:10:40Z

pandas/core/arrays/base.py

        dtype = pandas_dtype(dtype)
        if dtype == self.dtype:
            if not copy:
                return self
            else:
                return self.copy()

+        if isinstance(dtype, CategoricalDtype):


Pretty weird to do this in the base class. Can you track down where the actual problem is? Could be hiding other bugs eg in categorical.fromsequence

Ah okay i'll investigate further

jbrockmendel · 2025-08-06T14:39:57Z

pandas/core/arrays/categorical.py

+                if is_datetime64_any_dtype(cat_dtype) or is_timedelta64_dtype(
+                    cat_dtype
+                ):
+                    values = values.to_numpy()


this can be expensive, particularly in dt64tz. have you figured out why the problem is happening? i.e. if you step through the code, where does the first wrong result show up

arthurlw · 2025-08-07T05:14:30Z

Using astype with CategoricalDtype coerces values into Index, where the code tries to compare two Index objects with numpy and pyarrow types.

In the first example, the code does not think they are comparable in Index._should_compare, and returns all -1s. In the second example, the code correctly determines that they are comparable, but they try to compare it (coercing both indexes to object dtypes with Index._find_common_type_compat) and fail. Hence, both cases return all -1s.

jbrockmendel · 2025-08-07T21:51:48Z

So we should start by fixing should_compare?

arthurlw · 2025-08-15T05:31:42Z

So we should start by fixing should_compare?

Sorry I missed this earlier. The fix I pushed addresses that and resolves the second issue too.

jbrockmendel · 2025-08-16T18:55:16Z

pandas/core/indexes/base.py

@@ -3674,6 +3674,14 @@ def get_indexer(
        orig_target = target
        target = self._maybe_cast_listlike_indexer(target)

+        from pandas.api.types import is_timedelta64_dtype


this can go at the top of the file

jbrockmendel · 2025-08-16T18:55:55Z

pandas/core/indexes/base.py

@@ -3674,6 +3674,14 @@ def get_indexer(
        orig_target = target
        target = self._maybe_cast_listlike_indexer(target)

+        from pandas.api.types import is_timedelta64_dtype
+
+        if target.dtype == "string[pyarrow]" and is_timedelta64_dtype(self.dtype):


i dont think we generally do this implicit casting anymore

that's fair, but the array is initialized as string[pyarrow] instead of a pyarrow time scalar. Not sure how to compare string with timestamp64 without some coercion step

Looking at this more closely, i was wrong about not casting strings.

tdi = pd.date_range("2016-01-01", periods=3) - pd.Timestamp("2016-01-01") target = pd.array(["0 days"], dtype="string[pyarrow]") >>> tdi.get_indexer(["0 days"]) array([0]) >>> tdi._maybe_cast_listlike_indexer(["0 days"]) TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None) >>> tdi._maybe_cast_listlike_indexer(target) TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None)

If a patch is needed, can it go in _maybe_cast_listlike_indexer?

Yeah i was thinking about using a maybe helper function too

jbrockmendel · 2025-08-16T18:56:13Z

pandas/core/indexes/datetimes.py

@@ -384,6 +384,18 @@ def _is_comparable_dtype(self, dtype: DtypeObj) -> bool:
        if self.tz is not None:
            # If we have tz, we can compare to tzaware
            return isinstance(dtype, DatetimeTZDtype)
+
+        from pandas import ArrowDtype


can go at the top of the file

jbrockmendel · 2025-08-16T18:56:33Z

pandas/core/indexes/datetimes.py

+
+            return (
+                pa.types.is_date32(dtype.pyarrow_dtype)
+                or pa.types.is_date64(dtype.pyarrow_dtype)


i think timestamp is comparable but date is not

The original issue was with pyarrow date dtypes, which compare fine when using astype, so I think they should be treated as comparable here

dti = pd.date_range("2016-01-01", periods=3) item = dti[0].date() >>> (item == dti)[0] np.False_

We don't have a non-pyarrow date dtype, but if we did, it would not be considered comparable to datetime64

I think in that case the question is whether we want astype with categoricals to succeed here, or whether astype between pyarrow date and datetime64 should be disallowed for consistency

do we have analogous special-casing for the non-pyarrow dt64 that im missing?

Not that I know of

Then I expect it shouldn’t be necessary here. I’ll take a closer look on monday

i think the relevant special-casing is in Index._maybe_downcast_for_indexing. Take a look for the inferred_type checks

jbrockmendel · 2025-08-16T18:57:11Z

pandas/tests/arrays/categorical/test_astype.py

@@ -160,3 +163,20 @@ def test_astype_category_readonly_mask_values(self):
        result = arr.astype("category")
        expected = array([0, 1, 2], dtype="Int64").astype("category")
        tm.assert_extension_array_equal(result, expected)
+
+    def test_arrow_array_astype_to_categorical_dtype_temporal(self):


can you also test the intermediate steps that used to fail

arthurlw added 4 commits August 6, 2025 10:32

Added condition for CategoricalDtype

58284dd

Added tests

1896199

whatsnew

36b84cf

Added GH ref

2dc0698

arthurlw changed the title ~~BUG: Added condition for CategoricalDtype~~ BUG: Fix all-NaY when ArrowEA.astype to categorical Aug 6, 2025

arthurlw changed the title ~~BUG: Fix all-NaY when ArrowEA.astype to categorical~~ BUG: Fix all-NaT when ArrowEA.astype to categorical Aug 6, 2025

jbrockmendel reviewed Aug 6, 2025

View reviewed changes

arthurlw added 3 commits August 6, 2025 11:26

Updated conditions

5987952

Updated fix logic

a5fab10

importorskip

e6b7c64

jbrockmendel reviewed Aug 6, 2025

View reviewed changes

arthurlw added 3 commits August 12, 2025 14:05

Update fix logic

95e331c

precommit and removed unnecessary comments

3453487

Update condition

d747141

jbrockmendel reviewed Aug 16, 2025

View reviewed changes

Uh oh!

BUG: Fix all-NaT when ArrowEA.astype to categorical #62055

Are you sure you want to change the base?

BUG: Fix all-NaT when ArrowEA.astype to categorical #62055

Conversation

arthurlw commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arthurlw commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jbrockmendel commented Aug 7, 2025

Uh oh!

arthurlw commented Aug 15, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arthurlw commented Aug 6, 2025 •

edited

Loading

arthurlw commented Aug 7, 2025 •

edited

Loading