Skip to content

BUG: fixes for three related stringdtype issues (#26436) #26459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 16, 2024

Conversation

charris
Copy link
Member

@charris charris commented May 16, 2024

Backport of #26436.

Fixes #26420.

While working on the issue with where found in #26240, I noticed two other related issues. Since the fix for where depends on the other two fixes, I figured it would make sense to send them all as one PR.

First, it turns out that advanced indexing was broken for indexing any entry in an array that needs a heap allocation. On current main, this leads to errors or segfaults:

>>> import numpy as np
>>> a = np.array(["a", "c", "h"*25], dtype=np.dtypes.StringDType(na_object=None))
>>> a[[1,2]]
... elide traceback ...
MemoryError: Failed to load string in StringDType getitem

The fix is to properly set the output descriptor in the casting setup in array_subscript.

Second, I noticed that the nonzero function completely ignores nulls. I also noticed that the string to bool cast assumed all nulls are truthy and ignored the existence of nulls like None that should be falsey, following the behavior of object array:

>>> import numpy as np
>>> np.nonzero(np.array(['hello', np.nan, 'world'], dtype=object))
(array([0, 1, 2]),)
>>> np.nonzero(np.array(['hello', None, 'world'], dtype=object))
(array([0, 2]),)

This updates both nonzero and the string to bool cast to account for this and makes sure they behave identically.

Finally, the issue with where is also caused by not setting the input and output descriptors properly in the cast setup in PyArray_Where. The casting code here dates back to #23770, which was before stringdtype had an arena allocator. With the arena allocator, we need to be more careful about bookkeeping on input and output descriptors. This means we also need to setup a separate cast for the second input descriptor.

Also adds tests for all three fixed issues.

  • BUG: fix broken fancy indexing for stringdtype

  • BUG: fix incorrect casting for stringdtype in PyArray_Where

  • BUG: ensure casting to bool and nonzero treats null strings the same way

  • MNT: refactor so itemsizes are correct

  • MNT: refactor so trivial copy check aligns with casting setup

* BUG: fix broken fancy indexing for stringdtype

* BUG: fix incorrect casting for stringdtype in PyArray_Where

* BUG: ensure casting to bool and nonzero treats null strings the same way

* MNT: refactor so itemsizes are correct

* MNT: refactor so trivial copy check aligns with casting setup
@charris charris added 00 - Bug 08 - Backport Used to tag backport PRs labels May 16, 2024
@charris charris added this to the 2.0.0 release milestone May 16, 2024
@charris charris merged commit 5c3fc21 into numpy:maintenance/2.0.x May 16, 2024
60 checks passed
@charris charris deleted the backport-26436 branch May 16, 2024 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 08 - Backport Used to tag backport PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants