Skip to content

MAINT: Ensure correct handling for very large unicode strings #27875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 4, 2024

Conversation

seberg
Copy link
Member

@seberg seberg commented Nov 28, 2024

In the future, we can handle these strings (in parts we already can maybe), but for now have to stick to int length because more of the code needs cleanup to actually use it safely. (For user dtypes this is less of a problem, although corner cases probably exist.)

This adds necessary checks to avoid large unicode dtypes.

@seberg seberg added the 09 - Backport-Candidate PRs tagged should be backported label Nov 28, 2024
@seberg seberg force-pushed the fix-unicode-length branch 4 times, most recently from 474c052 to 4f85950 Compare November 29, 2024 10:07
In the future, we can handle these strings (in parts we already
can maybe), but for now have to stick to `int` length because
more of the code needs cleanup to actually use it safely.
(For user dtypes this is less of a problem, although corner cases
probably exist.)

This adds necessary checks to avoid large unicode dtypes.
@seberg seberg force-pushed the fix-unicode-length branch from 4f85950 to faf10e9 Compare November 29, 2024 10:37
@seberg seberg requested a review from ngoldbaum December 2, 2024 21:51
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity, did you look at whether StringDType handles these cases as well?

I guess we can see if anyone complains because suddenly NumPy objects to creating very large strings. It looks like the testing here was poor at best.

If we can ban strings longer than INT_MAX bytes for stringdtype, then that unlocks some interesting possible memory savings on 64 bit architectures: #26059

if (itemsize > NPY_MAX_INT) {
/* We can allow this, but should audit code paths before we do. */
PyErr_SetString(PyExc_TypeError,
"string too large to store inside array.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this error message and the other new error messages you're adding include the values of itemsize and the maximum string size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can do it, although I am a bit unenthusiastic about bothering because the alternative error is an overflow which isn't much more expressive. (Or most likely a memory error.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm putting myself in the shoes of someone who happens to hit this error in the wild. It might not be immediately obvious especially for the unicode case that the limit is related to INT_MAX.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you aren't wrong that it is easy enough to just add, even if I still think you are more likely to hit other errors that don't explain the issue as well.

Anyway, added (plus one additional check that should never happen, but why not be future proof...)

large_string = "A" * (very_large + 1)
except Exception:
# We may not be able to create this Python string on 32bit.
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if you could statically detect this case and use pytest.mark.skipif. But if that's impossible I think it's better to use pytest.skip, so the comment above can go in the reason https://docs.pytest.org/en/stable/reference/reference.html#pytest.skip

return _NPY_ERROR_OCCURRED_IN_CAST;
}

loop_descrs[2] = PyArray_DescrNew(loop_descrs[0]);
if (loop_descrs[2] == NULL) {
Py_DECREF(loop_descrs[0]);
Py_DECREF(loop_descrs[1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof, good catch!

@seberg seberg added this to the 2.1.4 release milestone Dec 3, 2024
@seberg
Copy link
Member Author

seberg commented Dec 3, 2024

Just out of curiosity, did you look at whether StringDType handles these cases as well?

I have not, I would think that is hard to achieve but for certain functions it could be wrong in the string ufunc core, though.

If we can ban strings longer than INT_MAX bytes for stringdtype

I think that is a choice that one can do. But here it is just a historic choice, because NumPy only even supports larger dtypes since 2.0, but we did not audit the code carefully enough to really exploit that.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Go ahead and merge if you still disagree about the error messages. In general I try to err on the side of including more information than not in python error messages generated from C, since most programmers don't realize that they can grep the C source code for error messages to learn more.

@ngoldbaum
Copy link
Member

I have not, I would think that is hard to achieve but for certain functions it could be wrong in the string ufunc core, though.

Added an item to #25693 to check this.

@charris
Copy link
Member

charris commented Dec 3, 2024

A note to myself that this needs two backports.

Also add future proof guard, just in case we got a larger string
in addition.
@ngoldbaum
Copy link
Member

Thanks @seberg!

@ngoldbaum ngoldbaum merged commit 7901f71 into numpy:main Dec 4, 2024
69 checks passed
@seberg seberg deleted the fix-unicode-length branch December 4, 2024 17:23
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants