-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
MAINT: Ensure correct handling for very large unicode strings #27875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
474c052
to
4f85950
Compare
In the future, we can handle these strings (in parts we already can maybe), but for now have to stick to `int` length because more of the code needs cleanup to actually use it safely. (For user dtypes this is less of a problem, although corner cases probably exist.) This adds necessary checks to avoid large unicode dtypes.
4f85950
to
faf10e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, did you look at whether StringDType handles these cases as well?
I guess we can see if anyone complains because suddenly NumPy objects to creating very large strings. It looks like the testing here was poor at best.
If we can ban strings longer than INT_MAX bytes for stringdtype, then that unlocks some interesting possible memory savings on 64 bit architectures: #26059
numpy/_core/src/multiarray/common.c
Outdated
if (itemsize > NPY_MAX_INT) { | ||
/* We can allow this, but should audit code paths before we do. */ | ||
PyErr_SetString(PyExc_TypeError, | ||
"string too large to store inside array."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this error message and the other new error messages you're adding include the values of itemsize
and the maximum string size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can do it, although I am a bit unenthusiastic about bothering because the alternative error is an overflow which isn't much more expressive. (Or most likely a memory error.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm putting myself in the shoes of someone who happens to hit this error in the wild. It might not be immediately obvious especially for the unicode case that the limit is related to INT_MAX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you aren't wrong that it is easy enough to just add, even if I still think you are more likely to hit other errors that don't explain the issue as well.
Anyway, added (plus one additional check that should never happen, but why not be future proof...)
numpy/_core/tests/test_strings.py
Outdated
large_string = "A" * (very_large + 1) | ||
except Exception: | ||
# We may not be able to create this Python string on 32bit. | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if you could statically detect this case and use pytest.mark.skipif
. But if that's impossible I think it's better to use pytest.skip
, so the comment above can go in the reason
https://docs.pytest.org/en/stable/reference/reference.html#pytest.skip
return _NPY_ERROR_OCCURRED_IN_CAST; | ||
} | ||
|
||
loop_descrs[2] = PyArray_DescrNew(loop_descrs[0]); | ||
if (loop_descrs[2] == NULL) { | ||
Py_DECREF(loop_descrs[0]); | ||
Py_DECREF(loop_descrs[1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof, good catch!
I have not, I would think that is hard to achieve but for certain functions it could be wrong in the string ufunc core, though.
I think that is a choice that one can do. But here it is just a historic choice, because NumPy only even supports larger dtypes since 2.0, but we did not audit the code carefully enough to really exploit that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go ahead and merge if you still disagree about the error messages. In general I try to err on the side of including more information than not in python error messages generated from C, since most programmers don't realize that they can grep the C source code for error messages to learn more.
Added an item to #25693 to check this. |
A note to myself that this needs two backports. |
Also add future proof guard, just in case we got a larger string in addition.
Thanks @seberg! |
In the future, we can handle these strings (in parts we already can maybe), but for now have to stick to
int
length because more of the code needs cleanup to actually use it safely. (For user dtypes this is less of a problem, although corner cases probably exist.)This adds necessary checks to avoid large unicode dtypes.