MAINT: Ensure correct handling for very large unicode strings #27875

seberg · 2024-11-28T18:43:24Z

In the future, we can handle these strings (in parts we already can maybe), but for now have to stick to int length because more of the code needs cleanup to actually use it safely. (For user dtypes this is less of a problem, although corner cases probably exist.)

This adds necessary checks to avoid large unicode dtypes.

In the future, we can handle these strings (in parts we already can maybe), but for now have to stick to `int` length because more of the code needs cleanup to actually use it safely. (For user dtypes this is less of a problem, although corner cases probably exist.) This adds necessary checks to avoid large unicode dtypes.

ngoldbaum

Just out of curiosity, did you look at whether StringDType handles these cases as well?

I guess we can see if anyone complains because suddenly NumPy objects to creating very large strings. It looks like the testing here was poor at best.

If we can ban strings longer than INT_MAX bytes for stringdtype, then that unlocks some interesting possible memory savings on 64 bit architectures: #26059

ngoldbaum · 2024-12-02T21:57:11Z

numpy/_core/src/multiarray/common.c

+        if (itemsize > NPY_MAX_INT) {
+            /* We can allow this, but should audit code paths before we do. */
+            PyErr_SetString(PyExc_TypeError,
+                    "string too large to store inside array.");


Can this error message and the other new error messages you're adding include the values of itemsize and the maximum string size?

Can do it, although I am a bit unenthusiastic about bothering because the alternative error is an overflow which isn't much more expressive. (Or most likely a memory error.)

I'm putting myself in the shoes of someone who happens to hit this error in the wild. It might not be immediately obvious especially for the unicode case that the limit is related to INT_MAX.

Yeah you aren't wrong that it is easy enough to just add, even if I still think you are more likely to hit other errors that don't explain the issue as well.

Anyway, added (plus one additional check that should never happen, but why not be future proof...)

ngoldbaum · 2024-12-02T22:03:04Z

numpy/_core/tests/test_strings.py

+        large_string = "A" * (very_large + 1)
+    except Exception:
+        # We may not be able to create this Python string on 32bit.
+        return


It would be nice if you could statically detect this case and use pytest.mark.skipif. But if that's impossible I think it's better to use pytest.skip, so the comment above can go in the reason https://docs.pytest.org/en/stable/reference/reference.html#pytest.skip

ngoldbaum · 2024-12-02T22:03:49Z

numpy/_core/src/umath/string_ufuncs.cpp

        return _NPY_ERROR_OCCURRED_IN_CAST;
    }

    loop_descrs[2] = PyArray_DescrNew(loop_descrs[0]);
    if (loop_descrs[2] == NULL) {
+        Py_DECREF(loop_descrs[0]);
+        Py_DECREF(loop_descrs[1]);


oof, good catch!

seberg · 2024-12-03T13:28:11Z

Just out of curiosity, did you look at whether StringDType handles these cases as well?

I have not, I would think that is hard to achieve but for certain functions it could be wrong in the string ufunc core, though.

If we can ban strings longer than INT_MAX bytes for stringdtype

I think that is a choice that one can do. But here it is just a historic choice, because NumPy only even supports larger dtypes since 2.0, but we did not audit the code carefully enough to really exploit that.

ngoldbaum

Go ahead and merge if you still disagree about the error messages. In general I try to err on the side of including more information than not in python error messages generated from C, since most programmers don't realize that they can grep the C source code for error messages to learn more.

ngoldbaum · 2024-12-03T15:47:05Z

I have not, I would think that is hard to achieve but for certain functions it could be wrong in the string ufunc core, though.

Added an item to #25693 to check this.

charris · 2024-12-03T18:32:07Z

A note to myself that this needs two backports.

Also add future proof guard, just in case we got a larger string in addition.

ngoldbaum · 2024-12-04T16:43:02Z

Thanks @seberg!

seberg added the 09 - Backport-Candidate PRs tagged should be backported label Nov 28, 2024

github-actions bot added the 03 - Maintenance label Nov 28, 2024

seberg force-pushed the fix-unicode-length branch 4 times, most recently from 474c052 to 4f85950 Compare November 29, 2024 10:07

seberg force-pushed the fix-unicode-length branch from 4f85950 to faf10e9 Compare November 29, 2024 10:37

seberg requested a review from ngoldbaum December 2, 2024 21:51

ngoldbaum reviewed Dec 2, 2024

View reviewed changes

seberg added this to the 2.1.4 release milestone Dec 3, 2024

TST: Use skipif in test to signal that the test did nothing

446f52a

ngoldbaum approved these changes Dec 3, 2024

View reviewed changes

ngoldbaum mentioned this pull request Dec 3, 2024

TSK: Follow-up things for stringdtype #25693

Open

14 tasks

Add length information to exception

acc31aa

Also add future proof guard, just in case we got a larger string in addition.

ngoldbaum merged commit 7901f71 into numpy:main Dec 4, 2024
69 checks passed

charris mentioned this pull request Dec 4, 2024

MAINT: Ensure correct handling for very large unicode strings #27904

Merged

seberg deleted the fix-unicode-length branch December 4, 2024 17:23

charris mentioned this pull request Dec 4, 2024

MAINT: Ensure correct handling for very large unicode strings #27905

Merged

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

MAINT: Ensure correct handling for very large unicode strings #27875

MAINT: Ensure correct handling for very large unicode strings #27875

seberg commented Nov 28, 2024

Uh oh!

ngoldbaum left a comment

Uh oh!

ngoldbaum Dec 2, 2024

Uh oh!

seberg Dec 3, 2024

Uh oh!

ngoldbaum Dec 3, 2024

Uh oh!

seberg Dec 4, 2024

Uh oh!

ngoldbaum Dec 2, 2024

Uh oh!

ngoldbaum Dec 2, 2024

Uh oh!

seberg commented Dec 3, 2024

Uh oh!

ngoldbaum left a comment

Uh oh!

ngoldbaum commented Dec 3, 2024

Uh oh!

charris commented Dec 3, 2024

Uh oh!

ngoldbaum commented Dec 4, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MAINT: Ensure correct handling for very large unicode strings #27875

MAINT: Ensure correct handling for very large unicode strings #27875

Conversation

seberg commented Nov 28, 2024

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

ngoldbaum Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

seberg Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

ngoldbaum Dec 3, 2024

Choose a reason for hiding this comment

Uh oh!

seberg Dec 4, 2024

Choose a reason for hiding this comment

Uh oh!

ngoldbaum Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

ngoldbaum Dec 2, 2024

Choose a reason for hiding this comment

Uh oh!

seberg commented Dec 3, 2024

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Dec 3, 2024

Uh oh!

charris commented Dec 3, 2024

Uh oh!

ngoldbaum commented Dec 4, 2024

Uh oh!

Uh oh!

Uh oh!