ENH: Add partition/rpartition ufunc for string dtypes #26082

lysnikolaou · 2024-03-19T14:05:30Z

numpy/_core/code_generators/ufunc_docstrings.py

mhvk · 2024-03-19T14:54:21Z

I guess the disadvantage of the boolean is that sum(partition(string, sep), initial='') != string. I guess to reproduce the input, one would do something like np.where(result[1], result[0] + sep + result[2], result[0])? I'd still slightly prefer returning an array of string, but seeing that there is an easy solution, I won't insist.

lysnikolaou · 2024-03-19T15:22:35Z

I tend to agree. While implementing it and writing the tests, I was at some points a bit confused as to what the original input string was, cause it wasn't immediately clear from the result truple. The trade-off is 1 vs 4 bytes per element, if the input is UTF-32, so not a huge amount of memory, for having a cleaner way to reconstruct the original string.

@ngoldbaum What do you think?

ngoldbaum · 2024-03-19T17:19:31Z

The trade-off is 1 vs 4 bytes per element, if the input is UTF-32, so not a huge amount of memory, for having a cleaner way to reconstruct the original string.

And we diverge less from the CPython API too. Go for it and drop the boolean idea :)

mhvk

OK, looks good! Mostly comments on the docstrings and tests. Also a suggestion for speed-up, though that can be for another PR.

numpy/_core/code_generators/ufunc_docstrings.py

numpy/_core/strings.py

numpy/_core/tests/test_strings.py

ngoldbaum · 2024-03-20T19:11:39Z

FWIW I’m planning to go over this and add stringdtype support tomorrow.

ngoldbaum · 2024-03-21T21:00:36Z

I started on this but got distracted by other fires and didn’t finish it. Will work more tomorrow.

ngoldbaum

A few inline comments, I also just pushed a commit that adds stringdtype support, would appreciate your eyes on my changes.

ngoldbaum · 2024-03-22T17:58:44Z

numpy/_core/src/umath/string_ufuncs.cpp

+    if (!given_descrs[3] || !given_descrs[4] || !given_descrs[5]) {
+        PyErr_SetString(
+            PyExc_TypeError,
+            "The 'out' kwarg is necessary. Use numpy.strings without it.");


If you remove NPY_UNUSED from the self argument above, you can use self->name to include the ufunc name in the error message.

I also reworded this in my version, take it or leave it, or feel free to give me edits:

PyErr_Format( PyExc_TypeError, "The '%s' ufunc requires the out keyword to be set. The " "python wrapper in numpy.strings can be used without the " "out keyword." self->name);

I ended up realizing that I have the opposite problem with stringdtype and I don't want to support out at all, so the final error message ended up inverted from yours.

numpy/_core/strings.py

ngoldbaum · 2024-03-22T19:28:22Z

The failures on 32 bit architectures are real. Going to try to set up a 32 bit windows python environment to reproduce it...

ngoldbaum · 2024-03-22T20:51:17Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+            goto fail;
+        }
+        else if (NPY_UNLIKELY(i1_isnull || i2_isnull)) {
+            if (!has_string_na) {


I decided against adding support for nan-like nulls. Happy to add it if reviewers think it's needed.

I guess one has to be slightly careful, since one wants sum(parts) to work, so only the left or right result should be NaN, the others still empty strings. It seems that would work automatically, since find would already not return a result.

mhvk

A few comments on the StringDType part.

Also a general one: for StringDType it is in principle not necessary to pass in the indices; those can just be found as one goes (which would I think be faster). Might it make sense to have separate ufuncs for that reason? Obviously, that can be done later too!

mhvk · 2024-03-22T21:15:53Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+        PyArray_Descr *loop_descrs[3],
+        npy_intp *NPY_UNUSED(view_offset))
+{
+    if (given_descrs[3] || given_descrs[4] || given_descrs[5]) {


More out of curiosity than anything else, but why doesn't out work? Isn't it fine as long as the dtypes are consistent?

Right now all the np.strings wrappers don’t accept out. They could, in the future, but I figured it wasn’t worth going out of my way to add code paths for in-place operations that can’t happen yet. The fixed-width versions heavily leverage out by precomputing the output width, but they don’t support user-supplied out unless someone calls the ufunc directly with the private name.

mhvk · 2024-03-22T21:19:40Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+            goto fail;
+        }
+        else if (NPY_UNLIKELY(i1_isnull || i2_isnull)) {
+            if (!has_string_na) {


I guess one has to be slightly careful, since one wants sum(parts) to work, so only the left or right result should be NaN, the others still empty strings. It seems that would work automatically, since find would already not return a result.

mhvk · 2024-03-22T21:33:15Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+        // out1 and out3 can be no longer than in1, so conservatively
+        // overallocate buffers big enough to store in1
+        // out2 must be no bigger than in2
+        char *out1_mem = (char *)PyMem_RawCalloc(i1s.size, 1);


We know all the sizes so in principle one can allocate the right size directly.

Though I guess the fact that one needs to allocate a buffer at all exposes a bit of a short-coming of the API: really, one would just want to allocate the output strings with the known sizes and then pass on the corresponding buffers to string_partition...

mhvk · 2024-03-22T21:43:36Z

numpy/_core/strings.py

-    return _to_bytes_or_str_array(
-        _vec_string(a, np.object_, 'partition', (sep,)), a)
+    a = np.asanyarray(a)
+    sep = np.asanyarray(sep).astype(a.dtype)


Might as well do np.array(sep, a.dtype, copy=None, subok=True). The .astype() does seem logical.

ngoldbaum · 2024-03-25T18:20:33Z

Also a general one: for StringDType it is in principle not necessary to pass in the indices; those can just be found as one goes (which would I think be faster). Might it make sense to have separate ufuncs for that reason?

Good call. I renamed the ufuncs @lysnikolaou added to _partition_index and _rpartition_index and defined new _partition and _rpartition ufuncs that find and partition all in one ufunc. I think ends up being simpler even though it does leave some minor duplication of logic between string_buffer.h and stringdtype_ufuncs.cpp.

numpy/_core/src/umath/string_buffer.h

mhvk

Nice to have the separate loops! Some small remaining comments.

numpy/_core/src/umath/string_buffer.h

numpy/_core/src/umath/stringdtype_ufuncs.cpp

mhvk · 2024-03-25T19:09:11Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+        npy_intp idx;
+
+        if (startposition == STARTPOSITION::FRONT) {
+            idx = fastsearch((char *)i1s.buf, i1s.size, (char *)i2s.buf, i2s.size, -1, FAST_SEARCH);


In principle, nothing wrong with also passing in string_find and string_rfind as auxilliary data and then do the search with find_like_function *function (as in string_findlike_strided_loop).

EDIT: or, perhaps easier, on the top of the loop define

find_like_function *function = startposition == STARTPOSITION::FRONT ? string_find<ENCODING::UTF8> : string_rfind<ENCODING::UTF8>; # or int direction = startposition == STARTPOSITION::FRONT ? FAST_SEARCH : FAST_RSEARCH;

(Maybe latter better, assuming string_find is just a wrapper around fastsearch.)

I can't use string_find - its result is defined in terms of code point indices, but I need a byte index, which is exactly what fastsearch gives me, but I can use the suggestion in the edit. Thanks for that. I'll add a comment explaining why string_find/string_rfind can't work since that's not obvious at first glance.

mhvk · 2024-03-25T19:32:18Z

Failures look real, sadly!

ngoldbaum · 2024-03-25T19:46:26Z

Failures look real, sadly!

Yup! working on it…

numpy/_core/src/umath/string_ufuncs.cpp

numpy/_core/src/umath/stringdtype_ufuncs.cpp

ngoldbaum · 2024-03-26T19:13:59Z

The test failures are happening because of a more fundamental issue with stringdtype and you can trigger it on a 64 bit architecture with the following script:

import numpy as np

buf = np.array("𐌁𐌁𐌁𐌁𐌂𐌂𐌂𐌂𐌀𐌀𐌀𐌀", dtype="T")
sep1 = np.array("𐌂𐌂𐌂𐌂", dtype="T")

print(np.strings.partition(buf, sep1))

The test in the PR fails on 32 bit and passes on 64 bit because "𐌂𐌂" is 8 bytes, which is a small string on 64 bit but an arena string on 32 bit.

The line of code causing the problem is the use of np.array(sep, dtype=a.dtype, copy=None, subok=True) as @mhvk suggested the other day, which bypasses the code path in NewFromDescr_int that calls the dtype finalization function. Going to experiment a bit to see if I can fix the problem....

…type view

ngoldbaum · 2024-03-26T20:21:17Z

I ended up adding a stringdtype-specific hack to _array_fromobject_generic. For stringdtype specifically, it doesn't ever make sense to use the user-provided dtype instance in this case, we should always use the instance that was attached to the original array, otherwise we'll lose access to the arena containing the data needed by the view.

@seberg @mhvk maybe there's a better way to spell the stringdtype-specific hack I added? Or maybe we need a dtype flag for this?

The test_creation_from_view test I added in the last commit fails on main.

mhvk

Hmm, that's unfortunate! (But good sleuthing!) Ideally, we avoid this in this PR - does my suggestion of passing sep directly into the ufunc work?

mhvk · 2024-03-26T20:42:18Z

numpy/_core/src/multiarray/multiarraymodule.c

@@ -1611,7 +1611,7 @@ _array_fromobject_generic(
        oldtype = PyArray_DESCR(oparr);
        if (PyArray_EquivTypes(oldtype, dtype)) {
            if (copy != NPY_COPY_ALWAYS && STRIDING_OK(oparr, order)) {
-                if (oldtype == dtype) {
+                if (oldtype == dtype || oldtype->kind == 'T') {


Hmm, is this quite right? What if the user provides a dtype with a different na_object? I'd really expect result.dtype == requested_dtype... Or is does EquivTypes guarantee na_object is the same?

Ah good point, you're right, we need to do something a little more complicated here, I'll revert this.

mhvk · 2024-03-26T20:46:27Z

numpy/_core/strings.py

-    return _to_bytes_or_str_array(
-        _vec_string(a, np.object_, 'partition', (sep,)), a)
+    a = np.asanyarray(a)
+    sep = np.array(sep, dtype=a.dtype, copy=None, subok=True)


Can we just forego this for StringDType? I.e., move it after the if and let the ufunc machinery take care? Or would in that case the wrong ufunc be called if sep was an array with U dtype?

I'd need to add a promoter too, but it does work if I just call astype like I had it originally. Would you be OK with adding the extra copy in this PR and then we can come back to dealing with the views later?

Yes, let's just go back to your .astype()! sep is usually just a single string, so the copy will generally not matter much.

mhvk

Only little things left - most important probably to remove the extraneous things (a rebase/squash may be in order)

mhvk · 2024-03-26T21:17:05Z

numpy/_core/code_generators/ufunc_docstrings.py

+
+    Examples
+    --------
+    >>> x = np.array(["Numpy is nice!"])


Tiny think, but let's move this line together with the np.strings call (could even do np.strings.partition(["Numpy is nice!"], " ")

mhvk · 2024-03-26T21:17:39Z

numpy/_core/code_generators/ufunc_docstrings.py

+
+    Examples
+    --------
+    >>> a = np.array(['aAaAaA', '  aA  ', 'abBABba'])


Same here, combine with example at l.5125

mhvk · 2024-03-26T21:17:58Z

numpy/_core/code_generators/ufunc_docstrings.py

+
+    Examples
+    --------
+    >>> x = np.array(["Numpy is nice!"], dtype="T")


mhvk · 2024-03-26T21:18:47Z

numpy/_core/code_generators/ufunc_docstrings.py

+
+    Parameters
+    ----------
+    x1 : array-like, with ``StringDType``, ``bytes_``, or ``str_`` dtype


This can no longer be StringDType, correct? Need to adjust the docstring (same for reverse one)

mhvk · 2024-03-26T21:19:03Z

numpy/_core/code_generators/ufunc_docstrings.py

+
+    Parameters
+    ----------
+    x1 : array-like, with ``StringDType``, ``bytes_``, or ``str_`` dtype


This one only takes and produces StringDType!

mhvk · 2024-03-26T21:20:19Z

numpy/_core/src/multiarray/multiarraymodule.c

@@ -1611,7 +1611,7 @@ _array_fromobject_generic(
        oldtype = PyArray_DESCR(oparr);
        if (PyArray_EquivTypes(oldtype, dtype)) {
            if (copy != NPY_COPY_ALWAYS && STRIDING_OK(oparr, order)) {
-                if (oldtype == dtype) {
+                if (oldtype == dtype || oldtype->kind == 'T') {


This should now be removed, right!?

mhvk · 2024-03-26T21:21:10Z

numpy/_core/src/multiarray/stringdtype/utf8_utils.c

@@ -299,6 +299,26 @@ num_codepoints_for_utf8_bytes(const unsigned char *s, size_t *num_codepoints, si
    return state != UTF8_ACCEPT;
 }

+NPY_NO_EXPORT npy_int64


This function is no longer used! Since we removed the other bits, perhaps this one too?

mhvk · 2024-03-26T21:21:32Z

numpy/_core/src/multiarray/stringdtype/utf8_utils.h

@@ -39,6 +39,9 @@ utf8_character_index(
        const char* start_loc, size_t start_byte_offset, size_t start_index,
        size_t search_byte_offset, size_t buffer_size);

+NPY_NO_EXPORT npy_int64


Also remove from header if we don't use it...

numpy/_core/strings.py

ngoldbaum · 2024-03-26T21:42:49Z

Please feel free to squash merge, I don't want to rewrite history on a branch on @lysnikolaou's fork.

mhvk

OK, let's get it in!

ENH: Add partition/rpartition ufunc for string dtypes

5993849

Closes numpy#25993.

lysnikolaou requested review from ngoldbaum and mhvk March 19, 2024 14:05

github-actions bot added the 01 - Enhancement label Mar 19, 2024

Fix doctests

8a35bf4

mhvk reviewed Mar 19, 2024

View reviewed changes

numpy/_core/code_generators/ufunc_docstrings.py Outdated Show resolved Hide resolved

Fix docstrings in ufunc_docstrings.py as well

0376078

Return array with the separators // optimize using find ufunc results

a0084be

mhvk reviewed Mar 20, 2024

View reviewed changes

Address feedback

f0d1bc6

Fix chararray __array_finalize__

481a013

ENH: add stringdtype partition/rpartition

edd4f11

ngoldbaum reviewed Mar 22, 2024

View reviewed changes

BUG: remove unnecessary size_t cast

0ff6f95

ngoldbaum reviewed Mar 22, 2024

View reviewed changes

BUG: fix error handling and resource cleanup

afeb289

mhvk reviewed Mar 22, 2024

View reviewed changes

MNT: refactor so stringdtype can combine find and partition

4958d70

ngoldbaum reviewed Mar 25, 2024

View reviewed changes

numpy/_core/src/umath/string_buffer.h Show resolved Hide resolved

ngoldbaum added 2 commits March 25, 2024 12:29

Merge branch 'main' into string-ufuncs-partition

f77aecc

MNT: update signatures to reflect const API changes

18b4ec4

mhvk reviewed Mar 25, 2024

View reviewed changes

ngoldbaum reviewed Mar 25, 2024

View reviewed changes

numpy/_core/src/umath/string_ufuncs.cpp Outdated Show resolved Hide resolved

MNT: simplfy fastsearch call

a7c085c

mhvk reviewed Mar 25, 2024

View reviewed changes

numpy/_core/src/umath/stringdtype_ufuncs.cpp Outdated Show resolved Hide resolved

ngoldbaum and others added 3 commits March 25, 2024 15:37

MNT: move variable binding out of inner loop

8ff579b

Fix error message about out; fix promoter

0235134

Remove unused import in defchararray; add assertion

50e1b6e

BUG: don't use a user-provided descriptor to initialize a new stringd…

61917f0

…type view

mhvk reviewed Mar 26, 2024

View reviewed changes

MNT: back out attempted fix for stringdtype view problem

4d644fa

mhvk reviewed Mar 26, 2024

View reviewed changes

ngoldbaum mentioned this pull request Mar 26, 2024

BUG: creating an array that shares memory with a stringdtype array can fail #26140

Closed

MNT: address code review comments

457c1aa

mhvk approved these changes Mar 26, 2024

View reviewed changes

mhvk merged commit e1bf1d6 into numpy:main Mar 26, 2024

neutrinoceros mentioned this pull request Apr 4, 2024

BUG: declare private str ufuncs as unsupported (np.core.math._(r)partition(_index)) astropy/astropy#16270

Merged

1 task

Uh oh!

ENH: Add partition/rpartition ufunc for string dtypes #26082

ENH: Add partition/rpartition ufunc for string dtypes #26082

Uh oh!

Conversation

lysnikolaou commented Mar 19, 2024

Uh oh!

Uh oh!

mhvk commented Mar 19, 2024

Uh oh!

lysnikolaou commented Mar 19, 2024

Uh oh!

ngoldbaum commented Mar 19, 2024

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commented Mar 20, 2024

Uh oh!

ngoldbaum commented Mar 21, 2024

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ngoldbaum commented Mar 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Mar 25, 2024

Uh oh!

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk commented Mar 25, 2024

Uh oh!

ngoldbaum commented Mar 25, 2024

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngoldbaum commented Mar 26, 2024

ngoldbaum commented Mar 26, 2024 •

edited

Loading