ENH: Implement np.strings.slice as a gufunc #27789

ArvidJB · 2024-11-18T17:20:30Z

This commit adds a np.strings.slice function which vectorizes slicing of string arrays. For example

    >>> a = np.array(['hello', 'world'])
    >>> np.char.slice(a, 2)
    array(['he', 'wo'], dtype='<U5')

This supports fixed-length and variable-length string dtypes. It also supports broadcasting the start, stop, step args:

    >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType())
    >>> np.strings.slice(b, [3, 0, 2, 1], -1)
    array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType())

Closes #8579

seberg · 2024-11-20T09:33:09Z

Didn't have a real look, but looks thorough! Mainly to link that there is some ancient PR that introduces it for old strings in gh-20694.
I think this ufunc version (which copies) is better though (both implementation and behavior wise, of course implementing it as a ufunc would have been hard back then).

ArvidJB · 2024-11-21T18:14:49Z

Thanks @seberg for taking a look!

I had a couple of questions:

is the API design acceptable? I intentionally made slice only take positional arguments to mimic builtins.slice's behavior. Or should we support keyword arguments, like np.strings.slice(a, step=-1)?
For variable-length strings we end up doing three passes over the string: (1) to compute num_codepoints, (2) to compute outsize, (3) to copy each codepoint. I cannot see a way to avoid this, but maybe someone has an idea?
Negative steps, where "step < 1" are tricky, because we could potentially read past the beginning of the string by accident. I would be thankful if someone could double-check the logic I wrote there.

seberg · 2024-11-22T09:24:17Z

We can always start with only positional and relax it if someone feels strongly (I also wondered briefly if we should accept slices, but probably better not).

I am hoping @lysnikolaou or @ngoldbaum will have time to dive in (but I would not expect this to happen very quickly necessarily).
In general, I would think you could at least bring it down to 2 passes (merge the first two), although it may need a custom new method on the buffer helper implementation?
(I'll assume that small, usually one, steps are typically, and we shouldn't worry about optimizing large steps much.)

ngoldbaum · 2024-11-22T18:23:58Z

Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes?

pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap.

ArvidJB · 2024-11-22T22:35:16Z

Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes?

Sure, that sounds fine to me!

pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap.

Oh, that's clever! Curious to see the details. We should probably add some benchmarks? It's probably worth to have a dedicated codepath for step==1.

ngoldbaum · 2024-11-22T23:09:23Z

So there's a repo here: https://github.com/pypy/fast-utf8-methods, the pypy UTF-8 code also lives in pypy here: https://github.com/pypy/pypy/tree/6742499bbf3fc0aa63702fe4aa27147e11050c74/rpython/rlib/fastutf8

@mattip might know if there's a technical explanation somewhere of how this works. I'm not aware of a blog post or paper.

This is more of a long-term idea. I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character.

mattip · 2024-11-23T21:06:28Z

I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character.

I don't think there is a technical summary of the SIMD utf8 code. The code actually was never merged into PyPy, it was an experiment that one developer tried to push forward but was never completed. PyPy does build an index of every 4th codepoint which is only allocated and built on-demand when a utf-8 string is sliced.

ngoldbaum · 2024-12-08T22:50:50Z

Reminder to myself to look through this sometime next week.

ngoldbaum

I didn't go through the C++ code yet but I spotted two issues in the python layer on my first pass

numpy/_core/defchararray.py

numpy/_core/strings.py

ngoldbaum

I think it is necessary to do this in two passes for UTF-8 but I think you only need to do one pass over the whole string if you allocate another temporary buffer to build and save a byte index from the input string into the output string on the first pass.

I don't think it makes sense to build and cache an index for the whole string, but there's no need to calculate the index of each character in the output string twice.

numpy/_core/src/umath/stringdtype_ufuncs.cpp

ngoldbaum · 2024-12-12T19:30:42Z

This also needs a release note.

ngoldbaum · 2024-12-25T15:15:47Z

Just a head's up I'm on Christmas break right now. I'm
planning to look at this in the new year.

numpy/_core/src/umath/stringdtype_ufuncs.cpp

numpy/char/__init__.py

ngoldbaum · 2025-01-03T22:33:29Z

@lysnikolaou is there any chance I can get you to do a high-level once-over on this PR? There's no need to go through the new C++ code in detail, I just did that.

ngoldbaum · 2025-01-06T16:51:10Z

@mhvk could you give this one a once-over if you have a little time?

mhvk

This looks very nice! I did still have a few comments, see in-line.

mhvk · 2025-01-06T17:40:05Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

+        // compute outsize
+        npy_intp outsize = 0;
+        for (int i = start; step > 0 ? i < stop : i > stop; i += step) {
+            outsize += num_bytes_for_utf8_character(codepoint_offsets[i]);


This seems weird - you already calculated the number of bytes - can one not just take a difference between the relevant offsets? (Alternatively, perhaps there should also be a std::vector::<unsigned char> code_point_lengths?)

I didn't want to add the extra allocation because this function is so simple it seemed better to just call it twice. That said the allocation is amortized over the array so perhaps it doesn't matter.

What about making num_bytes_for_utf8_character a static inline function? It's can also make it branchless, I think, with some bitmath...

Yes, you're right, it should be possible to count the number of bytes in UTF8 by bitmath; after all, all information on a charcter's length is in the first byte (e.g., http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html)

p.s. Totally fine to leave this for a subsequent PR!

There's also https://github.com/simdutf/simdutf. But I will leave all of this as an exercise for the reader....

Here's a link to the codepoint counting implementation:
https://github.com/simdutf/simdutf/blob/c4c8e0c09c8b65d4d729ae7192f62ae1eac24b4d/src/generic/utf8.h#L10

And for AVX512: https://github.com/simdutf/simdutf/blob/c4c8e0c09c8b65d4d729ae7192f62ae1eac24b4d/src/icelake/implementation.cpp#L1272

numpy/_core/src/umath/stringdtype_ufuncs.cpp

numpy/_core/code_generators/ufunc_docstrings.py

numpy/_core/strings.py

numpy/_core/tests/test_strings.py

mhvk

Two utter nitpicks, and one query as I'm confused why the asanyarray would be needed - but clearly I'm missing something too...

None of this is important, so I'll approve now (though probably need to squash-merge given the large number of commits).

numpy/_core/src/umath/stringdtype_ufuncs.cpp

numpy/_core/strings.py

This commit adds a `np.strings.slice` function which vectorizes slicing of string arrays. For example ``` >>> a = np.array(['hello', 'world']) >>> np.char.slice(a, 2) array(['he', 'wo'], dtype='<U5') ``` This supports fixed-length and variable-length string dtypes. It also supports broadcasting the start, stop, step args: ``` >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType()) >>> np.strings.slice(b, [3, 0, 2, 1], -1) array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType()) ``` Closes numpy#8579

ArvidJB · 2025-01-08T03:14:07Z

I rebased on main and squashed the commits. Is this ready to be merged?

seberg · 2025-01-08T09:43:38Z

I rebased on main and squashed the commits. Is this ready to be merged?

I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:

1669     Examples
1670     --------
1671     >>> import numpy as np
1672     >>> a = np.array(['hello', 'world'])
1673     >>> np.char.slice(a, 2)
UNEXPECTED EXCEPTION: AttributeError("module 'numpy.char' has no attribute 'slice'")

ArvidJB · 2025-01-08T15:30:30Z

I rebased on main and squashed the commits. Is this ready to be merged?

I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:
1669     Examples
1670     --------
1671     >>> import numpy as np
1672     >>> a = np.array(['hello', 'world'])
1673     >>> np.char.slice(a, 2)
UNEXPECTED EXCEPTION: AttributeError("module 'numpy.char' has no attribute 'slice'")

Ah, looks like doctest running was fixed in one of the commits included in the rebase!

We had decided to not add np.char.slice since it's supposedly a deprecated module, but I had not fixed the doctests.

Everything is passing now.

ngoldbaum · 2025-01-08T15:31:35Z

Thanks for your patience on getting this reviewed and merged @ArvidJB. If you spot other bugs or missing features in np.strings please don’t hesitate to follow up with more issues or PRs!

mhvk · 2025-01-08T18:18:37Z

Thanks, @ArvidJB, really nice!

github-actions bot added the 01 - Enhancement label Nov 18, 2024

ngoldbaum self-assigned this Dec 8, 2024

ngoldbaum reviewed Dec 11, 2024

View reviewed changes

numpy/_core/defchararray.py Outdated Show resolved Hide resolved

numpy/_core/strings.py Outdated Show resolved Hide resolved

ngoldbaum reviewed Dec 12, 2024

View reviewed changes

ngoldbaum added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Dec 12, 2024

ArvidJB requested a review from ngoldbaum December 25, 2024 04:37

ngoldbaum reviewed Jan 3, 2025

View reviewed changes

numpy/_core/src/umath/stringdtype_ufuncs.cpp Outdated Show resolved Hide resolved

ngoldbaum reviewed Jan 3, 2025

View reviewed changes

numpy/char/__init__.py Outdated Show resolved Hide resolved

mhvk reviewed Jan 6, 2025

View reviewed changes

ArvidJB requested a review from mhvk January 7, 2025 02:47

mhvk approved these changes Jan 7, 2025

View reviewed changes

numpy/_core/src/umath/stringdtype_ufuncs.cpp Outdated Show resolved Hide resolved

numpy/_core/src/umath/stringdtype_ufuncs.cpp Outdated Show resolved Hide resolved

numpy/_core/strings.py Show resolved Hide resolved

ArvidJB force-pushed the string_slice_gufunc branch from 323aa73 to 2fa92ed Compare January 8, 2025 02:39

ArvidJB force-pushed the string_slice_gufunc branch from 2fa92ed to 56c900a Compare January 8, 2025 03:00

Fix reference to numpy.char.slice

cafa9c2

ngoldbaum removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jan 8, 2025

ngoldbaum merged commit 2e700c6 into numpy:main Jan 8, 2025
67 checks passed

mwtoews mentioned this pull request Jan 8, 2025

ENH: Added np.char.slice_ #20694

Closed

neutrinoceros mentioned this pull request Jan 9, 2025

TST: declare str ufunc np._core.umath._slice as unsupported astropy/astropy#17613

Merged

1 task

ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull request May 16, 2025

DOC: __trunc__ for scalars (numpy#27789)

245bee9

ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull request May 16, 2025

DOC: __trunc__ for scalars (numpy#27789)

bb11cf4

jorenham mentioned this pull request May 25, 2025

TYP: annotate strings.slice #29048

Merged

Uh oh!

ENH: Implement np.strings.slice as a gufunc #27789

ENH: Implement np.strings.slice as a gufunc #27789

Uh oh!

Conversation

ArvidJB commented Nov 18, 2024

Uh oh!

seberg commented Nov 20, 2024

Uh oh!

ArvidJB commented Nov 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Nov 22, 2024

Uh oh!

ngoldbaum commented Nov 22, 2024

Uh oh!

ArvidJB commented Nov 22, 2024

Uh oh!

ngoldbaum commented Nov 22, 2024

Uh oh!

mattip commented Nov 23, 2024

Uh oh!

ngoldbaum commented Dec 8, 2024

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ngoldbaum left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commented Dec 12, 2024

Uh oh!

ngoldbaum commented Dec 25, 2024

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commented Jan 3, 2025

Uh oh!

ngoldbaum commented Jan 6, 2025

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

mhvk Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

ngoldbaum Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

mhvk Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

mhvk Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

ArvidJB Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

ArvidJB Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

ArvidJB Jan 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

ArvidJB commented Nov 21, 2024 •

edited

Loading

ngoldbaum left a comment •

edited

Loading