Skip to content

ENH: Implement np.strings.slice as a gufunc #27789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 8, 2025

Conversation

ArvidJB
Copy link
Contributor

@ArvidJB ArvidJB commented Nov 18, 2024

This commit adds a np.strings.slice function which vectorizes slicing of string arrays. For example

    >>> a = np.array(['hello', 'world'])
    >>> np.char.slice(a, 2)
    array(['he', 'wo'], dtype='<U5')

This supports fixed-length and variable-length string dtypes. It also supports broadcasting the start, stop, step args:

    >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType())
    >>> np.strings.slice(b, [3, 0, 2, 1], -1)
    array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType())

Closes #8579

@seberg
Copy link
Member

seberg commented Nov 20, 2024

Didn't have a real look, but looks thorough! Mainly to link that there is some ancient PR that introduces it for old strings in gh-20694.
I think this ufunc version (which copies) is better though (both implementation and behavior wise, of course implementing it as a ufunc would have been hard back then).

@ArvidJB
Copy link
Contributor Author

ArvidJB commented Nov 21, 2024

Thanks @seberg for taking a look!

I had a couple of questions:

  1. is the API design acceptable? I intentionally made slice only take positional arguments to mimic builtins.slice's behavior. Or should we support keyword arguments, like np.strings.slice(a, step=-1)?
  2. For variable-length strings we end up doing three passes over the string: (1) to compute num_codepoints, (2) to compute outsize, (3) to copy each codepoint. I cannot see a way to avoid this, but maybe someone has an idea?
  3. Negative steps, where "step < 1" are tricky, because we could potentially read past the beginning of the string by accident. I would be thankful if someone could double-check the logic I wrote there.

@seberg
Copy link
Member

seberg commented Nov 22, 2024

We can always start with only positional and relax it if someone feels strongly (I also wondered briefly if we should accept slices, but probably better not).

I am hoping @lysnikolaou or @ngoldbaum will have time to dive in (but I would not expect this to happen very quickly necessarily).
In general, I would think you could at least bring it down to 2 passes (merge the first two), although it may need a custom new method on the buffer helper implementation?
(I'll assume that small, usually one, steps are typically, and we shouldn't worry about optimizing large steps much.)

@ngoldbaum
Copy link
Member

Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes?

pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap.

@ArvidJB
Copy link
Contributor Author

ArvidJB commented Nov 22, 2024

Neat, thanks for the ping. I'd like to spend some time looking closely at this. Do you mind if I push directly to this branch with fixes?

Sure, that sounds fine to me!

pypy has a neat optimization that allows O(1) indexing into UTF-8 strings by storing effectively a reverse index into the string. There are also some clever tricks to bring the memory overhead down. Adopting something like that would help for performance on operations like this that assume indexing into strings is cheap.

Oh, that's clever! Curious to see the details. We should probably add some benchmarks? It's probably worth to have a dedicated codepath for step==1.

@ngoldbaum
Copy link
Member

So there's a repo here: https://github.com/pypy/fast-utf8-methods, the pypy UTF-8 code also lives in pypy here: https://github.com/pypy/pypy/tree/6742499bbf3fc0aa63702fe4aa27147e11050c74/rpython/rlib/fastutf8

@mattip might know if there's a technical explanation somewhere of how this works. I'm not aware of a blog post or paper.

This is more of a long-term idea. I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character.

@mattip
Copy link
Member

mattip commented Nov 23, 2024

I'm not sure if the memory overhead of building the index is worth it - not everyone wants to do fast string indexing by character.

I don't think there is a technical summary of the SIMD utf8 code. The code actually was never merged into PyPy, it was an experiment that one developer tried to push forward but was never completed. PyPy does build an index of every 4th codepoint which is only allocated and built on-demand when a utf-8 string is sliced.

@ngoldbaum
Copy link
Member

Reminder to myself to look through this sometime next week.

@ngoldbaum ngoldbaum self-assigned this Dec 8, 2024
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go through the C++ code yet but I spotted two issues in the python layer on my first pass

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is necessary to do this in two passes for UTF-8 but I think you only need to do one pass over the whole string if you allocate another temporary buffer to build and save a byte index from the input string into the output string on the first pass.

I don't think it makes sense to build and cache an index for the whole string, but there's no need to calculate the index of each character in the output string twice.

@ngoldbaum
Copy link
Member

This also needs a release note.

@ngoldbaum ngoldbaum added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Dec 12, 2024
@ArvidJB ArvidJB requested a review from ngoldbaum December 25, 2024 04:37
@ngoldbaum
Copy link
Member

Just a head's up I'm on Christmas break right now. I'm
planning to look at this in the new year.

@ngoldbaum
Copy link
Member

@lysnikolaou is there any chance I can get you to do a high-level once-over on this PR? There's no need to go through the new C++ code in detail, I just did that.

@ngoldbaum
Copy link
Member

@mhvk could you give this one a once-over if you have a little time?

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very nice! I did still have a few comments, see in-line.

// compute outsize
npy_intp outsize = 0;
for (int i = start; step > 0 ? i < stop : i > stop; i += step) {
outsize += num_bytes_for_utf8_character(codepoint_offsets[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird - you already calculated the number of bytes - can one not just take a difference between the relevant offsets? (Alternatively, perhaps there should also be a std::vector::<unsigned char> code_point_lengths?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to add the extra allocation because this function is so simple it seemed better to just call it twice. That said the allocation is amortized over the array so perhaps it doesn't matter.

What about making num_bytes_for_utf8_character a static inline function? It's can also make it branchless, I think, with some bitmath...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right, it should be possible to count the number of bytes in UTF8 by bitmath; after all, all information on a charcter's length is in the first byte (e.g., http://www.daemonology.net/blog/2008-06-05-faster-utf8-strlen.html)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.s. Totally fine to leave this for a subsequent PR!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also https://github.com/simdutf/simdutf. But I will leave all of this as an exercise for the reader....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArvidJB ArvidJB requested a review from mhvk January 7, 2025 02:47
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two utter nitpicks, and one query as I'm confused why the asanyarray would be needed - but clearly I'm missing something too...

None of this is important, so I'll approve now (though probably need to squash-merge given the large number of commits).

@ArvidJB ArvidJB force-pushed the string_slice_gufunc branch from 323aa73 to 2fa92ed Compare January 8, 2025 02:39
This commit adds a `np.strings.slice` function which
vectorizes slicing of string arrays. For example
```
    >>> a = np.array(['hello', 'world'])
    >>> np.char.slice(a, 2)
    array(['he', 'wo'], dtype='<U5')
```

This supports fixed-length and variable-length string dtypes.
It also supports broadcasting the start, stop, step args:
```
    >>> b = np.array(['hello world', 'γεια σου κόσμε', '你好世界', '👋 🌍️'], dtype=np.dtypes.StringDType())
    >>> np.strings.slice(b, [3, 0, 2, 1], -1)
    array(['lo worl', 'γεια σου κόσμ', '世', ' 🌍'], dtype=StringDType())
```

Closes numpy#8579
@ArvidJB ArvidJB force-pushed the string_slice_gufunc branch from 2fa92ed to 56c900a Compare January 8, 2025 03:00
@ArvidJB
Copy link
Contributor Author

ArvidJB commented Jan 8, 2025

I rebased on main and squashed the commits. Is this ready to be merged?

@seberg
Copy link
Member

seberg commented Jan 8, 2025

I rebased on main and squashed the commits. Is this ready to be merged?

I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:

1669     Examples
1670     --------
1671     >>> import numpy as np
1672     >>> a = np.array(['hello', 'world'])
1673     >>> np.char.slice(a, 2)
UNEXPECTED EXCEPTION: AttributeError("module 'numpy.char' has no attribute 'slice'")

@ngoldbaum ngoldbaum removed the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jan 8, 2025
@ArvidJB
Copy link
Contributor Author

ArvidJB commented Jan 8, 2025

I rebased on main and squashed the commits. Is this ready to be merged?

I'll have Nathan have another look over, but likely. The CircleCI failure is real and we should address it (a tiny thing): The documentation is slightly wrong somewhere:

1669     Examples
1670     --------
1671     >>> import numpy as np
1672     >>> a = np.array(['hello', 'world'])
1673     >>> np.char.slice(a, 2)
UNEXPECTED EXCEPTION: AttributeError("module 'numpy.char' has no attribute 'slice'")

Ah, looks like doctest running was fixed in one of the commits included in the rebase!

We had decided to not add np.char.slice since it's supposedly a deprecated module, but I had not fixed the doctests.

Everything is passing now.

@ngoldbaum
Copy link
Member

Thanks for your patience on getting this reviewed and merged @ArvidJB. If you spot other bugs or missing features in np.strings please don’t hesitate to follow up with more issues or PRs!

@ngoldbaum ngoldbaum merged commit 2e700c6 into numpy:main Jan 8, 2025
67 checks passed
@mhvk
Copy link
Contributor

mhvk commented Jan 8, 2025

Thanks, @ArvidJB, really nice!

ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull request May 16, 2025
ganesh-k13 added a commit to ganesh-k13/numpy that referenced this pull request May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

Feature request: vectorized character slice
5 participants