Skip to content

ENH: Add partition/rpartition ufunc for string dtypes #26082

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 26, 2024

Conversation

lysnikolaou
Copy link
Member

Closes #25993.

@mhvk
Copy link
Contributor

mhvk commented Mar 19, 2024

I guess the disadvantage of the boolean is that sum(partition(string, sep), initial='') != string. I guess to reproduce the input, one would do something like np.where(result[1], result[0] + sep + result[2], result[0])? I'd still slightly prefer returning an array of string, but seeing that there is an easy solution, I won't insist.

@lysnikolaou
Copy link
Member Author

I tend to agree. While implementing it and writing the tests, I was at some points a bit confused as to what the original input string was, cause it wasn't immediately clear from the result truple. The trade-off is 1 vs 4 bytes per element, if the input is UTF-32, so not a huge amount of memory, for having a cleaner way to reconstruct the original string.

@ngoldbaum What do you think?

@ngoldbaum
Copy link
Member

The trade-off is 1 vs 4 bytes per element, if the input is UTF-32, so not a huge amount of memory, for having a cleaner way to reconstruct the original string.

And we diverge less from the CPython API too. Go for it and drop the boolean idea :)

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good! Mostly comments on the docstrings and tests. Also a suggestion for speed-up, though that can be for another PR.

@ngoldbaum
Copy link
Member

FWIW I’m planning to go over this and add stringdtype support tomorrow.

@ngoldbaum
Copy link
Member

I started on this but got distracted by other fires and didn’t finish it. Will work more tomorrow.

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few inline comments, I also just pushed a commit that adds stringdtype support, would appreciate your eyes on my changes.

if (!given_descrs[3] || !given_descrs[4] || !given_descrs[5]) {
PyErr_SetString(
PyExc_TypeError,
"The 'out' kwarg is necessary. Use numpy.strings without it.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you remove NPY_UNUSED from the self argument above, you can use self->name to include the ufunc name in the error message.

I also reworded this in my version, take it or leave it, or feel free to give me edits:

PyErr_Format(
                PyExc_TypeError,
                "The '%s' ufunc requires the out keyword to be set. The "
                "python wrapper in numpy.strings can be used without the "
		"out keyword."
                self->name);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up realizing that I have the opposite problem with stringdtype and I don't want to support out at all, so the final error message ended up inverted from yours.

@ngoldbaum
Copy link
Member

The failures on 32 bit architectures are real. Going to try to set up a 32 bit windows python environment to reproduce it...

goto fail;
}
else if (NPY_UNLIKELY(i1_isnull || i2_isnull)) {
if (!has_string_na) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided against adding support for nan-like nulls. Happy to add it if reviewers think it's needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one has to be slightly careful, since one wants sum(parts) to work, so only the left or right result should be NaN, the others still empty strings. It seems that would work automatically, since find would already not return a result.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the StringDType part.

Also a general one: for StringDType it is in principle not necessary to pass in the indices; those can just be found as one goes (which would I think be faster). Might it make sense to have separate ufuncs for that reason? Obviously, that can be done later too!

PyArray_Descr *loop_descrs[3],
npy_intp *NPY_UNUSED(view_offset))
{
if (given_descrs[3] || given_descrs[4] || given_descrs[5]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More out of curiosity than anything else, but why doesn't out work? Isn't it fine as long as the dtypes are consistent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now all the np.strings wrappers don’t accept out. They could, in the future, but I figured it wasn’t worth going out of my way to add code paths for in-place operations that can’t happen yet. The fixed-width versions heavily leverage out by precomputing the output width, but they don’t support user-supplied out unless someone calls the ufunc directly with the private name.

goto fail;
}
else if (NPY_UNLIKELY(i1_isnull || i2_isnull)) {
if (!has_string_na) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one has to be slightly careful, since one wants sum(parts) to work, so only the left or right result should be NaN, the others still empty strings. It seems that would work automatically, since find would already not return a result.

// out1 and out3 can be no longer than in1, so conservatively
// overallocate buffers big enough to store in1
// out2 must be no bigger than in2
char *out1_mem = (char *)PyMem_RawCalloc(i1s.size, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know all the sizes so in principle one can allocate the right size directly.

Though I guess the fact that one needs to allocate a buffer at all exposes a bit of a short-coming of the API: really, one would just want to allocate the output strings with the known sizes and then pass on the corresponding buffers to string_partition...

return _to_bytes_or_str_array(
_vec_string(a, np.object_, 'partition', (sep,)), a)
a = np.asanyarray(a)
sep = np.asanyarray(sep).astype(a.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well do np.array(sep, a.dtype, copy=None, subok=True). The .astype() does seem logical.

@ngoldbaum
Copy link
Member

Also a general one: for StringDType it is in principle not necessary to pass in the indices; those can just be found as one goes (which would I think be faster). Might it make sense to have separate ufuncs for that reason?

Good call. I renamed the ufuncs @lysnikolaou added to _partition_index and _rpartition_index and defined new _partition and _rpartition ufuncs that find and partition all in one ufunc. I think ends up being simpler even though it does leave some minor duplication of logic between string_buffer.h and stringdtype_ufuncs.cpp.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to have the separate loops! Some small remaining comments.

npy_intp idx;

if (startposition == STARTPOSITION::FRONT) {
idx = fastsearch((char *)i1s.buf, i1s.size, (char *)i2s.buf, i2s.size, -1, FAST_SEARCH);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, nothing wrong with also passing in string_find and string_rfind as auxilliary data and then do the search with find_like_function *function (as in string_findlike_strided_loop).

EDIT: or, perhaps easier, on the top of the loop define

find_like_function *function = startposition == STARTPOSITION::FRONT ? string_find<ENCODING::UTF8> : string_rfind<ENCODING::UTF8>;
# or
int direction = startposition == STARTPOSITION::FRONT ? FAST_SEARCH : FAST_RSEARCH;

(Maybe latter better, assuming string_find is just a wrapper around fastsearch.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't use string_find - its result is defined in terms of code point indices, but I need a byte index, which is exactly what fastsearch gives me, but I can use the suggestion in the edit. Thanks for that. I'll add a comment explaining why string_find/string_rfind can't work since that's not obvious at first glance.

@mhvk
Copy link
Contributor

mhvk commented Mar 25, 2024

Failures look real, sadly!

@ngoldbaum
Copy link
Member

Failures look real, sadly!

Yup! working on it…

@ngoldbaum
Copy link
Member

ngoldbaum commented Mar 26, 2024

The test failures are happening because of a more fundamental issue with stringdtype and you can trigger it on a 64 bit architecture with the following script:

import numpy as np

buf = np.array("𐌁𐌁𐌁𐌁𐌂𐌂𐌂𐌂𐌀𐌀𐌀𐌀", dtype="T")
sep1 = np.array("𐌂𐌂𐌂𐌂", dtype="T")

print(np.strings.partition(buf, sep1))

The test in the PR fails on 32 bit and passes on 64 bit because "𐌂𐌂" is 8 bytes, which is a small string on 64 bit but an arena string on 32 bit.

The line of code causing the problem is the use of np.array(sep, dtype=a.dtype, copy=None, subok=True) as @mhvk suggested the other day, which bypasses the code path in NewFromDescr_int that calls the dtype finalization function. Going to experiment a bit to see if I can fix the problem....

@ngoldbaum
Copy link
Member

I ended up adding a stringdtype-specific hack to _array_fromobject_generic. For stringdtype specifically, it doesn't ever make sense to use the user-provided dtype instance in this case, we should always use the instance that was attached to the original array, otherwise we'll lose access to the arena containing the data needed by the view.

@seberg @mhvk maybe there's a better way to spell the stringdtype-specific hack I added? Or maybe we need a dtype flag for this?

The test_creation_from_view test I added in the last commit fails on main.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's unfortunate! (But good sleuthing!) Ideally, we avoid this in this PR - does my suggestion of passing sep directly into the ufunc work?

@@ -1611,7 +1611,7 @@ _array_fromobject_generic(
oldtype = PyArray_DESCR(oparr);
if (PyArray_EquivTypes(oldtype, dtype)) {
if (copy != NPY_COPY_ALWAYS && STRIDING_OK(oparr, order)) {
if (oldtype == dtype) {
if (oldtype == dtype || oldtype->kind == 'T') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this quite right? What if the user provides a dtype with a different na_object? I'd really expect result.dtype == requested_dtype... Or is does EquivTypes guarantee na_object is the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point, you're right, we need to do something a little more complicated here, I'll revert this.

return _to_bytes_or_str_array(
_vec_string(a, np.object_, 'partition', (sep,)), a)
a = np.asanyarray(a)
sep = np.array(sep, dtype=a.dtype, copy=None, subok=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just forego this for StringDType? I.e., move it after the if and let the ufunc machinery take care? Or would in that case the wrong ufunc be called if sep was an array with U dtype?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd need to add a promoter too, but it does work if I just call astype like I had it originally. Would you be OK with adding the extra copy in this PR and then we can come back to dealing with the views later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's just go back to your .astype()! sep is usually just a single string, so the copy will generally not matter much.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only little things left - most important probably to remove the extraneous things (a rebase/squash may be in order)


Examples
--------
>>> x = np.array(["Numpy is nice!"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny think, but let's move this line together with the np.strings call (could even do np.strings.partition(["Numpy is nice!"], " ")


Examples
--------
>>> a = np.array(['aAaAaA', ' aA ', 'abBABba'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, combine with example at l.5125


Examples
--------
>>> x = np.array(["Numpy is nice!"], dtype="T")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here


Parameters
----------
x1 : array-like, with ``StringDType``, ``bytes_``, or ``str_`` dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can no longer be StringDType, correct? Need to adjust the docstring (same for reverse one)


Parameters
----------
x1 : array-like, with ``StringDType``, ``bytes_``, or ``str_`` dtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one only takes and produces StringDType!

@@ -1611,7 +1611,7 @@ _array_fromobject_generic(
oldtype = PyArray_DESCR(oparr);
if (PyArray_EquivTypes(oldtype, dtype)) {
if (copy != NPY_COPY_ALWAYS && STRIDING_OK(oparr, order)) {
if (oldtype == dtype) {
if (oldtype == dtype || oldtype->kind == 'T') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should now be removed, right!?

@@ -299,6 +299,26 @@ num_codepoints_for_utf8_bytes(const unsigned char *s, size_t *num_codepoints, si
return state != UTF8_ACCEPT;
}

NPY_NO_EXPORT npy_int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is no longer used! Since we removed the other bits, perhaps this one too?

@@ -39,6 +39,9 @@ utf8_character_index(
const char* start_loc, size_t start_byte_offset, size_t start_index,
size_t search_byte_offset, size_t buffer_size);

NPY_NO_EXPORT npy_int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove from header if we don't use it...

@ngoldbaum
Copy link
Member

Please feel free to squash merge, I don't want to rewrite history on a branch on @lysnikolaou's fork.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's get it in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

MAINT: What should we do about np.strings.partition?
3 participants