-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Add partition/rpartition ufunc for string dtypes #26082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I guess the disadvantage of the boolean is that |
I tend to agree. While implementing it and writing the tests, I was at some points a bit confused as to what the original input string was, cause it wasn't immediately clear from the result truple. The trade-off is 1 vs 4 bytes per element, if the input is UTF-32, so not a huge amount of memory, for having a cleaner way to reconstruct the original string. @ngoldbaum What do you think? |
And we diverge less from the CPython API too. Go for it and drop the boolean idea :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looks good! Mostly comments on the docstrings and tests. Also a suggestion for speed-up, though that can be for another PR.
FWIW I’m planning to go over this and add stringdtype support tomorrow. |
I started on this but got distracted by other fires and didn’t finish it. Will work more tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few inline comments, I also just pushed a commit that adds stringdtype support, would appreciate your eyes on my changes.
if (!given_descrs[3] || !given_descrs[4] || !given_descrs[5]) { | ||
PyErr_SetString( | ||
PyExc_TypeError, | ||
"The 'out' kwarg is necessary. Use numpy.strings without it."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove NPY_UNUSED
from the self
argument above, you can use self->name
to include the ufunc name in the error message.
I also reworded this in my version, take it or leave it, or feel free to give me edits:
PyErr_Format(
PyExc_TypeError,
"The '%s' ufunc requires the out keyword to be set. The "
"python wrapper in numpy.strings can be used without the "
"out keyword."
self->name);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up realizing that I have the opposite problem with stringdtype and I don't want to support out
at all, so the final error message ended up inverted from yours.
The failures on 32 bit architectures are real. Going to try to set up a 32 bit windows python environment to reproduce it... |
goto fail; | ||
} | ||
else if (NPY_UNLIKELY(i1_isnull || i2_isnull)) { | ||
if (!has_string_na) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided against adding support for nan-like nulls. Happy to add it if reviewers think it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one has to be slightly careful, since one wants sum(parts)
to work, so only the left or right result should be NaN
, the others still empty strings. It seems that would work automatically, since find
would already not return a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments on the StringDType
part.
Also a general one: for StringDType
it is in principle not necessary to pass in the indices; those can just be found as one goes (which would I think be faster). Might it make sense to have separate ufuncs for that reason? Obviously, that can be done later too!
PyArray_Descr *loop_descrs[3], | ||
npy_intp *NPY_UNUSED(view_offset)) | ||
{ | ||
if (given_descrs[3] || given_descrs[4] || given_descrs[5]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More out of curiosity than anything else, but why doesn't out
work? Isn't it fine as long as the dtypes are consistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now all the np.strings
wrappers don’t accept out
. They could, in the future, but I figured it wasn’t worth going out of my way to add code paths for in-place operations that can’t happen yet. The fixed-width versions heavily leverage out
by precomputing the output width, but they don’t support user-supplied out
unless someone calls the ufunc directly with the private name.
goto fail; | ||
} | ||
else if (NPY_UNLIKELY(i1_isnull || i2_isnull)) { | ||
if (!has_string_na) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess one has to be slightly careful, since one wants sum(parts)
to work, so only the left or right result should be NaN
, the others still empty strings. It seems that would work automatically, since find
would already not return a result.
// out1 and out3 can be no longer than in1, so conservatively | ||
// overallocate buffers big enough to store in1 | ||
// out2 must be no bigger than in2 | ||
char *out1_mem = (char *)PyMem_RawCalloc(i1s.size, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know all the sizes so in principle one can allocate the right size directly.
Though I guess the fact that one needs to allocate a buffer at all exposes a bit of a short-coming of the API: really, one would just want to allocate the output strings with the known sizes and then pass on the corresponding buffers to string_partition
...
numpy/_core/strings.py
Outdated
return _to_bytes_or_str_array( | ||
_vec_string(a, np.object_, 'partition', (sep,)), a) | ||
a = np.asanyarray(a) | ||
sep = np.asanyarray(sep).astype(a.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well do np.array(sep, a.dtype, copy=None, subok=True)
. The .astype()
does seem logical.
Good call. I renamed the ufuncs @lysnikolaou added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to have the separate loops! Some small remaining comments.
npy_intp idx; | ||
|
||
if (startposition == STARTPOSITION::FRONT) { | ||
idx = fastsearch((char *)i1s.buf, i1s.size, (char *)i2s.buf, i2s.size, -1, FAST_SEARCH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, nothing wrong with also passing in string_find
and string_rfind
as auxilliary data and then do the search with find_like_function *function
(as in string_findlike_strided_loop
).
EDIT: or, perhaps easier, on the top of the loop define
find_like_function *function = startposition == STARTPOSITION::FRONT ? string_find<ENCODING::UTF8> : string_rfind<ENCODING::UTF8>;
# or
int direction = startposition == STARTPOSITION::FRONT ? FAST_SEARCH : FAST_RSEARCH;
(Maybe latter better, assuming string_find
is just a wrapper around fastsearch
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't use string_find
- its result is defined in terms of code point indices, but I need a byte index, which is exactly what fastsearch
gives me, but I can use the suggestion in the edit. Thanks for that. I'll add a comment explaining why string_find/string_rfind
can't work since that's not obvious at first glance.
Failures look real, sadly! |
Yup! working on it… |
The test failures are happening because of a more fundamental issue with
The test in the PR fails on 32 bit and passes on 64 bit because "𐌂𐌂" is 8 bytes, which is a small string on 64 bit but an arena string on 32 bit. The line of code causing the problem is the use of |
I ended up adding a stringdtype-specific hack to @seberg @mhvk maybe there's a better way to spell the stringdtype-specific hack I added? Or maybe we need a dtype flag for this? The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's unfortunate! (But good sleuthing!) Ideally, we avoid this in this PR - does my suggestion of passing sep
directly into the ufunc work?
@@ -1611,7 +1611,7 @@ _array_fromobject_generic( | |||
oldtype = PyArray_DESCR(oparr); | |||
if (PyArray_EquivTypes(oldtype, dtype)) { | |||
if (copy != NPY_COPY_ALWAYS && STRIDING_OK(oparr, order)) { | |||
if (oldtype == dtype) { | |||
if (oldtype == dtype || oldtype->kind == 'T') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, is this quite right? What if the user provides a dtype with a different na_object
? I'd really expect result.dtype == requested_dtype
... Or is does EquivTypes
guarantee na_object
is the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point, you're right, we need to do something a little more complicated here, I'll revert this.
numpy/_core/strings.py
Outdated
return _to_bytes_or_str_array( | ||
_vec_string(a, np.object_, 'partition', (sep,)), a) | ||
a = np.asanyarray(a) | ||
sep = np.array(sep, dtype=a.dtype, copy=None, subok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just forego this for StringDType
? I.e., move it after the if
and let the ufunc machinery take care? Or would in that case the wrong ufunc
be called if sep
was an array with U
dtype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd need to add a promoter too, but it does work if I just call astype
like I had it originally. Would you be OK with adding the extra copy in this PR and then we can come back to dealing with the views later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's just go back to your .astype()
! sep
is usually just a single string, so the copy will generally not matter much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only little things left - most important probably to remove the extraneous things (a rebase/squash may be in order)
|
||
Examples | ||
-------- | ||
>>> x = np.array(["Numpy is nice!"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny think, but let's move this line together with the np.strings
call (could even do np.strings.partition(["Numpy is nice!"], " ")
|
||
Examples | ||
-------- | ||
>>> a = np.array(['aAaAaA', ' aA ', 'abBABba']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, combine with example at l.5125
|
||
Examples | ||
-------- | ||
>>> x = np.array(["Numpy is nice!"], dtype="T") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here
|
||
Parameters | ||
---------- | ||
x1 : array-like, with ``StringDType``, ``bytes_``, or ``str_`` dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can no longer be StringDType
, correct? Need to adjust the docstring (same for reverse one)
|
||
Parameters | ||
---------- | ||
x1 : array-like, with ``StringDType``, ``bytes_``, or ``str_`` dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one only takes and produces StringDType
!
@@ -1611,7 +1611,7 @@ _array_fromobject_generic( | |||
oldtype = PyArray_DESCR(oparr); | |||
if (PyArray_EquivTypes(oldtype, dtype)) { | |||
if (copy != NPY_COPY_ALWAYS && STRIDING_OK(oparr, order)) { | |||
if (oldtype == dtype) { | |||
if (oldtype == dtype || oldtype->kind == 'T') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should now be removed, right!?
@@ -299,6 +299,26 @@ num_codepoints_for_utf8_bytes(const unsigned char *s, size_t *num_codepoints, si | |||
return state != UTF8_ACCEPT; | |||
} | |||
|
|||
NPY_NO_EXPORT npy_int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is no longer used! Since we removed the other bits, perhaps this one too?
@@ -39,6 +39,9 @@ utf8_character_index( | |||
const char* start_loc, size_t start_byte_offset, size_t start_index, | |||
size_t search_byte_offset, size_t buffer_size); | |||
|
|||
NPY_NO_EXPORT npy_int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also remove from header if we don't use it...
Please feel free to squash merge, I don't want to rewrite history on a branch on @lysnikolaou's fork. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's get it in!
Closes #25993.