-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Implement string comparison ufuncs (or almost) #21716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* ENH: Implement string comparison ufuncs (or almost) This makes all comparison operators and ufuncs work on strings using the ufunc machinery. It requires a half-manual "ufunc" to keep supporting void comparisons and especially `np.compare_chararrays` (that one may have a bit more overhead now). In general the new code should be much faster, and has a lot of easier optimization potential. It is also much simpler since it can outsource some complexities to the ufunc/iterator machinery. This further fixes a couple of bugs with byte-swapped strings. The backward compatibility related change is that using the normal ufunc machinery means that string comparisons between string and unicode now give a `FutureWarning` (instead of just False). * MAINT: Do not use C99 tagged struct init in C++ C++ does not like it (at least not before C++20)... GCC and clang don't seem to mind, but MSVC seems to. * BENCH: Add basic string comparison benchmarks * DOC,STY: Fixup string-comparisons comments based on review Thanks to Marten's comments, a few clarfications and slight fixups. * ENH: Use `memcmp` because it may be faster for the byte case * TST: Improve string and unicode comparison tests. * MAINT: Use switch statement based on review As suggested be Serge. Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu> * TST: Make unicode byte-swap test slightly more concrete The issue is that the `view` needs to use native byte-order, so just ensure native byte-order for the view, and then do another cast to get it right. * BUG: Add `np.compare_chararrays` to test and fix typo * TST: Add test for empty string comparisons * TST: Fixup string test based on martens review * MAINT: Move definitions back into string_ufuncs.h * MAINT: Use enum class for comparison operator templating This removes the need for a dynamic (or static) assert in the switch statement. * Template version of add_loop to avoid redundant code * STY: Fixup style, two spaces, error is -1 * STY: Small `string_ufuncs.cpp` fixups based on Serge's review * MAINT: Fix merge conflict (ensure_dtype_nbo was removed) Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu>
Oh, I never added a release note over in the original PR? Assuming this backport goes in, do you want me to add a release notes on this PR directly, or make a separate PR later? I have to check how much actually changed here, I may have been confusing the structured comparison PR with this one, and this one should have no changes except bug-fixes. But it is maybe still a nice new feature to note :). |
What/if you decide for the release note, just post it here and I will put it into the 1.23.0 release note in preparation for the rc3 release. |
I suppose could add a brief note under improvements?
|
Thanks Sebastian, I've added that note. |
Backport of #21041.
This makes all comparison operators and ufuncs work on strings
using the ufunc machinery.
It requires a half-manual "ufunc" to keep supporting void comparisons
and especially
np.compare_chararrays
(that one may have a bit moreoverhead now).
In general the new code should be much faster, and has a lot of easier
optimization potential. It is also much simpler since it can outsource
some complexities to the ufunc/iterator machinery.
This further fixes a couple of bugs with byte-swapped strings.
The backward compatibility related change is that using the normal
ufunc machinery means that string comparisons between string and
unicode now give a
FutureWarning
(instead of just False).C++ does not like it (at least not before C++20)... GCC and clang
don't seem to mind, but MSVC seems to.
BENCH: Add basic string comparison benchmarks
DOC,STY: Fixup string-comparisons comments based on review
Thanks to Marten's comments, a few clarfications and slight fixups.
ENH: Use
memcmp
because it may be faster for the byte caseTST: Improve string and unicode comparison tests.
MAINT: Use switch statement based on review
As suggested be Serge.
Co-authored-by: Serge Guelton serge.guelton@telecom-bretagne.eu
The issue is that the
view
needs to use native byte-order, sojust ensure native byte-order for the view, and then do another cast
to get it right.
BUG: Add
np.compare_chararrays
to test and fix typoTST: Add test for empty string comparisons
TST: Fixup string test based on martens review
MAINT: Move definitions back into string_ufuncs.h
MAINT: Use enum class for comparison operator templating
This removes the need for a dynamic (or static) assert in the
switch statement.
Template version of add_loop to avoid redundant code
STY: Fixup style, two spaces, error is -1
STY: Small
string_ufuncs.cpp
fixups based on Serge's reviewMAINT: Fix merge conflict (ensure_dtype_nbo was removed)
Co-authored-by: Serge Guelton serge.guelton@telecom-bretagne.eu