ENH: Implement string comparison ufuncs (or almost) #21716

charris · 2022-06-10T13:55:58Z

Backport of #21041.

ENH: Implement string comparison ufuncs (or almost)

This makes all comparison operators and ufuncs work on strings
using the ufunc machinery.
It requires a half-manual "ufunc" to keep supporting void comparisons
and especially np.compare_chararrays (that one may have a bit more
overhead now).

In general the new code should be much faster, and has a lot of easier
optimization potential. It is also much simpler since it can outsource
some complexities to the ufunc/iterator machinery.

This further fixes a couple of bugs with byte-swapped strings.

The backward compatibility related change is that using the normal
ufunc machinery means that string comparisons between string and
unicode now give a FutureWarning (instead of just False).

MAINT: Do not use C99 tagged struct init in C++

C++ does not like it (at least not before C++20)... GCC and clang
don't seem to mind, but MSVC seems to.

BENCH: Add basic string comparison benchmarks
DOC,STY: Fixup string-comparisons comments based on review

Thanks to Marten's comments, a few clarfications and slight fixups.

ENH: Use memcmp because it may be faster for the byte case
TST: Improve string and unicode comparison tests.
MAINT: Use switch statement based on review

As suggested be Serge.

Co-authored-by: Serge Guelton serge.guelton@telecom-bretagne.eu

TST: Make unicode byte-swap test slightly more concrete

The issue is that the view needs to use native byte-order, so
just ensure native byte-order for the view, and then do another cast
to get it right.

BUG: Add np.compare_chararrays to test and fix typo
TST: Add test for empty string comparisons
TST: Fixup string test based on martens review
MAINT: Move definitions back into string_ufuncs.h
MAINT: Use enum class for comparison operator templating

This removes the need for a dynamic (or static) assert in the
switch statement.

Template version of add_loop to avoid redundant code
STY: Fixup style, two spaces, error is -1
STY: Small string_ufuncs.cpp fixups based on Serge's review
MAINT: Fix merge conflict (ensure_dtype_nbo was removed)

Co-authored-by: Serge Guelton serge.guelton@telecom-bretagne.eu

* ENH: Implement string comparison ufuncs (or almost) This makes all comparison operators and ufuncs work on strings using the ufunc machinery. It requires a half-manual "ufunc" to keep supporting void comparisons and especially `np.compare_chararrays` (that one may have a bit more overhead now). In general the new code should be much faster, and has a lot of easier optimization potential. It is also much simpler since it can outsource some complexities to the ufunc/iterator machinery. This further fixes a couple of bugs with byte-swapped strings. The backward compatibility related change is that using the normal ufunc machinery means that string comparisons between string and unicode now give a `FutureWarning` (instead of just False). * MAINT: Do not use C99 tagged struct init in C++ C++ does not like it (at least not before C++20)... GCC and clang don't seem to mind, but MSVC seems to. * BENCH: Add basic string comparison benchmarks * DOC,STY: Fixup string-comparisons comments based on review Thanks to Marten's comments, a few clarfications and slight fixups. * ENH: Use `memcmp` because it may be faster for the byte case * TST: Improve string and unicode comparison tests. * MAINT: Use switch statement based on review As suggested be Serge. Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu> * TST: Make unicode byte-swap test slightly more concrete The issue is that the `view` needs to use native byte-order, so just ensure native byte-order for the view, and then do another cast to get it right. * BUG: Add `np.compare_chararrays` to test and fix typo * TST: Add test for empty string comparisons * TST: Fixup string test based on martens review * MAINT: Move definitions back into string_ufuncs.h * MAINT: Use enum class for comparison operator templating This removes the need for a dynamic (or static) assert in the switch statement. * Template version of add_loop to avoid redundant code * STY: Fixup style, two spaces, error is -1 * STY: Small `string_ufuncs.cpp` fixups based on Serge's review * MAINT: Fix merge conflict (ensure_dtype_nbo was removed) Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu>

seberg · 2022-06-10T14:00:03Z

Oh, I never added a release note over in the original PR? Assuming this backport goes in, do you want me to add a release notes on this PR directly, or make a separate PR later?

I have to check how much actually changed here, I may have been confusing the structured comparison PR with this one, and this one should have no changes except bug-fixes. But it is maybe still a nice new feature to note :).

charris · 2022-06-10T17:42:22Z

What/if you decide for the release note, just post it here and I will put it into the 1.23.0 release note in preparation for the rc3 release.

seberg · 2022-06-10T19:32:18Z

I suppose could add a brief note under improvements?

String comparisons now support in ufuncs
----------------------------------------
The comparison ufuncs `np.equal`, `np.greater`, etc. now support
unicode and byte string inputs (dtypes ``S`` and ``U``).
Due to this change a ``FutureWarning`` is now given when comparing
unicode to byte strings.  Such comparisons always returned ``False``
and continue to do so at this time.

charris · 2022-06-10T20:16:54Z

Thanks Sebastian, I've added that note.

charris added 01 - Enhancement 08 - Backport Used to tag backport PRs labels Jun 10, 2022

charris added this to the 1.23.0 release milestone Jun 10, 2022

charris mentioned this pull request Jun 10, 2022

ENH: Implement string comparison ufuncs (or almost) #21041

Merged

charris merged commit 3942b6a into numpy:maintenance/1.23.x Jun 10, 2022

charris deleted the backport-21401 branch June 10, 2022 16:48

charris restored the backport-21401 branch June 16, 2022 14:55

charris mentioned this pull request Jun 16, 2022

REV: Revert "ENH: Implement string comparison ufuncs (or almost) " #21777

Merged

charris deleted the backport-21401 branch December 29, 2022 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Implement string comparison ufuncs (or almost) #21716

ENH: Implement string comparison ufuncs (or almost) #21716

Uh oh!

charris commented Jun 10, 2022 •

edited by seberg

Loading

Uh oh!

seberg commented Jun 10, 2022

Uh oh!

charris commented Jun 10, 2022

Uh oh!

seberg commented Jun 10, 2022

Uh oh!

charris commented Jun 10, 2022

Uh oh!

Uh oh!

Uh oh!

ENH: Implement string comparison ufuncs (or almost) #21716

ENH: Implement string comparison ufuncs (or almost) #21716

Uh oh!

Conversation

charris commented Jun 10, 2022 • edited by seberg Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg commented Jun 10, 2022

Uh oh!

charris commented Jun 10, 2022

Uh oh!

seberg commented Jun 10, 2022

Uh oh!

charris commented Jun 10, 2022

Uh oh!

Uh oh!

charris commented Jun 10, 2022 •

edited by seberg

Loading