Skip to content

ENH: Implement string comparison ufuncs (or almost) #21716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 10, 2022

Conversation

charris
Copy link
Member

@charris charris commented Jun 10, 2022

Backport of #21041.

  • ENH: Implement string comparison ufuncs (or almost)

This makes all comparison operators and ufuncs work on strings
using the ufunc machinery.
It requires a half-manual "ufunc" to keep supporting void comparisons
and especially np.compare_chararrays (that one may have a bit more
overhead now).

In general the new code should be much faster, and has a lot of easier
optimization potential. It is also much simpler since it can outsource
some complexities to the ufunc/iterator machinery.

This further fixes a couple of bugs with byte-swapped strings.

The backward compatibility related change is that using the normal
ufunc machinery means that string comparisons between string and
unicode now give a FutureWarning (instead of just False).

  • MAINT: Do not use C99 tagged struct init in C++

C++ does not like it (at least not before C++20)... GCC and clang
don't seem to mind, but MSVC seems to.

  • BENCH: Add basic string comparison benchmarks

  • DOC,STY: Fixup string-comparisons comments based on review

Thanks to Marten's comments, a few clarfications and slight fixups.

  • ENH: Use memcmp because it may be faster for the byte case

  • TST: Improve string and unicode comparison tests.

  • MAINT: Use switch statement based on review

As suggested be Serge.

Co-authored-by: Serge Guelton serge.guelton@telecom-bretagne.eu

  • TST: Make unicode byte-swap test slightly more concrete

The issue is that the view needs to use native byte-order, so
just ensure native byte-order for the view, and then do another cast
to get it right.

  • BUG: Add np.compare_chararrays to test and fix typo

  • TST: Add test for empty string comparisons

  • TST: Fixup string test based on martens review

  • MAINT: Move definitions back into string_ufuncs.h

  • MAINT: Use enum class for comparison operator templating

This removes the need for a dynamic (or static) assert in the
switch statement.

  • Template version of add_loop to avoid redundant code

  • STY: Fixup style, two spaces, error is -1

  • STY: Small string_ufuncs.cpp fixups based on Serge's review

  • MAINT: Fix merge conflict (ensure_dtype_nbo was removed)

Co-authored-by: Serge Guelton serge.guelton@telecom-bretagne.eu

* ENH: Implement string comparison ufuncs (or almost)

This makes all comparison operators and ufuncs work on strings
using the ufunc machinery.
It requires a half-manual "ufunc" to keep supporting void comparisons
and especially `np.compare_chararrays` (that one may have a bit more
overhead now).

In general the new code should be much faster, and has a lot of easier
optimization potential.  It is also much simpler since it can outsource
some complexities to the ufunc/iterator machinery.

This further fixes a couple of bugs with byte-swapped strings.

The backward compatibility related change is that using the normal
ufunc machinery means that string comparisons between string and
unicode now give a `FutureWarning` (instead of just False).

* MAINT: Do not use C99 tagged struct init in C++

C++ does not like it (at least not before C++20)... GCC and clang
don't seem to mind, but MSVC seems to.

* BENCH: Add basic string comparison benchmarks

* DOC,STY: Fixup string-comparisons comments based on review

Thanks to Marten's comments, a few clarfications and slight fixups.

* ENH: Use `memcmp` because it may be faster for the byte case

* TST: Improve string and unicode comparison tests.

* MAINT: Use switch statement based on review

As suggested be Serge.

Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu>

* TST: Make unicode byte-swap test slightly more concrete

The issue is that the `view` needs to use native byte-order, so
just ensure native byte-order for the view, and then do another cast
to get it right.

* BUG: Add `np.compare_chararrays` to test and fix typo

* TST: Add test for empty string comparisons

* TST: Fixup string test based on martens review

* MAINT: Move definitions back into string_ufuncs.h

* MAINT: Use enum class for comparison operator templating

This removes the need for a dynamic (or static) assert in the
switch statement.

* Template version of add_loop to avoid redundant code

* STY: Fixup style, two spaces, error is -1

* STY: Small `string_ufuncs.cpp` fixups based on Serge's review

* MAINT: Fix merge conflict (ensure_dtype_nbo was removed)

Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu>
@seberg
Copy link
Member

seberg commented Jun 10, 2022

Oh, I never added a release note over in the original PR? Assuming this backport goes in, do you want me to add a release notes on this PR directly, or make a separate PR later?

I have to check how much actually changed here, I may have been confusing the structured comparison PR with this one, and this one should have no changes except bug-fixes. But it is maybe still a nice new feature to note :).

@charris charris merged commit 3942b6a into numpy:maintenance/1.23.x Jun 10, 2022
@charris charris deleted the backport-21401 branch June 10, 2022 16:48
@charris
Copy link
Member Author

charris commented Jun 10, 2022

What/if you decide for the release note, just post it here and I will put it into the 1.23.0 release note in preparation for the rc3 release.

@seberg
Copy link
Member

seberg commented Jun 10, 2022

I suppose could add a brief note under improvements?

String comparisons now support in ufuncs
----------------------------------------
The comparison ufuncs `np.equal`, `np.greater`, etc. now support
unicode and byte string inputs (dtypes ``S`` and ``U``).
Due to this change a ``FutureWarning`` is now given when comparing
unicode to byte strings.  Such comparisons always returned ``False``
and continue to do so at this time.

@charris
Copy link
Member Author

charris commented Jun 10, 2022

Thanks Sebastian, I've added that note.

@charris charris restored the backport-21401 branch June 16, 2022 14:55
@charris charris deleted the backport-21401 branch December 29, 2022 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
01 - Enhancement 08 - Backport Used to tag backport PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants