Skip to content

ENH: Implement string comparison ufuncs (or almost) #21041

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 10, 2022

Conversation

seberg
Copy link
Member

@seberg seberg commented Feb 12, 2022

This makes all comparison operators and ufuncs work on strings
using the ufunc machinery.
It requires a half-manual "ufunc" to keep supporting void comparisons
and especially np.compare_chararrays (that one may have a bit more
overhead now).

In general the new code should be much faster, and has a lot of easier
optimization potential. It is also much simpler since it can outsource
some complexities to the ufunc/iterator machinery.

This further fixes a couple of bugs with byte-swapped strings.

The backward compatibility related change is that using the normal
ufunc machinery means that string comparisons between string and
unicode now give a FutureWarning (instead of just False).


This also shows two things:

  1. How to use new-style loops (very simple one)
  2. Writes them in C++ templating.

I do admit, that I think we will have to refactor some of this when we get more ufuncs created in this way. Note that for now np.equal.types will not list SS->? and UU->?.

In general, this is finished, however, it should get a bit of new tests for direct np.equal, etc. ufunc use and for byte-swapped or unaligned unicode inputs.

NPY_NO_EXPORT PyObject *
_umath_strings_richcompare(
PyArrayObject *self, PyArrayObject *other, int cmp_op, int rstrip);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this how we do this for C++? I am not sure which header I could cram this in so that i can just include it from multiarray/multiarrayobject.c and multiarray/arrayobject.c which need access to these. Currently I put it there directly, which is OK, but I don't like it too much.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not in an string_ufuncs.h ?

@seberg
Copy link
Member Author

seberg commented Feb 12, 2022

@serge-sans-paille could you vet the C++ again? Maybe you also know how to reduce those annoying repetitions when creating the many different loops? That would be awesome :).

EDIT: MSVC doesn't like those named struct initializers? C++ 11 really doesn't allow/cover this simple C99 feature?!
EDIT2: Oh well, expanding it into multiple-lines rather than the {} is not terrible...

@seberg seberg force-pushed the string-ufuncs branch 2 times, most recently from c549c66 to 26f0be6 Compare February 12, 2022 22:43
NPY_ITER_WRITEONLY | NPY_ITER_ALLOCATE | NPY_ITER_ALIGNED};

PyArrayMethod_StridedLoop *strided_loop = nullptr;
NPY_BEGIN_THREADS_DEF;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ forced me to move all the inits up, or it didn't want the goto, another C99 feature not included in C11?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if flagging these variables as constexpr changes something wrt. goto ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tried for the flags, and it didn't seem to work.

@mattip
Copy link
Member

mattip commented Feb 13, 2022

Windows is not happy.

Do we have benchmarks that are affected by this change?

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - I had a look mostly just because I was curious both how your new dtypes held up here and how the new templating works. Both seem great! My comments are really just about trying to make comments clearer; you probably want one of the more c++ people to have a look.

@@ -214,7 +214,7 @@ typedef struct {
} PyArrayMethod_Spec;


typedef PyObject *_ufunc_addloop_fromspec_func(
typedef int _ufunc_addloop_fromspec_func(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated to the refactoring part and probably deserves a commit on its own

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, although I don't care about when/how it gets in, hehe

/* From umath/string_ufuncs.cpp/h */
NPY_NO_EXPORT PyObject *
_umath_strings_richcompare(
PyArrayObject *self, PyArrayObject *other, int cmp_op, int rstrip);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not including the according header instead of forward-declaring that function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There be dragons... I am not quite sure how big they are but there are difficulties with crossing the multiarray/umath boundary.

template <typename character>
static NPY_INLINE int
character_cmp(character a, character b)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could that be return static_cast<int>(b) - static_cast<int>(a) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, although I am not sure in case of invalid unicode characters (they are probably unsigned, but that would be out of range). Happy to change if think it makes it much clearer? (I don't have an opinion either way.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe something other than an in is better as typename std::intermediate_type<npy_ucs4, npy_byte>::type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly still don't really see that this will lead to easier to read code :), and I assume the compilers should optimize it just fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite, and that's why I'm trying to convince you :-)

https://godbolt.org/z/GMWPKE4EK

Copy link
Member Author

@seberg seberg Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but that is not what we are comparing? We are comparing these:
https://godbolt.org/z/f5qh9a3b7

And there func1 func0 actually ends up shorter in some cases (depends on the operand)?

EDIT: sorry, I am silly, no since this first is applies on the full word, and only then to an integer. So I guess it may compile to something faster, hmmmm.

else if (comp == Py_GE) {
res = cmp >= 0;
}
else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as comp is a template parameter, this could be a static_assert

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I thought so first, but my linter didn't like static_assert(0)? But probably just a bad linting hint, not fully understanding the templating, will change.

Copy link
Member Author

@seberg seberg Feb 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Edit: oops different window, so missed my own comment ;))

I just tried it again, and even the compiler complained, not just the linter?

NPY_NO_EXPORT PyObject *
_umath_strings_richcompare(
PyArrayObject *self, PyArrayObject *other, int cmp_op, int rstrip);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not in an string_ufuncs.h ?

NPY_ITER_WRITEONLY | NPY_ITER_ALLOCATE | NPY_ITER_ALIGNED};

PyArrayMethod_StridedLoop *strided_loop = nullptr;
NPY_BEGIN_THREADS_DEF;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if flagging these variables as constexpr changes something wrt. goto ?

@seberg seberg marked this pull request as ready for review February 13, 2022 20:53
@seberg
Copy link
Member Author

seberg commented Feb 13, 2022

IMO, the main point here is making the string comparisons sane and adding them as ufuncs to np.equal, etc. to begin with. But benchmarks showed a slightly different performance "profile" for S with more characters and when things match. I am not convinced that it is actually better how it was before (unless we anticipate long strings), but for now changed it, more than happy to revert though!

The benchmarks proof my point that this is generally much faster:

(This time, smaller ratio is better)

       before           after         ratio
     [d9538400]       [623cea57]
     <main>           <string-ufuncs>
+         104±1μs        118±0.4μs     1.14  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '!=')
+      54.6±0.9μs      60.3±0.09μs     1.10  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '!=')
+         223±7μs          237±2μs     1.06  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '!=')
-       131±0.3μs        122±0.7μs     0.94  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '!=')
-     1.27±0.01μs      1.18±0.02μs     0.93  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '!=')
-     1.79±0.01μs      1.40±0.01μs     0.78  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '<=')
-     1.30±0.03μs      1.01±0.01μs     0.78  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '<=')
-     1.29±0.02μs         999±10ns     0.78  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '==')
-      53.8±0.5μs       41.7±0.3μs     0.78  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '>=')
-     1.80±0.01μs      1.40±0.01μs     0.77  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '>=')
-        1.28±0μs         991±20ns     0.77  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '>')
-     1.78±0.01μs      1.38±0.01μs     0.77  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '>')
-     1.28±0.01μs          984±4ns     0.77  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '>=')
-        1.27±0μs          975±2ns     0.77  bench_strings.StringComparisons.time_compare_identical(100, 'U', False, '<')
-        54.3±2μs       41.6±0.3μs     0.77  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '<=')
-         970±8ns          740±3ns     0.76  bench_strings.StringComparisons.time_compare_identical(100, 'S', False, '>')
-       213±0.4μs          162±2μs     0.76  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '>=')
-        54.1±2μs       41.1±0.3μs     0.76  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '>')
-         104±1μs       79.0±0.3μs     0.76  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '<=')
-     1.22±0.01μs          926±4ns     0.76  bench_strings.StringComparisons.time_compare_identical(100, 'S', True, '>=')
-      52.9±0.5μs       40.1±0.5μs     0.76  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '==')
-         976±5ns         739±20ns     0.76  bench_strings.StringComparisons.time_compare_identical(100, 'S', False, '<=')
-     54.2±0.06μs       40.7±0.1μs     0.75  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '<')
-        1.81±0μs      1.36±0.01μs     0.75  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '==')
-         106±6μs       79.5±0.6μs     0.75  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '>=')
-        984±20ns          738±5ns     0.75  bench_strings.StringComparisons.time_compare_identical(100, 'S', False, '>=')
-       106±0.5μs       79.2±0.4μs     0.75  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '>')
-         104±1μs       77.7±0.6μs     0.75  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '==')
-        966±10ns          719±4ns     0.74  bench_strings.StringComparisons.time_compare_identical(100, 'S', False, '==')
-     1.85±0.02μs      1.36±0.02μs     0.74  bench_strings.StringComparisons.time_compare_identical(100, 'U', True, '<')
-       966±0.7ns          711±2ns     0.74  bench_strings.StringComparisons.time_compare_different(100, 'S', False, '<=')
-        1.23±0μs          901±6ns     0.73  bench_strings.StringComparisons.time_compare_identical(100, 'S', True, '>')
-        982±10ns          721±3ns     0.73  bench_strings.StringComparisons.time_compare_different(100, 'S', False, '>')
-         108±1μs      78.8±0.06μs     0.73  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '<')
-     1.19±0.03μs          873±8ns     0.73  bench_strings.StringComparisons.time_compare_identical(100, 'S', True, '<')
-     1.22±0.01μs          894±6ns     0.73  bench_strings.StringComparisons.time_compare_identical(100, 'S', True, '<=')
-         974±7ns          713±2ns     0.73  bench_strings.StringComparisons.time_compare_different(100, 'S', False, '<')
-     1.20±0.02μs          874±6ns     0.73  bench_strings.StringComparisons.time_compare_different(100, 'S', True, '!=')
-         970±1ns          706±6ns     0.73  bench_strings.StringComparisons.time_compare_different(100, 'S', False, '!=')
-         970±9ns          705±5ns     0.73  bench_strings.StringComparisons.time_compare_different(100, 'S', False, '==')
-     1.22±0.02μs          887±4ns     0.72  bench_strings.StringComparisons.time_compare_different(100, 'S', True, '>=')
-     1.21±0.01μs          878±8ns     0.72  bench_strings.StringComparisons.time_compare_different(100, 'S', True, '<=')
-         977±8ns          707±4ns     0.72  bench_strings.StringComparisons.time_compare_identical(100, 'S', False, '<')
-     1.22±0.01μs          884±2ns     0.72  bench_strings.StringComparisons.time_compare_different(100, 'S', True, '>')
-         218±7μs          158±2μs     0.72  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '<=')
-         218±7μs        158±0.9μs     0.72  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '<')
-     1.23±0.01μs          890±4ns     0.72  bench_strings.StringComparisons.time_compare_different(100, 'S', True, '<')
-        996±30ns          718±2ns     0.72  bench_strings.StringComparisons.time_compare_different(100, 'S', False, '>=')
-         978±5ns          698±2ns     0.71  bench_strings.StringComparisons.time_compare_identical(100, 'S', False, '!=')
-     1.25±0.02μs          891±4ns     0.71  bench_strings.StringComparisons.time_compare_identical(100, 'S', True, '==')
-         228±1μs          160±1μs     0.70  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '>')
-         226±1μs          158±1μs     0.70  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', True, '==')
-     1.23±0.03μs          859±5ns     0.70  bench_strings.StringComparisons.time_compare_identical(100, 'S', True, '!=')
-     1.24±0.04μs         850±10ns     0.68  bench_strings.StringComparisons.time_compare_different(100, 'S', True, '==')
-         115±7μs       78.1±0.2μs     0.68  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '>=')
-      59.4±0.8μs      39.5±0.09μs     0.66  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '>=')
-      31.2±0.1μs       20.1±0.6μs     0.65  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '>=')
-         113±6μs       72.5±0.3μs     0.64  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '==')
-       128±0.5μs         81.6±1μs     0.64  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '>=')
-       130±0.8μs       82.6±0.9μs     0.64  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '==')
-        53.9±3μs       34.0±0.2μs     0.63  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '<')
-       131±0.5μs       82.4±0.5μs     0.63  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '<=')
-        55.1±5μs       34.6±0.3μs     0.63  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '<=')
-        58.4±1μs       36.5±0.1μs     0.63  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '<=')
-         966±7ns          603±3ns     0.62  bench_strings.StringComparisons.time_compare_different(100, 'U', False, '<')
-       131±0.9μs       81.6±0.8μs     0.62  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '>')
-        29.4±2μs       18.3±0.6μs     0.62  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '>')
-        972±20ns          603±5ns     0.62  bench_strings.StringComparisons.time_compare_different(100, 'U', False, '!=')
-         131±2μs       81.1±0.6μs     0.62  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'U', False, '<')
-        59.3±2μs       36.6±0.3μs     0.62  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '==')
-         954±6ns          586±2ns     0.61  bench_strings.StringComparisons.time_compare_different(100, 'U', False, '<=')
-        971±40ns          587±4ns     0.60  bench_strings.StringComparisons.time_compare_different(100, 'U', False, '>=')
-        977±10ns          590±1ns     0.60  bench_strings.StringComparisons.time_compare_different(100, 'U', False, '==')
-      27.3±0.7μs       16.5±0.2μs     0.60  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '<')
-        30.3±2μs       18.3±0.6μs     0.60  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '<=')
-         113±7μs       67.8±0.5μs     0.60  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '>')
-         988±4ns          593±1ns     0.60  bench_strings.StringComparisons.time_compare_different(100, 'U', False, '>')
-        51.9±2μs         31.1±1μs     0.60  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '==')
-     31.2±0.09μs       18.6±0.5μs     0.60  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '==')
-        59.6±4μs         35.4±2μs     0.59  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '>')
-       122±0.2μs       72.3±0.2μs     0.59  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '>')
-         116±6μs       68.2±0.2μs     0.59  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '!=')
-      31.3±0.5μs       18.3±0.5μs     0.59  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '<=')
-       122±0.6μs         71.1±3μs     0.58  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '<=')
-      30.2±0.6μs       17.5±0.9μs     0.58  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '<')
-        70.7±2μs      40.7±0.07μs     0.58  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '>=')
-        1.21±0μs         697±20ns     0.57  bench_strings.StringComparisons.time_compare_different(100, 'U', True, '!=')
-         112±7μs         64.3±3μs     0.57  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '>=')
-        59.5±1μs       33.8±0.2μs     0.57  bench_strings.StringComparisons.time_compare_identical(10000, 'S', True, '!=')
-         114±9μs         64.7±2μs     0.57  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '!=')
-        30.3±1μs       17.1±0.5μs     0.57  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '!=')
-      31.0±0.7μs      17.5±0.08μs     0.56  bench_strings.StringComparisons.time_compare_identical(10000, 'S', False, '!=')
-         119±6μs         67.0±3μs     0.56  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', True, '<')
-        60.4±2μs       33.8±0.2μs     0.56  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '!=')
-        60.8±3μs       34.0±0.4μs     0.56  bench_strings.StringComparisons.time_compare_different(10000, 'S', True, '>=')
-         118±3μs         65.1±2μs     0.55  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '<')
-        30.9±2μs       17.0±0.7μs     0.55  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '>')
-      31.1±0.2μs       17.0±0.6μs     0.55  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '>=')
-     1.22±0.01μs          667±2ns     0.55  bench_strings.StringComparisons.time_compare_different(100, 'U', True, '<')
-        1.23±0μs          668±5ns     0.54  bench_strings.StringComparisons.time_compare_different(100, 'U', True, '>=')
-     1.23±0.01μs          666±7ns     0.54  bench_strings.StringComparisons.time_compare_different(100, 'U', True, '==')
-     1.25±0.02μs         673±10ns     0.54  bench_strings.StringComparisons.time_compare_different(100, 'U', True, '<=')
-         120±3μs         64.6±3μs     0.54  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '<=')
-     1.24±0.02μs          665±1ns     0.54  bench_strings.StringComparisons.time_compare_different(100, 'U', True, '>')
-        29.6±2μs       15.8±0.5μs     0.53  bench_strings.StringComparisons.time_compare_different(10000, 'S', False, '==')
-     25.3±0.08μs      13.1±0.07μs     0.52  bench_strings.StringComparisons.time_compare_different(10000, 'U', False, '>=')
-        74.1±3μs      38.0±0.07μs     0.51  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '==')
-      25.2±0.3μs       12.9±0.1μs     0.51  bench_strings.StringComparisons.time_compare_different(10000, 'U', False, '>')
-      25.3±0.2μs      12.9±0.08μs     0.51  bench_strings.StringComparisons.time_compare_different(10000, 'U', False, '<=')
-     25.0±0.08μs       12.7±0.1μs     0.51  bench_strings.StringComparisons.time_compare_different(10000, 'U', False, '<')
-         117±3μs         59.1±3μs     0.50  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', True, '==')
-      25.0±0.1μs       12.5±0.2μs     0.50  bench_strings.StringComparisons.time_compare_different(10000, 'U', False, '==')
-        75.7±3μs         37.4±1μs     0.49  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '>')
-      25.6±0.6μs       12.6±0.1μs     0.49  bench_strings.StringComparisons.time_compare_different(10000, 'U', False, '!=')
-        72.7±2μs         34.2±1μs     0.47  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '>=')
-      75.8±0.4μs      35.3±0.05μs     0.47  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '<=')
-        75.7±2μs       35.1±0.1μs     0.46  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '<=')
-        76.0±2μs       34.9±0.1μs     0.46  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '<')
-        76.4±2μs       35.1±0.2μs     0.46  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '<')
-        76.0±3μs         34.0±1μs     0.45  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '!=')
-        78.6±1μs         33.9±1μs     0.43  bench_strings.StringComparisons.time_compare_identical((1000, 20), 'S', False, '!=')
-        78.8±3μs         33.9±1μs     0.43  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '>')
-      60.4±0.2μs       25.7±0.1μs     0.43  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', False, '>=')
-        76.1±2μs         31.3±1μs     0.41  bench_strings.StringComparisons.time_compare_different((1000, 20), 'S', False, '==')
-      62.5±0.4μs       25.7±0.1μs     0.41  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', False, '<=')
-      63.0±0.3μs       25.7±0.1μs     0.41  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', False, '>')
-      62.5±0.7μs      25.1±0.03μs     0.40  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', False, '==')
-        64.6±2μs      25.6±0.07μs     0.40  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', False, '<')
-        64.8±2μs       25.5±0.2μs     0.39  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', False, '!=')
-      48.7±0.3μs       18.8±0.4μs     0.39  bench_strings.StringComparisons.time_compare_different(10000, 'U', True, '<=')
-      48.9±0.2μs       18.6±0.2μs     0.38  bench_strings.StringComparisons.time_compare_different(10000, 'U', True, '<')
-      48.6±0.2μs       18.4±0.1μs     0.38  bench_strings.StringComparisons.time_compare_different(10000, 'U', True, '!=')
-      49.0±0.6μs       18.5±0.2μs     0.38  bench_strings.StringComparisons.time_compare_different(10000, 'U', True, '==')
-      48.9±0.5μs      18.4±0.09μs     0.38  bench_strings.StringComparisons.time_compare_different(10000, 'U', True, '>')
-      49.4±0.6μs       18.5±0.2μs     0.38  bench_strings.StringComparisons.time_compare_different(10000, 'U', True, '>=')
-      97.0±0.3μs       36.2±0.2μs     0.37  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', True, '!=')
-      96.9±0.7μs       35.9±0.2μs     0.37  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', True, '<')
-      98.0±0.4μs       36.2±0.8μs     0.37  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', True, '<=')
-      97.9±0.4μs       36.1±0.4μs     0.37  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', True, '>=')
-      97.9±0.3μs       35.7±0.7μs     0.36  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', True, '>')
-      98.1±0.8μs       35.6±0.3μs     0.36  bench_strings.StringComparisons.time_compare_different((1000, 20), 'U', True, '==')

@mhvk
Copy link
Contributor

mhvk commented Feb 13, 2022

@seberg - those are very impressive improvements!! I'm rather puzzled by this, though:

+         104±1μs        118±0.4μs     1.14  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '!=')
-         104±1μs       77.7±0.6μs     0.75  bench_strings.StringComparisons.time_compare_identical(10000, 'U', True, '==')

How is it possible that '!=' is 50% slower than '=='?? I would have expected the two to be identical (as was the case in the old code). And indeed, all the other comparisons are the same (within uncertainties) as '=='.

The same 50% penalty holds when not using rstrip, so that is not the problem:

+      54.6±0.9μs      60.3±0.09μs     1.10  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '!=')
-      52.9±0.5μs       40.1±0.5μs     0.76  bench_strings.StringComparisons.time_compare_identical(10000, 'U', False, '==')

As well as for the 2-d case.

@seberg
Copy link
Member Author

seberg commented Feb 13, 2022

Hmmm, no idea, without thinking about it, I would have assumed that all comparisons will have the same speed. Is there some reason the compiler can optimize some path more than the other, or just random?

@mhvk
Copy link
Contributor

mhvk commented Feb 13, 2022

The timing differences are very systematic: all comparisons are the same except '!=' for 'U'. I cannot see any reason for it in your code; after all, the base function doing the comparisons does not even know what comparison is being done; it just returns -1, 0, or +1. Maybe it is some kind of weird fluke?

@mhvk
Copy link
Contributor

mhvk commented Feb 13, 2022

I.e., maybe worth running the benchmarks again?

@seberg
Copy link
Member Author

seberg commented Feb 13, 2022

Its reproducible on my comp, also if I change the order in which the benchmarks run. The compiler must be spitting something different for !=.

@mhvk
Copy link
Contributor

mhvk commented Feb 13, 2022

Weird, I just thought I'd try to reproduce it too, and I cannot:

In [1]: import numpy as np

In [2]: a = np.arange(100000.0).astype("U")

In [3]: b = a.copy()

In [4]: %timeit a != b
2.23 ms ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [5]: %timeit a == b
2.29 ms ± 11.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

(For both, current master gives 3.46 ms, so I confirm the great speed-up! I also tried with np.arange(10000), just to be closer to the benchmark.)

@seberg
Copy link
Member Author

seberg commented Feb 13, 2022

To be honest, I currently don't care much about this oddness related to the core of the loop right now. Could be compiler version or even hardware differences (gcc 11.2 and a Ryzen 3 here) 🤷.
But if it is specific to my hardware, maybe that adds a reason to not worry about it?

If anyone to squeeze performance, I would suggest implementing the get_loop and looking into specializing it for short/long strings (and even contiguous+short or so), I also didn't even try adding NPY_GCC_OPT3. Is that worthwhile for string arrays?

@mhvk
Copy link
Contributor

mhvk commented Feb 13, 2022

@seberg - agreed that in the larger scheme of things this just doesn't matter. The code is much nicer and faster, what else could one ask for!?

@seberg
Copy link
Member Author

seberg commented Feb 14, 2022

OK, I think I covered all the important paths, so this should be ready now for another review and generally ready modulo smaller header or other cleanups.
The only tricky thing would be if anyone is worried about that FutureWarning change aligning strings with other dtypes in comparisons.

(We also really need to deprecate unicode + string -> unicode promotion at some point, but that is unrelated.)

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Mostly just a reiteration of @serge-sans-paille's suggestion to create an 'string_ufuncs.h', plus a small nit on one of the tests.

Though I'd prefer a more C++ adept person to sign off on that part!

}


extern "C" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like below: I think it makes sense to create a 'strings_ufunc.h' that contains these definitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @mattip, you have a good idea? I had that first, and it should work for the init_string_ufuncs, but the umath/multiarray boundary is always creating some headaches: it seems hard to include that header from multiarray somehow?

I admit, it is very likely I am just doing something stupid.

Also @serge-sans-paille, I thought another problem was that the symbol export didn't actually work right when I had it as its own header here. But I guess that may have been just a bad try?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe good to separate what should more obviously work, which is to put the ones double-defined here in 'strings_ufunc.h' and import that new .h above. And only then worry about the other places.

Regarding multiarray/umath, I thought we had tried to harmonize those, but not sure...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are now one build-unit, which is why this works at all without much hassle. But including umath stuff in multiarray is still tricky (I am not exactly sure why)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, moved things back, it is working fine, but did not try to cross the multiarray boundary yet. I guess one thing could be that I did not actually include other stuff in the header right now (not sure if I should).

template <typename character>
static NPY_INLINE int
character_cmp(character a, character b)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, maybe something other than an in is better as typename std::intermediate_type<npy_ucs4, npy_byte>::type

res = cmp >= 0;
break;
default:
assert(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the right static_assert here would be (in the function prelude`

static_assert(Py_EQ == comp || Py_NE == comp || ..., "valid comparison");

or something based on a range if you know for sure which enumerator is the lowest one.

case Py_GE:
return string_comparison_loop<rstrip, Py_GE, character>;
default:
assert(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, seeing the same assert again, what you probably should do is

static constexpr is_valid_comp(int comp) {
   return comp == Py_LT || ...;
}

and use it either in fonction prelude, either as an assert or a static_assert depending on the context

Or even better: make the int an enum class but that's probably not possible, just mentioning it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I didn't love the idea of a long duplicate Py_EQ || ... where the asserts were, but if it goes into the function prelude, I can see the point, will adopt :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang static analysis seems to not be quite smart enough to realize that the switch will always match, but I guess I will ignore that (it warns that res may be undefined in the other case).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I could use default for the last one, but dunno.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, silly me... This assert is actually not static at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the verboseness. Pushed a change with that the enum class instead, figured that the static assert was just as verbose anyway...

@serge-sans-paille
Copy link
Contributor

@seberg : feel free to cherry-pick serge-sans-paille@0bf1a47

@seberg
Copy link
Member Author

seberg commented Feb 15, 2022

Done that cherry pick, but also an STY commit because some style stuff I just can't agree with ;). Left some long lines, wasn't sure how to break them, and I think they are somewhat fine (i.e. nothing non-obvious hidden), although not great.

I am happy with this, it is nice that it removes the duplication, it adds a bit of "C++" knowledge requirement to truly understand what is going on, but I guess that is OK.

@dlee992
Copy link

dlee992 commented Jun 10, 2022

@seberg, @charris, just "a gentle hi". Recently I want to contribute a PR about string ufunc in Numba (numba/numba#8104), which is empty PR right now. I discussed with Numba members, and we hope to keep consistent with NumPy handling way. And I saw this PR! Very impressive and we need it (especially SS->?, UU->?, and unify string with other types in ufunc loop)!
I wonder whether this PR can be merged into NumPy 1.23.0 version ASAP, since I am pending on this very much. Haha. Thanks!

@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Jun 10, 2022
@charris charris merged commit 71f7f7c into numpy:main Jun 10, 2022
@charris
Copy link
Member

charris commented Jun 10, 2022

Yeah, let's give this a shot. Thanks Sebastian. I'm going to do an rc3 this weekend.

Looks like this needs a release note. @seberg Since I will backport this for @dlee992, could you add the release note directly against the 1.23.x branch after this is backported?

charris pushed a commit to charris/numpy that referenced this pull request Jun 10, 2022
* ENH: Implement string comparison ufuncs (or almost)

This makes all comparison operators and ufuncs work on strings
using the ufunc machinery.
It requires a half-manual "ufunc" to keep supporting void comparisons
and especially `np.compare_chararrays` (that one may have a bit more
overhead now).

In general the new code should be much faster, and has a lot of easier
optimization potential.  It is also much simpler since it can outsource
some complexities to the ufunc/iterator machinery.

This further fixes a couple of bugs with byte-swapped strings.

The backward compatibility related change is that using the normal
ufunc machinery means that string comparisons between string and
unicode now give a `FutureWarning` (instead of just False).

* MAINT: Do not use C99 tagged struct init in C++

C++ does not like it (at least not before C++20)... GCC and clang
don't seem to mind, but MSVC seems to.

* BENCH: Add basic string comparison benchmarks

* DOC,STY: Fixup string-comparisons comments based on review

Thanks to Marten's comments, a few clarfications and slight fixups.

* ENH: Use `memcmp` because it may be faster for the byte case

* TST: Improve string and unicode comparison tests.

* MAINT: Use switch statement based on review

As suggested be Serge.

Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu>

* TST: Make unicode byte-swap test slightly more concrete

The issue is that the `view` needs to use native byte-order, so
just ensure native byte-order for the view, and then do another cast
to get it right.

* BUG: Add `np.compare_chararrays` to test and fix typo

* TST: Add test for empty string comparisons

* TST: Fixup string test based on martens review

* MAINT: Move definitions back into string_ufuncs.h

* MAINT: Use enum class for comparison operator templating

This removes the need for a dynamic (or static) assert in the
switch statement.

* Template version of add_loop to avoid redundant code

* STY: Fixup style, two spaces, error is -1

* STY: Small `string_ufuncs.cpp` fixups based on Serge's review

* MAINT: Fix merge conflict (ensure_dtype_nbo was removed)

Co-authored-by: Serge Guelton <serge.guelton@telecom-bretagne.eu>
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Jun 10, 2022
@charris charris removed this from the 1.23.0 release milestone Jun 10, 2022
@charris
Copy link
Member

charris commented Jun 10, 2022

The backport is in #21716.

@seberg
Copy link
Member Author

seberg commented Jun 10, 2022

Oh we are giving a backport a shot? Will there be another (brief) rc? I don't really expect this to create issues, but it might have some very subtle changes.

I don't mind backporting, I do wonder though. @dlee992 are you interested in NumPy supporting np.equal(), etc. on the strings. Or do you need the API to extend NumPy ufuncs yourself in that way. Because the API here is only public experimentally right now.

@charris
Copy link
Member

charris commented Jun 10, 2022

Will there be another (brief) rc?

Yes, there are already > 10 backports since rc2. And then I just want to sit on it for two weeks or so and stop doing that :) The zero release is the true NumPy rc, since that is when widespread testing really happens.

@charris
Copy link
Member

charris commented Jun 10, 2022

If @dlee992 decides that a backport will not serve, the we can close the open PR for that.

@seberg seberg deleted the string-ufuncs branch June 10, 2022 14:08
@dlee992
Copy link

dlee992 commented Jun 10, 2022

I don't mind backporting, I do wonder though. @dlee992 are you interested in NumPy supporting np.equal(), etc. on the strings. Or do you need the API to extend NumPy ufuncs yourself in that way. Because the API here is only public experimentally right now.

@seberg, hi. I think I just need NumPy supporting np.equal(), etc. on the strings, not to extend other NumPy ufuncs by myself. BTW, I'd guess there will be other PRs to add SS->?/UU->? into np.equal.types etc? Since numba also relies on np.equal.types to implement its ufunc capabilities (maybe because it was implemented by Pitrou several years ago, which I am not sure).

If @dlee992 decides that a backport will not serve, the we can close the open PR for that.

@charris, hi, I need the very backport. I am glad it can be backported to 1.23.0.

Thanks for ur quick response very much!

@seberg
Copy link
Member Author

seberg commented Jun 10, 2022

@dlee992 np.equal.types is unmodified right now, I had not planned on finding a solution to that at this time. I suppose we could try (there may not be a complete solution, so I expect we may need to add new Python/C-API for numba in any case if you need this).

We should be able to expand the output of np.equal.types (but not the binary version stored on the ufunc struct) in some of the clear cases (such as np.equal.types).

Note that np.equal.types lists old-style loops, but these are new-style loops (they have a different signature).

@dlee992
Copy link

dlee992 commented Jun 13, 2022

@dlee992 np.equal.types is unmodified right now, I had not planned on finding a solution to that at this time. I suppose we could try (there may not be a complete solution, so I expect we may need to add new Python/C-API for numba in any case if you need this).

@seberg, maybe I can try to follow these parts of NumPy code, and try to extend np.equal.types for NumPy, if u don't have time to do this. But I am a newbie for NumPy, I will give it a try. Any suggestion is helpful to me!

We should be able to expand the output of np.equal.types (but not the binary version stored on the ufunc struct) in some of the clear cases (such as np.equal.types).

I'd guess these related stuff are implemented in string_ufuncs.cpp and arrayobject.c? IIRC, numba only uses np.equal.types as a string list to select an appropriate signature, and nothing more (not quite sure), it will generate compiled version by itself (some discussion is here numba/numba#7859).

@seberg
Copy link
Member Author

seberg commented Jun 13, 2022

@seberg, maybe I can try to follow these parts of NumPy code, and try to extend np.equal.types for NumPy

Of course, the problem is timing. If we really need this in the next release we have to do it super-fast and convince Chuck that backporting matters. It is probably not hard, but it seems a bit steep to me.
There may also be some subtleties, although for you it just matters to have the strings there (i.e. not a full solution, but something good enough for you).

I'd guess these related stuff are implemented in string_ufuncs.cpp and arrayobject.c? IIRC, numba only uses np.equal.types as a string list to select an appropriate signature, and nothing more (not quite sure)

This happens in umath/ufunc_object. there is a getter for the types. The problem is that the string loops have a different ABI then the "legacy" loops. And the ufunc.types are a summary of the "legacy" loops and do not contain the new ABI loops.
Even worse, a list of strings is probably not sufficient for a future-proof exposing in any case (i.e. it may work for strings, but not generally).

A "workaround" might be to add a "legacy" loop that is never used and just does nothing/sets an error if used. But that is pretty annoying.
The other workaround is to just manually extend the Python ufunc.types to include some of the new loops (if possible to describe them using simple strings).

@dlee992
Copy link

dlee992 commented Jun 14, 2022

@seberg, many thanks! But in fact, I cannot follow ur thought about extending ufunc.types right now, I have read Numba doc several times and opened several PRs for Numba, but I just began to read NumPy doc e.g., Numpy C-API doc, and build NumPy from source. I have to get a big picture about various concepts used in NumPy more carefully, then I guess I can follow this special string ufunc stuff and your workaround thought. I will spend several days in these parts and be back after this process.

PS, I think I am not that hurry about extending ufunc.types in NumPy 1.23.0 (maybe 1.23.1 is also OK for me), we can do it better with more thoughtful discussion. BTW, I saw the DDL is June 15 for 1.23.0 milestone, it is precise or will be delayed for two weeks?

Thanks again!

@seberg
Copy link
Member Author

seberg commented Jun 14, 2022

If there is consensus that returning two extra entries in np.equal.types we could do that later maybe, the question is whether we should worry about any backwards compatibility issue (say because someone auto-generates things from it, similar to numba).
In that case, I am not sure if it is a "bugfix" to land in 1.23.1 as opposed to 1.24 (we have a half-year release cadence).

We are now in the third rc for 1.23, so I would not expect many more changes unless something comes up. In the end it is up to Chuck though.

The following is what returns ufunc.types to Python:

static PyObject *
ufunc_get_types(PyUFuncObject *ufunc, void *NPY_UNUSED(ignored))
{
/* return a list with types grouped input->output */
PyObject *list;
PyObject *str;
int k, j, n, nt = ufunc->ntypes;
int ni = ufunc->nin;
int no = ufunc->nout;
char *t;
list = PyList_New(nt);
if (list == NULL) {
return NULL;
}
t = PyArray_malloc(no+ni+2);
n = 0;
for (k = 0; k < nt; k++) {
for (j = 0; j<ni; j++) {
t[j] = _typecharfromnum(ufunc->types[n]);
n++;
}
t[ni] = '-';
t[ni+1] = '>';
for (j = 0; j < no; j++) {
t[ni + 2 + j] = _typecharfromnum(ufunc->types[n]);
n++;
}
str = PyUnicode_FromStringAndSize(t, no + ni + 2);
PyList_SET_ITEM(list, k, str);
}
PyArray_free(t);
return list;
}

As you can see it reads ufunc->types. But NumPy now also has another list of "loops", which is stored as an actual list in ufunc->_loops. Now, we could scan/use ufunc->_loops rather than ufunc->types. The problems are:

  • The ufunc->_loops can have additional entries we need to skip.
  • In principle at least, ufunc->_loops could have entries that are not representable by single characters (breaking the formatting).
  • ufunc->_loops can get outdated for numba's DUFuncs, so you probably still need to have a check on ufunc->types (at least for changes).

@seberg
Copy link
Member Author

seberg commented Jun 16, 2022

@dlee992 for your awareness. It seems we should (and will) pull this out of NumPy 1.23. It leads to subtle changes for subclasses (subclasses were sometimes ignored before). That leads to subtle changes/breaks in MaskedArrays paths (that were arguably always a bit bad). And those breaks domino into failures for astropy...

@dlee992
Copy link

dlee992 commented Jun 17, 2022

@seberg, thanks! I will stay tuned about this ENH.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants