ENH: introduce a notion of "compatible" stringdtype instances #26261

ngoldbaum · 2024-04-11T20:11:26Z

This is a followup for #26198, it accomplishes the same thing I wanted in that PR.

See the change to the NEP for the details and explanation.

In short, this relaxes the error checking in stringdtype ufuncs and changes the common_instance logic, allowing operations between distinct stringdtype instances as long as the result isn't ambiguous. This makes it much simpler to work with non-default stringdtype instances, since users don't need to zealously convert all ufunc arguments to the same dtype before passing them to numpy in the most common cases like passing a python string as an argument.

For all operations that take more than one string argument, we now only raise an error if the inputs have distinct na_object settings. We allow distinct coerce settings and just choose coerce=False for string outputs if any input dtype had coerce=False set.

Also added a test. There was one spot in the existing tests where we were doing equality comparisons between arrays with distinct na_object settings, so I updated that test to account for the behavior change.

Ideally we could get this merged in time to be included with 2.0 RC2. I'd like to have this in NumPy 2.0 because it will eliminate a lot of boilerplate argument sanitizing in pandas when it called numpy ufuncs. I totally understand if this is coming too late in the game though.

mhvk

commit 1: all OK, but perhaps numpy should consider having some form of pre-commit that checks imports...

commit 2: I guess I understand why Py_NewRef doesn't "just work" on array descriptors, but it is a pity that this will force new ref variants for every class one devises. In that sense, the code becomes less readable. But the reduction in boilerplate is probably still worth it.

mhvk · 2024-04-11T20:35:32Z

numpy/_core/src/multiarray/stringdtype/dtype.c

-static int
-na_eq_cmp(PyObject *a, PyObject *b) {
+NPY_NO_EXPORT int
+na_eq_cmp(PyObject *a, PyObject *b, int coerce_nulls) {


coerce_nulls does not feel very clear to me. How about null_equals_all? Though since you use true only once, my tendency would be to instead just adjust there (see comment further down).

mhvk · 2024-04-11T20:37:03Z

numpy/_core/src/multiarray/stringdtype/dtype.c

 common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2)
 {
-    int eq = _eq_comparison(dtype1->coerce, dtype2->coerce, dtype1->na_object,
-                            dtype2->na_object);
+    int eq = na_eq_cmp(dtype1->na_object, dtype2->na_object, 1);


How about just dealing with the special case here, i.e.,

int eq = dtype1->na_object == NULL || dtype->na_object == NULL || na_eq_cmp(dtype1->na_object, dtype2->na_object);

or perhaps just move the lot inside the if - arguably clearer.

if (dtype1->na_object && dtype2->na_object && !na_eq_cmp(dtype1->na_object, dtype2->na_object) {

Both have the advantage of a (small) performance increase.

Seems like a good idea if it isn't used in many places at least. (I don't care about the perf, the compiler may just inline and optimize it all away anyway.)

mhvk · 2024-04-11T20:39:31Z

numpy/_core/src/multiarray/stringdtype/dtype.c

        return NULL;
    }

    return (PyArray_StringDTypeObject *)new_stringdtype_instance(
-            dtype1->na_object, dtype1->coerce);
+            dtype1->na_object != NULL ? dtype1->na_object : dtype2->na_object,
+            !((dtype1->coerce == 0) || (dtype2->coerce == 0)));


Isn't this just dtype1->coerce || dtype2->coerce?

mhvk · 2024-04-11T20:43:40Z

numpy/_core/src/multiarray/stringdtype/dtype.c

@@ -294,16 +301,16 @@ stringdtype_setitem(PyArray_StringDTypeObject *descr, PyObject *obj, char **data
    // so we do the comparison before acquiring the allocator.

    if (na_object != NULL) {
-        is_cmp = na_eq_cmp(obj, na_object);
-        if (is_cmp == -1) {
+        na_cmp = na_eq_cmp(obj, na_object, 1);


If you follow my suggestion above of keeping na_eq_cmp as is, then no change here is needed, since both na_object and obj are guaranteed to be non-NULL.

mhvk · 2024-04-11T20:45:22Z

numpy/_core/src/multiarray/stringdtype/dtype.c

@@ -280,7 +287,7 @@ stringdtype_setitem(PyArray_StringDTypeObject *descr, PyObject *obj, char **data
 {
    npy_packed_static_string *sdata = (npy_packed_static_string *)dataptr;

-    int is_cmp = 0;
+    int na_cmp = 0;


If you keep your na_eq_cmp code change, I would suggest

// borrow reference PyObject *na_object = descr->na_object; int na_cmp = na_eq_cmp(obj, na_object); if (na_cmp == -1) { return -1; }

i.e., remove the NULL check below (note that I prefer not making the change to na_eq_cmp; see comment a few lines down)

Oops, missed this, just pushed another commit addressing this comment. Cleans up the duplicate NULL checking too.

mhvk · 2024-04-11T20:58:12Z

numpy/_core/src/umath/stringdtype_ufuncs.cpp

@@ -845,37 +820,21 @@ string_startswith_endswith_resolve_descriptors(
 {
    PyArray_StringDTypeObject *descr1 = (PyArray_StringDTypeObject *)given_descrs[0];
    PyArray_StringDTypeObject *descr2 = (PyArray_StringDTypeObject *)given_descrs[1];
+    PyArray_StringDTypeObject *common_descr = common_instance(descr1, descr2);


numpy/_core/src/umath/stringdtype_ufuncs.cpp

mhvk · 2024-04-11T21:04:09Z

numpy/_core/tests/test_stringdtype.py

+        else:
+            # comparisons between arrays with distinct NA objects
+            # aren't allowed
+            if ((na1 is pd_NA or na2 is pd_NA or


How about writing

if ((na1 := getattr(dtype, "na_object", None) is not None and (na2 := getattr(dtype2, "na_object", None) is not None and (na1 is not na2) and (na1 is pd_NA or ...)): with pytest.raises ... else: assert_array_equal(...)

That doesn't work because na_object being unset is the default. It's only a valid attribute if you pass it:

>>> np.dtypes.StringDType().na_object AttributeError: 'numpy.dtypes.StringDType' object has no attribute 'na_object'.

but I see what you're saying, refactored a bit to hopefully be less confusing

mhvk

Overall, good, but some comments inline - main one being not to adjust na_cmp_eq but rather do the check in common_instance.

Bigger picture, I wonder a bit about using this inside, e.g., string_comparison_resolve_descriptors, since in the end you just don't generally actually use it. Since before no check was done, what goes wrong if you don't adjust this routine? Is it just that one gets a less clear error message? Somewhat similarly with the other resolve descriptors routines where the output of common_instance is not used.

seberg

Didn't have a very close look, but I think it is good once Marten's comments are addressed.

I like the NPy_NewDescrRef cleanup enough that I am happy to add it, although it would be nicer to have a single NewRef macro(?) that works for all types...
Maybe even #define Npy_NewRefT(obj) (Py_NewRef(obj), obj), which feels a bit hacky, but not too much.

FWIW, I think we should definitely backport since this is limited to StringDType and there is no code (beyond your pandas tries) that even can be using it.

seberg · 2024-04-12T10:22:05Z

numpy/_core/src/multiarray/stringdtype/dtype.c

 common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2)
 {
-    int eq = _eq_comparison(dtype1->coerce, dtype2->coerce, dtype1->na_object,
-                            dtype2->na_object);
+    int eq = na_eq_cmp(dtype1->na_object, dtype2->na_object, 1);


Seems like a good idea if it isn't used in many places at least. (I don't care about the perf, the compiler may just inline and optimize it all away anyway.)

seberg · 2024-04-12T10:23:25Z

numpy/_core/src/multiarray/stringdtype/dtype.h

@@ -49,6 +49,9 @@ stringdtype_finalize_descr(PyArray_Descr *dtype);
 NPY_NO_EXPORT int
 _eq_comparison(int scoerce, int ocoerce, PyObject *sna, PyObject *ona);

+NPY_NO_EXPORT PyArray_StringDTypeObject *
+common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2);


might make sense to name it string_common_instance or so if we export it, in case this might end up getting included elsewhere? (just a passing thought)

ngoldbaum · 2024-04-12T20:31:39Z

I just pushed an update addressing all your comments.

Regarding NewRef, I realized today that using it complicates backports, so I backed out using it in this PR. Sorry for the noise!

Let's not add it until we're no longer actively backporting stuff to numpy 2.0. I guess we could also backport adding pythoncapi-compat, but I'm ok living without NewRef for a little while.

ngoldbaum · 2024-04-12T20:32:48Z

numpy/_core/src/multiarray/stringdtype/dtype.c

+// currently this can only return 1 or -1, the latter indicating that the
+// error indicator is set
+NPY_NO_EXPORT int
+stringdtype_compatible_na(PyObject *na1, PyObject *na2) {


This is only used by string_comparison_resolve_descriptors, in principle I could expose na_eq_cmp but exposing this seemed more useful.

Agreed this is more useful. My sense, though, would be to make this function like your next one, i.e., have its signature include an PyObject **out_na, and, if this is non-NULL, return the compatible na in it. So, something like,

NPY_NO_EXPORT int stringdtype_compatible_na(PyObject *na1, PyObject *na2, PyObject **out_na) { if (na1 != NULL && na2 != NULL) { int na_eq = na_eq_comp(na1, na2); if (na_eq < 0) { return -1; } else if (na_eq == 0) { PyErr_Format(PyExc_TypeError, "Cannot find a compatible null string value for " "null strings '%R' and '%R'", na1, na2); return -1; } } if (out_na != NULL) { *out_na = na1 ? na1 : na2; } return 0; }

I think with that, you don't really need the next function any more - in the few places where it is used you can just calculate coerce directly.

Applied this suggestion in the most recent commit.

mhvk

The approach makes sense, but a suggestion to just have one added function, which can set an output compatible na.

numpy/_core/src/umath/stringdtype_ufuncs.cpp

mhvk · 2024-04-12T22:09:26Z

numpy/_core/src/multiarray/stringdtype/dtype.c

+// currently this can only return 1 or -1, the latter indicating that the
+// error indicator is set
+NPY_NO_EXPORT int
+stringdtype_compatible_na(PyObject *na1, PyObject *na2) {


Agreed this is more useful. My sense, though, would be to make this function like your next one, i.e., have its signature include an PyObject **out_na, and, if this is non-NULL, return the compatible na in it. So, something like,

NPY_NO_EXPORT int stringdtype_compatible_na(PyObject *na1, PyObject *na2, PyObject **out_na) { if (na1 != NULL && na2 != NULL) { int na_eq = na_eq_comp(na1, na2); if (na_eq < 0) { return -1; } else if (na_eq == 0) { PyErr_Format(PyExc_TypeError, "Cannot find a compatible null string value for " "null strings '%R' and '%R'", na1, na2); return -1; } } if (out_na != NULL) { *out_na = na1 ? na1 : na2; } return 0; }

I think with that, you don't really need the next function any more - in the few places where it is used you can just calculate coerce directly.

numpy/_core/src/multiarray/stringdtype/dtype.c

mhvk

Thanks, looks good! One nitpick and one question about the exception type, but approving now.

mhvk · 2024-04-15T19:36:34Z

numpy/_core/src/multiarray/stringdtype/dtype.c

-                "Cannot find common instance for unequal dtype instances");
+    if (stringdtype_compatible_na(
+                dtype1->na_object, dtype2->na_object, &out_na_object) == -1) {
+        PyErr_Format(PyExc_TypeError,


One of these I never knows, but isn't the old ValueError more appropriate than the new TypeError? After all, the type is OK, it is the value of object_na that is not.

(Also, I assume an already set error is just overridden? I haven't done much error stuff in C)

It's a type error because it's a problem with the dtype for an operation. It would be a ValueError if it was a problem with the array values, not the dtype.

mhvk · 2024-04-15T19:37:47Z

numpy/_core/src/multiarray/stringdtype/dtype.c

+    if (out_na != NULL) {
+        *out_na = na1 ? na1 : na2;
+    }
+    return 1;


Shouldn't this be return 0? (I.e., success.) I don't think any comparison other than == -1 is done, so it doesn't affect anything, but 0 seems more logical.

mhvk · 2024-04-15T19:42:50Z

numpy/_core/tests/test_stringdtype.py

+    # a dtype instance with an unset na object is compatible
+    # with a dtype that has one set
+
+    # this tests uses the "add" ufunc but all ufuncs that accept more


The test actually checks comparisons too, but doesn't really matter - just if for some other reason it is worth pushing another version.

ngoldbaum · 2024-04-15T20:19:01Z

Thanks so much for your comments!

ngoldbaum · 2024-04-16T17:14:05Z

Marten gave this a close look and both Marten and Sebastian were positive about the idea, so I'm going to merge this.

ENH: introduce a notion of "compatible" stringdtype instances #26261

ngoldbaum added 01 - Enhancement 09 - Backport-Candidate PRs tagged should be backported labels Apr 11, 2024

ngoldbaum force-pushed the compatible-stringdtype-ufuncs branch from f698e8a to b98767b Compare April 11, 2024 20:19

mhvk reviewed Apr 11, 2024

View reviewed changes

seberg reviewed Apr 12, 2024

View reviewed changes

ngoldbaum added 4 commits April 12, 2024 14:30

MNT: fix copy/paste error for NA type extracted from Pandas

229de79

ENH: introduce 'compatible' stringdtype instances

9817861

MNT: refactor stringdtype compatibility checking out of common_instance

b356e04

MNT: refactor tortured logic in test

53a4e75

ngoldbaum force-pushed the compatible-stringdtype-ufuncs branch from b98767b to 53a4e75 Compare April 12, 2024 20:30

ngoldbaum commented Apr 12, 2024

View reviewed changes

MNT: refactor stringdtype_setitem following marten's suggestion

9ba3cd3

mhvk reviewed Apr 12, 2024

View reviewed changes

MAINT: respond to marten's comments

a39ed17

mhvk approved these changes Apr 15, 2024

View reviewed changes

MNT: respond to minor comments from marten

019ae02

ngoldbaum merged commit 3c09f16 into numpy:main Apr 16, 2024

charris mentioned this pull request Apr 26, 2024

ENH: introduce a notion of "compatible" stringdtype instances #26261 #26351

Merged

charris added a commit that referenced this pull request Apr 27, 2024

Merge pull request #26351 from charris/backport-26261

9c913ee

ENH: introduce a notion of "compatible" stringdtype instances #26261

charris removed the 09 - Backport-Candidate PRs tagged should be backported label Apr 27, 2024

Uh oh!

ENH: introduce a notion of "compatible" stringdtype instances #26261

ENH: introduce a notion of "compatible" stringdtype instances #26261

Uh oh!

Conversation

ngoldbaum commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Apr 12, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngoldbaum commented Apr 15, 2024

Uh oh!

ngoldbaum commented Apr 16, 2024

Uh oh!

Uh oh!

ngoldbaum commented Apr 11, 2024 •

edited

Loading