-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
ENH: introduce a notion of "compatible" stringdtype instances #26261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: introduce a notion of "compatible" stringdtype instances #26261
Conversation
f698e8a
to
b98767b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commit 1: all OK, but perhaps numpy should consider having some form of pre-commit that checks imports...
commit 2: I guess I understand why Py_NewRef
doesn't "just work" on array descriptors, but it is a pity that this will force new ref variants for every class one devises. In that sense, the code becomes less readable. But the reduction in boilerplate is probably still worth it.
static int | ||
na_eq_cmp(PyObject *a, PyObject *b) { | ||
NPY_NO_EXPORT int | ||
na_eq_cmp(PyObject *a, PyObject *b, int coerce_nulls) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coerce_nulls
does not feel very clear to me. How about null_equals_all
? Though since you use true only once, my tendency would be to instead just adjust there (see comment further down).
common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2) | ||
{ | ||
int eq = _eq_comparison(dtype1->coerce, dtype2->coerce, dtype1->na_object, | ||
dtype2->na_object); | ||
int eq = na_eq_cmp(dtype1->na_object, dtype2->na_object, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just dealing with the special case here, i.e.,
int eq = dtype1->na_object == NULL || dtype->na_object == NULL || na_eq_cmp(dtype1->na_object, dtype2->na_object);
or perhaps just move the lot inside the if
- arguably clearer.
if (dtype1->na_object && dtype2->na_object && !na_eq_cmp(dtype1->na_object, dtype2->na_object) {
Both have the advantage of a (small) performance increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea if it isn't used in many places at least. (I don't care about the perf, the compiler may just inline and optimize it all away anyway.)
return NULL; | ||
} | ||
|
||
return (PyArray_StringDTypeObject *)new_stringdtype_instance( | ||
dtype1->na_object, dtype1->coerce); | ||
dtype1->na_object != NULL ? dtype1->na_object : dtype2->na_object, | ||
!((dtype1->coerce == 0) || (dtype2->coerce == 0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this just dtype1->coerce || dtype2->coerce
?
@@ -294,16 +301,16 @@ stringdtype_setitem(PyArray_StringDTypeObject *descr, PyObject *obj, char **data | |||
// so we do the comparison before acquiring the allocator. | |||
|
|||
if (na_object != NULL) { | |||
is_cmp = na_eq_cmp(obj, na_object); | |||
if (is_cmp == -1) { | |||
na_cmp = na_eq_cmp(obj, na_object, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you follow my suggestion above of keeping na_eq_cmp
as is, then no change here is needed, since both na_object
and obj
are guaranteed to be non-NULL.
@@ -280,7 +287,7 @@ stringdtype_setitem(PyArray_StringDTypeObject *descr, PyObject *obj, char **data | |||
{ | |||
npy_packed_static_string *sdata = (npy_packed_static_string *)dataptr; | |||
|
|||
int is_cmp = 0; | |||
int na_cmp = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you keep your na_eq_cmp
code change, I would suggest
// borrow reference
PyObject *na_object = descr->na_object;
int na_cmp = na_eq_cmp(obj, na_object);
if (na_cmp == -1) {
return -1;
}
i.e., remove the NULL check below (note that I prefer not making the change to na_eq_cmp
; see comment a few lines down)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, missed this, just pushed another commit addressing this comment. Cleans up the duplicate NULL checking too.
@@ -845,37 +820,21 @@ string_startswith_endswith_resolve_descriptors( | |||
{ | |||
PyArray_StringDTypeObject *descr1 = (PyArray_StringDTypeObject *)given_descrs[0]; | |||
PyArray_StringDTypeObject *descr2 = (PyArray_StringDTypeObject *)given_descrs[1]; | |||
PyArray_StringDTypeObject *common_descr = common_instance(descr1, descr2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
else: | ||
# comparisons between arrays with distinct NA objects | ||
# aren't allowed | ||
if ((na1 is pd_NA or na2 is pd_NA or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about writing
if ((na1 := getattr(dtype, "na_object", None) is not None
and (na2 := getattr(dtype2, "na_object", None) is not None
and (na1 is not na2)
and (na1 is pd_NA or ...)):
with pytest.raises
...
else:
assert_array_equal(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work because na_object
being unset is the default. It's only a valid attribute if you pass it:
>>> np.dtypes.StringDType().na_object
AttributeError: 'numpy.dtypes.StringDType' object has no attribute 'na_object'.
but I see what you're saying, refactored a bit to hopefully be less confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, good, but some comments inline - main one being not to adjust na_cmp_eq
but rather do the check in common_instance
.
Bigger picture, I wonder a bit about using this inside, e.g., string_comparison_resolve_descriptors
, since in the end you just don't generally actually use it. Since before no check was done, what goes wrong if you don't adjust this routine? Is it just that one gets a less clear error message? Somewhat similarly with the other resolve descriptors routines where the output of common_instance
is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't have a very close look, but I think it is good once Marten's comments are addressed.
I like the NPy_NewDescrRef
cleanup enough that I am happy to add it, although it would be nicer to have a single NewRef
macro(?) that works for all types...
Maybe even #define Npy_NewRefT(obj) (Py_NewRef(obj), obj)
, which feels a bit hacky, but not too much.
FWIW, I think we should definitely backport since this is limited to StringDType
and there is no code (beyond your pandas tries) that even can be using it.
common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2) | ||
{ | ||
int eq = _eq_comparison(dtype1->coerce, dtype2->coerce, dtype1->na_object, | ||
dtype2->na_object); | ||
int eq = na_eq_cmp(dtype1->na_object, dtype2->na_object, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good idea if it isn't used in many places at least. (I don't care about the perf, the compiler may just inline and optimize it all away anyway.)
@@ -49,6 +49,9 @@ stringdtype_finalize_descr(PyArray_Descr *dtype); | |||
NPY_NO_EXPORT int | |||
_eq_comparison(int scoerce, int ocoerce, PyObject *sna, PyObject *ona); | |||
|
|||
NPY_NO_EXPORT PyArray_StringDTypeObject * | |||
common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might make sense to name it string_common_instance
or so if we export it, in case this might end up getting included elsewhere? (just a passing thought)
b98767b
to
53a4e75
Compare
I just pushed an update addressing all your comments. Regarding Let's not add it until we're no longer actively backporting stuff to numpy 2.0. I guess we could also backport adding pythoncapi-compat, but I'm ok living without |
// currently this can only return 1 or -1, the latter indicating that the | ||
// error indicator is set | ||
NPY_NO_EXPORT int | ||
stringdtype_compatible_na(PyObject *na1, PyObject *na2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used by string_comparison_resolve_descriptors
, in principle I could expose na_eq_cmp
but exposing this seemed more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is more useful. My sense, though, would be to make this function like your next one, i.e., have its signature include an PyObject **out_na
, and, if this is non-NULL, return the compatible na in it. So, something like,
NPY_NO_EXPORT int
stringdtype_compatible_na(PyObject *na1, PyObject *na2, PyObject **out_na)
{
if (na1 != NULL && na2 != NULL) {
int na_eq = na_eq_comp(na1, na2);
if (na_eq < 0) {
return -1;
}
else if (na_eq == 0) {
PyErr_Format(PyExc_TypeError,
"Cannot find a compatible null string value for "
"null strings '%R' and '%R'", na1, na2);
return -1;
}
}
if (out_na != NULL) {
*out_na = na1 ? na1 : na2;
}
return 0;
}
I think with that, you don't really need the next function any more - in the few places where it is used you can just calculate coerce
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied this suggestion in the most recent commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach makes sense, but a suggestion to just have one added function, which can set an output compatible na
.
// currently this can only return 1 or -1, the latter indicating that the | ||
// error indicator is set | ||
NPY_NO_EXPORT int | ||
stringdtype_compatible_na(PyObject *na1, PyObject *na2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is more useful. My sense, though, would be to make this function like your next one, i.e., have its signature include an PyObject **out_na
, and, if this is non-NULL, return the compatible na in it. So, something like,
NPY_NO_EXPORT int
stringdtype_compatible_na(PyObject *na1, PyObject *na2, PyObject **out_na)
{
if (na1 != NULL && na2 != NULL) {
int na_eq = na_eq_comp(na1, na2);
if (na_eq < 0) {
return -1;
}
else if (na_eq == 0) {
PyErr_Format(PyExc_TypeError,
"Cannot find a compatible null string value for "
"null strings '%R' and '%R'", na1, na2);
return -1;
}
}
if (out_na != NULL) {
*out_na = na1 ? na1 : na2;
}
return 0;
}
I think with that, you don't really need the next function any more - in the few places where it is used you can just calculate coerce
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good! One nitpick and one question about the exception type, but approving now.
"Cannot find common instance for unequal dtype instances"); | ||
if (stringdtype_compatible_na( | ||
dtype1->na_object, dtype2->na_object, &out_na_object) == -1) { | ||
PyErr_Format(PyExc_TypeError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of these I never knows, but isn't the old ValueError
more appropriate than the new TypeError
? After all, the type is OK, it is the value of object_na
that is not.
(Also, I assume an already set error is just overridden? I haven't done much error stuff in C)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a type error because it's a problem with the dtype for an operation. It would be a ValueError if it was a problem with the array values, not the dtype.
if (out_na != NULL) { | ||
*out_na = na1 ? na1 : na2; | ||
} | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be return 0
? (I.e., success.) I don't think any comparison other than == -1
is done, so it doesn't affect anything, but 0 seems more logical.
# a dtype instance with an unset na object is compatible | ||
# with a dtype that has one set | ||
|
||
# this tests uses the "add" ufunc but all ufuncs that accept more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test actually checks comparisons too, but doesn't really matter - just if for some other reason it is worth pushing another version.
Thanks so much for your comments! |
Marten gave this a close look and both Marten and Sebastian were positive about the idea, so I'm going to merge this. |
ENH: introduce a notion of "compatible" stringdtype instances #26261
This is a followup for #26198, it accomplishes the same thing I wanted in that PR.
See the change to the NEP for the details and explanation.
In short, this relaxes the error checking in stringdtype ufuncs and changes the
common_instance
logic, allowing operations between distinct stringdtype instances as long as the result isn't ambiguous. This makes it much simpler to work with non-default stringdtype instances, since users don't need to zealously convert all ufunc arguments to the same dtype before passing them to numpy in the most common cases like passing a python string as an argument.For all operations that take more than one string argument, we now only raise an error if the inputs have distinct na_object settings. We allow distinct
coerce
settings and just choosecoerce=False
for string outputs if any input dtype hadcoerce=False
set.Also added a test. There was one spot in the existing tests where we were doing equality comparisons between arrays with distinct na_object settings, so I updated that test to account for the behavior change.
Ideally we could get this merged in time to be included with 2.0 RC2. I'd like to have this in NumPy 2.0 because it will eliminate a lot of boilerplate argument sanitizing in pandas when it called numpy ufuncs. I totally understand if this is coming too late in the game though.