Skip to content

ENH: introduce a notion of "compatible" stringdtype instances #26261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 16, 2024

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Apr 11, 2024

This is a followup for #26198, it accomplishes the same thing I wanted in that PR.

See the change to the NEP for the details and explanation.

In short, this relaxes the error checking in stringdtype ufuncs and changes the common_instance logic, allowing operations between distinct stringdtype instances as long as the result isn't ambiguous. This makes it much simpler to work with non-default stringdtype instances, since users don't need to zealously convert all ufunc arguments to the same dtype before passing them to numpy in the most common cases like passing a python string as an argument.

For all operations that take more than one string argument, we now only raise an error if the inputs have distinct na_object settings. We allow distinct coerce settings and just choose coerce=False for string outputs if any input dtype had coerce=False set.

Also added a test. There was one spot in the existing tests where we were doing equality comparisons between arrays with distinct na_object settings, so I updated that test to account for the behavior change.

Ideally we could get this merged in time to be included with 2.0 RC2. I'd like to have this in NumPy 2.0 because it will eliminate a lot of boilerplate argument sanitizing in pandas when it called numpy ufuncs. I totally understand if this is coming too late in the game though.

@ngoldbaum ngoldbaum added 01 - Enhancement 09 - Backport-Candidate PRs tagged should be backported labels Apr 11, 2024
@ngoldbaum ngoldbaum force-pushed the compatible-stringdtype-ufuncs branch from f698e8a to b98767b Compare April 11, 2024 20:19
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit 1: all OK, but perhaps numpy should consider having some form of pre-commit that checks imports...

commit 2: I guess I understand why Py_NewRef doesn't "just work" on array descriptors, but it is a pity that this will force new ref variants for every class one devises. In that sense, the code becomes less readable. But the reduction in boilerplate is probably still worth it.

static int
na_eq_cmp(PyObject *a, PyObject *b) {
NPY_NO_EXPORT int
na_eq_cmp(PyObject *a, PyObject *b, int coerce_nulls) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coerce_nulls does not feel very clear to me. How about null_equals_all? Though since you use true only once, my tendency would be to instead just adjust there (see comment further down).

common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2)
{
int eq = _eq_comparison(dtype1->coerce, dtype2->coerce, dtype1->na_object,
dtype2->na_object);
int eq = na_eq_cmp(dtype1->na_object, dtype2->na_object, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just dealing with the special case here, i.e.,

int eq = dtype1->na_object == NULL || dtype->na_object == NULL || na_eq_cmp(dtype1->na_object, dtype2->na_object);

or perhaps just move the lot inside the if - arguably clearer.

if (dtype1->na_object && dtype2->na_object && !na_eq_cmp(dtype1->na_object, dtype2->na_object) {

Both have the advantage of a (small) performance increase.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea if it isn't used in many places at least. (I don't care about the perf, the compiler may just inline and optimize it all away anyway.)

return NULL;
}

return (PyArray_StringDTypeObject *)new_stringdtype_instance(
dtype1->na_object, dtype1->coerce);
dtype1->na_object != NULL ? dtype1->na_object : dtype2->na_object,
!((dtype1->coerce == 0) || (dtype2->coerce == 0)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this just dtype1->coerce || dtype2->coerce?

@@ -294,16 +301,16 @@ stringdtype_setitem(PyArray_StringDTypeObject *descr, PyObject *obj, char **data
// so we do the comparison before acquiring the allocator.

if (na_object != NULL) {
is_cmp = na_eq_cmp(obj, na_object);
if (is_cmp == -1) {
na_cmp = na_eq_cmp(obj, na_object, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you follow my suggestion above of keeping na_eq_cmp as is, then no change here is needed, since both na_object and obj are guaranteed to be non-NULL.

@@ -280,7 +287,7 @@ stringdtype_setitem(PyArray_StringDTypeObject *descr, PyObject *obj, char **data
{
npy_packed_static_string *sdata = (npy_packed_static_string *)dataptr;

int is_cmp = 0;
int na_cmp = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you keep your na_eq_cmp code change, I would suggest

    // borrow reference
    PyObject *na_object = descr->na_object;
    int na_cmp = na_eq_cmp(obj, na_object);
    if (na_cmp == -1) {
        return -1;
    }

i.e., remove the NULL check below (note that I prefer not making the change to na_eq_cmp; see comment a few lines down)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, missed this, just pushed another commit addressing this comment. Cleans up the duplicate NULL checking too.

@@ -845,37 +820,21 @@ string_startswith_endswith_resolve_descriptors(
{
PyArray_StringDTypeObject *descr1 = (PyArray_StringDTypeObject *)given_descrs[0];
PyArray_StringDTypeObject *descr2 = (PyArray_StringDTypeObject *)given_descrs[1];
PyArray_StringDTypeObject *common_descr = common_instance(descr1, descr2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

else:
# comparisons between arrays with distinct NA objects
# aren't allowed
if ((na1 is pd_NA or na2 is pd_NA or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about writing

if ((na1 := getattr(dtype, "na_object", None) is not None
    and (na2 := getattr(dtype2, "na_object", None) is not None
    and (na1 is not na2)
    and (na1 is pd_NA or ...)):
    with pytest.raises
        ...
else:
    assert_array_equal(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work because na_object being unset is the default. It's only a valid attribute if you pass it:

>>> np.dtypes.StringDType().na_object
AttributeError: 'numpy.dtypes.StringDType' object has no attribute 'na_object'.

but I see what you're saying, refactored a bit to hopefully be less confusing

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, good, but some comments inline - main one being not to adjust na_cmp_eq but rather do the check in common_instance.

Bigger picture, I wonder a bit about using this inside, e.g., string_comparison_resolve_descriptors, since in the end you just don't generally actually use it. Since before no check was done, what goes wrong if you don't adjust this routine? Is it just that one gets a less clear error message? Somewhat similarly with the other resolve descriptors routines where the output of common_instance is not used.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't have a very close look, but I think it is good once Marten's comments are addressed.

I like the NPy_NewDescrRef cleanup enough that I am happy to add it, although it would be nicer to have a single NewRef macro(?) that works for all types...
Maybe even #define Npy_NewRefT(obj) (Py_NewRef(obj), obj), which feels a bit hacky, but not too much.

FWIW, I think we should definitely backport since this is limited to StringDType and there is no code (beyond your pandas tries) that even can be using it.

common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2)
{
int eq = _eq_comparison(dtype1->coerce, dtype2->coerce, dtype1->na_object,
dtype2->na_object);
int eq = na_eq_cmp(dtype1->na_object, dtype2->na_object, 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good idea if it isn't used in many places at least. (I don't care about the perf, the compiler may just inline and optimize it all away anyway.)

@@ -49,6 +49,9 @@ stringdtype_finalize_descr(PyArray_Descr *dtype);
NPY_NO_EXPORT int
_eq_comparison(int scoerce, int ocoerce, PyObject *sna, PyObject *ona);

NPY_NO_EXPORT PyArray_StringDTypeObject *
common_instance(PyArray_StringDTypeObject *dtype1, PyArray_StringDTypeObject *dtype2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might make sense to name it string_common_instance or so if we export it, in case this might end up getting included elsewhere? (just a passing thought)

@ngoldbaum ngoldbaum force-pushed the compatible-stringdtype-ufuncs branch from b98767b to 53a4e75 Compare April 12, 2024 20:30
@ngoldbaum
Copy link
Member Author

I just pushed an update addressing all your comments.

Regarding NewRef, I realized today that using it complicates backports, so I backed out using it in this PR. Sorry for the noise!

Let's not add it until we're no longer actively backporting stuff to numpy 2.0. I guess we could also backport adding pythoncapi-compat, but I'm ok living without NewRef for a little while.

// currently this can only return 1 or -1, the latter indicating that the
// error indicator is set
NPY_NO_EXPORT int
stringdtype_compatible_na(PyObject *na1, PyObject *na2) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used by string_comparison_resolve_descriptors, in principle I could expose na_eq_cmp but exposing this seemed more useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is more useful. My sense, though, would be to make this function like your next one, i.e., have its signature include an PyObject **out_na, and, if this is non-NULL, return the compatible na in it. So, something like,

NPY_NO_EXPORT int
stringdtype_compatible_na(PyObject *na1, PyObject *na2, PyObject **out_na)
{
    if (na1 != NULL && na2 != NULL) {
        int na_eq = na_eq_comp(na1, na2);
        if (na_eq < 0) {
            return -1;
        }
        else if (na_eq == 0) {
            PyErr_Format(PyExc_TypeError,
                     "Cannot find a compatible null string value for "
                     "null strings '%R' and '%R'", na1, na2);
            return -1;
        }
    }
    if (out_na != NULL) {
        *out_na = na1 ? na1 : na2;
    }
    return 0;
}

I think with that, you don't really need the next function any more - in the few places where it is used you can just calculate coerce directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied this suggestion in the most recent commit.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach makes sense, but a suggestion to just have one added function, which can set an output compatible na.

// currently this can only return 1 or -1, the latter indicating that the
// error indicator is set
NPY_NO_EXPORT int
stringdtype_compatible_na(PyObject *na1, PyObject *na2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is more useful. My sense, though, would be to make this function like your next one, i.e., have its signature include an PyObject **out_na, and, if this is non-NULL, return the compatible na in it. So, something like,

NPY_NO_EXPORT int
stringdtype_compatible_na(PyObject *na1, PyObject *na2, PyObject **out_na)
{
    if (na1 != NULL && na2 != NULL) {
        int na_eq = na_eq_comp(na1, na2);
        if (na_eq < 0) {
            return -1;
        }
        else if (na_eq == 0) {
            PyErr_Format(PyExc_TypeError,
                     "Cannot find a compatible null string value for "
                     "null strings '%R' and '%R'", na1, na2);
            return -1;
        }
    }
    if (out_na != NULL) {
        *out_na = na1 ? na1 : na2;
    }
    return 0;
}

I think with that, you don't really need the next function any more - in the few places where it is used you can just calculate coerce directly.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good! One nitpick and one question about the exception type, but approving now.

"Cannot find common instance for unequal dtype instances");
if (stringdtype_compatible_na(
dtype1->na_object, dtype2->na_object, &out_na_object) == -1) {
PyErr_Format(PyExc_TypeError,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of these I never knows, but isn't the old ValueError more appropriate than the new TypeError? After all, the type is OK, it is the value of object_na that is not.

(Also, I assume an already set error is just overridden? I haven't done much error stuff in C)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a type error because it's a problem with the dtype for an operation. It would be a ValueError if it was a problem with the array values, not the dtype.

if (out_na != NULL) {
*out_na = na1 ? na1 : na2;
}
return 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be return 0? (I.e., success.) I don't think any comparison other than == -1 is done, so it doesn't affect anything, but 0 seems more logical.

# a dtype instance with an unset na object is compatible
# with a dtype that has one set

# this tests uses the "add" ufunc but all ufuncs that accept more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test actually checks comparisons too, but doesn't really matter - just if for some other reason it is worth pushing another version.

@ngoldbaum
Copy link
Member Author

Thanks so much for your comments!

@ngoldbaum
Copy link
Member Author

Marten gave this a close look and both Marten and Sebastian were positive about the idea, so I'm going to merge this.

@ngoldbaum ngoldbaum merged commit 3c09f16 into numpy:main Apr 16, 2024
charris added a commit that referenced this pull request Apr 27, 2024
ENH: introduce a notion of "compatible" stringdtype instances #26261
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants