Skip to content

MAINT: Refactor stringdtype casts.c to use cpp templates #28091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Jan 24, 2025

Conversation

peytondmurray
Copy link
Contributor

@peytondmurray peytondmurray commented Jan 3, 2025

Summary

Part of #25693.

This PR replaces several places where macros were being used to generate cast specs for the vstring type with c++ templates. No user-facing changes were made.

Other changes

  • In working on this, I added a few checks in various places to ensure that the cast specs were being correctly generated.
  • Added a few comments about why 64-bit and 128-bit floats are handled differently than the 16-bit and 32-bit floats
  • Fixed an instance where a UnicodeEncodeError could be raised. Previously this was done like this:
PyObject *exc = PyObject_CallFunction(
        PyExc_UnicodeEncodeError, "ss#nns", "ascii", s.buf,
        (Py_ssize_t)s.size, (Py_ssize_t)i, (Py_ssize_t)(i+1), "ordinal not in range(128)");
PyErr_SetObject(PyExceptionInstance_Class(exc), exc);
Py_DECREF(exc);

but in using ss#nns, PyExc_UnicodeEncodeError expected a string and buffer length for the second and third arguments. I've changed this to sOnns, to accept a bytes object containing the offending data.

Notes

  • In swapping over to templates, it was helpful to add a few new functions that mapped numpy types to names, shortnames, and PyArray_DTypeMeta structs. I'm not sure if there is a better way to do this, but if you have any ideas about this I'd be grateful to learn how this could be done more elegantly (or if these would make sense to have in a different place in the code - I could see them being useful in more places than just this...):
  1. typenum_to_shortname_cstr - I realize not all NPY_TYPES have corresponding shortnames, but these were the ones necessary to finish this effort
  2. typenum_to_cstr
  3. typenum_to_dtypemeta
  • C++ is a bit more picky about a few other things that are unrelated to the macros, so I fixed those as well:
    • Variables of type char can't be initialized with string literals
    • Some cases where integer types were being compared to enum types or size_t
    • Cases where enum flags were being combined with |, e.g. NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI is not of type NPY_ARRAYMETHOD_FLAGS, it is of type int. Explicit casts were added to ensure the correct type
    • There was one instance where a goto was jumping over a variable initialization, which isn't allowed by the cpp compiler. The instance was modified to work without the jump.

@ngoldbaum I noticed that the string to f128 cast has only the NPY_METH_NO_FLOATINGPOINT_ERRORS array method flag. But if you look at string_to_float, or string_to_float_resolve_descriptors, it looks like there are lines that are accessing python objects, calling Py_INCREF, etc. I've kept the array flags identical to how they were before, but should this also have a NPY_METH_REQUIRES_PYAPI flag as well?

@seberg seberg changed the title [MAINT] Refactor casts.c to use cpp templates [MAINT] Refactor stringdtype casts.c to use cpp templates Jan 3, 2025
@seberg seberg changed the title [MAINT] Refactor stringdtype casts.c to use cpp templates MAINT:Refactor stringdtype casts.c to use cpp templates Jan 3, 2025
@seberg seberg changed the title MAINT:Refactor stringdtype casts.c to use cpp templates MAINT: Refactor stringdtype casts.c to use cpp templates Jan 3, 2025
@peytondmurray peytondmurray force-pushed the 25693-npystring-templating-casts branch 8 times, most recently from 220615b to d15e583 Compare January 9, 2025 06:56
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anyone else wants to look over this carefully, I strongly recommend the side-by-side diff view.

Most of these comments are formatting related. Overall this looks like a very nice improvement.

@peytondmurray peytondmurray marked this pull request as ready for review January 10, 2025 21:27
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peytondmurray the string to longdouble cast is based on NumPyOS_ascii_strtold, which doesn't need the GIL. That said, the uses of PyErr_Warn in that function are probably incorrect without the GIL, and we'd need a warning equivalent of npy_gil_error to re-acquire the GIL and raise the warning to fix that.

That said, it occurs to me that the other string to float casts could probably all use NumPyOS_ascii_strtod, and that would probably be a lot faster than creating a python float for each entry.

If you're not feeling up to fixing that stuff, I think just adding NEEDS_PYAPI to the string to longdouble cast is fine and adding a TODO is fine.

@peytondmurray
Copy link
Contributor Author

Okay, NPY_METH_REQUIRES_PYAPI has been added for the string->long double cast, and I've implemented a npy_gil_warning modeled after npy_gil_error. Ready for another look!

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more comments but this is looking much closer.

Maybe @mhvk is interested in giving this a look? Or maybe @lysnikolaou or @seberg?

I'm very aware this is a 1000-line diff and no worries if you're busy and this is too much.

@ngoldbaum
Copy link
Member

ngoldbaum commented Jan 16, 2025

Heh, it turns out a user just reported a bug that is way easier to fix once this PR lands since there's no need to mess with macros: #28157.

The issue comes down to the float to string casts not accounting for the case where the float value is exactly the na_object of the destination array.

I think it's better to land the fix for that issue after this lands - this is already huge and is getting close to being mergeable.

@peytondmurray peytondmurray force-pushed the 25693-npystring-templating-casts branch from 61cd0a1 to 0d46409 Compare January 16, 2025 20:31
@peytondmurray
Copy link
Contributor Author

Looks like the 32-bit build broke as a result of the std::isinf changes. Why aren't the 64-bit builds affected? 🤔

@peytondmurray
Copy link
Contributor Author

Okay, I can't see exactly where the root of the problem lies, but I think std::isinf is defined differently on 32-bit and 64-bit platforms. For the 32-bit case it seems like it's an overloaded function, and the compiler complains. For now I'm just going to define a wrapper that always takes in a type T and returns a bool. If there's a more elegant solution here that can work on all platforms, I'd be happy to implement it.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, no thorough review but I'll trust Nathan. I noticed memory errors not being handled right. If that is just too much here, let's overlook it (I think it was also wrong in the old code).

I am suspecting that it would be helpful to refactor this more to have a:

template <typename to_type, maybe-enable-if-magic>
struct CastSpec {
    PyArrayMethod_Spec spec;
    ...
}

so that you can group the initializer function with data (such as the name) so that cleanup can also happen there...
Can't pass that out unfortunately if dtype.c isn't also .cpp so cleanup would have to jump through hoops.

Of course that would double-down on full-fledged C++, but as we see here there are some hoops that are just annoying in the half state this is in.

However, I think it is probably even better if the "refactor this more" is a follow-up (if it happens). The diff here is already big enough.

casts[cast_i++] = StringToVoidCastSpec;
casts[cast_i++] = VoidToStringCastSpec;
casts[cast_i++] = StringToBytesCastSpec;
casts[cast_i++] = BytesToStringCastSpec;
casts[cast_i++] = NULL;

// Check that every cast spec is valid
for (int i = 0; i<num_casts; i++) {
assert(casts[i] != NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If things are fully correct error-wise. Then this actually can happen due to OOM.

I dunno what to do about it, because fixing things seems quite awkward (at least right now). Maybe just return NULL and put a comment that it leaks a bit of memory but this should never happen anyway.

I am also OK if you decide to just keep it as a bug. Since it is a crash that happens at import time and only if OoM, it isn't super concerning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that this can happen due to OOM, but if that's the case the error indicator should be set by getCastSpec. So maybe the solution could be here to check the error indicator, and return NULL if it's set.

Copy link
Member

@lysnikolaou lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a pass over this. Overall a great improvement. Left a few comments.

@peytondmurray
Copy link
Contributor Author

Okay, I added a bunch of error handling for the places where allocations weren't being checked before. Let's make sure the tests pass, then I think we'll be good to go!

@ngoldbaum
Copy link
Member

I did one more pass with an eye towards resource management and error handling. I didn't spot any issues but I added some comments so future readers don't get tripped up by pyobj_to_string stealing a reference. Definitely a future cleanup to remove the stolen reference in the internal API there.

I'll merge this assuming all the tests pass. Thanks so much for seeing this through.

@ngoldbaum ngoldbaum merged commit 8893c03 into numpy:main Jan 24, 2025
67 checks passed
@peytondmurray peytondmurray deleted the 25693-npystring-templating-casts branch January 24, 2025 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants