MAINT: Refactor stringdtype casts.c to use cpp templates #28091

peytondmurray · 2025-01-03T06:58:01Z

Summary

Part of #25693.

This PR replaces several places where macros were being used to generate cast specs for the vstring type with c++ templates. No user-facing changes were made.

Other changes

In working on this, I added a few checks in various places to ensure that the cast specs were being correctly generated.
Added a few comments about why 64-bit and 128-bit floats are handled differently than the 16-bit and 32-bit floats
Fixed an instance where a UnicodeEncodeError could be raised. Previously this was done like this:

PyObject *exc = PyObject_CallFunction(
        PyExc_UnicodeEncodeError, "ss#nns", "ascii", s.buf,
        (Py_ssize_t)s.size, (Py_ssize_t)i, (Py_ssize_t)(i+1), "ordinal not in range(128)");
PyErr_SetObject(PyExceptionInstance_Class(exc), exc);
Py_DECREF(exc);

but in using ss#nns, PyExc_UnicodeEncodeError expected a string and buffer length for the second and third arguments. I've changed this to sOnns, to accept a bytes object containing the offending data.

Notes

In swapping over to templates, it was helpful to add a few new functions that mapped numpy types to names, shortnames, and PyArray_DTypeMeta structs. I'm not sure if there is a better way to do this, but if you have any ideas about this I'd be grateful to learn how this could be done more elegantly (or if these would make sense to have in a different place in the code - I could see them being useful in more places than just this...):

typenum_to_shortname_cstr - I realize not all NPY_TYPES have corresponding shortnames, but these were the ones necessary to finish this effort
typenum_to_cstr
typenum_to_dtypemeta

C++ is a bit more picky about a few other things that are unrelated to the macros, so I fixed those as well:
- Variables of type char can't be initialized with string literals
- Some cases where integer types were being compared to enum types or size_t
- Cases where enum flags were being combined with |, e.g. NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI is not of type NPY_ARRAYMETHOD_FLAGS, it is of type int. Explicit casts were added to ensure the correct type
- There was one instance where a goto was jumping over a variable initialization, which isn't allowed by the cpp compiler. The instance was modified to work without the jump.

@ngoldbaum I noticed that the string to f128 cast has only the NPY_METH_NO_FLOATINGPOINT_ERRORS array method flag. But if you look at string_to_float, or string_to_float_resolve_descriptors, it looks like there are lines that are accessing python objects, calling Py_INCREF, etc. I've kept the array flags identical to how they were before, but should this also have a NPY_METH_REQUIRES_PYAPI flag as well?

numpy/_core/src/multiarray/stringdtype/casts.cpp

ngoldbaum

If anyone else wants to look over this carefully, I strongly recommend the side-by-side diff view.

Most of these comments are formatting related. Overall this looks like a very nice improvement.

numpy/_core/src/multiarray/stringdtype/casts.cpp

ngoldbaum

@peytondmurray the string to longdouble cast is based on NumPyOS_ascii_strtold, which doesn't need the GIL. That said, the uses of PyErr_Warn in that function are probably incorrect without the GIL, and we'd need a warning equivalent of npy_gil_error to re-acquire the GIL and raise the warning to fix that.

That said, it occurs to me that the other string to float casts could probably all use NumPyOS_ascii_strtod, and that would probably be a lot faster than creating a python float for each entry.

If you're not feeling up to fixing that stuff, I think just adding NEEDS_PYAPI to the string to longdouble cast is fine and adding a TODO is fine.

numpy/_core/src/multiarray/stringdtype/casts.cpp

peytondmurray · 2025-01-16T00:13:13Z

Okay, NPY_METH_REQUIRES_PYAPI has been added for the string->long double cast, and I've implemented a npy_gil_warning modeled after npy_gil_error. Ready for another look!

ngoldbaum

I left a few more comments but this is looking much closer.

Maybe @mhvk is interested in giving this a look? Or maybe @lysnikolaou or @seberg?

I'm very aware this is a 1000-line diff and no worries if you're busy and this is too much.

numpy/_core/src/common/gil_utils.c

numpy/_core/src/multiarray/stringdtype/casts.cpp

ngoldbaum · 2025-01-16T17:36:03Z

Heh, it turns out a user just reported a bug that is way easier to fix once this PR lands since there's no need to mess with macros: #28157.

The issue comes down to the float to string casts not accounting for the case where the float value is exactly the na_object of the destination array.

I think it's better to land the fix for that issue after this lands - this is already huge and is getting close to being mergeable.

numpy/_core/src/multiarray/abstractdtypes.h

Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>

numpy/_core/src/multiarray/stringdtype/casts.cpp

peytondmurray · 2025-01-16T23:11:59Z

Looks like the 32-bit build broke as a result of the std::isinf changes. Why aren't the 64-bit builds affected? 🤔

peytondmurray · 2025-01-17T00:05:25Z

Okay, I can't see exactly where the root of the problem lies, but I think std::isinf is defined differently on 32-bit and 64-bit platforms. For the 32-bit case it seems like it's an overloaded function, and the compiler complains. For now I'm just going to define a wrapper that always takes in a type T and returns a bool. If there's a more elegant solution here that can work on all platforms, I'd be happy to implement it.

seberg

A few comments, no thorough review but I'll trust Nathan. I noticed memory errors not being handled right. If that is just too much here, let's overlook it (I think it was also wrong in the old code).

I am suspecting that it would be helpful to refactor this more to have a:

template <typename to_type, maybe-enable-if-magic>
struct CastSpec {
    PyArrayMethod_Spec spec;
    ...
}

so that you can group the initializer function with data (such as the name) so that cleanup can also happen there...
Can't pass that out unfortunately if dtype.c isn't also .cpp so cleanup would have to jump through hoops.

Of course that would double-down on full-fledged C++, but as we see here there are some hoops that are just annoying in the half state this is in.

However, I think it is probably even better if the "refactor this more" is a follow-up (if it happens). The diff here is already big enough.

numpy/_core/src/multiarray/stringdtype/casts.cpp

numpy/_core/src/multiarray/dtypemeta.h

numpy/_core/src/multiarray/stringdtype/casts.cpp

seberg · 2025-01-17T08:50:52Z

numpy/_core/src/multiarray/stringdtype/casts.cpp

    casts[cast_i++] = StringToVoidCastSpec;
    casts[cast_i++] = VoidToStringCastSpec;
    casts[cast_i++] = StringToBytesCastSpec;
    casts[cast_i++] = BytesToStringCastSpec;
    casts[cast_i++] = NULL;

+    // Check that every cast spec is valid
+    for (int i = 0; i<num_casts; i++) {
+        assert(casts[i] != NULL);


If things are fully correct error-wise. Then this actually can happen due to OOM.

I dunno what to do about it, because fixing things seems quite awkward (at least right now). Maybe just return NULL and put a comment that it leaks a bit of memory but this should never happen anyway.

I am also OK if you decide to just keep it as a bug. Since it is a crash that happens at import time and only if OoM, it isn't super concerning.

You're right that this can happen due to OOM, but if that's the case the error indicator should be set by getCastSpec. So maybe the solution could be here to check the error indicator, and return NULL if it's set.

numpy/_core/src/multiarray/stringdtype/casts.cpp

numpy/_core/src/common/gil_utils.c

numpy/_core/src/multiarray/stringdtype/casts.cpp

lysnikolaou

I did a pass over this. Overall a great improvement. Left a few comments.

numpy/_core/src/multiarray/stringdtype/casts.cpp

peytondmurray · 2025-01-21T22:09:26Z

Okay, I added a bunch of error handling for the places where allocations weren't being checked before. Let's make sure the tests pass, then I think we'll be good to go!

ngoldbaum · 2025-01-24T18:13:36Z

I did one more pass with an eye towards resource management and error handling. I didn't spot any issues but I added some comments so future readers don't get tripped up by pyobj_to_string stealing a reference. Definitely a future cleanup to remove the stolen reference in the internal API there.

I'll merge this assuming all the tests pass. Thanks so much for seeing this through.

seberg changed the title ~~[MAINT] Refactor casts.c to use cpp templates~~ [MAINT] Refactor stringdtype casts.c to use cpp templates Jan 3, 2025

seberg changed the title ~~[MAINT] Refactor stringdtype casts.c to use cpp templates~~ MAINT:Refactor stringdtype casts.c to use cpp templates Jan 3, 2025

seberg changed the title ~~MAINT:Refactor stringdtype casts.c to use cpp templates~~ MAINT: Refactor stringdtype casts.c to use cpp templates Jan 3, 2025

peytondmurray force-pushed the 25693-npystring-templating-casts branch 8 times, most recently from 220615b to d15e583 Compare January 9, 2025 06:56

ngoldbaum reviewed Jan 9, 2025

View reviewed changes

numpy/_core/src/multiarray/stringdtype/casts.cpp Outdated Show resolved Hide resolved

ngoldbaum reviewed Jan 9, 2025

View reviewed changes

peytondmurray marked this pull request as ready for review January 10, 2025 21:27

ngoldbaum reviewed Jan 13, 2025

View reviewed changes

numpy/_core/src/multiarray/stringdtype/casts.cpp Outdated Show resolved Hide resolved

peytondmurray requested a review from ngoldbaum January 16, 2025 00:13

ngoldbaum reviewed Jan 16, 2025

View reviewed changes

ngoldbaum mentioned this pull request Jan 16, 2025

BUG: StringDType: na_object ignored in full #28157

Closed

ngoldbaum reviewed Jan 16, 2025

View reviewed changes

numpy/_core/src/multiarray/abstractdtypes.h Outdated Show resolved Hide resolved

peytondmurray added 9 commits January 16, 2025 12:30

MAINT: Refactor casts.c to use cpp templates

50cc494

Avoid specifying specific bit width for complex types

828a892

Use longdouble types instead of f128

94ef809

Fix byte/short/int string to int cast specs

7d637ba

Switch from explicit sized types to NPY_TYPES

9eae051

Guard against platforms where npy_float64 == npy_longdouble

0860b0c

Add back in longdouble; keep separate definition independent of float64

3c26de0

Fix string_to_bytes slot when encountering unicode encode error

b945cdf

Temporary debugging

1e42e93

peytondmurray and others added 16 commits January 16, 2025 12:30

Try allocating names at init time, cleaning up after casts registered

b8b9ce9

Fix cast name generation

8b0a512

Use memcpy instead of strncpy to avoid stringop-truncation warning

889f586

Print some debug info for 32-bit tests

5171fe5

Fix the overflow handling for string to int

44fa72e

Remove unused typenum_to_typechar

c4c026d

Update numpy/_core/src/multiarray/stringdtype/casts.cpp

82a5e29

Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>

Update numpy/_core/src/multiarray/stringdtype/casts.cpp

3505b08

Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>

Update numpy/_core/src/multiarray/stringdtype/casts.cpp

84d32c8

Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>

Update numpy/_core/src/multiarray/stringdtype/casts.cpp

5a48673

Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>

Update numpy/_core/src/multiarray/stringdtype/casts.cpp

190031b

Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>

Update numpy/_core/src/multiarray/stringdtype/casts.cpp

e5b1174

Co-authored-by: Nathan Goldbaum <nathan.goldbaum@gmail.com>

Address various review comments

0fc5015

Add npy_gil_warning; require gil for string to longdouble cast

3db8ffb

Move typenum_to_dtypemeta to abstractdtypes; address PR comments

66bbb9a

Move typenum_to_dtypemeta->dtypemeta.h; fix npy_gil_warning declaration

0d46409

peytondmurray force-pushed the 25693-npystring-templating-casts branch from 61cd0a1 to 0d46409 Compare January 16, 2025 20:31

ngoldbaum reviewed Jan 16, 2025

View reviewed changes

numpy/_core/src/multiarray/stringdtype/casts.cpp Outdated Show resolved Hide resolved

Remove unused include

3987ecd

Add a wrapper for std::isinf to make it work on 32-bit platforms

d84d9a8

seberg reviewed Jan 17, 2025

View reviewed changes

lysnikolaou reviewed Jan 17, 2025

View reviewed changes

Address review comments; add error checks

c6e569f

MAINT: add comments explaining pyobj_to_string steals a reference

3c99bc6

ngoldbaum merged commit 8893c03 into numpy:main Jan 24, 2025
67 checks passed

peytondmurray deleted the 25693-npystring-templating-casts branch January 24, 2025 21:26

Uh oh!

MAINT: Refactor stringdtype casts.c to use cpp templates #28091

MAINT: Refactor stringdtype casts.c to use cpp templates #28091

Uh oh!

Conversation

peytondmurray commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Other changes

Notes

Uh oh!

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

peytondmurray commented Jan 16, 2025

Uh oh!

ngoldbaum left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commented Jan 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peytondmurray commented Jan 16, 2025

Uh oh!

peytondmurray commented Jan 17, 2025

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seberg Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

peytondmurray Jan 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lysnikolaou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

peytondmurray commented Jan 21, 2025

Uh oh!

ngoldbaum commented Jan 24, 2025

Uh oh!

Uh oh!

Uh oh!

peytondmurray commented Jan 3, 2025 •

edited

Loading

ngoldbaum commented Jan 16, 2025 •

edited

Loading