MAINT,API: Make metadata, c_metadata, fields, and names only exist on old-style dtypes #25802

seberg · 2024-02-10T18:34:17Z

Right now, this is a heads-up a start to:

Introduce _PyArray_LegacyDescr for the full struct
Introduce some additional access macros for the fields that may not exist now which check for that (and cast). In some cases we use them even though we could also cast.

I need to do another pass, there are warnings. Also random seems to be build in INTERNAL mode, because it picks up the new struct definition (and thus fails to compile, because we have a function to fetch subarray in the Cython file even though it isn't used).

I would much appreciate a heads up if anyone thinks this isn't right. When it comes to naming, replacing the name is easy and always possible, though.
The casts are a bit annoying, but honestly half the time we need guards anyway, and casting for subclasses isn't that odd.

It took many hours to get this far, but the full diff isn't that bad, IMO.

mhvk · 2024-02-11T00:25:04Z

I've not looked through the details, but is the idea that new-style dtypes will not be structured so don't need fields and names? There's a lot of code out there that checks for structured dtype with things like if dtype.names is None, etc.; will those break if a new-style dtype is used?

seberg · 2024-02-11T08:16:53Z

That doesn't break, we just still define those to be None in Python (I might have to modify the C code a tiny bit).

The idea is mostly that as a followup, C-code has to use new macros at the very least, so that we could create a new way to do structured dtypes.

seberg · 2024-02-11T13:54:04Z

This is working now and should be fine overall, a note for review:

There is a little script added I used to get a start on changing direct access to macro access. I could remove it here.
The changes the ndarraytypes.h are the most interesting.
The changes to the Cython were needed for random to compile. I don't see them as bad enough to worry about (Python access seems fine to me, subarray dtypes are very rare after all).

The other commits are probably best viewed together, I added a few later ones when reviewing myself. But those later ones for the most part reduce the amount of lines changed by reverting to direct access (because the descr was cast to a LegacyDescr already.)

There are a decent amount of places that might make more or just as much sense to cast, but do use the macros: Overall, I didn't want to overly worry about using the macro in a few places where the ISLEGACY check is unnecessary.

mattip · 2024-02-12T11:22:42Z

numpy/_core/include/numpy/ndarraytypes.h

@@ -578,6 +578,11 @@ typedef struct {
 #define PyDataType_REFCHK(dtype) \
        PyDataType_FLAGCHK(dtype, NPY_ITEM_REFCOUNT)

+
+#if !defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD


Shouldn't this be just

Suggested change

#if !defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD

#if !defined(NPY_INTERNAL_BUILD)

Thanks, this might be the issue about random requiring changes in the init.pyd!

Unless you disagree, will just add the brackets around it, since checking both seems the pattern we use everywhere (I guess it forces setting it to an actual value which seems a good habit).

Cool, that was the reason for the Cython/random oddity. I added new commits for the changes for review, but happy to squash-merge.

mattip · 2024-02-12T11:25:14Z

numpy/_core/include/numpy/ndarraytypes.h

+#define PyDataType_METADATA(dtype) (!PyDataType_ISLEGACY(dtype) ?  \
+            NULL : ((_PyArray_LegacyDescr *)dtype)->metadata)
+#define PyDataType_C_METADATA(dtype) (!PyDataType_ISLEGACY(dtype) ?  \
+            NULL : ((_PyArray_LegacyDescr *)dtype)->c_metadata)


what about a comprimise: make them static functions so they can be properly typed?

Sounds good, that also means I enforce the direct access for when it is already cast to _PyArray_LegacyDescr *. (Won't touch the old macros here, as it is more tedious, although I can see us changing them also.)

mattip · 2024-02-13T12:19:16Z

tools/descriptor_update.py

+        #content = simple_fix(content, apply_filter=is_public, name="alignment")
+
+        # Write the result:
+        file.write_text(content)


Do you see this being useful once this PR is merged? If so, there are some linting problems. If not, maybe we don't need it in the repo?

Not particularly, I thought it might be intersting for reviewers to see. Let me remove it, we can add it back in a dedicated step if anyone thinks its useful.

ngoldbaum

This may make numpy internals a bit harder to understand. Is the plan to refactor so we're not passing legacy descriptors around internally? I think ideally there would be only one internal representation people need to think about.

ngoldbaum · 2024-02-13T18:29:48Z

numpy/_core/include/numpy/npy_2_compat.h

@@ -82,6 +82,8 @@
    /* Aliases of 2.x names to 1.x only equivalent names */
    #define NPY_NTYPES NPY_NTYPES_LEGACY
    #define PyArray_DescrProto PyArray_Descr
+    /* NumPy 2 definition always works, but add it for 1.x only */
+    #define PyDataType_ISLEGACY(dtype) (1)


Technically not true because of the experimental dtype flag but it's probably a weird thing to do that across the numpy 1.0/2.0 boundary so this is fine IMO.

ngoldbaum · 2024-02-13T18:34:02Z

numpy/_core/src/multiarray/dtypemeta.c

            dt_slots->get_clear_loop =
-                    npy_get_clear_void_and_legacy_user_dtype_loop;
+                    (void *)npy_get_clear_void_and_legacy_user_dtype_loop;


are these void* casts a remnant of an earlier version or this PR where you modified the NPY_DType_Slots struct or do you need these for some other reason?

The void casts are a bit lazy (could use the full name, which is long though), but because of the changing the signature, you need to cast now to avoid compiler warnings.

instead of adding these void * casts, I'd prefer adding new function typedefs for internal usages, even better if we can figure out how to make this function take the real struct

Yeah, sorry: I'll follow up with the actual signature names.

I am not sure I see this is a problem: it seems like a common pattern in type slots to cast, i.e. array number slots will use PyArrayObject * (must clearer) and then cast it when assigning to avoid the C-warnings because the typedef is for PyObject *.

For me, the problem is it adds cognitive overhead while reading the code.m, since I need to stop and look at the definitions and compare types to see if the void * cast is safe. I also don't think the type checking is useless, I made a few type errors that the typedefs in the C API caught while developing stringdtype.

ngoldbaum · 2024-02-13T18:35:15Z

The change itself looks fine, just a couple general higher level questions.

ngoldbaum · 2024-02-13T18:42:13Z

Also still needs a release note and some discussion in the numpy 2.0 transition guide for C API users (latter doesn't need to happen here, just a note that it does need to happen).

seberg · 2024-02-13T20:59:22Z

This may make numpy internals a bit harder to understand. Is the plan to refactor so we're not passing legacy descriptors around internally?

Yeah, I don't love it to some degree, but having fields and names on the new dtypes when users can't reasonably use them seems not great. We have subclasses now, and that means casting to the subclass when you need to access the subclass info...

seberg · 2024-02-13T21:13:09Z

Also still needs a release note and some discussion in the numpy

Technically, this PR doesn't change public API yet (because I want that change to be a big bang). But I added a release note pretending a bit that it did.

ngoldbaum

This one seems less risky, all of the uses of these fields I'm finding are in custom legacy dtypes and not in any major packages, but it's hard to tell for sure because "metadata", "names", and "fields" are pretty generic and show up in other contexts too.

I'd like it if another maintainer could give their opinion about whether we like the change to internally using _PyArray_LegacyDescr in many places. I suspect it will be confusing to handle for people who are used to looking at numpy internals but aren't following this discussion closely.

ngoldbaum · 2024-02-16T15:47:34Z

doc/release/upcoming_changes/25802.c_api.rst

+------------------------------------------------------------------
+The dtype structures fields ``metadata``, ``c_metadata``, ``names``,
+``fields``, and ``subarray`` must now be accessed through new
+functions following the same names, such as ``PyDataType_METADATA``.


We should list these new accessor macros in the C API docs.

ngoldbaum · 2024-02-16T16:01:55Z

numpy/_core/src/multiarray/dtypemeta.c

            dt_slots->get_clear_loop =
-                    npy_get_clear_void_and_legacy_user_dtype_loop;
+                    (void *)npy_get_clear_void_and_legacy_user_dtype_loop;


instead of adding these void * casts, I'd prefer adding new function typedefs for internal usages, even better if we can figure out how to make this function take the real struct

numpy/_core/include/numpy/ndarraytypes.h

(odd that I didn't see the warning locally...)

…d though)" This reverts commit cc12306.

seberg · 2024-02-18T15:21:35Z

I have updated the other PR. This needs a little work, because it collided a bit more with the flags PR (requiring direct access/use of a macro or casts in 5 places or so).

I will re-instate metadata for the sake of being minimal (it is a very minor part, but it is the one thing that is OK to keep also for new dtypes, whether they actually use it or not).

I'd like it if another maintainer could give their opinion about whether we like the change to internally using _PyArray_LegacyDescr in many places

Happy if others disagree! We can keep the fields (although they should move around anyway probably). I don't loves the casts, but also don't think it is so terrible:

It is specific to structured dtypes and datetimes (effectively) and seeing them as subclasses that extend the normal dtype struct isn't so odd (subclasses require such casts).
(I would love a better name, but StructDescr isn't actually correct for c_metadata at least. Fortunately, it is at least something we can change at any time.)
While it is a lot of lines changes, they are mostly in code paths that specifically deal with promotion/casts of structured dtypes.

rgommers · 2024-03-03T14:41:35Z

This looks like it may end up being the last PR marked for 2.0.0. Since Chuck planned to branch today (unlikely to happen though, but it's any day now) and this still has a merge conflict and seems to need some more work, could we have a status update on this one? Is it actually blocking for 2.0.0rc1 or is it nice to have?

Could also check the flags manually, but this seemed few enough that it isn't a big deal.

I prefer to have the casts rather than cast in the functions because this is the pattern that is used in Python C-slots as well.

rgommers · 2024-03-04T11:10:16Z

I'll have a look at the FreeBSD failure (unrelated and getting annoying).

For the doc build issue, that's related:

The following files were not found by towncrier:
    25802.c_removal.rst

For the final additions/changes to the release notes, can you please edit the doc/source/release/2.0.0-notes.rst directly rather than using towncrier snippets?

seberg · 2024-03-04T16:38:56Z

I think this should be ready. I stuck with the cast on the methods because I think that is a very common pattern, otherwise you need to cast in the functions themselves (the assignment itself is not often read).

(Please squash merge of course)

ngoldbaum

LGTM. I'm OK with the internal code changes now and feel like the additional comments and explanation helps to understand why the legacy struct exists.

I guess we can't do much about subarrays, fields, and names but the metadata members we can probably deprecate in the future and eventually get rid of at least some of this legacy compat code. Maybe if we're really ambitious someone can take on overhauling structured dtypes...

mhvk

I started reviewing, but as I went on, it seems a lot of changes that make the code less readable for future developers. Would it make more sense to just bite the bullet and use PyDataType_NAMES, etc., whenever they are needed? Obviously, one can change things around so that for internal use it doesn't do the legacy type check but rather just replaces with ((_PyArray_LegacyDescr *)descr)->names. But I think our future selves will thank us...

mhvk · 2024-03-04T16:57:01Z

doc/source/reference/c-api/array.rst

+    this list should not be mutated, NumPy may change the way fields are
+    stored in the future.
+
+.. c:function:: PyObject *PyTypeNum_NAMES(PyArray_Descr *descr)


NAMES -> FIELDS??

mhvk · 2024-03-04T16:57:35Z

doc/source/release/2.0.0-notes.rst

@@ -407,6 +407,17 @@ to adapt your code and achieve compatibility with both 1.x and 2.x.
 (`gh-25792 <https://github.com/numpy/numpy/pull/25792>`__)


+Structured dtype information access through functions
+-----------------------------------------------------
+The dtype structures fields `c_metadata``, ``names``,


Missing backtick on c_metadata

mhvk · 2024-03-04T16:58:28Z

doc/source/release/2.0.0-notes.rst

+The dtype structures fields `c_metadata``, ``names``,
+``fields``, and ``subarray`` must now be accessed through new
+functions following the same names, such as ``PyDataType_NAMES``.
+Direct access of the fields is not valid as they do not exist for


Here, you write PyDataType_NAMES but in the .rst above you call it PyTypeNum_NAMES - I think the former is better but it should be consistent.

mhvk · 2024-03-04T16:59:13Z

numpy/_core/include/numpy/arrayscalars.h

@@ -150,7 +150,12 @@ typedef struct {
 typedef struct {
        PyObject_VAR_HEAD
        char *obval;
+#if defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD
+        /* Internally it should be able to conveniently access names/fields */


able -> possible

mhvk · 2024-03-04T17:01:57Z

numpy/_core/include/numpy/ndarraytypes.h

@@ -573,6 +573,10 @@ typedef struct {
                                NPY_ITEM_IS_POINTER | NPY_ITEM_REFCOUNT | \
                                NPY_NEEDS_INIT | NPY_NEEDS_PYAPI)

+#if !(defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD)
+/*
+ * Public version of the Descriptor struct


I'm confused - why does the public version still have those things like names. Isn't the point that one should use the accessors? Or am I misreading the above if?

No, you are not: The point is that I am splitting this into multiple PRs and this one doesn't actually change anything. The only reason I added the release note here, was because Nathan asked for it, and I agreed it might be easier to reason about it here than in a PR that breaks more things.

mhvk · 2024-03-04T17:07:59Z

numpy/_core/src/multiarray/_multiarray_tests.c.src

@@ -496,7 +496,7 @@ static PyObject *
 create_custom_field_dtype(PyObject *NPY_UNUSED(mod), PyObject *args)
 {
    PyArray_DescrProto proto;
-    PyArray_Descr *dtype;
+    _PyArray_LegacyDescr *dtype;  /* Is checked for void, so legacy is OK */


I'd tend to go the other way around, and just recast or define a new variable when you need the legacy one. But this is fine as is too.

mhvk · 2024-03-04T17:08:28Z

numpy/_core/src/multiarray/_multiarray_tests.c.src

@@ -510,7 +510,7 @@ create_custom_field_dtype(PyObject *NPY_UNUSED(mod), PyObject *args)
    if (dtype->type_num != NPY_VOID || dtype->fields == NULL ||
            !PyDict_CheckExact(dtype->fields) ||
            PyTuple_Size(dtype->names) != 1 ||
-            !PyDataType_REFCHK(dtype) ||
+            !PyDataType_REFCHK((PyArray_Descr *)dtype) ||


Basically feel one should avoid recasting a legacy type into a "real" one (but really am nitpicking here)

That gets annoying since you cannot assign using the macros. Having both around is reasonable.
Also, what you don't see is that I removed a lot of changed lines of code again by going the cast route.

It would be nice to just define some of these internally in a way that supports the subclass LegacyDescr *, but it was few enough instances where this was needed that I didn't think it was very important.

mhvk · 2024-03-04T17:10:30Z

numpy/_core/src/multiarray/arrayobject.c


        /* Use promotion to decide whether the comparison is valid */
-        PyArray_Descr *promoted = PyArray_PromoteTypes(self_descr, other_descr);
+        PyArray_Descr *promoted = PyArray_PromoteTypes(


Maybe do this promotion check before even defining self_descr and other_descr?

Yeah, probably a bit nicer, changed it.

seberg · 2024-03-04T18:42:53Z

metadata members we can probably deprecate in the future and eventually get rid of at least some of this legacy compat code

Yeah, maybe. H5Py uses it I think, so I don't care too much either way. If you lean towards that, I will lean towards removing the field (it is a follow-up anyway).

Either way user-dtypes could enable an equivalent metdata themselves (and we could invent API to make it easier to do so).

that make the code less readable for future developers

Yes, the point is that structured dtype and datetime/timedelta need to be thought of as subclasses of dtype which extend it (ignore the Legacy name).

So the choice is: either we have fields, names, and c_metadata around on user dtypes but they must always be NULL/None, or we do this internal dance.

I don't like the dance, but I hate to force these fields on everyone forever (when they cannot be used). And I don't hate the dance that much if I think about it as a sunbclass: These fields are same as the allocator in StringDType, the main difference is that the old code always tried to do all dtype logic in a single code path making this unfortunate messy.

mhvk · 2024-03-04T19:23:43Z

I like your way of thinking of them as subclasses - that really helps. And that also makes using a different struct much more logical, so removes much of my argument for not doing.

But: Would it be an idea to just go with that and for the structured case call it PyArray_StructuredDataType already now? (probably with an underscore to keep it private). And only use it for structured? (I think there is not much else and for those I think you mostly use the new functions anyway...) Really my main worry here is about clarity for future developers - this code is not easy and _legacy* is not necessarily very helpful...

p.s. Sorry to be commenting so late in the cycle - if it is too late, that's obviously OK.

seberg · 2024-03-05T07:14:59Z

I would like to just call it VoidDescr or so, unfortunately we don't strictly prevent e.g. fields on other dtypes right now and in a few places (the odd user dtype at least) that is used.

So we have a bunch of descr subclasses, but in subclass speak, all of them are trivial subclasses of a single "legacy" and much of our code bunches them together (since all old code used to).

Maybe it would be nice to disallow most of this (I think we can get away with it, I would even do it in a 2.1 e.g. for fields on non-void) and then just create a bunch of aliases. The metadata field is the odd one out, as it makes sense for all current dtypes.

(About metadata: Should we give our dtypes a __dict__ slot and just give that as a way to tag on metadata? That comes with no guarantees on how it is propagated, but we have little guarantee already.)

In either case, I will merge this because I can't really continue working on the actual breaking PR without it (hopefully one): Please add anything that comes to mind to the issue.

github-actions bot added the 03 - Maintenance label Feb 10, 2024

seberg force-pushed the create-legacy-descr branch 2 times, most recently from ea1443a to 39df139 Compare February 11, 2024 13:49

seberg marked this pull request as ready for review February 11, 2024 13:49

seberg requested a review from ngoldbaum February 12, 2024 08:45

mattip reviewed Feb 12, 2024

View reviewed changes

mattip reviewed Feb 13, 2024

View reviewed changes

seberg force-pushed the create-legacy-descr branch from d624287 to 74d719b Compare February 13, 2024 12:25

ngoldbaum reviewed Feb 13, 2024

View reviewed changes

seberg mentioned this pull request Feb 13, 2024

In-flight DType change follow-ups #25820

Closed

5 tasks

seberg requested a review from ngoldbaum February 16, 2024 14:49

rgommers added this to the 2.0.0 release milestone Feb 16, 2024

ngoldbaum reviewed Feb 16, 2024

View reviewed changes

ngoldbaum reviewed Feb 17, 2024

View reviewed changes

numpy/_core/include/numpy/ndarraytypes.h Show resolved Hide resolved

seberg added 8 commits February 18, 2024 15:14

MAINT: Internally split some Descriptor fields into LegacyDescriptor

439bc0f

MAINT: Temporary fix for cython to build random (not super bad though)

7329d7e

MAINT: Massive fixup to cast to _PyArray_LegacyDescr or use macro access

c82245a

MAINT: Further places where to prefer casts (and rename in cast prep)

ba10883

MAINT: Further preferences for casts instead of macro with check

873d3ec

MAINT: Further use of direct access on LegacyDescr (and one related fix)

87d9d95

MAINT: Add missing Legacydescr cast

8e7cc23

(odd that I didn't see the warning locally...)

Revert "MAINT: Temporary fix for cython to build random (not super ba…

b58c790

…d though)" This reverts commit cc12306.

seberg added 6 commits February 18, 2024 15:17

MAINT: Use static inline functions rather than macros as per review

03b4277

MAINT: Use direct struct access again in more places

70409c4

DOC: Add release note snippet for descr field access indirection

0910b58

MAINT: Delete PyDataType_GetDatetimeMetaData completely

a8f4618

MAINT: Keep ->metadata as direct access internally

5a3eafa

MAINT: Fix merge/rebase conflict

80beb25

ngoldbaum mentioned this pull request Feb 20, 2024

NEP: add NEP 56 on array API standard support in main namespace #25542

Merged

rgommers mentioned this pull request Mar 3, 2024

Coordination: last items before branching 2.0.x #25918

Closed

10 tasks

seberg added 2 commits March 4, 2024 11:14

Merge branch 'main' into create-legacy-descr

80b6d6e

MAINT: Add a few casts back to normal descriptor for flag checking

b37cbf7

Could also check the flags manually, but this seemed few enough that it isn't a big deal.

seberg force-pushed the create-legacy-descr branch from 9fae7d3 to b37cbf7 Compare March 4, 2024 10:23

MAINT: Add explicitly casts for functions rather than void *

82ac0ba

I prefer to have the casts rather than cast in the functions because this is the pattern that is used in Python C-slots as well.

seberg force-pushed the create-legacy-descr branch from 54506ce to 82ac0ba Compare March 4, 2024 10:45

DOC: Add brief documentation on new accessors and move release notes

56194f0

ngoldbaum approved these changes Mar 4, 2024

View reviewed changes

mhvk reviewed Mar 4, 2024

View reviewed changes

MAINT,DOC: Fixups based on review

31e5d47

seberg merged commit 2c95083 into numpy:main Mar 5, 2024

lithomas1 mentioned this pull request Mar 8, 2024

COMPAT: Adapt to Numpy 2.0 dtype changes pandas-dev/pandas#57774

Closed

5 tasks

	#if !defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD
	#if !defined(NPY_INTERNAL_BUILD)

MAINT,API: Make metadata, c_metadata, fields, and names only exist on old-style dtypes #25802

MAINT,API: Make metadata, c_metadata, fields, and names only exist on old-style dtypes #25802

Conversation

seberg commented Feb 10, 2024

mhvk commented Feb 11, 2024

seberg commented Feb 11, 2024

seberg commented Feb 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum Feb 17, 2024 • edited Loading

Choose a reason for hiding this comment

ngoldbaum commented Feb 13, 2024

ngoldbaum commented Feb 13, 2024

seberg commented Feb 13, 2024

seberg commented Feb 13, 2024

ngoldbaum left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Feb 18, 2024

rgommers commented Mar 3, 2024

rgommers commented Mar 4, 2024

seberg commented Mar 4, 2024

ngoldbaum left a comment

Choose a reason for hiding this comment

mhvk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Mar 4, 2024

mhvk commented Mar 4, 2024

seberg commented Mar 5, 2024

ngoldbaum Feb 17, 2024 •

edited

Loading

ngoldbaum left a comment •

edited

Loading