Skip to content

MAINT,API: Make metadata, c_metadata, fields, and names only exist on old-style dtypes #25802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Mar 5, 2024

Conversation

seberg
Copy link
Member

@seberg seberg commented Feb 10, 2024

Right now, this is a heads-up a start to:

  • Introduce _PyArray_LegacyDescr for the full struct
  • Introduce some additional access macros for the fields that may not exist now which check for that (and cast). In some cases we use them even though we could also cast.

I need to do another pass, there are warnings. Also random seems to be build in INTERNAL mode, because it picks up the new struct definition (and thus fails to compile, because we have a function to fetch subarray in the Cython file even though it isn't used).


I would much appreciate a heads up if anyone thinks this isn't right. When it comes to naming, replacing the name is easy and always possible, though.
The casts are a bit annoying, but honestly half the time we need guards anyway, and casting for subclasses isn't that odd.

It took many hours to get this far, but the full diff isn't that bad, IMO.

@mhvk
Copy link
Contributor

mhvk commented Feb 11, 2024

I've not looked through the details, but is the idea that new-style dtypes will not be structured so don't need fields and names? There's a lot of code out there that checks for structured dtype with things like if dtype.names is None, etc.; will those break if a new-style dtype is used?

@seberg
Copy link
Member Author

seberg commented Feb 11, 2024

That doesn't break, we just still define those to be None in Python (I might have to modify the C code a tiny bit).

The idea is mostly that as a followup, C-code has to use new macros at the very least, so that we could create a new way to do structured dtypes.

@seberg seberg force-pushed the create-legacy-descr branch 2 times, most recently from ea1443a to 39df139 Compare February 11, 2024 13:49
@seberg seberg marked this pull request as ready for review February 11, 2024 13:49
@seberg
Copy link
Member Author

seberg commented Feb 11, 2024

This is working now and should be fine overall, a note for review:

  • There is a little script added I used to get a start on changing direct access to macro access. I could remove it here.
  • The changes the ndarraytypes.h are the most interesting.
  • The changes to the Cython were needed for random to compile. I don't see them as bad enough to worry about (Python access seems fine to me, subarray dtypes are very rare after all).

The other commits are probably best viewed together, I added a few later ones when reviewing myself. But those later ones for the most part reduce the amount of lines changed by reverting to direct access (because the descr was cast to a LegacyDescr already.)

There are a decent amount of places that might make more or just as much sense to cast, but do use the macros: Overall, I didn't want to overly worry about using the macro in a few places where the ISLEGACY check is unnecessary.

@seberg seberg requested a review from ngoldbaum February 12, 2024 08:45
@@ -578,6 +578,11 @@ typedef struct {
#define PyDataType_REFCHK(dtype) \
PyDataType_FLAGCHK(dtype, NPY_ITEM_REFCOUNT)


#if !defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be just

Suggested change
#if !defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD
#if !defined(NPY_INTERNAL_BUILD)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this might be the issue about random requiring changes in the init.pyd!

Unless you disagree, will just add the brackets around it, since checking both seems the pattern we use everywhere (I guess it forces setting it to an actual value which seems a good habit).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, that was the reason for the Cython/random oddity. I added new commits for the changes for review, but happy to squash-merge.

#define PyDataType_METADATA(dtype) (!PyDataType_ISLEGACY(dtype) ? \
NULL : ((_PyArray_LegacyDescr *)dtype)->metadata)
#define PyDataType_C_METADATA(dtype) (!PyDataType_ISLEGACY(dtype) ? \
NULL : ((_PyArray_LegacyDescr *)dtype)->c_metadata)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about a comprimise: make them static functions so they can be properly typed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, that also means I enforce the direct access for when it is already cast to _PyArray_LegacyDescr *. (Won't touch the old macros here, as it is more tedious, although I can see us changing them also.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

#content = simple_fix(content, apply_filter=is_public, name="alignment")

# Write the result:
file.write_text(content)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see this being useful once this PR is merged? If so, there are some linting problems. If not, maybe we don't need it in the repo?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly, I thought it might be intersting for reviewers to see. Let me remove it, we can add it back in a dedicated step if anyone thinks its useful.

@seberg seberg force-pushed the create-legacy-descr branch from d624287 to 74d719b Compare February 13, 2024 12:25
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may make numpy internals a bit harder to understand. Is the plan to refactor so we're not passing legacy descriptors around internally? I think ideally there would be only one internal representation people need to think about.

@@ -82,6 +82,8 @@
/* Aliases of 2.x names to 1.x only equivalent names */
#define NPY_NTYPES NPY_NTYPES_LEGACY
#define PyArray_DescrProto PyArray_Descr
/* NumPy 2 definition always works, but add it for 1.x only */
#define PyDataType_ISLEGACY(dtype) (1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not true because of the experimental dtype flag but it's probably a weird thing to do that across the numpy 1.0/2.0 boundary so this is fine IMO.

dt_slots->get_clear_loop =
npy_get_clear_void_and_legacy_user_dtype_loop;
(void *)npy_get_clear_void_and_legacy_user_dtype_loop;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these void* casts a remnant of an earlier version or this PR where you modified the NPY_DType_Slots struct or do you need these for some other reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The void casts are a bit lazy (could use the full name, which is long though), but because of the changing the signature, you need to cast now to avoid compiler warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding these void * casts, I'd prefer adding new function typedefs for internal usages, even better if we can figure out how to make this function take the real struct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry: I'll follow up with the actual signature names.

I am not sure I see this is a problem: it seems like a common pattern in type slots to cast, i.e. array number slots will use PyArrayObject * (must clearer) and then cast it when assigning to avoid the C-warnings because the typedef is for PyObject *.

Copy link
Member

@ngoldbaum ngoldbaum Feb 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me, the problem is it adds cognitive overhead while reading the code.m, since I need to stop and look at the definitions and compare types to see if the void * cast is safe. I also don't think the type checking is useless, I made a few type errors that the typedefs in the C API caught while developing stringdtype.

@ngoldbaum
Copy link
Member

The change itself looks fine, just a couple general higher level questions.

@ngoldbaum
Copy link
Member

Also still needs a release note and some discussion in the numpy 2.0 transition guide for C API users (latter doesn't need to happen here, just a note that it does need to happen).

@seberg
Copy link
Member Author

seberg commented Feb 13, 2024

This may make numpy internals a bit harder to understand. Is the plan to refactor so we're not passing legacy descriptors around internally?

Yeah, I don't love it to some degree, but having fields and names on the new dtypes when users can't reasonably use them seems not great. We have subclasses now, and that means casting to the subclass when you need to access the subclass info...

@seberg
Copy link
Member Author

seberg commented Feb 13, 2024

Also still needs a release note and some discussion in the numpy

Technically, this PR doesn't change public API yet (because I want that change to be a big bang). But I added a release note pretending a bit that it did.

@seberg seberg mentioned this pull request Feb 13, 2024
5 tasks
@seberg seberg requested a review from ngoldbaum February 16, 2024 14:49
@rgommers rgommers added this to the 2.0.0 release milestone Feb 16, 2024
Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one seems less risky, all of the uses of these fields I'm finding are in custom legacy dtypes and not in any major packages, but it's hard to tell for sure because "metadata", "names", and "fields" are pretty generic and show up in other contexts too.

I'd like it if another maintainer could give their opinion about whether we like the change to internally using _PyArray_LegacyDescr in many places. I suspect it will be confusing to handle for people who are used to looking at numpy internals but aren't following this discussion closely.

------------------------------------------------------------------
The dtype structures fields ``metadata``, ``c_metadata``, ``names``,
``fields``, and ``subarray`` must now be accessed through new
functions following the same names, such as ``PyDataType_METADATA``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should list these new accessor macros in the C API docs.

dt_slots->get_clear_loop =
npy_get_clear_void_and_legacy_user_dtype_loop;
(void *)npy_get_clear_void_and_legacy_user_dtype_loop;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding these void * casts, I'd prefer adding new function typedefs for internal usages, even better if we can figure out how to make this function take the real struct

@seberg
Copy link
Member Author

seberg commented Feb 18, 2024

I have updated the other PR. This needs a little work, because it collided a bit more with the flags PR (requiring direct access/use of a macro or casts in 5 places or so).

I will re-instate metadata for the sake of being minimal (it is a very minor part, but it is the one thing that is OK to keep also for new dtypes, whether they actually use it or not).

I'd like it if another maintainer could give their opinion about whether we like the change to internally using _PyArray_LegacyDescr in many places

Happy if others disagree! We can keep the fields (although they should move around anyway probably). I don't loves the casts, but also don't think it is so terrible:

  • It is specific to structured dtypes and datetimes (effectively) and seeing them as subclasses that extend the normal dtype struct isn't so odd (subclasses require such casts).
    (I would love a better name, but StructDescr isn't actually correct for c_metadata at least. Fortunately, it is at least something we can change at any time.)
  • While it is a lot of lines changes, they are mostly in code paths that specifically deal with promotion/casts of structured dtypes.

@rgommers
Copy link
Member

rgommers commented Mar 3, 2024

This looks like it may end up being the last PR marked for 2.0.0. Since Chuck planned to branch today (unlikely to happen though, but it's any day now) and this still has a merge conflict and seems to need some more work, could we have a status update on this one? Is it actually blocking for 2.0.0rc1 or is it nice to have?

seberg added 2 commits March 4, 2024 11:14
Could also check the flags manually, but this seemed few enough
that it isn't a big deal.
@seberg seberg force-pushed the create-legacy-descr branch from 9fae7d3 to b37cbf7 Compare March 4, 2024 10:23
I prefer to have the casts rather than cast in the functions because
this is the pattern that is used in Python C-slots as well.
@seberg seberg force-pushed the create-legacy-descr branch from 54506ce to 82ac0ba Compare March 4, 2024 10:45
@rgommers
Copy link
Member

rgommers commented Mar 4, 2024

I'll have a look at the FreeBSD failure (unrelated and getting annoying).

For the doc build issue, that's related:

The following files were not found by towncrier:
    25802.c_removal.rst

For the final additions/changes to the release notes, can you please edit the doc/source/release/2.0.0-notes.rst directly rather than using towncrier snippets?

@seberg
Copy link
Member Author

seberg commented Mar 4, 2024

I think this should be ready. I stuck with the cast on the methods because I think that is a very common pattern, otherwise you need to cast in the functions themselves (the assignment itself is not often read).

(Please squash merge of course)

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm OK with the internal code changes now and feel like the additional comments and explanation helps to understand why the legacy struct exists.

I guess we can't do much about subarrays, fields, and names but the metadata members we can probably deprecate in the future and eventually get rid of at least some of this legacy compat code. Maybe if we're really ambitious someone can take on overhauling structured dtypes...

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started reviewing, but as I went on, it seems a lot of changes that make the code less readable for future developers. Would it make more sense to just bite the bullet and use PyDataType_NAMES, etc., whenever they are needed? Obviously, one can change things around so that for internal use it doesn't do the legacy type check but rather just replaces with ((_PyArray_LegacyDescr *)descr)->names. But I think our future selves will thank us...

this list should not be mutated, NumPy may change the way fields are
stored in the future.

.. c:function:: PyObject *PyTypeNum_NAMES(PyArray_Descr *descr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAMES -> FIELDS??

@@ -407,6 +407,17 @@ to adapt your code and achieve compatibility with both 1.x and 2.x.
(`gh-25792 <https://github.com/numpy/numpy/pull/25792>`__)


Structured dtype information access through functions
-----------------------------------------------------
The dtype structures fields `c_metadata``, ``names``,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing backtick on c_metadata

The dtype structures fields `c_metadata``, ``names``,
``fields``, and ``subarray`` must now be accessed through new
functions following the same names, such as ``PyDataType_NAMES``.
Direct access of the fields is not valid as they do not exist for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you write PyDataType_NAMES but in the .rst above you call it PyTypeNum_NAMES - I think the former is better but it should be consistent.

@@ -150,7 +150,12 @@ typedef struct {
typedef struct {
PyObject_VAR_HEAD
char *obval;
#if defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD
/* Internally it should be able to conveniently access names/fields */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

able -> possible

@@ -573,6 +573,10 @@ typedef struct {
NPY_ITEM_IS_POINTER | NPY_ITEM_REFCOUNT | \
NPY_NEEDS_INIT | NPY_NEEDS_PYAPI)

#if !(defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD)
/*
* Public version of the Descriptor struct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused - why does the public version still have those things like names. Isn't the point that one should use the accessors? Or am I misreading the above if?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you are not: The point is that I am splitting this into multiple PRs and this one doesn't actually change anything. The only reason I added the release note here, was because Nathan asked for it, and I agreed it might be easier to reason about it here than in a PR that breaks more things.

@@ -496,7 +496,7 @@ static PyObject *
create_custom_field_dtype(PyObject *NPY_UNUSED(mod), PyObject *args)
{
PyArray_DescrProto proto;
PyArray_Descr *dtype;
_PyArray_LegacyDescr *dtype; /* Is checked for void, so legacy is OK */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd tend to go the other way around, and just recast or define a new variable when you need the legacy one. But this is fine as is too.

@@ -510,7 +510,7 @@ create_custom_field_dtype(PyObject *NPY_UNUSED(mod), PyObject *args)
if (dtype->type_num != NPY_VOID || dtype->fields == NULL ||
!PyDict_CheckExact(dtype->fields) ||
PyTuple_Size(dtype->names) != 1 ||
!PyDataType_REFCHK(dtype) ||
!PyDataType_REFCHK((PyArray_Descr *)dtype) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically feel one should avoid recasting a legacy type into a "real" one (but really am nitpicking here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That gets annoying since you cannot assign using the macros. Having both around is reasonable.
Also, what you don't see is that I removed a lot of changed lines of code again by going the cast route.

It would be nice to just define some of these internally in a way that supports the subclass LegacyDescr *, but it was few enough instances where this was needed that I didn't think it was very important.


/* Use promotion to decide whether the comparison is valid */
PyArray_Descr *promoted = PyArray_PromoteTypes(self_descr, other_descr);
PyArray_Descr *promoted = PyArray_PromoteTypes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do this promotion check before even defining self_descr and other_descr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably a bit nicer, changed it.

@seberg
Copy link
Member Author

seberg commented Mar 4, 2024

metadata members we can probably deprecate in the future and eventually get rid of at least some of this legacy compat code

Yeah, maybe. H5Py uses it I think, so I don't care too much either way. If you lean towards that, I will lean towards removing the field (it is a follow-up anyway).

Either way user-dtypes could enable an equivalent metdata themselves (and we could invent API to make it easier to do so).

that make the code less readable for future developers

Yes, the point is that structured dtype and datetime/timedelta need to be thought of as subclasses of dtype which extend it (ignore the Legacy name).

So the choice is: either we have fields, names, and c_metadata around on user dtypes but they must always be NULL/None, or we do this internal dance.

I don't like the dance, but I hate to force these fields on everyone forever (when they cannot be used). And I don't hate the dance that much if I think about it as a sunbclass: These fields are same as the allocator in StringDType, the main difference is that the old code always tried to do all dtype logic in a single code path making this unfortunate messy.

@mhvk
Copy link
Contributor

mhvk commented Mar 4, 2024

I like your way of thinking of them as subclasses - that really helps. And that also makes using a different struct much more logical, so removes much of my argument for not doing.

But: Would it be an idea to just go with that and for the structured case call it PyArray_StructuredDataType already now? (probably with an underscore to keep it private). And only use it for structured? (I think there is not much else and for those I think you mostly use the new functions anyway...) Really my main worry here is about clarity for future developers - this code is not easy and _legacy* is not necessarily very helpful...

p.s. Sorry to be commenting so late in the cycle - if it is too late, that's obviously OK.

@seberg
Copy link
Member Author

seberg commented Mar 5, 2024

I would like to just call it VoidDescr or so, unfortunately we don't strictly prevent e.g. fields on other dtypes right now and in a few places (the odd user dtype at least) that is used.

So we have a bunch of descr subclasses, but in subclass speak, all of them are trivial subclasses of a single "legacy" and much of our code bunches them together (since all old code used to).

Maybe it would be nice to disallow most of this (I think we can get away with it, I would even do it in a 2.1 e.g. for fields on non-void) and then just create a bunch of aliases. The metadata field is the odd one out, as it makes sense for all current dtypes.

(About metadata: Should we give our dtypes a __dict__ slot and just give that as a way to tag on metadata? That comes with no guarantees on how it is propagated, but we have little guarantee already.)


In either case, I will merge this because I can't really continue working on the actual breaking PR without it (hopefully one): Please add anything that comes to mind to the issue.

@seberg seberg merged commit 2c95083 into numpy:main Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants