-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
MAINT,API: Make metadata, c_metadata, fields, and names only exist on old-style dtypes #25802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've not looked through the details, but is the idea that new-style dtypes will not be structured so don't need |
That doesn't break, we just still define those to be The idea is mostly that as a followup, C-code has to use new macros at the very least, so that we could create a new way to do structured dtypes. |
ea1443a
to
39df139
Compare
This is working now and should be fine overall, a note for review:
The other commits are probably best viewed together, I added a few later ones when reviewing myself. But those later ones for the most part reduce the amount of lines changed by reverting to direct access (because the descr was cast to a There are a decent amount of places that might make more or just as much sense to cast, but do use the macros: Overall, I didn't want to overly worry about using the macro in a few places where the |
@@ -578,6 +578,11 @@ typedef struct { | |||
#define PyDataType_REFCHK(dtype) \ | |||
PyDataType_FLAGCHK(dtype, NPY_ITEM_REFCOUNT) | |||
|
|||
|
|||
#if !defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be just
#if !defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD | |
#if !defined(NPY_INTERNAL_BUILD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this might be the issue about random
requiring changes in the init.pyd
!
Unless you disagree, will just add the brackets around it, since checking both seems the pattern we use everywhere (I guess it forces setting it to an actual value which seems a good habit).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, that was the reason for the Cython/random oddity. I added new commits for the changes for review, but happy to squash-merge.
#define PyDataType_METADATA(dtype) (!PyDataType_ISLEGACY(dtype) ? \ | ||
NULL : ((_PyArray_LegacyDescr *)dtype)->metadata) | ||
#define PyDataType_C_METADATA(dtype) (!PyDataType_ISLEGACY(dtype) ? \ | ||
NULL : ((_PyArray_LegacyDescr *)dtype)->c_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about a comprimise: make them static functions so they can be properly typed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, that also means I enforce the direct access for when it is already cast to _PyArray_LegacyDescr *
. (Won't touch the old macros here, as it is more tedious, although I can see us changing them also.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
tools/descriptor_update.py
Outdated
#content = simple_fix(content, apply_filter=is_public, name="alignment") | ||
|
||
# Write the result: | ||
file.write_text(content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you see this being useful once this PR is merged? If so, there are some linting problems. If not, maybe we don't need it in the repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly, I thought it might be intersting for reviewers to see. Let me remove it, we can add it back in a dedicated step if anyone thinks its useful.
d624287
to
74d719b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may make numpy internals a bit harder to understand. Is the plan to refactor so we're not passing legacy descriptors around internally? I think ideally there would be only one internal representation people need to think about.
@@ -82,6 +82,8 @@ | |||
/* Aliases of 2.x names to 1.x only equivalent names */ | |||
#define NPY_NTYPES NPY_NTYPES_LEGACY | |||
#define PyArray_DescrProto PyArray_Descr | |||
/* NumPy 2 definition always works, but add it for 1.x only */ | |||
#define PyDataType_ISLEGACY(dtype) (1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically not true because of the experimental dtype flag but it's probably a weird thing to do that across the numpy 1.0/2.0 boundary so this is fine IMO.
dt_slots->get_clear_loop = | ||
npy_get_clear_void_and_legacy_user_dtype_loop; | ||
(void *)npy_get_clear_void_and_legacy_user_dtype_loop; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these void*
casts a remnant of an earlier version or this PR where you modified the NPY_DType_Slots
struct or do you need these for some other reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The void casts are a bit lazy (could use the full name, which is long though), but because of the changing the signature, you need to cast now to avoid compiler warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding these void * casts, I'd prefer adding new function typedefs for internal usages, even better if we can figure out how to make this function take the real struct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry: I'll follow up with the actual signature names.
I am not sure I see this is a problem: it seems like a common pattern in type slots to cast, i.e. array number slots will use PyArrayObject *
(must clearer) and then cast it when assigning to avoid the C-warnings because the typedef is for PyObject *
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me, the problem is it adds cognitive overhead while reading the code.m, since I need to stop and look at the definitions and compare types to see if the void *
cast is safe. I also don't think the type checking is useless, I made a few type errors that the typedefs in the C API caught while developing stringdtype.
The change itself looks fine, just a couple general higher level questions. |
Also still needs a release note and some discussion in the numpy 2.0 transition guide for C API users (latter doesn't need to happen here, just a note that it does need to happen). |
Yeah, I don't love it to some degree, but having |
Technically, this PR doesn't change public API yet (because I want that change to be a big bang). But I added a release note pretending a bit that it did. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems less risky, all of the uses of these fields I'm finding are in custom legacy dtypes and not in any major packages, but it's hard to tell for sure because "metadata", "names", and "fields" are pretty generic and show up in other contexts too.
I'd like it if another maintainer could give their opinion about whether we like the change to internally using _PyArray_LegacyDescr
in many places. I suspect it will be confusing to handle for people who are used to looking at numpy internals but aren't following this discussion closely.
------------------------------------------------------------------ | ||
The dtype structures fields ``metadata``, ``c_metadata``, ``names``, | ||
``fields``, and ``subarray`` must now be accessed through new | ||
functions following the same names, such as ``PyDataType_METADATA``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should list these new accessor macros in the C API docs.
dt_slots->get_clear_loop = | ||
npy_get_clear_void_and_legacy_user_dtype_loop; | ||
(void *)npy_get_clear_void_and_legacy_user_dtype_loop; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of adding these void * casts, I'd prefer adding new function typedefs for internal usages, even better if we can figure out how to make this function take the real struct
(odd that I didn't see the warning locally...)
…d though)" This reverts commit cc12306.
I have updated the other PR. This needs a little work, because it collided a bit more with the I will re-instate
Happy if others disagree! We can keep the fields (although they should move around anyway probably). I don't loves the casts, but also don't think it is so terrible:
|
This looks like it may end up being the last PR marked for 2.0.0. Since Chuck planned to branch today (unlikely to happen though, but it's any day now) and this still has a merge conflict and seems to need some more work, could we have a status update on this one? Is it actually blocking for 2.0.0rc1 or is it nice to have? |
Could also check the flags manually, but this seemed few enough that it isn't a big deal.
9fae7d3
to
b37cbf7
Compare
I prefer to have the casts rather than cast in the functions because this is the pattern that is used in Python C-slots as well.
54506ce
to
82ac0ba
Compare
I'll have a look at the FreeBSD failure (unrelated and getting annoying). For the doc build issue, that's related:
For the final additions/changes to the release notes, can you please edit the |
I think this should be ready. I stuck with the cast on the methods because I think that is a very common pattern, otherwise you need to cast in the functions themselves (the assignment itself is not often read). (Please squash merge of course) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm OK with the internal code changes now and feel like the additional comments and explanation helps to understand why the legacy struct exists.
I guess we can't do much about subarrays
, fields
, and names
but the metadata members we can probably deprecate in the future and eventually get rid of at least some of this legacy compat code. Maybe if we're really ambitious someone can take on overhauling structured dtypes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started reviewing, but as I went on, it seems a lot of changes that make the code less readable for future developers. Would it make more sense to just bite the bullet and use PyDataType_NAMES
, etc., whenever they are needed? Obviously, one can change things around so that for internal use it doesn't do the legacy type check but rather just replaces with ((_PyArray_LegacyDescr *)descr)->names
. But I think our future selves will thank us...
doc/source/reference/c-api/array.rst
Outdated
this list should not be mutated, NumPy may change the way fields are | ||
stored in the future. | ||
|
||
.. c:function:: PyObject *PyTypeNum_NAMES(PyArray_Descr *descr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NAMES -> FIELDS??
@@ -407,6 +407,17 @@ to adapt your code and achieve compatibility with both 1.x and 2.x. | |||
(`gh-25792 <https://github.com/numpy/numpy/pull/25792>`__) | |||
|
|||
|
|||
Structured dtype information access through functions | |||
----------------------------------------------------- | |||
The dtype structures fields `c_metadata``, ``names``, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing backtick on c_metadata
The dtype structures fields `c_metadata``, ``names``, | ||
``fields``, and ``subarray`` must now be accessed through new | ||
functions following the same names, such as ``PyDataType_NAMES``. | ||
Direct access of the fields is not valid as they do not exist for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you write PyDataType_NAMES
but in the .rst above you call it PyTypeNum_NAMES
- I think the former is better but it should be consistent.
@@ -150,7 +150,12 @@ typedef struct { | |||
typedef struct { | |||
PyObject_VAR_HEAD | |||
char *obval; | |||
#if defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD | |||
/* Internally it should be able to conveniently access names/fields */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
able -> possible
@@ -573,6 +573,10 @@ typedef struct { | |||
NPY_ITEM_IS_POINTER | NPY_ITEM_REFCOUNT | \ | |||
NPY_NEEDS_INIT | NPY_NEEDS_PYAPI) | |||
|
|||
#if !(defined(NPY_INTERNAL_BUILD) && NPY_INTERNAL_BUILD) | |||
/* | |||
* Public version of the Descriptor struct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused - why does the public version still have those things like names
. Isn't the point that one should use the accessors? Or am I misreading the above if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you are not: The point is that I am splitting this into multiple PRs and this one doesn't actually change anything. The only reason I added the release note here, was because Nathan asked for it, and I agreed it might be easier to reason about it here than in a PR that breaks more things.
@@ -496,7 +496,7 @@ static PyObject * | |||
create_custom_field_dtype(PyObject *NPY_UNUSED(mod), PyObject *args) | |||
{ | |||
PyArray_DescrProto proto; | |||
PyArray_Descr *dtype; | |||
_PyArray_LegacyDescr *dtype; /* Is checked for void, so legacy is OK */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd tend to go the other way around, and just recast or define a new variable when you need the legacy one. But this is fine as is too.
@@ -510,7 +510,7 @@ create_custom_field_dtype(PyObject *NPY_UNUSED(mod), PyObject *args) | |||
if (dtype->type_num != NPY_VOID || dtype->fields == NULL || | |||
!PyDict_CheckExact(dtype->fields) || | |||
PyTuple_Size(dtype->names) != 1 || | |||
!PyDataType_REFCHK(dtype) || | |||
!PyDataType_REFCHK((PyArray_Descr *)dtype) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically feel one should avoid recasting a legacy type into a "real" one (but really am nitpicking here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That gets annoying since you cannot assign using the macros. Having both around is reasonable.
Also, what you don't see is that I removed a lot of changed lines of code again by going the cast route.
It would be nice to just define some of these internally in a way that supports the subclass LegacyDescr *
, but it was few enough instances where this was needed that I didn't think it was very important.
|
||
/* Use promotion to decide whether the comparison is valid */ | ||
PyArray_Descr *promoted = PyArray_PromoteTypes(self_descr, other_descr); | ||
PyArray_Descr *promoted = PyArray_PromoteTypes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe do this promotion check before even defining self_descr
and other_descr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably a bit nicer, changed it.
Yeah, maybe. H5Py uses it I think, so I don't care too much either way. If you lean towards that, I will lean towards removing the field (it is a follow-up anyway). Either way user-dtypes could enable an equivalent
Yes, the point is that structured dtype and datetime/timedelta need to be thought of as subclasses of dtype which extend it (ignore the So the choice is: either we have I don't like the dance, but I hate to force these fields on everyone forever (when they cannot be used). And I don't hate the dance that much if I think about it as a sunbclass: These fields are same as the |
I like your way of thinking of them as subclasses - that really helps. And that also makes using a different struct much more logical, so removes much of my argument for not doing. But: Would it be an idea to just go with that and for the structured case call it p.s. Sorry to be commenting so late in the cycle - if it is too late, that's obviously OK. |
I would like to just call it So we have a bunch of descr subclasses, but in subclass speak, all of them are trivial subclasses of a single "legacy" and much of our code bunches them together (since all old code used to). Maybe it would be nice to disallow most of this (I think we can get away with it, I would even do it in a 2.1 e.g. for fields on non-void) and then just create a bunch of aliases. The (About metadata: Should we give our In either case, I will merge this because I can't really continue working on the actual breaking PR without it (hopefully one): Please add anything that comes to mind to the issue. |
Right now, this is a heads-up a start to:
_PyArray_LegacyDescr
for the full structI need to do another pass, there are warnings. Also
random
seems to be build in INTERNAL mode, because it picks up the new struct definition (and thus fails to compile, because we have a function to fetchsubarray
in the Cython file even though it isn't used).I would much appreciate a heads up if anyone thinks this isn't right. When it comes to naming, replacing the name is easy and always possible, though.
The casts are a bit annoying, but honestly half the time we need guards anyway, and casting for subclasses isn't that odd.
It took many hours to get this far, but the full diff isn't that bad, IMO.