-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: Dtype without metadata #23185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Dtype without metadata #23185
Conversation
If I understand correctly, this allows a programmer to side-step #23169, but doesn't fix it directly. To fix it, |
It's true that I use it just before saving HDF5 array in zarr store.
But in your case, if you have already files with metadata, i do not know if
you can force the dtype in np.load or take only the buffer part of the
array, not the dtype part
Le jeu. 9 févr. 2023 à 15:00, Sjoerd de Vries ***@***.***> a
écrit :
… If I understand correctly, this allows a programmer to side-step #23169
<#23169>, but doesn't fix it
directly. To fix it, utils.dtype_without_metadata must be invoked by
np.save.
As it is, every .npy file with metadata is *already* unreadable by np.load,
so no harm would be done.
Please correct me if I am wrong.
—
Reply to this email directly, view it on GitHub
<#23185 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH75LVJG25HW5GIRFZ7QQBLWWT2ABANCNFSM6AAAAAAUWQRNNA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
@ninousf you can. You could also edit the file pretty easily, but both requires deeper dives into the header parsing code. If anyone wants to try and help with a solution, it might make sense to create a simple recipe for how to solve it and then refer to it from the current error. |
I could help, but I am still not sure which problem(s) you would like to have solved.
Personally, I don't need any of these, the current PR will solve all of my problems. |
I would be in favor of doing 1., potentially with a warning (not sure its actually worth it since its annoying). All else seems a lot of hassle for no clear gain (since I doubt you need to round-trip via the h5py metadata). I could imagine the error to have some bread-crumbs for unlucky folks who ended up storing with metadata. |
In my case, this function solves the problem when a library (like zarr) use descr to build dtype in the save function (not in load for npy format). ` dt = np.dtype({'names': ['r','g','b','a'], It can resolve the problem n.1. It is true that allowing a file to be stored in the knowledge that it cannot be re-read, is a bug for me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was your plan to just use this internally for now? In that case things are maybe a little bit simpler to settle quickly.
While dealing with structured dtypes can be in Python in either case, it does need to recurse via dtypes.fields
(or names, but fields might almost be easier, there are also "titles"?).
I understand the urge to run auto-formatting, but we don't do it yet in NumPy and the it does make the diff a bit unwieldy in this case.
numpy/lib/tests/test_utils.py
Outdated
for k in dir(dt1): | ||
if not ((k.startswith('__') and k.endswith('__')) | ||
or k in ['descr', 'metadata', 'newbyteorder']): | ||
ret &= (getattr(dt1, k) == getattr(dt2, k)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write this as:
for ...:
if getattr(dt1, k) != getattr(dt2, k):
return False
return True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also check np.can_can(dt1, dt2, casting="no")
, since that should only really work only metadata mismatches. (The only issue should be e.g. "long"
and "longlong"
which on most systems, except windows, are the same).
So, I am fine with this also, it is at least thorough and easy enough to read.
numpy/lib/utils.py
Outdated
result : dtype | ||
Dtype containing no metadata | ||
""" | ||
descr = dt.descr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do this in python as a utility, implementing it as dtype.drop_metadata()
or so would have the advantage of easier generalizing user DTypes.
Fields can be nested, so I think this must use a recursive approach. I.e. check dtype.names is None
if it is we should iterate dtype.fields
, if not check dtype.shape != ()
(not a subarray). After that, we would have to assume dtype.str
works (for now).
Both structured (names and shape) will need to then call the function again recursively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok but this means that drop_metadata is a method of dtype class, right ?
Is there a python file where we can put this method in numpy project ? or we have to code it in cython. I'm not confortable with this language
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are OK with private, it could probably stay as a function for now. If not, I think I like a method? Even then it could be implemented in Python (for most of the logic at least), we would need a C stub that forwards it to Python. That should already be done for some other methods or getters/setters.
But if that is what we do, and you are not happy to dig into Python C-API a bit on your own, I can help with that part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there something that i do not understand. You suggest dtype.drop_metadata()
with no parameter, so i suppose it is a method of dtype class.
And as metadata is a readonly attribute, drop_metadata(self)
must return a copy, no?.
Or you want to add a setter to metadata in order to allow self
to clean all its metadatas ? In this case drop_metadata(self)
returns nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would have to return a new dtype. Honestly, dtypes should be immutable anyway. I suppose if there is no metadata to drop, returning the same one would be fine/good.
bdf5773
to
5225af2
Compare
5225af2
to
89d6441
Compare
Hmmmm, seems like all commits went away by accident. I think you might have to just open a new PR, since github does this auto-closing when this happens. |
next episode #23371 |
To avoid problems with metadata in dtypes when copying hdf5 arrays to zarr or npy file or in other cases:
i add a numpy tool function that creates a dtype from an original one but without any metada inside.
it does the same for all the underlying dtypes too