ENH: Dtype without metadata #23185

ninousf · 2023-02-09T12:57:31Z

To avoid problems with metadata in dtypes when copying hdf5 arrays to zarr or npy file or in other cases:

Error converting hdf5 file to zarr zarr-developers/zarr-python#1152
Inconsistent behavior of dtype.descr with metadata #15488
BUG: unable to read .npy file with non-eval'able metadata #23169
i add a numpy tool function that creates a dtype from an original one but without any metada inside.
it does the same for all the underlying dtypes too

sjdv1982 · 2023-02-09T14:00:22Z

If I understand correctly, this allows a programmer to side-step #23169, but doesn't fix it directly. To fix it, utils.dtype_without_metadata must be invoked by np.save.
As it is, every .npy file with metadata is already unreadable by np.load, so no harm would be done.
Please correct me if I am wrong.

ninousf · 2023-02-09T17:08:49Z

It's true that I use it just before saving HDF5 array in zarr store. But in your case, if you have already files with metadata, i do not know if you can force the dtype in np.load or take only the buffer part of the array, not the dtype part Le jeu. 9 févr. 2023 à 15:00, Sjoerd de Vries ***@***.***> a écrit :

…

If I understand correctly, this allows a programmer to side-step #23169 <#23169>, but doesn't fix it directly. To fix it, utils.dtype_without_metadata must be invoked by np.save. As it is, every .npy file with metadata is *already* unreadable by np.load, so no harm would be done. Please correct me if I am wrong. — Reply to this email directly, view it on GitHub <#23185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH75LVJG25HW5GIRFZ7QQBLWWT2ABANCNFSM6AAAAAAUWQRNNA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

seberg · 2023-02-09T18:15:05Z

@ninousf you can. You could also edit the file pretty easily, but both requires deeper dives into the header parsing code. If anyone wants to try and help with a solution, it might make sense to create a simple recipe for how to solve it and then refer to it from the current error.

sjdv1982 · 2023-02-09T18:34:34Z

@ninousf you can. You could also edit the file pretty easily, but both requires deeper dives into the header parsing code. If anyone wants to try and help with a solution, it might make sense to create a simple recipe for how to solve it and then refer to it from the current error.

I could help, but I am still not sure which problem(s) you would like to have solved.

Force npy.save to strip metadata, just because npy.load currently can't read any of them. This is a one-liner.
Amend the dtype constructor so that it guesses between metadata and shape information, instead of assuming shape information (the current case). This is complicated, and what ENH: better handle dtype creation with metadata #15962 tries to do.
In addition to 2., support np.load header parsing for safe_eval'able metadata. This is easy enough to do, but I am not sure it has practical value. h5py metadata is not safe_eval'able.
In addition to 2., support np.load header parsing for non-safe_eval'able metadata, e.g. headers containing dtype and class expressions, as written by h5py. That would allow the loading of h5py-exported .npy files, but carries a security risk.

Personally, I don't need any of these, the current PR will solve all of my problems.

seberg · 2023-02-09T20:27:13Z

I would be in favor of doing 1., potentially with a warning (not sure its actually worth it since its annoying). All else seems a lot of hassle for no clear gain (since I doubt you need to round-trip via the h5py metadata). I could imagine the error to have some bread-crumbs for unlucky folks who ended up storing with metadata.

ninousf · 2023-02-11T08:30:42Z

In my case, this function solves the problem when a library (like zarr) use descr to build dtype in the save function (not in load for npy format).
the issue is when you pass descr with metadata to dtype constructor, sometimes it works, sometimes not:

`
dt = np.dtype('S8', metadata={'msg': 'Hello'})
s_dt = np.dtype([('toto', dt)])
#s_dt2 = np.dtype(s_dt.descr) <= KO
print('metadata in descr', s_dt.descr)

dt = np.dtype({'names': ['r','g','b','a'],
'formats': [np.uint8, np.uint8, np.uint8, np.uint8]}, metadata={'msg': 'Hello'})
s_dt = np.dtype([('toto', dt)])
s_dt2 = np.dtype(s_dt.descr) # <= OK
print('metadata not in descr', s_dt.descr)
`
to avoid these weird issues, i made this function in order to clean dtype when you want to store an numpy array no matter the file format (npy, zarr, ....)

It can resolve the problem n.1. It is true that allowing a file to be stored in the knowledge that it cannot be re-read, is a bug for me

seberg

Was your plan to just use this internally for now? In that case things are maybe a little bit simpler to settle quickly.

While dealing with structured dtypes can be in Python in either case, it does need to recurse via dtypes.fields (or names, but fields might almost be easier, there are also "titles"?).

I understand the urge to run auto-formatting, but we don't do it yet in NumPy and the it does make the diff a bit unwieldy in this case.

seberg · 2023-02-13T20:26:51Z

numpy/lib/tests/test_utils.py

+    for k in dir(dt1):
+        if not ((k.startswith('__') and k.endswith('__'))
+                or k in ['descr', 'metadata', 'newbyteorder']):
+            ret &= (getattr(dt1, k) == getattr(dt2, k))


I would write this as:

for ...: if getattr(dt1, k) != getattr(dt2, k): return False return True

You could also check np.can_can(dt1, dt2, casting="no"), since that should only really work only metadata mismatches. (The only issue should be e.g. "long" and "longlong" which on most systems, except windows, are the same).

So, I am fine with this also, it is at least thorough and easy enough to read.

seberg · 2023-02-13T20:43:04Z

numpy/lib/utils.py

+    result : dtype
+        Dtype containing no metadata
+    """
+    descr = dt.descr


We could do this in python as a utility, implementing it as dtype.drop_metadata() or so would have the advantage of easier generalizing user DTypes.

Fields can be nested, so I think this must use a recursive approach. I.e. check dtype.names is None if it is we should iterate dtype.fields, if not check dtype.shape != () (not a subarray). After that, we would have to assume dtype.str works (for now).

Both structured (names and shape) will need to then call the function again recursively.

ok but this means that drop_metadata is a method of dtype class, right ?
Is there a python file where we can put this method in numpy project ? or we have to code it in cython. I'm not confortable with this language

If we are OK with private, it could probably stay as a function for now. If not, I think I like a method? Even then it could be implemented in Python (for most of the logic at least), we would need a C stub that forwards it to Python. That should already be done for some other methods or getters/setters.

But if that is what we do, and you are not happy to dig into Python C-API a bit on your own, I can help with that part.

there something that i do not understand. You suggest dtype.drop_metadata() with no parameter, so i suppose it is a method of dtype class.
And as metadata is a readonly attribute, drop_metadata(self) must return a copy, no?.
Or you want to add a setter to metadata in order to allow self to clean all its metadatas ? In this case drop_metadata(self) returns nothing

Yes, it would have to return a new dtype. Honestly, dtypes should be immutable anyway. I suppose if there is no metadata to drop, returning the same one would be fine/good.

seberg · 2023-03-10T16:17:31Z

Hmmmm, seems like all commits went away by accident. I think you might have to just open a new PR, since github does this auto-closing when this happens.

ninousf · 2023-03-10T19:05:33Z

next episode #23371

ninousf changed the title ~~Dtype without metadata~~ ENH: Dtype without metadata Feb 9, 2023

seberg reviewed Feb 13, 2023

View reviewed changes

charris added the 01 - Enhancement label Mar 3, 2023

ninousf force-pushed the dtype-without-metadata branch from bdf5773 to 5225af2 Compare March 10, 2023 14:59

ninousf closed this Mar 10, 2023

ninousf force-pushed the dtype-without-metadata branch from 5225af2 to 89d6441 Compare March 10, 2023 15:00

ninousf mentioned this pull request Mar 10, 2023

ENH: drop dtype metadata #23371

Merged

Uh oh!

ENH: Dtype without metadata #23185

ENH: Dtype without metadata #23185

Uh oh!

Conversation

ninousf commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjdv1982 commented Feb 9, 2023

Uh oh!

ninousf commented Feb 9, 2023 via email

Uh oh!

seberg commented Feb 9, 2023

Uh oh!

sjdv1982 commented Feb 9, 2023

Uh oh!

seberg commented Feb 9, 2023

Uh oh!

ninousf commented Feb 11, 2023

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

seberg Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

seberg Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

seberg Feb 13, 2023

Choose a reason for hiding this comment

Uh oh!

ninousf Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

seberg Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

ninousf Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

seberg Feb 27, 2023

Choose a reason for hiding this comment

Uh oh!

seberg commented Mar 10, 2023

Uh oh!

ninousf commented Mar 10, 2023

Uh oh!

Uh oh!

ninousf commented Feb 9, 2023 •

edited

Loading