Skip to content

ENH: Dtype without metadata #23185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from
Closed

Conversation

ninousf
Copy link
Contributor

@ninousf ninousf commented Feb 9, 2023

To avoid problems with metadata in dtypes when copying hdf5 arrays to zarr or npy file or in other cases:

@ninousf ninousf changed the title Dtype without metadata ENH: Dtype without metadata Feb 9, 2023
@sjdv1982
Copy link

sjdv1982 commented Feb 9, 2023

If I understand correctly, this allows a programmer to side-step #23169, but doesn't fix it directly. To fix it, utils.dtype_without_metadata must be invoked by np.save.
As it is, every .npy file with metadata is already unreadable by np.load, so no harm would be done.
Please correct me if I am wrong.

@ninousf
Copy link
Contributor Author

ninousf commented Feb 9, 2023 via email

@seberg
Copy link
Member

seberg commented Feb 9, 2023

@ninousf you can. You could also edit the file pretty easily, but both requires deeper dives into the header parsing code. If anyone wants to try and help with a solution, it might make sense to create a simple recipe for how to solve it and then refer to it from the current error.

@sjdv1982
Copy link

sjdv1982 commented Feb 9, 2023

@ninousf you can. You could also edit the file pretty easily, but both requires deeper dives into the header parsing code. If anyone wants to try and help with a solution, it might make sense to create a simple recipe for how to solve it and then refer to it from the current error.

I could help, but I am still not sure which problem(s) you would like to have solved.

  1. Force npy.save to strip metadata, just because npy.load currently can't read any of them. This is a one-liner.
  2. Amend the dtype constructor so that it guesses between metadata and shape information, instead of assuming shape information (the current case). This is complicated, and what ENH: better handle dtype creation with metadata #15962 tries to do.
  3. In addition to 2., support np.load header parsing for safe_eval'able metadata. This is easy enough to do, but I am not sure it has practical value. h5py metadata is not safe_eval'able.
  4. In addition to 2., support np.load header parsing for non-safe_eval'able metadata, e.g. headers containing dtype and class expressions, as written by h5py. That would allow the loading of h5py-exported .npy files, but carries a security risk.

Personally, I don't need any of these, the current PR will solve all of my problems.

@seberg
Copy link
Member

seberg commented Feb 9, 2023

I would be in favor of doing 1., potentially with a warning (not sure its actually worth it since its annoying). All else seems a lot of hassle for no clear gain (since I doubt you need to round-trip via the h5py metadata). I could imagine the error to have some bread-crumbs for unlucky folks who ended up storing with metadata.

@ninousf
Copy link
Contributor Author

ninousf commented Feb 11, 2023

In my case, this function solves the problem when a library (like zarr) use descr to build dtype in the save function (not in load for npy format).
the issue is when you pass descr with metadata to dtype constructor, sometimes it works, sometimes not:

`
dt = np.dtype('S8', metadata={'msg': 'Hello'})
s_dt = np.dtype([('toto', dt)])
#s_dt2 = np.dtype(s_dt.descr) <= KO
print('metadata in descr', s_dt.descr)

dt = np.dtype({'names': ['r','g','b','a'],
'formats': [np.uint8, np.uint8, np.uint8, np.uint8]}, metadata={'msg': 'Hello'})
s_dt = np.dtype([('toto', dt)])
s_dt2 = np.dtype(s_dt.descr) # <= OK
print('metadata not in descr', s_dt.descr)
`
to avoid these weird issues, i made this function in order to clean dtype when you want to store an numpy array no matter the file format (npy, zarr, ....)

It can resolve the problem n.1. It is true that allowing a file to be stored in the knowledge that it cannot be re-read, is a bug for me

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was your plan to just use this internally for now? In that case things are maybe a little bit simpler to settle quickly.

While dealing with structured dtypes can be in Python in either case, it does need to recurse via dtypes.fields (or names, but fields might almost be easier, there are also "titles"?).

I understand the urge to run auto-formatting, but we don't do it yet in NumPy and the it does make the diff a bit unwieldy in this case.

for k in dir(dt1):
if not ((k.startswith('__') and k.endswith('__'))
or k in ['descr', 'metadata', 'newbyteorder']):
ret &= (getattr(dt1, k) == getattr(dt2, k))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write this as:

for ...:
    if getattr(dt1, k) != getattr(dt2, k):
        return False

return True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also check np.can_can(dt1, dt2, casting="no"), since that should only really work only metadata mismatches. (The only issue should be e.g. "long" and "longlong" which on most systems, except windows, are the same).

So, I am fine with this also, it is at least thorough and easy enough to read.

result : dtype
Dtype containing no metadata
"""
descr = dt.descr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this in python as a utility, implementing it as dtype.drop_metadata() or so would have the advantage of easier generalizing user DTypes.

Fields can be nested, so I think this must use a recursive approach. I.e. check dtype.names is None if it is we should iterate dtype.fields, if not check dtype.shape != () (not a subarray). After that, we would have to assume dtype.str works (for now).

Both structured (names and shape) will need to then call the function again recursively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but this means that drop_metadata is a method of dtype class, right ?
Is there a python file where we can put this method in numpy project ? or we have to code it in cython. I'm not confortable with this language

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are OK with private, it could probably stay as a function for now. If not, I think I like a method? Even then it could be implemented in Python (for most of the logic at least), we would need a C stub that forwards it to Python. That should already be done for some other methods or getters/setters.

But if that is what we do, and you are not happy to dig into Python C-API a bit on your own, I can help with that part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there something that i do not understand. You suggest dtype.drop_metadata() with no parameter, so i suppose it is a method of dtype class.
And as metadata is a readonly attribute, drop_metadata(self) must return a copy, no?.
Or you want to add a setter to metadata in order to allow self to clean all its metadatas ? In this case drop_metadata(self) returns nothing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would have to return a new dtype. Honestly, dtypes should be immutable anyway. I suppose if there is no metadata to drop, returning the same one would be fine/good.

@ninousf ninousf force-pushed the dtype-without-metadata branch from bdf5773 to 5225af2 Compare March 10, 2023 14:59
@ninousf ninousf closed this Mar 10, 2023
@ninousf ninousf force-pushed the dtype-without-metadata branch from 5225af2 to 89d6441 Compare March 10, 2023 15:00
@seberg
Copy link
Member

seberg commented Mar 10, 2023

Hmmmm, seems like all commits went away by accident. I think you might have to just open a new PR, since github does this auto-closing when this happens.

@ninousf
Copy link
Contributor Author

ninousf commented Mar 10, 2023

next episode #23371

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

4 participants