Skip to content

Inconsistent behavior of dtype.descr with metadata #15488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dmbelov opened this issue Feb 1, 2020 · 21 comments
Open

Inconsistent behavior of dtype.descr with metadata #15488

dmbelov opened this issue Feb 1, 2020 · 21 comments

Comments

@dmbelov
Copy link
Contributor

dmbelov commented Feb 1, 2020

We used to the following property: if x is a dtype then numpy.dtype(x.descr) is the same dtype. This property is broken in NumPy 1.8 when one specifies metadata to dtype. Is it possible to at least fix that descr returned by dtype can always be evaluated back to a dtype?

Related issues: h5py/h5py#1307

Reproducing code example:

import numpy as np

# 1. metadata is lost for simple dtype
s_dt = np.dtype('S8', metadata={'msg': 'Hello'})
print(s_dt.descr)
# prints [('', '|S8')]

# 2. metadata is returned for structured dtype, but it cannot be evaluated into dtype
struct_dtype = np.dtype([('a', s_dt)])
print(struct_dtype.descr)
# prints [('a', ('|S8', {'msg': 'Hello'}))]

# next line raises a ValueError exception
# ValueError: invalid shape in fixed-type tuple.
np.dtype(struct_dtype.descr)

Numpy/Python version information:

1.18.1 3.8.1 | packaged by conda-forge | (default, Jan 29 2020, 14:55:04)
[GCC 7.3.0]

@eric-wieser
Copy link
Member

I don't think this property ever held, and my recommendation would be to avoid using .descr at all. Presumably your use case is serialization?

@dmbelov
Copy link
Contributor Author

dmbelov commented Feb 2, 2020

We need a way to store dtype in text format. We are simply calling dtype.descr for this. During 10 years of using descr I have never met a structured array which dtype.descr was not evaluated into dtype (I always assumed native alignment).

Regardless of the usage, it is unclear why dtype.descr with metadata cannot be evaluated back into dtype. I think it is an easy change to a function numpy.dtype.

@eric-wieser
Copy link
Member

eric-wieser commented Feb 2, 2020

I have never met a structured array which dtype.descr was not evaluated into dtype (I always assumed native alignment).

Try this one ;)

dt = np.dtype(dict(formats=['f4', 'f4'], names=['a', 'b'], offsets=[4, 0]))

These dtypes can appear more often than they used to too.

@eric-wieser
Copy link
Member

We need a way to store dtype in text format.

The most correct option would be something like base64_encode(pickle.dumps(dt)), especially if there are no constraints on the metadata itself.

@eric-wieser
Copy link
Member

Possible duplicate of #14142

@dmbelov
Copy link
Contributor Author

dmbelov commented Feb 3, 2020

  1. Alright, I understand that descr does not exit for every dtype. However, why one cannot get dtype back for those dtypes for which descr exists?
  2. dtype.descr produces inconsistent output. See example in the issue description: in the first case metadata for dtype s_dt is NOT shown, while in the second case metadata for dtype s_dt is shown. Is not this a bug?

@seberg
Copy link
Member

seberg commented Feb 3, 2020

The fact is, that metadata breaks things here. But hdf5 came around and probably as the first serious project, started to use it. I dislike the metadata idea as is, but they probably have good reason for using it.

Do you have a proposal? I am not sure what best to do. We could try and drop metadata from .descr with a UserWarning probably. Or, if we have a "clean" proposal, even add a new method...

@dmbelov
Copy link
Contributor Author

dmbelov commented Feb 5, 2020

Proposal

  1. Enforce the following property going forward: if dtype.descr exists then np.dtype constructor must be able to parse it. Hopefully, it should return a dtype which is very close to the original (some optional fields, like metadata, can be missing).
  2. We do one of the following:
    1. remove metadata from .descr (preferable). If needed, one can get metadata by accessing attr .metadata.
    2. standardize output of .descr and modify np.dtype so that it can parse it. E.g. if metadata is not None .descr should return the tuple with full description of dtype (str, shape, dict) even if shape is an empty tuple. I do not know what to do if there is a non-trivial offset.

@eric-wieser
Copy link
Member

eric-wieser commented Feb 5, 2020

Hopefully, it should return a dtype which is very close to the original

Unfortunately by design it doesn't really do this:

>>> a = np.dtype('i4')
>>> a
dtype('int32')
>>> np.dtype(a.descr)
dtype([('f0', '<i4')])

So at best, we can achieve "vaguely close to the original"

@seberg
Copy link
Member

seberg commented Feb 5, 2020

I am not sure what descr is actually good for :). We postprocess in a few places, but other than that? Trying to cram this feature in descr is just a bad idea. And it may well be that nobody really should be using .descr at all?

@mattip
Copy link
Member

mattip commented Feb 6, 2020

I am not sure what descr is actually good for

Maybe deprecate/make it private as part of the dtype refactor?

@dmbelov
Copy link
Contributor Author

dmbelov commented Mar 20, 2020

Have you made any decision on what to do in this ticket?

I just want to notify you that changing behavior of descr requires me, in particular, and may be many other people, to change code that had been working without any problems for years! Please be responsible, do not break other's people code.

@eric-wieser
Copy link
Member

For now, I'd recommend you use something like gh-14994 in your own code, and avoid using .descr if metadata saved. That matches your 2.i suggestion.

@dmbelov
Copy link
Contributor Author

dmbelov commented Mar 21, 2020

So, basically you are forcing the whole NumPy community to change their code because YOU added a new feature that breaks everybody's code! This is not a good direction of development.

Why cannot you just patch numpy by either

  1. do not put metadata into descr and let people to access metadata data as a property of dtype;
  2. or change constructor numpy.dtype to understand dictionary metadata as it is printed by descr?

Both of these approaches make less damage to existing code than the one that your have chosen now.
The new feature, that breaks code, was introduced very recently, so only few developers are using it. Moreover the fact that developers have to use hacky solution like gh-14994 adds negativity to these new feature.

@eric-wieser
Copy link
Member

eric-wieser commented Mar 21, 2020

The new feature, that breaks code, was introduced very recently

Which feature are you describing? I wasnt aware we were talking about a change that had already happened.

@seberg
Copy link
Member

seberg commented Mar 21, 2020

Indeed, I do am not aware of any change either. The only change I am aware of is that h5py (and maybe others) suddenly started to actually use metadata. So the issue you have is reverse, we are hesitant to change behaviour here as to not force anyone to change code, although it is and probably was not super consistent.

@dmbelov
Copy link
Contributor Author

dmbelov commented Mar 23, 2020

Thanks for your replies. I looked at the source code: the code for creation of dtype from list is written consistently, there is just a simple logical bug there.

Look at the file numpy/core/src/multiarray/descriptor.c, function _convert_from_tuple, line 279

    // the BUG is on the next line:
    // val might be dict, but metadata is NULL; in this case val is simply ignored.
    //else if (type->metadata && (PyDict_Check(val) || PyDictProxy_Check(val))) {
    // replace the line above by the following patch
    // Start of PATCH
    else if (PyDict_Check(val) || PyDictProxy_Check(val)) {
        // if metadata is NULL, just create a new dictionary
        if (!type->metadata) {
            type->metadata = PyDict_New();
            if (type->metadata == NULL) {
                Py_DECREF(type);
                return NULL;
            }
        }
    // End of PATCH
        if (PyDict_Merge(type->metadata, val, 0) == -1) {
            Py_DECREF(type);
            return NULL;
        }
        return type;
    }

I complied NumPy, and I checked that this code works on the following examples:

dt = numpy.dtype([('a', ('S8', {'x': "Hello"}), 2)])
dt.descr
# [('a', ('|S8', {'x': 'Hello'}), (2,))]

numpy.dtype(dt.descr) # works

numpy.dtype(dt.descr) == dt
# returns True

Could you please review this patch and add it to the closest NumPy update?

@dmbelov
Copy link
Contributor Author

dmbelov commented Apr 11, 2020

eric-wieser, seberg are you going to include my patch shown in the previous update to the nearest NumPy release?

@seberg
Copy link
Member

seberg commented Apr 11, 2020

@dmbelov, not sure, but could you open a PR for discussion rather than a patch in an issue?

@dmbelov
Copy link
Contributor Author

dmbelov commented Apr 13, 2020

@seberg, I created PR request, gh-15962

@eric-wieser
Copy link
Member

Would you mind sharing some actual code that actually uses descr? It's too easy for me to say "well don't do that" to np.dtype(dt.descr), but if your use case is something else, there might be another way for us to approach the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants