-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: better handle dtype creation with metadata #15962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len) | |||
return 0; | |||
} | |||
|
|||
// This function checks if val == {'name': any tuple}. | |||
// See gh-2865, issue #8235 for details | |||
inline npy_bool _check_if_size_1_and_contains_name_and_tuple(PyObject* val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix formatting and add static
inline npy_bool _check_if_size_1_and_contains_name_and_tuple(PyObject* val) { | |
static inline npy_bool | |
_check_if_size_1_and_contains_name_and_tuple(PyObject* val) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also change to int
# recursive: metadata on the field of a dtype | ||
(np.dtype({'names': ['a', 'b'], 'formats': [ | ||
float, np.dtype({'names': ['c'], 'formats': [np.dtype(int, metadata={})]}) | ||
]}), False) | ||
]}), False, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems we are now missing cases where fail==True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. It does not fail anymore on np.load. Should I remove option fail completely?
npy_bool is_ok = PyDict_Size(val) == 1; | ||
if (is_ok) { | ||
PyObject* name = PyUnicode_FromString("name"); | ||
if (!name) return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't silence memoryerror here:
if (!name) return 1; | |
if (!name) { | |
return -1; | |
} |
The caller will need to handle this specially too.
@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len) | |||
return 0; | |||
} | |||
|
|||
// This function checks if val == {'name': any tuple}. | |||
// See gh-2865, issue #8235 for details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reviewers, gh-2865
@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len) | |||
return 0; | |||
} | |||
|
|||
// This function checks if val == {'name': any tuple}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow what's special about the word "name"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I made things more complicated than they should be. The misunderstood how "inherited" dtype is initialized. I made a change. Please check again.
@eric-wieser, @mattip I implemented your comments. Can u take a look again? Will this patch be included in the nearest NumPy release? |
@@ -238,6 +238,22 @@ is_datetime_typestr(char const *type, Py_ssize_t len) | |||
return 0; | |||
} | |||
|
|||
// This function checks if val is mistyped dict of inherited dtype, | |||
// e.g val = {key: tuple of size >= 2}; see gh-2865, issue #8235 for details |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit confused about what actually happens here. Could you give an example for when this function does not return True
for is_mistyped
? And how does such a descriptor differ from a mistyped one? What about the possibility that a metadata actually includes a correctly typed dtype?
So is it possible to create dtypes, e.g. with odd metadata, where this logic is broken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two contradicting to each other usages of dictionary in a tuple that defines dtype. Consider the following expression np.dtype(('i4', adict))
('i4'
can be replaced by any other valid type):
- The dictionary
adict
here should be interpreted asmetadata
dt = np.dtype(('i4', {'comment': 'qty'})) dt.metadata # mappingproxy({'comment': 'qty'})
- But if we do this we will break historically working code
convert_from_iherit_tuple
(see BUG: add checks for some invalid structured dtypes. Fixes #2865. #8235) that allows one to split a given type into parts and use up-to 2 names to access them. For exampleIn this format tuple inside dictionary must be of size 2 or 3.dt = np.dtype(('i4', {'part_1': ('i2', 0, 'Part 1'), 'part_2': ('i2', 2, 'Part 2')})) dt.descr # [(('Part 1', 'part_1'), '<i2'), (('Part 2', 'part_2'), '<i2')]
Thus, given a dictionary adict
one has to decide if it is of type 1 or 2. I've just submitted a change that makes this choice very clear. Specifically, adict
is mistyped dict of type 2 if all of the following is true:
- all adict values are tuples of size >= 2;
- either of the following is true for at least one tuple
a. val[1] is not an integer (this is required to pass existing regression tests);
b. val[1] is an integer but val[1] < 0;
c. len(val) >= 4.
I added description of this algo to docstring of the function numpy.core._internal._is_mistyped_inherited_type_dict
.
Does this resolve your confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, so what I need to figure out is whether we can deprecate that whole other meaning?
Because otherwise we are guessing what the user wants and guessing is bad. E.g.:
In [13]: dt = np.dtype(np.float64, metadata={"original_type": (np.int8, 8)})
In [14]: np.dtype([("f1", dt)]).descr
Out[14]: [('f1', ('<f8', {'original_type': (numpy.int8, 8)}))]
In which case your code will misinterpret the metadata for a dtype when converting the .descr
.
Sure, its a corner case, but if you want to build reliable software based on this guess, I feel the inconsistency will bite you at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, reliability and consistency of software (especially widely used one) is very important . It is hard decision and only one of you can make it.
Note that we are already in that inconsistent territory. Recall my original concern: descr of dtype cannot be evaluated back into dtype when metadata is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my feeling is that we should only do this, if it is reasonable to put a FutureWarning on the other branch (i.e. remove it). Even in that case, it would be nicer if we can put in the FutureWarning at least one version before going here.
But, I am not sure how valid the existing use-case is. It only works with offsets
which I assume are not super-common, but if you use offsets, then it actually seems like a fairly inconsistent and nice way to spell this out.
I still would be a bit more curious about either dropping metadata entirely for .descr
or just creating a new way to do this... maybe expose the buffer-protocol representation (which has its problems as well though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another idea is to extend format of the tuple that defines dtype: the tuple with metadata must have size 3.
np.dtype((type, dict))
will be parsed as inherited dtype.np.dtype((type, None or dict_1, metadata))
will by parsed with metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a better idea that will keep the old code (and users of inherited dtype): require metadata to be stored in the dictionary under name metadata. For example,
np.dtype(('i4', {'part_1': ('i2', 0), 'part_2': ('i2', 2), 'metadata': {"x": 1}}))
np.dtype(('i8', {'metadata': {'x': 1}}))
np.dtype({'names': ['a', 'b'], 'formats': ['i2', 'i2'], 'metatdata': {'x': 1}})
In this way we can unambiguously infer dtype from dict: dict is either inherited dtype or dict with keys 'names', 'formats', 'offsets', ...
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need that when we have np.dtype(type, metadata=...)
though? (Apologies if I've forgotten something, I've not looked at this in a while).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand,
we want to be able to create dtype from a list and be able to specify metadata for individual fields, e.g.
np.dtype([('a', ('i8', {'x': 1})), ('b', ('f8', {'y': 1}))], metadata={'z': 1})
Keyword metadata only allows one to specify metadata for the final dtype.
The whole discussion appeared because adding metadata to dtype breaks construction of dtype from descr.
…n np.load anymore.
…_mistyped_inherited_type_dict.
This change fixes the issue #15488: inconsistent behavior of
dtype.descr
with metadata. Actually the problem with handling metadata appeared in 2016 in the pool request gh-8235 (it even mentioned there that it breaks creation of dtype with metdata). I fixed the code in such a way that:np.dtype([('a', ('S8', {'info': 'hello'}))])
;