-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: add checks for some invalid structured dtypes. Fixes #2865. #8235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@ahaldane Heads up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, nice catch and solution!
@@ -287,7 +287,7 @@ _convert_from_tuple(PyObject *obj) | |||
type->elsize = itemsize; | |||
} | |||
} | |||
else if (PyDict_Check(val) || PyDictProxy_Check(val)) { | |||
else if (type->metadata && (PyDict_Check(val) || PyDictProxy_Check(val))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine, but I just want to note for posterity that this whole if-block is weird undocumented behavior. (as are a number of things related to metadata).
(Actually, conceivably the "correct" behavior here if type->metadata == NULL
might be to do type->metadata = val
, eg compare to the end of convert_from_dict
. But that's not clear.).
In the future we might consider removing the block anyway, so I'm fine with this as-is.
goto fail; | ||
} | ||
} | ||
else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this whole else block be replaced by
if (PyDataType_REFCHK(conv)) {
goto fail;
}
?
That would also catch the case there are objects in nested structures.
assert_raises(ValueError, np.dtype, | ||
('i', {'name': ('i', 'wrongtype', 'title')})) | ||
# this one is allowed as a special case though | ||
a = np.ones(1, dtype=('O', [('name', 'O')])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important that this example works? If not, I would be tempted to altogether disallow the (base_dtype, new_dtype)
form of specification if either base_dtype
or new_dtype
have objects (tested with PyDataType_REFCHK
).
Note that similar operations are already disallowed:
>>> np.zeros(3, dtype='O').view(dtype([('name', 'O')]))
TypeError: Cannot change data-type for object array.
That wouldn't allow your special case though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback. I noticed that test_dtype.py
already has a test that requires that this case works:
def test_base_dtype_with_object_type(self):
# Issue gh-2798, should not error.
np.array(['a'], dtype="O").astype(("O", [("name", "O")]))
There is a discussion at #2798, but I don't really understand the rationale. Hmm, I also notice that you can currently have a dtype with object fields in the first element of the tuple, like this:
>>> np.zeros(1, dtype=('O,O', 'O,i8'))
array([(0, 0)],
dtype=[('f0', 'O'), ('f1', '<i8')])
or even multiple fields in the first element and just 'O'
in the second one, like this:
>>> np.zeros(1, dtype=('i4,i4', 'O'))
array([(0, 0)],
dtype=[('f0', '<i4'), ('f1', '<i4')])
but this seems especially useless, since one of the two dtypes in the tuple is just ignored (as long as it has the correct size), even if it doesn't make sense to map it to the other one.
So should we allow all of the dtypes that currently work just in case somebody is using them for some reason, or just ones that kind of make sense (e.g. ('O,O', 'O,O')
), or just the one that is already tested for, or none of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, #2798 suggests that some people (eg h5py) have wanted to be able to create dtype(("O", [("name", "O")]))
, in order to add a field name. I would have liked to simply disallow that, but maybe we can't.
Its probably safe to assume no-one is doing anything like dtype(('O,O','O,i8'))
, so my suggestion is to disallow all cases involving objects (by RECHCK on both elements of the tuple) except the special case you already have for a single field.
I am a little tempted to add a deprecation warning for a release cycle, but it's probably such a rare problem that I would rather just go ahead with the change. But since it can break code, we should clearly document it. So please add something to doc/release/1.13.0-notes.rst
decsribing what will break.
By the way, the example causing problems in h5py is in h5py/h5py#217. I tried the example, but it looks like it no longer involves the special case dtype. Also, incidentally, after #6053 gets in they will be able to write np.zeros(3, dtype='O').astype([('name', 'O')])
instead (this is currently disallowed). So there will be a better syntax for adding a field.
Also, by the way, I once wrote a lot of code in #5548 (in _check_field_overlap
) to carefully check whether views like 'O,O'
to 'O,i8'
would clobber pointers or not, by searching out the positions of all the pointers. But it was a huge complicated affair we got rid of later because it was too slow. That's why I favor the stupid solution of simply disallowing views involving objects.
As future reference, it is preferable to put the |
a91be86
to
cb65c8a
Compare
OK, I think I've covered everything, though I was slightly unsure where to put the release note. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you address the two comments and squash I think it's good to go.
* people have been using to add a field to an object array without fields | ||
*/ | ||
static int | ||
validate_structured_object_dtype(PyArray_Descr *new, PyArray_Descr *conv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might like a less general function name. Maybe invalid_union_object_dtype
?
if (!PyDataType_REFCHK(new) && !PyDataType_REFCHK(conv)) { | ||
return 0; | ||
} | ||
if (PyDataType_HASFIELDS(new)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think by organizing the tests this way you are trying to allow cases like np.dtype(('i4', 'O'))
. But I think we should rule that out too: Consider the strange fact that np.dtype(('i8', 'O')).hasobject
is True
.
Maybe reorganize the if-statements here a little, eg start with if (new->kind != 'O')
, then if (!PyDataType_HASFIELDS(conv))
, then if (PyTuple_GET_SIZE(names) != 1)
an so on, an fail if any is true.
452061b
to
9fe73dd
Compare
Looks good. I'll merge in a few minutes. Thanks @J-Sand ! |
This also fixes a similar bug I found while investigating the issue: with structured dtypes of the form
('i', {'name': ('i', offset, 'optional title')})
, if the inner tuple has the wrong number or types of items, a bad python API call is made, resulting in aSystemError
.I'm completely new to the numpy source, so I may have made some mistakes. In particular, I wasn't certain how to deal with a dtype like ('O', [('name', 'O')]). As far as I can tell, it's the only structured dtype of this form containing an object dtype that currently works, so I've allowed it as a special case. On the other hand, it seems pretty useless.