-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
API: Fix structured dtype cast-safety, promotion, and comparison #19226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Fix structured dtype cast-safety, promotion, and comparison #19226
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks quite clear, but there is one point that probably needs to be clarified. Currently, the docstring for np.promote_types
states:
Returns the data type with the smallest size and smallest scalar
kind to which both ``type1`` and ``type2`` may be safely cast.
But for structured dtypes, this PR insists names match, which implies casting='equiv'
.
This does seem intentional, but I think it is good to ensure it is the right decision. Should np.promote_types
correspond to casting='safe'
instead? Looking back at #15494, which concerns the concatenation of two arrays with the same dtype but different names, it is not obvious to me that should be disallowed (though I'd definitely expect the output dtype have names from the dtype of the first array).
Note that I'm far from sure what the right answer is. But if it is disallowed, we should clarify this in the docs for promote_types
. And more generally it might make sense to have some overview of when and how field names are used!
p.s. Probably too unrelated: should the following work without a copy?
a = np.array([(1., 2.), (3., 4.)], dtype=[('a', 'f8'), ('b', 'f8')])
a.astype([('b', 'f8'), ('a', 'f8')], copy=False)
field_dims_a != 0 && /* neither is subarray */ | ||
/* Compare only the added (subarray) dimensions: */ | ||
!PyArray_CompareLists( | ||
PyArray_DIMS(a) + PyArray_NDIM(self), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gosh, took me a while to understand there was pointer addition being done here...
return NULL; | ||
} | ||
|
||
temp = array_richcompare(a, (PyObject *)b,cmp_op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add space after second ,
else { | ||
/* Cannot have subarray, since subarray is absorbed into array. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add comment that this is for two unstructured voids? To me, the asserts feel a bit superfluous, but obviously doesn't hurt.
Py_ssize_t field_count = PyTuple_Size(from->names); | ||
if (field_count != PyTuple_Size(to->names)) { | ||
/* TODO: This should be rejected! */ | ||
return NPY_UNSAFE_CASTING; | ||
return -1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay! (as long as np.zeros((), dtype)
continues to work!)
'offsets': [4, 0]}) | ||
assert_equal(x == y, False) | ||
# But it is currently an equivalent cast: | ||
assert np.can_cast(x, y, casting="equiv") | ||
# This is an safe cast (not equiv) due to the different names: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really tricky example, as the way it was written, it described exactly the same bytes in memory, but just with a different order. With the change to use field order, I guess it makes sense that, as written, it would become "unsafe" (integer to float and vice versa). Might it be good to keep that as an example?
And then also insert the new example, where, because it is by field order, it is "safe".
Hmm, it seems the answer to my question about the behaviour of
I think "unambigously" is key here - with name mismatches, one has to pick, which means the result is ambiguous (and it breaks the symmetry promise of @seberg - could you also include in this PR the changelog entry and doc updates to On the larger scheme, mostly out of curiosity: is the goal that eventually the |
Yes, sorry, I should have copied over the doc changes/change log from the other PR and expanded on that, its a bit too raw right now here! (And I thought adding the About the "safe" casting... My opinion is that the documentation should be updated. And I can probably do that even ignoring this PR. I have given up on the idea that the common dtype (promotion) derives from casting. I honestly think its a fallacy. The two are related, but distinct. Rather the properties of
Ensuring this is the job of the type promotion (i.e. the DType implementer has to take care). But in that definition type promotion can be more strict.
It is absolutely related! But the fix/change itself is mostly orthongal. We have to modify
The changes here are designed so that that if we do that change, your example will also be no-copy.
This is already the case. But you don't have the public API to "recurse" into arbitrary dtypes well (e.g. to write the casting functions). So you could write such a new DType today, but you have to do it in NumPy. |
That really helps! I also like your proposed logic for p.s. I think |
numpy/core/_internal.py
Outdated
"NumPy currently does not support promotion with field titles.") | ||
|
||
return dtype([(name, promote_types(dt1[name], dt2[name])) | ||
for name in dt1.names]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jeesh, I realized that this is actually incorrect. What is missing is something like align=dt1.isalignedstruct or dt2.isalignedstruct
.
Is that what we want? Preserve alignment if one is aligned, or drop alignment when one is not aligned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your choice of preserving alignment when one is aligned.
5b1c020
to
716fe3c
Compare
return PyArray_DescrNewByteorder(type, NPY_NATIVE); | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer needed for non-structured arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is just moved, but maybe I should limit the change a bit (since this is perfectly fine for most NumPy builtin dtypes). In any case, it is actually a change of one of the previous PRs that would have to go in first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #19345
9191e9d
to
a8aa5de
Compare
Code wise, this should be mostly ready now. I narrowed the special case in |
d9abddd
to
a8de4ef
Compare
@seberg - sounds good. Probably easiest to review once the doc changes are ready too... |
2e13510
to
38ffb10
Compare
This PR replaces the old numpygh-15509 implementing proper type promotion for structured voids. It further fixes the casting safety to consider casts with equivalent field number and matching order as "safe" and if the names, titles, and offsets match as "equiv". The change perculates into the void comparison, and since it fixes the order, it removes the current FutureWarning there as well. This addresses liberfa/pyerfa#77 and replaces numpygh-15509 (the implementation has changed too much). Fixes numpygh-15494 (and probably a few more) Co-authored-by: Allan Haldane <allan.haldane@gmail.com>
Negative offsets in structured dtypes give indirect access outside of the original data in principle, thus we should always reject them (this is as of now practically unused in NumPy). Further, `np.result_type` and `np.promote_types` should behave the same as canonicalization for twice the same input (at least mostly, metadata is a rare exception). This was not the case, because promote-types had a fast path that would return the wrong thing for structured dtypes. Thus, reject them from the fast-path, and limit it to builtin types for now.
…some stuff in the NEP 50 draft...
…structured dtypes Identical means here that the promotion is between the same dtype *and* promoting it does not modify any of the included dtypes *and* does not modify the offset layout or itemsize.
Also modify the error message to remove the "may return an array of False note". That is fathomable, but if we really want to go there, maybe it should just get a new FutureWarning...
5f7b973
to
079ca88
Compare
Most of the removed info is already in the structured array dtype doc or later in the release note - leave high-level summary in place.
One of the following test is mainly interesting on little endian machines, but that seems OK, the tests definitely run on somewhere :)
@@ -1029,35 +1029,85 @@ static PyObject * | |||
_void_compare(PyArrayObject *self, PyArrayObject *other, int cmp_op) | |||
{ | |||
if (!(cmp_op == Py_EQ || cmp_op == Py_NE)) { | |||
PyErr_SetString(PyExc_ValueError, | |||
PyErr_SetString(PyExc_TypeError, | |||
"Void-arrays can only be compared for equality."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path is actually unreachable, so no coverage is not a surprise.
@ahaldane just a ping in case you have time to have a look on the logic, or whether the tests make sense. (I think it would be good to include this in the upcoming release and branching is looming, to finally get us all the way on "order", rather than "names" with structured dtypes.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, though I mostly focused on the Python components + tests!
assert_equal(a != b, [True, False]) | ||
|
||
a = np.array([(5, 42), (10, 1)], dtype=[('a', '>f8'), ('b', '<f8')]) | ||
b = np.array([(5, 43), (10, 1)], dtype=[('a', '<i8'), ('b', '>i8')]) | ||
assert_equal(a == b, [False, True]) | ||
assert_equal(a != b, [True, False]) | ||
|
||
# Including with embedded subarray dtype (although subarray comparison | ||
# itself may still be a bit weird and compare the raw data) | ||
a = np.array([(5, 42), (10, 1)], dtype=[('a', '10>f8'), ('b', '5<f8')]) | ||
b = np.array([(5, 43), (10, 1)], dtype=[('a', '10<i8'), ('b', '5>i8')]) | ||
assert_equal(a == b, [False, True]) | ||
assert_equal(a != b, [True, False]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a readability standpoint it's a little confusing that both the arrays themselves and the fields names are the same (i.e. a
and b
)! Not that it's a blocker or anything :)
Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
49665bd
to
64e3c5c
Compare
The linting "errors" are fine to ignore. Thanks @seberg |
This PR replaces the old gh-15509 implementing proper type promotion
for structured voids. It further fixes the casting safety to consider
casts with equivalent field number and matching order as "safe"
and if the names, titles, and offsets match as "equiv".
The change perculates into the void comparison, and since it fixes
the order, it removes the current FutureWarning there as well.
This addresses liberfa/pyerfa#77
and replaces gh-15509 (the implementation has changed too much).
Fixes gh-15494 (and probably a few more)
Co-authored-by: Allan Haldane allan.haldane@gmail.com
Ping @ahaldane, @mhvk. The tests will fail right now. And we have to decide a few things, and probably one more change that is necessary here:
ensure_nbo
. My plan forensure_nbo(descr)
is to actually call a newDType.ensure_canonical(descr)
(for native dtypes, this will just do the same asensure_nbo
) but for structured ones it would: Recursively callensure_canonical
and drop offsets. (Fixing the test failure) This should be simple, but requires adding the new slot to the DType, I will probably just do this soon.FutureWarning
from the void comparison path. This means we have to decide wether void comparisons should ever return an array ofFalse
if the types they include mismatch. Or if they should raise an error. Note that most field mismatches will still raise aFutureWarning
(since it will callself[field1_self] == other[field1_other]
which will give a warning when the types are not comparable. This is thus only about the explicit error messages below. (Or alternatively, keep the warning for now, even if it is very strange for dtypes that can promote to not also compare.)As such this is currently draft, although all changes should be good, there are some things missing. But I decided to have a look since gh-15509 is outdated now and it might be tricky to resurrect it without knowing the new layout well.
EDIT: Currently based off gh-19346