Skip to content

API: Fix structured dtype cast-safety, promotion, and comparison #19226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

seberg
Copy link
Member

@seberg seberg commented Jun 11, 2021

This PR replaces the old gh-15509 implementing proper type promotion
for structured voids. It further fixes the casting safety to consider
casts with equivalent field number and matching order as "safe"
and if the names, titles, and offsets match as "equiv".

The change perculates into the void comparison, and since it fixes
the order, it removes the current FutureWarning there as well.

This addresses liberfa/pyerfa#77
and replaces gh-15509 (the implementation has changed too much).

Fixes gh-15494 (and probably a few more)

Co-authored-by: Allan Haldane allan.haldane@gmail.com


Ping @ahaldane, @mhvk. The tests will fail right now. And we have to decide a few things, and probably one more change that is necessary here:

  • The tests fail, because the old PR included a test which ensured that the promotion result is always a packed struct. I think this makes sense, but we further should ensure the promotion result is always native byte order (the old code probably did this also). The new code doesn't call "promotion" because both descriptors are identical. This calls ensure_nbo. My plan for ensure_nbo(descr) is to actually call a new DType.ensure_canonical(descr) (for native dtypes, this will just do the same as ensure_nbo) but for structured ones it would: Recursively call ensure_canonical and drop offsets. (Fixing the test failure) This should be simple, but requires adding the new slot to the DType, I will probably just do this soon.
  • I removed the FutureWarning from the void comparison path. This means we have to decide wether void comparisons should ever return an array of False if the types they include mismatch. Or if they should raise an error. Note that most field mismatches will still raise a FutureWarning (since it will call self[field1_self] == other[field1_other] which will give a warning when the types are not comparable. This is thus only about the explicit error messages below. (Or alternatively, keep the warning for now, even if it is very strange for dtypes that can promote to not also compare.)
  • We need to improve test coverage (at least in all those points that coverage points too)

As such this is currently draft, although all changes should be good, there are some things missing. But I decided to have a look since gh-15509 is outdated now and it might be tricky to resurrect it without knowing the new layout well.

EDIT: Currently based off gh-19346

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks quite clear, but there is one point that probably needs to be clarified. Currently, the docstring for np.promote_types states:

Returns the data type with the smallest size and smallest scalar
kind to which both ``type1`` and ``type2`` may be safely cast.

But for structured dtypes, this PR insists names match, which implies casting='equiv'.

This does seem intentional, but I think it is good to ensure it is the right decision. Should np.promote_types correspond to casting='safe' instead? Looking back at #15494, which concerns the concatenation of two arrays with the same dtype but different names, it is not obvious to me that should be disallowed (though I'd definitely expect the output dtype have names from the dtype of the first array).

Note that I'm far from sure what the right answer is. But if it is disallowed, we should clarify this in the docs for promote_types. And more generally it might make sense to have some overview of when and how field names are used!

p.s. Probably too unrelated: should the following work without a copy?

a = np.array([(1., 2.), (3., 4.)], dtype=[('a', 'f8'), ('b', 'f8')])
a.astype([('b', 'f8'), ('a', 'f8')], copy=False)

field_dims_a != 0 && /* neither is subarray */
/* Compare only the added (subarray) dimensions: */
!PyArray_CompareLists(
PyArray_DIMS(a) + PyArray_NDIM(self),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gosh, took me a while to understand there was pointer addition being done here...

return NULL;
}

temp = array_richcompare(a, (PyObject *)b,cmp_op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add space after second ,

else {
/* Cannot have subarray, since subarray is absorbed into array. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add comment that this is for two unstructured voids? To me, the asserts feel a bit superfluous, but obviously doesn't hurt.

Py_ssize_t field_count = PyTuple_Size(from->names);
if (field_count != PyTuple_Size(to->names)) {
/* TODO: This should be rejected! */
return NPY_UNSAFE_CASTING;
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay! (as long as np.zeros((), dtype) continues to work!)

'offsets': [4, 0]})
assert_equal(x == y, False)
# But it is currently an equivalent cast:
assert np.can_cast(x, y, casting="equiv")
# This is an safe cast (not equiv) due to the different names:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really tricky example, as the way it was written, it described exactly the same bytes in memory, but just with a different order. With the change to use field order, I guess it makes sense that, as written, it would become "unsafe" (integer to float and vice versa). Might it be good to keep that as an example?

And then also insert the new example, where, because it is by field order, it is "safe".

@mhvk
Copy link
Contributor

mhvk commented Jun 12, 2021

Hmm, it seems the answer to my question about the behaviour of promote_types was already given by @ahaldane in #15494 (comment):

It seems like we want for two dtypes with different field names but same field dtypes:

  • can be cast to each other
  • cannot be promoted to a common type unambiguously

So can_cast should succeed, but promote_types should fail. Does that make sense?

I think "unambigously" is key here - with name mismatches, one has to pick, which means the result is ambiguous (and it breaks the symmetry promise of np.promote_types). So, this PR should include an update to the np.promote_types docstring...

@seberg - could you also include in this PR the changelog entry and doc updates to structured_arrays.py from #15509? They help understand what is being done!

On the larger scheme, mostly out of curiosity: is the goal that eventually the DType classes themselves can tell how promotion/casting should be done? I.e., that if one wanted to have a pure by-order or by-name dtype, one could implemented this?

@seberg
Copy link
Member Author

seberg commented Jun 12, 2021

Yes, sorry, I should have copied over the doc changes/change log from the other PR and expanded on that, its a bit too raw right now here! (And I thought adding the ensure_canonical is a bit more straight forward.)


About the "safe" casting... My opinion is that the documentation should be updated. And I can probably do that even ignoring this PR. I have given up on the idea that the common dtype (promotion) derives from casting. I honestly think its a fallacy. The two are related, but distinct.

Rather the properties of P = promote_types(A, B) are:

  • Both A and B must (should?) safely cast to P. (It seems surprising if we allow violating this. But arguable we already do with promote_types(int64, float64) -> float64 which is not really a safe cast. The main reason we call it a safe cast is just to make promotion work – because it previously relied on safe casting.)
  • We do not guarantee this is the "smallest". This is typical and guaranteed for vanilla NumPy, but not a generic rule!
    • If a user adds an int24 both uint16 and int16 can safely cast to it. But promote_types(uint16, int16) must continue to return int32!
  • If A can "equiv" cast to B, then I suppose we have some strong rules:
    • P must exist.
    • P can "equiv" cast to B
    • B can "equiv" cast to P
    • (P may be B, or B.copy() or B.ensure_canonical() – the last one is a byte-swap or removal offsets/padding.)

Ensuring this is the job of the type promotion (i.e. the DType implementer has to take care). But in that definition type promotion can be more strict.


p.s. Probably too unrelated: should the following work without a copy?

a = np.array([(1., 2.), (3., 4.)], dtype=[('a', 'f8'), ('b', 'f8')])
a.astype([('b', 'f8'), ('a', 'f8')], copy=False)

It is absolutely related! But the fix/change itself is mostly orthongal. We have to modify arr.astype() to use the following logic:

casting = get_casting_safety(self.dtype, new_dtype)
if casting == "no":
    # the current branch, should maybe use `self.dtype is new_dtype`, though!
    return self
if casting & cast_is_view:
    # new branch!
    return self.view(new_dype)
else:
    # create new array and cast

The changes here are designed so that that if we do that change, your example will also be no-copy.


is the goal that eventually the DType classes themselves can tell how promotion/casting should be done? I.e., that if one wanted to have a pure by-order or by-name dtype, one could implemented this?

This is already the case. But you don't have the public API to "recurse" into arbitrary dtypes well (e.g. to write the casting functions). So you could write such a new DType today, but you have to do it in NumPy.
Making enough of that public needs more careful though than most of the API, so I think it is longer term than general user DTypes. On the other hand, I am also not sure that it is actually problematic to just develop those as part of NumPy! "Dynamically" structured DTypes are maybe just special enough :).

@mhvk
Copy link
Contributor

mhvk commented Jun 12, 2021

That really helps! I also like your proposed logic for astype, using the (hidden) viewable casting.

p.s. I think ensure_canonical is a good idea.

"NumPy currently does not support promotion with field titles.")

return dtype([(name, promote_types(dt1[name], dt2[name]))
for name in dt1.names])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jeesh, I realized that this is actually incorrect. What is missing is something like align=dt1.isalignedstruct or dt2.isalignedstruct.

Is that what we want? Preserve alignment if one is aligned, or drop alignment when one is not aligned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your choice of preserving alignment when one is aligned.

@seberg seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch 2 times, most recently from 5b1c020 to 716fe3c Compare June 26, 2021 00:32
return PyArray_DescrNewByteorder(type, NPY_NATIVE);
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer needed for non-structured arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is just moved, but maybe I should limit the change a bit (since this is perfectly fine for most NumPy builtin dtypes). In any case, it is actually a change of one of the previous PRs that would have to go in first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #19345

@seberg seberg added this to the 1.23.0 release milestone Feb 22, 2022
@seberg seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch 2 times, most recently from 9191e9d to a8aa5de Compare February 24, 2022 22:38
@seberg
Copy link
Member Author

seberg commented Feb 24, 2022

Code wise, this should be mostly ready now. I narrowed the special case in promote_types (I have not much love for that, but I have no better idea). I need to reorganize the docs with the release notes and review the old comments though.

@seberg seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch 2 times, most recently from d9abddd to a8de4ef Compare February 24, 2022 23:04
@mhvk
Copy link
Contributor

mhvk commented Feb 25, 2022

@seberg - sounds good. Probably easiest to review once the doc changes are ready too...

@seberg seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch 2 times, most recently from 2e13510 to 38ffb10 Compare May 5, 2022 13:37
seberg and others added 9 commits May 9, 2022 17:13
This PR replaces the old numpygh-15509 implementing proper type promotion
for structured voids.  It further fixes the casting safety to consider
casts with equivalent field number and matching order as "safe"
and if the names, titles, and offsets match as "equiv".

The change perculates into the void comparison, and since it fixes
the order, it removes the current FutureWarning there as well.

This addresses liberfa/pyerfa#77
and replaces numpygh-15509 (the implementation has changed too much).

Fixes numpygh-15494  (and probably a few more)

Co-authored-by: Allan Haldane <allan.haldane@gmail.com>
Negative offsets in structured dtypes give indirect access outside
of the original data in principle, thus we should always reject
them (this is as of now practically unused in NumPy).

Further, `np.result_type` and `np.promote_types` should behave
the same as canonicalization for twice the same input (at least
mostly, metadata is a rare exception).

This was not the case, because promote-types had a fast path that
would return the wrong thing for structured dtypes.
Thus, reject them from the fast-path, and limit it to builtin types
for now.
…structured dtypes

Identical means here that the promotion is between the same dtype *and* promoting it
does not modify any of the included dtypes *and* does not modify the offset layout
or itemsize.
Also modify the error message to remove the "may return an array
of False note".  That is fathomable, but if we really want to go
there, maybe it should just get a new FutureWarning...
@seberg seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch from 5f7b973 to 079ca88 Compare May 9, 2022 19:56
rossbar added 2 commits May 11, 2022 09:58
Most of the removed info is already in the structured array dtype
doc or later in the release note - leave high-level summary in place.
@seberg seberg marked this pull request as ready for review May 11, 2022 08:42
@seberg seberg added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label May 11, 2022
One of the following test is mainly interesting on little endian machines,
but that seems OK, the tests definitely run on somewhere :)
@@ -1029,35 +1029,85 @@ static PyObject *
_void_compare(PyArrayObject *self, PyArrayObject *other, int cmp_op)
{
if (!(cmp_op == Py_EQ || cmp_op == Py_NE)) {
PyErr_SetString(PyExc_ValueError,
PyErr_SetString(PyExc_TypeError,
"Void-arrays can only be compared for equality.");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path is actually unreachable, so no coverage is not a surprise.

@seberg
Copy link
Member Author

seberg commented May 12, 2022

@ahaldane just a ping in case you have time to have a look on the logic, or whether the tests make sense. (I think it would be good to include this in the upcoming release and branching is looming, to finally get us all the way on "order", rather than "names" with structured dtypes.)

@charris charris added the triage review Issue/PR to be discussed at the next triage meeting label May 18, 2022
Copy link
Contributor

@rossbar rossbar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, though I mostly focused on the Python components + tests!

Comment on lines +1254 to +1266
assert_equal(a != b, [True, False])

a = np.array([(5, 42), (10, 1)], dtype=[('a', '>f8'), ('b', '<f8')])
b = np.array([(5, 43), (10, 1)], dtype=[('a', '<i8'), ('b', '>i8')])
assert_equal(a == b, [False, True])
assert_equal(a != b, [True, False])

# Including with embedded subarray dtype (although subarray comparison
# itself may still be a bit weird and compare the raw data)
a = np.array([(5, 42), (10, 1)], dtype=[('a', '10>f8'), ('b', '5<f8')])
b = np.array([(5, 43), (10, 1)], dtype=[('a', '10<i8'), ('b', '5>i8')])
assert_equal(a == b, [False, True])
assert_equal(a != b, [True, False])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a readability standpoint it's a little confusing that both the arrays themselves and the fields names are the same (i.e. a and b)! Not that it's a blocker or anything :)

Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>
@seberg seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch from 49665bd to 64e3c5c Compare May 19, 2022 01:20
@mattip mattip merged commit db481ba into numpy:main May 19, 2022
@mattip
Copy link
Member

mattip commented May 19, 2022

The linting "errors" are fine to ignore. Thanks @seberg

@seberg seberg deleted the fix-void-cast-safety-promotion-and-comparison branch May 19, 2022 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
30 - API 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. triage review Issue/PR to be discussed at the next triage meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: concatenate of structured dtypes does not check field orders match
5 participants