API: Fix structured dtype cast-safety, promotion, and comparison #19226

seberg · 2021-06-11T23:37:28Z

This PR replaces the old gh-15509 implementing proper type promotion
for structured voids. It further fixes the casting safety to consider
casts with equivalent field number and matching order as "safe"
and if the names, titles, and offsets match as "equiv".

The change perculates into the void comparison, and since it fixes
the order, it removes the current FutureWarning there as well.

This addresses liberfa/pyerfa#77
and replaces gh-15509 (the implementation has changed too much).

Fixes gh-15494 (and probably a few more)

Co-authored-by: Allan Haldane allan.haldane@gmail.com

Ping @ahaldane, @mhvk. The tests will fail right now. And we have to decide a few things, and probably one more change that is necessary here:

The tests fail, because the old PR included a test which ensured that the promotion result is always a packed struct. I think this makes sense, but we further should ensure the promotion result is always native byte order (the old code probably did this also). The new code doesn't call "promotion" because both descriptors are identical. This calls ensure_nbo. My plan for ensure_nbo(descr) is to actually call a new DType.ensure_canonical(descr) (for native dtypes, this will just do the same as ensure_nbo) but for structured ones it would: Recursively call ensure_canonical and drop offsets. (Fixing the test failure) This should be simple, but requires adding the new slot to the DType, I will probably just do this soon.
I removed the FutureWarning from the void comparison path. This means we have to decide wether void comparisons should ever return an array of False if the types they include mismatch. Or if they should raise an error. Note that most field mismatches will still raise a FutureWarning (since it will call self[field1_self] == other[field1_other] which will give a warning when the types are not comparable. This is thus only about the explicit error messages below. (Or alternatively, keep the warning for now, even if it is very strange for dtypes that can promote to not also compare.)
We need to improve test coverage (at least in all those points that coverage points too)

As such this is currently draft, although all changes should be good, there are some things missing. But I decided to have a look since gh-15509 is outdated now and it might be tricky to resurrect it without knowing the new layout well.

EDIT: Currently based off gh-19346

mhvk

This all looks quite clear, but there is one point that probably needs to be clarified. Currently, the docstring for np.promote_types states:

Returns the data type with the smallest size and smallest scalar
kind to which both ``type1`` and ``type2`` may be safely cast.

But for structured dtypes, this PR insists names match, which implies casting='equiv'.

This does seem intentional, but I think it is good to ensure it is the right decision. Should np.promote_types correspond to casting='safe' instead? Looking back at #15494, which concerns the concatenation of two arrays with the same dtype but different names, it is not obvious to me that should be disallowed (though I'd definitely expect the output dtype have names from the dtype of the first array).

Note that I'm far from sure what the right answer is. But if it is disallowed, we should clarify this in the docs for promote_types. And more generally it might make sense to have some overview of when and how field names are used!

p.s. Probably too unrelated: should the following work without a copy?

a = np.array([(1., 2.), (3., 4.)], dtype=[('a', 'f8'), ('b', 'f8')])
a.astype([('b', 'f8'), ('a', 'f8')], copy=False)

mhvk · 2021-06-12T15:54:53Z

numpy/core/src/multiarray/arrayobject.c

+                    field_dims_a != 0 &&  /* neither is subarray */
+                    /* Compare only the added (subarray) dimensions: */
+                    !PyArray_CompareLists(
+                            PyArray_DIMS(a) + PyArray_NDIM(self),


Gosh, took me a while to understand there was pointer addition being done here...

mhvk · 2021-06-12T15:55:15Z

numpy/core/src/multiarray/arrayobject.c

+                return NULL;
+            }
+
+            temp = array_richcompare(a, (PyObject *)b,cmp_op);


Add space after second ,

mhvk · 2021-06-12T15:58:19Z

numpy/core/src/multiarray/arrayobject.c

    else {
+        /* Cannot have subarray, since subarray is absorbed into array. */


Maybe add comment that this is for two unstructured voids? To me, the asserts feel a bit superfluous, but obviously doesn't hurt.

mhvk · 2021-06-12T16:03:47Z

numpy/core/src/multiarray/convert_datatype.c

    Py_ssize_t field_count = PyTuple_Size(from->names);
    if (field_count != PyTuple_Size(to->names)) {
-        /* TODO: This should be rejected! */
-        return NPY_UNSAFE_CASTING;
+        return -1;


Yay! (as long as np.zeros((), dtype) continues to work!)

mhvk · 2021-06-12T16:32:19Z

numpy/core/tests/test_dtype.py

                      'offsets': [4, 0]})
        assert_equal(x == y, False)
-        # But it is currently an equivalent cast:
-        assert np.can_cast(x, y, casting="equiv")
+        # This is an safe cast (not equiv) due to the different names:


This is a really tricky example, as the way it was written, it described exactly the same bytes in memory, but just with a different order. With the change to use field order, I guess it makes sense that, as written, it would become "unsafe" (integer to float and vice versa). Might it be good to keep that as an example?

And then also insert the new example, where, because it is by field order, it is "safe".

mhvk · 2021-06-12T17:46:10Z

Hmm, it seems the answer to my question about the behaviour of promote_types was already given by @ahaldane in #15494 (comment):

It seems like we want for two dtypes with different field names but same field dtypes:

can be cast to each other

cannot be promoted to a common type unambiguously

So can_cast should succeed, but promote_types should fail. Does that make sense?

I think "unambigously" is key here - with name mismatches, one has to pick, which means the result is ambiguous (and it breaks the symmetry promise of np.promote_types). So, this PR should include an update to the np.promote_types docstring...

@seberg - could you also include in this PR the changelog entry and doc updates to structured_arrays.py from #15509? They help understand what is being done!

On the larger scheme, mostly out of curiosity: is the goal that eventually the DType classes themselves can tell how promotion/casting should be done? I.e., that if one wanted to have a pure by-order or by-name dtype, one could implemented this?

seberg · 2021-06-12T18:25:32Z

Yes, sorry, I should have copied over the doc changes/change log from the other PR and expanded on that, its a bit too raw right now here! (And I thought adding the ensure_canonical is a bit more straight forward.)

About the "safe" casting... My opinion is that the documentation should be updated. And I can probably do that even ignoring this PR. I have given up on the idea that the common dtype (promotion) derives from casting. I honestly think its a fallacy. The two are related, but distinct.

Rather the properties of P = promote_types(A, B) are:

Both A and B must (should?) safely cast to P. (It seems surprising if we allow violating this. But arguable we already do with promote_types(int64, float64) -> float64 which is not really a safe cast. The main reason we call it a safe cast is just to make promotion work – because it previously relied on safe casting.)
We do not guarantee this is the "smallest". This is typical and guaranteed for vanilla NumPy, but not a generic rule!
- If a user adds an int24 both uint16 and int16 can safely cast to it. But promote_types(uint16, int16) must continue to return int32!
If A can "equiv" cast to B, then I suppose we have some strong rules:
- P must exist.
- P can "equiv" cast to B
- B can "equiv" cast to P
- (P may be B, or B.copy() or B.ensure_canonical() – the last one is a byte-swap or removal offsets/padding.)

Ensuring this is the job of the type promotion (i.e. the DType implementer has to take care). But in that definition type promotion can be more strict.

p.s. Probably too unrelated: should the following work without a copy?
a = np.array([(1., 2.), (3., 4.)], dtype=[('a', 'f8'), ('b', 'f8')])
a.astype([('b', 'f8'), ('a', 'f8')], copy=False)

It is absolutely related! But the fix/change itself is mostly orthongal. We have to modify arr.astype() to use the following logic:

casting = get_casting_safety(self.dtype, new_dtype)
if casting == "no":
    # the current branch, should maybe use `self.dtype is new_dtype`, though!
    return self
if casting & cast_is_view:
    # new branch!
    return self.view(new_dype)
else:
    # create new array and cast

The changes here are designed so that that if we do that change, your example will also be no-copy.

is the goal that eventually the DType classes themselves can tell how promotion/casting should be done? I.e., that if one wanted to have a pure by-order or by-name dtype, one could implemented this?

This is already the case. But you don't have the public API to "recurse" into arbitrary dtypes well (e.g. to write the casting functions). So you could write such a new DType today, but you have to do it in NumPy.
Making enough of that public needs more careful though than most of the API, so I think it is longer term than general user DTypes. On the other hand, I am also not sure that it is actually problematic to just develop those as part of NumPy! "Dynamically" structured DTypes are maybe just special enough :).

mhvk · 2021-06-12T18:33:57Z

That really helps! I also like your proposed logic for astype, using the (hidden) viewable casting.

p.s. I think ensure_canonical is a good idea.

seberg · 2021-06-25T00:35:31Z

numpy/core/_internal.py

+            "NumPy currently does not support promotion with field titles.")
+
+    return dtype([(name, promote_types(dt1[name], dt2[name]))
+                  for name in dt1.names])


Jeesh, I realized that this is actually incorrect. What is missing is something like align=dt1.isalignedstruct or dt2.isalignedstruct.

Is that what we want? Preserve alignment if one is aligned, or drop alignment when one is not aligned?

I agree with your choice of preserving alignment when one is aligned.

doc/release/upcoming_changes/19226.compatibility.rst

mattip · 2021-06-26T21:40:08Z

numpy/core/src/multiarray/convert_datatype.c

-        return PyArray_DescrNewByteorder(type, NPY_NATIVE);
-    }
-}
-


This is no longer needed for non-structured arrays?

It is just moved, but maybe I should limit the change a bit (since this is perfectly fine for most NumPy builtin dtypes). In any case, it is actually a change of one of the previous PRs that would have to go in first.

seberg · 2022-02-24T22:40:43Z

Code wise, this should be mostly ready now. I narrowed the special case in promote_types (I have not much love for that, but I have no better idea). I need to reorganize the docs with the release notes and review the old comments though.

mhvk · 2022-02-25T10:48:43Z

@seberg - sounds good. Probably easiest to review once the doc changes are ready too...

This PR replaces the old numpygh-15509 implementing proper type promotion for structured voids. It further fixes the casting safety to consider casts with equivalent field number and matching order as "safe" and if the names, titles, and offsets match as "equiv". The change perculates into the void comparison, and since it fixes the order, it removes the current FutureWarning there as well. This addresses liberfa/pyerfa#77 and replaces numpygh-15509 (the implementation has changed too much). Fixes numpygh-15494 (and probably a few more) Co-authored-by: Allan Haldane <allan.haldane@gmail.com>

Negative offsets in structured dtypes give indirect access outside of the original data in principle, thus we should always reject them (this is as of now practically unused in NumPy). Further, `np.result_type` and `np.promote_types` should behave the same as canonicalization for twice the same input (at least mostly, metadata is a rare exception). This was not the case, because promote-types had a fast path that would return the wrong thing for structured dtypes. Thus, reject them from the fast-path, and limit it to builtin types for now.

…some stuff in the NEP 50 draft...

…structured dtypes Identical means here that the promotion is between the same dtype *and* promoting it does not modify any of the included dtypes *and* does not modify the offset layout or itemsize.

Also modify the error message to remove the "may return an array of False note". That is fathomable, but if we really want to go there, maybe it should just get a new FutureWarning...

Most of the removed info is already in the structured array dtype doc or later in the release note - leave high-level summary in place.

One of the following test is mainly interesting on little endian machines, but that seems OK, the tests definitely run on somewhere :)

seberg · 2022-05-11T14:06:35Z

numpy/core/src/multiarray/arrayobject.c

@@ -1029,35 +1029,85 @@ static PyObject *
 _void_compare(PyArrayObject *self, PyArrayObject *other, int cmp_op)
 {
    if (!(cmp_op == Py_EQ || cmp_op == Py_NE)) {
-        PyErr_SetString(PyExc_ValueError,
+        PyErr_SetString(PyExc_TypeError,
                "Void-arrays can only be compared for equality.");


The path is actually unreachable, so no coverage is not a surprise.

seberg · 2022-05-12T05:55:05Z

@ahaldane just a ping in case you have time to have a look on the logic, or whether the tests make sense. (I think it would be good to include this in the upcoming release and branching is looming, to finally get us all the way on "order", rather than "names" with structured dtypes.)

rossbar

This LGTM, though I mostly focused on the Python components + tests!

numpy/core/tests/test_numeric.py

numpy/core/tests/test_nditer.py

numpy/core/tests/test_multiarray.py

rossbar · 2022-05-18T19:28:47Z

numpy/core/tests/test_multiarray.py

+        assert_equal(a != b, [True, False])
+
+        a = np.array([(5, 42), (10, 1)], dtype=[('a', '>f8'), ('b', '<f8')])
+        b = np.array([(5, 43), (10, 1)], dtype=[('a', '<i8'), ('b', '>i8')])
+        assert_equal(a == b, [False, True])
+        assert_equal(a != b, [True, False])
+
+        # Including with embedded subarray dtype (although subarray comparison
+        # itself may still be a bit weird and compare the raw data)
+        a = np.array([(5, 42), (10, 1)], dtype=[('a', '10>f8'), ('b', '5<f8')])
+        b = np.array([(5, 43), (10, 1)], dtype=[('a', '10<i8'), ('b', '5>i8')])
+        assert_equal(a == b, [False, True])
+        assert_equal(a != b, [True, False])


From a readability standpoint it's a little confusing that both the arrays themselves and the fields names are the same (i.e. a and b)! Not that it's a blocker or anything :)

numpy/core/tests/test_dtype.py

numpy/core/src/multiarray/dtypemeta.c

Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>

mattip · 2022-05-19T17:00:39Z

The linting "errors" are fine to ignore. Thanks @seberg

github-actions bot added the 30 - API label Jun 11, 2021

seberg mentioned this pull request Jun 11, 2021

BUG: Disallow type-promotion of dtypes with different field names #15509

Closed

mhvk reviewed Jun 12, 2021

View reviewed changes

seberg commented Jun 25, 2021

View reviewed changes

seberg mentioned this pull request Jun 25, 2021

API: Fix np.result_type(structured_dtype) to "canonicalize" #19346

Merged

seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch 2 times, most recently from 5b1c020 to 716fe3c Compare June 26, 2021 00:32

mattip reviewed Jun 26, 2021

View reviewed changes

doc/release/upcoming_changes/19226.compatibility.rst Show resolved Hide resolved

mattip reviewed Jun 26, 2021

View reviewed changes

seberg added this to the 1.23.0 release milestone Feb 22, 2022

seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch 2 times, most recently from 9191e9d to a8aa5de Compare February 24, 2022 22:38

seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch 2 times, most recently from d9abddd to a8de4ef Compare February 24, 2022 23:04

seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch 2 times, most recently from 2e13510 to 38ffb10 Compare May 5, 2022 13:37

seberg and others added 9 commits May 9, 2022 17:13

DOC: Add a long release-note about structured casts and promotion

f376d23

WIP: docs... Need to start a whole new "promotion" probably based on …

8f8e446

…some stuff in the NEP 50 draft...

fixup promote types

dcb8b07

MAINT,ENH: Adjust metadata preservation, preserving it for identical …

59b7afe

…structured dtypes Identical means here that the promotion is between the same dtype *and* promoting it does not modify any of the included dtypes *and* does not modify the offset layout or itemsize.

TST: Fortify tests for improved struct dtype promotion and comparison

eb8b2e3

BUG: Add missing return NULL in void common instance

4091d91

TST: Fixup slow metadata promotion tests to match new behaviour

1b9a98a

TST: Add additional (basic) failure tests for void comparisons

079ca88

Also modify the error message to remove the "may return an array of False note". That is fathomable, but if we really want to go there, maybe it should just get a new FutureWarning...

seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch from 5f7b973 to 079ca88 Compare May 9, 2022 19:56

rossbar mentioned this pull request May 10, 2022

DOC: restructure docs/release note seberg/numpy#45

Merged

rossbar added 2 commits May 11, 2022 09:58

Rm preamble from release note.

14b601c

Most of the removed info is already in the structured array dtype doc or later in the release note - leave high-level summary in place.

DOC: mv details from release note to user guide.

9967058

seberg marked this pull request as ready for review May 11, 2022 08:42

BUG: Failed comparison should be TypeError and fix title mismatch test

cacc3b7

seberg added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label May 11, 2022

TST: Fix byte-swap test for big-endian machines

e19c7d5

One of the following test is mainly interesting on little endian machines, but that seems OK, the tests definitely run on somewhere :)

seberg commented May 11, 2022

View reviewed changes

charris added the triage review Issue/PR to be discussed at the next triage meeting label May 18, 2022

rossbar approved these changes May 18, 2022

View reviewed changes

DOC: Fixup test and general doc strings

64e3c5c

Co-authored-by: Ross Barnowski <rossbar@berkeley.edu>

seberg force-pushed the fix-void-cast-safety-promotion-and-comparison branch from 49665bd to 64e3c5c Compare May 19, 2022 01:20

mattip merged commit db481ba into numpy:main May 19, 2022

seberg deleted the fix-void-cast-safety-promotion-and-comparison branch May 19, 2022 17:25

jakevdp mentioned this pull request Jun 23, 2022

autodiff fails under numpy 1.23 jax-ml/jax#11221

Closed

		else {
		/* Cannot have subarray, since subarray is absorbed into array. */

Uh oh!

API: Fix structured dtype cast-safety, promotion, and comparison #19226

API: Fix structured dtype cast-safety, promotion, and comparison #19226

Uh oh!

Conversation

seberg commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhvk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhvk commented Jun 12, 2021

Uh oh!

seberg commented Jun 12, 2021

Uh oh!

mhvk commented Jun 12, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg commented Feb 24, 2022

Uh oh!

mhvk commented Feb 25, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg commented May 12, 2022

Uh oh!

rossbar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mattip commented May 19, 2022

Uh oh!

Uh oh!

seberg commented Jun 11, 2021 •

edited

Loading