Skip to content

BUG: ensure that casting to/from structured is properly checked. #13648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mhvk
Copy link
Contributor

@mhvk mhvk commented May 28, 2019

As noted in #11114, currently the following clearly wrong things happen:

np.can_cast(1., 'i2,i2', casting='safe')
# True
# while
np.can_cast(1., 'i2', casting='safe')
# False
np.array(1000.).astype('i1,i1', casting='safe')
# array((-24, -24), dtype=[('f0', 'i1'), ('f1', 'i1')])
# while
np.array(1000.).astype('i1', casting='safe')
# TypeError: Cannot cast array from dtype('float64') to dtype('int8') according to the rule 'safe'

The present PR fixes it, and has some elementary tests. In particular, it allows casting from simple to structured (EDIT) in unsafe mode, and from structured to simple only in unsafe mode and when the structured dtype has only a single field.

fixes #11114

p.s. Replaces #13646, which confused the general void type with structured.

}
else {
return 0;
}
Copy link
Member

@eric-wieser eric-wieser May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could perhaps see an argument that in this direction, this should be any(PyArray_CanCastTypeTo(from, field, casting) for field in fields), allowing 2 to cast into (2, 2, 2,), which I think we have precedent for

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was perhaps more driven by symmetry here than anything else. But I think you are right: casting to multiple fields is in principle not ambiguous. But it should be all(cancast(...)), then, not any(...), correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To take a step back here, do we want safe casting? It feels right, but it also feels right that safe casting in most cases aligns with "common_type" and I am not sure I feel that I feel things like np.concatenate should feel they can just "broadcast" to all fields without some manual casting. (I did not make up my mind yet, probably its OK).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it's a question of if we already set a precedent for this weird behavior. If we didn't, then the original more conservative patch was probably the right one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm moving this out of the in-line comments, so it doesn't get lost when I change things.

@mhvk mhvk force-pushed the disallow-can-cast-of-everything-to-structured branch 2 times, most recently from e41981e to 7ce1813 Compare May 29, 2019 01:05
@mhvk
Copy link
Contributor Author

mhvk commented May 29, 2019

@eric-wieser - I changed it to allow casting to multiple fields. Also realized I had to put it further up, since the changes hold also for unsafe casting. Since I think this is getting close, I also updated the docstrings. I think can_cast is now consistent with actually executing a.astype(...).

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose a few tests assume that unsafe casting works in strange ways... I am not sure we want to go there and change that just now...

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose a few tests assume that unsafe casting works in strange ways... I am not sure we want to go there and change that just now...

@mhvk
Copy link
Contributor Author

mhvk commented May 29, 2019

@seberg, @eric-wieser, thanks for all the comments!

On the casting, I'm a bit torn. I agree with @eric-wieser that we should try to make can_cast consistent with actual casting behaviour in, e.g., astype, to the extent possible. The current behaviour of astype and __setitem__ (see below) can be summarized as follows:

  • Structured to simple: only allowed for a single field and when casting="unsafe"; can_cast is inconsistent with this in always returning True for "unsafe", independent of the number of fields. This is an easy fix.
  • Simple to structured: always done for any casting but "equiv" and "no"; this is the bug I hoped to fix.

My own sense is that symmetry suggests that we do not allow simple to structured at all for any casting but "unsafe". But for "unsafe", it does seem fine to just allow the cast to all fields, especially as this is what happens also in __setitem__.

This seems to make sufficient sense that I'll go ahead and change it accordingly...

Current behaviour

Going from structured to simple:

import numpy as np
# Multiple fields to simple
i8 = np.array([1000])
i1i1 = np.array([(1,2)], dtype='i1,i1')
np.can_cast('i1,i1', 'i8', casting='unsafe')
# True # and False for any other casting mode
i1i1.astype('i8', casting='unsafe')
# ValueError: Can't cast from structure to non-structure, except if the structure only has a single field.
i8[...] = i1i1
# ValueError: Can't cast from structure to non-structure, except if the structure only has a single field.

# Single field to simple
i1 = np.array([(1,)], dtype=[('i', 'i1')])
np.can_cast([('i', 'i1')], 'i8', casting='unsafe')
# True  # and False for any other casting mode
i1.astype('i8', casting='unsafe')
# array([1])  # note that casting='unsafe' is default
i8[...] = i1
# Passes -> i8=1
np.array([(1,)], dtype=[('i', 'i1')]).astype('i8', casting='safe')
# TypeError: Cannot cast array from dtype([('i', 'i1')]) to dtype('int64') according to the rule 'safe'

So, for this case, astype behaves sensibly in allowing a cast from structured to simple only when there is a single field, and even then only when casting="unsafe" - the current PR would extend this to allow any case where the single field can be cast to the simple type, but I think I should revert that and instead just make can_cast consistent with astype, i.e., change its behaviour for "unsafe" to not allow the case of multiple fields.

Now turning to casting from simple to structured

import numpy as np
i8 = np.array([1000])
# Simple to multiple, non-safe fields
np.can_cast('i8', 'i1,i1', casting='safe')
# True  # only 'no' and 'equiv' give False
i8.astype('i1,i1', casting='safe')
# array([(-24, -24)], dtype=[('f0', 'i1'), ('f1', 'i1')])
i1i1[...] = i8
# Passes -> same array.
# only 'no' and 'equiv' error
# Behaviour identical for safe fields and for a single field!

So, the current behaviour is that casting to a structured type is never allowed for "equiv" and always allowed for "safe" and beyond.

@mhvk
Copy link
Contributor Author

mhvk commented May 29, 2019

p.s. I think there actually is another inconsistency for structured to structured casting, in that can_cast looks at field names while astype and __setitem__ go by field order. Probably a leftover of #6053, so cc @ahaldane (and let's leave it out of this PR...)

EDIT: see gh-13667

@mhvk mhvk force-pushed the disallow-can-cast-of-everything-to-structured branch from 7ce1813 to 7f15d42 Compare May 29, 2019 15:18
@mhvk
Copy link
Contributor Author

mhvk commented May 29, 2019

Rewritten as noted above. Quite a bit simpler PR, which suggests it may be better..

@mhvk mhvk force-pushed the disallow-can-cast-of-everything-to-structured branch from 7f15d42 to cd90983 Compare May 29, 2019 15:22
@seberg seberg self-requested a review May 29, 2019 15:42
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may get more complicating, fear... I am willing to move this forward though as an improvement though if you feel that is better. There is no regression here in 1.17, right?

/*
* Fast path for basic types.
*/
if ((NPY_LIKELY(from->type_num < NPY_OBJECT) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose <= would work as well, not that it matters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was there before, but you're right, no sense in not including object.

*/
if (!PyDataType_HASFIELDS(to) && to->type_num != NPY_OBJECT) {
return (casting == NPY_UNSAFE_CASTING &&
PyDict_Size(from->fields) == 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if Eric won't come around a corner here and spook you with a dtype with nested fields :), it seems probable that there needs to be a recursion somewhere to get it right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this should be recursive.

return 1;
}
}
else if (PyDataType_HASFIELDS(to)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about object and subarrays? E.g np.dtype("(1,2)u8")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, of course rather something like> np.dtype(["f1", "f4", (5,)]) or np.dtype([("f1", "f,f")])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for casting to structured array, sub-arrays do not matter - certainly- astype happily does it. This particular stanza is really mostly there to ensure we return False when casting is not "unsafe".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For subarrays in from, astype seems quite happy to just take the first element. So, for now just mimicking that.

Copy link
Member

@eric-wieser eric-wieser May 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens for:

from_dt = np.dtype([('a', ('i4,i4', 2))])
to_dt = np.dtype([('a', 'i4')])

For instance

arr = np.array(([(1, 2), (3, 4)],), dtype=from_dt)
arr.astype(to_dt)

What's currently implemented is not "just take the first element", but "take the first element and ignore all casting rules when casting it"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your particular example with astype raises an error in both in current master and in my PR, because from_dt['a'] itself has two fields, which cannot be cast to a single one.

Sadly, though, can_cast happily claims this works even in my PR as from['a'] has no fields (but does have a subdtype) - though only for casting='unsafe'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that particular case is solved. But really astype is weird - it checks can_cast internally but can still throw an error later. This should not be allowed...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, astype casting could fail based on the values. Of course you are right and it should not fail based on information already inside the dtype.

I have to catch up a bit here, sorry, but I do not really want crunch my mind too much about it right now, we are talking about corner cases and if we have new infrastructure things might get easier to align.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg - agreed that we should take this bit-by-bit. My main goal really is to ensure that structured with multiple fields to unstructured always returns False, as this cannot possibly work (and has caused me real head-aches).

@mhvk mhvk force-pushed the disallow-can-cast-of-everything-to-structured branch from 7a2d01b to 2675ef9 Compare May 29, 2019 23:58
@mhvk
Copy link
Contributor Author

mhvk commented May 29, 2019

OK, pushed another update. Am still trying to keep the impact minimal. Another worry I have is whether user types can have fields - though at least this should be better than it was before...

@eric-wieser eric-wieser self-requested a review May 30, 2019 04:57
@mhvk mhvk force-pushed the disallow-can-cast-of-everything-to-structured branch from 2675ef9 to 09639c4 Compare May 31, 2019 17:53
@mhvk
Copy link
Contributor Author

mhvk commented May 31, 2019

Damn, astype is really quite inconsistent too in what it does for subarrays. For now, focussing on getting the structured from right...

@mhvk mhvk force-pushed the disallow-can-cast-of-everything-to-structured branch from 09639c4 to e4efd1d Compare May 31, 2019 20:43
@seberg seberg self-requested a review June 1, 2019 00:11
@charris charris added the 09 - Backport-Candidate PRs tagged should be backported label Jun 3, 2019
@charris
Copy link
Member

charris commented Jun 3, 2019

Also failing doctest at numpy.doc.structured_arrays.

@mhvk mhvk force-pushed the disallow-can-cast-of-everything-to-structured branch from 840d924 to a6f1713 Compare June 4, 2019 20:02
@mhvk
Copy link
Contributor Author

mhvk commented Jun 12, 2019

@seberg, @eric-wieser - what do you think we should do with this PR? Overall, I feel it at least fixes one obvious glaring inconsistency. On the other, it leaves many others. But those generally need more thought at developing a better set of tests. Personally, I'd prefer to approach this incrementally (mostly as I don't think I've got the time to do the larger overhaul...).

@seberg
Copy link
Member

seberg commented Jun 12, 2019

@mhvk sorry, meant to look through this for a bit now, but then did not when I saw we had put 1.18, and thinking I can't merge it anyway ;).
I hope more restructuring of casting can/will happen, but I am ok with doing some small improvements before the end. Plus anything that works and is tested now, is something we can't break when we try to do larger changes.

@mhvk
Copy link
Contributor Author

mhvk commented Jun 12, 2019

Ideally this would still go in 1.17...

@charris
Copy link
Member

charris commented Jun 12, 2019

The 1.18 just means it isn't a blocker for 1.17, it can certainly go in 1.17.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sorry for that. Can't find anything to squirm about right now. But I do think we should add release notes, since casting is changed in subtle ways. As most actual casting uses "unsafe" mode, I doubt that those changes affect a lot of code though (and are not already buggy).

@seberg seberg added this to the 1.17.0 release milestone Jun 12, 2019
@seberg seberg added the 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes label Jun 12, 2019
Allow unsafe casting from a simple data type to a structured one with
multiple fields, but only from structured data type with a single
field (checked recursively) to a simple data type.

For now, continue to allow any structured to structured with
casting="unsafe", as the current can_cast implementation does not
match that of "astype".
@mhvk mhvk force-pushed the disallow-can-cast-of-everything-to-structured branch from a6f1713 to e12c6bd Compare June 12, 2019 18:48
@mhvk
Copy link
Contributor Author

mhvk commented Jun 12, 2019

OK, made the changes, and added a release note.

Previously, ``can_cast`` returned `True` for almost all inputs for
``casting='unsafe'``, even for cases where casting was not possible, such as
from a structured dtype to a regular one. This has been fixed, making it
more consistent with actual casting using, e.g., the ``.astype`` method.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the change that np.zeros(3, "i4").astype("i4,i4", casting="safe") (or related can_cast) should fail after this? But maybe it is very fringe, since astype defaults to "unsafe" in any case, and so does assignment I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I am good with merging this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, let me just merge this...

@mhvk mhvk merged commit 6420e7f into numpy:master Jun 12, 2019
@mhvk mhvk deleted the disallow-can-cast-of-everything-to-structured branch June 12, 2019 20:12
@seberg
Copy link
Member

seberg commented Jun 12, 2019

Cool, thanks Marten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG? anything is allowed to cast to structured dtypes?
4 participants