-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
MAINT: struct assignment "by field position", multi-field indices return views #6053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d240955
to
f205e04
Compare
a2c9006
to
c5128ed
Compare
b3a11f3
to
49fa42a
Compare
ed4a602
to
988bec1
Compare
All right, this PR is complete except for release notes. Although I'm aiming for 1.11 which is a long way off, it might be worth discussing the deprecation warnings I mentioned for 1.10. Also, for some context here are a few notes on other things that could be improved about structured arrays. |
☔ The latest upstream changes (presumably #6269) made this pull request unmergeable. Please resolve the merge conflicts. |
@ahaldane Needs a rebase. |
You need to choose which fields you want to keep, now: required_data = np.require(data[['a', 'b']], dtype=dt) |
But what if there is nested dtype:
Thanks |
Hmm, true, there does not seem to be a replacement for that case... Assignment-by-position still seems a lot better to me than assignment-by-name, but it would be good to find a way to replace all the old use-cases such as that one. |
Hi @ahaldane , Our clients have a lot of python scripts which uses different numpy functions like: Sample tests:
Error:
We can't upgrade version of numpy for clients higher than 1.13.3 due to "Multiple-field indexing/assignment of structured arrays" changes in 1.14. Do you plan to implement some functionality in newer version of numpy to cover these old cases? Thanks, |
Hmm, this makes me think I should add a function It would effectively do what the old |
That would be great |
@ihormelnyk I put up a draft of these functions at #11526. |
@ahaldane: I can think of a non-breaking way to change this functionality - introduce a new
The fact that all numpy dtypes can be structured seems super weird to me - perhaps we could move away from that at the same time as introducing these new semantics |
@eric-wieser That would be great. My understanding is that the funding @njsmith and others at berkeley got was (in part) to work on the "epic dtype cleanup plan", and I think one of the points there was that structured types should be their own type, not void. I would strongly support using the new semantics when we create that new type. |
By the way @ihormelnyk, your "extend_dtype" test might be hiding some unexpected behavior since you use >>> t1 = np.dtype([('b', np.float64)])
>>> t2 = np.dtype([('a', np.float32), ('b', np.float64)])
>>> data1 = np.array([(1,), (2,), (3,)], t1)
>>> np.require(data1, t2)
array([( 0., 1.), ( 0., 2.), ( 0., 3.)],
dtype=[('a', '<f4'), ('b', '<f8')]) That's the kind of inexplicit behavior I was trying to avoid with some of these changes. |
@ihormelnyk's issues look different from any mentioned before. Perhaps the behavior asked for should just be available as |
AFAIK, these changes were rolled back in 1.14.1 - so you absolutely can have your clients upgrade, as long as they skip 1.14.0 |
Output:
For numpy 1.13.3 it works well |
For me it works in 1.13.3, but displays
In @ihormelnyk 's case that new behavior would change the dtype
|
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
This returns a dtype of the same size with other fields removed, just like that used in numpygh-6053 It should be possible to fall back on this implementation in mapping.c:_get_field_view
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
This is a try at improving structured array assignment, aiming for 1.11. The goal is to remove various inconsistencies in structure assignment, and to allow multi-field indexing to return a view.
The biggest change is that assignment from one structured array to another now works "by position" instead of "by field name", as in this example:
Previously, arr1 would be set to "(5,4,6)" but with this PR it is set to "(4,5,6)".
This fixes the issues in structured array assignment outlined below, and allows casting/assignment in new situations. Then I allow multi-field indexing (like
arr1[['b', 'c']]
) to return a view instead of a copy (issue #5994), since assignment to this view will now work more sensibly.This PR also fixes #2353 #6085 #3351.
Problems in structure assignment
There are two code paths used to assign structured items, 1. using
setitem
or 2. using thedtype_transfer
functions. The two paths behave differently from each other, and each sometimes have unexpected behaviors. Consider this setup:Output:
In summary, only the assignment using exactly the same dtype (v1) works, all other cases are broken in some way. For v2,
setitem
effectively assigns the fields "by position", butdtype_transfer
code assigns "by field name". For v3,dtype_transfer
converts "missing" fields to 0. In v4dtype_transfer
sets missing fields to 0, whilesetitem
pads short buffers with 0. For v5,setitem
copies the bytes directly (no casting). For the "partial view"w
, we get uninitialized values at unviewed fields in both paths.Proposed Solution
In this PR I made all field assignment occur "by position", and cast if needed. That is, if you order the fields from left to right in dst and src arrays, dst field 0 gets set to src field 0 and so on, regardless of field name. This makes sense since the "python representation" of a structured element (in
VOID_getitem
) is a tuple, and tuple assignment to a structure works "by position" too:Assignment to a different number of fields (v4 above) is no longer allowed. With this behavior, the output of the example will be:
This is a backward incompatible change, but I suspect it's not a big problem for most people given how inconsistently structure assignment worked. The most common case - assigning using an identical dtype (v1) - behaves the same as before. I ran unit tests for scipy, pandas and astropy with the PR on my system (python2, x64 linux). Scipy and pandas are unaffected, and astropy fails 1 new test (this one).
One alternative is to make all assignment work "by field name", but that also involves a compatibility break for all of the setitem cases. The fact that
dtype_transfer
code assigns by field name is currently undocumented (as far as I see), however there are a couple unit tests that test whether "reordering" of fields is allowed (like in v2), which depends on assignment by field name. I removed/modified these tests. I think assignment by field name is confusing and bug prone though, and without it I was able to get rid of a number of unit tests for edge cases we no longer have to worry about (eg, checking that "unviewed" fields of various kinds got set to 0 or None).Historical note: Based on some comments in unit tests it looks like numpy 1.5 and before assigned fields "by position", but this was changed in 1.6 to assignment "by field name" in PR #36, though this was not noted anywhere in docs or release notes.
Implementation involved rewriting parts of
VOID_setitem
andget_fields_transfer_function
.Multi-Field Assignment
With these changes it is now reasonable to allow writeable multi-field views, which allows assignment like:
Note that with assignment "by position" reordering the fields (like in v2) would no longer work. However, multi-field assignment makes up for this, since fields will be assigned in the order they are subscripted, allowing reordering in a different way. For example,
swaps the field data (while it was a no-op/broken in the old assignment "by field name").
Note that multi-field views (whether by this PR or another) introduce another backward incompatible change, since the byte-arrangement of a multi-field copy is different from the byte-arrangement of a multi-field view:
This is a problem for code which does things like
a[['f0','f2']].view('i4')
(like in unit testtest_field_names
), which would end up viewing different values in the old vs new code. In the new code this particular view raises an error as a result of my other PR #4488 to add safety checks to views. Similarly,np.recarray
code uses such a view internally, and needed to be fixed. There might be a similar problem for the planned conversion tonp.diagonal
to a view.I suggest we add a deprecation warning for code that attempts such a view in 1.10. Multi-field views have been planned for a long time and there is already a deprecation warning that copies will become views, but the warning only appears if you try to assign to the copy, and not if you try to view the copy. I've made another PR, #6054, which adds the appropriate deprecation warnings.
This change required a small edit to
_internals.py
, and removing warnings inmapping.c
.Other changes
One difference between
setitem
anddtype_transfer
was thatdtype_transfer
allows assignment of a scalar to a structure (a[ind] = 1
), which sets all fields to that value, whilesetitem
did not. This is important for egnp.ones
, which initializes the array with such a statement. I made it so both paths now allow assignment using a scalar.Relatedly, got rid of undocumented special case for assignment from a structure to a scalar array of boolean type: If the scalar was of boolean type it would be set to "True" regardless of the values in the structured array, with the explanation that "the existence of fields implies True". This broke a MaskedArray test, which I fixed by updating
make_mask
to do the equivalent.Previously, if you tried to assign a structured array to a non-structured array just the first field was copied over. Eg
Now, I made it raise an exception unless the structured type has only one field, since I found it confusing.
Slightly changed PyArray_EquivTypes for structured arrays. Now it checks that the fields are in the same order in both dtypes. (previously only cared that the dtype.fields dicts had the same values). In other words, field order is now important!
I rewrote most of the docs on structured arrays. A possibly iffy part of the new docs is the section "obsolete features", in which I discourage use of field titles and use the "second" dictionary based form of dtype specification.