Skip to content

MAINT: struct assignment "by field position", multi-field indices return views #6053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 9, 2017

Conversation

ahaldane
Copy link
Member

@ahaldane ahaldane commented Jul 8, 2015

This is a try at improving structured array assignment, aiming for 1.11. The goal is to remove various inconsistencies in structure assignment, and to allow multi-field indexing to return a view.

The biggest change is that assignment from one structured array to another now works "by position" instead of "by field name", as in this example:

>>> arr1 = np.array([(1,2,3)], dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'i4')])
>>> arr2 = np.array([(4,5,6)], dtype=[('b', 'i4'), ('a', 'i4'), ('c', 'i4')])
>>> arr1[:] = arr2

Previously, arr1 would be set to "(5,4,6)" but with this PR it is set to "(4,5,6)".

This fixes the issues in structured array assignment outlined below, and allows casting/assignment in new situations. Then I allow multi-field indexing (like arr1[['b', 'c']]) to return a view instead of a copy (issue #5994), since assignment to this view will now work more sensibly.

This PR also fixes #2353 #6085 #3351.

Problems in structure assignment

There are two code paths used to assign structured items, 1. using setitem or 2. using the dtype_transfer functions. The two paths behave differently from each other, and each sometimes have unexpected behaviors. Consider this setup:

def testassign(arr, v):
    c = arr.copy()
    c[0] = v  # assign using setitem
    c[1:] = v # assign using "dtype_transfer"
    print(c)

arr = ones(2, dtype=[('foo', 'i8'), ('bar', 'i8')])
v1 = array([(2,3)], dtype=[('foo', 'i8'), ('bar', 'i8')])
v2 = array([(2,3)], dtype=[('bar', 'i8'), ('foo', 'i8')])
v3 = array([(2,3)], dtype=[('bar', 'i8'), ('baz', 'i8')])
v4 = array([(2,)],  dtype=[('bar', 'i8')])
v5 = array([(2,3)], dtype=[('foo', 'f8'), ('bar', 'f8')])
w = arr.view({'names': ['bar'], 'formats': ['i8'], 'offsets': [8]})

testassign(arr, v1)
testassign(arr, v2)
testassign(arr, v3)
testassign(arr, v4)
testassign(arr, v5)
w[:] = 4
print arr

Output:

[(2, 3) (2, 3)]
[(2, 3) (3, 2)]
[(2, 3) (0, 2)]
[(2, 0) (0, 2)]
[(4607182418800017408, 4611686018427387904) (2, 3)]
[(140514438937232, 4) (140514438937232, 4)]

In summary, only the assignment using exactly the same dtype (v1) works, all other cases are broken in some way. For v2, setitem effectively assigns the fields "by position", but dtype_transfer code assigns "by field name". For v3, dtype_transfer converts "missing" fields to 0. In v4 dtype_transfer sets missing fields to 0, while setitem pads short buffers with 0. For v5, setitem copies the bytes directly (no casting). For the "partial view" w, we get uninitialized values at unviewed fields in both paths.

Proposed Solution

In this PR I made all field assignment occur "by position", and cast if needed. That is, if you order the fields from left to right in dst and src arrays, dst field 0 gets set to src field 0 and so on, regardless of field name. This makes sense since the "python representation" of a structured element (in VOID_getitem) is a tuple, and tuple assignment to a structure works "by position" too:

>>> a[0] = (1,2)
>>> a[0] = a[1].item() #works the same, essentially "a.setitem(0, a.getitem(1))"

Assignment to a different number of fields (v4 above) is no longer allowed. With this behavior, the output of the example will be:

[(2, 3) (2, 3)]
[(2, 3) (2, 3)]
[(2, 3) (2, 3)]
# v4 raises exception
[(2, 3) (2, 3)]
[(1, 4) (1, 4)]

This is a backward incompatible change, but I suspect it's not a big problem for most people given how inconsistently structure assignment worked. The most common case - assigning using an identical dtype (v1) - behaves the same as before. I ran unit tests for scipy, pandas and astropy with the PR on my system (python2, x64 linux). Scipy and pandas are unaffected, and astropy fails 1 new test (this one).

One alternative is to make all assignment work "by field name", but that also involves a compatibility break for all of the setitem cases. The fact that dtype_transfer code assigns by field name is currently undocumented (as far as I see), however there are a couple unit tests that test whether "reordering" of fields is allowed (like in v2), which depends on assignment by field name. I removed/modified these tests. I think assignment by field name is confusing and bug prone though, and without it I was able to get rid of a number of unit tests for edge cases we no longer have to worry about (eg, checking that "unviewed" fields of various kinds got set to 0 or None).

Historical note: Based on some comments in unit tests it looks like numpy 1.5 and before assigned fields "by position", but this was changed in 1.6 to assignment "by field name" in PR #36, though this was not noted anywhere in docs or release notes.

Implementation involved rewriting parts of VOID_setitem and get_fields_transfer_function.

Multi-Field Assignment

With these changes it is now reasonable to allow writeable multi-field views, which allows assignment like:

>>> a = np.ones(4, dtype=[('foo', 'i8'), ('bar', 'i8'), ('baz', 'f4')])
>>> a[['bar', 'baz']] = (2, 3.0)
>>> a[['baz', 'foo']][:2] = (4.0, 5)
>>> a
array([(5, 2, 4.0), (5, 2, 4.0), (1, 2, 3.0), (1, 2, 3.0)], 
      dtype=[('foo', '<i8'), ('bar', '<i8'), ('baz', '<f4')])

Note that with assignment "by position" reordering the fields (like in v2) would no longer work. However, multi-field assignment makes up for this, since fields will be assigned in the order they are subscripted, allowing reordering in a different way. For example,

>>> a[['foo', 'bar']] = a[['bar', 'foo']]

swaps the field data (while it was a no-op/broken in the old assignment "by field name").

Note that multi-field views (whether by this PR or another) introduce another backward incompatible change, since the byte-arrangement of a multi-field copy is different from the byte-arrangement of a multi-field view:

>>> a = ones(10, dtype='i8,i8,i8')
>>> a[['f0','f2']].dtype
# old ouput, for the multi-field copy:
dtype([('f0', '<i8'), ('f2', '<i8')])
# this PR's output, for multi-field view
dtype({'names':['f0','f2'], 'formats':['<i8','<i8'], 'offsets':[0,16], 'itemsize':24})

This is a problem for code which does things like a[['f0','f2']].view('i4') (like in unit test test_field_names), which would end up viewing different values in the old vs new code. In the new code this particular view raises an error as a result of my other PR #4488 to add safety checks to views. Similarly, np.recarray code uses such a view internally, and needed to be fixed. There might be a similar problem for the planned conversion to np.diagonal to a view.

I suggest we add a deprecation warning for code that attempts such a view in 1.10. Multi-field views have been planned for a long time and there is already a deprecation warning that copies will become views, but the warning only appears if you try to assign to the copy, and not if you try to view the copy. I've made another PR, #6054, which adds the appropriate deprecation warnings.

This change required a small edit to _internals.py, and removing warnings in mapping.c.

Other changes

  1. One difference between setitem and dtype_transfer was that dtype_transfer allows assignment of a scalar to a structure (a[ind] = 1), which sets all fields to that value, while setitem did not. This is important for eg np.ones, which initializes the array with such a statement. I made it so both paths now allow assignment using a scalar.

  2. Relatedly, got rid of undocumented special case for assignment from a structure to a scalar array of boolean type: If the scalar was of boolean type it would be set to "True" regardless of the values in the structured array, with the explanation that "the existence of fields implies True". This broke a MaskedArray test, which I fixed by updating make_mask to do the equivalent.

  3. Previously, if you tried to assign a structured array to a non-structured array just the first field was copied over. Eg

 >>> a = np.array([1,2,3])
 >>> b = np.array([(10,20,3.0)], dtype='i4,i8,f4')
 >>> a[:] = b
 >>> a
 array([10, 10, 10])

Now, I made it raise an exception unless the structured type has only one field, since I found it confusing.

  1. Slightly changed PyArray_EquivTypes for structured arrays. Now it checks that the fields are in the same order in both dtypes. (previously only cared that the dtype.fields dicts had the same values). In other words, field order is now important!

  2. I rewrote most of the docs on structured arrays. A possibly iffy part of the new docs is the section "obsolete features", in which I discourage use of field titles and use the "second" dictionary based form of dtype specification.

@ahaldane ahaldane force-pushed the multifield_structassign branch from d240955 to f205e04 Compare July 15, 2015 13:01
@ahaldane ahaldane changed the title WIP: allow multi fields views, improve struct assignment WIP: struct assignment "by field position", multi-field indices return views Jul 15, 2015
@ahaldane ahaldane force-pushed the multifield_structassign branch 3 times, most recently from a2c9006 to c5128ed Compare July 20, 2015 14:38
@ahaldane ahaldane force-pushed the multifield_structassign branch 5 times, most recently from b3a11f3 to 49fa42a Compare July 28, 2015 15:50
@ahaldane ahaldane force-pushed the multifield_structassign branch 3 times, most recently from ed4a602 to 988bec1 Compare August 7, 2015 02:19
@ahaldane
Copy link
Member Author

ahaldane commented Aug 7, 2015

All right, this PR is complete except for release notes.

Although I'm aiming for 1.11 which is a long way off, it might be worth discussing the deprecation warnings I mentioned for 1.10.

Also, for some context here are a few notes on other things that could be improved about structured arrays.

@ahaldane ahaldane changed the title WIP: struct assignment "by field position", multi-field indices return views MAINT: struct assignment "by field position", multi-field indices return views Aug 7, 2015
@homu
Copy link
Contributor

homu commented Sep 28, 2015

☔ The latest upstream changes (presumably #6269) made this pull request unmergeable. Please resolve the merge conflicts.

@charris
Copy link
Member

charris commented Oct 18, 2015

@ahaldane Needs a rebase.

@ahaldane
Copy link
Member Author

You need to choose which fields you want to keep, now:

required_data = np.require(data[['a', 'b']], dtype=dt)

@ihormelnyk
Copy link

ihormelnyk commented Apr 20, 2018

But what if there is nested dtype:

dt = np.dtype([('a', 'i1'), ('b', [('c', 'i4'), ('d', 'f8')])])
dt_extended = np.dtype([('a', 'i1'), ('b', [('c', 'i4'), ('zzz', 'i1'), ('d', 'f8')])])

data = np.zeros(2, dt_extended)
required_data = np.require(data[:], dtype=dt)

Thanks

@ahaldane
Copy link
Member Author

Hmm, true, there does not seem to be a replacement for that case...

Assignment-by-position still seems a lot better to me than assignment-by-name, but it would be good to find a way to replace all the old use-cases such as that one.

@ihormelnyk
Copy link

Hi @ahaldane ,

Our clients have a lot of python scripts which uses different numpy functions like:
np.require, data.astype(dt2), np.array(data1, dt2)
to extend or to select some subset of data,
dtype is structured(complex) usually.

Sample tests:

    def test_squeeze_dtype(self):
        t1 = np.dtype([('a', np.float32), ('b', np.float64)], align=True)
        t2 = np.dtype([('b', np.float64)], align=True)
        data1 = np.zeros(2, t1)
        data2 = np.require(data1, t2)
        data2 = data1.astype(t2)
        data2 = np.array(data1, t2)
        data2 = np.asarray(data1, t2)
        self.assertEqual(data2.dtype.descr, [('b', '<f8')])

    def test_extend_dtype(self):
        t1 = np.dtype([('b', np.float64)], align=True)
        t2 = np.dtype([('a', np.float32), ('b', np.float64)], align=True)
        data1 = np.zeros(2, t1)
        data2 = np.require(data1, t2)
        data2 = data1.astype(t2)
        data2 = np.array(data1, t2)
        data2 = np.asarray(data1, t2)
        self.assertEqual(data2.dtype.descr, [('a', '<f4'), ('', '|V4'), ('b', '<f8')])

Error:

    return array(a, dtype, copy=False, order=order, subok=True)
ValueError: structures must have the same size

We can't upgrade version of numpy for clients higher than 1.13.3 due to "Multiple-field indexing/assignment of structured arrays" changes in 1.14.

Do you plan to implement some functionality in newer version of numpy to cover these old cases?

Thanks,
Ihor Melnyk

@ahaldane
Copy link
Member Author

ahaldane commented Jul 7, 2018

Hmm, this makes me think I should add a function numpy.lib.recfunctions.require_fields, along with the new function repack_fields I already added.

It would effectively do what the old require call would do.

@ihormelnyk
Copy link

That would be great

@ahaldane
Copy link
Member Author

ahaldane commented Jul 7, 2018

@ihormelnyk I put up a draft of these functions at #11526.

@eric-wieser
Copy link
Member

@ahaldane: I can think of a non-breaking way to change this functionality - introduce a new np.structured data type, which is like np.void, but:

  • must always have named fields (no fixed-length bytestrings like void)
  • uses the new semantics

The fact that all numpy dtypes can be structured seems super weird to me - perhaps we could move away from that at the same time as introducing these new semantics

@ahaldane
Copy link
Member Author

ahaldane commented Jul 8, 2018

@eric-wieser That would be great.

My understanding is that the funding @njsmith and others at berkeley got was (in part) to work on the "epic dtype cleanup plan", and I think one of the points there was that structured types should be their own type, not void.

I would strongly support using the new semantics when we create that new type.

@ahaldane
Copy link
Member Author

ahaldane commented Jul 8, 2018

By the way @ihormelnyk, your "extend_dtype" test might be hiding some unexpected behavior since you use zeros. Note how we get some zeros in the following:

>>> t1 = np.dtype([('b', np.float64)])
>>> t2 = np.dtype([('a', np.float32), ('b', np.float64)])
>>> data1 = np.array([(1,), (2,), (3,)], t1)
>>> np.require(data1, t2)
array([( 0.,  1.), ( 0.,  2.), ( 0.,  3.)],
      dtype=[('a', '<f4'), ('b', '<f8')])

That's the kind of inexplicit behavior I was trying to avoid with some of these changes.

@eric-wieser
Copy link
Member

@ihormelnyk's issues look different from any mentioned before. Perhaps the behavior asked for should just be available as .astype(casting='same_kind')?

@eric-wieser
Copy link
Member

eric-wieser commented Jul 9, 2018

@ihormelnyk:

We can't upgrade version of numpy for clients higher than 1.13.3 due to "Multiple-field indexing/assignment of structured arrays" changes in 1.14.

AFAIK, these changes were rolled back in 1.14.1 - so you absolutely can have your clients upgrade, as long as they skip 1.14.0

@ihormelnyk
Copy link

@eric-wieser

import sys; print(sys.version)
import numpy; print(numpy.__version__)

import numpy as np

t1 = np.dtype([('a', np.float32), ('b', np.float64)], align=True)
t2 = np.dtype([('b', np.float64)], align=True)
data1 = np.zeros(2, t1)
data2 = np.require(data1, t2)

Output:

python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)]
numpy 1.14.5

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    data2 = np.require(data1, t2)
  File "Python27\lib\site-packages\numpy\core\numeric.py", line 690, in require
    return asanyarray(a, dtype=dtype)
  File "Python27\lib\site-packages\numpy\core\numeric.py", line 544, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
ValueError: structures must have the same size

For numpy 1.13.3 it works well

@mattip
Copy link
Member

mattip commented Jul 9, 2018

For me it works in 1.13.3, but displays

FutureWarning: Assignment between structured arrays with different field names will change in numpy 1.14.
Previously fields in the dst would be set to the value of the identically-named field in the src. In numpy 1.14 fields will instead be assigned 'by position': The Nth field of the dst will be set to the Nth field of the src array.

In @ihormelnyk 's case that new behavior would change the dtype float32 to float64.

See the release notes for details

detrout added a commit to detrout/dask that referenced this pull request Jan 10, 2019
Numpy attempted to release a feature in 1.14.0,
numpy/numpy#6053 which was then reverted in
1.14.1 with numpy/numpy#10411 And then was
finally released in 1.16 by numpy/numpy#12447

This also makes a minor enhancement to determines which numpy function
should be used at import time instead of run time.

The function name was also then updated from
_make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats
the primary version the feature was released in
TomAugspurger pushed a commit to dask/dask that referenced this pull request Jan 14, 2019
Numpy attempted to release a feature in 1.14.0,
numpy/numpy#6053 which was then reverted in
1.14.1 with numpy/numpy#10411 And then was
finally released in 1.16 by numpy/numpy#12447

This also makes a minor enhancement to determines which numpy function
should be used at import time instead of run time.

The function name was also then updated from
_make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats
the primary version the feature was released in
eric-wieser added a commit to eric-wieser/numpy that referenced this pull request May 11, 2019
This returns a dtype of the same size with other fields removed, just like that used in numpygh-6053

It should be possible to fall back on this implementation in mapping.c:_get_field_view
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019
Numpy attempted to release a feature in 1.14.0,
numpy/numpy#6053 which was then reverted in
1.14.1 with numpy/numpy#10411 And then was
finally released in 1.16 by numpy/numpy#12447

This also makes a minor enhancement to determines which numpy function
should be used at import time instead of run time.

The function name was also then updated from
_make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats
the primary version the feature was released in
seberg pushed a commit that referenced this pull request Jun 7, 2019
…10417)

This returns a dtype of the same size with other fields removed, just like that used in gh-6053
and is used internally for the indexing code in mapping.c now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assiging to structured array rows with list/array fails (Trac #1758)
9 participants