MAINT: struct assignment "by field position", multi-field indices return views #6053

ahaldane · 2015-07-08T16:20:13Z

This is a try at improving structured array assignment, aiming for 1.11. The goal is to remove various inconsistencies in structure assignment, and to allow multi-field indexing to return a view.

The biggest change is that assignment from one structured array to another now works "by position" instead of "by field name", as in this example:

>>> arr1 = np.array([(1,2,3)], dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'i4')])
>>> arr2 = np.array([(4,5,6)], dtype=[('b', 'i4'), ('a', 'i4'), ('c', 'i4')])
>>> arr1[:] = arr2

Previously, arr1 would be set to "(5,4,6)" but with this PR it is set to "(4,5,6)".

This fixes the issues in structured array assignment outlined below, and allows casting/assignment in new situations. Then I allow multi-field indexing (like arr1[['b', 'c']]) to return a view instead of a copy (issue #5994), since assignment to this view will now work more sensibly.

This PR also fixes #2353 #6085 #3351.

Problems in structure assignment

There are two code paths used to assign structured items, 1. using setitem or 2. using the dtype_transfer functions. The two paths behave differently from each other, and each sometimes have unexpected behaviors. Consider this setup:

def testassign(arr, v):
    c = arr.copy()
    c[0] = v  # assign using setitem
    c[1:] = v # assign using "dtype_transfer"
    print(c)

arr = ones(2, dtype=[('foo', 'i8'), ('bar', 'i8')])
v1 = array([(2,3)], dtype=[('foo', 'i8'), ('bar', 'i8')])
v2 = array([(2,3)], dtype=[('bar', 'i8'), ('foo', 'i8')])
v3 = array([(2,3)], dtype=[('bar', 'i8'), ('baz', 'i8')])
v4 = array([(2,)],  dtype=[('bar', 'i8')])
v5 = array([(2,3)], dtype=[('foo', 'f8'), ('bar', 'f8')])
w = arr.view({'names': ['bar'], 'formats': ['i8'], 'offsets': [8]})

testassign(arr, v1)
testassign(arr, v2)
testassign(arr, v3)
testassign(arr, v4)
testassign(arr, v5)
w[:] = 4
print arr

Output:

[(2, 3) (2, 3)]
[(2, 3) (3, 2)]
[(2, 3) (0, 2)]
[(2, 0) (0, 2)]
[(4607182418800017408, 4611686018427387904) (2, 3)]
[(140514438937232, 4) (140514438937232, 4)]

In summary, only the assignment using exactly the same dtype (v1) works, all other cases are broken in some way. For v2, setitem effectively assigns the fields "by position", but dtype_transfer code assigns "by field name". For v3, dtype_transfer converts "missing" fields to 0. In v4 dtype_transfer sets missing fields to 0, while setitem pads short buffers with 0. For v5, setitem copies the bytes directly (no casting). For the "partial view" w, we get uninitialized values at unviewed fields in both paths.

Proposed Solution

In this PR I made all field assignment occur "by position", and cast if needed. That is, if you order the fields from left to right in dst and src arrays, dst field 0 gets set to src field 0 and so on, regardless of field name. This makes sense since the "python representation" of a structured element (in VOID_getitem) is a tuple, and tuple assignment to a structure works "by position" too:

>>> a[0] = (1,2)
>>> a[0] = a[1].item() #works the same, essentially "a.setitem(0, a.getitem(1))"

Assignment to a different number of fields (v4 above) is no longer allowed. With this behavior, the output of the example will be:

[(2, 3) (2, 3)]
[(2, 3) (2, 3)]
[(2, 3) (2, 3)]
# v4 raises exception
[(2, 3) (2, 3)]
[(1, 4) (1, 4)]

This is a backward incompatible change, but I suspect it's not a big problem for most people given how inconsistently structure assignment worked. The most common case - assigning using an identical dtype (v1) - behaves the same as before. I ran unit tests for scipy, pandas and astropy with the PR on my system (python2, x64 linux). Scipy and pandas are unaffected, and astropy fails 1 new test (this one).

One alternative is to make all assignment work "by field name", but that also involves a compatibility break for all of the setitem cases. The fact that dtype_transfer code assigns by field name is currently undocumented (as far as I see), however there are a couple unit tests that test whether "reordering" of fields is allowed (like in v2), which depends on assignment by field name. I removed/modified these tests. I think assignment by field name is confusing and bug prone though, and without it I was able to get rid of a number of unit tests for edge cases we no longer have to worry about (eg, checking that "unviewed" fields of various kinds got set to 0 or None).

Historical note: Based on some comments in unit tests it looks like numpy 1.5 and before assigned fields "by position", but this was changed in 1.6 to assignment "by field name" in PR #36, though this was not noted anywhere in docs or release notes.

Implementation involved rewriting parts of VOID_setitem and get_fields_transfer_function.

Multi-Field Assignment

With these changes it is now reasonable to allow writeable multi-field views, which allows assignment like:

>>> a = np.ones(4, dtype=[('foo', 'i8'), ('bar', 'i8'), ('baz', 'f4')])
>>> a[['bar', 'baz']] = (2, 3.0)
>>> a[['baz', 'foo']][:2] = (4.0, 5)
>>> a
array([(5, 2, 4.0), (5, 2, 4.0), (1, 2, 3.0), (1, 2, 3.0)], 
      dtype=[('foo', '<i8'), ('bar', '<i8'), ('baz', '<f4')])

Note that with assignment "by position" reordering the fields (like in v2) would no longer work. However, multi-field assignment makes up for this, since fields will be assigned in the order they are subscripted, allowing reordering in a different way. For example,

>>> a[['foo', 'bar']] = a[['bar', 'foo']]

swaps the field data (while it was a no-op/broken in the old assignment "by field name").

Note that multi-field views (whether by this PR or another) introduce another backward incompatible change, since the byte-arrangement of a multi-field copy is different from the byte-arrangement of a multi-field view:

>>> a = ones(10, dtype='i8,i8,i8')
>>> a[['f0','f2']].dtype
# old ouput, for the multi-field copy:
dtype([('f0', '<i8'), ('f2', '<i8')])
# this PR's output, for multi-field view
dtype({'names':['f0','f2'], 'formats':['<i8','<i8'], 'offsets':[0,16], 'itemsize':24})

This is a problem for code which does things like a[['f0','f2']].view('i4') (like in unit test test_field_names), which would end up viewing different values in the old vs new code. In the new code this particular view raises an error as a result of my other PR #4488 to add safety checks to views. Similarly, np.recarray code uses such a view internally, and needed to be fixed. There might be a similar problem for the planned conversion to np.diagonal to a view.

I suggest we add a deprecation warning for code that attempts such a view in 1.10. Multi-field views have been planned for a long time and there is already a deprecation warning that copies will become views, but the warning only appears if you try to assign to the copy, and not if you try to view the copy. I've made another PR, #6054, which adds the appropriate deprecation warnings.

This change required a small edit to _internals.py, and removing warnings in mapping.c.

Other changes

One difference between setitem and dtype_transfer was that dtype_transfer allows assignment of a scalar to a structure (a[ind] = 1), which sets all fields to that value, while setitem did not. This is important for eg np.ones, which initializes the array with such a statement. I made it so both paths now allow assignment using a scalar.
Relatedly, got rid of undocumented special case for assignment from a structure to a scalar array of boolean type: If the scalar was of boolean type it would be set to "True" regardless of the values in the structured array, with the explanation that "the existence of fields implies True". This broke a MaskedArray test, which I fixed by updating make_mask to do the equivalent.
Previously, if you tried to assign a structured array to a non-structured array just the first field was copied over. Eg

 >>> a = np.array([1,2,3])
 >>> b = np.array([(10,20,3.0)], dtype='i4,i8,f4')
 >>> a[:] = b
 >>> a
 array([10, 10, 10])

Now, I made it raise an exception unless the structured type has only one field, since I found it confusing.

Slightly changed PyArray_EquivTypes for structured arrays. Now it checks that the fields are in the same order in both dtypes. (previously only cared that the dtype.fields dicts had the same values). In other words, field order is now important!
I rewrote most of the docs on structured arrays. A possibly iffy part of the new docs is the section "obsolete features", in which I discourage use of field titles and use the "second" dictionary based form of dtype specification.

ahaldane · 2015-08-07T02:38:50Z

All right, this PR is complete except for release notes.

Although I'm aiming for 1.11 which is a long way off, it might be worth discussing the deprecation warnings I mentioned for 1.10.

Also, for some context here are a few notes on other things that could be improved about structured arrays.

homu · 2015-09-28T20:52:24Z

☔ The latest upstream changes (presumably #6269) made this pull request unmergeable. Please resolve the merge conflicts.

charris · 2015-10-18T16:44:02Z

@ahaldane Needs a rebase.

ahaldane · 2018-04-20T13:50:15Z

You need to choose which fields you want to keep, now:

required_data = np.require(data[['a', 'b']], dtype=dt)

ihormelnyk · 2018-04-20T14:03:52Z

But what if there is nested dtype:

dt = np.dtype([('a', 'i1'), ('b', [('c', 'i4'), ('d', 'f8')])])
dt_extended = np.dtype([('a', 'i1'), ('b', [('c', 'i4'), ('zzz', 'i1'), ('d', 'f8')])])

data = np.zeros(2, dt_extended)
required_data = np.require(data[:], dtype=dt)

Thanks

ahaldane · 2018-04-20T15:58:18Z

Hmm, true, there does not seem to be a replacement for that case...

Assignment-by-position still seems a lot better to me than assignment-by-name, but it would be good to find a way to replace all the old use-cases such as that one.

ihormelnyk · 2018-07-06T14:18:33Z

Hi @ahaldane ,

Our clients have a lot of python scripts which uses different numpy functions like:
np.require, data.astype(dt2), np.array(data1, dt2)
to extend or to select some subset of data,
dtype is structured(complex) usually.

Sample tests:

    def test_squeeze_dtype(self):
        t1 = np.dtype([('a', np.float32), ('b', np.float64)], align=True)
        t2 = np.dtype([('b', np.float64)], align=True)
        data1 = np.zeros(2, t1)
        data2 = np.require(data1, t2)
        data2 = data1.astype(t2)
        data2 = np.array(data1, t2)
        data2 = np.asarray(data1, t2)
        self.assertEqual(data2.dtype.descr, [('b', '<f8')])

    def test_extend_dtype(self):
        t1 = np.dtype([('b', np.float64)], align=True)
        t2 = np.dtype([('a', np.float32), ('b', np.float64)], align=True)
        data1 = np.zeros(2, t1)
        data2 = np.require(data1, t2)
        data2 = data1.astype(t2)
        data2 = np.array(data1, t2)
        data2 = np.asarray(data1, t2)
        self.assertEqual(data2.dtype.descr, [('a', '<f4'), ('', '|V4'), ('b', '<f8')])

Error:

    return array(a, dtype, copy=False, order=order, subok=True)
ValueError: structures must have the same size

We can't upgrade version of numpy for clients higher than 1.13.3 due to "Multiple-field indexing/assignment of structured arrays" changes in 1.14.

Do you plan to implement some functionality in newer version of numpy to cover these old cases?

Thanks,
Ihor Melnyk

ahaldane · 2018-07-07T17:25:02Z

Hmm, this makes me think I should add a function numpy.lib.recfunctions.require_fields, along with the new function repack_fields I already added.

It would effectively do what the old require call would do.

ihormelnyk · 2018-07-07T18:07:35Z

That would be great

ahaldane · 2018-07-07T20:04:24Z

@ihormelnyk I put up a draft of these functions at #11526.

eric-wieser · 2018-07-08T17:00:34Z

@ahaldane: I can think of a non-breaking way to change this functionality - introduce a new np.structured data type, which is like np.void, but:

must always have named fields (no fixed-length bytestrings like void)
uses the new semantics

The fact that all numpy dtypes can be structured seems super weird to me - perhaps we could move away from that at the same time as introducing these new semantics

ahaldane · 2018-07-08T18:18:38Z

@eric-wieser That would be great.

My understanding is that the funding @njsmith and others at berkeley got was (in part) to work on the "epic dtype cleanup plan", and I think one of the points there was that structured types should be their own type, not void.

I would strongly support using the new semantics when we create that new type.

ahaldane · 2018-07-08T18:26:12Z

By the way @ihormelnyk, your "extend_dtype" test might be hiding some unexpected behavior since you use zeros. Note how we get some zeros in the following:

>>> t1 = np.dtype([('b', np.float64)])
>>> t2 = np.dtype([('a', np.float32), ('b', np.float64)])
>>> data1 = np.array([(1,), (2,), (3,)], t1)
>>> np.require(data1, t2)
array([( 0.,  1.), ( 0.,  2.), ( 0.,  3.)],
      dtype=[('a', '<f4'), ('b', '<f8')])

That's the kind of inexplicit behavior I was trying to avoid with some of these changes.

eric-wieser · 2018-07-08T21:11:35Z

@ihormelnyk's issues look different from any mentioned before. Perhaps the behavior asked for should just be available as .astype(casting='same_kind')?

eric-wieser · 2018-07-09T09:06:55Z

@ihormelnyk:

We can't upgrade version of numpy for clients higher than 1.13.3 due to "Multiple-field indexing/assignment of structured arrays" changes in 1.14.

AFAIK, these changes were rolled back in 1.14.1 - so you absolutely can have your clients upgrade, as long as they skip 1.14.0

ihormelnyk · 2018-07-09T09:23:14Z

@eric-wieser

import sys; print(sys.version)
import numpy; print(numpy.__version__)

import numpy as np

t1 = np.dtype([('a', np.float32), ('b', np.float64)], align=True)
t2 = np.dtype([('b', np.float64)], align=True)
data1 = np.zeros(2, t1)
data2 = np.require(data1, t2)

Output:

python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)]
numpy 1.14.5

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    data2 = np.require(data1, t2)
  File "Python27\lib\site-packages\numpy\core\numeric.py", line 690, in require
    return asanyarray(a, dtype=dtype)
  File "Python27\lib\site-packages\numpy\core\numeric.py", line 544, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
ValueError: structures must have the same size

For numpy 1.13.3 it works well

mattip · 2018-07-09T20:20:55Z

For me it works in 1.13.3, but displays

FutureWarning: Assignment between structured arrays with different field names will change in numpy 1.14.
Previously fields in the dst would be set to the value of the identically-named field in the src. In numpy 1.14 fields will instead be assigned 'by position': The Nth field of the dst will be set to the Nth field of the src array.

In @ihormelnyk 's case that new behavior would change the dtype float32 to float64.

See the release notes for details

Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in

This returns a dtype of the same size with other fields removed, just like that used in numpygh-6053 It should be possible to fall back on this implementation in mapping.c:_get_field_view

Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in

…10417) This returns a dtype of the same size with other fields removed, just like that used in gh-6053 and is used internally for the indexing code in mapping.c now.

ahaldane mentioned this pull request Jul 8, 2015

WIP: MAINT: Add deprecation warning to views of multi-field indexes #6054

Merged

ahaldane force-pushed the multifield_structassign branch from d240955 to f205e04 Compare July 15, 2015 13:01

ahaldane changed the title ~~WIP: allow multi fields views, improve struct assignment~~ WIP: struct assignment "by field position", multi-field indices return views Jul 15, 2015

ahaldane mentioned this pull request Jul 16, 2015

Broadcasting not working with record array with a single record #6085

Closed

ahaldane force-pushed the multifield_structassign branch 3 times, most recently from a2c9006 to c5128ed Compare July 20, 2015 14:38

rgommers added component: numpy._core 03 - Maintenance labels Jul 26, 2015

ahaldane force-pushed the multifield_structassign branch 5 times, most recently from b3a11f3 to 49fa42a Compare July 28, 2015 15:50

ahaldane force-pushed the multifield_structassign branch 3 times, most recently from ed4a602 to 988bec1 Compare August 7, 2015 02:19

ahaldane changed the title ~~WIP: struct assignment "by field position", multi-field indices return views~~ MAINT: struct assignment "by field position", multi-field indices return views Aug 7, 2015

ahaldane mentioned this pull request Sep 14, 2015

Missing byteswap when converting array with record dtype to opposite-endian dtype. #6314

Closed

ahaldane mentioned this pull request Oct 12, 2015

error when making view of a recarray of an aligned record dtype #6459

Closed

This was referenced Oct 27, 2015

Disable view safety checks #6562

Merged

recfunctions.append_fields fails on arrays containing objects (Trac #1751) #2346

Closed

This was referenced Dec 4, 2015

BUG/TST: Fix for #6724, make numpy.ma.mvoid consistent with numpy.void #6763

Merged

setting entry in struct array with object field #6806

Closed

ahaldane mentioned this pull request Jan 11, 2016

Fix recfunctions.join_by bugs (issue #4216) #4524

Closed

This was referenced Jan 18, 2016

astype converts numpy array values to 0.0 for structured dtype #7058

Closed

return mutiple field selection as ro view #3641

Closed

ahaldane mentioned this pull request Jul 7, 2018

ENH: add multi-field assignment helpers in np.lib.recfunctions #11526

Merged

ahaldane mentioned this pull request Jul 8, 2018

MAINT: cancel multifield copy->view and copy-by-position plans (alternate proposal) #11530

Closed

mattip added the component: numpy.dtype label Jul 8, 2018

ihormelnyk mentioned this pull request Jul 9, 2018

BUG: Fix np.load for aligned dtypes. #10931

Closed

ahaldane mentioned this pull request Oct 14, 2018

ENH: allow nan -> NaT casting #11851

Closed

ahaldane mentioned this pull request Nov 24, 2018

ENH: add back the multifield copy->view change #12447

Merged

hmaarrfk mentioned this pull request Jan 5, 2019

Numpy 1.16.0 test failure, dtype construction seems wrong dask/dask#4352

Closed

This was referenced May 29, 2019

BUG: ensure that casting to/from structured is properly checked. #13648

Merged

BUG: can_cast inconsistent with as_type for structured types #13667

Closed

mhvk mentioned this pull request Dec 17, 2020

BUG? equality of structured arrays is by field names rather than position #18010

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: struct assignment "by field position", multi-field indices return views #6053

MAINT: struct assignment "by field position", multi-field indices return views #6053

ahaldane commented Jul 8, 2015

ahaldane commented Aug 7, 2015

homu commented Sep 28, 2015

charris commented Oct 18, 2015

ahaldane commented Apr 20, 2018

ihormelnyk commented Apr 20, 2018 •

edited

Loading

ahaldane commented Apr 20, 2018

ihormelnyk commented Jul 6, 2018

ahaldane commented Jul 7, 2018 •

edited

Loading

ihormelnyk commented Jul 7, 2018

ahaldane commented Jul 7, 2018

eric-wieser commented Jul 8, 2018

ahaldane commented Jul 8, 2018

ahaldane commented Jul 8, 2018

eric-wieser commented Jul 8, 2018

eric-wieser commented Jul 9, 2018 •

edited

Loading

ihormelnyk commented Jul 9, 2018

mattip commented Jul 9, 2018

MAINT: struct assignment "by field position", multi-field indices return views #6053

MAINT: struct assignment "by field position", multi-field indices return views #6053

Conversation

ahaldane commented Jul 8, 2015

Problems in structure assignment

Proposed Solution

Multi-Field Assignment

Other changes

ahaldane commented Aug 7, 2015

homu commented Sep 28, 2015

charris commented Oct 18, 2015

ahaldane commented Apr 20, 2018

ihormelnyk commented Apr 20, 2018 • edited Loading

ahaldane commented Apr 20, 2018

ihormelnyk commented Jul 6, 2018

ahaldane commented Jul 7, 2018 • edited Loading

ihormelnyk commented Jul 7, 2018

ahaldane commented Jul 7, 2018

eric-wieser commented Jul 8, 2018

ahaldane commented Jul 8, 2018

ahaldane commented Jul 8, 2018

eric-wieser commented Jul 8, 2018

eric-wieser commented Jul 9, 2018 • edited Loading

ihormelnyk commented Jul 9, 2018

mattip commented Jul 9, 2018

ihormelnyk commented Apr 20, 2018 •

edited

Loading

ahaldane commented Jul 7, 2018 •

edited

Loading

eric-wieser commented Jul 9, 2018 •

edited

Loading