BUG: Revert multifield-indexing adds padding bytes for NumPy 1.15. #10411

ahaldane · 2018-01-16T04:14:51Z

This is a potential fix to the multifield-indexing bugs in 1.14.0 caused by how easy it is now to get a structured array with padding bytes, which trigger other old bugs like #3176, #8100.

Changed Behavior

This PR reverts the change in 1.14.0 that a multifield-index returns a view instead of a copy, but we plan to make the change in 1.15. That is:

>>> a = np.zeros(3, dtype=[('a', 'i4'), ('b', 'i4'), ('c', 'f4')])

1.13 behavior:

>>> a[['a', 'c']] 
array([(0, 0.), (0, 0.), (0, 0.)],
      dtype=[('a', '<i4'), ('c', '<f4')]

1.14.0 behavior (pushed off to 1.15):

>>> a[['a', 'c']]   
array([(0, 0.), (0, 0.), (0, 0.)],
      dtype={'names':['a','c'], 'formats':['<i4','<f4'], 'offsets':[0,8], 'itemsize':12})

Summary of the problem

Why revert? The problem boils down to padding bytes. Compare the 1.13 to 1.14.0 output above, and note how there are "padding" or missing bytes at offset 4 in the structure in 1.14.0, and how the itemsize is different. The 1.13 behavior essentially returns a copy of the 1.14 output, but with the fields re-packed so that they are contiguous. The two ways users are affected were:

A) Code that attempts to "view" the result, as in:

     >>> a[['a', 'c']].view('i8')
     array([0, 0, 0])

In 1.13 this works fine, since the indexed array has an itemsize of 8, but in 1.14.0 this now fails since the itemsize of the view is still 12.

B) Code that attempts to use np.load/np.save to save these arrays, as in:

     >>> np.save('file', a[['a', 'c']])
     >>> np.load('file')

This fails because of another long-standing bug, that currently np.load cannot handle structured arrays with padding bytes, #8100. In 1.13 this bug was not triggered because there were no padding bytes, but in 1.14.0 they now appear often.

Proposed Way Forward:

Revert the copy-> view change, and leave it for 1.15.

That way, we have time to fix #8100 by then, solving problem B).

Problem A) will still cause broken code, but we hope to mitigate this by introducing a new method pack_fields in this PR for 1.14.1, which allows convertsion of the 1.15 output into the 1.13 output:

>>> np.pack_fields(a[['a', 'c']])
array([(0, 0.), (0, 0.), (0, 0.)],
      dtype=[('a', '<i4'), ('c', '<f4')]

See comments below for more discussion. Also see the "documentation" updates at the end of the diff for a description of the proposed behavior, under "Accessing Multiple Fields".

(Old outdated summary, for reference)

This PR implements a couple different possible behaviors and makes them user-configurable; we can decide in discussion which one we want, which should be default, or which to remove. You control the behavior by setting the variable np.multifield_index_setting to one of 'view', 'copy' or 'copywarn' (default).

For 'view', the behavior is unchanged from 1.14.0, meaning that multi-field indexing returns a view which will often contain padding bytes.

For 'copy', the behavior reverts back to 1.13 behavior, in which a packed-copy of the array fields is returned. This should solve most of people's bugs.

For 'copywarn', do the same as 'copy' but warn anytime multifield indexing is used. (default)

Is this kind of configuration option a good idea, or will it lead to a dependency mess? (I got the idea from the legacy mode for printing, which also allows the old and new behavior but defaults to the new one, unlike this PR currently).

In this PR I also introduce a new user-visible function np.pack_fields, which removes padding bytes from structured ndarrays and dtypes. Its existence should help any users transition to the new behavior: Doing np.pack_fields(a[['a', 'c']]) in the new behavior ('view' setting) returns what you get in the old ('copy') setting.

eric-wieser · 2018-01-16T05:27:34Z

numpy/core/numeric.py

+    packed : ndarray or dtype
+       Copy of a with padding bytes removed
+    """
+    if isinstance(a, dtype):


How about if not isinstance(a, dtype): return a.astype(pack_fields(a.dtype))

eric-wieser · 2018-01-16T05:28:21Z

numpy/core/numeric.py

@@ -623,6 +624,45 @@ def asfortranarray(a, dtype=None):
    """
    return array(a, dtype, copy=False, order='F', ndmin=1)

+def pack_fields(a):
+    """Return copy of a structured array with padding between fields removed.


And fields reordered in memory

Oh right I forgot about reordering. Will think about it tomorrow.

eric-wieser · 2018-01-16T05:30:17Z

I'd rather not introduced any more global state into numpy, to avoid running into #9444 in more places. The legacy mode changes are alright, because nothing apart from doc tests should depend on them - but real code is likely to want to turn on this behavior.

ahaldane · 2018-01-16T05:41:23Z

Here's a few different non-configurable ways we could go forward, which I would be OK with:

Keep the new view behavior in 1.14, and tell people to use pack_fields if their code broke. (statsmodels people may be unhappy... but I ran their tests on their master, I see only 1 test failure before this PR).
Revert to copywarn behavior for 1.14 with a note to use pack_fields (it will be a noop for copywarn), and then switch to solution 1 in 1.15. Needs a way to turn off the warning.
Keep the copy behavior forever for back-compatibility. Honestly I'm not even that unhappy with that. I care more about all the other MAINT: struct assignment "by field position", multi-field indices return views #6053 changes which we are keeping. Also note that I allow multifield-assignment in this PR, which is different from 1.13, and that alone is a pretty good improvement (see the doc updates).

eric-wieser · 2018-01-16T06:31:48Z

Keep the copy behavior forever for back-compatibility. Honestly I'm not even that unhappy with that.

That actually doesn't seem too awful. It would be even better if we could preserve the view behavior under a different API, perhaps:

a[['field3', 'field1']]  # master creates a view, proposal copies
a.view(a.dtype[['field3', 'field1']])  # a view no matter what

ahaldane · 2018-01-16T17:16:24Z

I want to collect some past discussions on this issue:

We discussed multifield-views more extensively in #350 and briefly in #5994 and #3641. This change has been originally planned as far back as numpy 1.7. It was mentioned as a FutureWarning in the 1.8 release note.

I also found these mailing list threads. ★ means more important.

★ Selection of only a certain number of fields (Feb 05, 2009) Travis Oliphant introduces multifield indexing, and debates view vs copy. People, including Travis, prefer "view", but travis later argues there may be a technical challenge which forces us to "copy". (This is overcome today as of MAINT: struct assignment "by field position", multi-field indices return views #6053).
FYI: New "select-multiple-fields" behavior (Feb 11, 2009) Travis thinks is is not possible to make multifield indexing return a view. He notes that fancy indexing does not return a view either.
issue #1431 accessing multiple fields in a recarray (Jun 27, 2010) Somewhat unrelated, this is a very short discussion about field reordering in multifield indexing. MAINT: struct assignment "by field position", multi-field indices return views #6053 supersedes/fixes this.
view of a structured array? ( Jan 25, 2012) Question asking if it is possible to get a view from a multi-field index. Answer is no, workaround suggested.
★ Extracting sub-fields from an array as a view (PR 350) ( Jul 14, 2012) Travis proposes to turn the copy into a view. Presumably this discussion turned into Add transition code for returning view when selecting subset of fields #350.
About ready to start 1.8 release process. (Aug 12, 2013) Chuck asked whether we wanted to go ahead with multifield-views FutureWarning, but no-one answered. So we did.

I think if that we want to argue in favor of the change, there is clearly a long history of planning and warning about it.

As a reminder of reported issues caused by the copy->view change, we have:

Re: NumPy 1.14.0 release (Jan 13, 2018) where it initially seemed like the view->copy change had caused statsmodels test failures, but I think further investigation pointed to a different problem, Possible regression with numpy 1.14.0 and recfromcsv #10394.
BUG: structured array indexing dtype and dtype.descr changed in 1.14.0 #10387, the save/load bug, which can be fixed with BUG: np.save() and np.load() are not idempotent when align=True or fields are discontiguous #8100.
New in NumPy 1.14: failures in PyTables #10344, which can also hopefully be fixed by BUG: np.save() and np.load() are not idempotent when align=True or fields are discontiguous #8100

eric-wieser · 2018-01-17T17:16:55Z

While there is a history of planning and warning, it seems that support for the non-contiguous dtypes it creates was a gap in our preparation - depending on how much more we want to squeeze into 1.14.1, it might be better to roll back views for now to give us time to think about the right way to fix this problem in 1.15.

ahaldane · 2018-01-17T17:22:51Z

Yes, that is what I'm thinking too.

The problems caused by the new behavior really boil down to two issues: #10409 for views, and #10387 for save/load.

If we delay this change until later, I am pretty sure we can fix #10387 before then so that it will not be a problem. I'd guess #10387 is the one causing the most test failures, too.

But I think #10409 will cause unavoidable downstream code-breakage if we return a padded view. The best we can do is introduce pack_fields now, and perhaps add more warnings so people start using it. I suspect that this problem is the rarer one.

ahaldane · 2018-01-17T22:16:49Z

Updated. Now we keep the copy behavior for 1.14, to be changed for real in 1.15, hopefully.

I added back all the deprecation warning tests. By the way, those people who ran across the view issue of #10409 have quite surely been getting big FutureWarnings from numpy. No excuses!

In these latest changes I kept a simplified version of the np.multifield_index_view setting, but made it mostly "secret". See the new documentation section in this PR for details.

Test failure is not real, it's due to #10425

Taken from numpygh-10411

eric-wieser · 2018-01-18T09:51:48Z

doc/release/1.14.1-notes.rst

@@ -0,0 +1,22 @@
+==========================
+NumPy 1.14.1 Release Notes


Extracted to #10431

ahaldane · 2018-01-19T22:26:50Z

If there are no objections from devs at this point, later today or tomorrow I'll post to the list about this PR and how we are reverting and pushing the copy->view change to 1.15.

I don't expect anyone will object, but maybe users who were affected have extra ideas we could implement.

rgommers · 2018-01-19T22:54:32Z

Disclaimer: I've followed along but didn't have enough time to read all the history in linked issues/threads.

Opinions:

Agree with @eric-wieser that more global state is not desirable
The revert seems necessary
In general, trying to change copy behavior to view behavior is a bad idea, the gains are not worth the backwards compatibility breaks. (for new functionality, by all means go for views of course)
In this case, there's something fishy in addition to that - if a new function pack_fields is needed to guard against mishaps with the view behavior, that seems like a sign that keeping the copy behavior is the best option from an API perspective.

So if you still want to make the change in 1.15.0, a clear story about the pros/cons of just that change would be good to see.

Also, was introducing np.pack_fields proposed on the mailing list? It may be a useful utility, but doesn't seem to deserve being in the top-level namespace.

ahaldane · 2018-01-19T23:26:59Z

Yes, we got rid of the global state (except... I kept it in hidden form for dev purposes), and are planning to revert. We can also move np.pack_fields to the np.rec submodule which has similar methods.

I'll incorporate your other points into the email this weekend.

charris · 2018-01-20T01:08:06Z

IIRC, I did do a preliminary run at replacing the deprecation with a view several releases ago, but ran into the same problem of filler bytes and a failing test, so never followed through. Another thing we should revisit is the diagonal view replacing a copy.

ahaldane · 2018-06-11T02:23:10Z

OK, rebased and updated.

Just to avoid some confusion: This PR is the forward port of #10537 (or rather, #10537 is the backport of this PR).

So the plan is to merge this PR for 1.15, and then fix the save/view code for 1.16.

charris · 2018-06-11T03:02:28Z

I changed the PR title, might want to check that. Does this also fix #10409?

charris · 2018-06-11T03:41:02Z

Couple of failing tests.

ahaldane · 2018-06-11T04:29:17Z

numpy/core/src/multiarray/convert.c

+    flags = PyArray_FLAGS(self);
+    dtype = PyArray_DESCR(self);
+
+    if (type != NULL && !PyArray_EquivTypes(dtype, type) &&


The addition of PyArrayEquivTypes here, (the only real change), is so that code in records.py like arr.view((np.record, arr.dtype)) doesn't complain.

That kind of view doesn't change the memory layout so is save to keep in 1.16.

eric-wieser · 2018-06-11T08:38:41Z

numpy/core/src/multiarray/mapping.c

@@ -1507,7 +1547,7 @@ _get_field_view(PyArrayObject *arr, PyObject *ind, PyArrayObject **view)
                }
                Py_DECREF(title);
            }
-            // disallow duplicate field indices
+            /* disallow duplicate field indices */


Why are we using C-style comments again? If we have a good reason, it seems our tests ought to be catching this...

Yes, no test catches this. But it is in the style guide.

We did release 1.13 with these comments present and no-one complained, so maybe it is harmless. We fixed them in the 1.14 branch though.

Also, a grep shows that there is only one other place in numpy with a C++ comment, in common.c, which was also added by me in 2016. If we want to remove all C++ comments, I could fix that one into this PR too.

We do it because there used to be compilers that didn't accept the C++ style comments. We can probably drop this when we go to c99, although mixed comment styles can be a bit annoying. We should probably test for that, -ansi might detect that, although it could give us other errors. IIRC, it didn't work (many years ago).

Right, although given that no-one complained in 1.13 it seems no-one building numpy uses incompatible compilers any more.

In any case, I just made the change in common.c in this PR, so now numpy is totally free of C++ comments.

eric-wieser · 2018-06-11T08:39:47Z

numpy/core/tests/test_multiarray.py

+            # assert_equal(b[['f1', 'f3']][0].tolist(), (2, (1,)))
+            with suppress_warnings() as sup:
+                sup.filter(FutureWarning,
+                           "Numpy has detected that you .*")


"selecting multiple fields" is a better pattern to look for here (possible with leading/trailing .*)

charris · 2018-06-11T14:33:51Z

@ahaldane Should this be applied before or after the branch? The latter will leave master as is.

ahaldane · 2018-06-11T15:49:03Z

Let's do it before the branch. It is easy to remove the unneeded parts when the time comes, and it will make the future update/change more explicit in the commit history.

charris · 2018-06-11T16:37:34Z

Thanks Allan.

eric-wieser · 2018-06-12T05:57:59Z

numpy/core/src/private/npy_import.h

@@ -29,4 +29,18 @@ npy_cache_import(const char *module, const char *attr, PyObject **cache)
    }
 }

+NPY_INLINE static PyObject *
+npy_import(const char *module, const char *attr)


Why was this function added?

Oh... I thought it was used in _multifield_view_to_copy, but I must have removed that. Yes, I think it was related to the config variabel that I removed.

I'll make a followup to remove this.

Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in

ahaldane force-pushed the revert_multifield_view branch from a9dc2c5 to 64a8ab9 Compare January 16, 2018 05:07

eric-wieser reviewed Jan 16, 2018

View reviewed changes

rgommers added component: numpy._core 03 - Maintenance labels Jan 16, 2018

ahaldane force-pushed the revert_multifield_view branch 2 times, most recently from cdd5246 to dd143dd Compare January 17, 2018 19:55

ahaldane force-pushed the revert_multifield_view branch 3 times, most recently from 561f242 to 46d2973 Compare January 18, 2018 02:07

ahaldane added this to the 1.14.1 release milestone Jan 18, 2018

eric-wieser added a commit to eric-wieser/numpy that referenced this pull request Jan 18, 2018

REL: Add 1.14.1 release notes template

c1cf40d

Taken from numpygh-10411

eric-wieser mentioned this pull request Jan 18, 2018

REL: Add 1.14.1 release notes template #10431

Merged

eric-wieser reviewed Jan 18, 2018

View reviewed changes

doc/release/1.14.1-notes.rst Outdated

@@ -0,0 +1,22 @@

==========================

NumPy 1.14.1 Release Notes

Copy link

Member

eric-wieser Jan 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted to #10431

ahaldane force-pushed the revert_multifield_view branch 2 times, most recently from 2ee7650 to 2275ff8 Compare January 18, 2018 14:36

ahaldane mentioned this pull request Jan 18, 2018

DOC: Add 1.14.1 release notes template (forward port) #10433

Merged

ahaldane force-pushed the revert_multifield_view branch from 2275ff8 to 81ba635 Compare January 18, 2018 20:15

eric-wieser mentioned this pull request Jan 20, 2018

ENH: Allow dtype objects to be indexed with multiple fields at once #10417

Merged

ahaldane force-pushed the revert_multifield_view branch from 930a747 to 7a41cf9 Compare June 11, 2018 02:21

ahaldane force-pushed the revert_multifield_view branch 2 times, most recently from 505c8d4 to 37f45ef Compare June 11, 2018 02:25

charris changed the title ~~BUG: 1.14 multifield-indexing adds padding bytes~~ BUG: Revert multifield-indexing adds padding bytes for NumPy 1.15. Jun 11, 2018

ahaldane force-pushed the revert_multifield_view branch from 37f45ef to 18ca7f3 Compare June 11, 2018 04:16

ahaldane commented Jun 11, 2018

View reviewed changes

ahaldane force-pushed the revert_multifield_view branch 3 times, most recently from 8d5af5a to 807ddda Compare June 11, 2018 04:34

eric-wieser reviewed Jun 11, 2018

View reviewed changes

ahaldane force-pushed the revert_multifield_view branch 2 times, most recently from 1b16f7f to ad1f995 Compare June 11, 2018 13:43

MAINT: push back multifield copy->view changes to 1.16

e08eced

ahaldane force-pushed the revert_multifield_view branch from ad1f995 to e08eced Compare June 11, 2018 15:51

charris merged commit 1b92080 into numpy:master Jun 11, 2018

eric-wieser reviewed Jun 12, 2018

View reviewed changes

ahaldane mentioned this pull request Jun 12, 2018

MAINT: remove unused "npy_import" #11314

Merged

ahaldane mentioned this pull request Jul 7, 2018

.view no longer works on field selections of structured arrays #10409

Closed

ahaldane mentioned this pull request Aug 17, 2018

BUG: np.save() and np.load() are not idempotent when align=True or fields are discontiguous #8100

Closed

ahaldane mentioned this pull request Nov 24, 2018

ENH: add back the multifield copy->view change #12447

Merged

		@@ -0,0 +1,22 @@
		==========================
		NumPy 1.14.1 Release Notes

Uh oh!

BUG: Revert multifield-indexing adds padding bytes for NumPy 1.15. #10411

BUG: Revert multifield-indexing adds padding bytes for NumPy 1.15. #10411

Uh oh!

Conversation

ahaldane commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changed Behavior

Summary of the problem

Proposed Way Forward:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eric-wieser commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eric-wieser commented Jan 17, 2018

Uh oh!

ahaldane commented Jan 17, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahaldane commented Jan 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahaldane commented Jan 19, 2018

Uh oh!

rgommers commented Jan 19, 2018

Uh oh!

ahaldane commented Jan 19, 2018

Uh oh!

charris commented Jan 20, 2018

Uh oh!

ahaldane commented Jun 11, 2018

Uh oh!

charris commented Jun 11, 2018

Uh oh!

charris commented Jun 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

charris commented Jun 11, 2018

Uh oh!

ahaldane commented Jun 11, 2018

Uh oh!

charris commented Jun 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahaldane commented Jan 16, 2018 •

edited

Loading

eric-wieser commented Jan 16, 2018 •

edited

Loading

ahaldane commented Jan 16, 2018 •

edited

Loading

eric-wieser commented Jan 16, 2018 •

edited

Loading

ahaldane commented Jan 16, 2018 •

edited

Loading

ahaldane commented Jan 17, 2018 •

edited

Loading