-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
BUG: Revert multifield-indexing adds padding bytes for NumPy 1.15. #10411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a9dc2c5
to
64a8ab9
Compare
numpy/core/numeric.py
Outdated
packed : ndarray or dtype | ||
Copy of a with padding bytes removed | ||
""" | ||
if isinstance(a, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if not isinstance(a, dtype): return a.astype(pack_fields(a.dtype))
numpy/core/numeric.py
Outdated
@@ -623,6 +624,45 @@ def asfortranarray(a, dtype=None): | |||
""" | |||
return array(a, dtype, copy=False, order='F', ndmin=1) | |||
|
|||
def pack_fields(a): | |||
"""Return copy of a structured array with padding between fields removed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And fields reordered in memory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right I forgot about reordering. Will think about it tomorrow.
I'd rather not introduced any more global state into numpy, to avoid running into #9444 in more places. The legacy mode changes are alright, because nothing apart from doc tests should depend on them - but real code is likely to want to turn on this behavior. |
Here's a few different non-configurable ways we could go forward, which I would be OK with:
|
That actually doesn't seem too awful. It would be even better if we could preserve the view behavior under a different API, perhaps:
|
I want to collect some past discussions on this issue: We discussed multifield-views more extensively in #350 and briefly in #5994 and #3641. This change has been originally planned as far back as numpy 1.7. It was mentioned as a I also found these mailing list threads. ★ means more important.
I think if that we want to argue in favor of the change, there is clearly a long history of planning and warning about it. As a reminder of reported issues caused by the copy->view change, we have:
|
While there is a history of planning and warning, it seems that support for the non-contiguous dtypes it creates was a gap in our preparation - depending on how much more we want to squeeze into 1.14.1, it might be better to roll back views for now to give us time to think about the right way to fix this problem in 1.15. |
Yes, that is what I'm thinking too. The problems caused by the new behavior really boil down to two issues: #10409 for views, and #10387 for save/load. If we delay this change until later, I am pretty sure we can fix #10387 before then so that it will not be a problem. I'd guess #10387 is the one causing the most test failures, too. But I think #10409 will cause unavoidable downstream code-breakage if we return a padded view. The best we can do is introduce |
cdd5246
to
dd143dd
Compare
Updated. Now we keep the copy behavior for 1.14, to be changed for real in 1.15, hopefully. I added back all the deprecation warning tests. By the way, those people who ran across the view issue of #10409 have quite surely been getting big FutureWarnings from numpy. No excuses! In these latest changes I kept a simplified version of the Test failure is not real, it's due to #10425 |
561f242
to
46d2973
Compare
doc/release/1.14.1-notes.rst
Outdated
@@ -0,0 +1,22 @@ | |||
========================== | |||
NumPy 1.14.1 Release Notes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracted to #10431
2ee7650
to
2275ff8
Compare
2275ff8
to
81ba635
Compare
If there are no objections from devs at this point, later today or tomorrow I'll post to the list about this PR and how we are reverting and pushing the copy->view change to 1.15. I don't expect anyone will object, but maybe users who were affected have extra ideas we could implement. |
Disclaimer: I've followed along but didn't have enough time to read all the history in linked issues/threads. Opinions:
So if you still want to make the change in 1.15.0, a clear story about the pros/cons of just that change would be good to see. Also, was introducing |
Yes, we got rid of the global state (except... I kept it in hidden form for dev purposes), and are planning to revert. We can also move I'll incorporate your other points into the email this weekend. |
IIRC, I did do a preliminary run at replacing the deprecation with a view several releases ago, but ran into the same problem of filler bytes and a failing test, so never followed through. Another thing we should revisit is the diagonal view replacing a copy. |
930a747
to
7a41cf9
Compare
505c8d4
to
37f45ef
Compare
I changed the PR title, might want to check that. Does this also fix #10409? |
Couple of failing tests. |
37f45ef
to
18ca7f3
Compare
numpy/core/src/multiarray/convert.c
Outdated
flags = PyArray_FLAGS(self); | ||
dtype = PyArray_DESCR(self); | ||
|
||
if (type != NULL && !PyArray_EquivTypes(dtype, type) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of PyArrayEquivTypes
here, (the only real change), is so that code in records.py
like arr.view((np.record, arr.dtype))
doesn't complain.
That kind of view doesn't change the memory layout so is save to keep in 1.16.
8d5af5a
to
807ddda
Compare
@@ -1507,7 +1547,7 @@ _get_field_view(PyArrayObject *arr, PyObject *ind, PyArrayObject **view) | |||
} | |||
Py_DECREF(title); | |||
} | |||
// disallow duplicate field indices | |||
/* disallow duplicate field indices */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using C-style comments again? If we have a good reason, it seems our tests ought to be catching this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no test catches this. But it is in the style guide.
We did release 1.13 with these comments present and no-one complained, so maybe it is harmless. We fixed them in the 1.14 branch though.
Also, a grep shows that there is only one other place in numpy with a C++ comment, in common.c
, which was also added by me in 2016. If we want to remove all C++ comments, I could fix that one into this PR too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do it because there used to be compilers that didn't accept the C++ style comments. We can probably drop this when we go to c99, although mixed comment styles can be a bit annoying. We should probably test for that, -ansi
might detect that, although it could give us other errors. IIRC, it didn't work (many years ago).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, although given that no-one complained in 1.13 it seems no-one building numpy uses incompatible compilers any more.
In any case, I just made the change in common.c
in this PR, so now numpy is totally free of C++ comments.
numpy/core/tests/test_multiarray.py
Outdated
# assert_equal(b[['f1', 'f3']][0].tolist(), (2, (1,))) | ||
with suppress_warnings() as sup: | ||
sup.filter(FutureWarning, | ||
"Numpy has detected that you .*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"selecting multiple fields" is a better pattern to look for here (possible with leading/trailing .*
)
1b16f7f
to
ad1f995
Compare
@ahaldane Should this be applied before or after the branch? The latter will leave master as is. |
Let's do it before the branch. It is easy to remove the unneeded parts when the time comes, and it will make the future update/change more explicit in the commit history. |
ad1f995
to
e08eced
Compare
Thanks Allan. |
@@ -29,4 +29,18 @@ npy_cache_import(const char *module, const char *attr, PyObject **cache) | |||
} | |||
} | |||
|
|||
NPY_INLINE static PyObject * | |||
npy_import(const char *module, const char *attr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this function added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh... I thought it was used in _multifield_view_to_copy
, but I must have removed that. Yes, I think it was related to the config variabel that I removed.
I'll make a followup to remove this.
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
Fixes #10387
Fixes #10344
This is a potential fix to the multifield-indexing bugs in 1.14.0 caused by how easy it is now to get a structured array with padding bytes, which trigger other old bugs like #3176, #8100.
Changed Behavior
This PR reverts the change in 1.14.0 that a multifield-index returns a view instead of a copy, but we plan to make the change in 1.15. That is:
1.13 behavior:
1.14.0 behavior (pushed off to 1.15):
Summary of the problem
Why revert? The problem boils down to padding bytes. Compare the 1.13 to 1.14.0 output above, and note how there are "padding" or missing bytes at offset 4 in the structure in 1.14.0, and how the itemsize is different. The 1.13 behavior essentially returns a copy of the 1.14 output, but with the fields re-packed so that they are contiguous. The two ways users are affected were:
A) Code that attempts to "view" the result, as in:
In 1.13 this works fine, since the indexed array has an itemsize of 8, but in 1.14.0 this now fails since the itemsize of the view is still 12.
B) Code that attempts to use np.load/np.save to save these arrays, as in:
This fails because of another long-standing bug, that currently np.load cannot handle structured arrays with padding bytes, #8100. In 1.13 this bug was not triggered because there were no padding bytes, but in 1.14.0 they now appear often.
Proposed Way Forward:
Revert the copy-> view change, and leave it for 1.15.
That way, we have time to fix #8100 by then, solving problem B).
Problem A) will still cause broken code, but we hope to mitigate this by introducing a new method
pack_fields
in this PR for 1.14.1, which allows convertsion of the 1.15 output into the 1.13 output:See comments below for more discussion. Also see the "documentation" updates at the end of the diff for a description of the proposed behavior, under "Accessing Multiple Fields".
(Old outdated summary, for reference)
This PR implements a couple different possible behaviors and makes them user-configurable; we can decide in discussion which one we want, which should be default, or which to remove. You control the behavior by setting the variable
np.multifield_index_setting
to one of'view'
,'copy'
or'copywarn'
(default).For
'view'
, the behavior is unchanged from 1.14.0, meaning that multi-field indexing returns a view which will often contain padding bytes.For
'copy'
, the behavior reverts back to 1.13 behavior, in which a packed-copy of the array fields is returned. This should solve most of people's bugs.For
'copywarn'
, do the same as'copy'
but warn anytime multifield indexing is used. (default)Is this kind of configuration option a good idea, or will it lead to a dependency mess? (I got the idea from the legacy mode for printing, which also allows the old and new behavior but defaults to the new one, unlike this PR currently).
In this PR I also introduce a new user-visible function
np.pack_fields
, which removes padding bytes from structured ndarrays and dtypes. Its existence should help any users transition to the new behavior: Doingnp.pack_fields(a[['a', 'c']])
in the new behavior ('view'
setting) returns what you get in the old ('copy'
) setting.