-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
ENH: add back the multifield copy->view change #12447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
To help users update their code to account for these changes, a number of | ||
functions have been added to the ``numpy.lib.recfunctions`` module which | ||
safely allow such operations. For instance, the above view can be achieved | ||
with ``structured_to_unstructured(arr[['f1', 'f3']], dtype='float64')``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be neat if it did produce a view, but right now this produces a copy not a view, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular example produces a copy, because even before this PR a copy was unavoidable.
However, it was actually your suggestion to rework the code so structured_to_unstructured
returns a view in many cases, for instance structured_to_unstructured(arr[['f1', 'f2', 'f3']], dtype='float64')
actually makes no copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence should be reworked to avoid the word view though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for instance structured_to_unstructured(arr[['f1', 'f2', 'f3']], dtype='float64') actually makes no copies.
Is that a statement about what actually happens in master, or just what I was suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was proposing something stronger, where the following would also produce a view:
structured_to_unstructured(arr[['f1', 'f3']], dtype='float64')
structured_to_unstructured(arr[['f3', 'f1']], dtype='float64')
structured_to_unstructured(arr[['f3', 'f2', 'f1']], dtype='float64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, let's say we're dealing with the array arr = np.ones(3, dtype='f8,f8,f8')
. Then arr[['f0', 'f2']]
returns a 24-byte structured array where each element is organized as FxF
where F
is 8 bytes of float memory, and x is 8 bytes of padding. The 3-element array as a whole is FxFFxFFxF
(72 bytes).
Is it possible to view this as an unstructured float64 array with the right stride? I suppose this particular 3-field array might be viewed with shape (3, 2) and stride (24, 16). np.ndarray((3,2), strides=(24, 16), dtype='f8', buffer=arr)
works.
But it seems it would be quite an involved computation to determine more generally whether appropriate strides exist, given arbitrary field offsets. Fo three fields it is always possible, but with 4 fields it is not, eg np.ones(3, 'f8,f8,f8,f8')[['f0', 'f1, 'f3']]
cannot be viewed as an unstructured array.
My feeling is it's too difficult to do the calculation generally, it needs some complex gcd computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fo three fields it is always possible
Not if they're not contiguous. You're right that it's only possible in some special cases - but reshape
can only avoid copies in some special cases, and we decided that was worth doing - so it might also be worthwhile here
But it seems it would be quite an involved computation to determine more generally whether appropriate strides exist, given arbitrary field offsets
The computation is straightforward - pseudo-code:
if len(offsets) == 0:
stride = 0
offset = 0
elif len(offsets) == 1:
stride = 0
offset = offsets[0]
else:
offset = offsets[0]
stride = offsets[1] - offsets[0]
for i, o in enumerate(offsets[2:],2):
if o != offset + stride * i:
raise NotViewable
return offset, stride
Which should give enough information to be passed to as_strided
somehow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should work, indeed. I would have to mostly rewrite structured_to_unstructured
to account for it, and it seems like not enough of a good enough cost/benefit ratio for the effort, to me. I'd rather leave that to a separate PR if it's desired.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as the contract of structured_to_unstructured
allows us to make that change in a later release, I'm fine with not doing it now
ecd6c48
to
79e0e6a
Compare
79e0e6a
to
845def0
Compare
I'd like us to consider this one for 1.16.0, possibly, so I added the tag. |
That seems fair, given that the warning messages say it would change in 1.16 |
Ping. What is the current status of this? |
It is ready-to-go by my reading. There is just a social question about whether we should actually make this change, since we did get some pushback on the mailing list. But in my opinion, we should go ahead with it, since we have now mitigated the back-compat break, and the change has been "promised" since numpy 1.7 and the current Warnings say it will happen in 1.16. We already implemented this change in 1.14.0 (reverted in 1.14.1), and almost all the bug reports we got last time should not re-occur now. |
Just a 👍 to merging this. |
Thanks Allan. |
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
Numpy attempted to release a feature in 1.14.0, numpy/numpy#6053 which was then reverted in 1.14.1 with numpy/numpy#10411 And then was finally released in 1.16 by numpy/numpy#12447 This also makes a minor enhancement to determines which numpy function should be used at import time instead of run time. The function name was also then updated from _make_sliced_dtype_np_ge_14 to _make_sliced_dtype_np_ge_16 as thats the primary version the feature was released in
Fixes #10409 Closes #11530
This is the (hopefully) last of a long-running series of PRs attepting to make multifield-indexes return views instead of copies, which has been planned since about Numpy 1.6 or 1.7. I have some writeup of the motivation here: https://gist.github.com/ahaldane/6cd44886efb449f9c8d5ea012747323b#multifield-views and in the linked mailing list discussions there.
This is an "un-revert" of #10411, which was a revert of part of #6053. The reason we can do the unreversion now is that: 1. We have fixed #8100, which was the main source of bugs that caused us to revert this change originally, and 2. We have merged #11526 which provides helper functions for cases which this change will break.
In the mailing list discussions we did get some pushback on this backcompat break. So please consider this PR an opportunity to debate whether we should make the change at all. Myself, I'd say we go ahead with it, since it seemed enough devs supported the change on the mailing list, because I think the new helper functions are better and safer than the old "trick" using views in any case, plus the other reasons in the introductory maling list post: https://mail.python.org/pipermail/numpy-discussion/2018-January/077624.html